Go to the first, previous, next, last section, table of contents.


Formulating a Query

As mentioned in section The WAIS Protocol, freeWAIS-sf uses the original WAIS Protocol. This protocol was designed just for transporting a free text query which means a list of search term separated by spaces. We decided deliberately to use this old protocol so that all clients out there could use our new features.

So we had to encode the new and richer query semantics in the query string. This means that the query now has to obey a certain syntax (see section Query Syntax) and consequently a user might get a syntax error when submitting a query. So our goal was to make the syntax as easy as possible and especially leave simple free text queries valid.

In the query the categories to be searched should be selectable for each term. To leave the original queries valid (and to support casual users) we provided a default category, which is used if no category is specified in the query. Now we give an outline of the query language:

The atomic search expressions of the language are terms, term wild-cards and phrases (e.g. `information', `inform*', `"information retrieval"').

Stemming is handled transparently for the client. Terms searched in a stemmed category are searched using their word stem automatically. For a wildcard (only tail truncation is implemented), all matching words from the dictionary are used as search terms. Phrase search looks up the words in the string. At least one of them must be an index term. Then the server scans the documents containing this word for string matches for the complete phrase. This means that string search can only work if the server has access to the documents. For type `URL' this is not the case.

Additionally the prefix operators `soundex' and `phonix' are allowed for converting the following query term into its Soundex/Phonix code. This is for example very useful when searching in phonebooks if the exact spelling of a name is not known. Arbitrary Boolean combination of these atomic expressions with the binary operators `and', `or' and `not' (`not' means `and not' in Boolean logic and is therefore a binary operation, too. See the examples below.) are allowed. Parentheses can be used for grouping. For compatibility with the original syntax, `or' may be omitted.

For each expression, a semantic category (field) can be defined using the `category pred' operator, where pred is `=' for text categories and one of `=', `<', `>' for numeric categories (`==' is also valid for backward compatibility to version 1.1).

Here are some examples:

`information retrieval'
free text query
`information or retrieval'
same as above
`ti=information retrieval'
`information' must be in the title
`ti=(information retrieval)'
one of them in title
`ti=(information or retrieval)'
same as above
`ti=(information and retrieval)'
both of them in title
`ti=(information not retrieval)'
`information' in title and `retrieval' not in title
numerically equal
numerically less
numerically greater
Date search. Format is yyyymmdd
`au=(soundex salatan)'
soundex search, matches eg. `Salton'
`ti=("information retrieval")'
phrase search
`ti=(information system*)'
wild-card search
`nuclear w/10 waste'
We have added a proximity search feature to freeWAIS-sf. With this feature, you could search for `nuclear w/10 waste' (like the Lexis/Nexis search syntax) to find all stories that have `nuclear' and `waste' within 10 words of each other.
`nuclear pre/10 waste'
Proximity Searches If the order of the two words is important, then `nuclear pre/10 waste' will find all stories that have `nuclear' up to 10 words before `waste'. Proximity also works within fields; for instance, `byline=(dan w/2 woods)' will find every story with a byline that has `dan' within 2 words of `woods'. Note that you must use parentheses around the words you want to look for in the field.
`atleast 20 clinton'
At Least Searches `atleast 20 clinton' finds every story that has at least 20 occurrences of `clinton'. `atleast' has to be all lower-case, and there cannot be any spaces between `at', `least', and the number -- `at least 20 clinton' will not work!

Query Weighting

Let's for now disregard the boolean operators and assume that a query is simply a list of terms. Query weighting is done using the Vector Space model. Each term in the query is associated with a query term weight. Currently this weight is constantly 1. On the other side, the terms in each document get a document term weight. This weight is the product of a document specific weight and the inverse document frequency. The latter is defined as `idf = log(N/n)' where N is the number of documents in the database and n the number of documents the term occurs in.

The other part of the document weight is computed as follows: Let tf be the number of occurances of the term in document and maxtf the maximum frequency of any term in the document. A preliminary weight is computed according to `w = (0.5 * tf)/(1 + maxtf)'. Then these weights are normalize by dividing them by the sum of the squares of all preliminary weights for terms in this document. So the document specific weights make up a vector of length 1. The final document term weight is yielded by multiplying this weight to the idf.

For simple queries (no booleans) the weight of a document is computed by multiplying the query term weight to the query term weight for each term in the query and summing up the results. This is often referred to as the vector product (hence the name of the model) or scalar product.

Now let's get to the booleans. The `or' operator is just dropped. So `information or retrieval' yields exactly the same weight than `information retrieval'. To interpret the vector product in another way, you can say the `or' operator just sums up the weights of its arguments. For the `and' operator the weights of both arguments is computed and the final weight is just the minimum of these weights. Similar the binary `not' operator returns the minimum of the weight of the left argument and 1 - the weight of the right argument.


Numeric searches are currently handled the same way. This is clearly a bug. Why should a document from 1988 yield a higher weight with respect to a query `<=1990' than one from 1987 just because there are more documents from 1987 in the database than from 1988?

I have no idea how strings searches are weighted.

Partial searches hopefully are weighted as if all matching terms where given.

The proximity operators might work like the booleans?

Go to the first, previous, next, last section, table of contents.