Go to the first, previous, next, last section, table of contents.

Building Databases

If you successfuly compiled and installed the system, you already have build your own database. This database will serve as an example throughout this section.

Let's first define some terms.

Document
The entities you want to search for.
Source
The files containing the documents you want to make searchable.
Index
A set of files generated by waisindex which are used by the server to perform searches. See section The Index Files.
Database
The combination of source and index.
Format
The layout of the documents including the separation of documents in the files.
Field
The Format description might associate regions in the documents with certain concept types. The sender of an email might be associated to the concept `author'. Searches might be constrained to such an concept type or field for short.
Type
A type is attached to each document in the database. The possible types roughly correspond to mime content types. Possible types are `TEXT', `HTML', `URL', ....

In general you have two options when indexing a new database: You can use a build-in format or use the `-t fields' option. The first alternative is not covered by this manual since the latter is more general and you can emulate the build-in formats with the field indexing.

waisindex Options

The following is mostly drawn from the waisindex manual.

-d index-filename
This is the base filename for the index files. Therefore if `/usr/local/foo' is specified, then the index files will be called `/usr/local/foo.dct' etc. The index should be stored on the local file system of the machine running waisindex. It works over NFS, but it is much slower.
-a
Append this index to an existing one. Useful for incremental additions or updates. This will only add onto an index, so that if a file has changed, it will get reindexed, but the old entries will not be purged. Therefore, to save space, it is a good idea to reindex the whole set of files periodically.
-r
Recursively index subdirectories.
-mem
How much main memory to use during indexing. This variable will have a large effect on how fast indexing is done.
-export
This causes the resulting source description file to include the host-name and tcp-port for use by the clients. Otherwise the file contains no connection information, and is expected to be used only for local searches.
-nocat
Inhibits the creation of a catalog. This is useful for databases with a large number of documents, as the catalog contains 3 lines per document.
-T type
Sets the type of the document to type.
-t format
This is the format of files that are handled by waisindex. To find out the list of currently known types, execute the waisindex command with no arguments and it will list them. Use `-t fields' to use the field indexing. The file `database.fmt' must contain the format description.
-stop
This option refers to the list of otherwise valid words which will be excluded from indexing and therefor searching. If `-stop' is given the build-in stop list is not used. In any case the file `database.stop' might contain stopwords separated by newlines. If fact waisindex might add some words, if they occur to often.
filename ...
These are the files that will be indexed according to the arguments above. To insure the files are registered in the filename list correctly, it is advised that these are full paths (beginning with a `/'). If the database is used from a machine other than the machine on which the index is created, this should be a machineindependent path.
-stdin
Read the names of the files to index from standard input instead from the argument line.
-line
Makes the proximity operators use the lines aus units instead of words.

Bulding a Format Description

The trickiest part when building a new database is to write a format description -- the horrible `database.fmt' files.

An important thing to know when writing format descriptions is that waisindex parses sources files line by line. All of the nice regular expressions have to match a single line. So including multiple linefeeds in an expression will cause the matcher to fail.

Regular Expressions

All regular expressions in the format description must be included between two `/'. Therefore a `/' in the regular expression must be escaped with a `\'. To match `TCP/IP' write `/TCP\/IP/'. This in consequence means, that a `\' must be escaped too ;-). Other escape sequences interpreted by the parser are:

\n
A newline character. See section Bulding a Format Description.
\r
The carriage return character.
\f
The formfeed character
\x
The characer C-x (control x) where x ranges from `A' to `Z'.

After escaping is resolved, the string is passed to a standard regular expression package (or your systems one see section Regular Expressions).

Operator
Meaning
x
the character `x'
"x"
an `x', even if x is an operator
\x
an `x', even if x is an operator. Remember that the parser eats one `\'. So use `/\\*/' to match a `*' in the text.
[xy]
the character `x' or `y'
[a-c]
the characters `a', `b' or `c'
[^x]
any character but `x'
.
any character but newline
^x
an `x' at the beginning of a line
x$
an `x' at the end of a line
x?
an optional `x'
x*
0,1,2, ... instances of `x'
x+
1,2,3, ... instances of `x'
x|y
an `x' or a `y'
(x)
an `x'
x{m,n}
m through n occurrences of `x'

Structure of the Format Descriptions

A format description consist of three parts. The first part -- usually just one line -- describes how the files should be split into documents. The second part, called the Layout Section defines how a headline for each document should be computed.

The last and usually largest part defines which fields should be generated and how the indexing should be done.

Separation of Documents

A format description begins with a `record-sep:' directive.

record-sep: regexp

Here regexp might be `/\f/' to match a line containing a formfeed character. Note that the line matching the `record-sep:' regular expression is never indexed!

Layout Section

The layout section must be embraced by `layout:' and `end:'. It can contain multiple `headline:' and one `date:' directive. Each `headline:' keyword defines a region in the document which should be copied to a section of the headline. The region, defined by the regular expressions which start and end it can occur any number of times in the document. The order of the directives defines the order of the sections in the headline.

headline: start end width [skip]

The above line advises the indexer to copy the text between the matches for start and end, optionally skipping the text matched by skip to the next width characters of the headline.

The `date:' directives defines how a date to be associated with the document can be extraced. Note that this date is not usable for searches. It ist part of the headline and displayed to the user by some clients.

date: start scanfarg d-m-y d-m-y d-m-y skip

The start and skip work like above. The scanfarg parameter is passed to scanf to read in year, month, and day. The order of the arguments is indicated by the three d-m-y where each might be `year', `month', and `day'. The `month' might be followed by `string' to indicate that the month will be given as a standard three-letter abreviation.

Here is complete example:

layout:
headline: /^PY: / /^[A-Z][A-Z]:/ 5 /^PY: */
headline: /^AU: / /^[A-Z][A-Z]:/ 21 /^AU: */
headline: /^TI: / /^[A-Z][A-Z]:/ 41 /^TI: */
date:     /^ED: / /%d-%3s-%d/ day month string year /^ED: [^ ]/
end:

Definition of Field and Index Types

The field specification part of the format description is made up of multiple `region:' ... `end:' groups. Each is mapping a region of text to a set of query categories aka fields.

region: start [skip]
   fieldlist options indexspecs
end: end

The start, skip, and end expressions define the region of the document for which the index specification applies.

fieldlist is the list of fields each optionaly followed by a description string enlosed in double quotes (`"'). The description is entered in the database description (see section The Database Description).

Options include the directives `numeric: skip width', `date: d-m-y d-m-y d-m-y', and `stemming'. The first option advises the indexer to allow numeric values only and makes sure that numeric atomic search expressions will work with the categories. The second advices the indexer to convert the date in the region to the `yymmdd' format prior to index it numerically. The last option tells the indexer to stemm the words in the region before entering them in the index. When searching, the server also stemms search terms prior to the index lookup.

indexspecs is a list of index types along with a keyword indicating if the region should be mapped to the designated categories (`LOCAL'), the default category (`GLOBAL') or to both (`BOTH'). The default category is used when none is specified in the query. See section Query Syntax.

Index types currently supported are `TEXT', `SOUNDEX' and `PHONIX'.

Consider the following example:

region: /^AU: /
        au "author names" SOUNDEX LOCAL TEXT BOTH
end: /^[A-Z][A-Z]:/

To the indexer this means:

For all words starting with `AU: ' at the beginning of a line up to a line which starts with two capital letters followed by a colon and a blank, put the word in the default and the `au' category and its soundex code only in the `au' category.

Thus an author name can be found in the created database in the default category or the `au' category if the exact spelling is known. If the name is misspelled, it might be found using the query `au=(soundex misspelled-name)'. See section Sample Format Description, section Query Syntax.

The Index Files

waisindex generates a number of index files. If called with `waisindex -d test -t fields ...' it uses:

`test.fmt'
The format definition. See section Bulding a Format Description. This file is not necessary if the option `-t fields' is not used.
`test.fde'
The optional format description. Plain text, which is added to the database description.
`test.syn'
The optional synonym file contains multiple lines with synonym terms separated by spaces.
wais freewais
programming hacking
`test.stop'
The optional stopword file contains words which should be ignored when indexing, each in one line.

The following files are generated or modified by waisindex:

`test.src'
The database description. See section The Database Description.
`test.fn'
The filename list. One entry for each file in the database. See section The Filename List.
`test.hl'
The headline list. One entry for each document in the database. See section The Headline List.
`test.doc'
Document table. One entry for each document in the database. Contains pointers to the filename list and the headline list. See section The Document Table.
`test.cat'
The catalog file. One entry for each document in the database. A human readable combination of document table with headline list and filename list.This file may be very space consuming. You can avoid generating this file if you use waisindex with the `-nocat' option. See section waisindex Options. See section The Catalog Files.
`test.dct'
The global dictionary. One entry for each term in the default field. See section The Dictionary Files
`test.inv'
The inverted file for the default field. For each term in the database, there is a list of postings giving the documents and positions in the documents where the term occurs. See section The Inverted Files.
`test.stop'
waisindex might add words to the stopword file, if they occur to often and would break the index.

For each field in the format description a dictionary and a inverted file is generated. This applies only to waisindex run with the `-t fields' option.

`test_field_name.dct'
`test_field_name.inv'

The Database Description

The database description (`database.src') serves for two purposes:

This is how a database descriptions looks like (when called without the `-export' option):

(:source 
   :version  3 
   :database-name "/usr/local/ls6/src+data/src/freeWAIS-sf-2.0/FIELD-EXAMPLE/test"
   :cost 0.00 
   :cost-unit :free 
   :maintainer "pfeifer@buster"
   :keyword-list (
                  fuhr
                  pfeifer
                  )
   :fields (
                  (:field
                   :name "ck"
                   :description "Citation Code")
                  (:field
                   :name "py"
                   :description "Publication Year (numeric)")
                  (:field
                   :name "au"
                   :description "Author")
                  (:field
                   :name "ti"
                   :description "Title")
                  (:field
                   :name "jt"
                   :description "Journal Title")
                  (:field
                   :name "ed"
                   :description "Date of insertion (date)")
           )
   :description "Server created with freeWAIS-sf 2.0 PL 30 on Oct  6 18:09:24 1995 by pfeifer@buster
The files of type fields used in the index were:
   /usr/local/ls6/src+data/src/freeWAIS-sf-2.0/FIELD-EXAMPLE/TEST
Here goes the contents of the format description file (database.fde)
"
)

With the `-export' option the first part changes to:

(:source 
   :version  3 
   :ip-address "129.217.20.190"
   :ip-name "buster"
   :tcp-port 210
   :database-name "test"
...

You can modify the database description, especially if you want to enter it in a directory of servers. Notice that waisindex might overwrite the description when re-indexing or adding to the database when the list of significant terms (keywords) changes. Therefore, use the format description file (database.fde) to add your comments on the database.

The Filename List

The filename list (`database.fn') contains for each source file of the database its filename, modification time, and the type of the documents.

The files starts with four zero bytes. Then for each file follows it's name as zero terminated string, its modification time a four byte integer and the display type as zero terminated string.

Byte
Contents
0-3
0x00000000
4-52
`/usr/local/ls6/src/freeWAIS-sf/FIELD-EXAMPLE/TEST'
53
0x00
54-57
0x2EEC461D (-> 13:46 Dec 12 1994 )
58-61
`TEXT'
62
0x00

The Headline List

The headline list contains the headline; a null terminated string for each document of the database . The first 4 for bytes are zero.

Byte
Contents
0-3
0x00000000
4-73
` 1990 Ait-Kaci, Hassan; Na Implementing a Knowledge-Based Library I'
74
0x00
74-144
` 1990 Shepard, Michael A.; Transient Hypergraphs for Citation Netwo'
145
0x00
...

The length of the headlines is variable with an upper limit given by the macro MAX_HEADLINE_LEN in `config.h'.

`config.h': int MAX_HEADLINE_LEN

Maximum length of a headline. It may cause difficulties when increasing the value above the default value of 300. Some of the problems are fixed; maybe not all ;-)

The Document Table

The document table contains one entry for each document in the database. Each entry contains a pointer to the filename list and the headline list. Also start and end of the document in the file is given. The other entries are the number of terms and lines and the date extracted from the document (see section Layout Section).

Byte
Contents
0-1
0x0000
2-4
Three byte offset in the filename list.
5-7
Three byte offset in the headline list. This entry may be four bytes if you selected the HUGE_HEADLINES option during configuration. See section Huge Headlines.
8-11
The first character of the document in the source file (four bytes).
12-15
The last character of the document in the source file (four bytes).
16-19
Number of term in the document (four bytes).
20-22
Number of lines in the document (three bytes).
23-26
Date extracted from the document (yymmdd: a long int year*10000+month*100+day).
27-29
Next document: Three byte offset in the filename list.
...

The first 25 bytes entry might be zero filled as well. Therefor the entry for document n starts at n*25+2.

The Catalog Files

Unless prohibited by the `-nocat' option (see section waisindex Options), waisindex generates a human readable version of the document table (see section The Document Table) with the filenames and headline included from the filename list (see section The Filename List) and the headline list (see section The Headline List).

Catalog for database: /usr/local/ls6/src/freeWAIS-sf/FIELD-EXAMPLE/test
Date: Jun  4 13:32:46 1995
276 total documents

Document # 1
Headline:  1990   Ait-Kaci, Hassan; Na  Implementing a Knowledge-Based Library I
DocID: 0 244 /usr/local/ls6/src/freeWAIS-sf/FIELD-EXAMPLE/TEST

Document # 2
Headline:  1990   Shepard, Michael A.;  Transient Hypergraphs for Citation Netwo
DocID: 244 437 /usr/local/ls6/src/freeWAIS-sf/FIELD-EXAMPLE/TEST

...

The Dictionary Files

The dictionary file (`database.dct', `database_fieldname.dct' contains entries for all valid terms in the database. An entry contains:

Bytes
Contents
21
The term as a zero terminated string. Therefor the length of the terms is 20 chars maximum.
4
Pointer in the inverted file. See section The Inverted Files
4
Should be the number of documents the term occurs in. Is the total number of occurances instead.

The entries are sorted by the terms and grouped in blocks of 1000 entries each. This main file is preceeded by a directory containing the first entries of the main blocks. The bytes 21-24 of this directory entries contain the pointer to the main blocks, the last four bytes are zero. Entries in the index block are sorted too so that binary search on the index can work. The first two bytes of the file give the number of main blocks used. E.g., if we have two blocks the layout is as follows:

Bytes
Contents
0-1
0x0000
2-3
2
4-32
Directory entry for the first block (21 Bytes: Term, 4 Bytes pointer, 4 Bytes zero)
33-61
Directory entry for the second block
62-29061
First block
29062-58062
Second block

The Inverted Files

The inverted files are the heart of the index. They contain the references to the documents they occur in for each term. The first 4 bytes contain the number of all terms in the file. The rest of the file is made up of term entries. Note that the following is mainly debugger knowledge; Tungs documentation is not very good here. But a Perl script using this definition was able to parse my inverted files ;-).

Header1
Bytes
Contents
1
`{', the dictionary flag.
6
0x000000000000. Probably used during building?
2
Length of header 1
4
Total number of occurances of the term. Multiple occurances in documents count!
variable
The term terminated by a newline. Variatio delectat.
Header 2
This is where the entries in the dictionary point to. See section The Dictionary Files
Bytes
Contents
1
`E' the full flag, indicating a vaid entry.
4
Number of postings following this header.
4
Total size of the postings following this header.
Postings
A list of postings, each referring to one document.
Bytes
Contents
4
The document id. This is the entry number in the document table. See section The Document Table.
4
The total size of the character position list including the first one(!). If you use the proximity operators (see section Proximity vs. String Search) the word position is stored instead. This also applies to the rest of the posting list.
4
A single precision float which gives the weight of the term with respect to the document. See section Query Weighting.
3
The position of the first occurance of the term in the document.
variable
If there is more than one occurance of the term in the document it will be followed by a list of compressed integers. From each byte bits 6 to 0 are used. If 7 bit are not enough to store the number, bit 7 of the higher order bytes is set. So 0 to 127 is encoded as usual and 128 is encoded as 0x8100 129 as 0x8101 an so on.


Go to the first, previous, next, last section, table of contents.