If you successfuly compiled and installed the system, you already have build your own database. This database will serve as an example throughout this section.
Let's first define some terms.
waisindexwhich are used by the server to perform searches. See section The Index Files.
In general you have two options when indexing a new database: You can use a build-in format or use the `-t fields' option. The first alternative is not covered by this manual since the latter is more general and you can emulate the build-in formats with the field indexing.
The following is mostly drawn from the
-t fields' to use the field indexing. The file `database.fmt' must contain the format description.
waisindexmight add some words, if they occur to often.
The trickiest part when building a new database is to write a format
description -- the horrible `
An important thing to know when writing format descriptions is that
waisindex parses sources files line by line. All of the
nice regular expressions have to match a single line. So including multiple
linefeeds in an expression will cause the matcher to fail.
All regular expressions in the format description must be included between two `/'. Therefore a `/' in the regular expression must be escaped with a `\'. To match `TCP/IP' write `/TCP\/IP/'. This in consequence means, that a `\' must be escaped too ;-). Other escape sequences interpreted by the parser are:
After escaping is resolved, the string is passed to a standard regular expression package (or your systems one see section Regular Expressions).
A format description consist of three parts. The first part -- usually just one line -- describes how the files should be split into documents. The second part, called the Layout Section defines how a headline for each document should be computed.
The last and usually largest part defines which fields should be generated and how the indexing should be done.
A format description begins with a `record-sep:' directive.
Here regexp might be `/\f/' to match a line containing a formfeed character. Note that the line matching the `record-sep:' regular expression is never indexed!
The layout section must be embraced by `layout:' and `end:'. It can contain multiple `headline:' and one `date:' directive. Each `headline:' keyword defines a region in the document which should be copied to a section of the headline. The region, defined by the regular expressions which start and end it can occur any number of times in the document. The order of the directives defines the order of the sections in the headline.
headline: start end width [skip]
The above line advises the indexer to copy the text between the matches for start and end, optionally skipping the text matched by skip to the next width characters of the headline.
The `date:' directives defines how a date to be associated with the document can be extraced. Note that this date is not usable for searches. It ist part of the headline and displayed to the user by some clients.
date: start scanfarg d-m-y d-m-y d-m-y skip
The start and skip work like above. The scanfarg
parameter is passed to
scanf to read in year, month, and day. The
order of the arguments is indicated by the three d-m-y where each
might be `year', `month', and `day'. The `month'
might be followed by `string' to indicate that the month will be
given as a standard three-letter abreviation.
Here is complete example:
layout: headline: /^PY: / /^[A-Z][A-Z]:/ 5 /^PY: */ headline: /^AU: / /^[A-Z][A-Z]:/ 21 /^AU: */ headline: /^TI: / /^[A-Z][A-Z]:/ 41 /^TI: */ date: /^ED: / /%d-%3s-%d/ day month string year /^ED: [^ ]/ end:
The field specification part of the format description is made up of multiple `region:' ... `end:' groups. Each is mapping a region of text to a set of query categories aka fields.
region: start [skip] fieldlist options indexspecs end: end
The start, skip, and end expressions define the region of the document for which the index specification applies.
fieldlist is the list of fields each optionaly followed by a description string enlosed in double quotes (`"'). The description is entered in the database description (see section The Database Description).
Options include the directives `numeric: skip width', `date: d-m-y d-m-y d-m-y', and `stemming'. The first option advises the indexer to allow numeric values only and makes sure that numeric atomic search expressions will work with the categories. The second advices the indexer to convert the date in the region to the `yymmdd' format prior to index it numerically. The last option tells the indexer to stemm the words in the region before entering them in the index. When searching, the server also stemms search terms prior to the index lookup.
indexspecs is a list of index types along with a keyword indicating if the region should be mapped to the designated categories (`LOCAL'), the default category (`GLOBAL') or to both (`BOTH'). The default category is used when none is specified in the query. See section Query Syntax.
Index types currently supported are `TEXT', `SOUNDEX' and `PHONIX'.
Consider the following example:
region: /^AU: / au "author names" SOUNDEX LOCAL TEXT BOTH end: /^[A-Z][A-Z]:/
To the indexer this means:
For all words starting with `AU: ' at the beginning of a line up to a line which starts with two capital letters followed by a colon and a blank, put the word in the default and the `au' category and its soundex code only in the `au' category.
Thus an author name can be found in the created database in the default category or the `au' category if the exact spelling is known. If the name is misspelled, it might be found using the query `au=(soundex misspelled-name)'. See section Sample Format Description, section Query Syntax.
waisindex generates a number of index files. If called with
`waisindex -d test -t fields ...' it uses:
wais freewais programming hacking
The following files are generated or modified by
waisindexwith the `-nocat' option. See section
waisindexOptions. See section The Catalog Files.
waisindexmight add words to the stopword file, if they occur to often and would break the index.
For each field in the format description a dictionary and a inverted
file is generated. This applies only to
waisindex run with the
`-t fields' option.
The database description (`database.src') serves for two purposes:
This is how a database descriptions looks like (when called without the `-export' option):
(:source :version 3 :database-name "/usr/local/ls6/src+data/src/freeWAIS-sf-2.0/FIELD-EXAMPLE/test" :cost 0.00 :cost-unit :free :maintainer "pfeifer@buster" :keyword-list ( fuhr pfeifer ) :fields ( (:field :name "ck" :description "Citation Code") (:field :name "py" :description "Publication Year (numeric)") (:field :name "au" :description "Author") (:field :name "ti" :description "Title") (:field :name "jt" :description "Journal Title") (:field :name "ed" :description "Date of insertion (date)") ) :description "Server created with freeWAIS-sf 2.0 PL 30 on Oct 6 18:09:24 1995 by pfeifer@buster The files of type fields used in the index were: /usr/local/ls6/src+data/src/freeWAIS-sf-2.0/FIELD-EXAMPLE/TEST Here goes the contents of the format description file (database
.fde) " )
With the `-export' option the first part changes to:
(:source :version 3 :ip-address "220.127.116.11" :ip-name "buster" :tcp-port 210 :database-name "test" ...
You can modify the database description, especially if you want to enter
it in a directory of servers. Notice that waisindex might overwrite the
description when re-indexing or adding to the database when the list of
significant terms (keywords) changes. Therefore, use the format
description file (
database.fde) to add your comments on the
The filename list (`database.fn') contains for each source file of the database its filename, modification time, and the type of the documents.
The files starts with four zero bytes. Then for each file follows it's name as zero terminated string, its modification time a four byte integer and the display type as zero terminated string.
The headline list contains the headline; a null terminated string for each document of the database . The first 4 for bytes are zero.
The length of the headlines is variable with an upper limit given by the
MAX_HEADLINE_LEN in `config.h'.
`config.h': int MAX_HEADLINE_LEN
Maximum length of a headline. It may cause difficulties when increasing the value above the default value of 300. Some of the problems are fixed; maybe not all ;-)
The document table contains one entry for each document in the database. Each entry contains a pointer to the filename list and the headline list. Also start and end of the document in the file is given. The other entries are the number of terms and lines and the date extracted from the document (see section Layout Section).
HUGE_HEADLINESoption during configuration. See section Huge Headlines.
The first 25 bytes entry might be zero filled as well. Therefor the entry for document n starts at n*25+2.
Unless prohibited by the `-nocat' option (see section
waisindex generates a human readable version of the
document table (see section The Document Table) with the filenames and headline
included from the filename list (see section The Filename List) and the headline
list (see section The Headline List).
Catalog for database: /usr/local/ls6/src/freeWAIS-sf/FIELD-EXAMPLE/test Date: Jun 4 13:32:46 1995 276 total documents Document # 1 Headline: 1990 Ait-Kaci, Hassan; Na Implementing a Knowledge-Based Library I DocID: 0 244 /usr/local/ls6/src/freeWAIS-sf/FIELD-EXAMPLE/TEST Document # 2 Headline: 1990 Shepard, Michael A.; Transient Hypergraphs for Citation Netwo DocID: 244 437 /usr/local/ls6/src/freeWAIS-sf/FIELD-EXAMPLE/TEST ...
The dictionary file (`database.dct', `database_fieldname.dct' contains entries for all valid terms in the database. An entry contains:
The entries are sorted by the terms and grouped in blocks of 1000 entries each. This main file is preceeded by a directory containing the first entries of the main blocks. The bytes 21-24 of this directory entries contain the pointer to the main blocks, the last four bytes are zero. Entries in the index block are sorted too so that binary search on the index can work. The first two bytes of the file give the number of main blocks used. E.g., if we have two blocks the layout is as follows:
The inverted files are the heart of the index. They contain the references to the documents they occur in for each term. The first 4 bytes contain the number of all terms in the file. The rest of the file is made up of term entries. Note that the following is mainly debugger knowledge; Tungs documentation is not very good here. But a Perl script using this definition was able to parse my inverted files ;-).