Here is an excerpt of the corresponding smart routine:

The documents would be presented by term vectors of the form

where each identifies a content term assigned to some sample document and represents the weight of term in Document (or query ). Thus, a typical query might be formulated as

where once again reprents a term assigned to query . The weights could be allowed to vary continuosly between and , the higher weight assignments near being used for the most important terms, whereas lower weights near would characterize the less important terms. Given the vector representation, a query-document similarity value may be obtained by comparing the corresponding vectors, using for example the conventional vector product formula

Three factors important for term_weighting:

- term frequency in individual document (recall)
- inverse document frequency (precision)
- document length (vector length)

Term frequency component used: augmented normalized term frequency ( factor normalized by maximum in the vector, and further normalized to lie between and ).

Collection frequency component used: no change in weight; use original term frequency component.

Normalization component used: .

Thus, document term weight is:

By query term weighting, it is assumed that tf is equal to . So that .

___________________________________________________

___________________________________________________

Thu May 25 16:37:04 MET DST 1995