Here is an excerpt of the corresponding smart routine:
The documents would be presented by term vectors of the form
where each identifies a content term assigned to some sample document and represents the weight of term in Document (or query ). Thus, a typical query might be formulated as
where once again reprents a term assigned to query . The weights could be allowed to vary continuosly between and , the higher weight assignments near being used for the most important terms, whereas lower weights near would characterize the less important terms. Given the vector representation, a query-document similarity value may be obtained by comparing the corresponding vectors, using for example the conventional vector product formula
Three factors important for term_weighting:
Term frequency component used: augmented normalized term frequency ( factor normalized by maximum in the vector, and further normalized to lie between and ).
Collection frequency component used: no change in weight; use original term frequency component.
Normalization component used: .
Thus, document term weight is:
By query term weighting, it is assumed that tf is equal to . So that .