next up previous
Next: Retrieval of XML documents Up: pire Previous: PDatalog++ engine

Subsections


PIRE implementation

PIRE is a probabilistic IR engine which is based on the principles introduced in the previous chapters. This chapter describes the internals of PIRE.

Each attribute/operator pair has its own index, which is a non-empty set of pDatalog++ relations with some additional methods. The relations in different indexes can be distinguished by prefixing the relation name with the attribute name, if this is required by the underlying pDatalog++ engine.

PIRE supports arbitrary index types (as Java classes) as long as they implement the Java interface for the index. This allows for using different index types with the same IR engine (and the same data types, see below). Currently only one index type is available, which uses pDatalog++ not only as an interface to the other components, but also internally as it builds on the pDatalog++ engine introduced in chapter 1.

PIRE also uses classes for modelling data types. Although major parts of the data type can be expressed in pDatalog++, some parts (mainly splitting an attribute value or a comparison value into tokens) should be implemented in Java. Thus, each data type has its class, which works on a specified index.

Figure 2.1 shows the different layers of PIRE. On top, there is a thin, simple XML layer (see section 3). The actual PIRE layer contains data types and indexes (including the pDatalog-based index class). The third layer contains the pDatalog++ engine, using an abstract SQL layer, which finally accesses a relational database management system.

Figure 2.1: PIRE architecture


\includegraphics[scale=0.4]{architecture}

Methods for handling an index

The interface de.unidu.is.retrieval.Index is responsible for handling an IR index. It contains high-level (specialised on IR) and low-level (for arbitrary relations) methods.

Each index uses pre-defined relations with the local names docid, tf, weight, rd and idb_rd, and potentially further data type-specific relations.

The higher, IR-specific level methods for creating an index are:

init()
creates the pre-defined relations for this index.
insert(String)
adds the specified document id to the relation docid.
insert(String,String,String)
adds the specific document id, token and token frequency to the relation tf.
computeMoments()
computes the moments of the indexing weights in the relation expectation (the expectation $ \mu$) and variance (the variance $ \sigma^2$). These relations are mainly used for resource selection.
remove()
removes the pre-defined relations from this index.

Computing the weight relation is out of the scope if the index interface, as this typically involves data type-specific knowledge. Thus, this is defined in the data type classes.


In addition, it supports retrieval on the index. One relation [c]_[a]_[o]_rsv[query id]_[i] is used for each sub-query. Each sub-query uses only one attribute/operator pair, and thus only one index (in which RSVs and probabilities for this sub-query are computed. These RSVs are mapped onto a probability of relevance via mapping functions (in a temporary relation [c]_[a]_[o]_prob[query id]_[i]). In a final step, these probabilities are combined in the relation [c]_prob[query id], which is contained in a new index (used only for that purpose), and not in the attribute/operator indexes. The current interface de.unidu.is.retrieval.Index does not provide any method for directly transferring the data from one index into another; instead, it is required to use the same index class for every index, and moving data (if required) is left to the actual index class.

These methods are provided for performing retrieval:

initQuery(String)
initialises the specified query.
getMaxRSV(String,String)
returns the maximum RSV $ \mathit{Pr}(q_i \leftarrow d)$ for the specified sub-query. This method is mainly intended for use by the data type classes.
addRule(String,String[],List)
adds a rule for computing the prob relation to the specified list. The body is the conjunction of the specified unary relations which contain the result for a single condition.
computeProbs(String,List,boolean)
applies the specified rules for the relation prob (where disjointness of the rules can be specified), i.e. it computes probabilities of relevance $ \mathit{Pr}($rel$ \vert q,d)$ from RSVs $ \mathit{Pr}(q_i \leftarrow d)$, where the rules act on behalf of mapping functions.
getProbs(String,int)
returns the first top-ranked document ids together with the probabilities of relevance (as a list of de.unidu.is.retrieval.ProbDoc instances).
closeQuery(String)
frees resources acquired for processing the specified query (e.g. temporary relations).

Methods for computing RSVs are left out here as this is covered by data type classes (see section 2.2).

In addition, also computing the moments for weighted sum queries is supported:

addMomentsCondition(String,weight,Object,List)
adds rules for computing the moments to the specified list, using the specified queryID, weight and comparison value.
computeMoments(String,List)
applies the specified rules for computing the moments. The results are stored in the prob relation, with the document ids expectation and variance (i.e., by partially abusing the existing infrastructure for document retrieval).


The low-level methods allow for direct manipulation of arbitrary relations:

convert(String)
converts a local relation name into an absolute one, using the collection name, the attribute and the operator name as a prefix.
convertCollection(String)
converts a local relation name into an absolute one, using only the collection name as a prefix.
getRD(String)
returns the resource description value for the specified key.
getRD(String,double)
returns the resource description value for the specified key. If no value is the stored, the specified default value is returned.
setRD(String,double)
sets the resource description value for the specified key with the specified value.
add(Fact)
adds a fact to the index.
compute(Rule)
evaluates the specified rule on the index.
compute(IDBRelation,Collection)
evaluates the specified collection of rules with common head predicate.
computeDisjoint(IDBRelation,Collection)
evaluates the specified collection of rules with common head predicate, assuming disjointness.
computeDisjoint(String,int,Collection)
evaluates the specified collection of rules with common head predicate, assuming disjointness.
iterator(String)
returns all facts in the specified relation.
addEDBRelation(String,int,boolean)
adds an extensional relation (defined by facts) to the index.
addIDBRelation(String,int,boolean)
adds an intensional relation (defined by rules) to the index.
removeRelation(String)
removes the specified relation from the index.
completeRelation(String)
signals that the specified relation is complete; the actual index class can perform some post-processing for efficiency improvements.
getRSVRelation(String,String)
returns the name for the RSV relation, following the scheme [c]_[a]_[o]_rsv[query id]_[i].
getProbRelation(String,String)
returns the name for the relation with the probabilities of relevance, following the scheme [c]_[a]_[o]_prob[query id]_[i].


Methods for describing data types

The data type classes cover only the parts of the retrieval process which depend on a specific data type and operator.

These methods support creating an index:

convertOperator(String)
converts an operator name into an identifier (which only contains alphanumeric characters and the underscore).
addToIndex(Index,String,String,Object)
adds a document/value to an index w.r.t. an operator; this method splits the attribute content into tokens (e.g. stemming and stop-word removal), calculates the token frequencies, and calls the specified index via insert(String,String,int) for inserting them into the tf relation.
computeIndex(Index,String)
computes data type/operator-specific indexing weights.
removeIndex(Index,String)
removes the data type/operator-specific relations from the specified index.

E.g., a simple Boolean weighting (which can be used for numbers) can be implemented using the rule

weight(D,V) :- tf(D,V,TF).

A normalised tf weighting schema requires the rules:

dl(D,DL) :- sum(DL,D,{tf(D,_,#)}).
weight(D,V) :- tf(D,V,TF) & dl(D,DL) | (TF/DL).


In addition, the rules for computing RSV for a single condition, and for probabilities of relevance for a sub-query depend on the data type and operator:

addRSVRules(Index,String,String,String,double,Object,List)
adds rules for computing the RSV for the specified condition to a list. The rule head predicate is always
[c]_[a]_[o]_rsv[query id]_[sub-query id]. Typically, it is only one rule which is added.
addProbRules(Index,String,String,String,List)
adds rules for computing the probabilities of relevance for the specified sub-query to a list. The rule head predicate is always
[c]_[a]_[o]_prob[query id]_[sub-query id]. Typically, it is only one rule which is added.

Example: AbstractDT

This class de.unidu.is.pire.dt.AbstractDT is the abstract super-class of all DT classes. It provides these additional methods:

getFilter(String)
returns a filter object for splitting actual attribute content into tokens. The filter either has to return instances of de.unidu.is.util.Tuple (element 0 contains the token, element 1 the token frequency as an integer), or any object whose toString() method has to return the token (with frequency 1). Sub-classes have to overwrite this abstract method.
getQueryFilter(String)
returns a filter object for splitting actual query condition values into tokens. The filter either has to return instances of de.unidu.is.util.Tuple (element 0 contains the token, element 1 the token frequency as an integer), or any object whose toString() method has to return the token (with frequency 1). Sub-classes have to overwrite this abstract method.
getProbsTemplate(Index,String,String,String)
returns the template for computing probabilities of relevance out of the RSVs. The RSV has to be denoted as PROB. This method simply returns PROB, which refers to the identity mapping function. Sub-classes can overwrite this method.

In addition, this class provides default implementations of methods which are defined in the interface
de.unidu.is.pire.dt.DT:

convertOperator(String)
converts an operator name into an identifier (which only contains alphanumeric characters and the underscore). The default implementation simply returns the operator name, as typical operator names are also legal identifiers.
addToIndex(Index,String,String,Object)
adds a document/value to an index w.r.t. an operator. It uses the filter returned by getFilter(String) and inserts them into the specified index.
computeIndex(Index,String)
computes data type/operator-specific indexing weights. This default method uses a simply binary indexing scheme. Sub-classes can overwrite this method.
removeIndex(Index,String)
removes the data type/operator-specific relations from the specified index. This default implementation does nothing.
addRSVRules(Index,String,String,String,double,Object,List)
adds rules for computing the RSV for the specified condition to a list. This default implementation uses the indexing weight directly. Sub-classes can overwrite this method, but this is typically not required.
addProbRules(Index,String,String,String,List)
adds rules for computing the probabilities of relevance for the specified sub-query to a list. This default implementation uses the template returned by getProbsTemplate(Index,String,String,String). Sub-classes can overwrite this method, but this is typically not required.

Example: TextDT

This class defines a data type for textual content, which provides several operators like stemen or nostem.

By default, the RSVs are normalised (i.e., divided by the maximum RSV) before the logistic mapping function is applied. This can be enabled by setDoScaleForBM25(true). However, this normalisation is only applied for the operators stemen and nostem, not for the variants with normalised tf weights.

The parameters $ b_0$ and $ b_1$ of the logistic mapping functions (again, only for the operators stemen and nostem) can be set as the resource description values with keys b0 and b12.1. Default values are $ b_0=-4$ and $ b_1=12$. The logistic mapping function has to be enabled using setDoMapForBM25(true), it is disabled by default (which is equivalent to the identify mapping function).

As a byproduct, this data type computes relations df (document frequency) and dl (document length) in addition to the standard relations.

Example: NumbertDT

This class defines a data type for numbers. Binary indexing weights are used together with the identity mapping function.

This data type supports the binary operators <, <=, >, >= and = with the usual semantics and the identity mapping function.

It will also provide vague operators $ \sim$<, $ \sim$> and $ \sim$= with logistic mapping functions, although this is not fully implemented yet.

Example: YearDT

This class defines a data type for numbers. Binary indexing weights are used together with the identity mapping function. Currently there is no difference to NumberDT.

Example: NameDT

This class defines a data type for person names with two binary operators plainname and soundex. In both cases, names are converted into lower case, split into words (tokens), and their frequencies are counted (although they are currently ignored).

The operator plainname directly compares the name tokens with the tokens in the query condition. The operator soundex compares the soundex values from the name tokens.

Method for the IR engine

Index and data type classes are combined in the PIRE main class. First, this class stores the schema (attributes with data type and operators), it keeps one index for each attribute/operator pair.

Indexing is supported by these methods.

registerAttribute(String,String,List)
registers an attribute with a corresponding data type and a list of operators (a sub-list of the operators provided by the operator).
initIndex()
initialises all indexes, i.e. removes old data (if necessary), and creates fresh indexes for all attribute/operator pairs.
addToIndex(String)
adds the specified document id to all indexes.
addToIndex(String,String,Object)
adds the specified document/value pair to all indexes for the specified attribute (i.e., for all operators which are defined on that attribute). As splitting a value into tokens, and adding tokens to the index depends on the data type and operator, it is delegated to the data type class and not to the index.
computeIndex()
computes the indexes, basically the weight relations for all indexes. As this highly depends on the data type and operator, it is delegated to the data type class and not to the index.
computeMoments()
computes the moments of the indexing weights for all indexes. The results are stored in the relation expectation (the expectation $ \mu$) and variance (the variance $ \sigma^2$). These relations are mainly used for resource selection.
removeIndex()
removes all indexes.

With these methods, creating a PIRE index has to be done in four steps:


For retrieval, these methods are supported:

initQuery(String)
initialises the relations for the specified query.
addCondition(String,String,String,double,Object)
adds the specified weighted condition (for a weighted sum query). The condition is saved and evaluated when all conditions are available, as conditions referring to the same attribute/operator pair have to evaluated together.
addCondition(String,WeightedQueryCondition)
adds the specified weighted condition (for a weighted sum query). The condition is saved and evaluated when all conditions are available, as conditions referring to the same attribute/operator pair have to evaluated together.
addConjunction(String,QueryCondition[])
adds the conditions of a conjunction (for a Boolean-style query in disjunctive form). These rules are stored and later evaluated together.
computeProbs(String)
evaluates all collected rules, and computes the final probabilities of relevance.
getProbs(String,int)
returns the first top-ranked document ids together with the probabilities of relevance (as a list of de.unidu.is.retrieval.ProbDoc instances).
closeQuery(String)
frees resources acquired for processing the specified query (e.g. temporary relations).
addMomentsCondition(String,String,String,double,Object)
adds the specified weighted condition, defined by an queryID, attribute, operator weight and comparison value, to the computation of the moments.
addMomentsCondition(String,WeightedQueryCondition)
adds the specified weighted condition to the computation of the moments.
getMoments(String)
returns the moments for the specified query.

Retrieving documents with PIRE is similarly easy:

PIRE provides convenient methods for setting and retrieving values in the rd relation for a specified attribute and operator:

double getRD(String,String,String
returns the value in the resource description for the specified index and key.
void setRD(String,String,String,double
sets the value in the resource description for the specified index and key.

Finally, PIRE provides access to the underlying index objects for advanced applications:

Index getIndex(String,String
returns the index object for the specified attribute and operator.

Thus, using PIRE from outside is fairly easy, and of course abstracts from the internals.


next up previous
Next: Retrieval of XML documents Up: pire Previous: PDatalog++ engine
Stefan Huesselbeck 2005-03-09