Next: Retrieval of XML documents
Up: pire
Previous: PDatalog++ engine
Subsections
PIRE implementation
PIRE is a probabilistic IR engine which is based on the principles
introduced in the previous chapters. This chapter describes the
internals of PIRE.
Each attribute/operator pair has its own index, which is a non-empty set of
pDatalog++ relations with some additional methods. The relations in
different indexes can be distinguished by prefixing the relation name
with the attribute name, if this is required by the underlying
pDatalog++ engine.
PIRE supports arbitrary index types (as Java classes) as long as they
implement the Java interface for the index. This allows for using
different index types with the same IR engine (and the same data types,
see below). Currently only one index type is available, which uses
pDatalog++ not only as an interface to the other components, but also
internally as it builds on the pDatalog++ engine introduced in
chapter 1.
PIRE also uses classes for modelling data types. Although major parts
of the data type can be expressed in pDatalog++, some parts (mainly
splitting an attribute value or a comparison value into tokens) should
be implemented in Java. Thus, each data type has its class, which
works on a specified index.
Figure 2.1 shows the different layers of PIRE. On
top, there is a thin, simple XML layer (see
section 3). The actual PIRE layer contains data
types and indexes (including the pDatalog-based index class). The
third layer contains the pDatalog++ engine, using an abstract SQL
layer, which finally accesses a relational database management system.
Figure 2.1:
PIRE architecture
|
|
The interface de.unidu.is.retrieval.Index is responsible for
handling an IR index. It contains high-level (specialised on IR) and
low-level (for arbitrary relations) methods.
Each index uses pre-defined relations with the local names docid,
tf, weight, rd and idb_rd, and
potentially further data type-specific relations.
The higher, IR-specific level methods for creating an index are:
- init()
- creates the pre-defined relations for this
index.
- insert(String)
- adds the specified document id to the
relation docid.
- insert(String,String,String)
- adds the specific
document id, token and token frequency to the relation tf.
- computeMoments()
- computes the moments of the indexing
weights in the relation expectation (the expectation
) and variance (the variance
). These
relations are mainly used for resource selection.
- remove()
- removes the pre-defined relations from this
index.
Computing the weight relation is out of the scope if the
index interface, as this typically involves data type-specific
knowledge. Thus, this is defined in the data type classes.
In addition, it supports retrieval on the index. One relation
[c]_[a]_[o]_rsv[query id]_[i] is used for each sub-query.
Each sub-query uses only one attribute/operator pair, and thus only one index (in which
RSVs and probabilities for this sub-query are computed. These RSVs are
mapped onto a probability of relevance via mapping functions (in a
temporary relation [c]_[a]_[o]_prob[query id]_[i]). In a
final step, these probabilities are combined in the relation
[c]_prob[query id], which is contained in a new index (used
only for that purpose), and not in the attribute/operator indexes. The
current interface de.unidu.is.retrieval.Index does not
provide any method for directly transferring the data from one index
into another; instead, it is required to use the same index class for
every index, and moving data (if required) is left to the actual index
class.
These methods are provided for performing retrieval:
- initQuery(String)
- initialises the specified query.
- getMaxRSV(String,String)
- returns the maximum RSV
for the specified sub-query. This
method is mainly intended for use by the data type classes.
- addRule(String,String[],List)
- adds a rule for
computing the prob relation to the specified list. The
body is the conjunction of the specified unary relations which
contain the result for a single condition.
- computeProbs(String,List,boolean)
- applies the
specified rules for the relation prob (where disjointness
of the rules can be specified), i.e. it computes probabilities of
relevance
rel
from RSVs
,
where the rules act on behalf of mapping functions.
- getProbs(String,int)
- returns the first top-ranked
document ids together with the probabilities of relevance (as a list
of de.unidu.is.retrieval.ProbDoc instances).
- closeQuery(String)
- frees resources acquired for
processing the specified query (e.g. temporary relations).
Methods for computing RSVs are left out here as this is covered by
data type classes (see section 2.2).
In addition, also computing the moments for weighted sum queries is
supported:
- addMomentsCondition(String,weight,Object,List)
- adds
rules for computing the moments to the specified list, using the
specified queryID, weight and comparison value.
- computeMoments(String,List)
- applies the
specified rules for computing the moments. The results are stored in
the prob relation, with the document ids
expectation and variance (i.e., by partially
abusing the existing infrastructure for document retrieval).
The low-level methods allow for direct manipulation of arbitrary
relations:
- convert(String)
- converts a local relation name into an
absolute one, using the collection name, the attribute and the
operator name as a prefix.
- convertCollection(String)
- converts a local relation
name into an absolute one, using only the collection name as a
prefix.
- getRD(String)
- returns the resource description value
for the specified key.
- getRD(String,double)
- returns the resource description value
for the specified key. If no value is the stored, the specified
default value is returned.
- setRD(String,double)
- sets the resource description value
for the specified key with the specified value.
- add(Fact)
- adds a fact to the index.
- compute(Rule)
- evaluates the specified rule on the
index.
- compute(IDBRelation,Collection)
- evaluates the specified
collection of rules with common head predicate.
- computeDisjoint(IDBRelation,Collection)
- evaluates the
specified collection of rules with common head predicate, assuming
disjointness.
- computeDisjoint(String,int,Collection)
- evaluates the
specified collection of rules with common head predicate, assuming
disjointness.
- iterator(String)
- returns all facts in the specified
relation.
- addEDBRelation(String,int,boolean)
- adds an extensional
relation (defined by facts) to the index.
- addIDBRelation(String,int,boolean)
- adds an intensional
relation (defined by rules) to the index.
- removeRelation(String)
- removes the specified relation
from the index.
- completeRelation(String)
- signals that the specified
relation is complete; the actual index class can perform some
post-processing for efficiency improvements.
- getRSVRelation(String,String)
- returns the name for the
RSV relation, following the scheme [c]_[a]_[o]_rsv[query
id]_[i].
- getProbRelation(String,String)
- returns the name for
the relation with the probabilities of relevance, following the
scheme [c]_[a]_[o]_prob[query id]_[i].
Methods for describing data types
The data type classes cover only the parts of the retrieval process
which depend on a specific data type and operator.
These methods support creating an index:
- convertOperator(String)
- converts an operator name into
an identifier (which only contains alphanumeric characters and the
underscore).
- addToIndex(Index,String,String,Object)
- adds a
document/value to an index w.r.t. an operator; this method splits the
attribute content into tokens (e.g. stemming and stop-word removal),
calculates the token frequencies, and calls the specified index via
insert(String,String,int) for inserting them into the
tf relation.
- computeIndex(Index,String)
- computes
data type/operator-specific indexing weights.
- removeIndex(Index,String)
- removes the
data type/operator-specific relations from the specified index.
E.g., a simple Boolean weighting (which can be used for numbers) can
be implemented using the rule
weight(D,V) :- tf(D,V,TF).
A normalised tf weighting schema requires the rules:
dl(D,DL) :- sum(DL,D,{tf(D,_,#)}).
weight(D,V) :- tf(D,V,TF) & dl(D,DL) | (TF/DL).
In addition, the rules for computing RSV for a single condition, and
for probabilities of relevance for a sub-query depend on the data type
and operator:
- addRSVRules(Index,String,String,String,double,Object,List)
- adds rules for computing the RSV for the specified condition to a
list. The rule head predicate is always
[c]_[a]_[o]_rsv[query id]_[sub-query id]. Typically, it
is only one rule which is added.
- addProbRules(Index,String,String,String,List)
- adds
rules for computing the probabilities of relevance for the
specified sub-query to a list. The rule head predicate is always
[c]_[a]_[o]_prob[query id]_[sub-query id].
Typically, it is only one rule which is added.
This class de.unidu.is.pire.dt.AbstractDT is the abstract
super-class of all DT classes. It provides these additional methods:
- getFilter(String)
- returns a filter object for
splitting actual attribute content into tokens. The filter either
has to return instances of de.unidu.is.util.Tuple (element
0 contains the token, element 1 the token frequency as an integer),
or any object whose toString() method has to return the
token (with frequency 1). Sub-classes have to overwrite this
abstract method.
- getQueryFilter(String)
- returns a filter object for
splitting actual query condition values into tokens. The filter
either has to return instances of de.unidu.is.util.Tuple
(element 0 contains the token, element 1 the token frequency as an
integer), or any object whose toString() method has to
return the token (with frequency 1). Sub-classes have to overwrite
this abstract method.
- getProbsTemplate(Index,String,String,String)
- returns
the template for computing probabilities of relevance out of the
RSVs. The RSV has to be denoted as PROB. This method simply
returns PROB, which refers to the identity mapping
function. Sub-classes can overwrite this method.
In addition, this class provides default implementations of methods
which are defined in the interface
de.unidu.is.pire.dt.DT:
- convertOperator(String)
- converts an operator name into
an identifier (which only contains alphanumeric characters and the
underscore). The default implementation simply returns the operator
name, as typical operator names are also legal identifiers.
- addToIndex(Index,String,String,Object)
- adds a
document/value to an index w.r.t. an operator. It uses the filter
returned by getFilter(String) and inserts them into the
specified index.
- computeIndex(Index,String)
- computes
data type/operator-specific indexing weights. This default method
uses a simply binary indexing scheme. Sub-classes can overwrite this
method.
- removeIndex(Index,String)
- removes the
data type/operator-specific relations from the specified index. This
default implementation does nothing.
- addRSVRules(Index,String,String,String,double,Object,List)
- adds rules for computing the RSV for the specified condition to a
list. This default implementation uses the indexing weight
directly. Sub-classes can overwrite this method, but this is
typically not required.
- addProbRules(Index,String,String,String,List)
- adds
rules for computing the probabilities of relevance for the specified
sub-query to a list. This default implementation uses the template
returned by getProbsTemplate(Index,String,String,String).
Sub-classes can overwrite this method, but this is typically not
required.
This class defines a data type for textual content, which provides
several operators like stemen or nostem.
By default, the RSVs are normalised (i.e., divided by the maximum RSV)
before the logistic mapping function is applied. This can be enabled
by setDoScaleForBM25(true). However, this normalisation is
only applied for the operators stemen and nostem,
not for the variants with normalised tf weights.
The parameters
and
of the logistic mapping functions
(again, only for the operators stemen and nostem)
can be set as the resource description values with keys b0
and b12.1. Default values are
and
.
The logistic mapping function has to be enabled using
setDoMapForBM25(true), it is disabled by default (which is
equivalent to the identify mapping function).
As a byproduct, this data type computes relations df
(document frequency) and dl (document length) in addition to
the standard relations.
This class defines a data type for numbers. Binary indexing weights
are used together with the identity mapping function.
This data type supports the binary operators <, <=,
>, >= and = with the usual semantics and
the identity mapping function.
It will also provide vague operators
<,
>
and
= with logistic mapping functions, although this is
not fully implemented yet.
This class defines a data type for numbers. Binary indexing weights
are used together with the identity mapping function. Currently there
is no difference to NumberDT.
This class defines a data type for person names with two binary
operators plainname and soundex. In both cases,
names are converted into lower case, split into words (tokens), and
their frequencies are counted (although they are currently ignored).
The operator plainname directly compares the name tokens with
the tokens in the query condition. The operator soundex
compares the soundex values from the name tokens.
Index and data type classes are combined in the PIRE main class. First,
this class stores the schema (attributes with data type and
operators), it keeps one index for each attribute/operator pair.
Indexing is supported by these methods.
- registerAttribute(String,String,List)
- registers an
attribute with a corresponding data type and a list of operators (a
sub-list of the operators provided by the operator).
- initIndex()
- initialises all indexes, i.e. removes old
data (if necessary), and creates fresh indexes for all
attribute/operator pairs.
- addToIndex(String)
- adds the specified document id to
all indexes.
- addToIndex(String,String,Object)
- adds the specified
document/value pair to all indexes for the specified attribute
(i.e., for all operators which are defined on that attribute). As
splitting a value into tokens, and adding tokens to the index
depends on the data type and operator, it is delegated to the
data type class and not to the index.
- computeIndex()
- computes the indexes, basically the
weight relations for all indexes. As this highly depends on
the data type and operator, it is delegated to the data type class and
not to the index.
- computeMoments()
- computes the moments of the indexing
weights for all indexes. The results are stored in the relation
expectation (the expectation
) and variance
(the variance
). These relations are mainly used for
resource selection.
- removeIndex()
- removes all indexes.
With these methods, creating a PIRE index has to be done in four steps:
- First, the schema has to be constructed.
- Then, the indexes have to be initialised.
- For each document, its id has to be added, and the content of
each attribute (if it occurs in the document).
- Finally, the actual index (the weight relations) have
to be computed.
For retrieval, these methods are supported:
- initQuery(String)
- initialises the relations for the
specified query.
- addCondition(String,String,String,double,Object)
- adds
the specified weighted condition (for a weighted sum query). The
condition is saved and evaluated when all conditions are available,
as conditions referring to the same attribute/operator pair have to
evaluated together.
- addCondition(String,WeightedQueryCondition)
- adds the
specified weighted condition (for a weighted sum query). The
condition is saved and evaluated when all conditions are available,
as conditions referring to the same attribute/operator pair have to
evaluated together.
- addConjunction(String,QueryCondition[])
- adds the
conditions of a conjunction (for a Boolean-style query in
disjunctive form). These rules are stored and later evaluated
together.
- computeProbs(String)
- evaluates all collected rules,
and computes the final probabilities of relevance.
- getProbs(String,int)
- returns the first top-ranked
document ids together with the probabilities of relevance (as a list
of de.unidu.is.retrieval.ProbDoc instances).
- closeQuery(String)
- frees resources acquired for
processing the specified query (e.g. temporary relations).
- addMomentsCondition(String,String,String,double,Object)
- adds the specified weighted condition, defined by an queryID,
attribute, operator weight and comparison value, to the computation
of the moments.
- addMomentsCondition(String,WeightedQueryCondition)
- adds the specified weighted condition to the computation of the
moments.
- getMoments(String)
- returns the moments for the
specified query.
Retrieving documents with PIRE is similarly easy:
- First, the query relations have to be initialised.
- Then, each condition of a weighted sum and each conjunction of a
Boolean-style query has to be added.
- Once all conditions are added, the resulting probabilities of
relevance
rel
can be computed.
- Then, the top-ranked documents can be retrieved from the
prob relation.
- Finally, the query has to be closed. This means that the
relations rsv and prob in all involved indexes
have to be removed, as well as internal data about the query.
PIRE provides convenient methods for setting and retrieving
values in the rd relation for a specified attribute and
operator:
- double getRD(String,String,String
- returns the value in
the resource description for the specified index and key.
- void setRD(String,String,String,double
- sets the value
in the resource description for the specified index and key.
Finally, PIRE provides access to the underlying index objects for
advanced applications:
- Index getIndex(String,String
- returns the index object
for the specified attribute and operator.
Thus, using PIRE from outside is fairly easy, and of course abstracts
from the internals.
Next: Retrieval of XML documents
Up: pire
Previous: PDatalog++ engine
Stefan Huesselbeck
2005-03-09