This chapter gives an overview over the implemented Java code for XML retrieval.
This section describes the data structures used to encode documents, schemas and queries.
XML documents are returned in their parsed form as DOM trees, i.e. as org.w3c.dom.Document instances. As input parameters for methods, either Document instances or strings can be used (this is only important for creating a document index).
Documents are identified by a document id. This is a string of arbitrary form, e.g. a URI, a URN, a number encoded as a string, a file name, or anything else. The only requirement is that it is unique within one index.
The retrieval result is essentially a list of documents with attached
weight
which describes the similarity between a document
and the query. This can be the probability of inference
in Uncertain inference, the probability of relevance
rel
, or another similarity measure. Such a result is
encapsulated in ProbDoc instances, which store the document
id (which identifies the document itself) and the
weight. If
also the XML document is required, the class XMLDoc can be
used which combines a ProbDoc and a Document
instance.
Schemas define the structure of XML documents. Here, a simplified view is used, which defines the elements and attributes which are allowed as children of another element. The order and cardinality of children elements, default values for attributes etc. is not considered. Schemas are stored in Schema instances. Each schema contains an optional name and a reference to the root schema element.
The class SchemaElement contains the data used about ``elements'' of the schema, which can be XML elements, XML attributes and text nodes. Each schema element has a name, which can be the element name, an attributed name prefixed with @, or text() for text nodes. This naming scheme allows for direct usage in XPath expressions, which will be used for selecting parts of the XML documents. A schema element referring to text nodes also contains an optional data type name and a list of operators, which is important for indexing. Currently, there is no strict list of allowed data types or operators, it depends on the indexing/retrieval engine. Finally, a schema element contains a list of children schema elements for defining the hierarchical structure. Recursion (loops) are forbidden, thus a schema consists of a tree of schema elements. As a consequence, a schema contains a finite set of distinct XPath expressions, each referring to one content-carrying schema element (i.e., one which does not refer to an XML element). This list can be retrieved using the method getXPaths() in both classes Schema and SchemaElement.
E.g., the following XML document
<document>
<metadata correct=''true''>
<title>Foo and Bar</title>
<author>John Doe</author>
<author>Jane Doe</author>
<year>2004</year>
</metadata>
<abstract>
The abstract ...
</abstract>
<fulltext>
The main text ...
</fulltext>
</document>
corresponds to the XPath expressions
/document/metadata/@correct /document/metadata/title/text() /document/metadata/author/text() /document/metadata/year/text() /document/abstract/text() /document/fulltext/text()
Some IR systems are not capable to deal with hierarchical documents. One example is the IR engine PIRE which only allows for linear schemas. Thus, a schema definition can also contain the mapping from XPath entries (and, thus, from schema elements) onto so called ``aliases'', the elements of linear schemas. This is an n:m mapping, thus each XPath expression can belong to several aliases, and each aliases can contain several XPath expressions. Not every XPath expression has to belong to any alias, and the usage of aliases is not mandatory. Typically, the term ``path'' subsumes XPath expressions (which always start with a slash /) and aliases (which never start with a slash /).
Useful aliases for the example from above would be title, author, abstract, fulltext and text (containing the abstract and the full text).
The method addAliases() creates aliases automatically. Every XPath expression of text and XML attribute schema element nodes forms one alias, converting every slash and every minus into an underscore, converting the @ sign into the string @, removing /text() suffixes, and ignoring the very first element name in the XPath. Thus, in our example we would derive the aliases metadata_atcorrect, metadata_title, metadata_author, metadata_year, abstract and fulltext.
The schema (using the method usesXPathForQuery()) defines which type of paths is used in queries (see below).
One way of defining schemas is to use existing DTDs. The schema sub-class DTDSchema loads a DTD and extracts the children relationships from it.
Data types and operators cannot be extracted, as there is no information about them in the DTD. However, data types and operators can be added later, by manipulating the SchemaElement elements or adding default operators. Here, Text uses stemen, Name uses plainname and soundex, and the data types Year and Number use =.
In addition, no aliases are defined; if they are needed, they have to be specified manually or by adding all possible aliases from XPath expressions, see above.
Alternatively, HyREX [1] DDL files can be used. The advantage is that it contains more information for indexing and retrieval than DTD files; in addition, many concepts are very similar for HyREX and PIRE.
This is a sample DDL file:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<hyrex
directory="/path/to/hyrex/index"
base="MyBase"
class="ByClass"
dtd="/foo/mydtd.dtd">
<access classname="HyREX::HyPath::Document::Access::Find">
<parameter name="expression" value="$_[0] =~ /\/.*xml$/"/>
<parameter name="directories" value="/path/to/xml"/>
</access>
<summary>
<xslfile name="...xsl"/>
</summary>
<attribute name="title">
<datatype classname="HyREX::HyPath::Datatype::Text::English">
<parameter name="indexfilter" value="latin1_tr"/>
<parameter name="indexfilter" value="latin1_lc"/>
<parameter name="indexfilter" value="split2"/>
<parameter name="indexfilter" value="stop"/>
<parameter name="filter" value="latin1_tr"/>
<parameter name="filter" value="latin1_lc"/>
<parameter name="filter" value="split2"/>
<parameter name="filter" value="stop"/>
<query query="/document/metadata/title"/>
</datatype>
</attribute>
<structure classname="HyREX::HyPath::Structure::NoStruct"/>
</hyrex>
In our context, /hyrex/@directory (the directory where HyREX stores the index on disk), /hyrex/@dtd (DTD file), /hyrex/access/@classname (Perl class name for the access method), /hyrex/summary (XSLT stylesheets for displaying information on web pages) and /hyrex/attribute/datatype/parameter (data type parameters) can be ignored.
The parts /hyrex/@base, /hyrex/@class and /hyrex/access/parameter are only for indexing, they are used in the indexer program (see section 3.3).
The schema class HyREXSchema extracts all XPaths from
/hyrex/attribute/datatype/query/@query, and forms a new tree
of SchemaElement elements from them. The data type definition
from
/hyrex/attribute/datatype/@classname is
converted into PIRE data types: So, the HyREX data type
HyREX::HyPath::Datatype::Text::English is converted into
Text, HyREX::HyPath::Datatype::Name is converted
into Name, and HyREX::HyPath::Datatype::Numeric into
Number. Using PIRE data type names is also allowed, although
one looses compatibility to HyREX.
We extend the DDL file by optional /hyrex/attribute/datatype/predicate/@name constructs, defining operators for the schema elements. Here, plaintexten is converted into nostem, and equal into =.
The /hyrex/structure is set to NoStruct, then each attribute defines one alias.
Queries are identified by their query id (any string), and contain the number of documents which have to be retrieved.
The system contains different kinds of queries:
wsum(0.3,/a/b/@foo $stemen$ ``xyz'',0.7,/a/c/d/text() $plainname$ ``Doe'') wsum(0.3,/#PCDATA $foo:stemen$ ``xyz'',0.7,/#PCDATA $bar:plainname$ ``Doe'')
The following query nodes are supported:
Boolean-style queries can be brought into disjunction form, which is a conjunction of disjunctions of query conditions.
This section describes standard interfaces for indexing and retrieval (which can also be used for other IR engines), as well as the concrete implementation based on PIRE:
The interface Retriever defines methods for XML document retrieval using the above-mentioned data structures. This interface is intended as a common gateway to different IR engines. Several implementations (e.g. for PIRE and HyREX) are available, additional ones will follow.
The interface Retriever defines the following methods:
The interface IR, a sub-interface of Retriever defines additional methods for XML document indexing. An implementations for PIRE is available, additional ones will follow.
The interface IR defines the following methods in addition to those of Retriever:
With the above classes and interfaces, performing document indexing and retrieval with PIRE is straight-forward. It is implemented in the class PDatalogIR.
First, only aliases are used in paths. In other words, parts of the XML documents are mapped onto so called ``attributes'' (which form a linear list). So, before PDatalogIR is used, one has to ensure that aliases are defined for the schema.
The schema data type must be one of the PIRE data types. These XML document parts are indexed separately, and can be considered for retrieval, where only WSumQuery and StructuredQuery are allowed for queries.
Obviously, this is only a primitive form of XML retrieval, as it makes only limited usage of the document structure. A more sophisticated version is currently under development, still adhering to the same interface definitions and using the same data structures.
PIRE and the XML extension can easily be used in own applications. The program index.sh, which calls de.unidu.is.retrieval.Indexer, provides a convenient front-end for the task of indexing a given collection of XML documents. As the program uses PDatalogIR, everything explained in section 3.2.3 is valid here as well, in particular that parts of XML documents defined by XPath expressions are mapped onto attributes.
You can start de.unidu.is.retrieval.Indexer directly, or call the index.sh shell script. This script is found in the bin directory in CVS, it will be copied to the dist directory automatically when ant dist is invoked. The shell script uses the JAR file unidu.jar directly, so the classpath is set automatically. It cannot be used without the JAR file.
The indexer programs expects some parameters:
-?,--help displays this help message
-q,--quiet do not output anything to STDOUT and STDERR
-l,--logfile <file> use the specified file for the output
-u,--user <user> use specified user name for RDBMS
-p,--password <password> use specified password for RDBMS
-h,--host <host> use specified host for RDBMS
-d,--db <database> use specified database for RDBMS
-1,--ddl <ddl file> use specified HyREX DDL file
-2,--dtd <dtdl file> use specified DTD
-n,--name <collection name> use specified collection name instead of
combining base/class from HyREX DDL
-x,--xml <dir> use specified directory for XML files
(instead of extracting it from the HyREX DDL file)
The user, password, host and database parameters are required, they specify where the index is created (using a MySQL database). In addition, either a HyREX DDL file or a DTD file have to be specified as a schema.
The DDL file (see section 3.1.2.2) specifies the attributes which are used for indexing and retrieval. When using structure is switched off in the DDL file, then the attribute names are used directly as aliases; otherwise, XPath expressions which are explicitly mentioned in the DDL are converted into aliases. If the DDL does not contain operator definitions (remember that this is an extension to HyREX), default operators are used. The collection name (the prefix for the MySQL tables) and the directory containing the XML files are extract from the DDL, as well as a regular expression which is used to filter documents in the directory (so that only documents matching the regular expression are indexed).
If a DTD is specified, then the collection name and the directory containing XML files (which end with .xml) have to be specified manually. All paths which can be created based on the DTD are used and converted into aliases, and default operators are added.
The document ids are the file names (without path and .xml extension is all cases).
All output to STDOUT and STDERR can be switched off by the -quiet option. It is also possible to specify a file where all output is written to (even if output to STDOUT/STDERR is blocked).
The program directly initialises the index, adds all documents to it, and completes the index computation. After that, the index is ready for querying.