next up previous
Next: PIRE implementation Up: pire Previous: pire

Subsections


PDatalog++ engine

This pDatalog++ implementation follows the approach in [4] and operates on a relational database, i.e. pDatalog++ relations are stored in database tables, and rules are transformed into one or more SQL statements. A relational database is a good choice as it is fast, flexible, and shares the concepts with predicate logics (tables for relations). The major difference is that each database relation needs an additional column for storing the probability of the tuples. Figure 1.1 depicts the overall architecture of the pDatalog++ engine.

Figure 1.1: PDatalog++ engine architecture


\includegraphics[scale=0.4]{arch-pdatalog}

Relations, facts and rules

This section explains the data structures for facts and rules, for relations, and the handling of all relations (e.g. adding facts or computing rules).

Facts and rules

The classes Rule and its sub-class Fact store a rule or a fact, respectively, where a fact is simply a rule with empty body. The head is stored as a Literal instance, the body is a list of Literal instances, and the function for computing the probabilities is stored as an Expression instance (see below). A literal is given by its name, an array of expressions, and a flag indicating if the literal is positive or negated.

Expressions are used for storing e.g. variables (class Variable) and constants (class Constant) in literals, but also in mathematical functions in a recursive way. Expression classes can be found in the packages de.unidu.is.expression and in de.unidu.is.pdatalog.ds.

The easiest way to construct a fact or a rule object is not to create it manually, but to parse a string, employing the de.unidu.is.pdatalog.parser.Parser class. Note that the current implementation requires a lot of round brackets in mathematical expressions.

Expressions

Expressions are used here for defining the function for computing the probability of facts derived by a single rule, and as the literal arguments:

DBColumnExpression
denotes a database table column.
DBColExpression
is equivalent to DBColumnExpression with the standard names of pDatalog++ fact arguments arg0, arg1 etc.
DBProbExpression
denotes the database table column which contains the probabilities of pDatalog++ facts.
DifferenceExpression
denotes the difference of exactly two other expressions.
EqualsExpression
represents the test for equality for two expressions. It is used mainly for internal purpose, e.g. for the SQL formatter (see section 1.2).
FractionExpression
denotes the result of dividing one expression by another one.
FunctionExpression
represents the result of applying a named function onto another expression.
LiteralExpression
represents a literal (for the aggregation operators).
PlainExpression
stores a plain string.
ProbExpression
represents the resulting probability, i.e. PROB.
ProductExpression
denotes the product of exactly two other expressions.
ProductNExpression
denotes the product of a list of other expressions.
Str2NumExpression
presents the function converting a string value into a double number. This is used e.g. for the arguments of a literal.
StringExpression
stores a string in quotation marks.
SumExpression
denotes the sum of exactly two other expressions.
SumNExpression
denotes the sum of a list of other expressions.
VariableExpression
represents a variable.
Constant
is equivalent to StringExpression and has to be used in literals for constants.
Variable
is equivalent to VariableExpression and has to be used in literals for variables.

Expressions have to support to methods:

substitute(Map)
has to apply substitution on itself. The keys of the specified map are variables (instances of VariableExpression), and the values are the expressions which have to substitute the occurrences of the variable.
getSQLTemplate()
returns a template for the SQL formatter for this expression, see section 1.2 for details.

Relations

A relation is identified by its name, and has an arity (the number of arguments). Relations are defined in the class Relation.

Currently there are three different kinds of relations:

EDBRelation:
These relations are extensionally defined by facts, which are stored in a database table.
EDBComputedRelation:
These ``built-in'' relations are extensionally defined by facts, which are computed on demand.
IDBRelation:
These relations are intensionally defined by rules; the results are stored in a database table.

Built-in relations are EqualsRelation (name eq), GreaterEqualsRelation (name ge),
GreaterThanRelation (name gt), LessEqualsRelation (name le), LessThanRelation (name lt) and VagueLessRelation (name ge). Note that currently not all built-in relations are defined as
EDBComputedRelation, the system also knows the ternary relations add, mult, div and sub (the first argument always is the result), the binary log (the natural logarithm), the aggregation operators and neq (for non-equality) .

The relation base

The relation base provides some methods for adding facts, computing rules, performing queries, retrieving facts and removing facts. The current implementation does not compute the order in which rules have to be executed automatically, so rules have to be computed in the correct order by hand.

These methods are provided:

add(Fact)
adds the fact to its relation.
compute(IDBRelation, Collection)
computes the result of the specified non-recursive rules (all corresponding to the specified IDB relation). Probabilities for the same facts derived by multiple rules are combined using the inclusion-exclusion formula.
compute(IDBRelation, Rule)
computes the result of the specified non-recursive rule (corresponding to the specified IDB relation). When facts are derived multiple times, the probabilities are combined using the inclusion-exclusion formula.
compute(IDBRelation, Rule, boolean)
computes the result of the specified non-recursive rule (corresponding to the specified IDB relation). When facts are derived multiple times, the probabilities are combined using the inclusion-exclusion formula. A user can specify that no database index is created for the IDB relation (which typically increases performance, but takes some time to be created).
computeDisjoint(IDBRelation, Collection)
computes the result of the specified non-recursive rules (all corresponding to the specified IDB relation). Probabilities for the same facts derived by multiple rules are added, assuming disjointness.
computeRecursively(IDBRelation, Rule)
computes the result of the specified recursive rules (all corresponding to the specified IDB relation). Probabilities for the same facts derived by multiple rules are combined using the inclusion-exclusion formula. Non-recursive rules for the same relation have to be computed first.
addIndex(Relation)
creates a database index for the specified relation. This typically increases performance, but takes some time to be created.
clear(String)
removes all tuples from the the relation with the specified name.
query(String)
returns all tuples (as Fact objects) matching the specified query (as a single string).
queryRelation(Relation)
returns all tuples (as Fact objects) in the specified relation.
queryRelation(String, int)
returns all tuples (as Fact objects) in the specified relation.
getTupleCount(Relation)
returns the number of tuples in the specified relation.


The relation base keeps track of all defined relations; for that, it stores an internal list of all Relation objects. Thus, every relation has to be added to the relation base before it can be used, even if it already physically exists, by calling add(relation,false) (the false prevents the relation to be physically re-created). These relation list handling methods are available:

clear()
removes all relations from the list, without removing them physically in the database.
containsRelation(String)
returns true iff there is a relation with the specified name.
get(String)
returns the relation object for the specified name.
isEmpty()
returns true if there is no relation in the system.
names()
returns an iterator over the relation names (as strings).
add(Relation)
adds the relation to the system, and creates it physically in the database.
add(Relation, boolean)
adds the relation to the system, and optionally creates it physically in the database.
remove(String)
removes the relation with the specified name, and also physically removes it in the database.
size()
returns the number of relations in the system.
relations()
returns an iterator over all relations (as Relation objects).


The relation base also allows for some lower-level methods for relation handling:

existsRelation(String)
returns true if the relation specified by its name physically exists.
dump(Relation)
dumps the content of the specified relation to STDOUT.
perform(SQL)
performs the specified abstract SQL statement.
performQuery(SQL)
performs the specified abstract SQL select statement, and returns a ResultSet.
close(ResultSet)
closes the specified result set and frees all resources.


Formatting pDatalog++ to SQL

The basic idea is to map each relation onto a table, where the relation name equals the table name. Each argument is a column, and an additional (last) column is used for storing the probabilities.

As different database management systems use slightly different SQL dialects, the concepts of abstract SQL statements is introduced. Abstract SQL statements are converted into concrete SQL statements in a database-dependent way, so that no program code has to be modified when switching to a different database.

Abstract SQL statements

The class de.unidu.is.sql.SQL is an abstraction from SQL queries. It is used in situations where different database management systems with different SQL dialects (e.g. HSQLDB vs. MySQL) are used. These abstract SQL statements are formatter later to native SQL statements, using database-dependent config files.

The bean class SQL stores the following values:

insertTable
contains the table name where rows are inserted.
select
contains the list of expressions for the ``select'' part.
isDistinct
specifies if ``select distinct'' has to be used.
from
contains the list of expressions for the ``from'' part.
where
contains the list of expressions for the ``where'' part. If the list is empty or equals null, no ``where'' is used.
group
contains the list of expressions for the ``group by'' part. If the list is empty or equals null, no ``group by'' is used.
limit
contains the maximum number of rows used (if greater than zero).
order
contains the list of list of expressions for the ``order by'' part. If the list is empty or equals null, no ``order by'' is used.
orderDesc
specifies if ``desc'' has to be used in the ``order by'' construct.

SQL formatter

The interface de.unidu.is.sql.SQLFormatter provides methods for converting an abstract SQL statement into a concrete one, and sending it to a database management system. Its standard implementation is the class de.unidu.is.sql.SQLFormatterImplementation. It can only be instantiated by a factory.

The SQL formatters provide the following methods:

create(String, String[], String[]))
creates a table with the specified column names and types.
clear(String))
removes all rows from the specified table.
remove(String))
removes the specified table.
addIndex(String, String, String[], boolean[]))
adds a named database index to the specified table, using the specified columns (which might be marked as text columns).
getSelect(SQL))
returns the concrete SQL query for the specified abstract SQL query.
perform(SQL,String)
performs the specified abstract SQL statement. If the specified table name equals null, then the table name specified in the abstract SQL statement is used.
performQuery(SQL)
performs the specified abstract SQL select statement, and returns a ResultSet.
close(ResultSet)
closes the specified result set and frees all resources.
getDB())
returns the database object.
setDB(DB)
sets the database object.


The SQL formatter uses a config file (in the conf/db, where the file name equals the database management system identifier in the JDBC uri, followed by .conf) directory with templates which will be filled. For that, sometimes placeholders foo are defined, so that every occurrence of ${foo} can be replaced later by another string.

These are all fields defined in the config file:

table.remove
contains the SQL statement for removing a table, using the placeholder
table.remove.table (the table name).
table.create
contains the SQL statement for creating a table, using the placeholders
table.create.table (the table name) and table.create.args (a comma-separated list of argument name, space and argument type).
table.clear
contains the SQL statement for removing all rows from a table, using the placeholder table.clear.table (the table name).
table.insert
contains the SQL statement for inserting rows into a table, using the placeholders
table.insert.table (the table name) and table.insert.select (a ``select'' statement or a ``values'' statement, defining the row(s) to be inserted).
table.select.select
contains the SQL ``select'' statements, using the placeholders distinct
(for the ``distinct'' construct), select (for the ``select'' construct), table.select.from.option (for the ``from'' construct), table.select.where.option (for the ``where'' construct),
table.select.group.option (for the ``group by'' construct), table.select.order.option (for the ``order by'' construct), table.select.limit.option (for the ``limit'' construct).
table.select.distinct
contains the string used in the ``distinct''.
table.select.from
contains the ``from'' construct template, using from (the table name(s)).
table.select.where
contains the ``where'' construct template, using where.
table.select.where.and
contains the string used for conjunctions in ``where'' constructs.
table.select.group
contains the ``group'' construct template, using group.
table.select.from
contains the ``from'' construct template, using the placeholders order and
table.select.order.desc.option.
table.select.order.desc.option
contains the string used for ``order by desc''.
table.select.limit
contains the ``limit'' construct template, using limit.
table.select.values
contains the ``values'' template for inserting rows as facts, using values.
table.addindex
contains the SQL statement for adding an index, using the placeholders
table.addindex.table (the table name) and table.addindex.definition (the definition).
table.addindex.textcolsuffix
contains the suffix for text columns when adding an index (could be used for restricting the length of the strings considered in the index).
str2num.start
contains the start of the SQL code converting a string into a number. It will be followed by the string in the final concrete SQL statement.
str2num.end
contains the end of the SQL code converting a string into a number. It will be preceded by the string in the final concrete SQL statement.
type.text
contains the SQL type of text columns.
type.double
contains the SQL types of floating numbers.

SQL formatter for pDatalog++

The class PDatalogSQLFormatter delegates most methods to the standard SQL formatter, and provides these additional convenience methods:

create(Relation))
creates the specified relation.
create(int, String))
creates the specified relation.
clear(Relation))
removes all tuples (rows) from the specified relation.
dump(Relation))
dumps the specified relation to STDOUT.
dump(int, String))
dumps the specified relation to STDOUT.
dump(Relation, PrintStream))
dumps the specified relation to the specified stream.
dump(int, String, PrintStream))
dumps the specified relation to the specified stream.

DB objects

The class DB provides an encapsulation of JDBC connections and statements. Besides some convenience methods, it also employs a connection broker for improved efficiency.

Insert and update SQL statements can be sent to the database management system with one call of the execute(String) method. Internally, a connection is retrieved from the pool, a statement is created, the SQL statement is sent, and everything is closed again.

For SQL select statements, the executeQuery(String) method returns a ResultSet object, which can be used for retrieving the data. The result set has to be closed after usage by calling close(ResultSet).

Currently two sub-classes encapsulate specific JDBC drivers for two different database management systems:

MySQLDB
encapsulates the popular MySQL database.
HSQLDBEmbeddedDB
encapsulates the HSQLDB database in the mode where it keeps all data in main memory. So, it does not save anything to disk, and does not need a server, but is embedded in the Java program.


Examples for the formatting process

This section presents some examples for the formatting process from pDatalog++ into SQL. The MySQL database system which is able to keep relations in main memory is used for storing and processing pDatalog++.

First, two binary relations foo and bar are created, which results in creating two tables with three columns (two for the two arguments, the last one for the probabilities):

drop table if exists bar
create table bar (arg0 varchar(255),arg1 varchar(255),prob double)

drop table if exists foo
create table foo (arg0 varchar(255),arg1 varchar(255),prob double)


Inserting facts is straightforward:

0.3 bar('1','2').
0.5 bar('2','3').
0.7 bar('3','4').

insert into bar values('1','2',(0.3*(1)))
insert into bar values('2','3',(0.5*(1)))
insert into bar values('3','4',(0.7*(1)))


Processing rules is more complex. In the following, we consider two rules for the same relation foo, which are defined to be disjunctive:

foo(X,Y) :- bar(X,Y).
foo(X,Z) :- bar(X,Y) & bar(Y,Z).

drop table if exists tmp_foo
create table tmp_foo (arg0 varchar(255),arg1 varchar(255),prob double)
insert into tmp_foo select rel0.arg0,rel0.arg1,(rel0.prob) from bar rel0  
insert into tmp_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
                    from bar rel0,bar rel1 where (rel0.arg1=rel1.arg0)   
insert into foo select tmp_foo.arg0,tmp_foo.arg1,sum(tmp_foo.prob)
                from tmp_foo group by tmp_foo.arg0,tmp_foo.arg1  
alter table foo add index (arg0)
alter table foo add index (arg1)
drop table if exists tmp_foo
Here, the table tmp_foo stores the facts created by the two rules. Creating these two insert into ... select statements is rather generic and not discussed here any further. The last insert into ... select statement inserts the generated facts into the foo table, where the probabilities are summed up if the same facts occurs multiple times in the temporary table. This can happen if both rules generate the same facts; but also the same rule can generate the fact multiple times (e.g. when the facts bar(1,a), bar(a,2), bar(1,b) are available bar(b,2). Finally, two database indexes are created for both arguments, allowing for faster processing, and the temporary table is removed.


When the same rules are processed without disjointness, the inclusion-exclusion formula has to be applied. This results in these SQL statements:

drop table if exists tmp_foo
create table tmp_foo (arg0 varchar(255),arg1 varchar(255),prob double)
insert into tmp_foo select rel0.arg0,rel0.arg1,(rel0.prob) from bar rel0     
insert into tmp_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
                    from bar rel0,bar rel1 where (rel0.arg1=rel1.arg0)    
select distinct tmp_foo.arg0,tmp_foo.arg1 from tmp_foo     
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='1') and (tmp_foo.arg1='2')    
insert into foo values('1','2',(0.3*(1)))
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='2') and (tmp_foo.arg1='3')    
insert into foo values('2','3',(0.5*(1)))
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='3') and (tmp_foo.arg1='4')    
insert into foo values('3','4',(0.7*(1)))
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='1') and (tmp_foo.arg1='3')    
insert into foo values('1','3',(0.15*(1)))
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='2') and (tmp_foo.arg1='4')    
insert into foo values('2','4',(0.35*(1)))
alter table foo add index (arg0)
alter table foo add index (arg1)
drop table if exists tmp_foo
The temporary table tmp_foo is created as before. The select distinct now ranges over all possible arguments. For each argument, the probabilities are retrieved from the database and combined via the inclusion-exclusion formula. The resulting probability is inserted into the foo table.


The pDatalog++ engine has only basic support for recursive rules (employing a naive evaluation scheme):

insert into foo select rel0.arg0,rel0.arg1,(rel0.prob) from bar rel0     
alter table foo add index (arg0)
alter table foo add index (arg1)
drop table if exists tmpa_foo
create table tmpa_foo (arg0 varchar(255),arg1 varchar(255),prob double)
drop table if exists tmpb_foo
create table tmpb_foo (arg0 varchar(255),arg1 varchar(255),prob double)
insert into tmpa_foo select distinct * from foo
delete from tmpb_foo
insert into tmpb_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
                     from tmpa_foo rel0,bar rel1 where (rel0.arg1=rel1.arg0)    
insert into tmpa_foo select distinct * from tmpb_foo
delete from tmpb_foo
select count(distinct arg0,arg1,prob) from tmpa_foo
delete from tmpb_foo
insert into tmpb_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
                     from tmpa_foo rel0,bar rel1 where (rel0.arg1=rel1.arg0)    
insert into tmpa_foo select distinct * from tmpb_foo
delete from tmpb_foo
select count(distinct arg0,arg1,prob) from tmpa_foo
delete from tmpb_foo
insert into tmpb_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
                     from tmpa_foo rel0,bar rel1 where (rel0.arg1=rel1.arg0)    
insert into tmpa_foo select distinct * from tmpb_foo
delete from tmpb_foo
select count(distinct arg0,arg1,prob) from tmpa_foo
alter table tmpa_foo add index (arg0)
alter table tmpa_foo add index (arg1)
delete from foo
select distinct tmpa_foo.arg0,tmpa_foo.arg1 from tmpa_foo     
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='1') and (tmpa_foo.arg1='2')    
insert into foo values('1','2',(0.3*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='2') and (tmpa_foo.arg1='3')    
insert into foo values('2','3',(0.5*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='3') and (tmpa_foo.arg1='4')    
insert into foo values('3','4',(0.7*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='1') and (tmpa_foo.arg1='3')    
insert into foo values('1','3',(0.38587499999999997*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='2') and (tmpa_foo.arg1='4')    
insert into foo values('2','4',(0.7253750000000001*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='1') and (tmpa_foo.arg1='4')    
insert into foo values('1','4',(0.19897499999999999*(1)))
alter table foo add index (arg0)
alter table foo add index (arg1)
drop table if exists tmpa_foo
drop table if exists tmpb_foo
The basic idea is to use two temporary tables tmpa_foo and tmpb_foo. The first one accumulates the generated facts, the second one contains newly generated facts. Then, the rule is evaluated repeatedly until the total number of facts is constant. In a final step, the probabilities are computed as described above.


next up previous
Next: PIRE implementation Up: pire Previous: pire
Stefan Huesselbeck 2005-03-09