This pDatalog++ implementation follows the approach in [4] and operates on a relational database, i.e. pDatalog++ relations are stored in database tables, and rules are transformed into one or more SQL statements. A relational database is a good choice as it is fast, flexible, and shares the concepts with predicate logics (tables for relations). The major difference is that each database relation needs an additional column for storing the probability of the tuples. Figure 1.1 depicts the overall architecture of the pDatalog++ engine.
This section explains the data structures for facts and rules, for relations, and the handling of all relations (e.g. adding facts or computing rules).
The classes Rule and its sub-class Fact store a rule or a fact, respectively, where a fact is simply a rule with empty body. The head is stored as a Literal instance, the body is a list of Literal instances, and the function for computing the probabilities is stored as an Expression instance (see below). A literal is given by its name, an array of expressions, and a flag indicating if the literal is positive or negated.
Expressions are used for storing e.g. variables (class Variable) and constants (class Constant) in literals, but also in mathematical functions in a recursive way. Expression classes can be found in the packages de.unidu.is.expression and in de.unidu.is.pdatalog.ds.
The easiest way to construct a fact or a rule object is not to create it manually, but to parse a string, employing the de.unidu.is.pdatalog.parser.Parser class. Note that the current implementation requires a lot of round brackets in mathematical expressions.
Expressions are used here for defining the function for computing the probability of facts derived by a single rule, and as the literal arguments:
Expressions have to support to methods:
A relation is identified by its name, and has an arity (the number of arguments). Relations are defined in the class Relation.
Currently there are three different kinds of relations:
Built-in relations are EqualsRelation (name
eq), GreaterEqualsRelation (name
ge),
GreaterThanRelation (name gt),
LessEqualsRelation (name le),
LessThanRelation (name lt) and
VagueLessRelation (name ge). Note that currently
not all built-in relations are defined as
EDBComputedRelation, the system also knows the ternary
relations add, mult, div and sub
(the first argument always is the result), the binary log
(the natural logarithm), the aggregation operators and neq
(for non-equality) .
The relation base provides some methods for adding facts, computing rules, performing queries, retrieving facts and removing facts. The current implementation does not compute the order in which rules have to be executed automatically, so rules have to be computed in the correct order by hand.
These methods are provided:
The relation base keeps track of all defined relations; for that, it
stores an internal list of all Relation objects. Thus, every
relation has to be added to the relation base before it can be used,
even if it already physically exists, by calling
add(relation,false) (the false prevents the relation
to be physically re-created). These relation list handling methods are
available:
The relation base also allows for some lower-level methods for
relation handling:
The basic idea is to map each relation onto a table, where the relation name equals the table name. Each argument is a column, and an additional (last) column is used for storing the probabilities.
As different database management systems use slightly different SQL dialects, the concepts of abstract SQL statements is introduced. Abstract SQL statements are converted into concrete SQL statements in a database-dependent way, so that no program code has to be modified when switching to a different database.
The class de.unidu.is.sql.SQL is an abstraction from SQL queries. It is used in situations where different database management systems with different SQL dialects (e.g. HSQLDB vs. MySQL) are used. These abstract SQL statements are formatter later to native SQL statements, using database-dependent config files.
The bean class SQL stores the following values:
The interface de.unidu.is.sql.SQLFormatter provides methods for converting an abstract SQL statement into a concrete one, and sending it to a database management system. Its standard implementation is the class de.unidu.is.sql.SQLFormatterImplementation. It can only be instantiated by a factory.
The SQL formatters provide the following methods:
The SQL formatter uses a config file (in the conf/db, where
the file name equals the database management system identifier in the
JDBC uri, followed by .conf) directory with templates which
will be filled. For that, sometimes placeholders foo are
defined, so that every occurrence of ${foo} can be
replaced later by another string.
These are all fields defined in the config file:
The class PDatalogSQLFormatter delegates most methods to the standard SQL formatter, and provides these additional convenience methods:
The class DB provides an encapsulation of JDBC connections and statements. Besides some convenience methods, it also employs a connection broker for improved efficiency.
Insert and update SQL statements can be sent to the database management system with one call of the execute(String) method. Internally, a connection is retrieved from the pool, a statement is created, the SQL statement is sent, and everything is closed again.
For SQL select statements, the executeQuery(String) method returns a ResultSet object, which can be used for retrieving the data. The result set has to be closed after usage by calling close(ResultSet).
Currently two sub-classes encapsulate specific JDBC drivers for two different database management systems:
This section presents some examples for the formatting process from pDatalog++ into SQL. The MySQL database system which is able to keep relations in main memory is used for storing and processing pDatalog++.
First, two binary relations foo and bar are created, which results in creating two tables with three columns (two for the two arguments, the last one for the probabilities):
drop table if exists bar create table bar (arg0 varchar(255),arg1 varchar(255),prob double) drop table if exists foo create table foo (arg0 varchar(255),arg1 varchar(255),prob double)
Inserting facts is straightforward:
0.3 bar('1','2').
0.5 bar('2','3').
0.7 bar('3','4').
insert into bar values('1','2',(0.3*(1)))
insert into bar values('2','3',(0.5*(1)))
insert into bar values('3','4',(0.7*(1)))
Processing rules is more complex. In the following, we consider two
rules for the same relation foo, which are defined to be
disjunctive:
foo(X,Y) :- bar(X,Y).
foo(X,Z) :- bar(X,Y) & bar(Y,Z).
drop table if exists tmp_foo
create table tmp_foo (arg0 varchar(255),arg1 varchar(255),prob double)
insert into tmp_foo select rel0.arg0,rel0.arg1,(rel0.prob) from bar rel0
insert into tmp_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
from bar rel0,bar rel1 where (rel0.arg1=rel1.arg0)
insert into foo select tmp_foo.arg0,tmp_foo.arg1,sum(tmp_foo.prob)
from tmp_foo group by tmp_foo.arg0,tmp_foo.arg1
alter table foo add index (arg0)
alter table foo add index (arg1)
drop table if exists tmp_foo
Here, the table tmp_foo stores the facts created by the two
rules. Creating these two insert into ... select statements
is rather generic and not discussed here any further. The last
insert into ... select statement inserts the generated facts
into the foo table, where the probabilities are summed up
if the same facts occurs multiple times in the temporary table.
This can happen if both rules generate the same facts; but also the
same rule can generate the fact multiple times (e.g. when the facts
bar(1,a), bar(a,2), bar(1,b) are available
bar(b,2). Finally, two database indexes are created for both
arguments, allowing for faster processing, and the temporary table
is removed.
When the same rules are processed without disjointness, the
inclusion-exclusion formula has to be applied. This results in these
SQL statements:
drop table if exists tmp_foo
create table tmp_foo (arg0 varchar(255),arg1 varchar(255),prob double)
insert into tmp_foo select rel0.arg0,rel0.arg1,(rel0.prob) from bar rel0
insert into tmp_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
from bar rel0,bar rel1 where (rel0.arg1=rel1.arg0)
select distinct tmp_foo.arg0,tmp_foo.arg1 from tmp_foo
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='1') and (tmp_foo.arg1='2')
insert into foo values('1','2',(0.3*(1)))
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='2') and (tmp_foo.arg1='3')
insert into foo values('2','3',(0.5*(1)))
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='3') and (tmp_foo.arg1='4')
insert into foo values('3','4',(0.7*(1)))
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='1') and (tmp_foo.arg1='3')
insert into foo values('1','3',(0.15*(1)))
select tmp_foo.prob from tmp_foo where (tmp_foo.arg0='2') and (tmp_foo.arg1='4')
insert into foo values('2','4',(0.35*(1)))
alter table foo add index (arg0)
alter table foo add index (arg1)
drop table if exists tmp_foo
The temporary table tmp_foo is created as before. The
select distinct now ranges over all possible arguments. For
each argument, the probabilities are retrieved from the database and
combined via the inclusion-exclusion formula. The resulting
probability is inserted into the foo table.
The pDatalog++ engine has only basic support for recursive rules
(employing a naive evaluation scheme):
insert into foo select rel0.arg0,rel0.arg1,(rel0.prob) from bar rel0
alter table foo add index (arg0)
alter table foo add index (arg1)
drop table if exists tmpa_foo
create table tmpa_foo (arg0 varchar(255),arg1 varchar(255),prob double)
drop table if exists tmpb_foo
create table tmpb_foo (arg0 varchar(255),arg1 varchar(255),prob double)
insert into tmpa_foo select distinct * from foo
delete from tmpb_foo
insert into tmpb_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
from tmpa_foo rel0,bar rel1 where (rel0.arg1=rel1.arg0)
insert into tmpa_foo select distinct * from tmpb_foo
delete from tmpb_foo
select count(distinct arg0,arg1,prob) from tmpa_foo
delete from tmpb_foo
insert into tmpb_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
from tmpa_foo rel0,bar rel1 where (rel0.arg1=rel1.arg0)
insert into tmpa_foo select distinct * from tmpb_foo
delete from tmpb_foo
select count(distinct arg0,arg1,prob) from tmpa_foo
delete from tmpb_foo
insert into tmpb_foo select rel0.arg0,rel1.arg1,(rel0.prob*rel1.prob)
from tmpa_foo rel0,bar rel1 where (rel0.arg1=rel1.arg0)
insert into tmpa_foo select distinct * from tmpb_foo
delete from tmpb_foo
select count(distinct arg0,arg1,prob) from tmpa_foo
alter table tmpa_foo add index (arg0)
alter table tmpa_foo add index (arg1)
delete from foo
select distinct tmpa_foo.arg0,tmpa_foo.arg1 from tmpa_foo
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='1') and (tmpa_foo.arg1='2')
insert into foo values('1','2',(0.3*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='2') and (tmpa_foo.arg1='3')
insert into foo values('2','3',(0.5*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='3') and (tmpa_foo.arg1='4')
insert into foo values('3','4',(0.7*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='1') and (tmpa_foo.arg1='3')
insert into foo values('1','3',(0.38587499999999997*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='2') and (tmpa_foo.arg1='4')
insert into foo values('2','4',(0.7253750000000001*(1)))
select tmpa_foo.prob from tmpa_foo where (tmpa_foo.arg0='1') and (tmpa_foo.arg1='4')
insert into foo values('1','4',(0.19897499999999999*(1)))
alter table foo add index (arg0)
alter table foo add index (arg1)
drop table if exists tmpa_foo
drop table if exists tmpb_foo
The basic idea is to use two temporary tables tmpa_foo
and tmpb_foo. The first one accumulates the generated facts,
the second one contains newly generated facts. Then, the rule is
evaluated repeatedly until the total number of facts is constant.
In a final step, the probabilities are computed as described above.