A Data Structure for Subsumption-Based Tabling

0 downloads 0 Views 175KB Size Report
Our data structure is based on hB-Trees, a multi- attribute ... the termination of recursive programs and thus allow their evaluation TS86]. The general e ect .... this de nition. Our work is currently restricted to logical languages without free ... it is documented in the algorithms for the tabled resolution engine, our data structure ...
A Data Structure for Subsumption-Based Tabling in Top-Down Resolution Engines for Data-Intensive Logic Applications Jacques Calmet and Peter Kullmann Institut fur Algorithmen und Kognitive Systeme (IAKS) Fakultat fur Informatik, Universitat Karlsruhe Am Fasanengarten 5, D-76131 Karlsruhe, Germany

Abstract. Using tables in top-down resolution engines is an e ective way of optimization by avoiding recomputation of formerly derived facts. This e ect can be even more exploited if, instead of considering only identical subgoals, facts from subsuming sugoals are reused. This is particularly of interest if recalculation is expensive, e.g. due to data transmission cost or the mere volume of data to be processed. We present a data structure for subsumption-based tabling in subgoaloriented resolution. Our data structure is based on hB-Trees, a multiattribute indexing structure which emerged from research on database management systems. Hence, our data structure inherits desirable properties such as its use as ecient indexing structure for secondary storage. Keywords: Logic for Arti cial Intelligence, Intelligent Information Retrieval, Subsumption-based Tabling, Information Integration 1

Introduction

Tabling in top-down proof procedures has originally been proposed to guarantee the termination of recursive programs and thus allow their evaluation [TS86]. The general e ect of using tables is that the recalculation of answer substitutions can be avoided if a subgoal has already been calculated in the resolution process or its calculation is currently in process. The new subgoal can be fed with answers from the existing subgoal. Thus, answer tables serve as a cache. This aspect makes the use of tables also interesting for non-recursive subgoals as an optimization technique if the cost for calculating answers is high. This might be the case if remote sites have to be contacted via a network in order to retrieve facts from a speci c information source or the data volume is extremely high. In connection with our research on the integration of heterogeneous information systems we have developed the logic-based mediator system KOMET [CJKS97]. The core of KOMET is a highly exible resolution engine which processes many-sorted annotated logic programs [KS92]. It uses SLG resolution [CW93] which has been extended for annotated logic. SLG resolution makes extensive use of answer tables for program evaluation under the well-founded

semantics. In the context of a mediator system like KOMET, tables can play a central role in the query optimization strategy by using them as cache for expensive subgoal calls. Ideally the tables are kept across queries to maximally exploit their e ect. For certain applications it might even be worth-while to use local secondary storage devices for maintaining answer tables. Such a mechanism can make a mediator system highly suitable as a data warehouse system where the gathering, preprocessing and locally storing of information is done implicitly by the mediator engine in contrast to one or more explicit preprocessing steps as it is done in traditional data warehouse systems. 2

Data Structures for Tabling

In our paper we adopt the notion of subgoals as answer producers whereas the selected literals in clauses are regarded as answer consumers. Upon selection of a literal in a clause body it is necessary to nd a producer which generates appropriate answers to be able to resolve the clause. This producer can either already exist from previous calculations or it has to be created which usually implies that program clause resolution is initiated. Common approaches to tabling rely on variant-based tabling. In variantbased tabling an existing subgoal is accepted as producer if it equals the selected literal up to variable renaming. Variant-based tabling has several properties which make the use and processing of answer tables straight-forward and ecient. Variant-based tabling schemes mostly adopt a strucuture-sharing approach to represent subgoal answer tables in the form of indexing tries [RRR96]. Since tries share representations of atoms, they are very ecient in terms of space requirements and exhibit very good performance for insert and lookup operations. The main drawback of variant-based tabling is the fact that the potential of tables as caching facility is not fully exploited. Using variant-based tabling tends to produce a large number of subgoal answer tables while performing many redundant computations. By exploiting the subsumption of subgoal calls, performance of a subgoal-based resolution engine can be drastically improved. This has recently been shown in [RRR96] where tries have been extended for subsumption-based tabling. We believe, a tabling data structure suitable for a data-intensive logic-based applications should have the following properties:

Support for speci c needs of subsumptive tabling: Subsumptive tabling

requires retrieval of subsets of the table contents according to a query literal which is more speci c than the subgoal but still may contain variables. This requires all answers to be indexed on all arguments in some way, since it cannot be foreseen what binding information is supplied by consumers. As another requirement, a mechanism for registering consumers is needed, since a subgoal calculation might still be in process when a subsumed literal is selected.

Reduction of subsumed answers: On the one hand, a new answer should

be ignored if it is subsumed by another answer in the data structure. On the other hand, if a new answer subsumes one or more answers in the data structure they should be removed from the table. Support for secondary storage: Main memory should not limit the data volume which can be processed in a deductive system. However, disk accesses must be suciently ecient. Support for persistence: To maximally exploit tabling it might be sensible to store answer tables over a longer period of time (not only across queries, but as well across query sessions). In a concrete system, tables could be recalculated regularly (e.g. weekly) in order to reduce resource consumption in a productive environment. The use of tries as data structure for subsumption-based tabling has an essential drawback. As they are one-dimensional indexes, the concatenation of attributes are used as index keys. Tries inherently index on pre xes of the stored information, i.e. lookup is the more ecient the longer the pre x of the query is. Partial-match queries will degenerate to traversals of the entire data structure in the worst case, even in the presence of bound arguments. A class of access methods applicable to our scenario are hash-based methods. However they present several problems. The one is the poor space eciency which is even worse in the case of multi-dimensional methods. Another is that they are not suitable for the ecient incorporation of secondary storage devices. Finally the dynamic growth of hash tables usually involves expensive reorganization processes. 3

Multi-attribute Indexing Structures

Multi-attribute indexing structures have been mainly the concern of work in database research. Usually, ordering access methods applicable to arbitrary sets of key values are based on tree search. The classical indexing structure for one-dimensional access is the B-tree which is an approximately balanced, pageoriented binary tree. A naive way to achieve multi-attribute indexing is to use a one-dimensional index for each dimension. The greatest disadvantages of this approach are the poor space eciency and the high cost for insertions and deletions, since these operations have to be performed for each dimension individually. The fact that cost increases proportionally with the dimension of the search space is not generally acceptable. Grid les, K-D-B-trees and R-trees and variants are well-known data structures for multi-attribute indexing. More recently, X-trees [BKK96], hB-trees [LS90,ELS95] and UB-trees [Bay97] have been discussed, each with speci c strengths and weaknesses. Our requirements for a tabling data structure are good average storage utilization, ecient dynamic reorganization, support for partial-match queries and robustness with respect to data distribution and key space dimension. Most of the above indexing structures exhibit these properties only partially. For our purposes we have found hB-trees to be a data structure which guarantees these properties all of the time

in the face of arbitrary data. A more detailed analysis of multi-attribute indexing methods can be found in [Lom92]. 4

hB-Trees

The hB-Tree, introduced by Lomet and Salzberg [ELS95] exhibits several properties in addition to our requirements. { For small data sets, hB-trees can easily be modi ed for complete storage and management within the physical main memory. { hB-Trees have recently been extended for concurrent access [ELS95]. This property would make them suitable for a multi-threaded resolution engine. Note, that SLG resolution is also well-suited for concurrent processing. The hB-tree is derived from the K-D-B-tree. hB-Trees are composed of pages which contain a k-d-tree each. A k-d-tree basically consists of two types of nodes. Data nodes contain data objects and represent rectangular subspaces. Index nodes contain an index term and references to the left and right subtree and divide a subspace along one dimension. A hB-tree page is the unit for storage on disk. In contrast to K-D-B-Trees, the goal of the hB-Tree structure is to avoid downward cascading of page splits, hence avoiding both restructuring cost and adverse storage utilization. When a page becomes full, splitting is done based on the exisiting boundaries within the subspace. This results in the removal of a smaller brick-like subspace from the larger brick-like subspace of the original full page. The result is a "holey brick", which is where the name comes from. In the remainder of the paper we will describe the modi cations necessary in order to use hb-Trees as tabling structure for subsumptive tabling. We will rst de ne an abstract tabling engine and then describe our data structure in terms of this de nition. Our work is currently restricted to logical languages without free function symbols but includes non-monotonic reasoning. In addition we require the domains of all predicate arguments to be fully ordered. 5

Tabled Resolution

In this section we abstractly describe a resolution engine with tabling. This abstract view will help us to understand the central properties and diculties of tabled resolution. Our de nition is similar, but in details di erent to the one given in [RRR96] and emphasizes other aspects of tabled resolution. For this presentation we use a simpli ed version of our model which is restricted to programs without negation. We divide a tabled resolution engine in two subsystems: the resolution engine RE and the tabling engine T E . Neither of the two systems is a true subsystem of the other. Depending on the evaluation strategy they may call each other. Due do the decoupling of the two subsystems, which is achieved by using input queues, this framework may well serve as a model for a multi-threaded resolution engine.

A tabling engine T E is given by the set of tabled subgoals S , of which each subgoal s is given by the tuple hl; A; C; Qi

l denotes a subgoal literal, A represents a set of answers, C represents a set of pending clauses with selected literal and associated subgoal for the head literal, Q stands for a queue of waiting answers to be inserted into the subgoal table. A contains all facts subsumed by l found up to the current state of processing. If substitution factoring is used for tabling, A is a set of answer subtitutions over the set of variables in l. Substitution factoring can be employed transparently in this framework, we can neglect it here without loss of generality. A resolution engine RE is given by the tuple hP; Qc ; Qs ; Qr i

P is a logic program, Qc ; Qs and Qr represent queues of elements waiting for further processing. Qc is a queue of pending pairs hs; Cli, where Cl is a clause and s an associated subgoal table for its head. Qs is a queue of new subgoals waiting for program clause resolution. Qr is a queue of triples hs; Cl; i. Such a triple is formed by a clause Cl with a selected literal, an associated subgoal table s for the clause head and a substitution  for answer resolution. RE supports following operations: NewClause(s; Cl) : This operation takes a clause Cl and its associated subgoal table s. If the clause body is empty, T E :: NewAnswer will be called, otherwise a literal is selected and T E :: FindOrCreateSubgoal is called. NewSubgoal(s) : This operation determines the appropriate program clauses for resolution with a subgoal literal s. For each clause it performs the approriate substitution and calls RE :: NewClause. ResolveAnswer(s; Cl; Cls ; ) : This operation performs answer resolution on the clause Cl with selected literal Cls using the given substitution . It then calls RE :: NewClause with s and the resolved clause. T E supports following operations: NewAnswer(s; A) : This operation inserts an answer A into the speci ed subgoal table s, if it is not subsumed. It determines all approriate consumers in C and calls RE :: ResolveAnswer with each consumer. FindOrCreateSubgoal(l; Cl) : This operation rst determines, if an appropriate subgoal call already exists in S . If it does exist, all subsumed answers are sent to the consuming clause Cl. If the subgoal evaluation is still in process, Cl is registered as consumer. If the call does not yet exist, a new subgoal table is created and RE :: NewSubgoal is invoked. These operations parallel a subset of the transformations of SLG resolution as de ned in [CW93]. The termination property of SLG can be directly transferred to our tabled resolution engine, if we ensure that all elements in any of the queues are processed until there are no more elements in any of the queues.

6

hBT-Tables

We depart from the hB-Tree data structure as it is given in [LS90,ELS95]. As it is documented in the algorithms for the tabled resolution engine, our data structure has to provide for following mechanisms:

{ { { {

Insertion of answers with variable arguments, Determination of subsumed answers (partial match queries), Registration of consumers, Correct page splitting in the presence of above modi cations.

Insertion Answers with no variable arguments are inserted as in the standard

insert operation, that is, the appropriate leaf node is determined by descending the hB-Tree. As answers with variables in their arguments represent a region in the index space, they are inserted in all data nodes which overlap with the given region1. Since a search only nds data nodes as a whole, it is neccesary to regard the answers in a data node as candidate answers and to check them individually against the search literal. A special treatment is given to answers which have only variable arguments, such as A(X; Y ) or A(X; X ) for a subgoal A(X; Y ). Otherwise they would need to be inserted in all data nodes. To reduce redundancy and to be able to guarantee the correctness of our page split algorithm, we store these answers in a separate list at the root of the tree. The pseudocode algorithm is given in the following.

Insert(Answer) InsertAnswer(Root,Answer) if Answer was inserted SendToConsumers(Answer) end InsertAnswer(Node,Answer) if Node is an index node SDim is the splitting dimension of the index term if Answer is not ground at argument position SDim

InsertAnswer(left child,Answer) InsertAnswer(right child,Answer) else if the argument value of Answer at position SDim is less than index term InsertAnswer(left child,Answer)

else InsertAnswer(right child,Answer) else Node is data node forall answers B in the answer container Perform reduction end forall if Answer was not subsumed Insert Answer in answer container if container overruns split node end if end 1

This technique is used in R+-trees as well as in UB-trees.

Finding subsumed or uni able answers The search for subsumed (uni -

able) answers in general corresponds to a partial match query. Such a query de nes a subspace within the search space. As result all entries that are enclosed in (overlap) the subspace are returned. The candidate data pages are determined in the same way as in the insertion algorithm. The algorithm for subsumed answers is given below. The algorithm for uni able answers works accordingly.

GetSubsumed(Node,Query,Subsumed) begin if Node is an index node SDim is the splitting dimension of the index term if Query is not ground at argument position SDim

GetSubsumed(left child,Query,Subsumed) GetSubsumed(right child,Query,Subsumed) else if the argument value of Query at position SDim is less than index term GetSubsumed(left child,Query,Subsumed)

else GetSubsumed(right child,Query,Subsumed) else Node is data node add all subsumed answers in container to Subsumed end if end

Consumer registration We supplement our data structure with a K-D-B-tree in which consumers for answers of the particular subgoal are stored. After a new answer has been inserted into the answer tree, the consumer entries that unify with the new answer are determined. The answer is then queued in RE for resolution. Page splitting The general algorithm for page splitting remains as given for

the hB-tree. The general idea is that upon node overrun the content of the node is evenly split into two parts. In [LS90] it has been shown that in a node with point data2 such a split is always possible with a ratio of 2 : 1 or better. Note, that the splitting criterion may involve more than one dimension of the search space. In the presence of spatial data3 we have a di erent situation. If an answer or a consumer is variable in the argument position of the split dimension, or, generally speaking, overlaps both subspaces, it is duplicated and inserted in both pages. This is the technique which is used in R+ -trees [SRF87]. A problem arises when all entries in a data node cover the entire subspace de ned by the node. It will then not be possible to nd an appropriate splitting criterion. However, by exploiting the special structure of non-ground answers in our application domain we can guarantee valid data node splits, if the capacity of data nodes is at least (dim?2)(dim?1) + 1. This number corresponds to the maximum possible number 2 of non-ground entries in the presence of answer reduction due to subsumption. 2 3

i.e. there are no entries with variable arguments here, answers with non-ground arguments such as A(1; X; Y )

SplitNode(DataNode) begin

Find splitting criterion which splits with 2 : 1 ratio NewNode is a new index node with subnodes realizing the criterion forall answers B in answer container Insert B in NewNode

end forall Exchange NewNode with DataNode end 7

Results

We have implemented a version of hbT-tables for complete in-memory tabling. The implementation of our system additionally supports negation and handles many-sorted annotated logic programs according to the well-founded semantics using SLG resolution. There is no other system that we are aware of which implements the calculation of the well-founded semantics for annotated logic programs. The only inference engine we know that (partially) implements SLG resolution is the PROLOG system XSB. Due to the additional features of our engine, it performs approximately an order of magnitude worse than XSB. In the table below we compare the performance of using subsumptive hbT-tables with a variant-based version of our engine. Size of KOMET 1.2, KOMET 1.2 Relative parent relation Variant-based Subsumption-based Speedup 512 4.29 0.82 5.23 1024 15.81 2.5 6.32 2048 59.19 8.92 6.63 4096 235.89 33.7 7.0 8192 963.93 128.84 7.4

Table 1. Performance comparison As benchmark the calculation of the ancestor relation from a complete binary tree realized by the parent relation with di erent sizes was used4 . The tests were run on a Intel Pentium 233 MMX with 128MB of main memory. Execution times are given in seconds. In our experiments hBT-tables exhibit substantial speedups in comparison to our previous variant-based tabling technique. It proves to be robust under di erent circumstances such as data distribution and arity of predicates. 4

ancr(X,Y):[t]