A Dynamic Approach to Deductive Query Evaluation Andreas Behrend University of Bonn, Institute of Computer Science III Romerstr. 164, D-53117 Bonn, Germany e-mail:
[email protected]
Abstract
In this paper, a dynamic approach to deductive query evaluation is presented which combines the eects of dierent SIP strategies used for the magic set rewriting technique. Given a query execution plan obtained by a single magic set transformation, this plan is normally not reoptimized, and the actual number of facts generated during query processing cannot be considered. But applying the magic set transformation again during a query evaluation process allows the use of a dierent SIP strategy which takes into account dynamic criteria as well, e.g. the number of generated answer facts or the number of accesses to relations. It will be shown that adapting query plans at runtime can reduce the overall number of generated facts and hence reach optimization eects which cannot be achieved by magic sets based on a single choice of SIP strategy.
1 Introduction The magic sets rewriting [2, 3] technique for query evaluation seems to be the most promising approach to evaluating database queries bottom-up for database systems with a powerful view concept. This is in particular the case for systems which will implement the new SQL3 standard and hence will allow the de nition of recursive views. The attractiveness of this method lies in its generality and eciency. Several approaches to improve the magic set method have been proposed which are often applicable in special cases only, e.g. the counting method [10], factorizing methods [15], and linear rules [14]. Another approach called envelopes [18] is as general as the magic set method and can be better than magic sets in many cases. There are dynamic methods [5, 9] as well, which try to restrict the search for a particular query to relevant data at computation time rather than at compilation time. These methods determine relevant facts dierently from the magic sets method and can consider further (dynamic) criteria as well, e.g., the actual number of relevant facts computed so far. In [6] it has been shown that an adaptation of query execution plans can be worthwhile by comparing a static and several dynamic strategies. The described adaptive strategies, however, do not use the magic set method for improving the query evaluation process. In this paper we present an approach where a query execution plan obtained by a single (static) magic set transformation can be reoptimized dynamically and hence allows consideration of the actual number of facts generated during query processing. We consider Datalog rules, i.e., function-free Horn clauses, and distinguish between the extensional database (i.e., base relations) and the intensional database (i.e., derived relations). A Datalog rule consists of a single positive literal called head of the rule and a conjunction
123
of positive literals called body of the rule. The magic set method is used for transforming a given rule set and query to magic rules, such that these rules focus on the relevant part of the database with respect to the query only while evaluating these rules. A magic set transformation is done with respect to a chosen sideways information passing strategy (SIP strategy) [3] which indicates how bindings in the head of a rule are passed to the rule's body and in which order the body literals have to be evaluated. Applying the magic set transformation during query processing once again allows the use of a dierent SIP strategy which takes into account dynamic criteria as well. Since a chosen SIP strategy represents a special way to 'prove' a given query it is natural to ask whether it is possible to combine the eects of dierent SIP strategies leading to a more ecient 'proof process'1 . Consider for example the following rules for computing the transitive closure of a relation b: p(X; Y ) p(X; Y )
b(X; Y ) b(X; Z ); p(Z; Y )
As stated in [18], given a query ?-p(a; b), it is sometimes better to evaluate the more general query ?-p(a; Y ) and to check whether b is in the answer as compared to evaluating ?-p(a; b) directly. For this example there are two possible full SIP strategies which evaluate the body literals from 'left-to-right' or from 'right-to-left'. But the magic set transformation with respect to these strategies leads already to corresponding proof processes like (?-p(a; Y ) and check b in Y) and (?-p(X; b) and check a in X), that is, both do not really use the other binding, respectively. Combining the two strategies, however, leads to two proof processes which evaluate ?-p(a; Z ) or ?-p(Z; b) respectively until all solutions are found for one of the two processes. Thus, changing the SIP strategy during the materialization process corresponds to a bidirectional search. Reoptimization to adapt query plans at runtime can then be regarded as weighted multidirectional search. The method for query evaluation presented in this paper implements such a multidirectional search process and hence nds the cheapest proof process given several SIP strategies. This avoids the problem of choosing a 'bad' SIP strategy when the 'best' SIP strategy with respect to the number of facts generated within a query evaluation process is not known in advance. Moreover, the method is as general as magic sets and assures evaluation costs which can be determined exactly according to the 'best' SIP strategy. Of course, counting the number of facts is not a complete cost measure, but can be a good indication of the relative eciency of a method. In this paper, we present only the motivation for and the general scheme of the dynamic approach. Algorithms and proofs will be presented in a full version of this paper. The paper is organized as follows. In Section 2, an example is presented to show how a chosen SIP strategy in uences the number of facts generated during a query evaluation process leading to dierent evaluation costs. Section 3 presents the dynamic phase of our method by using an example to show the eects of the reoptimization of a query evaluation process using dierent strategies. In section 4 the static analysis of SIP strategies which ought to be considered for the reoptimization process is presented. In section 5 the method is summarized by presenting the static and the dynamic phase within a general query evaluation scheme. Section 6 presents further research topics and a conclusion. 1
See also [19] for a static approach of combining multiple SIP strategies
124
2 A motivating example Consider the following Datalog rules R: q(X; Y )
p(X; Y ); r(Y; X )
p(X; Y ) p(X; Y )
b1(X; Y ) b1(X; Z ); p(Z; Y )
r(Y; X ) r(Y; X )
b2(Y; X ) b2(Y; Z ); r(Z; X )
Relations b1 and b2 are base relations, r and p are derived and compute the transitive closure of b2 and b1, respectively. The relation q is the intersection of the transitive closures. Given the query ?-q(a,Y), one has to decide which SIP strategy would be the best to compute all relevant answers via the magic set method. Because three rules have two body literals there are 23 = 8 rule sets with dierent orders of body literals for this example. Thus, any choice of a full SIP strategy leads to one of these eight possible rule sets. Consider for example the 'left-to-right' SIP strategy where the binding of X in the query is passed to p. Evaluation of the p-literal leads to a binding for the argument Y of r. This SIP strategy leads to the following magic set transformed rules (Note that the adornment b denotes a bound argument position, whereas f denotes a free argument position) R1 : qbf (X; Y )
m qbf (X ); pbf (X; Y ); rbb (Y; X )
pbf (X; Y ) pbf (X; Y )
m pbf (X ); b1(X; Y ) m pbf (X ); b1(X; Z ); pbf (Z; Y )
rbb (Y; X ) rbb (Y; X )
m pbf (X ) m qbf (X ) m pbf (Z ) m pbf (X ); b1(X; Z )
m rbb (Y; X ); b2(Y; X ) m rbb (Y; X ); b2(Y; Z ); rbb (Z; X )
m rbb (Y; X ) m rbb (Z; X )
m qbf (X ); pbf (X; Y ) m rbb (Y; X ); b2(Y; Z )
The query ?-q(a,Y) is represented by the 'seed' fact m q (a). Suppose the base relations consist of the following tuples F b1(a; b) b1(a; c) b1(c; c ) b1(c; c ) b1(c ; d ) b1(c ; d ) b2(e; f ) b2(f; a) b2(c ; a) which can be represented by the following two directed graphs: bf
0
00
0
0
0
00
00
Relation b1 a
b
Relation b2 a
Q Q + Q s
c
A AU
0
AAU
d
0
c
c c
00
3 Q k Q Q
f 6
e
00
d
00
While computing the implicit state S of the deductive database D = hF; R1 i bottom-up, the entire transitive closure from the starting point a for the relation p is computed as well as seven m r -facts representing all resulting subqueries to the r relation for the solutions found for p. The entire number of generated facts is 28, produced in order to nd the only solution q (a; c00 ). Since b2 is much smaller than b1 it is natural to ask whether another choice of SIP D
bb
bf
125
strategy might lead to fewer facts in order to answer the query. Consider for example a SIP strategy where the relevant part of r is computed before the relevant part of p leading to the magic rules R2 . The implicit state S of the deductive database D = hF; R2 i then consists of 31 generated facts (7 answer facts and 24 subqueries). Obviously, the intuitive change of the SIP strategy does not provide a better way to compute the answers to the query ?-q(a,Y), although a smaller overall number of 7 answer facts will be generated by a xpoint iteration process. The problem lies in the generation of many magic facts representing subqueries which cannot be in the answer set of the p relation since the bound arguments are not in the domain and hence not in the transitive closure of b1. The 'best' full SIP strategy for this example is to evaluate all body literals from 'right-to-left' leading to the magic rules R3 . Computing the implicit state S with D = hF; R3 i would generate only 14 facts (7 answer facts and 7 subqueries). Note that even with this choice of SIP strategy subqueries for the p relation are generated which cannot lead to any answers since the bound arguments are not in the domain of the p relation2 . Note that for this example it is better not to apply a 'full SIP strategy', e.g. by using only r and p , which would lead to 8 facts only during the materialization process. The three rule sets above showed that a chosen SIP strategy has a considerable eect on the overall number of generated facts. It is not easy to decide whether a certain SIP strategy is better than another one without evaluating both. The rst strategy basically follows the original de nition given by the schema developer. Although the number of rules produced by the magic set transformation is the smallest of the three examples, one has not considered the dierent sizes of the derived relations. In R2 we have tried to estimate the relation sizes by looking at the base relations. Since b2 is much smaller than b1 we used a SIP strategy which evaluates r before p. In spite of the very small number of answers for p and r, the computation of the complete transitive closure of b1 within the subqueries led to a very expensive query evaluation process for this example. The 'best' SIP strategy yielding R3 , however, does not evaluate the body literals with the maximum number of bindings possible and yet leads to the smallest number of generated facts. D
D
fb
fb
3 Changing SIP strategies dynamically The question arises how to avoid that a 'bad' SIP strategy is chosen for query evaluation without knowing the 'best' one. In general it is not possible to know in advance the number of facts that will be generated for evaluating a given query, and hence it is not possible to choose the 'best' SIP strategy statically. However, one can get additional information about the number of relevant facts during a bottom-up or top-down query evaluation process. It seems promising to reoptimize a query evaluation process at run-time when the number of generated subqueries and answer facts has signi cantly changed and another proof process appears to be better. If one wants to change a chosen SIP strategy at run-time, there are several aspects which ought to be considered:
It is only necessary to adorn magic literals. The adorned answer relations can lead to redundant storage of similar facts, e.g. p (a; b) and p (a; b), which moreover cannot be shared by several rule sets resulting from dierent SIP strategies. bf
bb
It is quite simple to integrate a domain check into the rules but this kind of check is very limited and will be a side eect of the approach presented later. 2
126
More general subqueries subsume subqueries which are less general, e.g m p (a) subsumes m p (a; k) completely. That is, given a set of magic rules where the subqueries m p (a) and m p (a; k) have been generated during the evaluation process, all answers for the second subquery will also be generated by the rst (more general) subquery alone. Hence, the subqueries m p (a; ) do not have to be evaluated within this rule set3 . The use of dierent SIP strategies can only be valuable if the strategies correspond to dierent proof processes. This criterion is important to keep the number of SIP strategies to be considered small. Moreover, using dierent SIP strategies makes the sharing of subqueries and intermediate answers very important. The original rules should only be reoptimized after a considerable number of facts have been generated [6]. Thus, we need a selection function to determine at what time a change to another SIP strategy could be worthwhile. bf
bb
bf
bb
bb
Given a relation rel, an adornment ad and a SIP strategy si, let us denote the rules produced by the magic set transformation for this particular constellation by R . Consider again the sample rules R for the query ?-q(a; Y ). The possible adornments for this example are given by ad = fff; bf; fb; bbg. Since the rules have no more than two body literals we can restrict the set of possible SIP strategies to si = flr; rlg where lr denotes the 'left-to-right' SIP strategy and rl denotes the 'right-to-left' strategy. The set of derived relations in the example is given by rel = fq; p; rg. Thus, R speci es, for example, the following rule set: rel si;ad
q
rl;bf
m qbf (X ); r(Y; X ); p(X; Y ) m qbf (X ). m qbf (X ); r(Y; X ).
q(X; Y ) m rf b (X ) m pbb (X; Y )
Note, that a tuple i R0 = hR ; R ;R represents a single proof process for the query ?-q(a; Y ) with respect to the SIP strategy hsi0 ; si00 ; si000 i. Moreover, there exist various dependencies between the rule sets for q, p and r with respect to the adornments, e.g., choosing R leads to an adornment bb for p, which is represented by the following tree: q
p
si0 ;ad0
si00 ;ad00
r si000 ;ad000
q
rl;bf
m q X(a)
R
R
q
R
p
lr;bf
R R (1) (2)
r rl;bb
bf
XXX
XX z
R
lr;bf
HH H j
A AU
r lr;bb
9
R
p
r lr;f b
rl;bf
A AU
R R (3) (4) r lr;bb
A A U
HH j H
R R (5) (6) p
r rl;bb
lr;bb
q rl;bf
p
rl;bb
R
r rl;f b
A A U
R R (7) (8) p
lr;bb
p
rl;bb
Note that each path in this tree represents a single proof process. However, not all of the proof processes corresponding to paths in such a tree are really dierent or are really needed. Hence, rst we have to reduce the tree such that these paths cannot be considered as alternative 3
See also [1] for a more detailed discussion about subsumption eects in a deductive database system
127
proof processes. This reduction phase will be explained in more detail in the next section. In the example, R can be eliminated by reduction because R subsumes R completely and hence can only be worse with respect to the number of facts generated4 . Moreover, R can be eliminated because after reduction R and R represent the same proof process, i.e. both would generate the same facts after reduction, but since r is smaller than b2 throughout the query evaluation the SIPS rl is preferred5. Thus, the paths leading to (3), (4), (5) and (6) can be deleted. Moreover, the reduction process relaxes the bb adornments to the adornment bf for the SIP strategy lr and the adornment fb for the SIP strategy rl. This kind of reduction is done in order to reduce the rule sets concerned to the essential part of the SIP strategy which has been applied. Thus, we obtain the following reduced tree: p
p
p
rl;bf
rl;bf
lr;bf
r lr;f b
r rl;f b
m q X(a)
R
9
q
r lr;f b
bf
XXX
XX z
R
lr;bf
? p
?
R
R
lr;bf
A AU
r rl;f b
A A U
R R (1) (2) r lr;bf
q rl;bf
R R (7) (8) p lr;bf
r rl;f b
p rl;f b
We present now the dynamic phase of our method for dynamic query evaluation by using the previous example. The following discussion is supposed to give an intuitive idea how a query evaluation process could work if the actual sizes of derived relations during a bottom-up xpoint computation were taken into account. First, we need a selection function to decide which of the four proof processes represented by the reduced dependency tree
R1 = hR R2 = hR
q lr;bf q lr;bf
;R ;R
p lr;bf p lr;bf
;R ;R
R7 = hR R8 = hR
i r i rl;f b
q
r lr;bf
lr;bf q lr;bf
;R ;R
r rl;f b r rl;f b
;R ;R
i i rl;f b
p
lr;bf p
has to be taken for the next iteration round. Suppose we choose a very simple function depending on the exact number of facts, that is, we always choose the proof process with the smallest number of facts generated so far during evaluation. Note that we count the answer facts as well as the generated subqueries for a derived relation to get a better cost measure. Moreover, suppose that the selection function always chooses the rule set with the original order of body literals given by the schema developer (i.e., R1 ) in case that there are several proof processes with the same 'smallest' number of generated facts. At rst, none of the derived relations contains answers yet. Thus, the selection function can choose R1 ,R2 ,R7 or R8 for evaluation arbitrarily. However, following the original order of body literals, R1 is taken for computing the rst relevant answers generating the fact m p (a). in the rst iteration round. Let us denote the number of facts with respect to a given proof process R by the triple R such that for every rule set within R that set is replaced by the actual number of corresponding facts. Thus, the rst iteration leads to the following sizes: R1 = h1; 1; 0i R2 = h1; 1; 0i R7 = h1; 0; 1i R8 = h1; 0; 0i bf
i
n
4 5
n i
n
i
n
See also example 4.2 in section 4 for this particular case. See also example 4.3 in section 4 for this particular case.
128
n
Obviously, R8 represents the 'smallest' proof process after the rst iteration and R8 is chosen for the next iteration round generating the fact m r (a). This leads to the following sizes R1 = h1; 1; 0i R2 = h1; 1; 1i R7 = h1; 1; 1i R8 = h1; 1; 0i. Now R1 is chosen for evaluation generating the following facts in the next iteration: m p (b) m p (c) p(a; b) p(a; c). The resulting relation sizes are: R1 = h1; 5; 0i R2 = h1; 5; 1i R7 = h1; 1; 5i R8 = h1; 1; 2i. Note that we add the number of all answers for a relation rel to a corresponding rule set R , because we left out the adornments for the answer relations and hence these answer facts can be applied in all these sets. The subqueries, however, are only considered if they have the same adornment as the corresponding rule set. Thus, only the two new answers of p are added to the third number of R8 which now represents the 'cheapest' proof process and hence R8 is chosen for evaluation. In the following iteration rounds the selection function chooses R8 for the 5th iteration, R1 for the 6th iteration, R8 for the 7th and for the 8th iteration round. By then, the sizes of proof processes have changed to R1 = h2; 10; 3i R2 = h2; 10; 4i R7 = h2; 4; 10i R8 = h2; 4; 8i and R8 is chosen once more for evaluation within the 9th iteration round. This time, however, no other facts can be derived for this set; that is, one of the proof processes, here R8 , has been evaluated completely. Hence we can conclude that all possible answers for r with respect to the query have been found as well as all necessary answers for p. Note that we have not evaluated all generated subqueries for p since the complete evaluation of the subquery m r (a) generated all possible values for Y in ?-q(a; Y ). The overall number of generated facts is 18. Hence for this example the method presented is almost as good as the 'best' full SIP strategy. Moreover, since our process stopped while trying to generate new answers for R8 , we know that this proof process generates the smallest number of facts. Hence, the selection function is supposed to use R8 = hR ; R ; R i n
fb
n
bf
n
n
n
n
n
n
n
bf
rel
sip;ad
n
n
n
n
n
fb
q
lr;bf
r rl;f b
p
rl;f b
throughout the entire query evaluation the next time. Note that R8 represents the cheapest proof process which would lead to 8 facts only during query evaluation.
129
4 Static analysis of SIP strategies In this section the static phase of our method for determining relevant SIP strategies will be presented. Normally, there is an exponential number of full SIP strategies applicable to a given rule set. These strategies, however, do not always represent dierent proof processes, and it is useful to reduce the set of all possible SIP strategies to a subset of relevant strategies. We will use examples to show the main ideas behind the entire reduction process. The rst example shows how two dierent proof processes can be distinguished. Example 4.1: Consider the following rules p(X; Y )
q(Z; Y ); b(X; Z )
q(Z; Y ) b(X; Z )
e(Z; Y ) d(X; Z )
e(Z; Y ) d(X; Z )
::: :::
and the query ?-p(X; c). Note that we assume the relations e and d to be derived from some other relations in our system. We have two possible SIP strategies which evaluate the body literals of p from 'left-to-right' or from 'right-to-left' respectively. These strategies would lead to the following magic rules: 'right-to-left'
'left-to-right'
p(X; Y )
m pf b (Y ); q(Z; Y ); b(X; Z )
p(X; Y )
m pf b (Y ); b(X; Z ); q(Z; Y )
q(Z; Y ) b(X; Z )
m qf b (Y ); e(Z; Y ) m bf b(Z ); d(X; Z )
q(Z; Y ) b(X; Z )
m qbb (Z; Y ); e(Z; Y ) m bf f ; d(X; Z )
m qf b (Y ) m bf b (Z )
m pf b (Y ) m pf b (Y ); q(Z; Y )
m qbb (Z; Y ) m bf f
m pf b (Y ); b(X; Z ) m pf b (Y )
Thus, the two strategies lead to subqueries like S 1 = fm q ; m b g and to subqueries like S 2 = fm q ; m b g. Now we have to decide whether we want to consider these sets as a real alternative and hence allowing these two strategies to be considered for reoptimization. Although the rst one seems to be better, i.e. the input binding of the query is used optimally, there are also cases where the second proof process could be better. This holds for example when b contains signi cantly less facts than q and hence is better suited to restrict the number of facts generated for q by providing an additional binding for Z to the binding for Y already given by the query6 . More precisely, the subqueries m q subsume the subqueries m q and we conclude that the strategy represented by S 2 could be better than the strategy represented by S 1 with respect to the number of facts generated. On the other hand, the subqueries m b are subsumed by the subquery m b which indicates that the strategy represented by S 1 could be better than the strategy represented by S 2 . Thus, since we cannot conclude that either one is strictly better, we ought to consider both strategies for reoptimization. Note that the subsumption eects cannot be discovered by analyzing the sets of subqueries but requires looking at the corresponding rules. That is, for proving that the subqueries m q fb
bb
fb
ff
fb
bb
fb
ff
bb
6 Note that this is true despite of the cross product within the rule de ning m q which makes no use of any input binding. bb
130
are subsumed by the subqueries m q we have to show that the following condition holds: 8c0 9e0 : m q (e0 ; c0 ) ) m q (c0 ) For the given example it is quite simple to show that this condition is satis ed since there exists only one input binding Y=c for the subqueries of q. The general case, however, is an undecidable problem which corresponds to the containment problem [7] or to the problem whether two Datalog rule sets are equivalent [16]. A possible solution which works for a wide class of strati ed deductive databases7 is given in [7] where the containment problem is reduced to a view update problem. The same technique can be applied in our case also by checking whether we can perform the view update request insert(h0 (c1 )) on the rules obtained by the two SIP strategies and the following rule: h0 (Y ) m q (Z; Y ); :m q (Y ). This time the view update request does not succeed (i.e. fact h0 (c1 ) is not derivable). Thus we know that for all possible states of the database the subqueries m q subsume the subqueries m q and we conclude that the two SIP strategies really lead to two dierent proof processes. fb
bb
fb
bb
fb
fb
bb
Following the argumentation of example 4.1 we do not always use the maximal number of input bindings possible and hence can avoid the computation of unnecessary subqueries as it happens with the rule set R2 . Using less input bindings, however, can lead to the most general subqueries, such as m q , if we consider the generated subqueries the same way as the original query. That is, we would loose the focus on the relevant part of the query which is shown in the next example. Example 4.2: Consider the rules for computing the transitive closure of b ff
p(X; Y ) p(X; Y )
b(X; Y ) b(X; Z ); p(Z; Y )
and the query ?-p(a; Y ). The two possible SIP strategies which evaluate the body literals of p from 'left-to-right' or from 'right-to-left' respectively lead to the following magic rules: 'right-to-left'
'left-to-right'
p(X; Y ) p(X; Y )
m pbf (X ); b(X; Y ) m pbf (X ); b(X; Z ); p(Z; Y )
p(X; Y ) p(X; Y )
m pbf (X ); b(X; Y ) m pbf (X ); p(Z; Y ); b(X; Z )
m pbf (Z )
m pbf (X ); b(X; Z )
m pf f
m pbf (X )
p(X; Y ) p(X; Y )
m pf f ; b(X; Y ) m pf f ; :::
This time, the two strategies lead to S 1 = fm p g and S 2 = fm p ; m p g. Before we compare these two sets we have to reduce the sets themselves. In S 2 the subquery m p subsumes all subqueries m p . That is, the latter ones do not have to be evaluated in this rule set. Thus, choosing the SIP strategy which evaluates the literals from 'right-to-left' leads bf
bf
ff
ff
bf
The algorithms for view updating are applicable to all strati ed databases but there are cases where they do not terminate. 7
131
to a proof process like (?-p(X; Y ) and check a in X). However, if we compare the two strategies using the sets S 1 = fm p g and S 2 = fm p g we see that the strategy represented by S 1 can only be better than the strategy represented by S 2 . Hence, we ought to consider the 'left-to-right' SIP strategy and discard 'right-to-left'. This result corresponds to our intuition, since it is better to use the given binding X=a as a starting point for generating all values for Y connected with a. Note that we can detect subsumption eects within the rule set by performing the view update request insert(h00 ) on the rules of the 'right-to-left' SIP strategy and the following rule: h00 m p (X ); :m p . Example 4.3: However, comparing several given SIP strategies is not always obvious. Consider again the rules of Example 4.2 but with the query ?-p(X; c) leading to the following magic rules: 0
bf
ff
0
bf
ff
'right-to-left'
'left-to-right'
p(X; Y ) p(X; Y ) m pbb (Z; Y ) p(X; Y ) p(X; Y )
m pf b (Y ); b(X; Y ) m pf b (Y ); b(X; Z ); p(Z; Y ) m pf b (Y ); b(X; Z )
p(X; Y ) p(X; Y )
m pf b (Y ); b(X; Y ) m pf b (Y ); p(Z; Y ); b(X; Z )
m pf b (Y )
m pf b (Y )
m pbb (X; Y ); b(X; Y ) m pbb (X; Y ); :::
This time we have S 1 = fm p ; m p g and S 2 = fm p g. Following the previous argumentation we can reduce S 1 to fm p g. Comparing the two sets after reduction indicates that both strategies represent a similar proof process and hence are supposed to generate the same facts during query evaluation. However, this holds only if we also reduce the rule set for the 'left-to-right' strategy to avoid the generation and not only the application of any m p facts. Reducing the corresponding rule set will then lead to the following magic rules: fb
bb
fb
fb
bb
'left-to-right'-reduced
p(X; Y ) p(X; Y )
'right-to-left'
m pf b (Y ); b(X; Y ) m pf b (Y ); b(X; Z ); p(Z; Y )
p(X; Y ) p(X; Y )
m pf b (Y ); b(X; Y ) m pf b (Y ); p(Z; Y ); b(X; Z )
Note that the redundant subquery rules m p (Y ) m p (Y ) for both strategies are omitted. It is obvious now that both SIP strategies represent the same proof process and we have avoided the cross product within the rule de ning m p . Note that the reduction of the 'left-to-right' SIP strategy leads to a rule set which does not correspond to a full SIP strategy anymore. Moreover, although these two strategies lead to the same number of facts it is still useful to consider both rule sets and hence to allow the consideration of dierent relation sizes of p for determining the order of the body literals p and b. Note that this example is closely related to the previous Example 4.1 where the two SIP strategies, however, have provided two dierent proof processes despite of the cross product. The next example shows that it is not sucient to compare the subquery sets which are reduced in the fb
fb
bb
132
described way but it is also necessary to check in advance whether all bindings of a subquery are really used. Example 4.4: Consider again the rules for computing the transitive closure of b from Example 4.2 and the query ?-p(a; c). The two possible SIP strategies lead to the following magic rules: 'left-to-right'
p(X; Y ) p(X; Y )
'right-to-left'
m pbb (X; Y ); b(X; Y ) m pbb (X; Y ); b(X; Z ); p(Z; Y )
m pbb (Z; Y )
p(X; Y ) p(X; Y )
m pbb (X; Y ); b(X; Z )
m pbb (X; Y ); b(X; Y ) m pbb (X; Y ); p(Z; Y ); b(X; Z )
m pf b (Y )
m pbb (X; Y )
p(X; Y ) p(X; Y )
m pf b (Y ); b(X; Y ) m pf b (Y ); :::
This time we have S 1 = fm p g and S 2 = fm p ; m p g. Reducing these sets will lead to S 1 = fm p g and S 2 = fm p g. Note that the reduction of S 1 has now to recognize that the second binding of m p is not used for the proof process and hence can be eliminated within the subqueries. Unused bindings within subqueries can be detected, for example, by using the envelopes transformation [18]. Comparing the two reduced sets shows that they represent two dierent proof processes and hence both ought to be considered for reoptimization. The corresponding reduced magic rules are: bb
0
bf
0
bb
fb
fb
bb
'right-to-left'
'left-to-right'
p(X; Y ) p(X; Y ) m pbf (X ) in pf b (Y ) m pbf (Z )
m pbf (X ); in pf b (Y ); b(X; Y ) m pbf (X ); b(X; Z ); p(Z; Y )
p(X; Y ) p(X; Y )
m pbb (X; Y ) m pbb (X; Y ) m pbf (X ); b(X; Z )
m pf b (Y )
m pf b (Y ); b(X; Y ) m pf b (Y ); p(Z; Y ); b(X; Z ) m pbb (X; Y )
Note that now the rule set obtained for the 'left-to-right' strategy is equivalent to the corresponding envelope. Moreover, since it is easier to examine envelopes for subsumption eects than magic set transformed rules it seems promising to use them throughout the entire reduction. The four examples have shown that the reduction process has to be carried out in three phases. First, we have to nd all the bindings of the subqueries for a SIP strategy which are really used for the proof process. Subsequently, the unused bindings have to be deleted without loosing information; that is, we have to assure that unused bindings within subqueries can still be used within the answer rules to restrict the number of answer facts. In the second phase, we have to discover subsumption eects within the magic rules obtained by applying a single SIP strategy. In the third phase, the possible rule sets are compared and sets which are completely subsumed by others have to be deleted such that only rule sets representing dierent proof processes are left for consideration.
133
5 An evaluation scheme for dynamic query processing Summarizing the results of the previous sections we have to distinguish between a static analysis phase and a dynamic evaluation phase. During the static analysis phase the magic set transformation is applied to a given rule set for a set of possible SIP strategies and a set of possible adornments. Subsequently, the resulting sets of magic rules are reduced to a subset of magic rules such that each of them represents a distinct proof process. During the dynamic phase a selection function is used to determine successively which of the distinct proof processes is chosen for evaluation until all solutions have been found. We present now a very simple analysis of how the number of facts generated by using this method can be determined. Consider again the rules of example 4.2 for computing the transitive closure of b. Suppose the base relation b consists of the following facts: F 1 : b = f(a; d ) j 1 i ng [ f(e; c)g Given a query ?-p(a; c) it must fail since there is no path leading from a to c. Choosing the 'left-to-right' SIP strategy, however, will generate O(n2 ) facts in order to prove that the query cannot be answered. Changing the strategies during computation time, however, will allow the use of the 'right-to-left' strategy which will stop the evaluation process already after the third iteration round generating three facts only. Note that an integrated domain check within the rules could not discover that there is no solution to the query. Suppose now there is a fact p(a; c) which can be derived by the rules using the following facts: F 2 : b = f(a; d ) j 1 i ng [ f(a; b); (b; c)g [ f(e ; c) j 1 j mg Obviously, there exists only one path p(a,c) which can be found after four iteration rounds by either proof processes. Nevertheless, both processes compute O(n2 ) or O(m2 ) number of facts, respectively. Suppose that n is much smaller than m and hence the proof process using the 'left-to-right' strategy generates much less facts than the other strategy. Normally, we do not know in advance which SIP strategy is better with respect to the number of facts generated. However, changing the SIP strategy using a fair selection function, i.e. choosing always the rule set with the smallest number of generated facts at the time of reoptimization will lead to a constant factor8 of increased evaluation costs with respect to the 'best' SIP strategy. For the given example this leads to 2 O(n2 ) facts which is at least an improvement when the worse strategy exceeds this number and would have been chosen for query evaluation alone. Note that other optimization techniques to improve the magic set method are still applicable and hence can be used to improve this dynamic approach as well. Applying, for example, the transformation for ecient evaluation of linear rules [14] to the two proof processes of this example will reduce the number of generated facts to O(n) and O(m) respectively. Thus, assuming that n is smaller than m the dynamic approach would lead to 2 O(n) facts. Moreover, it is possible to reduce the number of facts generated by considering existential queries [17]. For example, for answering the query ?-p(a; c), it is sucient to stop the query evaluation process after the rst p(a; c)-fact has been derived successfully. This can easily be achieved, for example, by integrating an existential query check within the selection function of our dynamic approach. Thus, for the given example, the dynamic query evaluation process would i
i
j
8 Note that this factor depends on the number of dierent proof processes considered and hence can be determined exactly.
134
generate no more than 5 facts if existential queries were taken in consideration. Note, that the dynamic approach with integrated existential query check corresponds to a multidirectional search. Moreover, by using an 'unfair' selection function which prefers the evaluation of certain proof processes to other ones during the query evaluation then corresponds to a weighted multidirectional search.
6 Discussion and further research We have presented a dynamic approach to deductive query evaluation. This approach can be used for query evaluation when the best method for evaluating a given query is not known. Most of the optimization strategies presented in the literature rely on 'good' methods for estimating sizes of derived relations [12]. The dynamic approach presented in this paper, however, uses the information about relation sizes during the query evaluation process itself and can avoid that a 'bad' xed strategy is statically chosen for query evaluation. In this paper, the dynamic approach has been presented using a bottom-up query evaluation process. However, in [4] it has been shown that bottom-up query evaluation via magic sets is basically equivalent to top-down approaches like OLDT [20] and QSQ [22]. Moreover, although we considered queries with a certain number of bound variables only it seems useful to apply this evaluation technique also to more general queries without any input bindings. Another advantage of the method is that the number of facts generated during query processing can be determined with respect to the 'cheapest' proof process. In addition this proof process can be identi ed after dynamic query evaluation and hence can be used statically for the next evaluation of the same query, unless considerable changes take place. As mentioned in section 4, our dynamic approach is independent of other optimization strategies for improving magic sets. Thus, these optimizations can also be applied to the rule sets of our dynamic approach as long as they are guaranteed to be complete . Moreover, it is simple to extend our method to implement an evaluation process which can be regarded as a weighted multidirectional search. A further research topic can be the implementation of a query evaluation process which corresponds to the A search algorithm and hence is known to be optimal. Note that our method can also be applied to rules with strati ed negation by restricting the set of possible SIP strategies to a set of 'allowed' SIP strategies. These strategies assure that negative body literals are always evaluated fully bound and hence cannot lead to dierent proof processes. The general problem, however, of losing stra cation after the magic set transformation using an allowed SIP strategy has been applied remains and can be solved, e.g., by using weak strati cation [11] or by using the alternating xpoint [21]. In [8, 13] it has been shown that the magic set method can improve the performance of nonrecursive queries as well and hence it seems worthwhile to apply the dynamic approach to nonrecursive systems, too. This is especially true if evaluation costs of body literals dier considerably and the evaluation of query and subqueries needs more than one iteration round. One disadvantage of our method is that despite of reduction an exponential number of different proof processes will be considered. In general, the set of possible SIP strategies ought to be reduced by applying only a subset of these strategies leading to proof processes which seem to be more relevant. A criterion for this reduction could be the 'distinctiveness' of dierent proof processes which, for example, could be determined by comparing the number of dierent adornments within these rule sets. Moreover, the dynamic approach needs a very expensive analysis phase which, however, can be done in advance at compile time.
135
References [1] Azevedo, Paulo J.: Magic Sets with Full Sharing. JLP 30(3): 223-237, 1997. [2] Bancilhon, F., Maier, D., Sagiv, Y., Ullman,J. D.: Magic Sets and Other Strange Ways to Implement Logic Programs. PODS 1986: 1-16. [3] Beeri, C., Ramakrishnan, R.: On the Power of Magic. JLP 10: 255-299, 1991. [4] Bry, F.: Query Evaluation in Recursive Databases: Bottom-Up and Top-Down Reconciled. DKE 5: 289-312, 1990. [5] Cole, R. L., Graefe, G.: Optimization of Dynamic Query Evaluation Plans. SIGMOD Conference 1994: 150-160. [6] Derr, Marcia A.: Adaptive Query Optimization in a Deductive Database System. CIKM 1993, Washington, DC, USA. [7] Farre, C., Teniente, E., Urpi, T.: Query Containment Checking as a View Updating Problem. DEXA 1998: 310-321. [8] Gupta, A., Mumick, I. S.: Magic-sets Transformation in Nonrecursive Systems. PODS 1992: 354-367. [9] Han, J.: Selection of Processing Strategies for Dierent Recursive Queries. JCDKB 1988: 59-68. [10] Haddad, R. W., Naughton, J. F.: Counting Methods for Cyclic Relations. PODS 1988: 333-340. [11] Kerisit, J.-M., Pugin, J.-M.: Ecient Query Answering on Strati ed Databases. FGCS 1988: 719-726. [12] Ko nig, A. C., Weikum, G.: Combining Histograms and Parametric Curve Fitting for FeedbackDriven Query Result-size Estimation. VLDB 1999: 423-434. [13] Mumick, I. S., Finkelstein, S. J., Pirahesh, H., Ramakrishnan, R.: Magic is Relevant. SIGMOD Conference 1990: 247-258. [14] Naughton, J. F., Ramakrishnan, R., Sagiv, Y., Ullman, J. D.: Ecient Evaluation of Right-, Left-, and Multi-Linear Rules. SIGMOD Conf. 1989: 235-242. [15] Naughton, J. F., Ramakrishnan, R., Sagiv, Y., Ullman, J. D.: Argument Reduction by Factoring. TCS 146(1&2): 269-310, 1995. [16] Shmueli, O.: Equivalence of DATALOG Queries is Undecidable. JLP 1993 15(3): 231-241. [17] Ramakrishnan, R., Beeri, C., Krishnamurthy, R.: Optimizing Existential Datalog Queries. PODS, 1988: 89-102. [18] Sagiv, Y.: Is There Anything Better than Magic? NACLP, 1990: 235-254. [19] Sippu, S., Soisalon-Soininen, E.: Multiple SIP Strategies and Bottom-Up Adorning in Logic Query Optimization. ICDT, 1990: 485-498. [20] Tamaki, H., Sato, T.: OLD Resolution with Tabulation. ICLP 1986: 84-98. [21] Van Gelder, A.: The Alternating Fixpoint of Logic Programs with Negation. PODS 1989: 1-10. [22] Vieille, L.: From QSQ towards QoSaQ: Global Optimization of Recursive Queries. Expert Database Conf. 1988: 743-778.
136