Appears in: ICDE'95.
Pushing Semantics inside Recursion: A General Framework for Semantic Optimization of Recursive Queries Laks V.S. Lakshmanan Dept. of Comp. Sci. Concordia University Montreal, Canada
[email protected]
Abstract We consider a class of linear query programs and integrity constraints and develop methods for (i) computing the residues and (ii) pushing them inside the recursive programs, minimizing redundant computation and run-time overhead. We also discuss applications of our strategy to intelligent query answering.
1 Introduction Several approaches have been proposed for semantic query optimization, and the reader is referred to Chakravarthy et al [3] for a detailed survey of works on relational as well as deductive databases. Deductive databases have been recognized as an important data model for forming the platform for next generation applications. The optimization of recursive queries is a very important problem for deductive databases (see [1, 2, 12] for surveys). Chakravarthy et al [3] laid down the foundations for semantic query optimization in non-recursive deductive databases. They proposed the approach of compiling queries w.r.t. the integrity constraints (ICs) and extracting \residues" which are to be imposed on the queries so as to minimize useless computation during query processing. Lee et al [10] proposed a way of determining ICs for semantically reformulating queries involving joins/unions of predicates, for non-recursive databases. There are essentially two approaches for recursive queries. The rst approach, based on the evaluation paradigm assumes that queries will be evaluated bottom-up. Since the various subqueries computed This research was supported by grants held by both authors
from NSERC (Canada) and FCAR (Quebec).
Rokia Missaoui Dep. d'Informatique Univ. du Quebec a Montreal Montreal, Canada
[email protected] in an iteration of the bottom-up evaluation loop are non-recursive, this approach essentially applies the residues to the subqueries being computed in each iteration of the bottom-up evaluation. Chakravarthy et al [3] and Lee and Han [9] follow this approach. The former extends the residue technique to all recursive (and other) queries, while the latter specializes the technique to a subclass of linear recursive queries. In previous works, we proposed a program transformation approach for semantic optimization of recursive queries. One of the early works based on the program transformation paradigm is Sagiv [13], which develops an exponential time procedure for eliminating redundant atoms and rules in programs in the presence of tuple generating dependencies. Lakshmanan and Hernandez [6] studied the problem of detecting and eliminating redundant subgoal occurrences in proof trees generated by programs in the presence of functional dependencies, and gave a syntactic characterization of when the number of times a subgoal is needed can be bounded, for a class of single linear recursive rule programs. They also developed a polynomial time program transformation algorithm for eliminating the redundant subgoal occurrences. Lakshmanan and Missaoui [7] developed suciency criteria and algorithms for detecting redundant atoms and rules in a class of linear recursive programs. In comparison with the evaluation approach, the major advantages of the program transformation approach are that (i) it is independent of the particular query (or binding pattern) being processed, and (ii) since the optimizing transformation is carried out in one shot at compile time, it does not incur any run time overhead. The central idea behind the program transformation approach is as follows. A recursive query can be thought of as an in nite union of conjunctive queries
associated with the proof trees generated by the query program. In the context of given ICs, some conjunctive queries could have redundancies. The goal is to remove the redundancies in these conjunctive queries. However, since there are in nitely many conjunctive queries in the union, we want to achieve the eect of removing redundancies from the conjunctive queries by means of transforming the original query program so that the proof trees generated by it are free from these redundancies. Thus, there are two inherently challenging problems in trying to (semantically) optimize recursive query programs by transformation. (1) Typically the residues of ICs need to be determined w.r.t. the proof trees generated by a program, rather than just the rules of the program. (2) The optimization induced by the residues w.r.t. the proof trees must be \pushed" into the program so that the program will only generate the optimized proof trees. In this paper, we rst develop an algorithm for generating residues of ICs for programs containing recursive rules. We then discuss program transformation techniques for pushing the residues inside recursion, corresponding to dierent types of optimization. For the basic de nitions on deductive databases, the reader is referred to Ullman [14] and Ceri et al [2]. In particular, we assume the reader is familiar with such notions as edb (extensional database)/idb (intensional database) predicates, recursion { both linear and non-linear, etc. We use a Prolog-like notation for program rules and express integrity constraints (ICs) as implication statements. Built-in predicates like X > Y , X > 100 are called evaluable predicates while all others are called database predicates. Two programs are semantically equivalent (w.r.t. given constraints I) provided on all databases (corresponding to the edb) that satisfy I, the sets of idb relations computed by them are identical. A rule (or an IC) is connected if for any two subgoals in the body, either they share a variable, or are both connected to a common subgoal. A rule is range restricted if every variable appearing in the head appears in the body1 . We make the following assumptions in this paper. (1) All program rules are range restricted. This is a customary assumption in the literature, which ensures the (idb) relations computed by the rules are nite. (2) All rules and ICs are connected. Most reallife examples satisfy this condition. (3) Only linear recursive programs with no mutual recursion are considered. Again, most real-life examples fall into this category [14]. (4) We only consider ICs involving edb relations (and evaluable predicates). This still covers 1
Note the reversal of head and body in our notation for ICs.
a rich class of ICs.
2 Background We shall brie y recall the fundamental notions from the framework for semantic optimization, proposed by Chakravarthy et al [3]. The central notion is that of a residue of an IC w.r.t. a query. A residue is essentially a portion of an IC and is a condition that must be satis ed whenever the body of the program rule to which it is attached produces a tuple. This idea is made precise as follows. A clause C subsumes another clause D if there is a mapping (called the subsuming substitution) from the variables of C to the arguments of D such that C is a subclause of D. We say C partially subsumes D if a subclause of C subsumes D. An IC is in expanded form provided no constant appears among the arguments of any database predicate in its body, and each such argument contains a distinct variable. As shown in [3], every IC can be converted to expanded form. An IC partially subsumes the body of a rule if its expanded form does. In this case, the residue of the subsumption is the part of the (expanded) IC that did not participate in the subsumption. Chakravarthy et al [3] gives an algorithm for deciding partial subsumption and for computing residues. For simplicity, we say an IC (partially) subsumes a rule to mean it (partially) subsumes the body of the rule. In this paper, we shall be interested in \free" (partial) subsumption and free residues. The intuition behind free subsumption can be understood as follows. In the approach proposed by Chakravarthy et. al. [3], both the clauses involved in the subsumption test are expanded by replacing shared variables and constants by new distinct variables and making the underlying constraints explicit by introducing equalities. In this paper, we consider the utility of residues obtained from (partial) subsumption among clauses as they appear (i.e. without expanding them). This gives rise to the notion of free residues since no constraints are introduced. We shall show that free residues are quite useful in (semantically) optimizing recursive queries via program transformation. De nition 2.1. An IC ic is said to freely (partially) subsume a rule r provided there is a subclause of ic (not the expanded version) that subsumes r in the above sense. Let ic freely (partially) subsume r via a substitution . Then the free residue arising from this is the part of (ic) that did not participate in the (free) subsumption above. 2
We shall use the term freely subsume even when the subsumption is partial (but free), for convenience. The dierence between free residues and the usual residues is illustrated next. Example 2.1. Consider the following program. r0: p(X1 ; : : :; X6) :- a(X1 ; X2; X4); b(X20 ; X3 ); c(X30 ; X40 ; X5); d(X50 ; X6 ); p(X1 ; X20 ; X30 ; X40 ; X50 ; X60 ). r1: p(X1 ; : : :; X6) :- e(X1 ; : : :; X6). ic: a(V1 ; V2 ; V3); b(V2; V4 ); c(V4; V5; V6 ) ! d(V6; V7 ). The expanded form of ic is ice : a(V1; V2 ; V3); b(V8; V4 ); c(V9; V5 ; V6); V8 = V2 ; V9 = V4 ! d(V6; V7 ). It can be veri ed that ice and hence ic partially subsumes r0 giving rise to the residue X20 = X2 ; X30 = X3 ! d(X5 ; X6), after simpli cation. Notice that ic also freely subsumes r. Depending on the substitution used for the free subsumption, the residue generated can be b(X2 ; X30 ) ! d(X5 ; V7), or a(V1 ; X20 ; V3 ); c(X3; V5 ; V6) ! d(V6; V7 ). 2 In [3], Chakravarthy et al showed that depending on their type, residues can lead to optimization in ve fundamental ways: (i) join elimination; (ii) scan reduction by introduction of evaluable predicates; (iii) introduction of small relations in the context of joining large relations; (iv) detecting that certain (sub)queries have only one answer and stopping the computation after one answer is found; (v) detecting when queries have no answer by virtue of given ICs. We usually use the letter p to denote the recursive predicate in the program. The next several notions refer to the following program and are needed in later sections. r0 : p(X1 ; : : :; Xn ) :- b1(U1 ; : : :; Ul ); : : :; bs(V1 ; : : :; Vt); p(S1 ; : : :; Sn): r1 : p(X1 ; : : :; Xn) :- a1 (Y1 ; : : :; Yk ); : : :; am (Z1 ; : : :; Zl ); p(W1 ; : : :; Wn): r2 : p(X1 ; : : :; Xn) : ?e(X1 ; : : :; Xn ): The variables Xi which appear in the head of rules are called output variables. The variables that appear only in the body of rules are called local variables. Variables occurring in the arguments of the recursive predicate in a rule body (e.g., Si ; Wj ) are recursive variables. We assume that the rules are recti ed so that all rules de ning the same predicate have an identical head, and Xi appears in column i of the head predicate [14]. This assumption is not restrictive since it is well known that all programs can be recti ed. We shall denote by pi , 1 i n, the ith argument position of p in the body of a recursive rule. A proof tree for a recursive predicate p de ned by a program P is a tree T with root p(X1 ; : : :; Xn), such that an internal node with p(A1 ; : : :; An ) as its label has chil-
dren corresponding to the subgoals in some rule for p, after this rule's head is uni ed with p(A1; : : :; An). An expansion sequence is a sequence of program rules. Expansion sequences are in 1-1 correspondence with proof trees for linear programs, since they indicate the order in which rules were applied top-down in generating tuples for the idb predicate p. E.g., the sequence r0 r1 r0 represents the proof tree in which p is successively expanded top-down using the rules r0; r1; r0, in that order. The notion of residues (free or otherwise) can be associated with expansion sequences in the obvious manner. We remark that the same algorithm given in [3] for partial subsumption and residue generation can be used for testing free subsumption and generating free residues, with minor modi cations. In the sequel, when we refer to subsumption (residue), we mean free subsumption (residue), unless otherwise stated.
3 Generating Residues In this section, we address the following question: given a (linear) program and an IC, how to extract residues of the IC w.r.t. the program that will most bene t the optimization? We assume that each IC is of the form D1 ; : : :; Dk ; E1; : : :; Em ! A, k 1; m 0, where (i) Di 's are database predicates, Ej 's are evaluable predicates, and A (possibly absent) is either type of predicate, and (ii) Di shares one or more variables with Di?1 and Di+1 and with no other predicate, 1 < i < k. This still covers a wide class of ICs of practical interest. Let P be a program and ic be an IC of the above form. The idea behind generating residues [8] is to determine if the IC (partially) subsumes any rule or any expansion sequence. In general, an IC may well (partially) subsume more than one expansion sequence, where the subsumption is free in the sense of Section 2. Intuitively, the IC can be exploited most eectively if the extent of the subsumption is in some sense \maximal". De nition 3.1. Let ic be an IC of the above form and let s be some expansion sequence from a program P. Then ic is said to maximally subsume the expansion sequence s provided the subclause of ic consisting of all the database subgoals in the body of ic freely subsumes s completely. 2 The intuition behind maximal subsumption is that when an IC maximally subsumes a rule/sequence in the above sense, the corresponding residue will only contain evaluable predicates (if any) in its body. The signi cance of this is as follows. Recall that our goal is to perform optimization by program transformation
in a query independent manner. Then a residue with database subgoals in its body calls for anticipating such subgoals as part of a speci c query. In other words, such residues cannot contribute to a direct optimization of the program, although they can suggest optimization of speci c queries based on the program. Thus residues coming from non-maximal subsumption are not appropriate for query independent optimization. Example 3.1. Consider the program and IC of Example 2.1. The IC ic in that example partially subsumes the expansion sequences r0; r0 r0, and r0 r0 r0. Only the sequence r0 r0 r0 is maximally subsumed by this IC. To see how, express r0 r0 r0 as p(X1 ; : : :; X6) :- a(X1 ; X2; X4 ); b(X20 ; X3); c(X30 ; X40 ; X5); d(X50 ; X6 ); a(X1; X20 ; X40 ); b(X2 "; X30 ); c(X3 "; X4"; X50 ); d(X5"; X60 ); a(X1 ; X2"; X4"); b(X2 "0; X3 "); c(X3 "0 ; X4"0; X5 "); d(X5 "0; X6 "); a(X1 ; X2 "0; X4"0 ); b(X2 ""; X3"0 ); c(X3 ""; X4""; X5"0); d(X5 ""; X6"0); p(X1 ; X2 ""; X3""; X4""; X5""; X6""). Then using the substitution = fV1 =X1; V2 =X2"0 ; V3=X4 "0; V4 =X3 "; V5=X4 "; V6=X50 g we see that the IC maximally subsumes this sequence. This yields the residue !d(X50 ; V7). Now, can be extended to a substitution which maps V7 to X6 so that we can get the variant !d(X50 ; X6) of this residue. We now address the question of how to detect sequences that are maximally subsumed by a given IC. The exhaustive approach of enumerating all possible sequences of a certain length and then testing each of them for maximal subsumption is unattractive and inecient. In the following we give an ecient procedure for detecting these sequences directly. Let the IC be D1 ; : : :; Dk ; E1; : : :; Em !A. The test consists of two phases: (i) test if there is any expansion sequence that has the same connectivity pattern among the subgoals Di , as in the IC; (ii) for such a sequence, test if the IC does subsume that sequence maximally, and if so generate the residue. We need certain auxiliary notions. We assume that the subgoals appearing in the same as well as dierent rules of a program are distinct. De nition 3.2. An argument/predicate graph (APgraph) associated with a program P = fr1; : : :; rl g is a labeled graph G consisting of both directed and undirected edges. The vertices of G consist of the edb subgoals a; b; : : : appearing in the program rules, argument positions pj of the recursive predicate, and argument positions di of a dummy subgoal. The edges of G are de ned as follows. There is an undirected edge (a, pk ) with label < ; j > whenever the jth argument of a shares a
variable with pk in the body of some rule; a directed edge < pi ; a > with label < r; j > if a appears in rule r and has the output variable Xi at position j; a directed edge < pi; pj > with label < r; > if the output variable Xi corresponds to pj in the body of rule r; A (directed) path in an AP-graph is a sequence of nodes < v1; : : :; vk > such that there exists a directed or undirected edge from vi to vi+1 , i = 1; : : :; k ? 1. The label associated with a directed path < v1; : : :; vk > is the sequence of labels attached to the (directed) edges in the path, where the order of the sequence corresponds to the direction of the path. 2 When two subgoals a and b share a variable within a same rule without sharing that variable with the recursive predicate, we make use of an argument position di of a dummy subgoal d, and add two undirected edges (a, di ) and (b, di) in the AP-graph. From the AP-graph of a program P, we can construct a subgoal dependency graph (SD-graph), in which the nodes correspond to the edb predicates of P, and the edges correspond to expansion sequences of P, as follows. There is an undirected edge (a; b) in the SD-graph when there is an undirected path between a and b in the AP-graph. There is a directed edge < a; b > with label < exp; f(i1; j1) (ik ; jk )g > in the SD-graph exactly when there exists a directed path from a to b in the AP-graph of P such that exp is an expansion sequence (a sequence of rules) appearing in that path and f(i1; j1 ) (ik ; jk )g a set of argument positions for shared variables indicating that the il -th argument of a is identical to the jl -th argument of b, for l = 1; : : :; k. Let D1 ; : : :; Dk ; E1; : : :; Em!A be an IC such that each Di shares variables with exactly Di?1 and Di+1 . We can associate a pattern graph with this IC which is the (undirected) path graph whose nodes are the edb subgoals D1 ; : : :; Dk , and for i = 1; : : :; k ? 1; there is an edge (Di ; Di+1 ) with a label f(i1; j1 ) (im ; jm )g indicating the set of pairs of argument positions containing shared variables. The following lemma yields a way to test maximal subsumption using SD-graphs and pattern graphs. We next present the algorithm for generating residues. Lemma 3.1. Let G1 be the SD-graph associated with a program, ic : D1 ; : : :; Dk ; E1; : : :; Em ! A be
an IC, and G2 be the pattern graph associated with ic. Then ic maximally subsumes some expansion sequence s of the program i the following conditions are satis ed. (i) G1 contains a subgraph G which either corresponds to a directed path from D1 to Dk or a directed path from Dk to D1 , where the order of the intermediate nodes is preserved. (ii) The label of each edge in the pattern graph G2 is a subset of the second component of the label of the corresponding edge in G.
Algorithm 3.1.
Input: A program P = fr1; : : :; rl g and an IC, ic : D1 ; : : :; Dk ; E1; : : :; Em !A. Output: The residue R generated from the IC ic and an expansion sequence s of P via maximal subsumption, together with this sequence.
begin
Step (1) Construct the AP-graph and then the SDgraph G1 of P; construct the pattern graph G2 associated with ic. Step (2) Test if G1 has a subgraph isomorphic to G2 ; if not, return(\IC does not maximally subsume any expansion sequence"); /** If there are isomorphic subgraphs in G1, there can be at most two: one corresponding to the direction < D1 ; : : :; Dk > and another corresponding to the direction < Dk ; : : :; D1 >. **/ Step (3) Let G be the directed path in G1 starting with the subgoal node D1 (resp. Dk ) that is isomorphic to G2 ; Step (4) Use Lemma 3.1 and test if the label of each edge in G2 is a subset of the second component of the label of the corresponding edge in G; if it does not, then return(\IC does not maximally subsume any expansion sequence"); Otherwise, generate the expansion sequence r L where r is the rule in which the atom D1 (resp. Dk ) occurs as a subgoal; generate the residue R from this subsumption; return(s; R);
end.
Remarks: The isomorphism between the pattern
graph G2 and a subgraph of the SD-graph G1 that we are seeking here has to preserve the names of the
nodes. Since the pattern graph is an undirected path, testing for (subgraph) isomorphism can be done eciently by checking whether the SD-graph contains a directed path corresponding to < D1 ; : : :; Dk > or corresponding to < Dk ; : : :; D1 >. As the subgoals occurring in the same as well as dierent rules are distinct and the pattern graph is an undirected path, there can be at most two isomorphic copies of G2 in G1, one starting from the subgoal node D1 and another starting from Dk . For testing maximal subsumption, one can use a minor variant of the subsumption algorithm in [3]. For more details on the algorithm, its eciency, and its proof of correctness, the reader is referred to [8]. Suppose that an IC generates a residue R w.r.t. an expansion sequence s, and that R has a database atom A in its head. Let be the subsuming substitution. Then we say that this residue is useful for the sequence s provided there is an atom B in s such that can be extended to a mapping so that A = B. The motivation for this de nition will be clear shortly. (Notice that according to this de nition, residues without database atoms in their head are trivially useful.) We next illustrate the algorithm with an example. Example 3.2. Suppose that a database contains the following EDB predicates: super(P; S; T), meaning professor P supervises student S for thesis T, pays(M; G; S; T), meaning amount M from grant G is allocated to student S who works on thesis T, has(P; G), indicating professor P has a grant G, works with(P; P 0), meaning P works jointly with P 0, expert(P; F), indicating professor P is an expert in the eld F, field(T; F), meaning the eld of the thesis T is F, doctoral(S), meaning student S is a doctoral student. Suppose also that the database has an IDB predicate eval(P; S; T) meaning P is quali ed to evaluate student S on his/her thesis, de ned by the program: r0: eval(P; S; T) :- super(P; S; T) r1: eval(P; S; T) :- works with(P; P 0); eval(P 0 ; S; T); expert(P; F); field(T; F): and the following IC: ic1 : works with(P2; P1); expert(P1; F1) ! expert(P2 ; F1). The expanded form of ic1 is ic1e: works with(P2 ; P1); expert(P3; F1); P3 = P1 ! expert(P2 ; F1). It can be veri ed that ic1e and hence ic1 partially subsumes r1 in the sense of [3], giving rise to the residue
P = P 0 ! expert(P; F). However, notice that this residue is trivial in the context of the recursive rule r1, and is hence not useful for optimization. Now, let us consider free subsumption. The pattern graph associated with ic1 is simply the graph consisting of the nodes works with and expert and the only edge < works with; expert >. It can be easily veri ed that in the SD-graph of the above program, there is a subgraph which is essentially the edge < works with; expert > with label < r1; f(2; 1)g >. Since works with occurs in r1, the expansion sequence constructed in Step (3) of the Algorithm 3.1 is r1 r1 . Subsequently, when we test this sequence in Step (4), we nd that it is maximally subsumed by ic1 , giving us the residue !expert(P; F). Let be the subsuming substitution. Then this residue is clearly useful for the sequence r1 r1, since expert(P; F) (even with no extension to ) occurs in r1 r1. 2 For convenience, we use the notation (s; R) for residues, where s is the expansion sequence that produced the residue R. In the sequel, we only consider residues that are useful for their expansion sequences. Detecting useful residues for expansion sequences is discussed in detail in [8] and is beyond the scope of this paper. In the next section, we classify residues generated by maximal (free) subsumption into dierent types, depending on the kind of optimization they can lead to. We then discuss techniques for pushing these residues inside recursion.
4 Pushing Residues inside Recursion The main goal of this section is to develop and illustrate techniques for pushing the residues of ICs inside recursion. Since our (free) residues are generated by free subsumption, they never contain a database predicate in their body. This leads to the following classi cation of residues. De nition 4.1. Let P be a program, ic an IC and let R be a residue of ic w.r.t. the expansion sequence s and suppose R is useful w.r.t. s. Then R is a fact residue if it is of the form E1 ; : : :; Em !A, m 0 where Ei are evaluable atoms and A is a database/evaluable atom. Further, it is said to be a conditional fact residue if m > 0 (unconditional, otherwise). The residue R is a null residue if it is of the form E1; : : :; Em ! . Further, it is a conditional null residue if m > 0 (unconditional, otherwise). 2
We discuss the types of fundamental optimizations suggested by the various types of residues. We shall see that they generalize the operations discussed in Chakravarthy et al [3] to the recursive case. (1) Atom Elimination: An atom deduced to be redundant in an expansion sequence of the program is deleted. The atom may be database or evaluable. Atom elimination may be conditional or unconditional, depending on the type of the fact residue that induces it. Thus, atom elimination is a general form of join elimination. The following example illustrates these ideas. Example 4.1. Consider an organizational database, having the following edb predicates: boss(E; B; R) { B is a boss of E with rank R; same level(E1 ; E2; E3) { the employees E1; E2; E3 are at the same level of the organizational hierarchy; and experienced(E) { employee E is experienced. Now consider the following program and IC. r1: triple(E1 ; E2; E3) :- same level(E1 ; E2; E3). r2: triple(E1 ; E2; E3) :- boss(U; E3 ; R); experienced(U); triple(U; E1 ; E2). ic1 : boss(E; B; R); R = `executive0 ! experienced(B). The predicate triple(E1 ; E2; E3) de ned by the program computes the set of triples of employees such that (i) the employees in the triple are separated by at most one level, and (ii) level(E3 ) level(E2 ) level(E1 ), where level(E) indicates the organizational level of E. Rule r2 puts some restrictions on the argument U saying that only the bosses of people with experience are considered for membership in triple. The reader can verify that the only expansion sequence that is useful w.r.t. the IC ic1 is r2 r2 r2 r2. Indeed, the fact residue of ic1 w.r.t. this sequence is R = `executive0 !experienced(U), trivially making it useful. Thus, the IC above suggests that in the context of this proof tree, the atom experienced(U) can be deleted, whenever the condition R = `executive0 holds. Clearly, unconditional residues are a special case of conditional ones, and an example can be constructed for them rather easily. 2 (2) Atom Introduction: An atom which is either evaluable or (is database and) corresponds to a small relation may be introduced in the query computation. If the atom is evaluable, the addition in general causes a scan reduction. If the atom is a (small) database predicate, its addition could reduce the cost of joining some large relations. Atom introduction is also
induced by fact residues, and can again be conditional or unconditional. Thus, atom introduction is a general form of join introduction and scan reduction. The following example illustrates these ideas. Example 4.2. Let us revisit Example 3.2 with one additional rule and IC. r2: eval support(P; S; T; M) : ?eval(P; S; T); pays(M; G; S; T). ic2 : pays(M; G; S; T); (M > 10; 000)!doctoral(S). (Only doctoral students have a nancial support > $10; 000.) As seen in Example 3.2, ic1 maximally subsumes the expansion sequence r1 r1 freely. The corresponding residue is the unconditional fact residue !expert(P 0; F), which is useful for the expansion sequence r1 r1 in the sense of Section 3. This residue suggests the atom expert(P 0 ; F) can be eliminated in every subtree of the form r1 r1 (in any proof tree generated by the program). The IC ic2 has the residue M > 10; 000!doctoral(S) w.r.t. the expansion sequence r2. This suggests (conditionally) adding the subgoal doctoral(S) to this sequence (assuming this relation is small compared to other relations). 2 (3) Pruning (Sub)Trees: Null residues, both conditional and unconditional, suggest that the proof tree corresponding to their associated expansion sequence will not generate any tuples (for the predicate at the root of the proof tree). Depending on the type of the residue, the subtrees can be pruned either conditionally or unconditionally. Thus, subtree pruning is a general form of detecting unsatis able conditions. Example 4.3. Consider the following example. EDB-schema: par(Person; Person Age; Parent; Parent Age). IDB-schema: anc(Person; Person Age; Ancestor; Ancestor Age). r0: anc(X; Xa ; Y; Ya ) :- par(X; Xa ; Y; Ya ). r1: anc(X; Xa ; Y; Ya ) : ?anc(X; Xa ; Z; Za); par(Z; Za ; Y; Ya ). ic1 : Ya 50; par(Z; Za; Y; Ya); par(Z 0 ; Za0 ; Z; Za); par(Z"; Z"a ; Z 0; Za0 )! . (People under 50 years do not have 3 generations of descendants below them.) The IC ic1 maximally subsumes the expansion sequence r1 r1 r1. The associated residue is Ya 50!, which is also useful w.r.t. this sequence. This essentially says the proof tree r1 r1 r1 can be pruned whenever Ya 50 holds. This example illustrates
(conditional) subtree pruning. 2 We have seen the fundamental types of optimization induced by the residues. A key dierence between the approach of [3] and ours is that our residues are not just w.r.t. rules but are rather w.r.t. expansion sequences. Hence the optimization we suggest is not directly on the program, but on the proof trees generated by it. To carry out this optimization, we must make sure the transformed program only generates proof trees with such optimization incorporated! We attack this problem in two stages. First, we transform the original program into an equivalent one which in some sense \isolates" the expansion sequence of interest. We then carry out the required optimization on the rules which isolate the expansion sequence, by pushing the residue inside recursion. Isolating an Expansion Sequence: We give an algorithm for transforming a program such that the given expansion sequence is isolated.
Algorithm 4.1. Input: A linear recursive program P = fr1; : : :; rm g, de ning a predicate p and an expansion sequence s =< rj1 ; : : :; rj >. Output: A transformed program Q equivalent to P which isolates the sequence s. k
begin Step (1) Introduce auxiliary predicates pi ; qi; 1 i k ? 1; Adopt the convention that p0 = pk = q0 = qk = p; Step (2) Write the following rules, called \-rules", in sequence; pi?1 :- rj , with the occurrence of p (if any) in the body replaced by pi , 1 i k; /** This is the ith -rule. **/ Step (3) Write down the following rules, called \ rules", in sequence; pi?1 :- rj , with the occurrence of p (if any) in the body replaced by qi ; /** This is the ith -rule. **/ Step (4) Write down the last group of rules, called \ rules"; qi?1 :- rl , for all rules rl 2 P; such that l 6= ji , for 1 i k. Step(5) for i := 2 to k do rewrite the ith -rule by unifying its head with the occurrence of pi?1 in the body of i
i
the (i ? 1)th -rule, and renaming all local variables with completely new names; also rewrite the ith -rule by unifying its head with the occurrence of pi?1 in the body of the (i ? 1)th -rule and renaming the local variables as above; nally, rewrite the -rules by making the head of the ith -rule identical to that of the ith -rule, 1 i k ? 1, as far as the arguments are concerned, and renaming local variables as above.
end.
Theorem 4.1. The transformation described by Algorithm 4.1 preserves equivalence. Proof. The proof is by showing that every proof tree generated by the original program is generated by the transformed program and vice versa. The complete details and illustrative examples can be found in [8]. 2 We next discuss Stage 2 of our transformation process. Optimization by Pushing: The optimizing transformation evidently depends on the type of the residue. In the discussion below, we assume that the residues have at most one subgoal in their body. This is only for simplicity. The techniques and results work for the general case with any number of subgoals. Suppose the transformed program from Stage 1 above is Q, s =< rj1 ; : : :; rj > is the expansion sequence, and the residue is R. (Recall that Q isolates the sequence s.) (1) Atom Elimination: Let the residue R be E !A. By construction, R must be useful for the expansion sequence s. Hence, suppose that the atom A occurs in the body of rule rj in the expansion sequence. Then to implement the optimization, we consider the ith rule, say R, in Q. Split R logically as follows: make two copies of R; add in the subgoal E to the rst copy and then delete the subgoal A from the resulting body; for the second copy, add in the subgoal :E. Notice that since E and hence :E is an evaluable atom, it can be eciently implemented as selection. (2) Atom Introduction: Both atom introduction and elimination are induced by fact residues. The criteria for looking for introduction are (i) the atom being introduced should be evaluable or a small database predicate, and (ii) it should share some arguments with the k
i
expansion sequence. Let rj be the rule instance in the sequence s with which the residue R shares variables. Let the residue be of the form E !A. In this case, optimization can be eected by logically splitting the ith -rule, say R, as follows. Make two copies of R; add in the subgoal A to the body of one copy, and add the subgoal :E to the body of the other. Again, notice that :E is an evaluable predicate. Once this \high level" optimization is carried out, heuristics such as subgoal reordering may be applied to improve the eciency even further. Thus, the subgoals in the last two rules may be reordered so that the selection (or restriction) is rst performed on the edb relations pays and the bindings from the result are passed on for the computation of eval. Finally, notice that for this program, both atom elimination and atom introduction can be performed independently of each other. 2 (3) Pruning (Sub)Trees: Let the residue E ! constrain the variable(s) occurring in the rule rj in the context of the expansion sequence s. The signi cance of this residue is that whenever the condition E holds, the expansion sequence s will not produce any tuples and hence can be pruned (when viewed as a proof tree). In this case, we simply have to modify the ith -rule, say R, in the transformed program Q, by adding in the subgoal :E into its body. This modi cation captures the fact that the only way the expansion sequence s can produce tuples is when E is false. In the special case where the null clause is unconditional, we simply delete the rule de ning the predicate pk?1. This is equivalent to adding in the unsatis able condition false into its body. Of course, once the rule for pk?1 is deleted every rule making use of the predicate pk?1 can be deleted. i
i
5 Application to Intelligent Query Answering The framework of semantic query optimization is suciently broad to capture forms of intelligent query answering. Intelligent query answering has been studied by several researchers [4, 5, 11] In this section, we shall show via examples that some of the techniques studied by Motro and Yuan [11] can be captured within our framework for semantic query optimization. The basic idea in intelligent query answering is that
it must be possible to elicit answers from the database that describe the \objects" (or tuples) which would constitute answers in a conventional sense of the term. Examples includes queries like \describe the honors students of Concordia University", or \what does it take to make the Dean's honors list in the Engineering faculty of UQAM?". The dierence between such
queries and conventional queries is that answers to these queries are not expected in the traditional form of sets of tuples satisfying query bindings. Rather, some explanation or summary information regarding the (conventional types of) answers to queries is what is expected. Motro and Yuan [11] proposed the following special syntax for formulating the so-called knowledge queries, which expect intelligent answers. describe '(X) where (X). This query is asking for a description (or summary) of the set of all objects X satisfying conditions or properties ' given that a certain context expressed by the formula is satis ed. In conceptual terms, we can think of the query as asking \what can you say about objects X satisfying properties ' in the context ?". An example might be \describe honors students given that they are in computer science, come from one of the top 10 colleges in town, and play chess for a hobby". This query may be formally expressed as follows. describe honors(X) where major(X; CS)^ college(X; C) ^topten(C)^ hobby(X; chess). For each predicate, the database might either have an explicit relation corresponding to it (in the case of edb predicate), or have a de nition for it (in the case of idb predicates). The idea is to extract as compact a description of the answers as possible. In this section, we suggest a methodology for answering such queries intelligently, making use of techniques developed for semantic query optimization. Identi cation of Relevant context: In the given query, not everything in the context i.e. the properties ' stated in the where clause might be relevant. E.g., in the above example, we know from common sense that the hobby of a student (usually!) might have little to do with his/her academic achievement. Therefore, an intelligent query processor should be able to ignore such irrelevant information in the given context. More generally, determination of the relevant part of the context is required. This may be done by performing
a reachability analysis and deciding whether any of the predicates in the context are reachable from the query predicate. Reachability may be de ned as follows. Every predicate is reachable from itself. A predicate p is reachable from a predicate q if q occurs in the body of a rule for a predicate r which is reachable from p. p is reachable from q if q is reachable from p. Reachability is the smallest relation satisfying the conditions above. Predicates in the context not reachable from the query predicate are viewed as irrelevant to the query. Identifying quali cations given by context: The context in a query can be viewed as asserting that the assertions made in the context are satis ed. Given this, the system has to determine to what extent the objects are quali ed to be answers, and what additional quali cations must be met in order that the objects be valid answers. Application of semantic query optimization techniques: In order to obtain a concise description of the
valid answers to the query given that the context is satis ed, we can take advantage of semantic query optimization as follows. We can treat the context as an axiom and consider the proof tree(s) associated with the query. Each proof tree is a conjunctive query that says if an object satis es the (conditions asserted by the) leaves, then the object is a valid answer to the query associated with the root of the tree. Thus, in order to determine to what extent the context information in uences answers to the query, we can determine whether the context (axiom) (partially) subsumes the (leaves of the) query. The associated residue is helpful in generating descriptive answers to the original query. Example 5.1. Consider the following deductive database, adapted from Motro and Yuan [11]. r0: honors(Stud) :- transcript(Stud; Major; Cred; Gpa); Cred 30; Gpa 3:8 r1: honors(Stud) :- transcript(Stud; Major; Cred; Gpa); Gpa 3:8; exceptional(Stud). r2: exceptional(Stud) : ?publication(Stud; P); appears(P; Jl); reputed(Jl). r3: honors(Stud) : ?graduated(Stud; College); topten(College). Now, consider the following query.
describe honors(Stud) where major(Stud; CS)^graduated(Stud; College)
^topten(College)^ hobby(Stud; chess).
First, we determine the part of the context relevant to the given query. In this case, the predicates graduated and topten are the only predicates in the context which are relevant to the query. At this point, we can regard the context as instructing the query processor to consider the situation where a given student graduated from one of the top ten colleges (i.e. graduated(Stud; College) ^ topten(College) is true). The next step is to determine whether the relevant part of the context partially subsumes the conditions associated with the query. To do this, we consider the proof tree(s) associated with the query. There are 3 proof trees associated with this query, given by r0 , r1 r2 , and r3 . It is easy to see that the relevant part of the context does not partially subsume the rst two proof trees. However, it does subsume the third proof tree (totally). As a result, the residue resulting from this subsumption is the empty conjunction, which in this case means all individuals satisfying the context of the query qualify to be honors students. The residues resulting from the rst two proof trees are the entire proof trees themselves (since there is no subsumption at all). These residues are contained in the residue associated with the third proof tree, since the latter, being an empty conjunction, corresponds to true.
6 Summary and Future Research In this paper, we have developed techniques for generating residues as well pushing them inside recursion, as an alternative to the evaluation based approach developed by [3, 9]. This situation is similar to the dierences and tradeos between the two paradigms [1, 2] for (conventional) recursive query processing { evaluation (e.g., semi-naive evaluation) and rewriting (e.g., magic sets). Semantic optimization by program transformation is comparable to the magic sets method. Just as the magic sets method pushes the goal selectivity of queries inside recursion, our approach tries to push the semantics (in ICs) inside the recursion. We believe this approach constitutes an attractive alternative to the evaluation based approach, as well as an important area for future research. The application of semantic query optimization for intelligent query answering is promising and merits further exploration.
References [1] F. Bancilhon and R. Ramakrishnan, \An amateur's introduction to recursive query processing strategies," ACM-SIGMOD Conf., 1986, 16-52. [2] S. Ceri, G. Gottlob, and L. Tanca, \What You Always Wanted to Know About Datalog (And Never Dared to Ask)," IEEE Trans. Knowledge and Data Eng., (March 1989), 146-166. [3] U.S. Chakravarthy, J. Grant, and J. Minker, \Logic Based Approach to Semantic Query Optimization," ACM TODS, (June 1990), 162-207. [4] T. Gaasterland, P. Godfrey and J. Minker, \Relaxation as a Platform for Cooperative Answering," Journal of Intelligent Information Systems, 1, (1992), 293-321. [5] T. Imielinski, \Intelligent query answering in rule based systems," Journal of Logic Programming, 4, 3, (September 1987), 229-257. [6] V.S. Lakshmanan and H. Hernandez, \Structural Query Optimization: A Uniform Framework for Semantic Query Optimization in Deductive Databases," Proc. ACM SIGACT-SIGMOD Symp. Principles of Database Systems, Denver, May 29-31, 1991, 102-114. [7] V.S. Lakshmanan and R. Missaoui, \On Semantic Query Optimization in Deductive Databases," Proc. IEEE Int. Conf. Data Eng., 1992, 368-375. [8] V.S. Lakshmanan and R. Missaoui, \Semantic Optimization of Recursive Queries { Pushing Semantics inside Recursion," Tech. Report, Dept. of Comp. Science, Concordia University, May 1994. [9] S. Lee and J. Han, \Semantic query optimization in recursive databases," Proc. IEEE Int. Conf. Data Eng., 1988, 444-451. [10] S.-G. Lee, L.J. Henschen, and G.Z. Qadah, \Semantic Query Reformulation in Deductive Databases," Proc. IEEE Int. Conf. Data Eng., April 8-12, 1991, Kobe, Japan, 232-239. [11] Motro, A. and Yuan, \Querying database knowledge", ACM SIGMOD International Conference on Management of Data, 1990. [12] J. F. Naughton, R. Ramakrishnan, Y. Sagiv, and J.D. Ullman, \Ecient evaluation of right-, left-, and multi-linear rules," ACM SIGMOD International Conference on Management of Data, 1989, 235-242. [13] Y. Sagiv, \Optimizing datalog programs," 6th ACM PODS, 1987, 349-362. [14] J.D. Ullman, Principles of Database and Knowledge-Base Systems, vol I & II, Comp. Sci. Press, MD., 1988.