Magic Checking: Constraint Checking for Database Query Optimisation

1 downloads 0 Views 229KB Size Report
checking and propagation to database query optimisation BLW94, WL95, .... whose predicate is de ned by a single fact, for each occurrence of a constant term.
Magic Checking: Constraint Checking for Database Query Optimisation Mark Wallace, Stephane Bressan and Thierry Le Provost Contact address: Mark Wallace, IC-Parc, William Penney Laboratory, Imperial College, LONDON SW7 2AZ. email: [email protected] September 1995

1 Introduction

Constraint programming languages, such as CHIP [DvS+ 88], ILOG Solver [ILO95] and ECLiPSe [Me95], have proven their worth on a range of practical applications [Wal95]. The common ingredient of their success is constraint satisfaction techniques, embedded in a host programming language, as pioneered by Van Hentenryck [Van89] and the CHIP team. Recently experiments have been carried out on the application of constraint checking and propagation to database query optimisation [BLW94, WL95, Bre94]. Some very promising results have been obtained using the ECLiPSe-BANG database connection [ECR93]. Because ECLIPSE-BANG supports a tight tupleat-a-time database interface, it was possible to apply directly the algorithms supporting constraint propagation in ECLiPSe to query optimisation in ECLiPSeBANG. In particular it was possible to use the ECLiPSe's advanced control features [BFL+ 95] which allow constraints to be woken up and redelayed repeatedly during constraint propagation so as to achieve constraint propagation behaviour by coroutining. In this paper we describe the application of program transformation techniques, based on sideways information passing, to constraint checking in databases. This implements constraint checking in set-at-a-time database programming systems which may not support ECLiPSe's special functionality mentioned above. The transformation we propose owes much to the ideas of magic sets, as will be apparent from the paper. Therefore we call our approach magic checking. Optimisers which are de ned on Datalog programs are typically designed to handle non-standard database queries. For example much work - including 1

the concept of magic sets - has gone into optimising recursive queries [RLK86, BMSU86, Vie89]. By contrast, magic checking is not designed for the optimisation of recursive queries, though it is fully integrated with the technique of generalised magic sets. Magic checking is an option available to the end-user to guide the evaluation of queries - typically complex queries involving cycles. Nevertheless the option even proves useful for optimising a very simple nonrecursive, non-cyclic query (see section 8 below). Another direction is semantic optimisation, which seeks to transform a query into another query which has the same set of answers but which is cheaper to evaluate [HZ80]. Semantic optimisation is often applied to databases which support integrity constraints [Kin81, CGM88]: theorem proving techniques are used to transform the query into a simpler query that, under the given integrity constraints, is logically equivalent. By contrast, magic checking does not require information about the semantics of the database. It is not an optimisation based on semantics at all. Instead it implements a high level control primitive which is made available to the end-user as a query language annotation. It is up to the end user to decide how to make judicious use of this annotation in order to optimise his query. Finally magic checking is designed to augment standard query optimisations supported by current generation database systems. It enables the user to direct the query evaluation so as to make use of non-standard optimisation possibilities, but the resulting plan is still nally optimised by the standard database query optimiser. In other words magic checking does not o er an alternative to standard optimisation techniques but an extension of them. In the next section we make some de nitions; then in section 3 we show a mapping between sets of rules and database query evaluation plans; then in section 4 we discuss constraint satisfaction techniques and their relation to database querying; in section 5 we present some examples motivating the application of constraint checking to query optimisation; in section 6 we present our optimisation architecture and in section 7 and 8 we explain magic checking in detail; nally in the conclusion we comment on current developments.

2 De nitions The language, Vanessa, used in this paper to express end-user database queries is an extension of Datalog. Vanessa queries are transformed into Datalog for evaluation against the database. A Datalog program is a set of clauses. There are three types of clauses: rules, facts and goals. Rules have the form Head A1 ; : : : An where each Ai is termed an atomic goal. The head of the rule, and each atomic goal, Ai , in its body, is an atom. An atom has the form p(T ), where p is a predicate. T is a vector of terms, each of which is either a variable or a constant. Predicate names are written starting 2

with a small letter. Variable names begin with a capital letter. The symbol stands for an unnamed variable: the variable is distinct from all named variables and also from all other unnamed variables (i.e. all the other variables represnted by the symbol). Constants are either numbers or alphanumeric names starting with a small letter. We shall assume that every variable appearing in the head of a rule also appears in some atomic goal in its body. The predicate associated with the head of a rule directly depends on the predicate associated with each of the atoms in its body. It depends on each predicate which it directly depends on, and each predicate on which those predicates depend. A predicate is recursive if it depends upon itself, and a Datalog program is recursive if it contains a recursive predicate. A fact has the form Head , where Head is an atom. We shall assume that no variables appear in a fact. A goal has the form A1 ; : : : ; Am where the Ai are atomic goals. The usual interpretation of a clause is a logical disjunction, where the head (if there is one) is a positive disjunct and the atomic goals in the body are negative disjuncts. For example the rule: Head A1 ; : : : An is interpreted as the disjunction: Head _ :A1 _ : : : _ :An . A tuple T in a database relation p can be viewed as a fact p(T ) , and we shall use the words fact and tuple interchangeably. The de nition of a predicate p, in a program P , is the set of rules and facts in P whose head has the form p(T ). If the de nition contains only facts, p is termed an extensional predicate. If the de nition of p contains only rules, then p is termed an intensional predicate. A predicate de ned by both facts and rules can be rewritten as two predicates one intensional and one extensional. Therefore we shall assume, without loss of generality, that every predicate is either extensional or intentional. The use of datalog for expressing queries is elucidated in [Ull82]. Finally we de ne a selection-free Datalog program to be one in which the terms appearing in rules and goals are all variables. Any Datalog program can be translated into a selection-free Datalog program by introducing a atomic goal, whose predicate is de ned by a single fact, for each occurrence of a constant term in a rule or goal clause. For pedagogical reasons we shall apply Magic Checking only to selection-free Datalog programs in this paper. Unrestricted Datalog can be handled by applying the above translation before Magic Checking.

3 Rules as Query Evaluation Plans Historically Datalog rules were rst associated with the top-down execution mode of logic programming. However with the introduction of magic-set techniques [BMSU86], the idea of bottom-up, or forward-chaining, evaluation became established as an alternative. The forward-chaining mode accommodates set-oriented evaluation and so appears more amenable to integration with current database systems. 3

Datalog can be seen as a negation-free relational calculus extended with recursion. There is a simple mapping from non-recursive Datalog programs to relational calculus, by \unfolding" all the rules in the Datalog program. Unfolding means replacing each atom by its de nition, and similarly replacing the atoms in the de nition until the remaining atoms are de ned by facts. Thus the selection-free Datalog program: answer(Z ) fredpar(Y ); brother(Y; Z ) (1) answer(Z ) fredaunt(Y ); husband(Y; Z ) (2) fredpar(Y ) fred(X ); parent(X; Y ) (3) fredaunt(Y ) fredpar(Y ); sister(Y; Z ) (4) parent(X; Y ) father(X; Y ) (5) parent(X; Y ) mother(X; Y ) (6) fred(fred) (7) represents the query \Who are Fred's uncle's?", expressed as a relational calculus query thus: Z: 9X; Y: (X = fred ^ (father(X; Y ) _ mother(X; Y )) ^ brother(Y; Z ))_ 9X; Y; W: (X = fred^ (father(X; Y ) _ mother(X; Y )) ^ sister(Y; W ) ^ husband(W; Z )) We assume that there is an underlying database management system (DBMS) which can optimise and execute relational calculus queries against the database system. The main reason for using Datalog instead of the relational calculus in this paper is to separate the magic checking optimisation from any optimisation performed by the underlying DBMS. This avoids the problem of one optimiser working against the other. This separation is achieved because the Datalog clauses are passed one at a time to the DBMS for optimisation. Magic checking speci es its query evaluation plan as a Datalog program. This plan makes precise what intermediate relations are to be produced during query execution - these correspond to the intensional predicates de ned in the program. However the de nition of each intermediate relation is, in e ect, a relational calculus expression whose atomic expressions refer to database relations, other intermediate relations. The optimisation of each de nition is left entirely to the DBMS. The Datalog program above, expressing the query about Fred's uncles, speci es that parent, fredaunt and fredpar should be materialised as intermediate relations. As a query evaluation plan it is rather inecient because the selection fred(X ) is only applied in the de nition of fredpar, and not in the de nition of parent. Consequently the intermediate relation parent contains the union of the whole of the mother and father relations, and not just the mother and father of Fred! 4

Given the assumption that the database optimiser can locally optimise the evaluation of a single rule, but cannot globally optimise the evaluation of a whole set of rules, it is possible to transform an input Datalog program to another Datalog program for which the local optimisation performed by the database optimiser is more ecient. This assumption underlies a large body of work on transforming Datalog programs for the purpose of optimisation [BR86, BR87, Ram88]. This work is aimed primarily at optimising recursive queries. The facility to handle, and optimise, recursive queries, is the second reason why in the current paper we use Datalog rather than relational calculus as our underlying query language. The main optimisation achieved by the transformation of queries expressed as Datalog program, in previous work, has been to make use of \sideways information passing strategies" (Sips). This amounts to making selections rst wherever possible [BR87, p.270]. The most obvious transformation applicable to non-recursive Datalog programs is unfolding them into a single relational calculus expression. In this case the DBMS has the responsibility for globally optimising the whole query. It is a claim of this paper that there are non-recursive programs where unfolding the program leads to a non-optimal query evaluation. The claim is not that DBMS optimisers are de cient, but that they are black boxes for the end user. Magic checking provides a very simple but powerful form of user control. By annotating their queries, users can specify evaluation plans that are optimal for their own particular applications. Magic checking allows them to do this in a way that does not con ict with the optimisations performed by the underlying DBMS. In short, our contribution is to use the transformational approach to capture constraint checking. In concluding this section, we note that whilst the optimisation of individual clauses in the Datalog program de ning a query evaluation plan is left to the DBMS optimiser, there remains the problem of the order in which the clauses are evaluated. In the case of recursive queries, moreover, clauses may need to be reevaluated repeatedly until no new tuples are produced. As has been proposed in [BRSS90], the magic checking transformation must also produce a set of control rules governing clause evaluations. In the case of non-recursive queries, these rules are quite straightforward. We do not further discuss the generation of control rules in this paper.

4 Constraint Satisfaction Techniques Informally, a constraint in a database query is a condition that must be satis ed by answers to the query. One can view the set of possible answers to a query as being the cross-product of all the relations involved in the query. The constraints - speci ed as selection and join conditions - e ectively prune all irrelevant tuples from the cross-product, leaving just those tuples belonging to the answer. From the query answering point of view, then, the involved database relations 5

contribute to the set of possible answer tuples and the constraints reduce this set. Sideways information passing is used to ensure that not only the nal answers, but also intermediate solutions, materialised as temporary relations, satisfy the constraints on the query. This keeps the temporary relations as small as possible. A class of problems called \Constraint Satisfaction Problems" (CSPs) has been identi ed as a result of work in the eld of arti cial intelligence. A CSP involves a set of variables and a domain of possible values for each variable. Classically these variables only have a nite set of possible values, and we shall con ne ourselves to classical CSPs in this paper. Additionally each CSP has a set of constraints. A constraint is a relation which must hold between a subset of the variables in the CSP. The de nition of the relation is a subset of the cross-product of the relevant variables. There is an obvious mapping between constraints in CSPs and database relations [Dec92]. The set of possible answers to a CSP is the cross-product of the domains of its variables. The constraints prune irrelevant tuples from this cross-product. The classical approach to solving CSPs is to \label" the variables one at a time with a value from their domains until all the variables are labelled, or until it becomes clear that one or more constraints will be violated by any extension of the current partial labelling to a complete labelling. Constraint satisfaction techniques [Kum92] provide ways of detecting possible, or certain, constraint violations as early as possible, so as to reduce the number of possible labellings. The simplest constraint satisfaction technique is constraint checking. As soon as all the variables involved in a constraint have a label, check that the tuple of labels belongs to the constraint de nition [GB65]. In fact the check can be made repeatedly as each of the variables involved in the constraint becomes labelled. In this case check that the current labels belong to the projection of the constraint onto the variables labelled so far. There are many more powerful constraint satisfaction techniques [Mac77, Mon74, MH86] which could also be applied to optimise database querying, but in this paper we restrict ourselves to constraint checking. The CSP approach traditionally used labelling on the variables to explore the space of possible solutions, by contrast with database querying where the space is de ned by the relations involved in the query. It was the embedding of CSP techniques into logic programming [Van89] that rst began to show how constraints de ned by relations could be used both for de ning the search space and for pruning it. In this paper we apply a basic constraint satisfaction technique to database querying. Some important steps were already been taken in this direction many years ago. The idea of a reducer is precisely to eliminate tuples not participating in the solution [BFMY83]. This idea was applied by using semi-joins in database querying, which dates back to [WY76, BC83]. One motivation for this work was to minimise data transfer in distributed databases, but another motivation 6

was to optimise query processing on a single (RAP) database machine. Indeed our approach addresses a problem identi ed in the conclusion of Bernstein and Chiu's paper [BC83]: \For cyclic queries, nding good semi-join programs is likely to be quite dicult. Since cyclic queries are common too, we will...probably have to be satis ed with heuristic approaches.". Instead of building a heuristic into the optimiser, our approach o ers the end user control over the optimiser in the form of query annotations, which direct the optimiser to apply constraint checking to the annotated query atoms.

5 Motivating Examples of Magic Checking In this section we give two small examples of the application of constraint checking to query optimisation. The rst example illustrates the simplest context in which constraint checking could possibly pay o . The example is surprising because it is so simple: there are no cycles in the query connection graph for instance. Consider the goal

p(X ); q(X; Y ); r(X; Z ) A traditional query evaluation plan for this goal is to perform two joins. However quite a di erent query evaluation plan is possible, if q and r are used additionally as constraints. We can express a plan in the following two rules, in which q and r are used both as database relations to contribute to the search space and as constraints to prune the search space:

pcheckqr(X ) ans(X; Y; Z )

p(X ); q(X; ); r(X; ) pcheckqr(X ); q(X; Y ); r(X; Z )

Although this plan is redundant, it can be superior to the previous plan under certain circumstances. Let us assume that each pair of the relations p, q and r contains some values for X that do not appear in the third relation. In this case it is easy to extend the relations q and r such that the latter plan is more ecient than any possible join ordering on p, q and r by simply adding more and more tuples for each value of X , because the pairwise joins between p, q and r will include more and more tuples that do not appear in the nal answer. The semi-joins introduced in the latter plan avoid at any stage producing an intermediate solution in which these tuples appear. The semi-join reduction algorithm of [BC83] nds the same optimisation for this query. A more interesting example is that of crossword generation. The example described in [BLW94] has an empty 5X5 crossword grid. This innocent-looking CSP problem consists of a 134-word dictionary, each word with 5 letters. In CSP terms, the problem features 25 variables (one per square in the crossword 7

grid), and 10 constraints, all de ned by the same 134-tuple relation, enforcing that each of the 10 down and across slots be a word from the dictionary. Since the query has cycles it is not treated by the algorithm of [BC83]. The words and the answers are given as appendices to this paper. At the start of problem solving, the grid is lled with 25 free variables (see table 1). A F K P U

B G L Q V

C H M R W

D I N S X

E J O T Y

Table 1: Crossword If we call w the table containing the dictionary and if we name the 25 variables from A to Y, the query denoting the crossword problem in Table 1 is represented by the following join sequence (or any permutation of it): w(A; B; C; D; E ) 1 w(F; G; H; I; J ) 1 w(K; L; M; N; O) 1 w(P; Q; R; S; T ) 1 w(U; V; W; X; Y ) 1 w(A; F; K; P; U ) 1 w(B; G; L; Q; V ) 1 w(C; H; M; R; W ) 1 w(D; I; N; S; X ) 1 w(E; J; O; T; Y ). Obviously it is important to optimise this join sequence, so that words across alternate with words down. This is can be achieved by standard optimisation techniques, and we will apply it to our example. An optimal ordering of goals for evaluating the query is: w(A; B; C; D; E ) 1 w(A; F; K; P; U ) 1 w(B; G; L; Q; V ) 1 w(F; G; H; I; J ) 1 w(C; H; M; R; W ) 1 w(K; L; M; N; O) 1 w(D; I; N; S; X ) 1 w(P; Q; R; S; T ) 1 w(E; J; O; T; Y ) 1 w(U; V; W; X; Y ) which we shall refer to as the plan without constraints. However it is possible to use each word goal both as a relation and as a constraint. Adding these constraints by hand we produce a goal where the constraints are enforced by semi-joins: w(A; B; C; D; E ) 1 w(A; ; ; ; ) 1 w(B; ; ; ; ) 1 w(C; ; ; ; ) 1 w(D; ; ; ; ) 1 w(E; ; ; ; ) 1 w(A; F; K; P; U ) 1 w(F; ; ; ; ) 1 w(K; ; ; ; ) 1 w(P; ; ; ; ) 1 w(U; ; ; ; ) 1 w(B; G; L; Q; V ) 1 w(F; G; ; ; ) 1 w(K; L; ; ; ) 1 w(P; Q; ; ; ) 1 w(U; V; ; ; ) 1 w(F; G; H; I; J ) 1 w(C; H; ; ; ) 1 w(D; I; ; ; ) 1 w(E; J; ; ; ) 1 w(C; H; M; R; W ) 1 w(K; L; M; ; ) 1 w(P; Q; R; ; ) 1 w(U; V; W; ; ) 1) 8

w(K; L; M; N; O) 1 w(D; I; N; ; ) 1 w(E; J; O; ; ) 1 w(D; I; N; S; X ) 1 w(P; Q; R; S; ) 1 w(U; V; W; X; ) 1 w(P; Q; R; S; T ) 1 w(E; J; O; T; ) 1 w(E; J; O; T; Y ) 1) w(U; V; W; X; Y ) 1 w(U; V; W; X; Y ). We call this the plan with constraints. Evaluating these plans, we found the plan with constraints to be an order of magnitude faster. The largest intermediate result produced during evaluation of the plan without constraints had around 170,000 tuples. The largest intermediate result produced during evaluation of the plan with constraints had around 20,000 tuples. Details of this experiment, using ECLiPSe-BANG and two major commercial database systems, can be found in [BLW94]. In the following section we shall present an optimisation system which achieves the behaviour of the plan with constraints.

6 Architecture for Magic Checking In this section we will present the two main inputs to the magic checking system, and then brie y describe the components of the optimiser. The inputs are the language V anessa used for expressing rules and queries, and the sideways information passing strategy (Sips) which broadly de nes the order of evaluation.

6.1 The V anessa Language

As an input to the magic checking, it is necessary to specify which goal atoms should be used as constraints. We therefore adapt the syntax introduced in [PW93] for application to database querying. The language V anessa, introduced in [Bre94], is an extension of Datalog to allow both extensional and intensional predicates to be used as constraints. The syntax of V anessa is that of Datalog with the added syntax that any atomic goal in the body of any clause can be annotated as a constraint. In discussing V anessa programs we call annotated atoms constraint goals, or just constraints; ordinary un-annotated goals we call Datalog atomic goals, or just Datalogatoms. For generate-and-check evaluation the syntax of a constraint is check(Atom) where Atom is a Datalog atom. The Datalog part of a constraint check(Atom) is Atom. The Datalog part of a Datalog atom Atom is just Atom itself. 9

A query is expressed as a V anessa program whose extensional relations are held in the database. The program includes an initial rule ans(X ) Goal whose body Goal formalises the query. Atomic goals in the body may be de ned by further rules. The example on page 7 above would be expressed as the following V anessa program:

ans(X; Y; Z )

p(X ); check(q(X; Y )); check(r(X; Z )) p(X ) is an atomic goal and check(q(X; Y )) and check(r(X; Z )) are constraint goals.

6.2 Sideways Information Passing

The output from the Sips producer is a sideways information passing strategy. Intuitively a Sips broadly de nes which relations should be accessed rst in order to produce intermediate results that can be used to focus the remainder of the query. The rst formalisation of a Sips is in [BR87]. The following formalisation is much simpler. An adorned predicate is an intentional predicate with an associated mode declaration, which we shall term an \adornment", following [BR87]. The adornment states conditions on the arguments of the predicate before it can be evaluated. In the current paper the conditions state which arguments must be ground. We shall henceforth call these the input arguments. In database terms the input arguments are subject to a selection condition of the form Argument = V alue. For queries involving other selection conditions than equalities, correspondingly more sophisticated adornments have been devised [Ram88], but they are beyond the scope of our current work. De nition 1 A Sips for a rule de ning an adorned predicate is a partial or-

dering on the Datalog part of the atomic goals appearing in the rule.

Thus an appropiate Sips for the rule

ans(X; Y; Z )

p(X ); check(q(X; Y )); check(r(X; Z ))

could be p(X ) > q(X; Y ) and p(X ) > r(X; Z ). Recall that every variable in the head of a rule also occurs in its body, and that facts are all ground. Therefore after evaluating any atomic goal, de ned by rules and facts, either tuple-at-a-time or set-at-a-time, each argument has subsequently a ground value or, respectively, a ground value for each answer. Consequently information about the call pattern of each atomic goal appearing in the rule body can be deduced from the Sips. Those variables in an atomic goal which are also input arguments to the head of the clause, or appear in an earlier atomic goal (according to the Sips), are inputs. 10

Based on the call pattern for an atomic goal with a certain predicate, an adornment for that predicate can be deduced. If the same predicate p appears in several atomic goals with distinct call patterns, then the occurrences of the predicate can be renamed apart pa1 , pa2 etc. Now each renamed predicate has the same de nition as p, but a di erent adornment.1 In the following we shall use a superscript a to denote an adornment, and write pa for the renamed predicate p with adornment a. For a three argument predicate whose rst argument is required to be ground we shall write the adornment g . For example if p is a predicate with three arguments, the renaming of p with this adornment will be written pg . If the query is represented by multiple rules, then the Sips producer rst produces a Sips for the initial rule (with head ans(X )). Strictly the answer predicate should be annotated but, by an abuse of notation, the annotation of ans will be omitted. The output of this Sips will imply the input/output pattern of the arguments of each atomic goal in the rule body. When handling the rules de ning the predicate for each such goal, therefore, the Sips producer can be applied to the appropriately adorned predicate.

6.3 Components

Our query execution system comprises four components: a Sips producer, a program transformer, a local database management system, and a query evaluator. In this paper we describe in detail only the program transformer, which performs the magic-checking transformation. The query evaluator passes subqueries to the DBMS which optimises and evaluates them writing the results to a temporary relation, which is used by later subqueries. The intermediate results, recorded in the temporary relations, are the only form of communication between separate subqueries. Thus the database optimiser is free to optimise subqueries in any way. The role of the Magic Checking transformation is to divide the original query into subqueries and specify the contents of the intermediate solution produced by each subquery.  The Sips producer is a pre-optimiser which takes as input information about the size of each database relation and indexing information. Based on this input it produces an appropriate Sips for the V anessa program de ning the query. The optimisation techniques required here are wellknown. Warren describes precisely this form of optimisation applied to logic programming as a database query language is [War81].  The program transformer takes as input the V anessa program and the Sips. It produces a Datalog program and a set of rules for controlling their evaluation against the database. The generation of these rules - which are unproblematic for non-recursive programs - is not described in this paper.

1 This can alternatively be avoided by weakening the mode declaration until it is consistent with every call pattern.

11

 

The local database optimiser takes each rule of the Datalog program separately and locally optimises it for the database. The query evaluator applies the optimised rules to the database in a forward-chaining mode according to the control rules mention above.

7 A Speci cation of Magic Checking Magic checking is the process that takes as input a V anessa program and a Sips and produces a Datalog program and a control expression. The process involves two steps: 1. For each rule in the V anessa program, the signi cant atomic goals in the body are identi ed. These are termed intermediate solution goals. 2. For each V anessa rule, the output Datalog rules are generated. We describe the steps in turn. As a working example we shall use a highly simpli ed application of database querying to con guring a vehicle. The query input is a vehicle speci cation in the form of its required performance P , style S , luxury L and export market M . These speci cations can be satis ed in combination with di erent types of fuel F . These choices dictate alternative components that need to be tted to the vehicles, and these components dictate electrical and fuel requirements - perf (P; E 1; F ) and style(S; E 2), lux(L; M; F ). The electrical requirements must be compatible according to local market standards - ok(M; E 1; E 2). The simpli ed query is as follows

ans(P; S; L; M; F )

perf (P; E 1; F ); style(S; E 2); lux(L; M; F ); check(ok(M; E 1; E 2))

P; S; L; M are input arguments. For purposes of illustration the predicate style is intensional, and the others are extensional. The Sips producer has produced the following Sips for the query:

perf (P; E 1; F ) > lux(L; M; F ) > style(S; E 2) (If all the arguments of a constraint appear in Datalog atoms, which is the case for check(ok(M; E 1; E 2)) in this example, it will typically not appear in an optimal Sips.)

7.1 Intermediate Solution Goals

Intermediate solution goals are so-called because their results will be stored as intermediate solutions. Each such atomic goal will be associated with a 12

separate subquery, and will de ne the result of the subquery that will be temporarily stored by the database management system. The atomic goals between the intermediate solution goals, by contrast, are passed unchanged to the local database optimiser. In other words our transformation leaves optimisations, as far as possible, up to the local database optimiser. There are three kinds of intermediate solution goals:  Fork Atom: an atomic goal which has more than one immediate successor in the Sips ordering. Handled by simply materialising its solutions.  Check Atom: an atomic goal which shares non-input variables with constraint goals (other than itself). Handled by checking with the appropriate constraints.  Magic Atom: an atomic goal whose predicate is intensional. Handled by magic set techniques. Naturally an intermediate solution goal can be any combination of these. In our running example the intermediate solution goals are:  perf (P; E 1; F ) because it shares a variable with the constraint check(ok(M; E 1; E 2)): thus it is a Check goal.  style(S; E 2) because it has an intensional predicate and because it shares a variable with the constraint check(ok(M; E 1; E 2)): thus it is both a Magic goal and a Check goal. The intermediate solution goals are given unique new predicate names. In the following, if b is the predicate name associated with an intermediate solution goal, we shall use b for its new name. In our running example the new names are perf and style . Common to all three kinds of intermediate solution goal above is the need to record the values of all the variables in the current partial solution to avoid recomputing parts of the solution again. This corresponds to the use of supplementary magic sets [SZ86] in earlier work on magic sets. For each intermediate solution goal b(X ) in a V anessa rule body we identify three sets of variables  The environment variables: variables appearing in earlier atomic goals (according to the Sips), or as inputs to the rule head, which also appear in g(X ) or some later atomic goal (according to the Sips).  The bequeathed variables: arguments of g (X ) and environment variables that also appear in some later atomic goal, or the rule head.  The input variables: recall that these are the environment variables which appear in g(X ). 0

0

0

13

For perf (P; E 1; F ) in our example, the environment variables are P; S; L; M , because they are the inputs to ans(P; S; L; M; F ). The bequeathed variables from perf (P; E 1; F ) are P; E 1; F; S; L; M because they appear in style(S; E 2); ok(M; E 1; E 2) or ans(P; S; L; M; F ). Finally the input variable to perf (P; E 1; F ) is just P , because its other arguments, E 1 and F , are not environment variables. For style(S; E 2) the environment variables are P; E 1; S; L; M; F , the bequeathed variables are P; S; L; M; F , and the input variable is S .

8 Transformation for Constraint Checking Given a Sips for a V anessa clause, its magic implemention comprises the rules derived from the clause, and extra rules for evaluating each intermediate solution goal in the body. For each (renamed) intermediate solution goal b (Z ) we introduce a new predicate done b (B ), whose arguments are the bequeathed variables of b(Z ). done b (B ) is used to collect up the results of evaluating b(Z ) and insert them into the context - i.e. join them with the previous partial solutions. In our running example the new predicates are done perf (P; S; L; M; F ) and done style (P; S; L; M; F ). The de nitions of done b for the di erent kinds of intermediate solution goal are presented in sections 8.2.1, 8.2.2 and 8.2.3 below. 0

0

0

0

0

0

8.1 Magic Checking Rules for the Input V anessa Clause

Suppose the input V anessa clause has head pa (W ). Let last1; : : : lastk be the last intermediate solution goals (those which have no intermediate solution goals among their direct or indirect successors), and let g1 (Y1 ); : : : ; gm (Ym ) be the Datalog part of the (non-intermediate-solution-) goals that succeed them, directly or indirectly. If, for each j , B j are the variables bequeathed from lastj , the following rule is produced by our transformation:

pa (W )

done last1(B 1 ); : : : done lastk (B k ); g1 (Y1 ); : : : ; gm(Ym ) In our running example the last intermediate solution node is style(S; E 2), 0

0

and the resulting rule is:

ans(P; S; L; M; F )

done style (P; S; L; M; F ) 0

8.2 Magic Checking Rules for the Intermediate Solution Goals Let us de ne the predecessor goals of an intermediate solution node. 14

We rst treat intermediate solution goals for which there are no previous intermediate solution goals according to the Sips. Let g1 (Y 1 ); : : : gm (Y m ) be the Datalog part of the atomic goals preceding b according to the Sips. If the V anessa clause has head pa (W ), and I are the input arguments to pa , predecessor goals are

magic pa (I ); g1 (Y 1 ); : : : gm(Y m ): We now treat an arbitrary intermediate solution goal b(Z ), whose immediately preceding intermediate solution goals (according to the Sips) are prev1 , : : :, prevk . Let the Datalog part of the non-intermediate-solution-goals sandwiched between b(Z ) and prev1 or : : : or prevk be g1 (Y 1 ); : : : gm (Y m ). If B j are the variables bequeathed by prevj , the predecessor goals of b are

done prev1 (B 1 ) : : : done prevk (B k ); g1 (Y 1 ); : : : gm (Y m ): 0

0

We write Env for the environment variables of b(Z ), B for the bequeathed variables and I for the input variables. We write p1 (X 1 ); : : : pm (X m ) for its predecessor goals and V for X 1 [ : : : [ X m . For perf (P; E 1; F ) the predecessor goal is magic ans(P; S; L; M ). For style(S; E 2) the predecessor goals are done perf (P; E 1; F; S; L; M ); lux(L; M; F ). 0

Proposition 1 Consider an intermediate solution goal b(Z ).

Env is a subset of V . I is a subset of Env. B is a subset of Env [ Z .

This proposition follows from the de nitions of Env; V ; I and B , and it ensures that the Datalog rules produced by our transformation are admissible: variables appearing in the head of any rule also appear in its body. 8.2.1 Magic Checking Rules for Fork Goals

If b(Z ) is a Fork goal, the following rule is produced by the transformation:

done b (B ) 0

p1 (X 1 ); : : : pm (X m ); b(Z )

Notice that in the body of this rule the goal b(Z ) calls the original predicte b and not the new predicate b . 0

8.2.2 Magic Checking Rules for Check Goals

If b(Z ) is a Check goal, check each constraint check(ci ) which shares a variable with an outgoing edge from b(Z ). Write ci (Y i ) for the atomic goal which results from renaming to \ " all the arguments of ci which do not appear in Env [ Z . 00

15

The following rule is produced by the transformation:

done b (B ) 0

p1 (X 1 ); : : : pm(X m ); b(Z ); c1 (Y 1 ); : : : ; cn (Y n ) 00

00

The transformation of a Check goal subsumes that of a Fork goal, so if b(Z ) is a both a Check goal and a Fork goal, undergoes precisely the transformation described in this subsection. For perf (P; E 1; F ) in our example the rules produced are: 0

done perf (P; E 1; F; S; L; M )

magic ans(P; S; L; M ); perf (P; E 1; F ); ok(M; E 1; ) Notice the constraint check ok(M; E 1; ) in the rule body. 0

8.2.3 Magic Checking Rules for Magic Goals

For each intermediate solution goal b(Z ) of type magic, three new rules are added. The rst rule collects the environment into a supplementary magic predicate: suppmagic b (Env) p1 (X 1 ); : : : pm(X m ) The magic rule is: 0

magic ba (I )

suppmagic b (Env ) 0

The rule which adds the environment back to the result to produce the intermediate solution with all the bequeathed variables is:

done b (B ) 0

ba (Z ); suppmagic b (Env) 0

Notice that in the body of this rule the annotated predicate ba is used: the annotation is that deduced from the Sips. In accordance with this transformation, to start a query execution it is necessary to add one or more facts for the predicate magic ans to the database. These facts record the input values for the ans predicate, or in database terms they specify the selection conditions in the query. In case the intermediate solution goal is both a Magic goal and a Fork goal, exactly the same rules are produced by the transformation. In case it is both a Magic goal and a Check goal, the constraint checks (as described in the previous section) are added to the third rule above. In case it is all three (a Magic goal, a Check goal and a Fork goal), the same rules are produced as for a combined Magic goal and Check goal. For style(S; E 2) in our running example, which is both a Magic goal and a Check goal, the rules produced are:

suppmagic style (P; E 1; F; S; L; M )

done perf (P; E 1; F; S; L; M ); lux(L; M; F )

0

0

16

magic styleg (S ) done style (P; S; L; M; F )

suppmagic style (P; E 1; F; S; L; M ) styleg (S; E 2); ok(M; E 1; E 2); suppmagic style (P; E 1; F; S; L; M ) 0

0

0

As a last reference to our running example, a query is red by adding a fact such as magic ans(topperf; saloonstyle; familylux; japanmarket) to the database.

8.3 Application of Magic Checking to Crossword Generation

This transformation can be directly applied to the crossword example. We give here a simpli ed version for illustration. Suppose the original query - for a 3X3 crossword is as follows: ans (A; B; C; D; E; F; G; H; I ) word(A; B; C ); word(D; E; F ); word(G; H; I ); check(word(A; D; G)); check(word(B; E; H )); check(word(C; F; I )) Let us assume the Sips produced is word(A; B; C ) > word(D; E; F ) > word(G; H; I ). The magic checking transformation yields the following set of rules:

done word1 (A; B; C ) magic ans; word(A; B; C ); word(A; ; ); word(B; ; ); word(C; ; ) done word2 (A; B; C; D; E; F ) done word1 (A; B; C ); word(D; E; F ); word(A; D; ); word(B; E; ); word(C; F; ) done word3 (A; B; C; D; E; F; G; H; I ) done word2 (A; B; C; D; E; F ); word(A; D; G); word(B; E; H ); word(C; F; I ) ans (A; B; C; D; E; F; G; H; I ) done word3 (A; B; C; D; E; F; G; H; I ) magic ans

9 Conclusion This paper reports the results of applying program transformation techniques for a form of database query optimisation. The transformation achieves the e ect of constraint checking, in a set-at-a-time query evaluation sheme. The optimisation is completely compatible with standard magic-set optimisations and so both optimisations can be applied to a single query. However the transformation proposed here leaves local optimisation, as far as possible, up 17

to the local database query optimiser. The approach makes it quite realistic to apply constraint checking techniques to current generation commercial database systems. We have recently applied the technique using an ECLiPSe-Oracle connection that we developed for supporting constraint programming in a database application (in the Chronos project [ILIU94]). The transformation allowed Oracle (release 7) to eciently optimise and execute the 5X5 crossword query. Adding control expressions as introduced in [BRSS90], makes it possible to control evaluation of the transformed program and support repeated evaluation till a xpoint is reached. With this addition we are currently extending the magic checking transformation to a magic propagation technique. The architecture, and indeed most of the magic rules, remain the same for magic propagation, but many extra checking rules are needed to capture set-at-a-time propagation. We intend to report on this technique in a forthcoming paper, and to experiment with it using the ECLiPSe-Oracle connection mentioned above.

Appendix: The Word Relation This is the set of words used in the example of section 5 above. abbas alban cline needs arara bravo itala acara nonyl laser notal rebud rerun sayal

abdal allow demit neeld ardeb braze oraon acate obese layer noter rebut retan skeet

abeam anana denim nodal areek breba oriel alate aaron maral onset recon riata sowel

abele anele emend nonet arena breme osela alosa abram marka rabat redan robin tates

abner anent idiot ochna arete breva ottar blore abrim metal racon refel rodge

acana ankle lader omina aroma broma ovile clite alula nasal ramed reges roper

acoin bacon manny rudas arulo brose ovoid enate amply natal range renes rotal

addie badon monal araca award buret ovula inert baton nathe raspy renky salal

aegle befit namer arain brace crore ozark neese betis netty rated repew salay

alada benin nanny aramu brava eurus abase nenta bosun newel rater rerow sandy

With these words it is possible to generate 72 di erent 5X5 crosswords.

References [BC83] P. Bernstein and D. Chiu. Using semi-joins to solve relational queries. ACM, 28(1):25{40, January 1983.

18

[BFL+ 95] P. Brisset, T. Fruhwirth, P. Lim, M. Meier, T. Le Provost, J. Schimpf, and M. Wallace. ECLiPSe 3.5 extension user manual. CHIC Deliverable D.6.2.4, ECRC, Arabellastr 17, Munich, Germany, 1995. [BFMY83] C. Beeri, R. Fagin, D. Maier, and M. Yannakakis. On the desirability of acyclic database schemes. ACM, 30(3):479{513, July 1983. [BLW94] S. Bressan, T. Le Provost, and M. Wallace. Towards the application of generalised propagation to database query processing. CHIC Deliverable D.5.2.3.5, ECRC, Arabellastr 17, Munich, Germany, 1994. [BMSU86] F. Bancilhon, D. Maier, Y. Sagiv, and J. Ullman. Magic sets and other strange ways to implement logic programs. In Proc. 5th ACM Symposium on Principles of Database Systems, 1986. [BR86] F. Bancilhon and R. Ramakrishnan. An amateur's introduction to recursive query processing. In Proc. SIGMOD, 1986. invited paper. [BR87] C. Beeri and R. Ramakrishnan. On the power of magic. In Proc. 6th PODS, 1987. [Bre94] S. Bressan. Database query optimisation and evaluation as constraint satisfaction problem solving. In P. Revesz, D. Srivastava, P. Stuckey, and D. Sudarshan, editors, Proc. Post-ILPS Workshop on Constraints and Databases, Ithaca, NY., November 1994. Published as Technical Report UNL-CSE-94-025, Univ. of Nebraska. [BRSS90] C. Beeri, R. Ramakrishnan, D. Srivastava, and S Sudarshan. Magic implementation of strati ed logic programs. Technical Report, August 1990. [CGM88] U. S. Chakravarthy, J. Grant, and J. Minker. Foundations of semantic query optimization for deductive databases. In J. Minker, editor, Foundations of Deductive Databases and Logic Programming, pages 243{273. Morgan Kaufmann, 1988. [Dec92] Rina Dechter. Constraint networks. In Stuart C. Shapiro, editor, Encyclopedia of Arti cial Intelligence, pages 276{285. Wiley, 1992. Volume 1, second edition. [DvS+ 88] M. Dincbas, P. van Hentenryck, H. Simonis, A. Aggoun, T. Graf, and F. Berthier. The Constraint Logic Programming Language CHIP. In Proceedings of the International Conference on Fifth Generation Computer Systems FGCS-88, pages 693{702, Tokyo, Japan, December 1988. [ECR93] ECRC. ECLiPSe knowledge base user manual. Technical report, ECRC, Arabellastr 17, 81925 Munich, 1993. 19

[GB65] S.W. Golomb and L.D. Baumert. Backtrack programming. Journal of the ACM, 12:516{524, 1965. [HZ80] M. Hammer and S. Zdonik. Knowledge-based query processing. In Proc. 6th VLDB Conference, Montreal, 1980. [ILIU94] ICL, Lloyd's Register, Imperial College, and UMIST. CHRONOS DTI/EPSRC project no. 8028. UK government funded Scienti c Research Project, 1994. [ILO95] ILOG. ILOG solver users manual. Technical report, ILOG, 1995. [Kin81] J.J. King. Quist: A system for semantic query optimisation in relational databases. In Proc. 7th VLDB Conference, 1981. [Kum92] Vipin Kumar. Algorithms for constraint-satisfaction problems: A survey. A.I. Magazine, 13(1):32{44, Spring 1992. [Mac77] A.K. Mackworth. Consistency in networks of relations. Arti cial Intelligence, 8(1):99{118, 1977. [Me95] Meier M. and et. al. ECLiPSe 3.5. Technical Report ECRC/ECLIPSE, ECRC, Arabellastr 17, 81725 Munich, 1995. [MH86] R. Mohr and T.C. Henderson. Arc and path consistency revisited. Arti cial Intelligence, 28:225{233, 1986. [Mon74] U. Montanari. Networks of constraints : fundamental properties and applications to picture processing. Information Science, 7(2):95{132, 1974. [PW93] Thierry Le Provost and Mark Wallace. Generalised constraint propagation over the CLP scheme. Journal of Logic Programming, 16(3):319{360, July 1993. [Ram88] R. Ramakrishnan. Magic templates: A spellbinding approach to logic programs. In Proc. International Conference on Logic Programming, Seattle, Washington, 1988. [RLK86] J. Rohmer, R. Lescoeur, and J.-M. Kerisit. The alexander method: A technique for the processing of recursive axioms in deductive databases. New Generation Computing, 4(3), 1986. [SZ86] D. Sacca and C. Zaniolo. The generalised counting method for recursive logical queries. In Proc. 1st International Conference on Database Theory, 1986. [Ull82] J. D. Ullman. Principles of Database Systems. Pitman, second edition, 1982. 20

[Van89] Pascal Van Hentenryck. Constraint Satisfaction in Logic Programming. Logic Programming Series. MIT Press, Cambridge, MA, 1989. [Vie89] L. Vieille. Recursive query processing: The power of logic. Theoretical Computer Science, 69(1), December 1989. [Wal95] Mark Wallace, editor. Proc. Conf. on Practical Applications of Constraints Technology, Paris, 1995. [War81] D.H.D. Warren. Ecient processing of interactive relational database queries expressed in logic. In Proc. 7th VLDB, Cannes, 1981. [WL95] M. Wallace and T. Le Provost. Propia - nal implementation. CHIC Deliverable D.5.2.3.6, ECRC, Arabellastr 17, Munich, Germany, 1995. [WY76] E. Wong and K. Youse. Decomposition - a strategy for query processing. ACM Transactions on Database Systems, 1(3):223{241, 1976.

21