Exploiting Uniqueness in Query Optimization - CiteSeerX

18 downloads 354 Views 335KB Size Report
for which our semantic query optimization techniques, ... IN (`Chicago', `New York', `Toronto') ...... burstrewrite engine convertall queries, whenever pos- sible, to ...
Exploiting Uniqueness in Query Optimization G. N. Paulley Per- Ake Larson Department of Computer Science University of Waterloo Waterloo, Ontario, Canada N2L 3G1

Abstract

Consider an sql query that speci es duplicate elimination via a Distinct clause. Because duplicate elimination often requires an expensive sort of the query result, it is often worthwhile to identify unnecessary Distinct clauses and avoid the sort altogether. We prove a necessary and sucient condition for deciding if a query requires duplicate elimination. The condition exploits knowledge about keys, table constraints, and query predicates. Because the condition cannot always be tested ef ciently, we o er a practical algorithm that tests a simpler, sucient condition. We consider applications of this condition for various types of queries, and show that we can exploit this condition in both relational and nonrelational database systems.

1 Introduction

sql queries that contain Distinct are common enough to warrant special consideration by commercial query optimizers because duplicate elimination often requires an expensive sort of the query result. It is worthwhile, then, for an optimizer to identify redundant Distinct clauses to avoid the sort operation altogether. Example 1 illustrates a situation where a Distinct clause is unnecessary. Example 1

Consider the query SELECT DISTINCT S.SNO, P.PNO, P.PNAME FROM SUPPLIER S, PARTS P WHERE S.SNO = P.SNO AND P.COLOR = `RED'

which lists all red parts and the numbers of suppliers that supply them; the tables PARTS and SUPPLIER are de ned in Figure 1. The Distinct inthe query's Select clause is unnecessary because each tuple in the result is uniquely identi ed by SNO and PNO, the primary key of PARTS.Conversely,Example2 presents a case where duplicate elimination must be performed.

Example 2

Consider a query that lists supplier names who supply red parts:

SUPPLIER PARTS AGENTS

(SNO, SNAME, SCITY, BUDGET, STATUS) (SNO, PNO, PNAME, OEM-PNO, COLOR) (SNO, ANO, ANAME, ACITY).

Figure 1: Hypothetical supplier database used throughout this

paper as an example schema. Primary keys are in italics.Tuples in PARTS reference the SUPPLIER who supply them; tuples in AGENTS reference the SUPPLIER they represent. SELECT DISTINCT S.SNAME, P.PNO, P.PNAME FROM SUPPLIER S, PARTS P WHERE S.SNO = P.SNO AND P.COLOR = `RED'.

In this case, duplicate elimination is required because two suppliers, who coincidentally have the same supplier name, could both supply the same part. These two examples raise the followingquestions:  Under what conditions is duplicate eliminationunnecessary?  Are there other types of queries where duplicate analysis enables alternate execution strategies?  Ifso,whenarethese other executionstrategiesbene cial, in terms of query performance? This paper explores the rst two questions. Our main theorem provides a necessary and sucient condition for deciding when duplicate elimination is unnecessary. The condition cannot always be tested eciently, so we o er a practical algorithmthat tests a simpler,sucient condition. The rest of the paper is organized as follows. Section 2 de nes, in terms of sql, the class of relational algebra expressions that we consider. Section 3 formally de nes functional dependencies and contains the main theorem. Section 4 presents our algorithmfor detecting when duplicate eliminationis redundant for a large subset of possible queries. Section 5 illustrates some applications of duplicate analysis; we consider transformations of sql queries using schema information such as constraints and candidate keys. Section 6 summarizes related research, and Section 7 presents a summaryand lists some directions for future work.

2 Class of SQL queries considered

In this paper, we consider a subset of sql2 [9] queries for which our semantic query optimization techniques, described below, may be bene cial. Following the sql2 standard, queries corresponding to query speci cations consist of only the algebraic operators selection, projection, and extended Cartesian product. Without loss of generality, we assume that the From clause consists of only two tables, R and S , since we can rewrite a query involving three or more tables in terms of two. The selection conditionin a Where clause is expressed as CR ^ CS ^ CR;S (see Table 1). Subqueries, involvingexistential or universal quanti cation,are permitted in all three selection predicates. Query speci cations maynot contain a Group By or Having clause, involve aggregation operators, or contain arithmetic expressions. In addition to query speci cations we also consider a subset of sql2 query expressions. These expressions involve two query speci cations related by one of these algebraic operators: INTERSECT, INTERSECT ALL, EXCEPT, and EXCEPT ALL. We assume the two query speci cations produce derived tables that are union-compatible. In summary, we consider both query speci cations|of thefamiliar Select-From-Where variety|andqueryexpressions that match the followingbasic syntax: SELECT [ALL/DISTINCT] A FROM R; S WHERE CR ^ CS ^ CR;S INTERSECT or INTERSECT ALL or EXCEPT or EXCEPT ALL SELECT [ALL/DISTINCT] FROM WHERE X Y X;Y .

A X; Y C ^C ^C Our objective is to determine under what conditions these queries can bene t fromsemantic query optimization techniques that test for duplicate rows.

2.1 Constraints

The semantic information we can exploit for query rewrite optimizationconsists of column constraint de nitions and table constraint de nitions in the sql2 standard. A column constraint may specify a particular domainor a Check clause, which de nes a search condition that must always be satis ed. For example, the check constraint de nition for column SNO in SUPPLIER could be CHECK(SNO BETWEEN 1 AND 499). A tuple in SUPPLIER violates this constraint if the value of its SNO attribute lies outside this range. Check constraints on domains are identical in form to Check constraints on columns and typically specify ranges of possible values. Table constraints in sql2 subsume column and domain con-

Symbol De nition

(R) A CR CR;S h

I (R) TR Ui (R)

Attributes of table R Set of query projection attributes cnf predicate on attributes of R cnf predicate involvingcolumns of both R and S The set fh1; h2; : : :; hng of host variables in a query predicate An instance of table R Set of table constraints on table R in cnf Uniqueness constraint i (candidate key) on table R; returns the attributes of the candidate key

Table 1: Summary of symbolic notation.

straints, so we need consider only table constraints in this paper. We consider two forms of table constraints. Check constraints on base tables in sql2 identify conditions for columns in a table that must always be true. For example,our SUPPLIER table mightbe de ned as follows: CREATE TABLE SUPPLIER ( SNO ..., SNAME ..., SCITY ..., BUDGET ..., STATUS ..., PRIMARY KEY (SNO), CHECK (SNO BETWEEN 1 AND 499), CHECK (SCITY IN (`Chicago', `New York', `Toronto')), CHECK (BUDGET 0 OR STATUS = `Inactive'))

which speci es Check conditions on attributes SNO and and an implicationconstraint relating BUDGET to . Since these conditions must always be satis ed, then the query

SCITY STATUS

SELECT * FROM SUPPLIER WHERE SNO BETWEEN 1 AND 499 AND SCITY IN (`Chicago', `New York', `Toronto') AND BUDGET 0 OR STATUS = `Inactive'

must return all tuples of SUPPLIER.This means that we can add any table constraint to a query without changing the query result. The second type of constraint we consider is a unique speci cation that identi es primary or candidate keys. Each columnof a primarykey identi ed by the PRIMARY KEY clause cannot contain a Null value. The Unique clause de nes additional candidate keys; unlike primary keys, however, components of a candidate key may be Null. Essentially, sql2 treats Null key values as a `special'value.For example,if table R has a single-attribute candidate key K , only one tuple in R may have K equal to Null.The table de nition for the PARTS table:

CREATE TABLE PARTS ( SNO ..., PNO ..., PNAME ..., OEM-PNO ..., COLOR ..., PRIMARY KEY (SNO, PNO), UNIQUE (OEM-PNO), CHECK (SNO BETWEEN 1 AND 499))

de nes OEM-PNO as a candidate key; any instance of PARTS mayhave only one tuple with OEM-PNO = NULL.

2.2 An algebra for SQL queries

As a shorthand notation for our subset of sql, we de ne an algebra whose operations are de ned by sql statements themselves [1,24].By doing so, we avoid the need to show the semantic equivalence between our algebra and sql. Note that `conventional' relational algebra transformations do not necessarily apply to this algebra, which is based on multisets. We de ne our relational algebra for query speci cations as follows:  R  S : the extended Cartesian product of tables R and S de ned by the query SELECT * FROM R, S.   ](R): select all rows of R that satisfy condition C as de ned by the sql statement SELECT * FROM R WHERE C . Conjuncts in C may contain host variables whose values are available only at execution time. Selection does not eliminate duplicate rows. If condition C evaluates to unknown, then sql interprets C as false.  d [A](R), where d is either Dist or All: project R ontoattributes A eliminatingduplicatesif d = Dist and retaining duplicates if d = All. The projection operator is de ned by the sql statement SELECT [ALL/DISTINCT] A FROM R. For query expressions, we de ne these operators:  R \d S : intersect union-compatibletables R and S , eliminating duplicates if d = Dist and following the semantics of INTERSECT ALL if d = All. Our intersection operator is equivalentto the sql expression SELECT * FROM R INTERSECT [ALL] SELECT * FROM .

S

sql2's INTERSECT ALL works as follows.Let r0 be a row that occurs in R, S , or both R and S . Let j be the number of occurrences (possibly 0) of r0 in R, and let k be the number of occurrences (possibly 0) of r0 in S . Then the number of instances of r0 that occur in the result is the minimumof j and k.  R ?d S : for union-compatible tables R and S , subtract S from R, eliminating duplicates if d = Dist and following the semantics of EXCEPT ALL if d = All. The semantics of EXCEPT ALL are similar to INTERSECT ALL except that the number of occurrences of r0 (as de ned above) in the result is the maximum of j ? k and zero. We de ne our minus operator with the sql expression

R

SELECT * FROM EXCEPT [ALL] SELECT * FROM .

S Werepresentthelogicaloperatorsimplication,equivalence, conjunction and disjunction with the standard notation =), (), ^, and _, respectively.

3 Formal analysis of duplicate elimination

The previous section detailed the sql2 mechanisms for declaring primaryand candidate keys of base tables. A key declaration impliesthat allattributes of the table are functionally dependent on the key, which is termed a key dependency (kd). For duplicate elimination,we are interested in which functional dependencies (fds) hold in a derived table|a table de ned by a query or view. We call such fds derived functional dependencies. Similarly, a kd that holds in a derived table is a derived key dependency. We represent the fd `A functionally determines B ' with the standard notation A ?! B [21]. The followingexample illustrates derived fds. Example 3

Consider the derived table de ned by the query SELECT ALL S.SNO, SNAME, P.PNO, PNAME FROM SUPPLIER S, PARTS P WHERE P.SNO = :SUPPLIER-NO AND S.SNO = P.SNO

which lists the supplier name and number, and part name and number, for all parts supplied by supplier :SUPPLIER-NO. We claimthat PNO is a key of the derived table. PNO is certainly a key of the derived table D where D = NO = :SUPPLIER-NO](PARTS). In this case, :SUPPLIER-NO is a host variable in an application program,assumed to have the samedomainas P.SNO.Each tuple of D joins with at most one tuple from SUPPLIER since S.SNO is SUPPLIER's primary key. Therefore, PNO remains the key of the derived table obtained after projection. Since the key dependency SNO ?! SNAME holds in the SUPPLIER table, it should also hold in the derived table. In this case, a key dependency in a source table became a non-key functional dependency in the derived table.

3.1 SQL and functional dependencies

Example 3 illustrates the usefulness of derived functionaldependencies in determiningif duplicate elimination is required, because the existence of a primary key in each output tuple|PNO in the example|means that duplicates cannot exist. However, `traditional' dependency theory ignores three-valued logic; only recently have researchers adequately documented the formalsemanticsof three-valued logic within sql [5,18,22].

Because the iso standard permits Null values in any attributes of a candidate key, Null values may exist on both the left- and right-hand sides of a key dependency in both base and derived tables. Essentially, the problem is to determine the result of the comparison Null = Null.In sql2, the result depends on the context: within Where and Having clauses, the comparison is unknown; within Group By and Order By{and particularly duplicate elimination via Select DISTINCT|the comparison is true. To incorporate three-valued logic semantics into fds, we adopt the interpretation and Null comparison operators of reference [18]. Table 2 describes their semantics. Using the null comparison operator, we can formally state that two tuples t0 and t0 from instance I of table R are equivalent if ^ bt [a ] =! t [a ]cg: 8 t0; t0 2 I (R) : f (1) 0 i 0 i 0

0

0

ai 2 (R)

Henceforth we use the notation t[A] to represent a set of attributes A = fa1 ; a2; : : :; ang of tuple t.

Definition 1 (Functional Dependency)

Consider a table scheme R(A; b; : : :) where A is a set of attributes fa1; a2; : : :; ang and b is an attribute. Let I (R) denote an instance of R. Then A ?! b i the followingcondition holds: 8 t0; t0 2 I (R) : (2) ! ! f(bt0 [A] = t0[A]c) =) (bt0 [b] = t0[b]c)g which states that if two tuples agree on the set of attributes A, then the two tuples must agree on the value of b if the functional dependency holds. Note the treatment of Null values implicit in this de nition: corresponding attributes in A and b must either agree in value, or both be Null. From this de nition, we can now de ne the key dependency over all instances of a table R as 8 Ii (R) : 8 t; t 2 Ii (R) : (3) ! f(bt[Key(R)] = t [Key(R)]c) =) (bt[ (R)] =! t [ (R)]c)g where Key(R) denotes the set of attributes that form a candidate key of R.This formalismmerelystates our intuitive notion of a key: if two tuples have the same key, then the tuples mustagree on allother attributes. In the next section, we formally de ne the conditions necessary to determine if a key exists in a derived table. 0

0

0

0

0

0

3.2 Main theorem

Consider a simple sql query speci cation that involves only projection, selection, and extended Cartesian product; for simplicity, we do not permit queries

to contain arithmetic expressions, Group By clauses, or Having clauses. A query's Where clause may contain host variables|constants whose values are known only at query execution. We assume that each selectionpredicateexpressioncontaininghostvariablescompares them to other union-compatible arguments, for example, domains of particular columns. We de ne a host variable's domainas the intersection of the column domainswith which the host variable is compared. We would like to determine if the result of the query SELECT A FROM R; S WHERE CR ^ CS ^ CR;S may contain duplicate rows. Intuitively, the uniqueness condition will be met if:  both R and S have primary keys, so that the key of (R  S ) is the concatenation of Key(R) with Key(S ), denoted Key(R)  Key(S );  either allthe columnsof Key(R  S ) are inthe projection list, or  a subset of the key columns is present in the projection list, and the values of the other key columns are equated to constants or can be inferred through the selection predicate or table constraints. This notion corresponds to the followingtheorem. Theorem 1 (Uniqueness Condition)

Consider a query involving only projection, selection, and extended Cartesian product of two tables R and S where R and S each have at least one candidate key. The selection predicate CR ^ CS ^ CR;S maycontain expressions that include host variables; we denote this set of input parameters by h. Thus we identify the test of a selection predicate, which includes host variables, on tuple r of R with the notation CR (r; h). Then the two expressions Q = All [A](R ^ CS ^ CR;S ](R  S )) and V = Dist [A](R ^ CS ^ CR;S ](R  S )) are equivalent if and only if the following condition holds: 8 r; r 2 Domain(R  S ); 8 h 2 Domain(H ) : (4) f [ [ bTR (r)c ^ bTR (r )c ^ bTS (r)c ^ bTS (r )c^ (for each Ui (R) : (br[Ui (R)] =! r [Ui (R)]c) =) br[ (R)] =! r [ (R)]c)^ (for each Uj (S ) : (br[Uj (S )] =! r [Uj (S )]c) =) br[ (S )] =! r [ (S )]c)^ bCR (r; h)c ^ bCR (r ; h)c ^ bCS (r; h)c^ 0

0

0

0

0

0

0

0

Notation Interpretation of Null

P (x) dP (x)e bP (!x)c X=Y

unde ned true-interpreted false-interpreted equivalent

sql Semantics

x IS NOT NULL =) P (x) x IS NULL OR P (x) x IS NOT NULL AND P (x)

(X IS NULL AND Y IS NULL) OR X = Y

Table 2: Interpretation and Null comparison operator semantics. P (x) represents a predicate P on an attribute x.

bCS (r ; h)c ^ bCR;S (r; h)c ^ bCR;S (r ; h)c ] =) (br[A] =! r [A]c) ] =) (br[Key(R  S )] =! r [Key(R  S )]c)g 0

0

0

0

(Suciency) We assert that if the theorem's condition is true then the query result contains no duplicates. In contradiction, assume the condition stated in Theorem 1 holds but Q 6= V ; Q contains duplicate rows. If Q 6= V , then there exists a valid instance I (R) and a valid instance I (S ) giving di erent results for Q and V . Then there exist (at least) two di erent tuples r0; r0 2 (I (R)  I (S )) such that br0 [A] =! r0[A]c. Projecting r0 and r0 onto base tables I (R) and I (S ), r0 and r0 are derived from the tuples r0[ (S )], r0 [ (S )], r0[ (R)], and r0[ (R)]. Furthermore, r0[ (R)]; r0[ (R)] 2 R ](R) and r0 [ (S )]; r0[ (S )] 2 S ](S ). If Q 6= V , then the extended Cartesian product of these tuples, which satis es the condition CR;S , yields at least two tuples in Q's result. This means that either the tuples in I (S ) ! are di erent (br0 [ (S )] 6= r0[ (S )]c), the tuples in I (R) are di erent, or both. It follows that the consequent br0 [Key(R  S )] =! r0 [Key(R  S )]c must be false, since if either br0 [ (S )] 6=! r0[ (S )]c, br0 [ (R)] 6=! r0[ (R)]c, or both, then the keys of the respective tuples must be di erent; a contradiction. Therefore, we conclude that no duplicate rows can appear in the query result if the condition of Theorem 1 holds. 2 Proof (Necessity) Assume that for every valid instance of the database, Q cannot generate any duplicate rows, but the condition stated in Theorem 1 does not hold. To prove necessity, we mustshow that we can construct validinstances of R and S for which Q results in duplicate rows. If Theorem 1's condition does not hold, then there must exist two tuples r0; r0 2 Domain(R  S ) so that theconsequent(br0 [A] =! r0 [A]c) =) (br0 [Key(R  S )] =! r0[Key(R  S )]c) is false, but its antecedents (table constraints, key dependencies, and query predicates) are true. If r0 and r0 disagree on their key, then there must exist at least one column 2 Key(R)  Key(S ) where br[ ] 6=! r [ ]c. Projecting r0 and r0 onto base Proof

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

tables R and S , we get the database instance consisting solely of the tuples r0[ (S )], r0[ (S )], r0[ (R)], and r0[ (R)]. This instance is valid since the tuples satisfy the table and uniqueness constraints for R and S . Furthermore r0 [ (S )]; r0[ (S )] 2 S ](S ) and r0[ (R)]; r0[ (R)] 2 R ](R). Because all constraints are satis ed and br0 [A] =! r0[A]c, V contains a single tuple. Suppose 2 Key(S ). Then br0 [ (S )] 6=! r0[ (S )]c, and the extended Cartesian product with r0[ (R)] and r0[ (R)] satisfying CR;S yields at least two tuples. A similar result occurs if 2 Key(R). In either case, Q contains at least two tuples, so Q 6= V . Therefore, we conclude that the condition in Theorem 1 is both necessary and sucient. 2 Note that we can extend this result to involve more than two tables in the Cartesian product. 0

0

0

0

0

0

0

Example 4

Consider the query from Example 3, modi ed to eliminate duplicate rows: SELECT DISTINCT S.SNO, SNAME, P.PNO, PNAME FROM SUPPLIER S, PARTS P WHERE P.SNO = :SUPPLIER-NO AND S.SNO = P.SNO.

We can safely ignore the Distinct speci cation in the above query if the condition of Theorem 1 holds: 8 r; r 2 Domain(S  P); 8 :SUPPLIER-NO 2 Domain(S.SNO) : Tuple constraints (CHECK conditions ) f [ [ br[S.SNO]  1c ^ br[S.SNO]  499c^ (br[S.SCITY] = Chicagoc _ br[S.SCITY] = Torontoc _ br[S.SCITY] = New Yorkc)^ (br[S.BUDGET] 6= 0c _ br[S.STATUS] = Inactivec)^ br [S.SNO]  1c ^ br [S.SNO]  499c^ (br [S.SCITY] = Chicagoc _ br [S.SCITY] = Torontoc _ br [S.SCITY] = New Yorkc)^ (br [S.BUDGET] 6= 0c _ br [S.STATUS] = Inactivec)^ br[P.SNO]  1c ^ br[P.SNO]  499c^ br [P.SNO]  1c ^ br [P.SNO]  499c^ 0

0

0

0

0

0

0

0

0

0

Primary key dependency for SUPPLIER

(br[S.SNO] =! r [S.SNO]c) =) (br[S.SNAME] =! r [S.SNAME]c ^ br[S.SCITY] =! r [S.SCITY]c ^ br[S.BUDGET] =! r [S.BUDGET]c^ br[S.STATUS] =! r [S.STATUS]c)^ 0

0

0

0

0

Primary key dependency for PARTS

(br[P.SNO] =! r [P.SNO]c ^ br[P.PNO] =! r [P.PNO]c) =) (br[P.PNAME] =! r [P.PNAME]c^ br[P.OEM-PNO] =! r [P.OEM-PNO]c^ br[P.COLOR] =! r [P.COLOR]c)^ 0

0

0

0

0

Candidate key dependency for PARTS

(br[P.OEM-PNO] =! r [P.OEM-PNO]c) =) (br[P.SNO] =! r [P.SNO]c ^ br[P.PNO] =! r [P.PNO]c ^ br[P.PNAME] =! r [P.PNAME]c^ br[P.COLOR] =! r [P.COLOR]c)^ 0

0

0

0

0

Query predicate conditions

br[P.SNO] = :SUPPLIER-NOc ^ br [P.SNO] = :SUPPLIER-NOc ^ br[S.SNO] = r[P.SNO]c^ br [S.SNO] = r [P.SNO]c ] =) 0

0

0

Projection attributes

(br[S.SNO] =! r [S.SNO]c ^ br[S.SNAME] =! r [S.SNAME]c ^ br[P.PNO] =! r [P.PNO]c^ br[P.PNAME] =! r [P.PNAME]c) ] =) Key of R  Key of S (br[P.PNO] =! r [P.PNO]c ^ br[S.SNO] =! r [S.SNO]c ^ br[P.SNO] =! r [P.SNO]c)g Although complex, this expression is satis able: ignoring the table constraints and key dependencies, we can see fromthe consequent (br[S.SNO] =! r [S.SNO]c ^ br[S.SNAME] =! r [S.SNAME]c ^ br[P.PNO] =! r [P.PNO]c^ br[P.PNAME] =! r [P.PNAME]c) =) (br[P.PNO] =! r [P.PNO]c ^ br[S.SNO] =! r [S.SNO]c ^ br[P.SNO] =! r [P.SNO]c) that the conjuncts containing P.PNO and S.SNO in the nal consequent are triviallytrue. The conjunct containing P.SNO is also true, since the antecedent br[P.SNO] = :SUPPLIER-NOc^br [P.SNO] = :SUPPLIER-NOc implies that P.SNO is constant. Therefore, the entire condition is true, and duplicate eliminationis not necessary. In the next section, we propose a straightforward algorithm for determining if a uniqueness condition, like the one above, holds for a given query and database instance. 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

4 Algorithm

We need to test whether a particular query, on an instance of a database, satis es the conditions of Theorem 1 so that we can decide if duplicate elimination is redundant. Since the conditions are quanti ed Boolean expressions, the test is equivalent to deciding if the expression is satis able|an np-complete problem. Instead, we o er an ecient algorithm that, although it will not discover all situations where duplicate elimination is unnecessary, handles a large subclass of queries. Domain and column constraints typically specify permissible ranges of values. Therefore, it is usually unproductive to consider atomic conditions that do not involve equality. Knowledge of primary and candidate keys, however, is invaluable when we test for derived functional dependencies. Our proposed algorithm (Algorithm 1) exploits information about primary keys, candidate keys, and equality conditions in a Where clause. Equality conditions are of two types: Type 1 of the form(v = c) and Type 2 of the form(v1 = v2) where v; v1 ; v2 are columns and c is a constant. Example 5

Suppose we are given the query of Example4: SELECT DISTINCT S.SNO, SNAME, P.PNO, PNAME FROM SUPPLIER S, PARTS P WHERE P.SNO = :SUPPLIER-NO AND S.SNO = P.SNO.

ApplyingAlgorithm1 to this query we can trace the followingsteps: Line 5: C () P.SNO = :SUPPLIER-NO ^ S.SNO = P.SNO ^ T . Lines 6{9: C is unchanged. Line 10: C is not simplytrue; we proceed. Line 11: E1 () P.SNO = :SUPPLIER-NO ^ S.SNO = P.SNO. Line 13: V = f S.SNO, SNAME, P.PNO, PNAME g. Line 14: V = fS.SNO, SNAME, P.PNO, PNAME, P.SNOg. Line 16: V is unchanged. Line 17: V containstheprimarykeys S.SNO and P.SNO, P.PNO;proceed. Line 20: Return YES and stop. Since the algorithmreturns YES, we know that the Distinct clause in the query is unnecessary.

4.1 Proof of correctness

Algorithm1 tests a simpler, sucient condition than that stated in Theorem 1; it ignores table constraints TR and TS and non-equality atomicconditions in the query predicate. Any Di deleted on line 7 weakens condition C , but C is still sucient. Similarly, any Di deleted on

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Algorithm: Determine if duplicateelimination is unnecessary. Inputs: predicates CR ; CS ; CR;S ; key constraints U (R) and U (S ); set of projection list attributes A. Output: YES or NO. Convert CR ^ CS ^ CR;S ^ T into cnf: C = D1 ^ D2 ^ : : : ^ Dn ;

for each Di 2 C do if Di contains an atomic condition not of Type 1 or Type 2 then delete Di from C else if Di contains a disjunctive clause on v then delete Di from C od if C = T then return NO else convert C to dnf: C = E1 _ E2 _ : : : _ Em ; for each conjunctive component Ei 2 C do create a set V that contains each attribute in A; for each Type 1 condition (v = c) in Ei do add v to V od ; { { compute the transitive closure of V based on Type 2 conditionsin Ei . while 9 a Type 2 condition 2 C such that v1 2 V and v2 62 V do add v2 to V od ; if Key(R)  Key(S )  V then proceed else return NO od ; 0

0

0

return YES

Algorithm 1: Test uniqueness of a query result. The algorithm returns YES if primary or candidate keys for both R and S are in V .

line 8 removes conditions like X = 5 OR X = 10. Therefore, we need to show that the simpli edcondition 8 r; r 2 Domain(R  S ); 8 h 2 Domain(H ) : (5) ! f[(for each Ui (R) : (br[Ui (R)] = r [Ui(R)]c) =) br[ (R)] =! r [ (R)]c)^ (for each Uj (S ) : (br[Uj (S )] =! r [Uj (S )]c) =) br[ (S )] =! r [ (S )]c)^ bCR (r; h)c ^ bCR (r ; h)c ^ bCS (r; h)c^ bCS (r ; h)c ^ bCR;S (r; h)c ^ bCR;S (r ; h)c] =) [(br[A] =! r [A]c) =) (br[Key(R  S )] =! r [Key(R  S )]c)]g; where CR , CS , and CR;S contain only atomic conditions using `=', is true when Algorithm 1 returns YES. Assuming Algorithm 1 returns YES, consider one iteration of the main loop starting on line 12. Since line 17 is true (the Key(R  S ) occurs in V ), then we know that the Key(R)  Key(S ) is functionally determined from the result attributes; a derived functional! dependency. This means that the consequent (br[A] = r [A]c) =) (br[Key(R  S )] =! r [Key(R  S )]c) must be true. Since we assume that all key dependencies hold, and we considered all conjunctive components of Ei then the simpli ed condition must hold for each Ei . Since E1 _ E2 _ : : : _ Em () CR ^ CS ^ CR;S , the condition holds for all i. 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

5 Applications

Our goal is to show how relational query optimizers can employ Theorem 1 to expand the space of possible execution strategies for a variety of queries. Once the optimizer identi es possible transformations,it can then choose the most appropriate strategy on the basis of its cost model. In this section, we identify three importantquery transformations:detection ofunnecessary duplicate elimination, conversion of a subquery to a join, and conversion of an intersection to a subquery. Other researchers have described these query transformations elsewhere [6,11,17,19] but with relatively little formalism.Later, in Section 6, we show the applicabilityof these transformationsin nonrelationalenvironments.

5.1 Unnecessary duplicate elimination

We believe that many queries contain unnecessary clauses, for two reasons. First, case tools often generate queries using `generic' query templates. These templates specify Distinct as a conservative approach to handling duplicate rows. Second, some practitioners [4] encourage users to always specify Distinct, again as a conservative approach to simplify query semantics. We feel that recognizing redundant Distinct clauses is an important optimization,since it can avoid a costly sort. Distinct

Example 6

Consider the following query which lists the supplier number and part data for every part supplied by suppli-

ers with the name :SUPPLIER-NAME (there maybe more than one): SELECT DISTINCT S.SNO, PNO, PNAME, P.COLOR FROM SUPPLIER S, PARTS P WHERE S.SNAME = :SUPPLIER-NAME AND S.SNO = P.SNO.

This query satis es the conditions in Theorem 1, and, consequently, Distinct in the Select clause is unnecessary.

5.2 Subquery to join

Both Kim [11] and Pirahesh et al. [19] devote a great deal of e ort to the rewriting of correlated, positive existential subqueries as joins. Their rationale is to avoid processing the query with a naive nested-loop strategy. Instead, they convert the query to a join so that the optimizer can consider alternate join methods. The class of queries we consider corresponds to Type j nested queries in Kim's paper; however, we explicitly consider three-valued logic and duplicate rows. Pirahesh et al. consider merging existential subquery blocks in Rule 7 of their suite of rewrite rules in the starburst query optimizer. We believe that it is worthwhile to analyze several subquery-to-join transformations, particularly when duplicate rows are permitted. Example 7

Consider the correlated query SELECT ALL S.SNO, S.SNAME FROM SUPPLIER S WHERE S.SNAME = :SUPPLIER-NAME AND EXISTS (SELECT * FROM PARTS P WHERE S.SNO = P.SNO AND P.PNO = :PART-NO)

which lists all suppliers with a given name that supply a particular part. We claimthat this query maybe rewritten as SELECT ALL S.SNO, S.SNAME FROM SUPPLIER S, PARTS P WHERE S.SNAME = :SUPPLIER-NAME AND S.SNO = P.SNO AND P.PNO = :PART-NO

since the conditions in the subquery block can, at most, identify a single tuple in PARTS for each candidate tuple in SUPPLIER. Theorem 2 (Subquery to Join)

Consider a nested query on tables R and S that contains apositive existential subquery block.Assume that R and S have at least one candidate key and the same

preconditions for host variables, as described in Theorem 1, hold. Then the two expressions Q = All [AR ](R ^ 9(S ^ CR;S ](S ))](R)) and V = All [AR ](R ^ CS ^ CR;S ](R  S )) are equivalent if and only if the following condition holds: 8 r 2 Domain(R); 8 h 2 Domain(H ) : (6) f [ [ bTR (r)c ^ bCR (r; h)c ] =) 8 s; s 2 Domain(S ) : bTS (s)c ^ bTS (s )c^ (for each Uj (S ) : bs[Uj (S )] =! s [Uj (S )]c) =) bs[ (S )] =! s [ (S )]c)^ bCS (s; r; h)c ^ bCS (s ; r; h)c ^ bCR;S (s; r; h)c^ bCR;S (s ; r; h)c ] =) (bs[Key(S )] =! s [Key(S )]c)g Proof (Suciency) We assert that at most one tuple from S can match the selection predicate CS ^ CR;S if the condition in Theorem 2 holds. We prove this claim by contradiction; assume the condition in Theorem 2 holds, but the expressions Q and V are not equivalent. Then there must exist instances I (R) and I (S ), a tuple r0 2 I (R), and (at least) two di erent tuples s0 ; s0 2 I (S ) such that CS (s0 ; h), CS (s0 ; h), CR;S (r0 ; s0; h), and CR;S (r0; s0 ; h) are satis ed. Since all the antecedents in the condition hold, and the table and key constraints hold for every tuple in Domain(R  S ), then s0 and s0 must agree on their key. However, if the two tuples s0 and s0 agree on their key, then they violate the candidate key constraint for S , a contradiction. We now argue that the semantics of Q and V are equivalent if at most one tuple from S matches each tuple from R. If the predicate CS ^ CR;S in Q is false or unknown, then the existential predicate 9(S ^ CR;S ](S )) must return false, and the tuple represented by r0 cannot be part of the result. Otherwise, if CS ^ CR;S is true then r0 appears in the result. Similarly,for query V , any tuple r0 that satis es CR will join with at most one tuple s0 of S if the condition in Theorem 2 holds. If CS ^ CR;S is false or unknown for the two tuples r0 and s0 the selection predicate is false; hence r0 willnot appear in the result. If CS ^ CR;S is true then at most one tuple of S quali es, and the extended Cartesianproductproducesonlyasingletuplefrom R.Therefore, if at most one tuple from S matches each tuple of R, then Q = V . 2 0

0

0

0

0

0

0

0

0

0

0

0

(Necessity) Assume that for every valid instance of the database, the subquery block on S can match at most one tuple r of R but the condition in Theorem 2 does not hold. To prove necessity, we mustshow we can construct validinstances I (R) and I (S ) so that evaluating Q and V on those instances yields a di erent result. If the condition in Theorem 2 is false there must exist two di erent tuples s0 ; s0 2 Domain(S ) and a tuple r0 2 Domain(R)such that the consequent (bs0 [Key(S )] =! s0 [Key(S )]c) is false, but its antecedents are true. The instance of S formed by tuples s0 and s0 is certainly valid, since it satis es the table and uniqueness constraints for I (S ). In turn, r0 is a valid instance of R because it satis es the constraints on R. Since r0 satis es the condition CR and since both s0 and s0 satisfy the selection predicate CS ^ CR;S , then Q yields one instance of r0 in the result, but V yields two, a contradiction. We conclude that the condition in Theorem 2 is both necessary and sucient. 2 Atthispoint,wecanmakeseveralobservations.Trivially, if the subquery in Q includes more than one table so that the subquery involves an extended Cartesian product of, say, tables S and W , we can extend Theorem 2 to include the corresponding conditions of W (similar to Theorem 1). More importantly, we observe that the two expressions Q = Dist [AR](R ^ 9(S ^ CR;S ](S ))](R)) and V = Dist [AR ](R ^ CS ^ CR;S ](R  S )) are always equivalent, since duplicate eliminationin the projection automatically excludes duplicate tuples obtained from the Cartesian product if more than one tuple in S matches the selection predicate. This means that if we can alter the projection All [AR ] to Dist [AR ] without changing the query's semantics,then we can always convert a nested query to a join, as illustrated by the followingexample. Proof

0

0

0

0

Example 8

Consider the correlated query SELECT ALL S.SNO, S.SNAME FROM SUPPLIER S WHERE EXISTS (SELECT * FROM PARTS P WHERE P.SNO = S.SNO AND P.COLOR = `RED')

which lists all suppliers who supply at least one red part. Note that the uniqueness condition does not hold on the subquery block since many red parts maybe supplied by one supplier. However, this query maybe rewritten as

SELECT DISTINCT S.SNO, S.SNAME FROM SUPPLIER S, PARTS P WHERE S.SNO = P.SNO AND P.COLOR = `RED'

since the uniqueness condition is satis ed for the outer query block (SNO is the key of SUPPLIER).The optimizer converts the query to a join, disregards any columns from PARTS,and then applies duplicate elimination that outputs only one SUPPLIER tuple for each unique SNO in the Cartesian product. This observation leads to the followingcorollary: Corollary 1 (Subquery to Distinct Join)

Consider a nested query on tables R and S that contains apositive existential subquery block.Assume that R and S have at least one candidate key and the same preconditions for host variables, as described in Theorem 2, hold. Then the two expressions Q = All [AR ](R ^ 9(S ^ CR;S ](S ))](R)) and V = Dist [AR ](R ^ CS ^ CR;S ](R  S )) are equivalent if All [AR ](R](R)) contains no duplicate rows. In summary, we have proved the equivalence of nested queries and joins in a variety of situations. Intuitively, it would seem that rewriting nested queries as joins is the mostbene cial. In Section 6, we consider the opposite case: transforminga jointo a subquery as a potential semantic optimization.

5.3 Distinct intersection to subquery

Typically, most relational query optimizers execute the Intersect operation by evaluating each operand, sorting each result, and merging the inputs. Recall that thesemanticsof Intersect requiresignoringduplicates and, more troublesome, equating two tuples if:  all non-Null columns are equal and  for each Null column, its counterpart in the other (derived) table is also Null. A subtle diculty with the transformation of intersection query expressions to subqueries arises because the equivalence of tuples, normallyhandled by the intersectionoperator that treats NULL = NULL,is nowmovedinto a Where clause. Pirahesh et al. [19] do not handle this situationadequately intheir paper (Rule 8); they transformthe query without considering possibly NULL keys. Theorem 3 (Distinct Intersection to Exists)

Consider a query expression that contains the set intersection operator on two tables R and S where R and S each have at least one candidate key. Either selection

predicate CR (R) or CS (S ) may contain host variables. Then the two expressions Q = All [AR](R ](R)) \Dist All [AS ](S ](S )) and V = All [AR ](R ^ 9(S ^ CR;S ](S ))](R)); where CR;S = bR[A] =! S [A]c are equivalent if the derived table All [AR ](R](R)) does not contain duplicate rows. Proof Omitted. 2 Recall that the semantics of X \Dist Y are to include a tuple from X if it exists in Y , and eliminateany duplicates in the result. If each result tuple from R is unique, then a tuple from All [AR ](R](R)) may appear in the nal result if at least one matching tuple is found in All [AS ](S ](S )). Observe that the predicate CR;S = bR[A] =! S [A]c can be expressed in sql as (R.X IS NULL AND S.X IS NULL) OR R.X = S.X for each attribute X in the projection list (though a plain equijoin predicate will suce for primary key columns, since a primary key is guaranteed not to contain any Null values). Using a primary key makes the transformationin Example 5 of reference [19]correct. Example 9

As an example ofTheorem 3, consider the sql query expression SELECT ALL S.SNO FROM SUPPLIER S WHERE S.SCITY = `Toronto' INTERSECT SELECT ALL A.SNO FROM AGENT A WHERE A.ACITY = `Ottawa' OR A.ACITY = `Hull'

which lists supplier numbers for suppliers based in Toronto who have agents in Ottawa or Hull. Since SNO is the key of SUPPLIER,the derived table from SUPPLIER cannot contain duplicate rows, and we may rewrite the query as SELECT ALL S.SNO FROM SUPPLIER S WHERE S.SCITY = `Toronto' AND EXISTS (SELECT * FROM AGENTS A WHERE (A.ACITY = `Ottawa' OR A.ACITY = `Hull') AND ((A.SNO IS NULL AND S.SNO IS NULL) OR A.SNO = S.SNO))

Obviously we can perform this transformation if either of the derived tables from SUPPLIER or AGENTS have unique rows. Subsequent conversion of the Exists subquery to a join is possible [19] if the tests for Nulls are maintained1. We can maketwo additionalobservations:  We now have a means of converting a nested query speci cation to a query expression involving intersection, another possible execution strategy.  The semantics of Intersect and INTERSECT ALL are equivalent if at least one of the derived tables cannot produce duplicate rows.This leads to the followingcorollary:

Corollary 2 (All Intersection to Exists)

Consider a query expression that contains the set operator Intersect All.Assume that R and S have at least one candidate key, and the same preconditions for host variables,asdescribedinTheorem1,hold.Thenthetwo expressions Q = All [AR ](R ](R)) \All All [AS ](S ](S )) and V = All [AR ](R ^ 9(S ^ CR;S ](S ))](R)); where CR;S is de ned as in Theorem 3 are equivalent if the expression All [AR ](R ](R)) does not contain duplicate rows. Similarly, Q and V (modi ed by interchanging R and S ) are equivalent if the query speci cation on S does not contain duplicate rows. Space restrictions prevent us from describing valid transformationsforconvertingthesetoperations Except and Except All to existential subqueries, again by paying particular attention to the correlation predicates. In the next section, we document semantic transformations of join queries|in hierarchical and objectoriented database systems|to show the wide applicabilityof semantic query optimization.

6 Exploiting uniqueness in nonrelational systems

Several researchers [6,7,10,11,19,23] have studied ways to rewrite nested queries as joins to avoid a nestedloops execution plan. When the query is converted to a join,theoptimizerisfreetochoosethemostecientjoin strategy while maintainingthe semantics of the original query; the assumption is that a nested-loops strategy is inecient and seldomworth considering. 1 Because the result of Example9 is a primarykey column in both tables, the test for Null is actually unnecessary.

supplier parts

agent

The subquery block satis es conditions similar to thoseinTheorem2.Forthisexample,anecessarycondition is that at mosta single instance (segment) of PARTS can join with each SUPPLIER. Therefore, we can rewrite this query as SELECT ALL S.* FROM SUPPLIER S WHERE EXISTS (SELECT * FROM PARTS P WHERE S.SNO = P.SNO AND P.PNO = :PARTNO).

Figure 2: Supplier ims database. We assume that the database

organization is hidam [8] with parent-child/twinpointers; root segments are key-sequenced. SNO is the key of SUPPLIER; PNO is the key of PARTS; ANO is the key of AGENT. SNO is a virtual column in the relational views of PARTS and AGENT.

On the other hand, nonrelational systems such as ims and various object-oriented database systems are essentially navigational and queries against these data models inherently use a nested-loops approach. In this section, we propose converting joins to subqueries as a possible execution strategy in these systems.

6.1 IMS

Part of the multidatabase project at the University of Waterloo consists of nding ways to support isostandard sql queries against relational views of ims databases. Essentially, the gateway optimizer attempts to translate an sql query intoan iterative dl/i program consisting of nested loops of ims calls. Queries that cannot be directly translated by the data access layer | which executes the iterative program|requirefacilities of the post-processing layer that can performmore complex operations, like sorting, but at increased cost [14]. Therefore, nested-loop strategies, which require only the data access layer, mayoften be cheaper to execute. Example 10

Consider the select-project-parent/child join query SELECT ALL S.* FROM SUPPLIER S, PARTS P WHERE S.SNO = P.SNO AND P.PNO = :PARTNO

which lists all suppliers for a particular part. Since this query can be handled exclusively by the data access layer, a straightforward nested-loop strategy is: 21 22 23 24 25 26 27 28 29

GU SUPPLIER; while status = ` ' do GNP PARTS (PNO = :PARTNO); while status = ` ' do output SUPPLIER tuple; GNP PARTS (PNO = :PARTNO) od ; GN SUPPLIER

od

This transformation simpli es the iterative method above, since the inner nested loop can stop as soon as one qualifying PARTS segment is found: 30 31 32 33 34

GU SUPPLIER; while status = ` ' do GNP PARTS (PNO = :PARTNO); if status = ` ' then output SUPPLIER tuple ; GN SUPPLIER

od This version reduces the number of dl/i calls against the PARTS segment by half, since the second GNP call in the join strategy (line 26) will always fail with a `GE' (not found) status code. A greater cost reduction may occur if the optimizer can convert a join that speci es non-key attributes in the join predicate to a nested query. For example,suppose the candidate key OEM-PNO appeared in the joincondition.In the joinstrategy, dl/i would have to scan all PARTS segments with the given oem part number, instead of halting the search when the next segment's key was greater than :PARTNO. The nested version halts the search immediately once dl/i nds a match.

35

6.2 Object-oriented systems

In some object-oriented database systems, physical

object identi ers (oids) take the place of foreign keys;

both exodus and O2 take this approach [20]. However, oids di er from pointers in ims because each child ob-

ject points to its parent (see Figure 3.) This pointer scheme does not e ectively support select-project-join queries in which the selection predicate on the parent class (for example, SUPPLIER) is more restrictive than the predicate on a subordinate class, because the most ecient way to process this type of join would require pointers in the opposite direction [20]. Example 11

Consider the following join between SUPPLIER and in an object-oriented database system:

PARTS

SELECT ALL S.* FROM SUPPLIER S, PARTS P WHERE S.SNO BETWEEN 10 AND 20 AND S.SNO = P.SNO AND P.PNO = :PARTNO

7 Related work Supplier

   +  YHH H H

Parts Agent

Figure 3: Object-oriented data model for the supplier data-

base. Each class has a primary key attribute. Object identi ers (oids), implemented as physical pointers, replace foreign keys as the relationshipmechanism between objects.

which lists all suppliers whose supplier numbers lie in the range 10 to 20 that supply a particular part. A straightforward nested-loop join strategy is: 36

retrievePARTS (PNO = :PARTNO);

while parts remaining do retrieve PARTS.SUPPLIER; 39 if PARTS.SUPPLIER.SNO is between 40 10 and 20 then output SUPPLIERobject ; 41 retrieve next PARTS (PNO = :PARTNO) 42 od This strategy is inecient because many SUPPLIER objects may be referenced, only to nd that their supplier number is not in the speci ed range. From Theorem 2, however, we can rewrite this join as the nested query 37

38

SELECT ALL S.* FROM SUPPLIER S WHERE S.SNO BETWEEN 10 AND 20 AND EXISTS (SELECT * FROM PARTS P WHERE S.SNO = P.SNO AND P.PNO = :PARTNO).

Assuming we have an index on PARTS by PNO and an index on SUPPLIER by SNO, then a more ecient strategy maybe as follows: 43 44 45 46 47 48

retrieveSUPPLIER (SNO between 10 AND 20);

while suppliersremaining do

retrieve PARTS (PNO = :PARTNO and PARTS.SUPPLIER.OID= SUPPLIER.OID); if found output SUPPLIER object ; retrieve next SUPPLIER(SNO between 10 AND 20)

od depending on the objects' selectivity. The idea is to restrict the search in the PARTS class to only those instances that correspond to a SUPPLIER instance whose supplier number matches the range predicate.

49

Semantic transformation of sql queries using our uniqueness condition is a form of semantic query optimization [12]. Kim [11] originally suggested rewriting correlated, nested queries as joins to avoid nested-loop execution strategies. Subsequently, several researchers corrected and extended Kim's work, particularly in the aspects of grouping and aggregation [1,6,7,10,17]. None of this work explicitly addresses applicability of these techniques in other database environments;as our examples show, nested loops remain an attractive execution strategy, under certain conditions, with a variety of database architectures. Much of the earlier work in semantic transformations ignored sql's three-valued logic and the presence of Null values. To help better understand these problems, Negri et al. [18] and von Bultzingsloewen [22] de ned formal semantics for sql using an extended relational calculus, although neither paper tackled the problems of duplicates. A signi cant contribution of Negri et al. is the notion of query equivalence classes for syntacticallydi erent, yet semantically equivalent, sql queries. Several authors discuss the properties of derived functional dependencies in two-valued logic. Klug [13] studies the problem of derived fds in two-valued relational algebra expressions with the operators projection, selection, restriction, cross-product, union, and di erence. The paper's main contributions are (1) the problem of determining the equivalence of two arbitrary relational expressions is undecidable, (2) the de nition and proof of a transitive closure operator for fds, and (3) an algorithm to derive all fds for an arbitrary expression, without set di erence, and with a restricted order of algebraic operators. Maier [15] describes query modi cationtechniques with respect to minimizingthe number of rows in tableaux,which is equivalent to minimizingthe number ofjoins inrelationalalgebra.Maier's chase computation uses functional and join dependencies to transform tableaux. Darwen [3] reiterates Klug's work, and gives an exponential algorithm for generating derived fds. Darwen concentrates on deriving candidate keys for arbitrary algebraic expressions and their applications, notably view updatability and join optimization. Ceri and Widom [2] discuss derived key dependencieswithrespecttoupdatingmaterializedviews. They de ne these dependencies in terms of an algorithm for deducing bound columns, nearly identical to our Algorithm1 except for a test for disjunctive clauses. In our approach, however, our formal proofs take into account other static constraints and explicitly handle the existence of Null values; our algorithmis simplya sucient condition for determiningcandidate keys. Pirahesh, Hellerstein, and Wasan [19] draw paral-

lels between optimization of sql subqueries in relational systems and the optimization of path queries in object-oriented systems. Their work in starburst focusesonrewritingcomplex Select statementsasselectproject-join queries. One of the query rewrite rules identi es when duplicate elimination is not required, through isolationof two conditions: uniqueness, termed the `one-tuple-condition', and existence of a primary key in a projection list, termed the `quanti er-nodupcondition'.However, we feel that optimizationopportunities may be lost upon their insistence that the starburst rewrite engine convert all queries, whenever possible, to joins. In contrast, we believe that converting joins to subqueries o ers possibilities for optimization in nonrelational systems.

8 Concluding remarks

We have formally proved the validity of a number of semantic query rewrite optimizations for a restricted set of sql queries, and shown that these transformations can potentially improve query performance in both relational and nonrelational database systems. Although testing the conditions for transformation is np-complete, our algorithm detects a large subclass of queries for which the transformations are valid. Our approach takes into account static constraints, as de ned by the sql2 standard, and explicitly handles the `semantic reefs' [10] referred to by Kiessling|duplicate rows and three-valued logic|which continue to complicate optimizationstrategies. Our original motivation was to nd ways to expand the strategy space for optimizing sql queries|particularly nested queries and joins|against relational views of ims databases. We believe these transformations are useful for any database model that uses pointers between objects. Pointer-based systems di er from traditional relational systems in that the cost of processing a particular algebraic operator in a pointer-based dbms can vary signi cantly from the cost of processing the same operator in a `pure' relational system. In our future work, we willstudy the basis of these new tradeo s. For example,we wish to study the possibility of  using query transformations based on true-interpreted predicates, which expand the number of execution strategies for certain sql expressions;  utilizinginclusiondependencies[16]toprunequery graphs, thus implementing King's notion of join elimination;  expanding the suite of sql queries considered here|some relevant work on Group By is still underway [24].

References

[1] Stefano Ceri and Georg Gottlob. Translating sql into relational algebra: Optimization, semantics, and equivalence of sql queries. ieee Trans. on Soft. Eng., 11(4):324{345, April 1985. [2] Stefano Ceri and Jennifer Widom. Deriving production rules for incremental view maintenance. In Proc. vldb 17, pages 577{589, Barcelona, 1991. [3] Hugh Darwen. The role of functional dependence in query decomposition. In Relational Database Writings 1989{1991, chapter 10, pages 133{154. AddisonWesley, 1992. [4] C. J. Date. An Introduction to Database Systems, volume 1. Addison-Wesley, fth edition, 1990. [5] C. J. Date. Relational Database Writings 1985{1989. Addison-Wesley, 1990. [6] Umeshwar Dayal. Of nests and trees: A uni ed approach to processing queries that contain nested subqueries, aggregates, and quanti ers. In Proc. vldb 13, pages 197{208, Brighton, England, 1987. [7] Richard A. Ganski and Harry K. T. Wong. Optimization of nested queries revisited. In Proc. acm sigmod Conference, pages 23{33, San Francisco, May 1987. [8] ibm Corporation. ims/esa Version 3 General Information, rst edition, June 1989. ibm Order Number GC26{4275{0. [9] International Standards Organization. Information Technology|Database Language sql 2 Draft Report, December 1990. iso Committee iso/iec jtc1/sc21. [10] Werner Kiessling. On semantic reefs and ecient processing of correlation queries with aggregates. In Proc. vldb 11, pages 241{249, Stockholm, 1985. [11] Won Kim. On optimizing an sql-like nested query. acm tods, 7(3), 1982. [12] Jonathan J. King. Query Optimization by Semantic Reasoning. umi Research Press, 1984. [13] Anthony Klug. Calculating constraints on relational expressions. acm tods, 5(3):260{290, 1980. [14] Per- Ake Larson. Relational Access to ims Databases: Gateway Structure and Join Processing. University of Waterloo, December 1990. Unpublished manuscript, 70 pages. [15] David Maier. The Theory of Relational Databases. Computer Science Press, 1983. [16] R[okia] Missaoui and R[obert] Godin. Semantic query optimization using generalized functional dependencies. Rapport de Recherche 98, Universite du Quebec a Montreal, Montreal, Quebec, September 1989. [17] M. Muralikrishna. Improved unnesting algorithms for join aggregate sql queries. In Proc. vldb 18, pages 91{ 102, Vancouver, bc, 1992.

[18] M. Negri, G. Pelagatti, and L. Sbattella. Formal semantics of sql queries. acm tods, 16(3):513{534, 1991. [19] Hamid Pirahesh, Joseph M. Hellerstein, and Waqar Hasan. Extensible/rule based query rewrite optimization in starburst. In Proc. acm sigmod Conference, pages 39{48, San Diego, June 1992. [20] Eugene Shekita. High-performance implementation techniques for next-generation database systems. Tech. Rep. 1026, University of Wisconsin at Madison, May 1991. [21] Je rey D. Ullman. Principles of Database and Knowledge-Base Systems, Volume 1. Computer Science Press, 1988. [22] Gunter von Bultzingsloewen. Translating and optimizing sql queries having aggregates. In Proc. vldb 13, pages 235{243, Brighton, England, 1987. [23] Eugene Wong and Karel Yousse . Decomposition|A strategy for query processing. acm tods, 1(3):223{241, 1976. [24] W. P. Yan and Per- Ake Larson. Performing group by before join. In Proc. Tenth ieee Int. Conf. on Data Engineering,Houston, February 1994.