A Nested Relational Approach to Processing SQL ... - Semantic Scholar

8 downloads 229 Views 422KB Size Report
Jun 16, 2005 - republish, to post on servers or to redistribute to lists, requires prior speci£c permission ..... In our case, a primary key with the null value must be ...
A Nested Relational Approach to Processing SQL Subqueries∗ Bin Cao [email protected]

Antonio Badia [email protected]

Computer Engineering and Computer Science Department University of Louisville Louisville, KY 40292

ABSTRACT

inefficient to directly execute nested queries in their original form [10], query unnesting, i.e. rewriting nested queries into flat forms, has been proposed as a better solution [10, 8, 5, 13, 14, 18]. Unfortunately, most proposed approaches concentrate on aggregate subqueries; optimization of nonaggregate subqueries has some limitations. Some proposed approaches are derived from those for aggregate subqueries [8, 6, 1]. The only solutions proposed for non-aggregate subqueries are limited [5, 3, 2], especially for queries with multiple subqueries and null values. The common problems of these proposed approaches are two fold: first, queries can not be unnested directly and transformations are required; second, each operator is evaluated in a different manner. In this paper, we focus on non-aggregate subqueries. We propose a new, efficient approach, the nested relational approach, for evaluating nested queries containing non-aggregate subqueries in a uniform manner. To directly unnest nonaggregate subqueries, we use the nested relational algebra instead of the standard relational algebra. The motivation of using the nested relational algebra is based on the observation that the subquery result is either a set of values or empty, which can be considered as a set-valued attribute in the nested relational model. Conceptually, our nested relational approach unnests a nested query from top-down, and then uses our extended nested relational algebra to compute the predicates associated with the subquery from bottomup. We will show that our approach not only allows unnesting non-aggregate subqueries directly without transformation, but also allows each operator to be evaluated in a uniform manner. Furthermore, our approach does not require indexes; only hash joins are necessary. Finally, being algebraic, our approach has clear semantics and can be further optimized. The rest of this paper is organized as follows: Section 2 summarizes related work and gives the motivation. Section 3 defines the nested relational model and our extended nested relational algebra. Section 4 describes the original algorithm for evaluating queries having non-aggregate subqueries and some special cases for optimization. Section 5 shows our experiments. Section 6 concludes the paper.

One of the most powerful features of SQL is the use of nested queries. Most research work on the optimization of nested queries focuses on aggregate subqueries. However, the solutions proposed for non-aggregate subqueries are still limited, especially for queries having multiple subqueries and null values. In this paper, we show that existing approaches to queries containing non-aggregate subqueries proposed in the literature (including rewrites) are not adequate. We then propose a new efficient approach, the nested relational approach, based on the nested relational algebra. Our approach directly unnests non-aggregate subqueries using hash joins, and treats all subqueries in a uniform manner, being able to deal with nested queries of any type and any level. We report on experimental work that confirms that existing approaches have difficulties dealing with non-aggregate subqueries, and that our approach offers better performance. We also discuss some possibilities for algebraic optimization and the issue of integrating our approach in a relational database system.

1. INTRODUCTION SQL is the standard language for data retrieval and manipulation in relational database systems. One of the most powerful features of SQL is nested queries (queries having subqueries). Theoretically, a query can have an arbitrary number of subqueries nested within it. A subquery can be either aggregate or non-aggregate. An aggregate subquery has an aggregate function in its SELECT clause; it always returns a single value as the result. A non-aggregate subquery is linked to the outer query by one of the following operators: EXISTS, NOT EXISTS, IN, NOT IN, θ SOME/ANY, and θ ALL, where θ ∈ {, ≥, =, 6=}; the result is either a set of values or empty. Since it is usually ∗This research was sponsored by NSF under grant IIS0091928.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro£t or commercial advantage, and that copies bear this notice and the full citation on the £rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci£c permission and/or a fee. SIGMOD 2005 June 14-16, 2005, Baltimore, Maryland, USA Copyright 2005 ACM 1-59593-060-4/05/06 $5.00.

2.

RELATED WORK AND MOTIVATION

Significant research efforts have been devoted to the optimization of nested queries since the 1980’s. Kim [10] was motivated by the observation that executing correlated nested queries using the traditional nested iteration method can be

191

very inefficient. As a solution, Kim developed query transformation algorithms to rewrite nested queries into equivalent, flat queries which can be processed more efficiently. Several problems of these algorithms were later pointed out and solved in [8]. Dayal [5] refined and extended all of the previous optimization work to a unified approach for processing queries that contain nested subqueries, aggregates and quantifiers, which enables unnesting queries with more than one nesting level. Muralikrishna [13] extended Dayal’s approach to enable processing the queries that have an arbitrary number of blocks nested within any given block. Finally, work on magic decorrelation optimization within the logic programming community was brought to SQL optimization [18, 17]. Unfortunately, most proposed strategies focus on aggregate subqueries. Unnesting non-aggregate subqueries, especially those with certain operators, still pose problems. Before presenting the problems, we first describe the terminology used in this paper. We introduce the term linking predicate to refer to the predicate that connects a subquery and an outer query. In a linking predicate, the attribute of the outer query is called the linking attribute, the attribute of the subquery is called the linked attribute, and the operator is called the linking operator. We call EXISTS, SOME/ANY and IN positive linking operators, and NOT EXISTS, ALL and NOT IN negative linking operators. If a query has both positive and negative linking operators, we say it has mixed linking operators. If a subquery contains a predicate which references the relation in the outer query, we say the subquery is correlated to the outer query, and the predicate is called the correlated predicate. The attribute of the outer query in a correlated predicate is called the correlating attribute, and the attribute of the subquery is called the correlated attribute. If a query has no nested subqueries, we call it a flat query. If a query has nested subqueries, but they are all flat, we call it a one-level nested query; if a query has nested subqueries, but they are all onelevel, we call it a two-level nested query, and so on. Since SQL is a block-structured language, the terms inner query block and outer query block are used interchangeably with subquery and outer query respectively in this paper. As an example, assume relations R(A, B, C, D), S(E, F, G, H, I), T (J, K, L), and consider the following query:

using existing techniques presents several problems. First, it can not be unnested directly; instead, rewriting predicates NOT IN and ALL is required. However, rewriting such predicates may not preserve semantics when null values are present. Because of null values, R.A >ALL (select S.B...) is not equal to an antijoin of R and S on the condition R.A ALL (select S.B...) is not equal to R.A > (select max(S.B)...) or to 0 = (select count(S.B)...) with the condition R.A 10 and R.B not in (select S.E from S where S.F = 5 and R.D = S.G and S.H > all (select T.J from T where T.K = R.C and T.L S.I)) Query Q is a two-level nested query. From top-down, the second query block is correlated to the first query block by the predicate R.D = S.G, and the third query block is correlated to the other two query blocks by the predicates T.K = R.C and T.L S.I. It has two negative linking operators, NOT IN and ALL. Unnesting Query Q

192

queries are evaluated by transforming nested queries into flat queries first, and then flat queries are incrementally computed. Their transformation leads to a Cartesian product followed by difference operations, which is likely not to be an efficient approach. In conclusion, existing approaches either have difficulties in dealing with mixed and negative linking operators, or call for special operations. What is needed is an approach which uniformly deals with all types of linking predicates without introducing undue complexity. We propose to use the nested relational algebra because it explicitly represents the intuition that for a given tuple, a non-aggregate subquery provides a set of values (perhaps empty). As a consequence, linking predicates become set predicates which can be represented in a straightforward manner.

Definition 3. Let R = (A1 , . . . , An ) be a flat relational schema, where Ai (1 ≤ i ≤ n) are atomic attributes. Let attr(R) denote the names of all attributes, that is, attr(R) = {A1 , . . . , An }. Let r be a flat relation over R, that is, sch(r) = R. Let N1 and N2 be two disjoint subsets of attr(R). Then the nest of r by N1 keeping N2 , υN1 ,N2 (r), is defined as: υN1 ,N2 (r) := {t0 | ∃t ∈ r ∧ t0 [N1 ] = t[N1 ] ∧ t0 [N2 ] = 00 {t [N2 ] | t00 ∈ r ∧ t00 [N1 ] = t[N1 ]}} N1 is called the set of nesting attributes, N2 the set of nested attributes. 2 Note that in the traditional definition, only N2 is specified, and N1 is understood as attr(R) − attr(N2 ). The definition presented here has an implicit projection of N1 ∪N2 and will be more convenient for our approach; it also highlights the connection between nesting and grouping. Note also that, for simplicity (and since this will be our most frequent use) we have defined nesting over flat (depth(R) = 0) relations only; however, the definition can be extended to the general case without problems. The unnest operator can be defined as usual to be the inverse of nest.

3. DEFINITION OF EXTENDED NESTED RELATIONAL ALGEBRA Several well-known, basically equivalent definitions of the nested relational algebra have been introduced [19, 15, 11]. For the purpose of the nested relational approach, the definitions need to be extended and slightly modified.

Definition 4. Let R(A1 , . . . , An , R1 , . . . , Rm ) be a nested relational schema, where Ai (1 ≤ i ≤ n) are atomic attributes, Rj (1 ≤ j ≤ m) are subschemas. Let r be a nested relation over R, that is, sch(r) = R. Let attr(Rj ) denote the names of attributes in Rj (1 ≤ j ≤ m). Then a linking predicate over r is defined as one of:

Definition 1. Let U = {A1 , ..., An } be a finite set of attributes. A schema over U and the depth of the schema is defined recursively as follows: 1. If A1 , ..., An are atomic attributes from U , then R = (A1 , ..., An ) is a (flat) schema over U with the name R. The depth of the schema R is 0, denoted by depth(R) = 0.

• AθL{B}, where A ∈ Ai (1 ≤ i ≤ n), B ∈ attr(Rj ) (1 ≤ j ≤ m), θ ∈ {, ≥, =, 6=}, L ∈ {SOME/ANY, ALL}.

2. If A1 , ..., An are atomic attributes from U , R1 , ..., Rm are distinct names of schema with a set of attributes (denoted by attr(R1 ), ..., attr(Rm )) such that {A1 , ..., An } and {attr(R1 ), ..., attr(Rm )} are pairwise disjoint, then R = (A1 , ..., An , R1 , ..., Rm ) is a (nested) schema with the name R. R1 , ...Rm are called subschemas. The depth of the schema R is defined as: depth(R) = 1 + maxm i=1 depth(Ri ). 2 Definition 2. Let R denote a schema over a finite set U of attributes. The domain of R, denoted by DOM (R), is defined recursively as follows: 1. If R = (A1 , ..., An ), where Ai (1 ≤ i ≤ n) are atomic attributes, then DOM (R)=DOM (A1 )×...×DOM (An ), where “×” denotes Cartesian product. 2. If R = (A1 , ...An , R1 , ..., Rm ), where Ai (1 ≤ i ≤ n) are atomic attributes and Rj (1 ≤ j ≤ m) are subschemas nested within R, then DOM (R)=DOM (A1 ) × ... × DOM (An ) × 2DOM (R1 ) × ... × 2DOM (Rm ) , where “×” denotes Cartesian product and 2DOM (Rj ) denotes the power set of the set DOM (Rj )(1 ≤ j ≤ m).

• {B} θ ∅ , where θ ∈ {=, 6=} and B as above. The semantics of each predicate are obvious. 2 Note that (again for simplicity) we only define the linking predicate over one-level (depth(R) = 1) nested relations. For a multi-level (depth(R) ≥ 2) nested relation, A and B might belong to the subschemas with depth d and d + 1 respectively. Thus, the above definition can still be used. Definition 5. Let r be a relation over schema R, that is, sch(r) = R. The selection of r with respect to C, σC (r), where C is a usual predicate or a linking predicate, is defined as usual: σC (r) := {t | ∃t ∈ r ∧ C(t) is true} Let attr(R) denote the names of all attributes in R, A a subset of attr(R), and C a usual predicate or a linking predicate. The pseudo-selection of r with respect to C keeping 0 A, σC,A (r), is defined as: 0 σC,A (r) := {t0 | ∃t ∈ r((C(t) is true ∧t0 = t) ∨ (C(t) is false ∧t0 [A] = {null} ∧ t0 [attr(R) − A] = t[attr(R) − A]))} 2 Thus, a pseudo-selection keeps all tuples that pass the condition (as the usual selection); for the tuples that fail, it keeps the tuple, but it pads the attributes in A with null values. In this paper, either σ or σ 0 is called a linking selection if C is a linking predicate. The linking selection with σ follows the usual definition; the linking selection with σ 0 applies the pseudo-selection definition. As usual, the definitions of join, semijoin and outer join can carry out to nested algebra from regular (flat) algebra. To help understand the above definitions, we give an example.

A nested tuple over R is an element of DOM (R). A nested relation r over R is a finite set of nested tuples over R, which is denoted by: sch(r) = R. 2 The nested relational algebra has the standard operations of the relational algebra: selection(σ), projection(π), Cartesian product(×), join(1), union(∪), intersection(∩), difference(−), plus the nest and unnest operators. Here we modify this algebra slightly to suit our purpose, redefining nest and modifying selection.

193

Example 1. Assume R(A, B, C, D), S(E, F, G, H, I), T (J, K, L) are relations shown in figure 1(a), 1(b), 1(c), where R.D, S.I and T.L are primary keys for each relation. The relation T emp1 shown in figure 1(d) is obtained by the projection of R.B, R.C, R.D, S.E, S.H, S.I, T.J and T.L on the result of a left outer join of R and S on the predicate R.D = S.G, followed by a left outer join with T on the predicates T.K = R.C and T.L S.I. A 12 15 16 11

B 5 4 null null

C 2 3 2 5

D(#) 1 2 3 4

E 6 5 3 7

F 5 5 5 5

(a) Relation R

G 1 1 2 4

H 3 1 5 null

I(#) 1 2 3 4

K 2 2 3 4

C 2 2 2 2 3 2 5

D(#) 1 1 1 1 2 3 4

E 6 6 5 5 3 null 7

L(#) 4 3 1 2

H 3 3 1 1 5 null null

I(#) 1 1 2 2 3 null 4

D(#) 1

E 6

H 3

I(#) 1

5

2

1

5

1

2

4 null null

3 2 5

2 3 4

3 null 7

5 null null

3 null 4

B 5 5 4 null null

C 2 2 3 2 5

D(#) 1 1 2 3 4

E 6 null 3 null 7

(b) T emp3 0 σS.H>ALL{T.J}∨T.L is B 5 4 null null

(c) Relation T B 5 5 5 5 4 null null

C 2

J 2 1 2 1 4 null null

J 2 1 2 1 4 null null

L(#) 4 3 4 3 1 null null

(a) T emp2 = υ{R.B,R.C,R.D,S.E,S.H,S.I},{T.J,T.L} (T emp1)

(b) Relation S J 2 1 4 null

B 5

L(#) 4 3 4 3 1 null null

C 2 3 2 5

D(#) 1 2 3 4

I(#) 1 null 3 null 4 =

null,{S.E,S.H,S.I} (T emp2)

E 6 3 null 7

(c) T emp4 σS.H>ALL{T.J}∨T.L is

H 3 null 5 null null

H 3 5 null null

I(#) 1 3 null 4 =

null (T emp2)

Figure 2: Example of nest and linking selection

(d) T emp1 = πR.B,R.C,R.D,S.E,S.H,S.I,T.J,T.L ((RoR.D=S.G S) oT.K=R.C∧T.LS.I T )

Note that it is a pseudo-selection. A negative linking predicate returns true if the subquery result is empty, which is identified by the primary key being null. Thus, we have adFigure 1: Base Relations ditional condition T.L is null doing linking selection. Under our definition, even though the linking selection over The relation T emp2 shown in figure 2(a) is a one-level the second tuple returns false, we can not discard this tunested relation resulting from nesting by {R.B, R.C, R.D, S.E, ple. We have to keep this tuple by padding null values on S.H, S.I}, keeping all of {T.J, T.L}. The reason why we keep S.E, S.H and S.I. The linking selection over all other tuthe primary keys of R, S and T is that they will be used ples returns true, thus we keep these tuples in their original to identify if the corresponding tuple is empty. We assume forms. One notable point is that for the fourth and the fifth that each relation has a unique non-null attribute served as tuples, although the linking selection compares S.H(null) a primary key. In our case, a primary key with the null value to {T.J}({null}), the linking selection returns true because must be padded by a left outer join operation. If a tuple does the result of the condition T.L is null is true. From this exnot match the join condition, the left outer join operation ample, we can see that linking selection only compares the will pad null values on its attributes including the primary linking attribute to the linked attribute whose correspondkey. Thus, the tuple with the primary key being null can ing primary key is not null. The result of comparison is be considered empty. Another reason we keep the primary based on the standard definition. keys of R, S and T is that we have to distinguish between an The relation T emp4 shown in figure 2(c) is obtained by the empty tuple with all attributes being null and a tuple with projection of R.B, R.C, R.D, S.E, S.H and S.I on the result a certain attribute originally null. As a result, our extended of the linking selection σS.H>ALL{T.J}∨T.L is null (T emp2). relational algebra can be used on relations containing null The linking selection over the second tuple returns false, values without any problem. thus we discard this tuple. All other tuples pass the linking The relation T emp3 shown in figure 2(b) is the projection selection and become the result. of R.B, R.C, R.D, S.E, S.H and S.I on the result of the Note that the projection operation in each subfigure is 0 linking selection σS.H>ALL{T.J}∨T.L is null,{S.E,S.H,S.I} (T emp2). omitted. 2

194

4. THE NESTED RELATIONAL APPROACH TO PROCESSING SUBQUERIES

we get a maximal spanning query tree for the graph (when all query blocks are correlated). When a leaf is reached, the algorithm goes bottom-up nesting the relation obtained and applying a corresponding linking selection to reduce the relation. When a subroot is found on the way down, the algorithm chooses a child to continue towards the leaves; on the way up, however, the algorithm will go down again until all paths in the subtree of the subroot have been covered before proceeding up past the subroot. We do not provide a formal proof for the correctness of algorithm 1 due to lack of space. Basically, we unnest a query in a traditional way, and then nest by each tuple of the outer query, which preserves tuple iteration semantics. Then, the linking selection operator computes linking predicates in a straightforward manner.

The motivation of the nested relational approach is based on the observation that the linking predicate is actually a set computation. The basic idea of the nested relational approach is straightforward: a nested query is unnested from top-down first, and then the linking predicates are computed from bottom-up, which requires: (1) the subquery result to be a set (perhaps empty) and (2) a comparison between a single-valued attribute and a set-valued attribute. Such operations can be achieved by the nest operator and the linking selection operator defined in the previous section. In our approach, non-correlated subqueries are executed once, and the result is used by every tuple (virtual Cartesian product). Correlated subqueries can be executed and then connected to outer queries by join or outer join operations. We first present an original approach and then introduce some optimizations.

Algorithm 1 Compute(node,relational-expression) Require: : a nested query with non-aggregate subqueries Ensure: : the result of a query 1: PROCEDURE compute(node, rel) { 2: if (node is a leaf) then 3: return; 4: else 5: for each n ∈ children(node) do 6: Ti = name(n); 7: Cij = linkC (node, n); 8: Li = linkL (node, n); 9: if (Cij 6= ∅) then 10: rel = rel oCij Ti or rel = rel 1Cij Ti ; 11: else 12: rel = rel × Ti ; 13: end if 14: compute(n, rel); 15: rel = υ{T1 .∗,...},{Ti .∗} (rel); 0 16: rel = σLi (rel) or σL (rel); i 17: end for 18: end if 19: }

4.1 Original approach For a nested query with n query blocks, in each query block, from top-down, let Ri (1 ≤ i ≤ n) denote the relations in the FROM clause; Li (1 ≤ i ≤ n − 1) denote the linking predicate between blocks i and i + 1; Cij (2 ≤ i ≤ n and 1 ≤ j ≤ n) represent the correlated predicate(s) between block i and j (i > j), and ∆i (1 ≤ i ≤ n) represent the predicates in the WHERE clause except Li and Ci . Our algorithm proceeds in three steps. First, we reduce each query block to one relation by doing all operations in the WHERE clause except linking predicate and correlated predicate(s), i.e., at each block i, produce Ti = σ∆i (Ri )1 . Note that this is equivalent to producing the complementary set in the magic decorrelation technique [18, 17]; however, we do not produce a magic set. Second, we create a tree expression for the query as follows: walk through the query in Depth-First, Left-to-Right order; create one node for each query block. We label each node with the corresponding Ti . Between any two adjacent nodes Ti and Ti+1 , we add an edge directed from Ti to Ti+1 labeled with the linking predicate Li . If Ti+1 is correlated to Ti , we add the correlated predicate C(i+1)i to the edge. If Ti is correlated to a non-adjacent node Tj (i > j), we add the correlated predicate Cij to the edge between Ti and Ti−1 if all edges between Tj and Ti have been labeled with correlated predicates; otherwise, we add an edge directed from Tj to Ti labeled with the correlated predicate Cij . The root is labeled by the name of the outermost query block, leaves are labeled by the name of innermost query blocks, other nodes are labeled by the name of the middle query blocks. A node is called a subroot if it has more than one children. All nodes under a subroot are called a subtree of the subroot. For a given node n, let name(n) be the Ti that serves as name of the node; linkC (n, m) be the Cij (if one exists) and linkL (n, m) be the Li , which label the link between n and one of his children m. Third, we compute(root, T1 ). The algorithm, shown as algorithm 1, recursively goes down the tree in depth-first manner, creating a single relation through the use of join or outer join. Note that the structure created in the previous step may be a graph. In this step, we restrict our attention to edges labeled with correlated predicates, in which case

The algorithm works equally for nested linear queries and nested tree queries 2 . In the first case, there is only one child for each node; the net effect is that of going down the tree joining or outer joining, or using the Cartesian product when there is no correlation (this Cartesian product is really virtual), and then up nesting and evaluating the predicates. In the second case, each subroot makes us go down all paths before continuing on the way up. To show how the original nested relational approach processes a nested query, we give an example. Example 2. Consider Query Q in section 2. The tree expression for this query is shown in figure 3(a). To process this query, we would start from root node T1 : R, performing a left outer join of R and S on the correlated predicate R.D = S.G. Since T2 : S is not a leaf, we keep performing a left outer join with T on the correlated predicates T.K = R.C and T.L S.I. Node T3 : T is a leaf node, thus we compute the linking predicate L2 : S.H >ALL {T.J}, which 2 A nested linear query is a query in which at most one query block is nested within any query block. A nested tree query is a query in which there is at least one query block which has two or more query blocks nested within it at the same level.

1 We assume all relations are connected, i.e. no Cartesian product present.

195

4.2.1

is achieved by nesting {R.B, R.C, R.D, S.E, S.H, S.I}, keeping all of {T.J, T.L}, followed by the projection of R.B, R.C, R.D, S.E, S.H, S.I and the linking selection S.H >ALL {T.J}. Then, it goes back to node T2 : S. Since there is no other children under node T2 : S, we compute the linking predicate R.B 6=ALL {S.E} (the NOT IN linking operator is equal to “6=ALL”) by nesting {R.B, R.C, R.D}, keeping all {S.E, S.I}, followed by the projection of R.B, R.C, R.D and the linking selection R.B 6=ALL {S.E}, which goes back to root T1 : R. The final result is obtained by the projection of the desired attributes. Note that we use both σ and σ 0 linking selection in this example. Generally, σ 0 is used for computing negative or mixed linking predicates; σ is used for computing the last unfinished linking predicate, or for all unfinished linking predicates being positive. We use a query tree to represent the process of a query evaluation, in which π denotes projection; o left outer join; σ or σ 0 (linking) selection; υ nest. The query tree for processing Query Q is shown in figure 3(b) (intermediate projections are omitted). 2

In the original approach, we compute each linking predicate by using one nesting operation followed by one linking selection. However, examining the parameters of the nest operator, it is clear that higher levels nest by a prefix of the nesting attributes used by lower levels, and use part of the postfix of those nesting attributes as the nested attributes. For instance, see figure 3(b). To compute the linking predicate S.H >ALL {T.J}, we nest by the nesting attributes {R.B, R.C, R.D, S.E, S.H, S.I}; next to compute R.B 6=ALL {S.E}, we nest by the prefix of the previous nesting attributes {R.B, R.C, R.D}, and choose part of the postfix of the previous nesting attributes {S.E, S.I} as the nested attributes. This advantageous feature gives rise to an optimization of the original approach: doing first all nesting operations in a single step, followed by executing the linking selections one by one, instead of intertwining nesting and linking selection. This gives a feasible and efficient implementation due to the fact that only the deepest or first nesting involves true (physical) reordering of the tuples in the relation, all others are conceptual. For example, the nest and the linking selection operations in figure 3(b) can be rewritten as two consecutive nests followed by two linking selections. Even there still exist two nest operators, the operations can be done in one step. Note that the result of two consecutive nesting is a two-level nested relation. As pointed out in section 3, computing the linking predicate S.H >ALL {T.J} only involves S and T , which still can be considered as a linking selection over a one-level nested relation resulted from the projection of S and T on the two-level nested relation. Similarly, computing the linking predicate R.B 6=ALL {S.E} can be regarded as the projection of R and S on the two-level nested relation.

T1 : R L1 : R.B 6= ALL {S.E} C21 : R.D = S.G T2 : S L2 : S.H > ALL {T.J} C32 : T.L S.I C31 : T.K = R.C T3 : T

(a) Tree Expression

4.2.2 πR.B,R.C,R.D σR.B6=ALL{S.E}∨S.I

υ{R.B,R.C,R.D},{S.E.S.I} is null,{S.E,S.H,S.I}

υ{R.B,R.C,R.D,S.E,S.H,S.I},{T.J,T.L} oT.K=R.C∧T.LS.I oR.D=S.G σR.A>10

σS.F =5

R

S

Pipelining

Pipelining is possible in the context of our algorithm. In particular, it seems clear that it should be possible to pipeline the linking selection with the nesting that is immediately adjacent to it; in some cases, the condition may be evaluated at the same time that the nesting is taking place. Thus, the cost of such plans can be further reduced even if no modification to the plan takes place.

is null

0 σS.H>ALL{T.J}∨T.L

Reduce nesting operations

4.2.3

Linear correlation

Algorithm 1 could be further optimized for some special queries to gain better performance. One such case is linear correlation. A nested query is linear correlated if each inner query block is only correlated to its adjacent outer query block. Since the evaluation of the outer query block only depends on its adjacent inner query block, the linear correlated queries can be processed from bottom-up instead of top-down. For instance, Query Q becomes a linear correlated query by getting rid of one of the correlated predicates T.K = R.C in the innermost query block and changing T.L S.I to T.L = S.I. Instead of from top-down, this query can be efficiently processed from bottom-up by performing nesting on the result of a left outer join of S and T with corresponding selections pushed down, followed by computing the linking predicate S.H >ALL {T.J}; then nesting again on the result of a left outer join of R and the previous resulting tuples, followed by computing the linking predicate R.B 6=ALL {S.E}. Note that pipelining can be applied for computing the linking predicate and nest-

T

(b) Query Tree Figure 3: The Nested Relational Approach Applied to Query Q

4.2 Optimizations Algorithm 1 can evaluate nested queries containing nonaggregate subqueries with any type of linking predicates and any level of nesting in a uniform manner. However, there are several alternatives and optimizations possible. We briefly discuss some of the more interesting ones.

196

5.1

ing. Clearly, this strategy benefits from small intermediate results, since only qualified tuples participate in further (outer) join operations.

4.2.4

As described in the previous sections, our nested relational algebra is an extension of the standard relational algebra, thus only the nest operator and the linking selection operator are not supported by current DBMS. To implement the nested relational approach, we wrote stored procedures in procedural SQL, an extension of SQL that adds programming language-like capabilities to SQL (variable declaration, loop and conditional statements). Our approach was to design the program in two stages: first, an SQL query is used to unnest the query by executing (left outer) joins of the base relations in each query block with corresponding selections pushed down. Second, code in the procedure implements the nest operator and the linking selection operator by processing the tuples fetched from the first stage, which we call intermediate result. In order to simulate nesting in an effective manner, we make the database sort the intermediate result. This is equivalent to implementing nest by sorting, which we believe is a realistic possibility (like a group-by, the two obvious options to implement nest are sorting and hashing). We implemented two variants of the nested relational approach: the original nested relational approach implements the nest operator and the linking selection operator separately (which requires two passes over the intermediate result), and the optimized nested relational approach pipelines the nest operator and the linking selection operator (which requires only one pass over the intermediate result). The reasons we use stored procedures to implement the nested relational approach are: (1) they run inside the database so that the communication overhead can be reduced significantly compared to external processing; (2) they can be called by other applications, which makes the nested relational approach more suited for practical use. However, there still exists communication overhead when the stored procedure fetches data from the SQL engine (as observed in [1], this is one considerable disadvantage that all experimental settings similar to ours must bear). For that reason, in reporting our results one of the main parameters we use is the size of the intermediate result. In our experiments, we created a TPC-H database [4] at scale factor 1 (total data size 1GB) in System A, hosted on a server with an Intel Pentium 4 2.80GHz processor, two 36GB SCSI disks, and 1GB memory, running Red Hat Enterprise Linux WS release 3. We configured a buffer cache of size 32MB, and installed all data and indexes in a single disk. B+ tree indexes on the primary key of each base table were automatically built by System A. Additional indexes on the selected foreign keys were created manually when needed (more on this below).

Push down nesting

Another idea is to push nesting operations down past (outer) join. The original nested relational approach uses the standard approach to unnest the subquery, which may produce a very large intermediate relation for later processing. To avoid this problem, we can push the nesting operation down before the (outer) join. This is not always possible; the conditions under which it can be done are similar to the conditions to push down a group-by operator past a join [9]. In particular, one situation in which the push down is possible is when the nesting attribute is also the attribute in the condition of the join, and this condition is an equality. In symbols, if R and S are flat relations and B, C ∈ sch(S), A ∈ sch(R), υ{B},{C} (R 1A=C S) = R 1 (υ{B},{C} S). This is a common pattern in our approach. For example, consider Query Q with the third query block removed. It can be processed as follows: first, nest the relation S using υ{S.G},{S.E,S.I} with the selection of S.F = 5 pushed down. Note that these two steps can be pipelined. Then R left outer joins the resulting one-level nested relation on the predicate R.D = S.G, followed by computing the linking predicate R.B 6=ALL {S.E}. The final result is obtained by the projection of the desired attributes.

4.2.5

Implementation

Positive linking operators

Although the nested relational approach is focused on dealing efficiently with mixed and negative linking operators, we would like the approach to be also efficient for positive linking operators. However, existing approaches have a very efficient way for evaluating positive linking operators. In the case of IN, for instance, the linking predicate is transformed into a semijoin. However, our approach would create an outer join, a nest and a selection, that is, an expression of σA= SOM E{B} (υ{A},{B} (R 1C S)), where A is the nesting attribute and A ∈ sch(R), B is the nested attribute and B ∈ sch(S), and C is the correlated predicate, would be generated for A IN (SELECT B FROM S...). The trick in these cases is to realize that the expression above can be simplified to R 1C∧A=B S. In a more general setting, the expression σAθSOM E{B} (υ{A},{B} (R 1C S)) can be shown to be equivalent to R 1C∧AθB S. If, furthermore, projection push down shows that only attributes from R are needed, the join can be transformed into a semijoin. Thus, through algebraic rewriting our approach can be shown to be equivalent to the standard one for positive cases. More discussion about positive linking operators will be shown in section 5.

5.2

Performance analysis

To verify the efficiency of the nested relational approach, three queries and their variations with four different sizes derived from the TPC-H benchmark were tested in our experiments. For each query, we measured the average execution time of multiple runs of the query as the primary performance metric. Before each running, the buffer cache of System A was flushed. The graphs of the results plot the elapsed time on the Y-axis and the size of each query block (outer/inner) on the X-axis. The size of each query block denotes the size of the base table (or a join of base tables) in a query block with corresponding selections pushed down, but without the linking predicates executed yet. We

5. EXPERIMENTS AND PERFORMANCE ANALYSIS In this section, we compare the performance of the nested relational approach with the performance of a popular commercial database management system (DBMS), which we call “System A”, evaluating nested queries in its latest version using its “native” approach. Our experiments focus on nested queries containing negative and mixed linking operators, which are not efficiently evaluated by direct unnesting using existing techniques.

197

chose this size as a parameter due to the fact that it directly relates to the intermediate result, which in turn, relates to the overhead corresponding to fetching tuples from the SQL engine. This size is controlled by changing constants on the selections and thus varying their selectivity factor. Note that the size of the final result is proportional to the size of the intermediate result. Our first experiment was done on Query 1, which is a one-level nested query with an ALL linking operator.

proach. The experimental results are shown in figure 4. Both the original and the optimized nested relational approaches outperform the native approach, although the native approach benefits from indexes. One notable point about Query 1 is that, with a NOT NULL constraint on the attribute l_extendedprice, System A directly performs an antijoin of orders and lineitem, and the performance is about the same as ours. However, if the NOT NULL constraint is dropped, even though there are no null values in l_extendedprice, antijoin is not used. In general, the ALL or NOT IN linking predicate can not be evaluated using antijoin when null values exist. The second experiment we did was on two variations of Query 2, a two-level nested query. The term [any|all] refers to choosing either one.

Query 1: select o_orderkey, o_orderpriority from orders where o_orderdate>=x1 and o_orderdate all (select l_extendedprice from lineitem where l_orderkey=o_orderkey and l_commitdate

Suggest Documents