Selecting and Using Views to Compute Aggregate Queries - NC State

Selecting and Using Views to Compute Aggregate Queries Foto Afrati National Technical University of Athens 157 73 Athens, Greece [email protected]

Abstract

Rada Chirkova† Computer Science North Carolina State University [email protected]

precomputes some of the grouping/aggregation on some of the query’s subgoals. Also, because aggregate queries are often computed on large amounts of data, in many applications it is beneficial to use previously cached results as views to answer a new query [HRU96, GHRU97, ACN00].

We consider the problem of obtaining equivalent rewritings of aggregate queries using views. We assume conjunctive views and rewritings, with or without aggregation; in each rewriting, only one view contributes to computing the aggregated query output. Our focus is on minimizing the cost of computing a query workload; we look at query rewriting using existing views and at view selection. In the queryrewriting problem, we give sufficient and necessary conditions for a rewriting to exist. For view selection, we prove complexity results. We also give algorithms for obtaining rewritings and selecting views.

1

∗

We consider aggregate queries and views and address the problems of (1) how to answer the queries using the views and of (2) how to optimally select views to materialize. EXAMPLE 1.1 This is a simple motivating example. On a database with schema {P (A, B), S(B, C, D), T (C, G), U (A, H)} we consider three queries, Q1 , Q2 , and Q3 :

Introduction

q1 (A, B, max(C)) : − p(A, B), s(B, C, D), t(C, G), u(A, H). q2 (B, C, sum(H)) : − p(A, B), s(B, C, D), u(A, H). q3 (B, count) : − s(B, C, D), t(C, G).

The problem of answering and rewriting queries using views for conjunctive queries and views has received considerable attention (see, e.g., [ALU01, CDGLV03, CHS02, Hal01] and references therein). However, a small amount of work addresses the case where the language is extended with aggregation [ACN00, CNS99, GHQ95, GHRU97, NSS98, SDJL96]. Few complete algorithms are known for finding rewritings; moreover, the existing results address special cases. At the same time, using materialized views to compute aggregate queries results in potentially greater benefits than for purely conjunctive queries, as a view with aggregation

We consider the following views: v1 (B, max(C)) v2 (A, B, sum(H)) v3 (B, C) v4 (C, count)

::::-

s(B, C, D), t(C, G). p(A, B), u(A, H). s(B, C, D). t(C, G).

We can rewrite the three queries as Q01 , Q02 , Q03 using the four views: q10 (A, B, W ) q20 (B, C, sum(W )) q30 (B, sum(W ))

:- v1 (B, W ), v2 (A, B, X). :- v2 (A, B, W ), v3 (B, C). :- v4 (C, W ), v3 (B, C).

Each rewriting uses more than one view, and all views in a rewriting are not necessarily of the ∗ Due to space limitations, we do not provide proofs in same type, i.e., some are without aggregation the text. Selected proofs are in the appendix. (view V3 ), and some use aggregation different † Contact author: NCSU, 900 Main Campus Dr, Venture III Ste 165-C Rm 196, Raleigh, NC 27695, USA; tel than the aggregation of the query they rewrite (view V4 in Q03 ). However, in each rewriting only 919-513-3506; fax 919-513-4357; [email protected] 1

one view (the first) in the body contributes to the value of the aggregated attribute in the head; we call it the central view. We call these rewritings central rewritings. Also rewritings Q02 , Q03 are themselves aggregate queries, whereas rewriting Q01 is not. Finally, the grouping attributes in the head of the rewriting are, in general, different than the ones in the views used in the body.

each year after 1997 and only for the city of Seattle. Here is one possible SQL expression for Q3 : SELECT product,yearOfSale,sum(salePrice) FROM P,W WHERE P.storeId = W.storeId GROUP BY product, yearOfSale, storeCity HAVING yearOfSale > 1997 AND storeCity = ’Seattle’;

It is not straightforward how to argue that These three queries can be rewritten using a these rewritings are indeed equivalent to the single multiaggregate view. In our datalog rule queries. To see that, take Q02 slightly modified notation the queries, the view and the rewritings as q200 (B, C, W ) : − v2 (A, B, W ), v3 (B, C). Inter- can be written as: estingly, rewriting Q002 is not equivalent to query q (S, Y, max(T )):- p(X, Y, Z, T, N, L, 02), t(X, S). 1 Q2 , although its body is the same as Q02 and q2 (Y, M, U, sum(Z)):- p(X, Y, Z, T, N, L, M ), w(X, U ). the head contains the same attributes. Also Q02 q3 (Y, M, F ):- q2 (Y, M, U, F ), M > 97, U =0 Seattle0 . v1 (X, Y, M, sum(Z), max(T )):- p(X, Y, Z, T, N, L, M ). (Q03 ) is equivalent to Q2 (Q3 ) only if the view V3 0 is computed under bag semantics [CV93]. 2 q10 (S, Y, max(K)):- v1 (X, Y, 02, F, K), t(X, S). q2 (Y, M, U, sum(J)):- v1 (X, Y, M, J, K), w(X, U ). q30 (Y, M, F ):- q20 (Y, M, U, F ), M > 97, U =0 Seattle0 .

One contribution of this paper is a complete algorithm which constructs central rewritings given a query and a set of views. The aggregate operators we consider are the common operators max, min, count, sum, count(∗). As aggregation is not a relational operator, proving equivalence of queries to rewritings is more complicated than when queries have no aggregation. Thus, we investigate this problem first and use the results we obtain to develop our algorithm.

View V1 can be used as a central view to rewrite all three queries. 2 Our second main result is an algorithm that selects multiaggregate central views optimally given a query workload. We also prove complexity results for the view-selection problem. The structure of this paper is as follows. Section 2 defines aggregate queries and equivalence among aggregate queries. Section 3 presents our framework, in particular the types of rewritings we consider, the cost model for view selection, and a more technical presentation of our results. In section 4, we prove necessary and sufficient conditions for a type of rewriting to exist and provide also negative results. In section 5, we prove that the view-selection problem in NP-complete for sum, count, and provide an exponential-time lower bound on the complexity of view selection for max, min. In Section 6, we give algorithms for obtaining rewritings given a query and views and for selecting views given a query workload.

When addressing the view-selection problem, we consider also multiaggregate views and queries with the HAVING clause, as in the following example.

EXAMPLE 1.2 Consider a database with three relations, one relation that stores transactions, and two that store information about store branches: P(storeId, product, salePrice, profit, dayOfSale, monthOfSale, yearOfSale); T(storeId, storeChain); W(storeId, storeCity). We consider three queries. Query Q1 gives maximal profit per store chain per product for year 2002. Query Q2 gives total sales per product per year per city, for all stores. Query Q3 uses Related Work and Comparison to Ours a HAVING clause in its definition and returns The problems of rewriting queries using views all product names, together with total sales, for and of view selection for aggregate queries have 2

been considered in papers related with data warehouses and datacubes [GCB+ 97, Wid95]; in general, the problem considered in this context was to answer each query (or part of a query) using a single view [ACN00, GHQ95, GHRU97, SDJL96]. Recent work [CNS99] has considered the problem of rewriting a query with aggregation using multiple views with aggregation; to determine whether a rewriting that uses views is equivalent to a query with aggregation, the method is to determine whether the rewriting’s unfolding (defined similarly to expansion [Ull97]), which uses base relations only, is equivalent to the query [NSS98]. Thus complete algorithms are obtained that construct rewritings that use multiplication as an aggregate operator and use only aggregate views in the body of the rewritings. In the present paper, we use unfoldings to determine equivalence of a central rewriting to a query and obtain complete algorithms. Our central rewritings use only standard aggregation operators and use any views in the body, including multiaggregate views. On view selection, considerable work has been done on efficiently selecting views such as in the datacube context (e.g., [GHRU97]), where the focus was on getting efficient algorithms for interesting special cases of the problem. Here we focus on obtaining results on the complexity of the view-selection problem for central rewritings in a framework similar to [CHS02].

Finally, results on equivalence of aggregate queries are presented in [CNS99], which establishes that checking the equivalence of unions of sum or count-queries is GI-hard and in PSPACE. (GI is the class of problems that are many-one reducible to the graph isomorphism problem.) It is also shown in [CNS99] that checking equivalence of unions of max-queries is Πp2 -complete, whereas checking equivalence of unions of conjunctive queries without aggregation is NP-complete.

2

Preliminaries

A database is a collection of relations. A query is a mapping from databases to databases, where usually the output database (the answer) is a database with a single relation. A relation is viewed as either a set or a bag (a.k.a. multiset) of tuples. A bag can be thought of as a set of elements (we call it the core-set of the bag) with multiplicities attached to each element. A conjunctive query is of the following form: h(¯ s) : − g1 (¯ s1 ), . . . , gk (¯ sk ). In each subgoal gi (¯ si ), predicate gi is a base relation, and every argument in the subgoal is either a variable or a constant. We shall denote the part on the right-hand side of the 0 : −0 (called the body) by A. The part in the left-hand side is called the head. An attribute or variable which is not in the head is called a nondistinguished attribute or variable. An assignment γ for A is a mapping of the variables appearing in A to constants, and of the constants appearing in A to themselves. Assignments are naturally extended to tuples and atoms. For a tuple of variables s¯ = (s1 , . . . , sk ) we let γ¯ s denote the tuple (γ(s1 ), . . . , γ(sk )). Satisfaction of atoms (and of conjunctions of atoms) by an assignment w.r.t a database is defined as follows: g(γ¯ s) is satisfied if the tuple γ¯ s is in the relation that corresponds to the predicate of subgoal g.

Other related work on aggregate query rewriting includes [GT03], which considers rewriting aggregate queries using multiple aggregate views over a single relation, and [AAD+ 96], which presents fast algorithms for computing the cube operator. [YW01] considers the problem of using views with aggregation to compute queries in temporal databases. Work related to query languages with aggregate capabilities can be ¨ OM87], ¨ found in [BL02], [RSSS98], [O [LSV02]. [PDST00] proposes a new method for generating Under set semantics, a conjunctive query alternative query plans, using an interaction of q(¯ s) ← A defines a new relation q D , for a given indexes, materialized views, semantic optimiza- set database D, as follows: q D := {γ¯ s | γ satisfies tion, and query minimization. A w.r.t. D}. Under bag-set semantics [CV93], a 3

conjunctive query q(¯ s) ← A defines a new mulrepresent relations; D tiset relation {{q}} , for a given set database D, • α(y) is an aggregate term; as follows: {{q}}D := {{γ¯ s | γ satisfies A w.r.t. • s¯ are the grouping attributes of the query; D}}. We say that the query is computed under • y does not appear among s¯; bag semantics [CV93] if both the input database • all the variables in the head occur in the and the answer are bags. In this case, the colbody. lection of satisfying assignments is viewed as a With each aggregate query q as in Equation multiset. 1, we associate its core q˘, which is a conjunctive We define equivalence under each of the query: q˘(¯ s, y) ← A. (2) three types of semantics. Two queries are setequivalent (bag-set-equivalent, bag-equivalent, reFor the semantics of an aggregate query we spectively) if they produce the same set (multiset, respectively) of answers on every database think as follows: Let D be a database and q an q is (every set database for the first two cases, every aggregate query as in Equation 1. When D applied on D it yields a new relation q that is bag database for the third case). defined by the following three steps: First, we When we compute a query, we will say whether compute the core q˘ on D as a bag B. In the we compute it as a bag or as a set, unless obvious second step, we form equivalence classes in B. from the context. Two tuples belong to the same equivalence class We assume in this paper that the data we want if they agree on the values of the grouping atto aggregate are real numbers, R. If S is a set, tributes. This is the grouping step. The third then M(S) denotes the set of finite multisets step is aggregation; it associates with each equivover S. A K-ary aggregate function is a func- alence class a value that is the aggregate function tion α : M(Rk ) → R that maps multisets computed on a bag which contains all values of of k-tuples of real numbers to real numbers. An the input argument of the aggregated attribute aggregate term is an expression built up using in this class. For each class, it returns one tuvariables and aggregate functions. Every aggre- ple which contains the values of the grouping atgate term gives rise to an aggregate function in tributes and the computed aggregated value. a natural way. We say that an aggregate function α is duplicate-insensitive if the result of α computed over a bag of values is the same as the result of α computed over the core set of this bag. Otherwise α is duplicate-sensitive [GHQ95]. We say that an aggregate function α is distributive [GCB+ 97] if there is a function γ such that α(A) = γ(α(A)), where A is a multiset. All the four functions we consider are distributive. In fact, for all α, γ = α, except that for count, γ = sum. The following are useful observations.

We use α(y) as an abstract notation for an aggregate term, where y is the variable in the term. The aggregate queries that we consider here have the aggregate functions count, count(∗), sum, max, and min. Note that count is over an argument whereas count(∗) is the only function that we consider here that takes no argument. In the rest of the paper, we will not refer again to this distinction as our resutls carry over.

An aggregate query is a conjunctive query augmented by an aggregate term in its head. Thus Proposition 2.1 Let Q be an aggregate query ¯ the grouping tuple and Y the aggregated with X it has the syntax: q(¯ s, α(y)) ← A, (1) attribute. Then the following hold: (1) There is a ¯ → Y ; (2) the answer to functional dependency X Q is set-valued; (3) the projection of the answer where ¯ is set-valued. • A is a conjunction of predicate atoms that to Q on X 2 4

Now we define equivalence between aggregate queries. As two aggregate queries with different aggregate functions may be equivalent but we don’t want to treat such cases here, we define equivalence only among compatible queries.

rewriting, a bag-valued view V will be denoted by an adornement as V b . The following example shows that equivalence of a rewriting to a query is affected depending on whether conjunctive views are set- or bag-valued.

Definition 2.1 (Compatible queries) EXAMPLE 3.1 We have the following query [NSS98] Two queries are compatible if they have and one view which is the core of the query. identical heads, up to variable renaming. 2 : − p(X, Y, Z). Definition 2.2 (Equivalence of compatible Q(X, count) V (X) : − p(X, Y, Z). aggregate queries) [NSS98] For two compat0 b ible aggregate queries Q(¯ x, α(y)) ← B(¯ s) and Q (X, count) : − V (X). Q0 (¯ x, α(y)) ← B 0 (¯ s0 ), Q ≡ Q0 if Q(D) = Q0 (D) The rewriting is equivalent to the query as it for every database D. 2 is, i.e., when the view is bag-valued. HowEquivalence among aggregate queries is inves- ever, if the view is set-valued, then there is no tigated in [CNS99, NSS98] where it is shown equivalence. (Consider the following database: that: (1) Two conjunctive queries are bag-set P = {(1, 3, 4), (1, 5, 6)}. On P , the answer to equivalent if and only if they are isomorphic; Q has one tuple (1, 2), the answer to the view has one tuple (1), and hence (2) equivalence of sum-queries and count-queries computed as a set 0 has one tuple (1, 1).) the answer to Q 2 can be reduced to bag-set equivalence among their cores; (3) equivalence of max-queries can 3.2 Central Rewritings be reduced to set-equivalence between their Finding rewritings for aggregate queries introcores. duces additional complications when compared 3 Our Framework and Contri- to finding rewritings for conjunctive queries butions without aggregation: Now a decision has to be made as to the following parameters: (1) What 3.1 Rewritings for Aggregate Queries kinds of queries are the views. (2) What kind of Suppose V is a set of views defined on a database query is the rewriting. (3) Whether the views schema S, and suppose D is a database instance are computed under set or bag-set semantics. with schema S. Then by DV we denote the (4) Moreover, as a consequence of the choice we database obtained by computing all the view re- make, the aggregate function may or may not delations in V on the database D: pend on some aggregated attributes of the views. [ Our choice is to depend only on the aggregated DV = V (D). attribute of a single view, which we call central V V Definition 3.1 (Equivalent Rewriting) Let view. The rest of the views in the rewriting are Q be a query defined on database schema S, and called noncentral views. let V be a set of views defined on S; let R be a Aggregate queries (and views that are defined query defined in terms of the views in V. Then by aggregate queries) are not symmetrical w.r.t. Q and R are equivalent, denoted Q ≡ R, if and all their attributes. We call the aggregated atonly if for any database D, Q(D) = R(DV ). 2 tribute the output argument of the query. We do We say that a view V is set-valued if V is com- not allow joins on output arguments. puted and stored to be accessed as a set, and we Thus in the setting of our paper, we make the say that V is bag-valued if V is computed and following assumptions on the rewritings we constored to be accessed as a bag. Whenever in a sider: 5

1. The argument of aggregation in the head of the rewriting comes from exactly one (central) view in the body of the rewriting. We call central aggregate operator the aggregate operator of the central view that contributes to the aggregation in the head (there might be several in the case of multiaggregate central view) and (in the case the central view is purely conjunctive) the aggregate operator in the head of the rewriting.

define and use expansions [Ull97] (unfoldings reduce to expansions in this case), in presence of aggregation there are more complications. Sometimes, unfoldings are not equivalent to the rewritings as we will prove in the section that follows. Here we define unfoldings. We are given a set of views defined as conjunctive aggregate queries over the base predicates, and are given a conjunctive query R over the views. We use to refer to R as a ”rewriting” even in the case when we have not associated it with any particular query (whose rewriting is to be obtained). The unfolding Ru of R is a join of all the subgoals of the views in R, followed by some grouping/aggregation. If we denote by Bvi the body of a view Vi , then an unfolding Ru of R is defined as follows:

2. Aggregated outputs of noncentral views are not used in the head of the rewriting. 3. There is no join on output arguments of views. We call such types of rewritings central rewritings. In all our results, we will assume that we consider only central rewritings.

ru (¯ x, β(y)) ← Bv0 & Bv1 & . . . & Bvk .

We may view our problem now as belonging to one of the following three classes: CQ/CQA when the central view is purely conjunctive and the rewriting has aggregation, CQA/CQ when the central view has aggregation and the rewriting is purely conjunctive, and CQA/CQA when both the central view and the rewriting have aggregation. It is easier to state our results for each class separately. Our rewriting template R for all three rewritings is

(4)

where (1) β is the aggregate operator of the central view of R, if the central view is aggregated, or else is the aggregate operator in the head of R; (2) the variables in the Bvi ’s that are also contained in the x¯i are retained the same as in the rewriting, whereas the other (non-distinguished variables of the view definition) are replaced by fresh variables that are not used in any other Bvj with j 6= i. Moreover, y is the attribute which is aggregated in the definition of the central view r(¯ x, α(y)) ← v0 (¯ x0 , y), v1b (¯ x1 , y1 ), . . . , vkb (¯ xk , yk ). V0 of R (in case V0 has aggregation). In the (3) purely conjunctive case, the unfolding is equivawhere α is a nontrivial aggregate operator in lent to the expansion [Ull97] of the rewriting. cases CQ/CQA and CQA/CQA, and is an idenIn our framework, we also consider multiaggretity in case CQA/CQ (i.e., the head is r(¯ x, y)). Also in the case CQ/CQA, we assume a central gate queries and views. In this case, we assume view too which covers all subgoals that contain again that only one aggregated attribute from one (central) view is used to compute the aggrethe variable y. gated value in the head of the rewriting. Our Our contribution presented in Section 4 is: central rewritings are extended naturally. For each central rewriting, we obtain sufficient and necessary conditions for a rewriting to exist. 3.4 View Selection and Cost Model This is achieved by using unfoldings of rewritings We want to design minimal-cost views, i.e., those as explained in the following section. views whose use in the rewriting of a query re3.3 Unfoldings of Rewritings sults in the cheapest computation of the query. Unlike the case of conjunctive queries without We take the assumption that the view relaaggregation, where it is straightforward how to tions have been precomputed and stored in the 6

database. Thus, we don’t assume any cost on computing the views. We assume that the size of a database relation is the number of tuples in it, and that the cost of computing a join is the sum of the sizes of the input relations and of the output relation (this faithfully models the cost of, e.g., hash joins). For conjunctive queries, we measure the cost of query evaluation as the sum of the costs of all the joins during the computation of the query. (We assume that all selections are pushed down as far as they go, and consider only left-linear query trees for joins.) For queries with aggregation, our sum-cost model measures the cost of evaluating a query as the sum of the costs of the three steps in the computation of the query: computation of the conjunctive core, grouping, aggregation. (Let N be the size of the input relation to a unary operator. Then the cost of the grouping operator, which is the same as sorting, is proportional to N log N ; the cost of the aggregate operator, which can be computed in a single scan, is N .)

For the view-selection problem, we prove the following (in section 5): (1) Decidability. (2) NP-hardness, even in the case of queries and views without aggregation. (3) Membership in NP for sum and count aggregate queries. (4) Exponential-time lower bound on the complexity for min and max aggregate queries.

4

Results on Equivalence of Unfoldings and Rewriting

We present results that prove that the unfoldings defined in Section 3 are equivalent to the rewritings. We also present negative results that show that our constraints that need to be satisfied for this to hold are tight. As a consequence of the results in this section, equivalence of a rewriting to a query is reduced to equivalence between two aggregate queries (which is known how to check [NSS98]). In brief, for the cases where we prove that the rewriting is equivalent to the query, it suffices to check whether the unfolding is equivalent to the query.

Now we present our formulation of the viewselection problem. We assume that we must sat- 4.1 Case CQ/CQA: central view CQ isfy a bound (storage limit) on the sum of the and rewriting CQA sizes of the relations for the views that will be Theorem 4.1 Let R be a CQ/CQA rewriting. selected to be materialized. Suppose that all ¡ noncentral views are without u Definition 3.2 (view-selection problem) aggregation and are bag-valued. Then R ≡ R . Given a query workload, an oracle that gives 2 view sizes1 , and a storage limit (a positive integer), return a set of view definitions, such Proof: Here all views are without aggregation. that: Given a database D on the base relations, the • the views in the set give an equivalent result of computing the bag-join of all views in rewriting (of one of our three central rewrit- the body of R is equivalent to computing each view relation separately as a bag and then coming types) of each query in the workload, puting the bag-join of all the views in the body • the view relations satisfy the storage limit, of the rewriting. After that, the same grouping and and aggregation is applied in both R and Ru . • the total cost of computing the queries using the rewritings is minimum among the view The following result relaxes the requirement sets that satisfy the previous two conditions. for noncentral views in the case of duplicate2 insensitive functions. 1

Theorem 4.2 For a CQ/CQA rewriting R with central aggregation max (min), R ≡ Ru . 2

alternatively, given a specific database

7

Negative Results

Negative Results

Proposition 4.1 Let R be a CQ/CQA rewriting with central aggregation sum or count. Suppose that either there is a noncentral view with aggregation, or there is a set-valued noncentral view. Then the unfolding is not set-equivalent to the rewriting. 2

The question arises whether we can extend Theorem 4.3 by relaxing one of the restrictions. Here we prove that it is not possible for aggregate functions sum and count. A counterexample is rewriting Q002 in Example 1.1. However, it might seem that there could be cases where the unfolding we defined in the previous section does work. In the following proposition, we prove that, for aggregate operators sum or count the following holds: For any rewriting and its unfolding (the way we define unfoldings), such that some of the restrictions in Theorem 4.3 are relaxed, the unfolding is not equivalent to the rewriting.

4.2

Case CQA/CQ: central view CQA and rewriting CQ

Lemma 4.1 For every CQA/CQ rewriting R, if R is equivalent to its unfolding Ru , then all grouping attributes of the central view of R appear in the head of R. 2

Query Q1 in example 1.1 is rewritten using a Proposition 4.2 Consider a CQA/CQ query R view whose grouping atributes are a proper subwith central aggregation sum or count. Suppose set of the arguments in the head of the rewriting. that noncentral aggregated arguments of R canThe following theorem proves equivalence of a not be used in the head of R or in joins in the rewriting to its unfolding for all aggregate func- body of R. Moreover, suppose that at least one tions that we consider, under some restrictions of the following holds: on the view definitions and on the form of the 1. There is a noncentral view in R defined by rewriting. an aggregate query. Theorem 4.3 Consider a CQA/CQ rewriting 2. There is a noncentral view in R defined by R. Suppose that (i) all noncentral views of R a query with nondistinguished variables. have no aggregation, (ii) R does not have nondis3. There are nondistinguished variables in R tinguished attributes in its body (except possibly (other than noncentral aggregation in the noncentral aggregated arguments in R’s central central view of R). view – in case of multiaggregate views), (iii) nonThen R is not set-equivalent to its unfolding Ru central views do not have nondistinguished at(the way we define Ru ). 2 tributes in their definition, and (iv) all grouping attributes of the central view appear in the head We prove this proposition in three proposiof R. Then the following hold: tions, each relaxing one of the restrictions. The proof techniques are similar in all three cases. • R is equivalent to its unfolding Ru , and • the answer to R on any set-valued database 4.3 Case CQA/CQA: central view is a set. CQA and rewriting CQA 2 Although, as we prove in the negative-results Here, to prove R ≡ Ru , we choose to prove that section, none of the conditions in Theorem 4.3 the standard query plans for R and Ru can be can be relaxed for sum or count queries, they transformed to the same plan Rint . We give an can be relaxed for max and min queries: example to show our technique. Theorem 4.4 Let R be a CQA/CQ rewriting with central aggregation max (min). Suppose that all the grouping arguments of the central view of R appear in the head of R. Then R is set-valued and is equivalent to its unfolding Ru . 2

EXAMPLE 4.1 Consider the following rewriting and its unfolding: r(X, T, sum(W )) : − v4 (X, Z, W ), v5 (Z, T ). v4 (X, Z, count(∗)) : − p(X, Y, Z). 8

v5 (Z, T ) : − u(Z, T, L). u r (X, T, count(∗)) : − p(X, Y, Z), u(Z, T, L).

workloads of conjunctive queries with sum- or count- aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. 2

Let Rint be defined as follows: ¯ int (X, T, Z, W ). rint (X, T, sum(W )) :−R

r¯int (X, T, Z, count(∗)) : − p(X, Y, Z), u(Z, T, L). 5.3 Lower Bound for Workloads of max or min We show that R ≡ Ru by showing that R ≡ int int u R and R ≡ R . 2 We prove an exponential-time lower bound for view selection under a storage limit for max- and Theorem 4.5 Let R be a CQA/CQA rewriting. min-queries. Suppose that noncentral views are without aggregation and are bag-valued. Then R ≡ Ru . 2 Theorem 5.3 The view-selection problem under the storage limit has an exponential-time Negative Results lower bound for finite workloads of conjunctive Proposition 4.3 Let R be a CQA/CQA rewrit- queries with max- or min- aggregation and for ing with central aggregation sum or count. Sup- conjunctive views and rewritings, with or withpose that either there is a noncentral view with out aggregation, for the three central rewritings aggregation, or there is a set-valued noncentral we consider. 2 view. Then the unfolding is not set-equivalent to the rewriting. 2 6 Algorithms

5 5.1

As a consequence of the results in Section 4, we obtain algorithms which are based on the following observations.

View Selection Decidability

Theorem 5.1 The view-selection problem under the storage limit is decidable for finite workloads of conjunctive queries with aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. 2

Proposition 6.1 In a CQA/CQ rewriting, the set of all grouping attributes of the central view is a subset of the set of all grouping attributes of the rewriting. We call this central view groupingcomplete.

In a CQA/CQA rewriting, the set of the The query workloads we consider may contain grouping attributes of the rewriting is a union of queries both with and without aggregation. subsets of the grouping attributes in the central 5.2 NP-completeness for sum or count view and the non-aggregated attributes in noncentral views. We call this central view groupingIn this section we present an NP-completeness incomplete. 2 result for the view-selection problem for workloads of sum or count queries. As the proof We consider a rewriting R and define its also works for purely conjunctive queries, views, reduced-core rewriting Rr to be a conjunctive and rewritings under bag semantics, the view- rewriting whose head attributes are R’s grouping selection problem for that case is also NP- attributes only, and whose body uses reducedcomplete. (Interestingly, under bag-set se- core views. Given an aggregate view V , we demantics, the view-selection problem for con- fine its reduced-core view V r to be a view whose junctive queries, views, and rewritings has an body is the body of V and whose head is a new exponential-time lower bound; cf. [CHS02].) predicate name V r ; the arguments in the head of V r are all the grouping attributes of V . The Theorem 5.2 The view-selection problem unreduced-core rewriting is a conjunctive query, der the storage limit is NP-complete for finite and the following holds: 9

Proposition 6.2 Let Rr be a reduced-core rewriting of a CQA/CQA or CQA/CQ rewriting R. Then Rr is an equivalent rewriting of the reduced-core query using the reduced-core views. 2

6.1

Constructing Rewritings

In this section, given a query and a set of views, we construct all equivalent rewritings of the query using the views. The problem is actually reduced to the problem of obtaining rewritings for purely conjunctive queries. For lack of space, we describe only the case for max queries and CQA/CQA or CQA/CQ rewritings. The other cases are similar with the additional observation that, in the duplicate-sensitive cases, we find rewritings for the purely conjunctive queries whose unfolding is isomorphically mapped on the query. In the following algorithm, Qr and V r are the reduced-core queries of a query Q and of views, respectively. We use an algorithm in the literature [ALU01] to find all rewritings Qr using V r . Procedure Find-R. Input: query Q, set of views V Consider Qr ,V r . Find all rewritings of Qr using V r . For each rewriting Rr do: Consider the expansion Rr−exp For each cont. mapping from Qr to Rr−exp do: If there is a view in the rewriting such that its aggregated attribute is the image of the aggregated attribute of the query, do: Call this the central view. If the central view is grouping-incomplete then construct CQA/CQA rewriting If the central view is grouping-complete then construct CQA/CQ rewriting end end end

6.2

Selecting Views

We present an algorithm that selects multiaggregate views to be used as central views, given a query workload. It is particularly efficient in the case of queries with the HAVING clause, where a single multiaggregate central view saves on using joins on several aggregate views. The algorithm selects all maximal such views. For a query workload, a view is maximal if there does not exist another multiaggregate view with more aggregated arguments which can replace it in all the rewritings in the workload. The algorithm considers each query Q in the workload and constructs a pair of views (VcQ , VnQ ) which essentially represent a central minimal view and a collective noncentral view. We may think of the pair (VcQ , VnQ ) as providing a rewriting for Q with the minimum number of subgoals in the central view VcQ . We call them characteristic views of the query Q. In the next step, the algorithm considers all combinations of those pairs and finds compatible pairs of characteristic views. Two pairs are compatible if (1) the two central views can be combined in a single multiaggregate view Vm , and (2) Vm can be used to rewrite both queries. Proposition 6.3 1. Each query has a bounded number of characteristic views. 2. In any central rewriting of a query Q, the views used in the rewriting can also be used to produce central rewritings of characteristic views. 3. It is decidable to tell whether two pairs of 2 characteristic views are compatible. Theorem 6.2 The algorithm finds all maximal multiaggregate views for a query workload. 2

References [AAD+ 96]

Theorem 6.1 If there is a central rewriting of a query Q using views V , then the algorithm will find it. 2 10

S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. In Proceedings of VLDB, pages 506–521, 1996.

[ACN00]

[ALU01]

[BL02]

S. Agrawal, S. Chaudhuri, and V.R. [GHRU97] Narasayya. Automated selection of materialized views and indexes in SQL databases. In Proceedings of VLDB, pages 496–505, 2000. [GT03] F. Afrati, C. Li, and J.D. Ullman. Generating efficient plans for queries using views. In Proceedings of ACM SIGMOD, 2001. [Hal01] M. Benedikt and L. Libkin. Aggregate operators in constraint query languages. JCSS, 64:628–654, 2002.

[CDGLV03] D. Calvanese, G. De Giacomo, M. Lenzerini, and M.Y. Vardi. View-based query containment. In Proc. PODS, pages 56–67, 2003.

S. Grumbach and L. Tininini. On the content of materialized aggregate views. JCSS, 66:133–168, 2003. Alon Y. Halevy. Answering queries using views: A survey. VLDB Journal, 10(4):270–294, 2001.

[HRU96]

V. Harinarayan, A. Rajaraman, and J. Ullman. Implementing data cubes efficiently. In Proceedings of SIGMOD, pages 205–216, 1996.

[LSV02]

J. Lechtenb¨orger, H. Shu, and G. Vossen. Aggregate queries over conditional tables. Journal of Intelligent Information Systems, 19(3):343–362, 2002.

[NSS98]

W. Nutt, Y. Sagiv, and S. Shurin. Deciding equivalences among aggregate queries. In Proceedings of PODS, pages 214–223, 1998.

¨ OM87] ¨ [O

¨ ¨ G. Ozsoyoglu, Z.M. Ozsoyoglu, and V. Matos. Extending relational algebra and relational calculus with set-valued attributes and aggregate functions. TODS, 12:566–592, 1987.

[CHS02]

R. Chirkova, A.Y. Halevy, and D. Suciu. A formal perspective on the view selection problem. VLDB Journal, 11(3):216–237, 2002.

[CNS99]

S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries using views. In Proceedings of PODS, pages 155–166, 1999.

[CV93]

S. Chaudhuri and M. Vardi. Optimization of real conjunctive queries. In Proc. PODS, pages 59–70, 1993.

[GCB+ 97]

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Re[PDST00] ichart, and M. Venkatrao. Data cube: A relational aggregation operator generalizing Group-by, Cross-Tab, and sub totals. Data [RSSS98] Mining and Knowledge Discovery, 1(1):29–53, 1997.

[GHQ95]

H. Gupta, V. Harinarayan, A. Rajaraman, and J.D. Ullman. Index selection for OLAP. In Proceedings of ICDE, pages 208–219, 1997.

A. Gupta, V. Harinarayan, and D. Quass. Aggregate-query processing in data warehousing environ- [SDJL96] ments. In Proceedings of VLDB, pages 358–369, 1995. 11

L. Popa, A. Deutsch, A. Sahuguet, and V. Tannen. A chase too far? SIGMOD Record, 29(2), 2000. K.A. Ross, D. Srivastava, P.J. Stuckey, and S. Sudarshan. Foundations of aggregation constraints. Theoretical Computer Science, 193(1-2):149–179, 1998. D. Srivastava, S. Dar, H.V. Jagadish, and A.Y. Levy. Answering queries with aggregation using

[Ull97]

[Wid95]

[YW01]

views. In Proceedings of VLDB, projection of the body of V on the head attributes of V , and by then doing V ’s grouping pages 318–329, 1996. and aggregation on the result. We use this obJeffrey D. Ullman. Information in- servation to argue that we can compute R on D tegration using logical views. In as follows: (1) take a join of the bodies of all Proceedings of ICDT, 1997. the views in R; (2) project the resulting relation Jennifer Widom. Research prob- on the head attributes of R under bag semantics; lems in data warehousing. In Pro- (3) group the resulting tuples into equivalence classes, based on the union of the grouping arceedings of CIKM, 1995. guments of V and of the head arguments of R, J. Yang and J. Widom. Incremen- and then aggregate using V ’s aggregation functal computation and maintenance tion; as a result, we obtain the value of V ’s agof temporal aggregates. In Proceed- gregation for each equivalence class w.r.t. the ings of ICDE, pages 51–62, 2001. grouping attributes of the view V .

Because the grouping attributes of V are a subset of the head arguments of R, the result A From Section 4 of this computation is the relation for R on D. A.1 Proof of Theorem 4.3 We then observe that it is trivial to transform Theorem 4.3 Consider a CQA/CQ rewriting this plan into standard computation for Ru . R. Suppose that (i) all noncentral views of R Part 2 (multiaggregate central view V ): We have no aggregation, (ii) R does not have nondis- reduce this case to the previous case by projecttinguished attributes in its body (except possibly ing out extra aggregate arguments of the central noncentral aggregated arguments in R’s central view V and thus obtaining a new rewriting R0 . view – in case of multiaggregate views), (iii) non- We then argue that R and R0 have the same uncentral views do not have nondistinguished at- folding, and use transitivity of set equivalence to tributes in their definition, and (iv) all grouping show R ≡ Ru . attributes of the central view appear in the head A.2 Proof of Proposition 4.2 of R. Then the following hold:

2

Proposition 4.2 is proven in three parts with sim• R is equivalent to its unfolding Ru , and • the answer to R on any set-valued database ilar proofs, each for one of the clauses in the statement. We give here one of the three proofs. is a set.

Proof (sketch): The proof has two parts:

Proposition A.1 Consider a CQA/CQ query R with central aggregation sum or count. If at least one noncentral view in R has aggregation (with any aggregation function(s)), and if noncentral aggregated arguments of R cannot be used in the head of R or in joins in the body of R, then R is not set-equivalent to its unfolding Ru (the way we define Ru ). 2

Part 1: Suppose the central view V of the rewriting has just one aggregated argument (i.e., we do not consider multiaggregate views). We first show that the answer to R is a set on any setvalued database; thus, it is enough to show set equivalence of R and Ru on set-valued databases. We then transform the standard query plan for Proof (sketch): Consider an arbitrary R into a set-equivalent query plan that is the CQA/CQ query R with central aggregation sum standard query plan for Ru , as follows. or count, such that R has a noncentral view with We fix a set-valued database D. We observe aggregation; let Ru be the unfolding of R. We that V can be computed on D by taking a bag prove the Proposition by assuming R ≡ Ru and 12

by then constructing a database on which the construct D in such a way that each tuple in the answers to R and Ru are different as sets; we body of V has the value 1 of the argument Y that is aggregated in the head of V . Therefore, thus arrive at a contradiction. Recall that, by definition of Ru , the head vari- the value of Z = α(Y ) in the head of V is exables of R and Ru are the same. Here’s the idea actly j (recall that α is either sum or count). By of what we show on a counterexample database definition, the answer to R on D is obtained by D, for the case where R has a noncentral ag- taking a projection, on the head arguments of R, gregate view. For a fixed assignment x ¯ of the of the result of joining all the subgoals of R. As grouping attributes in the head of Ru , we ascer- the subgoals of R include the view V , any tuple ¯ in the answer to R on D has Z = j. tain that the answer to Ru on D has a tuple, for x with some value z of the aggregated argument Z of Ru . We argue that, for the same assignment x ¯, none of the tuples in the answer to R on D has a value of Z that is equal to z. Thus, the answers to R and Ru on D are different as sets. To produce this counterexample, we build a database D in such a way that the body of the aggregate noncentral view V1 in R has exactly two tuples that correspond to a fixed assignment x ¯ of ¯ of Ru ; we build the the grouping arguments X rest of the database D to ensure that the answer to each of R and Ru on D has at least one tuple ¯ are x whose values of X ¯. (We build the database D as a union, on each base relation separately, of two canonical databases for R, which result from assigning two different variable names to the argument to be aggregated in V1 .)

On the other hand, the answer to Ru on D is the result of performing Ru ’s aggregation — which is the central aggregation (sum or count) of the central view of R — on the body of Ru . (The body of Ru is the result of joining all its subgoals.) The body of Ru has at least 2j tuples; the value, in each tuple, of the argument to be aggregated is 1. Thus, the tuple for x ¯ in the u answer to R on D has the value of Z that is at least 2j.

A.3

Proof of Theorem 4.5

Theorem 4.5 Let R be a CQA/CQA rewriting. Suppose that noncentral views are without aggregation and are bag-valued. Then R ≡ Ru .

Proof (sketch): We show that each of R and Ru is equivalent to a query Rint (see Example 4.1) whose definition is based on the defiNow, because V1 has aggregation, the answer nitions of R and Ru ; then R ≡ Ru follows from to V1 on D has exactly one tuple that corre- transitivity of equivalence. For a rewriting R desponds to this assignment x ¯; recall that the body fined as of V1 has two tuples for x ¯. For this reason, when r(¯ x, α(y)) ← v0 (¯ x0 , y), v1b (¯ x1 , y1 ), . . . , vkb (¯ xk , yk ). u we compute R and R on the database D, the u and for its unfolding R , result of joining all the subgoals of Ru has at ru (¯ x, β(y)) ← Bv0 & Bv1 & . . . & Bvk . least two copies of each tuple in the body of the int central view of R. Recall that the aggregated R is defined as [ argument Z in the head of Ru is also the aggrerint (¯ x, α(z)) ← r¯int (¯ x x ¯0 , z). (5) gated argument in the head of the central view [ int x x ¯0 , β(y)) ← Bv0 & Bv1 & . . . & Bvk . V of R. We argue that, for this reason, for the r¯ (¯ u assignment x ¯ of the grouping arguments of R , Here, α is the aggregate function of R, and β is the (only) tuple in the answer to Ru on D has the aggregate function of R’s central view V . the value of Z that is at least twice the value We give an intuition for the proof for the case of Z in any tuple for x ¯ in the answer to R on where the aggregation function of the central D. Indeed, let there be j tuples in the body, on view V in R is count(∗); the proof carries over the database D, of the central view V of R. We in a straightforward way to any distributive ag13

gregation function [GCB+ 97].

the fact that views in equivalent rewritings have definitions whose length is bounded by the size of the query. This is true for conjunctive queries and carries over to aggregate queries and central rewritings because of our results on equivalence of unfoldings and rewritings (proved in Section 4) an the results on equivalence of aggregate queries. The combination of these results obtain that the core of the rewriting should be equivalent to the core of the query. Then we argue as in the purely conjunctive case under either semantics.

In the computation of R on an arbitrary database, consider any group G(t) that results, after grouping and aggregation, in a tuple t in the answer to R. Any tuple p in G(t) is the result of joining tuples in views in R, one tuple from each view. Consider a tuple s in V (central view of R) that contributes to the tuple p, and let k be the aggregated value in s. As V ’s aggregation is count(∗), s corresponds to k tuples in the body of V . Thus, each tuple p (with some value k) in each group in R corresponds to k tuples in the body of the central view V of R. B.2 NP-hardness proof (Theorem 5.2) We use this observation to see that we can use Proposition B.1 The view-selection problem a query plan for Rint to compute R. For each under the storage limit is NP-hard for finite tuple p (with some value k) in the body of R, we workloads of conjunctive queries with sum- or ¯ int . After doing have k tuples in the body of R count- aggregation and for conjunctive views and int ¯ ’s aggregation (the same as V ’s aggregation) R rewritings, with or without aggregation, for the on the union of the grouping attributes of R and three central rewritings we consider. 2 V , we obtain, from these k tuples, exactly the tuple p in the body of Rint . As the grouping and Proof (sketch): We prove the Proposition by aggregation are the same in the heads of R and reducing an NP-complete problem Partition to Rint , we conclude that R and Rint have the same the problem of view selection for a single query answer on any database. with sum- or count- aggregation, for each of our u int To show that R and R have the same three central rewritings. Consider an instance answer on any database, we first observe that I of Partition, which has n elements a1 , . . . , an . they are computed on the same relation B = We construct an instance J of view selection, in Bv0 & . . . & Bvk . We then use the fact that R’s time at most polynomial in the size of I. The aggregate function is distributive, to argue that instance J has: the two grouping/aggregation steps in comput1. A sum or count query Q, with n subgoals ing Rint result in the same answer, on the relapi that correspond to the elements in I, and tion B, as the single grouping/aggregation step with an extra subgoal p0 that provides an in computing Ru . aggregation argument of Q. 2. An oracle, which gives the size of the relaB From Section 5 tion for each subgoal of the query Q, as 1 for p0 and as 2s(ai ) for each pi , i > 0; here B.1 Proof of Theorem 5.1 the integer s(ai ) is the size of the element ai Theorem 5.1 The view-selection problem under in the instance I of Partition. For any view the storage limit is decidable for finite workloads defined on a subset of the subgoals of the of conjunctive queries with aggregation and for query Q, the oracle gives the size of the reconjunctive views and rewritings, with or without lation for the view as a product of the sizes aggregation, for the three central rewritings we of the relations for the relevant subgoals; the consider. 2 size of Q is 2S(A) , where S(A) is the sum of Proof (sketch): The proof is a consequence of sizes of all the elements in I. In the full 14

proof we argue that the oracle gives view sizes consistently on some database. 3. A central rewriting type (one of CQ/CQA, CQA/CQ, and CQA/CQA).

the conjunctive query in the construction and modify it to obtain definitions of queries with aggregation. We then consider rewritings of these queries — one for each of the three rewriting types we consider — and prove that in each case, an exponential number of fixed views (some of them with aggregation) are the only possible viewset that satisfies a chosen storage limit and gives a minimal-cost rewriting of the query.

The problem for J is: For the query Q and the set of databases for which the oracle gives the sizes of views as described above, does there exist a rewriting R of the specified type, such that the sum cost (see section 3) of answering the query Q on these databases using the rewriting R does Here are some details. We take the database not exceed a numeric value M , which depends schema (relations S1 through Sn ) from the conon the type of the rewriting R. struction in the proof in [CHS02], and change We then show that an instance I of Partition the schemas of two of the relations, to accomhas a solution if and only if the corresponding modate attributes that would justify the aggreinstance J of view selection has a solution. Con- gation in each type of central rewritings that sider the value of M in J : The component of M we consider; we then construct a database D that represents the cost of computing the body on which to compute our queries and rewritings. of the rewriting (i.e., all except the final group- After defining the queries and rewritings on the ing/aggregation) is M ∗ = 2S(A)/2 + 2S(A)/2 + new schema, we use our results on equivalence of 2S(A) . The remainder of the proof is an argument queries with aggregation to their central rewritthat on the databases described by the oracle, ings to argue that each of the rewritings we prothe cost of computing the body of a rewriting duce is equivalent to the corresponding query. does not exceed M ∗ only if there are exactly two Each rewriting has an exponential number of filviews that have the same size M0 , as given by the tering views [ALU01] that, when applied (i.e., oracle, and such that the join of the two views joined) together to one of the nonfiltering views gives the body of the query Q. Now the size M0 in the plan for computing the rewriting on the of any such view can only be 2S(A)/2 , otherwise database, reduce the relation for the view in a the (sum) cost of joining the views cannot be way that minimizes the cost of the plan. M ∗ . But by construction of the query and of the oracle, the size of a view can be 2S(A)/2 only if in the instance I of Partition there is a subset A0 of the set A, such that the total size of the elements of A0 is S(A)/2.

B.3

Proof of Theorem 5.3

Theorem 5.3 The view-selection problem under the storage limit has an exponential-time lower bound for finite workloads of conjunctive queries with max- or min- aggregation and for conjunctive views and rewritings, with or without aggregation, for the three central rewritings we consider. 2 Proof (sketch): We use the construction given in the proof of Theorem 6 in [CHS02]; we take

Finally, for each rewriting we set a storage limit as the amount of space that is just enough to store the relations, on the database D, for an exponential number of views that we have fixed in each rewriting. Similarly to the proof in [CHS02], we show that (1) the cost of computing the queries using the chosen views and rewritings is lower than the cost of computing the queries without views, and (2) for any other viewset that could produce lower-cost plans to compute the queries on the database D, the relations for the viewset do not satisfy the storage limit. In particular, by construction of the database D, our fixed views with aggregation are more beneficial than views (without aggregation) that are their cores.

15

Selecting and Using Views to Compute Aggregate Queries - NC State

Selecting and Using Views to Compute Aggregate Queries - NC State

Suggest Documents

Algorithms for Rewriting Aggregate Queries Using Views

Selecting Operator Queries using Expected Myopic Gain

Provenance for Aggregate Queries

Answering Graph Pattern Queries Using Views

Answering Recursive Queries Using Views - CiteSeerX

Designing Views to Answer Queries under Set

NC State Brewery NC State Brewery NC State Brewery

selecting gilts for lifetime productivity - NC State University

Using ADIFOR to Compute Dense and Sparse

Learning from Aggregate Views - CiteSeerX

Answering Queries Determined by Views

Aggregate Skyline Join Queries: Skylines with Aggregate Operations

Supplemental Appendix to - NC State

Information Extraction Using Database Queries - Arizona State ...

Rewriting Nested XML Queries Using Nested Views - UCSD CSE

Selecting good views of high-dimensional data using class consistency

Approximate Rewriting of Queries Using Views - Google Sites

Generating Efficient Plans for Queries Using Views - CiteSeerX

Rewriting Queries Using Views with Access Patterns ... - CiteSeerX

On Rewriting XPath Queries Using Views - Google Sites

Optimizing Queries Using Materialized Views: A Practical ... - CiteSeerX

Answering Regular Path Queries Using Views - Semantic Scholar

Rewriting queries using views with negation - IOS Press

Distinguishing aggregate formation and aggregate clearance using ...