Optimal Histograms for Hierarchical Range Queries - Semantic Scholar

5 downloads 102 Views 211KB Size Report
Extended Abstract. Nick Koudas. AT&T Labs Research koudas@research.att.com ... have a long research history in the database litera- ture 7, 3, 2, 4, 10, 5, 8 .
Optimal Histograms for Hierarchical Range Queries (Extended Abstract)

Nick Koudas

AT&T Labs{Research [email protected]

1

Introduction

S. Muthukrishnan

AT&T Labs{Research [email protected]

Now there is tremendous interest in data warehousing and OLAP applications. OLAP applications typically view data as having multiple logical dimensions (e.g., product, location) with natural hierarchies de ned on each dimension, and analyze the behavior of various measure attributes (e.g., sales, volume) in terms of the dimensions. OLAP queries typically involve hierarchical selections on some of the dimensions (e.g., product is classi ed under the jeans product category, or location is in the north-east region), often aggregating measure attributes (see, e.g., [6]). Cost-based query optimization of such OLAP queries needs good estimates of the selectivity of hierarchical selections. Histograms capture attribute value distribution statistics in a space-ecient fashion. They have been designed to work well for numeric attribute value domains, and have long been used to support cost-based query optimization in databases [11, 9, 2, 4, 10, 5]. Histograms can be used to estimate the selectivity of OLAP queries by modeling the (hierarchical) conditions on a given dimension as a set of hierarchical ranges (i.e., two ranges are either disjoint or one is contained in the other), and using standard range selectivity estimation techniques (see, e.g., [10]). The quality of selectivity estimates obtained using a histogram depends on computing a good solution to the histogram construction problem, and there has been considerable recent e ort in this area (see, e.g., [10, 5]). However, while OLAP queries make extensive use of hierarchical selection conditions, previous works on computing good histograms, for the most part, consider only equality queries when computing the error incurred by a particular choice of histogram bucket boundaries. This mismatch between the nature of OLAP queries,

Divesh Srivastava

AT&T Labs{Research [email protected]

and the class of queries considered when constructing histograms can result in poor selectivity estimates for OLAP queries. In this paper, we address this problem and focus on eciently computing optimal histograms for the case of hierarchical range queries. We make the following contributions:  We show that \optimal" histograms for equality queries are sub-optimal for hierarchical range queries (Section 3).  We present polynomial-time, dynamic programming algorithms for computing optimal histograms that provably minimize expected error for a given amount of space (equivalently minimize space for a given error), for the special cases of one-sided ranges and balanced binary trees, as well as for the general case of arbitrary hierarchical range queries (Sections 4{ 5).  We prove that our algorithm for the case of onesided ranges is as ecient in running time as the VOptimal algorithm of [5], which computes optimal histograms for equality queries, and experimentally demonstrate that the histograms produced by our algorithm have substantially lower error (Section 4). Our work is the rst that computes provably good histograms for non-equality queries, and lays the theoretical foundation for this important area.

1.1

Related Work

Histogram construction techniques and their use in selectivity estimation for query optimization as well as their relationship to approximate query answering, have a long research history in the database literature [7, 3, 2, 4, 10, 5, 8]. The V-Optimal histogram was de ned in [2, 4] to be that which minimizes error for estimating equality queries. Heuristically constructed V-Optimal histograms were evaluated in [10] for range selection predicates along with several other histograms and were shown to achieve good accuracy for those predicates as well. However, the evaluation of VOptimal histograms for range queries was performed on

V-Optimal histograms (or approximations thereof) constructed by taking only equality queries into account. Matias et al. [8] proposed the use of the wavelet transform to construct histograms and experimentally showed that they can construct histograms of superior accuracy over previously proposed heuristics for the construction of V-Optimal histograms. Their technique consists of obtaining the wavelet transform of the data set and greedily choosing a certain number of wavelet coecients in order to satisfy the space constraint. They proposed ecient algorithms to perform this greedy selection. Jagadish et al., [5] proposed an optimal polynomial-time algorithm to construct VOptimal histograms. We expect that the V-Optimal histogram, constructed using the algorithms of Jagadish et al. [5], is superior to the histogram constructed using the greedy selection techniques of Matias et al. [8]. The reason is that the algorithm of Jagadish et al. works explicitly towards achieving the optimization objective, unlike the algorithm of Matias et al., which greedily tries to approximate it. It is important to note that most previous works on histogram construction do not take range predicates into account towards the construction of histograms speci cally tailored to these predicates, and the rare ones that do (such as [8]) do so in a heuristic manner, with no optimality guarantees.

2

Problem De nition

We are given an array A[1; n] of non-negative real +A[b] , the average of numbers. De ne A[a; b] = A[ab]+,a+1 items A[a]; : : :; A[b]. Intuitively, a histogram of array A using B buckets is a disjoint partition of the range [1; n] into B intervals. More formally, we have the following de nition:

De nition 2.1 [Histogram] A histogram of array A[1; n] using B buckets is speci ed by B + 1 integers, b ; : : :; bB , where 0 = b  b  : : :  bB = n. 1

+1

1

2

+1

Each interval [bi + 1; bi+1] is called a bucket, and each bi is called a bucket boundary.

The histogram is stored as the series of bucket boundaries together with the average of the array values in each bucket, i.e., A[bi + 1; bi+1]. This implies that for any bucket [bi + 1; bi+1], we can obtain the sum of all values in it from the average and its length, both of which are available from the histogram representation. For a given array A, many histograms are possible, as discussed in Section 1.1. Most previous works on histograms make use only of equality queries of the form \give me A[i]" in their de nition and construction. Here, we formally de ne the notion of \optimal" histograms for a more general query workload consisting of hierarchical range queries, de ned as follows.

De nition 2.2 [Hierarchical Range Queries] A range query Rij asks for the sum of the values sij = A[i] +    + A[j]. A set S of range queries is said to be hierarchical, if for any two queries Rij and Rk` in S,

either the ranges [i; j] and [k; `] are disjoint, or one is contained in the other. Hierarchical range queries generalize equality queries since the degenerate range query Rii is precisely the equality query A[i]. Hierarchical range queries can be conveniently displayed as a tree in which each node u has a range ru associated with it. Node v is a child of node u if and only if rv is contained in ru , but there is no other node w such that rw contains rv and is contained in ru . Figures 1(a) and 1(b) illustrate these trees for the case of point and pre x range queries, and full binary range queries, respectively.1 We now de ne the notion of a query workload, based on a simple model of probabilistic query distribution, where the probability of any particular range query being asked is independent of the probabilities of other range queries. De nition 2.3 [Workload] A workload W consists of a set S of hierarchical range queries, along with a probability pij associated with each range query Rij in S. The probability pij associated with every range can be obtained by monitoring and logging queries on the warehouse. Given a histogram H of array A[1; n], a range query Rij is answered as follows. Recall that the answer is sij = A[i] +    + A[j]. We consider the left bucket [b` + 1; b`+1] that \straddles" i, i.e., b` + 1  i  b`+1 . Likewise, the right bucket [br + 1; br+1] is de ned as the one that straddles j. Thus, the range of query Rij contains portions of the left and right buckets, and contains every bucket between these two in its entirety. (The left and right buckets may coincide, and/or there may be no buckets in between, but our discussion here is not seriously a ected.) We can obtain the precise total of all values in the buckets between the left and right buckets from the histogram representation, as remarked earlier. For the portions within the left and right buckets, we use the common assumption of uniformity, i.e., we estimate the sum of the A values in the interval [i; j] \ [b`+1; b`+1] as A[b` + 1; b`+1]j[i; j] \ [b`+1; b`+1]j, and likewise for the right bucket. The total estimate for sij is then the sum of the estimates for the left and right buckets, and the exact sums for the buckets in between. We denote this estimate by s^ij . See Section 3 for an example of this estimation method. Optimal histograms are de ned based on the errors in the estimation for queries. 1 See Section5.1 for the notationused for identifying full binary ranges in the tree of Figure 1(b).

R 18 R 16

R 8t

R 14

R 4t

R 12

R 2t

R ii

Rt

i

i

1 2 3 4 5 6 7 8

(a) Points and Pre x Ranges

1 2 3 4 5 6 7 8

(b) Full Binary Ranges

Figure 1: Hierarchical Range Query Trees

De nition 2.4 [Optimal Histogram] Given a histogram H of array A[1; n], the error eij of range query Rij is de ned to be (sij , s^ij ) .

R (0.5) 15 R (0.5)

2

13

Given a histogram H of array A[1; n], and a workload W , thePtotal expected error for estimating W is de ned to be R (pij eij ), over all queries Rij in W .2 Given a workload W, an optimal histogram with B buckets H-opt of array A[1; n] is the histogram with at most B buckets that has the minimum total expected error for estimating workload W, among all histograms with at most B buckets. ij

buckets, and a workload W of hierarchical range queries with probability pij for query Rij , determine boundaries and bucket averages of the optimal histogram with B buckets.

There are some variations on the basic problem described above that are of interest in general. For example, a range query Rij may return not the sum of the A[k] values in that range, but say the maximum maxjk=i A[k], or any other suitable function. Another variation is that one may choose a di erent representation for the histogram, say, by storing not the average value, but some other representative(s) for the bucket values. One possibility would be to store a value that optimizes the error over the range queries. Our discussions below will hold for many such variations of optimal histograms, but we do not elaborate on this issue further. 2

This is the standard sum-squared-error measure.

i

1

2

3

4

5

A[i]

1

1

97

100

100

P[i]

4/15

4/15 4/15 1/10

1/10

V-Opt

We can now formally state the problem that we address in this paper:

Optimal Histogram Problem for Hierarchical Range Queries: Given array A[1; n], B

Queries

H-Opt

1

99 33

100

Bucket boundaries

Figure 2: Sub-optimality of vopt for Ranges

3

Motivation

In this section, we present some examples to illustrate the intricate nature of the problem of optimal histogram construction for a workload of hierarchical range queries. First, we show that V-Optimal histograms [10, 5], which minimize expected error for the case of equality (or point) queries, can be sub-optimal for the case of hierarchical range queries. Consider Figure 2. The data set consists of ve points, denoted by i; 1  i  5, whose values are given by A[i]; 1  i  5. The query workload is given by the two (hierarchical range) queries R13 and R15, each of which is equally probable. The problem is to identify the optimal bucket boundaries (i.e., one that minimizes expected error), given a space budget of two buckets. Identifying the V-Optimal histogram requires knowledge of the probabilities of accessing individual points. This can be obtained from the range query workload by

R (0.34) 56 R (0.33) 34 R (0.33)

Queries

12

i

1

2

3

4

5

6

A[i]

2

2

2

6

6

6

2

H-Opt H1 H2

6

2

Bucket boundaries

5 3

6

Figure 3: Aligning Buckets with Range Boundaries projecting (range-length normalized) range query probabilities to the individual points in the range. For the above example, the point with index 1 would be associated with the probability 0:5=3 + 0:5=5, the rst term obtained by projecting the probability of range query R13 (of length 3), and the second term obtained by projecting the probability of range query R15 (of length 5). Thus, we get the point probabilities, P [i], depicted in Figure 2. For this point query workload, it is easy to verify that the V-Optimal histogram depicted in the gure (the bucket contains the average value) is optimal for two buckets, under the point query workload, with a total expected error of 1:267 = 4=15  (97 , 99)2 +1=10  (100 , 99)2 +1=10  (100 , 99)2 , since the rst two points do not have any associated error. For the two range queries of interest, the V-Optimal histogram can be seen to have a total expected error of 2 = 0:5  (99 , 101)2. However, the sub-optimality of the V-Optimal histogram for range queries is demonstrated by the H-optimal histogram depicted in the gure, which has zero total expected error for the two hierarchical range queries of interest. The above example may lead the reader to believe that one could get an H-optimal histogram, for a given number of buckets, by aligning bucket boundaries along (some of) the hierarchical range query boundaries. This is not true, as we now show. Consider Figure 3, and again assume that we would like the optimal histogram with two buckets, for the given range query workload. If we align bucket boundaries along range boundaries, we get one of the two histograms, H1 or H2, depicted in Figure 3. H1 has a total expected error of 2:68, while H2 has a total expected error of 2:64, the di erence arising because of the marginally higher probability of range query R56. However, the H-optimal histogram depicted in the gure has zero total expected error,

even though the bucket boundary is not aligned with any hierarchical range query boundary.

4

Special Case: Pre x Ranges

4.1

A Dynamic Programming Solution

To begin with, we consider a special case which is interesting in itself. The special case is when the set of all ranges is the set of all pre xes, that is, the only range queries allowed are one-sided ranges. (Equality queries may be added to this collection without a ecting our results, but for simplicity of exposition, we do not consider them here.) We have range queries Ri = A[1]+    + A[i] with probability pi ; some of the probabilities may be zero. Clearly, this is a hierarchical collection of ranges. See Figures 1(a) and 2 for examples. In what follows, we will describe a dynamic programming based solution for nding the optimal histogram for this set of ranges. We will focus on nding the optimum cost, that is, the minimum expected error for a histogram; as is standard in all dynamic programming solutions, a histogram with the minimum error can be determined easily from our solution. Say E(i; k) is the minimum error of a histogram of k buckets for all ranges that are entirely contained in [1; i]. We have, E(i; k) = 1min (E(j; k , 1) + C(j + 1; i)) j