Materialized Sample Views for Database Approximation - UF CISE

0 downloads 0 Views 493KB Size Report
This can be seen in Figure 10, when we compare the paths taken by Stab 1 and Stab 2. The algorithm chooses to traverse to the left child of the root node during.
Materialized Sample Views for Database Approximation Shantanu Joshi

Christopher Jermaine University of Florida

Department of Computer and Information Sciences and Engineering Gainesville, FL, USA

{ssjoshi,cjermain}@cise.ufl.edu

Abstract We consider the problem of creating a sample view of a database table. A sample view is an indexed, materialized view that permits efficient sampling from an arbitrary range query over the view. Such “sample views” are very useful to applications that require random samples from a database: approximate query processing, online aggregation, data mining, and randomized algorithms are a few examples. Our core technical contribution is a new file organization called the ACE Tree that is suitable for organizing and indexing a sample view. One of the most important aspects of the ACE Tree is that it supports online random sampling from the view. That is, at all times, the set of records returned by the ACE Tree constitutes a statistically random sample of the database records satisfying the relational selection predicate over the view. Our paper presents experimental results that demonstrate the utility of the ACE Tree. Index Terms H.2.2.c Indexing methods, H.2.4.h Query processing, I.4.1.f Sampling

I. I NTRODUCTION

W

ITH ever-increasing database sizes, randomization and randomized algorithms [1] have become vital data management tools. In particular, random sampling is one of the most important sources

of randomness for such algorithms. Scores of algorithms that are useful over large data repositories either require a randomized input ordering for data (i.e., an online random sample), or else they operate over samples of the data to increase the speed of the algorithm.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

2

Applications requiring randomization abound in the data management literature. For example, consider online aggregation [2]–[4]. In online aggregation, database records are processed one-at-a-time, and used to keep the user informed of the current “best guess” as to the eventual answer to the query. If the records are input into the online aggregation algorithm in a randomized order, then it becomes possible to give probabilistic guarantees on the relationship of the current guess to the eventual answer to the query. For another example, consider data mining algorithms. Many data mining algorithms require an online, randomized input ordering of the data. One of the most widely-cited techniques along these lines is the scalable K-means clustering algorithm of Bradley et al. [5]. For another example, one-pass frequent-itemset mining algorithms are typically useful only if the data are processed in a randomized order so that the first few records are distributed in the same way as latter ones [6], [7]. In general, it is often possible to scale data mining and machine learning techniques by incorporating samples into a learned model, one-at-a-time, until the marginal accuracy of adding an additional sample into the model is small [8]–[12]. Such online algorithms could be used in a database context for building decision trees, fitting parametric statistical models [9], [11], learning the characteristics of a certain class of data records (including variations on PAC learning [13]), and so on. In fact, database sampling has been recognized as an important enough problem that ISO has been working to develop a standard interface for sampling from relational database systems [14], and significant research efforts are directed at providing sampling from database systems by vendors such as IBM [15]. However, despite the obvious importance of random sampling in a database environment and dozens of recent papers on the subject (approximately 20 papers from recent SIGMOD and VLDB conferences are concerned with database sampling), there has been relatively little work towards actually supporting random sampling with physical database file organizations. The classic work in this area (by Olken and his co-authors [16]–[18]) suffers from a key drawback: each record sampled from a database file requires a random disk I/O. At a current rate of around 100 random disk I/Os per second per disk, this means that it is possible to retrieve only 6,000 samples per minute. If the goal is fast approximate query processing or speeding up a data mining algorithm, this is clearly unacceptable.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

3

The Materialized Sample view In this paper, we propose to use the materialized sample view1 as a convenient abstraction for allowing efficient random sampling from a database. For example, consider the following database schema: SALE (DAY, CUST, PART, SUPP) Imagine that we want to support fast, random sampling from this table, and most of our queries include a temporal range predicate on the DAY attribute. This is exactly the interface provided by a materialized sample view. A materialized sample view can be specified with the following SQL-like query: CREATE MATERIALIZED SAMPLE VIEW MySam AS SELECT * FROM SALE INDEX ON DAY In general, the range attribute or attributes referenced in the INDEX ON clause can be spatial, temporal, or otherwise, depending on the requirements of the application. While the materialized sample view is a straightforward concept, efficient implementation is difficult. The primary technical contribution of this paper is a novel index structure called the ACE Tree (Appendability, Combinability Exponentiality; see Section 4) which can be used to efficiently implement a materialized sample view. Such a view, stored as an ACE-Tree, has the following characteristics: •

It is possible to efficiently sample (without replacement) from any arbitrary range query over the indexed attribute, at a rate that is far faster than is possible using techniques proposed by Olken [19] or by scanning a randomly permuted file. In general, the view can produce samples from a predicate involving any attribute having a natural ordering, and a straightforward extension of the ACE Tree can be used for sampling from multi-dimensional predicates.



The resulting sample is online, which means that new samples are returned continuously as time progresses, and in a manner such that at all times, the set of samples returned is a true random sample of all of the records in the view that match the range query. This is vital for important applications like online aggregation and data mining.



Finally, the sample view is created efficiently, requiring only two external sorts of the records in the view, and with only a very small space overhead beyond the storage required for the data records.

1 This term was originally used in Olken’s PhD thesis [19] in a slightly different context, where the goal was to maintain a fixed-size sample of database; in contrast, as we describe subsequently our materialized sample view is a structure allowing online sampling

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

4

We note that while the materialized sample view is a logical concept, the actual file organization used to implement such a view can be referred to as a sample index since it is a primary index structure to efficiently retrieve random samples. Paper Organization In the next Section we give an overview of three obvious approaches for supporting such sample views, and in Section 3 we present an overview of our approach, based on the ACE Tree. Section 4 discusses important properties of the ACE Tree while Section 5 gives the detailed design and construction of the index structure. We present and analyze the retrieval algorithms in Section 6 and experimental results are discussed in Section 8. Section 9 concludes the paper and discusses future avenues of research. II. E XISTING SAMPLING TECHNIQUES In this Section, we discuss three simple techniques that can be used to create materialized sample views to support random sampling from a relational selection predicate. A. Randomly Permuted Files One option for creating a materialized sample view is to randomly shuffle or permute the records in the view. To sample from a relational selection predicate over the view, we scan it sequentially from beginning to end and accept those records that satisfy the predicate while rejecting the rest. This method has the advantage that it is very simple, and using a fast external sorting algorithm, permuting the records can be very efficient. Furthermore, since the process of scanning the file can make use of the fast, sequential I/O provided by modern hard disks, a materialized view organized as a randomly permuted file can be very useful for answering queries that are not very selective. However, the major problem with such a materialized view is that the fraction of useful samples retrieved by it is directly proportional to the selectivity of the selection predicate. For example, if the selectivity of the query is 10%, then on average only 10% of the random samples obtained by such a view can be used to answer the query. Hence for moderate to low selectivity queries, most of the random samples retrieved by such a view will not be useful for answering queries. Thus, the performance of such a view quickly degrades as selectivity of the selection predicates decreases.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

5

Algorithm 1: Sampling from a Ranked B+-Tree Algorithm SampleRankedB+Tree (Value v1 , Value v2 ) 1. Find the rank r1 of the record which has the smallest DAY value greater than v1 . 2. Find the rank r2 of the record which has the largest DAY value smaller than v2 . 3. While sample size < desired sample size 3.a Generate a uniformly distributed random number i, between r1 and r2 . 3.b If i has been generated previously, discard it and generate the next random number. 3.c Using the rank information in the internal nodes, retrieve the record whose rank is i.

B. Sampling from Indices The second approach to creating a materialized sample view is to use one of the standard indexing structures like a hashing scheme or a tree-based index structure to organize the records in the view. In order to produce random samples from such a materialized view, we can employ iterative or batch sampling techniques [16], [18]–[21] that sample directly from a relational selection predicate, thus avoiding the aforementioned problem of obtaining too few relevant records in the sample. Olken [19] presents a comprehensive analysis and comparison of many such techniques. In this Section we discuss the technique of sampling from a materialized view organized as a ranked B+-Tree, since it has been proven to be the most efficient existing iterative sampling technique in terms of number of disk accesses. A ranked B+-Tree is a regular B+-Tree whose internal nodes have been augmented with information which permits one to find the ith record in the file. Let us assume that the relation SALE presented in the Introduction is stored as a ranked B+-Tree file indexed on the attribute DAY and we want to retrieve a random sample of records whose DAY attribute value falls between 11-28-2004 and 03-02-2005. This translates to the following SQL query: SELECT * FROM SALE WHERE SALE.DAY BETWEEN ’11-28-2004’ AND ’03-02-2005’ Algorithm 1 above can then be used to obtain a random sample of relevant records from the ranked B+-Tree file. The drawback of the above algorithm is that whenever a leaf page is accessed, the algorithm retrieves only that record whose rank matches with the rank being searched for. Hence for every record which

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

6

resides on a page that is not currently buffered, the retrieval time is the same as the time required for a random disk I/O. Thus, as long as there are unbuffered leaf pages containing candidate records, the rate of record retrieval is very slow. C. Block-based Random Sampling While the classic algorithms of Olken and Antoshenkov sample records one-at-a-time, it is possible to sample from an indexing structure such as a B+-Tree, and make use of entire blocks of records [14] [22]. The number of records per block is typically on the order of 100 to 1000, leading to a speedup of two or three orders of magnitude in the number of records retrieved over time if all of the records in each block are consumed, rather than a single record. However, there are two problems with this approach. First, if the structure is used to estimate the answer to some aggregate query, then the confidence bounds associated with any estimate provided after N samples have been retrieved from a range predicate using a B+-Tree (or some other index structure) may be much wider than the confidence bounds that would have been obtained had all N samples been independent. In the extreme case where the values on each block of records are closely correlated with one another, all of the N samples may be no better than a single sample. Second, any algorithm which makes use of such a sample must be aware of the block-based method used to sample the index, and adjust its estimates accordingly, thus adding complexity to the query result estimating process. For algorithms such as Bradley’s K-means algorithm [5], it is not clear whether or not such samples are even appropriate. III. OVERVIEW OF O UR A PPROACH We propose an entirely different strategy for implementing a materialized sample view. Our strategy uses a new data structure called the ACE Tree to index the records in the sample view. At the highest level, the ACE Tree partitions a data set into a large number of different random samples such that each is a random sample without replacement from one particular range query. When an application asks to sample from some arbitrary range query, the ACE Tree and its associated algorithms filter and combine these samples so that very quickly, a large and random subset of the records satisfying the range query is returned. The sampling algorithm of the ACE Tree is an online algorithm, which means that as time progresses, a larger and larger sample is produced by the structure. At all times, the set of records retrieved is a true random sample of all the database records matching the range selection predicate.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

7

A. ACE Tree Leaf Nodes The ACE Tree stores records in a large set of leaf nodes on disk. Every leaf node has two components: 1) A set of h ranges, where a range is a pair of key values in the domain of the key attribute and h is the height of the ACE Tree. Unlike a B+-Tree, each leaf node in the ACE Tree stores records falling in several different ranges. The ith range associated with leaf node L is denoted by L.Ri . The h different ranges associated with a leaf node are hierarchical; that is L.R1 ⊃ L.R2 ⊃ · · · ⊃ L.Rh . The first range in any leaf node, L.R1 , always contains a uniform random sample of all records of the database thus corresponding to the range (−∞, ∞). The hth range in any leaf node is the smallest among all other ranges in that leaf node. 2) A set of h associated sections. The ith section of leaf node L is denoted by L.Si . The section L.Si contains a random subset of all the database records with key values in the range L.Ri . Figure 1 depicts an example leaf node in the ACE Tree with attribute range values written above each section and section numbers marked below. Records within each section are shown as circles. R1 :0-100 75 36

47 41

S1 Fig. 1.

R2 :0-50 R3 :0-25 18

10

22

S2

25 3

S3

R4 : 0-12 11

7

1

S4

Structure of a leaf node of the ACE tree.

B. ACE Tree Structure Logically, the ACE Tree is a disk-based binary tree data structure with internal nodes used to index leaf nodes, and leaf nodes used to store the actual data. Since the internal nodes in a binary tree are much smaller than disk pages, they are packed and stored together in disk-page-sized units [23]. Each internal node has the following components: 1) A range R of key values associated with the node. 2) A key value k that splits R and partitions the data on the left and right of the node. 3) Pointers ptrl and ptrr , that point to the left and right children of the node. 4) Counts cntl and cntr , that give the number of database records falling in the ranges associated with the left and right child nodes. These values can be used, for example, during evaluation of online aggregation queries which require the size of the population from which we are sampling [4].

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

8

0−100 I1,1

50 0−50

51−100 I2,1

25 0−25

26−50 I3,1

12

L1

L2

L3

26

70

L4

0−50 14

87

L4 .S1

I2,2

51−75 I3,2

37

0−100

Fig. 2.

75

7 20 40

L4 .S2

I3,3

62

L5

76−100

88

L6

L7

I3,4

L8

26−50 38−50 39 27

50 44

L4 .S3

38

40 46

L4 .S4

Structure of the ACE Tree.

Figure 2 shows the logical structure of the ACE Tree. Ii,j refers to the jth internal node at level i. The root node is labeled with a range I1,1 .R = [0-100], signifying that all records in the data set have key values within this range. The key of the root node partitions I1,1 .R into I2,1 .R = [0-50] and I2,2 .R = [51-100]. Similarly each internal node divides the range of its descendents with its own key. The ranges associated with each section of a leaf node are determined by the ranges associated with each internal node on the path from the root node to the leaf. For example, if we consider the path from the root node down to leaf node L4 , the ranges that we encounter along the path are 0-100, 0-50, 26-50 and 38-50. Thus for L4 , L4 .S1 has a random sample of records in the range 0-100, L4 .S2 has a random sample in the range 0-50, L4 .S3 has a random sample in the range 26-50, while L4 .S4 has a random sample in the range 38-50. C. Example Query Execution in ACE Tree In the following discussion, we demonstrate how the ACE Tree efficiently retrieves a large random sample of records for any given range query. The query algorithm is formally described in Section 6. Let Q = [30-65] be our example query postulated over the ACE Tree depicted in Figure 2. The query algorithm starts at I1,1 , the root node. Since I2,1 .R overlaps Q, the algorithm decides to explore the left child node labeled I2,1 in Figure 2. At this point the two range values associated with the left and right

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

9

children of I2,1 are 0-25 and 26-50. Since the left child range has no overlap with the query range, the algorithm chooses to explore the right child next. At this child node (I3,2 ), the algorithm picks leaf node L3 to be the first leaf node retrieved by the index. Records from section 1 of L3 (which totally encompasses Q) are filtered for Q and returned immediately to the consumer of the sample as a random sample from the range [30-65], while records from sections 2, 3 and 4 are stored in memory. Figure 3 shows the one random sample from section 1 of L3 which can be used directly for answering query Q.

0-100

31

55 72

0-50

89

L3 .S1

45

18 48

26-50

23

L3 .S2

29

37

L3 .S3

26-37 29

34

L3 .S4

σ30−65

55

Fig. 3.

Random samples from section 1 of L3 .

Next, the algorithm again starts at the root node and now chooses to explore the right child node I2,2 . After performing range comparisons, it explores the left child of I2,2 which is I3,3 since I3,4 .R has no overlap with Q. The algorithm chooses to visit the left child node of I3,3 next, which is leaf node L5 . This is the second leaf node to be retrieved. As depicted in Figure 4, since L5 .R1 encompasses Q, the records of L5 .S1 are filtered and returned immediately to the user as two additional samples from R. Furthermore, section 2 records are combined with section 2 records of L3 to obtain a random sample of records in the range 0-100. These are again filtered and returned, giving four more samples from Q. Section 3 records are also combined with section 3 records of L3 to obtain a sample of records in the range 26-75. Since this range also encompasses R, the records are again filtered and returned adding four more records to our sample. Finally section 4 records are stored in memory for later use. Note that after retrieving just two leaf nodes in our small example, the algorithm obtains eleven randomly selected records from the query range. However, in a real index, this number would be many times greater. Thus, the ACE Tree supports “fast first” sampling from a range predicate: a large number of samples are returned very quickly. We contrast this with a sample taken from a B+-Tree having a

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

10

similar structure to the ACE Tree depicted in Figure 2. The B+-Tree sampling algorithm would need to pre-select which nodes to explore. Since four leaf nodes in the tree are needed to span the query range, there is a reasonably high likelihood that the first four samples taken would need to access all four leaf nodes. As the ACE Tree Query Algorithm progresses, it goes on to retrieve the rest of the leaf nodes in the order L4 , L6 , L1 , L7 , L2 , L8 .

0−100 10

61 59 77

L5 .S1

0−50 31 48

Fig. 4.

18

99 23

34

L3 .S2

σ30−65

61

51−100

59

46

L5 .S2

26−50 45

29 37

51−75 53 58 74 69

L3 .S3

L5 .S3

Combine samples

Combine samples

σ30−65

σ30−65

31 48 46 34

45 53

37 58

Combining samples from L3 and L5 .

D. Choice of Binary Versus k-Ary Tree The ACE Tree as described above can also be implemented as a k-ary tree instead of a binary tree. For example, for a ternary tree, each internal node can have two (instead of one) keys and three (instead of two) children. If the height of the tree was h, every leaf node would still have h ranges and h sections associated with them. Like a standard complete k-ary tree, the number of leaf nodes will be k h . However, the big difference would be the manner in which a query is executed using a k-ary ACE Tree as opposed to a binary ACE Tree. The query algorithm will always start at the root node and traverse down to a leaf. However, at every internal node it will alternate between the k children in a round-robin fashion. Moreover, since the data space would be divided into k equal parts at each level, the query algorithm might have to make k traversals and hence access k leaf nodes before it can combine sections that can be used to answer the query. This would mean that the query algorithm will have to wait longer (than a binary ACE Tree) before it can combine leaf node sections and thus return useful random samples. Since

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

11

the goal of the ACE Tree is to support “fast first” sampling, use of a binary tree instead of a k-ary tree seems to be a better choice to implement the ACE Tree. IV. P ROPERTIES OF THE ACE T REE In this Section we describe the three important properties of the ACE Tree which facilitate the efficient retrieval of random samples from any range query, and will be instrumental in ensuring the performance of the algorithm described in Section 6. A. Combinability

σ3−47

0−100 75 36

0−50 47

41

L1 .S1

18

0−25 10

22

25 3

L1 .S2

0−12 11 1

L1 .S3

L1

7

22 47

L1 .S4

18 39

Combined samples

σ3−47

0−100 88

61 27

L3 .S1

0−50 39

26−50

50

26 33

L3 .S2

26−37 29

45

34

L3 .S3

L3 .S4

L3 Fig. 5.

Combining two sections of leaf nodes of the ACE tree.

The various samples produced from processing a set of leaf nodes are combinable. For example, consider the two leaf nodes L1 and L3 , and the query “Compute a random sample of the records in the query range Ql = [3 to 47]”. As depicted in Figure 5, first we read leaf node L1 and filter the second section in order to produce a random sample of size n1 from Ql which is returned to the user. Next we read leaf node L3 , and filter its second section L3 .S2 to produce a random sample of size n2 from Ql which is also returned to the user. At this point, the two sets returned to the user constitute a single random sample from Ql of size n1 + n2 . This means that as more and more nodes are read from disk, the records contained in them can be combined to obtain an ever-increasing random sample from any range query.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

12

B. Appendability The ith sections from two leaf nodes are appendable. That is, given two leaf nodes Lj and Lk , S Lj .Si Lk .Si is always a true random sample of all records of the database with key values within S the range Lj .Ri Lk .Ri . For example, reconsider the query, “Compute a random sample of the records in the query range Ql = [3 to 47]”. As depicted in Figure 6, we can append the third section from node L3 to the third section from node L1 and filter the result to produce yet another random sample from Ql . This means that sections are never wasted. 0−100 75 36

0−50 47

41

L1 .S1

0−25 18

10

22

L1 .S2

0−12

25 3

L1 .S3

11

7

1

L1 .S4

Append sections

L1 0−100 88

61 27

0−50

26−50

50 39

26 33

45

σ3−47

26−37 29 34 10

L3 .S1

L3 .S2

L3 .S3

L3 .S4 33

L3 Fig. 6.

25 45

3 26

Combined samples

Appending two sections of leaf nodes of the ACE tree.

C. Exponentiality The ranges in a leaf node are exponential. The number of database records that fall in L.Ri is twice the number of records that fall in L.Ri+1 . This allows the ACE Tree to maintain the invariant that for any query Q0 over a relation R such that at least hµ database records fall in Q0 , and with |R|/2k+1 = (1/2) × |σRC1 (R)|. Similarly, Q2 can be answered by appending section 3 of (for S example) L4 and L6 . If RC2 = L4 .R3 L6 .R3 , then half the database records fall in RC2 . Also, since |σQ2 (R)| >= |R|/4 we have |σQ2 (R)| >= (1/2) × |σRC2 (R)|. This can be generalized to obtain the invariant stated in Section 4.3. D. Construction Phase 2 The objective of Phase 2 is to construct leaf nodes with appropriate sections and populate them with records. This can be achieved by the following three steps: 1) Assign a uniformly generated random number between 1 and h to each record as its section number. 2) Associate an additional random number with the record that will be used to identify the leaf node to which the record will be assigned.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

16

Section Numbers 1

2

3

2 1

2 4

2

3 4 3

4

2

1 4 3

2

3

1

4

1 4

3

2

1

4

3

2

3

1

1

3

7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98

(a) Records assigned section numbers Leaf Numbers

Section Numbers 7 3 1 2 3

1 3

4 5 2 1

1 2 1 2 4 2

4 3 4 3 2 8 4 3 6 3 4 3 4 2 1 4 3 2

6 3

1 1

5 2 6 5 7 3 4 1 4 3 2 1

7 8 4 3

5 2

8 3

2 1

7 1

7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98

(b) Records assigned leaf numbers Leaf Numbers

Section Numbers 1

1

1

1 2

2

2 2

3

1

2

2

3 1

1

2

1 2

4

3 3 3 3 4

3

4 4

4

4

5

5

5 5

6

6

6

7

7

7

7

8

8

8

4 2 3

3

4

1

2

3

4 2

3

4 1

1

2

4

1

3

3

60 18 25 10 69 92 41 22 77 7 50 37 33 12 29 36 50 15 88 74 62 53 58 74 3 98 75 81 47 84 89

Leaf 1

Fig. 9.

Leaf 2

Leaf 3 Leaf 4 Leaf 5 Leaf 6 (c) Records organized into leaf nodes

Leaf 7

Leaf 8

Phase 2 of tree construction.

3) Finally, re-organize the file by performing an external sort to group records in a given leaf node and a given section together. Figure 9(a) depicts our example data set after we have assigned each record a randomly generated section number, assuming four sections in each leaf node. In Step 2, the algorithm assigns one more randomly generated number to each record, which will identify the leaf node to which the record will be assigned. We assume for our example that the number of leaf nodes is 2h−1 = 23 = 8. The number to identify the leaf node is assigned as follows. 1) First, the section number of the record is checked. We denote this value as s. 2) We then start at the root of the tree and traverse down by comparing the record key with s − 1 key values. After the comparisons, if we arrive at an internal node, Ii,j , then we assign the record to one of the leaves in the subtree rooted at Ii,j . From the example of Figure 9(a), the first record having key value 3 has been assigned to section 1. Since this record can be randomly assigned to any leaf from 1 through 8, we assign it to leaf 7.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

17

The next record of Figure 9(a) has been assigned to section number 2. Referring back to Figure 7, we see that the key of the root node is 50. Since the key of the record is 7 which is less than 50, the record will be assigned to a leaf node in the left subtree of the root. Hence we assign a leaf node between 1 and 4 to this record. In our example, we randomly choose the leaf node 3. For the next record having key value 10, we see that the section number assigned is 3. To assign a leaf node to this record, we initially compare its key with the key of the root node. Referring to Figure 7, we see that 10 is smaller than 50; hence we then compare it with 25 which is the key of the left child node of the root. Since the record key is smaller than 25, we assign the record to some leaf node in the left subtree of the node with key 25 by assigning to it a random number between 1 and 2. The section number and leaf node identifiers for each record are written in a small amount of temporary disk space associated with each record. Once all records have been assigned to leaf nodes and sections, the dataset is re-organized into leaf nodes using a two-pass external sorting algorithm as follows: •

Records are sorted in ascending order of their leaf node number.



Records with the same leaf node number are arranged in ascending order of their section number.

The re-organized data set is depicted in Figure 9(c). E. Combinability/Appendability Revisited In Phase 2 of the tree construction, we observe that all records belonging to some section k are segregated based upon the result of the comparison of their key with the appropriate medians, and are then randomly assigned a leaf node number from the feasible ones. Thus, if records from section s of all leaf nodes are merged together, we will obtain all of the section s records. This ensures the appendability property of the ACE Tree. Also note that the probability of assignment of one record to a section is unaffected by the probability of assignment of some other record to that section. Since this results in each section having a random subset of the database records, it is possible to merge a sample of the records from one section that match a range query with a sample of records from a different section that match the same query. This will produce a larger random sample of records falling in the range of the query, thus ensuring the combinability property.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

18

F. Page Alignment In Phase 2 of the construction algorithm, section numbers and leaf node numbers are randomly generated. Hence we can only predict on expectation the number of records that will fall in each section of each leaf node. As a result, section sizes within each leaf node can differ, and the size of a leaf node itself is variable and will generally not be equal to the size of a disk page. Thus when the leaf nodes are written out to disk, a single leaf node may span across multiple disk pages or may be contained within a single disk page. This situation could be avoided if we fix the size of each section a priori. However, this poses a serious problem. Consider two leaf node sections Li .Sj and Li+1 .Sj . We can force these two sections to contain the same number of records by ensuring that the set of records assigned to section j in Phase 2 of the construction algorithm has equal representation from Li .Rj and Li+1 .Rj . However, this means that the set of records assigned to section j is no longer random. If we fix the section size and force a set number of records to fall in each section, we invalidate the appendability and combinability properties of the structure. Thus, we are forced to accept a variable section size. In order to implement variable section size, we can adopt one of the following two schemes: 1) Enforce fixed-sized leaf nodes and allow variable-sized sections within the leaf nodes. 2) Allow variable-sized leaf nodes along with variable-sized sections. If we choose the fixed-sized leaf node, variable-sized section scheme, leaf node size is fixed in advance. However, section size is allowed to vary. This allows full sections to grow further by claiming any available space within the leaf node. The leaf node size chosen should be large enough to prevent any leaf node from becoming completely filled up, which prevents the partitioning of any leaf node across two disk pages. The major drawback of this scheme is that the average leaf node space utilization will be very low. Assuming a reasonable set of ACE Tree parameters, a quick calculation shows that if we want to be 99% sure that no leaf node gets filled up, the average leaf node space utilization will be less than 15%. The variable-sized leaf node, variable-sized section scheme does not impose a size limit on either the leaf node or the section. It allows leaf nodes to grow beyond disk page boundaries, if space is required. The important advantage of this scheme is that it is space-efficient. The main drawback of this approach is that leaf nodes may span multiple disk pages, and hence all such pages must be accessed in order to retrieve such a leaf node. Given that most of the cost associated with reading an arbitrary leaf page is

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

19

associated with the disk head movement needed to move the disk arm to the appropriate cylinder, this does not pose too much of a problem. Hence we use this scheme for the construction of leaf nodes of the ACE Tree. VI. Q UERY A LGORITHM In this Section, we describe in detail the algorithm used to answer range queries using the ACE Tree. A. Goals The algorithm has been designed to meet the primary goal of achieving “fast-first” sampling from the index structure, which means it attempts to be greedy on the number of records relevant for the query in the early stages of execution. In order to meet this goal, the query answering algorithm identifies the leaf nodes which contain maximum number of sections relevant for the query. A section Li1 .Sj is relevant T S S S for a range query Q if Li1 .Rj Q 6= φ and Li1 .Rj Li2 .Rj . . . Lin .Rj ⊇ Q where Li1 , . . . ,Lin are some leaf nodes in the tree. The query algorithm prioritizes retrieval of leaf nodes so as to: •

Facilitate the combination of sections so as to maximize n in the above formulation, and



Maximize the number of relevant sections in each leaf node L retrieved such that L.Sj

T

Q 6= φ

where j = (c + 1) . . . h where L.Rc is the smallest range in L that encompasses Q.

B. Algorithm Overview At a high level, the query answering algorithm retrieves the leaf nodes relevant to answering a query via a series of stabs or traversals, accessing one leaf node per stab. Each stab begins at the root node and traverses down to a leaf. The distinctive feature of the algorithm is that at each internal node that is traversed during a stab, the algorithm chooses to access the child node that was not chosen the last time the node was traversed. For example, imagine that for a given internal node I, the algorithm chooses to traverse to the left child of I during a stab. The next time that I is accessed during a stab, the algorithm will choose to traverse to the right child node. This can be seen in Figure 10, when we compare the paths taken by Stab 1 and Stab 2. The algorithm chooses to traverse to the left child of the root node during the first stab, while during the second stab it chooses to traverse to the right child of the root node.

12

37

62

next = L

88

12

37 next = L 62

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006 L L L L2 L3 L4 L5 L6 L7 L1 L8 L22 L33 L11

20

L L44

I1,1

25 25 0−25 0−25

12 12

26−50 26−50

I3,1 I3,1

L L22

L L11

 L L33

0−50 0−50

2,2 75 II2,2 75 next = L

51−75 51−75 I3,3 I3,3

next = L

L L44

L L55

L L66

76−100 76−100

88 88

L L77

I3,4 I3,4

L L88

0−25 0−25

12 12

26−50 26−50

I3,1 I3,1

34 34 4 3 4 3 4343

L L22

L L11

37 37

L L33

L L44

12 12

next = R

L L22

L L11

37 37

I3,2 I3,2

L L33

L L44

0−50 0−50

75 next = L 75 51−75 51−75

76−100 76−100

I3,3 I3,3

62 88 62 next = L 88

L L55

L L66

L7 L7

I3,4 I3,4

L8 L8

25 25

0−25 0−25

12 12

L L11

(b)Stab (c) Stab2,3,6 7contributing contributingsections sections 0−100 0−100

L8 L8

I2,2

I3,2 I3,2

37 62 37 next = R 62

75 I2,2 75 next = L

51−75 51−75 I3,3 I3,3

next = R

L L33

L L22

L L44

L L55

L L66

76−100 76−100

88 88

L L7 7

I3,4 I3,4

L L8 8

(c) Stab Stab 4, 3, 16 7 contributing (d) contributingsections sections 0−100 I1,1

50 next = R

Execution runs of query Answering algorithm.

51−100 51−100

I2,1 Inext 2,1 = L

L7 L7

51−100 51−100

%&&  %&&% ' '((' ( %& '  ( % &% ( ' ('  + +   / / : 9 8 7 6 5 )* ) *) ) * *  , ,  . .  0 + + / ) *) ,, .. 00/0 * 26−50 26−50

I3,1 I3,1

1,1 50 Inext =L 50 next = R

0−50 0−50

L L66

I2,1 Inext 2,1 = L

I1,1

Fig. 10.

L L55

I3,4 I3,4

I1,1

I2,2 I2,2

          

12 2 1 2121

next = R

76−100 76−100

1,1 50 Inext =L 50 next = R

51−100 51−100

26−50 26−50

I3,1 I3,1

51−75 51−75

3,3 62 II3,3 88 62 next = L 88

(b)Stab (c) Stab2,3,6 7contributing contributingsections sections 0−100 0−100

I1,1

0−25 0−25

I2,2

2,2 75 Inext =L 75

I3,2 I3,2

1,1 50 Inext =R 50 next = L

2,1 25 II2,1 25 next = L

L L88

51−100 51−100

2,1 25 II2,1 25 next = L

(a) Stab2,1,61contributing contributingsections section (b)Stab 0−100 0−100

0−50 0−50

L L77

1,1 50 Inext =R 50 next = L

51−100 51−100

3,2 37 II3,2 62 37 next = L 62

L L66

I1,1

1,1 50 Inext =L 50 next = R

I2,1 Inext 2,1 = L

L L55

(a) Stab2,1,61contributing contributingsections section (b)Stab 0−100 0−100

(a) Stab 1, 1 contributing section 0−100 0−100

0−50 0−50

88

next = L

0−50

51−100

I2,2 I2,2

I2,2

25 75 75 next = L 25 I2,1 25 75 next = L The advantage of retrieving leaf nodes in this back and forth sequence is that it allows us to quickly 0−25

26−50

51−75

76−100

0−25

26−50

51−75

76−100

26−50 nodes with the 51−75 76−100 sections possible in a given number of stabs. The retrieve0−25 a set of most 88 disparate I3,1leaf 37 I3,3 I3,4 I3,2 12 62 12 I3,1 37 I3,2 62 I3,3 88 I3,4 I I I 3,1 3,2 3,3 12 37 next = R 62 88 I3,4 next = R next = R reason that we want a non-homogeneous set of nodes is that nodes from very distant portions of a query

  "!  #$

;< => ?@ AB

range will tend to have sections covering large ranges that do not overlap. This allows us to append L L11

L L33

L L22

L L44

L L55

L L66

L L7 7

L L8 8

L1

L2

L3

L4

L5

L6

L7

L8

sections of newly retrieved leaf nodes with the corresponding sections of previously retrieved leaf nodes. (c) Stab Stab 4, 3, 16 7 contributing (d) contributingsections sections

(d) Stab 4, 16 contributing sections

The samples obtained can 0−100 then be filtered and immediately returned. I1,1

This order of retrieval is implemented 50 next = R by associating a bit with each internal node that indicates whether the next child 0−50 node to be retrieved should be the left node or the right node. The value of this bit is toggled 51−100 I2,2

2,1 25 Iis 7510 next =L every time the node accessed. Figure illustrates the choices made by the algorithm at each internal

0−25

12

L1

26−50 I3,1

L2

51−75

37

I3,2

62

I3,3

    L3

L4

L5

next = R

L6

L7

76−100

88

I3,4

L8

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

21

node during four separate stabs. Note that when the algorithm reaches an internal node where the range associated with one of the child nodes has no overlap with the query range, the algorithm always picks the child node that has overlap with the query, irrespective of the value of the indicator bit. The only exception to this is when all leaf nodes of the subtree rooted at an internal node which overlaps the query range have been accessed. In such a case, the internal node which overlaps the query range is not chosen and is never accessed again. C. Data Structures In addition to the structure of the internal and leaf nodes of the ACE Tree, the query algorithm uses and updates the following two memory resident data structures: 1) A lookup table T , to store internal node information in the form of a pair of values (next = [LEFT] | [RIGHT], done = [TRUE] | [FALSE]). The first value indicates whether the next node to be retrieved should be the left child or right child. The second value is TRUE if all leaf nodes in the subtree rooted at the current node have already been accessed, else it is FALSE. 2) An array buckets[h] to hold sections of all the leaf nodes which have been accessed so far and whose records could not be used to answer the query. h is the height of the ACE Tree. D. Actual Algorithm We now present the algorithms used for answering queries using the ACE Tree. Algorithm 2 simply calls Algorithm 3 which is the main tree traversal algorithm, called Shuttle(). Each traversal or stab begins at the root node and proceeds down to a leaf node. In each invocation of Shuttle(), a recursive call is made to either its left or right child with the recursion ending when it reaches a leaf node. At this point, the sections in the leaf node are combined with previously retrieved sections so that they can be used to answer the query. The algorithm for combining sections is described in Algorithm 4. This algorithm determines the sections that are required to be combined with every new section s that is retrieved and then searches for them in the array buckets[]. If all sections are found, it combines them with s and removes them from buckets[]. If it does not find all the required sections in buckets[], it stores s in buckets[]. E. Algorithm Analysis We now present a lower bound on the expected performance of the ACE Tree index for sampling from a relational selection predicate. For simplicity, our analysis assumes that the number of leaf nodes in the

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

22

Algorithm 2: Query Answering Algorithm Algorithm Answer (Query Q) Let root be the root of the ACE Tree While (!T.lookup(root).done) T.lookup(root).done = shuttle(Q, root); Algorithm 3: ACE Tree traversal algorithm Algorithm Shuttle (Query Q, Node curr node) If (curr node is an internal node) lef t node = curr node →get lef t node(); right node = curr node →get right node(); If (lef t node is done AND right node is done) Mark curr node as done Else if (right node is not done) Shuttle(Q, right node); Else if (lef t node is not done) Shuttle(Q, lef t node); Else if (both children are not done) If (Q overlaps only with lef t node.R) Shuttle(Q, lef t node); Else if (Q overlaps only with right node.R) Shuttle(Q, right node); Else //Q overlaps both sides or none If (next node is LEFT) Shuttle(Q, lef t node); Set next node to RIGHT; If (next node is RIGHT) Shuttle(Q, right node); Set next node to LEFT; Else //curr node is a leaf node Combine Tuples(Q, curr node); Mark curr node as done

tree is a power of 2. Lemma 1. Efficiency of the ACE Tree for query evaluation. •

Let n be the total number of leaf nodes in a ACE Tree used to sample from some arbitrary range query, Q



Let p be the largest power of 2 no greater than n



Let µ be the mean section size in the tree



Let α be fraction of database records falling in Q



Let N be the size of the sample from Q that has been obtained after m ACE Tree leaf nodes have

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

23

Algorithm 4: Algorithm for combining sections Algorithm Combine Tuples(Query Q, LeafNode node) For each section s in node do Store the section numbers required to be combined with s to span Q, in a list list For each section number i in list do If buckets[] does not have section i f lag = false If (f lag == true) Combine all sections from list with s and use the records to answer Q Else Store s in the appropriate bucket

been retrieved from disk If m is not too large (that is, if m ≤ 2αn + 2), then: E[N ] ≥

µ p log2 p 2

where E[N ] denotes the expected value of N (the mean value of N after an infinite number of trials). Proof: Let Ii,j and Ii,j+1 be the two internal nodes in the ACE Tree where R = Ii,j .R

S

Ii,j+1 .R

covers Q and i is maximized. As long as the shuttle algorithm has not retrieved all the children of Ii,j and Ii,j+1 (this is the case as long as m ≤ 2αn + 2), when the mth leaf node has been processed, the expected number of new samples obtained: blog2 mc 2k−1

Nm =

X X k=1

wkl µ

l=1

where the outer summation is over each of the h − i contributing sections of the leaf nodes, starting with P section number i up to section number h, while l wkl represents the fraction of records of the 2k−1 P combined sections that satisfy Q. By the exponentiality property, l wkl ≥ 1/2 for every k, Nm ≥

µ log2 m 2

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

24

Thus after m leaf nodes have been obtained, the total number of expected samples is given by: E[N ] ≥

m X

Nk

k=1



m X µ k=1



2

log2 k

µ m log2 m 2

If m is a power of 2, the result is proven. Lemma 2. The expected number of records µ in any leaf node section is given by: E[µ] =

|R| h2h−1

where |R| is the total number of database records, h is the height of the ACE Tree and 2h−1 is the number of leaf nodes in the ACE Tree. Proof: The probability of assigning a record to any section i; i ≤ h is 1/h. Given that the record is assigned to section i, it can be assigned to only one of 2i−1 leaf node groups after comparing with the appropriate medians. Since each group would have 2h−1 /2i−1 candidate leaf nodes, the probability that the record is assigned to some leaf node Lj is:   2i−1 i−1 X X1 1 1 2 0 i−1 + i−1 h−1  × E[µi,j ] = h 2 2 2 t∈R k=1 =

|R| h2h−1

VII. M ULTI - DIMENSIONAL ACE T REES The ACE Tree can be easily extended to support queries that include multi-dimensional predicates. The change needed to incorporate this extension is to use a k-d binary tree instead of the regular binary tree for the ACE Tree. Let a1 . . . ak be the k key attributes for the k-d ACE Tree. To construct such a tree, the root node would be the median of all the a1 values in the database. Thus the root partitions the dataset based on a1 . At the next step, we need to assign values for level 2 internal nodes of the tree. For each of the resulting partitions of the dataset, we calculate the median of all the a2 values. These two medians are assigned to the two internal nodes at level 2 respectively, and we recursively partition the two halves based on a2 . This process is continued until we finish level k. At level k + 1, we again consider a1

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

25

0.18

% of total number of records in the relation

0.16

0.14

0.12

ACE Tree

0.1

0.08

0.06

0.04

B+ Tree Randomly permuted file

0.02

0 0

0.5

1

1.5 2 2.5 % of time required to scan relation

3

3.5

4

Fig. 11. Sampling rate of an ACE Tree vs. rate for a B+ Tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 0.25% of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation

for choosing the medians. We would then assign a randomly generated section number to every record. The strategy for assigning a leaf node number to the records would also be similar to the one described in Section 5.4 except that the appropriate key attribute is used while performing comparisons with the internal nodes. Finally, the dataset is sorted into leaf nodes as in Figure 9(c). Query answering with the k-d ACE Tree can use the Shuttle algorithm described earlier with a few minor modifications. Whenever a section is retrieved by the algorithm, only records which satisfy all predicates in the query should be returned. Also, the mth sections of two leaf nodes can be combined only if they match in all m dimensions. The nth sections of two leaf nodes can be appended only if they match in the first n − 1 dimensions and form a contiguous interval over the nth dimension. VIII. B ENCHMARKING In this Section, we describe a set of experiments designed to test the ability of the ACE Tree to quickly provide an online random sample from a relational selection predicate as well as to demonstrate that the memory requirement of the ACE Tree is reasonable. We performed two sets of experiments. The first set is designed to test the utility of the ACE Tree for use with one-dimensional data, where the ACE Tree is compared with a simple sequential file scan as well as Antoshenkov’s algorithm for sampling from a

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL XX NO XX, 2006

26

0.4

% of total number of records in the relation

0.35

0.3

0.25

0.2

0.15

ACE Tree Randomly permuted file

0.1

0.05 B+ Tree 0 0.5

1

1.5 2 2.5 3 % of time required to scan the relation

3.5

4

Fig. 12. Sampling rate of an ACE Tree vs. rate for a B+ Tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 2.5% of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation

ranked B+-Tree. In the second set, we compare a multi-dimensional ACE Tree with the sequential file scan as well as with the obvious extension of Antoshenkov’s algorithm to a two-dimensional R-Tree. A. Overview All experiments were performed on a Linux workstation having 1GB of RAM, 2.4GHz clock speed and with two 80GB, 15,000 RPM Seagate SCSI disks. 64KB data pages were used. Experiment 1. For the first set of experiments, we consider the problem of sampling from a range query of the form: SELECT * FROM SALE WHERE SALE.DAY >= i AND SALE.DAY = d1 AND SALE.DAY = a1 AND SALE.AMOUNT