He has always emphasized the importance of high-quality technical writing and has spent ... since the time I decided to pursue computer science. 4 ..... of error bounds with certain probabilistic guarantees of the form, âThe estimated answer ...... 0. 5e-05. 0.0001. 0.00015. 0.0002. 0.00025. 0.0003. 0.00035. 0. 1. 2. 3. 4. 5. 6. 7.
SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE QUERY PROCESSING
By SHANTANU JOSHI
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 1
c 2007 SHANTANU JOSHI °
2
To my parents, Dr Sharad Joshi and Dr Hemangi Joshi
3
ACKNOWLEDGMENTS Firstly, my sincerest gratitude goes to my advisor, Professor Chris Jermaine for his invaluable guidance and support throughout my PhD research work. During the initial several months of my graduate work, Chris was extremely patient and always led me towards the right direction whenever I would waver. His acute insight into the research problems we worked on set an excellent example and provided me immense motivation to work on them. He has always emphasized the importance of high-quality technical writing and has spent several painstaking hours reading and correcting my technical manuscripts. He has been the best mentor I could have hoped for and I shall always remain indebted to him for shaping my career and more importantly, my thinking. I am also very thankful to Professor Alin Dobra for his guidance during my graduate study. His enthusiasm and constant willingness to help has always amazed me. Thanks are also due to Professor Joachim Hammer for his support during the very early days of my graduate study. I take this opportunity to thank Professors Tamer Kahveci and Gary Koehler for taking the time to serve on my committee and for their helpful suggestions. It was a pleasure working with Subi Arumugam and Abhijit Pol on various collaborative research projects. Several interesting technical discussions with Mingxi Wu, Fei Xu, Florin Rusu, Laukik Chitnis and Seema Degwekar provided a stimulating work environment in the Database Center. This work would not have been possible without the constant encouragement and support of my family. My parents, Dr Sharad Joshi and Dr Hemangi Joshi always encouraged me to focus on my goals and pursue them against all odds. My brother, Dr Abhijit Joshi has always placed trust in my abilities and has been an ideal example to follow since my childhood. My loving sister-in-law, Dr Hetal Joshi has been supportive since the time I decided to pursue computer science.
4
TABLE OF CONTENTS page ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
CHAPTER 1
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.1 1.2
. . . . .
13 14 15 16 18
RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.1 2.2 2.3
Sampling-based Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . Estimation Using Non-sampling Precomputed Synopses . . . . . . . . . . . Analytic Query Processing Using Non-standard Data Models . . . . . . . .
19 28 30
MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION . .
33
3.1 3.2
33 35 35 36 37 38 38 39 40 42 43 43 44 44 45 45 46 46
1.3 2
3
3.3
3.4
3.5
Approximate Query Processing (AQP) - A Different Building an AQP System Afresh . . . . . . . . . . . 1.2.1 Sampling Vs Precomputed Synopses . . . . . 1.2.2 Architectural Changes . . . . . . . . . . . . Contributions in This Thesis . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . Existing Sampling Techniques . . . . . . . . . 3.2.1 Randomly Permuted Files . . . . . . . 3.2.2 Sampling from Indices . . . . . . . . . 3.2.3 Block-based Random Sampling . . . . . Overview of Our Approach . . . . . . . . . . . 3.3.1 ACE Tree Leaf Nodes . . . . . . . . . . 3.3.2 ACE Tree Structure . . . . . . . . . . . 3.3.3 Example Query Execution in ACE Tree 3.3.4 Choice of Binary Versus k-Ary Tree . . Properties of the ACE Tree . . . . . . . . . . 3.4.1 Combinability . . . . . . . . . . . . . . 3.4.2 Appendability . . . . . . . . . . . . . . 3.4.3 Exponentiality . . . . . . . . . . . . . . Construction of the ACE Tree . . . . . . . . . 3.5.1 Design Goals . . . . . . . . . . . . . . . 3.5.2 Construction . . . . . . . . . . . . . . . 3.5.3 Construction Phase 1 . . . . . . . . . .
5
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
Paradigm . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
48 51 51 52 53 53 55 55 57 59 60 61 66 70
SAMPLING-BASED ESTIMATORS FOR SUBSET-BASED QUERIES . . . .
73
3.6
3.7 3.8
3.9 4
4.1 4.2 4.3
4.4 4.5
4.6
4.7 4.8 5
3.5.4 Construction Phase 2 . . . . . . . . . . 3.5.5 Combinability/Appendability Revisited 3.5.6 Page Alignment . . . . . . . . . . . . . Query Algorithm . . . . . . . . . . . . . . . . 3.6.1 Goals . . . . . . . . . . . . . . . . . . . 3.6.2 Algorithm Overview . . . . . . . . . . . 3.6.3 Data Structures . . . . . . . . . . . . . 3.6.4 Actual Algorithm . . . . . . . . . . . . 3.6.5 Algorithm Analysis . . . . . . . . . . . Multi-Dimensional ACE Trees . . . . . . . . . Benchmarking . . . . . . . . . . . . . . . . . . 3.8.1 Overview . . . . . . . . . . . . . . . . . 3.8.2 Discussion of Experimental Results . . Conclusion and Discussion . . . . . . . . . . .
. . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . The Concurrent Estimator . . . . . . . . . . . . Unbiased Estimator . . . . . . . . . . . . . . . . 4.3.1 High-Level Description . . . . . . . . . . 4.3.2 The Unbiased Estimator In Depth . . . . 4.3.3 Why Is the Estimator Unbiased? . . . . . 4.3.4 Computing the Variance of the Estimator 4.3.5 Is This Good? . . . . . . . . . . . . . . . Developing a Biased Estimator . . . . . . . . . Details of Our Approach . . . . . . . . . . . . . 4.5.1 Choice of Model and Model Parameters . 4.5.2 Estimation of Model Parameters . . . . . 4.5.3 Generating Populations From the Model 4.5.4 Constructing the Estimator . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . 4.6.1 Experimental Setup . . . . . . . . . . . . 4.6.1.1 Synthetic data sets . . . . . . . 4.6.1.2 Real-life data sets . . . . . . . . 4.6.2 Results . . . . . . . . . . . . . . . . . . . 4.6.3 Discussion . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
73 78 80 80 81 85 87 89 91 92 92 95 100 102 103 103 104 106 109 111 118 119
SAMPLING-BASED ESTIMATION OF LOW SELECTIVITY QUERIES . . . 121 5.1 5.2
Introduction . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . 5.2.1 Stratification . . . . . . . . . . . . . . . 5.2.2 “Optimal” Allocation and Why It’s Not
6
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
121 124 124 126
5.3 5.4
5.5 5.6
5.7
5.8 5.9 6
Overview of Our Solution . . . . . . . . . . . . . Defining XΣ . . . . . . . . . . . . . . . . . . . . . 5.4.1 Overview . . . . . . . . . . . . . . . . . . . 5.4.2 Defining Xcnt . . . . . . . . . . . . . . . . 5.4.3 Defining XΣ0 . . . . . . . . . . . . . . . . . 5.4.4 Combining The Two . . . . . . . . . . . . 5.4.5 Limiting the Number of Domain Values . . Updating Priors Using The Pilot . . . . . . . . . Putting It All Together . . . . . . . . . . . . . . . 5.6.1 Minimizing the Variance . . . . . . . . . . 5.6.2 Computing the Final Sampling Allocation Experiments . . . . . . . . . . . . . . . . . . . . . 5.7.1 Goals . . . . . . . . . . . . . . . . . . . . . 5.7.2 Experimental Setup . . . . . . . . . . . . . 5.7.3 Results . . . . . . . . . . . . . . . . . . . . 5.7.4 Discussion . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
128 129 129 130 132 135 137 139 141 141 142 143 143 144 147 147 151 153
CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
APPENDIX EM ALGORITHM DERIVATION . . . . . . . . . . . . . . . . . . . . . . . . . 155 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7
LIST OF TABLES Table
page
4-1 Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for 24 synthetically generated data sets. The table shows errors for three different sampling fractions: 1%, 5% and 10% and for each of these fractions, it shows the error for the three estimators: U - Unbiased estimator, C - Concurrent sampling estimator and B - Model-based biased estimator. . . . . . . . . . . . . . . . . . 112 4-2 Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for 24 synthetically generated data sets. The table shows errors for three different sampling fractions: 1%, 5% and 10% and for each of these fractions, it shows the error for the three estimators: U - Unbiased estimator, C - Concurrent sampling estimator and B - Model-based biased estimator. . . . . . . . . . . . . . . . . . 113 4-3 Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for 18 synthetically generated data sets. The table shows errors for three different sampling fractions: 1%, 5% and 10% and for each of these fractions, it shows the error for the three estimators: U - Unbiased estimator, C - Concurrent sampling estimator and B - Model-based biased estimator. . . . . . . . . . . . . . . . . . 114 4-4 Observed standard error as a percentage of the total aggregate value of all records in the database for 8 queries over 3 real-life data sets. The table shows errors for three different sampling fractions: 1%, 5% and 10% and for each of these fractions, it shows the error for the three estimators: U - Unbiased estimator, C - Concurrent sampling estimator and B - Model-based biased estimator. . . . 115 5-1 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for a Simple Random Sampling estimator for the KDD Cup data set. Results are shown for varying sample sizes and for three different query selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 146 5-2 Average running time of Neyman and Bayes-Neyman estimators over three real-world datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5-3 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for the Neyman estimator and the Bayes-Neyman estimator for the three data sets. Results are shown for 20 strata and for varying number of records in pilot sample per stratum (PS), and sample sizes(SS) for three different query selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 148 5-4 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage (for 1000 query runs) for the Neyman estimator and the Bayes-Neyman estimator for the three data sets. Results are shown for 200 strata with varying number of records in pilot sample per stratum (PS), and sample sizes(SS) for three different query selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 149
8
LIST OF FIGURES Figure
page
1-1 Simplified architecture of a DBMS . . . . . . . . . . . . . . . . . . . . . . . . .
17
3-1 Structure of a leaf node of the ACE tree. . . . . . . . . . . . . . . . . . . . . . .
39
3-2 Structure of the ACE tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3-3 Random samples from section 1 of L3 . . . . . . . . . . . . . . . . . . . . . . . .
41
3-4 Combining samples from L3 and L5 . . . . . . . . . . . . . . . . . . . . . . . . .
42
3-5 Combining two sections of leaf nodes of the ACE tree. . . . . . . . . . . . . . .
43
3-6 Appending two sections of leaf nodes of the ACE tree. . . . . . . . . . . . . . .
45
3-7 Choosing keys for internal nodes. . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3-8 Exponentiality property of ACE tree. . . . . . . . . . . . . . . . . . . . . . . . .
48
3-9 Phase 2 of tree construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3-10 Execution runs of query answering algorithm with (a) 1 contributing section, (b) 6 contributing sections, (c) 7 contributing sections and (d) 16 contributing sections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3-11 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 0.25% of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3-12 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 2.5% of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3-13 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 25% of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
3-14 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 2.5% of the database records. The graph is an extension of Figure 3-12 and shows results till all three sampling techniques return all the records matching the query predicate. 63
9
3-15 Number of records needed to be buffered by the ACE Tree for queries with (a) 0.25% and (b) 2.5% selectivity. The graphs show the number of records buffered as a fraction of the total database records versus time plotted as a percentage of the time required to scan the relation. . . . . . . . . . . . . . . . . . . . . . .
64
3-16 Sampling rate of an ACE Tree vs. rate for an R-Tree and scan of a randomly permuted file, with a spatial selection predicate accepting 0.25% of the database tuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
3-17 Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomly permuted file with a spatial selection predicate accepting 2.5% of the database tuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
3-18 Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomly permuted file with a spatial selection predicate accepting 25% of the database tuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4-1 Sampling from a superpopulation . . . . . . . . . . . . . . . . . . . . . . . . . .
90
4-2 Six distributions used to generate for each e in EMP the number of records s in SALE for which f3 (e, s) evaluates to true. . . . . . . . . . . . . . . . . . . . . . . 105 5-1 Beta distribution with parameters α = β = 0.5. . . . . . . . . . . . . . . . . . . 131
10
Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE QUERY PROCESSING By SHANTANU JOSHI August 2007 Chair: Christopher Jermaine Major: Computer Engineering The past couple of decades have seen a significant amount of research directed towards data warehousing and efficient processing of analytic queries. This is a daunting task due to massive sizes of data warehouses and the nature of complex, analytical queries. This is evident from standard, published benchmarking results such as TPC-H, which show that many typical queries can require several minutes to execute despite using sophisticated hardware equipment. This can seem expensive especially for ad-hoc, data exploratory analysis. One direction to speed up execution of such exploratory queries is to rely on approximate results. This approach can be especially promising if approximate answers and their error bounds are computed in a small fraction of the time required to execute the query to completion. Random samples can be used effectively to perform such an estimation. Two important problems have to be addressed before using random samples for estimation. The first problem is that retrieval of random samples from a database is generally very expensive and hence index structures are required to be designed which can permit efficient random sampling from arbitrary selection predicates. Secondly, approximate computation of arbitrary queries generally requires complex statistical machinery and reliable sampling-based estimators have to be developed for different types of analytic queries. My research addresses the two problems described above by making the following contributions: (a) A novel file organization and index structure called the ACE Tree which permits efficient random sampling from an arbitrary
11
range query. (b) Sampling-based estimators for aggregate queries which have a correlated subquery where the inner and outer queries are related by the SQL EXISTS, NOT EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the result of aggregate queries having highly selective predicates.
12
CHAPTER 1 INTRODUCTION The last couple of decades have seen an explosive growth of electronic data. It is not unusual for data management systems to support several terabytes or even petabytes of data. Such massive volumes of data have led to the evolution of “data warehouses”, which are systems capable of supporting storage and efficient retrieval of large amounts of data. Data Warehouses are typically used for applications such as online analytical processing among others. Such applications process queries and expect results in a manner that is different from traditional transaction processing. For example, a typical query by a sales manager on a sales data warehouse might be: “Return average salary of all employees at locations whose sales have increased by atleast 10% over the past 3 years” The result of such a query could be used to make high-level decisions such as whether or not to hire more employees at the locations of interest. Such queries are typical in a data warehousing environment in that their evaluation requires complex analytical processing over huge amounts of data. Traditional transactional processing methods may be unacceptably slow to answer such complex queries. 1.1
Approximate Query Processing (AQP) - A Different Paradigm
The nature of analytical queries and their associated applications provides an opportunity to provide results which may not be exact. Since computation of exact results may require an unreasonable amount of time due to massive volumes of data, approximation may be attractive if the approximate results can be computed in a fraction of the time it would take to compute the exact results. Moreover, providing approximate results can be useful to quickly explore the whole data at a high level. This technique of providing fast but approximate results has been termed “Approximate Query Processing” in the literature.
13
In addition to computing an approximate answer, it is also important to provide metrics about the accuracy of the answer. One way to express the accuracy is in terms of error bounds with certain probabilistic guarantees of the form, “The estimated answer is 2.45 × 105 , and with 95% confidence the true answer lies within ±1.18 × 103 of the estimate”. Here, the error bounds are expressed as an interval and the accuracy guarantee is provided at 95% confidence. A promising approach for aggregation queries in Approximate Query Processing (AQP) has been proposed by Haas and Hellerstein [63] called Online aggregation (OLA). They propose an interactive interface for data exploration and analysis where records are retrieved in a random order. Using these random samples, running estimates and error bounds are computed and immediately displayed to the user. As time progresses, the size of the random sample keeps growing and so the estimate is continuously refined. At a predetermined time interval, the refined estimate along with its improved accuracy is displayed to the user. If at any point of time during the execution the user is satisfied with the accuracy of the answer, she can terminate further execution. The system also gives an overall progress indicator based on the fraction of records that have been sampled thus far. Thus, OLA provides an interface where the user is given a rough estimate of the result very quickly. 1.2
Building an AQP System Afresh
The OLA system described above presents an intuitive interface for approximate answering of aggregate queries. However, to support the functionality proposed by the system, fundamental changes need to be incorporated in several components of a traditional database management system. In this section, we first examine why sampling is a good approach for AQP, and then present an overview of the changes needed in the architecture of a database management system to support sampling-based AQP.
14
1.2.1
Sampling Vs Precomputed Synopses
We now discuss two techniques that can be used to support fast but approximate answering of queries. One intuitive technique is using some compact information about records for answering queries. Such information is typically called a database statistic and it is actually summary information about the actual records of the database. Commonly used database statistics are wavelets, histograms and sketches. Such statistics also known as synopses, are orders of magnitude smaller in size than the actual data. Hence it is much faster and efficient to access synopses as compared to reading the entire data. However, such synopses are precomputed and static. If a query is issued which requires some synopses that are not already available, then they would have to be computed by scanning the dataset, possibly multiple times before answering the query. A second approach to AQP is using samples of database records to answer queries. Query execution is extremely fast since the number of records in the sample is a small fraction of the total number of records in the database. The answer is then extrapolated or “scaled-up” to the size of the entire database. Since the answer is computed by processing very few records of the database, it is an approximation of the true answer. For the work in this thesis, we propose to use sampling [25, 109] in order to support AQP. We make this choice due to the following important advantages of sampling over precomputed synopses. The accuracy of an estimate computed by using samples can be easily improved by obtaining more samples to answer the query. On the other hand, if the estimate computed by using synopses is not sufficiently accurate, a new synopsis providing greater accuracy would have to be built. Since this would require scanning the dataset it is impractical. Secondly, sampling is very amenable to scalability. Even for extremely large datasets of the order of hundreds of gigabytes, it is generally possible to accomodate a small sample in main memory and use efficient in-memory algorithms to process it. If this is not possible, disk-based samples and algorithms have also been proposed [76] and are equally effective as their in-memory counterparts. This is an important benefit of sampling
15
as compared to histograms, which become unwieldy as the number of attributes of the records in the dataset increases. Thirdly, since real records (although very few) are used in a sample, it is possible to answer any statistical query including arbitrarily complex functions in relational selection and join predicates. This is a very important advantage of sampling as opposed to synopses such as sketches, which are not suitable for answering arbitrary queries. Finally, unlike precomputed synopses there is no requirement of maintenance and updates for on-the-fly sampling as data are updated. 1.2.2
Architectural Changes
In order to support sampling-based AQP in a database management system, major changes need to be incorporated in the architecture of the system. The reason for this is that traditional database management systems were not designed to work with random samples or to support computation of approximate results. In this section, we briefly describe some of the most critical changes that are required in the architecture of a DBMS to support sampling-based AQP. Figure 1-1 depicts the various components from a simplified architecture of a DBMS. The four components that require major changes in order to support sampling-based AQP are as follows: •
Index/file/record manager - The use of traditional index structures like B+-Trees is not appropriate to obtain random samples. This is because such index structures order records based on record search key values which is actually the opposite of obtaining records in a random order. Hence, for AQP it is important to provide physical structures or file organizations which support efficient retrieval of random samples.
•
Execution engine - The execution engine needs to be revamped completely so that it can use the random samples returned by the lower level to execute the query on them. Further, the result of the query needs to be scaled up appropriately for the size of the entire database. This component would also need to be able to compute accuracy guarantees for the approximate answer.
16
User Interface Queries/Updates Query Compiler Query plan Execution Engine Index, file and record requests Index/File/Record Manager Page commands Buffer Manager Read/write pages Storage Manager
Storage
Figure 1-1. Simplified architecture of a DBMS •
Query compiler - The query compiler has to be modified so that it can chalk out a different strategy of execution for various types of queries like relational joins, subset-based queries or queries with a GROUP-BY clause. Moreover, optimization of queries needs to be done very differently from traditional query optimizers which create the most efficient query plan to run a query to completion. For AQP, queries should be optimized so that the first few result tuples are output as quickly as possible.
•
User interface - There is tremendous scope of providing an intuitive user interface for an online AQP system. In addition to the UI being able to provide accuracy guarantees to the estimate, it would be very intuitive to provide a visualization of the intermediate results as and when they become available so that the user can continue to explore the query or decide to modify or terminate it. Current database management systems provide user interfaces with very limited functionality.
17
1.3
Contributions in This Thesis
These tasks involve significant research and implementation issues. Since many of the problems have never been tackled in the literature, there are several challenging tasks to be addressed. For the scope of my research, I choose to address the following three problems. The motivation and our solutions to each of these research problems is described separately in the following chapters of this thesis. •
We present a primary index structure which can support efficient retrieval of random samples from an arbitrary range query. This requires a specialized file organization and an efficient algorithm to actually retrieve the desired random samples from the index. This work falls in the scope of the Index/file/record manager component described earlier.
•
We present our solution to support execution of queries which have a nested sub-query where the inner query is correlated to the outer query, in an approximate query processing framework. This work falls in the purview of the execution engine of the system.
•
Finally, we also present a technique to support efficient execution of queries which have predicates with low selectivities, such as GROUP BY queries with many different groups. This work also falls in the scope of the query execution engine.
18
CHAPTER 2 RELATED WORK This chapter presents previous work in the data management and statistics literature related to estimation using sampling as well as non-sampling based precomputed synopses structures. Finally, it describes work related to OLAP query processing using non-relational data models like data cubes. 2.1
Sampling-based Estimation
Sampling has a long history in the data management literature. Some of the pioneering work in this field has been done by Olken and Rotem [96, 98–101] and Antoshenkov [9], though the idea of using a survey sample for estimation in statistics literature goes back much earlier than these works. Most of the work by Olken and Rotem describes how to perform simple random sampling from databases. Estimation for several types of database tasks has been attempted with random samples. The rest of this section presents important works on sampling-based estimation of major database tasks. Some of the initial work on estimating selectivity of join queries is due to Hou et al. [67, 68]. They present unbiased and consistent estimators for estimating the join size and also provide an algorithm for cluster sampling. In [64] they propose unbiased estimators for COUNT aggregate queries over arbitrary relational algebra expressions. However, computation of variance of their estimators is very complex [67]. They also do not provide any bounds on the number of random samples required for estimation. Adaptive sampling has been used for estimation of selectivity of predicates in relational selection and join operations [83, 84, 86] and for approximating the size of a relational projection operation [94]. Adaptive sampling has also been used in [85], to estimate transitive closures of database relations. The authors point out the benefits and generality of using sampling for selectivity estimation over parametric methods which make assumptions about an underlying probability distribution for the data as well as over non-parametric methods which require storing and maintaining synopses about the
19
underlying data. The algorithms consider the query result as a collection of results from several disjoint subqueries. Subqueries are sampled randomly and their result sizes are computed. The estimate of the actual query result size is then obtained from the results of the various subqueries. The sampling of subqueries is continued until either the sum of the subquery sizes is sufficiently large or the number of samples taken is sufficiently large. The method requires that the maximum size of a subquery be known. Since this is generally not available, the authors use an upper bound for the maximum subquery size in their method. Haas and Swami [59] observe that using a loose upper bound for the maximum subquery size can lead to sampling more subqueries than necessary, and potentially increasing the cost of sampling significantly. Double sampling or two-phase sampling has been used in [66] for estimating the result of a COUNT query with a guaranteed error bound at a certain confidence level. The error bound is guaranteed by performing sampling in two steps. In the first step a small pilot sample is used to obtain preliminary information about the input relation. This information is then used to compute the size of the sample for the second step such that the estimator is guaranteed to produce an estimate with the desired error bound. As Haas and Swami [59] point out, the drawback of using double sampling is that there is no theoretical guidance for choosing the size of the pilot sample. This could lead to an unpredictably imprecise estimate if the pilot sample size is too small or an unnecessarily high sampling cost if the pilot sample size is too large. In their work [59], Haas and Swami present sequential sampling techniques which provide an estimate of the result size and also bounds the error in estimation with a prespecified probability. They present two algorithms in the paper to estimate the size of a query result. Although both algorithms have been proven to be asymptotically correct and efficient, the first algorithm suffers from the problem of undercoverage. This means that in practice the probability with which it estimates the query result within the computed error bound is less than the specified confidence level of the algorithm. This problem is addressed by the second
20
algorithm which organizes groups of equal-sized results sets into a single stratum and then performs stratified sampling over the different strata. However, their algorithms do not perform very well when estimating the size of joins between a skewed and a non-skewed relation. Ling and Sun [82] point out that general sampling-based estimation methods have a high cost of execution since they make an overly restrictive assumption of no knowledge about the overall characteristics of the data. In particular, they note that estimation of the overall mean and variance of the data not only incurs cost but also introduces error in estimation. The authors rather suggest an alternative approach of actually keeping track of these characteristics in the database at a minimal overhead. A detailed study about the cost of sampling-based methods to estimate join query sizes appears in [58]. The paper systematically analyses the factors which influence the cost of a sampling-based method to estimate join selectivities. Based on their analysis, their findings can be summarized as follows: (a) When the measure of precision of the estimate is absolute, the cost of sampling increases with the number of relations involved in the join as well as the sizes of the relations themselves. (b) When the measure of precision of the estimate is relative, the cost of using sampling increases with the sizes of the relations, but decreases as the number of input relations increase. (c) When the distribution of the join attribute values is uniform or highly skewed for all input relations, the cost of sampling tends to be low, while it is high when only some of the input relations have a skewed join attribute value distribution. (d) The presence of tuples in a relation which do not join with any other tuples from other relations always increases the cost of sampling. Haas et al. [56, 57] study and compare the performance of new as well as previous sampling-based procedures for estimating the selectivity of queries with joins. In particular they identify estimators which have a minimum variance after a fixed number of sampling steps have been performed. They note that use of indexes on input relations can further
21
reduce variance of the selectivity estimate. The authors also show how their estimation methods can be used to estimate the cost of implementing a given join query plan without making any assumptions about the underlying data or requiring storage and maintenance of summary statistics about the data. Ganguly et al. [35] describe how to estimate the size of a join in the presence of skew in the data by using a technique called bifocal sampling. This technique classifies tuples of each input relation into two groups, sparse and dense, based on the number of tuples with the same value for the join attribute. Every combination of these groups is then subject to different estimation procedures. Each of these estimation procedures require a sample size larger than a certain value (in terms of the total number of tuples in the input relation) to provide an estimate within a small constant factor of the true join size. In order to guarantee estimates with the specified accuracy, bifocal sampling also requires the total join size and the join sizes from sparse-sparse subjoins to be greater than a certain threshold. Gibbons and Matias [40] introduce two sampling-based summary statistics called concise samples and counting samples and present techniques for their fast and incremental maintenance. Although the paper describes summary statistics rather than on-the-fly sampling techniques, the summary statistics are created from random samples of the underlying data and are actually defined to describe characteristics of a random sample of the data. Since summary statistics of a random sample require much lesser amount of memory than the sample itself, the paper describes how information from a much larger sample can be stored in a given amount of memory by storing sample statistics instead of using the memory to store actual random samples. Thus, the authors claim that since information from a larger sample can be stored by their summary statistics the accuracy of approximate answers can be boosted. Chaudhuri, Motwani and Narasayya [22, 24] present a detailed study of the problem of efficiently sampling the output of a join operation without actually computing the
22
entire join. They prove a negative result that it is not possible to generate a sample of the join result of two relations by merely joining samples of the relations involved in the join. Based on this result, they propose a biased sampling strategy which samples tuples from one relation in the proportion with which their matching tuples appear in the other relation. The intuition behind this approach is that the resulting biased sample is more likely to reflect the structure of the actual join result between the two relations. Information about the frequency of the various join attribute values is assumed to be available in the form of some synopsis structures like histograms. There has also been work to estimate the actual result of an aggregate query which involves a relational join operation on its input relations. In fact, Haas, Hellerstein and Wang [63] propose a system called Online Aggregation (OLA) that can support online execution of analytic-style aggregation queries. They propose the system to have a visual interface which displays the current estimate of the aggregate query along with error bounds at a certain confidence level. Then, as time progresses, the system continually refines the estimate and at the same time shrinks the width of the error bounds. The user who is presented with such a visual interface, has at all times, an option to terminate further execution of the query in case the error bound width is satisfactory for the given confidence level. The authors propose the use random sampling from input relations to provide estimates in OLA. Further, they describe some of the key changes that would be required in a DBMS to support OLA. In [51], Haas describes statistical techniques for computing error bounds in OLA. The work on OLA eventually grew into the UC Berkeley CONTROL project. In their article [62], Hellerstein et al. describe various issues in providing interactive data analysis and possible approaches to address those issues. Haas and Hellerstein [53, 54] propose a family of join algorithms called ripple joins to perform relational joins in an OLA framework. Ripple joins were designed to minimize the time until an acceptably precise estimate of the query result is made available, as opposed to minimizing the time to completion of the query as in a traditional DBMS. For
23
a two-table join, the algorithm retrieves a certain number of random tuples from both relations at each sampling step; these new tuples are joined with previously seen tuples and with each other. The running result of the aggregate query is updated with these newly retrieved tuples. The paper also describes how a statistically meaningful confidence interval of the estimated result can be computed based on the Central Limit Theorem (CLT). Luo et al. [87] present an online parallel hash ripple join algorithm to speed up the execution of the ripple join especially when the join selectivity is low and also when the user wishes to continue execution till completion. The algorithm is assumed to be executed at a fixed set of processor nodes. At each node, a hash table is maintained for every relation. Moreover every bucket in each hash table could have some tuples stored in memory and some others stored on disk. The join algorithm proceeds in two phases; in the first phase tuples from both relations are retrieved in a random order and distributed to the processor nodes so that each node would perform roughly the same amount of work for executing the join. By using multiple threads at each node, production of join tuples from the in-memory hash table buckets begins even as tuples are being distributed to the various processors. The second phase begins after redistribution from the first phase is complete. In this phase, a new in-memory hash table is created which uses a hashing function different from the function used in phase 1. The tuples in the disk-resident buckets of the hash table of phase 1 are then hashed according to the hashing function of phase 2 and joined. The algorithm provides a considerable speed-up factor over the one-node ripple join, provided its memory requirements are met. Jermaine et al. [73, 74] point out that the drawback of both the ripple join algorithms described above is that the statistical guarantees provided by the estimator are valid only as long as the output of the join can be accomodated in main memory. In order to counteract this problem, they propose the Sort-Merge-Shrink join algorithm as a generalization of the ripple join which can provide error guarantees throughout execution,
24
even if it operates from disk. The algorithm proceeds in three phases. In the sort phase, the two input relations are read in parallel and sorted into runs. Each pair of runs is subject to an in-memory hash ripple join and provides a corresponding estimate of the join result. The merge and shrink phases execute concurrently where in the merge phase, tuples are retrieved from the various sorted runs of both relations and joined with each other. Since the sorted runs “lose” tuples which are pulled by the merge phase, the shrinking phase takes these tuples into account and updates the estimator accordingly. The authors provide a detailed statistical analysis of the estimator as well computation of error bounds. Estimation using sampling of the number of distinct values in a column has been studied by Haas et al. [48]. They provide an overview of the estimators used in the database and statistics literature and also develop several new sampling-based estimators for the distinct value estimation problem. They propose a new hybrid sampling estimator which explicitly adapts to different levels of data skew. Their hybrid estimator performs a Chi-square test to detect skew in the distribution of the attribute value. If the data appears to be skewed, then Shlosser’s estimator is used while if the test does not detect skew, a smoothed-jackknife estimator (which is a modification of the conventional jackknife estimator) is used. The authors attribute a dearth of work for sampling-based estimation of the number of distinct values to the inherent difficulty of the problem while noting that it is a much harder problem than estimating the selectivity of a join. Haas and Stokes [50] present a detailed study of the problem of estimating the number of classes in a finite population. This is equivalent to the database problem of estimating the number of distinct values in a relation. The authors make recommendations about which statistical estimator is appropriate subject to constraints and finally claim from empirical results that a hybrid estimator which adapts according to data skew is the most superior estimator.
25
There has also been work by Charikar et al. [16] which establishes a negative result stating that no sampling-based estimator for estimating the number of distinct values can guarantee small error across all input distributions unless it examines a large fraction of the input data. They also present a Guaranteed Error Estimator (GEE) whose error is provably no worse than their negative result. Since the GEE is a general estimator providing optimal error over all distributions, the authors note that its accuracy may be lower than some previous estimators on specific distributions. Hence, they propose an estimator called the Adaptive Estimator (AE) which is similar in spirit to Haas et al.’s hybrid estimator [50], but unlike the latter, is not composed of two distinct estimators. Rather the AE considers the contribution of data items having high and low frequencies in a single unified estimator. In the AQUA system [41] for approximate answering of queries, Acharya et al. [6] propose using synopses for estimating the result of relational join queries involving foreign-key joins rather than using random samples from the base relations. These synopses are actually precomputed samples from a small set of distinguished joins and are called join synopses in the paper. The idea of join synopses is that by precomputing samples from a small set of distinguished joins, these samples can be used for estimating the result of many other joins. The concept is applicable in a k-way join where each join involves a primary and foreign key of the participating relations. The paper describes that if workload information is available, it can be used to design an optimal allocation for the join synopses that minimizes the overall error in the approximate answers over the workload. Acharya et al. [5] propose using a mix of uniform and biased samples for approximately answering queries with a GROUP-BY clause. Their sampling technique called congressional sampling relies on using precomputed samples which are a hybrid union of uniform and biased samples. They assume that the selectivity of the query predicate is not so low that their precomputed sample completely misses one or more groups from the result of
26
the GROUP-BY query. Based on this assumption, they devise a sampling plan for the different groups such that the expected minimum number of tuples satisfying the query predicate in any group, is maximized. The authors also present one-pass algorithms [4] for constructing the congressional samples. Ganti et al. [37] describe a biased sampling approach which they call ICICLES to obtain random samples which are tuned to a particular workload. Thus, if a tuple is chosen by many queries in a workload, it has a higher probability of being selected in the self-tuning sample as compared to tuples which are chosen by fewer queries. Since this is a non-uniform sample, traditional sampling-based estimators must be adapted for these samples. The paper describes modified estimators for the common aggregation operations. It also describes how the self-tuning samples are tuned in the presence of a dynamically changing workload. Chaudhuri et al. [18] note that uniform random sampling to estimate aggregate queries is ineffective when the distribution of the aggregate attribute is skewed or when the query predicate has a low selectivity. They propose using a combination of two methods to address this problem. Their first approach is to index separately those attribute values which contribute significantly to the query result. This method is called Outlier Indexing in the paper. The second approach proposed in the paper is to exploit workload information to perform weighted sampling. According to this technique, records which satisfied many queries in the workload are sampled more than records than satisfied fewer queries. Chaudhuri, Das and Narasayya [19, 20] describe how workload information can be used to precompute a sample that minimizes the error for the given workload. The problem of selection of the sample is framed as an optimization problem so that the error in estimation of the workload queries using the resulting sample is minimized. When the actual incoming queries are identical to queries in the workload, this approach gives a solution with minimal error across all queries. The paper also describes how the choice of
27
the sample can be tuned to achieve effective estimates when the actual queries are similar but not identical to the workload. Babcock, Chaudhuri and Das [10] note that a uniformly random sample can lead to inaccurate answers for many queries. They observe that for such queries, estimation using an appropriately biased sample can lead to more accurate answers as compared to estimation using uniformly random samples. Based on this idea, the paper describes a technique called small group sampling which is designed to approximately answer aggregation queries having a GROUP-BY clause. The distinctive feature of this technique as compared to previous biased sampling techniques like congressional sampling is that a new biased sample is chosen for every GROUP-BY query, such that it maximizes the accuracy of estimating the query rather than trying to devise a biased sample that maximizes the accuracy over an entire workload of queries. According to this technique, larger groups from the output of the GROUP-BY queries are sampled uniformly while the small groups are sampled at a higher rate to ensure that they are adequately represented. The group samples are obtained on a per-query basis from an overall sample which is computed in a pre-processing phase. In fact, database sampling has been recognized as an important enough problem that ISO has been working to develop a standard interface for sampling from relational database systems [55], and significant research efforts are directed at providing sampling from database systems by vendors such as IBM [52]. 2.2
Estimation Using Non-sampling Precomputed Synopses
Estimation in databases using a non-sampling technique was first proposed by Rowe [106, 107]. The technique proposed is called antisampling and involves creation of a special auxiliary structure called database abstract. The abstract considers the distribution of several attributes and groups of attributes. Correlations between different attributes can also be characterized as statistics. This technique was found to be faster than random sampling, but required domain knowledge about the various attributes.
28
Classic work on histogram based estimation of predicate selectivity is by Selinger et al. [110] and Piatetsky-Shapiro and Connell [102]. Selectivity estimation of queries with multidimensional predicates using histograms was presented by Muralikrishna and DeWitt [92]. They show that the maximum error in estimation can be controlled more effectively by choosing equi-depth histograms as opposed to equi-width histograms. Ioannidis [70] describes how serial histograms are optimal for aggregate queries involving arbitrary join trees with equality predicates. Ioannidis and Poosala [71] have also studied how histograms can be used to approximately answer non-aggregate queries which have a set based result. Several histogram construction schemes [42, 45, 72] have been proposed in the literature. Jagadish et al. [72] describe techniques for constructing histograms which can minimize a given error metric where the error is introduced because of approximation of values in a bucket by a single value associated with the bucket. They also describe techniques for augmenting histograms with additional information so that they can be used to provide accuracy guarantees of the estimated results. Construction of approximate histograms by considering only a random sample of the data set was investigated by Chaudhuri et al. [23]. Their technique uses an adaptive sampling approach to determine the sample size that would be sufficient to generate approximate histograms which can guarantee pre-specified error bounds in estimation. They also extend their work to consider duplicate values in the domain of the attribute for which a histogram is to be constructed. The problem of estimation of the number of distinct value combinations of a set of attributes has been studied by Yu et al. [121]. Due to the inherent difficulty of developing a good, sampling-based estimation solution to the problem, they propose using additional information about the data in the form of histograms, indexes or data cubes. In a recent paper [28], Dobra presents a study of when histograms are best suited for approximation. The paper considers the long-standing assumption that histograms are
29
most effective only when all elements in a bucket have the same frequency and actually extends it to a less restrictive assumption that histograms are well-suited when elements within a bucket are randomly arranged even though they might have different frequencies. Wavelets have a long history as mathematical tools for hierarchical decomposition of functions in signal and image processing. Vitter and his collaborators have also studied how wavelets can be applied to selectivity estimation of queries [89] and also for computing aggregates over data cubes [118, 119]. Chakrabarti et al. [15] present techniques for approximate computation of results for aggregate as well as non-aggregate queries using Haar wavelets. One more summary structure that has been proposed for approximating the size of joins is sketches. Sketches are small-space summaries of data suited for data streams. A sketch generally consists of multiple counters corresponding to random variables which enable them to provide approximate answers with error guarantees for a priori decided queries. Some of the earliest work on sketches was presented by Alon, Gibbons, Matias and Szegedy [7, 8]. Sketching techniques with improved error guarantees and faster update times have been proposed as Fast-Count sketches [117]. A statistical analysis of various sketching techniques along with recommendations on their use for estimating join sizes appears in [108]. 2.3
Analytic Query Processing Using Non-standard Data Models
A data model for OLAP applications called data cube was proposed by Gray et al. [44] for processing of analytic style aggregation queries over data warehouses. The paper describes a generalization of the SQL GROUP BY operator to multiple dimensions by introducing the data cube operator. This operator treats each of the possible aggregation attributes as a dimension of a high dimensional space. The aggregate of a particular set of attribute values is considered as a point in this space. Since the cube holds precomputed aggregate values over all dimensions, it can be used to quickly compute results to GROUP-BY queries over multiple dimensions. The data cube is precomputed
30
and can require significant amount of space for storage of the precomputed aggregates along the different dimensions. A more serious drawback of the data cube approach is that it can be used to efficiently answer only such queries which have a grouping hierarchy that conforms to the hierarchy on which the data cube is built. Moreover, complex queries which have been addressed in this thesis such as queries having correlated subqueries are not amenable to efficient processing with the data cube model. Due to potentially large sizes of data cubes for high dimensions, researchers have studied techniques to discover semantic relationships in a data cube. This approach reduces the number of precomputed aggregates grouped by different attributes if their aggregate values are identical. The quotient cube [79] and quotient cube tree [80] structures are such compressed representations of the data cube which preserve semantic relationships while also allowing processing of point and range queries. Another approach that has been employed in shrinking the data cube while at the same time preserving all the information in it is the Dwarf [113, 114] structure. Dwarf identifies and eliminates redundancies in prefixes and suffixes of the values along different dimensions of a data cube. The paper shows that by eliminating prefix as well as suffix redundancies, both dense as well as sparse data cubes can be compressed effectively. The paper also shows improved cube construction time, query response time as well as update time as compared to cube trees [105]. Although, the dwarf structure improves the performance of the data cube model, it still suffers from the inherent drawback of the data cube model – it is not suitable to efficiently answer arbitrarily complex queries such as queries with correlated subqueries. Recently, a new column-oriented architecture for database systems called C-store was proposed by Stonebraker et al [115]. The system has been designed for an environment that has much higher number of database reads as opposed to writes, such as a data warehousing environment. C-store logically splits attributes of a relational table into projections which are collections of attributes, and stores them on disk such that all values
31
of any attribute are stored adjacent to each other. The paper presents experimental results which show that C-store executes several select-project-join and group-by queries over the TPC-H benchmark much faster than commercial row-oriented or column-oriented systems. At the time of the paper [115], the system was still under development.
32
CHAPTER 3 MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION 3.1
Introduction
With ever-increasing database sizes, randomization and randomized algorithms [91] have become vital data management tools. In particular, random sampling is one of the most important sources of randomness for such algorithms. Scores of algorithms that are useful over large data repositories either require a randomized input ordering for data (i.e., an online random sample), or else they operate over samples of the data to increase the speed of the algorithm. Although applications requiring randomization abound in the data management literature, we specifically consider online aggregation [54, 62, 63] in this thesis. In online aggregation, database records are processed one-at-a-time, and used to keep the user informed of the current “best guess” as to the eventual answer to the query. If the records are input into the online aggregation algorithm in a randomized order, then it becomes possible to give probabilistic guarantees on the relationship of the current guess to the eventual answer to the query. Despite the obvious importance of random sampling in a database environment and dozens of recent papers on the subject (approximately 20 papers from recent SIGMOD and VLDB conferences are concerned with database sampling), there has been relatively little work towards actually supporting random sampling with physical database file organizations. The classic work in this area (by Olken and his co-authors [98, 99, 101]) suffers from a key drawback: each record sampled from a database file requires a random disk I/O. At a current rate of around 100 random disk I/Os per second per disk, this means that it is possible to retrieve only 6,000 samples per minute. If the goal is fast approximate query processing or speeding up a data mining algorithm, this is clearly unacceptable.
33
The Materialized Sample view In this chapter, we propose to use the materialized sample view
1
as a convenient
abstraction for allowing efficient random sampling from a database. For example, consider the following database schema: SALE (DAY, CUST, PART, SUPP) Imagine that we want to support fast, random sampling from this table, and most of our queries include a temporal range predicate on the DAY attribute. This is exactly the interface provided by a materialized sample view. A materialized sample view can be specified with the following SQL-like query: CREATE MATERIALIZED SAMPLE VIEW MySam AS SELECT * FROM SALE INDEX ON DAY In general, the range attribute or attributes referenced in the INDEX ON clause can be spatial, temporal, or otherwise, depending on the requirements of the application. While the materialized sample view is a straightforward concept, efficient implementation is difficult. The primary technical contribution of this thesis is a novel index structure called the ACE Tree (Appendability, Combinability Exponentiality; see Section 3.4) which can be used to efficiently implement a materialized sample view. Such a view, stored as an ACE-Tree, has the following characteristics: •
It is possible to efficiently sample (without replacement) from any arbitrary range query over the indexed attribute, at a rate that is far faster than is possible using techniques proposed by Olken [96] or by scanning a randomly permuted file. In general, the view can produce samples from a predicate involving any attribute having a natural ordering, and a straightforward extension of the ACE Tree can be used for sampling from multi-dimensional predicates.
•
The resulting sample is online, which means that new samples are returned continuously as time progresses, and in a manner such that at all times, the set of samples returned is a true random sample of all of the records in the view that
1
This term was originally used in Olken’s PhD thesis [96] in a slightly different context, where the goal was to maintain a fixed-size sample of database; in contrast, as we describe subsequently our materialized sample view is a structure allowing online sampling
34
match the range query. This is vital for important applications like online aggregation and data mining. •
Finally, the sample view is created efficiently, requiring only two external sorts of the records in the view, and with only a very small space overhead beyond the storage required for the data records. We note that while the materialized sample view is a logical concept, the actual file
organization used to implement such a view can be referred to as a sample index since it is a primary index structure to efficiently retrieve random samples. 3.2
Existing Sampling Techniques
In this section, we discuss three simple techniques that can be used to create materialized sample views to support random sampling from a relational selection predicate. 3.2.1
Randomly Permuted Files
One option for creating a materialized sample view is to randomly shuffle or permute the records in the view. To sample from a relational selection predicate over the view, we scan it sequentially from beginning to end and accept those records that satisfy the predicate while rejecting the rest. This method has the advantage that it is very simple, and using a fast external sorting algorithm, permuting the records can be very efficient. Furthermore, since the process of scanning the file can make use of the fast, sequential I/O provided by modern hard disks, a materialized view organized as a randomly permuted file can be very useful for answering queries that are not very selective. However, the major problem with such a materialized view is that the fraction of useful samples retrieved by it is directly proportional to the selectivity of the selection predicate. For example, if the selectivity of the query is 10%, then on average only 10% of the random samples obtained by such a view can be used to answer the query. Hence for moderate to low selectivity queries, most of the random samples retrieved by such a view will not be useful for answering queries. Thus, the performance of such a view quickly degrades as selectivity of the selection predicates decreases.
35
3.2.2
Sampling from Indices
The second approach to creating a materialized sample view is to use one of the standard indexing structures like a hashing scheme or a tree-based index structure to organize the records in the view. In order to produce random samples from such a materialized view, we can employ iterative or batch sampling techniques [9, 96, 99–101] that sample directly from a relational selection predicate, thus avoiding the aforementioned problem of obtaining too few relevant records in the sample. Olken [96] presents a comprehensive analysis and comparison of many such techniques. In this Section we discuss the technique of sampling from a materialized view organized as a ranked B+-Tree, since it has been proven to be the most efficient existing iterative sampling technique in terms of number of disk accesses. A ranked B+-Tree is a regular B+-Tree whose internal nodes have been augmented with information which permits one to find the ith record in the file. Let us assume that the relation SALE presented in the Introduction is stored as a ranked B+-Tree file indexed on the attribute DAY and we want to retrieve a random sample of records whose DAY attribute value falls between 11-28-2004 and 03-02-2005. This translates to the following SQL query: SELECT * FROM SALE WHERE SALE.DAY BETWEEN ’11-28-2004’ AND ’03-02-2005’ Algorithm 1 above can then be used to obtain a random sample of relevant records from the ranked B+-Tree file. The drawback of the above algorithm is that whenever a leaf page is accessed, the algorithm retrieves only that record whose rank matches with the rank being searched for. Hence for every record which resides on a page that is not currently buffered, the retrieval time is the same as the time required for a random disk I/O. Thus, as long as there are unbuffered leaf pages containing candidate records, the rate of record retrieval is very slow.
36
Algorithm 1: Sampling from a Ranked B+-Tree Algorithm SampleRankedB+Tree (Value v1 , Value v2 ) 1. Find the rank r1 of the record which has the smallest DAY value greater than v1 . 2. Find the rank r2 of the record which has the largest DAY value smaller than v2 . 3. While sample size < desired sample size 3.a Generate a uniformly distributed random number i, between r1 and r2 . 3.b If i has been generated previously, discard it and generate the next random number. 3.c Using the rank information in the internal nodes, retrieve the record whose rank is i. 3.2.3
Block-based Random Sampling
While the classic algorithms of Olken and Antoshenkov sample records one-at-a-time, it is possible to sample from an indexing structure such as a B+-Tree, and make use of entire blocks of records [21, 55]. The number of records per block is typically on the order of 100 to 1000, leading to a speedup of two or three orders of magnitude in the number of records retrieved over time if all of the records in each block are consumed, rather than a single record. However, there are two problems with this approach. First, if the structure is used to estimate the answer to some aggregate query, then the confidence bounds associated with any estimate provided after N samples have been retrieved from a range predicate using a B+-Tree (or some other index structure) may be much wider than the confidence bounds that would have been obtained had all N samples been independent. In the extreme case where the values on each block of records are closely correlated with one another, all of the N samples may be no better than a single sample. Second, any algorithm which makes use of such a sample must be aware of the block-based method used to sample the index, and adjust its estimates accordingly, thus adding complexity to the query result estimating process. For algorithms such as Bradley’s K-means algorithm [11], it is not clear whether or not such samples are even appropriate.
37
3.3
Overview of Our Approach
We propose an entirely different strategy for implementing a materialized sample view. Our strategy uses a new data structure called the ACE Tree to index the records in the sample view. At the highest level, the ACE Tree partitions a data set into a large number of different random samples such that each is a random sample without replacement from one particular range query. When an application asks to sample from some arbitrary range query, the ACE Tree and its associated algorithms filter and combine these samples so that very quickly, a large and random subset of the records satisfying the range query is returned. The sampling algorithm of the ACE Tree is an online algorithm, which means that as time progresses, a larger and larger sample is produced by the structure. At all times, the set of records retrieved is a true random sample of all the database records matching the range selection predicate. 3.3.1
ACE Tree Leaf Nodes
The ACE Tree stores records in a large set of leaf nodes on disk. Every leaf node has two components: 1.
A set of h ranges, where a range is a pair of key values in the domain of the key attribute and h is the height of the ACE Tree. Unlike a B+-Tree, each leaf node in the ACE Tree stores records falling in several different ranges. The ith range associated with leaf node L is denoted by L.Ri . The h different ranges associated with a leaf node are textithierarchical; that is L.R1 ⊃ L.R2 ⊃ · · · ⊃ L.Rh . The first range in any leaf node, L.R1 , always contains a uniform random sample of all records of the database thus corresponding to the range (−∞, ∞). The hth range in any leaf node is the smallest among all other ranges in that leaf node.
2.
A set of h associated sections. The ith section of leaf node L is denoted by L.Si . The section L.Si contains a random subset of all the database records with key values in the range L.Ri . Figure 3-1 depicts an example leaf node in the ACE Tree with attribute range values
written above each section and section numbers marked below. Records within each section are shown as circles.
38
R1 :0-100 75 36
R2 :0-50 R3 :0-25 47
41
S1
18
10
22
S2
25 3
S3
R4 : 0-12 11
7
1
S4
Figure 3-1. Structure of a leaf node of the ACE tree. 3.3.2
ACE Tree Structure
Logically, the ACE Tree is a disk-based binary tree data structure with internal nodes used to index leaf nodes, and leaf nodes used to store the actual data. Since the internal nodes in a binary tree are much smaller than disk pages, they are packed and stored together in disk-page-sized units [27]. Each internal node has the following components: 1.
A range R of key values associated with the node.
2.
A key value k that splits R and partitions the data on the left and right of the node.
3.
Pointers ptrl and ptrr , that point to the left and right children of the node.
4.
Counts cntl and cntr , that give the number of database records falling in the ranges associated with the left and right child nodes. These values can be used, for example, during evaluation of online aggregation queries which require the size of the population from which we are sampling [54]. Figure 3-2 shows the logical structure of the ACE Tree. Ii,j refers to the jth internal
node at level i. The root node is labeled with a range I1,1 .R = [0-100], signifying that all records in the data set have key values within this range. The key of the root node partitions I1,1 .R into I2,1 .R = [0-50] and I2,2 .R = [51-100]. Similarly each internal node divides the range of its descendents with its own key. The ranges associated with each section of a leaf node are determined by the ranges associated with each internal node on the path from the root node to the leaf. For example, if we consider the path from the root node down to leaf node L4 , the ranges that we encounter along the path are 0-100, 0-50, 26-50 and 38-50. Thus for L4 , L4 .S1 has a random sample of records in the range 0-100, L4 .S2 has a random sample in the range
39
0−100 I1,1
50 0−50
51−100 I2,1
25 0−25
26−50 I3,1
12
L1
75
L2
26
70 87
51−75 I3,2
37
L3
0−100
L4
0−50 14
I2,2
7 20 40
L4 .S1
I3,3
62
L5
76−100
88
L6
L7
I3,4
L8
26−50 38−50 39 27
L4 .S2
50 44
38
L4 .S3
40 46
L4 .S4
Figure 3-2. Structure of the ACE tree. 0-50, L4 .S3 has a random sample in the range 26-50, while L4 .S4 has a random sample in the range 38-50. 3.3.3
Example Query Execution in ACE Tree
In the following discussion, we demonstrate how the ACE Tree efficiently retrieves a large random sample of records for any given range query. The query algorithm is formally described in Section 3.6. Let Q = [30-65] be our example query postulated over the ACE Tree depicted in Figure 3-2. The query algorithm starts at I1,1 , the root node. Since I2,1 .R overlaps Q, the algorithm decides to explore the left child node labeled I2,1 in Figure 3-2. At this point the two range values associated with the left and right children of I2,1 are 0-25 and 26-50. Since the left child range has no overlap with the query range, the algorithm chooses to explore the right child next. At this child node (I3,2 ), the algorithm picks leaf node L3 to be the first leaf node retrieved by the index. Records from section 1 of L3 (which totally encompasses Q) are filtered for Q and returned immediately to the consumer of the sample
40
as a random sample from the range [30-65], while records from sections 2, 3 and 4 are stored in memory. Figure 3-3 shows the one random sample from section 1 of L3 which can be used directly for answering query Q.
0-100
31
55 72
0-50
89
L3 .S1
26-50 45
18 48
23
L3 .S2
29
37
L3 .S3
26-37 29
34
L3 .S4
σ30−65
55
Figure 3-3. Random samples from section 1 of L3 . Next, the algorithm again starts at the root node and now chooses to explore the right child node I2,2 . After performing range comparisons, it explores the left child of I2,2 which is I3,3 since I3,4 .R has no overlap with Q. The algorithm chooses to visit the left child node of I3,3 next, which is leaf node L5 . This is the second leaf node to be retrieved. As depicted in Figure 3-4, since L5 .R1 encompasses Q, the records of L5 .S1 are filtered and returned immediately to the user as two additional samples from R. Furthermore, section 2 records are combined with section 2 records of L3 to obtain a random sample of records in the range 0-100. These are again filtered and returned, giving four more samples from Q. Section 3 records are also combined with section 3 records of L3 to obtain a sample of records in the range 26-75. Since this range also encompasses R, the records are again filtered and returned adding four more records to our sample. Finally section 4 records are stored in memory for later use. Note that after retrieving just two leaf nodes in our small example, the algorithm obtains eleven randomly selected records from the query range. However, in a real index, this number would be many times greater. Thus, the ACE Tree supports “fast first” 41
sampling from a range predicate: a large number of samples are returned very quickly. We contrast this with a sample taken from a B+-Tree having a similar structure to the ACE Tree depicted in Figure 3-2. The B+-Tree sampling algorithm would need to pre-select which nodes to explore. Since four leaf nodes in the tree are needed to span the query range, there is a reasonably high likelihood that the first four samples taken would need to access all four leaf nodes. As the ACE Tree Query Algorithm progresses, it goes on to retrieve the rest of the leaf nodes in the order L4 , L6 , L1 , L7 , L2 , L8 .
0−100 10
61 59 77
L5 .S1
31 48
51−100
18
99 23
34
L3 .S2
σ30−65
61
0−50
59
46
L5 .S2
26−50 45
29 37
51−75 53 58 74 69
L3 .S3
L5 .S3
Combine samples
Combine samples
σ30−65
σ30−65
31 48 46 34
45 53
37 58
Figure 3-4. Combining samples from L3 and L5 . 3.3.4
Choice of Binary Versus k-Ary Tree
The ACE Tree as described above can also be implemented as a k-ary tree instead of a binary tree. For example, for a ternary tree, each internal node can have two (instead of one) keys and three (instead of two) children. If the height of the tree was h, every leaf node would still have h ranges and h sections associated with them. Like a standard complete k-ary tree, the number of leaf nodes will be k h . However, the big difference would be the manner in which a query is executed using a k-ary ACE Tree as opposed to a binary ACE Tree. The query algorithm will always start at the root node and traverse
42
down to a leaf. However, at every internal node it will alternate between the k children in a round-robin fashion. Moreover, since the data space would be divided into k equal parts at each level, the query algorithm might have to make k traversals and hence access k leaf nodes before it can combine sections that can be used to answer the query. This would mean that the query algorithm will have to wait longer (than a binary ACE Tree) before it can combine leaf node sections and thus return useful random samples. Since the goal of the ACE Tree is to support “fast first” sampling, use of a binary tree instead of a k-ary tree seems to be a better choice to implement the ACE Tree. 3.4
Properties of the ACE Tree
In this Section we describe the three important properties of the ACE Tree which facilitate the efficient retrieval of random samples from any range query, and will be instrumental in ensuring the performance of the algorithm described in Section 3.6. 3.4.1
Combinability
σ3−47
0−100 75 36
0−50 47
41
L1 .S1
18
0−25 10
22
25 3
L1 .S2
0−12 11 1
L1 .S3
L1
7
22 47
L1 .S4
88
61 27
L3 .S1
0−50 39
26−50
50
26 33
L3 .S2
26−37 29
45
34
L3 .S3
L3 .S4
L3 Figure 3-5. Combining two sections of leaf nodes of the ACE tree.
43
39
Combined samples
σ3−47
0−100
18
The various samples produced from processing a set of leaf nodes are combinable. For example, consider the two leaf nodes L1 and L3 , and the query “Compute a random sample of the records in the query range Ql = [3 to 47]”. As depicted in Figure 3-5, first we read leaf node L1 and filter the second section in order to produce a random sample of size n1 from Ql which is returned to the user. Next we read leaf node L3 , and filter its second section L3 .S2 to produce a random sample of size n2 from Ql which is also returned to the user. At this point, the two sets returned to the user constitute a single random sample from Ql of size n1 + n2 . This means that as more and more nodes are read from disk, the records contained in them can be combined to obtain an ever-increasing random sample from any range query. 3.4.2
Appendability
The ith sections from two leaf nodes are appendable. That is, given two leaf nodes Lj S and Lk , Lj .Si Lk .Si is always a true random sample of all records of the database with S key values within the range Lj .Ri Lk .Ri . For example, reconsider the query, “Compute a random sample of the records in the query range Ql = [3 to 47]”. As depicted in Figure 3-6, we can append the third section from node L3 to the third section from node L1 and filter the result to produce yet another random sample from Ql . This means that sections are never wasted. 3.4.3
Exponentiality
The ranges in a leaf node are exponential. The number of database records that fall in L.Ri is twice the number of records that fall in L.Ri+1 . This allows the ACE Tree to maintain the invariant that for any query Q0 over a relation R such that at least hµ database records fall in Q0 , and with |R|/2k+1 = (1/2) × |σRC1 (R)|. Similarly, Q2 can be answered by appending section 3 of (for example) L4 and L6 . If RC2 = S L4 .R3 L6 .R3 , then half the database records fall in RC2 . Also, since |σQ2 (R)| >= |R|/4 we have |σQ2 (R)| >= (1/2) × |σRC2 (R)|. This can be generalized to obtain the invariant stated in Section 3.4.3. 3.5.4
Construction Phase 2
The objective of Phase 2 is to construct leaf nodes with appropriate sections and populate them with records. This can be achieved by the following three steps: 48
Section Numbers 1
2
3
2 1
2 4
2
3 4 3
4
2
1 4
3
2
3
1
4
1 4
3
2
1
4
3
2
3
1
1
3
7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98
(a) Records assigned section numbers Leaf Numbers
Section Numbers 7 3 1 2 3
1 3
4 5 2 1
1 2 1 2 4 2
4 3 4 3 2 8 4 3 6 3 4 3 4 2 1 4 3 2
6 3
1 1
5 2 6 5 7 3 4 1 4 3 2 1
7 8 4 3
5 2
8 3
2 1
7 1
7 10 12 15 18 22 25 29 33 36 37 41 47 50 50 53 58 60 62 69 72 74 75 77 81 84 88 89 92 98
(b) Records assigned leaf numbers Leaf Numbers
Section Numbers 1
1
1
1 2
2
2 2
3
1
2
2
3 1
1
2
1 2
4
3 3 3 3 4
3
4 4
4
4
5
5
5
5
6
6
6
7
7
7
7
8
8
8
4
2 3
3
4
1
2
3
4 2
3
4 1
1
2
4
1
3
3
60 18 25 10 69 92 41 22 77 7 50 37 33 12 29 36 50 15 88 74 62 53 58 74 3 98 75 81 47 84 89
Leaf 1
Leaf 2
Leaf 3 Leaf 4 Leaf 5 Leaf 6 (c) Records organized into leaf nodes
Leaf 7
Leaf 8
Figure 3-9. Phase 2 of tree construction. 1.
Assign a uniformly generated random number between 1 and h to each record as its section number.
2.
Associate an additional random number with the record that will be used to identify the leaf node to which the record will be assigned.
3.
Finally, re-organize the file by performing an external sort to group records in a given leaf node and a given section together. Figure 3-9(a) depicts our example data set after we have assigned each record a
randomly generated section number, assuming four sections in each leaf node. In Step 2, the algorithm assigns one more randomly generated number to each record, which will identify the leaf node to which the record will be assigned. We assume for our example that the number of leaf nodes is 2h−1 = 23 = 8. The number to identify the leaf node is assigned as follows. 1.
First, the section number of the record is checked. We denote this value as s.
49
2.
We then start at the root of the tree and traverse down by comparing the record key with s − 1 key values. After the comparisons, if we arrive at an internal node, Ii,j , then we assign the record to one of the leaves in the subtree rooted at Ii,j . From the example of Figure 3-9(a), the first record having key value 3 has been
assigned to section 1. Since this record can be randomly assigned to any leaf from 1 through 8, we assign it to leaf 7. The next record of Figure 3-9(a) has been assigned to section number 2. Referring back to Figure 3-7, we see that the key of the root node is 50. Since the key of the record is 7 which is less than 50, the record will be assigned to a leaf node in the left subtree of the root. Hence we assign a leaf node between 1 and 4 to this record. In our example, we randomly choose the leaf node 3. For the next record having key value 10, we see that the section number assigned is 3. To assign a leaf node to this record, we initially compare its key with the key of the root node. Referring to Figure 3-7, we see that 10 is smaller than 50; hence we then compare it with 25 which is the key of the left child node of the root. Since the record key is smaller than 25, we assign the record to some leaf node in the left subtree of the node with key 25 by assigning to it a random number between 1 and 2. The section number and leaf node identifiers for each record are written in a small amount of temporary disk space associated with each record. Once all records have been assigned to leaf nodes and sections, the dataset is re-organized into leaf nodes using a two-pass external sorting algorithm as follows: •
Records are sorted in ascending order of their leaf node number.
•
Records with the same leaf node number are arranged in ascending order of their section number. The re-organized data set is depicted in Figure 3-9(c).
50
3.5.5
Combinability/Appendability Revisited In Phase 2 of the tree construction, we observe that all records belonging to some
section k are segregated based upon the result of the comparison of their key with the appropriate medians, and are then randomly assigned a leaf node number from the feasible ones. Thus, if records from section s of all leaf nodes are merged together, we will obtain all of the section s records. This ensures the appendability property of the ACE Tree. Also note that the probability of assignment of one record to a section is unaffected by the probability of assignment of some other record to that section. Since this results in each section having a random subset of the database records, it is possible to merge a sample of the records from one section that match a range query with a sample of records from a different section that match the same query. This will produce a larger random sample of records falling in the range of the query, thus ensuring the combinability property. 3.5.6
Page Alignment
In Phase 2 of the construction algorithm, section numbers and leaf node numbers are randomly generated. Hence we can only predict on expectation the number of records that will fall in each section of each leaf node. As a result, section sizes within each leaf node can differ, and the size of a leaf node itself is variable and will generally not be equal to the size of a disk page. Thus when the leaf nodes are written out to disk, a single leaf node may span across multiple disk pages or may be contained within a single disk page. This situation could be avoided if we fix the size of each section a priori. However, this poses a serious problem. Consider two leaf node sections Li .Sj and Li+1 .Sj . We can force these two sections to contain the same number of records by ensuring that the set of records assigned to section j in Phase 2 of the construction algorithm has equal representation from Li .Rj and Li+1 .Rj . However, this means that the set of records assigned to section j is no longer random. If we fix the section size and force a set number
51
of records to fall in each section, we invalidate the appendability and combinability properties of the structure. Thus, we are forced to accept a variable section size. In order to implement variable section size, we can adopt one of the following two schemes: 1.
Enforce fixed-sized leaf nodes and allow variable-sized sections within the leaf nodes.
2.
Allow variable-sized leaf nodes along with variable-sized sections. If we choose the fixed-sized leaf node, variable-sized section scheme, leaf node size is
fixed in advance. However, section size is allowed to vary. This allows full sections to grow further by claiming any available space within the leaf node. The leaf node size chosen should be large enough to prevent any leaf node from becoming completely filled up, which prevents the partitioning of any leaf node across two disk pages. The major drawback of this scheme is that the average leaf node space utilization will be very low. Assuming a reasonable set of ACE Tree parameters, a quick calculation shows that if we want to be 99% sure that no leaf node gets filled up, the average leaf node space utilization will be less than 15%. The variable-sized leaf node, variable-sized section scheme does not impose a size limit on either the leaf node or the section. It allows leaf nodes to grow beyond disk page boundaries, if space is required. The important advantage of this scheme is that it is space-efficient. The main drawback of this approach is that leaf nodes may span multiple disk pages, and hence all such pages must be accessed in order to retrieve such a leaf node. Given that most of the cost associated with reading an arbitrary leaf page is associated with the disk head movement needed to move the disk arm to the appropriate cylinder, this does not pose too much of a problem. Hence we use this scheme for the construction of leaf nodes of the ACE Tree. 3.6
Query Algorithm
In this Section, we describe in detail the algorithm used to answer range queries using the ACE Tree.
52
3.6.1
Goals
The algorithm has been designed to meet the primary goal of achieving “fast-first” sampling from the index structure, which means it attempts to be greedy on the number of records relevant for the query in the early stages of execution. In order to meet this goal, the query answering algorithm identifies the leaf nodes which contain maximum number of sections relevant for the query. A section Li1 .Sj is relevant for a range query Q T S S S if Li1 .Rj Q 6= φ and Li1 .Rj Li2 .Rj . . . Lin .Rj ⊇ Q where Li1 , . . . ,Lin are some leaf nodes in the tree. The query algorithm prioritizes retrieval of leaf nodes so as to: •
Facilitate the combination of sections so as to maximize n in the above formulation, and
•
Maximize the number of relevant sections in each leaf node L retrieved such that T L.Sj Q 6= φ where j = (c + 1) . . . h where L.Rc is the smallest range in L that encompasses Q.
3.6.2
Algorithm Overview
At a high level, the query answering algorithm retrieves the leaf nodes relevant to answering a query via a series of stabs or traversals, accessing one leaf node per stab. Each stab begins at the root node and traverses down to a leaf. The distinctive feature of the algorithm is that at each internal node that is traversed during a stab, the algorithm chooses to access the child node that was not chosen the last time the node was traversed. For example, imagine that for a given internal node I, the algorithm chooses to traverse to the left child of I during a stab. The next time that I is accessed during a stab, the algorithm will choose to traverse to the right child node. This can be seen in Figure 3-10, when we compare the paths taken by Stab 1 and Stab 2. The algorithm chooses to traverse to the left child of the root node during the first stab, while during the second stab it chooses to traverse to the right child of the root node. The advantage of retrieving leaf nodes in this back and forth sequence is that it allows us to quickly retrieve a set of leaf nodes with the most disparate sections possible in a
53
0−25
26−50
12
37
L2
L1
51−75
I3,1
L3
I3,2
62
next = L
L4
L5
L6
0−25 0−25
76−100
I3,3
88
L7
12 12
I3,4
3,2 37 II3,2 62 37 next = L 62
L L33
L L44
I1,1
25 25 0−25 0−25
12 12
26−50 26−50
I3,1 I3,1
L L22
L L11
51−75 51−75 I3,3 I3,3
next = L
L L33
0−50 0−50
2,2 75 II2,2 75 next = L
37 62 37 next = L 62
L L44
L L55
L L66
76−100 76−100
88 88
L L77
I3,4 I3,4
0−25 0−25
12 12
L L88
26−50 26−50
I3,1 I3,1
34 34 4 3 4 3 4343
L L22
L L11
12 12
L L33
37 37
L L11
37 37
I3,2 I3,2
next = R
L L22
L L33
L L44
0−50 0−50
75 next = L 75 51−75 51−75
L L66
25 25
76−100 76−100
3,3 62 II3,3 88 62 next = L 88
L L55
12 2 1 2121
next = R
L L44
76−100 76−100
I3,3 I3,3
62 88 62 next = L 88
L L55
L L66
L7 L7
I3,4 I3,4
L8 L8
I1,1
I2,2 I2,2
51−75 51−75
1,1 50 Inext =L 50 next = R
51−100 51−100
26−50 26−50
I3,1 I3,1
I2,2
(b)Stab (c) Stab2,3,6 7contributing contributingsections sections 0−100 0−100
I1,1
0−25 0−25
L L88
2,2 75 Inext =L 75
I3,2 I3,2
1,1 50 Inext =R 50 next = L
2,1 25 II2,1 25 next = L
L L77
51−100 51−100
2,1 25 II2,1 25 next = L
(a) Stab2,1,61contributing contributingsections section (b)Stab 0−100 0−100
0−50 0−50
L L66
I3,4 I3,4
1,1 50 Inext =R 50 next = L
51−100 51−100
I3,2 I3,2
L L55
76−100 76−100
88 88
I1,1
1,1 50 Inext =L 50 next = R
I2,1 Inext 2,1 = L
I3,3 I3,3
(a) Stab2,1,61contributing contributingsections section (b)Stab 0−100 0−100
(a) Stab 1, 1 contributing section 0−100 0−100
0−50 0−50
51−75 51−75
next = L
L L22
L L11
L8
26−50 26−50
I3,1 I3,1
L7 L7
I3,4 I3,4
0−25 0−25
L8 L8
12 12
L L11
(b)Stab (c) Stab2,3,6 7contributing contributingsections sections 0−100 0−100
51−100 51−100
I2,1 Inext 2,1 = L
I2,2
%&& '( &% %&%% ( (' '('' %& ' &++ -(- /( / : 9 8 7 6 5 )* ) *) ) * * , , . . + + /00/0 ) *) ,, .. 0 * 26−50 26−50
I3,1 I3,1
I3,2 I3,2
37 62 37 next = R 62
75 I2,2 75 next = L
51−75 51−75 I3,3 I3,3
next = R
L L22
L L33
L L44
L L55
L L66
76−100 76−100
88 88
L L7 7
I3,4 I3,4
L L8 8
(c) Stab Stab 4, 3, 16 7 contributing (d) contributingsections sections 0−100
I1,1
I1,1
1,1 50 Inext 50 next = R =L 50 next = R Figure 3-10. Execution runs of query answering algorithm with (a) 1 contributing section, 0−50 51−100 0−50 51−100 0−50 51−100 (b) (c) 7 contributing sections and (d) 16 I2,2 I2,16 contributing sections, I2,2 25contributing 75 I2,2 75 next = L 25 I2,1 Inext 2,1 = L sections. 25 75 next = L
0−25 0−25
26−50 26−50
51−75 51−75
76−100 76−100
0−25
26−50
51−75
76−100
I I3,4want a 12 I3,2 The I3,4 I3,1 given number of 37 stabs. that non-homogeneous nodes 88 is that 3,1 12 II3,1 62reason 88 Iwe 37 I3,2 set 62ofI3,3 I3,2 I 3,3 3,4 12 37 next 88 = R 62 3,3 next = R next = R nodes from very distant portions of a query range will tend to have sections covering large
;< => ?@ AB
"! #$
ranges that doL3not Loverlap. This allows us to append sections ofL4newly retrieved leafL8nodes L5 L6 L7 L8 L1 L2 L2 L3 L5 L6 L7 L1 4 L1
L2
L3
L4
L5
L6
L7
L8
with the corresponding sections of previously retrieved leaf nodes. The samples obtained (c) Stab Stab 4, 3, 16 7 contributing (d) contributingsections sections 0−100 can then be filtered and immediately returned. I1,1
50 next = R 0−50
25 0−25
51−100 I2,2
I2,1
75 next = L
26−50 I3,1
51−75 I3,2
I3,3
54
76−100 I3,4
(d) Stab 4, 16 contributing sections
This order of retrieval is implemented by associating a bit with each internal node that indicates whether the next child node to be retrieved should be the left node or the right node. The value of this bit is toggled every time the node is accessed. Figure 3-10 illustrates the choices made by the algorithm at each internal node during four separate stabs. Note that when the algorithm reaches an internal node where the range associated with one of the child nodes has no overlap with the query range, the algorithm always picks the child node that has overlap with the query, irrespective of the value of the indicator bit. The only exception to this is when all leaf nodes of the subtree rooted at an internal node which overlaps the query range have been accessed. In such a case, the internal node which overlaps the query range is not chosen and is never accessed again. 3.6.3 Data Structures In addition to the structure of the internal and leaf nodes of the ACE Tree, the query algorithm uses and updates the following two memory resident data structures: 1.
A lookup table T , to store internal node information in the form of a pair of values (next = [LEFT] | [RIGHT], done = [TRUE] | [FALSE]). The first value indicates whether the next node to be retrieved should be the left child or right child. The second value is TRUE if all leaf nodes in the subtree rooted at the current node have already been accessed, else it is FALSE.
2.
An array buckets[h] to hold sections of all the leaf nodes which have been accessed so far and whose records could not be used to answer the query. h is the height of the ACE Tree.
3.6.4
Actual Algorithm
We now present the algorithms used for answering queries using the ACE Tree. Algorithm 2 simply calls Algorithm 3 which is the main tree traversal algorithm, called Shuttle(). Each traversal or stab begins at the root node and proceeds down to a leaf node. In each invocation of Shuttle(), a recursive call is made to either its left or right child with the recursion ending when it reaches a leaf node. At this point, the sections in the leaf node are combined with previously retrieved sections so that they can be used to answer the query. The algorithm for combining sections is described in Algorithm 4. This
55
Algorithm 2: Query Answering Algorithm Algorithm Answer (Query Q) Let root be the root of the ACE Tree While (!T.lookup(root).done) T.lookup(root).done = shuttle(Q, root); Algorithm 3: ACE Tree traversal algorithm Algorithm Shuttle (Query Q, Node curr node) If (curr node is an internal node) lef t node = curr node →get lef t node(); right node = curr node →get right node(); If (lef t node is done AND right node is done) Mark curr node as done Else if (right node is not done) Shuttle(Q, right node); Else if (lef t node is not done) Shuttle(Q, lef t node); Else if (both children are not done) If (Q overlaps only with lef t node.R) Shuttle(Q, lef t node); Else if (Q overlaps only with right node.R) Shuttle(Q, right node); Else //Q overlaps both sides or none If (next node is LEFT) Shuttle(Q, lef t node); Set next node to RIGHT; If (next node is RIGHT) Shuttle(Q, right node); Set next node to LEFT; Else //curr node is a leaf node Combine Tuples(Q, curr node); Mark curr node as done algorithm determines the sections that are required to be combined with every new section s that is retrieved and then searches for them in the array buckets[]. If all sections are found, it combines them with s and removes them from buckets[]. If it does not find all the required sections in buckets[], it stores s in buckets[].
56
Algorithm 4: Algorithm for combining sections Algorithm Combine Tuples(Query Q, LeafNode node) For each section s in node do Store the section numbers required to be combined with s to span Q, in a list list For each section number i in list do If buckets[] does not have section i f lag = false If (f lag == true) Combine all sections from list with s and use the records to answer Q Else Store s in the appropriate bucket 3.6.5
Algorithm Analysis
We now present a lower bound on the expected performance of the ACE Tree index for sampling from a relational selection predicate. For simplicity, our analysis assumes that the number of leaf nodes in the tree is a power of 2. Lemma 1. Efficiency of the ACE Tree for query evaluation. •
Let n be the total number of leaf nodes in a ACE Tree used to sample from some arbitrary range query, Q
•
Let p be the largest power of 2 no greater than n
•
Let µ be the mean section size in the tree
•
Let α be fraction of database records falling in Q
•
Let N be the size of the sample from Q that has been obtained after m ACE Tree leaf nodes have been retrieved from disk If m is not too large (that is, if m ≤ 2αn + 2), then: E[N ] ≥
µ p log2 p 2
where E[N ] denotes the expected value of N (the mean value of N after an infinite number of trials).
57
Proof. Let Ii,j and Ii,j+1 be the two internal nodes in the ACE Tree where R = S Ii,j .R Ii,j+1 .R covers Q and i is maximized. As long as the shuttle algorithm has not retrieved all the children of Ii,j and Ii,j+1 (this is the case as long as m ≤ 2αn + 2), when the mth leaf node has been processed, the expected number of new samples obtained: blog2 mc 2k−1
Nm =
X X k=1
wkl µ
l=1
where the outer summation is over each of the h − i contributing sections of the leaf P nodes, starting with section number i up to section number h, while l wkl represents the fraction of records of the 2k−1 combined sections that satisfy Q. By the exponentiality P property, l wkl ≥ 1/2 for every k, µ log2 m 2
Nm ≥
Thus after m leaf nodes have been obtained, the total number of expected samples is given by: E[N ] ≥ ≥
m X k=1 m X k=1
≥
Nk µ log2 k 2
µ m log2 m 2
If m is a power of 2, the result is proven. Lemma 2. The expected number of records µ in any leaf node section is given by: E[µ] =
|R| h2h−1
where |R| is the total number of database records, h is the height of the ACE Tree and 2h−1 is the number of leaf nodes in the ACE Tree.
58
Proof. The probability of assigning a record to any section i; i ≤ h is 1/h. Given that the record is assigned to section i, it can be assigned to only one of 2i−1 leaf node groups after comparing with the appropriate medians. Since each group would have 2h−1 /2i−1 candidate leaf nodes, the probability that the record is assigned to some leaf node Lj is: 2i−1 i−1 X1 X 1 1 2 E[µi,j ] = × 0 i−1 + i−1 h−1 h 2 2 2 t∈R k=1 =
|R| h2h−1
This completes the proof of the lemma. 3.7
Multi-Dimensional ACE Trees
The ACE Tree can be easily extended to support queries that include multi-dimensional predicates. The change needed to incorporate this extension is to use a k-d binary tree instead of the regular binary tree for the ACE Tree. Let a1 , . . . , ak be the k key attributes for the k-d ACE Tree. To construct such a tree, the root node would be the median of all the a1 values in the database. Thus the root partitions the dataset based on a1 . At the next step, we need to assign values for level 2 internal nodes of the tree. For each of the resulting partitions of the dataset, we calculate the median of all the a2 values. These two medians are assigned to the two internal nodes at level 2 respectively, and we recursively partition the two halves based on a2 . This process is continued until we finish level k. At level k + 1, we again consider a1 for choosing the medians. We would then assign a randomly generated section number to every record. The strategy for assigning a leaf node number to the records would also be similar to the one described in Section 3.5.4 except that the appropriate key attribute is used while performing comparisons with the internal nodes. Finally, the dataset is sorted into leaf nodes as in Figure 3-9(c). Query answering with the k-d ACE Tree can use the Shuttle algorithm described earlier with a few minor modifications. Whenever a section is retrieved by the algorithm, only records which satisfy all predicates in the query should be returned. Also, the mth
59
0.18
% of total number of records in the relation
0.16
0.14
0.12
ACE Tree
0.1
0.08
0.06
0.04
B+ Tree Randomly permuted file
0.02
0 0
0.5
1
1.5 2 2.5 % of time required to scan relation
3
3.5
4
Figure 3-11. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 0.25% of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation sections of two leaf nodes can be combined only if they match in all m dimensions. The nth sections of two leaf nodes can be appended only if they match in the first n − 1 dimensions and form a contiguous interval over the nth dimension. 3.8
Benchmarking
In this Section, we describe a set of experiments designed to test the ability of the ACE Tree to quickly provide an online random sample from a relational selection predicate as well as to demonstrate that the memory requirement of the ACE Tree is reasonable. We performed two sets of experiments. The first set is designed to test the utility of the ACE Tree for use with one-dimensional data, where the ACE Tree is compared with a simple sequential file scan as well as Antoshenkov’s algorithm for sampling from a ranked B+-Tree. In the second set, we compare a multi-dimensional ACE Tree with the
60
0.4
% of total number of records in the relation
0.35
0.3
0.25
0.2
0.15
ACE Tree Randomly permuted file
0.1
0.05 B+ Tree 0 0.5
1
1.5 2 2.5 3 % of time required to scan the relation
3.5
4
Figure 3-12. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomly permuted file, with a one dimensional selection predicate accepting 2.5% of the database records. The graph shows the percentage of database records retrieved by all three sampling techniques versus time plotted as a percentage of the time required to scan the relation sequential file scan as well as with the obvious extension of Antoshenkov’s algorithm to a two-dimensional R-Tree. 3.8.1
Overview
All experiments were performed on a Linux workstation having 1GB of RAM, 2.4GHz clock speed and with two 80GB, 15,000 RPM Seagate SCSI disks. 64KB data pages were used. Experiment 1. For the first set of experiments, we consider the problem of sampling from a range query of the form: SELECT * FROM SALE WHERE SALE.DAY >= i AND SALE.DAY = d1 AND SALE.DAY = a1 AND SALE.AMOUNT 0 records. Since any reasonable estimator for NOT EXISTS will likely have to compute an estimated sum over each class, a solution for NOT EXISTS should immediately suggest a solution for EXISTS. Also, nested queries having an IN (or NOT IN) clause can be easily rewritten as a nested query having the EXISTS (or NOT EXISTS) clause. For example, the query “Find the total salary of all employees who have not made a sale in the past year” given above can also be written as: SELECT SUM (e.SAL) FROM EMP as e WHERE e.EID NOT IN (SELECT s.EID FROM SALE AS s) Furthermore, a solution to the problem of sampling for subset queries would allow sampling-based aggregates over SQL DISTINCT queries, which can easily be re-written as subset queries. For example: SELECT SUM (DISTINCT e.SAL)
74
FROM EMP AS e is equivalent to: SELECT SUM (e.SAL) FROM EMP AS e WHERE NOT EXISTS (SELECT * FROM EMP AS e2 WHERE id(e) < id(e2) AND e.SAL = e2.SAL) In this query, id is a function that returns the row identifier for the record in question. Some work has considered the problem of sampling for counts of distinct attribute values [17, 49], but aggregates over DISTINCT queries remains an open problem. Similarly, it is possible to write an aggregate query where records with identical values may appear more than once in the data, but should be considered no more than once by the aggregate function as a subset-based SQL query. For example: SELECT SUM (e.SAL) FROM EMP AS e WHERE NOT EXISTS (SELECT * FROM EMP AS e2 WHERE id(e) < id(e2) AND identical(e, e2)) In this query, the function identical returns true if the two records contain identical values for all of their attributes. This would be very useful in computations where the same data object may be seen at many sites in a distributed environment (packets in an IP network, for example). Previous work has considered how to perform sampling in such a distributed system [12, 77], but not how to deal with the duplicate data problem. Unfortunately, it turns out that handling subset queries using sampling is exceedingly difficult, due to the fact that the subquery in a subset query is not asking for a mean or a
75
sum – tasks for which sampling is particularly well-suited. Rather, the subquery is asking whether we will ever see a match for each tuple from the outer relation. By looking at an individual tuple, this is very hard to guess: either we have seen a match already on our sample (in which case we are assured that the inner relation has a match), or we have not, in which case we may have almost no way to guess whether we will ever see a match. For example, imagine that employee Joe does not have a sale in a 10% sample of the SALE relation. How can we guess whether or not he has a sale in the remaining 90%? There is little relevant work in the statistical literature to suggest how to tackle subset queries, because such queries ask a simultaneous question linking two populations (database tables EMP and SALE in our example), which is an uncommon question in traditional applications of finite population sampling. Outside of the work on sample from the number of distinct values [17, 49, 50] and one method that requires an index on the inner relation [75], there is also little relevant work in the data management literature; we presume this is due to the difficulty of the problem; researchers have considered the difficulty of the more limited problem of sampling for distinct values in some detail [17]. Our Contributions In this chapter, we consider the problem of developing sampling-based statistical estimators for such queries. In the remainder of this chapter, we assume without-replacement sampling, though our methods could easily be extended to other sampling plans. Given the difficulty of the problem, it is perhaps not surprising that significant statistical and mathematical machinery is required for a satisfactory solution. Our first contribution is to develop an unbiased estimator, which is the traditional first step when searching for a good statistical estimator. An unbiased estimator is one that is correct on expectation; that is, if an unbiased estimator is run an infinite number of times, then the average over all of the trials would be exactly the same as the correct answer to the query. The reason that an unbiased estimator is the natural first choice is
76
that if the estimator has low variance1 , then the fact that it is correct on average implies that it will always be very close to the correct answer. Unfortunately, it turns out that the unbiased estimator we develop often has high variance, which we prove analytically and demonstrate experimentally. Since it is easy to argue that our unbiased estimator is the only unbiased estimator for a certain subclass of subset-based queries (see the Related Work section of this chapter), it is perhaps doubtful that a better unbiased estimator exists. Thus, we also propose a novel, biased estimator that makes use of a statistical technique called “superpopulation modeling”. Superpopulation modeling is an example of a so-called Bayesian statistical technique [39]. Bayesian methods generally make use of mild and reasonable distributional assumptions about the data in order to greatly increase estimation accuracy, and have become very popular in statistics in the last few decades. Using this method in the context of answering subset-based queries presents a number of significant technical challenges whose solutions are detailed in this chapter, including: •
The definition of an appropriate generative statistical model for the problem of sampling for subset-based queries.
•
The derivation of a unique Expectation Maximization algorithm [26] to learn the model from the database samples.
•
The development of algorithms for efficiently generating many new random data sets from the model, without actually having to materialize them. Through an extensive set of experiments, we show that the resulting biased Bayesian
estimator has excellent accuracy on a wide variety of data. The biased estimator also has the desirable property that it provides something closely related to classical confidence bounds, that can be used to give the user an idea of the accuracy of the associated estimate.
1
Variance is the statistical measure of the random variability of an estimator.
77
4.2
The Concurrent Estimator
With a little effort, it is not hard to imagine several possible sampling-based estimators for subset queries. In this section, we discuss one very simple (and sometimes unusable) sample-based estimator. This estimator has previously been studied in detail [75], but we present it here because it forms the basis for the unbiased estimator described in the next section. We begin our description with an even simpler estimation problem. Given a one-attribute relation R(A) consisting of nR records, imagine that our goal is to estimate the sum over attribute A of all the records in R. A simple, sample-based estimator would be as follows. We obtain a random sample R0 of size nR0 of all the records of R, compute P total = r∈R0 r.A, and then scale up total to output total × nR /nR0 as the estimate for the final sum. Not only is this estimator extremely simple to understand, but it is also unbiased, consistent, and its variance reduces monotonically with increasing sample size. We can extend this simple idea to define an estimator for the NOT EXISTS query considered in the introduction. We start by obtaining random samples EMP0 and SALE0 of sizes nEMP0 and nSALE0 , respectively from the relations EMP and SALE. We then evaluate the NOT EXISTS query over the samples of the two relations. We compare every record in EMP0 with every record in SALE0 , and if we do not find a matching record (that is, one for which f3 evaluates to true), then we add its f1 value to the estimated total. Lastly, we scale up the estimated total by a factor of nEMP /nEMP0 to obtain the final estimate, which we term M: µ M=
nEMP nEMP0
¶ X
f1 (e) × (1 − min(1, cnt(e, SALE0 )))
e∈EMP0
In this expression, cnt(e, SALE0 ) =
P s∈SALE0
I(f3 (e, s)) where I is the standard
indicator function, returning 1 if the boolean argument evaluates to true, and 0 otherwise. The algorithm can be slightly modified to accommodate for growing samples of the
78
relations, and has been described in detail in [75], where it is called the “concurrent estimator” since it samples both relations concurrently. Unfortunately, on expectation, the estimator is often severely biased, meaning that it is, on average, incorrect. The reason for this bias is fairly intuitive. The algorithm compares a record from EMP with all records from SALE0 , and if it does not find a matching record in SALE0 , it classifies the record as having no match in the entire SALE relation. Clearly, this classification may be incorrect for certain records in EMP, since although they might have no matching record in SALE0 , it is possible that they may match with some record from the part of SALE that was not included in the sample. As a result, M typically overestimates the answer to the NOT EXISTS query. In fact, the bias of M is: Bias(M ) =
X
f1 (e)(1 − min(1, cnt(e, SALE))
e∈EMP
−ϕ(nSALE , nSALE0 , cnt(e, SALE))) In this expression, ϕ denotes the hypergeometric probability2 that a sample of size nSALE0 will contain none of the cnt(e, SALE) matching records of e. The solution that was employed previously to counteract this bias requires an index such as a B+-Tree on the entire SALE relation, in order to estimate and correct for Bias(M ). Unfortunately, the requirement for an index severely limits the applicability of the method. If an index on the “join” attribute in the inner relation is not available, the method cannot be used. In a streaming environment where it is not feasible to store SALE in its entirety, an index is not practical. The requirement of an index also precludes use of the concurrent estimator for a non-equality predicate in the inner subquery or
2
The hypergeometric probability distribution models the distribution of the number of red balls that will be obtained in a sample without replacement of n0 balls from an urn containing r red balls and n − r non-red balls.
79
for non-database environments where sampling might be useful, such as in a distributed system. In the remainder of this chapter, we consider the development of sampling-based estimators for this problem that require nothing but samples from the relations themselves. \ ) for Bias(M ). Our first estimator makes use of a provably unbiased estimator Bias(M \ ). is then an unbiased estimator for the final query answer. Taken together, M − Bias(M The second estimator we consider is quite different in character, making use of Bayesian statistical techniques. 4.3 4.3.1
Unbiased Estimator
High-Level Description
In order to develop an unbiased estimator for Bias(M ), it is useful to first re-write the formula for Bias(M ) in a slightly different fashion. We subsequently refer to the set of records in EMP that have i matches in SALE as “class i records”. Denote the sum of the P aggregate function over all records of class i by ti , so ti = e∈EMP f1 (e) × I(cnt(e, SALE) = i) (note that the final answer to the NOT EXISTS query is the quantity t0 ). Given that the probability that a record with i matches in SALE happens to have no matches in SALE0 is ϕ(nSALE , nSALE0 , i), we can re-write the expression for the bias of M as:
Bias(M ) =
m X
ϕ(nSALE , nSALE0 , i) × ti
(4–1)
i=1
The above equation computes the bias of M since it computes the expected sum over the aggregate attribute of all records of EMP which are incorrectly classified as class 0 records by M . Let m be the maximum number of matching records in SALE for any record of EMP. Equation 4–1 suggests an unbiased estimator for Bias(M ) because it turns out that it is easy to generate an unbiased estimate for tm : since no records other than those with m matches in SALE can have m matches in SALE0 , we can simply count the sum
80
of the aggregate function f1 over all such records in our sample, and scale up the total accordingly. The scale-up would also be done to account for the fact that we use SALE0 and not SALE to count matches. Once we have an estimate for tm , it is possible to estimate tm−1 . How? Note that records with m − 1 matches in SALE0 must be a member of either class m or class m − 1. Using our unbiased estimate for tm , it is possible to guess the total aggregate sum for those records with m − 1 matches in SALE0 that in reality have m matches in SALE. By subtracting this from the sum for those records with m − 1 matches in SALE0 and scaling up accordingly, we can obtain an unbiased estimate for tm−1 . In a similar fashion, each unbiased estimate for ti leads to an unbiased estimate for ti−1 . By using this recursive relationship, it is possible to guess in an unbiased fashion the value for each ti in the expression for Bias(M ). This leads to an unbiased estimator for the Bias(M ) quantity, which can be subtracted from M to provide an unbiased guess for the query result. 4.3.2 The Unbiased Estimator In Depth We now formalize the above ideas to develop an unbiased estimator for each tk that can be used in conjunction with Equation 4–1 to develop an unbiased estimator for Bias(M ). We use the following additional notation for this section and the remainder of this chapter: •
∆k,i is a 0/1 (non-random) variable which evaluates to 1 if the ith tuple of EMP has k matches in SALE and evaluates to 0 otherwise.
•
sk is the sum of f1 over all records of EMP0 having k matching records in SALE0 : PnEMP sk = i=1 0 I(cnt(ei , SALE0 ) = k) × f1 (ei ).
•
α0 is nEMP0 /nEMP , the sampling fraction of EMP.
•
Yi is a random variable which governs whether or not the ith record of EMP appears in EMP0 .
•
h(k; nSALE , nSALE0 , i) is the hypergeometric probability that out of the i interesting records in a population of size nSALE , exactly k will appear in a random sample of size nSALE0 . For compactness of representation we will refer to this probability as h(k; i) in the remainder of the thesis, since our sampling fraction never changes.
81
We begin by noting that if we consider only those records from EMP which appear in the sample EMP0 , an unbiased estimator for tk over EMP0 can be expressed as follows: nEMP 1 X tˆk = Yi × ∆k,i × f1 (ei ) α0 i=1
(4–2)
Unfortunately, this estimator relies on being able to evaluate ∆k,i for an arbitrary record, which is impossible without scanning the inner relation in its entirety. However, with a little cleverness, it is possible to remove this requirement. We have seen earlier that a record e can have k matches in the sample SALE0 provided it has i ≥ k matches in SALE. This implies that records from all classes i where i ≥ k can contribute to sk . The contribution of a class i record towards the expected value of sk is obtained by simply multiplying the probability that it will have k matches in SALE0 with its aggregate attribute value. Thus a generic expression to compute the contribution of any arbitrary P record from EMP0 towards the expected value of sk can be written as m i=k ∆i,j × h(k; i) × f1 (ej ). Then, the following random variable has an expected value that is equivalent to the expected value of sk : sˆk =
nEMP X m X
Yj × ∆i,j × h(k; i) × f1 (ej )
(4–3)
j=1 i=k
The fact that E[ˆ sk ] = E[sk ] (proven in Section 4.3.3) is significant, because there is a simple algebraic relationship between the various sˆ variables and the various tˆ variables. Thus, we can express one set in terms of the other, and then replace each sˆk with sk in order to derive an unbiased estimator for each tˆ. The benefit of doing this is that since sk is defined as the sum of f1 over all records of EMP0 having k matching records in SALE0 , it can be directly evaluated from the samples EMP0 and SALE0 .
82
To derive the relationship between sˆ and tˆ, we start with an expression for sˆm−r using Equation 4–3: sˆm−r = = = =
nEMP X m X
Yj × ∆i,j × h(m − r; i) × f1 (ej )
j=1 i=m−r m X
h(m − r; i)
i=m−r r X
nEMP X
Yj × ∆i,j × f1 (ej )
j=1
h(m − r; m − r + i)
i=0 r X
nEMP X
Yj × ∆m−r+i,j × f1 (ej )
j=1
h(m − r; m − r + i)α0 tˆm−r+i
(4–4)
i=0
By re-arranging the terms we get the following important recursive relationship: tˆm−r =
sˆm−r − α0
Pr
h(m − r; m − r + i) × tˆm−r+i α0 h(m − r; m − r)
i=1
(4–5)
For the base case we obtain: tˆm =
sˆm α0 h(m;m)
= am × sˆm
(4–6)
where am = 1/(α0 h(m; m)). By replacing sˆm−r in the above equations with sm−r which is readily observable from the data and has the same expected value, we can obtain a simple recursive algorithm for computing an unbiased estimator for any ti . Before presenting the recursive algorithm, we note that we can re-write Equation 4–5 for tˆi by replacing sˆ with s and by changing the summation variable from i to k and actually substituting m − r by i, si − α0 tˆi =
Pm−i
h(i; i + k)tˆi+k α0 h(i; i) k=1
83
The following pseudo-code then gives the algorithm3 for computing an unbiased estimator for any ti . ———————————————————————————Function GetEstTi(int i) 1 if(i == m) 2
return sm /(α0 h(m; m))
3 else 4 5 6 7 8
returnval = si for(int k = 1; k