Query Size Estimation using Systematic Sampling - Semantic Scholar

3 downloads 0 Views 249KB Size Report
Banchong Harangsri John Shepherd Anne Ngu. School of Computer Science and Engineering,. The University of New South Wales, Sydney 2052, AUSTRALIA.
Query Size Estimation using Systematic Sampling Banchong Harangsri John Shepherd Anne Ngu School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, AUSTRALIA. Telephone: +61 2 9385 3980 Fax: +61 2 9385 1813 Email: fbjtong,jas,[email protected]

Abstract

In this paper, we propose a new approach to the estimation of query size for select and join operations. The technique, which we have called \systematic sampling", is a novel variant of the sampling approach, which sorts the relation before sampling, and which maintains a summary relation to improve run-time performance. We compare the method to a number of existing methods to tackle the problem of query size estimation, and demonstrate, with extensive experimental results, that it performs better than existing approaches over a wide range of data sets.

1 Introduction

Query optimisers for database systems aim to determine the most ecient query execution plan to be executed by the database system. Choosing an ecient plan relies on cost estimates derived from the statistics maintained by the underlying database system. Work by [12] pointed out that inaccurate estimates derived from such statistics may cause the optimiser to choose a very poor plan. Although the initial error might be negligible for the rst subplan (such as the rst join/selection), the subsequent errors (errors in the next subplans) can grow very rapidly (i.e., exponentially). Good estimates for the cost of database operations are thus critical to the e ective operation of query optimisers and ultimately of the database systems that rely on them. This paper proposes a novel method to improve such cost estimation. There has been a considerable amount of work on the issue of selectivity estimation over one and a half decades [22, 6, 7, 19, 13, 11, 17, 18, 16, 8, 23, 5]. This work can be classi ed into four categories [23, 5], namely parametric, histogram, curve tting and sampling. Let us brie y describe each of them; the reader can nd more details in the references given above.

Parametric The parametric methods [22, 6, 7] are

ones which depend upon underlying assumptions about the data distribution such as uniform, normal, poisson, Zipf distributions and so on. The methods will approximate query result sizes e ectively if the actual data distribution follows the a priori assumption. However, in reality, data distributions in real database systems may not t well with the assumed distribution and, consequently, the quality of the size estimates could be unpredictable.

Histogram [19, 18] A histogram is built by dividing

an attribute domain into buckets and counting the number of tuples which fall into the ranges of the buckets. The histogram method and our sampling method both require storage space to store summary information which is used to assist in estimating query result sizes. The histogram methods assume a uniform distribution in each bucket { each attribute value in the same bucket has the same frequency (number of occurrences) { and query result sizes are estimated based on this assumption. However, with our sampling method, we use a summary relation that better summarises (represents) the source relation. The frequency distribution of each attribute of the summary relation \follows" much closer to the actual frequency distributions in the source relation. Curve-Fitting The methods [23, 5] in this class are based on using polynomial regression to nd the best- t set of coecients to minimise the criterion of least-squared error. Recently [9], we have proposed the use of a learning machine called M5 [20] which combines model-based learning and instance-based learning [14, 1, 2]. Using feedback from user queries (instances), a regression tree [4] is created whose leaf nodes consist of linear regression functions. When a new query is required to estimate its size, the most similar queries to the new query are picked up from the stored user queries and the result size of the new query is calculated based on (1) some linear regression functions of the regression tree and (2) the actual result sizes of the most similar queries. In [9], we compared the performance of M5 and a curve tting method called ASE (Adaptive Selectivity Estimation) [5]. It appeared that M5 signi cantly outperformed ASE. These curve- tting methods can deal very well with queries with simple selections (i.e., whose selection predicates specify on a single attribute) and their performance was even better than the sampling methods. Both our experiments and the experiments reported in [23] for the queries with simple selections con rmed this observation in the same direction. However, the methods do not handle well queries

with complex selections whose selection predicates specify on multiple attributes. In these cases they actually performed worse than sampling methods. Again both our experiments and the experiments reported in [23] con rmed this observation. Sampling The basic idea of the sampling methods is as follows: given a query1, to estimate its result size, tuples are sampled from the source relation, checked with the query to see if they satisfy the query and counted with a summary total number if they do. Based on the summary total number of tuples, the result size of the query can be estimated. Here are the main advantages of the sampling methods:  can handle correlation among attributes (due to the fact that whole tuples are considered and checked against the given query).  require no storage to store statistical information.  are very simple to implement. The method we propose in this paper is called systematic sampling [21]. There are two separate algorithms to implement systematic sampling: one for selections and the other for joins (most of the previous work on query size estimation has concentrated only on these two operations). Ling and Sun [15] have made a comparison of the three sampling methods: adaptive, double and sequential sampling. They pointed out that, in fact, all the three methods are instances of sequential random sampling2 , di ering only in their stopping conditions. Hereafter, we will use the term SEQSMP to refer to all the three sequential sampling methods and SYSSMP to refer to our systematic sampling method. Here are the main advantages of SYSSMP:  With regard to selections, SYSSMP provides considerably better query size estimates than SEQSMP. In addition, since we propose to create a summary relation separately from the source relation, the CPU time spent by SYSSMP scanning through the summary relation with an index and calculating query result sizes is signi cantly less than the CPU time spent by SEQSMP sampling over the source relation and doing the same calculation.  With regard to joins, we propose to sample less tuples of the target relation|a sample tuple of the source relation joins with only a xed portion of the target relation while the traditional SEQSMP proposed to join the sample tuple with the full target relation. From the experimental results we obtained, with 10% sampling on the 1 We use the term \query" in this paper to mean a query with either a simple/complex selection on a relation or a join predicate on two relations. 2 Every tuple of a source relation can be chosen equally.

source relation and sampling between 60-100% on the target relation, the variation between the actual result size values and their estimates are, in most of the cases, less than 2%!. This paper is structured as follows: Basic Idea of Systematic Sampling (Section 2) provides the basic idea of the systematic sampling for selections and joins. Rationale behind SYSSMP (Section 3) provides the rationale behind the systematic sampling | reasons about why sorting is so important to the systematic sampling. Result Size Estimation for Joins (Section 4) gives a speci c description of systematic sampling for result size estimation for joins. De nitions and Method Descriptions (Section 5) de nes some terms to be used in the experiment sections and describes 6 estimation methods for selections to be compared in the experiment section 6. Experimental Results for Selections (Section 6) gives the results for selections. Experimental Results for Joins (Section 7) gives the results for joins. Conclusion (Section 3 of this paper) gives a conclusion of what we have achieved in this paper. Due to the space limit, we only describe the core concept of the systematic sampling in Section 2. Sections 3{7 are not given here; we refer the reader to the complete version of this paper in [10].

2 Basic Idea of Systematic Sampling

In this section, we describe the basic idea of the systematic sampling, which is essentially the same for selections and joins. The main di erence between the systematic sampling methods for selections and joins is that the systematic sampling for the former builds a summary relation while the systematic sampling for the latter does not (see Section 4 in [10] for query result size estimation for joins). The idea to be explained next, although more for selections, is applicable for joins too. Previous work has proposed runtime sampling to estimate query result sizes. Whenever a query result size estimate is required, then a sampling over the source relation would be required to do. With our method, we propose to create a summary relation separately from the source relation. When a query size estimate is required, the summary relation would be scanned (optionally via some form of an index). The rationale behind this proposal is described later on at the end of this section (after having explained how to create the summary relation). We begin by de ning the notion of \biased attributes". The biased attributes are the ones which occur most frequently in selection predicates in user queries. In most database applications, although there are several attributes in one relation, there are only a small number of them that users always use to form selection predicates. Note that it is possible for all of the attributes in a relation to be biased. Before giving the algorithm to create the summary relation, let us informally explain how the method works: input: R = source relation, B = list of biased attributes output: R = summary relation created from R.

 Randomly choose a permutation of the attributes

in B . For example, suppose attributes a2 ; a3 and a5 are in B ; one possible permutation could be a3 ! a5 ! a2 , where ! denotes that a3 must come before a5 and a5 before a2 . Let Li store each attribute in the permutation; L1 stores the rst attribute in the permutation, L2 stores the second and so on. In the example above, L1 = a3 ; L2 = a5 and L3 = a2 .  relation R can be split into jB j partitions of the same size jjBRjj as shown in Figure 1(a) (P1 ; P2 and P3 ). For each partition Pi , sort it on attribute Li

a3 a4 a5 111 000 000 111 000 111 000 111 000 111 000 111 P1 000 111 000 111 000 111 000 111 000 111 000 000 111 000 111 111 000 111 000 111 000 111 000 111 000 P2 111 000 111 000 111 000 111 000 111 000 111 000 111 0000 1111 000 111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 P3 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 (a) Relation R with 3 Partitions and Sort Order: a3 ! a5 ! a2

a1

and randomly choose a tuple near the top of the sorted partition by:

t = rand( NumTupStep ) 2

where NumTupStep is an integral xed value calculated by jRn j , where n (= jR j) is the number of tuples to be sampled from R. The value of n is basically involved with a stopping condition of a sampling method. In the literature of sampling, there has been a fair amount of work attempting to address the stopping condition. In [15], the authors discussed several stopping conditions. The one which we adopt for our work is controlled by value of n which is de ned as follows:

p

( t1?  V )2 eY

a1

a3

a4

a5

111 000 000 111 000 111 000 111 000 111 000 111 000 000 000 000 000111 111 000111 111 000111 111 000111 111 000 NumTupStep 111 000 111 000 111 000 111 000 111 000 111 000 111 000 000 000 000 000111 111 000111 111 000111 111 000111 111 000 NumTupStep 111 000 111 000 111 000 111 000 111 000 111 000 111 000 000 000 000 000111 111 000111 111 000111 111 000111 111 000 111

(1)

where 1 ? is the con dence level, t1? the abscissa of the standard normal curve, e the required relative error, Y total mean of population and V total variance of population (see [15] for more details). Then store tuple t-th into the summary relation R and step forward to the next tuple by: t = t + NumTupStep Repeat storing the next tuple into R until t goes beyond the last tuple of the sorted partition Pi ( jjBRjj ). Figure 1(b) shows how to step through relation R to create the summary relation R in Figure 1(c). The algorithm to achieve the steps above is given in Figure 2. Having explained how to create the summary relation, we now describe the rationale behind it. The reasons are as follows:  Previous sampling work proposed to attach an index such as B-Trees to the source relation in order to be able to do a fast access to the tuple required. With our systematic sampling, to step through the source relation in the way we have described above, it could be very complicated or perhaps

a2

a2

(b) Sorted Partition Pi of R

111 000 000 000 000 000 000111 111 000111 111 000111 111 000111 111 000 From 111 000 111 000 111 000 111 000 111 000 000111 111 000111 111 000111 111 000111 111 000 P1 111 000 111 000 000 000 000 111 000 111 000 111 000 111 000 111 000 111 From 000 111 000 111 000 111 000 111 000 111 P2 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 From 000 111 000 000 000 000 000111 111 000111 111 000111 111 000111 111 000 P3 111 000 111 000 111 000 111 000 111 000 111 000111 111 000111 000111 000111 000 (c) Summary Relation R

Figure 1: Partitioning, Sort Order and Summary Relation 1 2 3 4 5 6 7 8 9 10 11 12 13 14

R = source relation, B = list of biased attributes output: R = summary relation created from R. L = Randomly choose a permutation of the attributes in B R = fg for i = 1 to jB j do get a partition Pi from R sort Pi onNumTupStep attribute Li ) t = rand( 2 R while t  B do R = R [ f tuple t-th of sorted Pi g t = t + NumTupStep input:

j

j

j

j

endwhile

endfor

Figure 2: Algorithm Gen Summary Relation

very costly to attach an index to the source relation since the source relation must be sorted on several biased attributes, not a single attribute.  It's signi cantly faster to access a summary relation created separately and estimate query result sizes based on it (see the experiments in the complete version of the paper in [10]). A worst case analysis of complexity in accessing n tuples of the source relation would be O(n) disk pages, where the n tuples stay in a separate disk page, while with the summary relation, all the n records will stay nearby one another; if k is the number of tuples per disk page, then the worst case analysis in accessing n tuples of the summary relation would be O(d nk e), where de is the ceiling of a value. Moreover, we can also attach an index (in our implementation a multiple attribute index kd-tree [3] is used) to the summary relation to speed up retrieval. Given a query, to estimate its result size, scan the summary relation to nd all the tuples which satisfy the query. Let S be the total number of such tuples. The result size estimate S^ of this query would be: jRj S jRj .

3 InConclusion this paper, we have proposed a promising new

approach to the estimation of query result size for select and join operations. The technique, called \systematic sampling", estimates the size of a query result by using the information stored in a summary relation, which is an accurate re ection of the original relation. This approach has the bene ts both of speed (we can eciently access the information in the summary relation to get the estimate) and accuracy (the summary relation provides a good model of the data in the original relation). Systematic sampling appears to be an e ective solution to the problem of query size estimation, and we plan to test its e ectiveness in existing query optimisers in the near future.

References [1] D. W. Aha. A Study of Instance-Based Algorithms for Su-

pervised Learning Tasks: Mathematical, Empirical, and Psychological Evaluations. PhD thesis, Department of

[2] [3] [4] [5] [6]

Information and Computer Science, University of California, Irvine, CA 92717, Nov 27 1990. D. W. Aha, D. Kibler, and M. K. Albert. Instance-Based Learning Algorithms. Machine Learning, 6(1):37{66, 1991. J. L. Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Comm. ACM, 18(9):507{517, 1975. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi cation and Regression Tress. Chapman & Hall, Inc., 1984. C. M. Chen and N. Roussopoulos. Adaptive Selectivity Estimation using Query Feedback. In Proceedings of 1994 ACM-SIGMOD International Conference on Management of Data, 1994. S. Christodoulakis. Estimating Block Transfers and Join Sizes. In Proceedings of the ACM SIGMOD Conference, pages 40{54, 1983.

[7] S. Christodoulakis. Estimating Record Selectivities. Information System, 8(2):105{115, 1983. [8] P. Haas and A. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD Conference on the Management of Data, pages 341{350, 1992. [9] B. Harangsri, J. Shepherd, and A. Ngu. Query Size Estimation using Machine Learning. In 1996 International Computer Symposium (ICS '96), December 19{21, 1996 National Sun Yat-Sen University Kaohsiung, Taiwan, R.O.C. 1996. To be published. [10] B. Harangsri, J. Shepherd, and A. Ngu. Query Size Estimation using Systematic Sampling. Technical report, The University of New South Wales, School of Computer Science and Engineering, Sydney 2052, AUSTRALIA, 1996. [11] W. Hou, G. Ozsoyoglu, and B. K. Taneja. Statistical Estimators for Relational Algebra Expressions. In Proceedings of the ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, pages 276{287, 1988. [12] Y. E. Ioannidis and S. Christodoulakis. On the Propagation of Errors in the Size of Join Results. In Proceedings of the ACM-SIGMOD Intl. Conf. on Management of Data, pages 268{277, 1991. [13] N. Kamel and R. King. A Method of Data Distribution Based on Texture Analysis. In Proceedings of the ACM SIGMOD Intl. Conf. on Management of Data, pages 319{ 325, 1985. [14] D. Kibler, D. W. Aha, and M. K. Albert. Instance-Based Prediction of Real-Valued Attributes. Computational Intelligence, 5:51{57, 1989. [15] Y. Ling and W. Sun. An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems. In The International Confererence on Data Engineering, pages 532{539, 1995. [16] R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical Selectivity Estimation through Adaptive Sampling. In Proceedings of ACM SIGMOD, pages 1{12, 1990. [17] M. Mannino, P. Chu, and T. Sager. Statistical Pro le Estimation in Database Systems. ACM Computing Surveys, 20(3):191{221, september 1988. [18] M. Muralikrishma and D. DeWitt. Equi-depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries. In Proceedings of the ACM SIGMOD Conf. on Management of Data, pages 28{36, 1988. [19] G. Piatetsky-Shapiro and C. Connell. Accurate Estimation of the Number of Tuples Satisfying a Condition. In Proceedings of the ACM SIGMOD Conference, pages 256{276, 1984. Boston, Mass, June, ACM, New York. [20] J. R. Quinlan. Combining Instance-Based and ModelBased Learning. In Proceedings of Machine Learning. Morgan Kaufmann, 1993. [21] R. L. Schea er, W. Mendenhall, and L. Ott. Elementary Survey Sampling. PWS-KENT Publishing Company, fourth edition, 1990. [22] P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T.G. Price. Access Path Selection in a Relational Database Management System. In ACM SIGMOD, pages 23{34, 1979. Boston, MA, June 1979. [23] W. Sun, Y. Ling, N. Rishe, and Y. Deng. An Instant and Accurate Size Estimation Method for Joins and Selection in a Retrieval-Intensive Environment. In Proceedings of ACM SIGMOD, pages 79{88, 1993.