Selectivity Estimation for Joins Using Systematic ...

Selectivity Estimation for Joins Using Systematic Sampling Banchong Harangsri John Shepherd Anne Ngu School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, AUSTRALIA. Email: {bjtong,jas,anne}@cse.unsw.edu.au

Abstract We propose a new approach to the estimation of join selectivity. The technique, which we have called “systematic sampling”, is a novel variant of the sampling-based approach. Systematic sampling works as follows: Given a relation R of N tuples, with a join attribute that can be accessed in ascending/descending order via an index, if n is the number of tuples to be sampled from R, select a tuple at random from the first k = d N n e tuples of R and every k th tuple thereafter. We first develop a theoretical foundation for systematic sampling which suggests that the method gives a more representative sample than the traditional simple random sampling. Subsequent experimental analysis on a range of synthetic relations confirms that the quality of sample relations (participating in a join) yielded by systematic sampling is higher than those produced by the traditional simple random sampling. To ensure that the sample relations produced by the systematic sampling indeed assist in computation for more accurate join selectivities, we compare the systematic sampling with the most efficient simple random sampling called t_cross using a variety of star joins and a variety of relation configurations. The results demonstrate that with the same amount of sampling, the systematic sampling can provide considerably more accurate join selectivities than the t_cross sampling.

1. Introduction Query optimisers for database systems aim to determine the most efficient query execution plan to be executed by the database system. Choosing an efficient plan relies on cost estimates derived from the statistics maintained by the underlying database system. Work by [8] pointed out that inaccurate estimates derived from such statistics may cause the optimiser to choose a very poor plan. Although the initial error might be negligible for the first subplan (such as the first join/selection), the subsequent errors (errors in the next subplans) can grow very rapidly (i.e., exponentially).

Good estimates for the cost of database operations are thus critical to the effective operation of query optimisers and ultimately of the database systems that rely on them. This paper proposes a novel sampling-based method to improve such cost estimation for the join operation. Most previous work [7, 9, 6, 3, 4] on sampling-based methods has focused on simple random sampling (SRS) whereby each unit (tuple) in the population (relation) of interest has an equal chance to be selected in the sample. Simple random sampling can be performed under two distinct regimes. The first is with replacement; that is, the unit (tuple) which has already been selected from the population can subsequently be selected again. We call this scheme of sampling SRSWR. Most previous SRS work uses this scheme because it is simple to implement. The second scheme does not allow replacement; any unit (tuple) already selected can not be selected again. This scheme which we call SRSWOR requires a more sophisticated data structure to do the sampling. The simple random sampling methods proposed in the literature [9, 6, 3] differ from one another primarily in their stopping conditions, i.e., when to stop sampling. Systematic sampling was first proposed by [12] in the context of multidatabase systems; this work made no assumptions about the sortedness of the underlying relations. In this paper, we suggest that a new systematic sampling method (SYSSMP) that exploits the sortedness of data can be profitably used in the context of query size estimation for conventional query optimisation. The overall idea behind systematic sampling is: start with a relation R with cardinality N whose tuples can be accessed in ascending/descending order on the join attribute(s) of R; decide on the size n of the sample relation; to produce the sample relation, select a tuple at random from the first k = d N ne tuples of R and every k th tuple thereafter. The most efficient SRS named t_cross was proposed by [4]. To compare the effectiveness of t_cross and SYSSMP, we applied them both to the task of estimating join selectivities using a variety of relation configurations and a variety of star joins (a join in which any pair of

join-compatible attributes can participate) The reason star joins are used for comparison is that (1) they are amenable to simulation and analysis (2) they play a significant role in decision-support system applications and (3) Haas and Swami [4] also used star joins in their experiments. The ideal case for SYSSMP is where each join attribute participating in a star join has an index (such as a B+ tree [1]) on it which ensures that each relation is sorted on that attribute. However, achieving this would incur a prohibitive cost in building and maintaining indices. To make our approach applicable to the more realistic case where only some relations are sorted on their join attributes, we propose a hybrid between SYSSMP and t cross. The hybrid scheme works as follows: samples of the relations which are sorted on the join attributes are created via SYSSMP and samples of the unsorted relations are created via t cross. All the samples are then joined together to produce an estimated join selectivity for the star join. We have conducted 2 sets of experiments. The first was performed for the ideal case, and the results show that, in this case, SYSSMP is far superior to t cross in approximating join result sizes. In the second set of experiments, we consider the more realistic case (namely, some relations participating in the join are not sorted on their join attributes) and show that despite a somewhat decreased performance in approximating join result sizes, the hybrid SYSSMP/t cross sampling scheme still outperforms the pure t cross sampling procedure. The structure of presentation is as follows. Section 2 gives an example of sampling and provides an informal argument as to why SYSSMP is likely to perform better than SRS with/without replacement. Section 3 gives a theoretical foundation of SYSSMP that proves that when relations can be accessed in ascending/descending order on the join attributes, SYSSMP can in general yield more accurate sample relations than SRSWOR and SRSWR. Section 4 summarises the results of experiments which aimed to demonstrate the realistic case. Complete details of the experiments and their results for both the ideal and realistic cases can be found in [5]. Finally we summarise the work and give conclusive remarks in Section 5.

2. Rationale behind why systematic sampling works The purpose of a sample relation is to provide a basis to estimate selectivity for use in estimating the cost of relational operations such as select and join. In order to do this effectively, the sample relation must capture well the underlying distribution of the data in the original relation. Consider Figure 1, which shows the list of join-attribute values for unsorted (Figure 1(a)) and sorted (Figure 1(b)) versions of one particular relation.

If we use SRS with replacement to generate a sample relation from either of these lists, we may choose the same tuple from the relation multiple times. In some sense, this “wastes” some of our sampling effort, and, in the worst case, we might choose the same tuple every time, giving a totally inaccurate impression of the underlying distribution. If we use SRS without replacement, we get “better value” for our sampling efforts, because we never reconsider the same tuple. The same is true of a systematic scan through either list. A little thought suggests that the case of scanning through the unsorted list is very similar to SRS without replacement; and, in fact, it has been proved to be equivalent (see [2]). The question remains as to whether a systematic scan of the sorted list is better than these methods. It seems likely that it is, since such a scan ensures that we take samples from each “region” of the underlying data distribution. With the other methods, it is possible that our sampling may miss some regions of the underlying distribution altogether. The rest of this paper aims to determine whether this intuition is correct both by formal and empirical means.

3. Theoretical foundation for systematic sampling The selectivity of a distinct value is a ratio which is defined by the total number of tuples in a relation having the distinct value divided by the cardinality of the relation. SYSSMP is more efficient than SRS with/without replacement if and only if: 1. The variance of the estimated selectivity of a distinct value in the common join domain is lower. This variance indicates the quality of the estimated selectivity for one distinct value in the join-attribute domain of the sample relation. Lower variance indicates more accurate estimates. 2. The total variance of estimated selectivities for all distinct values in the common join domain is lower. This total variance indicates the overall quality of the sample relation. This affects the accuracy of the estimated join selectivity, which is calculated using estimated selectivities from all of the sample relations participating in the join. The proof in Section 3.1 shows when the variance for a single distinct value will be lower for SYSSMP than SRS (point 1). Section 3.2 completes the proof by showing that the total variance over the entire join domain will be lower for SYSSMP than SWS (point 2).

2

1

1

3

5

1

4

1

2

1

1

2

2

3

5

3

4

4

4

3

5

1

1

1

2

4

5

5

5

(a) Unsorted values of a join attribute

k 1

1

1

1

1

k 1

1

1

1

k 2

2

2

2

k 2

3

3

3

k 3

4

4

4

(b) Values after sorting

Figure 1. Unsorted and sorted values and stepping through the sorted values 3.1. Variance of estimated selectivity of a distinct value A star join is a join whose qualification is of the form:

R1 :a1 = R2 :a2 = R3 :a3 = Rm :am where R1 ; R2 ; : : : ; Rm are the relations participating in the join and a1 ; a2 ; : : : ; am are the join attributes of R1 ; R2 ; : : : ; Rm , respectively and m is the number of re-

lations participating in the star join. Let d be the number of distinct values of a common join domain of the attributes a1 ; a2 ; : : : ; am . Let be the join selectivity of a star join with the m participating relations. The definition of join selectivity for the star join is given as follows:

P Q = di=1 mj=1 selectivity of the ith distinct value on Rj In view of a relation, say Rj , the selectivities of each ith distinct value i = 1; 2; : : : ; d of the relation contribute significantly to the calculation of the join selectivity of the

star join. Let us consider a relation R which participates in the star join. Let R0 be a sample relation from R. Let Y i be the selectivity of the ith distinct value (i = 1; 2; : : : d) from relation R and let Y i be the selectivity of the same distinct value from sample relation R0 . By definition, the variance V (Y i )

b

b

V (Ybi ) = E (Yb i ? Y i )2 =

Xb

b

(Y i ? Y i )2 Pr(Y i )

b

where Pr(Y i ) is the probability that each sample can be selected. If the sample relation R0 created by SYSSMP is more efficient than SRS, then the total variance of each Y i , which is called total variance of estimated selectivities for all distinct values, must produce a lower value, namely:

b

d X V (Yb i ) = V (Yb 1 ) + V (Yb 2 ) + V (Yb d ) i=1

b

where each V (Y i ) is the variance of the selectivity of the ith distinct value on a sample relation or the variance of an estimated selectivity of the ith distinct value. Towards

the end of Section 3.2, we will show how to obtain such an efficient sample relation R0 . Given relation R with N tuples, to select n ( N ) tuples of the relation, the method of the systematic sampling is to choose a tuple at random from the first k = d N n e tuples of R and every k th tuple thereafter. Suppose that relation R given in Figure 2 is sorted1 on its join attribute. The cardinality of R in the figure is N = 25. Let n = 5 so k = N n = 5. Table 1(a) shows all possible systematic sample relations R0 ’s which can be taken from R. Consider Table 1(b). Denote by tij tuple j of systematic sample relation i and by f (tij ; x) a function of a [0,1] value, i.e.:

81 < yij = f (tij ; x) = : 0

if tuple j of sample i contains value x if it doesn’t

For example, in view of the selectivity of the distinct value equal to x, if tuple j of the sample relation i has x as its value, then yij would be 1, otherwise 0. Therefore, using systematic sample relation number 1 in Table 1(a) and using the [0,1] value representation, the selectivity when the distinct value is equal to 1 is y1 = an1 = 25 (there are two rows which consist of value 1 in the sample relation number 1). Using the same distinct value (=1), the selectivities calculated using systematic sample relations 2, 3, 4 and 5 are equal to y2 = an2 = 52 ; y3 = an3 = 52 ; y4 = an4 = 25 and y5 = an5 = 15 , respectively.

By definition, a mean Y of a population of size N can be defined as: population mean = Y =

y1 + y2 + : : : y N N

=

A N

If each unit yi of the population can have two values 0 or 1 (e.g., satisfy or not satisfy, contain or not contain), then the mean of the population is a proportion. In our case, x is a distinct value in a common join attribute domain and we want to know how many of tuples of relation R satisfy x, 1 With regard to the implementation, term “sort” which we use throughout the paper means attaching an index such as a B+ -tree to the join attribute of relation R, we can then systematically access R on the join attribute in ascending/descending order.

1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 Figure 2. Relation R sorted on the join attribute Table 1. Systematic sample relations and notations 1

systematic sample relation no. I II III IV V 1 1 1 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5

2

11 y12

Pn

j =1 Means

ij

y

b

b

Let V (Y ), V (Y swr ) and V (Y swor ) be such a variance obtained via SYSSMP, SRSWR and SRSWOR, respectively, where Y ; Y swr and Y swor are the selectivities of the ith distinct value calculated from each individual R0 yielded by SYSSMP, SRSWR, and SRSWOR respectively. Let the value of the ith distinct value equal x. Before proceeding to the theorem, let us define one more symbol S 2 . Let S 2 be the variance of the selectivity of the distinct value x over the entire relation R. S 2 is thus defined as:

b

P

P

2 N N 2 2 S 2 = i=1N(y?i ?1 Y ) = i=1Nyi??1NY (1) where each yi corresponds to an element yrj in Table 1(b) (r = 1; 2; : : :; k; j = 1; 2; : : : ; n). It is straightforward to PN see that i=1 yi2 = A since each yi in the table is a [0,1] value. Thus, the variance S 2 is:

S 2 = AN? ?NY1

2

(2)

Since A = NY , the substitution of this in (2) gives:

S 2 = NYN??NY 1

2

=

:::

k

k1 k2

y y

:::

:::

:::

y

y

in ai y i = ani

y

a1 n

2n a2 y 2 =

a2 n

kn k y k = ank a

(b)

namely, its selectivity. This problem is, in fact, the same as finding a “proportion” of relation R which has x as values. Thus the proportion of R with value x is the selectivity Y of A , where A is the total number value x on R, namely, Y = N of tuples on R having x as values. The main theorem which is quoted from page 209 of reference [2] is applicable for the proportion (or mean) which we have just described above. Since this theorem (and its two corollaries) only state on a given distinct value of relation R, i.e., proportion of R with the given distinct value, for simplicity of notation in the theorem and its two corollaries, we will drop the subscript i of V (Y i ), the variance of the selectivity of the ith distinct value on a sample relation.

b bb

i1 yi2 y

y

(a)

b

i

:::

1n a1 y 1 =

y

:::

21 y22

y

N N ? 1 Y (1 ? Y )

(3)

The theorem can be stated as follows:

b

Theorem 3.1 The variance of the selectivity Y for distinct value x on a systematic sample relation is:

2 V (Yb ) = N N? 1 S 2 ? k(nN? 1) Swsmp

(4)

where

1 2 Swsmp = k(n ? 1)

k X n X i=1 j =1

(yij

? yi )2

(5)

is the variance among tuples that lie within the same systematic sample relation. Recall that yij as shown in Table 1(b) is a function f (tij ; x) of a [0,1] value of tuple j of systematic sample relation i. yi = ani is the selectivity of distinct value x calculated from the ith systematic sample relation. The proof is not given here since it was already shown in [2]. The first important result from Theorem 3.1 is: Corollary 3.1 The selectivity for distinct value x obtained from a systematic sample relation is more accurate than the selectivity for the same distinct value from a simple random sample relation without replacement if and only if

2 Swsmp > S2

(6)

The proof was also given in the same reference. The main result from this corollary is that SYSSMP is more efficient than SRSWOR if the variance within systematic sample relations is greater than the variance of the actual selectivity over the entire relation. To ensure this, a sort over the entire population in our case, entire relation, in ascending or descending order must be done [10, 11], which will then result in heterogeneous tuples within the same systematic sample relation. Next is the second result which we have derived based also on Theorem 3.1. The proofs of this result and the remaining results in Section 3.2 are given in [5].

Corollary 3.2 The selectivity for distinct value x obtained from a systematic sample relation is more accurate than the selectivity for the same distinct value from a simple random sample relation with replacement if and only if

2 Swsmp > N N? 1 S 2

(7)

In summary, the first result in Corollary 3.1 indicates that 2 for a single distinct value, if Swsmp > S 2 , then V (Y )
NN?1 S 2 , then V (Yb ) < V (Yb swr ). 4. Experimental results 3.2. Total variances of estimated selectivities Let Si2 be the variance of the actual selectivity of the ith distinct value over the entire relation R, where i = 1; 2; : : : ; d. 2 Let Swsmp i be the variance within systematic sample relations of the ith distinct value. In this section, there are 2 main results which are also based on Theorem 3.1. Generally, they are to claim that the systematic sampling is more efficient than the simple random sampling with/without re2 placement if and only if the total of Swsmp i ’s must be larger 2 than the total of Si ’s for all i = 1; 2; : : :; d. Let T be the total variance of each Si2 , i.e.,

T = S12 + S22 + : : : + Sd2 =

d X i=1

Si2

and let Twsmp be the total variance within systematic sample relations, defined as:

2 2 2 + : : : + Swsmp + Swsmp Twsmp = Swsmp 2 1 d =

d X i=1

2 Swsmp i

Corollary 3.3 SYSSMP yields a more efficient sample relation than SRSWOR if and only if

Twsmp > T

(8)

Again, to ascertain Twsmp > T , relation R must be sorted in ascending/descending order [10, 11]. Corollary 3.4 SYSSMP yields a more efficient sample relation than SRSWR if and only if

Twsmp > N N? 1 T

(9)

The two results in corollaries 3.3 and 3.4 indicate that the total variance of estimated selectivities for all distinct values

Now that the better quality of sample relations yielded by SYSSMP can be obtained by SYSSMP as confirmed by the theoretical foundation, we will now demonstrate that such sample relations can indeed assist in more accurate computation for query result sizes of star joins. We have conducted 6 experiments in the realistic case to demonstrate the performance between t_cross versus the hybrid sampling scheme (between SYSSMP and t cross). That is, only some join attributes participating in a star join have their own indices and the rest do not. These 6 experiments are 5-relation star joins and we do a 10% sampling on each relation participating in the star joins. In each experiment for a star join, we ran the star join 30 times. For the ith run (i = 1; 2; : : : ; 30) of either t_cross or the hybrid sampling scheme, 5 sample relations are generated and joined together to yield an estimated result size ^i N~ of the star join, where ^i mis the estimated join selec~ = tivity of the ith run, N i=1 jRi j and m is equal to 5. We then summarise the errors of these estimated result sizes from the actual result size of the star join. We used three kinds of error measures, namely, root mean square, mean residual and mean relative errors, to gauge an average error between the estimated result sizes and their actual result size. Table 2 shows the average errors for the 6 experiments we have described above. In each experiment, we increased the number of indices from left to right, i.e., 2, 4 and all (5) indices, respectively. A clear trend that one can perceive is that the average errors via the hybrid sampling scheme gradually reduces from left to right. This should be attributed to the fact that when increasing more indices, we will have more sample relations with high quality. Hence, we can conclude that the sample relations yielded by SYSSMP can indeed assist in more accurate computation for query result sizes than those by t_cross. The more we have the sample relations of high quality, the more accurate the join result sizes we will attain.

Q

Table 2. Error Trend with Increment of Indexed Attributes. exp. SJ1 SJ2 SJ3 SJ4 SJ5 SJ6

t_cross 25.52 6.41 31.13 259.30 21.60 244.20

2 inds 20.45 4.79 16.70 157.10 22.19 163.00

4 inds 14.65 3.56 9.73 49.68 11.11 85.39

(a) Root Mean Square Error ( exp. SJ1 SJ2 SJ3 SJ4 SJ5 SJ6

t_cross 21.13 5.34 15.29 136.80 12.52 99.04

2 inds 16.36 4.26 11.67 94.73 11.33 69.97

t_cross 63 72 93 227 161 267

2 inds 49 57 71 157 146 188

e

4 inds 12.05 2.96 6.44 49.59 8.34 56.01

(b) Mean Residual Error ( exp. SJ1 SJ2 SJ3 SJ4 SJ5 SJ6

e

4 inds 36 40 39 82 108 151

all inds 10.35 2.50 5.17 48.53 6.59 35.03

+ 07) all inds 8.08 1.80 4.29 48.25 6.52 35.02

+ 07) all inds 24 24 26 80 84 94

(c) Mean Relative Error

5. Conclusion The main achievement in this paper is to demonstrate that, compared with SRSWOR and SRSWR, SYSSMP using sorted data provides a lower total variance of estimated selectivities for all distinct values in a common join attribute domain. The quality of a sample relation yielded by SYSSMP would consequently be higher than that by SRSWOR and SRSWR. The more we have sample relations of high quality (whose original relations have the indices on the join attributes), the more accurate the estimated join selectivity we will attain. This has been verified by experiments which consider both the ideal case and the more realistic case where only some of the relations are sorted on their join attributes. The last important result in this paper is that SYSSMP can provide more accurate query result size estimates than t cross from the same amount of sampling.

References [1] R. Bayer and E. M. McCreight. Organization and maintenance of large ordered indexes. Acta Informatica, Springer Verlag (Heidelberg, FRG and NewYork NY, USA) Verlag, 1(3): February 1972. Also published in/as: ACM SIGFIDET 1970, pp.107–141.

[2] W. G. Cochran. Sampling Techniques. John Wiley & Sons, Inc., second edition edition, 1963. [3] P. J. Haas and A. N. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD Conference on the Management of Data, pages 341–350, 1992. [4] P. J. Haas and A. N. Swami. Sampling-Based Selectivity Estimation for Joins Using Augmented Frequency Value Statistics. In The International Confererence on Data Engineering, pages 522–531, 1995. [5] B. Harangsri, J. Shepherd, and A. Ngu. Selectivity Estimation for Joins using Systematic Sampling. Technical report, The University of New South Wales, School of Computer Science and Engineering, Sydney 2052, AUSTRALIA, 1997. [6] W. Hou, G. Ozsoyoglu, and E. Dogdu. Error Constrained COUNT Query Evaluation in Relational Databases. In ACMSIGMOD Conference on the Management of Data, pages 278–287, 1991. [7] W. Hou, G. Ozsoyoglu, and B. K. Taneja. Statistical Estimators for Relational Algebra Expressions. In Proceedings of the ACM SIGACT-SIGMOD Symposium on Principles of Database Systems, pages 276–287, 1988. [8] Y. E. Ioannidis and S. Christodoulakis. On the Propagation of Errors in the Size of Join Results. In Proceedings of the ACM-SIGMOD Intl. Conf. on Management of Data, pages 268–277, 1991. [9] R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical Selectivity Estimation through Adaptive Sampling. In Proceedings of ACM SIGMOD, pages 1–12, 1990. [10] M. N. Murthy and T. J. Rao. Systematic Sampling with Illustrative Examples, volume 6, chapter 7, pages 147–185. Elsevier Science Publishers, 1988. Handbook of Statistics. [11] R. L. Scheaffer, W. Mendenhall, and L. Ott. Elementary Survey Sampling. PWS-KENT Publishing Company, fourth edition, 1990. [12] Q. Zhu. An Integrated Method for Estimating Selectivities in a Multidatabase System. In Proceedings of Distributed Computing (CASCON’ 93), volume 2, pages 832–847, Toronto, Ontario, Canada, 24–28, October 1993.

Selectivity Estimation for Joins Using Systematic ...

Selectivity Estimation for Joins Using Systematic ...

Suggest Documents

Selectivity Estimation for Spatial Joins - CiteSeerX

Lightweight Graphical Models for Selectivity Estimation ... - CiteSeerX

Selectivity Estimation for Predictive Spatio-Temporal Queries

Entropy-based Histograms for Selectivity Estimation - CiteSeerX

Wavelet-Based Histograms for Selectivity Estimation - CiteSeerX

Selectivity Estimation for Exclusive Query ... - Computer Science

Selectivity Estimation for Predictive Spatio-Temporal Queries

ESTIMATION OF FREQUENCY SELECTIVITY FOR ... - Google Sites

Fuzzy Joins Using MapReduce

Estimation of Gillnet and Hook Selectivity using Log-linear Models

Using Deep Learning for Compound Selectivity Prediction

Approximate Substring Selectivity Estimation - Semantic Scholar

Selectivity Estimation For Boolean Queries - Cornell Computer Science

A Micro-Benchmark for Selectivity Estimation ... - Semantic Scholar

Selectivity Estimation for Extraction Operators over ... - Semantic Scholar

Query Size Estimation using Systematic Sampling - Semantic Scholar

Spatial Joins Using Seeded Trees - UCR CS

Processing Theta-Joins using MapReduce - Northeastern University

Spatial Joins Using Seeded Trees - UCR CS

Efficient Trajectory Joins using Symbolic Representations - CiteSeerX

Accelerating Foreign-Key Joins using Asymmetric

An Efficient Cost Model for Spatial Joins Using R-trees

Using Optimized Multi-attribute Hash Indexes for Hash Joins - CiteSeerX

Selectivity Enhancement by Using Double-Layer