Query Classi cation in Multidatabase Systems - Semantic Scholar

1 downloads 0 Views 249KB Size Report
Jan 30, 1996 - Query Classi cation in Multidatabase Systems. Banchong Harangsri John Shepherd Anne Ngu. School of Computer Science and Engineering,.
Query Classi cation in Multidatabase Systems Banchong Harangsri John Shepherd Anne Ngu School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, AUSTRALIA. Email: fbjtong,jas,[email protected]

Abstract

Query optimisation is a signi cant unsolved problem in the development of multidatabase systems. The main reason for this is that the query cost functions for the component database systems may not be known to the global query optimiser. In this paper, we describe a method, based on a classical clustering algorithm, for classifying queries which allows us to derive accurate approximations of these query cost functions. The experimental results show that the cost functions derived by the clustering algorithm yield a lower average error as compared to the error produced by a manual classi cation.

Keywords: Cost function derivation, Classi cation, Query optimisation, Multidatabase systems

1 Introduction

Query optimisation in multidatabase systems is fundamentally di erent from distributed query optimisation, for three major reasons [5]: site autonomy, system heterogeneity and semantic heterogeneity. Site autonomy means that the essential information for optimisation, namely cost functions and database statistics, may not be available to the global query optimiser to assist in choosing query execution plans. Clearly, before e ective query optimisation is possible in such a system, some means must be found of estimating query costs in the component (or local) database systems. Du et al. [2] were the rst to address this problem. They identi ed three types of component database systems: proprietary databases, for which cost functions and database statistics are known; conforming databases, which can provide database statistics but not cost functions; non-conforming databases, for which neither cost functions nor database statistics are available. Du et al.'s approach to this problem was to derive coecients (parameters) of the cost functions by using a

Proceedings of the 7th Australasian Database Conference, Melbourne, Australia, January 29{ January 30 1996.

synthetic database. The main limitations of this approach were:  The derivation can be done only with conforming databases.  The derivation process requires us to know a priori the access methods employed by the component databases.  The synthetic relations used by the calibrating process must always have a eld (attribute) whose values are normally distributed. Recently Zhu and Larson [8] proposed the use of a query sampling method to derive the cost parameters of a local database. Their method has two steps: 1. develop a manual classi cation of queries based on their access semantics, and derive a cost function for each class 2. sample queries from each class and run them on a real local database (not synthetic) to observe the running times of the queries 3. use multiple linear regression to derive the parameters of the cost function for the local database The manual classi cation proposed by Zhu and Larson [8] which we call ZL classi cation basically gives three classes of select or join queries: clustered index, non-clustered index, and non-index. The main problem with this method is that when it is used for non-conforming databases, the manual classi cation of queries is not possible to produce the three classes (since we know nothing about the underlying access methods). Thus all queries are placed in only a single class, with a single cost function which has a relatively high average error of cost estimation. While query classi cation is clearly important in deriving accurate cost functions, a signi cant question is \How many classes do we need to get accurate cost functions?". The higher number of classes of queries we have, the more likely the average error over all classes will be reduced. There

are reasons, however, that can prevent us from having the maximum number of classes. First, the more classes we have, the more sampling we are required to do. In non-conforming databases, it is reasonable that one will classify queries into two main classes of select queries, i.e., the equi and non-equi select queries. The maximum number of query classes would be 2a where a is the number of all attributes in the local database (the value of 2a will be clari ed again in section 4.3). Suppose the database we are considering has 10 relations each of which has 10 attributes so the total number of attributes is 100. Thus the maximum number of classes would be 2  100 = 200 classes. Each class of select queries requires at least 40 queries [8] to make each cost function of the queries in the class accurate enough when bringing them to use on-line. Let us relax the assumption of 2a to be only half of it; that is, we ignore a classes of equi queries. The total number of queries we need to perform sampling is 100  40 = 4000 queries. According to our experiments we have done with database sizes of 10,000 up to 25,000 tuples and 5-10 attributes per relation, the non-equi select queries can run on average approximately 60-100 queries per hour and therefore the time we would need to perform the sampling for the local database would be 1.7-2.8 days. The problem will be considerably worse when we want to perform the sampling over non-equi join queries and the maximum number of classes of join queries is required!. Thus the sampling process may form a signi cant part of the total expense of running the database system, if a large number of classes are involved. Second, for dynamic local databases, periodically we need to perform a new sampling to update the cost function coecients. Last, the fact that, in general, several query classes will have similar characteristics in their running time1 , means that grouping them together into the same class would not reduce the accuracy of the cost functions. In this paper, we suggest that the number of classes k can be based on the number of queries q that we are willing to run in the sample. Whenever the required number of query classes is less than the maximum, the query classi cation problem can be formulated as a clustering problem in a large search space. Here, we propose to use a hierarchical clustering algorithm (HCA) [4] to perform classi cation. Note that our method does not require any a priori knowledge about the local database system, apart from knowing the relational schema, which makes it more widely applicable than the ZL method (which requires us to know the access methods used in the local database). In this paper, we use the elapsed running time of queries as a cost metric which is the same as [8, 2] 1

The rest of the paper is organised as follows: Section 2 describes the model of queries that we use in optimisation and the cost functions on which the global query optimiser is based. In section 3, we examine query classi cation (1) where we can use a priori knowledge of the local database to classify queries into top-level classes and (2) where we have no such knowledge but still can classify the queries further. In section 4 we describes how to perform query sampling. The algorithm HCA is explained in section 5 and its experimental results are shown in section 6. An example of how to apply HCA to a local database is given in the appendix. In the last section, we present our conclusions and give some issues for future research.

2 Query Optimisation and Cost Function

In a multidatabase environment, each local database system may use di erent kinds of data models but, in this paper, for the global level the relational data model is assumed. That is, each local database is connected to the global multidatabase agent via an interface that provides the relational appearance, even if the participating database is non-relational [8]. In this paper, we adopt the standard treatment of queries that is used in most of the query optimisation work in the literature. A query is regarded as a sequence of select (), project () and join (./) operations. The cost of a query is the sum of costs of those composite operations sequenced in a particular order. The projection operation usually is grouped together with a select or join operation so its cost is computed in conjunction with the cost of the select or join operation. One of the aims of this work is to produce cost functions which can be exploited by a global query optimiser to answer select-project-join queries in conjunctive normal form (CNF)2 . Each predicate of a CNF query is ANDed together to form the whole condition of the query and is of the form Ri :aj  const or Ri :aj  Rs :bt , where Ri :aj is attribute j of relation i, const is a constant value in the domain of attribute j and  2 f=; 6=; >; ; < ; g. Fundamentally, a predicate is either select or join operation. Our proposed method may be used to derive query cost functions for either of the two main classes of queries (i.e. select and join queries). However, in this paper, we study the application of our method only to the classi cation of select queries (for the rest of the paper, we use the word \query" to refer to \select query"). Select queries are of the form L (F (Ri )) as in [8], where L is

2 CNF is the most commonly used query form in the optimisation literature, basically because its search space is smaller than its counter form disjunctive normal form.

a list of projected attributes of relation Ri and F a predicate of the form Ri :aj  const. Although simple, the select queries in such a form are enough to be employed to compute the cost of any complex CNF select queries on a single relation. Under our scheme, select queries are to be classi ed into subclasses where each subclass has its own cost function which is found by a least squared error (LSE) method. Each select query has two independent variables which a ect the running time of the query (which is the dependent variable). The number of tuples of the input relation x1 and the output relation x2 are the two independent variables. Basically, x2 is unknown at runtime, but it can be estimated by the selectivity value of the select query, i.e., x2 = selectivity times x1 . Therefore what we try to do is to derive one cost function t^ = f (x1 ; x2 ) for queries in each subclass, where t^ is the estimate of the real running time t of the query. To account for variations in system load, we ran each query three times and used the average running time as the value for t.

3 Query Classi cation

We propose a query classi cation scheme that can be used with local databases which: 1. can provide some priori knowledge 2. cannot provide any knowledge

For the former, we can use knowledge gained from the applications at hand such as database schema, key information, query type information (for instance, point query, multipoint query, range query, pre x match query [6]) etc. to roughly classify all given queries into top-level classes. For example, the ZL manual classi cation method as shown in Figure 1 can be considered as knowledgebased approach; queries which have any kinds of clustered indexes are in the rst class, queries with any kinds of non-clustered indexes would be in the second and queries in the last class are the ones which are out of any of the rst two classes. Q

Q1 Clustered Index Queries

Q2 Non-clustered Index Queries

Q3

Non-index Queries

Figure 1: ZL Manual Classi cation As for the second classi cation, this assumes no such knowledge is required to assist in classifying queries in a higher-level class into subclasses. For example, based on the top-level classes (Q1 ; Q2 ;

and Q3 ), the queries in each class can then be classi ed further into subclasses based on this classi cation in order to gain more accurate cost functions. Recall that the more classes we have, the more accurate the cost functions we obtain. An algorithm HCA described in sections 5 is used to carry out this kind of classi cation. Basically this classi cation can be useful in 2 situations: 1. It can be used to enhance the priori knowledge classi cation. 2. It can be used to properly derive more subclasses from a higher-level single class when the local database system is non-conforming| the system cannot reveal any useful information to help classify queries into the top-level classes. For the former situation, let us look at an example. In the ZL manual classi cation which is knowledge-based, we can see that for each query class in the top level, namely Q1 ; Q2 or Q3, all queries (such as all clustered index queries) would be in the single class whose only one cost function basically yields a high average error as compared to a lower average error produced by multiple classes of the same queries. In the second situation where the classi cation without knowledge can help, this is particularly useful for any non-conforming database system. We start o from a single query class Q which contains all queries from Q1; Q2 and Q3 (in contrast to the knowledge-based classi cation which starts o from a certain number of classes, 3 in Figure 1). In both the situations (either starting from Q or Q1 ; Q2 and Q3 ), based on the number of queries given, the HCA algorithm works out query subclasses with their cost functions having a low average error.

4 Query Sampling

Query sampling we use here is simple random sampling [7] which is the same as the one in [8]. For the purpose of describing a number of parameters, we will explain the sampling method based on the ZL knowledge-based classi cation. Although the sampling method presented here is for the knowledge-based classi cation, the method can be applied straightforwardly to non-conforming database systems, where no knowledge about the local database is available.

4.1 Sampling Method

Sampling is best explained by Figure 2. The local database in the gure consists of 7 relations. In query set Q1 (Figure 2(a)), relation R1 :a3 ; R2 :b1, and so on are all clustered index attributes, whereas

Q1 = Clustered Index Queries R1

R2

R3

R4

R5

R6

R7

R1.a3

R2.b1

R3.h1

R4.d1

R5.e4

R6.f1

R7.g3

{R1.a3=c1}

Q11

{R2.b1=c2} {R3.h1=c3} {R4.d1=c4} {R5.e4=c5}

Q12

Q13

Q14

Q15

{R6.f4=c6}

{R7.g3=c7}

Q16

Q17

(a) Sampling Q1

Q2 = Non-Clustered Index Queries R1

R2

R1.a1 R1.a2

R3.h2 R3.h3

{R1.a1=c1} {R1.a2=c2}

Q21

R3

R4

R4.d2

R5

R6

R7

R5.e3 R6.f4 R6.f5 R7.g2

{R3.h2=c3}{R3.h3=c4} {R4.d2=c5} {R5.e3=c6} {R6.f4=c7} {R6.f5=c8} {R7.g2=c9}

Q22

Q23 Q24

Q25

Q26

Q27

Q28 Q29

(b) Sampling Q2

Q3 = Non-Index Queries R1

R1.a1

c2

R1.a2

!=c3 c5 !=c6 c8 !=c9 =c10 c12 !=c13

Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39 (c) Sampling Q3

Figure 2: Query Sampling

R1.an

R

R.m1

R.m2

=c1

{,!=}c2

=c3

Q1

Q2

Q3

{,!=}c4

Q4

R.m3

=c5

Q5

R.m4

{,!=}c6 =c7

Q6

Q7

R.mn

{,!=}c8

Q8

Figure 3: Preliminary clustering of \similar" relational operators in Q2 (Figure 2(b)), R1 :a1 ; R1 :a2 ; R3 :h2 and so on are non-clustered index attributes. In Q3 shown in Figure 2(c), we use R1 as a representative for the rest of relations. Note that R1 :a1 ; R1 :a2 and R1 :a3 are index attributes and therefore while drawing up the queries in Q31 ; Q32 ; : : : Q39 , we consider only operators f; 6=g3 since queries that use the \=" operator already appear in sets Q11 ; Q21 ; and Q22 . Given the number of queries to be sampled, what we would do is to sample queries from an entire query population such that each query is randomly chosen with an equal probability. To clarify the sampling method, consider query set Q1 . The average number of queries q in each set Q11 ; Q12 ; : : : ; Q17 is computed by:

q = Kq

(1)

where q is the total number of queries in Q1 to be sampled and K (=7 in Figure 2(c)) is the maximum number of classes (see the next section). More details about the average number of queries q when q < K; q = K and q > K are explained in reference [8].

4.2 Maximum Number of Classes (K )

Recall that Q(= Q1 [ Q2 [ Q3 ) is the entire set of queries to be sampled. The maximum number of classes for query set Q is: 4a

(2)

where a is the total number of attributes for all relations in the local database and the constant factor 4 is due to the four di erent relational operators f=; ; 6=g. Now let us consider the maximum number of classes for each individual Q1 ; Q2 and Q3 (see Figure 2). The maximum number of classes for Q1 and Q2 , respectively, is the number of all clustered and non-clustered index attributes in the database. Suppose K1 is the number of clustered indexes and 3  and  are treated similarly to < and >, respectively.

K2 is the number of non-clustered indexes. For Q3, the maximum number of classes K3 is 4a ? (K1 + K2 ). Therefore, K1 + K2 + K3 = 4a.

4.3 Preliminary Clustering

Since K is a vital factor in controlling the time and search space in searching for the best clustering of query classes by the algorithm HCA and generally K is large, here we propose a preliminary clustering method to reduce the value of K and thus reduce both time and search space. Figure 3 helps to clarify the method. The basic idea is to cluster \similar" relational operators together into the same query class. For example, in the gure, one may want to cluster equi select queries together to form one class and non-equi queries to form another perhaps because of the justi cation that equi select queries should have similar characteristic in their running time and so should the non-equi queries. Note that even though the time and search space can be reduced by clustering some relational operators, the maximum number of classes is still large; namely, the total number of classes for query set Q after the preliminary clustering is 2a. Recall that a is the total number of attributes in a local database and the preliminary clustering of queries in Figure 3 is based on 2 relational operators, namely equi (=) and non-equi (; 6=) operators; hence, the total number of classes is equal to 2a.

4.4 Number of Classes Required (k)

The number of queries q users wish to sample is the indication of how many classes k we need. There are expensive classes of queries, which are around 3/4 of the maximum number of classes (4a). Therefore, we can a ord to have only a certain number of classes less than the maximum. To make each cost function accurate enough, we require at least w queries. w is more than or equal to 40 queries for select queries as proposed in [8]. Therefore the number of classes we need is:

k = b wq c

(3) where bc is the maximum integer value less than or equal to q=w.

Figure 4 Algorithm HCA (k) let k be the number of query classes required, let O be a set of initial query classes to be clustered, let Ci =Cc =Cij be a cluster of initial query classes, let M (Ci ; Cj ) be the matrix of average RMS errors of size n  n, 1  n  jOj place each initial query class in O in its own cluster Ci where i = 1::jOj num clus jOj for each pair of clusters Ci ; Cj do compute an average error RMS of M (Ci ; Cj ) endfor while num clus > k do

choose two clusters Ci ; Cj of the least RMS in matrix M update matrix M by grouping Ci and Cj into a new cluster Cij for each cluster c such that c 6= ij do compute an average error RMS of M (Cc ; Cij ) endfor

num clus

endwhile

num clus ? 1

5 Hierarchical Clustering Algorithm

Hierarchical clustering has been applied successfully in several applications. It may yield suboptimal solutions, but its great advantage is that it has polynomial running time. In any hierarchical clustering algorithm, one is required to de ne a matrix of similarity values [4, 1]. Based on such values, two clusters of \entities" which have a highest similarity are grouped together into a new cluster. The semantics of similarity are problem-dependent: it could be Euclidean distance, correlation, and so on [1]. In this case, it is the average error value of root mean squared errors (RMS) of each cost function.

RMS =

Pci

i  ni PRMS c n i i

=1

=1

(4)

where ni is the number of queries in each class, c the number of classes, and each RMSi can be de ned as:

RMSi =

s Pn

j =1 (tj ? t^j ) ni i

2

(5)

where tj is a real observed running time of query j and t^j a running time from a cost function. The HCA algorithm is given in Figure 4. The initial query classes in O are all query subclasses such as Q11 ; Q12; : : : ; Q17 for query set Q1 (see Figure 2(a)) and Q21 ; Q22 ; : : : ; Q29 for query set Q2 (see Figure 2(b)). The algorithm starts with each initial query class placed in its own cluster, and, in each iteration, combines two existing clusters into a new cluster. That is, the algorithm starts with jOj clusters, then jOj ? 1, jOj ? 2 ... until the number of clusters is equal to the desired number of query classes k.

Table 1 shows how the matrix M is updated from 5  5 to 4  4. Note that M is symmetrical. C1 C2 C3 C4 C5

C1 C2 C3 C4 C5 0 0.2 0.23 0.4 0.3 0.2 0 0.25 0.65 0.12 0.23 0.25 0 0.33 0.47 0.4 0.65 0.33 0 0.21 0.3 0.12 0.47 0.21 0 (a) Before grouping C2 and C5

C1 C25 C3 C4

C1 C25 C3 C4 0 0.32 0.23 0.4 0.32 0 0.27 0.69 0.23 0.27 0 0.33 0.4 0.69 0.33 0

(b) After grouping C2 and C5

Table 1: Grouping C2 and C5

6 Experimental Results

The main aim of the experiments here is to see how well the HCA algorithm performs in reducing the average error RMS of query cost functions. To do this, we compare the average error from the manual query classi cation [8] to the average error from a query classi cation produced by HCA. In each experiment, when the number of queries we fed into algorithm HCA multiply increases, namely 40, 80, 120, : : : and so on, the number of classes will increment by 1. Table 2 shows three di erent database con gurations. The databases have 7000-25000 tuples per relation and each relation has 5-10 attributes. The number of queries in each experiment varies from 1200 to 2000 for each individual class of clustered index, non-clustered index and non-index queries. We used around 30% of the queries as the sample set to derive cost functions and the remainder as the test set for measuring an average error yielded by the cost functions. The results for each database con guration are shown in Figures 5, 6 and 7, respectively. The results show a tendency towards decreasing the average error when the number of query classes increases. To illustrate, let us describe the graphs in Figure 5 as an example. In graph 5(a), the maximum number of classes is 7 as labelled on the rightmost of the X-axis. This number stems from the total number of clustered index attributes, which is the sum of values of K1 's row in Table 2(a). For graph 5(b), the maximum number of classes is 10, which is the total number of non-clustered indexes in K2 's row of Table 2(a). As to graph 5(c), we showed only part of the maximum number of

K1 = number of clustered index attributes K2 = number of non-clustered index attributes others = any other non-index attributes tuples K1 K2 others

R1 14420 0 0 9

R2 17122 1 1 8

R3 8767 1 1 3

R4 20056 1 1 7

R5 18910 1 1 6

R6 R7 9854 19635 1 0 1 2 8 5

R8 18284 1 0 7

R9 18899 0 1 8

R10 21836 1 2 3

(a) Database 1 tuples K1 K2 others

R1 24825 1 1 8

R2 10665 1 0 5

R3 18565 1 1 6

R4 20962 1 0 4

R5 19707 0 1 9

R6 7045 0 1 7

R7 18406 1 1 8

R8 22185 0 1 9

R9 9289 0 0 9

R10 10111 1 2 6

R11 13795 1 2 4

(b) Database 2 tuples K1 K2 others

R1 12095 1 1 6

R2 17302 0 0 7

R3 8862 0 3 6

R4 R5 9224 19795 0 1 2 1 3 6

R6 20954 0 2 3

R7 16914 1 1 4

R8 20908 1 0 6

R9 19283 1 2 6

R10 18789 1 1 7

R11 13841 1 0 7

R12 24420 1 2 2

(c) Database 3

Table 2: Di erent Database Con gurations non-index attribute classes, i.e., 10 out of 307 = 4  81 ? (7+10) classes. The solid line, for example in graph 5(a) shows the average errors when having only a single class as compared to the errors of a dashed line with multiple classes (2-7 classes in Figure 5(a)). Obviously, in most of the cases of di erent number of query classes, the HCA algorithm manifests its performance in reducing average error and thus provides better cost estimates as compared to a single class. From 81 cases, 71 cases yielded by HCA give lower average errors whereas only 7 cases give worse errors. The reason that there are 7 cases giving the worse errors could be that the number of queries is not sucient initially but after it reaches a sucient number, then the average errors produced by multiple classes again become lower than the average errors produced by a single class (see Figure 6(c) for example).

7 Conclusion and Future Research

The paper addressed the derivation of query cost functions for multidatabase systems. The following are the contributions of the paper:  We propose to use a hierarchical clustering algorithm to perform query classi cation, which achieves a better performance in reducing average error of query cost estimation than the manual classi cation.

 We propose a query classi cation which can

be used with both conforming and nonconforming database systems. Especially for the non-conforming systems, the query classi cation in these systems has not been tackled successfully before.

There are several issues of interest that we plan to investigate further:

 Extend the current method to use non-linear

regression techniques to compare with the linear regression technique we currently use. The reason is that there are cases that the distribution of attribute values may not be uniform and therefore the running times of queries in a class may not be linear. Thus, non-linear regression techniques could be better in nding best- t cost functions.

 Compare the HCA algorithm with other classi cation algorithms such as the partitioning algorithm in [3] or the algorithms used in machine learning.

 Investigate how many queries to be sampled are \sucient" for each query class.

 Investigate how the cost functions of multiple

query classes derived by HCA a ect the choice of query execution plans as compared to the

0.0516

4

20

"ZL.cl1.db1" "HCA.cl1.db1"

"ZL.cl2.db1" "HCA.cl2.db1"

"ZL.cl3.db1" "HCA.cl3.db1"

0.0515

19.5 3.5

0.0513

0.0512

0.0511

0.051

average error in second per query

19 average error in second per query

average error in second per query

0.0514

3

2.5

2

18.5

18

17.5

17

0.0509 1.5

16.5

0.0508

0.0507

1 2

3

4

5

6

7

16 2

3

4

5

6 classes

classes

(a) clustered index class

7

8

9

10

2

3

(b) non-clustered index class

4

5

6 classes

7

8

9

10

(c) non-index class

Figure 5: Database 1 0.038

0.074

18

"ZL.cl1.db2" "HCA.cl1.db2"

"ZL.cl2.db2" "HCA.cl2.db2"

0.034

0.032

0.03

0.028

17

average error in second per query

average error in second per query

0.036

average error in second per query

"ZL.cl3.db2" "HCA.cl3.db2"

0.0735

0.073

0.0725

0.072

0.0715

0.071

0.026 3

4

5

6

7

15

14

13

12

0.0705 2

16

11 2

3

4

5

6 classes

classes

(a) clustered index class

7

8

9

10

2

3

4

5

6

7

8

9

10

11

classes

(b) non-clustered index class

(c) non-index class

Figure 6: Database 2 0.034

0.052

14.5

"ZL.cl1.db3" "HCA.cl1.db3"

"ZL.cl2.db3" "HCA.cl2.db3"

0.032

"ZL.cl3.db3" "HCA.cl3.db3" 14

0.05

0.028

0.026

0.024

average error in second per query

average error in second per query

average error in second per query

13.5 0.03

0.048

0.046

0.044

0.042

13

12.5

12

11.5

0.022

0.04

0.02

11

0.038 2

3

4

5 classes

6

(a) clustered index class

7

8

10.5 2

4

6

8

10

12

14

classes

(b) non-clustered index class

Figure 7: Database 3

16

2

4

6

8 classes

(c) non-index class

10

12

three cost functions of each individual class (clustered index, non-clustered index and nonindex) derived by the manual classi cation.

 Investigate the use of other cost metrics instead of just the elapsed running time.

The HCA algorithm runs in polynomial time to produce cost functions of each query class and this is an advantage when we want to combine this algorithm with a non-linear regression technique which perhaps is slower than the linear regression one in nding best- t cost functions.

Acknowledgements

We would like to thank Christopher R. Birchenhall from University of Manchester, UK for his superb state-of-the-art C++ matclass package that he made publically avaliable together with his excellent manual. His math class library contains several useful linear LSE functions which help our project, such as SVD, QR, LU decompositions.

References

[1] M.R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. [2] W. Du, R. Krishnamurthy and M.C. Shan. Query Optimization in Heterogeneous DBMS. In Proceedings of the 18th VLDB Conference, pages 277{291, 1992. [3] J.A. Hartigan. Clustering Algorithms. John Wiley & Sons, 1975. [4] S.C. Johnson. Hierarchical clustering schemes. Psychometrika, Volume 32, Number 3, pages 241{254, September 1967. [5] H. Lu, B.C. Ooi and C.H. Goh. Multidatabase Query Optimization: Issues and Solutions. In Proceedings of Third International Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems, pages 137{143, 1993. [6] D.E. Shasha. Database Tuning: A Principled Approach, Chapter 3, pages 53{88. Prentice Hall, Englewood Cli s, New Jersy, 1992. [7] S.K. Thomson. Sampling. John Wiley & Sons, Inc., 1992. Basic and Advanced Sampling Methods. [8] Q. Zhu and P.A. Larson. A Query Sampling Method for Estimating Local Cost Parameters in a Multidatabase System. In Data Engineering, pages 144{153, 1994.

class

C1 C2 C3 C4 C5 C6 C7 C8

query where r1.a5 = 39138; where r1.a5 = 38464; where r1.a5 = 38828; where r5.e7 = 13006; where r5.e7 = 26025; where r5.e7 = 32182; where r7.g3 = 33075; where r7.g3 = 32120; where r7.g3 = 33262; where r8.h2 = 44688; where r8.h2 = 55941; where r8.h2 = 55410; where r9.i3 = 41119; where r9.i3 = 40610; where r9.i3 = 35895; where r10.j9 = 23224; where r10.j9 = 14760; where r10.j9 = 11924; where r11.k5 = 20329; where r11.k5 = 8263; where r11.k5 = 21263; where r12.l3 = 72124; where r12.l3 = 61785; where r12.l3 = 62874;

x1

12095 12095 12095 19795 19795 19795 16914 16914 16914 20908 20908 20908 19283 19283 19283 18789 18789 18789 13841 13841 13841 24420 24420 24420

x2 4 5 1 1 3 1 3 1 2 2 5 2 3 5 3 2 5 2 2 2 3 1 4 2

t

0.69657 0.03793 0.02229 0.06499 0.02647 0.02414 0.04472 0.02136 0.02870 0.07292 0.09813 0.06793 0.03425 0.04323 0.02748 0.03273 0.04467 0.03331 0.05460 0.04523 0.07179 0.02655 0.04207 0.02692

Table 3: Queries, input and output tuples and running times

Appendix Example

This appendix is to demonstrate how to apply HCA algorithm to yield the solution of query classes with a low average error. In database con guration 3 (see table 2(c)), there are 8 clustered indexes, namely, r1.a5, r5.e7, r7.g3, r8.h2, r9.i3, r10.j9, r11.k5 and r12.l3. Each clustered index forms one initial query class; that is, queries which use index r1.a5 are in query class C1 , queries which use index r5.e7 are in C2 and so on. Table 3 shows queries4, the number of tuples of an input relation x1 and of an output relation x2 and their running times, in column 2, 3, 4 and 5 respectively. Note that we show only 3 queries (out of 10) for each query class. The number of total queries we were willing to sample in this experiment is 80 and thus, for each class, divided by 8 (number of clustered indexes), giving 10 queries for each class. Initially, HCA placed 8 initial query classes in their own clusters as shown in the rst column of table 4. The second, third and fourth column in the table are regression coecients of equation 4 Recall that select queries are of the form  ( (R )). L F i To make table 3 concise, we omit to show the lists of projected attributes of queries in the table.

cluster

f C1 g f C2 g f C3 g f C4 g f C5 g f C6 g f C7 g f C8 g

0

1.18804e-10 6.80326e-11 4.59872e-11 5.28670e-11 3.90722e-11 1.06346e-10 9.42720e-11 3.03456e-11

1

1.43693e-06 1.34671e-06 7.77827e-07 1.10534e-06 7.53428e-07 1.99814e-06 1.30482e-06 7.41039e-07

2

0.02363 -0.00032 0.00725 0.00901 0.00491 0.00189 0.00878 0.00491

Table 4: clusters and their regression coecients

t^ = f (x1 ; x2 ) = 0 + 1  x1 + 2  x2 . By \learning" from the number of input and output tuples x1 and x2 of queries and their running times t, regression

formulas (cost functions) are found by the LSE method for each individual cluster. These formulas can then be employed to estimate running times of unseen queries at on-line. In the rst iteration of HCA, query classes C3 and C8 were merged into the same cluster as shown in table 5. The reason C3 and C8 got merged into the same cluster is that compared with other mergings (such as between C1 and C2 , C1 and C3 and so on), the merging of C3 and C8 gave the least average error RMS . In addition, HCA recomputed the coecients of the new cluster comprising C3 and C8 . cluster

f C1 g f C2 g f C4 g f C5 g f C6 g f C7 g

0

1.18804e-10 6.80326e-11 5.28670e-11 3.90722e-11 1.06346e-10 9.42720e-11 f C3 ; C8 g 0.01698

1

1.43693e-06 1.34671e-06 1.10534e-06 7.53428e-07 1.99814e-06 1.30482e-06 -2.89993e-08

2

0.02363 -0.00032 0.00901 0.00491 0.00189 0.00878 0.00567

Table 5: clusters and their regression coecients Due to the limit of the length of the paper, we ignore to show the outputs from the second to fth iteration and show only the last iteration 6 in table 6, which yielded the nal solution of applying HCA. Recall that HCA will stop when the number of clusters (numclus) is less than or equal to the number of clusters desired 2 (calculated by q=w = 80=40 = 2). cluster

f C1 g

0

1

2

1.18804e-10 1.43693e-06 0.02363 f C2 ; C3 ; C4 ; 0.02877 -3.67496e-07 0.00538

C5 ; C 6 ; C 7 ; C8 g

Table 6: clusters and their regression coecients

Suggest Documents