and l (free restricted extension of literal l . If B is the union of the constants occuring in a~ and those variables in S ?1 which occur among the a , and if dom(v) denotes the class to which a variable or constant v is bound, then the fan-out of l (a~ ) in S is de ned as n
n
i
i
i
i
i
i
i
i
~ )j fo(l (a~ ); B ) = Q jl (free jdom(b)j i
i
i
i
2B
b
2 The fan out of an extensional literal in a given sequence is directly related to the selectivity. Let L =< l1 (a~1 ); l2 (a~2 ) > be a join operation de ned by the common variable v with domain (Class) V , then the result size of L is estimated as:
n 1 2 = n 1 n 2 jV1 j = jl1 (a~1 ))j fo(l2 ; fvg) l ;l
l
nl1;l2
=
Let S =< l1 (a~1 ); : : : ; l (a~ ) > be a literal sequence and let S denote the pre x with length i of this sequence. Then Size(S ) is de ned as n
1 = 15552 = 432 432 12
(F1 T) (A type(F2 T) T ) type(free) = 432 432 = 432 A FileType 12 = 15552 jA:type
;
j : j
j fo ~
:
j
j
;
n
n
i
i
i
When using the fan-out of A:type(F2; T) in the given sequence, the estimate is: =
De nition 3.6 (Size)
l
jA:typej jA:typej
nl1;l2
3.3 The Size Function The Size function used in de nition 3.2 can be de ned as follows using the fan-out:
Size(S ) =
Example 3.1 (Fan-Out vs. Selectivity) In the example model 432 les of 12 dierent types exist, where every le has a unique type. The estimation of the cardinality of = A type(F1 T) A type(F2 T) is according to equation 1: S
a literal directly allows to distinguish between \good" and \bad" literals in a sequence: literals with a fan-out less than 1 are \good", because they reduce the intermediate result size, while literals with a fan-out larger than 1 enlarge the intermediate result size. Negated literals are evaluated according to the \closed world assumption": not(l(~a)) can be evaluated only if every argument in ~a is bound. In this case the fan-out of l(~a) is (theoretically) guaranteed to be less than or equal to 1, and we de ne the fan-out of not(l(~a)) as fo(not(l(~a)); B) = 1 ? fo(l(~a); B). In practice the fan-out fo(l(~a); B) can be greater than one because of the additional solutions produced by deductive rules. In this case, the computation above would result in a negative value. Therefore fo(l(~a); B) in this case is restricted to the solutions stored extensionally and the intensional fan-outs are added afterwards to fo(not(l(~a)); B). The fan-out of the arithmetic comparison literals LT (), LE (), GE () is estimated as 0.5, the fan-out of EQ (=) is estimated as jInteger jj1 Integer j and the fan-out of NE (neq) is estimated as 1 ? jInteger jj1Integer j 1 Cyclic dependencies are handled similar to negated literals. If the query optimizer detects a cycle in the rule dependency graph, all recursively de ned literals are removed from the rules that occur in the cycle. After the optimization of the reduced rules, the literals are inserted at the end of the sequence, because we assume high costs to evaluate recursive literals.
fo(l1 (a~1 ); B1 ) : i = 1 Size(S ?1 ) fo(l (a~ ); B ) : i > 1 i
i
i
i
Since we consider rule sets, S may occur as body of a rule with head literal l. This literal l on the other hand can be used in other rule bodies with speci c instantiation patterns. Therefore B1 not only contains the constants of l1 but also its variables that occur as bound variables in l already.2 2 This de nition lls the gap left in de nition 3.2 and our cost function is de ned completely now. Example 3.2 (BigFile) Assume that the class BigFile is de ned via the rule n
In(F,BigFile) :- A.size(F,S),GT(S,10000)
;f g
then the estimated cost for this sequence would be
(In(F BigFile) ) = (A size(F S) ) + (A size(F S) ) (GT(S 10000) F S ) = 432 + 432 0 5 = 648
2
Though both equations in the example deliver the same result, fan-out and selectivity are slightly dierent concepts. Selectivity interprets a join as selection from the cross product of two relations. When using the fan-out, the right hand literal can be seen as operand, that is applied on the left hand literal. In contrast to the selectivity the fan out of
Cost
;
Size
;;
:
;
Size
;;
fo
:
;
;
;;
;f ;
g
:
The extension of class Integer contains all integers occuring as attribute values for a stored object. As an obvious consequence we have to optimize each rule body wrt. all relevant instantiation patterns of its corresponding head. 1
2
if F is free and
(In(F BigFile) F ) = + (A size(F S) F ) = 1+1 0 5= 1 5
Cost
;
Size
;f g
:
;
:
;f g
(A size(F S) (GT(S 10000)
Size
fo
:
;
;
)
; fFg
)
; fF; Sg
:
if F is bound to an object of class SourceFile. The estimated size of the Class BigFile would be Size(S2 ; fF; Sg) = 216. If the literal In(F,BigFile) occurs in a conjunction, than the fan-out of this literal can be estimated as 216 if F is free and e.g. as 0:5 if F is bound to an object of class SourceFile. The estimate of the evaluation costs for the conjunction are increased by 648 if F is free and by 1.5 if F is bound to an object of class SourceFile. 2 3.4 Histograms The above formulas base costs on the assumptions of uniform distribution and statistical independence. In some cases more detailed distribution information can be applied. For A:p literals, the distribution of values for the source and destination component can be stored in the statistical pro le. This information can be used directly if one of the arguments of an A:p literal is a constant. If the other argument is a free variable, the histogram entry is the exact fan-out, if the other argument is bound, the fan-out can be computed based on the number of solutions for the constant rather than on all solutions of A:p. Example 3.3 (Use of Histogram Entries) Assume that the directory d occurs 20 times as destination component in the literal A.dir and that 400 source- le objects and 80 directories are stored in the database. Then according to the histogram entries for the destination component of A.dir the fan-out20 of A.dir(F,d) is 20 if F is a free and can be estimated as 400 = 201 if F is bound. The latter value expresses the probability, that the value of F is among the 20 values, which ful ll A.dir(F,d). If the histogram would not be available, the constant d in cost estimate has to be treated like a bound variable, which concrete value is unknown. Then the fan-out would be 400 80 = 5 if F is free 1 and 80400 2 400 = 80 if F is bound. The histograms stored in the statistical pro le are endbiased histograms ([2]). This type of histograms requires less memory than normal histograms, because they represent distribution information in a compressed form. The variant used here represents some objects of the highest and lowest frequency explicitly and uses one average frequency for the remaining values. Furthermore, the number of tuples in the result of a binary equi-join A:p1 (A; B ); A:p2 (C; B ) with A, B and C free can be known exactly if histograms for A:p1 and A:p2 exist. Example 3.4 (Uniform Distribution vs. Histograms) Consider the sequence S = < A:type(A; B); A:type(C; B) > and a database with 10 les of type a, 10 les of type b and 70 les of type c. Using the uniform distribution assumption, 30 les would be assumed to exist for each type. C (S2 ) would than be estimated as Size(S1 ) + Size(S2 ) = 90+90 903 = 2790. But we know that for example for each of the le of type a in the rst relation, 10 matching les in the second relation exist. If we take the histogram for the destination component of A.type as vector and take the scalarproduct of this vector by itself, we obtain the exact size of the extension of S2 : Size(S2 ) = 1010+1010+7070 = 5100.
2
3.5 Exploitation of Materialized Views The usage of histograms described before is one possible way of improving the accuracy of cost estimate. The cost function implemented additionally exploits further background knowledge stored in the system catalogue, for example knowledge about the size of materialized views. Views in ConceptBase are de ned like query classes and are translated to Datalog rules in the same way as queries and deductive rules. If the body of a Datalog rule for a materialized view is a subset of the current set of literals to optimize, the statistical knowledge about the extension of the view is employed during cost estimate as shown in the following example. Details of the necessary query subsumption algorithm can be found in [10]. 3.6 Assumptions and Overall Cost Estimation Strategy The predecing examples illustrated the application of more sophisticated statistics for obtaining cost estimates which avoids signi cant underestimates as in example 3.4. The decision about applicability of available histograms and materialized views is based on the assumption of uniform selection, which states, that the shape of the distribution of the instances of a class is kept during evaluation of the literals in between. [11] explains that in more detail. The overall cost function of the ConceptBase query optimizer applies the following strategy: 1. if a view based estimate is applicable, use it 2. if a histogram based estimate is applicable, use it 3. else use the \standard" fan-out method 4 SEARCH STRATEGIES AND HEURISTICS The computational costs for query optimization are dominated by the search space. Computation of the cost-function for a given literal sequence can be done in linear time (linear in the number of literals). But even when only considering left-deep evaluation, for a sequence of n literals there are n! possible literal sequences. We decided to combine our cost function with a global (and complete) search method, but stay pragmatic: The optimizer performs a best- rst algorithm until a certain upper time bound is reached. Then the execution is suspended and continued as cheapest- rst search. The best- rst search proceeds as proposed in [9]: Starting with a set of literal sequences of length one formed by the literals of the query, the sequences are sucessively extended by including literals not contained before. Selection criterion for a sequence to be further expanded are minimal costs compared to all other sequences currently available. If the sequence already contains all literals then it represents the optimal QEP. Otherwise it is replaced by its successor and its neighbour with minimal costs among all other successors and neighbours. Compared to checking costs of several partial QEP's as candidates for an extension during best- rst search, an eventually subsequent cheapest- rst search starts with the cheapest single literal and stepwise includes the other literals with increasing costs required for their evaluation based on the existing partial sequence. In addition heuristics are employed to restrict the number of possible QEP's that are considered by this algorithm. So called \Early-Reduction" heuristics aim at identifying literals that can be evaluated with low costs and which are
suited to reduce intermediate result sizes as early as possible. The rst two categories of heuristics eliminate literals before executing the cost-based optimization and insert them afterwards, the others perform an additional reordering of literals positioned in the optimized sequence. Details of the algorithm and the heuristics are elaborated in [11]. 5 ACCURACY AND PRACTICAL IMPACT The techniques presented above were implemented in the ConceptBase query optimizer [10]. In section 3 we discussed how the basic cost function can respect additional information such as histograms and materialized views. We will demonstrate the relationship beween accuracy of the cost estimation and the exploited additional information with the sample query BadMakeFiles. The very uneven distribution of le types in directories makes this query an interesting challenge for testing the optimizer. We look at the cost estimation with respect to four different constellations of background information available to the optimizer. After applying the early reduction heuristics 360 possible QEP's remain. Figure 2 compares the estimated costs with the costs measured by directly computing the cost function with the actual Size values for all intermediate results. Measurement 1: The cost estimation relies on the uniform distribution assumption only. Obviously we get a clear underestimation, in particular for the minimum value. Measurement 2: For the attributes occuring in the query the optimzer can access histograms and assumes uniform selection. In certain parts of the curve we get a better approximation. Measurement 3: In addition to histograms the optimizer also respects two materialized views which correspond to certain subexpressions of BadMakeFiles. Although still underestimated the global minimum is found. Measurement 4: Employing specialized histograms (maintained during the QEP analysis) yields a further improvement of the cost approximation, in particular in the lower part of the curve. For validation purposes we employ two applications: The rst (I) concerns the software source and change management for the ConceptBase system development based on a schema similar to the one used above. The second (II) stems from a business process modelling project [7]. For both applications we undertook measurements for a number of queries and view de nitions. Without explaining their semantics tables 1 and 2 compare the query evaluation time (t ) for a subset of them with the time t needed by the same ConceptBase release without the cost-based optimizer. The cost of optimization are covered by t . As in measurement 1 we only use the uniform distribution assumption. The search of QEP's is done with Best-First. The acceleration factor c results from dividing the old by the new response time. new
old
opt
Query
BadMakeFiles BadPrologFiles SameComment MultiUser DoubleSize QuixAndMe PowerUser
tnew
tbest
tworst
topt
told
c
8.4
0.48
45.00
0.3
33.11
0.23
0.23
58.45
0.3
32.87
7.35
7.35
426.65
0.64
80045.3
12.44
11.56
48.89
2.26
446.02
0.9
0.82
10.45
0.85
1.6
1.51
19.51
0.37
5.74
0.47
0.47
7.62
0.08
1001.42
3.9 142.9 10890.5 35.9 14.3 3.6 2130.7
12.85
Table 1: Optimization results I
Query
QII 1 QII 2 QII 3 QII 4 QII 5 QII 6 QII 7 QII 8 QII 9 QII 10 QII 11
tnew
topt
told
c
0.07
0.13
0.09
1.3
0.20
0.84
0.21
1.1
0.29
0.6
0.77
2.7
0.17
0.07
3.65
21.5
0.68
2.88
73.16
107.6
0.24
0.31
3.99
16.6
0.23
0.31
2.04
8.9
0.68
0.09
28.82
42.4
0.73
0.1
28.38
38.9
0.69
0.41
14.15
20.5
0.73
0.41
14.34
19.6
Table 2: Optimization results II Application I: For each query we measured the running time for all possible QEP's. Table 1 contains the best (t ) and worst (t ) values found. best
worst
Application II: The measured values (Table 2) show a signi cant improvement of the response time compared to using the previous ConceptBase optimizer which applied only some very rough heuristics. In some cases we observe a negative overall eect if we look at t + t . This can be explained through the anyway good QEP selected before using the rough heuristics or just by accident. Additional eorts for systematical optimization in this cases yield to relatively small improvements only. The optimizing time can in particular neglected for views which are stored together with a xed QEP to be executed whenever the view is accessed. new
opt
6 CONCLUSIONS We have described the main features of the ConceptBase query optimizer which explicitly takes into account the complexities introduced by the combination of complex structures and sophisticated analytic rule-based capabilities required by meta data managers in applications such as data warehousing or information systems design. Our cost-based query optimization approach is based on the asymmetric principle of fan-out rather than the symmetric principle of join selectivity. It supports this principle with adapted indexing and view materialization techniques that do not only accelerate query execution but also improve cost estimates. Experiences in a number of realworld applications demonstrate the value of these more precise cost estimates as well as the performance advantages over purely heuristic strategies used in most implemented deductive databases. A number of issues remain to be investigated. The current query execution strategy in ConceptBase supports adhoc queries only, even though query optimization is heavily based on query reuse. A compiled approach that requires less iteration between the reasoning/planning level and the object store level should result in a signi cant, albeit constant-factor improvement, of query performance in database application programs. More fundamentally, the experiments in section 5 show that even though more precise statistical pro les and materialized views are important in approximating the overall cost estimates, the true goal of query optimization is just to nd some plan which in practice has close to minimal costs. The question of what views to materialize and what histograms to keep is, similar to the general question of view materialization strategies, still subject of ongoing research.
Measurement 1
measured Costs estimated Costs
100000
Number of Tupels (log.)
100000
Number of Tupels (log.)
Measurement 2 measured Costs estimated Costs
10000
1000
10000
1000
Example Sequence = Minimum
Example Sequence = Minimum 100
100 0
50
100
150 200 250 Number of Sequence
300
350
400
0
50
Measurement 3
Number of Tupels (log.)
Number of Tupels (log.)
300
350
400
measured Costs estimated Costs
100000
10000 Example sequence
Minimum
1000
150 200 250 Number of Sequence
Measurement 4 measured Costs estimated Costs
100000
100
10000 Example Sequence
1000
Minimum 100
100 0
50
100
150 200 250 Number of Sequence
300
350
400
0
50
100
150 200 250 Number of Sequence
300
350
400
Figure 2: Accuracy of Estimation References [1] Y.E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 268{277, 1991. [2] Y.E. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. ACM Transactions on Database Systems, 18(4):709{748, 1993. [3] M. Jarke, R. Gallersdorfer, M.A. Jeusfeld, M. Staudt, and S. Eherer. ConceptBase - a deductive object base for meta data management. Journal of Intelligent Information Systems, 4(2):167{192, March 1995. [4] M.A. Jeusfeld. Update Control in Deductive Object Bases. PhD thesis, University of Passau (in German), 1992. [5] A. Kemper and G. Moerkotte. Access support relations: An indexing method for object bases. Information Systems, 17(2):117{145, 1992. [6] J. Mylopoulos, A. Borgida, M. Jarke, and M. Koubarakis. Telos: Representing knowledge about information systems. ACM Transactions on Information Systems, 8(4):325{362, October 1990.
[7] H.W. Nissen, M.A. Jeusfeld, M. Jarke, G.V. Zemanek, and H. Huber. Managing multiple requirements perspectives with metamodels. IEEE Software, 13(2):37{ 48, March 1996. [8] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 23{ 34, 1979. [9] D.E. Smith and M.R. Genesereth. Ordering conjunctive queries. Arti cial Intelligence, 26:171 { 215, 1985. [10] R. Soiron. Cost-based query optimization in deductive databases (in german). Master's thesis, Lehrstuhl Informatik V, RWTH Aachen, 1996. [11] M. Staudt, R. Soiron, C. Quix, and M. Jarke. Costbased query optimization in deductive object bases. Technical report, Swiss Life, 1998. submitted for publication. [12] M. Steinbrunn, G. Moerkotte, and A. Kemper. Heuristic and randomized optimization for the join ordering problem. VLDB Journal, 6(3):191{208, 1997. [13] A. N. Swami and K. B. Schiefer. On the estimation of join result sizes. In Proc. 4th Int. Conf. on Extending Database Technology, pages 287 { 300, 1994.