Genetic Programming in Database Query Optimization Michael Stillger
Institut fur Informatik Humboldt-Universitat zu Berlin Unter den Linden 6, 10099 Berlin, Germany
[email protected] http://www.dbis.informatik.hu-berlin.de/ stillger
ABSTRACT Database query optimization is a hard research problem. Exhaustive techniques are adequate for trivial instances only, while combinatorial optimization techniques are vulnerable to the peculiarities of speci c instances. We propose a model based on genetic programming to address this problem, motivated by its robustness and eciency in a wide area of search problems. We adapt the genetic programming paradigm to the requirements of the query optimization problem, showing that the nature of the problem makes genetic programming a particularly attractive approach to it. 1 Introduction Genetic programming enjoys an increasing popularity in the mastering of dicult optimization problems. We propose a methodology applying genetic programming to the query optimization problem in relational databases. Query optimization has been the subject of active research for more than 20 years. Recently, research has turned towards techniques based on combinatorial optimization. It has been observed, however, that their eciency is aected by peculiarities of speci c instances of the problem [Ioannidis and Kang, 1991]. The robustness of evolutional computation towards such problems has lead researchers to consider genetic algorithms for query optimization [Bennett et al., 1991; Steinbrunn et al., 1993], yielding encouraging results. However, the methods used in [Bennett et al., 1991; The work of this author is supported by the German Research Council under contract DFG Fr 1142/1-1.
Myra Spiliopoulou
Institut fur Wirtschaftsinformatik Humboldt-Universitat zu Berlin Spandauer Str. 1, 10178 Berlin, Germany
[email protected] http://www.wiwi.hu-berlin.de/ myra/ Steinbrunn et al., 1993] to represent the problem according to the genetic algorithms requirements are counterintuitive and prone to ambiguities. We show that a methodology based on genetic programming is closer to the nature of the problem, avoids much of the complexity and ambiguity of GA methods and better exploits the advantages of evolutional computation. In the next section we brie y describe the query optimization problem and the techniques proposed to tackle it. In section 3, we present our genetic programming model for query optimization, specifying our GP representation and the GP operators applied on it. In section 4, we discuss the current status of our model. Section 5 concludes the study.
2 The Query Optimization Problem
The relational data model introduced by Codd [Codd, 1970] opened a new era in database technology. Relational database requests are formulated as declarative queries. A query can be represented as the conjunction of the operators in a \query graph": the graph nodes are database relations, while the edges represent operators applied on them, namely \selection", \join", \projection", \set operators" and \aggregates". Fig. 1 shows an example query graph. Optimizing a query. The query graph is still a declarative representation of the query, from which several query execution plans (QEPs) can be produced, specifying dierent ways of executing the query. A QEP in relational databases is an \operator tree". Its non-leaf nodes are the query operators labelled with their execution algorithms, while the leafs are the database relations labelled with the access method (e.g. index) used to retrieve them. An edge represents the data stream output from a node and input to its parent. An edge denotes data ow; hence, the relations on which an operator is applied must appear in the leafs of the subtree below it. Dierent QEPs can be produced by specifying dierent execution algorithms for the nodes and by changing the relative placement of the nodes while preserving the data ow semantics of the edges. All QEPs for a query
R1
J5
J1
R5
R6 J6
J4
J7
R7
R4
R8 J8
J3 R9 R2
J12
R3 J2
J9 R11
R19
J11 R10
R18
J13
J10 J18
R17
R12 R16
R13
J17 R14 R15
J14
J15
J16
Figure 1: Join Graph for a 18 Join Query constitute its \search space" S . Each database contains a module, the query optimizer, whose goal is the selection of the least expensive QEP in S . A cost model, usually measuring CPU and I/O time, attaches a value cost(qep) to each qep 2 S . Hence, the least expensive QEP is a qep0 2 S such that:
cost(qep0) = qep min cost(qep) 2S
Optimizing a join query. The join is a binary relational operator associated with a constraint on its two input relations. It combines their tuples in a cartesian product fashion and uses the constraint to specify which tuples will be output. Joins appear in all but trivial queries. Their importance is such that much research is devoted to the optimization of queries comprised solely of joins. We also focus on join queries in our study. The join optimization problem covers the development of join execution algorithms and the design of ecient join ordering techniques. For the execution of a join, several algorithms have been proposed, based on nested loops, on merging and sorting, and on hashing of the input relations [Graefe, 1993]. The join ordering problem concerns the relative placement of joins in a QEP. Each repositioning of nodes and each change of a node's join algorithm produces a new QEP. Thus, the search space increases exponentially with the number of joins in the query. Optimizers scanning this space exhaustively are inappropriate for queries containing tenths of joins. Such queries occur in databases used for CAD/CAM, expert systems, data mining and decision support; their ecient processing is constantly gaining in importance. For the optimization of large join queries, combinatorial optimization techniques with polynomially increas-
ing optimization time have been studied in [Swami and Gupta, 1988; Ioannidis and Kang, 1990; Lanzelotte et al., 1993; Spiliopoulou et al., 1996] etc. However, their eciency depends on the shape of the cost function, which can vary considerably [Bennett et al., 1991]. Consequently, methodologies based on evolutional computation gain in interest, since they are more robust towards varying cost function shapes. Genetic algorithms for join query optimization have been studied in [Bennett et al., 1991; Steinbrunn et al., 1993]. They transform QEPs into chromosomes, on which crossover, selection and mutation are applied. The tness function is based on the query cost model, which is designed for QEPs. Therefore, the chromosomes must be transformed back to QEPs to compute their tness. Those transformations are complicated and expensive. More gravely, the crossover operator may destroy the tree structure of a chromosome. So, the QEP must be \repaired" before computing its cost. Despite these shortcomings, the results of [Bennett et al., 1991; Steinbrunn et al., 1993] are encouraging. This indicates that evolutional computation is appropriate for the query optimization problem, but the methodology has to be revised. This is the goal of our study. Rather than genetic algorithms, we use genetic programming, showing that the QEPs can be conveniently used as genetic programs.
3 Genetic Programming for Join Queries We use genetic programming to optimize queries with up to hundreds of joins. The QEPs of a join query consist solely of join operators. We use the term \join tree" or \tree" to refer to the operator tree of the QEP of a join query, and the term \join-node" or \node" to refer to the non-leaf nodes denoting the operators.
3.1 The Representation Structure A QEP can be observed as a program in an abstract tree representation, which is evaluated in a bottom-up way by the database system. The relations are the terminals and the joins are the functions of the genetic program. Hence, the QEP satis es the structural requirements of the genetic programming method which applies the search paradigm of genetic algorithms developed in [Holland, 1975; Goldberg, 1989] on a tree representation of an abstract computer program. The input and the output of all operators in a QEP are relations. So, the nodes of a QEP satisfy the closure property de ned in [Koza, 1991] (chapter 6.1.1). However, the random placement of nodes in the QEP can introduce cross products and arti cial joins. \Cross products" materialize joins between relations that are not directly connected on the query graph; for the query graph of Fig. 1, the operator R1 1 R7 in
a QEP would be a cross product. \Arti cial joins" are joins between two instances of the same relation: For the example query graph, consider a subtree of a QEP, in which joins J 1 and J 5 are performed simultaneously as in Fig. 2. Then, the fragments of R5 contained in the tuples output by the two joins satisfy dierent constraints. However, the fragments must satisfy the conjunction of the join constraints. Thus, an arti cial join is introduced to enforce the conjunction constraint.
17 16
13
h 7
1
12
15 h R15
6
9
2
11
8
R13 R14
5 3
S1
4 17 11
T2 5 R5
16 9
2
Figure 2: A QEP subtree with an arti cial join Cross products and arti cial joins are expensive and increase the size of the search space. We therefore disallow them by introducing the notion of valid tree: De nition 1: The join tree of a QEP is valid if: 1. it contains no cross products: the left, resp. right, relation of each join node is referenced in the left, resp. right, subtree below that node 2. it contains no arti cial joins: each relation and each join-node appears only once in the tree A QEP with a valid tree is a valid QEP. Obviously, if a tree is valid, all its subtrees are valid. Hereafter, we consider only valid QEPs and use the terms (valid) \QEP" and \tree" interchangeably. By this de nition, the search space of a join query consists of all valid QEPs. To preserve the validity property, random placement of join nodes in a tree is not allowed. However, we de ne our crossover and mutation operators so that they always produce valid QEPs. Hence, the functions of the genetic program/QEP are closed in regard of our crossover and mutation. Each join node has a label consisting of the join identi er, the algorithm identi er (e.g. Nested Loop 'l', Hash 'h', Sort merge 's') and the join constraint. It also contains pointers to its left and right subtree: the left and the right input streams are processed in a dierent way, so that the distinction between them is important for the node's execution time. In Fig. 3, we show two valid QEPs T1 and T2 for the join graph in Fig. 1. Non-leaf nodes are surrounded by a circle. All nodes are identi ed by their number, being 12 for join J 12 and R12 for relation R12. In order to keep the gure simple, we omit the labels of the join nodes, as well as the leaf nodes and the edges pointing to them. We retain the join algorithm identi er and present the leaf nodes for the two marked subtrees S1 and S2 which are used in later examples.
18
3
R6
15
8 5
1 R5
18
10
14
?
R1
14
T1
s 12
4 6 1
R6
s
10 R17
s
13 7h
R7
l
R11 R12 R8
S2
Figure 3: Join trees T1 and T2
3.2 Crossover, Selection and Mutation The potential of genetic algorithms has not been suf ciently exploited in query optimization, because the string representation and the crossover operators used in [Bennett et al., 1991] and [Steinbrunn et al., 1993] were rather inappropriate: Small changes to a genom string produced large changes in the trees. Moreover, the Partially Matched Crossover operators adapted from [Goldberg, 1989] did not obey the semantics of building blocks for QEPs: an arbitrary string selected from a chromosome for crossover is often not a subtree. Hence, its insertion to another chromosome requires \repairing" the tree. The tree produced after repair has lost most of the structural information of its ancestor. In our GP model, we use trees/QEPs as they are, so the structural information is properly inherited from ancestors to successors. Our crossover and mutation operators produce valid QEPs according to the GP principles.
Crossover. According to the GP paradigm, crossover is applied on two trees T1 , T2 , in which two subtrees S1 , S2 , have been selected, as marked in Fig. 3. Crossover produces two trees of the next generation, NG1 and NG2 , which (a) must be valid, and each of them must
(b) inherit most of the structure of its parent and (c) contain the selected subtree of the other parent.
De nition 2: postorder(Ti ) is the list of join nodes produced by traversing tree Ti in postorder, i.e. left child{ right child{root order. De nition 3: leaves of(Ti ) is the set of relations in Ti .
De nition 4: (nodelist; T set) is applied on an or-
dered list of join nodes, nodelist, and a set of trees or relations, T set, and producing a valid tree as follows 1 :
foreach join J in nodelist { A:= element of T_set that references the left relation of J; B:= element of T_set that references the right relation of J; build new tree T, where J is its root, A its left child and B its right child; remove A and B from T_set; insert T into T_set; }
The -operator produces a valid tree if all trees in T set are valid. De nition 5: crossover(T1; S1 ; T2 ; S2 ) is applied on two valid trees T1 and T2 , from which two subtrees S1 and S2 are selected. It produces two valid trees NG1 and NG2 of the next generation using the -operator: NG1 := ( postorder (T1) ? postorder(S2); fS2g S(leaves of (T1) ? leaves of (S2 )) ) NG2 := ( postorder (T2) ? postorder(S1); fS1g S(leaves of (T2) ? leaves of (S1 )) ) Example 1: We apply crossover to the trees T1 , T2 of Fig. 3 using their marked subtrees S1 and S2 . The next generation trees NG1 and NG2 are shown in Fig. 4. To construct NG1 , we proceed as follows: We create the postorder list of T1 and remove the nodes of S2 from it. We then apply the -operator on the remaining postorder list and on the set of trees comprised of S2 and the nodes of T1 after eliminating those belonging to S2 . So, the T set parameter to the -operator consists of the intact subtree S2 and a number of nodes. By selecting the nodes from the postorder list and connecting them to form NG1 we practically restore T1 : the only nodes missing are those belonging to S2 . This subtree however belongs to T set and is attached to NG1 intact. Hence, NG1 inherits most of the structure of T1 and contains S2 in its original form. NG2 is created in a similar way. Our crossover operator ensures that most of the structural characteristics of the trees are inherited by their successors, while the two subtrees selected for crossover are attached intact to the new trees. Note that if S1 ; S2 have no nodes in common, crossover rebuilds S1 in NG1 exactly as it was in T1 (same for S2 in NG2 ). Hence, both subtrees appear intact in both trees of the next generation. Thus, a probably good subtree of one ancestor is combined with the structure and the original join nodes of the other ancestor into a valid successor tree. 1 A similar algorithm appears in [Bennett et al., 1991] for the transformation between the tree and its chromosome representation.
NG1
14 17
10
16
9 8 11
1
15
2 S1
5
3
18
12
4 6 S2
13 7
NG2
14 11
17
8 2
18
3
10
6 1
15 S1
12
4
16
9
5
13 7
S2
Figure 4: Next Generation: NG1 and NG2 Mutation. We consider two mutation operators: mutate1 (T ; N ; newAlg ) changes the join algorithm into newAlg for a randomly selected node N in tree T . The structure of T is not aected, but its cost and
hence its tness value do change. mutate2 (T ; N ; parentOf (N )) swaps the position of two neighbour nodes, the randomly selected node N and its parent. A repair action might be needed to produce a valid tree according to Def. 1.
Selection. Our selection operator is adopted from
[Koza, 1991]. It uses a tness-proportionate selection method. We formulate the next generation by recombining with crossover the 90% of the old population, mutating 5% and reproducing the remaining 5%.
3.3 Fitness Function and Cost Model The tness function of our GP model is based on the \cost" of a QEP, de ned as the total execution time from the earliest retrieval of a database relation until the completion of the generation of the output by the root node of the QEP. Unlike other GP applications, the cost of a QEP is not computed by executing the QEP but predicted by a cost model that re ects the factors of execution time. Then, according to the tness proportionate selection, the tness of a QEP is its normalized execution cost, where the ttest QEP has tness 1. We consider a database in a multiprocessor environment. Parallelism is exploited by executing the join nodes simultaneously: nodes belonging to dierent subtrees can be executed concurrently, while neighbour
nodes can be executed in \pipeline", where the parent node process the output of its children as it is produced. In this environment, we consider two cost models. A cost model is a mathematical approximation of the real behaviour of a QEP. It must be as ne as possible to reliably re ect the actual QEP cost. The ner a cost model, though, the more complex and time-consuming is the evaluation of QEP cost. By considering two cost models with dierent search space shapes, we can study their impact on the eciency of the search strategy. The rst model, \CM-1", re ects concurrent execution of nodes in dierent subtrees: once the input streams of a node are available, the node can start execution, possibly in parallel with other nodes. When a node completes execution, its output is written to disk, from which it is retrieved by the parent. The second cost model, \CM-2", also incorporates pipeline: the output of a node is not written to disk but sent immediately to its parent, which is executed simultaneously with its children. Those cost models are analyzed in [Stillger and Spiliopoulou, 1996].
4 Current Status Our GP model is part of a larger model for parallel query optimization. For small join queries, the search space is scanned exhaustively [Spiliopoulou et al., 1993]. For larger queries, two variations of iterative improvement have been implemented [Spiliopoulou et al., 1996].
4.1 Architecture of the Model The optimization module is parallel and has been implemented on several platforms; its most recent version runs on a network of SunT M workstations, organized according to the client-server paradigm. The clients scan dierent areas of the search space simultaneously. Parallel optimization is subject to one or more termination criteria, such as maximum optimization time or threshold value for QEP cost. When the termination criterion is met, the client sends the least expensive QEP to the server. The optimal QEP is the one with the lowest cost. The exploitation of parallelism during optimization conforms nicely to the inherent parallelism of the distributed genetic algorithm approach [Koza, 1991] (chapter 22). The population is partitioned into subpopulations assigned to dierent clients/processors and processed concurrently, applying crossover, selection and mutation to generate the next population. The termination criterion may be extended to determine a maximum number of populations. When the termination criterion is met, the client sends the ttest QEP to the server, which selects the optimal QEP. We can enhance this scheme by migrating selected individuals from one subpopulation to another after a number of generations [Koza, 1991], thus establishing a cooperation among the clients for the generation of the optimal QEP.
4.2 Performance Experiments In order to study the behaviour of our technique, we have compared it with iterative improvement, a combinatorial optimization technique frequently used for the optimization of large join queries, especially in parallel environments [Ioannidis et al., 1992; Lin et al., 1994; Spiliopoulou et al., 1996; Swami and Gupta, 1988]. For our experiments, we have generated queries with 10 to 100 joins towards a database of relations with 103 to 106 tuples. The experimentation process is analyzed in [Stillger and Spiliopoulou, 1996], from which the following gures are obtained. The genetic programming technique used a population of 100 to 1,000 QEPs, increasing with the query size. 50 generations were produced. To yield comparable results, iterative improvement produced an equal number of QEPs during its scan of the search space. In Figures 5 and 6, we show the behaviour of our technique towards iterative improvement in terms of query cost improvement for each cost model. The cost values are normalized, using iterative improvement as the reference strategy. For cost model CM-1 (Fig. 5), our technique is better than iterative improvement for small queries. For larger queries, it converges rapidly towards the optimal QEPs of iterative improvement. 1.04 iterative improvement genetic programming 1.03
1.02
1.01
1
0.99
0.98
0.97
0.96 10
20
30
40
50
60
70
80
90
100
Figure 5: GP performance for cost model CM-1 For cost model CM-2 (Fig. 6), most optimal QEPs of GP are located within a distance of 10% from the optimal of iterative improvement. This distance becomes smaller as the query size increases.
5 Conclusions In this study, we have proposed a genetic programming model for one of the hardest problems in databases, the query optimization problem. We have shown that the nature of the problem makes it particularly appropriate for GP, since the QEP of a query can be conveniently observed as a genetic program. We have speci ed the
1.25 iterative improvement genetic programming 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 10
20
30
40
50
60
70
80
90
100
Figure 6: GP performance for cost model CM-2 search space of our model and presented the GP operators applied on the QEPs of this space to produce the successors of each population. Our experiments have demonstrated that our technique has comparable performance to a widely used classic technique. We are currently performing a wide set of comparative experiments between our GP model and several other search techniques, including simulated annealing and tabu search. We consider various cost models with dierent search space shapes, and we study the impact of this dierence on the behaviour of the search techniques. It is very important to test whether the GP model remains unaected by the search space shape, since this would indicate that it is more appropriate for query optimization than the techniques currently used.
References
[Bennett et al., 1991] Kristin Bennett, Michael C. Ferris, and Yannis Ioannidis. A genetic algorithm for database query optimization. Technical Report TR1004, University of Wisconsin, Madison, WI, 1991. [Codd, 1970] E.F. Codd. A relational model of data for large shared data banks. CACM, 13(6):377{387, 1970. [Goldberg, 1989] D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley, Reading, MA, 1989. [Graefe, 1993] Goetz Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73{170, 1993. [Holland, 1975] J. Holland. Adaption in Natural and Arti cial Systems. The University of Michigan Press, Ann Arbor, MI, 1975. [Ioannidis and Kang, 1990] Yannis Ioannidis and Y.C. Kang. Randomized algorithms for optimizing large join queries. In SIGMOD Int. Conf. on Management of Data, pages 312{321, Atlantic City, NJ, 1990. ACM.
[Ioannidis and Kang, 1991] Yannis Ioannidis and Y.C. Kang. Left-deep vs. bushy trees: An analysis of strategy spaces and its implications on query optimization. In SIGMOD Int. Conf. on Management of Data, pages 168{177, Denver, CO, 1991. ACM. [Ioannidis et al., 1992] Yannis Ioannidis, Raymond T. Ng, Kyuseok Shim, and Timos K. Sellis. Parametric query optimisation. In Int. Conf. on Very Large Databases, pages 103{114, Vancouver, Canada, 1992. [Koza, 1991] John R. Koza. Genetic Programming. The MIT Press, Cambridge, MA, 1991. [Lanzelotte et al., 1993] Rosana Lanzelotte, Patrick Valduriez, and Mohamed Zat. On the eectiveness of optimization search strategies for parallel execution spaces. In Int. Conf. on Very Large Databases, pages 493{504, Dublin, Ireland, 1993. [Lin et al., 1994] E.T. Lin, E.R. Omiecinski, and S. Yalamanchili. Large join optimization on a hypercube multiprocessor. IEEE Trans. on Knowledge and Data Engineering, 6(2):304{315, 1994. [Spiliopoulou et al., 1993] Myra Spiliopoulou, Michalis Hatzopoulos, and Costas Vassilakis. Parallel optimization of join queries using a technique of exhaustive nature. Computers & Arti cial Intelligence, 12(2):145{ 166, 1993. [Spiliopoulou et al., 1996] Myra Spiliopoulou, Michalis Hatzopoulos, and Yannis Cotronis. Parallel optimization of large join queries with set operators and aggregates in a parallel environment supporting pipeline. IEEE Trans. on Knowledge and Data Engineering, 1996. To appear. [Steinbrunn et al., 1993] Michael Steinbrunn, Guido Moerkotte, and Alfons Kemper. Optimizing join orders. Technical Report MIP9307, Faculty of Mathematic, University of Passau, Passau, Germany, 1993. [Stillger and Spiliopoulou, 1996] Michael Stillger and Myra Spiliopoulou. Exploiting genetic programming in parallel query spaces. Technical report, Institut fur Informatik, Humboldt-Universitat zu Berlin, Berlin, Germany, 1996. [Swami and Gupta, 1988] Arun Swami and Anoop Gupta. Optimization of large join queries. In SIGMOD Int. Conf. on Management of Data, pages 8{17, Chicago,IL, 1988. ACM.