The Use of Randomized Search Strategies for Complex ... - CiteSeerX

2 downloads 0 Views 351KB Size Report
e ective technique, both in output quality and optimization e ort, for complex ..... least costs is retained and added to the set of not yet considered trees (eventually.
The Use of Randomized Search Strategies for Complex Parallel Relational Query Optimization Harald Kosch [email protected] Department of Information Technology - University Klagenfurt Universitatsstr. 65 A - 9020 Klagenfurt (Austria) tel +43-463-2700-513 ? fax +43-463-2700-505 Abstract

This article proposes a complete parallel relational optimization methodology based on randomized search strategies from the theoretical basics to the experimental validation. This methodology is based on a survey-like analysis of related search techniques. We defend why randomized search strategies are an e ective technique, both in output quality and optimization e ort, for complex parallel query optimization. The traditional techniques are tuned and parallized versions are developed. Furthermore we describe how the transformation rules of the search strategy could interact with the resource allocation and present an allocation model. A series of experiments performed on a 100 relation database with 18 randomly chosen queries demonstrate that an excellent tradeo between optimization time and optimization quality can be achieved.

Key words : Parallel databases, parallel query optimization, randomized search

strategies.

1 Introduction Modern database applications, such as data mining and decision support pose several new challenges to parallel relational query optimization and processing [1, 2]. The complexity of queries against the database size rises signi cantly compared to traditional systems. Typically the number of joins and the relation size [3] are the 1

dominant factors in complex query processing e.g. recent Teradata relational data warehouse queries involved more than 30 joins [4] over some Terabytes of data. In this context, the fast development of high performance parallel machines provided with ecient multi-tasking opens the possibility to exploit massive parallelism for each complex query execution and higher throughput of concurrent execution [5]. In addition to the sequential optimization issues (i.e. operator ordering and implementation method choice) a parallel query optimizer must determine the degree of inter-operator parallelism, i.e. the number of operators to be executed concurrently. The number of possible orderings faces a combinatorial explosion, i.e. up-to 1:8  1011 for a 10 way-join and 4:3  1027 for a 20 way-join query [6]. Furthermore for each participating operator the optimizer must compute the degree of intra-operator parallelism, i.e. the number of processors to execute one join operator. Finally the resource allocation, i.e. processor and memory allocations, must be done. The traditional dynamic programming [7, 8] technique utilized in sequential query optimization performs almost an exhaustive search over the search space. Even, if special implementation techniques are used to reject obvious costly execution planning, this technique becomes soon intractable in the parallel context, as pointed out by several authors [9, 10, 11]. Therefore several works propose heuristic based optimization methods [12, 13, 14, 15]. These methods work well if low cost partitions of the search space are accessed. However, as the navigation is more or less randomly, locally advantageous moves might be accepted allowing no access to global low cost strategies later on. This is not acceptable for complex database applications. Randomized search strategies o er a way around these problems; they perform local transformations on already complete orderings, until a low-cost execution plan can be found. The use of randomized search strategies for parallel query optimization was rst examined by Lanzelotte et al. [16, 9] based on experiences in sequential optimization [17]. The presented framework only considered intra-operator parallelism. Later work like Spilipoulou et al. [10] studied the e ectiveness of those strategies for inter-operator parallelism. However, this approach su ers from the fact that it does not consider resource limitations. In this context we propose to study randomized search strategies not only from the point of view of its generating solution quality, but also from its complexity and from its integration in the resource allocation process. We will demonstrate that randomized search with an adapted resource allocation offers an excellent tradeo between optimization time and the quality of the generated solution for complex parallel queries. The paper is structured as follows. Section 2 gives a detailed introduction to parallel relational query optimization and discusses open problems. Section 3 analyzes previous work on search strategies in query optimization. Section 4 introduces how we tuned the traditional randomized search strategies. Section 5 presents the resource allocation of our query optimizer 2

and how the search strategy interacts with the allocation module. Section 6 describes the experimental validation of our tuning strategies : determination of the algorithm's parameter, comparison to traditional randomized search and the query complexity/output quality tradeo . Section 7 concludes this paper and points to future work.

2 From sequential to parallel query optimization Sequential query optimization can be roughly divided into three steps : the rewriting of the query (e.g. push the selections as near as possible to the scan of the base relations/object sets), the operator ordering problem and the choice of the access methods (e.g. use of an index or not). In the two latter steps, the mapping from a logical query representation (consisting of relational algebra operators) to a physical query representation (consisting of physical operators working on records, data segments and pages) is performed [18]. This mapping is hereafter called physical operator mapping phase. Most query optimizers use a data- ow model for the logical representation, the so-called processing tree where the nodes model relational operators and receive their input relations via the incoming edges (see left hand of Figure 1). For the physical representation commonly a physical operator tree is used, which has the same structure as a processing tree, but where the nodes models physical operators. Let us consider a sample query to illustrate the operator ordering and method choice problem, as the di erent query representation forms. The example relation schema consists of three relations from a workshop planning : Participant(no Participant, Name, Lab) Participant in WS(no Participant, no WS) Workshop(no WS, Name, Thematic) The following SQL-query searches for all participant names who intend to visit at least one Workshop, for a Thematic dealing with 'Parallel Computing' : SELECT P.name FROM Participant P, Participant in WS P WS, Workshop WS WHERE WS.Thematic LIKE 'Parallel Computing' AND WS.no WS = P WS.n0 WS AND P WS.no Participant = P.no Participant The output of the rewriter phase is a normalized relational algebra expression, e.g projectP:Name(join(P; P WS ); join(selectWS:Thematic= ParallelComputing ; WS )), represented by the processing tree in Figure 1 left hand. The operator ordering phase takes this processing tree as input and then seeks for the best execution order of the joins. Furthermore it splits each join operator into its physical components. 0

3

0

Figure 1, right hand shows the generated physical operator tree in the representation form proposed by Schneider [19]. The join is assumed to be executed by a hash join method, thus rst a hash-Table is Built on the inner relation (B operator) and then the tuples of the outer relation are Probed against this hash Table (P operator). Furthermore sel (S) and proj (PJ) note the select and project operators. The best operator ordering is selected with respect to a cost model, associating a cost to each ordering. Typical sequential cost models include two weighted cost factors, the number of transfered disk pages and the number of processed tuples. Thus, the chosen operator ordering could be di erent from that proposed by the rewriter, as put in our example. P

P_WS

sel

WS

WS

P

P_WS

S1PJ1

PJ2

WS.Thematic

S1 = sel B1

WS.Thematic

PJ1 = proj WS.noWS

P1

PJ2 = proj proj

PJ3

P.Name

PJ3 = proj B2

P.noParticipant,P.Name P_WS.noParticipant

PJ4 = proj P.Name P2

PJ4 Physical operator graph after join ordering and mapping

Processing tree after rewriting

Figure 1: Processing tree and physical operator tree. In a parallel system, the optimization problem is more complicated with the new dimension introduced by parallelism, especially for the operator ordering and physical operator mapping phase. In general, three forms of parallelization are distinguished :

pipeline parallelism : let's consider the Project operator PJ 2 and the Probe op-

erator P 2 of the join (WS ./ P WS ) ./ P . Clearly P 2 can start its job as soon as the rst tuple has been processed by PJ 2 and then work in parallel with that latter. inter-operator parallelism : let's consider now the Select-Project S 1PJ 1 and once again the Project operator PJ 2. These two operators can be performed in parallel. 4

intra-operator parallelism : let's assume a select operator is to be run on a relation. The tuples of that relation can be partitioned into sub-relations processed separately.

Obviously, the number of feasible execution strategies increases dramatically [6] as in addition to sequential query optimization, the parallelization, the degree of intraand inter-operator parallelism and possible pipeline parallelism has to be determined. In order to manage this complexity, a classi cation of the di erent parallelization strategies has been designed according to the tree shape of a processing tree. Two major forms are actually distinguished [20], linear trees and bushy trees. In linear trees, only one join operator is executed at the same time. In bushy trees, two or more join operators can be processed at the same time (inter-operator parallelism). More precisely, two classes of linear trees are considered, the left-deep trees (based on a data-parallel strategy) and the right-deep trees (more pipeline oriented). In this article we study the searching techniques of a parallel query optimizer to solve the operator ordering problem. This problem must be addressed very carefully as any form of restricting of the search space of an optimizer takes the risk that the optimal or the suboptimal solutions might be excluded from the search space [15]. Randomized search strategies appear as a very promising solution to determine the operator ordering, because they provide a very good tradeo in optimization time versus quality of the optimization. The proposed methodologies in the search strategy and in the resource allocations are orthogonal to the question whether a one- or a two-phase optimization is chosen. In the two-phase optimization [21], the operator ordering generates one solution without taken the parallel resources into considerations. This solution is then scheduled for parallel execution in a second phase. In the one-phase optimization strategy [9], operator ordering and scheduling go hand in hand. For every generated ordering a physical scheduling is done. We adapted a one-phase optimization for our sharednothing implementation, because the risk for excluding a good parallizable operator ordering in the rst phase is very high for the bushy tree space on shared nothing or hybrid systems [5, 22]. However, if we want to run our methods within a two-phase optimization, we would apply the resource allocation only to the solution of the rst phase.

Assumptions In the remainder of the article we concentrate on equi-join operators

with join predicates of the form R:attr1 = S:attr2 for some relations R and S and attributes attr1 and attr2. However our approach remains general, because the join operator can be exchanged without further adaption of the proposed methodologies to other complex operators (as intersection, union and the object-oriented atten operator [23, 8]). Aggregation operators are supposed to be executed after the join. 5

3 Related work to search strategies Many di erent combinatorial optimization algorithms have been proposed as search strategies for parallel query optimization, which can be classi ed into four basic groups : an exhaustive search, dynamic programming, randomized search and polynomial heuristic search. Most query optimizers are proud of their extensibility and propose transformation rules to render the optimizers extensible. Exhaustive search and randomized search can be relatively easily implemented with the use of transformation rule system [24], while the dynamic programming is much harder to be written with a rule system [8]. We refer the interested reader to the latter references for more information, we concentrate here on the basic structure, algorithm feasibility and the quality of the generated solutions.

3.1 Exhaustive search

The exhaustive search generates all possible physical operator trees. In general it's done by rst generating all join orderings successively and then choosing for each of this ordering the appropriate method. Obviously, such an approach is of combinatorial complexity (in dependency of the number of participating relations n) and only feasible for a small value of n (typically until a 6 way-join [25].).

3.2 Bottom-up dynamic programming

The extensible bottom-up dynamic programming optimization of Starburst [26], based on that of system R [27], is the most used and tuned operator ordering algorithm [8]. It is actually implemented in several industrial products, as the IBM DB2 PE [5] and in the NonStopSql product [28]. For example for generating a linear tree solution, it works iteratively over the number of relations n, by constructing an optimal physical operator tree based on the expansion of optimal subtrees involving smaller number of relations i.e. the operator tree of i relations (2 < i  n) is build from the optimal subtrees of i ? j relations combined with the optimal subtrees of j relations for any j , j < i. It's quite astonishing that already the search space for linear-tree scheduling is already exponential in terms of the number of relations [6]. In order to handle such an exponential complexity, a pruning function is introduced to realize a branch and bound in the dynamic programming, i.e. it reduces the number of trees to be expanded in the remaining of the search algorithm. Pruning is achieved by comparing all (sub)trees which join the same set of relations with respect to an equivalence criteria and then to discard all the trees of non-optimal costs. The equivalence criterion in sequential optimization is that two trees can be compared when they involve the 6

same relations and when their result relations are sorted on the same attribute. This heuristic solution to the optimum decision making of the dynamic programming is in generally accepted to be sucient in practice. However, in the parallel context an acceptable solution can not be guaranteed for the described dynamic programming technique. For instance, reconsider the workshop planning database of section 2, consisting of the Tables : Participant(no Participant, Name, Lab) Participant in WS(no Participant, no WS) Workshop(no WS, Name, Thematic) and the sample query searching for all participant names who intend to visit at least one Workshop, for a Thematic dealing with 'Parallel Computing' (the select and project operator are hidden) : Workshop ./ Participant ./ Participant in WS Let us assume that the sample query is executed on a shared nothing system and the relation Participant (P) is partitioned on disk1, the relation Participant in WS (P WS) on disk2 and Workshop (WS) on disk2. The two join operators are implemented using a hash-based algorithm. Furthermore the relation P is supposed to have a smaller size than the relation P WS . Thus, the dynamic programming will discard from the two subtrees P ./ P WS and P WS ./ P the second one (without any idea of the remainder of the query processing at that moment the hash-table should be built on the smaller relation). Consequently, the result relation of P ./ P WS is stored on disk1. When joining P ./ P WS with WS , redistribution costs must be taken into account. On the contrary, for the join P WS ./ P with WS no redistribution costs are presented. It can therefore easily turn out that the complete tree P WS ./ P ./ WS can outperform P WS ./ P ./ WS , i.e. the dynamic programming had made a false local decision. Ganguly et al. [29] extended the cost model by considering separately the work done on individual resources, in order to generate nevertheless the optimal tree for parallel execution. Thus they propose to represent the cost model as a cost-vector, where a component is the response-time, and the other components are the work done on each of the resources, e.g. the CPUs, communication network and the disks. The less-than relation must be modi ed in order to compare two cost-vectors. Given two vectors v1 and v2, v1 is lesser than v2, i all the components of v1 are smaller than those of v2. Obviously, the eciency of the branch and bound is drastically reduced [30], as only those subtrees of i relations are discarded for which all components of the cost vector are smaller than all the other subtrees of i relations and which join the same set of relations. The performance of this algorithm is similar to an exhaustive search and thus not applicable for complex queries. In order to limit the optimization budget for complex queries, the equivalence criteria is in general relaxed. Lanzelotte et al. [31, 16] studied possible relaxation strategies and proposed, for distributed memory systems, to include in addition to 7

the sorting property, the data location of the result relation, i.e. two trees, which joins the same set of relations, are said to be equivalent when their result relations are sorted on the same attribute and are materialized on the same set of processors. This renders the optimization methodology tractable only until a 9-way join. The authors showed that for more complex queries the optimizer ran out of memory. Even if we use the fast implementation of Vance et al. in [11], which uses a bit-vector representation for the relation to be processed, we obtain an upper bound for the feasibility. They can handle queries up until 15 relations for sequential queries in a modest optimization time. Above this limit the authors suggest the use of pure randomized search or a combination of dynamic programming and randomized search !! Finally it should be remarked that the dynamic programming optimization can be done also in a top-down manner (we start with complete trees and then try to arrange the subtrees). The latter method became recently popular as we generate very fast complete trees [32, 33]. If we apply for parallel query optimization bottom-up and top-down optimization su er from the same optimization time explosion [21].

3.3 Randomized search

For managing inter-operator parallelism in complex, queries, randomized search strategies have been suggested as one alternative to restrict the search space. Randomized algorithms are well known combinatorial optimization [34], and two basic variants of these algorithms have been adapted for relational query optimization : the simulated annealing and the iterative improvement [31, 35, 36, 9]. Furthermore several extensions have been proposed, as the Toured Simulated Annealing [16] and the TwoPhase Optimization [36]. This search strategy is the chosen one, so we do not go in detail here and refer the reader to the detailed presentation given in section 4. The general idea is as follows. The randomized algorithms start from a random state in the search space, i.e. an initial physical operator tree, which can be obtained as a result of the query rewriting, or a state obtained by some simple greedy heuristics, e.g. the augmentation heuristic described in the next section. The algorithm then walks through the search space, evaluating the cost of each state and stopping either when they estimate that they have found the optimum tree or when a prede ned optimization time expires. The walking between the states is controlled by transformation rules and a global strategy. Iterative improvement accepts a move from one step to another only if the cost of the destination state is lower than the cost of the source state. Simulated annealing, on the other hand, allows a move to a higher-cost state with a certain probability that diminishes as optimization time moves along. Obviously, as randomized search investigates only a portion of the search space, they cannot be guaranteed to be optimal. However, it has been shown, for both the parallel and sequential context, that it is fairly close to the optimal state given su8

cient time [36, 10]. The latter sentence reveals the main problem of such approaches and suggests the main criticism which has been made. In an extreme case, a good solution might already be found after a rst local minimum is found, or in the worst case, never. Related studies focus mainly on the study of the quality of the generated tree, but rarely on the complexity of the search itself and the integration of the resource scheduling. This is quite astonishing as the latters are the most crucial point in this kind of technique. Thus, the complexity of our proposed technique will intensively be studied. Furthermore we will propose an ecient and original parallelization of the randomized search techniques, in order to handle the optimization budget.

3.4 Polynomial heuristic search

Another class of methods to manage the optimization costs is the class of polynomial algorithms. The most straightforward algorithm comes under the greedy paradigm, which builds a complete physical operator tree iteratively from the initial set of base relations [37, 14]. For each iteration, the algorithm obtains greedily a new (sub)tree that has the lowest cost from all the possible combination of joining two elements of the not yet considered set of subtrees. Let n denote the number of participating relations in the query. Furthermore, let us assume that the parallel resource allocation can be done in O(1), e.g. in a shared nothing system when the join is performed on the processors where the inner relation is stored (as in gamma). Under this hypothesis, the complexity of the algorithm in memory consumption is O(n2) and in computing time is O(n2logn) (the search for the best join combination from OpenSet can be made in at least O(nlogn)) Other approaches adapt the dynamic programming optimization technique in order to achieve a polynomial complexity behavior. If we apply a breadth- rst search, a restrictive pruning policy is introduced [20]. A bound for the generation of successor trees of the expand() function is set, in such a way that the algorithm complexity reduces to polynomial complexity for any given scheduling goal. However such a technique has not been tested. More popular is the adaption of a depth- rst search. In the augmentation heuristic [25], the initial set is restricted to one element, the relation with the least cardinality. Then, at each expansion, only one successor (sub)tree with the least costs is retained and added to the set of not yet considered trees (eventually another subtree can be then pruned). Such an algorithm behaves always as O(n2), in terms of computing time and in O(n), in terms of memory consumption. An improved variant, the uniform greedy heuristic, produces one complete tree by starting from each base relation, by applying an augmentation heuristic and then choosing the least costly among those generated complete trees. The complexity in terms of computing time augments to O(n3), as the memory consumption stays at O(n). The proposed heuristics perform very quickly and the quality of the generated 9

tree is quite good for linear tree scheduling [16]. However, in the context of bushy tree scheduling, they are very likely to fail the region of good physical operator trees [16]. The reason is that the proposed heuristics perform to few changes for exploring a large search space. This motivated us to use randomized search algorithms for bushy-tree scheduling in parallel, complex query optimization. Note that some works have applied the genetic programming paradigm (GP) to query optimization [38, 39]. The processing tree can be easily associated to a genetic program and the transformation rules to genetic programming operators. Stillger et al. show in [38] that queries up-to 100 joins could be optimized with such a technique. However, those rst results are too spare, to compare the GP technique to related methods, such as randomized algorithms or polynomial heuristic search. Furthermore, the performances of the GP technique seems to be very sensitive to the query characteristics and the chosen cost model.

4 Tuning the Randomized search strategies This section describes the di erent ways we tune the randomized search strategies. First, the transformation rules are modi ed, in order to decrease the number of applied rules. Second the basic algorithms are originally parallelized. Compared to the traditional techniques, the number of required optimization runs is reduced and the parallel machine resources are better utilized.

4.1 Tuning the transformation rules

For join ordering six transformations rules [20] are distinguished to generate the bushy tree search space : 1. 2. 3. 4. 5.

Method choice-transformation: R1 ./method1 R2 ! R2 ./method2 R1 Swap-transformation: R1 ./ R2 ! R2 ./ R1 LeftJoin exchange-transformation: (R1 ./ R2 ) ./ R3 ! (R1 ./ R3 ) ./ R2 RightJoin exchange-transformation: R1 ./ (R2 ./ R3 ) ! R2 ./ (R1 ./ R3 ) Join associativity-transformation: (R1 ./ R2 ) ./ R3 ! R1 ./ (R2 ./ R3 )

The three join exchange rules (Rule 3,4 and 5) are conditional rules, i.e. even if the left hand side of the rule matches the input expression, the rule should only re if a supplement condition (the so called rule condition) holds. Indeed the attribute dependencies within the expressions lead to a rule condition for each of the 10

join exchange rules. For example consider the Join associativity-transformation. Let the join R1 ./ R2 be executed over the join predicate R1 :attr1 = R2 :attr2 and the join (R1 ./ R2) ./ R3 be executed over the predicate T:attr1 = R3:attr3 with T = R1 ./ R2 , then the rule can only re if R2 possesses the attribute attr3. Similar conditions have to be established for the LeftJoin exchange- and the RightJoin exchange transformation. Remark that the LeftJoin exchange and RightJoin exchange rules are redundant [24]1 . Transformation-rule optimizers uses the set of three join exchange rules because one of these rules is always applicable. As they choose the next transformation rule arbitrarily out of the possible set, it is not guaranteed that the applicable one is chosen. In this context we propose originally to regroup the join exchange rules to one rule which is always applicable. This new rule requires the local study of the attribute dependencies, which can be achieved in a simple way when a trace on the di erent attribute sets of the relation is kept which should be the normal case for optimizers. In our objective to implement a fast parallel query optimizer, we also encapsulated the method choice in a separated optimization module, which decides upon an optimal implementation strategy based on an intelligent Decision Table established by an exhaustive literature study. The best strategy is determined in a very fast way by searching in that Decision Table of typical execution cases. The construction of such a table is possible, as the join implementation strategies have been largely investigated in literature. In our related paper [41] an exhaustive comparative study of the three basic classes of join algorithms (nested loop, sort-merge and hash-join) in a sharednothing environment as well as the structure of the decision Table is presented.

4.1.1 Regrouping the join exchange rules Before explaining the regroup we should introduce processing trees. Processing trees are a special optimizers data-structure to represent an operator ordering2 : The leaves of a processing tree represent the base relations that participate in the query and intermediate nodes model operators. These latters receive their input relations via the incoming edges and send the result relation through the outgoing edge to the next operation. The root of the tree produces the result of the whole query. For instance consider the processing tree of Figure 2 representing the join expression q = R1 ./ R2 ./ R3 . In order to explain our approach of regrouping the join exchange rules, the following conventions must be introduced. A join will always depend on the incoming We join the common hypotheses to exclude cartesien products from the parallel search space [20, 40, 10]. 2 The processing tree is a representation of the relational algebra [7]. 1

11

Level 0:

R1

R2

Level 1:

R3 J1

Level 2:

J2

Figure 2: Sample processing tree relations that are subject to this join. This way, a parent-child relationship is established between this join and the base relations or other joins that hand down the required data. Following this basic idea, any query can be represented as a processing tree with a level concept, e.g. query q1 = (R1 ./ (R2 ./ R3 )) will lead to the processing tree as shown in Figure 2 where J1 depends on the base relations R2 and R3 , and located to level1, while J2 depends on R1 but also on the newly created intermediate relation provided by J1 , so located to level2. In the following example of Figure 3 q2 = ((R1 ./ R2) ./ (R3 ./ R4)) joins J1 and J2 occupy the same level. Level 0:

Level 1:

R1

R2

R3

J1

R4 J2

Level 2:

J3

Figure 3: Processing tree of the query q2 = ((R1 ./ R2) ./ (R3 ./ R4 )).

The unique JoinExchange rule Given an arbitrary query processing tree struc-

ture q two basic starting points for the JoinExchange can be distinguished (see Figure 4). A, B and C are three subtrees of q while J1 and J2 are the two joins that are subject to the JoinExchange. As these two con gurations di er only in the parent - child relationship between J1 and J2 it is sucient to regard the di erent cases that may evolve when applying a JoinExchange to the case of a left parent-child relationship as shown in Figure 4, left scheme. 12

A

B

C

A

B

J1

C

J1

J2

J2

Figure 4: Left and right parent-child relationship. At rst let us suppose that only one of the two subtrees of A and B provides relations containing the join attribute of join J2 . Then two cases, as shown in Figure 5 can be distinguished. The subtrees providing the join attribute of join J2 are marked. A

B

C

A

B

C

J1

J1

J2

J2

Figure 5: Di erent subtrees may provide the join attribute of J2. Taking a closer look at the case where the resulting relations of A and C carry the join attribute (left hand of Figure 5). It becomes clear that J2 must take A and C as its parent tree. Four di erent implementations are then possible : either A or C is the outer relation of J2 and for each of these possibilities, J2 can be the inner or the outer relation of J1 . Our JoinExchange implements the case where the result of J2 is the inner relation of J1 and the result of A is the inner relation of J2 (see Figure 6). The three other alternatives can be attained by the chosen JoinExchange implementation by applying at most two supplement Swap transformations. Figure 7 illustrates the transformation management for the case (right hand of Figure 5) when the join attribute being provided by subtrees B and C. Here, the existing left parent-child relationship between J1 and J2 is changed into a right parentchild relationship. Once again, all other alternatives can be attained by supplement swap transformations. At second, considering the case when both subtrees of A and B provides the join attribute of J2 (this case is in reality rare), an heuristic approach is chosen to handle 13

A

B

C

A

J1

C

B

J2

J2

J1

Figure 6: Subtree A is providing the join attribute. A

B

C

A

B

C

J2

J1

J1

J2

Figure 7: Subtree B is providing the join attribute. the larger set of successor processing trees. First, the transformation is performed as if only the subtree A provides the join attribute. The costs of the resulting processing tree is compared with the cost of the processing tree obtained by a transformation, which considered that only the subtree B provides the join attribute. The lower cost processing tree is retained for further searching.

4.2 Tuning the search algorithms

Randomized search strategies are based either on the Iterative Improvement or on the Simulated Annealing technique. Let us rst explain these basic techniques and then describe the commonly used search strategies. The Iterative Improvement (II) strategy accepts only those transformations which reduce the cost function. On the other hand the Simulated Annealing (SA) algorithm also accepts transformations generating higher costs. The acceptance depends on a temperature property T of the system. The lower this property is, the lower is the probability to accept higher cost transformations. The generic Iterative Improvement (II) algorithm is shown in Figure 8 (left scheme). 14

It starts from an initial randomized stage3 and converges to a local minimum state, which is returned by the II.

State Iterative Improvement()

begin

State Simulated Annealing() begin

S = randomstate S = randomstate;T = initialT ;minS = S While not(local minimum(S )) do While T  1 do

begin(while)

S 0 = randomstate in neighbors(S ) If cost(S 0 ) < cost(S ) then S = S0

begin(while) While not(equilibrium) do begin(while)

S 0 = randomstate in neighbors(S ) C = cost(S 0) ? cost(S ) If C  0 then S = S0 If C > 0 then S = S 0; with prob: e?C=T If cost(S ) < cost(minS ) then minS = S

end(while) return S end

end(while)

T = reduce(T )

end(while) return minS end

(a) Iterative Improvement.

(b) Simulated Annealing. Figure 8: The Iterative Improvement and Simulated Annealing algorithm.

Then, for the Simulated Annealing (SA) algorithm, shown in Figure 8 (right scheme), the inner loop is executed for a xed temperature T , which controls the probability to accept higher cost transformations. This probability is equal to e?C=T , where C is the di erence between the new physical operator tree cost and the old one. Thus the probability of accepting higher cost transformations is a monotonically increasing function of the temperature and a monotonically decreasing function of the di erence C . Therefore it is more probable that high di erences C are accepted with a higher temperature, i.e. at the beginning of the SA. Each inner loop We call the generated processing trees states, according to the usual convention in randomized optimization algorithms. This convention goes back to Kirckpatrick et al. [42], who have applied such algorithm successfully for physical systems. 3

15

run ends when the number of inner loop iterations exceeds an equilibrium condition, e.g in [36] 16 times the number of the participating relations. The temperature is reduced according to some function reduce(T). The global algorithm stops, when the temperature is considered to be frozen, i.e. T is smaller than 1. The commonly used randomized search strategies based on the II and SA technique are the Repetitive II (RII), the Toured Simulated Annealing (TSA) and the Two Phase optimization (TPO). In the Repetitive II (RII) an Iterative Improvement strategy is applied to several initial, randomized states [10, 36, 20]. This naturally augments the probability to avoid non-optimal minimum. The same idea is applied to the Simulated Annealing algorithm, leading to the so called [16] Toured Simulated Annealing (TSA) algorithm. Other variants perform higher search e ort for nding a better initial stage, mostly by doing some polynomial heuristics e.g the uniform greedy solution in [20], with a computational complexity of O(n3) or the greedy heuristics [37] with a computational complexity of O(n2logn), n = number of relations. Such techniques have been applied to all yet proposed search variants and will furthermore be called Modi ed Start II/SA/RII/TSA algorithms. Clearly, randomized search algorithm will faster converge to good local minimum, but the complexity of nding good initial states is not negligible, when the number of relations grows. For instance, in the study made by Lanzelotte et al. in [16, 20] for 10-way joins about 200 processing tree (PT) have to be generated until the start stage is found. The latter authors claim that choosing a good initial solution for complex queries in bushy space still remains open. Let us recall that even authors proposing fast implementations of polynomial heuristics as Vance [11] doubt if the introduced cuts in the search tree will not a ect importantly the ecacy of the output execution planning in the context of complex queries. We overcome this by using a simple O(n) algorithm to nd a randomized start stage anywhere in the bushy space, avoiding high cost search for the start state which can not really guarantee us a low-cost state. The algorithm has a similar structure as the introduced greedy algorithm in section 3.4. We build a complete physical operator tree iteratively from the initial set of base relations. For each iteration, the algorithm obtains a new (sub)tree by joining two randomly chosen elements (which should be joinable) out of the not yet considered set of subtrees. At last, a combination algorithm for II and SA has been presented by Ioannidis et al. [36], the Two Phase optimization (TPO). In the rst phase an RII is performed. The output of this phase, which is the best local minimum found, is the initial state to the next phase where an SA is performed. In order to obtain similar quality, in the TPO less runs of the RII phase are required as in the beforehand described RII. The reason is that the second phase SA technique come up with too high cost local minimum produced by an II in the rst phase. We implemented all these variants and performed some pre-experimentations. We 16

quickly recognized that the performance of pure SA techniques, i.e. the SA and TSA, are very sensitive to parameter settings, which could vary form query to query. In general the exact shape of the cost function is not known at query compile-time, so such algorithms work very di erently. Such observations are in accordance with the experiences taken from other domains, e.g. the use of the simulated annealing technique in the case of the matching problems [34]. The RII and TPO optimization hold a greater attraction to us as it proposes the most robust approach to query optimization. Even if parameter settings for the SA run within a TPO are not optimal to the shape of a cost function for a query, at least the best local minimum of the RII is returned.

4.3 Parallelized search strategies

Taking a very close look at the proposed search algorithms, one notices that the II and SA run in a TPO, RII and TSA algorithms are independent from each others and could so be executed in parallel. So why not utilize the capacities of the parallel machine for optimizing the queries? Parallelizing the RII algorithm was rst considered [10], but not yet applied to other algorithms. We go one step further and parallelize all variants of randomized algorithms. To begin, for the TSA algorithm, parallelization is straightforward, as each SA run can be performed independently from the other one. However the problem remains, which degree of parallelism should be chosen for the RII and TSA algorithm. Spilipoulou et al. [10] distributed for the RII, the II runs over all processors. Obviously, this is not possible in a multi-query environment, as query optimization is very memory consuming, e.g. running an II for a 15-way join consumed about 15MB in our experimentations, Michael Stillger told me that for their prototype [38], the memory consumption could easily reach 50MB for the tested 40-way joins. Therefore only those processors are chosen which hold enough memory for the optimization. The estimated memory consumption of the optimizer can be determined by continuously building statistics on previous optimization runs. In the parallized TPO, as many SA runs are performed as there have been II runs within the RII rst phase. Compared to the RII we do not augment the optimization time. Experiments (see section 6) will demonstrate that this parallized TPO is the most cost-e ective technique. In order to generate with 95% a good physical operator tree4 , until a 25-way joins, only 6 processors are required to independently run one II+SA per processor.

Good means in the experiments, not more than 10% over the lowest local minimum cost found in the experiments 4

17

5 Resource allocation and cost model This section presents the original resource allocation technique utilized in our randomized search strategy based query optimizer and introduced the parallel cost model. To begin, a resource usage model for a multi-query parallel databases is presented and then the novel four level resource allocation heuristics is introduced. The latter computes the degree of intra-operator parallelism and the processors allocation, with regard to the available main memory size, CPU power and disk bandwidth of the processors. Finally, subsection 5.3 describes the interaction between the search strategy and the resource allocation.

5.1 Resource usage model

The work done on the resources in a multi-query parallel database is measured by most related works as the sum of the e ective time where an operator uses the resources [9, 43]. Such an approach makes it dicult to compute the additional cost of timesharing preemptable resources like the CPU and the disk bandwidth. Rather than computing the sum of the elapsed time spent on the resources we are more interested in the question whether or not a resource is over-used at any moment of the query execution. For instance, for the preemptive resources, the CPU and the disk bandwidth, we introduce the term : number of concurrent users in order to measure the work done at any moment on those resources. A maximal number is xed a priori, depending on the standard query complexity and the resource availability. Time sharing of the CPU and the disk bandwidth causes additional costs of context switch. We compute these costs according to the works of Mehta et al. [44] made in the context of parallel single-join queries in multi-query databases. Once an operator is scheduled on a processor already occupied by some other operators, the operator execution time is increased by 10%. Moreover, with a certain number of concurrent processes, the system performance deteriorates. Hence, the maximal number of concurrent users must be chosen in a way that on each processor, no deteriorisation can occur for normal workload estimates. For computing the work done on non-preemptable resources, like the memory, and for determining their allocation, special features must be introduced. A regulation factor is presented (see in detail [45, 46]) which guarantees a fair utilization of the memory. The higher the value of this factor, the more memory is granted to the rst join, which consequently will leave less memory to the other joins running in parallel to it. This factor is also set a priori in dependency of the initial memory availability.

18

5.2 Hierarchical allocation heuristic

In the rst level, it is examined if the data locality of the relations can be retained. In such a case, the join is performed on the processors/disks where the relations are stored and therefore avoiding supplement communication and memory utilization. This rst allocation strategy appeared to us the most important, as all shared resources are used less. The same signi cance was attributed by many other authors, for instance see the algorithm developed by Hasan et al. [47] or Hameurlain et al. [48] for optimizing the communication costs. Even if communication network speed has increased dramatically over the last years (e.g. faster networks such as the optical one come up quickly), additional costs on the sender and receiver machine could not be neglected. For instance, on both sides a communication bu er must be provided in memory to store the data to be sent and received, causing memory space problems and computing additional costs in the context of huge relation sizes [10]. Of course, data locality can not be conserved when the concerned processors do not hold enough resources to process the join, or when the relations are not distributed on the join attribute, which is in general the case for intermediate results. So the second level heuristic tries to distribute the relations in such a way that they t in the processors main memory. This avoids cost intensive temporary I/O access. We put this strategy at the second level, motivated by the works done on parallel in-memory databases showing excellent performances [49, 50]. If no appropriate processor allocation is found, the third level heuristic is called. It splits the relation into partitions to be held in the main memory and seeks out the processors which hold the most memory to correctly perform temporary I/O access. If the processor allocation fails once again, the join operators, running in parallel, are serialized. Here, the resource contentions a ect directly the computation of the degree of inter-operator parallelism, i.e. the degree must decrease to enable a correct processor allocation by serializing parallel joins. This maintains the original join ordering while adapting more precisely to the machines workload as do earlier approaches. After the serialization, the resource allocation heuristics are reapplied (starting from level 1). If, even with this new constraint, the allocation is not possible, the transformation must be rejected. If the four levels nd a possible processor allocation, it is adjusted in order to optimize the stream of pipeline parallelism i.e. the latency of a pipeline chain should be the least possible. If the work of two parallel join operators di ers signi cantly no adjustment would be found to equally distribute the work. Therefore the joins will be serialized and the resource allocation heuristics is reapplied (starting from level 1).

19

5.3 The resource allocation and the application of transformations rules

The application of the transformation rule to a join (or two joins) implies a reallocation of the resources, possible serialization and a new choice of the implementation strategies for those joins, i.e. the transformation of the processing tree to a physical operator graph. In comparison to a sequential optimizer, such a reallocation for a join has side e ects on the query processing for the reminder of the query. For example, consider the example query tree of Figure 9. Performing a JoinExchange(J3 ; J4), changes the producer of the outer relation for J5 from J3 to J4. Consequently, the arriving time of the data pages of J4 for pipeline consummation changes, thus the whole memory availability changes. Therefore, a resource allocation for join J5 has to be reconsidered. R1

R2

R5

J1

R3

R4

R1

R6

R2

R3

J 2 // pipeline

J1

J2 // inter−op. J3

joinExchange(j3,j4)

R5

R6 J4

// inter−op. J3

J4 J5

R4

J5

// pipeline

// pipeline

Figure 9: The transformation application has side e ects on the query reminder. Obviously, the resource allocation must be done for all joins presented in the processing trees when applying a local transformation5. This allocation is achieved by ordering the joins according to their position in the processing tree. The furthest left join receives the lowest position 0. Walking to the right until the end of one level or switching to the next level increases the position by one. For example, in Figure 9, the join J5 has the position 4, as the join J2 holds the position 1. After the ordering of the joins is complete, we are now left with the task of detecting which of the joins are parallel. However this can only be done for the joins having a smaller position (only for these joins the new resource allocation has already been computed). Therefore we take the risk of not detecting some of the parallel joins. To detect all the parallel joins would require us to go over the processing tree at least once again. This technique is not cost e ective. 5

which is not the case for sequential optimization [17].

20

Our chosen strategy can also be imagined in terms of priority. The joins with the smaller positions have the higher priority to use the resources as memory, CPU and disk. The join operators with the higher positions must use the resources which are not yet exploited. That does not mean that they have in general smaller resources at their disposition as the joins with a smaller position. A special factor (see in detail [46]) regulates the memory attributions. It is a pre-de ned function of the user requirements as the system speci cations and guarantees the fairness of the resource allocation.

5.4 Cost model

The cost of a query is in the implemented query optimizer is represented by the estimation of the query response time. The response time is composed of a local and a global time. The local time represents the elapsed time to execute in parallel either a basic (atomic operators working on relation partitions e.g build hash table, probe against a hash table) or a communication (operator representing the dataredistributions), or a control operator (operators used to control query processing, e.g. the choose operator [51]) of the execution scenario. The global time combines the di erent local times and incorporate the time to execute the data dependencies (pipelined or sequential). A precedence dependency between two operators is handled in the same way as a data dependency, set to sequential. In order to account for pipeline parallelism, we have chosen to let the global and local response time have two parts (the rst and last page time). The rst page time is the time when the rst page is output by the operator (for the local time) or a subgraph (for the global time). The last page time is the time, when the operator terminates its processing or the execution of the plan has been completed. This model has rst been introduced by Ganguly et al. [29] (see subsection 3.2) and has been reused by Ioannidis et al. later [52]. The original de nition does not yet introduce communication costs. In our framework communication costs are modeled with the help of special communication operators, which associated local cost represents the communication cost [53].

6 Experimental validation The goal of this experimental section is twofold. First we examine how the randomized search strategies (Repetitive Iterative Improvement (RII) and the Parallel Two Phase technique (TPO)) have to be parameter tuned for the bushy tree space and compare then their optimization time. Second we examine the optimization time/quality tradeo of our proposed optimization technique. 21

To start with we describe the experimental testbed. Afterwards we summarize the parameter settings of previous works and reveal problems. We report then the result of several experiments to set the parameters of the RII and PTPO techniques6 and to compare their optimization times. Finally, in another experiment we examine the optimization time/quality tradeo of our optimization technique, implemented with the proposed resource allocation.

6.1 Experimental testbed

We will rst describe the experimental testbed, i.e. the relation schema and the individual relation characteristics and then the architecture of the system as the query characteristics.

Relation schema and individual relation characteristics The relation schema

of 100 relations and its corresponding join graph was developed to support various application elds. Thus, three star graphs (subgraphs with one special node which is connected by only one edge to several others) were introduced. The corresponding star queries have traditionally been regarded as one of the representative DSS querytypes [11]7 . Our implemented star queries not only integrate relations which are in the star graph but also relations connected in a string to di erent relations of the star (see Figure 10). The reason was that in practice those mixtures of one big star and some strings are likely to appear. Relation Joinable

Join graph for the implemented star queries

Join graph highly inter−connected

Figure 10: Implemented queries : star and high connectivity. Several long strings were introduced in the join graph (a string is a subgraph where the nodes only have one incoming and one outgoing edge) in order to implement the so-called string queries. Moreover, subgraphs in which the relations are highly inter-connected were realized (see Figure 10). Queries where at least 50% of the Recall that the pure Simulated Annealing (SA) techniques, i.e. the SA strategies which start searching from a randomized initial state are not considered, because their eciency is very sensitive to parameter settings which vary from query to query. 7 Star queries correspond typically to commercial database application where a lot of small relations are joined to one large one. Some commercial optimizers are especially tuned for optimizing this query-type [54]. 6

22

participating relations belong to such subgraphs will hereafter called high connectivity queries. Although star and string queries are considered DSS representatives, the high connectivity queries are not unlikely to occur. Di erent values of relation cardinality, tuple size, number of attributes and number of unique values in the join attribute (which controls the join selectivity) were chosen for the 100 relations. The relation cardinality ranged from 10 (representing small relations as e.g. country code and name) to 50,000,000 tuples (representing large relations as e.g. all clients), the tuple size from 1 to 1,000 bytes, the number of attributes from 2 to 26.

Architecture of the system and query characteristics The relations were as-

sumed to be partially declustered over either 8 or 32 or 64 disks of a shared nothing system according to the following schema. The relations with a cardinality smaller than 10,000 tuples were randomly allocated to one disk. The relations with a cardinality between 10,000 and 1,000,000 were randomly assigned to half of the available disks. Finally, the larger relations were declustered on all available disks. Eighteen di erent query-types (q1-18) were run classi ed according to their complexity (6, 11, 15 and 25 joins) and to their query classes. The query classes for 6,11 and 15 joins are : Query1 implemented a star query, Query2 a high connectivity query, Query3 a string query and Query4 and 5 two random queries. For 25 joins we had only three classes : Query1 implemented a star query, Query2 a string query and Query3 a random query. The sample size for each query-type and each experiment was 4000, except for the 25-join were it was 2000 (due to the highly increasing optimization time compared to the lower complex queries { same reason why we only implemented 3 query-types for 25 joins).

6.2 Parameter setting of previous work

Several parameters of the proposed randomized search strategies are implementation dependent and must be tuned to improve performance and output quality. The following Tables 1 for the RII and 2 for the TPO shows the parameter settings by previous work, which deal with sequential (Ioannidis et al. [36], Swami et al. [35] and Lu et al. [40]) and parallel (Lanzelotte et al. [16, 20] and Spilipoulou et al. [10]) optimization. Let n be the number of relations in the query. The double question mark :?? indicates the setting, not mentioned in previous work. For the RII two parameters must be xed : the number of II runs (runs) and how a local minimum is detected (local minimum). Either an exhaustive search of all neighbors is performed (p-local minimum), or only n randomly chosen states are tested (r-local minimum). The latter strategy might fail a low cost transformation and declare falsely a local minimum, yet it saves execution time. Spilipoulou et al. [10] 23

Ioannidis Swami Lu Lanzelotte Spilipoulou runs equal time to TPO n n n n local minimum r-local p-local r-local r-local r-local Table 1: Parameter settings by previous work for the RII search strategy. Ioannidis Swami Lu Lanzelotte runs 10 1 n n initial T 0.1*best RII cost 2.0*init cost 0.1*best RII cost 0.1*init cost reduction T 0.95*T ?? ?? ?? equilibrium 16*n ?? n n Table 2: Parameter settings by previous work for the TPO search strategy. and Swami et al. [35] introduce furthermore a limiting execution time for each run. However no experimental validation is given for this limiting factor. Parameter setting for the TPO includes the number of runs for the RII and the way the local minimum is determined8 , the initial temperature of the SA phase (initial T), temperature reduction (reduction T) and the equilibrium condition for the inner loop (equilibrium). In previous work which do not implement a TPO, the TPO parameter settings correspond to a SA for Swami et al. and to a TSA for Lanzelotte et al. Spilipoulou et al. has not implemented any SA algorithm.

6.3 Parameter tuning the Repetitive Iterative Improvement

As described above the RII technique requires only the setting of the number of Iterative Improvement (II) runs. In the rst experiment, the percentage of local minima whose costs are inferior to the optimal costs found plus 10% was examined (hereafter called II hit-rate). Based on that II hit-rate an average number of runs is calculated for which the RII nds a local minimum cost below the optimal costs found plus 10% with a probability more than 95% (hereafter called RII hit-rate). Figure 11 shows the average II hit-rate for every query-type. Just as for the 6-way joins, the average II hit-rate is 33%; for the 11-way joins, it decreases signi cantly to 8% and for the 15-way joins to 6%. Astonishing, however is, that for the 25-way here always the r-local mentioned in Table 2. 8

minimum

is performed and this parameter setting is therefore not

24

Hit−rate in [0,1]

1

6−way join

1

11−way join

1

15−way join

1

0.9

0.9

0.9

0.9

0.8

0.8

0.8

0.8

0.7

0.7

0.7

0.7

0.6

0.6

0.6

0.6

0.5

0.5

0.5

0.5

0.4

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

1 2 3 4 5 Querynumber

0

1 2 3 4 5 Querynumber

0

1 2 3 4 5 Querynumber

0

25−way join

1 2 3 Querynumber

Figure 11: Average hit-rate of the II. join, the average II hit-rate increases slightly to 12%. This could be explained by the smaller sample size of the experiments compared to the lower complex queries. The number of runs required to achieve a RII hit-rate of 95% is shown in Table 3. The required number of runs is relatively high and render the RII technique costly, even when parallized. Query complexity Number of runs 6 joins 9 11 joins 36 15 joins 48 25 joins 26 Table 3: Number of runs in a RII technique required for a 95 % hit-rate. Among the query-types, the II hit-rate for the star queries (type 1) is the best. This is clear, as in these queries the participating relations are lesser inter-connected and thus the search space is smaller compared to the other query-types. The highconnectivity queries (type 2) generate the largest space and the results reveal the smallest II hit-rate. 25

In this context, the assumption of related works that the II hit-rate decreases proportionally with the number of relations, which were not validated experimentally, could not be con rmed. Our experiments demonstrate clearly the opposite. Only from 6 to 11 joins, a clear decrease of the hit-rate can be observed; but with more complex queries, the hit-rate hovers around 10%.

6.4 Parameter tuning the Parallel Two Phase optimization

The parameter settings for the TPO techniques are more complicated than those for the RII. The classic approach (sequential TPO) consists of running one SA with the lowest local minimum found in the rst phase RII as input. In this context, the original parallized TPO technique is studied where as many SA runs are performed parallel in the second phase of the TPO like there are parallel II runs in the rst phase. Such a technique is referred to as the PTPO technique and an II run followed by an SA run is called a PTPO run. The parameter settings for that SA run are the initial temperature, the temperature reduction and the equilibrium condition. As mentioned above, in order to nd nearoptimal local minimum in the pure SA techniques for di erent query characteristics, very di erent settings are required. But for the SA integrated in the TPO, the parameter settings are less critical. This is because the rst phase local minimum cost is generally much lower than the randomized sample costs. Thus, it is not so catastrophic if the SA fails to nd an improvement over the rst phase local minimum costs, as when the pure SA techniques starting from a randomly sampled state achieve no real cost improvements. The optimization budget of the second phase SA can therefore be strictly limited. The experiments were tested with two optimization budgets : the number of applied SA transformations are either limited to one (variant 1) or to two times (variant 2) the number of applied transformations of the rst phase II. Furthermore, based on the ndings of previous works, the equilibrium condition is set to the number n of relations participating in the query (see also section 4). Finally, for each of the two optimization budgets, once the initial temperature is set to 0.1 times the local minimum cost of the rst phase II (variant 1a and 2a), and once to 1.0 times the local minimum cost of the rst phase II (variant 1b and 2b). The temperature reduction reduce(T) can then be calculated and expressed as (N denotes the number of applied transformations for the SA phase) : reduce(T ) = T ? Nn Experimental results show that all PTPO variants (1a, 1b, 2a, 2b) have nearly the same hit-rate. When looking at the average local minimum costs achieved by the di erent variants, it turned out that the 1a variant performed best in about half the cases. 26

Hit−rate in [0,1]

1

6−way join

1

11−way join

1

15−way join

1

0.9

0.9

0.9

0.9

0.8

0.8

0.8

0.8

0.7

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5

0.6

TPO

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

II

0.1

TPO

II

TPO

0.1

25−way join

TPO

0.4

0.1

II

II 0

1 2 3 4 5 Querynumber

0

1 2 3 4 5 Querynumber

0

1 2 3 4 5 Querynumber

0

1 2 3 Querynumber

Figure 12: Average hit-rate of the PTPO run. Figure 12 shows the average hit-rate for the PTPO run (only for the 1a variant) for every query-type as a dashed line (curve marked TPO); the average II hit-rate, shown as a solid line (curve marked II). The hit-rate for the PTPO run increases compared to the II hit-rate on average by a factor of 3.8. Table 4 shows the number of PTPO runs which are needed to achieve a 95% hit-rate for the PTPO technique. Compared to the RII technique, the PTPO technique (variant 1a) requires double the number of applied transformations, but still reduces the number of required runs by signi cantly more than half. Thus, the PTPO technique turns out to be the most e ective search algorithm. Furthermore, Table 4 illustrates that by disposing only 6 processors for optimizing, each PTPO run can be assigned to di erent processors. Query complexity Number of runs 6 joins 4 11 joins 6 15 joins 6 25 joins 6 Table 4: Number of runs in a PTPO technique required for a 95 % hit-rate. 27

6.5 Optimization time versus estimated processing time of the Two Phase optimization

Percentage of optimization to query execution time in [0,1]

In a new series of experiments the optimization e ort for the PTPO technique was studied. The time to process an optimization varies on average from 2.5 seconds (6 joins) on four SUN Sparc 20 workstations, to 6.3 seconds (11 joins), to 21.7 seconds (15 joins) and to 67.7 seconds (25 joins) on six SUN Sparc 20 workstations. Indeed, considering only those optimization times disrupts correct complexity analysis, as long-running queries can e ort more time-intensive optimization. Thus for each query complexity, the average optimization time running the PTPO was compared to the average estimated query processing time for the example of a 32 processors system which was calculated by the optimizer (Figure 13). Note, however that this ratio di ers between small or large numbers of processors. The optimization time increases with the number of processors, because more resource scheduling is required, while the estimated query processing time decreases. For instance, for the 15-way join, the optimization time for 32 processors is about ve times that for 8 processors, and about 1/3 rd that for 64 processors. 0.14 25−way join 0.12

0.1

0.08 15−way join

0.06

0.04 11−way join

0.02

6−way join 0

1

2

3 Querynumber

4

5

Figure 13: Percent of optimization time versus processing time of the PTPO technique. Figure 13 illustrates that very acceptable optimization times were achieved, even for 25 joins. However, the results for the 25 joins shows the tendency that proposed optimization techniques will become intractable for super complex queries. The use of such queries is very limited and they are unlikely to become more popular in the future. Some research exists in the eld of this \super-complex" query optimization [55, 56, 39]. Stillger et al. [38] seems to present the most promising technique based on 28

genetic programming, which could be viewed as a kind of prolongation of randomized techniques (see section 3.4). Finally, it is quite interesting to look at the total memory requirement of the PTPO technique : on average 1.5 MB (6 joins), 3.1 MB (11 joins), 4.3 MB (15 joins) and 14 MB (25 joins). This demonstrates a major advantage of randomized search strategies over the related methods (breadth- and depth- rst search and polynomial heuristic search, see section 3) because the latter requires one to keep much more information in the main memory.

6.6 Summary

This section presented how the RII and TPO have to be tuned for exploiting the bushy tree space. Furthermore, the two search strategies were compared by their complexity and output quality. The experimental results demonstrated that the parallized TPO (PTPO) technique is the most e ective search algorithm. A maximum of 6 runs were required to achieve with a probability of more than 95% a local minimum cost inferior to the optimal cost found plus 10%. The parallized TPO technique can thus, by disposing only six processors for optimizing, allocate the TPO runs on di erent processors. In this context, the assumptions of related works, which were for the most parts not validated experimentally, were controlled. Some of the assumptions were con rmed; others had to be corrected. For instance, a relatively high number of runs was con rmed to be necessary for the RII to nd a good physical operator tree. However, the common assumption that the RII hit-rate (i.e. the average number of runs for which the RII nds with a probability of more than 95% a local minimum cost below the optimal costs found plus 10%) decreases proportionally with the number of relations could not be validated. Our experiments showed that only from 6 to 11 joins a clear decrease of the hit-rate was observed; with more complex queries, however, the hit-rate hovers around 10%. Finally, the optimization complexity was examined. Results show that very acceptable optimization times can be achieved. The ratio of the optimization time to estimated processing time for a 32 processors system was on average only 5%.

7 Conclusion and future work Ecient parallel processing of complex relational queries with processing times measured in units of hours, depends to a high degree on the quality of the query optimizer. The technical design of a parallel query optimizer relies on the careful conception and tuning of all its components. In this paper we addressed the most important prob29

lems in parallel query optimization, the operator ordering and the physical operator mapping phase. We reconsidered randomized search strategies and tuned them for eciently searching the bushy tree space with a very good tradeo between the optimization e ort and output quality. Our main contributions are three folded : First, a complete description, not yet considered in this context, of the randomized search strategies, including the transformation rules, search algorithms, initial stage computation and their parameter settings, was presented. We tuned each components individually, the transformation rules were regrouped, the initial stage computation was accelerated, the parameter setting was experimentally revised and nally the search itself was parallized. Second, an intelligent resource allocation module, especially adapted to the transformation based nature of the randomized search strategies, was presented. It manages heterogeneous resource availabilities and resource contentions and handles additional cost of time-shared resources. Third, rigorous experiments based on a decision support database scheme of 100 relations and 18 queries were performed in order to illustrate that the parallelized search strategies combined with our hierarchy-based resource allocation are very coste ective methods, which achieve high quality execution plannings. Finally let us remark that our developed methodologies apply to object-oriented (relational) databases too, when no methods and complex datatypes are employed. This is true because the best way to evaluate a path expression in an OO (-REL)DB query is to use pointer joins between the extents of the objects involved in the path expression [23, 8]. Future work concerns query optimization in parallel object-oriented(relational) databases. We will study object distribution and the impact of complex data-types and method invocation to parallel databases. Another topic will be the interaction of multi-media operators with a parallel object or relational algebra.

References [1] W. Hasan, D. Florescu, and P. Valduriez. Open issues in parallel query optimization. SIGMOD Record, 25(3):28{33, September 1996. [2] A. Silberschatz, M. Stonebraker, and J. Ullman. Database research: Achievements and opportunities. Into the 21st century. SIGMOD Record, 25(1):52{63, March 1996.

30

[3] Jim Gray. Parallel Database Systems Survey. In Tutorial Handouts of the 21st International Conference on Very Large Data Bases, Zurich, Switzerland, September 1995. [4] Paul Krill. NCR boosts Teradata decision support database. Teradata News, April 1998. [5] C.K. Baru, G. Fecteau, A. Goyal, H. Hsiao, A. Jhingran, S. Padmanabhan, G.P. Copeland, and W.G. Wilson. DB2 Parallel Edition. IBM Systems Journal, 34(2):292{323, 1995. [6] K.-L. Tan and H. Lu. A Note on the Strategy Space of Multiway Join Query Optimization Problem in Parallel Systems. SIGMOD Record, 20(4):81{82, December 1991. [7] M. Jarke and J. Koch. Query optimization in database systems. ACM Computing Surveys, 16(2), June 1984. [8] N. Kabra and D.J. DeWitt. OPT++ : An Object-Oriented Implementation for Extensible Database Query Optimization. The VLDB Journal, 8(1):55{78, 1999. [9] M. Zat, D. Florescu, and P. Valduriez. Benchmarking the DBS3 Parallel Query Optimizer. IEEE Parallel and Distributed Technology: Systems and Applications, 4(2):26{40, 1996. [10] M. Spiliopoulou, M. Hatzopoulos, and Y. Contronis. Parallel Optimization of Large Join Queries with Set Operators and Aggregates in a Parallel Environment Supporting Pipeline. IEEE Transactions on Knowledge and Data Engineering, 8(3):429{445, June 1996. [11] B. Vance and D. Maier. Rapid Bushy Join-order Optimization with Cartesian Products. In Proceedings of the ACM SIGMOD International Conference of Managment of Data, pages 35{46, Montreal, Canada, June 1996. [12] E.J. Shekita, K.L. Tan, and H.C. Young. Multi-Join Optimization for Symmetric Multiprocessors. In Proceedings of the International Conference on Very Large Data Bases, pages 479{492, Dublin, Ireland, August 1993. [13] J. Srivastava and G. Elsesser. Optimizing Multi-Join Queries in Parallel Relational Databases. In Proceedings of the Second International Conference of Parallel and Distributed Information Systems, pages 84{96, Los Alamitos, California, USA, December 1993. 31

[14] Leonindes Fegaras. A new heuristic for optimizing large queries. In International Database and Expert Systems Applications Conference, pages 726{735, Vienna, Austria, August 1998. Springer Verlag LNCS 1460. [15] H. Lu, B.-C. Ooi, and K.-L. Tan, editors. Query Processing in Parallel Relational Database Systems, chapter Parallel Query Otimization. IEEE Computer Society Press, 1994. [16] R.S.G. Lanzelotte, P. Valduriez, and M. Zat. On the e ectiveness of optimization search strategies for parallel execution spaces. In Proceedings of the International Conference on Very Large Data Bases, pages 429{445, Dublin, Ireland, August 1993. [17] Y.E. Ioannidis and Y. Cha Kang. Randomized Algorithms for Optimizing large join queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 312{321, Atlantic City, New York, USA, 1990. [18] Goetz Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73{170, June 1993. [19] D. Schneider and D.J. DeWitt. Tradeo s in processing complex join queries via hashing in multi-processor database machines. In Proceedings of the International Conference on Very Large Databases, pages 469{490, Melbourne, Australia, August 1990. [20] R.S.G. Lanzelotte, P. Valduriez, M. Zat, and M. Ziane. Invited project review: Industrial-strength parallel query optimization: issues and lessons. Information Systems, 19(4):311{330, 1994. [21] C. Nippl and B. Mitschang. TOPAZ: a Cost-Based, Rule-Driven, Multi-Phase Parallelizer. In Proceedings of the International Conference on Very Large Databases, pages 251{262, New York City (NY), USA, August 1998. [22] L. Bouganim, D. Florescu, and P. Valduriez. Load balancing for parallel query execution on NUMA multiprocessors. Distributed and Parallel Datbases, 7(1):99{ 121, 1999. [23] D. DeWitt, J. Naughton, J. Shafer, and Sh. Venkataram. Parallelizing OODBMS traversals : A performance evaluation. Very Large Databases Journal, 5(1):3{18, 1996. [24] A. Pellenkoft, C.A. Galindo-Legaria, and M.L. Kersten. The Complexity of Transformation-Based Join Enumeration. In Proceedings of the International 32

Conference on Very Large Databases, pages 306{315, Athens, Greece, September 1997.

[25] Arun Swami. Optimization of Large Join Queries: Combining Heuristics and Combinatorial Techniques. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 367{376, Portland, USA, 1989. [26] K. Ono and G. M. Lohman. Measuring the complexity of join enumeration in query optimization. In International Conference on Very Large Databases, pages 314{325, Brisbane, Queensland, Australia, August 1990. [27] P. Selinger M. Astrahan D.A. Chamberlin R.A. Lorie and T.G. Price. Access Path Selection in a Relational Database Management System. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 23{34, 1979. [28] S. Englert, R. Glasstone, and W. Hasan. Parallelism and its Price: A Case Study of NonStopSQL/MP. ACM Sigmod Records, 24(3):61{71, December 1995. [29] S. Ganguly, W. Hasan, and R. Krishnamurthy. Query Optimization for Parallel Execution. In Proceedings of the ACM SIGMOD International Conference of Management of Data, pages 9{18, San Diego, California, USA, 1992. [30] G. Graefe and R.L. Cole. Optimization of Dynamic Query Execution Plans. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 150{160, Minneapolis, Minnesota, USA, May 1994. [31] R.S.G. Lanzelotte P. Valduriez M. Ziane and J.-P. Cheiney. Optimization of Nonrecursive Queries in OODBs. In Proceedings of Deductive and Object-Oriented Databases, pages 1{21, 1991. [32] Goetz Graefe. The Cascades Framework for Query Optimizatiom. Bulletin of the IEEE Technical Committee on Data Engineering, 18(3):19{29, September 1995. [33] Pedro Celis. The query optimizer in Tandem's Serverware SQL Product. In Proceedings of the International Conference on Very Large Databases, page 512, Bombay, India, September 1996. [34] M. Groetschel and L. Lovasz. Combinatorial Optimization, chapter 28. Handbook of Combinatorics. Elsevier Science B.V., 1995.

33

[35] A. Swami and A. Gupta. Optimization of Large Join Queries: Combining Heuristics and Combinatorial Techniques. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 8{17, Chicago, Illinois, USA, June 1988. [36] Y.E. Ioannidis and S. Christodoulakis. On the Propagation of Errors in the Size of Join Results. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 268{277, Denver, USA, 1991. [37] H. Lu and K.-L. Tan. Load-Balanced Join Processing in Shared-Nothing Systems. Journal of Parallel and Distributed Computing, 23:382{398, 1994. [38] M. Stillger and M. Spiliopoulou. Genetic Programming in Database Query Optimization. In Proceedings of the First International Conference on Genetic Programming, USA, July 1996. [39] K. A. Nafjan and J. M. Kerridge. Large join optimisation on parallel sharednothing database machines using genetic algorithms. In EUROPAR 97, Parallel Processing, pages 1159{1163, Passau, Germany, September 1997. Springer Verlag, LNCS. [40] H. Lu, K.-L. Tan, and S. Dao. The ttest survives: An adaptive Approach to Query Optimization. In Proceedings of the International Conference on Very Large Databases, pages 251{262, Zurich, Switzerland, September 1995. [41] N. Biscondi, A. Flory, L. Brunie, and H. Kosch. Encapsulation of intra-operator parallelism in a parallel match operator. In ACPC 96, pages 24{32, Klagenfurt, Austria, September 1996. Springer Verlag, LNCS 1127. [42] S. Kirckpatrick C.D. Gelatt and M.P. Vecchini. Optimization by simulated annealing. Science, 220(4598):671{680, may 1983. [43] S. Ganguly, A. Goel, and A. Silberschatz. Ecient and Acurate Cost Models for parallel query optimization. In Proceedings of the ACM SIGMOD-Symposium on Principles of Database Systems, pages 172{181, New York, USA, June 1996. ACM Press. [44] M. Mehta and D. J. DeWitt. Data Placement in Shared-Nothing Parallel Database Systems. The VLDB Journal, 6(1):53{72, January 1997. [45] Harald Kosch. Exploiting serialized bushy trees for parallel relational query optimization. PhD thesis, Ecole Normale Superieure de Lyon, Lyon, France, June 1997. 34

[46] L. Brunie and H. Kosch. Optimizing complex decision support queries for parallel execution. In International Conference of PDPTA 97, pages 1123{1133, Las Vegas, USA, July 1997. CSREA Press. [47] C. Chekuri, W. Hasan, and R. Motwani. Scheduling Problems in Parallel Query Optimization. In Proceedings of the Principles of Database Sytems, pages 255{ 265, San Jose, California, USA, May 1995. [48] A. Hameurlain and F. Morvan. Scheduling and Mapping for Parallel Execution of Extended SQL Queries. In ACM CIKM 95, pages 197{204, Baltimore, MD, USA, November 1995. [49] A.N. Wilschut, J. Flokstra, and P.M.G. Apers. Parallel Evolution of MultiJoin Queries. In Proceedings of the ACM SIGMOD International Conference of Management of Data, pages 115{126, San Jose, California, USA, May 1995. [50] N. Bassiliades and I. Vlahavas. A Non-Uniform Data Fragmentation Strategy for Parallel Main-Memory Database Systems. In Proceedings of the International Conference on Very Large Databases, pages 370{381, Zurich, Switzerland, September 1995. [51] G. Graefe and W. J. McKenna. The Volcano Optimizer Generator: Extensibility and Ecient Search. In Proceedings IEEE International Conference on Data Engineering, pages 209{218, Vienna, Austria, April 1993. [52] M. N. Garofalakis and Y. E. Ioannidis. Multi-dimensional Resource Scheduling for Parallel Queries. In Proceedings of the ACM SIGMOD International Conference of Managment of Data, pages 365{376, Montral, Canada, June 1996. [53] H. Kosch, L.Brunie, and W. Wohner. From the modeling of parallel relational query processing to query optimization and simulation. Parallel Processing Letters, 8(1):2{14, March 1998. [54] E. Ding, L.A. Diminio, G. Gopal, and T.K. Rengarajan. Parallel processing capabilities of Sybase Adaptive Server Enterprise 11.5. Data Engineering Bulletin, 20(2):35{43, 1997. [55] M. Stillger M. Spiliopoulou and J.C.-Freytag. Parallel Query Optimization: Exploiting Bushy and Pipeline Parallelism with Genetic Programms. Technical report, Humboldt-Universitaet Berlin, 1996. [56] C. Galindo-Legaria, A. Pelllenkoft, and M. Kersten. Fast, randomized joinorder selection { why use transformations? In Proceedings of the International Conference on Very Large Databases, pages 85{95, Santiago, Chile, 1994. 35

Suggest Documents