Optimizing Multi-Join Queries in Parallel Relational ... - CiteSeerX

63 downloads 1939 Views 251KB Size Report
execution time, and this is needed for the cost-based. search for a good plan tree. In this paper we propose. a new search heuristic that breaks the cycle by con-.
Optimizing Multi-Join Queries in Parallel Relational Databases  Jaideep Srivastava Gary Elsesser Computer Science Department University of Minnesota Minneapolis, MN 55455

Abstract

typically a linear join tree, and parallelize each operation using intra-operation parallelism [4]. This approach does not exploit the full range of parallelism achievable. Recent work has proposed to apply interoperator [11] and pipelining [18] parallelism to query processing. Query Plan Representation: Left- and rightdeep trees to represent parallelism and pipelining were introduced in [10], and an experimental analysis of the query processing tradeo s among them is presented in [18]. An important nding was that right-deep trees perform very well, given sucient resources, primarily memory. No analytical cost expressions were provided which can be used by an optimizer, and no notation is provided for depicting the degree of intra-operator parallelism for an operator. Since memory size is an important constraint for query plan selection [18] it is important to distinguish between plans with di erent memory requirements, both for individual operator evaluation and for bu er space between successive producer-consumer pairs of pipelined operators. This distinction has been discussed in [11], but no notation to distinguish between them is proposed. In the XPRS system [6] the main focus is on intra-operator parallelism, and a comprehensive notation for plans is not discussed. Recent work [13] has evaluated the technique of parallelizing a good sequential plan. [5] depicts the parallelism in an example query plan using a time-resource diagram, without however a systematic way of generating the diagram for a given plan. [20] has proposed a data ow query execution model which can represent pipelined query execution. [14] has proposed a model in which intra- and inter-operator parallelism can be expressed but not pipelining. The Papyrus project [9] has developed a model which considers inter-operator dependent and independent parallelism, as well as operator cloning, i.e. intra-operator parallelism. Cost Model for Query Plans: No good cost models currently exist for the parallel environment [3]. Cost models for the sequential environment consider the cost of a query plan to be the sum of the costs of its components [7]. This observation does not hold in the parallel environment, where the cost of the plan is the sum of the costs of the tasks only on the critical path (of the tree representing the plan). The impact of this is clear from the following (approximate) quote from [12],

Query optimization for parallel machines needs to consider machine architecture, processor and memory resources available, and di erent types of parallelism, making the search space much larger than the sequential case. In this paper our aim is to determine a plan that makes the execution of an individual query very fast, making minimizing parallel execution time the right objective. This creates the following circular dependence: a plan tree is needed for e ective resource assignment, which is needed to estimate the parallel execution time, and this is needed for the cost-based search for a good plan tree. In this paper we propose a new search heuristic that breaks the cycle by constructing the plan tree layer by layer in a bottom-up manner. To select nodes at the next level, the lower and upper bounds on the execution time for plans consistent with the decisions made so far are estimated and are used to guide the search. A query plan representation for intra- and inter-operator parallelism, pipelining, and processor and memory assignment is proposed. Also proposed is a new approach to estimating the parallel execution time of a plan that considers sum and max of operators working sequentially and in parallel, respectively. The results obtained from a prototype optimizer are presented.

1 Introduction

The relational data model has been found especially suited for massive parallelization due to its set-oriented nature [3, 8]. The focus in query optimization so far has been on nding least work plans, since on a uniprocessor the time taken is proportional to the work done. Since the most ecient uniprocessor solutions often have sequential dependencies, making them dicult to parallelize, an important aim is to nd solutions with the least parallel time, and the amount of work done may become a secondary concern. Thus, a plan should employ di erent kinds of parallelisms including intra-operator and inter-operator parallelism, and pipelining. Optimizers for existing parallel database systems consider only a small subset of the plan space. A common approach is to take a good sequential plan,  This work has been supported in part by the National Science Foundation through grant number IRI-91-10584.

1

2 Query Plan Representation

For a sequential machine the problems of minimizing execution time and minimizing the amount of work done are the same, since the same processor has to do all the work. For a parallel machine, however, the two problems are di erent since the solution with the minimum amount of work may have a higher degree of sequential dependency. Various algorithms for the processing of relational operators on general-purpose parallel machines have been proposed in recent years [1, 2, 15, 17, 4]. While most of these have been evaluated either by simulation or implementation, very little e ort has been made in deriving analytical cost expressions for them. [15] provides expressions for the hybrid-hash and join-index join algorithms for the hypercube and ring architectures, while [16] provides cost expressions for sortmerge, hash-based sort-merge and hybrid-hash join algorithms for a speci c multiprocessor. [13] uses work, i.e. uniprocessor execution time, as the objective function in phase 1, and a weighted sum of the resource consumption and response time in phase 2. Detailed analytical formulae of the cost model have not been reported, and it is not clear how Selinger type [7] costbased optimization may be done. [14] has proposed a cost model in which parallel execution time is considered as the objective function. However, the complete details of the cost model have not been reported. In [20] a cost model based on input and output rates to join operators has been proposed. It is an interesting approach whose details need to be worked out. The cost model in [9] is quite comprehensive, but at a very abstract level. It details for speci c architectures remain to be investigated. In summary, analytical cost expressions for parallel implementations of relational operators are limited to only a few algorithms for some architectures. Also, cost models for parallel query plans are far from validation [3]. Search Algorithm: Determining the optimal sequential plan is a hard combinatorial problem. Since for every sequential query plan a number of parallel ones are possible, the search space for parallel query optimization is much larger. The XPRS optimizer [6, 13] reduces the search space by nding a good sequential plan and parallelizing it. The bottom-up search approach proposed in [14] is a promising one, whose details of how the cost estimate is used to guide the search have not been reported. [9] proposes the use of dynamic programming as the search algorithm. However, speci c pruning heuristics and their performance evaluation has not been reported. Our speci c contributions include (i) developing a representation for parallel query plans, (ii) developing a cost model for query processing on a shared everything multiprocessor, (iii) developing a heuristic search algorithm for optimization, and (iv) carrying out its performance evaluation. This paper is organized as follows: Section 2 describes the query plan representation, section 3 presents the analytical cost model, section 4 describes the search heuristic, and section 5 presents the results obtained from building an optimizer based on our model.

In our model a parallel query plan is represented as a capacitated labeled ordered binary tree. The shape represents inter-operator parallelism, the orientation represents pipelining, the node labeling represents intra-operator parallelism, and the branch capacity represents the main memory bu er size between pipelined operations (for right branches), or main memory bu er size for storing the intermediate results (for left branches). Figure 1 shows an example plan. 4 0

J 4

10

4 J

R

1

6 J

1 R

2

J

3

R5

0

6

R

3

2 R 4

Figure 1: A capacitated labeled ordered binary tree No two operators on a root-leaf path may be evaluated in parallel (though they may be pipelined, as we shall presently see). Thus, bushy and linear trees represent parallel query plans with and without interoperator parallelism, respectively. A pair of operators represented by a node and its right child are pipelined, while a pair represented by a node and its left child are executed strictly sequentially. The capacity of a branch is the size of the main memory bu er between the operators on either end. Since a left branch represents sequential processing, its label is the bu er size allocated for storing (perhaps only part of) the intermediate results.1 The label of a right branch is the size of the producer-consumer bu er between the operators on either end. This capacity determines the tolerance of the pipeline to variations in the rates of production/consumption. Figure 1 shows a parallel query plan for a query with four joins, i.e. J1 , J2 , J3 and J4 , between ve relations, i.e. R1, R2, R3, R4 and R5. J1 has interoperator parallelism with J2 (and with J3 ). Operations J1 and J4 are on a root-leaf path and thus do not have inter-operator parallelism. The same holds for J2 , J3 and J4 . Since J1 is the left child of J4 , it must complete before the latter can begin. The same holds for J2 and J3. J3 is the right child of J4 and thus the two are pipelined, with J4 beginning as soon as J3 has produced its rst result tuple (and of course J1 has completed). The labels 4, 4, 6 and 6 on J1, J4, J2 and J3 , respectively, represent the number of processors assigned to each. It is important to note that the processors assigned to operators at the opposite ends of a left branch are the same group, i.e. they rst perform the child task and then the parent task. The processors on the opposite ends of a right branch are distinct groups since the operations are pipelined. The 4 processors will rst perform the join J1 and then J4. The 6 processors will rst perform the join J2 and then J3 . While performing J3 and J4, the 6 1 The intermediateresults are the output of the child operator and input to the parent operator.

2

processor group will be the producer while the 4 processor group will be the consumer. The capacity of 10 on the branch (J4, J3) means that the intermediate bu er is assigned 10 units of memory. The capacity of 0 on the other two branches means the intermediate results cannot be stored in main memory (i.e. must be stored on disk).

Notation

X XL XR X:begin X:hash done X:end L R x B w h m l Hash Work(X ) ProbeW ork(X ) EndH Work(X ) Pbegin Work(X ) P Px f (P )

3 Cost Model for Parallel Query Plans

Developing a comprehensive cost model for parallel query plans requires work in the following two directions: 1. Developing analytical cost expressions for individual operators such as select, project, join, and distribute. 2. Combining the expressions for individual operators to obtain costs for entire plans. Special care has to be paid when combining costs for operators executing in a parallel or pipelined manner.

Meaning A join operator left child X right child X when X begins when hash table of X is done when X is done Left operand of X Right operand of X Selectivity of X (L 1 R) Tuples per disk block. Average Disk I/O time hash join 1 pair of tuples (av. seq. time) merge join 1 pair of tuples (av. seq. time) nested-loop join 1 pair of tuples (av. seq. time) work to read XL and create its hash table work to probe XR and write result of X Probe Work for last block of XR Probe Work for first block of result read cost(XR ) / read cost(XL ) total number of available processors number of processors assigned to X speed-up using P processors (see section 4.1.2)

Table 1: Notation Used (all times in seconds)

4. This paper considers a scenario where there is enough main memory (aggregated over all processors) to hold the smaller relation of a join. The obvious disadvantage of this is of not handling joins where this assumption does not hold. However, on the other hand, we discovered this to provide an extremely powerful memory-cuto heuristic which can be used during the search process. Furthermore, we believe that with the huge amounts of aggregate memory available in today's parallel machines (e.g. the NCube at Sandia National Laboratory has 1024 processors with 4 mega bytes/processor, i.e. a total aggregate memory of 4 giga bytes) our assumption would be useful in many real-life applications3. Let X be a join operator whose children XL and XR are also join operators. In our notation, X executes strictly after XL , and is pipelined with XR . We believe that memory left over from holding the smaller relations of joins in progress is best used if it holds intermediate results from XR (rather than from XL ), since the processors assigned to X are waiting for it. We assume that the result of XL goes to the disk, while some bu er space is allocated to hold the result from XR . To model this, we introduce a factor (  1), i.e. the ratio of the cost of getting a tuple from XR to the cost of getting a tuple from XL . The parameters associated with X are listed in Table 1. All times are measured considering the starting of the query plan execution to be t = 0. Many terms in Table 1 represent the work to be done in various subtasks (of a join operation) at a node. The formulae for these terms are given by the equations below:

3.1 Cost of Individual Operators

For query optimization an analytical parameterized cost model is needed which accounts for operators being evaluated by multiple processors. In addition to conventional parameters such as database size, query selectivity, indexes, algorithm used, etc., the cost of a relational operator will depend on (i) the number of processors assigned to it, (ii) the memory available, and (iii) the machine architecture. Thus the cost expression should be of the form: COST(relop,ALG) (DBSIZE; SEL; NPROCS; MEM) ARCH Where COST is an analytical cost expression for evaluating relational operator 'relop' using algorithm ALG on a machine with architecture ARCH, using NPROCS processors, MEM memory, database of size DBSIZE, and predicate selectivities SEL. Because of the wide variety of parallel architectures available, developing analytical cost expressions for all of them is a major task. Furthermore, since our primary aim is to illustrate the query optimization methodology, it is beyond the scope of this paper. Thus, we make the following assumptions: 1. The queries consist only of join operators2 . 2. The machine is a shared-memory multiprocessor with a shared disk array, with at least as many processors as disks. 3. Many parallel join algorithms have been proposed in the literature. For purposes of illustration, in this paper we consider only the hash-join algorithm footnote The approach we present here can be easily generalized to include other algorithms..

Hash Join:

Hash W ork(X )

w = B jLj + hjLj

w w = B jRj + hjRj + x B jLjjRj EndH W ork(X ) = w + hB + x wjLj P robe W ork(X )

2 This assumption is not an oversimpli cation since the NPHardness of the query optimization problem is due to the join operators in it.

3 Our ongoing work is considering a model without this assumption, which provides increased generality at the expense of losing a powerful search pruning heuristic.

3

X:probe begin = max(X:hash done; XR :result begin) (X ) X:result begin = X:probe begin + Pbeginf (Work Px ) Work (X ) Probe ; X:end = max( X:probe begin + f (Px ) (X ) XR :end + Endhf (Work ) Px )

= min(P robe W ork(X );

P begin W ork

B P robe W ork(X )) x jLjjRj

3.2 Combining Operator Costs

For a query plan with multiple operators there is more than one way of combining the individual operator costs described above. We are interested in the following two metrics: 1. Work: The Work of a query plan is de ned as the time it would take to execute this plan on a uniprocessor, which has the same power as one of the processing elements. 2. ExecutionTime: The ExecutionTime of a query plan is the time it would take to execute this plan on the multiprocessor, based on some processor and memory assignments. Combining operator costs to get the Work metric is straightforward since the uniprocessor operator costs are summed. Combining operator costs to get the ExecutionTime, however, needs to consider intra- and inter-operator parallelism and pipelining. We propose a cost combination model for parallel query plans as shown in Figure 2, where H and P are the hash and probe phases of a hash-join algorithm. H

H

X

L

X

P

The above formulae hold for join nodes whose left and right children themselves are join nodes. If the left child is a relation, X:begin = 0. If the right child is a relation, XR :result begin = 0. The way we have de ned the costs, a post-order traversal of a plan tree will label all the nodes with their appropriate costs. In our notation the parallel execution time of the entire plan is root:end.

4 Search Algorithm

Beyond six joins, the number of plans quickly becomes too large for exhaustive search to be practical (see [19]). It is therefore necessary to consider heuristic methods to narrow the search space and thereby reduce query planning time. Our cost model uses two cost measures: work, i.e. sequential time, and (parallel) execution time. The work o ered by a join is determined entirely by the attributes of its operands, i.e. relation sizes and query predicate selectivities. A query's execution time, however, is a function of its work estimate, its degree and type of parallelism, as well as its resource allocation, i.e. processors and memory assigned to it. Figure 4 shows the dependence between various tasks and parameters.

P

H

XR

P

Figure 2: Nodes in a Plan Tree

* Database Statistics * Query Predicate Selectivities * Relational Operation Algorithms

3.2.1 Hash Join Algorithm Considered

* Work Estimate (seq. time)

Many di erent types of hash joins have been described in the literature. The one we consider here (purely for illustrative purposes) is the asymmetric variety where one relation (left) is used to create a hash table, and the other (right) is used to probe the table. Furthermore, we assume there is enough memory to hold the hash table of the smaller (left) relation. Figure 3 shows the various subtasks to be performed in a join of this type. X.begin

X.hash_done

X.probe_begin

* Sequential Plan Time

* Processor Assignment * Memory Assignment

* Parallel Execution Time Estimate

* Degree/Type of Parallelism

* Parallel Plan Tree

Figure 4: Dependence Between Tasks and Parameters in Optimization We used two search algorithms for nding a good query plan. The rst is an "exhaustive" approach (with quali cations discussed in the next subsection) which enumerates a very large number of query plans. The result of this is used as a baseline measure. The second is a heuristic approach which examines a much smaller portion of the search space.

X.end

Figure 3: Subtasks of the Hash Join The labels are various time points in the processing of the operation. A solid line shows processing while a broken line shows a waiting period. We assume that the hash table creation starts strictly after the left child join is done, while the probe is pipelined with the right child join. The time points are related by the following (recursive) equations: X:begin = XL :end (X ) X:hash done = X:begin + Hashf (Work Px )

4.1 Exhaustive Search for a Query Plan

The \exhaustive" search algorithm employs a two phase approach for examining the query plan space. Phase 1 generates candidate ordered trees (plans) and computes work and result size estimates for each tree and all of its subtrees. Processor assignment and execution time estimation are performed during phase 2. While phase 1 works inductively, from the bottom up, phase 2 is a top down process.

4

4.1.1 Phase 1: Plan Enumeration and Work Estimation

needed by the join operator X, the subtree rooted at XL , the subtree rooted at XR , and the subtree rooted at X, respectively. Since we assume that the smaller relation in a join must t in memory, its size provides a lower bound on mX . From Figure 6 the following rules can be derived:

Cell(0) := f plan tree(r) : r 2 input relations g; for N := 1, 2, : : :, number of joins Cell(N) := ;; for I 2 f 0, : : :, N - 1 g for (L, R) 2 Cell(I)  Cell(N - 1 - I) if leaves(L) \ leaves(R) = ; then Cell(N) := Cell(N) [ f plan tree(L, R) g; Figure 5: Plan Tree Construction As in Figure 5, the forest of plan trees is generated inductively. Cell(k) is lled with plan trees which represent exactly k join operations (interior nodes). Work load and result size estimates are computed as the subtrees are constructed. The work load estimate of each subtree is computed using the cost model described above, while our selection heuristic and method of result size estimation are described below. In phase 1 each subtree is evaluated exactly once. Since a given subtree is common to many larger trees, phase 1 utilizes a dynamic programming approach. The exhaustive search is not really so since it prunes the search space based on the following criteria: 1. When joining two relations, the smaller one is always chosen as the left relation. 2. A subtree, and thus all potential trees it can be a part of, are discarded if the subtree is found to require more memory than is available4.

int node(X) ) MX = max(ML + MR ; MR + mX ) left leaf(X) ) MX = jX j right leaf(X) ) MX = 0 The term ML + MR arises because the execution of the subplans represented by the subtrees rooted at XL and XR may overlap, while MR + mX arises because the execution of the subplans of the subtree rooted at XR and the operator X may overlap. In this approach, the value associated with a node is the maximum memory ever needed during the execution of the plan rooted at that node. The value for the root gives the maximum memory needed for the entire plan.

Two functions are performed by phase 1: construction of candidate query plans and assignment of work and result size estimates.

4.1.2 Phase 2: Processor Assignment & Execution Time Estimation

Phase 2 walks each of the plan trees generated by phase 1 and selects a plan tree with minimal (estimated) parallel execution time. Though it is convenient to think of the two phases as occurring in strict temporal succession, it is possible to substantially reduce memory cost by performing the phase 2 evaluation of each plan tree immediately after its construction. This optimization is well worth considering because the last \Cell' (see Figure 5) populated by phase 1, generally, requires many times the combined storage of all preceding cells. The processor assignment and time estimation algorithms described below do not depend on the order in which candidate trees are generated or on the subsequent use of storage held by already evaluated plans. Execution time, like work , is computed in a bottom up manner. Unlike work, however, execution time is dependent on the degree and type of parallelism of the plan tree, and the processor assignment. Therefore the execution time of a plan cannot be computed until the processor assignment for the plan is known. Since processor assignment may not be performed until the entire plan tree is available, execution time estimates may not be computed for partial plans. Thus, in contrast to phase 1, phase 2 must evaluate each plan as a logical unit. The evaluation of a plan tree is carried out by performing two logically distinct traversals: a preorder walk to perform processor assignment and a postorder walk to compute execution time estimates. Though in practice these traversals are performed by a single tree walk, we will describe them as if they were distinct processing stages. Proportional Processor Assignment: Processors are assigned to operators with the aim to minimize idling. Extensive work in parallel processing has shown that multiprocessor speed-ups are almost always less than linear. A number of speed-up curves

p , m , w X X X X

XL P , M , W L L L

XR P , M , W R R R

Figure 6: A Plan Node and its Subtrees

Estimating Memory Requirements of a Plan:

The memory requirements of a query plan can be determined from the memory requirements of its constituent operators and the manner in which they are combined. Memory assigned to an operator can be reused by those starting strictly after it ends, while it cannot be reused by those whose execution overlaps. Let mX , ML , MR , and MX be maximummemory ever

4 As discussed earlier, for illustration we are only considering situations where the smaller relation ts in main memory. Thus, we have inherently reduced the search space.

5

These three equations capture a processor assignment policy that is intuitively appealing, and has been observed to be e ective in our preliminary evaluation. Additional experimentation and theoretical work is necessary before the impact of our rounding policy can be properly evaluated. Execution Time Estimates: The cost model section describes how execution time estimates are computed using the tree structure, intermediate result sizes and join selectivities of phase 1, and the processor assignment of phase 2. These equations treat execution time as a synthesized attribute, i.e., when computing the execution time estimates of an operator, it is assumed that corresponding estimates have already been computed for the operator's immediate subtrees. Therefore, execution time estimates are computed by the visit operation of postorder walk of a candidate tree. Phase 2 evaluates each tree constructed by phase 1 and selects the tree with minimal expected parallel execution time.

for simulation studies and implementations of parallel join algorithms have been reported. However, for query optimization an analytical expression for speedup is needed. Linear speed-up is only an idealization, and in reality speed-ups tend to be much lower due to resource contention and multi-threading overheads. After examining a number of curves, weq decided to use the speed-up function, f(P) = min(P ; P), where 0 < q < 1. In 6 let pX , PL , PR , and PX represent the number of processor assigned to the operator X, and the plans represented by the0 trees rooted at XL , XR , and X, respectively. The W s represent the corresponding work estimates. Proportional processor assignment is de ned by equations 1 thru 3. pX + PR = PX (1) PL = p X (2) wX + W L = W R (3) pq Pq X

4.2 Heuristic Search

R

Despite the search space reduction ideas used in the exhaustive algorithm, the overhead is still prohibitively large. We now present an ecient algorithm that employs additional heuristics to reduce the search space drastically. The additional heuristics are: 1. While the exhaustive approach carries forward a very large number of possible trees, the heuristic builds exactly one plan tree. 2. In the bottom-up tree building process an ordered tree is created, thus deciding the sequential, interoperator parallelism, and pipelining relationships between operators. 3. Upper and lower bounds on execution time are used to prune the search space. Circular Dependence in Search Heuristic: As observed in the case of our exhaustive search, the work estimate for a plan can be built in a bottom-up fashion. However, as shown in Figure 7, processor assignment can be done only after the complete ordered tree has been determined. Next, (parallel) execution time estimation can be done only after the processor assignment has been done. But, if we want to construct exactly one plan tree, and our aim is to minimize the parallel execution time, the search should be guided by the parallel execution time estimates. This inherent circularity is shown in Figure 7.

Equation 1 follows from the assumption that the processing power available for executing the plan of the tree rooted at X is the sum of the processing powers available for the operator X and the subtree rooted at XR . Recall, in our notation an operator may not begin until its left subtree has completed, so tree rooted at XL must complete execution before operator X can begin. Thus, the processors that execute the tree rooted at XL are a natural choice for executing operator X. Equation 2 is a direct consequence of this understanding5 . Equation 3 captures the intuitive notion that the processing power assigned to a task, or group of tasks, should be proportional to the amount of work it is expected to perform. Thus, the time at which the processor set pX completes must be the same as the time at which the set PR completes. This approach can also be considered as minimizing the query execution time. Boundary conditions are handled by assigning no processors to the leaf nodes, i.e. relations.6 It is easy to observe that these equations always have a positive solution. The dicultly lies in the fact that they do not, in general, have integer solutions. It is unrealistic to assign a non-integer number of processors to an operator, so we are forced to adjust our assignments to nearby integer values. Presently we round to the nearest integer subject to two constraints: (i) each of pX and PR must be at least 1; (ii) equation 1 is always satis ed. If pX < 2 and the tree rooted at X has a right subtree, then the tree must be executed serially, by a single processor7 .

5 If processors are dynamically assigned, this rule need not hold. In our present work we consider only static processor assignment. 6 No processors being assigned to leaf nodes is a simpli cation for expository purposes. If an index-scan of a relation is done, or projection has to be carried out, some processors have to be assigned. In such a case, the scan or projection can also be considered as an operator. 7 When assigned to a single processor, a subtree is evaluated in some serial order in accordance with the evaluation method given by the associated query plan subtree.

* Heuristic Search

*Work Estimate * Plan Tree

*Parallel Execution Time Estimate

* Processor Assignment

Figure 7: Circular dependency in search heuristic 6

4.2.1 Layered Approach to Dependence Resolution

Parameter B w beta h l

We resolve this circular dependency by taking a layered approach, i.e. constructing the plan (an ordered capacitated binary tree) in a bottom-up manner, level by level. At a generic step in the algorithm some i levels (starting from bottom) have been constructed; actually there will be a number of trees each having i levels or less at this point, which will eventually become part of the complete plan. The next step is to decide how to combine pairs of trees to create the next level. To guide this, the algorithm rst constructs the left-deep and right-deep trees from this point upwards to estimate the upper- and lower-bound estimates on the cost, respectively, of all possible plans in the optimization process from now on. These estimates are used in guiding the search, i.e. deciding which trees will be merged at this level. The main procedure, namely Heuristic Plan, is described in Figure 8. Detailed descriptions of all the procedures are in [19].

alpha

Comment disk transfer size in tuples disk transfer time (per block) buffer factor (see cost model) in memory hash of 1 tuple 1 tuple-pair compare (nested-loop) comb. wt. for right subtree

Table 2: Fixed Parameters of the Experiment by creating the right-deep and left-deep trees that can be constructed starting with the i pairs merged at this level. The bounds on the costs are obtained by doing the phase 2 computation of the exhaustive algorithm, i.e. processor assignment and execution time estimation, on these trees. The estimate Cost est is then taken as a linear combination of the upper and lower bounds. Cost est's are calculated in increasing order of i, i.e. number of pairs joined at this level, till an upward trend is encountered. The process also stops if for a certain value of i a left-deep tree cannot be constructed that will execute in the memory available since clearly even the least memory intensive plan tree from this point onwards, i.e. a tree that is Left-Deep from this point upwards, requires more memory than is available. Once i pairs have been selected at a certain level, the 2i merged relations are removed from the original set of n relations and the i intermediate results created are added to the set. The whole process above is repeated till a single relation is left.

function Heurstic Plan( R1 ; : : : ; Rn ) return Plan tree is { start with a forest of n one node trees: F1; : : : ; Fn . for i 2 1 : : : n do Fi := tree( Ri ); while jF j > 1 loop (p; q) := select join pairs(F ); { join pairs (Fp1 ; Fq1 ); (Fp2 ; Fq2 ); : : : in turn; { testing cost and memory after each join. F := merge pair(F; p1 ; q1 ); cost := Cost est(F); if cost > Max cost then FAIL; for i := 2 : : : bjF j=2c loop F := merge pair(F; pi ; qi ); cost := Cost est(F sup0 ); exit when cost > cost; { exit merge loop. (F; cost) := (F , cost ); end loop; end loop; return T , where F = fT g. end Heuristic Plan;

5 Experimental Evaluation

A query optimizer was implemented based on the ideas described in the previous sections. The aim was to carry out an evaluation of the proposed search heuristic. In this section we describe the experimental setupnd the results obtained.

0

5.1 Experimental Setup

0

Table 2 shows the xed parameter values used in the experiment. Two relations each of sizes 10K, 20K, 30K, 40K, and 50K tuples were considered in the database. The selectivities ranged from 10?6 to 10?4. The experiment was conducted for = 0:25; 0:5; 0:75. However, no noticeable di erence was observed. Thus, we report the results only for = 0:5. A more detailed evaluation is needed to determine the exact e ect of on the robustness and Opt-Cost Ratio of the heuristic. To study the behavior of the heuristic, we varied the following parameters: 1. No. of Relations: Varied between 6 and 10; 2. Total memory Size: 20K, 30K, and 40K tuples, i.e. 7of database size; 3. No. of Processors: 16, 32, and 64; For each parameter setting 100 runs were made. The values reported are the average values of the following metrics: 1. Robustness: Since the heuristic is somewhat greedy, it is possible that it picks too bushy a tree at the lower levels and ends up in a situation where no feasible, i.e. within memory limits,

0

0

Value 100 20 ms 0.8 100 sec 20 sec 0.5

0

Figure 8: The Search Heuristic The algorithm Heuristic Plan takes the n initial relations, R1 ; R2; : : :Rn, and merges them layer by layer in a bottom-up manner. The key idea is the manner in which some i pairs of relations are merged in a particular layer. For deciding the value of i it is necessary to know its Cost est, i.e. the estimate of the parallel execution time of the best plan that can be constructed by merging i particular pairs at this layer, in conjunction with the decisions made earlier. This is determined by estimating the lower and upper bounds on Cost est. It was observed by [18] that a right-deep tree is preferable for minimizing parallel execution time. Under our assumption of having the smaller relation completely in memory during the execution of an operator, the right-deep tree would lead to the maximal degree of pipelining. Thus, the bounds are obtained 7

Relations 6 7 8 9 10

Robustness (%) 92.5 96.4 100.0 92.6 95.0

Plan Quality 1.00768 1.00409 1.00200 1.00317 1.00709

Opt-Cost Ratio 3.7 8.1 23.2 67.0 224.8

relations 6 7 8 9 10

Table 3: E ect of Number of Relations with 32 PEs, = 0.5, Memory cuto = 30K. Memory 20K 30K 40K

Robustness (%) 91 95 96

Plan Quality 1.00371 1.00709 1.00794

Heuristic Plans 16 PE 32 PE 64 PE 13.8 24.5 43.7 14.0 25.0 44.5 14.2 25.5 45.1 14.9 26.6 47.3 15.4 27.4 48.7

Optimal Plans 16 PE 32 PE 64 PE 13.7 24.4 43.5 14.0 25.0 44.5 14.3 25.5 45.3 14.9 26.5 47.3 15.4 27.4 48.9

Table 5: Average Speedup Predicted with = 0:5, Memory cuto = 30K.

Opt-Cost Ratio 147 225 297

increases with the number of processors. The increase in speedup with the number of relations, is due to the increased chance for parallelism.

Table 4: E ect of Memory Size with 10 Relations, 32 PEs, = 0.5

6 Conclusions

Query optimization for parallel machines needs to consider machine architecture, processor and memory resources available, and di erent types of parallelism, making the search space much larger than the sequential case. Minimizing parallel execution time, rather than work, however, creates the following circular dependence: a plan tree is needed for e ective resource assignment, which is needed to estimate the parallel execution time, and this is needed to guide the cost-based search for a good plan tree. An exhaustive search easily breaks this cycle since it enumerates all possible trees. In this paper we proposed a new search heuristic that breaks the cycle by constructing a plan tree layer by layer in a bottom-up manner. Lower and upper bounds on the execution time for plans consistent with the decisions made so far are estimated, and used to guide the search. We propose a query plan representation for expressing intra- and inter-operator parallelism, pipelining, and processor and memory assignment. We also proposed a new approach to estimating the parallel execution time of a plan that considers sum and max of operators working sequentially and in parallel, respectively. The results obtained from a prototype optimizer were presented.

complete tree is possible. Thus the heuristic has failed to nd a solution. We de ne robustness of the heuristic to be the ratio of the cases in which the heuristic succeeds in nding a solution to the cases in which the exhaustive approach succeeds. A heuristic with robustness = 1 will succeed in all cases that the exhaustive approach does. 2. Plan Quality: This is the ratio of the ExecTime estimates obtained from the heuristic and the "exhaustive" approach. This measures the e ectiveness of the proposed heuristic. 3. Opt-Cost Ratio: This is the ratio of the work done by the exhaustive approach to the work done by the heuristic. This is a measure of the optimization time savings obtained by the heuristic. 4. Speed-Up: This is the ratio of the Work estimate of a plan and the ExecTime estimate of its parallelization. This is a measure of the overall e ect of parallel processing.

5.2 Results and Discussion

References

Table 3 shows the e ect of varying the number of relations in the query. The robustness of the heuristic was observed to be very high in all cases. While search for a more robust heuristic is certainly desirable, another possibility is to use the exhaustive approach when the heuristic fails. Table 3 shows that this is not likely to happen often. The plan quality of the solution found by the heuristic was observed to be very close to the the one found by the exhaustive approach in almost all cases. The Opt-Cost Ratio was found to increasingly favor the heuristic as the number of relations increased. Table 4 shows the e ect of varying the memory size. With increase in available memory, there is greater chance of a very bushy tree, chosen earlier in the heuristic, to be completed into a feasible plan. This can be observed as the increase in robustness of the heuristic with increasing memory size. The plan quality remains good. The Opt-Cost ratio increases with memory size since lesser plans fail and the heuristic is able to complete its search sooner. Table 5 shows the predicted speedups of the heuristic and optimal plans, based on the cost model and the best parallel plan selected. As expected, the speedup

[1] C. K. Baru, O. Frieder, D. Kan ur, and M. Segal. Join on a cube: Analysis, simulation, and implementation. In Kitsuregawa and H. Tanaka, editors, Database Machines and Knowledge Base Machines, pages 61{74. Kluwer, Boston, 1988. [2] D. J. Dewitt and R. Gerber. Multiprocessor hashbased join algorithms. In Proceedings of 11th VLDB Conference, volume 11, pages 151{164, 1985. [3] D. J. DeWitt and J. Gray. Parallel database systems: The future of database processing or a passing fad? CACM, 35(6), June 1992. [4] D. J. DeWitt et al. The gamma database machine project. IEEE Transactions on Knowledge and Data Engineering, 2(1), March 1990. [5] H. Pirahesh et al. Parallelism in relational database systems: Architectural issues and design approaches. In Proc. of 2nd Intl. Symp. on Database in Parallel and Distributed Systems, July 1990.

8

[6] M. R. Stonebraker et al. The design of xprs. In Proc. of the 14th Very Large Database Conference, August 1988. [7] P. P. Selinger et al. Access path selection in a relational database management system. In Proc. of ACM-SIGMOD Intl. Conf. on Mgmt. of Data, 1979. [8] The Committee for Advanced DBMS Function. Third generation database manifesto. ACM SIGMOD, 19(3), September 1990. [9] S. Ganguly, W. Hasan, and R. Krishnamurthy. Query optimization for parallel execution. In Proc. of ACM-SIGMOD Intl. Conf. on Mgmt. of Data, 1992. [10] R. Gerber. Data ow Query Processing using Multiprocessor Hash-Partitioned Algorithms. PhD dissertation, University of Wisconsin, October 1986. [11] G. Graefe. Encapsulation of parallelism in the volcano query processing system. ACM SIGMOD Intl. Conf. on Mgmt. of Data, May 1990. [12] J. L. Gustafson. Challenges to parallel processing. talk given at University of Minnesota, Minneapolis, September 1989. [13] W. Hong and M.R. Stonebraker. Optimization of parallel query execution plans in XPRS. In Proceedings of First International Conference on Parallel and Distributed Information Systems, December 1991. [14] H. Lu, M.-C. Shan, and K.-L. Tan. Parallel query optimization for shared memory multiprocessors. In Proceedings of 17th VLDB Conference, August 1991. [15] E. R. Omiecinsky and E. T. Lin. Hash-based and index-based join algorithms for cube and ring connected multicomputers. IEEE Transactions on Knowledge and Data Engineering, 1(3), September 1989. [16] J. Richardson, H. Lu, and K. Mikkilineni. Design and evaluation of parallel pipelined join algorithms. In Proc. of ACM-SIGMOD Intl. Conf. on Mgmt. of Data, pages 399{409, 1987. [17] D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor. In Proc. of ACM-SIGMOD Intl. Conf. on Mgmt. of Data, June 1989. [18] D. A. Schneider and D. J. DeWitt. Tradeo s in processing complex join queries via hashing in multiprocessor database machines. In Proc. of the 16th Very Large Database Conference, August 1990.

[19] J. Srivastava and G.W. Elsesser. Optimizing multi-join queries in parallel relational databases. Computer Science Department Technical Report, November 1992. University of Minnesota. [20] A.N. Wilschut and P.M.G. Apers. Data ow query execution in a parallel main-memory environment. In Proceedings of First International Conference on Parallel and Distributed Information Systems, December 1991.

9

Suggest Documents