IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
429
Parallel Optimization of Large Join Queries with Set Operators and Aggregates in a Parallel Environment Supporting Pipeline Myra Spiliopoulou, Michael Hatzopoulos, Member, IEEE Computer Society, and Yannis Cotronis Abstract—We propose a parallel optimizer for queries containing a large number of joins, as well as set operators and aggregate functions. The platform of execution is a shared-disk multiprocessor machine supporting bushy parallelism and pipeline. Our model partitions the query into almost independent subtrees that can be optimized simultaneously and applies an enhanced variation of the iterative improvement technique on those of the subtrees, which contain a large number of joins. This technique is parallelized, too. In order to estimate the cost of the states constructed during optimization of join subtrees, cost formulae are developed that estimate the cost of relational algebra operators when executed across coalescing pipes. Index Terms— Parallel query optimization, parallelism in optimization, iterative improvement, large join queries, bushy parallelism, pipeline, shared-disk architectures, query optimization, parallelism, databases.
—————————— ✦ ——————————
1 INTRODUCTION ARALLELISM opens new perspectives in the area of query processing but increases the complexity of the query optimization problem. There are three different kinds of parallel database architectures: shared-memory, sharednothing, and shared-disk. In these architectures, there are different ways of exploiting parallelism. Graefe distinguishes between intraoperator and interoperator parallelism, the latter being further divided into horizontal or bushy, and vertical or pipeline parallelism [5]. In intraoperator parallelism, each operator in the query execution plan (QEP) is executed in parallel. In interoperator parallelism, different operators are executed simultaneously. Bushy parallelism denotes the parallel execution of independent operators on the QEP; pipelining refers to the parallel execution of operators on adjacent nodes in producerconsumer mode. Pipelining is determined by the structure of the query processing tree [16]. In the left-deep tree used for instance in System R [23], the output of each node is materialized before being read by the parent node. In the right-deep tree [22], the right stream input to a node is consumed in pipeline mode, while the left input stream must still be materialized. Several variations of the right-deep tree have been proposed [l], [36] to alleviate its main disadvantage, namely the high memory demand. For the full exploitation of bushy parallelism, though, the processing tree should be bushy itself. In this study, we present our work on query optimization for a shared-disk architecture where bushy parallelism
P
¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥
• M. Spiliopoulou is with Humboldt University of Berlin, 10178 Berlin, Germany. E-mail:
[email protected]. • M. Hatzopoulos and Y. Cotronis are with the Department of Informatics, University of Athens, Greece. Manuscript received July 22, 1993; revised Oct. 6, 1995. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number K96034.
and pipelining are exploited on a bushy processing tree structure. We propose a parallel optimization model for queries with a large number of joins, restrictions, projections, set operations, and aggregates. For the optimization of join (sub)queries, we propose a parallel variation of the iterative improvement combinatorial technique [20]. For the optimization of queries containing also set operators and aggregates, we introduce a parallel optimization scheme, by which a query is partitioned into independent fragments, the fragments are simultaneously optimized, and the resulting sub-QEPs are merged into a global QEP.
2 RELATED WORK We identify two general approaches in query optimization for parallel databases. In “two-step optimization,” an optimal plan is first constructed for a uniprocessor machine and then parallelized. In “one-step optimization,” the parameters of parallelism are already taken into account when establishing the optimal plan, which thus contains scheduling information. The first approach is adopted in studies like [1], [6], [7]. An optimal sequential plan is produced at compile time, and an optimal parallelization of this plan is selected according to some heuristics at runtime. As pointed out by Lanzelotte et al., there is no guarantee that the optimal uniprocessor plan will remain optimal when parallelized according to some (different) optimality criteria [16]. The approach incorporating the parallelism parameters into the optimizer [4], [26], [16], [25], [17] bypasses this problem at the cost of increased complexity. In the models of [26], [17], parallelism is incorporated into the cost function. In the cost functions of [4], [16], [25], the utilization of resources is further estimated. This extension has the disadvantage that the output of the cost function is no longer a scalar value, but a vector. Special rules
1041-4347/96$05.00 ©1996 IEEE
430
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
are introduced to compare alternative QEPs, while some QEPs remain incomparable and need to be temporarily maintained. Thus, not only the size of the search space but also the memory demand during optimization are substantially increased. The overhead of exhaustive exploration of the search space increases exponentially with the query size [15]. If parallelism is also considered, the space becomes even larger [16]. Hence, as large join queries emerge in applications requiring the coupling of knowledge bases with database systems [15], in deductive and in object-oriented databases, researchers turn towards alternatives to exhaustive search, such as heuristics, dynamic programming, and combinatorial optimization techniques. Dynamic programming has been studied in [4], [25]. However, dynamic programming performs an almost exhaustive search over the search space, and thus becomes soon intractable, as pointed out in [16]. Heuristics are used in [23], [7], [25]. However, heuristics often rely on assumptions on the cost function, that are not always satisfied. For instance, most heuristics incrementally construct a QEP by making an optimal decision at each step: If a decision taken in one step is affected by previous decisions or affects later decisions, then the optimality may be lost due to a wrong decision. This is the case when the merge-sort join algorithm is used [9] or if data locality is taken into account in a shared-nothing architecture [16], [17]. A viable alternative to heuristic-based and dynamic optimization is found in the usage of combinatorial optimization techniques. The most promising results are observed for iterative improvement [35], [34], simulated annealing [12], [9], [33], and combinations of the two [10]. The largest amount of work on nonexhaustive optimization techniques refers to uniprocessor query execution. Simulated annealing for parallel spaces is studied in [16], where an enhanced variation of the base technique is proposed. Lin et al. use parallel iterative improvement to optimize join queries on a hypercube [17]. In our work, we consider optimization of parallel queries by incorporating several aspects of parallelism into the cost model, and employing parallelized iterative improvement to explore the large search space thus produced. Our approach differs from related ones in many aspects. Our model combines the aforementioned optimization approaches. It generates a parallel QEP, by incorporating the qualitative impact of bushy and pipeline parallelism into the cost function. However, the quantitative impact of parallelism, i.e., the resource utilization is not taken into account. We rather assume that there is a sufficient number of processors to execute the query in parallel. Hence, the optimizer can achieve a better approximation of the optimum than a sequential QEP being parallelized at runtime. At the same time, the cost function still produces scalar values, so that different QEPs are always comparable. Moreover, our parallel QEP can be optimally scheduled on the processors actually available at runtime, even in an architecture supporting dynamic (re)assignment of processes to processors. We study bushy and pipelined parallelism on bushy trees, as in [4], [16], [25], but we consider a shared-disk rather than a shared-nothing architecture. Moreover, our
notion of pipeline is more general than that of [4], as we do not assume that the pipelines are synchronized. We scan the query search space using a combinatorial optimization technique, as in [16], [17]. Like Lin et al. [17], we use iterative improvement. However, our optimizer exploits pipelining and not intraoperator parallelism. Moreover, our parallel machine is of generic nature, while the particular machine they assume (a hypercube) leads to decisions that simplify the optimization problem [17]. While the usage of parallelism for query execution is already widespread, the usage of parallelism for query optimization is much less addressed [30], [17], although the optimization time is far from negligible. In [30], we have shown that the performance of an exhaustive technique can be improved by parallelism, as the optimization overhead is decreased and its growth with the query size is slowed down. In this study, we show that a nonexhaustive technique can also be improved by parallelism, as its overhead is reduced and optimal plans of higher quality are produced. The contribution of our model is manifold. We present a complete optimizer which exploits parallelism not only during query execution, but also during optimization, to produce optimal execution plans within a shorter time span. For the optimization of large join (sub)queries, we propose two parallel variations of iterative improvement for bushy trees with joins in the order of hundred. Finally, we do not isolate join processing, but also take account for the presence of other operators. Our study is organized as follows: In the next section, we outline our model, present its objectives, and describe the internal query representation selected. In Section 4, we briefly describe our cost model that estimates query execution time. In Section 5, we analyze our search strategy, a parallelized variation of iterative improvement. Then, in Section 6, we present our first results of its exploitation. In Section 7, we discuss an enhanced variant of our base technique. In Section 8, we consider the impact of other operations on the query optimization process, resulting in a parallel optimizer formed as a tree of interacting optimization tasks. Section 9 concludes our study.
3 OUTLINE OF THE OPTIMIZER We propose a parallel optimizer for SQL or SQL-like queries, in which the number of joins is the dominant factor.
3.1 Framework of the Parallel Optimizer Our model is designed for parallel environments supporting bushy and pipeline parallelism [5]. Processors have private main memories and shared secondary storage, and they communicate across a network of sufficiently high bandwidth to ensure that data transfer time is lower than I/O time. Pipelining is performed by data transfer via nonshared buffers written by the producer and read by the consumer process. After retrieving the initial data from secondary storage, disk accesses are avoided as far as possible. All processors access the relations on the same shared device(s); no partitioning and replication of relations are considered. In general, the base relations do not fit in main
SPILIOPOULOU ET AL.: PARALLEL OPTIMIZATION OF LARGE JOIN QUERIES WITH SET OPERATORS AND AGGREGATES
memory, but no tuple is larger than the buffer used for interprocessor communication. We define as cost of a query the elapsed time required to execute it in parallel mode. Our objective is to minimize both the optimization and the execution time of a query by utilizing the available processing power. To satisfy our objective, independent operations must be executed in parallel and interdepending operations be processed across coalescing pipes to minimize I/O access and to overlap the communication and CPU cost of consecutive operations across each pipe. This approach holds both for the I/O-bound query execution and for the CPU-bound query optimization.
3.2 Query Tree Structure We study queries containing restrictions, projections, joins, set operators, and aggregates. We use an operator-tree structure (terminology in [13]), the nodes of which are the operators applied on the relations, while the edges represent the flow of information from the node producing the intermediate relation towards its parent-consumer. The tree is directed from the leaves, retrieving relations from disk, towards the root producing the final output. Our query tree is bushy [5], so that nodes in different subtrees can be executed in parallel, while adjacent nodes can be executed in pipeline. Each node/operator in a pipe starts execution as soon as the operator(s) producing its input have output enough data for it to start; until these data are produced, the consumer waits. The initial query tree is built by a preoptimizer, which performs “unnesting” of nested queries [14], transforms disjunctions into unions, and places selection nodes below join nodes on the same relations. Set operations are placed below the root projection and separated from the nodes containing predicates by further projections. So, projections, selections, and joins form “PSJ-subtrees,” in which joins are gathered in a “JOIN-zone”. Similarly, set operators are gathered in a zone directly below the root. In [29], we present a parser/preoptimizer producing this tree structure. Queries with aggregates in subqueries are transformed during unnesting into one main query producing the final result and a list of secondary queries executed in a specific order. Due to the semantics of the parent-child relationship of the query tree, the output of each secondary query tree must be consumed by a leaf of another (secondary or the main) tree. Thus, a single query tree is built. Join optimization is the most time consuming part of the optimization process. We handle it separately, focusing on PSJ-trees in the next sections. We consider trees containing PSJ-subtrees and other operators in Section 8.
duces enough tuples for its parent to start, plus the cost of the last node (consumer) on the pipe. Since the branches below any node of a bushy tree may be processed in parallel, the cost of a subtree is the cost of its most expensive pipe ending at the root node. Since the database operations are not CPU-intensive, we assume that as soon as a consumer-node can start processing, it processes the data sent to it at the same rate as its producer(s).
4.1 Execution Algorithms For the execution of joins, we consider the nested loops algorithm and the merge join on sorted input. For equijoins, antijoins, and equality outerjoins we also consider a hash algorithm. Merge join is always used when the input relations comes sorted on the join attribute. If not, the nested loop is selected if the inner relation fits in memory. If the hash table of the inner relation is no larger than the square root of the processor’s memory size, the hash algorithm of Shapiro [24] is used. Otherwise, projections are introduced to sort the input relations, and the merge join is used. The inner relation or hash table for classic joins is selected as the largest one fitting in memory [14]. The inner relation for semijoins is the one not retained in the output. The inner relation of an antijoin or outerjoin is determined by the operator’s semantics [14], [2] and cannot be changed. For restrictions, we consider two algorithms, one used when the input relation comes sorted on the restriction attribute and one used otherwise. For projections, we consider an (M − 1)-way merge sort method [14], where M is the memory size in pages. The optimizer introduces projections below a join node that requires sorted input. Hence, projections in a PSJ-subtree always sort their input, while the root projection may or may not do so. Restriction and join algorithms filter their input by eliminating attributes not used in ancestor nodes. The duplicates thus produced are not eliminated. So, projections do not filter their input but remove existing duplicates, if the aggregate functions in ancestor nodes permit so. We omit indexing from the repertory of execution algorithms, on reasons of simplicity.
4.2 Cost Parameters 4.2.1 System Constants In Table 1, we present the constants of our environment. We compute I/O and communication time only. We do not consider CPU time because it is orders of magnitude lower than the other two factors. TABLE 1 SYSTEM CONSTANTS PG
4 COST MODEL FOR THE JOIN-ZONE In order to compare the QEPs produced by the optimizer, we need a function computing the cost of each PSJ(sub)tree. In this section, we briefly present our cost function, which is analytically described in [32]. We first compute the cost of a node when executed in isolation, and then the cost of nodes executed in pipeline. The cost of a pipe is the cost of each producer until it pro-
431
Page size
M
Memory size in pages
tdisk
Disk page transfer time
tnet
Network page transfer time
4.2.2 Database Parameters In Table 2, we present the parameters of our cost formulae. They refer to the relations’ sizes and can be typically obtained from the information held in the data dictionary. In
432
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
these parameters, the k attributes of a relation R are denoted as R.C1, R.C2, , R.Ck. TABLE 2 DATABASE PARAMETERS NR
Number of tuples in R
LR
Tuple length for R
l R.C
where SFR stands for fR or SR. For joins, antijoins, and their semi-counterparts, and for outerjoins, it holds that:
Number of pages of R
PR
nR.C
The PSF for projections and restrictions on relation R can be computed as: N output ◊ Loutput PSFR = (2) = SFR ◊ j R , output N R ◊ LR
Maximum length of attribute R.Ci
i
Number of distinct values of R.Ci
i
MR,q
page selectivity factor PSF is defined as the ratio of pages output from a task to the pages input to it, i.e., Poutput PSF = . Pinput
Filtering factor of R over q attributes
PSFR , S
MR, R
forming a relation R
Fe(R)
Redundancy factor of operation e on R
N output ◊ Loutput Loutput PG PG = = SFR , S ◊ LR ◊ LS N R ◊ LR ◊ N S ◊ LS PG 2
where SFR,S stands for JR,S, AJR,S, or OJR,S. Since Loutput consists of the output part of R and of S: The number of pages of a relation R is: PSFR , S = SFR , S ◊
N ◊L PR = R R . PG
(1)
The filtering factor of Table 2 describes the reduction of the tuple size caused by elimination of attributes, as mentioned in Section 4.1. If only the attributes R.Ci, i = 1 q are retained, then q
j R , R¢ = j R , q =
Âi = 1
lR . C
LR
i
.
The redundancy factor denotes the percentage of pages containing false hits on the total number of pages. The operation e is a restriction (f ), a projection (S) or a join (j), where the semijoin is also treated as a join in this context. In the worst case, Fe(R) = 1.
4.2.3 Selectivity Factors In Table 3, we present the selectivity factors of the operators. The selectivity factor SF is the ratio of tuples output from a task to the number of tuples input to the task, i.e.,
FG j H L
R , R¢ S
+
IJ K
j S , S¢ ◊ PG LR
(3)
4.3 Execution Cost of Operations For each operator and candidate algorithm, we present a cost formula of the generic form: Toperator = Tinput + TintermIO + Toutput
(4)
The first term of the summation is the cost of retrieving the input. The last term is the cost of producing the output and forwarding it to the parent task or to disk. The second term is non zero only if disk access is needed, e.g., to store intermediate results. Hereafter, we denote by t′ the time unit tdisk or tnet, depending on whether the relation is read from or written to disk or to another processor’s memory.
4.3.1 Unary Operators The cost of a restriction on a relation R is, according to (2): Trestriction = Ff(R) ¹ PR ¹ t′ + MR,q ¹ fR ¹ PR ¹ t′
N output SF = . N input
where Ff(R) < 1, if R comes sorted on the restriction attribute. A projection is introduced below a join to sort a relation not fitting in memory and remove its dupliates. No attributes are filtered out, i.e., MR,R = 1.
Typically, the selectivity factors are available as the result of database usage statistics.
Tprojection = PR ¹ t′ + PR ¹ logM−1(PR) ¹ tdisk + SR ¹ PR ¹ t′
TABLE 3 SELECTIVITY FACTORS fR
Restriction s.f. on relation R
JR,S
s.f. for the join between R, S
AJR,S
s.f. for the antijoin between R, S where R is the outer relation
OJR,S
s.f. for the outerjoin between R, S where R is the outer relation
SR
Projection selectivity factor for duplicates removal on R
The selectivity factors on Table 3 refer to tuples. The
4.3.2 Join Between Two Relations R, S Poutput is equal to PSFR,S ¹ PR ¹ PS for joins and outerjoins, and equal to PSFR,S ¹ PR for semijoins and antijoins, where PSFR,S is computed by (3). So, Toutput = Poutput ¹ t′. If the nested loops algorithm is applied, there are no intermediate results, since the inner relation, say S, fits in memory. The algorithm needs one page of R and the whole of S to start. Let 1R denote the single page of R. Then: nl Tinput = max(1R ◊ t ¢ , PS ◊ t ¢ )
When the merge algorithm is used, the two relations are retrieved in parallel:
SPILIOPOULOU ET AL.: PARALLEL OPTIMIZATION OF LARGE JOIN QUERIES WITH SET OPERATORS AND AGGREGATES
mj Tinput = max( Fp ( R ) ◊ PR ◊ t ¢ , Fp (S) ◊ PS ◊ t ¢ )
mj
P1
The hash algorithm creates the hash table HT on the inner relation, say S, and writes it to disk for a subsequent read. Then HT and R are joined in a nested loops fashion, where HT is the inner relation. So, hj Tinput = max(1R ◊ t ¢ , PS ◊ t ¢ )
TintermIO = Fj(S) ¹ PS ¹ tdisk
4.4 Cost of Tasks-Operations in a Pipe Let s0 be a task and s1 be its consumer. The cost of the pipe is the cost of s0 until s1 can start execution, plus the cost of s1. Let N0 is the number of tuples processed by s0, and let SF be its selectivity factor. Further, let k be the number of tuples that must be produced by s0 for s1 to start. The number of tuples processed by s0 to produce them is computed according to the Hypergeometric Waiting Time Distribution [21]:
4.4.1 Task s0 is Applied on a Single Relation R If s0 is a restriction, let NR (respectively N1) be the number of tuples input to s0 (to s1) and LR (L1) be the tuple length of the tuple input to s0 (to s1). Then, according to (5): L1 ◊ N 1 L1 NR + 1 = k◊ ◊ PG PG f 123 R ◊ N R + 1 K pages
If s0 is a projection, then a holding point occurs, since a projection cannot output any tuple before reading and sorting its whole input. So, P1R = PR.
4.4.2 Task s0 is Applied on Two Relations R, S The nested loops algorithm requires that the whole inner relation S is available for comparisons with the first page of the outer relation R. The inner relation must be read in its entirety, while only P1nl R pages of the outer relation are necessary to produce K output pages: P1nl R = K ◊
Therefore: P1mj R = K ◊
Fp ( R ) ◊ PR F ◊P mj , P1S = K ◊ p (S) R Poutput Poutput
mj Tmj - inpipe = max(T1 + P1mj R ◊ t ¢ , T2 + P1S ◊ t ¢ )
The hash algorithm requires the whole inner relation to be available before starting execution, in order to construct the hash table and begin the comparisons. Hence, as for the nested loops algorithm, only P1hj R pages are required to produce K output pages: hj
P1R = K ◊
PR Poutput
Using T1, T2 as before: Thj - inpipe = max(T1 + P1hj R ◊ t ¢, T2 + PS ◊ t ¢ ) + PS ◊ tdisk + Fj(S) ◊ PS ◊ tdisk
(5)
We compute hereafter the number of pages P1 corresponding to N1 tuples. This value must be used in the previous formulae of Tinput and TintermIO instead of the full sizes of the relations.
P1R =
Fp ( R ) ◊ PR + Fp (S) ◊ PS Poutput
Using T1, T2 similarly to the above case:
The cost of retrieving the successful tuples of S according to the hash table, including the possibility of collisions, is:
N0 + 1 N1 = k ◊ SF ◊ N 0 + 1
= K◊
433
PR Poutput
Let T1, T2 be the time needed by the children of s0 to produce P1nl R , respectively PS. Then: Tnl - inpipe = max(T1 + P1nl R ◊ t ¢ , T2 + PS ◊ t ¢ ) The merge join algorithm is applied on already sorted input. The two relations are read at the same rate, merging the tuples satisfying the join predicate. For the genmj eration of K pages, P1 pages are needed, as computed in Section 4.3:
5 PARALLEL ITERATIVE IMPROVEMENT 5.1 General Principles The search space of all possible solutions to the join optimization problem is the graph having as nodes all query trees equivalent to the initial query [10], [35]. A state in the search space is a query tree that forms a solution and has a cost associated with it. The objective of the optimization process is to find a state with a globally minimum cost, termed as the global minimum (state). Starting from an initial state, start states are generated using random transformations or augmentation heuristics [34]. A move is a transformation of random nature that is applied on a state of the search space in order to produce another state in the same space. A move is downhill, if the output state has lower cost than the input state; otherwise, it is uphill. In iterative improvement, only downhill moves are permitted. The set of states reachable from a state by a single move forms its neighborhood. A state is termed a local minimum if it has lower cost than all its neighbors. Starting from a start state, a series of moves is performed until a local minimum is reached; this series is called optimization run [35] or local optimization [10]. The local minimum with the lowermost cost is the global minimum. As the size of a neighborhood is too large to be scanned exhaustively, a state is considered a local minimum if a number of consecutive moves from it are rejected as uphill [35], [19]. This number is called sequence length; in [35] it is set equal to the number of joined relations, while in [19] it is a constant. Combinatorial optimization techniques need a termination criterion. One such criterion could be based on the usage of a lower bound of the global minimum, as suggested in [35]. However, such lower bounds are too imprecise for other than comparative purposes [35]. Hence, a time limit is necessary. This limit T is a function of the query size. In [35], [34],
434
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
Fig. 1. Dispersion of the duration of an optimization run. 2
the time assigned to an optimization run is T = c * N , where N is the number of joined relations. We present hereafter our strategy for the optimization of large PSJ-subtrees, namely a parallel variation of iterative improvement. The states of the search space are deep and bushy query trees of the structure described in Section 3.2. As “query size” we define the number of joins in the PSJsubtree, i.e., in the JOIN-zone; the query size is denoted in the following as n.
5.2 States and Stopping Criteria In our model, the initial state is the query tree produced by the parser. A start state is constructed by applying a transformation of random nature on the initial state; the cost of the result is not estimated. The number of start states is in the range of the query size n. For each start state a local minimum is constructed; the sequence length for declaring a state as local minimum is set equal to the number of joins on the tree, in analogy to the approach of [35]. Parallelism is introduced by initiating as many tasks corresponding to optimization runs as are the start states; so, n “LM-constructor” tasks are loaded to the available processors and executed simultaneously. The stopping criterion used for the termination of the execution of runs is an upper limit T on the duration of the whole optimization process. We decided to set a limit to the whole process, instead of limiting the execution time Trun of a single optimization run as in [35], because we have observed that the duration of individual runs can vary considerably, as shown in Fig. 1 for a series of runs on an ex2 ample query. Thus, a limit Trun = a * n for each run may prevent the completion of long runs and may never be totally consumed by short ones. This has been verified experimentally, as reported in Section 6.2. We perform optimization runs on n start states. Hence, the limit on the total execution time of the parallel machine is 2 set to T = T(n) = c * n * n. It must be noted that by “time” we mean absolute execution time and not CPU time of individual processors. In order to compute the value of constant c, we define as
nmin the smallest JOIN-zone size, for which we use iterative improvement. For smaller JOIN-zones, an exhaustive technique can be used. For this boundary number, the iterative improvement technique is assigned the same time span tn min
required by an exhaustive technique. We use the exhaustive technique proposed in [30] as our reference technique, because it relies on the same principles and cost function. Thus: 3 T (nmin ) = c * nmin = tn fi c = min
tn
min
3 nmin
(6)
The constant c further depends on the number of proces1 sors p on the machine. In fact, the size of c decreases inverse proportionally to p. Using this function T, we observed that the number of local minima is sufficiently large and does not change drastically with the query size. Moreover, the optimization overhead remains lower than the overhead of exhaustive techniques, which increases exponentially to the query size.
5.3 Set of Moves on the Query Tree Our model uses a single move, ROLL. ROLL is a complex operation based on the swapping of adjacent JOIN-nodes, and has a number of parameters selected at random, namely: • the JOIN-node J to be moved • the direction of the move, downwards or upwards • the depth of the move, indicating the number of nodes with which J will be swapped • for downward moves only: the child with which J is swapped in each individual swap, i.e., left or right child Outerjoins are excluded from JOIN-node selection, because they do not always commute with normal joins [2]. If swapping of an outerjoin occurs during ROLL, the move is the time of sequential execution, so 1. For reasons of simplicity, tn min that parallelism can be introduced independently later on.
SPILIOPOULOU ET AL.: PARALLEL OPTIMIZATION OF LARGE JOIN QUERIES WITH SET OPERATORS AND AGGREGATES
stops there and the swap is cancelled. Thus, the impact of an outerjoin on the acceptability of a move is limited to the algorithm selected in each state. The ROLL(QT, J, direction, d) move, depicted in Fig. 2, takes as parameters the query tree (actually the JOIN-zone) QT, the JOIN-node J to be moved, the direction of the move, and the depth d of the move. A ROLL move consists of multiple SWAP(QT, J1, J2) operations, where the JOIN-node J2 must shift position with its parent J1. The SWAP(¹) operation is shown in Fig. 3.
435
as uphill. If the move is accepted, the old tree is replaced by the new one. The impact of a legal ROLL(J, upwards, 2) move is shown in Fig. 4, where only the affected subtree of the JOIN-zone is presented.
ROLL(J, direction, d) copy the tree QTold to QTnew; u = 1; Ju = J; if direction is towards the root find the ancestor of J at depth d; repeat d times SWAP(QTnew,parentOf(Ju),Ju); u = u + 1; Ju = parentOf(Ju); end-repeat; else repeat d times randomly select a side-direction; if side-direction is LEFT child = leftChildOf(Ju); else child = rightChildOf(Ju); endif SWAP(QTnew, Ju, child); u = u + 1; Ju = child; end-repeat; endif oldCost = COST(QTold); newCost = COST(QTnew); if oldCost > newCost replace QTold with QTnew; discard QTold; else discard QTnew; endif
Fig. 4. A legal ROLL move on a JOIN-zone.
5.4 Parallel Construction of Local Minima Our variation of Iterative Improvement is adapted for parallel environments where the number of processors is less than the number of start states to be processed. We assume a machine with p processors, on which we load the n CPUintensive “LM-constructor” processes, i.e., optimization runs on the start states to construct Local Minima. We also initiate one “JOIN-zone coordinator” process, intended to supervise the LM-constructors. If there are more LMconstructor processes than processors, all np processes are loaded on each processor, and multitasking with time slicing is used. This has the advantage that all start states are processed simultaneously, and the impact of long runs is limited to the processor they are loaded on. The parallel processing algorithm is shown in Fig. 5.
Fig. 2. The ROLL(¹) operation.
PARALLEL-LM-CONSTRUCTION SWAP(QT, J1, J2) if the position to be obtained by J1 is occupied by a SELECT-node S2 place J1 in this location; merge the subtree rooted at S2 with the subtree below J1 applied on the same relation; else if the position is occupied by a JOIN-node J3 insert J1 between J2 and J3; adjust the attributes passed to/from the repositioned node; else place J1 in the free position; endif Fig. 3. The SWAP(¹) operation.
A ROLL move is legal, if none of its swap operations attempts to place a node J1 below a node J2 not producing the input required by J2. The query tree structure proposed in [26] ensures that all moves are legal. Using our cost model, as encapsulated in the function COST(QT), we decide whether a move should be accepted as downhill, or rejected
[INITIALIZATION:] The initial state and a “start parameter” are forwarded to the LM-constructors. This start parameter is used by the transformation operator for the generation of each of the n start states and is different for each LM-constructor. [LM-CONSTRUCTOR:] Generation of a start state and execution of an optimization run on it, to find a local minimum. [LM-CONSTRUCTOR:] Forwarding of the found local minimum to the coordinator and termination. [LM-CONSTRUCTOR:] Upon expiration of the preset optimization time T, forwarding of the last state constructed to the coordinator and termination. [COMPLETION:] Selection of the least cost state by the coordinator. Continuation of the optimization run by the coordinator, if this state is not a local minimum. Fig. 5. Algorithm for parallel LM-construction.
436
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
The conceptual topology of the LM-constructors and their coordinator is a tree rooted at the JOIN-zone coordinator. The parent of an LM-constructor is either the coordinator or another LM-constructor. This topology is analyzed in [30]. We exploit the tree structure in order to reduce the communication overhead during the forwarding of local minima to the coordinator (completion phase). A processor Px across a route towards the coordinator may discard a local minimum received from another processor Py, if the state currently processed by Px has already lower cost. So, communication is reduced and localized.
6 BEHAVIOR OF THE PARALLEL ITERATIVE IMPROVEMENT TECHNIQUE 6.1 Rationale and Relationship with Other Techniques In our technique for large join query optimization, parallelism is exploited to improve the quality of the results and to reduce the optimization overhead. In that aspect, our technique is innovative, since previous studies on combinatorial query optimization were based on sequential algorithms. Our technique processes bushy query trees. In models like [1], [7], [18], one bushy tree is created by a constructive algorithm, which is intended to find the optimal way of performing the joins. In models like [35], [34], [12], [9], [10], an initial tree is created using some mechanism (possibly an augmentation heuristic) and a randomized algorithm is applied on it to transform the initial state into an optimal one. Models adopting the first approach use heuristics and/or fix certain parameters of the problem (like the parameter of the join algorithm to be employed [1]) in order to reduce the search space. Models adhering to the second approach do not attempt to reduce the search space in advance, whereupon they are applicable in more general cases. We decided for the second approach, as noted in Section 2. The incorporation of parallelism parameters into our cost function increases its complexity, as well as the size of the search space. This is circumvented by using a combinatorial optimization technique rather than an exhaustive strategy. Such techniques guarantee convergence to the global minimum, but in an arbitrary amount of time. Since the duration of the optimization phase cannot be unbounded, we had to choose a technique that shows satisfactory performance under timing constraints and which can be parallelized. According to the results of [35], [9], we considered iterative improvement and simulated annealing. As pointed out in [9], simulated annealing should not be subject to stopping criteria based on timing, both because it converges slowly and because it occasionally accepts states of high cost. Moreover, simulated annealing relies on certain assumptions on the shape of the search space [10]; the time it needs to identify local minima is increased, if these assumptions do not hold. On the other hand, iterative improvement shows good performance under timing constraints [35] and does not rely on any assumptions on the search space. Its performance is satisfactory and the duration of convergence is lower than that of simulated annealing [11].
Moreover, iterative improvement is parallelizable by nature, and a timing criterion can be used to ensure that the execution time on each processor is fairly the same. For simulated annealing, only the variation proposed in [16] may be appropriate for parallel implementation. However, even in this variation the load balancing on the different processors may vary considerably. We have, therefore, selected iterative improvement as the basis of our model. A disadvantage of iterative improvement, identified in [10], is that it may produce local minima of poor quality, if the search space contains “wells.” This disadvantage is bypassed in our model by using parallelism to build a large number of local minima, thus decreasing the probability of having all of them end inside a well of poor quality. It should be noted, that even simulated annealing can exit from wells only with a certain probability. Parallel iterative improvement has been recently used in the model of [17]. This model differs from ours both in the exploitation of parallelism, as they consider intraoperator parallelism but not pipelining, and in the optimization goal. Moreover, the architecture they use (a hypercube) implies some design decisions that further reduce the search space to be explored.
6.2 Experimental Results Four implementations of our technique are currently operational: A sequential version on a Sun workstation has been used as our simulation and testbed platform. A parallel version runs on a 20-transputer PARSYTEC machine; this version is also coupled with a parallel query execution 2 model. A more sophisticated parallel version runs on a 3 512-transputer PARSYTEC machine. The most recent version is distributed and runs on a network of Sun workstations, using sockets for the communication between the coordinator and the LM-constructors. The experiments presented here have been performed on the sequential testbed implementation, which simulates parallel execution on p processors by distributing the preset execution time among them, and performing the processing tasks of each processor in turn. For the study of the behavior of the technique in terms of optimality of the results, there is no difference between the parallel and sequential implementations, since the technique uses the same total amount of time. For the parameter settings of the cost model, we have used a database schema with relations having 1-10 join attributes and 1,000 to 100,000 tuples. As we do not have a fully fledged DBMS to gather query statistics, we have used constant values for the selectivity factors in the ranges assumed in [23]. For the system parameters, we have assumed a page size of 1,024 bytes, and a network transfer time being the half of disk access time. While this might seem a modest assumption, given the bandwidths of modern networks, the impact of the setting is only quantitative and does not affect the validity and quality of our results. 2. Part of this work was performed within the ESPRIT Parallel Computing Action 4021 (PCA), partially funded by the European Community. 3. Part of this work was performed within the ESPRIT Project GP-MIMD, partially funded by the EC.
SPILIOPOULOU ET AL.: PARALLEL OPTIMIZATION OF LARGE JOIN QUERIES WITH SET OPERATORS AND AGGREGATES
6.2.1 Comparative Results In order to verify the performance of our parallel iterative improvement variation, we have compared it with a reference technique of exhaustive nature. Comparisons of a new technique towards a reference technique rather than against the output of an actual query processor are quite usual for studies on large join queries [19], [33], [3]. For our technique, we have decided to use the output of an exhaustive optimizer as a measure of the optimal QEP. The most widespread reference optimizer is that of System R [23], used also in the distributed R*. However, R* is a distributed DBMS, not a shared-disk parallel system. Moreover, its optimizer uses left-deep instead of bushy query trees. We have, therefore, used the exhaustive optimization technique of [30] as reference technique. Due to the time and memory limitations of exhaustive techniques, we have experimented with modest query sizes. We have tuned our iterative improvement method by setting its parameters in accordance with the exhaustive technique. In particular, the constant c = c(p) was computed from Equation (6) by estimating tn for the exhaustive min
technique on a single processor and by dividing the result with the number of processors used for the construction of local minima, as noted in Section 5.2. In Fig. 6, we show the results of optimization for queries with 20 to 33 joins, comparing the proposed technique with the exhaustive one. The initial cost per query is the cost of the initial state received by the optimizer. The Global Minimum produced by the proposed technique is considerably lower than the initial cost, while it is very close to the cost of the Optimal Plan produced by the exhaustive technique. The latter obviously produces better optimal plans, since it scans the search space exhaustively. However, the duration of the exhaustive technique increases exponentially with the query size, while the duration of the proposed technique increases polynomially. Further details on this comparative experiment can be found in [31], where we focused on the exhaustive technique.
437
We have also studied the impact of time units allocation on the optimization runs. As noted in Section 5.2, we introduce a total optimization time span T(n), which is distributed among the p processors, each one running for a span T ( n) p . We have compared this approach to the one used in [35], according to which a time limit Trun is placed on the execution of each optimization run. For our experiments, we have used two sets of queries (set-A and set-B) over the database described above. The two sets differ in the connectivity of the graphs of the que4 ries they contain : for the same query size, the query graph in set-B has more edges than the one in set-A. Since we are not limited by the resource demand and duration of the exhaustive technique, we have experimented with larger queries, having 50 to 250 joins. The number of processors was set to one-tenth of the query size. In Figs. 7 and 8, we show the performance of our technique, which uses a Processor-based Time Limit (PTL) versus the variant placing a Time Limit on each optimization Run (RTL). PTL outperforms RTL in all cases, indicating that the time limit on the run may indeed affect the quality of the results.
Fig. 7. PTL versus RTL technique for set-A.
Fig. 8. PTL versus RTL technique for set-B.
Fig. 6. Performance of the Parallel Iterative Improvement Technique versus an Exhaustive Parallel Technique for a series of join queries.
The superiority of our technique PTL versus RTL shows that, although the actual duration of a run is in the range of 2 a ¹ n , it may diverge from this average in both directions. If 4. The query graph is a graph having as nodes the relations of the query and edges denoting the joins between the relations.
438
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
a time limit on the runs is applied, as in RTL, the unused time of short runs is wasted. PTL assigns this unused time to longer runs that converge slower. 3 Although the duration of optimization is c ¹ n for both PTL and RTL, PTL shows a more judicious processor utilization. In RTL, a short run implies an idle time for the processor executing it. In PTL, this extra time will be used by another, longer run on the same processor. Dynamic reallocation of long runs to idle processors can also be supported under PTL, to anticipate the placement of many short runs on the same processor.
6.2.2 Qualitative Results In [9], the shape of the search space for left-deep and bushy trees has been extensively studied for a uniprocessor architecture. This shape is determined by the cost model and affected by the transformation rules used to explore the search space. Using our own transformation rules and cost model, we have also performed a preliminary experiment to observe the distribution of states and local minima. In [33], it has been noted that the behavior of the optimization technique is affected by the connectivity of the query graph. In our experiment, we have studied two 120-join queries}queryA and queryB. The graph of queryB has higher connectivity than that of queryA for the same number of joins. In Figs. 9 and 10, we show the distribution of states and local minima in the search space of the two queries. We have generated a sample of 500,000 QEPs per query by applying iterative improvement without a time limit. We have divided the cost ranges of those QEPs into 10 logarithmically specified groups}and grouped the cost values of states and local minima into them}by counting the frequency of occurence. The frequencies are normalized and presented as probability distributions. The normalization shows a strong correlation of the distribution of states and local minima in both sets of queries; the correlation factor is more than 0.98 for both sets. In Fig. 9, we observe a uniform distribution, slightly shifted to the left. This indicates that the number of local minima with high cost values is not negligible. This implies the need for a large number of local optimizations, as performed by iterative improvement. Simulated annealing may fail to improve those expensive local minima satisfactorily. In Fig. 10, we observe a shifting of the values towards the J-distribution. Ioannidis and Kang have identified a J-distribution of local minima and states in their search space [9]. This implied the existence of areas where states of low cost are gathered. Our results indicate that their observation also holds in parallel spaces, but that the distribution is affected by the connectivity of the query graph: a graph of low connectivity produces a distribution more close to the uniform one than to the J-distribution. However, it should be noted that our experiments are not yet as extensive, our system architecture is parallel and, consequently, our cost function is very different from theirs.
Fig. 9. Distribution of the cost of local minima and states for queryA.
Fig. 10. Distribution of the cost of local minima and states for queryB.
7 AN EXTENSION: TWO-PASS PARALLEL ITERATIVE IMPROVEMENT We have developed an extension to the technique described thus far, which performs two passes over the search space. Its aim is to produce an intermediate global optimal plan of reasonable quality within a shorter time than needed by the single-pass technique, and then attempt to improve its quality by performing optimization runs that are the best candidates to produce local minima of lower cost. The algorithm and results in this section have appeared as a poster in [27].
7.1 Analysis The first pass of the extended technique is similar to the procedure described thus far, except that the time span as3 signed to it is T1 = c1 * n , with c1 < c. By the end of the first pass, the coordinator obtains the least cost local minimum and all states produced by runs interrupted before reaching a local minimum. We call these states “pending states.” The state with the lowermost cost produced in Pass 1 is called the Intermediate Global Optimal Plan (IGOP); it is not necessarily a local minimum. Some of the pending states of Pass 1, one of which may be the IGOP itself, are processed in a second scan over the search space, by initiating LM-constructors for them and creating local minima for a 3 time span T2 = c2 * n , where c2 < c − c1. The state with the lowermost cost produced in Pass 2 is the Global Minimum;
SPILIOPOULOU ET AL.: PARALLEL OPTIMIZATION OF LARGE JOIN QUERIES WITH SET OPERATORS AND AGGREGATES
if it is not a local minimum, its run is allowed to continue to completion by the coordinator, as for the Global Minimum of the single-pass technique. An optimization run ending in a pending state is observed as a process r that stepwise decreases the cost of its Start State costStS towards the cost of IGOP costIGOP, which r is by definition the lowest cost achieved in Pass 1. Thus, each run r approaches IGOP according to the following coefficient of the speed at which the cost of the states produced by r converges towards costIGOP:
439
final state at the end of both passes. As can be seen from the figure, the pending states at the left side of the X-axis, i.e., those placed at the beginning of the queue, produced results of generally good quality during Pass 2.
costPS - cost IGOP r speed - coeff (r ) = dMovesr * costStS - costPS r
r
where dMovesr is the number of downhill moves performed by r and costPS is the cost of the final pending state of r. The r
speed coefficient informally corresponds to the number of downhill moves that a run would further need to reduce the cost of its pending state as low as costIGOP, given the number of downhill moves it needed to reach this pending state. For Pass 2, interrupted optimization runs are queued in increasing order of the value of the speed coefficient. The reason for the queuing is that Pass 2 is assigned a time span T2 much less than T1, within which only some of the interrupted optimization runs will be continued. Therefore, the selected runs must be the best candidates for producing a final state with cost lower than the IGOP cost. The criterion for this selection and placement in the queue is the speed of cost decrease of each run producing a pending state, as expressed by the speed coefficient. In Pass 2, each processor processes a single pending state, using all its resources to produce a local minimum. If a local minimum is found within the time span T2, the processor receives another pending state from the coordinator maintaining the queue. The policy of loading only one LM-constructor to each physical processor reflects the fact that the pending states must be processed according to the ordering by speed coefficient: the good candidates for improving the quality of IGOP must have enough time and resources to run to completion.
Fig. 11. Speed coefficients of runs relatively to the IGOP.
7.2 Experimental Results
Fig. 12. Evolution of optimization runs in Pass 1 and Pass 2.
For the comparison of the single-pass and two-pass parallel variations of iterative improvement we used a series of join queries with 20 to 40 joins. In order to specify the time spans c-c . T1 and T2, we have set c1 = 23c and c2 = 2 1 , i.e., c2 - 5c . For a query with 31 joins, the speed coefficients for the interrupted runs by the end of Pass 1 are shown in Fig. 11 in ascending order. The further to the right a run appears on the X-axis, the smaller is the probability that it will be initiated and perform to completion in Pass 2. The enumeration of the runs in the figure is the enumeration of their start states. Run 13 has produced the IGOP. In order to study the behavior of runs during Pass 2, we relaxed the time limit T2 and allowed all runs interrupted in Pass 1 to complete. In Fig. 12, we see the final state each of them would have reached, if time permitted. Optimization runs that have reached a local minimum by the end of Pass 1 do not contribute to Pass 2 and have thus the same
In Fig. 13, we compare the quality of the results of the single-pass and the two-pass techniques for queries with 20 to 40 joins. In the figure, we denote as 1 − GM the global minima produced by the single-pass variation, as IGOP the intermediate global optimal plans produced by the end of Pass 1 of the two-pass variation and as 2 − GM the global minima produced by the end of Pass 2. It must be stressed that the time span T required to produce the 1 − GM plans is larger than the time span T1 + T2 required to produce the 2 − GM plans. The overhead of ordering the pending states by speed coefficient is less than T − (T1 + T2) and can be performed using a heapsort algorithm. So, the sort is partially overlapped with the forwarding of the sorted states to the processors. As shown in the Figure, the global minima of the two-pass technique are satisfactorily close to the corresponding 1 − GMs.
440
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
8 PARALLEL OPTIMIZATION OF A QUERY WITH MANY PSJ-SUBTREES As analyzed in Section 3.2, the query tree for a general query consists of the root PROJECT-node, nodes containing set operators and/or aggregates, and PSJ-subtrees producing the intermediate relations on which the set operations and aggregates are applied. We present hereafter the complete optimization scheme for such query trees.
8.1 PSJ-Subtrees in the Query A PSJ-subtree can be optimized almost independently of its ancestor and completely independently of its sibling PSJ-subtree(s). For PSJ-subtrees with at least nmin joins, as defined in Section 5.2, we use parallel iterative improvement. For PSJ-subtrees with less than nmin joins, we use the exhaustive technique presented in [30].
8.2 Impact of Set Operators Let QT be a query tree consisting of a set operator and two PSJ-subtrees. We split this tree into three parts QTset−op, QT1, QT2, of which QT1, QT2 are the left and right PSJ-subtrees including the projections below the set operator, while QTset−op contains the root of the query tree and the set operator. These subtrees can be optimized in parallel: the optimization of each of QT1, QT2 involves one JOIN-zone coordinator process and a number of subordinate processes for the exhaustive technique or for iterative improvement, depending on the subtree size. QTset−op is optimized independently.
So, it is sufficient to forward the desired sort order for QTset−op to the JOIN-zone coordinators, which will specify that the root projections of QT1 and QT2 must sort their input accordingly. Decisions on sort orders of joined relations, as taken during optimization of the PSJ-subtrees, should not be affected, since the execution cost of a PSJsubtree is by far larger than the cost of QTset−op. In the general case of multiple set operators appearing below the root of the query tree, a child of any set operator is either a PSJ-subtree or another set operator. So, we extend the previous pattern by gathering all set operators to QTset−op and having more than two PSJ-subtrees QT1, QT2, QT3, . Since the relations input to all set operators must have the same structure to be set-operator-compatible, a single sort order is specified for all relations and is propagated to the underlying JOIN-zone coordinators. Therefore, the optimization scheme for a query tree comprised of set operators and PSJ-subtrees consists of a “first level coordinator” and as many “JOIN-zone coordinator” processes, as are the PSJ-subtrees. The first level coordinator retrieves the query tree from the parser, retains the QTset−op, and forwards the PSJ-subtrees to JOIN-zone coordinators. These subtrees are optimized in parallel. Upon completion of the optimization of QTset−op, the first level coordinator propagates the required sort order to the JOIN-zone coordinators. Finally, the first level coordinator receives the optimal execution plans for the PSJ-subtrees and merges them with the plan for QTset−op into a plan for the whole query tree. In Fig. 14, a query tree with two set operators is presented, for which a first level coordinator, and three JOIN-zone coordinators are initialized. We assume that the PSJ-subtrees are large enough to be optimized using parallel iterative improvement.
Fig. 14. Optimization of a query tree with set operators and PSJsubtrees. Fig. 13. Optimal query execution plans produced by the parallel singlepass and two-pass variations of iterative improvement.
The three execution plans have two interdependence points concerning the order of the relations input to the set operator, because we use a merge algorithm on sorted input for all set operators. The optimization of QTset−op will almost certainly finish before the optimization of QT1, QT2.
8.3 Impact of Aggregate Functions 8.3.1 Query Tree Partitioning If the initial query contains aggregates in subqueries, the equivalent query tree consists of a main tree and a number of secondary ones. Each of these trees may also have aggregate nodes and nodes produced from HAVING clauses appearing above the set operators and below its root. These
SPILIOPOULOU ET AL.: PARALLEL OPTIMIZATION OF LARGE JOIN QUERIES WITH SET OPERATORS AND AGGREGATES
trees are connected via dataflow edges. Since nodes cannot migrate from one tree to the other, the trees can be optimized concurrently. Therefore, the optimization scheme of the previous section is extended to the following: A “zero level coordinator” process retrieves the initial query tree and splits it into a main query tree QTQ and a 1
number of secondary trees QTQ , QTQ , K . Subtree QTQ is i 2
3
st
assigned to first level coordinator 1 Ci, which splits it into a a number of PSJ-subtrees and a subtree QTQ , aggr & set containing i
st
the set operators and the aggregates. The coordinator 1 Ci optimizes QTQ , aggr & set , enforces the desired sort orders upon i
the underlying JOIN-zone coordinators, and merges their optimal plans with the optimal plan it produced. subtree. Finally, the zero level coordinator receives the optimal plans for the main and secondary query trees, and merges them into a single plan for the whole query tree.
8.3.2 Treatment of Interdependencies The interdependence among the main and the secondary trees concerns the sort order of the relations output by secondary query trees and input to other trees. Since ordered relations are essential for the optimization of a PSJ-subtree receiving the output of a secondary query tree, we adopt the following policy: the optimization techniques used on JOIN-zones (i.e., the parallel iterative improvement variation and the parallel exhaustive technique) consider for each joined relation produced by a secondary tree all sort orders of interest. Then the root PROJECT-node of the secondary tree is enforced to sort the relation it produces as desired. The cost of this sorting operation is negligible compared to the cost of a whole PSJ-subtree. This policy causes a slight delay in the optimization of a secondary tree QTQ , aggr & set by the first level coordinators, since an algoi
rithm can be assigned to its root only after the PSJ-subtree receiving its output is optimized. However, this delay only affects a single node and can be ignored when compared to the long time required to optimize the PSJ-subtrees. Finally, for the optimization of a PSJ-subtree, the cost of each secondary tree attached to it must be known as early as possible. Therefore, the first level coordinator of each secondary tree QTQ sends the cost of its initial state to the i
first level coordinator of the query receiving the output of QTQ , to be used in the cost estimations of the PSJ-subtree. i This approximate cost is higher than the cost yielded by the optimal plan of QTQ and is, therefore, replaced by a better i
estimate whenever one is available: an early local minimum (or the intermediate global optimal plan produced at the end of Pass 1 of the two-pass iterative improvement variation) are such estimates of better quality. This is a further advantage of iterative improvement, which allows parallel optimization with good approximations of the optimal subplans, whenever they are needed. EXAMPLE. In Fig. 15, we present the query tree of Fig. 14 extended to have a secondary tree in one of its branches. This secondary tree has a SUM-node below its root and no set operators. So, it contains a single PSJ-subtree.
441
8.3.3 Optimizer Tree In Fig. 16, we depict the optimization scheme for a query tree comprised of a main and a secondary query trees, the former having one set operator. All PSJ-subtrees are assumed to be optimized using the parallel iterative improvement technique. The optimizer takes the form of a tree, the “optimizer tree,” consisting of coordinator nodes and LM-constructors. On this tree, data (nonoptimized query subtrees) flow from the root}zero level coordinator}towards the first level coordinators, JOIN-zone coordinators and LM-constructors. The optimal QEPs for the subtrees flow from the leaves towards the root. They are merged at each level, up to the optimal QEP for the whole query tree, which is output by the zero level coordinator. The number of nodes at each level of the optimizer tree depends on the contents of the query tree it processes. The specific optimizer tree contains the zero level coordinator, two first level coordinators, three JOIN-zone coordinators, and as many LM-constructors per PSJ-subtree, as is the number of joins in the subtree.
8.4 Processor Assignment in the Parallel Optimizer In order to integrate our parallel iterative improvement technique for large JOIN-zones into the global optimization scheme presented in this section, the processors must be assigned to the independently optimizable subtrees in a balanced way. Furthermore, the time spans assigned to the PSJ-subtrees optimized by iterative improvement must be set using a global policy. The LM-constructors are the most CPU-intensive optimization tasks. First-level coordinators optimize subtrees of rather simple structure and monitor JOIN-zone coordinators, passing cost estimates among them. JOIN-zone coordinators and zero-level coordinators are supervision tasks, which only distribute the initial data and gather the final results. Coordinator tasks have a very low CPU overhead. They are rather communication-bound, but the communication among tasks is negligible compared to the overall CPU cost of query optimization, as pointed out in [30]. According to these observations, the following policy is adopted for the assignment of p processors to the parallel optimizer’s subtasks: Let a query contain k PSJ-subtrees, QT1, QT2, , QTk with n1, n2, , nk joins, respectively. For PSJ-subtrees of size nmin or more, the iterative improvement technique is used. The duration of the optimization of those subtrees can be computed as described in Section 5.2. For PSJ-subtrees of size less than nmin, the parallel exhaustive technique of [30] is used. Then, as mentioned in Section 5.2, the duration of the optimization of those subtrees is upperbounded by Tn . min
If the whole query were optimized by a single processor, the time needed to optimize all PSJ-subtrees would be: TPSJ
= single
k
k
k
i =1
i =1
i =1
 Ti =  c0 * ni3 = c0 *  ni3
where Ti is the optimization time of PSJ-subtree QTi. If the size of QTi is ni nmin, Ti is computed according to the prin-
442
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
Fig. 15. Optimization of a query tree with set operators, PSJ-subtrees, and a secondary query tree with an aggregate function.
Fig. 16. Optimizer tree for the query tree of Fig. 15.
ciples of iterative improvement, as described in Section 5.2, using a constant c0 for a uniprocessor machine. If ni < nmin, we set Ti equal to Tn , i.e., ni is replaced by nmin. min
Let pPSJ be the number of processors available for the optimization of PSJ-subtrees. We assign to QTi pi processors for the computation of local minima (local optima for the exhaustive technique):
SPILIOPOULOU ET AL.: PARALLEL OPTIMIZATION OF LARGE JOIN QUERIES WITH SET OPERATORS AND AGGREGATES
F GH
Ti pi = max 1, p PSJ * TPSJ
single
I JK
This equation indicates that QTi will be optimized in parallel only if its expected optimization time justifies the usage of parallelism. Then, the duration of parallel optimization for QTi becomes Ti¢ = Ti¢ =
Ti pi
. So, for each QTi it holds that: Ti
pPSJ *
=
Ti
TPSJ
TPSJ
single
p PSJ
single
This policy ensures both that the number of processors assigned to a PSJ-subtree is proportional to the optimization time, and that all PSJ-subtrees are assigned the same time span TPSJ
single
p PSJ
,
thus guaranteeing uniform processor utilization during optimization. Optimization time is mainly CPU-time, so that as many processors as possible must be employed for the CPU-intensive optimization of PSJ-subtrees. However, the number pPSJ must be less than p, so that coordinators are not assigned to heavily loaded processors. Coordinators should be assigned to processors in such a way, that the communication overhead with the processes they monitor is minimized, i.e., as few physical links of the topology as possible are traversed. This mapping mechanism is beyond the scope of this paper.
9 CONCLUSIONS In this study, we have presented a parallel optimization model for queries containing a large number of joins, set operators and aggregate functions. We have considered a multiprocessor architecture with private processor memories and shared secondary storage. We have proposed a complete optimizer, accompanied by a cost function for the estimation of PSJ-subtree cost, as needed for the optimization of JOIN-zones, the cost of which is the dominant factor of query execution cost. The aim of our model is not only to reduce query execution time, but also query optimization time. Our model differs from other parallel database optimizers in that optimization is performed in parallel, uses a cost function expressing several aspects of query parallelism and is not limited to join queries. Parallelism is exploited to reduce optimization time and to improve the quality of the output optimal QEP. Our optimizer is a hybrid approach between optimizers generating a sequential QEP that is parallelized at runtime, and optimizers that incorporate processor utilization into the cost function. So, our optimizer takes parallelism into account and achieves a better approximation of the optimum than the former models, while avoiding the usage of multidimensional cost functions, as in the latter models. Moreover, our optimizer can be combined with any process scheduler, even one that reallocates processes to processors at runtime.
443
For the optimization of large join queries, we propose two parallel variations of iterative improvement, which construct global minima of improved quality in shorter time than the classical technique. We exploit the available processing power for the simultaneous construction of local minima. Hence, we ensure the creation of a sufficient number of local minima within the given time span and evenly spread the risk of too slow optimization runs among the processors involved. The existence of set operators and aggregate functions on the query tree is used by the parallel optimizer in order to split the initial query tree in concurrently optimizable subtrees, with the least possible interaction needed between the modules optimizing them in parallel. Thus, our model takes the form of an “optimizer tree” with four types of nodes, responsible for different optimization tasks on the query subtrees. The first version of our parallel optimizer prototype covers large queries with set operators and joins. As described in Section 6.2, two parallel and one distributed implementations are currently operational. The parallel versions are being used for the stepwise coupling of the optimizer with a parallel query processing engine, while the distributed version is being used for the study of parallel query optimization strategies. Our ongoing work includes the implementation of parallel variations of further combinatorial optimization techniques and the comparative experimentation with them. We are currently implementing threshold acceptance, “old bachelor’s acceptance” [8], as well as a tabu-search extension that can be used in conjunction with most techniques. We further want to study the characteristics of the search space of our cost function. We argue that the behavior of combinatorial optimization techniques is heavily affected by the complexity of the cost function; comparative studies for combinatorial optimization techniques [35], [33] use rather simple cost models, which might distort the behavior of the techniques being studied. We, therefore, intend to study the relative performance of different techniques for cost models incorporating as many aspects of the parallel query optimization problem as possible. Our cost model does not consider the impact of multitasking and physical network workload on the query execution time. We have designed an extended cost model, which takes those two factors into account in a generic environment covering both shared-disk and shared-nothing architectures [28]. We are currently working on the incorporation of this richer cost model into our distributed query optimization prototype, and on the extension of our transformation rules to cover with the new, larger search space.
444
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 3, JUNE 1996
ACKNOWLEDGMENTS We would like to thank Dr. Costas Vassilakis of the Department of Informatics, University of Athens, for his crucial contribution to the development of this prototype and for many valuable comments on the cost model, as well as the whole teams of ESPRIT 1588-SPAN, 4021-PCA, and GPMIMD, who participated in the prototype implementation on the various platforms. We also wish to thank the anonymous referees, whose comments greatly helped in improving the quality of this paper.
REFERENCES [1] [2]
[3] [4] [5] [6]
[7] [8]
[9] [10]
[11] [12] [13] [14] [15] [16]
[17] [18]
M.-S. Chen, P. Yu, and K.-L. Wu, “Scheduling and Processor Allocation for Parallel Execution of Multi-Join Queries,” Proc. Eighth Int’l Conf. Data Eng., pp. 58–67, IEEE, 1992. U. Dayal, “Of Nests and Trees: A Unified Approach to Processing Queries that Contain Nested Subqueries, Aggregates, and Quantifiers,” Proc. Int’l Conf. Very Large Databases, pp. 197–208, Brighton, England, 1987. C. Galindo-Legaria, A. Pellenkoft, and M. Kersten, “Fast, Randomized Join-Order Selection}Why Use Transformations?” Proc. Int’l Conf. Very Large Databases, pp. 85–95, Santiago, Chile, 1994. S. Ganguly, W. Hasan, and R. Krishnamurthy, “Query Optimization for Parallel Execution,” Proc. SIGMOD Int’l Conf. Management of Data, pp. 9–18, San Diego, Calif., ACM, 1992. G. Graefe, “Query Evaluation Techniques for Large Databases,” ACM Computing Surveys, vol. 25, no. 2, pp. 73–170, 1993. W. Hasan and R. Motwani, “Optimization Algorithms for Exploiting the Parallelism-Communication Tradeoff in Pipelined Parallelism,” Proc. Int’l Conf. Very Large Databases, pp. 36–47, Santiago, Chile, 1994. W. Hong, “Exploiting Interoperation Parallelism in XPRS,” Proc. SIGMOD Int’l Conf. Management of Data, pp. 19–28, San Diego, Calif., ACM, 1992. T. Hu, A.B. Kahng, and C.-W.A. Tsao, “Old Bachelor Acceptance: A New Class of Non-Monotone Threshold Accepting Methods,” technical report, UCLA Dept. of Computer Science, Los Angeles, and UC San Diego Computer Science and Engineering Dept., La Jolla, Calif., 1995. Y. Ioannidis and Y. Kang, “Randomized Algorithms for Optimizing Large Join Queries,” Proc. SIGMOD Int’l Conf. Management of Data, pp. 312–321, Atlantic City, N.J., ACM, 1990. Y. Ioannidis and Y. Kang, “Left-Deep vs. Bushy Trees: An Analysis of Strategy Spaces and Its Implications on Query Optimization,” Proc. SIGMOD Int’l Conf. Management of Data, pp. 168–177, Denver, Colo., ACM, 1991. Y. Ioannidis, R.T. Ng, K. Shim, and T.K. Sellis, “Parametric Query Optimisation,” Proc. Int’l Conf. Very Large Databases, pp. 103–114, Vancouver, Canada, 1992. Y. Ioannidis and E. Wong, “Query Optimization by Simulated Annealing,” Proc. SIGMOD Int’l Conf. Management of Data, pp. 9–22, San Francisco, Calif., ACM, 1987. M. Jarke and J. Koch, “Query Optimization in Database Systems,” ACM Computing Surveys, vol. 16, no. 2, pp. 111–152, 1984. W. Kim, “On Optimizing an SQL-Like Nested Query,” ACM Trans. Database Systems, vol. 7, no. 3, pp. 443–469, 1982. R. Krishnamurthy, H. Boral, and C. Zaniolo, “Optimization of Nonrecursive Querie,” Proc. Int’l Conf. Very Large Databases, pp. 128–137, Kyoto, Japan, 1986. R. Lanzelotte, P. Valduriez, and M. Zaït, “On the Effectiveness of Optimization Search Strategies for Parallel Execution Spaces,” Proc. Int’l Conf. Very Large Databases, pp. 493–504, Dublin, Ireland, 1993. E. Lin, E. Omiecinski, and S. Yalamanchili, “Large Join Optimization on a Hypercube Multiprocessor,” IEEE Trans. Knowledge and Data Eng., vol. 6, no. 2, pp. 304–315, 1994. H. Lu, M.-C. Shan, and K.-L. Tan, “Optimization of Multi-Way Join Queries for Parallel Execution,” Proc. Int’l Conf. Very Large Databases, pp. 549–560, Barcelona, Spain, 1991.
[19] T. Morzy, M. Matysiak, and S. Salza, “Tabu Search Optimization of Large Join Queries,” Proc. EDBT ’94 Int’l Conf., pp. 309–322, Cambridge, U.K., Springer-Verlag, 1994. [20] C. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, chapter 10. Englewood Cliffs, N.J.: Prentice Hall, 1982. [21] G. Patil, M. Boswell, S. Joshi, and M. Ratnaparkhi, “Discrete Models,” Dictionary and Classified Bibliography of Statistical Distributions in Scientific Work, vol. 1. Maryland: International Cooperative Publications House, 1984. [22] D.A. Schneider, “Complex Query Processing in Multiprocessor Database Machines,” Technical Report TR965, Univ. of Wisconsin, Madison, 1990. [23] P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and T. Price, “Access Path Selection in a Relational Database Management System,” Proc. SIGMOD Int’l Conf. Management of Data, pp. 23–34, Boston, 1979. [24] L. Shapiro, “Join Processing in Database Systems with Large Main Memories,” ACM Trans. Database Systems, vol. 11, no. 3, pp. 239–264, 1986. [25] E.J. Shekita, H.C. Young, and K.-L. Tan, “Multi-Join Optimization for Symmetric Multiprocessors,” Proc. Int’l Conf. Very Large Databases, pp. 479–492, Dublin, Ireland, 1993. [26] M. Spiliopoulou, “Parallel Optimization and Execution of Queries towards an RDBMS in a Parallel Environment Supporting Pipeline” (in Greek), PhD thesis, Dept. of Informatics, Univ. of Athens, Athens, Greece, 1992. [27] M. Spiliopoulou, Y. Cotronis, and M. Hatzopoulos, “Parallel Optimisation of Join Queries Using an Enhanced Iterative Improvement Technique,” Proc. 1993 PARLE Conf., Poster Session, pp. 716–719, Munich, Germany, 1993. [28] M. Spiliopoulou and J.C. Freytag, “Modelling Resource Utilization in Pipelined Query Execution,” Proc. Euro-Par Conf., Lyon, France, to appear in 1996. [29] M. Spiliopoulou and M. Hatzopoulos, “Translation of SQL Queries into a Graph Structure: Query Transformations and Preoptimisation Issues in a Pipeline Multiprocessor Environment,” Information Systems, vol. 17, no. 2, pp. 161–170, 1992. [30] M. Spiliopoulou, M. Hatzopoulos, and C. Vassilakis, “Using Parallelism and Pipeline for the Optimisation of Join Queries,” Proc. 1992 PARLE Conf., pp. 279–294, Paris, 1992. [31] M. Spiliopoulou, M. Hatzopoulos, and C. Vassilakis, “Parallel Optimization of Join Queries Using a Technique of Exhaustive Nature,” Computers and Artificial Intelligence, vol. 12, no. 2, pp. 145–166, 1993. [32] M. Spiliopoulou, M. Hatzopoulos, and C. Vassilakis, “A Cost Model for the Estimation of Query Execution Time in a Parallel Environment Supporting Pipeline,” Computers and Artificial Intelligence, to appear in 1996. [33] M. Steinbrunn, G. Moerkotte, and A. Kemper, “Optimizing Join Orders,” Technical Report MIP9307, Faculty of Mathematic, Univ. of Passau, Passau, Germany, 1993. [34] A. Swami, “Optimization of Large Join Queries: Combining Heuristics and Combinatorial Techniques,” Proc. SIGMOD Int’l Conf. Management of Data, pp. 367–376, Portland, Ore., ACM, 1989. [35] A. Swami and A. Gupta, “Optimization of Large Join Queries,” Proc. SIGMOD Int’l Conf. Management of Data, pp. 8–17, Chicago, ACM, 1988. [36] M. Ziane, M. Zaït, and P. Borla-Salamet, “Parallel Query Processing with Zigzag Trees,” The VLDB J., vol. 2, no. 3, pp. 277–301, 1993.
SPILIOPOULOU ET AL.: PARALLEL OPTIMIZATION OF LARGE JOIN QUERIES WITH SET OPERATORS AND AGGREGATES
Myra Spiliopoulou received the BS degree in mathematics and the PhD degree in computer science from the University of Athens, Greece, in 1986 and 1992, respectively. From 1987-1994, she worked as a research assistant in the Department of Informatics at the University of Athens, and was involved in national and European projects on parallel database query optimization, hypermedia and multimedia modeling and querying, and computers in education. Dr. Spiliopoulou is currently an assistant professor with the Institute of Information Systems, Humboldt University of Berlin. Her research interests include query optimization, cost modeling, parallel databases, federated databases, and multimedia. Michael Hatzopoulos received the BSc degree in mathematics from the University of Athens, Greece, in 1971; and the MSc and PhD degrees in computer science from the Loughborough University of Technology, United Kingdom, in 1972 and 1974, respectively. Dr. Hatzopoulos has served in several academic posts at the University of Athens, Greece, since 1975, and is currently a professor and chairman of the Department of Informatics there. From 1981-1983, he was a visiting associate professor at the University of Michigan. His research interests include multimedia databases, object-oriented databases, hypermedia applications, and physical database design. Dr. Hatzopoulos is a member of the IEEE Computer Society and the Association for Computing Machinery. Yannis Cotronis received his BSc in mathematics with applied sciences from the University of Sussex, United Kingdom; and the MSc and PhD degrees in computer science from the University of Newcastle-upon-Tyne, United Kingdom. He has worked as a research associate at the University of Newcastle-upon-Tyne and the University of Athens, Greece, in projects on the specification, design, and verification of parallel systems, and parallel relational query execution. As research and development manager in a Greek company, he has been involved in projects involving telepublishing, translation workbench, expert systems for drafting legal contracts, and hellenization of the X400 protocol. Dr. Cotronis is currently an assistant professor with the Department of Informatics at the University of Athens. His research interests include parallel programming techniques and applications, reusability of parallel components, and portability of parallel applications.
445