on optimal processor allocation to support ... - Semantic Scholar

115 downloads 349 Views 227KB Size Report
pipelined hash joins. Our solution scheme is general and applicable to any optimal resource allocation problem for- mulated as a two-phase mini-max problem.
ON OPTIMAL PROCESSOR ALLOCATION TO SUPPORT PIPELINED HASH JOINS Ming-Ling Lo, Ming-Syan Cheny, C. V. Ravishankar and Philip S. Yuy EECS Department IBM Thomas J. Watson Research Centery University of Michigan at Ann Arbor P.O.Box 704 Ann Arbor, MI 48109 Yorktown Heights, NY 10598

Abstract

In this paper, we develop algorithms to achieve optimal processor allocation for pipelined hash joins in a multiprocessorbased database system. A pipeline of hash joins is composed of several stages, each of which is associated with one join operation. The whole pipeline is executed in two phases: (1) the table-building phase, and (2) the tuple-probing phase. We focus on the problem of allocating processors to the stages of a pipeline to minimize the query execution time. We formulate the processor allocation problem as a twophase mini-max optimization problem, and develop three optimal allocation schemes under three di erent constraints. The e ectiveness of our problem formulation and solution is veri ed through a detailed tuple-by-tuple simulation of pipelined hash joins. Our solution scheme is general and applicable to any optimal resource allocation problem formulated as a two-phase mini-max problem.

1 Introduction

In recent years, multiprocessor-based parallel database machines have attracted considerable attention from both the academic and industrial communities because they can eciently execute complex database operations [1] [5] [10] [17]. In relational database systems, joins are the most expensive operations to execute, especially with the increases in database size and query complexity [3] [14] [19]. Many applications usually need to specify the desired results in terms of multi-join queries, some of which may take hours or even days to complete. As a result, parallelism has been recognized as the only solution for the ecient execution of multi-join queries for future database management [6] [7] [8] [12].  Authors in this work were supported in part by the Consortium for International Earth Science Information Networking.

Among various join methods, the hash join has been the focus of much research e ort and reported to have performance superior to that of others, particularly because it presents an opportunity for pipelining [4] [13] [15] [18]. Using hash joins, multiple joins can be pipelined so that the early resulting tuples from a join, before the whole join is completed, can be sent to the next join for processing. A pipeline of hash joins is composed of several stages, each of which is associated with one join operation that can be executed, in parallel, by several processors. Though pipelining has been shown to be very e ective in reducing the query execution time, prior studies on pipelined hash joins have focused mainly on heuristic methods for query plan generation. Most of the prior work on query plan generation, such as static right-deep scheduling, dynamic bottom-up scheduling [16], and segmented right-deep trees [2]1, resorted to simple heuristics to allocate processors to pipeline stages. Also, these methods dealt with only memory as a constraint for the execution of pipelined hash joins. Little e ort was made to take processing power into consideration and optimize processor allocation. Notice that two opportunities for parallelism exist for pipelined hash joins: Not only can each hash join be implemented by several processors, but also several joins can be pipelined. As a result, processor allocation arises as an important unexplored issue in the performance of pipelined hash joins. In view of this fact and the increasing demand for better performance of database operations, the objective of this paper is to study and improve processor allocation for pipelined hash joins. In this paper we derive optimal processor allocation algorithms that take both memory constraint and processing power into account. We assume that a pipeline of hash joins is given a priori, which can be generated based on the approaches in [2] [16]. A pipeline of hash joins is sequentially executed in two phases: 1 Segmented right-deep trees are bushy trees with right-deep subtrees [2].

(1) the table-building phase and (2) the tuple-probing phase. In the table-building phase, hash tables are constructed from inner relations using hash functions on join attributes. In the tuple-probing phase, tuples of the outer relation are used to probe the hash tables for matches. Note that the processing time of each phase is determined by the maximal execution time among all stages, and that the same allocation of processors to a stage is retained across the table-building and tupleprobing phases. The execution time of a pipeline is thus the sum of two correlated maxima. The characteristics of pipelined hash joins allow the processor allocation problem to be formulated as a two-phase mini-max optimization problem. Speci cally, for a pipeline with k stages, the execution time of the pipeline, TS, can be expressed as WBi + max WPi ; (1) TS = 0max 0ik ?1 ni ik?1 ni where WBi and WPi are, respectively, the workloads for the table-building and tuple-probing phases in stage i. Consequently, the processor allocation problem for pipelined hash joins can be stated as follows: \Given WBi and WPi, 0  i  k ? 1, determine the processor allocation hni i = (n0 ; n1; : : :; nk?1), so as to minimize TS in Eq.(1), where ni is the number of processors allocated to stage i." For example, consider the workloads shown in Table 1 for a pipeline of ve stages. First, it is observed that the workloads of stage 2 are less than those of stage 3, suggesting that stage 3 should be assigned more processors than stage 2. However, stage 3 has a heavier load in the table-building phase than stage 4, while the latter has a heavier load in the tuple-probing phase. In such a con guration, there is no obvious way to allocate processors to minimize the pipeline execution time speci ed in Eq.(1). For an illustrative purpose, suppose the total number of processors available to execute the pipeline in Table 1 is 20. It can be seen that the allocation hni i= (4, 4, 4, 4, 4) leads WPi i to max8i WB ni = 1.5, max8i ni = 1.2, and TS = 2:7, whereas the one hni i= (6, 3, 3, 6, 2), which is based on the workloads of the table-building phase, leads WPi i to max8i WB ni = 1.0, max8i ni = 2.5, and TS = 3:5: Clearly, to develop an optimal processor allocation to minimize TS in Eq.(1) is in general a very dicult and important problem. Since the table-building and tuple-probing phases are executed one after the other, we minimize the sum of two correlated maxima in Eq.(1). In view of this, the optimal processor allocation problem in Eq.(1) is hence termed the two-phase minimax optimization problem. This feature distinguishes our allocation problem from other conventional resource allocation problems [9], which may be considered one-

stage 0 1 2 3 4 WBi 6 3 3 6 2 WPi 4 4 2 3 5 Table 1: The workloads in two phases of each stage. phase mini-max optimization problems. To develop the optimal processor allocation scheme for the two-phase mini-max optimization, we consider the following three constraints: (1) the total number of processors available for allocation is xed, (2) a lower bound is imposed on the number of processors required for each stage to meet the corresponding memory requirement, and (3) processors are available only in discrete units. We develop solution schemes by incrementally adding constraints to our optimization problem. Speci cally, three optimal processor allocation schemes are devised, i.e., one under Constraint (1), another under Constraints (1) and (2), and the third under all the three constraints. The three allocation schemes will be shown to be optimal under their corresponding constraints. Also, it can be veri ed that, for a system with N processors and k pipeline stages, the time complexity of the rst two allocation schemes is O(k  logk), and that of the third one is O(Nk  log(Nk)). Generally speaking, a complex real world problem such as pipelining usually needs to be simpli ed and abstracted before it can be mapped into a mathematical model to further analyze and optimize. The contributions of this study are twofold. We not only formulate the optimal processor allocation problem for pipelined hash joins as a two-phase min-maxproblem, but also derive solutions to this new class of optimization problem. We verify the e ectiveness of our problem formulation and solution procedure through a detailed simulation of pipelined hash joins. Simulation results show that the proposed allocation schemes lead to much shorter query execution times than conventional approaches. It is worth mentioning that our solution scheme is general and applicable to any optimal resource allocation problem formulated as a two-phase mini-max problem. Although the one-phase mini-max optimization problem has been studied extensively in the literature, this study, to the best of our knowledge, provides the rst solution to the two-phase mini-max optimization problem. This paper is organized as follows. Section 2 provides the background and the problem description. Section 3 develops three algorithms for deriving three optimal processor allocations. Proofs of lemmas and theorems can be found in [11]. In Section 4, we demonstrate the performance improvement achieved by our proposed

scheme via simulation. The paper concludes with Section 5.

2 Preliminaries

2.1 Notation, Assumption and De nition

As in most prior work, we assume that execution time is the primary cost measure for estimating the eciency of database operations. The architecture assumed is a multiprocessor system with distributed memory and shared disks. Each processing node (or processor) in the system has its own memory and address space, and communicates with others through message passing. Each node is assumed to have the same amount of memory, and the amount of memory available to execute a join is in proportion to the number of processors involved. The disks in the system are accessible by all nodes through shared I/O buses. A pipeline of hash joins is composed of several stages, each of which is associated with one join operation. The relation in a hash join that is loaded into memory to build the hash table is called the inner relation, while the other relation, whose tuples are used to probe the hash table, is called the outer relation. The inner relations of a pipeline are the inner relations of its stages. The outer relation of a pipeline is de ned to be the outer relation of its rst stage. In the tablebuilding phase, the hash tables of the inner relations are built using hash functions on join attributes. In the tuple-probing phase, tuples of the outer relation are rst probed, one by one, against the entries in the hash table of the rst stage using the corresponding hash function. If there are matches, the resulting tuples are generated, and then sent to the next stage for similar processing. The table-building and tuple-probing times of a pipeline are the time spans of the building and probing phases respectively. The execution of one pipeline is given in Figure 1 for illustration.

2.2 Problem Formulation

As pointed out earlier, the query processing time can WPi i be approximated as Tins + max8i WB ni + max8i ni , where ni is the number of nodes allocated to stage i and Tins is the sum of those costs independent of processor allocation. It is also derived in [2] that WBi is proportional to jRij, where jRij is the size of the inner relation Ri at stage i of the pipeline2. Since Tins is constant over all processor allocations, the objective i function that we shall minimize is TS = max8i WB ni + WP i max8i ni , the sum of costs dependent on processor allocation. In what follows, we shall concentrate on deriving the optimal processor allocation,  = 2 This relationship between WB and jR j is useful in our i i derivation for optimal allocation in Section 3.2 later.

n =4 3 n 2=2 R

3 R

n 1 =3 1 R 2

S

k=3

h 1 =2

N=9

h 2 =1 h 3 =3

R2

processing nodes packets

R3

R1 disks

S

...

stage 2 stage 1 stage 3

Figure 1: Execution of one pipeline. (n0 ; n1; : : :; nk?1), to minimize TS, where k is the number of stages in the pipeline. Note that the formula for TS does not depend on the order of stages in the pipeline. The workloads of the various stages in the pipeline determine the allocation. As mentioned earlier, we approach this problem by expressing it as an allocation problem with three constraints. We start with assuming only one constraint holds, and incrementally consider more constraints.

Constraint I (no idling): The total number of proces-

sors assigned to all stages is equal to P the?total num1 ber of processors in the system, i.e., ki=0 ni = N. Constraint II (sucient memory): The amount of memory allocated to each stage must be large enough to accommodate the hash table of that stage, i.e., ni  jRMi j ; for all i, where M is the memory size of each processor. Constraint III (discrete allocation): Processors must be allocated to stages in discrete units.

3 Optimal Processor Allocation

We shall develop three optimal processor allocation schemes in this section. The scheme in Section 3.1 corresponds to Constraint I (denoted by I ), that in Section 3.2 is subject to Constraints I and II (denoted by II ), and the one in Section 3.3 satis es all the three constraints (denoted by III ).

3.1 Allocation under Constraint I

Under Constraint I, ni , the number of processors allocated to stage i, could be any positive real number. We obtain the following necessary condition for I , stating that each stage under I must be a bottleneck

in either the table-building or the tuple-probing phase. This is referred to as the two phase allocation condition.

Lemma 1: Let TBi and TPi be, respectively, the

table-building and tuple-probing times for a stage i under I . Then, either TBi = TB, or TPi = TP, where TB and TP are the pipeline building and probing times, respectively.

Since this lemmaonly de nes the two phase allocation condition as a necessary condition, there may be some non-optimal allocations which also satisfy this condition. For every allocation, we identify an ordered pair of sets  = (A; B) such that, i A = fstage i j WB (2) ni = TB g; j (3) B = fstage j j WP n = TP g; j

i max8i WB ni

i max8i WP ni .

where TB = and TP = Speci cally, A (respectively, B) is the set of stages that are bottlenecks in the table-building (respectively, tuple-probing) phase. De ne j = TP; j 62 Ag: (4) B 0 = B ? (A \ B) = fstage j j WP nj It follows from Lemma 1 that A [ B 0 comprises all stages in the pipeline. The problem of nding the optimal allocation can now be reduced to that of nding all (A, B), or (A, B 0 ), satisfying the two phase allocation requirement in Eqs. (2) - (4), and then determining the one with the shortest processing time. To avoid the exponential complexity of enumerating all possible two phase allocations, we shall rst introduce the concept of workload ratios (B/P ratios) for the pipeline stages, and then in light of this concept, prove that the number of allocations needed to consider is no more than the number of pipeline stages. The B=P rai tio for stage i is de ned as ri = WB WPi . We order the stages in descending order of their B=P ratios, and denote the sequence as ~a = (a0 ; a1; : : :; ak?1). For the example pro le in Table 1, we obtain Table 2 where stages are sorted P according to their B/PPratios. LetPWBI = i2I WBi , WPI = i2I WPi , and nI = i2I ni , where I denotes A, B or B 0 . An example of B 0 can be found in Figure 2. Then, under the two phase allocation condition, we have i WBA (5) TB = WB ni = nA , 8i 2 A; j WPB WPB TP = WP nj = nB = nB , 8j 2 B: 0

0

(6)

a0 = 3 a 1 = 0 a 2 = 2 a 3 = 1 a 4 = 4 6 6 3 3 2 3 4 2 4 5 2 1.5 1.5 0.75 0.4

WBi WPi ri

Table 2: The workloads in two phases of each stage after sorting. Set B’ Set A

6

6

Table Building

Tuple

3

Set B 2

5

4

3

3

4 2

Probing stage 3

stage 0 stage 2

stage 1 stage 4

descending order of B/P

Figure 2: Example of a candidate ordered pair  = (A; B) for a pipeline of 5 stages. The pipeline processing time can be expressed as: WBi + max WPi : TS = max 8i2B ni 8i2A ni Then, we get, A WPB WBA WPB TS = WB nA + nB = nA + nB ; (7) where the total number of processors N = nA + nB . If we select an A, thus xing WBA and WPB , TS can be expressed as a function of nA . Note that there is an upper bound and a lower bound on the amount of processors that one can assign as nA (and nB ) for a given pair of A and B. The reason is that Eq. (7) is valid only when A and B satisfy the conditions that they are the bottleneck sets in the tablebuilding and the tuple-probing phases, respectively. If for a particular pair of sets A and B (and the associated WBA and WPB ), we allocate so many processors to stages in set A (i.e., assign so large a number to nA ) that the table-building times of stages in set A become shorter than those of some stages in set B, A would not be the bottleneck set of the table-building phase any more, and Eq. (7) would no longer give the correct pipeline processing time TS. Under such an allocation 0

0

0

0

0

0

of processors, the processing time TS will be determined by di erent sets of A and B (and hence di erent WBA and WPB ). Therefore, there is an upper bound on the amount of processors one can meaningfully assign to stages in set A, and so is there a lower bound on the amount of processor assignable to stages in set A since N = nA + nB . This phenomenon is formally stated below. Let ag be the last stage (the stage with the smallest B/P ratio) in A, i.e., g = jAj?1, and ag+1 be the rst stage in B 0 . 0

Note that x? represents x ?  where  is an in nitesimal quantity. Let noc A be the nA that minimizes TS under the constraints in Eq. (8).

Lemma 4: nocA falls into the set of finf(nA); sup(nA); nonc A g.

0

Lemma 2: Under the conditions that all stages in set A (respectively, B) have the same table-building time (respectively, tuple-probing time), if the stages in set A and those in set B are indeed the bottlenecks in the table-building phase and tuple-probingWPphase, WBi 0 B A respectively, i.e. WB nA > ni ; 8i 2 B and nB  WPi ; 8i 2 A, then nA must satisfy the following ni constraints 1 1 N WBag WPB (8) WBag WPB  nA < N  1 + WPag WBA 1 + WPag WBA and vice versa. 0

0

0

+1

0

+1

From the above lemma, we get the following lemma to determine the optimal allocation for a given set A.

Lemma 3: WP WB A nA

The optimal value of nA minimizes + nBB under the constraints in Eq. (8). 0

0

Next we consider the nA that minimizes f(nA ) = WBA + WPB without any constraint. It can be derived nA n?nA by taking d(fd(nnAA )) = 0: The optimal allocation, nonc A , 0

where the superscript onc indicates optimality under no constraints, is p N  WB pA : p (9) nonc = A WBA + WPB The corresponding optimal total processing time without the constraints in Eq. (8) is given by p TS onc = N1  (WBA + WPB + 2 WBA  WPB ): (10) We next de ne inf(nA) and sup(nA ) as follows: 1 inf(nA ) = N  (11) WBag WPB ; 1 + WPag WBA 0

0

0

0

0 sup(nA ) = @N 

1

ag WPB 1 + WB WPag WBA +1

+1

0

1? A :

(12)

Obviously, if nonc A ), A lies between inf(nA ) and sup(n would the constraints in Eq. (8), noc i.e. nonc A A satis es onc be equal to nonc A . Otherwise, if nA is smaller than oc inf(nA ), nA is equal to inf(nA ). On the other hand, oc if nonc A is greater than sup(nA ), nA is sup(nA ). The procedure PAI to derive I can now be summarized as follows.

Procedure for allocation I (PAI ) 1. For each stage i, 0  i  k ? 1, calculate its B/P WB

ratio ri= WPii . 2. Sort the stages in descending order of ri to get the sequence ~a. 3. For each j, 1  j  k, let A = fa0; :::; aj ?1g, and A WPB nd the nA that minimizes TS = WB nA + nB under the constraint N N WBag WPB  nA < WBag WPB 1 + WPag WBA 1 + WPag WBA 0

0

+1

0

0

+1

(based on Lemma 4). Choose the A and the associated nA that achieves the shortest processing time. 4. From this optimal set A and nA , obtain the number of processors ni for each stage ni =

(

WBi nA  WB A WP i (N ? nA )  WP B

0

if i 2 A, if i 2 B 0 .

(13)

This is allocation I . For example, from the pro le in Table 2, we have the following 4 possible con gurations: A= fa0g, fa0; a1; a2g, fa0 ; a1; a2; a3g, or fa0 ; a1; a2; a3; a4g. For A= fa0g (and B 0 = fa1; a2; a3 ; a4g), we have WBA =6 and WPB =15 from the rows WBi and WPi in Table 2, leading to TS= 2.375 by Eq. (7). Similarly, each con guration of A determines a value of TS by Eq. (7). It can then be veri ed that from the 4 possible con gurations of A, the minimum of TS, denoted by TSI , is 2.362 that is achieved for A= fa0 ; a1; a2g. From Eq. (9), we next have nA =4.51 and nB = 15.49, which leads to I = (4.51, 3.88, 2.25, 4.51, 4.85) from Eq. (13). 0

0

In PAI , the combined complexity of Steps 1 and 2 is O(k  logk) because of the need of sorting. The complexity of Step 3 is O(k) since there are only k di erent ways of determining set A, and for each A, there are only 3 potential processor allocations that can be optimal according to Lemma 4. The complexity of PAI is thus O(k  log k). Since PAI considers all the allocations satisfying the necessary conditions stated in Lemmas 1 and 4, the allocation obtained I is the optimal.

Theorem 1: PAI determines the optimal processor

Also Lemma 4 holds if inf(nA) is de ned as N A; inf(nA ) = max( CWB M ): WB B 2 1 + WPaagg WP WBA 0

(15)

To show that the allocation II satis es constraint II, we need to prove the following two lemmas.

Lemma 5: Under allocation II , every stage in A satis es Constraint II.

The lemma follows directly from Eq. (13) as nA  WBi  WBi . implies ni = nA WB CM A

allocation to the pipeline stages under Constraint I.

WBA C2 M

3.2 Allocation under Constraints I and II

Lemma 6: Under allocation II , every stage in B

2

satis es Constraint II.

Next, we take into account Constraint II, which imposes a lower bound on the number of processors required for each stage. Let mi be the number of nodes required to hold Ri, i.e., mi = jRMi j . As mentioned earlier, WBi = C2  jRij where C2 is a constant. Hence, Constraint II can be expressed as ni  mi = jRMi j = CWBMi for every stage i. We will show that taking Constraint II into consideration for processor allocation amounts to adding the A condition nA  WB C M into the inequality Eq. (8) derived in Section 3.1. The procedure PAII to derive II can be summarized as follows.

For example, suppose hmi i = h jRMi j i= (5.0, 2.5, 2.5, 5.0, 1.7) for the pro le in Table 1, with N = 20. We then have 2 possible con gurations of A to meet Constraints I and II, i.e., A= fa0; a1; a2g, or fa0; a1; a2; a3g. From Eq. (7), it can be veri ed that these 2 con gurations of A lead to TS=2.40 and TS=2.538, respectively. We hence get II = (5.00, 3.33, 2.50, 5.00, 4.17) and TSII = 2.40. Like PAI , PAII is also of complexity O(k  log k). Following the same reasoning as for PAI , we have the theorem below, stating the optimality of PAII .

Procedure for allocation II (PAII ) 1. For each stage i, 0  i  k ? 1, calculate its B/P WB

Theorem 2: PAII determines the optimal processor allocation to the pipeline stages under Constraints I and II.

2

2

ratio ri= WPii .

2. Sort the stages in descending order of ri to get the sequence ~a. 3. For each j, 1  j  k, let A = fa0; :::; aj ?1g, and A WPB nd the nA that minimizes TS = WB nA + nB under the constraint N A; )  nA max( CWB M WB B 2 1 + WPaagg WP WBA N (14) < WBag WPB : 1 + WPag WBA 0

0

0

+1

0

+1

Choose the A and the associated nA that achieves the shortest processing time. 4. From this optimal set A and nA , calculate the number of processors ni for each stage as in Eq. (13). This is allocation II .

3.3 Allocation under three constraints

Given allocation II , one straightforward approach to meet Constraint III is to round up the number of processors allocated to some stages, and truncate the number of processors for others. However, this simple approach, though applicable to some cases, does not in general yield the optimal allocation. It is noted that for certain con gurations if the number of processors allocated to stage i by II is ni, the allocation to stage i by III can be very di erent from either bni c or dni e. It is thus necessary to develop the procedure for the optimal allocation III . From Constraints II and III, it follows that stage i must be allocated with at least dmi e processors, where mi = jRMi j . If we allocatePdmi e processors to each stage i, there will be L = N ? 8j dmj e remaining processors. To achieve III , we must allocate these L processors to appropriate stages. Clearly, the number of additional processors allocatable to a given stage, say stage i, is between 0 and L, and the total number of processors

allocated to that stage can thus range from dmi e to

dmi e + L.

After each stage has got dmi e processors to satisfy its memory constraint, we consider the issue on how to allocate the remaining L processors. The subtlety of the allocation comes from the fact that the optimal allocation of p processors cannot be obtained directly from a given optimal allocation of p ? 1 processors by greedily allocating the additional processor to the bottleneck stage. For example, consider a 3 stage pipeline with workload hWBi i = (1; 1; 2) and hWPii = (10; 10; 2): Assume that each stage needs one processor to hold its hash table. Clearly, if there are 4 processors, the optimal allocation is (1, 1, 2), since allocating the extra processor to either stage 1 or stage 2 will not change the pipeline building and probing times. However, if there are 5 processors, the optimal allocation is (2, 2, 1), meaning that the optimal allocation of 5 processors cannot simply be obtained based on the optimal allocation of 4 processors. We therefore need to devise an ecient mechanism to determine the optimal allocation. Let TBix and TPix be, respectively, the table-building and tuple-probing times of stage i, where x is the total number of processors allocated to this stage. For each allocation of x processors to stage i, we then have a corresponding allocation descriptor, (i; x; TBix; TPix). We shall maintain certain data structures on the allocation descriptors to facilitate the allocation of the L additional processors. First, we order the possible allocation descriptors into two allocation queues , QB and QP, each of which consists of all k(L + 1) descriptors. Elements in QB and QP are sorted in decreasing order of TBix and TPix , respectively. Note that each allocation descriptor will appear in both QB and QP, but the positions in which it appears in QB and QP may be di erent. Marking an allocation descriptor can be thought of as allocating one additional processor to the stage speci ed in that descriptor. Allocating L additional processors to the pipeline stages then corresponds to marking L distinct allocation descriptors. Our method of marking allocation descriptors is to follow the descriptor orders in either QB or QP. Thus a legitimate marking of L descriptors would be a marking on the rst j descriptors of QB and the rst k descriptors of QP, if there are exactly L distinct descriptors among these (j + k) descriptors. (Note one descriptor may fall into both the rst j elements in QB and the rst k elements in QP.) In each stage, the marking can only occur in increasing order of x values for that stage. The allocation descriptors for a stage i start with (i; dmi e; TBidmi e ; TPidmi e ). When (i; x; TBix ; TPix) is marked, it means that at least x+1 ?dmi e descriptors of stage i must have been marked, and at least x + 1

processors must have been allocated to stage i. If the last allocation descriptor marked for stage i is labeled with x processors, the total number processors assigned to that stage is x + 1, and the corresponding stage building and probing times will be TBix+1 and TPix+1 , respectively. Consequently, we have the following lemma.

Lemma 7:

The table-building time of the rst unmarked element in QB is the table-building time of the pipeline, and the tuple-probing time of the rst unmarked element in QP is the tuple-probing time of the pipeline. Denote the i-th element in QB as bi , and the associated building time as TB . pj in QP and TP are de ned similarly. The procedure PAIII for III can be summarized as follows.

Procedure for allocation III (PAIII ) 1. For each stage i, 0  i  k ? 1, rst allocate dmi e 2.

3. 4.

processors to that stage. For each stage i, 0  i  k ? 1, determine the L + 1 allocation descriptors, (i; x; TBix; TPix), dmj e  x  dmj e + L, for the possible allocations of additional P processors, where L = N ? 8j dmj e. Construct QB and QP (each with k(L + 1) descriptors), where the allocation descriptors are sorted in descending order of TBix and TPix , respectively. For each j, 0  j  L, determine k(j)( (L ? j)) such that the number of distinct elements in the set Dj = fb1 ; b2; : : :; bj g [ fp1 ; p2; : : :; pk(j )g is L. (This can be achieved by marking the rst j elements in QB rst and then the rst (L ? j) unmarked elements in QP. Thus Dj is the set of marked allocation descriptors and corresponds to an allocation of L processors.) For each j, calculate the time TB + TP , which is the processing time of the pipeline under the corresponding allocation. Choose the j, 0  j  L, with the shortest processing time. The corresponding allocation is III . +1

5.

( )+1

Consider the example in Table 2. Since hmi i = h jRMi j i= (5.0, 2.5, 2.5, 5.0, 1,7) from II in Section 3.2, we get hdmi ei = (5, 3, 3, 5, 2), meaning that there

are L = 2 processors left that will be added to the appropriate stages by III . QB and QP can then be

determined, each of which consists of (2 + 1)  5 = 15 elements in the form of (i, x, TBix , TPix). We then have, b1 = (0; 5; 1:2; 0:8), b2 = (3; 5; 1:2; 0:6), b3 = (4; 2; 1:0; 2:5), b4 = (1; 3; 1:0; 1:33), etc., for QB, and p1 = (4; 2; 1:0; 2:5), p2 = (4; 3; 0:66; 1:66), p3 = (1; 3; 1:0; 1:33), p4 = (4; 4; 0:5; 1:25), etc., for QP. From Step 4 in PAIII , we obtain for (j; k) = (0; 2); TS = TB + TP

= 1:2 + 1:33 = 2:53, for (j; k) = (2; 0); TS = TB +TP

= 1:0+2:5 = 3:5, for (j; k) = (1; 1); TS = TB + TP

= 1:2 + 1:66 = 2:86, etc. It follows that the 2 additional processors should be added to stage 4 to achieve an optimal allocation III = (5; 3; 3; 5; 4) with TSIII = 2:53. It can be veri ed that the complexity of PAIII is O(Nk  log(Nk)) because of the sorting in QB and QP. Also, from the way QB and QP are constructed, it follows that the allocation achieved by III is the one with shortest processing time among all possible allocations. 1

3

3

1

2

2

Theorem 3: Considering all three constraints, PAIII gives the optimal processor allocation to the stages in a pipeline.

4 Simulation

We have formulated the processor allocation as a deterministic optimization problem and developed optimal solution procedures. In the simulation, each individual tuple actually runs through the stages in a pipeline of hash joins, so that the burst e ects and the actions for pipeline ll-up and depletion are captured. The simulation veri es that our formulation of the two-phase mini-max optimization problem well approximates the original pipelined hash join problem and provides it with an e ective solution procedure.

4.1 The Simulation Model

parameters mu-sec tread 14 treceive 20 thash 12 tcomp 12 tbuild 8 tpart 12 tsend 20 tinsert 2 twrite 14

To focus on the e ect of processor allocation on pipelined hash joins, we simulate pipeline segments of hash joins, with and without using the optimal processor allocation scheme. The simulator takes hardware parameters, a pipeline segment of hash joins and a processor allocation for the pipeline as inputs. It outputs the query processing time of the pipeline. The number of stages in the pipeline is predetermined in the simulation. The cardinalities of the relations in pipeline stages are randomly chosen from xed ranges. In order to simulate the behavior of each tuple accurately, we scaled down the average relation size and reduced the memory size of each processor accordingly, so that

Table 3: Architectural parameters employed in simulation. the ratio of average relation size to memory size is nonetheless realistic. The hash table for each pipeline stage is built in the table-building phase by partitioning its inner relation into subtables, one for each processor allocated to the stage. In the probing phase, each incoming tuple is routed to one of these subtables at random. A random number generator, coded based on the join selectivities, is then used to generate the resulting tuples. The time spent on various actions such as partitioning, hashing, matching, or building resulting tuples, are highly dependent on the architecture, OS (or equivalent system software) and the actual implementation. However, to better re ect reality, we have determined the relative CPU times for the various actions by actually taking measurements for sample code on a SUN sparc station. These parameters are reported in Table 3.

4.2 Simulation Results

Two processor allocation schemes are used for each query in the simulation: the optimal processor allocation algorithm PAIII and a heuristic HT. Heuristic HT allocates processors to the stages in proportion to their hash table sizes. The numbers of processors so allocated are real numbers, which are then converted to integers subject to the constraint that the number of processors allocated to each stage is large enough to hold its hash table. The pipelines employed in the simulation are chosen to be of ve stages (i.e. six relations). Results from four experiments are presented here, although more experiments on sensitivity analysis have been conducted and indicated similar trends. In each experiment, three hardware con gurations, with 16, 32 and 64 processors, respectively, are used. Sixty pipelines are simulated for each hardware con guration. Each combination of pipeline and hardware con guration is simulated three times to capture the randomness on the tuple generation

200,000 HT

Opt. Allocation

Execution Time (micro-second)

Execution Time (micro-second)

250,000

200,000

150,000

100,000

50,000

0

HT

150,000

100,000

50,000

0 16

32

64

16

Number of Processors

32

64

Number of Processors

max card./min card.=4 60 pipeline segments

table building bottleneck 60 pipeline segments

Figure 3: Comparison of the optimal scheme and heuristic HT when max card./min card.=4.

Figure 5: Comparison of the optimal scheme and heuristic HT for table-building phase bottleneck. 400,000

HT

Execution Time (micro-second)

300,000

Execution Time (micro-second)

Opt. Allocation

Opt. Allocation

250,000

200,000

150,000

100,000

50,000

HT

Opt. Allocation

300,000

200,000

100,000

0

0 16

32

64

16

max card./min card.=6 60 pipeline segments

32

64

Number of Processors

Number of Processors tuple probing phase bottleneck 60 pipeline segments

Figure 4: Comparison of the optimal scheme and heuristic HT when max card./min card.=6.

Figure 6: Comparison of the optimal scheme and heuristic HT for tuple-probing phase bottleneck.

and matching in a join during the simulation. In the rst experiment, the cardinalities of all inner relations and intermediate relations are randomly selected from a range between 800 and 3200, with the ratio of maximal to minimal cardinality equal to 4. The second experiment deals with relations with a larger variance, and has relation cardinalities generated from 600 to 3600, with the ratio of maximal to minimal cardinality equal to 6. In the third experiment, we randomly set a stage to be the bottleneck of the tablebuilding phase by assigning that stage with an inner relation ten times as the average size of other inner relations. The ratio of maximal to minimal cardinality of the non-bottleneck stages is set to 4. In the fourth experiment, we set one stage to be the bottleneck of the tuple-probing phase by adjusting its tuple-probing phase workload to be ten times as the average of other stages. (This is achieved by making the intermediate

relation from the previous stage ten times as large as the average.) Without loss of generality, we set the bottleneck stage to be the last stage in a pipeline. The ratio of maximal to minimal cardinality of the nonbottleneck stages is also set to 4. The performance data for the rst two experiments are shown in Figures 3 and 4. In can be seen that in the rst experiment, the optimal processor allocation scheme shows 28% to 31% improvement over the simple heuristic. When the ratio of the maximal to minimal cardinality increase to 6 in the second experiment, the performance improvement of the proposed scheme increases to the range between 36% and 38%, meaning that when the cardinalities of relations and join attributes become more widely varied, PAIII o ers even better performance improvement. As shown in Figures 5 and 6, it is observed that in the case of a tuple-probing phase bottleneck, the

optimal processor allocation provides 39% to 43% improvement in execution time. In the case of a table-building phase bottleneck, the optimal processor allocation shows 45% to 51% improvement, an even better result. It can be seen that the more skewed the input is, the more improvement can be achieved by a better processor allocation scheme. As a matter of fact, skewed inputs are usually those creating performance bottleneck, and the very ones we would like to tackle for better overall performance. In general, it is observed from simulation results that the proposed scheme performs very consistently, and can lead to signi cant performance improvement over simple heuristics.

5 Conclusions

This paper presents a method for achieving optimal processor allocation for pipelined hash joins. We formulated the processor allocation problem as a twophase mini-max optimization problem and developed both exact and approximate solution procedures. We also demonstrated through simulation that our problem formulation, though not explicitly incorporating all dynamic e ects of pipelining, can lead to solutions with substantial performance improvement over previous approaches. The results are useful for dealing with both right-deep trees and segmented right-deep trees.

References

[1] H. Boral, W. Alexander, et al. Prototyping Bubba, A Highly Parallel Database System. IEEE Transactions on Knowledge and Data Engineering, 2(1):4{24, March 1990. [2] M.-S. Chen, M.-L. Lo, P. S. Yu, and Y. C. Young. Using Segmented Right-Deep Trees for the Execution of Pipelined Hash Joins. Proceedings of the 18th International Conference on Very Large Data Bases, pages 15{26, August 1992. [3] M.-S. Chen, P. S. Yu, and K.-L. Wu. Scheduling and Processor Allocation for Parallel Execution of Multi-Join Queries. Proceedings of the 8th International Conference on Data Engineering, pages 58{67, February 1992. [4] D. J. DeWitt and R. Gerber. Multiprocessor Hash-Based Join Algorithms. Proceedings of the 11th International Conference on Very Large Data Bases, pages 151{162, August 1985. [5] D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, A. Bricker, H.I. Hsiao, and R. Rasmussen. The Gamma Database Machine Project. IEEE Transactions on Knowledge and Data Engineering, 2(1):44{62, March 1990. [6] D. J. DeWitt and J. Gray. Parallel Database Systems: The Future of High Performance Database Systems. Comm. of ACM, 35(6):85{98, June 1992.

[7] W. Hong. Exploiting Inter-Operator Parallelism in XPRS. Proceedings of ACM SIGMOD, pages 19{ 28, June 1992. [8] K. Hua, Y.-L. Lo, and H. C. Young. Including the Load Balancing Issue in the Optimization of MultiWay Join Queries for Shared-Nothing Database Computers. Proceedings of the 2nd Conference on Parallel and Distributed Information Systems, pages 74{83, January 1993. [9] T. Ibaraki and N. Katoh. Resource Allocation Problems: Algorithmic Approaches. The MIT Press, Cambridge, Massachusetts, 1988. [10] M. Kitsuregawa, H. Tanaka, and T. Moto-Oka. Architecture and Performance of Relational Algebra Machine GRACE. Proceedings of the International Conference on Parallel Processing, pages 241{250, August 1984. [11] M.-L. Lo, M.-S. Chen, C. V. Ravishankar, and P. S. Yu. Optimal Processor Allocation for Pipelined Hash Joins. IBM Research Report, RC 18303, September 1992. [12] R. A. Lorie, J.-J. Daudenarde, J. W. Stamos, and H. C. Young. Exploiting Database Parallelism In A Message-Passing Multiprocessor. IBM Journal of Research and Development, 35(5/6):681{695, September/November 1991. [13] H. Lu, K. L. Tan, and M.-C. Shan. HashBased Join Algorithms for Multiprocessor Computers with Shared Memory. Proceedings of the 16th International Conference on Very Large Data Bases, pages 198{209, August 1990. [14] P. Mishra and M. H. Eich. Join Processing in Relational Databases. ACM Computing Surveys, 24(1):63{113, March 1992. [15] N. Roussopoulos and H. Kang. A Pipeline NWay Join Algorithm Based on the 2-Way Semijoin Program. IEEE Transactions on Knowledge and Data Engineering, 3(4):461{473, December 1991. [16] D. Schneider and D. J. DeWitt. Tradeo s in Processing Complex Join Queries via Hashing in Multiprocessor Database Machines. Proceedings of the 16th International Conference on Very Large Data Bases, pages 469{480, August 1990. [17] M. Stonebraker, R. Katz, D. Patterson, and J. Ousterhout. The Design of XPRS. Proceedings of the 14th International Conference on Very Large Data Bases, pages 318{330, 1988. [18] A. Wilschut and P. Apers. Data ow Query Execution in Parallel Main-Memory Environment. Proceedings of the 1st Conference on Parallel and Distributed Information Systems, pages 68{77, December 1991. [19] P. S. Yu, M.-S. Chen, H. Heiss, and S. H. Lee. On Workload Characterization of Relational Database Environments. IEEE Transactions on Software Engineering, 18(4):347{355, April 1992.

Suggest Documents