E cient Techniques for Fast Nested Barrier Synchronization3 - CiteSeerX

2 downloads 1069 Views 196KB Size Report
Data parallel programs involve a form of synchronization ... Current MIMD computers provide just one barrier tree per user ...... jobs scheduled on the processors.
Ecient Techniques for Fast Nested Barrier Synchronization3 Vara Ramakrishnan

Isaac D. Scherson

Raghu Subramanian

fvara,isaac,[email protected] Department of Information and Computer Science University of California, Irvine Irvine, CA

92717{3425

advantages of doing so [6]. The chief advantage is that an MIMD computer does not force unnecessary synchronization after every instruction, or unnecessary sequentialization of non-interfering branches, as an SIMD computer does. Moreover, many jobs can be run simultaneously on di erent partitions of an MIMD computer. These factors, among others, explain the recent market trend towards MIMD computers, exempli ed by Thinking Machines' CM-5 and Cray Research's T3D. Data parallel programs involve a form of synchronization called : A point in the code is designated as a barrier and no processor is allowed to cross the barrier until all processors involved in the computation have reached it. Since data parallel programs involve frequent barrier synchronizations, a computer intended to run data parallel programs must implement them eciently. For this purpose, current MIMD computers, including the CM-5 and T3D, provide a dedicated exclusively for barrier synchronizations [2, 13, 16]. Current MIMD computers provide just one barrier tree per user application. However, data parallel programs very often require the simultaneous use of more than one barrier synchronization tree, due to data dependent conditionals and loops, that have barriers nested in them. Any reasonably large data parallel program has several levels of nested barriers. Since providing as many trees as the number of nestings in programs is not feasible due to cost constraints, current machines solve this problem by also implementing barriers in software. Software barriers are implemented using shared semaphores or complicated message passing protocols. They su er from either the sequential bottleneck associated with shared semaphores or the large communication latencies of message passing protocols. Since dedicated hardware barrier trees are intrinsically parallel and have very low latency, they are usually an order of magnitude faster than software barriers. There exist numerous algorithms and methods in the literature for improved software barriers, including [1, 5, 7, 11, 12, 10, 17], but these are improvements on a mechanism that is inherently slow. Methods for masking the latency of barriers have also been proposed [3, 4]. These methods hide the synchronization overhead as well as the time spent waiting for other processors to reach the barrier. They depend on being able to schedule other operations on the processors while waiting for barriers to complete. Therefore, they perform well on some applications and poorly on others. The ability to provide hardware support for all nested barriers in a data parallel program will result in a signi cant speedup

Abstract

Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implementations. Since large data parallel programs often have several levels of nested barriers, these schemes provide signi cant speedups in the execution of such programs on MIMD computers. The rst scheme performs code transformations and uses two single-bit-trees to implement unlimited levels of nested barriers. However, this scheme increases the code size. The second scheme uses a more expensive integer-tree to support an exponential number of nested barriers without increasing the code size. Using hardware already available on commercial MIMD computers, this scheme can support more than four billion levels of nesting.

barrier synchronization

barrier tree

1

Introduction

The data parallel programming model allows a natural way of expressing the large degree of parallelism involved in most computationally intensive and mathematically complex problems. In the data parallel model, the programmer can retain a single thread of control while performing computations on a large set of data. This is perhaps the reason for the preponderance of data parallelism in massively parallel computations. For example, of the more than 120 parallel algorithms in a survey of three ACM Symposia on Theory of Computing (STOC) were data parallel [14]. One way of executing a data parallel program is on an SIMD (Single-Instruction Multiple-Data) computer. The semantics of the abstract model have often been confused with its hardware implementation, and data parallelism has been equated to SIMD. However, not only is it possible to execute a data parallel program on an MIMD (MultipleInstruction Multiple-Data) computer, there are also several all

3 This research was supported in part by the Air Force Oce of Scienti c Research under grant number F49620-92-J-0126, the NASA under grant number NAG-5-2561, and the NSF under grant number MIP-9205737.

Appeared in the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '95), July 1995, Santa Barbara, CA 1

Reduction Tree

Broadcast Tree

Processor

Figure 1.

LOCAL flag

GLOBAL flag

AND gate

Duplicator

A barrier tree is used in current MIMD computers to perform barrier synchronizations.

Reduction Tree

Broadcast Tree

n bits

n bits

Processor

LOCAL flag

Figure 2.

GLOBAL flag

OP logic

Duplicator

An OP-tree with n-bit inputs and outputs.

of all data parallel applications. In this paper, two schemes are presented for supporting nested barriers using only limited hardware. The rst scheme uses two single-bit-trees to support any number of nested barriers. The method relies on code transformations, and it increases the code size. The second scheme uses an integer-tree, which requires more expensive hardware, to support an exponential number of nested barriers without increasing the code size. With hardware currently available on the CM-5, this scheme can support more than four billion levels of nesting. The barrier synchronization hardware available on existing machines is described in Section 2. Section 3 examines the semantics of data parallel programs and the problems of implementing them with current barrier synchronization hardware. The rst scheme for executing nested data parallel programs, using two single-bit-trees and code transformations is described in Section 4.1. The second scheme, using an integer-tree, is presented in Section 4.2. Section 5 gives a comparison of the two schemes and directions for future work.

2

Barrier Synchronization Trees

The following functionality is desired of a barrier: A point in the code is designated as a barrier and no processor may be allowed to cross the barrier until all processors have reached it. A barrier is implemented as follows: Each processor is assumed to have two 1-bit ags, LOCAL and GLOBAL. When a processor arrives at a barrier instruction in its program, it sets its LOCAL ag and begins testing its GLOBAL ag. As soon as the processor nds its GLOBAL

ag set, it clears its LOCAL ag and carries on with its next instruction. The barrier synchronization tree must ensure that no processor's GLOBAL ag is set until all processors have set their LOCAL ags. A barrier tree is implemented as two complete binary trees, called the and the , joined at their roots (see Figure 1). The reduction tree consists of AND gates, and takes its inputs from the LOCAL ags of the processors. The broadcast tree consists of duplicators. (A duplicator is a unit that takes in an input and outputs two copies of it.) The root of the reduction tree feeds the input of the duplicator at the root of the broadcast tree. The outputs of the broadcast tree are delivered to the GLOBAL

ags of the processors. reduction tree

2

broadcast tree

(1) (2) (3) (4) (6) (5) (7) (8) (9) Figure 3.

parbegin computation 1 if ( ) then read shared data endif

parbegin computation 1 if ( ) then barrier read a neighbor's data endif barrier

(1) (2) (3) (4) (5) (6) (7) (9) (8) (10) (11) (12) (13) (14)

C1

write shared data computation 2 while (C2) do write shared data

endwhile parend

The barrier tree works as follows: The output of the reduction tree at the root is the AND of all the LOCAL ags. The broadcast tree merely copies this output into all the GLOBAL ags. Therefore, none of the GLOBAL ags are set to 1 before all the LOCAL ags are set to 1, just as desired. The time for the GLOBAL ags to be set after all processors have set their LOCAL ags (measured in gate delays) is proportional to the height of the tree, or equivalently, 2(log ) where is the number of processors. This is much faster than older techniques that used either shared semaphores or complicated message-passing protocols. The barrier tree in Figure 1 is a special case of the shown in Figure 2. In this case, each processor has two -bit ags, LOCAL and GLOBAL. The processor puts a value on its LOCAL ag, and tests for a particular value on the GLOBAL ag, which will be available after a hardware latency of 2(log log( +1)). If it nds that particular value on the GLOBAL ag, it proceeds. The tree is constructed of -bit edges. The operation (OP) performed in the nodes of the reduction tree may be any associative operation on two -bit inputs, such as integer maximum, minimum, sum, bitwise AND, OR, etc. For instance, if the OP is maximum, the GLOBAL ag returns the maximum of all the LOCAL

ag values. The barrier tree available in existing MIMD computers is the special case where =1 and the operation is AND. The generic OP-tree will be used later in our solutions.

The parallel program of Figure 3 with the implicit barriers placed in the code.

synchronizations need to be placed at the end of loops and conditionals. To ensure a correct implementation of the semantics of data parallel programs without a knowledge of the data dependencies in the program, it is also necessary to place barriers before and after each remote read/write operation (or computation involving shared variables). For the code shown in Figure 3, the implicit barriers are placed and shown in Figure 4.

OP-

tree n

Parallel Conditionals

n

In the example in Figure 4, the processors that evaluate the control expression (on line 2) to false skip to the endif. The semantics of parallel ifs assume an implicit barrier (on line 6) after the endif. The processors not participating in the conditional operation fall to this barrier. If there were no remote reads or writes in the construct, the processors could continue execution beyond the conditional. However, in this example, there is a remote read in the construct, so a barrier is required to correctly execute the semantics of the if. Let us assume that an AND-tree is used to execute this barrier. The processors which skipped to line 7 set their LOCAL ags on the AND-tree and wait. There is another implicit barrier (on line 3) inside the construct, to ensure that correct data is read in line 4. So, a problem is encountered: the AND-tree is already being used, and the barrier at line 3 cannot be executed without setting the GLOBAL ags of the processors waiting at line 6, therefore letting them cross that barrier. In other words, the existing information on the AND-tree would be destroyed if it were used to execute another barrier. Therefore, an independent tree is needed to execute the barriers inside the if construct. If a parallel conditional has an else, and either the then or else branch contains remote read or write operations, the following question must be asked: Are there data dependencies between the branches? If there are, then one way to ensure correct execution is to sequentialize the execution of the branches. This can be done by placing a barrier between the two branches and treating an else as a separate conditional: if not(control expression) then. It may be possible to partially overlap the execution of the branches by reC1

n

n

n

3

C2

Figure 4.

p

p

write shared data computation 2

barrier while ( ) do barrier write shared data endwhile barrier parend

A parallel program containing a conditional and a loop.

p

C1

Barrier Synchronizations within Parallel Conditionals and Loops

In the data parallel model of execution, it is assumed that processors run asynchronously between barrier synchronizations in which all processors participate. The barrier synchronization hardware (an AND-tree) in existing MIMD computers is based on this model of execution. In reality, the local operations in a data parallel program may be conditional or loop constructs with control expressions involving parallel (non-scalar) variables. This leads to barrier synchronizations in which not all processors may participate. These constructs drastically a ect the hardware barrier synchronization requirements of data parallel programs. To understand these requirements, the semantics of parallel conditionals and loops are examined with the assumption that all barriers must be executed using available hardware. The set of processors which must execute the statements inside a parallel conditional or loop construct is determined by a . A processor which evaluates the expression to false is assumed to skip to the end of the construct and wait at a barrier there. Therefore, while executing a data parallel program asynchronously, barrier parallel control expression

3

processor 1

processor 2

processor 3

processor 4

processor p

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

*

*

*

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

OP

*

*

OP

OP

OP

*

*

OP

OP

OP

*

*

*

OP

OP

Barrier1

Barrier2

Barrier3

Barrier4 Barrier4 Barrier4

Barrier5

The data parallel model of execution due to conditionals and loops. An OP is a local operation or remote read/write performed asynchronously between barriers. The * indicates that the processor did not perform that OP and skipped to the next barrier.

Figure 5.

arranging the placement of barriers between them, but this can be done only after data dependencies across the branches are known. For programs which produce the same result regardless of the order of execution of the branches, a parallel execution of the branches is possible. The barrier tree requirements arising from such an execution will be discussed in a later section.

parbegin some computations ... (1) if ( ) then (2) computation 1 (3) while ( ) do (4) computation 2 (5) barrier (6) computation 3 (7) endwhile (8) barrier (9) computation 4 (10) endif (11) barrier some computations ... parend C1

C2

Parallel Loops

The semantics of parallel while-do, repeat-until,and are similar. Therefore, without loss of generality, the while loop is used to illustrate our schemes in this paper. In the while loop of Figure 4, the processors which evaluate the control expression to false (on line 10) fall out to the end of the loop. All the processors which enter the loop eventually fall out of the loop, perhaps after a di erent number of iterations. Just as in the if construct, an implicit barrier (on line 14) is assumed after an endwhile statement, and an independent AND-tree is required to execute the barrier (on line 11) inside the loop. Due to parallel loops and conditionals, the data parallel model of execution resembles the one shown in Figure 5. The gure represents processors, 1 2 ... p executing the program shown in Figure 4. Suppose 3 .. . p evaluate to false, and fall out to barrier 2. 1 and 2 encounter barrier 1 inside the conditional before they join the other processors at barrier 2. Between barrier 2 and barrier 3, all processors execute all OPs. After crossing barrier 3, they encounter the loop. Now suppose 1 and 2 evaluate to false, fall out of the loop immediately, and wait at barrier 5. Therefore, only processors 3 ... p encounter barrier 4. Then they loop back to line 10. Let all of them evaluate to true and enter the loop again. They encounter barrier 4 once more. When they loop back to line 10, let 3 evaluate to false, and fall out to barrier 5. 4 ... p encounter barrier 4 before nally evaluating to false and falling to barrier 5, which is common to all processors. The above example shows that even a simple data parallel program may require barrier synchronizations which are not common to all processors executing the program. for loops

C2

p

P ;P ;

P ;

C1

P

Barriers within nested constructs. Note that each of the three barriers will require a separate barrier tree.

Figure 6.

A barrier is now de ned as a point in the program which cannot be crossed by a processor unless every other processor is at that barrier at some other barrier which the processor will reach later in its execution. either

;P

;P

Nested Conditionals and Loops

P

In general, a parallel conditional or loop construct without any nestings can be executed using only hardware barriers if two AND-trees are available. One AND-tree ( 1) can be used for barriers at the end of constructs, and the other ( 2) is dedicated for barriers that occur inside such constructs and in the outer level of code. Processors that have fallen out of a construct need to indicate that they will not be participating in barriers occurring inside the construct, so they must set their LOCAL ags on 1 as well as 2. At barriers within constructs, processors must set their LOCAL

ags only on 2. The hardware requirements for supporting barrier synchronization increase when conditionals and loops are nested. These requirements are exempli ed by the code in Figure 6.

p

P

P ;

T

C2

P

T

;P

C2

P

C2

P ;

or

T

;P

C2

T

4

T

parbegin (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)

parbegin

some computations ... T1 C1 if (T1) then computation 1

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

barrier while (GLOBALOR(T3) = true) do if (T3) then computation 2 endif barrier if (T3) then computation 3 endif if (T3) then T3 (T1 and T2) endif barrier endwhile barrier some computations ... parend

endif T2 while (T1 AND T2) do computation 2 barrier computation 3 endwhile barrier if (T1) then computation 4 endif barrier some computations ... parend C2

Transformed while-loop from Figure 7, requiring one AND-tree and one OR-tree for execution.

Figure 8.

loops can be executed using just one AND-tree. However, loops cannot be attened in this manner. This is because barriers inside a loop may be executed many times, each time on a smaller subset of active processors. A partial attening of the code in Figure 6 by transforming the if is shown in Figure 7 (based on [8]). We now show how barriers within loops can be executed using an AND-tree and one additional single-bit-tree, regardless of the number of nestings in the program. The tree required is a special case of the generic OP-tree de ned in Section 2; here =1, and the operation is OR. Such a tree is commonly known as an OR-tree. Since each processor has LOCAL and GLOBAL ags for each tree, the terms LOCALAND and GLOBALAND are used for the AND-tree

ags, and the terms LOCALOR and GLOBALOR for the OR-tree ags. The while loop shown on lines 6 to 10 of Figure 7 can be transformed to the one shown in Figure 8. In the transformed version, no processor ever falls out of the while on line 3 if there is even one other processor which needs to enter the loop. The OR-tree is used to execute the transformed code. Each processor evaluates the GLOBALOR of the control expression by placing its value in the LOCALOR ag. If at least one processor's LOCALOR ag is true, the root of the OR-tree outputs a true. Therefore, the GLOBALOR

ags of all processors will be set to true (and they all enter the loop). This scheme is similar to the implementation used in SIMD machines to execute loops, where the front-end uses the OR of the control expressions of all processors to decide how many times to broadcast the instructions within a loop. In the CM-5, the programming model allows only scalar variables to be used in the control expression of a loop, automatically creating loops which only all or none of the processors can enter. All barrier statements are executed using the AND-tree. When a processor reaches a barrier statement, it sets its LOCALAND ag to true, and waits for its GLOBALAND

ag to be true. In the example in Figure 8, all processors reach line 4 together. Processors whose control expressions are false skip computation 2 and reach the barrier on line 5. When all processors whose control expressions are true also reach the barrier, the GLOBALAND ags of processors are true, and they proceed to line 6. Note that the barrier is ful lled even by processors who would not have entered the loop. Line 6 is executed similarly to line 4. The control expression is revaluated on line 7 only by processors whose

A attening of the conditional in Figure 6 such that all barriers nested within it are moved one level outward.

Figure 7.

In Figure 6, some processors fall out of the if on line 1, and wait at the barrier on line 11. Since the code within the if has barriers at two di erent levels (lines 5 and 8), it already requires two AND-trees. Neither of those two trees can be used for the barrier on line 11 without losing information about waiting processors. Hence, a third AND-tree is needed. When loops or conditionals are nested in any combination, one additional AND-tree will be required for every level of nesting. Providing as many trees as the number of nestings in programs is infeasible for deep levels of nesting. Due to cost constraints, real machines can only provide a small xed number of barrier trees. The option of using software barriers is available, but since they are an order of magnitude slower than hardware barriers, it is important to avoid their usage. Therefore, we would like to implement all barriers using a limited number of hardware trees. 4

n

Supporting Nested Barriers in Hardware

In this section, two schemes are proposed to execute nested barriers using only hardware trees and no software barriers. The rst scheme involves some code transformations leading to increased code size, and uses just two singlebit-trees (one AND-tree and one OR-tree). It accomplishes this by ensuring that all processors step through every loop body the same number of times. The second scheme does not alter the code size, since it requires no code transformations. However, it uses an integer max-tree, which is more expensive than a single-bit-tree. Both schemes execute all barrier synchronizations in hardware, allowing a much faster execution of data parallel programs than when software barriers are used. 4.1

some computations ... T3 (T1 and T2)

Using Two Single-Bit-Trees

Every conditional construct can be transformed so that all barriers within it are moved one level outward. By repeated application of this transformation, all barriers within a nested if (with any number of nestings of conditionals) can be moved to the outermost level of code. By these transformations, a parallel program that does not contain any

all

5

parbegin some computations ... (1) if ( ) then (2) computation 1 (3) while ( ) do (4) computation 2 (5) barrier ( (6) computation 3 (7) endwhile (8) barrier ( (9) computation 4 (10) endif (11) barrier ( ) some computations ... parend

construct a limited number of hardware barrier trees. Consequently, such an implementationn is infeasible. A scheme that implements 2 0 1 logical barrier trees using one -bit max-tree is now presented. Our scheme is based on the following key observation: barriers at an inner level of nesting must always complete before a barrier at an outer level ( before before ). If each processor keeps track of its using its integer counter, one max-tree is sucient to execute all nested barriers. The processor increments its counter at every nesting (loop or conditional) it encounters, and decrements it when it leaves a nesting. When it reaches a barrier, it knows its nesting level and only proceeds beyond the barrier when the value returned by the max-tree is equal to its nesting level. The processor may either poll the tree or be interrupted when the tree returns an appropriate value. The barrier synchronization algorithm using a max-tree is shown in Figure 10. The working of the algorithm is illustrated with the following example. Consider the case of four processors, 1 2 3 4 , executing the example in Figure 9. Initially, all processors have their equal to , and their LOCAL ags on the max-tree have been initialized to 1. = h0 0 0 0i, Flags = h1 1 1 1i, Max = 1 At line 1, say processor 1 evaluates to false and falls to the endif on line 10. It puts its on the max-tree and waits at the barrier on line 11. Processors 2 , s to , perform 3 , and 4 increment their computation 1 and reach line 3. = h0 1 1 1i, Flags = h0 1 1 1i, Max = 1 Say 2 evaluates to false and falls to the endwhile on line 7. It puts a on the max-tree and waits. Processors 3 and 4 increment their s to and reach line 4. = h0 1 2 2i, Flags = h0 1 1 1i, Max = 1 After performing computation 2, they reach line 5 (possibly at di erent times). When they have both put a on the max-tree, the tree returns a . = h0 1 2 2i, Flags = h0 1 2 2i, Max = 2 3 and 4 cross the barrier and clear the values they had placed on the max-tree (by setting them to 1). = h0 1 2 2i, Flags = h0 1 1 1i, Max = 1 1 and 2 continue waiting, since they are waiting for the tree to return a and a respectively. Then 3 and 4 perform computation 3, loop back to line 3 and evaluate again. Say 3 evaluates it to false and falls out to the endwhile on line 7. It decrements its to , reaches the barrier on line 8, and puts a on the max-tree. = h0 1 1 2i, Flags = h0 1 1 1i, Max = 1 on the max-tree. 4 passes through to line 5 and puts a The tree returns a immediately, which is ignored by processors 1, 2 , and 3 . = h0 1 1 2i, Flags = h0 1 1 2i, Max = 2

C1

n

C2

2

)

Nesting Level 2

)

Nesting Level 0

Figure 9. The program in Figure 6, with nested constructs containing barriers at three nesting levels.

P ;P ;P ;P

control expression is currently true. This is because the semantics of the loop are such that a processor whose control expression is false must never get a chance to recompute it. The barrier is placed on line 8 to ensure that all processors check their OR-tree using the same control expression. The same AND-tree is used for executing the barriers at lines 5 and 10 (which are at di erent nesting levels). Even if one loop is nested inside another, by applying this transformation to both loops, the code can be executed using just one AND-tree and one OR-tree. For each level of loop nesting, the control expression evaluated and placed on the OR-tree is di erent, but one OR-tree suces since all processors are in the same nesting level at any given time. This transformation applies to any number of nestings in any combination of conditionals and loops. The transformation for conditionals applies to those with an else branch, by treating the branch as a separate conditional with a negated control expression. Since all barriers within conditionals are moved to an outer level of code, the branches of each conditional will be partially or fully sequentialized. Even if two branches containing barriers are shown to have no data dependencies across them, there will be at least a partial sequentialization of the branches since only one barrier tree is available in this scheme.

0

N estingLevel

Levels

;

;

;

;

;

;

C1

P

N estingLevel 0

P

P

P

N estingLevel

Levels

;

;

;

;

;

1

;

C2

P

1

P

P

2

N estingLevel

Levels

;

;

;

;

;

;

2

2

Levels

Using an Integer-Tree

P

A scheme for executing nested barrier synchronizations in hardware without any code transformations is presented in this section. The scheme uses a , which is a special case of the generic OP-tree shown in Figure 2, with -bit wide edges and the OP being integer maximum. The scheme also requires the use of one -bit integer counter at each of the processors. Consider the example in Figure 9, which has implicit barriers at three levels of nesting, labeled , , and . If these barriers were implemented in software, labelling the synchronizing messages or shared semaphores with these numbers would suce to distinguish one barrier from another. In other words, as many logical \trees" as required by the application can be easily constructed in software. To run this example, a naive implementation that uses one hardware tree for every logical tree would require three AND-trees (for three levels of barriers). By extension, it is necessary to provide as many AND-trees in hardware as there are nested barriers in the application. However, it is only possible to

P

n

;

;

;

;

;

;

;

;

;

;

;

P

0

n

;

P

Levels

max-tree

0 1

0

N estingLevel

Nesting Level 1

4.2

1

C2

1

P

P

N estingLevel

2

1

Levels

;

;

;

;

;

;

2

P

2

P

Levels

6

P

P

P

;

;

;

;

;

;

1

Algorithm

parbegin

LOCAL

BarrierSynchronize 1 0

N estingLevel

while (program has not ended) do fetch next instruction if (instruction is conditional or loop) then +1 endif if (instruction is end-conditional or end-loop) then 01 endif if (instruction is barrier) then N estingLevel

N estingLevel

N estingLevel

N estingLevel

LOCAL N estingLevel wait until GLOBAL = LOCAL LOCAL 1

endif endwhile LOCAL 0 parend Figure 10.

Algorithm for barrier synchronization using the max-tree.

clears its tree value and proceeds back to line 3. = h0 1 1 2i, Flags = h0 1 1 1i, Max = 1 When it nally evaluates to false, it falls out to line 7 and decrements its to . On line 8, it puts this value on the max-tree, and the tree returns a . = h0 1 1 1i, Flags = h0 1 1 1i, Max = 1 Now processors 2, 3 and 4 cross this barrier and do computation 4 on line 9. = h0 1 1 1i, Flags = h0 1 1 1i, Max = 1 They reach the endif on line 10 (possibly at di erent times) and decrement their s to . = h0 0 0 0i, Flags = h0 1 1 1i, Max = 1 At the barrier, they each put a on the max-tree, and the tree returns a . = h0 0 0 0i, Flags = h0 0 0 0i, Max = 0 This lets all the processors (including 1) cross the barrier and continue execution. = h0 0 0 0i, Flags = h1 1 1 1i, Max = 1

on the other processors. This scheme also correctly executes all conditionals with else branches. First consider else branches which have barriers logically independent from the barriers of the corresponding then branches. In such conditionals, a parallel execution of the two branches should always produce the same results. The barriers of the two branches are considered to be at the same nesting level . Although the barriers are logically independent, they are executed on the same tree. Consider a then branch with barriers and a corresponding else branch with barriers. Without loss of generality, let  . The barriers of the then are executed by aligning one-to-one with the rst barriers of the else. The next 0 barriers of the else are executed after the then branch processors fall to the end of the conditional, and wait at the barrier at nesting level 0 1. Now consider else branches which need to be run after the corresponding then branches have been executed (because there are data dependencies across the branches). In this case, if the outer level of nesting is 0 1, the nesting level assigned to the then branch is + 1 and the nesting level of the else is . The rst barrier of the else will be completed only after all the then barriers have completed and the processors of the then branch have fallen out to the barrier at nesting level 0 1. This ensures a correct sequential execution of the two branches.

P4

Levels

;

;

;

;

;

;

C2

1

N estingLevel

1

Levels

;

;

P

Levels

;

;

;

P

;

;

;

;

a

P

;

;

;

;

b

a

;

;

b

;

;

;

;

;

l

l

;

l

P

Levels

;

;

;

;

;

a

l

0

;

a

l

0

Levels

b

a

;

0

N estingLevel

Levels

l

;

5

;

Conclusions

In this paper we identi ed an often overlooked problem of eciently implementing nested barrier synchronizations when executing data parallel programs asynchronously on MIMD machines. Two ecient hardware techniques for implementing nested barrier synchronizations were presented. The rst scheme performs code transformations and uses two single-bit-trees to implement unlimited levels of nested barriers. The second scheme uses an integer-tree to support an exponential number of nested barriers. One problem with the scheme using two single-bit-trees is that the code size increases signi cantly due to the compiler transformations. It is common for the transformed code to be two or three times as large as the original code. This also

The details of the implementation are described in the algorithm in Figure 10. The outermost level of code is assumed to be at nesting level 0. During program execution, the default value placed in the LOCAL

ag by each processor is the largest -bit integer, 2n 0 1. This is done so that the GLOBAL ag values returned by the max-tree are larger than the LOCAL ags of any processors that are waiting at barriers, until all processors are at barriers. In Figure 10, the symbol 1 is used to represent this number. When a processor nishes executing its program, it sets its LOCAL ag to 0 to avoid interfering with the completion of the program n

7

, pages 28{39. IOS Press, Amsterdam, Netherlands, 1992. [7] L. Kontothanasiss and R. Wisniewski. Using schedular information to achieve optimal barrier synchronization performance. In , volume 28, pages 64{72, 1993. [8] Bradley C. Kuszmaul. Personal communications, 1995. [9] C. Leiserson, . The network architecture of the connection machine CM-5. In , pages 272{285, June 1992. [10] E. Markatos, M. Crovella, and P. Das. The e ects of multiprogramming on barrier synchronization. In , pages 662{669, December 1991. [11] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. , 9(1):21{65, February 1991. [12] J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In , pages 269{278, April 1991. [13] Wilfried Oed. The Cray Research Massively Parallel Processor System CRAY T3D. Available by anonymous ftp from ftp.cray.com, November 1993. [14] Gary Sabot. . MIT Press, Cambridge, MA, 1988. [15] R. Subramanian and I. D. Scherson. Networks for multiple disjoint barrier synchronizations. Submitted to the 4th IEEE International Symposium on High Performance Distributed Computing (HPDC 95), 1995. [16] Thinking Machines Corporation, Cambridge, MA. , October 1991. [17] H. Xu, P. K. McKinley, and L. M. Ni. Ecient implementation of barrier synchronization in wormholerouted hypercube multicomputers. , 16:172{184, 1992.

signi cantly increases the running time of the application. Another disadvantage is that the scheme works by making all processors busy-loop through every iteration of a loop. (A processor fetches the instructions inside the loop even if it has already determined that it should not execute the loop.) If the scheduling policy on the MIMD computer allows time-sharing of the individual processors, this scheme wastes processing power which could be utilized to run other jobs scheduled on the processors. The scheme using a max-tree does not rely on any code transformations for execution. Therefore, the code size is not increased as in the earlier scheme. This is traded-o with the fact that a max-tree can be much more expensive to implement in hardware than two single-bit-trees, if is large. However, since an -bit max-tree can support 2n 0 1 nested barriers simultaneously, small values of are sucient in general. For example, the CM-5, which provides a 32-bit max-tree in hardware as part of its control network [9], could support 4 billion levels of nesting (even though the max-tree is not currently used in this manner). A shortcoming of both schemes is that they sequentialize (at least partially) the branches of conditionals that contain barriers, even if there are no data dependencies between them. Future work is required in nding techniques to independently synchronize the branches of a conditional, allowing a completely parallel execution. This is part of the more general problem of independently synchronizing multiple disjoint subsets of processors. Some results in this direction are available in [15]. n

Parallel Computing

Fourth ACM SIGPLAN Symposium on

Principles and Practice of Parallel Programming

et al

Symposium on Paral-

lel Algorithms and Architectures

n

n

Pro-

ceedings of the 3rd IEEE Symposium on Parallel and

Distributed Processing

ACM Transactions on Computer Systems

Fourth International Con-

ference on Architectural Support for Programming Languages and Operating Systems

Acknowledgements

We would like to thank Umesh Krishnaswamy and Luis Miguel Campos for their valuable suggestions and for convincing us these results were worth writing up. We also thank Bradley C. Kuszmaul for his prompt and detailed responses to our questions.

The

Paralation Model:

Architecture-

Independent Parallel Programming

References

[1] A. Agrawal and M. Cherian. Adaptive backo synchronization techniques. In , pages 396{406, 1989. [2] Cray Research, Inc., Eagan, MN. , 1993. [3] R. Gupta. The fuzzy barrier: A mechanism for high speed synchronization of processors. In , pages 54{ 63, 1989. [4] R. Gupta and M. Epstein. Achieving low cost synchronization in a multiprocessor system. , 6:255{269, December 1990. [5] R. Gupta and C. R. Hill. A scalable implementation of barrier synchronization using an adaptive combining tree. , 18:161{180, 1989. [6] P.J. Hatcher, A.J. Lapadula, M.J. Quinn, and R.J. Anderson. Compiling data parallel programs for MIMD architectures. In W. Joosen and E. Milgrom, editors,

16th Annual International

Symposium on Computer Architecture

The

Connection Machine CM-5 Technical Summary

Cray T3D System

Architecture Overview Manual

The Journal of Par-

Third Inter-

allel and Distributed Computing

national Conference on Architectural Support for Pro-

gramming Languages and Operating Systems

Future Generation

Computer Systems

International Journal of Parallel Programming

Parallel Computing: From Theory to Sound Practice, Proceedings of EWPC '92, the European Workshops on

8