Ecient Techniques for Nested and Disjoint Barrier Synchronization Vara Ramakrishnan3
[email protected]
University of California Dept. of Info. & Computer Science Irvine, CA 92697 Isaac D. Scherson
[email protected]
University of California Dept. of Info. & Computer Science Irvine, CA 92697 Raghu Subramanian
[email protected]
PMC-Sierra San Jose R&D Center San Jose, CA 95110
3 Also with PMC-Sierra
1
Ecient Nested and Disjoint Barrier Synchronization Contact Author: Vara Ramakrishnan
[email protected]
University of California Dept. of Info. & Computer Science Irvine, CA 92697 (408) 432 0432
Abstract Current MIMD computers support the execution of data parallel programs by providing a tree network to perform fast barrier synchronizations. However, there are two major limitations to using tree networks: The rst arises due to control nesting in programs, and the second arises when the MIMD computer needs to run several programs simultaneously. First, we present two hardware barrier synchronization schemes which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implementations. Since large data parallel programs often have several levels of nested barriers, these schemes provide signi cant speedups in the execution of such programs on MIMD computers. The rst scheme performs code transformations and uses two single-bit-trees to implement unlimited levels of nested barriers. However, this scheme increases the code size. The second scheme uses a more expensive integer-tree to support an exponential number of nesting levels without increasing the code size: we show that up to n nested barriers can be supported by a network with bisection bandwidth O(log n) and a latency of O(log p log n) gate delays. Using tree network hardware already available on commercial MIMD computers, this scheme can support more than four billion levels of nesting. Second, we present a design for a barrier synchronization network that is free from the partitioning constraints imposed by barrier trees. When the MIMD computer is partitioned among several jobs, then rather than barrier synchronizations, we desire multiple disjoint barrier synchronizations (MDBSs), where processors within each partition barrier synchronize among themselves without interfering with other partitions. Barrier trees can be adapted to handle MDBSs, but only if the partitions are constrained to be of very special sizes and shapes. These stringent constraints on partitioning often run contrary to other important considerations, such as the contiguity of the processors of each partition within the data network. Our MDBS network design allows for any number of partitions of any size and shape, as long as the processors comprising each partition are contiguous in the data network. Key Words: barrier synchronization, barrier tree, MIMD, data parallel, control nesting, partition
2
Ecient Techniques for Nested and Disjoint Barrier Synchronization
1 Introduction The data parallel programming model allows a natural way of expressing the large degree of parallelism involved in most computationally intensive and mathematically complex problems. In the data parallel model, the programmer can retain a single thread of control while performing computations on a large set of data. This is perhaps the reason for the preponderance of data parallelism in massively parallel computations. For example, all of the more than 120 parallel algorithms in a survey of three ACM Symposia on Theory of Computing (STOC) were data parallel [17, 18]. One way of executing a data parallel program is on an SIMD (Single-Instruction Multiple-Data) computer. The semantics of the abstract model have often been confused with its hardware implementation, and data parallelism has been equated to SIMD. However, not only is it possible to execute a data parallel program on an MIMD (Multiple-Instruction Multiple-Data) computer, there are also several advantages of doing so [8]. The chief advantage is that an MIMD computer does not force unnecessary synchronization after every instruction, or unnecessary sequentialization of non-interfering branches, as an SIMD computer does. Moreover, many jobs can be run simultaneously on dierent partitions of an MIMD computer. These factors, among others, explain the recent market trend towards MIMD computers, exempli ed by the Thinking Machines CM-5 and the Cray Research T3D. Data parallel programs involve a form of synchronization called barrier synchronization: A point in the code is designated as a barrier and no processor is allowed to cross the barrier until all processors involved in the computation have reached it. Since data parallel programs involve frequent barrier synchronizations, a computer intended to run data parallel programs must implement them eciently. For this purpose, current MIMD computers, including the CM-5 and T3D, provide a dedicated barrier tree exclusively for barrier synchronizations [4, 15, 19]. 1.1
Limitations from Control Nesting
Current MIMD computers provide just one barrier tree per user application. However, data parallel programs very often require the simultaneous use of more than one barrier synchronization tree, due to data dependent conditionals and loops, that have barriers nested in them. Any reasonably large data parallel program has several levels of nested barriers. Examples include parallel simulation programs and applications such as parallel electronic design automation software. Since providing as many trees as the number of nestings in programs is not feasible due to cost constraints, current machines solve this problem by also implementing barriers in software. Software barriers are implemented using shared semaphores or complicated message passing protocols. They suer from either the sequential bottleneck associated with shared semaphores or the large communication latencies of message passing protocols. Since dedicated hardware barrier trees are intrinsically parallel and have very low latency, they are usually an order of magnitude faster than software barriers. There exist numerous algorithms and methods in the literature for improved software barriers, including [1, 7, 10, 13, 14, 12, 20], but these are improvements on a mechanism that is inherently slow. Methods for masking the latency of barriers have also been proposed [5, 6]. These methods hide the synchronization overhead as well as the time spent waiting for other processors to reach the barrier. They depend on being able to schedule other operations on the processors while waiting for barriers to complete. Therefore, they 3
Reduction Tree
Processor
Broadcast Tree
LOCAL flag
GLOBAL flag
AND gate
Duplicator
Figure 1. A barrier tree is used in current MIMD computers to perform barrier synchronizations.
Reduction Tree
Broadcast Tree
n bits
Processor
LOCAL flag
n bits
GLOBAL flag
OP logic
Duplicator
Figure 2. An OP-tree with n-bit inputs and outputs.
perform well on some applications and poorly on others. The ability to provide hardware support for all nested barriers in a data parallel program will result in a signi cant speedup of most data parallel applications. In this paper, two schemes are presented for supporting nested barriers using only limited hardware. Preliminary results in this area appeared in [16]. The rst scheme uses two single-bit-trees to support any number of nested barriers. The method relies on code transformations, and it increases the code size. The second scheme uses an integer max tree, which requires more expensive hardware, to support an exponential number of nesting levels without increasing the code size. With hardware currently available on the CM-5, this scheme can support more than four billion levels of nesting. Both schemes are also applicable to the design of general purpose barrier networks as in [2, 9]. 1.2
Limitations from Partitioning
When an MIMD machine is partitioned among several jobs, then multiple disjoint barrier synchronizations (MDBS), rather than barrier synchronizations, are required. In an MDBS, processors within each partition 4
barrier synchronize among themselves, but processors in dierent partitions do not interfere with each other. What kind of a network is required to implement MDBSs? It turns out that a barrier tree network can be modi ed to handle MDBSs, but only if the partitions are constrained to be of very special size and shape. This is a problem because there are other important factors that dictate the size and shape of partitions. For example, given the high cost of interprocessor communications, it is desirable that the processors of a partition be contiguous in the data network. The constraints on partitioning imposed by the barrier tree network may run contrary to these contiguity considerations. The only proposed solution that we are aware of involves using several barrier trees, one for each partition, and in each barrier tree, masking o the processors irrelevant to that synchronization [6, 4]. Apart from being wasteful in terms of hardware, this solution places an a priori limit on the number of partitions that can be created. In this paper, we present a design for an MDBS network which matches the data network topology, and is therefore free of the constraints imposed by barrier tree networks. The design allows for any number of partitions of any size and shape, as long as the processors of each partition are contiguous in the data network. Both schemes outlined in the previous section can be trivially applied to MDBS networks. This paper is organized as follows. The barrier synchronization hardware available on existing machines is described in Section 2. Section 3 examines the semantics of data parallel programs and the problems of implementing them with current barrier synchronization hardware. The rst scheme for executing nested data parallel programs, using two single-bit-trees and code transformations is described in Section 4.1. The second scheme, using an integer-tree, is presented in Section 4.2. Section 5 describes the constraints imposed on partitions by the barrier tree network, presents a design for an MDBS network, and shows that it is free of such constraints. Section 6 shows how the operating system may quickly re-con gure the MDBS network if partitions change dynamically as jobs arrive and complete.
2 Barrier Synchronization Trees The following functionality is desired of a barrier synchronization network. Each processor is assumed to have two bits, a READY ag and a GO ag. When a processor arrives at a barrier instruction in its program, it sets its READY ag. Then it keeps testing its GO ag. As soon as the processor nds its GO
ag set, it clears its READY ag and carries on with its next instruction. The barrier synchronization network must ensure that no processor's GO ag is set until all processors have set their READY ags. It must also ensure that the GO ag is not reset till all processors have a chance to see it. The conventional implementation of a barrier synchronization network is the barrier tree. A barrier tree is implemented as two complete binary trees, called the reduction tree and the broadcast tree, joined at their roots (see Figure 1). The reduction tree consists of AND gates, and takes its inputs from the LOCAL
ags of the processors. The broadcast tree consists of duplicators. (A duplicator is a unit that takes in an input and outputs two copies of it.) The root of the reduction tree feeds the input of the duplicator at the root of the broadcast tree. The outputs of the broadcast tree are delivered to the GLOBAL ags of the processors. The barrier tree works as follows: The output of the reduction tree at the root is the AND of all the LOCAL ags. The broadcast tree merely copies this output into all the GLOBAL ags. Therefore, none of the GLOBAL ags are set to 1 before all the LOCAL ags are set to 1, just as desired. The time for the GLOBAL ags to be set after all processors have set their LOCAL ags (measured in gate delays) is proportional to the height of the tree, or equivalently, 2(log ) where is the number of processors. This is usually orders of magnitude faster than software barriers which tend to have sequential bottlenecks as well as much higher latencies. p
5
p
(1) (2) (3) (4) (6) (5) (7) (8) (9)
parbegin computation 1 if (C1) then read shared endif
data
write shared data computation 2 while (C2) do write shared data
endwhile parend
Figure 3. A parallel program containing a conditional and a loop.
The barrier tree in Figure 1 is a special case of the OP-tree shown in Figure 2. In this case, each processor has two -bit ags, LOCAL and GLOBAL. The processor puts a value on its LOCAL ag, and tests for a particular value on the GLOBAL ag, which will be available after a hardware latency no worse than 2(log log ). If it nds that particular value on the GLOBAL ag, it proceeds. The tree is constructed of -bit edges. The operation (OP) performed in the nodes of the reduction tree may be any associative operation on two -bit inputs, such as integer maximum, minimum, sum, bitwise AND, OR and XOR. For instance, if the OP is maximum, the GLOBAL ag returns the maximum of all the LOCAL ag values. To ensure that all processors see their GLOBAL ag when it changes, processors may be imposed with a lower limit on the time they must hold their LOCAL ag value before resetting it. The barrier tree available in existing MIMD computers is the special case where =1 and the operation is AND. The generic OP-tree will be used later in our solutions. n
p
n
n
n
n
3 Barrier Synchronizations within Parallel Conditionals and Loops In the data parallel model of execution, it is assumed that processors run asynchronously between barrier synchronizations in which all processors participate. The barrier synchronization hardware (an AND-tree) in existing MIMD computers is based on this model of execution. In reality, the local operations in a data parallel program may be conditional or loop constructs with control expressions involving parallel (nonscalar) variables. This leads to barrier synchronizations in which not all processors may participate. These constructs drastically aect the hardware barrier synchronization requirements of data parallel programs. To understand these requirements, the semantics of parallel conditionals and loops are examined with the assumption that all barriers must be executed using available hardware. The set of processors which must execute the statements inside a parallel conditional or loop construct is determined by a parallel control expression. A processor which evaluates the expression to false is assumed to skip to the end of the construct and wait at a barrier there. Therefore, while executing a data parallel program asynchronously, barrier synchronizations need to be placed at the end of loops and conditionals (1). To ensure a correct implementation of the semantics of data parallel programs without a knowledge of the data dependencies in the program, it is also necessary to place barriers before and after each remote read/write operation or computation involving shared variables (2). After introducing barriers based on rules (1) and (2), there may be some redundant barriers (barriers with no executable code between them). Such barriers are eliminated, leaving only one representative barrier. For the code shown in Figure 3, the implicit barriers are placed and shown in Figure 4. 6
parbegin (1) computation 1 (2) if (C1) then (3) barrier (4) read a neighbor's data (5) endif (6) barrier (7) write shared data (8) barrier (9) computation 2 (10) barrier (11) while (C2) do (12) barrier (13) write shared data (14) barrier (15) endwhile (16) barrier parend
Figure 4. The parallel program of Figure 3 with the implicit barriers placed in the code.
Parallel Conditionals
In the example in Figure 4, the processors that evaluate the control expression C1 (on line 2) to false skip to the endif. The semantics of parallel ifs assume an implicit barrier (on line 6) after the endif. The processors not participating in the conditional operation fall to this barrier. If there were no remote reads or writes in the construct, the processors could continue execution beyond the conditional. However, in this example, there is a remote read in the construct, so a barrier is required to correctly execute the semantics of the if. Let us assume that an AND-tree is used to execute this barrier. The processors which skipped to line 7 set their LOCAL ags on the AND-tree and wait. There is another implicit barrier (on line 3) inside the construct, to ensure that correct data is read in line 4. So, a problem is encountered: the AND-tree is already being used, and the barrier at line 3 cannot be executed without setting the GLOBAL ags of the processors waiting at line 6, therefore letting them cross that barrier. In other words, the existing information on the AND-tree would be destroyed if it were used to execute another barrier. Therefore, an independent tree is needed to execute the barriers inside the if construct. If a parallel conditional has an else, and either the then or else branch contains remote read or write operations, the following question must be asked: Are there data dependencies between the branches? If there are, then one way to ensure correct execution is to sequentialize the execution of the branches. This can be done by placing a barrier between the two branches and treating an else as a separate conditional: if not(control expression) then. It may be possible to partially overlap the execution of the branches by rearranging the placement of barriers between them, but this can be done only after data dependencies across the branches are known. For programs which produce the same result regardless of the order of execution of the branches, a parallel execution of the branches is possible. The barrier tree requirements arising from such an execution will be discussed in a later section.
Parallel Loops
The semantics of parallel while-do, repeat-until, and for loops are similar. Therefore, without 7
processor 1
processor 2
processor 3
processor 4
processor p
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
*
*
*
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
OP
*
*
OP
OP
OP
*
*
OP
OP
OP
*
*
*
OP
OP
Barrier1
Barrier2
Barrier3
Barrier4 Barrier4 Barrier4
Barrier5
Figure 5. The data parallel model of execution due to conditionals and loops. An OP is a local operation or remote read/write performed asynchronously between barriers. The * indicates that the processor did not perform that OP and skipped to the next barrier.
loss of generality, the while loop is used to illustrate the schemes in this paper. In the while loop of Figure 4, the processors which evaluate the control expression C2 to false (on line 10) fall out to the end of the loop. All the processors which enter the loop eventually fall out of the loop, perhaps after a dierent number of iterations. Just as in the if construct, an implicit barrier (on line 14) is assumed after an endwhile statement, and an independent AND-tree is required to execute the barrier (on line 11) inside the loop. Due to parallel loops and conditionals, the data parallel model of execution resembles the one shown in Figure 5. The gure represents processors, 1 2 . . . p executing the program shown in Figure 4. Suppose 3 . . . p evaluate C1 to false, and fall out to barrier 2. 1 and 2 encounter barrier 1 inside the conditional before they join the other processors at barrier 2. Between barrier 2 and barrier 3, all processors execute all OPs. After crossing barrier 3, they encounter the loop. Now suppose 1 and 2 evaluate C2 to false, fall out of the loop immediately, and wait at barrier 5. Therefore, only processors 3 ... p encounter barrier 4. Then they loop back to line 10. Let all of them evaluate C2 to true and enter the loop again. They encounter barrier 4 once more. When they loop back to line 10, let 3 evaluate C2 to false, and fall out to barrier 5. 4 . . . p encounter barrier 4 before nally evaluating C2 to false and falling to barrier 5, which is common to all processors. The above example shows that even a simple data parallel program may require barrier synchronizations which are not common to all processors executing the program. A barrier is now de ned as a point in the program which cannot be crossed by a processor unless every other processor is either at that barrier or at some other barrier which the processor will reach later in its execution. p
P ;
P ;P ;
;P
;P
P
P
p
P
P ;
P
;P
P
P ;
;P
Nested Conditionals and Loops
In general, a parallel conditional or loop construct without any nestings can be executed using only hardware barriers if two AND-trees are available. One AND-tree ( 1) can be used for barriers at the end of constructs, and the other ( 2) is dedicated for barriers that occur inside such constructs and in the T
T
8
parbegin some computations ... (1) if (C1) then (2) computation 1 (3) while (C2) do (4) computation 2 (5) barrier (6) computation 3 (7) endwhile (8) barrier (9) computation 4 (10) endif (11) barrier some computations ... parend
Figure 6. Barriers within nested constructs. Note that each of the three barriers will require a separate barrier tree.
outer level of code. Processors that have fallen out of a construct need to indicate that they will not be participating in barriers occurring inside the construct, so they must set their LOCAL ags on 1 as well as 2. At barriers within constructs, processors must set their LOCAL ags only on 2. The hardware requirements for supporting barrier synchronization increase when conditionals and loops are nested. These requirements are exempli ed by the code in Figure 6. In Figure 6, some processors fall out of the if on line 1, and wait at the barrier on line 11. Since the code within the if has barriers at two dierent levels (lines 5 and 8), it already requires two AND-trees. Neither of those two trees can be used for the barrier on line 11 without losing information about waiting processors. Hence, a third AND-tree is needed. When loops or conditionals are nested in any combination, one additional AND-tree will be required for every level of nesting. Providing as many trees as the number of nestings in programs is infeasible for deep levels of nesting. Due to cost constraints, real machines can only provide a small xed number of barrier trees. The option of using software barriers is available, but since they are an order of magnitude slower than hardware barriers, it is important to avoid their usage. Therefore, we would like to implement all barriers using a limited number of hardware trees. T
T
T
4 Supporting Nested Barriers in Hardware In this section, two schemes are proposed to execute nested barriers using only hardware trees and no software barriers. The rst scheme involves some code transformations leading to increased code size, and uses just two single-bit-trees (one AND-tree and one OR-tree). It accomplishes this by ensuring that all processors step through every loop body the same number of times. The second scheme does not alter the code size, since it requires no code transformations. However, it uses an integer max-tree, which is more expensive than a single-bit-tree. Both schemes execute all barrier synchronizations in hardware, allowing a much faster execution of data parallel programs than when software barriers are used.
9
parbegin (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14)
some computations ... T1 C1 if (T1) then computation 1
endif while (T1 AND C2) do computation 2 barrier computation 3 endwhile barrier if (T1) then computation 4 endif barrier some computations ... parend
Figure 7. A attening of the conditional in Figure 6 such that all barriers nested within it are moved one level outward. 4.1
Using Two Single-Bit-Trees
Every conditional construct can be transformed so that all barriers within it are moved one level outward. The transformation is done as follows: 1. Split the conditional at every barrier and additional nesting level occurring inside it. 2. Store the value of the conditional control expression in a temporary variable. 3. Use the variable as the control expression for all the split conditional constructs. 4. Modify the control expression of all nested constructs to include the temporary variable as an additional condition. By repeated application of this transformation, all barriers within a nested if (with any number of nestings of conditionals) can be moved to the outermost level of code. By these transformations, a parallel program that does not contain any loops can be executed using just one AND-tree. A partial attening of the code in Figure 6 by transforming the if is shown in Figure 7. Note that T1 is only modi ed when explicitly assigned the value of C1 before the rst conditional, thus maintaining the semantics of the original code. However, loops cannot be attened in this manner. This is because barriers inside a loop may be executed many times, each time on a smaller subset of active processors. We now show how barriers within loops can be executed using an AND-tree and one additional single-bit-tree, regardless of the number of nestings in the program. The tree required is a special case of the generic OP-tree de ned in Section 2; here =1, and the operation is OR. Such a tree is commonly known as an OR-tree. (Note that an AND-tree can be used as an OR-tree by simply reversing the logic of the LOCAL and GLOBAL
ags.) Since each processor has LOCAL and GLOBAL ags for each tree, the terms LOCALAND and GLOBALAND are used for the AND-tree ags, and the terms LOCALOR and GLOBALOR for the OR-tree
ags. The while loop shown on lines 6 to 10 of Figure 7 can be transformed to the one shown in Figure 8. The transformation is done as follows: 1. Store the value of the loop control expression in a temporary variable. 2. Use the variable as the control expression for a conditional which pre xes every operation within the loop (except barriers). 3. Recompute the temporary variable as the last operation within the n
10
parbegin (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
some computations ... T3 (T1 and C2)
barrier while (GLOBALOR(T3) = true) do if (T3) then computation 2 endif barrier if (T3) then computation 3 endif if (T3) then T3 (T1 and C2) endif barrier endwhile barrier some computations ... parend
Figure 8. Transformed while-loop from Figure 7, requiring one AND-tree and one OR-tree for execution.
loop. 4. Introduce a barrier after each point where the temporary variable is modi ed. In the transformed version, no processor ever falls out of the while on line 3 if there is even one other processor which needs to enter the loop. Therefore, several processors may be busy-looping instead of idling during loop execution, but this is not signi cant if the scheduling policy on the MIMD computer only allows time-sharing by switching the whole machine or a whole partition between jobs (as in gang scheduling, a popular MIMD scheduling policy). The OR-tree is used to execute the transformed code. Each processor evaluates the GLOBALOR of the control expression by placing its value in the LOCALOR ag. If at least one processor's LOCALOR
ag is true, the root of the OR-tree outputs a true. Therefore, the GLOBALOR ags of all processors will be set to true (and they all enter the loop). This scheme is similar to the implementation used in SIMD machines to execute loops, where the frontend uses the OR of the control expressions of all processors to decide how many times to broadcast the instructions within a loop. In the CM-5, the programming model allows only scalar variables to be used in the control expression of a loop, automatically creating loops which only all or none of the processors can enter. All barrier statements are executed using the AND-tree. When a processor reaches a barrier statement, it sets its LOCALAND ag to true, and waits for its GLOBALAND ag to be true. In the example in Figure 8, all processors reach line 4 together. Processors whose control expressions are false skip computation 2 and reach the barrier on line 5. When all processors whose control expressions are true also reach the barrier, the GLOBALAND ags of all processors are true, and they proceed to line 6. Note that the barrier is ful lled even by processors who would not have entered the loop. Line 6 is executed similarly to line 4. The control expression is revaluated on line 7 only by processors whose control expression is currently true. This is because the semantics of the loop are such that a processor whose control expression is false must never get a chance to recompute it. The barrier is placed on line 8 to ensure that all processors check their OR-tree using the same control expression. The same AND-tree is used for executing the barriers at lines 5 and 10 (which are at dierent nesting levels). Even if one loop is nested inside another, by applying this transformation to both loops, the code can be executed using just one AND-tree and one OR-tree. For each level of loop nesting, the control expression evaluated and placed on the OR-tree is dierent, but one OR-tree suces since all processors are in the same nesting level at any given time. This transformation applies to any number of nestings in any combination of conditionals and loops. 11
parbegin some computations ... (1) if (C1) then (2) computation 1 (3) while (C2) do (4) computation 2 (5) barrier (Nesting Level 2) (6) computation 3 (7) endwhile (8) barrier (Nesting Level 1) (9) computation 4 (10) endif (11) barrier (Nesting Level 0) some computations ... parend
Figure 9. The program in Figure 6, with nested constructs containing barriers at three nesting levels.
The transformation for conditionals applies to those with an else branch, by treating the branch as a separate conditional with a negated control expression. Since all barriers within conditionals are moved to an outer level of code, the branches of each conditional will be partially or fully sequentialized. Even if two branches containing barriers are shown to have no data dependencies across them, there will be at least a partial sequentialization of the branches since only one barrier tree is available in this scheme. 4.2
Using an Integer-Tree
A scheme for executing nested barrier synchronizations in hardware without any code transformations is presented in this section. The scheme uses a max-tree, which is a special case of the generic OP-tree shown in Figure 2, with -bit wide edges and the OP being integer maximum. The scheme also requires the use of one -bit integer counter at each of the processors. Consider the example in Figure 9, which has implicit barriers at three levels of nesting, labeled 0, 1, and 2. If these barriers were implemented in software, labelling the synchronizing messages or shared semaphores with these numbers would suce to distinguish one barrier from another. In other words, as many logical \trees" as required by the application can be easily constructed in software. To run this example, a naive implementation that uses one hardware tree for every logical tree would require three AND-trees (for three levels of barriers). By extension, it is necessary to provide as many AND-trees in hardware as there are nested barriers in the application. However, it is only possible to construct a limited number of hardware barrier trees. Consequently, such an implementation is infeasible. A scheme that implements 2n 0 1 logical barrier trees using one -bit max-tree is now presented. This scheme is based on the following key observation: barriers at an inner level of nesting must always complete before a barrier at an outer level (2 before 1 before 0). If each processor keeps track of its using its integer counter, one max-tree is sucient to execute all nested barriers. The processor increments its counter at every nesting (loop or conditional) it encounters, and decrements it when it leaves a nesting. When it reaches a barrier, it knows its nesting level and only proceeds beyond the barrier when the value returned by the max-tree is equal to its nesting level. The processor may either poll the tree or be interrupted when the tree returns an appropriate value. The barrier synchronization algorithm using a max-tree is shown in Figure 10. The working of the algorithm is illustrated with the following example. Consider the case of four processors, 1 2 3 4, n
n
n
N estingLevel
P ;P ;P ;P
12
Algorithm
parbegin
LOCAL
BarrierSynchronize 1
0
N estingLevel
while (program has not ended) do fetch next instruction if (instruction is conditional or loop) then +1 endif if (instruction is end-conditional or end-loop) then 01 endif if (instruction is barrier) then N estingLevel
N estingLevel
N estingLevel
N estingLevel
LOCAL N estingLevel wait until GLOBAL = LOCAL LOCAL 1
endif endwhile LOCAL 0 parend
Figure 10. Algorithm for barrier synchronization using the max-tree.
executing the example in Figure 9. Initially, all processors have their LOCAL ags on the max-tree have been initialized to 1. Levels
N estingLevel
equal to 0, and their
= h0 0 0 0i, Flags = h1 1 1 1i, Max = 1 ;
;
;
;
;
;
At line 1, say processor 1 evaluates C1 to false and falls to the endif on line 10. It puts its 0 on the max-tree and waits at the barrier on line 11. Processors 2 , 3 , and 4 increment their s to 1, perform computation 1 and reach line 3. P
N estingLevel
P
P
P
N estingLevel
Levels
= h0 1 1 1i, Flags = h0 1 1 1i, Max = 1 ;
Say 2 evaluates waits. Processors P
Levels
;
;
C2 P3
;
;
;
to false and falls to the endwhile on line 7. It puts a 1 on the max-tree and and 4 increment their s to 2 and reach line 4. P
N estingLevel
= h0 1 2 2i, Flags = h0 1 1 1i, Max = 1 ;
;
;
;
;
;
After performing computation 2, they reach line 5 (possibly at dierent times). When they have both put a 2 on the max-tree, the tree returns a 2. Levels
3 and to 1). P
P4
Levels
= h0 1 2 2i, Flags = h0 1 2 2i, Max = 2 ;
;
;
;
;
;
cross the barrier and clear the values they had placed on the max-tree (by setting them = h0 1 2 2i, Flags = h0 1 1 1i, Max = 1 ;
;
;
;
;
;
13
and 2 continue waiting, since they are waiting for the tree to return a 0 and a 1 respectively. Then and 4 perform computation 3, loop back to line 3 and evaluate C2 again. Say 3 evaluates it to false and falls out to the endwhile on line 7. It decrements its to 1, reaches the barrier on line 8, and puts a 1 on the max-tree.
P1
P
P3
P
P
N estingLevel
Levels
= h0 1 1 2i, Flags = h0 1 1 1i, Max = 1 ;
;
;
;
;
;
passes through to line 5 and puts a 2 on the max-tree. The tree returns a 2 immediately, which is ignored by processors 1, 2, and 3.
P4
P
Levels
P4
P
P
= h0 1 1 2i, Flags = h0 1 1 2i, Max = 2 ;
;
;
;
;
;
clears its tree value and proceeds back to line 3. Levels
= h0 1 1 2i, Flags = h0 1 1 1i, Max = 1 ;
;
;
;
;
;
When it nally evaluates C2 to false, it falls out to line 7 and decrements its 8, it puts this value on the max-tree, and the tree returns a 1. Levels
;
;
P
;
P3
;
and
P4
;
;
cross this barrier and do computation 4 on line 9.
= h0 1 1 1i, Flags = h0 1 1 1i, Max = 1 ;
;
;
;
;
;
They reach the endif on line 10 (possibly at dierent times) and decrement their 0. Levels
to 1. On line
= h0 1 1 1i, Flags = h0 1 1 1i, Max = 1
Now processors 2, Levels
N estingLevel
N estingLevel
s to
= h0 0 0 0i, Flags = h0 1 1 1i, Max = 1 ;
;
;
;
;
;
At the barrier, they each put a 0 on the max-tree, and the tree returns a 0. Levels
= h0 0 0 0i, Flags = h0 0 0 0i, Max = 0 ;
;
;
;
;
;
This lets all the processors (including 1) cross the barrier and continue execution. P
Levels
= h0 0 0 0i, Flags = h1 1 1 1i, Max = 1 ;
;
;
;
;
;
The details of the implementation are described in the algorithm in Figure 10. The outermost level of code is assumed to be at nesting level 0. During program execution, the default value placed in the LOCAL ag by each processor is the largest -bit integer, 2n 0 1. This is done so that the GLOBAL ag values returned by the max-tree are larger than the LOCAL ags of any processors that are waiting at barriers, until all processors are at barriers. In Figure 10, the symbol 1 is used to represent this number. When a processor nishes executing its program, it sets its LOCAL ag to 0 to avoid interfering with the completion of the program on the other processors. This scheme also correctly executes all conditionals with else branches. First consider else branches n
14
which have barriers logically independent from the barriers of the corresponding then branches. In such conditionals, a parallel execution of the two branches should always produce the same results. The barriers of the two branches are considered to be at the same nesting level . Although the barriers are logically independent, they are executed on the same tree. Consider a then branch with barriers and a corresponding else branch with barriers. Without loss of generality, let . The barriers of the then are executed by aligning one-to-one with the rst barriers of the else. The next 0 barriers of the else are executed after the then branch processors fall to the end of the conditional, and wait at the barrier at nesting level 0 1. Now consider else branches which need to be run after the corresponding then branches have been executed (because there are data dependencies across the branches). In this case, if the outer level of nesting is 0 1, the nesting level assigned to the then branch is + 1 and the nesting level of the else is . The rst barrier of the else will be completed only after all the then barriers have completed and the processors of the then branch have fallen out to the barrier at nesting level 0 1. This ensures a correct sequential execution of the two branches. l
a
b
a
a
b
a
b
a
l
l
l
l
l
5 Multiple Disjoint Barrier Synchronizations A barrier tree eciently implements barrier synchronizations among all processors of the machine. However, an across-the-board barrier synchronization is not always desired. Suppose we want several jobs to run on the machine simultaneously, by partitioning the processors among the various jobs. In such a case, each processor must not synchronize with all processors in the machine, but only with the processors in its own partition; otherwise, a job may not only waste time aligning with another job unnecessarily, but may also deadlock (say, if another job does not encounter any more barrier instructions). The problem is to devise a network that eciently implements such multiple disjoint barrier synchronizations (MDBSs). One proposed solution, which has been implemented in the CM-5 and the T3D, involves splitting the barrier synchronization tree into several disjoint subtrees, and allocating to each job the processors at the leaves of one of the subtrees. For example, consider a machine with 8 processors, numbered from 0 to 7, running left to right with respect to the barrier tree (see Figure 11). Suppose there are three jobs, requiring 4, 2, and 2, processors. The solution is to recon gure some nodes of the barrier tree so that the output of that node's AND-gate is connected directly to the input of that node's duplicator, thereby bypassing higher levels of the tree. This has the eect of splitting the barrier synchronization tree into three subtrees, one whose leaves are processors 0 to 3, another whose leaves are processors 4 and 5, and a third whose leaves are processors 6 and 7. Let us allocate processors 0 to 3 to the rst job, processors 4 and 5 to the second job, and processors 6 and 7 to the third job. Now, all three jobs may perform barrier synchronizations in parallel, without interfering with each other. However, the above solution enforces stringent constraints on the size and shape of partitions: If the number of processors required by jobs are not all powers of 2 (for example, if the rst job requires 6 processors, and the second job requires 2 processors), then it is impossible to partition the machine such that each job uses a disjoint subtree of the barrier tree. Apart from the barrier tree, parallel computers also have a data network. Due to the high costs of communication, it is often desirable that the processors allocated to a job be contiguous in the data network. However, the above scheme requires the processors allocated to a job to be contiguous in the barrier tree. Unless the data network is also tree-shaped, these two contiguity requirements may work at cross purposes. This problem was encountered by the designers of the T3D, where the data network is a 3-dimensional torus, onto which a tree does not map naturally. 15
Disconnected AND gate Disconnected Duplicator
Processor
READY flag
GO flag
Figure 11. A barrier tree could be modi ed to handle multiple disjoint barrier synchronizations if all partitions satisfy the following condition: the processors of each partition form the leaves of some subtree of the barrier tree.
We present an MDBS network that performs multiple disjoint barrier synchronizations without enforcing the above stringent constraints on machine partitions. Speci cally, we allow any number of partitions of any size and shape, as long as the processors within a partition are contiguous in the data network. The following functionality is desired of an MDBS network. As before, we postulate that each processor has a READY ag and a GO ag. When a processor arrives at a barrier instruction in its program, it sets its READY ag. Then it keeps testing its GO ag. As soon as the processor nds its GO ag set, it clears its READY ag and carries on with its next instruction. The MDBS network should ensure that no processor crosses the barrier instruction until all processors belonging to the same partition have reached the barrier instruction. As mentioned before, it is assumed that the processors comprising each partition are contiguous in the data network. Recall that the problem with using barrier trees to perform MDBSs is that the tree topology may not agree with the topology of the data network. The key insight into alleviating this disagreement is to endow the MDBS network with the same topology as the data network. Hence, the processors of each partition, being contiguous in the data network, will also be contiguous in the MDBS network. Therefore, one can nd trees within the MDBS network that span the processors of each partition. This is illustrated in Figure 12. In this toy parallel machine, the data network is mesh-shaped, so the barrier synchronization network is also designed in the shape of a mesh. The dotted lines enclose three partitions. The bold edges form three disjoint trees that span the processors of each partition. Using these trees (in a fashion similar to the way described in the previous section) the processors of the various partitions can barrier synchronize without getting in one another's way. It is assumed that when the operating system allocates a partition, it con gures a tree that spans all the processors of that partition. This tree does not change during the lifetime of the partition. When the operating system frees the partition, the tree is dissolved. What does it mean to con gure and dissolve trees? If a processor has neighbors, then it has two registers of -bits each, one bit for each neighbor. One register is called the child register, and has a 0 in precisely those positions where it has a child, and 1 in all other positions. The other is called the parent d
d
16
B
A
X
C
D
Figure 12. The idea underlying the design of a network for multiple disjoint barrier synchronizations. For each partition, a spanning tree is con gured, which is used to perform barrier synchronizations within that partition.
and has a 0 in the precise position where its parent resides, and a 1 in all other positions. If a certain neighbor is not in the same partition, then both the child and parent registers show a 1 in that place. As an example, refer to Figure 12. Processor X has four neighbors, called A, B, C, and D. Therefore, both X's child and parent registers have 4 bits each. Now, processors A and D are X's children; C is X's parent; and B is not even in X's partition. Therefore the contents of X's child register is h0110i, and the contents of X's parent register is h1101i. By saying that the operating system con gures or dissolves trees, we merely mean that the operating system sets the child and parent registers with appropriate values. A discussion of how the operating system may con gure and dissolve trees eciently is deferred to the next section. In this section, we consider how an existing tree enables the processors of a partition to barrier synchronize. Recall that the barrier tree described in Section 2 consists of two superposed trees called the reduction and broadcast trees. Similarly, the MDBS network of this section consists of two superposed networks called the reduction and broadcast networks, each of them topologically identical to the data network. The reduction phase is performed as follows: Each processor receives a 1-bit input from each of its neighbors through the reduction network. The input bits are ltered using the children register as a mask: input bits coming from children processors are passed through unchanged, whereas all other inputs bits are forced to 1. All the ltered input bits are ANDed together, and the answer is ANDed with the processor's own READY ag. Eectively, all the inputs coming from the children, along with the current processor's READY ag, are ANDed together, ignoring inputs from all other neighbors. The nal 1-bit result is sent to all the neighboring processors through the reduction network. (They, of course, will ignore the bit if the sending processor is not their child.) The whole process is illustrated in the left half of Figure 13. The broadcast phase is performed as follows: Each processor receives a 1-bit input from each of its neighbors through the broadcast network. The input bits are ltered using the parent register as a mask: the input bit coming from the parent processor is passed through unchanged, whereas all other inputs bits are forced to 1. All the ltered input bits are ANDed together. The eect is to select the input
register
d
d
d
17
Outputs to d neighbors
Outputs to d neighbors
Duplicator or Fan-out node
Duplicator or Fan-out node Go flag
AND
Ready flag
Fanout
AND
AND
Child Reg
Parent Reg
Inputs from d neighbors REDUCTION NETWORK
Inputs from d neighbors BROADCAST NETWORK
Figure 13. This gure shows the circuitry at each node of the MDBS network. Several such nodes are connected together, in exactly the same topology as the data network.
bit that came from the parent processor, ignoring inputs from all other neighbors. The nal 1-bit result is written into the processor's own GO ag, and is also sent to all the neighboring processors through the broadcast network. (They, of course, will ignore the bit if the sending processor is not their parent.) The process is illustrated in the right half of Figure 13. Of course, if the data network topology already matches the barrier tree topology, an MDBS network is not required, since natural partition sizes based on the data network will satisfy the requirement of being on a disjoint subtree of the barrier tree. In most cases however, since the data network tends to be a denser network than a tree, this scheme does add signi cantly to the cost of building a barrier network. There is one detail left to be lled in: The root processor must connect the output of the reduction tree to the input of the broadcast tree. This requires some extra logic, which is not depicted in Figure 13. It suces to note that a processor may identify itself as the root by the fact that it has no parent, that is, its parent register has all 1s. d
6 Con guring and Dissolving Trees As mentioned in the previous section, when the operating system allocates a partition, it must con gure a tree that spans all the processors of that partition. Similarly, the operating system must dissolve the tree when the partition is freed. The con guration process is essentially a breadth- rst search, with in-order delivery of con guration messages being the safeguard against race conditions. The details are explained below. Assume that each processor has a register called Partition ID, which gives the partition that processor belongs to. Thus, all processors with the same value in Partition ID belong to the same partition. Allocating and freeing partitions involves setting the Partition ID registers appropriately. We assume 18
Algorithm ConfigureTree for all nodes dopar if (the operating system designated you as the root) send yourself a REQ
endif for 1 to (22degree) do Wait for a signal if (the signal is a REQ) then if (this is the first REQ from i
a node in your partition) then update parent register to indicate that sender is your parent reply with a POS-ACK send a REQ to all your neighbors
else reply endif
with a NEG-ACK
else /* the signal is a POS-ACK or NEG-ACK */ update child register to indicate that sender endif endfor endfor
is, or is not, your child
Figure 14. Algorithm for con guring a synchronization tree.
that the operating system takes care of this. Given that a partition has been allocated, the question remains how a barrier synchronization tree may be con gured and dissolved. If partitions are static, i.e., they are created at boot time by the system administrator and never change thereafter until the machine is brought down, then con guring spanning trees is not a problem. The spanning trees can be computed o-line and stored in a con guration le. While booting, the trees are simply read from the le, and the various child and parent registers are set accordingly. Con guring spanning trees becomes more interesting if partitions are dynamic, i.e., they change as jobs come and go. We now present an algorithm to do so. The idea behind the algorithm is start at one node of the partition, and grow the tree larger and larger, rooted at this node. Speci cally: 1. Designate any one node of the partition as the root. 2. Conduct a breadth- rst search (BFS) starting at the root, but ignore exploring all neighbors that are in a dierent partition. 3. The BFS results in the formation of a breadth- rst tree [3, page 475], which spans all processors of the partition, as desired. It only remains to show how to implement BFS in parallel. The complication: No synchronizations are allowed in the implementation, since a barrier tree (whose construction is the very aim of the BFS) is not yet available. Thus race conditions are inevitable. The implementation must guarantee that no matter how the races are won, a valid breadth- rst tree is found. The details of the implementation are shown in Figure 14. Three kinds of signals, REQ and POS-ACK and NEG-ACK, are exchanged between nodes. When a node X sends a REQ to a neighboring node Y, then X is requesting to be Y's parent in the tree. In turn, Y may acquiesce by replying with POS-ACK, or reject the oer by replying with a NEG-ACK. 19
We have shown how to con gure spanning trees. The mechanics of dissolving a tree are trivial, involving nothing more than clearing the child and parent registers. Once con gured, the rst dierence between the spanning tree of an MDBS network and a traditional barrier tree is that processors are located at all nodes of the spanning tree instead of only at the leaves. For example, in the max-tree, each node would now compute the max of all its children's LOCAL ags as well as the ag of the processor located at the node itself. This dierence does not aect the performance of the schemes presented in Section 4. Secondly, it is apparent that the spanning tree need not be a balanced or binary tree, while a traditional barrier tree is usually built with those properties. This means the latency of the max-tree could be as bad as 2( log ) gate delays, which is still insigni cant compared to the latency of software barriers. Therefore, nested barriers can be eciently supported on an MDBS network by applying the same schemes. p
n
7 Summary In this paper, we identi ed an often overlooked problem of eciently implementing nested barrier synchronizations when executing data parallel programs asynchronously on MIMD machines. Two ecient hardware techniques for implementing nested barrier synchronizations were presented. The rst scheme performs code transformations and uses two single-bit-trees to implement unlimited levels of nested barriers. The second scheme uses an integer-tree to support an exponential number of nesting levels. One problem with the scheme using two single-bit-trees is that the code size increases signi cantly due to the compiler transformations. It is common for the transformed code to be two or three times as large as the original code. This also signi cantly increases the running time of the application. Another disadvantage is that the scheme works by making all processors busy-loop through every iteration of a loop. (A processor fetches the instructions inside the loop even if it has already determined that it should not execute the loop.) However, this disadvantage is not signi cant if the the MIMD computer is gang scheduled. In gang scheduling, an idle processor cannot be timed out for use by any other job, so the busy-looping does not waste any useful machine resources. The scheme using a max-tree does not rely on any code transformations for execution. Therefore, the code size is not increased as in the earlier scheme. This is traded-o with the fact that a max-tree can be much more expensive to implement in hardware than two single-bit-trees, if is large. However, since an -bit max-tree can support 2n 0 1 nested barriers simultaneously, small values of are sucient in general. For example, the CM-5, which provides a 32-bit max-tree in hardware as part of its control network [11], could support 4 billion levels of nesting (even though the max-tree is not currently used in this manner). The above schemes can be applied to the multiple disjoint barrier synchronization network which we presented, where processors of each partition barrier synchronize among themselves, without interfering with other partitions. Any number of partitions, of any size and shape are allowed, as long as the processors of each partition are contiguous in the data network. n
n
n
Acknowledgements
We acknowledge the help we received from R. Kent Koeninger (Cray Research) and Bradley C. Kuszmaul (MIT) in understanding the details of the T3D and CM-5 barrier implementations. We would like to thank Cadence Design Systems and PMC-Sierra for their support in conducting this research.
20
References [1] A. Agrawal and M. Cherian. Adaptive backo synchronization techniques. In 16th Annual International Symposium on Computer Architecture, pages 396{406, 1989. [2] W. E. Cohen, H. G. Dietz, and J. B. Sponaugle. Dynamic barrier architecture for multi-mode ne grain parallelism using conventional processors. In 23rd Annual International Conference on Parallel Processing, pages 93{96, August 1994. [3] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Algorithms. MIT Press, Cambridge, MA, 1990. [4] Cray Research, Inc., Eagan, MN. Cray T3D System Architecture Overview Manual, 1993. [5] R. Gupta. The fuzzy barrier: A mechanism for high speed synchronization of processors. In Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 54{63, 1989. [6] R. Gupta and M. Epstein. Achieving low cost synchronization in a multiprocessor system. Future Generation Computer Systems, 6:255{269, December 1990. [7] R. Gupta and C. R. Hill. A scalable implementation of barrier synchronization using an adaptive combining tree. International Journal of Parallel Programming, 18:161{180, 1989. [8] P. J. Hatcher, A. J. Lapadula, M. J. Quinn, and R. J. Anderson. Compiling data parallel programs for MIMD architectures. In W. Joosen and E. Milgrom, editors, Parallel Computing: From Theory to Sound Practice, Proceedings of EWPC '92, the European Workshops on Parallel Computing, pages 28{39. IOS Press, Amsterdam, Netherlands, 1992. [9] D. Johnson, D. Lilja, J. Riedl, and J. Anderson. Low-cost, high-performance barrier synchronization on networks of workstations. The Journal of Parallel and Distributed Computing, 40, January 1997. [10] L. Kontothanasiss and R. Wisniewski. Using schedular information to achieve optimal barrier synchronization performance. In Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, volume 28, pages 64{72, 1993. [11] C. E. Leiserson, et al. The network architecture of the connection machine CM-5. In Symposium on Parallel Algorithms and Architectures, pages 272{285, June 1992. [12] E. Markatos, M. Crovella, and P. Das. The eects of multiprogramming on barrier synchronization. In Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 662{669, December 1991. [13] J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21{65, February 1991. [14] J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 269{ 278, April 1991. [15] Wilfried Oed. The Cray Research Massively Parallel Processor System CRAY T3D. Available by anonymous ftp from ftp.cray.com, November 1993. [16] V. Ramakrishnan, I. D. Scherson, and R. Subramanian. Ecient techniques for fast nested barrier synchronization. In Symposium on Parallel Algorithms and Architectures, pages 157{164, July 1995. 21
[17] Gary Sabot. The Paralation Model: Architecture-Independent Parallel Programming. MIT Press, Cambridge, MA, 1988. [18] Guy L. Steele, Jr. and W. Daniel Hillis. Connection machine LISP: Fine-grained parallel symbolic processing. In 1986 ACM Conference on Lisp and Functional Programming, pages 279{297, August 1986. [19] Thinking Machines Corporation, Cambridge, MA. The Connection Machine CM-5 Technical Summary, October 1991. [20] H. Xu, P. K. McKinley, and L. M. Ni. Ecient implementation of barrier synchronization in wormholerouted hypercube multicomputers. The Journal of Parallel and Distributed Computing, 16:172{184, 1992.
22