4] David Callahan, Ken Kennedy, and Jaspal Subhlok. Analysis ... Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua, editors, Languages and.
Data Flow Equations for Explicitly Parallel Programs
Dirk Grunwald and Harini Srinivasan Department of Computer Science University of Colorado at Boulder Campus Box 430 Boulder, CO - 80309 CU-CS-605-92 July 1992
University of Colorado at Boulder Technical Report CU-CS-605-92 Department of Computer Science Campus Box 430 University of Colorado Boulder, Colorado 80309
c 1992 by Copyright Dirk Grunwald and Harini Srinivasan Department of Computer Science University of Colorado at Boulder Campus Box 430 Boulder, CO - 80309
Data Flow Equations for Explicitly Parallel Programs Dirk Grunwaldand Harini Srinivasan Department of Computer Science University of Colorado at Boulder Campus Box 430 Boulder, CO - 80309 July 1992
Abstract We have extended the standard monotone data ow system for the reaching de nitions problem to accommodate explicitly parallel programs; this information is used in many standard optimization problems. This paper considers the parallel sections construct, both with and without explicit synchronization; a future paper considers the parallel do construct. Although work has been done on analyzing parallel programs to detect data races, little work has been done on optimizing such programs; the equations in this paper should form the basis for extensive work on optimization.
1 Introduction In this paper, we describe a data ow framework [1] for computing the reaching de nition information in parallel programs with a view of being able to optimize such programs. It is desirable to be able to perform classical code optimization on parallel programs for two main reasons: rst, parallel programming and the use of explicitly parallel constructs is becoming common in scienti c and numerical programs. Second, we are interested in being able to achieve good execution rates of such programs on high performance architectures. Though extensive work has been done on analyzing parallel programs for potential races [6, 4, 2], little work has been done on analyzing parallel programs with an aim of being able to perform code optimizations. Midki and Padua [7] point out the diculties in optimizing explicitly parallel programs. Our work focuses on developing an intermediate representation for optimizing parallel programs. Reaching de nition information, i.e., the set of de nitions reaching each use of a variable in a program is vital for various code optimizations; some of them include constant propagation, induction variable analysis, common subexpression elimination, dead code elimination etc. Most
This work was supported in part by the National Science Foundation under NSF Grant CCR-9010624.
1
(1) (1) (2) (3) (4) (5) (6) (6) (7)
(1) (1) (2) (3) (4) (4) (5) (5) (6) (6) (7)
j=0 k=1 loop if (condition) then j=j+1 else k=5 endif l=k+4 endloop (a) Sequential Program
j=0 k=1 loop Parallel Sections Section A j=j+1 Section B k=5 End Parallel Sections l=k+4 endloop (b) Parallel Program
Figure 1: Example sequential and parallel programs.
commercial compilers use the bit vector intermediate representation, and we feel the extensions presented in this paper can be easily incorporated in compilers to analyze explicitly parallel programs. In our work we consider the parallel extensions to FORTRAN as speci ed by the Parallel Computing Forum [9], which is the basis of the ANSI committee X3H5 standardization eort. As mentioned earlier, the performance of parallel programs on existing and future high performance architectures depends to a great extent on the ability to perform aggressive code optimizations, particularly scalar optimizations across parallel constructs. Most of the existing compilers for parallel programs do not perform scalar optimizations across parallel constructs. Instead, they restrict optimizations to speci c sequential sections of code in the parallel program. Consider the sequential and parallel programs in Figure 1; these two programs have very similar control ow structures. The variable `j' in 1(a) is not an induction variable, because the if .. then may not be executed for each iteration of the loop. However, in the parallel program, `j' is an induction variable since both branches of the Parallel Sections statement always execute for all iterations of the loop, but this could not be automatically detected without adequate data ow information. Detecting such induction variables is useful for strength reduction, data dependence analysis and other optimizations. Likewise, data ow information would show that the variable `k' has the value 5 at the end of the parallel construct during each iteration. In section 2, we explain data ow analysis and explain how the reaching de nition information is computed in sequential programs. Section 3 describes the parallel constructs and the semantics of the parallel constructs considered in this paper. In Section 4, we describe data structures 2
Entry
1
2
3
4
5
6
7
Exit
Figure 2: Control Flow Graph for the sequential program in Figure 1.
to represent such constructs. Section 5 extends the data ow equations for computing reaching de nition information in sequential programs, to handle parallel constructs and in Section 6 we extend these equations to handle synchronization in the form of post/wait statements.
2 Global Data Flow Analysis The problem of global data ow analysis can be explained as follows [8]: given the control ow structure, we must discern the nature of the data ow (which de nitions of program quantities can aect which uses) within the program. Data ow problems are often posed as a system of equations based on the Control Flow Graph of the program. A Control Flow Graph, CFG, of a program is a directed graph, h is the set of vertices representing basic blocks 0i, where in the program, is the set of edges representing ow of control in the program and 0 is the unique node representing the entry into the program. We say a node is the predecessor of node if there is an edge in the CFG from to . For each vertex in the CFG, we de ne some basic attributes, which can be de ned unambiguously from an analysis of the program. Then we de ne inherited and synthesized attributes in a set of data ow equations, and solve these equations. This section discusses the data ow equations to solve the reaching de nitions problem in sequential programs [1]. The reaching de nition problem is to nd the set of de nitions of a variable `v' that can reach a particular use of `v'. This is also referred to as the ud-chaining problem in the literature. In the later sections of the paper, we explain how these equations can be extended to V; E; V
V
E
V
P
Q
P
Q
3
Node Gen Entry f g (1) 1, 1 (2) fg (3) fg (4) 4 (5) 5 (6) 6 (7) fg Exit f g Entry f g (1) 1, 1 (2) fg (3) fg (4) 4 (5) 5 (6) 6 (7) fg Exit f g j
k
Kill j4
,
k5
j1
k
k1
j
k
fg fg fg fg j4
,
j
k1
j1
k1
j
k
j
k1
j
k
j
k1
j
k
j
k5
k
j
j
k
k5
k
k
j
k
k
j
k
j
k
j
k
j
k
j
k
j1
l l
, 1, 4, 1, 1 1, 1 k
j
fg fg
j
k1
l6 ; j1
j
, 1, 4, 5 , 1, 1, 4, 5 6, 1, 1, 4, 5 6, 1, 1, 4, 5 1, 1, 4, 5 1, 1, 4, 5 1, 1, 4, 5 j1
j
fg fg fg
, , 1, 4, 1,
j1
k
, 1 1, 1 1, 1 1, 1 1, 1, 4, 1, 1 1, 1
l6
k l
fg
j1
j
k5
fg fg
Out
fg fg
fg fg
j
l
In
fg
j
k5
k k
fg
j1
,
k1
, 1 , 1, 4, 5 , 1 , 1, 4, 5 6, 4, 1, 5 6, 1, 4 , 5 , 6 1 , 1, 4, 5 6 , 1 , 1, 4, 5 6 , 1 , 1, 4, 5
l6
j
k
j
k
l6
j
k
j
k
l
j
k
k
l
j
j
k
j
k
j
k
l
j
k
j
k
j
k
j
k
l
j
k
j
k
j
k
j
k
l
j
k
j
k
Table 1: Table showing two iterations of the data ow equations to solve the reaching de nitions
problem for the sequential program in Figure 1(a).
solve the reaching de nitions problem across parallel constructs and synchronization constructs in explicitly parallel programs. 2.1 Reaching De nitions
We say a de nition of a variable `v' reaches a point p in the program if there is a path in the CFG from that de nition to p, such that no other de nitions of `v' appear on the path. To determine the de nitions that can reach a given point in a program, we rst assign a distinct number to each de nition. Our problem is to be able to nd for each node n of the CFG, In (n), the set of de nitions that reach the beginning of n. Formally, a de nition d of a variable name `v' reaches a node n if there is a path 1 2 in the ow graph such that
n ; n ; : : : ; nk ; n
1. d is within 1 , n
2. d is not subsequently killed in 3. d is not killed in any of
n1
n 2 ; : : : ; nk
(i.e., `v' is not rede ned) and . 4
One way of calculating In (n) is to determine all generated de nitions and then to propagate each de nition from the point of generation to n. An easy way of doing this is to solve the following set of 2N simultaneous equations for a CFG of N nodes: Out (n) = (In (n) ? K ill(n)) [ Gen (n)] [ In (n) = Out (p) :
p 2 pred(n)
Out (n) is similar to In (n) but pertains to the point immediately after the end of the basic block. Kill (n) is the set of de nitions outside of n that de ne variables that also have de nitions within n and Gen (n) is the set of de nitions generated within n that reach the end of n. We are interested in the smallest solution possible for In , which is why we start with In as the empty set for all n. The algorithm that computes the In sets starts with this initial approximation and iterates through the
above set of equations until a xpoint is reached. This particular set of equations and the iterative algorithm form a monotone data ow system. In such a system, the order of traversal of the CFG only aects the convergence rate of the dierent sets to their xpoint. It has been proven that a depth rst traversal of the CFG helps reduce the number of iterations to ve in most practical cases [1]. The CFG for the sequential program in Figure 1 is given in Figure 2. Variable `j' is de ned at nodes (1) and (4); call these 1 and 4 respectively. The reaching de nitions for the use of `j' at node (6) are 1 and 4 . The In , Out , Kill and Gen sets for the dierent nodes are given in Table 1. This table shows two iterations of the data ow equations; the third iteration is the same as the second, indicating that a xpoint has been reached. This example illustrates how the In and Out sets are computed for sequential programs, given the Gen and Kill sets. In the following sections, we derive an analogous procedure for explicitly parallel programs. j
j
j
j
3 Parallel Constructs In this paper, we only consider the Parallel Sections construct [9]. The Parallel Sections construct is used to specify parallel execution of explicitly identi ed sections of code. Each section of code is interpreted as a parallel thread, and must be data independent except where an appropriate synchronization mechanism is used. The Parallel Sections construct can also be nested, appear in the body of a loop and so on. We consider synchronization between threads in the form of event synchronization, described by a binary event variable. Operations are available to indicate that an event has occurred (post), to 5
(Entry) (Entry) (Entry) (1) (2) (3) (3) (4) (4) (5) (5) (6) (6) (7) (7) (8) (8) (8) (9) (9) (10) (11) (11) (12)
event( ev ) x=2 y=5 loop Parallel Sections Section A if (condition) then x=7 post( ev ) else x=8 post( ev ) endif z=y7 Section B Parallel Sections Section B1 wait( ev ) x = x * 32 Section B2 z = y * 54 End Parallel Sections End Parallel Sections y=x*z endloop
Figure 3: Parallel Program with Parallel Sections and event synchronization
6
ensure that an event has occurred (wait), and to indicate that an event has not occurred (clear). In our work, we only consider post and wait statements. When a post statement is executed, the appropriate shared variables are made consistent and the value of the event is set to \posted", no matter what its value was previously. When a wait statement is executed, the appropriate shared variables are made consistent and the thread waits for the event to be marked \posted". An example parallel program with Parallel Sections construct and event synchronization is shown in Figure 3. Section A and B execute in parallel. Within section B, there is a nested Parallel Sections construct where sections B1 and B2 can execute in parallel. The event variable `ev' will be posted in one of the branches of the if-construct, depending on the value of `condition'. The execution of Section B1 can not proceed until at least one of the post occurs. Note that the Parallel Sections is inside a loop. This example is purely illustrative; in particular, the event variable `ev' is not cleared between iterations of the loop, and thus, this example would not execute properly. We refer to this example in x6, to show the interaction of loops and synchronization variables. Note that this is a sequential loop; analysis of parallel loops is a topic of future papers. The language standard does not de ne the memory consistency model for the target architecture. Rather, it allows a range of implementations including copy-in/copy-out semantics. We assume copy-in/copy-out semantics in the compiler, because it provides more opportunity for optimization. For example, within a single thread, we are free to load copies of variable values into registers or propagate subexpressions and the like, disregarding the actions of other threads. This does not imply that we implement a pure copy-in/copy-out program. Rather, we use this as one model of memory consistency because it is convenient for compiler optimizations and allowed by the language standard. Correct programs should obey copy-in/copy-out semantics as well as other memory consistency models allowed by the language standard. At a fork point, i.e., a Parallel Sections statement, every branch of the fork (each thread) gets its own copy of the shared variables. Each thread modi es its own local copy and at the join point, i.e., the End Parallel Sections statement, the copies from the dierent threads are merged with the global values. In the presence of post/wait synchronization, the thread that waits for an event to occur updates its copy with the values from all the threads posting that event. Multiple copies of a variable may potentially reach a wait statement, either because of multiple posts executed by dierent threads or because of one or more posts (executed by dierent threads) and the waiting thread de nes that variable prior to the the wait statement. Some decision has to be made at run time as to which value will reach the wait statement. However, at the compiler level, we allow more than one value to reach that point and the presence of multiple values at
7
such wait statements indicates potential anomalies.1Similarly at a join node, multiple values for a variable reaching that node indicates a potential anomaly in the Parallel Sections construct.
4 Parallel Flow Graph In this section, we describe the Parallel Flow Graph[12], a data structure used to represent control
ow, parallelism and synchronization in explicitly parallel programs. The Parallel Flow Graph (PFG) is similar to the Synchronized Control Flow Graph [4] and the Program Execution Graph [2]. A PFG is basically a directed graph with nodes representing extended basic blocks in the program and edges representing either sequential control ow, parallel control ow or synchronization. An extended basic block is a basic block that may have at most one wait statement at the start of the basic block and at most one post or branch statement at the end of the basic block. A sequential control ow edge represents sequential ow of control within sequential parts of the program. A parallel control ow edge represents parallel control ow, as at fork and join points in the program. Finally, a synchronization edge is an edge from a post statement to a corresponding wait statement. The PFG for the parallel program in Figure 3 is shown in Figure 4. Nodes (2) and (7) represent fork nodes and nodes (11) and (10) are the respective join nodes. Sequential, parallel and synchronization edges are identi ed in this gure as indicated.
5 Data Flow Equations for Parallel
Sections
In Section 2, we reviewed the data ow equations from [1] to compute the reaching de nition information at any point in a sequential program. In this Section, we extend these equations to handle the Parallel Sections construct. The extensions are based on the following fundamental concepts: At parallel branch points, such as fork nodes, all the branches execute; in the case of sequential branch points, e.g., if-statements, only one of the branches will be executed. A value de ned at a point prior to a parallel construct does not reach the corresponding parallel merge point if it is always killed in at least one of the branches. In contrast, for sequential branches, the value would need to be always killed along every branch. The compiler must assume that a conditionally de ned value in a parallel section may reach the parallel merge point. These de nitions do not kill the de nitions prior to the Parallel Sections statement. In actuality, only one de nition reaches the merge point, but determining the actual reaching de nition is undecidable. Thus, the compiler must be conservative and assume that both de nitions reach. Note that multiple values reaching a wait statement do not necessarily mean there are anomalous updates; for example, the post statements may have been conditionally executed. 1
8
Entry Sequential Flow Edge Parallel Flow Edge 1
Synchronization Edge
2
7
3
4
5
8
6
9
10
11
12
Exit
Figure 4: Parallel Flow Graph for the example parallel program.
9
(1) (1) (2) (3) (3) (4) (4) (5)
(1) (1) (2) (3) (3) (3) (4) (4) (4) (5)
a=0 b=1 if (condition) then a=a+1 b=7 else b=5 endif c=ab
(A) Example Sequential Program
a=0 b=1 Parallel Sections Section A a=a+1 b=7 Section B b=5 End Parallel Sections c=ab
(B) Example Parallel Program
Figure 5: Example sequential and parallel programs.
(1) (1) (1) (2) (3) (3) (3) (4) (4) (5) (5) (6) (6) (7) (8) (9) (10) (10)
a=0 b=1 c=2 Parallel Sections Section A a=a+1 b=7 Section B Parallel Sections Section B1 b=5 Section B2 if (P) then c=6 endif End Parallel Sections End Parallel Sections d=ab+c
Figure 6: Example parallel program to illustrate data ow equations.
10
These concepts are illustrated by the sequential and parallel programs in Figure 5 and by the program in Figure 6 on page 10. The values of the variable `a' reaching the sequential and parallel merge points (i.e., the endif and End Parallel Sections statement respectively) in Figure 5 are dierent. In the case of the sequential program, the values of the variable `a' reaching the endif statement is either the value de ned before the if test or the value de ned in the then-part of the if-construct. However, at the parallel merge point, the only reaching value of `a' is the value de ned in Section A. In Figure 6, the variable `c' is de ned conditionally in Section B. Therefore, this value and the value of `c' de ned prior to the outer Parallel Sections construct reach the parallel merge points. The sequential data ow equations (Section 2) will not be able to handle such cases. The new data ow equations for parallel programs must still be able to say that the values of `b' in Figure 5 reaching the join node are either from Section A or Section B. As mentioned earlier, more than one value of a variable reaching a parallel merge point indicates a potential anomaly in the program. To handle the above situation, we introduce two new sets to the data ow framework of Section 2: the ACCKillin and ACCKillout sets. These sets accumulate de nitions that occur outside a parallel construct and that are killed along speci c parallel branches in the parallel construct. The ACCKillin set at a node is the set propagated by its predecessors and ACCKillout set at the node is its ACCKillin set updated by the de nitions killed in this node, excluding the de nitions generated in this node, i.e., its Kill set minus the Gen set. In our rst example (Figure 5), the accumulated kill set at the end of Section A is the value of `a' de ned prior to the parallel construct because the de nition of `a' inside Section A will always kill the previous de nition. Parallel sections can be nested, but the information represented by the ACCKillout set pertains to a single parallel block. For example, in Figure 5, the ACCKillin set at the entry to the parallel program is empty. At node (1), the ACCKillout set includes `a3' since it is in its Kill set. However, if we propagate this set via Section B, that does not de ne `a', to the parallel merge node, the ACCKill set at this node will contain this de nition. However, `a3' always reaches the parallel merge point and should not be in the ACCKill set of any of its parallel predecessors. Therefore, we clear ACCKillout at fork nodes and use this empty set in computing the accumulated kill sets inside the corresponding parallel block. We must also preserve the current value across internal nested parallel blocks because a join node must have access to the ACCKillout set from the corresponding fork node. Thus, fork nodes store the ACCKillout , computed from its Gen and Kill sets in another set, ForkKill , and a `technical edge' between corresponding fork and join nodes makes this information available to the join node. At join nodes, the In set will exclude de nitions from the ACCKillout sets of all the parallel predecessors of this node.
11
We propagate the ACCKill sets by computing the ACCKillin set at a merge node as the union of the ACCKillout sets of its parallel predecessors and the intersection of the ACCKillout sets of its sequential predecessors. In sequential programs, we de ne Kill ( ) to be the set of all the de nitions of variables outside for those variables de ned in ; these are the de nitions that will be overridden when the variable is de ned in node . This is appropriate for sequential programs or a single thread of control because assignments can not occur in parallel. By comparison, in the case of parallel programs, where we can have multiple simultaneous threads of execution, we distinguish between the Kill set and the ParallelKill set. The Kill set for node contains all killed de nitions from nodes that can not execute at the same time as node . Similarly, the ParallelKill for contains all de nitions from nodes that can execute at the same time. For example, in Figure 5(B), the Kill set of section B contains the de nition `b1' (the de nition of from node 1), while the ParallelKill set contains the de nition `b3'. We would expect both de nitions `b3' and `b4', but not `b1', to reach the join node (node 5). De nition `b1' should not reach because there are assignments to `b' that are guaranteed to occur later in the execution order. Both `b3' and `b4' should reach the join node because the compiler can not assume a particular execution order or memory semantics. Indeed, this indicates a potential data anomaly or race condition in this particular program. We segregate the kill sets into Kill and ParallelKill sets to distinguish between these cases. ParallelKill ( ) can be computed by traversing the PFG and including those de nitions of variables `v' such that `v' has a de nition in and occurs in a node that can execute in parallel with . This can be done by traversing the parallel
ow edges and the sequential ow edges in all Sections that have the same fork node and join node as the Section corresponding to but not itself. Thus, as in the sequential data ow problem, Kill and ParallelKill can be computed directly and need not be computed using an iterative algorithm. The ACCKill sets accumulate information about de nitions that are killed within a sequential thread, and we include the Kill sets in the ACCKillin and ACCKillout sets. We do not include the ParallelKill set, because that set represents information about other threads where the temporal ordering of de nitions is unde ned. When computing the Out set for each node, we must consider all killed de nitions, i.e. the union of the Kill and ParallelKill sets. The data ow equations for the reaching de nitions problem in programs that have the Parallel Sections construct is given in Figure 7. In those equations, par pred refers to the set of parallel
ow predecessors of the node; seq pred refers to the set of sequential ow predecessors of the node and pred refers to the set of all predecessors (both parallel and sequential ow predecessors) of the node. The reaching de nition information,i.e., the In set at each node, is de ned by the xpoint of the equations in Figure 7. n
n
n
n
n
n
n
b
n
di
n
n
Sn
n
Sn
12
di
Out (n) = In (n)[? Kill (n) ? ParallelKill[(n) [ Gen (n) ACCKillout (p) In (n) = Out (p) ? p 2 pred(n) p 2 par pred(n)
8 > ; > < (ACCKillin( ) + Kill ( )) ? Gen ( ) ACCKillout ( ) = > +(ForkKill ( ) ? Out ( )) > : (ACCKillin( ) + Kill ( )) ? Gen ( ) n
n
n
f
n
ACCKillin(n) =
ForkKill (n) =
[ p 2 par pred(n)
(
( is a fork node) ( is a join node, with corresponding fork node ) (otherwise) n
n
n
n
n
f
n
ACCKillout (p) +
\ p 2 seq pred(n)
ACCKillout (p)
(ACCKillin( ) + Kill ( )) ? Gen ( ) ( is a fork node) (otherwise) n
;
n
n
n
Figure 7: Data ow Equations for Programs with Parallel Sections
For the parallel program given in Figure 6, the In , Out , Gen , Kill , ParallelKill and the accumulated kill sets are given in Figure 8. The system of equations converges on the second iteration. The gure shows the rst iteration (which is the same as the second). Note that ACCKillout (10) contains 1. This indicates that 1 is killed by one or more of the parallel branches { in this case, it is killed by both sections A and B (via Section B1). By comparison, even though `c' is de ned in node 7, the de nition is conditional on `P', and thus 1 does not appear in ACCKillout (10). The set Out (10) contains de nitions 3 and 5 , indicating a potential anomaly. In the case of `b', this is an actual anomaly. In the next section, we extend these data ow equations to consider event synchronization between parallel Sections. b
b
c
b
b
6 Including the eect of Synchronization We extend the data ow equations in the previous section to consider event synchronization by using the preserved sets formulation given in [3]. Synchronization using post/wait occurs between dierent threads that execute in parallel. Synchronization edges carry data ow information, i.e., they propagate values of variables from the thread that posted the event to the thread that is waiting for the event to be posted. According to [9], it must be possible to execute each post before 13
Node Gen Kill ParKill 1 f 1 1 1g f 3 3 5 7g 2 3 f 3 3g f 1 1g f 5g 4 5 f 5g f 1g f 3g 6 7 f 7g f 1g 8 9 10 f 10g Out ACCKillIn AccKillOut a ;b ;c
a ;b ;b ;c
a ;b
a ;b
b
b
b
b
c
c
d
Node In ForkKill 1 f 1 1 1g f 3 3 5 7g 2 f 1 1 1g f 1 1 1g f 3 3 5 7g f 3 3 5 7g 3 f 1 1 1g f 3 3 1g f 1 1g 4 f 1 1 1g f 1 1 1g 5 f 1 1 1g f 1 5 1g f 1g 6 f 1 1 1g f 1 1 1g 7 f 1 1 1g f 1 1 7g f 1g 8 f 1 1 1 7g f 1 1 1 7g 9 f 1 5 1 7g f 1 5 1 7g f 1g f 1g 10 f 3 3 5 1 7g f 3 3 5 1 7 10g f 1 1g f 1 1g a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c
a ;b ;c ;c
a ;b ;c ;c
a ;b ;c ;c
a ;b ;b ;c ;c
a ;b ;b ;c
a ;b ;b ;c
a ;b ;b ;c
a ;b
b
c
a ;b ;c ;c
a ;b ;b ;c ;c ;d
b
b
a ;b
a ;b
Figure 8: Data Flow Sets for one iteration on the parallel program in Figure 7.
14
its corresponding wait for a parallel program to be deadlock free and correct. If the post statement at a node always executes before the corresponding wait node, 2 , then will have to update its reaching de nitions information with that from the Out set of the node corresponding to the post. Apart from updating the reaching de nitions information, i.e., the In set at the wait node, , we also want to be able update its accumulated kill sets, e.g., if de nes a variable, then all de nitions of that variable reaching via synchronization edges from post nodes, , that always execute before , must be included in the ACCKillin set of . This is important because the de nitions propagated by such synchronization edges are killed by the corresponding de nition in . If there is a synchronization edge from to , we can not say that always executes before . It is possible that there are multiple posts of the same event variable and multiple waits for the same event variable. It is also possible that these multiple posts and waits are executed conditionally. Thus, a synchronization edge does not always imply an execution order. We are, however, interested in the potential execution order for computing the reaching de nition information. Preserved sets, as de ned in [3] give precisely the set of nodes that execute before a given node, de ned as follows: np
nw
nw
nw
nw
nw
np
nw
nw
nw
np
nw
np
nw
De nition 1 A node nj
2 Preserved( ) if and only if for all parallel executions , if ni
x
are both executed, n is completed before n is begun. j
nj
and n
i
i
However, Callahan and Subhlok [3] have shown that computing this information is Co-NP Hard and have given a data ow framework to compute a conservative approximation to the Preserved sets. The approximate Preserved sets are computed as the least xpoint of a set of data ow equations over the Parallel Flow Graph. The Preserved set for a block is de ned as a function of its control ow (parallel and sequential) and synchronization predecessors. Clearly, by using the Preserved set formulation, we can determine at a wait node if a post node, , always completes execution before begins, i.e., if 2 Preserved( ). We use a new data ow set, called the SynchPass set, that propagates de nitions via synchronization edges. If executes before , we propagate the de nitions from to (and thus all nodes that execute after in the same thread), because we know those de nitions must have occurred before the synchronization occurred. Any de nitions that occur in node (and nodes subsequently executed by that thread) will kill the previous de nitions in the thread executing . These de nitions will also kill any de nitions that occur before executes in the thread corresponding to , but not necessarily those de nitions occurring after executed in that thread (e.g., because the thread executing may have already completed execution, as it does not wait for the wait nw
np
nw
np
np
nw
np
nw
nw
nw
nw
nw
np
np
np
np
2 We say a wait node starts executing when the wait statement is successful and the code following the wait in this node starts execution
15
X = 1 fork
X = 2
----
post( ev )
wait( ev )
Y = X
X = 3
join
Figure 9: Synchronization Example
statement to occur). This means that the join node must realize that the de nitions in occur after the de nitions passed in from ; this is the role of the ACCKill sets. For example, consider the Parallel Sections in the PFG shown in Figure 9. The fork node de nes a value for `x'. This value reaches the predecessor of the wait node and the post node. The de nition in the fork node is in the ACCKillout set for the post node, indicating that some branch of the Parallel Section has killed that value. However, only the value from the wait node should reach the join node, because that de nition must occur after the assignment in the post node and the fork node. We get the execution ordering information from the Preserved set. The value of `Y' following the post node is not speci ed by the language de nition. One could argue that `Y' should have the value `3'; however, we have chosen to assume copy-in/copy-out semantics, and would thus believe that `Y' has the value `2' { ideally, an error message would be issued concerning this data race. For this example, the data ow formulation for Preserved sets given in [3] will be able to determine the Preserved sets of the wait node accurately. However, since this data ow formulation is conservative, we may not be always able to compute the exact Preserved sets for any node. This would result in a conservative approximation to the reaching de nitions information in our data
ow framework. For example, in the absence of the Preserved sets information in gure 9, we would derive the Out set of the join node to contain the de nitions from both the post and the wait node. This is a conservative, yet correct, approximation to the reaching de nition information at the join nw
np
16
node. In the worst case, the eect of synchronization is lost at parallel merge points, i.e., in the absence of any Preserved set information our data ow equations would compute multiple reaching de nitions at respective parallel merge nodes. This simply reduces the opportunity or eectiveness of some optimizations. Therefore, at wait nodes, we update the SynchPass set with the Out set from the corresponding synchronization predecessors in the Preserved set of this node, indicating that the de nitions from those predecessors have occurred. In order to propagate the SynchPass information to other nodes after a wait node, we want to consider the union of the SynchPass from all the parallel predecessors (since all these predecessors always execute) and the intersection of the SynchPass from the sequential predecessors (since only one of them executes). We update the ACCKillin set of each node with the de nitions of variables that are propagated by synchronization edges (i.e. SynchPass ). We only consider the de nitions of SynchPass also de ned in this node. To do this, we use the set OtherDefs ( ), or the de nitions in the program outside of that de ne variables that also have de nitions within . The data ow equations taking into account Parallel Sections constructs with event synchronization is given in Figure 10. In this gure, synch pred refers to synchronization predecessor. Figures 11 and 12 show the data ow sets for the rst two iterations for the parallel program in Figure 3; the x point is reached in the third iteration. The Preserved set of node (8) (the wait node) is the set fEntry, 1, 2, 3, 4, 5, 7g, since each of these nodes always completes execution before node (8), if they execute at all. The reaching de nition information in this gure has been computed using the Preserved set information. The de nitions, `x4' and `x5' will not reach the join node, (11), because the de nition `x8' always executes after `x4' and `x5'. It is this information on execution order that we borrowed from the Preserved set formulation. Also, the ACCKillout set of (11) includes `x4' and `x5 '. This information was propagated to node (8) by the synchronization edges since (4) and (5) were in the Preserved set of (8). The de nitions `z6' and `z9' reach the merge node (11); this is an indication of a potential anomaly in the program since the two de nitions occur in distinct parallel branches, i.e., threads that can execute in parallel. The importance of the ParallelKill set is seen in the Out set of nodes (6) and (9). Even though the corresponding In sets have both de nitions of `z', only the de nition in that node should be in its Out set. The reason the In set of (6) and (9) both have `z6' and `z9' is because of the loop around the parallel block. Since we exclude the ParallelKill set from the Out set, we are able to compute the correct Out sets; for example, the Out set of (6) does not contain `z9' since this de nition is in its ParallelKill set. n
n
n
17
[ 8 Out ( ) (if is a wait node) > > < p 2 synch pred ^ p 2 Preserved(n) \ [ SynchPass ( ) = > SynchPass ( ) + SynchPass ( ) (otherwise) > : p 2 seq pred p 2 par pred p
n
n
p
Out (n) =
(
p
(In ( ) ? Kill ( ) ? ParallelKill ( )[Gen ( ))? (OtherDefs ( )\SynchPass ( )) n
n
n
n
n
n
8 > < [ [ \ In ( ) = > Out ( ) ? ACCKillout ( ) ? ACCKillout ( ) :p 2 pred(n) p 2 par pred(n) p 2 synch pred(n) n
p
p
8 > ; > < (ACCKillin( ) + Kill ( )) ? Gen ( ) ACCKillout ( ) = > + (ForkKill ( ) ? Out ( )) > : (ACCKillin( ) + Kill ( )) ? Gen ( ) n
n
n
f
n
n
n
n
n
p
( is a fork node) ( is a join node, with corresponding fork node ) (otherwise) n
n
f
8 [ \ > ACCKillout ( ) + ACCKillout ( ) < ACCKillin( ) = > p 2 par pred(n) p 2 seq pred(n) : + (OtherDefs ( )\SynchPass ( )) p
p
n
n
ForkKill (n) =
(
n
(ACCKillin( ) + Kill ( )) ? Gen ( ) ( is a fork node) ; (otherwise) n
n
n
n
Figure 10: Data ow Equations for Programs with Parallel Sections and Event Synchronization
18
Node Gen Kill ParKill Entry f 0 0g f 4 5 8 11g 1 2 3 4 f 4g f 0 5g f 8g 5 f 5g f 0 4g f 8g 6 f 6g f 9g 7 8 f 8g f 0g f 4 5g 9 f 9g f 6g 10 11 f 11g f 0g 12 Out ACCKillIn AccKillOut x ;y
x ;x ;x ;y
x
x ;x
x
x
x ;x
x
z
z
x
x
x ;x
z
z
y
y
Node In ForkKill Entry f 0 0g f 4 5 8 11g 1 f 0 0g f 0 0g 2 f 0 0g f 0 0g 3 f 0 0g f 0 0g 4 f 0 0g f 4 0g f 0 5g 5 f 0 0g f 5 0g f 0 4g 6 f 4 5 0g f 4 5 0 6g f 0g f 0g 7 f 0 0g f 0 0g 8 f 4 5 0g f 8 0g f 4 5g f 0 4 5g 9 f 0 0g f 0 0 9g 10 f 8 0 9g f 8 0 9g f 0 4 5g f 0 4 5g 11 f 8 0 6 9g f 8 11 6 9g f 0 4 5 0g f 0 4 5 0g 12 f 0 0g f 0 0g x ;y
x ;y
x ;y
x ;y
x ;y
x ;y
x ;y
x ;y
x ;y
x ;y
x ;y
x ;x ;y
x ;x ;y ;z
x ;y
x ;y
x ;x ;y
x ;y
x ;y
x ;y ;z
x ;y ;z
x ;y ;z ;z x ;y
x ;x ;x ;y
x ;x x ;x
x ;y ;z
x ;y
;z ;z
x
x
x ;x
x ;x ;x
x ;x ;x
x ;x ;x
x ;x ;x ;y
x ;x ;x ;y
x ;y
Figure 11: Data Flow Sets for the program in Figure 4 : Iteration 1.
19
Node In Entry 1 f 0 8 0 11 6 9g 2 f 0 8 0 11 6 9g 3 f 0 8 0 11 6 9g 4 f 0 8 0 11 6 9g 5 f 0 8 0 11 6 9g 6 f 4 5 0 11 6 9g 7 f 0 8 0 11 6 9g 8 f 4 5 8 0 11 6 9g 9 f 0 8 0 11 6 9g 10 f 8 0 11 6 9g 11 f 8 0 11 6 9g 12 f 0 8 0 11 6 9g x ;x ;y ;y
;z ;z
x ;x ;y ;y
;z ;z
x ;x ;y ;y
;z ;z
x ;x ;y ;y
;z ;z
x ;x ;y ;y
;z ;z
x ;x ;y ;y
;z ;z
x ;x ;y ;y
;z ;z
x ;x ;x ;y ;y x ;x ;y ;y
;z ;z
;z ;z
x ;y ;y
;z ;z
x ;y ;y
;z ;z
x ;x ;y ;y
;z ;z
f f f
f
Out
x 0 ; y0
ACCKillIn
g
f
AccKillOut
x4 ; x5 ; x8 ; y11
g
ForkKill
g f 4 5g f 4 5g g f 4 5g f 0 8 0 11 6 9 g f 4 0 11 6 9g f 0 5g f 5 0 11 6 9g f 0 4g f 4 5 0 11 6g f 0g f 0g f 0 8 0 11 6 9g f 8 0 11 6 9g f 4 5g f 0 4 5g f 0 8 0 11 9g f 8 0 11 6 9g f 0 4 5g f 0 4 5g f 8 11 6 9g f 0 4 5 0 g f 0 4 5 0g f 0 8 0 11 6 9g f 4 5g f 4 5g x0 ; x8 ; y0 ; y11; z6 ; z9
x ;x
x0 ; x8 ; y0 ; y11; z6 ; z9
x ;x
x ;x ;y ;y
;z ;z
x ;y ;y
;z ;z
x ;x ;y ;y
x ;x ;y ;y
x ;x x ;x
x
x
x ;x
x ;x ;x
x ;x ;x
x ;x ;x
x ;x ;x ;y
x ;x ;x ;y
x ;x
x ;x
;z ;z
x ;x ;y ;y
x ;y
;z
;z ;z
x ;y ;y
x 4 ; x5
;z ;z
x ;y ;y
x ;y ;y
x ;x
;z
;z ;z
;z ;z
x ;x ;y ;y
;z ;z
Figure 12: Data Flow Sets for the program in Figure 4 : Iteration 2.
7 Conclusion & Future Work We have presented data ow equations to compute the reaching de nition information at any point in an explicitly parallel program with event synchronization. Data ow equations for computing reaching de nitions information in sequential programs have been well understood and used in many current day compilers for the optimization of such programs. We believe that the data ow framework that we have presented in this paper can be used to perform rigorous scalar optimization on parallel programs and thus help achieve better execution rates of such programs on existing high performance architectures. This information will particularly bene t distributed shared memory systems, because optimizations using the data ow information will reduce the amount of communication between processors. This work has considered the Parallel Sections construct and event synchronization. We have implemented these algorithms in a data ow tool; we hope to eventually implement them in an actual compiler such as Sigma. In the future, we propose to extend the data ow equations to handle Parallel Do, another parallel construct speci ed by PCF FORTRAN. We anticipate the use of some technology from data dependence analysis to analyze these constructs. Earlier work [14, 11, 13, 12] has looked at translating explicitly parallel programs to their Static Single Assignment intermediate representation [5], a more ecient data ow representation. We hope to extend these results on post/wait synchronization to that representation as well.
20
g
References [1] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. AddisonWesley, Reading, MA, 1986. [2] Vasanth Balasundaram and Ken Kennedy. Compile-time detection of race conditions in a parallel program. In Proc. 3rd International Conference on Supercomputing, pages 175{185, June 1989. [3] D. Callahan and J. Subhlok. Static Analysis of low-level synchronization. In Proc. of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, pages 100{111, Madison, WA, May 1988. [4] David Callahan, Ken Kennedy, and Jaspal Subhlok. Analysis of event synchronization in a parallel programming tool. In Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming [10], pages 21{30. [5] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Eciently computing static single assignment form and the control dependence graph. ACM Trans. on Programming Languages and Systems, 13(4):451{490, October 1991. [6] Anne Dinning and Edith Schonberg. An empirical comparison of monitoring algorithms for access anomaly detection. In Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming [10], pages 1{10. [7] Samuel P. Midki and David A. Padua. Issues in the optimization of parallel programs. In David Padua, editor, Proc. 1990 International Conf. on Parallel Processing, volume II, pages 105{113, St. Charles, IL, August 1990. Penn State Press. [8] Steven S. Muchnick and Neil D. Jones. Program Flow Analysis: Theory and Applications. Prentice Hall, Englewood Clis, NJ, 1981. [9] Parallel Computing Forum. PCF Parallel FORTRAN Extensions. FORTRAN Forum, 10(3), September 1991. (special issue). [10] Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seattle, Washington, March 1990. ACM Press. [11] Harini Srinivasan. Analyzing programs with explicit parallelism. M.S. thesis 91-TH-006, Oregon Graduate Institute of Science and Technology, Dept. of Computer Science and Engineering, July 1991. [12] Harini Srinivasan and Dirk Grunwald. An Ecient Construction of Parallel Static Single Assignment Form for Structured Parallel Programs. Technical Report CU-CS-564-91, University of Colorado at Boulder., December 1991. [13] Harini Srinivasan and Michael Wolfe. Analyzing programs with explicit parallelism. In Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua, editors, Languages and Compilers for Parallel Computing, pages 405{419. Springer-Verlag, 1992. 21
[14] Michael Wolfe and Harini Srinivasan. Data structures for optimizing programs with explicit parallelism. In H. P. Zima, editor, Parallel Computation: First International ACPC Conference, volume 591 of Lecture Notes in Computer Science, pages 139{156. Springer-Verlag, Berlin, 1992.
22