Static Single Assignment Form for Explicitly Parallel Programs: Theory and Practice Eric Stoltzy, Harini Srinivasanzx, James Hook{, Michael Wolfey Department of Computer Science and Engineering Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland, OR 97291-1000 (503) 690-1121 ext. 7404 fstoltz,hook,
[email protected] [email protected]
Abstract
To sensibly reason about parallel programs, a coherent intermediate form needs to be developed. We describe and prove correctness and safety of algorithms to convert programs that use the Parallel Computing Forum Parallel Sections construct into a parallel Static Single Assignment (SSA) form. We de ne what the concept of dominator and dominance frontier mean in parallel programs. How to extend the SSA form to handle parallel updates and still preserve the SSA properties is described by introducing a new parallel merge operator, the -function. The minimal placement points for -functions are identi ed and proved correct by introducing the meet of two nodes in a Parallel Precedence Graph (PPG), which is the dual concept of join in a sequential control ow graph (CFG). The resulting intermediate form allows compilers to apply classical scalar optimization algorithms to explicitly parallel programs. We also discuss issues encountered while implementing these constructs in Nascent, our restructuring research compiler. Keywords: Optimizing compilers, use-def chains, parallel languages, explicit parallel sections, Static Single Assignment. Corresponding author. Supported in part by NSF grant CCR-9113885 and a grant from Intel Corporation and the Oregon Advanced Computing Institute. z University of Colorado, Boulder CO x Supported in part by an IBM Graduate Fellowship. { Supported in part by NSF grant CCR-9101721. y
1 Introduction While parallelizing sequential programs has long been a popular research topic, little work has been done on applying classical optimizations to explicitly parallel programs. For sequential programs, the Static Single Assignment (SSA)[1] form has proven to be an ecient and practical basis for many code optimization algorithms, including constant propagation, redundancy elimination, and induction variable analysis [2, 3, 4, 5]. Indeed, it has already found its way into modern commercial compilers. In this paper we extend the algorithms developed by Cytron et al. that convert a program into SSA form to handle explicitly parallel programs. To support this, we have designed abstractions that represent both control ow and parallelism. Other issues that aect parallel programs, such as multiple parallel updates and the concept of parallel precedence, are also considered.
We chose to start with the Parallel Computing Forum Parallel Fortran extensions since this is the basis of the ANSI committee X3H5 standardization eort [6]. Because of the industrial participation and commitment to this eort, these particular parallel language extensions may have widespread impact. While multiprocessor workstations are now available with parallel language extensions that will allow users to write programs to take advantage of them, the actual performance of these programs will depend to a large extent on the ability of the compiler to perform aggressive optimizations (particularly scalar optimizations) within and across parallel constructs. Consider the sequential and parallel programs in Figure 1; even though the conditional branching behavior of the sequential program resembles the parallel fork structure of the parallel program, these two programs are quite dierent. Variable j is a linear induction variable in the parallel program [5], but not in the sequential program. Recognition of j as an induction variable is important for data dependence analysis and strength reduction. Furthermore, variable k has a constant value in the parallel program since both branches of the fork, 1
k j
= =
loop if
k j
0 0 (condition) j = j + 1
else
k
=
5
A(j)
=
B(k)
endif endloop
= =
0 0
loop Parallel Sections Section A
then
j
=
j
k
=
5
A(j)
=
B(k)
Section B
+
1
End Parallel Sections
endloop
Figure 1 Similar sequential and parallel programs, but not identical i.e. the
Parallel Sections
statement, will be executed independently. This is also useful
information since the compiler can determine that the value assigned to A(j) is loop invariant. This is not necessarily true for the sequential program since only one branch of the conditional statement executes. More detail on this crucial distinction is provided in Section 4. Our work takes an important step in the intraprocedural analysis necessary to extend traditional sequential optimization to parallel programs. We provide both the theoretical background as well as the techniques necessary for a successful implementation. The rest of this article is organized as follows: In Section 2 the explicit parallel syntax which we will address in this work is introduced. Section 3 reviews concepts and terms dealing with Static Single Assignment Form. Section 4 covers the de nition of Extended Flow Graphs, which account for explicit parallelism and synchronization in a program. Sections 5 and 6 give the theoretical basis for our work, with well-known concepts in sequential analysis extended to explicit parallel sections. Section 7 describes how we incorporate multiple parallel updates, and introduces the new parallel merge operator to handle such cases. We also identify and 2
prove correct the minimal set of nodes at which to place parallel merge operators. Section 8 details the methods used to implement these ideas and concepts, while Section 9 presents the complete algorithm, with proofs of safety and correctness for eciently transforming programs into the extended SSA form. Section 10 contrasts this work with related research, suggesting some continuing directions of interest. Section 11 provides conclusions.
2 Parallel Section Semantics The Parallel
Sections
or the IBM Parallel
construct [6] is similar to the cobegin/coend of Brinch Hansen [7]
Cases
statement [8]. It is a block structured construct used to specify
parallel execution of identi ed sections of code. The parallel sections may also be nested. The sections of code must be data independent, except where an appropriate synchronization mechanism is used. Transfer of control into or out of a parallel section is not supported [6]. Here we consider only structured synchronization expressed as Wait clauses, i.e. DAG parallelism [9]. An example parallel program with Wait clauses is given in Figure 2. The Wait clause in section D speci es that this section can start executing only after both sections A and B have completed execution. Note that Wait clauses do not aect control dependence relations [10]; all unconditional code in the Parallel
Sections construct is identically
the same predicate that controls execution of the Parallel
Sections
control dependent on
statement itself. Once
this predicate evaluates to true all the parallel sections will execute. It is important that a well-de ned interpretation be assigned to the case when two sections of code that can execute in parallel both modify the same variable, or when one section modi es a variable that is used by the other. We assume copy-in/copy-out semantics in the compiler, where the values of shared variables in a parallel section are de ned to be initialized to the values they had when the parallel block was entered; any updates are made (conceptually) 3
(p) (p) (p) (p)
a b c
if
(s) (t) (u) (u) (v)
then Parallel Sections Section A if (P) then b = a 5 else = =
endif Section B = =
c f
c c
a b
+
+
7 a
15 16
Section C, Wait(A) d = b a Section D, Wait(A, B) c = a b + c * f End Parallel Sections
(x) (y)
(q) (r) (r)
2 3 4 (Q)
b f
(w) (w)
(n)
= = =
d = d
else
+
f
d = 23
endif e
=
a
+
b
c
d
Figure 2 Example parallel program
4
to local copies of the variable [11]. When the parallel block is complete, the global state is updated with any modi cations made within any section. This gives a well-de ned program without volatile variables, and allows optimization within a parallel section independent of code in other sections. Such a model is not adequate for certain systems-level tasks, but is appropriate for most higher-level applications programming. Strict observance of the model has several potential problems, however, such as the overhead of making local copies of variables and atomic merging of updated variables. In the PCF Fortran extensions, the required dataindependence between parallel sections allows the compiler to ignore this issue. The interplay between the language model, the compiler model and the architectural model is a subject for another paper.
3 The Static Single Assignment Form This section reviews some concepts of SSA construction introduced by Cytron et al. [1]. The algorithm to convert a sequential program into SSA form uses the Control Flow Graph (CFG) of a procedure. A CFG is a graph G = hV; E; Entry; Exit i, where V is a set of nodes representing basic blocks in the program, E is a set of edges representing sequential control ow in the program, and Entry and Exit are nodes representing the unique entry point into the program and the unique exit from the program, respectively. When a program is converted into SSA form, the resulting program has two distinguishing properties: 1. At appropriate con uence points in the CFG, merge functions called -functions are introduced. A -function for a variable merges the values of the variable from distinct incoming control ow paths when a de nition of that variable occurs along at least one of the paths. 5
2. Every use of any variable in the program has exactly one reaching de nition. The algorithm which achieves the rst property nds where to place -functions. Cytron et al. describe the SSA algorithm using the dominance relation and dominance frontiers. A node
X dominates a node Y (written X Y ) in a CFG if X appears on all paths from Entry to Y ; a node X strictly dominates a node Y (X Y ) if X Y but X 6= Y . The dominance frontier of a node X , DF(X ), is de ned as the set of nodes Z such that X dominates a predecessor of Z and X 6 Z . The dominance frontier of a set of nodes, DF(S ), is de ned as the union of the dominance frontiers of the nodes in the set S . The iterated dominance frontier, DF+ (S ), is the limit of the increasing sequence of sets of nodes, DF1 (S ) = DF(S ); DFi+1 (S ) = DF(S [ DFi (S )) Cytron et al. prove that -functions must be placed at the iterated dominance frontier of the set of nodes which have assignments to a variable. DF+ (S ) is calculated using a simple and ecient worklist algorithm. (To be precise, -functions are placed at special con uence points in the CFG, de ned as the join of the set S , where informally the join of S is de ned to be the set of all nodes Z such that there are two non-null CFG paths that start at two distinct nodes in S and converge at
Z . Formally, the join of nodes X and Y , J (X; Y ) is:
n
Z j 9ZX ; ZY with ZX ! Z and ZY ! Z; paths pX : X ! ZX and pY : Y ! ZY ; pX \ pY = ;
Cytron et al. proved that J + (S ) = DF+ (S ).) To achieve the second property, renaming of all variable de nitions is performed uniquely (including -functions), followed by replacing each variable use with its one reaching de nition. Further details on the construction of SSA form may be found in the original paper [1]. 6
o
4 Flow Graphs for Parallel Constructs As illustrated in the programs in Figure 1, traditional analysis methods on Control Flow Graphs will not easily apply to explicitly parallel programs. It is not possible to analyze parallel merge points using the techniques for sequential merge points like endif statements. In this section, we present an abstraction that handles both sequential control ow and parallelism in parallel programs. This abstraction, which extends to Control Flow Graphs, will be used to translate an explicitly parallel program to its SSA representation just as the CFG was used to translate a sequential program to its SSA representation in Section 3.
4.1 The Extended Flow Graph We de ne a Parallel Control Flow Graph (PCFG) to be a CFG that may have a special type of node called a supernode. A supernode essentially represents an entire Parallel
Sections
construct (or parallel block), described earlier. Formally, a PCFG is de ned as a graph G =
hVG; EG; Entry G ; Exit G i where VG is a set of vertices (each representing a basic block or an entire parallel block), EG is a set of edges representing sequential control ow, Entry G is the unique start vertex and Exit G is the unique end vertex. For each supernode P , two additional basic block nodes, called the head and tail nodes for that supernode, are introduced. The head node captures all the incoming control ow edges to P and the control ow successors of the tail node are those of P in the original PCFG. The tail node has exactly one control ow predecessor, namely P . Node P is the only control ow
successor of the corresponding head node. These additional nodes are helpful for both proving correctness and for implementation. We will return to their function in later sections. Parallel execution of the sections within a parallel block is represented by a Parallel Precedence Graph (PPG). Nodes in the PPG represent the sections in the parallel block with two
7
G main :
Section A:
PPG P1 : Cobegin
Entry
Entry A
p
s A
B
C
D
Hd
t
u
P1
q
v Tl
Coend
Exit A
n r
Exit Section B:
Section D:
Section C: Entry B
Entry C
Entry D
w
x
y
Exit B
Exit C
Exit D
Figure 3 EFG for the example parallel program additional nodes, cobegin and coend. A Wait clause in a parallel block imposes wait-dependences between the waiting section and the sections in the same parallel block speci ed in the Wait clause. The edges in the PPG (also called wait-dependence arcs) represent those wait dependences. Edges from cobegin have as their target sections which are waiting upon no other sections, hence they may execute without synchronization when that parallel block is entered. Edges with the coend node as a target represent nodes which have no other nodes waiting upon them. Formally, a PPG is de ned as a directed graph P = hVP ; EP ; Entry P ; Exit P i where VP is a set of vertices, each representing a section in a parallel block, EP is the set of edges or wait-dependence arcs, Entry P is the cobegin node, and Exit P is the coend node. By de nition of the language, the PPG must be acyclic. Each section in a parallel block is again represented by a PCFG S = hVS ; ES ; Entry S ; Exit S i where Entry S marks the entry into that section and Exit S marks the exit from that section.
It is crucial to note that merge nodes in the PPG, such as the node D in PPGP1 (Figure 3), 8
have semantics which are quite dierent from sequential merge nodes. A parallel con uence point is a true merge: each predecessor must execute. Section A and Section B must both execute, perhaps in parallel, before Section D may execute. A sequential merge, on the other hand, is actually a choice or selection from the possible incoming edges. At most one of the predecessors of any sequential merge will be executed. The Extended Flow Graph (EFG) is the union of PCFGs and PPGs representing sequential control ow and parallelism for a single program unit. The distinguished PCFG corresponding to the program unit is called Gmain . We will talk about the set of nodes in an EFG, which is the union of all the nodes in all the PCFGs and PPGs in the EFG. Each node falls into one of several categories: 1. program Entry or Exit nodes, 2. basic block nodes, 3. supernodes corresponding to a parallel block, 4. cobegin or coend nodes in a PPG, 5. section nodes, and 6. section Entry or Exit nodes in a section PCFG. The PCFG containing any basic block node X is designated GX , the section node corresponding to GX is designated SX , and PX corresponds to the supernode representing the PPG containing
SX . The EFG for the parallel program in Figure 2 is shown in Figure 3. The parallel block (supernode) is represented by the node P1 in Gmain , which is in turn represented by PPGP1 . Each section of the parallel construct is represented by a separate PCFG, as shown in the gure. For example, for basic block node x, Gx is the PCFG for Section C, Sx is the node C in 9
PPGP1 , and Px is P1.
5 Factoring Control Flow Graphs capture information about the relative ordering of events. In particular, they allow us to determine those events that may happen before a speci c event and those events that must happen before that event. In the sequential case, this is determined by analyzing all paths through the graph. To relate properties of the parallel programs to well understood properties of sequential programs, we exploit the algebraic structure of parallelism to factor parallel programs, expressed as EFGs, into a set of CFGs representing all possible control ow paths through the program. For example, the EFG in Figure 3 can be factored into the CFGs in Figure 4. Questions of precedence can then be answered by doing conventional analysis of the sequential factors and asking if there is a sequential factor in which one event always precedes another (in sequential programs, the \must precede" relation is the dominance relation). In essence, a factor displays one of many sequential threads identi ed within the procedure. While factoring is a useful technique for reasoning about the application of sequential techniques to parallel programs, it is computationally infeasible; the number of factors increases exponentially in the number of parallel segments in the program. In the sequel, factoring is used to argue that the ecient algorithms we develop for parallel program analysis are correct generalizations of sequential program analysis techniques.
De nition 5.1 For any basic block X in the EFG, dominators of X =
[
8G2 factored CFGs
dominators of X in G :
De nition 5.2 For any basic block X in the EFG, the parallel dominance frontier (PDF) of 10
X is de ned as: PDF(X) =
[ 8G2 factored CFGs
DF(X) in G :
The EFG in Figure 3 has three factors, corresponding to the possible threads in the PPG for supernode P1, as shown in Figure 4. In the A-C factor p s, and in the B-D factor p w; by our de nition, the precedence relation in any factor holds in the original program, thus p must dominate both s and w in the original EFG. Since s and w do not appear in any common factors, this de nition also implies that neither dominates the other. In general, a basic block node in an EFG may have more than one immediate dominator (a set of immediate dominators), meaning the parallel dominance relation can no longer be represented by a dominator tree. However, another bene t of using the hierarchical EFG abstraction is that the dominator relation of each PCFG and PPG is a tree; this allows us to use proven techniques based on the dominator tree traversal, such as building dominance frontiers.
5.1 Precedence relations and parallel precedence frontiers Computing parallel dominance frontiers as de ned above is not ecient, due to the cost of factoring. This section de nes the precedence relation and the parallel precedence frontiers; the theorems that follow will then relate the parallel dominance frontiers to the parallel precedence frontiers. This relationship is important for results in the next section concerning iterated join sets.
De nition 5.3 For any two nodes X , Y , where X and Y are either nodes in the same PCFG, nodes in distinct PCFGs or nodes in a PPG, we say X precedes Y if the execution of X must precede the execution of Y .
Since De nition 5.3 is based upon static analysis, precedence within an EFG is analogous to dominance within a sequential CFG. Thus, if node X always precedes node Y , even in the 11
A-C Factor:
A-D Factor:
B-D Factor:
Entry
Entry
Entry
p
p
p
Head
Head Head
Entry- A
Entry-A Entry-B
s
s u
t v q
w
u
t
Exit-B
v
q
Exit - A
Exit-A
q
Entry-D
Entry - C Entry-D
x
y
Exit-C
Tail
Tail
n
n
Exit
Exit-D
Exit-D
Tail
r
y
r
Exit
Figure 4 Factors for the example parallel program
12
n r
Exit
presence of some looping construct, we can apply De nition 5.3.
De nition 5.4 Given the following descriptions: PPFlocal (X ) is the sequential dominance frontier of X, de ned within GX ; PPFlocal (X ) is de ned between nodes and supernodes in GX and does not consider nodes within supernodes, and
PX is the supernode containing X , as de ned earlier. (If X 2 Gmain then PX = ;.) We de ne the parallel precedence frontier of a basic block node or supernode X , denoted
PPF(X ), as follows: If X 6 ExitGX then PPF(X ) = PPFlocal (X ): If X ExitGX then PPF(X ) = PPFlocal (X ) [ PPF(PX ):
The de nition of PPF is recursive: the case when X 6 ExitGX corresponds to the base case of this de nition since this is when the recursive computation of PPF terminates. For example, in Figure 3 the sequential dominance frontier (SDF) of node v (PPFlocal (v )) in the PCFG for section A = f g. But the parallel precedence frontier of node v is PPF(v ) = PPFlocal (v ) [ PPF(P1) = frg. Similarly, the parallel precedence frontier of nodes w, x and
y is PPF(P1). This is because P1 represents the enclosing parallel block and PPF(P1) = PPFlocal (P1 ) = frg. The algorithm to compute parallel precedence frontiers of all nodes, given the sequential dominance frontiers of each node in the PCFGs in the program, is given in Figure 5. If P^ is the number of sequential sections of code in the parallel program, N^ and E^ are the maximum of the total number of nodes and edges respectively in the PCFGs corresponding to each of these sections, then the algorithm takes O(P^ (N^ 2 + E^ ) + P^ N^ ) time, which asymptotically 13
PFRONT( Gmain ) Procedure PFRONT( PCFG G ) /* Compute dominance frontier (DF) using the algorithm in [1] */ for each node X in G do compute DF(X)
enddo for each node X in G do PPF(X)
DF(X)
if X ExitX then
PPF(X) PPF(X) [ PPF(PX ) endif if X is a parallel block, PX , then for each section GS in PX PFRONT( GS )
endfor endif enddo
end PFRONT
Figure 5 Algorithm PFRONT: Computing parallel precedence frontiers reduces to O(P^ N^ 2 ). The algorithm is called recursively for each section; the outer for loop is executed at most N^ times for each section, and the worst case time to compute the sequential dominance frontiers is O(N^ 2 + E^ ).
5.2 Equivalence between PDF and PPF The theorems presented in this section establish the equivalence of parallel precedence frontiers and parallel dominance frontiers de ned above. Lemmas 5.1 through 5.6 are used in the proof of the theorems. These lemmas establish the properties of PCFGs and factoring, and apply to any basic block node or supernode. In the following proofs, Xhead and Xtail refer to the head and tail nodes of the closest enclosing supernode for node X . 14
Lemma 5.1 In any factored CFG Gf such that X 2 Gf , if X Xtail and Z 2 DF(Xtail) then Z 2 DF(X). Proof: By de nition of DF, Xtail P for some P 2 Pred(Z ). Therefore by construction,
X P . Also, since Xtail 6 Z and any path from X to Z must include Xtail; X 6 Z . Hence, Z 2 DF(X ). 2
Lemma 5.2 If X Xtail and Y 2 PDF(Xtail) then Y 2 PDF(X). Proof: By construction, X 2 Gf ) Xtail 2 Gf , where Gf is a factored CFG. Hence, by Lemma 5.1 and de nition of PDF, PDF(Xtail) PDF(X ). 2
Lemma 5.3 If Y 2 PPFlocal(X) then 8Z 2 Y [ Pred(Y ), Z is not a supernode. Proof: Y is not a supernode, since (by construction) a supernode has exactly one predeces-
sor, the corresponding head node. If a predecessor of Y is a supernode then, by construction, its only successor is the tail node (i.e. Y ), which has only one predecessor; this contradicts the assumption that Y 2 PPFlocal (X ). 2
Lemma 5.4 For any node X representing a basic block, PPFlocal(X) PDF(X). Proof: Consider Y 2 PPFlocal (X ). By Lemma 5.3, Y represents a basic block (i.e. Y is not
a supernode) and by de nition of PPFlocal(X ), X Z where Z is some predecessor of Y and
X 6 Y . By Lemma 5.3, Z is not a supernode. By construction and the de nition of factoring, we know that if P and Q are two nodes that represent basic blocks in a PCFG then P Q in the PCFG if and only if P Q in some factored CFG where P and Q are nodes. Therefore, X Z in some factored CFG where X and Z are nodes and X 6 Y (and X 6 Y ) in all 15
the factored CFGs where X and Y are nodes. Therefore, by de nition of parallel dominance frontiers, Y 2 PDF(X ). 2
Lemma 5.5 If X 6 ExitGX then PDF(X ) = PPFlocal(X ). Proof: Let R1 be the set of basic block nodes that are reachable from Xtail in GX , including
Xtail. Let R2 be the set of basic block nodes that occur in a section SY that is preceded in execution order by SX (the section corresponding to X ) where SY and SX are sections in the same parallel block, i.e. SY wait depends on SX . Finally, let R3 be the set of basic blocks nodes in a parallel block, PZ , nested within PX . The set R1 [ R2 [ R3 is the set of nodes that can potentially be reached from X and that are not in GX . The Extended Flow Graph set in Figure 6 illustrates the sets R1; R2 and R3. In this gure, R1 = fXtail : : : Exitg; R2 is the set of all basic nodes in the PCFG for section SY ; R3 is the set of all basic block nodes in the PPG for PZ , i.e., the nodes representing basic blocks in each parallel section in PZ . Note that PZ is nested within PX and is a supernode in GX . In general, PZ can either be a supernode in GX or not. In either case, the head node corresponding to PZ , Zhead , is the unique predecessor of PZ . Therefore, by construction of factored graphs, any node that represents a basic block within the parallel block corresponding to PZ 62 PDF(X ), i.e.
8Z 2 R3; Z 62 PDF(X )
(1)
X 6 Exit GX ) X 6 Z; 8Z 2 (R1 [ R2):
(2)
Also, by construction,
16
G
main
: Entry . . .
PPG : Px cobegin
Section Sx : Entry (Gx ) . . .
Sx Sw
X-head
Sy Z-head
Px
coend
X-tail . . .
Pz Z-tail . . .
d
Exit
Exit
Figure 6 Figure illustrating sets R1; R2 and R3
17
Claim 5.1 If X 6 ExitGX then 8Y such that Y 2 PDF(X), Y 2 GX , i.e. Y is local. Proof : Suppose Y is not local. Therefore, P is local i P is Exit GX and we know that X 6 Exit GX . If P is not local then P is not in GX . Hence, by construction, P 2 (R1 [ R2 [ R3 ).
Consider the case when P 2 (R1 [ R2). By equation 2, X 6 P . Clearly, X 6 P is a contradiction to Y 2 PDF(X ). If P is not local and P 2 R3, then by construction, Y is also a node in R3. Therefore, by equation 1, we have a contradiction to Y 2 PDF(X ). Therefore, Y
2 GX . 2 Consider Y 2 PDF(X ). Therefore, X P for some P 2 Pred(Y ) and X 6 Y . Since
X 6 Exit GX , from the above claim, Y and its predecessors are local. From Lemma 5.3 we know that Y and its predecessors are not supernodes. By de nition of PPFlocal(X ), Y 2 PPFlocal(X ). Therefore if X 6 Exit GX , PDF(X) PPFlocal (X) By Lemma 5.4, PPFlocal(X) PDF(X) Thus, PPFlocal(X ) = PDF(X ) if X 6 Exit GX . 2
Lemma 5.6 If X ExitGX then PPF(X ) PDF(X ). Proof: Proof proceeds by induction on the nest-level of Gx , the closest supernode containing
X . The outermost parallel block is at nest-level 0. Let X occur at nest-level n. Basis: When the nest-level n = 0, Exit GX corresponds to the exit node of Gmain by construction. Therefore, PPF(X ) = PPFlocal(X ). But, from Lemma 5.3, the nodes in PPFlocal(X ) 18
must be basic block nodes. Hence, by Lemma 5.4 (PPF(X ) = PPFlocal(X )) is a subset of PDF(X ). Inductive Hypothesis: Assume PPF(Y ) PDF(Y ), 8Y at nest-level < n. Inductive Step: When nest-level = n. De ne PDFunlocal(X ) = PDF(X) ? PDFlocal(X) .
We know that PDFlocal(X ) = PPFlocal(X ) and PPF(X ) = PPFlocal(X ) [ PPF(Xtail ). Therefore, to prove that PPF(X ) PDF(X ), it is sucient to show that PPF(Xtail)
PDFunlocal(X ). Since Xtail occurs at nest-level < n, by inductive hypothesis, PPF(Xtail) PDF(Xtail ). Thus, to verify the inductive step, we show that PDFunlocal(X ) = PDF(Xtail ). The proof of this statement follows from the proof of the following two claims:
Claim 5.2 PDF(Xtail) PDFunlocal(X). Claim 5.3 PDFunlocal(X) PDF(Xtail). Proof of claim 5.2: Since X Exit GX , by construction, X Xtail. Therefore, from Lemma 5.2, PDF(Xtail) PDF(X). By de nition, 8Z 2 PDFlocal(X ), Z 2 GX and by con-
struction, 8Z 2 PDF(Xtail), Z 62 GX (since Entry GX and Exit GX are unique entry and exit nodes in GX ). Therefore, PDF(Xtail ) \ PDFlocal(X ) = ;. Hence, PDF(Xtail) PDFunlocal(X ).
Proof of claim 5.3: We know that 8Z 2 PDFunlocal(X ), Z is not local. Therefore, considering the sets, R1, R2 and R3 introduced in Lemma 5.5, Z 2 R1 [ R2 (since Z 62 R3 by
equation 1). Since, X Exit GX , by construction, X W , 8W 2 R2 and X Xtail . Since Z 2 PDFunlocal(X ), 9P 2 Pred(Z ) such that X P and X 6 Z , i.e. X 6 Z (Z 6= X ).
Therefore, Z 62 R2 and Z 2 R1 ? fXtailg. Suppose, Z 62 PDF(Xtail). Therefore, either 8P 2 Pred(Z ), Xtail 6 P or Xtail Z .
19
case (i): 8P 2 Pred(Z ), Xtail 6 P Since X Xtail , we know by construction that X Q , Xtail Q, where Q is not local. This is because, there are no jumps into or out of a parallel block and Entry GX and Exit GX are unique entry and exit nodes. Therefore, using this property in case (i) (since Z is not local), X 6 P ; a contradiction to Z 2 PDFunlocal(X ). case (ii): If Xtail Z , then by transitivity of the dominance relation, X Z ; a contradiction to Z 2 PDFunlocal(X ). Contradictions to cases (i) and (ii) prove claim 5.3. From claims 5.2 and 5.3, PDFunlocal(X )
= PDF(Xtail), completing the induction. 2
Theorem 5.1 PPF (X ) PDF (X ) Proof : From Lemma 5.4 we know that PPFlocal(X ) PDF(X ). Therefore, by de nition
of PPF(X ), when X 6 Exit GX , PPF(X ) PDF(X ). By Lemma 5.6, PPF(X ) PDF(X ) when X Exit GX . Therefore, by de nition of PPF(X ), PPF(X ) PDF(X ). 2
Theorem 5.2 PDF (X ) PPF (X ) Proof : From Lemma 5.5, PDF(X ) = PPFlocal(X ) if X 6 Exit GX . Therefore, it is enough to prove that PDF(X ) PPF(X ) when X Exit GX . The proof proceeds by induction on
the nest-level of the closest supernode enclosing X . Basis: When nest-level, n = 0, Exit GX and Entry GX nodes correspond to the exit and entry
nodes of Gmain . Therefore, PDF(X ) = PPFlocal(X ).
Inductive Hypothesis: Assume, 8Q such that Q is at nest-level < n, PDF(Q) PPF(Q).
20
Inductive Step: When nest-level = n (i.e. for X ). We know, by de nition, PPF(X ) =
PPFlocal(X ) [ PPF(Xtail), and PDF(X ) = PDFlocal(X ) [ PDFunlocal(X ). But, PDFlocal(X ) = PPFlocal(X ) and PDFunlocal(X ) \ PPFlocal(X ) = ;. Therefore, to prove the inductive
step, it suces to show that PDFunlocal(X ) PPF(Xtail). By inductive hypothesis,
PDF(Xtail) PPF(Xtail )
(3)
From claims 5.2 and 5.3 (Lemma 5.6), we know that PDFunlocal(X ) = PDF(Xtail). Sub-
stituting this in equation 3, we have the required proof. 2
6 Placement of -functions In the case of sequential programs, -functions for a variable are located at all nodes that represent merge points in the program where the variable is de ned along at least one of the incoming paths. These nodes are precisely the nodes in the set J + (S ) (iterated join set), where
S is the set of nodes containing assignments to the variable. Cytron et al. [1] have proved the equivalence between J + (S ) and DF+ (S ) for the sequential case. We de ne join sets for parallel programs in two ways; Ju is de ned by taking the union of the join sets of factors, while the parallel join set Jp is de ned from basic principles. The Jp set is an extension of the join set in a sequential CFG to an analogous set of merge points based upon the structure of the EFG. This section proves that the de nitions of Ju and Jp are identical, establishes the equivalence between the iterated parallel join set (Jp+ ) and the iterated parallel dominance frontier (PDF+ ); with Theorems 5.1 and 5.2, this is equivalent to PPF+ .
21
6.1 Join sets for factored CFGs A -function for a variable needs to be placed at any point in a parallel program where two or more de nitions of that variable merge along control ow paths that do not execute in parallel. Supernodes have to be considered in order to manage the paths that can execute in parallel. One way of de ning the join set is by considering all possible sequential paths of execution, i.e. by considering the sequential CFGs derived by factoring the PCFG. The join set J (fP; Qg) for nodes P and Q in any factor of a PCFG is de ned as in the sequential case (since the factor here is a sequential CFG). Two nodes X and Y which have assignments to the same variable in a PCFG can both appear in more than one factor; thus we are interested in the union of the join sets of the factors containing X and Y . The set
Ju (fX; Y g) for any two nodes X and Y that represent basic blocks in the PCFG is de ned as : Ju(fX; Y g) =
[
factoredCFG0 s
J (fX; Y g)
The union is considered over all the factored CFGs, G , where X; Y 2 G .
6.2 Parallel join set for PCFGs It is also possible to recursively de ne the parallel join set for nodes in a PCFG.
base case. For any two nodes, X and Y , such that GX = GY , the set Jlocal (fX; Y g) is de ned as the join set of nodes X and Y within GX . Jlocal is de ned just as in the sequential case, i.e. J (fX; Y g).
recursive case. If GX is not the same as GY , then at least one of X or Y is in a parallel block by construction.
Suppose X is enclosed in a parallel block, PX , but not Y . Here, Y is a node in Gmain and Jp(fX; Y g) is de ned as Jp(fPX ; Y g). 22
When both X and Y are enclosed within a parallel block, the closest supernodes enclosing X and Y , PX and PY , may be the same or dierent.
{ PX and PY are dierent: Any downwards exposed de nition within parallel block PY may reach points in the program outside the parallel block. If we assume PY to be at nest-level n and GY to be \nested within" GX , then clearly
X is at nest-level less than n, say m. In order to compute the join nodes, it is important to consider those points in the parallel program where the de nitions at X and Y will converge, which is clearly at nodes at nest level m. Therefore, the join of X and Y is equal to the join of X and PY , i.e. Jp (fX; PY g). If the two sub-PCFGs are not nested, we must look at the supernodes, PX and
PY . If both X and Y are at the same nest-level, the join set for X and Y is the join set of PX and PY , Jp (fPX ; PY g). If the nest-level of X is less than that of Y , then the join set for X and Y is Jp(fX; PY g).
{ The case when PX and PY are the same is analyzed on the basis of whether SX and SY (which represent sections in the parallel block) are related by the wait-closure relation (W + ). Wait-closure is the transitive closure of the waitdependence relation between section nodes, where A W + B means that A is in the wait-closure set of B (B waits upon A). If there is no wait-closure relation between SX and SY , the join set Jp (fX; Y g) is the empty set. This is the case when SX and SY are along paths in the parallel program that may execute concurrently. Consider the case when SX W + SY . Since the Parallel Precedence Graph is a DAG, i.e. has no cycles, the join of X and Y is the same as the join of the Entry GY node and Y . Since Entry GY and Y are nodes in the same PCFG,
23
(namely GY ), Jp (fX; Y g) = Jlocal (fEntry GY ; Y g). Now we prove the equivalence of the two de nitions of the Join sets in our parallel programs.
Theorem 6.1 For every pair of nodes X ,Y representing basic blocks in a parallel program, Z 2 Jp(fX; Y g) i Z 2 Ju (fX; Y g). Proof: The proof of this theorem proceeds by considering the dierent cases in the de nition
of Jp and showing the equivalence between the two sets for each case. Details of the proof can be found in previous work [12]. 2 We know for the sequential case [1] that the iterated join set J + (S ) is equal to the iterated dominance frontier DF+ (S ). Since this equivalence is unaected by the union operator, the result obtained by iterating the union of the join sets for each factored CFG (Ju ) is equal to the iterated parallel dominance frontier set. The iterated parallel join set Jp+ (S ) for a variable v, where S is the set of nodes that have assignments to v, is precisely the set of nodes where -functions for v need to be placed. However, computing Jp+ can be very inecient. We claim that the iterated parallel precedence frontier is equal to the iterated join set, Jp+ . The proof of this theorem uses the de nition of parallel dominance frontier and Theorem 6.1.
Theorem 6.2 For a set S of assignment nodes, Jp+ (S ) = PPF+ (S ) Proof: From Theorem 6.1, we know that Jp (fX; Y g) = Ju (fX; Y g). Therefore,
[
8X;Y 2S ^X 6=Y
Jp(fX; Y g) =
[
8X;Y 2S ^X 6=Y
Ju (fX; Y g)
This equivalence can be extended to the iterated joins, i.e.
Jp+ (S ) = Ju+ (S ) 24
(4)
From the paper by Cytron et al. [1], we know that J + (S ) = DF+ (S ). This equivalence again is unaected by taking the union over all factored CFGs, i.e.
Ju+ (S ) = PDF+ (S )
(5)
From equations 4 and 5, Jp+ (S ) = PDF+ (S ). The proof of the theorem follows from Theorems 5.1 and 5.2, i.e. Jp+ (S ) = PPF+ (S ). 2
6.3 Algorithm to place -functions The algorithm to place -functions is a worklist algorithm, involving several modi cations to the algorithm for placing -functions in sequential programs [1]. Essentially a recursive call to the algorithm is made at each supernode, retaining the reaching de nitions from outside the supernode, while propagating reaching de nitions from within the supernode to the outer program block. Details are provided in Sections 8 and 9.
7 Multiple Parallel Updates 7.1 The -function In a
Parallel Sections
construct, when only one section updates a shared variable (or a
variable is de ned in more than one section but chained together by wait clauses), the value of the variable after the parallel block is well-de ned. However, since two parallel sections can update the same variable, more than one SSA name of the variable may reach assignments after the parallel block. Similarly, when two dierent array elements are updated in dierent parallel sections, the two \names" for the updated array must be merged to preserve the SSA properties. Since this merge does not arise from sequential control ow branching, we use a dierent merge operator to distinguish it from a 25
-function. Note that a parallel merge does not necessarily mean that the program is incorrect, is nondeterministic, or contains an anomaly. Instead, a parallel merge means that a variable might be updated in more than one parallel section, and the updated value may be used in code following the parallel block, regardless of which section performed the update. The updates may be controlled by mutually exclusive conditions, or may be to dierent elements of an array. We use the Parallel Precedence Graph to detect multiple updates. We introduce a new operator to the SSA representation, the -function, to merge multiple parallel updates. Placement of -functions must occur along with -placement, before the SSA renaming phase.
7.2 Placement of -functions At what points in a PPG do we need to merge reaching de nitions? We need to merge de nitions at the precedence section nodes in which these de nitions rst come together. However, as opposed to sequential control ow, a variable may not be de ned along every path which reaches a con uence point; as long as it is de ned along some path which reaches a use for that variable, a de nition for that variable will be available. This also implies that a de nition from one node can be killed in a PPG if any path exists to another node which contains a de nition of the same variable. Thus, for PPGs, we de ne reaching de nitions as follows:
De nition 7.1 (Reaching De nitions Within a PPG.) Within section nodes of a PPG, a de nition of v at section node X reaches section node Y if no path from X to Y contains a de nition of v, except at X or Y.
This important distinction between how information ows in sequential and parallel graphs suggests that identifying merge points as the iterated dominance frontier of the set of de nition points in the PPG may not be correct. To see why, examine Figure 7(a). If v is only de ned at node X , then any use of v at W , Z , or A will have that de nition available, since X will always 26
have been executed before any of these other sections execute. But DF+ (X ) = fW; Z; Ag; clearly a merge node is not necessary when only a single de nition of a variable reaches any point. Since a PPG guarantees execution of all predecessors, we need not be concerned about a de nition of v owing from node Y . Since Y does not de ne v , it does not contribute to the reaching de nitions of v for other nodes. Entry
X
Entry Y v=
W
X
v= Y
Z
Z A = v
A
(a)
(b)
Figure 7
To see where merge operators are needed in a PPG, rst examine Figure 7(b), in which
v is de ned in sections X and Y , while used in section A. Since both de nitions reach A without either killing the other, a merge operator is needed at A. However, we need merge only two de nitions, even though there are three predecessors. Thus, a merge function for a PPG only needs arguments for predecessors with de nitions which reach the con uence node along that path. This highlights another major dierence between sequential and parallel merges; therefore we use -function as the merge operator for reaching de nitions within the 27
PPG. The -function is similar to the -function in that it acts as a non-killing de nition in terms of data- ow analysis, and is also a use for all de nitions which reach the -function via its arguments. By collecting multiple reaching de nitions, the -function linearizes de nition chains within a PPG in the same manner as the -function within a CFG. To identify precisely where to place -functions in a PPG we begin with the de nition of meet for nodes within PPGs. Meet is the dual to join in sequential CFGs with respect to universal and existential quanti ers.
De nition 7.2 The meet of nodes X and Y , M(X,Y) =
n
Z j 8ZX ; ZY with ZX ! Z and ZY ! Z; and 8 paths pX : X ! ZX ; pY : Y ! ZY ; pX \ pY = ; For a set of nodes S , M(S ) is de ned in the usual pairwise manner: M(S ) =
[
X;Y 2S
M(X ,Y ).
We also de ne M+ (S ) as the limit of increasing sequences analogous to that used for join and dominance frontier: M1 (S ) = M(S ) M2 (S ) = M(S [ M1(S )) Mi+1 (S ) = M(S [ Mi (S )) The de nition of join, the basis of work to place -functions [1], is a well-known concept. Although in a CFG J + (S ) = J (S ) [13], the dual de nition of join for PPGs, meet, does not possess this property. Consider Figure 7(a). Let S = fX,Yg. Then M(S ) = fW,Zg; in fact,
A 62 M(S ), but A 2 M (S [ M (S )) = M(X,Y,W,Z) = fW,Z,Ag. The meet of two nodes possesses one of the important properties which characterize nodes in a PPG: it is unaected by transitive edges. To prove this claim, we rst formalize the concept of a transitive edge as follows: 28
o
De nition 7.3 Edge E: X ! Y added to graph G is a transitive edge if 9Z 2 G 3 X ! Z !+ Y . We now show that a central concept of PPGs, path reachability, is unaltered in the presence of transitive edges.
Theorem 7.1 Path reachability in a PPG is unaected by transitive edges. Proof: Consider PPG G0, consisting of G plus transitive edge E : X ! Y . Since all edges in
G exist in G0 , if A reached B in G, A reaches B in G0 . Now, let A reach B in G0, but assume + B in G0 must include E , else no distinction that A does not reach B in G. Then path p1 : A !
X!Y ! B . By is possible between paths in G and G0. Thus, p1 must be of the form A ! + Y in G. Thus, path p : A ! X! Z! + Y ! B exists in G. By De nition 7.3, X ! Z ! 2
contradiction, we have demonstrated equivalence of path reachability between G and G0 . 2 Next, we demonstrate that the meet of a set of nodes is also unaected in the presence of transitive edges.
Theorem 7.2 The meet relation is insensitive to transitive edges. Proof: We use G and G0 as de ned in the proof of Theorem 7.1, except that E is any transitive
edge added to G. We rst show that for nodes X and Y , M(X ,Y ) in G is equal to M(X ,Y ) in G0 by means of double inclusion. 1. Let Z 2 M(X; Y ) in G. We show that Z 2 M(X; Y ) in G0 . By De nition 7.2 for meet, the intersection of all pairs of paths in G from X and Y to predecessors of Z is empty. Now consider G0 , which includes edge E : A ! B . Assume Z 2 M(X; Y ) in G, but Z 62
+ Z with X ! V and M(X; Y ) in G0 . Then, in G0 there exists node V such that V !
Y ! V . If no path from X ! V or Y ! V passes through A, then V does not exist, since the only dierence between G and G0 is edge E . Thus, without loss of generality, 29
at least X , and perhaps Y , has a path to V which passes through A. But from A, no nodes are reachable in G0 that were not reachable in G, as Theorem 7.1 demonstrated. Hence, if V exists in G0 it exists in G, since its existence is predicated upon reachability. We conclude that since V does not exist in G it cannot exist in G0 . By contradiction,
Z 2 M(X; Y ) in G0. 2. Let Z 2 M(X; Y ) in G0. By De nition 7.2 the intersection of all pairs of paths from X and Y to predecessors of Z is empty in G0 . Since the edges in G are a subset of the edges in G0 , any pair of paths from X and Y to predecessors of Z which exists in G exists in G0 , and is empty in G0 by assumption. Thus, that pair of paths is empty in G, and Z 2 M(X; Y ) in G. Now consider the M(S ), where S is a set of nodes. Since M(S ) =
[ X;Y 2S
M(X ,Y ), we apply the
property just proved to each pair X; Y to obtain the desired result for S : M(S ) in G equals M(S ) in G0. 2 Given S , a set of nodes in a PPG which de ne variable v , merge operators for PPGs ( functions) need to be placed at the iterated meet of S . A -function for v at section node A collects all de nitions of v that reach A. That is, there is an argument of the -function for each predecessor of A that has a de nition of v reaching A. Figure 8 shows the case where even though an edge exists from a de nition of v (in node A) to the con uence node N , the -function placed at N will only collect the de nitions from nodes B and C . That is because the de nition at A gets killed by the de nition at B in this PPG. In this case, S = fA; B; C g, M(S ) = M+ (S ) = fN g, and we note that the edge A ! N is a transitive edge with M(B; C ) = fN g.
30
Entry v =
A
v =
B
C
v =
N
Figure 8 How do -functions aect reaching de nitions in a PPG? If node N is reached by -function s and s is reached by de nition d (where s collects d as a -argument), then d reaches N indirectly
via a -function. In general, it may be that one or more -functions lie on the path from d to
N . In that case, d reaches N indirectly via a -chain. Thus, a de nition or -function in a PPG which reaches node N in the sense of De nition 7.1 is called a direct reaching de nition, whereas a de nition which reaches node N via a -chain is called an indirect reaching de nition. We now prove the following important results: that placing -functions at the iterated meet of the set of nodes which de ne a variable maintains the properties of (1) unique reaching de nitions, (2) collects all de nitions (directly or indirectly) which before -function placement could reach each node, and (3) is the minimal set at which to place such merge operators. We will rst prove that it is sucient to place -functions at M+ (S ). The concept of iterated meet is a re nement of the -function placement method suggested earlier [14], in that the iterated meet is smaller and, in fact, the minimal set.
31
Theorem 7.3 In a PPG, with -functions for v placed at M+(S ), all uses of v within node N will be reached (in the sense of De nition 7.1) by exactly one de nition (including -functions) of v. Proof: Let G be a PPG before placing -functions, and G be the same graph after -function
placement. Within G, let the set of nodes with de nitions of v be S , S 0 S be the set of nodes in S which have de nitions which reach N , and T S be the set of nodes in S with paths that reach N . EXISTENCE. We rst show that any use which had at least one reaching de nition in G has at least one reaching de nition in G . Since S 0 = 6 ;, let A 2 S 0 in G. Then all paths
pA : A ! Z , with Z ! N , contain no de nitions of v (except at A). For all pA in G , if the de nition of v in A does not directly reach N , then there must be at least one -function along some pA . In this case, at least one -function reaches N . UNIQUENESS. We consider cases: (i) Only one W 2 S 0 reaches N (jS 0j = 1). In this case, the last de nition in W kills any
+ N and p : other de nitions which may exist in nodes of T . Then, 8t1; t2 2 T , 9 pt1 :t1 ! t2
+ N such that W 2 p \ p . Thus, N 62 M(T ), and more generally, no node on any t2 ! t1 t2 + N 2 M(T ) (except, perhaps, W ). Repeating this argument, no node in any p pW :W ! W fW g 2 M+ (T ). Thus, in G only the de nition of v in W reaches N , since no additional de nitions ( -functions) created in G can directly reach N . (ii) Multiple de nitions of v reach N from S 0 , with N 2 M+ (S ). Then a -function will be placed at the beginning of node N in G , and uses of v within N will be reached by that
-function. (iii) Multiple de nitions of v reach N from S 0, with N 62 M+ (S ). Assume N is reached by more than one de nition from members of fS[ M+ (S )g. Call this set R1 . Then either (a) N 2 32
M(R1 ), which contradicts our assumption, or (b) 8A; B 2 R1 M(A; B ) is non-empty (since A and B reach N ), and we call the set of all elements of M(R1 ) which have paths that reach N R2. Repeating this process, we note that R must converge at R+ since Rn+1 is always composed of nodes closer to N , along the paths from nodes in R1 to N , than the nodes in Rn . If the set R+ consists of exactly one node (it can't be zero by the existence proof) we have a contradiction of assumption, and if it contains more than one node (which can't include N by assumption) R has not converged. This result holds as long as G contains a nite number of nodes. Thus, in all cases, we have shown that in G there is precisely one reaching de nition for each use which had at least one reaching de nition in G. 2 We now show that -functions correctly collect reaching de nitions.
Theorem 7.4 Within a PPG, with -functions placed at M+(S ), any use of v at node N will be reached directly or indirectly by all de nitions of v which reached N before placing -functions. Proof: Let G be a PPG before placing -functions, and G be the same graph after -function
placement. We consider two cases: (i) Only one de nition of v reaches N in G. This case is handled similarly to case (i) in Theorem 7.3, and the single de nition which reached N in G will reach N in G . (ii) Multiple de nitions of v reach N in G. All de nitions for v from node A which reach N in G reach N indirectly in G via a -chain. To show this, consider all paths p from A to N in G. By De nition 7.1 no nodes in p (with the possible exception of the rst and last node on any p) contain de nitions of v . Since, by Theorem 7.3, a single de nition of v in G reaches
N from A, by De nition 7.1 there must exist a de nition along some path from A to N which did not exist in G. That de nition can only be a -function. If there is just one -function along the path then it collects all de nitions which reach it, and that -function will now reach
N , resulting in the de nition of v reaching N indirectly. If there is more than one -function 33
along any path from A to N , the argument is repeated. By induction, a de nition of v in A will reach N via a -chain. Thus, we have shown that reachability of all de nitions is maintained when placing -functions. 2 We now show that the M+ (S ) is the minimal set at which to place -functions.
Theorem 7.5 Within a PPG, for a set of nodes S which de ne v, M+ (S ) is the smallest set at which to place -functions in order to insure unique reaching de nitions at all nodes. Proof: Given S for variable v , consider any element N 2 M+ (S ). Let N 2 Mj (S ), for the
minimum j 1. Then 9X; Y 2 Mj ?1 (S ) (where M0(S ) = S ) such that all pairs of paths from X and Y to predecessors of N are empty. Since X and Y contain de nitions of v (either assignments to v or -functions for v ), the de nitions at X and Y (or, perhaps, a later de nition of v within some node along one of these disjoint paths) both reach N . Thus, by removing the -function for v at N , any use of v within N would be reached by multiple de nitions. 2 Due to the expense of calculating the iterated meet from rst principles, the algorithm we use actually places -functions at the iterated dominance frontier; spurious -functions are removed during the renaming phase. For eciency and implementation purposes, we would like to place -functions at DF+ (S ) of the transitive reduction of the PPG graph. To understand why, examine the code fragment and its associated graph in Figure 9. Only the de nition of x from Section C should reach Section D. De nition 7.2 correctly identi es this case. However, Section D is in the dominance frontier of both Sections A and C. Thus, a naive implementation would incorrectly place a -function at section D with two reaching de nitions. The problem is caused by the redundant, transitive reference for Section D to wait upon Section A. A transitive reduction of the PPG graph would remove the wait-dependence arc from A to D, eliminating an unnecessary -function argument. For DAGs, transitive closure can be computed in O(N 2 + E ) time, and transitive reduction in worst case O(N 2) time. 34
Entry
:::
Parallel Sections Section A x = Section B Section C, Wait(A) x = Section D, Wait(A,B,C) = x End Parallel Sections
A
B
C
D
:::
Exit
Figure 9 Code and wait-dependence graph to illustrate transitive edges
8 Adjustments for Implementation Implementation of - and -functions requires propagating de nitions within each section to the enclosing parallel block, as they must be considered potential reaching de nitions for code in an outer PCFG. Thus, a consistent mechanism is needed that correctly handles this de nition propagation. Within our implementation we have added to each PCFG an extra edge added from Entry to Exit , called the slice edge. This edge is necessary for the proofs in the original SSA transformation algorithms dealing with control dependence [1], and they are also needed for the construction we provide to support the SSA extensions for explicit parallelism. Conceptually, this edge represents a conditional within each Entry node as to whether the CFG will be executed. The key idea behind a slice edge within each PCFG is that each section Exit node will always be in the iterated dominance frontier for all nodes within that section (except the section Entry and Exit nodes). Thus, any variable de nition within a section will always have a -function for that variable placed at the local Exit node. The -function at Exit also contains the last 35
de nition for each variable within that local PCFG. In this way, it is easy to propagate any reaching de nition from within a section: just examine the local Exit node for -functions. These -functions are instrumental in correct propagation of de nitions within the algorithms given in Section 9. On the other hand, -functions within sections, as will be explained in more detail later, are placed at a section Entry node, thus will not be propagated beyond that section. We also place a slice edge within a PPG from the cobegin node to the coend node. This is necessary in a case such as Figure 10, where the de nition of x in Sections A and B needs to be propagated outside the parallel block. This is accomplished by slice edges, which locally insures that every section Exit node is in the iterated dominance frontier of each local PCFG node, and similarly that the coend node is in the iterated dominance frontier of each section node. All variables de ned within a section will thus result in a -function being created at the section Exit, which, when passed to the corresponding section node, will be propagated to a -function in the tail node of the enclosing supernode via the coend node. Without the slice edge, the coend node in Figure 10 would not be in the iterated DF of either Section A or Section B, making the propagation of the -function from Section C to the enclosing PCFG more dicult. We now restate the de nition of parallel precedence frontier to accommodate the additional edge from Entry to Exit.
De nition 8.1 For any basic block node or supernode X in the EFG, PPF(X) is de ned as follows:
If ExitGX 2= PPFlocal(X ) then PPF(X ) = PPFlocal (X ): If ExitGX 2 PPFlocal(X ) then PPF(X ) = PPFlocal (X ) [ PPF(PX ): 36
COBEGIN
A
EntryA
EntryB
x =
x =
ExitA x = φ
ExitB x= φ
B
C
COEND
Figure 10 Necessity for the slice edge within a PPG The only dierence between the parallel precedence frontiers computed using De nition 8.1 and those computed using De nition 5.4 is that the former set includes Exit nodes because of the edge from Entry to Exit in each PCFG. To prove the correctness of De nition 8.1 and to show that this de nition can be used to place -functions, we prove the following theorem, where PPF1 refers to the parallel precedence frontiers computed using De nition 5.4 and PPF2 refers to the parallel precedence frontiers computed using De nition 8.1. Similarly, we denote PPFlocal corresponding to De nitions 5.4 and 8.1 as PPF1local and PPF2local, respectively.
Theorem 8.1 Let E be the set of exit nodes in all the PCFGs in an EFG. Then, PPF1(X) = PPF2(X) - E for any basic block node or supernode X in the EFG. Proof: The proof proceeds by induction on the nest-level of the closest enclosing parallel
block. The proof is also based on the fact that X Exit GX in the EFG (G1) that does not have an edge from Entry to Exit i Exit GX 2 PPF2local(X ) in the EFG (G2) that includes this edge y.
We assume by construction that an exit node, Exit GX , has exactly one predecessor that is not equal to Entry GX . y
37
Basis: nest-level of X is 0. This corresponds to the outer most PCFG, i.e. Gmain . There is no enclosing parallel block. Hence, if X 6 Exit GX in G1 then PPF1(X ) = PPF1local(X). Correspondingly, if Exit GX 2 PPF2local(X ) in G2 then, PPF2(X ) = PPF2local(X ) = PPF1local(X ) [ fExit GX g.
Suppose X Exit GX in G1 and Exit GX 2 PPF2local(X ) in G2. Then, since there is no closest enclosing supernode, PPF2(X ) = PPF2local(X ) = PPF1local(X ) [ fExit GX g = PPF1(X ) [ fExit GX g. Inductive Hypothesis: Assume that for all X at nest-level less than n, PPF1(X ) = PPF2(X ) ? E . Induction Step: When X is at nest-level = n. Case 1: Suppose X 6 Exit GX in G1 and Exit GX 2= PPF2local(X ) in G2. Then, PPF1(X ) = PPF1local(X ) and PPF2(X ) = PPF2local(X ) = PPF1local(X ) [ Exit GX . Therefore, PPF1local(X ) = PPF2(X ) ? E . Case 2: Suppose X Exit GX in G1 and Exit GX 2 PPF2local(X ) in G2. Then, PPF1(X ) = PPF1local(X ) [ PPF1(PX ). But, PX is at nest-level less than n by construction. Therefore, by inductive hypothesis: PPF1(PX ) = PPF2(PX ) ? E
(6)
But, PPF2(X ) = PPF2local(X ) [ fExit GX g [ PPF2(PX ). Therefore, PPF2(X ) ? E = (PPF2local(X ) ? E ) [ (fExit GX g ? E ) [ (PPF2(PX ) ? E ): (7) Using equation 6 in equation 7, 38
PPF2(X ) ? E = (PPF1local(X )) [ (;) [ (PPF1(PX )), i.e. PPF2(X ) ? E = PPF1(X ). 2
Placement of -functions is done according to the theorems in section 6. The key result of that section, Theorem 6.2, shows that Jp+ (S ) = PPF+ (S ). The proof used the fact from section 5 that showed the equivalence between the PDF and the PPF. With the introduction of the slice edge, this result is no longer strictly true. However, Theorem 6.2 actually depends upon the somewhat weaker result that PDF+ (S ) = PPF+ (S ). This relationship is maintained with the addition of slice edges, hence the theoretical results proved in sections 5 and 6 retain their validity. The -functions are placed in the iterated parallel precedence frontier set, in theory taking time proportional to the number of ordinary assignments and -functions plus the total number of relevant parallel precedence frontiers to process a single variable. In practice, we actually do not calculate PFRONT for each node. Instead, if ExitX is in the iterated dominance frontier for node X in a parallel section, then both PX and SX are added to the worklist. This technique, which is essentially a dynamic implementation of the PPF+ set for each node, also eliminates potentially redundant computation by not calculating PFRONT independently for each node before the renaming phase of Algorithm 2 given in Section 9.
9 Algorithms and Correctness In this section, we rst discuss several necessary details for implementation. Next, we present the complete algorithm to insert and -functions, and correctly create and generate the proper reaching de nitions as arguments. We also demonstrate the correctness and safety of these algorithms, which have been successfully implemented in our restructuring compiler, Nascent [15]. We close this section with a few re ections on the implementation. 39
9.1 An Introduction to Implementation In our implementation, the compiler nds for each variable the set of nodes in each PCFG or PPG where the variable is assigned. A section node in the PPG has an assignment to the variable if the section assigns the variable. Function ( or ) placement is done jointly, with the function type distinguished by the type of node at the con uence point: a supernode or section node tells us that a -function is required. Our method can result in a -function being initially placed at a PPG join point where only one reaching de nition within the PPG block is de ned. This can occur in two ways: a variable is de ned within a section node that has only one section successor, or a variable is de ned within a supernode such that only one de nition of that variable reaches outside the parallel block. Although this means that a -function will have only a single argument, it is a necessary step for renaming, as will be detailed in Section 9.2.
9.2 Contrasting - and -functions It has already been noted that placing -functions at the DF of nodes in the PPG may result in only a single argument to this function. Although every predecessor will have a reaching de nition for each variable (we always add an initial de nition at program Entry), we do not want to include reaching de nitions from outside the current parallel block as an argument to a -function { that will always be the default reaching de nition if none exists from waitdominance ancestors. When renaming arguments to -functions we know how many arguments must be lled in with appropriate de nitions { the number of predecessors. But with functions, we only ll in arguments as needed. There will be at least one -argument, since the presence of the function tells us that a de nition exists from another section node in this parallel block. But, there may be just that one argument to a -function. If that is the case, 40
a singleton- is created, which serves a special purpose. As explained in section 4, there is a crucial distinction to be made between - and functions. A -function is a variable assignment; the choice of assignment is given by the predecessor number of the path taken to reach the merge. A -function, on the other hand, reports anomalous or multiple updates. Hence, -arguments are not necessary for each predecessor of a section node { only those in which an update occurs for a given variable. This means that the order of the arguments for -functions is not important, since there is not a one-to-one relationship between arguments and predecessors. Con uence points in the PPG do not represent dierent possible paths for the sections (as they would in a sequential CFG), since all parallel sections are executed. Rather, they identify those parallel sections which must be executed before the merge section (hence altering the copy-in status for all variable in the merge section) and these are sections which may re-de ne variables whose de nitions reach beyond the merge section. Thus, even if there is a singleton- , its reaching de nition is critical, since it must be propagated to other sections which may wait upon the con uence point. This is re ected in the complete algorithm, where we propagate the reaching de nition of the argument in this case rather than the -function. Once it has served its purpose, we can delete a singleton- , since we have discovered that only one de nition reaches this merge point from within the parallel section. Singleton- 's are essentially used as a temporary holding pen for single reaching de nitions between explicit parallel sections. Propagating the argument's reaching de nition in this case also eliminates redundant links to names, which can otherwise arise. Consider the example program from Figure 2, shown in its SSA form in Figure 18. Section D uses variables a, b, c,
and f to rede ne c. Variable a's reaching de nition comes from outside the parallel block
and f's reaching de nition comes from the -function at Entry D, which merges the de nitions 41
from Sections A and B. But b and c have their reaching de nitions propagated from single wait{predecessor sections, A and B respectively. To correctly propagate these values to D, a -function (call it b0 = ) is created in Entry D for b (and likewise for c). Yet if b0 = (b5) is treated as a normal de nition, b0 would be pushed onto the stack of de nitions for b. When used in the new generation of c a pointer to b0 would be inserted, which only points to b5. Thus, b0 would be just another link to b5 , which is redundant. The -function was necessary to propagate the correct reaching de nition of b to section D, but after visiting all sections which may wait upon this de nition of b, b0 = can be deleted. Notice, also, that in the -function creation phase, the variable generations of b at Sections A (b5) and D (b0) will create another -function at tailP 1 with arguments b5 and b5 . By eliminating redundant links this duplication is detected, reducing this -function to a singleton, hence propagating the correct reaching de nition of b to the rest of the program before being deleted.
9.3 Depth- rst Renaming Computing the iterated meet seems somewhat impractical from its de nition. After placing
-functions a technique known as renaming transforms each variable de nition into a unique name and each use into the name of its unique reaching de nition [1]. The method employed to perform this renaming is depth- rst, in that it recursively traverses the dominator tree in a depth- rst order, keeping a stack of current de nitions for each variable. The key property that this renaming scheme satis es is that at each node the correct \current" de nition (an original de nition or -function) of each variable is the most recent de nition on the depth- rst path to this node from Entry, i.e., the de nition on top of the de nition stack [1, Lemma 10]. In fact, a depth- rst traversal of any spanning tree of the CFG will also satisfy this property. 42
Unfortunately, a depth- rst traversal of the nodes of a PPG will not satisfy this key property with merge operators at M+ (S ). For instance, in Figure 11, no -function is needed at node Entry
x=
B
A
C
y=
=x =y
Figure 11 C for either x or y, since only one de nition of each variable reaches node C (in the sense of De nition 7.1). Suppose the depth- rst traversal of the PPG visits node C after node A; when visiting node C , the current de nition of variable x will be the de nition in A, but the current de nition of variable y will be wrong.
9.4 Ecient Implementation What method can be used which is relatively ecient and yet correctly propagates information between section nodes of a PPG? We need to look more closely at how information ows between nodes in a PPG, keeping in mind that a precedence graph has dierent semantics compared to a CFG. Since information owing through the PPG is described in terms of reachability, we have found the concept of reaching frontier useful. This concept describes reachable nodes in a PPG in a way that is analogous to the dominance frontier for nodes within a CFG. 43
De nition 9.1 The reaching frontier of X, RF(X) = fZ j X reaches a predecessor of Z, but X does not reach all predecessors of Zg The reaching frontier of a set S , RF(S ), is de ned to be the union of the reaching frontiers of all elements of S , i.e., RF(S ) =
[ RF (X ).
X 2S
The iterated reaching frontier, RF+ (S ), is
de ned similarly to that for join, meet, and dominance frontier. The reaching frontier is used to relate important properties between the meet and dominance frontier. To implement the placement of operators which merge information within a PPG, we would like to show that M+ (S ) RF+ (S ) DF+ (S ). How are the meet and reaching frontier related? The analogous relations in sequential CFGs, join and dominance frontier, are shown to be equal when iterated, with the provision that Entry 2 S . However, Entry adds no information to either the meet or the reaching frontier in a PPG. M(Entry , X ) = ; 8X , since Entry reaches all nodes, and thus there is always a path from Entry to any node on any path from X . Also, RF(Entry ) = ;, since Entry reaches all predecessors of all nodes. We can also show that RF+ (S ) 6= M+ (S ). Simply choose the set T = fX ,Entryg. Then M(T) = ;, so M+ (T) = ;, while RF(T) clearly may not be empty. We now show that in general M+ (S ) RF+ (S ).
Theorem 9.1 M+ (S ) RF+(S ) Proof: Let Z 2 M(S ). Then there is a node X 2 S such that X has a path that reaches a
predecessor of Z , but X cannot reach all predecessors of Z or else there would be no path from any other node that did not intersect some path from X to each predecessor of Z (which would imply that Z 62 M(S )). So, we have Z 2 RF(X ) and Z 2 RF(S ). Finally, M(X ) RF(X ) =) M+ (X ) RF+ (X ). 2 44
We also show that DF(S ) is not in general a subset or superset of RF(S ). In Figure 12, Entry
X
Y
W
Z
A
Figure 12 DF(X ) = fA,Z g, but RF(X ) = fZ g, since it reaches all predecessors of A. It's also easy to nd a graph where X reaches a predecessor of Z but does not dominate any predecessor of Z , so Z 2 RF(X ), but Z 62 DF(X ). Next, we show that the iterated dominance frontier is a superset of the iterated reaching frontier on all graphs.
Theorem 9.2 DF+ (S ) RF+(S ) Proof: It has been shown [1, Lemma 4] that for any node Z that X reaches, some node Y 2
fX [ DF+ (X )g dominates Z. Now, for any node Z that X reaches, if Z is in RF(X ), then Z is in DF+ (X ); this is because some node in DF+ (X ) must dominate Z . Choose a path p from X to Z . Let Y be the last node on p in fX [ DF+ (X )g; Y must dominate Z . If Y is not Z , then Y dominates all predecessors of Z , so there is a path from Y to all predecessors of Z ; thus there is a path from X to all predecessors of Z , and Z is not in RF(X ). 45
Thus, DF+ (X ) RF(X ). Thus, DF+ (S ) RF(S ). RF2 (S ) = RF(S[ RF(S )) RF(S[DF+ (S ))
DF+(S[DF+(S )) = DF+ (S ). By induction, DF+ (S ) RF+(S ). 2 In general RF+ (X ) 6= DF+ (X ). Although DF+ (X ) RF+ (X ), the converse is not necessarily true. Consider Figure 13. DF(X )= fB ,Z g, and DF2(X ) = DF+ (X ) = fB ,Z ,X g. HowEntry
X
B
Z
Figure 13 ever, RF(X ) = fB g, and RF2(X ) = RF+ (X ) = fB ,X g. Thus, by counterexample, RF+ (X )
6 DF+ (X ). But, we note that the above example contains a cycle. We are interested in placing functions in a PPG, which we know to be acyclic. And, in a DAG, we next show that RF+ (S ) = DF+ (S ).
Theorem 9.3 In a DAG, RF+ (S ) = DF+(S ) Proof: Given a DAG, we demonstrate two preliminary lemmas.
Lemma 9.1 RF+ (S ) DF(S ). 46
Let X 2 S and let Z be in DF(X ). Then X dom A, a predecessor of Z . X 6 B , some other predecessor of Z , since X 6 Z . If X does not reach B , then Z is in RF(X ). So assume that X reaches B . We now show that on some path from X to B , there exists a C such that C is in RF(X ). Since X 6 B , consider a path from entry to B such that X is not on the path (there must be at least one such path). Let C be the rst node on this path that X can reach (C may be B ). Then since X can reach C , but not the predecessor of C on this path, C is in RF(X ). Next, note that C cannot reach A. Else, we would have the path Entry ! C ! A (which cannot go through X since the graph is acyclic) which does not pass through X , contradicting the fact that X dom A. But, this means that Z is in RF(C ), since C reaches Z through B , but cannot reach A. We already know that C is in RF(X ), so we have shown that Z is in RF+ (X ). 2
Lemma 9.2 RF+ (S ) DF+(S ). Given Lemma 9.1, we know that RF+ (X ) DF(X ). So, RF+ (S ) DF(S ). DF2(S ) = DF(S[ DF(S )) RF+ (S[ RF+ (S ) ) = RF+ (S ). By induction, DF+ (S ) RF+ (S ). 2 Lemma 9.2 together with Theorem 9.2 gives us our result. 2 Since M+ (S ) RF+ (S ) DF+ (S ) (with RF+ (S ) = DF+ (S ) in a DAG), we have shown that placing -functions within a PPG at the DF+ (S ) is a safe approximation for the somewhat smaller set of M+ (S ). However, for the common depth- rst implementations which use renaming, placing merge operators at DF+ (S ) may well be the method of choice. How conservative is the use of DF+ (S ) as an approximation for M+ (S )? First, if there is only one member of S , then M+ (S ) will be empty, while DF+ (S ) will usually not be empty. Second, 47
DF+ (S ) assumes a de nition lies along all possible paths. Thus, in the case of Figure 14 where
S = fA; C g, M(S ) = M+ (S ) = fE g, while DF(S ) includes D. Third, M+ (S ) is insensitive to transitive edges, while DF+ (S ) is not. Again, examine Figure 14, where DF+ (S ) = fD; E; F g. A -function is only needed at E , but the insensitivity to transitive edges of DF+ (S ) adds node F to its set. Entry
v=
A
B
C
v=
D
E
F = v
Figure 14 However, extra -functions are safe, since they only pass along the information collected at those points. Thus, merging information at DF+ (S ) within a PPG has been shown to be a safe method, and is relatively ecient since it can be performed with the same complexity as that for -function placement. In terms of the space requirements for placing -functions within the PPG, we can use the space consumed by -function placement as an upper bound, since M+ (S ) DF+ (S ). 48
While the worst case scenario could be O(N 2), in practice most programs exhibit linear space requirements when placing -functions [1, 16].
9.5 The Complete Algorithms The complete transformation of an intermediate representation into parallel SSA form is accomplished in two main phases: function placement and renaming. For these algorithms, successor and predecessor always refer to nodes in the PCFG, while children refers to the dominator tree of the associated PCFG. We describe here the data structures used for the following algorithms:
A(V) ! List of all nodes with assignments to variable V. S() ! An array of stacks, one for each variable V. Holds pointers to de nitions. T() ! Stack of nodes to hold section nodes of PPG for popping in topological order. DF(N) ! Local dominance frontier for node N. WhichPred(N,Q) ! An integer telling which predecessor of Q in the PCFG is N. Work List ! For each variable V, Work List is initialized to A(V), all assignments to V. HasFunc() ! A pointer eld to a variable in each basic block node. HasFunc(N) = V means block N already has a - or -function added for variable V. Work() ! A reference ag for each PCFG or PPG node. Work(N) = V means that node N has already been added to the Work List for variable V.
set delete( ) ! Marks a singleton -function for later deletion.
49
Algorithm 1: Placement of - and -Functions
1: 2: 3: 4:
given ! A(V), 8 V. do ! compute DF( N ), 8 N 2 EFG. for all nodes N do HasFunc(N) Work(N) ;
;
endfor for each variable V do Work List ; for each N in A(V)
5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:
endif endif endfor /* each Q in DF */ endwhile endfor /* each variable V */
29: 30: 31: 32:
add-(N,V) i = number of predecessors of N place V = ( V1; V2 ; :::; Vi ) at the beginning of basic block N, where Vj corresponds to the jth predecessor of N
33: 34: 35: 36:
add- (N,V) if N is a section node, then N = EntryN if N is a coend node, then N = tailPN place V = at N
Work(N) V Work List Work List [ f N g
endfor while Work List 6= ; do
take N from Work List
for each Q in DF(N) do if HasFunc(Q) 6= V then
HasFunc(Q) V if Q is a basic block of PCFG add-(Q,V) else if Q is a member of PPG add- (Q,V)
endif if Work(Q) 6= V then
Work(Q) V Work List Work List [ f Q g if Q is a section Exit basic block then Work List Work List [ fPQ ; SQ g
0
0
0
Figure 15 Placement: locations for - and -functions 50
Algorithm 2: Procedure for Renaming - and -Functions Each assignment has the form LHS( A ) (= RHS( A ).
for each variable V, S(V)
;
T(V) ; Call RenameCFG( Entrymain )
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41:
RenameCFG(N) if N is a node of a PCFG then for each assignment statement A in N do for each V used in RHS(A) of an ordinary assignment do replace fetch of V by link to Top(S(V))
endfor for each V in LHS(A) do if A is an ordinary or assignment then push pointer to A onto S(V) else if A is a -function do
eliminate duplicate arguments to num = number of arguments if num = 1 then set delete( ) push reaching de nition of -argument of A onto S(V) else if num > 1 then push pointer to A onto S(V)
endif endif endfor /* each V */ endfor /* each assignment statement A */ endif /* PCFG node */ if N is a supernode then traversePPG( EntryN ) /* cobegin for N */ if N is a node of a PCFG then for each Q 2 Succ(N) do /* Succ(N) in CFG */ j = WhichPred(N,Q) for each -function f in Q with variable V do replace j th argument in RHS of f with link to Top(S(V)) endfor endfor for each Q 2 children(N) do /* children in dom tree */ Rename(Q) endfor for each assignment A in N do for each V in LHS(A) do pop S(V)
endfor if A is a -function & set delete(A), then remove statement A endfor endif
end RenameCFG
Figure 16 Renaming: correctly inserting links to all - and -function arguments as well as all ordinary uses
51
Algorithm 3: Procedure for traversing PPG section nodes 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45:
traversePPG(E) call dfst(E) while T 6= ; do M = pop(T) if M is a section node do RenameCFG( EntryM ) for all -functions s at EntryM do for each V on LHS of s push pointer to V onto S(V)
endfor enddo for all -functions h at ExitM do for each V on LHS of h push pointer to V onto S(V) endfor enddo for each Q 2 Succ(N) do /* Succ(N) in PPG graph */ if Q is a section node, then Q = EntryQ if Q is a coend node, then Q = tailPQ for each V = -function s in Q do if Top(S(V)) is contained within enclosing parallel block then add -argument with pointer to Top(S(V)) as a RHS argument of s endif enddo enddo enddo /* of section node */ R=M while R 6= parent(Top(T)) do for each assignment at ExitR do for each V in LHS() pop S(V) endfor enddo for each assignment at EntryR do for each V in LHS( ) pop S(V) endfor if set delete( ) then remove statement containing endif enddo R = parent(R) /* parent set in dfst */ endwhile endwhile 0
0
0
end traversePPG
Figure 17 Correct traversal of nodes in the PPG 52
The placement of -functions is done concurrently with the placement of -functions, as is shown in Algorithm 1. Functions are placed at the iterated dominance frontier of each assignment per given variable, V . A(V ), the list of all initial assignments to V , is found in one pass through the program, storing the de nitions of V as a linked list. We do not have to reinitialize the elds HasFunc and Work as each variable is processed, since they are just pointers to each variable under consideration. At each iterated dominance frontier nodey we distinguish whether to place a - or -function by the type of node encountered (lines 16
- 17
in Algorithm 1) { a basic block node in a PCFG
always receives a -function, while PPG nodes indicate that a -function is required. However, note that a -function is not actually placed within the PPG node, but rather within the Entry node of the corresponding section, unless the PPG node is coend, in which case it is placed within the tail node of the enclosing supernode in the outer PCFG. In this way we correctly propagate de nitions which reach the end of a parallel section to the sequential ow which follows the supernode in the enclosing PCFG. The distinction made to determine which type of merge node to create also enables a single eld, HasFunc, to be used for each node; there can never be both a - and -function placed at the same node. The other importance dierence between - and -functions in the placement phase can be seen by examining the add- and add- routines in Algorithm 1. When a -function is placed at a node, its arity is xed at i, where i is the number of CFG predecessors of the node. On the other hand, when a -function is placed at a node, we do not know its arity, other than it will be at least one. There is not necessarily a correspondence between -arguments and PPG predecessors. Remember, a -argument re ects a de nition of that variable within the corresponding parallel section. It may be that no de nition of the variable exists within a Although line 13 in Algorithm 1 looks at each Q in the dominance frontier of N, lines iterates the dominance frontier by placing nodes back into the worklist. y
53
19 - 25
eectively
predecessor section for some -function. It is in the next phase, renaming, that arguments are added to -functions. Once - and -function placement is accomplished, the renaming phase is invoked. Algorithm 2 in Figure 16 lls in the correct argument pointers in the case of -functions and creates -arguments when needed, lled in with the current reaching de nition for each variable. This algorithm also links each ordinary use to its unique reaching de nition. Renaming refers to giving each de nition point for every variable a unique name, as described in Section 3. In this way, every use has exactly one reaching de nition. However, for real implementations, such as our Fortran restructuring compiler, Nascent, one does not seriously entertain the notion of a symbol table explosion to insure this property [17]. Instead, a use-def ssalink pointer eld is assigned to each use (fetch in the intermediate representation) and all arguments of - and -functions [17]. The example program transformed into SSA form (Figure 18) demonstrates the unique reaching de nition semantics, not the syntax of implementation. We retain the designation \renaming" for historical purposes. The renaming algorithm we present works the same as the original when traversing a PCFG except that -functions as well as -functions are treated as variable generations and pushed onto the de nition stack S. However, when looking for -functions at CFG successors, we will never be examining a node which could contain both a -function and a -function. A function can only be placed at two types of nodes: Section entry nodes and the tail node for a supernode. In the former case, we have a node with no nodes in its dominance frontier, and in the latter we have a node with exactly one predecessor. The original renaming algorithm [1, Figure 5] performs a depth- rst traversal of the dominator tree of the CFG. We modify this algorithm for parallel constructs as follows: 1. Begin the traversal with the Entry node for Gmain . 54
(p) (p) (p) (p)
a1 = 2 b1 = 3 c1 = 4
if
(s) (t) (u) (u) (v) (v) (v) (ExitA) (ExitA) (w) (w) (ExitB) (ExitB) (x) (ExitC) (EntryD) (y) (ExitD)
then Parallel Sections Section A if (P) then b2 = a 1 5 else
(Q)
b 3 = a1 + 7 f1 = b 3 a1
endif b4 f2 b5 f3
= = = =
(b2, (f0 , (b4, (f0 ,
c2 f4 c3 f5
= = = =
c1 + 15 c2 16 (c1, c2) (f0 , f4)
Section B
b3 ) f1 ) b1 ) f2 )
Section C, Wait(A) d 1 = b5 a 1 d2 = (d0, d1)
Section D, Wait(A, B)
f6 = (f3 , f5 ) c4 = a 1 b 5 + c 3 * f5 c5 = (c3, c4)
End Parallel Sections
(tailP 1 ) f7 = (f3 , f6 ) (n) d3 = d2 + f7 (p) else (q) d4 = 23 (r) endif (r) b6 = (b5, b1) (r) c6 = (c5, c1 ) (r) d5 = (d3, d4 ) (r) f8 = (f0, f7 ) (r) e1 = a1 + b6 c6 d5
Figure 18 SSA form of parallel program 55
2. When visiting a basic block node or Entry node, the algorithm works the same as originally presented. 3. When visiting a supernode, the procedure recurses to perform a traversal of the nodes in the corresponding PPG. The order of traversal of these nodes is important: the traversal of section nodes must preserve topological order | that is, every predecessor of a section node must be visited before visiting the section node itself. Since the PPG is acyclic, it is fairly easy to discover a correct order. We call the routine dfst() to build a correct traversal order (see Algorithm 4 and Claims 9.1 { 9.3). 4. When visiting a section node, singleton -functions can be identi ed for deletion (a separate pass is not needed to actually delete these functions, since the deletion occurs at lines
38 - 40
of Algorithm 3, the same time de nitions are popped o the stack).
First, for each -function at this section node, remove any duplicate -arguments. If there is only one remaining -argument, then that argument can be marked for future deletion. If there is more than one remaining -argument, the -function is necessary. The procedure then recurses to visit the dominator tree of the corresponding PCFG. Insertion of -arguments is done to the wait-dependence successors in a fashion similar to renaming of arguments of -functions. 5. When visiting an Exit node for a section, the SSA name for every variable modi ed in that section must be propagated back to the section node, as though there were an assignment in that section node. Due to slice edges, all variables de ned within the section will have corresponding -functions at the section Exit. 6. Similarly, when visiting a coend node, each SSA name modi ed in the parallel block must be propagated to the corresponding supernode. We accomplish this by placing -functions for coend nodes at the parallel block tail node. If only a single reaching 56
de nition of a variable reaches coend, a singleton- will be created. The SSA form of the parallel program (where the EFG includes slice edges) in Figure 2 is shown in Figure 18. Recall the copy-in/copy-out semantics of the parallel language. A major revision made to the algorithm presented by Cytron et al. occurs when a supernode is encountered. At this point we must traverse the section nodes of the corresponding PPG by calling traversePPG(), and recursively calling RenameCFG() on each local PCFG (see Algorithm 3). If that local PCFG has a supernode, then traversePPG() will again be called. Thus, RenameCFG() and traversePPG() seesaw back and forth as needed.
9.6 Safety and Correctness of the Algorithms We show in this section that the algorithms presented perform as intended. We rst demonstrate that -functions are correctly placed, and that -functions are placed at all points identi ed by De nition 7.2. Our algorithm may place -functions at more points than required, but these functions are useful for implementation, notably as singleton- s, and deleted later. Next, we show that the correct reaching de nitions are propagated and inserted as arguments to - and -functions. Finally, we prove that the traversePPG() routine visits nodes within the PPG graph in the correct order, and we also provide complexity analysis for the algorithms.
Theorem 9.4 The placement algorithm inserts a -function at all points in the PPF+ (S ) for any variable, and a -function at all points identi ed by De nition 7.2. Proof: We rst consider the proper placement of -functions. As shown in Section 6, they
belong, for variable v, at Jp+ (A(v )), which, by Theorem 6.2 is equal to PPF+ (A(v )). Thus, we need to show that Algorithm 1 places -functions at precisely those points. For each element in A(v ), lines 16 and 21 operate the same as the original sequential algorithm. This satis es 57
the rst half of De nition 8.1, while line 23 satis es the second half of the de nition (adding
SQ on this line to the worklist generates a -function, as seen by line 17). Finally, lines 19 and 20
insure that the PPF is iterated. Next, we show that the points identi ed by De nition 7.2 are a subset of those identi ed in
Algorithm 1. Due to the slice edge in each section PCFG, every local Exit node is always in the iterated dominance frontier of all nodes (except local Entry and Exit) within the section. Thus, if variable v is de ned within a section, the local Exit node will always have a -function created for v . Line 22 from Algorithm 1 guarantees that the section in the PPG graph containing the variable de nition is added to the worklist. Similarly, the slice edge in the PPG graph insures that the coend node is added to the worklist via the DF+ . Any section node from De nition 7.2 is contained in the DF+ of a variable de nition by Theorems 9.1 and 9.2, and we have shown that our algorithm identi es all nodes in the DF+ for -function placement. 2
Theorem 9.5 The correct reaching de nitions for -functions are propagated by Algorithms 2 and 3. Proof: By exhaustive cases. Let g be any -function, for variable v. From Algorithm 1 we
know g is either (i) at a Section Entry node or (ii) at the tail node of a supernode. case (i). Let g 2 Section B for arbitrary Sections A and B, such that A W+ B. We must show that all reaching de nitions of v from Section A are correctly propagated to g . Any downward exposed de nition of v in A results in a -function, f , being created for v in ExitA . Lines 12-16 of Algorithm 3 push a pointer to f onto S(v), which at this point is Top(S(v)). We have two sub-cases. In sub-case (i.a) B waits directly upon A. Lines 17-18 of Algorithm 3 will examine B (which we know contains v =
by Theorem 9.4), and create a -argument in lines 20-21
with a pointer to f . In sub-case (i.b) B waits transitively upon A with no intervening de nition of v. Here, since there are no intervening de nitions of v, Top(S(v)) remains unchanged until 58
reaching B as long as it is not popped o of S(v). The only issue concerns whether the section nodes are visited in the correct order. This issue is dealt with in Claims 9.1-9.3 later in this section. case (ii). Let g 2 tailP . This is actually a special case of (i), where coend 2 DF+ . Line 19 of Algorithm 3 insures that the reaching de nition is propagated to g in this case. 2
Theorem 9.6 The correct reaching de nitions for -functions are propagated by Algorithms 2 and 3. Proof: Consider any -function f for variable v. If f is within a local PCFG, then Algorithm
2 works as originally presented [1]. We need only consider the case where the PCFG contains a supernode, P, and (i) f is within P, or (ii) f is at a point reached by P. case (i). Either f 2 DF+ of A(v), or not. If not, then Top(S(v)) will contain the current reaching de nition of v, propagated from its local Exit node -function, and pushed onto S by lines 12-16 of Algorithm 3. If so, then Algorithm 1 guarantees that a -function was created at EntryGX , and Theorem 9.5 assures us that it possesses the proper reaching de nition. case (ii). The last de nition of v from one branch of the PCFG reaching f comes from inside P. But here Top(S(v)) will be the -function at tailP when lling in the correct -argument in lines 26-29 of Algorithm 2. 2 Algorithm 3 gives the algorithm for traversing the PPG section nodes in the right order: they must be visited in topological order, but must also be visited in a depth- rst fashion of some spanning tree of the PPG graph. The algorithm given by Cytron et al. visits nodes for
renaming in a depth- rst order of its dominator tree. We visit the section nodes of a supernode in topological order. Note that a depth- rst order of a graph will not, in general, visit the nodes in topological order, and all topological orders do not visit a directed graph in a depth59
Algorithm 4: A Topological Depth-First Sorting set of edges E = ; stack of nodes T = ; given dag G with root R, call dfst(R) dfst(V) mark visited(V) for each child ( successor in a PPG ) of V do if unvisited(child) then add edge V ! child to E set parent(child) = V dfst(child)
endif enddo
push V onto T end dfst
Figure 19 Ordering PPG nodes for processing rst manner of some spanning tree of that graph. The key idea of the algorithm presented by Cytron et al. is that when a depth- rst search of the dominator tree visits a node all reaching de nitions of previous nodes are on a stack. This is accomplished by the depth- rst search as it will visit all a node's dominator tree children before completing its call, and only then popping o de nitions within the node. For sequential code, visiting nodes in a depth- rst order of the dominator tree eectively produces a `must-precedes' ordering; for a supernode, we visit section nodes in the `must-precede' order by examining them topologically, while we insure that the correct reaching de nitions between section nodes exists by visiting these nodes in a depth- rst order of some spanning tree of the PPG. Thus, we would like to nd a spanning tree of the PPG such that there exists a depth- rst search of that tree which maintains topological order. We prove that Algorithm 4 in Figure 19 accomplishes the desired task, which is called by the routine traversePPG(). 60
Claim 9.1 Popping T will visit the nodes of G in topological order. This result is well-known [18].
Claim 9.2 E is a spanning tree of G. Proof : Choose any node N of G. We know N is visited (Claim 9.1), and visited only once,
since it is marked when visited the rst time, and will not be revisited once so marked. Since each node N has at most one edge in E with head N, we need only show that N can be reached from the root, R. Simply follow the parent links repeatedly from N. Each unique parent P of N is in 1-1 correspondence with an edge P ! N in E. Since G is nite (a necessary assumption), this link will terminate at the only node without a parent, R. 2
Claim 9.3 Popping T will visit the nodes of E in a depth- rst order. Proof : In the context of visiting tree E, a depth- rst order of E means that we want to
visit all descendants of node N before any unvisited siblings of N. Let N and M be siblings in E, with D a descendant of N. We must show that, given N, M, and D unvisited, if N is visited rst, D will be visited before M (by Claim 9.1 we know N will be visited before D). Assume, to the contrary, M is visited before D. This implies that M is between N and D in stack T. Since D is reachable from N, and dfst(N) reached M before completing, D must be a descendant of M in E. But this implies two paths from R to D in E (one through N and one through M), since M and N are siblings. However, this contradicts the fact that E is a tree. Thus, D will be visited before M. 2 Using the same notation employed for computing the running time for De nition 5.4, and where V^ is the number of variables in the program, we calculate the running time of our algorithms as follows: the rst phase, - and -function placement, takes worst case O(N^ 2 61
+ E^ ) per section [1], thus over all sequential sections it will take O( P^ (N^ 2 + E^ ) ) time. Then, for the second phase, the RenameCFG() routine take maximum time of O(N^ V^ ) (per sequential section), while traversePPG() will traverse all sections (O(P^ )), calling dfst() (O(1) with respect to the traversePPG() call), RenameCFG(), and processing - and -functions on each section (O(V^ )). Thus, the running time of the second phase, over all variables, is O(P^ ( N^ V^ + V^ )).
9.7 Implementation Observations Here, we will examine some of the salient features observed while implementing the algorithms presented in this paper:
The slice edges proved to be an invaluable tool for propagating reaching de nitions. All variables de ned within a section would have a -function at the local Exit node, but almost as important, -functions, which were inserted at the section Entry nodes, would never be propagated by that section, since Entry nodes always have an empty local dominance frontier. Thus, by looking rst for -functions at Entry, followed by
-functions at Exit, the proper reaching de nition will always be on the top of the stack when proceeding to a new section.
As noted in Section 8, renaming is actually done by providing links which point to the unique de nition point for each use of a variable. In Nascent, we have accomplished this by supplying a sparse form of use-def links. This is in contrast to most other implementations of which we are aware, which provide def-use links. We believe our approach has numerous advantages, including ecient space utilization and eective solutions to data- ow problems [17]. 62
Removing duplicate -arguments. We have seen how duplicate arguments can occur. At rst glance, it may appear that in order to remove duplicates, the arguments would need to be sorted, taking N log N time. Although we expect N to usually be fairly small, we can, in fact, perform the duplicate elimination in linear time, by using a variant of a bucket sort. For each -function s, examine its arguments, marking a reference ag (pointer to a symbol) at the end of each argument's ssalink with s. Since each -function is unique, we can immediately identify duplicate entries and remove them. Note that this technique is possible since we can follow the use of a variable (from the -argument in this case) to its de nition via our SSA implementation.
10 Related Work and Future Directions 10.1 Related Work Analysis of parallel programs for detecting races has been a very popular research topic. Data
ow analysis of these programs for compile time optimizations has become a topic of interest only recently. To our knowledge, none of the earlier work on race detection [19, 20, 21, 22] uses the SSA form of a program. We have shown that an outcome of translating explicitly parallel programs to their SSA form is static detection of write-write races in these programs. However, race detection is not the main focus of this research work. More recently there has been considerable interest in data- ow analysis of explicitly parallel programs. Chow and Harrison [23] have presented a general framework for analyzing parallel programs using abstract interpretation. They do not consider synchronization and assume a strong memory consistency model. As mentioned earlier, we believe this is overly restrictive if Note that since we assume copy-in/copy-out semantics, we never report a write-read race in the parallel program.
63
the purpose of the data- ow analysis framework is code optimizations. Grunwald and Srinivasan [24] present a data- ow framework to compute reaching de nitions information in explicitly parallel programs with post/wait synchronization. This work also does not focus on SSA form. Sarkar and Simons [25, 26] de ne a representation for explicitly parallel programs, called the Parallel Program Graph. They describe a way of specifying concurrent execution semantics for a Parallel Program Graph and also suggest the use of this representation for parallel program analysis. Some work has also been done by Grunwald and Srinivasan [27] in translating structured parallel programs to their SSA form. Their algorithm uses the Parallel Flow Graph representation which is not a hierarchical data structure like the Extended Flow Graph. The algorithm by Grunwald and Srinivasan is dierent from the algorithms presented in this work (Section 9) in that the latter adds -functions at con uence points only on demand, i.e. as they are necessary. Also, the algorithms in this paper are based on implementing SSA
with use-def chains as opposed to def-use chains, which is the standard interpretation.
10.2 Future Directions This paper presents algorithms to translate explicitly parallel programs with synchronization in the form of Wait clauses to their Static Single Assignment form. SSA form gives the reaching de nition for every use of a variable in the program. As future work, we would like to extend the algorithms to derive Sparse Evaluation Graphs [28], used to solve several data ow problems including reaching de nitions in sequential programs, to handle the parallel and synchronization constructs mentioned in this paper. The concepts of dominance relation and dominance frontiers have been used in deriving Sparse Evaluation Graphs as well [28]. We
believe that extensions to these concepts to handle parallel control ow can be used to derive Sparse Evaluation Graphs for explicitly parallel programs. 64
Another important future work that we plan to pursue is the use of SSA form for parallel programs to develop algorithms for code optimizations such as constant propagation.
11 Conclusion Static Single Assignment form is a powerful intermediate representation for optimizing sequential programs. We have extended the algorithms to translate a sequential program to its SSA representation to handle parallel constructs with synchronization. These extensions entailed modifying existing algorithms and introducing a new parallel merge operator, the -function. The resulting SSA representation of parallel programs can be used to optimize explicitly parallel programs. We believe that the ability to perform classical code optimizations on parallel programs is critical to the performance of such programs on existing and forthcoming high performance parallel architectures. Previous work on applying scalar optimizations to parallel programs focused on stricter language semantics and the problems this caused for the compiler [29, 30]. Our language model is more appropriate for application-level code, and allows more aggressive optimizations.
Acknowledgements We would like to gratefully acknowledge the reviewers, whose thorough reading and helpful comments de nitely improved the quality of this paper.
References [1] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Eciently computing Static Single Assignment form and the control dependence graph. ACM Trans. on Programming Languages and Systems, 13(4):451{490, October 1991.
65
[2] B. Alpern, M.N. Wegman, and F.K. Zadeck. Detecting equality of variables in programs. In Conf. Record 15th Annual ACM Symp. Principles of Programming Languages [31], pages 1{11. [3] B.K. Rosen, M.N. Wegman, and F.K. Zadeck. Global value numbers and redundant computations. In Conf. Record 15th Annual ACM Symp. Principles of Programming Languages [31], pages 12{27. [4] Mark N. Wegman and F. Kenneth Zadeck. Constant propagation with conditional branches. ACM Trans. on Programming Languages and Systems, 13(2):181{210, July 1991. [5] Michael Wolfe. Beyond induction variables. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 162{174, June 1992. [6] Parallel Computing Forum. PCF Parallel Fortran extensions. Fortran Forum, 10(3), September 1991. (special issue). [7] Per Brinch Hansen. Operating Systems Principles. Automatic Computation. Prentice-Hall, 1973. [8] IBM Corporation, Kingston, NY. Parallel FORTRAN Language and Library Reference, 1988. [9] Ron Cytron, Michael Hind, and Wilson Hsieh. Automatic generation of DAG parallelism. In Proc. ACM SIGPLAN '89 Conf. on Programming Language Design and Implementation, pages 54{68, Portland, OR, June 1989. [10] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence graph and its use in optimization. ACM Trans. on Programming Languages and Systems, 9(3):319{349, July 1987. [11] Harini Srinivasan and Michael Wolfe. Analyzing programs with explicit parallelism. In Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua, editors, Languages and Compilers for Parallel Computing, number 589 in Lecture Notes in Computer Science, pages 403{419. Springer-Verlag, 1992. [12] Harini Srinivasan. Analyzing programs with explicit parallelism. M.S. thesis 91-TH-006, Oregon Graduate Institute, Dept. of Computer Science and Engineering, July 1991. [13] Michael Wolfe. J + = J. ACM Sigplan Notices, 29(7):51{53, July 1994. [14] Harini Srinivasan, James Hook, and Michael Wolfe. Static single assignment for explicitly parallel programs. In Conf. Record 20th Annual ACM Symp. Principles of Programming Lan guages, pages 260{272, Charleston, SC, January 1993. 66
[15] Michael Wolfe, Michael P. Gerlek, and Eric Stoltz. Nascent: A Next-Generation, High Performance Compiler. Oregon Graduate Institute of Science & Technology unpublished, 1993. [16] Paul Havlak. Interprocedural Symbolic Analysis. PhD thesis, Department of Computer Science, Rice University, 1994. [17] Eric Stoltz, Michael P. Gerlek, and Michael Wolfe. Extended SSA with factored use-def chains to support optimization and parallelism. In 1994 ACM Conf. Proceedings Hawaii International Conference on System Sciences, January 1994. [18] Robert Sedgewick. Algorithms. Addison-Wesley, 1988. [19] David Callahan, Ken Kennedy, and Jaspal Subhlok. Analysis of event synchronization in a parallel programming tool. In Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming [32], pages 21{30. [20] Anne Dinning and Edith Schonberg. An empirical comparison of monitoring algorithms for access anomaly detection. In Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming [32], pages 1{10. [21] Vasanth Balasundaram and Ken Kennedy. Compile-time detection of race conditions in a parallel program. In Proc. 3rd International Conference on Supercomputing, pages 175{ 185, June 1989. [22] D. Callahan and J. Subhlok. Static Analysis of low-level synchronization. In Proc. of the ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging, pages 100{111, Madison, WA, May 1988. [23] Jyh-Herng Chow and Williams Ludwell Harrisson. Compile time analysis of parallel programs that share memory. In Conf. Record 19th Annual ACM Symp. Principles of Programming Languages, pages 130{141, 1992. [24] Dirk Grunwald and Harini Srinivasan. Data ow equations for explicitly parallel programs. In Conf. Record 4th ACM Symp. Principles and Practices of Parallel Programming, San Diego, California, May 1993. [25] Vivek Sarkar. A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs. In Conf. Record 5th Workshop on Languages and Compilers for Parallel Computing, pages 13{20, New Haven, CT, August 1992. [26] Vivek Sarkar and Barbara Simons. Parallel Program Graphs and Their Classi cation. In Proc. of the Sixth Workshop on Languages and Compilers for Parallel Computing, pages 633{655, Portland, OR, August 1993. [27] Harini Srinivasan and Dirk Grunwald. An Ecient Construction of Parallel Static Single Assignment Form for Structured Parallel Programs. Technical Report CU-CS-564-91, University of Colorado at Boulder, December 1991. 67
[28] Jong-Deok Choi, Ron Cytron, and Jeanne Ferrante. Automatic construction of sparse data ow evaluation graphs. In Conf. Record 18th Annual ACM Symp. Principles of Programming Languages, Orlando, Florida, January 1991. [29] Samuel P. Midki and David A. Padua. Issues in the optimization of parallel programs. In Proc. 1990 International Conf. on Parallel Processing, volume II, pages 105{113, St. Charles, IL, August 1990. Penn State Press. [30] Samuel P. Midki, David A. Padua, and Ron Cytron. Compiling programs with user parallelism. In David Gelernter, Alexandru Nicolau, and David A. Padua, editors, Languages and Compilers for Parallel Computing, Research Monographs in Parallel and Distributed Computing, pages 402{422. MIT Press, Boston, 1990. [31] Conf. Record 15th Annual ACM Symp. Principles of Programming Languages, San Diego, CA, January 1988. [32] Second ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seattle, Washington, March 1990. ACM Press.
68