Simplifying Control Flow in Compiler-Generated Parallel ... - CiteSeerX

Simplifying Control Flow in Compiler-Generated Parallel Code John Mellor-Crummey

Vikram Adve

fjohnmc,[email protected]

Department of Computer Science, MS 132 Rice University 6100 Main Street, Houston, TX 77005-1892.

Abstract Optimizing compilers for data-parallel languages such as High Performance Fortran perform a complex sequence of transformations. However, the eects of many transformations are not independent, which makes it challenging to generate high quality code. In particular, some transformations introduce conditional control ow, while others make some conditionals unnecessary by re ning program context. Eliminating unnecessary conditional control ow during compilation can reduce code size and remove a source of overhead in the generated code. This paper describes algorithms to compute symbolic constraints on the values of expressions used in control predicates and to use these constraints to identify and remove unnecessary conditional control ow. These algorithms have been implemented in the Rice dHPF compiler and we show that these algorithms are eective in reducing the number of conditionals and the overall size of generated code. Finally, we describe a synergy between control ow simpli cation and data-parallel code generation based on loop splitting which achieves the eects of more narrow data-parallel compiler optimizations such as vector message pipelining and the use of overlap areas.

Key words: expression range analysis, constraint propagation

1 Introduction High-Performance Fortran (HPF)(15) , an extension of Fortran 90 (3) , has attracted considerable attention as a promising language for writing portable parallel programs. HPF oers a simple, high-level programming model that shields programmers from the intricacies of concurrent programming and managing distributed data. Programmers express data parallelism using Fortran 90 array operations and use data layout directives to direct a compiler's partitioning of the data and computation among the processors of a parallel machine. 1

The price of the simplicity of the HPF programming model is that optimizing compilers for HPF are very complex. Given an HPF source program, optimizing HPF compilers perform a sequence of code generation steps to synthesize an ecient single-program-multiple-data (SPMD) program. Each code generation step performs one or more complex code transformations. Even if program analysis results are updated incrementally and continuously during code generation, transformations will typically have incomplete knowledge about code produced in subsequent steps. As a result, code generated in this manner will often have opportunities for further optimization. For example, the Rice dHPF compiler for HPF employs transformations that partition a loop into multiple loops with disjoint iteration spaces to minimize the need for ownership guards|conditionals that test whether or not a processor should perform a particular computation or communication, or whether an access through a potentially non-local reference touches o-processor data (2) .1 Resulting loops contain a copy of some or all of the statements inside the loop before splitting. Within the \re ned contexts" of new loops created by splitting, conditional control ow in the enclosed code may now be redundant. In this paper, we motivate, describe, and evaluate a global optimization to eliminate redundant conditional control ow generated in the course of compiling data-parallel programs. This optimization, as implemented in the Rice dHPF compiler, is performed in two phases. The rst phase analyzes the SPMD code generated by dHPF to collect symbolic constraints on the values of integer expressions imposed by loops, conditional branches, assertions, and integer computations, and propagates these constraints over the control dependence graph. The second phase uses these constraints to eliminate or simplify conditional tests whose outcome can be determined (in full or in part) at compile time, and simpli es the control ow based on those tests. Although this algorithm can be applied to either sequential or parallel programs, we believe it will be most useful for optimizing machine-generated SPMD code, which has the greatest likelihood of containing super uous conditional control ow. While we have only studied the use of constraint propagation for control

ow simpli cation, it can also be useful for other applications such as debugging (7) , optimizing embedded system software (16) , optimizing array bounds checking (13; 20) , improving dependence analysis (5) , array privatization (5; 27) , and constant propagation (9) . This work makes the following contributions:

We describe an ecient algorithm for propagating constraints imposed on the values of integer expressions, and an algorithm for using these constraints to elide redundant conditional control ow. Our algorithms use three key technologies: global value numbering, control dependence analysis, and analysis of integer constraints using Fourier-Motzkin elimination. Unlike most previously proposed 1 We use the term ownership guards because these conditionals follow from data and computation partitionings (loosely) based on the owner-computes rule (23) .

2

algorithms for symbolic range propagation (5; 7; 13; 16; 27) , our constraint propagation algorithm is not iterative. By employing global value numbering based on static single assignment form (14) , our approach achieves most of the bene ts of an iterative algorithm in a single pass. Also unlike previous algorithms, we analyze not just simple bound and range constraints but also logical combinations of ane inequalities, which are important for simplifying conditionals in SPMD parallel code.

We evaluate our algorithms using three benchmarks of varying complexity. The results show that our approach is consistently highly eective at removing excess conditional control- ow in the generated code. For the benchmarks studied, our control simpli cation strategy eliminates 35% to 67% of the if statements generated by previous stages of code generation.

Finally, our experiments demonstrate an interesting synergy between control- ow simpli cation and data-parallel code generation based on loop-splitting transformations. The combination of these general-purpose techniques provides most of the bene ts of two more narrow data-parallel compiler optimizations, namely vector message pipelining (26) and overlap areas (11) . The following section motivates the need for a global control- ow simpli cation post-pass optimization, through a discussion of several important code generation steps in data-parallel compilation. Section 3 presents our algorithms for constraint propagation and control- ow simpli cation. Section 4 describes an evaluation of the algorithms in the context of a version of the dHPF compiler that compiles HPF for messagepassing systems. Section 5 contrasts our algorithms with previous approaches to the problem of constraint propagation. Finally, Section 6 summarizes our conclusions and suggests some directions for future work.

2 Motivation Data-parallel compilers require several complex and potentially expensive code generation steps to synthesize an explicitly parallel program (1) . Typical steps include computation partitioning and communication generation, along with code transformations for communication-computation overlap and non-local storage management. Good scalar performance of the compiler-parallelized code is essential; however, it is desirable to defer many scalar code optimization issues until after parallelization. This approach has two major advantages: the parallelization steps can be simpler, and scalar optimizations can exploit global information that is not available until all parallelization steps are complete. Here, we use an example to illustrate several of the code generation steps taken by the Rice dHPF compiler in compiling an HPF program to a message-passing system, and we show how these steps interact to yield opportunities for eliminating redundant conditionals and empty loops. It is important to note 3

that the individual code generation steps use powerful algorithms designed to improve scalar and parallel performance (1) . The redundant control ow arises from the interaction of these sophisticated steps, and not from naive code generation strategies. Consider the example source loop nest in Figure 1. This code is a fragment abstracted from a pipelined computation in the Erlebacher benchmark, a derivative calculation using an implicit sixth-order compactdierencing scheme.2 Comments in the gure show the computation partitioning (CP) and the initial placement of communication operations chosen automatically by the dHPF compiler. Figure 2 shows the corresponding intermediate code generated by dHPF. (For brevity, Figure 2 uses \COMPUTE f(1:64, j, k)" to represent a copy of the i loop.) The innermost statement is assigned to be executed by the processor that owns f(i,j,k), as de ned by the alignment and distribution of the array f. This results in the k loop being partitioned among the processors. The reference f(i,j,k+1) therefore accesses non-local data, and the required communication is placed inside the k loop since the communicated data is modi ed inside this loop. The communication is initially represented simply by placeholders for SEND and RECV statements. The compiler assigns appropriate CPs to these placeholders to ensure the communication is executed by the required processors. These CPs simply specify that the SEND is to be executed by the owner of f(i,j,k+1) and the RECV by the reader of the data, viz. the owner of f(i,j,k) (for 1 ). Note that these communication CPs are conservative because the SEND and RECV should actually execute only in a subset of the iterations of the loop, namely in the \boundary" iterations on each processor.3 We rely on conditionals in the communication code instantiated for the SEND and RECV placeholders to enforce precisely when communication needs to occur. Based on these selected CPs for computation and communication, the rst few key code generation steps the compiler carries out are as follows: i

N

k

1. Non-local index-set splitting on the i loop: This step splits the innermost loop to separate the iterations that access only local data from those that access non-local data. The motivation for this splitting operation is twofold. First, it enables data communication for non-local iterations to be overlapped with the computation of local iterations. Second, it is desirable to access non-local data directly out of communication buers (to avoid data copies) while still accessing local data in-place; splitting minimizes the execution frequency of ownership guards needed to select between the local and non-local storage areas. (These guards are not shown in this example, but are explained in more detail below.) Splitting the i loop results in two full copies of the loop with dierent conditions on k (k 16 This benchmark was developed by Thomas Eidson at ICASE. Communication CPs are conservative only when communication happens inside a partitioned loop. In such cases, precise CPs for communication are dicult to compute and express in any general manner since communication patterns can be quite complex. 2

3

4

+ 15 and k 16 pmyid1 + 16 on lines 8 and 10 respectively). During this step, the only information available about the enclosing context is that the bounds of the k loop will be reduced according to the CPs of its enclosed statements, i.e., 16 pmyid1 k 16 pmyid1 + 16.

pmyid1

2. Loop bounds reduction on the k loop: This step reduces the loop bounds of the k loop to partition the iterations among processors as speci ed by the CPs assigned to the SEND, COMPUTE, and RECV statements. Since the SEND should not execute in the rst iteration of each processor and the RECV and COMPUTE blocks should not execute in the last, the loop is tiled into three sections to avoid introducing additional ownership guards within the k loop. The resulting loop sections are fk = 16 pmyid1 + 16g, f16 pmyid1 + 15 k 16 pmyid1 + 1g, and fk = 16 pmyid1g. The COMPUTE blocks for the two sections of the i loop (local and non-local) are simply replicated in the rst two sections of the k loop (lines 3,5 and 8,10). Similarly, the RECV and SEND are replicated as shown. However, now the conditions on lines 3, 7, 9 and 10 are unsatis able, and those on lines 4, 5, 8 and 11 are guaranteed true. Thus, although code generation for the k loop performed a complex transformation to avoid introducing ownership guards to enforce the CPs of the three inner statements, it also caused previously generated ownership guards to become unsatis able or tautological. 3. Communication generation: This step replaces the communication placeholders with code to carry out the communication. (Results of this step are not shown in Figure 2 for brevity.) As mentioned above, this communication code contains conditionals which ensure that communication is performed only for iterations referencing or providing o-processor data. While this communication generation step has information available about the CP of the k loop, it does not have information about the tiled loop sections for the k loop resulting from step 3 (because this would require expensive reanalysis of previously generated code). Because of this tiling, Therefore, the communication generation step inserts conditionals on lines 4 and 11 that are in fact tautological (because of the tiling), and on lines 7 and 9 that are unsatis able. The index-set splitting (2; 19) in step 1 and bounds reduction operations (1) in step 2 above perform sophisticated loop transformations aimed at minimizing the dynamic frequency of conditionals and otherwise improving the eciency of generated code. Keeping code generation ecient, however, constrains such transformations to operate with only limited information about the context provided by enclosing loops and the code already generated for inner loops. For this reason, code resulting from a sequence of such loop transformations can bene t from a subsequent pass that uses more complete global information to improve the control ow eciency of the code. In addition to the three examples described above, a fourth source of redundant conditionals are the own5

ership guards that arise for references that access both local and non-local data (which we brie y mentioned above in the context of step 1). When non-local data is stored in a separate communication buer, each potentially non-local reference would require an ownership test to determine whether to reference the local array or the buer in each iteration. Executing an ownership test on every iteration of a loop nest can be expensive. \Overlap areas" can be used to avoid these tests for shift communication patterns by extending the local data storage to hold adjacent non-local values (11) . When overlap areas are not applicable, current commercial compilers (e.g., the Portland Group's pgHPF (8) and IBM's xlhpf (12) compilers) avoid these ownership guards by copying the local and non-local data into a single uni ed region called a computation buer. While this approach is ecient for some cases, in others it can be expensive both in terms of memory usage and the cost of data copies. By using index-set splitting, the dHPF compiler can avoid the data copies associated with a computation buer approach, as well as most of the execution overhead associated with data ownership guards. In a loop with non-local references, completely eliminating the ownership tests while accessing data directly out of communication buers would in general require splitting the index space into 2r ? 1 non-local subsets, plus the local section. To avoid this exponential behavior, we simply split the loop into a local subset (in which all references are guaranteed to access only local data) and a non-local subset containing all other iterations. In the local section, no ownership guards are necessary. In the non-local section, we insert ownership guards only when 1 and dierent references access non-local data in dierent iterations. Often, however, the non-local iteration set is non-convex, and the loop nest generated for this set gets tiled into multiple convex regions. In each of these regions, some or all of these ownership guards may be redundant. It is important to note that the code generation algorithms and the trade-os between algorithm complexity and code quality described above are not limited to the dHPF implementation. Loop-splitting and bounds reduction are well-known techniques, although only bounds reduction is used widely because of the perceived diculty of applying splitting in general cases. Although the use of imprecise CPs for communication, as described above, could be avoided in simple cases, it is dicult to avoid imprecise CPs in general with the arbitrary communication patterns supported by dHPF. Although some research and commercial compilers reanalyze the generated code between major code generation steps, such reanalysis would be too expensive to perform after each loop transformation such as between steps 1 and 2 above. If a single global post-pass optimization can eectively compensate for the lack of full contextual information at each step in code generation by removing excess conditional control ow that results, such a step would clearly be of signi cant bene t in parallelizing compilers. r

r >

6

3 Constraint Propagation and Control Flow Simpli cation We describe an algorithm for computing and propagating constraints on the values of integer symbolic expressions due to conditional expressions in a program, and a second algorithm that uses these constraints to simplify program control- ow. The constraints we compute for each point in the program are conservative approximations of its path condition, which speci es the precise conditions under which will be reached during execution. The constraint propagation algorithm includes both constraints imposed by conditional branches and constraint assertions inserted either by a programmer or by previous phases of compilation. Our constraint propagation algorithm leverages three key program analysis technologies: control dependence analysis (10) , global value numbering based on static single assignment form (10; 14) , and simpli cation of symbolic integer constraints expressed as formulae in Presburger arithmetic (18; 22) . (A Presburger formula consists of arbitrary ane inequalities and equalities of integer variables, together with the logical operations :, _, ^, and the quanti ers 9 and 8. For example, given a loop from 1 to with a stride of 2, the constraints on the loop index variable within the loop can be expressed as f9 : = 2 + 1 ^ 1 g.) By leveraging control dependence analysis and value numbering, we are able to use a non-iterative constraint propagation algorithm without signi cant loss of information for most cases we expect to encounter in practice. By accommodating the rich set of constraints representable in Presburger arithmetic, we can capture and exploit complex integer constraints. This can be important for reasoning about ownership guards and loop bounds which arise in compiler-generated SPMD code. In the dHPF implementation, we use the Omega library from the University of Maryland to support representation, manipulation, and code generation from Presburger formulae. We use value numbers as the symbolic representation for integer values in the program. Value numbering assigns the same value number to two expressions i it can be proven that they will have the same value. Computing constraints on value numbers is convenient because any time a variable's value is rede ned, uses of that variable after the rede nition will be labeled with a dierent value number. Any constraints on a previous value number will still be valid (since they apply to the old value), although they may be irrelevant if the value is not used further. This leads to a simple, monotonic algorithm for constraint propagation in which constraints never need to be removed. This simpli es the handling of loop variant values, variables modi ed inside loops, and assertions, as explained below. Finally, note that value number construction uses full forward substitution for program variables. Therefore, the leaves of the symbolic expressions are limited to atomic values such as integer constants, loop index variables, procedure arguments, function return values and side eects, and values read through external I/O. p

p

N

i

s

7

i

s

i

N

3.1 Computing Constraints We compute constraints that arise from conditional branches, loop bounds, and explicit constraint assertions which specify arbitrary logical expressions in terms of program variables. At the lowest level, we represent these constraints directly as Presburger formulae, using the Omega library as mentioned above. Each variable in an inequality or equality (other than those that are existentially quanti ed) represents a value number of one of the following types: (a) an atomic program variable, (b) non-ane expressions of program variables, or (c) merged values such as the value for a variable on exit from a loop or conditional branch. As an example, consider how we compute constraints enforced by the if statement in the source fragment below. i = 2 * m if (i * j + 8 .ge. i * 7 - 4 .and. intmod(i, 4) .eq. 1)

In this example, intmod(i n) ? ( div ), so that 0 intmod(i n) ? 1. First, we compute a value number for the logical expression in the conditional test. In computing the value number for this expression, gets replaced with 2m everywhere. The rst term of the resulting value number expression then is 2 m j + 8 14 m ? 4, where m and j represent the value numbers for the variables m and j. In translating this value number into a Presburger formula, since Presburger arithmetic only accommodates ane relations, we collapse the symbolic product m j into a single Presburger variable mj . The resulting constraint thus becomes f2 mj + 12 14 m g. In the second term, the expression intmod(i,4) .eq. 1 is translated as f9 : 0 ? 4 3 ^ ( ? 4 ) = 1g. After substituting with 2 m in this term and combining with the rst term's constraints, the constraints for the full logical expression become: ;

i

n i

n

;

n

i

V

V

:ge:

V

V

V

V

V

V

V

V

e

i

i

e

i

e

V

f2

V

mj + 12 14Vm

^ (9 : 0 2 e

V

m ? 4e 3 ^ 2Vm ? 4e = 1)g

Constraints for a loop are similarly computed by constructing a formula that constrains the index variable range according to the loop bounds and stride. One restriction is that a symbolic (i.e., unknown) stride cannot be represented precisely using Presburger formulas because it would require a product of symbolic terms. Similarly, an intmod or div with a symbolic divisor cannot be expanded as above, and thus must be treated as a non-ane expression, i.e., represented with a single Presburger variable. For such cases, we could assume compute a formula representing only the bounds constraints, which is a conservative assumption for our constraint analysis. After constraint simpli cation, we convert a satis able constraint formula for a conditional back into an equivalent Fortran logical expression with the aid of a code generation operation, mmcodegen, provided by 8

the Omega library (17) . Non-ane expressions are then substituted back into the generated code (e.g., mj is replaced by m*j). V

3.2 Constraint Propagation We use a one-pass algorithm that propagates constraints throughout a program. We can use a one-pass algorithm because our value-number-based constraint representation allows us to ignore any constraints that would reach a statement along a back edge in the control ow graph.4 In particular, a variable that is modi ed inside a loop is assigned a new value number at the top of the loop that represents the merge of the value number for the variable entering the loop from the outside, and the value number from the previous iteration. Constraints on the value number entering the loop from the outside need not be invalidated inside the loop because all uses of the loop-variant variable in the loop will refer to one or more dierent value numbers de ned in the body of the loop. However, by ignoring constraints along back edges, our propagation algorithm cannot directly compute constraints for iterative values de ned in loops. Instead, for simple iterative values (e.g. an auxiliary induction variable that is a linear function of a loop index variable) our value numbering package computes symbolic range information which provides us with the appropriate constraints for these iterative values. We have not found it necessary to compute constraints on more complex iterative constructs, as brie y discussed in Section 3.4. Because we can safely ignore constraints along backward control ow edges, we perform our constraint propagation on the control-dependence graph (CDG), which is a natural representation for logical constraint propagation. A node in a control ow graph is control dependent on another node if and only if there is a path from to the exit node that does not execute , and an outgoing edge from that (if taken) is guaranteed to execute . The node must be a branch node, and the CDG will include an edge from to . A statement inside an ordinary loop (without loop exits) has a single control dependence predecessor, i.e., the CDG has a single edge from the loop header node to the statement. Therefore, ordinary loops do not cause cycles in the CDG, and so backward control ow in ordinary loops is automatically ignored with this representation. Non-trivial back edges (forming a cycle) in the CDG occur due to constructs such as a jump out of a loop, but these too can be safely ignored as described above because any references to loop variant values inside the loop will refer to dierent value numbers than those entering the loop with constraints along forward control dependence edges. Figure 4 shows our algorithm for computing path constraints at each conditional branch node in a n

b

b

n

n

b

b

b

n

4 The algorithms we present assume that the control ow graph is reducible (25) , i.e., for a node within a natural loop, every path from the root to the node must include the loop header. Handling irreducible graphs correctly is straightforward: it is only necessary to avoid propagating constraints into or out of any nodes within an irreducible subgraph.

9

program. The inputs to the algorithm are the constraints computed from conditional branches, loop bounds and strides, and assertions. For static analysis, we assume that an assertion is valid at any statement that is predominated by the assertion but is prior to any rede nition of any of the variables in the assertion. All IF statements are assumed to be in IF-THEN-ELSE form (dHPF converts other IF statements to this form). Conditions tested by other types of branch nodes in Fortran such as computed GOTOs are conservatively ignored. The goal of the algorithm is to compute the incoming constraints, outgoing constraints, and valid assertions at each conditional branch point in the program, as de ned in Figure 4. The outgoing control dependence (CD) edges from a node are assigned a label identifying the type of edge, namely, IF TRUE and IF FALSE for an IF statement, DO ENTER and DO FALLTHROUGH for a DO loop. Separate outgoing constraints are computed for each type of outgoing edge at a branch. We begin by initializing the incoming and outgoing constraints at each branch to false, and the assertions that apply to each outgoing CD edge type to true. Next, we annotate CD edges with constraints from logical assertions. Although an assertion is only valid at subsequent statements, once the logical expression is translated in terms of variable value numbers, the expression is globally true throughout the entire program and can be safely applied at any statement. We exploit this property as follows to treat assertions uniformly with conditional branch expressions. We apply the constraints from an assertion to all statements that have the same control dependence relationship with a common CD parent, even though this includes statements that precede . In other words, for each statement with CD edge ! of type , we associate the assertion constraints with all CD edges of type emanating from . The second phase in Figure 4 propagates these assertion constraints forward along the CD edges to which they apply. The nal phase of the propagation in Figure 4 computes incoming and outgoing path constraints at each conditional branch node. These constraints are computed for each node in a reverse post order traversal. In reverse-post-order, all control dependence predecessors of a node along forward edges are visited before the node itself. This evaluation order enables transitive constraints to be propagated along all forward control dependence edges. At each conditional branch node , we compute the incoming path constraints that hold when is reached along any path. We compute these as a disjunction of the constraints along each incoming control dependence edge. The incoming constraints along a control dependence edge are simply the intersection of both the outgoing and assertion constraints from 's predecessor for that type of edge. Predecessors across back edges will contribute outgoing constraints of false (the identity element for logical disjunctions) since they have not been visited yet. As described earlier in this section, it is safe for us to ignore constraints along CD back edges; our initialization accomplishes this simply. Next, for each CD edge s

s

b

b

t

s

t

b

b

b

b

10

type leaving , we set the default outgoing path constraints to the incoming path constraints for . Finally, depending on 's node type, we fold the local constraints enforced by into the dierent types of outgoing control dependence edge types as appropriate. b

b

b

b

3.3 Control- ow Simpli cation Figure 5 uses the constraints computed at each conditional branch node to simplify a procedure's control

ow. If the outgoing edge constraints for a loop entry or the true branch of a conditional are unsatis able, we eliminate its code by calling SimplifyInfeasibleCode. In the case of an IF, the entire IF statement is replaced with the statements in the false branch if any. For conditionals, one of two further simpli cations is possible. If the incoming constraints at a logical IF are as strict as the outgoing constraints on the true branch, the conditional will always evaluate to true when reached. Therefore, we replace the entire IF statement by statements in the true branch using SimplifyTautologicalGuard. Otherwise, we can try to simplify the guard condition by eliminating those constraints in the guard that are guaranteed by the incoming constraints. For two constraint formulae, 1 and 2 , the operation Gist( 1 2 ) computes a (possibly simpler) set of inequalities such that ^ 2 ) 1 . Applying the Gist operation to the outgoing constraints given the incoming constraints returns a conservative approximation of the non-redundant constraints. The function SimplifyGuard then invokes mmcodegen on the simpli ed constraints to regenerate a simpler guard. Currently, the Omega library's implementation of the Gist operation requires that the second parameter be a single conjunct. We must accommodate this limitationwhen implementingthe control ow simpli cation algorithm shown in Figure 5. It is inappropriate to just use hull(incoming[b]) (the convex hull of the constraints) as the second parameter to Gist here since the precise value of incoming[b] has already been incorporated into outgoing[b][etype]. (Doing so without adjusting outgoing[b][etype] accordingly could result in more complex guards.) Instead, we augmented the constraint propagation algorithm in Figure 4 to also compute incomingh[b] := hull(incoming[b]), and outgoingh which is computed in the same fashion as outgoing, except it is initialized dierently: outgoingh[b][*] := incomingh[b]. With these variables, we compute simpli ed constraints as Gist(outgoingh[b][etype],incomingh[b]), which gives us just the new constraints for the conditional branch that have not already been guaranteed by the incoming constraints. f

f

f

f

f

f

11

f ;f

3.4 Discussion The code that results for the Erlebacher example after applying the above algorithms is shown in Figure 3. The gure shows that all of the infeasible branches (due to unsatis able or tautological guards) have been discovered and eliminated. In fact, numerous additional guards (not shown in the gure) that were part of the SEND and RECV blocks of code have also been signi cantly simpli ed or eliminated. It is interesting to note that the nal placement of the communication code in Figure 3 is exactly what the optimization known as vector message-pipelining aims to achieve (26) . This optimization moves pipelined shift communication occuring in boundary iterations of a partitioned loop out of the loop in order to eliminate conditionals around the communication statements. This optimization is complex to implement in a general way using pattern-based techniques because simply lifting communication out of the loop can lead to deadlock when the loop body requires both forward and reverse pipeline communication. We safely achieve the eect of vector message pipelining through the combined eects of our CP code generation followed by control- ow simpli cation. Our code generation is safe because we place communication in the proper iterations to satisfy the pipeline dependences and mmcodegen preserves lexical statement order as it transforms iteration spaces. Section 3.2 mentioned that the only iterative values for which constraints are computed are simple ones for which the value numbering algorithm could compute known ranges. In fact, we could discover constraints on more complex iterative values in many other cases without resorting to iterative propagation of constraints. This would be possible by exploiting the gated single-assignment (GSA) form, using a technique similar to the distribution rules used by Tu and Padua (27) for computing symbolic ranges for array privatization. In fact, we have experimented with these techniques in our 1-pass algorithm and derived the same results as in an example used by Blume and Eigenmann (5) to illustrate their iterative data ow analysis. However, computing precise constraints using GSA form is also potentially costly because it can lead to a number of logical terms that is exponential in the number of merge nodes. The potential bene ts for our application do not seem to justify the additional analysis time that would be incurred. For this reason, we have not pursued using GSA for control ow simpli cation of complex iterative constructs in data-parallel codes.

4 Experimental Results In this section, we present some experiments to evaluate the eectiveness of the constraint-based control ow simpli cation algorithm, when used by the dHPF compiler for optimizing the Fortran 77 SPMD code it generates.

12

4.1 Methodology For our study, we used two dierent versions of the generated code for each of three benchmarks. The two versions for each benchmark dier in whether or not dHPF was allowed to use overlap regions for holding o-processor data for shift communication. As described in Section 2, overlap areas allow non-local data (received in regular shift communication patterns) to be referenced uniformly using the same subscripted reference that accesses local data; otherwise a guard is needed to coordinate use of dierent access methods for local and non-local data. We consider the code without overlap areas because overlap areas have some potentially signi cant disadvantages: (a) the use of an overlap area may introduce an extra memory-tomemory copy when data cannot be received directly into the overlap area, (b) an overlap area for an array can consume a large amount of memory, typically for the entire lifetime of the program, and (c) implementing overlap areas in any general (and memory-ecient) way requires interprocedural analysis. In either version, for any communication not using overlap areas, non-local data is used directly out of the buers without unpacking, if the required buer indexing expressions are possible to compute. For all these experiments, non-local index set splitting was turned on. We use two metrics to evaluate the eectiveness of our algorithms: static code size (number of conditionals as well as total code length) and program execution time. We used the number of words of generated Fortran 77+MPI code to measure code length, rather than the number of lines, because it is better able to account for partial simpli cation of conditionals, and it provides a better estimate for the static number of data and memory operations in the program. To measure the impact on execution time, we collected performance measurements on 4-16 processors of a 64-processor IBM SP-2. The processors on the SP2 used for the measurements were 66MHz thin nodes. The Fortran 77 SPMD code generated by dHPF was compiled with the IBM xlf compiler using -02 optimization. The measurements were collected on idle nodes of the SP2, as far as possible. Communication among the processes was performed using MPI (24) over the SP2 high-performance dataswitch, using the user-level message-passing (us) communication substrate (the most ecient). Since the measurements were collected in unprivileged mode, they can re ect perturbation by any operating system activity on any of the nodes. For this reason, at least 5 executions of each of the two variants of the benchmark were taken and the lowest measured time for each variant is reported in the tables.

4.2 Results for Individual Benchmarks Tomcatv Tomcatv is one of the SPEC oating point benchmarks. We used the version from SPEC 92, and increased

13

the problem size from 257 257 to 514 514 to make it more suitable as a parallel benchmark. We speci ed a (block, *) data distribution for the two-dimensional arrays onto a linear array of 4 processors. (Parallelizing the second dimension is very expensive because it generates numerous ne-grain messages.) Figure 1 shows the measured values of the above metrics for the versions of Tomcatv with and without overlap areas. There were two main reasons for the large reduction in static code size. The major reduction occurred because some relatively complex communication patterns resulted in redundant guards. The initialization phase involves transpose communication between the left and top edges of the matrices, as well as several non-local references to individual corner elements. In the code generated, both computation partition (CP) for the communication operation and the code for the communication operation itself (which is synthesized oblivious of the CP context) each enforced which processors participate. Some of the resulting redundant guards were then replicated as many as four times as the computation partitioning generated multiple loop nests in the course of tiling the iteration spaces for a processor. Although eliminating redundancy in these guards results in substantial code size reductions, the impact on performance is small because the bulk of them are in the initialization code outside the main convergence loop. The second main reason for code reduction was through elimination of ownership guards introduced for non-local references, particularly in the version without overlaps. (Perhaps not surprisingly, in all our experiments, we found little correlation between overall code size reductions and the corresponding reduction in execution time.) Another interesting fact emerging from the numbers in the table is that after range propagation, the version without overlap areas actually slightly outperforms the version with overlap areas. Overlap areas are widely considered an essential optimization for regular data-parallel compilers. Even before constraintbased control- ow simpli cation, however, the tradeo in using overlap areas is apparent from an inspection of the code: their use eliminates a large number of ownership guards in the non-local sections of the loop iteration spaces, but also introduces unpacking operations for six messages in each iteration. When not using overlaps, data from these messages is directly accessed out of the buers they are received in. Overall, overlap areas provide slightly higher performance before constraint-based control- ow simpli cation. However, after control- ow simpli cation, virtually all the ownership guards within the main convergence loop of Tomcatv are eliminated, because all the non-local references access non-local data in identical sets of iterations. (This is only possible because of the non-local index set splitting, without which the guards would be needed on all non-local references in all iterations.) Now, the unpacking of message buers into overlap areas is almost pure overhead. Overall, the combination of non-local index set splitting and constraint propagation provides an eective alternative to overlap areas in Tomcatv. The other two benchmarks we studied are small, fairly simple kernels. Even so, though they still show 14

signi cant reductions in static code size and some slight improvements in run-time performance from control simpli cation.

Erlebacher Erlebacher is a partial dierential equation solver based on Alternating Direction Implicit (ADI) inte-

gration. We used arrays of size 128 128 128, distributed (*,*,block) onto 16 processors. With this distribution, one of the 3 symmetric phases of the application contains most of the communication. In particular, this phase contains two key pipeline stages that are most important in determining its scalability and performance. We focused on this phase of Erlebacher for our experiments. The code fragment used as an example in Section 2 is abstracted from the second of the two pipeline stages of this phase. Table 2 shows the results of our experiments for this phase of Erlebacher. The data shows signi cant reductions in the static number of if statements in the code, in both versions (with and without overlap areas). Execution time shows a small but noticeable improvement, particularly in the version without overlap areas. In the version with overlaps, the primary source of improvement is eliminating the conditionals checking for communication within the main section of the pipelined loop (as illustrated in Figures 2 and 3) The version without overlap areas also bene ts from eliminating ownership guards on non-local sections, although only in the boundary (non-local) iterations. The version with overlap areas does not require these guards.

Jacobi The third kernel we examine is Jacobi, a simple stencil code for solving partial dierential equations. We used arrays of size 2048 2048 distributed (block,block) on 16 processors to minimize the communication to computation ratio. This kernel has very regular, coarse-grain, shift communication patterns in all four directions, and the combination of message vectorization and overlap areas are sucient to provide very good performance. Although the number of if statements is substantially reduced in both versions (with and without overlap areas), the huge number of oating point operations dilutes the impact on total execution time.

4.3 Summary and Discussion The experiments presented here indicate that control- ow simpli cation based on constraint propagation consistently provides substantial improvements in the number and complexity of conditional branches, but its impact on performance is small. The key reason why the performance impact of eliminating these guards 15

is small is the eectiveness of the loop splitting transformations of the dHPF compiler which separate the iterations that access local-only data from those that access potentially non-local data. After this splitting, the most signi cant computational loop operates only on local data and contains no compiler-generated guards. Any improvements made in minimizing guards occur only at the processor boundaries of the data partitions. For this reason, improvements are proportional to the surface to volume ratio, namely the ratio of the number of iterations accessing non-local data at the boundaries versus the number of iterations accessing local-only data in the interior. The eectiveness of the algorithm for eliminating if statements is important for the following reason. Because some applications may see signi cant performance degradation from complex control ow, compiler optimizations must be concerned with avoiding this performance impact. Because constraint propagation is so eective at simplifying conditional branches, however, other compiler optimizations do not have to burdened with the need to minimize the control ow complexity of the generated code; they can rely on a single powerful post-pass optimization to do so.

5 Related Work Several previous systems have developed techniques for propagating and using constraint information for dierent purposes. These include debugging (7) , optimizing embedded system software (16) , optimizing array bounds checking (13; 20) , improving dependence analysis (5) , array privatization (5; 27) , and constant propagation (9) . Our work is speci cally aimed at optimizing the control- ow generated by a parallelizing compiler. This has led to two principal dierences between our approach and those used in previous systems. First, most previous systems use iterative data ow techniques to determine value ranges of variables, whereas we use a simpler and much more ecient (but somewhat less precise) non-iterative algorithm based on the control dependence graph. Second, most previous systems have used simpler representations of constraints limited to ranges or bounds for integer values, whereas we permit much more general logical combinations of integer inequalities (but with constant coecients). Both these choices in our system are well-suited to analyzing integer conditional expressions that arise in compiler-generated SPMD code. In addition to these dierences in the underlying algorithms, we have shown experimental evidence that a global control- ow simpli cation phase signi cantly reduces the control ow complexity and code size and can help simplify the implementation of other code generation steps in a parallelizing compiler. Below, we discuss brie y the principal dierences between our algorithms and those used in previous work. Most previous techniques for variable range analysis use an iterative approach related to abstract interpretation (5; 7; 13; 20; 16) . Such expensive techniques are usually justi ed for the intended applications, 16

e.g., Bourdoncle's Syntox system (7) for static and formal debugging of programs, and Johnson's Protel system (16) for optimizing imbedded software for telecommunication systems. Blume and Eigenmann (5) use an iterative data ow analysis algorithm similar to that of Bourdoncle for computing range information to support powerful dependence analysis and other optimizations in the context of a parallelizing compiler. They incorporate a demand-driven algorithm that only constructs range information for variables when desired. Compared to these iterative approaches, our one-pass propagation along control dependence edges is simpler. Without iteration, we nonetheless are able to construct constraints for simple iterative values (linear auxiliary induction variables) directly from value numbers. This information seems sucient for simplifying compiler-generated data-parallel control ow because data-parallel compilers typically do not introduce complex recurrences in the generated code. Another distinctive feature of our approach is the richness of the constraint model we have used for control

ow simpli cation. We leverage the Omega library developed by Pugh et al (18) , which uses techniques based on Fourier-Motzkin elimination (22) to support representation and manipulation of constraints expressed as formulae in Presburger arithmetic. Most previous work on analysis of constraints on variable values has focused on much simpler cases, in particular, value ranges delimited by constants (6; 7; 20; 21) , ranges delimited by symbolics (13; 5; 27) or equality constraints (9) . Johnson (16) uses the most similar constraint representation which consists of unquanti ed logical assertions about variable values. Omega's support for quanti ed formulae enables us to reason about strided sequences of values, which can be important for codes using cyclic data distributions or for applications such as multi-grid interpolation (4) . Furthermore, by using Fourier-Motzkin elimination (FME), we can detect when a conditional is redundant not only because it has been tested directly along all incoming paths, but also when its truth or falsity is mathematically implied by other constraints. Constraint implication using this approach is strictly stronger than methods used previously in analysis of subscript range checks (20) . There are two primary disadvantages of using techniques based on FME: the solution complexity of FME, and restricted handling of symbolic coecients in inequalities. The worst-case complexity of FME is exponential in the number of variables, but the cost appears quite reasonable in practice since the depth of the control dependence graph is small and there is typically only a modest degree of coupling between constraints. FME does not directly support symbolic coecients in inequalities, which could be a signi cant limitation for guards that arise with cyclic or cyclic( ) distributions on a symbolic number of processors. In the Polaris compiler, Blume and Eigenmann (5) use a more powerful but ad hoc symbolic simpli cation technique that supports products of symbolic terms (although they focus on ranges rather than general integer inequalities). This is motivated by their intended application, namely dependence analysis, which they aim to make more powerful than just handling ane expressions. These techniques, however, can be k

17

quite expensive, as their experimental results show (5) . Tu and Padua (27) describe an approach for computing symbolic variable ranges using gated single assignment (GSA) form to improve array privatization analysis. Through its gate predicates, GSA form encodes an implicit representation of the path constraints on values, including constraints along back edges. Thus, the GSA approach can support precise analysis of iterative values, whereas our approach only handles linear iterative values. The GSA representation must be used with care: naive expansion of GSA-based value numbers into explicit constraints can cause an exponential explosion in the complexity of constraints. We experimented with using GSA for expanding merge nodes and iterative values (as described in Section 3.4), but the increased cost due to the expansion in the size of the logical expressions did not appear to be justi ed for control ow simpli cation. Tu and Padua use demand-driven elaboration of GSA-based value numbers to perform symbolic comparisons while avoiding the exponential explosion in many cases. The most signi cant dierence between Tu and Padua's work and our approach arises out of the intended applications of these techniques. Tu and Padua focus on using GSA form to compute symbolic ranges to improve array privatization analysis. Since array privatization requires compile-time proof of inequalities, they focus on bounds and symbolic comparisons where one expression contains a subset of the non-constant literals in the other. In contrast, we use a much stronger method (FME) for testing the satis ability of more general conditionals given more general path constraints, which is more suited to control ow simpli cation. Finally, some previous systems have used techniques that can be combined with ours to improve the eectiveness of control ow simpli cation. These systems use partial redundancy elimination to optimize conditionals implementing subscript range checks (20) , and to guide node splitting along control ow paths to eliminate redundancy among conditionals (6; 21) . These approaches not only eliminate conditionals that are redundant, but also reduce the dynamic execution frequency of conditionals that are redundant along some paths through code restructuring. This work is largely orthogonal to ours: we focus on identifying and eliminating redundancy in complex constraints, whereas they focus on transformations to eliminate partially redundant conditionals that arise out of simple constraints (i.e. comparison of a variable w.r.t. a constant).

6 Conclusions and Future Work We have presented an algorithm for discovering and propagating constraints on the values of variables in a program, and its use for simplifying conditional branches and loops. The principal advantages of our algorithm are that it is non-iterative and much simpler than previous algorithms, it achieves the most common bene ts of the previous data ow-based iterative approaches, and it considers not just ranges but arbitrary ane constraints on integer variables. The algorithm achieves these bene ts by exploiting several 18

key analysis techniques, namely, control-dependence analysis, global value numbering based on static singleassignment form, and simpli cation of integer constraints expressed as formulae in Presburger arithmetic using a technique based on Fourier-Motzkin elimination (FME). A disadvantage of using constraint analysis and simpli cation based on FME is that non-ane terms cannot be simpli ed. If this is a signi cant concern (e.g., for applications other than simplifying control ow), the algorithm could still be used by using a dierent technique for symbolic range simpli cation. Our experimental results have shown that control- ow simpli cation is very eective in simplifying the generated code, eliminating from 35% to 67% of the generated if statements in the benchmarks we studied. Furthermore, control- ow simpli cation combines with other transformations such as non-local index set splitting to achieve many of the bene ts of speci c, complex code optimizations such as vector message pipelining and overlap areas. The results in this paper make a case that control- ow simpli cation using constraint-propagation can be a useful tool in a parallelizing compiler. In such a compiler, it is faster and simpler to carry out code generation steps without full contextual information. Our results show that a late optimization step such as control- ow simpli cation can exploit global contextual information about the penultimate parallelized program to greatly simplify the generated code. It would be interesting to compare our code generation approach with the alternative approach (common at least in research compilers) of reanalyzing the generated code between optimization phases. The latter approach only provides information about preceding passes at any point during code generation, and requires full program reanalysis of increasingly complex code (or incremental updates of analysis results). In the future, it would be interesting to study the impact of our constraint propagation algorithm on the accuracy of other analyses such as dependence analysis, constant propagation, and array privatization.

Acknowledgments Paul Havlak provided valuable input in early discussions about simplifying control ow. Also, his global value numbering package is a cornerstone of our methods. This work has been supported in part by DARPA Contract DABT63-92-C-0038, the Texas Advanced Technology Program Grant TATP 003604-017, an NSF Research Instrumentation Award CDA-9617383, and sponsored by the Defense Advanced Research Projects Agency and Rome Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-96-1-0159. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted 19

as representing the ocial policies or endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency and Rome Laboratory or the U.S. Government.

References [1] Vikram Adve and John Mellor-Crummey. Advanced code generation for High Performance Fortran. In Languages, Compilation Techniques and Run Time Systems for Scalable Parallel Systems, Lecture Notes in Computer Science Series. Springer-Verlag, 1997. [2] Vikram Adve and John Mellor-Crummey. Using integer sets for data-parallel program analysis and optimization. In Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998. [3] ANSI X3J3/S8.115. Fortran 90, June 1990. [4] D. Bailey, Tim Harris, William Saphir, Rob van der Wijngaart, Alex Woo, and Maurice Yarrow. The NAS parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, December 1995. [5] W. Blume and R. Eigenmann. Demand-driven symbolic range propagation. In Proceedings of the Eighth Workshop on Languages and Compilers for Parallel Computing, pages 141{160, Columbus, OH, August 1995. [6] Rastislav Bodik, Rajiv Gupta, and Mary Lou Soa. Interprocedural conditional branch elimination. In ACM SIGPLAN '97 Conference on Programming Language Design and Implementation, Las Vegas, NV, June 1997. [7] Francois Bourdoncle. Abstract debugging of higher-order imperative languages. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation, pages 46{55, June 1993. [8] Z. Bozkus, L. Meadows, S. Nakamoto, V. Schuster, and M. Young. Compiling High Performance Fortran. In Proceedings of the Seventh SIAM Conference on Parallel Processing for Scienti c Computing, pages 704{709, San Francisco, CA, February 1995. [9] Preston Briggs, Linda Torczon, and Keith D. Cooper. Using conditional branches to improve constant propagation. Technical Report CRPC-TR95533, Center for Research on Parallel Computation, Rice University, April 1995. [10] R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and K. Zadeck. Eciently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):451{ 490, October 1991. [11] M. Gerndt. Updating distributed variables in local computations. Concurrency: Practice and Experience, 2(3):171{193, September 1990. [12] M. Gupta, S. Midki, E. Schonberg, V. Seshadri, D. Shields, K. Wang, W. Ching, and T. Ngo. An HPF compiler for the IBM SP2. In Proceedings of Supercomputing '95, San Diego, CA, December 1995.

20

[13] W. H. Harrison. Compiler analysis of the value ranges for variables. IEEE Transactions on Software Engineering, SE-3(3):243{250, May 1977. [14] Paul Havlak. Interprocedural Symbolic Analysis. PhD thesis, Dept. of Computer Science, Rice University, May 1994. Also available as CRPC-TR94451 from the Center for Research on Parallel Computation and CS-TR94-228 from the Rice Department of Computer Science. [15] High Performance Fortran Forum. High Performance Fortran language speci cation. Scienti c Programming, 2(1-2):1{170, 1993. [16] Harold Johnson. Data ow analysis of `intractable' imbedded system software. In Proceedings of the SIGPLAN '86 Symposium on Compiler Construction, pages 109{117, 1986. [17] W. Kelly, W. Pugh, and E. Rosser. Code generation for multiple mappings. In Frontiers '95: The 5th Symposium on the Frontiers of Massively Parallel Computation, McLean, VA, February 1995. [18] Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeisman, and David Wonnacott. The Omega Library Interface Guide. Technical report, Dept. of Computer Science, Univ. of Maryland, College Park, April 1996. [19] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Systems, 2(4):440{451, October 1991. [20] Priyadarshan Kolte and Michael Wolfe. Elimination of redundant array subscript range checks. In ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 270{278, La Jolla, CA, June 1995. [21] Frank Mueller and David B. Whalley. Avoiding conditional branches by code replication. In ACM SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 56{66, La Jolla, CA, June 1995. [22] W. Pugh. A practical algorithm for exact array dependence analysis. Communications of the ACM, 35(8):102{ 114, August 1992. [23] A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proceedings of the SIGPLAN '89 Conference on Programming Language Design and Implementation, Portland, OR, June 1989. [24] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack Dongarra. MPI: The Complete Reference. MIT Press, 1995. [25] R. E. Tarjan. Testing ow graph reducibility. Journal of Computer and System Sciences, 9:355{365, 1974. [26] C.-W. Tseng. An Optimizing Fortran D Compiler for MIMD Distributed-Memory Machines. PhD thesis, Dept. of Computer Science, Rice University, January 1993. [27] Peng Tu and David Padua. Gated SSA-based demand-driven symbolic analysis for parallelizing compilers. In Proceedings of the 1995 ACM International Conference on Supercomputing, Barcelona, Spain, July 1995.

21

parameter (N=64) real c(N), f(N,N,N) CHPF$ processors p(4) CHPF$ distribute f(*,*,block) onto p CHPF$ distribute c(block) onto p do j=1,N do k=N-1,1,-1 C SEND f(1:N,j,k+1) C RECV f(1:N,j,k+1) do i=1,N f(i,j,k) = f(i,j,k) - c(k) * f(i,j,k+1)

! ON_HOME f(1:N,j,k+1) ! ON_HOME f(1:N,j,k) ! ON_HOME f(i,j,k)

Figure 1: HPF source fragment abstracted from the Erlebacher benchmark. 22

1 2 3 4

5 6 7

8 9

10

11

do j = 1, 64 if (pmyid1 -if (16 * pmyid1 >= k - 15) then !UNSATISFIABLE COMPUTE f(1:64, j, k) if (16 * pmyid1 == k - 16 && pmyid1 -if (16 * pmyid1 = 1) then !UNSATISFIABLE SEND f(1:64, j, 16 * p_myid1 + 1) !-->-if (16 * pmyid1 >= k - 15) then !TAUTOLOGY COMPUTE f(1:64, j, k) if (16 * pmyid1 == k - 16 && pmyid1 -if (16 * pmyid1 = 1) then k = 16 * pmyid1 if (16 * pmyid1 == k && pmyid1 >= 1) then !TAUTOLOGY SEND f(1:64, j, k + 1)

Figure 2: Skeletal SPMD code for Fig. 1 with partitioned computation. 23

do j = 1, 64 if (pmyid1 -COMPUTE f(1:64, j, k) do k = 16 * pmyid1 + 15, 16 * pmyid1 + 1, - 1 !-->-COMPUTE f(1:64, j, k) if (pmyid1 >= 1) then k = 16 * pmyid1 SEND f(1:64, j, k + 1)

Figure 3: Skeletal SPMD code for Fig. 2 after simpli cation. 24

Inputs:

for each branch b, constraints[b] := the set of constraints on value numbers that are enforced at conditional branch b. there are two cases for b: LOGICAL IF: constraints[b] represents constraints tested by the conditional DO LOOP: constraints[b] represents loop bounds and stride constraints for each ASSERTION statement s assert[s] := the constraints tested by the assertion

Outputs:

for each branch b, incoming[b] := constraints that hold when b is reached outgoing[b][t] := for each CD edge type, t, leaving this node, the constraints that hold when leaving b via a CD edge with type t assertions[b][t] := for each CD edge type, t, leaving this node, the constraints asserted to hold when leaving b via a CD edge with type t, t 2 f IF TRUE, IF FALSE, DO ENTER, otherg

Algorithm: // initialize results foreach conditional branch node b 2 cdg incoming[b] := outgoing[b][*] := false assertions[b][*] := true

// note assertion constraints

foreach statement s 2 cfg if s is an ASSERTION foreach edge e 2 cdg, where e.sink = s

assertions[e.source][e.type] := assertions[e.source][e.type] ^ assert[s]

// propagate guard constraints and assertions incoming[cdg.root] := true foreach conditional branch node b 2 cdg (in reverse postorder) foreach edge e 2 cdg, where e.sink = b incoming[b] := incoming[b] _ (outgoing[e.source][e.type] ^ assertions[e.source][e.type]) outgoing[b][*] := incoming[b] // default: all outgoing constraints are the same as those incoming switch (b.type) // re ne outgoing constraints of IF and DO statements

case LOGICAL IF

outgoing[b][IF TRUE] := outgoing[b][IF TRUE] ^ constraints[b] outgoing[b][IF FALSE] := outgoing[b][IF FALSE] ^ : constraints[b] break

case DO LOOP

outgoing[b][DO ENTER] := outgoing[b][DO ENTER] ^ constraints[b] break

Figure 4: Algorithm for propagating symbolic constraints on value numbers. 25

inputs:

for each branch b, incoming[b] := constraints that hold when b is reached outgoing[b][t] := for each meaningful CD edge type t for this node, the constraints that hold when leaving b via a CD edge with type t assertions[b][t] := for each meaningful CD edge type t for this node, the constraints asserted to hold when leaving b via a CD edge with type t

foreach conditional branch node b cdg (in reverse preorder) if b is a DO LOOP or LOGICAL IF etype := if b is a LOGICAL IF then IF TRUE else DO ENTER if outgoing[b][etype] is FALSE SimplifyInfeasibleCode(b) else if b is a LOGICAL IF // nothing more to do for a DO LOOP if incoming[b] implies outgoing[b][etype] SimplifyTautologicalGuard(b) else // i.e., (newConstraints ^ incoming[b]) implies outgoing[b][etype] in

newConstraints := Gist(outgoing[b][etype],incoming[b]) SimplifyGuard(b, newConstraints)

Figure 5: Control ow simpli cation using constraints on variable values 26

words if-statements execution time (seconds)

Without overlaps before after reduction 9602 6465 33% 228 78 66.6% 16.31 13.79 15.4%

With overlaps before after reduction 8627 6648 23% 204 88 56.8% 15.96 14.20 11.0%

Table 1: Statistics for 4-processor SPMD code for Tomcatv before and after control- ow simpli cation 27


Without overlaps before after reduction 4406 3292 25% 80 38 52.5% 2.3 2.22 3.5%

With overlaps before after reduction 4376 3329 24% 76 37 51% 2.29 2.24 2.2%

Table 2: Statistics for 16-processor SPMD code for Erlebacher before and after control- ow simpli cation 28


Without overlaps before after reduction 9515 6829 20% 141 79 44% 2.25 2.23 1%

With overlaps before after reduction 5803 5672 2% 97 63 35% 2.20 2.16 2%

Table 3: Statistics for 16-processor SPMD code for Jacobi before and after control- ow simpli cation 29

Simplifying Control Flow in Compiler-Generated Parallel ... - CiteSeerX

Simplifying Control Flow in Compiler-Generated Parallel ... - CiteSeerX

Suggest Documents

Control Abstraction in Parallel Programming Languages - CiteSeerX

detection of control flow errors in parallel programs

Simplifying Reductions - CiteSeerX

Simplifying Inquiry Instruction - CiteSeerX

Enhanced Control Flow Graphs in Montages - CiteSeerX

Simplifying Inquiry Instruction - CiteSeerX

Hydrodynamic flow control in marine mammals - CiteSeerX

Multiobjective flow control in telecommunication networks - CiteSeerX

Hayward Flow Control Flow Control

Simplifying solar harvesting model-development in ... - CiteSeerX

Rivulet Flow In Vertical Parallel-Wall Channels - CiteSeerX

massivlely parallel computation of the flow in hydro turbines - CiteSeerX

Distributed Hierarchical Control for Parallel Processing - CiteSeerX

Parallel-channel flow instabilities and active control ... - RPI ECSE

Pressure drop and flow distribution in parallel

Control Heuristics for Scheduling in a Parallel Blackboard ... - CiteSeerX

Pressure-driven flow in parallel-plate nanochannels

including contrast gain control in a parallel simulation of ... - CiteSeerX

Flow visualization in parallel-plate ducts with

Control of parallel connected inverters in standalone AC ... - CiteSeerX

Simplifying Complexity in Metabolomics

Flow Control

Flow Control

Flow Control