On Privatization of Variables for Data-Parallel Execution
Manish Gupta IBM T. J. Watson Research Center P. O. Box 218 Yorktown Heights, NY 10598
[email protected]
Abstract
rays as well, in order to expose parallelism in loops that otherwise appear to be sequential [6]. A number of research compilers now aggressively perform array privatization [18, 10, 8], which has contributed signi cantly to the capability of these compilers to recognize coarse-grain parallelism in programs. Privatization assumes an even greater importance for parallel execution of programs on machines with high interprocessor communication costs, which includes most distributed memory machines. The penalty for not privatizing a variable amounts to much more than merely the loss of parallelism, it can lead to very high communication overheads. Under control-based parallelization, in which the entire loop body corresponding to an iteration (or chunk of iterations) of a parallel loop is assigned as a unit to a processor, the actual steps of privatizing a variable, once it has been recognized as privatizable, are relatively straightforward. However, further analysis is needed under data-driven parallelization, in which the assignment of computation to processors is based on ownership of data, due to which all statements corresponding to a loop iteration may not be executed by the same processor. An example of such a method of parallelization is the owner-computes rule [11], which assigns computation to the processor which owns the data being modi ed in that computation. The ownercomputes rule and its generalized variants (where the computation may be assigned to processors that own some other data not being modi ed) are followed by most compilers for languages like High Performance Fortran (HPF) [13, 11, 19, 14, 2, 9, 1]. Many of these compilers have not paid adequate attention to the problem of mapping privatizable scalar and array variables. This paper presents a framework for privatizing scalar and array variables in the context of a datadriven approach to parallelization. We describe dierent alternatives available for mapping privatized variables, and show how that choice in uences paralleliza-
Privatization of data is an important technique that has been used by compilers to parallelize loops by eliminating storage-related dependences. When a compiler partitions computations based on the ownership of data, selecting a proper mapping of privatizable data is crucial to obtaining the bene ts of privatization. This paper presents a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We show that there are numerous alternatives available for mapping privatized variables and the choice of mapping can signi cantly aect the performance of the program. We present an algorithm that attempts to preserve parallelism and minimize communication overheads. We also introduce the concept of partial privatization of arrays that combines data partitioning and privatization, and enables ecient handling of a class of codes with multidimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to the execution of control ow statements as well. An implementation of these ideas in the pHPF prototype compiler for High Performance Fortran on the IBM SP2 machine has shown impressive results.
1 Introduction Parallelizing compilers have traditionally used
scalar privatization [3] to eliminate the storage-related anti and output dependences associated with scalar
variables inside loops. Under this technique, private copies of the variable are created for each processor participating in the parallel execution of a loop, so that writes followed by read acceses of these variables in iterations executed by dierent processors do not interfere with each other. More recent studies have pointed out the importance of applying this technique to ar1
!HPF$ Align (i) with A(i) :: B,C,D !HPF$ Align (i) with A(*) :: E,F !HPF$ Distribute (block) :: A
S1 : S2 : S3 : S4 : S5 : S6 :
m=2 do i = 2; n ? 1 m = m+1 x = B(i) + C(i) y = A(i) + B(i) z = E(i) + F(i) A(i + 1) = y=z D(m) = x=z
! ! ! !
induction variable align with consumer align with producer no alignment
enddo
Figure 1: Dierent alignments of privatized scalars tion and communication overheads. We present an algorithm to select the alignment of privatizable scalar and array variables that attempts to preserve parallelism and minimize communication overheads. Our algorithm is guided by a realistic communication cost model which takes into account the placement of communication, and hence, optimizations like message vectorization. We introduce a novel concept of partial privatization of arrays that combines data partitioning and privatization, and enables ecient handling of a class of codes with multi-dimensional data distribution that was not previously possible. Finally, we show how the ideas of privatization apply to execution of control ow statements as well. The ideas presented in this work have been implemented in the pHPF prototype compiler for HPF [9]. Our preliminary results are very encouraging.
2 Mapping of Privatized Scalars 2.1
Alignment Choices
We shall rst illustrate the need for scalar privatization and for dierent kinds of alignment using an example shown in Figure 1. It is necessary to privatize each of the variables m, x, y, and z to achieve partitioned execution of the loop, as replication of any variable would force all processors to execute the assignment to that variable under the owner-computes rule. We refer to a statement in which a variable is de ned as the producer statement for that variable and a statement which uses the value of that variable as a consumer statement. Alignment with Consumer Consider the variable x, replicating it would lead to each processor executing the rst statement in the loop in every iteration. Furthermore, that will require values of B(1:n) and C(1:n) to be unnecessarily broadcast to all the processors. As
G(i; j) A(i) H
H(i; j) H(i; )
!HPF$ Align with !HPF$ Align with !HPF$ Distribute (block, ) do ! not needed on all processors ! needed on all processors
i = 1; n p = B(i) q = C(i) A(i) = H(i; p) + G(q; i)
enddo
Figure 2: Availability requirements for subscripts part of privatization, if x is aligned with the producer reference B(i) or C(i) (i.e., owned by the same processor as the owner of B(i) and C(i)), there is no communication needed to compute x. However, the value of x has to be communicated to the owner of D(m) (the value of m is known to be i + 1 via induction variable analysis). This communication takes place inside the i-loop, because of a dependence from the de nition of x to the use of x inside the loop. If x is aligned instead with the consumer reference D(m), communication is now needed for two references B(i) and C(i) to the owners to D(m) in the computation of x. However, both of these communications can be moved outside the i-loop and carried out with a collective shift communication. The consumer reference for a read reference u is a reference r whose owner needs the value of u during execution of that statement. Thus, in most cases, under the owner-computes rule, the consumer reference is the lhs (left hand side) of the assignment statement. For special cases where a read reference, such as a subscript, is needed by all processors, the consumer reference is set to be a dummy replicated reference. As an optimization, for a reference which appears as a subscript of an rhs reference which does not need communication, pHPF sets the consumer reference to be the lhs reference, because only the processor executing that statement needs to know the value of the subscript. Thus, for the example shown in Figure 2, the consumer reference for p is A(i), and for q it is the dummy replicated reference.
Alignment with Producer The preferable align-
ment for the variable y is with the producer reference A(i) (or equivalently, B(i)), which helps avoid the need for any communication in the computation of y, though communication is needed for statement S5 . The alignment of y with the consumer reference A(i + 1) would lead to an extra communication, i.e., communication of both rhs references on S2 to the owner of A(i + 1), of which communication for A(i) would take place in the inner loop.
Privatization without Alignment The variable z
uses the value of replicated array elements E(i) and F(i) in its computation. The alignment of z with these producer references would amount to replication, and hence is not desirable. Its alignment with one of the consumer references on statement S5 or S6 would lead to communication being required for the other statement. Since data needed for the computation of z is available on all processors, it can be privatized without explicit alignment with any other reference. Each processor that executes an iteration of the i-loop under the computation partitioning, as determined by the partitioning of other statements in that loop, \owns" and computes a temporary value of z in that loop iteration. Any scalar variable recognized as an induction variable, such as m in Figure 1, should be privatized without alignment. The pHPF compiler replaces the rhs of that assignment statement by the closed-form expression for the value of that induction variable as a function of surrounding loop indices. Thus, the expression \m + 1" on S1 is replaced by the expression \i + 1", which represents the closed-form value of the variable. Privatization of a scalar without alignment impacts computation partitioning by ensuring that the statement assigning a value to that scalar is not forced to execute on all processors, which would happen if the scalar were replicated. There is no computation partitioning guard associated with the statement [9]. Hence, if that statement appears inside a loop, it is executed by the union of all processors executing any other statement inside that loop for a given iteration. If the statement appears outside any loop, it is executed by all processors. For the purpose of communication analysis, the scalar is viewed as if it has been replicated, and therefore, no use of that scalar requires any communication. In fact, pHPF selects this mapping only when the computation of the scalar value requires no communication as well.
2.2
Algorithm to Determine Mapping
The mapping of scalar variables and privatizable array variables is determined in a separate, rst pass during the communication analysis phase of the pHPF compiler. It follows an earlier program analysis phase which constructs the static single assignment (SSA) [5] representation of the program and performs constant propagation and induction variable recognition. pHPF uses the SSA representation to associate a separate mapping decision with each assignment to a scalar. This permits exibility in choosing an appropriate mapping when logically dierent variables being used in dierent segments of a procedure happen to
DetermineMapping(def,
f
stmt)
/* Default mapping of def is replication */ (IsPrivatizable(def) == TRUE) RhsReplicated = IsRhsReplicated(stmt) (RhsReplicated == TRUE IsUniqueDef(def) == TRUE) Add def to NoAlignExam list
if
then
if
and then
end if
Traverse reached uses of def and select a consumer ref as AlignRef (RhsReplicated == FALSE (no consumer ref selected as AlignRef alignment of def with AlignRef leads to inner loop commn. for some RHS ref on stmt)) Select a partitioned RHS ref as AlignRef
if
and or
then
g
end if if (AlignRef has been selected and alignment valid inside current loop) then for each reached use of def do for each reaching def of use do Align def with AlignRef end do end do end if end if
Figure 3: Pseudocode for determining mapping of scalar reference reuse a name. However, for simplicity of communication analysis and code generation, the compiler imposes a restriction that given a use (read reference) of a scalar variable, all reaching de nitions are given an identical mapping. Thus, during the later phases of compilation, there is no scope for ambiguity regarding the mapping of a scalar value { it can be determined by obtaining the mapping information recorded with the rst reaching de nition of that reference. Any other reaching de nition of that use is guaranteed by our algorithm to have the same mapping associated with it. The scalar variables involved in reduction operations are treated in a special manner, which is described in the next subsection. For other scalar variables, Figure 3 gives an overview of the algorithm to determine the mapping for a given scalar de nition. The default mapping of each scalar de nition is set to replication. We now explain the dierent steps of our algorithm. Privatization without alignment If the data ow analysis shows that a given de nition is privatizable and not live outside the current loop (the compiler also takes advantage of the NEW clause in the INDEPEN-
DENT directive of HPF to infer this), we rst check if all rhs references on that statement are to replicated data. If so, the compiler considers privatizing the scalar without alignment with any reference, if the given definition is the only reaching de nition of all the reached uses of that de nition. This would ensure that in spite of the privatized execution of statement S computing the scalar value, each reached use of the scalar variable would see the correct value. We note that at this stage, an eligible scalar de nition is only added to the list of de nitions being considered for privatization without alignment. The reason for this deferral is that there may be rhs references to privatizable scalar or array variables in statement S for which mapping decisions have not yet been made, so those variables appear to be replicated at this stage. At the end of the compiler pass making mapping decisions, the list is examined again and if all rhs data on the corresponding statement continue to be replicated, the scalar de nition is privatized without alignment. Identi cation of Alignment Target The next step in the algorithm is to examine each reached use of def and identify a consumer reference with which def could be aligned. The selection of a single alignment target is done using a heuristic algorithm. If any reached use appears inside a loop bound expression or a subscript that has to be broadcast to all processors (note that subscript values of rhs references not involved in communication need not be made available on all processors), the dummy replicated reference is returned as the selected consumer reference and the traversal through reached uses is terminated. Otherwise, pHPF ignores any consumer reference that refers to replicated data, and selects a consumer reference (if any) to partitioned data. This selection process favors a reference in which a distributed array dimension is traversed in the innermost common loop enclosing the scalar de nition and the reached use, since alignment with such a reference will ensure that the scalar is mapped to dierent processors during dierent loop iterations. For example, inside an i-loop, alignment with a reference A(i) would be preferred over alignment with a reference A(1), where A is a partitioned array. As described earlier, a consumer reference corresponding to a use of the scalar variable is usually the lhs reference of the assignment statement in which the use occurs. If this reference is to a privatizable variable, the compiler invokes the mapping algorithm recursively on that de nition to determine the mapping of variable before determining whether this consumer reference may serve as a suitable alignment target. If there are rhs references to partitioned data on
!HPF$ Distribute (block,block, ) :: do do
i = 1; n j = 1; n ::: s = ::: do k = 1; n A(i; j; k) = : : : B(s; j; k) = : : :
enddo enddo enddo
! !
A; B
AlignLevel = 2 AlignLevel = 3
Figure 4: AlignLevel for array references the statement computing the scalar value, the compiler selects one of those producer references (similar to the selection of consumer reference) as another potential alignment target. Given a choice between producer and consumer references as the alignment target, our algorithm selects the consumer reference unless that selection leads to inner-loop communication for some rhs reference on the given assignment statement. Scope of Validity of Alignment We now describe how the compiler determines the program region in which the alignment information about a scalar variable is well-de ned. Given a subscript s in an array reference r, let VarLevel (s) denote the innermost loop nesting level in which subscript s varies in value. We de ne SubscriptAlignLevel (s) as: VarLevel (s) if s is ane function of loop indices VarLevel (s) + 1 otherwise Thus, SubscriptAlignLevel (s) gives the nesting level of the outermost loop throughout which the value of the subscript s is well-de ned. For example, in Figure 4, the k-loop is the outermost loop in which both subscripts s and k are well-de ned. We now de ne AlignLevel (r) as the the maximum of SubscriptAlignLevel values for each subscript in a partitioned dimension of r. Therefore, in Figure 4, the AlignLevel of A(i; j; k) is 2, which corresponds to the j-loop, and the AlignLevel of B(s; j; k) is 3, corresponding to the kloop, which is the outermost loop in which subscript s is invariant. Given a reference r which is selected as an alignment target of a scalar de nition, AlignLevel(r) indicates the outermost loop throughout which the alignment information is valid. Therefore, the scalar definition which is privatizable at nesting level l can be aligned unambiguously with the selected reference r if AlignLevel (r) l. Marking alignment information Finally, once a valid alignment target for the scalar de nition def has
B(i)
A(i; )
!HPF$ Align with !HPF$ Distribute (block,block) :: do
i = 1; n s=0 do j = 1; n s = s + A(i; j)
A
enddo
B(i) = s
enddo
Figure 5: Scalar variable involved in reduction been identi ed, the compiler records that alignment information for each reaching de nition of every reached use of def. The mapping information at a use during communication analysis is obtained initially by accessing the information recorded with its rst reaching de nition, and is cached in a data structure associated with the use for subsequent inquiries. Our procedure ensures that a consistent mapping information is seen by each reached use of def. 2.3
Mapping of Scalars Involved in Reductions
Any scalar computed in a reduction operation, such as sum, carried out across a processor grid dimension is handled in a special manner. An additional privatized temporary copy of the scalar is created during code generation to hold the results of the local reduction computation initially performed by each processor. A global reduction operation combines the values of the local operations and stores the result into the variable which retains the original name of the scalar. This scalar variable is replicated in the dimensions in which reduction takes place. However, it may be privatized with respect to other processor grid dimensions. Given a statement assigning value to a scalar variable which is recognized as a reduction, the compiler checks if the scalar de nition is privatizable without copy-out with respect to the loop immediately surrounding the reduction loop. If so, the special array reference whose ownership governs the partitioning of the partial reduction operation [9] serves as the alignment target. However, in this case, the compiler constructs a new alignment mapping in which the scalar variable is replicated in each dimension over which reduction takes place, and is aligned with the target array reference in only the remaining grid dimensions. This alignment information is propagated for each reaching de nition of every reached use of the original de nition. Finally, at code generation time, another privatized copy of the scalar variable is created, which diers from the original variable only in that it is private on each processor
involved in reduction rather than being replicated. For example, in Figure 5, a sum reduction takes place in the j-loop. The de nition of variable s is veri ed as being privatizable with respect to the i-loop, and A(i; j) serves as the alignment target. Hence, s is replicated in the second grid dimension and is aligned with the ith row of A in the rst dimension. As a result of this alignment, the reduction computation can proceed without the need to broadcast the ith row of A to other processors along the rst grid dimension.
3 Mapping of Privatized Arrays We now describe the procedure for mapping privatizable arrays. The pHPF compiler currently relies on directives from the programmer to infer that arrays are privatizable. While the basic alternatives explored for alignment of privatized arrays are the same as those for scalars, there are additional options available when dealing with arrays. 3.1
Basic Procedure
The INDEPENDENT directive for loops in HPF asserts that the iterations of the loop may be executed in any order without changing the program semantics. The NEW clause attached to such a directive supplies a list of variables (see example in Figure 6) and modi es the INDEPENDENT directive to assert that the independence of dierent loop iterations holds if new objects are created for the named variables for each loop iteration. Thus, those variables can be regarded as privatizable with respect to the INDEPENDENT loop. pHPF is also able to infer the privatizablity of an array from a weaker form of a parallel loop directive which indicates that a loop has no true loop-carried value-based dependences. Any lhs array reference in which each subscript is either invariant with respect to the parallel loop or is an ane function of inner loop indices contributes to memory-based loop-carried dependences, which can be eliminated only by privatizing that array. The algorithm to determine the target alignment reference is identical to that used for scalar variables. Similarly, once an alignment target has been selected and the AlignLevel for that reference determined, the compiler examines each reached use to ensure that it does not appear outside the loop corresponding to AlignLevel. Any seemingly reached uses outside the loop associated with the NEW directive are assumed to be spurious and hence are ignored. The alignment information is kept in a data structure associated with the loop with respect to which the array has been privatized, and applies to all references to that array variable
;
!HPF$ Distribute ( ,block,block) :: !HPF$ INDEPENDENT, NEW(c) do do do
rsd
k = 2; nz ? 1 j = 2; ny ? 1 i = 2; nx ? 1 c(i; j; 1) = : : :
!HPF$ Align (:) with A(:) :: !HPF$ Distribute (block) :: A do if then
i = 1; n (B(i) 6= 0:0) A(i) = A(i)=B(i) if (B(i) < 0:0) go
do
j = 3; ny ? 1 do i = 2; nx ? 1 rsd (1; i; j; k) = : : :c(i; j ? 1; 1) : : :
enddo enddo enddo
Figure 6: Need for partial privatization within that loop. The privatizable arrays used to hold results of a reduction operation are also handled in a similar manner as scalar variables in reduction computations. 3.2
Partial Privatization
The concept of privatization of variables has traditionally been associated with a single parallel loop at a time. With nested loop parallelism, which is enabled by the use of a multi-dimensional processor grid in HPF, the idea of privatization can be trivially extended to apply to each grid dimension and to each loop. However, yet another alternative which can be considered in this scenario is one combining partitioning and privatization { the array may be partitioned in some grid dimensions and privatized with respect to the other dimensions. We refer to this as partial privatization. Figure 6 shows a program segment adapted from the Appsp program of NAS Benchmarks, which illustrates the bene ts of partial privatization. The array c is privatizable with respect to the k-loop, but not with respect to the j-loop. Correspondingly, the compiler will fail in its attempt to privatize the array in both grid dimensions. Clearly, replication in either dimension would lead to loss of parallelism and a great deal of extra communication, due to the owner-computes rule. In accordance with the HPF data distribution directives, the only way to exploit parallelism in both the k and the j-loops is to partition the second dimension of c across the rst grid dimension, and to privatize it along the second grid dimension. Under partial privatization, the procedure to determine AlignLevel of a target reference is modi ed to con-
to 100
else
A(i) = C(i)
enddo enddo
:::
B,C
end if 100
C(i) = C(i) + 1
continue enddo
Figure 7: Privatized execution of control ow statements sider subscripts only in those distributed dimensions in which the array is to be privatized. Thus, in the example shown in Figure 6, since c is to be privatized only in the second grid dimension, the AlignLevel for rsd (1; i; j; k) is obtained as 1 (corresponding to the kloop) rather than 2 (corresponding to the j-loop). The information from the NEW clause indicates that the value of c can be discarded after the k-loop. Hence, the compiler is able to proceed with partial privatization of c, whereas complete privatization was not possible. It is well-known that distribution of arrays on multidimensional rather than a single-dimensional processor grid can lead to more scalable solutions for many stencil codes, due to lower volume of interprocessor communication resulting from such a distribution. For the Appsp program speci cally, a 3-D distribution of arrays has been known to outperform the 2-D distribution, which in turn has outperformed a 1-D distribution in a hand-tuned message-passing implementation [15].
4 Control Flow Statements The handling of control ow statements is often left vague under the data-driven compilation model. A default strategy of executing these statements on all processors would lead to problems very similar to those encountered with indiscriminate replication of variables. For example, if all processors were forced to execute any control ow statement in Figure 7, the loop would not be parallelized eectively, and expensive communication would be required for the array B. On the other hand, \privatized" execution of these statements by the owner of A(i) (which also owns B(i) and C(i)) eliminates the need for any communication and allows the loop to be parallelized eectively, as the loop bounds can be shrunk [9] in the nal SPMD code. The pHPF compiler applies the following rules to privatize the execution of a control ow statement S
(otherwise executed by all processors, by default), and if necessary, identify a reference which will serve as the alignment target for any data needed to execute that statement.
If the statement S cannot transfer control to a
target statement outside the body of loop L, then S does not contribute to a computation partitioning guard [9] for the loop L. Essentially, S will be executed by the union of all processors executing any other statement inside loop L for a given iteration. Conceptually, this corresponds to the notion of privatization without alignment.
Any data referenced in the control predicate of
S has to be communicated to the union of all processors that participate in the execution of any statement that is control-dependent on S.
In the example shown in Figure 7, both of the if statements transfer control only to a statement inside the i-loop. Hence the execution of those statements can be privatized. Furthermore, following the ownercomputes rule, only the owner of A(i) (or equivalently, C(i)) needs to participate in the execution of any statement that is control dependent on either of those control ow statements. Therefore, no communication is needed for the predicate of those if statements, as B(i) is owned by the same processor as A(i).
5 Experimental Results The ideas presented in this paper have been implemented in the pHPF prototype compiler [9]. In this section, we describe some preliminary experimental results to show the impact of this analysis. We present performance results on three benchmark programs, each of which illustrates dierent aspects of our procedure for mapping privatized variables. The rst program is Tomcatv, a mesh-generator with Thomson's solver. The program, originally from Spec92fp benchmark, has been been augmented with HPF directives. The second program, Dgefa, performs gaussian elimination on a matrix with partial pivoting. It is the HPF version of the original routine from Linpack, in which we have applied procedureinlining by hand. The third program, Appsp from Nas benchmarks, is a pseudo-application for performance evaluation of a solver for ve coupled, nonlinear partial dierential equations. Each of these programs was compiled with the -O3 option for optimizations. All measurements were done using 16 thin nodes of an IBM SP2.
Program
#P
Tomcatv
1 2 4 8 16
(, block) n = 514
Execution Time (sec) Replication Producer Selected Alignment Alignment
777.13 1439.58 1493.67 1759.91 1964.37
339.06 339.17 366.17 428.97 553.52
46.04 22.74 12.04 6.42 3.73
Table 1: Performance of Tomcatv on IBM SP2 5.1
Tomcatv
Table 1 shows the performance of Tomcatv obtained with three dierent levels of optimization. The rst version, which is the most naive version of the compiler, does not perform privatization and replicates all scalar variables. The second version performs privatization, but always aligns each scalar de nition with a producer reference, i.e., with a partitioned array or scalar reference on that statement. The third version applies the algorithm described in Section 2.2 to determine the alignment of privatized scalar variables. As expected, replication of all scalar variables leads to extremely poor performance. This can be attributed to the loss of parallelism and excessive communication in the main computational loop nest of the program. We nd the performance gures in the second column to be even more interesting. They show that in spite of privatization, there can be a substantial loss of performance if the scalar variables are not mapped carefully. The alignment of a privatizable scalar variable with a partitioned producer reference on the same statement is quite simple to support. In contrast, alignment with a consumer reference requires a more complex procedure that may be recursively invoked to deal with a privatizable consumer reference which in turn needs to be aligned with a target reference. However, alignment of scalar variables with producer references leads to a considerable amount of inner-loop communication in Tomcatv. Our algorithm is able to avoid that by selecting alignment with consumer references in the main computational loop of the program. With proper alignment, we obtain performance improvements of more than two orders of magnitude on 16 processors. In fact, it is only with the appropriate alignment of scalar variables that the program exhibits speedups. 5.2
Dgefa
The array on which gaussian elimination is performed is partitioned column-wise in a cyclic manner. In each step of the elimination, partial pivoting involves a maxloc operation along a single array column which
Program # Procs Execution Time (sec) Default Alignment
Dgefa
(,cyclic) n = 2048
1 2 4 8 16
217.88 119.81 72.33 52.51 54.52
203.40 105.69 59.12 41.00 41.97
Table 2: Performance of Dgefa on IBM SP2 is mapped to a single processor. Our optimization to align privatizable variables holding the results of a reduction operation in the dimensions not involved in reduction leads to the computation for partial pivoting being con ned to just the relevant processor in each step, and also helps avoid unnecessary communication. Table 2 shows the performance results of Dgefa without and with this optimization. The communication overhead incurred when the reduction variable is replicated across the columns remains roughly constant, but it accounts for an increasing percentage of the execution time as the number of processors is increased. 5.3
Appsp
We present performance results for two HPF versions of the program, one with a 1-D distribution of arrays and redistribution (transpose) of data in the sweepz subroutine, and the other employing a xed 2-D distribution throughout the program. The rst two columns of results in Table 3 show that the execution time of the program becomes prohibitively large if array privatization is disabled. In fact, in that case, we had to abort the parallel program after more than a day of execution. The remaining columns show that with a 2-D distribution of arrays, even regular array privatization does not help and the program performs extremely poorly. However, with partial privatization employed by the compiler, we obtain signi cantly better performance. The program version using 2-D distribution starts out at fewer processors with better performance, mainly due to the absence of global transpose operations in the sweepz subroutine, but does not scale as well as the version using 1-D distribution, unlike handtuned message-passing versions of Appsp [15]. An examination of the message-passing code produced by the HPF compiler showed that there is considerable scope for improving the performance of that version by global message combining across loop nests. The pHPF compiler does not currently perform that optimization.
6 Related Work
There has been a great deal of work done on techniques related to privatization for exposing more par-
allelism, such as scalar expansion [16], scalar privatization [3], array expansion [7], and array privatization [18, 10, 8]. Our work focuses on the additional analysis needed to apply privatization eectively to datadriven execution, and hence is complementary to previous work. Knobe and Dally present a subspace model and describe an algorithm meant to be applied before data partitioning and scheduling, which attempts to resolve mismatches in the shape of various operands [12]. Their method achieves privatization by adding an expansion dimension that is indexed by a loop induction variable. They also apply the subspace model to optimize the execution of control ow statements. They do not discuss alternatives regarding the alignment of privatized data with other partitioned data or the impact of such mapping on the loop-level placement of communication involving privatized data. Chatterjee et al. present the concept of mobile alignment of arrays with respect to loops [4], which is similar to the idea of array privatization. Their work focuses on choosing between replication and mobile alignment of data. Their algorithm does not take into account information about the privatizability of arrays, which can make code generation dicult or expensive for arrays with mobile alignment that are not privatizable. The work done by Palermo et al. [17] is the most closely related to our work. They use a simpler algorithm in which an assignment to a privatized scalar variable is executed by each processor that participates in the execution of any statement in the given loop iteration, which is similar to our notion of privatization without alignment with a speci c reference. This could potentially lead to more communication if there are fewer processors using a scalar value than those made to execute the assignment statement. An earlier implementation of the pHPF compiler [9] employed a simpler and more limited version of our analysis for handling privatizable scalar variables. It did not privatize a scalar de nition that was not the only reaching de nition of the reached uses, and did not deal with privatizable arrays. Privatization of variables is performed by many other HPF compilers as well. However, the method of determining the ownership of those variables has usually not been discussed.
7 Conclusions In this paper, we have presented a novel framework for privatizing scalar and array variables in the context of a data-driven approach to parallelization. We show that there are numerous alternatives available for
Program
# Processors
Appsp n = 64 niter = 400
1 2 4 8 16
Execution Time (sec) 1-D, No Array Priv. 1-D, Priv. 2-D, No Partial Priv. 2-D, Partial Priv. 21032.34 6787.56 13054.20 3188.54 > 86400.00 (1 day) 4621.79 > 86400.00 (1 day) 3020.53 2631.10 2059.54 1585.35 1959.93 900.89 1470.73
Table 3: Performance of Appsp on IBM SP2 [6] R. Eigenmann, J. Hoe inger, Z. Li, and D. Padua. mapping privatized variables and the choice of mapping Experience in the automatic parallelization of four can signi cantly aect the performance of the program, Perfect-Benchmark programs. In Proc. 4th Workshop by as much as two orders of magnitude in some cases on Languages and Compilers for Parallel Computing. on the IBM SP2. Our algorithm to select this mapPitman/MIT Press, August 1991. ping is guided by a realistic communication cost model [7] P. Feautrier. Array expansion. In Proc. 1988 ACM Inwhich takes into account optimizations like message ternational Conference on Supercomputing, July 1988. [8] J. Gu, Z. Li, and G. Lee. Symbolic array data ow vectorization. We also introduce the notion of partial analysis for array privatization and program paralprivatization of arrays, which enables a compiler to exlelization. In Proc. Supercomputing '95, San Diego, ploit nested parallelism even when that nested form is CA, December 1995. incompatible with the conventional de nition of array [9] M. Gupta, S. Midki, E. Schonberg, V. Seshadri, privatization. Our preliminary results, based on an imK. Wang, D. Shields, W.-M. Ching, and T. Ngo. An plementation of these ideas in the pHPF compiler have HPF compiler for the IBM SP2. In Proc. Supercombeen very encouraging. In the future, we plan to inputing '95, San Diego, CA, December 1995. [10] M. Hall, S. Amarasinghe, B. Murphy, S.-W. Liao, and tegrate our mapping techniques with automatic array M. Lam. Detecting coarse-grain parallelism using an privatization.
Acknowledgements
[11]
The author wishes to thank Sam Midki for his help in implementing the technique of partial privatization.
[12]
References [1] P. Banerjee, J. Chandy, M. Gupta, E. Hodges, J. Holm, A. Lain, D. Palermo, S. Ramaswamy, and E. Su. An overview of the PARADIGM compiler for distributedmemory multicomputers. IEEE Computer, October 1995. [2] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. A compilation approach for Fortran 90D/HPF compilers on distributed memory MIMD computers. In Proc. Sixth Annual Workshop on Languages and Compilers for Parallel Computing, Portland, Oregon, August 1993. [3] M. Burke, R. Cytron, J. Ferrante, and W. Hsieh. Automatic generation of nested, fork-join parallelism. Journal of Supercomputing, pages 71{88, 1989. [4] S. Chatterjee, J. R. Gilbert, and R. Schreiber. Mobile and replicated alignment of arrays in data-parallel programs. In Proc. Supercomputing '94, Washington D.C., November 1994. [5] R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. Eciently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):451{490, October 1991.
[13] [14] [15] [16] [17]
[18] [19]
interprocedural parallelizing compiler. In Proc. Supercomputing '95, San Diego, CA, December 1995. S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66{80, August 1992. K. Knobe and W. Dally. The subspace model: A theory of shapes for parallel systems. In Proc. 5th Workshop on Compilers for Parallel Computers, Malaga, Spain, June 1995. C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr., and M. E. Zosel. The High Performance FORTRAN Handbook. The MIT Press, Cambridge, MA, 1994. C. Koelbel and P. Mehrotra. Compiling global namespace parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Systems, 2(4):440{451, October 1991. V. K. Naik. Scalability issues for a class of CFD applications. In Proc. 1992 Scalable High Performance Computing Conference, pages 268{275, May 1992. D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184{1201, December 1986. D. Palermo, E. Su, E. HodgesIV, and P. Banerjee. Compiler support for privatization on distributedmemory machines. In Proc. 25th International Conference on Parallel Processing, Bloomingdale, IL, August 1996. P. Tu and D. Padua. Automatic array privatization. In Proc. 6th Workshop on Languages and Compilers for Parallel Computing, Portland, OR, August 1993. H. Zima, H. Bast, and M. Gerndt. SUPERB: A tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6:1{18, 1988.