where A is the name of an array variable, S is a vector of subscript values such ..... The availability domain of the inner loop Gen2 is the array section A(1:100,I).
A Framework for Exploiting Data Availability to Optimize Communication Manish Gupta and Edith Schonberg IBM T. J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598
Abstract. This paper presents a global analysis framework for deter-
mining the availability of data on a virtual processor grid. The data availability information obtained is useful for optimizing communication when generating SPMD programs for distributed address-space multiprocessors. We introduce a new kind of array section descriptor, called an Available Section Descriptor, which represents the mapping of an array section onto the processor grid. We present an array data- ow analysis procedure, based on interval analysis, for determining data availability at each statement. Several communication optimizations, including redundant communication elimination, are also described. An advantage of our approach is that it is independent of actual data partitioning and representation of explicit communication.
1 Introduction Distributed memory architectures are becoming increasingly popular as a viable and cost-eective method of building massively parallel computers. However, the absence of global address space, and consequently, the need for explicit message passing among processes makes such machines very dicult to program. This has motivated considerable research towards developing compilers that relieve the programmer of the burden of generating communication [15, 23, 17, 19, 18, 16, 21]. Such compilers take a sequential or a shared-memory parallel program, annotated with directives specifying data decomposition, and generate the target SPMD program with explicit message passing. Thus, the compiler performs two essential tasks: { partitioning of computation, usually based on the owner computes rule [15, 23, 19], which makes the processor that owns a data item (being assigned a value) responsible for its computation, and { generation of communication, whereby the owner of a data item (being used in a computation) sends its value to the processor performing that computation. In order to reduce the cost of communication, the compilers perform optimizations like message vectorization [15, 23], using collective communication [13, 17], and overlapping communication with computation [15]. However, the extent to
which most compilers are able to (or even attempt to) optimize communication is still limited by the following factors: { It is always the owner of a data item that supplies its value when required by another processor. It may be possible to generate more ecient communication if any processor that has a valid copy of that data is allowed to send its value. In the special case when a processor receiving a value via communication already has a valid copy of that data item available, the communication can be recognized as being redundant, and eliminated. { There is no global analysis performed of the communication requirements of array references over dierent loop nests. This precludes general optimizations, such as redundant communication elimination, or carrying out extra communication inside one loop nest if it subsumes communication required in the next loop nest. DIMENSION B(100, !HPF$ ALIGN B(I, !HPF$ ALIGN A(I, !HPF$ ALIGN D(I)
s1 : s2 : s3 : s4 : s5 : s6 : s7 :
100), A(100, 100), D(100) J) WITH VPROCS(I, J) J) WITH VPROCS(I, J) WITH VPROCS(I, 1)
DO I = 2, 100 DO J = 1, 100 B(I, J) = : : : A(J, I) ENDDO A(100, I-1) = : : : D(I) = A(1, I) ENDDO
:::
Fig. 1. Example Program As an example, consider the program fragment in Figure 1. According to the HPF alignment directives, B and A are identically aligned to an abstract template, represented by VPROCS. The variable D is aligned with the rst column of VPROCS. To enforce the owner computes rule at statement s3 , communication must be generated which creates a copy of A transpose aligned with B. Similarly, statement s6 would require a second communication to align the rst row of A with D. However, since D is aligned with the rst column of B, the second communication is subsumed by the the rst, and is therefore redundant. Our goal is to provide a framework, based on global array data- ow analysis, for performing communication optimizations. The conventional approach to data- ow analysis regards each access to an array element as an access to the entire array. Previous researchers [12, 11, 20] have applied data- ow analysis to array sections to improve its precision. The array section descriptors, such as the regular section descriptor (RSD) [7], and the data access descriptor (DAD) [5], provide a compact representation for elements comprising certain special substructures of arrays. However, in the context of communication optimizations,
these descriptors are insucient, because they do not capture information about the processors where the array elements are available. Granston and Veidenbaum [11] use data- ow analysis to detect redundant accesses to global memory in a hierarchical, shared-memory machine. However, they do not explicitly represent information about the availability of data on processors. Instead, they rely on simplistic assumptions about scheduling of parallel loops, which are often not applicable. Amarasinghe and Lam [4] use the last write tree framework to perform optimizations like eliminating redundant messages. Their framework does not handle general conditional statements, and they do not eliminate redundant communication due to dierent references in arbitrary statements (for instance, statements appearing in dierent loop nests). Von Hanxleden et al. [22] have developed a data ow framework for generating communication in the presence of indirection arrays. Their work focusses on irregular subscripts, and can be viewed as complimentary to this work, which deals with regular subscripts, and attempts to obtain more precise information. In this paper, we rst introduce a new kind of descriptor, the Available Section Descriptor (ASD), that describes the availability of array elements on processors in an abstract space. We apply ideas from the interval analysis method [2, 12] to develop a procedure for obtaining global data- ow information. This information is recorded using the ASDs. Finally, we describe some optimizations that can now be performed to reduce communication costs, and present an algorithm for one such optimization, eliminating redundant communication. An advantage of our approach is that availability analysis is performed on the original program form, before any communication is introduced by the compiler. Thus, communication optimizations based on data availability analysis need not depend on detailed knowledge of explicit communication representations. Additionally, our analysis does not rely on how data is eventually partitioned onto physical processors. Availability information can therefore be valuable input for automatic data partitioning decisions.
2 Available Section Descriptor The Available Section Descriptor is an extended version of an array section descriptor, that also records information about the availability of data items on processors. It is de ned as a pair hD; M i, where D is an array section descriptor, and M is a descriptor of the function mapping elements in D to virtual processors. We use the term mapping to convey the availability of data at processors, not simply the ownership of data by processors. Data is made available through communication. In this paper, we shall use the bounded regular section descriptor (BRSD) [14], a version of RSD to represent array sections. Bounded regular sections allow representation of subarrays that can be speci ed using the Fortran 90 triplet notation. We represent a bounded regular section as an expression A(S), where A is the name of an array variable, S is a vector of subscript values such that each of its elements is either (i) an expression of the form k + , where
k is a loop index variable and and are invariants, (ii) a triple l : u : s, where l; u; and s are invariants (the triple represents the expression discussed above expanded over a range) , or (iii) ?, indicating no knowledge of the subscript value. The processor space is regarded as an unbounded grid of virtual processors. The abstract processor space is similar to a template in High Performance Fortran (HPF) [9], which is a grid over which dierent arrays are aligned. The mapping function descriptor M is a pair hP; F i, both P and F being vectors of length equal to the dimensionality of the processor grid. The ith element of P (denoted as Pi ) indicates the dimension of the array A that is mapped to the ith grid dimension, and Fi is the mapping function for that array dimension, i.e., Fi (j) returns the position(s) along the ith grid dimension to which the jth element of the array dimension is mapped. We represent a mapping function as (c j + l : c j + u : s) (Form 1) Fi(j) = ? (Form 2) In the expression for Form 1, c; l; u and s are invariants. The parameters c; l and u may take rational values, as long as Fi(j) evaluates to a range over integers, over the data domain. The above formulation allows representation of one-toone mappings (when l = u), one-to-many mappings (when u l + s), and also constant mappings (when c = 0). The one-to-many mappings expressible with this formulation are more general than the replicated mappings for ownership that may be speci ed using HPF [9]. As with the usual triplet notation, we shall omit the stride, s, from the expression when it is equal to one. Form 2 represents the case when there is no information about the availability of data. If an array has fewer dimensions than the processor grid, clearly there is no array dimension mapped to some of the grid dimensions. For each such grid dimension m, Pm takes the value ?, which represents a \missing" array dimension. In that case, Fm is no longer a function of a subscript position. It is simply an expression of the form l : u : s or ?, and indicates the position(s) in the mth grid dimension at which the array is available. Example. Consider a 2-D virtual processor grid VPROCS, and an ASD hA(2 : 100 : 2; 1 : 100); h[2; 1]; [F1; F2]ii, where F1(j) = 1 : 100; F2(i) = (3=2) i + 0 : (3=2) i + 0. The ASD represents an array section A(2 : 100 : 2; 1 : 100), each of whose element A(2 i; j) is available at a hundred processor positions given by VPROCS(1 : 100; 3 i). This ASD is illustrated in Figure 2. Figure 2(a) shows the array A, where each horizontal stripe Ai represents A(2 i; 1 : 100). Figure 2(b) represents the mapping of the array section onto the virtual processor template VPROCS, where each subsection Ai is replicated along its corresponding column.
3 Data-Flow Analysis In this section, we present a procedure for obtaining data- ow information for a structured program. The basic item of information that we derive is the availability of data values at dierent processors, at each node in the control ow graph
A1
A2
A1
A2
A1
A2
A1
A2
A1 A2 A3
(a) Array A
(b) Virtual Processor Grid VPROCS
Fig. 2. ASD Illustration of the program. This is similar to determining available expressions in classical data- ow analysis [1], with the dierence that we also determine the processor(s) at which each expression is available. Our procedure, like the one presented in [12], is based on the interval analysis method [2], with data- ow extensions for arrays. We are assuming Fortran 77-like language constructs. However, doall loops also can be analyzed in this framework, as well as HPF forall constructs and array language after scalarization. Let INi denote the availability of data at entry to statement node ni , OUTi denote the availability of data at exit from ni , and Ki be the data that is unavailable at exit. The set Ki corresponds to killed variables. Let Killi be the set of variables de ned in the statement node ni , and Geni be the set of variables along with their mapping functions made available at ni . The sets Killi and Geni can be computed locally from each statement. All of these sets are represented as ASDs. The ASD representation of any killed data always has a mapping function > associated with it, which signi es the data being killed on all processors. The data- ow transfer functions for ni are de ned as follows: Ki = ([p Kp ) [ Killi (1) INi = \p OUTp OUTi = [INi ? Killi ] [ Geni (2) where p ranges over predecessors of ni . The set Geni is dependent on the compute-rule used by the compiler in translating the source program into SPMD form. In the next section, we de ne a
function Avail, which corresponds to the owner computes rule. For an assignment statement lhs = F(rhs1:::rhsn), Geni = [rhs Avail(lhs; rhsi ): i
In Section 3.2, we describe our global data- ow procedure, based on interval analysis, for propagating availability data.
3.1 Compute Rule Function The Avail function can easily be de ned to handle various methods of assigning computation to processors [8], as long as that decision is made statically. In this work, we de ne this function based on the owner computes rule. The owner computes rule requires each item referenced on the rhs of an assignment statement to be sent to the processor that owns the lhs. Let hDL ; M Li represent the ASD which speci es ownership for the lhs variable. Avail calculates the mapping of the rhs variable hDR ; M R i that results from enforcing the owner computes rule. The new ASD represents the rhs data aligned with lhs. The regular section descriptor DR represents the element(s) referenced by rhs. The mapping descriptor M R = hP R ; F Ri is obtained by the following procedure: Step 1. Align array dimensions with processor grid dimensions: 1. For each processor grid dimension i, if the subscript expression obtained from PiL has the form 1 k + 1 and there is a rhs subscript expression 2 k + 2 , for the same loop index variable k, set PiR to the rhs subscript position. 2. For each remaining processor grid dimension i, set PiR to j, where j is an unassigned rhs subscript position. If there is no unassigned rhs subscript position left, set PiR to ?.
Step 2. Calculate the mapping function for each grid dimension: For each processor grid dimension i, let FiL (j) = c j + l be the ownership
mapping function of the lhs variable.1 We determine the rhs mapping function FiR (j) from the subscript expressions corresponding to PiR and PiL. The details are speci ed in Table 1. Finally, if any rhs or lhs subscript expressions are coupled, or if any subscript expressions are non-linear, then hDR ; M R i is set to ?. Example. Consider the assignment statement in the code fragment: !HPF$ ALIGN A(I, J) WITH VPROCS(J, I+1) ... A(I, J) = ..B(2*I).. 1
This is the form allowed by HPF alignment directives.
PiL PiR FiR (j ) 1 k + 1 2 k + 2 ; 2 6= 0 c ( 1 (j2? 2 ) + 1 ) + l 1 k + 1 2 c (1 k + 1 ) + l 1 k + 1 ? c (1 k + 1 ) + l ? 2 k + 2 l (c must be 0)
Table 1. Mapping Function Calculation for Owner Computes Rule The ownership mapping descriptor M L for the lhs variable A is h[2; 1]; F Li where F1L(j) = j and F2L(j) = j +1. This mapping descriptor is derived from the HPF alignment speci cation. Applying Step 1 of the compute rule algorithm, P R is set to [?; 1]. That is, the second dimension of VPROCS is aligned with the rst dimension of B, and the rst entry of P R is ?, since B only has one dimension. The second step is to determine the mapping function F R. For the rst grid dimension, P1L corresponds to the subscript expression J and P1R is ?. Therefore, using F1L and the third rule in Table 1, F1R is set to 1*(1*J + 0) + 0 = J. For the second grid dimension, P2L corresponds to the subscript expression I, and P2R corresponds to the subscript expression 2*I. Using F2L and the rst rule in Table 1, F2R(j) is set to 2j + 1. The mapping descriptor thus obtained maps B(2*I) onto VPROCS(J, I+1).
3.2 Interval-Based Data Availability Analysis Interval analysis is precisely de ned in [6]. An interval corresponds to a program loop. Each interval has a header node h, a last node l, and a single backedge hl; hi. The analysis for structured (reducible) control ow graphs is performed in two phases, an elimination phase and a propagation phase.
Elimination Phase The elimination phase processes the program intervals in a bottom-up (innermost to outermost) traversal. The nodes of each interval are visited in a forward traversal. After each interval has been processed, its data ow information is summarized and is associated with its header node, and the interval is logically collapsed. Thus, when an outer interval is traversed, each inner interval is treated as a single node, represented by its header. For availability analysis, we initialize the local sets for the header node of each interval as follows: OUTh = Genh Kh = Killh Transfer functions (1) and (2) are then applied to each statement node during the forward traversal of the interval. Finally, the data availability generated for the
interval last node l must be summarized for the entire interval, and associated with h. However, the data availability at l, obtained from (1) and (2), is only for a single iteration of the loop. Following [12], we would like to represent the availability of data corresponding to all iterations of the loop. De nition. For an ASD set S, expand(S; k; low : high) is a function which replaces all single data item references k + used in any array section descriptor D in S by the triple ( low + : high + : ), and any mapping function of the form Fi(j) = c k + l by Fi (j) = c low + l : c high + l : c. The following equations de ne the transfer functions which summarize the data being killed and the data being made available in an interval for all iterations low : high. Killh = expand(Kl ; k; low : high ) Genh = expand(OUTl ; k; low : high ) ? ([def expand(def; k; low : high )) where def ranges over de nitions in the interval loop, such that each de nition is the target of an anti-dependence carried by the loop [3]. If the loop is a doall loop, the range of def is empty, so that Genh = expand(OUTl ; k; low : high ). For computing the interval kill set Killh , we simply expand the kill set generated at l over the interval loop bounds low : high. Computing the interval availability set Genh requires more work, because a variable de nition in an iteration k may kill data made available in previous iterations low : k ? 1. Therefore, we rst expand the data made available in a single iteration, obtaining all data made available in any iteration, and then subtract out the data that may be killed after it is made available. A de nition kills data made available in a previous iteration of a loop if it is the target of an anti-dependence carried by the loop, that is, if it de nes data previous used. Example. Table 2 shows ASD sets for the example in Figure 1, obtained during the elimination phase of interval analysis.2 The availabilitydomain of the inner loop Gen2 is the array section A(1:100,I). For a single iteration I, the assignment to A(100,I-1) at s5 does not kill Gen2 . Therefore, after s6 , both A(1:100,I) and A(1,I) are available. Since A(1,I) is a subset of A(1:100,I), OUT6 is simply A(1:100,I). (See the union operation in Section 4.) The nal availability set Gen1 is obtained by: a) expanding OUT6 into A(1:100,2:100), b) expanding the de nition at s5 , which is the target of an anti-dependence carried by the loop, into A(100,1:99), and c) taking the dierence between the resulting sets. The result of the dierence operation is shown in Figure 3.
Propagation Phase The propagation phase processes the program intervals in a top-down (outermost to innermost) traversal, and the nodes of each interval are visited in a Since the mapping descriptor hP;F i for variable A is identical for all availability sets, 2
we ignore it in the discussion below.
ASD set OUT3 K3 Gen2 Kill2 OUT5 K5
D P F A(J,I) 2; 1 F1(j) = j; F2(j) = j B(I,J) > > A(1:100,I) 2; 1 F1(j) = j; F2(j) = j B(I,1:100) > > A(1:100,I) 2; 1 F1(j) = j; F2(j) = j A(100,I-1) > > B(I,1:100) > > OUT6 A(1:100,I) 2; 1 F1(j) = j; F2(j) = j K6 D(I) > > A(100,I-1) > > B(I,1:100) > > Gen1 A(1:99,2:100), A(1:100, 100) 2; 1 F1(j) = j; F2(j) = j Kill1 D(2:100) > > A(100,1:99) > > B(2:100,1:100) > >
Table 2. Steps in Elimination Phase for Example in Figure 1 forward traversal. Thus, data made available outside each interval is propagated to nodes inside the interval. During this node traversal, any inner interval is treated as a single node represented by its header. Our analysis calculates the data availability for a loop iteration k. To determine the data availability of the entire loop, we set k = high. For the elimination phase, the reaching set INh is initially empty. For the propagation phase, the initial reaching set for iteration k consists of: 1. The data made available before the loop is entered which is not killed in iterations 1 : k ? 1, and 2. The data made available on all previous iterations 1 : k ? 1, which has not been killed before iteration k. We de ne INik to be the availability of data at entry to a statement i on the kth iteration of its loop. Therefore, INhlow is the available data that reaches the header node h of an interval before the interval is entered. The data that is available on entry to the kth loop iteration is given by the equation: INhk = [INhlow ? expand(Kl ; k; low : k ? 1)] [ [expand(OUTl ; k; low : k ? 1) ? ([def expand(def; k; low : k ? 1 ))] where def ranges over de nitions that are targets of anti-dependences carried by the loop. Availability information is propagated within the interval using the transfer functions de ned in (1) and (2).
D1
D2
D1 = A(1:100, 2:100) D1−D2 = A(1:99,2:100),A(1:100,100) D2 = A(100, 1:99)
Fig. 3. Result of Dierence Operation on RSD
Example. For the example in Figure 1, IN1I = hA(1:99,2:I-1), A(1:100,I-1)i.
During the forward traversal of this outer interval, we obtain: OUT2 = hA(1:99, 2:I), A(1:100, I-1:I)i OUT6 = hA(1:99, 2:I), A(1:100, I)i For brevity, we have omitted the mapping function, since it is the same for all ASDs on A.
4 Operations on Available Section Descriptors In this section, we present the algorithms for various operations on the ASDs. Each operation is described in terms of further operations on the array section descriptors and the mapping descriptors that constitute the ASDs. There is an implicit order in each of those computations. The operations are rst carried out over the array section descriptors, and then over the descriptors of the mapping functions applied to the resulting array section. Intersection Operation The intersection of two ASDs represents the elements constituting the common part of their array sections, that are mapped to the same processors. The operation is given by: hD1 ; M1i \ hD2 ; M2i = h(D1 \ D2 ); (M1 \ M2 )i In the above equation, M1 \M2 represents the intersection of each of the mapping functions M1 and M2 applied to the array region (D1 \ D2 ).
Dierence Operation The dierence operation causes a part of the array region associated with the rst operand to be invalidated at all the processors where it was available. hD1 ; M1i ? hD2 ; >i = h(D1 ? D2 ); M1 i The result represents the elements of the reduced array region (D1 ? D2 ) that are still available at processors given by the original mapping function M1 . Union Operation The union operation presents some diculty because the ASDs are not closed under the union operation. The same is true for data descriptors like the DADs and the BRSDs, that have been used in practical optimizing compilers [5, 14]. Clearly, there is a need to compute an approximation to the exact union. In the context of determining the availability of data, any errors introduced during this approximation should be towards underestimating the extent of the descriptor. One way to minimize the loss of information in computing hD1 ; M1 i[hD2 ; M2i is to maintain a list consisting of (1) hD1 ; M1 i, (2) hD2 ; M2i, (3) h(D1 [D2 ); (M1 \ M2 )i, and (4) h(D1 \ D2 ); (M1 [ M2 )i. Subsequently, any operations involving the descriptor would have to be carried out over all the elements in that list. The items (3) and (4) are included in the list because they potentially provide more useful information than just (1) and (2). Example. Let hD1 ; M1 i = hA(1 : 100; 3); h[2; 1]; F ii, and let hD2 ; M2i = hA(1 : 100; 4); h[2; 1];F ii, where F1 (j) = F2(j) = j. Figure 4(a) shows hD1 ; M1i and Figure 4(b) shows hD2 ; M2i. Consider the subset test: hA(7 : 8; 3 : 4); h[2; 1]; F ii hD1 ; M1i [ hD2 ; M2i. If the union is represented as the list of ASDs 4(a) and 4(b) only, this subset test will fail, which is inaccurate. The ASD for item (3) h(D1 [ D2 ); (M1 \ M2 )i is shown in Figure 4(c). The subset test succeeds for 4(c). However, this solution is too expensive, as the size of the list grows exponentially in the number of original descriptors to be unioned (the growth would be linear if only items (1) and (2) were included in the list). There are a number of optimizations that can be done to exclude redundant terms from the above list:
{ If D1 D2, the rst term can be dropped from the list, since hD1 ; M1 i is subsumed by h(D1 \ D2 ); (M1 [ M2 )i. Similarly, appropriate terms can be dropped when D2 D1 , M1 M2 , or M2 M1. { If D1 = D2 , only the fourth term needs to be retained, since hD1 ; M1 [ M2 i subsumes all other terms. If M1 = M2 , the third term, that eectively evaluates to hD1 [ D2 ; M1i, subsumes all other terms, which may hence be
dropped. In addition to these optimizations, the compiler can use heuristics like dropping terms associated with the smaller array regions to ensure that the size of such lists is bounded by a constant. Further discussion of these heuristics is beyond the scope of this paper.
(1) D1 = A(1:100,3)
M1 = [2,1], F1(j) = F2(j) = j
(a) (2) D2 = A(1:100,4)
M2 = [2,1], F1(j) = F2(j) = j
(b) (3) D1 U D2 = A(1:100,3:4)
M1 & M2 = [2,1], F1(j) = F2(j) = j
(c)
Fig. 4. ASD Union Operation Example
4.1 Operations on Bounded Regular Section Descriptors Let D1 and D2 be two BRSDs, that correspond to array sections A(S11 ; : : :; S1n ) and A(S21 ; : : :; S2n ) respectively, where each S1i and S2i term is a range represented as a triple. We now describe the computations corresponding to dierent operations on D1 and D2 . Intersection Operation The result is obtained by carrying out individual intersection operations on the ranges corresponding to each array dimension. D1 \ D2 = A(S11 \ S21 ; : : :; S1n \ S2n )
The formula for computing the exact intersection of two ranges, S1i and S2i , expressed as triples, is given in [16]. The result of the intersection operation is another range represented as a triple. If either of the two ranges is equal to ?, the result also takes the value ?. Dierence Operation Conceptually, the result is a list of n regions, each region corresponding to the dierence taken along one array dimension. D1 ? D2 = list(A(S11 ? S21 ; S12 ; : : :; S1n ); : : :; A(S11 ; : : :; S1n?1 ; S1n ? S2n )) When the regions corresponding to D1 and D2 dier only in one dimension, all except for one of the terms in the above expression evaluate to null. When there is more than one non-null term, the result may be represented as a list, and heuristics used (as discussed earlier) to keep such lists bounded in length. Again, the formula for computing the exact dierence, Ri = S1i ? S2i , when both S1i and S2i are expressed as triples, is given in [16]. If either S1i = ? or S2i = ?, then Si = ?. Union Operation The BRSDs are not closed under the union operation. The algorithm described in [14] to compute an approximate union potentially overestimates the region corresponding to the union. We need a dierent algorithm, one that underestimates the array region while being conservative. In the special case when the array regions identi ed by D1 and D2 dier only in one dimension, say, i, the union operation is given by: D1 [ D2 = A(S11 ; : : :; S1i?1 ; S1i [ S2i ; S1i+1 ; : : :; S1n ) In the most general case when the regions dier in each dimension, an exhaustive list of regions corresponding to D1 [ D2 would include (1) A(S11 ; : : :; S1n ), (2) A(S21 ; : : :; S2n ), and the following n terms: A(S11 [ S21 ; S12 \ S22 ; : : :; S1n \ S2n ); : : :; A(S11 \ S21 ; : : :; S1n?1 \ S2n?1 ; S1n [ S2n ). Once again, heuristics are needed to keep the lists bounded. Figure 5 describes an algorithm for computing an approximate union of two ranges.
4.2 Operations on Mapping Function Descriptors Consider two mapping function descriptors, M1 = hP1 ; F1i and M2 = hP2; F2i,
associated with the same array A(S) section. The intersection and union operations are de ned as: hP1 ; F1 \ F2i if P1 = P2 M1 \ M2 = h? ; ?i if P1 6= P2 hP1 ; F1 [ F2i if P1 = P2 M1 [ M2 = list(M 1 ; M2) if P1 6= P2
Let F1 = [F11; : : :; F1n], and let F2 = [F21; : : :; F2n]. The computations of various operations between F1 and F2 can be described at two levels: (i) computations
function union-range(S1; S2 ) if (S1 = ?) return S2 if (S2 = ?) return S1 Let S1 = (l1 : u1 : s1 ); S2 = (l2 : u2 : s2 ) if (s1 mod s2 = 0)&((l2 ? l1 ) mod s1 = 0) s = s1 else if (s2 mod s1 = 0)&((l2 ? l1 ) mod s2 = 0) s = s2 else return list(S1; S2 ) if (l1 l2 )&(u1 l2 ) return (l1 : max(u1 ; u2 ) : s) else if (l2 l1 )&(u2 l1 ) return (l2 : max(u1 ; u2 ) : s) return list(S1; S2 ) end union-range.
Fig. 5. Algorithm to Compute Union of Ranges of F1 \ F2 and F1 [ F2 in terms of further operations over F1i and F2i, 1 i n, and (ii) computation of F1i \ F2i, and of F1i [ F2i. The computations of the rst type are identical to those described for the BRSDs. For example, the intersection operation is given by: F1 \ F2 = [F11 \ F21; : : :; F1n \ F2n] We now describe the intersection and union operations on the mapping functions of individual array dimensions. As mentioned in Section 2, depending on whether Pi represents a true array dimension or a \missing" dimension, F1i is either a function of the type F1i(j) = (c1 j+l1 : c1j+u1 : s1 ), or simply a constant range, F1i = (l1 : u1 : s1 ). We need to describe only the results for the former case, since the latter can be viewed as a special case of the former, with c1 set to zero. Thus, in the remainder of this section, we shall regard F1i and F2i as functions of the form F1i(j) = (c1 j+l1 : c1 j+u1 : s1 ), and F2i (j) = (c2 j+l2 : c2 j+u2 : s2 ). Intersection Operation There are two cases: Case 1: c1 = c2 : Let (l : u : s) = (l1 : u1 : s1) \ (l2 : u2 : s2) as described in [16]. The result is given by: (F1i \ F2i )(j) = (c1 j + l : c1 j + u : s)
Case 2: c1 6= c2 : In this case, we check if one of the mapping functions completely covers the other function over the data domain, and otherwise return ?. The high-level algorithm is:
if F1i (j) F2i(j); j 2 SP return F1i else if F2i(j) F1i (j); j 2 SP return F2i else return ? Let SP = (low : high : step ). We now describe the conditions for checking if F1i is covered by F2i. For j 2 (low : high : step ); i
i
i
F1i(j) F2i (j) , (c1 j + l1 : c1 j + u1 : s1 ) (c2 j + l2 : c2 j + u2 : s2 ) A set of sucient conditions for the above relationship to hold is: (i) c1 j +l1 c2 j + l2 ; low j high , (ii) c1 j + u1 c2 j + u2; low j high , (iii) [(c1 ? c2) low +(l1 ? l2 )] mod s2 = 0, and (iv) s1 mod s2 = 0. In the special case when s1 = s2 = 1, the last two conditions are satis ed trivially, and conditions (i) and (ii) are both necessary and sucient. The conditions (i) and (ii) can be further simpli ed to the following: (i) c1 low + l1 c2 low + l2 , (ii) c1 high + u1 c2 high + u2 , if c1 c2 (i) c1 high + l1 c2 high + l2 , (ii) c1 low + u1 c2 low + u2 , if c1 < c2 i i The conditions for checking if F2 (j) F1 (j); j 2 SP ; can be derived in a similar manner. Union Operation Once again, there are two cases: Case 1: c1 = c2 : Let (l : u : s) denote (l1 : u1 : s1) [ (l2 : u2 : s2) as described in Figure 5. The result is given by: (F1i [ F2i)(j) = (c1 j + l : c1 j + u : s) If (l1 : u1 : s1 ) [ (l2 : u2 : s2 ) is actually a list, F1i [ F2i is also represented as a list, each element of which is given by the above equation. Case 2: c1 6= c2 : We check if one of the mapping functions completely covers the other, and otherwise return a list, as shown below: if F1i (j) F2i(j); j 2 SP return F2i else if F2i(j) F1i (j); j 2 SP return F1i else return list(F1i; F2i). i
i
i
5 Communication Optimizations Availability of global data- ow information enables the compiler to perform communication optimizations that improve the performance of the program as a whole. Several optimizations are brie y described below.
5.1 Redundant Communication Elimination
The most obvious optimization is the elimination of redundant messages: if the receiving processor already has a valid copy of the data being communicated, that communication can be eliminated. More precisely, let INi = hD1 ; M1 i be the reaching availability set and Geni = hD2 ; M2i be the data made available at node ni . Then communication for Geni is redundant if hD2 ; M2 i hD1 ; M1 i, i.e., if the following conditions are satis ed: 1. D2 D1 2. M2 (D2 ) M1 (D2 ). This analysis assumes that the relevant data received during prior communication is not destroyed. In fact, the results of the analysis can be used to determine which data can be usefully cached. Data invalidation is not necessary, since the compile-time analysis ensures consistency.
5.2 Communication Placement
The placement of communication can greatly impact performance. For example, prefetching data enables optimizations such as message vectorization and overlapping communication with computation. Another reason to move the communication earlier is for subsuming subsequent communication. Consider the program: do k = 1; n z(k; 100) = F (d(k)) enddo
::: do
j = 1; n do k = 1; n z(k; j) = F (d(k))
enddo enddo
In the rst loop, d must be aligned with the hundredth column of z. In the second loop nest, d must be aligned with every column of z. If the communication needed for the second loop nest is performed before the rst loop, then the rst communication is unnecessary. Availability analysis is useful for optimizing communication placement.
5.3 Message Merging
When dierent messages must be sent between the same processors, it is possible to optimize the communication by merging messages. Availability information speci es when data is identically mapped across multiple statements over the whole program. This is precisely the information that is needed for merging messages.
5.4 Multiple Sender Selection If message-send data is available on several processors, one processor may be better suited to playing the role of the sending processor than the others. A processor may be better suited if it is closer to the receiving processor, or selecting it leads to a better communication pattern. For instance, consider the well-known distributed memory algorithm for matrix multiplication C = A * B in which A is distributed by rows, and B is distributed by columns [10]. The columns of the second array are to be rotated among processors as the computation proceeds. The communication can be realized using a shift operation if we let a processor other than the owner send the necessary data. Our analysis makes explicit when dierent copies of the same data are available.
6 Conclusions We have presented a new framework for analyzing array references for distributed memory compilation. This framework has the advantage that the analysis is performed on the original program form, before it is transformed into a local SPMD node program: It is generally much harder to analyze a program once the semantic level has been lowered by inserting a lot of additional code. The availability analysis results can be used to perform various global communication optimizations, thus improving the nal node program. For future work, the communication optimization algorithms brie y described in Section 5 must be further developed. Also, experience is needed with the ASDs, to design an eective union operation, and assess the cost of this analysis in practice.
References 1. A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: principles, techniques, and tools. Addison-Wesley, 1986. 2. F. E. Allen and J. Cocke. A program data ow analysis procedure. Communications of the ACM, 19(3):137{147, March 1976. 3. J. R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491{542, October 1987. 4. S. P. Amarasinghe and M. S. Lam. Communication optimization and code generation for distributed memory machines. In Proc. ACM SIGPLAN '93 Conference on Programming Language Design and Implementation, Albuquerque, New Mexico, June 1993. 5. V. Balasundaram. A mechanism for keeping useful internal information in parallel programming tools: the data access descriptor. Journal of Parallel and Distributed Computing, 9(2):154{170, June 1990. 6. M. Burke. An interval-based approach to exhaustive and incremental interprocedural data- ow analysis. ACM Transactions on Programming Languages and Systems, 12(3):341{395, July 1990.
7. D. Callahan and K. Kennedy. Analysis of interprocedural side eects in a parallel programming environment. Journal of Parallel and Distributed Computing, 5:517{ 550, 1988. 8. S. Chatterjee, J. R. Gilbert, R. Schreiber, and S.-H. Teng. Optimal evaluation of array expressions on massively parallel machines. In Proc. Second Workshop on Languages, Compilers, and Runtime Environments for Distributed Memory Multiprocessors, Boulder, CO, October 1992. 9. High Performance Fortran Forum. High Performance Fortran language speci cation, version 1.0. Technical Report CRPC-TR92225, Rice University, Jan '93. 10. G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors. Prentice Hall, 1988. 11. E. Granston and A. Veidenbaum. Detecting redundant accesses to array data. In Proc. Supercomputing '91, pages 854{965, 1991. 12. T. Gross and P. Steenkiste. Structured data ow analysis for arrays and its use in an optimizing compiler. Software - Practice and Experience, 20(2):133{155, February 1990. 13. M. Gupta and P. Banerjee. A methodology for high-level synthesis of communication on multicomputers. In Proc. 6th ACM International Conference on Supercomputing, Washington D.C., July 1992. 14. P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3):350{ 360, July 1991. 15. S. Hiranandani, K. Kennedy, and C. Tseng. Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8):66{80, August 1992. 16. C. Koelbel. Compiling programs for nonshared memory machines. PhD thesis, Purdue University, August 1990. 17. J. Li and M. Chen. Compiling communication-ecient programs for massively parallel machines. IEEE Transactions on Parallel and Distributed Systems, 2(3):361{ 376, July 1991. 18. M.J. Quinn and P. J. Hatcher. Data-parallel programming on multicomputers. IEEE Software, 7:69{76, September 1990. 19. A. Rogers and K. Pingali. Process decomposition through locality of reference. In Proc. SIGPLAN '89 Conference on Programming Language Design and Implementation, pages 69{80, June 1989. 20. C. Rosend. Incremental Dependence Analysis. PhD thesis, Rice University, March 1990. 21. R. Ruhl and M. Annaratone. Parallelization of Fortran code on distributedmemory parallel processors. In Proc. 1990 ACM International Conference on Supercomputing, Amsterdam, The Netherlands, June 1990. 22. R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz. Compiler analysis for irregular problems in Fortran D. In Proc. 5th Workshop on Languages and Compilers for Parallel Computing, New Haven, CT, August 1992. 23. H. Zima, H. Bast, and M. Gerndt. SUPERB: A tool for semi-automatic MIMD/SIMD parallelization. Parallel Computing, 6:1{18, 1988.
This article was processed using the LaTEX macro package with LLNCS style