Vikram Adve and John Mellor-Crummey. Department of Computer Science and. Center For Research on Parallel Computation,. Rice University, Houston, Texas, ...
Advanced Code Generation for High Performance Fortran Vikram Adve and John Mellor-Crummey
Department of Computer Science and Center For Research on Parallel Computation, Rice University, Houston, Texas, USA
Summary. For data-parallel languages such as High Performance Fortran to
achieve wide acceptance, parallelizing compilers must be able to provide consistently high performance for a broad spectrum of scienti c applications. Although compilation of regular data-parallel applications for message-passing systems have been widely studied, current state-of-the-art compilers implement only a small number of key optimizations, and the implementations generally focus on optimizing programs using a \case-based" approach. For these reasons, current compilers are unable to provide consistently high levels of performance. In this paper, we describe techniques developed in the Rice dHPF compiler to address key code generation challenges that arise in achieving high performance for regular applications on message-passing systems. We focus on techniques required to implement advanced optimizations and to achieve consistently high performance with existing optimizations. Many of the core communication analysis and code generation algorithms in dHPF are expressed in terms of abstract equations manipulating integer sets. This approach enables general and yet simple implementations of sophisticated optimizations, making it more practical to include a comprehensive set of optimizations in data-parallel compilers. It also enables the compiler to support much more aggressive computation partitioning algorithms than in previous compilers. We therefore believe this approach can provide higher and more consistent levels of performance than are available today.
1. Introduction Data-parallel languages such as High-Performance Fortran (HPF) [29, 31] aim to make parallel scienti c computing accessible to a much wider audience by providing a simple, portable, abstract programming model applicable to a wide variety of parallel computing systems. For such languages to achieve wide acceptance, it will be essential to have parallelizing compilers that provide consistently high performance for a broad spectrum of scienti c applications. To achieve the desired levels of performance and consistency, compilers must necessarily exploit a wide variety of optimization techniques and eectively apply them to programs with as few restrictions as possible. Engineering HPF compilers that provide consistently high performance for a wide range of programs is a challenging task. The data layout directives in an HPF program provide an abstract, high-level speci cation of maximal data-parallelism and data-access locality. The compiler must use this information to choose how to partition the computation among processors,
2
Vikram Adve and John Mellor-Crummey
determine what data movement and synchronization is necessary, and generate code to implement the partitioning, communication and synchronization. Accounting for interactions and feedback among these steps complicates program analysis and code generation. To achieve high eciency, optimizations must analyze and transform the program globally within procedures, and often interprocedurally as well. The most widely studied sub-problem of data-parallel compilation is that of compiling \regular" data-parallel applications on message-passing systems. Data-parallel programs are known as \regular" if the mapping of each array's data elements to processors can be described by an ane mapping function and the array sections accessed by each array reference can be computed symbolically at compile time. Even within this class of applications, stateof-the-art commercial and research compilers do not consistently achieve performance competitive with hand-written code [16, 24]. Although many important optimizations for such systems have been proposed by previous researchers, current compilers implement only a small fraction of these optimizations, generally focusing on the most fundamental ones such as static loop partitioning based on the \owner-computes" rule [39], moving messages out of loops, reducing the number of data copies, and exploiting collective communication. Furthermore, even for these optimizations, most research and commercial data-parallel compilers to date [7, 10, 15, 16, 17, 19, 24, 32, 33, 35, 42, 45, 46] (including the Rice Fortran 77D compiler [24]) perform communication analysis and code generation for speci c combinations of the form of references, data layouts and computation partitionings. While such \case-based" approaches can provide excellent performance where they apply, they will provide poor performance for cases that have not been explicitly considered. More importantly, case-based compilers require a relatively high development cost for each new optimization because the analysis and code generation for each case is handled separately; this makes it dicult to achieve wide coverage with optimizations, which in turn makes it dicult to oer consistently high performance. In this paper, we describe techniques to address key code generation challenges that arise in a sophisticated compiler for regular data-parallel applications on message-passing systems. We focus on techniques required to implement advanced optimizations and to achieve consistently high performance with existing optimizations. With minor exceptions, these techniques have been implemented in the Rice dHPF compiler, an experimental research compiler for High Performance Fortran. Although this paper focuses on compilation techniques for regular problems on message-passing systems, the dHPF compiler is being designed to integrate handling for regular and irregular applications, and to target other architectures including shared-memory and hybrid systems. The principal code generation challenges we address are the following:
Advanced Code Generation for High Performance Fortran
3
{ Flexible computation partitionings: Higher performance can be achieved if
compilers can go beyond the widely-used owner-computes rule to support a much more exible class of computation partitionings. More general computation partitionings require two key compiler enhancements. First, they require robust communication analysis techniques that are not limited to speci c partitioning assumptions. Second, they also require sophisticated code generation techniques to guarantee correctness in the presence of arbitrary control- ow, and to generate partitioned code with good scalar eciency. The dHPF compiler supports a much more general computation partitioning (CP) model than in previous data-parallel compilers. We describe the communication analysis and code generation techniques required to support this model. { Robust algorithms for communication and code generation: The core communication analysis, optimization,and code generation algorithms in dHPF are expressed in terms of abstract equations manipulating integer sets rather than as a collection of strategies for dierent cases. Optimizations we have formulated in this manner include message vectorization [14], message coalescing [43], recognizing in-place communication [1], code generation for our general CP model [1], non-local index set splitting [32], control- ow simpli cation [34], and generalized loop-peeling for improving parallelism. By formulating these algorithms in terms of operations on integer sets, we are able to abstract away the details of the CPs, references, and data layouts for each problem instance. All of these algorithms fully support our general computation partitioning model, and can be used for arbitrary combinations of computation partitionings, data layouts and ane reference subscripts. { Simplifying compiler-generated control- ow: Loop splitting transformations, performed to minimize the dynamic cost of adding new guards, produce new loops with smaller iteration spaces which can render existing guards and loops inside redundant or infeasible. This raises the need for an algorithm that can determine the symbolic constraints that hold at each control point and use them to simplify or eliminate branches and loops in the generated code. We motivate and brie y describe a powerful algorithm for constraint propagation and control ow simpli cation used in the dHPF compiler [34]. In a preliminary evaluation, the algorithm has proven highly eective at eliminating excess control- ow in the generated code. Furthermore, we nd that the general purpose control- ow simpli cation algorithm provides some or all of the bene ts of special-purpose optimizations such as vector-message pipelining [43] and overlap areas [14]. Our aim in this chapter is to motivate and provide an overview of the techniques we use to address the challenges described above. The algorithms underlying these techniques are described and evaluated in detail elsewhere [1, 34]. In the following section, we use an example to describe
4
Vikram Adve and John Mellor-Crummey
the basic steps in generating an explicit message-passing parallel program for HPF. We also use the example to show why integer sets are fundamental to the problem of HPF compilation, and describe the approaches used in previous data-parallel compilers for computing and generating code from integer sets. In Section 3 we describe a general integer-set framework underlying HPF compilation, and the implementation of this framework in dHPF. The framework directly supports integer set based algorithms for many of our optimizations, and these are brie y described in the subsequent sections. In Section 4, we de ne our general computation partitioning model and how we support code generation for it. In Section 5, we provide an overview of the principal communication optimizations in dHPF and then brie y describe the key optimizations that were formulated in terms of the integer set framework. In Section 6, we motivate and describe control- ow simpli cation in dHPF, and present a brief evaluation of its eectiveness. Finally, in Section 7, we conclude with a brief summary and discussion of the techniques described in this chapter.
2. Background: The Code Generation Problem For HPF The High Performance Fortran standard describes a number of extensions to Fortran 90 to guide compiler parallelization for parallel systems. The language is discussed in some detail in an earlier chapter [29], and we assume the reader is familiar with the major aspects of the language (particularly, the data distribution directives). Throughout this paper, we assume a messagepassing target system although many of the same analyses are required or pro table for shared-memory systems as well.
2.1 Communication analysis and code generation for HPF To understand the basic problem of compiling an HPF program into an explicitly parallel message-passing program, and to motivate our use of a general integer-set framework for analysis and code generation, consider the simple example in Figure 2.1. The source loop represents a nearest-neighbor stencil computation similar to those found in partial dierential equation solvers. The two arrays are aligned with each other and both are distributed (block,block) on a two-dimensional processor array. To generate an explicitly parallel code for the program, the compiler must rst decide (a) how to partition the computation for each statement in the program, (b) which references might access non-local data due to the chosen partitioning, and (c) how and when to instantiate the communication to obtain this non-local data. Assume the compiler chooses an \owner-computes" partitioning for the statement in the loop, i.e., each instance of the statement is executed by the
Advanced Code Generation for High Performance Fortran CHPF$ processors P(3,3) CHPF$ distribute A(block,block) onto P CHPF$ distribute B(block,block) onto P do 10 j=2,N-1 do 10 i=2,N-1 A(i,j) = 0.25* (B(i-1,j)+B(i+1,j) + B(i,j-1)+B(i,j+1)) 10 continue
(a) HPF source code
P(1,1) P(1,2)
5 P(1,3)
P(2,1) P(2,2) P(2,3)
P(3,1)
P(3,2) P(3,3)
(b) Processor array P(3,3)
Local section for P(2,2) (and iterations executed) Non-local data accessed by P(2,2)
P(2,2)
P(2,3)
Iterations that access non-local data
(c) Communication and iteration sets
Fig. 2.1. Example illustrating integer sets required for code generation for
Jacobi kernel
processors that own the value being computed, viz., A(i,j). In this case, each of the four references on the right hand side (RHS) accesses some o-processor elements, namely the boundary elements on the four neighboring processors. Since the array B is not modi ed inside the loop nest, communication for these references can be moved out of the loops and placed before the loop nest. In order to generate ecient explicitly parallel SPMD code, the compiler must compute the following quantities, and then use these to generate code. These quantities are illustrated in Figure 2.1: 1. the sections of each array allocated to (owned by) each processor; 2. the set of iterations to be executed by each processor (conforming with the owned section of array A); 3. the non-local data accessed from each other processor by each reference (the o-processor boundary sections shown in the gure); 4. the iterations that access non-local data and the iterations that access exclusively local data. (These sets are used by advanced optimizations such as those described in Section 5.3.)
6
Vikram Adve and John Mellor-Crummey
All of these quantities can be symbolically represented as sets of integer tuples (representing array indices, loop iterations, or processor indices), or as mappings between integer tuples (e.g., an array layout is a mapping from processor indices to array indices). These sets and mappings are de ned in Section 3. The sets may be non-convex, as is the set of iterations accessing non-local data shown in Figure 2.1. To generate a statically partitioned, message-passing program, any dataparallel compiler must implicitly or explicitly compute the above sets, and then use these sets to generate code. The compiler typically generates SPMD code for a representative processor myid by performing the following tasks in some order (the resulting code is omitted here): 1. Synthesize a loop nest to execute the iterations assigned to myid. 2. For each message, if explicit buering of data is necessary, synthesize code to pack a buer at the sending processor and/or to unpack a buer at the receiving processor. 3. Allocate storage to hold the non-local data, and modify the code to access data out of this storage (note that dierent references access non-local data in dierent iterations). 4. Allocate storage for the local sections for each array, and modify array references (for local data) to index appropriately into these local sections.
2.2 Previous Approaches to Communication Analysis and Code Generation To compute the above sets and to generate code using them, the primary approach in most previous research and commercial compilers has been to focus on individual common cases and to precompute the iteration and communication sets symbolically for speci c forms of references, data layouts and computation partitionings [7, 10, 15, 16, 17, 19, 24, 32, 33, 42, 46]. For example, to implement the Kali compiler [32], the authors pre-compute symbolically the iteration sets and communication sets for subscripts of the form c, i + c and c ? i, where i is a loop index variable for BLOCK and CYCLIC distributions. The Fortran 77D compiler also handled the same classes of references and distributions, but computed special-case expressions for the iteration sets for \interior" and \boundary" processors [24]. (In fact, both these groups have described basic compilation steps in terms of abstract set operations [23, 32]; however, this was used only as a pedagogical abstraction and the corresponding compilers were implemented using case-based analysis.) Li and Chen describe algorithms to classify communication caused by more general reference patterns (assuming aligned arrays and the ownercomputes rule), and generate code to realize these patterns eciently on a target machine [33]. In general, these compilers focus on providing speci c optimizations aimed at cases that are considered to be the most common and most important. The principal bene ts of such case-based strategies are
Advanced Code Generation for High Performance Fortran
7
that they are conceptually simple and hence lend themselves well to initial implementations, they have predictable and fast running times, and they can provide excellent performance in cases where they apply. Three groups have used a more abstract and general approach based on linear inequalities to support code generation for communication and iteration sets [2, 3, 4, 5]. In this approach, each code generation or communication optimization problem is described by a collection of linear inequalities representing integer sets or mappings. Fourier-Motzkin elimination [41] is used to simplify the resulting inequalities, and to compute a range of values for individual index variables that together enumerate the integer points described by these inequalities. Code generation then amounts to generating loops to iterate over these index ranges. In the PIPS and Paradigm compilers, these techniques were primarily used for code generation for communication and iteration sets [3, 5]. In the SUIF compiler, these techniques were also applied to carry out speci c optimizations including message vectorization, message coalescing (limited to certain combinations of references) and redundant message elimination [2]. The advantage of using linear inequalities over case-based approaches is that each optimization or code generation problem can be expressed and solved in abstract terms, independent of the speci c forms of references, data layouts, and computation partitionings. Furthermore, Fourier-Motzkin elimination is applicable to arbitrary ane references and data layouts. The primary limitation of linear-inequality based approaches in previous compilers is that they have limited their focus to problems that can be represented by the intersection of a single set of inequalities. This limited the scope of their techniques so that, for example, they would be unable to support our general computation partitioning model, coalescing communication for arbitrary ane references, or general loop-splitting into local and non-local iterations. We considered all of these capabilities to be important goals for the dHPF compiler. A second drawback of linear-inequality based approaches is that each problem or optimization is expressed directly in terms of large collections of inequalities which must be constructed to represent operations such as intersections and compositions of sets and mappings. It appears much easier and more intuitive to express complex optimizations directly in terms of sequences of abstract set operations on integer sets, as shown in [1]. There is also a large body of work on techniques to enumerate communication sets and iteration sets in the presence of cyclic(k) distributions (e.g., [12, 17, 28, 35, 45]). Compared to more general approaches based on integer sets or linear inequalities, these techniques likely provide more ecient support for cyclic(k) distributions, particularly when k > 1, but would be much less ecient for simpler distributions, and are much less general in the forms of references and computation partitionings they could handle. Our goal has been to base the dHPF compiler on a general analysis framework that provides good performance in the vast majority of common cases
8
Vikram Adve and John Mellor-Crummey
(within regular distributions), and requires such special-purpose techniques as infrequently as possible. Such techniques can be added as special-purpose optimizations in conjunction with the integer-set framework, but even in the absence of these techniques, we expect that the set framework itself will provide acceptably ecient support for cyclic(k) distributions. To summarize, we believe there are two essential advantages and one signi cant disadvantage of the more general and abstract approaches based on linear inequalities or integer-sets, compared with the case-based approaches. First, the former use general algorithms that handle the entire class of regular problems fairly well, whereas case-based approaches apply more narrowly and must fall back more often on inecient techniques such as run-time resolution for cases they do not handle. In the absence of special-case algorithms, the general approaches are likely to provide much higher performance. Support for exploiting special-cases (e.g., for using collective communication primitives) can be added to the the former if they provide substantial performance improvements, but they should be needed in very few cases. Second, the more abstract framework provided by linear inequalities (to some extent) and by integer sets (to a greater extent) greatly simpli es the compiler-writer's task of implementing important optimizations that are generally applicable, and therefore make it practical to achieve high performance for a wider class of programs. By combining both generality and simplicity, we believe an approach such as that of using integer sets can provide higher and more consistent levels of performance than is available today. In contrast, the principal advantages of case-based approaches are that preliminary implementations can be simple, and that they typically have fast and predictable running times. For more general approaches, running time is the greatest concern since manipulation of linear inequalities and integer sets can be costly and unpredictable in dicult cases. This issue is discussed in more detail in later sections.
3. An Integer Set Framework for Data-Parallel Compilation As discussed in the previous section, any compiler for a data-parallel language based on data distributions can be viewed as operating in terms of some fundamental sets of integer tuples and mappings between these sets. This formulation is made explicit in dHPF and in the SUIF compiler [2], and similar formulations have been discussed elsewhere [32, 23]. The integer set framework in dHPF includes the representation of these primitive quantities as integer tuple sets and mappings, together with the operations to manipulate them and generate code from them. The core optimizations in dHPF are implemented directly on this framework. This section explains the primitive components of the framework, and the implementation of the framework. The following sections describe the optimizations formulated using the framework.
Advanced Code Generation for High Performance Fortran
9
3.1 Primitive Components of the Framework An integer k-tuple is a point in Z k ; a tuple space of rank k is a subset of Z k . Any compiler for a data-parallel language based on data distributions
operates primarily on three types of tuple spaces, and the three pairwise mappings between these tuple spaces [32, 23, 2]. These are:1 datak : index set of an array of rank k; k 0 loopk : iteration space of a loop nest of depth k; k 0 prock : processor index space in a processor array of rank k; k 1 Layout : [p] ! [a] : p 2 procn owns array element a 2 datak Ref : [i] ! [a] : i 2 loopk references array element a 2 datak CPMap : [p] ! [i] : p 2 procn executes statement instance i 2 loopk Scalar quantities such as a \data set" for a scalar, or the \iteration set" for a statement not enclosed in any loop are handled uniformly within the framework as tuples of rank zero.2 For example, the computation partitioning for a statement (outside any loop) assigned toprocessor P in a 1-D processor array would be represented as the mapping [ ] ! [p] : p = P . Hereafter, the terms \array" and \iterations of a statement" should be thought of as including scalars and outermost statements as well. Note that any mapping we require, including a mapping with domain of rank 0, will be invertible. Of these primitive sets and mappings, the sets loop and proc and the mappings Layout and Ref are constructed directly from the compiler's intermediate representation, and form the primary inputs for further analyses. These quantities are constructed from a powerful symbolic representation used in dHPF, namely global value numbering. A value number in dHPF is a handle for a symbolic expression tree. Value numbers are constructed from data ow analysis of the program based on its Static Single Assignment (SSA) form [13], such that any two subexpressions that are known to have identical runtime values are assigned the same value number [21]. Their construction subsumes expression simpli cation, constant propagation, auxiliary induction variable recognition, and computing range information for expressions of loop index variables. A value number can be reconstituted back into an equivalent code fragment that represents the value. Figure 3.1 illustrates simple examples of the primitive sets and mappings for an example HPF code fragment. For clarity, we use dierent set variables to denote points in dierent tuple spaces. The construction of the Layout mapping follows the two steps used to describe an array layout in HPF [31], We use names with lower-case initial letters for tuple sets and upper-case letters for mappings respectively. 2 A set of rank 0, f[ ] : f (v1 ; : : : ; vn )g, should be interpreted as a boolean that takes the values true or false, depending on whether the constraints given by f (v1 ; : : : ; vn ) are satis ed. Here v1 : : : vn are symbolic integer variables. 1
10
Vikram Adve and John Mellor-Crummey
real A(0:99,100), B(100,100) processors P(4) template T(100,100) align A(i,j) with T(i+1,j) align B(i,j) with T(,i) distribute t(*,block) onto P read(*), N do i = 1, N do j = 2, N+1
A(i,j) = B(j-1,i) ! on home B(j-1,i)
enddo enddo
symbolic N AlignA = f[a1; a1] ! [t1; t2] : t1 = a1 + 1 ^ t2 = a2g AlignB = f[b1; b2] ! [t1; t2] : t2 = b1 g DistT = f[t1; t2] ! [p] : 25p + 1 t2 25(p + 1) ^ 0 p 3g LayoutA = Dist?T 1 Align?A1 = f[p] ! [a1; a2] : max(25p + 1; 1) a2 min(25p + 25; 100) ^ 0 a1 99g LayoutB = Dist?T 1 Align?B1 = f[p] ! [b1; b2] : max(25p + 1; 1) b1 min(25p + 25; 100) ^ 1 b2 100g loop = f[l1 ; l2] : 1 l1 N ^ 2 l2 N + 1g CPRef = f[l1 ; l2] ! [b1; b2] : b2 =Tl1 ^ b1 = l2 ? 1g CPMap = LayoutB CPRef ?1 range loop = f[p] ! [l1; l2 ] : 1 l1 min(N; 100) ^ max(2; 25p + 2) l2 min(N + 1; 101; 25p + 26)g
Fig. 3.1. Construction of primitive sets and mappings for an example pro-
gram. AlignA , AlignB , and DistT also include constraints for the array and template ranges, but these have been omitted here for brevity.
namely, alignment of the array with a template and distribution of the template on a physical processor array (the template and processor array are each represented by a separate tuple space). The on home CP notation and construction of CPMap are described in Section 4.
3.2 Implementation of the Framework Expressing optimizations in terms of this framework requires an integer set package that supports all of the key set and map operations including intersection, union, dierence, domain, range, composition of a map with another
Advanced Code Generation for High Performance Fortran
11
map or set, and projection to eliminate a variable from a map or set. We use the Omega library developed by Pugh et al. at the University of Maryland for this purpose [27]. The library operations use powerful algorithms based on Fourier-Motzkin elimination for manipulating integer tuple sets represented by Presburger formulae [37]. In particular, the library provides two key capabilities: it supports a general class of integer set operations including set union, and it provides an algorithm to generate ecient code that enumerates points in a given sequence of iteration spaces associated with a sequence of statements in a loop [26]. (Appendix A describes this code generation capability.) These capabilities are an invaluable asset for implementing set-based versions of the core HPF compiler optimizations as well as enabling a variety of interesting new optimizations, described in later sections. One potentially signi cant disadvantage of using such a general representation is the compile-time cost of the algorithms used in Omega. In particular, simpli cation of formulae in Presburger arithmetic can be extremely costly in the worst-case [36]. Pugh has shown, however, that when the underlying algorithms in Omega (for Fourier-Motzkin elimination) are applied to dependence analysis, the execution time is quite small even for complex constraints with coupled subscripts and also for synthetic problems known to cause poor performance [37]. These experimental results at least provide evidence that the basic techniques could be practical for use in a compiler. In dHPF, the Omega library has already proved to be a powerful tool for prototyping advanced optimizations based on the integer set framework. On small benchmarks, the compiler provides acceptably fast running times, for example, requiring about 3 minutes on a SparcStation-10 to compile the Spec92 benchmark Tomcatv with all optimizations. Further evidence for a variety of real applications will be required to judge whether or not this technology for implementing the integer set framework will prove practical for commercial data-parallel compilers. The dHPF implementation will provide a testbed for developing this evidence. If this approach does not prove practical, it is still possible that a simpler and more ecient underlying set representation could be used to support the same abstract formulation of optimizations, but with some loss of precision. Another signi cant and fundamental limitation is that Presburger arithmetic is undecidable in the presence of multiplication. For this reason, the Omega library provides only limited support for handling multiplication, and in particular, cannot represent sets with an unknown (i.e., symbolic) stride. Most importantly (from the viewpoint of HPF compilation), such strided sets are required for any HPF distribution when the number of processors is not known at compile time, and for a cyclic(k) distribution with unknown k. We have extended our framework to permit these parameters to be symbolic, as described below. Symbolic strides also arise for a loop with a non-constant stride or a subscript expression with a non-constant coecient, although we expect these to be rare in practice. These are not supported by our frame-
12
Vikram Adve and John Mellor-Crummey
work, and would have to fall back on more expensive run-time techniques such as a nite-state-machine approach for computing communication and iteration sets (for example, [28]), or an inspector-executor approach. To permit a symbolic number of processors or cyclic(k) distribution with symbolic k, we use a virtual processor (VP) model that naturally matches the semantics of templates in HPF [22]. The VP model uses a virtual processor array for each physical processor array, using template indices (i.e., ignoring the distribute directive) in dimensions where the block size or number of processors is unknown, but using physical processor indices in all other dimensions. Using physical processor indices where possible facilitates better analysis and improves the eciency of generated code. All of the analyses described in the following sections operate unchanged on physical or virtual processor domains. During code generation for each speci c problem (e.g., generating a partitioned loop), we add extra enclosing loops that iterate over the VPs that are owned by the relevant physical processor (e.g., the representative processor myid). For each problem, we use an additional optimization step (consisting of a few extra integer set equations) to compute the precise set of iterations required for these extra loops, and therefore to minimize the runtime overhead in the resulting code. The details of our extensions to handle a symbolic number of processors are given in [1].
4. Computation Partitioning A computation partitioning (CP) for a statement is a precise speci cation of which processor or processors must execute each dynamic instance of the statement. The CPs chosen by the compiler play a fundamental role in determining the performance of the resulting code. For the compiler to have the freedom to choose a partitioning well-suited to an application's needs, the communication analysis and code generation phases of the compiler must be able to support a exible class of computation partitionings. In this section, we describe the computation partitioning model provided by dHPF and the code generation framework used to support the model. The communication analysis techniques supporting the model are described in Section 5.
4.1 Computation Partitioning Models Most research and commercial compilers for HPF to date primarily use the owner-computes rule [39] to partition a computation. This rule speci es that each value assigned by a statement is computed by the owner (i.e., the \home") of the location being assigned the value, e.g., the left hand side (LHS) in an assignment. The owner-computes rule amounts to a simple heuristic choice for partitioning a computation. It is straightforward to show that this approach is not optimal in general [12]. An alternate partitioning
Advanced Code Generation for High Performance Fortran
13
strategy used by SUIF [2] and Barua, Kranz & Agarwal [6] requires a CP to be described by a single ane mapping of iterations to processors, and assigns a single CP to an entire loop iteration and not to individual statements in a loop. This strategy is also not optimal, because (in general) it does not permit optimal CPs to be chosen separately for dierent statements in a loop. A major goal of the dHPF compiler is to support more general computation partitionings. Doing so requires new support from the compiler's communication analysis and code-generation phases. Previous compilers based on the owner-computes rule have bene ted from two key simplifying assumptions: (1) communication patterns are de ned by a single pair of LHS and right-hand-side (RHS) references, and (2) all communication is caused by reads of non-local data. The SUIF partitioning model also has the bene t that each communication is de ned by a single reference and a single CP mapping (before coalescing communication), although write references can cause communication too. This model has the additional bene t that code generation is greatly simpli ed by having a common CP for all statements in a loop (which will become clear from the discussion in Section 4.2). None of these simplifying assumptions are true for the more general partitioning model used in dHPF. 4.1.1 The Computation Partitioning Model in dHPF. The computation partitioning model supported by dHPF combines and generalizes the features of both previous CP models described above (the owner-computes rule and the SUIF model). Below we describe the key features of the dHPF CP model including the implicit CP representation used by early phases in the compilation and conversion to an explicit CP representation required for communication analysis and code generation. In Section 4.2.1, we discuss code generation for these general CPs and in Section 5 we discuss the role of CPs in communication analysis. In dHPF, a computation partitioning for a statement can be speci ed as the set of owners of the locations accessed by one or more arbitrary data references, Every statement (including control ow statements) can be assigned a partitioning independent of every other statement, restricted only to preserving the semantics of the source program. For a statement S enclosed in a loop nest with iteration space i, the CP of S is speci ed by a union of one or more on home terms: CP(S) :
k[ =n k=1
on home Ak (fk (i))
(4.1)
An individual term on home Ak (fk (i)), speci es that the instance of S in iteration i is to be executed by the processor(s) that own the array element(s) Ak (fk (i)). This set of processors is uniquely speci ed by subscript vector fk (i)
14
Vikram Adve and John Mellor-Crummey
and the layout of array Ak at that point in the execution of the program.3 This implicit representation of a computation partitioning supports arbitrary index expressions or any set of values in each index position in fk (i). A data reference Ak (fk (i)) in an on home clause need not be a reference existing in the program. Even the variable Ak and its corresponding data layout may be synthesized for representing a desired CP, though our implementation is restricted to legal HPF layouts. With this representation, the CP model permits speci cation of a very wide class of partitionings. While early analysis phases in the dHPF compiler use this implicit CP representation, communication analysis and code generation require that the CP for each statement is converted into an explicit mapping of type CPMap de ned in Section 3.1. The integer set framework is used to construct this explicit mapping. This construction requires that each subscript expression in fk (i) be an ane expression of the index variables, i, with known constant coecients, or a strided range speci able by a triplet lb:ub:step with known constant step. We construct the explicit integer tuple mapping representing the CP for a statement as follows. CPMap(S) =
k[ =n k=1
T
(LayoutAk Refk?1 ) range loop:
(4.2)
For each term on home Ak (fk (i)) in CP(S), the composition of its layout and inverse reference maps results in a new map that speci es all possible iterations assigned to each processor by this CP term. We restrict the range of this map to the iteration space given by loop. Taking the union over all CP terms gives the full mapping of iterations to processors for statement S. CPMap(S) speci es the processor assignment for the single instance of statement S in loop iteration i. Figure 3.1 shows a simple example of the construction of CPMap. The mapping can be vectorized over the range of iterations of one or more enclosing loops to represent the combined processor assignment for the set of statement instances in those loop iterations. Careful assignment of CPs to control- ow related statements (namely, DO, IF, and GOTO statements, as well as labeled branch targets) is necessary to preserve the semantics of the source program. In particular, a legal partitioning must ensure that each statement in the program is reached by a superset of the processors that need to participate in its execution, as speci ed by its CP. The code generation phase will then ensure that the statement is executed by exactly the processors speci ed by the CP. The algorithms dHPF uses to select computation partitions and ensure legality are beyond the scope of this paper. In Section 4.2.1, we discuss the interaction between correctness 3
In the presence of dynamic REALIGN and REDISTRIBUTE directives, we assume that only a single known layout is possible for each reference in the program. Multiple reaching layouts would require generating multi-version code or assuming that the layout is unknown until run time (as done for inherited layouts).
Advanced Code Generation for High Performance Fortran
15
constraints on CP assignments for control- ow related statements and code generation. To support dHPF's general computation partitioning model, the communication analysis and code-generation phases in the compiler must fully support any legal partitioning. Supporting this partitioning model would be impractical using a case-based approach; the dHPF compiler's representation of computation partitionings using an abstract integer set framework has proven essential for making the required analysis and code generation capabilities practical.
4.2 Code Generation to Realize Computation Partitions A general CP model such as that in dHPF poses several challenges for static code generation. First, the code generator must ensure correctness in the presence of arbitrary structured and unstructured control ow, without sacri cing available parallelism. Second, generating ecient parallel code for a loop nest containing multiple statements that potentially have dierent iteration spaces is an intrinsically dicult problem. Previous compilers use simple approaches for code generation and do not solve this problem in its general form (as described brie y below), but Kelly, Pugh & Rosser have developed an aggressive algorithm for \multiple-mappings code generation" which directly tackles this problem [26]. A third and related diculty, however, is that good algorithms for generating ecient code (like that of Kelly, Pugh & Rosser) will be inherently expensive because of the potential complexity of the iteration spaces and the resulting code. Ensuring reasonable compile times requires an eective strategy to control the compile-time cost of applying such an algorithm while still producing as high quality code as possible. A fourth problem (not directly related to a general CP model) is that static code generation techniques will not be useful for a code with irregular or complex partitionings. Such cases require runtime strategies such as the inspector-executor approach (e.g., [32, 18, 40]). However, regular and irregular partitionings may coexist in the same program, and perhaps even in a single loop nest. This raises the need for a exible code generation framework that allows each part of the source program to be partitioned using the most ecient strategy applicable. Before describing the techniques used in dHPF to address these challenges, we brie y describe the strategies used to realize computation partitions in current compilers, and the limitations of these strategies in addressing these challenges. We begin with the second of the four problems described above because the approaches to addressing this problem have implications for the other problems as well. As in the rest of the paper, we focus on compiletime techniques that are applicable when the compiler can compute static (symbolic) mappings between processors and the data they own or communicate. If a static mapping is not computable at compile time, alternative
16
Vikram Adve and John Mellor-Crummey
strategies that can be used include run-time resolution [39], the inspectorexecutor approach, and run-time techniques for handling cyclic(k) partitionings. The latter two approaches are described in other chapters within this volume [40, 38]. It is relatively straightforward to partition a simple loop nest that contains a single statement or a sequence of statements with the same CP. The loop bounds can be reduced so that each processor will execute a smaller iteration space that contains only the statement instances the processor needs to execute. For loops containing multiple statements with dierent CPs, it is important that each processor execute as few guards as possible to determine which statement instances it must execute in each iteration. Previous compilers, such as IBM's pHPF compiler[16], rely on loop distribution to construct separate loop nests containing statements with identical CPs, so as to avoid the need for run-time guards. There are two drawbacks to using loop distribution in this manner. First, loop distribution may be impossible because of cyclic data dependences in the loop. In such cases, compilers add statement guards to implement the CPs and, except for Paradigm, don't reduce loop bounds. The Paradigm compiler reduces loop bounds to the convex hull of the iteration spaces of statements inside the loop, in order to reduce the number of guards executed [5]. Second, fragmenting a loop nest into a sequence of separate loops over individual statements can (a) signi cantly reduce reuse of cached values between statements, and (b) signi cantly increase the contribution of loop overhead to overall execution time. A loop fusion pass that attempts to recover cache locality and reduce loop bound overhead is possible, but complex. Both the IBM and the Portland Group HPF compilers use a loop fusion pass, but apply it only to the simplest cases, namely conformant loops [10, 16]. Kelly, Pugh & Rosser describe an aggressive algorithm to generate ecient code without relying on loop distribution, for a loop nest containing multiple statements with dierent iteration spaces [26]. Given a sequence of (possibly non-convex) iteration spaces, the algorithm, mmcodegen, synthesizes a code fragment to enumerate the points in the iteration spaces in lexicographic order, tiling the loops as necessary to lift guards out of one or more levels of loops. Thus, the algorithm provides one of the key capabilities required to support our general CP model. The algorithm is brie y described in Appendix A along with an example that highlights its capabilities. As mentioned earlier, one potential drawback of an algorithm like mmcodegen is that it can be costly for large loop nests with multiple iteration spaces. Because of the potential for achieving high performance, however, we use the algorithm in dHPF as one of the core techniques for supporting the general CP model, and develop a strategy to control compile-time cost while still producing high quality code. To our knowledge, this is the rst use of their algorithm for code generation in a data-parallel compiler, and therefore these issues have not been addressed previously.
Advanced Code Generation for High Performance Fortran
17
The techniques used in previous compilers to partition code in the presence of control- ow are not clearly described in the written literature. Some compilers simply replicate the execution of control- ow statements on all processors, which is a simple method to ensure correctness but can substantially reduce parallelism in loops containing control ow [19]. Many other compilers ignore control ow because loop bounds reduction and ownership guards will enforce the appropriate CP for enclosed statements. However, this approach sacri ces potential parallelism for the sake of simplifying code generation. In particular, by not enforcing an explicit CP for a block structured IF statement, all processors that enter the scope enclosing the IF will execute the test, enter the appropriate branch, and execute CP guards for the statements inside, even though some of the processors do not need to participate in the execution of the enclosed statements at all. 4.2.1 Realizing Computation Partitions in dHPF. We have developed a hierarchical code generation framework to realize the general class of partitionings that dHPF supports. Our approach is hierarchical because it operates on nested block structured scopes, one scope at a time. (A scope refers to a sequence of statements immediately enclosed within a procedure, a DO loop, or a single branch of an IF statement.) Key bene ts of the hierarchical code generation framework are (1) it supports partitioning of scopes that can contain arbitrary control ow, (2) it supports multiple code generation strategies for individual scopes so that each scope can be partitioned with the most appropriate strategy, and (3) it uses a two pass strategy that is eective at minimizing guard overhead without sacri cing compile-time eciency. Below, we rst describe the hierarchical code generation framework, and then describe strategies used in dHPF to generate code for a single scope within this framework. The hierarchical code generation framework. Brie y, the code generation framework in dHPF operates as follows. Each scope in the program is handled independently. The code generation operates one scope at a time, visiting scopes bottom-up (i.e., innermost scopes rst). Although the framework operates one scope at at time, any particular strategy can partition multiple scopes in a single step if desired. At each scope that has not yet been partitioned, the framework attempts to apply a sequence of successively more general strategies until one successfully performs the partitioning for the scope. Control ow in the program is handled as follows. This discussion assumes that a pre-processing phase has transformed all DO loops into DO/ENDDO form and and relocated all branch target labels to CONTINUE statements. The compiler computes partitionings for control- ow statements so that all processors that need to execute any statement reach it, but as few processors execute the control- ow statements as possible in order to maximize parallelism. Informally, the computation partitioning for an IF statement or a DO loop must involve all of the processors that need to execute any statement that is transitively control-dependent on it. For an IF statement, the union of the CPs
18
Vikram Adve and John Mellor-Crummey
of its control dependents gives a suitable CP. For a DO loop, the the union of the CPs of its control dependents gives a CP suitable for a single iteration; a suitable CP for the entire compound DO statement can be computed by vectorizing the iteration CP over the loop range. A GOTO statement is assigned the union of the CPs of all control- ow statements on which it is immediately control-dependent. Note that this will be a superset of the CPs of the statements following the branch target. i.e., any processors that need to execute the target statements will reach those statements (the condition controlling the branch is satis ed). Finally, a branch target label is assigned the CP of the immediately enclosing scope, for reasons discussed below. In order to ensure correctness while preserving maximum parallelism in the presence of control- ow, the code generation framework simply has to ensure that the above assignment of CPs to control- ow related statements is correctly preserved. In particular, any particular strategy used to partition a scope must correctly enforce the CPs of all statements within the scope. The only subtlety is that reducing the bounds of a DO loop does not enforce the CP of the compound DO loop statement itself; that must be enforced when partitioning the scope enclosing the DO loop (otherwise, extra processors may evaluate the bounds of the DO loop). The above handling of partitioned control ow statements yields a key simpli cation of the framework, namely, that each block-structured scope can be handled independently, even in the presence of arbitrary control ow such as a GOTO that transfers control from one scope to another. In particular, the CPs assigned to individual GOTO statements is simply enforced during code generation, independent of the location and CPs of its branch targets. A signi cant diculty that must be addressed is that GOTOs are matched with the correct branch targets (and labels). This can be dicult because statements in dierent scopes and even statements in the same scope may be cloned into dierent numbers of copies. (Statements can be cloned by the tiling performed by mmcodegen to minimize guards, as shown in the example in Appendix A.) We ensure that branches and branch targets are matched, as follows. Fortran semantics dictate that a GOTO cannot branch from an outer to an inner scope; a GOTO can only branch within the same scope or to an outer scope. The CP assigned to a GOTO may cause it to be cloned into multiple copies. By assigning a labeled statement the CP of its enclosing scope, we ensure that a label appears in every instance of that scope (in particular, in a DO scope, every iteration of the enclosing scope will include the labeled CONTINUE). Since every GOTO branching to this label must come from the same or an inner scope, every cloned copy of the GOTO will be matched with exactly one copy of the labeled CONTINUE. Furthermore, the GOTOs that must match a particular copy of the label are exactly those that appear within the same instance of the scope enclosing the label. This allows us to to
Advanced Code Generation for High Performance Fortran
19
renumber the label de nitions and any matching label references. The details of the renumbering scheme are described elsewhere [1]. The second key issue we address with the framework is controlling the compile-time cost of using expensive algorithms such as mmcodegen, while still ensuring that guard overhead is minimized. There are two features in the framework that ensure eciency. First, we take advantage of the independent handling of scopes to apply mmcodegen independently one loop or one perfect loop nest at a time. This ensures that the iteration spaces in each invocation of mmcodegen are as simple as possible (though there can still be multiple dierent iteration spaces). Second, the use of a bottom-up approach greatly reduces the number of times mmcodegen is invoked, compared to a top-down approach. The drawback however, is that the top-down approach could yield much more ecient code. This tradeo between the bottom-up and top-down approaches arises as follows. When generating code for a DO scope, the loop's iteration space is often split into multiple sections to enable guards to be lifted out of the loop. If we used a top-down scope traversal order for generating code, information about the bounds of the dierent sections could be passed downward during code generation and exploited in inner scopes. However, a top-down strategy would require many more applications of the partitioning algorithm than a bottom-up strategy. For example, for a triply nested loop in which each loop will be split into two sections by code generation, a top-down strategy would invoke the loop partitioning algorithm seven times. A bottom-up strategy would invoke it only three times. Because of the potentially high compiletime cost of the former, we use a bottom-up code generation strategy. We use two techniques to ensure the quality of the generated code despite the trade-os made above. First, an important optimization we apply when generating code for individual scopes is to exploit as much known contextual information as possible about the enclosing scopes. Second, we use a powerful, global control- ow simpli cation phase as the last step of code generation, which further simpli es the control- ow in the resulting program. The control- ow simpli cation algorithm is described in section 6. The use of available contextual information during the bottom-up strategy is described below. Together, these achieve much or all of the bene t of a top-down code generation strategy in which full context information is available to code generation in inner scopes, but at a fraction of the cost. When generating code for a scope in the bottom-up strategy, we can assume that code generation in the enclosing scope will ensure that only the correct processors enter the current scope. For example, consider the loop nest in Fig. 4.1. Statements s1, s2, and s3 each have a simple computation partition consisting of a single on home clause. LoopCPj (i), represents the CP for the j loop, which consists of the union of the CPs of the statements in its scope vectorized across the range of the j loop. Inside the j loop, we assume that the constraints in LoopCPj (i) hold because these constraints
20
Vikram Adve and John Mellor-Crummey LoopCPi () = CP1 (1 : N ) [ LoopCPj (1 : N ) do i = 1, N CP1 (i) = f [i] : proc: myid owns A1 (f1 (i)) g S1(i) do j = 1, M LoopCPj (i) = CP2 (i; 1 : M ) [ CP3 (i; 1 : M ) S2(i,j) CP2 (i; j ) = f [i; j ] : proc: myid owns A2 (f2 (i;j )) g S3(i,j) CP3 (i; j ) = f [i; j ] : proc: myid owns A3 (f3 (i;j )) g
enddo enddo Fig. 4.1. Example showing iteration sets constructed for code generation. will have been enforced when partitioning the enclosing i loop. Any code generation strategy used for the inner scope can exploit this information. Similarly, LoopCPi() represents the CP for the i loop, which consists of the union of the CPs of the statements it contains, vectorized across the range of the i loop. We assume that the constraints in LoopCPi () are true when generating code for the i loop. Realizing a CP for a single scope. The rst step in the process for partitioning a scope is to separate the statements in the scope into statement groups, which are sequences of adjacent statements that have homogeneous computation partitions. Second, we use equation 4.2 to construct the explicit representation of the iteration space for each statement group according to its computation partitioning. Third, we use some available strategy to partition the computation in the scope. When multiple strategies are available, we currently apply them in a xed sequence, stopping when one succeeds in partitioning the scope. This permits a xed series of strategies to be tried (typically attempting speci c optimizations and, if these fail, then nally applying some general strategy). We currently support two strategies for partitioning a loop: bounds reduction, or loop-splitting combined with bounds reduction for the individual loop sections generated by splitting (the latter is described in Section 5.3). For non-loop scopes (conditional branches and the outermost routine level), we also use bounds reduction which reduces to inserting guards on the relevant statements. Two alternatives applicable to loops or statement groups with irregular data layouts or irregular references are under development, namely, runtime resolution and an inspector-executor strategy. To perform bounds reduction as part of the above strategies, we apply Kelly, Pugh & Rosser's mmcodegen algorithm to the sequence of iteration spaces for the statement groups in a scope. Applying mmcodegen to the iteration spaces for statement groups reduces loop bounds as needed and lifts guards out of inner loops when statement groups with non-overlapping iteration spaces exist. This results in a code template template with placeholders representing the statement groups. Finally, we replace each of the placeholders in the code template by a copy of the code for the corresponding statement group. When labels are present in the code for the statement groups, we renumber the labels to ensure unique numbers as discussed earlier.
Advanced Code Generation for High Performance Fortran
21
As an alternative to this base strategy for realizing the computation partition for a scope, Section 5.3 describes a loop splitting transformation that may be applied during code generation to any perfect loop nest that has no carried dependences. From the perspective of the computation partitioning code generation, this approach serves as an alternate partitioning method which subdivides the iteration space for a DO into a sequence of iteration spaces, and then generates code for each with the method described above for a single scope. The purpose of the splitting transformation is described in section 5.3. Another much more specialized strategy we expect to add for loop nests is a code transformation for coarse-grain pipelining. The transformation simultaneously performs strip-mining of one or more non-partitioned loops and loop bounds reduction for partitioned loops. The last two optimizations (loop-splitting and pipelining) illustrate that the hierarchical code generation framework provides a natural setting within which to perform any code transformation that has the side eect of producing partitioned code, i.e., realizing the CPs assigned to statements in a scope.
5. Communication Code Generation On message-passing systems, the most ecient communication is obtained when the compiler can statically compute the set of data that needs to be exchanged between processors to satisfy each non-local reference. For references with statically analyzable communication requirements, a data-parallel compiler must compute the set of non-local data to be communicated for each non-local reference, and then use these sets to generate ecient code to pack, communicate, unpack, and access the non-local data. In this section, we describe implementation techniques for several key communication optimizations used in the dHPF compiler to synthesize high performance communication code for regular applications. Many of these techniques are based on the integer set framework. For references with unanalyzable communication requirements, typically due to non-ane subscripts, runtime techniques such as the inspector-executor model must be used to manage communication. For more information on such techniques, we refer the reader to Chapter 21 and the references therein. The dHPF compiler includes a comprehensive set of communication optimizations that have been identi ed as important for high performance on message-passing systems. The bene ts obtained from these optimizations (with very few exceptions) can vary widely between applications as well as between dierent systems. This implies that a compiler may have to incorporate many dierent optimizations to obtain consistently high performance across large classes of applications and systems. Previous commercial and research compilers, however, have generally implemented only a few of these techniques because of the signi cant implementation eort entailed in each
22
Vikram Adve and John Mellor-Crummey
case. The important communication optimizations in the dHPF compiler include the following.
Optimizations to reduce message overhead { Message vectorization moves communication out of loops in order to re-
place element-wise communication with fewer but larger messages. This is implemented by virtually all data-parallel compilers, but in case-based compilers it is usually restricted to speci c reference patterns for which the compiler can derive (or conservatively approximate) the data sets to be communicated [16, 24, 32]. { Exploiting collective communication is essential for achieving good speedup in important cases such as reductions, broadcasts, and array redistribution [33]. On certain systems, collective communication primitives may also provide signi cant bene ts for other patterns such as shift communication. The important patterns (particularly reductions and broadcast) have been supported in most data-parallel compilers. { Message coalescing combines messages for multiple non-local references to the same or dierent variables, in order to reduce the total number of messages and to eliminate redundant communication. Previous implementations in Fortran 77D [24], SUIF [2], Paradigm [5], and IBM's pHPF [11, 16] have some signi cant limitations. In particular, coalescing can produce fairly complex data sets from the union of data sets for individual references. The previous implementations are limited to cases where the combined data sets are representable with (or can be approximated by) regular sections (in Fortran 77D, Paradigm and pHPF) or a single collection of inequalities (in SUIF). { Coarse-grain pipelining trades o parallelism to reduce communication overhead in loop nests with loop-carried, cross-processor data dependences. It is an important optimization for eectively implementing parallelism in such loop nests because the only alternative may be to perform a full array redistribution, which can be much more expensive. To our knowledge, this optimization has been implemented in a few research compilers [24, 5] and one commercial one [16].
Optimizations to overlap communication with computation { Data ow-based communication placement attempts to hide the latency of
communication by placing message sends early and receives late so as to overlap messages with unrelated computation. A few compilers including IBM's pHPF, SUIF, Paradigm, and Fortran 77D have used data ow techniques to overlap communication in this manner. { Communication overlap via non-local index set splitting attempts to overlap communication from a given loop nest with the local iterations of the same loop nest. This overlap generally cannot be achieved by the above data ow placement techniques. Non-local index set splitting (or loop splitting) separates iterations that access non-local data from those that access
Advanced Code Generation for High Performance Fortran
23
only local data. Communication can be overlapped with the local iterations by rst executing send operations for the non-local data required in the loop, then the local iterations, then the receives, and nally the nonlocal iterations. Loop splitting was implemented in Kali [32], albeit with signi cant limitations as described in Section 5.3.
Optimizations to minimize data buering and access costs { Minimizing buer copying overhead is essential to minimize the overall cost
of communication. This can be achieved in multiple ways. First, in most message-passing implementations, when the data to be sent or received is contiguous in memory, it can be communicated \in-place" rather than copied to or from message buers. Second, asynchronous send and receive primitives can be used to avoid additional buer copies between user and system buers by making user-level buers available for the duration of communication. Third, non-local data received into a buer can in some cases be directly referenced out of the buer (if the indexing functions can be generated by the compiler), thus avoiding an unpacking operation. All of these techniques appear to be widely used in data-parallel compilers, though the eectiveness of the implementations may vary. { Minimizing buer access checks via non-local index-set splitting. Access checks (i.e., ownership tests) are required when the same reference may access local data from an array or non-local data from a separate buer on dierent loop iterations. Loop-splitting separates out the local iterations (which are guaranteed to access local data) from the non-local ones. Even the latter may not need access checks if all non-local references now access only non-local data. The alternative to this transformation is to copy local and non-local data into a common buer (as done in the IBM pHPF compiler [16]), which can be costly in time and memory usage. As mentioned above, non-local index set splitting was implemented in a limited form in Kali. { Overlap areas for shift communication are extra boundary elements added to the local sections of arrays involved in shift communication [14]. They permit local and non-local data to be referenced uniformly, thus avoiding the need for the access checks (or alternatives) mentioned above. Generally, interprocedural analysis has to be used to determine the size of required overlap areas globally for each array. Simpler implementations may waste signi cant memory and may have to be controlled by the programmer. Overlap areas have been implemented in several research and commercial compilers. The Rice dHPF compiler implements all of the above optimizations except coarse-grain pipelining and the use of asynchronous message primitives (both these are currently being implemented). Some other speci c communication optimizations have been implemented in other compilers but are not included in dHPF. IBM's pHPF coalesces diagonal shift communication into
24
Vikram Adve and John Mellor-Crummey
two messages [16], whereas dHPF requires three. This is useful, for example, in stencil computations that access diagonal neighbors such as a nine-point stencil over a two-dimensional array. Chakrabarti et al. describe a powerful communication placement algorithm that can be used to maximize opportunities for message coalescing or to balance message coalescing with communication overlap [11]. SUIF uses array data ow analysis to communicate data directly from a processor executing a non-local write to the next processor executing a non-local read [2], whereas dHPF must use an extra message to send the data rst to the owner and from there to the reader. The former two optimizations can be directly added to the current implementation of dHPF. The SUIF model is a dierent and signi cantly more complex communication model compared to that used in dHPF, and there is little evidence available so far to evaluate whether the additional complexity is justi ed for message-passing systems. One reason that it has been practical to implement a fairly large collection of advanced optimizations in dHPF is our use of the integer set framework. By formulating optimizations abstractly in terms of integer set operations, we have obtained simple, concise, and general implementations of some of the most important phases of the compiler (such as communication code generation) as well as of complex optimizations like loop-splitting. These implementations broadly apply to arbitrary combinations of ane references, data distributions, and computation partitionings, because the analysis is not dependent on speci c forms of these parameters. In the remainder of this section, we brie y describe the implementation of communication optimizations that use the integer set framework. These include our entire communication generation phase which incorporates message vectorization and coalescing, the two optimizations based on non-local index set splitting and an algorithm for recognizing in-place communication. A control- ow simpli cation algorithm, which is also implemented using integer sets, is described in Section 6.
5.1 Communication generation with message vectorization and coalescing
The communication insertion steps in dHPF can be classi ed into two phases: a preliminary decision-making phase that identi es and places the required communication in the program, and a communication generation phase that computes the sets of processors and data involved in each communication,and synthesizes code to carry out the communication. In this paper, we primarily focus on the communication generation phase which is based on the integer set framework. We brie y describe the decisions made in the former phase, since these directly feed in as inputs to communication generation. The preliminary communication analysis steps in dHPF determine (a) which references are potentially \non-local", i.e., might access non-local data, (b) where to place communication for each reference, (c) whether to use a
Advanced Code Generation for High Performance Fortran
25
collective communication primitive in each case, and (d) when communication for multiple references can be combined. The rst step is a very simple analysis to lter out references that can easily be proven to access only local data. The second step uses a combination of dependence and data ow analysis to choose the placement of communication so as to determine how far each message can be vectorized out of enclosing loops, and to optionally move communication calls early or late to hide communication latency [30]. The third step uses the algorithms of Li and Chen [33] to to determine if specialized collective communication primitives such as a broadcast could be exploited. (Reductions are recognized using separate algorithms.) Otherwise, the compiler directly implements the communication using pairwise point-topoint communication. The fourth step chooses references whose communication can be combined. Any two references whose communication involves one or more common pairs of processors can be coalesced in our implementation. In practice, however, it it is usually not bene cial to combine references that should use dierent communication primitives, such as a broadcast with any pairwise point-to-point communication. (One instance where it is pro table is combining a reduction and a broadcast by using a special reduction primitive like MPI AllReduce, which leaves every processor involved with a copy of the result.) We refer to the entire collection of messages required for a set of coalesced references as a single logical communication event. The code generation phase must then use the results of the previous phases to synthesize vectorized and coalesced messages that implement the desired communication for each logical communication event. For each reference, the compiler rst computes the set of data to send between pairs (or groups) of processors; these communication sets depend on the reference, layout, computation partitioning, and the loop level at which vectorized communication is to be performed. Message coalescing requires computing the union of the above communication sets for the coalesced references. We directly compute the communication sets for each communication event using a sequence of integer set operations, independent of the speci c form of the reference, layout, and computation partitioning. We then generate code from these sets directly. The integer set equations used to compute the communication sets for each logical communication event are described in detail elsewhere [1]. We brie y describe the key aspects of the algorithm here. The goal of the algorithm is to compute two separate maps for a xed symbolic processor index m (where m is the index tuple for processor myid in the processor array to which the data is mapped, and myid is the representative processor index of the SPMD program). SendCommMap(m) = [p] ! [a] : array elements a that proc. m must send to proc. p RecvCommMap(m) = [p] ! [a] : array elements a that proc. m must receive from proc. p
26
Vikram Adve and John Mellor-Crummey DataAccessedMap = { [p1,p2]->[b1,b2] : max(1, 20*p1)