Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines M. Kandemiry
R. Bordawekarz
A. Choudharyx
Abstract This paper describes optimization techniques for translating out-of-core programs written in a data parallel language like HPF to message passing node programs with explicit parallel I/O. We rst discuss how an out-of-core program can be translated by extending the method used for translating in-core programs. We demonstrate that straightforward extension of in-core compilation techniques does not work well for outof-core programs. We then describe how the compiler can optimize the code by (1) determining appropriate le layouts for out-of-core arrays, (2) permuting the loops in the nest(s) to allow ecient le reading/writing, and (3) partitioning the available incore node memory among references based on I/O cost estimation. Both our analytical approach and experimental results indicate that these optimizations can reduce the amount of time spent in I/O by as much as an order of magnitude. Performance results on IBM SP-2 are presented and discussed.
1 Introduction The use of massively parallel machines to solve large scale computational problems in physics, chemistry, medicine and other sciences has increased considerably in recent times. This is primarily due to the tremendous improvements in the computational speeds of parallel computers in the last few years. Many of these applications, also referred to as Grand Challenge Applications [11], have computational requirements which stretch the capabilities of even the fastest supercomputer available today. This work was supported in part by NSF Young Investigator Award CCR-9357840, NSF CCR-9509143 and in part by the Scalable I/O Initiative, contract number DABT63-94-C-0049 from Defense Advanced Research Projects Agency(DARPA) administered by US Army at Fort Huachuca. y CIS Dept., Syracuse University, Syracuse, NY 13244, e-mail:
[email protected] z CACR, Caltech, Pasadena, CA 91125, e-mail:
[email protected] x ECE Dept., Northwestern University, Evanston, IL 60208-3118, e-mail:
[email protected]
1
2 In addition to requiring a great deal of computational power, these applications usually deal with large quantities of data. At present, a typical Grand Challenge Application could require 1 GByte to 4 TBytes of data per run[8]. Main memories are not large enough to hold this much amount of data; so data needs to be stored on disks and fetched during the execution of the program. Unfortunately, the performance of the I/O subsystems of massively parallel computers has not kept pace with their processing and communication capabilities. Hence, the performance bottleneck is the time taken to perform disk I/O. An overview of the various issues involved in high performance I/O is given in [8]. Data parallel languages like HPF[13] have recently been developed to provide support for portable high performance programming on parallel machines. In order that these languages can be used for large scienti c computations, support for performing large scale I/O from programs written in these languages is necessary. It is, therefore, essential to provide compiler support for these languages so that the programs can be translated automatically and eciently. In this paper we describe data access reorganization strategies for ecient compilation of out-of-core data parallel programs on distributed memory machines. In particular, we address the following issues, 1) how to estimate the I/O costs associated with dierent access patterns in out-of-core computations, 2) how to reorganize data storage on disks to reduce I/O costs, 3) how to reorganize computations based on the reorganized data, and 4) when multiple out-of-core arrays are involved in the computations, how to allocate memory to individual arrays to minimize I/O accesses. We demonstrate that these techniques can reduce I/O costs by as much as an order of magnitude compared to the costs of I/O when in-core compilation methods are naively extended for out-of-core computations. The rest of the paper is organized as follows. Section 2 introduces our model and section 3 explains out-of-core compilation strategy. How I/O optimizations can reduce the I/O cost of loop nests is discussed in section 4. Experimental results are presented in section 5. Section 6 discusses related work, and we conclude in section 7.
2 Model for Out-of-Core Compilation 2.1 Programming Model The most widely used programming model for large-scale scienti c and engineering applications on distributed memory machines is the Single Program Multiple Data (SPMD) model. In this model, parallelism is achieved by partitioning data among processors. To achieve load balance, express locality of accesses, and reduce communication, several distribution and data alignment strategies are often used (e.g. block, cyclic etc.). Many parallel programming languages or language extensions provide directives that enable the expression of mappings from the problem domain to the processing domain and allow the user to align and distribute arrays in the most appropriate fashion for the underlying computation. The
3 compiler uses the information provided by these directives to compile global name space programs for distributed memory computers. Examples of parallel languages which support data distributions include Vienna Fortran[21], Fortran D[10] and High Performance Fortran (HPF)[13]. In this paper, we describe the compilation of out-of-core HPF programs, but the discussion is applicable to any other data parallel language in general. Explicit or implicit distribution of data to each processor results in each processor having a local array associated with it. For large data sets, local arrays cannot entirely t in local memory. In such cases, parts of the local arrays have to be stored on disk. We refer such local arrays Out-of-Core Local Array s. Parts of the out-of-core local arrays need to be swapped between main memory and disk during the course of computation.
HPF Program
In-Core Phase
Out-of-Core Phase
Node Program
Figure 1: Flow-Chart for Out-of-Core Compilation
2.2 Data Storage Model The data storage model speci es how the out-of-core arrays are placed on disks and how they are accessed by the processors. The out-of-core local arrays of each processor are stored in separate les called Local Array File s. The local array les can be assumed to be owned by that processor. We assume that each processor has its own logical disk with the local array les stored on that disk. The mapping of the logical disk to the physical disks is system dependent. If a processor needs data from any of the local array les of another processor, the required data will be rst read by the owner processor and then communicated to the requesting processor. Since data sharing is performed by explicit message passing, this system is a natural extension of distributed-memory paradigm.
3 Compilation Methodology In order to translate out-of-core programs, in addition to the steps used for in-core compilation, the compiler also has to schedule explicit I/O accesses to disks. The compiler has to take into account the data distribution on disks, the number of disks used for storing data and the prefetching/caching strategies used. The portions of local arrays currently required for computation are fetched from disk into memory. These portions are called Data Tiles or simply Tiles. The larger the tiles the better, as it reduces the number of disk accesses. Each processor performs computation on the data in its tiles. Notice that the node memory should be divided somehow among tiles of dierent out-of-core arrays. Thus, during the
4 course of execution, a number of data tiles belonging to a number of dierent out-of-core local arrays are brought into memory, the new values for these data tiles are computed and the tiles are stored back into appropriate les, if necessary. DO i = LBi ,UBi ,Si DO j = LBj ,UBj ,Sj
computation
ENDDO j ENDDO i
(A) DO IT = LBIT ,UBIT ,SIT DO JT = LBJT ,UBJT ,SJT read data tile(s) from local le(s)
computation write data tile(s) into local le(s)
ENDDO JT ENDDO IT
DO IT = LBIT ,UBIT ,SIT DO JT = LBJT ,UBJT ,SJT read data tile(s) from local le(s) handle communication and storage of non-local data DO IE = LBIE ,UBIE ,SIE DO JE = LBJE ,UBJE ,SJE perform computation on data tile(s) to compute the new data values ENDDO JE ENDDO IE handle communication and storage of non-local data write data tile(s) into local le(s) ENDDO JT ENDDO IT (B)
(C)
Figure 2: (A) Example Loop Nest (B) Resulting Node Program (C) Simpli ed Node Program Figure 1 shows various steps involved in translating an out-of-core HPF program. The compilation consists of two phases. It should be noted that we concentrate on the compilation of a single loop nest. In the rst phase, called in-core phase, the arrays in the source HPF program are partitioned according to the distribution information and lower/upper bounds for each local out-of-core le are computed. Array expressions are then analyzed for detecting communication. In other words, the compilation in this phase proceeds in the way that would be done in an in-core HPF compiler. The second phase, called out-of-core phase, involves adding appropriate statements to perform I/O and communication. The local arrays are rst tiled according to the node memory available in each processor. The resulting tiles are analyzed for communication. The loops are then modi ed to insert necessary I/O calls. Consider the loop nest shown in Figure 2:(A) where LBk , UBk and Sk are the lower bound, upper bound and step size respectively for loop k. This nest will be translated by the compiler into the node program shown in Figure 2:(B). In this translated code loops IT and JT are called tiling loops, and loops IE and JE are called element loops. Note that communication is allowed only at tile boundaries (outside the element loops). In the rest of the paper for the sake of clarity we will write this translated version as shown in Figure 2:(C). All communication statements and element loops will be omitted and in computation part each reference will be replaced by its sub-matrix version. For example a reference such as A(i; j ) will be replaced by A[IT; JT ]. We will also show the bounds and the step sizes symbolically as the actual values they have are not signi cant for what this paper focuses on.
5 Since, in out-of-core computations, I/O time (cost) is the dominating term in the overall cost expression and node memory is at premium, the proposed techniques decompose the node memory among competing out-of-core local arrays such that total I/O time is minimized. It will be shown later through examples that partitioning the node memory equally among local out-of-core arrays generally results in poor performance. PARAMETER (n=...,p=...) REAL A(n,n), B(n,n), C(n,n) !hpf$ PROCESSORS P(p) !hpf$ TEMPLATE D(n) !hpf$ DISTRIBUTE D(BLOCK) on P !hpf$ ALIGN (:,*) WITH D :: A !hpf$ ALIGN (*,:) WITH D :: B DO i = LBi ,UBi ,Si DO j = LBj ,UBj ,Sj DO k = LBk ,UBk ,Sk DO l = LBl ,UBl ,Sl A(i,j)=A(i,j)+B(k,i)+C(l,k)+1 ENDDO l ENDDO k ENDDO j ENDDO i END (A)
DO IT = LBIT ,UBIT ,SIT DO JT = LBJT ,UBJT ,SJT read data tile for A DO KT = LBKT ,UBKT ,SKT read data tile for B DO LT = LBLT ,UBLT ,SLT read data tile for C A[IT,JT]=A[IT,JT]+B[KT,IT]+C[LT,KT]+1 ENDDO LT ENDDO KT write data tile for A ENDDO JT ENDDO IT (B)
DO IT = LBIT ,UBIT ,SIT read data tile for A read data tile for B DO LT = LBLT ,UBLT ,SLT read data tile for C A[IT,1:n]=A[IT,1:n]+B[1:n,IT]+C[LT,1:n]+1 ENDDO LT write data tile for A ENDDO IT (C)
Figure 3: (A) Example out-of-core loop nest (B) Node program resulted from straightforward compilation (C) I/O optimized node program
4 I/O Optimizations To illustrate the signi cance of our I/O optimizations we rst consider the example shown in Figure 3:(A), assuming arrays A, B and C are out-of-core and reside on (logical) disk in column-major order1. It should be noted that in all HPF examples given in this paper the compiler directives apply to data on les. 1
6 The compilation is performed in two phases as described before. In the in-core phase, using the array distribution information, the compiler computes the local array bounds and partitions the computation. It then analyzes the array assignment statement to determine if communication is required. In the second phase, stripmining (tiling) of the out-of-core data is carried out using the information about available node memory size. The I/O calls to fetch necessary data tiles for A, B and C are inserted and nally the resulting node program with explicit I/O and communication calls is generated. Figure 3:(B) shows the simpli ed resulting node program (tiling loops) without any I/O optimization performed. Notice that this translation is a straightforward extension of in-core compilation strategy. Suppose that IT KT
LT
S
IT
S
S
LT
S S
S
JT IT
n/p
n
S
Array A
Array A
(A)
(C) Array B IT KT
S
Array C
Array B IT
KT KT
LT
JT
Array C KT LT n
S
IT
KT
KT
S
S
JT IT
KT S
JT IT
S
n
S
n
n
Array A
Array A
(B)
(D) Array B
Array C
Array B
Array C
Figure 4: Out-of-Core Local Arrays and Dierent Tile Allocations for the example shown in Figure 3 every processor has a memory of size M and this memory is divided equally among all arrays. Also assume that the compiler works on square tiles of size S xS as shown Figure 4:(A). The overall I/O cost (time) of the nest shown in Figure 3:(B) (Toverall ) considering le reads only can be calculated as follows. 2 2 2 S 2 tI=O ] ] Toverall = Sn2p [SCI=O + S 2tI=O + n[SCI=O S+ S tI=O ] + n [SCI=OS+ 2 2 2 3 3 4 4 = n CI=O + n tI=O + n C2I=O + n tI=O + n C3I=O + n t2I=O Sp p Sp Sp Sp Sp
|
{z
TA
} |
{z
TB
} |
{z
TC
}
under the memory constraint 3S 2 M . In this calculation and for the rest of the paper CI=O is the startup cost for a le read (write), tI=O is the cost of reading (writing) an element from (into) a le and p is the number of processors. We assume that cost of reading (writing) of l consecutive elements from (into) a le can be approximated as T = CI=O + ltI=O . TA , TB and TC denote the I/O costs for the arrays A, B and C respectively.
7 We believe that this straightforward translation can be improved substantially by being more cagey about le layouts, loop ordering, and tile allocations. The proposed techniques transform the loop nest shown in Figure 3:(B) to the nest shown in Figure 3:(C), and associates row-major le layout for arrays A and C and column-major le layout for array B , and then allocates data tiles of size Sn for A and C and a data tile of size nS for B as shown in Figure 4:(D).2 The overall I/O cost of this new loop order and allocation scheme is n [SC + Snt + SC + nSt + n[SCI=O + SntI=O ] ] Toverall = Sp I=O I=O I=O I=O S 2 2 2 3 nC nC nC nt nt nt = pI=O + pI=O + pI=O + pI=O + SpI=O + SpI=O
|
{z
TA
} |
{z
TB
} |
{z
TC
}
provided that 3nS M . Notice that this cost is much better than that of the original. Note also that in order to keep calculations simple we have assumed that at most n elements can be requested in a single I/O call. The rest of the paper explains how to obtain this I/O optimized le layout, loop order and memory allocation, given a naive translation. Our approach consists of three steps:
Determination of the most appropriate le layouts for all arrays referenced in the loop
nest. Permutation of the loops in the nest in order to maximize spatial and temporal locality. Partitioning the available memory among references based on I/O cost estimation. Throughout the paper we assume that the le layout for any out-of-core array may be either row-major or column-major and there is only one distinct reference per array.3 First we need a few de nitions : De nition: Assume a loop index IT , an array reference R with associated le layout and an array index position r. Also assume a data tile with size S in each dimension except rth dimension where its size is n provided that n = (N ) S where N is size of the array in rth dimension. Then Index I/O Cost of IT with respect to R, layout and r is the number of I/O calls required to read such a tile from the associated le into memory, if IT appears in the rth position of R; else Index I/O Cost is zero. Index I/O Cost is denoted by ICost(IT; R; r; layout) where layout can be either row-major (rowmajor ) or column-major (colmajor ). De nition: Assuming a le layout, Basic I/O Cost of a loop index IT with respect to a reference R is the sum of index I/O costs of IT with respect to all index positions of reference Assuming that the trip counts are equal to array sizes in relevant dimensions. Otherwise, either trip count, if it is known at compile time, or an estimation of it based on the array size should be used instead of n. Notice also that since the compiler reads tiles of size Sn and nS , tiling loops JT and KT disappear. 3 To be precise, we are assuming only one Uniformly Generated Reference Set (UGRS) [9] per array. For our purposes, all references in a UGRS can be treated as a single reference. 2
8 Index IT IT IT IT IT IT KT KT KT KT KT KT
Reference A[IT,JT] A[IT,JT] B[KT,IT] B[KT,IT] C[LT,KT] C[LT,KT] A[IT,JT] A[IT,JT] B[KT,IT] B[KT,IT] C[LT,KT] C[LT,KT]
Layout BCost colmajor S rowmajor n/p colmajor n/p rowmajor S colmajor 0 rowmajor 0 colmajor 0 rowmajor 0 colmajor S rowmajor n colmajor n rowmajor S
Index JT JT JT JT JT JT LT LT LT LT LT LT
Reference A[IT,JT] A[IT,JT] B[KT,IT] B[KT,IT] C[LT,KT] C[LT,KT] A[IT,JT] A[IT,JT] B[KT,IT] B[KT,IT] C[LT,KT] C[LT,KT]
Layout BCost colmajor n rowmajor S colmajor 0 rowmajor 0 colmajor 0 rowmajor 0 colmajor 0 rowmajor 0 colmajor 0 rowmajor 0 colmajor S rowmajor n
Table 1: BCost values for the program shown in Figure 3:(B)
R. Mathematically speaking, X BCost(IT; R; layout) = ICost(IT; R; r; layout) r
Notice that BCost function can be used in two dierent ways: 1) If IT is xed and (R,layout) pair is changed, it gives the I/O cost induced by loop index IT for dierent local array layouts. 2) If (R,layout) pair is xed and IT is changed over all loop indices in the nest, then it gives the I/O cost induced by R with the associated layout. The following de nition employs the latter usage, while the former one is used in Section 4.2. De nition: Array Cost of an array reference R, assuming a layout, is the sum of BCost values for all loop indices with respect to reference R. In other words,
ACost(R; layout) =
X BCost(IT; R; layout) IT
4.1 Determining File Layouts Our heuristic for determining le layouts for out-of-core local arrays rst computes the ACost values for all arrays under possible layouts. It then chooses the combination that will allow the compiler to perform ecient reading/writing. Consider the statement A[IT; JT ] = A[IT; JT ] + B [KT; IT ] + C [LT; KT ] + 1 in Figure 3:(B). The Basic I/O costs4 for this statement are given in Table 1. Using these BCost values, array costs (ACost values) for A, B and C can be found as in Table 2. Notice that ACost values are listed term by term and each term corresponds to the BCost of a loop index (IT ,JT ,KT and LT in that order) under the given layout of the They are computed from ICost values. ICost(IT; A[IT; JT ]; 1; colmajor) is S . Because S I/O calls are needed to read a data tile of size nS from associated le with column-major layout. On the other hand, ICost(LT; C [LT; KT ]; 1; rowmajor) is n. Because the number of I/O calls required to read a data tile of size nS from associated le with row-major layout is n. Other ICost values can be computed similarly. 4
9 Array Layout ACost A colmajor S+n+0+0 A rowmajor n/p+S+0+0
Array Layout ACost B colmajor n/p+0+S+0 B rowmajor S+0+n+0
Array Layout ACost C colmajor 0+0+n+S C rowmajor 0+0+S+n
Table 2: ACost values for the program shown in Figure 3:(B) le. Next our heuristic considers all possible layout combinations by summing up the ACost values for the components of each combination term by term. Since in our example there are eight possible combinations, the term by term additions of ACost values are as shown in Table 3. De nition: The Order of a term is the greatest symbolic value it contains. For example the order of (S + n) is n whereas the order of S is S . A term that contains neither n nor S is called constant-order term. After listing all possible layout combinations term by term, our layout determination algorithm chooses the combination with greatest number of constant-order and/or S-order terms. The reason for this choice is that each loop index with only S-order and/or constantorder terms will enable the compiler to read all the data along the relevant dimension(s) with low I/O cost. If there are more than one combination with maximum number of constantorder and/or S -order terms, our heuristic chooses the one with the minimum cost. It is easy to see from Table 3 that, for our example, row-major layout for arrays A and C and columnmajor layout for array B (Combination 6) is an optimum combination, since it contains 2 S-order terms (S and 2S ). Notice that there may be more than one optimum combination.
4.2 Deciding Loop Order After selecting le layouts for each array, our compiler next determines an optimum (tiling) loop order that will enable ecient le reading/writing. De nition: Total I/O Cost of a loop index IT is the sum of the Basic I/O costs (BCost ) of IT with respect to every distinct array references it surrounds. Generally speaking, TCost(IT ) is the estimated I/O cost caused by loop IT when all array references are considered. Mathematically,
TCost(IT ) =
X BCost(IT; R; layout ) R
R;layoutR
where R is the array reference and layoutR is the layout of the associated le as determined in the previous step. Our algorithm for desired loop permutation is as follows:
Calculate TCost(IT ) for each tiling loop IT ,
10 Combination 1 2 3 4 5 6 7 8
Array A colmajor colmajor colmajor colmajor rowmajor rowmajor rowmajor rowmajor
Array B colmajor colmajor rowmajor rowmajor colmajor colmajor rowmajor rowmajor
Array C colmajor rowmajor colmajor rowmajor colmajor rowmajor colmajor rowmajor
Cost (S+n/p)+n+(S+n)+S (S+n/p)+n+2S+n 2S+n+2n+S 2S+n+(S+n)+n 2n/p+S+(S+n)+S 2n/p+S+2S+n (n/p+S)+S+2n+S (n/p+S)+S+(n+S)+n
Table 3: Possible le layout combinations for the program shown in Figure 3:(B)
Permute the tiling loops from outermost to innermost position according to nonin-
creasing values of TCost5, and Apply necessary loop interchange(s) to improve the temporal locality for (the tile of) the reference being updated. This step prevents write operation from occurring in inner tiling loops.6
Returning to our example under the chosen le layouts (Combination 6), TCost(IT ) = 2n=p, TCost(JT ) = S , TCost(KT ) = 2S , and TCost(LT ) = n. So, the desired loop permutation from outermost to innermost position is LT ,IT ,KT ,JT , assuming p 2. But considering the temporal locality for the array being written to, the compiler interchanges LT and IT , and reaches the order IT ,LT ,KT ,JT .
4.3 Memory Allocation Scheme Since each node has a limited memory capacity and in general a loop nest may contain a number of out-of-core arrays, the node memory should be partitioned among these out-ofcore arrays optimally. That is, partitioning of the node memory should lead to minimum I/O cost. De nition: Column-Conformant (Row-Conformant ) Position of an array reference is the rst (last ) index position of it. Following is the description of our scheme:
The memory allocation algorithm starts with tiles of size S for each dimension in each reference. For example, if a loop nest contains a one-dimensional array, two twodimensional arrays and a three-dimensional array, it rst allocates a tile of size S for
It should be noted that the desired loop permutation should observe the existing dependences. If it does not, our heuristic keeps the loop order as it is, and applies the memory allocation scheme only. 6 Loop indices that do not have any n expression should be placed into innermost positions. Other loop indices however can be interchanged with one another if doing so promotes the temporal locality for the reference(s) being updated. 5
11 the one-dimensional array, a tile of size S xS for each of the two-dimensional arrays, and a tile of size S xS xS for the three-dimensional array. This allotment scheme implies the memory constraint S 3 + 2S 2 + S M where M is the size of the node memory. Then the compiler divides array references in the nest into two disjoint groups: a group whose associated les have row-major layout, and a group whose associated les have column-major layout. { For row-major layout group: The compiler considers all loop indices in turn. For every loop whose index appears in at least one row-conformant position and does not appear in any other position of any reference in this group, it increases the tile size, in the row-conformant position(s) to full array size, of the references which have that loop index in the row-conformant position. { For column-major layout group: The compiler considers all loop indices in turn. For every loop whose index appears in at least one column-conformant position and does not appear in any other position of any reference in this group, it increases the tile size, in the column-conformant position(s) to full array size, of the references which have that loop index in the column-conformant position. Of course, the memory constraint should be adjusted accordingly. It should be noted that, after these adjustments, any inconsistency between those two groups (due to a common loop index) should be resolved by not changing the original tile sizes in the dimensions in question. For our running example, the compiler rst allocates a tile of size S xS for each reference. It then divides the array references into two groups: A[IT; JT ] and C [LT; KT ] in the rst group, and B [KT; IT ] in the second group. Since JT and KT appear in the row-conformant positions of the rst group and do not appear elsewhere in this group, our algorithm allocates data tiles of size Sn for A[IT; JT ] and C [LT; KT ]. Similarly, since KT appears in the column-conformant position of the second group and does not appear elsewhere in this group, our algorithm allocates a data tile of size nS for B [KT; IT ] as shown in Figure 4:(D). Notice that after these tile allocations tiling loops KT and JT disappear and the node program shown in Figure 3:(C) is obtained. In what follows we show that neither a xed column-major le layout as in Fortran nor a xed row-major le layout as in C results in optimal I/O cost.
Assume a xed column-major le layout for all arrays as in Fortran. In that case,
TCost(IT ) = S + n=p, TCost(JT ) = n, TCost(KT ) = S + n, and TCost(LT ) = S (from the rst row of Table 3). So, from outermost to innermost position KT ,JT ,IT ,LT is the desirable loop permutation. Again considering the temporal locality for the array being written to, the compiler interchanges KT and IT , and the order IT ,JT ,KT ,LT is obtained. In other words for a xed column-major layout, the initial loop order is the most appropriate one. Our memory allocation scheme allocates a tile of size nS for C , and tiles of size S 2 for each of A and B as shown in Figure 4:(B). The reason
12 for assigning tiles of size S 2 is that although IT and KT appear in column-conformant positions, they appear in other positions as well.7 The overall I/O cost of this approach is 2 2 Toverall = Sn2 p [SCI=O + S 2tI=O + n[SCI=O S+ S tI=O ] + n[SCI=O S+ nStI=O ] ] n2t n2 C n3 C + n3 tI=O + n3 CI=O + n4 tI=O = SpI=O + pI=O + S 2I=O p Sp S 2p S 2p
|
{z
TA
} |
{z
TB
} |
{z
}
TC
under the memory constraint 2S 2 + nS M . Assume a xed row-major layout for all arrays as in C. In that case, TCost(IT ) = n=p + S , TCost(JT ) = S , TCost(KT ) = n + S , and TCost(LT ) = n (from the last row of Table 3). From outermost to innermost position KT ,LT ,IT ,JT is the desirable loop permutation. Once more considering the temporal locality, our compiler takes IT to the outermost position. So, the nal loop order is IT ,KT ,LT ,JT . The compiler allocates a tile of size Sn for A, and tiles of size S 2 for each of B and C as shown in Figure 4:(C). The overall I/O cost of this approach is
Toverall
n [SCI=O + S 2 tI=O ] n2[SCI=O + S 2tI=O ] n + ] = Sp [SCI=O + SntI=O + S S2 n2 C n3C + n3 tI=O nC n2 t n2 t = pI=O + pI=O + SpI=O + pI=O + S 2I=O p Sp
|
{z
TA
} |
{z
TB
} |
{z
TC
}
under the memory constraint 2S 2 + nS M . It should be noted that although for reasonable values of M these costs are better than that of the original version, they are much worse than the one obtained by our approach. These analytical results have also been con rmed by our experiments as will be shown in Section 5. As a second example we consider a nest that is taken from GAXPY matrix multiplication program (see Figure 5:(A)). Suppose that j th column of a matrix X is denoted by xj and let AP , B and C be nxn matrices. Then the GAXPY algorithm for computing C =AxB is cj = nk=1 bkj ak , for j = 1; :::; n. As before the compilation is performed in two phases. The resulting straightforward node program is shown Figure 5:(B). The local out-of-core arrays with square data tiles are shown in Figure 6:(A). Note that for this and the following case, during memory allocation, there is only one group, and all array references belong to it. 7
13 PARAMETER (n=..., p=...) REAL A(n,n),B(n,n),C(n,n),TEMP(n,n) !hpf$ PROCESSORS P(p) !hpf$ TEMPLATE D(n) !hpf$ DISTRIBUTE D(BLOCK) on P !hpf$ ALIGN (*,:) WITH D::A,C,TEMP !hpf$ ALIGN (:,*) WITH D::B DO j = 1,n FORALL (k=1:n) TEMP(1:n,k)=B(k,j)*A(1:n,k) ENDFORALL C(1:n,j)=SUM(TEMP,2) !Sum Intrinsic END (A)
DO JT = LBJT ,UBJT ,SJT DO IT = LBIT ,UBIT ,SIT read data tile for B DO KT = LBKT ,UBKT ,SKT read data tile for A TEMP[KT,IT]=A[KT,IT]*B[IT,JT] ENDDO KT ENDDO IT participate in global sum if (owner ) store the j th column ENDDO JT (B)
Figure 5: (A) Out-of-core GAXPY nest (B) Resulting straightforward node program Index IT IT IT IT
Reference A[KT,IT] A[KT,IT] B[IT,JT] B[IT,JT]
Layout BCost colmajor n/p rowmajor S colmajor S rowmajor n/p
Index JT JT JT JT
Reference A[KT,IT] A[KT,IT] B[IT,JT] B[IT,JT]
Layout BCost colmajor 0 rowmajor 0 colmajor n rowmajor S
Index KT KT KT KT
Reference A[KT,IT] A[KT,IT] B[IT,JT] B[IT,JT]
Layout BCost colmajor S rowmajor n colmajor 0 rowmajor 0
Table 4: BCost values for the program shown Figure 5:(B) Next we demonstrate how our three-step approach optimizes the I/O cost of this program. We concentrate on only A and B , as they are used much more frequently than array C . The Basic I/O costs for this nest are given in Table 4 and the array costs for A and B are shown in Table 5. The compiler considers all possible layout combinations by summing up the ACost values for each combination term by term, obtaining costs for dierent layout combinations as shown in Table 6. It is clear that the Combination 2 which associates column-major layout with A and row-major layout with B is the most appropriate one, since it contains two S -terms. Under these layouts, TCost(IT ) = 2n=p, TCost(JT ) = S , and TCost(KT ) = S . From outermost to innermost position IT ,JT ,KT and IT ,KT ,JT are two desirable loop permutations. Going on with the former, the compiler allocates a tile of size nS for array A and a tile of size Sn for array B as illustrated in Figure 6:(B). Array Layout ACost A colmajor n/p+0+S A rowmajor S+0+n
Array Layout ACost B colmajor S+n+0 B rowmajor n/p+S+0
Table 5: ACost values for the program shown Figure 5:(B)
14 Combination 1 2 3 4
Array A column column row row
Array B column row column row
Cost (n/p+S)+n+S 2n/p+S+S 2S+n+n (S+n/p)+S+n
Table 6: Possible le layout combinations for the program shown Figure 5:(B) IT KT
IT KT
S S
JT IT
JT S
n
S
Array B
n
Array B
(A) Array A
IT
(B) Array A
Figure 6: Out-of-Core Local Arrays and Dierent Tile Allocations for GAXPY nest The complexity of our heuristics is (lmd +2m + llog(l)) if layout determination algorithm is used, otherwise it is (lmd + llog(l)) where l is the number of loops, m is the number of distinct array references, and d is the maximum number of array dimensions considering all references. The log term comes from sorting once the TCost value for each loop index has been computed. Since in practice l, m and d are very small (e.g. 2,3,etc.), all the steps are inexpensive and the approach is ecient. Notice that our approach is applicable to loop nests that have arbitrary reuse spaces as well [20]. As an example consider A[IT; JT ] = A[IT; JT ] + B [JT + LT; KT ]. Assuming a uniprocessor and column-major layout for all arrays, TCost(IT ) = S , TCost(JT ) = n + S , TCost(KT ) = n and TCost(LT ) = S . It should also be noted that our three-step approach can be used in dierent ways. If the way data arrives (eg. from archival storage, satellite or over the network) does not conform to the I/O optimized layout, the compiler either
keeps the original layout, and applies only loop permutation and memory allocation heuristics, or redistributes data based on the result of the layout determination algorithm, and then applies the loop permutation and memory allocation heuristics. Of course, this redistribution involves some additional overhead which can be amortized if the array is used several times.
If, on the other hand, the compiler is to create the out-of-core arrays from scratch, then it can apply all three steps of the proposed technique. As we have indicated before, if the desired loop permutation is not legal (semantic-
15 SR 1/4 1/16 1/64 1/256
Number of Processors = 1 Original Col-Opt Row-Opt Opt 363 311 281 199 1441 1108 1074 441 8384 6449 5912 1422 48880 35759 30055 4584
SR 1/4 1/16 1/64 1/256
Number of Processors = 4 Original Col-Opt Row-Opt Opt 202 161 119 66 672 412 354 152 5193 2411 2196 490 29504 16904 12291 1689
SR 1/4 1/16 1/64 1/256
Number of Processors = 8 Original Col-Opt Row-Opt 154 117 93 427 359 288 2858 1955 1327 21026 8901 7505
SR 1/4 1/16 1/64 1/256
Number of Processors = 16 Original Col-Opt Row-Opt Opt 99 79 61 21 286 271 259 48 1874 1654 1141 160 16719 6221 5687 522
Opt 39 87 272 941
Table 7: I/O Times for our rst example (Figure 3) with 4K*4K (128 MByte) double arrays preserving) then the compiler keeps the original loop order and applies only the memory allocation algorithm8. An important characteristic of our scheme is that all three steps are independent from each other and depending on the speci c requirements of the application in hand, some combination of them can be applied.
5 Experimental Results The optimizations introduced in this paper were performed by hand using PASSION [19], a run-time library for out-of-core computations. PASSION routines can be called from C and Fortran, and an out-of-core array can be associated with row-major or column-major layout. In the experiments C language was used. All experiments were run on an IBM SP-2 running the AIX 4.1.4 operating system. Each node has an I/O subsystem containing two SCSI buses and six Star re 7200 SCSI disks. Each node has 66.7 MHz clock speed and each SCSI disk has a minimum bandwidth of 8MB/s. All the reported times are in seconds. The experiments were performed for dierent Slab Ratio s (SR), the ratio of avilable in-core node memory to the size of out-of-core local arrays combined. Table 7 presents the I/O times of four dierent versions of our rst example (Figure 3) with 4K*4K (128 MByte) double arrays : Original version (Original), optimized version using column-major layout for all arrays (Col-Opt), optimized version using row-major layout for all arrays (Row-Opt), and the version that is optimized by our approach (Opt). Figure 7 shows the I/O times normalized with respect to the original version within each slab ratio. Another option is to try the next most desirable loop permutation. Our choice is simpler and guarantees that the optimized program will be at least as good as the original one. 8
16
number of processors=1
number of processors=4
1.0
1.0
Normalized I/O Times
0.8
0.6
0.4
0.2
0.0
Original Col-Opt Row-Opt Opt
0.8
Normalized I/O Times
Original Col-Opt Row-Opt Opt
0.6
0.4
0.2
1/4
1/16 1/64 Slab Ratio
0.0
1/256
1/4
number of processors=8 1.0
0.6
0.4
0.2
Original Col-Opt Row-Opt Opt
0.8
Normalized I/O Times
Original Col-Opt Row-Opt Opt
0.8
Normalized I/O Times
1/256
number of processors=16
1.0
0.0
1/16 1/64 Slab Ratio
0.6
0.4
0.2
1/4
1/16 1/64 Slab Ratio
1/256
0.0
1/4
1/16 1/64 Slab Ratio
1/256
Figure 7: Normalized I/O Times for the rst example (Figure 3) with 4K*4K (128 MByte) double arrays
17 Figure 8 illustrates the speedups for Original and Opt versions. We de ne two kinds of speedups: speedup that is obtained for each version by increasing the number of processors, which we call Sp, and speedup that is obtained by using Opt version instead of the Original when the number of processors is xed. We call this second speedup Local Speedup (Sl ), and product Sp*Sl is termed as Combined Speedup. Figure 9 illustrates the combined speedup curves for dierent slab ratios. From these results we conclude the following:
The Opt version performs much better than all other versions. The reason for this
expected result is that in the Opt version, all array readings are perfectly optimized (as shown in Figure 4:(D)) in contrast to the other versions where at most one array reading could be optimized, and the readings for the other arrays have remained as bottleneck (see Figure 4:(A), (B) and (C)). Slab Ratio=1/4
Slab Ratio=1/16
20.0
Original Opt Ideal
15.0
Speedup
Speedup
15.0
20.0
10.0
5.0
0.0 0.0
Original Opt Ideal
10.0
5.0
5.0
10.0 15.0 Number of Processors
0.0 0.0
20.0
5.0
10.0 15.0 Number of Processors
Slab Ratio=1/64
Slab Ratio=1/256
20.0
20.0
Original Opt Ideal
15.0
Speedup
Speedup
15.0
10.0
5.0
0.0 0.0
20.0
Original Opt Ideal
10.0
5.0
5.0
10.0 15.0 Number of Processors
20.0
0.0 0.0
5.0
10.0 15.0 Number of Processors
20.0
Figure 8: Speedups for unoptimized and optimized versions of the rst example (Figure 3) with 4K*4K double arrays
When the number of processors is increased, the eectiveness of our approach (Opt)
increases (see Figure 7). This is because of the fact that more number of processors now are working on the out-of-core local array(s) I/O optimally. When the slab ratio is decreased, the eectiveness of our approach increases (see Fig-
18 Combined Speedups 100.0
Speedup
80.0
Slab Ratio=1/4 Slab Ratio=1/16 Slab Ratio=1/64 Slab Ratio=1/256 Linear
60.0
40.0
20.0
0.0 0.0
5.0
10.0 15.0 Number of Processors
20.0
Figure 9: Combined Speedups for the rst example (Figure 3) with 4K*4K double arrays ure 7). As the amount of node memory is reduced, the Original version performs many number of small I/O requests (for square data tiles), and that in turn degrades the performance dramatically. The Opt version on the other hand continues with optimized I/O no matter how small in-core node memory is. As shown in Figure 8, the Opt version also scales better than the Original for all slab ratios. This is due to poor I/O performance of all involved processors in the Original version. The speedups are, to an extent, limited because of the fact that the array C is not distributed. It is clear to see from Figure 9 that combined speedup is much higher for small slab ratios. That again points to the importance of our I/O optimizations in environments where the amount of in-core node memory is very small as compared to the size(s) of out-of-core local array(s). Note that all of the combined speedups are superlinear as the algorithm (loop order) is changed in the Opt version. It should also be noted that when the slab ratio is very small, the optimized versions with xed layouts for all les also perform much better than the Original. The I/O optimizations introduced in this paper can be evaluated in two dierent ways:
First, a problem that is solved by the Original version using a xed slab ratio on p
processors can, in principle, be solved in the same or less time on p processors with the same slab ratio using the Opt version. The ratio p=p is termed as Processor Coecient (PC). Second, a problem that is solved on a xed number processors with a slab ratio sr by the Original version can, in principle, be solved in the same or less time on the same number of processors with a smaller slab ratio (less memory) sr by the Opt version. We call the ratio sr=sr Memory Coecient (MC). 0
0
0
0
19 The larger these coecents are, the better as they indicate reduction in processor and memory requirements of the application program respectively. PC curves
MC curves
40.0
4.0
Slab Ratio=1/4 Slab Ratio=1/16 Slab Ratio=1/64 Slab Ratio=1/256 Slab Ratio=1/1024
3.0
MC
PC
30.0
20.0
10.0
Number of Processors=1 Number of Processors=4 Number of Processors=8 Number of Processors=16 Number of Processors=32 Slab Ratios : (a=1/4, b=1/16, c=1/64, d=1/256)
2.0
1.0
0.0
4 8 1632
4 8 1632
4 8 1632
4 8 1632
0.0
4 8 1632
Number of Processors
a b c d
a b c d
a b c d
a b c d
a b c d
Slab Ratios
Figure 10: PC and MC curves for the rst example (Figure 3) Figures 10 shows the PC and MC curves for the rst example with 4K*4K (128 MByte) double arrays. It can be observed that there is a slab ratio, called Critical Slab Ratio, beyond which the shape of the PC curve does not change. In Figure 10 the critical slab ratio is 1=64. The practical signi cance of this observation is that below the critical slab ratio, independent of the node memory capacities, for a given p it is possible to nd the corresponding p where p and p are as de ned above. For instance, suppose that in this example nest, a problem can be solved using a number of processors and Original in to seconds with a slab ratio which is equal to or smaller than the critical slab ratio. It was observed that the same problem can be solved in to seconds in a single processor using the Opt with the same slab ratio. The reason for this is that the I/O in the Original is performed unoptimally, and that after critical slab ratio is reached, increasing the number of processors does not give much additional bene t at all. Similarly it can be observed that there is a number of processors beyond which the shape of the MC curve does not change. In Figure 10 that number is 8. This result means that beyond that number of processors, given an sr it is possible to nd the corresponding sr where sr and sr are as de ned above. We have also conducted experiments with 2K*2K (32 MByte) double arrays for this example and obtained similar results, but due to space limitations we are not presenting those results. We believe that the nal PC and MC curves give enough information about the performance of I/O optimizations in a multicomputer environment where very large arrays are processed. Figure 11 and 12 demonstrate the performance of our optimizations for the second running example of this paper (GAXPY nest). The experiments were performed with 4K*4K double arrays. As shown in Figure 11, the improvement in the I/O time is highly signi cant, especially with smaller slab ratios. As noted before, the eectiveness of our optimizations gets stronger as number of processors is increased. Figure 12:(A) clearly shows that the 0
0
0
0
20
number of processors=1
number of processors=4
1.0
1.0 Original Opt
Original Opt 0.8
Normalized I/O Time
Normalized I/O Times
0.8
0.6
0.4
0.2
0.0
0.6
0.4
0.2
1/4
1/16 1/64 Slab Ratio
0.0
1/256
1/4
number of processors=8
1/16 1/64 Slab Ratio
1/256
number of processors=16
1.0
1.0 Original Opt
Original Opt 0.8
Normalized I/O Times
Normalized I/O Times
0.8
0.6
0.4
0.2
0.0
0.6
0.4
0.2
1/4
1/16 1/64 Slab Ratio
1/256
0.0
1/4
1/16 1/64 Slab Ratio
1/256
Figure 11: Normalized I/O Times for GAXPY nest (Figure 5) with 4K*4K (128 MByte) double arrays
21 optimized version scales much better than the original. Due to lack of space we present only the case where the slab ratio is 1/256. The curves for other slab ratios are similar. The combined speedups in Figure 12:(B) indicate the importance of I/O optimizations with smaller slab ratios. And nally PC and MC values for this loop nest are given in Figure 13. (A)
(B)
Slab Ratio=1/256
Combined Speedups
20.0
200.0
150.0
Original Opt Ideal
Speedup
Speedup
15.0
Slab Ratio=1/4 Slab Ratio=1/16 Slab Ratio=1/64 Slab Ratio=1/256 Linear
10.0
5.0
100.0
50.0
0.0 0.0
5.0
10.0 15.0 Number of Processors
20.0
0.0 0.0
5.0
10.0 15.0 Number of Processors
20.0
Figure 12: (A) Speedups for Unoptimized and Optimized Versions of GAXPY nest with 4K*4K double arrays (B) Combined Speedups for GAXPY nest with 4K*4K double arrays
We have also applied our optimizations to a number of common kernels. The experiments were conducted on four nodes of IBM SP-2 with 4K*4K two-dimensional and 4K onedimensional double arrays and the results are shown in Figure 14 as normalized I/O times. Figure 14:(A) gives the performance improvement obtained on a matrix-transpose loop that contains the statement B (i; j ) = A(j; i). Our layout heuristic assigns column-major layout for array A and row-major layout for array B . It then allocates a tile of size nS to A, and a tile of size Sn to B . Notice that no xed le layout for all arrays (as in C or Fortran) can obtain that performance. The performance improvement on iterative-solver nest is given in PC curves
MC curves 300.0
40.0
200.0
MC
PC
30.0
Slab Ratio=1/4 Slab Ratio=1/16 Slab Ratio=1/64 Slab Ratio=1/256 Slab Ratio=1/1024
20.0
Number of Processors=1 Number of Processors=4 Number of Processors=8 Number of Processors=16 Number of Processors=32 Slab Ratios : (a=1/4, b=1/16, c=1/64, d=1/256)
100.0 10.0
0.0
4 8 1632
4 8 1632
4 8 1632
4 8 1632
4 8 1632
0.0
a b c d
a b c d
Number of Processors
a b c d
a b c d
a b c d
Slab Ratios
Figure 13: PC and MC curves for GAXPY nest Figure 14:(B). The innermost loop contains the statement Y (k) = Y (k) , U (k; j ) X (j ).
22 For this imperfectly nested example our heuristic associates column-major layout for all arrays. Figure 14:(C) shows the results of our optimizations on matrix-smoothening (fourpoint relaxation) nest. The optimized statement is B (i; j ) = 0:25 (A(i , 1; j )+ A(i +1; j )+ A(i; j , 1) + A(i; j + 1)). Our compiler associates row-major layout for all arrays (a xed column-major layout for all arrays is also equally acceptable). Finally Figure 14:(D) illustrate the performance improvement on matrix-vector multiplication (Y=AX). It is interesting to note that in this example all layout combinations are equally good as far as the I/O cost is concerned. Going with column-major layout for all arrays, our compiler allocates a data tile of size nS for A, a tile of size n for Y , and a tile of size S for X . (A)
(B)
Matrix Transpose
Iterative Solver
1.0
1.0 Original Opt
Original Opt 0.8
Normalized I/O Times
Normalized I/O Times
0.8
0.6
0.4
0.2
0.0
0.6
0.4
0.2
1/4
1/16 1/64 Slab Ratio
0.0
1/256
1/4
1/16 1/64 Slab Ratio
(C)
(D)
Matrix Smoothing
Matrix-vector Multiplication
1.0
1/256
1.0
Original Opt
Original Opt 0.8
Normalized I/O Times
Normalized I/O Times
0.8
0.6
0.4
0.2
0.0
0.6
0.4
0.2
1/4
1/16 1/64 Slab Ratio
1/256
0.0
1/4
1/16 1/64 Slab Ratio
1/256
Figure 14: Normalized I/O times for dierent nests. (A) Matrix Transpose (B) Iterative Solver (C) Matrix Smoothening (D) Matrix-vector Multiplication
6 Related Work Previous works on compiler optimizations to improve locality have concentrated on iteration space tiling. In [20] and [14] iteration space tiling is used for optimizing cache performance of programs. In [15] a simple model for optimizing cache locality is developed. In [18] reorganization of loops is proposed in order to create sub-matrix versions of algorithms
23 automatically. In most of the previous works, locality enhancements for sequential programs are targeted. Our work deals with locality optimizations for out-of-core programs running on distributed-memory machines. Iteration space tiling was also used for purposes other than optimizing locality. In [12] an iteration space partitioning technique based on hyperplanes is introduced. In [17] the problem of compiling perfectly nested loops for distributed-memory message-passing machines is addressed. In [2] a solution to the problem of determining loop and data partitions automatically for programs with multiple loops and arrays is presented. Our work also bears similarity to that of Abu-Sufah et al.[1] which deals with optimizations to enhance the locality properties of programs in a virtual memory environment. Among the program transformations used by them are loop fusion, loop distribution and tiling (page indexing). We, instead, do not assume the existence of virtual memory; but sensing the diculty of writing out-of-core version of an in-core program, give the responsibility of achieving reasonable performance to compiler. There are a few works on compilation of out-of-core problems. In [7] the functionality of ViC*, a compiler-like preprocessor for out-of-core C* is described. Output of ViC* is a standard C* program with the appropriate I/O and library calls added for ecient access to out-of-core parallel variables. Several issues that arise in the design and implementation of virtual-memory systems for data-parallel computing are investigated in [6]. In [16] the compiler support for handling out-of-core arrays on parallel architectures is discussed. In [3] a strategy to compile out-of-core programs on distributed-memory message-passing systems is oered. In [4] a methodology to optimize the communication in out-of-core computations is presented. It should be noted that our optimization techniques are general in the sense that they can be incorporated to any out-of-core compilation framework for parallel and sequential machines. Finally in [5] a uni ed approach that uses both data and control transformations for optimizing locality of in-core programs on distributed shared-memory machines is oered.
7 Conclusion In this paper we presented how basic in-core compilation method can be extended to compile out-of-core programs. However, the code generated using such a straightforward extension may not give good performance. We proposed a three-step I/O optimization process by which the compiler can improve the code generated by the above method. The compiler estimates the I/O costs associated with dierent access patterns and selects the method with the least I/O cost. This can reduce the amount of I/O by as much as an order of magnitude. It has been shown in this paper that instead of dividing the available memory equally among all arrays, the best performance is obtained when the most frequently accessed array(s) are allocated larger tiles. Our work is unique in the sense that it combines data transformations (layout determination) and control transformations (loop permutation) in a uni ed framework for optimizing out-of-core programs on distributed-memory message-passing machines. It has been shown
24 in this paper that the combination of these two transformations leads to optimized memory allocations for dierent out-of-core arrays, and that in turn, minimizes the overall I/O time.
References [1] W.Abu-Sufah. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations, IEEE Transactions on Computers, C30(5), pages 341-355, May 1981. [2] A.Agarwal, D.Kranz, and V.Natarajan. Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors. In 22nd International Conference on Parallel Processing, St.Charles, IL, August 1993. [3] R.Bordawekar, A.Choudhary, K.Kennedy, C.Koelbel, and M.Paleczny. A Model and Compilation Strategy for Out-of-Core Data Parallel Programs. Proceedings of the ACM Symposium on Principles and Practice of Parallel Programming, July 1995. [4] R.Bordawekar, A.Choudhary, and J.Ramanujam. Automatic Optimization of Communication in Out-of-Core Stencil Codes, In Proc. 10th ACM International Conference on Supercomputing, Philadelphia, PA, May 1996. [5] M.Cierniak and W.Li. Unifying Data and Control Transformations for Distributed Shared Memory Machines. Technical Report 542, Dept. of Computer Science, University of Rochester, NY, November 1994. [6] T.H.Cormen. Virtual Memory for Data-Parallel Computing, Ph.D. Thesis, Massachussets Institute of Technology, Cambridge, MA, 1992. [7] T.H.Cormen and A.Colvin. ViC*: A Preprocessor for Virtual-Memory C*. Dartmouth College Computer Science Technical Report PCS-TR94-243, November 1994. [8] J.Del Rosario and A.Choudhary. High Performance I/O for Parallel Computers: Problems and Prospects. IEEE Computer, pages 59-68, March 1994. [9] D.Gannon, W. Jalby, and K.Gallivan. Strategies for Cache and Local Memory Management by Global Program Transformations, Journal of Parallel and Distributed Computing,5:587-616, 1988. [10] S.Hiranandani, K.Kennedy, and C.Tseng. Compiling Fortran D for MIMD distributed-memory machines. Comm. ACM 35,8, pages 66-80, Aug. 1992. [11] High Performance Computing and Communications: Grand Challenges 1993 Report. A Report by the Committee on Physical,Mathematical and Engineering Sciences, Federal Coordinating Council for Science, Engineering and Technology.
25 [12] F.Irigoin and R.Triolet. Supernode Partitioning. Proc. 15th Annual ACM Symp. Principles of Programming Languages, pages 319-329, San Diego, CA, January 1988. [13] C.Koebel, D.Lovemen, R.Schreiber, G.Steele, and M.Zosel. High Performance Fortran Handbook. The MIT Press, 1994. [14] W.Li. Compiling for NUMA Parallel Machines, Ph.D. Thesis, Cornell University, 1993. [15] K.McKinley, S.Carr, and C.W.Tseng. Improving Data Locality with Loop Transformations. ACM Transactions on Programming Languages and Systems (to appear). [16] M.Paleczny, K.Kennedy, and C.Koebel. Compiler Support for Out-of-Core Arrays on Parallel Machines., CRPC Technical Report CRPC-TR94509-S, Rice University, Houston, TX, 1994. [17] J.Ramanujam and P.Sadayappan. Tiling Multidimensional Iteration Spaces for Multicomputers. Journal of Parallel and Distributed Computing, 16(2):108-120, October 1992. [18] R.Schriber and J.Dongarra. Automatic Blocking of Nested Loops. Technical Report, RIACS, May 1990. [19] R.Thakur, R.Bordawekar, A.Choudhary, R.Ponnusamy, and T.Singh. PASSION Runtime Library for Parallel I/O, In Proc. The Scalable Parallel Libraries Conference, October 1994. [20] M.Wolf and M.Lam. A data Locality Optimizing Algorithm. in Proc. ACM SIGPLAN 91 Conf.Programming Language Design and Implementation, pages 30-44, June 1991. [21] H.Zima, P.Brezany, B.Chapman, P.Mehrotra, and A.Schwald. Vienna Fortran a language speci cation. ICASE Interim Report 21, MS 132c, ICASE, NASA, Hampton VA 23681, 1992.