Distributed Memory Compiler Design for Sparse ... - Semantic Scholar

Distributed Memory Compiler4 Design for Sparse Problems Janet Wu 1 Raja Das 2 Joel Saltz 2 Harry Berryman 1 2 Seema Hiranandani3 Abstract This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accesses that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used eectively by a compiler to generate ecient code in these situations. The compiler accepts as input a Fortran 77 program enhanced with speci cations for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented. Computer Science Department Yale University New Haven CT 06520 ICASE, NASA Langley Research Center Hampton VA 23665 3 Computer Science Department Rice University Houston Tx 77251 4 Research supported by the National Aeronautics and Space Administration under NASA contract NAS1-18605 while the authors were in residence at ICASE, Mail Stop 132C, NASA Langley Research Center, Hampton, VA 23665, and by NSF grant ASC-8819374. 1

2

1

1 Introduction On modern scalable multicomputers it is widely recognized that, in addition to detecting and exploiting available parallelism, reducing communication costs is crucial in achieving good performance. Existing systems such as DINO [34], Fortran D [12], Superb [44], and Id Noveau [31] perform communication optimizations only in the presence of regular array reference patterns within loops, such as message blocking, collective communications utilization, and message coalescing and aggregation. Parallel loop nests, however, often contain array references that cannot be analyzed at compile time. Such array references are classi ed as irregular. Methods are described here that deal with parallel loops and loops that contain reduction type output dependencies. The methods work for loops that do not contain crossprocessor loop-carried dependencies or cross-processor loop-independent dependencies. A cross-processor dependence is one whose end points cross processors. A loop-carried dependence involves a write to a location in one iteration, followed by a read to the same location at a later iteration. A loop-independent dependence involves a write to a location followed by a read to the same location in the same loop iteration. Data parallelism is achieved by partitioning arrays across the nodes of the machine and each processor performs computations on a part of the array. When parallelism is achieved by partitioning loop iterations between processors, cross processor loop independent dependences will not occur. Runtime optimization techniques have been developed that are designed to reduce communication costs for irregular references in the following ways:

judicious partitioning of data and computational work, combining element messages into a larger message thereby reducing the number of messages transmitted, and

eliminating redundant communication of array elements. To demonstrate that these optimizations can be performed automatically by a compiler, a prototype compiler called ARF (Arguably Fortran) was developed. ARF accepts a simpli ed Fortran 77 program enhanced with speci cations for distributing data. It outputs a program that executes directly on the nodes of a distributed memory machine, in this 2

case the Intel iPSC/860. The compiler partitions computations and analyzes array references to classify them as regular or irregular. For irregular references it performs runtime optimizations to reduce communication costs. Since the development of ARF a signi cant amount of work has been done in standardizing extensions to the Fortran language. The High Performance Fortran Forum (HPFF), a joint eort between the academic community and industry, has agreed on a preliminary set of data parallel programming language extensions [16], [20]. It has been heavily in uenced by experimental languages such as Fortran D [12], Vienna Fortran [45], Crystal [7], [24], [23], [25], Kali [22], DINO [32], and CM Fortran [9]. The HPFF decided to defer consideration of language extensions targeting irregular problems; over the next few years, the HPFF plans to consider possible irregular problem language extensions.

1.1 Overview of PARTI and ARF These runtime optimizations are implemented using PARTI (Parallel Automated Runtime Toolkit at ICASE) runtime preprocessing procedures which can be embedded by the compiler. These procedures (1) support a shared name space, (2) provide the infrastructure needed to implement non-uniform data mappings eciently, (3) coordinate interprocessor data movement, and (4) manage the storage of, and access to, copies of o-processor data. The compiler consists of two distinct layers. The bottom layer is the library of PARTI runtime procedures that are designed to support irregular patterns of distributed array accesses eciently. The top layer is a compiler that carries out program transformations by embedding calls to the PARTI primitives in the original program. PARTI procedures support a variety of operations on globally named distributed array indices. The distributed arrays can be partitioned in a non-uniform manner where each distributed array element is assigned to an arbitrary processor. The operations include: o-processor data fetches, data stores, and accumulations to o-processor memory locations. A multicomputer program is generated in which all distributed memory accesses are carried out using embedded procedures. We emphasize that the goal of this project is not to develop a production quality compiler, but to demonstrate that run time optimizations can be generated automatically and eciently by a compiler. Most of the complexity of this system is in the PARTI procedures. The PARTI procedures have been developed so that transformations needed to embed the appropriate primitives can be implemented with relative ease in distributed 3

memory compilers. This paper begins with a description of the language that is accepted and how it relates to Fortran D. An outline of the compiler phases is described in Section 3. Section 4 describes the PARTI run time primitives that have been implemented and incorporated in the runtime systems employed by the compiler. Section 5 provides details of the code generation and optimizations performed by the compiler. The compiler is described in the context of two example code kernels. The kernels are written in ARF and are translated by the compiler to message passing code. Section 6 reports experimental performance measurements for the codes compiled by ARF. Section 7 describes the relationship between this research and other related projects.

2 Language Support The ARF compiler was developed to demonstrate the feasibility of our approach for irregular problem compilation; consequently, the ARF language extensions are limited in scope. For the sake of clarity and better understanding we show how our language extensions are related to a real data-parallel language like Fortran D. We rst describe the syntax of the Fortran D language extensions that provide the same functionality as the ARF extensions. We then go on to describe the corresponding ARF language extensions.

2.1 Fortran D Language Fortran D is a version of Fortran 77 enhanced with data decomposition speci cations. In this section, we present a brief description of the features of Fortran D that can be used to support irregular problems. ALIGN and DECOMPOSITION are two key Fortran D data decomposition constructs. A DECOMPOSITION is an abstract problem domain. ALIGN is used to map arrays with respect to a decomposition. ALIGN guarantees that if elements from dierent arrays are mapped to the same element of a decomposition, they will reside on the same processor. The simplest alignment occurs when an array is exactly mapped onto a decomposition. The DISTRIBUTE statement speci es the mapping of a decomposition to a physical machine. Distributions can be regular, for example, consider the BLOCK distribution. If we have n$proc processors and N elements in a decomposition (where n$proc divides N), BLOCK distribution divides a decomposition into contiguous chunks of size N/n$proc, assigning one block to each processor. Fortran D also allows user speci ed irregular distributions through the use of a mapping array, which itself is typically dis4

tributed. A mapping array contains processor numbers used to specify the processor that owns each individual decomposition element. Below is an example that speci es an irregular partitioning in Fortran D: S1 REAL x(1000), y(1000) S2 INTEGER map(1000) S3 DECOMPOSITION reg(1000), irreg(1000) S4 ALIGN map with reg S5 DISTRIBUTE reg(BLOCK) ....

set values of map array using a mapping method ....

S6 ALIGN x, y with irreg S7 DISTRIBUTE irreg(map)

In this example, arrays x and y are the data arrays and map is the mapping array. The array map is mapped onto decomposition reg (statement S4). Decomposition reg in turn is distributed by blocks across the processors (statement S5). Arrays x and y are aligned with the decomposition irreg (statement S6). Finally, decomposition irreg is irregularly partitioned across processors using the distributed mapping array map (statement S7). The result of the above statements is that array elements x(i) and y(i) are assigned to processor map(i). It is sometimes convenient to ignore certain array dimensions when mapping an array to a decomposition. All array elements in the unassigned dimensions are collapsed and mapped to the same index of the decomposition. For instance, ALIGN z(i,j) with map(j) means that the the second dimension of z is aligned with map. In this example it means that we map column j of z to processor map(j). Fortran D also provides a directive, the on clause [21], to specify a processor which will execute each iteration of a loop. For instance, if we have n$proc processors do i=1,n on mod(i,n$proc) .... end do

5

assigns loop iterations to processors in a round-robin fashion.

2.2 ARF Extensions Distributed arrays declared in ARF can either be partitioned between processors in a regular manner (e.g. equal sized blocks of contiguous array elements assigned to each processor) or in an irregular manner. ARF extensions explicitly specify how each array is partitioned between processors; ARF does not make use of decomposition statements like the ones found in Fortran D. ARF arrays are irregularly partitioned across processors using a distributed mapping array. Below is an ARF code fragment that has the same eect as the rst code fragment presented in the previous section. Irregular distribution is speci ed as follows: S1:

distributed regular using block integer map(1000)

....

set values of map array using a mapping method ...

S2:

distributed irregular using map real x(1000),y(1000)

Statement S1 declares an integer array map and states that the array will be distributed blockwise across processors. Statement S2 declares real arrays x and y and assigns array elements x(i) and y(i) to processor map(i). ARF arrays can only be distributed in a single dimension; the distributed dimension must be the last declared array dimension. For instance, the ARF statement: S2 distributed irregular using map real x(10,1000)

would assign column i of x to processor map(i). ARF also contains an on clause, for example: distributed do i=1,n on partition

means that work associated with iteration i is to be carried out on processor partition(i).

6

3 Compiler Support for Irregular Computations Compile time analysis can make it possible to generate highly ecient code when a compiler can gather enough information through analysis of the source code. In order to generate ecient code, a compiler needs to have tractable representations of array subscript functions. It also needs tractable representations of how data and computational work are to be partitioned [18], [15], [19], [35]. For instance, consider the Fortran D example: S1 REAL x(1000), y(1000) S2 DECOMPOSITION blocks(1000) S3 ALIGN x,y with blocks S4 DISTRIBUTE blocks(BLOCK) S5 DO i=1,750 S6 x(i) = y(i+2) S7 END DO

Assume that each processor is responsible for computing values of data it owns (i.e. the owner computes rule [18]). If we have 4 processors, each processor will own contiguous chunks of 250 elements of arrays x and y. Since the subscript function for x(i) is the identity, the owner computes rule implies that each of three processors will execute 250 iterations of loop S5. In this example, it is clear by inspection that non-local accesses occur during each processor's last two loop iterations. In addition, it is easy to determine which non-local data must be obtained. For instance, the processor responsible for loop iterations 1 through 250 will need the rst two values of y stored on the processor responsible for loop iterations 251 through 500. A variety of researchers [18], [15] have implemented techniques to generate optimized calls to message passing routines given compile-time information about array subscript functions, array distribution and the distribution of loop iterations. This paper deals with situations where compile time analysis fails because crucial information is not available until a program executes. There are a variety of applications in which array subscript functions cannot be known at compile time. In many cases, these subscript functions are given by integer arrays; consider the reference y (ia(i)) in the code fragment below. 7

S1 REAL x(1000), y(1000) S2 INTEGER ia(1000) S3 DECOMPOSITION blocks(1000) S4 ALIGN x,y,ia with blocks S5 DISTRIBUTE blocks(BLOCK) ....

ia gets assigned values at runtime ....

S6 do i=1,750 S7 x(i) = y(ia(i)) S8 end do

Compile time analysis is dicult to do when we have irregular array distributions or irregular partitions of loop iterations. In the example below, it is impossible to predict at compile time the data that needs to be communicated because the distribution of x and y are not known until runtime. S1 REAL x(1000), y(1000) S2 INTEGER map(1000) S3 DECOMPOSITION reg(1000), irreg(1000) S4 ALIGN map with reg S5 DISTRIBUTE reg(BLOCK) ....

set values of map array using a mapping method ....

S6 ALIGN x, y with irreg S7 DISTRIBUTE irreg(map) S8 do i=1,750 S9 x(i) = y(i+2) S10 end do

The ARF compiler is able to handle parallel loops (marked by distributed which array references have subscripts that are given by functions of: 8

do)

in

1. the loop index, 2. scalars that are not rede ned in the loop body. 3. arrays indexed by just the loop index. Examples of such index functions are: (assuming that i is the index of a distributed do loop) (i + 4), (i i + 3), (ia(i) + i + a). Note that ia could be distributed in a regular or an irregular manner. The ARF compiler cannot, in general, handle loops in which reference patterns are not of this simple form. For instance, the compiler presented here could not deal with the following loop: S1 distribute do i= 1,100 on partition S2 do j = 1,num(i) S3 col(j) = num(i-1)+ j S4 y(i) = y(i) + a(j)*x(col(j)) S5 end do S6 end do

One diculty arises in the reference x(col(j )) in statement S4. The values of the subscript array col(j ) are computed in statement S3. Statement S3 in turn lies within a loop S2 whose upper bound is itself determined by the values taken on by array num. Das et. al. [11] describes program slicing techniques that can be used to extend the methods described here to a broader set of constructs. Except for one special case, the ARF compiler is unable to handle loops with loop carried dependencies. The special case involves accumulation type dependencies. The decision to include this special case greatly expands the number of irregular application codes to which these methods apply. The ARF compiler is able to recognize accumulations to an indirectly addressed array as shown in the following example. S1 distribute do i = 1,100 on partition S2 x(ia(i)) = x(ia(i)) + ... S3 end do

9

The commutative and associative property of the \+" operator allows the ARF compiler to postpone all accumulations to the distributed array x until the end of the loop computation.

3.1 The Inspectors/Executors Inspectors and executors perform optimizations to reduce communication costs for nonlocal accesses arising from irregular array references. Each processor pre-computes which data it will have to send or receive. Communication volume can be reduced by pre-fetching a single copy of each o-processor datum, even if it is referenced several times. The number of messages can be reduced by pre-fetching large quantities of o-processor data in a single message.

3.2 Inspector The inspector loop carries out the preprocessing needed to reduce the volume of communication and the number of messages transmitted. Figure 1 illustrates how the inspector is generated by the ARF compiler for a parallel loop. Hash tables, called hashed-caches, are used for temporary storage. Run time primitives initialize the hashed caches, store and retrieve data from them and ush the hashed caches when appropriate. During program execution, a hash table records o-processor fetches and stores them enabling the user to recognize when more than one reference is being made to the same o-processor distributed array element. This way only one copy of that element must be fetched or stored. During the inspector phase, we carry out a set of interprocessor communications that allows us to anticipate exactly which send and receive communication calls each processor must execute before and after executing the loop. To carry out the inspector loop described above, we must be able to nd the owner of each distributed array element. Regular distributions comprise those that require simple functions to compute the processor and local oset of a particular array element. For example, if a one dimensional array is distributed in a block manner, a simple function can be used to compute the processor and local oset of a particular array element. On the other hand, irregular distributions are those where we attempt to partition in a way that balances the following two objectives: 1. to have each processor perform approximately the same amount of work, and 2. to minimize communication overhead. 10

Foreach processor P Generate a clone of the partitioned loop Insert code to perform the following: Foreach rhs irregular array references: ?

generate list of off-processor data to be fetched

Foreach lhs

?

nest

irregular array reference:

generate list of data to be stored off-processor

Exchange messages with other processors to determine copies of non-local data to be sent and received during executor phase

Figure 1: Simpli ed Inspector for a single loop nest Typically, it is not possible to express the resulting array partitions in a simple way. By allowing an arbitrary assignment of distributed array elements to processors, we take on the additional burden of maintaining a data structure that describes the partitioning. The size of this data structure must be the same as the size of the the irregularly distributed array. We call this data structure a distributed translation table. Distributed translation tables are partitioned between processors in a simple manner (described in Section 4.3). Distributed translation tables are accessed during the inspector phase to determine where each data element resides. Once the preprocessing is completed, every processor knows exactly which non-local data elements it needs to send to and receive from the other processors. Once nished, we are in a position to carry out the necessary communication and computation.

3.3 Executor The loop is transformed into an executor loop. Figure 2 outlines the steps involved (the nature of the distributed array distribution does not aect the executor). The initial data exchange phase follows the plan established by the inspector. When a processor obtains copies of non-local distributed array elements, the copies are written into the processor's hashed cache. Once the communication phase is over, each processor carries out its computation. Each processor uses locally stored portions of distributed arrays along with non-local distributed array elements stored in the hashed cache. When the 11

Insert code before loop to

communicate local data to be referenced by other processor

receive non local data to be referenced locally

Insert code inside loop to

obtain non local data from hashed cache store non local writes to hashed cache

Insert code after loop to

update off-processor stores

Figure 2: Executor for a single loop nest computational phase is nished, distributed array elements to be stored o-processor are obtained from the hashed cache and sent to the appropriate o-processor locations. In the next section, we describe the details of the PARTI run time primitives that may be invoked during the inspector and executor phases.

4 PARTI primitives The PARTI run time primitives can be divided into three categories; primitives that may be invoked during the inspector phase, the executor phase, or both the inspector and executor phase. The scheduler primitive, invoked during the inspector phase, determines the send and receive calls that are needed during the executor phase. These calls may be to scatter data, gather data or perform reduction operations during the executor phase. The distributed translation table mentioned earlier is used during the inspector phase. The hashed cache primitives are used during both the inspector and executor phases. This next section describes the details of the scheduler, distributed translation table, scatter, gather, reduction, and the hashed cached primitives.

4.1 The Scheduler Primitive We will use a simple example to illustrate the preprocessing carried out by the scheduler. Assume we have a distributed array a that is partitioned among three processors in an irregular fashion as depicted in Figure 3 and a loop computation such that the access 12

Processors

P(1) P(2) P(3)

Global array a

1

2

3

4

local array a osets

1

1

2

2

0

Figure 3: Mapping of a Global Array to Processors pattern of array a is as shown in Figure 4. Each processor stores its elements of distributed array a in a local array a . Thus processor P1 needs to fetch array element a(3) or element a (2) of the local array from processor P2 and processors P2 and P3 need to fetch a(4) or element a (2) of the local array from P1 . Recall that the task of the scheduler is to anticipate exactly which send and receive communications must be carried out by each processor. The scheduler rst determines how many messages each processor will have to send and receive during the data exchange that takes place in the executor phase. To gather this information each processor needs to know the total number of processors executing the code. De ned on each processor Pi is an array nmsgsi. Each processor sets its value of nmsgsi(j) to 1 if it needs data from processor j or to 0 if it does not. The scheduler then updates nmsgs Pk nmsgsk (j ). This on each processor with the element-by-element sum nmsgsi(j) operation uses a fan-in tree to nd the sums. At the end of the fan-in, on all processors, the entries of nmsgs are identical. The value nmsgs(j) is equal to the number of messages that processor Pj must send during the exchange phase. In our example scenario, we see that at the end of the fan in, the value of nmsgs on each processor will be [2,1,0] (Figure 5). Thus P1 is able to determine that it needs to send data to two other (as yet unspeci ed) processors, P2 needs to send data to one processor and P3 does not need to send any data. At this point, each processor transmits to the appropriate processor, a list of required array elements. This list contains the local osets of the global array elements. In our 0

0

0

13

Processors

P(1) P(2) P(3)

Irregular access pattern of array a Global array a

1

2

3

4

local array a osets

1

1

2

2

0

Figure 4: Irregular Access Pattern example, P1 sends a message to P2 requesting element 2 of the local array a ; P2 and P3 send a message to P1 requesting element 2 of the local array a . Each processor now has the information required to set up the send and receive messages that are needed to carry out the scheduled communications (Figure 6). The schedule generated by the scheduler can be reused. A schedule can also be used to carry out identical patterns of data exchange on several dierent identically distributed arrays. The same schedule can be reused to carry out a particular pattern of data exchange on a single distributed array, and any of the data exchange primitives can make use of a given schedule. 0

0

4.2 Data Exchange Primitives Data exchangers can be called by each processor to:

gather data from other processors, scatter data to other processors, or perform global reduction operations. These exchangers use state information stored by the scheduler. As described in the previous section, the scheduler determines the send and receive calls needed to carry out data 14

P(1) needs data from P(2)



Input to sum 0 1 0 tree

1 0 0

1 0 0

Output from 2 1 0 sum tree distributed to all processors

2 1 0

2 1 0

Figure 5: Computing the number of Send Messages exchanges. The scheduler is not given any information about memory locations { it involves only processors and local indices. When a processor P calls a data exchanger, it passes to the exchanger routine the starting address of the rst local array element in its memory. We call this address AP . The exchanger routines use AP as base address to read or write distributed array elements.

4.3 The Translation Table We allow users to assign globally numbered distributed array elements to processors in an irregular pattern, using a distributed translation table. Recall that the scheduler and the data exchangers deal with indices of arrays that are local to each processor. The translation primitives, however, assume that distributed array elements have been assigned global indices. The procedure build-translation-table constructs the distributed translation table. Each processor passes to build-translation-table a set of globally numbered indices for which it will be responsible. The distributed translation table may be striped or blocked across the processors. With a striped translation table, the translation table entry for global 15

Send Processors

P(1) P(2) P(3)

Messages sent by the processors Receiving Processors

P(1) P(2) P(3)

Data sent by the processors: local array a`

2

2

2

Figure 6: Final Message Pattern index i is stored on processor i mod P , where P represents the number of processors. In a blocked translation table, translation table entries are partitioned into a number of equal sized ranges of contiguous integers; these ranges are placed in consecutively numbered processors. Dereference accesses the distributed translation table constructed in build-translation-table. For a given distributed array, dereference is passed a set of global indices that need to be accessed in the distributed memory. Dereference returns the processors and memory locations where the speci ed global indices are stored. We will illustrate these primitives using a simple two processor example where Processor P1 is assigned indices 1 and 4, and Processor P2 is assigned indices 2 and 3. In this example, we assume that the translation table is partitioned between the two processors by blocks. We depict the translation table data structure in Table 1. Each entry of the translation table assigns a processor and a local array index to each globally indexed distributed array element. In our example, translation table information about global indices 1 and 2 is stored in Processor 1, while information about global indices 3 and 4 is stored in Processor 2. To continue our example, assume that both processors use the dereference primitive to nd assigned processors and local indices corresponding to particular global distributed 16

Table 1: Translation Table Entries global assigned local index processor index 1 2 3 4

Processor 1 1 2

1 1

2 1

2 2

Processor 2

Table 2: Results obtained from Dereference processor global assigned local number index processor index 1 1 1 1 1 3 2 2 2 2 2 1 2 3 2 2 2 4 1 2 array indices. In Table 2 we depict the results obtained when Processor 1 dereferences global indices 1 and 3, and Processor 2 dereferences global indices 2, 3 and 4.

4.4 The Hashed Cache The usefulness of the PARTI primitives described in Section 4 can be enhanced by coupling these primitives with hash tables. The hash table records the numerical value associated with each distributed array element. The hash table also records the processor and local index associated with the element. Dereference uses the hash table to reduce the volume of interprocessor communication. Recall that dereference returns the processor assignments and the memory locations that correspond to a given list of distributed array indices. Each distributed array index may appear several times in lists passed to dereference. The hash table is used to remove these duplicates. Lists of o-processor distributed array elements passed to the scheduler may contain multiple references to the same element. The scheduler uses the hash table to identify unique o-processor data references. The data exchange procedures use hash tables to store copies of o-processor distributed array elements. The gather-exchanger fetches copies of o-processor distributed 17

array elements and places the values in a hash table. Similarly, the scatter-exchanger obtains copies of o-processor distributed array elements from a hash table and writes the values obtained into a speci ed local array element on a designated processor. Primitives to support accumulations to non-local memory use hash tables in the same way as the scatter-exchanger. PARTI supplies a number of other primitives that support reading from, as well as writing and accumulating to, hash tables. When o-processor accumulations must be performed, we rst carry out all possible accumulations to copies of distributed array elements in the hash table, then we perform an accumulation data exchange. We use a hash function that for a hashed cache of size 2k , masks the lower k bits of the key. The key is formed by concatenating the processor-local index pair that corresponds to a distributed array reference.

4.5 Summary of the PARTI primitives In this section we summarize the PARTI primitives that we have described and present an example of how they are used. We consider the following PARTI procedure calls: ttable id

= build translation table(distribution,mapping,num elements

call dereference(ttable id,global indices, processors,local indices,num indices) call setup hashed-cache(hashed-cache, processors, local indices) call scheduler(id,n,hashed-cache,local indices,processors) call gather-exchanger(id,hashed-cache,local-array). In this example, a processor P arranges to obtain copies of speci ed o-processor data elements, and these copies are placed in the hash table hashed-cache. All processors call the build translation table function with the data mapping. This function returns a pointer to a structure which stores the data layout. P calls the dereference function to nd the local addresses corresponding to the global indices it requires. The dereference call returns the processor number and local address corresponding to each of the global indices. P calls the function setup hashed-cache with the information returned by dereference to allocate the hashed table. P passes to scheduler a list of o-processor local array indices. The scheduler will build a schedule that will make it possible for P to obtain n data elements. P will obtain data element i, 1 i n from processor processors(i), 18

local index local indices(i). A previously allocated hash table hashed-cache is used to eliminate duplicate o-processor indices. In most irregular problems, the data access pattern in loops is such that the same data point is referenced multiple times. Partitioning of such loops cause duplicate o-processor references. The scheduler returns an integer id which will be used by the subsequent call to gather-exchanger. Each processor then calls gather-exchanger. On each processor, the gather-exchanger primitive is passed a pointer to the schedule (id), generated by the previous call to the scheduler, a pointer to the allocated hash table (hashed-cache) and the base address of its portion of the array local-array. After the execution of the gather-exchanger call, copies of the o-processor elements from array local-array reside in the hash table hashed-cache.

5 The ARF Compiler The ARF compiler transforms the source program into a single program multiple data (SPMD) form. Data distribution speci cations are used to partition the program and generate appropriate communication. The compiler incorporates the PARTI primitives to carry out the computations on each processor eciently. The kernels presented here have been coded in Fortran 77, enhanced with ARF data distribution statements, compiled and run on an iPSC/860. Section 6 presents performance data obtained from both kernels. We describe a compilation algorithm that is slightly more general than the algorithm actually used in the ARF compiler. The two algorithms produce equivalent code on the test data sets.

5.1 Code Generation by the ARF Compiler This compiler uses distribution speci cations to generate code to set up the distributed translation tables; calls to build translation table are embedded in the sequential code. One call is generated for each distribution. The translation table pointer for an array is stored in the symbol table. If the array is distributed in a regular manner, then the translation table contains a function, which is evaluated at runtime to nd the processor and local index of a particular datum. If the array is irregularly distributed, for each index both the processor and the local index is stored explicitly in the distributed translation table. In order to describe the algorithm used to generate the inspector and executor for a 19

do i = 1, n x(ia(i)) = y(ib(i)) end do Figure 7: Simple Irregular Loop loop, an s descriptor must be de ned:

s descriptor An s descriptor is a tuple which gives the complete description of a subscript and consists of the following components:

S descriptor = (Array; Distribution; Type; SubscriptExpression); where, for an s descriptor sd,

sd(1) = Array: name of the array indexed by the subscript, sd(2) = Distribution: identi es how an array is distributed (BLOCK, CYCLIC,

IRREGULAR, etc.) sd(3) = Type: the type of reference where the subscript expression is used. It can be any one of the exchanger types; gather, scatter or accumulation. sd(4) = List of subscript expression: the expressions used to determine the array index. For our implementation we assume that only a single dimension is accessed using the type of index functions shown in Section 3.

In Figure 7, arrays x, y, ia and ib are all distributed. The arrays ia and ib are used to index the arrays x and y respectively. At compile time it is not possible to gure out the indices of x and y that are accessed because they are dependent on the values stored in the arrays ia and ib. This data access pattern becomes available at runtime. For the algorithm, it is assumed that the loops do not have cross-processor loop carried dependencies. Later in this section we will describe how loops that contain reductions are handled. First, the basic algorithm to produce the inspector and executor for a given loop is presented. For any loop l, 20

Find all array references. For the loop in Figure 7, the array references are x(ia(i)) and y(ib)). Using these references and the subscript expressions form a list of s descriptors SD . For the loop shown in Figure 7 two s descriptors are generated, one for the reference x(ia(i)) and the other for y(ib(i)).

After generating the list SD , we are ready to generate the inspector and the executor code. For each sd 2 SD ,

Generate a declaration statement for a temporary array temp to store the values that will be assigned to the subscript corresponding to sd, i.e. sd(4), inside l. Note for the two s descriptors generated for the example loop the storing of the reference trace in a temporary array can be skipped and the arrays ia and ib can be used directly to do the dereferencing.

Generate a clone of loop l, loop l0, before l The body of the loop l0 consists of a statement that records into temp each value taken on by the subscript expression sd(4).

Generate a call to dereference passing array temp and the translation table pointer

associated with array sd(1). For the example loop the dereferencing is done with the arrays ia and ib.

Next generate a call to the scheduler using the arrays PA and LA that are returned by dereference to form the schedule S.

If sd(3) = gather then a call to the gather-exchanger is generated using schedule S.

At runtime this obtains the o-processor data and puts the data in the hash table HS . For the example loop the o-processor y values are gathered. If sd(3) = scatter then a call to the scatter-exchanger is generated using schedule S. This call to scatterexchanger, at runtime, takes the data from the hash table HS and sends it to the other processors. For the example loop the data values from the array x are scattered. If sd(3) = accumulation then a call to the scatter-op exchanger is generated using schedule S. This call to scatter-op exchanger, at runtime, takes the data from the hash table HS and accumulates it in the other processors. 21

do i = 1, n n1 = nde(i,1) n2 = nde(i,2) x(n1) = x(n1) + y(n2) end do Figure 8: Irregular Loop with Staged Indirect Indexing

Replace the subscript expression that indexes the array sd(1) inside the loop l by the temporary array temp.

The ARF compiler was tailored to recognize an idiom that is used to index distributed arrays in many irregular codes (see for example Figure 8). A programmer assigns an expression that would have otherwise been used to subscript an array reference to a scalar s. The s is then used as a array subscript. In this type of indexing pattern, a scalar s is de ned inside a loop and then it is used to index distributed arrays. More precisely,

A scalar s is de ned once each iteration of the loop. The de nition of s may be a function of:

a. The loop index. b. Scalars that are not de ned in the loop body. c. Arrays indexed by just the loop index.

s is used to index the distributed dimension of distributed arrays in the loop body. When one carries out forward substitution, subscript expressions in loops written using this idiom have the properties outlined in Section 3. Note that forward substitution transforms the example in Figure 8 to the example in Figure 7.

5.2 Optimizations Two main optimizations are performed. The rst optimization reduces the scheduling overhead by identifying sets of distributed array references that can make use of the same 22

Optimization Name

Array Distribution

Subscript Type Expression

Common Schedule Elimination Don't care

Match

Match

Don't care

Common Exchanger Elimination

Match

Match

Match

Match

Table 3: Optimization Patterns schedule. The second optimization reduces data transfer costs by identifying distributed array references that can make use of precisely the same exchanger invocation. These optimizations are carried out by sorting s descriptors into equivalence classes. Several distributed array references can share the same schedule as long as all arrays in question are: 1) identically distributed and 2) have matching subscript expressions. A set of distributed array references can share the same exchanger call if all references have identical s descriptors. Table 3 summarizes these conditions.

5.3 ARF Compiler Examples In this section we present two examples used to demonstrate how the ARF compiler works. Section 5.3.1 presents how ARF was used to program a distributed memory block sparse matrix vector multiply kernel. Section 5.3.2 presents an example from computational uid dynamics.

5.3.1 Sparse Block Matrix Vector Multiply Figure 10 presents an ARF program that carries out a block sparse matrix vector multiply. This kernel is from an iterative solver produced for a program designed to calculate uid

ow for geometries de ned by an unstructured mesh [40]. The matrix is assumed to have size 4 by 4 blocks of non-zero entries. Statements S4 and S5 are loops that sweep over the non-zero entries in each block. It is assumed that the array partition is passed to the sparse matrix vector multiply kernel after having been generated elsewhere. 23

Figure 11 presents speci cation of the data decomposition for the sparse block matrix vector multiplication example written in Fortran D. If Fortran D is used to write the example the only change to Figure 10 is replacement of statements S1 and S2 with statements S1 through S10 from Figure 11. The array map in Figure 11 speci es the mapping of the data arrays. Of all the data arrays a single dimension is distributed and the rest are compressed. In Figure 10 the integer array partition is local to each processor and enumerates a list of indices assigned to the processor. As mentioned earlier, the current implementation partitions only one dimension: the last dimension of the array. PARTI primitives, however, support a broader class of array mappings [6]. Thus partition describes the partitioning of the last dimension of the arrays declared in statements S1 and S2. The ARF compiler uses the information in partition to make calls to primitives that initialize the distributed translation tables. These distributed translation tables are used to describe the mapping of x, y, cols, ncols and f (statements S1 and S2). The partitioning of computational work is speci ed in statement S3 using an on clause. In this example, the distributed array partition is used to specify the loop iterations to be carried out on each processor. The reference x(m,cols(j,i)) in S6 may require oprocessor references. ARF consequently generate an inspector to produce a schedule and a hash table to handle accesses to the distributed array x. A reference to the irregularly distributed array f occurs in statement S6. Note that distributed array f is irregularly distributed using array partition and that partition is also used by the on clause to partition loop iterations in S3. Therefore, it can be deduced that the reference to f in statement S6 is on-processor; partition speci es how distributed array elements and loop iterations are to be distributed between processors. A separate partitioning routine generates partition. The ARF compiler generates an inspector and an executor to run on each processor. The code executed on each processor to generate the inspector is shown in Figure 9. The statement S1 shows the generation of the translation table using the partition array. Statement S2 shows the dereference call made to gure out the address of the various data elements. The next two statements in the inspector code generates the data communication schedule and the hash table structure. The executor generated by ARF on processor P is depicted in Figure 12. Fortran 90 notation is used where appropriate to enhance readability. O-processor elements of x are gathered and placed in hash table H (step I, Figure 12). Values from x are obtained from H or from local memory (step IIa, Figure 12). Arrays PA and LA are used to distinguish 24

S1 Tpartition = build translation table using the mapping de ned by array partition S2 call dereference to nd processor assignments, PA and local indices, LA for consecutive references to x(m; cols(j; i)), employing Tpartition. S3 call setup hashed-cache(hashed-cache, PA, LA) S4 call scheduler(id,n,hashed-cache,LA,PA) Figure 9: Inspector generated from ARF for Sparse Block Matrix Vector Multiply local from o-processor array accesses. In step IIb, we accumulate to y. Note that the declarations in S1 and S3 in Figure 10 allow the compiler to determine that accumulations to y are local.

5.3.2 The Fluxroe Kernel This kernel is taken from a program that computes convective uxes using a method based on Roe's approximate Riemann solver [41], [42]; referred to as Fluxroe kernel in this paper. Fluxroe computes the ux across each edge of an unstructured mesh. Fluxroe accesses elements of array yold, carries out ux calculations and accumulates results to array y. As was the case in the sparse block matrix vector multiply kernel, four sections of each array are distributed and accessed in an identical manner. Figure 13 depicts an outline of the Fluxroe kernel. The indices of the two vertices that comprise edge i are noted as n1 = node(1; i) and n2 = node(2; i). To compute the uxes flux(k) across the ith edge, access yold(k; n1) and yold(k; n2), for 1 k 4 (part I, Figure 13). Once the uxes have been computed, add the newly computed ux values flux(k) to y (k; n1) and subtract flux(k) from y(k; n2) (part III, Figure 13). Note that arrays y and yold are irregularly distributed using y-partition, and that distributed array node is irregularly distributed using edge-partition. Since the on clause in the distributed do statement also uses edge-partition to specify how loop iterations are to be partitioned, no o-processor references are made to node in part I, Figure 13. In the inspector, compute a schedule Sn1 for the o-processor additions to y (k; n1) (part IIIa, Figure 13), and a dierent schedule Sn2 for the o-processor subtractions from 25

S1 distributed irregular using partition real*8 x(4,n), y(4,n),f(4,4,maxcols,n) S2 distributed irregular using partition integer cols(9,n), ncols(n) ... initialization of local variables ... S3 distributed do i = 1,n on partition

do j = 1, ncols(i) S4

do k = 1 ,4 sum = 0

S5 S6

do m = 1,4 sum = sum + f(m,k,j,i)*x(m,cols(j,i))

enddo S7

y(k,i) = y(k,i) + sum

enddo enddo distributed enddo Figure 10: ARF Sparse Block Matrix Vector Multiply

26

S1 real*8 x(4,n),y(4,n),f(4,4,maxcols,n) S2 integer cols(9,n),ncols(n) S3 integer map(n) S4 DECOMPOSITION reg(n), irreg(n) S5 ALIGN map with reg S6 ALIGN x(i,j), y(i,j), col(i,j) with map(j) S7 ALIGN f(i,j,k,l) with map(l) S8 ALIGN ncols(i) with map(i) S9 DISTRIBUTE reg(BLOCK) S10 DISTRIBUTE irreg(map)

Figure 11: Fortran D Data Distribution Statements for Sparse Block Matrix Vector Multiply

27

I. call gather-exchanger using schedule S to obtain o-processor elements of x gather-exchanger places gathered data in hash table H

count = 1 II. for all rows i assigned to processor P

do j=1,ncols(i) do k= 1,4 sum = 0 IIa.

if (PA(count) == P ) then vx(1:4) = x(1:4,LA(count)) else Use PA(count), LA(count) to get vx(1:4) from hashtable H endif

do m=1,4 sum = sum + f(m,k,j,i)*vx(m)

enddo IIb.

y(k,i) = y(k,i) + sum

enddo count = count + 1

enddo Figure 12: Executor generated from ARF for Sparse Block Matrix Vector Multiply

28

distributed irregular using y-partition real*8 yold(4,Number-nodes), y(4,Numbernodes) distributed irregular using edge-partition integer node(2,Number-edges) ... initialization of local variables ...

distributed do i = 1,Number-edges on edge-partition I. n1 = node(1,i) n2 = node(2,i)

do k=1,4 Ia. Va(k) = yold(k,n1) Ib. Vb(k) = yold(k,n2)

enddo II. Calculate ux using Va(k), Vb(k) III. do k=1,4 IIIa. y(k,n1) = y(k,n1) + ux(k) IIIb. y(k,n2) = y(k,n2) - ux(k)

enddo distributed enddo Figure 13: ARF Kernel From Riemann Solver

29

S1 Ty?partition = build translation table using the mapping de ned by array y-partition S2 call dereference to nd processor assignments, PAn1 and local indices, LAn1 for consecutive references to y (k; n1), employing Ty?partition. S3 call dereference to nd processor assignments, PAn2 and local indices, LAn2 for consecutive references to y (k; n2), employing Ty?partition. S4 call setup hashed-cache(hashed ? cachen1, PAn1, LAn1) S5 call setup hashed-cache(hashed ? cachen2, PAn2, LAn2) S6 call scheduler(id,n,hashed ? cachen1,LAn1,PAn1) S7 call scheduler(id,n,hashed ? cachen2,LAn2,PAn2) Figure 14: Inspector generated from ARF for Fluxroe Kernel

y(k; n2) (part IIIb, Figure 13). When parallelized, Fluxroe reads, as well as accumulates,

to o-processor distributed array locations. Any of the data exchange primitives can use the same schedule. Schedule Sn1 to gather o-processor references from yold(k; n1) (part Ia, Figure 13) can be used, and schedule Sn2 can be used to gather o-processor references from yold(k; n2) (part Ib, Figure 13). The inspector code generated by the ARF compiler for the Fluxroe Kernel is shown in Figure 14. Statement S1 shows the call to the build translation table function to store the information of how the array y is partitioned. Statements S2 and S3 are calls to the dereference function to nd the addresses of the various references to the y array. Both these dereference calls use the translation table setup in Statement S1. Statements S4 and S5 generates the hash table structure. The last two statements in the code fragment shows the building of the communication schedules. Figure 15 outlines the executor produced by ARF on processor P . In Figure 15 Fortran 90 notation is used where appropriate to enhance readability. In step Ia and Ib two sets of o-processor elements of yold are gathered using schedules Sn1 and Sn2 . In step II the appropriate elements of yold are accessed either from local memory or from the appropriate hash table; and in step III yold values are used to calculate uxes. If the 30

newly computed uxes are to be accumulated to a local element of distributed array y, the appropriate addition or subtraction is carried out at once ( steps IVa and IVc, Figure 15). When a ux must be accumulated to an o-processor element of y, accumulate the ux to a copy of y stored in a hash table (steps IVb and IVd, Figure 15). When all uxes have been calculated and all local accumulations completed, call the scatter-add and scatter-subtract exchangers. These exchangers carry out the needed o-processor accumulations. The current version of the ARF compiler attempts to minimize the number of schedules to be computed. A single schedule for all o-processor yold data accesses could have been produced. Computing a single schedule for all references to yold will lead to a more ecient executor at the cost of a more expensive inspector.

5.4 Memory Utilization In this section an overview of some of the memory requirements exacted by the methods described is given, and suggestions made of some ways in which these requirements can be reduced. Many sparse and unstructured programs use large integer arrays to determine reference patterns. In this respect, the kernels depicted here are typical. In Figure 10, a 9n element integer array cols is used for this purpose; while in Figure 13, a size 2 Number ? edges array node is employed. The executors depicted in Figure 12 and Figure 15 replace cols and node with local arrays that store the processor assignments and the local indices for references to irregularly distributed arrays. In the kernels in Figure 10 the sum of the number of elements used in all processors to store both processor assignments and local indices is no larger than 18n. In Figure 13 the parallelized code uses a total of 4 Number ? edges elements. The amount of additional storage needed for the parallelized code can be reduced in the following simple manner. The iterations I of a loop are divided into two disjoint sets. The rst set of iterations is Ilocal, where all memory references are locally stored array elements. The second set is Ioff?processor, in which each iteration contains some oprocessor distributed array reference. In this case listing processor assignments for loop iterations Ioff?processor is necessary. Since it is frequently possible to map problems so that most memory references are local to a processor, a substantial memory savings results. The schemes described thus far would use very large quantities of extra memory when attempting to handle a loop in which a small number of distributed array elements are accessed many times. For instance, consider the following loop where f is a function de ned so that 1 f (i) 2 for any i. 31

Ia. call gather-exchanger using schedule Sn1 to obtain rst set of o-processor elements of yold gather-exchanger places data in hash table Hn1 . Ib. call gather-exchanger using schedule Sn2 , to obtain second set of o-processor elements of yold

gather-exchanger

II.

III. IV. IVa. IVb. IVc. IVd.

places data in hash table Hn2.

count = 1 .... for edges i assigned to processor P do i=1,Number of edges assigned to P if (PAn1(count) == P ) then va(1:4) = yold(1:4,LAn1(count)) else get va(1:4) from hash table Hn1 endif if (PAn2(count) == P ) then vb(1:4) = yold(1:4,LAn2(count)) else get vb(1:4) from hash table Hn2 endif .... calculate uxes ux(1:4) using va(1:4) and vb(1:4) if (PAn1(count) == P ) then yold(1:4,LAn1(count)) = yold(1:4,LAn1(count)) + ux(1:4) else Accumulate ux(1:4) to hash table Hn1 endif if (PAn2(count) == P ) then yold(1:4,LAn2(count)) = yold(1:4,LAn2(count)) - ux(1:4) else Accumulate ux(1:4) to hash table Hn2 endif enddo

count = count+1 Va. Call scatter-add exchanger using schedule Sn1 and hash table Hn1 . Vb. Call scatter-subtract exchanger using schedule Sn2 and hash table Hn2 .

Figure 15: Executor generated from ARF for Fluxroe Kernel 32

distributed irregular partition y ...... do i=1, HUGE ? NUMBER .... y(f(i)) end do The reference pattern of distributed array y is determined by f. At most two distinct elements of y are referenced in the loop. Loops of this sort can be handled by using a hash table to store processor and local index assignments for each distinct memory reference. In this example, each processor would store processor and local index assignments for no more than two references to distributed array y. There is a performance penalty for using a hash table to nd processor and local index assignments for distributed array elements. After examining a variety of sparse and unstructured codes, it was decided not to implement the method described in this section in the ARF compiler. See the analysis in [30] for the time and space tradeos outlined in this section.

6 Experimental Results This section presents a range of performance data that summarizes the eects of preprocessing on measures of overall eciency. Also discussed is the performance eects of problem irregularity and partitioning. The computational experiments employed the Fluxroe kernel and the block sparse matrix vector multiply kernel. Both kernels were coded in ARF; the parallelized benchmark numbers were obtained from programs generated by the ARF compiler. Note that the syntax accepted by the ARF compiler diers in minor ways from what was presented in previous sections. The experiments described in this paper used either a 32 processor iPSC/860 machine located at ICASE, NASA Langley Research Center or a 128 processor iPSC/860 machine located at Oak Ridge National Laboratories. Each processor had 8 megabytes of memory. The Greenhill 1.8.5 Beta version C compiler was used to generate code for the 80860 processors.

33

6.1 Unstructured Mesh Data Input data from variety of unstructured meshes were used; including actual unstructured meshes obtained from aerodynamic simulations and synthetically generated meshes.

Unstructured Meshes from Aerodynamics: Two unstructured meshes generated from aerodynamic simulations were used.

Mesh A: A 21,672 element mesh generated to carry out an aerodynamic sim-

ulation involving a multi-element airfoil in a landing con guration [28]. This mesh has 11,143 points. Mesh B: A 37,741 element mesh generated to simulate a 4.2 % circular arc airfoil in a channel [14]. This mesh has 19,155 points.

Each mesh point is associated with an (x; y ) coordinate in a physical domain. Domain information was used to partition the mesh in three dierent ways: strips, orthogonal binary dissection algorithm [5], [13], and another mesh partitioning algorithm jagged partitioning [38]. The partitioning of the meshes are done sequentially and mapping arrays are generated for distribution of the data structures.

Synthetic Mesh from Templates A nite dierence template links K points in a square two dimensional mesh. This connectivity pattern is distorted incrementally. Random edges are introduced subject to the constraints in the new mesh; each point still requires information from K other mesh points. This mesh generator makes the following assumptions: 1. The problem domain consists of a 2-dimensional square mesh of N points, 2. Each point is initially connected to K neighbors determined by a nite dierence template, 3. With probability q, each mesh link is replaced by a link to a randomly chosen mesh point. Note that when q is equal to 0:0, no mesh links are modi ed and no changes are introduced. When q is equal to 1:0 a random graph is generated. Two templates are used. One template connects each point to its four nearest neighbors (K =4); the 34

other template connects each point to both its four nearest neighbors and to each of its four diagonal neighbors (K =8). The K = 4 template is referred to as a ve point template and the K =8 template as a nine point template. In the experiments described in this section, a 256 by 256 point mesh was employed.

6.2 Overall Performance Data is presented to give an overview of the performance obtained on the iPSC/860 from the ARF compiler output. A block distributed translation table was used. Table 4 presents a) the inspector time: time required to carry out the inspector preprocessing phase, b) computation time: the time required to perform computations in the iterative portion of the program, and c) the communication time: the time required to exchange messages within the iterative portion of the program. The inspector time includes the time required to set up the needed distributed translation table as well as the time required to access the distributed translation table when carrying out preprocessing. Unstructured Meshes A and B were partitioned using orthogonal binary dissection. In these experiments, the ratio of the time required to carry out the inspector to the time required for a single iteration (computation time + communication time) ranged from a factor of 0.7 to a factor of 3.6. Most of the preprocessing time represents set up and use of the distributed translation table. For instance, consider the block matrix vector multiply on 64 processors using the 21,672 element mesh. The total preprocessing cost was 122 milliseconds, of which 111 milliseconds represent work related to the translation table. Parallel eciency for a given number of processors P is de ned as sequential time divided by the product of the execution time on P processors times P. The sequential time was measured using a separate sequential version of the each kernel run on a single node of the iPSC/860. The algorithm of the sequential code was the same as that of the parallel code. Table 4, under the column single sweep eciency, depicts the parallel eciencies that would have been obtained with the requirement to preprocess the kernel each time the calculations were carried out. In reality, preprocessing time can be amortized over multiple mesh sweeps. If the time required to preprocess the problem in computing parallel eciencies is neglected, the second set of parallel eciency measurements is obtained. The executor eciency is presented in Table 4. The executor eciencies for 64 processors ranged from 0.48 to 0.59, while the single sweep eciencies ranged from 0.10 to 0.17. In the experiments depicted in Table 4, computing time is at least a factor of 2 greater than the communication time. The executor eciencies are eected by the fact that the 35

computations in the parallelized codes are carried out less eciently than those in the sequential program. The parallel code spends time accessing the hashed cache. It also needs to perform more indirections than the sequential program. Table 4: Performance on dierent number of processors nprocs inspector comp comm single sweep executor time(ms) time(ms) time(ms) eciency eciency Sparse Block Matrix Vector Multiply - Mesh A 32 148 49 9 0.15 0.55 64 122 25 9 0.10 0.48 Sparse Block Matrix Vector Multiply - Mesh B 32 200 85 10 0.19 0.59 64 150 42 9 0.14 0.54 Fluxroe - Mesh A 8 231 310 24 0.40 0.69 16 162 157 21 0.34 0.65 32 135 80 22 0.19 0.57 64 172 41 19 0.12 0.48 Fluxroe - Mesh B 8 393 534 23 0.41 0.70 16 249 269 18 0.36 0.68 32 191 156 23 0.28 0.62 64 203 69 14 0.17 0.59 Table 5, summarizes the performance of the Fluxroe kernel for meshes with varying degrees of regularity and for varying mesh mappings. This experiment was conducted using 32 processors. Table 5 depicts synthetic meshes derived from 5 and 9 point stencils with probability of edge move q equal to either 0:0 or 0:4. These meshes were mapped by 1-D strips or by 2-D blocks. As one might expect for the synthetic meshes, the communications costs increase dramatically for increasing q . These dramatic increases are present because both the volume of communication required and the number of messages sent per node are much higher for large q . Preprocessing costs also increased with q but while the communications costs went up by at least a factor of 16, preprocessing costs went up by at most a factor of 1.8. Table 5 summarizes results from Meshes A and B which were partitioned in three ways: strips, the orthogonal binary dissection algorithm, and jagged partitioning. Both binary dissection and the jagged partitioning algorithm break the domain into two dimensional rectangular regions; the two methods produce very similar performance results. 36

Table 5: Performance on 32 processors with dierent meshes nprocs inspector comp comm single sweep executor time(ms) time(ms) time(ms) eciency eciency 5 point template synthetic mesh partitioned into strips q=0.0 200 275 22 0.49 0.82 q=0.4 310 293 361 0.25 0.37 5 point template synthetic mesh partitioned into 2-D block q=0.0 398 275 15 0.35 0.84 q=0.4 463 291 319 0.23 0.40 9 point template synthetic mesh partitioned into strips q=0.0 211 583 21 0.58 0.80 q=0.4 385 620 530 0.31 0.42 9 point template synthetic mesh partitioned into 2-D block q=0.0 447 589 20 0.46 0.79 q=0.4 595 624 527 0.28 0.42 Mesh A binary 134 80 22 0.24 0.57 jagged 135 81 22 0.24 0.56 strips 148 83 26 0.22 0.53 Mesh B binary 191 136 23 0.28 0.61 jagged 186 137 21 0.28 0.62 strips 219 149 31 0.24 0.54

37

6.3 Breakdown of Inspector Overhead Table 6, summarizes the cost of dereferencing and scheduling the Fluxroe kernel on different numbers of processors using a blocked translation table. A ve point template was used and the mesh was partitioned either into 1-D strips or into 2-D blocks. When the mesh is partitioned into strips, dereference involves mostly local data accesses since the domain data and the translation table are identically partitioned. When strip partitioning is used, translation table initialization does not involve communication. The measurements presented in Table 6 are de ned in the following manner:

Executor time is the computation and communication time required to execute the kernel; it does not include time required for preprocessing,

Table initialization time is the time needed to initialize the distributed translation table,

Dereference time is the time taken by the dereference PARTI primitive, and Scheduler time is the time required to produce the communications schedule once the required processor locations and local indices have been found by dereference.

The majority of the costs incurred by the inspector are due to the translation table initialization and dereference (see Table 6). For instance, consider the case where 64 processors are used to carry out a sweep over a 2-D block partitioned mesh with a 5 point template. The translation table initialization and dereference together require 183 % of the executor time while the generation of the schedule requires only 12 % of the executor time. In these problems communication costs comprise a small fraction of the executor time; consequently the method used to partition the domain does not make a signi cant performance impact on executor time. In Table 6, the costs of translation table initialization and of dereference are both strongly dependent on how the domain is partitioned. 2-D block partitioning leads to higher translation table related costs. This is almost certainly due to the increased communication requirements needed for translation table initialization and dereference. Strip partitioning per se does not necessarily lead to low translation table related costs. In Table 5 it is noted that strip partitioning actually leads to higher inspector costs for both Mesh A and Mesh B. The translation table is partitioned so that blocks of contiguously numbered indices are assigned to each processor. However in Mesh A and 38

Mesh B, mesh points are not numbered in a regular fashion so the indices corresponding to a domain strip are not contiguously numbered. Table 6: Cost of dereferencing and scheduling on dierent number of processors nprocs executor table init dereference schedule time (ms) time (ms) time (ms) time (ms) 5 point template synthetic mesh partitioned into strips 8 1192 131 143 83 16 606 115 109 42 32 297 92 83 27 64 167 63 62 17 5 point template synthetic mesh partitioned into 2-D blocks 8 1189 333 595 83 16 599 192 311 42 32 290 136 235 26 64 158 77 212 19

6.4 Cost of translation table Section 4.3 discussed two straightforward ways to map a distributed translation table onto processors. Consider the question of how to distribute the translation table so as to minimize costs associated with translation table access. Table 7 compares the time required to carry out dereference on blocked and striped translation tables by depicting:

the time required to carry out a particular call to dereference, the average number of non-local accesses to table entries required by dereference, and the average number of non-local processors accessed during the call to dereference. When the results for unstructured Meshes A and B are examined, no consistent performance dierence in the cost required to dereference a blocked or a striped translation table is seen. Similar numbers of o-processor table entries need to be accessed for either translation table distribution. Blocked translation tables lead to superior performance when synthetic meshes are used. For the reasons described in Section 6.3 particularly good results are obtained when a striped partition with a blocked translation table is used. It is of interest that the blocked translation table also proved to be superior when synthetic meshes partitioned in 2-D blocks are used.

39

Problem q=0 q=0.2 q=0.4 q=0 q=0.2 q=0.4 binary jagged strips binary jagged strips

Table 7: Cost of dereference on 32 processors Indirect - Blocked Indirect - Striped Time Nonlocal Nonlocal Time Nonlocal Nonlocal (ms) Data Proc (ms) Data Proc Synthetic: 5 point template, strip partition 109 256 1 346 2232 31 157 1045 17 365 2862 31 218 1825 17 368 3350 31 Synthetic: 5 point template, 2-D block partition 235 2143 9 336 2078 31 326 2841 25 355 2782 31 330 3352 25 370 3273 31 Mesh A 97 768 21 96 743 31 98 772 20 98 751 31 109 860 29 102 843 31 Mesh B 130 1271 24 122 1230 31 139 1293 24 130 1263 31 159 1519 31 172 1513 31

6.5 Scheduler and Data Exchanger Performance To quantify the communications costs incurred by the PARTI scheduler and data exchange primitives, the time required to carry out the scheduler, gather-exchanger and scatter-exchanger procedure calls were measured and compared to the hand-coded version of iPSC/860 supplied sends and receives. The sends and receives communicated the same amount of data as did the PARTI procedures. An experiment was conducted in which two processors repeatedly exchanged W single precision words of information. The exchange was carried out using gather-exchangers, scatter-exchangers and the iPSC/860 supplied send and receive calls. Table 8 summarizes the results of these experiments. Presented are: the time (in milliseconds) required to carry out the requisite data exchange using send and receive messages; the ratio between the time taken by the scheduler and gather-exchanger PARTI primitive calls and the time taken by the equivalent send and receive calls. The scatter exchanger calls were also timed, the results of which were virtually identical to that of the corresponding gather-exchanger call. The gather-exchanger exceeded no more than 20% of the explicitly coded send/receive pairs moving W words of information between two processors. The additional overhead required for scheduler to carry out the data exchange was a factor of 2:1 to 1:0 times the 40

Table 8: Overheads for PARTI Scheduler and Gather-Exchanger Primitives Number of Send Gather- Scheduler Data Receive Exchanger Elements Time(ms) (ratio) (ratio) 100 0.5 1.0 2.1 400 1.0 1.1 1.4 900 1.8 1.1 1.3 1600 2.9 1.2 1.3 2500 4.3 1.2 1.1 3600 6.0 1.2 1.0 cost of using explicitly coded send/receive pairs to move W words.

7 Relation to Other Work Programs designed to carry out a range of irregular computations, [2, 26, 4, 43, 13], including sparse direct and iterative methods, require many of the optimizations described in this paper. Several researchers have developed programming environments that target particular classes of irregular or adaptive problems. Williams [43] describes a programming environment (DIME) for calculations with unstructured triangular meshes using distributed memory machines. Baden [3] has developed a programming environment targeting particle computations, which provides facilities that support dynamic load balancing. One of the key distinctions between the present work and that of Baden and Williams is that PARTI runtime support is designed to be used by compilers to handle parallel loops with irregular array references. In addition, it can be used by programmers in a wide range of applications. By contrast, programming environments such as those described by Baden and Williams are highly customized for use in speci c application areas. There are a variety of compilers targeting distributed memory multiprocessors [44, 8, 33, 31, 1, 39]. With the exception of the Kali project [22], and the PARTI work described here and in [36, 29, 37], these compilers do not attempt to deal with loops having irregular references eciently. The work described in this paper is also related to schemes to carry out distributed memory runtime parallelization [29, 27]. These schemes are more ambitious than those described in this paper, which include mechanisms to carry out runtime partitioning and parallelization. Chen [27] suggests an optimization similar to one described here. She pro41

posed reducing scheduling overheads by identifying distributed array references for which one can employ identical schedules. At this point only hand coding based timing experiments have been carried out to study the schemes proposed [29, 27]. The prototype compiler described here is able to generate code capable of eciently handling kernels with parallel loops containing irregular array references. The procedures that carry out runtime optimizations are coupled to a distributed memory compiler via a set of compiler transformations. The compiler described and tested in this paper is qualitatively dierent from the eorts cited above in a number of important respects. Mechanisms have been developed and demonstrated that support irregularly distributed arrays, making it possible to map data and computational work in an arbitrary manner. Because irregularly distributed arrays can be supported, it was possible to compare the performance eects of dierent problem mappings. Support for arbitrary distributions was proposed [29, 37] but this is the rst implementation of a compiler-based distributed translation table mechanism for irregular scienti c problems. Many unstructured NASA codes must carry out data accumulations to o-processor memory locations. One of the demonstration kernels addressed this, and the primitives and the compiler were designed to handle this situation. This compiler eort is unique in its ability to carry out irregular patterns of o-processor data accumulations eciently. These primitives are augmented with a hash table designed to eliminate duplicate data accesses. In addition, the hash table manages copies of o-processor array elements. Other researchers have used dierent data structures for management of o-processor data copies [22].

8 Conclusion This paper described and experimentally characterized a compiler and runtime support procedures which embody methods that are capable of handling an important class of irregular problems that arise in scienti c computing. After examining a number of complete NASA codes, two kernels were extracted to demonstrate the methods. Both of these kernels involved computations over unstructured meshes. Both kernels were coded in ARF, a dialect of Fortran, and generated code to run on the nodes of the iPSC/860. Detailed timings were carried out on both kernels using unstructured meshes from aerodynamics, along with meshes that were generated by using random numbers to incrementally distort matrices obtained from a xed nite dierence template. This benchmarking suite stressed 42

the communications capabilities of the iPSC/860 and the PARTI primitives in a variety of ways. In the experiments reported in Section 6.2, the ratio of the time required to carry out all preprocessing to the time required for a single iteration of either kernel ranged from a factor of 0.7 to a factor of 3.6. In Section 6.3 the majority of the preprocessing costs arose from the need to support irregularly distributed arrays. In Section 6.5 the performance of the scheduler and data exchanger PARTI primitives were quanti ed. The data-exchangers demonstrated a maximum increase of 20% over the analogous send and receive calls provided by Intel. One of the virtues of the layered approach to distributed compiler design is the capture of a set of critical optimizations in the runtime support primitives. These primitives, and hence these optimizations, can be migrated to a variety of compilers targeting distributed memory multiprocessors. It is intended to implement these primitives in the ParaScope parallel programming environment [17]. In addition, PARTI primitives can, and are, being used directly by programmers in applications codes [6], [10]. The applications described in [10] were particularly noteworthy. These applications were explicit and multigrid unstructured Euler solvers which were employed to compute ows over full aircraft con gurations. The explicit unstructured Euler solver achieved a computational rate of 1.5 G ops on 512 processors of the Intel Touchstone Delta. The multigrid unstructured Euler solver achieved a computational rate of 1.2 G ops on 512 Delta processors. In both cases, the cost of the inspector's preprocessing was approximately equal to the cost of a single iteration of the Euler solver, amounting to less than 3 % of the total time. Most of the complexity in this system is in the PARTI procedures. The PARTI procedures have been developed so that transformations needed to embed the appropriate primitives can be implemented with relative ease in distributed memory compilers. The primitives used to implement the runtime support include communications procedures designed to support irregular patterns of distributed array access, and procedures to nd the location of irregularly mapped distributed array data using distributed translation tables. Primitives also support the maintenance of hash tables to store copies of o-processor data.

9 Acknowledgements We would like to thank Harry Jordan, Bob Voigt and Donna Meisel for their careful editing of this manuscript. We would also like to thank the Advanced Computing Laboratory at 43

Oak Ridge National Laboratories and NAS at NASA Ames for providing access to the 128 node Intel iPSC/860 hypercubes. We wish to thank Dimitri Mavriplis and David Whitaker for supplying unstructured meshes, and to David Whitaker and P Venkatkrishnan for access to their codes.

References [1] F. Andre, J.-L. Pazat, and H. Thomas, PANDORE: A system to manage data distribution, in International Conference on Supercomputing, June 1990, pp. 380{388. [2] C. Ashcraft, S. C. Eisenstat, and J. W. H. Liu, A fan-in algorithm for distributed sparse numerical factorization, SISSC, 11 (1990), pp. 593{599. [3] S. Baden, Programming abstractions for dynamically partitioning and coordinating localized scienti c calculations running on multiprocessors, SIAM J. Sci. and Stat. Computation., 12 (1991). [4] D. Baxter, J. Saltz, M. Schultz, S. Eisentstat, and K. Crowley, An experimental study of methods for parallel preconditioned Krylov methods, in Proceedings of the 1988 Hypercube Multiprocessor Conference, Pasadena CA, January 1988, pp. 1698{1711. [5] M. J. Berger and S. H. Bokhari, A partitioning strategy for pdes across multiprocessors, in The Proceedings of the 1985 International Conference on Parallel Processing, August 1985. [6] H. Berryman, J. Saltz, and J. Scroggs, Execution time support for adaptive scienti c algorithms on distributed memory architectures, Concurrency: Practice and Experience, 3 (1991), pp. 159{178. [7] M. C. Chen, A design methodology for synthesizing parallel algorithms and architectures, Journal of Parallel and Distributed Computing, (1986), pp. 116{121. [8] A. Cheung and A. P. Reeves, The Paragon multicomputer environment: A rst implementation, Tech. Rep. EE-CEG-89-9, Cornell University Computer Engineering Group, Cornell University School of Electrical Engineering, July 1989. [9] T. M. Corporation, CM Fortran reference manual, Tech. Rep. version 1.0, Thinking Machines Corporation, Feb 1991. 44

[10] R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, The design and implementation of a parallel unstructured Euler solver using software primitives, AIAA-92-0562, in Proceedings of the 30th Aerospace Sciences Meeting, Jan. 1992. [11] R. Das, J. Saltz, and R. v. Hanxleden, Slicing analysis and indirect access to distributed arrays, in Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing, Springer-Verlag, Aug. 1993, pp. 152{168. Also available as University of Maryland Technical Report CS-TR-3076 and UMIACS-TR-93-42. [12] G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu, Fortran D language speci cation, Tech. Rep. TR90-141, Dept. of Computer Science, Rice University, Dec. 1990. [13] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Computers, Prentice-Hall, Englewood Clis, New Jersey, 1988. [14] Numerical methods for the computation of inviscid transonic ows with shock waves a gamm workshop, in Notes on Numercial Fluid Mechanics, vol. 3. [15] M. Gerndt, Updating distributed variables in local computations, Concurrency: Practice and Experience, 2 (1990), pp. 171{193. [16] High Performance Fortran Forum, High Performance Fortran language speci cation, Scienti c Programming, 2 (1993), pp. 1{170. [17] S. Hiranandani, K. Kennedy, and C. Tseng, Compiler support for machineindependent parallel programming in Fortran D, in Compilers and Runtime Software for Scalable Multiprocessors, J. Saltz and P. Mehrotra Editors, Amsterdam, The Netherlands, To appear 1991, Elsevier. [18] S. Hiranandani, K. Kennedy, and C.-W. Tseng, Compiler optimizations for Fortran D on MIMD distributed-memory machines, in Proceedings Supercomputing '91, IEEE Computer Society Press, Nov. 1991, pp. 86{100. [19] C. Koelbel, Compiling Programs for Nonshared Memory Machines, PhD thesis, Purdue University, West Lafayette, IN, Aug. 1990. [20] C. Koelbel, D. Loveman, R. Schreiber, G. Steele, Jr., and M. Zosel, The High Performance Fortran Handbook, 1994. 45

[21] C. Koelbel and P. Mehrotra, Compiling global name-space programs for distributed execution, Report 90-70, ICASE, 1990. [22] C. Koelbel, P. Mehrotra, and J. V. Rosendale, Supporting shared data structures on distributed memory architectures, in 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM, Mar. 1990, pp. 177{186. [23] J. Li and M. Chen, Generating explicit communication from shared-memory program references, in Proceedings Supercomputing '90, November 1990. [24]

, Index domain alignment: Minimizing cost of cross-references between distributed arrays, in Proceedings of the 3rd Symposium on the Frontiers of Massively Parallel Computation, October 1990.

[25]

, Automating the coordination of interprocessor communication., in Programming Languages and Compilers for Parallel Computing, Cambridge Mass, 1991, MIT Press.

[26] J. W. Liu, Computational models and task scheduling for parallel sparse Cholesky factorization, Parallel Computing, 3 (1986), pp. 327{342. [27] L. C. Lu and M. Chen, Parallelizing loops with indirect array references or pointers, in Proceedings of the Fourth Workshop on Languages and Compilers for Parallel Computing, Santa Clara, CA, August 1991. [28] D. J. Mavriplis, Multigrid solution of the two-dimensional Euler equations on unstructured triangular meshes, AIAA Journal, 26 (1988), pp. 824{831. [29] R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nicol, and K. Crowley, Principles of runtime support for parallel processors, in Proceedings of the 1988 ACM International Conference on Supercomputing, July 1988, pp. 140{152. [30] S. Mirchandaney, J. Saltz, P. Mehrotra, and H. Berryman, A scheme for supporting automatic data migration on multicomputers, in Proceedings of the Fifth Distributed Memory Computing Conference, Charleston S.C., 1990. [31] A. Rogers and K. Pingali, Process decomposition through locality of reference, in Proceedings of the SIGPLAN '89 Conference on Program Language Design and Implementation, Portland, OR, June 1989. 46

[32] M. Rosing and R. Schnabel, An overview of Dino - a new language for numerical computation on distributed memory multiprocessors, Tech. Rep. CU-CS-385-88, University of Colorado, Boulder, 1988. [33] M. Rosing, R. Schnabel, and R. Weaver, Expressing complex parallel algorithms in Dino, in Proceedings of the 4th Conference on Hypercubes, Concurrent Computers and Applications, 1989, pp. 553{560. [34] M. Rosing, R. Schnabel, and R. Weaver, Massive parallelism and process contraction in Dino, in Proceedings of the Fourth SIAM Conference on Parallel Processing for Scienti c Computing, Chicago, IL, Dec. 1989. [35] M. Rosing, R. B. Schnabel, and R. P. Weaver, The DINO parallel programming language, Journal of Parallel and Distributed Computing, 13 (1991), pp. 30{42. [36] J. Saltz and M. Chen, Automated problem mapping: the crystal runtime system, in The Proceedings of the Hypercube Microprocessors Conf., Knoxville, TN, September 1986. [37] J. Saltz, K. Crowley, R. Mirchandaney, and H. Berryman, Run-time scheduling and execution of loops on message passing machines, Journal of Parallel and Distributed Computing, 8 (1990), pp. 303{312. [38] J. Saltz, S. Petiton, H. Berryman, and A. Rifkin, Performance eects of irregular communications patterns on massively parallel multiprocessors, to appear in Journal of Parallel and Distributed Computing, 1991, Report 91-12, ICASE, 1991. [39] P. S. Tseng, A Parallelizing Compiler for Distributed Memory Parallel Computers, PhD thesis, Carnegie Mellon University, Pittsburgh, PA, May 1989. [40] P. Venkatkrishnan, J. Saltz, and D. Mavriplis, Parallel preconditioned iterative methods for the compressible navier stokes equations, in 12th International Conference on Numerical Methods in Fluid Dynamics, Oxford, England, July 1990. [41] D. L. Whitaker and B. Grossman, Two-dimensional Euler computations on a triangular mesh using an upwind, nite-volume scheme, in Proceedings AIAA 27th Aerospace Sciences Meeting, Reno, Nevada, January 1989.

47

[42] D. L. Whitaker, D. C. Slack, and R. W. Walters, Solution algorithms for the two-dimensional Euler equations on unstructured meshes, in Proceedings AIAA 28th Aerospace Sciences Meeting, Reno, Nevada, January 1990. [43] R. D. Williams and R. Glowinski, Distributed irregular nite elements, Tech. Rep. C3P 715, Caltech Concurrent Computation Program, February 1989. [44] H. Zima, H.-J. Bast, and M. Gerndt, SUPERB: A tool for semi-automatic MIMD/SIMD parallelization, Parallel Computing, 6 (1988), pp. 1{18. [45] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, and A. Schwald, Vienna Fortran | a language speci cation, version 1.1, Interim Report 21, ICASE, NASA Langley Research Center, Mar. 1992.

48

Distributed Memory Compiler Design for Sparse ... - Semantic Scholar

Distributed Memory Compiler Design for Sparse ... - Semantic Scholar

Suggest Documents

Distributed Java Compiler - Semantic Scholar

The PARADIGM Compiler for Distributed-Memory ... - CiteSeerX

HPF Compiler for Distributed Memory MIMD ... - CiteSeerX

Distributed-Memory Parallel Algorithms for ... - Semantic Scholar

Sparse Distributed Memory: Principles and ... - Stanford University

Extended Sparse Distributed Memory - Cognitive Computing ...

Modified Sparse Distributed Memory as Transient ... - CiteSeerX

D'Caml: A Native Distributed ML Compiler for ... - Semantic Scholar

Fortran 90D/HPF Compiler for Distributed Memory MIMD Computers ...

Fortran 90D/HPF Compiler for Distributed Memory MIMD ... - CiteSeerX

Compiler Assisted Fault Detection for Distributed-Memory Systems

a Compiler for Shared and Distributed Memory Computers

Compiler and Software Distributed Shared Memory Support for ...

Compiler Driven Architecture Design Space ... - Semantic Scholar

A spiking neural sparse distributed memory implementation for

System Design for Distributed Adaptive ... - Semantic Scholar

Language Design for Distributed Objects - Semantic Scholar

Design Principles for Distributed, Interactive ... - Semantic Scholar

Distributed Supervisory Controller Design for ... - Semantic Scholar

Coupling distributed-memory ocean and ... - Semantic Scholar

Adaptive Performance-Aware Distributed Memory ... - Semantic Scholar

Heterogeneous Distributed Shared Memory on ... - Semantic Scholar

High-Performance Distributed Shared Memory ... - Semantic Scholar

Basic Matrix Subprograms for Distributed Memory ... - Semantic Scholar