Runtime Support to Parallelize Adaptive Irregular Programs Yuan-Shin Hwang 1
Bongki Moon 1 Shamik Sharma 1 1 Joel Saltz
Raja Das 1
Abstract
This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support speci c types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.
1 Introduction
Recently there have been major eorts in developing programming language and compiler support for distributed memory parallel machines. Programming languages such as High Performance Fortran, Fortran-D, Fortran 90D and Vienna Fortran allow users to make use of a single threaded data-parallel programming model. Compilers for these languages typically produce Single Program Multiple Data (SPMD) code with calls to runtime support library procedures. These library procedures must eciently support the address translations and data movements that occur when a globally indexed program is embedded onto a multiple processor architecture. Many applications make extensive use of indirectly indexed arrays. These applications are sometimes referred to as irregular applications. Examples of irregular applications include unstructured computational uid dynamic solvers, molecular dynamics codes (CHARMM, AMBER, GROMOS), Direct Simulation Monte Carlo codes, diagonal or polynomial preconditioned iterative linear solvers, and particle-in-cell (PIC) codes. The focus of this paper is to describe a runtime support library that can be used as compiler runtime support in irregular applications. This paper does not describe how this runtime support is integrated into compilers, discussions of irregular problem compilation may be found elsewhere [5], [7], [16], [15], [17], [1], [8]. Figure 1 illustrates a typical irregular loop. The access pattern of arrays x and y are determined by the values assigned to elements of arrays ia and ib. Integer arrays used to index other arrays (e.g. ia and ib in Figure 1) are called indirection arrays. Runtime support libraries use indirection array values to carry out a variety of optimizations. These UMIACS and Computer Science Dept. University of Maryland College Park, MD 20742 (
[email protected]) This work was sponsored in part by ARPA (NAG-1-1485), NSF (ASC 9213821), ONR (SC292-1-22913) and NASA (NAG-11560). Thanks to DCRT/NIH for access to Intel iPSC/860 and also to MCS/ANL for access to IBM SP-1 1
1
2 optimizations [6] include support for partitioning data and work between processors and communication optimizations (Section 2.2). The goal of this partitioning is to balance the computational load and to reduce the net communication volume. The communication optimizations include: Software caching During the execution of irregular loops the same o-processor data may be accessed repeatedly. It is often pro table to identify cases in which only a single copy of that data must be fetched from o-processor, even though the same o-processor data may be accessed multiple times. Communication coalescing Many data items destined for the same processor can frequently be collected into a single message. This kind of optimization is sometimes called communication coalescing. The object of communication coalescing is to reduce the number of message startups. Inspector/Executor preprocessing Costs required to carry out communication optimizations can be amortized when a communication pattern is repeated several times. In some iterative codes, some indirection arrays change from iteration to iteration while other indirection arrays remain unchanged. It is possible to incrementally revise the results of inspector/executor preprocessing (Section 2.5.1). Reductions There are a number of common situations in which applications programmers are willing to tolerate a degree of ambiguity in the way in operation order. It is possible to take advantage of this ambiguity when designing optimized runtime support. The loop shown in Figure 1 provides one example in which an applications programmer allows additions to be carried out in an arbitrary order. In Figure 1, arrays ia and ib may be de ned for distinct integers i and j so that ia(i) = ia(j ) but ib(i) 6= ib(j ). Applications programmers may decide to ignore round-o error and allow additions to be carried out in any order. Another example of a kind of reduction can be seen when a programmer chooses to view an array as being an implementation of a set. This situation arises in DSMC codes. A simpli ed loop that might be associated with such a code is shown in Figure 2. Here, elements are moved across the rows of a 2-D array based on the information provided in indirection array icell. The elements of array data are shued and stored in array new data. While the total number of rows remains the same after the shue, the size of individual rows may not. At the programming language level this process can be viewed as the rst dimension being statically distributed while the second dimension is compressed. Here, elements migrate across the rst dimension. Figure 3 shows an example of such data movement. In this application, the order in which elements are appended to the new rows does not matter. The inspector/executor preprocessing designed to deal with this situation is called set-append scheduling, set-append scheduling will be described in Section 2.5.2. This paper presents a new set of runtime procedures designed to eciently implement adaptive programs to distributed memory machines. This runtime library is called CHAOS; it subsumes PARTI, a library aimed at static irregular problems [20]. CHAOS has been used to parallelize two challenging real-life adaptive applications | CHARMM, a molecular dynamics code and a Direct Simulation Monte Carlo code. Section 2 describes the runtime support to parallelize adaptive irregular programs. Section 3 demonstrates the performance of the runtime support library for real application codes. Related works and conclusions are presented in Section 4.
Runtime Support to Parallelize Adaptive Irregular Programs
3
real x(max nodes), y(max nodes) ! data arrays integer ia(max edges), ib(max edges) ! indirection arrays L1: do n = 1, n step L2: do i = 1, sizeof indirection arrays x(ia(i)) = x(ia(i)) + y(ib(i)) end do end do Fig. 1.
! outer loop ! inner loop
An Example with an Irregular Loop
do i = 1, rows do j = 1, size(i) ! size(i) is the number of elements in the ith row new data(icell(i,j), next(icell(i,j))) = data(i,j) next(icell(i,j)) = next(icell(i,j)) + 1 end do end do Fig. 2.
1
3
Example of data movement in a Particle-in-cell code
2
size
4 Fig. 3.
cells
new_cells
new_size
3
1
icell (1, 3)= 3
2
2
2
icell (2,2) = 4
1
1
3
2
4
3 icell (4,2) = 3
Data movement in a Particle-in-cell code
2
4 Phase A : Data Partitioning
Assign elements of data arrays to processors
Phase B : Data Remapping
Redistribute data array elements
Phase C : Iteration Partitioning
Allocate iterations to processors
Phase D : Iteration Remapping
Redistribute indirection array elements
Phase E : Inspector
Translate indices; Generate schedules
Phase F : Executor
Use schedules for data transportation; Perform computation Fig. 4.
Solving Irregular Problems
2 Runtime Support
This section presents the principles and functionality of the CHAOS runtime support library, a superset of the PARTI library [11, 23, 20], and describes the new features, setappend schedules and two-phased schedule generation, designed to handle adaptive irregular programs.
2.1 Overview of CHAOS
The CHAOS runtime library has been developed to eciently handle problems that consist of a sequence of clearly demarcated concurrent computational phases. Solving such concurrent irregular problems on distributed memory machines using the CHAOS runtime support involves six major phases (Figure 4). The rst four phases concern mapping data and computations onto processors. The next two steps concern analyzing data access patterns in loops and communication between the dierent processors to retrieve copies of the required data. In static irregular problems, Phase F is typically executed many times, while phases A through E are executed only once. In some adaptive problems where data access patterns change periodically but reasonable load balance is maintained, phase E must be repeated whenever the data access patterns change. In even more highly adaptive problems, the data arrays may need to be repartitioned in order to maintain load balance. In such applications, all the phases described above are repeated. We will brie y present some important features of CHAOS that are useful to parallelize adaptive irregular programs. A detailed description of the functionality of these features is given in Saltz et al. [19].
2.2 Data Distribution
2.2.1 Data Access Descriptors { Translation Tables When an array is irregularly
distributed, a mechanism is needed to retrieve required elements of that array. CHAOS supports a translation mechanism using a data structure called the translation table. A translation table lists the home processor and the local address in the home processor's memory for each element of the irregularly distributed array. In order to access an element A(m) of a distributed array A a translation table lookup is needed to nd out the location
Runtime Support to Parallelize Adaptive Irregular Programs
5
of A(m) in the distributed memory. A translation table lookup aimed at discovering the home processor and the oset associated with a global distributed array index is called a dereference request. Any preprocessing aimed at communication optimizations needs to perform dereferencing, since it must be determined where each elements resides before reading/writing it. The fastest table lookup is achieved by replicating the translation table in each processor's local memory. This type of translation table is a replicated translation table. Clearly, the storage costs for this type of translation table is maximal. However the dereference cost in each processor is constant and independent of the number of processors involved in the computation. The redistribution of the data can be handled eciently using vendor-supplied global communication routines. Note that, for the replicated translation table, the translation table in each processor is identical. Due to memory considerations, it is not always feasible to place a copy of the translation table on each processor. The approach taken in these cases is to distribute the translation table between processors. This type of translation table is a distributed translation table. Earlier versions of PARTI supported a translation table that was distributed between processors in a blocked fashion [4, 23]. CHAOS also supports an intermediate degree of replication with a paged translation table [6]. 2.2.2 Data Partitioners Data distribution is important for parallelization because it determines the patterns of communication between processors. While regular decompositions such as BLOCK and CYCLIC are easy to achieve, they might cause load imbalance or substantially increase communication in irregular problems. Over the years researchers have developed heuristic methods to obtain irregular data distribution based on a variety of criteria, such as spatial location, connectivity, and computational load. In order to achieve good performance, CHAOS oers parallel versions of these heuristic methods.
2.3 Data Redistribution
For eciency, in scienti c programs, distributed data arrays may have to change their distribution between computational phases. For instance, as computation progresses in an adaptive problem, the work load and distributed array access patterns may change based on the nature of the problem, leading to poor load balance among processors. Hence, data must be redistributed periodically to maintain balance. To obtain irregular data distribution for an irregular concurrent problem, start with a known initial distribution A of data arrays. Then, apply a heuristic method to obtain an irregular distribution B . Once the new data distribution is obtained, all data arrays associated with distribution A must be transformed to distribution B . Similarly, the loop iterations and the indirection arrays associated with the loop must be remapped. To redistribute data and loop iterations, a runtime procedure called remap has been developed. This procedure takes as input both the original and the new distribution in the form of translation tables and returns a communication schedule. This schedule can then be used to move data between initial and new distributions.
2.4 Loop Iteration Partitioning
Once data arrays are partitioned, loop iterations must also be partitioned. Loop partitioning refers to determining which processor will evaluate which expressions of the loop body. Loop partitioning can be performed at several levels of granularity. At the nest level, each operation may be individually assigned to a processor. At the coarsest level, a
6 block of iterations may be assigned to a processor without considering the data distribution and access patterns. Both approaches seem expensive. In the rst case, the amount of preprocessing overhead can be very high, whereas in the second case communication cost can be expensive. CHAOS's approach represents a compromise. Each loop iteration is individually considered prior to assignment to a processor. For this task, a set of runtime procedures have been developed. These procedures compute, with the current known distribution of loop iterations, a list containing the home processors of the distinct data references for each local iteration. Following the loop iteration distribution, the data references in each iteration must be remapped to conform with the new loop iteration distribution. For example, in Figure 1, once loop L2 is partitioned, data references ia(i) and ib(i) occurring in iteration i are moved to the processor which executes that iteration. An inspector is carried out to remap data references. A communication schedule is built during the inspector which can be used later to gather the new data references.
2.5 Communication Schedules
As described in Section 2.1, a communication schedule is used to fetch o-processor elements into a local buer and to scatter these elements back to their home processors after the computational phase is completed. Communication schedules determine the number of communication startups and the volume of communication; so it is important to optimize them. The schedule for processor p stores the following information: 1. send list { a list of arrays that speci es the local elements of a processor p required by all processors 2. permutation list { an array that speci es the data placement order of o-processor elements in the local buer of processor p 3. send size { an array that speci es the sizes of out-going messages of processor p to all processors 4. fetch size { an array that speci es the sizes of in-coming messages to processor p from all processors. 2.5.1 Two-Phase Schedule Generation The basic idea of the inspector/executor concept is to hoist the pre-processing outside the loop as much as possible so that it need not be repeated unnecessarily. In adaptive codes where the data access pattern occasionally changes, the inspector is not a one-time preprocessing cost. Every time an indirection array changes, the schedules associated with it must be regenerated. For example, in Figure 5, if the indirection array ic is modi ed, the schedules inc sched c and sched c must be regenerated. Generating inc sched c involves inspecting sched ab to determine which oprocessor elements are duplicated in that schedule. Thus, we must be very careful that our communication schedule generators are ecient while maintaining the necessary exibility. In CHAOS, the schedule-generation process is carried out in two distinct phases: Index analysis phase examines the data access patterns to determine which references are o-processor, removes duplicate o-processor references by only keeping distinct references in hash tables, assigns local buers for o-processor references, and translation global indexes to local indexes.
Runtime Support to Parallelize Adaptive Irregular Programs
L1: do n = 1, nsteps call gather(y(begin bu), y, sched ab) call zero out buer(x(begin bu), op x) L2: do i = 1, local sizeof indir arrays x(local ia(i)) = x(local ia(i)) + y(local ia(i)) * y(local ib(i)) end do S:
if (required) then modify part ic(:) CHAOS clear mask(hashtable, stamp c) local ic(:) = part ic(:) stamp c = CHAOS enter hash(local ic) inc sched c = CHAOS incremental schedule(stamp c) sched ac = CHAOS schedule(stamp a, stamp c) endif
call gather(y(begin bu2), y, inc sched c) call zero out buer(x(begin bu2), op x2) L3: do i = 1, local sizeof ic x(local ic(i)) = x(local ic(i)) + y(local ic(i)) end do call scatter add(x(begin bu), x, sched ac) end do Fig. 5.
Schedule Optimizations
! ! ! !
7
outer loop fetch o-proc data initialize buer inner loop
! ic is modi ed ! clear ic ! enter new ic ! incremental sched ! sched for ia, ic ! incremental gather ! initialize buer ! inner loop ! scatter addition
8
Schedule generation phase creates communication schedules based on the informa-
tion stored in hash tables. The principal advantage of such a two-step process is that some of the index analysis can be reused in adaptive applications. In the index analysis phase, hash tables are used to store global to local translation and to remove duplicate o-processor references. Each entry stores the following information: 1. global index | the global index hashed in; 2. translated address | the processor and oset where the element is stored; this information is accessed from the translation table; 3. local index | the local buer address assigned to hold a copy of the element, if it is o-processor; 4. stamp | this is an integer, used to identify which indirection array entered the element into the hash-table; The same global index entry might be hashed in by many dierent indirection arrays; a bit in the stamp is marked for each such entry. Stamps are very useful when implementing adaptive irregular programs, especially for those programs with several index arrays of which most are static. In the index analysis phase, each index array hashed into the hash table is assigned a unique stamp which marks all its entries in the table. Communication schedules are generated based on the combination of stamps. If any one index array changes, only the entries pertaining to the index array, i.e. those entries with the stamp assigned for the index array, have to removed from hash table. Once the new index array is hashed into the hash table, a new schedule can be generated without rehashing other index arrays. Figure 5 illustrates how CHAOS primitives (in pseudo-code) are used to parallelize the adaptive problem. The conditional statement S may modify the indirection array ic. Whenever this occurs, the communication schedules that involve pre-fetching references of ic must be modi ed. Since the values of ic in the hash table are no longer valid, the entries with stamp stamp c are cleared by calling CHAOS clear mask(). New values of ic are then entered into the hash table by CHAOS enter hash(). After all indirection arrays have been hashed in, build communication schedules can be built for any combination of indirection arrays by calling CHAOS schedule() or CHAOS incremental schedule() with an appropriate combination of stamps. An example of schedule generation for two processors with sample values of indirection arrays ia, ib, and ic is shown in Figure 6. The global references due to indirection array ia are stored in hash table H with stamp a, ib with stamp b and ic with stamp c. The indirection arrays might have some common references. Hence, a hashed global reference might have more than one stamp. The gather schedule sched ab for loop L2 in Figure 5 is built using the union of references with time stamps a or b. The scatter operation for loop L2 can be combined with the scatter operation for loop L3. The gather schedule inc sched c for Loop L3 is built with those references which have time stamp c alone because references with time stamps a or b as well as with c can be fetched by using the schedule sched ab. The scatter schedule for loops L2 and L3 is built using the union of references with time stamps a or c. 2.5.2 Set-Append Schedules In a class of highly adaptive problems, such as Direct Simulation Monte Carlo (DSMC) methods, patterns of data access frequently vary in the course of computation. The implication of this variation is that pre-processing for a loop must be repeated whenever the data access pattern of the loop changes. In other words,
Runtime Support to Parallelize Adaptive Irregular Programs Initial
distribution Processor 1
2
of data
arrays
0
3 4
9
Processor 5
6
7
8
1 9
10
Inserting indirection arrays into the hash table (Processor 0) indirection array ia = 7, 2, 9, 1, 3 7 proc - 1 addr - 2 stamps a
indirection array ib = 1, 5, 7,8,2 7 proc - 1 addr - 2 b stamps a
9 proc - 1 addr - 4 a
9 proc - 1 addr - 4 a
8 proc - 1 addr - 3 b
indirection array ic = 10, 4, 8, 9 , 3 7 proc - 1 addr - 2 b stamps a
9 proc - 1 addr - 4 a c
8 proc - 1 addr - 3 b c
10 proc - 1 addr - 5 c
Generating communication schedules from the hash table sched_a
=
sched_b inc_sched_b
= =
sched_c inc_sched_c
= =
Fig. 6.
schedule ( stamp = a) will gather/scatter
7,9
schedule ( stamp = b) will gather/scatter 7,8 incremental_schedule( base=a, stamp =b) will gather/scatter 8 schedule ( stamp = c) will gather/scatter 9,8, 10 incremental_schedule( base=a,b, stamp =c) will gather/scatter 10
Schedule generation with hash table
previously built communication schedules cannot be used once the data access pattern changes. For some adaptive applications such as DSMC, there is no computational signi cance attached to the placement order of incoming array elements. Such application-speci c information is used to build much cheaper set-append communication schedules. A setappend schedule for a processor p stores the following information: 1. send list { a list of arrays that speci es the local elements of processor p required by all processors; 2. send size { an array that speci es out-going message size of processor p to all processors; 3. fetch size { an array that speci es in-coming message size of processor p from all processors. Set-append schedules are similar to the previously described schedules (Section 2.5.1) except that they do not carry information of data placement order in the receiving processor. While the cost of building a set-append schedule is less than that of regular schedules, a setappend schedule still provides communication optimizations by aggregating and vectorizing messages.
2.6 Data Transportation
While communication schedules store data send/receive patterns, the CHAOS data transportation procedures actually move data using these schedules. The procedure gather
10 Table 1
Performance of Parallel CHARMM on Intel iPSC/860 (in sec.)
Number of Processors 1 16 32 64 2 Execution Time 74595.5 4356.0 2293.8 1261.4 Computation Time 74595.5 4099.4 2026.8 1011.2 Communication Time 0.0 147.1 159.8 181.1 Load Balance Index 1.00 1.03 1.05 1.06
128 781.8 507.6 219.2 1.08
can be used to fetch a copy of o-processor elements. The procedure scatter can be used to send o-processor elements. The procedure scatter append can be used to perform data movement using the set-append schedules. A detailed description of these procedures can be found in Saltz et al [19].
3 Experimental Results
This section presents computational structures and performance of two adaptive irregular application programs: 1) a molecular dynamics code { Chemistry at HARvard Macromolecular Mechanics (CHARMM), and 2) a direct simulation Monte Carlo code (DSMC). These two application programs are ported to distributed memory machines using CHAOS primitives. The application code CHARMM adapts occasionally, whereas the application code DSMC adapts frequently. Experimental results are presented for these two parallelized programs on the Intel iPSC/860 machine.
3.1 CHARMM
CHARMM is a program which calculates empirical energy functions to model macromolecular systems. The purpose of CHARMM is to derive structural and dynamic properties of molecules using the rst and second order derivative techniques [2]. The computationally intensive part of CHARMM is the molecular dynamics simulation. 3.1.1 Performance The performance of molecular dynamics simulations was studied with a benchmark case (MbCO + 3830 water molecules) on the Intel iPSC/860. It ran for 1000 steps with 40 non-bonded list updates. The cuto for non-bonded list generation was 14 A. The non-bonded list was regenerated 40 times during the simulation. The results are presented in Table 1. The recursive coordinate bisection (RCB) partitioner was used to partition atoms. The execution time included the energy calculation time and communication time of each processor. The computation time was the average of the computation time of dynamics simulations over processors; the communication time was the average communication time. The load balance index was calculated as n P time of processor i) (number of processors n) LB = (maxi=1 computation n computation time of i=1
processor i
The results showed that CHARMM scaled well and that good load balance was maintained up to 128 processors. CHARMM was also ported to the SP-1 using the CHAOS software. The 64 processor SP-1 version ran 2.5 times faster than the corresponding iPSC/860 version but the communication time on the SP-1 was higher by 20 %. 2
Estimation done by Brooks and Hodoscek[3]
Runtime Support to Parallelize Adaptive Irregular Programs
11
Table 2
Preprocessing Overheads of CHARMM (in sec.)
Number of Processors Data Partition Non-bonded List Update Remapping and Preprocessing Schedule Generation Schedule Regeneration (40)
16 0.27 7.18 0.03 1.31 43.51
32 0.47 3.85 0.03 0.80 23.36
64 0.83 2.16 0.02 0.64 13.18
128 1.63 1.22 0.02 0.42 8.92
Table 3
Inspector/executor vs. Data migration
48x48 Cells 96x96 Cells (Time in secs) Processors Processors 16 32 64 128 16 32 64 128 Regular Schedules 63.74 50.50 79.58 95.50 226.89 131.99 125.64 118.89 Set-Append Schedules 20.14 11.54 7.60 6.77 79.89 40.46 21.77 14.23
Overheads of Preprocessing
Data and iterations partitioning, remapping, and loop pre-processing, must be done at runtime. Pre-processing overheads of the simulation are shown in Table 2. The data partition time is the execution time of RCB. After partitioning atoms, the non-bonded list is regenerated. This non-bonded list regeneration was performed because atoms were redistributed over processors and it was done before simulation occurred. In table 2, the regeneration time is shown as non-bonded list generation time. During simulation, non-bonded list was regenerate periodically. When the non-bonded list was updated, the schedule must be regenerated. The schedule regeneration time in Table 2 gives the total schedules regeneration time for 40 non-bonded list updates. By Comparing these numbers to those in Table 1, it can be observed that the preprocessing overhead is relatively small compared to the total execution time.
3.2 Direct Simulation Monte Carlo
This section presents an example of a highly adaptive code, DSMC code. It is similar to PIC codes in that it tries to simulate the physics of ows directly through Lagrangian movement and particle interaction. The DSMC method is a technique for computer modeling a real gas by a large number of simulated molecules. It features highly ecient movement and collision handling of simulated molecules on a spatial ow eld domain overlaid by a Cartesian mesh [18, 22]. For a detailed parallelization approach see [12]. The performance results of 2-dimensional and 3-dimensional DSMC codes on the Intel iPSC/860 hypercube are also presented. 3.2.1 Performance Results Table 3 compares the execution time of 2-dimensional DSMC code using the set-append schedules (Section 2.5.2) with the time obtained using the regular communication schedules (Section 2.5). As the computational requirement is evenly distributed over the whole domain of the 2-dimensional DSMC code, and hence load balance is not an issue. The inspector/executor method which is applied to the 2-
12 Table 4
Performance eects of remapping (remapped every 40 time steps)
Number of processors Sequential (Time in secs) 8 16 32 64 128 Code Static partition 1161.69 675.75 417.17 285.56 215.06 4857.69 Recursive bisection 850.75 462.15 278.23 209.75 267.24 Chain partition 807.19 423.50 237.12 154.39 127.26 dimensional problems must carry out preprocessing of communication patterns every time step because the reference patterns to o-processor data uctuate from time step to time step. Consequently preprocessing cost is substantial for inspector/executor method. The use of the data migration primitives also incurs preprocessing of communication patterns, however, the cost of generating set-append schedules is much lower than that of the regular schedules. Periodic data remapping provides better performance compared with that of static partitioning. Experiments done with the 3-dimensional DSMC code on 128 processors reveal that the degree of load imbalance does not exceed 30 percent of perfect load balance, whereas static partitioning exceeds 400 percent. Table 4 compares the performances of periodic domain partitioning methods with that of static partitioning (i.e. no remapping), for 3-dimensional DSMC codes. The domain is repartitioned every 40 time steps based on the workload information collected for each Cartesian mesh cell. The table presents execution time for 1000 time steps. The results show that the repartitioning method outperform static partitioning signi cantly on a small number of processors. However, using recursive inertial bisection partitioner leads to performance degradation on large number of processors. The net result of this method is a poorer performance than that of the static partitioning. This performance degradation is a result of the large communication overhead which increases as the number of processors increases. The chain partitioner on the contrary provides better results for this problem. The 3D DSMC code was ported to the IBM SP-1, and the execution time on 56 processors for 1000 time steps, for a problem with 30x18x18 cells, was 56.86 secs. The speedup achieved on 56 processors is 40.
4 Related Work and Conclusions
Several researchers have developed programming environments that target particular classes of irregular or adaptive problems. Williams [21] describes a programming environment (DIME) for calculations with unstructured triangular meshes using distributed memory machines. Baden and Quinlan have developed C++ based programming environments that target a range of non-uniform scienti c calculations [9], [10], [13], [14]. This programming environment provides facilities that support dynamic load balancing. Many applications make extensive use of indirectly indexed arrays. This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. It also describes how runtime libraries can use indirection array values to carry out a variety of optimizations. The CHAOS runtime library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support speci c types of irregular reduction and runtime support [6] for partitioning data and work between processors and communication optimizations
Runtime Support to Parallelize Adaptive Irregular Programs
13
(Section 2.2). This paper does not describe how the runtime support is integrated into compilers. While in many cases irregular problem compilation is by no means trivial, a variety of limited prototype compilers have been demonstrated. Several prototype compilers have been developed, that are able to use the CHAOS runtime support to generate reasonable distributed memory code corresponding to Fortran D versions of unstructured mesh and molecular dynamics procedures (e.g. [5], [7]). However, the authors are not aware of any prototype compiler that is able to generate correct SPMD code from a DSMC-type application; it is believed that both new language extensions and new compilation methods are required to accomplish this.
References
[1] P. Brezany, M. Gerndt, V. Sipkova, and H.P. Zima. SUPERB support for irregular scienti c computations. In Proceedings of the Scalable High Performance Computing Conference (SHPCC-92), pages 314{321. IEEE Computer Society Press, April 1992. [2] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan, and M. Karplus. Charmm: A program for macromolecular energy, minimization, and dynamics calculations. Journal of Computational Chemistry, 4:187, 1983. [3] B. R. Brooks and M. Hodoscek. Parallelization of charmm for mimd machines. Chemical Design Automation News, 7:16, 1992. [4] R. Das, J. Saltz, D. Mavriplis, and R. Ponnusamy. The incremental scheduler. In Unstructured Scienti c Computation on Scalable Multiprocessors, Cambridge Mass, 1992. MIT Press. [5] Raja Das, Joel Saltz, and Reinhard von Hanxleden. Slicing analysis and indirect access to distributed arrays. In Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing, pages 152{168. Springer-Verlag, August 1993. Also available as University of Maryland Technical Report CS-TR-3076 and UMIACS-TR-93-42. [6] Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. Communication optimizations for irregular scienti c computations on distributed memory architectures. Technical Report CSTR-3163 and UMIACS-TR-93-109, University of Maryland, Department of Computer Science and UMIACS, October 1993. To appear in Journal of Parallel and Distributed Computing. [7] R. v. Hanxleden, K. Kennedy, and J. Saltz. Value-based distributions in fortran d | a preliminary report. Technical Report CRPC-TR93365-S, Center for Research on Parallel Computation, Rice University, December 1993. submitted to Journal of Programming Languages - Special Issue on Compiling and Run-Time Issues for Distributed Address Space Machines. [8] C. Koelbel, P. Mehrotra, and J. Van Rosendale. Supporting shared data structures on distributed memory architectures. In 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 177{186. ACM, March 1990. [9] Scott R. Kohn and Scott B. Baden. An implementation of the LPAR parallel programming model for scienti c computations. In Proceedings of the Sixth SIAM Conference on Parallel Processing for Scienti c Computing, pages 759{766. SIAM, March 1993. [10] S.R. Kohn and S.B. Baden. A robust parallel programming model for dynamic nonuniform scienti c computations. In Proceedings of the Scalable High Performance Computing Conference (SHPCC-94), pages 509{517. IEEE Computer Society Press, May 1994. [11] R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nicol, and Kay Crowley. Principles of runtime support for parallel processors. In Proceedings of the 1988 ACM International Conference on Supercomputing, pages 140{152, July 1988. [12] B. Moon and J. Saltz. Adaptive runtime support for direct simulation monte carlo methods on distributed memory architectures. In Proceedings of the Scalable High Performance Computing Conference (SHPCC-94), pages 176{183. IEEE Computer Society Press, May 1994. [13] R. Parsons and D. Quinlan. Run-time recognition of task parallelism within the P++ class library. In Proceedings of the Scalable Parallel Libraries Conference, Mississippi State University, Starkville, MS, pages 77{86. IEEE Computer Society Press, October 1993.
14 [14] Rebecca Parsons and Daniel Quinlan. A++/P++ array classes for architecture independent nite dierence computations. To Appear in the Proceedings of the Second Annual Object Oriented Numerics Conference, Sunriver, Oregon, April 1994. [15] Ravi Ponnusamy, Yuan-Shin Hwang, Joel Saltz, Alok Choudhary, and Georey Fox. Supporting irregular distributions in FORTRAN 90D/HPF compilers. Technical Report CS-TR-3268 and UMIACS-TR-94-57, University of Maryland, Department of Computer Science and UMIACS, May 1994. Submitted to IEEE Parallel and Distributed Technology. [16] Ravi Ponnusamy, Joel Saltz, and Alok Choudhary. Runtime-compilation techniques for data partitioning and communication schedule reuse. In Proceedings Supercomputing '93, pages 361{ 370. IEEE Computer Society Press, November 1993. Also available as University of Maryland Technical Report CS-TR-3055 and UMIACS-TR-93-32. [17] Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, and Georey Fox. Runtime support and compilation methods for user-speci ed data distributions. Technical Report CSTR-3194 and UMIACS-TR-93-135, University of Maryland, Department of Computer Science and UMIACS, November 1993. Submitted to IEEE Transactions on Parallel and Distributed Systems. [18] D. F. G. Rault and M. S. Woronowicz. Spacecraft contamination investigation by direct simulation Monte Carlo - contamination on UARS/HALOE. In Proceedings AIAA 31th Aerospace Sciences Meeting and Exhibit, Reno, Nevada, January 1993. [19] J. Saltz and et. al. A manual for the CHAOS runtime library. Technical report, University of Maryland, 1993. [20] Joel Saltz, Harry Berryman, and Janet Wu. Multiprocessors and run-time compilation. Concurrency: Practice and Experience, 3(6):573{592, December 1991. [21] R. Williams. Performance of dynamic load balancing algorithms for unstructured mesh calculations. Concurrency, Practice and Experience, 3(5):457{482, February 1991. [22] M. S. Woronowicz and D. F. G. Rault. On predicting contamination levels of HALOE optics aboard UARS using direct simulation Monte Carlo. In Proceedings AIAA 28th Thermophysics Conference, Orlando, Florida, June 1993. [23] J. Wu, J. Saltz, S. Hiranandani, and H. Berryman. Runtime compilation methods for multicomputers. In Proceedings of the 1991 International Conference on Parallel Processing, volume 2, pages 26{30, 1991.