OPTIMIZATION AND SCALING OF SHARED-MEMORY AND MESSAGE-PASSING IMPLEMENTATIONS OF THE ZEUS HYDRODYNAMICS ALGORITHM Robert A. Fiedler Hewlett Packard Co. U.S. Naval Research Laboratory Code 5593; Bldg. A49; Rm. 28 4555 Overlook Ave. SW Washington, DC 20375-5320 Ph.: (202) 767-8426 FAX: (202) 404-7402 Email:
[email protected]
Abstract
We compare the performance of shared-memory and message-passing versions of the ZEUS algorithm for astrophysical uid dynamics on a 64-processor HP/Convex Exemplar SPP-2000. Single-processor optimization is guided by timing several versions of simple loops whose structure typi es the main performance bottlenecks. Overhead is minimized in the message-passing implementation through the use of non-blocking communication operations. Our benchmark results agree reasonably well with the predictions of a simple performance model. The message-passing version of ZEUS scales better than the sharedmemory one primarily because, under shared-memory, (unless data-layout directives are utilized) the domain decomposition is eectively one-dimensional.
Introduction
As more and more researchers migrate from traditional vector supercomputers to RISC-based multiprocessors and workstation clusters, optimization of existing codes for ecient parallel execution on these architectures becomes increasingly important. This paper illustrates how a venerable 3-D uid dynamics application was tuned for systems with a data cache, parallelized using compiler directives, and restructured for message passing. Our benchmarks demonstrate that the choice of programming model and the amount of programming eort the developer is willing to invest can have a profound impact on the speed and scalability of an appliation. Under the shared-memory programming model, the compiler, operating system, and hardware perform the task of ensuring that all threads of execution see the same values of all variables residing in global memory (see, e. g., Dowd, 1993, Chapter 16). Under this paradigm, the data is shared by all threads (unless declared otherwise via compiler directives), and parallelism is achieved primarily by directing the compiler to distribute the iterations of each loop among the threads (normally 1 per processor). The compiler also automatically places barriers after each loop to ensure that the threads remain synchronized. Nested loops are normally parallelized on only one loop index. In this case, the best speedups usually occur when the iterations of the outermost loop can be distributed, since this strategy maximizes the \granularity", or amount of work done within parallel regions of code, and reduces overhead. However, parallelizing outermost loops eectively decomposes the computational domain along just one dimension for a given loop nest. In many cases, such a decomposition is less than optimal for large numbers of threads, as we explain in our theoretical scaling analysis below. Other loop nests in the same application may be parallelized in a way that implies a dierent domain decomposition (e. g., along a dierent dimension), so optimal distribution of data among threads can be dicult to achieve without using non-portable data-layout directives. The main advantage of shared-memory programming is that reasonably good parallel speedups on up to about 16 threads often can be obtained without a greal deal of eort on the part of the programmer. Under message-passing, all processes execute the same program and all data is local to a given process (see, e. g., Dowd, 1993, Chapter 15). Whenever one process requires data computed by another process, that data is transfered explicitly via calls to routines in a message-passing library, preferably a widely available standard such as MPI (see Snir, et al. 1996; the MPI Web page at http://www.mcs.anl.gov/mpi/index.html). The messagepassing programming model generally allows coarser granularity and greater control of the distribution of data among processors than does the shared-memory model. For a 2
given algorithm, there is potentially less system overhead for a well-written messagepassing implementation compared to a shared-memory one, since there is no need to ensure that variables in memory or cache have the same values for all processors. Particularly ecient are \native" implementations of message-passing libraries written by the vendors, as opposed to public domain versions, because the exchange of data between processors can take advantage of the shared-memory communication system. The main disadvantage of using message-passing when parallelizing existing serial codes is that the programmer usually must make extensive modi cations in order to create an ecient application. Both the shared-memory and the message-passing programming paradigms may be adopted to obtain respectable parallel speedups on shared-memory multiprocessors such as the HP/Convex Exemplar SPP-2000 and the SGI Origin 2000, in which the physically distributed memory is mapped onto a global address space accessible from all processors.
The ZEUS Algorithm
The ZEUS algorithm, developed at NCSA by Michael L. Norman, James M. Stone, and David A. Clarke, solves the ideal MHD equations governing the evolution of a wide variety of astrophysical systems using a time-explicit, multi-step, nite dierence scheme (see Stone and Norman, 1992). Since 1993, implementations of ZEUS for two- and three-dimensional problems have been disseminated to a growing international scienti c community (see the LCA Web page at http://zeus.ncsa.uiuc.edu:8080/lca home page.html, or send e-mail to
[email protected]). Although the full ZEUS algorithm includes such physical processes as magnetic elds, radiation, and self-gravity, for this study we apply it to a simple hydrodynamics test problem. We consider the expansion of an initially stationary bubble of high pressure gas in a homogenous ambient medium. After the pressure gradient drives a shock front out to a radius much larger than the initial size of the bubble, the solution should approach the Sedov-Taylor analytical result for a point-source explosion. The evolution equations describing the blast wave are:
@ ( v) @t @ ( ") @t
+
@ @t
+
r
r
( v) = 0 ;
(1)
( vv) + p = 0 ;
(2)
+
r
r
( " v) + p
r
v = 0;
(3)
where is the mass density, v is the uid velocity, " is the speci c internal energy, and p is the thermal pressure at time t. The quantities , ", and the velocity components
constitute the set of ve \ eld variables" to be evolved. We chose to evolve the system in Cartesian coordinates to simplify visualization of the domain decomposition; however, the ZEUS algorithms are implemented in a \covariant" form and can solve equations (1){(3) in cylindrical-polar or spherical coordinates as well. Finite-dierence approximations replace the temporal and spatial partial derivatives in equations (1){(3), and the updated values of each eld variable at each mesh point 3
are computed explicitly from straightforward algebraic expressions involving values of quantities at the previous time step (or substep). The transport of eld variables with the uid (\advection") is treated separately from the eect of the remaining terms (source terms). Moreover, the advection and source steps are further divided into several substeps in which the eld variables are updated from values computed in the previous substep (see Flowchart 1). In the source step, the rst substep accelerates the uid through the action of body forces such as pressure gradients. In the next substep, arti cial viscosity spreads shock fronts over several zones and converts kinetic energy into internal energy. In the third substep, compressional heating raises the internal energy. In the transport step, 1-dimensional sweeps are performed along each axis. A dierent sweep ordering (one of six possible permutations) is used each time step to prevent truncation errors from building up along a preferred direction. This \operator splitting" technique produces more accurate numerical solutions than does updating the eld variables in one operation that includes every term in the evolution equations. The drawback of operator splitting is that additional communication must occur each time the eld variables are (partially) updated, but the amount of computation remains the same. Clearly, parallel execution can scale linearly with the number of threads only if the time spent on communication operations is small compared to the time spent performing computations, or if communication and computation can be overlapped (see below). A certain amount of communication is unavoidable in any implementation of the ZEUS algorithm, since the nite dierence expressions that update each eld variable at zone (i, j, k) involve values of these quantities 1 to 3 grid zones away in up to 6 dierent directions. When the computational domain is divided into groups of zones (\tiles") belonging to each processor, evaluating the evolution equations in zones near tile boundaries requires values of some eld variables from zones belonging to one or more neighboring tiles. The concept of processors owning tiles (i. e., data locality) applies even to shared-memory applications, since access times to a given shared memory location on large multiprocessors are usually non-uniform.
Single-Processor Optimizations
The ZEUS algorithm was developed on a single Cray vector processor. The original implementation performs poorly on RISC-based systems because its memory access pattern causes frequent cache misses. This paper describes ZEUS-3D (Clarke, Norman, and Fiedler, 1994), a tuned shared-memory version, and a new messagepassing implementation, ZEUS-MP. Below we illustrate the strategies we used to reduce memory latency by applying some of them to a representative loop nest and compare the performance of dierent optimized versions on several platforms. On RISC processors, access to encached data is commonly on the order of 100 times faster than access to memory, so it is critically important to perform as many computations as possible with any data brought into the cache before it gets overwritten. When a cache miss occurs, typically several adjacent storage locations (comprising a cache line) are fetched from memory. Loops that access consecutive array elements as the iterations proceed sequentially frequently make use of all of the elements in each cache line. Reducing a loop's stride through memory can therefore greatly improve performance. Often it is not possible to reduce the stride of a loop to unity. In this case the size and design of the cache are important factors in determining whether encached data 4
remains there long enough to be available when needed. Each location in cache may be mapped to more than one memory location. In a \fully associative" cache, data from memory can be written to any cache line. This design minimizes the chance that subsequent memory references may overwrite encached data before it can be (re)used (cache thrashing). However, this type of cache is very expensive. Some systems have \set associative" caches, in which each location in memory may be stored in one of two or more cache lines. Others have \direct mapped" caches, in which each memory location is assigned to a single cache line. This type of cache is the easiest to thrash. The most time-consuming stage of the ZEUS hydrodynamics algorithm is the advection step, which involves 1-dimensional sweeps along each of the 3 coordinate axes. Consider the following simple loop nest, whose basic structure is common to many conservative advection schemes: do i=1,m do k=1,m do j=0,m dq(j) = q(i,j,k) - q(i,j-1,k) enddo ! j do j=1,m f(i,j,k) = dq(j) - dq(j-1) enddo ! j enddo ! k enddo ! i
In Fortran, arrays are laid out in memory so that the leftmost index changes most rapidly as consecutive elements are accessed. We can reduce the stride through arrays q and f in the above loop nest by interchanging the i and k loops: do k=1,m do i=1,m do j=0,m dq(j) = q(i,j,k) - q(i,j-1,k) enddo ! j do j=1,m f(i,j,k) = dq(j) - dq(j-1) enddo ! j enddo ! i enddo ! k
This interchanged loop nest leads to the same numerical results as the original. If the cache is large enough to hold several times m words, then the extra elements of q in the cache lines fetched from memory for a given value of i while performing the j-loop can be used in the next several i-loop iterations. New values of f collect in the cache and are written out to memory before the individual cache lines containing them are overwritten. The stride may be reduced to unity by splitting the i loop into two separate nests, saving intermediate results in a two-dimensional scratch array, and interchanging the i and j loops:
5
do k=1,m do j=0,m do i=1,m qq(i,j) = q(i,j,k) - q(i,j-1,k) enddo ! i enddo ! j do j=1,m do i=1,m f(i,j,k) = qq(i,j) - qq(i,j-1) enddo ! i enddo ! j enddo ! k
This loop-splitting strategy should work about as well as the interchanged loop nest above, provided array qq remains in cache while the values of f for a given k are computed. We can also fuse the split loops, eliminating the need for a scratch array by recomputing the dierences in elements of q: do k=1,m do j=1,m do i=1,m dqj = q(i,j ,k) - q(i,j-1,k) dqjm1 = q(i,j-1,k) - q(i,j-2,k) f(i,j,k) = dqj - dqjm1 enddo ! i enddo ! j enddo ! k
Optimizing compilers can use registers to save dqj for the next iteration of the fused loop. If the bodies of these loops were to contain many more assignment statements, the number of registers available may not be sucient to hold all of the new temporary variables required for a fused version. This would prevent the compiler from generating an ecient software pipeline for the fused loop and severely degrade performance. We obtained the following CPU times in seconds for the four versions of the loop nest on various machines for m = 100: MACHINE ------Cray C90 SPP-2000 SGI O2k SGI R10k
ORIGINAL -------0.012 0.380 1.112 2.593
INTERCH ------0.017 0.080 0.104 0.134
SPLIT ----0.009 0.060 0.119 0.176
FUSED ----0.015 0.050 0.100 0.135
COMPILER -------cf77 6.0.x f77 5.2 f77 7.1 f77 7.0
On each system, these loops were compiled at a very high level of optimization, including data prefetching if available. Except for the fused version, the two inner loops are each executed approximately 1 million times, and each iteration contains one oatingpoint operation. Therefore, the C90 runs this loop nest at about 166 MFLOPS. The best RISC time to solution for the same operation count (the SPP-2000 on the split version) corresponds to 33.5 MFLOPS. These loop nests run at a fraction of the peak performance on any of these systems because there are 3 memory operations per oating point operation 6
and the processor spends too much time waiting for the data to move to and from memory. Since the execution speed of this loop is limited by the speed of the memory system, the Cray nishes this computation several times faster than any of the RISC processors. On the Cray, the interchanged loop runs 40 percent slower than the original version, but on the RISC processors, the interchanged loop runs many times faster than the original. Thus, legacy codes tuned for a vector processor often perform poorly on RISCbased systems without additional modi cation. Evidently, none of the RISC compilers automatically interchanges the outer loops. The one-dimensional Array dq is small enough to t in cache on any machine, so execution times for the interchanged version are insensitive to cache size. On the Cray, the split version runs 25 percent faster than the original, despite the Cray's scatter/gather hardware for accessing data in non-unit-strided loops. On the RISC systems, execution times range from somewhat longer to somewhat shorter than the times for the interchanged version. Scratch array qq occupies only about 10 kwords (80 kB for double precision data) and easily ts in cache. The time to solution for the fused loop version is shorter than the time for the interchanged version on most systems, despite the apparent extra work. On the Exemplar SPP-2000, the split and fused versions may run faster than the interchanged version because loops of unit stride are less likely to thrash its direct-mapped cache. The ZEUS codes also contain loop nests that perform advection sweeps along the iand k-coordinate axes. The number of operations in each advection sweep is the same for any sweep direction, but execution times can be very dierent, depending on the memory access pattern. The original loop nest that operates along the k-axis has a stride roughly proportional to m2 : do i=1,m do j=1,m do k=0,m dq(k) = q(i,j,k) - q(i,j,k-1) enddo ! k do k=1,m f(i,j,k) = dq(k) - dq(k-1) enddo ! k enddo ! j enddo ! i
In this case, the split version involves a three-dimensional scratch array, which is about 8 MB in size for m = 100 and double precision data, too large to t in cache all at once:
7
do k=0,m do j=1,m do i=1,m qq(i,j,k) = q(i,j,k) - q(i,j,k-1) enddo ! i enddo ! j enndo k do k=1,m do j=1,m do i=1,m f(i,j,k) = qq(i,j,k) - qq(i,j,k-1) enddo ! i enddo ! j enddo ! k
Especially on machines with small direct mapped caches (such as the Cray T3D, whose cache is 8 kB), a dierent approach to reducing cache thrashing can be helpful, especially when fusing the loops is not a viable option. For loops whose sweep direction is not along the i-axis, a factor of 2 or more speedup can be obtained by \unrolling and jamming" the fused loop on index i: do j=1,m do i=1,m,2 do k=0,m dp(k,1) = q(i ,j,k ) - q(i ,j,k-1) dp(k,2) = q(i+1,j,k ) - q(i+1,j,k-1) enddo ! k do k=1,m f(i ,j,k) = dp(k ,1) - dp(k-1,1) f(i+1,j,k) = dp(k ,2) - dp(k-1,2) enddo ! k enddo ! i enddo ! j
Here we have assumed that m is even; if not, we would need to add code for the remaining iteration. This technique can help because 2 of the 4 (or more) elements of q on a cache line are used immediately, rather than risk overwriting them while executing the second k loop. Better performance is obtained when the loop is unrolled 4 times instead of twice as in the example above, since there are 4 (double precision) words in a T3D cache line. Some machines have compiler directives to perform this type of unrolling automatically. In addition to cache misses, it is important to reduce Translation Lookaside Buer misses. The TLB is a type of cache that stores recently used entries in the table that keeps track of the mapping between virtual and physical memory pages (see Dowd, 1993, Chapter 3). Sweeps along the k-axis can generate many TLB misses because the stride through q is so long that we may access data on several dierent pages for each value of i and j in the outer loops. In the interchanged version for advection sweeps along the k-axis, we essentially copy a strip of the 3-D array q into a 1-D scratch array dq, perform all the desired operations on this data (subtract elements), and then scatter the result to a 3-D array f. The TLB misses that these gather/scatter operations generate would be 8
outweighed by the bene ts only if a substantially larger number of oating-point operations were performed with the gathered data. As an alternative to copying 3-D arrays to 1-D arrays, one may perform the k-sweeps in blocks: do kk=1,m-1,kblock do j=1,m do i=1,m if (kk .eq. 1) dq(0) = q(i,j,0) - q(ij,-1) do k=kk,kk+kblock-1 dq(k) = q(i,j,k ) - q(i,j,k-1) enddo ! k do k=kk,kk+kblock-1 f(i,j,k) = dq(k) - dq(k-1) enddo ! k enddo ! i enddo ! j enddo ! kk
Here we have added an additional outer loop along the k-axis which simply breaks up the length of the sweeps so that as many computations as possible can be done with data stored on one memory page before moving on to the next block. We have taken m to be evenly divisible by kblock; otherwise more code is needed to perform the leftover iterations. The optimal number of iterations per block depends on the number of 2-D planes of data accessed in the loop body, the page size, the number of entries in the TLB, and the penalty for a TLB miss; some experimentation may be necessary to determine the best value for each machine and problem size. For m = 100, kblock = 25 was close to optimal on all three RISC systems. We obtain the following execution times (CPU seconds) for m = 100: MACHINE ------Cray C90 SPP-2000 SGI O2k SGI R10k
ORIGINAL -------0.028 1.460 1.943 2.197
INTERCH ------0.023 0.980 0.748 0.663
SPLIT ----0.010 0.150 0.188 0.270
FUSED ----0.010 0.050 0.100 0.117
UNROLL -----0.029 0.440 0.252 0.234
BLOCKED ------0.043 0.100 0.116 0.139
Note that the split and fused versions run nearly 3 times as fast as the original version on the Cray. Thus, drastically reducing the loop stride can be bene cial on both vector processors and RISC systems. On the RISC processors, the 1-D scratch array dq in the interchanged version ts in cache, but performance suers due to TLB misses. The 3-D scratch array in the split version impacts performance somewhat because it is too large to t in cache. After the rst loop completes, most of the elements must be brought in from memory again for the second loop. On RISC systems, the fused version runs as fast as it did for sweeps along the j-axis, since the number of cache misses is nearly the same. The unrolled-and-jammed version is signi cantly faster than the interchanged version because it reduces the number of TLB misses by the unrolling factor (4), utilizing more data on the same memory page at a time. 9
TLB blocking is especially eective on the Exemplar SPP-2000, because it has a smaller page size than the SGI Origin-2000. For larger loop bodies, if loop fusion prohibits software pipelining, loop blocking may be the next best option on the RISC platforms. We adopted the fused or the split versions for most of these types of loop nests in ZEUS-3D, depending on the complexity of the loop. For ZEUS-MP, split versions were avoided in favor of blocked versions to allow better cache use. Data reuse was improved for ZEUS-MP compared to ZEUS-3D in other ways as well. Whereas ZEUS-3D consists of a large number of subroutines that perform a few simple operations on all elements of several input arrays, ZEUS-MP is structured so that once the elements of these input arrays are brought into cache, as many operations as possible are performed on them before they are overwritten. For example, the arti cial viscosity substep is performed in 3 separate sweeps in ZEUS-3D, but in ZEUS-MP this substep is done in a single loop so that values of and " are brought in from memory just once instead of 3 times. Moreover, in the same loop, and v are reused once again to compute momentum densities for a subsequent transport substep. Another technique used to boost the single-processor performace of ZEUS-MP was to in-line the routines which interpolate the values of advected quantities at the interfaces between mesh zones. The interpolation of and " was further combined into one loop, since the values of both quantities are de ned at the same locations on the mesh. With all these improvements, on RISC processors the per-processor performance of ZEUS-MP exceeds that of ZEUS-3D by 50 to 100 percent, unless the problem size is so small that a substantial portion of the eld variable and scratch arrays t in the cache. However, on vector machines ZEUS-3D runs nearly a factor of 2 faster than ZEUS-MP on a single processor.
Parallelization
Cray, SGI, and HP/Convex systems all feature compilers that can parallelize some loops automatically. However, a number of programming constructs inhibit this process, including subroutine/function calls, reduction operations (summation, global min/max), and scratch arrays passed as arguments or in common blocks. Loops which contain any of these constructs must be concurrentized manually using compiler directives. Fortunately, among dierent vendors, most directives are similar in both concept and syntax. On the Exemplar, some directives from other vendors are correctly interpreted. Parallelization directives can be inserted automatically by the EDITOR preprocessor written for the original version of ZEUS-3D for Cray Autotasking by David Clarke and extended to SGI and Exemplar multiprocessors by Robert Fiedler. EDITOR scopes the variables in each loop nest, checks for data dependancies, and determines which variables should be declared local. Reduction operations require special consideration, since incorrect results can be obtained if the loop is forced by a directive to go parallel without ensuring that the (shared) reduction variable is updated by just one thread at a time. In ZEUS-3D, the reduction operations are isolated in separate loops to allow automatic parallelization. In ZEUS-MP, the computational domain is divided into 3-dimensional tiles, each containing the same number of mesh zones. In order to evaluate the nite-dierence forms of the evolution equations (1){(3) at zones near tile surfaces, the dimensions of the eld variable arrays are extended by two extra layers (ghost zones) to store \boundary" values. Whenever the eld variables are updated, this boundary data becomes invalid 10
and new values must be received from neighboring tiles before the eld variables (in zones requiring this data) can be updated again. Boundary values for surfaces along the physical boundaries of the computational domain are speci ed by the physical problem unless the physical boundary is periodic. Periodic boundary conditions are treated in the same manner as boundaries between tiles, since messages are passed between tiles in both situations. After the type of physical boundary conditions for each axis is read in during initialization, we de ne a virtual Cartesian topology to map processors to tiles. The MPI speci cation includes an option to instruct the system to reorder the processors to minimize communication times for adjacent tiles. However, not all MPI implementations actually perform any reordering. For optimum parallel scaling, communication in ZEUS-MP is overlapped with computation as much as possible by means of non-blocking send and receive operations (see Gropp, et al., 1994, Chapter 4). Each substep is carried out by a driver routine which rst calls \boundary value" routines to initiate the exchange of the minimum amount of eld variable data between the boundary zones of neighboring tiles, and then calls a worker routine to update the eld variables only in zones for which all required data is already available (see Flowchart 2). Ideally, the worker routine's computations should take more time than the communication operations previously initiated by the driver routine; otherwise execution is blocked until all communication initiated by that tile is completed. Unlike the transit of messages across the interconnection network, however, the communication overhead associated with packing and unpacking messages typically involves the CPU and is not overlapped with computation. Exchanging data with neighboring tiles is more complicated for zones along the edges or on the corners of a tile. For these zones, eld variable boundary data is required from tiles that are one step away in more than one coordinate direction, e. g., diagonally opposite an edge. Rather than initiate separate communication operations to transmit edge and corner boundary values for each tile, we exchange data between neighboring tiles in three stages, sending and receiving messages along just one coordinate axis per stage. These messages include the updated edge and corner data from neighboring zones. We divide the interior zones into 3 groups along the k-axis and work on one of the groups during each stage (see Diagram 1): 1) Boundary data is exchanged with neighbors along the i-axis while updates are performed for the rst group of interior zones. 2) Boundary data is exchanged along the j-axis while updates are performed a) for the values of i skipped in step (1) in the rst group of interior zones, and b) for the second group of interior zones (all i). 3) Boundary data is exchanged along the k-axis while updates are performed a) for the values of j skipped in step (2) in the rst two groups of interior zones, and b) for the third group of interior zones (all i and j). 4) Updates are performed for the values of k skipped in the previous steps (all i and j). At the very end of the time step, the size of the next time step is computed from the minimum value of the Courant limit over all zones. MPI and other message-passing libraries provide routines to perform such global reduction operations eciently in parallel. The time spent waiting for this operation to complete comprises only a fraction of the total communication time. 11
In this version of ZEUS-MP, each process writes a separate output le. These les may be combined later by the ZEUS-MP postprocessor \ZMP PP", if desired. In contrast, ZEUS-3D oers a wide variety of output modes and can write any data residing in shared memory to a single le without regard to which processor owns it.
Theoretical Scaling Analysis
A straightforward analysis of the domain decomposition in ZEUS-MP and ZEUS-3D allows us to anticipate some of our parallel performance results reported below. For a more detailed discussion of performance models, see Foster (1995), Chapter 3. For an ideal parallel machine consisting of a number of von Neumann computers linked by an interconnection network, the cost of sending a message between two nodes is proportional to message length but independent of node location and network trac (see Foster, 1995, 1.2.1). Below we demonstrate that on such a machine the 3-dimensional domain decomposition used in ZEUS-MP should lead to better scaling than the eectively 1-dimensional domain decomposition used in ZEUS-3D (see Diagram 2). We also address the limitations imposed by some of the non-ideal aspects of existing multiprocessors. We perform two types of scaling studies. In the rst type, the total number of grid points is scaled with the number of threads N so that the computational work per processor is constant. In the second type of scaling study, the total number of grid points is held constant while the number of processors sharing the work is increased. Suppose a given problem run on a single-processor takes t1 seconds, and a problem with N times as many grid points run on N processors takes a total of tN seconds (summed over processors). The parallel speedup per processor E is given by x
E
= (t
t1 N =N )
= Nt t1 ; N
and the speedup is simply S = overhead, then
EN . tN
(4)
If communication accounts for all of the parallel = N (t1 + tc) ;
(5)
where tc is the time per processor spent waiting for communication operations to complete. In this case, E
= (t t+1 t ) 1 c = (1 + 1t =t ) : c 1
(6)
If all communication is overlapped with computation, then tc =t1 is negligible and the speedup is perfectly linear. On the other hand, if the communication operations are performed serially (all processes wait for all communication), the communication time per thread is proportional to N . In this case E drops as we add more processors and no further parallel speedup is possible. 12
In ZEUS-MP, each thread works on a cubic or block-shaped tile (see Diagram 2). The bulk of the messages are passed only to a nearest neighbor and consist of a single 2-D layer of boundary data from one surface of a tile. During each time step, a typical tile sends or receives well over 100 messages of this type. Tiles with surfaces on physical boundaries have fewer neighbors with which to exchange boundary data, unless the boundary condition is periodic. In our experience the type of physical boundary condition has only a small eect on parallel speedups. Consider a simulation in which each grid zone is of size unity (the \logical grid"). As the number of grid zones increases, the full logical computational volume increases accordingly. The amount of computational work performed by each processor is proportional to the logical computational tile volume, and the amount of communication required per processor is proportional to the logical tile surface area. For ZEUS-MP, when performing a scaling study in which the logical computational volume increases with N , the logical tile size and the computation per process are both xed and tc =t1 is constant. In ZEUS-3D, the computational domain is partitioned along only one direction. When the total computational work is scaled with N , the work per thread is constant, but the amount of communication per thread increases with the logical tile surface area. For a cubic computational domain tc =t1 is proportional to N 2=3. Now consider a scaling study in which the total number of grid points is constant. If t1 is the execution time for the full problem on 1 processor and tN is the summed execution times for all N processors, the per-processor speedup is E
= t1 =tN :
(7)
If communication again accounts for all of the parallel overhead, then tN
= t 1 + tc ;
(8)
where tc is now the total time spent on communication. In this case, E
= (t t+1 t ) 1 c = (1 + 1t =t ) ; c 1
(9)
the same relation as equation (6) derived above for scaled total work, although tc and t1 here refer to the full problem. When the total work is xed, each process is given a chunk of the problem proportional to 1=N . For ZEUS-MP with cubic tiles, the logical surface area of each tile now scales as 1=N 2=3 and hence the total amount of communication increases as N 1=3; therefore tc =t1 should scale as N 1=3. For ZEUS-3D, the total amount of communication is proportional to N and tc=t1 scales as N . In this case, our performance model predicts that there should be no parallel speedup beyond a certain value of N . From our above analysis we expect ZEUS-MP to scale better than ZEUS-3D on an ideal parallel machine for more than a few processors in both types of scaling study. For ZEUS-MP and scaled total work, the parallel speedup should be linear, although the slope 13
could be less than unity; for xed total work, the speedup should approach N 2=3 for large N . The parallel speedup for ZEUS-3D should approach N 1=3 for scaled total work, but should eventually reach a constant value for xed total work. Our predictions may be inaccurate for real shared-memory multiprocessors because they dier from an ideal parallel machine in several important ways. First, sending and receiving messages always involves some latency. For short messages, latency may dominate the communication time. In this case, the changes in tile surface area with N no longer aect the communication time, and therefore for both codes tc=t1 is constant for scaled total work and scales as N for xed total work. However, communication times under these circumstances would probably dominate the computation time, so good parallel speedups on large numbers of processors would not occur for either code. Another non-ideal feature of real multiprocessors is the fact that the communication time depends on the location of the processors exchanging messages. In multiprocessors such as the Exemplar, it takes somewhat longer to access data across the interconnaction network (CTI rings) to a processor on a dierent 16-processor HYPERnode than it does to access data residing on the same HYPERnode via the crossbar switch. On the Origin, a processor has faster access to memory on its own node board than it does to memory on a node board that may be a few router hops away. Real multiprocessors can also exhibit unexpectedly high parallel speedups. For xed total work, the chunk that each processor works on gets smaller as N increases. Eventually, this chunk begins to t in cache and performance receives a boost (relative to the perprocessor performance for smaller N ). A performance boost can also result with a very large xed problem size when the chunks begin to t in the memory most readily available to each processor. This eect is more likely to be seen on the Origin, in which the memory is distributed over many 2-CPU node boards than it is on the Exemplar, in which the memory is distributed over 16-CPU HYPERnodes.
Scaling Study Results
Figures 1 and 2 show results on the Exemplar SPP-2000 for scaling studies in which the total work is scaled with the number of processors. We study more than one per-tile problem size in order to determine the aect of cache performance and logical tile surfaceto-volume ratio on scaling. Both measured (solid) and theoretical (dotted) speedup curves are shown for ZEUS-MP (thick) and ZEUS-3D (thin). The theoretical speedup curves are based on our analytical predictions above, with the constant of proportionality in the ratio of communication time to computation time evaluated by passing the curves through the measured speedup for 8 threads. The speedup is determined by comparing CPU times on processor 0 for 10 time steps. These times do not include any time spent initializing the problem or nishing up. ZEUS-MP exhibits good scaling all the way out to 64 processors (4 HYPERnodes) for either problem size, with the larger problem scaling somewhat better. The 32 32 32 tile requires about 7 MB per processor and the message size is around 11 kB, while the 64 64 64 tile requires about 56 MB and passes 38 kB messages. Therefore, the ratio of communication to computation for the larger tile is roughly half its value for the smaller tile. Moreover, latency should be less important for larger message sizes. The theoretical speedups are linear, but with a slope less than unity, since the communication time is not negligible compared to the computation time (see equation (6)). The somewhat longer access time for exchanging messages across HYPERnodes may account for the change in slope of the measured speedup at about 16 processors.
14
ZEUS-3D does not scale as well as our analysis indicates that it should, particularly for the larger problem size. This could be due to the fact that ZEUS-3D generates many more cache misses than ZEUS-MP, especially on the larger problem, which increases memory and trac as the missing data are fetched from memory far more often than would occur in a better optimized version. The change in slope of the measured speedup for ZEUS3D at about 16 processors occurs because the data must be distributed across multiple HYPERnodes beyond that point. Figures 3 and 4 show results from scaling studies in which the total work is xed. For the smaller problem, the work per processor decreases from 128 128 128 for 1 processor to as little as 32 32 32 for each of 64 processors. The speedup curves agree fairly well with our predictions for both codes. As in the case of scaled total work, scaling is somewhat better on the larger problem for ZEUS-MP. Again, the measured slope for ZEUS-3D changes at about 16 processors because the data must be distributed across multiple HYPERnodes.
Conclusions
The shared-memory and message-passing paradigms are equally valid and inherently ecient. The additional eort required to rewrite existing codes for message-passing is oset by the control gained over the distribution of data among processes. In this report, we apply a 3-D algorithm to a 3-D problem which is scaled in all 3 dimensions, since this is a common situation. The use of data layout directives to specify a 3-D domain decomposition could potentially help the shared-memory implementation of the ZEUS algorithm (ZEUS-3D) attain parallel speedups comparable to those achieved by the message-passing implementation (ZEUS-MP). We transformed many of the loop nests in the original ZEUS code, written for vector processors, to improve data locality for cache-based RISC architectures. Timings of several versions of a representative code fragment quantify the in uence of the memory access pattern on single-preocessor peformance. Interchanging loops to reduce the stride, fusing loops to avoid ushing the cache, and blocking loops to reduce TLB misses are all eective optimization strategies on SGI and HP/Convex RISC systems. Our tuned code fragment runs at 166 MFLOPS on a Cray C90, 2.8 times faster than the original version. The fastest RISC system ran the same tuned version 30 times faster than it ran the original version, but the absolute performance was under 34 MFLOPS for a problem size too large to t in cache. On these systems, execution times for this code fragment are limited by memory system bandwidth, not raw processor speed, because the bodies of the loops contain 3 memory references per oating point operation. Memory bandwidth similarly limits the single-processor performance of many CFD applications to between 10 and 30 percent of the peak speed. Our simple performance model shows that, all else being equal, ZEUS-MP should exhibit better parallel scaling than ZEUS-3D because the ratio of communication to computation increases less rapidly with the number of processors N for ZEUS-MP by a factor of N 2=3. This is due to the fact that in ZEUS-MP the domain is decomposed into 3-D cubic tiles, whereas in ZEUS-3D only the outer loop of a nest is parallelized and the domain is eectively decomposed into 1-D slabs. This advantage is independent of whether the total computational work is xed or is scaled with the number of processors. Our benchmarks conclusively demonstrate the superiority of our MPI-based implementation over the loop-parallel version. ZEUS-MP scales far better than ZEUS-3D in every test for more than 4 processors, and the per-processor performance of ZEUS-MP 15
is also signi cantly better (unless the problem ts entirely in cache) due to increased reuse of encached data. We obtained the best speedups with ZEUS-MP on the larger of two per-processor problem sizes: 56/64 for scaled total work and 47/64 for xed total work.
Disclaimer
The HP/Convex Exemplar SPP-2000 on which most of these experiments were conducted was a \ eld test" system, not a production version. The benchmarks presented here are to be considered preliminary and do not guarantee a particular level of performance in future hardware or software releases.
References
Clarke, David A., M. L. Norman, and R. A. Fiedler, \ZEUS-3D User Manual", LCA Technical Report no. 15, 1994. Dowd, Kevin, \High Performance Computing", O'Reilly and Associates, Inc. , 1993. Foster, Ian T., \Designing and Building Parallel Programs", Addison Wesley, 1995. Gropp, William, E. Lusk, and A. Skjellum, \Using MPI", MIT Press, 1994. Snir, Marc, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra, \MPI: The Complete Reference", MIT Press, 1996. Stone, James M., and M. Norman, Astrophysical Journal Supplement Series, vol. 80, pp. 753{818, 1992.
16