Evaluating Distributed Shared Memory for Parallel Numerical Applications Larry Wittie , Gudjon Hermannsson, and Ai Li TR # 93/01 January 1993
To appear in Sixth SIAM Conference on Parallel Processing for Scienti c Computing March 1993
This research has been supported in part by Department of Energy/Superconducting Super Collider contract SSC-91W09964; by National Science Foundation grants for equipment (CCR87-05079, CDA88-22721, and CDA9022388) and research (CCR87-13865 and MIP89-22353); and by Oce of Naval Research grant N00014-88-K-0383. Computer Science Department, SUNY Stony Brook NY 11794-4400,
[email protected] (516)632-8456
1
Abstract
One method to evaluate a distributed shared memory(DSM) system is to analyze its performance for a variety of algorithms using a simulator. Eager DSM systems reduce latencies for remote data accesses by sending shared data changes immediately to all processors that may need it. Simulation results for two parallel application programs, Gaussian elimination and Fast Fourier Transform, are compared for eagersharing SESAME versus more traditional demand-driven DSM systems. The evaluations indicate that (1) eager DSM systems can scale hundreds of times more eciently in large networks; and (2) dierent data sharing strategies greatly aect system performance. Eager sharing lets systems of thousands of processors eciently run parallel scienti c programs by providing each processor with needed data at the earliest possible time. On-going research is creating parallel programming environments for high performance SESAME systems.
1 Introduction This paper shows how latency hiding is needed to run parts of a parallel program eciently on thousands of processors simultaneously. Distributed shared memory (DSM) systems have been proposed to give users of parallel computers the programming convenience of shared memory, while simultaneously scaling to large numbers of processors and avoiding the severe buscontention or network-expense problems of existing multiprocessors. DSM systems pass short messages, but hide the underlying mechanism. DSM systems are feasible because only a small fraction of write accesses, much less than 3% of all memory references[5], in parallel programs are to variables used by more than one processor. DSM systems maintain local memory, or cache, copies of shared data. One of the main classi cations of DSM systems depends on the way they handle remote memory accesses. It de nes a spectrum from on-demand to eager sharing. Older on-demand mechanisms introduce long delays whenever processors wait to fetch data across the network. On the other hand, as the name indicates, eagersharing memories immediately (\eagerly") send each changed shared datum to whatever processors may need it. Demand-driven mechanisms[10, 6, 2, 3] for DSM minimize network trac by passing only needed data, whereas eager sharing may pass data values that are never used. Eager sharing is simple. Each processor sharing variable s has a local copy of s. Whenever it changes s, the new value is sent to the other processors that share it. With demand fetch, whenever a processor needs shared data not already present in its local memory, it sends a request to obtain the data. Since newly changed data can be present locally before a processor needs them, eager sharing has the potential to overlap communication with computation to greatest extent possible. Processors may never be idle. A few parallel systems, DASH[10], KSR1 from Kendall Square, TERA[1], and SESAME[12], provide some eagersharing hardware to disseminate shared data values whenever they are changed. The SESAME[Scalable Eagerly ShAred MEmory] project[12] at Stony Brook is linking fast workstations with latency-hiding interfaces that give programmers the illusion of physically shared memory.
2
Each workstation in a SESAME network has an interface which snoops on its local memory system. Whenever a variable value is written to local memory, its address is checked against a preset directory of memory regions containing shared variables and a copy of the new value is immediately multicast to all processors that share the variable. If none share it or if sharing has been temporarily disabled by the host, the value is discarded. The disabling of sharing is useful to decrease network load when interim values do not need to be shared. For SESAME, the interfaces route each sharing message and steal a memory cycle to store each incoming data copy into local memory. Memory contention from locally shared incoming copies is the only cost to hosts from sharing. Sending is cost-free; receiving has a tiny cost that may, however, become signi cant if too many values are shared too frequently. SESAME interfaces distribute copies of each newly written shared variable in parallel over a forest of shortest path spanning trees so sharing will work eciently in large networks. Sharing is organized within multicast domains with centralized control of each domain for eciency and consistent sequential ordering[9] of all writes within each group. Control roots for dierent domains are spread throughout the network to avoid bottlenecks that would limit scaling. The next section describes the simulator DSIM, used to evaluate eagersharing and demandfetch mechanisms. Section 3 gives processor eciency estimates for parallel execution of Gaussian elimination and Fast Fourier Transform under eager sharing or demand fetch. Section 4 gives conclusions.
2 The DSIM Simulation of Parallel Programs Computer simulations allow exploration, evaluation, and analysis of systems that are new, not physically available, or too expensive to bring into a research lab. Simulating a new computer system before actually building it can answer critical questions like whether the system design is feasible and how to con gure components to yield maximal performance. Simulation is especially desirable for evaluating systems with non-traditional designs, to avoid wasting time and money. The DSIM distributed shared memory simulator evaluates the scalability of parallel C programs running on DSM systems. DSIM predicts average CPU eciency for parallel programs using eager sharing, or demand sharing methods like those in DASH[10]. DSIM is an eventdriven simulator. The primary events are estimated completion times of lengthy calculations, propagation of messages in the network, synchronization accesses, and memory accesses through bus and network interfaces. Events signal possible major changes in system behavior, for example, an increase in memory system utilization that may slow processor speed, and a change in the value of a synchronization variable that generates sharing messages which may reduce computing capacity. The eciency model for DSIM assumes that the instantaneous computation rate of each processor is determined by the minimum allowed by three constraints: (1) its cpu to memory bandwidth versus the number of operands that must be fetched from or stored into local memory for each calculation step;
3
(2) its sharing interface acceptance rate versus the number, if any, of shared variables written during each calculation; and (3) the peak processor MFLOPS rate. Delays on synchronization locks and barriers are recorded as intervals of zero computation rate. The model is justi ed by observing that computation speeds for most pipelined processors are limited mainly by the CPU-memory bandwidth needed to access data[7]. The parallel C language application code used for each simulation is an SPMD (SingleProgram, Multiple-Data) program. Every processor executes the same program. However, execution is not performed in lock step. Program ow can depend on a variable CPU, which encodes the identi cation number of the executing processor. Dierent processors can execute dierent parts of the program and can evaluate only speci ed data items. At explicit \barrier" synchronization points, each processor must wait and not execute until all processors have reached the same point. Each SPMD program is translated into internal code structures which DSIM interprets. A compiler frontend[8] does the translation.
3 Execution Eciency of Parallel Numerical Applications Parallel Gaussian elimination (dgauss)[4] and Fast Fourier Transform (FFT)[11] codes have been used with DSIM to compare data sharing methods and to determine how to structure parallel programs to run eciently on thousands of processors. DSIM estimates the eciency of each application if it were executed using demand fetch or eager sharing.
3.1 Dgauss: Gaussian Elimination
Parallel Gaussian elimination has been used with DSIM to compare data sharing methods since: (1) Gaussian elimination is technologically important for solving systems of linear equations; (2) A well designed parallel algorithm (gauss.c) from Livermore National Labs runs rapidly on Sequent shared memory multiprocessors[4]; and (3) Gaussian elimination in theory can run rapidly on parallel processors, but in practice is hard to run eciently for several reasons. Simplistic solutions that globally share all data saturate any bus or network connecting many processors to memory. Rapid data writes, one after every two non-cache fetches and two oating point operations, make performance very sensitive to synchronization delays. Processor workload per iteration varies greatly, from a full matrix row to a single element at the triangle tip. In phase one of gauss.c, each of N processors has a set of rows (i, i + N , i +2N ,..., for some i N ) which it alone changes to reduce the number of non-zero terms whenever a higher row has been completely reduced. Phase one requires O( 32 D3 ) oating point operations and takes at least D3 =(Npeak cpu rate) time since each processor is idle once its last row has been reduced to its upper triangular length. Memory system con icts normally lengthen times for the SAXPY (Sum aX plus Y) reductions. Phase two is quicker, requiring only O(D2 ) operations to calculate
4
and substitute the solution for each of the D variables in turn back into the D equations to produce the rest. Dgauss.c, as simulated with DSIM, diers in three ways from Livermore's gauss.c: (1) A few lines selectively prevent unneeded interprocessor sharing of temporary values to improve eciency. If all changes are shared, network power under eager sharing is equivalent to only 1:3 CPUs at the peak rate regardless of network size N . If only nal values are shared, network power is roughly N3 CPUs. Changed row elements are shared only during their nal row reduction. (2) Prefetch statements have been added to the reduction loop. These statements are ignored during eager sharing and pure demand fetch simulations. (3) Synchronization ags allow computational granularity to vary from a full row to one cache line as small as two data values. DSIM has determined dgauss.c eciency for three sharing methods: demand fetch, demand fetch with prefetch, and eager sharing. All runs start with exactly one matrix copy somewhere in memory. The methods dier only in the way signals are passed to trigger sharing of variables. For pure demand fetch, a cache line is shared only when the receiving processor requests access to it. For prefetch, the algorithm periodically requests cache line(s) for the next few reducing row elements before it needs them. For eager sharing, each row value datum is multicast whenever its nal value is set. The two demand fetch schemes use a directory similar to that for DASH[10]. Each shared cache line has two entries: the index of the owning processor which last changed it and a list of all processors that have acquired a copy. Whenever a processor changes a datum, it becomes owner of the corresponding cache line and all old copies are invalidated. Figure 1 shows cumulative network computing power for Gaussian elimination versus the logarithm of network size, from N = 1 to 400 computers. Input is a densely non-zero 400400 matrix representing a family of 400 linear equations. Network power is average processor eciency times network size. Eciency is the percentage of peak processor speed. Each processor is assumed to have a peak computation speed of 33 MFLOPS for 64-bit data and a fast 400 MB/sec local memory bus. Each data sharing hop takes 200ns, the delay possible with eagersharing interfaces using 1 gigabit/sec links. Prefetch results show the best combination of xed sizes for two parameters: how many cache lines to prefetch at once and how long before the cache values are needed in calculations. From Figure 1, the dierences in scaling ability of the three sharing methods are clear. For both demand sharing methods, large networks are less powerful than one processor. For demand fetch, network power peaks at 6 CPUs with an eective power of only 4:2; for prefetch, near 20 CPUs totaling 13:7; and for selective eager sharing, at 400 CPUs equivalent to 126:8. For eager sharing among 33 MFLOPS processors, the sustained aggregate computing power is 4:18 gigaFLOPS. For dgauss with multiple ags per row, eager sharing extracts ever-increasing network power from more processors. Figure 2 shows that the same eagersharing interfaces can support larger problems running on huge networks. For Gaussian elimination of 2800 equations with one ag per row, network
5
Network Power 128 64 32
Eager Share Prefetch Demand Fetch
16 8 4 2
2
4
7
10
20
40
70 100
200
400
Number of CPUs
Figure 1: Network Power (in CPUs) for Gaussian Elimination: 400 Equations
Network Power & 1000*Efficiency 1000
1000*Efficiency
800
600
400
200
Network Power 2
4
7 10 16
40
100
400
1000
2800
Number of CPUs
Figure 2: Gaussian Elimination of 2800 Equations: CPU Eciency and Network Power
6
1000*Efficiency 1000
800
3 1 600
400
200
4 2
1:Selective, 65536 Data 2:Selective, 4096 Data 3:Pipelined Selective, 65536 Data 4:Pipelined Selective, 4096 Data 5:Global, 65536 Data 6:Demand, 65536 Data 2
4
8
16
32
64
128
256
5,6 512
1024
2048
CPUs
Figure 3: CPU Eciency of FFT for Selective, Pipelined, and Global Sharing performance peaks between 900 and 1400 processors with a cumulative sustained power of 690 CPUs of 33 MFLOPS, or 23 gigaFLOPS. Demand fetch executions of such large programs cannot be simulated because there is not enough room in 128MB of DSIM workstation memory for the needed 1010(2; 8003 =2) cache line directory entries.
3.2 Parallel Fast Fourier Transform
Performance simulations of Fast Fourier Transform (FFT)[11] have been used to determine other factors critical to ecient eager sharing: selective distribution of shared results and overlapping of communication with computation. One reason for choosing FFT is its very ne computation grain when using many processors. Each processor calculates only a few values in each iteration and then waits for remote results from one other processor before starting the next iteration. Each processor sends results to a dierent processor in every iteration. For ecient execution, great care must be taken to share values only when and where needed. One proposed mechanism for SESAME eagersharing interfaces is hardware support for dynamically selective sharing. DSIM results show this support is needed. The one-dimensional Fourier transform of a sequence of M = 2L complex numbers (x0; :::; xM ?1), can be transformed into a FFT with computation complexity of O(M log M ):
F (xk) =
p
X? x e
M 1 j =0
j
2ik=M
=
X X ::: X 1
1
j0 =0 j1 =0
1
jL?1 =0
xj e2ik=M
(1)
where i = ?1, ei = cos + i sin, k = 0; :::; M ? 1, and the binary digits of j are jL?1; :::; j1; j0. One way to solve FFT in parallel is to let each of N processors handle MN data per iteration. The iterations do the sums from right to left for all xk and use each set of results as inputs to the next iteration. For example, in the rst iteration, x0 and x M2 will be combined to calculate new values for iteration two.
7
Figure 3 shows average CPU eciency during a linear FFT with D = 4096 and 65; 536 data points for eagersharing networks of N = 1 to 2048 processors, each with 33 MFLOPS in peak computing power. During each iteration, each processor calculates D=2N pairs of complex data values and must send one complex value plus an arrival ag to one other node for the next iteration. Each value is represented by two oating data words, the real and imaginary parts. With a ag, each value forms a sharing triplet. If sharing is selective, exactly one node receives each triplet and each node receives D=2N triplets per iteration, versus D ? D=N triplets for fully global sharing. For all but the largest network sizes, sharing latency is totally overlapped by calculations involving other data. For selective sharing, eciency stays high until there are only a few sets of calculations per processor in each iteration, then falls as more and more of the network latency is visible. With global sharing of data, overused memory systems cause eciency to drop at N = 16 CPUs, as soon as the calculation phase is short enough that each host memory is saturated by the nearly 3D values that arrive per iteration. Limiting of sharing is critical to achieving high-performance in scalably large eager DSM systems. Two FFT codes have been tested: (1) Selective: each processor waits until all the data needed in the current iteration have arrived before starting computing, and (2) Pipelined: each processor starts computing whenever any data used in the current iteration are available. Figure 3 shows that for large grain sizes, the pipelined version is slightly less ecient since it sends many more arrival ags, which steal memory cycles to be stored. However, it performs much better when there are so many processors that task times are short enough for initial waits to be signi cant. With 4; 096 or more data, eagerly shared pipelined FFT eciency stays above 60% for a thousand or more processors. With other sharing methods, using more processors usually quickly degrades eciency. Eager sharing allows overlapping of computations and communications. It hides sharing latencies and sustains high eciencies even for ne grained computations involving very many processors. The FFT results are shown in Figure 4 from a more familiar viewpoint; in the form of equivalent network power, or sustained speedup. It is apparent that eager sharing is superior to demand fetch, even when using global sharing. Selective eager sharing achieves nearly linear speedup.
4 Conclusions for Large Distributed Shared Memory Systems As can be seen from the simulations, demand fetch does not scale well for Gaussian elimination, which exhibits well structured, but global data sharing. Codes that share variables only between
8
Network Power 2048 1024 512
Eager Sharing, 65536 Data Global Sharing, 65536 Data Demand Fetch, 65536 Data
256 128 64 32 16 8 4 2 2
4
8
16
32
64
128
256
512
1024 2048
Number of CPUs
Figure 4: Network Power for FFT with 65536 Data neighboring processors usually run eciently on any parallel computer, including message-passing \hypercubes". Adding prefetch to demand fetch only slightly improves scalability for Gaussian elimination, mainly because of the overhead in responding separately to many requests for newly reduced row values and ags. Gaussian elimination scales very well using eager sharing. For FFT, selective eager sharing improves eciency markedly by reducing sharing trac and allowing even very ne grained computations to overlap communication times to avoid delays. These results indicate that dynamically selectable eager sharing of variables should be an option for programmers of future massively parallel distributed shared memory systems. Distributed shared memory systems give programmers the convenience of shared memory and the scalability of message passing protocols without the coding diculty. Eagersharing protocols supported by hardware interfaces allow ecient execution of parallel programs that are latency sensitive because of ne-grained computations. This paper compares demand fetch with eager sharing for two applications: Gaussian elimination and Fast Fourier Transform. The simulation results show that eager sharing allows sustained speeds for both applications that are above 60% of peak performance, even for networks containing as many as 1; 000 fast processors. The simulation results show orders of magnitude faster execution of massively parallel programs using selective eagersharing rather than demand fetch mechanisms. Eager sharing allows many more processors to work eciently on large single problems, shortening run times. Current SESAME research includes development of new methods for structuring parallel algorithms to run eciently on large eagersharing networks. Eagersharing mechanisms are important enough in achieving general purpose high performance parallel computing to merit inclusion in the cache memory systems of future massively parallel supercomputers.
References [1] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porter eld, and B. Smith. The
9
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
Tera Computer System. Computer Architecture News: SigArchNews, 18(3):1{6, September 1990. Reprinted from Proceedings of 1990 International Conference on Supercomputing. M. Carlton and A. Despain. Multiple-Bus Shared-Memory System: Aquarius Project. IEEE Computer, 23(6):80{83, June 1990. D.R. Cheriton, H.A. Goosen, and P.D. Boyle. Multi-Level Shared Caching Techniques for Scalability in VMP-MC. The 16th Ann. Int. Symp. on Comp. Arch., pages 16{24, May 1989. G.A. Darmohray and E.D. Brooks III. A Parallel Gauss Elimination Algorithm with Minimized Barrier Synchronization. Technical Report UCRL-101587, Lawrence Livermore National Laboratory, 1989. S.J. Eggers and R.H. Katz. A Characterization of Sharing in Parallel Programs and its Application to Coherency Protocol Evaluation. The 15th Ann. Int. Symp. on Comp. Arch., pages 373{382, May 1988. J.R. Goodman and P.J. Woest. The Wisconsin Multicube: A New Large-Scale CacheCoherent Multiprocessor. The 15th Ann. Int. Symp. on Comp. Arch., pages 422{431, May 1988. J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1990. G. Hermannsson, A. Li, and L. Wittie. E-C: A Front End for Parallel System Simulations. Technical Report TR-91/23, SUNY Stony Brook, December 1991. G. Hermannsson and L. Wittie. Scalable Group Write Consistency. Technical Report TR92/19, SUNY Stony Brook, October 1992. D. Lenoski, K. Gharachorloo, J. Laudon, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. Design of Scalable Shared-Memory Multiprocessors: The DASH Approach. Proceedings of CompCon - Spring 90, pages 62{67, February 1990. M. C. Pease. An Adaptation of the Fast Fourier Transform for Parallel Processing. Journal. ACM, 15:252{264, April 1968. L.D. Wittie, G. Hermannsson, and A. Li. Eager Sharing for Ecient Massive Parallelism. Proceedings of the 1992 International Conference on Parallel Processing, pages 251{255, August 1992.
10