models to be supported on the same platform, either directly in hardware or via software. .... well as space for some data), and data buffer space for the data in large ..... analysis tells us that the problem is caused mainly by an extra copy in the.
International Journal of Parallel Programming, Vol. 29, No. 3, 2001
A Comparison of MPI, SHMEM and Cache-Coherent Shared Address Space Programming Models on a Tightly-Coupled Multiprocessors Hongzhang Shan 1 and Jaswinder Pal Singh 1 Received October 1999; revised September 2000 We compare the performance of three major programming models on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging. We focus on applications that are either regular, predictable or at least do not require fine-grained dynamic replication of irregularly accessed data. Within this class, we use programs with a range of important communication patterns. We examine whether the basic parallel algorithm and communication structuring approaches needed for best performance are similar or different among the models, whether some models have substantial performance advantages over others as problem size and number of processors change, what the sources of these performance differences are, where the programs spend their time, and whether substantial improvements can be obtained by modifying either the application programming interfaces or the implementations of the programming models on this type of tightly-coupled multiprocessor platform. KEY WORDS: Parallel programming models; performance; ease of programming; shared memory; message passing.
1. INTRODUCTION Architectural convergence has made it common for different programming models to be supported on the same platform, either directly in hardware or via software. Three common programming models in use today are 1
Department of Computer Science, Princeton University; e-mail: shz[ jps]cs.princeton.edu 283 0885-7458010600-028319.500 2001 Plenum Publishing Corporation
284
Shan and Singh
(i) explicit message passing (MP, exemplified by the Message Passing Interface or MPI standard (1) in which both communication and replication are explicit; (ii) a cache-coherent shared address space (CC-SAS) in which both communication and replication are implicit; and (iii) the SHMEM programming model. SHMEM is like MPI in that communication and replication are explicit and usually made coarse-grained for good performance; however, unlike the send-receive pair in MPI, communication in SHMEM requires processor involvement on only one side (using put or get primitives) and SHMEM allows a process to name or specify remote data via a local name and a process identifier. On the platform side, high-performance computing is converging to mainly two types of platforms: (i) tightly-coupled multiprocessors, which increasingly support a cache-coherent shared address space in hardware, and in which the hardware support is leveraged to implement the MP and SHMEM models efficiently as well; and (ii) less tightly-coupled clusters of either uniprocessor or such tightly-coupled multiprocessors, in which all the programming models are implemented in software across nodes. From both a user's and a system designer's perspective, this state of affairs makes it important to understand the relative advantages and disadvantages of these three models, both in programmability as well as in performance, when implemented on both these types of platforms. Our focus in this paper is on the former, tightly coupled multiprocessor platform. In particular, we examine an SGI Origin2000 machinea cache-coherent distributed shared memory (DSM) machineas an aggressive representative that is widely used in high-performance computing. The tradeoffs between models depend on the nature of the applications as well. For certain classes of irregular, dynamically changing applications, it has been argued that a CC-SAS model has substantial algorithmic and ease-of-programming advantages over message passing that often translate to advantages in performance as well. (2, 3) The best implementations of such applications in the CC-SAS and MP models often look very different. While it is very important to examine the programming model question for such applications, we leave this more complex and subjective question to future work. In this paper, we restrict ourselves to applications that are either regular in their data access and communication patterns or that perform irregular accesses but do not require fine-grained dynamic replication of irregularly communicated remote data. We use applications or kernels for which the basic parallel algorithm structures are very similar across models and the amount of useful data communicated is about the same, so that differences in performance can be attributed to differences in how communication is actually performed. Within this class, we choose programs that cover many of the most interesting communication patterns, including
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
285
regular all-to-all personalized (FFT), nearest-neighbor and multi-grid (exemplified by Ocean, a computational fluid dynamics application), irregular all-to-all personalized (radix sorting, sample sorting), and multicast oriented (LU). In particular, we are interested in the following questions, for which our results will be summarized in Section 6: v For these types of fairly regular applications, is it indeed the case that parallel algorithms can be structured in the same way for good performance in all three models? Or do we need to restructure the algorithms to match a programming model? Where are the main differences in high-level or low-level program orchestration? v Are there substantial differences in performance under the three models? v If so, where are the key bottlenecks in each case? Are they similar or different aspects of performance across models? v Can these bottlenecks be alleviated by changing the implementation of the programming model, or do we need to change the algorithms or data structures substantially? If the former, does this require changes in the programming model or interface visible to the application programmer as well? The rest of the paper is organized as follows. Section 2 briefly examines some related work in comparing the message passing and shared memory programming models. Section 3 describes the Origin2000 platform and the three programming models. Section 4 describes the applications we used and the programming differences for them among the three models. Performance is analyzed in Section 5, which also examines methods for addressing performance bottlenecks in either the model or the application. Finally, Section 6 summarizes our key conclusions and discusses future work.
2. RELATED WORK Previous research in comparing models has focused on CC-SAS and MP models, but not on SHMEM. It can be divided into three groups: research related to hardware-coherent shared address space systems, research related to clusters or other systems in which the CC-SAS model is implemented in software, and research related to irregular applications with naturally fine-grained, dynamic and unpredictable communication and replication needs. For the latter, which are increasingly very important, it is argued that CC-SAS when implemented efficiently in hardware
286
Shan and Singh
has substantial ease of programming and likely performance advantages compared to MP. (2, 3) However, a proper evaluation for this class of programs requires a much more involved study of programming issues and is not our focus here. Let us examine the first two groups. For hardware-coherent systems, Ngo and Snyder (4) compared several CC-SAS programs against MP versions running on the same platform. The CC-SAS programs they used were not written well to take locality into account (i.e., were written somewhat ``naively''), and they found such programs to perform worse than the message passing ones. We start in this study with well-written and tuned programs for all models. Chandra et al. (5) compared MP with CC-SAS using simulators of the two programming models and examined where the programs spent their time. They found that the CC-SAS programs could perform as well as message passing programs. Important differences in their study from ours are that they examined only a single problem and machine size for each program, that their study used simulation which has limitations in accuracy (especially with regard to modeling contention) and in the ability to run large problem and machine sizes, that the hardware platform they simulated (the Thinking Machines CM-5) is now quite dated, and they used different programs with somewhat less challenging communication patterns than we do (e.g., none so challenging as FFT or Radix sorting). Another simulation study by Woo et al. (6) studied the impact of using a block transfer (message-passing) facility to accelerate hardware-coherent shared memory on a system that provides integrated support for block transfer. They found that block transfer did not promise to improve performance as greatly as had been expected. Both these studies examined differences in traffic generated as well. Kranz et al. (7) showed that message passing can improve the performance of certain primitive communication and synchronization operations over using cache-coherent shared memory. LeBlanc and Markatos (8) concluded that shared memory is preferable on multiprocessors where communication is relatively cheap. As the cost of communication increases in shared memory multiprocessors, message passing is becoming an increasingly attractive alternative to shared memory. Finally, Klaiber and Levy (9) use both simulation and direct execution to compare message traffic (not performance) of C* data-parallel programs from which a compiler automatically generates SAS and MP versions. In the second group of related work, researchers have compared the performance of message passing with the CC-SAS model implemented in software at page granularity on either older message-passing multiprocessors or on very small-scale networks of workstations. (10, 11) They found that the CC-SAS model generally performs a little worse for those regular applications they studied. In contrast with these two groups of related
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
287
work, our study uses well-written programs to compare modern implementations of all three major programming models on a modern hardwarecoherent multiprocessor platform at a variety of problem and machine scales. 3. PLATFORMS AND PROGRAMMING MODELS 3.1. Platform: SGI Origin2000 The SGI Origin2000 is a scalable, hardware-supported, cache-coherent, non-uniform memory access machine, with perhaps the most aggressive communication architecture among such machines today. The machine we use has 64 processors, organized in 32 nodes with two 195MHZ MIPS R10000 microprocessors each. Each processor has separate 32 KB firstlevel instruction and data caches, and a unified 4 MB second-level cache with 2-way associativity and a 128-byte block size. The machine has 16 GB of main memory (512 MB per node) with a page size of 16 Kbytes. Each pair of nodes (i.e., 4 processors) is connected to a network router. The interconnect topology across the 16 node pairs (routers) is a hypercube. The peak point to point bandwidth between nodes is 1.6 GBsec (total in both directions). The average un-contended read latency to access the first word of a cache line are as follows: local memory 313tns, average of local and all remote memories on a machine this sizet796 ns, and furthest remote memory 1010 ns. (12) The latency grows by about 100 ns for each router hop. 3.2. Parallel Programming Models The Origin2000 provides full hardware support for a cache-coherent shared address space. Other programming models like MP (here using the Message Passing Interface Standard or MPI primitives) and SHMEM are built in software but leverage the hardware support for a shared address space and efficient communication for both ease of implementation and performance, as is increasingly the case in high-end tightly-coupled multiprocessors. 3.2.1. CC-SAS In this model, remotely allocated data are accessed just like locally allocated data or data in a sequential program, using ordinary loads and stores. A load or store that misses in the cache and must be satisfied remotely communicates the data in hardware at cache block granularity, and automatically replicates it in the local cache. The transparent naming
288
Shan and Singh
and replication provides programming simplicity, especially for dynamic, fine-grained applications. In all our parallel programs, the initial or parent process spawns off a number of child processes, one for each additional processor. These cooperating processes are assigned chunks of work using static assignment. The synchronization structures used are locks and barriers. Processes are spawned once near the beginning of the program, do their work, and then terminate at the end of the parallel part of the program. 3.2.2. MP In the message passing model, each process has only a private address space, and must communicate explicitly with other processes to access their (also private) data. Communication is done via explicit send-receive pairs, so the processes on both sides are involved. The sender specifies to whom to send the data but does not specify the destination addresses; these are specified by the matching receiver whose address space they are in. The data may have to be packed and unpacked at each end to make the transferred data contiguous and hence increase communication performance. While the MP model can be more difficult to program, more so for irregular applications, its potential advantages are better performance for coarse-grained communication and the fact that once communication is explicitly coordinated with sends and receives, synchronization is implicit in the send-receive pairs in some blocking message passing models. We began by using the vendor-optimized native MPI implementation (Message-Passing Toolkit 1.2), which was developed starting from the publicly available MPICH. (13) Both use the hardware shared address space and fast communication support to accelerate message passing. We found that the performance of the native SGI implementation and MPICH are quite comparable for our applications, especially for larger numbers of processors. We therefore selected MPICH, since its source code is available. Let us examine how it works at a high level. The MPICH implementation (like the native SGI one), is faithful to the message passing model in that application data structures are only allocated in private per-process address spaces. Only the buffers and other data structures used by the MPI library itself, to implement send and receive operations, are allocated in the shared address space. The MPI buffers are allocated during the initialization process; they include a shared packet pool for exchanging control information for all messages as well as data for short messages (each packet has header and flag information as well as space for some data), and data buffer space for the data in large messages. There are three data exchange mechanisms: short, eager and rendezvous. Which mechanism is used in a particular instance is determined by the library and depends on the size of the exchanged data. All copying
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
289
of data to and from packet queues and data buffers is done with the memcpy function; note that while the hardware support for load-store communication is very useful, an invalidation-based coherence protocol can make such producer-consumer communication inefficient compared to an update protocol or a hardware-supported but noncoherent shared address space since it will potentially cause a lot of more protocol transactions. Short Mode. If the message size is smaller than a certain threshold, the sender first requests a packet from the pre-allocated shared packet pool. The sender copies the data into the packet body itself (using memcpy), fills in the control information and then adds this packet into the incoming queue of the destination process. A receive operation checks the incoming queue and, if the corresponding packet is there then copies the data from the packet into its application data structure and releases the packet. Two other incoming queues per process, called a posted queue and an unexpected messages queue, are also used by receives to manage the flow of packets and handle the cases where a receive is posted before the data arrives. If a nonblocking or asynchronous receive is used, the wait function that is called laterbefore the data are actually neededperforms similar queue management. Eager Mode. If the data length is larger than the short mode threshold but smaller than another threshold, the transfer uses eager mode. Message data are not kept in the packet queue in this case, only control information is. A send operation first requests a data buffer from shared memory space and (if successful) copies the data into the buffer using memcpy. It then requests and uses packet queues for control in much the same way as the short mode does. When the receiving side receives the packet, it obtains the buffer address from the packet and then copies the data from the buffer to its own application data structure. It then frees the packet and the buffer. Eager mode often offers the highest performance per byte transferred. Rendezvous Mode. If the message is beyond the threshold size for eager mode, or if a buffer large enough cannot be obtained from the shared buffer space for an eager-mode message, rendezvous mode is used. It is similar to eager mode, except that the data are transferred into the shared buffer not when the send operation is called but only when the send-receive match occurs (this means that a sender using non-blocking sends has to be careful to not overwrite the application data too early). A large message may be partitioned by the library into many smaller messages, each of which is managed in this manner. This mode is the most robust, but it may be less efficient than the eager protocol and is not used in our applications.
290
Shan and Singh
Fig. 1.
Bandwidth for MP and SHMEM programming model.
3.2.3. SHMEM The SHMEM library provides the fastest inter-processor communication for large messages, using data passing and one-sided communication techniques. Figures 1 and 2 show the bandwidth and latency for the MP model and SHMEM model. The two major primitives are put and get. A get is similar to a read in the CC-SAS model. In CC-SAS, an ordinary load instruction is used to fetch a cache block of remote data, and data replication is automatically supported by hardware. In SHMEM, an explicit get operation is used to copy a variable amount of data from another process (using bcopy, which does the same thing as memcpy used in MP. However, it has more complex semantics in that it can correctly handle overlapping sources and destinations) and explicitly replicate it locally. The get operation specifies the address space (process number) from which to get (copy) the data, the local source address in that (private) address space, the size of the data to fetch, and the local destination address at which to place the fetched data. In SHMEM, there is no flat uniformly addressable-shared address space or
Fig. 2.
Latency for the MP and SHMEM programming model.
File: 828J 171908 . By:XX . Date:02:04:01 . Time:09:34 LOP8M. V8.B. Page 01:01 Codes: 1842 Signs: 1234 . Length: 44 pic 2 pts, 186 mm
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
291
data structures that all processes can loadstore to. However, the portions of the private address spaces of processes that hold the logically shared data structures are identical in their data allocation. Thus, a process refers to data in a remote process's partition of a distributed data structure by using an address as if it were referring to the corresponding location in its own partition of that data structure (and by also specifying which process's address space it is referring to), not by using a ``global'' address in the larger, logically overall shared data structure. Unlike send-receive message passing, a process can refer to local variables in another process's address space when explicitly specifying communication, but unlike CC-SAS it cannot loadstore directly to those variables. A put is the dual of a get; however, each is an independent and complete way of performing data transfer. Only one of them is used per communication, and they are not used as pairs to orchestrate a data transfer as in send and receive. By providing a ``global'' segmented address space and by avoiding the need for matching send and receive operations to supply the full naming, the SHMEM model delivers significant programming simplicity over MP, even though it too does not provide fully transparent naming or replication. Table I summarizes the properties of the three models both in general and as implemented on the Origin2000.
4. APPLICATIONS AND ALGORITHMS We examine applications whose CC-SAS versions are from the SPLASH-2 suite or other research scientists and that are within the class of applications on which we focus, choosing within this class a range of communication patterns and communication to computation ratios. The first application, FFT, uses a nonlocalized but regular all-to-all personalized communication pattern to perform a matrix transposition; i.e., every process communicates with every other but the data sent to the different processes is different. The communication to computation ratio is quite high and diminishes only logarithmically with problem size. The second application, Ocean, exhibits primarily nearest-neighbor patterns which are very important in practice, but in a multi-grid formation rather than on a single grid. The communication to computation ratio is large for small problem sizes but diminishes rapidly with increasing problem size. The third application, Radix sorting, uses all-to-all personalized communication but in an irregular and scattered fashion, and has a very high communication to computation ratio that is independent of problem size and number of processors. The fourth application, Sample sorting, also uses an irregular
292 Table I.
Shan and Singh Summary of the Properties of the Three Models Both in General and as Implemented on the Origin2000
Property
CC-SAS
MP
SHMEM
Naming model for remote data
Shared address space
Replication and coherence
Implicit, hardware supported in caches Leverages hardware shared address space, cache coherence and low latency communication LoadStore
None: explicit message between private address spaces Explicit, no hardware support
Segmented, symmetric global address space with explicit operations Explicit, no hardware support
Uses SAS and low latency for comm. Through shared buffers; doesn't need coherence
Uses SAS and low latency for comm. Through shared buffers; doesn't need coherence
Memcpy
Bcopy
Inefficient for finegrain; efficient for coarse-grain Can be implicit in the explicit communication Explicit communication, so easier
More efficient than MP for both due to one-sided comm. Explicit and separate from communication
Hardware support
Primitives used for data transfer on Origin2000 Communication overhead Synchronization
Performance predictability
Efficient for fixedsize, fine-grain Explicit and separate from communication Implicit communication, so more difficult
Explicit communication, so easier
all-to-all personalized communication but compared with radix sorting, it is much more regular and the communication to computation ratio is much smaller. The final application, blocked LU factorization of a dense matrix uses one-to-many non-personalized communication: The pivot block and the pivot row blocks are communicated to - p processors each. However, the communication needs are relatively small compared to load imbalance. The CC-SAS programs for these applications are taken from the SPLASH-2 suite (FFT, Ocean, Radix, LU) (1416) and other research scientist (Sample), (17) using the best versions of each application with proper data placement. Only Radix is modified to use a prefix tree to accumulate local histograms into global histograms. In the following, we will discuss the differences in the communication orchestration and implementation across models. For mostly regular applications such as these, the basic
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
293
partitioning method and parallel algorithm are usually the same for the CC-SAS and MP programming models. Only, communication is usually sender-based in MP for better performance, and it is structured to communicate in larger messages as described later. We examined some of the best implementations of MP programs for these applications and kernels obtained from other scientists at a variety of sites, but our transformed SPLASH-2 programs were as good as or better than any of those under message passing. So we retained the programs we produced (they also have the benefit of being directly comparable in a node performance sense with the CC-SAS programs). When noncontiguous data have to be transferred, we packunpack them in the application programs themselves to avoid the buffer mallocfree overhead used by the corresponding MPI functions. Finally, for the SHMEM versions we restructured the MP versions to use put or get rather than send-receive pairs, and to synchronize appropriately. Packing and unpacking regularly structured data is left to the stride get and put operations, which don't have performance problems here. The choice of using get or put is based on performance first and ease of programming second, experimenting with both options in various ways to determine which one to use. Using put generally transfers the data earlier (as soon as they are produced, as with a send) and reduces latency as seen by the destination; however, using get brings data into the cache while put does not push the data in the destination cache (it cannot do so on this and many modern machines), and using get can obtain better reuse of buffers at the destination of the data. No prefetching is used in the CC-SAS programs, although we have found that software-controlled prefetching of only remote data could improve the performance of some applications. (17, 18) Using prefetching in applications will increase the programming complexity and it is not common in practice. The dynamically scheduled processor hides some memory latency, and in the SHMEM and MP cases we use asynchronous (nonblocking) operations to try to hide their latency, with wait function calls used after these operations when necessary to wait for data to leave or arrive. Let us discuss the differences of the MP and SHMEM versions from CC-SAS for the individual applications. The partitioning of work is the same across models in all cases.
4.1. FFT CC-SAS. The FFT kernel is a double-precision complex 1-D version of the radix-- r six-step FFT algorithm described by Bailey. (15) 1-D FFTs are more challenging than higher-dimensional FFTs, since there is more
294
Shan and Singh
communication relative to computation. The data set consists of n doubleprecision complex data points to be transformed, and another array of double-precision complex data to be used as the roots of unity. The n-point data set is arranged in the form of a - n V - n matrix for this high-radix implementation, and the matrix is partitioned among the processors in blocks of - np contiguous rows each. Each processor is responsible for transposing and computing on the - np rows assigned to it ( p is the total number of processors). The matrices are transposed three times, alternating which matrix is the input to the transpose and which is the output. In each transpose stage, each processor communicates a sub-matrix of size (- np) V (- np) to every other processor, resulting in coarse-grained and regular (value-independent and completely predictable a priori) all-to-all personalized communication. A blocked transpose is used to exploit cache line reuse. To avoid memory hot-spotting, the sub-matrix is transmitted in a staggered way, That means processor i first send data to processor i+1, i+2,..., p and then from 0,..., i&1. MP. In the MP implementation, the partition of the matrix is the same; however, each processor owns private arrays to store those rows assigned to it. The communication in the transpose phase is senderinitiated for higher performance. Each processor still communicates - np subrows of size - np to each other processor, but these subrows are disjoint in the local address space; they are therefore packed into a buffer before sending and unpacked implicitly when transposing locally at the destination. Another change we make from the CC-SAS version, based on observed performance, is that we do not use the linear, staggered way of communicating to avoid algorithmic hot-spots in the transpose. Rather, the all-to-all personalized communication is performed in p&1 loop iterations. In each iteration, each processor chooses a unique partner with which to exchange data bidirectionally. After the p&1 iterations, each processor has exchanged data with every other processor. We experimented with other methods, including using smaller messages (a few subrows at a time) to take advantage of the overlap between communication in the transpose and computation in the local row-wise FFTs before or after it. However, the high cost of messages and low amount of work between them ends up hurting performance. SHMEM. The SHMEM implementation is very similar to the MP implementation except that it uses put operations rather than send and receive (the sender-initiated put is more efficient than get here due to latency hiding).
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
295
4.2. Ocean CC-SAS. In Ocean, the principle data structures are about 25 twodimensional arrays holding discretized values of the various functions associated with the model's equations. These grids are partitioned among the processors into square-like subgrids, which are represented as 4-D arrays, with all the subgrids allocated contiguous and locally in the local processor's memory. The equation solver used (twice in every time-step) is a W-cycle multigrid solver that uses red-black Gauss-Seidel iterations at each level of the multigrid hierarchy. In order to compute the value for a grid point, it needs the data from all its nearest neighbors at that grid level. MP. In the MP implementation, the grids in this mostly near-neighbor application are partitioned into subgrids in the same way as in the CC-SAS program. A processor sends its upper and lower border data to its neighbors in one message each. When it communicates with its left or right neighbors, the (sub) column of data is noncontiguous and is therefore first packed locally in the application and then sent in one message to reduce communication overhead, and unpacked into the ghost subcolumn at the other end. SHMEM. Unlike in FFT, the SHMEM implementation uses get operations to receive border data in a receiver-initiated way, due to the advantages of get here, but uses the SHMEM stride get functions instead of packing data itself since unlike in MPI there is no performance difference here.
4.3. Radix CC-SAS. The CC-SAS radix sort is borrowed from the SPLASH2 application suite, which is based on the method described by Blelloch et al. (16) The algorithm is iterative, performing one iteration for each r-bit digit in the keys, where r is the Radix used. The maximum key value determines how many iterations will actually be needed. Each iteration uses two arrays of n keys eachan input and an output array which toggle their roles across iterationseach partitioned into p parts. A process's partition of the input array contains its assigned keys to process for that iteration. In every iteration, every process first sweeps over its assigned keys and generates a local histogram of the frequencies of their values in the current radix digit. After this, the local histogram is accumulated into global histograms, using a parallel prefix tree. Then, each process uses the local and global histogram values to permute its keys into the output array, thus
296
Shan and Singh
conceptually performing remote writes to other processes' partitions of that array and resulting in all-to-all personalized communication. The input and output arrays swap their roles in the next iteration (for the next digit). In our original (SPLASH-2) CC-SAS program, keys are written directly into the output array as their permuted positions are computed from the histograms. Thus, although a given process will end up writing to several (up to r) contiguous segments of each process's partition of the output array (average segment size being n(2 r V p), where n is the number of keys, r is the Radix size and p is the number of processes), these writes will be temporally interleaved with writes to many other segments and hence appear scattered. MP. Our MPI implementation follows the same overall structure as the SPLASH-2 CC-SAS program. The first major difference is in how the global histogram is generated from local histograms. In the CC-SAS implementation, a single shared copy of the global histograms is constructed, using a binary prefix tree. In MPI, the fine-grained communication needed for this turns out to be very expensive. We therefore use an MPI Allgather function to collect the local histograms from all processes and make a local copy of each for all of them. Then, each process computes the global histograms locally. The performance of this phase does not affect overall performance much, which is dominated by the permutation itself. Actually, having all the histogram information locally greatly simplifies the later computation of parameters for the MPI sendreceive functions in the permutation phase. Another difference is that in the MPI implementation, it is extremely expensive to sendreceive a message for each key in the permutation phase. Since the keys that process i permutes into process j $ 's partition of the output array will end up falling into that partition in several contiguous chunks, of average size n(2 r V p), our MP program therefore first writes the data locally into such contiguous buffers to compose larger messages before sending them out, which amounts to a local permutation of the data (using the now local histograms) followed by communication. An interesting question is how to send the data. One possibility is for process i to send only one message to each other process j, containing all its chunks of keys that are destined for j. Processor j will then reorganize the data chunks to their correct positions in its partition of the output array at the other end. This is similar to the algorithm used in the NAS parallel application IS. (19) Another method is for a process to send each contiguously-destined chunk of keys directly as a separate message so that it can be put into the correct position at the destination processes, leading to multiple messages from process i destined for each other process. This is an implementation-dependent tradeoff between communication and computation. Our experiments
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
297
show that the latter performs better than the former on this machine, so we use the latter. SHMEM. Our SHMEM implementation is transformed from the MPI program, though the specification of communication is simplified. Since SHMEM uses one-sided communication, only one of the sender and receiver needs to compute the parameters for the message, not both. Since the entire histogram data is available locally to each process, receiverinitiated communication can be used: Each needed remote chunk of permuted keys is brought into its destination locations using a get operation (get has the advantage that data are brought into the cache, while put doesn't deposit them in the destination cache). The symmetric arrangement of process partitions of the arrays make this easily to program (a process simply specifies the positions within a partition and the source partition or process number). 4.4. Sample Sort CC-SAS. Suppose we have p processes, and each process has its own partition of keys. The CC-SAS sample sort program proceeds in five phases: v Each process sorts its own keys locally using Radix sort. v Each process selects a fixed number of keys (sample keys) from its sorted array. v A small number of processes is responsible for reading these sample keys from all processes, sorting them, and selecting p&1 keys from these collected keys which are splitters used in the next phase. v Each process uses the p&1 splitters to decide (locally) how to distribute its keys to other processes or fetch keys from them. An all-toall personalized communication follows to distribute the keys. v Each process sorts its received keys locally. Thus, sample sort does two local sorts (the first and last steps), and thus does almost double the sorting work of Radix sort, but its communication behavior is better than that of parallel Radix sort. It does not involve scattered writes in CC-SAS, and it requires neither one message per small chunk nor reorganizing received data into small chunks at the destination: there is one contiguous message from a process tofrom each other process. There is also no loop around the steps above: each local sort sorts the keys completely. There are many ways to decide how to sample the keys in the second phase and how to find the p&1 splitters in the third phase; these affect
298
Shan and Singh
load balance and program complexity. (20) We choose a method that performs best on our system: Each process selects 128 sample keys in the second phase; in the third phase, every set of 32 processes forms a group and selects one member to be responsible to collect the sample keys, sort them, and communicate with other groups to find the splitters. MP. In the MPI program, the first, second and fifth phases are the same as in the CC-SAS program. In the third phase, we use the MPI Allgather function instead of loads and stores to collect the sample keys from all processes. Then, the computation of the splitters becomes completely local, with the tradeoff that a lot of it is redundantly performed on all processes. Unlike in the CC-SAS program, we do not divide the 64 processes into two groups. In the fourth phase, each process uses an explicit sendreceive operation to distribute its keys to their destinations. SHMEM. The SHMEM program is obtained directly from the MPI program. The only difference is that in the fourth (communication) phase it uses a get operation to replace the sendreceive pair.
4.5. LU CC-SAS. The LU kernel factors a dense matrix into the product of a lower triangular and an upper triangular matrix. The dense n"times n matrix A is divided into an N V N array of B V B blocks (n=NB) to exploit the temporal locality on sub-matrix elements. To reduce communication, block ownership is assigned using a 2-D scatter decomposition, with blocks being updated by the processors that own them. A 4-D array is used to ensure the blocks assigned to a processor are allocated locally and contiguously. The pseudo-code in the following shows the most important steps in the CC-SAS programs: For k=0 to N&1 do Factor diagonal block A kk BARRIER Update all perimeter blocks in column k and row k using A kk BARRIER For j=k+1 to N&1 do For i=k+1 to N&1 do A ij =A ij &A ik V A kj MP. In CC-SAS, each process directly fetches the pivot block data (or the needed pivot row blocks) from the owner, using load instructions.
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
299
In MPI, however, the owner of a block sends it to the - p other processes that need it once it is produced. SHMEM. The SHMEM implementation replaces the sends with get operations on whole blocks. Get is used instead of put since it brings data into the cache, as in Ocean and Radix, and enables reuse of the buffer used by the get operation. 5. PERFORMANCE ANALYSIS Let us compare the performance of the applications under the different programming models. For each application, we first examine speedups, measuring them with respect to the same sequential program for all models. Then we examine per-processor breakdowns of execution time, obtained using various tools available on the machine, to obtain more detailed insights into how time is distributed in each programming model and where the bottlenecks are. We divide the per-processor wall-clock running time into four categories: CPU busy time in computation (BUSY), CPU stall time waiting for local cache misses (LMEM), CPU stall time for sendingreceiving remote data (RMEM), and CPU time spent at synchronization events (SYNC). For CC-SAS programs with their implicit data access and communications, the available tools do not allow us to distinguish LMEM time from RMEM time, so we are forced to lump them together (MEM=LMEM+RMEM). However, they can be distinguished for the other two models. In the MP model, since we are using asynchronous mode, on the receiver side the SYNC time is the time spent in MPI Waitall, waiting for the incoming messages for which receives have been posted to arrive in the packet queue, indicating that the data are ready to be copied. During this time, if new messages that are not expected arrive then the receiver will also spend some time processing those messages, but that time is counted as RMEM time. On the sender side, SYNC time is the time the sender spends on adding the control packet into the receiver's incoming queue. The RMEM time is all the time spent for MP functions (like send and receive) excluding the SYNC time. In the SHMEM model, the SYNC time is the global barrier time. The RMEM time is the time spent in get put operations and collective communication calls; there is a little synchronization time included in these operations, but unlike MPI we do not have the source code for SHMEM and the available tools cannot tell this time apart. In the CC-SAS model, the SYNC time is synchronization time including barrier time and lockingunlocking time. The BUSY time is obtained by the vendor-supplied tools SpeedShop. The LMEM time is
300
Shan and Singh
approximated by the total execution time subtracting the SYNC time, RMEM time and BUSY time. For a given machine size, problem size is a very important variable in comparing programming models. Generally, increasing problem size reduces the communication to computation ratio and will tend to diminish the performance differences among models. Thus, although large problems are important for large machines, it is very important to examine smaller problems too. Of course, we must be careful to pay significant attention only to those problem sizes that are realistic for the application and machine at hand, and that deliver reasonable speedups at least for one of the programming models. Our general approach is to examine a range of potentially interesting problem sizes at each machine size. That said, Fig. 3 shows the speedups for FFT, OCEAN, RADIX, SAMPLE, and LU using three different programming models for only the largest data set we have run (FFT: 16M double complex, OCEAN: 2050 V 2050 rid size, RADIX: 256M integers, AMPLE: 256M integers, LU: 4096 V 4096 matrix) on 64 processors. The dplace tool is used to map the data to memory modules when running the programs. The SHMEM model performs well on all applications. The CC-SAS model is close except for RADIX (later we will discuss why). For MP, however, none of these applications' performance is initially satisfactory, even though we are using almost the same algorithms and data structures as in SHMEM. Let us examine why. 5.1. Improving MP Performance Consider FFT as an example. Figure 4a shows the time breakdown for a smaller, 256K-point data set for FFT on 64 processors. The BUSY times are extremely flat across processors, as they are for other applications as well, since every processor executes nearly the same number of instructions.
Fig. 3. Speedups of FFT(16M), OCEAN(2050), RADIX(256M), SAMPLE(256M) and LU(4096) for the three models on 64 processors.
File: 828J 171918 . By:XX . Date:02:04:01 . Time:09:34 LOP8M. V8.B. Page 01:01 Codes: 2412 Signs: 1947 . Length: 44 pic 2 pts, 186 mm
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
Fig. 4.
301
Time breakdown for FFT under the MP model for a 256K-point on 64 processors.
The LMEM time is a little imbalanced, but not the major bottleneck here. It is the RMEM time and the SYNC time that are very high and extremely unbalanced in the MP version, and which cause parallel performance to be bad. This is despite the fact that we make a special effort to avoid using rendezvous mode, since it is potentially slower than eager mode, by making the threshold large enough and allocating enough buffer space. Further analysis tells us that the problem is caused mainly by an extra copy in the send function in the MP implementation. As discussed earlier, only the buffers and other data structures used by the MP library itself to implement send and receive calls are allocated in the shared address space (in both the MPICH and SGI implementations). This means that a sending process cannot directly write data into a receiving process's data structures, since it cannot reference them directly, but can only write the data into the shared buffers from where the receiver can read them. Thus, the data are copied twice. If we can copy the data directly from the source to the destination data structures without using the buffers (the sender will no longer copy the data, only the receiver will), we may be able to improve performance by eliminating one copy. This idea has also been explored by other systems. (21) Eliminating the use of the shared buffer space has other performance benefits as well. Requesting and obtaining a buffer itself takes time. Worse still, for a large number of processors like 64, processes often compete for allocating shared memory resources, causing a lot of contention in the shared memory allocation function. This contention increases RMEM time at the sender, but also increases SYNC time at the corresponding receivers that now have to wait longer, and causes imbalances in both these time components.
File: 828J 171919 . By:XX . Date:02:04:01 . Time:09:34 LOP8M. V8.B. Page 01:01 Codes: 2424 Signs: 2012 . Length: 44 pic 2 pts, 186 mm
302
Shan and Singh
Since processes allocate their data in private address spaces in MP programs, eliminating the extra copy (buffering) would normally require the help of the operating system to break the process boundary. However, since we have an underlying shared address space machine, we can achieve this goal without involving the operating system, if we modify both the application (slightly) and the message-passing library. We increase the size of the shared address space andin the applicationallocate all the logically shared data structures that are involved in communication in this shared address space, even though the data structures are organized exactly the same way as in the original MP program (they are simply allocated in the shared rather than private address space by using a shared malloc rather than a regular one). How sends and receives are called doesn't change; however, in the MPI implementation now once the send-receive match via the packet queues establishes the source and destination addresses, either application process can directly copy data to or from the application data structures of the other (using memcpy still). In particular, in eager mode, the sender now only places the control packet into the receiver's queue, but does not apply for a shared buffer and copy data. When the match happens, the receiver copies data directly from the sender's data structures (which are in the shared address space). Of course, this means that the sender cannot modify those data structures in the interim (as in a general nonblocking asynchronous send), so additional synchronization might be needed in the application. Without buffers, rendezvous mode now works similarly to eager mode and its overhead is greatly reduced. Short mode remains the same as before since the data to be copied is small and there is no buffer allocation overhead anyway. Figure 4b shows the new per-processor breakdowns of execution time. Removing the extra copy clearly improves performance dramatically and reduces imbalances in RMEM and SYNC time. The speedups for the 256K- and 16M-point problem sizes have increased from 1.26 and 33.52 to 8.17 and 55.17, respectively. However, the speedup is still lower than that of CC-SAS or SHMEM. The SYNC and RMEM time components are still high. This brings us to another major source of performance loss in the MP implementations: the locking mechanism used to manage the incoming packet queues. In the original implementations, when a process sends a message to another it obtains a lock on the latter's incoming queue, adds the control information packet into the queue, and releases the lock. When the receiver receives a message, it also has to use lockunlock to delete the entry from the queue. This locking and contention shows up as a significant problem, especially for smaller problem sizes. Performance can be improved by using lock-free queue management, as follows. Instead of
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
303
locking to add or delete a packet in a shared incoming queue for a process, every process has a separate one specific 1-deep buffer to communicate with each of the other processes (thus, there are p 2 packet slots instead of p packet queues). A flag in this fixed packet is used to control the message flow. When one process wants to send a message to another, it checks the corresponding packet, in which there is a flag to indicate whether the receiver is busy or not. A busy value means the sender has already sent a previous message to the receiver that has not been processed. In this case, the sender waits for the flag to be cleared. If the flag is clear, the sender will set the flag, and put the message control information into the packet and be done. On the other side, the receiver will check this packet when it wants to get data from this source. If the signal has been set, it will obtain the data and clear the flag, otherwise it will either wait for the flag (in blocking mode) or leave (in non-blocking mode). Note that this still provides point-to-point order among messages. The lock-free mechanism further improves the performance of FFT. Indeed, after all these changes to the MP library (mostly) and the application (a little in how data are allocated in address spaces), the performance of the MP versions is comparable with that of the equivalent CC-SAS programs, at least for this problem size. The final time breakdown is shown in Fig. 5, for both this and a larger problem size. In fact, using the final improved MP implementation, we find that the performance of OCEAN, RADIX, SAMPLE, and LU is also greatly improved. The comparison of the speedups for MP and Improved MP (NEW) is shown in Fig. 6 for a large data set and Fig. 7 for a small data set. Let us now compare the performance under different programming models for each application. We will use the improved MP (no extra copy, and lock-free queue management) in all the applications from here on to
Fig. 5. Time breakdown for FFT under the MP model with 256K and 4M problem sizes on 64 processors, with the new MPI implementation.
File: 828J 171921 . By:XX . Date:02:04:01 . Time:09:35 LOP8M. V8.B. Page 01:01 Codes: 2595 Signs: 2180 . Length: 44 pic 2 pts, 186 mm
304
Shan and Singh
Fig. 6. Speedups of MP and improved MP for FFT(16M), OCEAN(2050), RADIX(256M), SAMPLE(256M) and LU(4096) on 64 processors.
enable exploration of the remaining performance differences among models once these dominant bottlenecks are alleviated, calling it simply MP, even though it violates the ``pure'' MP model of using only private address spaces in the applications themselves. 5.2. FFT The speedups with the three programming models for different data sizes, from 64K to 16M double complex data, are shown in Fig. 8. [Note: We found that for the MP and SHMEM programs, which use explicit communication, the same communication functions take much longer in the first transpose than in following transposes. Detailed profiling showed that the extra time was in the actual remote data movement operations like memcpy and bcopy used by these functions, which are much more expensive when invoked for the first time on a given set of pages. This page-level cold-start effect, which is not substantial in the CC-SAS case, is eliminated in our results by having each processor in advance touch all remote pages it may need to communicate with later. Simply doing a single load-store
Fig. 7. Speedups of MP and improved MP for FFT(64K), OCEAN(258), RADIX(1M), SAMPLE(1M) and LU(256) on 64 processors.
File: 828J 171922 . By:XX . Date:02:04:01 . Time:09:35 LOP8M. V8.B. Page 01:01 Codes: 1916 Signs: 1304 . Length: 44 pic 2 pts, 186 mm
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
Fig. 8.
305
Speedups for FFT under SHMEM, CC-SAS and MP on 16, 32, 64 processors with different problem sizes.
reference to each such page using the machine's shared address space support, suffices, though many other methods will do. This touching is done before our timing measurements begin. This cold-start problem on pages is large only for kernels with a lot of communicated pages like FFT and Radix (as we will see later); real applications that use these kernels may use them multiple times, amortizing this cost.] Speedups are quite similar across models at 16 processors for most problem sizes, with differences becoming substantial only beyond that. Even for larger p, the performance of (the new) MP and SHMEM is quite similar on all data sets. However, compared with CC-SAS, their speedups for smaller problem sizes are much lower. With increasing problem size, their speedups improve and finally catch up with that of CC-SAS. For the 16M data size on 64 processors, the speedups for all models are about 60. One reason that speedup is so high for big problems is that stall time on local memory relative to busy time may be reduced greatly compared to a uniprocessor, due to local working sets fitting in the cache in a multiprocessor run while they didn't in the uniprocessor case (for small problems they fit in the cache in the uniprocessor case as well, and for very large problems they may fit in neither). Note that the inherent communication to computation ratio itself does not diminish rapidly with problem size in FFT (only logarithmically). So although message sizes become larger and amortize overheads much better, only communication does not account for the large increases in speedup with problem sizes even in MP. To illustrate this, Table II shows the average ratios across processors (expressed as percentages) of local memory time to busy time for two data sets, one smaller and one larger, using the MP executions as an example. The reduction in this ratio with increasing number of processors shows that the capacity-induced superlinear effect in local access is much larger for the larger data set (a two-fold reduction in the local memory time component when going from 1 to 64
File: 828J 171923 . By:XX . Date:02:04:01 . Time:09:35 LOP8M. V8.B. Page 01:01 Codes: 2686 Signs: 2276 . Length: 44 pic 2 pts, 186 mm
306
Shan and Singh Table II. Average Ratio of Local Memory Time to Busy Time for the 256K- and 4M-Point Problem Sizes in the MP Model a
256K 4M a
1P
16P
32P
64P
41 90
36 37
23 24
25 19
In percentage.
processors) than for the smaller in this case. Fortunately, this effect applies about equally to all programming models, and is quite clean in our applications even in the CC-SAS case since they do not have significant capacity misses on remotely allocated data. [Note: CC-SAS has another small problem for making this measurement in that some of the local misses in the transpose are converted to remote misses, but MP and SHMEM do not have this problem since all transposition is done locally separately from communication.] Although the capacity-induced superlinear effect is real, we can ignore it by replacing the LMEM time in the uniprocessor case with the sum of the LMEM times across processors in the parallel case. The speedups calculated in this way are smaller, as shown in the ``no-cap'' entries in Table III. For comparison, we also include the actual speedups (including capacity effects) as well. In other applications, such as OCEAN and RADIX, the superlinear capacity effect on local misses is also severe, though again fortunately similar for all models. The per-processor execution time breakdowns for the 256K and 4M problem sizes on 64 processors for MP, CC-SAS, and SHMEM are shown in Figs. 5, 9, and 10, respectively. The BUSY time for SHMEM and MP for each problem size is almost the same, and is a little higher than that of CC-SAS. This is primarily due to the extra packing and unpacking operation needed in SHMEM and MP programs, in which the (noncontiguous) Table III. FFT Speedup Comparison With and Without Cache Capacity Effects for 256K- and 4M-Point Problem Sizes on 16, 32, 64 Processors
256K-no-cap 256K-with-cap 4M-no-cap 4M-with-cap
16P
32P
64P
11 11 10 14
18 19 19 28
25 27 38 60
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
307
Fig. 9. Time breakdown for FFT under the CC-SAS model with 256K and 4M point problems on 64 processors.
sub-rows of a transferred - np by - np patch are packed contiguously before they are sent out and unpacked after they have arrived at the destination. In CC-SAS, on the other hand, the data are read individually at the fine granularity of cache lines, so there is no need to pack and unpack the data. This difference is imposed by the performance-driven need to make messages larger in SHMEM and MP. The main differences among models for FFT lies in the data access stall components. The CC-SAS model has a much lower MEM time than the others for smaller data sets and larger p. Recall that we have to lump LMEM and RMEM together in this model since they cannot be separated by the available tools. When - np is small, so are the messages in MP and SHMEM, so message overhead (of software management in MP as well as of the basic data transfer operations used by both MPI and SHMEM) is not well amortized. This is further worse in MP than in SHMEM, both since the producer-consumer communication needed for (control) packet queue management is a poor match for the underlying invalidation-based
Fig. 10. Time breakdown for FFT under the SHMEM model with 256K and 4M point problems on 64 processors.
File: 828J 171925 . By:XX . Date:02:04:01 . Time:09:35 LOP8M. V8.B. Page 01:01 Codes: 1872 Signs: 1359 . Length: 44 pic 2 pts, 186 mm
308
Shan and Singh
cache coherence protocol and especially since an explicit send and a matching receive must be initiated separately for each communication. The latter potentially increases not only messaging overhead and end-point contention further but also synchronization time, since the sends and receives have to be posted in timely ways and matched. In the CC-SAS model, the transfers of cache blocks triggered by loads and stores are very efficient because of the good data locality of FFT. With automatic hardware caching, the data fetched also arrive in the cache rather than in main memory, and can be used very efficiently locally. As - np increases, message size increases and explicit communication with send-receive or (more so) putget becomes more efficient, so the performance of MP and SHMEM equals that of CC-SAS. Finally, consider the difference between MEM times in SHMEM and MP. SHMEM's RMEM time is less than that of MP because its one-sided communication is more efficient as discussed earlier. But, surprisingly, it has a much higher LMEM time. Through further analysis, we find that the greater LMEM time is spent in the transpose phases, specifically during the local data movement needed to unpack the deposited or received data, i.e., extract the subrows of the transferred square blocks and move them to the correct transposed positions in the local matrix. The data transfer operation we use in SHMEM is put. Unlike with a receive in MP, where the data that are moved to the receiver's data structures are also placed in its cache, the bcopy in a put places the data in the receiver's main memory but not in its cache (see Section 4). This means that when unpacking the data, the MP code reads the data out of cache while the SHMEM code reads it from main memory, increasing local memory stall time. We verify this by measuring the unpacking separately, as well as by having the destination process of the put touch the buffer before unpacking (thus bringing the data into the cache), in which case the unpacking is as fast as in the MP version. Using get instead of put helps with the caching issue, but communication latency is not hidden well and hence synchronization time increases, so overall there is not much difference in performance in FFT. Note that in CC-SAS the transposition of data is done as part of the loadstore communication itself, and the data are brought into the destination processor's cache. 5.3. OCEAN In OCEAN, a large fraction of the execution time is spent in a multigrid equation solver in each time-step, which exhibits nearest-neighbor communication but at various levels of a hierarchy of successively smaller grids.
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
309
Fig. 11. Speedups for OCEAN under SHMEM, CC-SAS and MP on 16, 32, 64 processors with different problem sizes.
The speedups for OCEAN are shown in Fig. 11. For 16 processors, the speedups in the three programming models are similar; however, there are large differences for larger processor counts; in particular, the performance of CC-SAS is now much worse for smaller problem sizes (the opposite of the FFT situation). The time breakdowns for the intermediate, 1026-by-1026 grid size on 64 processors are shown in the Fig. 12. The BUSY times are very balanced and similar. The MEM time in all three cases is imbalanced, but is much higher and more imbalanced in CC-SAS. There are several likely reasons for this behavior of CC-SAS relative to SHMEM and MP, which we unfortunately cannot determine easily because of the lack of available tools (note that the LMEM category for CC-SAS is actually MEM=LMEM+ RMEM). One is poor spatial locality at cache block granularity for remote access at the column-oriented boundaries of the square partitions: Only one boundary word is needed in each row, but a whole cache block is fetched. This poor spatial locality is on local rather than remote accesses for MP and SHMEM since they pack the data contiguously locally before communicating. Another likely possibility is that local capacity misses behave differently across programming models. In MP and SHMEM, process partitions of all the different grids are allocated contiguously in its private address space, while in CC-SAS each entire grid is allocated as a large contiguous shared array in the shared address space; even though the process partition of each grid is contiguous due to the use of 4-D arrays, there is a very large gap between a processor's partitions of different grids in the data layout. This causes many more and imbalanced local conflict misses in Ocean since multiple grids are accessed together in many computations. A third possibility is that perhaps certain kinds of data and pointer arrays have not been properly placed among the distributed
File: 828J 171927 . By:XX . Date:02:04:01 . Time:09:36 LOP8M. V8.B. Page 01:01 Codes: 2540 Signs: 2124 . Length: 44 pic 2 pts, 186 mm
310
Shan and Singh
Fig. 12.
Time breakdown for OCEAN (1026) on 64 processors.
memories in CC-SAS, though they have to be in MP and SHMEM, and these become relatively more of an issue at smaller problem sizes. (We already obtained a great improvement compared to our original program by placing some data structures better; perhaps more work can be done on this, though the lack of appropriate information from the machine makes the problems difficult to diagnose.) Larger problem sizes and smaller machines make local capacity misses dominant, so the difference between models is small. 5.4. RADIX The speedups are shown in Fig. 13. The SHMEM model almost always performs best, except for the smallest (1M keys) data set on 32 and 64 processors, where the CC-SAS model does best. There are two reasons for this exception. One is that in SHMEM, like in MPI, we use a collective communication function by which each process collects the local histograms from all processors. This operation has a fixed cost that does not change with the data set size, so for smaller data sets it occupies a larger part of
Fig. 13. Speedups for RADIX under SHMEM, CC-SAS and MP on 16, 32, 64 processors with different problem sizes.
File: 828J 171928 . By:XX . Date:02:04:01 . Time:09:36 LOP8M. V8.B. Page 01:01 Codes: 1788 Signs: 1201 . Length: 44 pic 2 pts, 186 mm
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
Fig. 14.
311
Time breakdown for RADIX (64M) on 64 processors.
the execution time. In CC-SAS, only a single shared histogram is built without replicating local histograms and the efficient fine-grained loadstore communication enables the histogram accumulation to be implemented by constructing a binary-prefix tree, so the computation of the global histogram is quite cheap. The other reason is that the message size in SHMEM and MPI is small for this data set on 64 processors even in the key permutation phase. Message overhead is therefore not amortized well and the RMEM time is higher for SHMEM and especially for MPI with its two sided communication and hence greater overhead. As data set size increases, both problems for SHMEM and MPI diminish greatly. However, the performance of the CC-SAS program suffers. To understand this effect better, we show the per-processor time breakdowns for the 64M-key data set in Fig. 14. From Fig. 14a we find that the MEM time in CC-SAS is very high, and it dominates the total execution time. False sharing of data is very low for this problem and machine configuration. The reason for the high MEM time is that in the CC-SAS case not only is the communication very bursty and with little computation to overlap, but the nature of the remote-write based communication is such that a lot of cache coherence protocol transactions like invalidations and acknowledgments compete for communication resources with data transfer. In addition, for large problem sizes data are written back to remote memories as they are replaced from the caches during the communication, and these writeback transactions further contend for controllers and other resources. This contention causes performance to suffer greatly. In MPI (Fig. 14c) and SHMEM (Fig. 14d) the explicit messages are larger and less scattered due to local buffering, there aren't nearly so many protocol transactions to contend with data movement, and there aren't remote writeback operations since data from remote memories are not cached. Thus the total memory time is lower.
File: 828J 171929 . By:XX . Date:02:04:01 . Time:09:36 LOP8M. V8.B. Page 01:01 Codes: 2602 Signs: 2125 . Length: 44 pic 2 pts, 186 mm
312
Shan and Singh
Compared with SHMEM, MPI has higher SYNC time. This is despite the lock-free mechanism used to manage queues for incoming messages. In fact, it has to do with the implementation of the lock-free mechanism. Every process has a separate one specific 1-deep buffer to communicate with each of the other processes. If one process wants to send several consecutive messages to the same other process, the next message has to wait until the former one has been received by the receiver. This leads to higher synchronization time in this situation. Using deeper buffers alleviates the problem, but does not eliminate it in this application since there are many messages (chunks of keys for different radix values in the permutation phase) sent from each processor to each other. Also, adding a buffer requires p 2 memory. In SHMEM, since communication is one-sided, messages between the same processes can proceed one by one without any delay, and the synchronization time is smaller. The effect of protocol interference in CC-SAS can be greatly reduced by restructuring the application to bring its permutation phase closer in structure to the SHMEM and MPI implementations (while retaining its histogram accumulation advantages). That is, instead of performing scattered remote writes to a large output array as the permuted location of each key is computed, processes can locally buffer the permuted data during the permutation and then copy (via reads or writes) the contiguous buffered chunks to the remote destination. This increases the local work or data movement, but reduces the temporal scatteredness of the fine-grain writes. Although this does not eliminate coherence protocol interference, it reduces it greatly. The effect can be clearly seen from the new per-processor time breakdown in Fig. 14b. The new speedup for CC-SAS is shown in Fig. 13 under the label (CC-SAS-NEW). The improved CC-SAS version is dramatically better than the original, though it is still not quite so good as the SHMEM version (except for the 1M data set size on 32 or 64 processors). If prefetching is used in the communication, the performance difference will be further reduced. Interestingly the new version is inferior to its original for the smallest 1M data set, because the trade-off between savings in traffic and increase in local work or BUSY time (for the local permutation buffering) resolves itself differently for small ratios of data set size to number of processors. 5.5. Sample The speedups of sample sort for the three programming models are shown in Fig. 15 for data set sizes ranging from 1M to 256M. The CC-SAS model now works best up to the 4M data set size. After that SHMEM and CC-SAS performs similarly, with MPI following somewhat
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
313
Fig. 15. Speedups for sample sort under SHMEM, CC-SAS, and MP on 16, 32, 64 processors with different problem sizes.
behind. Compared with Radix sort, sample sort has only one global communication phase but two local sorting phases, and the communication is naturally contiguous (spatially and temporally), so it is better behaved for all models than in Radix sort: Each processor sends only one message to each other processor in MPI and SHMEM (rather than potentially one per different Radix value), and the temporal scatteredness and even the need for remote writes disappear in CC-SAS (remote reads are used instead). For larger data sets, the two local sorting phases dominate the total execution time, so communication matters less. This can be seen from the per processor time breakdown for 64M data set size in Fig. 16. For smaller data sets, CC-SAS performs better for the same two reasons as it performs better on 64 processors for 1M data size in Radix sort: We have to use an expensive collective function to collect the sample data in MPI and SHMEM while in CC-SAS the necessary fine-grained loadstore operations are directly supported by hardware, and the message overhead is not well amortized for smaller data sets. Compared with SHMEM, MPIs performance
Fig. 16.
Time breakdown for sample sort (64M) on 64 processors.
File: 828J 171931 . By:XX . Date:02:04:01 . Time:09:36 LOP8M. V8.B. Page 01:01 Codes: 1938 Signs: 1394 . Length: 44 pic 2 pts, 186 mm
314
Shan and Singh
is a little worse, again because in MPI the communication is two-sided (send and receive) and the collective communication function is not so efficient as in SHMEM. Note that the computation (BUSY) time increases a lot compared with Radix sort due to the two local sorts needed here. If the best programming model is used in each application, sample sort performs better than Radix sort up to 64K integers per processor (due to better communication) and becomes worse after that point (due to the extra local sort dominating communication). 5.6. LU In LU, the communication pattern is multicast-oriented. Communication occurs when a diagonal block is needed by some processes to update the perimeter blocks they own, and when perimeter blocks are needed by processes to update their interior blocks. The speedups for LU are shown in Fig. 17. The speedups in all three programming models are lower for smaller data sets because load imbalance is the main problem there. Here too, they are similar for all models up to 16 processors in all cases, but quite different beyond that. Breakdowns (not shown here for space reasons) show that the performance of MP is much lower on large processor counts for small problems: it suffers much greater synchronization cost than CC-SAS and SHMEM due to the need for send-receive matches on individual messages (even if asynchronous) rather than waiting for them all at a barrier, as well as due to the lock-free management discussed earlier, and also suffers higher communication cost due to two-sided communication. SHMEM turns out to be worse than CC-SAS, almost entirely because the many barriers are used and the cost of the barrier implementations used is larger in SHMEM than in CC-SAS. We use the barrier supplied with the SHMEM library, for which we don't have the source code, which in SAS we use our own tournament barrier. This is the only application in which we find the barrier implementation to matter a lot (and then
Fig. 17.
Speedups for LU under SHMEM, CC-SAS, and MP on 16, 32, 64 processors with different problem sizes.
File: 828J 171932 . By:XX . Date:02:04:01 . Time:09:36 LOP8M. V8.B. Page 01:01 Codes: 2638 Signs: 2092 . Length: 44 pic 2 pts, 186 mm
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
315
only for small data sets), but it is a problem that should be fixable. Larger data sets deliver good speedups in all cases, even though the capacityinduced superlinear effect is not substantial in this application due to blocking, since load imbalance and communication-to-computation ratio are reduced and barriers are less frequent due to the n 3 growth rate of computation. 6. CONCLUSION We have compared the performance of the three major programming models (CC-SAS, MPI and SHMEM) on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging (the other being commodity-oriented clusters). We have focused on applications or kernels that are either completely regular or at least do not require replication of irregularly accessed data. First, we found that the parallel algorithm structuring and the resulting communication volume and patterns needed to obtain good performance are indeed basically the same for the three models for this class of computations, at least for the major phases of computation. The differences lie in how the resulting communication and synchronization are orchestrated. Second, we found that removing the extra copy in the MP implementation is very important for all our applications, but it requires changes to both the MPI implementation and to the API (providing a new malloc function). Using the lock-free management instead of the locking on incoming message control queues will further improve the MP performance. The rest of our conclusions assume this new implementation of MPI message passing, impure as it may be at the application level. Given this implementation, we find that all three programming models usually perform quite similarly up to the 16-processor scale, and it is only beyond that that large differences usually become manifest. Even beyond this, all three models perform well for our applications as the problem size is made very large, and the breakdowns of execution time into busy, local memory, remote memory and synchronization components are often similar. However, smaller but very realistic problem sizes reveal substantial differences. We found some of the most important differences, variously in different applications, having to do with the following: (i) the efficiency of fine-grained remote access at cache line granularity compared to the overhead of packing and unpacking data in some cases (FFT, where CC-SAS does best for smaller problem sizes); (ii) conversely, the disadvantage of fixed-size remote access when spatial locality on remote data is not good in CC-SAS (e.g., smaller problem size in Ocean); (iii) whether explicit transfers put data in the cache or only in main memory at the destination
316
Shan and Singh
(e.g., MP versus SHMEM in FFT); (iv) differences in cache conflict behavior, even on local data, in programs in which many large data structures are decomposed in large chunks and then allocated in the different models (Ocean); (v) situations when transparent cache coherence protocol actions get in the way of CC-SAS performance as opposed to explicitly managed communication (an issue primarily in Radix); and (vi) implementations of barriers in barrier-heavy situations (e.g., smaller problem sizes in LU), which should be fixable by using the underlying machine support. In general, SHMEM is the best performing model of the three for applications of the class we have considered. In the exception cases, CC-SAS is best at smaller problem sizes. Our results for the MP and CC-SAS models also support earlier results from simulation studies that indicate that explicit message passing does not have substantial advantages on efficient modern hardware-coherent multiprocessors over the native loadstore CC-SAS model, even for regular applications with naturally coarsegrained communication. (5, 6) These results are despite the fact that we do not use prefetching to hide remote access latency in our CC-SAS programs (or to hide local access latency in all models). In terms of programmability, we found that the CC-SAS model was the simplest to use, though SHMEM was very simple for these regular applications as well. We can take advantage of this simplicity to easily program more complex algorithms to achieve high performance, as we have done in radix sort (the prefix tree). In the MP model, the inability to specify addresses at the other end in a send or receive sometimes caused some difficulty in computing parameters for the send and receive functions, and made us change the way some less important phases of computation were done and incur more communication in them than under the other models (e.g., the histogram computation in Radix). This restructuring was also useful in letting SHMEM use get rather than put based communication in these cases, which helped by bringing data directly into the cache. Having studied some regular applications, we plan to extend this work to other classes of applications, especially applications that indeed have irregular, unpredictable and naturally fine-grained data access and communication patterns. We will also study how the tradeoffs among models change on larger-scale machines as well as on flat and two-level clusters, where all programming models are implemented in software across nodes. ACKNOWLEDGMENTS We thank Chris Ding for sharing his FFT code with us, Eric Salo for his generous help in understanding SGI's MPI implementation, Alexandros Poulos, Rohit Chandra and John McCalpin for their help in using the
MPI, SHMEM, and Cache-Coherent Shared Address Space Comparison
317
performance tools, Michael Way and Stephane Ethier for their help in using the Origin2000, and Dongming Jiang for her help with the CC-SAS programs.
REFERENCES 1. Message Passing Interface Forum, Document for a Standard Message Passing Interface (June 1993), http:www-c.mcs.anl.govmpi. 2. J. P. Singh, A. Gupta, and J. L. Hennessy, Implications of Hierarchical N-Body Techniques for Multiprocessor Architecture, ACM Trans. Computer Syst. (May 1995). 3. J. P. Singh, A. Gupta, and M. Levoy, Parallel Visualization Algorithms: Performance and Architectural Implications, IEEE Computer, 27(6) (June 1994). 4. T. A. Ngo and L. Snyder, ON the Influence of Programming Models on Shared Memory Computer Performance, Scalable High Performance Computing Conf. (April 1992). 5. S. Chandra, J. R. Larus, and A. Rogers, Where is Time Spent in Message Passing and Shared Memory Programs, ASPLOS (October 1994). 6. S. C. Woo, J. P. Singh, and J. L. Hennessy, The Performance Advantages of Integrating Message-Passing in Cache-Coherent Multiprocessors, Proc. Architectural Support Progr. Lang. Oper. Syst. (1994). 7. D. Kranz et al., Integrating Message-Passing and Shared Memory: Early Experience, Principles and Practice of Parallel Progr. (May 1993). 8. T. LeBlanc and E. Markatos, Shared Memory vs. Message Passing in Shared Memory Multiprocessor, Fourth SPDP (1992). 9. A. C. Klaiber and H. M. Levy, A Comparison of Message Passing and Shared Memory Architectures for Data Parallel Program, Proc. 21st Ann. Int'l. Symp. Computer Architecture (April 1994). 10. H. Lu, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel, Quantifying the Performance Differences Between [PVM] and Threadmarks, J. Parallel and Distributed Computing (June 1997). 11. S. Karlsson and M. Brorsson, A Comparative Characterization of Communication Patterns in Applications using MPI and Shared Memory on the IBM SP2, NetworkBased Parallel Computing, CANPC98 (1998). 12. D. Cortesi, Origin2000 and onyx2 Performance Tuning and Optimization Guide (1997). http:techpubs.sgi.com. 13. ANLMSU MPI impementation, MPICHA portable Implementation of MPI (June 1995), http:www-c.mcs.anl.govmpimpich. 14. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, The Splash-2 Programs: Characterization and Methodological Considerations, Proc. 22th Ann. Int'l. Symp. Computer Architecture (June 1995). 15. D. H. Bailey, FFTs in External or Hierarchical Memories, J. Supercomputing, 4:2325 (1990). 16. G. E. Blelloch et al., A Comparison of Sorting Algorithms for the Connection Machine CM-2, Symp. Parallel Algorithms and Architectures (July 1991). 17. D. Jiang and J. P. Singh, Does Application Performance Scale on Modern CacheCoherent Multiprocessors: A Case Study of a 128-Processor SGI Origin2000, Proc. 26th Int'l. Symp. Computer Architecture (May 1999). 18. H. Shan, J. Feng, and H. Shan, Programming FFT on DSM Multiprocessors, HPC-ASIAN 2000 (May 2000).
318
Shan and Singh
19. NASA Ames Research Center, The [NAS] Parallel Benchmarks 2.0, http:science.nas. nasa.govsoftwareNPB (November 1995). 20. X. Li and P. Lu, On the Versatiity of Parallel Sorting by Regular Sampling, Parallel Computing (1993). 21. A. Sohn, P. Druschel, and W. Zwaenepoel, IO-Lite: A Unified IO Buffering and Caching System, Proc. Third Symp. Oper. Syst. Design Implementation, New Orleans, Louisiana (February 1999).