Parallel Sorting on Cache-coherent DSM Multiprocessors Hongzhang Shan and Jaswinder Pal Singh Department of Computer Science Princeton University
fshz,
[email protected] Abstract
The performance of parallel sorting is not well understood on hardware cachecoherent shared address space (CC-SAS) multiprocessors, which increasingly dominate the market for tightly-coupled multiprocessing. We study two high-performance parallel sorting algorithms, radix and sample sorting, under three major programming models—a load-store CC-SAS, message passing, and the segmented SHMEM model—on a 64-processor SGI Origin2000. We observe surprisingly good speedups on this demanding application. The performance of radix sort is greatly affected by the programming model and particular implementation used. Sample sort exhibits more uniform performance across programming models on this platform, but it is usually not so good as that of the best radix sort for larger data sets if each is allowed to use the best programming model for itself. The best combination of algorithm and programming model is radix sorting under the SHMEM model for larger data sets and sample sorting under CC-SAS for smaller data sets.
1 Introduction Sorting is an important kernel for high-performance multiprocessing. It is also a core utility for database systems in organizing and indexing data. Dramatic improvements in sorting have been made in the last few years, largely due to increased attention to interactions with computer architecture. Zagha[14] designed a re-try radix sort algorithm for the CRAY Y-MP vector machine, which runs 25 times faster than the sorting routine provided in that machine’s library. Blelloch, et al.[4] and Dusseau, et al.[1] present comparative studies of different parallel sorting algorithms on the CM-2 and CM-5 multiprocessors, respectively. Sohn and Kodama[3] implemented a load-balanced parallel radix sort algorithm which can sort 0.5G integers in 20 seconds on a 64-processor IBM SP2-WN. Dusseau, et al. and Rivera, et al. studied the performance of diskto-disk sorting on clusters of workstations and broke the Minute-Sorting record using these parallel machines [2, 9]. Helman [7] and Li [13] studied sample sort with regular sampling on various message passing and vector computers and found it to have good portability to outperform other sampling algorithms for sample sorting. Recently, a new type of platform has begun to dominate tightly-coupled multiprocessing, and together with less tightly-coupled commodity clusters constitutes the stateof-the-art going forward in multiprocessing. This type of platform supports a shared address space with implicit coherent replication of data in hardware, even though memory is physically distributed. It is very different from the traditional message-passing 1
machines, vector supercomputers and clusters on which high-performance sorting has been studied. Thus it is important to investigate the performance of parallel sorting on this type of platform. A radix sorting program for this type of platform is included in the SPLASH2 suite [12] and a CC-SAS sample sorting program has been discussed in [8]. With today’s architectural convergence, it is common for the same platform to support different programming models, either directly in hardware or via software or both. With this flexibility available to programmers, it is important to understand how programming models affect application performance on a given type of platform. Do the same basic algorithms perform well under different models? What is the best combination of algorithm and programming model for a given type of platform? And how does the ease of programming compare? These questions have important implications for both system designers and users; they are especially important for tightly-coupled hardware coherent machines, which support all the programming models quite efficiently. This paper investigates the performance of parallel sorting on this type of platform using two sorting algorithms and three major programming models. The two algorithms are radix sort and sample sort, which have been previously shown to outperform other sorting algorithms especially for the larger data sets that demand parallel computing [4, 1, 3, 7]. The three programming models are (i) explicit message passing (MP), using the MPI library, in which both communication and replication are explicit, (ii) a cache-coherent shared address space (CC-SAS), in which both communication and replication are implicit, and (iii) the SHMEM programming model. SHMEM is like MPI in that communication and replication are explicit and usually made coarsegrained for good performance; however, unlike the send-receive pair in MPI, communication in SHMEM requires application process involvement on only one side (using put or get primitives). SHMEM also supports a symmetric, segmented address space, which simplifies naming compared to MP by allowing a process to name (specify) remote data via a local name and a process identifier (in MP, a process can not name another process’s data). The platform we use, a 64-processor SGI Origin2000, is a leading example of the tightly-coupled, hardware-coherent style of architecture. It has an aggressive communication architecture and relatively few organizational artifacts, so it forms a good basis for examining the potential of and bottlenecks in this style of architecture, as well as for comparing algorithms and programming models. The Origin 2000 provides full hardware support for the CC-SAS model. The other two programming models leverage this hardware support and the aggressive communication architecture, but are implemented in software (as they are on most systems). For MPI, implementations of pure message passing such as the SGI vendor-supplied implementation have performance problems. Specifically, since address spaces are private in the MP model and a process cannot reference another process’s address space directly, to transfer data from one process to another the data often has to be copied by the library to an internal buffer in the underlying shared address space and then from that buffer to the destination. This is needed to enable asynchronous messages, wherein the MPI library can return from the MPI functions quickly and give execution control back to the user without worrying about data being overwritten, thus tolerating message latency. However, the extra staging copy greatly increases message overhead, hurting application performance. We therefore developed our own MPI that implements an “impure” message passing model, starting from the MPICH source code, which substantially outperforms the SGI MPI implementation[10]. The impurity in this version is that it allows a process to transfer data directly into another process’s address space, taking advantage of which reqires that at least those application data structures that are
2
involved in communication be allocated in the underlying shared address space rather than in private address spaces as in the pure MP model (a simple change to make). We use our implementation in the results we present. An added advantage is that we can instrument the source code of MPICH to obtain per-process execution time breakdowns. But we also presents results obtained on the SGI vendor-supplied implementation since it is widely used. The rest of this paper is organized as follows. Section 2 briefly describes the Origin 2000 platform, Section 3 describes the parallel Radix and Sample sorting algorithms we used. Performance is analyzed and compared in Section 4, which also examines methods for addressing performance bottlenecks in the application. Finally, Section 5 summarizes our key conclusions.
2 The SGI Origin2000 Platform The SGI Origin 2000 is a scalable, hardware-supported, cache-coherent, non-uniform memory access (CC-NUMA) machine, with perhaps the most aggressive communication architecture among such machines today. The machine we use has 64 processors, organized in 32 nodes with two 195MHz MIPS R10000 microprocessors and 512MB of main memory each. Each processor has separate 32 KB first-level instruction and data caches, and a unified 4 MB second-level cache with 2-way associativity and a 128-byte block size. The machine has 16 GB of main memory with a default page size of 16 Kbytes (which can be changed easily). Each pair of nodes (i.e. 4 processors) is connected to a network router. The interconnect topology across the 16 routers (node pairs) is a hypercube. The peak point to point bandwidth between nodes is 1.6 GB/sec (total in both directions). The average uncontended read latencies to access the first word of a cache line are as follows: local memory 313 ns, average of local and all remote memories on a machine this size 796 ns, and furthest remote memory 1010 ns [6]. The latency grows by about 100 ns for each router hop.
3 Sorting Implementations We do not actually describe the parallel Radix and Sample sorting algorithms, since they are well known. The basic parallel algorithms are also similar across programming models, a useful property that allows programming models to be compared more easily. Rather, we focus on how data access, communication and synchronization are orchestrated in the three programming models. For each algorithm, we first describe the implementation for the CC-SAS model and then discuss the differences across models. We also briefly discuss programming complexity.
3.1 Radix sort CC-SAS The CC-SAS radix sort is borrowed from the SPLASH2 application suite, which is based on the method described in [4]. The algorithm is iterative, performing one iteration for each r-bit digit in the keys, where r is the radix used. The maximum key value determines how many iterations will actually be needed. Each iteration uses two arrays of n keys each — an input and an output array which toggle their roles across iterations — each partitioned into p parts. A process’s partition of the input array contains its assigned keys to process for that iteration. In every iteration, every process first sweeps over its assigned keys and generates a local histogram of the frequencies of their values in the current radix digit. After this, the local histogram is accumulated
3
into global histograms, using a parallel prefix tree. Then, each process uses the local and global histogram values to permute its keys into the output array, thus conceptually performing remote writes to other processes’ partitions of that array and resulting in all-to-all personalized communication. The input and output arrays swap their roles in the next iteration (for the next digit). In our original (SPLASH-2) CC-SAS program, keys are written directly into the output array as their permuted positions are computed from the histograms. Thus, although a given process will end up writing to several (up to r) contiguous segments of each process’s partition of the output array (average segment size being 2rnp , where n is the number of keys, r is the radix size and p is the number of processes), these writes will be temporally interleaved with writes to many other segments and hence appear scattered. MPI Our MPI implementation follows the same overall structure as the SPLASH-2 CC-SAS program. The first major difference is in how the global histogram is generated from local histograms. In the CC-SAS implementation, a single shared copy of the global histograms is constructed, using a binary prefix tree. In MPI, the finegrained communication needed for this turns out to be very expensive. We therefore use an MPI Allgather function to collect the local histograms from all processes and make a local copy of each for all of them. Then, each process computes the global histograms locally. The performance of this phase does not affect overall performance much, which is dominated by the permutation itself. Actually, having all the histogram information locally greatly simplifies the later computation of parameters for the MPI send/receive functions in the permutation phase. Another difference is that in the MPI implementation, it is extremely expensive to send/receive a message for each key in the permutation phase. Since the keys that process i permutes into process j ’s partition of the output array will end up falling into that partition in several contiguous chunks, of average size 2rnp , our MP program therefore first writes the data locally into such contiguous buffers to compose larger messages before sending them out, which amounts to a local permutation of the data (using the now local histograms) followed by communication. An interesting question is how to send the data. One possibility is for process i to send only one message to each other process j , containing all its chunks of keys that are destined for j . Processor j will then reorganize the data chunks to their correct positions in its partition of the output array at the other end. This is similar to the algorithm used in the NAS parallel application IS [5]. Another method is for a process to send each contiguously-destined chunk of keys directly as a separate message so that it can be put into the correct position at the destination processes, leading to multiple messages from process i destined for each other process. This is an implementationdependent tradeoff between communication and computation. Our experiments show that the latter performs better than the former on this machine, so we use the latter. SHMEM Our SHMEM implementation is transformed from the MPI program, though the specification of communication is simplified. Since SHMEM uses one-sided communication, only one of the sender and receiver needs to compute the parameters for the message, not both. Since the entire histogram data is available locally to each process, receiver-initiated communication can be used: Each needed remote chunk of permuted keys is brought into its destination locations using a get operation (get has the advantage that data are brought into the cache, while put doesn’t deposit them in the destination cache). The symmetric arrangement of process partitions of the arrays make this easily to program (a process simply specifies the positions within a partition and the source partition or process number).
4
3.2 Sample sort CC-SAS Suppose we have p processes, and each process has its own partition of keys. The CC-SAS sample sort program proceeds in five phases:
Each process sorts its own keys locally using radix sort. Each process selects a fixed number of keys (sample keys) from its sorted array. A small number of processes is responsible for reading these sample keys from all processes, sorting them, and selecting p 1 keys from these collected keys which are splitters used in the next phase. Each process uses the p 1 splitters to decide (locally) how to distribute its keys to other processes or fetch keys from them. An all-to-all personalized communication follows to distribute the keys. Each process sorts its received keys locally.
Thus, sample sort does two local sorts (the first and last steps), and thus does almost double the sorting work of radix sort, but it’s communication behavior is better than that of parallel Radix sort. It does not involve scattered writes in CC-SAS, and it requires neither one message per small chunk nor reorganizing received data into small chunks at the destination: there is one contiguous message from a process to/from each other process. There is also no loop around the steps above: each local sort sorts the keys completely. There are many ways to decide how to sample the keys in the second phase and how to find the p 1 splitters in the third phase; these affect load balance and program complexity[13]. We choose a method that performs best on our system: Each process selects 128 sample keys in the second phase; in the third phase, every set of 32 processes forms a group and selects one member to be responsible to collect the sample keys, sort them, and communicate with other groups to find the splitters. MPI In the MPI program, the first, second and fifth phases are the same as in the CC-SAS program. In the third phase, we use the MPI Allgather function instead of loads and stores to collect the sample keys from all processes. Then, the computation of the splitters becomes completely local, with the tradeoff that a lot of it is redundantly performed on all processes. Unlike in the CC-SAS program, we do not divide the 64 processes into two groups. In the fourth phase, each process uses an explicit send/receive operation to distribute its keys to their destinations. SHMEM The SHMEM program is obtained directly from the MPI program. The only difference is that in the fourth (communication) phase it uses a get operation to replace the send/receive pair.
3.3 Data sets Since parallel sorting performance depends on its distribution of key values, we initialized the keys with eight methods, including Gauss, random, zero, bucket, stagger, half, remote and local. The first five are from the literature, and the last three are designed buy us to exercise certain behavior. In the following description, the p is the number of processes, n is the number of integer keys and r is the size of radix. MAX is the maximum value of keys, which was set as 231 . 5
Gauss is the default method used in the original SPLASH2 radix program and in the NAS parallel benchmark IS [5]. Each key is the average of four consecutive uniformly distributed pseudo-random numbers generated by the recurrence xk+1 = axk (mod 246) where a = 513 and x0 = 314159265. Random simply calls the C library random number generator random() to initialize each key. It returns successive pseudo-random numbers in the range from 0 to 231 1. Zero is essentially the same as random except that every tenth key is set to zero. Bucket is obtained by setting the first pn2 elements initially assigned to each pron 1), the second p2 elements cess to be random numbers between 0 and ( MAX p 2MAX 1), and so at each process to be random numbers between MAX p and ( p forth. Stagger is obtained by setting all the np elements initially assigned to a process to be random numbers between (2i + 1) MAX and (2i + 2) MAX p p if the process p index i is less than 2 , otherwise setting them to be random numbers between MAX (2i p) MAX p and (2i p + 1) p . Half is almost the same as Gauss except that all keys restricted to be even numbers. Remote is designed to maximize communication and is created as follows: for all the np elements initially assigned to a process, (i) the first r (size of radix) bits of keys, starting from least significant bit, are set to be random numbers between r r 0 and 2r 1, but not between i 2p and (i + 1) 2p . (ii) the second r bits are set r r to values between i 2p and (i + 1) 2p . (iii) the third r bits are the same as the first r bits, (iv) the fourth r bits are the same as the second r bits, and so forth.
Local incurs no remote communication. All the np elements initially assigned to a process will be created as follows: set the first r bits starting from least r r significant bit to be random numbers between i 2p and (i + 1) 2p . And duplicate this value for the next several r bits.
We select these distributions for various reasons. Gauss is the default distribution used in the SPLASH2 and NAS application suites. Random, zero, bucket and stagger have been used by other researchers [3, 7]. We designed the remote, half and local distributions. The purpose of the remote distribution is to maximize the number of keys moved between the processes. For example, all the elements belonging to one process in an iteration of the parallel radix sort will be dispersed to other processes in the next iteration. The local distribution has the opposite behavior, all the elements belong to one process in an iteration will still be owned by the same process in the next iteration: There is no communication needed to permute the keys. The half distribution is designed to observe the effect of reducing the number of messages while keeping the amount of data transfered fixed. Since all the keys are even numbers, there is no data belonging to odd buckets within the sorting process, thus the number of messages needed to distribute the elements in radix sort from one process to another is halved compared with the Gauss distribution, and the message size is correspondingly increased. 6
4 Experimental Results and Discussion This section compares the performance of the applications under the different programming models. For each application, we first examine speedups for different data set sizes and processor counts, measuring them with respect to the same sequential radix sorting program for both algorithms and all models (recall that sample sorting uses radix sort for the local sorts within a node, and a single local sort is all there to do on a uniprocessor). The execution time is obtained using different pagesizes to get the best performance: for 1M - 64M data sets, it is 64KB; for the 256M data set, it is 256KB. Then, we examine per-processor breakdowns of execution time, obtained using program/library instrumentation and various tools available on the machine, to obtain more detailed insight into how time is spent in each programming model and where the bottlenecks are. We divide the per-processor execution time into four categories: CPU busy time in executing instructions and assuming no memory stall time (BUSY), CPU stall time waiting for local cache misses (LMEM), CPU stall time for communicating remote data (RMEM), and CPU time spent at synchronization events (SYNC). For CC-SAS programs with their implicit data access and communication, the available tools on this machine do not allow us to distinguish LMEM time from RMEM time, so we are forced to lump them together (MEM = LMEM + RMEM). However, they can be distinguished for the other two models. Finally we study the performance effect of data set characteristics and radix size. 1M 1610142
4M 7013044
16M 33668308
64M 143693696
256M 947575676
Table 1: The sequential execution time (s) of radix sort for different key sizes initialized with Gauss distribution. We are interested in the performance of the algorithms when the number of keys per processor is large enough, specifically starting from 16K. For much smaller data sets, it is difficult to get good parallel performance on a 64-processor machine. The sequential execution time for different data set sizes initialized with the Gauss distribution is shown in Table 1.
4.1 MPI Performance The introduction discussed the development of our own “impure” MPI (starting from MPICH) which outperforms the SGI vendor-supplied MPI implementation (MPT 1.3). We begin our discussion of performance by examining the performance difference between these two for the same application. Recall that our implementation violates the “pure message passing model” to allow a process to transfer data directly into another process’s address space without buffering in the library, taking advantage of which requires that the application data structures that are involved in communication be allocated in the underlying shared address space rather than in private address space as in the pure message passing model. Our implementation also manages memory buffers, queues and then synchronization more efficiently than the MPICH [11]. While SGI’s version MPT 1.3 (which we compute with) outperforms their previous version MPT 1.2 implementation, we have not yet been able to determine what changes they made or if any of them are similar to our changes to MPICH. Figure 1 shows the results for radix sort with the Gauss distribution. The results for our implementation are labeled NEW. Our implementation performs much better, especially on larger number of processes. Further analysis shows that the performance
7
Speedups of Radix for MPI 100
90
80
70
Speedups
60
50
SGI 40
NEW 30
20
10
16P
4 M 1 6 M 6 4 M 2 5 6 M
1 M
4 M 1 6 M 6 4 M 2 5 6 M
1 M
4 M 1 6 M 6 4 M 2 5 6 M
1 M
0
32P
64P
Figure 1: Speedups of radix sort for two MPI implementations on different number of processes and different key sizes.
difference is indeed mainly caused by differences in remote communication time. The local sorting time between these two implementations is similar. For sample sort, the performance gap between these two implementations (shown in Figure 2) is smaller. This is because there is only one remote communication stage and two local sorting stages in sample sort (i.e. more computation relative to communication than in radix sort), and the communication involves fewer messages as discussed earlier. Each process needs only one message to send its keys to another process while in radix it will r possibly require 2p messages, where r is radix size and p is the number of processes. Speedups of Sample for MPI
80
Speedup
60
40
SGI
NEW 20
0
256M
64M
16M
4M
1M
256M
32P
64M
16M
4M
1M
256M
64M
16M
4M
1M
16P
64P
Figure 2: Speedups of sample sort for two MPI implementations on different number of processes and different key sizes. We next examine the performance across programming models. In the following sections, when we refer to MPI performance, we assume our improved MPICH implementation (NEW).
4.2 Radix sort The speedups for radix sort under the three programming models with the Gauss distribution are shown in Figure 3 for data sets ranging from 1M to 256M integer keys. For larger data sets, the effect of limited cache and memory capacity on a node often yields
8
superlinear speedup; this is a real effect, and in any case the fact that we use the same sequential program as the base line allows us to use the speedups to compare the absolute performance across algorithms and programming models. If we simply substitute for the MEM time for the uniprocessor execution the sum of the LMEM times across processors in a 64-processor MPI (or SHMEM) run, we can get a rough estimate of the speedup without superlinear effects. For example, the speedup computed in this way with 64M-integer keys for MPI is only 38 instead of 75 on 64 processors, implying that the capacity induced superlinear effect constitutes a factor of 2 to its speedup. Speedups of Radix Sort
SHMEM
CC-SAS
MPI
CC-SAS-NEW
128
Speedup
96
64
32
0 16
32
64
1M
16
32
64
16
4M
32
16M
64
16
32
64M
64
16
32
64
256M
Figure 3: Speedups of radix sort for the three models on 16, 32, and 64 processors. The SHMEM model almost always performs best, except for the smallest (1M keys) data set on 32 and 64 processors, where the CC-SAS model does best. There are two reasons for this exception. One is that in SHMEM, like in MPI, we use a collective communication function by which each process collects the local histograms from all processors. This operation has a fixed cost that does not change with the data set size, so for smaller data sets it occupies a larger part of the execution time. In CC-SAS, only a shared histogram is built without replicating local histogram and the efficient fine-grained load-store communication enables the histogram accumulation to be implemented by constructing a binary-prefix tree, so the computation of the global histogram is quite cheap. The other reason is that the message size in SHMEM and MPI is small for this data set on 64 processors even in the key permutation phase, so the message overhead is not amortized well and the RMEM time is higher for MPI and SHMEM (especially MPI with its two sided communication and overhead). As data set size increases, both problems for SHMEM and MPI diminish greatly. However, the performance of the CC-SAS program suffers. To understand this effect better, we show the per-processor time breakdowns for the 64M-key data set in Figure 4. From Figure 4 (a) we find that the MEM time in CC-SAS is very high, and it dominates the total execution time. False sharing of data is very low for this configuration. The reason for the high MEM time is that in the CC-SAS case not only is the communication very bursty and with little computation to overlap, but the nature of the remotewrite based communication is such that a lot of cache coherence protocol transactions like invalidations and acknowledgements compete for communication resources with data transfer. In addition, for large problem sizes data are written back from the cache as they are replaced, and these writeback transactions further contend for controllers and other resources. This contention causes performance to suffer greatly. In MPI (Figure 4 (c)) and SHMEM (Figure 4 (d)) the explicit messages are larger and less scattered due to local buffering, and there aren’t nearly so many protocol transactions to contend with data movement. Thus the total memory time is lower. 9
BUSY
(a) CC-SAS
LMEM RMEM SYNC
(c) MPI
(b)CC-SAS-NEW
(d) SHMEM
4000000
Time (us)
3200000
SYNC
2400000
RMEM
1600000
LMEM 800000
BUSY
Processors (0 -- 63)
Processors (0 -- 63)
60
48
36
0
24
12
52
39
26
0
Processors (0 -- 63)
13
52
39
26
0
13
60
45
0
30
15
0
Processors (0 -- 63)
Figure 4: Time breakdown for radix sort (64M) on 64 processors. Compared with SHMEM, MPI has higher SYNC time. This is despite the lockfree mechanism used to manage queues for incoming messages (see [11]). In fact, it has to do with the implementation of the lock-free mechanism. Every process has a separate one specific 1-deep buffer to communicate with each of the other processes. If one process wants to send several consecutive messages to the same process, the next message has to wait until the former one has been received by the receiver. This leads to higher synchronization time. Using deeper buffers alleviates the problem, but does not eliminate it in this application since there are many messages (chunks) sent from each processor to each other. Also adding a buffer requires O(p2 ) memory. In SHMEM, since communication is one-sided, messages between the same processes can proceed one by one without any delay, and the synchronization time is smaller. 4.2.1 Improving CC-SAS Performance for Radix Sort The effect of protocol interference in CC-SAS can be greatly reduced by restructuring the application to bring it closer to the SHMEM and MPI implementations in the permutation phases (while retaining its histogram accumulation advantages). That is it can locally buffer the data during the permutation and then copy the chunks to the remote destination to reduce the temporal scatteredness of the fine-grain writes. Although this does not eliminate coherence protocol interference, it reduces it greatly. The effect can be clearly seen from the new per-processor time breakdown in Figure 4 (b). The new speedup for CC-SAS is shown in Figure3 under the label (CC-SAS-NEW). The improved CC-SAS version is dramatically better than its original, though it is still not quite so good as the SHMEM version (except for the 1M data set size on 32 or 64 processors). Interestingly the new version is inferior to its original for the smallest 1M data set, because the trade-off between savings in traffic and increase in local work or BUSY time (for buffering) resolves itself differently for small ratios of data set size to number of processors. 4.2.2 Effect of Distribution of Keys Figure 5 shows the execution time for the different key distribution methods described in section 3.3 relative to the execution time of Gauss distribution. Results are presented for the SHMEM programming model; for other two models, the conclusion is similar. We found that the local method indeed performs best because every process holds its initial assigned keys and there is no remote key movement at all. The only interprocess communication is the collective function call to get the key distribution information. Quite surprisingly, for the other distribution methods, the execution times are 10
Effect Of Characteristics of Keys
1M
4M
16M
64M
256M
1.2
1.0
Relative Time
0.8
0.6
0.4
0.2
L o ca l
lf H a
m o te
e r
R e
S
ta g g
u ck e t B
Z e ro
n d o m R a
G a
u ss
0.0
Methods
Figure 5: The relative execution time for radix under SHMEM programming model on 64 processors.
quite similar to each other. The exception is for remote distribution with key size 256M, where performance is counter-intuitively better than that with other distributions. This is because the remote data set exhibits better spatial data locality in local access during the local sort phase after the first pass. By the design of the distribution the data assigned to a process is already sorted within each of p (number of processors) chunks. So there is little local (scattered) permutation of data and hence TLB misses (this is also true for the local distribution, which does best). This effect becomes prominent for 256M integer keys (and beyond) since at this point the data being locally permuted don’t fit in the 4MB-per processor second-level cache so the data access pattern matters a lot. It is visible for the 64M data set on 64 processors too. Finally, to our surprise, the performance of the half distribution is quite similar to that of the Gauss distribution despite its smaller number of messages. Aggregate data traffic seems to matter more in the permutation phase of this application than the number of messages. 4.2.3 Effect of Radix Size Another important factor that affects sorting performance is the size of the radix used. The radix size r determines the number of sorting passes or iterations needed ( 32 r ), the number of messages (total 2r p) and their sizes ( 2rnp ) during each pass. The optimal radix size may therefore depend on n and p as well as on other machine factors. The radix sizes we present results for are 6,7,8,9,10,11,12. We also did some experiments using other larger or smaller values and found that none of them achieve better performance for our configurations. The performance relative to radix size 8 is shown for the Gauss distribution on 64 processors under the SHMEM model in Figure 6. The effect of radix size is much larger for smaller data sets than that for larger data sets. For 1M integers, radix 7 (5 passes) performs best. Then radix 8 (4 passes) is best for 4M-16M keys. For 64M and 256M keys, the best radix size becomes 11 and 12 (3 passes), respectively. The larger data sets usually require bigger radix size to achieve the best performance since the communication overhead is relatively smaller than for smaller data sets so the reduction in the number of (expensive) iterations or passes matters more. We can sort the 1G integers using radix 12 in 30 seconds on our machine. Even though our code is not hand-optimized for single-node performance. The performance of radix 8 is quite good across all the data set sizes we studied on 64 processors, though it performs a little worse on larger data sets than the best radix size.
11
Effect Of Radix Size
6
7
8
9
10
11
12
5.0
Relative Time
4.0
3.0
2.0
1.0
0.0 1M
4M
16M
64M
256M
Data Sets
Figure 6: The relative performance for radix sort under the SHMEM programming model on 64 processors, using radix 6 to 12. The Gauss distribution is used.
4.3 Sample sort The speedups of sample sort for the three programming models are shown in Figure 7 for data set sizes ranging from 1M to 256M. Keys are initialized with the Gauss method. Speedups of Sample Sort SHMEM
CC_SAS
MPI
100
80
Speedups
60
40
20
0 16
32
64
1M
16
32
4M
64
16
32
16M
64
16
32
64M
64
16
32
64
256M
Figure 7: Speedups of sample sort for the three models on 16, 32, and 64 processors. The CC-SAS model now works best up to the 4M data set size. After that SHMEM and CC-SAS performs similarly, with MPI following somewhat behind. Compared with radix sort, sample sort has only one global communication phase but two local sorting phases, and the communication is naturally contiguous (spatially and temporally), so it is better behaved than in Radix sort: Each processor sends only one message to each other processor in MPI and SHMEM, and the temporal scatteredness and even the need for remote writes disappear in CC-SAS (remote reads are used instead). For larger data sets, the two local sorting phases dominate the total execution time, so communication matters less. This can be seen from the per processor time breakdown for 64M data set size in Figure 8. For smaller data sets, CC-SAS performs better for the same two reasons as it performs better on 64 processors for 1M data size in radix sort: We use an expensive collective function to collect the sample data in MPI and SHMEM while in CC-SAS the load/store operations are directly supported by hardware, and the message overhead is not well amortized for smaller data sets. Compared 12
with SHMEM, MPI’s performance is a little worse, again because in MPI the communication is two-sided (send and receive) and the collective communication function is not so efficient as in SHMEM. Note that the computation (BUSY) time increases a lot compared with radix sort due to the two local sorts needed here (we shall compare the two further in section 4.4). (a) CC-SAS
(b) MPI
(c) SHMEM
Time (us)
2500000
SYNC
2000000
RMEM
1500000 1000000
LMEM
500000
BUSY
Processors (0--63)
Processors (0--63)
52
39
26
0
13
52
39
26
0
13
52
39
26
0
13
0
Processors (0--63)
Figure 8: Time breakdown for sample sort (64M) on 64 processors.
4.3.1 Effect of distribution of Keys The relative execution times of sample sort for different key distributions on 64 processors in the CC-SAS model are shown in Figure 9, relative to the execution time of Gauss distribution. The local distribution again performs best since there is no remote key movement and it also has good data locality during the local sort phase. For other distributions, before reaching the cache size limit (4MBytes, i.e. up to 1M integer keys per processor or 64M keys total), the characteristics of keys have little effect on the execution time. After that point, the execution time for the remote and half distributions becomes much less than the other methods (except local). Similarly to radix, the difference lies in the local sorting phases and is caused by the better spatial data locality in the remote and the half methods (which is also true for the local distribution, and which affects performance once the data doesn’t fit in the cache). Compared to radix, (i) the effect of data locality becomes prominent from the 64M key data set size instead of the 256M data set size. (ii) and the half data distribution is more affected. One reason is that in sample sort, in each local sort phase, the data will be sorted continuously for 32 passes; while in radix sort there is remote communication between the passes; the r latter causes other overhead, such as cache coherence and TLB misses, and reduces the effect of data locality. Another reason is that there are two local sort phases in sample sort, which almost doubles the sorting work in radix sort. 4.3.2 Effect of Radix Size Figure 10 shows the relative execution times for different radix sizes in the CC-SAS programming model. The keys are initialized with the Gauss method. Unlike in radix, small radix values don’t work so well and 11 is the best choice here. It performs best up to 64M keys and close to the best after that point, where radix 12 is fastest for 256M. For each data set, the ratio of best time to worst time is within a factor of 2, which is less than in radix. Compared with radix sort, sample sort spends less time in communication and more time in local computation. Thus reducing the number of (local) sorting passes by using a larger radix is relatively more important.
13
Effects of Characteristics of Key
1M
4M
16M
64M
256M
1.2
1.0
Relative Time
0.8
0.6
0.4
0.2
ca l
a lf
L o
e m R
la S p
B u
H
o te
r tt e
ck e
t
e ro Z
o m a n d R
G a u
ss
0.0
Methods
Figure 9: The relative execution time for sample sort in CC-SAS programming model on 64 processors for different key distributions. Effects of Radix Size
6
7
8
9
10
11
12
2.0
Relative Time
1.5
1.0
0.5
0.0 1M
4M
16M
64M
256M
Data Sets
Figure 10: The relative performance for sample sort in CC-SAS programming model on 64 processors using radix 6 to 12.
4.4 Putting it All Together Finally, let us compare the two algorithms as well as programming models across data set size and processor count regimes. Table 2 shows the best time for radix sort and sample sort under different programming models and radix sizes for the Gauss data distribution (others behave quite similarly relatively). Table 3 shows under which model and what radix size the best performance is obtained for each combination of problem size and number of processors. Comparing sorting algorithms under their bestperformance conditions of Table 3, we see that the performance of sample sort is better than that of radix sort up to 64K integers per processor (due to better communication) and becomes worse after that point (due to the extra local sort dominating communication). Interestingly, sample sort’s performance is more uniform across programming models than that of radix sort when the large data sets are used, due to better balanced and less important communication. The best combination of programming model and algorithm is radix sorting under the SHMEM model for larger data sets and sample sorting under the CC-SAS programming model for smaller data sets. The radix size chosen affects the performance of both radix sorting and sample sorting. For radix sorting, the best radix is 8 for our smaller data sets and for sample sorting it is 11. Larger data sets require somewhat larger radix sizes to achieve the best
14
performance in both cases. Compared with radix size, the effect of key distribution characteristics is smaller. Before the data set size per processor reaches the cache size limit, the different data distributions (for realistic distribution) perform similarly. However after that point, data sets with better spatial data locality in the local sorting phase perform much better, since many data accesses are not absorbed by the cache and there are fewer TLB misses.
1M 4M 16M 64M 256M
16P 63249 229182 1008322 6547243 29650916
radix 32P 55068 133296 483560 2557912 15054134
64P 33546 134407 306429 1147412 7191246
16P 74301 343466 1490045 13699476 54852935
sample 32P 42998 148800 634267 3902624 23838522
64P 29470 98720 380864 1503827 11891683
Table 2: The best execution time (s) with Gauss-distribution keys for radix sort and sample sort under three programming models and different radix sizes.
1M 4M 16M 64M 256M
16P CC-SAS 8 SHMEM 8 SHMEM 11 SHMEM 12 SHMEM 14
radix 32P CC-SAS 9 SHMEM 8 SHMEM 11 SHMEM 11 SHMEM 13
64P CC-SAS 8 SHMEM 8 SHMEM 8 SHMEM 8 SHMEM 12
16P CC-SAS 11 CC-SAS 11 CC-SAS 11 CC-SAS 12 CC-SAS 14
sample 32P CC-SAS 11 CC-SAS 11 CC-SAS 12 CC-SAS 12 CC-SAS 13
64P CC-SAS 11 CC-SAS 11 SHMEM 11 SHMEM 11 SHMEM 12
Table 3: The combination of programming models and radix sizes to achieve the best performance for each data set on 16, 32 and 64 processors.
5 Conclusions For radix sort, we found that the original, naturally structured CC-SAS program performs poorly since it suffers from the interactions of its bursty communication based on temporally scattered remote writes and remote writebacks with the underlying cache coherence protocol. In the MPI and SHMEM versions, larger explicit messages are used due to local buffering, and there are not so many protocol transactions to contend with data movement. Thus, these models perform better for radix sort once the data set becomes large enough that message overhead and the fixed costs of histogram accumulation are amortized. However, there is still a substantial performance gap between MPI and SHMEM. This is largely due to the incoming message managing mechanism used in MPI, its two-sided communication and overhead, and its collective communication functions. If the CC-SAS program is also modified to use local buffering before communication, similarly to the MPI and SHMEM programs, the scatteredness of writes to remotely allocated data is reduced. This greatly reduces the cache coherence protocol overhead, and the performance of the CC-SAS program is greatly improved. However, it still lags behind SHMEM and MPI for larger data sets. Unlike radix sort, SHMEM and CC-SAS perform similarly for sample sort once the data set becomes large enough for per processor; CC-SAS continues to be competitive. This is because the communication in sample sort is in contiguous blocks rather than 15
scattered. MPI follows a little behind. For smaller data sets, CC-SAS again performs best, though here up to a little larger data sets than for radix sort. Comparing the best implementations of the two sorting algorithms, sample sort is generally better than radix sort up to 64k integers per processor (when its better communication behavior is very important) and becomes worse after that point (when the extra cost of the second local sort outweighs the benefits in communication). Overall, the best combination of algorithms and programming models depends on the data set size and the number of processors, but apparently not much on the key distribution (for realistic distributions) on our hardware-supported cache-coherent Origin 2000 machine. The best combination is sample sort under the CC-SAS programming model for smaller data sets and radix sort under the SHMEM programming model for larger data sets. The latter (like some other combinations) delivers a very high (superlinear) speedup for data sets larger than 16M keys on our machine, due to cache and TLB capacity effects. The gap between these two combinations is larger for large data sets per processor than for smaller ones. If one combination has to be chosen, it might be radix sort under the SHMEM programming model since sample sort doesn’t scale well to large data set per processor. However, the SHMEM model is currently not widely available on parallel machines other than those of SGI/CRAY. Between the other two programming models, the CC-SAS model generally delivers much better performance than MPI, and dramatically better for smaller data sets (using the vendoroptimized MPI; we ignore our modified here because it too is not widely available and because it violates the pure message passing model). Under this model, sample sort is better than radix sort when there are fewer than 1M keys per processor and radix sort is better otherwise; if one has to be chosen, it would likely be radixi, since the gap is greater for the larger problem sizes where radix dominates. Finally, in terms of programming overhead, the CC-SAS model usually provides programming ease due to its global naming and its implicit communication and replication. We can take advantage of this simplicity to easily program more complex algorithms to achieve high performance, as we have done in sample sort (and in radix sort with the prefix tree). In the MPI model, for irregular all-to-all communications as in radix sort, computing the parameters for both the send and the receive functions can be more difficult to program. Compared with MPI, SHMEM provides some programming simplicity in such cases due to its one-sided communication and the fact that only one side needs to determine where data should come from or go to at the remote end, but it is still more difficult to program than in CC-SAS. Overall, parallel sorting has a regular enough structure that the programming difficulties are not so great as in many other applications. Future work will include improving collective functions in our MPI implementation and developing a formula (based on profiles) to predict performance for each programming model.
References [1] A.C.Dusseau, D.E.Culler, and et al. Fast parallel sorting under LogP: experience with the CM-5. In IEEE Transactions on Parallel and Distributed Systems, pages 791–805, August 1996. [2] A.C.Dusseau, R.H.Dusseau, and et al. High-performance sorting on networks of workstations. In SIGMOD ’97 AZ, USA, 1997. [3] A.Sohn and Y.Kodama. Load balanced parallel radix sort. In International Conference on Supercomputing, 1998. [4] Guy E. Blelloch and et al. A comparison of sorting algorithms for the connection machine CM-2. In Symposium on Parallel Algorithms and Architectures, pages 3–16, July 1991.
16
[5] NASA Ames Research Center. The NAS parallel http://science.nas.nasa.gov/Software/NPB, November 1995.
benchmarks
2.0.
[6] David Cortesi. Origin 2000 and onyx2 performance tuning and optimization guide. http://techpubs.sgi.com, 1997. [7] D.R. Helman, D.A.Bader, and J. J´aJ´a. Parallel algorithms for personalized communication and sorting with an experimental study. In SPAA’96, Padua, Italy, 1996. [8] Dongming Jiang and Jaswinder Pal Singh. Does application performance scale on modern cache-coherent multiprocessors: A case study of a 128-processsor sgi origin2000. In Proceedings of the 26th International Symposium on Computer Architecture, May 1999. [9] L. Rivera and A.A.Chien. A high speed disk-to-disk sort on a Windows NT cluster running HPVM. In to be published, 1999. [10] H. Shan and J.P.Singh. A comparition of MPI, SHMEM and cache-coherent shared address space programming models on the SGI Origin 2000. In International Conference on Supercomputing, page Greece, June 1999. [11] Hongzhang Shan and Jawinder Pal Singh. Comparison of message passing, SHMEM and cache-coherent shared address space programming models on the SGI Origin 2000. In International Conference on Supercomputing, June 1999. [12] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. Methodological considerations and characterization of the SPLASH-2 parallel application suite. In Proceedings of the 23rd Annual Symposium on Computer Architecture, May 1995. [13] X.Li, P. Lu, and et al. On the versatility of parallel sorting by regular sampling. In Parallel Computing, 1993. [14] M. Zagha and G.E. Blelloch. Radix sort for vector multiprocessors. In Proceedings of the 1991 conference on Supercomputing, page 712, 1991.
17