The MPI_Allgather() function is a useful, high-level collective communication function. ... functions. *This work was supported by the the W. M. Keck Foundation.
A Comparison of MPICH Allgather Algorithms on Switched Networks? Gregory D. Benson, Cho-Wai Chu, Qing Huang, and Sadik G. Caglar Keck Cluster Research Group Department of Computer Science University of San Francisco 2130 Fulton Street, San Francisco, CA 94117-1080 {benson,cchu,qhuang,gcaglar}@cs.usfca.edu
Abstract. This study evaluates the performance of MPI_Allgather() in MPICH 1.2.5 on a Linux cluster. This implementation of MPICH improves on the performance of allgather compared to previous versions by using a recursive doubling algorithm. We have developed a dissemination allgather based on the dissemination barrier algorithm. This algorithm takes log 2 p stages for any values of p. We experimentally evaluate MPICH allgather and our implementations on a Linux cluster of dual-processor nodes using both TCP over FastEthernet and GM over Myrinet. We show that on Myrinet, variations of the dissemination algorithm perform best for both large and small messages. However, when using TCP, the dissemination allgather algorithm performs poorly because data is not exchanged in a pair-wise fashion. Therefore, we recommend the dissemination allgather for low-latency switched networks.
1
Introduction
The MPI_Allgather() function is a useful, high-level collective communication function. Each process that participates in an allgather contributes a portion of data. At the end of the allgather, each participating process ends up with the contributions from all the other processes. Allgather is often used in simulation and modeling applications where each process is responsible for the calculation of a subregion that depends on the results of all the other subregions. A poor implementation of allgather can have a significant, negative impact on the performance of applications. The MPICH [1] and LAM [3] implementations of MPI are widely used on clusters with switched networks, such as FastEthernet or Myrinet. However, the collective communication functions in MPICH and LAM have long suffered from poor performance due to unoptimized implementations. Due to the poor performance of collective communication in MPICH and LAM, programmers often resort to writing their own collective functions on top of the point to point functions. ?
This work was supported by the the W. M. Keck Foundation.
For our own local applications we have experimented with improved implementations of allgather. We have developed several implementations of dissemination allgather based on the dissemination barrier algorithm [2]. This approach is attractive because, for p processes, it requires at most dlog2 pe stages for any value of p. However, some processes may have to send non-contiguous data in some stages. For non-contiguous data, either copying must be used or two sends to transfer the two regions of data. We have experimented with both approaches. Recently, many of the collective routines in MPICH have been greatly improved. In particular, allgather is now implemented with a recursive doubling algorithm. This algorithm avoids the need to send noncontiguous data, but can require up to 2blog2 pc stages. In this paper we report on an experimental evaluation of our dissemination allgather algorithms and the new recursive doubling algorithm in MPICH 1.2.5 on a Linux cluster connected by switched FastEthernet and Myrinet. We show that on Myrinet, variations of the dissemination algorithm generally performs best for both large and small messages. For TCP over Fast Ethernet, the recursive doubling algorithm performs best when using small messages and for large messages all of the algorithms perform similarly, with MPICH doing slightly better. Our results suggest that for a low-latency network, like Myrinet, the dissemination allgather that supports both copying for short messages and two sends for large messages should be used. For TCP, an algorithm that incorporates pair-wise exchange, such as recursive doubling or the butterfly algorithm, is best in order to minimize TCP traffic. The rest of this paper is organized as follows. Section 2 describes the allgather algorithms used in this study. Section 3 presents our experimental data and analyzes the results. Finally, Section 4 makes some concluding remarks.
2
Allgather Algorithms
Early implementations of allgather used simple approaches. One such approach is to gather all the contributed data regions on to process 0 then have process 0 broadcast all of the collected data regions to all the other participating processes. This approach is used in LAM version 6.5.9. It is dependent, in part, on the implementation of the gather and broadcast functions. However, in LAM, gather and broadcast also have simple and inefficient implementations. Another approach is the ring algorithm used in previous versions of MPICH. The ring algorithm requires p − 1 stages. This section describes two, more sophisticated, algorithms: recursive doubling and dissemination allgather. 2.1
Recusive Doubling
In the latest version of MPICH, 1.2.5, a new algorithm based on recursive doubling is used to implement allgather. First consider the power-of-2 case. In the first stage, all neighboring pairs of processes exchange their contributions. Thus, in the first stage, each process exchanges data with a process that is distance 1
away. In stage one, process 0 exchanges data with process 1, 2 exchanges with 3, and so on. In the next stage, groups of 4 are formed. In this stage, each process exchanges all of its data with a process that is distance 2 away within its group. For example, 0 exchanges with 2 and 1 exchanges with 3. In the next stage, the group size is doubled again to a group of 8 and each process exchanges data with a process that is distance 4 apart. After log2 p stages all processes will have all the contributions.
0
1
2
3
4
0
1
2
3
4
01
01
23
23
4
04
01
12
23
34
0123
0123
0123
0123
4
0234
0134
0124
0123
1234
01234
01234
01234
0123
...
01234
01234
0123
...
01234
...
01234
...
recursive doubling
dissemination allgather
Fig. 1. An example of recusive doubling and dissemination allgather for 5 processes
For the non-power-of-2 case, correction steps are introduced to ensure that all processes receive the data they would have received in the power-of-two case. The left column of Figure 2.1 shows how recursive doubling works for 5 processes. The first two stages are just like the first two stages of the power-of-2 case. The third stage marks the beginning of the correction step; it can be thought of as a backward butterfly. In the third stage, 4 exchanges with 0. In the fourth and fifth stages, the contribution of process 4 is propagated to the first four processes. To accommodate the non-power-of-2 cases, recursive doubling with correction can take 2blog2 pc stages. From Figure 2.1 it can be seen that data is always exchanged in contiguous chunks. Therefore, the exchange steps are simple. Furthermore, processes always exchange data in a pairwise manner. As we will see in the experimental results, this property benefits a TCP implementation of MPI. 2.2
Dissemination Allgather
We have developed an allgather algorithm based on the dissemination barrier algorithm [2]. Every process is involved in each stage. In the first stage, each
process i sends to process (i + 1) mod p. Note that this is not an exchange, it is just a send. On the second stage each process i sends to process (i + 2) mod p and on the third stage process i sends to process (i + 4) mod p. This pattern continues until dlog2 pe stages have completed. See the right column of Figure 2.1 for an example of dissemination allgather for 5 processes. Note that the number of stages is bounded by dlog2 pe for all values of p, which is much better than the bound for recursive doubling. An important aspect of the dissemination algorithm is determining which data needs to be sent on each stage. The data to be sent can be determined using the process rank (i) and the stage number (s), starting at 0. First we consider all stages other than the final stage. Each process will send 2s chunks. The starting chunk that process i will send is ((i + 1) − (2s + p)) mod p. The starting chunk that process i will receive is ((i + 1) + (2s × (p − 2))) mod p. For the final stage, each process will send p − 2s chunks. The starting chunk that process i will send is ((i + 1) + 2s ) mod p and the starting chunk that process i will receive is (i + 1) mod p. It is possible that a sequence of chunks is non-contiguous because the chunk wraps around to the beginning of the chunk array. This can also be seen in Figure 2.1. This means that either copying is needed or at most two sends, one for each chunk of data. If copying is used, the non-contiguous chunks must be copied into a contiguous buffer before a transfer. After the buffer is received the chunks must be copied into their proper locations. We have implemented three versions of the dissemination allgather algorithm; each deals with the noncontiguous data in a different way: – One send copy non-contiguous regions into a buffer and issue one send. – Two sends send each non-contiguous region one at a time. – Indexed type Use MPI_Type_Indexed to send the non-contiguous regions. All of our implementations are built using MPI point to point functions. We use non-blocking sends and receives. During experimentation we found that we could achieve better performance if we varied the order of which processes send first and which processes receive first. For example, in the first stage, every other process, starting at process 0, sends first. Every other process starting at process 1 receives first. In the second stage the first two processes send first, the second two receive first, and so on. We tried having all processes issue an MPI_Irecv() first, but this did not achieve the same performance as the alternating send/receive approach. Finally, we also tried using MPI_sendrecv(), but the resulted in slightly lower performance than using individual non-blocking sends and receives.
3
Experimental Results
We ran several experiments to compare the performance of the new MPICH recursive doubling algorithm to our implementations of dissemination allgather. Our test environment is a cluster of dual Pentium III 1GHz nodes connected
by Myrinet and FastEthernet. Our test program simply measures the time to complete an allgather operation for a given number of processes and a given data size. We measure 500 to 2000 iterations and divide by the iteration count to determine the cost of a single allgather operation. We take the mean of three runs for each data point. The variance in the data was quite small; the normalized standard deviation for each data point was never larger than 0.02. Our benchmarking methodology is simple and suffers from some of the problems noted in [5]. However, the experiments are reproducible and all the algorithms are compared using the same technique. More work is needed to better measure the cost of a single allgather invocation and to predict allgather performance in real applications. We present the results in terms of the data size; it is the amount of data contributed by each process in a run. Thus if the data size is 64K and we use 4 processes, the total data accumulated at each process is 256K. In the graphs, MPICH denotes the recursive doubling algorithm. 3.1
MPICH-GM on Myrinet
Time (in microseconds)
Figures 2, 3, 4, and 5 show the performance of the different allgather algorithms using MPICH-GM over Myrinet. The first two figures, 2 and 3, give results in which we assign a single process to each node. The second two figures, 4 and 5, give results in which we assign two processes per node.
100
MPICH Two sends Indexed send One send
50
0 0
10
20
30
Number of processors Data size = number of processors * 8 bytes
Fig. 2. Allgather performance for small messages on MPICH-GM-Myrinet (1 process per node)
Time (in microseconds)
40000
30000
MPICH Two sends Indexed send One send
20000
10000
0 0
10
20
30
Number of processors Data size = number of processors * 65536 bytes
Fig. 3. Allgather performance for large messages on MPICH-GM-Myrinet (1 process per node)
Time (in microseconds)
200
150 MPICH Two sends Indexed send One send 100
50
0 0
10
20
30
Number of processors Data size = number of processors * 8 bytes
Fig. 4. Allgather performance for small messages on MPICH-GM-Myrinet (2 processes per node)
Time (in microseconds)
60000
40000 MPICH Two sends Indexed send One send
20000
0 0
10
20
30
Number of processors Data size = number of processors * 65536 bytes
Fig. 5. Allgather performance for large messages on MPICH-GM-Myrinet (2 processes per node)
The small message (8 bytes) results in Figure 2 shows that one send and indexed dissemination allgather perform the best for almost all process counts. For the powers-of-two, the recursive doubling algorithm does just as well because only log2 p stages are used. In this case, recursive doubling reduces to the butterfly algorithm. The graph also reveals the varying number of stages required for recursive doubling in the non-power-of-2 cases. Note that the two sends approach generally performs the worst due to the small message size. The two sends incur two start up latencies. The one send approach performs slightly better than the indexed send approach. This is due to the overhead of setting up the index type. Figure 3 shows the results for large messages (64K bytes). In this case, the two send dissemination approach consistently performs better, up to 40% faster, than the recursive doubling approach for the non-power-of-2 cases. The one send approach performs the worst because it incurs a large amount of copying. The indexed send appear to minimize some of the copying. In additional experiments not shown here, we found the break even point for one send versus two sends to be 1024 bytes. The 2 processes per node results in Figures 4 and 5 are similar to the one process per node results. A notable exception is for short messages in Figure 4 in which recursive doubling performs much better than the dissemination allgather in the power-of-two cases. This is because recursive doubling uses pairwise exchange and in the first stage, all the pairs of processes reside on the same node. This suggests that for 2 processor per node assignments a hybrid of dissemination allgather and recursive doubling is desired.
3.2
TCP on FastEthernet
Figures 6 and 7 show the performance of the different allgather algorithms using TCP over FastEthernet. Due to limited space we only show the results for 1 process per node. For small messages in Figure 6 the results are highly erratic. We found large variances in our collected data for TCP an small messages. This is likely due to buffering in TCP. However, the recursive doubling approach is clearly the best in all cases. As expected, two sends performs the worst. However, one send and indexed send seem to do much more poorly than expected, especially considering the good small message results for Myrinet. The source of the difference between recursive doubling and dissemination allgather comes from the use of pairwise exchange. Because recursive doubling uses pairwise exchange, TCP can optimize the exchange by piggybacking data on ACK packets. The dissemination allgather never does a pairwise exchange; in each stage a process sends to a different process than from which it receives. To test our hypothesis we ran tcpdump [4] to observe the TCP traffic during an allgather operation. We monitored the rank 0 process. With a data size of 8 bytes and using 7 processes we found for recursive doubling 3654 packets were transmitted and only 114 (3.1%) ACKs were pure ACKs. A pure ACK is an ACK without piggybacked data. For the same parameters we found that dissemination allgather resulted in 6746 total packets transmitted and 2161 (32%) ACKs were pure ACKs. Dissemination allgather generates much more TCP traffic. Figure 7 shows the results for TCP and large messages. As the data size increases, the advantage of pairwise exchange begins to diminish. However, in the power-of-2 cases, the recursive doubling algorithm still shows better performance than dissemination allgather.
4
Conclusions
We have presented the dissemination allgather algorithm and compared its performance to the recursive doubling approach used in MPICH 1.2.5. We showed that for GM over Myrinet, dissemination allgather consistently performs better than recursive doubling except in the power of two cases with 1 process per node assignment. For 2 process per node assignment a hybrid of dissemination allgather and recursive doubling should be used. A general purpose implementation will have to utilize both one send and two sends in order to achieve good performance for both small and large messages. For TCP over FastEthernet the best choice is recursive doubling due to its use of pairwise exchange. Our results suggest that a version of the dissemination algorithm should be used to implement allgather for MPICH-GM. For future work we plan to improve our benchmarking methodology. We also plan to apply our results by developing an allgather implementation that can choose the most efficient algorithm based on input parameters and the underlying network. Finally we plan to analyze the other collective operations to see where the dissemination approach can be applied.
Time (in microseconds)
2000
1500 MPICH Two sends Indexed send One send 1000
500
0 0
10
20
30
Number of processors Data size = number of processors * 8 bytes
Fig. 6. Allgather performance for small messages on MPICH-TCP-Ethernet (1 process per node)
Time (in microseconds)
500000
400000
MPICH Two sends Indexed send One send
300000
200000
100000
0 0
10
20
30
Number of processors Data size = number of processors * 65536 bytes
Fig. 7. Allgather performance for large messages on MPICH-TCP-Ethernet (1 process per node)
Acknowledgements We thank Peter Pacheco and Yuliya Zabiyaka for insightful discussions on different allgather algorithms. Peter Pacheco also provided feedback on an earlier version of this paper. Amol Dharmadhikari helped us verify the correctness of our dissemination algorithm. Alex Fedosov provided system support for the Keck Cluster and responded to last minute requests during data collection. Finally, we thank the anonymous reviewers for their useful feedback and pointers to related work.
References 1. Argonne National Laboratory. MPICH - A Portable Implementation of MPI, 2003. http://www-unix.mcs.anl.gov/mpi/mpich/. 2. D. Hensgen, R. Finkel, and U. Manber. Two algorithms for barrier synchronization. International Journal of Parallel Programming, 17(1):1–17, February 1988. 3. Indiana University. LAM / MPI Parallel Computing, 2003. http://www.lam-mpi. org/. 4. Lawrence Berkeley National Laboratory. tcpdump, 2003. http://www.tcpdump. org/. 5. Thomas Worsch, Ralf H. Reussner, and Werner Augustin. On benchmarking collective MPI operations. In J. Volkert, D. Kranzlm¨ uller, and J. J. Dongarra, editors, Recent advances in parallel virtual machine and message passing interface: 9th European PVM/MPI Users’ Group Meeting, Linz, Austria, September 29 – October 02, 2002, Lecture Notes in Computer Science, 2002.