Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers Jun Doi1 and Yasushi Negishi1 1
IBM Research – Tokyo 1623-14 Shimo-tsuruma, Yamato-shi, Kanagawa-ken, Japan.
[email protected],
[email protected]
Abstract—Torus networks are commonly used for massively parallel computers, its performance often becomes the constraint on total application performance. Especially in an asymmetric torus network, network traffic along the longest axis is the performance bottleneck for all-to-all communication, so that it is important to schedule the longest-axis traffic smoothly. In this paper, we propose a new algorithm based on an indirect method for pipelining the all-to-all procedures using shared memory parallel threads, which (1) isolates the longest-axis traffic from other traffic, (2) schedules it smoothly and (3) overlaps all of the other traffic and overhead for the all-to-all communication behind the longest-axis traffic. The proposed method achieves up to 95% of the theoretical peak. We integrated the overlapped allto-all method with parallel FFT algorithms. And local FFT calculations are also overlapped behind the longest-axis traffic. The FFT performance achieves up to 90% of the theoretical peak for the parallel 1D FFT. Index Terms—all-to-all communication, collective communication, Blue Gene, FFT, multi-core, multi-thread, parallel algorithm, SMP, torus network
I. INTRODUCTION
A
ll-to-all communication is one of the most performancecritical communication patterns for scientific applications on massively parallel computers. The all-to-all communication is often used for matrix or array transposition, or parallel Fast Fourier Transform (FFT) algorithms. Since the performance of the parallel FFT largely depends on the performance of the allto-all communication, its performance is a key parameter for applications such as molecular dynamics [7] [8], quantum molecular dynamics [9], plasma physics, or digital signal processing. In this paper, we focus on all-to-all communication on massively parallel computers with torus interconnects. Today, the torus interconnect is becoming a common interconnect topology for massively parallel computers [1] [2] [3] [4] [5] [6] since it offers high scalability, high cost effectiveness, and high flexibility. As the sizes of the torus networks become larger to scale up to petaflops and exaflops, the performance of the all-to-all communication is becoming critical to a wider range of applications because the final performance depends on the size and shape of the torus network. The bandwidth of the all-to-all communication is theoretically in inverse
proportion to the longest dimension of the torus network, and the network contention caused by the asymmetry of a torus network decreases the effective bandwidth. In most cases the torus network is asymmetric. For example a standard rack of Blue Gene®/P [3] machines is composed of a 3D torus whose dimensions are 8x8x16. There are only two symmetric configurations, 4 racks (16x16x16) and 32 racks (32x32x32) for machines below the petaflops (72 racks). In this paper, we propose an optimal all-to-all communication algorithm and an FFT algorithm for asymmetric torus configurations. There are many approaches for all-to-all communication optimized for torus networks. These are classified into two types, direct and indirect methods. In direct methods the all-toall communication is done via point-to-point communications between all of the pairs of nodes. In contrast, in indirect methods the messages are combined in packets and routed or sent through intermediate nodes. There are several indirect methods suitable for torus networks [10] [11] [12] [13], and most of the indirect methods are focused on scheduling and routing the packets in torus networks to minimize network contention. For example, [13] presented a two-phase all-to-all schedule to minimize network contention on asymmetric torus networks. In indirect methods, it is important to minimize additional software overhead such as message combining and routing in the intermediate nodes. In this paper, we propose an overlapping method for all-toall communications to reduce the network contention in the asymmetric torus networks. We overlap the all-to-all method based on the indirect method by pipelining in the shared memory parallel (SMP) threads on the multi-core processors, which (1) isolates the longest-axis traffic from other traffics, (2) schedules that traffic smoothly and (3) overlaps all other traffic and overhead for the all-to-all communication behind the longest-axis traffic. By using threads, network latencies, startup overheads, and internal array transposes to arrange the message packets on the different threads are overlapped to maximize the performance. We also propose a method to integrate our optimal all-to-all communication with the parallel FFT algorithms. Although the parallel FFT algorithm is usually implemented using synchronous all-to-all communication, we integrated the
© 2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. SC10 November 2010, New Orleans, Louisiana, USA 978-1-4244-7558-2/10/$26.00
parallel FFT algorithm into the pipelined all-to-all communication using SMP threads, so that local FFT calculations are also hidden behind the longest-axis traffic of all-to-all communication. We implemented parallel 1D, 2D and 3D FFT. The remainder of this paper is organized as follows: Section II describes the performance characteristics of all-to-all communication for a torus network. Section III presents our proposal for a pipelined all-to-all method based on the indirect method, Section IV explains a way to integrate our all-to-all method with the parallel FFT algorithm. We present conclusions and future work in Section V. II. ALL-TO-ALL COMMUNICATION ON THE TORUS NETWORK A. Torus network In a torus network, the nodes are arranged in an Ndimensional grid and connected to neighboring nodes so each node in the torus network has 2N nearest neighbors. Fig.1 shows an example of a 4x4 2D torus network.
torus networks of 512 and 4,096 nodes show much better performance than the asymmetric 3D torus networks of 1,024 and 2,048 nodes. Although MPI_Alltoall is well optimized [15], the measured bandwidth is only 72% of the theoretical peak on 1,024 nodes because of the network contention and the traffic imbalance bottleneck along the longest axis. In Table I, we show the best measured bandwidth for MPI_Alltoall. We will see that as the message size increases the effective bandwidth is reduced, as shown in Fig. 4 in Section III.C. We propose a new all-to-all method address these problems in the next section. TABLE I BANDWIDTH OF MPI_ALLTOALL ON THE 3D TORUS OF BLUE GENE/P Number of Nodes
Torus Shape
512
8x8x8
Theoretical Peak Bandwidth [MB/s] 374
1,024
8x8x16
2,048 4,096
8x16x16 16x16x16
Measured Bandwidth [MB/s]
Efficiency [%]
340.6
91.1
187
134.2
71.8
187 187
146.2 158.9
78.2 84.9
III. OVERLAPPING METHOD FOR ALL-TO-ALL COMMUNICATION Fig. 1. Example of 4x4 2D “symmetric” torus network. In this case each node has four nearest neighbors.
The packets for point-to-point communications are routed to their final destinations by routers in each node. Network contention occurs frequently when the routes of packets cross in dense communication patterns such as all-to-all, broadcast, scatter, or gather. B. Performance of all-to-all communications In all-to-all communication, each node exchanges messages with all of the other nodes. The all-to-all bandwidth peaks when all of the routes are filled with packets and the theoretical peak bandwidth is determined by the length of the longest axis of torus network. The time for all-to-all communication T is given by Equation 1 [13]:
T = np ∗ (L 8) ∗ m ∗ β
A2A-2 All-to-all communication
(1)
where np is the number of processors, L is the length of the longest axis, m is the total size of all of the all-to-all messages, and β is the per byte network transfer time. The theoretical peak bandwidth of the all-to-all communications is given by Equation 2:
A = e ∗ (1 β ) ∗ (8 L )
A. Two-phase scheduling of all-to-all communication To reduce the network contention in the all-to-all communications on asymmetric torus networks, two-phase scheduling [13] offers advantages. The all-to-all communication is separated into two phases with the group for communication being called the communicator. The entire torus network is separated into two sub-communicators so that each sub-communicator shapes 1D axis or symmetric multidimensional torus and the bandwidths of the two separated allto-all communications can approach their theoretical peaks. For example, if the configuration of the 3D torus network is X=4, Y=4 and Z=8, then the sub-communicator for phase-1 (A2A-1) contains the Z axis, while phase-2 (A2A-2) contains the X and Y axes, forming a symmetric plane as shown in Fig. 2.
(2)
where e is the efficiency and (1 β ) is the network bandwidth. Table I shows the theoretical peak bandwidth and the measured all-to-all bandwidth of an MPI library (MPI_Alltoall) [14] for the 3D torus network of Blue Gene/P. In this table, the measured bandwidth for the 3D symmetric
A2A-1
in the plane
All-to-all communication along the longest axis
Fig. 2. Separating the all-to-all communication into 2 phases; along the longest axis and in the plane consists of the other axes.
The all-to-all communication is processed in two phases: Phase-1: For each final destination, a sender node chooses a node belonging to both the A2A-1 (including the sender) and A2A-2 (including the final destination) as an intermediate node, and sends the packets for the final destination to the chosen intermediate node. Phase-2: Each intermediate node sends its packets to their final destinations in A2A-2.
Inter thread atomic barrier synchronization for “A2A-1” next iterations Thread1 tr1 Thread2 tr1 Thread3 tr1 Thread4 tr1
A2A-1
tr2
A2A-2 A2A-1
tr3
tr1 tr2
A2A-1
A2A-2 tr3 A2A-1
tr2 A2A-2
tr1 tr2
A2A-2 A2A-1
A2A-1 tr3
tr1 tr2
A2A-2
tr3
A2A-1 tr3
tr1
...
tr1 tr2 A2A-2
tr3
tr1 tr2
A2A-2 A2A-1
... ... ...
tr1, tr2 and tr3 are the procedures of internal array transposing Fig. 3. Pipelining of two-phase all-to-all communication using 4 SMP threads. A2A-1 is all-to-all along the longest axis, which takes much longer than A2A-2, so A2A-2 and most of the other procedures are hidden behind A2A-1. In this way the two-phase all-to-all bandwidth can approach the theoretical peak bandwidth of A2A-1. To arrange the packet messages, there are 3 internal array transpositions performed in cache memory, tr1, tr2 and tr3.
By dividing the all-to-all communication into two phases, we can avoid network contention caused by the longest-axis traffic, but the theoretical communication time increases. Therefore we need a new approach to hide the additional communication time behind the communication time of the longest-axis all-to-all communications. B. Overlapping all-to-all communications on multi-thread nodes We divide the messages and pipeline the procedures of the two-phase scheduling technique so that we can overlap the two-phase all-to-all communications in the different subcommunicators each other. The pipelined procedures for handling each message consist of two all-to-all communications and three internal array transpositions: tr1: Array transpose to arrange packets for the intermediate node A2A-1: All-to-all in the sub-communicator with the longest-axis tr2: Array transpose to arrange packets for the final destination A2A-2: All-to-all in the second sub-communicator tr3: Array transpose to arrange the output array We use shared memory parallelization (SMP) on multithread or multi-core processors, and we simply map these pipelined procedures to each thread as shown in Fig. 3 to process the independent messages sequentially. The threads must be carefully scheduled so that only one thread at a time processes the all-to-all communication on the same subcommunicator. We use atomic counters for exclusive access control for two sub-communicators. On Blue Gene/P, we use the PowerPC® atomic load and store instructions, lwarx and stwcx to refer to and increment the atomic counters. This scheduling prioritizes the longest-axis traffic to maximize the bandwidth of the all-to-all communications. The all-to-all communications along the longest-axis are the dominant procedures and the other all-to-all communications and the three internal array transpositions are hidden behind them. In this method, all-to-all communication does not need an asynchronous implementation because each thread has to wait until all of the data is received from the earlier procedures. However by using SMP threads, the other threads can proceed independently behind the communication and it is easier to implement this complex scheduling compared to using asynchronous communication to overlap other procedures. We can implement the overlapping method by inserting OpenMP
directives in front of the pipelined loop and atomic barrier operations in the loop. This pseudocode describes the overlapped two-phase all-to-all communications for np processors from array A(N) to B(N). d: denominator for message division nt: number of SMP threads np1: number of processors belonging to communicator A2A-1 np2: number of processors belonging to communicator A2A-2 (np = np1*np2) c1: atomic counter for A2A-1 c2: atomic counter for A2A-2 T1, T2, T3: temporary buffers shared memory parallel section: do i=1 to d { transpose 1: A(N/np/d, i, np2, np1) to T1(N/np/d, np2, np1) wait until mod(c1,nt) becomes my thread ID A2A-1 from T1 to T2 atomic increment c1 transpose 2: T2(N/np/d, np2, np1) to T1(N/np/d, np1, np2) wait until mod(c2,nt) becomes my thread ID A2A-2 from T1 to T3 atomic increment c2 transpose 3: T3(N/np/d, np1, np2) to B(N/np/d, i, np2, np1) } Since the sizes of the temporary buffers T are inversely related to the denominator d, they can be chosen to maximize the performance. The larger d values increase the pipeline efficiency, while larger T values increase the efficiency of communication. It is also important to choose T so that the internal array transpose is done in the cache memory, or cache misses will extend the times for internal array transpositions beyond what can be hidden behind the times for communications. For Blue Gene/P, we chose 512 KB for the size of the temporary buffer T, because Blue Gene/P has 8 MB of L3 cache memory per node, and we need three temporary buffers per thread. Therefore 512 KB is the largest buffer size that can be processed in the L3 cache (512 KB x 3 buffers x 4 threads = 6 MB). This means that all of the data received is stored in the L3 cache and the internal array transpositions are done smoothly without any cache misses.
200
300
200 theoretical peak Overlap
100
MPI
All-to-All Bandwidth [MB/s]
All-to-All Bandwidth [MB/s]
400
0
150
100 theoretical peak Overlap
50
MPI 0
0.5
1
2
4
8
16
32
64
128
0.5
1
point-to-point message size [KB]
(a) 512 nodes (symmetric)
4
8
16
32
64
128
(b) 1,024 nodes (asymmetric) 200
150
100 theoretical peak Overlap
50
MPI
0
All-to-All Bandwidth [MB/s]
200
All-to-All Bandwidth [MB/s]
2
point-to-point message size [KB]
150
100 theoretical peak Overlap
50
MPI
0 0.5
1
2
4
8
16
32
64
128
0.5
point-to-point message size [KB]
1
2
4
8
16
32
64
128
point-to-point message size [KB]
(c) 2,048 nodes (asymmetric)
(d) 4,096 nodes (symmetric)
Fig. 4. Performance comparisons of all-to-all communication on Blue Gene/P.
C. Performance of the overlapping all-to-all method We developed the overlapping method for the all-to-all communication and measured the performance on Blue Gene/P. Blue Gene/P has four processor cores per node and SMP is supported in each node. Each node of Blue Gene/P is connected in a 3D torus network, and the smallest torus is a half-rack with an 8x8x8 torus network, with two 8x8x8 torus networks combined in each full rack of 8x8x16. The shapes of the torus networks for multiple racks are combinations of 8x8x16, so most large Blue Gene/P systems have asymmetric torus networks. We measured and compared the all-to-all performances on a half-rack (8x8x8 = 512 nodes), one rack (8x8x16 = 1,024 nodes), two racks (8x16x16 = 2,048 nodes) and four racks (16x16x16 = 4,096 nodes) of Blue Gene/P. Fig. 4 shows the measured all-to-all bandwidths of the proposed overlapping method and MPI_Alltoall on Blue Gene/P. On all of the networks, the performance of MPI_Alltoall is saturated at 2 KB and is degraded for larger messages. This saturation is due to L3 cache misses, because MPI_Alltoall handles all of each message at one time. In contrast, the performance of the proposed method continues to approach the theoretical peak, even for larger messages and without degradation. To avoid the performance degradation caused by cache misses, the proposed method pipelines each message to be sent and processed, and handles a constant amount of data that can fit within the L3 cache in each processing step.
On 512 nodes (symmetric torus network), the performance of MPI_Alltoall is better than our proposed method, because the advantage of our method is relatively small for a small symmetric torus network. We observed that some of the internal array transposes are not masked behind the all-to-all communication time in this case. With the 1,024 or 2,048 nodes (asymmetric torus networks), the proposed method has 20% to 30% better performance, and achieves up to 95% of the theoretical peak bandwidth. These results show that the proposed method does a good job of overlapping all of the other overhead for the all-to-all communication behind the longest-axis traffic. On 4,096 nodes (symmetric torus network), the performance of the proposed method is slightly better than MPI_Alltoall because of the effects of L3 cache misses. IV. OVERLAPPING FAST FOURIER TRANSFORM A. Parallel 1D FFT 1) Six-step FFT algorithm A discrete Fourier transform (DFT) is defined in Equation 3. n −1
yk = ∑ x j wnjk , where wn = e −2πi n
(3)
j =0
To parallelize a DFT on parallel machines, algorithms based on a six-step FFT [16] are generally used [17]. When n can be
Step 1
Step 2
Step 3
n2 FFT
mul w
Step 4 next iterations
Thread1 tr1
A2A-1
tr2
Thread2 tr1 Thread3 tr1 Thread4 tr1
A2A-2
tr3
A2A-1
tr2
Inter thread atomic barrier synchronization for “A2A-1”
A2A-2
tr3
A2A-1
A2A-1
n2 FFT tr2
A2A-2
tr4
mul w tr3
A2A-1
A2A-1
n2 FFT tr2
A2A-2 tr5
A2A-2
tr4
mul w tr3
... ... ... ...
tr1 A2A-2 tr5 A2A-1
n2 FFT
tr1 tr4
A2A-2 tr5
mul w
A2A-1
tr1, tr2, tr3, tr4 and tr5 are the procedures of internal array transposing mul w is the procedure of multiplication of twiddle factor including internal array transposing
(a) Phase 1: scheduling of all-to-all and n2 FFT using 4 SMP threads Step 5 Thread1
n1 FFT
Thread2
n1 FFT
Thread3
n1 FFT
Thread4
n1 FFT
Step 6 A2A-1
tr6
next iterations A2A-2 A2A-1
tr7
n1 FFT tr6
Inter thread atomic barrier synchronization for “A2A-1”
A2A-2
A2A-1
tr7
A2A-1
tr6
tr6
A2A-2
tr7
A2A-1
A2A-2
tr7
A2A-1
n1 FFT n1 FFT tr6
A2A-2
tr7
n1 FFT tr6
A2A-2 A2A-1
n1 FFT
tr7 tr6
... ... ... ...
tr6 and tr7 are the procedures of internal array transposing
(b) Phase 2: scheduling of all-to-all and n1 FFT using 4 SMP threads Fig. 5. Pipelining and overlapping all-to-all communications and FFT calculations of parallel six-step FFT algorithm using 4 SMP threads.
decomposed as n1 * n2, Equation 3 can be rewritten using 2D arrays as Equation 4.
y (k2 , k1 ) =
n1 −1 n 2 −1
∑ ∑ x( j , j )w j1 = 0 j 2 = 0
1
2
j2 k 2 n2
wnj11nk22 wnj11k1
(4)
When we use np parallel processes, a 1D array x(n) is distributed and each process has a partial array x(n/np). This is equivalent to the 2D array x(n1, n2/np). To calculate a 1D FFT using Equation 4 on a parallel computer, three all-to-all communications are needed to transpose the global array. Usually all-to-all communication is synchronous and the local FFT calculations can be done when all of the elements of the array have been received, so the performance of parallel FFT is largely dependent on the performance of the all-to-all communication. Here is the algorithm for a parallel six-step FFT. Step 1: Global array transpose, x(n1, n2/np) to t(n2, n1/np) Step 2: n2 FFT Step 3: Multiply the elements of the array by the twiddle factor
wnj11nk22
Step 4: Global array transpose, t(n2, n1/np) to t(n1, n2/np) Step 5: n1 FFT Step 6: Global array transpose, t(n1, n2/np) to y(n2, n1/np) In more detail, we also need internal array transposes before and after all-to-all communications of the global array transposes, and we can merge Step 3 into the internal array transpose needed in Step 4 before the all-to-all communication. 2) Overlapping parallel six-step FFT algorithm By replacing conventional synchronous all-to-all communication with the proposed overlapping method of allto-all communication using a multi-thread SMP, we can optimize the all-to-all communication parts of the FFT. In addition to the all-to-all communication optimization, we can also optimize the local FFT calculations and internal array transposes of the FFT by hiding their time behind the longest-
axis all-to-all traffic. To integrate the overlapping method of the all-to-all communication into the parallel six-step 1D FFT, the algorithm and the array are divided for pipelining [18]. We separate the six-step FFT algorithm into two phases, Phase-1 combines Step 1 to Step 4. Phase-2 combines Step 5 and Step 6. Then each phase is pipelined separately. Phase-1 is pipelined into n1/np ways and Phase-2 is pipelined into n2/np ways, and the arrays are divided in the same ways as the pipelines and the divided arrays are processed in parallel loops using SMP threads. Fig.5 shows the pipelined procedures of Phase-1 and Phase-2 using the proposed overlapping method for the all-toall communication. In Phase-1, there are two all-to-all communications for the global array transposes, so A2A-1 and A2A-2 appears twice in a pipelined iteration. The all-to-all communication is exclusively scheduled using atomic counters as described in Section III.B. In Fig.5 A2A-1 is the all-to-all communication along the longest axis and most of the other procedures, including the local FFT calculations, are hidden behind A2A-1. B. Parallel multi-dimensional FFT The multi-dimensional FFT is calculated by applying successive 1D FFTs in each dimension. A multi-dimensional array is usually decomposed by mapping the problem to the shape of the torus network. For example, with a 3D problem, a 3D array would be decomposed in 3 dimensions and mapped to a 3D torus network by dividing the array by the size of the torus network in each dimension. In this case, we have three sub-communicators along each axis of the torus network and we calculate the 1D parallel FFTs using all-to-all overlapping method with these sub-communicators. 1) Overlapping parallel 2D FFT When we map a 2D problem to a 3D torus network, we decompose the array onto two sub-communicators. One is decomposed along an axis, and the other is decomposed on a plane. We apply the overlapped parallel 1D FFT in the two directions independently. Fig. 6 describes the pipelined 1D FFT procedures for symmetric or linear sub-communicators.
Inter thread atomic barrier synchronization for “A2A” Thread1 tr1
A2A
tr2
Thread2 tr1
FFT A2A
tr3 tr2
Thread3 tr1
next iterations
A2A FFT
A2A
tr4
tr3 tr2
A2A FFT
Thread4 tr1
A2A
tr4
tr1
tr3 tr2
... ... ... ...
tr1
A2A FFT
tr4
tr1
tr3
A2A
tr1, tr2, tr3 and tr4 are the procedures of internal array transposing
Fig. 6. Overlapping FFTs and all-to-all communications on the symmetric sub-communicator using 4 SMP threads. Inter thread atomic barrier synchronization for “A2A-1” Thread1 tr1
A2A-1
tr2 A2A-2 tr3
FFT
A2A-1
tr2 A2A-2 tr3
FFT
A2A-1
tr2 A2A-2 tr3
FFT
A2A-1
tr2 A2A-2 tr3
Thread2 tr1 Thread3 tr1
tr4
Thread4 tr1
A2A-1
next iterations tr5 A2A-2 tr6 A2A-1
tr4
tr1 tr5 A2A-2 tr6 A2A-1
tr4 FFT
tr1 tr5 A2A-2 tr6
tr4
A2A-1
... ... ... ...
tr1, tr2, tr3, tr4, tr5 and tr6 are the procedures of internal array transposing
Fig. 7. Overlapping FFTs and 2 phase all-to-all communications on the asymmetric sub-communicator using 4 SMP threads. Inter thread atomic barrier synchronization for “A2A-1” Thread1 tr1
A2A-1
tr2
Thread2 tr1
n2 FFT A2A-1
Thread3 tr1
tr3 A2A-2 tr2 tr2
n2 FFT A2A-1
Thread4 tr1
n1 FFT
tr3 A2A-2 tr2 tr2
A2A-1
tr3
n2 FFT A2A-1
n1 FFT
tr3
tr3 A2A-2 tr2 tr2
n2 FFT
next iterations tr4 A2A-2 tr5 A2A-1 n1 FFT
tr3
tr3 A2A-2 tr2
tr1 tr4 A2A-2 tr5 A2A-1
n1 FFT
tr3
tr1 tr4 A2A-2 tr5 A2A-1
... ... ... ...
tr1, tr2, tr3, tr4 and tr5 are the procedures of internal array transposing
Fig. 8. Merging 2 pipelines of the overlapped method of all-to-all communications and FFTs using 4 SMP threads.
For asymmetric sub-communicators, the sub-communicator is divided into two smaller sub-communicators and overlapped as shown in Fig. 7. 2) Overlapping parallel 3D FFT When we map a 3D problem to a 3D torus network, we decompose the array onto three sub-communicators, one for each axis. Although a 3D FFT is calculated with three independent pipelines of overlapped 1D FFTs, two of the pipelines can be merged into one pipeline because we need at least two independent pipeline loops for local parallelism to process the pipelined procedures using SMP threads. We can overlap the all-to-all communications on the different axes of the torus network as shown in Fig. 8. The rest of the pipeline is processed as shown in Fig. 6. For higher dimensions, the pipelines can be merged in the same manner to overlap the procedures for different dimensions. C. Performance of the overlapping method of the parallel FFT We also measured the performance of the overlapping FFT method on Blue Gene/P. We used our own optimized FFT library for the local FFT calculations to support the SIMD instructions for complex number arithmetic. We compared our overlapping FFT method to a conventional synchronous implementation using MPI_Alltoall. The maximum performance for synchronous parallel FFT is defined by the theoretical peak bandwidth of the all-to-all communications based on dividing the number of floating point operations by the theoretical total time for the all-to-all communications. The number of floating point operations in a 1D FFT for an array of length N is defined by Equation 5.
P( N ) = 5 N log( N ) / log(2)
(5)
The theoretical total time for the all-to-all communication is
calculated by dividing the array size by the theoretical peak bandwidth (Equation 2) and multiplying by the number of allto-all communications needed for parallel FFT calculations. Fig. 9 shows the measured performances for a 1D FFT on Blue Gene/P. We can see very similar curves for the all-to-all bandwidths of the MPI implementation. The performance of the overlapped FFT increases for larger arrays, which is similar to the bandwidth of the overlapping all-to-all method, and we achieved 60% to 90% of the maximum performance for the parallel 1D FFT calculation as defined by the theoretical peak bandwidth of the all-to-all communication. Our measured improvements were 1.3 to 1.8 times faster than the synchronous implementation using MPI_Alltoall. Fig. 10 shows the measured performances for a 2D FFT on Blue Gene/P. We achieved 70% to 85% of the maximum performance for the parallel 2D FFT calculation. The measured improvements were 1.3 times faster than the synchronous implementation using MPI_Alltoall. Fig. 11 shows the measured performances for a 3D FFT on Blue Gene/P. For 3D and higher dimension FFTs, by merging two pipelines of overlapping FFTs, two pairs of all-to-all communications are hidden behind the longest axis traffic, so the maximum performance for the proposed overlapping method is higher than the MPI implementation. We achieved 75% to 95% of the maximum performance for the parallel 3D FFT calculation defined by the theoretical peak bandwidth of the all-to-all communication, and the performance of the new overlapping method exceeded the maximum performance of the conventional implementation. The performance of the MPI implementation is stable, while the performance of the overlapping FFT improves with larger arrays, eventually becoming 2.0 times faster than a synchronous implementation using MPI_alltoall.
V. CONCLUSIONS In this paper, we presented an overlapping method for allto-all communications to reduce the network contention in asymmetric torus networks. We improved the all-to-all bandwidth for asymmetric torus networks by pipelining the indirect all-to-all algorithm and by using SMP threads to overlap the all-to-all communications and the internal array transpositions. We also eliminated the performance degradation for larger arrays by processing the internal array transposes in cache memory. It is important to schedule the indirect all-to-all communications to avoid network contention on the asymmetric torus networks. However indirect methods need extra overhead such as internal array transpositions, and it is also important to reduce the overhead to improve the performance. The overlapping all-to-all method addresses these problems by hiding overhead behind the all-to-all communications along the longest axis. The proposed method improves the performance of all-to-all communication by 20% to 30% on an asymmetric torus, and achieves up to 95% of the theoretical peak bandwidth. This pipelining technique for SMP threads is not only useful for a torus network but also for other networks. We are planning to develop and analyze our overlapping all-to-all method on other parallel computers connected without torus interconnects. We also improved the performance of the parallel FFT on torus networks by integrating the new overlapping all-to-all method into a parallel six-step FFT algorithm and by hiding the calculation time for the local FFT behind the all-to-all communications. The proposed method improves performance by 80% to 100% for parallel 1D FFT and reaches up to 90% of the maximum theoretical performance as defined by the bandwidth of the all-to-all communications. The proposed overlapping 3D FFT exceeds the maximum theoretical performance of the synchronous implementation. We are also planning to find other applications to improve by overlapping the all-to-all communications with the calculations.
REFERENCES [1]
[2]
[3]
[4]
[5]
[6] [7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
ACKNOWLEDGEMENTS We tested and developed our experimental programs on the Watson4P Blue Gene/P system installed at IBM T.J. Watson Research Center, Yorktown Heights, N.Y. We would like to thank Fred Mintzer for helping us obtain machine time, James Sexton for managing our research project, and Shannon Jacobs for proofreading. We also would like to thank Sameer Kumar, Yogish Sabharwal and Philip Heidelberger.
[17]
[18]
P.A. Boyle et al., “Overview of the QCDSP and QCDOC computers,” IBM Journal of Research and Development, Vol. 49, No. 2/3, 2005, pp. 351-365. A. Gara et al., “Overview of the Blue Gene/L System Architecture,” IBM Journal of Research and Development, Vol. 49, No. 2/3, 2005, pp. 195-212. IBM Journal of Research and Development staff, “Overview of the IBM Blue Gene/P project,” IBM Journal of Research and Development, Vol. 52 No. 1/2, 2008, pp. 199-220. G. Goldrian et al., “QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine,” Computing in Science & Engineering, Vol. 10, Issue 6, 2008, pp. 46-54. S.R. Alam et al., "Performance analysis and projections for Petascale applications on Cray XT series systems," 2009 IEEE International Symposium on Parallel & Distributed Processing, 2009, pp. 1-8. Y. Ajima et al., “Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers,” Computer, Vol. 42, No. 11, Nov. 2009, pp. 36-40. B.G. Fitch et al., “Blue Matter: Approaching the Limits of Concurrency for Classical Molecular Dynamics,” SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006, p. 87. M. Sekijima et al., "Optimization and Evaluation of Parallel Molecular Dynamics Simulation on Blue Gene/L," In Proceedings of the IASTED International Conference on Parallel and Distributed Computer and Networks (PDCN2007), 2007, pp. 257-262. E. Bohm et al., “Fine grained parallelization of the Car-Parrinello ab initio MD method on Blue Gene/L,” IBM Journal of Research and Development, Vol. 52, No. 1/2, 2007, pp. 159-176. Y.J. Suh and S. Yalamanchili, "All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes," IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 5, 1998, pp. 442-458. L.V. Kalé et al., "A Framework for Collective Personalized Communication," International Parallel and Distributed Processing Symposium, IPDPS’03, 2003, p. 69a. S. Souravlas and M. Roumeliotis, “A message passing strategy for array redistributions in a torus network,” The Journal of Supercomputing, Vol. 46, No. 1, 2008, pp. 40-57. S. Kumar et al., “Optimization of All-to-All Communication on the Blue Gene/L,” International Conference on Parallel Processing, ICPP’08, pp. 320-329, 2008. S. Kumar et al., “The Deep Computing Messaging Framework: Generalized Scalable Message Passing on the Blue Gene/P Supercomputer,” Proceedings of the 22nd annual international conference on Supercomputing, 2008, pp. 94-103. A. Faraj et al. "MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations," hoti, 2009 17th IEEE Symposium on High Performance Interconnects, 2009, pp. 63-72. D.H. Bailey, “FFTs in external or hierarchical memory,” The Journal of Supercomputing, Vol. 4, 1990, pp. 23–35. D. Takahashi et al., “A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs,” Proc. 8th International Euro-Par Conference (EuroPar 2002), Lecture Notes in Computer Science, No. 2400, 2002, pp. 691-700. Y. Sabharwal et al., “Optimization of Fast Fourier Transforms on the Blue Gene/L Supercomputer,” International Conference on High Performance Computing, HiPC 2008, 2008, pp. 309-322.
700
600
600
500
500 GFLOPS
GFLOPS
700
400 300 200
300 200
Overlap MPI
100
400
Overlap MPI
100
maximum performance
maximum performance 0
0 32
64
128
256
512
1024
2048
4096
8192 16384 32768
32
64
128
256
array size / 512^2 (P2P message length)
512
1024
2048
4096
8192 16384 32768
array size / 1024^2
(b) 1,024 nodes (asymmetric)
(a) 512 nodes (symmetric) 3000 1400 2500
1200
2000 GFLOPS
GFLOPS
1000 800 600
1500 1000
400
Overlap
Overlap
MPI
200
MPI
500
maximum performance
maximum performance
0
0 32
64
128
256
512
1024
2048
4096
8192
16384
32
64
128
array size / 2048^2
256
512
1024
2048
4096
8192
array size / 4096^2
(c) 2,048 nodes (asymmetric)
(d) 4,096 nodes (symmetric)
Fig. 9. Performance comparisons for parallel 1D FFT on Blue Gene/P. The array size is the length of the complex array and the point-to-point message length is shown as the label of the horizontal axis. The maximum performance shows the parallel 1D FFT performance of a synchronous implementation defined by the theoretical peak bandwidth of the all-to-all communications. 500
600 500
400
GFLOPS
GFLOPS
400 300
200
200
Overlap MPI
100
300
Overlap MPI
100
maximum performance
maximum performance
0
0 8192x8192
16384x8192 16384x16384 32768x16384 32768x32768
8192x8192
16384x8192 16384x16384 32768x16384 32768x32768
array size
array size
(b) 1,024 nodes (asymmetric)
(a) 512 nodes (symmetric) 2000 1200
1500
800
GFLOPS
GFLOPS
1000
600
1000
Overlap
400
MPI 200
Overlap 500
MPI
maximum performance
0
maximum performance 0
16384x16384 32768x16384 32768x32768 65536x32768 65536x65536 array size
(c) 2,048 nodes (asymmetric)
32768x32768
65536x32768
65536x65536
array size
(d) 4,096 nodes (symmetric)
Fig. 10. Performance comparisons for parallel 2D FFT on Blue Gene/P. The maximum performance shows the parallel 2D FFT performance of a synchronous implementation defined by the theoretical peak bandwidth of the all-to-all communications.
500
800 700
400
500
300
GFLOPS
200
300 Overlap MPI max performance (overlap) max performance (MPI)
200
Overlap MPI max performance (overlap) max performance (MPI)
100
400
100
0
array size
^3 40 96
8^ 2 ^2 x2 04 8 40 96
x2 04
^3
20 48
40 96
20 48
4^ 2 ^2 x1 02 4
^3 20 48
x1 02
51 2 ^2 x
10 24
10 24
^2
3 10 24
10 24
51 2^
^2
^2 x5 12 10 24 ^3 20 48 x1 02 4^ 20 2 48 ^2 x1 02 4 20 48 ^3
3
x5 12
10 24
51 2^
56 2x 2
51 2^
25 6^
51 2x
25 6^
3
2
0 x5 12
GFLOPS
600
array size
(b) 1,024 nodes (asymmetric)
(a) 512 nodes (symmetric) 1200
2500
1000
2000
GFLOPS
GFLOPS
800
600
1500
1000
400 Overlap MPI max performance (overlap) max performance (MPI)
200
0
Overlap MPI max performance (overlap) max performance (MPI)
500
(c) 2,048 nodes (asymmetric)
^3 40 96
20 48
8^ 2
^2 x
40 96
x2 04
^3 20 48
40 96
10 24
4^ 2 20 48
^2 x
^3
x1 02
10 24
20 48
51 2 ^2 x
10 24
x5 12
^2
3
^3
51 2^
10 24
40 96
array size
40 96
8^ 2 ^2 x2 04 8
^3
x2 04
20 48
40 96
4^ 2 ^2 x1 02 4 20 48
20 48
x1 02
^3
51 2
10 24
^2
^2 x
10 24
x5 12
10 24
51 2^
3
0
array size
(d) 4,096 nodes (symmetric)
Fig. 11. Performance comparisons for parallel 3D FFT on Blue Gene/P. There are 2 different maximum performances, the higher one is the maximum performance for the proposed overlapping method, and the other is the maximum performance for the conventional synchronous implementation using MPI_Alltoall as defined by the theoretical peak bandwidth of the all-to-all communications.