Overlapping Methods of All-to-All Communication and ... - CCT@LSU

2 downloads 0 Views 461KB Size Report
Nov 10, 2010 - 2. The all-to-all communication is processed in two phases: Phase-1: For each final destination, a sender node chooses a node belonging to ...
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers Jun Doi1 and Yasushi Negishi1 1

IBM Research – Tokyo 1623-14 Shimo-tsuruma, Yamato-shi, Kanagawa-ken, Japan. [email protected], [email protected]

Abstract—Torus networks are commonly used for massively parallel computers, its performance often becomes the constraint on total application performance. Especially in an asymmetric torus network, network traffic along the longest axis is the performance bottleneck for all-to-all communication, so that it is important to schedule the longest-axis traffic smoothly. In this paper, we propose a new algorithm based on an indirect method for pipelining the all-to-all procedures using shared memory parallel threads, which (1) isolates the longest-axis traffic from other traffic, (2) schedules it smoothly and (3) overlaps all of the other traffic and overhead for the all-to-all communication behind the longest-axis traffic. The proposed method achieves up to 95% of the theoretical peak. We integrated the overlapped allto-all method with parallel FFT algorithms. And local FFT calculations are also overlapped behind the longest-axis traffic. The FFT performance achieves up to 90% of the theoretical peak for the parallel 1D FFT. Index Terms—all-to-all communication, collective communication, Blue Gene, FFT, multi-core, multi-thread, parallel algorithm, SMP, torus network

I. INTRODUCTION

A

ll-to-all communication is one of the most performancecritical communication patterns for scientific applications on massively parallel computers. The all-to-all communication is often used for matrix or array transposition, or parallel Fast Fourier Transform (FFT) algorithms. Since the performance of the parallel FFT largely depends on the performance of the allto-all communication, its performance is a key parameter for applications such as molecular dynamics [7] [8], quantum molecular dynamics [9], plasma physics, or digital signal processing. In this paper, we focus on all-to-all communication on massively parallel computers with torus interconnects. Today, the torus interconnect is becoming a common interconnect topology for massively parallel computers [1] [2] [3] [4] [5] [6] since it offers high scalability, high cost effectiveness, and high flexibility. As the sizes of the torus networks become larger to scale up to petaflops and exaflops, the performance of the all-to-all communication is becoming critical to a wider range of applications because the final performance depends on the size and shape of the torus network. The bandwidth of the all-to-all communication is theoretically in inverse

proportion to the longest dimension of the torus network, and the network contention caused by the asymmetry of a torus network decreases the effective bandwidth. In most cases the torus network is asymmetric. For example a standard rack of Blue Gene®/P [3] machines is composed of a 3D torus whose dimensions are 8x8x16. There are only two symmetric configurations, 4 racks (16x16x16) and 32 racks (32x32x32) for machines below the petaflops (72 racks). In this paper, we propose an optimal all-to-all communication algorithm and an FFT algorithm for asymmetric torus configurations. There are many approaches for all-to-all communication optimized for torus networks. These are classified into two types, direct and indirect methods. In direct methods the all-toall communication is done via point-to-point communications between all of the pairs of nodes. In contrast, in indirect methods the messages are combined in packets and routed or sent through intermediate nodes. There are several indirect methods suitable for torus networks [10] [11] [12] [13], and most of the indirect methods are focused on scheduling and routing the packets in torus networks to minimize network contention. For example, [13] presented a two-phase all-to-all schedule to minimize network contention on asymmetric torus networks. In indirect methods, it is important to minimize additional software overhead such as message combining and routing in the intermediate nodes. In this paper, we propose an overlapping method for all-toall communications to reduce the network contention in the asymmetric torus networks. We overlap the all-to-all method based on the indirect method by pipelining in the shared memory parallel (SMP) threads on the multi-core processors, which (1) isolates the longest-axis traffic from other traffics, (2) schedules that traffic smoothly and (3) overlaps all other traffic and overhead for the all-to-all communication behind the longest-axis traffic. By using threads, network latencies, startup overheads, and internal array transposes to arrange the message packets on the different threads are overlapped to maximize the performance. We also propose a method to integrate our optimal all-to-all communication with the parallel FFT algorithms. Although the parallel FFT algorithm is usually implemented using synchronous all-to-all communication, we integrated the

© 2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. SC10 November 2010, New Orleans, Louisiana, USA 978-1-4244-7558-2/10/$26.00

parallel FFT algorithm into the pipelined all-to-all communication using SMP threads, so that local FFT calculations are also hidden behind the longest-axis traffic of all-to-all communication. We implemented parallel 1D, 2D and 3D FFT. The remainder of this paper is organized as follows: Section II describes the performance characteristics of all-to-all communication for a torus network. Section III presents our proposal for a pipelined all-to-all method based on the indirect method, Section IV explains a way to integrate our all-to-all method with the parallel FFT algorithm. We present conclusions and future work in Section V. II. ALL-TO-ALL COMMUNICATION ON THE TORUS NETWORK A. Torus network In a torus network, the nodes are arranged in an Ndimensional grid and connected to neighboring nodes so each node in the torus network has 2N nearest neighbors. Fig.1 shows an example of a 4x4 2D torus network.

torus networks of 512 and 4,096 nodes show much better performance than the asymmetric 3D torus networks of 1,024 and 2,048 nodes. Although MPI_Alltoall is well optimized [15], the measured bandwidth is only 72% of the theoretical peak on 1,024 nodes because of the network contention and the traffic imbalance bottleneck along the longest axis. In Table I, we show the best measured bandwidth for MPI_Alltoall. We will see that as the message size increases the effective bandwidth is reduced, as shown in Fig. 4 in Section III.C. We propose a new all-to-all method address these problems in the next section. TABLE I BANDWIDTH OF MPI_ALLTOALL ON THE 3D TORUS OF BLUE GENE/P Number of Nodes

Torus Shape

512

8x8x8

Theoretical Peak Bandwidth [MB/s] 374

1,024

8x8x16

2,048 4,096

8x16x16 16x16x16

Measured Bandwidth [MB/s]

Efficiency [%]

340.6

91.1

187

134.2

71.8

187 187

146.2 158.9

78.2 84.9

III. OVERLAPPING METHOD FOR ALL-TO-ALL COMMUNICATION Fig. 1. Example of 4x4 2D “symmetric” torus network. In this case each node has four nearest neighbors.

The packets for point-to-point communications are routed to their final destinations by routers in each node. Network contention occurs frequently when the routes of packets cross in dense communication patterns such as all-to-all, broadcast, scatter, or gather. B. Performance of all-to-all communications In all-to-all communication, each node exchanges messages with all of the other nodes. The all-to-all bandwidth peaks when all of the routes are filled with packets and the theoretical peak bandwidth is determined by the length of the longest axis of torus network. The time for all-to-all communication T is given by Equation 1 [13]:

T = np ∗ (L 8) ∗ m ∗ β

A2A-2 All-to-all communication

(1)

where np is the number of processors, L is the length of the longest axis, m is the total size of all of the all-to-all messages, and β is the per byte network transfer time. The theoretical peak bandwidth of the all-to-all communications is given by Equation 2:

A = e ∗ (1 β ) ∗ (8 L )

A. Two-phase scheduling of all-to-all communication To reduce the network contention in the all-to-all communications on asymmetric torus networks, two-phase scheduling [13] offers advantages. The all-to-all communication is separated into two phases with the group for communication being called the communicator. The entire torus network is separated into two sub-communicators so that each sub-communicator shapes 1D axis or symmetric multidimensional torus and the bandwidths of the two separated allto-all communications can approach their theoretical peaks. For example, if the configuration of the 3D torus network is X=4, Y=4 and Z=8, then the sub-communicator for phase-1 (A2A-1) contains the Z axis, while phase-2 (A2A-2) contains the X and Y axes, forming a symmetric plane as shown in Fig. 2.

(2)

where e is the efficiency and (1 β ) is the network bandwidth. Table I shows the theoretical peak bandwidth and the measured all-to-all bandwidth of an MPI library (MPI_Alltoall) [14] for the 3D torus network of Blue Gene/P. In this table, the measured bandwidth for the 3D symmetric

A2A-1

in the plane

All-to-all communication along the longest axis

Fig. 2. Separating the all-to-all communication into 2 phases; along the longest axis and in the plane consists of the other axes.

The all-to-all communication is processed in two phases: Phase-1: For each final destination, a sender node chooses a node belonging to both the A2A-1 (including the sender) and A2A-2 (including the final destination) as an intermediate node, and sends the packets for the final destination to the chosen intermediate node. Phase-2: Each intermediate node sends its packets to their final destinations in A2A-2.

Inter thread atomic barrier synchronization for “A2A-1” next iterations Thread1 tr1 Thread2 tr1 Thread3 tr1 Thread4 tr1

A2A-1

tr2

A2A-2 A2A-1

tr3

tr1 tr2

A2A-1

A2A-2 tr3 A2A-1

tr2 A2A-2

tr1 tr2

A2A-2 A2A-1

A2A-1 tr3

tr1 tr2

A2A-2

tr3

A2A-1 tr3

tr1

...

tr1 tr2 A2A-2

tr3

tr1 tr2

A2A-2 A2A-1

... ... ...

tr1, tr2 and tr3 are the procedures of internal array transposing Fig. 3. Pipelining of two-phase all-to-all communication using 4 SMP threads. A2A-1 is all-to-all along the longest axis, which takes much longer than A2A-2, so A2A-2 and most of the other procedures are hidden behind A2A-1. In this way the two-phase all-to-all bandwidth can approach the theoretical peak bandwidth of A2A-1. To arrange the packet messages, there are 3 internal array transpositions performed in cache memory, tr1, tr2 and tr3.

By dividing the all-to-all communication into two phases, we can avoid network contention caused by the longest-axis traffic, but the theoretical communication time increases. Therefore we need a new approach to hide the additional communication time behind the communication time of the longest-axis all-to-all communications. B. Overlapping all-to-all communications on multi-thread nodes We divide the messages and pipeline the procedures of the two-phase scheduling technique so that we can overlap the two-phase all-to-all communications in the different subcommunicators each other. The pipelined procedures for handling each message consist of two all-to-all communications and three internal array transpositions: tr1: Array transpose to arrange packets for the intermediate node A2A-1: All-to-all in the sub-communicator with the longest-axis tr2: Array transpose to arrange packets for the final destination A2A-2: All-to-all in the second sub-communicator tr3: Array transpose to arrange the output array We use shared memory parallelization (SMP) on multithread or multi-core processors, and we simply map these pipelined procedures to each thread as shown in Fig. 3 to process the independent messages sequentially. The threads must be carefully scheduled so that only one thread at a time processes the all-to-all communication on the same subcommunicator. We use atomic counters for exclusive access control for two sub-communicators. On Blue Gene/P, we use the PowerPC® atomic load and store instructions, lwarx and stwcx to refer to and increment the atomic counters. This scheduling prioritizes the longest-axis traffic to maximize the bandwidth of the all-to-all communications. The all-to-all communications along the longest-axis are the dominant procedures and the other all-to-all communications and the three internal array transpositions are hidden behind them. In this method, all-to-all communication does not need an asynchronous implementation because each thread has to wait until all of the data is received from the earlier procedures. However by using SMP threads, the other threads can proceed independently behind the communication and it is easier to implement this complex scheduling compared to using asynchronous communication to overlap other procedures. We can implement the overlapping method by inserting OpenMP

directives in front of the pipelined loop and atomic barrier operations in the loop. This pseudocode describes the overlapped two-phase all-to-all communications for np processors from array A(N) to B(N). d: denominator for message division nt: number of SMP threads np1: number of processors belonging to communicator A2A-1 np2: number of processors belonging to communicator A2A-2 (np = np1*np2) c1: atomic counter for A2A-1 c2: atomic counter for A2A-2 T1, T2, T3: temporary buffers shared memory parallel section: do i=1 to d { transpose 1: A(N/np/d, i, np2, np1) to T1(N/np/d, np2, np1) wait until mod(c1,nt) becomes my thread ID A2A-1 from T1 to T2 atomic increment c1 transpose 2: T2(N/np/d, np2, np1) to T1(N/np/d, np1, np2) wait until mod(c2,nt) becomes my thread ID A2A-2 from T1 to T3 atomic increment c2 transpose 3: T3(N/np/d, np1, np2) to B(N/np/d, i, np2, np1) } Since the sizes of the temporary buffers T are inversely related to the denominator d, they can be chosen to maximize the performance. The larger d values increase the pipeline efficiency, while larger T values increase the efficiency of communication. It is also important to choose T so that the internal array transpose is done in the cache memory, or cache misses will extend the times for internal array transpositions beyond what can be hidden behind the times for communications. For Blue Gene/P, we chose 512 KB for the size of the temporary buffer T, because Blue Gene/P has 8 MB of L3 cache memory per node, and we need three temporary buffers per thread. Therefore 512 KB is the largest buffer size that can be processed in the L3 cache (512 KB x 3 buffers x 4 threads = 6 MB). This means that all of the data received is stored in the L3 cache and the internal array transpositions are done smoothly without any cache misses.

200

300

200 theoretical peak Overlap

100

MPI

All-to-All Bandwidth [MB/s]

All-to-All Bandwidth [MB/s]

400

0

150

100 theoretical peak Overlap

50

MPI 0

0.5

1

2

4

8

16

32

64

128

0.5

1

point-to-point message size [KB]

(a) 512 nodes (symmetric)

4

8

16

32

64

128

(b) 1,024 nodes (asymmetric) 200

150

100 theoretical peak Overlap

50

MPI

0

All-to-All Bandwidth [MB/s]

200

All-to-All Bandwidth [MB/s]

2

point-to-point message size [KB]

150

100 theoretical peak Overlap

50

MPI

0 0.5

1

2

4

8

16

32

64

128

0.5

point-to-point message size [KB]

1

2

4

8

16

32

64

128

point-to-point message size [KB]

(c) 2,048 nodes (asymmetric)

(d) 4,096 nodes (symmetric)

Fig. 4. Performance comparisons of all-to-all communication on Blue Gene/P.

C. Performance of the overlapping all-to-all method We developed the overlapping method for the all-to-all communication and measured the performance on Blue Gene/P. Blue Gene/P has four processor cores per node and SMP is supported in each node. Each node of Blue Gene/P is connected in a 3D torus network, and the smallest torus is a half-rack with an 8x8x8 torus network, with two 8x8x8 torus networks combined in each full rack of 8x8x16. The shapes of the torus networks for multiple racks are combinations of 8x8x16, so most large Blue Gene/P systems have asymmetric torus networks. We measured and compared the all-to-all performances on a half-rack (8x8x8 = 512 nodes), one rack (8x8x16 = 1,024 nodes), two racks (8x16x16 = 2,048 nodes) and four racks (16x16x16 = 4,096 nodes) of Blue Gene/P. Fig. 4 shows the measured all-to-all bandwidths of the proposed overlapping method and MPI_Alltoall on Blue Gene/P. On all of the networks, the performance of MPI_Alltoall is saturated at 2 KB and is degraded for larger messages. This saturation is due to L3 cache misses, because MPI_Alltoall handles all of each message at one time. In contrast, the performance of the proposed method continues to approach the theoretical peak, even for larger messages and without degradation. To avoid the performance degradation caused by cache misses, the proposed method pipelines each message to be sent and processed, and handles a constant amount of data that can fit within the L3 cache in each processing step.

On 512 nodes (symmetric torus network), the performance of MPI_Alltoall is better than our proposed method, because the advantage of our method is relatively small for a small symmetric torus network. We observed that some of the internal array transposes are not masked behind the all-to-all communication time in this case. With the 1,024 or 2,048 nodes (asymmetric torus networks), the proposed method has 20% to 30% better performance, and achieves up to 95% of the theoretical peak bandwidth. These results show that the proposed method does a good job of overlapping all of the other overhead for the all-to-all communication behind the longest-axis traffic. On 4,096 nodes (symmetric torus network), the performance of the proposed method is slightly better than MPI_Alltoall because of the effects of L3 cache misses. IV. OVERLAPPING FAST FOURIER TRANSFORM A. Parallel 1D FFT 1) Six-step FFT algorithm A discrete Fourier transform (DFT) is defined in Equation 3. n −1

yk = ∑ x j wnjk , where wn = e −2πi n

(3)

j =0

To parallelize a DFT on parallel machines, algorithms based on a six-step FFT [16] are generally used [17]. When n can be

Step 1

Step 2

Step 3

n2 FFT

mul w

Step 4 next iterations

Thread1 tr1

A2A-1

tr2

Thread2 tr1 Thread3 tr1 Thread4 tr1

A2A-2

tr3

A2A-1

tr2

Inter thread atomic barrier synchronization for “A2A-1”

A2A-2

tr3

A2A-1

A2A-1

n2 FFT tr2

A2A-2

tr4

mul w tr3

A2A-1

A2A-1

n2 FFT tr2

A2A-2 tr5

A2A-2

tr4

mul w tr3

... ... ... ...

tr1 A2A-2 tr5 A2A-1

n2 FFT

tr1 tr4

A2A-2 tr5

mul w

A2A-1

tr1, tr2, tr3, tr4 and tr5 are the procedures of internal array transposing mul w is the procedure of multiplication of twiddle factor including internal array transposing

(a) Phase 1: scheduling of all-to-all and n2 FFT using 4 SMP threads Step 5 Thread1

n1 FFT

Thread2

n1 FFT

Thread3

n1 FFT

Thread4

n1 FFT

Step 6 A2A-1

tr6

next iterations A2A-2 A2A-1

tr7

n1 FFT tr6

Inter thread atomic barrier synchronization for “A2A-1”

A2A-2

A2A-1

tr7

A2A-1

tr6

tr6

A2A-2

tr7

A2A-1

A2A-2

tr7

A2A-1

n1 FFT n1 FFT tr6

A2A-2

tr7

n1 FFT tr6

A2A-2 A2A-1

n1 FFT

tr7 tr6

... ... ... ...

tr6 and tr7 are the procedures of internal array transposing

(b) Phase 2: scheduling of all-to-all and n1 FFT using 4 SMP threads Fig. 5. Pipelining and overlapping all-to-all communications and FFT calculations of parallel six-step FFT algorithm using 4 SMP threads.

decomposed as n1 * n2, Equation 3 can be rewritten using 2D arrays as Equation 4.

y (k2 , k1 ) =

n1 −1 n 2 −1

∑ ∑ x( j , j )w j1 = 0 j 2 = 0

1

2

j2 k 2 n2

wnj11nk22 wnj11k1

(4)

When we use np parallel processes, a 1D array x(n) is distributed and each process has a partial array x(n/np). This is equivalent to the 2D array x(n1, n2/np). To calculate a 1D FFT using Equation 4 on a parallel computer, three all-to-all communications are needed to transpose the global array. Usually all-to-all communication is synchronous and the local FFT calculations can be done when all of the elements of the array have been received, so the performance of parallel FFT is largely dependent on the performance of the all-to-all communication. Here is the algorithm for a parallel six-step FFT. Step 1: Global array transpose, x(n1, n2/np) to t(n2, n1/np) Step 2: n2 FFT Step 3: Multiply the elements of the array by the twiddle factor

wnj11nk22

Step 4: Global array transpose, t(n2, n1/np) to t(n1, n2/np) Step 5: n1 FFT Step 6: Global array transpose, t(n1, n2/np) to y(n2, n1/np) In more detail, we also need internal array transposes before and after all-to-all communications of the global array transposes, and we can merge Step 3 into the internal array transpose needed in Step 4 before the all-to-all communication. 2) Overlapping parallel six-step FFT algorithm By replacing conventional synchronous all-to-all communication with the proposed overlapping method of allto-all communication using a multi-thread SMP, we can optimize the all-to-all communication parts of the FFT. In addition to the all-to-all communication optimization, we can also optimize the local FFT calculations and internal array transposes of the FFT by hiding their time behind the longest-

axis all-to-all traffic. To integrate the overlapping method of the all-to-all communication into the parallel six-step 1D FFT, the algorithm and the array are divided for pipelining [18]. We separate the six-step FFT algorithm into two phases, Phase-1 combines Step 1 to Step 4. Phase-2 combines Step 5 and Step 6. Then each phase is pipelined separately. Phase-1 is pipelined into n1/np ways and Phase-2 is pipelined into n2/np ways, and the arrays are divided in the same ways as the pipelines and the divided arrays are processed in parallel loops using SMP threads. Fig.5 shows the pipelined procedures of Phase-1 and Phase-2 using the proposed overlapping method for the all-toall communication. In Phase-1, there are two all-to-all communications for the global array transposes, so A2A-1 and A2A-2 appears twice in a pipelined iteration. The all-to-all communication is exclusively scheduled using atomic counters as described in Section III.B. In Fig.5 A2A-1 is the all-to-all communication along the longest axis and most of the other procedures, including the local FFT calculations, are hidden behind A2A-1. B. Parallel multi-dimensional FFT The multi-dimensional FFT is calculated by applying successive 1D FFTs in each dimension. A multi-dimensional array is usually decomposed by mapping the problem to the shape of the torus network. For example, with a 3D problem, a 3D array would be decomposed in 3 dimensions and mapped to a 3D torus network by dividing the array by the size of the torus network in each dimension. In this case, we have three sub-communicators along each axis of the torus network and we calculate the 1D parallel FFTs using all-to-all overlapping method with these sub-communicators. 1) Overlapping parallel 2D FFT When we map a 2D problem to a 3D torus network, we decompose the array onto two sub-communicators. One is decomposed along an axis, and the other is decomposed on a plane. We apply the overlapped parallel 1D FFT in the two directions independently. Fig. 6 describes the pipelined 1D FFT procedures for symmetric or linear sub-communicators.

Inter thread atomic barrier synchronization for “A2A” Thread1 tr1

A2A

tr2

Thread2 tr1

FFT A2A

tr3 tr2

Thread3 tr1

next iterations

A2A FFT

A2A

tr4

tr3 tr2

A2A FFT

Thread4 tr1

A2A

tr4

tr1

tr3 tr2

... ... ... ...

tr1

A2A FFT

tr4

tr1

tr3

A2A

tr1, tr2, tr3 and tr4 are the procedures of internal array transposing

Fig. 6. Overlapping FFTs and all-to-all communications on the symmetric sub-communicator using 4 SMP threads. Inter thread atomic barrier synchronization for “A2A-1” Thread1 tr1

A2A-1

tr2 A2A-2 tr3

FFT

A2A-1

tr2 A2A-2 tr3

FFT

A2A-1

tr2 A2A-2 tr3

FFT

A2A-1

tr2 A2A-2 tr3

Thread2 tr1 Thread3 tr1

tr4

Thread4 tr1

A2A-1

next iterations tr5 A2A-2 tr6 A2A-1

tr4

tr1 tr5 A2A-2 tr6 A2A-1

tr4 FFT

tr1 tr5 A2A-2 tr6

tr4

A2A-1

... ... ... ...

tr1, tr2, tr3, tr4, tr5 and tr6 are the procedures of internal array transposing

Fig. 7. Overlapping FFTs and 2 phase all-to-all communications on the asymmetric sub-communicator using 4 SMP threads. Inter thread atomic barrier synchronization for “A2A-1” Thread1 tr1

A2A-1

tr2

Thread2 tr1

n2 FFT A2A-1

Thread3 tr1

tr3 A2A-2 tr2 tr2

n2 FFT A2A-1

Thread4 tr1

n1 FFT

tr3 A2A-2 tr2 tr2

A2A-1

tr3

n2 FFT A2A-1

n1 FFT

tr3

tr3 A2A-2 tr2 tr2

n2 FFT

next iterations tr4 A2A-2 tr5 A2A-1 n1 FFT

tr3

tr3 A2A-2 tr2

tr1 tr4 A2A-2 tr5 A2A-1

n1 FFT

tr3

tr1 tr4 A2A-2 tr5 A2A-1

... ... ... ...

tr1, tr2, tr3, tr4 and tr5 are the procedures of internal array transposing

Fig. 8. Merging 2 pipelines of the overlapped method of all-to-all communications and FFTs using 4 SMP threads.

For asymmetric sub-communicators, the sub-communicator is divided into two smaller sub-communicators and overlapped as shown in Fig. 7. 2) Overlapping parallel 3D FFT When we map a 3D problem to a 3D torus network, we decompose the array onto three sub-communicators, one for each axis. Although a 3D FFT is calculated with three independent pipelines of overlapped 1D FFTs, two of the pipelines can be merged into one pipeline because we need at least two independent pipeline loops for local parallelism to process the pipelined procedures using SMP threads. We can overlap the all-to-all communications on the different axes of the torus network as shown in Fig. 8. The rest of the pipeline is processed as shown in Fig. 6. For higher dimensions, the pipelines can be merged in the same manner to overlap the procedures for different dimensions. C. Performance of the overlapping method of the parallel FFT We also measured the performance of the overlapping FFT method on Blue Gene/P. We used our own optimized FFT library for the local FFT calculations to support the SIMD instructions for complex number arithmetic. We compared our overlapping FFT method to a conventional synchronous implementation using MPI_Alltoall. The maximum performance for synchronous parallel FFT is defined by the theoretical peak bandwidth of the all-to-all communications based on dividing the number of floating point operations by the theoretical total time for the all-to-all communications. The number of floating point operations in a 1D FFT for an array of length N is defined by Equation 5.

P( N ) = 5 N log( N ) / log(2)

(5)

The theoretical total time for the all-to-all communication is

calculated by dividing the array size by the theoretical peak bandwidth (Equation 2) and multiplying by the number of allto-all communications needed for parallel FFT calculations. Fig. 9 shows the measured performances for a 1D FFT on Blue Gene/P. We can see very similar curves for the all-to-all bandwidths of the MPI implementation. The performance of the overlapped FFT increases for larger arrays, which is similar to the bandwidth of the overlapping all-to-all method, and we achieved 60% to 90% of the maximum performance for the parallel 1D FFT calculation as defined by the theoretical peak bandwidth of the all-to-all communication. Our measured improvements were 1.3 to 1.8 times faster than the synchronous implementation using MPI_Alltoall. Fig. 10 shows the measured performances for a 2D FFT on Blue Gene/P. We achieved 70% to 85% of the maximum performance for the parallel 2D FFT calculation. The measured improvements were 1.3 times faster than the synchronous implementation using MPI_Alltoall. Fig. 11 shows the measured performances for a 3D FFT on Blue Gene/P. For 3D and higher dimension FFTs, by merging two pipelines of overlapping FFTs, two pairs of all-to-all communications are hidden behind the longest axis traffic, so the maximum performance for the proposed overlapping method is higher than the MPI implementation. We achieved 75% to 95% of the maximum performance for the parallel 3D FFT calculation defined by the theoretical peak bandwidth of the all-to-all communication, and the performance of the new overlapping method exceeded the maximum performance of the conventional implementation. The performance of the MPI implementation is stable, while the performance of the overlapping FFT improves with larger arrays, eventually becoming 2.0 times faster than a synchronous implementation using MPI_alltoall.

V. CONCLUSIONS In this paper, we presented an overlapping method for allto-all communications to reduce the network contention in asymmetric torus networks. We improved the all-to-all bandwidth for asymmetric torus networks by pipelining the indirect all-to-all algorithm and by using SMP threads to overlap the all-to-all communications and the internal array transpositions. We also eliminated the performance degradation for larger arrays by processing the internal array transposes in cache memory. It is important to schedule the indirect all-to-all communications to avoid network contention on the asymmetric torus networks. However indirect methods need extra overhead such as internal array transpositions, and it is also important to reduce the overhead to improve the performance. The overlapping all-to-all method addresses these problems by hiding overhead behind the all-to-all communications along the longest axis. The proposed method improves the performance of all-to-all communication by 20% to 30% on an asymmetric torus, and achieves up to 95% of the theoretical peak bandwidth. This pipelining technique for SMP threads is not only useful for a torus network but also for other networks. We are planning to develop and analyze our overlapping all-to-all method on other parallel computers connected without torus interconnects. We also improved the performance of the parallel FFT on torus networks by integrating the new overlapping all-to-all method into a parallel six-step FFT algorithm and by hiding the calculation time for the local FFT behind the all-to-all communications. The proposed method improves performance by 80% to 100% for parallel 1D FFT and reaches up to 90% of the maximum theoretical performance as defined by the bandwidth of the all-to-all communications. The proposed overlapping 3D FFT exceeds the maximum theoretical performance of the synchronous implementation. We are also planning to find other applications to improve by overlapping the all-to-all communications with the calculations.

REFERENCES [1]

[2]

[3]

[4]

[5]

[6] [7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

ACKNOWLEDGEMENTS We tested and developed our experimental programs on the Watson4P Blue Gene/P system installed at IBM T.J. Watson Research Center, Yorktown Heights, N.Y. We would like to thank Fred Mintzer for helping us obtain machine time, James Sexton for managing our research project, and Shannon Jacobs for proofreading. We also would like to thank Sameer Kumar, Yogish Sabharwal and Philip Heidelberger.

[17]

[18]

P.A. Boyle et al., “Overview of the QCDSP and QCDOC computers,” IBM Journal of Research and Development, Vol. 49, No. 2/3, 2005, pp. 351-365. A. Gara et al., “Overview of the Blue Gene/L System Architecture,” IBM Journal of Research and Development, Vol. 49, No. 2/3, 2005, pp. 195-212. IBM Journal of Research and Development staff, “Overview of the IBM Blue Gene/P project,” IBM Journal of Research and Development, Vol. 52 No. 1/2, 2008, pp. 199-220. G. Goldrian et al., “QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine,” Computing in Science & Engineering, Vol. 10, Issue 6, 2008, pp. 46-54. S.R. Alam et al., "Performance analysis and projections for Petascale applications on Cray XT series systems," 2009 IEEE International Symposium on Parallel & Distributed Processing, 2009, pp. 1-8. Y. Ajima et al., “Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers,” Computer, Vol. 42, No. 11, Nov. 2009, pp. 36-40. B.G. Fitch et al., “Blue Matter: Approaching the Limits of Concurrency for Classical Molecular Dynamics,” SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006, p. 87. M. Sekijima et al., "Optimization and Evaluation of Parallel Molecular Dynamics Simulation on Blue Gene/L," In Proceedings of the IASTED International Conference on Parallel and Distributed Computer and Networks (PDCN2007), 2007, pp. 257-262. E. Bohm et al., “Fine grained parallelization of the Car-Parrinello ab initio MD method on Blue Gene/L,” IBM Journal of Research and Development, Vol. 52, No. 1/2, 2007, pp. 159-176. Y.J. Suh and S. Yalamanchili, "All-To-All Communication with Minimum Start-Up Costs in 2D/3D Tori and Meshes," IEEE Transactions on Parallel and Distributed Systems, Vol. 9, No. 5, 1998, pp. 442-458. L.V. Kalé et al., "A Framework for Collective Personalized Communication," International Parallel and Distributed Processing Symposium, IPDPS’03, 2003, p. 69a. S. Souravlas and M. Roumeliotis, “A message passing strategy for array redistributions in a torus network,” The Journal of Supercomputing, Vol. 46, No. 1, 2008, pp. 40-57. S. Kumar et al., “Optimization of All-to-All Communication on the Blue Gene/L,” International Conference on Parallel Processing, ICPP’08, pp. 320-329, 2008. S. Kumar et al., “The Deep Computing Messaging Framework: Generalized Scalable Message Passing on the Blue Gene/P Supercomputer,” Proceedings of the 22nd annual international conference on Supercomputing, 2008, pp. 94-103. A. Faraj et al. "MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations," hoti, 2009 17th IEEE Symposium on High Performance Interconnects, 2009, pp. 63-72. D.H. Bailey, “FFTs in external or hierarchical memory,” The Journal of Supercomputing, Vol. 4, 1990, pp. 23–35. D. Takahashi et al., “A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs,” Proc. 8th International Euro-Par Conference (EuroPar 2002), Lecture Notes in Computer Science, No. 2400, 2002, pp. 691-700. Y. Sabharwal et al., “Optimization of Fast Fourier Transforms on the Blue Gene/L Supercomputer,” International Conference on High Performance Computing, HiPC 2008, 2008, pp. 309-322.

700

600

600

500

500 GFLOPS

GFLOPS

700

400 300 200

300 200

Overlap MPI

100

400

Overlap MPI

100

maximum performance

maximum performance 0

0 32

64

128

256

512

1024

2048

4096

8192 16384 32768

32

64

128

256

array size / 512^2 (P2P message length)

512

1024

2048

4096

8192 16384 32768

array size / 1024^2

(b) 1,024 nodes (asymmetric)

(a) 512 nodes (symmetric) 3000 1400 2500

1200

2000 GFLOPS

GFLOPS

1000 800 600

1500 1000

400

Overlap

Overlap

MPI

200

MPI

500

maximum performance

maximum performance

0

0 32

64

128

256

512

1024

2048

4096

8192

16384

32

64

128

array size / 2048^2

256

512

1024

2048

4096

8192

array size / 4096^2

(c) 2,048 nodes (asymmetric)

(d) 4,096 nodes (symmetric)

Fig. 9. Performance comparisons for parallel 1D FFT on Blue Gene/P. The array size is the length of the complex array and the point-to-point message length is shown as the label of the horizontal axis. The maximum performance shows the parallel 1D FFT performance of a synchronous implementation defined by the theoretical peak bandwidth of the all-to-all communications. 500

600 500

400

GFLOPS

GFLOPS

400 300

200

200

Overlap MPI

100

300

Overlap MPI

100

maximum performance

maximum performance

0

0 8192x8192

16384x8192 16384x16384 32768x16384 32768x32768

8192x8192

16384x8192 16384x16384 32768x16384 32768x32768

array size

array size

(b) 1,024 nodes (asymmetric)

(a) 512 nodes (symmetric) 2000 1200

1500

800

GFLOPS

GFLOPS

1000

600

1000

Overlap

400

MPI 200

Overlap 500

MPI

maximum performance

0

maximum performance 0

16384x16384 32768x16384 32768x32768 65536x32768 65536x65536 array size

(c) 2,048 nodes (asymmetric)

32768x32768

65536x32768

65536x65536

array size

(d) 4,096 nodes (symmetric)

Fig. 10. Performance comparisons for parallel 2D FFT on Blue Gene/P. The maximum performance shows the parallel 2D FFT performance of a synchronous implementation defined by the theoretical peak bandwidth of the all-to-all communications.

500

800 700

400

500

300

GFLOPS

200

300 Overlap MPI max performance (overlap) max performance (MPI)

200

Overlap MPI max performance (overlap) max performance (MPI)

100

400

100

0

array size

^3 40 96

8^ 2 ^2 x2 04 8 40 96

x2 04

^3

20 48

40 96

20 48

4^ 2 ^2 x1 02 4

^3 20 48

x1 02

51 2 ^2 x

10 24

10 24

^2

3 10 24

10 24

51 2^

^2

^2 x5 12 10 24 ^3 20 48 x1 02 4^ 20 2 48 ^2 x1 02 4 20 48 ^3

3

x5 12

10 24

51 2^

56 2x 2

51 2^

25 6^

51 2x

25 6^

3

2

0 x5 12

GFLOPS

600

array size

(b) 1,024 nodes (asymmetric)

(a) 512 nodes (symmetric) 1200

2500

1000

2000

GFLOPS

GFLOPS

800

600

1500

1000

400 Overlap MPI max performance (overlap) max performance (MPI)

200

0

Overlap MPI max performance (overlap) max performance (MPI)

500

(c) 2,048 nodes (asymmetric)

^3 40 96

20 48

8^ 2

^2 x

40 96

x2 04

^3 20 48

40 96

10 24

4^ 2 20 48

^2 x

^3

x1 02

10 24

20 48

51 2 ^2 x

10 24

x5 12

^2

3

^3

51 2^

10 24

40 96

array size

40 96

8^ 2 ^2 x2 04 8

^3

x2 04

20 48

40 96

4^ 2 ^2 x1 02 4 20 48

20 48

x1 02

^3

51 2

10 24

^2

^2 x

10 24

x5 12

10 24

51 2^

3

0

array size

(d) 4,096 nodes (symmetric)

Fig. 11. Performance comparisons for parallel 3D FFT on Blue Gene/P. There are 2 different maximum performances, the higher one is the maximum performance for the proposed overlapping method, and the other is the maximum performance for the conventional synchronous implementation using MPI_Alltoall as defined by the theoretical peak bandwidth of the all-to-all communications.

Suggest Documents