In Section 3, we present a distribute-memory node program for the Cooley-. Tukey FFT. Section 4 and Section 5 present implementations of the Pease FFT and ...
IMPLEMENTING FAST FOURIER TRANSFORMS ON DISTRIBUTED-MEMORY MULTIPROCESSORS USING DATA REDISTRIBUTIONS 1
S. K. S. GUPTA, C.-H. HUANG, P. SADAYAPPAN Department of Computer and Information Science, The Ohio State University Columbus, OH 43210, U.S.A. and
R. W. JOHNSON Department of Computer Science, St. Cloud State University St. Cloud, MN 56301, U.S.A. Received (received date) Revised (revised date) Communicated by (Name of Editor) ABSTRACT Implementations of various fast Fourier transform (FFT) algorithms are presented for distributed-memory multiprocessors. These algorithms use data redistribution to localize the computation. The goal is to optimize communication cost by using a minimum number of redistribution steps. Both analytical and experimental performance results on the Intel iPSC/860 system are presented. Keywords: fast Fourier transform, parallel algorithm, distributed-memory multiprocessor, High Performance Fortran, data distribution, communication optimization.
1. Introduction This paper presents distributed-memory parallel programs for the fast Fourier transform (FFT) [1, 2] which use data redistribution to localize the computation. In a distributedmemory multiprocessor like the Intel iPSC/860 and the Thinking Machines CM5, shared data is distributed across the local memories of interconnected processors. The most commonly used distributions are the block, cyclic, and block-cyclic distributions. These distributions are also proposed in High Performance Fortran (HPF) [3]. Communication is needed when a processor requires data from another processor’s local memory. I n most distributed-memory multiprocessors the cost of communicating a data value is considerably larger than the cost of a primitive arithmetic computation on the data element. It is therefore important to reduce the communication overhead to achieve high performance. In order to achieve good scalable performance, programs may have to use different approaches to solve the same problem for different ranges of problem size. In this paper we present FFT algorithms based on redistribution primitives, which have better performance than the FFT programs based on point-to-point message passing, when the problem size is
1 This work was supported in part by ARPA, order number 7898, monitored by NIST under
grant number 60NANB1D1151 and ARPA, order number 7899, monitored by NIST under grant number 60NANB1D1150. 1
large. The overall communication cost in the FFT programs with redistribution decreases due to a reduction in the total communication volume, although the number of transmitted messages increases. Furthermore, using redistribution primitives to eliminate explicit message-passing in the computation will lead to a more portable implementation, by abstracting away the details of the target machine, such as the network topology. Both analytical and experimental performance results on an Intel iPSC/860 system show that real benefits can be achieved by such a scheme. For simplicity, only radix-2 decimation-in-time FFT algorithms are developed. The paper is organized as follows. Section 2 describes the semantics of block-cyclic distributions. In Section 3, we present a distribute-memory node program for the CooleyTukey FFT. Section 4 and Section 5 present implementations of the Pease FFT and the Stockham FFT. Performance results are given in Section 6. Conclusions are included in Section 7.
2. Semantics of Data Distribution Various data distributions of an array correspond to mapping certain bits of the address (binary representation of the array index) of an element to represent the processor address [4]. One such data distribution is the block-cyclic distribution. A block-cyclic distribution partitions an array into equal sized blocks of consecutive elements and maps them to the processors in a cyclic manner. The elements mapped to a processor are stored in increasing order of their indices in its local memory. We will use the following convention to express the block-cyclic distribution of a linear array A(0 : N ? 1) on P processors:
A(cyclic(b)): the block size is b and element i, 0 i < N , is on processor (i div b) mod P at local index (i mod b) + (i div (P b)) b. The local array A0 (0 : dN=P e? 1) will contain elements of array A mapped to a processor.
A(block) and A(cyclic) will be used to express block and cyclic distributions of array A. These distributions are equivalent to cyclic(dN=P e) and cyclic(1), respectively. Suppose an array of size 2n is block-cyclically distributed on 2p processors. Then the p bits required for the processor address are chosen from the n bits used for array indexing. If bn 1 is the index of an array element, then cyclic(2b) corresponds to choosing bb+p b+1
as its processor addressa . As the array elements are stored in increasing order of their indices in each processor’s local memory, the local address of element bn 1 on processor bb+p b+1 is bn b+p+1 bb 1 . We will mark any bit bi of the global address that is used for processor indexing as ^bi . For example, the global address for the index bn 1 would be ^bn n?p+1 bn?p 1 for a linear array of size 2n distributed as block . It would be bn p+1^bp 1 when it is distributed as cyclic. The local address and the processor address can be extracted from a global address by functions local and proc. For example, local(^bn n?p+1 bn?p 1 ) = bn?p 1 and proc(^bn n?p+1 bn?p 1 ) = bn n?p+1 . Redistribution corresponds to a remapping of the processor bits. Changing the distribu0 tion from cyclic(2b ) to cyclic(2b ) implies using bits bb0 +p bb0 +1 of the global address for the processor address instead of bits bb+p bb+1 . a The notation bi j , where i the bit sequence bk bk+1 bj .
?1 bj and bk!j , where k j , denotes
j , denotes the bit sequence bi bi
2
P0
0 16 8 24 4 20 12 28
1
2
3
4
5
0 1 2 3 4 5 6 7
2 18 10 26 6 22 14 30
8 9 10 11 12 13 14 15
P2
1 17 9 25 5 21 13 29
16 17 18 19 20 21 22 23
P3
3 19 11 27 7 23 15 31
24 25 26 27 28 29 30 31
P1
Fig. 1. Cooley-Tukey FFT with point-to-point message passing.
3. Cooley-Tukey FFT For simplicity, we will assume that the input is in bit-reversed order for the CooleyTukey FFT and the Pease FFT. The twiddle factors will be assumed to be replicated on all processors. Furthermore, we will present only radix-2 decimation-in-time FFT algorithms. However, the techniques presented in this paper are also applicable to mixed-radix and decimation-in-frequency FFTs [2]. We will use N (= 2n ) to denote the size of the FFT being performed and P (= 2p ; p n) to denote the number of processors being used to perform the FFT. An FFT is an O(N log N ) algorithm to compute the matrix-vector product:
P B (i) = Nj=0?1 !Nij A(j );
where 0 i < N and
p !N = e2 ?1=N :
In a 2n point Cooley-Tukey FFT [5], there are n steps of computation. At step i, 1 i n, the following butterfly computation is performed on elements at address bn i+1 0bi?1 1 and bn i+1 1bi?1 1 :
B (bn B (bn A(bn A(bn
i+1 0bi?1 1 ) = A(nb?ni i+1 0bi?1 1 ) 2 (bi?1 1 ) A(b i+1 1bi?1 1 ) = !N n i+1 1bi?1 1 ) 0 b ) = B ( b 0 b ) + i+1 i?1 1 n i+1 i?1 1 B (bn i+1 1bi?1 1 b ) = B ( b i+1 i?1 1 n i+1 0bi?1 1 ) ? B (bn i+1 1bi?1
) 1 ):
1
The above computation is performed on elements whose addresses differ only in bit bi . This butterfly computation will be denoted by bn i+1 Bi bi?1 1 . A simple way to implement the Cooley-Tukey FFT on a distributed-memory multiprocessor is to use a block distribution for the input array A. The communication required for such a program can be determined from its trace in terms of the global address. For i n ? p, the computation can be summarized as ^bn n?p+1 bn?p i+1 Bi bi?1 1 , whereas, for i > n ? p, it can be summarized as ^bn i+1 Bi^bi?1 n?p bn?p?1 1 . Communication is 3
P0
P1
0 16 8 24 4 20 12 28 2 18 10 26 6 22 14 30
1
2
3
Block to Cyclic
4
5 0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29
P2
1 17 9 25 5 21 13 29
2 6 10 14 18 22 26 30
P3
3 19 11 27 7 23 15 31
3 7 11 15 19 23 27 31
Fig. 2. Cooley-Tukey FFT with redistribution. required for steps where the butterfly is performed on elements differing in bits used for the processor index, i.e., for steps n ? p + 1 through n. Hence, such a program requires p communication steps; at step n ? p + j , 1 j p, processor bp 1 communicates with processor bp j +1bj bj ?1 1 , where bj is the complement of bj . It can be similarly determined that, if A has a fixed distribution of cyclic(2b ), then communication is required for steps b + 1 to b + p; the rest of the n ? p steps are communication-free. For example, Fig. 1 shows the data flow diagram (after bit-reversal) for a 32-point Cooley-Tukey FFT on four processors with a fixed block distribution of the input array. Communication is required for the last two steps (indicated by lines crossing a processor boundary). If n 2p (N P 2 ) then by choosing the most significant p bits for the first n ? p steps and the least significant p bits for the last p steps to represent the processor address, we can localize the computation for all n steps. This implies that, the input array A is initially distributed using a block distribution and then after performing n ? p steps of computation, its distribution is changed to a cyclic distribution. The computation performed on each node before the redistribution is:
B 0 (bn?p i+1 0bi?1 1 ) = A0 (b?n?p i+1 0bi?1 1 ) B 0 (bn?p i+1 1bi?1 1 ) = !N2 (b ?1 1 ) A0 (bn?p i+1 1bi?1 1 ) A0 (bn?p i+1 0bi?1 1 ) = B 0 (bn?p i+1 0bi?1 1 ) + B 0 (bn?p i+1 1bi?1 1 ) A0 (bn?p i+1 1bi?1 1 ) = B 0 (bn?p i+1 0bi?1 1 ) ? B 0 (bn?p i+1 1bi?1 1 ): After a block to cyclic redistribution, a node with processor identifier pid performs: B 0 (bn i+1 0bi?1 p+1 ) = A0 (b?n i+1 0bi?1 p+1 ) B 0 (bn i+1 1bi?1 p+1 ) = !N2 (2 (b ?1 +1)+pid) A0 (bn i+1 1bi?1 p+1 ) A0 (bn i+1 0bi?1 p+1 ) = B 0 (bn i+1 0bi?1 p+1 ) + B 0 (bn i+1 1bi?1 p+1 ) A0 (bn i+1 1bi?1 p+1 ) = B 0 (bn i+1 0bi?1 p+1 ) ? B 0 (bn i+1 1bi?1 p+1 ): In general, dn=(n ? p)e redistribution steps would be required for any n p. If the initial n
n
i
i
i
p
i
p
distribution of the input array is to preserved, then one extra redistribution is required at the 4
P0
0 16 8 24 4 20 12 28
1a
1b
2a
2b
3a
3b
4a
4b
5a
5b
0 1 2 3 4 5 6 7
2 18 10 26 6 22 14 30
8 9 10 11 12 13 14 15
P2
1 17 9 25 5 21 13 29
16 17 18 19 20 21 22 23
P3
3 19 11 27 7 23 15 31
24 25 26 27 28 29 30 31
P1
Fig. 3. Pease FFT with point-to-point message passing. end of the computation. For example, Fig. 2 shows the data flow diagram (after bit-reversal) for a 32-point Cooley-Tukey FFT on four processors using redistribution. The input array initially has a block distribution which is changed to a cyclic distribution after the third computation step.
4. Pease FFT Pease proposed an FFT algorithm suitable for parallel processing [6]. In this algorithm, the array of intermediate results is permuted (using the perfect shuffle) after the butterfly computation at each step so that the two data elements involved in a butterfly computation at the next step are moved adjacent to each other. The following computation is performed at each step i, 1 i n, of the Pease FFT:
?i
A(bn 2 1) = !N2 (b ? +1) A(bn B (bn 2 0) = A(bn 2 0) + A(bn 2 1) B (bn 2 1) = A(bn 2 0) ? A(bn 2 1) A(b1 bn 2 ) = B (bn 2 b1 ): n
n
n
i
2
1)
A direct implementation of the Pease FFT algorithm on a distributed-memory multiprocessor would be to distribute array A according to some block-cyclic distribution. Suppose array A is initially distributed using a block distribution. The shuffle permutation at each step, in terms of the global address, can be expressed as:
A(^b1^bn n?p+2 bn?p+1 2 ) = B (^bn n?p+1 bn?p 1 ): Note that proc(^bn n?p+1 bn?p 1 ) 6= proc(^b1^bn n?p+2 bn?p+1 2 ), which implies that communication is needed at every step. This is true for any distribution of the input array. To avoid communication at each step, we need to modify the Pease FFT algorithm. We now present a modified Pease FFT algorithm which requires a data redistribution, after every n ? p steps. When n 2p, only a single redistribution is required. Fig. 3 shows the 5
P0
P1
0 16 8 24 4 20 12 28 2 18 10 26 6 22 14 30
1a
1b
2a
2b
3a
3b
Block to Cyclic
4a
4b
5a
5b
0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29
P2
1 17 9 25 5 21 13 29
2 6 10 14 18 22 26 30
P3
3 19 11 27 7 23 15 31
3 7 11 15 19 23 27 31
Fig. 4. Pease FFT with redistribution. data flow diagram (after bit-reversal) for a 32-point Pease FFT on four processors with the input array initially distributed block-wise. Communication is required for all the five steps of the program. In the modified algorithm, a shuffle permutation is performed only on the local address bits. For the first n ? p steps, the shuffle permutation is performed on the least significant n ? p bits. This preserves bits bn n?p+1 in the most significant p address positions. Similarly, for the last p steps, the shuffle permutation is performed only on the most significant p bits; preserving bits bp 1 in the least significant p address positions. This modification permits the use of a block distribution for the first n ? p steps and a cyclic distribution for the last p steps, with the desirable property that communication is only needed for a redistribution after step n ? p. The computation performed on each node before the redistribution is:
?i
A0 (bn?p 2 1) = !N2 (b ? ? ? +1) A0 (bn?p 2 1) B 0 (bn?p 2 0) = A0 (bn?p 2 0) + A0 (bn?p 2 1) B 0 (bn?p 2 1) = A0 (bn?p 2 0) ? A0 (bn?p 2 1) A0 (b1 bn?p 2 ) = B 0 (bn?p 2 b1 ): The computation performed on each node after a block to cyclic redistribution is: n
n
p
n
p
i
?i
A0 (bn n?p+2 1bn?p p+1 ) = !N2 (j;pid) A0 (bn n?p+2 1bn?p p+1 ) B 0 (bn n?p+2 0bn?p p+1 ) = A0 (bn n?p+2 0bn?p p+1 ) + A0 (bn n?p+2 1bn?p p+1 ) B 0 (bn n?p+2 1bn?p p+1 ) = A0 (bn n?p+2 0bn?p p+1 ) ? A0 (bn n?p+2 1bn?p p+1 ) A0 (b1 bn n?p+2 bn?p p+1 ) = B 0 (bn n?p+2 b1 bn?p p+1 ); n
where j = i ? n + p and (j; pid) = 2n?p (bn?1 n?j +1 ) + 2p (bn?2p p+1 ) + pid. A trace of the modified Pease FFT algorithm in terms of the initial global address bit pattern is shown below. Bits bn n?p+1 are fixed to represent the processor address for the first n ? p steps and bp 1 are fixed to represent the processor address for the last p steps of the
6
computation. Steps ia and ib represent the computation phase and the permutation phase, respectively, at step i. Step
a 1b 2a 2b 1
n ? p)b redist: (n ? p + 1)a (n ? p + 1)b (
nb
b ^ bn ^ bn ^ bn ^n
global addr.
proc addr.
?p+1 bn?p 2 B1 n?p+1 b1 bn?p 2 n?p+1 b1 bn?p 3 B2 n?p+1 b2 1 bn?p 3
bn bn bn bn
n
?p+1 n?p+1 n?p+1 n?p+1 n
b n?p+1 bn?p 1 bn p+1^bp 1 bn n?p Bn?p+1 bn?p p+1^bp 1 bn?p+1 bn n?p+2bn?p p+1^bp 1
bn bp bp bp
bn n?p+1^bp 1
bp 1
^n
?p+1
n
1 1 1
local addr.
bn?p 1 b1 bn?p 2 b1 bn?p 2 b2 1 bn?p 3 bn?p 1 bn n?p+1 bn?p p+1 bn n?p+1 bn?p p+1 bn?p+1 bn n?p+2bn?p p+1 bn n?p+1
For example, Fig. 4 shows the data flow diagram (after bit-reversal) for a 32-point Pease FFT on four processors, which uses a single redistribution. The input array has a block distribution for the first three steps and a cyclic distribution for the remaining two steps. Communication is only needed to change the distribution from block to cyclic, after the third step. The Korn-Lambiotte FFT [7], which was developed for vector processing, can be similarly implemented on a distributed-memory vector multiprocessor. The Korn-Lambiotte FFT performs a shuffle permutation at the beginning of each step so that a fixed vector length of 2n?1 is achieved. Using a technique similar to the one used for the Pease FFT, the KornLambiotte FFT can be modified so that only a single redistribution is required when n 2p. Further, a fixed vector length of 2n?p?1 is achieved on each node.
5. Stockham FFT
In the Stockham FFT [8], at step i, 1 i n, a permutation corresponding to a right cyclic shift of the most significant i address bits is performed on the input vector. The input vector is then multiplied by appropriate twiddle factors and a butterfly operation is performed with the lower-half and upper-half of the input vector as its two inputs. The Stockham FFT has the bit-reversal permutation implicitly embedded in its computation and therefore does not require the initial bit-reversal needed in the Cooley-Tukey and Pease FFT. The following computation is performed at step i:
?i
B (bn?i+1 bn n?i+2 bn?i 1 ) = !N2 (i) A(bn n?i+2 bn?i+1 bn?i 1 ) A(0bn n?i+2 bn?i 1 ) = B (0bn n?i+2 bn?i 1 ) + B (1bn n?i+2 bn?i 1 ) A(1bn n?i+2 bn?i 1 ) = B (0bn n?i+2 bn?i 1 ) ? B (1bn n?i+2 bn?i 1 ); where (i) = (bn?i+1 ) (bn n?i+2 ). n
To implement the Stockham FFT on a 2p processor distributed-memory multiprocessor, it can be determined that if we choose the least significant p bits for the processor address, i.e., cyclic distribution, then the first n?p steps are communication-free. This is because the least significant p bits are not affected by the permutation performed in these steps. However, communication will be required for the remaining p steps. Redistributing the input array to any other distribution will also require p steps of communication. For example, Fig. 5 shows the data flow diagram for a 32-point Stockham FFT on four processors, with 7
P0
P1
0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29
1b
2a
2b
3a
3b
4a
4b
5a
5b
0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29
P2
2 6 10 14 18 22 26 30
2 6 10 14 18 22 26 30
P3
3 7 11 15 19 23 27 31
3 7 11 15 19 23 27 31
Fig. 5. Stockham FFT with point-to-point message passing. the input array initially distributed cyclically. In order to use redistributions to eliminate subsequent communication steps, we modify the Stockham FFT algorithm. Now assuming that n 2p, a trace of the modified algorithm is shown below. Steps ia and ib represent the permutation phase and the computation phase, respectively, at step i. Step
a 1b
1
n ? p)a (n ? p)b redist: perm: (n ? p + 1)a (n ? p + 1)b (
na nb
global addr.
bn p+1^bp 1 Bn bn?1 p+1^bp 1 bp+1!n^bp 1 Bp+1 bp+2!n^bp 1 bp+1!n?p^bn?p+1!n bp 1 bp+1!n?pbp 1^bn?p+1!n bp!n?p bp?1 1^bn?p+1!n Bp bp+1!n?pbp?1 1 ^bn?p+1!n
b1!n?p^bn?p+1!n B1 b2!n?p^bn?p+1!n
Assuming that the global arrays A and B both have an initial cyclic distribution, the computation performed on each node for the first n ? p steps is:
?i
B 0 (bn?i+1 bn n?i+2 bn?i p+1 ) = !N2 (i) A0 (bn n?i+2 bn?i+1 bn?i p+1 ) A0 (0bn n?i+2 bn?i p+1 ) = B 0 (0bn n?i+2 bn?i p+1 ) + B 0 (1bn n?i+2 bn?i p+1 ) A0 (1bn n?i+2 bn?i p+1 ) = B 0 (0bn n?i+2 bn?i p+1 ) ? B 0 (1bn n?i+2 bn?i p+1 ); n
where (i) = (bn?i+1 )(bn n?i+2 ). After performing this computation, each node copies A0 (j ) to B 0 (j ), 0 j < 2n?p . The distribution of array B is then changed to cyclic(2p) (step redist: in the trace above). Subsequently, an address-bit permutation, which corresponds to exchanging bits bn?p+1!n with bp 1 , is performed (step perm: in the trace above). This permutation is performed by copying B (bp+1!n?p^bn?p+1!n bp 1 ) to A(bp+1!n?p bp 1^bn?p+1!n ).
8
A’(j)->B’(j)
P0
P1
0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29
1b
2a
2b
3a
3b
B’(j)->A’(j)
cyclic to cyclic(4)
4a
4b
5a
5b
0 4 8 12 16 20 24 28 1 5 9 13 17 21 25 29
P2
2 6 10 14 18 22 26 30
2 6 10 14 18 22 26 30
P3
3 7 11 15 19 23 27 31
3 7 11 15 19 23 27 31
Fig. 6. Stockham FFT with redistribution. Since we have, and
proc(bp+1!n?p^bn?p+1!n bp
1
) = proc(bp+1!n?p bp
local(bp+1!n?p^bn?p+1!n bp
1
) = local(bp+1!n?p bp
1
^bn?p+1!n )
1
^bn?p+1!n );
no communication is needed for this copying. Assuming j = i ? n + p, the following computation is performed locally on each node for the remaining p steps:
?i
B 0 (b2p?j+1 bn 2p?j+2 b2p?j p+1 ) = !N2 (j;pid)A0(bn 2p?j+2 b2p?j+1 b2p?j p+1) A0 (0bn 2p?j+2 b2p?j p+1 ) = B 0 (0bn 2p?j+2 b2p?j p+1 )+ B 0(1bn 2p?j+2 b2p?j p+1) A0 (1bn 2p?j+2 b2p?j p+1 ) = B 0 (0bn 2p?j+2 b2p?j p+1 ) ? B 0(1bn 2p?j+2 b2p?j p+1) where (j; pid) = (b2p?j +1 ) (2p+j ?1 bn 2p+1 + 2j ?1 pid + b2p 2p?j +2 ). As the distribution of array A remains unchanged, no final redistribution is needed to n
preserve the same distribution for the input and the output. Hence, the Stockham FFT can be performed with only one redistribution. Furthermore, the Stockham FFT does not require a bit-reversal permutation. This leads to a communication-efficient distributed-memory implementation of the Stockham FFT. For example, Fig. 6 shows the data flow diagram for a 32-point Stockham FFT on four processors using redistribution.
6. Performance Results We now present analytical and experimental performance results for the Cooley-Tukey, Pease, and Stockham FFT on the Intel iPSC/860 distributed-memory hypercube. The CooleyTukey FFT requires communication only during the redistribution step where the distribution of array A is changed from block to cyclic. If the initial distribution of array A is to be preserved, then another redistribution will be required at the end of the program. Hence, two versions of the Cooley-Tukey FFT were implemented, one using only a single redistribution step (CT-1R) and another one performing an additional block to cyclic redistribution 9
Table 1. Number of redistributions required for various FFT algorithms when n 2p. Algorithm CT PS ST
Number of redistributions Preserving initial distribution Not preserving initial distribution 2 + 1 (for bit-reversal) 1 + 1 (for bit-reversal) 2 + 1 (for bit-reversal) 1 + 1 (for bit-reversal) 1 (no bit-reversal) -
Table 2. Estimated comm. time (sec.) for CT-PP and CT-1R (without bit-reversal) for P =32. log(N ) TCT-PP TCT-1R TCT-PP =TCT-1R 12 13 14 15 16 17 18 19 20
0.003007 0.005045 0.009121 0.017272 0.033574 0.066178 0.131386 0.261803 0.522636
0.007796 0.008191 0.008981 0.010560 0.013718 0.020035 0.032669 0.057938 0.108474
0.385741 0.615931 1.015592 1.635601 2.447362 3.303051 4.021679 4.518695 4.818070
to restore the original distribution (CT-2R). Similarly for the Pease FFT, two versions were implemented: PS-1R and PS-2R. The program for the Stockham FFT (ST-1R) requires communication only for redistributing array B from a cyclic to a cyclic(2p ) distribution. The redistribution requirements for the Cooley-Tukey FFT, Pease FFT, and Stockham FFT are summarized in Table 1. Parallel programs which use point-to-point message-passing were implemented for the Cooley-Tukey FFT (CT-PP) and the Stockham FFT (ST-PP). These programs require log (P ) communication steps [9]. All the communication in CT-PP is nearest-neighbor. PS-PP was not implemented as its performance is certain to be worse than CT-PP, due to the log (N ) communication steps needed. In all programs except ST-1R and ST-PP, there is an initial bit-reversal permutation step which requires one additional redistribution. We now compare the communication requirements of the CT-PP and the CT-1R algorithms. CT-PP requires log (P ) messages, each of size N=P . CT-1R requires a block to cyclic redistribution involving an all-to-all personalized communication, which can be performed using the pairwise exchange algorithm [10]. The pairwise exchange algorithm is contention-free on hypercubes. In each of the P ? 1 steps of this algorithm, each processor sends and receives a message of size N=P 2 . Not including the bit-reversal, the total communication volume for CT-1R is P (P ? 1) N=P 2 N , whereas for CT-PP, it is P N=P log (P ) = N log (P ): If ts is the message setup time and tp is the link transmission time per data element, then the time spent in communication per processor in CT-PP can be estimated to be: TCT-PP = log (P ) ts + (log (P ) N=P ) tp and TCT-1R = (P ? 1) ts + ((P ? 1) N=P 2 ) tp : The Intel iPSC/860 is a circuit switched hypercube. The communication time (in sec.) for a buffered (UNFORCED) message of size m bytes over a distance of d is approximately 164 + 0:398m + 29:9d [10]. Therefore, ts = 164 sec. and tp = 8 0:398 = 3:184 sec. (because one complex number has eight bytes). For CT-PP, all the messages are nearestneighbor. Therefore, the total communication time for CT-PP is:
TCT-PP = 164 log(P ) + 3:184 (log(P ) N=P ) + 29:9 log(P ): 10
For CT-1R, the average distance a message travels is log (P )=2, therefore the communication time is:
TCT-1R = 164 (P ? 1) + 3:184 ((P ? 1) N=P 2) + (29:9 log(P ) (P ? 1))=2: Table 2 shows the estimated communication time for CT-PP and CT-1R on 32 processors. For large values of N , it is evident that the communication time for CT-1R is much smaller than that for CT-PP even though more messages are exchanged in CT-1R. A similar analysis can be performed for the Pease FFT and Stockham FFT.
Fig. 7. FFT on iPSC/860 with P=16 and P=32.
FFT on iPSC/860 with P=16
FFT on iPSC/860 with P=32
12
6
CT-2R CT-PP CT-1R PS-2R ST-PP PS-1R ST-1R
10
CT-2R CT-PP ST-PP CT-1R PS-2R PS-1R ST-1R
5
4
Finish time (sec.)
Finish time (sec.)
8
6
4
3
2
2
1
0
0
0
256
512
1024 N (X 1024)
2048
0
256
512
1024 N (X 1024)
2048
Figure 7 shows performance results for CT-MP, CT-1R, CT-2R, PS-1R, PS-2R, ST-MP, and ST-1R for data sizes ranging from 1K to 2M, on subcubes of size 16 and 32. The redistribution routines were implemented using the iPSC’s communication library. Timings were measured using the millisecond node timer mclock. From Figure 7, it can be observed that CT-2R performs nearly as well as or better than CT-PP. CT-1R performs better than both CT-2R and CT-PP. PS-1R and PS-2R perform better than all the Cooley-Tukey programs. ST-1R performs better than ST-PP and has the best performance among all the FFT implementations. These results indicate that efficient portable programs can be implemented using redistribution primitives.
7. Conclusions We have presented FFT programs in which communication is expressed through data redistribution primitives. The resulting programs have reduced communication volume compared to point-to-point message-passing programs, when the problem size is large.
Acknowledgments We thank the referees for their suggestions. We are grateful to NASA Lewis Research Center for providing access to the Intel iPSC/860 system. 11
References 1. J. R. Johnson, R. W. Johnson, D. Rodriguez, and R. Tolimieri, A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures, Circuits, Systems, and Signal Processing, 9 (1990) 449{500. 2. C. Van Loan, Computational frameworks for the fast Fourier transform. SIAM, 1992. 3. High Performance Fortran Forum, High Performance Fortran language speci cation, Technical Report CRPC-TR92225, Rice University, 1993. 4. S. K. S. Gupta, S. D. Kaushik, C.-H. Huang, J. R. Johnson, R. W. Johnson, and P. Sadayappan, A methodology for the generation of data distributions to optimize communication, Proc. Fourth IEEE Symposium on Parallel and Distributed Processing, Dec. 1992, 436{441. 5. J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation. 19 (1965) 297{301. 6. M. C. Pease, An adaptation of the fast Fourier transform for parallel processing, J. ACM, 15(2) (1968) 252{264. 7. D. G. Korn and J. J. Lambiotte, Computing the fast Fourier transform on a vector computer, Mathematics of Computation, 33 (1979) 977{992. 8. T. G. Stockham, High speed convolution and correlation, Proc. Spring Joint Computer Conf., AFIPS, 1966, 229{233. 9. P. N. Swarztrauber, Multiprocessor FFT's, Parallel Computing, 5 (1987) 197{210. 10. S. H. Bokhari, Complete Exchange on the iPSC-860. ICASE Report 91-4, 1991.
12