MARKUS HEGLAND. Abstract. Since the ... MARKUS HEGLAND. 1. ..... Bai88] David H. Bailey, A high-performance FFT algorithm for vector supercom- puters ...
ON SOME BLOCK ALGORITHMS FOR FAST FOURIER TRANSFORMS MARKUS HEGLAND
Abstract. Since the publication of the Cooley-Tukey algorithm a variety of related algorithms for fast Fourier transforms have been suggested by Gentleman/Sande, Stockham, Pease, Johnson/Burrus and others. These algorithms dier mainly in the way they store and access their data. From a numerical point of view, they use splitting which is related to a factorization of the complex Fourier transform matrix into two factors. One factor is usually a multiple Fourier transform of slightly smaller order than the original and the other is a scaled and permuted multiple Fourier transform of minimal order. Depending on the relative position of these matrices in the factorization we speak of decimation in time or decimation in frequency. Although these algorithms contain large amounts of parallelism it does not appear in a uniform way and thus these algorithms are often not optimal for vector or parallel processing as they may generate, e.g., bank con icts or short vector lengths. This is corrected by a slightly dierent kind of algorithms which are called segmented and actually correspond to blocking. These algorithms have been proven ecient on vector computers by Ashworth/Lynn and Bailey and on parallel computers by Swarztrauber. The idea is to initially factor the FFT matrix into two multiple Fourier transforms of about the same size and use the traditional algorithms on each of the factors. It is known that this class of algorithms needs slightly more oating point operations but the savings because of regular data access outweigh this disadvantage by far. Recently, a recursive and in-place implementation of this idea has been suggested by the author. We will present a systematic and uni ed approach of these blocking algorithms for FFTs. The presentation is centered around the blocking of a triangular matrix which is related to the algorithms. A revised version of this paper appeared in the Proceedings of the Computational Techniques and Applications Conference CTAC93, Canberra, pp. 276-284, 1994, preprint as CMA report CMA-MR51-93. 1
2
MARKUS HEGLAND
1. Introduction Fast Fourier algorithms [Loa92] are a centerpiece in any numerical algorithm collection. They use the representation of indices as arrays of digits to reduce operation counts from O(n ) to O(n log(n)). Although these algorithms are extremely parallel this parallelism is not regular enough for some hardware, especially vector processors and so the classical algorithms need very careful tuning. By analyzing the connection between the index-representation and the fast Fourier transform it is possible to obtain some new and very ecient algorithms. This connection is discussed in Section 2 and some known algorithms are derived in Section 3. The last section discusses our new algorithms. Matrix factorizations relating to fast Fourier transform algorithms (including some of our new ones) can be found in [Heg92]. 2
2. Interpretation of a one-dimensional Fourier
transform as a t-dimensional generalized Fourier transform
Let Zn := f0; : : : ; n ? 1g be the set of nonnegative integers less than n. In the context of Fourier transforms it is very natural and simpli es notation if n-dimensional complex vectors are interpreted as mappings on Zn , i.e., Cn := CZn . The one-dimensional discrete Fourier transform is a linear operator on Cn , de ned by y = Fnx; nX ? (2.1) y(k) = !nkix(i); k 2 Zn 1
i=0
!n := e?2i=n
and is a n-th root of 1. A t-dimensional transform is an operator on Cn := CZn , where the grid Zn is the Cartesian product Zn = Zn0 Znt?1 and n = (n ; : : : ; nt? ) 2 Nt. We will show how Fourier transforms are related to t-dimensional transforms. The basic tool is a mapping between the grid Zn and the set of integers Zjnj where jnj := n nt? . This mapping is called the mixedradix representation n : Zn ! Zjnj and is de ned as (2.2) n (i) = i + i n + + it? n nt? : i = (i ; : : :; it? ) 2 Zn: 0
1
0
0
1
0
1
1
0
2
0
1
ON SOME BLOCK ALGORITHMS FOR FAST FOURIER TRANSFORMS
3
The array n is the radix vector and i is the digit vector of i = n (i). Radix 10 and radix 2 representations (meaning that the radix vectors have only components 10 or 2) are the decimal or binary number system, respectively. Mixed-radix representations have dierent radices for the dierent digits and are discussed, e.g., in [Knu81]. It is wellknown that n is bijective and so the mapping between Cn and Cjnj de ned by x 2 Cjnj 7! x n 2 Cn is an isomorphism. Now let L(U; V ) be the space of linear mappings between the vector spaces U and V . Then the isomorphism de ned by the mixed-radix representation generates in a canonical way an isomorphism between one-dimensional operators A 2 L(Cn ; Cm ) and multidimensional operators B 2 L(Cn ; Cm) as suggested by the following diagram: C?n C?jnj ???! =
(2.3)
??A y
??B y
Cm Cjmj ???! =
Thus instead of studying the one-dimensional operators we can study multidimensional ones. We will use the following \left-upper triangular" integer matrix Bn with entries Bn [r; s]; r; s = 0; : : : ; t ? 1 de ned as 8 < (2.4) Bn [r; s] = :jnj=(ns nt?r? ) for r + s t ? 1 0 else: In the case of radix 2, i.e., n = (2; 2; : : : ; 2) we get the following Hankel matrix : 0 1 2 2t? 2t? 1 CC B 2 4 2t? B CC B ... ... . . . Bn = B CC : B B t ? t ? A @2 2 t ? 2 The next proposition shows a relation between the mixed-radix representation and the matrix Bn . Proposition 2.1. Let n = (nt? ; : : : ; n ) be the reverse of n = (n ; : : : ; nt? ) 2 t N , then the following holds: (2.5) n (k) n (i) = kT Bn i mod jnj for k 2 Zn ; i 2 Zn : 1
2
1
1
2
1
1
1
0
0
1
4
MARKUS HEGLAND
Proof. Let b := (1; n0 ; n0n1 ; : : : ; n0 nt?2 )T and c := (1; nt?1 ; nt?1 nt?2 ; : : : ; nt?1 n1 )T . Then n (i) = bT i and n (k) = cT k: From this we get n (k)n (i) = kT B 0i with B 0 := cbT . The matrix elements of this bilinear form are computed from the de nitions of c and b as B 0[r; s] = (nt?1 nt?r )(n0 ns?1 ) = jnj=(ns nt?r?1 ) if s < t ? r = jnj(nt?r ns?1) if s > t ? r = jnj if s = t ? r From this we see that B 0 = Bn mod jnj which completes the proof.
With Bn we can now de ne a t-dimensional generalized Fourier n n transform as a mapping Fn : C ! C by Y = Fn X X T (2.6) Y (k) = !jnkj Bn i X (i); k 2 Zn : )
(
i2Zn
As Bn de nes the bilinear form in the exponent of !jnj of the generalized transform we call it the exponent matrix of Fn. We get the following isomorphism between the one-dimensional and this t-dimensional generalized Fourier transform as: Proposition 2.2. For any x 2 Cjnj the following holds: (2.7) (Fjnjx) n = Fn(x n ) With n := jnj = jnj this proposition can be summarized using Diagram 2.3 as: C?n ???! C?n =
(2.8)
??F yn
??Fn y
Cn ???! Cn =
ON SOME BLOCK ALGORITHMS FOR FAST FOURIER TRANSFORMS
5
Proof. By using the de nitions, Proposition 2.5 and the bijectivity of n we get
Fn(x n )(k) = = =
X
i2Zn
X
i2Zn
X
!jnkj Bni x(n (i)) ( T
)
!jnnj k n i x(n(i))
i2Zjnj
( )
( )
!jnnj k ix(i) ( )
= (Fjnjx)(n (k)): As a consequence we can develop algorithms for the t-dimensional generalized Fourier transforms in order to obtain algorithms for the one-dimensional ones. Note that the one-dimensional transform maps Cnn onto itself but the generalized multidimensional transform maps C onto the dierent (although isomorph) space Cn . The mapping of Fn onto Fn is not uniquely de ned by n but depends heavily on the factorization n = n nt? . This gives some additional freedom and fast Fourier transform algorithms have to choose n carefully for a given n. This, however, will not be discussed here. The propositions of this section seem pretty basic and might not involve any deeper mathematics. However, they lead to the understanding of a wide class of fast Fourier transform algorithms and also to the suggestion of some new algorithms with very favorable properties as will be shown in the next two sections. Furthermore, to our knowledge, these isomorphisms have not been spelled out in detail as this was done here, although they do show up implicitly in many earlier papers on fast Fourier transforms. 0
1
3. Block Algorithms In this section we show how some known fast Fourier transform algorithms can be derived from a partitioning of the exponent matrix Bn . We will denote by Bn [a : b; c : d] the sub-matrix of Bn consisting of all elements from the columns c; : : : ; d of rows a; : : :; b. An important fact is that anti-diagonal blocks of Bn are up to a factor of the same form as Bn. This is proved in the next proposition: Proposition 3.1. For every m = (np; : : :; nq ) the anti-diagonal subblocks Bn [t ? 1 ? q : t ? 1 ? p; p : q] of Bn are (up to a factor) matrices
6
MARKUS HEGLAND
of the same type as Bn and (3.1) jmjBn[t ? 1 ? q : t ? 1 ? p; p : q] = jnjBm: Proof. Both sides are left-upper triangular. Furthermore, for the nonzero elements and r; s 2 f0; : : : ; q ? pg we get jnjBm[r; s] = jnjjmj=(np+s nq?r ) = jmjjnj=(np+s nq?r ) = jmjBn[t ? 1 ? q ? r; p + s] from which the proposition follows. Now let n = (n0; n1) be a partitioning of n. Then an application of the previous proposition shows:
#
"
Bn = jn jBn jn j0Bn 0
1
1
0
The remainder of this section will discuss some consequences of this blocking. The basic proposition is a direct consequence: Proposition 3.2 (Blocking). Let Fn be the t-dimensional Fouriers transform as de ned earlier and X 2 Cn ; n = (n ; n ) for n 2 N and n 2 Nt?s . Then, with B 0 = Bn [0 : t ? s ? 1; 0 : s ? 1] the following holds: X kT0 B0i0 kT1 Bn0 i0 kT0 Bn1 i1 FnX (k) = !jnj !jn0 j !jn1j X (i) 0
1
0
1
ij 2Znj j =0;1 with ij 2
Znj and k = (k0; k1) with k j 2 Zn1?j for where i = (i0 ; i1 ) j = 0; 1. Proof. By using the blocking of Bn suggested previously we get: kT Bni = kT0 B 0i0 + jn1jkT1 Bn0 i0 + jn0jkT0 Bn1 i1: The proposition follows if this is inserted into the de nition of Fn and using !jnj j = !njn1?j j. This proposition is implemented in the following algorithm. It consists of four steps. The rst and the last step do a multiple lowerdimensional general Fourier transform each. The second step does an element-wise multiplication and the third step interchanges the two arguments which is a transposition. We will always write X (i0; i1) for X (i) if i = (i0; i1) and by X (i0; ) we mean the restriction of X to the ane set f(i0; i1) j i1 2 Zn1 g.
ON SOME BLOCK ALGORITHMS FOR FAST FOURIER TRANSFORMS
7
Algorithm 3.1 (Block algorithm). For X 2 Cn compute FnX . All statements are meant for all i 2 Zn and all k 2 Zn . Furthermore, let B 0 = Bn [0 : t ? s ? 1; 0 : s ? 1]. Z (i ; ) := Fn X (i ; ) 0 Z (i ; k ) := !jknj B i Z (i ; k ) 0
(1)
(2)
0
(1)
0
0
0
1
0
1
T
0
0
0
0
0
Z (k ; i ) := Z (i ; k ) Y (k ; ) := Fn0 Z (k ; ) This algorithm is always self-sorting as it computes Y and not a permutation of Y . The implementation of the rst and last step are left unspeci ed at the moment. In general, they are done by recursive application of the block algorithm. The second step is very simple and always in-place . However, the ecient implementation of this step is crucial for the performance of the algorithm on vector processors. The third step can be implemented eciently in-place if jn j = jn j. The block algorithm is the mother of the main classical fast Fourier transform algorithms but also of some newer vector and parallel algorithms. Sometimes it is also just called the four-step algorithm or segmenting or splitting depending on the context. We will now discuss three basic variants of the block algorithm. They dier in the way they do the partitioning of n. A rst variant chooses s = t ? 1 and so n = (n ; nt? ). The block algorithm is called decimation in frequency (DIF) in this case [Loa92]. It uses the following partitioning of the exponent matrix Bn : " T # j b j n Bn = nt? Bn 0 0 Here the rst step of the block algorithm is just a multipleone-dimensional transform. As the range of i and k can take largely dierent values during the lifetime of an actual FFT implementation one has the choice for step 2 of either carefully choosing which loop is taken as the innermost [Pet83] which requires only small tables to store the powers of !n or one uses large tables with duplicate entries and gets optimal vector lengths [Bai88]. The third step cannot be done eciently in-place on vector- and parallel computers and so some workspace is required. If the last step is done recursively using the same block algorithm one obtains the Stockham (3)
0
(2)
0
0
(3)
0
0
0
0
0
0
1
0
0
1
1
8
MARKUS HEGLAND
algorithm. Algorithms by Gentleman-Sande, Pease etc. are obtained from this by rearranging the data. Decimation in time (DIT) chooses s = 1 and so n = (n ; n ). The partitioning of the exponent matrix Bn is here: # " b n B n Bn = jn j 0 1 The rst step needs recursive application of the algorithm and the last step is a multiple one-dimensional transform. Otherwise, the same comments hold here as in the case of decimation in frequency. The original FFT algorithm by Cooley and Tukey implements decimation in time. Sometimes the block algorithm is called splitting in the case of decimation in time or decimation in frequency. A third q variant chooses s = bt=3c and n should be such that jn j jn j jnj. In [Loa92] this is called the 4-step algorithm but we suggest it should be called decimation in time and frequency . This algorithm can be implemented without the need of large tables or loop interchange and gives good vectorization. Data access is with stride 1. The transposition step can only be done eciently in-place if jn j = jn j. The rst and last steps are implemented by recursive application of DIF or DIT. Algorithms along these lines have been suggested in [Swa87, AL88, Bai90]. It is well-known that these algorithms might use slightly more arithmetic operations than the Cooley-Tukey type. However, the way they access data compensates for this de ciency. 0
1
0
1
0
1
0
1
4. Double Block Algorithms In this section we will derive a new family of algorithms which share the attractive features of the decimation in time and frequency algorithm but in addition are also in-place and can obtain longer vector lengths. Furthermore half of the transpositions are saved. The array n is partitioned into three sub-arrays as n = (n ; n ; n ). With this the exponent matrix Bn is partitioned as 3 2 jj n j B j n n2 jn jjn jBn1 0 75 (4.1) Bn = 64 0 0 jn jjn jBn0 Properties of the Fourier transform related to this blocking are: Proposition 4.1 (Double blocking). Letn Fn be the t-dimensional Fourier transform as de ned earlier and X 2 C ; n = (n ; n ; n ). Then for 0
0
0
1
1
2
2
0
1
2
1
2
ON SOME BLOCK ALGORITHMS FOR FAST FOURIER TRANSFORMS
9
some integer matrices B 0; B 00 ; B 000 the following holds: X kT0 B0i0 +kT0 B00i1 +kT1 B000i0 kT0 Bn2 i2 kT1 Bn1 i1 kT2 Bn0 i0 !jnj FnX (k) = !jn2 j !jn1j !jn0 j X (i) ij 2Znj j =0;2
where i = (i0 ; i1 ; i2 ) with ij 2 Znj and k = (k 0 ; k 1; k 2 ) with k j 2 Zn2?j for j = 0; 1; 2. Proof. By using the blocking of Bn we just described we get (4.2) kT Bn i = kT0 B 0i0 + kT0 B 00i1 + kT1 B 000i0 + jn0jjn1jkT0 Bn2 i2 + jn0jjn2jkT1 Bn1 i1 + jn1jjn2jkT2 Bn0 i0: The B 0; : : : ; B 000 are sub-matrices of Bn . From this we get a new 6-step algorithm (dierent from what is called 6-step algorithm in [Loa92]): Algorithm 4.1 (Double block algorithm). For X 2 Cn compute Y = FnX . Z (1)(i0; i1; ) := Fn2 X (i0; i1; ) T 0 00 Z (2)(i0; i1; k0) := !jkn0j (B i0+B i1 ) Z (1)(i0; i1; k0) Z (3)(i0; ; k0) := Fn1 Z (2)(i0; ; i2) T 000 Z (4)(i0; k1; k0) := !jkn1j B i0 Z (3)(i0; k1; k0) Z (5)(k0; k1; i0) := Z (4)(i0; k1; k0) Y (k0; k1; ) := Fn0 Z (5)(k0; k1; ) The steps 1, 3 and 6 are multiple one-dimensional transforms and steps 2 and 4 are unitary scaling steps. For these steps the same comments apply as for the four-step algorithm. Step 5, however, implements a slightly more general transposition as the block algorithm. This step can be done in-place if jn0j = jn2j. A rst variant is very similar to the decimation in time and frequency of the previous section. Here n = (n0; ns; n2). In this case the blocking of Bn takes the form
2 Bn = 64
3 B0 b jn jnsBn2 7 0 5 cT jn jjn j jn jnsBn0 0 0 If jn j jn j the data can be accessed with long vectors and stride 1. For any n 2 Zn one can nd a n with n = jnj such that either n = (n ; n ) or n = (n ; ns ; n ) and all the components of n are not 0
0
2
0
2
0
0
0
0
2
10
MARKUS HEGLAND
necessarily primes but contain no second or higher powers of primes. The same blocking as for n can also be obtained for n etc. In this case a combination of the decimation in time and frequency variant of the block algorithm and the rst variant of the double block algorithm gives an algorithm with very high vector-lengths, and which is in-place, selfsorting and accesses data with stride one [Heg92]. This algorithm has been successfully implemented on the vector computer Fujitsu VP2200 at ANU and is being implemented for the scienti c library of the new vector-parallel computer VPP 500 of Fujitsu Ltd. There are several more variants of the six-step algorithm which could be considered. As an example we show a variant which combines a Stockham DIF with a Stockham DIT step. In this case n = (n ; n ; n ) and the blocking of the exponent matrix is as follows: 3 2 T 1 b n j n j n nt? Bn1 0 75 Bn = 64 c 0 0 jn jnt? 0
0
0
0
1
2
1
1
1
1
In analogy to the usual terminology we call this algorithm decimation between time and frequency . This variant is in-place and self-sorting if n = n . However, it has the same problems with loop-interchange as the original Stockham algorithm. It shows some similarities with the Johnson-Burrus algorithm [JB84, Heg92] which is related to the Cooley-Tukey algorithm. A feature of both algorithms is that they need only half the transposition steps of the Stockham algorithm. Finally, one can group two Stockham DIF or DIT steps together to get a double block decimation in frequency or decimation in time algorithm. In these cases the blockings of the exponent matrix Bn are as follows. For DIF one has n = (n ; nt? ; nt? ) and 3 2 T b j n j j n j n t? 0 75 : jn jnt? Bn = 64 cT nt? nt? Bn0 0 0 0
2
2
0
1
0
0
2
2
0
1
1
The double block DIT algorithm uses n = (n ; n ; n ) and so 2 3 b c n n B n2 0 75 Bn = 64 jn j n jn j n jn j 0 0 0
0
0
2
1
1
2
1
2
2
The advantage of these algorithms is again that they need only half the transposition steps compared to the original Stockham algorithm.
ON SOME BLOCK ALGORITHMS FOR FAST FOURIER TRANSFORMS 11
Acknowledgements
This work was done as part of a joint project of ANU and Fujitsu Ltd Japan for the development of parallel numerical algorithms and software. One of the algorithms will be included in the Fujitsu SSL II Scienti c Software Library. Implementations are done in cooperation with Judy Jenkinson and Murray Dow of the ANU Supercomputer Facility. [AL88] [Bai88] [Bai90] [Heg92] [JB84] [Knu81] [Loa92] [Pet83] [Swa87]
References Mike Ashworth and Andrew G. Lyne, A segmented FFT algorithm for vector computers, Parallel Computing 6 (1988), 217{224. David H. Bailey, A high-performance FFT algorithm for vector supercomputers, Int. J. Supercomputer Appl. 2 (1988), no. 1, 82{87. David H. Bailey, FFTs in external or hierarchical memory, J. Supercomputing 4 (1990), 23{35. Markus Hegland, A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing, submitted, 1992. H.W. Johnson and C.S. Burrus, An in-place, in-order radix-2 FFT, Proc. IEEE ICASSP, 1984, p. 28A.2. Donald E. Knuth, The art of computer programming, vol. 2, Addison Wesley, 1981. Charles Van Loan, Computational frameworks for the fast Fourier transform, SIAM, 1992. W.P. Petersen, Vector Fortran for numerical problems on CRAY-1, Comm. ACM 26 (1983), no. 11, 1008{1021. Paul N. Swarztrauber, Multiprocessor FFTs, Parallel Comput. 5 (1987), 197{210.