Bailey 2] and Swarztrauber 5]. They access data ... Walter Waelde and Oswald Haan, Optimization of the FFT for SIEMENS VP systems, Tech. Report 39.89 ...
Block Algorithms for Fast Fourier Transforms on Vector and Parallel Computers Markus Hegland CISR Australian National University Fast Fourier transform (FFT) algorithms are extremely important in many elds of scienti c computing. They are used in fast Poisson solvers and spectral methods for computational uid dynamics and in fast convolution algorithms for signal processing to name just three examples out of a vast eld of applications. The main advantage of FFT algorithms is that they reduce the number of oating point operations from O(n2) to O(n log(n)). Furthermore they consist of O(log(n)) steps containing n independent tasks each. However, the original algorithms suer from varying short vector lengths and non-unit stride data access on vector processors and the need for synchronization and communication after every step on parallel computers. Examples of a new class of algorithms were proposed by Ashworth and Lyne [1], Bailey [2] and Swarztrauber [5]. They access data mainly with stride one and their p vector lengths are n. They only need to be synchronized, resp. do communication once. We will show how a general class of algorithms can be de ned by partitioning a related integer matrix into sub-blocks. Members of this class will be called block FFT algorithms and the cited methods fall into this class. We will also discuss a new block FFT algorithm implementing recursive blocking which leads to even longer vector lengths. Like the Johnson-Burrus algorithm [3] and unlike other block FFT algorithms this new algorithm is also in-place and selfsorting. It has been implemented on the vector processor Fujitsu VP2200 and the Fujitsu AP 1000 with 128 nodes and distributed memory. Essentially, fast Fourier transform algorithms implement ecient ways to do the matrix vector product F x for x 2 Cn where i h (1) F = !? =0 ?1 ; ! = exp(2i=n): They can be interpreted as a factorization of the dense matrix F into sparse factors by successively applying splitting. One splitting step is (2) F = (F I )W T (F I ): where n = pq and A B denotes the Kronecker product, T is a matrix transposition permutation and W a unitary diagonal matrix. The original algorithms used splitting where either p or q is small, often a prime p factor. Block algorithms, on the other hand, try to get both p and q to be near n and thus have uniformly long vector lengths. Our new algorithm applies blocking recursively and furthermore combines two splitting steps into one such that only square transpositions are needed. The corresponding matrix factorization is (3) F = (F I )W T (I F I )(W I )(F I ) if n = qpq. Here q2 is the largest square factor in n. The permutation T is essentially a p-fold square matrix transposition. n
jk
n
n
j;k
n
;n
n
n
q
p
q;p
p;q
p
q
p;q
q;p
n
q
pq
q;pq
q;p;q
q
p
q
q;p
q
q
pq
q;p;q
1
2
This work was funded by Fujitsu Ltd. Japan under a research and development contract at the Australian National University.
References
1. Mike Ashworth and Andrew G. Lyne, A segmented FFT algorithm for vector computers, Parallel Computing 6 (1988), 217{224. 2. David H. Bailey, FFTs in external or hierarchical memory, J. Supercomputing 4 (1990), 23{35. 3. H.W. Johnson and C.S. Burrus, An in-place, in-order radix-2 FFT, Proc. IEEE ICASSP, 1984, p. 28A.2. 4. W.P. Petersen, Vector Fortran for numerical problems on CRAY-1, Comm. ACM 26 (1983), no. 11, 1008{1021. 5. Paul N. Swarztrauber, Multiprocessor FFTs, Parallel Comput. 5 (1987), 197{210. 6. Clive Temperton, Self-sorting mixed-radix fast Fourier transforms, J. Comput. Phys. 53 (1983), 1{23. 7. Walter Waelde and Oswald Haan, Optimization of the FFT for SIEMENS VP systems, Tech. Report 39.89, University of Karlsruhe, Computer Center, 1989.