Numer. Math. 68: 507{547 (1994)
Numerische Mathematik
c Springer-Verlag 1994 Electronic Edition
A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing? Markus Hegland
Program in Advanced Computation, Center for Mathematics and its Applications, School of Mathematical Sciences, The Australian National University, Canberra ACT 0200, Australia e-mail:
[email protected] Received October 29, 1992 / Revised version received October 21, 1993
Summary. We propose a new algorithm for fast Fourier transforms. This algorithm features uniformly long vector lengths and stride one data access. Thus it is well adapted to modern vector computers like the Fujitsu VP2200 having several oating point pipelines per CPU and very fast stride one data access. It also has favorable properties for distributed memory computers as all communication is gathered together in one step. The algorithm has been implemented on the Fujitsu VP2200 using the basic subroutines for fast Fourier transforms discussed elsewhere. We develop the theory of index digit permutations to some extent. With this theory we can derive the splitting formulas for almost all mixed-radix FFT algorithms known so far. This framework enables us to prove these algorithms but also to derive our new algorithm. The development and systematic use of this framework is new and allows us to simplify the proofs which are now reduced to the application of matrix recursions. Mathematics Subject Classi cation (1991): 65T20
1. Introduction Fast Fourier transforms are of primary importance in computationally intensive applications. They are used for large-scale data analysis and to solve partial dierential equations. Applications of FFTs in data analysis include computer tomography, data ltering and structure or uid-structure interaction analysis. Other areas of interest here are spectral analysis of speech, sonar, radar, seismic and vibration detection. They are used for digital ltering, convolution evaluation and signal decomposition. A second eld of FFT applications are in the solution of partial dierential equations. An earlier example here is the fast Poisson solver which is closely related to FFTs. In recent years spectral methods featuring FFTs have been used successfully to solve computational uid dynamics (CFD) problems. They ?
This research was part of a software project by ANU and Fujitsu Ltd
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 507 of Numer. Math. 68: 507{547 (1994)
508
M. Hegland
are routinely used in weather forecasting and for simulation in geophysics and quantum mechanics. All these applications use discrete Fourier transforms. They are nitedimensional, linear, complex operators related to the transforms of classical Fourier analysis which play a preeminante role in mathematical physics. They implement fast algorithms for the matrix-vector product with the complex Fourier transform matrix. It is well known that the order n complex matrix-vector product can be done with order O(n2 ) oating point operations. In the sixties, however, algorithms were suggested by Cooley and Tukey [CT65] and later by Gentleman and Sande [GS66] which all only need O(n log(n)) operations if n is \far from prime" in the sense that it is composed of many factors. These algorithms were given the name fast Fourier transforms and have been essential for the success of the applications mentioned. They can be implemented without the use of extra workspace of order O(n) overwriting the input vector with the transformed data. Thus they are called \in-place". However, they have one important drawback in that they either compute the discrete Fourier transform in a \digit-reversed" order or compute the transform of digit-reversed data. Thus in order to get the actual Fourier transform some extra permutations are needed. For the application of FFTs to spectral methods, however, a combination of the Cooley-Tukey and Gentleman-Sande algorithm can be used without any permutations. This is usually not true for data analysis applications where Fourier transforms are used to compute the spectrum and not just as a tool to get fast algorithms. Later, algorithms were suggested by Stockham [CCF+ 67] which are in-order, combining digit reversions with elementary transforms. These algorithms for general n being both in-place and in-order could be created and a rst example was suggested by Johnson and Burrus [JB84, Tem91]. (Naturally, in-place, inorder transforms for special orders n using the prime factor algorithm were known for some time [Tem85].) While in-order transforms are needed for some applications in-place transforms generally make larger transform sizes feasible if the limiting factor is storage space rather than computing time. Fourier transform algorithms contain a large amount of parallelism. Thus the algorithms by Cooley-Tukey, Gentleman-Sande and Stockham have all been implemented on vector and parallel computers. A problem of these algorithms, however, is that they have varying loop lengths. Thus some implementations use loop interchanges [Pet83]. A drawback of the loop interchange method is that it requires data access with non-unit stride. This is avoided by the Pease algorithm [Pea68] using extended trigonometric tables or by using a combination of the Stockham algorithm and the transposed Stockham algorithm [WH90]. A drawback of these two methods, however, is their need to access the trigonometric table in the innermost loop as well as the data. A somehow new approach can be found in a paper by Swarztrauber [Swa87] and Ashworth and Lyne [AL88]. In order to get suitable parallelism they implement what Van Loan [Loa92] calls four or six-step frameworks. These frameworks are dierent from the original algorithms as they \split" or \segment" the discrete Fourier transform into factors implementing multiple transforms of similar order instead of a factor of elementary (e.g., prime) order and a factor of composite order. However, after just one step of this splitting into multiple transforms of similar order these algorithms proceed with the original splitting Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 508 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
509
into a small (often prime) order transform and a large order transform. A disadvantage of these algorithms is that the rst step might not be in-place. Their big advantage is their long vector lengths and stride one data access after the rst step. Our new algorithm can be interpreted as a combination of these algorithms with ideas from the Johnson-Burrus algorithm. A review of the historical development of fast Fourier transforms since Gauss can be found in [DV90]. Several of the 117 references in this paper point to software. Furthermore a simple tutorial on FFTs is presented there without explicit usage of Kronecker products. Both complexity results and implementation ideas are presented. A broad introduction to FFTs featuring \matrix language" can be found in [Loa92]. It seems safe to guess that this book will have large impact on the further development of FFT algorithms as it gives a concise mathematical treatment and thorough discussion of most of the major FFT ideas known so far. The purpose this paper is threefold: First we present a new FFT algorithm. Then we discuss the theory of index digit permutations on which such algorithms are based. Finally we derive some old and new FFT algorithms using splitting lemmata. Our new algorithm implements a recursive Swarztrauber-type of splitting. In contrast to Swarztrauber's, it is in-place for any order n. It is naturally selfsorting and essentially accesses data only with stride one. This is important for computers featuring very fast stride one data access like the VP series of Fujitsu. As in Swarztrauber's algorithm, it also lends itself to parallel implementation. In a joint project between ANU and Fujitsu Ltd. implementations of this algorithm for the new VPP 500 supercomputer are being developed. The Fujitsu VPP 500 is a state-of-the-art vector parallel computer with up to 222 vector processors and capable of up to 355 Giga ops (from a product description by Fujitsu Ltd.). As a rst stage we implemented this algorithm for n being a power of 2 on the VP 2200 at ANU. We will introduce some basic matrix families to describe the Fourier transform algorithms. These families are closely related to basic subroutines (similar to the BLAS in linear algebra) needed for the implementation of the algorithms. The basic subroutines will be discussed elsewhere. For our derivation we suggest some new notation based on Kronecker products. Kronecker products are well known to be very useful in the derivation and implementation of FFT algorithms. This was repeatedly demonstrated, a recent reference is [JJRT90]. In [JJRT90] an approach very similar to the one suggested here was used. Several identities they discuss will be used here for the implementation of FFTs on vector and parallel computers. One point deserves special mentioning here: They discuss the idea of creating a compiler for FFTs which could select between the myriads of choices the one best adapted to a particular computer architecture. In the sections following the description of our new algorithm we present in some detail many relations needed for the implementation of such a compiler. The next section introduces the notation needed. This section is rather long but necessary as the notation is often new and in some cases not well-known. Then, in the third section we present our new algorithm. A forth section gives some basic propositions on index-digit permutations and Kronecker products. Index digit permutations are used extensively for self-sorting Fourier transforms. Some of their basic properties are discussed in [Fra76]. Index digit permutations are transformations which permute digits of a Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 509 of Numer. Math. 68: 507{547 (1994)
510
M. Hegland
mixed radix representation of the integers. The questions we discuss focus around two main points: First, a product of an index digit permutation is not always an index digit permutation again. Second, the same permutation can be de ned by dierent radix vectors. By investigating these questions carefully we are able to derive a framework which leads to a comprehensive new uni ed treatment of all essential mixed-radix algorithms including some yet unknown algorithms. This treatment is based on matrix factorizations as they are used in [Loa92]. However, we extend the discussion of [Loa92] by including new splittings and the algebraic treatment of index-digit permutations. The fth section develops some known and several new algorithms and the underlying splittings or matrix factorizations. Finally, in the last section, we give bounds for the performance indicators.
2. Notation In the following we will have a rather lengthy introduction to notation. One reason for this is that for the comparison of FFTs we will need index-digit permutations which have hardly been systematically formalized in the literature. Thus some of the notation will be new or not so well-known. Also we need notation for various objects as integers (for the indices), vectors and matrices and also for permutations relating to these objects. 2.1. Integers
The set of integers will be denoted by Z, the set of positive integers by N , and the set of non-negative integers less than t by Zt = f0; : : : ; t ? 1g, where t 2 N . Individual integers will be denoted by the lower case Roman letters i; j; k; l; m; n; p; q; r; s; t. Contrary to some algebraic notation the operators +; ?; ; = will always be the ones de ned for the rational numbers Q . Thus for i 2 Zn and j 2 Zm the sum i + j is well de ned and i + j 2 Zn+m. If we use the operators modulo n we will explicitly write so. For example if i; j 2 Zn we write k = i + j mod n if k 2 Zn such that k = i + j + ln for some l 2 Z. For t 2 N any bijective mapping : Zt ! Zt; s 7! (s): is called a permutation of Zt. The set of all permutations of Zt is the symmetric group St . We will denote elements of St by the Greek letters ; ; ; . Special elements of St are { transpositions : (s1 ) = s2 ; (s2 ) = s1 ; and (s) = s; s 2 Zt n fs1 ; s2 g for some s1 ; s2 2 Zt, Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 510 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
{ shifts : { and reversals :
511
(0) = t ? 1; (s) = s ? 1; s 2 Zt n f0g; (s) = t ? 1 ? s; s 2 Zt:
It is well known that any permutation can be represented as a product of transpositions. The composition 1 2 of permutations 1 ; 2 2 St is de ned by (1 2 )(s) := 1 (2 (s)); s 2 Zt; and is also a permutation. We will use the same symbol for the composition of any mappings. In order to make them distinct from integers we will denote arrays of integers by underlined lower case Roman letters, e.g., n = (n0 ; : : : ; nt?1 ); i = (i0 ; : : : ; it?1 ): The components of n will usually be denoted by ns . As usual, N k denotes the set of k-tuples of positive integers. Furthermore, we will need the set of all nite arrays of integers larger than 1 and call it
N=
[
k2N
(n0 ; : : : ; nk?1 ) 2 N k j ns > 1; s 2 Zk :
The dimension of n 2 N is retrieved by , i.e., if n 2 N then
n 2 N (n ) :
For convenience we denote the product of the components of n 2 N by jnj = n0 n(n)?1 = n: Finally we will use a pre-ordering on N de ned by the following: For any n; m 2 N ; m shall be subordinate to n or
mn
if there exist (m) arrays n0 ; : : : ; n(m)?1 such that
n = (n0 ; : : : ; n(m)?1 ) and m = (jn0 j; : : : ; jn(m)?1 j): Following this de nition, the array (6; 3; 7) is subordinate to (2; 3; 3; 7). If m is subordinate to n then jmj = jnj and (m) (n). The set of all t-tuples i = (i0 ; : : : ; it?1 ) of integers is 2 Zns; s = f0; : : : ; t ? 1g will be denoted by
Zn = Zn0
Znt? : 1
This set is sometimes also called a grid. Note that Zn
Z n
( )
where Zt is the set of t-tuples of integers as usual. Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 511 of Numer. Math. 68: 507{547 (1994)
512
M. Hegland
Now we de ne the actions of permutations on integer arrays by permuting the components. In order to be able to distinguish the permutations from their actions on integer arrays we will denote them by underlined Greek letters. Thus for 2 St and i = (i0 ; : : : ; it?1 ) 2 Zt we de ne (i) = (i(0) ; : : : ; i(t?1)): From this we get (Zn) = Z(n) : The set of all will be called S t . Note that composition in S t is reversed compared to composition in St . With other words (1 2 )(i) = 2 (1 (i)) or 1 2 = 2 1 if denotes the composition in S t . Finally permutations leave our \norm" and dimension of n invariant: j(n)j = jnj; and ( (n)) = (n) for all 2 S (n) ; n 2 N . The mixed-radix digit representation maps integer t-tuples onto integers where the components of the t-tuples are just the digits of the number. This is done by the following formula:
n (i) =
t?1 X s=0
is n0 ns?1 ; i 2 Zn; n 2 N ;
where i = (i0 ; : : : ; it?1 ) and n = (n0 ; : : : ; nt?1 ). This de nes a bijective mapping from Zn to Zjnj . This mapping is well-known and can be found, e.g., in [Knu81]. If jnj = jmj then ?m1 n de nes a bijection between Zn and Zm. Alternatives to n are well-known and can be generated, e.g., from the Chinese remainder theorem. However, they will not be discussed here. We will call i the digit vector of i = n (i) relative to a radix vector n 2 N . As an example consider the radix vector n = (10; 10; 10). With this radix vector we can represent integers between 0 and 999 and the digit vector of the number 572 would be (2; 7; 5) which are just the ordinary digits in reversed order. Mixed radix representation extends this to radices other than 10 and even dierent radices. For example with the radix vector n = (9; 10; 10) we can represent integers between 0 and 899 and the number 572 has the index vector (5; 3; 6) as 572 = 5 + 3 9 + 6 9 10. If the digits of an integer are rst computed with the inverse of n , then permuted with and nally mapped back to an integer with (n) we get an index-digit permutation. Formally we de ne ~n = (n) ?n 1 ; n 2 N ; 2 S (n) : As n?1 maps Zjnj onto Zn, maps Zn onto Z(n) , and (n) maps Z(n) back to Zjnj the composition maps Zjnj onto itself and as all factors are bijective we get Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 512 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
513
~n 2 Sjnj :
We call ~n the index-digit permutation generated by and n. A composition of two index-digit permutations gives a permutation but in general this is not an index-digit permutation any more. In a later section we will study the question when a product of two index-digit permutations is an index-digit permutation. Furthermore the same index-digit permutation can be represented by dierent radix vectors n and permutations . This question will also be discussed later. We now continue our previous example. Given the radix vector (10; 10; 10) and the permutation 2 S3 with (0) = 2; (1) = 0 and (2) = 1 the corresponding index-digit-permutation permutes the actual digits. Thus the number 572 with digit vector (2; 7; 5) has a permuted digit vector (5; 2; 7) which corresponds to the number 725. With the same permutation but a dierent radix vector (9; 10; 10) the digit vector of 572 was seen to be (5; 3; 6) and so the permuted digit vector is (6; 5; 3) which corresponds with respect to the permuted radix vector (10; 9; 10) to the number 326 = 6+5 10+3 10 9. So the same permutation can give rise to quite dierent index-digit permutations depending on the radix vector. 2.2. Vectors
We interpret n-vectors x 2 C n as mappings which map the set of indices Zn into the complex numbers C : x : Zn ! C ; i 7! x(i); where x(i) is the i-th component of x. This might seem just like a fancy interpretation, however, it will lead to compact notation and better understanding of the permutations. We will denote vectors by the lower-case Roman letters x; y; z; u; v; e; f and sometimes write them as column vectors of their components. We will use the scalar product de ned in the usual way as (x; y) =
X
i2Zn
x(i)y(i);
where x(i) is the conjugate complex of x(i). We let ei 2 C n be the standard basis so that X x= x(i)ei : i2Zn n C s (s = 0; : : : ; t ? 1) we de ne the Kronecker
For a set of vectors xs 2 product of the xs which we write as
x = x0 xt?1 = x(i) =
tY ?1 s=0
t?1 O s=0
xs by
xs (is ); where (i0 ; : : : ; it?1 ) = ?n 1 (i)
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 513 of Numer. Math. 68: 507{547 (1994)
514
M. Hegland
for n = (n0 ; : : : ; nt?1 ) and n = jnj. (Sometimes this product is also denoted as xt?1 x0 .) The algebraic closure (the set of all linear combinations) of such products is C n . Permutations of vectors are de ned by permutations of the indices. As vectors are interpreted as mappings we can write a permutation of a vector as a composition of a permutation and a vector, i.e., as x . For example if 2 S3 with ((0); (1); (2)) = (2; 0; 1) and x 2 C 3 we get x((0)) ! x(2) ! x = x((1)) = x(0) : x((2)) x(1) It is easy to show that permutations are unitary, i.e., (x ; y ) = (x; y): The i-th component of x is x((i)), the (i)th component of x and so ei = e? (i) : 1
2.3. Matrices
Elements of the set L(C n ) of linear operators on C n are denoted by capital Roman letters. For any A 2 L(C n ) the matrix elements are A[i; j ] = (Aej )(i) = (ei ; Aej ); i; j 2 Zn: Two special matrices will be used frequently, namely In , the identity in C n and Fn 2 L(C n ), the discrete Fourier transform matrix with components being powers of the n-th root of unity, i.e., Fn [i; j ] = !nij ; i; j 2 Zn; where p !n = exp(?2 ?1=n) and exp is the exponential function. We de ne the Kronecker product A of a set of matrices As 2 L(C ns ) by
A
t?1 O s=0
!
xs =
and will denote it by
t?1 O s=0
A=
(As xs ); xs 2 C ns ; (s = 0; : : : ; t ? 1);
t?1 O s=0
As = A0 At?1 :
It can be easily checked that the matrix elements of the Kronecker product are given by
A[i; j ] =
tY ?1
s=0
As [is ; js ]
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 514 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
515
if (i0 ; : : : ; it?1 ) = ?n 1 (i) and (j0 ; : : : ; jt?1 ) = ?n 1 (j ). We will use various matrix families. They are all mappings from Nt into L(C n ) and will be denoted by T (n0; n1 ; n2 ; n3 ; n4 ); W (n0 ; n1 ; n2 ; n3 ); etc. Quite frequently we will make use of permutation matrices. They are generated by a permutation 2 Sn and are de ned by P ()x = x ?1 : The matrix elements of the permutations are ? 1 ? 1 P ()[i; j ] = (P ()ej )(i) = (ej )(i) = ej ( (i)) = 1; if (j ) = i 0; else The reason for using x ?1 instead of x is purely esthetic, we want to de ne x on the \non-permuted argument set". A consequence of this is the composition rule for transpositions: Let 1 ; 2 2 Sn . Then for any x 2 C n : P (1 2 )x = x (1 2 )?1 = x 2?1 1?1 = P (1 )(x 2?1 ) = P (1 )P (2 )x and so P (1 2 ) = P (1 )P (2 ): We will now look closer at a special transposition which we will use later. It is generated by 2 S5 with (1) = 3; (3) = 1; and (s) = s; for s = 0; 2; 4: As de ned earlier the action of on Z5 is with (i) = (i0 ; i3 ; i2 ; i1; i4 ); for i = (i0 ; i1 ; i2 ; i3 ; i4) 2 Z5: The index-digit permutation associated with is then ~n = (n) ?n 1 and we denote its action on C n with n = jnj by T (n0; n1 ; n2 ; n3 ; n4 ) = P (~n ): Now let y = P (~n )x = x n ?1 ? (1n) and set i = ?1 ? (1n) (j ); j 2 Zn: Then j = (n) ( (i)) and nally y (n) (i0 ; i3; i2 ; i1 ; i4 ) = x n (i0 ; i1 ; i2; i3 ; i4 ); which shows that if x and y are interpreted as multidimensional arrays (functions on C n and C (n) respectively) the permutation is just a transposition of the arrays. The implementation of such transpositions usually needs to have the vectors x and y in dierent storage locations. However, as in the case of matrix transpositions, the x and y can use the same storage locations if n1 = n3 . We call this a square transposition. FFT algorithms using only T (n0; n1 ; n2 ; n3 ; n4 ) with n1 = n3 can be implemented without the need for any extra workspace and will be called in-place. Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 515 of Numer. Math. 68: 507{547 (1994)
516
M. Hegland
The transpositions as de ned above also have been used to solve banded systems and there were termed \wrap-around-partitioning permutations" [Heg91]. A second matrix family is de ned by F (n0 ; n1 ; n2 ) = In Fn In : These matrices are just multiple Fourier transforms. Now let n = (n0 ; n1 ; n2 ; n3 ). Then we need a family of diagonal matrices with diagonal elements 1
0
2
W (n0 ; n1 ; n2 ; n3)[i; i] = (!n n )i i ; if (i0 ; i1 ; i2 ; i3 ) = ?n 1 (i); i 2 Zjnj ; where the diagonal elements are powers of the n1 n2 th root of unity. This matrix 1
2
1 2
is called weight matrix or twiddle-factor matrix. Usually multiplication by this matrix is implemented by using a precomputed table containing the necessary powers of !n . 2.4. Algorithms
Our algorithms will be formulated in a symbolic language which has some similarities with Matlab [Mat93], see also [Loa92]. We will introduce this notation here with an example which describes the generic Fourier transform. It is de ned by
Algorithm 2.1.
function y = gft(x; n) (*compute Fourier transform of x*) n 2 N ; x; y 2 C n ; ; 2 Sn y := P (1 )Fn P (0 )x
0
1
On the rst line, the name of the function, here gft, is speci ed and the arguments x; n and the result y are named. The second line is a comment which speci es the functionality of the algorithm and the third line speci es the type of arguments, results and extra data, here the permutations k . The fourth line de nes an assignment statement with the variable to be rede ned on the left side and an expression to rede ne this variable on the right. For this expression the same terminology as in the rest of the paper will be used. In the example the vector x is rst permuted then multiplied by the discrete Fourier transform matrix Fn and nally permuted again. Although not implemented in this way, the functionality of the generic Fourier transform is the one we will encounter. If P (0 ) = P (1 ) = In we call the transform self-sorting. Further functionality like some obvious control statements will be de ned as we go along. Often the functions will be recursive. Most of the functions de ned in this way have been translated to Matlab and tested. The ultimate test for performance of an algorithm is to code it and run it on the desired computer. Various properties of hard- and software used will in uence performance. It is useful to have indicators summarizing aspects of the algorithm to be tested, and con ned with some information about the hardware, allow comparisons of dierent algorithms. Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 516 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
517
A rst indicator which we denote by counts the number of oating point operations needed. For the generic Fourier transform, if we use the ordinary complex matrix-vector product without taking into account any special values (like 1) we need n2 complex multiplications and n(n ? 1) complex additions for a transform of size n, and we would set (gft; n) = 8n2 ? 2n: This indicator gives reasonable comparisons for sequential computers. For vector and parallel computers, however, other indicators are needed. Fourier transform algorithms contain essentially mappings of the sort x 7! (In A)x; A 2 L(C n ); x 2 C n n : Such mappings can be implemented as triple loops where the \innermost" loop is of length n0 and the two outer loops are of length n1 . Data access to x is with stride one in the innermost loop and the elements of A are constant in this loop. On a vector processor, the innermost loop would be the favoured candidate for vectorization. On a distributed memory parallel processor, data is distributed such that this innermost loop can be run in parallel and no communication is needed. Both vectorization and parallelization, however, are only ecient if n0 is large. Our Fourier transform algorithms consists of mappings x 7! F (n0 ; n1 ; n2 )x where n1 is small, x 7! W (n0 ; n1 ; n2 ; n3 )x and x 7! T (n0 ; n1 ; n2 ; n3 ; n4 )x: We assume that the rst and third mappings can be overlapped with mappings of the second kind. If the algorithm contains only one mapping of the second kind, 1
0
Algorithm 2.2.
0
1
function y = wft(x; n)
::: y := W (n0 ; n1 ; n2 ; n3 )x ::: we set our second indicator to ( 1 ; if n0 > 1 (wft; n) = n2 if n0 = 1. n 0 1
The factor 2 in the second case takes into account that two vectors usually have to be loaded in this case. We also desire our indicator to be additive, i.e., if an algorithm consists of two parts as in
Algorithm 2.3.
function y = paft(x; n) ::: y := part1(x; n) y := part2(y; n)
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 517 of Numer. Math. 68: 507{547 (1994)
518
we require
M. Hegland
(paft; n) = (part1; n) + (part2; n):
Since data from part2 has to be computed in part1 the two parts can not be done in parallel and thus additivity is also valid for parallel processing. Some times we will use the symbol f (n) = O(n) meaning that there is a constant such that f (n) Cn for all n.
3. A new FFT algorithm This section speci es a new algorithm to do fast Fourier transforms. Some propositions are stated about the performance of this algorithm and some performance data are supplied. Proofs of the propositions will be given in later sections. The number of oating point operations required is bounded by the number of operations needed for the traditional radix 2 algorithms. This can be up to 20% more than for higher radix or even split-radix [DH84] methods. On many platforms, however, the new algorithm performs faster as it essentially only accesses data with stride one and allows long vector lengths. The algorithm is described by the following pseudo-code:
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 518 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
519
Algorithm 3.1. function y = tft(x; n ; n ; n ) (*compute multiple Fourier transform, y = F (n ; n ; n )x*) n ; n ; n 2 N ; x;p y 2 Cn n n p := minfs 2 N j n =s 2 Ng if (p = n ) y := F (n ; p; n )x else p 0
0
1
1 0
2
2
1
0
2
1
2
1
1
0
2
q := n1 =p y := tft(x; n0 qp; q; n2 ) y := W (n0 ; qp; q; n2 )T (n0 ; q; p; q; n2)F (n0 q; p; qn2 )W (n0 q; p; q; n2 )y y := tft(y; n0 qp; q; n2 )
end if Let (1)
n1 =
tY ?1 i=0
pri i
be the prime factor decomposition of the order n1 of the Fourier transform. In a rst step our algorithm computes (2)
p=
Y
ri is odd
pi
the product of all primes which appear in odd powers in the factorization. If n1 is equal to p then its prime factor decomposition contains only mutually dierent prime factors. In this case x is multiplied by the matrix F (n0 ; p; n2 ). For this multiplication a separate algorithm has to be supplied (e.g., the prime factor algorithm [Tem83, Tem85]). If n1 is not equal to p then n1 =p = q2 is a square of an integer. The algorithm then proceeds by rst calling itself with order q < n1 , then doing some intermediate matrix vector multiplies, and nally calling itself again. Thus it both does decimation in time and decimation in frequency. The above de nes our algorithm in a recursive fashion. However, for the VP2200 we implemented this algorithm in a non-recursive Fortran 77 program. In order to get a better understanding of this algorithm, we will study how it proceeds in the case of n1 = 8. In the rst step it gets p = 2 and proceeds to the statements following \else". There it computes q = 2, then does an order 2 transform, then the intermediate step and a nal order 2 transform. For the order two transforms tft is called recursively, but as 2 is prime, we end up with p = 2 = n1 on this level. Thus essentially the following three steps are performed: y := F (4n0 ; 2; n2 )x y := W (n0 ; 4; 2; n2)T (n0 ; 2; 2; 2; n2)F (2n0 ; 2; 2n2)W (2n0 ; 2; 2; n2)y y := F (4n0 ; 2; n2 )y So in this case we have two multiplications with W , three order 2 transforms and one square transposition. The following proposition states that our algorithm does what we want it to do. Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 519 of Numer. Math. 68: 507{547 (1994)
520
M. Hegland
Proposition 3.1. Algorithm 3 de nes an in-place and self-sorting Fourier transform and for all x 2 C n with n = n n n the following holds: 0
1
2
(3) tft(x; n0 ; n1 ; n2 ) = F (n0 ; n1 ; n2 )x: The proof uses a new splitting formula. It will be given in Sect. 5. The following proposition states that our new algorithm is at least as fast as the traditional radix 2 algorithms in the case of n1 being a power of 2. The general mixed-radix case can be treated in a similar way. Proposition 3.2. The number of oating point operations needed for Algorithm 3 is for orders being a power of 2 bounded by (tft; n0 ; 2t ; n2 ) 5tn02t n2 : This theorem will be proved in Sect. 6. Compared to other algorithms which come close to operation counts of 4tn0 2t n2 this new algorithm doesn't seem to perform better than the usual ones. However, an indication for better performance is given by the following theorem giving an estimate for the loop-overhead indicator. We will compute estimates for this indicator for various other algorithms in Sect. 6 for comparison. As simple transforms (n0 = n2 = 1) are most crucial we give an estimate for this case. Proposition 3.3. Let n be the product of powers of some small prime numbers, e.g., 2,3 and 5. Then the innermost loop overhead indicator for Algorithm 3 is in the case of simple transforms bounded by p (4) (tft; 1; n; 1) = O( n?1 ): In the following we give some timing results for an implementation of our new algorithm on the VP 2200. In comparison we give the timings for the library routine DVCFT1 from Fujitsu's scienti c subroutine library SSL II [Fuj90] which implement specially tuned variants of the Pease and Stockham Algorithms. The highest MFlop rate is around 470. Here we only counted the operations done in the actual transform and not the ones needed for the table setup. If we take these in account also we get a performance around 500 MFlops. The reduction in computer time for our new algorithm seems to be rather small. However, DVCFT1 only achieves the performance displayed if some preliminary computations have been done. Furthermore it uses a large workspace array. Our new algorithm requires no such workspace.
Table 1. Time in microseconds and performance in MFlops of tft (double precision) n
1024 2048 4096 8192 16384 32768 65536 131072
time 380.0 640.0 740.0 1438.0 2507.0 5114.0 10603.0 22040.0
performance time for Stockham 119 173.0 157 406.0 299 821.0 336 1755.0 418 3814.0 442 8190.0 457 17617.0 470 37553.0
Finally, a word on parallel execution. In the core of the code of FFT algorithms one usually nds multiple loops. On vector computers the innermost of Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 520 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
521
these loops is vectorized as it gives stride-one data access. This is why, in our algorithm, we try to maximize the length of the innermost loop to get optimal vector performance. In the ordinary algorithms the outer most loop can be reasonably large as well. This is not the case for our algorithm where we try to minimize the outermost loop, which often has length one for our algorithm. Thus distribution of the outermost loop over dierent processors, as it is often done for the \classical" algorithms is not a feasible option here. Instead, we use the innermost loop again for distribution over processors. If a distributed memory is used, this means that the data has to by cyclically distributed. Note that the usage of the innermost loop for parallelization does not mean any disadvantage in comparison with the usage of the outermost loop as we essentially replace the outermost loop by a longer innermost loop. If the inner most loop is used for parallelization and vectorization, the inner most loop overhead indicator gives us information about vector and parallel performance.
4. Index digit permutations and Kronecker products This section does three things: In a rst subsection the question is studied when a product of two index-digit permutations is again an index-digit permutation. Then we will give a fundamental theorem showing how Kronecker products of matrices transform under certain index-digit permutations. Finally we will give some recursions for index-digit reversals. The rst two parts are fundamental to all FFT algorithms we will discuss here. Index-digit reversals are closely related to FFT algorithms and are used on there own in connection with the algorithms of Cooley-Tukey, Gentleman-Sande and Pease. A special instance of this are the bit-reversal algorithms. Some work on index-digit permutations was done in [Fra76], see also [Loa92]. 4.1. Product of two index-digit permutations
In general, a product of two index-digit permutations is a permutation but not an index-digit permutation. We will study conditions for the factors such that their product is an index-digit permutation. If this is the case we show how the product of two index-digit permutations can simply be computed as the product of the permutations of the corresponding index-digits. The following proposition is our basic \multiplication theorem". Proposition 4.1. For t 2 N let (s) 2 St; s = 1; 2; 3 be three permutations and (3) = (1) (2) . As usual, let (s) denote the action of the (s) on Zt and for n 2 N t let ~n(s) be the index-digit permutations generated by (s) and n, s = 1; 2; 3. Then with n0 = (1) (n) the composition ~n(2)0 ~n(1) is an index-digit permutation and ~n(2)0 ~n(1) = ~n(3) : Proof. This is a direct consequence of the de nitions and the associative law for compositions of mappings: Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 521 of Numer. Math. 68: 507{547 (1994)
522
M. Hegland
~n(2)0 ~n(1) = (n0 ) (2) ?n01 (n) (1) ?n 1 = (n) (3) ?n 1 = ~n(3) (2)
(1)
(3)
as on the second line the intermediate terms drop out because n0 = (1) (n) and (2) (1) = (1) (2) = (3) . ut Thus we can multiply index digit permutations simply by multiplying their de ning permutations (s) in reversed order if the radix vectors \match". However, in many cases this matching condition is too restrictive and in the following we will relax it slightly. First we just state a consequence of the \multiplication theorem:" Corollary 4.2. For t 2 N let ; 0 2 St ; be two permutations and 0 = ?1 . As usual, let ; 0 denote the action of the ; 0 on Zt and for any n 2 N t let ~n ; ~n0 be the index-digit permutations generated by ; 0 and n. Then the inverse of ~n is given by ? ?1 ~n = ~0 (n) : Proof. Set (1) = ; (2) = ?1 = 0 and apply Proposition 4.1 to get
~0 (n) ~n = ~n(3)
where ~n(3) is generated by (3) = 0 = id and n, so ~n(3) = id. ut Earlier we de ned a radix vector to be subordinate to another radix vector if the second radix vector can be partitioned into blocks in such a way that the product of the components of the blocks just form the components of the subordinate vector. Then the mixed radix digit representations are closely related. Consider as an example the number 1597. With respect to the radix vector (10; 10; 10; 10) its digits are 7; 9; 5 and 1. A subordinate radix vector is (100; 100) and with respect to this radix vector we get the \digits" 97 and 15. The mixed radix representation with respect to the original radix vector can be obtained by inserting the mixed radix representation of the \digits" with respect to the blocks (10; 10) of the original radix vector into the representation with respect to the subordinate radix vector. This might seem all very trivial but is nevertheless quite useful in the development of FFT algorithms. We formulate this in the following lemma for the general mixed-radix case. Lemma 4.3. Let n = (n0 ; : : : ; nt?1) 2 N be partitioned such that m = (jn0 j; : : :, jnt?1 j) is subordinate. Furthermore let the digit vector i = (i0 ; : : : ; it?1 ) 2 Zn be partitioned such that is 2 Zns. Then every digit vector j 2 Zm is given by
j = (n (i0 ); : : : ; nt? (it?1 )) 0
if and only if
1
m (j ) = n (i):
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 522 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
523
The proof simply inserts the de nition of the mixed radix representation n and uses its bijectivity. We will omit it here. Now any index digit permutation with respect to a subordinate radix vector also de nes an index digit permutation with respect to the original radix vector. Thus the number 9715 is obtained from 1597 by permuting the two \digits" 97 and 15 but also by permuting the digits 7; 9; 5 and 1. For the general mixedradix representation this is formulated in the following corollary to the previous lemma: Corollary 4.4. Let ~n be the index-digit permutation generated by the permutation and the radix vector n. Furthermore let n be subordinate to some n0 , i.e., n n0 . Then there exists an index-digit permutation ~n0 0 generated by a permutation 0 and the radix vector n0 such that ~n0 0 = ~n : Proof. Let n0 = (n00 ; : : : ; n0t?1 ) such that n = (jn00 j; : : : ; jn0t?1 j). Furthermore for any j 2 Zjnj = Zjn0 j let
i = (i0 ; : : : ; it?1 ) = ?n 1 (j ); i0 = (i00 ; : : : ; i0t?1 ) = n0 (j ): Finally let 0 be de ned by (j 0 ; : : : ; j t?1 ) = (j (0) ; : : : ; j (t?1) )
for any j 2 Zn0 . Then we get from the previous lemma is = n0s (i0s ); s = 0; : : : ; t ? 1 and so i(s) = n0 s (i0(s) ); s = 0; : : : ; t ? 1: Applying the previous lemma a second time results in (n) ( (i)) = 0 (n0 ) ( 0 (i0 )): ( )
From this the corollary follows by insertion of the de nitions of i and i0 . ut From this we see that the same index-digit permutation is generated by a whole class of pairs of permutations and radix vectors. Our basic multiplication theorem required the radix vectors of two indexdigit permutations to \match" in order to guarantee that the product is also an index digit permutation. The matching condition can now be generalized using the previous corollary.
Corollary 4.5. Let the two index-digit permutations ~n and ~m be generated by the radix vectors n; m 2 N and the permutations and . Furthermore let (n) m: Then the composition ~m ~n is an index-digit permutation as well generated (1)
(1)
(2)
(2)
(1)
(2)
by some permutation
(3)
(1)
and the radix vector n.
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 523 of Numer. Math. 68: 507{547 (1994)
524
M. Hegland
Proof. By the previous corollary there exists a permutation (3) such that (3) and the radix vector (1) (n) generate ~m(2) . The multiplication theorem shows then that the composition of this index-digit permutation with ~n(1) is an indexdigit permutation.
ut
With this we complete the introduction to index-digit permutation algebra and go on to the connection of index-digit permutations and Kronecker products. 4.2. Permutation of Kronecker products
The eect of index-digit permutations on Kronecker products is such that the factors of the product are permuted according to the generating permutation. A rst proposition shows this for Kronecker products of vectors. Here the usefulness of the interpretation of vectors as mappings becomes obvious as it simpli es the proof. Proposition 4.6. Let ~n be the index-digit permutation generated by a radix vector n = (n0 ; : : : ; nt?1 ) 2 N and a permutation 2 St . Then for xs 2 C ns ; s = 0; : : : ; t ? 1 the permuted Kronecker product of the xs is the same as the Kronecker product of the factors in permuted order or
P (~n ) Proof. Let
t?1 O s=0
xs =
x= Then by de nition
x n (i) =
tY ?1 s=0
t?1 O s=0
t?1 O s=0
x(s) :
xs :
xs (is ); for i = (i0 ; : : : ; it?1) 2 Zn:
Using this and the de nition of index-digit permutations we get for any j = (j0 ; : : : ; jt?1 ) 2 Z(n) (P (~n )x) (n) (j ) = x n ?1 (j ) =
tY ?1
s=0 tY ?1
xs (j? (s) ) 1
x(s) (js ) s=0 ! t?1 O = x(s) (n) (j ): s=0
=
As this is valid for any j the proof is complete. ut
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 524 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
525
A similar proposition holds for matrices and is a consequence of the previous proposition. Proposition 4.7. Let ~n be the index-digit permutation generated by a radix vector n = (n0 ; : : : ; nt?1 ) 2 N and a permutation 2 St . Then for As 2 L(C ns ); s = 0; : : : ; t ? 1 the permuted Kronecker product of the As is the same as the Kronecker product of the factors in permuted order multiplied by the permutation or t?1 t?1 O O P (~n ) As = A(s) P (~n ): s=0
s=0
Proof. By de nition, for any xs 2 C ns ; s = 0; : : : ; t ? 1 we have
A
t?1 O s=0
xs =
t?1 O s=0
(As xs )
and so, by the previous proposition we get
P (~n )
t?1 O s=0
As
!
= =
t?1 O
A(s) x(s)
s=0 t?1 O
s=0 t?1 O
A(s)
! !
t?1 O s=0
x(s)
!
t?1 O
!
xs : s=0 As the Kronecker products of the xs contain a basis of C jn j the proof is com=
pleted.
s=0
A(s) P (~n )
ut
Now we can apply Proposition 4.7 to our basic matrices T, W and F to get: Corollary 4.8. For any two integers p; q the following commutation relations hold: (5) F (p; q; 1)T (1; q; 1; p; 1) = T (1; q; 1; p; 1)F (1; q; p) and (6) W (1; p; q; 1)T (1; q; 1; p; 1) = T (1; q; 1; p; 1)W (1; q; p; 1): The proof just uses Proposition 4.7 and the previous de nitions. We will not give it here. However, this corollary will be used in the next section a various places.
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 525 of Numer. Math. 68: 507{547 (1994)
526
M. Hegland
4.3. Recursions for digit-reversals
In this subsection we are going to characterize index-digit reversals by index-digit permutations generated by transpositions. In the case of radix vectors of length three or less any index-digit permutation can be generated by a transposition as the following lemma shows. This is a consequence of the blocking idea of Corollary 4.4. Lemma 4.9. For every t 3, the index digit permutation ~n generated by 2 St and n = (n0 ; : : : ; nt?1 ) 2 N is either the identity or it can be generated by a radix vector m n and a transposition 2 S(m) or ~n = ~m : Proof. S1 only contains the identity and S2 only contains a transposition besides the identity. Obviously the lemma holds in these cases. The symmetric group S3 contains the identity, three transpositions and the two elements (1) and (2) with (1) (0) = 1; (1) (1) = 2; (1) (2) = 0 (2) (0) = 2; (2) (1) = 0; (2) (2) = 1 We will only give the proof for (1) as the proof for (2) is similar. Set m = (n0 ; n1 n2 ). Then m n and for the transposition 2 S2 there is a 0 2 S3 such that ~m = ~n0 if these two are the index-digit permutations generated by and m and by 0 and n respectively. This follows from Corollary 4.4. Furthermore, the proof of Corollary 4.4 shows that 0 = . ut Thus if the radix vector only has three components, a digit-reversal is also a transposition. This is not true for longer radix vectors. Using the same blocking idea we can show that any index-digit permutations generated by circular shifts can also be generated by transpositions. This is the content of the next lemma. Here the restriction to small t is not necessary. Lemma 4.10. For any t 2 N the index digit permutation ~n generated by a shift 2 St and n = (n0 ; : : : ; nt?1 ) 2 N can be generated by a radix vector m = (m0 ; m1 ) 2 N with m n and the transposition 2 S2 or ~n = ~m : Proof. As before it can be seen from Corollary 4.4 that our index-digit permutation is generated by the transposition in S2 and the radix vector m = (n0 nt?2 ; nt?1 ).
ut
Thus the index-digit permutations generated by circular shifts can be identi ed with the ones generated by the transposition in S2 . We will now continue by giving some recursions for more general index-digit reversals using transpositions. All our propositions on index-digit permutations could also be formulated for the matrix representations P (~n ). The following Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 526 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
527
recursions will be given as recursions for the representations as notation can be simpli ed. We will use the following lemma combining index-digit permutations with Kronecker products. Lemma 4.11. Let ~n and ~m0 be two index-digit permutations generated by the radix vectors n 2 N and m 2 N and the permutations and 0 respectively. Then P (~n ) P (~m0 ) = P (~k00 ) is an index-digit permutation generated by the radix vector k = (n; m) and the permutation 00 de ned by = 0; : : : ; (n) ? 1 00 (s) = (0 (ss);? (n)) + (n); ss = (n); : : : ; (n) + (m) ? 1. Proof. Let ns = 0; : : : ; (n) ? 1 xs 2 CC ms;? n ; ss = (n); : : : ; (n) + (m) ? 1. Then we get by inserting the de nition of the Kronecker product and its associativity: ( )
P (~n ) P (~m0 )
(m)?1 (n)+O s=0
2
xs =
P (~n )
4
2
(O n)?1
(O n)?1 s=0
3
3
2
xs 5 4P (~m0 ) 2
(m)?1 (n)+O
(m)?1 (n)+O
s=(n)
3
xs 5 3
x(s) 5 4 x0 (s?(n))+(n) 5 s=0 s=(n) (n)+O (m)?1 = x00 (s) : s=0 As the Kronecker products contain a basis of C jn j the proof is completed. ut =
4
A special case of this lemma will be frequently used where one of the indexdigit permutations is the identity, e.g., 2 S1 for which the lemma shows In P (~m0 ) = P (~k00 ); k = (n0 ; m): 0
In the following let 2 St always be the reversal permutation. Then the index-digit permutation generated by and a radix-vector n 2 N t will be denoted by ~n and is called the index-digit reversal generated by the radix vector n. In the case of n = (2; 2; : : : ; 2) (i.e., radix 2) the index-digit reversals are called bit-reversals as the digits are represented by one bit. The next proposition gives some basic recursions for index-digit reversals. They use the blocking idea of Corollary 4.4 and the multiplication theorem 4.1. They are very similar to the recursions for FFTs and can be used to get ecient implementations of the index-digit reversal.
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 527 of Numer. Math. 68: 507{547 (1994)
528
M. Hegland
Proposition 4.12. Let n = (n ; : : : ; nt? ) 2 N and ~n be the index-digit re0
1
versal generated by n. Then the following three sets of recursions hold: (1) If nl = (n0 ; : : : ; nt?2 ), nu = (n1 ; : : : ; nt?1 ) and ~nl and ~nu are the indexdigit reversals generated by nl and nu respectively one has: P (~n ) = P (~nu ) In T (1; n0; 1; jnu j; 1) = Int? P (~nl ) T (1; jnlj; 1; nt?1 ; 1) = T (1; n0; 1; jnu j; 1) In P (~nu ) = T (1; jnl j; 1; nt?1; 1) P (~nl ) Int? 0
1
0
1
(2) If nm = (n1 ; : : : ; nt?2 ) and ~nm is the index-digit reversal generated by nm we get P (~n ) = Int? P (~nm ) In T (1; n0; jnm j; nt?1 ; 1) = T (1; n0; jnm j; nt?1 ; 1) In P (~nm ) Int? 1
0
0
1
(3) If nl = (n0 ; : : : ; ns?1 ) and nu = (ns+1 ; : : : ; nt?1 ) for s 2 f1; : : : ; t ? 2g then i
h
h
i
Ijnu jns P (~nl ) T (1; jnl j; ns ; jnu j; 1) Ijnl jns P (~nu ) i i h h = P (~nu ) Ijnl j ns T (1; jnl j; ns ; jnu j; 1) P (~nl ) Ijnu jns :
P (~n ) =
Proof. The proof of all the equations proceeds in a similar way. We will only give the proof of the rst equation here. First we get from the previous lemma: P (~nu ) In = P (~m ) where the index digit permutation ~m is generated by the radix vector m = (nu ; n0) and the permutation 2 St with t ? 2 ? s; s=0,: : : ,t-2 (s) = t ? 1; s=t-1. By de nition, T (1; n0; 1; jnu j; 1) = P (~(n ;jnu j) ) where ~(n ;jnu j) is the index-digit permutation generated by the transposition 2 S2 and the radix-vector (n0 ; jnu j). As (n0 ; jnu j) n, there is by using the blocking idea, see Corollary 4.4, a permutation 0 2 St such that ~(n ;jnu j) = ~n0 : 0
0
0
0
Furthermore, as can be seen from the proof of Corollary 4.4, t ? 1; s = 0 0 (s) = s ? 1; s = 1; : : : ; t ? 1 and 0 (n0 ; nu ) = (nu ; n0 ) = m. As P is a homomorphism (see Sect. 2.3) we get
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 528 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
529
P (~nu ) In T (1; n0; 1; nu ; 1) = P (~m )P (~n0 ) = P (~m ~n0 ): It is easily checked that 0 = and so, by the multiplication Theorem 4.1 we 0
get
P (~m ~n0 ) = P (~n )
which completes the proof for the rst equation. ut
5. Recursions and algorithms In this section we will apply some of the ideas of the previous section to prove the new algorithm. We will introduce a new splitting formula which essentially combines two \classical" splitting steps with the Johnson-Burrus idea. We will also prove some theorems on oating point counts and loop overheads. First we restate the general splitting lemma. Then we give some not so well known splittings which are useful for the derivation of well-known algorithms. In order to do this we introduce some new matrix families. These families are generated by the recursions and some contain the \usual" Fourier transforms as end points. Although many of these algorithms are well-known their systematic development from splittings and matrix families is less so. Usually these algorithms are all developed from one splitting directly by using dierent data arrangements [Loa92, Swa87]. In this derivation, splittings and recursions mean pretty much the same and the whole derivation could be reformulated as matrix factorizations. One bene t of our approach is that it displays what is computed at every step. Thus switching between algorithms as is done in [WH90] becomes mathematically tractable. Furthermore, using the splitting formulas and the de nitions of the various families it is possible to code a fairly general FFT library using the best method available at any stage thus combining all the methods. In principle most of the matrix families are just some Kronecker products of permutations with the multiple Fourier transforms. We denote them by special symbols, however, in order to simplify notation. Note that the algorithm by Johnson and Burrus is special as it is not a combination of permutations and the ordinary transform. Thus switching between the Johnson-Burrus algorithm and the other algorithms might not be feasible. Our new algorithm, however, does not share this drawback and might be combined with algorithms using one-sided splitting. 5.1. Recursions for the ordinary Fourier transforms { the Stockham algorithm
We start our visit two dierent Fourier transforms with the Stockham algorithm. We choose this algorithm as it combines simplicity with versatility. It is thus the algorithm which usually is implemented in general purpose libraries of newer date.
Algorithm 5.1 (Stockham, decimation in time).
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 529 of Numer. Math. 68: 507{547 (1994)
530
M. Hegland
function y = sft(x; n ; n ; n ) (* compute Fourier transform of x *) n ; n ; n 2 N ; y; x 2 C n n n ; p; q 2 N p := maxfs 2 R j n =s 2 N g 0
0
1
1 0
2
2
1
2
1
q := n1 =p y := x if (p < n1) y := sft(y; n0 p; q; n2 ) y := T (n0; p; 1; q; n2)W (n0 ; p; q; n2)y
end
y := F (n0 q; p; n2)y
The set R is the radix set. Examples of radix sets are the powers of 2, products of the powers of 2,3,5 and 7 or simply all prime numbers (more of theoretical interest.) The transform is de ned for all orders (7)
8 1) y = trft(y; p; nl ; nt?1 q; r) y = W (pq; jnl j; nt?1 ; r)T (p; nt?1 ; 1; jnl jq; r)y y = F (pjnl jq; nt?1 ; r)y else if ((n) = 1) y = T (p; q; 1; n0; r)F (p; q; n0 r)y
end
The following proposition tells us what the Pease algorithm computes: Proposition 5.10 (Pease). Algorithm 5.4 de nes a Fourier transform and (22) trft(x; p; n; q; r) = FTR(p; n; q; r)y: The proof of this proposition uses the splitting formula. Note that this transform is neither self-sorting nor in-place. Furthermore, for switching between a self-sorting algorithm and the Pease algorithm both digit-reversal and a transposition is needed. 5.5. A rst two-sided recursion for a partial reversed Fourier transform { the Johnson-Burrus algorithm
Now we will discuss the splitting used to prove a new algorithm by Johnson and Burrus [JB84]. This generalizes the concept of a Fourier transform further. In previous subsections we de ned matrix families which might be interpreted as orbits starting with the identity and ending with the desired Fourier transform and which are generated by recursions or splittings. The intermediate stages are permuted Fourier transforms as well. This is no longer the case for a new family or orbit we discuss in this section. However, the new orbit still starts with the identity and its nal point is a Fourier transform. Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 538 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
539
Just for this section we need some extra notation which shall be included here. For any n = (n0 ; : : : ; nt?1 ) 2 N and m = (m0 ; : : : ; ms?1 ) 2 N we denote the concatenation by n m = (n0 ; : : : ; nt?1; m0 ; : : : ; ms?1 ) 2 N : The neutral element of this binary operation on N is the \empty array" which we will denote by () and we set j()j = 1: The matrix FR(p; n; q) can be de ned for n = () as
FR(p; (); q) := Ipq :
It is easy to check that this de nition allows the extension of the recursion for FR as: FR(p; (n0 ); q) = F (p; n0 ; q)W (p; n0 ; j()j; q)FR(p; (); n0 q): Finally we de ne P (~() ) = 1 and can equally extend the recursions from Proposition 4.12. We then de ne the matrix family FRR by: FRR(p; n; m; q) := FR(p; n m; q)FR(p; n; jmjq)?1 Ipjnj P (~m ) Iq ?1
Using the de nitions from above we get the special values FRR(p; n; (); q) = Ipjnjq and FRR(p; (); m; q) = F (p; jmj; q): Using splitting for FR (i.e. Lemma 5.4) we get furthermore FRR(p; n; (m0 ); q) = F (pjnj; m0 ; q)W (p; jnj; m0 ; q): Now we can proceed to obtain a recursion for FRR. Lemma 5.11 (Two-sided splitting for FRR). Let p; q 2 N, n = (n0 ; : : : ; nt?1) 2 N , m0 = (m0 ; : : : ; ms?1 ) 2 N and n0 = (n0 ; : : : ; nt?1 ; m0 ), m0 = (m1 ; : : : ; ms?2 ) and q = ms?1 q. Then FRR(p; n; m; q) = F (pjn0 m0 j; ms?1 ; q)W (p; jn0 m0 j; ms?1 ; q)FRR(p; n0 ; m0 ; q0 ) F (pjnj; m0 ; jm0 jq0 )W (p; jnj; m0 ; jm0 jq0 )T (pjnj; ms?1 ; jm0 j; m0 ; q):
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 539 of Numer. Math. 68: 507{547 (1994)
540
M. Hegland
Proof. The proof proceeds by estblishing and combining recursions for the three factors of FRR. The recursion for the rst factor uses the splitting for FR 5.4 to get: FR(p; nm; q) = F (pjn0 m0 j; ms?1 ; q)W (p; jn0 m0 j; ms?1 ; q)FR(p; n0 m0 ; q0 ): For the second factor we apply Lemma 5.4 again: FR(p; n0 ; jm0 jq0 ) = F (pjnj; m0 ; jm0 jq0 )W (p; jnj; m0 ; jm0 jq0 )FR(p; n; jmjq) by taking inverses and rearranging the factors we get: FR(p; n; jmjq)?1 = FR(p; n0 ; jm0 jq0 )?1 F (pjnj; m0 ; jm0 jq0 )W (p; jnj; m0 ?; jm0 jq0 ) For the third factor we need Proposition 4.12 twice to get: P (~m ) = T (1; m0; 1; jmj=m0 ; 1) [Im T (m0 ; jm0 j; 1; ms?1 ; 1)] Im P (~m0 ) Ims? : 0
1
0
Now, let k = (m0 ; jm0 j; ms?1 ) and k0 = (m0 ; ms?1 ; jm0 j). Applying Corol-
lary 4.4 we get
Im T (1; jm0 j; 1; ms?1 ; 1) = P (~k(1) ); and index-digit permutation generated by k and (1) with (1) (0) = 0; (1) (1) = 2; (1) (2) = 1: 0
Furthermore,
T (1; m0; 1; jmj=m0 ; 1) = P (~k(2)0 ); is an index-digit permutation generted by k0 and (2) with (2) (0) = 1; (2) (1) = 2; (2) (2) = 0: By setting (3) = (1) (2) we, by applying the multiplication theorem 4.1: T (1; m0; 1; jmj=m0; 1) [Im T (1; jm0 j; 1; ms?1; 1)] = T (1; m0; jm0 j; m0 ; 1): 0
By taking the inverse and looking at multiple instead of simple permutations we nally get the following recursion for the third factor as: Ipjnj P (~m ) Iq ?1 = Ipjn0 j P (~m0 ) Iq0 ?1 T (pjnj; ms?1 ; jm0 j; m0 ; q):
The proof is completed by multiplying these three recursions and using the fact that the transposition involved happens to commute with some of the other factors. ut
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 540 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
541
This lemma forms the basis of both the proof and the implementation of the Johnson-Burrus algorithm. Strictly speaking, this algorithm is neither decimation in time nor decimation in frequency as the decimated transform is \inbetween." For it's derivation, however, a decimation in time recursion for FR and FR?1 was used. Thus we say this algorithm is decimation in time. The corresponding decimation in frequency algorithm computes transforms with a matrix FRRT and the recursion is obtained by ordinary matrix algebra from the recursion we already have. We should nally mention that if the radix set chosen is \symmetric" or nt = ns in our recursion then the transposition needed is square and thus can be done in-place. This is the big advantage of this algorithm which otherwise performs similar to the Cooley-Tukey or Gentleman-Sande algorithms. Here is a recursive implementation of the Johnson-Burrus algorithm:
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 541 of Numer. Math. 68: 507{547 (1994)
542
M. Hegland
Algorithm 5.8 (Johnson-Burrus). function y = jft(x; p; n; m; q) (* compute Fourier transform of x *) p; q 2 N ; n = (n ; : : : ; nt? ) 2 N ; m = (m ; : : : ; ms? ) 2 N ; x; y 2 C pjn jjmjq y := x if 0(s > 1) 0
1
0
1
n := (n0 ; : : : ; nt?1 ; m0 ) m0 := (m1 ; : : : ; mt?2 ) q0 := ms?1 q y := T (pjnj; ms?1 ; jm0 j; q0 )y y := F (pjnj; m0 ; jm0 jq0 )W (p; jnj; m0 ; jm0 jq0 )y y := jft(y; p; n0 ; m0 ; q0 ) y := F (pjn0 m0 j; ms?1 ; q)W (p; jn0 m0 j; ms?1 ; q)y else if (s = 1) y := F (pjnj; m0 ; q)W (p; jnj; m0 ; q)y
end
One of the ways to interpret the Johnson-Burrus algorithm is as an application of two-sided splitting, i.e., it does two splitting steps at once and by this is able to combine two transpositions into one thus saving half of the transpositions. More important is the fact that the algorithm can be implemented in-place. A drawback of the Johnson-Burrus algorithm from a theoretical point of view is that it computes a rather complicated transform at intermediate stages which is not even a Fourier transform. An interesting question to ask is if there also exist two-sided splittings for other transforms. This is indeed the case and these splittings lead to some ecient algorithms as well as we will see in the next section. 5.6. A new two-sided splitting for the ordinary Fourier transform { our new algorithm
First we present a simple two-sided splitting for F . Lemma 5.12 (Two-sided splitting for F ). Let p; q and r be three integers and n1 = pqr: Then the following factorizations hold: F (n0 ; n1 ; n2 ) = F (n0 pq; r; n2 )W (n0 ; pq; r; n2)T (n0 ; r; q; p; n2 )F (n0 r; q; pn2 ) W (n0 r; q; p; n2 )F (n0 rq; p; n2 ) = F (n0 pq; r; n2 )W (n0 p; q; r; n2)T (n0 ; r; q; p; n2 )F (n0 r; q; pn2 ) W (n0 ; rq; p; n2 )F (n0 rq; p; n2 ) Proof. Applying the splitting Lemma 5.1 twice gives F (n0 ; pqr; n2) = (n0 pq; r; n2 )T (n0; r; 1; pq; n2)W (n0 ; r; pq; n2 )F (n0 rp; q; n2 ) T (n0r; q; 1; p; n2)W (n0 r; q; p; n2)F (n0 rq; p; n2 ): Now use the commutation formulas and move both transpositions one position towards the center. Thus we get in the center: Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 542 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
543
T (n0 ; r; 1; pq; n2)T (n0r; q; 1; p; n2 ))
Using the de nitions of the transpositions we get: T (n0 r; q; 1; p; n2) = P (~n(1) );
an index-digit permutation generated by n = (n0 ; r; q; p; n2) and (1) with (1) (0) = 0; (1) (1) = 1; (1) (2) = 3; (1) (3) = 2; (1) (4) = 4 and similarly, T (n0 ; r; 1; pq; n2) = P ~m(2) is an index-digit permutation generated by m = (n0 ; r; p; q; n2 ) and (2) with (2) (0) = 0; (1) (1) = 2; (1) (2) = 3; (1) (3) = 1; (1) (4) = 4: With this we get from Lemma 4.1: T (n0 ; r; 1; pq; n2)T (n0 r; q; 1; p; n2) = P ~m(2) P (~n(1) ) = T (n0; r; q; p; n2 ): Including this proves the rst splitting formula of this lemma. The second one is obtained simply by taking the matrix transpose of the rst one and renumbering.
ut
We can repeat the comment we made for the Johnson-Burrus algorithm about decimation in time and frequency as this splitting lemma can be viewed as doing the Johnson-Burrus algorithm for n1 = pqr with radices p; q and r: Thus we may interpret (by abusage of terminology) the rst formula as decimation in time and the second as decimation in frequency. Now we can implement these formulas in (at least) two ways depending on how we interpret the r; p and q: A rst implementation was given in Algorithm 3. There we chose r = q. Furthermore, p should be as small as possible. A dierent implementation is more similar to the Johnson-Burrus Algorithm as it does two radix-steps at each instance of the recursion. This algorithm is in our Matlab-like notation:
Algorithm 5.9. function y = taft(x; n ; n ; n ) (* compute the Fourier transform of x*) n ; n ; n 2 N ; x; y 2 C n n n : p := maxfs 2 R j n =s 2 N g r := maxfs 2 R j n =(ps) 2 N g q := n =(pr) if (q < n ) 0
0
1
1
0
2
2
1
2
1
1
1
1
y := W (n0 r; q; p; n2 )F (n0 rq; p; n2 )x y := taft(y; n0r; q; pn2 ) y := F (n0 pq; r; n2 )W (n0 ; pq; r; n2 )T (n0; r; q; p; n2 )y
else y := F (n ; n ; n )y end 0
1
2
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 543 of Numer. Math. 68: 507{547 (1994)
544
M. Hegland
This algorithm starts with short innermost loops which grow. Note that if the radices come in pairs of equal numbers, this algorithm also can be done in place. (The comments about equal radices are especially valid for constant radix algorithms like the radix 2 etc. algorithms.) An advantage of this algorithm is that during intermediate steps an in-order Fourier transform has to be computed. This makes it easy to switch to a dierent algorithm. If the Johnson-Burrus algorithm is used, such switching is less feasible as an intermediate step does not compute a Fourier transform. Switching is also possible for Algorithm 3.1. In a previous section we stated that Algorithm 3.1 de ned an in-place and self-sorting Fourier transform. We can now use the two-sided splitting lemma to prove this: Proof. The proof is by induction proof over u: Here u is the largest integer such that the order n1 of the transform contains a radix to the power 2u as a factor. For any order n1 of the transform which does not contain any squared radices (or u = 0) we have by de nition tft(x; n0 ; n1 ; n2 ) = F (n0 ; n1 ; n2 )x: Note that in this case separate implementations have to be provided. Now assume that the assertion holds for u. Then let n1 be such that it contains a power 2u+1 of a prime number as a factor. Now in one sweep of the algorithm we compute q which can contain at most a power of 2u of any prime number as it is essentially the square root of n1 : Thus we get y = F (n0 qp; q; n2)W (n0 ; qp; q; n2 )T (n0; q; p; q; n2 )F (n0 q; p; qn2 ) W (n0 q; p; q; n2 )F (n0 qp; q; n2 )x: The two-sided splitting lemma shows us that this is just F (n0 ; n1 ; n2 )x: Thus Algorithm 3 is a self-sorting Fourier transform algorithm. As the transpositions are square it can also be implemented in-place. ut Now an interesting question is if such two-sided splitting formulas also exist for other matrix families. We will not discuss this question here.
6. Performance indicators In this section we show some bounds for the algorithms developed earlier. A rst subsection covers very brie y the oating point operations count. As this indicator has been studied in depth in the literature [Loa92] we mainly prove a proposition for our new algorithm and the case of radix 2. A second subsection discusses our new loop overhead indicator for various radix sets and all of the algorithms mentioned here. 6.1. Operation counts
After having established the Fourier transforms we prove estimates for radix 2 transforms. The upper bound is the same for all Fourier transform routines displayed here and is given in the following proposition: Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 544 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
545
Proposition 6.1. Let n = 2s and R = f2g: Furthermore assume that the setup
of the elementary matrices like W does not cost anything. Then all the algorithms mentioned so far for the order n Fourier transform can be implemented such that they need at most n(log2 (n) ? 2)=2 complex multiplications and n log2 (n) complex additions if n > 2. Proof. The proof is slightly dierent for our new algorithm and the other algorithms (including Burrus-Johnson and Algorithm 5.9). Both proofs use induction and start with mentioning that the estimate is valid for order n = 4. Now for the \old" algorithms one sweep consists of one application of the algorithm for order n=2, one twiddle factor multiplication (n=2 complex multiplications) and one transform of order 2 (n complex additions). Thus if the estimate is valid for orders n=2 we get for orders n at most n log2 (n=2) ? 2)=2 + n=2 complex multiplications and n log2 (n=2) + n complex additions. Thus the estimate is also valid for orders n: To our new algorithm. Assume that the estimate is valid for all orders up to n=2: Now we have to distinguish two cases. In the rst case n is a square. Then the algorithm calls itself twice with order q = pn and has to do only one twiddle factor multiplication in addition as p = 1. Thus it needs at most p 2n log2 ( n) complex additions and p 2n(log2 ( n) ? 2)=2) + n complex multiplications. Thus in this case the estimate is also valid forpn: In the second case n=2 is a square. Here our algorithm chooses q = n=2 as the order for the transforms which are recursively called and p = 2: Thus it needs per step one radix 2 transform and two twiddle factor multiplies in addition to the calls to the lower order transforms. One of the twiddle factor multiplications can be implemented with only half the number of multiplications (the same as for the \traditional" algorithms). Thus an order n transform needs in this case: p
2n(log2 ( n=2) ? 2)=2 + n + n=2 complex multiplications and p
2n log2 ( n=2) + n complex additions. This completes the proof. ut Note that the recursive formulation of the algorithms simpli ed the proof of the operation counts. As a consequence of this general proposition we get the estimate for the oating point operations of our algorithm as stated in a previous section. Note that we only counted the operations needed for the actual matrix operations and not the ones for the table setup. In order to get correct MFlop rates, however, these table setups need to be taken into account also. Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 545 of Numer. Math. 68: 507{547 (1994)
546
M. Hegland
6.2. The loop overhead indicator
Finally we give bounds for the loop overhead indicator for the algorithms de ned earlier. In a rst proposition we show bounds for the algorithms based on onestep splitting. A second proposition shows that these bounds are considerably smaller for our new two-step-splitting based algorithm. Proposition 6.2. Let n = (n0; : : : ; nt?1) 2 Rt N be such that nj = maxfs 2 R j nj nt?1 =s 2 N g n = jnj and p; q; r 2 N . Then the following bounds hold for the Stockham algorithm (sft; p; n; q) 2=p: f . For the self-sorting The same holds for the decimation in frequency algorithm sft Cooley-Tukey variant one gets (csft; p; n; q) = ((n) ? 1)=p if p > 1 and (csft; 1; n; q) 2 and the same is valid for the transposed Stockham algorithm tsft and essentially also for the usual (permuted) Cooley-Tukey algorithm: (cft; p; n; q) = ((n) ? 1)=p if p > 1 and (cft; 1; n; q) 2: For the decimation in frequency transposed Stockham algorithm we get (tt; p; q; jnj; r) = 2=(pq) if pq > 1 and (tt; 1; 1; jnj; r) 1 + 1=jnj: For the Pease algorithm one has, (trft; p; n; q; r) = ((n) ? 1)=(pq) if pq > 1 and (trft; 1; n; 1; r)2: Finally, for the Johnson-Burrus algorithm one obtains (jft; p; n; m; q) = ((n) ? 1)=p if pq > 1 and (jft; 1; n; m; q) 2:
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 546 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
547
Proof. All the bounds are proved in a very similar way so we can restrict ourselves to display only the proof of the bound for sft. It is a simple induction argument. First the bound is valid for n 2 R. Assume now that it is valid for any n 2 Rs with 1 s < t. Then we get for n 2 Rt : (sft; p; jnj; q) 1=p + (pn0 ; jnj=n0; q) 1=p + 2=(n0 p) 2=p if p > 1 and 2=n0 + (n0 ; jnj=n0; q) (sft; 1; jnj; q) 2=n0 + 2=n0 2 as n0 2. ut One thing this proposition shows is that loop overheads can only be neglected for multiple Fourier transforms with very large multiplicity p. The situation is improved by our new algorithm, however, the bounds are slightly more dicult to establish. We will make frequent use of increasing integer sequences fmj g with the following properties: (23) m0 2 (24) m2j mj+1 rm2j for j 2 N
Here r > 0 is a constant which controls the growth of fmj g. For any sequence de ned to have these properties we get: (25) mj pmj+1 (26) mj+1 =mj prmj+1 (27) mj 22j pmj qmj+1 =22j (28) We obtain Relation (25) by taking the square root of (24). Relation (26) is obtained if one premultiplies (24) by mj+1 =m2j and taking the square root of the result. Relation (27) is valid for j = 0 because of (23) and can be shown to hold for all j by a standard induction argument. From this we get the last inequality of the list with Relation (25) and Relation (26). Let n be the product of powers of the radix set R. Then there is a s 2 N and m0 ; : : : ; ms with ns = n such that Relations (23) and (24) hold and q
mj+1 =m2j = minfrj 2 N j mj+1 =rj 2 N g: The mj de ned in this way are just the orders of the intermediate Fourier transforms as de ned by our algorithm and r is the product of the elements of the Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 547 of Numer. Math. 68: 507{547 (1994)
548
M. Hegland
radix set R. These are the numbers used in the proof of the proposition giving bounds for (tft; p; n; q). We need the discrete Dirac function de ned by (0) = 1 and (k) = 0 if k 6= 0. With this we de ne an integer sequence fj g by (29) 0 = 0 (30) j+1 = 2j + mj+1 + (mj+1 ? m2j ) mmj j We will see how this sequence is closely related to our loop overhead indicator. The aim is to get bounds for this sequence. For this we need a sequence of real numbers fcj g de ned by (31) pc0?2=j 0 p (32) cj+1 = 2 2 cj + r + 2: p As the factors of cj in this recursion are a null sequence the cj converge to r +2. p Thus they are bounded and as the convergence is very fast r + 2 is a good approximation for cj even for moderate j . With these de nitions we get the crucial lemma: Lemma 6.3. Let the sequences mj , j and cj be as we de ned them earlier. Then j mj + cj pmj : Given the fact that cj is very well approximated p by a constant this lemma says that j can be essentially bounded by mj + ( r + 2)pmj if j is large enough. Proof. The proof uses induction. First the bound is obvious for 0 = 0. Now assume it is valid for j . Then, by using Relations 25 to 28 and the recursion for cj we get: +1
j+1 = 2j + mj+1 + (mj+1 ? m2j ) mmj+1 j m p j +1 2(mj + cj mj ) + mj+1 + m j p p p j mj+1 + (2 + 2cj 2?2 + r) mj+1 mj+1 + cj+1 pmj+1 :
ut
This lemma is used in the proof of the bound for the loop overhead as put forward in the next proposition: Proposition 6.4. Let cs be the constants as de ned earlier in this section. Then if p > 1 and
(tft; p; n; q) p?1 (1 + pcsn )
p
(tft; 1; n; q) 2p+n cs :
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 548 of Numer. Math. 68: 507{547 (1994)
A fast Fourier transform algorithm
549
Proof. Let ms = n and mj be the orders of the intermediate transforms for our algorithm as de ned earlier. First we get from the de nition of tft the recursion j + mj (m ? m2 ) + 2(tft; pn=m ; m ; q ) (tft; pn=mj ; mj ; q) = m j ?1 j ?1 j ?1 pn pnmj?1 j
for j = 1; : : : ; s ? 1 and (tft; np=m0 ; m0 ; q) = 0. Thus (tft; pn=mj ; mj ; q) = j =(pn) for j = 0; : : : ; s ? 1. For j = s we have to distinguish two cases. First, if p > 1 we get the same relations as above and so (tft; p; n; q) = s =(pn) from which we get the bound to prove with the previous lemma. If p = 1 we have (tft; 1; n; q) = 2ms?1 =n+m?s?11 +2(tft; ms?1 ; n=ms?1 ; q) = 2ms?1 =n+s =n?1 and we use our lemma once again to prove the bound in this case. ut Note that slightly sharper bounds might be obtained if the recursion is used and the bound of the lemma is only inserted for s?1 instead of s . Thus we can learn several things. First look at the case of multiple transforms or p > 1: If p is very large then loop overhead is not a problem and all methods would work equally well. However, in the case where p is around 10 and n is large then loop overhead gets important. In this case the Stockham algorithm or our new algorithm would be preferred. For the Cooley-Tukey, transposed Stockham and Johnson-Burrus one might need to vectorize over the second loop although this would mean more data movement or even non-unit stride. This is certainly not a preferable situation. The situation is even more drastic when n = 1. In this case the rst class of algorithms would need to vectorize over the second loop as the rst loop is always of length 1. The Stockham algorithm would have to vectorize over the second loop in some cases. Note that if half of the cases where vectorized over the second loop then the loop overhead would be proportional to n?1=2 . However, as already mentioned, in the second loop some items which were constant in the rst loop are vectors and in some cases non-unit stride has to be accepted. Thus it is preferable to have less than half of the sweeps vectorizing over the second loop. This means higher loop overhead. For our new algorithm we only need to vectorize over the second loop in one case to get loop overhead of order n?1=2 : Acknowledgement. The implementation on the VP2200 was done in cooperation with J. Jenkinson and M. Dow of the ANU supercomputer facility. This research was done as part of the Area 4 project for the development and implementation of numerical algorithms on vector and parallel computers. It is a joint eort by the Australian National University and Fujitsu Ltd., Japan. This algorithm will be included in SSL II, the Scienti c Subroutine Library of Fujitsu [Fuj90] which is available for a wide variety of platforms. I would like to thank Wesley Petersen for various suggestions which have made the paper more understandable.
References [AL88] [Bai88]
Ashworth, M., Lyne, A.G.: A segmented FFT algorithm for vector computers. Parallel Comput. 6, 217{224 (1988) Bailey, D.H.: A high-performance FFT algorithm for vector supercomputers. Int. J. Supercomputer Appl. 2(1), 82{87 (1988)
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 549 of Numer. Math. 68: 507{547 (1994)
550
M. Hegland
[CCF+67] Cochrane, W.T., Cooley, J.W., Favin, J.W., Helms, D.L., Kaenel, R.A., Lang, W.W., Mailing, C.C., Nelsen, D.E., Rader, C.M., Welch, P.D.: What is the fast Fourier transform? IEEE Transactions on Audio and Electroacoustics AU-15, 45{ 55 (1967) [CT65] Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Math. Comp. 19, 297{301 (1965) [DH84] Duhamel, P., Hollman, H.: Split radix FFT algorithms. Electron. Lett. 20, 14{16 (1984) [DV90] Duhamel, P., Vetterli, M.: Fast Fourier transforms: a tutorial review and a state of the art. Signal Process. 19(4), 259{299 (1990) [Fra76] Fraser, D.: Array permutation by index-digit permutation. J. ACM 23(2), 298{309 (1976) [Fuj90] Fujitsu, FACOM OS IV SSL II USER'S GUIDE, 99SP0050E5, 1990 [GS66] Gentleman, W.M., Sande, G.: Fast Fourier transforms, for fun and pro t. Proc. 1966 Fall Joint Computer Conference AFIPS 29, pp. 563{578, 1966 [Heg91] Hegland, M.: On the parallel solution of tridiagonal systems by wrap-around partitioning and incomplete LU factorization. Numer. Math. 59(5), 453{472 (1991) [JB84] Johnson, H.W., Burrus, C.S.: An in-place, in-order radix-2 FFT. Proc. IEEE ICASSP, 1984, p. 28A.2 [JJRT90] Johnson, J.R., Johnson, W.R., Rodriguez, D., Tolmieri, R.: A method for design, modi cation and implementation of FFT algorithms on various architectures. Circuits, Systems, Signal Process. 9(4), 449{500 (1990) [Knu81] Knuth, D.E.: The art of computer programming, vol. 2. Addison Wesley, 1981 [Loa92] Van Loan, C.: Computational frameworks for the fast Fourier transform. SIAM, 1992 [Mat93] The Math Works Inc.: MATLAB User's Guide, 1993 [Pea68] Pease, M.C.: An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252{264 (1968) [Pet83] Petersen, W.P.: Vector Fortran for numerical problems on CRAY-1. Comm. ACM 26(11), 1008{1021 (1983) [Swa87] Swarztrauber, P.N.: Multiprocessor FFTs. Parallel Comput. 5, 197{210 (1987) [Tem83] Temperton, C.: Self-sorting mixed-radix fast Fourier transforms. J. Comput. Phys. 53, 1{23 (1983) [Tem85] Temperton, C.: Implementation of a self-sorting in-place prime factor fast Fourier transform. J. Comput. Phys. 58, 283{299 (1985) [Tem91] Temperton, C.: Self-sorting in-place fast Fourier transforms. SIAM J. Sci. Statist. Comput. 12(4), 808{823 (1991) [WH90] Waelde, W., Haan, O.: Performance of fast Fourier transforms on vector computers. Supercomputer 40, 42{49 (1990) This article was processed by the author using the LaTEX style le cljour1 from Springer-Verlag.
Numerische Mathematik Electronic Edition { page numbers may dier from the printed version page 550 of Numer. Math. 68: 507{547 (1994)