Twiddle-Factor-Based FFT Algorithm with Reduced

0 downloads 0 Views 158KB Size Report
Yingtao Jiang. Department of Electrical &. Computer Engineering. University of Nevada, Las Vegas. Las Vegas, NV 89119. USA yingtao@eng.unlv.edu.
Twiddle-Factor-Based FFT Algorithm with Reduced Memory Access Yingtao Jiang Department of Electrical & Computer Engineering University of Nevada, Las Vegas Las Vegas, NV 89119 USA [email protected]

Ting Zhou ASIC Design Division Gennum Corporation Kanada, Ontario Canada [email protected]

Abstract In microprocessor-based systems, memory access is expensive due to longer latency and higher power consumption. In this paper, we present a novel FFT algorithm to reduce the frequency of memory access as well as multiplication operations. For an N-point FFT, we design the FFT with two distinct sections: (1) The first section of the FFT structure computes the butterflies involving twiddle factors WNj ( j ≠ 0 ) through a computation/partitioning scheme similar to the Hoffman coding. In this section, all the butterflies sharing the same twiddle factor will be clustered and computed together. In this way, redundant memory access to load twiddle factors is avoided. (2) In the second section, the remaing ( N − 1) butterflies involving the twiddle factor W N0 are computed with a register-based breadth-first tree traversal algorithm. This novel twiddle-factor-based FFT is tested on the TI TMS320C62x digital signal processor. The results show that, for a 32-point FFT, the new algorithm leads to as much as 20% reduction in clock cycles and an average of 30% reduction in memory access than that of the conventional DIF FFT.

1. Introduction In the field of digital signal processing, the Discrete Fourier Transform (DFT) plays an important role in the analysis, design, and implementation of discrete-time signal-processing algorithms and systems [1]-[10][12][20]. For instance, the DFT can be used to calculate a signal’s frequency response, find a system’s frequency response from the system’s impulse response, and serve as an intermediate step in more elaborate signal processing techniques. The Fast Fourier Transform (FFT) is an efficient class of computational algorithms of the DFT. FFT algorithms are based on the fundamental principle of decomposing the computation of the DFT of a sequence of length N into successively smaller DFTs, all with comparable improvements in computational speed.

Yiyan Tang and Yuke Wang Department of Computer Science University of Texas at Dallas Richardson, TX 75083 USA {yiyan, yuke}@utdallas.edu

The study of FFT algorithms not only has a long history and large bibliography, and it is still an exciting research field where new results are used in practical applications. Efficient FFT algorithms were first discovered by Gauss [7], and later by Runge and Konig [13]. The importance of FFT algorithms was not fully recognized until its rediscovery by Cooley and Tukey [4] in 1960s. Since then, the research in FFT has been proliferated, to name a few, higher radix algorithms [2], mixed-radix [15], prime-factor [8], Winograd (WFTA) [20], the split-radix Fourier transform algorithms [16][17], recursive FFT algorithm [19], and the combination of decimation-in-time and the decimation-in-frequency FFT algorithms [14]. The structures of the FFT computation are all organized in the same way defined in [4]. There are many ways to measure the complexity and efficiency of the proposed FFT algorithms, and a final assessment depends on both the available technology and the intended applications. However, by careful analysis, we can see that there is a memory access problem with previously proposed approaches. For instance, unless the processor where the FFT runs provides a large number of registers, repeated access to the memory to load some twiddle factors are unavoidable under proposed FFT algorithms. It has been recognized that memory access is expensive due to long latency and high power consumption. In this paper, we propose an algorithm that can remove the redundant memory access in the calculation of DFT. For an N-point FFT, we consider two distinct cases:

W N0 and W Nj ( j ≠ 0 ). The FFT structure is, therefore, organized as two concatenated sections. The first section computes those butterflies involving twiddle factors W Nj ( j ≠ 0 ). In this section, once a twiddle factor W Nj is loaded, it will be used until there is no need for its value in the following computation. In this way, we show that of an N-point Radix-2 FFT, only (N/2-1) memory accesses are needed as classical approaches may require (N-1) memory accesses to load twiddle factors for computation. The power saving can be quite significant when N is a

0-7695-1573-8/02/$17.00 (C) 2002 IEEE

very large number. In the second section to compute the rest butterflies involving the twiddle factor WN0 , which accounts for a total of (N-1) butterflies, the main concern is to construct a tree structure to minimize the frequency of read/write operations to store the intermediate results. To this end, we propose a breadth-first traversal algorithm. As W N0 = 1 , for these (N-1) butterflies, no multiplication operation is needed in the computation. It is fair to say that this novel twiddle-factor-based algorithm lead to efficient implementations and a wide range of applications, such as low power high performance ASIC designs. We test the proposed algorithm in TI TMS320C62x fixed-point digital signal processor (DSP). The experimental results show that the new algorithm requires fewer clock cycles to compute the N-point FFT than conventional FFT approaches. Furthermore, we can expect that the power consumption in the new approach shall be significantly less than the conventional FFT schemes due to the reduction of powerhungry memory access and multiplication operations. The rest of this paper is organized as follows. In section 2, the conventional Radix-2 FFT algorithm is briefly reviewed. The new twiddle-factor-based FFT algorithm is described in section 3. Some practical issues are addressed in section 4. Experimental results are presented in section 5. The conclusions are summarized in section 6.

decomposition, let WN2 nr = WNnr/ 2 and following equation are derived by N / 2 −1 N N X (2k ) = ∑ [ x(n) + x(n + )]WNnk/ 2 k = 0,1,..., − 1 2 2 n=0 (4) N / 2 −1 N N n nk X (2k + 1) = ∑ [ x(n) − x (n + )]W N W N / 2 k = 0, 1,..., − 1 2 2 n =0 (5) Above equations are frequently represented in butterfly format. The butterfly of a Radix-2 algorithm is shown in Fig. 1.a. The complete flow graph of an N-point Radix-2 FFT can be constructed by applying the basic butterfly structure (Fig.1.a) recursively, where N = 2, 4, 8,... For an N-point Radix-2 FFT, it has log 2 N stages. Within stage s, for s = 1, 2, ..., log 2 N , there are N / 2 s groups of butterflies, with 2 s −1 butterflies per group. The computation of the 8-point DFT, for instance, can be accomplished by the algorithm depicted in Fig. 1.b. y (n)

x (n)

x( n +

WN0

N ) 2

(a) x (0 )

N ) 2

X ( 0)

W80 x(1)

−1

X ( 4)

W80 x ( 2)

2. Discrete Fourier Transform and FFT

y (n +

-1

X ( 2)

−1 W82

x(3)

−1

W80 −1

X ( 6)

W80 x ( 4)

The Discrete Fourier Transform (DFT) of discrete signal x (n ) can be directly computed as N −1

X (k ) = ∑ x(n)WNnk n=0

x(5)

(3)

where WN = e− j 2π / N = cos 2π − j sin 2π and WN is known as N N the phase or twiddle factor and j 2 = −1 . Here x(n) and

X (ω ) are sequences of complex numbers. An efficient method of computing the DFT that significantly reduces the number of required arithmetic operations is called FFT [1]-[10][12]-[20]. An FFT algorithm divides the DFT calculation into many shortlength DFTs and results in huge savings of computations. If the length of DFT N = R v , i.e., is the product of identical factors, the corresponding FFT algorithms are called Radix-R algorithms. Assume FFT length is 2M, where M is the number of stages. The radix-2 DIF FFT divides an N-point DFT into 2 N/2-point DFTs, then into 4 N/4-point DFTs, and so on. That is, the radix-2 DIF FFT expresses the DFT equation as two summations, then divides it into two equations, each of which computes every two output samples. To arrive at a two-point DFT

x (6 )

x (7 )

W80

−1 W82

k = 0,1,..., N − 1

X (1)

−1 W81

−1 W83 −1

−1

X (5)

W80 X (3)

−1 W82 −1

W80 −1

(b) X ( 7)

Fig. 1 Flow graph of FFT

3. The New Twiddle-Factor-Based FFT Algorithm It can be seen from Fig. 1.b that, unless sufficiently large number of registers is available, in most practical situations, the twiddle factor W82 will be loaded from the memory to the CPU twice in both Stages 1 and 2. Such redundant memory access is repeatedly seen in loading other twiddle factors and therefore, becomes a serious problem when computing a large FFT. In this section, we present the twiddle-factor-based FFT algorithm, which can reduce the number of memory access as well as the number of multiplication operations.

0-7695-1573-8/02/$17.00 (C) 2002 IEEE

Theorem 1: The total number of butterflies involving the twiddle factor W N0 is N − 1 for an N-point Radix 2 FFT. Proof: This proof is performed on an N-point DIF (Decimation-In-Frequency) FFT. At the stage 1, there is only one butterfly requiring W N0 At the stage 2, there are two butterflies requiring W N0 … At the stage k, there are 2 k −1 butterflies requiring W N0 There are in total of log 2 N stages in the FFT structure. Therefore, the total number of butterflies that require W N0 is 1 + 2 + ... + 2log 2 N −1 = N − 1 . Theorem 2: The total number of butterflies involving the same twiddle factor W Nj ( j ≠ 0 ) is 2 k +1 − 1 for an Npoint j 2

k −1

Radix

2

FFT,

where

j 2k

mod 2 = 1 and

k = log 2 N − 1 . All butterflies involving the twiddle factor W Nj ( j ≠ 0 ) will be computed at super-stage (k + 1) , such that j mod 2 = 1 and 2k j mod 2 = 0, k = 0,1, 2,..., (log 2 N − 1) . k −1 2 The proposed twiddle-factor-based algorithm can be viewed as a skewed version of popular DIF FFT structure. We, therefore, use the term “super-stage” to reflect the fact that at each super-stage ss, the butterflies to be computed span the stages 1, 2,..., ss in the classical DIF FFT. Section 1 consists of (log 2 N − 1) super-stages. (1) At the first super-stage, all the data samples with binary indices ( B k B k −1 ...B 2 B11) are computed. Among these (N / 2) data samples, any of the two with indices ( B k B k −1 ...B 2 B11) and ( B k B k −1 ...B 2 B11) can pair together to form a butterfly. The twiddle factor

mod 2 = 0, k = 0,1, 2,..., log 2 N − 1 . (1) If j = 1, 3, 5, 7,...N / 2 − 1 , W Nj will only

involved in this butterfly is W Nj , where j corresponds

appear on the first stage. Similarly, from Eqs. (5) and (6), we can see that under the situation j j mod 2 = 1 and k −1 mod 2 = 0, k = 0,1, 2,... , k 2 2

to the decimal value of the binary (0 Bk −1...B2 B11) . In total, (N / 4) butterflies are to be computed in this super-stage.

Proof:

the appearance of

W Nj

( j ≠ 0 ) will span from the Stage 1

up to the Stage k + 1 . (2) For a twiddle factor W Nj ( j ≠ 0 ), it appears in the first stage once; it appears on the second stage twice as there are two butterflies requiring this twiddle factor. This continues to the kth stage, where 2 k groups of twiddle factors exist to share the same twiddle factor. Therefore, the total number of butterflies that require W Nj ( j ≠ 0 ) is 1 + 2 + ... + 2 k = 2 k +1 − 1 . The C-like pseudo code of the proposed twiddlefactor-based algorithm is shown in Fig. 2. For an N-point FFT, this algorithm consists of two concatenated sections to deal with two distinct cases: W N0 and W Nj ( j ≠ 0 ). Section 1: In the first section of the FFT structure, those butterflies with twiddle factors W Nj ( j ≠ 0 ) involved are computed. In this section, the major concern is the minimization of the number of memory access to load the twiddle factors. That is, once a twiddle factor W Nj is loaded, it will be repeatedly used until there is no need for its value in the following computation. For an N-point FFT, the binary index of a data sample will look like this, ( B k B k −1 ...B0 ) , where

// n: n-point FFT // x: input data samples // x[2k + 1] ---- imaginary part of kth sample // x[2k] ---- real part of the kth sample // w: pre-computed twiddle factors // w[2k + 1] ---- imaginary part of kth twiddle factor // w[2k] ---- real part of kth twiddle factor void radix2_fft(int n, float* x, float* w) { int n2 = 0; int start = 1; int step = 2; // Section 1: Compute butterflies with twiddle factors // W(N, j), j 0. for (proc = n; proc > 2; proc /= 2) { // super stage n2++; for (twiddle = start; twiddle < n/2; twiddle += step) { // load one twiddle factor and repeatedly use it co = w[twiddle*2 + 1]; // load twiddle factor: cos si = w[twiddle*2]; // load e twiddle factor: sine int n3 = n4 = n; for (stage = 0; stage < n2; stage++) { // stage n4 /= 2; for (i0 = twiddle >> stage; i0 < n; i0 += n3) { //butterfly computation

0-7695-1573-8/02/$17.00 (C) 2002 IEEE

computed. Apparently, there are ( N / 4 + N / 2) data samples falling into this category. Any two data samples with binary indices ( B k B k −1 ...B 2 10) and

i1 = i0 + n4; re0 = x[2 * i0] + x[2 * i1]; im0 = x[2*i0 + 1] + x[2*i1 + 1]; re1 = x[2 * i0] - x[2 * i1]; im1 = x[2*i0 + 1] - x[2*i1 + 1]; x[2 * i0] = re0; x[2*i0 + 1] = im0; x[2 * i1] = re1*co - im1*si; x[2*i1 + 1]= re1*si + im1*co;

( B k B k −1 ...B 2 10) ,

or

( B k B k −1 ...B 2 B11)

and

( B k B k −1 ...B 2 B11) , can pair together to form a butterfly. The twiddle factor involved in this butterfly is W Nj , where j is the decimal value of binary index (0 Bk −1...B210) . In

} n3 = n4;

total, ( N / 8 + N / 4) butterflies are to be computed in this super-stage. That is, a quarter of the butterflies in the first stage and a half of the butterflies in the second stage of the original DIF FFT are computed. (3) Within the ss-th super-stage, for ss = 1, 2, ..., log 2 N − 1 , all the data samples with binary

} } start *= 2; step *= 2; } n2++;

indices

// Section 2: Compute the butterflies with // twiddle factor W(N, 0) n3 = n4 = n; for (stage = 0; stage < n2; stage++) { n4 /= 2; for (i0 = 0; i0 < n; i0 += n3) { i1 = i0 + n4; re0 = x[2 * i0] + x[2 * i1]; im0 = x[2*i0 + 1] + x[2*i1 + 1]; re1 = x[2 * i0] - x[2 * i1]; im1 = x[2 *i0 + 1] - x[2*i1 + 1]; x[2 * i0] = re0; x[2*i0 + 1] = im0; x[2 * i1] = re1; x[2*i1 + 1] = im1; } n3 = n4; } } /* Note that in Section 2, no multiplication operation is needed. Furthermore, the algorithm used in this section is not an optimized one in terms of memory access. If a small number of registers are allocated to save temporary values, we will show an algorithm (Fig. 4) that can help to partition the system so that the intermediate memory access for read/write can be significantly reduced. */ Fig. 2 Pseudo code of the proposed twiddle-factor-based FFT algorithm

( Bk Bk −1...Bss +110...0) ,

( Bk Bk −1...Bss Bss −110...0) ,

…,

( Bk Bk −1...Bss10...0) , ( Bk Bk −1...B210) ,

and

( Bk Bk −1...B2 B11) are computed. In this case, the butterflies are originated from the Stage 1 all the way up to Stage ss in the corresponding DIF FFT. There are N / 2 ss +1 different twiddle factors involved in this superstage. Under this algorithm, according to Theorem 1, we can see that only ( N / 2 − 1) irredundant memory accesses are needed. The classical approach, the DIF FFT shown in Fig. 1, however, may require as many as ( N − 1) memory accesses to load twiddle factors for computation unless the size of the register file is comparable to the size of the input samples. Very large size of register files, however, are barely seen in current microprocessor designs. As an illustrative example, Table 1 lists the computation order of the 16-point FFT with indexing and pairing information presented in binary format. Altogether, there are 3 super-stages and 17 butterflies. The overall computation structure of this 16-point FFT based on the proposed algorithm can be seen in Fig. 3. Table 1. The computation order of a 16-point Twiddlefactor-based FFT: Indexing and Pairing Super-Stage 1: Super-Stage 2: Super-Stage 3: Original FFT Original FFT Original FFT ( B3 B 2 B11) and ( B3 B 2 B11) ( B3 B 2 10) and

( B3 B 2 B11) and

( B3 B210)

( B3 B3 B11)

( B3 100) and

( B3 B 2 10) and

( B3 B2 B11) and

( B3100)

( B3 B2 10)

( B3 B2 B11)

(2) At the second super-stage, all the data samples with binary indices ( Bk Bk −1...B210) and ( Bk Bk −1...B2 B11) are

0-7695-1573-8/02/$17.00 (C) 2002 IEEE

bold lines: multiply with -1 x(0)

x(0) W[0]

x(1)

x(8) W[0]

x(2)

x(4) W[4]

W[0]

x(3)

x(12)

// The number of the stages to be calculated // in the prolog int n5 = n1 % 2; // n1: N = 2^n1 int n2 = N;

W[0] x(4)

x(2) W[2]

W[0]

x(5)

x(10) W[4]

W[0]

x(6)

x(6) W[6]

W[4]

W[0]

x(7)

x(14) W[0]

x(8)

x(1) W[1]

W[0]

x(9)

x(9) W[2]

W[0]

x(10)

x(5) W[3]

W[4]

W[0]

x(11)

x(13) W[4]

W[0]

x(12)

x(3) W[5]

W[2]

W[0]

x(13)

x(11) W[6]

W[4]

W[6]

W[4]

W[0]

x(14)

x(7) W[7]

W[0]

x(15)

x(15)

Super Stage

1. stage 2. Twiddle factor 3. butterfly Section 1

Section 2

Fig. 3 Structure of a 16-point FFT based on the proposed algorithm. From the above discussion, we can see that we can compute the FFT structure in the way similar to the Hoffman coding. This resolves the data dependence and the verification of this algorithm can be viewed as “decoding” of the Hoffman codes. It can be seen that four loops are required in this section of computation, while traditional approaches may require three. This loop overhead, however, can be easily absorbed in current processors with multiple data paths, such as TI TMSC62x DSP [18]. Section 2: In the second section, the rest butterflies involving the twiddle factor W N0 are computed. Note that no multiplication is needed in computing these ( N − 1) butterflies (Theorem 1) as W N0 = 1 . All these butterflies are organized as a binary tree and there are log 2 N stages. The memory access of this section can be significantly reduced if a few user-visible data registers are available. Depending on the size of the given registers (M) to save intermediate results, we can traversal the binary tree with an algorithm shown in Fig. 4, where the visit to a node refers to a 2-point butterfly computation.

// The Prolog of the tree structure for (proc = 0; proc < n5; proc++) { n3 = n2; n2 >>= 1; for (bu = 0; bu < n; bu += n3) { // Calculate the butterfly // butterfly_cal(x[0], x[1], x[2], x[3]): Calculate // the butterfly with two specified points, x[0], x[2] // are the real parts of the points; x[1],x[3] are // the imaginary parts of the points. butterfly_cal(x[2bu], x[2bu+1], x[2(bu+n2)], x[2(bu+n2)+1]); } } // The Kernel of the tree structure // If there are r pairs of registers, denoted as // Reg_real [0: (r-1)] and Reg_im[0: (r-1)], // to be used, (r-1) butterflies can be computed . // Reg_real and Reg_im are for real and imaginary parts, // respectively. Immediate results can be saved in given // registers, rather than writing them back to the memory. for (proc = N >> n5; proc > 1; proc >>= j) { int n4 = proc >> j; int index = 0; for (group = 0; group < 2^(n1 – n3) ; group++){ // Fetch the points from memory to registers for (i = 0; i < r; i++) { Reg_real[i] = x[2index]; Reg_im[i] = x[2index + 1]; index = index + n4; } m=r; // Calculation the butterflies: (r-1) in total for (i = 1; i

Suggest Documents