Efficient VLSI Architecture for Decimation-in-Time Fast ...

IEEE TRANSACTIONS ON CIRCUIT AND SYSTEM-I, REGULAR PAPERS

1

Efficient VLSI Architecture for Decimation-in-Time Fast Fourier Transform of Real-Valued Data Pramod K. Meher, Senior Member, IEEE, Basant K. Mohanty, Senior Member, IEEE, Sujit K. Patel, Soumya Ganguly, Thambipillai Srikanthan Senior Member, IEEE

Index Terms—Fast Fourier transform, FFT, in-place computation, real-valued FFT, decimation-in-time FFT

I. I NTRODUCTION HE fast Fourier transform (FFT) algorithm is frequently encountered in almost every application area of digital signal processing. There are several applications such as speech, audio, image, and video processing, where FFT is very often performed on real-valued signals. Efficient realization of FFT of real-valued signals has received further attention nowadays due to the emergence of biomedical signal processing, and wide applications of real-valued time-series analysis.

T

Manuscript submitted on 08 May 2015, revised August 9, 2015, and September 20, 2015. This paper was recommended by Associate Editor Hakan Johansson. P. K. Meher and T. Srikanthan are with School of Computer Engineering, Nanyang Technological University, Singapore, Email: {aspkmeher,astsrikan} @ntu.edu.sg. B. K. Mohanty and S. K. Patel are with Department of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Raghogarh, Guna, Madhy Pradesh, India-473226, Email: [email protected],[email protected]. S. Ganguly was with EEE Department, BITS, Hyderabad, Andhra Pradesh, India. He worked in the School of Computer Engineering, Nanyang Technological University, Singapore, during June 2014 to November 2014. Email:[email protected] Copyright (c) 2015 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected].

#m#m samples samples per clock per clock cyclecycle

# m# m samples samples per clock per clock cyclecycle

(a) x1 x2

(c)

# m# m samples DIT FFT DIT FFTsamples per clock UnitUnitper clock cycle cycle

inputt reordering input reordering

#m#m samples samplesDIF FFT DIF FFT per clock per clock UnitUnit cyclecycle

outpu ut reordering output reordering

Abstract—The decimation-in-time (DIT) fast Fourier transform (FFT) very often has advantage over the decimation-infrequency (DIF) FFT for most real-valued applications, like speech/ image/ video processing, bio-medical signal processing, and time-series analysis, etc., since it does not require any output reordering. Besides, the DIT FFT butterfly involves less computation time than its DIF counterpart. In this paper, we present y x1 1 an efficient architecture for the radix-2 DIT real-valued FFT k W (RFFT). We present here the necessary mathematical formulation for removing the redundancies in the radix-2x2DIT RFFT, - and y2 present a formulation to regularize its flow graph to facilitate folded computation with a simple control unit. We propose here a register-based storage design which involves significantly less area at the cost of a little higher latency compared with the conventional RAM-based storage. The address generation for folded in-place DIT RFFT computation with register-based storage is challenging since both read and write operations are performed in the same clock cycle at different locations. Therefore, we present here a simple formulation of address generation for the proposed radix-2 DIT RFFT structure. The proposed structure involves ∼ 61% less area and ∼ 40% less power consumption than those of [8], on average, for FFT sizes 16, 32, 64, and 128. It involves ∼ 70% less area-delay product and ∼ 57% less energy per sample than those of the other, on average, for the same FFT sizes.

(b) y1

x1

Wk

Wk

y2

x2

(d)

y1

x1

y2

x2

Fig. 1. The key differences in the DIT and DIF FFT computation. (a) DIT FFT processing. (b) DIF FFT processing. (c) DIT FFT butterfly. (d) DIF FFT butterfly. m is a power-of-2 integer.

FFT of real-valued signal exhibits conjugate symmetry which renders half the FFT outputs redundant [1]. The conjugate symmetry of real-valued FFT (RFFT) is used to compute FFTs of a pair of N -point real-valued sequences from N -point FFT of a complex-valued data1 . There has been a continued effort for several decades on reducing the hardware cost of VLSI implementation of FFT and improving its performance [2]. Some efforts have been made for efficient VLSI implementation of RFFT, as well. Some folded pipeline architectures have been proposed for the computation of RFFT [3], [4], where butterfly operations are multiplexed into a small logic unit. The structures in [3] and [4] could provide adequate throughput for some applications but the storage complexity of those structures continues to be very high. A few in-place architectures have also been proposed for RFFT using specialized packing algorithms [5], [6]. Memory-conflict for read/write operation is found to be the major challenge in the design of algorithms and architectures for in-place computation [7]. Recently, an inplace architecture and conflict-free memory addressing scheme have been proposed for continuous processing of RFFT [8]. The FFT algorithms are classified into two broad categories, namely, the decimation-in-time (DIT) and the decimation-infrequency (DIF) algorithms. The key differences between the two are shown in Fig.1. In case of DIF algorithm (Fig.1(a)), the input samples are fed to the computing structure in their 1 This is sometimes referred to as ‘Bach and Rock’ approach since one can find FFTs of two independent streams of real-valued data, e.g., a classical track by J. S. Bach and the music by a rock group using this approach. Based on this approach open source FFT libraries are developed by the Center for Astronomy Signal Processing and Electronics Research (CASPER) of UC Berkeley ( https://casper.berkeley.edu/).

-


2

TABLE I C OMPUTATIONAL D ELAY OF M ULT-A DD AND A DD -M ULT O PERATIONS BASED ON 65 NM CMOS T ECHNOLOGY L IBRARY IN NS . input size 8-bit 16-bit 24-bit

TMA 1.24 2.29 3.52

TAM 1.30 2.48 3.82

TAM -TMA 0.06 0.19 0.30

% Difference 4.84% 8.30% 8.52%

X(k) =

TMA is the computation time of multiplication followed by addition. TAM is the computation time of addition followed by multiplication.

II. R EAL -VALUED M ATHEMATICAL F ORMULATION AND D ERIVATION OF F LOW GRAPH FOR R ADIX -2 RFFT The DFT of an N -point sequence {x(n)} is given by x(n) · e−j

2πkn N

X

x0 (m) · e−j

2πk(2m) N

+ x1 (m) · e−j

2πk(2m+1) N

i

(2) for k, n = 0, 1, · · · , N − 1. Equation (2) then can be split into real and imaginary parts as X(k) = [A0 (k) + D(k)] − j[B0 (k) + E(k)]

(3a)

X(k + N/2) = [A0 (k) − D(k)] − j[B0 (k) − E(k)] (3b) where D(k) = A1 (k) cos(2πk/N ) − B1 (k) sin(2πk/N ) (4a) E(k) = B1 (k) cos(2πk/N ) + A1 (k) sin(2πk/N ) (4b) (N/2)−1

Ai (k) =

X

xi (m) cos

m=0 (N/2)−1

Bi (k) =

X

xi (m) sin

m=0

2πkm

(4c)

N/2 2πkm

(4d)

N/2

for k = 0, 1, 2, · · · , (N/2) − 1. Note that Ai (k) and Bi (k) are, respectively, the real and negative imaginary parts of the (N/2)-point DFT of the sub-sequence {xi (m)}, for i = 0 and 1. When N/2 is again an even integer, each of the pair of subsequences {x0 (m)} and {x1 (m)} can be further decomposed into two sub-sequences corresponding to their even- and oddindexed elements. When N is a power-of-2 integer, such decomposition could continue recursively to compute an N point RFFT using a DIT flow graph of P = log2 N stages of butterfly (BF) computation. Real and imaginary parts of the p-th stage (for 1 ≤ p ≤ P ) of DIT-RFFT {Ap (k), B p (k)} can be computed from the real and imaginary parts of (p − 1)-th stage of the BF output using the following relations: Ap0 (k) = Aq0 (k) + D0p (k), B0p (k) = B0q (k) + E0p (k), Ap0 (k0 ) = Aq0 (k) − D0p (k), B0p (k0 ) = B0q (k) − E0p (k), Ap1 (k) = Aq0 (k0 ) + D1p (k), B1p (k) = B0q (k0 ) + E1p (k), Ap1 (k0 ) = Aq0 (k0 ) − D1p (k), B1p (k0 ) = B0q (k0 ) − E1p (k), D0p (k) = Aq1 (k)ck − B1q (k)sk , E0p (k) = B1q (k)ck + Aq1 (k)sk , D1p (k) = Aq1 (k0 )ck − B1q (k0 )sk , E1p (k) = B1q (k0 )ck + Aq1 (k0 )sk ,

A. Radix-2 DIT RFFT using Real-valued Arithmetic

N −1 X

(N/2)−1 h m=0

natural order, while the output are generated in bit-reversed order. On the other hand, in case of DIT algorithm (Fig.1(b)), the input samples need bit-reversal reordering before being processed, while the output FFT coefficients are generated in natural order. In different RFFT applications such as image and video processing, biomedical signal processing, and timeseries analysis etc., the complete input sequence is generally available together at the same time for the FFT computation. The DIT RFFT has an advantage over the DIF form for these applications, since a DIT RFFT structure need not wait for the arrival of input samples but can produce the outputs as soon as those are computed. As shown in Figs.1(c) and 1(d) the DIF butterfly involves an addition followed by a multiplication while the DIT butterfly involves a multiplication followed by additions. As shown in Table I the computation time of fixed-point multiplication followed by an addition is less than that of addition followed by a multiplication. The DIT-based RFFT butterfly thus involves less propagation delay than that of DIF-based RFFT butterfly although both these butterflies involve the same number of multipliers and adders. Therefore, the choice of DIT algorithm to derive RFFT structure has an advantage over DIF algorithm. In this paper, we present an efficient architecture for the DIT radix-2 RFFT algorithm. The main contributions of this paper, as discussed in the next three Sections of the paper, are: 1) Mathematical formulation of the radix-2 DIT RFFT algorithm using real-valued arithmetic. 2) Derivation of a regularized flow graph for folded computation of radix-2 DIT RFFT using a simple control. 3) Formulation of address generation for folded in-place radix-2 DIT RFFT algorithm and derivation of proposed RFFT structure using register-based storage. Hardware and time complexities along with performance comparison are presented in Section V. Conclusions are presented in Section VI.

X(k) =

comprised of the even-indexed and odd-indexed elements of {x(n)} respectively, such that x0 (m) = x(2m) and x1 (m) = x(2m + 1) for 0 ≤ m ≤ (N/2) − 1. Using the time-decimated subsequences {x0 (m)} and {x1 (m)}, (1) can be written as

(5) 0

(1)

n=0

for k, n = 0, 1, · · · , N − 1. When N is an even integer, the input sequence {x(n)} could be decomposed into two sub-sequences {x0 (m)} and {x1 (m)} of size (N/2) each,

0

where q = p − 1, k 0 = k + (N/2p ), ck = cos(2p πk/N ), 0 0 sk = sin(2p πk/N ), for 0 ≤ k ≤ (N/2p )−1, p0 = P −p+1, P 1 ≤ p ≤ P , and P = log2 N . Note that AP 0 (k) and B0 (k) are, respectively, the real and the negative imaginary parts of N -point DFT sequence {X(k)}. The radix-2 RFFT given by (5) is optimized further using the symmetric property of realvalued DFT as discussed in the following Subsection.


3

For a real-valued N -point sequence {x(n)}, X(0) and X(N/2) are real-valued since their imaginary part is zero, and X(N − k)∗ = X(k). Therefore, A(k) = A(N − k) and B(k) = −B(N − k). Due to redundancies resulting from the symmetric and anti-symmetric behaviour of A(k) and B(k), only half the number of DFT components need to be computed at each stage of BF computations. Further, the sine and the cosine functions of (5) are, respectively, 0 symmetric and anti-symmetric for k and [(N/2p ) − k] where 0 0 ≤ k ≤ (N/2p ) − 1, for p0 = P − p + 1. Using this feature, the radix-2 RFFT given by (5) is optimized and generalized expressions for the computation of real parts {Ap (k)}, and the negative imaginary-parts {B p (k)} of the p-th BF stage of RFFT are obtained as follows:

x(8) x(4) x(12) x(2) x(10) x(6) x(14)

+ +

for 1 ≤ k ≤ M − 1 except k = M/2, where q = p − 0 1, M = N/2p , M 1 = M − k, M 2 = M + k, p0 = P − p + 1, 1 ≤ p ≤ P , and P = log2 N . We may define the real0 valued sequence {x(n)} = {A00 (k)} = {xP 0 (m)}; {A1 (k)} = P 0 0 {x1 (m)}, B0 (k) = 0, B1 (k) = 0, for 0 ≤ (m, k) ≤ (N/2) − P 1, where {xP 0 (m)} and {x1 (m)} are, respectively, the evenand the odd-indexed sub-sequences obtained after the P -th time-decimation of sequence of {x(n)}. Using (6), the flow graph for 16-point real-valued FFT is derived and shown in Fig.2. As shown in Fig.2, the flow graph corresponding to the optimized RFFT (referred to as optimized flow graph) involves 25 BFs (2-point) while the original flow graph involves 49 such BFs. In general, the number of 2-point BFs of original flow-graph (α) and optimized flow-graph (β) of N -point realvalued FFT can be calculated using the following relations. i 7 15 + + · · · + (2P − 1)/2P 4 8 16 h3 3 i 7 β=N + + + · · · + (2P −1 − 1)/2P 4 8 16 α=N

h5

+

(6a) (6b)

The optimized flow graph is not regular and the numbers of BFs involved in different stages of computation are not the same. For example, stage-1, stage-2, stage-3, and stage4 in Fig.2 involve (8, 4, 6, and7) BFs, respectively. From the optimized flow graph of DIT-RFFT of various sizes, we observe that intermediate result E of stage-2 onwards pass through type-II BFs whereas rest of the inputs and intermediate results are processed by type-I BFs. The irregular distribution of type-I and type-II BFs of the optimized flow graph requires a complex control unit particularly when the BF operations

A0(1) 2 A0(2)

+ -

+

+ -

1

+

A1(3)

-

1

x(13) x(3)

-

x(11) x(7)

-

+

1

A1(4)

-

1 1

A0(5) 2

A0(6) +

2

1

A0(7) 1

+

A1(6)

-

1

A1(7)

Stage-1

+

B0(5)

A0(6)

+

+ E0(1) +

2

A1(5)

+

+ -

3

2

A0(5)

+

-

3

+ D0(1) -

A0(4)

1

+ -

A1(0) {c2,s2} 2 A1(1) 2 A1(2) {c2,s2} 2 B1(1) 2

A0(4)

x(9) x(5)

+

2

1

4

A0(1) 4

A0(2)

3

B0(1)

A0(3) 1 A1(2)

4

A0(0)

3

A0(1) 3 A0(2)

2

1

-

3

A0(0)

2

A1(1) 1 A0(2)

-

A1(4) {c2,s2} 2 A1(5) 2 A1(6) {c2,s2} 2 B1(5)

Stage-2

-

3

+ D1(1) -

+ -

3

+ E1(1)

-

+

+

4

B0(1)

B0(1)

3

4

A0(4)

A0(4)

3

4

A0(3) 3 B0(2)

A0(3) 4 B0(2)

3

4

B0(3) 3 A1(0) {c1,s1} 3 A1(1) 3 {c ,s } A1(2) 2 2 {c1,s1} 3 B1(1)

B0(3) 4

A0(8)

4

+ D0(1) - 4 + D0(2) - 4 +E0(1) +

3

A1(4) {c3,s3} 3 A1(3) 3 {c ,s } B1(2) 2 2 {c3,s3} 3 B1(3)

+ + +

4

+ D0(3) - 4 + E0(2) + 4 + E0(3) + Stage-4

Stage-3

+ -

+

+

4

A0(7) 4

A0(6) 4

B0(7) 4

B0(4) 4

A0(5) 4

B0(6) 4

B0(5)

Fig. 2. Optimized flow graph of 16-point DIT real-valued FFT. cl = Fig-3 cos(2πl/16) and sl = sin(2πl/16), where l = 1, 2, and 3 +

x1 +

x1

Ap0 (k) = Aq0 (k) + D0p (k), Ap0 (M 1) = Aq0 (k) − D0p (k), B0p (k) = B0q (k) + E0p (k), B0p (M 1) = E0p (k) − B0q (k), Ap0 (M/2) = Aq0 (M/2), B0p (M/2) = Aq1 (M/2), Ap0 (0) = Aq0 (0) + Aq1 (0), Ap0 (M ) = Aq0 (0) − Aq1 (0), Ap1 (k) = Aq0 (M 2) + D1p (k), Ap1 (M 1) = Aq0 (M 2) − D1p (k), B1p (k) = B1q (M 2) + E1p (k), B1p (M 1) = E1p (k) − B1q (M 2), Ap1 (M/2) = Aq0 (M + M/2), B1p (M/2) = Aq1 (M + M/2), Ap1 (0) = Aq0 (M ) + Aq1 (M ), Ap1 (M ) = Aq0 (M ) − Aq1 (M ), B0p (0) = B0p (M ) = B1p (0) = B1p (M ) = 0. (6)

A0(0)

1

A0(1) 1 A1(0)

x(1)

x(15)

2

1

A0(0)

x(0)

B. Optimized Radix-2 DIT Real-valued FFT

+ x2

+ -

(a)

y1

x1

+

y1 +

y2

x2

+

(b)

x2 x3

y2

x4

(c)

+ + + + +

y1 y2 y3 y4

Fig. 3. (a) Type-I butterfly operation. (b) Type-II butterfly operation. (c) 4point butterfly operation.

are mapped onto a folded structure. Moreover, the conflictfree memory access is another issue associated with the inplace computation by a folded structure. The optimized flow graph therefore requires highly complex control due to its irregular data-flow. To resolve the these issues, we present here a modified flow graph (MFG), which could lead to a simplified control structure for the conflict-free memory access for the in-place computation by a folded structure. III. R EGULARIZATION OF F LOW G RAPH AND C ONTROL D ESIGN FOR RFFT C OMPUTATION From the optimized flow graph of 16-point RFFT (Fig.2), we find that stage-1, stage-2, stage-3, and stage-4, respectively involve (8, 4, 4, and 4) type-I BFs and (0, 0, 2, and 3) type-II BFs. Similarly, stage-1, stage-2, stage-3, stage-4, and stage-5 of the optimized flow graph of 32-point DIT RFFT, respectively, involve (16, 8, 8, 8, and 8) type-I BFs and (0, 0, 4, 6, and 7) type-II BFs. Interestingly, the numbers of type-I and type-II BFs are nearly equal after stage-2 onwards. Therefore, we use a 4-input hybrid BF comprised of a type-I BF and a type-II BF as the basic building block to derive a regular flow graph. The input pattern of the original flow graph (Fig.2) is modified to match with the input pattern of 4-input BF [shown in Fig.3(c)]. The BFs of stage-3 and stage-4 are also reordered accordingly. A few redundant arithmetic operations are introduced to transform the original flow graph to a regular form without changing intermediate signal values. The MFG of 16-point DIT RFFT is shown in Fig.4. The dashed-lines in Fig.4 represent absence of signal-path. As shown in Fig.4, each stage of MFG involves four 4-point BFs. A. Regularization of Flow graph for Real-valued FFT The data-flow of the MFG for in-place computation of RFFT is shown in Fig.5. The index of input sequence {x(n)}


4

4-input BF unit x(0) x(12) {c0,s0} x(8) {c0,s0} x(4)

A0(2)

E1(0)

1

-

A1(1)

-

{c0,s0} {c0,s0}

1

D2(0) E2(0)

+

A0(1)

-

B0(1)

-

1

A1(2) 1

A1(3)

D2(1)

{c0,s0}

E2(1)

+

2

-

B1(1)

A0(4)

A0(4)

1

A0(6)

{c0,s0}

1

A1(5)

+

{c0,s0}

D2(2)

+

A0(5)

-

B0(5)

+

A1(6)

E1(3)

1

-

A1(7)

+

-

{c0,s0}

+ -

E3(2)

+

D2(3)

{c0,s0}

E2(3)

+

A1(5)

-

B1(5)

2

-

2

+

A0(3) 3

B0(3)

B0(1)

{c1,s1}

D4(1)

{c1,s1}

E4(1)

+

A0(7)

-

B0(7)

+

D3(3)

{c1,s1}

E3(3)

+ +

4

4

4

{c2,s2}

3

A1(2)

{c2,s2}

3

B1(2)

B0(2) D4(2)

+ -

E4(2)

+

4

A0(6) 4

B0(6) 4

3

A0(3)

3

B0(3)

4

B1(1)

{c1,s1}

4

A0(2)

A1(1)

A1(6)

{c0,s0}

4

3

A1(4) D3(2)

4

B0(4) 4

3

{c0,s0}

4

A0(4)

A0(1)

A1(0)

2

A0(7) 1

E3(1)

+

A1(4)

1

D1(3)

{c1,s1}

2

A0(6)

-

3

2

1

x(3)

D3(1)

+

2

E2(2)

E4(0)

+

B0(1)

{c1,s1}

2

1

+

3

A1(1)

-

{c0,s0}

3

B0(2)

D4(0)

3

2

A1(4) -

-

{c0,s0}

3

A0(2)

A0(8)

A0(1)

2

+

A0(5) E1(2)

E3(0)

+

A1(2)

{c0,s0}

1

D1(2)

+

2

A0(3)

x(1)

D3(0)

A1(0)

1

-

{c0,s0}

4

A0(4)

2

A0(2)

E1(1)

{c0,s0}

2

+

4

A0(0)

3

2

1

D1(1)

3

A0(0)

2

A1(0)

+

x(15) {c0,s0} x(11) {c0,s0} x(7)

A0(0)

1

+

x(2)

x(13) {c0,s0} x(9) {c0,s0} x(5)

1

A0(1) D1(0)

+

x(14) {c0,s0} x(10) {c0,s0} x(6)

2

A0(0)

3

A1(3) 3

B1(3)

{c3,s3}

D4(3)

{c3,s1}

E4(3)

A0(5)

-

B0(5)

+

4

+

4

Fig. 4. Modified flow graph (MFG) of 16-point DIT RFFT. The dashed line represent no signal path.

x(2) x(14) x(10) x(6)

BF-9 (c0,s0)

w(0) w(2) w(4) w(6)

w(0) w(2) w(1)

v(8) v(12) v(10) v(14)

BF-10 (c2,s2)

w(8) w(12) w(10) w(14)

v(1) v(5) v(3) v(7)

BF-11 (c0,s0)

w(1) w(3) w(5) w(7)

v(9) v(13) v(11) v(15)

BF-12 (c2,s2)

w(9) w(13) w(11) w(15)

BF-5 (c0,s0)

v(0) v(4) v(8) v(12)

v(0) v(4) v(2) v(6)

BF-2 (c0,s0)

u(2) u(10) u(6) u(14)

BF-6 (c0,s0)

v(2) v(10) v(6) v(14)

x(1) x(13) x(9) x(5)

BF-3 (c0,s0)

u(1) u(9) u(5) u(13)

BF-7 (c0,s0)

v(1) v(5) v(9) v(13)

x(3) x(15) x(11) x(7)

BF-4 (c0,s0)

u(3) u(11) u(7) u(15)

BF-8 (c0,s0)

v(3) v(7) v(11) v(15)

BF-13 (c0,s0)

z(0) z(1) z(2) z(3)

w(8) w(12) w(9) w(13)

BF-14 (c1,s1)

z(8) z(12) z(9) z(13)

w(4) w(6) w(5) w(7)

BF-15 (c2,s2)

z(4) z(6) z(5) z(7)

w(10) w(14) w(11) w(15)

BF-16 (c3,s3)

z(10) z(14) z(11) z(15)

w(3)

Data Reordering

u(0) u(8) u(4) u(12)

Data Reordering

BF-1 (c0,s0)

x(0) x(12) x(8) x(4)

Fig. 5. DFG for in-place computation of 16-point decimation-in-time real-valued FFT.

is in natural order and indices of all the intermediate and output signal match with the requirement of folded in-place computation. The horizontal and vertical dashed-lines of Fig.5 represent the possible folding of the MFG. Each small rectangular box of Fig.5 represents a 4-point BF. The MFG of 16-point RFFT involves 16 number of 4-point hybrid BFs. The computation of those BFs are scheduled in 16 successive clock cycles such that 4 BFs of each stage are scheduled in 4 successive clock cycles. The 16 BF operations of Fig.5 are numbered according to their schedule in a folded structure. Therefore, the computation of stage-1, stage-2, stage-3, and stage-4 of 16-point RFFT are folded and scheduled during 1-4, 5-7, 8-11, and 13-16 clock cycles to be processed in a 4point BF processing unit. In general, N -point RFFT involves [(N/4) log2 N ] folded 4-point BF operations. Intermediate coefficients v(k) and w(k) are reordered before they are scheduled to be processed during the next stage of computation. This reordering requires changing the storage access order in case of in-place computation, and does not involve any extra clock cycle. Therefore, the entire in-place computations of 16-point RFFT MFG can be computed in 16 clock cycles using one 4-input BF processing unit. The MFG of the optimized 32-point radix-2 RFFT is shown in Fig.6. We find from this figure that from stage-4 onwards

the order of occurrence of twiddle factors {cl , sl } deviate from the natural order of their index ‘l0 . In case of 32point RFFT, the order of occurrence of twiddle factors in terms of their index, l in stage-5 is {0, 1, 2, 3, 4, 7, 6, 5}, where {cl , sl } := {cos(2πl/32), sin(2πl/32)}. Similarly, the order of twiddle factors in stage-6 in case of 64-point RFFT is {0, 1, 2, 3, 4, 7, 6, 5, 8, 15, 14, 13, 12, 9, 10, 11} where {cl , sl } := {cos(2πl/64), sin(2πl/64)}. After close observation of the order of occurrence of twiddle factors in different stages of the MFG of RFFT of sizes N = 8, 16, 32, 64, 128, and 256, we find that the twiddle factor indices satisfy a periodic property. Using that periodic property, order of occurrence of twiddle factors for p-th stage can be obtained easily from those of (p − 1)-th stage. The order of twiddle-factors of a given BF stage of MFG is obtained iteratively using the following relations: tpi = tp−1 and tp2p−3 +i = 2p−2 − tp−1 i i

(7)

for 1 ≤ i ≤ 2p−3 − 1, 1 ≤ p ≤ log2 N , tp0 = tp−1 , and 0 tp2p−3 = 2p−3 . Using these relations, we can find the order of occurrence of twiddle-factors in different BF stages of RFFT of different sizes as shown in Table II. The sequence of occurrence of twiddle factors of Table II, can be used to


x(0) {c0,s0} x(16) {c0,s0} x(24)

D1(0) E1(0)

E1(1)

A0(2)

1

A1(0)

{c0,s0} {c0,s0}

1

A1(1)

D2(0)

2

A1(2) 1

A1(3)

{c0,s0}

E2(1)

2

A1(1) 2

B1(1)

E1(2)

1

A1(5)

{c0,s0}

D2(2)

{c0,s0}

E2(2)

E1(3)

A1(6) 1

2

B0(4)

A1(7)

2

B1(4) A0(6)

x(9)

1

A0(8)

E1(0)

A1(9)

{c0,s0}

E1(1)

A0(11) {c0,s0} 1 A1(10) {c0,s0} 1 A1(11)

2

A1(7)

E2(1)

2

B1(7)

A0(13) {c0,s0} 1 A1(12) {c0,s0} 1 A1(13)

E3(0)

D3(1)

{c1,s1}

E3(1)

2

2

3

A0(7) 3

B0(7)

D4(2)

{c2,s2}

E4(2)

A1(10) 2

B1(10)

3

B0(8)

D3(2)

{c0,s0}

E3(2)

3

A1(7) 3

B1(7)

D4(3)

{c3,s1}

E4(3)

{c1,s1}

E3(3)

3

A1(8) 3

B1(8)

4

A0(6) 4

B0(6)

5

5

{c2,s2}

D4(2)

{c2,s2}

E4(2)

E4(0)

5

5

B0(3) {c3,s3}

D4(3)

A0(13)

4

{c3,s3}

E4(3)

B0(13)

A0(5) B0(5)

4

A1(4) 4

B1(4)

5

5

{c4,s4} {c4,s4}

D4(4) E4(4)

{c1,s1}

E4(1)

4

A1(7) 4

B1(7)

5

B0(7)

{c7,s7}

D4(5)

{c7,s7}

E4(5)

4

D4(2)

{c2,s2}

E4(2)

4

A1(6) 4

B1(6)

5

5

{c6,s6}

D4(6)

{c6,s6}

E4(6)

5

B1(3)

E4(3)

5

B0(10) A0(5)

4

{c3,s1}

5

A0(10)

5

A1(3)

D4(3)

5

B0(9)

B0(6)

4

{c3,s3}

5

A0(9)

A0(6)

4

{c2,s2}

5

B0(12) 5

A1(2) B1(2)

5

A0(12)

A0(7)

4

B1(1)

5

B0(4)

A1(1)

D4(1)

5

A0(4)

4

{c1,s1}

5

B0(14) A0(3)

4

{c0,s0}

5

A0(14)

4

A1(8) D4(0)

5

B0(15)

B0(2)

4

{c0,s0}

5

A0(15)

A0(2)

A1(0)

3

B1(6)

E4(1)

B0(3) {c3,s3}

A1(6)

D3(3)

{c1,s1}

4

3

{c1,s1}

D4(1)

4

3

{c0,s0}

5

{c1,s1}

A0(3)

3

2

2

A0(8)

A1(9)

A1(9) A1(11)

3

5

4

{c2,s2}

5

B0(8)

B0(1)

4

A1(5)

2

E2(3)

B1(3)

B0(6)

{c1,s1}

B0(10)

D2(3)

3

4

B0(7)

B0(2)

3

A0(10)

1

3

A1(3)

4

A0(7)

5

A0(8)

A0(1)

A0(2)

A0(6)

2

E2(2)

E4(1)

3

2

A0(14)

E1(3)

{c0,s0}

A0(11)

1

A0(15) {c0,s0} 1 A1(14) {c0,s0} 1 A1(15)

D3(0)

A0(9)

D2(2)

3

B1(2)

A0(9)

2

1

D1(3)

E3(3)

A1(6)

1

E1(2)

{c1,s1}

2

A0(12)

D1(2)

D3(3)

{c0,s0}

B0(7)

A1(8)

{c1,s1}

3

2

D2(1)

D4(1)

3

A0(7)

E2(0)

{c1,s1}

A0(5)

2

1

D1(1)

x(7) x(15) {c0,s0} x(23) {c0,s0} x(31)

1

D2(0)

A0(10)

x(3) x(11) {c0,s0} x(19) {c0,s0} x(27)

{c0,s0}

A1(2)

B1(1)

{c1,s1}

2

1

x(5) x(13) {c0,s0} x(21) {c0,s0} x(29)

A1(8)

E3(2)

2

1

A0(9) 1

{c0,s0}

3

E4(0)

4

3

2

A0(8)

D1(0)

D3(2)

{c0,s0}

D4(0)

4

3

A1(4)

x(1) {c0,s0} x(17) {c0,s0} x(25)

{c0,s0}

4

B0(4)

B0(1)

A1(1)

A1(5)

E2(3)

3

B0(3)

4

A0(4)

5

{c0,s0}

A0(16)

A0(1)

3

2

{c0,s0}

E4(0)

3

2

D2(3)

3

A0(3)

A1(4)

A1(3) {c0,s0}

{c0,s0}

D4(0)

A1(0)

A0(4)

1

1

E3(1)

2

1

A0(7)

{c1,s1}

2

A0(6)

D1(3)

D3(1)

2

A0(5)

{c0,s0}

3

{c1,s1}

5

A0(0)

4

A0(8)

3

B0(1)

A0(3)

1

3

B0(2) A0(1)

1

A1(4)

3

A0(2)

2

A0(4) A0(5) D1(2)

E3(0)

2

A1(2) D2(1)

D3(0)

{c0,s0}

2

A1(0) {c0,s0}

A0(4)

{c0,s0}

B0(1)

1

4

A0(0)

3

A0(1)

E2(0)

A0(2)

1

3

A0(0)

2

1

x(6) x(14) {c0,s0} x(22) {c0,s0} x(30)

A0(0)

1

A0(3) D1(1)

x(2) x(10) {c0,s0} x(18) {c0,s0} x(26)

1

1

x(4) x(12) {c0,s0} x(20) {c0,s0} x(28)

2

A0(0) A0(1)

x(8)

5

B0(5)

4

{c5,s5}

D4(5)

A0(11)

4

{c5,s5}

E4(5)

B0(11)

A1(5) B1(5)

5 5

Fig. 6. Modified DFG of 32-point decimation-in-time real-valued FFT.

TABLE II T WIDDLE FACTOR INDEX SEQUENCE USED IN THE MODIFIED DFG BF Stage stage-3 (N = 8) stage-4 (N = 16) stage-5 (N = 32) stage-6 (N = 64) stage-7 (N = 128) stage-8 (N = 256)

Twiddle factor index sequence 0,1 0,1,2,3 0,1,2,3,4,7,6,5 0,1,2,3,4,7,6,5,8,15,14,13,12,9,10,11 0,1,2,3,4,7,6,5,8,15,14,13,12,9,10,11 16,31,30,29,28,25,26,27,24,17,18,19,20,23,22,21 0,1,2,3,4,7,6,5,8,15,14,13,12,9,10,11 16,31,30,29,28,25,26,27,24,17,18,19,20,23,22,21 32,63,62,61,60,57,58,59,56,49,50,51,52,55,54,53 48,33,34,35,36,39,38,37,40,47,46,45,44,41,42,43

derive the MFG of the DIT RFFT of different lengths.

B. The Control Design and Storage Structure In the proposed structure, the in-place RFFT computation based on the MFG of Fig.4, is to be performed by one 4point BF processing unit and a data storage unit. The data storage unit stores the inputs as well as the intermediate results produced after the computation of different stages. In every clock cycle, 4 input samples/intermediate results are, therefore, read from the data storage unit and accordingly four result values are written into the same storage locations. As shown in Fig.4, the data-input pattern changes from the BF stage-2 onwards. The change in input pattern results in the increase of complexity of address generation unit which needs to be taken care of to perform appropriate read/write operations with the register banks. From the analysis of the MFG of Fig.4, we find that only one pair of input values out of two such pairs in each 4-point BF of stage-3 or higher stages need to be reordered while the other pair of input samples are to be fed directly from the corresponding BF of the previous stage. However, the reordered input pair could be any one from the two pairs. To resolve the addressing issue we swap the input pairs before writing them into the register banks. The corresponding pairs


6

TABLE III R EAD AND W RITE O PERATIONS FOR D IFFERENT S TAGES OF B UTTERFLY O PERATIONS OF 16- POINT RFFT clock cycle 1-4 5-8 9-12 13-16

Reading for BF stage S1 S2 S3 S4

Writing after BF stage S2 S3 S4 S1

Si , for i = 1, 2, 3, and 4 represents the ith butterfly stage .

of data when read from the register banks also need to be reverse swaped to feed appropriate data to the BF processing unit. Therefore, by introducing two data-swapping operations, the memory conflict for in-place computation of DIT RFFT is completely resolved. The data storage unit is comprised of four register banks (RB1 , RB2 , RB3 , and RB4 ), so that a block of 4 words can be read/stored from/in the four banks in each cycle using a common address for reference. Besides, the data storage unit consists of two data-selectors (DS1 and DS2 ) for swapping of data before and after the write and read operations, respectively. The internal structure of the data storage unit is shown in Fig.7. We implement each register-bank by (N/4) registers where both read and write operations can be performed in the same clock cycle. We discuss here the control design for the proposed register-based storage unit of 16-point DIT RFFT. Similar control design however can easily be derived for higher FFT sizes, as well. For 16-point RFFT, each register bank (shown in Fig.8) consists of (N/4) = 4 registers, one 2-to-4 line decoder and one 4-to-1 line multiplexer (word-level). During each clock cycle of a set of 4 clock cycles, content of one register of

From 4-point butterfly unit ctr1

DATA SELECTOR (DS1) q0 q1

RB1

RB2

RB3

r0 r1 w0 w1

RB4

CLK

ctr2

DATA SELECTOR (DS2)

To 4-point butterfly unit Fig. 7. Data storage unit of 16-point RFFT for folded in-place computation. w0 w1

Decoder

u1 u2

u3

u4

ctr1 /ctr2

CLK

en R1 en R2 en R3 en R4 r0 r1

Din v1

v2

v3

v4

Dout

(a)

(b)

Fig. 8. (a) Register-bank for N = 16. (b) Structure of data-selector (DS).

TABLE IV R EGISTER BANK L OADING OF I N - PLACE 16- POINT RFFT CC CC0.1 CC0.2 CC0.3 CC0.4 CC1.1 CC1.2 CC1.3 CC1.4 CC1.5 CC1.6 CC1.7 CC1.8 CC1.9 CC1.10 CC1.11 CC1.12 CC1.13 CC1.14 CC1.15 CC1.16

RB1 x(0) x(2) x(1) x(3) u(0) u(2) u(1) u(3) v(0) v(10) v(1) v(11) w(0) w(8) w(5) w(11) x(16) x(18) x(17) x(19)




each bank is read according to the address available at MUX select line and a new value is thus written during the next clock transition in the same register, such that each register of particular register-bank is accessed (for read followed by write) once in every four cycles. During each clock cycle, four data values are read from four register-banks to perform the BF operation of the i-th BF stage and the outputs of the BF operations are written back into the same locations (registers) during the next clock transition for the (i + 1)-th stage, for 1 ≤ i ≤ log2 N . For continuous processing, the write operations with the register bank for the first stage of the next set of N input samples is performed concurrently with the read operation of (log2 N )-th stage (the final stage) of BF operation of current set of input. The read and write operations with the register banks for different stages of BF operation of 16point RFFT is shown in Table III for a set of 16 clock cycles, which is repeated for the next set of 16 clock cycles to process the next 16-point input sequence. Based on the read/write pattern of Table III and necessary data reordering for conflictfree continuous read/write operations with the register-banks it is possible to perform the in-place computation of RFFT. The register-bank loading for RFFT computation of two successive 16-point input x1 = {x(0), x(1), · · · , x(15)} and x2 = {x(16), x(17), · · · , x(31)} is given in Table IV for 20 clock cycles, where clock cycles CC0.1 - CC0.4 are the four clock cycles during which input-vector x1 is loaded in the register-banks. During the 16 clock cycles CC1.1 to CC1.16 , the FFT of input x1 is computed. The 16-point intermediate datavectors u1 , v1 , and w1 corresponding to BF stage-1, stage-2, and stage-3 of 16-point RFFT are loaded in the register banks during CC1.1 to CC1.4 , CC1.5 to CC1.8 , CC1.9 to CC1.12 , respectively. During CC1.13 to CC1.16 , the FFT output of x1 is delivered out and during the same period the next input sequence x2 is loaded into the register banks. One can find from Table IV and Fig.5 that the necessary data-blocks can be read and written continuously from and to the four registerbanks without any conflict during different BF stages. The


7

TABLE VI R EAD A DDRESS (RA) AND W RITE A DDRESS (WA) OF M EMORY BANKS AND C ONTROL S IGNALS FOR DATA - SELECTORS

Control Unit Data Storage Unit

RA3 / RA4

W A1 / W A2

W A3 / W A4

ctrl1

ctrl2

1 2 3 4

00 01 10 11

00 01 10 11

00 01 10 11

11 10 01 00

0 0 0 0

0 0 0 0

5 6 7 8

00 01 10 11

00 01 10 11

00 01 10 11

00 01 10 11

0 1 0 1

0 0 0 0

9 10 11 12

00 01 10 11

01 00 11 10

00 01 10 11

00 01 10 11

0 0 1 1

0 1 0 1

13 14 15 16

00 01 10 11

11 10 01 00

00 01 10 11

01 00 11 10

0 0 0 0

0 0 1 1

Twiddle-Factor Storage Unit

Arithmetic Unit input vector

output vector

Fig. 9. Proposed structure for in-place computation of DIT RFFT.

MUX-ARRAY

x1 x2 x3 x4

To SU

1 1

From SU

+

+ -

-

-

y1 y2 y3

2 2

y4

2 2

+

From TSU

1 1

LC2

RA1 / RA2

LC1

clock cycle

sel1 sel2

ck, sk

sel3

IV. P ROPOSED S TRUCTURE

TABLE VII S ELECT SIGNAL STATUS FOR 16- POINT RFFT 1−4 {1, 1, 1, 1} {1, 1, 1, 1} {0, 0, 0, 0}

clock cycles 5−8 9 − 12 {0, 0, 0, 0} {0, 0, 0, 0} {1, 1, 1, 1} {1, 0, 1, 0} {0, 0, 0, 0} {0, 0, 0, 0}

q0

q1

q2 q3

ROM c0 s0 c0 s0 c0 s0 c2 s2 c0 c1

s0 s1

c2 c3

s2 s3

{cl, sl}

CLK

Fig. 11. Internal structure of twiddle-factor storage unit (TFSU).

The proposed structure for N -point in-place DIT RFFT computation is shown in Fig.9. It consists of one arithmetic unit (AU), one data storage-unit (DSU), one twiddle-factor storage unit (TFSU), and one control-unit (CU). During every cycle, the AU receives a block of 4 samples from the DSU and a pair of twiddle factors {cl , sl } from the TFSU. During the

Select Signal sel1 sel2 sel3

Address Generator

Address Port

input and output data-flow of DS1 and DS2 are given in Table V for 16 processing cycles of x1 . It may be noted that the loading of subsequent input sequences happens during the last four processing clock cycles of a period of 16 processing clock cycles of its previous input sequence. Using Table IV and Table V, the necessary read-addresses (RA) and writeaddresses (W A) for register-banks and control signals ctr1 and ctr2 for DS1 and DS1 , respectively, are derived and shown in Table VI. The required address and control signals can be easily generated by a 2-bit counter and a few additional gates.

Read Data

Fig. 10. Structure of arithmetic unit (AU).

13 − 16 {0, 0, 0, 0} {1, 0, 0, 0} {1, 1, 1, 1}

When sel1 /sel2 =0 10 then I1 → O2 and I2 → O1 in LC1 and LC2 of AU (Fig.10) respectively. In case of MUX-array when sel3 =0 10 , then {x1 → O1 , x2 → O2 , x3 → O3 , x4 → O4 }. TABLE VIII LUT A DDRESSES FOR T WIDDLE FACTOR cc Address

1 000

2 000

3 000

4 000

5 001

6 001

7 001

8 001

cc Address

9 010

10 011

11 010

12 011

13 100

14 101

15 110

16 111

first 12 clock cycles of a period of 16 clock cycles, it performs a 4-point BF operation in every clock cycle to produce a 4point intermediate-output data block which is written back into the DSU. During the last 4 clock cycles of a period of 16 clock cycles the FFT coefficients are delivered out as output of the AU. The structure and function of the DSU is described in Section III-B (Fig.7). The internal structure of AU is shown in Fig.10. AU uses two line-changers (LC1 and LC2 ) to steer the BF outputs along the desired path according to the MFG shown in Fig.4. Both LC1 and LC2 move the input data from ports I1 and I2 to port O1 and O2 , respectively, when the control signal is ‘0’ otherwise they steer the input data from ports I1 and I2 to port O2 and O1 , respectively. Both LC1 and LC2 are implemented by a pair of 2-to-1 line MUXes. The MUX array receives the samples {x(n)} from the input port as well as from the AU. It uses signal sel3 to select input from the AU during first 12 clock cycles and from the input-port during last 4 clock cycles of each set of 16 clock cycles. The select control for 16 clock cycles for 16-point RFFT computation is given in Table VII. The twiddle factors of 16-point RFFT are stored in a ROM look-up-table (LUT) of 8-words as shown in Fig.11. It has an address generation unit to produce appropriate addresses


8

TABLE V I NPUT-O UTPUT DATA - FLOW OF DATA - SELECTORS . DS1

DS2 Input-block

Output-block

CC1.1 CC1.2 CC1.3 CC1.4

{u(0), u(8), u(4), u(12)} {u(2), u(10), u(6), u(14)} {u(1), u(9), u(5), u(13)} {u(3), u(11), u(7), u(15)}


{x(0), x(12), x(8), x(4)} {x(2), x(14), x(10), x(6)} {x(1), x(13), x(9), x(5)} {x(3), x(15), x(11), x(7)}


CC1.5 CC1.6 CC1.7 CC1.8

{v(0), v(4), v(8), v(12)} {v(2), v(6), v(10), v(14)} {v(1), v(5), v(9), v(13)} {v(3), v(7), v(11), v(15)}




CC1.9 CC1.10 CC1.11 CC1.12

{w(0), w(2), w(4), w(6)} {w(8), w(12), w(10), w(14)} {w(1), w(3), w(5), w(7)} {w(9), w(13), w(11), w(15)}




CC1.13 CC1.14 CC1.15 CC1.16





MUX(2-bit)

q0 q1 Stage Select

Counter (4-bit)

ctr1 D

MUX(2-bit)

CLK

MUX(1-bit)

Output-block

AC (2-bit)

Input-block

RA/WA for RB1/RB2

clock cycle no.

ctr2

r0 r1 D D

w0 w1

q2 q3

q3 + q1 q0 + q3 q2 q0

sel2

q3 .q2

sel1

q3 .q2

sel3

Fig. 12. Control unit (CU). AC stands for AND cell.

during different clock cycles (shown in Table VIII for 16point RFFT). The logic circuit at the input of 2-bit MUX of the address generation unit produces the upper 2-bits of address of each BF stages. The four BF stages of 16-point FFT are encoded by a pair of bits {q2 , q3 } used as the control of the MUX to select the 2-bit address of particular BF stage. The MSB of the address word is directly obtained by ANDing {q2 , q3 }. In general, a ROM LUT consisting N/2 words is required for N -point RFFT, where log2 N -to-1 MUX (log2 (N/4)-bit) is used to produce upper log2 (N/2) address bits. The logic design of the CU is shown in Fig.12. It produces control signals for TFSU, RA/WR of RBs and control signals for DS0 and DS1 of DSU, and select signals of AU to be used for the implementation of different BF stages. The AU receives a block of 4 samples from the DSU in every clock cycle to perform the BF operations. The BF outputs are sent back to the DSU except during the [(N/4)(log2 N − 1) + 1] to [(N/4)(log2 N )] clock cycles. During this period, AU computes the outputs of final BF stage of N -point RFFT and (N/4) number of 4-point input-blocks

corresponding to a new N -point input-vector is loaded onto the register bank. Therefore, during [(N/4)(log2 N −1)+1] to [(N/4)(log2 N )] clock cycles of every set of N clock cycles, N FFT coefficients {X(k)} are computed by the AU and N input samples of a new sequence {x(n)} are loaded into register banks. The computation FFT of an N -point sequence is completed by the proposed structure in [(N/4)(log2 N )] clock cycles. V. C OMPLEXITIES AND P ERFORMANCE C OMPARISON The proposed structure consists of one DSU, one AU, one TFSU, and one CU. The DSU consists of four register-banks and two data-selectors, where each register-bank is comprised of (N/4) registers of width w-bits, one log2 (N/4):(N/4) decoder, and one (N/4):1 line MUX. Each data-selector is comprised of four 2:1 MUXes. The AU involves 4 real multipliers, 6 real adders, 2 LCs, and one MUX-array, where each LC is comprised of two 2:1 MUXes. The MUX-array is comprised of four 2:1 MUXes. The TFSU involves one ROM unit of (N/2) words of 2W -bit width and an address generation unit. The CU and the address generation unit of TFSU involve only a few gates. The complexities of CU, ROM address generation unit, and decoders are negligible compared to the other components. We have not included the complexities of these components in the theoretical estimation (Table IX). The duration of clock period T = TM A + TF A + 5TM X + TRM , where TM A , TF A , TM X , and TRM are, respectively, the computation time of multiplication followed by addtion, full-adder delay, MUX delay and register-bank access delay. TRM = TAN D + TD + TM X log2 (N/4), where TD and TAN D are, respectively, the flip-flop delay, and the AND-gate delay. We have estimated hardware and time complexities of proposed structure accordingly, and listed in Table IX with those of the structures of [8] and [9]. The structure of [8] uses two memory units (memory-1 and memory-2). Each memory unit is comprised of 4 banks (for 1 PE case) where each bank


9

TABLE IX C OMPARISON OF H ARDWARE AND T IME C OMPLEXITIES Structures

Multiplier

Adder

REG

RAM words*

MUX/DMUX

Minimum clock period

Computation time (cc)

Jo et al [9]

12

22

0

8N

38

TM A + 2TA + 8TM X + 2TR

Ayinala et al [8] Proposed

4 4

6 6

0 N

4N 0

38 N + 12

TM A + TA + 8TM X + 2TR TM A + TF A + 5TM X + TRM

(N/4) log4 N or (N/4)(log4 (N/2) + 1) (N/4) log2 N (N/4) log2 N

*Single-port w-bit RAM, where w is the wordlength. TM A : computation time of multiplication followed by addtion, TA : real adder delay, TF A : full-adder delay, TM X : 2:1 MUX delay, TR : RAM access delay, TRM : register-based memory-bank access delay. TABLE X S YNTHESIS R ESULTS OF P ROPOSED S TRUCTURES AND S TRUCTURE OF [8] USING TMSC 65 NM CMOS S TANDARD C ELL L IBRARY Data Storage Unit Area (um2 ) Power (mW)

Complete Design Power (mW) ADP (um2 .us)

Design

N

MCP (ns)

Structure of [8]

16 32 64 128

1.66 1.67 1.7 1.77

8926.6 15793.6 29430.7 56917.8

1.8 3.1 5.6 10.6

17933.4 24481.4 38530.8 65804.7

2.7 4.2 7.5 13.9

476.3 1308.3 6288.2 26090.3

4.48 8.76 19.02 43.03

Proposed structure

16 32 64 128

1.1 1.17 1.32 1.61

2257.9 4374.0 7937.3 15130.4

0.9 1.7 2.6 4.9

8028.72 10513.44 13879.80 21267.35

2.2 2.9 3.3 6.1

143.9 492.0 1758.8 6478.9

2.49 4.29 6.59 14.63

Power estimated at MCP for both (EPS)=power×MCP×[(N/4) log2 N )]/N .

designs,

area

delay

is of depth (N/4) and needs to perform simultaneous readwrite operations. In order to perform simultaneous read-write operations, they are required to be implemented as dual-port RAM. N -word dual-port RAM can be implemented using two single-port RAMs of depth N words and width of w-bits, where w is word-length of input/output data. The structure of [8] involves 4N single-port RAM words of w-bit width or 2N dual-port RAM words 2w bit-width. As shown in Table IX, the proposed structure involves the same number of multipliers and adders as that of [8], but it involves N registers and (N − 26) extra MUXes against 4N single-port RAM words of the other. The proposed structure has shorter clock period than that of [8] as TR and TRM are nearly the same for smaller N values. For higher values of N , TRM may be marginally higher than TR . But the other components except the register-bank of proposed structure have less delay than those of [8]. Consequently, the overall clock period of the proposed structure is expected to be less than or nearly equal to that of [8] for higher values of N . Due to shorter clock period, the proposed structure has less computation time than the structure of [8] as both the structures involve the same number of clock cycles to complete the computation of N -point RFFT. Compared with the structure of [9], the proposed structure involves 3 times less multipliers, 3.6 times less adders, N extra registers and (N − 26) extra MUXes against 8N RAM words, and offers nearly half the throughput. The proposed register-based DSU requires nearly half of the area of the RAM based storage unit and offers higher area saving for higher values of N . The folded FFT is used when the length of input sequence is not very large size, and generally, less than 512. Moreover, for small FFT size N , RAM-based storage unit is more expensive compared to the register-based storage unit. Therefore, the

Area (um2 )

product

(ADP)=area×

MCP×(N/4) log2 N ,

EPS (pJ)

energy

per

sample

proposed structure could offer better area-delay efficiency and power efficiency than the existing structures for most practical applications which involve FFTs of length N = 256 or less. We have coded the proposed design in VHDL for RFFT of length 16, 32, 64, and 128. We have also coded the structure of [8] for the same FFT sizes using one processing element (2butterfly) since that is the most efficient amongst the existing structures. We have used the single-port RAM generated by Synopsis DesignWare for the implementation of data storage unit of [8] whereas delay flip-flop (D-FF) is used for implementation of proposed register-based DSU. We have taken 8bit word-length for the input and the output for all stages of computation. We have synthesized the proposed structure and the structure of [8] including its data storage unit in Synopsys Design Compiler (DC) and IC Compiler (ICC) using TSMC 65nm CMOS standard cell library. The area, minimum cycle period (MCP), and power consumption reported by ICC are listed in Table X. Power consumption is estimated at the MCP for both the designs. As per the theoretical estimates given in Table IX, the MCP of the proposed structure increases marginally when N increases. As shown in Table X, the proposed register-based DSU involves ∼ 73% less area and consumes ∼ 50% less power than the RAM-based DSU of [8] on average for different FFT sizes. The proposed structure involves ∼ 61% less area and ∼ 40% less power consumption than those of [8] on average for different FFT sizes. As shown in Table X, the proposed structure involves ∼ 70% less areadelay product (ADP) and ∼ 57% less energy per sample (EPS)2 than those of [8] on average for different FFT sizes.

2 ADP = Area × MCP× N log N 2 4 1 EPS= N [Power Consumption × MCP ×

N 4

log2 N ]


VI. CONCLUSIONS In this paper, we present an area-efficient and energyefficient architecture for radix-2 DIT real-valued FFT. Besides, we have proposed a register-based storage design which involves significantly less storage area compared with RAMbased storage unit at the cost of marginal increase in latency. The address generation for folded in-place DIT RFFT for register-based storage-unit is challenging since both read and write are performed in the same clock cycle at multiple different locations. Therefore, we have regularized the flow graph of RFFT and presented a recursive formulation of the necessary address generation. The proposed structure involves significantly less area-delay product and less energy per sample than the existing folded structures for RFFT implementation. R EFERENCES [1] H. Sorensen, D. Jones, M. Heideman, and C. Burrus, ”Real-valued fast Fourier transform algorithms,” IEEE Transactions on. Acoust., Speech Signal Process., vol. 35, no.6, pp. 849863, Jun. 1987. [2] Shen-Fu Hsiao and Wei-Ren Shiue, ”Design of Low-Cost and HighThroughput Linear Arrays for DFT Computations: Algorithms, Architectures and Implementations,” in Proc. IEEE Trans. Circuits and SystemsII: Analog and Digital Signal Processing, Nov 2000, vol. 47, no. 11, pp. 1188-1203. [3] M. Garrido, K. K. Parhi, J. Grajal, ”A pipelined FFT architecture for real-valued signals,” IEEE Transactions on Circuits and SystemsI, Regular Papers, vol. 56, no.12, pp. 2634-2643, Dec. 2009. [4] M. Ayinala, M. Brown, K. K. Parhi, ”Pipelined parallel FFT architectures via folding transformation,” IEEE Transactions on Very Large Scale Integration Systems, vol. 20, no. 6, pp. 1068-1081, June 2012. [5] A. Wang and A. P. Chandrakasan, ”Energy-aware architectures for a real valued FFT implementation,” in Proc. Int. Symp. Low Power Electron. Design, Aug. 2003, pp. 360-365. [6] H. Chi and Z. Lai, ”A cost-effective memory-based real-valued FFT and Hermitian symmetric IFFT processor for DMT-based wire-line transmission systems,” in Proc. IEEE Int. Symp. Circuits Syst., May 2005, vol. 6, pp. 6006-6009. [7] L. G. Johnson, ”Conflict free memory addressing for dedicated FFT hardware,” IEEE Transactions on Circuits and Systems-II., Analog Digit. Signal Process., vol. 39, no. 5, pp. 312316, May 1992. [8] M. Ayinala, Y. Lao and K. K. Parhi, ”An in-place FFT architecture for real-valued signals”, IEEE Transaction on Circuits and System-II, Express Briefs, vol. 60, no. 10, pp. 652656, Oct. 2013. [9] B. G. Jo and M. H. Sunwoo, ”New continuous-flow mixed-radix (CFMR) FFT processor using novel in-place strategy”, IEEE Transaction on Circuits and System-I, Regular Papers, vol. 52, no. 5, pp. 911919, May 2005.

Pramod Kumar Meher (SM’03) received his Ph.D. degree in science in the year 1996 from Sambalpur University (India), and currently he is a Senior Research Scientist with Nanyang Technological University, Singapore. He has contributed more than 200 technical papers listed in http://www.ntu.edu.sg/home/ASPKMeher/List.pdf. Dr. Meher has served as a speaker for the Distinguished Lecturer Program of IEEE Circuits and Systems Society during 2011 and 2012. He has also served as Associate Editor of IEEE Transactions on Circuits Systems-II: Express Briefs, IEEE Transactions on Circuits & Systems-I: Regular Papers, and IEEE Transactions on VLSI Systems, during 2008-2011, 2012-2013, and 2009-2014, respectively.

10

Basant Kumar Mohanty (M’06, SM’11) received Ph.D degree in 2000, and currently he is full professor in Jaypee University of Engineering and Technology, Guna, Madhya Pradesh. His research interest includes design and implementation of lowpower and high-performance systems for digital signal processing applications, secured communication and reconfigurable architectures. He has published nearly 60 technical papers. Dr.Mohanty is serving as Associate Editor for the Journal of Circuits, Systems, and Signal Processing.

Sujit Kumar Patel received B.E degree in electronics and communication engineering from Jabalpur Engineering College, Jabalpur, M.P, India and M.Tech. degree from the DA-IICT, Gandhinagar, Gujarat, India in 2006 and 2009, respectively. Currently he is an Assistant Professor with ECE department, Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, India where he is pursuing his Ph.D degree. His research interests on algorithms and concurrent architectures for adaptive filters.

Soumya Ganguly received the B.E (Hons.) degree in Electrical and Electronics Engineering from Birla Institute of Technology and Science, Hyderabad Campus, India, in 2014. His research interests include VLSI Architectures for Signal Processing applications, and Computer Arithmetic. Mr. Ganguly has received the Silver Leaf Certificate at the 2012 IEEE Asia Pacific Conference on Postgraduate Research in Microelectronics and Electronics, and the VLSI Society of India Fellowship in 2014.

Thambipillai Srikanthan (SM’92) is a full Professor in the School of Computer Engineering in Nanyang Technological University, Singapore. He is the Chair of the same school since February 2010. He is the Director of a 100-strong Centre for High Performance Embedded Systems (CHiPES), which he founded in 1998. Previously he was also the Director of the Intelligent Devices and Systems (IDeAS) Cluster at NTU. His research interests include design methodologies for highproductivity embedded systems, architectural translations of compute-intensive algorithms, computer arithmetic and high-speed techniques for vision- enabled systems. He has published more than 400 technical papers including 100 papers in various IEEE Transactions, IEE proceedings and other reputed international journals.

Efficient VLSI Architecture for Decimation-in-Time Fast ...

Efficient VLSI Architecture for Decimation-in-Time Fast ...

Suggest Documents

AN EFFICIENT VLSI ARCHITECTURE FOR MC ... - CiteSeerX

An Efficient VLSI Architecture for CORDIC Algorithm

An Efficient VLSI Architecture for Fingerprint

vlsi architecture of an area efficient image

Adaptable, fast, area-efficient architecture for

Efficient VLSI Architecture for Training Radial Basis Function ... - MDPI

Area Efficient VLSI Architecture for Square Root Carry Select ... - Core

An Efficient VLSI Architecture for New Three-Step Search ... - VTVT

An Efficient VLSI Architecture for CORDIC Algorithm - Semantic Scholar

An Efficient VLSI Architecture for Lifting-Based Flipping Discrete ...

An Efficient VLSI Architecture for Motion Compensation of AVS ... - UTA

An Efficient VLSI Architecture for Rivest-Shamir-Adleman Public-key ...

An efficient reformulation based VLSI architecture for ... - IEEE Xplore

High-Throughput Power-Efficient VLSI Architecture of Fractional ...

Energy-efficient Hardware Architecture and VLSI Implementation of a ...

Area-Efficient Architecture for Fast Fourier Transform - Biblioteca

An Efficient Architecture for the in Place Fast Cosine Transform

an Energy-Efficient Fast Adder Architecture for Cell ...

New Architecture Paradigms for Analog VLSI ... - People.csail.mit.edu

Display Architecture for VLSI -based Graphics Workstations

VLSI Architecture for Spatial Domain Spread ...

SYSTEM ANALYSIS OF VLSI ARCHITECTURE FOR MOTION ...

FPGA Implementation Technology for Memory Efficient VLSI

VLSI: Techniques for Efficient Standard Cell