Multi-mode parallel and folded VLSI architectures for ...

INTEGRATION, the VLSI journal 55 (2016) 43–56

Contents lists available at ScienceDirect

INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi

Multi-mode parallel and folded VLSI architectures for 1D-fast Fourier transform Mohamed Asan Basiri M n, Noor Mahammad Sk n Indian Institute of Information Technology Design and Manufacturing Kancheepuram, Chennai, Tamil Nadu, India

art ic l e i nf o

a b s t r a c t

Article history: Received 21 August 2014 Received in revised form 9 July 2015 Accepted 22 February 2016 Available online 5 March 2016

The modern real time applications like orthogonal frequency division multiplexing and etc., demand high performance fast Fourier transform (FFT) design with less area and clock cycles. This paper proposes efficient FFT VLSI architectures using folded/parallel implementation. In the proposed folded FFT architecture, the number of cycles required to complete the operation is less than single path delay feedback (SDF)/multi-path delay commutator (MDC) architectures. In the proposed parallel FFT architecture, Npoint FFT is implemented by using one N/2-point FFT without much extra hardware. Both the proposed architectures are implemented for radix-2, 22, and 4 using 45 nm technology library. The proposed parallel architecture achieves 56.7% and 40.6% of area reduction as compared with the existing parallel architecture based 16-point radix-2 and radix-22 DIF FFTs respectively. The proposed folded architecture achieves 65.5%, 51.1%, and 35.8% of worst path delay reduction as compared with the existing SDF based 16-point radix-2, radix-22, and radix-4 DIF FFTs respectively. & 2016 Elsevier B.V. All rights reserved.

Keywords: DFT DSP processor FFT Single path delay feedback Multi path delay commutator

1. Introduction FFT [1] is the most popular algorithm used in digital signal processing applications like orthogonal frequency division multiplexing (OFDM) [2,3], ultra wide band (UWB) [4], and etc. In general, VLSI architecture for any digital signal processing application is classified into two categories; they are (1) parallel form and (2) folded form [5]. The main difference between the parallel and folded architectures is the number of clock cycles and area. In parallel architecture, total area is greater than folded. In folded architecture, the number of clock cycles is greater than parallel. Therefore, parallel architecture can be used in the applications, where time optimization (high throughput) is the primary goal (Example – Super Computer). Similarly, folded architecture can be used in some applications, where area optimization is the primary goal (Example – Handheld devices). Usually, this kind of parallel/ folded architectures is used in digital filter design, discrete transformation and etc. The radix-2, 22, and 4 parallel FFT architectures are explained in [7–9], respectively. The folded FFT architectures are further classified into single path delay feedback (SDF), single path delay commutator (SDC), multi-path delay feedback (MDF), and multi-path delay commutator (MDC). The objective of FFT is to reduce time complexity of discrete Fourier transform (DFT) [10]. n

Corresponding authors. E-mail addresses: [email protected] (M. Asan Basiri M), [email protected] (N. Mahammad Sk). http://dx.doi.org/10.1016/j.vlsi.2016.02.007 0167-9260/& 2016 Elsevier B.V. All rights reserved.

Eq. (1) shows the basic operation of DFT. W rN=2 ¼ e j2π nr=ðN=2Þ X½r ¼

N 1 X

x½ne j2π nr=N ;

r ¼ 0; 1; …; N 1

ð1Þ

n¼0

X½2r ¼

N=2 1 X

ðx½n þx½n þ N=2ÞW rN=2 ;

ð2Þ

n¼0

where r ¼ 0; 1; …ðN=2Þ 1 X½2r þ 1 ¼

N=2 1 X

ðx½n x½n þ N=2Þe j2π =Nn W rN=2 ;

ð3Þ

n¼0

where r ¼ 0; 1; …ðN=2Þ 1: Eqs. (2) and (3) show decompositions in radix-2 decimation in frequency (DIF) FFT. Radix-2 16-point single path delay feedback (SDF) [11] DIF FFT architecture is shown in Fig. 1(a). Here the Butterfly units are represented as BF1, BF2, BF3, and BF4. Each Butterfly output is multiplied with corresponding twiddle factor from the multiplexer. The twiddle factor is represented as WNk and which is equal to e j2π nk=N . The twiddle factors can be selected by the appropriate select lines se1, se2, and se3. Throughout this paper, registers are represented as shaded square boxes. Similarly, the notation z N1 is used to represent N1 number of cascaded hardware registers which are used for delay. During each clock cycle, only one set of inputs can be processed by every Butterfly unit. Eqs. (4) and (5) show the variables used to derive the radix-22

44

M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56

Fig. 1. 16-point single path delay feedback DIF FFT using (a) Radix-2, (b) Radix-22, and (c) Radix-4 implementation.

FFT algorithm which is shown in the equations from (6) to (10). N N n1 þ n2 þ n3 N ð4Þ n¼ 2 4 k ¼ 〈k1 þ 2k2 þ 4k3 〉N Xðk1 þ 2k2 þ4k3 Þ ¼

ð5Þ

ðN=4Þ X 1

N N n1 þ n2 þ n3 W nk x N 2 4 ¼0

1 1 X X

n3 ¼ 0 n2 ¼ 0 n1

n3 ðk1 þ 2k2 Þ n2 ðk1 þ 2k2 Þ 3 k3 WN W 4n W nk N ¼ ð jÞ N

Xðk1 þ 2k2 þ4k3 Þ ¼

ðN=4Þ X 1 n3 ¼ 0

ð6Þ

ð7Þ

n3 ðk1 þ 2k2 Þ ½H N=4 ðn3 ; k1 ; k2 ÞW N W nN 3 k3 4

ð8Þ

H N ðn3 ; k1 ; k2 Þ ¼ BN ðk3 ; k1 Þ þ ð jÞðk1 þ 2k2 Þ BN 4

2

N BN ðn3 ; k1 Þ ¼ xðn3 Þ þ ð 1Þk1 x n3 þ 2 2

2

N n3 þ ; k1 4

ð9Þ

ð10Þ

Radix-22 16-point single path delay feedback [12,14] DIF FFT is shown in Fig. 1(b). Here, the second Butterfly unit (BF2) contains two complex number multipliers. The output from first and third Butterfly units is multiplied by ð jÞ. The Radix-22 16point SDF FFT operation can follow Fig. 1(a). Eqs. (11)–(16) show radix-4 DIF FFT algorithm, where y(n), yðn þ N4 Þ, yðn þ 2N 4 Þ, and yðn N þ 3N Þ are radix-4 -point DIF FFT of Xð4kÞ, Xð4k þ 1Þ, Xð4k þ2Þ, and 4 4 Xð4k þ 3Þ respectively.


45

Fig. 2. Radix-2 16-point DIF FFT (a) existing parallel (b) proposed folded architectures.

X½k ¼

N 1 X

x½nW nk N ¼

n¼0

N 1 4

X

n¼0

x½nW nk N þ

2N 1 4

X

n¼

N 4

x½nW nk N þ

3N 1 4

X

n¼

x½nW nk N þ

2N 4

N 1 X n¼

x½nW nk N

3N 4

ð11Þ X½k ¼

N 1 4X n¼0

X½k ¼

2Nk 3Nk Nk N 2N 3N xðnÞ þW N4 x nþ þW N4 x n þ þ W N4 x n þ 4 4 4

Cycle

N 2N 3N xðnÞ þð jÞk x n þ þ ð 1Þk x n þ þ ðjÞk x n þ 4 4 4

m

n value for Butterfly units (BFs) 1

2

…

N 1 4

N 4

N þ1 4

N þ2 4

…

N 2

N 2 N 4

0

1

…

1

…

N 1 4 N 1 4

N 4 2N 4

N þ1 4 2N þ1 4

…

0

N 2 4 N 2 4

1

0

2

…

N 4 2

N 2 2

N 2

N þ2 2

N 1 2 2N N þ 1 4 4 ⋮ N 2

ð12Þ 1

N 1 4X n¼0

Table 1 Operation of proposed folded N-point radix-2 DIF FFT.

ð13Þ

N 2N 3N þx nþ þx nþ W 0N yðnÞ ¼ xðnÞ þ x n þ 4 4 4

ð14Þ

N N 2N 3N ¼ xðnÞ jx n þ x nþ þ jx n þ W nN y nþ 4 4 4 4

ð15Þ

2N N 2N 3N ¼ xðnÞ x n þ þx n þ x nþ W 2n y nþ N 4 4 4 4 ð16Þ 3N N 2N 3N ¼ xðnÞ þ jx n þ x nþ jx n þ W 3n y nþ N 4 4 4 4 ð17Þ Radix-4 16-point SDF [15] architecture is shown in Fig. 1(c). Here only one complex number multiplier is used. The corresponding twiddle factors can be selected by Sel2 . During each cycle, 4 outputs can be produced from 4 inputs. Radix-2 and 4 multi-path delay commutator (MDC) architectures are explained in [16]. Radix 22 MDC architecture is shown in [17]. In Fig. 1, BF refers the Butterfly unit. The corresponding twiddle factor for each cycle can be selected through the multiplexer with appropriate select line. Throughout this paper, the multiplication unit is considered as complex number multiplier, which consists of 4 real number multiplications and two real number signed adders. The addition unit is considered as a complex number adder, which consists of 2 real number signed adders. In FFT using SDF/MDC architecture, each Butterfly unit can take only one set of inputs during each cycle. So, it will take

2 ⋮ log 2 N

… …

more cycles to complete the whole operation. At the same time, this architecture needs few Butterfly units and hence the area/ power requirement is less than parallel architecture. 1.1. Contribution of this paper This paper proposes an efficient folded FFT architecture with less clock cycles, which is compared with existing SDF/MDC architectures. This paper also proposes an area efficient parallel architecture for FFT, where the required N-point FFT is implemented by using one N2 -point FFT parallel architecture without much additional hardware. The proposed folded/parallel FFT architectures are implemented for radix-2, 22, and 4. The number of clock cycles and BF units is compared with existing and proposed FFT architectures. The proposed folded FFT architecture is designed to optimize the number of cycles (trade off-number of BF units) and the proposed parallel FFT architecture is designed to optimize the number of BFs (trade off-number of cycles). The rest of the paper is organized as follows: Section 2 states the proposed multi-mode folded FFT architecture. Section 3 states the proposed multi-mode parallel FFT architecture. Time/Hardware analysis of proposed multi-mode DIF FFT is discussed in Section 4. Design modeling, implementation, and results are discussed in Section 5, followed by a conclusion in Section 6.

46


Fig. 3. Proposed 16-point (a) radix-2/22 multi-mode folded 16-point DIF FFT, (b) radix-2 Butterfly unit, and (c) radix-22 Butterfly unit.

2. The proposed multi-mode folded DIF FFT architectures The architecture for proposed folded radix-2 16-point DIF FFT is derived from the existing parallel implementation, which is shown in Fig. 2(a) and (b). The existing parallel 16-point DIF FFT architecture is shown in Fig. 2(a), where x½n is the input. Here, the total number of stages is four and each with 8 Butterfly units (BFs). The first, second, and third stage outputs are represented as y1 ½n, y2 ½n, and y3 ½n respectively. The value of n and twiddle factor for each BF will be varied. The proposed radix-2 folded architecture follows the equations as shown in (2) and (3). In Fig. 2(b), only one stage of

radix-2 DIF FFT is taken and which includes eight BFs. Therefore, each stage operation of 16-point parallel architecture can be done using only one stage of BFs with 4 cycles. The output from previous cycle has been sent as the input to next cycle. During first cycle, x½n and x½n þ m are selected using multiplexer (MUX). From the second cycle onwards, y½n and y½n þ m from previous cycle are selected using MUX. The horizontal dark line shows the pipeline registers. In general, radix-2 N-point DIF FFT parallel architecture consists of log 2 N stages and each with N=2 BFs. The inputs for first stage

are x½n and x n þ N2 , where n is varied from 0, 1, 2, … N2 1 for BFs 1, 2, … N=2 respectively. The outputs from first stage are


47

Table 2 Butterfly units with twiddle factors for 16-point radix-22 proposed folded DIF FFT. Select lines

M4

M1

M5

M2

M6

M3

0

1

0

1

0

1

2

0

1

2

0

1

2

3

0

1

2

3

BF1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

W 016

1

1

1

1

j

W 08

BF2

1

1

1

1

1

1

1

1

1

1

1

j

1

1

1

1

1

1

1

1

1

1

1

1

1

1

BF4

1

1

1

1

1

1

1

1

1

1

1

j

1

1

1

1

1

1

1

1

1

1

1

1

1

BF6

1

1

1

1

1

1

1

1

1

1

1

BF7

1

1

1

1

1

1

1

1

1

j

BF8

1

1

1

1

W 18

1

j

W 216

1

j

W 08

j

W 116

j

W 016

1

j

W 016

j

W 616

BF5

W 18 1

j

W 416

1

j

W 08

j

W 216

BF3

W 28 W 08 W 38 W 08 W 28 W 08 W 38

1

1

W 316

1

1

j

Table 3 Butterfly units with twiddle factors for proposed multi-mode folded 16-point radix-2 DIF FFT. BF unit

m2

m1

m0

BF1

W 016

W 08

W 04

W 02

BF2

W 116

W 18

W 14

W 02

BF3

W 216

W 28

W 04

W 02

BF4

W 316

W 38

W 14

W 02

BF5

W 416

W 08

W 04

W 02

BF6

W 516 W 616 W 716

W 18 W 28 W 38

W 14 W 04 W 14

W 02 W 02 W 02

BF8

j

1

W 616

1

1

j

1

W 916

Table 4 Operation of proposed multi-mode folded radix-2=22 DIF FFT. Sel

Cycle

s1

s2

s3

se

Operation

0

0

X

X

X

0

2-point DIF FFT

1

0 1 0 1 2 0 1 2 3

0 1 X X X X X X X

X X 0 1 2 X X X X

X X X X X 0 1 2 3

1 0 2 1 0 3 2 1 0

4-point DIF FFT

Inputs of multiplexer U1 m3

BF7

W 316

y1 ½n ¼ x½n þ x n þ N2 , y1 n þ N2 ¼ x½n x n þ N2 W kN , where k is

N varied from 0, 1, 2, … 2 1 for BFs 1, 2, … N=2 respectively. The inputs for second stage are y1 ½n and y1 n þ N4 , where n is varied

N from 0, 1, 2, … 4 1 for BFs 1, 2, … N=4 respectively and it will be

2N N 2N 2N for BFs N4 þ 1, N4 þ 2; …N2 from 2N 4, 4 þ 1, 4 þ 2; … 4 þ 4 1 respectively. The outputs from second stage are y2 ½n¼ y1 ½nþ y1 n þ N4 ,

k y n þ N4 ¼ y1 ½n y1 n þ N4 W N , where k is varied from 0, 1, 2, … 2

N2 1 for BFs 1, 2, … N=4 respectively and it will be from 0, 1, 2, …

N4 N N N 4 1 for BFs 4 þ 1, 4 þ 2, … 2 respectively. Similarly, the inputs for last stage ðlog 2 NÞ are ylog 2 N 1 ½n and ylog 2 N 1 ½n þ1, where n will be varied from 0, 2, 4, … N 2 for BFs 1, 2, 3, … N2 respectively. The outputs from last stage are ylog 2 N ½n¼ylog 2 N 1 ½n þ

ylog 2 N 1 ½n þ1, y2 ½n þ 1 ¼ ðylog 2 N 1 ½n ylog 2 N 1 ½n þ 1ÞW kN , where 2 k ¼0. Table 1 shows the operation of proposed folded N-point radix-2 DIF FFT. Theorem 1. The radix-2 N-point DIF FFT parallel architecture (log 2 N stages) operation can be completed using log 2 N cycles of single stage proposed N-point DIF FFT folded architecture with tradeoff in number of cycles. Proof. The radix-2 16-point DIF FFT existing parallel architecture consists of four stages and each with eight BFs. The 16 inputs are given to the first stage and the others will get the inputs from previous stage. According to the hardware reuse strategy [23], the same hardware is used for further processing in a time multiplexing manner. During first cycle, 16 inputs are given to the single stage 16-point DIF FFT proposed folded architecture. During the clock cycles 2, 3, and 4, the output from previous cycle can be given to the same single stage 16-point DIF FFT proposed folded architecture. Therefore, N-point DIF FFT existing parallel architecture (log 2 N stages) operation can be completed with single

2

3

8-point DIF FFT

16-point DIF FFT

stage N-point DIF FFT proposed folded architecture with log 2 N cycles by using hardware reuse [23] strategy. □ Fig. 3(a) shows the proposed 16-point radix-2/22 multi-mode folded DIF FFT architecture with 1 stage pipeline, where one 16point or two 8-point or four 4-point or eight 2-point FFTs can be performed. Throughout this paper, input signal values of FFT unit are represented as i with corresponding index value, i.e. i0, i1, i2, … Similarly, the outputs of FFT are represented as o with corresponding index value, i.e. o0, o1, o2,… . In general, 16-point parallel FFT will have 4 stages each with 8 Butterfly units. In the proposed folded architecture, only one stage is iteratively (4 iterations) used with 8 Butterfly units. So, the area of proposed folded architecture is less than existing parallel. Here, the number of cycles used to complete the whole operation will be greater than parallel architecture. Fig. 3(b) and (c) shows the proposed Butterfly units for 16point radix-2 and radix-22 DIF FFT respectively. Here the multiplexers are named as U1, U2, U3, M1, M2, M3, M4, M5, and M6. Each multiplexer unit contains dedicated twiddle factors as inputs. Tables 2 and 3 show the corresponding twiddle factors used in the BF units of 16-point radix-22 and radix-2 DIF FFT respectively. In Figs. 3 and 4, vertical dark lines represent pipelining. Here, the output from the previous cycle is sent to the input if 16/8/4-point FFT operation is performed. In the case of 2-point FFT, output can be obtained after first cycle. Table 4 shows the operation of 16point radix-2/22 DIF FFT. In all the tables, X represents do not care condition. Fig. 4 (a) and (b) shows the proposed 16-point radix-4 multi-mode folded FFT DIF with one stage pipeline and its Butterfly unit respectively. Here, only one clock cycle is required for four 4-point radix-4 DIF FFT operation. In 16-point DIF FFT mode, the outputs from first cycle of BF1, BF2, BF3, and BF4 are sent as an input to the same Butterflies. Table 5 shows BF units with twiddle factors for proposed folded 16-point radix-4 DIF FFT. Table 6 shows the

48


Fig. 4. Proposed 16-point radix-4 (a) folded 16-point DIF FFT and (b) Butterfly unit. Table 5 Butterfly units with twiddle factors for proposed multi-mode folded 16-point radix-4 DIF FFT. BF unit

m1

m2

m3

BF1 BF2

1

1

1

W 116

W 216

W 316

BF3

W 216

W 416

W 616

BF4

W 316

W 616

W 916

Table 6 Operation of proposed multi-mode folded 16-point radix-4 DIF FFT. Sel

Cycle

s

Operation

0 1 2

0 1 0

0 1 1

16-point DIF FFT 4-point DIF FFT

operation of proposed folded 16-point radix-4 DIF FFT. The number of BFs in proposed folded DIF FFTs is greater than SDF and MDC architectures but the number of cycles for proposed folded DIF FFTs is less than SDF and MDC architectures. Similarly, worst path delay for proposed folded DIF FFTs is less than SDF and MDC because the critical path of proposed folded architecture consists of only one BF but the radix-2 and 22 N-point DIF FFT SDF and MDC architectures consist of log 2 N BFs. In the case of radix-4 N-point SDF and MDC, critical path consists of log 4 N Butterfly units.

3. The proposed multi-mode parallel DIF FFT architectures In the proposed multi-mode parallel DIF FFT, N2 -point parallel architecture is used to find N-point DIF FFT, where the number of

cycles is the trade-off. Fig. 5(a) shows the operation of radix-2 8-point DIF FFT existing parallel architecture, where 3-stages are used and each with four BFs. The inputs are represented as x½n. The output from first stage y1 ½n is given to the second stage, where y2 ½n will be produced. In the third stage, y2 ½n will be used as inputs. The proposed 16-point parallel DIF FFT operation is shown in Fig. 5(b). The radix-2 16-point parallel DIF FFT requires 4 stages and each with 8 BFs. Therefore, stage-1 of radix-2 16point parallel DIF FFT is completed with 2 cycles (cycle 0 and 1) using stage-1 of radix-2 8-point DIF FFT parallel architecture. Similarly, stage-2 of 16-point parallel DIF FFT is completed with another 2 cycles (cycle 2 and 3) using stage-1 of radix-2 8-point DIF FFT parallel architecture. The stage-3 of 16-point parallel DIF FFT is completed during cycles 3 and 4 using stage-2 of radix-2 8-point DIF FFT parallel architecture. The final stage of 16-point parallel DIF FFT is completed during cycles 4 and 5 using stage-3 of radix-2 8-point DIF FFT parallel architecture. Theorem 2. The radix-2 N2 point DIF FFT parallel architecture can be used to find N-point DIF FFT with trade-off in number of cycles. Proof. The number of stages in radix-2 8 and 16-point DIF FFT existing parallel architectures is 3 and 4 respectively and each with 4 and 8 BFs respectively. Therefore, 16-point DIF FFT parallel architecture consists of eight 8-point DIF FFT parallel architecture stages and each stage with four BFs. In general, N-point DIF FFT parallel architecture consists of 2 log 2 N numbers of N2 -point DIF FFT parallel architecture stages and each stage with N4 numbers of BFs. Therefore, the operation of radix-2 N-point DIF FFT parallel architecture can be completed with at most 2 log 2 N number of cycles by using N2 -point DIF FFT parallel architecture with the strategy of hardware reuse. □ Fig. 6(a) shows proposed multi-mode parallel architecture for 16-point radix-2 DIF FFT using 8-point architecture with 3 stage


stage1

stage1/cycle0

Four radix-2 DIF FFT BFs. n=0, 1, 2, and 3 for BF1,BF2,BF3 and BF4 respectively.

stage2


stage1/cycle1 Four radix-2 DIF FFT BFs. n=4, 5, 6, and 7 for BF1,BF2,BF3 and BF4 respectively.

stage1/cycle2

stage1/cycle3



stage2/cycle3


stage3 Four radix-2 DIF FFT BFs. n=0, 2, 4, and 6 for BF1,BF2,BF3 and BF4 respectively. output

49

stage2/cycle4



stage3/cycle4 Four radix-2 DIF FFT BFs. n=0, 2, 4, and 6 for BF1,BF2,BF3 and BF4 respectively output

stage3/cycle5 Four radix-2 DIF FFT BFs. n=8, 10, 12, and 14 for BF1,BF2,BF3 and BF7 respectively output

Fig. 5. Radix-2 (a) 8-point existing parallel and (b) 16-point proposed parallel DIF FFT architectures.

Stage1 0 1 2 3

s

R20 0 R31 1

s

R10 R21

0 1 2 3

0 1

0 1 2 3

s

R22 0 R33 1

i9 R12 R23

s

0 1 2 3

s

0 1 2 3

0 1

R24 0 R35 1

s

R14 R25

0 1

0 1

0 1 2 3

0 1

0 1 2 3

s

R26 R37

R16 R27

0 1 2 3

s

Stage2

Stage3

BF5

BF9

Sel R10 R20 R30 Sel

BF1 R11 R21 R31

0 1 2

Sel R12 R22 R32 Sel

BF2

BF6

BF10

BF7

BF11

BF8

BF12

R13 R23 R33

Sel R14 R24 R34 Sel

BF3 R15 R25 R35

Sel R16 R26 R36 Sel

BF4 R17 R27 R37

Fig. 6. Proposed 16-point radix-2 (a) parallel DIF FFT, (b) Butterfly unit used in BF1, BF2, BF3, and BF4, (c) Butterfly unit used in BF5, BF6, BF7, and BF8, and (d) Butterfly unit used in BF9, BF10, BF11, and BF12.

50


Table 7 Butterfly units with twiddle factors for proposed multi-mode parallel 16-point radix-2 DIF FFT. BF unit

m0

m1

m2

BF unit

m3

BF1

W 016

W 416

W 08

BF5

W 04

BF2

W 116

W 516

W 18

BF6

W 14

BF3

W 216

W 616

W 28

BF7

W 04

BF4

W 316

W 716

W 38

BF8

W 14

Table 8 Operation of proposed multi-mode parallel radix-2 DIF FFT. Cycle

Sel

s

s1

Operation

0 1 2 3 0

0 1 2 2 3

X X 0 1 X

0 1 2 2 2

16-point DIF FFT

8-point DIF FFT

pipeline. Fig. 6(b)–(d) shows the corresponding BF units. Table 7 shows Butterfly units with corresponding twiddle factors for proposed multi-mode parallel 16-point radix-2 DIF FFT. The operation of this proposed radix-2 16-point parallel architecture during clock cycle 0, 1, 2, and 3 is explained in Table 8. In 16point DIF FFT mode, the first and second set of 8 inputs is accepted by the architecture during clock cycles 0 and 1 respectively. During clock cycles 2 and 3, the register outputs are sent back as the inputs. The outputs from the registers R10, R11, R12, R13, R14, R15, R16, and R17 are sent to the second stage BF units BF5, BF6, BF7, and BF8 during 4th clock cycle. Similarly, the outputs from the second stage of Butterfly units BF5, BF6, BF7, and BF8 are sent to third stage of BF units BF9, BF10, BF11, and BF12 during 5th clock cycle. Hence the first 8 outputs of 16-point DIF operation can be obtained after 5th clock cycle and the second 8 outputs of 16-point DIF operation can be obtained after 6th clock cycle. So, the proposed parallel radix-2 architecture requires 7 cycles to complete 16-point DIF FFT and it requires 3 clock cycles to complete 8-point DIF FFT operation. Fig. 7(a) shows proposed multi-mode parallel architecture for 16-point radix-22 DIF FFT using 8-point architecture with 3 stage pipeline. Fig. 7(b) and (c) shows the corresponding Butterfly units. Here the outputs from stage-1 registers are sent as an input to stage-2. The output from stage-2 and registers of stage-3 (R50, R51, R52, R53, R54, R55, R56, and R57) are sent as the input to stage 3. Table 9 shows Butterfly units with the corresponding twiddle factors for 16-point proposed parallel radix-22 DIF FFT. The operation of this proposed parallel 16-point radix-22 DIF FFT architecture is explained in Table 10. It requires 8 cycles to complete 16-point DIF FFT and 3 clock cycles to complete 8-point DIF FFT operation. Fig. 8(a) shows proposed 64-point radix-4 parallel DIF FFT. Fig. 8(b) shows Butterfly units used in BF1, BF2, BF3, and BF4 of radix-4 proposed parallel 64-point DIF FFT. Fig. 8 (c) shows Butterfly units used in BF5, BF6, BF7, and BF8 of the proposed radix-4 parallel 64-point DIF FFT. Here, the outputs from stage-1 (BF1, BF2, BF3, and BF4) registers are sent as the input to the stage-2 (BF5, BF6, BF7, and BF8) and stage-1 BF units. Table 11 shows BF units with corresponding inputs and twiddle factors for proposed multi-mode parallel 64-point radix-4 DIF FFT and its operation is explained in Table 12. Here, the operation of first and second stage of conventional parallel radix-4 64-point DIF FFT can be done during the clock cycles 0–7, which is shown in Table 12. So, the first 16 outputs of 64-point DIF FFT operation will be produced after 8th clock cycle. So, it requires 11 clock cycles to

obtain all the 64 outputs. In the case of 16-point DIF FFT, 2 clock cycles are required to obtain all the results.

4. Time/Hardware analysis of proposed multi-mode DIF FFT architectures Table 13 shows the comparison of N-point proposed folded/ parallel FFT with others. In SDF/MDC based N-point radix-2/22 DIF FFT, first set of outputs will be produced after N 1 clock cycles

and the remaining N2 1 set of outputs can be produced in every clock cycle after ðN 1Þth clock cycle. So, it takes N 1 þ N2 1 ¼ 3N 2 2 clock cycles to complete the whole operation. Similarly, radix4 N-point SDF/MDC based DIF FFT requires log

4 N clock cycles to produce first 4 outputs and it requires N4 1 log 4 N clock

cycles to produce rest of the outputs. So, it requires log 4 N þ N4 1 log 4 N ¼ N4 log 4 N clock cycles to complete the whole operation. The parallel radix-2/22 FFT has log 2 N stages and each with N2 Butterfly units. The proposed folded radix-2/22 FFT will use one among log 2 N stages of existing parallel architecture. Therefore, it requires log 2 N cycles to complete its operation, which is explained in Theorem 1. Similarly, proposed folded radix-4 FFT requires log 4 N cycles to complete its operation, where one among log 4 N stages of existing parallel radix-4 FFT will be used and each stage is consisting of N4 Butterfly units. In proposed radix-2/22 parallel N-point FFT, the number of Butterfly units will be equal to parallel radix-2 N2 -point FFT, i.e. N N 4 log 2 2 . So, it requires at most 2 log 2 N cycles to complete N-point FFT operation, which is explained in Theorem 2. Similarly, in proposed radix-4 parallel N-point FFT, the number of Butterfly N units will be equal to parallel radix-4 N4 -point FFT, i.e. 16 log 4 N4 . So, it requires at most 4 log 4 N cycles to complete N-point FFT operation. Table 14 shows the worst/best/typical case comparison of N-point proposed folded/parallel DIF FFT with respect to number of cycles and mode, where mode represents the length of DIF FFT operation. In [20], radix-2 SDC-SDF based folded FFT (log 2 N stages) architecture is proposed, where the Butterfly units of first ðlog 2 NÞ1 stages are designed with 3 real number adders and 2 real number multipliers. So, the hardware cost is reduced as compared with conventional radix-2 SDF based FFT, where one complex number adder (requires two number real adders) and one complex number multiplier (requires 4 real number multipliers and 2 real number adders) are used. The last stage contains one conventional SDF Butterfly unit. Therefore, the number of complex number multipliers used in [20] is half of the conventional radix-2 SDF based architecture, which is mentioned in Table 13. In [21], 4-parallel MDF based FFT (log 2 N stages) folded architecture is proposed. Here, two Butterfly units are used in parallel with one common complex number multiplier. The FIFO registers used in feedback are divided into two unequal parts to share the common multiplier.

5. Design modeling, implementation, and results The proposed and existing designs are modeled in Verilog HDL. These Verilog HDL models are simulated and verified using the Xilinx ISE simulator. The timing, area, and power analysis of this FFT implementation has been done with Cadence 6.1 ASIC design tool. All the designs are implemented for 45 nm technology [26], where the technology library tcbn45gsbwpbc088_ccs:lib is used. Here, the operating voltage is 0:88v. The timing/area/power details can be directly obtained from synthesis using Cadence (Encounter). Table 15 shows the comparison of worst path delay, total area, net power, and power delay product (PDP) or energy per operation [29] between various 1D-DIF FFT architectures. The hardware cost


Stage2

Stage1 0 1 2

R10 R20 R30

-j

0 1

R11 R21 R31

0 1 2

R12 R22 R32 BF2

0 1 2

-j

0 1

R13 R23 R33

0 1 2

R14 R24 R34

-j

0 1

R15 R25 R35

0 1 2

R16 R26 R36 BF4

0 1 2

R50

0 1 2

R10 0 R21 1 R14 2

R52

R22 0 R33 1 R12 2 R12 0 R23 1 R16 2

R51

-j

0 1

R17 R27 R37

R42 R52 BF10

R53

R54

0 1 2

-j

0 1

0 1 2

R43 R53

output

R44 R54 BF11

R14 0 R25 1 R15 2

0 R56 1 2

R26 0 R37 1 R13 2

R55

R16 0 R27 1 R17 2

R41 R51

0 1 2

BF6

R24 0 R35 1 R11 2

R40 R50 BF9

0 1 2

BF7

BF3 0 1 2

Stage3

BF5

BF1 0 1 2

R20 0 R31 1 R10 2

51

R45 R55

0 1 2

R46 R56 BF12

BF8 R57

0 1 2

-j

0 1

R47 R57

0 1 2

0 1 2 Fig. 7. Proposed 16-point radix-22 (a) Parallel DIF FFT, (b) Butterfly unit used in BF5, BF6, BF7, and BF8, and (c) Butterfly unit used in BF1, BF2, BF3, BF4, BF9, BF10, BF11, and BF12. Table 9 Butterfly units with twiddle factors for proposed multi-mode parallel 16-point radix-22 DIF FFT. BF unit

m4

m0

m5

m1

m6

m2

BF1

1

W 08

1

W 016

W 016

W 016

BF2

1

W 28

1

W 216

W 116

W 316

BF3

W 08

W 08

1

W 416

W 216

W 616

BF4

W 18

W 38

1

W 616

W 316

W 916

parameters [27] of any digital circuit are the development time, chip area, and power consumption. The power delay product (PDP) stands for the average energy consumed per switching event and it is apparent from the units (W s ¼ Joule). In Cadence (Encounter), switching and leakage power can be measured directly during

synthesis. Therefore, PDP [29] can be easily calculated by multiplying worst path delay with sum of switching and leakage power. The number of cycles used by proposed parallel architectures is greater than existing parallel architectures, which is explained in Table 13. The total cycle delay can be computed by multiplying the worst path delay with number of cycles (pipeline stages). Since the worst path delay of the proposed parallel architectures is less than existing architectures, the total cycle delay of the particular number of tasks (N-point DIF FFT) for the proposed parallel architectures will be less than existing architectures. For example, the system requires five 16-point radix-2 parallel DIF FFTs. According to existing parallel architecture, the total computation time will be equal to 15; 067:7 5 ¼ 75; 335 ps. According to proposed pipelined parallel architecture, total cycle delay to produce the output of first DIF FFT operation will be 5422 7 ¼ 37; 954 ps because the proposed radix-2 16-point parallel architecture requires 7 cycles, which is explained in Section 3. The remaining four DIF FFT outputs will be produced with extra of

52


5422 4 ¼ 21; 688 ps, which is to empty the pipeline. Therefore, the total delay required to produce five 16-point radix-2 DIF FFT operations will be 37; 954 þ 21; 688 ¼ 56; 642 ps.


s

s1

s2

Sel1

Sel2

s8

s5

Operation

0 1 2 3 4 5 6 7

0 1 X X X X X X

1 0 X X X X X X

1 0 X X X X X X

X X 0 1 X X X X

X X X X 0 0 1 1

X X X X 0 0 1 1

X X 1 2 X X X X

16-point DIF FFT

0 1 2

2 X X

1 X X

0 X X

X 2 X

X X 2

X X 1

0 0 0

8-point DIF FFT

Stage2

Sel Stage1

m

m

m

m

m

m inputs m

m

m

m

m

m

m

m

m

m

The proposed parallel architectures achieve 56.7% and 40.6% of area reduction compared with the existing parallel architecture based 16-point radix-2, and radix-22 DIF FFTs respectively. Similarly, the proposed parallel architecture achieves 49.5% of area reduction compared with the existing parallel architecture based 64-point radix-4 DIF FFT. The proposed parallel architectures achieve 90.9% and 88.3% of PDP reduction compared with the existing parallel architecture based 16-point radix-2, and radix-22 DIF FFTs respectively. Similarly, the proposed parallel architecture achieves 90.4% of PDP reduction compared with the existing parallel architecture based 64-point radix-4 DIF FFT. The area for proposed folded architectures is greater than existing SDF/MDC architectures and this will be the trade-off, which is explained in Section 1. In the case of SDF/MDC, hardware registers are acting as shift registers and they are not like pipeline registers. So, the worst path delay for SDF/MDC is almost equal to worst path delay of existing parallel architectures. The number of cycles and worst path delay of the proposed folded DIF FFT architectures is less than existing SDF/MDC architectures, which is explained in Section 1. The proposed folded architectures achieve

0 1 2 Sel 0 1 2 Sel

R10 R20 R30 R40

R50 R60 R70

R11 R21 R31 R41

R51 R61 R71

0 1 2 3

BF5

BF1

0 1 2

0 1 2 3

R12 R22 R32 R42

R13 R23 R33 R43

m BF

R52 R62 R72 0 1 2 3

Sel 0 1 2

Sel 0 1 2

R53 R63 R73

Sel 0 1 2

inputs Sel

0 1 2

R14 R24 R34 R44

0 1 2 3

R54 R64 R74

(1,-j,-1,j)

0 1 2

R15 R25 R35 R45

R55 R65 R75

BF2

0 1 2 Sel

BF6 R16 R26 R36 R46

R56 R66 R76

R17 R27 R37 R47

R57 R67 R77

0 1 2

0 1 2 3 0 1 2 3

Sel

0 1 2 Sel 0 1 2

R18 R28 R38 R48

R58 R68 R78

R19 R29 R39 R49

R59 R69 R79

BF1

0 1 2

0 1 2 3

(1,-1,1,-1)

0 1 2 m

(1,j,-1,-j)

0 1 2 3 4

0 1 2 3

Sel

Sel

0 1 2 3 4

(1,1,1,1)

m

Sel

Sel

0 1 2 3 4

Sel 0 1 2 m

BF7 R110 R210 R310 R410

R510 R610 R710

(1,1,1,1)

R111 R211 R311 R411

R511 R611 R711

(1,-j,-1,j)

Sel 0 1 2 Sel (1,-1,1,-1)

0 1 2

R112 R212 R312 R412

R512 R612 R712

R113 R213 R313 R413

R513 R613 R713

R114 R214 R314 R414

R514 R614 R714

R115 R215 R315 R415

R515 R615 R715

Sel

(1,j,-1,-j)

0 1 2 Sel 0 1 2

BF2

BF8

Sel 0 1 2

Fig. 8. Proposed 64-point radix-4 parallel DIF FFT (a) Architecture, (b) Butterfly unit used in BF1, BF2, BF3, and BF4, and (c) Butterfly unit used in BF5, BF6, BF7, and BF8.


53

Table 11 BF units with corresponding inputs and twiddle factors for proposed multi-mode parallel 64-point radix-4 DIF FFT. Multiplexer

M1

Sel

s1

Inputs for BF units

M3

M4

BF1

BF2

BF3

BF4 K1

se2

Inputs for BF units BF1

BF2

BF3

BF4

0

0

i0

i1

i2

i3

0

W 064

W 164

W 264

W 364

0

1

i4

i5

i6

i7

1

2

i8

i9

i10

i11

2

0

3

i12

i13

i14

i15

3

1

0

R40

R44

R48

R412

4

1

1

R51

R55

R59

R513

1

2

R62

R66

R610

R614

1

1

3

R73

R77

R711

R715

2

i0

i1

i2

i3

3 4

W 564 W 964 W 13 64 W 116 W 264 W 10 64 W 18 64 W 26 64 W 216 W 364 W 15 64 W 27 64 W 39 64 W 316

W 664 W 10 64 W 14 64 W 216 W 464 W 12 64 W 20 64 W 28 64 W 416 W 664 W 18 64 W 30 64 W 42 64 W 616

W 764

0

W 464 W 864 W 12 64 W 016 W 064 W 864 W 16 64 W 24 64 W 016 W 064 W 12 64 W 24 64 W 36 64 W 016

2 M2

Mulitplexer

K2

0

0

0

i16

i17

i19

i23

0

1

i20

i21

i22

i23

0

2

i24

i25

i26

i27

1

0

3

i28

i29

i30

i31

2

1

0

R30

R34

R38

R312

3

1

1

R41

R45

R49

R413

4

1 1 2 0 0 0 0 1 1 1 1 2 0 0 0 0 1 1 1 1 2

2 3

R52 R63 i4 i32 i36 i40 i44 R20 R31 R42 R53 i8 i48 i52 i56 i60 R10 R21 R32 R43 i12




0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

K3

0

W 11 64 W 15 64 W 316 W 664 W 14 64 W 22 64 W 30 64 W 616 W 964 W 21 64 W 33 64 W 45 64 W 916


Sel

s1

se2

Operation

0 1 2 3 4 5 6 7

0 0 0 0 1 1 1 1

0 1 2 3 0 1 2 3

0 1 2 3 4 4 4 4

64-point DIF FFT

0

2

X

4

16-point DIF FFT

65.5%, 51.1%, and 35.8% of worst path delay reduction compared with the existing SDF architecture based 16-point radix-2, radix22, and radix-4 16-point SDF based DIF FFTs respectively. Figs. 9 and 10 show the layout diagram for proposed radix-2 16point folded and parallel DIF FFT architectures. The area and power dissipation are directly proportional to each other (in most of the cases). The proposed folded FFT architecture is designed to optimize the number of cycles (trade off is number of BF units). So, the delay and number of cycles for proposed folded FFT architectures are less than existing (SDF, MDC, 4-parallel MDC, MDF, and SDC-SDF). The area for proposed folded is higher than existing folded. Therefore, PDP for proposed folded architectures is

greater than existing. At the same time, 4-parallel MDF, 4-parallel MDC, 8-parallel MDC, and mixed radix MDC architectures have less area than proposed and greater area than the other existing techniques. The number of clocked registers used in the abovementioned architectures is greater than proposed folded architectures. Therefore, PDPs for 4-parallel MDF, 4-parallel MDC, 8parallel MDC, and mixed radix MDC architectures are greater than proposed designs. The clock in the digital system tends to increase the power. In the case of pipelined system, the power dissipation is greater than non-pipelined system. The proposed parallel FFT architecture is designed to optimize the number of BFs (trade off is number of cycles, i.e., throughput for proposed parallel is less than

54


Table 13 Comparison of N-point proposed folded/parallel DIF FFT with others. Radix

Architecture

Hardware cost

Speed

Number of BF units

# of complex number adders

# of complex number multipliers

2

SDF [11]

log 2 N

2 log 2 N

ðlog 2 NÞ 1

2

MDC [16]

log 2 N

2 log 2 N

ðlog 2 NÞ 1

2

4-parallel MDC [6]

2 log 2 N

2ð2 log 2 NÞ

2ððlog 2 NÞ 1Þ

2

8- parallel MDC [6]

4 log 2 N

4ð2 log 2 NÞ

4ððlog 2 NÞ 1Þ

2

SDC-SDF [20]

log 2 N

2 log 2 N

2

Variable length SDF [25]

log 2 N

2 log 2 N

1 ððlog 2 NÞ 1Þ 2 ðlog 2 NÞ 1

2

4-parallel MDF [21]

2 log 2 N

2ð2 log 2 NÞ

ðlog 2 NÞ 1

2

Proposed folded

N 2

2

2

Parallel [7]

2

Proposed parallel

N log 2 N 2 N N log 2 4 2

N 2 log 2 N 2 1 N 2 log 2 N 2 2

N ððlog 2 NÞ 1Þ 2 1 N ððlog 2 NÞ 1Þ 2 2

22

SDF [14]

log 2 N

2 log 2 N

ðlog 2 NÞ 1

22

MDC [17]

log 2 N

2 log 2 N

ðlog 2 NÞ 1

22

4-parallel MDC [6]

2 log 2 N

2ð2 log 2 NÞ

2ððlog 2 NÞ 1Þ

2

8-parallel MDC [6]

4 log 2 N

4ð2 log 2 NÞ

4ððlog 2 NÞ 1Þ


log 2 N

2 log 2 N

ðlog 2 NÞ 1

4-parallel MDF [21]

log 2 N

2ð2 log 2 NÞ

ðlog 2 NÞ 1

22

Proposed folded

N 2

2

22

Parallel [8]

N log 2 N 2 N N log 2 4 2

N 2 log 2 N 2 1 N 2 log 2 N 2 2

N ððlog 2 NÞ 1Þ 2 1 N ððlog 2 NÞ 1Þ 2 2

2

22 2

2

2

2

Proposed parallel

N ¼N 2

N ¼N 2

N 2

2

N ¼N 2

4

SDF [15]

log 4 N

12 log 4 N

ðlog 4 NÞ 1

4

MDC [16]

log 4 N

12 log 4 N

3ðlog 4 NÞ 1

4

8- parallel MDC [6]

2 log 4 N

2ð12 log 4 NÞ

2ð3ðlog 4 NÞ 1Þ

4


log 4 N

12 log 4 N

ðlog 4 NÞ 1

4

Proposed folded

N 4

12

4

Parallel [9]

4

Proposed parallel

N log 4 N 4 N N log 4 16 4

N 12 log 4 N 4 1 N 12 log 4 N 4 4

N 4

3

N 4

N 3 ððlog 4 NÞ 1Þ 4 1 N 3 ððlog 4 NÞ 1Þ 4 4

Number of cycles 3N 2 2 3N 2 Z 2 1 3N 2 Z 2 2 1 3N 2 Z 4 2 2N þ ðlog 2 NÞ 1 Z

Throughput 2 2 4 8 2

3N 2 Z 2 1 3N 2 Z 2 2 log 2 N

4

1

N

r 2 log 2 N

N 2

3N 2 2 3N 2 Z 2 1 3N 2 Z 2 2 1 3N 2 Z 4 2 3N 2 Z 2 3N 2 Z 2 log 2 N Z

2

N

2 2 4 8 2 4 N

1

N

r 2 log 2 N

N 2

N log 4 N 4 N log 4 N Z 4 1 N log 4 N Z 2 4 N log 4 N Z 4 log 4 N

8

1

N

r 4 log 4 N

N 4

Z

4 4

4 N

Throughput¼ Number of operations (outputs) per cycle [24]. Each complex number adder contains two number real adders. Each complex number multiplier contains four real number multipliers and two real number adders.

existing). The number of pipeline stages in proposed parallel is greater than existing (which is mentioned in Table 13 as number of cycles). Hence, more hardware registers are driven by common clock in proposed parallel FFT architectures. Therefore, PDP for proposed parallel architectures is greater than existing.

6. Conclusion In this paper, efficient VLSI architectures for folded/parallel FFT operations are proposed. The proposed logic used for folded

architecture is to perform the whole N-point radix-k DIF FFT operation iteratively by using the Butterfly units of single stage parallel radix-k DIF FFT with corresponding twiddle factors, which has log k N stages and each stage contains N=k Butterfly units. In the proposed parallel FFT architecture, N-point FFT is implemented by using one N=2-point FFT without much extra hardware. In this work, the proposed/existing architectures are implemented for radix-2, 22, and 4 using 45 nm technology library. The proposed parallel architectures achieve 56.7% and 40.6% of area reduction compared with the existing parallel architecture based 16-point radix-2 and radix-22 DIF FFTs respectively. The proposed folded


55

Table 14 Worst/best/typical case comparison of N-point proposed folded/parallel DIF FFT with respect to number of cycles and mode of operation (length of DIF FFT). Proposed DIF FFT architecture

Mode/Number of cycles

Best case

Worst case

Typical case

Radix-2/22 16-point folded Radix-2/22 16-point folded Radix-2/22 N-point folded Radix-2/22 N-point folded Radix-4 16-point folded Radix-4 16-point folded Radix-4 N-point folded Radix-4 N-point folded Radix-2 16-point parallel Radix-2 16-point parallel Radix-2 N-point parallel

Mode Number Mode Number Mode Number Mode Number Mode Number Mode

16-point 4 N-point log 2 N 16-point 2 N-point log 4 N 16-point 7 N-point

4 or 8-point 2 or 3 other than 2 and N-point 1 o cycles o log 2 N 16-point 2 other than 4 and N-point 1 o cycles o log 4 N 16-point 7 N-point

Radix-2 N-point parallel

Number of cycles

2-point 1 2-point 1 4-point 1 4-point 1 8-point 3 N -point 2 N log 2 2 8-point 3 N -point 2 N log 2 2 16-point 2 N -point 4 N log 4 4

r 2 log 2 N

r 2 log 2 N

16-point 8 N-point

16-point 8 N-point

r 2 log 2 N

r 2 log 2 N

64-point 11 N-point

64-point 11 N-point

r 4 log 4 N

r 4 log 4 N

2

of cycles of cycles of cycles of cycles of cycles

Radix-2 16-point parallel Radix-22 16-point parallel Radix-22 N-point parallel

Mode Number of cycles Mode


Number of cycles

Radix-4 64-point parallel Radix-4 64-point parallel Radix-4 N-point parallel

Mode Number of cycles Mode


Number of cycles

Table 15 Comparison results of various 1D-DIF FFT architectures using 45-nm technology. Radix N

N-point DIF FFT architecture

Speed

PDP ðfJÞ

Hardware cost

Worst path delay (ps)

Frequency (MHz)

Through put

Total area (μm2)

Number of cells

Net power (nw)

Switching power (nw)

Leakage power (nw)

Energy per operation

2 2/4 2

16 16 16

Parallel [7] Parallel [19] Proposed parallel

15,067.7 14,708.7 5422.0

66.4 68.1 184.4

16 16 8

410,638.9 474,734.7 177,847.0

309,711 361,916 127,755

13,239,211.1 14,156,306.4 1,967,268.6

37,252,446.3 39,834,243.3 6,093,506.4

19,903,330.7 22,942,066.6 8,353,796.4

861,206.2 9,233,578.1 78,333.3

22 22

16 16

Parallel [8] Proposed parallel

11,870.3 5217.7

84.2 191.7

16 8

382,225.5 226,941.4

290,078 155,635

11,878,794.5 313,145.8

33,236,865.3 1,196,138.8

18,316,633.9 12,454,303.2

611,955.6 71,224.4

4 2/4 4

64 Parallel [9] 64 Parallel [19] 64 Proposed parallel

16,467.9 15,432.5 7282.4

60.7 64.8 137.3

64 64 16

1,362,300.5 982,521.5 688,497.1

1,176,504 694,238 498,627

56,677,494.8 136,346,215.1 41,572,714.1 63,256,661.1 2,652,370.7 8,266,041.9

62,585,455.1 37,326,613.1 34,747,213.0

3,275,986.9 1,552,251.5 313,239.6

2 2 2 2 2 2

16 16 16 16 16 16

15,867.5 15,267.5 15,084.6 8711.8 8740.4 15,548.3

63.02 65.5 66.3 114.8 114.4 64.3

2 2 4 2 4 2

52,281.9 53,789.1 102,784.1 27,484.4 91,817.4 52,954.4

39,611 39,833 78,616 20,586 68,783 40,163

925,647.6 721,160.4 1,297,226.6 230,978.2 1,356,854.7 944,359.9

2,473,586.9 1,966,510.8 3,458,148.5 626,789.9 3,644,161.9 2,513,912.6

2,457,504.8 2,442,210.5 4,727,111.7 1,309,002.4 4,200,483.1 2,491,889.5

78,244.1 67,310.2 123,471.4 16,864.2 68,565.3 77,831.7

2

16

5486.7

182.3

16

202,105.8

137,228

1,563,533.9

4,607,206.9

9,946,621.1

79,852.5

22 22 22 22 22

16 16 16 16 16

11,208.7 11,078.4 11,127.2 10,917.5 10,986.0

89.2 90.3 89.9 91.5 91.0

2 2 4 4 2

48,048.8 49,686.5 94,059.3 92,016.1 48,552.3

36,124 36,442 71,456 68,855 36,594

804,591.8 689,429.2 109,154.5 1,582,354.9 817,193.1

2,162,001.3 1,880,231.6 2,934,221.4 4,250,352.6 2,190,436.2

2,250,845.1 2,255,065.9 4,924,749.9 4,217,923.4 2,272,758.9

49,462.3 45,812.5 87,448.6 92,452.4 49,032.6

22

16

5490.2

182.1

16

247,243.8

167,851

1,395,058.0

4,098,530.3

10,040,188.6

77,624.4

4 4 4 4

16 16 16 16

11,261.3 11,761.3 11,137.0 10,939.5

88.8 85.1 89.8 91.4

4 4 8 4

110,383.4 117,931.7 230,931.8 111,018.4

87,234 94,112 187,104 87,646

1,992,840.8 2,357,822.5 555,551.6 2, 014,283.5

5,257,551.5 6,198,834.8 14,575,191.5 5,326,900.1

5,134,070.0 5,500,590.5 10,911,237.9 5,176,334.3

117,023.2 137,600.5 283,842.4 114,900.1

4/8

16

12,835.1

77.9

4

199,398.7

159,774

4,134,073.2

10,846,919.1

9,219,826.7

257,558.8

4

16

7227.3

138.4

16

331,073.2

245,509

1,983,434.4

5,671,832.4

17,585,383.0

168,086.9

SDF [11] MDC [16] 4- parallel MDC [6] SDC-SDF [20] 4-parallel MDF [21] Variable length SDF [25] Proposed folded SDF [14] MDC [17] 4-parallel MDC [6] 4-parallel MDF [21] Variable length SDF [25] Proposed folded SDF [15] MDC [16] 8-parallel MDC [6] Variable length SDF [25] Mixed Radix MDC [22] Proposed folded

56


Fig. 9. Layout chip diagram for proposed radix-2 16-point folded DIF FFT with core area as 232; 420:75 μm2 , die space around core as 60 μm, and total chip area as 293; 872:75 μm2 using 45-nm technology.

Fig. 10. Layout chip diagram for proposed radix-2 16-point parallel DIF FFT with core area as 204; 524:05 μm2 , die space around core as 60 μm, and total chip area as 262; 364:05 μm2 using 45-nm technology.

architectures achieve 65.5%, 51.1%, and 35.8% of worst path delay reduction compared with the existing SDF architecture based 16point radix-2, radix-22, and radix-4 SDF based DIF FFTs respectively. References [1] Steven W. Smith, The Scientist and Engineers Guide to Digital Signal Processing, California Technical Publishing, Poway, California, USA, 1997, p. 551–566. [2] L. Liu, J. Ren, X. Wang, F. Ye, Design of low-power, 1GS/s throughput FFT processor for MIMO-OFDM UWB communication system, in: Proceedings of IEEE International Symposium on Circuits System, 2007, pp. 210–213. [3] Eun Ji Kim, Myung Hoon Sunwoo, High speed eight-parallel mixed-radix FFT Processor for OFDM systems, in: IEEE International Symposium on Circuits and Systems (ISCAS), 2011, pp. 1684–1687. [4] J. Lee, H. Lee, S. in Cho, S.-S. Choi, A high-speed, low-complexity radix-24 FFT processor for MB-OFDM UWB systems, in: Proceedings of IEEE International Symposium on Circuits Systems, 2010, pp. 2594–2597.

[5] Sohaib Ahmed Khan, Digital Design of Signal Processing Systems, A Practical Approach, first edition, Wiley Publications, West Sussex, UK, 2011, p. 343–378. [6] Manohar Ayinalai, Michael Brown, Keshab K. Parhi, Pipelined parallel FFT architectures via folding transformation, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 20 (6) (2012) 1068–1081. [7] Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, Discrete Time Signal Processing, Prentice Hall Publishers, New Jersy, USA, 1999, p. 629–691. [8] Nuo Li, N.P. van der Meijs, A radix-22 based parallel pipeline FFT processor for MBOFDM UWB system, in: IEEE International SOC Conference, 2009, pp. 383–386. [9] Software Optimization of FFTs and IFFTs Using the SC3850 Core, Free Scale Semiconductor Application Note, Document Number: AN3666, 2010, pp. 1–52. [10] Zhong Cui-xiang, Han Guo-qiang, Huang Ming-he, Some new parallel fast Fourier transform algorithms, in: IEEE International Conference on Parallel and Distributed Computing, Applications and Technologies, 2005, pp. 624–628. [11] E.H. Wold, A.M. Despain, Pipeline and parallel-pipeline FFT processors for VLSI implementation, IEEE Trans. Comput. C-33 (5) (1984) 414–426. [12] Shousheng He, Mats Torkelson, A new approach to pipeline FFT processor, in: IEEE International Parallel Processing Symposium, 1996, pp. 766–770. [14] Yazan Samir Algnabi, Furat A. Aldaamee, Rozita Teymourzadeh, Masuri Othman, Md Shabiul Islam, Novel architecture of pipeline Radix-22 SDF FFT based on digit-slicing technique, in: IEEE International Conference on Semiconductor Electronics (ICSE), 2012, pp. 470–474. [15] A.M. Despain, Fourier transform computer using CORDIC iterations, IEEE Trans. Comput. C-23 (10) (1974) 993–1001. [16] Tzi-Dar Chiueh, Pei-Yun Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons (Asia) Pte Ltd, Clementi Loop, Singapore, 2007, p. 195–232. [17] Manohar Ayinalai, Low-power architectures for signal processing and classification systems (A dissertation Ph.D.), submitted to the Faculty of the Graduate School of the University of Minnesota, Minneapolis, United States, 2012, pp. 10–43. [19] Marwan A. Jaber, Daniel Massicotte, A new FFT concept for efficient VLSI implementation: part II – parallel pipelined processing, in: IEEE International Conference on Digital Signal Processing, 2009, pp. 1–5. [20] Zeke Wang, Xue Liu, Bingsheng He, Feng Yu, A combined SDC-SDF architecture for normal I/O pipelined radix-2 FFT, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23 (5) (2015) 973–977. [21] Seung-Won Yang, Jong-Yeol Lee, Constant twiddle factor multiplier sharing in multipath delay feedback parallel pipelined FFT processors, IEEE Electron. Lett. 50 (15) (2014) 1050–1052. [22] Kai-Jiun Yang, Shang-Ho Tsai, Gene C.H. Chuang, MDC FFT/IFFT processor with variable length for MIMO-OFDM systems, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21 (4) (2013) 720–731. [23] Milder, P., Franchetti, F., Hoe, J.C., Puschel, M. 2011. Computer generation of hardware for linear digital signal processing transforms. ACM Trans. Des. Autom. Electron. Syst. 17 (2) (2012) 15:1–33. [24] Emmanuel Casseau, Bertrand Le Gal, Design of Multi-mode Application-Specific Cores Based on High-Level Synthesis, Integr. VLSI J. 45 (2012) 9–21. [25] Padma Prasad Boopal, Mario Garrido, Oscar Gustafsson, A reconfigurable FFT architecture for variable-length and multi-streaming OFDM standards, in: IEEE International Symposium on Circuits and Systems (ISCAS), 2013, pp. 2066–2070. [26] Cadence, 〈http://www.cadence.com/Alliances/pages/tsmcrequest.aspx〉. [27] L. Xu, VLSI Circuit Design Methodology Demystified: A Conceptual Taxonomy, first edition, Wiley-IEEE Press, New Jersy, USA, 2007, p. 42–69. [29] Ricardo Gonzalez, Benjamin M. Gordon, Mark A. Horowitz, Supply and threshold voltage scaling for low power CMO, IEEE J. Solid State Circuits 32 (8) (1997) 1210–1216. Mohamed Asan Basiri M received B.E. (Electronics and Communication Engg.) and M.E. (Embedded Systems) from Anna University, Tamil Nadu, India in 2009 and 2011 respectively. Currently, he is working towards Ph. D in the department of computer science and engineering, Indian Institute of Information Technology (IIITDM) Kancheepuram, Chennai, Tamil Nadu, India. His research interest includes VLSI for signal processing and reconfigurable systems.

Noor Mahammad Sk obtained his Ph.D from Indian Institute of Technology Madras, and he is currently working as an Assistant professor in the department of computer science and engineering, Indian Institute of Information Technology Design and Manufacturing (IIITDM) Kancheepuram, Chennai, India. His research interest includes software for VLSI, network on chip, open flow networking, software defined radio, and evolvable hardware.

Multi-mode parallel and folded VLSI architectures for ...

Multi-mode parallel and folded VLSI architectures for ...

Suggest Documents

Parallel analog VLSI architectures for computation of heading ...

Parallel analog VLSI architectures for computation of heading

VLSI Architectures and Rapid Prototyping

Programmable Receiver Architectures for Multimode Mobile Terminals

Digital VLSI Architectures

A Novel and Efficient Mixed Signal VLSI Circuit for Multimode ...

Parallel Architectures

Scalable VLSI Architectures For Lattice Structure

VLSI Architectures for Image Interpolation: A Survey

Algorithms and Architectures for Parallel Processing

Algorithms and Architectures for Parallel Processing

Algorithms and Architectures for Parallel Processing

parallel architectures and compilation techniques

Design and Comparison of FFT VLSI Architectures for SoC Telecom ...

Parallel Memetic Algorithm for VLSI Circuit

An Adaptive Parallel Genetic Algorithm for VLSI

Binary Adder Architectures for Cell-Based VLSI and their Synthesis

Power and Area Efficient VLSI Architectures for Communication Signal ...

CS4/MSc Parallel Architectures

Algorithms and VLSI Architectures for Low-Power Mobile ... - RERO DOC

Parallel VLSI Architecture and Parallel Interleaver Design ... - CiteSeerX

VLSI Architectures for Layered Decoding for ... - Computer Science

Analog VLSI Stochastic Perturbative Learning Architectures - CiteSeerX

Scheduling for Parallel Architectures: Theory, Applications ... - DROPS