INTEGRATION, the VLSI journal 55 (2016) 43–56
Contents lists available at ScienceDirect
INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi
Multi-mode parallel and folded VLSI architectures for 1D-fast Fourier transform Mohamed Asan Basiri M n, Noor Mahammad Sk n Indian Institute of Information Technology Design and Manufacturing Kancheepuram, Chennai, Tamil Nadu, India
art ic l e i nf o
a b s t r a c t
Article history: Received 21 August 2014 Received in revised form 9 July 2015 Accepted 22 February 2016 Available online 5 March 2016
The modern real time applications like orthogonal frequency division multiplexing and etc., demand high performance fast Fourier transform (FFT) design with less area and clock cycles. This paper proposes efficient FFT VLSI architectures using folded/parallel implementation. In the proposed folded FFT architecture, the number of cycles required to complete the operation is less than single path delay feedback (SDF)/multi-path delay commutator (MDC) architectures. In the proposed parallel FFT architecture, Npoint FFT is implemented by using one N/2-point FFT without much extra hardware. Both the proposed architectures are implemented for radix-2, 22, and 4 using 45 nm technology library. The proposed parallel architecture achieves 56.7% and 40.6% of area reduction as compared with the existing parallel architecture based 16-point radix-2 and radix-22 DIF FFTs respectively. The proposed folded architecture achieves 65.5%, 51.1%, and 35.8% of worst path delay reduction as compared with the existing SDF based 16-point radix-2, radix-22, and radix-4 DIF FFTs respectively. & 2016 Elsevier B.V. All rights reserved.
Keywords: DFT DSP processor FFT Single path delay feedback Multi path delay commutator
1. Introduction FFT [1] is the most popular algorithm used in digital signal processing applications like orthogonal frequency division multiplexing (OFDM) [2,3], ultra wide band (UWB) [4], and etc. In general, VLSI architecture for any digital signal processing application is classified into two categories; they are (1) parallel form and (2) folded form [5]. The main difference between the parallel and folded architectures is the number of clock cycles and area. In parallel architecture, total area is greater than folded. In folded architecture, the number of clock cycles is greater than parallel. Therefore, parallel architecture can be used in the applications, where time optimization (high throughput) is the primary goal (Example – Super Computer). Similarly, folded architecture can be used in some applications, where area optimization is the primary goal (Example – Handheld devices). Usually, this kind of parallel/ folded architectures is used in digital filter design, discrete transformation and etc. The radix-2, 22, and 4 parallel FFT architectures are explained in [7–9], respectively. The folded FFT architectures are further classified into single path delay feedback (SDF), single path delay commutator (SDC), multi-path delay feedback (MDF), and multi-path delay commutator (MDC). The objective of FFT is to reduce time complexity of discrete Fourier transform (DFT) [10]. n
Corresponding authors. E-mail addresses:
[email protected] (M. Asan Basiri M),
[email protected] (N. Mahammad Sk). http://dx.doi.org/10.1016/j.vlsi.2016.02.007 0167-9260/& 2016 Elsevier B.V. All rights reserved.
Eq. (1) shows the basic operation of DFT. W rN=2 ¼ e j2π nr=ðN=2Þ X½r ¼
N 1 X
x½ne j2π nr=N ;
r ¼ 0; 1; …; N 1
ð1Þ
n¼0
X½2r ¼
N=2 1 X
ðx½n þx½n þ N=2ÞW rN=2 ;
ð2Þ
n¼0
where r ¼ 0; 1; …ðN=2Þ 1 X½2r þ 1 ¼
N=2 1 X
ðx½n x½n þ N=2Þe j2π =Nn W rN=2 ;
ð3Þ
n¼0
where r ¼ 0; 1; …ðN=2Þ 1: Eqs. (2) and (3) show decompositions in radix-2 decimation in frequency (DIF) FFT. Radix-2 16-point single path delay feedback (SDF) [11] DIF FFT architecture is shown in Fig. 1(a). Here the Butterfly units are represented as BF1, BF2, BF3, and BF4. Each Butterfly output is multiplied with corresponding twiddle factor from the multiplexer. The twiddle factor is represented as WNk and which is equal to e j2π nk=N . The twiddle factors can be selected by the appropriate select lines se1, se2, and se3. Throughout this paper, registers are represented as shaded square boxes. Similarly, the notation z N1 is used to represent N1 number of cascaded hardware registers which are used for delay. During each clock cycle, only one set of inputs can be processed by every Butterfly unit. Eqs. (4) and (5) show the variables used to derive the radix-22
44
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
Fig. 1. 16-point single path delay feedback DIF FFT using (a) Radix-2, (b) Radix-22, and (c) Radix-4 implementation.
FFT algorithm which is shown in the equations from (6) to (10). N N n1 þ n2 þ n3 N ð4Þ n¼ 2 4 k ¼ 〈k1 þ 2k2 þ 4k3 〉N Xðk1 þ 2k2 þ4k3 Þ ¼
ð5Þ
ðN=4Þ X 1
N N n1 þ n2 þ n3 W nk x N 2 4 ¼0
1 1 X X
n3 ¼ 0 n2 ¼ 0 n1
n3 ðk1 þ 2k2 Þ n2 ðk1 þ 2k2 Þ 3 k3 WN W 4n W nk N ¼ ð jÞ N
Xðk1 þ 2k2 þ4k3 Þ ¼
ðN=4Þ X 1 n3 ¼ 0
ð6Þ
ð7Þ
n3 ðk1 þ 2k2 Þ ½H N=4 ðn3 ; k1 ; k2 ÞW N W nN 3 k3 4
ð8Þ
H N ðn3 ; k1 ; k2 Þ ¼ BN ðk3 ; k1 Þ þ ð jÞðk1 þ 2k2 Þ BN 4
2
N BN ðn3 ; k1 Þ ¼ xðn3 Þ þ ð 1Þk1 x n3 þ 2 2
2
N n3 þ ; k1 4
ð9Þ
ð10Þ
Radix-22 16-point single path delay feedback [12,14] DIF FFT is shown in Fig. 1(b). Here, the second Butterfly unit (BF2) contains two complex number multipliers. The output from first and third Butterfly units is multiplied by ð jÞ. The Radix-22 16point SDF FFT operation can follow Fig. 1(a). Eqs. (11)–(16) show radix-4 DIF FFT algorithm, where y(n), yðn þ N4 Þ, yðn þ 2N 4 Þ, and yðn N þ 3N Þ are radix-4 -point DIF FFT of Xð4kÞ, Xð4k þ 1Þ, Xð4k þ2Þ, and 4 4 Xð4k þ 3Þ respectively.
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
45
Fig. 2. Radix-2 16-point DIF FFT (a) existing parallel (b) proposed folded architectures.
X½k ¼
N 1 X
x½nW nk N ¼
n¼0
N 1 4
X
n¼0
x½nW nk N þ
2N 1 4
X
n¼
N 4
x½nW nk N þ
3N 1 4
X
n¼
x½nW nk N þ
2N 4
N 1 X n¼
x½nW nk N
3N 4
ð11Þ X½k ¼
N 1 4X n¼0
X½k ¼
2Nk 3Nk Nk N 2N 3N xðnÞ þW N4 x nþ þW N4 x n þ þ W N4 x n þ 4 4 4
Cycle
N 2N 3N xðnÞ þð jÞk x n þ þ ð 1Þk x n þ þ ðjÞk x n þ 4 4 4
m
n value for Butterfly units (BFs) 1
2
…
N 1 4
N 4
N þ1 4
N þ2 4
…
N 2
N 2 N 4
0
1
…
1
…
N 1 4 N 1 4
N 4 2N 4
N þ1 4 2N þ1 4
…
0
N 2 4 N 2 4
1
0
2
…
N 4 2
N 2 2
N 2
N þ2 2
N 1 2 2N N þ 1 4 4 ⋮ N 2
ð12Þ 1
N 1 4X n¼0
Table 1 Operation of proposed folded N-point radix-2 DIF FFT.
ð13Þ
N 2N 3N þx nþ þx nþ W 0N yðnÞ ¼ xðnÞ þ x n þ 4 4 4
ð14Þ
N N 2N 3N ¼ xðnÞ jx n þ x nþ þ jx n þ W nN y nþ 4 4 4 4
ð15Þ
2N N 2N 3N ¼ xðnÞ x n þ þx n þ x nþ W 2n y nþ N 4 4 4 4 ð16Þ 3N N 2N 3N ¼ xðnÞ þ jx n þ x nþ jx n þ W 3n y nþ N 4 4 4 4 ð17Þ Radix-4 16-point SDF [15] architecture is shown in Fig. 1(c). Here only one complex number multiplier is used. The corresponding twiddle factors can be selected by Sel2 . During each cycle, 4 outputs can be produced from 4 inputs. Radix-2 and 4 multi-path delay commutator (MDC) architectures are explained in [16]. Radix 22 MDC architecture is shown in [17]. In Fig. 1, BF refers the Butterfly unit. The corresponding twiddle factor for each cycle can be selected through the multiplexer with appropriate select line. Throughout this paper, the multiplication unit is considered as complex number multiplier, which consists of 4 real number multiplications and two real number signed adders. The addition unit is considered as a complex number adder, which consists of 2 real number signed adders. In FFT using SDF/MDC architecture, each Butterfly unit can take only one set of inputs during each cycle. So, it will take
2 ⋮ log 2 N
… …
more cycles to complete the whole operation. At the same time, this architecture needs few Butterfly units and hence the area/ power requirement is less than parallel architecture. 1.1. Contribution of this paper This paper proposes an efficient folded FFT architecture with less clock cycles, which is compared with existing SDF/MDC architectures. This paper also proposes an area efficient parallel architecture for FFT, where the required N-point FFT is implemented by using one N2 -point FFT parallel architecture without much additional hardware. The proposed folded/parallel FFT architectures are implemented for radix-2, 22, and 4. The number of clock cycles and BF units is compared with existing and proposed FFT architectures. The proposed folded FFT architecture is designed to optimize the number of cycles (trade off-number of BF units) and the proposed parallel FFT architecture is designed to optimize the number of BFs (trade off-number of cycles). The rest of the paper is organized as follows: Section 2 states the proposed multi-mode folded FFT architecture. Section 3 states the proposed multi-mode parallel FFT architecture. Time/Hardware analysis of proposed multi-mode DIF FFT is discussed in Section 4. Design modeling, implementation, and results are discussed in Section 5, followed by a conclusion in Section 6.
46
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
Fig. 3. Proposed 16-point (a) radix-2/22 multi-mode folded 16-point DIF FFT, (b) radix-2 Butterfly unit, and (c) radix-22 Butterfly unit.
2. The proposed multi-mode folded DIF FFT architectures The architecture for proposed folded radix-2 16-point DIF FFT is derived from the existing parallel implementation, which is shown in Fig. 2(a) and (b). The existing parallel 16-point DIF FFT architecture is shown in Fig. 2(a), where x½n is the input. Here, the total number of stages is four and each with 8 Butterfly units (BFs). The first, second, and third stage outputs are represented as y1 ½n, y2 ½n, and y3 ½n respectively. The value of n and twiddle factor for each BF will be varied. The proposed radix-2 folded architecture follows the equations as shown in (2) and (3). In Fig. 2(b), only one stage of
radix-2 DIF FFT is taken and which includes eight BFs. Therefore, each stage operation of 16-point parallel architecture can be done using only one stage of BFs with 4 cycles. The output from previous cycle has been sent as the input to next cycle. During first cycle, x½n and x½n þ m are selected using multiplexer (MUX). From the second cycle onwards, y½n and y½n þ m from previous cycle are selected using MUX. The horizontal dark line shows the pipeline registers. In general, radix-2 N-point DIF FFT parallel architecture consists of log 2 N stages and each with N=2 BFs. The inputs for first stage
are x½n and x n þ N2 , where n is varied from 0, 1, 2, … N2 1 for BFs 1, 2, … N=2 respectively. The outputs from first stage are
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
47
Table 2 Butterfly units with twiddle factors for 16-point radix-22 proposed folded DIF FFT. Select lines
M4
M1
M5
M2
M6
M3
0
1
0
1
0
1
2
0
1
2
0
1
2
3
0
1
2
3
BF1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
W 016
1
1
1
1
j
W 08
BF2
1
1
1
1
1
1
1
1
1
1
1
j
1
1
1
1
1
1
1
1
1
1
1
1
1
1
BF4
1
1
1
1
1
1
1
1
1
1
1
j
1
1
1
1
1
1
1
1
1
1
1
1
1
BF6
1
1
1
1
1
1
1
1
1
1
1
BF7
1
1
1
1
1
1
1
1
1
j
BF8
1
1
1
1
W 18
1
j
W 216
1
j
W 08
j
W 116
j
W 016
1
j
W 016
j
W 616
BF5
W 18 1
j
W 416
1
j
W 08
j
W 216
BF3
W 28 W 08 W 38 W 08 W 28 W 08 W 38
1
1
W 316
1
1
j
Table 3 Butterfly units with twiddle factors for proposed multi-mode folded 16-point radix-2 DIF FFT. BF unit
m2
m1
m0
BF1
W 016
W 08
W 04
W 02
BF2
W 116
W 18
W 14
W 02
BF3
W 216
W 28
W 04
W 02
BF4
W 316
W 38
W 14
W 02
BF5
W 416
W 08
W 04
W 02
BF6
W 516 W 616 W 716
W 18 W 28 W 38
W 14 W 04 W 14
W 02 W 02 W 02
BF8
j
1
W 616
1
1
j
1
W 916
Table 4 Operation of proposed multi-mode folded radix-2=22 DIF FFT. Sel
Cycle
s1
s2
s3
se
Operation
0
0
X
X
X
0
2-point DIF FFT
1
0 1 0 1 2 0 1 2 3
0 1 X X X X X X X
X X 0 1 2 X X X X
X X X X X 0 1 2 3
1 0 2 1 0 3 2 1 0
4-point DIF FFT
Inputs of multiplexer U1 m3
BF7
W 316
y1 ½n ¼ x½n þ x n þ N2 , y1 n þ N2 ¼ x½n x n þ N2 W kN , where k is
N varied from 0, 1, 2, … 2 1 for BFs 1, 2, … N=2 respectively. The inputs for second stage are y1 ½n and y1 n þ N4 , where n is varied
N from 0, 1, 2, … 4 1 for BFs 1, 2, … N=4 respectively and it will be
2N N 2N 2N for BFs N4 þ 1, N4 þ 2; …N2 from 2N 4, 4 þ 1, 4 þ 2; … 4 þ 4 1 respectively. The outputs from second stage are y2 ½n¼ y1 ½nþ y1 n þ N4 ,
k y n þ N4 ¼ y1 ½n y1 n þ N4 W N , where k is varied from 0, 1, 2, … 2
N2 1 for BFs 1, 2, … N=4 respectively and it will be from 0, 1, 2, …
N4 N N N 4 1 for BFs 4 þ 1, 4 þ 2, … 2 respectively. Similarly, the inputs for last stage ðlog 2 NÞ are ylog 2 N 1 ½n and ylog 2 N 1 ½n þ1, where n will be varied from 0, 2, 4, … N 2 for BFs 1, 2, 3, … N2 respectively. The outputs from last stage are ylog 2 N ½n¼ylog 2 N 1 ½n þ
ylog 2 N 1 ½n þ1, y2 ½n þ 1 ¼ ðylog 2 N 1 ½n ylog 2 N 1 ½n þ 1ÞW kN , where 2 k ¼0. Table 1 shows the operation of proposed folded N-point radix-2 DIF FFT. Theorem 1. The radix-2 N-point DIF FFT parallel architecture (log 2 N stages) operation can be completed using log 2 N cycles of single stage proposed N-point DIF FFT folded architecture with tradeoff in number of cycles. Proof. The radix-2 16-point DIF FFT existing parallel architecture consists of four stages and each with eight BFs. The 16 inputs are given to the first stage and the others will get the inputs from previous stage. According to the hardware reuse strategy [23], the same hardware is used for further processing in a time multiplexing manner. During first cycle, 16 inputs are given to the single stage 16-point DIF FFT proposed folded architecture. During the clock cycles 2, 3, and 4, the output from previous cycle can be given to the same single stage 16-point DIF FFT proposed folded architecture. Therefore, N-point DIF FFT existing parallel architecture (log 2 N stages) operation can be completed with single
2
3
8-point DIF FFT
16-point DIF FFT
stage N-point DIF FFT proposed folded architecture with log 2 N cycles by using hardware reuse [23] strategy. □ Fig. 3(a) shows the proposed 16-point radix-2/22 multi-mode folded DIF FFT architecture with 1 stage pipeline, where one 16point or two 8-point or four 4-point or eight 2-point FFTs can be performed. Throughout this paper, input signal values of FFT unit are represented as i with corresponding index value, i.e. i0, i1, i2, … Similarly, the outputs of FFT are represented as o with corresponding index value, i.e. o0, o1, o2,… . In general, 16-point parallel FFT will have 4 stages each with 8 Butterfly units. In the proposed folded architecture, only one stage is iteratively (4 iterations) used with 8 Butterfly units. So, the area of proposed folded architecture is less than existing parallel. Here, the number of cycles used to complete the whole operation will be greater than parallel architecture. Fig. 3(b) and (c) shows the proposed Butterfly units for 16point radix-2 and radix-22 DIF FFT respectively. Here the multiplexers are named as U1, U2, U3, M1, M2, M3, M4, M5, and M6. Each multiplexer unit contains dedicated twiddle factors as inputs. Tables 2 and 3 show the corresponding twiddle factors used in the BF units of 16-point radix-22 and radix-2 DIF FFT respectively. In Figs. 3 and 4, vertical dark lines represent pipelining. Here, the output from the previous cycle is sent to the input if 16/8/4-point FFT operation is performed. In the case of 2-point FFT, output can be obtained after first cycle. Table 4 shows the operation of 16point radix-2/22 DIF FFT. In all the tables, X represents do not care condition. Fig. 4 (a) and (b) shows the proposed 16-point radix-4 multi-mode folded FFT DIF with one stage pipeline and its Butterfly unit respectively. Here, only one clock cycle is required for four 4-point radix-4 DIF FFT operation. In 16-point DIF FFT mode, the outputs from first cycle of BF1, BF2, BF3, and BF4 are sent as an input to the same Butterflies. Table 5 shows BF units with twiddle factors for proposed folded 16-point radix-4 DIF FFT. Table 6 shows the
48
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
Fig. 4. Proposed 16-point radix-4 (a) folded 16-point DIF FFT and (b) Butterfly unit. Table 5 Butterfly units with twiddle factors for proposed multi-mode folded 16-point radix-4 DIF FFT. BF unit
m1
m2
m3
BF1 BF2
1
1
1
W 116
W 216
W 316
BF3
W 216
W 416
W 616
BF4
W 316
W 616
W 916
Table 6 Operation of proposed multi-mode folded 16-point radix-4 DIF FFT. Sel
Cycle
s
Operation
0 1 2
0 1 0
0 1 1
16-point DIF FFT 4-point DIF FFT
operation of proposed folded 16-point radix-4 DIF FFT. The number of BFs in proposed folded DIF FFTs is greater than SDF and MDC architectures but the number of cycles for proposed folded DIF FFTs is less than SDF and MDC architectures. Similarly, worst path delay for proposed folded DIF FFTs is less than SDF and MDC because the critical path of proposed folded architecture consists of only one BF but the radix-2 and 22 N-point DIF FFT SDF and MDC architectures consist of log 2 N BFs. In the case of radix-4 N-point SDF and MDC, critical path consists of log 4 N Butterfly units.
3. The proposed multi-mode parallel DIF FFT architectures In the proposed multi-mode parallel DIF FFT, N2 -point parallel architecture is used to find N-point DIF FFT, where the number of
cycles is the trade-off. Fig. 5(a) shows the operation of radix-2 8-point DIF FFT existing parallel architecture, where 3-stages are used and each with four BFs. The inputs are represented as x½n. The output from first stage y1 ½n is given to the second stage, where y2 ½n will be produced. In the third stage, y2 ½n will be used as inputs. The proposed 16-point parallel DIF FFT operation is shown in Fig. 5(b). The radix-2 16-point parallel DIF FFT requires 4 stages and each with 8 BFs. Therefore, stage-1 of radix-2 16point parallel DIF FFT is completed with 2 cycles (cycle 0 and 1) using stage-1 of radix-2 8-point DIF FFT parallel architecture. Similarly, stage-2 of 16-point parallel DIF FFT is completed with another 2 cycles (cycle 2 and 3) using stage-1 of radix-2 8-point DIF FFT parallel architecture. The stage-3 of 16-point parallel DIF FFT is completed during cycles 3 and 4 using stage-2 of radix-2 8-point DIF FFT parallel architecture. The final stage of 16-point parallel DIF FFT is completed during cycles 4 and 5 using stage-3 of radix-2 8-point DIF FFT parallel architecture. Theorem 2. The radix-2 N2 point DIF FFT parallel architecture can be used to find N-point DIF FFT with trade-off in number of cycles. Proof. The number of stages in radix-2 8 and 16-point DIF FFT existing parallel architectures is 3 and 4 respectively and each with 4 and 8 BFs respectively. Therefore, 16-point DIF FFT parallel architecture consists of eight 8-point DIF FFT parallel architecture stages and each stage with four BFs. In general, N-point DIF FFT parallel architecture consists of 2 log 2 N numbers of N2 -point DIF FFT parallel architecture stages and each stage with N4 numbers of BFs. Therefore, the operation of radix-2 N-point DIF FFT parallel architecture can be completed with at most 2 log 2 N number of cycles by using N2 -point DIF FFT parallel architecture with the strategy of hardware reuse. □ Fig. 6(a) shows proposed multi-mode parallel architecture for 16-point radix-2 DIF FFT using 8-point architecture with 3 stage
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
stage1
stage1/cycle0
Four radix-2 DIF FFT BFs. n=0, 1, 2, and 3 for BF1,BF2,BF3 and BF4 respectively.
stage2
Four radix-2 DIF FFT BFs. n=0, 1, 2, and 3 for BF1,BF2,BF3 and BF4 respectively.
stage1/cycle1 Four radix-2 DIF FFT BFs. n=4, 5, 6, and 7 for BF1,BF2,BF3 and BF4 respectively.
stage1/cycle2
stage1/cycle3
Four radix-2 DIF FFT BFs. n=0, 1, 2, and 3 for BF1,BF2,BF3 and BF4 respectively.
Four radix-2 DIF FFT BFs. n=8, 9, 10, and 11 for BF1,BF2,BF3 and BF4 respectively.
stage2/cycle3
Four radix-2 DIF FFT BFs. n=0, 1, 4, and 5 for BF1,BF2,BF3 and BF4 respectively.
stage3 Four radix-2 DIF FFT BFs. n=0, 2, 4, and 6 for BF1,BF2,BF3 and BF4 respectively. output
49
stage2/cycle4
Four radix-2 DIF FFT BFs. n=0, 1, 4, and 5 for BF1,BF2,BF3 and BF4 respectively.
Four radix-2 DIF FFT BFs. n=8, 9, 12, and 13 for BF1,BF2,BF3 and BF7 respectively.
stage3/cycle4 Four radix-2 DIF FFT BFs. n=0, 2, 4, and 6 for BF1,BF2,BF3 and BF4 respectively output
stage3/cycle5 Four radix-2 DIF FFT BFs. n=8, 10, 12, and 14 for BF1,BF2,BF3 and BF7 respectively output
Fig. 5. Radix-2 (a) 8-point existing parallel and (b) 16-point proposed parallel DIF FFT architectures.
Stage1 0 1 2 3
s
R20 0 R31 1
s
R10 R21
0 1 2 3
0 1
0 1 2 3
s
R22 0 R33 1
i9 R12 R23
s
0 1 2 3
s
0 1 2 3
0 1
R24 0 R35 1
s
R14 R25
0 1
0 1
0 1 2 3
0 1
0 1 2 3
s
R26 R37
R16 R27
0 1 2 3
s
Stage2
Stage3
BF5
BF9
Sel R10 R20 R30 Sel
BF1 R11 R21 R31
0 1 2
Sel R12 R22 R32 Sel
BF2
BF6
BF10
BF7
BF11
BF8
BF12
R13 R23 R33
Sel R14 R24 R34 Sel
BF3 R15 R25 R35
Sel R16 R26 R36 Sel
BF4 R17 R27 R37
Fig. 6. Proposed 16-point radix-2 (a) parallel DIF FFT, (b) Butterfly unit used in BF1, BF2, BF3, and BF4, (c) Butterfly unit used in BF5, BF6, BF7, and BF8, and (d) Butterfly unit used in BF9, BF10, BF11, and BF12.
50
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
Table 7 Butterfly units with twiddle factors for proposed multi-mode parallel 16-point radix-2 DIF FFT. BF unit
m0
m1
m2
BF unit
m3
BF1
W 016
W 416
W 08
BF5
W 04
BF2
W 116
W 516
W 18
BF6
W 14
BF3
W 216
W 616
W 28
BF7
W 04
BF4
W 316
W 716
W 38
BF8
W 14
Table 8 Operation of proposed multi-mode parallel radix-2 DIF FFT. Cycle
Sel
s
s1
Operation
0 1 2 3 0
0 1 2 2 3
X X 0 1 X
0 1 2 2 2
16-point DIF FFT
8-point DIF FFT
pipeline. Fig. 6(b)–(d) shows the corresponding BF units. Table 7 shows Butterfly units with corresponding twiddle factors for proposed multi-mode parallel 16-point radix-2 DIF FFT. The operation of this proposed radix-2 16-point parallel architecture during clock cycle 0, 1, 2, and 3 is explained in Table 8. In 16point DIF FFT mode, the first and second set of 8 inputs is accepted by the architecture during clock cycles 0 and 1 respectively. During clock cycles 2 and 3, the register outputs are sent back as the inputs. The outputs from the registers R10, R11, R12, R13, R14, R15, R16, and R17 are sent to the second stage BF units BF5, BF6, BF7, and BF8 during 4th clock cycle. Similarly, the outputs from the second stage of Butterfly units BF5, BF6, BF7, and BF8 are sent to third stage of BF units BF9, BF10, BF11, and BF12 during 5th clock cycle. Hence the first 8 outputs of 16-point DIF operation can be obtained after 5th clock cycle and the second 8 outputs of 16-point DIF operation can be obtained after 6th clock cycle. So, the proposed parallel radix-2 architecture requires 7 cycles to complete 16-point DIF FFT and it requires 3 clock cycles to complete 8-point DIF FFT operation. Fig. 7(a) shows proposed multi-mode parallel architecture for 16-point radix-22 DIF FFT using 8-point architecture with 3 stage pipeline. Fig. 7(b) and (c) shows the corresponding Butterfly units. Here the outputs from stage-1 registers are sent as an input to stage-2. The output from stage-2 and registers of stage-3 (R50, R51, R52, R53, R54, R55, R56, and R57) are sent as the input to stage 3. Table 9 shows Butterfly units with the corresponding twiddle factors for 16-point proposed parallel radix-22 DIF FFT. The operation of this proposed parallel 16-point radix-22 DIF FFT architecture is explained in Table 10. It requires 8 cycles to complete 16-point DIF FFT and 3 clock cycles to complete 8-point DIF FFT operation. Fig. 8(a) shows proposed 64-point radix-4 parallel DIF FFT. Fig. 8(b) shows Butterfly units used in BF1, BF2, BF3, and BF4 of radix-4 proposed parallel 64-point DIF FFT. Fig. 8 (c) shows Butterfly units used in BF5, BF6, BF7, and BF8 of the proposed radix-4 parallel 64-point DIF FFT. Here, the outputs from stage-1 (BF1, BF2, BF3, and BF4) registers are sent as the input to the stage-2 (BF5, BF6, BF7, and BF8) and stage-1 BF units. Table 11 shows BF units with corresponding inputs and twiddle factors for proposed multi-mode parallel 64-point radix-4 DIF FFT and its operation is explained in Table 12. Here, the operation of first and second stage of conventional parallel radix-4 64-point DIF FFT can be done during the clock cycles 0–7, which is shown in Table 12. So, the first 16 outputs of 64-point DIF FFT operation will be produced after 8th clock cycle. So, it requires 11 clock cycles to
obtain all the 64 outputs. In the case of 16-point DIF FFT, 2 clock cycles are required to obtain all the results.
4. Time/Hardware analysis of proposed multi-mode DIF FFT architectures Table 13 shows the comparison of N-point proposed folded/ parallel FFT with others. In SDF/MDC based N-point radix-2/22 DIF FFT, first set of outputs will be produced after N 1 clock cycles
and the remaining N2 1 set of outputs can be produced in every clock cycle after ðN 1Þth clock cycle. So, it takes N 1 þ N2 1 ¼ 3N 2 2 clock cycles to complete the whole operation. Similarly, radix4 N-point SDF/MDC based DIF FFT requires log
4 N clock cycles to produce first 4 outputs and it requires N4 1 log 4 N clock
cycles to produce rest of the outputs. So, it requires log 4 N þ N4 1 log 4 N ¼ N4 log 4 N clock cycles to complete the whole operation. The parallel radix-2/22 FFT has log 2 N stages and each with N2 Butterfly units. The proposed folded radix-2/22 FFT will use one among log 2 N stages of existing parallel architecture. Therefore, it requires log 2 N cycles to complete its operation, which is explained in Theorem 1. Similarly, proposed folded radix-4 FFT requires log 4 N cycles to complete its operation, where one among log 4 N stages of existing parallel radix-4 FFT will be used and each stage is consisting of N4 Butterfly units. In proposed radix-2/22 parallel N-point FFT, the number of Butterfly units will be equal to parallel radix-2 N2 -point FFT, i.e. N N 4 log 2 2 . So, it requires at most 2 log 2 N cycles to complete N-point FFT operation, which is explained in Theorem 2. Similarly, in proposed radix-4 parallel N-point FFT, the number of Butterfly N units will be equal to parallel radix-4 N4 -point FFT, i.e. 16 log 4 N4 . So, it requires at most 4 log 4 N cycles to complete N-point FFT operation. Table 14 shows the worst/best/typical case comparison of N-point proposed folded/parallel DIF FFT with respect to number of cycles and mode, where mode represents the length of DIF FFT operation. In [20], radix-2 SDC-SDF based folded FFT (log 2 N stages) architecture is proposed, where the Butterfly units of first ðlog 2 NÞ1 stages are designed with 3 real number adders and 2 real number multipliers. So, the hardware cost is reduced as compared with conventional radix-2 SDF based FFT, where one complex number adder (requires two number real adders) and one complex number multiplier (requires 4 real number multipliers and 2 real number adders) are used. The last stage contains one conventional SDF Butterfly unit. Therefore, the number of complex number multipliers used in [20] is half of the conventional radix-2 SDF based architecture, which is mentioned in Table 13. In [21], 4-parallel MDF based FFT (log 2 N stages) folded architecture is proposed. Here, two Butterfly units are used in parallel with one common complex number multiplier. The FIFO registers used in feedback are divided into two unequal parts to share the common multiplier.
5. Design modeling, implementation, and results The proposed and existing designs are modeled in Verilog HDL. These Verilog HDL models are simulated and verified using the Xilinx ISE simulator. The timing, area, and power analysis of this FFT implementation has been done with Cadence 6.1 ASIC design tool. All the designs are implemented for 45 nm technology [26], where the technology library tcbn45gsbwpbc088_ccs:lib is used. Here, the operating voltage is 0:88v. The timing/area/power details can be directly obtained from synthesis using Cadence (Encounter). Table 15 shows the comparison of worst path delay, total area, net power, and power delay product (PDP) or energy per operation [29] between various 1D-DIF FFT architectures. The hardware cost
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
Stage2
Stage1 0 1 2
R10 R20 R30
-j
0 1
R11 R21 R31
0 1 2
R12 R22 R32 BF2
0 1 2
-j
0 1
R13 R23 R33
0 1 2
R14 R24 R34
-j
0 1
R15 R25 R35
0 1 2
R16 R26 R36 BF4
0 1 2
R50
0 1 2
R10 0 R21 1 R14 2
R52
R22 0 R33 1 R12 2 R12 0 R23 1 R16 2
R51
-j
0 1
R17 R27 R37
R42 R52 BF10
R53
R54
0 1 2
-j
0 1
0 1 2
R43 R53
output
R44 R54 BF11
R14 0 R25 1 R15 2
0 R56 1 2
R26 0 R37 1 R13 2
R55
R16 0 R27 1 R17 2
R41 R51
0 1 2
BF6
R24 0 R35 1 R11 2
R40 R50 BF9
0 1 2
BF7
BF3 0 1 2
Stage3
BF5
BF1 0 1 2
R20 0 R31 1 R10 2
51
R45 R55
0 1 2
R46 R56 BF12
BF8 R57
0 1 2
-j
0 1
R47 R57
0 1 2
0 1 2 Fig. 7. Proposed 16-point radix-22 (a) Parallel DIF FFT, (b) Butterfly unit used in BF5, BF6, BF7, and BF8, and (c) Butterfly unit used in BF1, BF2, BF3, BF4, BF9, BF10, BF11, and BF12. Table 9 Butterfly units with twiddle factors for proposed multi-mode parallel 16-point radix-22 DIF FFT. BF unit
m4
m0
m5
m1
m6
m2
BF1
1
W 08
1
W 016
W 016
W 016
BF2
1
W 28
1
W 216
W 116
W 316
BF3
W 08
W 08
1
W 416
W 216
W 616
BF4
W 18
W 38
1
W 616
W 316
W 916
parameters [27] of any digital circuit are the development time, chip area, and power consumption. The power delay product (PDP) stands for the average energy consumed per switching event and it is apparent from the units (W s ¼ Joule). In Cadence (Encounter), switching and leakage power can be measured directly during
synthesis. Therefore, PDP [29] can be easily calculated by multiplying worst path delay with sum of switching and leakage power. The number of cycles used by proposed parallel architectures is greater than existing parallel architectures, which is explained in Table 13. The total cycle delay can be computed by multiplying the worst path delay with number of cycles (pipeline stages). Since the worst path delay of the proposed parallel architectures is less than existing architectures, the total cycle delay of the particular number of tasks (N-point DIF FFT) for the proposed parallel architectures will be less than existing architectures. For example, the system requires five 16-point radix-2 parallel DIF FFTs. According to existing parallel architecture, the total computation time will be equal to 15; 067:7 5 ¼ 75; 335 ps. According to proposed pipelined parallel architecture, total cycle delay to produce the output of first DIF FFT operation will be 5422 7 ¼ 37; 954 ps because the proposed radix-2 16-point parallel architecture requires 7 cycles, which is explained in Section 3. The remaining four DIF FFT outputs will be produced with extra of
52
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
5422 4 ¼ 21; 688 ps, which is to empty the pipeline. Therefore, the total delay required to produce five 16-point radix-2 DIF FFT operations will be 37; 954 þ 21; 688 ¼ 56; 642 ps.
Table 10 Operation of proposed multi-mode parallel radix-22 DIF FFT. Cycle
s
s1
s2
Sel1
Sel2
s8
s5
Operation
0 1 2 3 4 5 6 7
0 1 X X X X X X
1 0 X X X X X X
1 0 X X X X X X
X X 0 1 X X X X
X X X X 0 0 1 1
X X X X 0 0 1 1
X X 1 2 X X X X
16-point DIF FFT
0 1 2
2 X X
1 X X
0 X X
X 2 X
X X 2
X X 1
0 0 0
8-point DIF FFT
Stage2
Sel Stage1
m
m
m
m
m
m inputs m
m
m
m
m
m
m
m
m
m
The proposed parallel architectures achieve 56.7% and 40.6% of area reduction compared with the existing parallel architecture based 16-point radix-2, and radix-22 DIF FFTs respectively. Similarly, the proposed parallel architecture achieves 49.5% of area reduction compared with the existing parallel architecture based 64-point radix-4 DIF FFT. The proposed parallel architectures achieve 90.9% and 88.3% of PDP reduction compared with the existing parallel architecture based 16-point radix-2, and radix-22 DIF FFTs respectively. Similarly, the proposed parallel architecture achieves 90.4% of PDP reduction compared with the existing parallel architecture based 64-point radix-4 DIF FFT. The area for proposed folded architectures is greater than existing SDF/MDC architectures and this will be the trade-off, which is explained in Section 1. In the case of SDF/MDC, hardware registers are acting as shift registers and they are not like pipeline registers. So, the worst path delay for SDF/MDC is almost equal to worst path delay of existing parallel architectures. The number of cycles and worst path delay of the proposed folded DIF FFT architectures is less than existing SDF/MDC architectures, which is explained in Section 1. The proposed folded architectures achieve
0 1 2 Sel 0 1 2 Sel
R10 R20 R30 R40
R50 R60 R70
R11 R21 R31 R41
R51 R61 R71
0 1 2 3
BF5
BF1
0 1 2
0 1 2 3
R12 R22 R32 R42
R13 R23 R33 R43
m BF
R52 R62 R72 0 1 2 3
Sel 0 1 2
Sel 0 1 2
R53 R63 R73
Sel 0 1 2
inputs Sel
0 1 2
R14 R24 R34 R44
0 1 2 3
R54 R64 R74
(1,-j,-1,j)
0 1 2
R15 R25 R35 R45
R55 R65 R75
BF2
0 1 2 Sel
BF6 R16 R26 R36 R46
R56 R66 R76
R17 R27 R37 R47
R57 R67 R77
0 1 2
0 1 2 3 0 1 2 3
Sel
0 1 2 Sel 0 1 2
R18 R28 R38 R48
R58 R68 R78
R19 R29 R39 R49
R59 R69 R79
BF1
0 1 2
0 1 2 3
(1,-1,1,-1)
0 1 2 m
(1,j,-1,-j)
0 1 2 3 4
0 1 2 3
Sel
Sel
0 1 2 3 4
(1,1,1,1)
m
Sel
Sel
0 1 2 3 4
Sel 0 1 2 m
BF7 R110 R210 R310 R410
R510 R610 R710
(1,1,1,1)
R111 R211 R311 R411
R511 R611 R711
(1,-j,-1,j)
Sel 0 1 2 Sel (1,-1,1,-1)
0 1 2
R112 R212 R312 R412
R512 R612 R712
R113 R213 R313 R413
R513 R613 R713
R114 R214 R314 R414
R514 R614 R714
R115 R215 R315 R415
R515 R615 R715
Sel
(1,j,-1,-j)
0 1 2 Sel 0 1 2
BF2
BF8
Sel 0 1 2
Fig. 8. Proposed 64-point radix-4 parallel DIF FFT (a) Architecture, (b) Butterfly unit used in BF1, BF2, BF3, and BF4, and (c) Butterfly unit used in BF5, BF6, BF7, and BF8.
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
53
Table 11 BF units with corresponding inputs and twiddle factors for proposed multi-mode parallel 64-point radix-4 DIF FFT. Multiplexer
M1
Sel
s1
Inputs for BF units
M3
M4
BF1
BF2
BF3
BF4 K1
se2
Inputs for BF units BF1
BF2
BF3
BF4
0
0
i0
i1
i2
i3
0
W 064
W 164
W 264
W 364
0
1
i4
i5
i6
i7
1
2
i8
i9
i10
i11
2
0
3
i12
i13
i14
i15
3
1
0
R40
R44
R48
R412
4
1
1
R51
R55
R59
R513
1
2
R62
R66
R610
R614
1
1
3
R73
R77
R711
R715
2
i0
i1
i2
i3
3 4
W 564 W 964 W 13 64 W 116 W 264 W 10 64 W 18 64 W 26 64 W 216 W 364 W 15 64 W 27 64 W 39 64 W 316
W 664 W 10 64 W 14 64 W 216 W 464 W 12 64 W 20 64 W 28 64 W 416 W 664 W 18 64 W 30 64 W 42 64 W 616
W 764
0
W 464 W 864 W 12 64 W 016 W 064 W 864 W 16 64 W 24 64 W 016 W 064 W 12 64 W 24 64 W 36 64 W 016
2 M2
Mulitplexer
K2
0
0
0
i16
i17
i19
i23
0
1
i20
i21
i22
i23
0
2
i24
i25
i26
i27
1
0
3
i28
i29
i30
i31
2
1
0
R30
R34
R38
R312
3
1
1
R41
R45
R49
R413
4
1 1 2 0 0 0 0 1 1 1 1 2 0 0 0 0 1 1 1 1 2
2 3
R52 R63 i4 i32 i36 i40 i44 R20 R31 R42 R53 i8 i48 i52 i56 i60 R10 R21 R32 R43 i12
R56 R67 i5 i33 i37 i41 i45 R24 R35 R46 R57 i9 i49 i53 i57 i61 R14 R25 R36 R47 i13
R510 R611 i6 i34 i38 i42 i46 R28 R39 R410 R511 i10 i50 i54 i58 i62 R18 R29 R310 R411 i14
R514 R615 i7 i35 i39 i43 i47 R212 R313 R414 R515 i11 i51 i55 i59 i63 R112 R213 R314 R415 i15
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
K3
0
W 11 64 W 15 64 W 316 W 664 W 14 64 W 22 64 W 30 64 W 616 W 964 W 21 64 W 33 64 W 45 64 W 916
Table 12 Operation of proposed multi-mode parallel radix-4 DIF FFT. Cycle
Sel
s1
se2
Operation
0 1 2 3 4 5 6 7
0 0 0 0 1 1 1 1
0 1 2 3 0 1 2 3
0 1 2 3 4 4 4 4
64-point DIF FFT
0
2
X
4
16-point DIF FFT
65.5%, 51.1%, and 35.8% of worst path delay reduction compared with the existing SDF architecture based 16-point radix-2, radix22, and radix-4 16-point SDF based DIF FFTs respectively. Figs. 9 and 10 show the layout diagram for proposed radix-2 16point folded and parallel DIF FFT architectures. The area and power dissipation are directly proportional to each other (in most of the cases). The proposed folded FFT architecture is designed to optimize the number of cycles (trade off is number of BF units). So, the delay and number of cycles for proposed folded FFT architectures are less than existing (SDF, MDC, 4-parallel MDC, MDF, and SDC-SDF). The area for proposed folded is higher than existing folded. Therefore, PDP for proposed folded architectures is
greater than existing. At the same time, 4-parallel MDF, 4-parallel MDC, 8-parallel MDC, and mixed radix MDC architectures have less area than proposed and greater area than the other existing techniques. The number of clocked registers used in the abovementioned architectures is greater than proposed folded architectures. Therefore, PDPs for 4-parallel MDF, 4-parallel MDC, 8parallel MDC, and mixed radix MDC architectures are greater than proposed designs. The clock in the digital system tends to increase the power. In the case of pipelined system, the power dissipation is greater than non-pipelined system. The proposed parallel FFT architecture is designed to optimize the number of BFs (trade off is number of cycles, i.e., throughput for proposed parallel is less than
54
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
Table 13 Comparison of N-point proposed folded/parallel DIF FFT with others. Radix
Architecture
Hardware cost
Speed
Number of BF units
# of complex number adders
# of complex number multipliers
2
SDF [11]
log 2 N
2 log 2 N
ðlog 2 NÞ 1
2
MDC [16]
log 2 N
2 log 2 N
ðlog 2 NÞ 1
2
4-parallel MDC [6]
2 log 2 N
2ð2 log 2 NÞ
2ððlog 2 NÞ 1Þ
2
8- parallel MDC [6]
4 log 2 N
4ð2 log 2 NÞ
4ððlog 2 NÞ 1Þ
2
SDC-SDF [20]
log 2 N
2 log 2 N
2
Variable length SDF [25]
log 2 N
2 log 2 N
1 ððlog 2 NÞ 1Þ 2 ðlog 2 NÞ 1
2
4-parallel MDF [21]
2 log 2 N
2ð2 log 2 NÞ
ðlog 2 NÞ 1
2
Proposed folded
N 2
2
2
Parallel [7]
2
Proposed parallel
N log 2 N 2 N N log 2 4 2
N 2 log 2 N 2 1 N 2 log 2 N 2 2
N ððlog 2 NÞ 1Þ 2 1 N ððlog 2 NÞ 1Þ 2 2
22
SDF [14]
log 2 N
2 log 2 N
ðlog 2 NÞ 1
22
MDC [17]
log 2 N
2 log 2 N
ðlog 2 NÞ 1
22
4-parallel MDC [6]
2 log 2 N
2ð2 log 2 NÞ
2ððlog 2 NÞ 1Þ
2
8-parallel MDC [6]
4 log 2 N
4ð2 log 2 NÞ
4ððlog 2 NÞ 1Þ
Variable length SDF [25]
log 2 N
2 log 2 N
ðlog 2 NÞ 1
4-parallel MDF [21]
log 2 N
2ð2 log 2 NÞ
ðlog 2 NÞ 1
22
Proposed folded
N 2
2
22
Parallel [8]
N log 2 N 2 N N log 2 4 2
N 2 log 2 N 2 1 N 2 log 2 N 2 2
N ððlog 2 NÞ 1Þ 2 1 N ððlog 2 NÞ 1Þ 2 2
2
22 2
2
2
2
Proposed parallel
N ¼N 2
N ¼N 2
N 2
2
N ¼N 2
4
SDF [15]
log 4 N
12 log 4 N
ðlog 4 NÞ 1
4
MDC [16]
log 4 N
12 log 4 N
3ðlog 4 NÞ 1
4
8- parallel MDC [6]
2 log 4 N
2ð12 log 4 NÞ
2ð3ðlog 4 NÞ 1Þ
4
Variable length SDF [25]
log 4 N
12 log 4 N
ðlog 4 NÞ 1
4
Proposed folded
N 4
12
4
Parallel [9]
4
Proposed parallel
N log 4 N 4 N N log 4 16 4
N 12 log 4 N 4 1 N 12 log 4 N 4 4
N 4
3
N 4
N 3 ððlog 4 NÞ 1Þ 4 1 N 3 ððlog 4 NÞ 1Þ 4 4
Number of cycles 3N 2 2 3N 2 Z 2 1 3N 2 Z 2 2 1 3N 2 Z 4 2 2N þ ðlog 2 NÞ 1 Z
Throughput 2 2 4 8 2
3N 2 Z 2 1 3N 2 Z 2 2 log 2 N
4
1
N
r 2 log 2 N
N 2
3N 2 2 3N 2 Z 2 1 3N 2 Z 2 2 1 3N 2 Z 4 2 3N 2 Z 2 3N 2 Z 2 log 2 N Z
2
N
2 2 4 8 2 4 N
1
N
r 2 log 2 N
N 2
N log 4 N 4 N log 4 N Z 4 1 N log 4 N Z 2 4 N log 4 N Z 4 log 4 N
8
1
N
r 4 log 4 N
N 4
Z
4 4
4 N
Throughput¼ Number of operations (outputs) per cycle [24]. Each complex number adder contains two number real adders. Each complex number multiplier contains four real number multipliers and two real number adders.
existing). The number of pipeline stages in proposed parallel is greater than existing (which is mentioned in Table 13 as number of cycles). Hence, more hardware registers are driven by common clock in proposed parallel FFT architectures. Therefore, PDP for proposed parallel architectures is greater than existing.
6. Conclusion In this paper, efficient VLSI architectures for folded/parallel FFT operations are proposed. The proposed logic used for folded
architecture is to perform the whole N-point radix-k DIF FFT operation iteratively by using the Butterfly units of single stage parallel radix-k DIF FFT with corresponding twiddle factors, which has log k N stages and each stage contains N=k Butterfly units. In the proposed parallel FFT architecture, N-point FFT is implemented by using one N=2-point FFT without much extra hardware. In this work, the proposed/existing architectures are implemented for radix-2, 22, and 4 using 45 nm technology library. The proposed parallel architectures achieve 56.7% and 40.6% of area reduction compared with the existing parallel architecture based 16-point radix-2 and radix-22 DIF FFTs respectively. The proposed folded
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
55
Table 14 Worst/best/typical case comparison of N-point proposed folded/parallel DIF FFT with respect to number of cycles and mode of operation (length of DIF FFT). Proposed DIF FFT architecture
Mode/Number of cycles
Best case
Worst case
Typical case
Radix-2/22 16-point folded Radix-2/22 16-point folded Radix-2/22 N-point folded Radix-2/22 N-point folded Radix-4 16-point folded Radix-4 16-point folded Radix-4 N-point folded Radix-4 N-point folded Radix-2 16-point parallel Radix-2 16-point parallel Radix-2 N-point parallel
Mode Number Mode Number Mode Number Mode Number Mode Number Mode
16-point 4 N-point log 2 N 16-point 2 N-point log 4 N 16-point 7 N-point
4 or 8-point 2 or 3 other than 2 and N-point 1 o cycles o log 2 N 16-point 2 other than 4 and N-point 1 o cycles o log 4 N 16-point 7 N-point
Radix-2 N-point parallel
Number of cycles
2-point 1 2-point 1 4-point 1 4-point 1 8-point 3 N -point 2 N log 2 2 8-point 3 N -point 2 N log 2 2 16-point 2 N -point 4 N log 4 4
r 2 log 2 N
r 2 log 2 N
16-point 8 N-point
16-point 8 N-point
r 2 log 2 N
r 2 log 2 N
64-point 11 N-point
64-point 11 N-point
r 4 log 4 N
r 4 log 4 N
2
of cycles of cycles of cycles of cycles of cycles
Radix-2 16-point parallel Radix-22 16-point parallel Radix-22 N-point parallel
Mode Number of cycles Mode
Radix-22 N-point parallel
Number of cycles
Radix-4 64-point parallel Radix-4 64-point parallel Radix-4 N-point parallel
Mode Number of cycles Mode
Radix-4 N-point parallel
Number of cycles
Table 15 Comparison results of various 1D-DIF FFT architectures using 45-nm technology. Radix N
N-point DIF FFT architecture
Speed
PDP ðfJÞ
Hardware cost
Worst path delay (ps)
Frequency (MHz)
Through put
Total area (μm2)
Number of cells
Net power (nw)
Switching power (nw)
Leakage power (nw)
Energy per operation
2 2/4 2
16 16 16
Parallel [7] Parallel [19] Proposed parallel
15,067.7 14,708.7 5422.0
66.4 68.1 184.4
16 16 8
410,638.9 474,734.7 177,847.0
309,711 361,916 127,755
13,239,211.1 14,156,306.4 1,967,268.6
37,252,446.3 39,834,243.3 6,093,506.4
19,903,330.7 22,942,066.6 8,353,796.4
861,206.2 9,233,578.1 78,333.3
22 22
16 16
Parallel [8] Proposed parallel
11,870.3 5217.7
84.2 191.7
16 8
382,225.5 226,941.4
290,078 155,635
11,878,794.5 313,145.8
33,236,865.3 1,196,138.8
18,316,633.9 12,454,303.2
611,955.6 71,224.4
4 2/4 4
64 Parallel [9] 64 Parallel [19] 64 Proposed parallel
16,467.9 15,432.5 7282.4
60.7 64.8 137.3
64 64 16
1,362,300.5 982,521.5 688,497.1
1,176,504 694,238 498,627
56,677,494.8 136,346,215.1 41,572,714.1 63,256,661.1 2,652,370.7 8,266,041.9
62,585,455.1 37,326,613.1 34,747,213.0
3,275,986.9 1,552,251.5 313,239.6
2 2 2 2 2 2
16 16 16 16 16 16
15,867.5 15,267.5 15,084.6 8711.8 8740.4 15,548.3
63.02 65.5 66.3 114.8 114.4 64.3
2 2 4 2 4 2
52,281.9 53,789.1 102,784.1 27,484.4 91,817.4 52,954.4
39,611 39,833 78,616 20,586 68,783 40,163
925,647.6 721,160.4 1,297,226.6 230,978.2 1,356,854.7 944,359.9
2,473,586.9 1,966,510.8 3,458,148.5 626,789.9 3,644,161.9 2,513,912.6
2,457,504.8 2,442,210.5 4,727,111.7 1,309,002.4 4,200,483.1 2,491,889.5
78,244.1 67,310.2 123,471.4 16,864.2 68,565.3 77,831.7
2
16
5486.7
182.3
16
202,105.8
137,228
1,563,533.9
4,607,206.9
9,946,621.1
79,852.5
22 22 22 22 22
16 16 16 16 16
11,208.7 11,078.4 11,127.2 10,917.5 10,986.0
89.2 90.3 89.9 91.5 91.0
2 2 4 4 2
48,048.8 49,686.5 94,059.3 92,016.1 48,552.3
36,124 36,442 71,456 68,855 36,594
804,591.8 689,429.2 109,154.5 1,582,354.9 817,193.1
2,162,001.3 1,880,231.6 2,934,221.4 4,250,352.6 2,190,436.2
2,250,845.1 2,255,065.9 4,924,749.9 4,217,923.4 2,272,758.9
49,462.3 45,812.5 87,448.6 92,452.4 49,032.6
22
16
5490.2
182.1
16
247,243.8
167,851
1,395,058.0
4,098,530.3
10,040,188.6
77,624.4
4 4 4 4
16 16 16 16
11,261.3 11,761.3 11,137.0 10,939.5
88.8 85.1 89.8 91.4
4 4 8 4
110,383.4 117,931.7 230,931.8 111,018.4
87,234 94,112 187,104 87,646
1,992,840.8 2,357,822.5 555,551.6 2, 014,283.5
5,257,551.5 6,198,834.8 14,575,191.5 5,326,900.1
5,134,070.0 5,500,590.5 10,911,237.9 5,176,334.3
117,023.2 137,600.5 283,842.4 114,900.1
4/8
16
12,835.1
77.9
4
199,398.7
159,774
4,134,073.2
10,846,919.1
9,219,826.7
257,558.8
4
16
7227.3
138.4
16
331,073.2
245,509
1,983,434.4
5,671,832.4
17,585,383.0
168,086.9
SDF [11] MDC [16] 4- parallel MDC [6] SDC-SDF [20] 4-parallel MDF [21] Variable length SDF [25] Proposed folded SDF [14] MDC [17] 4-parallel MDC [6] 4-parallel MDF [21] Variable length SDF [25] Proposed folded SDF [15] MDC [16] 8-parallel MDC [6] Variable length SDF [25] Mixed Radix MDC [22] Proposed folded
56
M. Asan Basiri M, N. Mahammad Sk / INTEGRATION, the VLSI journal 55 (2016) 43–56
Fig. 9. Layout chip diagram for proposed radix-2 16-point folded DIF FFT with core area as 232; 420:75 μm2 , die space around core as 60 μm, and total chip area as 293; 872:75 μm2 using 45-nm technology.
Fig. 10. Layout chip diagram for proposed radix-2 16-point parallel DIF FFT with core area as 204; 524:05 μm2 , die space around core as 60 μm, and total chip area as 262; 364:05 μm2 using 45-nm technology.
architectures achieve 65.5%, 51.1%, and 35.8% of worst path delay reduction compared with the existing SDF architecture based 16point radix-2, radix-22, and radix-4 SDF based DIF FFTs respectively. References [1] Steven W. Smith, The Scientist and Engineers Guide to Digital Signal Processing, California Technical Publishing, Poway, California, USA, 1997, p. 551–566. [2] L. Liu, J. Ren, X. Wang, F. Ye, Design of low-power, 1GS/s throughput FFT processor for MIMO-OFDM UWB communication system, in: Proceedings of IEEE International Symposium on Circuits System, 2007, pp. 210–213. [3] Eun Ji Kim, Myung Hoon Sunwoo, High speed eight-parallel mixed-radix FFT Processor for OFDM systems, in: IEEE International Symposium on Circuits and Systems (ISCAS), 2011, pp. 1684–1687. [4] J. Lee, H. Lee, S. in Cho, S.-S. Choi, A high-speed, low-complexity radix-24 FFT processor for MB-OFDM UWB systems, in: Proceedings of IEEE International Symposium on Circuits Systems, 2010, pp. 2594–2597.
[5] Sohaib Ahmed Khan, Digital Design of Signal Processing Systems, A Practical Approach, first edition, Wiley Publications, West Sussex, UK, 2011, p. 343–378. [6] Manohar Ayinalai, Michael Brown, Keshab K. Parhi, Pipelined parallel FFT architectures via folding transformation, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 20 (6) (2012) 1068–1081. [7] Alan V. Oppenheim, Ronald W. Schafer, John R. Buck, Discrete Time Signal Processing, Prentice Hall Publishers, New Jersy, USA, 1999, p. 629–691. [8] Nuo Li, N.P. van der Meijs, A radix-22 based parallel pipeline FFT processor for MBOFDM UWB system, in: IEEE International SOC Conference, 2009, pp. 383–386. [9] Software Optimization of FFTs and IFFTs Using the SC3850 Core, Free Scale Semiconductor Application Note, Document Number: AN3666, 2010, pp. 1–52. [10] Zhong Cui-xiang, Han Guo-qiang, Huang Ming-he, Some new parallel fast Fourier transform algorithms, in: IEEE International Conference on Parallel and Distributed Computing, Applications and Technologies, 2005, pp. 624–628. [11] E.H. Wold, A.M. Despain, Pipeline and parallel-pipeline FFT processors for VLSI implementation, IEEE Trans. Comput. C-33 (5) (1984) 414–426. [12] Shousheng He, Mats Torkelson, A new approach to pipeline FFT processor, in: IEEE International Parallel Processing Symposium, 1996, pp. 766–770. [14] Yazan Samir Algnabi, Furat A. Aldaamee, Rozita Teymourzadeh, Masuri Othman, Md Shabiul Islam, Novel architecture of pipeline Radix-22 SDF FFT based on digit-slicing technique, in: IEEE International Conference on Semiconductor Electronics (ICSE), 2012, pp. 470–474. [15] A.M. Despain, Fourier transform computer using CORDIC iterations, IEEE Trans. Comput. C-23 (10) (1974) 993–1001. [16] Tzi-Dar Chiueh, Pei-Yun Tsai, OFDM Baseband Receiver Design for Wireless Communications, John Wiley and Sons (Asia) Pte Ltd, Clementi Loop, Singapore, 2007, p. 195–232. [17] Manohar Ayinalai, Low-power architectures for signal processing and classification systems (A dissertation Ph.D.), submitted to the Faculty of the Graduate School of the University of Minnesota, Minneapolis, United States, 2012, pp. 10–43. [19] Marwan A. Jaber, Daniel Massicotte, A new FFT concept for efficient VLSI implementation: part II – parallel pipelined processing, in: IEEE International Conference on Digital Signal Processing, 2009, pp. 1–5. [20] Zeke Wang, Xue Liu, Bingsheng He, Feng Yu, A combined SDC-SDF architecture for normal I/O pipelined radix-2 FFT, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23 (5) (2015) 973–977. [21] Seung-Won Yang, Jong-Yeol Lee, Constant twiddle factor multiplier sharing in multipath delay feedback parallel pipelined FFT processors, IEEE Electron. Lett. 50 (15) (2014) 1050–1052. [22] Kai-Jiun Yang, Shang-Ho Tsai, Gene C.H. Chuang, MDC FFT/IFFT processor with variable length for MIMO-OFDM systems, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21 (4) (2013) 720–731. [23] Milder, P., Franchetti, F., Hoe, J.C., Puschel, M. 2011. Computer generation of hardware for linear digital signal processing transforms. ACM Trans. Des. Autom. Electron. Syst. 17 (2) (2012) 15:1–33. [24] Emmanuel Casseau, Bertrand Le Gal, Design of Multi-mode Application-Specific Cores Based on High-Level Synthesis, Integr. VLSI J. 45 (2012) 9–21. [25] Padma Prasad Boopal, Mario Garrido, Oscar Gustafsson, A reconfigurable FFT architecture for variable-length and multi-streaming OFDM standards, in: IEEE International Symposium on Circuits and Systems (ISCAS), 2013, pp. 2066–2070. [26] Cadence, 〈http://www.cadence.com/Alliances/pages/tsmcrequest.aspx〉. [27] L. Xu, VLSI Circuit Design Methodology Demystified: A Conceptual Taxonomy, first edition, Wiley-IEEE Press, New Jersy, USA, 2007, p. 42–69. [29] Ricardo Gonzalez, Benjamin M. Gordon, Mark A. Horowitz, Supply and threshold voltage scaling for low power CMO, IEEE J. Solid State Circuits 32 (8) (1997) 1210–1216. Mohamed Asan Basiri M received B.E. (Electronics and Communication Engg.) and M.E. (Embedded Systems) from Anna University, Tamil Nadu, India in 2009 and 2011 respectively. Currently, he is working towards Ph. D in the department of computer science and engineering, Indian Institute of Information Technology (IIITDM) Kancheepuram, Chennai, Tamil Nadu, India. His research interest includes VLSI for signal processing and reconfigurable systems.
Noor Mahammad Sk obtained his Ph.D from Indian Institute of Technology Madras, and he is currently working as an Assistant professor in the department of computer science and engineering, Indian Institute of Information Technology Design and Manufacturing (IIITDM) Kancheepuram, Chennai, India. His research interest includes software for VLSI, network on chip, open flow networking, software defined radio, and evolvable hardware.