This paper proposes the implementation of fully-parallel radix-2 Decimation in Time (DlT) Fast Fourier Transform -. FFT, using the Matrix- Multiple Constant ...
RADIX-2 DECIMATION IN TIME (DIT) FFT IMPLEMENTATION BASED ON A MATRIX-MULTIPLE CONSTANT MULTIPLICATION APPROACH 1,2Sidinei Ghissoni, 3Eduardo Costa, 4Cristiano Lazzari, 4Jose Monteiro, 4Levent Aksoy, IRicardo Reis I
2 Programa de P6s-Graduac;ao em Microeletronica-PGMICROIUFRGS, Universidade Federal do Pampa 3
- UNIPAMPA Universidade Cat6lica de Pelotas - UCPEL
4 Instituto de Engenharia de Sistemas e Computadores - INESC-ID
optimize the architectures.
ABSTRACT
This paper proposes the implementation of fully-parallel radix-2 Decimation in Time (DlT) Fast Fourier Transform FFT, using the Matrix- Multiple Constant Multiplication (M MCM) at gate level. In the FFT algorithm, the butterfly plays a central role in the complex multiplications by constants. The use of the Matrix-MCM approach can reduce significantly the impact of real and imaginary multiplications by constants. In this work,for each stage of the real and imaginary parts of the butterflies, we maximize the sharing of the partial products of the coefficients using M-MCM. The experimental results show that 58% and 74% reduction in area and power dissipation respectively can be obtained by using the M-MCM approach when the FFT designs are synthesized using the CADENCE Encounter RTL Compiler under the UMC 1 30nm technology.
Index Terms-
of
the
fully-parallel
FFT
The multiplication of a set of constants by a variable, generally known as the Multiple Constant Multiplications (MCM) operation, has a significant impact on the design of Digital Signal Processing (DSP) systems. Since the multipliers are expensive in terms of area in hardware and the constants are known beforehand, the MCM operation is generally implemented using only addition, subtraction, and shift operations. Note that shifts are also free in terms of hardware, since they can be realized using only wires. Over the years, many efficient algorithms that find the fewest number of addition and subtraction operations required for the implementation of the MCM operation were proposed [2-5]. However, only a few of them have been applied in the design of FFTs [6]. In this paper, we propose a design methodology for the efficient implementation of FFT architectures. In our method, constant multiplications are realized in a shift-adds architecture obtained by a method based on the optimization algorithm of [2] and each addition and subtraction operation is designed at gate-level as described in [7]. This paper is organized as follows. In Section 2, we present an overview of Multiple Constant Multiplications with FFTs algorithms at gate level. In Section 3, we presented the application of the Matrix-MCM in a butterfly. Some results on the comparison between butterflies using MCM and behavioral implementations are presented in Section 4. Finally, Section 5 presents the conclusions and gives some ideas for future work.
FFT, radix-2 DlT Matrix-MCM, gate
level 1. INTRODUCTION
Fast Fourier Transform (FFT) is largely used in Digital Signal Processing (DSP) applications, such as audio and video processing, wireless communication, and it is also found in modules of WLAN chips. In FFTs, the set of inputs is split into 2 groups. This is done recursively until there are only sequences of length 2, where the input samples cannot be partitioned anymore. This algorithm is named radix-2 FFT [1]. Partitioning the input sequence into more than two subsequences leads to higher radices of the FFT algorithm that leads to an increase in the number of arithmetic operations. Note that the complexity of an FFT architecture is dominated by the multiplications of inputs with a large number of coefficients denoting the twiddle factors. In the last decades, researchers have focused on the implementation of area-efficient, high-speed, and low-power multipliers. However, in the case of the FFT circuit, where several complex multiplications are performed by the butterflies, even these efficient multipliers are not able to
978-1-4244-8157-6/10/$26.00 ©2010 IEEE
complexity
2. BACKGROUND
The use of FFT algorithm is a simpler form to calculate the Discrete Fourier Transform (DFT) efficiently. Many algorithms have been proposed to improve the performance of the FFT butterfly architecture [8-12]. The radix-2 butterfly 2 with Decimation in Time (DlT) [1], where N multiplications found in the direct DFT are reduced to log2N multiplications, is shown in Eqn. 1. This aspect enables a significant increase in the computational performance while obtaining the solution of the Fourier transform. The FFT can reduce the
859
ICECS 2010
computational complexity of the DFT, because its butterfly can process the calculation of two samples at a time.
addition and subtraction operations as described in [7] for the design of low-complexity radix-2 butterfly of FFTs with DIT.
N-I
X(K) =Lx(n)W;,K=O,l,N-l n=O
where, Wn
=e
.21r -j N
2.2. Matrix-MCM for FFT
(1)
is the twiddle factor.
Several other algorithms were developed to further reduce the computational complexity of the butterfly,such as radix-4, split-radix, radix-8, radix-2/4/8, and higher radix versions. However,all of them are based on the butterfly of [1] given in the Eqn 1. 2.1. Related Work
Several algorithms have been proposed to reduce the complexity of the FFT architectures. The radix-2/4/8 algorithm that can effectively minimize the number of complex multiplications in pipelined FFTs was introduced in [8]. An optimization method based on multirate signal processing and asynchronous circuit technology was proposed in [9]. Design solutions based on parallel architectures proposed for high throughput and power efficient FFT cores were presented in [10]. In this work,different combinations of hybrid low-power techniques were exploited, such as i) the use of multiplierless units, which replace the complex multipliers in FFTs, ii) the use of low-power commutators, which is based on advanced interconnection,and iii) the use of the parallel-pipelined architecture. The proposed methods also use an MCM method for the sharing of various multipliers that are located in the same stage of the hybrid architectures. This methodology is efficient but in some cases area is increased. The optimization of the twiddle factors using trigonometric identity for few points of FFT architecture was proposed in [11]. This methodology proposes the replacement of the adders of the circuits by the use of multiplexers. Based on the same idea,the work of [6] introduces a low-complexity reconfigurable complex constant multiplication for FFTs for the reduction of area for a larger number of points ( 32 points). This new methodology proposed by [11] and [6] was compared with the work [ 3, 5], where reductions in terms of the number of adders could be achievable. Although the power and delay results are not presented in [11] and [6], it is commented that these metrics may lead to large circuits due to the limitation of the proposed architecture, where only FFT computation in serial form is performed. In [12], methods for designing multiplierless implementations of fixed-point rotators and FFTs, in which multiplications are replaced by additions, subtractions, and shifts, were introduced. These methods minimize the adder-cost (the number of additions and subtractions),while achieving a specified level of accuracy,but it takes much more processing time, with the increasing data wordlength. Although there are several proposed techniques to reduce the complexity of the FFT architectures, only a few of them use the MCM approach aiming at increasing the sharing of real and imaginary twiddle factors inside the butterflies. In this paper, we present the use of a Matrix-MCM algorithm based on the algorithm of [2] and the gate-level implementation of
860
The Matrix-MCM operation consists of the multiplication of an mxn matrix A including constant coefficient entries with an nxl input vector X, as presented in Figure 1. The constant matrix specifies how the output vector Y=A.X is obtained from the linear transformations of the inputs. The M-MCM problem is defined as finding the minimum number of addition and subtractions that implement the linear transforms. Note that the M-MCM problem is a more generic version of the MCM problem, where the mUltiplication of multiple constants with only one input variable is realized. Thus, algorithms designed for the MCM problem can be easily extended to handle the M-MCM problem,where each constant and the input variable in the MCM is replaced by a constant vector and an input vector,respectively.
[ ][ ] [ ] � 11 a12 . . . aZI aZZ . . .
a1n aZn
Xl Xz
·
·
·
a.nl
a.nz
=
YI Yz . .
. . . a.nn
Xn
Ym
=
J
IIXI + a12xZ + ... + a1nx aZlxl + aZZxz + ... + aznxn
.. .
a.nIXI + a.nZXZ + ... + a.nnX
Fig. 1: Matrix-Multiple constant multiplications structure. The M-MCM approach can be used in the design of N point FFT with DIT, because the data flow in FFT can be represented by a set of constants multiplied for each input, as given in the structure of Figure 1. Thus, the complexity of the M-MCM operation can be reduced by replacing constant multiplications by addition/subtraction operations and by finding the common subexpressions among linear transforms. 2.3 Gate-Level Design
In [7], the optimization of area in the MCM operation is realized by determining the cost of each operation in terms of HAs, FAs, and logic gates under a given technology library and formalizing the optimization of area problem as a 0-1 Integer Linear Programming (lLP) problem. The area of an operation at gate-level depends on: i. the type of operation (addition or subtraction); the shifted input in a subtraction (minuend or 11. subtrahend); the number of shifts at the inputs; iii. the position of the operation in the architecture; iv. v. the range and type of numbers considered. Since shifts are free in terms of hardware, the constants are considered as odd numbers. Thus, there are three different types of operations that can be considered [7]: 1. A«SA + B«SB (an adder where SA = 0,SB = S) 2. A«SA - B«SB (a subtracter where SA = S, SB = 0) 3. A«SA - B«SB (a subtracter where SA = 0,SB = S) For a given operation,A and B represent the inputs of the operation, SA and SB denote the amount of shifts on A and B respectively. Figure 2 shows examples on the computation of the cost of an A+B«s operation in terms of the number of
FAs,and HAs where S denotes the amount of shifts; nA and n8 represent the number of bits of A and B respectively; nm and nM are dete�ined as min(nA +SA, n8+S8) and max(nA +SA' n8+S8) respectIVely. Signed Input Examples
Unsigned Input Examples
sign extention (positive or negative)
A + B