an fft processor based on 16-point module - CiteSeerX

6 downloads 0 Views 69KB Size Report
which is equivalent to decimation-in-time (DIT) based algorithms in complexity. These algo- rithms are derived in the rest of this section. 2.1 RADIX-2.
AN FFT PROCESSOR BASED ON 16-POINT MODULE Weidong Li, Mark Vesterbacka and Lars Wanhammar Electronics Systems, Dept. of EE., Linköping University SE-581 83 LINKÖPING, SWEDEN E-mail: {weidongl, markv, larsw}@isy.liu.se, Tel.: +46 13284059, Fax: +46 139282 ABSTRACT: The number of multiplications has been a key merit for FFT algorithms. It has important impact on the total power consumption. In this paper, we present a 16-point FFT module, which reduces the multiplicative complexity by using real constant multiplications. A pipeline FFT processor has been implemented with the 16-point module and simulation result shows that it is an attractive candidate to reduce the power consumption. 1. INTRODUCTION FFT processor has been widely used in digital signal processing. Recently, FFT processor is applied to Orthogonal Frequency Division Multiplex (OFDM) based communication systems liked xDSL modems and wireless mobile terminals due to its efficient implementation of the modulator and demodulator bank. Furthermore, the low power has become a main constraint for battery-operated devices. Hence, the effective design of FFT processor with low power is vital. In this paper we present the design of an FFT processor which computes a 1024-point FFT including I/O within 40 µs and is part of high bit rate mobile radio modem. Since our target application has the high requirements in throughput, power, and area, the ASIC implementation is one of the most feasible implementations. Complex multiplications is an expansive operation both in the past [10] and now [1] [2]. One method to reduce the complexity is to replace complex multiplications with less expansive real multiplications when possible. Several 16-point modules are summarized in the following section. Among them, we uses the most efficient one as basic building block for the whole FFT processor, which is described in section 3. Result is presented in section 4. The conclusions are given finally in section 5. 2. 16-POINT FFT MODULE A N-point DFT can be expressed as following, N–1

X (n) =

∑ x ( k )W

nk

, where W

nk

= e

2πnk – ------------N

.

k=0

In this section, we concentrate us on 16-point FFT module. There are mainly three different FFT algorithms, i.e., radix-2, radix-4 and split-radix 2/4, which are suitable for VLSI implementation. For simplicity, all algorithms in this paper are based on decimation-in-frequency (DIF), which is equivalent to decimation-in-time (DIT) based algorithms in complexity. These algorithms are derived in the rest of this section. 2.1 RADIX-2 3

The radix-2 16-point FFT maps the indices with k =

∑ 2 ki i

i=0

1

3

and n =

∑ 2 ni i

i=0

3

( k i, n i ∈ [ 0, 1 ] ). With notation ( m 3, m 2, m 1, m 0 ) =

∑ 2 mi , the 16-point FFT can be exi

i=0

pressed as following 1

1

1

W

2

( 4k 2 + 2k 1 + k 0 )2n 1

which includes the simplification of W rithm is illustrated in Fig. 1. x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15)

1

 ( 8k 3 + 4k 2 + 2k 1 + k 0 )n 0  ∑  ∑ x ( k 3, k 2, k 1, k 0 )W  k =0 k =0

 X ( n 3, n 2, n 1, n 0 ) = ∑  ∑ k0 = 0  k1 = 0

3

W

( 2k 1 + k 0 )4n 2 

mN

W 

= e

2πjmN – -----------------N

8k 0 n 3

= 1 . A 16-point FFT with radix-2 algo-

4

W 2

W

4

W

6

W

W W W W W

4

2 3

4

W

4

W

6

W

W

W

1

5

W 7

(1)

W

2 4 6

W4

X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15)

Figure 1 16-point FFT with radix-2 algorithm. 4

The multiplication with W = – j can be done by swapping and inversion and therefore is triv2

ial. The number of complex multiplications is 10. The complex multiplications with W can be implemented with two real multiplications and other non-trivial complex multiplications can be implemented with three real multiplications. The number of real multiplications is therefore 24. 2.2 RADIX-4 The 16-point FFT with radix-4 algorithm can be driven in a similar manner.This is illustrated in Fig. 2.The number of complex multiplications is 8. With simplification for multiplications

2

2

with W , the total number of real multiplications is reduced to 20. x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15)

W W W

W W W

X(0) X(4) X(8) X(12) X(1) X(5) X(9) X(13) X(2) X(6) X(10) X(14) X(3) X(7) X(11) X(15)

1 2 3

2 4 6

W W W

3 6 9

Figure 2 16-point FFT with radix-4 algorithm. 2.3 SPLIT-RADIX FFT The split-radix FFT algorithm combines the radix-2 and radix-4 algorithms to reduce the number operations [11]. Odds terms of DFT are computed with radix-4 algorithm while the even terms with radix-2 algorithm. A 16-point FFT with split-radix 2/4 is shown in Fig. 3. x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x(11) x(12) x(13) x(14) x(15)

2

W

6

W

W W W

W W

1 2 3

3

6

W

9

X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15)

Figure 3 Split-radix 16-point FFT.

The number of complex multiplications is 8. With simplification, the number of real multiplications is 20, which is the same as radix-4 algorithm. To reduce the power consumption, the number of multiplications must be reduced. The radix-2 algorithm is less attractive due to the requirement of more multiplications. Since the FFT proc2

essor is implemented with fixed-point arithmetic, the multiplication with W can be implemented with constant multiplication with sufficient accuracy. In our target application, this constant multiplication is realized with five shifted additions.

3

3. PIPELEINE FFT PROCESSOR High performances, like high throughput and continue input/output etc., are required for communication systems. The pipeline architecture is suitable for those ends. In our target application, the input data are arrived in natural order. The data memory are required for pipeline processor since the incoming data have to be rearranged according to the FFT algorithm. The data memory consumes a large portion of power for large transform length FFT processor. Hence the data memory is also a critical factor for power consumption. In this section, we discuss the mapping of 16-point module for the pipeline processor. 3.1 MULTIPLICATIONS A large portion of total power are consumed by the computation of complex multiplications in the FFT processor. A complex multiplier consumes 72.6 mW with supply voltage of 3.3V at 25MHz. For a 1024-point FFT processor, it requires four complex multipliers and hence consumes [email protected], 25MHz. Even with bypass technique for trivial complex multiplications, the power consumption for the computation of complex multiplications is still larger than 210mW. Hence the reduction of the number of complex multiplication is vital. Using high radix module can reduce the number of complex multiplications outsides the module. However, it is not common to use high radix module for implementations due to two main drawbacks: it increases the number of complex multiplications within the module if the radix is larger than 4 and it increases the routing complexity as well. To overcome those drawbacks is the key for using high radix module, which is also the key issues for our discussions. As well-known, adders consumes much less power than that of multipliers with the same wordlength. This is because the adder has less hardware and much less glitches. A 32-bit Brent-Kung adder (real) consumes [email protected], 25MHz, which is much less than a 17 × 13 bit complex multiplier ([email protected], 25MHz). Therefore it is efficient to replace the complex multiplier with constant multiplier (carry-save-adders).We apply this idea to the design of 16-point module in order to reduce the number of complex multiplications. For a 16-point FFT module, there 1

2

are three type non-trivial complex multiplications, i.e., multiplications with W 16 , W 16 , and 3

1

3

W 16 . The multiplications with W 16 and W 16 can share coefficients since π π π 3π π π π 3π cos --- = sin  --- – --- = sin ------ and sin --- = cos  --- – --- = cos ------ . We can therefore use con 2 8  2 8 8 8 8 8 stant multiplication, which reduce the multiplication complexity. The implementation of multi1 plication with W 16 is illustrated in Fig. 4. π π Re{input} cos --- + sin --8 8 Im{output} constant C multiplication π π Im{input} cos --- – sin --1 8 8 Figure 4 Complex multiplication with W 16 . Re{output} π cos --8 For the three different algorithms, the different positions of multiplications cause different hardware implementations. Both radix-2 and split-radix algorithm require three multipliers (two 2

1

multipliers with W 16 and one multiplier with W 16 ) while the radix-4 algorithm requires only 2

1

two multipliers (one multipliers with W 16 and one multiplier with W 16 ). Hence the 16-point FFT module with radix-4 is more efficient and is selected for our implementation. The power consumption for complex multiplication within 16-point module is about [email protected], 25MHz.

4

By replacing the complex multiplications with constant multiplications within the 16-point module, the number of non-trivial complex multiplications can be reduced to 1776 with 16 × 16 × 4 configuration. The total number of complex multipliers is reduced to two for 1024point FFT due to the use of 16-point module. Table 1 shows the number non-trivial complex multiplications required for 1024-point FFT with different algorithms. Algorithm

Radix-2

Radix-4

Split-radix

Our approach

No. of comp. mult.

3586

2732

2390

1776

Table 1: Number of non-trivial complex multiplications for 1024-point FFT. 3.2 DATA MEMORY The data memory consumes a significant portion of the total power. It is therefore desirable to reduce the size of data memory. For the pipeline FFT processor, the data memory for the first few stages dominates both size and power consumption of the total memory. The architecture selection for those stages is of importance. There are two main methods to reorder data for FFT algorithm for pipeline FFT processor: delay-forward and feedback. The key difference between two methods is that the delay-forward method stores only the incoming data in the data memory while the feedback method stores both the incoming data and partial results in the data memory at each stage. This is shown in Fig. 5 for a 4-point FFT. D D D D

D D

Butterfly element

Butterfly element

(a) D D

D

Butterfly element

Butterfly element

Figure 5 Delay-forward (a) and feedback (b).

(b)

As described in [2], the efficient way to reduce data memory size is to use feedback method. We select single-path feedback for data memory since it gives the minimum data memory with N – 1 words for N -point FFT [2]. 3.3 REALIZATION OF 16-POINT MODULE Direct use of feedback method for the three algorithms listed in section 2 faces two main problems: large memory bandwidth and complex interconnection scheme. Also direct implementation of 16-point module is complicated. Mem.

Mem.

Mem.

Mem.

Butterfly element

Butterfly element

Butterfly element

Butterfly element

Figure 6 16-point FFT module.

Constant multipliers

The radix-4 algorithm can be decomposed into radix-2 algorithm as it does in [7]. Hence the mapping of 16-point module can be done with four pipelined radix-2 butterfly elements. Each butterfly element has its own feedback memory. The 16-module is illustrated in Fig. 6. With this mapping, the two main drawbacks for high radix module have been removed.

5

4. RESULTS For the complex multipliers, the conventional radix-4 algorithm requires 4 complex multipliers. Each complex multiplier consumes [email protected], 25MHz at full rate (simulation result). With bypass technique, the total power consumption for complex multipliers is about 210mW. In our approach, there is only two complex multipliers and two constant multipliers (one consumes [email protected], 25MHz), which consumes a total power less than 160mW. A power saving more than 20% for the computation of complex multiplications. This is less than the theoretical saving of 35% (the ratio for the number of complex multiplications) due to the computation for complex multiplications within the 16-point module. The power consumption for the data memory and butterfly elements are of the same. The power consumption for the data memory is estimated 300mW (the power consumption for 128 words or higher memory is given by the vendor and the smaller memory is estimated through linear approximation downto 32 words). The butterfly elements consumes about 30mW. 5. CONCLUSIONS In this paper, we introduces an FFT processor based on 16-point module. The new approach reduces the number of complex multiplications and retains the minimum size of data memory. The simulation result shows that it can reduce the power consumption. REFERENCE [1] W. Li and L. Wanhammar, “A Pipeline FFT Processor,” IEEE Workshop on Signal Processing Systems (SiPS), Taipei, China, Oct., 1999. [2] J. Melander, Design of SIC FFT Architectures, Linköping Studies in Science and Technology, Thesis No. 618, Linköping University, Sweden, 1997. [3] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. [4] Z. Mou and F. Jutand, “‘Overturned-Stairs’ Adder Trees and Multiplier Design,” IEEE Trans. on Computer, vol. C-41, No. 8, pp. 940-948, Aug. 1992. [5] W. Li and L. Wanhammar, “A Complex Multiplier Using ‘Overturned-Stairs’ Adder Tree,” Int. Conf. on Electronic Circuits and Systems (ICECS), Sept., 1999. [6] G. Bi and E. V. Jones, “A Pipelined FFT Processor for Word-Sequential Data,” IEEE Trans. on Acoustic, Speech, and Signal Process., vol. ASSP-37, No.12, pp. 1982-1985, Dec. 1989. [7] S. He and M. Torkelson, “A New Approach to Pipeline FFT Processor,” The 10th International Parallel Processing Symposium (IPPS), pp. 766- 770, 1996. [8] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, PrenticeHall, 1975. [9] A. M. Despain, “Fourier Transform Computer Using CORDIC Iterations,” IEEE Trans. on Computers, vol. C-23, No. 10, pp. 993-1001, 1974. [10]M. T. Heideman and S. Burrus, “On the Number of Multiplications Necessary to Compute a Length-2n DFT,” IEEE trans. on Acoustic, Speech, and Signal Process., vol. ASSP-34, No. 1, pp. 91-95, 1986. [11]P. Duhamel and H. Hollmann, “‘Split Radix’ FFT Algorithm,” Electronics Letters, Vol. 20, No. 1, pp. 14-16, Jan., 1984. n

[12]P. Duhamel and H. Hollmann, “Existence of a 2 FFT algorithm with a number of multin+1 plications lower than 2 ,” Electronics Letters, Vol. 20, No. 17, pp. 690-692, Aug., 1984.

6

Suggest Documents