IEICE TRANS. FUNDAMENTALS, VOL.E95–A, NO.2 FEBRUARY 2012
550
PAPER
Design of Area- and Power-Efficient Pipeline FFT Processors for 8x8 MIMO-OFDM Systems Shingo YOSHIZAWA†a) , Member and Yoshikazu MIYANAGA† , Fellow
SUMMARY We present area- and power-efficient pipeline 128- and 128/64-point fast Fourier transform (FFT) processors for 8x8 multipleinput multiple-output orthogonal frequency multiplexing (MIMO-OFDM) systems based on the specification framework of IEEE 802.11ac WLANs. Our new FFT processors use mixed-radix multipath delay commutator (MRMDC) architecture from the point of view of low complexity and high memory use. A conventional MRMDC architecture induces large circuits in delay commutators, which change the order of data sequences for the butterfly units. The proposed architecture replaces delay elements with new commutators that cooperate with other MIMO-OFDM processing blocks. These commutators are inserted in the front and rear of the input and output memory units. Our FFT processors exhibit a 50–51% reduction in logic gates and 70–72% reduction in power dissipation as compared with conventional ones. key words: pipeline FFT processor, MIMO-OFDM, VLSI architecture, IEEE 802.11ac
1.
Introduction
Multiple-input multiple-output orthogonal frequency multiplexing (MIMO-OFDM) is powerful in enhancing communication capacity or reliance and is widely adopted in current wireless communication systems. The IEEE 802.11n standard supports a data rate of 600 Mbps using four spatial data streams. The IEEE 802.11ac task group suggests use of eight spatial streams for single-user MIMO (SU-MIMO) in the framework of the physical layer (PHY) [1]. As MIMO spatial streams increase, the computational and hardware complexities in MIMO-OFDM systems also greatly increase. It is a challenging to design a MIMOOFDM transceiver with minimal hardware cost and power dissipation in very-large-scale integration (VLSI) implementation. A fast Fourier transform/inverse fast Fourier transform (FFT/IFFT) processor processes blocks with large hardware complexity as well as a MIMO detector and a FEC decoder, which are principal processing blocks in a MIMO-OFDM transceiver [2]. Various FFT architectures, such as memory, cash-memory, pipeline, and array architectures, have been presented [11]. We discuss pipeline architectures from the point of view of energy savings in mobile wireless terminals, where pipeline FFT architecture is efficient in low-speed clock and low-voltage operations. Pipeline FFT architecManuscript received April 27, 2011. Manuscript revised September 28, 2011. † The authors are with the Graduate School of Information Science and Technology, Hokkaido University, Sapporo-shi, 0600814 Japan. a) E-mail:
[email protected] DOI: 10.1587/transfun.E95.A.550
tures have been presented [3]–[16], and Radix-2, radix-22 , and radix-23 single-path delay feedback (R2SDF, R22 SDF, and R23 SDF) have also been reported [3]. R2α SDF (α ∈ N) is widely adopted in FFT/IFFT processors for single-input and single-output (SISO) OFDM systems. The radix-2 multipath delay commutator (R2MDC) is the most classical approach [4] and R4MDC as a radix-4 version of R2MDC has been presented [5]. RβMDC (β ∈ N) exhibits low utilization of all components where it induces idle cycles during serialto-parallel data conversion. For MIMO-OFDM systems, a multi-channel FFT processor is required to have multiple FFT inputs based on the number of transmitter/receiver antennas. For multi-channel FFT architecture, multi-channel RβMDC is the most area-efficient approach [6] and can overcome the issue of low utilization. It can also reduce computational complexity by taking a larger radix (e.g., use radix-8 for eight FFT inputs). The complexity of multichannel RβMDC is less than that of multi-channel R2α SDF. Recent studies have presented a multi-channel mixed-radix MDC (MRMDC) pipeline FFT processor, which is more area-efficient than the multi-channel RβMDC architecture with a reduced number of non-trivial multiplications [7], [8]. We describe the design of 128- and 128/64-point pipeline FFT processors by using multi-channel MRMDC for 8x8 MIMO-OFDM systems, where the combination of the R8MDC (β=8) and R2SDF (α=1) is applied into an 8channel 128-point FFT. Although complex multipliers decrease by using the RβMDC and MRMDC, delay elements which change the order of data sequences before the butterfly units become dominant in the circuit area as the number of multipliers decreases. When the delay elements are implemented using shift registers, our implementation implies that the percentage of the delay elements is about 50% of circuit area. The conventional FFT processors exhibit increasing hardware size in the delay commutators as the numbers of MIMO spatial streams and FFT points increase. Our proposed architecture replaces the delay elements with new commutators named as “pre-commutator” and “postcommutator”. By collaborating with input and output buffer units which are shared in FFT/IFFT and other MIMO processing blocks, their commutators can execute the same processing in the delay elements (i.e., changing the order of data sequences). Since the pre-commutator and the postcommutator are given by small-sized address converter and commutator units, the proposed architecture can decrease circuit area and power dissipation of the delay elements. We also apply the proposed architecture to a 128/64-point
c 2012 The Institute of Electronics, Information and Communication Engineers Copyright
YOSHIZAWA and MIYANAGA: DESIGN OF AREA- AND POWER-EFFICIENT PIPELINE FFT PROCESSORS FOR 8X8 MIMO-OFDM SYSTEMS
551
FFT processor which supports both 20- and 40-MHz channel bandwidths in a WLAN. The paper is organized as follows: Sect. 2 explains FFT implementation architecture for a MIMO-OFDM transceiver. Section 3 describes mixed-radix FFT algorithms. Section 3 reviews existing FFT architectures for 128- and 64- point FFT processor implementations. The proposed FFT processors are presented in Sect. 4. The evaluation results in circuit area and power dissipation are reported in Sect. 5. Section 6 summarizes our work.
pling rate, the input RAM is used as data buffering. We discuss our pipeline FFT processors for MIMO-OFDM systems considering the above implementation.
2.
where n and k are time and frequency indexes, respectively. WNnk =exp(− j2πnk/N) is a twiddle factor. The radix-r FFT algorithm is derived from the DFT by decomposing the Npoint DFT into a set of recursively computing r-point DFTs, where the computational complexity can be reduced from O(N 2 ) to O(N logr N). If N is not a power of r, a mixedradix algorithm is used for maintaining a higher radix number of r. For instance, a 128-point FFT can be computed from the combination of radix-8 and radix-2 algorithms [11] and is described by the following equations: the indexes of n and k can be expressed as
FFT Implementation Architecture
A basic FFT implementation architecture for a MIMOOFDM transceiver is illustrated in Fig. 1. The FFT/IFFT processor and other blocks must demodulate/modulate multiple spatial stream data simultaneously, where these blocks are implemented by duplicating operators and arithmetic units inside each block. For continuous flow in the MIMOOFDM transceiver, input and output memory units are typically inserted in the front and rear of the FFT/IFFT processor [9]. These memory units play an important role in arranging data sequences in time and frequency domains in terms of OFDM subcarrier assignment, guard interval insertion, and bit-reverse ordering. For M transmitter or receiver antennas, M-bank dual-port RAMs are used in the input and output memory units. The input and output RAMs can be shared with the other processing blocks by adjusting memory access schedules. An example of data arrangement between the QAM mapping and IFFT processing is illustrated in Fig. 2. The null and pilot subcarriers are inserted into the data sequence of data subcarriers before IFFT operation. Note that the length of the data sequence is changed by inserting the null and pilot subcarriers. To maintain the same clock sam-
3.
FFT Algorithms
Discrete Fourier transform (DFT) of size N is defined by X(k) =
N−1
x0 (n)WNnk
0 ≤ k ≤ N − 1,
(1)
n=0
k = 16k2 + 2k1 + k0 n = 64n2 + 8n1 + n0 , where k2 = 0, 1, ..., 7 k1 = 0, 1, ..., 7 k0 = 0, 1 n2 = 0, 1 n1 = 0, 1, ..., 7 n0 = 0, 1, ..., 7.
(2)
Using (2), (1) can be rewritten as X(16k2 + 2k1 + k0 ) 127 (16k2 +2k1 +k0 )(64n2 +8n1 +n0 ) = x0 (n)W128 n=0
⎡ 7 ⎡ 1 ⎤ ⎤ 7 ⎢ ⎢⎢⎢ ⎥⎥ ⎥⎥ ⎢⎢⎢ nk0 nk2 ⎥ nk1 ⎥ ⎢ ⎥ ⎢⎢⎣ ⎢⎢⎣ = x0 (n)W2 ⎥⎥⎦ W16 ⎥⎥⎥⎦ W128 .
(3)
n0 =0 n1 =0 n2 =0
Fig. 1
FFT implementation architecture for MIMO-OFDM systems.
(3) starts from two-point FFTs by radix-2 in the first FFT stage and then applies eight-point FFTs by radix-8 in the last two FFT stages. This computation is called the radix-(2+8) algorithm. For another way of using the mixed-radix algorithm, the radix-(8+2) algorithm can be given by preceding the radix-8 FFT; k = 64k2 + 8k1 + k0 n = 16n2 + 2n1 + n0 , where k2 = 0, 1 k1 = 0, 1, ..., 7 k0 = 0, 1, ..., 7 n2 = 0, 1, ..., 7 n1 = 0, 1, ..., 7 n0 = 0, 1. Using (4), (1) can be rewritten as
Fig. 2
Data arrangement between QAM mapping and IFFT processing.
X(64k2 + 8k1 + k0 ) 127 (64k2 +8k1 +k0 )(16n2 +2n1 +n0 ) = x0 (n)W128 n=0
(4)
IEICE TRANS. FUNDAMENTALS, VOL.E95–A, NO.2 FEBRUARY 2012
552
⎡ 7 ⎡ 7 ⎤ ⎤ 1 ⎢ ⎢⎢⎢ ⎥⎥ ⎥⎥ ⎢⎢⎢ nk0 nk2 ⎥ nk1 ⎥ ⎢ ⎥ ⎢⎢⎣ ⎢⎢⎣ = x0 (n)W16 ⎥⎥⎦ W64 ⎥⎥⎥⎦ W128 .
(5)
n0 =0 n1 =0 n2 =0
We omit the explanation of the complexity reduction in (3) and (5) because their methods are the same as a basic FFT algorithm [17]. 4.
Pipeline FFT Processors
The IEEE 802.11a/g/n WLAN standard provides 64and 128-point IFFT/FFT operations in OFDM modulation/demodulation for 20- and 40-MHz channel bandwidths, which will be also adopted in the IEEE 802.11ac. 128and 64-point FFT processors in R23 SDF, R8MDC, and MRMDC architectures are introduced for reviewing recent research. 4.1 R23 SDF The R23 SDF is an improved version of the R2SDF in complexity reduction [3]. A block diagram of a 64-point R23 SDF FFT processor is illustrated in Fig. 3. It consists of butterfly units, delay element units, and trivial/non-trivial multipliers. The R23 SDF breaks down the radix-8 butterfly operation into a cascade of three radix-2 butterfly operations by applying trivial (constant) multipliers having twiddle factors such as WNN/8 = √12 (1 − i) and WN3N/8 = √12 (1 + i). By using the R23 SDF, the number of non-trivial (complex) multipliers are reduced from 6 to 1 for a 64-point FFT. The delay element unit delays data so that the butterfly unit processes a pair of data points separated by a certain distance. The values in the delay elements indicate the numbers of delay cycles. The structure of the delay element is the same as first in, first out (FIFO) memory, which is typically implemented using shift registers. In application specific integrated circuit (ASIC) design, memory macro cells are superior with respect to circuit area and power consumption if the FIFO memory word size is large (e.g., more than 128).
R4MDC distributes FFT input data to four butterfly units by the commutator. The delay elements adjust the timing of butterfly input data. As shown in the timing chart (Fig. 4(a)), the butterfly unit idles for three of four cycles. The utilization of the butterfly unit is 25%. The RαSDF is superior in SISO-OFDM systems because of a higher utilization (100%). For MIMO-OFDM systems, an FFT processor has multiple inputs. This solves the problem of low utilization of the RβMDC. The multi-channel R4MDC in Fig. 4(b) uses extra delay elements to delay and change the data sequences of “A”, “B”, “C”, and “D”. The letters “A”, “B”, “C”, and “D” correspond to the order of MIMO antennas. The timing chart indicates that the multi-channel R4MDC removes the idle cycles that occur in the single-channel R4MDC and increases utilization to 100%. The hardware complexities of the R23 SDF and R8MDC architectures are listed in Table 1, quoted from Sansaloni et al. [6], where N indicates the number of FFT points and T is the number of complex adders required in the implementation of multiplications by constant values. Sansaloni et al. denote the memory size (the number of memory words) of the R8MDC as 9N/2 − 8 [6]. We replace this size with 8(N − 1) due to the existence of the extra delay elements. Moreover, Fu and Ampadu [8] mention the complex multiplier reduction of 2(log8 N − 1) to log8 N − 1 by making use of constant multipliers in the R23 SDF. Table 2 summarizes the hardware complexity for N=64, T =6, and 8 FFT inputs to target 8x8 MIMO-OFDM systems. Since Jia et al. implemented a constant multiplier using six com-
4.2 R8MDC We compare single-channel and multi-channel RβMDC architectures. The block diagram and timing chart of R4MDC FFT processors are illustrated in Fig. 4, which depicts a simple example of a four-point FFT. The single-channel Fig. 4 Block diagram and timing chart of single-channel and multichannel R4MDC FFT processors.
Table 1
Fig. 3
Block diagram of 64-point R23 SDF FFT processor.
Hardware complexity of R23 SDF and R8MDC architectures.
YOSHIZAWA and MIYANAGA: DESIGN OF AREA- AND POWER-EFFICIENT PIPELINE FFT PROCESSORS FOR 8X8 MIMO-OFDM SYSTEMS
553 Table 2 Hardware complexity comparison for N = 64, T = 6, and 8 FFT inputs.
Fig. 5
Block diagram of 64-point R8MDC FFT processor.
plex adders [15], we use T = 6. This comparison proves that the multi-channel R8MDC is more area-efficient than the multi-channel R23 SDF. A block diagram of a 64-point R8MDC FFT processor is illustrated in Fig. 5. The notations of trivial multipliers are omitted. The trivial multipliers and other complex adders are included in the radix-8 butterfly unit. This processor consists of two radix-8 butterfly operation stages. 4.3 MRMDC The MRMDC combines RαSDF and RβMDC (or other architectures) FFT processors that have different radix numbers [7], [8]. As mentioned in Sect. 3, a 128-point FFT can be applied with a combination of radix-2 and radix-8 algorithms. Block diagrams of 2 128-point MRMDC FFT processors based on the radix-(2+8) and radix-(8+2) algorithms, called radix-(2+8) MRMDC and radix-(8+2) MRMDC, respectively, are shown in Fig. 6. The first butterfly stage in the radix-(2+8) MRMDC executes by radix-2 butterfly computations in which the R2SDF is used. The other butterfly stages are the same as a 64-point FFT processor using the R8MDC. The radix-(8+2) MRMDC uses the R8MDC to apply 64-point FFTs first. The delay elements and twiddle factor coefficients are modified for this version. The last butterfly stage is the same as a two-point FFT processor based on the R2SDF. There is not much difference in the hardware complexities of the radix-(2+8) and radix-(8+2) MRMDCs. The radix-(2+8) MRMDC is easy to design a 128/64-point FFT processor by bypassing data to skip the R2SDF units. The radix-(8+2) MRMDC requires some modifications because
Fig. 6
Block diagram of 128-point MRMDC FFT processors.
the delay elements and the twiddle factor coefficients are different between 128- and 64-point FFTs. These modifications are not as simple as the radix-(2+8) MRMDC; however, they are not difficult (we present the modifications in Section V.C). The radix-(2+8) MRMDC requires more hardware resources for a variable length. However, we focus on the radix-(2+8) MRMDC for further reducing circuit area and power consumption. Figure 6(b) highlights the first stage delay commutator, which has delay elements and a commutator in the first butterfly stage. In the next section, we point out problems in the first stage delay commutator and propose new FFT processors. 5.
Proposed FFT Processors
5.1 Problems in Delay Commutator The first stage delay commutator in Fig. 6(b) has large FIFO memory sizes in the delay elements. Sangmin et al. [7] mentioned that the delay commutator takes a small percentage of the circuit area in the implementation of a R4MDC FFT processor. However, the delay commutator becomes no longer small as decreasing complex multipliers and taking more radix numbers and larger FFT sizes. Our implementation of a 128-point radix-(8+2) MRMDC FFT processor implies that the first stage delay commutator occupies about 50% of
IEICE TRANS. FUNDAMENTALS, VOL.E95–A, NO.2 FEBRUARY 2012
554
the circuit area. The delay elements are implemented using shift registers or memory macro cells in the ASIC design. Since shift registers are powered by a clock, they consume a larger circuit area and more power than logic circuits. Use of memory macro cells is effective for area and power saving. The first stage delay commutator has a variety of delay values, which makes its physical design (inserting many memory macro cells) difficult. Moreover, the numbers of memory words in the delay elements are not given by power of two. Their RAM implementations are not efficient even in memory macro cells† . 5.2 Proposed Architecture Even though many studies have tackled the improvement of FFT processors for OFDM and MIMO-OFDM systems, their concerns were only with the FFT processor. As explained in Sect. 2, an FFT processor is connected to other processing blocks. This indicates that an FFT processor is
not necessarily designed in isolation. For example, Jang et al. have presented an efficient IFFT processor integrated with quadrature amplitude modulation (QAM) mapping and IFFT processing for an OFDM transmitter [10]. We propose a new architecture to reduce the delay elements implemented using shift registers. Figure 7 compares conventional and proposed architectures. It is assumed that an FFT block is implemented in an eight-channel MIMO-OFDM receiver. The FFT block executes three tasks, delay and commutation, butterfly operation, and bit-reverse ordering. The bit-reverse ordering is required as post-processing in a pipeline FFT processor. The output RAM is also used for data conversion of bit-reverse into a natural order. For the single-channel RαSDF, bit-reverse ordering is achieved using a trivial circuit, i.e., memory write address conversion. However, the multi-channel RβMDC and MRMDC must have a non-trivial circuit equivalent to the first stage delay commutator in which the inverse conversion of Fig. 4(b) is needed for FFT outputs. The proposed architecture divides the task of delay and commutation into two parts. The first part is shifted to the time and frequency synchronization block. This shift is also applied into the send part of bitreverse ordering. It does not only shift the hardware components from the FFT block to the adjoined blocks. The circuit reduction of the first delay commutator is possible by cooperating with the adjoined blocks. A block diagram of the proposed 128-point FFT processor is illustrated in Fig. 8. The first delay commutator is removed from Fig. 6(b) and the “pre-commutator” and “post-commutator” are inserted in the front and rear of the input and output RAMs. The role of the input and output RAMs is data arrangement, such as OFDM subcarrier assignment and guard interval insertion, between the adjoined OFDM blocks, as explained in Sect. 2. However, they could also be used for delaying data. We investigated the relation of write and read addresses based on the condition that †
Fig. 7
Comparison of conventional and proposed architectures.
Fig. 8
The paper focuses on only ASIC design in terms of low power dissipation. As for field-programmable gate array (FPGA) design, the delay elements can be mapped into small block RAMs. The problems in the physical design and the memory units with nonpower of two words would not occur.
Block diagram of proposed 128-point FFT processors.
YOSHIZAWA and MIYANAGA: DESIGN OF AREA- AND POWER-EFFICIENT PIPELINE FFT PROCESSORS FOR 8X8 MIMO-OFDM SYSTEMS
555
Fig. 9
Block diagram of pre- and post-commutators in FFT input-side.
their addresses are selected in a natural order. When a write address is shifted by one, the read-out data is delayed for one cycle in memory read. This indicates that the tasks of the delay elements can be achieved using those of the input and the output RAMs. Figure 9 shows a block diagram of the pre- and post-commutators in the input RAM. The input RAM consists of eight memory banks where the write and read addresses are controlled by individual counters. Data is delayed by write address conversion in the address converter. The pre-commutator changes input ports in the memory banks to avoid data collision caused by the address conversion. The post-commutator restores the order of FFT inputs changed by the pre-commutator. The timing charts of the input and output data in the pre- and post-commutators are shown in Fig. 10. The data patterns for eight antenna inputs, “A”, “B”, ..., “H”, and time indexes, 0 to 127, are changed every 16 cycles, which are grouped into “Block 1”, “Block 2”, and so on (Fig. 10(a)). The changes in the data “B” sequence (enclosed by rectangles) are explained as follows: • The data sequence is a natural order in the precommutator input (Fig. 10(a)). • The data after “Block 2” is shifted to the neighboring antenna slot by the pre-commutator (Fig. 10(b)). • The time orders of data are individually adjusted for all antenna slots by the write address conversion (Fig. 10(c)). • The orders of antenna slots are recovered by the postcommutator (Fig. 10(d)). The data sequences of the post-commutator output are the same as those in the conventional FFT processor. Hence, the first delay commutator is removed in the proposed FFT processor. Next, the data path and write address conversion in the post-commutator, address convertor, and post commu-
Fig. 10 Timing charts of input and output data in pre- and post-commutators.
tator in Fig. 9 are summarized in Table 3. The data path conversion of the post-commutator is expressed as “O0 ← I0 ”, the data path conversion of the pre-commutator is denoted as “O0 ← I0 ”, and the address conversion is expressed as “W0 ← {000, c[3:0]}”. The upper 3 bits ranging from “000” to “111” delay data by multiples of 16. Note that the delay elements in the first stage delay commutator in Fig. 6(b) have the same values. The address change of “000” to “111” indicates that data is delayed for 112 cycles in memory read-out. The lower 4 bits take the same time order before conversion, which are extracted from the “counter A” in the address convertor. The behavior of pre- and post-commutators in the output RAM is almost the same as the procedure mentioned above. We implement the bit-reverse ordering into the postcommutator in the output RAM by modifying the order of read addresses.
IEICE TRANS. FUNDAMENTALS, VOL.E95–A, NO.2 FEBRUARY 2012
556
5.3 Variable-Length FFT
6.
The radix-(8+2) MRMDC can execute a 64-point FFT by using the most common components in 128- and 64-point FFTs. Figure 11 shows a block diagram of the proposed 128/64-point FFT processor and the modifications for a 64-point FFT. The pre- and post-commutator changes data paths every 8 cycles, which indicates a change from 16 to 8 cycles in the block in Fig. 10(a). Doubling twiddle coefficients is achieved by modifying read address indexes to a twiddle coefficient ROM. The operation of the third butterfly stage in the R2SDF is skipped by bypassing data. The above modifications are implemented by inserting extra switching units and data paths.
The proposed 128- and 128/64-point FFT processors for 8x8 MIMO-OFDM systems were implemented using a 90nm CMOS standard library. We used Verilog source codes for register transfer level (RTL) design and performed logic synthesis to evaluate circuit area and power dissipation in a gate level. A 12-bit word width was chosen for an internal data path and twiddle factor to balance quantization noise and hardware cost, as Fu and Ampadu indicated in [8]. The clock period was set to 8 ns as for the timing constraint in the logic synthesis tool. The synthesis results of the proposed 128-point FFT processor is summarized in Table 4. The hardware costs of the pre-commutator (including the address converter) and post-commutator are much smaller than that of the R8MDC block. Table 5 compares the circuit area (given by a gate count), maximum clock frequency, gate count per throughput and power dissipation of the proposed and conventional FFT processors. The logic synthesis of the conventional processors was performed on the same condition in the proposed processors. The radix-(2+23 ) FFT processor is based on the R23 SDF and deploys the same units based on the number of MIMO antennas. The maximum clock frequency in each processor was evaluated by measuring a timing margin from an 8-ns clock period. All the FFT processors can operate in real-time processing because their clock speeds
Table 3 Data path and write address conversion in post-commutator, address convertor, and post commutator.
Evaluation
Table 4
Fig. 11
Logic synthesis results from proposed 128-point FFT processor.
Block diagram of proposed 128/64-point FFT processor and modifications for 64-point FFT.
YOSHIZAWA and MIYANAGA: DESIGN OF AREA- AND POWER-EFFICIENT PIPELINE FFT PROCESSORS FOR 8X8 MIMO-OFDM SYSTEMS
557 Table 5
Comparison of proposed and conventional FFT processors.
are much higher than the channel bandwidths in a WLAN (i.e., 20 and 40 MHz). The proposed and conventional processors could operate at faster clock speeds by taking severer timing constraints in the logic synthesis. However, their synthesized circuits would require more logic gates and increase circuit area and power dissipation. The power dissipation in each processor was measured with a 1.0-V voltage supply and 120-MHz clock frequency. This clock condition is close to the maximum clock frequency in the proposed 128/64-point processor and equivalent to multiples of the channel bandwidths. In order to evaluate area efficiency from the viewpoints of clock frequency, we use gate count per throughput which normalizes a gate count by maximum clock frequency. The gate count per throughput indicates how small can a FFT processor be implemented for a certain throughput target, which is calculated by
in power dissipation. The proposed processors also show better performance in gate count per throughput than the conventional processors. 7.
Conclusion
We presented new 128-point and 128/64-point FFT processors for 8x8 MIMO-OFDM systems. The proposed FFT processors use pre- and post-commutators instead of delay elements, which improves circuit area and power efficiency. Currently, we are developing a complete 8x8 MIMO-OFDM transceiver. The results of the 8x8 MIMO-OFDM implementation will be reported in a future work. We will reveal the percentages of logic gate counts and power dissipation in the proposed FFT processors and the input and output RAMs. Acknowledgment
Gate count per throughput (per M samples/s) Number of logic gates = Throughput (M samples/seconds) Number of logic gates . = Max. clock frequency (MHz) × 8
(6)
Since a pipeline FFT inputs/outputs data by one sample per cycle, the throughput is given by the product of maximum clock frequency and the number of FFT channels (i.e., eight). The radix-(2+23 ) SDF FFT processor with eight units has the largest in circuit area and power dissipation. The radix-(2+8) MRMDC and radix-(8+2) MRMDC take longer circuit delays than the radix-(2+23 ) SDF. Taking larger radix numbers increase input/output ports in a butterfly unit and cause a longer butterfly unit delay. Even through the radix-(2+8) MRMDC and radix-(8+2) MRMDC take slower clock speeds, they have better performance in gate count per throughput than the radix-(2+23 ) SDF. The radix(8+2) MRMDC provides better results than the radix-(2+8) MRMDC in circuit area and power dissipation because it has less delay elements. The proposed FFT processor achieved a 51% reduction in logic gate count and 72% reduction in power dissipation compared with the radix-(8+2) MRMDC processor. This reduction was achieved by the removal of the first stage delay commutator. The proposed 128/64-point FFT processor requires additional hardware components for switching 128- and 64-point FFTs. However, it still maintains a 50% reduction in logic gate count and 70% reduction
This study is supported in parts by Japan Science and Technology (JST) Agency A-STEP whose project title is “Design and Development of Ultra-Low Power High Speed Wireless Communications LSI”. References [1] R. Stacey, E. Perahia, A. Stephens, et al., “Specification framework for TGac,” doc.:IEEE 802.11-09/0992r13, July 2010. [2] H. B¨olcskei, “MIMO-OFDM wireless systems: basics, perspectives, and challenges,” IEEE Wireless Commun., vol.13, no.4, pp.31–37, Aug. 2006. [3] S. He and M. Torkelson, “Designing pipeline FFT porocessor for OFDM (de)modulation,” URSI International Symposium on Signals, Systems, and Electronics (ISSSE), pp.257–262, Oct. 1998. [4] L.R. Rabiner and B. Gold., Theory and application of digital signal processing, Prentice-Hall, 1975. [5] E.E. Swartzlander, W.K.W. Young, and S.J. Joseph, “A radix-4 delay commutator for fast Fourier transform processor implementation,” IEEE J. Solid-State Circuits, vol.SC-19, no.5, pp.702–709, Oct. 1984. [6] T. Sansaloni, A. P´erez-Pascual, V. Torres, and J. Valls, “Efficient pipeline FFT processors for WLAN MIMO-OFDM systems,” Electron. Lett., vol.41, no.19, pp.1043–1044, Sept. 2005. [7] S. Lee, Y. Jung, and J. Kim, “Low complexity pipeline FFT processor for MIMO-OFDM systems,” IEICE Electron. Express, vol.4, no.23, pp.750–754, Dec. 2007. [8] B. Fu and P. Ampadu, “An area efficient FFT/IFFT processor for MIMO-OFDM WLAN 802.11n,” J. Signal Process. Syst., vol.56, no.1, pp.59–68, July 2009. [9] J. Wu, K. Liu, B. Shen, and H. Min, “A hardware efficient VLSI architecture for FFT processor in OFDM systems,” IEEE International
IEICE TRANS. FUNDAMENTALS, VOL.E95–A, NO.2 FEBRUARY 2012
558
Conference on ASIC (ASICON), pp.232–235, Oct. 2005. [10] I.-G. Jang, Y.-E. Kim, Y.-N. Xu, and J.-G. Chung, “Efficient IFFT design using mapping method,” IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp.878–881, Dec. 2008. [11] Y.-W. Lin and C.-Y. Lee, “Design of an FFT/IFFT processor for MIMO OFDM systems,” IEEE Trans. Circuits Syst. I, Regular Papers, vol.54, no.4, pp.807–815, April 2007. [12] B. Zhou, Y. Peng, and D. Hwang, “Pipeline FFT architectures optimized for FPGAs,” Int. J. Reconfigurable Computing, vol.2009, Article ID 219140, doi:10.1155/2009/219140, 2009. [13] G. Bi and E.V. Jones, “A pipelined FFT processor for wordsequential data,” IEEE Trans. Acoust. Speech Signal Process., vol.37, no.12, pp.1982–1985, Dec. 1989. [14] L. Hang and H. Lee, “A high performance four-parallel 128/64point radix-24 FFT/IFFT processor for MIMO-OFDM systems,” IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), pp.834–837, Nov. 2008. [15] L. Jia, Y. Gao, and H. Tenhunen, “Efficient VLSI implementation of radix-8 FFT algorithm,” IEEE Pacific Rim Conference on Communications Computers and Signal Processing (PACRIM), pp.468–471, Aug. 1999. [16] S. Yoshizawa, K. Nishi, and Y. Miyanaga, “Reconfigurable twodimensional pipeline FFT processor in OFDM cognitive radio systems,” IEEE International Symposium on Circuits and Systems (ISCAS), pp.2486–2489, May 2008. [17] E. Oran Brigham, The fast Fourier transform, pp.148–171, PrenticeHall, Englewood Cliffs, 1974.
Shingo Yoshizawa received the B.E., M.E., and Ph.D. degrees from Hokkaido University, Japan in 2001, 2003 and 2005, respectively. He is an Assistant Processor and currently working at the Graduate School of Information Science and Technology, Hokkaido University. His research interests are speech processing, wireless communication, and VLSI architecture.
Yoshikazu Miyanaga received the B.S., M.S., and D.Eng. degrees from Hokkaido University, Japan in 1979, 1981, and 1986, respectively. From 1983 to 1987, he was a Research Associate at the Institute for Electronic Science, Hokkaido University. From 1987 to 1988, he was a Lecturer at the Faculty of Engineering of Hokkaido University. From 1988 to 1997, he was an Associate Professor there. He is currently a Professor at the Graduate School of Information Science and Technology, Hokkaido University. His current research interests are adaptive signal processing, non-linear signal processing, and parallel-pipelined VLSI systems.