Circuits Syst Signal Process DOI 10.1007/s00034-011-9332-7
Design and Comparison of FFT VLSI Architectures for SoC Telecom Applications with Different Flexibility, Speed and Complexity Trade-Offs Sergio Saponara · Massimo Rovini · Luca Fanucci · Athanasios Karachalios · George Lentaris · Dionysios Reisis Received: 5 July 2010 / Revised: 16 June 2011 © Springer Science+Business Media, LLC 2011
Abstract The design of Fast Fourier Transform (FFT) integrated architectures for System-on-Chip (SoC) telecom applications is addressed in this paper. After reviewing the FFT processing requirements of wireless and wired Orthogonal Frequency Division Multiplexing (OFDM) standards, including the emerging Multiple Input Multiple Output (MIMO) and OFDM Access (OFDMA) schemes, three FFT architectures are proposed: a fully parallel, a pipelined cascade and an in-place variable-size architecture, which offer different trade-offs among flexibility, processing speed and complexity. Silicon implementation results and comparisons with the state-of-theart prove that each macrocell outperforms the known works for a target application. The fully parallel is optimized for throughput requirements up to several GSamples/s enabling Ultra-wideband (UWB) communications by using all channels foreseen in the standard. The pipelined cascade macrocell minimizes complexity for large size FFTs sustaining throughput up to 100 MSamples/s. The in-place variable-size FFT S. Saponara () · M. Rovini · L. Fanucci Department of Information Engineering, University of Pisa, Via G. Caruso 16, 56122 Pisa, Italy e-mail:
[email protected] M. Rovini e-mail:
[email protected] L. Fanucci e-mail:
[email protected] A. Karachalios · G. Lentaris · D. Reisis Department of Physics, University of Athens, Panepistimiopolis, Zografou, 15784 Athens, Greece A. Karachalios e-mail:
[email protected] G. Lentaris e-mail:
[email protected] D. Reisis e-mail:
[email protected]
Circuits Syst Signal Process
macrocell stands for its flexibility by allowing run-time reconfigurability required in OFDMA schemes while attaining the required throughput to support MIMO communications. The three architectures are also compared with common case-studies and target technology. Keywords VLSI design · Fast Fourier Transform · System-on-Chip · OFDM telecom systems
1 Introduction In the evolving telecommunication applications, dedicated FFT/IFFT architectures are required for the baseband processing. A plethora of such applications (see [1, 6, 11, 12, 17, 18, 20, 22, 38, 41]) suggests the design of configurable FFT architectures, capable of achieving high throughput but also keeping the gate complexity and power consumption relatively low. Aiming at accommodating these types of applications, this paper proposes the design of different VLSI (Very Large Scale of Integration) FFT/IFFT architectures targeting different trade-offs among the above performance metrics. Particularly, the design aspects allowing for optimized FPGA (Field Programmable Gate Array) implementations are considered. FPGAs provide an attractive implementation platform for telecom applications, because they are able to reconfigure at compilation and/or run time and hence support different wireless standards. Moreover, today’s FPGA designs extend their application range from prototyping platforms to user products, from fixed to mobile terminals: indeed, FPGA families are available at the cost of few dollars for large volume market, while embedded FPGAs can be integrated as reconfigurable logic in System-on-Chips (SoCs). The specifications of advanced OFDM-based standards for telecom systems lead to a wide configuration space to be faced by the FFT engine. The throughput may vary from few MSamples/s in xDSL (Digital Subscriber Line) modems for residential Internet connections (see [12, 41]) up to GSamples/s in UWB terminals for short-range communication of multimedia contents [18]. The I/O data-width may vary from 4 or 5 bits in UWB up to 16 bits in VDSL (Very high-speed DSL) or BPL (Broadband on Power Lines) applications. Similarly, the FFT size (i.e. the FFT length) varies from 64 complex points in Wireless Local Area Network (WLAN) (see [20, 38]) or ADSL, to 8192 in DVB (Digital Video Broadcasting) [22]. Moreover, the FFT engine should be conceived as a parametric IP (Intellectual Property) macrocell and, once integrated, it should be still configurable at run time to support standards with multi-mode adaptive behavior. As examples of such standards it is worth citing the Worldwide Interoperability for Microwave Access, WiMAX [2], or the 3rd Generation Partnership Project Long Term Evolution, 3GPP LTE [19], with FFT length ranging from 128 to 2048. To achieve an unified view on the variety of possible design solutions meeting the above requirements, this paper proposes different architectural approaches with respect to the degree of parallelism, memory access strategy and machine arithmetic style. Furthermore, it shows their implementations and analyzes their advantages and disadvantages in terms of performance, complexity and flexibility, considering FPGA devices as target implementation technology. Exploiting different FFT processing
Circuits Syst Signal Process
schemes, each of the proposed architectures introduces design features allowing for an efficient support of a specific group of the aforementioned standards. The paper is organized as follows. Section 2 reviews the OFDM communication standards and the requirements for the FFT processing core. Section 3 proposes a massively parallel FFT architecture suitable for high-throughput applications (up to GSamples/s) such as UWB. Section 4 presents a configurable cascade FFT core which ensures an optimal trade-off between complexity and performance for applications requiring large size FFTs (1024 complex points), such as DVB, and large datawidths, but with throughput requirements lower than one hundred of MSamples/s. Section 5 describes an in-place variable-length FFT core with parallel butterfly processors optimizing run-time reconfigurability and still supporting high-throughput applications. Such architecture is suitable for emerging WiMAX terminals needing run-time FFT length configuration and a computational throughput up to hundreds of MSamples/s to support Multi-Input Multi-Output (MIMO) communications. Implementation results of the above architectures on the same target technology and comparisons between them are proposed in Sect. 6. Results are also compared with the state-of-the-art of FFT VLSI designs for OFDM telecom applications. Conclusions are drawn in Sect. 7.
2 Overview on OFDM-Based Communication Standards 2.1 OFDM and MIMO-OFDM Architectures The multi-carrier OFDM scheme has fostered the rise of several wireless and wired communication standards including: Digital Broadcasting of Audio and Video contents, in Terrestrial and Handheld scenarios (DAB, DVB-T/H) (see [6, 22]); 802.16d/e Wireless Metropolitan Area Network (WMAN), known respectively as fixed and mobile WiMAX [17], for wireless fast Internet access in metropolitan scenarios; xDSL [7, 12, 31, 41] and BPL [1, 3, 11] modem for fast Internet access through wired channels, the telephone line and the power line respectively; 802.11 a/n WLAN for medium range indoor networking [20, 26, 38]; UWB radio [18, 32, 36] for high data rate personal area network connectivity. The connectivity range covers short range using UWB radio, mid range based on WLAN, BPL and VDSL and wide range through DVB-T/H, DAB, WMAN and xDSL standards. With respect to single-carrier modulation, OFDM-based systems offer enhanced robustness against cross-talk, fading channels and multi-path distortion [5]. In OFDM systems, channel equalization is simplified because the transmitted data are spread across orthogonal sub-carriers, hence OFDM can be viewed as the contribution of many narrow-band signals rather than a rapidly-modulated wideband signal. In 802.16e, OFDM is also deployed as a multi-user access technology (OFDMA), where carriers are clustered in subsets dynamically assigned to each user. Therefore, the channel capacity is shared among multiple users. All the aforementioned standards exploit a similar baseband processing scheme whose core are an FFT processor, in charge of multi-carrier symbol demodulation at the receiver (rx), plus an IFFT processor in charge of symbol modulation at the transmitter (tx). FFT and IFFT require roughly half of the total circuit complexity of the
Circuits Syst Signal Process
baseband processing in OFDM systems (see [6, 29]). During modulation (IFFT) there is a cyclic extension of the symbol to insert a guard interval handling time-spreading and eliminating inter symbol interference. The extraction of the cyclic prefix is done at receiver side (FFT). Note that FFT and IFFT operations can be merged in a single FFT/IFFT processor in case the communication is based on a time division duplexing (TDD) scheme, since the transceiver is working either in rx mode (demodulation by FFT) or in tx mode (modulation by IFFT). In full-duplex transceivers adopting frequency division duplexing (FDD), with concurrent tx and rx, FFT and IFFT have to be implemented through different dedicated processors. OFDM can be used in conjunction with MIMO techniques to increase the system capacity and/or the diversity gain (see [21, 25, 29]). The MIMO scheme, adopted in emerging standards such as 802.16 WMAN and 802.11n WLAN, uses multiple antennas at both the receiver and the transmitter side to exploit spatial diversity and/or spatial multiplexing. Spatial multiplexing increases the capacity of a MIMO link by simultaneously transmitting, from each tx antenna, independent data streams in the same time slot and frequency band. Multiple data steams are then differentiated at the receiver side by using channel information about each propagation path. Because multiple data streams are transmitted in parallel from different antennas, there is a linear increase in throughput for every pair of antennas added to the system (see [21, 29]). In contrast to spatial multiplexing, spatial diversity increases the diversity order of a MIMO link to mitigate fading by coding a signal across space and time using special space-time code techniques such as the Alamouti code [21]. At the rx side, the multiple replicas of the signal are combined constructively to achieve a diversity gain.
Fig. 1 WLAN 2 × 2 MIMO-OFDM scheme
Circuits Syst Signal Process
Figure 1 shows the baseband processing architecture for a 2 × 2 WLAN OFDM system in [21], featuring 2 rx paths and 2 tx paths. It is clear that the number of FFT and IFFT processing units to be integrated in the transceiver depends on the number of rx and tx paths. In an M × M OFDM-MIMO system, up to M different streams can be transmitted concurrently over multiple antennas; to serve these streams, M FFT (and M IFFT) processing units working in parallel are required, thus increasing area by a factor M. Alternatively, a lower number of P processors, with P ∈ [1, M], can be used in a time-division way, but to sustain the same data throughput, the clock frequency of the P processors should be increased by a factor M/P . Typical values for M are 2 and 4; IEEE 802.11n and 802.16e standards require 2 × 2 MIMO schemes but 4 × 4 is allowed. Also in TDD-MIMO systems, FFT and IFFT functions can be merged in a single FFT/IFFT processor, while in FDD schemes different FFT and IFFT processors, working simultaneously, are required. 2.2 OFDM and MIMO-OFDM Processing Requirements The different OFDM standards considered in this paper (UWB, WLAN 802.11 a/n, WMAN 802.16 d/e, DVB-T/H, DAB, VDSL, BPL) are characterized by different requirements for the FFT/IFFT processing in terms of I/O data-width, from 4 to 16 bits, transform length, from 64 to 8192 points, and throughput, from 1 MSamples/s to several1 GSamples/s (see [1, 6, 11, 12, 17, 18, 20, 22, 38, 41]). Table 1 and Fig. 2 summarize the FFT/IFFT requirements, which range between two extremes: small FFT size and high-throughput applications, e.g. UWB, or large-size but low-throughput applications, e.g. DVB-T/H. Some standards support multiple modes, requiring different configurations in terms of FFT length and throughput, to face different scenarios in terms of connected users, channel conditions, communication bandwidth and latency. For such standards, the configuration with the highest size and throughput was considered in Fig. 2. The proposed macrocells have been designed according to a design-for-reuse approach, as in [14–16, 33–35], and are parametric in terms of Table 1 FFT requirements of OFDM and M × M MIMO OFDM standards
Standard
FFT size
I/0
Throughput
(complex)
data-width (bits)
(MSample/s)
DVB-T/H
2048–8192
8
DAB
256–2048
8
8.26
VDSL
256–4096
16
1–35
802.11 a
64
802.16d/e
128–2048
802.11 n
128
UWB
128
BPL
512 or 1024
8 10 8 4 16
9
20 1.25–20 (xM) 40 (xM) 1584 (3 channels) 30
1 UWB can reach the throughput of N × 528 MSamples/s, with N the number of sub-channels supported
in parallel and 528 MSamples/s the requirement of a single channel; the max number of channels is 14, but the typical value is 3 as to allow frequency hopping.
Circuits Syst Signal Process
Fig. 2 FFT throughput and size for different OFDM and MIMO-OFDM transceivers
FFT length and I/O bit-width. The cascade and the in-place variable-size FFT architectures are also run-time configurable: the macrocell is synthesized for the maximum size (e.g., 8192 for DVB-T/H) and any smaller size specified by the standard (e.g., 2048 and 4096 in DVB-T/H) is supported. For M × M MIMO-OFDM systems the throughput requirements should be increased M times with respect to the basic Single-Input Single-Output (SISO) scheme. The requirements on throughput and energy efficiency, particularly for mobile terminals, call for hardware implementation of the FFT and IFFT processors. The widespread diffusion of FFT/IFFT cores in different transceiver schemes suggests the design of parametric, reconfigurable IP hardware macrocells which are addressed in the following sections considering different architectural approaches. Particularly, a massively parallel architecture for FFT suitable to high-throughput applications (up to GSamples/s) such as UWB is proposed in Sect. 3. A configurable cascade FFT core optimized for large, high-throughput FFTs such as for DVB, and large data-widths is discussed in Sect. 4. Section 5 presents an in-place variable-length FFT core suitable for OFDMA MIMO schemes such as WiMAX, requiring run-time reconfigurability and sustaining throughputs up to hundreds of MSamples/s.
3 Fully Parallel Architecture Although intrinsically more complex than other solutions based on hardware sharing, the parallel approach allows a memory-free design and is the natural choice to tackle very high data throughput up to GSamples/s [40]. This is particularly true for implementations on FPGA, whose system clock frequency is far lower than in custom ASIC designs. The key facet of the parallel architecture, not allowed by other solutions, is the possibility of an ad hoc customization of the data flow, whose width can be controlled stage by stage.
Circuits Syst Signal Process
Furthermore, all multiplications in the algorithm turn into the multiplication by a constant factor (so-called twiddles are constant roots of the unity), which can be greatly optimized by the logic synthesis tool. The fixed-point representation of several multiplicands (real or imaginary part or both) reduces to trivial values (zero or ±powers-of-two) that are costless in terms of implementation, and their number becomes higher when reducing the required precision, which is the typical use for parallel FFT. 3.1 VLSI Architecture The parallel architecture of a generic N -point (I)FFT, directly derived from the Cooley–Tukey algorithm [8], is shown in Fig. 3. The architecture is arranged in ρ = log4 N radix-4 stages where the first one is a radix-2 stage if N is not a power of 4. As widely shown in literature, radix-4 (used also for the cascade approach) and radix-2 (used for the in-place variable FFT core) are the most suitable factorizations for cost-effective implementations, since high-radix factorizations (i.e. 8, 16 and 32) require a basic computational unit (butterfly) with non-trivial multiplications, thus bearing an unacceptable increase of hardware complexity (see [13, 26, 30]). Each stage of the parallel FFT is composed of a set of butterfly blocks, followed by N complex multiplications by the twiddle factors. As opposed to the case of timesharing architectures, twiddle factors take (different) constant values over the stages of a parallel FFT; so, (real) multipliers do not need to resort to any particular architecture such as parallel Booth architecture [9] to speed up the elaboration or reduce the complexity. Since twiddle factors are complex roots of the unity, they can be treated with the lifting approach described in [28]: in this way, only three real multiplications and three additions are required instead of four multiplications and two additions as in a straightforward implementation. This approach helps to save complexity but as a drawback, increases the delay of the critical path in the design, so it is only suitable to applications where complexity is crucial while throughput is easily met. The generic architecture of a lifting element is reported in Fig. 4, where all the parts that are customized at the time of synthesis are enclosed in dashed lines. Actu-
Fig. 3 Mixed-radix FFT processor architecture
Circuits Syst Signal Process Fig. 4 Architecture of the lifting element
ally, the real factors L1, L2 are optimized to the value of the particular twiddle factor, input/output sign inversions are implemented only when required, and the cross-bar turns into an hard-wired connection. 3.2 Case study: Ultra-Wide Band (UWB) As a case study, the parallel architecture described in Sect. 3.1 has been tailored for a very high speed application such as the UWB standard [18]. The 128-point parallel FFT is designed to compute 128 complex samples per clock cycle, so that it can meet the throughput requirement of 1584 MSamples/s at the clock frequency of only 12.375 MHz. As a distinguishing feature, the UWB signal is quantized on a limited number of bits, typically in the range from 4 to 6, and this is exploited to contain the complexity of the VLSI design. As a case study, the input signal was then quantized on 5 bits, corresponding to about 28 dB of signal-to-quantization-noise ratio (SQNR). Then, the data flow was optimized as to guarantee the same SQNR on the output of the transform, by means of truncation (rounding) and saturation after every step (butterfly and lifting multiplication). Twiddle factors were quantized on Btwd = 6 bits, which resulted in a maximum output SQNR of 27.2 dB, attained with no truncation nor saturation (free-growing data flow). Actually, since the parallel FFT adopts the lifting approach, the lifting coefficients L1 and L2 were indeed quantized on Blift = 6 bits, and Blift − 1 bits were discarded after each real multiplication in Fig. 4. Following a cascade approach, truncation (rounding) and saturation of the butterfly outputs first, and saturation of the output of the lifting multiplier were applied to achieve the same output SQNR of 27 dB. The results of this procedure are graphically depicted in Fig. 5, where data-widths are represented with squares and for every stage the width on its input is shown, after butterfly combining and after twiddle multiplication (three columns for each stage of the FFT, overall). Bits saturated (rounded) are also shown as cross-hatched squares with top-left/bottom-right (bottom-left/topright) diagonals. As shown in Fig. 5, 1 bit is saturated after twiddle multiplication at every stage and 1 more bit is saturated at the output of the butterfly of stages 2 and 3.
Circuits Syst Signal Process Fig. 5 Growth of the fixed-point data flow in a 128-point parallel FFT with input signal on 5 bits and twiddle quantized on 6 bits
4 Cascade Architecture 4.1 Radix Cascade Architecture A cascade approach, alternative to the high-throughput parallel FFT IP core design described in Sect. 3, can offer a good trade-off between complexity and speed, with remarkable length flexibility for a variety of communication and multimedia applications. The architecture presented in this section adopts a cascade of radix-4 butterfly stages (the last stage of the cascade is mixed radix-4/radix-2 to support also FFT transform lengths which are power-of-two); such an approach is suitable for streamoriented data processing systems found in communication and multimedia applications. In fact, owing to the inherent pipeline of the cascade architecture, input buffers can be removed since buffering capability is spread across the whole data-path. Figure 6 illustrates the top-level data-path of the cascade architecture which is fully parametric in terms of maximum FFT size (Nmax ), number of radix stages (S), word length of the I/O (IOWL), of the twiddle coefficients (TWL) and of the internal system data-path (SWL). Moreover, each radix stage supports different types of machine arithmetic: fixed-point, block floating point (BFP) and convergent block floating point (CBFP). For the applications considered in this paper the configuration parameters are in the ranges: N ∈ [64, 8192], k ∈ [3, 6] and S ∈ [3, 7]. In Fig. 6, the multiplexers at the input of each stage enable the FFT/IFFT size to be configured at run time by selecting the number of stages in the cascade through the length input signal. The flush and the freeze signals control the internal pipeline, while the FFT/IFFT signal is used to switch between FFT and IFFT computation. The ability to compute both the direct and inverse transforms with the same core is useful in transceivers based on TDD techniques, as discussed in Sect. 2, where the core works alternatively as IFFT processor in the transmitter chain and as FFT processor in the receiver chain. The global control unit in Fig. 6 can manage also the insertion/removal of cyclic prefix and suffix to the sequence as foreseen in the OFDM scheme. As shown in Fig. 7, each stage in the cascade architecture includes the butterfly/multiplier unit, a module for sequencing data and a ROM containing the twiddle
Circuits Syst Signal Process
Fig. 6 Programmable cascade of radix-4/2 stages in the FFT/IFFT IP core
Fig. 7 Generic stage architecture
Fig. 8 Radix-4 butterfly with complex multiplier
factors. The data-path of the radix-4 butterfly is sketched in Fig. 8 where thick arrows are used for complex values and thin arrows for real ones. The complex multiplier is implemented with three Booth multipliers and six adders. Therefore, the basic butterfly in the cascade approach is more complex than the same unit in the parallel architecture in Sect. 3, which avoids complex multipliers. However, the parallel architecture includes multiple butterfly units at every radix stage, while a single butterfly per stage is used in the cascade architecture; so, for FFTs of considerable length, the number of butterflies and the relevant hardware complexity in the cascade is minimized vs. the parallel approach.
Circuits Syst Signal Process
The data-sequencing module uses memory banks for reordering data as dictated by the algorithm [4]. These memory banks can be exploited to remove input buffers since buffering capabilities are embedded in the data-path. Equations (1) and (2) detail the memory requirements, RAM for data and ROM for twiddle coefficients, for the j th stage in the cascade architecture in terms of number of banks × data-width × number of locations. IOWL is the data-width of the real/imaginary part of the processor input/output. SWL is the data-width of the internal data-path. TWL is the data-width for the real and imaginary parts of the twiddle factors. ⎧ Nmax ⎪ ⎨ 7 · 2 IOWL 4·2m j = 1 RAM(j ) = 7 · 2 SWL Nmax (1) j ∈ [2, S − 1] 4j ⎪ ⎩ 2 SWL j =S ROM(j ) = 2 TWL ·
Nmax 8 · 4j −1
j ∈ [1, S − 1]
(2)
By exploiting the symmetry of the unit circle on the complex plane, the size of the twiddle ROM has been reduced by a factor 8 vs. a conventional implementation. This is achieved through a very simple circuit that exchanges the real and imaginary parts and/or complements the sign of the coefficients stored in ROM. The same circuit conjugates the twiddle factors when the core is configured for IFFT. The data-sequencing stage is based on small memory banks and is designed so that the whole processor can sustain a throughput of one complex sample per clock cycle. Therefore, for standards such as UWB or MIMO systems with a large number of channels, a cascade architecture would require a clock frequency above 1 GHz, unfeasible on FPGA devices or with low-power ICs. The cascade approach is more suitable to mid-range throughput applications within one hundred of MSamples/s such as DVB, DAB, WLAN, WMAN, DSL and BPL. In the proposed architecture, the internal pipeline can be frozen or flushed by an external control unit to match the input traffic rate or in case of run-time reconfiguration (FFT/IFFT mode, transform length). The latency of the whole architecture varies with the actual transform length and is mainly determined by the sum of the latencies of the cascaded radix-4 stages (hence w.r.t. the parallel approach, a higher latency is paid in the cascade architecture for FFTs of considerable length). Figure 9 details the pipeline of a radix-4 stage where Ldr (N, j ) =
N 4j −1
(3)
4.2 FFT/IFFT IP Core Configuration and DVB Case Study By using a custom software tool, based on Monte Carlo simulations, we are able to profile the three arithmetic types supported by the FFT cascade macrocell in terms of SQNR and to provide the designer with information for selecting the arithmetic that suits the target application best. System-level requirements specify the FFT/IFFT length (N ) and a SQNR budget, i.e. the desired bit true IOWL, and the tool derives the
Circuits Syst Signal Process Fig. 9 Internal pipeline of a radix-4 stage
optimal values of internal width (SWL and TWL) and the most suitable arithmetic to achieve the desired output precision minimizing data-path and memory sizes. A 64bit floating point FFT/IFFT processor is considered as a golden reference model and the tool has been applied to the cascade architecture to cover the OFDM standards presented in Sect. 2. The results of the parametric configuration are summarized in Table 2. Besides the configuration parameters, also the FFT macrocell data-base is generated including the RTL VHDL code, test-benches and test vectors for verification and performance estimation (e.g. dynamic power consumption). As a rule of thumb, CBFP arithmetic is a good solution for FFTs of considerable length (N ≥ 1024) requiring very high accuracy (IOWL ≥ 14), as in the case of VDSL or BPL. For example, the use in BFP arithmetic of the same set of parameters (SWL, TWL) of CBFP, would result in a performance loss of about 4 dB for VDSL. In other words, BFP arithmetic would require SWL = 23 to achieve the same SQNR that CBFP achieves with SWL = 18. For the other standards (DVB-T/H, DAB, WLAN, WMAN, UWB) the SQNR budget is such that BFP and CBFP approaches exhibit the same performance with similar SWL and TWL values, thus BFP arithmetic is preferred to save the extra circuit complexity of the magnitude estimation unit. Figure 10 shows the SQNR for different values of SWL and TWL for the 8Kmode of DVB-T/H. Selecting BFP instead of CBFP, the same configuration of SWL and TWL (11 and 3, respectively) results in a loss of 0.7 dB in SQNR, 43.7 dB for BFP instead of 44.4 dB for CBFP. Since the standard requirement is around 43 dB, the BFP is preferred; indeed, its complexity is lower than CBFP for the same SWL and TWL width. Adopting an M × M MIMO scheme, such as for WLAN 802.11n or WMAN 802.16, requires the integration of P processors running at a clock frequency M/P times faster than the basic SISO scheme, independently from the SQNR budget. As a result, the arithmetic type and the word lengths are the same for a given standard both in MIMO (M = 2, 4) or SISO (M = 1) configurations.
5 In-place Variable Length FFT with Parallel Butterfly Processors This section presents a reconfigurable FFT architecture suitable for 4 × 4 MIMO OFDMA wireless systems that processes up to 4 streams with variable symbol lengths, ranging from 128 to 2048 complex points.
Circuits Syst Signal Process Fig. 10 SQNR vs. SWL and TWL, 8K mode DVB-T/H, 43 dB SQNR target
Fig. 11 Overview of the in-place, variable length, FFT architecture Table 2 Cascade IP configuration in different OFDM standards Standard
Arithmetic type
IOWL
SWL
TWL
SQNR
Stages
(dB)
S
DVB-T/H
BFP
8
11
3
43.6
7
DAB
BFP
8
11
5
43
6
VDSL
CBFP
16
18
12
94.2
6
802.11a/n
BFP
8
11
6
44.7
4
802.16d/e
BFP
10
13
7
54.9
6
UWB
BFP
5
7
4
25.4
4
BPL
CBFP
16
18
10
93.7
5
5.1 Overview of the FFT Organization The engine computes decimation in time (DIT) FFT algorithms of variable length by using in-place technique with radix-2 factorization. It consists of 16 butterfly processors, 32 banks and an interconnection network used to group the processors in sets of 2, 4, 8 or 16 (Fig. 11). Each processor can be reconfigured at run time to compute FFTs of 32, 64 or 128 complex points. The execution of the radix-2 FFT algorithm
Circuits Syst Signal Process
for these input sizes requires 5, 6 or 7 stages, respectively. At these stages, the processor Pi updates two points per cycle and uses only two memory banks for storing all the intermediate results: its private bank Bi,0 and its auxiliary bank Bi,1 (64 addresses each). FFTs of length 256, 512, 1024 or 2048 are computed by allowing the processors to cooperate in groups and execute the remaining stages of the algorithm. During these higher stages of the computation, the processor Pi uses the interconnection network to access the auxiliary bank Bk,1 of another processor Pk . The indices i and k are defined by the computation flow of the FFT algorithm (the required butterfly calculation) and by a specific conflict-free in-place technique described in the following subsection. The application of this novel technique leads to the reduction of the interconnection network and to the minimization of the total computation cycles: only half of the 32 banks need to be shared among the processors (i.e. the auxiliary banks) and, moreover, no conflicts stall the execution of the algorithm. The reconfiguration and the grouping of the processors are determined by an internal scheduler depending on the length of the 4 FFTs (input symbols) at each time. The scheduler design has focused on sustaining the required throughput rate of each stream. To simplify the process, we have included an input buffer and two distinct operation modes for the scheduler. We use the first operation mode when there is at least one stream with symbol length greater than or equal to 1024. For such cases the scheduler will use the input buffer to collect symbols of the same length, whose sum is 2048, i.e. 16 symbols of length 128, 8 symbols of length 256, etc. The collection of 2048 data belongs to same stream and constitutes the input to the 16-processor engine. The collections are processed sequentially in a round-robin fashion, sustaining an average throughput for each stream equal to its input rate. Each processor is configured to perform 128-point FFT. The second mode handles the remaining cases (where all symbols have length ≤ 512). The scheduler assigns the symbol of each stream to a dedicated group of four processors. The input streams are processed in parallel, as the processor groups operate independently from each another. Each processor is configured to perform FFT with length equal to 1/4 of the symbol length assigned to its group. Note that, in order to sustain the required throughput even in the worst case (4 FFTs of 2048 points each), the operating clock frequency of the design is set to fop = 1.375 · fin , where fin denotes the input data rate. 5.2 In-place Technique The proposed organization uses the in-place technique of [27], modified accordingly, to produce a sorted FFT output by using as a key the indices of the elements (the initial address of the elements). The input elements are stored in the banks Bi,0 , Bi,1 such that the LSB of each element’s index specifies its storing bank. Specifically, at the output of each butterfly computation, the following permutation is performed. Consider the elements xs , xr forming a transformation couple at stage (pass) j . The indices xs , xr differ only at the j th bit and the results will be exchanging memory locations if the bit (j + 1) of the indices xs , xr is 1. The output elements are sorted with indices 0, . . . , N/2 − 1 stored in banks Bi,0 in increasing order and indices
Circuits Syst Signal Process
N/2, . . . , N − 1 stored in Bi,1 in decreasing order. Besides the output permutations at each butterfly, we perform an input permutation: we exchange xs , xr at the butterfly input if xs is stored in Bi,1 . Note that the aforementioned scheme allows an efficient interconnection of the processors because each processor needs to access only four auxiliary banks (besides its own). 5.3 Butterfly Processor Architecture Each radix-2 butterfly processor Pi has two inputs (IR , IS ) and two outputs (OR , OS ), the two dual-port memory banks Bi,0 , Bi,1 , the FFT control, the interconnection between the processor and the banks, the data address generation circuit and the twiddle address generation (Fig. 12). Focusing on a 128-point processor as a reference case (the cases of 32- and 64point are similar), the FFT control includes a 10 bits up counter to handle 1024 pairs of data (worst case scenario) and a 4 bits down counter to handle the FFT stages (passes). During an FFT, the first 64 pairs are in Bi,0 , Bi,1 . With more than 64 pairs (2j pairs, 7 ≤ j ≤ 10), the processor performs the same operations on data stored in Bi,0 and Bx,1 , with x defined by the interconnection. The address generation circuit (Fig. 13) uses the two control counters to generate the data addresses at each stage and to control the I/O multiplexers (Fig. 12). During the j th stage the circuit will address N/2 pairs belonging to N/2j +1 FFT sub-blocks. The circuit generates the addresses of the pairs by forming a word, which consists of the 6 least significant bits of the up counter. The addresses for Bi,0 are generated by resetting the bit j − 1 of this word. The addresses for Bi,1 are generated by inverting the j − 1 least significant bits of the Bi,0 address. The multiplexers at the butterfly outputs (OR , OS ) realize the in-place technique permutations and are controlled by the j th bit of the up counter (at pass j ): if the j th bit is 1, then the multiplexers exchange the outputs of the butterfly (swapout signal of Fig. 13). The multiplexers at the IR , IS inputs of the butterfly are controlled by
Fig. 12 Architecture of the radix-2 processor
Circuits Syst Signal Process
Fig. 13 The Address Generator of the radix-2 processor
the (j − 1)th bit of the up counter (swapin signal of Fig. 13). The multiplexers at the output of the address generator (read addresses) are also controlled by the swapin signal. The radix-2 processor is organized to compute an FFT by performing DIT and producing 32, 64 or 128 sorted outputs: after the FFT completion, the elements with (max) indices d0 , . . . , d63 will be in the addresses 0, . . . , 63 of Bi,0 (increasing order) and the elements d64 , . . . , d127 will be in the addresses 63, . . . , 0 of Bi,1 , respectively (in decreasing order). A twiddle generator circuit is used with a ROM of Nmax /8 coefficients, i.e. 256, controlled by the vInv unit (Fig. 13). Specifically, assume that we execute the last stage of a sub-FFT of length 2j +1 on the two sub-FFTs of length 2j (the two subFFTs have their results sorted as above). We read the twiddles of the first 2j −1 pairs by increasing a counter and the twiddles of the remaining 2j −1 pairs by decreasing a counter as follows. At pass j we use the j + 1 least significant bits to create a 10 bit word: these j + 1 bits are used as MSBs followed by 0’s (input V of the vInv of Fig. 13). If the MSB of the above word is equal to 0, then this word will be used as a twiddle address; else, the remaining j MSBs (apart the MSB itself) are inverted to create the twiddle address. 5.4 Interconnection Scheme In the proposed architecture, certain stages of the FFT require that each processor Pi accesses data from the auxiliary bank of a remote processor. To optimize the interconnection network, we have designed the data flow such that only the lower input/output of Pi is connected to a remote bank (the upper is always connected to Bi,0 ). Moreover, each Pi connects only to 4 auxiliary remote banks (besides Bi,1 ). To determine the set of banks connected to each Pi , we must take into account the flow-graph of the FFT and the aforementioned in-place technique. Assume that j is
Circuits Syst Signal Process
a FFT stage where each Pi will access data belonging to processor Pk (recall that not all stages require remote accesses). Let i = [i3 i2 i1 i0 ]; the index k = [k3 k2 k1 k0 ] is obtained in the following two steps. First, we consider the effect of the output permutation, which is a bitwise exclusiveor (XOR) operation on the index i with a 4-bit number containing j ones in the j LSBs and zeros otherwise. Second, data exchanges at the input occur at processors whose index has the (j − 1)th bit set. For these processors, the index k is computed by the first step calculation and corrected by performing another bitwise exclusiveor operation with a number containing j − 1 ones in the j − 1 LSBs (and zeros otherwise). Therefore, k is produced by superimposing the two permutations (input and output) during stages j ≥ 8. More specifically, k = [k3 k2 k1 k0 ] = [i3 i2 i1 i0 ] ⊕ [0 . . . 1 . . . 1] ⊕ [0 . . . ij −1 . . . ij −1 ]
j
j −1
Therefore, the interconnection network for each processor consists of a 5-to-1 multiplexer at the processor’s lower input Is and a 1-to-5 demultiplexer at its lower output Os . The connections to each (de)multiplexer can be computed from the equation above. Note that the depth of the address calculation circuit by using the proposed technique is constant, irrespective of the size N of the FFT transform.
6 Implementation Results and Comparison with the State-of-the-Art This Section compares the three architectures described in this paper with state-ofthe-art FFT VLSI designs, considering the most suitable target application for each solution. Comparing IP cores of different architectures and implementation technologies becomes ambiguous; so, for the sake of fairness, we selected state-of-the-art designs with system-level requirements—expressed in terms of throughput and numerical accuracy—similarly to each of the three proposed architectures for any target application. Section 6.1 reports the results in terms of complexity and throughput for the three architectures. Then, Sect. 6.2 compares the three proposed implementations to each other, assuming common applications and FPGA technology. Finally, the most suited architecture solution for each of the telecommunication standards of Sect. 2 is presented, along with the relevant implementation complexity results. 6.1 Comparison vs. State-of-the-Art VLSI FFT Designs Table 3 compares the cascade architecture presented in Sect. 4 with other state-ofthe-art FFT cores targeting DVB-T/H applications. Complexity results refer to gate and memory complexity in silicon for a 90 nm CMOS technology and standardcells library. The comparison includes an application-specific instruction set processor (ASIP) [23], two macrocells specifically designed for DVB-T (see [24, 42]) and a macrocell obtained by an automatic IP generator [10]. Note that the proposed cascade macrocell stands for its low complexity while maintaining similar application
Circuits Syst Signal Process Table 3 Comparison with state-of-the-art DVB-T FFT cores (SQNR budget ≥40 dB) Implementation
IP type
Arithmetic
fclk
Complexity
type
(MHz)
Kgates
RAM bits
ROM bits
219738
8190
Cascade [this paper]
generator
BFP
9
37
Lee et al. [23]
ASIP
fixed-point
280
80
Wang et al. [42]
custom
fixed-point
16
139
211008
165120
Cortés et al. [10]
generator
fixed-point
9
48.7
262112
305760
Li et al. [24]
custom
BFP
8
91
n.a.
n.a.
1572864
performance: throughput of 9 MSamples/s, SQNR greater than 40 dB, variable transform length between 2048 and 8192. It is worth noting that when implemented in FPGA technology, the cascade FFT core, configured for DVB-T/H, can be fitted in a low-cost device family such as the Spartan3 from Xilinx (a XC3S200 device is sufficient); if the more powerful Virtex4 family FPGA is adopted, then the cascade FFT core needs less than 10000 slices. Both Spartan3 and Virtex4 device families are SRAM-based FPGA realized in 90 nm silicon CMOS technology. As far as the parallel architecture is concerned, it can be fairly compared with the UWB FFT core proposed by Sherrat et al. in [37], whose target is the real-time implementation of a 128-point 528-MSamples/s FFT. To achieve the above performance, four 128-point radix-2 pipelined processors are exploited in [37], working concurrently and each clocked at 132 MHz with a clock phase delay of 0, 90, 180, and 270 degrees generated by a digital clock management (DCM) unit. Fitted on FPGA technology, this state-of-the-art UWB FFT core requires roughly 5000 slices of a Virtex 4 device. Instead, the full parallel architecture proposed in Sect. 3, roughly requires 20,000 slices on a similar Virtex4 FPGA technology but, owing to its high parallelism, it can reach a throughput higher than 7 GSamples/s with a clock frequency of only 55 MHz. Therefore, our architecture occupies more slices (4 times higher) than [37], but it is also 14 times faster; as a result, while the macrocell in [37] can sustain only 1 UWB channel out of the 14 available channels in the ECMA standard [18], the proposed architecture allows the real-time realization of an UWB communication with full capabilities (14 channels). Finally, we examine the in-place, reconfigurable architecture presented in Sect. 5. To estimate its cost, we implemented a 10-bit I/O FFT on a Xilinx Virtex 4 FPGA (XC4VLX200). The FFT core occupies 8614 slices, 64 DSP blocks, and 80 RAM blocks. The input buffer occupies 3000 slices and 256 RAM blocks. Overall, the design operates at 34.375 MHz and achieves 56.8 dB SQNR by using 13-bit data-paths. Recalling that the implemented design computes 128–2048 complex points FFT on 4 independent streams (targeting 4 × 4 MIMO applications), it can be fairly compared to a straightforward solution made of 4 independent FFT modules. For this reason, we also implemented the well-known SDF (Single Delay Feedback) architecture of [39] (radix-2, 10-bit I/O, variable length). Overall, four instances of the SDF architecture occupy 16,124 slices, 152 DSP blocks, and 152 RAM blocks, and operate at 25 MHz. Clearly, the solution presented in Sect. 5 requires less (almost half) processing resources when compared to the straightforward solution. As a drawback of the
Circuits Syst Signal Process Table 4 FPGA implementation comparison: in-place variable-length against SDF architecture
Archite-
Xilinx
Operating
DSP
RAM
cture
slices
frequency
blocks
blocks
In-place
8614
34 MHz
64
80
R-2
16124
25 MHz
152
152
R-22
13745
25 MHz
96
174
R-23
13467
25 MHz
88
174
proposed solution, we point out the extra memory resources required to buffer the data, as well as the increased operating frequency. We have also compared the proposed architecture to SDF bearing higher radices, radix-22 and the radix-23 . Note that in these cases reconfiguration becomes quite involved since it requires extra paths to bypass the stage processors, which remain idle in the various configurations for different FFT lengths. Table 4 compares the complexity of the proposed architecture against three variations of the straightforward solution, which is based on four parallel SDF architectures: the first includes only radix-2 butterflies, the second realizes the radix-22 algorithm and the third the radix-23 . 6.2 Comparisons Between the Proposed FFT Architectures To compare the three architectures with each other, we implemented distinct FFT modules on the same Xilinx Virtex 4 FPGA (XC4VLX200) technology. Table 5 reports the implementation results assuming three distinct applications. The first two cases assume a fixed length for the FFT input (128 and 1024 points, respectively), while the third case is for a variable length FFT (128–2048 points). Note that, for the sake of a fair comparison, each implementation only resorted to FPGA slices, with no DSP block. As expected, the fully parallel FFT can sustain the highest throughput rate at the expense of an increased hardware cost. The cost difference between the parallel and the cascaded FFT cores increases significantly with the length of the FFT input. However, as Table 5 shows, the fully parallel architecture is the most prominent solution to tackle throughput rates as high as 50 GSamples/s. In another direction, the cascade FFT offers higher throughput rates with less hardware resources than the in-place reconfigurable FFT. Nonetheless, we must consider the fact that the reconfigurable architecture supports variable length FFT of up to 2048 points. Similar modifications in the cascaded architecture would increase its hardware cost above 12,400 slices (due to the extra FFT stages and the bypass circuits). Moreover, consider that the reconfigurable architecture is tailored for MIMO applications. A direct use of the cascaded approach in such applications requires either the use of multiple FFT modules (with a significant cost increase) or the use of one module operating at a multiple of the input rate (with higher power consumption). The results of Table 5 show the advantages of each architecture. On the one hand, the in-place variable FFT is suitable for WiMAX mobile and 3GPP LTE applications (MIMO-OFDMA systems) with up to 4 streams at 100 MSamples/s. On the other hand, the cascaded architecture is the most economic solution in SISO systems with FFT of fixed size, e.g., for fixed WiMAX, DVB, DAB, VDSL, BPL, and WLAN.
22270
6236
–
Cascade
In-place, reconfigurable (4 streams)
7 GSamples/s –
132 MSamples/s –
12400
134523
XC4V slices
–
132 MSamples/s
50 GSamples/s
throughput
case 2: 1024-points, 8-bit I/O
XC4V slices
throughput
case 1: 128-points, 5-bit I/O
Fully parallel
Architecture
Table 5 Implementation of the proposed FFT architectures on Virtex 4 FPGA (XC4VLX200)
18981
–
–
XC4V slices
100 MSamples/s
–
–
throughput
case 3: variable length, 10-bit I/O
Circuits Syst Signal Process
Circuits Syst Signal Process Table 6 Most suited FFT architecture for different telecommunication standards in Virtex4 FPGA technology
Standard
FFT arch.
XC4V slices
MSamples/s
DAB and DVB-T/H
cascade
15195
10
xDSL and BPL
cascade
25507
35
802.16d/e
in-place var.
18981
100
802.11n (MIMO)
in-place var.
15790
160
UWB (14 chan.)
parallel
22270
7000
Clearly, the in-place variable FFT implements very effective techniques to support multiple streams and/or different FFT lengths with small complexity overhead, while the cascaded FFT avoids any unnecessary overhead to efficiently support a single stream. Finally, for the case of UWB, the two above solutions fall short in terms of throughput rate, making the fully parallel architecture the most suitable approach. Table 6 summarizes the most suitable solution for each of the aforementioned standards (fully parallel, cascade, or in-place variable) and the relevant complexity on Virtex 4 FPGA technology (XCVLX200). Again, each implementation is only based here on FPGA slices (no DSP blocks) for the sake of comparison. Note that in Table 6 the same FFT engine can support different standards with similar requirements: for instance, from Tables 1 and 2, DAB and DVB-T/H have similar throughput (around 10 MSamples/s) and arithmetic accuracy requirements (8-bit I/O), while the maximum FFT length is set by DVB; BPL and xDSL have similar throughput (around 30 MSamples/s) and arithmetic accuracy requirements (16-bit I/O), while the maximum FFT length is set by VDSL.
7 Conclusion This paper has presented three distinct FFT/IFFT architectures aiming to support multi-carrier OFDM-based telecommunication systems. Introducing design features to enhance a fully parallel, a cascade and a reconfigurable architectural approaches led to the design of FFT/IFFT modules suitable for the most widely used protocols and improving the performance to cost ratio. The fully parallel architecture employs fine grained techniques to achieve high throughput rates, tens of GSamples/s, such those required in UWB when all channels are used. The cascade architecture leads to an efficient pipeline for SISO systems with large FFT length but moderate throughput (e.g. DSL, BPL, DVB-T/H, DAB). Finally, combining reconfiguration and in-place techniques results in a low-cost architecture fulfilling the requirements of MIMO systems with FFTs of variable size (WiMAX, MIMO WLAN). The design trade-offs are shown through the implementation results of each architecture. The comparison to the corresponding literature solutions favors the three architectures proposed in this paper. Acknowledgement This work was supported by the European Commission in the framework of the FP7 Network of Excellence in Wireless COMmunications NEWCOM++ (contract n. 216715).
Circuits Syst Signal Process
References 1. P. Amirshahi, M. Navidpour, M. Kavehrad, Performance analysis of uncoded and coded OFDM broadband transmission over low voltage power-line channels with impulsive noise. IEEE Trans. Power Deliv. 21(4), 1927–1934 (2006) 2. J.G. Andrews, A. Ghosh, R. Muhamed, Fundamentals of WiMAX, Understanding Broadband Wireless Networking. Prentice Hall Communications Engineering and Emerging Technologies Series (Prentice Hall, New York, 2007) 3. F. Baronti et al., Design and verification of hardware building blocks for high-speed and fault-tolerant in-vehicle networks. IEEE Trans. Ind. Electron. 58(3), 792–801 (2011) 4. G. Bi, E. Jones, A pipelined FFT processor for word-sequential data. IEEE Trans. Acoust. Speech Signal Process. 37(12), 1982–1985 (1988) 5. J. Bingham, Multicarrier modulation for data transmission: an idea whose time has come. IEEE Commun. Mag. 28(5) (1990) 6. R. Cabral, S. Escarigo, H. Neto, H. Sarmento, Implementation of a DAB receiver with FPGA technology, in Proc. IEEE ICCE, Jan 2006, pp. 397–398 7. A. Chimenti et al., VLSI architecture for a low-power video codec system. Microelectron. J., 33(5–6), 417–427 (2002) 8. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. IEEE Trans. Electron. Comput. EC-15(4), 680–681 (1966) 9. A.R. Cooper, Parallel architecture modified booth multiplier. IEE Proc. G, Electron. Circuits Syst. 135, 125–128 (1988) 10. A. Cortes, I. Velez, J. Sevillano, A. Irizar, An FFT core for DVB-T/DVB-H receivers, in Proc. Third IEEE International Conference on Electronics, Circuits, and Systems (ICECS), Dec 2006, pp. 102– 105 11. M. Deinzer, M. Stoger, Integrated PLC-modem based on OFDM, in Int. Sym. On Power-line Communications and its Applications (ISPLC’99) (1999) 12. C. Del-Toso, M. Nava, A short overview of the VDSL system requirements. IEEE Commun. Mag., 40(12), 82–90 (2002) 13. P. Duhamel, M. Vetterli, Fast Fourier transforms: a tutorial review and a state of the art. Signal Process. 19(4), 259–299 (1990) 14. L. Fanucci et al., A parametric VLSI architecture for video motion estimation. Integration 31(1), 79– 100 (2001) 15. L. Fanucci et al., Parametrized and reusable VLSI macrocells for the low-power realization of 2-D discrete-cosine-transform. Microelectron. J. 32(12), 1035–1045 (2001) 16. L. Fanucci et al., Power optimization of an 8051-compliant microcontroller. IEICE Trans. Electron. 88(4), 597–600 (2005) 17. B. Farahani, M. Ismail, WiMAX/WLAN radio receiver architecture for convergence in WMANS, in IEEE 48th Midwest Symposium on Circuits and Systems, Aug 2005, pp. 1621–1624 18. High rate ultra wideband PHY and MAC standard, Dec 2005, standard ECMA-368 19. H. Holma, A. Toskala, LTE for UMTS, OFDMA and SC-FDMA Based Radio Access (Wiley, New York, 2009) 20. IEEE 802.11-05/1102r4, IEEE P802.11 Wireless LANs Joint Proposal: High throughput extension to the 802.11 Standard: PHY, Jan 2006 21. Y. Jung, J. Kim, S. Lee, H. Yoon, J. Kim, Design and implementation of MIMO-OFDM baseband processor for high-speed wireless LANs. IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., 54(7), 631–635 (2007) 22. M. Kornfeld, DVB-H—the emerging standard for mobile data communication, in IEEE International Symposium on Consumer Electronics, Sept. 2004, pp. 193–198 23. J. Lee, J. Moon, K. Heo, M. Sunwoo, S. Oh, I. Kim, Implementation of application-specific DSP for OFDM systems, in Proc. IEEE International Conference on Circuits and Systems (ISCAS), May 2004, vol. 3, pp. 665–668 24. X. Li, Z. Lai, J. Cui, A low power and small area FFT processor for OFDM. IEEE Trans. Consum. Electron., 53(2), 274–277 (2007) 25. Y.-W. Lin, C.-Y. Lee, Design of an FFT/IFFT processor for MIMO OFDM systems. IEEE Trans. Circuits Syst. I, 54(4), 807–815 (2007) 26. N. L’insalata et al., Automatic synthesis of cost effective FFT/IFFT cores for VLSI OFDM systems. IEICE Trans. Electron., E91-C(4), 487–496 (2008)
Circuits Syst Signal Process 27. K. Nakos, D. Reisis, N. Vlassopoulos, Addressing technique for parallel memory accessing in Radix2 FFT Processors, in IEEE Int. Conference on Electronics, Circuits and Systems (ICECS), Sep 2008, pp. 52–56 28. S. Oraintara, Y.J. Chen, T.Q. Nguyen, Integer Fast Fourier Transform. IEEE Trans. Signal Process. 50(3), 607–618 (2002) 29. S. Perels, D. Haene, P. Luethi, A. Burg, N. Felber, W. Fichtner, H. Bolcskei, ASIC implementation of a MIMO OFDM transceiver for 192 Mbps WLAN, in Proc. IEEE ESSCIRC2005 (2005) 30. K. Prakash, M.M. Rao, Fixed-point error analysis of radix-4 fht algorithm with optimised scaling schemes. IEE Proc., Vis. Image Signal Process. 142, 65–70 (1995) 31. S. Saponara, L. Fanucci, VLSI design investigation for low-cost, low-power FFT/IFFT processing in advanced VDSL transceivers. Microelectron. J. 34(2), 133–148 (2003) 32. S. Saponara, K. Denolf, G. Lafruit, C. Blanch, J. Bormans, Performance and complexity co-evaluation of the advanced video coding standard for cost-effective multimedia communications. EURASIP J. Appl. Signal Process. 2004(2), 220–235 (2004) 33. S. Saponara, L. Fanucci, S. Marsi, G. Ramponi, Algorithmic and architectural design for real-time and power-efficient Retinex image/video processing. J. Real-Time Image Process. 1(4), 267–283 (2007) 34. S. Saponara, L. Fanucci, S. Marsi, G. Ramponi, D. Kammler, E. Witte, Application-specific instruction-set processor for retinex-like image and video processing. IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process. 54(7), 596–600 (2007) 35. S. Saponara, L. Fanucci, P. Terreni, Architectural-level power optimization of microcontroller cores in embedded systems. IEEE Trans. Ind. Electron. 54(1), 680–683 (2007) 36. S. Saponara, P. Nuzzo, C. Nani, G. Van der Plas, L. Fanucci, Architectural exploration and design of time-interleaved SAR arrays for low-power and high speed A/D converters. IEICE Trans. Electron. 92-C(6), 843–851 (2009) 37. R.S. Sherrat, O. Cadenas, N. Goswami, A low clock frequency FFT core implementation for multiband full-rate ultra-wideband (UWB) receivers. IEEE Trans. Consum. Electron. 51(3), 798–802 (2005) 38. D. Skellern, A high-speed wireless LAN. IEEE MICRO 17(1), 40–47 (1997) 39. C.D. Thompson, Fourier transform in VLSI. IEEE Trans. Comput. C-32(11), 1047–1057 (1983) 40. F. Vitullo et al., Low-complexity link microarchitecture for mesochronous communication in Networks-on-Chip. IEEE Trans. Comput. 57(9), 1196–1201 (2008) 41. J. Walko, Click here for VDSL2. Commun. Eng. 3(4), 9–12 (2005) 42. C.-C. Wang, J.-M. Huang, H.-C. Cheng, A 2K/8K mode small-area FFT processor for OFDM demodulation of DVB-T receivers. IEEE Trans. Consum. Electron. 51(1), 28–32 (2005)