2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip
1024-Point Pipeline FFT Processor with Pointer FIFOs based on FPGA Guanwen Zhong, Hongbin Zheng, ZhenHua Jin, Dihu Chen and Zhiyong Pang∗ School of physics and engineering, Sun Yat-sen University, Guangzhou 510275, P.R. China Email:
[email protected]
Abstract—Design and optimized implementation of a 16bit and 32-bit 1024-point pipeline FFT processor is presented in this paper. The architecture of the FFT is based on R22 SDF algorithm with new pointer FIFO embedded with gray code counters. It is implemented in Spartan-3E, Spartan6 and Virtex-4 devices and fully tested by method of cosimulation using SMIMSr VeriLinkr as a bridge that connects software(Matlabr Simulinkr ) and real hardware–FPGA targets. The implementation results show that our pointer FIFO FFT processor could use lower resource, but achieve higher performance. Our 16-bit 1024-point FFT processor only costs 2580 slices, 2030 slice flip flops and just 2 block RAMs, achieving the maximum clock frequency of 92.6 MHz with the throughput per area of 0.035 Msamples/s/area. Due to the parameterized input wordlength, output wordlength, Twiddle Factors wordlength and processing stages, it is easily to implement a 16-point, 64-point, 256-point, 1024-point,4096-point or higher power of 4 points pointer FIFO FFT processor synthesized from the same code just through modifying the corresponding parameters. Index Terms—FFT; Pointer FIFO; Radix22 SDF; Cosimulation.
(b) The Architecture of 1024 points FFT Processor based on R22 SDF Fig. 1.
The Architecture of FFT Processor
comparisons with others’ Radix22 SDF FFT processor. The conclusions and future work are given in IV.
I. I NTRODUCTION The Fast Fourier Transform (FFT) is an very efficient algorithm to compute the Discrete Fourier Transform(DFT) and has been used in a quite wide range of applications in modern digital processing and communication systems, such as Orthogonal Frequency Division Multiplexing(OFDM), radar technology, spectrum analysis and high speed image processing. The pipeline FFT is a special class of FFT algorithms that can compute the FFT in a sequential manner and its architectures have been studied since the 1970’s. There are many kinds of methods to implement the pipeline FFT hardware architectures, and the most commonly used methods can be categorized into three kinds of pipelined architectures which include multiple delay commutator(MDC), single delay commutator(SDC), single delay feedback(SDF) architectures. In this paper we focus on Radix22 SDF architecture and use new method to achieve the FIFOs of the hardware implementation architecture with pointer FIFO embedded with gray code counters. The rest of the paper is organized as follows. II discusses the common pipeline FFT architectures and focus on the Radix22 SDF architecture especially on the new pointer FIFOs implementation and some architecturespecific optimizations. III describes the verification using cosimulation with Matlabr Simulinkr , SMIMSr VeriLinkr and FPGA boards, the results of the implementation and the
978-1-4577-0170-2/11/$26.00 ©2011 IEEE
(a) 256-Point R22 SDF Architecture
II. P IPELINE FFT A RCHITECTURE A. The Selection of FFT Algorithms The common algorithms to implement FFT processors contain Radix-2, Radix-4, Radix-22 , Split-Radix(SRMDC). The detailed algorithm deduction can be found in [1] and [2]. As shown in [1] and [2], which list the major characteristics and resource requirements of the pipeline FFT architectures, Radix22 Single Delay Feedback (R22 SDF) architecture provides the highest computational efficiency with its hardware architecture simple to implement and was selected as our basic architecture for the FFT processor. B. R22 SDF Architecture The R22 SDF Architecture was proposed by He and Torkelson [1]. Fig. 1 (a) shows 256-point R22 SDF FFT processor. It has four levels. Similarly, 1024-point R22 SDF FFT has five levels as shown in Fig. 1 (b). Fig. 2 specifies each level. One level includes two butterfly processors, two different size FIFO, a -J unit, a Twiddle Factor unit and a controller which is used to determine when to enable the corresponding -J and Twiddle Factor modules. This paper focuses on the optimization of the FIFOs, using pointer FIFO embedded with gray code counters, of the
122
Fig. 2.
The Architecture of R22 SDF Butterfly Processor Fig. 4.
Fig. 3.
The Architecture of Pointer FIFO
R22 SDF algorithm. The other part of the implementation of the R22 SDF algorithm had been detailedly described in [1], [2], [3] and [4]. C. Pointer FIFO Implementation The general methods to achieve the buffers, which are used to store data out of the butterfly processors, usually use two RAMs named pingpang buffers or just one RAM named shift registers. The pingpang buffer has two RAMs which read or write simultaneously, while the shift register only uses one RAM and it costs a lot of slice flip flop. Achieving the same functions, the pointer FIFO uses less resources than the two’s, it consists of one RAM named as FIFO memory, which is the same size of the two’s, FIFO write pointer and full signals generator(FIFO wptr&full) ,FIFO read pointer signal generator(FIFO rptr) and address controller. Fig. 3 shows the architecture of the pointer FIFO. 1) FIFO Memory: FIFO Memory consists of one dual-port RAM. The dual-port RAM is used to store the outputs of the butterfly processors and the FFT processor could deal with the data according to signals given by the controller. 2) Controller: The controller traces the data flow. For clarity, take 64-point FFT as examples. During the first 32 cycles, butterfly processor of the stage1 does not do with data and put the data directly into the FIFO Memory in the same stage. When the 32ed data comes in the memory, the delay signal of the controller will be enabled and during the next 32 cycles, the controller will enable the rinc signal used to enable the part of FIFO rptr signal generator to count the read address and feedback the important signal–rptr to the part
The dual n-bit gray code counter block diagram(based on [5])
of the FIFO wptr&full. After the second group of 32 cycles, data has been calculated with the previous data that read form the FIFO memory, the after-caculated data(results of substract operations) will be restored firstly in the registers to delay by 2 clocks and then put into the FIFO memory. After restored all the results, The delay signal will be disabled and wait for the next 32 cycles. 3) FIFO wptr&full: The FIFO wptr&full consists of a dual n-bit gray code counter and the full signal generator. Unlike the binary counter that all of its bits will change when increasing, the gray code counter only changes one bit. Besides, with the MSB inverted, the second half of the n-bit Gray code is a mirror image of the first half. Fig. 4 is a block diagram of a dual n-bit gray code counter.[5] The Gray code outputs are passed to a Gray-to-binary converter (bin), which is passed to a conditional binary-value incrementer to generate the next-binary-count-value (bnext), which is passed to a binary-to-Gray converter that generates the next-Gray-count-value (gnext), which is passed to the register inputs. The signal-ptr[n-1:0](wptr in fig. 3) is passed to the read clock domain and the ptr[n-1:0] is used to address the FIFO buffer. The addrmsb signal is mixed the inverted msb with the inverted 2nd msb of gnext, and put the addrmsb and ptr[n-2:0] together to generate the signal which is used to compare with the rptr signal that passed by the FIFO rptr module. If the two signals are equal, it means that the write pointer has caught up with the synchronized read pointer, and then the FIFO wptr&full module enable the full signal-wfull. 4) FIFO rptr: The FIFO rptr module is similar to the FIFO wptr&full module. It also consists of a dual n-bit gray code counter. D. Optimization 1) Multiplier Unit: A complex multiplication is usually computed as: (a + bj) ∗ (c + dj) = (a ∗ c − b ∗ d) + (a ∗ d + b ∗ c)j
(1)
This method costs four multipliers and two add/sub operations. But just by rearranging the position of a, b, c and d, the equation above can be described as: (a + bj) ∗ (c + dj) = A ∗ b + C ∗ c + (B ∗ a − C ∗ c)j
123
(2)
(a) The Simulinkr and VeriLinkr model composed of functional blocks
(a)
(b) Fig. 5.
Verification Methods
(b) The wave of sine after FFT processing Fig. 7. Co-simulation with Xilinxr boards, Matlabr Simulinkr and SMIMSr Verilinkr
Fig. 6.
The results of the Pointer FIFO FFT and Matlab FFT
A is equal to the value of (c-d), B is equal to the value of (c+d) and C represents the value of (a-b). This method just requires three multipliers and five add/sub operations to compute a complex multiplication. 2) Twiddle Factor: Due to large numbers of twiddle factors, this paper uses block RAMs to generate ROMs which is used to store factors. This method can save the resource of distributed-RAMs generated by slices. 3) Reversed Unit: Using a controller consist of several counters and mapping the reverse data stored in the corresponding ROM, the data that come out of the Reversed Unit is the result of the input data after FFT processing. III. V ERIFICATION AND E XPERIMENTAL R ESULTS A. Verification 1) Method 1: Fig. 5 (a) shows the block diagram of the comparison between Point FIFO FFT and Matlabr FFT. Sine signal (fixed point) generated by Visual C++ is used as the input to the Pointer FIFO FFT and Matlabr FFT. After processing, use Matlabr Figure to show the results in Fig. 6. From Fig. 6, the output of the Pointer FIFO FFT is nearly the same as the Matlabr FFT’s. 2) Method 2: Use SMIMSr VeriLinkr as a bridge that connects software(Simulinkr ) and hardware to do the cosimulation, as shown in Fig. 5 (b). Building upon Simulinkr from Matlabr , along with the VeriLinkr from SMIMSr , input data generated by Simulinkr sine wave block could directly put into the FFT
processor downloaded in the FPGA board through the RT2C block given by SMIMSr VeriLinkr , which is used to convert and scale real format input to 2’complement, and the Simulinkr could straightly derive the output data from the FPGA devices after FFT processing using Simulinkr scope block to show the corresponding waves. The TWOCTR block given by SMIMSr VeriLinkr is used to convert and scale 2’complement input to real format. Fig. 7 (a) is the Simulinkr and VeriLinkr model composed of functional blocks that use to co-simulate the FFT processor(hardware). The Fig. 7 (b) shows the two impulse signals which are the sine signal spectrum. B. Experimental Results Table I gives the performance results of the R22 SDF architectures based on shift registers and pointer FIFOs with different FFT sizes on Spartan-3E FPGAs and the FFTs achieved by [4]. Due to shift operation, the FFT based on shift registers costs much more slices than the FFTs based on pointer FIFOs and Bin Zhou’s [4]. Each kind of FFTs, with different FFT size, keep the same maximum frequency because of the pipeline architecture. From table I, the pointer FIFO R22 SDF FFT with data width of 16 only occupies 2580 slices and 2 block RAMs which is smaller than the R22 SDF [4] and its throughput per area is better than the other FFTs. A summary of performance comparison with selected implementations of FFT are shown in table II. Figures of Amphion core CS2411XV [8], Sundance core FC200 [7], Xilinx core version 2 for Virtex-E [6] and Bin Zhou’s [4] are given together with our FFT. The resulting figures show different features of the implementations of R22 SDF FFT. Our implementation outperforms Sundance’s [7], Bin Zhou’s [4], Sukhsawas and Benkrid’s
124
FFT size Our PFR22 SDF1 Our SFR22 SDF2 Our PFR22 SDF Our SFR22 SDF R22 SDF[4] R4SDC[4] 1 2
1024 1024 1024 1024 1024 1024
Input data width 32 32 16 16 16 16
TABLE I I MPLEMENTATION R ESULTS ON S PARTAN -3E D EVICES Twiddle Slice Maximum factor Slices Flip BRAMs Frequency Latency width Flops (MHz) (clk) 32 3692 2788 5 71.947 1046 32 23425 39056 5 71.947 1046 16 2580 2030 2 92.595 1046 16 13485 22303 2 92.595 1046 16 4409 Not given 8 123.84 1041 16 2802 Not given 8 95.25 1042
Transform time (µs) 14.54 14.54 11.30 11.3 8.27 10.75
Throughput (MS/s) 70.44 70.44 90.62 90.62 123.84 95.25
Throughput /area (MS/s/slice) 0.019 0.003 0.035 0.007 0.028 0.034
PFR22 SDF: Pointer FIFOs R22 SDF SFR22 SDF: Shift Registers R22 SDF
FFT size Amphion[3] Xilinx[3],[6] Sundance[7] Suksawas R22 SDF[3] R22 SDF[4] Our PFR22 SDF 1
1024 1024 1024
Input data width 13 16 16
1024 1024 1024
16 16 16
TABLE II P ERFORMANCE C OMPARISON ON V IRTEX -E D EVICES Twiddle Maximum Transform factor Slices BRAMs Frequency Latency time width (MHz) (clk) (µs) 13 1639 9 57 5097 71.86 16 1968 24 83 4096 49.35 10 8031 20 49 1320 27.00 16 16 16
7365 5008 3390
28 32 11
82 95 53.978
1099 1042 1046
12.49 10.78 19.38
Throughput (MS/s) 14.25 20.75 49.00
Throughput /area1 (MS/s/slice) 0.009 0.011 0.006
82.00 95.00 52.84
0.011 0.019 0.016
Not taken the block RAMs into consideration
[7] in slice(only 3390) and block-RAM(only 11) cost and poor performance than theirs in maximum clock frequency with 54 MHz. And the throughput per area ratio of 0.016 Msamples/s/slice is better than others except for Bin Zhou’s, whose throughput per area ratio is 0.019 Msamples/s/slice. According to the corresponding reference, the Xilinx FFT IP core based on Virtex-E in the table II shows 4 times the latency(4096) in cycles because of its internal architecture while our FFT’s only has the latency of 1046 in cycles due to the internal pipeline architecture. Our FFT costs least block RAMs except for Amphion’s whose latency has 5097 in cycles. IV. C ONCLUSIONS AND FUTURE WORKS In this paper, we presented a new R22 SDF architecture based on pointer FIFO embedded with gray code counters which made the FFT processor more stable when dealing with a huge mass of data. The 16-bit 1024-point R22 SDF FFT based on pointer FIFO embedded with gray code counters reached a maximum clock frequency of 92.6 MHz and only used 2580 slices and 2030 slice flip flops with just 2 block RAMs, giving a throughput of 92.6 Msamples/s on Spartan-3E devices. Our pointer FIFO FFT processor outperforms others’ in slice and block RAM cost and with high throughput per area. This paper also presented a new verification methodhardware and software co-simulation, which used Matlabr Simulinkr and SMIMSr VeriLinkr together with FPGA
Future work includes improving the pointer FIFO embedded with gray code counters for slice optimization, ameliorating multipliers to improve the maximum clock frequency, implementation with CORDIC arithmetic, power and SQNR analysis. Higher performance should be resulted in with these improvements. R EFERENCES [1] S. He and M. Torkelson, “A new approach to pipeline fft processor,” in Parallel Processing Symposium, 1996., Proceedings of IPPS ’96, The 10th International, Apr. 1996, pp. 766 –770. [2] J. Garcia, J. Michell, and A. Buron, “VLSI configurable delay commutator for a pipeline split radix fft architecture,” Signal Processing, IEEE Transactions on, vol. 47, no. 11, pp. 3098 –3107, Nov. 1999. [3] S. Sukhsawas and K. Benkrid, “A high-level implementation of a high performance pipeline fft on virtex-e fpgas,” in VLSI, 2004. Proceedings. IEEE Computer society Annual Symposium on, 2004, pp. 229 – 232. [4] B. Zhou and D. Hwang, “Implementations and optimizations of pipeline ffts on xilinx fpgas,” in Reconfigurable Computing and FPGAs, 2008. ReConFig ’08. International Conference on, 2008, pp. 325 –330. [5] C. E.Cummings, “Synthesis and scripting techniques for designing multiasynchronous clock designs,” SNUG2001 (Synopsys Users Group Conference, San Jose, CA, 2001) User Papers, no. Section MC1, p. 3rd paper, Mar. 2001. [6] Xilinx, Inc., “High-performance 1024-point complex fft/ifft v2.0,” San Jose, Calif, USA, 2000, Jul. 2000. [Online]. Available: http://www.xilinx.com/ipcenter [7] Sundance Multiprocessor Technology Ltd., “1024-point fixed point fft processor,” Jul. 2008. [Online]. Available: http://www.sundance.com/web/?les/productpage.asp?STRFilter=FC200 [8] Amphion Semiconductor Ltd., “1024 point block based fft/ifft,” Apr. 2002. [Online]. Available: http://www.amphion.com/signal.html
boards to test the implementation of the R22 SDF FFT based on pointer FIFO.
125