Design Automation for a 3DIC FFT Processor for Synthetic ... - CiteSeerX

5.1

Design Automation for a 3DIC FFT Processor for Synthetic Aperture Radar: A Case Study Thorlindur Thorolfsson [email protected]

Kiran Gonsalves [email protected]

Paul D. Franzon [email protected]

Department of Electrical and Computer Engineering North Carolina State University Box 7911 Raleigh, NC 27695

ABSTRACT

wire length can be achieved[4]. In this paper we demonstrate one way in which an application-specific-processor can be rearchitectured to take advantage of 3DIC technology. In this paper an FFT engine designed for use in a Synthetic Aperture Radar (SAR) processor is used as a case study. We re-architectured the baseline design to reduce power and total area simultaneously by taking advantage of 3DIC with through-silicon via (TSV) technology. Furthermore, by separating the logic and memory layers we simplify the tool flow so that the design can be executed easily using extensions of current 2D CAD tools. 3DIC stacking is used to interconnect the power-optimized memory to the logic tier while reducing overall area. The design has been sent to fabrication at MIT Lincoln Laboratory but fabrication of the chip has not yet been completed. The paper is organized in the following manner. Section 2 describes the algorithm on which the SAR FFT processor is based. Section 3 describes the architecture of the SAR FFT processor. Section 4 describes the tool flow and manufacturing process. Finally, Section 5 compares the metrics of the 3D implementation to a 2D equivalent.

This work discusses a 1024-point, memory-on-logic 3DIC FFT processor for synthetic aperture radar (SAR), sent to fabrication in the 180 nm MIT Lincoln Labs 3D FDSOI 1.5 V process[12] along with the design flow required to realize it with off-the-shelf commercial 2D tools. The work shows how the vertical dimension can be exploited for novel memory architecture tradeoffs that are not feasible in 2D, reducing the energy consumed per memory operation in the FFT by 60.3%. In comparison to its 2D counterpart, the SAR FFT processor exhibits a 53.0% decrease in average wire length, a 24.6% increase in maximum operating frequency and a 25.3% decrease in total silicon area.

Categories and Subject Descriptors B.7.1 [Integrated Circuits]: Types and Design Styles; C.4 [Performance Of Systems]: Design studies

General Terms Design

2.

Keywords SAR, 3DIC, TSV, FFT

1.

SAR ALGORITHM

Synthetic aperture radar, unlike most radar, is used for imaging. While conventional images are formed using the visible spectrum, SAR images are formed using the radio region of the spectrum. A tremendous amount of digital signal processing and memory bandwidth is required to form a SAR image. The required digital signal processing and memory bandwidth increase exponentially with the desired image resolution. This makes a SAR FFT processor an excellent candidate to demonstrate the memory bandwidth benefits that 3D integrated circuits can provide. The image-forming algorithm used for the SAR FFT processor is derived from the one used in the RASSP[6] project and based on the Range Doppler Algorithm[8]. The steps required to form the SAR image along, with the portion of the floating point operations performed by each of the steps (for 30 cm imaging resolution), are shown in Table 1. It is important to note the majority of all the floating point operations are FFT/IFFT operations, which occur in steps 2, 3 and 5.

INTRODUCTION

New developments in fabrication technology allow vertical integration using 3D thru-silicon vias. Vertical integration has the potential to cut wire length drastically for standard cell designs as reported by Davis et al.[5]. This is important because as designs move to smaller feature sizes the wires will increasingly dominate the delay and the power budgets of digital logic circuits[7]. 3D integration has several major obstacles, including increased thermal densities[10], increased test costs, and the lack of commercial EDA tool support for 3DICs[2]. It has been shown that using custom 3D placement and routing tools, a 28-51% reduction in total

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC’09, July 26-31, 2009, San Francisco, California, USA Copyright 2009 ACM 978-1-60558-497-3/09/07....10.00

3.

ARCHITECTURE

As we have shown in Table 1 and discussed in Section 2, the majority of SAR processing involves computing FFTs to some degree. As a result, the main objective of the design is to efficiently calculate the FFTs used in the SAR algo-

51 Authorized licensed use limited to: North Carolina State University. Downloaded on September 21, 2009 at 10:32 from IEEE Xplore. Restrictions apply.

the actual storage of the FFT twiddle factors we utilize two optimizations. First, we use trigonometric properties to reduce the number of twiddle factors stored[11] from N/2 to N/8 + 1. For the 1024 point FFT this effectively reduces the number of twiddle factors stored from 512 to 129. Second, if any bit is the same for all the words, that bit is hard coded into the processing element rather being stored in the ROM. This optimization effectively reduces the number of bits required to store the twiddle factors from 64 to 52. We use the memory-dividing scheme described above to divide the processing memory in 32 smaller memories (16 even and 16 odd). Furthermore, every single memory is dual-ported (one read and one write port). Overall, this allows the system to perform 32 memory accesses per cycle (16 reads and 16 writes), completing a 1024-point FFT in 653 cycles assuming five pipeline stages. Each of the different components of the architecture are described below and illustrated in Figure 3. The system consists of four different components, eight processing elements, one controller, thirty two SRAMs, and eight ROMs. The processing elements are the core of the system, implementing the FFT butterfly with four floating point multipliers and six addition/subtraction units. The internal structure of the processing element is shown in Figure 2. The controller orchestrates the overall operation of the system, by setting the addresses and read enables of the memories. The controller requires very little communication with the processing elements, only three signals per processing element. The SRAMs implement the main processing memory using 8-transistor dual ported SRAMs. The ROMs store the FFT twiddle factors and are implemented as single ported NOR type ROMs.

Table 1: The Steps in the SAR Algorithm. Step % 1)Range Low Pass FIR Filtering 35.6% 2)Range Fast Fourier Transform 12.3% 3)Azimuth Fast Fourier Transform 22.3% 4)Azimuth Complex Multiply 3.8% 5)Azimuth Inverse Fast Fourier Transform 26.0% FFT Steps Combined (Steps 2, 3 & 5) 60.6%

rithm. We use a radix-2 Cooley-Tukey FFT [3] for all the FFT calculations in the processor. A radix-2 FFT has a data dependency that resembles a hypercube. This hypercube data dependency can be exploited in two ways. First, a radix-2 FFT will process two memory locations every cycle, one of which will have odd parity while the other will have even parity. As a result we can split the processing memory into two independent memory groups that never need to be accessed at the same time. Second, we can sub-divide the even and odd groups into smaller subgroups where each processing element is only connected to the absolute minimum number of memory locations required to successfully compute the FFT. Furthermore, in this subdivision each memory subgroup is not accessed by more than one processing element at the same time. The benefit of splitting the memories into smaller subgroups is that smaller memories are faster, and since each memory subgroup can be accessed simultaneously, the system can perform a greater number of reads and writes per cycle. Conversely, a single memory will require less area as only one set of peripheral logic (write driver and sense amp) is required. We use Cacti 4.1[13] to assess the architectural tradeoff, by comparing the properties of a single 8 kByte memory to sixteen 512 Byte memories. The memory-core area savings of using a single memory would have been 67.6%. By using multiple smaller memories, the energy per read is reduced by 60.8% (from 68.205 to 26.718 pJ), the energy per write is reduced by 57.6% (from 14.48 to 6.142 pJ) and the memory bandwidth is increased by 854.9% (13.4 to 128.4 GBps). The number of wires interconnecting the memory to logic is increased from 150 to 2272 wires. 3DIC stacking is used to minimize the area impact of the added wires. Furthermore, the single memory will have a shorter and simpler interconnect structure between the logic and the memory. This tradeoff is illustrated in Figure 1.

Figure 2: The internal structure of a processing element. The SAR FFT processor architecture is a good example of a design that has a significant number of heavily shared and interconnected resources. For this reason it can be expected to benefit significantly from 3D integration due to long wires in the interconnect between these resources (memories and processing elements).

4.

IMPLEMENTATION AND TOOL FLOW

In this section we discuss the design, implementation, tool flow and manufacturing process used to create the system. Before the design flow is explained, it is important to understand the manufacturing process. The MIT Lincoln Labs’ manufacturing process is a three tier, 180 nm wafer scale 3D integration process[9, 1]. It features a 1.5 V low power fully depleted silicon on insulator CMOS technology with one layer of polysilicon, three metal layers per tier and a back-metal layer between the top two tiers, with an additional metal layer on top of the entire stack. The bottom tier is named A, the middle tier B and the top tier C. Tier

Figure 1: The memory design tradeoffs. Overall, the architecture we fabricated can process a 1024pixel wide image using an FFT of the same width. Each pixel/data point in the FFT has a precision of 32 imaginary and 32 real bits. Computing an N-point FFT requires using N/2 FFT twiddle factors of the same precision. To minimize


Figure 3: The SAR FFT processor architecture.

A is closest to the heat sink. Tier C is the only tier which has off-chip inputs and outputs. Tiers B and C face down, while tier A faces up. Figure 4 shows a side view of the process with the silicon-thru vias and the orientation of the tiers shown. In this process the dimensions of a single thrusilicon via are 2.5 × 2.5 μm and the smallest pitch the vias can be placed on is 3.9 μm. Overall, the design is a mix between standard cell and full custom design. The processing elements and controller are coded in Verilog, while the memories (SRAMs and ROMs) are implemented using full custom design. One of the benefits of doing the memories in full custom, rather than using an off-the-shelf memory generator, is that it allows the thrusilicon vias to be implemented on the outside edges of the memories. This simplifies the flow as the thru-silicon vias get placed along with the memory. This, however, is not the case for the 24 logic-to-logic vias which must have their position predetermined and are then placed in the final assembly stage.

memory is connected to one lower numbered PE and one upper numbered PE. To exploit this connectivity we partition the system so that the controller and the memories are on the middle tier (tier B), with the upper numbered PEs and their respective twiddle factor ROMs placed on tier A and the lower numbered PEs along with their ROMs placed on tier C. This partitioning scheme guarantees that a memory is never more than one tier away from the processing elements that are connected to it. This means the memory is also on the same tier as the controller that sets its address lines. On the middle tier we have thirty-two memories and one controller to place. To accomplish this, we use an 11 × 3 grid. We place the controller in the center location of the grid in the middle tier. For the remaining memories we use a Python constraints package to generate an optimal memory placement based on the distance a given memory is to the two processing elements that use it. The resulting floorplan is shown in Figure 8. In the system there are a total of 8280 thru-silicon vias 4128 of those vias connect the logic on tier A to the memories on tier B, another 4128 connect the logic on tier C to the memories on tier B, the remaining 24 thru vias connect the controller to the processing elements. The next step in the design flow is synthesis, which was accomplished using a standard cell library based on the IITSoC library from the Illinois Institute of Technology. Each tier is synthesized separately in Synopsys Design Compiler. After synthesis, we perform static timing analysis and add an additional pipeline stage to the processing elements until adding another pipeline stage to the processing elements does not result in any overall speed increase. The optimal pre-place and route pipeline depth for the system was discovered to be five stages for this manufacturing process and standard cells, yielding a maximum operation frequency of 196 MHz (without parasitics). After synthesis, we perform place and route. This stage deviates the most from a conventional 2D flow. In order to successfully complete place and route, the global information about the placement of the memories and the thru-silicon vias is required. Using standard string and file manipulation functions built into the TCL interpreter in Encounter, the thru-via and pin locations can easily be extracted by parsing the DEF files of the custom memories designs. Using

Figure 4: A side view of the MIT Lincoln Labs’ process with the silicon-thru vias and tier orientation shown.

Figure 5 shows the complete design flow. The first step in the design flow is 3D floorplanning, partitioning and selecting the locations for the memories. In the 3D floorplanning phase, the main objective is to get the memories as close as possible to the processing elements that use them. We define PE0, PE1, PE2 and PE3 to be the lower numbered processing elements and PE4, PE5, PE6 and PE7 to be the upper numbered processing elements. Figure 3 shows that every


fit 4554 power and ground vias between tiers A and B and 4800 vias between tier B and C. The next step was to place the input and output pads and perform a final DRC and LVS. Furthermore, since the process being used is an SOI process, we added extra power and ground decoupling capacitors where ever there was room left over to compensate for the limited native decoupling in SOI. Figure 6 shows the three tiers stacked, along with the thru-silicon vias.

Figure 5: The design flow.

the information from the DEF file, routing and placement is blocked over the areas of the memories and inter-tier siliconthru via location. Normal placement is then performed, followed by clock tree synthesis. Due to the fact that the process only allows three metal layers per tier, the clock tree is not routed before regular routing, which is common for processes with a greater number of metals. Instead the clock tree is routed along with all other routing. This causes more clock skew than would have occurred if a greater number of metal layers had been available. After clock tree synthesis, the ”preassignPin” command is then used to place virtual input/output pins directly on top of the thru-vias on the edge of the memories. Encounter then performs routing as normal, connecting the standard cells, clock tree and virtual pins (effectively performing 3D routing). After place and route, the design along with its parasitics is imported into PrimeTime and post-place and route timing analysis is performed. In this step it is important to make sure that each tier has no setup or hold violations. It is also important to make sure that signals that travel between tiers have no setup or hold violations either. This step is greatly simplified due to the fact there are very few logic-to-logic vias (24) and the remaining signals are either data pins to the SRAMs or address pins to the twiddle factor ROMs. Finally, all the tiers are imported separately into Virtuoso. In Virtuoso the three tiers and the full-custom memories are combined. For the 24 signals that connect the controller to processing elements, the through-silicon vias are placed by hand, the rest are placed automatically as part of the memory. The reason the 24 TSVs were placed by hand is that since there were so few of them it was quicker to place them by hand then to write a script to do so. However, this process can easily by scripted using Skill code in Virtuoso as the location of all through-silicon vias are known. Furthermore, scripting this process would be necessary for other 3D designs that contain a greater number of logic to logic through-silicon vias. The power and ground rings of the three tiers are then combined into 3D meshes, by placing thru-silicon vias all along the perimeter. Due to the fact that Encounter routed over the power and ground rings in a some areas, a few thru-silicon vias along the perimeter had to be removed to avoid shorts. All in all, we managed to

Figure 6: The 3D SAR FFT processor with thru silicon vias drawn in.

5.

RESULTS

To quantify the improvements of the 3D circuit over its 2D counterpart, we place and route the design in 2D. In order to ensure a fair comparison between the two circuits, the circuit is not resynthesized, instead the same synthesis output is used. For the comparison, a literally identical floorplan is used. This floorplan is essentially the floorplan of tier B expanded with the ROMs placed in similar locations to the 3D version, shown in Figure 9. Due to increased congestion, the 2D design does not route successfully with the same area as its 3D counterpart (4.8 × 4.8 mm). To remedy this, the area used for place and route is grown until the design routes without any design rule violations. Compared to the 3D version, the total area used must be expanded significantly from 3 × 2.6 × 3 mm for the 3D circuit to 5.6 × 5.6 mm for its 2D counterpart, which is 25.3% increase in total area. To get just core placement area, we exclude the power and ground rings from the total area (0.1 mm on every side) and the comparison becomes 3 × 2.8 × 2.4 mm versus 5.4 × 5.4 mm. The area discrepancy between the total area and the core area illustrates an interesting point: given the same total area and same power and ground ring width, a 3D design will devote more area to the power and ground rings. The next metric examined was net length. We extract all net information directly from Encounter, combining the information from all the different tiers. As expected, the average wire length decreased drastically from 836.0 μm down to 392.9 μm. This is a 53.0% decrease. Similarly, the total wire length decreased from 19.107 m to 8.238 m, a total of 53%. A histogram of the wire lengths is shown in Figure 7. In order to gather the speed and power metrics of the design, we have to extract the parasitics and characterize the switching activity of the design. The parasitics are ex-


Table 2: Comparison between the 2D and 3D metrics of the SAR FFT along with read and write energy from Cacti. Metric Total Area (mm2 ) Core Area (mm2 ) Mean Net Length (μm) Total Wire Length (m) Max Speed (M Hz) Critical Path (ns) Logic Power @ 63.7MHz (mW ) Logic Power @ 79.4 MHz (mW ) FFT Logic Energy (μJ)

Figure 7: Histogram of wire lengths of the SAR FFT processor for both the 2D and 3D versions (bin size = 250μm ).

2D 31.36 29.16 836.0 19.107 63.7 15.7 340.0 —— 3.552

3D 23.40 20.16 392.9 8.238 79.4 12.6 324.9 409.2 3.366

% 25.3% 30.9% 53.0% 56.9% 24.6% 19.7% 4.4% —— 5.2%

Table 3: Read and write energy from Cacti comparing the un-optimized to the optimized design. Metric Bandwidth (GBps) Energy Per Write (pJ) Energy Per Read (pJ) Memory Wires (#)

Divided 13.4 14.48 68.205 150

Undivided 128.4 6.142 26.718 2272

% 854.9% 57.6% 60.8% -1414.7%

tracted into a SPEF file using Encounter. The switching activity is generated by simulating an FFT test bench in Mentor Graphics Modelsim and exporting the resulting activity of the test bench to a SAIF file. Both files were then read into Synopsys PrimeTime. In PrimeTime the clock period was increased to the fastest clock that did not cause any setup violations, to determine the maximum operating frequency. The 3D design simulated correctly at 79.4 MHz (12.6ns), whereas the 2D design simulated correctly at 63.7 MHz (15.7ns). This is a 24.6% increase in maximum operating frequency and a 19.7% improvement in clock speed. As these numbers may seem a bit slow for the given technology node, it is important to keep two points in mind. First, the process only has three metal layers which limits clock tree routing causing more skew than would occur if more metals were available. Second, the standard cell library does not have the multi-adder cells that many commercial libraries have, which would have helped increase the maximum operating frequency. Finally, using both the SPEF and the SAIF file, power dissipation numbers (excluding power dissipated in the memories) were generated using PrimeTime. For the 3D design the power dissipation is determined for both the maximum operating frequency of the 2D and the 3D design, while for the 2D design the power dissipation is only determined for its own maximum operating frequency. At an operating frequency of 79.4 MHz the 3D design dissipates 409.2 mW. Operating at 63.7 MHz the 3D design dissipates 324.9 mW and the 2D design dissipates 340.0 mW. This is a 4.4% improvement. Using the power numbers of both circuits operating at maximum frequency, we compute the energy (excluding memory accesses) required per 1024-point FFT. The energy required for the completing the FFT in 3D is 3.366 μJ as opposed to 3.552 μJ for the 2D version which is a 5.2% improvement. The results are summarized in Table 2, followed by a summary of the memory tradeoffs in Section 3 in Table 3.

This project was funded by DARPA under contract FA865004-C-7127, and contract FA8650-04-C-7120 both managed by AFRL. Additional funding was provided by Semiconductor Research Corporation. The authors would like to thank MIT Lincoln Labs for providing access to their FD-SOI technology and Magnus Halldorsson at Reykjavik University for help with the memory partitioning approach.

6.

8.

the 2D and 3D implementations of the same design but also to show how a system can be re-optimized in 3D in ways that are not available in 2D. Furthermore, the system can be realized with use of commercial 2D tools. Thus it is not necessary to use 3D tools. The 3D optimized design permits a single large memory to be broken down into multiple smaller memories to reduce the energy consumption in memory operations per FFT by 60.3%. This memory re-optimization would not be suited to a 2D design due to the high interconnect cost. In the 2D design, the increase in interconnect area is greater than the increase in memory area. Case in point, the 2D implementation of the archicture on the left side of Figure 1 is significantly worse than the 3D implementation of the archictecture to the right of Figure 1 in all metrics - power, performance and area. Finally, comparing the 2D and 3D implementations of the SAR FFT processor, we show an average wire length reduction of 53.0%, an overall wire length reduction of 56.9%, a 24.6% increase in maximum operating frequency, a 5.2% reduction in energy per FFT and a 30.9% reduction in area.

7.

CONCLUSIONS The main point of this paper is not necessarily to compare

ACKNOWLEDGMENTS

REFERENCES

[1] J. Burns, B. Aull, C. Chen, C.-L. Chen, C. Keast,


Figure 8: The 3D floorplan.

[2] [3]

[4]

[5]

[6]

[7] [8]

[9]

[10]

[11]

[12]

J. Knecht, V. Suntharalingam, K. Warner, P. Wyatt, and D. Yost. A wafer-scale 3-D circuit integration technology. IEEE Transactions on Electron Devices, 53(10):2507–2516, October 2006. P. Clarke. Eda’s big three unready for 3d chip packaging. EE Times Asia Online, October 2007. J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297–301, 1965. S. Das, A. Chandrakasan, and R. Reif. Design tools for 3-d integrated circuits. In ASPDAC: Proceedings of the 2003 conference on Asia South Pacific design automation, pages 53–56, New York, NY, USA, 2003. ACM. W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A. M. Sule, M. Steer, and P. D. Franzon. Demystifying 3D ICs: The Pros and Cons of Going Vertical. IEEE Design And Test of Computers, 22(6):498–510, Nov.-Dec. 2005. C. Hein, J. Pridgen, and W. Kline. RASSP Virtual Prototyping of DSP Systems. Design Automation Conference DAC 97, pages 492–497, 1997. R. Ho, K. Mai, and M. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4):490–504, 2001. M. Jin and C. Wu. SAR correlation algorithm which accommodates large-range migration. IEEE Transactions on Geoscience and Remote Sensing, 22(6):592–597, 1984. Massachusetts Institute of Technology Lincoln Labs. MITLL Low-Power FDSOI CMOS Process Design Guide, revision 2008:6 edition, September 2008. A. Rahman and R. Reif. Thermal analysis of three-dimensional (3-d) integrated circuits (ics). Interconnect Technology Conference, 2001. Proceedings of the IEEE 2001 International, pages 157–159, 2001. T. Sansaloni, A. Pérez-Pascual, V. Torres, and J. Valls. Scheme for Reducing the Storage Requirements of FFT Twiddle Factors on FPGAs. The Journal of VLSI Signal Processing, 47(2):183–187, 2007. V. Suntharalingam, R. Berger, J. Burns, C. Chen,

C. Keast, J. Knecht, R. Lambert, K. Newcomb, D. O’Mara, D. Rathman, D. Shaver, A. Soares, C. Stevenson, B. Tyrrell, K. Warner, B. Wheeler, D.-R. Yost, and D. Young. Megapixel cmos image sensor fabricated in three-dimensional integrated circuit technology. Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International, pages 356–357 Vol. 1, Feb. 2005. [13] S. Wilton and N. Jouppi. CACTI: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of, 31(5):677–688, 1996.

Figure 9: The 2D floorplan for comparison.


Design Automation for a 3DIC FFT Processor for Synthetic ... - CiteSeerX

Design Automation for a 3DIC FFT Processor for Synthetic ... - CiteSeerX

Suggest Documents

Vector Processor Customization for FFT - Computer Engineering ...

An Integrated 256-point Complex FFT Processor for Real ... - CiteSeerX

a pipeline fft processor - Semantic Scholar

a novel low-power reconfigurable fft processor - CiteSeerX

Design Considerations for Network Processor Operating ... - CiteSeerX

Implementation of a Single FFT Processor

Design and Evaluation of an FFT Processor Utilizing ... - Wsimg.com

Design of a Variable point FFT processor for 4G Standards - ijarcce

Design Patterns for Automation of Marketing ... - CiteSeerX

Design of a radix-8/4/2 FFT processor for OFDM systems

Design Patterns for Automation of Marketing ... - CiteSeerX

low-power application-specific processor for fft computations

An Survey of Low Power FFT Processor for Signal ...

The Design of a Specialised Processor for the Simulation ... - CiteSeerX

Multi-FFT Vectorization for the Cell Multicore Processor - Oak Ridge ...

Processor Design for Portable Systems

a genetic algorithm for vlsi physical design automation - CiteSeerX

A Multi-Step P-Cell for LNA Design Automation - CiteSeerX

SIC 3.0, a simulation model for canal automation design - CiteSeerX

an fft processor based on 16-point module - CiteSeerX

A Programming Language for Processor Based ... - CiteSeerX

VLSI Design of Mixed radix FFT Processor for MIMO OFDM in wireless ...

Optimized Hardware Implementation of FFT Processor

Algorithms for VLSI Design Automation