Abstractâ This is the first implementation of an. FPGA based on autonomous fine-grain power-gating. To cut the power consumption of clock network and.
1D-15
A Low-Power FPGA Based on Autonomous Fine-Grain Power-Gating Shota Ishihara, Masanori Hariyama, and Michitaka Kameyama Graduate School of Information Sciences, Tohoku University Aoba 6-6-05, Aramaki, Aoba, Sendai, Miyagi, 980-8579, Japan Email: {ishihara@kameyama., hariyama@, kameyama@}ecei.tohoku.ac.jp Abstract— This is the first implementation of an FPGA based on autonomous fine-grain power-gating. To cut the power consumption of clock network and detect the activity of the cell efficiently, asynchronous architecture is full exploited. The proposed FPGA is fabricated in a 90nm CMOS process with dual threshold voltages. It is more efficient in power than the synchronous FPGA at less than 30% utilization.
I. Introduction The major problem of conventional FPGAs is the large power overhead compared to ASICs. In FPGAs, the power consumption of clock network occupies over 30% of the total power. The most common way to reduce the clock distribution power is clock-gating. However, the clock gating for FPGAs prevents the unnecessary activity of the data paths but doesn’t gate the clock network itself [1]. Moreover, as the transistor feature sizes reduce, the stand-by power is now getting comparable with the dynamic one. A well-known technique to reduce the standby power is power-gating. However, the sleep controller requires a sequencer and the sequencer is always in active mode. Especially, in FPGAs, the sleep controller must be made by programmable resources. These area and dynamic power overheads make fine-grain power-gating unefficient [2]. To solve these problems, this paper proposes a fine-grain asynchronous architecture. The proposed architecture doesn’t only cut the power of the clock network [3] but also reduces the overhead of the sleep control for power-gating since asynchronous architecture inherently has information on activity of the cell. II. Architecture A. Overall architecture In reconfigurable VLSIs, the variation of the datapath is large, so the delay insensitive encoding is suitable for reconfigurable VLSIs. In delay insensitive encoding, the level-encoded dual-rail (LEDR) encoding achieves the highest though-put and lowest dynamic power because of its small number of signal transitions. The major drawback of the LEDR encoding is its larger overheads in area because it requires 2 wires for a single data bit. Based on this observation, we reported the LEDR-based architecture and its area-efficient implementation [3]. To reduce the inherent wire overhead, we employ the bit-serial architecture. Figure 1 shows the overall architecture and Fig. 2 shows the block diagram of the logic block.
Fig. 1. Overall architecture.
Fig. 2. Block diagram of the logic block.
As shown in Fig. 3, the LUT based on LEDR encoding consists of a decoder and a multiplexer-based LUT that has the same structure as a typical LUT. If the combination of inputs is invalid, the decoder makes the outputs of the multiplexer based LUT in Hi-Z condition and the latch outputs don’t change. Therefore, the multiplexers for invalid inputs become unnecessary. As a result, the transistor count is reduced to 64% compared to the typical multiplexer-based LUT. As the LEDR is a dual-rail encoding, in a typical manner, two registers are required to store a data set (V , R). To reduce this overhead, we exploit P hase signal stored in the C-element. By calculating R from V and P hase signal, only one register for V is required. B. Autonomous fine-grain power-gating scheme Asynchronous architecture easily detects the activity of the cell since it inherently has information on the operation completion and new-data arrival. As shown in Fig. 4, a cell has three states: sleep, standby and active. As shown in Fig. 5, when the data arrive at Cell1, a data-arrival signal is sent to Cell2 to wake it
Fig. 3. LUT based on LEDR encoding.
978-1-4244-2749-9/09/$25.00 ©2009 IEEE
119
1D-15
Fig. 4. State diagram of a cell.
Fig. 8. Comparison in a 90nm process. Fig. 5. Example of the autonomous fine-grain power-gating.
up, and Cell2 turns to the stand-by state. When the data arrive at Cell2, Cell2 turns to the active state. The operation is immediately executed because the power switch is woken up in the stand-by state. When the operation is completed, Cell2 turns to the stand-by state. If no data arrive at Cell2 during the waiting time, Cell2 predicts that the data will not arrive for quite a while, then Cell2 turns to the sleep state. The waiting time for each cell is empirically determined by system-level simulation so as to minimize the total power. The sleep controller consists mainly of a detector for data arrival and operation completion. As described in above, there is no penalty of the wake-up time, even the whole sleep controller is composed of small transistors with high threshold voltage. As a result, the area and power overheads are small. III. Evaluation The proposed FPGA is fabricated in a 90nm CMOS process (Fig. 6). From the delay of series-connected 200 cells, the delay of a cell is 1.3ns (Fig. 7). Thanks to the asynchronous circuit robustness, the chip makes correctly even though the supply voltage dynamically changes from 1.0V to 0.5V. This result indicates the power consumption can be reduced by controlling the supply voltage depending on processing loads. The proposed FPGA is compared with the FPGA based on 4-phase dual-rail encoding which is the most common one in delay insensitive encoding. The proposed FPGA reduces the delay and the power consumption to 61% and 58% respectively, while the number of transistors is only by 13% larger. Let’s compare the proposed FPGA with the synchronous FPGA with power-gating. In the synchronous FPGA, the whole chip is equally divided into several blocks, each of which has G cells, a sleep controller, and a
Fig. 9. Comparison in a 45nm process.
power switch. Figure 8 shows the comparison of the proposed FPGA and the synchronous FPGAs in a 90nm process. The proposed FPGA is more efficient in power than the synchronous FPGAs when the utilization is less than 30%. Unfortunately, the efficiency of the power-gating is small. This is because this 90nm process is well tuned so as to reduce the leakage current. To estimate the efficiency of the autonomous power-gating scheme in more advanced processes, we estimate the power consumption in 45nm process. The ITRS indicates, at the 45nm process, the dynamic and stand-by power per gate would be reduced to 48% and increased to 1030%, respectively. Considering these effects, as shown in Fig. 9, the proposed FPGA is more efficient than the synchronous FPGA without power-gating in higher utilization than in the 90nm process and more efficient in power than that with powergating when the utilization is less than 30%. IV. Conclusion In the proposed FPGA, the LEDR-based bit-serial architecture is fully exploited for area-efficient implementation of autonomous fine-grain power-gating with performance enhancement. Acknowledgements This work is supported by VLSI Design and Education Center(VDEC), the University of Tokyo in collaboration with Synopsys, Inc., and Cadence Design Systems, Inc. References [1] Yan Zhang, Jussi Roivainen and Aarne Mammela, “ClockGating in FPGAs: A Novel and Comparative Evaluation,” DSD, pp.584-590, 2006. [2] Arifur Rahman, Satyaki Das and Tim Tuan and Steve Trimberger, “Determination of Power Gating Granularity for FPGA Fabric,” CICC, pp.9-12, 2006. [3] Masanori Hariyama, Shota Ishihara, Chang Chia Wei and Michitaka Kameyama, “A Field-Programmable VLSI Based on an Asynchronous Bit-Serial Architecture,” A-SSCC, pp.380-383, 2007.
Fig. 6. Chip photograph.
Fig. 7. Measured waveform(1.0V).
120