Company Limited, NEC Electronics Corporation, Renesas. Technology Corporation, Toshiba Corporation, Cadence De- sign Systems Inc. and Synopsys Inc.
An FPGA Based on Synchronous/Asynchronous Hybrid Architecture with Area-Efficient FIFO Interfaces Masanori Hariyama, Yoshiya Komatsu, Shota Ishihara, Ryoto Tsuchiya, and Michitaka Kameyama Graduate School of Information Sciences, Tohoku University Aoba 6-6-05, Aramaki, Aoba, Sendai, Miyagi, 980-8579, Japan Abstract— This paper presents an FPGA architecture that combines synchronous and asynchronous architectures. Datapath components such as logic blocks and switch blocks are designed so as to run in asynchronous and synchronous modes. Moreover, a logic block is presented that implements area-efficient First-in-first-out(FIOF) interfaces, which are usually used for communication between synchronous and asynchronous logic cores. The FPGA based on the areahybrid architecture is fabricated in a 65nm process. Keywords: FPGA, Reconfigurable VLSI, Self-timed architecture, Delay-insensitive architecture.
1. Introduction Field-programmable gate arrays (FPGAs) are widely used to implement special-purpose processors. FPGAs are costeffective for small-lot production because functions and interconnections of logic resources can be directly programmed by end users. Despite their design cost advantage, FPGAs impose large power consumption overhead compared to custom silicon alternatives [1]. The overhead increases packaging costs and limits integrations of FPGAs into portable devices. In FPGAs, the power consumption of clock distribution is a serious problem because it has an enormously large number of registers than custom VLSIs. To cut the clock distribution power, some asynchronous FPGAs has been proposed [2], [3], [4], [5], [6]. From references [4]-[6], the asynchronous architecture is more power-efficient than the synchronous one under the lowworkload condition. Under the high-workload condition, the asynchronous architecture is less advantageous in power than the synchronous one because of its overhead of complex control circuitry. In actual situations, evan a single application consists of many tasks with various workloads. The best way to minimize the power consumption is to use both of asynchronous and synchronous architectures for a single application. For this purpose, we have reported the FPGA architecture based on the hybrid of a synchronous and asynchronous architectures[7]. However, it suffers from the lack of the area-efficient communication interface between synchronous and asynchronous cores. This paper proposes an FPGA architecture based on the hybrid of asynchronous and synchronous architectures that
can implement area-efficient communication interface. Datapath components such as logic blocks and switch blocks are used for both of asynchronous and synchronous modes; Each of the blocks is programmed to be an asynchronous block or a synchronous block in advance. When designing the hybrid datapath components, the major issue is to sharing the datapath resources efficiently. For this purpose, we propose dualbit operation where a one-bit logic block and interconnection resource in asynchronous mode is exploited as two-bit used for a two-bit logic block and interconnection resources in synchronous mode. As a result, the datapath resources are fully exploited in both modes. In order to use common Look-up tables(LUTs) of the logic block in both modes, 4-phase dual-rail encoding is adopted. Another design issue is to implement area-efficient communication interface. In general, the processing speeds of the synchronous and asynchronous cores are different from each other. Therefore, the First-in-first-out(FIFO) interface is usually used to absorb the difference of the processing speeds. However, FIFO interface imposes a large hardware overhead. To solve this problem, this paper presents the logic block structure that can implements area-efficient FIFOs. The FPGA based on the hybrid architecture is implemented in a 65nm process.
2. Architecture 2.1 Asynchronous protocols Asynchronous encoding schemes are mainly classified into • •
Single-rail encoding (ex. bundled-data encoding) Dual-rail encoding (ex. 4-phase dual-rail encoding, LEDR encoding)
The bundled-data encoding is most common one in the single-rail encoding. The bundled-data encoding is the most frequently-used way in ASICs since its hardware overhead is relatively small. The major disadvantage is that it requires the constraint of the delay length. If the data path is fixed in advance, it is relatively easy to meet the constraint by optimizing layouts of wires. On the other hand, in reconfigurable VLSIs such as FPGAs, it is not easy to always meet the constraint since the data path is programmable. The dual-rail encoding encodes a bit onto two wires. In dual-rail encoding, value is made implicit in the request and
Req(Pre-charge)
Table 1: Code table of 4-phase dual-rail encoding
Data 0 Data 1 Spacer
Sync. A A B B
Code word (T, F) (0,1) (1,0) (0,0)
4-phase
Out.T
A.F A.T B.F B.T
A.F
A.T B.T B.F
Req(Pre-charge)
A B out
Sync. A A B B
4-phase
Out.F
A.T A.F B.T B.F
B.F B.T
A.F A.T
Figure 2: XOR gate of 4-phase dual-rail architecture
Figure 1: XOR gate of synchronous architecture
SB
CB
SB
CB
SB Data
no delay insertion is therefore required[8]. Hence, the dualrail encoding is the ideal one for reconfigurable VLSIs. 4phase dual-rail encoding is the most common one in dual-rail encodings. Table 1 shows the code table of 4-phase dualrail encoding. The data value 0 is encoded as (0, 1) and 1 is encoded as (1, 0). Moreover, the spacer is encoded as (0, 0). Figure ?? shows the example where data values 0, 0 and 1 are transferred. The main feature is that the sender sends spacer after a data value. The receiver knows the arrival of a data value by detecting the change of either bit: 0 to 1. The insertion of spacers makes the encoding law simple. This results in a simple hardware of the function unit. Figures 1 and 2 show the XOR gates of synchronous (synchronous XOR) and 4-phase dual-rail (4-phase dual-rail XOR) architecture respectively. The circuit for generate out.t of the 4-phase dual-rail XOR gate is similar to the pMOS network of the synchronous XOR gate. On the other hand, the circuit for generate out.f of the 4-phase dual-rail XOR gate is similar to the nMOS network of the synchronous XOR gate. This similarity causes the function unit of 4-phase dualrail architecture to implement easily. The number of the transistors of the synchronous XOR gate is 12, while the 4phase dual-rail XOR gate is 16. Accordingly, the hardware overhead of 4-phase dual-rail architecture is smaller than that of other dual-rail architecture.
2.2 Overall architecture Figure 3 shows the overall architecture of the proposed FPGA. The FPGA consists of a mesh-connected cellular array likes conventional FPGAs. As mentioned in the
CB
LB
LB
CB
CB
Ack
Cell SB
CB
SB
CB A
SB Carry_out
Out
CB
LB
CB
B
LB C
D
CB
Carry_in
LB: Logic Block
SB
CB
SB
CB
SB
CB: Connection Block SB: Switch Block
Figure 3: Overall architecture.
previous section, 4-phase dual-rail encoding is employed as the asynchronous protocol because of its similarity to synchronous circuits. The logic blocks, connection blocks and switch blocks are used for asynchronous architecture and synchronous architecture. The clock-tree network is designed based on H-tree topology; For simplicity, the clock tree is not illustrated in this figure. The clock signal is distributed to all the registers in the logic blocks. Since the chip presented in this paper is a prototype, the simplest 2input LUT is used in the logic block.
2.3 Logic block structure As shown in Fig. 4, the LUT is constructed by two same smaller LUTs. In asynchronous mode shown in Fig. 4(a), the upper and lower LUTs are used for out.t and out.f, respectively. In synchronous mode shown in Fig. 4(b), the upper and lower LUTs are used for different bits respectively,
(In1_t,In1_f) ~ (In3_t,in3_f) 8
LUT
Out_t
In0# In7 8
LUT
Reg
LUT_ element !For f"!
Out_0
Data
Logic!
Reg Out_1
LUT_ element !For f"!
Out_f
(a) Asynchronous mode!
Synchronous core!
LUT_ element !For t"!
Sync/Async Converter!
LUT_ element !For t"!
Handshake
Asynchronous core!
Data FIFO! Data!
Clock
(b) Synchronous mode!
Figure 4: Resource sharing for the logic block of the hybrid architecture.
Figure 6: Interface between a synchronous and asynchronous cores. Clock
Out
Input3
t t t
f f
t f
Out_t
In!!!!!!! (Data in synchronous style )!
Out_f
Data in asynchronous style
f
Input2 Input1
Figure 7: Function of the converter from synchronous style to asynchronous style.
Input0 memory
FIFO interface is relatively small in the logic block.
3. Evaluation PreCharge
Figure 5: Block diagram of the LUT of the hybrid architecture.
and the all LUTs are fully exploited. Figure 5 shows the circuit of the LUT. As explained in the previous section, the logic circuit of 4-phase dualrail encoding is quite similar to dynamic circuit of the synchronous circuit. Based on this observation, the LUT for hybrid architecture is designed using the dynamic circuit. Hence, completely common LUTs can be used for both of asynchronous and synchronous modes. Another design issue on the hybrid architecture is to implement area-efficient communication interface between synchronous and asynchronous cores. In general, the processing speeds of a synchronous core and an asynchronous core are different. FIFO interface is commonly used to absorb the difference as shown in Fig. 6. If the processing speed of the sender core is higher than that of receiver core, the FIFO stores the data from the sender core until the unbalance of the processing speed is resolved. Moreover, the protocol converter is required to convert data from the synchronous format to the asynchronous one or vice versa. Figures 7 and 8 shows the functions of the protocol converter and the FIFO interface. The number of cells in the FIFO depends on the design and is not determined in advance. Therefore, it is desirable to implement the FIFO cell using a logic block. Figure 9 shows the structure of the logic block with the FIFO function; Figure 10 shows the FIFO-cell mode of the logic block. Regards can see that the area overhead for the
The FPGA based on the hybrid architecture is implemented in a 65nm CMOS process. The supply voltage is 1.2V. The processing performance of the asynchronous mode corresponds to that of the 720MHz of the synchronous FPGA. Table 2 summarizes the comparison result of the cells of synchronous architecture and the hybrid architecture in synchronous mode. Thanks to the resource sharing, the energy and area overheads are just 29% and 22%. Table 3 summarizes the comparison result of the cells of asynchronous architecture[4], and the proposed hybrid architecture in asynchronous mode. The transistor-count overhead is as small as 18%.
4. Conclusion This paper proposes an FPGA based on the hybrid of asynchronous and synchronous architecture, where the FIFO interface between synchronous and asynchronous cores is implemented using logic blocks with the small overhead.
Ack_out
C
Ack_in
phase
In_t In_f
Core 1
REG
Out_t Out_f
Core 2
Figure 8: Function of the FIFO interface.
Ack_out
Ack_in
Handshake controller
8
Clock MUX
PC
Sync.!
6
2
(Carryin_t,Carryin_f) 2
FIFO_ enable
2
2 2
MUX
2
Reg
2
Out_t Out_f
(Carryout_t,Carryout_f)
2
Figure 9: Structure of the logic block with the FIFO function. Ack_out
Ack_in
Handshake controller
(In0_t,In0_f)
2
Clock MUX
(In1_t,In1_f) ~ (In3_t,In3_f)
Our hybrid (Sync. mode) 459 (129%)
phase
Function Unit (LUT, Carry, etc.)
(In0_t,In0_f)
Table 2: Comparison of cells of synchronous and the hybrid architecture.
MUX sync
MUX
(In1_t,In1_f) ~ (In3_t,In3_f)
(In0_t,In0_f) ~ (In3_t,In3_f)!
Energy per data set [fJ]
355
Delay[ps]
263
367 (151%)
Transistor count
1703
2069 (122%)
Table 3: Comparison of cells of asynchronous and the hybrid architecture. Async [4]! 755
Our Hybrid (LB mode) 925 (122%)
Our Hybrid (FIFO mode) 664
Delay[ps]
482
713 (148%)
422
Transistor count!
1757
Energy per data set[fJ]
2069 (118%)
phase
sync
6
FIFO_ enable
(In0_t,In0_f)
References
2 MUX
(Carryin_t,Carryin_f) 2
2
2
MUX
2
2
2
Reg
Out_t Out_f
(Carryout_t,Carryout_f)
Figure 10: FIFO cell implemented using the logic block.
The FIFO interface will be also efficient for power reduction. If the number of the stored data becomes large, it means that the processing speed of the sender is much higher than that of the receiver. Then, the power consumption can be reduced by lowering the supply voltage of the sender. Moreover, the proposed architecture with FIFO has higher flexibility than the conventional GALS(Globally Asynchronous and Locally Synchronous) architecture. As a future work, we are evaluating the hybrid architecture on some practical benchmarks. Developing the CAD environment is also important topic.
Acknowledgment This work is supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with STARC, Fujitsu Limited, Matsushita Electric Industrial Company Limited, NEC Electronics Corporation, Renesas Technology Corporation, Toshiba Corporation, Cadence Design Systems Inc. and Synopsys Inc.
[1] H. Z. V. George and J. Rabaey, “The design of a low energy FPGA,” in Proceedings of 1999 International Symposium on Low Power Electronics and Design, Californai, USA, Aug 1999, pp. 188–193. [2] J. Teifel and R. Manohar, “An asynchronous dataflow FPGA architecture,” IEEE Transactions on Computers, vol. 53, no. 11, pp. 1376–1392, 2004. [3] R. Manohar, “Reconfigurable Asynchronous Logic,” in Proceedings of IEEE Custom Integrated Circuits Conference, Sept. 2006, pp. 13–20. [4] M. Hariyama, S. Ishihara, and M. Kameyama, “Evaluation of a FieldProgrammable VLSI Based on an Asynchronous Bit- Serial Architecture,” IEICE Trans. Electron, vol. E91-C, no. 9, pp. 1419–1426, 2008. [5] M. Hariyama, S. Ishihara, , and M. Kameyama, “A Low-Power FieldProgrammable VLSI Based on a Fine-Grained Power-Gating Scheme,” in Proceedings of IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), Knoxville(USA), Aug 2008, pp. 430–433. [6] S. Ishihara, Y. Komatsu, and M. K. Masanori Hariyama, “An Asynchronous Field-Programmable VLSI Using LEDR/4-Phase-Dual-Rail Protocol Converters,” in Proceedings of The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas(USA), Jul 2009, pp. 145–150. [7] M. Hariyama, R. Tsuchiya, S. Ishihara, and M. Kameyama, “A FieldProgrammable VLSI Based on Synchronous/Asynchronous Hybrid Architecture,” in Proceedings of The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas(USA), Jul 2010, pp. 217–274. [8] J. Sparsø and S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective. Kluwer Academic Publishers, 2001.