An FPGA Based on Synchronous/Asynchronous ... - Semantic Scholar

2 downloads 0 Views 706KB Size Report
Company Limited, NEC Electronics Corporation, Renesas. Technology Corporation, Toshiba Corporation, Cadence De- sign Systems Inc. and Synopsys Inc.
An FPGA Based on Synchronous/Asynchronous Hybrid Architecture with Area-Efficient FIFO Interfaces Masanori Hariyama, Yoshiya Komatsu, Shota Ishihara, Ryoto Tsuchiya, and Michitaka Kameyama Graduate School of Information Sciences, Tohoku University Aoba 6-6-05, Aramaki, Aoba, Sendai, Miyagi, 980-8579, Japan Abstract— This paper presents an FPGA architecture that combines synchronous and asynchronous architectures. Datapath components such as logic blocks and switch blocks are designed so as to run in asynchronous and synchronous modes. Moreover, a logic block is presented that implements area-efficient First-in-first-out(FIOF) interfaces, which are usually used for communication between synchronous and asynchronous logic cores. The FPGA based on the areahybrid architecture is fabricated in a 65nm process. Keywords: FPGA, Reconfigurable VLSI, Self-timed architecture, Delay-insensitive architecture.

1. Introduction Field-programmable gate arrays (FPGAs) are widely used to implement special-purpose processors. FPGAs are costeffective for small-lot production because functions and interconnections of logic resources can be directly programmed by end users. Despite their design cost advantage, FPGAs impose large power consumption overhead compared to custom silicon alternatives [1]. The overhead increases packaging costs and limits integrations of FPGAs into portable devices. In FPGAs, the power consumption of clock distribution is a serious problem because it has an enormously large number of registers than custom VLSIs. To cut the clock distribution power, some asynchronous FPGAs has been proposed [2], [3], [4], [5], [6]. From references [4]-[6], the asynchronous architecture is more power-efficient than the synchronous one under the lowworkload condition. Under the high-workload condition, the asynchronous architecture is less advantageous in power than the synchronous one because of its overhead of complex control circuitry. In actual situations, evan a single application consists of many tasks with various workloads. The best way to minimize the power consumption is to use both of asynchronous and synchronous architectures for a single application. For this purpose, we have reported the FPGA architecture based on the hybrid of a synchronous and asynchronous architectures[7]. However, it suffers from the lack of the area-efficient communication interface between synchronous and asynchronous cores. This paper proposes an FPGA architecture based on the hybrid of asynchronous and synchronous architectures that

can implement area-efficient communication interface. Datapath components such as logic blocks and switch blocks are used for both of asynchronous and synchronous modes; Each of the blocks is programmed to be an asynchronous block or a synchronous block in advance. When designing the hybrid datapath components, the major issue is to sharing the datapath resources efficiently. For this purpose, we propose dualbit operation where a one-bit logic block and interconnection resource in asynchronous mode is exploited as two-bit used for a two-bit logic block and interconnection resources in synchronous mode. As a result, the datapath resources are fully exploited in both modes. In order to use common Look-up tables(LUTs) of the logic block in both modes, 4-phase dual-rail encoding is adopted. Another design issue is to implement area-efficient communication interface. In general, the processing speeds of the synchronous and asynchronous cores are different from each other. Therefore, the First-in-first-out(FIFO) interface is usually used to absorb the difference of the processing speeds. However, FIFO interface imposes a large hardware overhead. To solve this problem, this paper presents the logic block structure that can implements area-efficient FIFOs. The FPGA based on the hybrid architecture is implemented in a 65nm process.

2. Architecture 2.1 Asynchronous protocols Asynchronous encoding schemes are mainly classified into • •

Single-rail encoding (ex. bundled-data encoding) Dual-rail encoding (ex. 4-phase dual-rail encoding, LEDR encoding)

The bundled-data encoding is most common one in the single-rail encoding. The bundled-data encoding is the most frequently-used way in ASICs since its hardware overhead is relatively small. The major disadvantage is that it requires the constraint of the delay length. If the data path is fixed in advance, it is relatively easy to meet the constraint by optimizing layouts of wires. On the other hand, in reconfigurable VLSIs such as FPGAs, it is not easy to always meet the constraint since the data path is programmable. The dual-rail encoding encodes a bit onto two wires. In dual-rail encoding, value is made implicit in the request and

Req(Pre-charge)

Table 1: Code table of 4-phase dual-rail encoding

Data 0 Data 1 Spacer

Sync. A A B B

Code word (T, F) (0,1) (1,0) (0,0)

4-phase

Out.T

A.F A.T B.F B.T

A.F

A.T B.T B.F

Req(Pre-charge)

A B out

Sync. A A B B

4-phase

Out.F

A.T A.F B.T B.F

B.F B.T

A.F A.T

Figure 2: XOR gate of 4-phase dual-rail architecture

Figure 1: XOR gate of synchronous architecture

SB

CB

SB

CB

SB Data

no delay insertion is therefore required[8]. Hence, the dualrail encoding is the ideal one for reconfigurable VLSIs. 4phase dual-rail encoding is the most common one in dual-rail encodings. Table 1 shows the code table of 4-phase dualrail encoding. The data value 0 is encoded as (0, 1) and 1 is encoded as (1, 0). Moreover, the spacer is encoded as (0, 0). Figure ?? shows the example where data values 0, 0 and 1 are transferred. The main feature is that the sender sends spacer after a data value. The receiver knows the arrival of a data value by detecting the change of either bit: 0 to 1. The insertion of spacers makes the encoding law simple. This results in a simple hardware of the function unit. Figures 1 and 2 show the XOR gates of synchronous (synchronous XOR) and 4-phase dual-rail (4-phase dual-rail XOR) architecture respectively. The circuit for generate out.t of the 4-phase dual-rail XOR gate is similar to the pMOS network of the synchronous XOR gate. On the other hand, the circuit for generate out.f of the 4-phase dual-rail XOR gate is similar to the nMOS network of the synchronous XOR gate. This similarity causes the function unit of 4-phase dualrail architecture to implement easily. The number of the transistors of the synchronous XOR gate is 12, while the 4phase dual-rail XOR gate is 16. Accordingly, the hardware overhead of 4-phase dual-rail architecture is smaller than that of other dual-rail architecture.

2.2 Overall architecture Figure 3 shows the overall architecture of the proposed FPGA. The FPGA consists of a mesh-connected cellular array likes conventional FPGAs. As mentioned in the

CB

LB

LB

CB

CB

Ack

Cell SB

CB

SB

CB A

SB Carry_out

Out

CB

LB

CB

B

LB C

D

CB

Carry_in

LB: Logic Block

SB

CB

SB

CB

SB

CB: Connection Block SB: Switch Block

Figure 3: Overall architecture.

previous section, 4-phase dual-rail encoding is employed as the asynchronous protocol because of its similarity to synchronous circuits. The logic blocks, connection blocks and switch blocks are used for asynchronous architecture and synchronous architecture. The clock-tree network is designed based on H-tree topology; For simplicity, the clock tree is not illustrated in this figure. The clock signal is distributed to all the registers in the logic blocks. Since the chip presented in this paper is a prototype, the simplest 2input LUT is used in the logic block.

2.3 Logic block structure As shown in Fig. 4, the LUT is constructed by two same smaller LUTs. In asynchronous mode shown in Fig. 4(a), the upper and lower LUTs are used for out.t and out.f, respectively. In synchronous mode shown in Fig. 4(b), the upper and lower LUTs are used for different bits respectively,

(In1_t,In1_f) ~ (In3_t,in3_f) 8

LUT

Out_t

In0# In7 8

LUT

Reg

LUT_ element !For f"!

Out_0

Data

Logic!

Reg Out_1

LUT_ element !For f"!

Out_f

(a) Asynchronous mode!

Synchronous core!

LUT_ element !For t"!

Sync/Async Converter!

LUT_ element !For t"!

Handshake

Asynchronous core!

Data FIFO! Data!

Clock

(b) Synchronous mode!

Figure 4: Resource sharing for the logic block of the hybrid architecture.

Figure 6: Interface between a synchronous and asynchronous cores. Clock

Out

Input3

t t t

f f

t f

Out_t

In!!!!!!! (Data in synchronous style )!

Out_f

Data in asynchronous style

f

Input2 Input1

Figure 7: Function of the converter from synchronous style to asynchronous style.

Input0 memory

FIFO interface is relatively small in the logic block.

3. Evaluation PreCharge

Figure 5: Block diagram of the LUT of the hybrid architecture.

and the all LUTs are fully exploited. Figure 5 shows the circuit of the LUT. As explained in the previous section, the logic circuit of 4-phase dualrail encoding is quite similar to dynamic circuit of the synchronous circuit. Based on this observation, the LUT for hybrid architecture is designed using the dynamic circuit. Hence, completely common LUTs can be used for both of asynchronous and synchronous modes. Another design issue on the hybrid architecture is to implement area-efficient communication interface between synchronous and asynchronous cores. In general, the processing speeds of a synchronous core and an asynchronous core are different. FIFO interface is commonly used to absorb the difference as shown in Fig. 6. If the processing speed of the sender core is higher than that of receiver core, the FIFO stores the data from the sender core until the unbalance of the processing speed is resolved. Moreover, the protocol converter is required to convert data from the synchronous format to the asynchronous one or vice versa. Figures 7 and 8 shows the functions of the protocol converter and the FIFO interface. The number of cells in the FIFO depends on the design and is not determined in advance. Therefore, it is desirable to implement the FIFO cell using a logic block. Figure 9 shows the structure of the logic block with the FIFO function; Figure 10 shows the FIFO-cell mode of the logic block. Regards can see that the area overhead for the

The FPGA based on the hybrid architecture is implemented in a 65nm CMOS process. The supply voltage is 1.2V. The processing performance of the asynchronous mode corresponds to that of the 720MHz of the synchronous FPGA. Table 2 summarizes the comparison result of the cells of synchronous architecture and the hybrid architecture in synchronous mode. Thanks to the resource sharing, the energy and area overheads are just 29% and 22%. Table 3 summarizes the comparison result of the cells of asynchronous architecture[4], and the proposed hybrid architecture in asynchronous mode. The transistor-count overhead is as small as 18%.

4. Conclusion This paper proposes an FPGA based on the hybrid of asynchronous and synchronous architecture, where the FIFO interface between synchronous and asynchronous cores is implemented using logic blocks with the small overhead.

Ack_out

C

Ack_in

phase

In_t In_f

Core 1

REG

Out_t Out_f

Core 2

Figure 8: Function of the FIFO interface.

Ack_out

Ack_in

Handshake controller

8

Clock MUX

PC

Sync.!

6

2

(Carryin_t,Carryin_f) 2

FIFO_ enable

2

2 2

MUX

2

Reg

2

Out_t Out_f

(Carryout_t,Carryout_f)

2

Figure 9: Structure of the logic block with the FIFO function. Ack_out

Ack_in

Handshake controller

(In0_t,In0_f)

2

Clock MUX

(In1_t,In1_f) ~ (In3_t,In3_f)

Our hybrid (Sync. mode) 459 (129%)

phase

Function Unit (LUT, Carry, etc.)

(In0_t,In0_f)

Table 2: Comparison of cells of synchronous and the hybrid architecture.

MUX sync

MUX

(In1_t,In1_f) ~ (In3_t,In3_f)

(In0_t,In0_f) ~ (In3_t,In3_f)!

Energy per data set [fJ]

355

Delay[ps]

263

367 (151%)

Transistor count

1703

2069 (122%)

Table 3: Comparison of cells of asynchronous and the hybrid architecture. Async [4]! 755

Our Hybrid (LB mode) 925 (122%)

Our Hybrid (FIFO mode) 664

Delay[ps]

482

713 (148%)

422

Transistor count!

1757

Energy per data set[fJ]

2069 (118%)

phase

sync

6

FIFO_ enable

(In0_t,In0_f)

References

2 MUX

(Carryin_t,Carryin_f) 2

2

2

MUX

2

2

2

Reg

Out_t Out_f

(Carryout_t,Carryout_f)

Figure 10: FIFO cell implemented using the logic block.

The FIFO interface will be also efficient for power reduction. If the number of the stored data becomes large, it means that the processing speed of the sender is much higher than that of the receiver. Then, the power consumption can be reduced by lowering the supply voltage of the sender. Moreover, the proposed architecture with FIFO has higher flexibility than the conventional GALS(Globally Asynchronous and Locally Synchronous) architecture. As a future work, we are evaluating the hybrid architecture on some practical benchmarks. Developing the CAD environment is also important topic.

Acknowledgment This work is supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with STARC, Fujitsu Limited, Matsushita Electric Industrial Company Limited, NEC Electronics Corporation, Renesas Technology Corporation, Toshiba Corporation, Cadence Design Systems Inc. and Synopsys Inc.

[1] H. Z. V. George and J. Rabaey, “The design of a low energy FPGA,” in Proceedings of 1999 International Symposium on Low Power Electronics and Design, Californai, USA, Aug 1999, pp. 188–193. [2] J. Teifel and R. Manohar, “An asynchronous dataflow FPGA architecture,” IEEE Transactions on Computers, vol. 53, no. 11, pp. 1376–1392, 2004. [3] R. Manohar, “Reconfigurable Asynchronous Logic,” in Proceedings of IEEE Custom Integrated Circuits Conference, Sept. 2006, pp. 13–20. [4] M. Hariyama, S. Ishihara, and M. Kameyama, “Evaluation of a FieldProgrammable VLSI Based on an Asynchronous Bit- Serial Architecture,” IEICE Trans. Electron, vol. E91-C, no. 9, pp. 1419–1426, 2008. [5] M. Hariyama, S. Ishihara, , and M. Kameyama, “A Low-Power FieldProgrammable VLSI Based on a Fine-Grained Power-Gating Scheme,” in Proceedings of IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), Knoxville(USA), Aug 2008, pp. 430–433. [6] S. Ishihara, Y. Komatsu, and M. K. Masanori Hariyama, “An Asynchronous Field-Programmable VLSI Using LEDR/4-Phase-Dual-Rail Protocol Converters,” in Proceedings of The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas(USA), Jul 2009, pp. 145–150. [7] M. Hariyama, R. Tsuchiya, S. Ishihara, and M. Kameyama, “A FieldProgrammable VLSI Based on Synchronous/Asynchronous Hybrid Architecture,” in Proceedings of The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas(USA), Jul 2010, pp. 217–274. [8] J. Sparsø and S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective. Kluwer Academic Publishers, 2001.