Comparative study of asynchronous pipeline design methods - J-Stage

2 downloads 0 Views 593KB Size Report
Accepted March 27, 2006. Published April 25, 2006 .... given by. TLat = ta(N). + tb(N). + tc(N). + tmd(N). + td(N),. (3). 3.2 MOUSETRAP Method. One stage of the ...
IEICE Electronics Express, Vol.3, No.8, 163–171

Comparative study of asynchronous pipeline design methods K. Shojaee1a) , M. Gholipour1b) , A. Afzali-Kusha1c) , and M. Nourani2d) 1

Nanoelectronics Center of Excellence, School of Electrical and Computer

Engineering, University of Tehran, P.O. Box 14395/515, Tehran, Iran 2

Department of Electrical Engineering, University of Texas at Dallas, Richardson,

TX 75083, USA a) [email protected] b) [email protected] c) [email protected] d) [email protected]

Abstract: In this paper, a performance comparison of several proposed asynchronous pipeline styles is presented. The asynchronous styles include GasP, MOUSETRAP, IPCMOS, LPSR 2/1, HC, STFB, LDA, LP2/1, RSPCFB, and NCL. Both 4-bit and 16-bit 4-stage FIFO circuits are designed and simulated utilizing HSPICE. The simulation results are then used to compare the styles in terms of throughput, latency, power dissipation, transistor count, and datapath width. In addition, two figures of merit which relate the energy and the delay of the circuit are utilized in the comparison of the styles. To estimate the throughput and the latency of the circuits, a simple analytical model for the transistor delay is also proposed. The predictions of the analytical model for the throughput and the latency are compared to the simulations results to assess the accuracy of the model. Keywords: asynchronous pipeline, FIFO, throughput, latency. Classification: Integrated circuits References

c 

IEICE 2006

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

[1] I. E. Sutherland, “Micropipelines,” Comm. ACM, vol. 32, no. 6, pp. 720– 738, 1989. [2] I. E. Sutherland and S. Fairbanks, “GasP: A Minimal FIFO Control,” Proc. ASYNC, pp. 46–53, 2001. [3] M. Singh and S. M. Nowick, “MOUSETRAP: Ultra-High-Speed Transition-Signaling Asynchronous Pipelines,” Proc. ICCD 2001, pp. 9– 17, 2001. [4] S. Schuster, W. Reohr, P. Cook, D. Heidel, M. Immediato, and K. Jenkins, “Asynchronous Interlocked Pipelined CMOS Circuits Operating at 3.3–4.5 GHz,” Proc. ISSCC, pp. 292–293, Feb. 2000. [5] M. Singh and S. M. Nowick, “High-Throughput Asynchronous Pipelines for Fine-Grain Dynamic Datapaths,” Proc. Int’l Symp. Advanced Re-

163

IEICE Electronics Express, Vol.3, No.8, 163–171

[6]

[7] [8]

[9]

[10]

1

search in Asynchronous Circuits and Systems (ASYNC 2000), pp. 198– 209, 2000. M. Singh and S. M. Nowick, “Fine-grain Pipelined Asynchronous Adders for High-Speed DSP Applications,” IEEE Computer Society Annual Workshop VLSI, pp. 111–118, 2000. M. Ferretti and P. A. Beerel, “Single-Track Asynchronous Pipeline Templates Using 1-of-N Encoding,” Proc. DATE’02, pp. 1008–1015, 2002. C. Choy, J. Butas, J. Povazanec, and C. Chan, “A Fine-Grain Asynchronous Pipeline Reaching the Synchronous Speed,” Proc. ASIC, pp. 547–550, 2001. R. O. Ozdag and P. A. Beerel, “High-speed QDI asynchronous pipelines,” Proc. Int’l Symp. Asynchronous Circuits and Systems (ASYNC ’02), 8– 11, pp. 13–22, April 2002. K. Fant and S. Brandt, “NULL Convention Logic a Complete and Consistent Logic for Asynchronous Digital Circuit Synthesis,” Proc. 1996 International Conference Application Specific Systems, Architectures, and Processors (ASAP 96), pp. 261–273, Aug. 1996.

Introduction

Micropipelines (Asynchronous Pipelines) belong to an important asynchronous architecture which was introduced in [1]. Based on the interface type between stages, these circuits may be classified into two groups: Bundled-Data (BD) and Data-Driven (DD). Selecting one type for a design is an important consideration that should be performed based on the system requirements. The objective of this paper is to provide a comparison of several asynchronous styles when they are used in implementing FIFOs. The ten selected micropipeline design styles are studied and their parameters are extracted and compared. When the design requirements are specified by the parameters discussed in this paper, the results of this study would help to choose a better architecture. The paper is organized as follows. In Section 2, we describe the analytical model used for the estimation of the delay of a transistor. Sections 3 and 4 contain the analytical expressions for throughput and latency of the each asynchronous FIFO circuit. The results of the HSPICE simulations for the asynchronous FIFOs are presented and discussed in Section 5 while the summary and the conclusions are given in Section 6.

2

c 

IEICE 2006

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

Design Approach

The delay of a circuit may be obtained in terms of the delays of the transistors in the signal path. To obtain an analytical expression for the delay, we propose a simple analytical model for the transistor delay, which is a function of the transistor sizes. The expression for the delay is conceived by considering the parameters that affect the delay. The parameters include the type of the transistor, switching or pass transistor (modeled by α), the width and the type of the transistor (Wo,i /Wo ), the gate (Wk ) and the diffusion (Wj ) capacitances connected to the transistor. To take into account these effects,

164

IEICE Electronics Express, Vol.3, No.8, 163–171

we have used the following expression for the delay of the transistor as ⎛ ⎞   W 1 1 o,i Ti = ⎝α + Wk + Wj ⎠ τ, W β γ k

(1)

j

Here, α, β, γ, and τ are technology dependent parameters, i is either p for PMOS transistors or n for NMOS transistors, Wo,i is the width of the minimum size transistor for type i, Wk is the width of transistor k whose gate is connected to this transistor, and Wj is the width of transistor j whose diffusion capacitance is connected to this transistor. For an NMOS (PMOS) transistor, Wo,i is equal to (twice of) the technology minimum channel length. The difference between the channel widths stems from the difference between the electron and holes mobility coefficients. Note that the technology dependent parameters are determined experimentally and their values for the technology used in this work are given in Table I (a). To assess the accuracy of the analytical model, its predictions have been compared with the circuit simulation results in Section 5.

3

Bundled-Data Based Micropipeline Design

In these design styles, the Acknowledge and Request signals are used as handshaking signals and the data is transmitted using the single-rail encoding, i.e., one signal per data bit. In this design style, there is a control circuit which generates the handshaking signals based on the time required for the data to be processed. In this section, five asynchronous design styles based on the Bundled-Data architecture are described.

3.1 GasP Method Fig. 1 (a) shows the circuit for one stage of the GasP FIFO [2]. The throughput time, TT hr , is the time difference between two successive data items. In this case, TT hr is the time between two consecutive low states on a state conductor (Hnd In, Hnd Out), and is given by TT hr ≈ ta(N +1) + tb(N +1) + ty(N +1) +tx(N ) + tb(N ) + tc(N ) + tmd(N ) + td(N ) ,

(2)

where tmd(N ) is the time added due to matched delay circuit in stage N and tθ(i) is the delay of transistor θ which is located in stage number i. The latency is the time needed for a data item to travel from the input to the output of one stage. For GasP, the latency of stage N is the time that takes for a low state induced on Hnd In N to propagate to Hnd Out N , and is given by (3) TLat = ta(N ) + tb(N ) + tc(N ) + tmd(N ) + td(N ) ,

c 

IEICE 2006

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

3.2 MOUSETRAP Method One stage of the MOUSETRAP FIFO [3] is illustrated in Fig. 1 (b). Since the pulse duration of high and low states are different, hence, two slightly

165

IEICE Electronics Express, Vol.3, No.8, 163–171

Fig. 1. Circuits of (a) GasP (b) MOUSETRAP (c) IPCMOS (d) LPSR2/1 (e) HC different values can be calculated for the throughput. The throughput time for the low to high transition is given by TT hr ↑≈ ta(N +1) + tb(N +1) + tc(N +1) +te(N ) + tf (N ) + ta(N ) + tb(N ) + tc(N ) + tmd(N ) ,

(4)

For the high to low transition, the throughput time is expressed as TT hr ↓≈ ta(N +1) + tb(N +1) + tc(N +1) +td(N ) + ta(N ) + tb(N ) + tc(N ) + tmd(N ) , c 

(5)

For this architecture, the latency is approximated by IEICE 2006

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

TLat ≈ ta(N ) + tb(N ) + tc(N ) + tmd(N ) ,

(6)

166

IEICE Electronics Express, Vol.3, No.8, 163–171

3.3 IPCMOS Method The IPCMOS FIFO [4] is depicted in Fig. 1 (c) Throughput time of IPCMOS can be obtained by: TT hr ≈ ta(N ) + tb(N ) + tc(N ) + tk(N ) + td(N ) + te(N ) + tp(N ) + tq(N ) +tg(N ) + tmd(N ) + tj(N +1) + tk(N +1) + td(N +1) + te(N +1) ,

(7)

The latency is given by TLat ≈ tj(N ) + tk(N ) + td(N ) + te(N ) + tp(N ) + tq(N ) + tg(N ) + tmd(N ) ,

(8)

3.4 LPSR 2/1 Method A stage of 4-bit LPSR 2/1 FIFO is drawn in Fig. 1 (d), [5]. Tracking the complete cycle path leads to the throughput time as TT hr ≈ ta(N −1) + tf (N −1) + td(N −1) + te(N −1) + tc(N −1) +tmd(N −1) + te(N ) + tc(N ) + tmd(N ) + te(N +1) + tc(N +1) ,

(9)

The latency time is given by TLat ≈ te(N ) + tc(N ) + tmd(N ) ,

(10)

3.5 HC Method One stage of HC is shown in Fig. 1 (e), [6]. We can express the throughput time of HC as TT hr ≈ ta(N +1) + tb(N +1) + tc(N ) + td(N ) + te(N ) + tf (N ) + tb(N ) +tmd(N ) + tg(N ) + th(N ) + ta(N ) + tb(N ) + tmd(N ) ,

(11)

The latency is given by TLat ≈ ta(N ) + tb(N ) + tmd(N ) .

4

(12)

Data-Driven Based Micropipeline Design

Dual-rail encoding is the common standard in the Data-Driven structure design where a data bit is represented using two signals. The encoded data indicates both data validity and data value and hence the Request and Acknowledge signals can be generated using the data itself. In this section, five asynchronous design styles based on the Data-Driven architecture are discussed.

4.1 STFB Method The whole STFB 1-bit cell which is shown in Fig. 2 (a) should be repeated for every bit [7]. The throughput time obtained for valid input data 01 and 10, respectively, as c 

IEICE 2006

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

TT hr01 ≈ tk(N ) + te(N ) + tf (N ) + tb(N ) +ta(N ) + tb(N +1) + tc(N +1) + td(N +1) ,

(13) 167

IEICE Electronics Express, Vol.3, No.8, 163–171

And TT hr10 ≈ tk(N ) + te(N ) + tf (N ) + th(N ) +tg(N ) + th(N +1) + ti(N +1) + tj(N +1) .

(14)

Similarly, the latency time for these two valid input date are given by TLat01 ≈ tb(N ) + ta(N ) ,

(15)

TLat10 ≈ th(N ) + tg(N ) ,

(16)

4.2 LDA Method A stage of LDA for the 4-bit FIFO is illustrated in Fig. 2 (b) [8]. The throughput time for 4-bit datapath, is given by TT hr ≈ ta(N +1) + ti(N +1) + tj(N +1) + tk(N +1) + tb(N +1) + tc(N +1) + td(N ) +te(N ) + tc(N ) + tf (N −1) + th(N −1) + tg(N −1) + th(N ) + tg(N ) , (17) The latency time is given by TLat ≈ th(N ) + tg(N ) ,

(18)

4.3 LP2/1 Method One stage of the LP2/1 4-bit FIFO is shown in Fig. 2 (c), [5]. The throughput time for the 4-bit FIFO is given by TT hr ≈ tEvalDelay(N −1) + tk(N −1) + th(N −1) + tj(N −1) + ta(N −1) +tb(N −1) + ta(N ) + tb(N ) + tc(N +1) + td(N +1) + te(N +1) +tf (N +1) + tg(N +1) ,

(19)

where tEvalDelay is the evaluation delay is required for proper functioning of the circuit. Here, the delay is generated by two cascaded NOT gates. The latency is given by (20) TLat ≈ ta(N ) + tb(N ) .

4.4 RSPCFB Method One stage of RSPCFB for the 4-bit FIFO is illustrated in Fig. 2 (d), [9]. The throughput time for the 4-bit FIFO as      TT hr ≈ ta(N ) + tb(N ) + tzp2(N ) + tdn(N ) + tep1(N ) + tep2(N ) + tep3(N )    +tep4(N ) + teNOT (N ) + tf p2(N ) + tf NOT (N ) + tg(N ) + tzn2(N )    +tdp2(N ) + ten(N ) + teNOT (N ) + tf n2(N ) + tf NOT (N ) ,

(21)

The latency time is given by

c 

IEICE 2006

TLat ≈ tb(N ) + tzp2(N ) .

(22)

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

168

IEICE Electronics Express, Vol.3, No.8, 163–171

c 

IEICE 2006

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

Fig. 2. Circuits of (a) STFB (b) LDA (c) LP2/1 (d) RSPCFB (e) NCL (f) throughput (g) Latency

169

IEICE Electronics Express, Vol.3, No.8, 163–171

4.5 NCL Method One stage of this method for 4-bit FIFO is illustrated in Fig. 2 (e) which consists of several 2-of-2 and 4-of-8 threshold gates whose detailed circuits are shown in the figure [10]. Assuming a valid data on P-rail, the throughput time for 4-bit datapath is obtained as TT hr ≈ tg2(N ) + tk(N ) + tx1(N ) + tx2(N ) + tx3(N ) + tx3(N ) + tl(N ) +tp(N ) + tg4(N −1) + tg3(N −1) + tk(N −1) + ty8(N ) + ty7(N ) +ty6(N ) + ty5(N ) + ty4(N ) + tl(N ) + tp(N ) + tg1(N −1) +tg2(N −1) + tk(N −1) ,

(23)

The expression for the latency in this method is written as TLat ≈ tg1(N ) + tg2(N ) + tk(N ) ,

5

(24)

Results and Discussion

The throughput times and the latency of the eleven 4- and 16-bit FIFOs designed may be obtained using the analytical model for the transistor delay discussed in Section 2. The expressions obtained for these times are given in Table I (b). Among the very important design parameters in digital circuits are the energy and delay, and, hence, a figure of merit, denoted by α1 , is Table I (a). Technology Dependent Parameters for Transistor Delay Model

Table I (b). Prediction of the analytical model for the throughput and latency times

c 

IEICE 2006

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

170

IEICE Electronics Express, Vol.3, No.8, 163–171

Table I (c). Simulation results

defined here as α1 =

Power Throughput 2

(25)

If, for a given circuit, the throughput (delay) is more important, then a better figure of merit may be defined as α2 =

Power Throughput 3

(26)

For our comparative study, we have simulated 4- and 16-bit 4-stage FIFOs designed based on the ten styles in a 0.18 µm CMOS technology (Vdd = 2 V and T = 25◦ C). The simulations were performed for 30 ns with an identical input data pattern. The results are summarized in Table I (c). The table contain the throughput, latency, power, α1 , and α2 for both 4- and 16-bit FIFOs. And also show the figures for the relative changes of the parameters when the datapath width is increased from 4- to 16-bit. We have plotted both the throughput and latency in Fig. 2 (f) and 2 (g) where the analytical and simulation results are also compared. The comparison reveals a good agreement between the analytical predictions and the simulations results for the designs under study.

6

Summary and Conclusions

This paper presented a comparative study of ten micropipeline design styles while their latency and throughput times were expressed in terms of the delays of their elements. For the study, 4- and 16-bit FIFOs were designed and simulated using HSPICE in a 0.18 µm CMOS technology. We then compared the performances of the designs in terms of the throughput, latency, power, transistor count, and the datapath width effect. In addition, we proposed an analytical transistor delay model which was used for estimating the latency and throughput times of the circuits. The accuracy of the model was determined by comparing its predictions to the simulation results. c 

IEICE 2006

DOI: 10.1587/elex.3.163 Received January 26, 2006 Accepted March 27, 2006 Published April 25, 2006

171