asynchronous wave pipelines for high throughput datapaths - CiteSeerX

ASYNCHRONOUS WAVE PIPELINES FOR HIGH THROUGHPUT DATAPATHS O. Hauck and S. A. Huss fhauck j [email protected] phone: +49 6151 16 3983 fax: +49 6151 16 4810 Integrated Circuits and Systems Lab Departments of CS and EE Darmstadt University of Technology Alexanderstr. 10 64283 Darmstadt Germany

ABSTRACT

A novel VLSI pipeline architecture for high-speed clockless computation is proposed. It features gate-level pipelining to maximize throughput and uses dynamic latches to keep the latency low. The most salient property is the asynchronous operation using a modi ed handshake protocol. Data words are accompanied by associated control signals resembling a local clock and propagate in coherent waves through the logic. As a result one can take advantage of the asynchronous operation and avoid the problems prevalent with global high-speed clocks in synchronous designs. HSpice simulations of an 4-bit adder designed in 0.7 m CMOS indicate throughput data rates at 1 GHz.

combinational logic

partitioned combinational logic

pipelined

combinational

logic

Figure 1: Pipelining combinational logic 4

3

2

1

Figure 2: Wave pipelining combinational logic

1. INTRODUCTION This research work originates from the attempt to combine the advantages of wave pipelining and asynchronous operation, i. e. very high throughput at low latency while having no problems related to a global clock. Furthermore, it aims at avoiding well-known drawbacks of synchronous wave pipelines [1] and previously presented elastic asynchronous datapaths, such as Micropipelines [2]. The result is a hybrid architecture that incorporates ingredients from conventional pipelines, synchronous wave pipelines and asynchronous Micropipelines. This paper is organized as follows. The next two sections review wave pipelines and Micropipelines, respectively. In section 4 the new architecture of asynchronous wave pipelines is introduced. Section 5 applies the principles outlined in section 4 to an example circuit and shows pre-layout simulation waveforms. Section 6 gives a summary and discusses topics for further research.

2. REVIEW OF WAVE PIPELINES In applications requiring high throughput in the rst place, pipelining is the premier technique of choice. This is the case, e.g., in RISC microprocessors and DSP domain. Some increase in latency is tolerated here for the sake of a higher throughput. On the other hand, latency is important when, e.g., an addition has to be completed in one clock cycle only. In general, latency and throughput trade o against each other. Figure 1 shows how a block of combinational logic is partitioned and pipeline registers are inserted. Partitioning in N stages nominally multiplies the throughput by N ,

provided that the longest path in the logic is equably partitioned into N slices as well. However, pipelining comes at the cost of increased area and power consumption. In addition, the propagation delay of the registers (or latches) adds to the latency and the setup time lenghtens the cycle time. Finally, when a high-speed clock is used, clock skew cuts into the cycle. It is therefore appropriate to look for alternative methods of pipelining. While in conventional pipelines each logic slice contains one data word only, wave pipelining uses less registers but has more than one data word simultaneously active in a slice of logic, whereas internal capacitances provide for storage. Figure 2 illustrates this concept. The name wave pipeling is explained by the nature of propagation of the data through the logic. It is of paramount importance that a fast wave does not overrun a previous slow wave which would result in data loss. Whereas in a conventional pipeline throughput is determined by the longest path in any stage, in a wave pipeline throughput is limited by the dierence between the shortest and longest path. Throughput is therefore maximized in a wave pipeline by equalizing path delays. However, this is not an easy undertaking as there are several sources introducing delay mismatch. Variations due to the dierent depth of the logic in the network can be compensated by insertion of delay elements. Variations due to data dependencies in CMOS gates are more dicult to handle. E. g., the delay of a NAND is shortest when both inputs switch from 1 to 0 (due to parallel pullup) and longest when both inputs switch from 0 to 1 (due to series pulldown) and the dierence may well

reach up to 50% of the average delay. Previous work is related to specialized logic structures to overcome these problems, e. g. transmission gate structures as in [3, 4] or wave domino logic [5, 6]. Klass and Flynn use in [7] ordinary static CMOS NAND2 gates and rely on statistic properties of the data as the maximal delay dierence occurs with a NAND2 gate used as inverter only which is deemed as unlikely. Wong et al. [8] have developed CAD algorithms aimed to CML/ECL. A rst pass called `rough tuning' pads all logical paths to the length of the longest path. Subsequent ` ne tuning' incorporates layout information and tunes the gates appropriately. Nowka presents in [9] methods such as transistor sizing, biased logic, current starved drivers to balance CMOS logic. In addition, [9] deals with an analysis of the performance limits of wave pipelined systems in presence of delay variations induced by temperature, noise and process parameters. With the above mentioned methods the delay variation can be limited to about 10% to 20% of the maximum delay in CMOS. Speedups over traditional pipelines between two and three are thus achievable. As a bottom line, in order to achieve a signi cant speedup some complicated design methods have to be followed or one has to restrict oneself to a special logic style. Furthermore, reliable operation in presence of environmental variations is dicult to guarantee.

3. REVIEW OF MICROPIPELINES Asynchronous design [10] has three main potential advantages over synchronous design. Above all, there are no problems related to the generation and distribution of a global clock which is a dominant challenge in today's high-speed designs. Advances in fabrication technology will aggravate the issue. Instead, in asynchronous systems communication is carried out locally according to a protocol. This helps in modular design of complex systems. Secondly, asynchronous systems exhibit average case performance while synchronous ones are limited by the worst case. This can be coined into a speed advantage of asynchronous designs [11]. Finally, asynchronous systems have an inherent powerdown mode since computation is performed only when data is available. This is in contrast with pure synchronous systems where registers are clocked regularly even when the input data has not changed. However, there are clouds in the sky. Asynchounous designs are plagued by the well-known hazard problems and an infamously dicult design. The potential advantages are hard to realize in practice due to the diculty in eciently implementing the communication protocol and completion detection units. Besides, there is no commercial CAD support available to date. As the main focus of this work is on datapaths one has to consider Micropipelines [2] , a framework for twophase handshake, elastic, pipelined computation. Figure 3 shows a Micropipeline without processing logic between the stages, i. e. a FIFO. The gates with a C inside are Muller C elements. The communication within the pipeline follows a twophase handshake or transition signalling protocol. Herein data at the inputs requests operation with an event (a rising or falling transition) on the R(in) request line. An

A(1)

R(in)

R(2)

A(3)

R(out)

DELAY

C

C D(in)

C

Cd

Pd

P

REG

REG Cd

DELAY

C

P

Pd

C

Cd

Pd

Cd

P

REG

REG

C

P

C

D(out)

Pd

C A(out)

A(in) DELAY

DELAY

R(1)

A(2)

R(3)

Figure 3: Micropipeline in FIFO con guration event on the A(in) line acknowledges the data word. The unit accepts data as long as the internal memory permits it and tries to deliver data at the output signalling an event on R(out). If the receiver cannot accept the data, no event on A(out) occurs and the Micropipeline blocks, saving the data for future delivery. If data words continue to ow into the pipeline, the latter will be lled up at some time and consequently will deny the acknowledge event on A(in). There are several important properties the Micropipeline is noted for and that have made it to an essential building block in asynchronous design. First, the input and output side operate independently. In contrast, synchronous FIFOs have to accomodate two independent clocks when placed at the border of the system to the outside world. Secondly, the Micropipeline is self- ushing. Thirdly, the elastic operation is a very convenient feature that allows modular system design exploiting standard building blocks. On the other hand, Micropipelines unveil high latency and a satisfactory, but not the utmost throughput. With logic blocks inserted between the stages, the delay of the logic has to be modeled with a delay element which is dicult to do accurately. Figure 3 shows the so-called Capture-Pass Latches as in Sutherland's original proposal. These are event-controlled and dicult to implement which has led researchers to use four-phase handshake and levelsensitive latches instead [12]. Yun et al. [13] propose an use of Svensson-style double edge-triggered D- ip- ops. Brunvand [14] gives several alternative architectures to improve latency.

4. ARCHITECTURE OF ASYNCHRONOUS WAVE PIPELINES In the preceding sections the pros and contras of wave pipelines and Micropipelines were discussed. We are now in the position to propose asynchronous wave pipelines that combine very high throughput with asynchronous operation. We emphasize that the scheme eliminates both the dicult issue of balancing delays pertinent to the design of traditional wave pipelines and the cumbersome storage elements as well as the performance limitations of Micropipelines. The proposed solution is to combine the best of both in taking a step back towards synchrony, but only local. Of course, we cannot expect to bargain for all the advantages and not taking any shortcomings. We will have to sacrify the elasticity of Micropipeline operation and instead manage an aggressive timing.

x1 . . . xn

req_in

F

m

y1 . . .

C

ym

req_out

Figure 4: Basic asynchronous wave pipeline scheme

B

SUM

A CARRY

In the following, we restrict ourselves to simple feedforward pipelines having n inputs x1 : : : xn and m outputs y1 : : : ymm , i. e. a general boolean mapping F : f0; 1gn 7! f0; 1g implemented as a combinational net. Figure 4 shows the basic architecture. Along with the logic net there is a request line which carries a pulse in parallel to every data word. There is no acknowledge signal and thus the protocol is inelastic. It represents just the minimal eort needed for data integrity. We would like the logic being wave pipelined without any storage elements, but at the same time we need the pulse on the request line coherent with its associated datum. It turns out that the datum and its pulse, which are semantically linked, have to be physically linked as well (indicated by the dashed lines) because propagating data and its request pulse independent to each other poses a balancing problem even worse than in traditional wave pipelining. The good news is that if one takes the burden of establishing a link between the request line and the logic, then this solves the problem of balancing the combinational logic at no extra eort. The idea is simply to tap the request line and to use the taps in order to control some sort of dynamic latch. Now one may object: If it uses latches, then it's not wave pipelining! The answer is that if the granularity level is just one transistor or gate then pipelining is necessarily wave pipelining since the ne granularity keeps the waves coherent. At

the gate level the concepts of traditional pipelining and wave pipelining are coherent. Of course, it is mandatory for the scheme to work that the latches be very simple in order not to compromise the throughput. This is detailed in the next section.

5. AN ASYNCHRONOUS WAVE PIPELINED ADDER This section summarizes the design of a 4-bit adder according to the principles given above. Since the emphasis is on throughput, a simple ripple adder architecture is used for demonstration purposes. When latency is important as well, one of the fast adder schemes should be used. A transmission-gate full adder similar to the one described in [15] is employed.

SUM = (A B )C + (A B )C CARRY = (A B )B + (A B )C The adder can be implemented with transmission-gate multiplexers. The logic has three stages, namely the input inverters, the multiplexer stage computing A B as well as (A B ), and the output multiplexer stage. Thus there will be two latch stages to insert. Figure 5 shows the pipelined version of the full adder together with the request line.

req_in

req_out

Latch stage 1

Latch stage 2

Figure 5: Pipelined transmission-gate full adder cin s0

a0 b0

FA

s1

a1

FA b1 a2

s2

FA b2 a3

s3

FA b3

req_in

cout

req_out

Figure 6: The complete 4-bit adder Transmission gates enclosed within inverters are used as dynamic latches. The inverters are necessary since we use transmission-gates in the logic as well and cannot tolerate the delay due to series transmission-gates. Every path between two latch stages must exhibit the same delay and thus feed-through paths have a delay element inserted. Permanently turned-on transmission-gates sized to match the delay of an inverter are used for this purpose. The buers in the request line consist of six inverters each, with all transistors having L = 0:8m and Wp=Wn = 120=48; 120=48; 120=48; 160=80; 200=100; 250=100. The operation is as follows. The latches of a stage are closed until the preceding logic is fully evaluated. Then the pulse arrives and opens the latches long enough for the next logic stage to take over the data (long-path constraint) but not too long to overrun the next stage (race-through constraint). A careful design of the request line is important for correct operation. However, unlike synchronous systems, we only have to manage the request line locally in the unit, and not over the whole chip. Figure 6 shows the complete 4-bit adder. The sequential nature of the ripple adder manifests itself in the population of the diagonal only. All other paths have to be padded with delay elements. Figure 7 shows pre-layout HSpice simulations 1 . The traces shown (from the top, through all three diagrams) are 1 We have not produced a layout due to time pressure and as Europractice phases out ES2 0.7m CMOS in June. Layout will be done with MIETEC 0.5m CMOS. The adder will then be made wider as well

cin, a0, b0, a1, b1, a2, b2, a3, b3, req in, req out, s0, s1, s2, s3, cout, req out. As can be seen from the req in stimulus, the operation is asynchronous. Computation takes place only when a pulse on req in indicates data. The data at the output is valid when accompanied by a pulse on req out, speci cally, at the end of the req out high phase. The maximal throughput of this example circuit is 1 Giga words / sec.

6. CONCLUSION AND FUTURE RESEARCH This paper has presented asynchronous wave pipelines for high throughput datapaths. It combines advantages of traditional wave pipelines and asynchronous operation while avoiding some problems pertinent to them. Throughput in the Giga range is achievable. While this work has experimental character and the critical request line was manually sized, an optimal and automated design of the critical portions has to be investigated. Another problem is to maintain the exact shape of the pulses. Finally, while in this paper the discussion was restricted to simple feed-forward pipelines, a generalization to systems with feedback seems interesting.

7. REFERENCES [1] L. W. Cotten, \Maximum-rate pipeline systems," 1969 AFIPS Proc. Spring Joint Computer Conf., vol. 34, Montvale, NJ: AFIPS Press, pp. 581{586, May 1969. [2] Ivan E. Sutherland, \Micropipelines," Communications of the ACM, 32(6):720{738, June 1989. [3] Debabrata Ghosh and S. K. Nandy, \Design and Realization of High-Performance Wave-Pipelined 88 Multiplier in CMOS Technology," IEEE Trans. on VLSI, vol. 3, no. 1, pp. 36{48, March 1995. [4] X. Zhang and R. Sridhar, \CMOS Wave Pipelining Using Transmission Gate Logic," Proceedings of the IEEE International ASIC Conference and Exhibit, Rochester, NY, 1994. [5] W. Lien and W. Burleson, \Wave-Domino Logic: Theory and Applications," IEEE Trans. on Circuits and Systems, vol. 42, no. 2, February 1995. [6] Sanu Mathew and Ramalingam Sridhar, \Ecient Clocking of a Wave-Domino Pipeline," Proceedings IEEE Intern. Symp. on Circuits and Systems, pp. 1832{1835, 1997. [7] F. Klass and M. Flynn, \A 1616-bit Static CMOS WavePipelined Multiplier," Proceedings IEEE Intern. Symp. on Circuits and Systems, pp. 143{146, 1994. [8] D. Wong, G. De Micheli, and M. Flynn, \Designing HighPerformance Digital Circuits Using Wave Pipelining: Algorithms and Practical Experiences," IEEE Trans. on CAD, vol. 12, no. 1, January 1993. [9] Kevin J. Nowka, \High-Performance CMOS System Design Using Wave Pipelining," PhD Thesis, Computer Systems Laboratory, Stanford University, August 1995. [10] S. Hauck, \Asynchronous Design Methodologies: An Overview," Proceedings of the IEEE, 83(1), 1995. [11] O. Hauck, H. Sauerwein, and S. A. Huss, \Asynchronous VLSI Architectures for Human Codecs," to appear in Proceedings IEEE Intern. Symp. on Circuits and Systems, 1998. [12] Paul Day and J. Viv. Woods, \Investigation into Micropipeline Latch Design Styles," IEEE Trans. on VLSI, vol. 3, no. 2, June 1995. [13] Kenneth Y. Yun, Peter A. Beerel, and Julio Arceo, \HighPerformance Asynchronous Pipeline Circuits," Proceedings ASYNC'96, pp. 17{28, March 1996. [14] Erik Brunvand, \Low Latency Self-Timed Flow-Through FIFOs," Proceedings 16th Conference on Advanced Research in VLSI, pp. 76{90, March 1995. [15] N. Weste, K. Eshraghian, \Principles of CMOS VLSI Design," second edition, Addison-Wesley, 1993, p. 526.

Figure 7: Pre-layout simulation results of the 4-bit adder

asynchronous wave pipelines for high throughput datapaths - CiteSeerX

asynchronous wave pipelines for high throughput datapaths - CiteSeerX

Suggest Documents

Modelling 2phase Asynchronous Pipelines

Asynchronous Task Dispatch for High Throughput Computing for the ...

Self-Resetting Latches for Asynchronous Micro-Pipelines

Efficient Asynchronous Bundled-data Pipelines for DCT ... - CiteSeerX

Wave & Wait Protocol (WWP): High Throughput

OptiBIST: A Tool for BISTing Datapaths - CiteSeerX

Continuous-flow lithography for high-throughput ... - CiteSeerX

High-Throughput Screening for Spermatogenesis ... - CiteSeerX

Autoantigen Microarray for High-throughput Autoantibody ... - CiteSeerX

abstract a high-throughput, low-power asynchronous mesh-of ... - DRUM

abstract a high-throughput, low-power asynchronous mesh-of ... - DRUM

Heterogeneous Latch-based Asynchronous Pipelines - Research ...

High-Throughput Asynchronous Datapath With Software ... - IEEE Xplore

abstract a high-throughput, low-power asynchronous mesh-of ... - DRUM

abstract a high-throughput, low-power asynchronous mesh-of ... - DRUM

High-Throughput Quantification of ... - CiteSeerX

for High-throughput Phenotyping

Efficient Asynchronous Bundled-data Pipelines for DCT Matrix-Vector ...

High Design Factor Pipelines

Wave Patterns and Minimum Wave Resistance for High ... - CiteSeerX

for high throughput screening - NCBI

CAD Directions for High Performance Asynchronous Circuits - CiteSeerX

High-Throughput Sequencing of Small RNA ... - CiteSeerX

Automated High-throughput Behavioral Analyses in ... - CiteSeerX