Asynchronous Pipeline Design using GaAs PDLL Logic and new CMOS dynamic techniques. Sam S. Appleton, Shannon V. Morton, & Michael J. Liebelt.
Asynchronous Pipeline Design using GaAs PDLL Logic and new CMOS dynamic techniques Sam S. Appleton, Shannon V. Morton, & Michael J. Liebelt Internal Report : HPCA-ECS-96/04 version I, June 17, 1997
Abstract|We explore the potential for extremely high asynchronous logic performance in CMOS and GaAs dynamic logic structures. By using a new class of GaAs dynamic logic, Pseudo-Dynamic Latched Logic, we develop asynchronous control structures capable of high-speed operation. We show how these techniques can be used for CMOS design, and give performance estimates.
Single Pipeline Stage @
req in T
SEND @
DELAY @
ack in
I Introduction
Asynchronous approaches have long been touted as the ideal replacement for synchronous designs when clock skew problems and power consumption become unmanageable. However, the lack of asynchronous method acceptance seems to indicate that these views are not shared by the design community as a whole. This is primarly due to the lower performance level oered by asynchronous solutions in a given technology than that achievable by a synchronous system. In this paper, we explore the potential for exploiting very high speed operation of CMOS and GaAs dynamic logic structures using a two-phase signalling, boundeddelay engineered approach called Event Controlled Systems. We have demonstrated the superiority of ECS pipelines in the bounded-delay environment compared to other two and four phase approaches[1, 2]. The work in this paper was spurred by the development of PseudoDynamic Latched Logic (PDLL) by Lopez[3] and his visit to Adelaide in September 1996.
II ECS and PDLL methods In this section we brie y introduce critical ECS and PDLL terms, notation and constructs. For more exhaustive details, the reader is referred elsewhere[1, 2, 3].
@ req out
ack out
UNTIL
select
Processing Logic DataOut
L DataIn
(a) Pipeline Circuit
@req in Lt+
@ack out
@ack in
Lt {
@req out
(b) Pipeline Signal Flow Graph
II-A ECS
Figure 1: ECS State Pipeline
ECS is a bounded-delay, two-phase signalling approach for asynchronous design developed at the University of Adelaide. We will not explain operators or gates here and instead introduce the fundamental ECS pipeline structure, the state pipeline, shown in Figure 1. The
control ow of the pipeline structure is illustrated in Figure 1(b). The send element is a transparent, initialisable latch. The UNTIL element is implemented as a standard XOR
II ECS AND PDLL METHODS
2
Figure 2: DCFL Waveforms
tdead = Tsend + Tuntil + T"StartComp ? Tpl
Single Pipeline Stage SEND in SEND
req
@
ack in
@ req out
@
UNTIL
select
L DataIn
ack out ComputationDone
@
StartComp
element. The output drive ability of the XOR element must be optimised in order to drive the latch line Lt in a reasonable time. Timing constraints[2] apply, but are minimal and \highly meet-able". Fig. 2 shows the operating waveforms for a 3-stage state pipeline with 8 latches connected to each stage. The gates were completely unoptimised but the system still operates at a rate of approximately 1.5GHz. The send and xor gates thus represent the most necessary control path elements for implementation of a system controlled using ECS. We have demonstrated the eciency of using such a structure[1]. The pipeline can readily be extended to the case where we do not wish to use a delay model for the computation { although delay modes are ne in CMOS, they can be quite undesirable in GaAs. The delay element is changed to a send, which is controlled by the ComputationDone signal, as shown in Figure 3. Note that there is a performance bottleneck here { if there is zero margin between @req in and DataIn, then data arrives at the computation inputs after Tpl if the latch line Lt is high. However, the control signal initiating the computation, StartComp, does not arrive until much later, creating a \logic dead-time" of
Processing Logic DataOut
Figure 3: State Pipe with Completion Control as the logic cannot begin evaluation until the StartComp signal goes high. Later we will investigate techniques for eliminating this logic dead-time. Note that the addition of margin between @req out and DataIn only worsens this situation. If we are using dynamic logic this penalty can be allevi-
II ECS AND PDLL METHODS
II-C PDLL GaAs Logic Family The PDLL (Pseudo-Dynamic Latched Logic) logic style[3] is a new GaAs logic style for implementing pseudo-dynamic logic functions in GaAs technology. Implementing dynamic logic functions in GaAs is quite dif cult because of the high gate leakage current caused
Latch
SetOK Recharge
Precharge
PrechDone DataComp Pull-down Stage
pre req
We can exploit the generation of the ComputationDone signal, shown in Fig. 3, in the control of the subsequent stage by using the Morton Precharge Pipeline, shown in Figure 5. In each MPP stage, the completion signal (ComputationDone is here called SetOut) is forwarded to the subsequent stage. The arriving completion signal, called SetIn, causes evaluation to commence if the stage is in the free state (i.e. Lt is high) and is precharged. Concurrently, the @req in arrives to control the latching of the data, but has no impact on the time to commence evaluation. When the stage's SetOut signal goes high, the next stage begins evaluation while the current stage completes output handshaking. The arrival of the @ack out event causes precharge to commence, and also sets Lt high to allow new data into the stage { this is safe as the result of the previous computation has been passed out, and the stage is presently precharging. When precharge completes and SetIn is high, the stage re-commences evaluation. We modi ed the existing MPP structure[1] to exploit the parallelism of the input handshake and precharge action, which is named the modi ed MPP. The input handshake (and thus the latching action) is now concurrent with the precharge phase, with no control overhead. The timing of one m MPP stage is shown in Fig. 5(b). Note that the pipelines we have discussed so far are general pipelines that expect the dynamic logic to have an evaluation signal. Many of the logic structures we will examine in this paper have no inherent evaluation signal and we will describe new control architectures to take advantage of these logic families.
Lt
Latch
@ req in
Evaluate
@
II-B Fast ECS Pipelines
@ ack in
SetIn
DataIn
@ pre ok
ated to a certain extent because we can overlap the input latching action with the precharge of the stage logic. However, there is one penalty which creates dead-time which cannot be removed in an asynchronous approach { the output handshake. The problem is illustrated in Figure 4. When a request, @req out, is generated it must go through the next stage's handshake logic before returning the @ack out signal, and then the input latch must be opened. This presents a minimum logic dead-time for the stage, since we cannot even perform precharge during this interval { we must hold the output of this stage at a steady value until it has been latched (indicated by the @ack out signal).
3
Eval
Activate EvalDone
@ ack out
SetOut
DataOut
@ req out
(a) Circuit @ req in @ ack in SetIn DataIn
Valid
Lt SetOK DataComp
Valid
Evaluate DataOut
Precharged
Eval
Valid
Prech Precharged
SetOut
@ req out @ ack out PrechDone
5Lt 4SetOK 5SetOut
5PrechDone
(b) Timing Diagram
Figure 5: Morton Precharged Pipeline
III PDLL GATES STRUCTURES
4 Controller
d_req out
Controller Internal Handshake Action must complete in this stage before ...
d_ack out
LATCH
LATCH
...this latch line can reset, allowing new data to enter and a new request to begin processing DATAPATH LOGIC
This stage has data ready and is sending it to next stage
This stage is receiving data
Figure 4: Output Handshake Action by the Schottky diodes formed at the MESFET gate junction. Other approaches to GaAs dynamic logic have been proposed[5, 6], but PDLL gives better performance when considering the combination of power, speed, and area[3]. In addtion, the design of PDLL gates is considerably easier, and requires no special elements to be implemented, such as capacitors. The basic form of a PDLL gate is shown in Figure 6(a). The logic family posseses a great deal of similarity to CMOS domino logic, shown in Figure 6(b). Note the addition of the weak inverter to the GaAs PDLL logic { this gate replaces lost charge though gate leakage when the node A is oating, but provides no impediment to the switching characteristics of the node A during node switching. Note that the inputs to both styles of gates must go valid during precharge, or
be zero during precharge and only make a 0 ! 1 transition when precharge is o.
Thus many techniques we develop for dealing with the structure of GaAs PDLL logic are directly applicable to the design of asynchronous domino CMOS structures also. However, in this paper we will mainly be concerned with GaAs PDLL structures. Note that in both logic structures of Fig. 6 we can create arbitrary pull-down trees between the node A and ground. However, the logic style is inherently noninverting since we cannot usually use inverted signals as inputs to a dynamic gate. This creates some diculties in creating XOR type structures, which we will examine later in this paper.
III PDLL Gates Structures
The design of the various control and datapath gates required for implementation of asynchronous units in PDLL needs to be detailed. The most critical elements to obtain are, a fast Send element { basically an initialisable latch a fast XOR element a transparent latch structure Other gates that we might require are developed as they are needed.
III-A Send Gate
The Send gate is really an initialisable latch, and is required to build almost any ECS pipeline. Therefore, we need a fast design so that the eect of the control is minimised on overall cycle time. One approach would be to use a series of DCFL gates, giving high speed, but then the gate may not operate correctly at lower voltages (PDLL logic circuits can operate at a lower voltage than DCFL gates).
III-B XOR Gate
The XOR element is mostly used to do two-to-four phase conversions to control logic path elements, and is extensively used in control structures. Some proposed designs[4] are shown in Fig. 8
III-C Transparent Latch
Transparent latches are faster than their edge-clocked counterparts, and ECS makes considerable use of transparent latchs in pipeline structures like that of Fig. 1.
III PDLL GATES STRUCTURES
5 init
driver & charge retention
Vdd precharge
in out
control
out
control
A
(a) Send Gate in DCFL Logic
in pull-down tree
Vdd
init
Enable output
(a) PDLL Basic Gate
input Enable
Vdd output driver
precharge
(b) Send Gate in PDLL
A
Figure 7: GaAs 3Send Gate Implementations
III-D Logic Structures & Inversions
in pull-down tree (b) CMOS Domino Basic Gate
Figure 6: Dynamic GaAs PDLL and CMOS Gates There are two latch solutions that we shall explore, shown in Figure 9. Fig. 9(b) is appropriate in situations where a normal transparent latch structure is required. However, the non-inverting dynamic nature of PDLL logic means that deactivating inputs to PDLL gates means we can begin precharge sooner. Thus, precharging the latch as well means we can precharge sooner and reduce power consumption slightly (since the inputs to the appropriate logic blocks are all zeroed). This precharged latch structure is shown in Fig. 9(a) { note that the cost of this latch is also lower than the other, non-precharged version. There may be some slightly higher control costs as we must now generate precharge and latch signals, as opposed to just latch and latch signals. Note also that we can integrate PDLL functions into the latch structure of Fig. 9 relatively easily, since there is only one extra transistor which determines when to propogate the input to the output (when precharge is not active and Enable goes high).
Since PDLL is an inherently non-inverting logic family, if we wish to use some inverting functions in a single logic stage we must either 1. ensure that the inverted signal to the gate becomes valid some time margin before other gate inputs are determined (i.e. before they can make a 0 ! 1 transition), 2. stop the evaluation of the stage which uses the inverted signal until we know all inputs (including the inversion) are valid, or 3. separate the logic by a stage-boundary. Option #1 is only possible in certain, restricted logic structures and is not a totally general solution, although we can convert logic structures to this form if required. Option #3 is not desirable since it increases the number of stages, and thus stage control logic and latches, every time we need some gate output inverted as the gate input for another gate. Option #2 is the solution that is most general and desirable { we stop the evaluation of the logic gate in question until the inputs are all determined. Note that, instead of having an evaluate transistor, we could equally continue to activate the gate's precharge signal until all inputs are determined (saving a transistor), although then the gate's precharge signal is then no longer under control of the stage's control logic (and this may not be such a bad
IV ASYNCHRONOUS DYNAMIC PIPELINE CONTROLLERS
6 Vdd
Vdd PrechargeLatch
output
OUT A
B
EvalLatch input
(a) XOR Solution # 1
(a) Latch with precharge
Vdd A B
Vdd
A OUT
B
Enable output
input A
B
Enable
(b) XOR Solution # 2
Figure 8: XOR gates in PDLL GaAs thing, as we will see). The two possibilities are shown in Figure 10. Thus the gate can use inverted inputs as long as we do not evaluate the gate output until all the inputs are properly determined. There are two techniques we consider for generating the local control signal, Matched Delay and Matched Path.
III-D.1 Matched Delay Technique The Matched Delay Technique evaluates the relevant gate(s) after a speci ed delay from the evaluation of a previous gate. We make this delay-element an asymmetric element so that the second gate enters precharge slightly after the rst gate enters precharge. The concept is shown in Figure 11. This involves 1. identi cation of a suitable `A' gate to control the timing of the `B' gate's evaluation/precharge signal, if possible 2. design of the delay element to match the path delay from `A' to `B' The structuring of the control logic aects the ease with which we can nd a suitable `A' gate, and design of the delay element, while possible in GaAs, is probably quite error prone because of parameter variations that we have seen in fabricated GaAs systems.
(b) Latch without precharge
Figure 9: PDLL Transparent Latch Implementations
III-D.2 Matched Path Technique The Matched Path Technique is the option we prefer in practice, even though it may involve a higher area overhead than the last option. As PDLL is a dynamic logic, we can use a path matched to the worst-case signal path from the `A' node to the `B' node, shown in Figure 12. The Matched A!B path mirrors almost exactly the worst-case logic path through A, and controls the evaluation of B (where B is presumed to depend on A in some manner). This eliminates the necessity of the design of the delay element (and thus the problems due to parameter and voltage/temperature uctuations), albiet at the expense of possibly more power consumption due to a larger number of gates. We will see how this technique operates in practice when we come to the design of a real system.
IV Asynchronous Dynamic Pipeline Controllers In this section we explore a novel pipeline controller designed explicity to take advantge of the special dynamic characteristics of PDLL and domino CMOS logic blocks. There are a number of observations to make about the operation of these dynamic logic blocks that are important in developing control architectures.
IV ASYNCHRONOUS DYNAMIC PIPELINE CONTROLLERS
7 EvalA
Vdd
Asymmetric Delay Gate
Driver EvalB
Pull-Down Tree
lo gi c/ B
lo gi c/ A
EvalLocal
ga te s
OUT ga te s
PrechargeGlobal
Figure 11: Operation of Matched Delay Technique EvalA Matched A->B Path
Driver EvalB
A
OUT
PrechargeLocal
Pull-Down Tree
ga te s lo gi c/
Vdd
B
lo gi c/
ga te s
(a) Gate with local Evaluation Control & Global Precharge Control
Figure 12: Operation of Matched Path Technique tomatically commences evaluation (as the PDLL logic blocks are essentially domino blocks). The outputs of Phase-1 propogate to Phase-2 where evaluation commences as well. When the Matched Path element for Phase-1 goes high,
the @ack in event passes through to @go phi2 the @ack in event starts precharging the latch and (b) Gate with local Precharge Control only
Figure 10: Control of PDLL Gates with inverted inputs
we can precharge PDLL gates and then leave
precharge turned o in preparation for an evaluation, as long as we ensure all inputs are precharged as well (i.e. low), once a gate output has been determined, we can take it's inputs back to the precharged (i.e. low) state without aecting the output logic state of the gate.
With this in mind, the pipeline structure of Figure 13 was developed. The latch and both logic elements (which we separate into Phase-1 and Phase-2) are initially precharged, and wait for input data to arrive via the latch which is open, as select is high. When data arrives at the input latch, it propogates through to the Phase-1 logic where it au-
the Phase-1 logic.
Note that this implies that Phase-2 must only use the outputs of Phase-1 in the rst few logic stages of Phase2, because they will soon become invalid due to Phase1 precharging. We can insert dummy Phase-2 gates to hold Phase-1 or latch outputs if necessary so that Phase2 operates correctly. Once Phase-2 completes operation, the Matched Path element for Phase-2 causes the @go phi2 event to propogate to @req out, initiating output handshaking. When output handshaking completes,
the returning @ack out event sets the latch select line
high, and if precharge is complete, both the latch and the Phase-1 logic will re-commence evaluation, and Phase-2 commences precharging, and stays in the precharge state until precharge is complete (controlled by the speed of the buer between PrechPhase2 and the Send gate that sets PrechPhase2 low).
V DESIGN OF A PDLL ASYNCHRONOUS MULTIPLIER
8 @ack out
@req in @go phi2
@req out
@ack in init
select
LATCH
PrechLatch
GoIn
L
GoPhi1
PrechPhase 1
PrechPhase 2
Phase 1 Logic
Phase 2 Logic
Matched Path Phase 1
GoPhi2
Matched Path Phase 2
GoOut
Figure 13: Main Pipeline Structure for Asynchrnous PDLL Logic Control We chose this structure because the precharge time is typically much less than the evaluation time in a logic block, and thus precharging Phase-1 and latch logic while Phase-2 evaluates means we can re-commence evaluation much sooner. In addition, Phase-2 should be back into evaluation mode once the GoPhi2 signal goes high, so that the pipe operates with a higher throughput. Note that the latch and Phase-1 logic are essentially domino logic stages and thus commence evaluation automatically and very quickly after the @ack out event. We will use this pipeline structure throughout the paper. However, we will consider carefully what is required to design complex logic in both Phase-1 and Phase-2 and any modi cations to the control required to faciltiate implementation.
IV-A Inputs to Phase-2 logic
V Design of a PDLL Asynchronous Multiplier We chose a multiplier as an application of PDLL as it is a well-understood problem, it is ammenable to pipelining, and it is not too complex, thus making it feasible for fast design and fabrication.
V-A Multiplier Algorithm The design of the multiplier uses a radix-4 modi ed Booth recoding scheme with Wallace-tree reduction[10].
Using a 16-bit wordlength, this results in 8 17-bit partial products, to be reduced using Wallace reduction techniques, vis. 9 PPs to 6 (see Fig. 17) 6 PPs to 4 (see Fig. 18) 4 PPs to 3 (see Fig. 19) 3 PPs to 2 (see Fig. 20) the nal two Partial Products are added using a fast adder, such as a CLA[10]. The reduction uses 3:2 or 4:3 compressors { compressors take a number of bits that need to be added, and reduce them to fewer bits with shifted signi cances. Thus compressors can reduce the number of partial products we need to add without needing to propgate carry signals across the word, in this case we reduce 17 partial products to two. The function performed by the compressors is shown in Figure 15. The complete scheme is shown in Figure 16. The reduction technique, along with associated correction bits, is shown in Figures 17-20. The symbols represent the partial product bits generated by the Booth recoding scheme on the operand, and x=y signi es a particular partial product bit from a particular partial product \level" (used whereever the source of the PP bit is not obvious). The reduction technique uses 3 : 2 and 4 : 3 compressors to reduce the height of the Partial Product tree from 9 to 2 in 4 stages.
V-B Pipelining the Algorithm V-C Gate Design We will now consider the structure and design of the gates required to implement the multiplier in PDLL.
V DESIGN OF A PDLL ASYNCHRONOUS MULTIPLIER
1
9
PP Level 1
1
2
1
3
1 1
4
1
5
1
6 7
1
8
1
Figure 16: Full Multiplier Partial Product Tree with Booth recoding and correction bits
1/16
PP Level 1/16
1/16
1 3/16
2
3/16
3
1 1
4
6/16
5
6/16
1
6
8/16
7
8/16
8
1
(a) First Tree Reduction PP Level 1 3/16
2 3/16
3
1
4 6/16
5
6/16
1
6
8/16
7 8/16
8
1
(b) Result of First Tree Reduction
Figure 17: First Partial Product Reduction using Wallace Trees
V DESIGN OF A PDLL ASYNCHRONOUS MULTIPLIER
10
PP Level 1 3/16
2 3/16
3
1
4 6/16
*
5
6/16
1
6
8/16
7 88
8/16
1
(a) Second Tree Reduction PP Level 1 2 3
1 *
4 5 6 7 88
1
(b) Result of Second Tree Reduction
Figure 18: Second Partial Product Reduction using Wallace Trees
PP Level 1 2 3 *
*
4 5 6 7 88
1
(a) Third Tree Reduction PP Level 1 2 3 4 5 6 *
7 88
1
(b) Result of Third Tree Reduction
Figure 19: Third Partial Product Reduction
REFERENCES
11
*
*
(a) Fourth Tree Reduction
Figure 20: Fourth Partial Product Reduction & Final Reduction 3:2 Compressor
Increasing Logic Depth
Bit Significance LATCH
Phase 1 Logic
2
1
4:3 Compressor 3
2
1
Phase 2 Logic
Before input x
PROBLEM!
(a) Use of latch output in Phase-2 logic
After
Figure 15: Compressor functions
Increasing Logic Depth
LATCH
Phase 1 Logic
Phase 2 Logic
input x Buffer
input’x
(b) Transformation of latch output into Phase-2 logic
Figure 14: Input Control for Phase-2 logic
V-C.1 Compressors The 3:2 and 4:3 compressors are essentially full-adders. However, implementing dynamic XOR gates that can domino is very dicult, and requires that we control the arrival of the A and B signals. Alternatively, we could directly implement the truth tables of the compressors.
V-C.2 Final Adder The last adder that combines the two partial products should be very fast so that we obtain the sum, and thus the result, very quickly.
V-D Comments
VI ECS PDLL Integer Divider
VII Conclusion We have described some of the basic features of ECS and PDLL that enable to implementation of very high performance arithmetic systems to be constructed. More work is needed. The author apologises for the incomplete status of this report.
References
[1] Appleton, S.S., Morton, S.V., & Liebelt, M.J., \High Performance Two-Phase Asynchronous Pipelines", IEICE ED Transactions, March 1997. [2] Appleton, S.S., Morton, S.V., & Liebelt, M.J., \Two-Phase Asynchronous Pipeline Control", submitted to Third International Asynchronous Circuits & Systems Symposium, April 1997. [3] Lopez, J.F., Sarmiento, R., Nunez, A., & Eshragian, K., \High Performance GaAs Pseudo Dynamic class of Logic", 1996 International ASIC Conference, 1996. [4] Lopez, J.F., \XOR & Transparent Latch Implementation Alternatives in PDLL", Unpublished Communication, University of Adelaide, September 1996. [5] Nary, K.R., & Long, S.I., \Two-Phase Dynamic FET Logic : An Extremely Low Power, High Speed Logic Family for GaAs VLSI", 1991 GaAs IC Symposium, pp. 83-86, 1991. [6] Hoe, D.H.K., & Salama, C.A.T., \Dynamic GaAs Capacatively Coupled Domino Logic (CCDL)", IEEE Journal of Solid-State Circuits, 26(6):pp. 884-849, June 1991. [7] Basmaji, Jean H., et. al., \Digital's High-performance CMOS ASIC", Digital Technical Journal, Vol. 7, No. 1, 1995. [8] Franzon, P.D., Stanaski, A., Tekmen, Y., Banerjia, S., \System Design Optimisation for MCM", in Proc. 1995 IEEE MCM Conference. [9] Svensson, Christer, & Yuan, Jiren, \High-Speed CMOS circuit technique", IEEE Journal of Solid-State Circuits, 27(3):382-388, March 1992.
REFERENCES [10] Burgess, Neil, \Fundamentals of Arithmetic Microarchitectures", to be published, 1997.
12