Asynchronous Pipeline Design using GaAs PDLL

1 downloads 0 Views 505KB Size Report
Asynchronous Pipeline Design using GaAs PDLL Logic and new CMOS dynamic techniques. Sam S. Appleton, Shannon V. Morton, & Michael J. Liebelt.
Asynchronous Pipeline Design using GaAs PDLL Logic and new CMOS dynamic techniques Sam S. Appleton, Shannon V. Morton, & Michael J. Liebelt Internal Report : HPCA-ECS-96/04 version I, June 17, 1997

Abstract|We explore the potential for extremely high asynchronous logic performance in CMOS and GaAs dynamic logic structures. By using a new class of GaAs dynamic logic, Pseudo-Dynamic Latched Logic, we develop asynchronous control structures capable of high-speed operation. We show how these techniques can be used for CMOS design, and give performance estimates.

Single Pipeline Stage @

req in T

SEND @

DELAY @

ack in

I Introduction

Asynchronous approaches have long been touted as the ideal replacement for synchronous designs when clock skew problems and power consumption become unmanageable. However, the lack of asynchronous method acceptance seems to indicate that these views are not shared by the design community as a whole. This is primarly due to the lower performance level o ered by asynchronous solutions in a given technology than that achievable by a synchronous system. In this paper, we explore the potential for exploiting very high speed operation of CMOS and GaAs dynamic logic structures using a two-phase signalling, boundeddelay engineered approach called Event Controlled Systems. We have demonstrated the superiority of ECS pipelines in the bounded-delay environment compared to other two and four phase approaches[1, 2]. The work in this paper was spurred by the development of PseudoDynamic Latched Logic (PDLL) by Lopez[3] and his visit to Adelaide in September 1996.

II ECS and PDLL methods In this section we brie y introduce critical ECS and PDLL terms, notation and constructs. For more exhaustive details, the reader is referred elsewhere[1, 2, 3].

@ req out

ack out

UNTIL

select

Processing Logic DataOut

L DataIn

(a) Pipeline Circuit

@req in Lt+

@ack out

@ack in

Lt {

@req out

(b) Pipeline Signal Flow Graph

II-A ECS

Figure 1: ECS State Pipeline

ECS is a bounded-delay, two-phase signalling approach for asynchronous design developed at the University of Adelaide. We will not explain operators or gates here and instead introduce the fundamental ECS pipeline structure, the state pipeline, shown in Figure 1. The

control ow of the pipeline structure is illustrated in Figure 1(b). The send element is a transparent, initialisable latch. The UNTIL element is implemented as a standard XOR

II ECS AND PDLL METHODS

2

Figure 2: DCFL Waveforms

tdead = Tsend + Tuntil + T"StartComp ? Tpl

Single Pipeline Stage SEND in SEND

req

@

ack in

@ req out

@

UNTIL

select

L DataIn

ack out ComputationDone

@

StartComp

element. The output drive ability of the XOR element must be optimised in order to drive the latch line Lt in a reasonable time. Timing constraints[2] apply, but are minimal and \highly meet-able". Fig. 2 shows the operating waveforms for a 3-stage state pipeline with 8 latches connected to each stage. The gates were completely unoptimised but the system still operates at a rate of approximately 1.5GHz. The send and xor gates thus represent the most necessary control path elements for implementation of a system controlled using ECS. We have demonstrated the eciency of using such a structure[1]. The pipeline can readily be extended to the case where we do not wish to use a delay model for the computation { although delay modes are ne in CMOS, they can be quite undesirable in GaAs. The delay element is changed to a send, which is controlled by the ComputationDone signal, as shown in Figure 3. Note that there is a performance bottleneck here { if there is zero margin between @req in and DataIn, then data arrives at the computation inputs after Tpl if the latch line Lt is high. However, the control signal initiating the computation, StartComp, does not arrive until much later, creating a \logic dead-time" of

Processing Logic DataOut

Figure 3: State Pipe with Completion Control as the logic cannot begin evaluation until the StartComp signal goes high. Later we will investigate techniques for eliminating this logic dead-time. Note that the addition of margin between @req out and DataIn only worsens this situation. If we are using dynamic logic this penalty can be allevi-

II ECS AND PDLL METHODS

II-C PDLL GaAs Logic Family The PDLL (Pseudo-Dynamic Latched Logic) logic style[3] is a new GaAs logic style for implementing pseudo-dynamic logic functions in GaAs technology. Implementing dynamic logic functions in GaAs is quite dif cult because of the high gate leakage current caused

Latch

SetOK Recharge

Precharge

PrechDone DataComp Pull-down Stage

pre req

We can exploit the generation of the ComputationDone signal, shown in Fig. 3, in the control of the subsequent stage by using the Morton Precharge Pipeline, shown in Figure 5. In each MPP stage, the completion signal (ComputationDone is here called SetOut) is forwarded to the subsequent stage. The arriving completion signal, called SetIn, causes evaluation to commence if the stage is in the free state (i.e. Lt is high) and is precharged. Concurrently, the @req in arrives to control the latching of the data, but has no impact on the time to commence evaluation. When the stage's SetOut signal goes high, the next stage begins evaluation while the current stage completes output handshaking. The arrival of the @ack out event causes precharge to commence, and also sets Lt high to allow new data into the stage { this is safe as the result of the previous computation has been passed out, and the stage is presently precharging. When precharge completes and SetIn is high, the stage re-commences evaluation. We modi ed the existing MPP structure[1] to exploit the parallelism of the input handshake and precharge action, which is named the modi ed MPP. The input handshake (and thus the latching action) is now concurrent with the precharge phase, with no control overhead. The timing of one m MPP stage is shown in Fig. 5(b). Note that the pipelines we have discussed so far are general pipelines that expect the dynamic logic to have an evaluation signal. Many of the logic structures we will examine in this paper have no inherent evaluation signal and we will describe new control architectures to take advantage of these logic families.

Lt

Latch

@ req in

Evaluate

@

II-B Fast ECS Pipelines

@ ack in

SetIn

DataIn

@ pre ok

ated to a certain extent because we can overlap the input latching action with the precharge of the stage logic. However, there is one penalty which creates dead-time which cannot be removed in an asynchronous approach { the output handshake. The problem is illustrated in Figure 4. When a request, @req out, is generated it must go through the next stage's handshake logic before returning the @ack out signal, and then the input latch must be opened. This presents a minimum logic dead-time for the stage, since we cannot even perform precharge during this interval { we must hold the output of this stage at a steady value until it has been latched (indicated by the @ack out signal).

3

Eval

Activate EvalDone

@ ack out

SetOut

DataOut

@ req out

(a) Circuit @ req in @ ack in SetIn DataIn

Valid

Lt SetOK DataComp

Valid

Evaluate DataOut

Precharged

Eval

Valid

Prech Precharged

SetOut

@ req out @ ack out PrechDone

5Lt 4SetOK 5SetOut

5PrechDone

(b) Timing Diagram

Figure 5: Morton Precharged Pipeline

III PDLL GATES STRUCTURES

4 Controller

d_req out

Controller Internal Handshake Action must complete in this stage before ...

d_ack out

LATCH

LATCH

...this latch line can reset, allowing new data to enter and a new request to begin processing DATAPATH LOGIC

This stage has data ready and is sending it to next stage

This stage is receiving data

Figure 4: Output Handshake Action by the Schottky diodes formed at the MESFET gate junction. Other approaches to GaAs dynamic logic have been proposed[5, 6], but PDLL gives better performance when considering the combination of power, speed, and area[3]. In addtion, the design of PDLL gates is considerably easier, and requires no special elements to be implemented, such as capacitors. The basic form of a PDLL gate is shown in Figure 6(a). The logic family posseses a great deal of similarity to CMOS domino logic, shown in Figure 6(b). Note the addition of the weak inverter to the GaAs PDLL logic { this gate replaces lost charge though gate leakage when the node A is oating, but provides no impediment to the switching characteristics of the node A during node switching. Note that the inputs to both styles of gates must  go valid during precharge, or

 be zero during precharge and only make a 0 ! 1 transition when precharge is o .

Thus many techniques we develop for dealing with the structure of GaAs PDLL logic are directly applicable to the design of asynchronous domino CMOS structures also. However, in this paper we will mainly be concerned with GaAs PDLL structures. Note that in both logic structures of Fig. 6 we can create arbitrary pull-down trees between the node A and ground. However, the logic style is inherently noninverting since we cannot usually use inverted signals as inputs to a dynamic gate. This creates some diculties in creating XOR type structures, which we will examine later in this paper.

III PDLL Gates Structures

The design of the various control and datapath gates required for implementation of asynchronous units in PDLL needs to be detailed. The most critical elements to obtain are,  a fast Send element { basically an initialisable latch  a fast XOR element  a transparent latch structure Other gates that we might require are developed as they are needed.

III-A Send Gate

The Send gate is really an initialisable latch, and is required to build almost any ECS pipeline. Therefore, we need a fast design so that the e ect of the control is minimised on overall cycle time. One approach would be to use a series of DCFL gates, giving high speed, but then the gate may not operate correctly at lower voltages (PDLL logic circuits can operate at a lower voltage than DCFL gates).

III-B XOR Gate

The XOR element is mostly used to do two-to-four phase conversions to control logic path elements, and is extensively used in control structures. Some proposed designs[4] are shown in Fig. 8

III-C Transparent Latch

Transparent latches are faster than their edge-clocked counterparts, and ECS makes considerable use of transparent latchs in pipeline structures like that of Fig. 1.

III PDLL GATES STRUCTURES

5 init

driver & charge retention

Vdd precharge

in out

control

out

control

A

(a) Send Gate in DCFL Logic

in pull-down tree

Vdd

init

Enable output

(a) PDLL Basic Gate

input Enable

Vdd output driver

precharge

(b) Send Gate in PDLL

A

Figure 7: GaAs 3Send Gate Implementations

III-D Logic Structures & Inversions

in pull-down tree (b) CMOS Domino Basic Gate

Figure 6: Dynamic GaAs PDLL and CMOS Gates There are two latch solutions that we shall explore, shown in Figure 9. Fig. 9(b) is appropriate in situations where a normal transparent latch structure is required. However, the non-inverting dynamic nature of PDLL logic means that deactivating inputs to PDLL gates means we can begin precharge sooner. Thus, precharging the latch as well means we can precharge sooner and reduce power consumption slightly (since the inputs to the appropriate logic blocks are all zeroed). This precharged latch structure is shown in Fig. 9(a) { note that the cost of this latch is also lower than the other, non-precharged version. There may be some slightly higher control costs as we must now generate precharge and latch signals, as opposed to just latch and latch signals. Note also that we can integrate PDLL functions into the latch structure of Fig. 9 relatively easily, since there is only one extra transistor which determines when to propogate the input to the output (when precharge is not active and Enable goes high).

Since PDLL is an inherently non-inverting logic family, if we wish to use some inverting functions in a single logic stage we must either 1. ensure that the inverted signal to the gate becomes valid some time margin before other gate inputs are determined (i.e. before they can make a 0 ! 1 transition), 2. stop the evaluation of the stage which uses the inverted signal until we know all inputs (including the inversion) are valid, or 3. separate the logic by a stage-boundary. Option #1 is only possible in certain, restricted logic structures and is not a totally general solution, although we can convert logic structures to this form if required. Option #3 is not desirable since it increases the number of stages, and thus stage control logic and latches, every time we need some gate output inverted as the gate input for another gate. Option #2 is the solution that is most general and desirable { we stop the evaluation of the logic gate in question until the inputs are all determined. Note that, instead of having an evaluate transistor, we could equally continue to activate the gate's precharge signal until all inputs are determined (saving a transistor), although then the gate's precharge signal is then no longer under control of the stage's control logic (and this may not be such a bad

IV ASYNCHRONOUS DYNAMIC PIPELINE CONTROLLERS

6 Vdd

Vdd PrechargeLatch

output

OUT A

B

EvalLatch input

(a) XOR Solution # 1

(a) Latch with precharge

Vdd A B

Vdd

A OUT

B

Enable output

input A

B

Enable

(b) XOR Solution # 2

Figure 8: XOR gates in PDLL GaAs thing, as we will see). The two possibilities are shown in Figure 10. Thus the gate can use inverted inputs as long as we do not evaluate the gate output until all the inputs are properly determined. There are two techniques we consider for generating the local control signal, Matched Delay and Matched Path.

III-D.1 Matched Delay Technique The Matched Delay Technique evaluates the relevant gate(s) after a speci ed delay from the evaluation of a previous gate. We make this delay-element an asymmetric element so that the second gate enters precharge slightly after the rst gate enters precharge. The concept is shown in Figure 11. This involves 1. identi cation of a suitable `A' gate to control the timing of the `B' gate's evaluation/precharge signal, if possible 2. design of the delay element to match the path delay from `A' to `B' The structuring of the control logic a ects the ease with which we can nd a suitable `A' gate, and design of the delay element, while possible in GaAs, is probably quite error prone because of parameter variations that we have seen in fabricated GaAs systems.

(b) Latch without precharge

Figure 9: PDLL Transparent Latch Implementations

III-D.2 Matched Path Technique The Matched Path Technique is the option we prefer in practice, even though it may involve a higher area overhead than the last option. As PDLL is a dynamic logic, we can use a path matched to the worst-case signal path from the `A' node to the `B' node, shown in Figure 12. The Matched A!B path mirrors almost exactly the worst-case logic path through A, and controls the evaluation of B (where B is presumed to depend on A in some manner). This eliminates the necessity of the design of the delay element (and thus the problems due to parameter and voltage/temperature uctuations), albiet at the expense of possibly more power consumption due to a larger number of gates. We will see how this technique operates in practice when we come to the design of a real system.

IV Asynchronous Dynamic Pipeline Controllers In this section we explore a novel pipeline controller designed explicity to take advantge of the special dynamic characteristics of PDLL and domino CMOS logic blocks. There are a number of observations to make about the operation of these dynamic logic blocks that are important in developing control architectures.

IV ASYNCHRONOUS DYNAMIC PIPELINE CONTROLLERS

7 EvalA

Vdd

Asymmetric Delay Gate

Driver EvalB

Pull-Down Tree

lo gi c/ B

lo gi c/ A

EvalLocal

ga te s

OUT ga te s

PrechargeGlobal

Figure 11: Operation of Matched Delay Technique EvalA Matched A->B Path

Driver EvalB

A

OUT

PrechargeLocal

Pull-Down Tree

ga te s lo gi c/

Vdd

B

lo gi c/

ga te s

(a) Gate with local Evaluation Control & Global Precharge Control

Figure 12: Operation of Matched Path Technique tomatically commences evaluation (as the PDLL logic blocks are essentially domino blocks). The outputs of Phase-1 propogate to Phase-2 where evaluation commences as well. When the Matched Path element for Phase-1 goes high,

 the @ack in event passes through to @go phi2  the @ack in event starts precharging the latch and (b) Gate with local Precharge Control only

Figure 10: Control of PDLL Gates with inverted inputs

 we can precharge PDLL gates and then leave

precharge turned o in preparation for an evaluation, as long as we ensure all inputs are precharged as well (i.e. low),  once a gate output has been determined, we can take it's inputs back to the precharged (i.e. low) state without a ecting the output logic state of the gate.

With this in mind, the pipeline structure of Figure 13 was developed. The latch and both logic elements (which we separate into Phase-1 and Phase-2) are initially precharged, and wait for input data to arrive via the latch which is open, as select is high. When data arrives at the input latch, it propogates through to the Phase-1 logic where it au-

the Phase-1 logic.

Note that this implies that Phase-2 must only use the outputs of Phase-1 in the rst few logic stages of Phase2, because they will soon become invalid due to Phase1 precharging. We can insert dummy Phase-2 gates to hold Phase-1 or latch outputs if necessary so that Phase2 operates correctly. Once Phase-2 completes operation, the Matched Path element for Phase-2 causes the @go phi2 event to propogate to @req out, initiating output handshaking. When output handshaking completes,

 the returning @ack out event sets the latch select line

high, and if precharge is complete, both the latch and the Phase-1 logic will re-commence evaluation, and  Phase-2 commences precharging, and stays in the precharge state until precharge is complete (controlled by the speed of the bu er between PrechPhase2 and the Send gate that sets PrechPhase2 low).

V DESIGN OF A PDLL ASYNCHRONOUS MULTIPLIER

8 @ack out

@req in @go phi2

@req out

@ack in init

select

LATCH

PrechLatch

GoIn

L

GoPhi1

PrechPhase 1

PrechPhase 2

Phase 1 Logic

Phase 2 Logic

Matched Path Phase 1

GoPhi2

Matched Path Phase 2

GoOut

Figure 13: Main Pipeline Structure for Asynchrnous PDLL Logic Control We chose this structure because the precharge time is typically much less than the evaluation time in a logic block, and thus precharging Phase-1 and latch logic while Phase-2 evaluates means we can re-commence evaluation much sooner. In addition, Phase-2 should be back into evaluation mode once the GoPhi2 signal goes high, so that the pipe operates with a higher throughput. Note that the latch and Phase-1 logic are essentially domino logic stages and thus commence evaluation automatically and very quickly after the @ack out event. We will use this pipeline structure throughout the paper. However, we will consider carefully what is required to design complex logic in both Phase-1 and Phase-2 and any modi cations to the control required to faciltiate implementation.

IV-A Inputs to Phase-2 logic

V Design of a PDLL Asynchronous Multiplier We chose a multiplier as an application of PDLL as  it is a well-understood problem,  it is ammenable to pipelining, and  it is not too complex, thus making it feasible for fast design and fabrication.

V-A Multiplier Algorithm The design of the multiplier uses a radix-4 modi ed Booth recoding scheme with Wallace-tree reduction[10].

Using a 16-bit wordlength, this results in 8 17-bit partial products, to be reduced using Wallace reduction techniques, vis.  9 PPs to 6 (see Fig. 17)  6 PPs to 4 (see Fig. 18)  4 PPs to 3 (see Fig. 19)  3 PPs to 2 (see Fig. 20) the nal two Partial Products are added using a fast adder, such as a CLA[10]. The reduction uses 3:2 or 4:3 compressors { compressors take a number of bits that need to be added, and reduce them to fewer bits with shifted signi cances. Thus compressors can reduce the number of partial products we need to add without needing to propgate carry signals across the word, in this case we reduce 17 partial products to two. The function performed by the compressors is shown in Figure 15. The complete scheme is shown in Figure 16. The reduction technique, along with associated correction bits, is shown in Figures 17-20. The  symbols represent the partial product bits generated by the Booth recoding scheme on the operand, and x=y signi es a particular partial product bit from a particular partial product \level" (used whereever the source of the PP bit is not obvious). The reduction technique uses 3 : 2 and 4 : 3 compressors to reduce the height of the Partial Product tree from 9 to 2 in 4 stages.

V-B Pipelining the Algorithm V-C Gate Design We will now consider the structure and design of the gates required to implement the multiplier in PDLL.

V DESIGN OF A PDLL ASYNCHRONOUS MULTIPLIER

1

9

PP Level 1

1

2

1

3

1 1

4

1

5

1

6 7

1

8

1

Figure 16: Full Multiplier Partial Product Tree with Booth recoding and correction bits

1/16

PP Level 1/16

1/16

1 3/16

2

3/16

3

1 1

4

6/16

5

6/16

1

6

8/16

7

8/16

8

1

(a) First Tree Reduction PP Level 1 3/16

2 3/16

3

1

4 6/16

5

6/16

1

6

8/16

7 8/16

8

1

(b) Result of First Tree Reduction

Figure 17: First Partial Product Reduction using Wallace Trees

V DESIGN OF A PDLL ASYNCHRONOUS MULTIPLIER

10

PP Level 1 3/16

2 3/16

3

1

4 6/16

*

5

6/16

1

6

8/16

7 88

8/16

1

(a) Second Tree Reduction PP Level 1 2 3

1 *

4 5 6 7 88

1

(b) Result of Second Tree Reduction

Figure 18: Second Partial Product Reduction using Wallace Trees

PP Level 1 2 3 *

*

4 5 6 7 88

1

(a) Third Tree Reduction PP Level 1 2 3 4 5 6 *

7 88

1

(b) Result of Third Tree Reduction

Figure 19: Third Partial Product Reduction

REFERENCES

11

*

*

(a) Fourth Tree Reduction

Figure 20: Fourth Partial Product Reduction & Final Reduction 3:2 Compressor

Increasing Logic Depth

Bit Significance LATCH

Phase 1 Logic

2

1

4:3 Compressor 3

2

1

Phase 2 Logic

Before input x

PROBLEM!

(a) Use of latch output in Phase-2 logic

After

Figure 15: Compressor functions

Increasing Logic Depth

LATCH

Phase 1 Logic

Phase 2 Logic

input x Buffer

input’x

(b) Transformation of latch output into Phase-2 logic

Figure 14: Input Control for Phase-2 logic

V-C.1 Compressors The 3:2 and 4:3 compressors are essentially full-adders. However, implementing dynamic XOR gates that can domino is very dicult, and requires that we control the arrival of the A and B signals. Alternatively, we could directly implement the truth tables of the compressors.

V-C.2 Final Adder The last adder that combines the two partial products should be very fast so that we obtain the sum, and thus the result, very quickly.

V-D Comments

VI ECS PDLL Integer Divider

VII Conclusion We have described some of the basic features of ECS and PDLL that enable to implementation of very high performance arithmetic systems to be constructed. More work is needed. The author apologises for the incomplete status of this report.

References

[1] Appleton, S.S., Morton, S.V., & Liebelt, M.J., \High Performance Two-Phase Asynchronous Pipelines", IEICE ED Transactions, March 1997. [2] Appleton, S.S., Morton, S.V., & Liebelt, M.J., \Two-Phase Asynchronous Pipeline Control", submitted to Third International Asynchronous Circuits & Systems Symposium, April 1997. [3] Lopez, J.F., Sarmiento, R., Nunez, A., & Eshragian, K., \High Performance GaAs Pseudo Dynamic class of Logic", 1996 International ASIC Conference, 1996. [4] Lopez, J.F., \XOR & Transparent Latch Implementation Alternatives in PDLL", Unpublished Communication, University of Adelaide, September 1996. [5] Nary, K.R., & Long, S.I., \Two-Phase Dynamic FET Logic : An Extremely Low Power, High Speed Logic Family for GaAs VLSI", 1991 GaAs IC Symposium, pp. 83-86, 1991. [6] Hoe, D.H.K., & Salama, C.A.T., \Dynamic GaAs Capacatively Coupled Domino Logic (CCDL)", IEEE Journal of Solid-State Circuits, 26(6):pp. 884-849, June 1991. [7] Basmaji, Jean H., et. al., \Digital's High-performance CMOS ASIC", Digital Technical Journal, Vol. 7, No. 1, 1995. [8] Franzon, P.D., Stanaski, A., Tekmen, Y., Banerjia, S., \System Design Optimisation for MCM", in Proc. 1995 IEEE MCM Conference. [9] Svensson, Christer, & Yuan, Jiren, \High-Speed CMOS circuit technique", IEEE Journal of Solid-State Circuits, 27(3):382-388, March 1992.

REFERENCES [10] Burgess, Neil, \Fundamentals of Arithmetic Microarchitectures", to be published, 1997.

12