Average-Case Optimized Transistor-Level ... - Semantic Scholar

3 downloads 0 Views 199KB Size Report
This work was motivated by the Intel Asynchronous In- struction ... 9625034 and by UC MICRO 96-193 and 97-220 with a matching fund from Intel Corp. ...... on Advanced Research in Asynchronous Circuits and. Systems. IEEE Computer Society Press, March 1996. ... [10] K. Y. Yun, P. A. Beerel, V. Vakilotojar, A. E. Dooply,.
Average-Case Optimized Transistor-Level Technology Mapping of Extended Burst-Mode Circuits∗ Kevin W. James

Kenneth Y. Yun

Electrical and Computer Engineering Department University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0407 USA {kwjames,kyy}@UCSD.EDU

Abstract We describe an automated method (3D-map) for determining near-optimal decomposed generalized C-element (gC) implementations of extended burst-mode asynchronous controllers. Average-case optimization is performed so that frequent paths are accelerated, possibly at the expense of less frequent paths. The overall effect, as quantified using Elmore delay analysis, is a circuit that has near-optimal performance for the average or common case.

1 Introduction Asynchronous circuits are inherently event-driven. In other words, they react to the changes in the environment as they occur; they do not need to wait for the clock tick. Therefore, it makes little sense to measure the performance of asynchronous circuits in the worse-case scenario. Instead, it should be measured for the average case. In this paper, we discuss an automated method (3Dmap) that maps synthesized logic for extended burst-mode asynchronous controllers [8] into average-case latency optimized generalized C-element (gC) implementations. The 3D-map starts with a gC implementation [9] whose set and reset functions are described in hazard-free minimal sumof-products form, selects a signal ordering that minimizes average-case latency, creates a representation that shares common subexpressions of products, and decomposes the gC elements using static gates, if necessary or advantageous. This work was motivated by the Intel Asynchronous Instruction Length Decoder project, in which many asyn∗ This research is supported in part by NSF CAREER Award MIP9625034 and by UC MICRO 96-193 and 97-220 with a matching fund from Intel Corp.

chronous controllers were manually translated from synthesized logic equations to transistor-level netlists. Furthermore, no automated tools were available to assist the designers with optimizing the design for the average case. An ensuing research effort to close this gap led to precursors of this work by Beerel et al [2, 1], which were devoted to optimizing average-case cycle time of gate-level burstmode controllers. In this paper, we extend the work by Beerel et al for small-to-medium size controllers designed in custom circuits at the transistor-level — a prevalent design style for controllers used in the Intel Instruction Length Decoder as well as in many other practical asynchronous designs [10, 4]. Our tool optimizes the design for average-case latency of gC implementations of extended burst-mode controllers. In many practical applications, we found that the settling times of control circuits after generating output changes is negligible relative to the response times of the circuits’ environment. Hence, we decided that latency is a more important parameter to optimize than cycle time. Burns presented a general method for decomposing state-holding elements, such as generalized C-elements, in [3]. However, Burns’s method is limited to speedindependent circuits, hence not suitable for generalized fundamental-mode circuits [8] synthesized by the 3D tool. Furthermore, his approach does not filter out decompositions that may induce transient short-cicuit conditions (see section 3.3.2). 3D-map uses an efficient heuristic to weed out potentially suboptimal circuits. It selects a signal ordering that optimizes the average-case delay, and removes all other orderings from consideration in determining the final circuit topology. When compatible with the selected signal ordering, common transistors are shared among several products, which has an effect of reducing the input loading as well as of reducing the area. In order to minimize charge

sharing effects, 3D-map only allows sharing of transistors at the bottom of the stack; i.e., near ground/Vdd. Sharing transistors at the top of the stack causes the node that fans out to unshared branches to have relatively large capacitance. This may lead to a significant charge-sharing problem if shared transistors turn on without a current path to ground/Vdd. 3D-map re-maps or decomposes the gC circuit with ordered inputs, using gCs with maximum stack length of 4 and static gates with maximum fanin of 3. Elmore delay is used to evaluate the performance of each potential decomposition. 3D-map accelerates average-case latency by up to 22.6%, according to our experiments. A key reason for developing 3D-map was the need for an automated means of shortening long series stacks. Long series stacks can cause a circuit to fail due to charge sharing. 3D-map finds an implementation that minimizes the average latency within a constraint that limits the stack length to an acceptable level. Even small designs are very tedious to decompose and verify by hand, and the process is error-prone. Failure to consider a large number of options is also liable to yield solutions with significantly lower performance than those chosen by 3D-map. The rest of the paper is organized as follows. Section 2 gives an overview of extended burst-mode circuits and our average-case optimized technology mapping procedure for the gC circuits synthesized by the 3D tool. Section 3 discusses our approach to technology mapping in detail. Experimental results are presented in section 4. Section 5 contains some concluding remarks.

2 Overview This section describes extended burst-mode circuits, and an overview of the average-case optimized technology mapping, using a simple example. 0

ok− fain− dackn+ / ok+ / frout+

1 fain+ / dreq+ frout−

3 fain− dackn+ / frout+

4

fain* dackn− / dreq−

2

fain* dackn− / dreq−

5

2.1 Extended Burst-Mode Circuits Figure 1 describes an extended burst-mode state machine (biu-fifo2dma) with 4 inputs (ok, cntgt1, fain, dackn) and 2 outputs (frout, dreq). Signals enclosed in angle brackets, such as cntgt1, represent conditional or level signals. hcntgt1 +i and hcntgt1 −i denote the conditional clauses “if cntgt1 is high” and “if cntgt1 is low.” Signals not enclosed in angle brackets, such as ok, fain, and dackn are edge signals. Edge inputs ending with + or − are trigger signals; the ones ending with ∗ are directed don’t cares. If a state transition is labeled with a directed don’t care a∗, then the following state transition must be labeled with a∗, a+ or a−. A trigger signal a+ denotes a 0 → 1 transition of a if a was initially 0, and no transition at all if a was initially 1. A sequence of state transitions labeled with a∗ and terminated with a+ represents a single 0 → 1 transition of a at any point in the sequence. If a state transition is not labeled with a level signal, the signal may change freely during the transition. However, if an edge signal is not mentioned in a transition, it is not allowed to change. Circuit Operation Initially, or after completion of the previous output/state variable transitions, the machine waits for new input transitions to arrive. When the machine detects all of the trigger transitions, it generates new output/state variable transitions. As in the 3D implementation of extended burstmode circuits, no fed-back output or state variable change arrives at the gate input until all of the specified edges in the output and state burst have appeared at the gate output. These conditions are met by inserting delays in the feedback paths as necessary. GC Implementation The 3D-gC synthesis tool produces hazard-free two-level AND-OR circuits for both set logic (fset ) and reset logic (freset). The N stack of the generalized C-element in Figure 2 is simply the N stack of the fully complementary complex AND-OR-NOT gate that implements fset ; the P stack of the generalized C-element is the P stack of the full complementary complex AND-OR-NOT gate that implements freset .

2.2 Statistical Analysis fain+ / dreq+ frout−

Figure 1: Biu-fifo2dma specification.

Statistical analysis plays a crucial role in selecting the average-case optimized circuit. We use Markov chain analysis to derive the relative frequency of each state transition. This information is used to order the signals in the order

weak

EvDone− start− / LB− LU− A1M−

weak reset

weak

0

start+ / prech−

frout dackn

1

weak

fain

EvDone+ / prech+ LB+ A1M+

dreq

ok

EvDone− start− / LB− A1M−

fain

2

dreq z0

90%

10% EvDone− M1A+ / prech−

3 EvDone+ / prech+ LA+ A1M−

Figure 2: Biu-fifo2dma circuit (z0 and z1 gates not shown). 4

of frequency when creating the series stacks, as well as to evaluate the optimality of the decomposed circuits. Our analysis is similar to that used in the gate-level averagecase optimized technology mapping of burst-mode circuits [2]. We model the state transitions of the circuit with a stochastic process, assuming that the probability of each state transition depends only on the current state — the key assumption for the Markov chain analysis. The long term frequency of a transition t, denoted Πt , is the unique non-negative solution to the equations: X Πt0 · Pr (t) (1) Πt = X

t0 :sink (t0 )=source(t)

Πt

=

1,

(2)

t∈δ

where Pr (t) is the conditional probability of state transition t and δ is the set of state transitions in the burst-mode machine.

2.3 Technology Mapping Example We present an overview of the technology mapping procedure, 3D-map, using an example. Figure 3 shows a burstmode specification ALU1 with conditional probabilities annotated. For example, conditional probability Pr (t2→0 ) of the environment steering the circuit to make a transition from state 2 to state 0 is 10% whereas the conditional probability of state transition t2→3 , Pr (t2→3 ), is 90%. We describe the method by showing how a circuit that implements LU − is derived. The process is as follows: (1) the state machine is annotated with long term frequencies as shown in Figure 4; (2) the optimal signal ordering is determined based on the relative frequencies of the trigger signals that enable LU − (highlighted in the figure); (3) a baseline circuit (without decomposition) is created based on the signal ordering; (4) the weighted Elmore delay of every legal decomposition of the baseline circuit (with the

EvDone− M1A− / prech−

5 EvDone+ / prech+ LU+ LA− A1M+

10% 6

90% EvDone− M1A+ / Prech− LU−

Figure 3: ALU1 specification with conditional probabilities annotated.

same signal ordering) is computed, in order to select the circuit with the optimal average-case delay. 2.3.1 Computing Transition Statistics Markov chain analysis is used to determine long term frequencies (annotated inside the parentheses). For example, the long term frequency, Πt6→0 , of transition t6→0 from state 6 to state 0 is 0.024 and that of t6→3 is 0.206. The next step is to select all transitions that enable LU −, (t6→3 , t6→0 ), and normalize the long term frequenb t6→3 = 0.206 = 0.9 and cies of these transitions: Π 0.23 b t6→0 = 0.024 = 0.1. These normalized values are used to Π 0.23 weight the delay values obtained in the following analysis. 2.3.2 Signal Ordering Once the normalized long term frequencies of the transitions involving LU − are computed, we find all trigger signals that enable LU −, (start, EvDone, M1A). For each trigger signal in t6→3 and t6→0 , we sum the long term frequencies of the transitions in which the trigger signal appears. start : EvDone: M1A:

Πt6→0 Πt6→3 + Πt6→0 Πt6→3

= = =

0.024 0.23 0.206

EvDone− start− (0.024) / LB− LU− A1M−

EvDone− start− (0.003) / LB− A1M−

0

M1 M3

M1A C1

start+ (0.027) / prech−

C2 M4

EvDone

1

90%

LU

x

M2 EvDone+ (0.027) / prech+ LB+ A1M+

Cout

10%

2

start

fset

EvDone− M1A+ (0.024)/ prech−

3 EvDone+ (0.23) / prech+ LA+ A1M−

4 EvDone− M1A− (0.23) / prech−

Figure 5: The first cut of ALU1 LU circuit: C1 = C2 = 19.2f F, Cout = 63.6f F; W and R of all pMOS transistors are 16λ and 2.5KΩ respectively.

5 EvDone+ (0.23) / prech+ LU+ LA− A1M+

10% 6

90% EvDone− M1A+ (0.206) / Prech− LU−

more than one trigger signal per transition, we compute the delay for each trigger signal, assuming that it is the last to switch, and use the mean of the delays.

Figure 4: ALU1 specification with long term frequencies annotated.

M1



EvDone

M3

16λ

M1A



8λ C2

C1

8λ C3

These sums are compared to determine the optimal trigger signal ordering: signals with larger sums are placed closer to the output. In this example, the trigger signals are ordered (EvDone,M1A,start). Below these signals in the stacks are trigger signals that enable LU + but not LU −, followed by all other signals (e.g. level signals) used in the circuit (there are none in this case).

EvDone

2.3.3 Creating the First Cut

Figure 6: Decomposed ALU1 LU circuit: C1 = C2 = 19.2f F, C3 = 9.6f F, Cout = 58.8f F; resistance of pMOS transistors with W = 16λ, pMOS with W = 8λ, and nMOS with W = 8λ are 2.5KΩ, 5KΩ, and 2.5KΩ respectively.

The first cut of the circuit is created by mapping the sum of products to a series/parallel network, attempting to share the transistors starting at the bottom of the stack (closest to ground or Vdd ) while maintaining the trigger signal ordering determined above. For example, abd + bcd = (ab + bc)d, for signal ordering (a, b, c, d). The circuit for LU − is shown in Figure 5. Note that transistors are sized so that the effective resistance (or drive) of each stack is equal to that of an nMOS transistor with W/L = 4λ : 2λ. 2.3.4 Decomposition We then explore all possible decompositions of the first cut, limiting the longest stack size to 4 and static gate fanins to 3. One possible decomposition is depicted in Figure 6. In order to determine the best decomposition, we compute delay values for all transitions that cause LU −. If there is

16λ M2

LU

x



start

Cout

fset

t6→3 : Trigger signals for this transition are EvDone and M1A. The delay from EvDone− to x+ is the time it takes to charge Cout through M1 and M2 , which is 2 × 2.5 × 58.8 = 294ps. The delay from M1A− to x+ is the sum of the time it takes to charge C1 through M1 and the time it takes to charge Cout through M1 and M2 , which amounts to 2.5 × 19.2 + 5 × 58.8 = 342ps. Since we do not know which trigger signal actually switches last, we average the = 318ps. two delays: 294+342 2

t6→0 : Trigger signals for this transition are EvDone and start. The delay from EvDone+ to x+ is the sum of the static NAND gate delay and the delay to charge Cout through M3 , which is 5 × 19.2 + 5 × 58.8 = 390ps. The delay from start + to x+ is slightly higher because start is placed farther away from the output of the NAND gate: (2.5 × 9.6 + 5 × 19.2) + 5 × 58.8 = 414ps. Again, since we do not know which trigger signal actually switches last, = 402ps. we average the two delays: 390+414 2

ordering: the signal with the highest value is placed at the top of the stack. Signal ordering, once determined, remains fixed for all decompositions. The signal ordering used in this way drastically reduces the search space, which otherwise would be prohibitively large, without sacrificing the optimality of the solution significantly. Remaining signals are ordered to facilitate sharing at the bottom of the stack. Signals that trigger z− transitions but not z+ are added immediately below the trigger signals for z+, followed by all other signals at the bottom of the stack.

2.3.5 Computing Weighted Delay Now that we have delay values for the two transitions, we can compute the weighted delay using the normalized long b t6→3 × 318 + term frequency values computed earlier: Π b t6→0 × 402 = 326ps. This value is compared to the Π weighted delay for each decomposition, in order to select the decomposition with the minimum weighted delay. The decomposition shown in Figure 6 yields the minimum weighted delay in this example. Note that this solution uses a static gate, which slows down t6→0 (infrequent path) considerably. However, reducing the stack length to 1 in this infrequent path enables us to reduce the diffusion contact capacitance of M3 . This, in turn, reduces Cout , which enables the frequent path to be sped up enough to minimize the weighted delay.

3.2 Sharing Transistors In order to reduce the area and the input loading of the circuit, 3D-map shares common variables amongst product terms. However, 3D-map allows sharing transistors at the bottom of the stack only, closest to ground/Vdd, because sharing at the top of the stack may introduce chargesharing problems. Note that satisfying both the signal ordering requirement and the requirement to share transistors only at the bottom of the stack disallows certain sharing. For example, z = ab + ac cannot be re-mapped to z = a(b + c) when ordering (a, b, c) is imposed.

3.3 Decomposition and Selection

3 Technology Mapping Technology mapping is a process of translating a technology independent logic equation into a technology dependent netlist. In our case, sum-of-products representations of zset and zreset are mapped to a decomposed generalized C-element that implements z. For clarity of exposition, we limit our discussion to deriving a circuit for z+; the same process is used, independently, to create z−.

3.1 Signal Ordering To minimize input-to-output delay, late-arriving signals should drive transistors closest to the output. In that way, the capacitance on the nodes below will already have been discharged by the time the late input changes occur at the transistors closest to the output. To determine which signals should be placed closest to the output, we make use of the statistics derived earlier. For each output transition, z+, we determine a set of transitions {. . . , ti , . . .} in which z+ is enabled. For each transition ti , we examine the trigger inputs for the transition. Each trigger input is “credited” with the long term frequency of every transition (ti ) it enables. These cumulative long term frequencies determine the trigger signal

We need to consider decomposing the stacks in order to limit stack length. Long series stacks are unattractive for two reasons: (1) the worst-case delay of the stack increases quadratically with the stack size; (2) the charge sharing problem may become significant enough to cause the circuit to fail. Although it is possible to reduce series resistance using larger transistors, doing so increases the gate capacitance, which slows down the input transitions. Furthermore, it increases the capacitance between series transistors (Ci ∝ 2Wn ) as well, which aggravates the charge sharing problem. By decomposing, we eliminate long series stacks, thus alleviating charge-sharing problems.

3.3.1 AND Decomposition Any series of transistors in a gC stack may be represented as an AND function. An AND function can be implemented using a series of transistors. On the other hand, the same AND function can be implemented as a single nMOS (pMOS) transistor driven by a static AND (NAND) gate. In general, parts of series stacks may be implemented with a transistor driven by a static gate, while the remainder is implemented as series transistors. We use either NOR or NAND static gates as depicted in Figure 7.

a b c

b

There exists i, jPsuch that pi,j = aX where X 6= 1 and zreset = i Πj pi,j .

a b c

a b

a

c

3D-map performs this check automatically.

c (a) (b) a a b

b

a

c

a b c

b c

c

Figure 7: (a) Four possible decompositions of a 3-input N-stack; (b) Four possible decompositions of a 3-input Pstack.

3.3.2 Checking Legality of Decompositions Short Circuits The decomposition step must not create a circuit that generates excessive short-circuit current during transitions. No signal that enables a gC stack, say an N-stack, to turn on in any state transition can be used to disable a P-stack through a static gate [9]. That is, if any signal is critical in disabling a P-stack when an N-stack turns on, it must drive a transistor in the P-stack directly, not through a static gate. Otherwise, that static gate would delay the P-stack from turning off until some time after the N-stack turns on, creating a short circuit condition.

b a x c fs (a)

a b c fs x

“0” “1”

AND gate delay Short circuit current

(b)

Figure 8: Short-circuit condition caused by decomposition: (a) a simple example using a static AND gate and one transistor in place of two series transistors; (b) conditions under which using the static gate allows short-circuit current to flow during a transition.

Without loss of generality, assume that a is a trigger signal and a+ causes zset to rise (an N-stack to turn on) and zreset to fall (some P-stacks to turn off). Then the necessary and sufficient condition for short circuit is:

Hazards The decompositions used in our method do not create hazards, because recursively decomposing an AND gate with smaller AND gates is hazard-free [5, 6]. Because the 3D tool creates hazard-free realizations, the circuits decomposed by our tool do not contain hazards. 3.3.3 Evaluating Decompositions For every legal decomposition, 3D-map calculates the average-case delay of the z+ portion of the circuit. 3Dmap assigns transistor widths, from which resistance and capacitance values can be derived, and then computes Elmore delays for all transitions in which z+ is enabled. The first step is to select transistor widths, the key parameter used in modeling the circuit. We assign widths so that the effective resistance along each path from gC output to Vdd /ground is approximately equal to that of a single nMOS transistor with W = 4λ. We assume that pMOS devices have twice the resistance of nMOS devices. Clearly, adding static gates to reduce the stack size or adjusting transistor widths to normalize the drive of each stack causes the loading of input buffers to vary. 3D-map assumes that both true and complement forms of inputs are available through separate buffers (drivers) and the drive of each buffer is sufficiently large so that small perturbation in the loading has little effect on input buffer delay. Resistance of a transistor is parameterized in terms of R2 , the resistance of one square of transistor channel (L = W ). L is fixed at 2λ, so the number of squares is L 2 W = W . Gate and diffusion contact capacitance (Cg , Cd ) are parameterized in terms of transistor width in units of λ. We assume that diffusion contacts are not shared. In other words, capacitance between two series transistors is the sum of the diffusion contact capacitances of both transistors. 2 R2 (3) R = W (4) Cg = Cd = W · CDIFF The total capacitance on node x (see Figure 9) is Cx = Cd + Cload , where Cd is the diffusion contact capacitance of the transistor at the top of the stack to be decomposed and Cload is the sum of the capacitance contributed by output inverter pairs as well as diffusion contacts from other stacks. Because not all of these values are known a priori, 3D-map assumes that Cload is equivalent to approximately 6 INV1 (W = 8λ : 4λ).

(b)

freset x an

x Rn

Mn

out b a

Cn

M2

R2

C2

M1

R1

C1

(a)

a1

M2

R2

C2

R1

C1

R1

out C2 C1

out out

R0

R2

C2

R1

C1

+

a

M1

(b) R2

(a)

(c)

a2

out

C0

R0 C0

Figure 9: An RC model of a gC element: (a) An N-stack of gC element (inverters omitted); (b) Corresponding RC model.

A simple RC model of the circuit is created by substituting every transistor with a resistor and adding a capacitor on every node, as shown in Figure 9. We compute the delay from the latest arriving input to output using the Elmore delay model [7]. Assuming that Mk (k ∈ 1 . . . n) is the last transistor to turn on, the delay from ak to x is computed as below: " i # n X X Rj Ci , (5) Dk = i=k j=1

where n is the number of resistors in the stack. All other inputs are assumed to have stabilized in advance. Transistor chains between the trigger point and ground are modeled as a series resistor network; capacitors C1 through Ck−1 are assumed to have been discharged in advance. Example: Three examples of Elmore delay calculation are depicted in Figure 10. Figures 10ab illustrate input transitions a+ and b+ turning on a simple 2-input N-stack. If a+ occurs later than b+, then the delay from a+ to out − is R1 C1 + (R1 + R2 )C2 . Otherwise, the delay from b+ to out − is (R1 + R2 )C2 . Figure 10 shows a 3-input Nstack decomposed into a NOR gate and a 2-input N-stack. The delay from a− to out−, assuming that a is the latest arriving trigger signal, is 2R0 C0 + R1 C1 + (R1 + R2 )C2 . 3.3.4 Decomposition Selection To select a decomposition, we use statistically weighted Elmore delay values. For z+, we select the set of state transitions {. . . , ta , . . .} in which z+ is enabled. We then calculate the delay for each transition ta and compute the average-case delay value using statistics derived earlier.

Figure 10: Elmore delay model: (a) a+ arrives late; (b) b+ arrives late; (c) a is the latest arriving trigger signal.

In order to find a decomposition that yields the minimum average-case delay, this process is repeated for every legal decomposition. For each state transition ta in which z+ is enabled, and for each trigger signal ai in ta , we calculate the delay from ai to output, assuming that ai is the last trigger input to switch. Since we do not have sufficient information about the environment to determine which trigger signal switches last, we take the arithmetic mean of the delay values obtained for all trigger signals in ta .

(a)

a=0 c=1

0

(b)

a=1 c=1

1

R

b=1

b=1

R

R

Figure 11: Don’t-cares in state transitions ([a, b, c] = [∗, 1, 1]): (a) a = 0, hence left side of circuit is OFF (b) a = 1, hence left side of circuit is ON, reducing resistance to R k R = R2 . Although don’t care signals can never enable an output to switch, they may alter the resistance of certain paths. Hence they may introduce multiple possible delay values, as shown in Figure 11. Again we have a practical limitation on the amount of information we can request from the designer. Therefore we assume that it is equally probable for a don’t care signal to assume 0 or 1. We calculate the

delay for every combination of don’t care assignments and use the arithmetic mean of these values. The average-case delay of z+ is the expected value of the delay in each state transition, Dti : X b ti · Dti . Π (6) Average-case Delay = i: z+ ∈ ti

Of course, the circuit implementation with the minimal average-case delay is selected.

Design alu1 alu2 binary-counter biu-dma2fifo biu-fifo2dma dff div-by-3 dramc fifocellctrl jkff pe-send-ifc ram-read-sbuf sbuf-ram-write sbuf-read-ctl scsi-init-rcv scsi-init-send scsi-targ-rcv scsi-targ-send select selmerge OVERALL

No. of Out 7 9 7 5 4 2 3 6 3 2 5 5 6 4 6 6 7 6 2 3

Cases Improved / Cases 4 / 14 14 / 18 9 / 14 6 / 10 3 / 8 4 / 4 0 / 6 1 / 12 0 / 6 4 / 4 7 / 10 3 / 10 0 / 12 2 / 8 10 / 12 6 / 12 10 / 14 6 / 12 4 / 4 6 / 6

Speedup (%) Max Avg 9.5 6.2 18.3 8.8 9.8 6.4 15.1 8.0 17.0 8.7 8.5 6.3 − − 17.1 17.1 − − 8.5 6.3 9.6 5.8 8.5 5.8 − − 5.1 4.8 17.0 8.3 12.8 7.6 22.6 8.8 11.4 6.9 15.1 11.8 15.1 9.2 22.6 8.0

Table 1: Experimental Results: No. of Out column lists the number of all primary outputs and state variables in the synthesized design. Cases Improved / Cases column shows the number of circuits improved by decomposition vs. the total number of cases. There are two cases per output: z+ and z−. Max Speedup is the maximum speedup of all cases -delay in the design, where Speedup = ( slow fast-delay − 1) × 100%. Average Speedup is the average of speedup for improved cases.

4 Experimental Results Table 1 shows our experimental results. The baseline delay is calculated using the original circuit, with transistors

shared amongst product terms when possible, but without decomposition. The minimum average-case delay of all legal decompositions is computed by 3D-map. We compare these numbers to obtain speedup for each output, for both rising and falling transitions. Table 1 reports the maximum speedup for the entire design. Also, we specify the number of cases that were improved by decomposition, and the average speedup for the cases for which improvements were made. Note that the optimization is performed on each output independently of other outputs. For the results presented, state transition probabilities were assigned according to the expected behavior of the environment when such information was available (e.g., alu2, dramc, and scsi-targ-rcv). Otherwise (e.g., dff ), the probabilities were assigned uniformly. In general, the assignment of state transition probabilities reflects the typical behavior of the environment. In some cases, state transition probabilities may be data-dependent. For example, the packet size for a SCSI controller may change from several bytes to several thousand bytes, depending on whether the packet comprises a simple message or raw data. Thus, the annotated probabilities are based on the typical behavior of the environment. Furthermore, the sensitivity of the average-case delay with respect to the state transition probability is negligible for those cases in which the probability is severely biased in one way: e.g., the probability of remaining in the steady-state loop is 1000 times more likely (typically) than that of leaving the loop in the SCSI controllers. Often, it is not possible to improve a circuit by decomposing it: for example, if the circuit consists of a single stack of length n, of which n − 1 transistors are driven by trigger signals. In such cases, any static gate added would be in the critical path. There are other cases, however, in which the speedup is impressive: up to 22.6%. Most circuits, even the ones with stack length less than five, benefit from decomposition. Typically, one or two cases in a design may experience 15-20% speedup, which makes using 3D-map worthwhile. Table 2 shows detailed results for ALU2. The ALU2 design contains circuits for 9 outputs, two of which are state variables. In two cases, the original stack length was one, hence no decomposition was possible. Experimental results show that decomposition selection is more subtle than it may appear to be. We initially assumed that the fastest circuit can be realized simply by factoring out only the non-trigger signals to static gates. However, it turned out in some cases that factoring out trigger signals in infrequently activated stacks can improve the latency as well. It is often the case that one stack in a gC element is statistically much more relevant than others. As a consequence, an appreciable speedup can result from de-

Speedup, rising (%) Speedup, falling (%) Stack len, rising, from/to Stack len, falling, from/to

Prech 2.8 12.1 4/3 4/2

LX 4.5 0.0 3/2 2/2

LY 11.1 0.0 5/4 1/1

A2M 16.2 7.5 5/4 4/2

EndP 8.5 0.0 6/4 1/1

seldx 4.5 0.0 3/2 2/2

selym2 4.5 7.5 3/2 4/2

zzz00 8.1 18.3 5/4 5/3

zzz01 10.0 7.9 5/4 3/2

Table 2: Mapping results for ALU2

composing a stack of limited statistical significance, which may actually slow down some infrequent transitions but speed up a statistically important one. Circuits derived in this manner often appear counter-intuitive. The key is to realize that there may be long stacks of transistors whose existence is justified only by very infrequently occurring transitions. These stacks increase the delay of frequent transitions, because of larger diffusion contacts required for larger transistors used in longer stacks. In other words, frequent transitions may be penalized because of long stacks that support only infrequent transitions. Instead, we may choose to include static gates in the critical path of an infrequent transition to reduce the length of a stack, which results in smaller capacitance on the output node, and shorter delays for frequent transitions. Less important paths may be slower than they otherwise would have been, but the circuit performs better overall. For example, the LU circuit shown in Figure 6 factors out a statistically insignificant stack (start EvDone) into a static gate, in order to improve the average latency of LU −. Because the total number of decompositions is substantially reduced by a priori signal ordering, and stack length and static gate fanin limits, the run-times of 3D-map are very low. For the cases presented (representing a crosssection of existing asynchronous designs), wall-clock run times are on the order of 2 seconds, on a 200MHz PentiumPro system with 64MB of memory.

5 Conclusion Improving average-case delay is an important goal; however, the initial reason for developing this tool was the need for decomposing long series stacks so that the circuit works. Because many legal decompositions were possible, we needed to determine a criterion for the “best” decomposed circuit. We selected the average-case latency as the metric, using designer-supplied state transition statistics and the Elmore delay model to compute average-case latency. In many cases, decomposition was not only necessary for optimizing the speed but also for finding equivalent circuits with reasonable stack sizes that can function re-

liably. That is, decomposition may be necessary to eliminate charge sharing problems, not just to improve performance. It should also be noted that the manual decomposition process is very tedious and error-prone. It is not feasible to manually derive results obtained by 3D-map even for fairly small circuits for the following reasons: (1) In order to derive average-case delay, statistics must be calculated; (2) Picking a legal decomposition is a non-trivial task, because the circuit operation must be examined to ensure that the decomposition does not create a short-circuit during any transition; (3) The number of delay calculations is exponential in the number of trigger signals and don’tcares. Overall, tens or hundreds of calculations must be performed for each rising and falling output transition. 3D-map uses only a single level of decomposition (static gate outputs drive gC transistors directly, not additional levels of static gates). This limits the size/complexity of gC elements that can be decomposed using the maximum legal stack length. For practical gC implementations, this constraint is not deemed significant.

Acknowledgment Many thanks to Steven M. Burns, Intel Corp., for his assistance. Although we ultimately decided against using his HSE package to determine potential decompositions, his work [3] helped guide our efforts.

References [1] P. A. Beerel, K. Y. Yun, and W. C. Chou. A heuristic covering technique for optimizing average-case delay in the technology mapping of asynchronous burstmode circuits. In Proc. European Design Automation Conference (EURO-DAC), September 1996. [2] P. A. Beerel, K. Y. Yun, and W. C. Chou. Optimizing average-case delay in technology mapping of burstmode circuits. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE Computer Society Press, March 1996. [3] S. M. Burns. General condition for the decomposition of state holding elements. In Proc. International

Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE Computer Society Press, March 1996. [4] A. Davis, B. Coates, and K. Stevens. The Post Office experience: Designing a large asynchronous chip. In Proc. Hawaii International Conf. System Sciences, volume I, pages 409–418. IEEE Computer Society Press, January 1993. [5] S. M. Nowick and D. L. Dill. Exact two-level minimization of hazard-free logic with multiple-input changes. IEEE Transactions on Computer-Aided Design, 14(8):986–997, August 1995. [6] S. H. Unger. Asynchronous Sequential Switching Circuits. Wiley-Interscience, John Wiley & Sons, Inc., New York, 1969. [7] N. Weste and K. Eshraghian. Principles of CMOS VLSI Design (Second Edition). Addison-Wesley, 1992. [8] K. Y. Yun. Synthesis of Asynchronous Controllers for Heterogeneous Systems. PhD thesis, Stanford University, August 1994. [9] K. Y. Yun. Automatic synthesis of extended burstmode circuits using generalized C-elements. In Proc. European Design Automation Conference (EURODAC), pages 290–295, September 1996. [10] K. Y. Yun, P. A. Beerel, V. Vakilotojar, A. E. Dooply, and J. Arceo. The design and verification of a highperformance low-control-overhead asynchronous differential equation solver. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 140–153, April 1997.

Suggest Documents