IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. XX, NO. Y, MONTH 2002
100
A Technique For Improving Dual-Output Domino Logic Sumant Ramprasad, Ibrahim N. Hajj, Farid N. Najm Abstract— We present a technique, termed clockgenerating (CG) domino, for improving dual-output domino logic that reduces area, clock load, and power without increasing the delay. A delayed clock, generated from certain dual-output gates, is used to convert other dual-output gates to single output. Simulation results with ISCAS 85 benchmark circuits indicate an average reduction in area, clock load, and power of 17%, 20%, and 24% respectively over dual-output domino and a 48% power reduction for the largest circuit. Keywords— Domino Logic, Delayed Clocks, Low-power
I. Introduction
D
YNAMIC logic circuits [5] are used in highperformance circuits due to their speed and area advantage over static CMOS circuits. One well-known dynamic logic family is the domino CMOS family [2], which, however, suffers from its inability to perform inversions. Various methods have been proposed to overcome this restriction. One such method is the dual-output domino logic family [1]. In the standard dual-output domino logic gate [7] shown in Figure 1 each dual-output gate consists of two standard domino logic gates, producing the output, R and its complement, R. The possible outputs of a dual-output clk
clk
R = a AND b
R = a OR b
a a b
b clk
clk
Fig. 1. Standard Dual-Output Domino Logic AND2 Gate
domino logic gate are given in Table I. Note that R and R are both 0 when the gate is in precharge or is waiting for its inputs. The advantage of the dual-output domino TABLE I Possible outputs of a dual-output domino logic gate
clk 0 1 1 1
R 0 0 1 0
R 0 0 0 1
Comments Precharge Waiting for inputs Gate has evaluated to 1 Gate has evaluated to 0
S. Ramprasad is with Mindspeed Technologies Inc., Newport Beach CA, USA. I. N. Hajj is with the Coord. Sc. Lab., U. of Illinois, Urbana IL, USA. Email:
[email protected]. F. N. Najm is with the Dept. of Elect. and Comp. Engg., University of Toronto, Toronto, Canada. E-mail:
[email protected]. This work was supported by NSF award MIP-97-10235
CMOS family is its completeness, but at the cost of higher area and higher power dissipation. In a combinational logic block implemented using domino CMOS, only the fanin cones of inverters have to be dual-output. In this paper, by dual-output domino CMOS, we mean that only those gates whose both outputs are needed are dual-output. For instance, in the circuit in Figure 2, taken from [3], only G7, G3, G8, G12, G13, and G14 have to be dual-output. The remaining gates are single output. I1 I2 I3 I4
I6 I7 I8 I9
G7 G3
G8 I5
INV1 O1 G4
G1 G5
G13 G12 I10 G14
I11 I12
INV2 G10
O2 G6
G2
G9
G11
Fig. 2. Example circuit
In the clock-delayed (CD) domino logic style [6], the clock to a gate is delayed using delay elements attached to a gate till the inputs are known to have reached their final value. The general CD domino scheme is faster than static CMOS but typically consumes more area. The disadvantage of CD domino is that delay elements are required to be attached to gates. CD domino logic may also be slower than dual-output domino since a margin is added to the delay elements by making the delay element slower than the gate it is attached to. One method to overcome this problem is to use dual-output domino for the critical path and CD domino logic for gates not on the critical path. In our paper, this idea is further extended in the clock-generating (CG) domino logic scheme by recognizing that a delayed clock can be generated for gates not on the critical path from the dual-output gates that are on the critical path. This eliminates the delay elements required by CD domino. CG domino reduces area, power dissipation, and clock load of dual-output domino CMOS without increasing the delay. Simulation results with ISCAS 85 benchmark circuits indicate an average reduction in area, power, and clock load of 17%, 24%, and 20% respectively over dual-output domino. The performance results are typically better for larger circuits with a 48% power reduction for the largest circuit. The rest of this paper is organized as follows. In section 2, CG domino is presented along with a detailed example.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. XX, NO. Y, MONTH 2002
In section 3, the synthesis algorithm is described and in section 4 experimental results are presented followed by the conclusion. II. Clock-generating (CG) domino In CG domino, given a circuit, a dual-output gate Q like the one shown in Figure 1 is first located. It is then replaced with the modified dual-output domino logic gate Q shown in Figure 3, in which there is one gate to produce the output Q. The complemented output, Q, is generated by inverting DUAL OUTPUT GATE R
NAND GATE
MODIFIED DUAL OUTPUT GATE Q
clk R
a
R
a
Q
Q
c
b
b
d delayed clk
Fig. 3. Clock-generating (CG) domino
Q and using it as input to an AND1 (i.e., buffer) domino gate clocked by a delayed clock. The delayed clock is a delayed version of the precharge evaluate clock and is 1 only after Q has evaluated. If Q has low fanout (1 or 2), then the buffer in Figure 3 can be eliminated by using the delayed clock for each fanout gate, P , of Q as shown in Figure 4. The scheme in Figure 4 is not used if Q has high fanout because then the fanout of the delayed clock would be increased. DUAL OUTPUT GATE R
NAND GATE
OPTIMIZED GATE Q
R
a
Q
R
P
c a
b
FANOUT GATE P OF Q delayed clk
clk
b
101
The generation of the delayed clock is discussed next. A. Generation of the delayed clock From Table I, we see that a gate has evaluated if and only if (R OR R) is 1. To generate the delayed clock for a gate Q, the circuit is analyzed to locate another dualoutput gate R in the circuit satisfying 4 conditions listed later in this sub-section. The delayed clock is then derived by ORing the outputs of R (or, equivalently, NANDing the outputs of R before the inverters). The generation of the delayed clock is shown in Figure 3. For a dual-output gate R to supply a delayed clock gate to another dual-output gate Q, it must satisfy the following conditions, 1. There should not be a path from Q to R or R. This condition is to ensure that a cycle is not introduced. 2. The delayed clock generated from R and R must arrive after the positive output of Q (we included a margin of 10% to be conservative). This condition ensures that the delayed clock arrives after the positive output of Q and hence there is at most one rising transition at Q. 3. The delayed clock generated from R and R must arrive before the latest required time for any input to the gate generating Q (we, again, included a margin of 10% to be conservative). This condition ensures that there is no increase in the delay through the combinational block. 4. The gate R must not be supplying the clock to more than a certain number of gates (this limit was set at 8). This condition is to avoid violating any fanout constraints. Note that the modified dual-output gate Q in Figure 3 can also be used to generate the delayed clock signal for other gates since the possible outputs of that gate are also given by Table I. It is, however, not possible to use the gate Q in Figure 4 to generate a delayed clock.
e d
Fig. 4. CG domino when gate Q has low fanout
The modified gates in Figure 3 and Figure 4 have several advantages over the gate in Figure 1, First, since fewer transistors are needed, the area and power dissipation are reduced. Second, since the complemented inputs c, d, . . . are not needed, the gates in the fanin cone of Q do not have to be dual-output. Third, it is possible to choose whether to implement Q or Q depending on their effect on power, area, and delay. In Figure 3 and Figure 4, the dual outputs of gate R are used to generate the delayed clock using the NAND gate. The delayed clock is used to optimized gate Q in Figure 3 and Figure 4. The delayed clock is also used to optimize the fanout gates of gate Q in Figure 4. In order to improve reliability, the NAND gate in Figure 3 and Figure 4 should be placed close to gate R whose dynamic outputs the NAND gate is using as inputs. The precharge pfets and evaluate nfets in gate R and gate Q will have to be sized up in order to charge the additional gate capacitance at the output of these gates.
B. Example In this sub-section, the circuit in Figure 2 is used as an example to illustrate CG domino. The inverters IN V 1 and IN V 2 cannot be propagated towards the primary inputs, due to which gates G3, G7, G8, G12, G13, and G14 have to be dual-output. Assuming all the primary inputs arrive at the same time, G12 can supply a delayed clock to G3 since it can be shown that G12 and G3 satisfy the 4 conditions listed in sub-section 2.1. Since G3 has a fanout of 1, the optimization in Figure 4 is used due to which G3, G7, and G8 no longer have to be dual-output and can now be single output. The delayed clock obtained by NANDing the outputs of G12 before the inverters is used as the clock for G1. Layouts of the circuit in Figure 2 using standard dual-output domino and CG domino are shown in Figure 5 and Figure 6 respectively. A SPICE file was extracted from both layouts and simulated. In the CG domino circuit, the delay of the path G7-G3-IN V 1-G1 is increased from 2.62ns to 2.90ns. This increase does not, however, slow down the circuit since it is less than the critical path delay of 5.7ns through the critical path G13-G12-G10-G9G6-G2. The critical path is unchanged in the CG domino circuit and hence the critical path delays are the same in the two circuits. The number of transistors is reduced from
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. XX, NO. Y, MONTH 2002
Fig. 5. Circuit layout using standard dual-output domino
102
boolean function Logic synthesis using standard library gate-level netlist Convert to AND/OR/NOT/XOR gate-level netlist Minimize Inverters gate-level netlist Mark gates as dual or single output gate-level netlist (baseline circuit) Convert as many dual o/p gates as possible to single o/p CG domino gate-level netlist Fig. 8. Synthesis of CG domino circuits
Fig. 6. Circuit layout using CG domino
121 to 107, the power dissipation is reduced from 0.824 mW to 0.737 mW, and the capacitance of the clock network is reduced from 399fF to 338fF. One timing diagram from a simulation of the CG domino circuit is shown in Figure 7. When clk goes high, either G12 or G12 is 1 after a certain time. This results in delayed clk becoming 1 due to which G1 evaluates. Note that one input, G3, to G1 is a falling input, i.e., G3 makes at most a single 1 to 0 transition during the evaluate phase. This is acceptable since the clock to G1 is delayed until after G3 has reached its final value. There is a short spike on G1 after the second evaluate because delayed clk goes low after G3 becomes 1. This spike does not cause an error because it occurs during precharge and because it is short. If needed, the spike can be eliminated by ensuring that delayed clk goes low before G3 becomes 1. 4.0V CLK
0V 4.0V G12
−0.1V
output of SIS is a gate-level netlist containing both inverting and non-inverting gates. This netlist was converted to a netlist comprising of only AND, OR, NOT, and XOR2 gates. The number of NOT gates was minimized by propagating them towards the primary inputs, wherever possible. Inverters at the primary inputs and primary outputs were deleted since these inverters can be absorbed into the registers at the input and output of the combinational circuit block. At this point, we have a domino logic circuit in which only gates whose both outputs were required were duplicated. This was the baseline circuit used for comparison. The next step was to do a timing analysis on the circuit to determine the arrival times at each node. In our experiments, the timing analysis was done pre-layout. If a more accurate analysis is desired, the timing analysis and the improvements can be done post-layout. The following two improvements were implemented on this baseline circuit based on the timing analysis. The first improvement was to use the timing information to convert some of the dual-output gates to single output. The idea behind this improvement is that a domino AND gate can accept falling inputs if there is a rising input guaranteed to arrive later. This improvement, typically, converted a small amount of gates from dual-output to single output. The second improvement was to use CG domino logic by modifying dualoutput gates by finding another dual (or modified-dual) output gate that will supply a delayed clock.
4.0V G12
IV. Experimental Results
−0.1V 4.0V DELAYED CLK
−0.1V 4.0V G3
−0.1V 4.0V G1
−0.1V 0ns
4ns
8ns
12ns
16ns
20ns
24ns
Fig. 7. Timing diagram for circuit using CG domino
III. Synthesis of CG domino In this section, the algorithm for synthesis of CG domino circuits is described. The synthesis procedure is shown in Figure 8. The input is a boolean function which is mapped to a standard library using the SIS logic synthesis tool [4]. The
The synthesis procedure described in section 3 was implemented in a combination of C and Perl. The ISCAS 85 benchmark circuits were used to measure the improvement achieved using CG domino. We measured the total number of transistors, which is a measure of area, and the total number of transistors driven by the precharge-evaluate clock, which is a measure of the clock load. In addition, a zero-delay simulation was performed for 500 clocks with random input vectors to measure the average switching activity at each node. A zero-delay model was used since there are no glitches in domino logic. Assuming the capacitances of the gates are equal, the average switching activity at each node was summed up to give a measure of the total power dissipation of the entire circuit. A unitdelay model was assumed in order to estimate the arrival times since most of the gates are 2-5 input AND/OR gates
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. XX, NO. Y, MONTH 2002
TABLE II Experimental results for ISCAS 85 benchmark circuits
103
[6]
G. Yee and C. Sechen, “Dynamic Logic Synthesis,” Custom Integrated Circuits Conference, pp. 345–348, Santa Clara CA, May
[7] Area Clock Load Power Circuit (Transistor count) (Transistor count) (Switching Activity) Dual CG % Dual CG % Dual CG % OutputdominoRedn. OutputdominoRedn. Outputdomino Redn. domino domino domino c432 1121 1086 3 328 308 3 85 86 0 c499 2192 2192 0 584 584 0 145 146 0 c880 2554 2164 15 830 660 20 144 108 25 c1355 4664 4164 11 1540 1316 15 296 240 19 c1908 4541 4281 6 1410 1288 9 313 286 9 c2670 7392 5955 19 2334 1862 20 577 404 30 c3540 11391 9179 19 3594 2748 24 889 687 23 c5315 14529 11177 23 4550 3308 27 989 631 36 c6288 21930 20948 4 7310 6934 5 2279 2192 4 c7552 24372 17403 29 7790 5244 33 1853 966 48 Total 94686 78549 17 30266 24252 20 7570 5746 24
J. D. Yetter, “Functionally complete family of self-timed dy-
1997.
with fanouts of 1-5. CG domino is not limited to using a unit-delay model and any other delay model can also be used. Our main aim in conducting the experiments was to estimate the relative benefit over dual-output domino CMOS. The results are shown in Table II from which it can be seen that the results are typically better for larger circuits with many levels of logic. For instance, there is a reduction in power of 48% for the largest circuit (c7552). The improvements took less than a minute to perform for each circuit on a Sun Sparc Ultra-1. V. Conclusion In this paper a technique, termed clock-generating (CG) domino, for improving dual-output domino CMOS gates was presented. CG domino reduces the number of dualoutput gates by analyzing the circuit to generate a delayed clock. Simulation results with ISCAS 85 benchmark circuits indicate an average reduction in area, power, and clock load of 17%, 24%, and 20% respectively over dualoutput domino. References [1]
K. Bernstein, K. M. Carrig, C. M. Durham, P. R. Hansen, D. Hogenmiller, E. J. Nowak, N. J. Roher, “High Speed CMOS Design Styles,” Kluwer Academic, Dordrecht Netherlands, 1998.
[2]
no. 3, pp. 614–619, June 1982. R. Puri, A. Bjorksten, T. E. Rosser, “Logic Optimization by Output Phase Assignment in Dynamic Logic Synthesis,” International Conference on Computer-Aided Design, pp. 2–7, San Jose CA, November 10–14 1996. [4]
E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P. R. Stephan, R. K. Brayton, and A. Sangiovanni-Vincentelli, “SIS: A System for Sequential Circuit Synthesis,” Memorandum No. UCB/ERL M92/41, University of California at Berkeley, May 1992.
[5]
Sumant Ramprasad received the Bachelor of Technology (B.Tech.) degree in computer science and engineering in 1988 from the Indian Institute of Technology, Bombay, India. He received the M.S. degree in computer and information sciences from the Ohio State University, Columbus, in 1990. From 1991 to 1996 he worked at the Chicago Corporate Research Laboratories of Motorola Inc. He received the Ph.D. degree in computer science at the University of Illinois at Urbana-Champaign. Currently, he is working on designing digital signal processors at Mindspeed Technologies, Newport Beach CA.
Ibrahim N. Hajj (S’64-M’70-SM’82-F’90) is a Professor of Electrical & Computer Engg. and a Research Professor at the Coordinated Science Laboratory, Univ. of Illinois, UrbanaChampaign, Illinois. He received his B.E. degree (with distinction) from the American University of Beirut, the M.S. degree from the Univ. of New Mexico, Albuquerque, and the Ph.D. degree from the Univ. of California at Berkeley, all in Electrical Engineering. Before joining the University of Illinois, he was with the Dept. of Electrical Engg. at the University of Waterloo, Ontario, Canada. His current research interests include computer-aided design of VLSI circuits, design for reliability and low-power, synthesis, physical design, and testing. He has published over 160 journal and conference papers and book chapters on these subjects. He is a co-author of a book, ”Switch-Level Timing Simulation of MOS VLSI Circuits,” Kluwer Academic Publishers, 1989. In 1992 he was a co-recipient of the IEEE Transactions on Computer-Aided Design Best Paper Award. Dr. Hajj is a Fellow of IEEE, and currently serves on the Board of Governors of the IEEE Circuits and Systems Society. He has served as an Associate Editor of the IEEE Transactions on Circuits and Systems, and an Associate Editor of the IEEE Circuits and Systems Magazine. He is a member of the Computer-Aided Network Design (CANDE), a member of ACM, and a member of Sigma Xi.
R. Krambeck, C. Lee, and H. Law, “High speed compact circuits with CMOS,” IEEE Journal of Solid-State Circuits, vol. SC-17,
[3]
namic logic circuits,” U.S. Patent 5,208,490, May 1993.
N. Weste and K. Esraghian, “Principles of CMOS VLSI design: a systems perspective,” 2nd Edition, Addison Wesley, Reading MA, 1993.
Farid N. Najm received the B.E. degree in electrical engineering from the American University of Beirut (AUB) in 1983, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Illinois at Urbana-Champaign in 1986 and 1989, respectively. He worked as Electronics Engineer with AUB from 1983 to 1984. In 1989, he joined Texas Instruments as Member of Technical Staff with the Semiconductor Process and Design Center. In August 1992, he became an Assistant Professor with the Electrical and Computer Engineering Department at the University of Illinois at Urbana-Champaign, and was promoted to Associate Professor with Tenure in 1997. In 1999, he joined the Electrical and Computer Engineering Department at the University of Toronto, where he is currently Professor. Dr. Najm received the IEEE Transactions on CAD Best Paper Award in 1992, the NSF Research Initiation Award in 1993, and the
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. XX, NO. Y, MONTH 2002
NSF CAREER Award in 1996. He has served as General Chairman for the 1999 International Symposium on Low-Power Electronics and Design (ISLPED-99), as Technical Program Co-Chairman for ISLPED-98, and as Technical Program Chairman for the 1997 Great Lakes Symposium on VLSI. He has also served on the technical committees of ICCAD, DAC, CICC, and ISLPED and is a Senior Member of the IEEE. He is Associate Editor for the IEEE Transactions on Computer-Aided Design, Associate Editor for the IEEE Transactions on VLSI, and has served as column editor for the IEEE Circuits and Devices Magazine. He has also served as guest editor for the VLSI Design journal, and for the High Speed Electronics and Systems journal. Dr. Najm has co-authored the text “Failure Mechanisms in Semiconductor Devices.” His research interests are in the general area of CAD tool development for VLSI circuits, including power estimation, modeling, and optimization, power grid design and verification, low-power design, reliability prediction, synthesis of low-power and reliable VLSI, timing analysis and verification, test generation, and circuit and timing simulation.
104