Output Domino Logic - MODL [6]. MODL is, by construction, less susceptive to charge sharing than standard domino gates since internal nodes of the logic tree ...
Enhanced 32-bit Carry Lookahead Adder using Multiple Output Enable-Disable CMOS Differential Logic Mario C.B. Osorio, Carlos A. Sampaio, Andre I. Reis, Renato P. Ribas lnstituto de Inforrnatica - UFRGS Caixa Postal 15064 CEP 91501-970 - Port0 Alegre - RS - Brasil Tel. +55 (51) 3316-6810
rpribasQ inf.ufrgs.br performance of such a kind of functional block in CMOS technology represents a fundamental challenge.
ABSTRACT This paper presents an enhanced 32-bit carry look-ahead (CLA) adder implemented using the multi-output enable /disable CMOS differential logic (MOECDL) style. The MOECDL structure proposed represents a promising technique for iterative networks and self-timed circuits. The recursive property of CLA algorithm has been efficiently exploited to demonstrate the advantages of multiple-output structures. The 32-bit MOECDL CLA circuit has been designed into a standard 0.5pm CMOS technology. Comparison to the known DCVS style is presented through electrical simulation.
The differential and dynamic CMOS logic families offer several advantages with respect to standard CMOS logic circuits. The logic function complexity is easily increased in a single gate, providing the output signal and its complementary one. These dual-rail topologies work in two phases or states: the disable state (output pre-charging or pre-discharging phase) and the enable state (logic evaluation phase). Dual rail coding allow the application in self-timed circuits once the logic processing completion detection is easily provided. There are several dualrail topologies, including:
Categories and Subject Descriptors B.2.4 [Arithmetic and Logic Arithmetic - cost/jeTfomance.
Structures]:
Differential Cascade Voltage Switch Logic - DCVS [I]; High-speed
Differential Current Switch Logic - DCSL [2]; Sample-Set Differential Logic - SSDL [3]; and,
General Terms
Enableidisable CMOS Differential Logic - ECDL [4].
Design, Performance.
The ECDL approach, in particular, is very promising because a single ECDL gate can handle more levels (transistors in series) in a logic tree than the most popular DCVS Structure
Keywords Digital circuits, adder, ECDL, CMOS
Recently, multi-output CMOS logic styles, with lower transistor count and faster speed compared to the single output circuits, have been investigated in order to profit the recursive property of logic circuits such as cany lookahead adders. Static and dynamic CMOS structures have been proposed in the IiteraNre [5]-[9l, with significant CLA improvement in terms of performance and circuit area. Static CMOS topology with multiple-outputs was first proposed in [SI, while dynamic multi-output CMOS logic style has been firstly presented for domino structure: MultipleOutput Domino Logic - MODL [6]. MODL is, by construction, less susceptive to charge sharing than standard domino gates since internal nodes of the logic tree are also pre-charged for functional purposes [6][7]. The multi-output differential cascade voltage switch logic (MODCVS) was then shown for self-timed circuit applications [8][9].
1. INTRODUCTION Addition is probably the most commonly used arithmetic operation, being often also the speed-limiting element to make faster VLSI microprocessors. As the demand for higher performance processors with increased sophistication grows, there is a continuing need to improve the performance and to reduce the area overhead while increasing the functionality of the arithmetic units contained within them. Carry lookahead (CLA) principle remains the dominant method in implementing fast adders. CMOS technology is widely used due to the overall merits in terms of speed, noise margin, power dissipation and fabrication cost. Therefore, circuit techniques that exploit the ultimate
In this paper, we propose a new logic style, the multiple-output ECDL (MOECDL) structure. A 32-bit CLA adder was designed in order to evaluate the potentialities of the proposed logic. Since the MOECDL targets mainly asynchronous applications, a 32-bit MODCVS CLA version is used as the reference performance.
Permission to make digital or hard copies of all or part of this work for personal or classroom use i s granted without fee provided that copies are not made or distributed for profit OT commercial advantage and that copies hear this notice and the full citation on the tint page. To copy otherwise, or republish, to post on sewers or to redistribute to lists, requires prior specific permission andor a fee. SBCCI'O4. September 7-1 I, 2004, Pernamhuco, Brazil. Copyright 2004 ACM 1-581 13-947-010410009...$5.00.
This paper is organized as follows. In Section 2, the CLA principle if briefly reviewed. The MOECDL style is described in Section 3. The MOECDL CLA is then discussed in Section 4,
181
[ G ~ G ~ G I G O - P ~ P Z=P I[OIOO-0101 ] in Fig.2, presenting both G2 and P2 the logic value ‘I,, the G1,O node is discharged through the transistors controlled by these signals.
while SPICE simulation results and analysis are presented in Section 5 . Finally, in Section 5 , the conclusions are outlined.
t
2. CLA PRINCIPLE Carry lookahead (CLA) adders consist basically of three distinct stages, as illustrated in Fig. 1. The preliminary stage is used to provide bit level generates (Cs) and propagates (Ps). The second stage is the carry chain building stage that gives the carry signals for all bit positions. The third stage is the sum generation stage responsible for the output sums of the adder. Among these stages, the second stage contains usually many sub-stages, and it conesponds to the most expensive part of the CLA architecture. For this reason, it is frequently emphasized in the literature. Ch
Full up
Ch
63-
cart
Figure 2. Multi-output Manchester carry chain used in MODL and MODCVS.
Figure 1. Block diagram of carry lookahead adder. In a CLA, if Ci-1 is the input carry of the ‘i’th stage, and Ai and Bi are the i bits of the input data, then the output cany Ci can be expressed as: Ci = Gi + Pi.Ci-1
3. MULTIPLE OUTPUT ECDL STRUCTURE Similarly to the SSDL structure and distinctly from the most popular DCVS one, the ECDL has no theoretical limitation in the number of series transistors in the logic network. It occurs because, during the evaluation phase, one of the pre-charged output voltage is not discharged through the network, but by the latch structure. As discussed before, a low differential voltage is enough to set the latch. In practice, when the number of transiston in series is too large (more than IO), the voltage difference created is not large enough to guarantee the latch transition to the right side. This would lead to a random decision and, consequently to a wrong result. This might also occur when the switched parasitic capacitances (internal and external) are very different in both direct and complementary outputs. In terms of performance, DCVS presents a better power-delay product than ECDL for logic functions with reduced series transistors in the logic tree [IO].
(1)
where the propagate (Pi) and generate (Gi) signals are obtained as follows: Gi = A i
Bi
and
Pi=Ai@Bi
(2)
Expanding this yields: Ci=Ci+Pi.Gi-I+
...+ Pi.Pi-1 ... P I C 0
(3)
The sum is generated by: Si = Ci-1 8 Ai 0 Bi = Ci-l 0 Pi
(4)
In practice, the number of lookahead stages is limited to four and the term C4 is expressed as:
C4=G4+P4.{G3+P3.[G2+P2.(GI +PI.CO)])
In addition, ECDL if fully static. This means that there is no minimum clocking frequency requirement. As long as the enable state is maintained, the outputs remain unchanged. In the DCVS approach, this stability is obtained using an additional weak feedback PMOS transistor in the output nodes, causing a negative effect on circuit delay. Moreover, like other differential logic families, ECDL provides the ternary signaling (or dual-rail code) adopted by self-timed circuits to the signal processing completion detection. These features result in speed improvement of Micropipelines [I I].
(5)
This equation shows the recursive relationship among C4, C3, C2 and C1, and it can he efficiently implemented through a recurrent logic structure, such as the multi-output Manchester carry chain used in MODL and MODCVS logic style, depicted in Fig. 2 [61[91. Although the propagate (Pi) signal, in equation (2), is resultant from an exclusive-OR function, it can logically be implemented through a simple OR gate (Pi = Ai + Bi) given also the correct logic value in expression (1). However, in the multi-output Manchester chain, generate and propagale terms need to be mutually exclusive in the sense that both cannot be ’ I ’ simultaneously. This is due to the presence of sneak paths, i.e., current flowing in the wrong direction through a transistor causing false discharge. For instance, for input vector
On the other hand, the ECDL gate requires inputs to be ready before starting the evaluation phase. If this condition is not respected, the output latch will not be able to decide for the right logic definition. This means that ECDL structure is suitable for speed-independent asynchronous category, but not for delayinsensitive one.
182
. . -
The proposed multi-output ECDL style presents the same features of single-output ECDL gates, like the low dependency to the series transistor length in the logic tree and the latched outputs property. In addition, the pre-discharging of the internal nodes used as output signals reduces significantly charge-sharing troubles. It is expected that the performance gains of MOECDL gates become more evident with the increasing of the complexity of the logic tree as well as with the number of recursive functions.
It is important to observe that in the ECDL CLA implementation the clock signal of one stage must be provided by the previous stage. It guarantees that the related inputs are available when the evaluation phase is started. The generation of the intermediate clock signals is discussed by Lu in [I I].
J
I4
group propagate (Pij), group generote (Gij) and group kill (Kij), and the C4 signals generation using single-output complex gates; calculation of the intermediate carries C8, C12 and C16, and the group propagate/generate related to the 17th bit level (P17,29/G17,29, P17,26/G17,26 and P17,23/(317,23) using two three-outputs MOECDL gates; sum blocks including the exclusive-OR functions for sum generation, as presented in Equation (4), in the multi-output structure used to provide the required intermediate cames.
N. Diffeiential
4Figure 3. ECDL basic structure.
4. MOECDL CLA ADDER
&
Multiple-output logic structures are very suitable for logic fmnctions with recursive relationship due to their recurrent property. A representative example of recursive expressions is observed in the carry lookahead (CLA) adder algorithm. For this reason, the CLA adder has been extensively investigated for multiple-output structllre application [5]-[9].
Figure 4. Single-gate 4-bit MOECDL CLA carry unit. s2
Initially, a single gate 4-bit MOECDL CLA cany unit has been proposed, as illustrated in Fig. 4. As mentioned before, sneak paths must be avoided to guarantee the correct functionality. This is obtained through a complementary Manchester carry chain. The direct logic tree is built using propagate and kill signals, while propagate and generate signals are used in the complementary tree. The kill signal is given by a NOR of the Ai and Bi inputs. Moreover, generote, propagate and kill signals have to be mutually exclusive. Such rule is respected by implementing the propagate signals as exclusive-OR logic function.
Lam
8 Figure 5. Single-gate MOECDL CLA carry-sum circuit.
Since the dual Manchester structure provides both direct and complementary intermediate carries, the sum stage can the easily included in the single-gate 4-bit cany unit, as suggested in Fig. 5 . It reduces the number of gates, and consequently the output latches, leading therefore to a reduction of the total power dissipation.
5. RESULTS AND ANALYSIS The 32-bit MOECDL CLA circuit, proposed in this work, was designed taking into a standard 3-metal 0.5ym CMOS process, provided by the MOSlS prototyping service [I?]. In order to evaluate the expected gains of this enhanced adder, a DCVS CLA version was also implemented as the performance reference circuit for design comparison. The DCVS logic approach considers the use of single-output domino gates when complementary output signal generation is not required, as occurs
The 32-bits CLA organization is based on the version proposed in [7]. The block diagram is illustrated in Fig. 6. Four parts can be identified in the circuit: ' individual propugate (Pi), generote (Ci) and kill (Ki) calculations;
183
2, were not verified in the same proportion. This is due to the fact that intermediate clock signals must he gencrated between the adder stages, resulting in a speed reduction. However, the results show significant performance enhancements in terms of propagation delay, with small penalties in transistor count and power consumption.
in first and second stages presented in Fig. 6. Moreover, the third and fourth stages of the CLA circuit explorc the benefits of the MODCVS structure, as proposed by Ruiz [81[91. Furthermore, since self-timed circuits are the target applications, a weak PMOS transistor is applied in feedback configuration in order to compensate parasitic discharging of internal nodes due to leakage currents [13].
Self-timed circuits are being prepared to include such 32-bits CLA approaches, in order to obtain the actual average signal propagation and power consumption. Table 2. Simulation results of 4-bit CLA part: including carry unit and sum generation. MOECDL 0.68ns
pre-(dis)charge phase (id-hl) Power dissioation (@SOMHz) .evaluation phase (Pd-eval)
1 MODCVS I I 1.24"s I I I I 1 300uW I I
qre-(dis)charge phase (Pd-reset) Figure of merit td-lh x Pd-eval Figure of merit td-h1 x Pd-reset Transistor count
372 fl I75 fl 66
493 fl
Delay propagation - worst case Power dissipation (@50MHz) Delay x dissipation Transistor count
MODCVS 3.7 ns 17mW 63 pJ 2005
MOECDL 2.6 ns 20mW 52 p l 2245
Delay.~ propagation evaluation phase (td-lh) ~
Figure 6. Block diagram of the 32-bit CLA adder. The multi-output 4-bit CLA complex gate was initially evaluated and the electrical simulation results are summarized in Tables 1 and 2. The analysis has been done by applying a fanout equivalent to 3 for each output. The delay propagation and power dissipation savings of the ECDL circuit are clearly verified in these tables. Moreover, it is interesting to observe that during the reset or predischarging phase of the ECDL gate functionality there is no current flowing from the power source (power consumption is approximately zero). This is explained because the connection to Vdd is cut during the pre-discharge (with a=l),as it may he seen in the basic ECDL Structure presented in Fig 3.
Delay propagation evaluation phase (td-lh) Delay. . propamtion .. pre-(dis)charge phase (td-hl) Power dissipation (@5OMHz)
~
[
MODCVS
MOECDL
988ps
355ps
399 ps
136 ps
225 pW
[
725 uW
74
7. ACKNOWLEDGMENTS
425 pW
This project is supported by CNPq and FAPERGS Brazilian agencies.
.pre-(dis)charge phase (Pd-reset) Figure of merit td-lh x Pd-eval Figure of merit td-hl x Pd-reset Transistor connt
~
222 fl IO9 fl 50
151 fl
REFERENCES K. M. Chu, and D. Pulfrey, Design procedures for differential cascode voltage switch circuits, IEEE Journal of Solid-Stale Circuifs,vol. SC-21, no. 6, Dec. 1986, pp. 10821087.
58
The simulation results obtained to the 32-bits CLA are presented in Table 3. The output signals, i.e., the sums and the most significant carry-out signal, were charged with fanout equal to 3. The significant gains with single gates, presented in Tables 1 and
D. Somasekhar, and K. Roy, Differential current switch logic: a low power DCVS logic family, IEEE Journal of Solid-Sfate Circuifs,vol. 31, no. 7, July 1996, pp. 981-991.
184
[Y] G. A. Ruiz, Evaluation of three 32-bit CMOS adders in DCVS logic for self-timed circuits, IEEE Journal of SolidStare Circuits, vol. 33, no. 4, Apr. 1998,pp. 604-613.
[3] T. A. Grotjohn, and B. Hoefflinger, Sample-set differential logic (SSDL) for complex high-speed VLSI, IEEE Journal of Solid-State Circuits, vol. SC-21, no. 2, Apr. 1986, pp. 367-369.
[IO] P. Ng, P. T. Balsara, and D. Steiss, Performance of CMOS differential circuits, IEEE Journal of Solid-State Circuits, vol. 31, no. 6 , June 1996,pp. 841-846.
[4] S. -L. Lu, Implementation of iterative networks with CMOS
differential logic, IEEE Journal of Solid-State Circuits, vol. 23, no. 4,Aug. 1988, pp. 1013-1017.
[ I l l S. -L. Lu, Implementation of micropipelines in enableidisable CMOS differential logic, IEEE Trans. VLSI System, vol. 3, no. 2, June 1995, pp. 338-341.
[ 5 ] Y . -T. Lee, et al. Design of compact static CMOS cany look-ahead adder using recursive output property, Electronic Letters, vol. 29, no. 9, Apr. 1993, pp. 794-796.
[I21 MOSIS Service. WEB site: htfp://www.mosis.org
[6] I. S. Hwang, and A. Fisher, Ultrafast 32-bit CMOS adders in multiple-output domino logic, IEEE Journal of Solid-State Circuits, vol. 24, no. 2, Apr. 1989, pp. 358-369.
1131 M. Renaudin, U. El Hassan, and A. Guyot, A new asynchronous pipeline scheme: application to the design of a self-timed ring divider, IEEE Journal of Solid-state Circuits. vol. 31,no. 7. July 1996,pp. 1001-1013.
[7] Z. Wang, et al. Fast adders using enhanced multiple-output domino logic, IEEE Journal of Solid-State Circuits, vol. 32, no. 2, Feb. 1997, pp. 206-214.
[I41 R. P. Brent and H. T. Kung, A regular layout for parallel adder, IEE Trans. Computer, vol. C-31, no. 3, Mar. 1983.
[8] G. A. Ruiz, Compact four bit carry look-ahead CMOS adder in multi-output DCVS logic, Electronic Letters, vol. 32, no. 17, Aug. 1996, pp. 1556-1557.
[I51 P. M. Kogge and H. S. Stone, A parallel algorithm for the efficient solution of a general class recurrence equations, IEEE Trans. Computer, 110.8,Aug. 1973.
185