implementation of bit-serial adders using robust

1 downloads 0 Views 297KB Size Report
stages of logic that may be merged with the latches or the flip-flops. ... is used in one of the adders, which significantly lowers the number of clocked transistors.
IMPLEMENTATION OF BIT-SERIAL ADDERS USING ROBUST DIFFERENTIAL LOGIC Magnus Karlsson, Mark Vesterbacka, Lars Wanhammar Department of Electrical Engineering, Linkšping University, Sweden Telephone: +46 (0) 13-28 40 59 Fax: +46 (0) 13-13 92 82 E-mail: [email protected], [email protected], [email protected]. In this paper two bit-serial carry save adders are implemented using a recently proposed differential logic style. The clocking scheme uses a single clock phase with non-precharged stages of logic that may be merged with the latches or the flip-flops. A novel flip-flop structure is used in one of the adders, which significantly lowers the number of clocked transistors. The logic style used in the adder realizations suits high speed and low power operation in both bitserial and bit-parallel implementations, since all logic nets are purely in NMOS. The logic style is also robust for clock slope and yields a data noise margin equal to Vdd/2. The adders reached a maximal clock frequency of 300 MHz in a 0.8 mm process with a 3.0 V power supply voltage.

1. INTRODUCTION In recent years, several differential logic styles have been proposed. In [1] the circuits use only NMOS transistors in the logic trees but require precharging and a large number of clocked transistors (four per latch). In [2] the circuit is precharged and combined with a set/reset NAND pair and the number of clocked transistors is large. A similar circuit is proposed in [3], uses precharging, both PMOS and NMOS transistors in the logic nets, and a large number of clocked transistors. The recently proposed Single Transistors Clocked latches (STC) [4] are not precharged with a minimum clock load consisting of a single clocked transistor per latch, which makes them interesting. The latches use a novel flip-flop concept, i.e., the non-transparent input state of the N latch shown in Fig. 1a is used. The outputs of the P latch Fig. 1b are allowed to go low in its latched state. This leads to only one low-to-high transition in its evaluation phase, which results in high speed. However, in these latches it is difficult to incorporate the logic due to problems with charge sharing, but they serve well as flip-flops. Charge sharing in the STC N latch is due to the common node A of the two NMOS branches. A charged internal node in one branch can discharge to the other branch through the common node A above the clocked transistor. The STC P latch shown in Fig. 1b has the same problem if the two branches of logic share transistors, e.g., a simplified XOR net that shares the nodes between both NMOS branches. When a charged internal node of the logic nets affects the output nodes charge sharing may ruin the low output state. The charge sharing problems can cause a static power dissipation in the next logic stage, thus, making the circuit unsuitable for low power implementations and unreliable due to possible faults in the logic evaluation. Another problem with the STC latches, due to the share node A, has been commented by Blair [5]. In another recently proposed logic style, presented in [6], the bottleneck is the P latch, which has PMOS transistors in the logic as shown in Fig. 2a. Since the logic nets are connected to a cross-coupled NMOS transistor pair this realization suffers from a severe ratio problem. To gain high speed, the PMOS transistors must be much larger than the pull-down NMOS transistors, which results in a large load and high power consumption even for weak undersized NMOS transistors. The large PMOS clock transistor impose a large clock load. The N latch shown in Fig. 2b is also ratio sensitive but this is not so severe since the logic is implemented using the NMOS transistors, which can easily be made stronger than the two cross-coupled pull-up PMOS transistors. Charge sharing is not a problem in these latches since there is no direct path from the internal nodes in the logic to the outputs. Precharging is not required in these latches due to the use of two cross-coupled pull-up or pull-down transistors.

2.8 2.8

4

Q

¿

Q N

4

N*

4

A 4

¿

Q

2

N

2

2

2

Q

N*

Fig. 1a. Single Transistor Clocked N latch. Fig. 1b. Single Transistor Clocked P latch.

2.8 2.8

P

4

4

¿

4

P

*

Q

4

¿

4

Q

Q 2/1.6

Q

N

4

4 4

N*

2/1.6

Fig. 2a. Robust P latch with merged P logic. Fig. 2b. Robust N latch with merged N logic. The restrictions on the clock slope are mild for these latches [6], i.e., the latching is not dependent on the clock slope. When the clock signal makes a high-to-low transition for the N latch or a low-to-high transition for the P latch, the data is already latched by the cross-coupled transistor pair and will be kept after the clock transition. This makes these latches robust against slow clock transitions and enables a decrease in area and power consumption by using a smaller clock driver. This logic style will be referred to as the P-N logic style. In the following it is shown how a new P latch without the ratio problem can replace the P latch in the P-N logic style. The new structure for the P latch is further combined with the N latch to form a new differential flip-flop with a low transistor count. A bit-serial carry save adder is used in implementation examples for both the original logic style and the new flip-flop structure.

2. THE DIFFERENTIAL LOGIC STYLE The logic style used to implement the adders in this work aims at overcoming the bottlenecks in [1, 2, 3, 4, 6]. The logic and latches should be merged without charge sharing problems. This increases the speed significantly and decreases the power consumption due to fewer switching nodes and reduces glitching. In bit-serial implementations, which is important in algorithmspecific DSP applications, most of the logic can be merged with the latches. In common bitparallel implementations, parts of the logic can be merged. Only the faster NMOS transistor should be used to implement the logic. The NMOS transistors can be made small yielding a reduced load and thereby a reduced power consumption. It also allows the use of more complex logic inside a single latch, which enables most of the logic to be merged with the latches. Precharging should be avoided since it yields higher switching activity than non-precharged

logic. By the use of complementary outputs no additional inverters are required, which decreases the number of gate delays and switching nodes and the number of gates to design is fewer (AND/NAND or OR/NOR is the same circuit, with only switched inputs or outputs). XOR functions, which are extensively used in full adders for implementation of arithmetic, do clearly benefit from the availability of complementary input signals. The latches should be clock slope insensitive, hence lower power consumption can be obtained by the use of a smaller clock driver. Not only the number of clocked transistors but their sizes should be minimized. The new logic style is a combination of the N latch shown in Fig. 2b and the novel P latch presented below. It is also possible to use a static variant of the N latch [6], where the cross-coupled PMOS transistors are replaced by two cross-coupled inverters. The static N latch and the novel P latch form a semi-static logic style, which is important feature in low power circuits. Hence, the clock can be idled at the low clock phase when the circuit is not used. The P latch may also be merged with the N latch forming a new flip-flop that requires use of fewer transistors.

2.1. The Novel P Latch The novel P latch is constructed from an ordinary Cascade Switch Logic (CVSL) gate as a base shown in Fig. 3a. The CVSL gate consist of two complementary NMOS switch structures connected to a pair of cross-coupled PMOS pull-up transistors. By the use of two crosscoupled PMOS transistors as pull-up transistors, precharging is not needed. When the inputs switch, either node b or b is pulled low. Positive feedback applied to the PMOS pull-up transistors causes the gate to switch. The order of input switching is important. First both complementary inputs must go low, otherwise both nodes b and b are pulled low, which results in a short-circuit. After that the complementary input is allowed to go high. The outputs of the N latch in Fig. 2b switches in the right order. The logic trees may be further minimized from the full differential form, e.g., a two input XOR gate may be minimized to only 6 NMOS transistors from the full differential form with 8 transistors in the complementary NMOS nets. To form a P latch, which is latching when the clock is high and evaluating when the clock is low, two clocked PMOS transistors are added to the CVSL gate as shown in Fig. 3b. The two clocked PMOS transistors prevents the outputs Q or Q to switch from low-to-high before the clock ¿ switch from high-to-low. This also solves the charge sharing problem, since a low-tohigh transition cannot occur at the outputs before the clock goes low. Charged internal nodes in the logic make the transition even faster when the internal nodes in the NMOS nets discharge to the cross-coupled PMOS transistors. The restrictions on the clock slope for the P latch become mild, since the latching works in the same fashion as for the N latch. A problem with the gate in Fig. 3b is the threshold voltage Vt loss during pull-down. This problem is solved by adding two NMOS transistors as shown in Fig. 4a.

¿ b

b N

N*

Q

¿ b

b N

N*

Fig. 3a. Cascade Switch Logic (CVSL) gate. Fig. 3b. DCVSL gate with clocked PMOS transistors

Q

When the inputs cause the CVSL gate to switch, the PMOS transistors prevents the output Q or Q to switch low-to-high, while the added NMOS transistor causes one output to switch highto-low, i.e., the non-transparent input state of the following N latch is used [4]. Hence, it is required that the proceeding N latch is sufficiently fast to have pulled down the output node before the input is turned off. This is easily solved by implementing both the P and N latch with equal rise and fall time. The complete P latch in Fig. 4a is redrawn in Fig. 4b to show the novel concept of connecting NMOS logic nets between the PMOS transistors. In the novel P latch the ratio problem is not so severe since the logic is in NMOS and the pullup PMOS transistors can be kept at minimum size. The two clocked transistors can also be kept at minimum size, yielding a small clock load. In fact, all the transistors in the latch can be kept at minimum size. Only the logic net must be sized if stacked NMOS transistors are required. Simulations show that it is possible to combine the new logic style with the STC latches, with only one restriction, the novel P latch will not work well with a STC N latch as load due to the coupling capacitance (gate-source capacitance) between the gate and the common node A of the STC N latch. When the clock switch low-to-high, the common node A is discharged to ground and the gate-source capacitance makes the output from the novel P latch to follow the node A and drop. This is also observed by simulation of a flip-flop constructed with the STC latches [5].

¿ Q

2

¿ b

b N

¿

Q

N*

2

Q

N

2

¿

2

2

2

Q

2.8

N* 2.8

Fig. 4a. The novel robust P latch with merged N logic Fig. 4b. Redrawn P latch.

2.2. The Novel Flip-Flop The novel flip-flop shown in Fig. 5 can be constructed by adding two clocked NMOS transistors to the novel P latch in Fig. 4a. The flip-flop becomes negative edge-trigged and the logic nets are merged with flip-flop in the same manner as in the case with the latches. The construction of the flip-flop can also be viewed as a merging of the P and N latch by simply sharing the cross-coupled PMOS transistors. The resulting flip-flop is similar to the flip-flop in [4] but this flip-flop suffers from the same problems as the STC N latch. With the novel flipflop these problems are removed. ¿ Q

¿

b

b

Q

¿

N

N*

Fig. 5. Novel flip-flop with merged N logic.

3. ROBUSTNESS Noise appears at the data and the clock inputs, yielding two different noise margins, data noise margin and clock noise margin. Assuming a supply voltage if 3.0 V, the noise margins at the data inputs are 2.0 V and 1.8 V for the P latch and N latch, respectively. This is a larger noise margin than Vdd/2 due to hysteresis, which shows that the NMOS nets are somewhat weak. The noise margins for the clock are 0.85 V and 1.8 V for the P latch and N latch, respectively. The clock noise margin for the N latch is similar to the data noise margin since the clocked transistors have to compete with the pull-up PMOS transistors in the same way as the logic. The P latch clock noise margin is equal to Vt, because the clock transistor only works as a switch in this case. Hence, the data noise margin is significantly improved compared to dynamic singlerail logic. For the novel flip-flop is the data noise margin equal to data noise margin for the N latch due to the same function. The clock noise margin is equal to Vt when the clock is high, and equal to the data noise margin when the clock is low. For examining the clock slope sensitivity, a divide-by-two counter has been implemented. The counter consisted of one P latch and one N latch in series, where the outputs from the N latch are fed back to the P latch inputs. The circuit was simulated with HSPICE with a triangular clock. In Fig. 6 the simulation result is shown. Both outputs of the P and N latch are shown. When the clock slope is 25 ns, the output from the N latch deviates 10 percent from the high level during a falling clock. This is due to the rising output of the P latch that rises before the N latch has finished the evaluation phase. This problem may be solved by making the P latch slower.

Fig. 6. Divide-by-two circuit.

4. BIT-SERIAL ADDER IMPLEMENTATIONS A bit-serial carry save adder has been used as an example circuit in a comparison between the structure consisting of the novel P latch and the N latch and the novel flip-flop. In Fig. 7 the solution with the separate P and N latches is shown. This solution also uses an STC P latch in one place in order to reduce the number of clocked transistors. This bit-serial adder required 9 clocked transistors and 38 non-clocked transistors of which only 11 are PMOS transistors. The highest number of stacked NMOS transistors is 3. In this adder efficient realizations have been used for the 2 input XOR gates and the 2 input multiplexer where transistors in the logic nets are shared. Only 6 transistors are required in the logic nets for both the XOR gates and the multiplexer. In the bit-serial adder, the carry loop is cleared before the start of a new computation by raising of the clr input to the AND gate.

p-xor

n-xor

¯

¯ å ¯

a

¯ å

a b

STC p-latch

b

C

nmux

¯

C ¯ b

b

¯

¯

¯

p-and

clr clr

Fig. 7. Bit-serial adder realized with separate latches. The bit-serial carry save adder based on the novel flip-flop is shown in Fig. 8. This solution used 8 clocked transistors. The number of non-clocked transistors is 30 of which only 4 are PMOS. In this adder the number of stacked NMOS transistors is 4. Also in this adder an efficient realization of the XOR gate has been used where a 3-input XOR gate has been realized with sharing of its complementary nets. This realization required only 10 transistors in the logic net. The low number of stacked transistors in the carry net allowed an AND function to be integrated with the carry function, allowing for the carry loop to be cleared before the start of a new computation. With this solution the number of switching output nodes is reduced to only 4 compared to the adder with separate latches that contained 10 switching output nodes.

carry

sum

¯

¯

¯

å

¯

C

C

¯

å

¯

¯

¯ a

clr

clr a

a

a

a b b

b

b

b

Fig. 8. Bit-serial adder using the novel flip-flop.

5. COMPARISON A power delay product (PDP) comparison of the adders in Fig. 7 and Fig. 8 has been performed using simulations on layouts in a 0.8 mm AMS process. For the simulations HSPICE [7] was used. In Table 1 the simulation results are shown together with the area measure, power consumption, and the maximum clock frequency. The adder with the new flip-flop structure required 30 percent less area than the adder with separate latches. This reduction in area is caused by the smaller number of PMOS transistors in the flip-flop structure and the smaller number of gates, which reduces the internal routing. The maximum clock frequency is 300 MHz for both implementations. This result is due to different effects. The clock frequency in the adder using the flip-flop structure is limited by the larger number of stacked transistors in the logic nets while for the adder with separate latches the larger number of cascaded gates limits the clock frequency. The power consumption for the clock driver in the adder with the flip-flop structure is reduced with 10 percent due to the reduced number of clocked transistors. The power consumption for the logic is reduced with 37 percent which is due to the reduced number of switching output nodes and smaller internal routing. The power delay product thereby is reduced with about 30 percent for the adder with the new flip-flop structure.

Area [mm2]

Fmax [MHz]

Separate Latches

3700

300

0.31

0.57

1.5

Flip-flop

2600

300

0.28

0.36

1.1

Bit-serial Adder

Power,Clock Power,Logic [mW] [mW]

Table 1. Comparison of the bit-serial adders.

PDP [pJ]

6. CONCLUSION It was shown how efficient bit-serial carry save adders is implemented using a recently proposed differential logic style. For the XOR gates and multiplexers used in the adders it was possible to share parts of the complementary logic nets, yielding savings in transistor count. A new flip-flop was also presented which further was used in the realization of an adder. The layout of two adders was simulated using HSPICE and the results were compared. The adder that used the novel flip-flop structure had reduced area requirements with 30 percent and a lower power consumption with 30 percent compared to the adder realized with separate latches.

REFERENCES [1] Hong-Yi Huang: True-single-phase All-N-Logic Differential Logic (TADL) for Very High Speed Complex VLSI, Proc. IEEE ISCAS-96, Vol. 4, pp. 296-299, Atlanta, USA, 1996. [2] Huang C. G.: Implementation of true single-phase clock D flipflops, IEE Electronics Letters, Vol. 30, pp. 1373-1374, Aug., 1994. [3] Renshaw D. and Choon How Lau.: Race-free clocking of CMOS pipelines using a single global clock, IEEE J. Solid-State Circuits, Vol. SC-25, pp. 766-769, June, 1996. [4] Yuan J. and Svensson C.: New Single-Clock CMOS Latches and Flipflops with Improved Speed and Power Savings, IEEE J. Solid-State Circuits, vol. 32, no. 1, pp. 62-69, Jan. 1997. [5] Blair G. M.: Comment on new differential flipflops from Yuan and Svensson, IEE, Electronics Letters Vol. 32, No 23, pp. 2125-2126, Nov., 1996. [6] Afghahi M.: A Robust Single Phase Clocking for Low Power, High Speed VLSI Applications, IEEE J. Solid-State Circuits, Vol. SC-31, No. 2, pp. 247-254, Feb. 1996. [7] Karlsson M., Vesterbacka M., and Wanhammar L.: A Robust Differential Logic Style with NMOS Logic Nets, Proc. of IEE IWSSIP, pp. 61-64, Poland, May, 1997.