Extended Sequential Logic for Synchronous Circuit ... - CiteSeerX

2 downloads 0 Views 384KB Size Report
Jul 29, 2008 - Abstract—In this paper, we present a new approach for the extension of sequential logic functionality of D flip-flop in order to perform an ...
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

1

Extended Sequential Logic for Synchronous Circuit Optimization and its Applications Pramod Kumar Meher, Senior Member, IEEE

Abstract—In this paper, we present a new approach for the extension of sequential logic functionality of D flip-flop in order to perform an additional Boolean function simultaneously along with its usual bit-storage function. We show that a combinational function of the form (a · b), (a + b), (a + b) or (a · b) which occurs frequently in a feed-forward path with a D flip-flop could be implemented efficiently by a D flip-flop with RESET or SET provision. Similarly, (a⊕b) or ((a·b)⊕c) in the feedback loop with a D flip-flop could be implemented by a T flip-flop by suitable modification of the clock. The use of such extended sequential logic is found to result in a significant reduction in critical-path and saving in area-complexity over the direct implementation. Moreover, we present a simple approach for the construction of CMOS T flip-flop by modification of clock signal of D flip-flop, which is found to be more efficient than the T flip-flop derived from JK flip-flop. The extended sequential logic is used for the implementation of finite field multiplication over GF (2m ) and carry-save addition of real numbers. In both these cases, the use of extended logic is found to offer a substantial saving in area and time-complexity over the conventional implementations. Index Terms—Sequential logic, combinational logic, digital arithmetic, finite field arithmetic, carry-save addition, logic optimization.

I. I NTRODUCTION Synchronous digital circuits mostly consist of combinational components interspersed with sequential components, e.g., D flip-flops and T flip-flops. While the D flip-flop is used popularly as input/output registers, pipelining latches, and feedback elements in synchronous systems, T type flip-flops are used in counters and clock divider circuits. In this paper, we examine the possibilities and advantages of combining the functions of basic logic elements like AND, OR, and XOR gates with that of D flip-flop. Let us name such extension of functionality of sequential elements to incorporate the combinational Boolean logic as “extended sequential logic.” The extended logic could be of two basic types such as (i) (ii)

the combinational logic in feed-forward path with the sequential logic; and the combinational logic in feedback loop with the sequential logic.

The combinational logic in feed-forward path with a sequential element like a D flip-flop serves usually the same function when combinational function precedes the sequential one and Manuscript submitted on July, 29, 2008. Revised on October, 25, 2008. The author is with the School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, E-mail: [email protected]. URL: http://www3.ntu.edu.sg/home/aspkmeher/ Copyright (c) 2008 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to [email protected].

vice versa. The basic Boolean function in a feed-forward path could be (a·b), (a+b), (a⊕b), (a+b) or (a·b). Except (a⊕b) none of these functions appear to be practically relevant to be used in the feedback loop with a D flip-flop, since the output get stuck at fixed output state on particular input conditions. Some typical forms of extended sequential logic of practical relevance which we have investigated, and aim at presenting in this paper are: • • • • • •

(a · b) followed by D flip-flop (a + b) followed by D flip-flop (a + b) or (a · b) followed by D flip-flop (a ⊕ b) followed by D flip-flop (a ⊕ b) in feedback loop with D flip-flop ((a · b) ⊕ c) in feedback loop with D flip-flop

where a, b and c are Boolean variables. In a recent paper, we have shown that an XOR gate in the feedback loop with a D flip-flop could be replaced by a T flip-flop for reducing the critical-path and area-complexity in finite field accumulator [1]. But, we do not find any specific design for CMOS implementation of T flip-flop. In the existing CAD tools, we do not have library cell for the T flip-flop. Conventionally, T operation is performed either by a JK flipflop or by a D flip-flop. The JK flip-flop-based implementation of T flip-flop usually involves much higher area and more propagation delay, while the conventional derivation of T flip-flop from D flip-flop is not suitable for some of the applications, where the state is required to be toggled by a control input. Therefore, we present a different derivation of T flip-flop from the CMOS implementation of D flip-flop. The proposed construction could be used to design a library cell for T flip-flop where the output could be toggled in each clock period when the control input is 1. We have shown here that the extended sequential logic could be used not only for achieving reduction in critical-path, but also to reduce the overall circuit area of a chip. To demonstrate its advantages, we have used it in carry-save addition of real numbers and finite field multipliers over GF (2m ). The extended sequential logic is found to offer substantial saving in area and timecomplexities over the conventional implementations. The rest of this paper, is organized as follows. The proposed extension of sequential logic is discussed in Section II; and the CMOS construction of T flip-flop is discussed in Section III. The complexity of implementation of extended sequential logic and its relative advantages over the conventional realizations are discussed in Section IV. The application of proposed techniques in carry-save addition and multiplication in GF (2m ) along with the resulting advantages are discussed in Section V. Conclusions are placed in Section VI.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

a

A

b

D

a 0 0 1 1 x

clock rising rising rising rising not rising

c = a ⋅b

Q

CLK

(a)

b 0 1 0 1 x

2

NS 0 0 0 1 NC

a

c = a ⋅b

Q

D

CLK R

b

(b)

(c)

Fig. 1. (a) Logic symbol of AND followed by D flip-flop. (b) State-transition-truth table. (c) Extended sequential logic implementation of an AND gate followed by a D flip-flop. NS and NC stand for “next state” and “no change”, respectively.

a

c=a+b

Q

D

b

clock rising rising rising rising not rising

CLK

(a)

a 0 0 1 1 x

b 0 1 0 1 x

b

NS 0 1 1 1 NC

a

D

S Q

c=a+b

CLK

(b)

(c)

Fig. 2. (a) Logic symbol of OR followed by a D flip-flop. (b) The state-transition-truth table. Extended sequential logic implementation of an OR gate followed by a D flip-flop.

b

a b

D

Q

c=a+b

CLK

clock rising rising rising rising not rising

(a)

a 0 0 1 1 x

b 0 1 0 1 x

NS 1 0 1 1 NC

a

D

Q

c = a ⋅b

a

Q

c = a+b

CLK

CLK

R

D

S

Q

c=a+b

b

(b)

(c)

(d)

Fig. 3. (a) Logic symbol of (c = a+b) followed by a D flip-flop. (b) The state-transition-truth table. Extended sequential logic implementation of (c = a+b) followed by a D flip-flop.

a b

D

Q

c = a⊕b

CLK

clock rising rising rising rising not rising

(a) Fig. 4.

a 0 0 1 1 x (b)

b 0 1 0 1 x

NS 0 1 1 0 NC

a

Q

D

b

c = a ⊕b

R CLK

(c)

(a) Logic symbol of XOR followed by a D flip-flop. (b) The state-transition-truth table. Simplified realization of an XOR followed by a D flip-flop.

II. S EQUENTIAL L OGIC E XTENSION We discuss here the proposed scheme for integrating a basic Boolean operation with a neighboring sequential element. The key idea we use here is based on the fact that when a combinational function of the form a∗b is followed by the storage of output-bit c = a∗b by a D flip-flop on a feed-forward path, the SET and RESET signals of the flip-flop could be utilized to perform some combinational functions. Synchronous SET and RESET should, however, be preferably used because, they are simpler to implement, and do not impose significant area/delay overhead. Besides, if an asynchronous SET/RESET is used, and that arrives long after the clock, the output may appear at the next cycle. In most practical situations, the flip-flop is not required to be set or reset during normal operation. The set or reset is required mostly for register initialization, and could be performed by the initialization of inputs a and b of the combinational logic. Similarly, when (a⊕b) or ((a·b)⊕c)

is in the feedback loop with a D flip-flop, clock of the flip-flop could be utilized to have simpler realizations. A. Feed-Forward Sequential Logic Extension 1) (a · b) Followed by a D flip-flop: The logic symbol of an AND operation (c = a·b), and subsequent storage of output bit c by a D flip-flop is shown in Fig.1(a). A combination of truth table of AND gate and state-transition table of a positive-edge triggered D flip-flop is shown as state-transition-truth table in Fig.1(b). The same function as that of AND gate followed by D flip-flop can be obtained by using a as D input and b as active-low synchronous RESET of a D flip-flop, as shown in Fig.1(c). It is easy to find that the arrangement of Fig.1(c) has the same truth table as that in Fig.1(b). 2) (a + b) Followed by a D flip-flop: The logic symbol of c = (a + b) followed by storage of output bit c in a D flipflop is shown in Fig.2(a). The state-transition-truth table for

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

a b

Q

D

Q=a

b

Q+

remark

0 1 0 1

0 0 1 1

0 1 1 0

NC NC toggle toggle

c = a⊕b

CLK

(a)

a

b

b

D

Q Q

CLK

c

Q

Q+

remark

rising rising rising rising not rising

1 1 0 0 x

0 1 0 1 x

1 0 0 1 x

toggle toggle NC NC NC

Q = ( a ⋅ b) ⊕ c

Q

CLK

(a) a

clock rising rising rising rising rising rising rising rising not rising

Fig. 5. (a) Logic symbol of an XOR gate in feedback loop with a D flip-flop. (b) Characteristic table. (Q : present state and Q+ : next state)

b

c D

(b)

clock

3

b 0 0 1 1 0 0 1 1 x

0 0 0 0 1 1 1 1 x

c=Q

Q+

remark

0 1 0 1 0 1 0 1 x

0 1 0 1 0 1 1 0 x

NC NC NC NC NC NC toggle toggle NC

(b) (a) Fig. 6.

(b)

a T

Modification of D flip-flop for T operation using control input T=b.

Q

c

b

a

T

Q

c

b this extended logic is shown in Fig.2(b). The same function as that of c = (a + b) followed by D flip-flop can be performed when b is used as active-high SET signal of the D flip-flop and a is fed as the D input, as shown in Fig.2(c). 3) (a + b) or (a · b) Followed by a D flip-flop: The logic symbol of c = (a + b) followed by storage of output bit c in a D flip-flop is shown in Fig.3(a). The corresponding statetransition-truth table is shown in Fig.3(b). The same function as that of (a + b) followed by D flip-flop can be performed in two other ways as shown in Fig.3(c) and (d). One of them is derived by feeding b as active-low RESET and a as D input (shown in Fig.3(c)). It can be found that the Q output of D flip-flop of Fig.3(c) is identical to the Q output of the D flipflop of Fig.3(a). (a·b) followed by D flip-flop could be derived from the Q output of the D flip-flop. The other way to realize (a+b) or (a·b) followed by a D flip-flop is shown in Fig.3(d), where a is fed as the D input and b is used as active-low SET of the D flip-flop. 4) (a ⊕ b) Followed by a D flip-flop: The logic symbol of c = (a ⊕ b) followed by the storage of bit c in a D flip-flop is shown in Fig.4(a). The corresponding statetransition-truth table is shown in Fig.4(b). (a ⊕ b) cannot be incorporated within a D flip-flop alone for the extended logic implementation, but using a NAND gate and a NOR gate at the D input and RESET of the D flip-flop, it would be possible to realize the same function as that of (a ⊕ b) followed by D flip-flop as shown in Fig.4(c), where (a · b) is used as the D input, and (a + b) is used as the RESET. B. Sequential Logic Extension with Feedback Loop 1) (a ⊕ b) in Feedback Loop with D flip-flop: The logic symbol of c = (a ⊕ b) in the feedback loop with a D flip-flop is shown in Fig.5(a). The characteristic table for this extended logic (shown in Fig.5(b)), is identical to that of a T flip-flop, where b is taken as input for the T flip-flop. An XOR gate in the feedback loop with a D flip-flop, therefore, could be realized by a T flip-flop with b as its control input [1]. T flip-flops are traditionally derived from JK flip-flops (for J =

CLK

(c)

CLK

(d)

Fig. 7. Implementation of ((a · b) ⊕ c) in feedback loop with D flip-flop. (a) Logic diagram. (b) Characteristic table. (c) Simplified implementation using T flip-flop. (c) Alternate simplified implementation using T flip-flop.

K), but a relatively simpler input modification of D flip-flop for T operation with T as control input can be obtained as shown in Fig.6. For T = 0, the clock input to the D flip-flop remains fixed at zero so that the state of the flip-flop does not change, while for T = 1 the clock becomes available as usual, and the output toggles on the rising edge of clock. The clock modification in Fig.6, therefore, makes the D flip-flop behave like a T flip-flop where the T input is used as a control. 2) ((a·b)⊕c) in Feedback Loop with D flip-flop: The logic symbol of ((a · b) ⊕ c) having the XOR in the feedback loop with a D flip-flop is shown in Fig.7(a). The characteristic table for this extended logic is shown in Fig.7(b). A straight-forward simplification of this logic could be achieved by replacing the (a⊕b) in the feedback loop with a D flip-flop with a T flip-flop as shown in Fig.7(c). Another variation of implementation of this logic which might be more useful in some applications is shown in Fig.7(d). One of the inputs is fed as control input to the T flip-flop while the other is ANDed with the clock. The output of the flip-flop toggles on arrival of the positive edge of the clock if and only if both the inputs are in state 1. The implementations of Figs.6 and 7(d) would be very much useful in some specific situations, where a single input-bit is used to control multiple flip-flops. Apart from the area and time saving, it would help in two other ways in such situations. Firstly, an additional wire for the input-bit would no longer be required. Secondly, when the input b = 0, it would perform auto clock-gating for power minimization. III. CMOS T F LIP -F LOP C ONSTRUCTION Instead of using basic logic gates, like NAND or NOR gates as in case of BJT-based flip-flop realizations, transmission gates are used in CMOS technology to control the connections

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

c

cn RESET

c

4

cn

c

cn Q

D

c

cn

cn c

CLK

Q

(a)

c

cn c RESET

(b)

cn

c

Q cn

cn

T

cn

CLK

c (c)

cn c

T

c

CLK

(d)

(e)

Fig. 8. (a) Functional schematic of CMOS realization of a D flip-flop. (b) Clock derivation for the D flip-flop. (c) Modification of a D flip-flop for CMOS realization of T flip-flop. (d) Clock derivation for the T flip-flop using enable controlled inverter. (e) Clock derivation for the T flip-flop using NAND gate. TBUFI cell of TSMC 0.l8um process library [3] which performs logical inversion of its input with an active-high control acts as an enable-controlled-inverter.

and necessary data movement for the construction of flipflops. Besides, CMOS flip-flops are typically constructed as D flip-flops since it leads to simpler and efficient implementation through transmission gates [2]. CMOS JK flip-flops are derived from D flip-flops by suitable input modifications. T flip-flops are conventionally derived either from JK (where J = K = 1) or from D flip-flop. For conversion of a D flipflop for T operation, the complement output Q is connected back as input to the D flip-flop. The T flip-flop derived accordingly from a D flip-flop has the same complexity as that of the D flip-flop; and works fine as a clock divider circuit or an oscillator. But, it cannot accept the control input like that of JK flip-flop-based implementation. The T type operation by a JK flip-flop, on the other hand, involves more area and more propagation-delay than the direct realization by a D flip-flop. A simple input modification of D flip-flop for T operation with T as control input is shown in Fig.6. The T flip-flop construction of Fig.6 is relatively simpler (and may involve less area and time) compared with the JK flip-flop-based implementation. But, its complexity is still higher than that of a D flip-flop due to the additional AND gate on the clock line. For further improvement on using a D flip-flop for T operation, we can modify the clock derivation circuit in the CMOS realization of D flip-flop, instead of using the additional AND gate. The functional schematic of CMOS realization of a D flip-flop is shown in Fig.8(a) and its clock derivation circuit is shown separately in Fig.8(b). The clock derivation circuit requires two inverters to derive a pair of balanced outputs to derive a pair of mutually complementary clock inputs c and cn. The modified form of a D flip-flop for T operation is shown in Figs.8(c), 8(d) and 8(e). The output Q is taken as the D input of D flip-flop in Fig.8(c), and the clock derivation circuit is controlled by the T input as shown in Figs.8(d) and 8(e). The circuits in Figs.8(c) and 8(d) or 8(e) result in the same transition table as that of Figs.5 and 6. Besides, this modification of the clock-derivation circuit, involves only a very small increase in the area and propagation delay of D flip-flop, and offers a significant saving over the modification according to Fig.6. The clock derivation of Figs.8(d) or 8(e) could also be used in the D flip-flop of Fig.8(a) to have a

clock

T

D

Q

Q+

remark

rising

1

0

x

0

Q+ ← D

rising

1

1

x

1

Q+ ← D

rising

0

x

1

1

output unchanged

rising

0

x

0

0

output unchanged

not rising

x

x

1

1

output unchanged

not rising

x

x

0

0

output unchanged

Fig. 9. Transition Table of modified D flip-flop using the controlled clock derivation circuit of Fig.8(d).

controlled-register operation as shown in the Table of Fig.9. The D input in this case could be transferred to the output state or the output state could be kept unchanged whenever required. This could be used in controlled-registers (or shiftregisters) where the register is required to deliver the output (or shift the content) at a specific clock period or under certain conditions through a single-bit control. IV. C OMPLEXITY C ONSIDERATIONS We estimate and compare here the area-complexities and propagation delays of conventional implementations of combinational logic in feed-forward path and feed-back loop with a D flip-flop, and those of corresponding extended sequential logic. Using the cell areas and propagation delays of different gates and D flip-flops with and without SET and RESET for TSMC 0.l8um process 1.8-Volt SAGE-XTM standard cell library [3] for different drive-strengths we have TABLE I C OMPARISON OF P ROPOSED I MPLEMENTATION OF (a · b) FOLLOWED BY D FLIP - FLOP AND C ONVENTIONAL I MPLEMENTATION drive strength XL X1 X2 X4

AND + FF area delay

extended logic area delay

66.53 66.53 79.83 89.81

56.55 56.55 69.85 83.16

0.49 0.53 0.50 0.48

0.41 0.50 0.47 0.39

savings area delay 17.65% 17.65% 14.29% 8.00%

20.06% 06.84% 05.62% 21.21%

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

5

TABLE IV A REA -C OMPLEXITY AND P ROPAGATION D ELAY OF P ROPOSED T FLIP - FLOP C ONSTRUCTION AND C ONVENTIONAL I MPLEMENTATIONS drive strength

XOR + D flip-flop (1)

JK flip-flop (2)

proposed construction (3)

saving of (3) over (1)

saving of (3) over (2)

area

delay

area

delay

area

delay

area

delay

area

XL

106.44

0.75

86.49

0.68

79.83

0.55

33.33%

38.18%

8.33%

25.02%

X1

106.44

0.75

93.14

0.74

79.83

0.59

33.33%

25.86%

16.67%

24.83%

X2

119.75

0.76

93.14

0.74

86.49

0.61

38.46%

24.25%

7.69%

21.52%

X4

159.67

0.66

116.42

0.63

103.12

0.52

54.84%

27.55%

12.90%

21.09%

JKFFR and DFFNR cells of TSMC 0.l8um 1.8-Volt SAGE-X

TM

TABLE II C OMPARISON OF P ROPOSED I MPLEMENTATION OF (a + b) OR (a · b) F OLLOWED BY D FLIP - FLOP WITH C ONVENTIONAL I MPLEMENTATION drive strength XL X1 X2 X4

NOT+OR+FF area delay 73.18 73.18 89.81 106.44

0.52 0.56 0.52 0.50

extended logic area delay 56.55 56.55 69.85 83.16

0.41 0.50 0.47 0.39

savings area delay 29.41% 29.41% 28.57% 28.00%

26.43% 11.94% 10.45% 26.46%

TABLE III C OMPARISON OF P ROPOSED I MPLEMENTATION OF (a ⊕ b) OR (a ⊕ b) F OLLOWED BY D FLIP - FLOP WITH C ONVENTIONAL I MPLEMENTATION drive strength XL X1 X2 X4

XOR + FF area delay 79.83 79.83 99.79 129.73

0.59 0.55 0.54 0.50

proposed logic area delay 76.51 76.51 103.12 129.73

0.46 0.55 0.52 0.44

savings area delay 4.35% 4.35% -3.23% 0.00%

26.43% -0.24% 3.15% 12.74%

obtained the complexities of the conventional and the proposed extended sequential logic implementations. The complexities of feed-forward extended logic and conventional realizations are compared in Tables I, II and II, while those of XOR in feedback loop are compared in Table IV. Delay is estimated as the sum of intrinsic worst-case propagation delay and the maximum setup time. Area and delay are in sq.µm and ns, respectively. FF in Tables I, II and II, refer to D flip-flop without SET/RESET provision. DFFHQ and DFFTR cells of TSMC 0.l8um process 1.8-Volt SAGE-XTM library [3] are, respectively, taken for D flip-flop without SET/RESET and D flip-flop with RESET in these three Tables. A. Complexity of Feed-Forward Sequential Logic Extension The areas and the worst-case propagation delays of the extended logic and the conventional implementations of AND followed by a D flip-flop for four different drive-strengths are listed in Table I for comparison. The extended logic involves nearly 14% less area and 13% less delay, in average, for different drive-strengths over the conventional implementation. The extended logic of (a + b) followed by a D flip-flop is found to have (not shown in the tables) the similar advantages of area- and time-complexities over the conventional implementation like that of the extended logic realization of AND followed by a D flip-flop. The complexities of (a+b) or (a·b) followed by a D flip-flop of the conventional and the proposed

delay

library [3] are, respectively, taken for JK and D type flip-flops.

implementations are listed in Table II. The extended logic is found to involve the nearly 29 % less area but 19% less delay, in average, for different drive-strengths over the other. The extended logic implementation of (a⊕b) followed by a D flip-flop requires a D flip-flop with active-high RESET along with a NAND gate and a NOR gate, while conventionally it needs an XOR gate followed by a D flip-flop. We do not find a library cell for D flip-flop with active-high RESET. But that could be realized by a NOR gate in place of the input NAND gate of D flip-flop with active-low RESET, e.g., the DFFTR cell of TSMC 0.l8um process 1.8-Volt SAGE-XTM standard cell library [3]. Accordingly, we have obtained the areas and the delays of the proposed and the conventional circuits (listed in Table III). Compared with the conventional one, it involves either less or the same area except for drive strength X2 and less delay except for drive strength X1. In average, it involves more than 1% less area and nearly 11% less delay over XOR followed by D flip-flop without RESET. B. Complexity of XOR in Feedback Loop with D flip-flop The proposed implementation of (a ⊕ b) in the feedback loop with a D flip-flop, as discussed in Sections II, could be implemented by a T flip-flop. Conventionally, it could be implemented either by a JK flip-flop or by an XOR gate along with a D flip-flop. It is shown in Section III that the complexity of proposed CMOS construction of T flipflop is almost the same as that of equivalent D flip-flop except that it uses a NAND gate instead of an inverter in the clock derivation circuit. Since the complexity of a NAND gate could be assumed to be almost the same as that of an inverter, the complexity of T flip-flop would accordingly be assumed to be the same as that of a D flip-flop. Based on the area and worst case delays of XOR gate, D flipflop and JK flip-flop (for TSMC 0.18 micron process) for different drive-strengths we have obtained the complexities of conventional and the proposed implementations as listed in Table IV for comparison. The proposed implementation involves nearly 40% less area and 29% less propagation delay, in average, for different drive-strengths over the conventional implementation using D flip-flop with XOR gate. It involves nearly 11% less area and 23% less delay in average over the conventional implementation using JK flip-flop. The proposed implementation of ((a · b) ⊕ c) in feedback loop with a D flipflop, also has similar advantages over the conventional one, since it has additional complexity of only one AND gate over the implementation of (a⊕b) in feedback loop with D flip-flop in each of the cases.

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

REDUCTION SECTION

a0

a1

ak-1

ak

am-2

D

D

6

am-1

Yin RC

D

D

RC

D

RC

D

Xin

ANDACCUMULATE SECTION

bj

X out ← X in ⊕ Yin ; D

D

D

D

D

D

c0

c1

ck-1

ck

cm-2

cm-1

else X out ← X in .

(a) Fig. 10.

Xout

If q i = 1

CLK OUTPUT REGISTER

RC

(b)

Conventional implementation of serial parallel multiplier over GF (2m ). (a) Structure of the multiplier. (b) Function of ith reduction cell.

V. A PPLICATIONS OF E XTENDED S EQUENTIAL L OGIC There could be many possible applications of the extended sequential logic, but to show the advantages of extended logic over the conventional implementation of synchronous digital circuits, we discuss here its application in two simple but popular cases of arithmetic circuits, e.g., finite field multiplier over GF (2m ) and carry-save adder.

The product of elements A and B over GF (2m ) is given by C = A.B mod Q(z) =

1) Mathematical Formulation: Let the finite field over GF (2m ) be defined, in general, by an irreducible polynomial of degree m, given by (1)

where {qj for 1 ≤ j ≤ m − 1} ∈ GF (2). The polynomial basis {1, α, α2 , . . . , αm−1 }, (where α is a root of Q(z)), be used to represent the field elements, so that any two arbitrary elements A and B in GF (2m ), can be represented in the form of polynomials of degree (m − 1) as m−1 X j=0

aj .αj and B =

m−1 X

bj .αj

j=0

where, aj and bj ∈ GF (2), for j = 0, 1, ..., m − 1.

(3)

where Ai = [αi .A mod Q(z)], A0 = A, and Ai+1 can be obtained from Ai recursively as: Ai+1 = α.Ai mod Q(z)

Multiplication over GF (2m ) is a basic field operation which is frequently encountered in elliptic curve cryptography (ECC) and error control coding [4], [5]. Multiplication in polynomial basis is relatively simpler, offers scalability for the fields of higher orders, and does not require a basis conversion [6]. A large number of architectures have been proposed in the literature for efficient polynomial basis multiplication over GF (2m ) in dedicated hardware platforms [7]–[12]. Serialparallel polynomial-basis multipliers are well-suited for small embedded systems since the cost and size of hardware and bandwidth are major constraints in such systems [7]–[9]. We discuss here, the application of proposed extended sequential logic for efficient implementation of serial-parallel polynomial-basis multiplier for GF (2m ).

A=

bi .Ai

i=0

A. Finite Field Multiplication over GF (2m )

Q(z) = z m + qm−1 .z m−1 + . . . + q2 .z 2 + q1 .z + 1

m−1 X

(2)

(4)

By polynomial expansion of right-side of (4), we can find Ai+1 = [ai0 .α + ai1 .α2 + . .. + aim−2 .αm−1 + aim−1 .αm ] mod Q(z)

(5)

i j where Ai = Σm−1 j=0 aj .α .

Since α is a root of Q(z) given by (1), one can have αm = qm−1 .αm−1 + . . . + q2 .α2 + q1 .α + 1

(6)

Substituting it on (5), Ai+1 can be obtained as

where

i+1 m−1 Ai+1 = ai+1 + ai+1 0 1 .α + ... + am−1 .α

(7a)

ai+1 = aij−1 ⊕ qj j

(7b)

for 1 ≤ j ≤ m − 1,

ai+1 = aim−1 . 0

(7c)

2) Conventional Implementation of Multiplier over GF (2m ) and its Optimization: The finite field multiplication could be performed in two stages of recursive operations, where modular reduction is performed according to (7) in the first stage and AND-accumulate operation is performed according to (3) in the second stage. A conventional implementation of serial-parallel multiplier over GF (2m ) is shown in Fig.10. It consists of a reduction section and an AND-accumulate section. The reduction section consists of m reduction cells (RC) and m D flip-flops to perform successive reductions of operand A in every cycle according to (7). The D flip-flops of the reduction section are initialized

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

a1

ak-1

ak

am-2

D

D

am-1 ANDREDUCTION ACCUMULATE SECTION SECTION

a0

7

CLK RC

D

D

RC

D

RC

D

bj

OUTPUT REGISTER

Fig. 11.

T

T

c0

c1

T

T

T

T

ck-1

ck

cm-2

cm-1

Proposed implementation of serial parallel multiplier for trinomial-based over GF (2m ) using extended sequential logic.

by the bits of input word A, which get shifted from one D flip-flop to the next in each cycle across the reduction section. Function of each RC is depicted in Fig.10(b). The ith RC consists of an XOR gate if qi = 1, otherwise the RC could be removed. The AND-accumulate section consists of m number of AND-accumulate nodes. One such node is shown in the gray oval. Each AND-accumulate node performs successive AND-accumulate operations (corresponding to the bits bj of 0 ≤ j ≤ m − 1 of input word B) according to (3). After m cycles the AND-accumulate section generates the product word C and transfers that to the output register. The ((a.b) ⊕ c) operation in feed-back loop with a D flip-flop to be performed by each AND-accumulate node could be implemented efficiently by a T flip-flop where the input b is fed as the control input and the clock input of the T flip-flop is derived by ANDing of the clock with input a as shown in Fig.7(d) in Section II. The AND-accumulate section of the conventional implementation of bit-serial multiplier (Fig.10) could therefore be implemented by m T flip-flops where the input bj is ANDed with clock input and fed to T flip-flop as clock as shown in Fig.11. 3) Complexity Considerations: The reduction section is identical in the conventional and optimized implementations of Figs.10 and 11, respectively. It consists of m D flip-flops and maximum of (m − 1) 2-input XOR gates (if each qi for 0 ≤ i ≤ m − 1 is 1). The AND-accumulate section of Fig.10 consists of m D flip-flops, m 2-input XOR gates and m 2input AND gates. The AND-accumulate section of Fig.11 on the other hand needs m D flip-flops and one AND gate (used for ANDing with the clock). Both the structures perform a field multiplication over GF (2m ) in m cycles, but they have different duration of clock periods due to the difference in their critical-paths. The critical-path of conventional structure (Fig.10) is TCC = TA + TX + TF F , where TA , TX and TF F are the delays of a 2-input AND gate, 2-input XOR gate and D flip-flop, respectively. The critical-path of the optimized structure (Fig.11) is TCO = max{TX , TF F , TT F }, where TT F is the delay of a T flip-flop. Since TX is less than TF F and TT F , while TT F could be assumed to be the same as that of TF F as discussed in Section II, we can have TCO = TF F . In Table V, we have listed the hardware requirements of the proposed structure, as well as, the existing structures. All the architectures listed in Table V have the same throughput per

TABLE V H ARDWARE - AND T IME -C OMPLEXITIES OF THE B IT-S ERIAL /PARALLEL m M ULTIPLIERS FOR GF (2 )-BASED ON G ENERAL F IELD P OLYNOMIALS designs

AND

XOR

flip-flop

cycle time

design of [7]

2m

2m − 1

4m + 2

TA+X + TF F

design of [8]

2m

2m − 1

4m

TA+X + TF F

proposed (Fig.10)

m

2m − 1

3m

TA+X + TF F

proposed (Fig.11)

1

m−1

3m

TF F

Latency is (m + 1) cycles and throughput per cycle is (1/m) in all cases. Fig.11 corresponds to the multiplier optimized by the proposed extended logic. TA+X = TA + TX , is sum of the delays of an AND gate and an XOR gate. TABLE VI C OMPARISON OF AND-ACCUMULATE S ECTIONS OF C ONVENTIONAL AND O PTIMIZED D ESIGNS OF S ERIAL -PARALLEL M ULTIPLIER OVER GF (2m ) drive strength

conventional design

optimized design

% of saving

area

CP

area

CP

area

CP

XL

119.75m

0.87

79.83

0.55

50.00

58.83

X1

119.75m

0.88

79.83

0.59

50.00

48.90

X2

133.06m

0.86

86.49

0.61

53.85

42.19

X4

176.30m

0.79

103.12

0.52

70.97

51.23

TM

DFFNR cell of TSMC 0.l8um process 1.8-Volt SAGE-X library [3] is taken for the D flip-flop. CP stands for the critical-path and given in nanosecond . Area is in sq.µm.

cycle and the same latency in terms of number of cycles, but amongst the existing designs, the structure of [8] involves the minimum of area and the minimum cycle time. The proposed conventional structure of Fig.10 involves the same time complexity as the structure of [8] with significantly less number of gates and flip-flops compared with the other. The optimized structure of Fig.11, however, involves nearly (1/3)rd number of gates and has a cycle time of duration nearly half that of proposed conventional design. The conventional design and the optimized design differ only in the implementation of their AND-accumulate section. The critical-path of both these designs is the same as the critical-path of their respective AND-accumulate section. Moreover, the major part of the gate and register complexities of conventional design is contributed by the AND-accumulate section. The area and the critical-path of AND-accumulate sections of the conventional and the optimized implementations of serial-parallel finite field multiplier are listed in Table VI. The AND-accumulate section of the optimized design is found

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

8

4-bit input word Cin

FA D

Fig. 12.

D

D S3

C3

FA

FA D S2

C2

D C1

D S1

FA D

D S0

C0

4-bit carry-save adder.

y

x

s

logic relations [14], [15]:

r CARRY to next stage

and half adder

2 D

where

D

b

a

3

1 T FF

D FF with RESET

(a)

x

y

s

CARRY to next stage

r

a

4

half adder

2

b

(8a)

sum = r ⊕ s

(8b)

a = r · s, b = x · y and r = x ⊕ y

(8c)

Implementation of Fig.13(a) consists of three main blocks enclosed by gray boxes along with a NAND gate. Box 1 consists of an XOR gate in the feedback loop with D flip-flop, which could be implemented by a T flip-flop as discussed in Section II. Box 2 is a half adder and Box 3 has a + b followed by a D flip-flop, which could be implemented by a D flipflop with a RESET as shown in Fig.3(c). The only difference of the circuit of Fig.13(b) from that of Fig.13(a) is in the implementation of the logic expression of a given by

D

a = (x · s) + (y · s) = (x + y) · s

1 T FF

3

carry = a + b

(9)

Instead of a NAND gate of Fig.13(a), it therefore requires Box 4 for realization of a.

D D FF with RESET

(b) Fig. 13. Modification of the nodes of CSA using T flip-flop in place of XOR gate in feedback loop with D flip-flop; and A + B followed by D flip-flop being replaced by D flip-flop with RESET.

to have less than half of the area and half the cycle time, in average, of the conventional one. B. Carry-Save Addition Carry-save adders (CSA) find wide applications as 3-to-2 counters, for addition of partial-products for fast multiplication and pipelined additions [2], [13]. The structure of a 4-bit carrysave adder is shown in Fig.12. It consists of four nodes, one of those nodes being shown by the gray oval in Fig.12. Each node consists of a full-adder (FA) where the sum Si is fed back as one of the inputs at the next cycle through a D flipflop of the ith node, and the carry Ci is transferred to the next node on its right through a D flip-flop. The critical-path of the CSA is TF A + TF F , where TF A and TF F are the propagation delays for a full-adder and a D flip-flop. 1) Optimization of Carry-Save Addition using Extended Sequential Logic: The function of each node could be represented by either of the pair of circuits shown in Figs.13(a) and 13(b). The circuit in Fig.13(a) corresponds to the implementation of full-adder of bits x, y and s according to the

2) Complexity of the Carry-Save Adder Optimized by Extended Sequential Logic: The implementation of CSA with nodes of Fig.13(a) has a critical-path max{(TX + TN AN D + TF F ), (TX + TT F )} = (TX + TN AN D + TF F ) and that of Fig.13(b) is max{(TOR + TN AN D + TF F ), (TX + TT F )}. Circuit of Fig.13(b) leads to smaller critical-path at the cost of one OR gate per node. The area and critical-path of implementations of CSA based on extended sequential logic and the conventional one (for TSMC 0.18 micron process) are listed in Table VII. The proposed designs of Fig.13(a) for CSA nodes using extended sequential logic involves nearly 12% less area and 15% less critical-path, in average, for different drivestrengths over the conventional implementation. The proposed circuit in Fig.13(b) offers a saving of nearly 8.5% area and 19.7% delay, in average, over the conventional one for lower drive-strengths. At drive-strength 4, however, it involves nearly 6.6% more area and 22.2% less critical-path over the other. The proposed design based on Fig.13(a) involves slightly more critical-path but requires less area than that of Fig.13(b). It should, therefore, be preferred if area is to be minimized, while that of Fig.13(b) could be used for time minimization. VI. C ONCLUSIONS A new approach for extending the sequential logic functionality of D flip-flop is suggested to perform an additional Boolean function simultaneously along with its usual bitstorage function. It is shown that combinational functions of

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

9

TABLE VII C OMPARISON OF C OMPLEXITY OF E ACH N ODE OF C ONVENTIONAL AND O PTIMIZED I MPLEMENTATION OF C ARRY-S AVE A DDER drive strength

conventional design(1) area

CP

optimization of Fig.13(a) (2)

optimization of Fig.13(b) (3)

area

CP

area

CP

saving of (2) over (1)

saving of (3) over (1)

area

CP

area 7.69%

27.40%

CP

XL

186.28

0.79

159.67

0.66

172.97

0.62

16.67%

19.49%

X1

189.60

0.77

162.99

0.69

176.30

0.67

16.33%

11.06%

7.55%

13.82%

X2

252.81

0.73

216.22

0.66

229.52

0.62

16.92%

10.91%

10.14%

17.78%

X4

282.74

0.68

282.74

0.58

302.70

0.55

0.00%

17.97%

-6.59%

22.24%

ADDFH, ADDH and DFFTR cells [3] are used for the full adder, half adder and D flip-flops. CP stands for clock period. Area and CP are in sq.µm and nanosecond, respectively.

the form (a ∗ b) in a feed-forward path with a D flip-flop could be implemented by a D flip-flop with a RESET or SET provision. Such an integrated implementation is more efficient than the conventional one, due to the fact that, the additional area and delay involved in incorporating SET and RESET in a D flip-flop (due to the optimization of logic blocks inside it) is much less than that of implementing the combinational elements outside the flip-flop. Some other functions like (a⊕b) and ((a·b)⊕c) in the feedback loop with a D flip-flop could be implemented by a T flip-flop by a simple modification of its clock input, where the complexity of XOR gate is saved. The use of extended sequential logic is found to provide a substantial saving in critical-path and area over the conventional implementations. We have also presented a simple approach for the construction of CMOS T flip-flop, which is more efficient than the T flip-flop derived from JK flip-flop. To show the advantage of extended sequential logic, it is used for optimizing a finite field multiplier over GF (2m ) and carry-save-adder of real numbers. Using the extended logic in case of finite field multiplier, the number of gates is reduced to nearly (1/3)rd, and the clock period is reduced to nearly half that of the conventional design. In case of carry-save adder also, the use of extended logic is found to result in a significant saving in area and/or time-complexity over the conventional design. The extended logic provides substantial benefit when the sequential elements are very much interspersed with the combinational elements, and full pipeline is covered with this simple optimization. Particularly when the logic maps into the clock pin, not only it provides area and time saving but can eliminate the wiring for an input bit-line and perform auto clock-gating (by fixing the input at zero) for power minimization. It would be useful to have a family of library cells for the extended sequential logic for optimal realization of synchronous digital systems. R EFERENCES [1] P. K. Meher, “On efficient implementation of accumulation in finite field over GF (2m ) and its applications,” IEEE Transactions on Very Large Scale Integration Systems, accepted for future publication. [2] N. H. E. Weste and D. Harris, CMOS VLSI Design : A Circuits and Systems Perspective. Boston: Pearson/Addison-Wesley, 2005. [3] “TSMC 0.l8um Process 1.8-Volt SAGE-XTM Standard Cell Library Databook”, Release 4.1, Artisan Components, September 2003.” [4] I. Blake, G. Seroussi, and N. P. Smart, Elliptic Curves in Cryptography. London Mathematical Society Lecture Note Series: Cambridge University Press, 1999. [5] National Institute of Standards and Technology, “FIPS 186-2, Digital Signature Standard (DSS), Federal Information Processing Standards Publication 186-2,” 2000.

[6] I. S. Hsu, T. K. Truong, L. J. Deutsch, and I. S. Reed, “A comparison of VLSI architecture of finite field multipliers using dual, normal, or standard bases,” IEEE Trans Computers, vol. 37, no. 6, pp. 735–739, June 1988. [7] L. Song and K. K. Parhi, “Efficient finite field serial/parallel multiplication,” in 1996 Int. Conf. on Application Specific Systems, Architectures and Processors, ASAP 96, Aug. 1996, pp. 72–82. [8] M. A. Garcia-Martinez, R. Posada-Gomez, G. Morales-Luna, and F. Rodriguez-Henriquez, “FPGA implementation of an efficient multiplier over finite fields GF (2m ),” in 2005 International Conference on Reconfigurable Computing and FPGAs (ReConFig 2005), Sept. 2005. [9] P. K. Meher, “Systolic formulation for low-complexity serial-parallel implementation of unified finite field multiplication over GF (2m ),” in Proc. 18th IEEE Int. Conf. on Application-specific Systems, Architectures and Processors, ASAP 07, Montreal, July 2007, pp. 134–139. [10] S. V. Bharathwaj and K. L. Narasimhan, “An alternate approach to modular multiplication for finite fields [GF (2m )] using Itoh Tsujii algorithm,” in The 3rd IEEE-NEWCAS Conference, 2005, pp. 7103–105. [11] S. K. Jain, L. Song, and K. K. Parhi, “Efficient semisystolic architectures for finite-field arithmetic,” IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 6, no. 1, pp. 101–113, Mar. 1998. [12] P. K. Meher, “Systolic and super-systolic multipliers for finite field GF (2m ) based on irreducible trinomials (2008).and its applications,” IEEE Transactions on Circuits & Systems-I: Regular Papers, vol. 55, no. 4, pp. 1031–1040, May 2008. [13] L.Dadda and V. Piuri, “Pipelined adders,” IEEE Transactions on Computers, vol. 45, no. 3, pp. 348–356, Mar. 1996. [14] A. R. Omondi, Computer Arithmetic Systems: Algorithms, Architectures, and Implementations. New York: Prentice-Hall, 1994. [15] B. Parhami, Computer Arithmetic : Algorithms and Hardware Designs. New York: Oxford University Press, 2000.

Pramod Kumar Meher (SM’03) received the first class degrees of B.Sc. (Honours) and M.Sc. in Physics, and Ph.D. in Science from Sambalpur University, Sambalpur, India in 1976, 1978, and 1996, respectively. He has a wide scientific and technical background covering Physics, Electronics and Computer Engineering. Currently, he is a Senior Fellow in the School of Computer Engineering, Nanyang Technological University, Singapore. He was a Professor at Utkal University, Bhubaneswar, India since 19972002, a Reader in Electronics at Berhampur University, Berhampur, India during 1993-1997, and a Lecturer in Physics in various Government Colleges (in India) during 1981-1993. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal processing, image processing, secure communication and bioinformatics. He has published more than 100 technical papers in various reputed journals and conference proceedings. Currently, he is serving as Associate Editor of the IEEE Transactions on Circuits and Systems-II: Express Briefs and The Journal of Circuits and Systems for Signal Processing. Dr. Meher was conferred with the Samanta Chandrasekhar Award for excellence in research in Engineering & Technology for the year 1999. He is a Chartered Engineer of the Engineering Council of United Kingdom, a Senior Member of IEEE, a Fellow of The Institution of Electronics and Telecommunication Engineers (IETE) of India, and a Fellow of the Institution of Engineering and Technology (IET), (formerly known as the Institution of Electrical Engineers), UK.