Completion Detecting Carry Select Addition Alessandro De Gloria and Mauro Olivieri
Abstract: We present the logic analysis, circuit implementation and verification of a novel selftimed adder scheme based on Carry Select (CS) logic. The preliminary analysis of the variabletime behavior of CS logic justifies the design of self-timed CS adders, and identifies the best choice for the block size to optimize the average performance. Hence we describe the logic design and full-custom circuit implementation of a completion detecting CS adder by means of precharged CMOS logic. We verify the correct asynchronous operation of the circuit by means of layout level Spice simulation referring to a 0.35 µm CMOS process. The worst case addition time is comparable with existing fastest fixed-time adders, which is a considerable result for a completion detecting technique. The hardware overhead can be limited to 23% over a conventional CS adder. Spice simulation let us estimate an average detected addition time of 1.6 ns for a 64 bit adder, including the pre-charge time.
NOTE: This version may contain errors with respect to the final published paper. Please refer to the final publication on IEE Proceedings
Prof. Mauro Olivieri Dept. of Electronic Engineering - Univ. of Rome “La Sapienza” Via Eudossiana 18 - 00184 Rome - Italy. Tel. +39 06 44585475 Fax +39 06 4742647 Email
[email protected]
1
Completion Detecting Carry Select Addition Abstract: We present the logic analysis, circuit implementation and verification of a novel selftimed adder scheme based on Carry Select (CS) logic. The preliminary analysis of the variabletime behavior of CS logic justifies the design of self-timed CS adders, and identifies the best choice for the block size to optimize the average performance. Hence we describe the logic design and full-custom circuit implementation of a completion detecting CS adder by means of precharged CMOS logic. We verify the correct asynchronous operation of the circuit by means of layout level Spice simulation referring to a 0.35 µm CMOS process. The worst case addition time is comparable with existing fastest fixed-time adders, which is a considerable result for a completion detecting technique. The hardware overhead can be limited to 23% over a conventional CS adder. Spice simulation let us estimate an average detected addition time of 1.6 ns for a 64 bit adder, including the pre-charge time. 1. Introduction Advances in asynchronous logic design [8,16] have given new impulse to the study of variabletime arithmetic units. A variable-time unit, also called completion detecting or self timed [18,1], is capable of generating a completion signal when the result of its operation is ready. As opposed to this, fixed-time units guarantee a valid result after a worst-case delay. Self-timed arithmetic units address a performance improvement by relying on the average response time rather than the worst case [18,16]. The efficiency of a variable-time arithmetic unit relies in proving that its average completion time is considerably less than the completion time of an equivalent fixed-time unit [19,9,10]. In the past, self-timed addition techniques have been studied from this point of view, especially leading to results on the average response time achieved by self-timed ripple carry chain adders [4,5,17]. Experience has shown, however, that fixed-time addition techniques usually outperform variabletime implementations due to completion detection time and area overhead [13,2,11,3,7,12]. Interesting results have been reached in self-timed Carry Lookahead adders (SCLA) [6,9]. The SCLA principle is based on the statistical properties of a chain of CLA blocks, acting as a ripple carry chain of higher radix adders. Carry Select (CS) addition [2] is based on a different mechanism from the CLA. In CS addition, (Fig. 1), each block in the carry chain produces the sum and the local carry for both a ‘1’ input carry and a ‘0’ input carry. The blocks compute their potential sum and carry values in parallel, then a fast selection produces the correct result for the whole addi-
2
tion. The statistical properties of the CS variable addition time depart from the case of CLA. In CS addition, three variable-time contributions determine the total time to complete the addition: 1. The time taken by each block to set up the two potential carry values for selection. This time is variable as it is the time for producing the final carry of a ripple adder. 2. The time taken by each block to set up the two potential sum values to be selected by the output carry of the preceding block when it is ready. This time is variable, as it is the addition time of a ripple carry adder [5,4], and is generally not the same as the time for producing the potential carries. 3. The time taken by the selection chain to propagate the carries. This time is variable due to the possible presence of anticipated carries along the chain after the operand evaluation. Hence, a correct model of the variable completion time of a CS addition cannot be based on the simple average longest carry chain length in a ripple adder, yet a quantitative general model is essential to identify the potential effectiveness of a completion detecting CS scheme. In the following, we first provide a model of the dynamic operation of CS logic. The model is suitable for evaluating the nominal (i.e. gate level) average completion time of a CS addition. The numerical results clearly quantify the potential effectiveness of variable-time CS addition and identify the optimal block size for a completion detecting implementation. Hence we develop a logic architecture of the completion detecting mechanism and its implementation at circuit level by means of pre-charged CMOS logic. In principle, one may obtain a completion detecting version of any logic function by means of dual rail encoding, as in dual rail domino circuits [21,9]. However, the research in the field has shown that high performance specific blocks such as adders deserve dedicated architecture design to reduce both time and area overhead [20,6]. The hardware overhead for our architecture can be limited to 23% over a conventional CS adder. Moreover the reduced complexity allow a faster operation of the completion detecting mechanism. Having designed the completion detecting CS architecture, we develop a layout level timing characterization, referring to a real 0.35 CMOS process. We analyze in detail the critical timing conditions for the circuit to guarantee safe asynchronous operation. The performance evaluation shows that the delay overhead of completion detection is around one gate delay only, and the worst case addition time is nominally the same as a conventional fixed-time CS adder delay. This is a considerable result, as the detection delay overhead and the poor worst-case performance can be a drawback of completion detecting implementations. Layout level Spice simulations lead to estimate an
3
average delay of 1.6 ns for 64 bit addition, including the pre-charge time. 2. Dynamic analysis of Carry Select addition 2.1 Definitions and preliminary concepts We consider a CS architecture with N bit operand length. Fig. 1 shows an architecture with M bit fixed block size. Another common architecture employs growing block sizes, so that the first block is M bit wide, the second M+1, and so on. The latter strategy has the property of optimizing the fixed-time worst-case carry propagation [2,7] by reducing the number of blocks, with the counterpart of a less regular layout. The analysis developed in this sub-section is valid for either architecture, and we generally refer to the number of blocks as H. In the following, maxk(y k) , for k = 1...K, is the maximum among a set of numbers y1 ... yK, while max(y, z) is the maximum between y and z. Definition 1: In a CS adder with operand length N, given two operands a= a0 ... aN-1 and b= b0 ... bN-1 we define the propagate and generate bit vectors as p = p0 ... pN-1, g = g0 ... gN-1, where
pk = ak ⊕ bk
,
g k = a k ∧ bk .
The carry produced by each CS block can be anticipated with respect to the arrival of the preceding block’s output carry. This happens when the two potential carry bits have the same value. For instance, in case of fixed size M=4, if p7=0 the second block generates an anticipated carry. In the following, when block c produces an anticipated carry, we say it is the leading block of the carry sub-chain c, where c ∈ [ 1, H ] , and when the block cannot produce an anticipated carry we say it is unresolved. Definition 2: In a CS adder with block size M and operand length N, given two operands,
• the time taken by computation of p and g is the evaluation of the operands, indicated as P. • in any block c, the time required to produce the two potential sum values, after the evaluation of the operands, is the sum setup Ac.
• in any sub-chain c, the time required between the evaluation of the operands and the starting instant of the carry selection is the is the sub-chain setup Sc. The sub-chain setup allows for the anticipated carry to be ready and for the two potential carries of each unresolved block to be ready when the selection reaches the block.
• in any sub-chain c, the time needed to select the carry of all the unresolved blocks is the subchain length Lc.
• the time required to select the final sum, after all the potential sum pairs and the selecting car-
4
ries are ready, is the final sum selection Q. The addition result comes out of two distinct concurrent processes: the carry chain construction and the block sum construction. The block sum construction is completed when all the blocks have valid potential sum values, so that the block sum construction time is maxc ( A c ) . The carry chain construction is complete when all the unresolved blocks have selected their carry. The unresolved sub-chains start the selection in different instants, i.e. after their own setup Sc. As a consequence, the completion time of the carry chain is not simply determined by the longest carry sub-chain, but by the longest time resulting as the sum of a sub-chain length and its setup time. The carry chain construction time is therefore maxc ( L c + S c ) . Fig. 2 illustrates two possible behaviors of a CS adder, depending on the input operands. In one case the carry chain construction dominates the overall addition time, while in the other case the block sum dominates. As a general result, we can state that the addition time of CS is
t CS = P + max ( maxc ( Ac ) , maxc ( L c + Sc ) ) + Q
.
(1)
In the following, we consider the implementation of a CS block as a pair of multiplexer-based Manchester chains (Fig. 3). The gates used in the architecture are the multiplexer, the XOR gate and the AND gate. The goal of this preliminary analysis is to evaluate the inherent variable-time behavior of CS logic. We can reach a single number evaluation of the average tCS by using a nominal 2-input gate delay as the time unit for the subsequent computations. Thus, by inspecting the schematic diagram in Fig. 3, we see that P is the a XOR gate delay, Q is a multiplexer delay, and therefore P = Q = 1 time units1. The expression of tCS is
t CS = 2 + max ( maxc ( A c ) , maxc ( Lc + S c ) )
(2)
Circuit level timings will be introduced in the verification of the correct operation of completion detecting structures and in their speed evaluation. 2.2 Special cases relevant for comparison We recall some well known properties on addition time in a form that will be used later as a reference for comparisons. According to the definitions given in the previous section, the fixed-time worst-case CS addition time is represented by the occurrence of a pair of operands for which
1. Actually, there is a special case of operand values in which Q must not be taken into account [14]. The very low statistical relevance of this occurrence makes it negligible for our evaluation purpose.
5
S1=A1=M , L1=H. In that case the nominal delay of the CS addition is
t CS
worst
= 2+M+H
where for fixed size CS blocks we have H =
H =
N⁄M
.
(3)
, while for variable size blocks we have
2
( – 2M + 1 + ( 2M – 1 ) + 8N ) ⁄ 2 , considering M as the size of the first block [2].
Table 1 reports the addition time of fixed-time worst-case CS adders corresponding to the optimal value of M. Table 1 is only valid in the approximation of unit gate delay and is used later for comparison. The particular case of block size M = 1 corresponds to a binary ripple adder, which does not take benefit from the carry selecting mechanism and can be realized as a single N-bit Manchester chain. It is well known that for uniform distributed operands the average completion time of a binary ripple adder is slightly less than log 2N + 1 ⁄ 2 [4,10] with one-bit carry propagation as the time unit, including the setup of the anticipated carries (after the evaluation of the operands), and the completion of the unresolved carry sub-chains. The evaluation of the operands is the same as in the CS adder, while the final sum selection is replaced by a XOR operation generating the sum bits when all the carries are ready. So the nominal average delay of a ripple adder is approximately
E { t RCA } = log 2N + 1 ⁄ 2 + P + Q = log 2N + 5 ⁄ 2
(4)
where we have assumed one gate delay for one-bit carry propagation, as in a multiplexer based Manchester carry chain. This result will be used for comparison with CS addition, in Table 2. 2.3 Analysis of the carry chain construction Fig. 4 gives a symbolic representation of the adder configuration after the evaluation of the operands and the setup of all the sub-chains, in case of fixed block size M. Each vertical M bit string is the sequence of the propagate bits in a block, while the horizontal H bit string reflects the selection chain starting configuration: the unresolved blocks have a ‘0’ in the string, while the leading ones have a ‘1’. Furthermore, each “bit cell” in the string corresponds to a multiplexer in the real circuit. For instance, since M=4 and p7=0, the second block generates the anticipated carry after two multiplexer delays, and the potential sums after 3 multiplexer delays, because p6=p5=p4=1 (all this after the evaluation of the operands). Here we derive a general expression for Sc and Lc focussing on fixed block size M. Definition 3: In a CS adder, given two operands, for each block c we define the carry setup con-
6
figuration σc as one unit plus the number of consecutive ‘1’ propagation bits in block c, starting from the most significant (see Fig. 4). We observe that σc is the number of multiplexer delays after which the two potential carries of a block are valid (either unresolved or leading). In particular, in case block c is unresolved, for a fixed block size M we have σc = M+1. Lemma 1: Using the fixed M-bit size CS block architecture in Fig. 3, having assumed unit gate delay, the setup of a sub-chain c is
M if σ 1 < M and σ 2 = M + 1 if σ 1 = M + 1 S 1 ( σ 1, σ 2 ) = M otherwise σ1
(5)
M if σ c < M and σ c + 1 = M + 1 if σ c = M + 1 ≡ 0 S c ( σ c, σ c + 1 ) c ≠ 1, c ≠ H otherwise σc + 1 0 SH ( σ H ) = σH + 1
if σ H = M + 1 otherwise
The proof of Lemma 1 is reported in [14]. Lemma 2: Using the fixed M-bit size CS block architecture in Fig. 3, having assumed unit gate delay, the sub-chain length of an unresolved sub-chain c is
0 if σ c = M + 1 L c ( σ 1 …σ H ) = m if σ c ≠ M + 1, σ c + 1 …σ c + m = M + 1, σ c + m + 1 ≠ M + 1
(6)
The proof of Lemma 2 is reported in [14]. 2.4 Analysis of the local sum construction Referring to Fig. 3, the local production of the two potential sum values is obtained by a ripple carry addition realized itself as a pair of multiplexer chains, each implementing a binary ripple adder. Hence, inside a block we have an internal carry propagation, with anticipated carries and sub-chains. When the propagation bit pk is null, the kth multiplexer anticipates a internal carry. Definition 4: In a CS adder with block size M, considering a single block c, the number of consecutive non-null propagation bits, starting from bit pk towards the most significant, is the internal-sub-chain length lk where k ∈ [ ( c – 1 )M, ( c – 1 )M + M – 1 ]
(Fig. 4).
Lemma 3: Using the fixed M-bit size CS block architecture in Fig. 3, having assumed unit gate delay, the sum setup Ac of a single block c is given by
7
for k ∈ [ ( c – 1 )M, ( c – 1 )M + M – 1 ]
Ac = maxk(lk) + 2
(7)
The proof of Lemma 3 is reported in [14]. 2.5 Calculation of E{tCS} The terms Lc and Sc are directly obtainable from the carry chain configuration σc, while the term Ac is obtainable from the local-sub-chain lengths lk, k ∈ [ cM + 1, cM + M ] . Τhe variables σc and lk are in turn defined as a function of the propagation bit vector p. It is possible to compute the average value of the addition completion time, assuming uniform operand distribution, by means of a “direct” approach, i.e. calculating tCS for all the possible configurations of the operands a and b. In fact, the propagate bit vector p is sufficient for evaluating (5),(6) and (7), so we have:
1 E { t CS } ( M, N ) = 2 + -----N- ⋅ 2
2N – 1
∑
max ( maxc ( Ac ( p ) ) , maxc ( L c ( p ) + S c ( p ) ) )
(8)
p=0
The computation of the above formula is only feasible for limited sizes of the operands 1. To evaluate the expression of E { t CS } ( N, M ) for values of N over 24, we performed a Monte-Carlo analysis by means of random p vectors. This approximated technique has also the advantage of permitting the use of non uniform operand distribution, such as one produced by a real benchmark. Table 2 shows the results of the evaluation of (8) for a comprehensive set of values of N and M. The development of a computable upper bound for E { t CS } can be found in [14]. A practically relevant outcome of Table 2 is that the optimal configuration for minimizing the average completion time is always the lowest value of M allowed for a CS adder, i.e. M = 2 (differently from the fixed-time fixed-blocks CS adders, and similarly to the fixed-time variable sized blocks CS adders, illustrated in Table 1). The case M = 1, corresponding to a pure ripple adder, has always longer average completion time than the CS addition with M=2. So M=2 results to be the optimal choice for completion detecting fixed-block CS adders. From Table 2 and 1, we note over 50% improvement of the average completion time on fixedtime CS adders (with variable block size also), similarly to the result reported in [6] on CLA.
1. For each p the algorithm has to consider H values of c, among which most are leading blocks [14] and therefore have non-zero Sc values. Thus for N=32 and M=4, the algorithm has to evaluate Sc a number of times close to 2
32
⋅ 8 ≈ 3.4 ⋅ 10 10 . 8
3. Logic design and timing verification 3.1 Logic level hardware implementation The proposed CS completion detecting circuitry relies on the following arithmetic property: Theorem 1: In a CS adder, let cp1 and cp0 be the potential carry produced by a block’s internal adder with fixed input carry 1 and 0, respectively. Then, in steady state conditions the situation cp0=1 and cp1=0 can not occur. Proof: The two local adders in a block add the same pair of M bit numbers x and y, for instance. An adder can produce a ‘1’ output carry if x + y + input carry is greater than 2M-1. The situation cp0=1 and cp1=0 corresponds to
x + y + 0 > 2 M – 1 and x + y + 1 ≤ 2 M – 1
, respec-
tively. It is easy to see that this leads to 1 < 0 by simple algebraic transformations, so in steady state conditions the ripple adders never produce the considered output.
Q.E.D.
As a result, the bits cp0 and cp1 provide a sort of inherent dual rail encoding to detect anticipated carries. When the carry is anticipated, they both have the same value 1 or 0. When the carry is not anticipated, we have cp0=0 and cp1=1, while the opposite situation does not occur. The top level view of the logic implementation of a completion detecting CS adder (Self-timed Carry Select, SCS) is shown in Fig. 5. The completion of the local potential sums and of the carry selection are flagged by a set of fc signals, c = 0...H; the completion of the carry selection is flagged by a set of Fc signals, c = 0...H. The signal Fall flags the completion of the whole addition according to the logic equations
F all = Fcarryall ∧ Fsumall F carryall = F 1 ∨ … ∨ F H
(9)
F sumall = f1 ∨ … ∨ f H A gate level implementation of the completion detecting CS block is shown in Fig. 6. All the multiplexers are properly initialized by the Start signal. The logic definition of the Fc signals, c = 0 ... H, is
F c = cp0 ∧ cp1 ∧ Fc – 1
(10)
We assume that before the evaluation of the operands the variables in (10) are in the state cp0 = 0, cp1 = 1, Fc = 1, which can be obtained by means of proper pre-charged logic implementation. According to Theorem 1, either Fc gets low because Fc-1 = 0 (i.e. the preceding selecting carry is ready), or Fc gets low because cp1 = 0 or cp0 = 1 (i.e. the current block c anticipates the carry).
9
Correspondingly, the circuit and layout implementation must meet the following two timing constraints: Constraint 1 (for unresolved blocks): Fc falls only after it is sure that the two potential carries cp0 and cp1 do not change their value, possibly changing the selected carry value. Furthermore, Fc falls after the selection of the carry Cc by the preceding carry Cc-1 has occurred. Constraint 2 (for leading blocks): No glitches occur on cp0 and cp1 before they reach their final status, i.e. it is sure that the status cp1 = 0 or cp0 = 1 is not transient but it is the actual detection of an anticipated carry. Furthermore, Fc falls after the value of cp1 and cp0 (both high or both low) has been conveyed to the corresponding carry Cc. Coming to the local sum detection, the logic definition of the fc signal is
f c = r1 ∨ r 2 ∨ … ∨ r M ri = pi ∧ r i – 1 ;
(11)
r0 = 0
We assume that after the evaluation of the operands the variables in (11) are in the state r1=r2=...=rM=1, fc=1, thanks to pre-charged logic implementation. The signal fc gets low only after all ri are low, flagging that the potential sums are ready. The hardware implementation must satisfy the following timing constraint: Constraint 3: When all ri fall, fc falls after an additional time which is sufficient for the XOR gates to produce the potential sum. The operation of the logic architecture in Fig. 5 and 6 is composed of four phases: 1. The operands are presented at the adder input and the start signal is lowered to determine the following initialization:
r i ≡ 1 ;c 0 ( i ) ≡ 0 ;c 1 ( i ) ≡ 1
∀i ∈ [ 1, M ], ∀c ∈ [ 1, H ]
(12)
As a result, the Fc signals rise, the fc signals rise, Fcarryall falls, Fsumall falls, and Fall rises. 2. The start signal rises, each block computes the potential carries and the unresolved sub-chains are completed by propagation. As a carry bits Cc gets valid, the corresponding Fc falls. 3. Concurrently with phase 2, each block computes the potential sums. As a local carry chain is completed, the corresponding fc falls. 4. When the whole carry selection chain is completed and all the blocks have computed the potential sums, the final sum selection produce the addition result. Concurrently, according to (9), Fall gets low, signaling that the sum is valid.
10
3.2 Timing verification of the detection mechanism The lowest part of Fig. 5 shows the logic simulation of a 64 bit adder for two different pairs of input operands. However, asynchronous operation demands for a careful circuit level analysis of the timing issues to be kept into account by the layout design. We designed the 0.35 µm CMOS layout implementation of the dynamic cells involved in a SCS architecture. Transistor level schematics of the most relevant cells are reported in Fig. 7. Table 3 reports the delay characterization data of the cells, obtained by layout-level Spice simulation taking into account real in-circuit load capacitances including interconnects. To satisfy the initial conditions assumed for correct operation, high- and low-precharged multiplexer cells are respectively used in the two internal adders of a block. The multiplexer chain generating Fc is implemented with high-precharged multiplexers. All the remaining gates are implemented as static CMOS. The final NOR gates producing Fsumall and Fcarryall are implemented in pseudo-NMOS style [22], similarly to [6], sacrificing power consumption in favor of speed. The total load capacitance of the Start signal per block can be estimated as 3M + 1 multiplexers. So the pre-charge load capacitance is 3N + H cells, leading to a worst case of 2.66 pF (224 precharged cells) for 64 bit operands with 2 bit blocks. Interconnect capacitance is not included in this estimate, but in our specific circuit simulation we used a layout level estimate of the precharge load. By inspecting Table 3, it is possible to observe that the proposed general implementation satisfies the timing constraints defined in the previous section. Specifically, Fig. 8 shows the Spice simulation of a 2-bit SCS block circuit. The two input patterns corresponds to critical timing conditions. In the upper diagram (002 + 012), the block anticipates its carry: the signal cp1 gets low with no glitches at time 0.71 ns, causing Cc to get low at time 0.86 ns and Fc at time 1.04 ns. The operation meets constraint 2. In the lower diagram (002 + 112), the unresolved block receives Fc-1 low at time 0.64 ns, flagging that Cc-1 is valid. Fc gets low only at 0.98 ns, to ensure that cp1 and cp0 remain unchanged, meeting constraint 1. We finally observe that for M=2 we the shortest possible time to raise Fc (when the block anticipates the carry) is 3 gate delays (0.44 ns after pre-charge in Fig. 8, addition 002+012), while the potential sums are valid after 2 gate delays (actually 0.38 ns). As a result fc is not needed, so that the OR gate producing fc (Fig. 6) can be omitted.
11
4. Performance evaluation The detection nominal overhead of the proposed architecture (Fig. 5 and 6) is noticeably low: fc is nominally as fast as the potential sum completion (at least for small M); Fc is at most one gate delay slower than the carry Cc. The pre-charge phase is overlapped with the evaluation of the operands, and the NOR gates producing Fall operate concurrently with the final sum selection. In the worst case the SCS adder behaves like a fixed-time fixed-blocks CS adder. For a 64-bit operand 2-bit block implementation, Spice simulation reports a worst case detection time of 5.54 ns, with a corresponding worst case addition time of 5.42 ns. To have a circuit-level estimate of the global average performance, circuit level simulation are not feasible because of the impractical amount of random input patterns together with the relatively high transistor count (>1000). We can reach an estimate of the detection time tdetect by multiplying the nominal delays Ac, Lc, and Sc, by the circuit level delay of the gates involved in the sum construction and carry construction. Thus, conservatively assuming that fc falls one OR gate delay after the sum construction and Fc falls one multiplexer delay after the selection of the carry, we obtain (symbols defined in Table 3):
t detect = ∆prec + + [ max ( maxc ( A c ∆ muxadd + ∆or ) , maxc ( L c ∆ muxsel + S c ∆ muxadd + ∆muxFc (13) ) )] + + max ( ∆s , ∆ c ) + ∆ a from which by inspecting Table 3 and applying simple algebraic transformations we can obatin the upper bound
t detect < ∆prec + max ( maxc ( A c ) , maxc ( Lc + S c ) ) ⋅ ∆muxsel + ∆or + max ( ∆ s , ∆c ) + ∆a (14) where we observe that max ( maxc ( A c ) , maxc ( Lc + S c ) )
is equivalent to ( t CS – 2 )
according to eq. 2, so that we can use the results in Table 2 to evaluate tdetect. In a 2-bit block SCS adder, where the fc signals are not present, the total completion of the addition is directly flagged by Fcarryall (Fig. 5), so that we have
t detect < ∆ prec + ( t CS – 2 ) ⋅ ∆muxsel + ∆muxFc + ∆c
only for M = 2 .
(15)
According to Tables 2 and 3, we can estimate that a 64 bit SCS architecture with 2-bit block size, implemented in 0.35 CMOS technology, has an average detected addition time of 1.63 ns, including pre-charge time, for uniformly distributed operands.
12
For a 64-bit adder with 8-bit blocks we obtain an estimated 2.37 ns average detected addition time and a 3.02 ns worst-case. The architecture with M>2 can therefore be employed if a faster worstcase performance is required or if M>2 is statistically better for the application. A general reliable evaluation of the area overhead of the SCS completion detection is not easy, as it should be done at layout level. For a 64 bit operand, 2 bit block implementation, we obtain 5413 transistors, the completion detection overhead being 1249 transistors, i.e. 23%. Block sizes greater than 2 would imply a slightly higher overhead due to the fc signals. This is an interesting result for over 50% improvement of the CS average performance, as conventional improvements of the carry select approach often involve higher area overhead [12]. 5. Conclusions We have presented the theoretical justification, the design and circuit level verification of completion detecting CS adder architecture. Gate level performance evaluation has shown that the nominal improvement over a fixed-time CS adder is more than 50% even when comparing the new architecture with a fixed-time variable-block size CS adder, that has the counterpart of a less regular layout. Furthermore, the worst case detection time is comparable with existing fixed-time CS adders, which is a considerable result because completion detecting techniques that are fast on average may have poor worst case performance. Other approaches, such as speculative dual-rail ripple adders [20] and SCLA [6], though simpler to design, have the counterpart of a slow worstcase response compared to medium-high-speed fixed-time adders. The area overhead can be limited to 23% for the fastest architecture configuration. At the circuit level, a conservative performance estimation of a 64-bit 0.35 µm CMOS implementation show an average addition time of 1.6 ns and worst-case addition time of 4.96 ns. 6. References [1] M. Afgahi and C. Svensson, "Performance of Synchronous and Asynchronous Design Scheme for VLSI Systems", IEEE Trans. on Computers, 41(7):838-872, July 1992. [2] O.J. Bedrij, “Carry Select Adder”, IRE Trans. on Electronic Comp., 11:340-346, 1962. [3] Brent, R.P and H.T. Kung, "A regular layout for parallel adders", IEEE Trans. on Comp., C31: 260-264, 1982. [4] Briley, B. "Some new results on average worst case carry", IEEE Trans. on Computers,C22:459-463, 1973. [5] A.W. Burks and H. Goldstine and J. von Neumann, "Preliminary discussion on the logical design of an electronic computing instrument", Tech. Report, The Institute of Advanced Study", Princeton, NJ, 1947.
13
[6] A. DeGloria and M. Olivieri, "Statistical Carry Lookahead Adders", IEEE Trans. on Comp., 45(3):340-347, Mar. 1996. [7] Hennessy, J.L., Patterson, D.A., Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publ., Palo Alto, CA, 1990. [8] IEEE Design and Test of Computers, Special Issue on Asynchronous Circuit and Systems, 11(2), Summer 1994. [9] Johnson, D. and Akella, V., Design and Analysis of Asynchronous Adders, IEE Proceedings - Comp. and Dig. Tech., 145(1), 1998. [10] D. J. Kinnement, An Evaluation of Asynchronous Addition, IEEE Trans. on VLSI Systems, 4(1):137-140, Mar 1996. [11] M. Lehman and N. Burla, “Skip Techniques for High Speed Carry propagation in Binary Arithmetic Units”, IRE Trans. on Electronic Comp., 10:691-698, 1961. [12] Lynch, T. and E.E. Swartzlander, "A Spanning Tree Carry Lookahead Adder", IEEE Trans. on Computers, 41(8):931-939, Aug. 1992. [13] O.L. MacSorley, “High speed arithmetic in binary computers”, IRE Proc., 49:67-91, 1961. [14] M. Olivieri, Analitic model of the average addition time of Carry Select addition, Technical Report, Univ. of Rome 1, Italy. Available at ftp://ss1.ing.uniroma1.it/pub/reports/olivieri2.pdf [15] A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, NY, 1987. [16] Proceedings of the 4th International Symposium on Advanced Research in Asynchronous Circuits and Systems, IEEE Computer Society Press, San Diego, CA, April 1998. [17] G.W. Reitwiesner, “The determination of carry propagation length for binary addition”, IRE Trans. on Electronic Comp., 9:35-38, 1960. [18] C.L. Seitz, “System Timing”, in C.Mead and L. Conway, Introduction to VLSI Systems Design, pp. 218-262. Addison Wesley, 1980. [19] Skalnsky, J. "An evaluation of Several Two-summand Binary Adders", IRE Trans. on Elect. Computers, EC-9:213-226, 1960. [20] Yun Y. K., Nowick S.M, Beerel P. A., Dooply A. E., Speculative Completion for the Design of High Performance Asynchronous Dynamic Adders, Proceedings of ASYNC 97, Eindhoven, The Netherlands, 1997. [21] Yun Y. K., Chou W., Beerel P. A., Ginosar R., Kol R., Myers C.J., Rotem, Stevens, K., Average case optimized technology mapping for one-hot dual-rail domino circuits, Proceedings of ASYNC 98, San Diego, CA, April 1998. [22] Weste, N.H. and Eshraghian, K.Principles of CMOS VLSI Design, Addison Wesley, 1993.
14
Table 1 - Fixed-time CS adder delay for homogeneous (fixed) block size and variable block size. For variable block size, M is intended as the first block’s size. Optimal M N=8
Hom. CS Var. CS
N=16
2
8
2 (first block)
7
4
10
2 (first block)
9
4
12
2 (first block)
10
4
14
2 (first block)
11
5
15
2 (first block)
12
6
16
2 (first block)
13
7
17
2 (first block)
14
8
18
2 (first block)
14
Hom. CS Var. CS
N=24
Hom. CS Var. CS
N=32
Hom. CS Var. CS
N=40
Hom. CS Var. CS
N=48
Hom. CS Var. CS
N=56
Hom. CS Var. CS
N=64
Hom. CS Var. CS
Add. time
Table 2 - Statistical evaluation of the average value of tCS referring to fixed block size. M=16
M=14
M=12
M=10
M=8
M=6
M=4
M=2
M=1 (tRCA)
N=8
---
---
---
---
---
---
5.962
5.215
5.500
N=16
---
---
---
---
7.091
6.821
6.474
5.562
6.500
N=24
---
---
7.737
7.692
7.538
7.301
6.708
5.826
7.084
N=32
8.183
8.163
8.142
8.097
7.847
7.473
6.835
6.031
7.500
N=40
8.438
8.420
8.351
8.238
8.077
7.605
6.916
6.199
7.821
N=48
8.655
8.614
8.559
8.367
8.266
7.910
6.976
6.339
8.084
N=56
8.832
8.779
8.696
8.512
8.387
8.288
7.000
6.456
8.307
N=64
9.002
8.923
8.819
8.673
8.549
8.399
7.035
6.535
8.500
15
Table 3 - Delay characterization of the dynamic cells used in the SCS circuit. Delay meaning
symbol
type
delay [ns]
pre-charge
∆prec
high-prec.mux
∆muxadd
fall
0.11
∆muxadd
rise
0.12
∆muxadd
fall
0.04
∆muxadd
rise
0.05
∆muxsel
fall
0.15
rise
0.15
0.60
(buffered) low-prec.mux (buffered) high-prec.mux (unbuffered) low-prec.mux (unbuffered) Cc select. mux Fc mux
∆muxFc
fall
0.18
sum sel. mux
Q
fall
0.14
NAND
P
fall
0.15
rise
0.11
fall
0.26
rise
0.26
XOR
P
fc OR (>2 in.)
∆or
fall
>0.28
Fsumall NOR
∆s
rise
0.18
∆c
rise
0.18
∆a
rise
0.14
(32 inputs) Fcarryall NOR (32 inputs) Fall NAND
16
CS block 1
M
M
0
R C A
M
M
R C A
1
R C A
M
M
0
M
R C A
M
1
R C A
M
M
C in
0
M
M
R C A
M
R C A
M
C0 Cout
C
n-2
C1
Fig. 1 - Global diagram of a carry select adder
17
P maxc(Ac) maxc(Sc+Lc) Q
propagate bit vector ready
potential sum pairs ready
final potential carry pairs sum ready ready
P maxc(Ac) maxc(Sc+Lc) Q
propagate bit vector ready
potential carry pairs ready
potential final sum pairs sum ready ready
Fig. 2 - Symbolic timing diagram of CS addition operation.
18
1
0
g 0 0
b0
1
1
0
p 0
s
0
a0 g 1 b1
0
1
1
0
p 1
s
1
a1 g 2 b2
0
1
1
0
p 2
s
2
a2 g 3 b3
0
1
1
0
p 3
a3
s cp1
cp0 1
0
Cc
C c-1
Fig. 3 - Structure of a single block of a CS adder for M=4
19
3
σ3=5 σ2=1 σ1=1
σ4=3 p12
0
p8
1
p4
1
p0
1
p13
0
p9
1
p5
1
p1
0
p14
1
p10
1
p6
1
p2
0
p15
1
p11
1
p7
0
p3
0
1
0
1
Μ=4 Η=4
1
Fig. 4 - Symbolic representation of the CS adder configuration before starting the carry selection
SCS block 1
SCS block
0
1
SCS block 1
0
start
SCS block
0
Fall
Cout
C2 F3
f3
C1 F2
f2
C0 F1 f 1
Cin F0
f0
Fsumall
Fcarryall a[0:63] b[0:63] start
0000000000000000 0000000000000000
F_all a[0:63] b[0:63] start
00000000C84DC83B 00000000CBB9D83B
F_all Fig. 5 - Global view of the completion detecting CS adder architecturewith is logic level behavior for two different addition cases
20
start 1 g 0
0
0 1
0
0
1
1
0
p 0
s 0 0
g 1
1
0
0
1
1
0
p 1
s 1 0
g 2
1
0
0
1
1
0
p 2
s 2 0
g 3
0
1
0
1
1
0
p 3
s 3 F’c 0
fc
cp1
cp0 1
0
1
Fc-1
C c-1
Fc Cc
Fig. 6 - Completion detecting CS single block architecture (gates generating g0..g3 and p0..p3 are omitted).
21
a
a
c
b
c b
s
a
b
0
1
a
b
a
b
Vcc
s precharge
c high-precharged output
s
a
b
0
1
s ~precharge
c low-precharged output
Vcc a b c d e f
a b c
d e f g
g
Fig. 7 - Transistor level schematics of some relevant components of the SCS block circuit. The noninverting buffer depicted with dashed lines are inserted every two multiplexers in the chain.
22
voltage [V]
3,5 2,5 1,5 0,5 -0,5
voltage [V] voltage [V]
3,5 2,5 1,5 0,5 -0,5 3,5 2,5 1,5 0,5 -0,5
voltage [V]
3,5 2,5 1,5 0,5 -0,5
voltage [V]
3,5 2,5 1,5 0,5 -0,5
voltage [V]
3,5 2,5 1,5 0,5 -0,5
voltage [V]
3,5 2,5 Cc-1 1,5 0,5 Cc -0,5 0,00 0,20
start
cp0 cp1
00 + 01 Cc-1 = 0
Fc-1 Fc
Cc Cc-1
cp0 cp1
Fc
00 + 11 Cc-1 = 1
Fc-1
0,40
0,60
0,80
1,00
end of precharge
1,20
1,40
time [ns]
Fig. 8 - Spice simulation of a single 2-bit SCS block with two different input operands. The Start signal is probed before the pre-charge buffering circuitry. The supply is 3.3V.
23
List of Figure Captions
Fig. 1 - Global diagram of a carry select adder Fig. 2 - Symbolic timing diagram of CS addition operation. Fig. 3 - Structure of a single block of a CS adder for M=4 Fig. 4 - Symbolic representation of the CS adder configuration before starting the carry selection Fig. 5 - Global view of the completion detecting CS adder architecturewith logic level view of the behavior in two different addition cases Fig. 6 - Completion detecting CS single block architecture (gates generating g0..g3 and p0..p3 are omitted) Fig. 7 - Transistor level schematics of some relevant components of the SCS block circuit. Fig. 8 - Spice simulation of a single 2-bit SCS block with two different input operands. The Start signal is probed before the pre-charge buffering circuitry. The supply is 3.3V.
27