A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers

10 downloads 3306 Views 259KB Size Report
A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers. Klaus Schneider ... Although there are many other number systems, simple radix numbers to a ...... parallel arithmetic,” IRE Transactions on Electronic Comput- ers, vol.
2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines

A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers Klaus Schneider and Adrian Willenb¨ucher Embedded Systems Group University of Kaiserslautern Kaiserslautern, Germany {schneider, willenbuecher}@cs.uni-kl.de Abstract—Signed-digit (SD) numbers generalize traditional radix numbers by allowing negative digits within a certain range. Typically, this leads to redundant number representations that can be used to avoid the carry propagation problem of addition of radix numbers. Unfortunately, as proved by Avizienis, the standard algorithm for carry-free addition of SD numbers does not work for the binary case. In this paper, we therefore construct a special algorithm for the carry-free addition and subtraction of binary SD numbers, i.e., addition and subtraction of n-digit numbers are performed with circuits of depth O(1) and size O(n). This is possible by computing in addition to the transfer digits used by the standard algorithm one additional bit that allows us to distinguish relevant cases to avoid propagation of dependencies. The additional bit and the transfer digit used to compute the sum digit at position i depend only on the summands’ digits at positions i and i − 1 so that all sum digits can be computed with a hardware circuit of a depth that is independent of the number of digits. We first explain the basics of the standard addition algorithm to derive the additional information needed to fix the algorithm for the binary case. After proving the correctness of our algorithm, we present experimental results that show that our implementation clearly outperforms two’s complement addition even for small numbers, and saves 50% of the required chip area compared to other carry-free implementations.

Circuits with a depth depending on the number of digits n limit the clock speed of synchronous circuits in terms of n. For radix-B numbers, it is not difficult to see that addition, subtraction, multiplication, division and comparison operations of n-digit numbers all require a depth of at least O(log(n)) since the digits of the results depend on all digits of the operands. For all basic operations, optimal O(log(n)) algorithms are known, even though these require sometimes substantial mathematical effort [2]–[4]. Since this minimal O(log(n)) depth cannot be improved for radix-B numbers, one has to consider non-conventional number systems for improvements. For example, residue number systems (RNS) [5], [6] encode a number x by its (x mod pn )) moduli (x1 , . . . , xn ) := ((x mod p1 ), . . . , n that are unique for numbers x ∈ {0, . . . , ( i=1 pi ) − 1} for relatively prime numbers pi . Addition, subtraction, and multiplication can be done in parallel on the moduli, and thus, with a depth O(1). Division can only be done by iterative methods like Newton-Raphson or Goldschmidt iteration which lead again to a depth of O(log(n)). The main problems for RNS numbers are however that comparison ( 0 are still popular in computer arithmetic. An n-digit radix-B number is thereby given as a sequence of digits [xn−1 , . . . , x0 ] with xi ∈ {0, . . . , B − 1} that denotes the following natural number: [xn−1 , . . . , x0 ]B :=

n−1 

xi · B i

i=0

It is well-known that the addition of radix-B numbers suffers inherently from carry propagation: In the worst case, a carry is generated when adding the least significant digits x0 and y0 , and is then propagated from the rightmost digits x0 , y0 to the leftmost digits xn−1 , yn−1 . As a consequence, simple carry-ripple adders have depth1 O(n). Even though this can be reduced to a depth of O(log(n)), e.g., by carry-lookahead adders [1], the depth still grows with the number of digits. 1 The depth of a circuit is the length of the longest path from inputs to outputs.

978-1-4799-5111-6/14 $31.00 © 2014 IEEE DOI 10.1109/.22

44

such that the sum digit si can be computed as si = ti + wi . Our algorithm computes an additional condition i that stores some important information to define the transfer and sum digits. Our transfer digits ti depend on the operand digits xi−1 , yi−1 , xi−2 , yi−2 and the additional condition i depends on xi−1 , yi−1 only, so that our algorithm has still depth O(1). We implemented our algorithm on FPGAs and compared its speed and area requirements with previous approaches to SD addition and also with a carry-lookahead adder. It turned out that our algorithm is faster than a hybrid carry-lookahead/carry-ripple adder for more than 24 bits on our hardware platform, and requires just about 50% of the chip area of other SD addition circuits. Our paper is organized as follows: In Section II, we discuss Avizienis’ algorithm for adding SD numbers. In Section III, we first analyze why that algorithm does not work for the case of binary numbers, and then develop a solution for this problem in Section III-B. To demonstrate the efficiency of our algorithm, we present experimental results in Section IV.

Due to the redundant representations of a number, it is not possible to reduce equality testing to checking the equality of the corresponding digits. However, due to the (constant depth) reduction x = y ⇔ x − y = 0, checking equality can be reduced to checking whether the result is zero. This is possible with depth O(log(n)) if zero has a unique representation (i.e., all digits being zero). To be able to check equality of SD numbers, Avizienis therefore imposed that D < B must hold because of the following result: Theorem 2 (Unique Representation of Zero): The number 0 has a unique representation as SD number [xn−1 , . . . , x0 ]D,B if and only if D < B holds. Proof: For any n, we have [xn−1 , . . . , x0 ]D,B = 0 for xi = 0. For any other representation [xn−1 , . . . , x0 ]D,B = 0 with x0 = x0 , we would have x0 − x0 = x0 = k · B with k = 0 by the previous lemma. However, this is impossible iff x0 ∈ {−D, . . . , +D} ⊆ {−(B − 1), . . . , B − 1} holds. Hence, we see that x0 = 0 is uniquely determined for x = 0 if and only if D < B holds. Then, we have [xn−1 , . . . , x1 ]D,B = 0, and the same argument applies to the next digit x1 , and so on. For example, we have [1, −B]D,B = [−1, B]D,B = 0 if we would allow D = B. Hence, we always assume D < B in the following to ensure the unique representation of 0. This uniqueness result can be generalized to other least significant digits x0 : Assume first that B ≤ 2 · D (and thus B − D ≤ D) holds, so that we can partition the legal digits {−D, . . . , +D} into the following intervals:

II. P REVIOUS W ORK In this section, we review known results about signed-digit numbers. To this end, we provide new proofs that allow us to discuss in the next section where the difficulties to define a carry-free addition for binary SD numbers come from. A. Signed-Digit Numbers Avizienis introduced in [7] the following SD numbers to a radix B > 1 and a digit set {−D, . . . , +D}: Definition 1: Given some number D and a radix B > 1, a sequence [xn−1 , . . . , x0 ] of digits xi ∈ {−D, . . . , +D} encodes the following integer: [xn−1 , . . . , x0 ]D,B :=

n−1 

−D . . . D − B

D − B + 1...B − D − 1

B − D...D

By Lemma 1, the digits D−1 := {−D, . . . , D − B} and D+1 := {B − D, . . . , D} can be mapped to each other by either adding or subtracting B, while for the digits D0 := {D − B + 1, . . . , B − D − 1} no legal digits are obtained this way. Thus, digits in D0 are uniquely determined, while digits in either D−1 or D+1 have exactly one alternative. Choosing the alternative, we have to either increment or decrement the next digit xi+1 , and then the same discussion can be repeated for xi+1 . However, if B > 2·D holds, then there are no alternatives left for the digits (since x0 −B ≤ D−B < D−2·D = −D). Hence, to ensure redundancy, we have to impose as second constraint B ≤ 2 · D (in addition to D < B) to obtain the following result: Lemma 2 (Redundancy of SD Representations): For any SD number x = [xn−1 , . . . , x0 ]D,B with D < B ≤ 2 · D, the following holds: • If (x mod B) ∈ {0, . . . , B−D−1}, then x0 is uniquely defined as x0 := (x mod B). • If (x mod B) ∈ {B − D, . . . , D}, then either x0 = (x mod B) or x0 = (x mod B) − B holds, thus there are exactly two solutions for x0 . • If (x mod B) ∈ {D+1, . . . , B−1}, then x0 is uniquely defined as x0 := (x mod B) − B.

xi · B i

i=0

There may be several SD representations of the same number. For example, for B = 3 and D = 2, the value 5 can be encoded as [2, −1], [1, 2] or [1, −1, −1]. To understand the different redundant representations of a number, we list the following well-known theorem without proof: Theorem 1 (Uniqueness of Division with Remainder): For all integers x, y ∈ Z with y = 0, there are uniquely defined numbers q, r ∈ Z with x = q ·y +r and 0 ≤ r < |y|. We therefore write q := (x div y) and r := (x mod y). By the above theorem, we conclude the following result: Lemma 1 (SD Number Representations): x = [xn−1 , . . . , x0 ]D,B = [xn−1 , . . . , x0 ]D,B implies x0 = x0 + k · B for some k ∈ Z. Proof: Using y1 := [xn−1 , . . . , x1 ]D,B and y1 :=  [xn−1 , . . . , x1 ]D,B , we obviously have x = y1 · B + x0 = y1 · B + x0 , and therefore x0 − x0 = (y1 − y1 ) · B holds. Hence, x0 − x0 is a multiple of B, so that the proposition holds with k := y1 − y1 .

45

Table I P OSSIBLE DECOMPOSITIONS ui = xi + yi = ti+1 · B + wi WITH xi , yi , wi ∈ {−D, . . . , +D} ASSUMING D < B ≤ 2 · D. range of ui ui ∈ {−2D, . . . , −D − 1} ui ∈ {−D, . . . , −B + D}

ui ∈ {−(B − D − 1), . . . , B − D − 1} ui ∈ {B − D, . . . , D}

ui ∈ {D + 1, . . . , 2D}

possible decomposition ui = ti+1 · B + wi with wi ∈ {−D, . . . , +D} (ti+1 , wi ) = (−1, B + ui ) with wi ∈ {B − 2D, . . . , B − D − 1} ⊆ {−D + 1, . . . , D − 1} (ti+1 , wi ) = (−1, B + ui ) with wi ∈ {B − D, . . . , D} ⊆ {−D + 1, . . . , D} or (ti+1 , wi ) = (0, ui ) with wi ∈ {−D, . . . , −B + D} ⊆ {−D, . . . , D − 1} (ti+1 , wi ) = (0, ui ) with wi ∈ {−(B − D − 1), . . . , B − D − 1} ⊆ {−D + 1, . . . , D − 1} (ti+1 , wi ) = (0, ui ) with wi ∈ {B − D, . . . , D} ⊆ {−D + 1, . . . , D} or (ti+1 , wi ) = (+1, −B + ui ) with wi ∈ {−D, . . . , −B + D} ⊆ {−D, . . . , D − 1} (ti+1 , wi ) = (+1, −B + ui ) with wi ∈ {D − B + 1, . . . , 2D − B} ⊆ {−D + 1, . . . , D − 1}

Lemma 4: For given digits xi , yi ∈ {−D, . . . , +D} with D < B < 2·D, the number ui = xi +yi can be decomposed as ui = ti+1 · B + wi with wi ∈ {−D + 1, . . . , +D − 1} as shown in Table I. Proof: The proof is easily obtained by checking all the cases mentioned in Table I. Note that the cases with ui ∈ {−D, . . . , −B + D} and ui ∈ {B − D, . . . , D} allow two different decompositions and for each case, there is one ui that produces an interim sum wi ∈ {−D+1, . . . , +D−1}. In that case, however, we use the other possible decomposition and can therefore ensure wi ∈ {−D + 1, . . . , +D − 1}. Since it is possible to find a decomposition with wi ∈ {−D + 1, . . . , +D − 1}, it is now possible to compute the final sum digits si := wi + ti without producing a carry! However, the reader might have noted that we had to strengthen the constraint D < B ≤ 2 · D used before to D < B < 2 · D to make this possible. Based on the above lemma, the carry-free addition due to Avizienis is now as follows: Theorem 3 (Carry-Free Addition by Avizienis): The addition of SD numbers x = [xn−1 , . . . , x0 ]D,B and y = [yn−1 , . . . , y0 ]D,B with D < B < 2 · D can be computed in depth O(1) with O(n) work (gates) as follows: 1) for i ∈ {0, . . . , n − 1}, compute ui := xi + yi 2) for i ∈ {0, . . . , n ⎧ − 1}, ⎨ +1 : if ui ≥ +D −1 : if ui ≤ −D compute ti+1 := ⎩ 0 : if − D < ui < +D

The constraint D < B is added to ensure the unique representation of zero (to ensure that we can check equality of SD numbers) while the second constraint B ≤ 2 · D is added to ensure a minimal redundancy that can be exploited for a carry-free addition as explained below. Note that Avizienis imposed a stronger second constraint B < 2 · D that then excludes the case B = 2. We will see in the following discussion why he did so and why we will not be that strict. The above lemma is the key to construct a carry-free addition algorithm: If two SD numbers [xn−1 , . . . , x0 ]D,B and [yn−1 , . . . , y0 ]D,B have to be added, we may first consider the expression [un−1 , . . . , u0 ]D,B with ui := xi + yi . Since each xi and each yi are legal digits, we have −2 · D ≤ ui ≤ 2 · D. According to Avizienis, each ui is decomposed into an outgoing transfer digit ti+1 and an interim sum digit wi so that xi + yi = ui = ti+1 · B + wi holds. Due to −2 · B < −2 · D ≤ xi + yi ≤ 2 · D < 2 · B, it follows that ti+1 ∈ {−1, 0, +1} holds for all such decompositions. Note that a particular choice ti+1 ∈ {−1, 0, +1} determines the range of ui = ti+1 · B + wi , so that we can easily prove the following lemma (note that D < B ≤ 2 · D implies −B − D < −2D < −D ≤ −B + D < 0 < B − D ≤ D < 2D < B + D): Lemma 3: For given digits xi , yi ∈ {−D, . . . , +D} with D < B ≤ 2·D, the number ui = xi +yi can be decomposed as ui = ti+1 · B + wi with wi ∈ {−D, . . . , +D} and ti+1 ∈ {−1, 0, +1} as shown in Table I. The proof is easily obtained by checking the cases mentioned in Table I. The final step of the computation consists now in computing the sum digits si := wi +ti by means of the transfer and interim sum digits. We have to make sure that these additions will not produce a carry. For this reason, Avizienis demanded that wi ∈ {−D + 1, . . . , +D − 1} must hold, which is also possible according to the following lemma:

3) for i ∈ {0, . . . , n − 1}, compute digits si := ti + ui − ti+1 · B with t0 := 0  

=:wi

The final sum is then the SD number s = [tn , sn−1 , . . . , s0 ]D,B . Each of the above steps can be performed in parallel, so that the sum can be computed in three steps. Moreover, in the

46

cases ui ∈ {−D, . . . , −B + D} and ui ∈ {B − D, . . . , D} the algorithm prefers the decomposition with ti+1 = 0 except for the cases ui = ±D, where the other possible decomposition is used. This way, we always have wi ∈ {−D + 1, . . . , +D − 1} and therefore, the final addition si := ti + wi produces a legal digit. Other operations can be implemented as follows: • Subtraction of x and y can be simply performed by addition of x and −y = [−yn−1 , . . . , −y0 ]D,B which can also be done with depth O(1) and work O(n).2 • Checking equality of x and y is reduced to checking whether x − y = 0 holds. The subtraction can be done with depth O(1) and work O(n), but checking that all obtained digits are zero requires depth O(log n) and work O(n). • Comparing x < y is reduced to testing for x − y < 0. The subtraction can be done with depth O(1) and work O(n), but checking the sign may require depth O(log(n)) since some of the leading digits can be zero (the sign of the first non-zero digit determines the sign). • Multiplication can be obtained by adding the partial products x · yi · B i which can be arranged with a depth of O(log(n)) and work O(n2 ) [14], [15]. • Division can be implemented by multiplication of the integer reciprocal, requiring depth O(log(n)) and work O(n2 ) [2]. Hence, SD numbers are an interesting number representation that leads to efficient arithmetic algorithms.

k can be chosen, this may still be a practical solution. Many papers consider also variants of these SD number representations, e.g. using asymmetric digits sets [12]. As another solution, Parhami [8] suggested recoding a given binary SD number x of length n to an equivalent SD number x of length n + 1 such that there are no two neighboring digits xi+1 and xi with xi+1 · xi = 1. Unfortunately, the output of his addition algorithm does not satisfy this condition, so that it has to be recoded again before another addition takes place. This does not only increase the required chip area, but also adds further latency to each addition. Other works on recoding SD numbers are discussed in [13]. We therefore considered whether it is possible to construct a direct algorithm for the addition of binary SD numbers despite the problems with the decomposition mentioned in the previous section. As we report in the next section, it turns out that there is indeed such an algorithm, and it can be efficiently implemented in hardware. III. O UR A LGORITHM FOR C ARRY-F REE A DDITION OF B INARY SD N UMBERS A. Analyzing the Problem Below, we first analyze the problem for base B = 2 and then construct a carry-free binary SD addition algorithm. We have to add two digits xi and yi of given numbers plus the transfer digit ti that comes from the neighboring digits to the right. All of xi , yi , and ti belong to the digit set {−1, 0, +1}, and we have to define transfer digits ti+1 , an interim sum wi , and the final sum digit si such that the following constraints hold: 1) xi + yi = 2 · ti+1 + wi 2) si = ti + wi 3) ti+1 , wi and si are digits from {−1, 0, +1} 4) ti+1 is defined independent of ti (to avoid a propagation chain) To this end, consider Table II: The first three columns list the possible inputs for xi , yi and ti . The next two columns are values for ti+1 and wi that were computed by the algorithm of the previous section, i.e. ⎧ ⎨ +1 if xi + yi ≥ +1 −1 if xi + yi ≤ −1 ti+1 := ⎩ 0 otherwise

B. Binary SD Numbers Avizienis already noted that his algorithm does not work for binary SD numbers for the reasons we explained in the previous section. Using the weaker constraints D < B ≤ 2 · D, we can reconsider Table I that reduces for B = 2 and D = 1 to the following decompositions: ui −2 −1 0 +1 +2

(ti+1 , wi ) (−1, 0) (0, −1) or (−1, +1) (0, 0) (0, +1) or (+1, −1) (+1, 0)

As can be seen, there is no decomposition that always allows us to achieve that wi ∈ {−D +1, . . . , +D −1} = {0} holds. For this reason, it was widely accepted that there is no carryfree addition for general binary SD numbers. One possible solution is to consider a radix B = 2k and to represent digits xi then as two’s complement numbers with k + 1 bits. The disadvantage is that the depth is increased to O(log(k)) (due to addition of two’s complement numbers with k bits), as considered in [11]. Since small numbers

and wi := xi + yi − 2 · ti+1 and si := xi + yi + ti − 2 · ti+1 . As can be seen, the algorithm sometimes computes values for si that are not in the allowed range. The symbol “*” in the rightmost column marks these rows (where the algorithm fails) and we have colored these rows in dark gray. It is not difficult to see that a correct result would have been possible, since we have [ti+1 , si ]1,2 = [−1, +2]1,2 = 0 = [0, 0]1,2 and [ti+1 , si ]1,2 = [+1, −2]1,2 = 0 = [0, 0]1,2 holds.

2 The work of a parallel algorithm is the number of executed operations, i.e., the number of gates of the corresponding circuit.

47

VALUES OF ti+1

Table II

AND si FOR STANDARD

SD

Table III A LTERNATIVE VALUES OF ti+1 FOR (xi = −1 AND yi = 0).

ADDITION .

xi

yi

ti

ti+1

wi

si

xi

yi

ti

ti+1

si

t i+1

s i

-1 -1 -1

-1 -1 -1

-1 0 +1

-1 -1 -1

0 0 0

-1 0 +1

-1 -1 -1

0 0 0

-1 0 +1

0 0 0

-2 -1 0

+1 +1 +1

-4 -3 -2

-1 -1 -1

0 0 0

-1 0 +1

-1 -1 -1

+1 +1 +1

0 +1 +2

-1 -1 -1

+1 +1 +1

-1 0 +1

0 0 0

0 0 0

-1 0 +1

0 0 0

-1 -1 -1

-1 0 +1

-1 -1 -1

+1 +1 +1

0 +1 +2

0 0 0

0 0 0

-1 0 +1

0 0 0

0 0 0

-1 0 +1

0 0 0

+1 +1 +1

-1 0 +1

+1 +1 +1

-1 -1 -1

-2 -1 0

+1 +1 +1

-1 -1 -1

-1 0 +1

0 0 0

0 0 0

-1 0 +1

+1 +1 +1

0 0 0

-1 0 +1

+1 +1 +1

-1 -1 -1

-2 -1 0

+1 +1 +1

+1 +1 +1

-1 0 +1

+1 +1 +1

0 0 0

-1 0 +1

+ + *

to define a decomposition for ti+1 that only depends on xi and yi as remarked by Avizienis! B. Solution Our algorithm uses additional information that solves the problem explained in the previous section. As the algorithm describes a hardware circuit, we make use of an encoding of the digits {−1, 0, +1} by a pair of booleans (x.0, x.1). There are many encodings of the digits {−1, 0, +1}, but the following two are the most popular ones:

+ + *

* + +

Value -1 0 +1

* + +

sign-value

neg-pos

(true, true) (false, false) (false, true)

(true, false) (false, false) (false, true)

We choose the neg-pos encoding for our algorithm because it lends itself well to a concise description of the logic equations below; in addition, it makes negating a value a simple swap of the pair’s elements. The key idea of our solution is to choose different decompositions xi + yi = 2 · ti+1 + wi in the critical cases (with gray color) of Table II. Since we cannot do this based on xi and yi only, and since we are not allowed to consider ti , we introduce a new input i such that

However, we cannot simply change these rows in the table to correct the outputs ti+1 and si , since the computation of ti+1 must be independent of ti , and should only depend on xi and yi . Hence, changing the value of ti+1 in a row forces us to make the same change in all rows where xi and yi has the same value. We therefore say that two input triples (xi , yi , ti ) and (xi , yi , ti ) are equivalent iff xi = xi ∧yi = yi holds. The symbol “+” denotes the rows that are equivalent in this sense to another input that leads to wrong results, and we have colored these rows in a lighter gray. We therefore see that we have four critical input classes (xi , yi , ti ) = (−1, 0, ∗), (xi , yi , ti ) = (0, −1, ∗), (xi , yi , ti ) = (0, +1, ∗), and (xi , yi , ti ) = (+1, 0, ∗) that refer to the decomposition cases in Table I where two decompositions are possible. Since we have to define a decomposition 2 · ti+1 + wi = xi + yi independent of ti , there is no solution by the information given in this table. For example, consider the critical input class (xi , yi , ti ) = (−1, 0, ∗): Using ti+1 = −1 as computed by the algorithm leads to value si = +2 for ti = +1. Using ti+1 = 0 instead leads to value si = −2 for ti = −1, and using ti+1 = +1 leads to forbidden values of si for all values of ti (see Table III). Thus, it is not possible

(ti = +1 → ¬i ) ∧ (ti = −1 → i ) holds, and we generate an output i+1 that maintains this property as an invariant (ti+1 = +1 → ¬i+1 ) ∧ (ti+1 = −1 → i+1 )

(1)

that is forwarded to the full adder that receives xi+1 and yi+1 as inputs, while i is provided in addition to ti by the full adder for xi−1 and yi−1 . Using i , we can then decide whether we use the one or the other possible decomposition in the critical cases (with gray color) of Table II. Note that i does not hold the full information of ti , since it is not determined for ti = 0. To establish the above invariant, we define i+1 := xi .0 ∨ yi .0 which means that i+1 holds if and only if at least one of the digits xi ,yi is −1. We prove that equation (1) holds by inspecting Table IV, where the solution computed by our algorithm is given as

48

VALUES OF ti+1

principle, we could replace i by (ti = −1) without making the equations incorrect. However, the hardware circuit would then suffer from carry propagation since ti+1 would then depend on ti . Figure 1 defines a full adder using the Quartz language [16] that can be cascaded to obtain a carry-free binary SD adder. Inputs are declared by ? while outputs are declared with !. The inputs tin, x, and y are thereby pairs of booleans that encode digits {−1, 0, +1} via the neg-pos encoding, i.e., ε(x.0, x.1) = ( x.0 ⇒ −1| 0) + ( x.1 ⇒ +1| 0) maps a pair of booleans to the corresponding digits. The module also makes use of local boolean variables w1, w2, w3, w4, w, u1, and u0. w is thereby defined such that it holds if and only if one of the critical input cases are given (the gray shaded ones in Table IV). Variables u1 and u0 are used to define some common subexpressions.

Table IV SD ADDITION ALGORITHM .

AND si FOR OUR

xi

yi

ti

i

i+1

ti+1

x

y

tin

lin

lout

tout

si s

-1 -1 -1

-1 -1 -1

-1 0 +1

T * F

T T T

-1 -1 -1

-1 0 +1

-1 -1 -1 -1

0 0 0 0

-1 0 0 +1

T T F F

T T T T

-1 -1 0 0

0 +1 -1 0

-1 -1 -1

+1 +1 +1

-1 0 +1

T * F

T T T

0 0 0

-1 0 +1

0 0 0 0

-1 -1 -1 -1

-1 0 0 +1

T T F F

T T T T

-1 -1 0 0

0 +1 -1 0

0 0 0

0 0 0

-1 0 +1

T * F

F F F

0 0 0

-1 0 +1

0 0 0 0

+1 +1 +1 +1

-1 0 0 +1

T T F F

F F F F

0 0 +1 +1

0 +1 -1 0

+1 +1 +1

-1 -1 -1

-1 0 +1

T * F

T T T

0 0 0

-1 0 +1

+1 +1 +1 +1

0 0 0 0

-1 0 0 +1

T T F F

F F F F

0 0 +1 +1

0 +1 -1 0

+1 +1 +1

+1 +1 +1

-1 0 +1

T * F

F F F

+1 +1 +1

-1 0 +1

module SgnFullAdd ( (bool∗bool) ? tin , ?x ,? y ,bool ? lin , (bool∗bool) ! tout ,! s ,bool ! lout ) { bool w1 , w2 , w3 , w4 ,w , u1 , u0 ; // define the critical input cases : w1 = ! x .0 & ! x .1 & y .1; // x ==0 & y ==+1 w2 = ! x .0 & ! x .1 & y .0; // x ==0 & y==−1 w3 = ! y .0 & ! y .1 & x .1; // y ==0 & x ==+1 w4 = ! y .0 & ! y .1 & x .0; // y ==0 & x==−1 w = w1 | w2 | w3 | w4 ; u1 = ! lin & w ; // tin!=−1 & critical input u0 = lin & w ; // tin !=+1 & critical input // d e t e r m i n e lout := x=−1 | y=−1 lout = x .0 | y .0; // tout .0 holds iff x = y=−1 | tin !=+1 & x + y=−1 tout .0 = x .0 & y .0 | lin & ( w2 | w4 ) ; // tout .1 holds iff x = y =+1 | tin!=−1 & x + y =+1 tout .1 = x .1 & y .1 | ! lin & ( w1 | w3 ) ; // determine sum digit s .0 = tin .0 & ! u0 | u1 & ! tin .1; s .1 = tin .1 & ! u1 | u0 & ! tin .0; }

Figure 1.

Implementation of a Full Adder for Binary SD Numbers

As can be seen, tout only depends on x,y,lin; s depends on tin,lin,x,y, and lout on x,y. Therefore, there is no dependency from tin to tout and neither is there one from lin to lout. Dependencies between neighbored full adder modules are shown in Figure 2. As can be seen, a sum digit si depends on xi , yi , xi−1 , yi−1 , xi−2 , yi−2 , i on xi−1 , yi−1 , and ti on xi−1 , yi−1 , xi−2 , yi−2 .

the three rightmost columns, and we can also verify that the important equation xi + yi + ti = 2 · ti+1 + si holds, and that all computed values are legal digits. Note that the inputs in Table IV are arbitrary, but input i must respect the mentioned invariant above. We use ‘*’ in case its value is a don’t care (i.e., if ti = 0). As can be seen, in case of non-critical inputs (those that are not given in gray color), the decomposition of xi + yi = ui into ti+1 · 2 + wi does only depend on xi and yi , while in the critical cases, it also depends on i . Using the information of i , it is possible to choose a decomposition where always legal digits are obtained for ti+1 and si without generating a carry digit. It is interesting to note that i and i+1 have strong relationships to ti and ti+1 due to the mentioned invariants. However, i+1 only depends on the digits xi and yi , while ti+1 depends on i , but not on ti . This is very important: In

Figure 2.

49

Dependencies of the Variables in SgnFullAdd

chains, resulting in a combination of carry-lookahead and carry-ripple adders. This method is the fastest and the smallest carry-based addition for all but very high bit-width numbers. For Parhami’s method, we chose the signed-value encoding, since it was the one they focused on in [8]. Our benchmarks were set up as follows: • To measure latency, we registered the inputs and outputs of the respective adder implementation. The synthesis and implementation tools were set to optimize for clock frequency, and the given latencies are the minimum clock periods which were still routable. • For area, the design was solely comprised of the adder circuit, with the FPGA’s pins serving as the inputs and the outputs of the adder. The tools were set to optimize for area, and the area is measured in occupied lookup tables (LUTs). For our benchmarks, we assumed that the inputs are given as signed-digit numbers. This is necessary in order to ensure that the input is as general as possible so that the synthesis tools are not able to optimize the circuit unrealistically by exploiting don’t-care conditions. We measured the following benchmarks: • add2: addition circuit with two n-digit inputs and an (n + 1)-digit output • add3: addition circuit with three n-digit inputs and an (n + 2)-digit output

It is not difficult to prove that the following theorem holds where ε(x) maps the pair of booleans x = (x.0, x.1) to a digit {−1, 0, +1} according to the neg-pos encoding: Theorem 4 (Correctness of SgnFullAdd): If x, y, tin are pairs of booleans that encode digits {−1, 0, +1}, and if lin is a boolean such that condition (lin → ¬tin.1) ∧ (¬lin → ¬tin.0) holds, then the following holds for module SgnFullAdd shown in Figure 1: • tout and s encode signed binary digits {−1, 0, +1} • (lout → ¬tout.1) ∧ (¬lout → ¬tout.0) • ε(x) + ε(y) + ε(tin) = 2 ∗ ε(tout) + ε(s) Proof: The proof can be made by an exhaustive enumeration of all cases, which has been performed by means of the Averest tool set. Thus, all bits i , then all transfer digits, and then all sum digits are computed in three parallel steps, thus requiring time O(1). Hence, we obtained a carry-free addition of binary SD numbers without the need to re-encode the inputs. The crucial fact used here is that we can extract enough information from the next less-significant digits to distinguish the cases where forbidden digits for si would be computed within the critical inputs. Note that i does not have the complete information to determine ti since that would lead to a dependency between ti+1 and ti that would introduce a carry chain. C. Conversion to/from Binary Numbers Converting radix-2 or two’s complement numbers to binary SD numbers does not require any logic resources. For a radix-2 number x = [xn−1 , . . . , x0 ], the equivalent SD number x in neg-pos encoding is x .0 := [0, . . . , 0] and x .1 := [xn−1 , . . . , x0 ]; for a two’s complement number x = [xn−1 , . . . , x0 ], an equivalent SD number is x .0 := [xn−1 , 0, . . . , 0] and x .1 := [0, xn−2 , . . . , x0 ]. The correctness of this can be easily seen from the equation n−2 [xn−1 , . . . , x0 ]2C = −xn−1 · 2n−1 + i=0 xi · 2i , where x2C denotes the two’s complement interpretation of a bitvector x. To convert an SD number x back to a radix-2 or a two’s complement number, the bitvector [xn−1 .0, . . . , x0 .0] is interpreted as a radix-2 number and subtracted from the radix2 number [xn−1 .1, . . . , x0 .1] (since [xn−1 , . . . , x0 ]1,2 = n−1 i i=0 (xi .1 − xi .0) · 2 ). This requires a single n-bit subtraction which needs time O(log(n)) and returns an (n + 1)-bit radix-2/two’s complement number.

B. Results Table V shows the latency and the maximum frequency of the two-input and the three-input adder for our new addition algorithm and compares it to Parhami’s adder. The values were determined for an input width of n = 64, but they are actually independent of n (with very small deviations due to slight variations in the LUT array and the routing network of the FPGA). We included the values for a 64-bit native addition as a reference. As can be seen, our algorithm is more than 40 % faster than Parhami’s SD addition. It also tends to achieve a frequency which is 50 % higher than a 64-bit native FPGA addition. This is to be expected, since our algorithm has a constant O(1) latency, while the best latency which any carry-based addition can achieve is O(log n). In fact, our algorithm is so efficient that the breakeven point is at n = 24, for which native addition has a latency of 2.11 ns. Interestingly, Parhami’s adder is actually slower than native FPGA addition for the three-input case, even though it is faster for the two-input case. In Table VI, we show the area requirements for the different algorithms. For all of them, the occupied area is proportional to their input width, hence we give the number of LUTs per input digit (measured for n = 64). For example, our method requires 3 LUTs per digit, so adding two 32digit numbers requires 96 LUTs. As expected, three-input

IV. B ENCHMARK R ESULTS A. Setup We implemented our addition algorithm in hardware on a Xilinx Virtex 5 FPGA, along with Parhami’s algorithm [8], and a simple addition of two’s complement numbers to make comparisons. On these FPGAs, simple addition is implemented using a dedicated carry logic and fast carry

50

Table V L ATENCY OF TWO - INPUT AND THREE - INPUT ADDERS IN NANOSECONDS , RESP. MAXIMUM FREQUENCY IN MH Z . add2 (ns / MHz)

add3 (ns / MHz)

2.02 / 495 2.88 / 347 3.19 / 313

3.14 / 318 4.97 / 201 4.71 / 212

Our adder Parhami’s adder Simple adder (64-bit)

R EFERENCES [1] P. Kogge and H. Stone, “A parallel algorithm for the efficient solution of a general class of recurrences,” IEEE Transactions on Computers (T-C), vol. 22, pp. 786–793, 1973. [2] P. Beame, S. Cook, and H. Hoover, “Log depth circuits for division and related problems,” in Foundations of Computer Science (FOCS). West Palm Beach, Florida, USA: IEEE Computer Society, 1984, pp. 1–6. [3] B. Parhami, Computer Arithmetic – Algorithms and Hardware Designs. Oxford University Press, 2000. [4] M. Ercegovac and T. Lang, Digital Arithmetic. Morgan Kaufmann, 2003. [5] H. Garner, “The residue number system,” IRE Transactions on Electronic Computers, vol. 8, pp. 140–147, June 1959. [6] H. Garner, R. Arnold, B. Benson, C. Brockus, R. Gonzalez, and D. Rozenberg, “Residue number systems for computers,” University of Michigan, Technical Report 61-483, October 1961. [7] A. Avizienis, “Signed-digit number representations for fast parallel arithmetic,” IRE Transactions on Electronic Computers, vol. 10, no. 3, pp. 389–400, September 1961. [8] B. Parhami, “Carry-free addition of recorded binary signeddigit numbers,” IEEE Transactions on Computers (T-C), vol. 37, no. 11, pp. 1470–1476, 1988. [9] ——, “Generalized signed-digit number systems: A unifying framework for redundant number representations,” IEEE Transactions on Computers (T-C), vol. 39, no. 1, pp. 89–98, January 1990. [10] S.-H. Shieh and C.-W. Wu, “Asymmetric high-radix signeddigit number systems for carry-free addition,” Journal of Information Science and Engineering, vol. 19, no. 6, pp. 1015–1039, 2003. [11] G. Jaberipur and M. Ghodsi, “High radix signed digit number systems: Representation paradigms,” Scientia Iranica, vol. 10, no. 4, pp. 383–391, 2003. [12] S. Gorgin and G. Jaberipur, “A family of high radix signed digit adders,” in Symposium on Computer Arithmetic (ARITH). T¨ubingen, Germany: IEEE Computer Society, 2011, pp. 112–120. [13] M. Joye and S.-M. Yen, “Optimal left-to-right binary signeddigit recoding,” IEEE Transactions on Computers (T-C), vol. 49, no. 7, pp. 740–748, 2000. [14] C. Koc and S. Johnson, “Multiplication of signed-digit numbers,” Electronics Letters, vol. 30, no. 11, pp. 840–841, 1994. [15] C. Hung and B. Parhami, “Generalized signed-digit multiplication and its systolic realizations,” in Circuits and Systems. Detroit, Michigan, USA: IEEE Computer Society, 1993, pp. 1505–1508. [16] K. Schneider, “The synchronous programming language Quartz,” Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany, Internal Report 375, December 2009. [17] S. Arno and F. Wheeler, “Signed digit representation of minimal Hamming weight,” IEEE Transactions on Computers (T-C), vol. 42, no. 8, pp. 1007–1010, August 1993. [18] A. Booth, “A signed binary multiplication technique,” Quarterly Journal of Mechanics and Applied Mathematics (QJMAM), vol. 4, no. 2, pp. 236–240, 1951. [19] D. Phatak, T. Goff, and I. Koren, “Constant-time addition and simultaneous format conversion based on redundant binary representations,” IEEE Transactions on Computers (T-C), vol. 50, 2001. [20] G. Reitwiesner, Advances in Computers. Academic Press, 1960, ch. Binary Arithmetic.

Table VI A REA REQUIREMENTS OF TWO - INPUT AND THREE - INPUT ADDERS IN LUT S PER INPUT BIT. add2

add3

Our adder

3.0

6.0

Parhami’s adder

7.3

14.7

Two’s complement adder

1.0

2.0

adders need twice the area of two-input adders, since they are just two adders in sequence. Our method requires three times as much area as the native addition algorithm, and less than half of Parhami’s algorithm. Note that in the case of an ASIC implementation, our algorithm would likely perform even better compared to a two’s complement adder since the latter benefits from the dedicated carry-propagation chain on the FPGA, an advantage which would not exist on an ASIC. V. C ONCLUSION We developed an algorithm for adding binary SD numbers which does not require the recoding step of previous approaches [8]. Our algorithm makes use of an additional input i that is used to determine suitable transfer and interim sum digits that avoid this way a carry generation. By implementing our addition algorithm on an FPGA, we showed that our method is approximately 40 % faster and needs less than half as much area compared to previous approaches to binary SD addition. It has a lower latency than even the fastest carry-based two’s complement addition for input widths as low as 24 bits, allowing it to be used as a replacement in many practical, latency-critical hardware designs.

51

Suggest Documents