Block Save Addition with Threshold Logic - CiteSeerX

2 downloads 0 Views 177KB Size Report
Abstract. In this paper we investigate small depth linear threshold element networks for multi-operand addi- tion. We consider depth-2 linear threshold element.
Block Save Addition with Threshold Logic S. Vassiliadis

S. Cotofana

J. Hoekstra



Electrical Engineering Dept. Delft University of Technology Mekelweg 4, 2628 CD Delft The Netherlands.

Abstract

In this paper we investigate small depth linear threshold element networks for multi-operand addition. We consider depth-2 linear threshold element networks and block save addition. We improve the overall cost of the block save addition, both in terms of gates and wires (actually the number of gates each multiplied by its number of inputs), with the inclusion of the telescopic sums proposed by Minnick together with a minimization technique based on gates sharing. We show that previously proposed schemes require about twice the number of linear threshold gates for common operand lengths. Furthermore, we show that the number of wires required by an implementation for previously proposed schemes is also about two times higher than the number of wires required for the scheme we describe for commonly architected operand sizes.

1 Introduction

A linear threshold gate computing a Boolean function F (X ) is de ned bythe following: (X )  0 F (X ) = sgn(F (X )) = 10 ifif F F (X ) < 0 Pn with F (X ) = i=1 !i xi + . The threshold element comprises a set of input variables, X = (x1; x2; : : :; xn?1; xn), and a set of weights

= (!1; !2; : : :; !n?1; !n). Additionally, the element includes a threshold value, , a summation device, , computing F (X ) and a threshold element, T , computing F (X ) = sgn(F (X )). The addition of multi-operand matrices is central to the implementation in hardware of higher order arithmetic operations such as multiplication [1, 2, 3, 4]. Block save addition (BSA) is a technique that has been used in conjunction with linear threshold gates for the reduction of the multi-operand addition matrices into two rows [5, 6, 7, 8]. It has been shown [5] that the block save addition can be implemented with depth-2 linear threshold element networks. Lauwereins and Bruck [6] have introduced two minimizations that substantially reduce  email:

fstamatis,sorin,[email protected]

the overall cost of the block save realization. It has been suggested in [6, 7] that the cost for a depth-2 network, in terms of total number of wires per gate, is in the order of O(N 3 log N ), the number of linear threshold gates required for the network is in the order of O(N 2 ), and the maximum fan-in is in the order of O(N log N ). In [8] we reduced the implementation costs, both in threshold gates and wires, by using a technique introduced by Minnick [9], denoted as the telescopic sum, which substantially reduces the cost of depth-2 linear threshold element networks. In this paper we propose a scheme for block save addition and multiplication which incorporate Minnick's telescopic sums and a minimization based on gate sharing. The scheme substantially reduces the overall cost for the block save addition and for the entire multi-operand addition network. In this approach we assume the same order of magnitude for the fan-in and the weight requirements as in [6, 8], i.e. maximum fan-in in the order of O(N log N ) and weight values in the order of O(N ). Our scheme provides the same asymptotic bounds for the network's cost as the scheme presented in [6, 7], but for practical cases we show that the Lauwereins/Bruck scheme, depending on the size of the matrices, requires between 2:153 and 2:729 more gates and respectively between 1:921 and 3:016 more wires. The discussion is organized as follows: First we introduce the new minimization techniques and estimate the corresponding cost of the network for block save addition and for the reduction of the entire multioperand matrix into two rows. We proceed by evaluating our new scheme via comparisons with the results reported in [6] for some common length multi-operand additions. We conclude the discussion with some remarks.

2 Block Save Addition with Minimization and Telescopic Sums

It is known, see for example [5], that the block save addition (e ectively the sum of N words of length log N ) is a generalized Boolean symmetric function1

1 A Boolean symmetric function is a Boolean function for which its output value entirely depends on sum of its input values. The exclusive-or (parity) of n variables is an exam-

of N log N inputs with at most 2log N ?1 weight size. Boolean symmetric functions can be implemented with a depth-2 linear threshold gates network using telescopic sums [9]. The telescopic sum that implements a Boolean symmetric function F (x1;    ; xn) deP ned to be 1 when ni=1 xi 2 [qj ; Qj ] for j = 1;    ; r and 0 otherwise is de ned by the following: 2

F = sgn 4

n X i=1

2

xi ? 4t0 +

r X j =1

33

tj uj 55

(1)

where:

t0 = q1 tj = qj +1 ? qj for j = 1;    ; r ? 1  n + 1 ? qr if Qr 6= n tr = 0 if Qr = n "

uj = sgn

n X i=1

xi ? (Qj + 1)

#

Using such a technique we have shown in [8] that block save addition can be implemented with sizable reductions of the network costs. In this presentation, we assume telescopic sums and a new minimization for block save addition networks. The minimization is exempli ed by the followings: Consider the BSA(8,3,6)(i.e. a block save addition with 8 rows 3 columns producing a 6 bit result.) with the partial truth table and the schematic diagram shown in Figure 1. Note that the inputs are denoted by the letters R, T, U, V, X, Y, Z and W and that the six bit output sum of the schematic is represented by S5 , S4 , S3 , S2 , S1 , S0 . Furthermore, assume that the linear threshold gates produce both the positive and the negative polarity output signals. In case this is not true than the discussion assumes the employment of inverters. This last assumption will increase the delay of the implementation by the delay of an inverter in the critical path. It can be observed in the Figure that:  Certain values determining the intervals in which the function is de ned to be equal to 1 appear multiple times (in dashed line boxes) and thus need not be implemented more than once.  Certain values determining the intervals in which the function is de ned to be equal to 1 are almost common (also in the dashed line boxes) thus they can be re-used with the possible introduction of some negligible delay. This can be achieved by observing that ( + 1)+= = ?= , where is an integer required for the computation of one of the interval where the function is de ned to be equal to 1, and += and ?= de ne respectively greater or equal to and less than or equal to . ple of a symmetric Boolean function. A generalized symmetric Boolean function is a non symmetric Boolean function that can be transformed into a symmetric Boolean function by trivial transformations.

Decimal Sum

Binary Sum s5

s4

s3

s2

s1

0

0

0

0

0

0

s0

0

1

0

0

0

0

0

1

2

0

0

0

0

1

0

3

0

0

0

0

1

1

4

0

0

0

1

0

0

5

0

0

0

1

0

1

6

0

0

0

1

1

0

R2

7

0

0

0

1

1

1

8

0

0

1

0

0

0

9

0

0

1

0

0

10

0

0

1

0

1

11

0

0

1

0

1

12

0

0

1

1

0

13

0

0

1

1

0

14

0

0

1

1

1

15

0

0

1

1

1

16

0

1

0

0

0

17

0

1

0

0

0

18

0

1

0

0

1

19

0

1

0

0

1

20

0

1

0

1

0

21

0

1

0

1

0

22

0

1

0

1

1

23

0

1

0

1

1

24

0

1

1

0

0

25

0

1

1

0

S5

S4

S3

R1 R0

T2

T1

T0

U2

U1

U0

V2

V1

V0

X2

X1

X0

Y2

Y1

Y0

Z2

Z1

Z0

W2 W1

W0

S2

S0

S1

Figure 1: Truth Table for Block Save Addition with Minimizations (partial)

 Certain values determining the intervals in which

the function is de ned to be equal to 1 while may appear to be common or almost common (in the solid line boxes) can not be excluded from the implementation. This is because they imply sharing of elements that have been created to re ect partially the bit position of interest. For example, in Figure 1, it will appear that the linear threshold element corresponding to S0 of the interval [3] computing the logical quantity 3?= could be shared for the computation of the quantity 3?= for S1 . This however is not possible because such a quantity has been involving only the bits at position 0 while the S1 requires bit positions 0 and 1. The essence of the previous discussion is that there are elements that can be shared thus need not be created more than once. It should be noted however that only the elements that have been built by considering inputs from common bit positions are legitimate for sharing. Speci cally, no elements can be shared on the least signi cant bit position from bit-position 0 to bit position log N while linear threshold elements can be shared from bit position log N to 2 log N ? 1. It is interesting to note that all the elements required for the high order bit from log N to 2 log N ? 1 need not be implemented when both minimizations are taken into account. Roughly half of the elements are shared without inversion and half require the inversion logic. The implication here is that the upper half of the sum bits requires no extra linear threshold gates for the implementation. An example design of the sum bits (that is S5 , S4 , S3 , S2 ) with the minimizations discussed previously

is found in Figure 2. The network in Figure 2 has Σ

+ 40=

+ 56= -7 16+ =

+ 24= -8

+ 8= -8

-8

-8 = 4+

-16

S2

-8 + 32= -16

+ 8=

-8 + 48=

S3

-16

-32 S5

S4 16+ =

Figure 2: Block Save Addition with Minimization and Minnick's Sum been designed using telescopic sums and one of the minimizations. It is interesting to note that because telescopic sums require one linear threshold element to be build per interval the network in the Figure consider only gates that can be shared directly with no need of two output polarity gates. In essence using telescopic sum design there is no need to build special linear threshold elements or penalize the delay with the inverter delay.

2.1 Network's Implementation Costs

In this subsection we estimate the cost of the network when introducing in addition to the telescopic sums also the minimization discussed earlier. In order to shorten the presentation we will not include in the proofs all the mathematical derivations steps. They can be found in [10].

Theorem 1 For the BSA receiving N log N bits and producing an 2 log N bits result, denoted S ; S ;    ; S N ? all the intervals, but at most one interval, in which the BSA bits from log N to 2 log N ? 1 are de ned to be 1 are \covered" by some intervals in which the sum bit log N ? 1 is de ned to 0

1

2 log

1

be 1. Where \covering" relates to an interval determined by two integers called limits and it means that the upper limit of the interval is exactly given by the upper limit of another interval and the lower limit of the interval is given by the upper limit of another interval (not necessary the interval that gives the upper limit) plus one.

Proof: Because the resulting sum can Nnot be larger than a certain value Smax = N (2 ? 1) it is represented by dlog Smax e bits. Each bit Si , i = 0; 1;    ; dlog Smax e has its value given by a symmetrici function that is de ned to be 1 inside the intervals P [2 + k2i ; ij 2j + k2i ], for k = 0; 1;    ; d Smax i+1 e? log

+1

=0

1 and 0 otherwise.

+1

2

For a bit i, i = 0P ; 1;    ; dlog Smax e the upper limits are given by ij =0 2j + k2i+1, for k = Pi j i+1 ? 1 the 0; 1;    ; d S2max i+1 e ? 1. Because j =0 2 = 2 position of the upper limits can be expressed also as 2i+1 ? 1 + k2i+1 . Fori+1 the next sum bit i + 1 the lower limits are given by 2 + k2i+2 and the upper limits by 2i+2 ? 1 + k2i+2 , for k = 0; 1;    ; d S2max i+2 e ? 1. Let suppose that k is even, i.e. k = 2l; l = 0; 1;    ; d S2max i+2 e ? 1. The upper limits for the bit i are given in this case by 2i+1 ? 1+2l2i+1 = 2i+1 ? 1+ l2i+2. These are equal with the lower limits for the bit i + 1, e.g. 2i+1 + k2i+2, minus one and the range of l is exactly the range of k in the case of the bit i + 1, i.e. 0; 1;    ; d S2max i+2 e ? 1. If k is odd, i.e. k = 2l + 1; l = 0; 1;    ; d S2max i+2 e ? 1, the upper limits for the bit i are given by 2i+1 ? 1 + (2l + 1)2i+1 = 2i+1 ? 1 + 2l2i+1 + 2i+1 = 2i+2 ? 1 + l2i+2. These are equal with the upper limits for the bit i + 1, e.g. 2i+2 + k2i+2 and the range of l is exactly the range of k in the case of the bit i + 1, i.e. 0; 1;    ; d S2max i+2 e ? 1. In the context of BSA the rst sum bit that is de ned on the entire domain [0; Smax] is the bit log N ?1. The bits between 0 and log N ?2 are de ned on restrictions of the domain [0; Smax ] to [0; N (2i+1 ?1)] and because of this they can not be considered in the covering process. As a consequence of what was proofed in the previous paragraph the intervals of the bit log N ? 1 are covering all the intervals of the bits from log N to 2 log N ? 1 when the value of the bit log N ? 1 is 1 for Smax and it will be at most one upper limit uncovered if the bit log N ? 1 has its value 0 for Smax . 2

Corollary 1 The total cost of the linear threshold

network performing the block save addition, designed with telescopic sums and with minimization included, in terms of linear threshold gates is at most: G=

N ?1 X

log

i=0

&

N

j' j =0 2 + 2 log N + 1 2i+1

Pi

Proof: The number of linear threshold gates required for the implementation of any sum bit i,0  i  2 log N ? 1, using telescopic sums is equal to the number of intervals in which the bit i is de ned to be equal with 1plus Pi one.j If 0  i  log N ? 1 there N2

j=0 are at most intervals for every sum Si . 2i+1 This leads to an overall costforPthe bits between 0  N ij=0 2j Plog N ?1 + 1 . For and log N ? 1 of i=0 2i+1 log N  i  2 log N ? 1, as a consequence of the Theorem 1 all the upper limits, but one in the worst case, of the intervals are covered by the intervals of the sum bit log N ? 1. In the context of the implementation with telescopic sums this means that, for the sum bits between log N and 2 log N ? 1, we have the linear threshold gates for the rst level already in

the implementation of the sum bit log N ? 1, except at most one in the case that the sum bit log N ? 1 has the value 0 for N (N ? 1). Thus for the sum bits log N to 2 log N ? 1 the implementation cost is given by the linear threshold gates in the second level, i.e. P2 log N ?1 i=log N (1), plus at most one gate. This facts lead almost straightforward to the number of gates stated in the theorem's enunciation. 2

Theorem 2 Assuming block save addition, the computation of the sum bit position i, Si , for all bit positions such that 0  i  2 log N ? 1, requires at most: N ?1 X

log

T= N ?1 X

2 log



i=0

&

((i + 1)N + 1)

N (2log N ? 1)



2i+1

i=log N

N

Pi

j =0 2 2i+1

j'

+

+ 23 N log N (log N + 1)

inputs per linear threshold gates, if designed with telescopic sums and with minimization included. Proof: all i's such that 0  i  log N ? 1, there  PFor i N

j

j=0 threshold (linear threshold) elements are 2i+1 for the rst level . Given that for the rst level of elements the number of inputs for a sum bit  iPisi equal  N j=0 2j to (i + 1)N there are in total (i + 1)N 2i+1 inputs . For the second level, to produce the sum bits, the fan-in required is given by the number of intervals plus the number of bits participating in the sum,  Pwhich imposes the total number of inputs to be N ij=0 2j + (i + 1)N . 2i+1 For all i such that log N  i  2 log N ? 1, as a consequence of Theorem 1, the rst level doesn't exist but at most one linear threshold gate with N log N inputs. For the sum bits to be produced in the second level the number of intervals contribute to the fanin plus the bits participating in the addition. This implies that thel the total m number of inputs for the N (N ?1) second level is 2i+1 + N log N . Using this facts the number of inputs stated in the theorem's enunciation can be easily derived. 2 We conclude this section by providing an estimation of the overall cost for the multi-operand addition. The following corollary estimates the cost in terms of gates and total number of inputs. 2

Corollary 2 The total cost for reduction of the entire

matrix to two rows using block save addition realized with telescopic sums and with minimization included, in terms of linear threshold gates (G) and total number of inputs (I) is at most:

G =



M log N



N ?1 X

log

i=0

& Pi

j =0 N 2 2i+1

j'

!

+ 2 log N + 1

& Pi '  log N ?1 X N j =0 2j M I = log N ((i + 1)N + 1) 2i+1 i=0 1  2 log N ?1  X 3 N (N ? 1) + N log N (log N + 1)A + i+1 

i=log N

2

2

Proof: Trivial from Corollary 1,mTheorem 2 and the l

division of the matrix into logMN carry save blocks. 2 From the costs of the reduction of the entire matrix into two rows given in Corollary 2 we derived (see [10] for computation details) that the number of linear threshold gates is in the order of O(N 2) and the number of inputs is in the order of O(N 3 log N ). This asymptotic bounds are equal to the asymptotic bounds corresponding to the networks proposed in [6, 11] . However, as we will show in the next section, in practice our scheme provides sizable cost improvements.

3 Comparisons

Regarding depth-2 neural networks for multioperand addition there are three recent proposals described in [5, 6, 11]. The scheme presented in [6] is an optimization (with substantial cost reductions) of the direct implementation of the scheme described in [5]. Furthermore it has been suggested in [6] that the multi-operand addition described in [11] requires of 2:5 times higher cost than the cost of the scheme presented in [6]. In evaluating the networks incorporating telescopic sums we concentrate on the scheme presented in [6] as it appear to be the most economical. In particular we compare the cost of the implementation for some common size multiplication matrices. The results of the estimation in [6] are reported in Table 1. In the Table, there are ve columns. The size Size 4x4 8x8 16x16 32x32

Wires 584 5880 62128 593892

Gates 88 310 1264 4966

Fan-in 8 24 64 160

 2 2 2 2

Table 1: Results of the Lauwereins/Bruck Block Addition column describes various matrices. It is noted that the sizes correspond to multiplication operand sizes. Consequently an NXN multiplication (i.e . the multiplication of two N -bit numbers) will result in an 2NXN multi-operand addition matrix which is reduced to two operands 2 . The estimations in [6] and in this sections assume that the matrix is rectangular. This assumption constitutes a worse case scenario for the cost for 2 The two operands are then reduced to a single operand representing the product of the multiplication via a high speed adder.

the multi-operand addition for the multiplication matrix. This is because not all of the 2N -bits of the N rows are di erent than zero. Actually for every row only N -bits are di erent than zero. Clearly, additional reductions can be achieved for both approaches. For a comparison we will assume here as in [6] that the resulting matrix is of size 2NXN . The second column, denotes the \wires" of the implementation, computed by multiplying the number of inputs for every gate. This cost more precisely associates the number of devices needed for the design and the number of wires present in the circuit. The third column represents the number of threshold elements,"gates", the fourth the maximum fan-in of the threshold gates and the fth the number of depth of the network in terms of levels of threshold elements required to build the circuit. Using the theorems presented previously we estimated the overall cost of an implementation by computing the number of wires (actually the number of inputs per gates) and the number of gates required for the linear threshold network. We assume that the design incorporates telescopic sums and we also considers the savings due to the sharing of common elements. We assume that the inputs are weighted rather than replicated. The fan-in and the delay associates for the designs are the same thus not considered. Our 

Size

44 88 1616 3232

Total

Wires 304 2784 21504 196898

Gates 40 144 464 1820

Ratio

Wires 1.921 2.112 2.889 3.016

Gates 2.200 2.153 2.724 2.729

Table 2: Estimation of Costs and Ratios with Minimizations estimations are reported in Table 2. We estimate the overall cost for the 2NXN multi-operand matrix addition. The Table has four columns. In the rst column we report the total number of \wires" required for the device. In the second the number of neural (linear threshold) gates required for the network. In the fourth and fth columns we report the ratio between the Lauwereins/Bruck scheme and the scheme discussed earlier using telescopic sums and gate sharing. Table 2 clearly suggests that the proposed minimizations coupled with the telescopic sums produce substantial reductions in both size and wires when compared to the scheme presented in [6]. In particular, it suggests that the Lauwereins/Bruck scheme, depending on the size of the matrices, requires between 2:153 and 2:729 more gates and respectively between 1:921 and 3:016 more wires.

4 Conclusions

In this paper we have investigated small depth networks for multi-operand additions. In particular we considered depth-2 neural networks and block save addition to reduce the multi-operand matrix into two

rows. We have proposed a scheme that will improve the overall cost substantially for the block save addition and for entire multi-operand addition network. We assumed the same order of magnitude for the fan-in and the weight requirements as in [6] and we proposed a new scheme based on a technique introduced by Minnick [9] ,(denoted as the telescopic sum), which substantially reduces depth-2 linear threshold element networks, and the introduction of additional minimizations. We have shown that the scheme we proposed has the same asymptotic bounds for the network's cost as the schemes in [6] but in practice it provides sizable cost improvements when compared with the scheme presented in [6], the most cost e ective scheme for depth-2 networks using block save addition, for the reduction of multi-operand matrices.

References

[1] L. Dadda. Some Schemes for Parallel Multipliers. Alta Frequenza, 34:349{356, May 1965. [2] S. Waser and M. J. Flynn. Introduction to Arithmetic for Digital Systems Designers. CBS, 1982. [3] S. Vassiliadis, E. M. Schwarz, and B. M. Sung. Hard-wired Multipliers with Encoded Partial Products. IEEE Transactions on Computers, 40(11):1181{1197, Nov. 1991. [4] S. Vassiliadis, E. M. Schwarz, and D. J. Hanrahan. A General Proof for Multi-bit Scanning Multiplications. IEEE Transactions on Computers, Vol. 33, No. 2, pp. 172-183, February 1989. [5] K. Y. Siu and J. Bruck. Neural Computation of Arithmetic Functions. Proc. IEEE, 78(10):1669{ 1675, October 1990. [6] R. Lauwereins and J. Bruck. Ecient Implementation of a Neural Multiplier. IBM technical report, RJ 8138, May 1991. [7] R. Lauwereins and J. Bruck. Ecient Implementation of a Neural Multiplier. Proc. 2nd Intern. Conf. on Microelectronics for Neural Networks, pages 217{230, October 1991. [8] S. Vassiliadis, J. Hoekstra, and S. Cotofana. Block Save Addition with Telescopic Sums. In 21st Euromicro Conference, pages 701{707, September 1995. [9] R. Minnick. Linear Input Logic. IEEE Transactions on Electronic Computers, Vol EC-10, pp. 6-16, March 1961. [10] S. Vassiliadis, S. Cotofana, and J. Hoekstra. Block Save Addition with Threshold Gates. Technical Report 1-68340-44(1995)04, TU Delft, June 1995. [11] T. Hofmeister, W. Hohberg, and S. Kohling. Some Notes on Threshold Circuits and Multiplication in Depth 4. Information Processing Letters, Vol. 39, pp. 219-225, 1991.

Suggest Documents