We express our gratitude to Hans Peter Graf for his ... 2] E. Sackinger, B.E. Boser, J. Bromley, Y. LeCun ... 11] Patrice Y. Simard and Hans Peter Graf \Back-.
Hardware Implementation of the Backpropagation without Multiplication Jocelyn Cloutier
Patrice Y. Simard
Departement d'Informatique et de Recherche Operationnelle Universite de Montreal C.P. 6128, Succ. Centre-Ville Montreal, H3C 3J7, Canada
4G-326 AT&T Bell Laboratories 101 Crawfords Corner Road Holdmdel, NJ 07733, USA
Abstract
ably more resolution. Requirements ranging anywhere from 8 bits to more than 16 bits were reported to be necessary to make learning converge reliably [1, 3, 4]. But there is no general theory, how much resolution is enough, and it depends on several factors, such as the size and architecture of the network as well as on the type of problem to be solved. Several researchers have tried to train networks where the weights are limited to powers of two [5, 6, 7]. In this way all the multiplications can be reduced to shift operations, an operation that can be implemented with much less hardware than a multiplication. But restricting the weight values severely impacts the performance of a network, and it is tricky to make the learning procedure converge. In fact, some researchers keep weights with a full resolution o-line and update these weights in the backward pass, while the weights with reduced resolution are used in the forward pass [7]. Similar tricks are usually used when networks implemented in analog hardware are trained. Weights with a high resolution are stored in an external, digital memory while the analog network with its limited resolution is used in the forward pass. If a high resolution copy is not stored, the weight update process needs to be modi ed. This is typically done by using a stochastic update technique, such as weight dithering [8], or weight perturbation [9]. We present here an algorithm that instead of reducing the resolution of the weights, reduces the resolution of all the other values, namely those of the states, gradients and learning rates, to powers of two. This eliminates multiplications without aecting the learning capabilities of the network. Therefore we obtain the bene t of a much compacter circuit without any compromises on the learning performance. Simulations of large networks with over 100,000 weights
The back propagation algorithm has been modi ed to work without any multiplications and to tolerate computations with a low resolution, which makes it more attractive for a hardware implementation. Numbers are represented in oating-point format with 1 bit mantissa and 2 bits in the exponent for the states, and 1 bit mantissa and 4 bit exponent for the gradients, while the weights are 16 bit xed-point numbers. In this way, all the computations can be executed with shift and add operations. Large networks with over 100,000 weights were trained and demonstrated the same performance as networks computed with full precision. An estimate of a circuit implementation shows that a large network can be placed on a single chip, reaching more than 1 billion weight updates per second. A speedup is also obtained on any machine where a multiplication is slower than a shift operation.
1 Introduction One of the main problems for implementing the backpropagation algorithm in hardware is the large number of multiplications that have to be executed. Fast multipliers for operands with a high resolution require a large area. Hence the multipliers are the elements dominating the area of a circuit. Many researchers have tried to reduce the size of a circuit by limiting the resolution of the computation. Typically, this is done by simply reducing the number of bits utilized for the computation. For a forward pass a reduction to just a few, 4 to 6, bits, often degrades the performance very little, but learning requires consider-
show that this algorithm performs as well as standard backpropagation computed with 32 bit oating point precision. Some of these results have been published in [11]. In this paper a new hardware architecture is proposed which implements eciently the algorithm.
2 THE ALGORITHM The forward propagation for each unit i, is given by the equation: X xj = fj ( wjixi ) (1) i
where f is the unit function, wji is the weight from unit i to unit j, and xi is the activation of unit i. The backpropagation algorithm is robust with regard to the unit function as long as the function is nonlinear, monotonically increasing, and a derivative exists (the most commonly used function is depicted in Figure 1, left. A saturated ramp function (see Figure 1, middle), for instance, performs as well as the dierentiable sigmoid. The binary threshold function, however, is too much of a simpli cation and results in poor performance. The choice of our function is dictated by the fact that we would like to have only powers of two for the unit values. This function is depicted in Figure 1, right. It gives performances comparable to the sigmoid or the saturated ramp. Its values can be represented by a 1 bit mantissa (the sign) with a 2 or 3 bit exponent (negative powers of two). The derivative of this function is a sum of Dirac delta functions, but we take instead the derivative of a piecewise linear ramp function (see Figure 1). One could actually consider this a low pass ltered version of the real derivative. After the gradients of all the units have been computed using the equation, X (2) gi = fi0 (sumi ) wjigj j
we will discretize the values to be a power of two (with sign). This introduces noise into the gradient and its eect on the learning has to be considered carefully. This problem will be discussed in section 4. The backpropagation algorithm can now be implemented with additions (or subtractions) and shifting only. The weight update is given by the equation: wji = ;gj xi (3) Since both gj and xi are powers of two, the weight update also reduces to additions and shifts.
3 Results A large structured network with ve layers and overall more than 100,000 weights was used to test this algorithm. The application analyzed is recognizing handwritten character images. A database of 800 digits was used for training and 2000 handwritten digits were used for testing. A description of this network can be found in [10]. Figure 2 shows the learning curves on the test set for various unit functions and discretization processes. First, it should be noted that the results given by the sigmoid function and the saturated ramp with full precision on unit values, gradients, and weights are very similar. This is actually a well known behavior. The surprising result comes from the fact that reducing the precision of the unit values and the gradients to a 1 bit mantissa does not reduce the classi cation accuracy and does not even slow down the learning process. During these tests the learning process was interrupted at various stages to check that both the unit values (including the input layer, but excluding the output layer) and the gradients (all gradients) were restricted to powers of two. It was further con rmed that only 2 bits were sucient for the exponent of the unit values (from 20 to 2;3) and 4 bits were sucient for the exponent of the gradients (from 20 to 2;15). To test whether there was any asymptotic limit on performance, we ran a long term experiment (several days) with our largest network (17,000 free parameters) for handwritten character recognition. The training set (60,000 patterns) was made out 30,000 patterns of the original NIST training set (easy) and 30,000 patterns out of the original NIST testing set (hard). Using the most basic backpropagation algorithm (with a guessed constant learning rate) we got the training raw error rate down to under 1% in 20 epochs which is comparable to our standard learning time. Performance on the test set was not as good with the discrete network (it took twice as long to reach equal performance with the discrete network). This was attributed to the unnecessary discretization of the output units. Since the output units are not multiplier by anything, there is no need to use a discrete activation function. These results show that gradients and unit activations can be discretized to powers of two with negligible loss in performance and convergence speed! The next section will present theoretical explanations for why this is at all possible and why it is generally the case.
Sigmoid
Piecewise linear
Function
Power of two
Function
1.5
Function
1.5
1
1.5
1
0.5
1
0.5
0
0.5
0
−0.5
0 −0.5
−0.5
−1
−1
−1
−1.5 −2
−1.5 −2
−1.5 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−1
−0.5
0
0.5
1
1.5
2
Function derivative
Function derivative 1.5
0
0
−1
−0.5
0
0.5
1
1.5
2
−1.5 −2
0.5
1
1.5
2
1.5
2
0 −0.5 −1
−1
−1.5
0
0.5
−0.5
−1
−0.5
1
0.5
−0.5
−1
1.5
1
0.5
−1.5
Function derivative (approximation)
1.5
1
−1.5 −2
−1.5
−1.5
−1
−0.5
0
0.5
1
1.5
2
−1.5 −2
−1.5
−1
−0.5
0
0.5
1
Figure 1: Left: sigmoid function with its derivative. Middle: piecewise linear function with its derivative. Right: Saturated power of two function with a power of two approximation of its derivative (identical to the piecewise linear derivative).
4 Discussion Discretizing the gradient is potentially very dangerous. Convergence may no longer be guaranteed, learning may become prohibitively slow, and nal performance after learning may be be too poor to be interesting. We will now explain why these problems do not arise for our choice of discretization. Let gi (p) be the error gradient at weight i and pattern p. Let i and i be the mean and standard deviation of gi(p) over the set of patterns. The mean i is what is driving the weights to their nal values, the standard deviation i represents the amplitudes of the variations of the gradients from pattern to pattern. In batch learning, only i is used for the weight update, while in stochastic gradient descent, each gi (p) is used for the weight update. If the learning rate is small enough the eects of the noise (measured by i) of the stochastic variable i (p) are negligible, but the frequent weight updates in stochastic gradient descent result in important speedups. To explain why the discretization of the gradient to a power of two has negligible eect on the performance, consider that in stochastic gradient descent, the noise on the gradient is already so large that it is minimally aected by a rounding (of the gradient) to the nearest power of two. Indeed asymptotically, the gradient average (i ) tends to be negligible compared
to its standard deviation (i), meaning that from pattern to pattern the gradient can undergo sign reversals. Rounding to the nearest power of two in comparison is a change of at most 33%, but never a change in sign. This additional noise can therefore be compensated by a slight decrease in the learning rate which will hardly aect the learning process. The histogram of gi (p) after learning in the last experiment described in the result section, is shown in Figure 3 (over the training set of 60,000 patterns). It is easy to see in the gure that i is small with respect to i (in this experiment i was one to two orders of magnitude smaller than i depending on the layer). We can also see that rounding each gradient to the nearest power of two will not aect signi cantly the variance i and therefore the learning rate will not need to be decreased to achieve the same performance. We will now try to evaluate the rounding to the nearest power of two eect more quantitatively. The standard deviation of the gradient for any weight can be written as X X X i2 = N1 (gi(p);i )2 = N1 gi(p)2 ;2 N1 gi(p)2 p p p (4) This approximation is very good asymptotically (after a few epochs of learning). For instance if ji j < i=10, the above formula gives the standard deviation to 1%. Rounding the gradient gi to the nearest power of
Training error
100
Testing error
100
sigmoid piecewise lin power of 2
90 80
sigmoid piecewise lin power of 2
90 80
70
70
60
60
50
50
40
40
30
30
20
20
10
10 0
0 0
2
4
6
8
10
12
14
16
18
20
22
24
age (in 1000)
0
2
4
6
8
10
12
14
16
18
20
22
24
age (in 1000)
Figure 2: Training and testing error during learning. The lled squares (resp. empty squared) represent the points obtained with the vanilla backpropagation and a sigmoid function (resp. piecewise linear function) used as an activation function. The circles represent the same experiment done with a power of two function used as the activation function, and with all unit gradients discretized to the nearest power of two. two (while keeping the sign) can be viewed as the eect of a multiplicative noise ni in the equation gi0 = 2k = gi(1 + ni )
for some k
(5)
where gi 0 is the nearest power of two from gi. It can be easily veri ed that this implies that ni ranges from ;1=3 and 1=3. From now on, we will view ni as a random variable which models as noise the eect of discretization. To simplify the computation we will assume that ni has uniform distribution. The eect of this assumption is depicted in gure 3, where the bottom histogram has been assumed constant between any two powers of two. To evaluate the eect of the noise ni in a multilayer network, let nli be the multiplicative noise introduced at layer l (l = 1 for the output, and l = L for the rst layer above the input) for weight i. Let's further assume that there is only one unit per layer (a simpli ed diagram of the network architecture is shown on gure 4. This is the worst case analysis. If there are several units per layer, the gradients will be summed to units in a lower layer. The gradients within a layer are correlated from unit to unit (they all originate from the same desired values), but the noise introduced by the discretization can only be less correlated, not more. The summation of the gradient in a lower layer can therefore only decrease the eect of the discretization. The worst case analysis is therefore when there is only one unit per layer as de-
picted in gure 4, right. We will further assume that the noise introduced by the discretization in one layer is independent from the noise introduced in the next layer. This is not really true but it greatly simpli es the derivation. Let 0i and i0 be the mean and standard deviation of gi(p)0 . Since nli has a zero mean, 0i = i and 0i is negligible with respect to gi (p). In the worst case, when the gradient has to be backpropagated all the way to the input, the standard deviation is: X 3L Z 1=3 Z 1=3 0 2 = N1 ::: L p 2 | ;1=3 {z ;1=3} L !2 Y Y 0 gi(p) (1 ; nli ) ; dnli l l ! Y 3 Z 1=3 X (1 + nli )2 dnli ; 2 = N1 gi (p)2 2 ; 1 = 3 p l 1 L i2 1 + 27 (6) As learning progresses, the minimumaverage distance of each weight to the weight corresponding to a local minimum becomes proportional to the variance of the noise on that weight, divided by the learning rate. Therefore, asymptotically (which is where most of the time is spent), for a given convergence speed,
histogram 2000
K1 =
Σ W1i
K2 =
Σ W2i
1800
W11
1600
W12 W13
W14
i
1400 1200 1000 800 600 400 200 0 −0.004
−0.002
0
0.002
0.004
W21
W22 W23
W24
i
histogram 2000 1800 1600 1400 1200 1000 800 600 400 200 0 −0.004
−0.002
0
0.002
0.004
Figure 3: Top: histogram of the gradients of one output unit after more than 20 epochs of learning over a training set of 60,000 patterns. Bottom: same histogram assuming that the distribution is constant between powers of two. the learning rate should be inversely proportional to the variance of the noise in the gradient. This means that to compensate the eect of the discretization, the learning rate should be divided by !L r i0 = 1 + 1 1:02L (7) 27 i
Even for a 10 layer network this value is only 1.2,
(i0 is 20 % larger than i ). The assumption that
the noise is independent from layer to layer tends to slightly underestimate this number while the assumption that the noise from unit to unit in the same layer is completely correlated tends to overestimate it. All things considered, we do not expect that the learning rate should be decreased by more than 10 to 20% for any practical application. In all our simulations it was actually left unchanged! Two useful tricks help improving further the performance. First it should be noticed that the last layer activations do not need to be discretized to the nearest powers of two. As a matter of fact, there is no need to even have an activation function for the output units since the target activations can be chosen accordingly. Such an observation is extremely important when decisions are made from the values of the output units, as it is the case when computing con dence and re-
Best case: Noise is uncorrelated and all weights are equal
Worse case: Noise is correlated or the weights are unequal
Figure 4: Simpli ed network architectures for noise eect analysis. jection rates. The second observation is that the unit gradients should be discretized after the corresponding weights have been updated. Indeed, the algorithm still remains multiplyless, but one level of corruption of the gradients is avoided (L should be replaced by L ; 1 in equation 7).
5 Hardware Implementation We present in this section a description of a multiprocessor for the parallel execution of the backpropagation algorithm without multiplication. We have selected the 1D systolic architecture with a shared bus for its scalability and its exibility. An example of this architecture with 4 processing elements (PEs) is showed in gure 5.
5.1 Architecture Multiprocessors implementing arti cial neural networks are usually composed of many parallel processing elements (PEs) that have similar functionality. Their cost and performance depends mainly on the complexity of their interconnection network for communicating between PEs. Two important criteria for selecting an architecture is its scalability and its exibility, in relation to its performance. The scalability represents the cost of adding more PEs to one chip or
PE(0)
PE(1)
PE(2)
PE(3)
Ring connections Width= State+ Weight Resolution
the cost of implementing a multichip board. The scalability of an architecture that have more communication channel is usually lower. The exibility of an architecture may be measured by its performance when simulating dierent neural network topologies. Following those two criteria, we have selected the 1D systolic ring architecture with a shared bus. Obviously, it has the same basic properties as the shared bus architecture but more connections are needed. Each PE is not only connected to the shared bus but also to two other PEs in order to create a ring ( gure 5 shows an architecture with 4 PEs). Such architecture is able to simulate at maximal speed neural networks with fully connected, 1D-convolution and 1D subsampling topologies. This is not the case for the simple shared bus architecture. We believe that the two later types of topologies are as important as the fully connected one, and that architectures that cannot simulate them eciently will not be able to perform well in many applications (such as character recognition [2]). In this paper, we do not discuss those later topology and refer the interested reader to [12, 13].
Global weigth memory
Width= Weight resolution
Usually, the most costly part of the hardware implementation of a PE is the multiplier. The presented algorithm is well suited for hardware integration since it permits to replace the multiplier by a shifter which is faster and smaller. Let have n as the number of bits of the result of a multiplication or shift operation. The multiplier has a delay in O(n) time and area in O(n2) while the shifter has a delay in O(log n) time and area
Figure 5: 1D systolic ring with shared bus
Global Bus
in O(n log n)). Also, the algorithm uses power of two for representing the state, gradient and learning rate values. This permit to reduce the amount of memory necessary to store those values and also, which is more important, to implement smaller and faster saturated unit function and its derivative. The backpropagation algorithm without multiplication is not tailored to a particular architecture and many dierent ones could bene t from its implementation. We present in the rest of this section the hardware implementation of a PE based on a systolic 1D-ring architecture with a shared bus.
5.2 Implemented Equations
i
(10)
(9)
For the hardware implementation, we consider the following resolution and format of neural network variables. The weights and undiscretized gradients are 16 bits xed-point numbers with 12 bit fractional value. The states and a discretized gradients are respectively 3 and 5 bit oating-point values with 1 bit mantissa (m), and 2 and 4 bit exponents (e). The corresponding real numbers are ;1m 2;e for states and ;1m 23;e for discretized gradients. The learning rate is a positive oating-point number with 4 bit exponent (e) with value 2;e. We consider that each neuron has the same unit and derivative functions as those presented in section 2. The unit function for neuron i is noted qf (sumi ) and its derivative is h(xi ) = f 0 (f ;1 (xi )) which approximates f 0 (sumi ) without having to store the value of sumi but only xi which is needed anyway to update the weights. Finally, we de ne function ql (dj ) which transform an undiscretized gradient into its discretized version. Those functions will be presented in details in section 5.4. Let de ne the shift-operation (a b) where a is a xed-point number with same resolution as weights and b = (m; e) is a oating-point number. We have (a b) = ;1m a 2;e when b is a state value and (a b) = ;1m a 23;e when b is a gradient value. We may now give the implemented neural network equations. For hidden unit j we have: X (wji xi )) (8) xj = q f (
i
X (wji xi )
i
X dj = ;gj = h(xj )( wij ql (di )) and for output unit j: xj =
PARALLEL FOR each PE s FOR each layer p (from next-to-input to output) := bias FOR each connection j += (R w ) IF (j is not the last unit connection) THEN Send/Receive R (R := R( +1)mod ) Store R in state memory IF (p is an hidden layer) THEN R := q ( ) ELSE (sum ) is kept in s
s;p
s
s
s
j;s;p
s
s
s
f
N
s
j;s;p
l
s
s
s
j
s
s
s
s
s
PARALLEL FOR each PE s /** Gradient Evaluation **/ D := T ;1 ; FOR each layer p (from output to next-to-input) := 0 FOR each connection j IF (j is not the last connection) THEN += (w q (D )) Send/Receive ( := ( ;1)mod ) ELSE
:= + (w q (D )) D0 := ( h(R )) /** Weight Update **/ bias += ( D ) Restore R from state memory FOR each connection j w += ( D ( R )) Send/Receive R (R := R( ;1)mod ) D := D0
s
s
Figure 6: Forward pass algorithm for unshared weights
s
s
s
s
j;s;p
l
N
s
s
s;p
s
s
dj = ;gj = Tj;1 ; xj (11) where Tj;1 = f ;1 (Tj ) with Tj the target value of output unit j. The weight update function is: wji = ;gj xi = (dj ( xi))
(12)
As proposed in section 4, the undiscretized gradient dj is used during the weight update. Note that the multiplication in ( xi ) may be implemented by a 4 bit addition of the two exponents while keeping the mantissa of the activation value xi since is always positive.
5.3 Implemented Algorithms We give in this section the algorithms of the forward ( gure 6) and backward ( gure 7) passes. We make the simplifying assumption that every layer are fully connected and that each layer has exactly N units which is the number of PEs. We also consider that weights are local and unshared. The implemented algorithms are more general but we do not have enough space in this paper to present them. We refer the interested reader to [14] for more details. Depending on the number of connections to a unit, a 1D-convolution or a 1D-subsampling may be performed without any loss in performance.
5.4 Datapath The implemented datapath is presented in gure 8. We will give implementation details about functions qf (); ql () and h(), the shifter and adder/subtracter. Function h() is the simpler of them. According to the derivative of the unit function presented in gure 1
j;s;p
s
s
s
s
s
s
N
s
Figure 7: Backward pass algorithm for unshared weights (right), function h() is de ned by: = 1 if ;1 < xi < 1 h(xi ) = 000b 011b = 2;3 otherwise Function qf () is implemented according to the saturated power of two unit function showed in gure 1 (right). 8 100b = ;1 if sum ; 3 i > 22 > 3 ; 1 > 101b = ;2 if ; 22 < sumi ; 233 > > 110b = ;2;2 if ; 33 < sumi ; 34 > > < 111b = ;2;3 if ; 234 < sumi 0 2 qf (sumi ) = > 011b = 2;3 if 0 > 3 ; 2 010b = 2 if 24 < sumi 233 > > ; 1 > =2 if 233 < sumi 232 > : 001b 000b = 1 if sumi > 232 Before giving implementation details about function ql () we will describe the shifter and the adder/subtracter. From the backpropagation algorithms given in the previous section, the adder/subtracter must be able to perform operations (a+b) and (a;b) where a and b are its left and right input. Operation (a+b) is used in the forward and backward pass of the backpropagation algorithm, while operation (a ; b) is only used when computing the gradient of an output unit (i.e. Ts;1 ; s). The operation (b ; a) may also be performed by the adder/subtracter in order to simplify the shifter hardware. The shifter
Address
Global weight memory
R/W Local weight memory
T
Address R/W
BigBus 16 bits
SmallBus 3 bits R
h
η
Shifter 7 bits
Shared bus
State memory
a
b Add/Sub
Ql
D
Qf
Ω
D’ ∑
Figure 8: Schematic diagram of the datapath does not make the complement of the 16 bit xedpoint value when the mantissa of the oating-point value is negative, instead operation (b ; a) is computed by the adder/subtracter. Depending on the processed intruction, the operands of the shifter come from the bigbus, the smallbus, register D, function ql (D), and constant as showed in gure 8. Its outputs are a 16 bit xedpoint number and 1 control bit selecting the operation performed by the adder/subtracter ((a+b) or (b ; a)). The adder/subtracter receive data inputs a and b and 2 control signals selecting which operation to perform among ((a ; b), (a + b), or (b ; a)). Its output is a 16 bits value. When a positive (negative) over ow occurs, it returns the maximal (minimal) value 7FFF(hexa) (8000(hexa)). The function ql (d) generates a gradient value with a format that can be used directly by the shifter. Instead of generating only one exponent value, it has two outputs l (2 bits) and r (4 bits) representing respectively a positive exponent (l) and the absolute value of the negative exponent (r). To be consistent with equations presented in section 5.2, the exponent e of
discretized gradients should be l if r = 0 e = 33 ; + r if l = 0 The maximal value of l and r are 3 and 12 respectively (recall that a 16 bit xed-point number has 12 bit fractional value). The r and l coding is not optimal since at least one of the two values is always equal to 0. However, it simpli es the interface to the shifter that can use directly those two numbers to perform a left- or right-shift. The function ql (d) is de ned by: 8 m = 1b; l = 00b; r =j round(log(jdj)) j > > > if d < 0 and round(log(jdj)) < 0 > > m = 1b; l = round(log(jdj)); r = 0000b > > if d < 0 and round(log(jdj)) 0 > < m = 0b; ql (d) = > if d =l 0= 00b; r = 1111b > > m = 0b; l = 00b; r =j round(log(d)) j > > if d > 0 and round(log(d)) < 0 > > m = l = round(log(jdj)); r = 0000b > : if 0b; d > 0 and round(log(jdj)) 0 where m is a 1 bit mantissa. When d = 0 the function returns r = 15 for which the shifter returns an output of 0.
The operation T ;1 ; is performed by passing the value of T ;1 through the shifter without modifying its value. The biass value is stored in the weight memory.
5.5 Controlpath The local controlpath is showed in gure 9. The controlpath is responsible for the sequencing of the datapath (not shown here) and for loading the parameters of each PE. Those parameters are the number of connections per layer, the number of layers, the target value and initial value of the weights. The number of connections to a unit of one layer is stored in the C-memory and the address of the rst weight of a unit for one layer is stored in the First Weight to-memory. This permits to simulate general neural network topology by allowing dierent weight addresses for dierent PEs. The -memory store the learning rate of one layer (which may be different from layer to layer). This is important for units that have shared weights since their learning rate is usually dierent from those of other units. Register p stores the current layer number and is used to address the dierent memories. It is also used for loading the neural network parameters. η Mem
η
BigBus 0
p
1
0
C Mem
j
1
1+/11-
First_Weight_To memory
0
A
1
1+/1-
Figure 9: Schematic diagram of the control path
cells (two levels of metal). The processor was successfully simulated and is under fabrication, thus no results are available about the actual processor speed. However, it was well evaluated by the available simulation and timing tools. Each PE has a weight RAM of 64 bytes (32 weights) and a state RAM of 16 bytes (32 states). Its datapath and controlpath are composed of 648 and 237 gates each. This is not much if we compare to a PE using a 16 bit xed-point multiplier which have about 900 more gates. The core area (without I/O pads) of the 4 PEs with its memory and the main control is 12.7 mm2 . Since the main dierence between our implementation and those of others is the use of a shifter instead of a multiplier, we have compared the speed of those two. A 16 bit multiplier is about 8 times slower than our shifter (41 ns versus 6 ns). Let recall that those times are for a relatively slow CMOS technology and could be decreased by a faster one, however the time ratio should remain unchanged. Conservative estimates indicate that the processor will run at a speed of 33 MHz with one connection evaluation per cycle. Three cycles are used for a connection update (one for the gradient evaluation and two for the weight update). The top speed of the processor should then be 33N MCPS (Million Connection evaluations Per Second) and 11N MCUPS (Million Connection Updates Per Second) where N is the number of PEs. We estimate that by using a 0:8 CMOS technology, with a better RAM module generator, we should be able to have 64 PEs and its memory (256 weights per PE) on one chip. This is enough memory to provide for 4 fully connected layers of 64x64 connections. The speed of such chip should be greater than ours and we estimate that it could have a frequency of at least 66 MHz. Such chip would have a peak performance of 4224 MCPS and 1408 MCUPS. Since, the architecture is scalable, it is possible to connect many chips to form a larger ring and the peak performance increases linearly with the number of chips. However, the actual use of such hardware simulator may only achieve a fraction of the peak performance. This is due to the rigidity of hardware design and to the 1D architecture which does not scale well to 2D problems such as image recognition [2].
5.6 Implementation Results
6 Conclusion
A 4 PEs processor was designed using the Cadence CAD tool environment with CMOS 1:5 standard
In this paper we have described a new algorithm which implements backpropagation without using any
multiplication. Surprisingly, removing all the multiplication operations from the original process can be done at almost negligible cost in performance and convergence time. A new hardware architecture has been proposed which takes full advantage of the algorithm. The main considerations in our design were exibility and scalability. The implemented controlpath is relatively simple since our initial goal was to demonstrate the advantage of the hardware implementation of the backpropagation without multiplication. However, the datapath is general and may be used, for example, in a 2D systolic architecture which is well tailored for image recognition applications. We are also considering the development of a neural network hardware compiler. The compiler would optimize the generated circuits depending on the size and topology of the neural network (1D or 2D convolution, 1D or 2D subsampling, fully connected layers). The basic building block of the generated circuit is the datapath presented here which permits fast simulation while using small area.
Acknowledgements We express our gratitude to Hans Peter Graf for his valuable contributions and support. This work was partially funded by NSERC grant OGPIN-007.
References [1] S. Sakaue and T. Kohda and H. Yamamoto and S. Maruno and Y. Shimeki, \Reduction of Required Precision Bits for Back-Propagation Applied to Pattern Recognition", IEEE Trans. Neural Networks, Vol. 4, pp. 270-275, 1993 [2] E. Sackinger, B.E. Boser, J. Bromley, Y. LeCun and L.D. Jackel \Application of the ANNA Neural Network Chip to High-Speed Character Recognition", IEEE Trans. Neural Networks, Vol. 3, No. 3, pp. 498-505, 1992 [3] K. Asanovic and N. Morgan and J. Wawrzynek, \Using Simulations of Reduced Precision Arithmetic to Design a Neuro-Microprocessor", J. VLSI Signal Processing, Vol. 6, pp. 33-44, 1993 [4] L.M. Reyneri and E. Filippi, \An analysis on the Performance of Silicon Implementations of Backpropagation Algorithms for Arti cial Neural Networks", IEEE Trans. Computers, Vol. 40, pp. 1380-1389, 1991
[5] H.K. Kwan and C.Z. Tang, \Multipyerless Multilayer Feedforward Neural Network Design Suitable for Continuous Input-Output Mapping", Electronic Letters, Vol. 29, pp. 1259-1260, 1993 [6] B.A. White and M.I. Elmasry, \The DigiNeocognitron: A Digital Neocognitron Neural Network Model for VLSI", IEEE Trans. Neural Networks, Vol. 3, pp. 73-85, 1992 [7] M. Marchesi and G. Orlando and F. Piazza and A. Uncini, \Fast Neural Networks without Multipliers", IEEE Trans. Neural Networks, Vol. 4, pp. 53-62, 1993 [8] J.M. Vincent and D.J. Myers, \Weight dithering and Wordlength Selection for Digital Backpropagation Networks", BT Technology J., Vol. 10, pp. 124-133, 1992 [9] M.A. Jabri and B. Flower, \Weight Perturbation : An optimal architecture and learning technique for analog VLSI feedforward and recurrent multilayer networks", IEEE Trans. Neural Networks, Vol. 3, pp. 154-157, 1992 [10] Le Cun, Y. and Boser, B. and Denker, J. S. and Henderson, D. and Howard, R. E. and Hubbard, W. and Jackel, L. D., \Handwritten Digit Recognition with a Back-Propagation Network", Neural Information Processing Systems, Morgan Kaufmann, Vol. 2, pp. 396-404, 1990 [11] Patrice Y. Simard and Hans Peter Graf \Backpropagation without Multiplication", Neural Information Processing Systems, Morgan Kaufmann, Vol. 6, pp. 232-239, 1994 [12] Paolo Ienne \Architecture for Neuro-Computers: Review and Performance Evaluation", Technical Report 93/21, Swiss Federal Institute of Technology, January 1993 [13] J. Cloutier \Architecture for Neuro-Computers: Performance Evaluation for Dierent Neural Network Topologies", Technical Report #1012, Universite de Montreal, Departement d'informatique, June 1994 [14] J. Cloutier \Architecture and Algorithms for a Neural Network Chip with Learning Capability", Technical Report #1013, Universite de Montreal, Departement d'informatique, June 1994