On-Chip Backpropagation Training Using Parallel ... - RobotLab

3 downloads 0 Views 747KB Size Report
5Ab0 adaptive digital element = ADDIE [7] .... art; Parker, Michael B.; Chu, Robert; VLSI-Escient ... [12] Massen, Robert; Zur Problematik des Puidischen Rau-.
On-Chip Backpropagation Training Using Parallel Stochastic Bit Streams Kuno Kollmann, Karl-Ragmar Riemschneider, Hans Christoph Zeidler Universitat der Bundeswehr Hamburg Holstenhofweg 85, D-22043 Hamburg, Germany Abstract

ted in (full) parallel hardware. In this case typically serial fixed-point arithmetic units of low accuracy or analogue arithmetic units (e.g. Intel 80170NX [3]) are used. Although the computation to modify the weights (learning) means the most time consuming procedure, in general it is done in a serial way using conventional processors. The chip-in-the-loop training strategy can be used to compensate the accuracy properties of the chip performing recall processing. Murray, Tar* senko et. al [15, 41 show aspects of the precision issues for training multilayer perceptrons and the influence of stochastic variations and noise. One of the main problems preventing to map the complete backpropagation algorithm one by one into conventional digital hardware is the large amount of multiplications that have to be executed. The reason is that fast multipliers for operands with a sufficient high resolution require a lot of chip area. So far there are only a few approaches implementing pulse/bit stream computation in hardware. Some proposals use analogue pulses resp. analogue integrated circuits [2, 14, 15, 241, others which are closer to the proposal presented here are using digital and stochastically independent bit streams [22, 20, 211. The latter normally perform a transformation into binary probabilities using unsigned coding. In this way a trivial multiplication using an AND-Gate is feasible, but two weightings (inhibitoric and excitatoric) for a relation between neurons are necessary. Consequently the error backpropagation algorithm had to be modified due to the separation of the net into an inhibitoric and an excitatoric part. Thus a training mode is demanded which usually cannot be implemented by stochastic processing elements; instead it has be run externally in software. Nevertheless, Eguchi et al. [5] have mentioned an on-chip training based on separated processing modes by Tomlinson [22].

It i s proposed t o use stochastic arithmetic computing for all arithmetic operations of training and processing backpropagation nets. I n this way it is possible to design simple processing elements which fuljil all the requirements of information processing using valzries coded as independent stochastic bit streams. Combining such processing elements silicon saving and full parallel neural networks of variable structure and capacity are available supporting the complete implementation of the error backpropagation algorithm in hardware. A sign considering method of coding is proposed which allows a homogeneous implementation of the net without separating it into an inhibitoric and an excitatoric part. Furthermore, parameterizable nonlinearities based on stochastic automata are used. Comparable t o the momentum (pulse term) and improving the training of a net there i s a sequential arrangement of adaptive and integrative elements influencing the weights and a”plemented stochastically, too. Experimental hardware implementations based on PLD’s/FPGA’s and a first silicon prototype are realized.

1: Bit Stream Solutions 1.1: State of the Art Today the error backpropagation algorithm described and improved so far allows supervised learning of multilayer and complex structured neural nets. It has been implemented mostly by programming general1 purpose computers with one or several conventional processing elements. This means a sequential arrangement of memory and arithmetic units though, in prinlciple, the algorithms allow operations in parallel. Therefore, strong efforts are being made in order to design special hardware using the inherent parallelism of neural methods. For expense reasons and because of the unavailability of appropriate simple arrangements with sufficient arithmetic accuracy mostly the processing phase of the backpropagation algorithm only is implemen1086-1947/96 $5.00 0 1996 IEEE Proceedings of MicroNeuro ’96

1.2: Improved Approach

- Signed Coding

In contrast to many stochastic training strategies (like Boltzmann machines or Hopfield nets) the goal of the bit stream approach is to realize the deterministic, operations given by the backpropagation algorithm as 149

simply as possible permitting randomness and to some extent uncertainty. The aim remains to ensure the complete net functionality of the recall mode and the training mode in a parallel implementation as homogeneous as possible. Despite the progress of VLSI technology, this is still feasible only by using decisive simpler components. One of the solutions is to use components and connections between them, with information carriers which are not digital values but probabilities in binary representation1. Such components are known as stochastic processing elements2. Performing the training of multilayer perceptrons it could be shown that the exponential dependency [6] between accuracy and time - in other applications frequently inconvenient can be kept under control. In contrast to solutions already known, the proposed arrangement is characterized by using a signed zero-symmetric coding leading to a very simple sign considering multiplication, and by automata with nonlinear transmission functions.

bili(y

0 yO P(y)>O.S

Figure 2. Averaging addition of two values using the random switch (multiplexor): example with limited observing time (left) and theoretical results with unlimited observing time (right)

2: Implementation of the Algorithm 2.1: Coding

y=(a=b)

In order to transform a deterministic variable into a stochastic bit stream it is necessary to have random sources available. Such bit streams are characterized by their statistic nature. The most essential properties are their stochastic spatial und temporal indepence and the probability of the appearence of ’1’as a carrier of information. In this way the observing time is a measure of the certainty to determine the state of such a system by the inputs and outputs of its components. Signed values are transformed into a probability using a coder. For this, first the greatest possible unsigned amount with the interval of values has to be estimated. Then the value to be coded is divided by this greatest possible amount; the result is incremented by one, and then divided by two. In this way the probability is obtained.

signed

%”coded

Figure 1. Signed multiplication of two values using the equivalence: example with limited observing time (left) and theoretical results with unlimited observing time (right)

2.2: Arithmetic Operations

As the most frequent and conventionally limiting operation during propagation and backpropagation the signed multiplication of two bit streams is very simple to be implemented by combining both streams in a digital equivalence (Fig. 1). In contrast to the multiplication it cannot be guaranteed that during an addition the interval of

lFor simplification without any restriction of the universality always the probability of a binary ”1” is considered. 2Jndependent of neural nets stochastic processing elements are known since the end of the sixties [7, 12, 131; at that time the technological situation was a strong motivation o f looking for components as simple as possible.

150

input = 0 input = 0

identical values resulting from the averaging ad the output of the neuron will be switched. If the is identical to the previous one it remains fixed (F The occurance of less than n identical and consecutive samples has no effect. inpul= 0 input = 0 input = 0

input = 1 input = 1

input

signed

coded

1

value

QU@Ut

3

-1 bility

input

Figure 3. Nonlinearity within propagation mode

Figure 4. Nonlinearity within backpropagation MO

[-1.. . +1] will not be left. Therefore the inpiit values are scaled using additional disjunctive dummy sequences in such a way that each operand is divid.ed by the number of input values to be added; instead of an addition the arithmetic mean is calculated. To meet this requirements a hardware element has been designed which produces a randomly alternating connection between the output and the inputs to be sum.marized (Fig.2). For the same reasons as above the subtraction of one or more signed coded values (formed of one value or the sum of various signed coded values) is possible by an averaging as well. After modifying the signs of the values to be subtracted by a negation an averaging addition of the values themselves must be performed. These basic operations have been already treated in the literature.

When observing input and output probabilities for a longer time a nonlinear transfer characteristic is obtalned which is similar to a sigmoid transmission function ("squashing"). Its gradient is given by the number sf n identical and consecutive values ( s w i t c h i n g - r e ~ e ~ ~ ~ ~ ~ run length). In the case of stationary input and unlimited observation time the automaton can be described like a Markov process as follows3

is0

The equation describing the nonlinearity can be evaluated as a transmission function of probabilities

X"(1- X) ((1 - X)" - 1) Y = (1 - X)" (X"- X) X"(X - 1)

+

2.3: Sigmoid-Similar Nonlinear it y

vector of automaton's status probabilities component of vectors' M ( X , n )matrix of transition probabilities f ( X ,n ) X input probability 2 signed coded input value Y output probability Y signed coded output value switching-relevant run of input bits n

3t

Si

To realize the nonlinear sigmoidal transmission at the output of a neuron the propagation flow is led through an automaton which makes use of the distribution of run lengths in the bit streams. After a fixed number o f n samples with consecutively

15 1

-

which corresponds to a function of signed coded values 2((1-2)(2

+

limited n-bit

I

,

bit stream of

+ 1))"-2"-'((1-2)"(2 + 1) + (2-1)(2 + l)n) + 1) + (1-2)(2 + l)")-((l-Z)(Z + 1))n

Y = 2"-1((1-2)"(2

-+

-

2.4: Sigmoid Derivative- S imilar Nonlinearity As part of the backpropagation flow there is also needed an automaton generating the nonlinear transmission of the averaging addition at the propagation inputs which switches a random value of identical probabilities for "0" and "1" to the output if a fixed number of n samples with consecutively identical values resulting from the averaging addition is reached (Fig. 5 ) . Otherwise there is a"1" at the output constantly. The resulting transfer characteristic is roughly similar to the first derivative of the transmission function in the propagation flow. The shape of its characteristic can be modified again via the length n of the sequence. The equation of the nonlinearity can be developed in the same way as the equation before considering the stationary case 4:

Y=

-(1

bit stream of * signed weigth

- m bits

2.6: Scalable Structure of the Net For larger silicon implementations and board solutions a scalability must be guaranteed. Often it is believed that fully connected layers underlay a 'connectionistic explosion' in a parallel implementation of processing and transfering the values. With regard to the proposed solutions the scalability proof can be reduced to the derivation that no pinning problem is foreseeable for local synaptic processing. The expense of synaptic processing grows with the square of the number of neurons per layer and has the most important impact in larger nets (e.g. shown in fig. 6). Using recursively defined synaptic matrices including the accumulation function of weighted activities and errors figure 7 demonstrates that the number of outward connection lines grows only with the square root of the number of active elements. The accumulation (averaging addition plus parallel transfer of the values) can be performed by distributed multiplexors on the self-similar stages of the recursion. That's why in this implementation powers of two as numbers of neurons per layer are preferred. If it is ensured that there is no pinning problem related to the active silicon area for the design of small synaptic matrices then also no problems arise if combining them to larger mai trices.

+ 3) - (1 - z)"+1- (z + 1)"+1 2"-1(22 + 3)

During the training of the backpropagation net in each clock cycle resp. learning step the values will be modified (increased or decreased) by very small amounts. The integrative element (Fig.5) of the stochastic arithmetic is the simplest arrangement for fulfilling the requirements [7, 12, 131. It is not necessary to calculate the modification of the weights directly by a multiplication of the error and the propagated values resp. the weight; in order to improve the backpropagation algorithm a so-called momentum (pulse term) is used which corresponds to a moving averaging of the result of the connection mentioned above. This effect is achieved by an adaptive element connected in series to the integrative element5. "Wtih

-

synapses

2.5: Integrative and Adaptive Elements

~~~

/

Figure 5. Adaptive and integrative stochastic elements used to compute and store the values in the

- X)"+1- x n + l + 2(X2 - X + 1) 2(X2 - x + 1)

2 y z 2

counter

value to bit stream coder

Using signed coded values one obtains:

Y=

limiledm-biit

2.7: Cooperating Flows During the Training Often the behaviour of the propagation (forward)

~~

flow of the net is described apart from the backpro-

an equivalent meaning of the variables

5Ab0 adaptive digital element = ADDIE [7]

pagation one. Following the approach described here

152

operations of the algorithm

implementation bit stream convenfonal

~ _ _ _ _

\

,./'

i

multiplication (floating point)

addition

-f 2

?

nonlinearity (sigmoid)

symmetric mirroring oil the stages of rekursion

automaton

-

,/"

\\

I

derivative of the nonlinearity

weight (accumulation)

momentum (accumulation of weight modifications)

memory access /register + adder (+ multiplier) memory access I register

+ adder (+ multiplier)

adaptive

Figure 6. Implementation of the operations in prnpagation and backpropagation mode (n is the number of neurons per layer)

3both flows will be active in parallel while training the weights of the net, comparable to a counter-stream process. While applying a definite training the propagation and the backpropagation flows cooperate - with respect to a simple pulse - for a definite time in such a way that both occur in opposite directions. Thereby single streams are combined with each other in the synapses, and bundles of them in the neurons. Modifying the parameters in an appropriate way a training behaviour is attained which is typical for the backpropagation algorithm and is also known from nets calculated conventionally. These parameters are the applied period and the cycle of the input and target patterns, the constants responsible for the timing behaviour of the adaptive and the integrative elements, and the run length of the nonlinear transmissions. In addition to the analogy between the procedures reported here and the conventional operations of' a deterministic method it can be advantageous for attaining global error minima during the training phase if all internal net values always are influenced stochastically, too.

Figure 7. Scalable recursively defined synaptic matrix using distributed averaging addition

3: Results 3.1: Simulations and Hardware Experiments Using PLDs/FPGAs First an object-oriented simulation system has been implemented which allowed to model various structures of nets. Due to the time consuming calculations net magnitudes of typical applications - like vision are difficult to be reached in a simulation: therefore VLSI implementations have been striven for. Two experimental hardware implementations using programmable logic arrays (AMD's MACH circuits) and field programmable gate arrays (Xilinx's Series 4) have been successfully tested. Classical neural learning problems - like XOR and m-n-m encoder - have been solved in

153

I

I

Neuron

Derivative Nonlinearity

r

SigmoidSimilar Nonlinearity

w e 8. Functions of the neuron are reduced t o nonlinearities and multiplication

--

master

code shifts

IO5...IO7 hardware clocks by applying some hundred sequences of patterns (Fig. 9). In general, the learning speed of this kind of implementation is independent of the size of the net. The expense per synapsis including the momentum term - which is essential for larger nets because it depends quadratically on the capacity of the layers - can be estimated at 85 flipflops and about 120 two-input-gates, m bit sequences

:

Integrating Large Amounts of Random Sources

- e .

total code shifts n * m bit sequences

r VLSI implementations one of the most important aspects of the investigations is to look for the generation of numerous bit streams [23]. In this context an optimal way t o integrate a large amount of random no. of hardware clocks

1M

f-3 k-2

Figure 10. Cascaded pseudorandom registers which dezentralized generate random bits by partial parities

+

512K

0

slaven

2M

1536K

or noise sources with sufficient quality and independence had to be found. Multiple independent thermal or semiconductor noise sources must be strongly amplified and give rise to problems when implemented on a digital chip. Best results of noise are obtained by using the avalanche effect of Zener diodes, but this requires a process with several differently doped layers like the expensive BiCMOS technology. In experiments with a CMOS process and external Zener diodes frequencies up to 1.5 MHz only have been reached. It is hardly feasible to integrate many amplifiers/comparators without on-chip capacities which consume a large amount of silicon. 0

64

128

no. of pattern cycles

192

256

320

304

440

512

--+

Figure 9. Training example XOR performed by the proposed hardware

154

A more promising approach is, e.g, to use the laser speckle effect and small local optical on-chip sensors [ll]. In a standard CMOS process a vertical bipolar transistor can be used as a photosensitive element to deliver distributed optical sensors on digital chips to

3.3: Silicon Prototype

satisfy the demand for random sources. Electronic problems of very low noise energy can be reduced by increasing the laser power. The problematic delay time of the CMOS-phototransistor can also be overcome by increasing the laser power or the base current. In this context, own experiments will be reported in detail [8].

A first silicon prototype has been implemented in a 1 micron CMOS technology (ES2). It contains 3150 low complex standard cells (gates, flipflops, 4-to-1 multiplexors, 4 bit registers) and realizes 16 synapses including the momentum and 4 complete neurons as well as the random sources needed. The neurons have adjustable nonlinearities and expandable inputs for both directions. Due to the learningladaptation rate and accuracy the synapses get the needed flexibility by switching the counter/coder length, and can also run in a non-learning mode. The prototype chip has optional inputs for physical noise sources t o act standalone or in combination with the cascaded pseudorandom sources. Hence, there are needed 54 additional input pins (of 119 pins in total) which enlarged the size of the die.

The silicon implementation reported uses a decentralised pseudorandom generator which is based on the principle of shifting the turn-around code by parities formed on partial stages of feedback shift registers. The original principle has been developed by Alspector [l] extended by a cascading mechanism [16]. To reduce the broadcasting problem of a central pseudorandom generator one 28 bit shift register with 52 four-stages-parities per 16 synapses and 4 neurons has been implemented. Only the first shift register (the master) has a combinatorial feedback to the first stage. All other identical generators are driven by a chain connecting an additional lower shifting parity to the first stage of the next (slave) register. This cascading chain needs one single wire only.

The core only needs a space of 3.5x2.8 mm2 silicon in 31 rows of cells (Fig. 11). Verilog simulations showed that a clock frequency at least of 16 MHz is attainable; in first tests the chips reached 25 MHz.

No frequency problems arise as pseudorandom generators can be driven by the clock of the digital circuit.

Chips can be combined to larger nets in blocks of 4 x 4 synapses and manifolds of 4 neurons per layer.

During the design finding the correct register stages for driving the combinatorial nets in order to produce the parities in a direct searching process took a lot of time. Therefore additional software with a more complex strategy was developed to look for stages with a sufficient shifting in the turn-around code of a long register. Moreover, it was possible to minimize the combinatorial nets, and to reduce the fan-out of the register flip-flops by shifting the group of stages per parity along the register.

This is also ensured for a board design if cascaded averaging addition units (treelike connected distributed external multiplexors) are used.

Figure

11. First silicon

If shrinking down the actual standard cell design to a 0.25 micron process, one chip could contain 4 K synapses and 64 neurons on 160 mm2 silicon core. Using an optimized full custom design it is estimated that 16 K synapses and 128 neurons can be placed on the same core. This can easily be managed due to the fact that the design of the adaptive and integrative elements is based on counter/coder bitslices containing about 80 % of the cells. Due to the regular structure of the design the optimization of a specifically designed type of bitslice cells brings an considerable improvement. Keeping in mind the problematic comparison of measurements in CUPS, already one of the 10 mm2/1 micron/25 MHz prototype chips delivers a theoretical performance of 400 MCUPS using 11...12 bits weight length and 14.. .15 bits momentum length. Performance measures in size and speed (CUPS) can hardly be compared because existing neural hardware differs substantially. But an engineering approach should comprehend the comparison of functional quality, speed and costs of an application. Considering the actual results a basis is given to implement very fast on-chip backpropagation for training time critical and medium sized applications.

prototype

155

Acknowledgements T h e authors would like to acknowledge t h e contributions and t h e support of H. B r a n d t , G. Feuchter, R.-R. Grigat6, H. Groninga, R. Hein, D.Nahrgang, A. Maeder7, I. Martiny8, T. Merten, M. Paskowski, R. Pleflmann, D.Reinicke, and M. Riibel.

[12]

[13]

References [l] Alspector, Joshua; Gannett, Joel W.; Haber, Stuart; Parker, Michael B.; Chu, Robert; VLSI-Escient Technique for Generating Multiple Uncorrelated Noise Sources and Its Application to Stochastic Neural Networks, IEEE Transactions on Circuits and Systems, Vol. 38, NO. 1, pp. 109-123, 1991 [2] Beerhold, J.; Jansen, M.; Eckmiller, R.; Pulse Processing Neural Net Hardware with Selectable Topology and Adaptive Weights and Delays, IEEE Intern. Joint Conf. on Neural Networks, pp. I1 569-574, San Diego

[14]

[15]

[17] Rojas, Raul.; Theorie der neuronalen Netze: Eine systematische Einfuhrung, Springer, Berlin 1993

[3] Brauch, Jeff Tam, Simon M.; Holler Mark A.; Shmu-

[5]

[6]

[7]

[8]

-

1994 [16] PleBmann, Ralfi Cascading pseudo random sources, personal coinmunications, 1994

1990

[4]

K.;,Garda P.; Devos, F.; Lalanne, P.; Richard, H.; €I-odier, J.C.; Chavel, p.; Taboury, J.; 2-D Generation of random numbers by multimode fiber speckle for silicon arrays of processing elements, Optics communications, vo1.76 no.5,6; pp.387-394, 1990 Massen, Robert; Zur Problematik des Puidischen Rauschens und dteiner Anwendung in der Stochastische Rechentechnik:,Diss., RWTH Aachen 1974 Massen, Robert; Stochastische Rechentechnik Eine Einfiihrisng in die Informationsuerarbeitung mit zufalligen Pulsfolgen, Carl Hanser, Miinchen 1977 Murray, Alitn F.; Pulse Arithmetic in VLSI Neural Networks, IEEE Micro Mag. Dec. 1989, pp. 64-74 Reprint in Sanches-Sinecio, Edgar; Lau, Clifford Artifical neural networks IEEE Press 1992 Murray, Alan F.; Tarassenko, Lionel; Analogue neural VLSI - A pulse stream approach, Chapman & Hall,

[I11 Madani,

K.-R.; Zeidler, H.Ch. Parallel BitStream Hardware Implementation of Backpropagation IEEE Symposium of Parallel and Distributed Processing, Add. 11-8, San Antonio, Texas, 1995 RCckert,Ulrich; Spaanenburg, Lambert; Anlauf, Joachim; Hardwareimplementierung Neuronaler Netze atp, vol. 35 - 7,pp. 414-420, 1993 Shawe-Taylor, John; Jeavons, Pete; van Daalen, Max; Probabilistic Bit Stream Neural Chip: Theory, Connection Science 3 (3) pp. 317-328, Abingdon, Oxfordshire 1991 Shawe-Taylor, John; Jeavons, Pete; van Daalen, Max; Probabilistic Bit Stream Neural Chip: Implementation, PIOC. VLSI for Artifical Intelligence and Neural Networks, (ed. Delgado-Frias,J.G.; Moore, W.R. 1991), Plenum Press, New York 1991 Tomlinson, Max Stanford jr.; Implementing Neural Networks, Dissertation Thesis, Univ. of California, San Diego 1988 van Daalen, Max; Jeavons, Pete; Shawe-Taylor, John; Cohen, D.; Device for Generating Binary Sequences f o r Stochacttic Computing, Electronics Letters 29 (l), pp.80-81, Jan. 1993 Zaghloul, Mona E.; Meador, Jack L.; Mewcomb, Robert w.; S:ilicon implementation of pulse coded neural networks, Kluwer Academic Publishers, 1994

[18] Riemschneider,

run Arthur L.; Analog VLSI Neural Networks for Impact Signal Processing, IEEE Micro Mag., Dec. 1992 Cairns, Graham; Tarassenko, Lionel; Precision Issues for Learning with Analogue Neural VLSI Multilayer Perceptrons, IEEE Micro Mag., June 1995 Eguchi, H.; Futura, T.; Horiguchi, H.; Oteki, S.; Neural Network LSI chip with on-chip learning, IEEE Int. Joint Conf. on Neural Networks I, pp. 453-456, Singapore 1991 Eguchi, H; Stork, D.G.; Wolff, G.; Precision analysis of stochastic pulse encoding algorithms for neural networks, IEEE Int. Joint Conf. on Neural Networks I, pp. 395-400, Baltimore 1992 Gaines, B. R.; Stochastic Computing Systems, Advances in Information Systems Science (ed. Tou, Julius T,), Vol. 2, Plenum Press New York 1969 Hein, R.; Kijllmann, K.; Martiny, I.; Riemschneider, K.-R.; Zeidler, H.Ch. Backpropagation Hardware Based on Bit-Stream Coding Using Amounts of Parallel Random Sources (submitted to Neurap, Marseille)

[19]

[20]

[21]

[22]

[23]

1995 [9] Holler, Mark A.; VLSI Implementations of Learning

and Memory Systems: A Review, Proc. Conf. on Neural Information Processing Systems I11 993 - 1000, San Mateo CA, Morgan Kaufmann 1991 [lo] Holt, Jordan L.; Hwang, Jenq-Neng; Finite Precision Error Analysis of Neural Network Hardware Implementations, IEEE Trans. on Computers, Vol. 42, NO. 3, pp. 281-290, 1993

[24]

ETechnischeUniversitat Hamburg-Harburg,Technische Informatik I 7Universitat Hamburg, Fachbereich Informatik

156

Suggest Documents