Harburger Schlo stra e 6-12,. D-21071 Hamburg ..... 2Technische Universit at Hamburg-Harburg, Technische In- formatik I ... Julius T.), Vol. 2, Plenum Press ...
Backpropagation Hardware Based on Bit-Stream Coding Using Amounts of Parallel Random Sources Rudiger Hein, Kuno Kollmann, Ingo Martiny , Karl-Ragmar Riemschneider, Hans Christoph Zeidler Universitat der Bundeswehr Hamburg { Technische Informatik { Holstenhofweg 85, D-22043 Hamburg
MAZ Mikroelektronik Anwendungszentrum Harburger Schlo stra e 6-12, D-21071 Hamburg
Abstract The realization of massively parallel neural hardware requires simple arithmetic units. The use of stochastic bit streams to represent numerical information leads to elements which consume only small areas when integrated on silicon. Under these aspects the optimal way to integrate a large amount of random or noise sources to generate the needed stochastically independent numerous bitstreams is discussed. Semiconductor noise (avalanche eect), laser speckle eect and pseudo random were investigated and tested experimentally using specially designed chips. As a result, a rst silicon digital prototype realizing a complete backpropagation network capable of on-chip learning is presented. Resume Pour la realisation de materiel neural, on a besoin d'elements arithmetiques simples. En utilisant des courants pulsatoires pour representer l'information numerique, on peut integrer de tels elements dans un chip qui ne prennent que peu de surface. Concernant ce sujet la meilleure methode pour integrer un grand nombre de sources aleatoires dont on a besoin pour generer des courants pulsatoires est discutee. Des recherches ont ete eectuees sur le sou e electrique (eet avalanche), les speckles laser et le pseudo alea et on a fait des experiments en utilisant des chips particuli eres. Comme resultat, nous presentons l'implantation prototype d'un reseau de neuron complet base sur l'algorithme backpropagation et capable d'un entrainment on-chip. Zusammenfassung Massiv parallele Neurohardware erfordert aus Platzgrunden wenig aufwendige Rechenwerke fur die zahlreichen Operationen. Die Darstellung der Zahlenwerte im Netz als Bitstrome erlaubt die Verwendung stochastischer Rechenwerke, welche mit hoher Dichte integriert werden konnen. In diesem Zusammenhang werden Ansatze gezeigt, eine gro e Anzahl statistisch unabhangige Zufallsquellen gemeinsam mit dem digital realisierten Netz zu integrieren. Halbleiterrauschen (Avalanche-Eekt), Laser-Speckle-Eekt und PseudoZufallsquellen wurden mit Hilfe speziell entworfener Chips untersucht. Als Ergebnis wird ein Prototyp-Schaltkreis vorgestellt, der das Backpropagation-Verfahren vollstandig als on-chip Training realisiert.
1 Necessity of Multiple Random Sources
which full all the requirements of information processing demanded for the nets. Combining such processing elements (see gure 1) leeds to silicon saving and fully parallel neural networks of variable structure and capacity. In order to avoid the necessity to split the net weights into excitatory and inhibitory ones (when using the probability for the state '1' in a bit stream as a representation of digital numbers) a sign considering procedure has been presented which allows for a ho-
Neural methods oer the capability of parallelism. However, they consume much processing power due to the low information density. In general, the speed problems cannot be solved until the neural hardware takes advantage of the inherent parallelism. In this paper stochastic arithmetic computing is proposed as the basis to realize the hardware implementation of the backpropagation algorithm. This allows to design drastically simpler { stochastic { processing elements 1
Figure 1: Implementation of the operations in propagation and backpropagation mode (n is the number of neurons per layer)
mogeneous implementation. It is based on a method which maps normalized digital values -1 +1] to a probability 0 1]. Generating a large number of such stochastic bit streams requires, of course, that a correspondingly large number of independent random sources is available. As the most frequent and conventionally limiting ope:::
:::
ration during propagation and backpropagation the signed multiplication of two bit streams is very simple to implement by combining both streams in a digital equivalence. In contrast to the multiplication it cannot be guaranteed that during an addition the interval of {1 +1] will not be left. Therefore the input values are scaled using additional randomly distribu:::
ted sequences. Instead of an addition the arithmetic mean is calculated. In hardware this may be realized by a randomly alternating connection between the output and the inputs to be summarized. Furthermore, it is demonstrated that stochastic automata can be used to produce parameterizable nonlinearities (sigmoid function and its derivative). While training the net the weights will be modied by very small amounts only in each clock cycle resp. learning step. In order to improve the backpropagation algorithm's speed of convergence the wellknown momentum term is introduced by an appropriate series-connection of adaptive1 and integrative elements. A more detailed discussion is found in RZ95].
3 Laser-Speckle-Eect A more promising approach is to use the laser speckle eect and small local optical on-chip sensors. The electronic problems of very low noise energy can be reduced by increasing the laser power. Madani et al. Mad90] use randomly distributed speckle pattern created by a multimode ber, into which continuous laser light is coupled. The modes leaving the exit face
2 Semiconductor Noise For VLSI implementations one of the most important aspects of the investigations was to look for the generation of the numerous bit streams vDa93]. In this context it must be identied which is the optimal way to integrate a large number of random or noise sources with sucient quality and independence. Multiple independent thermal or semiconductor noise sources must be strongly amplied, thus the electrical interferences of digital switching components have to be reduced by shielding. Best results are obtained by using the avalanche eect of Zener diodes, but this requires a process with several dierently doped layers like the expensive BiCMOS technology. In the experiments with a CMOS process and external Zener diodes frequencies were reached up to 1.5 MHz only. It is hardly feasible to integrate the ampliers/comparators without on-chip capacities which waste a large amount of silicon area (see gure 2). Therefore semiconductor noise does not seem to be viable for these goals. 1
Also adaptive digital element = ADDIE Gai69].
Figure 4: Photo of an infrared speckle pattern
of the ber are projected onto an array of photodetectors. The relative phase uctuations of these modes due to the bendings caused by ultrasonic excitation produce a fast random order of pattern movements. This approach using distributed optical sensors on digital chips fulls the demand for random sources. In a standard CMOS process a vertical bipolar transistor can be used as a photosensitive element. The optical sensitivity of such a CMOS phototransistor is shown in gure 5. The best, median and worst case of eight transistors are plotted. The sensitivity curves do not cross, therefore a correction of the sensitivity variations by scaling factors and osets seems feasible. Without cooling the transistors a contrast ratio 106 is attainable. This is the range from moonlight (incident optical power 1mW/m2) via a rainy day (1 W/m2) up to brilliant sunshine (1 kW/m2). Figure 6 shows the simulated delay time for a single >
Comparator
On-chip capacitor
Inverter
On-chip resistor
Figure 2: CMOS-Implementation of the amplier/comparator including the capacity and resistor needed for lowpass ltering
phototransistor. In the experiments a speckle pattern with a mean optical power of 10 W/m2 was used (see gure 6 third curve from the left). The curves there show the amplitudes after switching o the light source. On the abscissa the logarithm of the time is plotted, the ordinate shows the amplitudes of the relative output current normalized to the starting amplitude. If the curves are plotted with their unnormalized amplitudes instead of the relative ones they are identical if the starting times are properly chosen, e.g. the starting point of the 100 mW-curve
is identical to the 10 %-point of the 1 W-curve. The delay time of the phototransistor can be cut by increasing the laser power or the base current. This may be done by a bias current (as shown in gure 7) because only a small range of the optical sensivity of the phototransistor is used in this application. The schematic in gure 7 includes also the threshold, using the bias current as a reference for the digital output. This has the same eect as rising the minimum level of the optical power.
coupling continuous laser light
projection multimode fiber in several windings
integrated sensor array
ultra sonic exitation
Figure 3: Speckle pattern created by an excited multimode ber The circuit's output is a digital signal depending on the incoming power. Using a well adapted bias current the phototansistor speed was incremented by a factor of 10. Experiments on a circuit which adapts itself to the medium of the optical power level are in progress. Using external pin-photodiodes with discrete ampliers instead of integrated phototransistors a maximum noise frequency of about 2.5 MHz was achieved. But, an essential disadvantage of these photodiodes is their low sensitivity.
4 Decentralized Scalable Pseudo Random Generators The silicon implementation of the realized backpropagation net uses decentralized pseudo random generators which are based on the shifting principle of parities performed on partial stages of feedback shift registers. The original principle has been developed by Als91] which was extended by a cascading mechanism Ple94]. To reduce the broadcasting problem of a central pseudo random generator one 28 bit shift register with 52 four-stages-parities per 16 synapses and 4 neurons has been implemented. Only the rst shift register (the master) has a combi-
natorial feedback to the rst stage. All other identical generators are driven by a chain which connects an additional lower shifting parity to the rst stage of the next register using one single wire (see gure 8). No frequency problem arises as pseudo random generators can be driven by the clock of the digital circuit.
5 Results A rst silicon prototype of a backpropagation net has been implemented in a 1 micron CMOS technology. It contains 3150 low complex standard cells (gates, ipops, 4-to-1 multiplexors, 4 bit registers) and realizes 16 synapses including the momentum term and 4 complete neurons, and the random sources needed. The neurons have adjustable nonlinearities and an option to add inputs in both directions. The synapses are exible with respect to the learning/adaption rate and the accuracy by switching the counter/coder length, and can also work in a non-learning mode. The prototype chip has optional inputs for physical noise sources which can act standalone or in combination with the cascaded pseudo random sources. The core only needs a space of 3.5 2.8 mm2 silicon in 31 rows of cells. First tests show that a minimum clock frequency of 25 MHz is attainable with these chips. The chips can be combined to larger nets in
100
2
best case −
² W/m
10
10 m
/m² mW
40
100
−2
60
/m² 1W
− − worst case
/m²
Amplitude [%]
0
10
80
1k W
static photocurrent [µA]
10
20
−4
10
−5
10
0
optical power [mW/mm²]
−8
10
10
Figure 5: Photo electric sensitivity of some identical phototransistors
−6
−4
10
10
Time [s]
Figure 6: Delaytime of a phototransistor
U out I
bias
Figure 7: Schematic of a biased phototransistor blocks of 4 4 synapses and manifolds of 4 neurons per layer. With respect to the problematic comparison of measures like connection updates per second (CUPS) already one of the 10 mm2/1 m/25 MHz prototype chips reached the theoretical performance of 400 MCUPS using 11 12 bits weight length and 14 15 bits momentum length.
:::
:::
6 Problems & Perspectives Recently an object-oriented simulation system has been implemented which allows modelling various topologies of nets based on stochastic arithmetic. Due to the time consuming calculations net magnitudes of typical applications - like vision - are dicult to be reached in software simulation. Therefore parallel simulations are striven for.
master
code shifts
...
m bit sequences
slave1
...
m bit sequences
slave2
...
...
m bit sequences slave n
total code shifts n * m bit sequences
Figure 8: Chain of decentralized pseudo random generators In rst hardware experiments classical neural learning problems { like XOR and m-n-m encoder { have been proven to be solvable in 105 107 hardware clocks by applying some hundred sequences of patterns. In principle, the learning speed of this kind of implementation is independent of the size of the net. In the synaptic matrix the amount of connections grows with the square root of the number of active elements. There is no pinning problem concerning scalability. Considering a 160 mm2 core 1K synapses using a 0.5 m process are estimated to be feasible for integration. In the future about 16K synapses using a 0.25 m process and an optimized design of the layout seem to be attainable on the same die. :::
In this way the results of recent tests have established the basis of very fast on-chip backpropagation for learning time-critical and medium sized applications. The authors would like to acknowledge the contributions and the support of H. Brandt, G. Feuchter, R.R. Grigat2, H. Groninga, D. Nahrgang, A. Maeder3 , T. Merten, M. Paskowski, R. Plemann, D. Reinicke, and M. Rubel.
2 Technische Universit at Hamburg-Harburg, Technische Informatik I 3 Universit at Hamburg, Fachbereich Informatik
References
Ple94]
Plemann, Ralf Cascading pseudo random sources, personal communications, 1994 Als91] Alspector, Joshua Gannett, Joel W. Haber, Stuart Parker, Michael B. Chu, Robert VLSI- RZ95] Riemschneider, Karl-Ragmar Zeidler, Hans Christoph Parallel Bit-Stream Hardware ImplementaEcient Technique for Generating Multiple Uncortion of Backpropagation IEEE Symposium on Parrelated Noise Sources and Its Application to Stoallel and Distributed Processing, San Antonio, Oct. chastic Neural Networks, IEEE Transactions on 1995 (accepted) Circuits and Systems, Vol. 38, No. 1, pp. 109{123, 1991 Sha91a] Shawe-Taylor, John Jeavons, Pete van Daalen, Max Probabilistic Bit Stream Neural Chip: TheoBee90] Beerhold, J. Jansen, M. Eckmiller, R. Pulse Prory, Connection Science 3 (3) pp. 317{328, Abingcessing Neural Net Hardware with Selectable Topodon, Oxfordshire 1991 logy and Adaptive Weights and Delays, IEEE Intern. Joint Conf. on Neural Networks, pp. II 569{ Sha91b] Shawe-Taylor, John Jeavons, Pete van Daalen, 574, San Diego 1990 Max Probabilistic Bit Stream Neural Chip: Implementation, Proc. VLSI for Artical Intelligenceand Egu91] Eguchi, H. Futura, T. Horiguchi, H. Oteki, S.
Neural Networks, (ed. Delgado-Frias,J.G. Moore, Neural Network LSI chip with on-chip learning, W.R. 1991), Plenum Press, New York 1991 IEEE Int. Joint Conf. on Neural Networks I, pp. 453{456, Singapore 1991 Tom88] Tomlinson, Max Stanford jr. Implementing Neural Networks, Dissertation Thesis, Univ. of California, Egu92] Eguchi, H Stork, D.G. Wol, G. Precision anaSan Diego 1988 lysis of stochastic pulse encoding algorithms for neural networks, IEEE Int. Joint Conf. on Neural vDa93] van Daalen, Max Jeavons, Pete Shawe-Taylor, Networks I, pp. 395{400, Baltimore 1992 John Cohen, D. Device for Generating Binary Sequences for Stochastic Computing, Electronics LetGai69] Gaines, B. R. Stochastic Computing Systems, Adters 29 (1), pp.80-81, Jan. 1993 vances in Information Systems Science (ed. Tou, Julius T.), Vol. 2, Plenum Press New York 1969 Zag94] Zaghloul, Mona E. Meador, Jack L. Mewcomb, Robert W. Silicon implementation of pulse coded Hol91] Holler, Mark A. VLSI Implementations of Learning and Memory Systems: A Review, Proc. Conf. neural networks Kluwer Academic Publishers, 1994 on Neural Information Processing Systems III 993 - 1000, San Mateo CA, Morgan Kaufmann 1991 Mad90] Madani, K. Garda P. Devos, F. Lalanne, P.
Richard, H. Rodier, J.C. Chavel, P. Taboury, J. 2-D Generation of random numbers by multimode ber speckle for silicon arrays of processing elements, Optics communications, vol 76 no. 5,6
pp.387-394, 1990 MG95] Martiny, Ingo Grigat, Rolf-Rainer Adaptive Microsystems with Optical Line Sensors for Measurement and Quality Control SPIE's 1995 Int. Symp. on Optical Science, Engineering and Instrumentation, San Diego, 1995 Mas77] Massen, Robert Stochastische Rechentechnik - Eine Einfuhrung in die Informationsverarbeitung mit zufalligen Pulsfolgen, Carl Hanser, Munchen 1977 Mur89] Murray, Alan F. Pulse Arithmetic in VLSI Neural Networks, IEEE Micro Mag. Dec. 1989, pp. 64{ 74 Reprint in Sanches-Sinecio, Edgar Lau, Cliord Artical neural networks IEEE Press 1992 Mur94] Murray, Alan F. Analogue neural VLSI - A pulse stream approach, Chapman & Hall, 1994