RAN SOM: A Recon gurable Neural Network Architecture Based on Bit Stream Arithmetic 2
Michael Gschwind, Valentina Salapura, Oliver Maischberger fmike,vanja,
[email protected]
Institut fur Technische Informatik Technische Universitat Wien Treitlstrae 3-182-2 A-1040 Wien AUSTRIA
Abstract We introduce the RAN2 SOM (Recon gurable Architecure Neural Networks with Serially Operating Multipliers) architecture, a neural net architecture with a recon gurable interconnection scheme based on bit stream arithmetic. RAN2 SOM nets are implemented using eld programmable gate array logic. By conducting the training phase in software and executing the actual application in hardware, con icting demands can be met: training bene ts from a fast editdebug cycle, and once the design has stabilized, a hardware implementation results in higher performance. While neural nets have been implemented in hardware in the past, larger digital nets have not been possible due to the real-estate requirements of single neurons. We present a bit-serial encoding scheme and computation model, which allows space-ecient computation of the sum of weighted inputs, thereby facilitating the implementation of complex neural networks.
1 Introduction Conventional computer hardware is not optimized for simulating neural networks. Therefore, several hardware implementations for neural nets have been suggested ([MS88], [CB92], [vDJST93], [SGM94]). While the functions of neural networks are comparatively simple, their hardware implementation is
expensive. A neuron i can be modeled as
0 X m ) = i@
1 ji j A
n
i (x1 ; : : : ; x
n
i
a
1j mi
w
x
;
where j are the input signals, ji the weights and i the activation function. Neural performance stems from the simultaneous execution of many neurons and their interconnections, not the performance of any single neuron. Thus, the main focus of our project was space eciency to allow for large nets, not performance per neuron. Another requirement was to not have a xed interconnection scheme in order to support a wider range of neural net applications. We decided to use eld-programmable gate arrays (FPGAs) to develop a prototype of our net. Xilinx FPGAs consist of two components, CLBs (several user-programmable lookup tables and ip- ops) and user-con gurable routing resources. Custom hardware can be generated by loading a con guration bit stream into a con guration RAM [Xil93]. Since the con guration data is stored in RAM, chips can be recon gured by downloading a new con guration bit stream. Using the eld programmable gate array technology, dierent design choices can be evaluated in a short time. Also, customized designs can be generated for each neural net application. This design methodology also enabled us to keep overall system cost at a minimum. Early on we decided on o-chip training, with the fully trained con guration being downloaded into hardware. This reduced real estate consumption for two reasons: x
w
a
No hardware is necessary to conduct the training
phase. Instead of general purpose operational units, specialized instances can be generated. These require less hardware, as they do not have to handle all cases. This applies especially to the multiplication unit, which is a major resource hog in neural net implementation. As training occurs only once in the lifetime of a neural net application, namely at the beginning of its lifetime, this o-chip training scheme does not present a limitation to a net's functionality. Our choice of FPGAs as implementation technology proved bene cial in this respect, as each net can be trained on the workstation and is then downloaded to the FPGA for operational use. The FPGA technology here matches a de nite need. A silicon implementationof our neural net architecture would have to model part of the FPGA functionality: a silicon implementation needs parameterization RAMs to store weights, thresholds and neuron interconnection information. It is not a viable solution to implement fully trained neural networks in silicon. This would require a new production and validation cycle for every newly trained neural network. The neurons as modeled in our prototype can be used for any network architecture. They operate as binary threshold units with synaptic weights in the range [?1 1]. The neuron res if the sum of weighted inputs exceeds a predetermined threshold. A neuron only takes two states: active (1) or inactive (0). ;
2 Data Representation Issues In neural net implementation, the most complex, involved and space-consuming feature is the weighing of inputs by multiply units. As every neuron has several inputs, several of these operations have to be performed concurrently for a neuron to be fully functional. Thus, we concentrated our optimization eorts on this unit. By performing the training phase o-chip, considerable space can be saved over an on-chip training scheme, as general-purpose multiplication units can be replaced by units optimized towards multiplying by a speci c constant. This improvement did, unfortunately, not prove suciently space-conserving to allow the ecient implementation of non-trivial nets. To further reduce the amount of hardware required to perform the operations of a neuron (multiplication
and accumulation), we investigated the impact of different data representations on the hardware requirements. There are two basic strategies for encoding input/output values and synaptic weights, bit streams and digital values (i.e. conventional binary number representation). While a digital representation of numbers aords fast, parallel operation of the neural net and higher accuracy of the values represented, fully parallel (both operands in parallel) -bit multipliers require ( 2) area. Thus, massive replication of neurons based on a parallel operation comes at a prohibitive cost. With bit stream encoding, bits are processed serially, thus re-using the same functional unit for all bits in the number representation. This encoding essentially trades execution time for reduced circuit complexity. Since neural net performance is basically a function of the number of neurons which can be processed simultaneously, putting a large number of neurons on one chip leads to high performance. In our design, therefore, we chose a bit stream encoding for both the input values and weights. Using this particular bit stream encoding, multiplication of two values can be implemented eciently by a process called digital chopping. We call the resulting architecture a RAN2 SOM (R econ gurable Architecure N eural N etworks with S erially O perating M ultipliers) architecture. Bit stream encoding comes at the expense of the accuracy of value representation, as the accuracy is a linear function of the bit stream length (as opposed to the exponential function for binary representation). Since computational speed is a direct function of the bit stream length, using long bit streams is unpractical. However, Cox and Blanz [CB92] report that while accuracy has a strong impact during training, reduced accuracy does not have a signi cant impact during the operative phase of a neural net. n
n
O n
3 Multiplication Using Digital Chopping Using the time division multiply technique [Ern60], two analog values can be multiplied by intersecting the two signals, provided that one signal has a considerably higher frequency than the other. This process is called chopping and is restricted to multiplication by values from the range [0 1]. A similar technique can be applied to multiplication ;
Figure 1: Chopping signals and multiplication: (a) chopping signals for a bit stream of length 32. (b) generation of chopping weights by the union of the basic chopping signals. (c) chopping multiplication by intersection of two signals. of digital values encoded as bit streams. For digital operation, both the input values and the weights are represented as bit streams (see gure 1), in which the fraction of set bits in the total bit stream is used to encode the value. Multiplication is performed by intersecting the two streams bit by bit. Thus, a weight of 0 5 is represented by a chopping bit stream which has half of its bits set. In digital operation, the frequency disparity necessary for correct operation is not easy to attain. Since we use hard limiter activation functions, the state of any neuron is restricted to active (1) or inactive (0). These values are encoded by bit streams consisting entirely of either 1s or 0s (signals which have a frequency of 0). This guarantees a sucient frequency disparity for correct operation. For the input neurons, values in the range [0 1] are encoded by a bit stream where the probability that a bit is set is . The set bits have to be uniformly distributed over the entire bit stream to ensure correct operation. This requirement is a result of the limited frequency spectrum available in digital circuits. Thus, correctness of operation has to be ensured by probabilistic methods. :
v
;
v
4 Neuron Operation A neuron weighs its input values in the synapses, accumulates the post synaptic signals, and applies the activation function. As digital chopping is a simple operation, all inputs can be operated on in parallel by separate synapse instances. The accumulation of post synaptic signals is performed by time multiplexing to reduce hardware complexity. After one processing run, the threshold activation function determines the output value of the neuron. Figure 2 shows a complete neuron with eight inputs: each bit is processed by a synapse and fed to a
multiplexer. This multiplexer selects each post synaptic signal and feeds it to a counter which accumulates the post synaptic signals. A sign function determines whether a speci c input is inhibiting or activating. This determines the signal on the counter up/down signal, thereby circumventing the restriction of chopping multiplication to the range [0 1]. The counter contains the accumulated value of all post synaptic values in a biased representation: the value 0 is represented by the bit pattern 1000000000002. The counter is also used to implement the threshold activation function of the neuron. At the beginning of each computation cycle (i.e. when a new bit stream is fed to the neuron), the counter is initialized by loading the negated threshold value in based representation, i.e. 2048 ? . Thus, to implement a neuron with threshold 128 the counter will be initialized to 2048 ? 128 = 1920. This value will then be hardwired and not consume any resources. The activation state of a neuron can then be concluded from the counter's most signi cant bit.1 As we use bit streams of length 256, and accumulation of the post synaptic signals requires eight cycles per bit (one cycle for each input channel), a new computation cycle is started every 2048 cycles. This condition is checked by a global counter, and distributed to all neurons. Upon receiving this signal, the neurons will latch their output state and reload the counter with the biased threshold to restart a new computation [Mai94]. ;
5 Network Architecture and Performance The design of the neurons is such that any neural architecture can be assembled from single neurons. Users can choose an optimal interconnection pattern for their particular application, as these interconnections are performed using FPGA routing. Thus, our design can be used to implement, inter alia, Hop eld [Hop82], feed-forward, recursive, and Kohonen selforganizing feature maps [Koh90]. Figure 3 shows a feed-forward network with four neurons in the input layer, four neurons in the hidden layer and two neurons in the output layer. A global counter is used to synchronize the neurons. The RAN2 SOM design can be scaled to meet varying performance and precision requirements and to support an application-speci c number of input signals (see table 1). The performance of a complete net 1
As 2048 ? + 2048 =) !
C/L
Figure 2: Schematic diagram of a neuron
_ C/L
Figure 3: Example architecture: a feed-forward network with ten neurons.
(interconnections per second) is a function of the ring rate of a single neuron (UPS) and the particular network implemented (number of neurons and interconnections) and only limited by the design size. precision
0.0039 0.0078 0.0156
8 inputs 4 inputs 16000+ 32000+ 32000+ 64000+ 64000+ 128000+
Table 1: Updates per second (UPS) of one neuron (at 33 MHz), as a function of the number of inputs and value representation precision. Precision is de ned as the delta between two consecutive representable values. Several neurons can be put on one FPGA. The exact number of neurons tting on one FPGA depends on the exact FPGA type, the number of inputs and the required precision, and can be anywhere from eight to 128. By using multiple FPGAs, arbitrarily large, complex neural nets can be designed cheaply and eciently. Having neurons as indivisible functional units allows absolute freedom in choosing any topology required.
6 Design Process The process of designing a new net is simple and automated: Using a neural network simulator, a network architecture is de ned and the net trained, yielding a set of new weights and biases. The internal representation of the neural net simulator is then translated to an intermediate le (see gure 5), on which all back end tools operate. The standard intermediate le format contains all relevant information about the neural net and allows dierent front end tools to be used. To use a new front end simulator, only a simple lter which translates the simulator le format to the intermediate le format is needed. The new connections, weights and biases are then mapped to the logic of the FPGAs. Each newly designed neural net has to be translated to an FPGA con guration bit stream, a process which is fully automated in our design environment: the network topology with trained weights and biases is described in the intermediate le, from which design optimization and translation is performed automatically, without further user intervention. The resulting con guration bit stream is then used to initialize the Xilinx FP-
I I N C W N C W N C W N C W O
I1 I2 S0N0 0 I1 126 0 0 0 S0N1 0 I2 126 0 0 0 S1N0 192 S0N0 S0N1 63 63 0 0 S2N0 64 S0N0 S0N1 63 63 -126 S2N0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 5: Example intermediate le: a feed-forward network with four neurons.
GAs [Xil93]. Figure 4 shows the phase model for the design of a new net from training to the hardware model. The same intermediate le syntax can be used for all our dierent neuron architectures ([GMS94], [SGM94], [Sal94]), thus presenting a uni ed interface to the user. When translating the intermediate le to an FPGA con guration, the neuron type is speci ed and a net based on a particular neuron architecture is generated. This allows to explore dierent implementation strategies for a particular application. The intermediate le contains all parameters needed. For illustration, a simple input le is shown in gure 5. At the beginning of the le inputs are speci ed (denoted with I), assigning a name to every input. Then, the neurons are described. The order of neurons in the le is irrelevant. Every neuron is de ned with four parameters. Firstly, a name is assigned to every unit. Then, the bias value assigned to the unit is speci ed. After that, the connections are speci ed: for each of the eight neuron inputs, the name of an input port to the network or the name of the neuron with which to connect is given. Finally, the synaptic weights are given. At the end of the le, the neural network output ports are speci ed.
success
simulation training
hardware design
trained net
test phase
FPGA
failure
Figure 4: Phase model of net development
7 Related Work
v
7.1 Digital Data Representation GANGLION [CB92] is an implementation of a simple three layer feed forward net. GANGLION uses a highly parallel, pipelined architecture, with the aim of providing high performance. This design uses two Xilinx XC3090 FPGAs and a 2Kx8 PROM per neuron, trading high area consumption for high performance. Ferrucci [Fer94] shows how the GANGLION approach can be extended to self-adapting networks. Marchesi et al. [MOPU93] reduces hardware complexity by restricting weights to numbers of the form 2n + 2m { thus two shift registers and an adder suce to perform the multiplication by the synaptic weight. However, this design cannot be used with conventional training algorithms. In [SGM94], we propose a neuron design using a digital representation which can be implemented with a fraction of the logic required for GANGLION, by using only one multiplier which is shared by all synaptic weight computations.
7.2 Bit Stream Representation Murray and Smith [MS88] demonstrate an analog implementation of a neural net. This implementation uses pulse stream encoding to represent values. This facilitates an ecient implementation of multiplication by intersecting the input signal with a highfrequency chopping signal. van Daalen et al. [vDJST93] use a stochastic approach to neural net implementation. They represent values in the range [?1 1] by stochastic bit streams in which the probability that a bit is set v
is ( + 1) 2. Their input representation and architecture restrict this approach to fully interconnected feed-forward nets. The non-linear behavior of this approach requires that new training methods be developed. In [Sal94], we explore the use of delta encoded binary sequences to represent input values and synaptic weights. The resulting delta arithmetic units only require one-bit full adders and D ip- ops.
;
=
8 Conclusion RAN2 SOM (R econ gurable Architecure N eural N etworks with S erially O perating M ultipliers) is a space-ecient, scalable neural network architecture, which can be adapted to support a wide range of performance, precision and input requirements. Starting from an optimized, freely interconnectable neuron, various neural network models can be realized. By using bit stream encoding of values and digital chopping, the hardware required for implementing a single neuron can be reduced considerably. This allows for the massive replication of neurons to build complex neural nets. FPGAs are used as hardware platform, facilitating the implementation of arbitrary network architectures and the use of an o-chip training scheme. Future projects include the development of a wider variety of neurons (with dierent activation functions, performance/real estate trade-os etc.) and the complete automation of the the design ow from the network architecture de nition phase and training to the nished hardware implementation.
9 Acknowledgement
[MS88]
We wish to thank Alexander Jaud for his help with the Workview and the Xilinx design environments.
References [CB92]
Charles E. Cox and W. Ekkehard Blanz. GANGLION { a fast eld-programmable gate array implementation of a connectionist classi er. IEEE Journal of SolidState Circuits, 27(3):288{299, March 1992. [Ern60] Dietrich Ernst. Elektronische Analogrechner { Wirkungsweise und Anwendung. R. Oldenbourg, Munchen, Deutschland, 1960. [Fer94] Aaron Ferrucci. A eld-programmable gate array implementation of a selfadapting and scalable connectionist network. Master's thesis, University of California, Santa Cruz, January 1994. [GMS94] Michael Gschwind, Oliver Maischberger, and Valentina Salapura. Die Implementation komplexer, frei kon gurierbarer neuraler Netze in Hardware. Journal der Osterreischischen Gesellschaft fur Arti cial Intelligence, 12(3-4):10{14, January
[Hop82]
[Koh90] [Mai94]
1994. John J. Hop eld. Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the Academy of Sciences USA, volume 79, pages 2554{2558, April 1982. Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464{1480, September 1990. Oliver Maischberger. Die Implementation Neuraler Netze mittels FPGAs. Technical report, Technische Universitat Wien, Wien, Osterreich, 1994. (to be published).
[MOPU93] Michele Marchesi, Gianni Orlando, Francesco Piazza, and Aurelio Uncini. Fast neural networks without multipliers. IEEE Transactions on Neural Networks, 4(1):53{62, January 1993.
[Sal94]
Alan F. Murray and Anthony V. W. Smith. Asynchronous VLSI neural networks using pulse-stream arithmetic. IEEE Journal of Solid-State Circuits, 23(3):688{697, March 1988. Valentina Salapura. Neural networks using bit stream arithmetic: A space ef cient implementation. In Proceedings of the IEEE International Symposium on Circuits and Systems, London, UK, June
1994. [SGM94] Valentina Salapura, Michael Gschwind, and Oliver Maischberger. A fast FPGA implementation of a general purpose neuron. In Proc. of the Fourth International
Workshop on Field Programmable Logic and Applications, Prag, Czech Republic,
September 1994. [vDJST93] Max van Daalen, Peter Jeavons, and John Shawe-Taylor. A stochastic neural architecture that exploits dynamically recon gurable FPGAs. In IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 1993. IEEE CS Press. [Xil93] Xilinx. The Programmable Logic Data Book. Xilinx, San Jose, CA, 1993.