Simulation of a Digital Neuro-Chip for Spiking Neural Networks

1 downloads 0 Views 120KB Size Report
networks in the order of 106 of spiking neurons [Sho97]. Simulating large networks of complex neuron models is a computational expensive task and leads to ...
Simulation of a Digital Neuro-Chip for Spiking Neural Networks T. Schoenauer, S. Atasoy, N. Mehrtash and H. Klar Institute of Microelectronics, Technical University of Berlin, Einsteinufer 17, Sekr. EN-4, D-10587 Berlin, Germany Phone: +49 30 314-22640, Fax: +49 30 314-24597 E-mail: [email protected] ABSTRACT: Conventional hardware platforms are far from reaching real-time simulation requirements of complex spiking neural networks (SNN). Therefore we designed an accelerator board with a neuro-processorchip, called NeuroPipe-Chip. In this paper, we introduce two new concepts on chip-level to speed up the simulation of SNN. The concepts are implemented in the architecture of the NeuroPipe-Chip. We present the hardware structure of the NeuroPipe-Chip, which is modelled on register-transfer-level (RTL) using the hardware description language VHDL. We evaluate the performance of the NeuroPipe-Chip in a system simulation, where the rest of the accelerator board is modelled in behavioral VHDL. For a simple SNN for image segmentation, the NeuroPipe-Chip operating at 100MHz shows an improvement of more than two orders of magnitude compared to an Alpha 500MHz workstation and approaches real-time requirements for SNN in the order of 106 neurons. Hence, such an accelerator would allow real-time simulations of complex SNN for image processing. Currently, the implementation of the NeuroPipe-Chip in a 0.35µm digital CMOS technology is investigated.

1. Introduction Spiking (or pulse-coded or pulsed) neurons represent complex integrate-and-fire neurons. Synchronized firing of neuronal assemblies could serve the brain as a code for feature binding, pattern segmentation and figure/ground separation [Mal81, Eck88]. Spiking neural networks (SNN) are able to model such synchronization since they take into account the precise timing of spike events. They are therefore subject of investigations for biology-inspired image processing applications [Wei97, Lin98]. However, employing SNN for image processing requires complex networks in the order of 106 of spiking neurons [Sho97]. Simulating large networks of complex neuron models is a computational expensive task and leads to high simulation times even for high-performance workstations [Jah97]. Furthermore, to solve real world tasks there is a need for real-time performance of complex networks, which can only be achieved by supercomputers or dedicated hardware. The NeuroPipe-Chip is part of such a dedicated hardware accelerator system. Main objective of the paper is to present the RTL-model of the NeuroPipe-Chip employing two new concepts for a better computational performance as well as simulation results of the accelerator system which also allow to evaluate the NeuroPipe-Chip performance. For a better understanding, in part 2 and 3 of this paper general aspects of SNN, simulation of SNN and a review of our accelerator system is given.

2. Simulation of Spiking Neural Networks Spiking neuron models represent biophysical models which account for properties of real neurons without descending to the level of ionic currents but by modelling the integrated signal flow of incoming action potentials through parts of the neuron, in particular the synapses, dendrites, soma and axon. Opposed to rate-coded models, spiking neuron models encode their information in the exact timing of a neuron’s firing event not in the frequency rate of firing. Therefore interneuron communication takes place solely via action potentials (also called spikes or pulses). A time-discrete model of a generic spiking neuron with q feeding dendrites (excitatory or inhibitory), modulated multiplicatively by a linking dendrite and a dynamic threshold is shown in Fig. 1. Each dendrite as well as the dynamic threshold is modelled by a leaky integrator. The state value in the leaky integrator we refer to as dendrite potential (DP). Furthermore, delay elements at the output of the model represent axonal delays. As spiking neural networks, we assume sparsely connected networks. The networks are composed of several layers

where each layer consists of neurons with equal properties. Neurons may be connected laterally (within one layer) as S y n a p s e / D e n d r ite

X l

W

(n T )

l1

E P S P -/IP S P F u n c tio n

Π

Σ r l

Π

U

l1

(n T )

f1

X

fq

W

f1

W

fq

U

L in k in g - D e n d r ite

Σ

(n T ) (n T )

+ 1

T

In p u t X (n T ) [0 ,1 ] X

S o m a

t

(n T )

Σ

Π

r

f1

r

fq

Σ

Π

T

Π

T

U

U

fq

f1

(n T )

D y n a m ic T h r e s h o ld F u n c tio n Π

T

Σ r

−η0

υ

Σ

Π

A x o n

Π

S ta tic T h r e s h o ld

(n T )

Π

Σ

Σ

F e e d in g - D e n d r ite

+

U i( n T )

y 1(n T ) d

. 1

[0 ,1 ]

T

T

O u tp u t d

C o m p a ra to r

k

o th e r D e n d r ite s

.

T

y k(n T )

[0 ,1 ]

Figure 1: Time-discrete representation of a generic spiking neuron model. well as to other layers. Connections are usually structured in perceptive fields. The network activity resembles the average number of spikes per time slice divided by the total number of neurons. Network activity of SNN is typically low i.e. < 1%. The computation of a SNN by digital hardware is performed in discrete time. For real-time simulations, a time interval (also called time-slice) of 1ms is considered sufficient for the entire network data to be updated.

3. Accelerator System for Spiking Neural Networks: MASPINN

d

d

1

k

.T

.T

In order to achieve real-time simulation of very complex SNN we proposed an accelerator system called MASPINN (Memory optimized Accelerator for Spiking Neural Networks), which is connected as an accelerator board to a host computer via a PCI-bus [Shö98]. MASPINN is based on previously proposed concepts such as a a spike event list, a sender-oriented connectivity list and tagging of dendrite potentials [Fra95,Jah96]. We refer to dendrite potentials (DP) as the state values in the leaky integrators of the neuron model (Fig. 1). Two additional concepts are associated with the MASPINN-architecture: weight caching and a compressed DP-memory. All of these concepts either try to reduce the amount of computation to a minimum or to make the required computation more suitable for a hardware realization. As shown in Fig. 2, a MASPINN-board consists of three main units: a Neuron Unit, which computes the neuron model, a Spike Event List, which stores the addresses of source neurons (neurons emitting a spike) and a Connection Unit, which determines target neurons (neurons receiving a spike) and corresponding weight values. MASPINN uses Weight Caches in the Connection Unit to accumulate the weights during a time-slice. The accumulated weights then serve as target neuron input during the next time-slice.

N e u r o n -U n it

N e t-P a r.M e m o ry

Σ

Σ

+

υ

Π

T

D P -T a g M e m o ry

N e u ro P ip e C h ip

a c c .w e ig h ts , a c c .w e ig h t- ta g s

Π

C o n n e c tio n -U n it

Σ

Σ

+ 1

S p ik e E v e n t L is t

D P M e m o ry

Σ

*s o u rc e n e u ro n

T

0

−η Π

M A S P IN N -B o a r d

W e ig h t M e m o ry

T Π

Π

Σ

Σ

Π

T

Π Σ

Σ Π

Π

Π

W e ig h t C a c h e 1

Π

W e ig h t C a c h e 0

T

C o n n e c tio n C h ip

Π

P C IIn te rfa c e

Σ

H o s tC o m p u te r

Σ

P C I-B u s

Figure 2: Basic Structure of a MASPINN-Board. Right: circled part is computed by the NeuroPipe-Chip. Due to a low network activity, only part of the DP in a SNN receive an input. Therefore many DP decay to zero and are irrelevant: they do not need to be accessed or filtered in a leaky integrator. This can be achieved by using tags which mark the relevance of a DP with a single bit. Storing only the relevant DP in consecutive order (first neuron in first layer to last neuron in last layer), we refer to as a compressed DP-Memory.

4. RTL-Model of the NeuroPipe-Chip Main task of the NeuroPipe-Chip is to simulate a spiking neuron as depicted in the circled part of Fig. 2. In order to allow different kinds of neuron models to be simulated, the NeuroPipe-Chip computes a programmable neuron model: with a program code the user specifies the number of DP per neuron and how the DP contributes to the membrane potential (e.g. excitatory, inhibitory, multiplicatively). All neurons of one layer are supposed to have the same characteristics. For each time-slice the neuron simulation contains the following sequence of steps: Decay: DP are decayed (’leaky integration’). Propagate: Stimuli from other neurons in the network or the input layer (’accumulated weights’) are added to the corresponding DP. Output: DP are combined to a membrane potential according to a program code. When exceeding a threshold the neuron spikes and therefore its address is written to the Spike Event List. Thanks to the concept of weight caching, the complete set of data to compute the neuron model is available at the beginning of a time-slice: all DP and accumulated weights. Therefore, starting with the first relevant DP of the first layer, DP are consecutively fed into a pipeline performing the steps: decay, propagate and output. In addition to a fully pipelined datapath, several such datapaths in parallel achieve a further speed up. The NeuroPipe-Chip was designed with two datapaths in parallel. A top-level view of the RTL-Model is given in Fig. 3a. Per clock cycle each datapath receives a relevant DP-data from the DP-Memory via the Neuron-Memory-(NM)Interface (see Fig. 3a). The Neuron-Memory encloses the DP-, the DP-tag- and the Neuron-Parameter-Memory. From the DP-tag-stream, the corresponding DP-address is generated by a Tag-to-Address-Converter. DP-data and -address now enter a pipeline stage where the DP data is multiplied with a decay factor r (’Decay-Stage’). If the result is below a user-defined threshold, the DP is regarded irrelevant and the DP is removed from further processing. In the next pipeline stage accumulated weights from the Weight Cache are combined with the DP (’Propagate-Stage’): Accumulated weights and DP with equal addresses are summed; accumulated weights with no corresponding DP become a new DP and are inserted into the DP-stream so that the consecutive order of DP is remained. Since DPremoval in the Decay-Stage and DP-insertion in the Propagate-Stage might cause pipeline stalls, First-In-First-Out(FIFO)-Memories have been added in front and behind the pipe stage to buffer data irregularities. At the output of the D P (n + 1 ), D P -T a g (n + 1 )

D P (n ), D P -T a g (n ), N e u ro n P a ra m e te r N M -In te rfa c e

A d d itio n a l N e u ro n P a ra m e te rs

T a g -B u ffe r

N e u ro n P a ra m e te rR e g is te r s

T a g 2 A d r A d rF ifo

D P (n ) F ifo

T a g B u ffe r

S h a re d R e s o u rc e s fro m P E

r e s u lt

C o u n te r

+ 2

+

D e c a y F ifo

w A d rF ifo

T a g 2 A d r

a c c .w .F ifo

w T a g -B u ff

a c c .w e ig h tT a g s a c c . w e ig h ts

s p ik e _ P E 1

s ta tic th r e s h o ld

in h ib . A c c u

In h ib itio n U n it C o n tr o lle r

in h ib . w e ig h t

P ro p a g a te

in h ib . d e c a y fa c to r

D P (n + 1 ) F ifo

s ta t. T h r . + In h ib .

H ig h e r -O r d e r F ilte r

in h ib . P o te n tia l

F ifo

F ifo

P E 0

P E 1

b u ffe r

*

S o r tin g

a)

s p ik e _ P E 0

F u ll A d d e r

A d r2 T a g

D e c a y

N e u r o P ip e C h ip

+ 1

N e u ro n C h ip C o n tr o lle r

to c o m p a ra to r o f P E

In h ib itio n U n it

B u ffe r

*s o u rc e n e u ro n s

H o s t-C o m p u te r

b)

Figure 3: a) ) Basic Blocks of the NeuroPipe-Chip-RTL-Model. b) On-Chip Inhibition Unit. Propagate-Stage the DP is now completely updated and a copy of the data is written back to DP-Memory. The corresponding DP-address is converted back into a ’1’ at the proper place in the DP-tag-stream by an Address-to-TagConverter. From the Propagate-Stage a copy of the DP is also delivered to a subsequent pipe stage, the Higher-OrderFilter-Stage, where DP are cascaded to model higher-order filter functions. A Sorting Stage then maps all DP of the

same neuron to one of two parallel Processor-Elements (PE). A PE computes the membrane potential and determines the spike activity of the neuron. Optimal load balancing of the parallel PE is achieved by the Sorting Stage: it uses the half-full-flags of FIFO in front of each PE to decide which PE the next neuron will be mapped to. In case a neuron is active, its address is written to the spike event list.

5. Novel Concepts of the NeuroPipe-Chip The architecture of the NeuroPipe-Chip was designed to take advantage of the MASPINN-concepts such as Weight Caching and a Compressed DP-Memory. However, also two novel concepts in the NeuroPipe-Design allow further increase in system performance: an On-Chip Inhibition Unit and Data-Pre-Analysis. On-Chip Inhibition Unit: in SNN for image processing, inhibition is commonly used to control network activity and to generate a winner-take-all mechanism: e.g. to separate objects in time by a SNN for image segmentation [Rei93,Wei97]. Such an inhibition module receives spikes from all neurons or a large portion of the network. It then applies equally distributed negative feedback to these neurons. Typically the connections to and from the inhibition module have similar strength. Therefore, an inhibition module may be implemented conveniently on-chip: only a few parameter are required for characterization; also, by placing the On-Chip Inhibition Unit within a processor element, resource sharing of arithmetic modules, like multipliers, minimize area overhead. The basic structure of the On-Chip Inhibition Unit is shown in Fig. 3b. The left part of Fig. 3b shows registers with additional neuron parameters required. The middle part represents the required arithmetic elements. Since the computation takes only over a few clock cycles and occurs only once in a time-slice, the arithmetic elements of a PE may be used. The right part shows the main elements of the On-Chip Inhibition Unit: only a counter, registers for the inhibition data and a controller are necessary. In case of global inhibition it works as follows. All spikes of a layer during a time-slice increment a counter. At the end of layer processing, the counter value is multiplied with an inhibition weight, buffered and added to a inhibitory accumulator (see Fig. 3b). Meanwhile the counter is reset. Hence, at the end of a time-slice the value in the inhibitory accumulator represents some equivalent of the network activity of the previous time-slice. It may be used during the next time-slice to inhibit the network. For example the accumulated value is added to a so-called inhibitory potential, which is also decayed each time-slice. This inhibitory potential may substitute an inhibitory DP of a neuron. By computing the inhibitory potential in the On-Chip Inhibition Unit only once per time-slice instead of computing an inhibitory DP for each neuron of the network, memory, IO-bandwidth and computation time are saved. Since a global inhibitory potential is equal for all neurons, instead of taking it into account for each neuron of a layer during membrane potential calculation it may be computed once at the beginning of layer processing and added to the static threshold (see Fig. 3b). In the case that the inhibition is not global but only local to a few layer, several inhibition units are required. In the NeuroPipe-Chip, we implemented two On-Chip Inhibition Units, one in each PE. Data Pre-Analysis: Computational resources can be furthermore saved by analyzing which DP need to be taken into account for membrane potential calculation. For example, if there are no DP of a neuron with additive influence on the membrane potential, it does not make sense to calculate the ones with multiplicative or subtractive influence: in any case the neuron will not emit a spike. In the NeuroPipe-Architecture, the Sorting Unit keeps track of the types of DP in the datapath pipeline. In case no additive ones have occurred and will occur, all multiplicative or subtractive ones are neglected. Thereby the load for the PE is reduced.

6. System Simulation including the NeuroPipe-Chip-RTL-Model In the design flow of an ASIC (Application Specific Integrated Circuit), the register-transfer-level (RTL) of a chip represents an intermediate stage between an algorithmic description and the actual circuit layout. On RT-Level all registers and intermediate logic are specified but not yet mapped to a gate equivalent of a technology. In the process of designing the NeuroPipe-Chip, we realized a RTL-Model of the chip in the hardware description language VHDL. Furthermore, we implemented the MASPINN-system in behavioral (algorithmic) VHDL-code and used it as a testbench for the NeuroPipe-RTL-Model. In order to validate the functioning and to evaluate the performance of the NeuroPipe-Chip, we took a simple SNN for image processing [Rei93] as a benchmark. This benchmark has been chosen, because it represents a data structure typical of SNN for image processing while being simple to program and yielding reasonable simulation times. As shown in Fig.4b, the network consists of an output layer and a global inhibition neuron. The neuron model of the benchmark network is shown in Fig.4a. It has three types of inputs: a feeding, a linking and an inhibitory input. The refractory period of the neuron is modelled by an additional leaky integrator referred to as dynamic threshold. The

input to the network is given by a binary image where each pixel is associated with the feeding dendrite of an output layer neuron. Besides these feeding connections there are linking connections organized in receptive fields. They

X

(n T ) l

W l

Π

Σ r l

Π

D y n a m ic T h r e s h o ld

+ 1

L in k in g D e n d r ite

T

Σ T

Σ

In p u t X (n T )

[0 ,1 ] X

X

a)

(n T ) f

i

(n T )

W

W

f

i

Π

Π

Σ r

fq

Σ r

fq

F e e d in g D e n d r ite Π

Π

T

Π

Σ

Π

Σ

G lo b a l In h ib itio n

r

Π

S ta tic T h r e s h o ld

T

+

In h ib ito r y D e n d r ite T

O u tp u t

In h ib ito r y C o n n e c tio n s

L in k in g C o n n e c tio n s

O u tp u t L a y e r

O u tp u t

y (n T )

[0 ,1 ]

b)

F e e d in g C o n n e c tio n s

In p u t L a y e r

Figure 4: a) Neuron model. b) Network topology [Rei93].

connect neurons of the output layer laterally with their 9x9 nearest neighbors. A global inhibition neuron receives input from all output layer neurons and is connected to the inhibitory dendrite of all output layer neurons.

stimulus Bin 35

Bin 39

Bin 36

Bin 37

Bin 38

Bin 45

Bin 46

Bin 47

Figure 5: Results of a MASPINN system simulation using a simple benchmark network (1000neurons) [Rei93]. Fig. 5 shows the result of a system simulation run for a network of 1000 neurons and two input stimulus objects: a ’plus’ and a ’square’. The SNN performs an image segmentation: neurons associated to the ’plus’ spike during a certain time interval (bin35 - bin38) while neurons associated with the ’square’ spike with a phase shift in another time interval (bin46 - bin47). For a typical simulation run, the network activity was 0.4% and 12% of the DP were relevant excluding inhibitory DP. Number of Neurons

Number of Dendrite Potentials

103

4 ⋅ 10

3

4 ⋅ 10

6

106

Alpha 500MHz

NeuroPipe 100MHz

NeuroPipe 100MHz

(no On-Chip.Inh.Unit)

(On-Chip.Inh.Unit.)

5ms

10.9µs

6.4µs*

650ms*

10.9ms

6.4ms

neuron: 4 DP net activity = 0.4% relevant DP = 12% (excl. inhib. DP)

times marked by * are simulated, otherwise estimated

Tab.1: Performance of the NeuroPipe-Chip and an Alpha-workstation for SNN benchmark [Rei93]. The RTL-Model of the NeuroPipe-Chip was designed estimating a 100MHz-operation in a 0.35µm-CMOStechnology. The computational performance of the NeuroPipe-Chip at 100MHz is shown in Tab.1 in comparison to a high-performance workstation. Employing the On-Chip Inhibition Unit, the NeuroPipe-Chip yielded an improved performance of a factor of 1.7 compared to the NeuroPipe-Chip without. Compared to an Alpha 500MHz

workstation, a simulation time reduction of two orders of magnitude is achieved for this benchmark. Note, however, that due to the algorithmic description, delays introduced by the surrounding MASPINN-system are not yet considered. Several MASPINN-boards in parallel or a NeuroPipe-Chip with more parallel datapaths could lead to real-time simulations of SNN in the order of 106 neurons.

7. Conclusion We presented a neuro-processor-chip as part of an accelerator board which approaches real-time simulation requirements for SNN in the order of 106 neurons. Two new concepts on chip-level were introduced which lead to improved performance of the NeuroPipe-Chip. In the process of designing the NeuroPipe-Chip in a CMOStechnology, a VHDL-RTL-model of the chip was implemented. With an algorithmic description of the surrounding accelerator-system in behavioral VHDL, a system simulation was performed. For a simple SNN benchmark network for image segmentation, the simulation suggested about two orders of magnitude faster simulation time than a 500MHz Alpha workstation. Currently, the implementation of the NeuroPipe-Chip in a 0.35µm digital CMOS technology is investigated.

8. References [Eck88] [Fra95] [Jah96] [Jah97] [Jah98] [Lin98] [Mal81] [Rei93] [Shä99]

[Sho97] [Shö98] [Wei97] [Wol99]

R. Eckhorn, R. Bauer, W. Jordan, M. Brosch, W. Kruse, M. Munk, H.J. Reitböck, “Coherent oscillations: A mechanism of feature linking in the visual cortex?”, Biol. Cyb. 60:121-130, 1988. G. Frank, G. Hartmann, “An Artificial Neural Network Accelerator for Pulse-Coded Model Neurons”, Proc.on International Conference on Neural Networks ICNN95, Perth, Australia, Vol. 4, pp. 2014-2018, 1995. A. Jahnke, U. Roth, H. Klar, “A SIMD/Dataflow Architecture for a Neurocomputer for Spike-Processing Neural Networks (NESPINN)”, MicroNeuro'96, p. 232-237, 1996. A. Jahnke, T. Schoenauer, U. Roth, K. Mohraz, H. Klar, “Simulation of Spiking Neural Networks on Different Hardware Platforms”, Proc. of ICANN’97, Springer, Berlin, p. 1187-1192, 1997. A. Jahnke, U. Roth, T. Schoenauer, “Digital Simulation of Spiking Neural Networks”, In: Pulsed Neural Networks, W. Maas and C.M. Bishop (Eds.), MIT Press, 1998. T. Lindblad, J. M. Kinser, Image Processing using Pulse-Coupled Neural Networks, Springer, 1998. C.v.d. Malsburg, “The correlation theory of brain function”, In: Models of Neural Networks II, Domany et al. (Eds.), Springer, p. 95-119, 1994. H.J. Reitboeck, M. Stoecker, C. Hahn, “Object Separation in Dynamic Neural Networks”, Proc.on International Conference on Neural Networks ICNN93, p.638-641, 1993. M. Schäfer, G. Hartmann, “A Flexible Hardware Architecture for Online Hebbian Learning in the sender-oriented Neurocomputer Spike 128K”, Proc. of 7th International Conference on Microelectronics for Neural, Fuzzy and Bioinspired Systems, Granada, Spain, pp.316-323, 1999. U. Schott, R. Eckhorn (Philips-Universität Marburg), internal communication, 1997. T. Schoenauer, N. Mehrtash, A. Jahnke, H. Klar, “MASPINN: Novel Concepts for a Neuro-Accelerator for Spiking Neural Networks”, VIDYNN'98, 1998. L.Weitzel, K.Kopecz, C.Spengler, R.Eckhorn, H.J.Reitboeck, "Contour segmentation with recurrent neural networks of pulse-coding neurons", In: Sommer et al: Computer Analysis of Images and Patterns, CAIP Kiel, Springer, 1997. C. Wolff, G. Hartmann, U.Rückert, “ParSPIKE - A Parallel DSP-Accelerator for Dynamic Simulation of Large Spiking Neural Networks”, Proc. of 7th International Conference on Microelectronics for Neural, Fuzzy and Bioinspired Systems, Granada, Spain, pp.324-331, 1999.