Simulation of Spiking Neural Networks on Different ... - CiteSeerX

2 downloads 0 Views 19KB Size Report
formance of existing hardware platforms for the simulation of spiking neural networks with sizes from 8k neurons up to 512k neurons/50M synapses. We present ...
Simulation of Spiking Neural Networks on Different Hardware Platforms A. Jahnke, T. Schönauer, U. Roth, K. Mohraz and H.Klar Institut of Microelektronics, Technical University Berlin, Germany E-mail: [email protected], [email protected]

Abstract: Substantial evidence indicates that the time structure of neuronal spike trains is relevant in neuronal signal processing. Bio-inspired spiking neural networks are taking these results into account. Applications of these networks to low vision problems, e.g. segmentation, requires that the simulation of large-scale networks must be performed in a reasonable time. On this basis, we investigated the achievable performance of existing hardware platforms for the simulation of spiking neural networks with sizes from 8k neurons up to 512k neurons/50M synapses. We present results for workstations (Sparc-Ultra), digital signal processors (TMS-C8x), neurocomputers (CNAPS, SYNAPSE), small- and large-scale parallel-computers (4xPentium, CM-2, SP2) and discuss the specific implementation issues. According to our investigation, only supercomputers like CM-2 can match the performance requirements for the simulation of very large-scale spiking neural networks . Therefore, there is still the need for low-cost hardware accelerators.

1

Introduction

Substantial evidence indicates that the time structure of neuronal spike trains is relevant in neuronal signal processing [1]. Furthermore, experimental results [2][3] together with theoretical studies [4][5] suggest that temporal correlation of activity might be used by the brain as a code to bind features to one object and to segregate one object from others. This mechanism could also be useful for machine vision, where robust scene segmentation is still a difficult and intricate problem in a real world environment. In order to tackle these low vision problems with large-scale spiking neural networks, the simulation of networks must be performed in a reasonable time. Hence, we studied the performance of various hardware platforms for the simulation of spiking neural networks: workstations, digital signal processors, neurocomputers, small- and large-scale parallel-computers. This study discusses different mapping and programming methods and yields their impact on the performance for different hardware architectures. In the next section we will survey spiking neurons and their implementation issues. Section 3, then, presents the results of our performance study for various hardware platforms.

2

Spiking Neural Networks

2.1 • • •

• • •

Spiking Neurons

Let us first survey the basic properties of spiking neurons [2][6][7]: Neurons communicate with each other only via delayed spikes. Incoming spikes were weighted and induce a postsynaptic potential according to an impulse response function. Impulse response functions are constructed of one or several leaky integrators which are described in a discrete version by IP(n+1) = r* IP(n), where IP is an internal potential and r is a decay factor. Internal potentials (IPs) are combined in output functions containing addition, subtraction and/or multiplication and at least a threshold function. Networks may consist of some 105 neurons and can be divided in layers of neurons with equal properties. Each neuron is connected with up to 104 other neurons. The connectivity structure of a network can be divided into two groups: 1.) regular connections(RC), which follow some simple deterministic rules (e.g. receptive fields) and 2.) nonregular connections (NRC) (sparse, random connections).

2.2

Implementation Issues

When implementing such networks on digital computers the time t proceeds in discrete basic time units T (usually T = 1ms). The simulation of one basic time step for the whole network should be a time slice (TS). Computing a time slice in less than one millisecond will be called real-time simulation. The processing in each time slice involves three operations: Input: spikes have to be distributed and the spike-receiving neurons have to increment their corresponding IP. Output: each neuron has to combine its internal potentials and to decide wether to emit a spike or not. Decay: the internal potentials have to be decayed for each neuron. It should be noted that due to the required storage of IPs the computation of this networks is i/o-bounded (No. of I/Os > No. of Operations). We can apply several methods in order to speedup the simulation: • The network activity (average number of spikes per TS / number of neurons) is usually quite low. A very efficient communication scheme for such networks is the event-list protocol. Only the addresses of spiking neurons are registered in a spike event-list and have to distribute spikes for input [8]. • Typically, spiking neural networks are not fully connected. One method of representing this sparse connectivity is the use of lists, one for each neuron ni. The items in the lists are datasets consisting of weights and addresses. The addresses aj in the list may denote the neuron nj, to which ni sends a spike or from which ni receives a spike. The former represents sender-oriented, the later receiver-oriented connectivity [9]. For low network activity the sender-oriented method is advantageous: only spike-receiving neurons have to compute the input function [10]. • For regular connectivity we can compute the connections of a neuron independence of its spatial position. On-line computation of the connections reduces drastically the required amount of storage for connection lists. Assuming that the weight vectors for all neurons are similiar, we only need to store one weight vector for the whole network.



Instead of using floating-point arithmetic, fast fixed-point arithmetic can be used. According to the results of our resolution analysis [11] accuracies from 8b (weights) up to 18b (internal potentials) are sufficient for low vision problems.

Concerning the simulation on a parallel computer, we will make a distinction between three methods of network mapping, which can be combined with each other: • n(euron)-parallel: all synapses of a neuron are mapped to the same PE. Different neurons are processed in parallel, but the there synapses are processed serially. • s(ynapse)-parallel: the synapses of a neuron are distributed to different PEs. Now the synapses of one neuron are processed in parallel and the neurons serially. • p(attern)-parallel: the PEs computes the response of a net segment of a specific pattern in parallel while the different net segments will be processed serially.

3

Performance Study

As model network for the study we chose the neural network presented by Reitboeck, Stoecker et al. [12]. The network performs a basic segmentation task. It consists of a two-dimensional layer of neurons which receive an input spike from a corresponding pixel of the input image. Each neuron is connected to its 90 nearest neighbors and recurrently to a inhibitory neuron. For our comparative studies we used different network sizes varying from 8kN up to 512kN. Further distinctions will be made between RC and NRC networks. This is of particular interest regarding hardware with limited on-chip memory capacity (the connectivity has to be stored locally). 3.1

Single PE Workstations

Our implementation of the model network on single PE workstations is spike-eventdriven and uses sender-oriented connectivity and FP arithmetic. We have shown that the processing of decay function accounts for the major workload on single PE workstations [10]. The results for SUN Ultra-1 (166 MHz) indicates that recent single PE computer have not enough performance for real-time simulation of spike-processing neural networks. 3.2

Neurocomputer

The CNAPS/256 (Adaptive Solutions) is a SIMD parallel computer consisting of 256 16b PEs (50 MHz) with 4KB local memory each, communicating via two 16b buses [13]. The implementable network size is limited to nmax_rc~ 32kN for RC networks and to nmax_nrc~ 500N for NRC networks. For the later, the whole connectivity has to be stored in local memory. Overcoming these limits with extensive use of external memory leads to an unacceptable decrease of performance and a speedup «1, since 256 PEs would need to communicate with the external memory via two 16b buses only. Furthermore, the bottle-neck of inter-PE-communication requires that the connectivity of all neurons mapped to one PE must be either stored or computed directly on this PE. Therefore a sender-oriented connectivity is not useful, since the whole network topology would need to be stored on each PE leading to an early exploitation of memory resources. Only for RC networks the connections can be computed locally

instead of being stored. In order to avoid purely receiver-oriented connections and to achieve an equal distribution of spike-receiving neurons on all PEs, we used the following network mapping scheme. The network has been divided in blocks of 256 adjacent neurons. Each neuron of one block is mapped to another PE, where the number of PE denotes the x/y position of the neuron in the block. Neurons on a PE are distinguished by their block number. Thus, each neuron is unambiguously defined by its block and PE number. So, this identification code of a spike-emitting neuron still has to be broadcasted (-> receiver-oriented), but each PE then uses on-line sender-oriented connectivity by simply computing the block numbers of spike-receiving neurons relevant for the particular PE. In order to evaluate the speedup through parallelization of this algorithm, we measured computation times while varying network size and/or nPE (number of used PEs). Our results indicate, that the speedup is approximately linear to nPE as long as inter-PE-communication is not a dominating factor. This is the case for network activity< 1% and for a number of neurons per PE > 4. Staying within these limits, a speedup of up to 15 over single PE computers has been measured (see fig. 1). The performance could be further increased by using microcoding. Thereby, real-time requirements could be fulfilled for RC networks with up to 16kN. However, the use of CNAPS/256 for real-time applications is limited by the low network complexity as indicated by the values of nmax_rc and nmax_nrc. The SYNAPSE (Siemens AG) is a systolic array processor optimized for matrixmatrix operations. SYNAPSE achieves a very high performance for computing highly connected, conventional neural networks using s- and p-parallel mapping with [14]. The simulation of conventional, static neural networks in general constitute a compute-bounded problem. However, since spiking neurons are based on a dynamic model with the need to store internal potentials, their simulation is rather an i/o-bounded problem than a compute-bounded one. With barely any on-chip memory and no sufficient bandwidth to communicate with external memory, systolic array processors are inadequate for the simulation of spiking neural networks. 3.3

Parallel Computer

The TMS320C80 (Texas Instruments) is a digital signal processor including 4 MIMD 32b PEs with ~12kB local memory each and one 32b PE for vector processing. All PEs are communicating via a crossbar structure. All network parameter of RC networks up to 20kN could be stored in local memory. Using spike-event lists, on-line sender oriented connectivity and n-parallel mapping, a performance similar to the CNAPS could be achieved (see fig. 1). On one hand, CNAPS can profit from a higher number of PEs. On the other hand, the MIMD architecture of the TMS320C80 of allows the parallelization of input & decay. Also compared to CNAPS, the crossbar structure enables better inter-PE-communication which is also diminished due to the larger on-chip memory per PE of the TMS320C80. Simulating larger RC networks or NRC networks > 1 kN requires extensive use of external memory. Even the high memory bandwidth (400 MB/s) of the TMS320C80 is not sufficient and the simulation times approach those of single PE computers. The 4xP90 is a MIMD parallel computer with four Pentium P90 and shared memory

architecture. Concerning the simulation of spiking neural networks the problem arises that more than one PE needs access to the shared memory at a given time period. The SP2 is a MIMD parallel computer with up to 256 R6000 PEs, local memory architecture and high speed 16-to-16 switches for inter-PE communication. N-parallel mapping and sender-oriented connectivity has been used for implementation [15]. On the one hand, a single PE exhibits significantly less performance than a P90 which may be a result of unsatisfying compiler settings and requires further investigations. On the other hand, we noticed a speedup of only 1.12/1.23/1.25 using 2/4/8 PEs over a single PE implementation. Hence, the performance is mainly limited by inter PE communication (computation and communication could not be done in parallel on SP2) which results in the observed poor PE-scaling behavior.

computation time /time slice(ms)

1000.000 Ultra-I

100.000

TMS-C8x 10.000

4xP90 CNAPS/256

1.000

CM-2 SP-2

0.100

0.010

0.001

8

16

32

64

256

512

network size (kN)

Fig. 1. Computation times (in ms) per time slice

The Connection Machine CM-2 is a SIMD parallel computer with 16k 1b PEs and hypercube architecture. It has been shown by E.Niebur et.al. that networks with up to 4M simple spike-processing neurons can be simulated efficiently on the CM-2 [16]. While n-parallel mapping/sender-oriented connectivity is most efficient for decay/output, only a few neurons are active at a given time-step of input and most of the PEs are idle. Niebur et. al. showed a more efficient algorithm using s-parallel mapping and receiver-oriented connectivity. The s-parallel mapping is not so efficient for decay/output but for the CM-2 with its powerful communication this do not matter. (However, this would matter on bus-oriented parallel computers like the CNAPS!). On the base of the results of Niebur et. al. we estimated the simulation times in fig. 1. Even for large networks of spike-processing neurons the real-time requirements are met.

4

Conclusion

As figure 1 shows, only supercomputer like the CM-2 exhibit enough performance for real-time simulation of large-scale spiking neural networks. The simulation times indicate that the main reason for the poor performance of other parallel computers is the limited i/o bandwidth which is a result of the i/o-bounded character of computing spiking networks. This was a motivation to develop dedicated hardware for the simulation of spike-processing neural networks which has been presented recently [17].

5

Acknowledgment

This work has been supported in part by the BMBF under Grant No. 01 M 3013 A 8 and by the Deutsche Forschungsgemeinschaft (DFG) under Grant No. Kl 918/1-2.

References 1. 2.

3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

A. Aertsen (ed.), "Brain Theory: Spatio-Temporal Aspects of Brain Function", Elsevier, 1993. R. Eckhorn, H. J. Reitboeck, M. Arndt, P. Dicke, "Feature linking via stimulus-evoked oscillations: Experimental results from cat visual cortex and functional implication from a network model", Proc. ICNN I: 723-730, 1989. C. M. Gray, W. Singer, "Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex", Proc. Natl. Acad. Sci. USA 86: 1698-1702, 1989 C. von der Malsburg, W. Schneider, "A neural cocktail-party processor", Biol. Cybern. 54: 29-40, 1986. F. Crick and C. Koch, "Towards a neurobiological theory of consciousness", Seminars in The Neuroscience 2: 263-275, 1990.C. W. Gerstner, R. Ritz, J. L. van Hemmen, "A biologically motivated and analytically soluble model of collective oscillations in the cortex", Biol. Cybern. 68: 363-374, 1993. W. Maass, "Lower Bounds for the Computational Power of Networks of Spiking Networks", Neural Computation 8 (1), 1-40, 1996. J. Lazarro, J. Wawrzynek, "Silicon Auditory Processors as Computer Peripherals", NIPS 5: 820-827, 1993. G. Frank, G. Hartmann, "An Artificial Neural Network Accelerator for Puls-Coded ModelNeurons“, ICNN‘95, Perth, Australia, 1995. A. Jahnke, U. Roth, H. Klar, "Towards Efficient Hardware for Spike-Processing Neural Networks", Proc. World Congress on Neural Networks, 460-463, 1995. U. Roth, A. Jahnke, H. Klar, "Hardware Requirements for Spike-Processing Neural Networks", Proc. IWANN´95, 720-727, 1995. H.J. Reitböck, M. Stöcker, C. Hahn, "Object Separation in Dynamic Neural Networks", Proc.ICNN, II: 638-641, 1993. D. Hammerstrom, "A VLSI Architecture for High-Performance, Low-Cost, On-Chip Learning," Proc. IJCNN, 537-543, 1990. Ramacher U, Beichter J, Brüls N, "Architecture of a General Purpose Neural Signal Processor", Proc IJCNN I: 443-446, 1991 K. Mohraz, "Parallel Simulation of Pulse-coded Neural Networks", accepted IMACS 1997. E. Niebur, D. Brettle, "Efficient Simulation of Biological Neural Networks on Massively Parallel Supercomputers with Hypercube Architecture", NIPS 6: 904-910, 1993. A. Jahnke, U. Roth, H. Klar: "A SIMD/Dataflow Architecture for a Neurocomputer for Spike-Processing Neural Networks (NESPINN)", MicroNeuro'96, 232-237, 1996.

Suggest Documents