Neuromorphic Architectures - UCL Computer Science

GPU-Based Simulation of Spiking Neural Networks with Real-Time Performance & High Accuracy Dmitri Yudanov, Muhammad Shaaban, Roy Melton, Leon Reznik

Department of Computer Engineering Rochester Institute of Technology United States WCCI 2010, IJCNN, July 23

Agenda



Motivation



Neural network models



Simulation systems of neural networks



Parker-Sochacki numerical integration method



CUDA GPU architecture



Implementation: software architecture, computation phases



Verification



Results



Conclusion and future work



Q&A

Motivation 

Other works: accuracy and verification problem J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul. 2009. A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp. 137-144, 2009. J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in , 2009, pp. 754--759.

To provide scalable accuracy  To perform direct verification  Based on: 

R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp. 115-33, Aug. 2009.

Neuron Models: IF, HH, IZ IF

HH

IZ



IF: simple, but has poor spiking response



HH: has reach response, but complex



IZ: simple, has reach response, but phenomenological

System Modeling: Synchronous Systems 

Aligned events  good for parallel computing



Time quantization error introduced by dt



Smaller dt  more precise, but computation hangry



May result in missing events  STDP unfriendly

Order of computation per second of simulated time

N – network size F - average firing rate of a neuron p – average target neurons per spike source

R. Brette, et al.

System Modeling: Asynchronous Systems 

Small computation order



Events are unique in time  no quantization error more accurate, STDP friendly



Events are processed sequentially



More computation per unit-time



Spike predictor-corrector  excessive re-computation



Assumes analytical solution

Order of computation per second of simulated time N – network size F - average firing rate of a neuron p – average target neurons per spike source

R. Brette, et al.

System Modeling: Hybrid Systems 

Refreshes every dt  more structured than event-driven  good for parallel computing



Events are unique in time  no quantization error more accurate, STDP friendly



Doesn’t require analytical solution



Events are processed sequentially



Largest possible dt is limited by minimum delay and highest possible transient Order of computation per second of simulated time

N – network size, F - average firing rate of a neuron, p – average target neurons per spike source R. Brette, et al.

Choice of Numerical Integration Method 

Motivation: need to solve an IVP



Euler: compute next y based on tangent to current y



Modified Euler: predict with Euler, correct with average slope



Runge-Kutta 4th Order: evaluate and average



Bulirsch–Stoer: modified midpoint method with evaluation and error tolerance check using extrapolation with rational functions. Adaptive order. Generally more suited for smooth functions.



Parker-Sochacki: express IVP as power series. Adaptive order

Parcker-Sochacki Method A typical IVP:

Assume that solution function can be represented with power series. Therefore, its derivative based on Maclaurin series properties is

Parcker-Sochacki Method If

is linear:

Shift it to eliminate constant term: As a result, the equation becomes:

With finite order N:

 

LLP Parallel reduction

Parcker-Sochacki Method If

is quadratic:

Shift it to eliminate constant term: As a result, the equation becomes:

Quadratic term can be converted with series multiplication:

Parcker-Sochacki Method and the equation becomes:

With finite order N:



Loop-carried circular dependence on d



Only partial parallelism possible

Parcker-Sochacki Method 

Local Lipschitz constant determines the number of iterations for achieving certain error tolerance:



Power series representation  adaptive order  error tolerance control

Limitations:  Cauchy product reduces parallelism

CUDA: SW

     

Kernel: code separate, task division Thread Block (1D, 2D, 3D) Grid (1D, 2D) Divide computation based on IDs Granularity: bit level (after warp bcast access)

CUDA: HW

Scheduling 

Scheduling: parallel and sequential



Scalability  requirement for blocks to be independent

Warp 

Warp = 32 threads



Warp divergence



Warp level synchronization

Active blocks and threads: 

Active threads / SM: maximum1024



Goal: full occupancy = 1024 threads

Software Architecture

Update Phase

Stewart and Bair



Adaptive order p according to required error tolerance



Can be processed in parallel for each neuron

Propagation Phase



Translate spikes to synaptic events: global communication is required



Encoded spikes are written to the global memory: bit mask + time values



A propagation block reads and filters all spikes, decodes, fetches synaptic data and distributes into time slots

Sorting Phase

Satish et al.

Software Architecture

Results: Verification Input Conditions 

Random parameter allocation



Random connectivity



Zero PS error tolerance

GPU Device: GTX 260

CPU Device: AMD Opteron 285



24 symmetric multiprocessors



Dual core



Shared memory size, 16 KB / SM



L2 cache size, 1 KB / core



Global memory size, 938 MB



RAM size, 4 GB



Clock rate, 1.3 GHz



Clock rate, 2.6 GHz

Output 

Membrane potential traces



Passed test for equality

Results: Simulation Time vs. Network Size 250

Simulation Time, sec.

200

2%

4%

8%

16%

2%

4%

8%

16%

150

100

50

0 2

3

4

5

6

7

8

Network size, 1000 x neurons 

Conditions: 80% excitatory / 20% inhibitory synapses, zero tolerance, 10 sec of simulation, initially excited by 0 – 200 pA current.



Results: GPU simulation 8-9 times faster, RT performance for 2-4% - connected networks with size 2048 – 4096 neurons.



Major limiting factors: shared memory, number of SM

9

Results: Simulation Time vs. Event Throughput 410

Simulation Time, sec.

360

2%

4%

8%

16%

2%

4%

8%

16%

310 260 210 160

110 60 10 0

2

4

6

8

10

Mean Event Throughput, 1000 x events/(sec. x neuron) 

Conditions: increasing excitatory / inhibitory ratio from 0.8/0.2 to0.98/0.02, network of 4096 neurons, zero tolerance, 10 sec of simulation, initially excited by 0 – 200 pA current.



Results: GPU simulation 6-9 times faster, up to 10,000 events per sec per neuron. RT performance for 0-2% - connected networks with size of 2048 – 4096.



Major limiting factors: shared memory, number of SM

12

Results: Comparison with Other Works Metric Increase in speed Network Size Connectivity per neuron Accuracy Verification

This Work

Other works

Reason

6 – 9, RT

10 – 35, RT

2K - 8K

16K - 200K

GPU device, complexity of computation, numerical integration methods, simulation type, time scale

100 - 1.3K

100 – 1K

Full single precision FP

Undefined

Direct

Indirect

Numerical integration method

Conclusion  

Implemented high-accurate PS-based hybrid system of spiking neural network with IZ neurons on GPU Directly verified implementation

Future Work 

   

Add accurate STDP implementation Characterize accuracy in relation to signal processing, network size, network speed, learning Provide an example of application Port to Open CL Further optimization

Q&A Essential Bibliography R. Brette, et al., "Simulation of networks of spiking neurons: A review of tools and strategies," Journal of Computational Neurscience, vol. 23, no. 3, pp. 349-398, 2007. R. Stewart and W. Bair, "Spiking neural network simulation: numerical integration with the Parker-Sochacki method," Journal of Computational Neuroscience, vol. 27, no. 1, pp. 115-33, Aug. 2009. G. E. Parker and J. S. Sochacki, "Implementing the Picard iteration," Neural, Parallel Sci. Comput., vol. 4, pp. 97--112, 1996. E. M. Izhikevich, "Simple model of spiking neurons," Neural Networks, IEEE Transactions on, vol. 14, pp. 1569--1572, 2003. N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore GPUs," in , 2009, pp. 1--10. (2010, Apr.) CUDA Data Parallel Primitives Library. [Accessed online 04/30/2010]. http://code.google.com/p/cudpp/ (2008) NVIDIA CUDA Programming Guide 2.3. [Accessed online 04/30/2010]. http://developer.nvidia.com

Other works J. Nageswaran, N. Dutt, J. Krichmar, A. Nicolau, and A. Veidenbaum, "A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors," Neural Networks, Jul. 2009. A. K. Fidjeland, E. B. Roesch, M. P. Shanahan, and W. Luk, "NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs," Application-Specific Systems, Architectures and Processors, IEEE International Conference on, vol. 0, pp. 137-144, 2009. J.-P. Tiesel and A. S. Maida, "Using parallel GPU architecture for simulation of planar I/F networks," in , 2009, pp. 754--759.

Neuromorphic Architectures - UCL Computer Science

Neuromorphic Architectures - UCL Computer Science

Suggest Documents

Marmite - UCL Computer Science

Marmite - UCL Computer Science

MESSI - UCL Computer Science

Paper - UCL Computer Science

авбдгжеиз © £ - UCL Computer Science

Download - UCL Computer Science

Marmite - UCL Computer Science

OGSA - UCL Computer Science

Model Projection - UCL Computer Science

Pinocchio Coin - UCL Computer Science

part 2.1 - UCL Computer Science

Reverse Engineering - UCL Computer Science

App Clustering - UCL Computer Science

Testability Transformation - UCL Computer Science

WB Langdon - UCL Computer Science

bentley corrected - UCL Computer Science

Title Placeholder - UCL Computer Science

Lecture Notes in Computer Science - UCL Computer Science

Extensible Control Architectures - UCSB Computer Science

Architectures and Idioms - Department of Computer Science

Approximate Bayesian Computation: a ... - UCL Computer Science

Towards Higher Impact Argumentation - UCL Computer Science

Distributed Objects and Components - UCL Computer Science

accCHI2011 GT draft v13 - UCL Computer Science