Main backbone of pulse communication, multiplexed multiplexed asynchronous digital pulse communication. Load independent due to static routed connections ...
Wafer-scale VLSI implementations of pulse coupled neural networks Matthias Ehrlich1 , Christian Mayr1 , Holger Eisenreich1 , Stephan Henker1 , Andre Srowig2 , Andreas Gr¨ubl2 , Johannes Schemmel2 , Ren´e Sch¨uffny 1 1 Endowed Chair for Neural Circuits and Parallel VLSI-Systems
Technische Universit¨at Dresden, ETIT/IEE, D-01062 Dresden, Germany. e-mail: (ehrlich, mayr, eisenrei, henker, schueffn)@iee.et.tu-dresden.de 2 Kirchhoff Institute for Physics (KIP) University of Heidelberg, Im Neuenheimer Feld 227, D-69120 Heidelberg, Germany. e-mail: (srowig, gruebel, schemmel)@kip.uni-heidelberg.de
Abstract— In this paper, we present a system architecture currently under development that will allow very large (>106 neurons, >109 synapses) reconfigurable networks to be built, in the form of interlinked dies on a single wafer. Reconfigurable routing and complex adaptation/plasticity across several timescales in neurons and synapses allow for the implementation of large-scale biologically realistic neuronal networks and behaviors. Compared to biology the network dynamics will be about 104 to 105 times faster, so that developmental plasticity can be studied in detail. Keywords— neural VLSI hardware, system simulation, mapping algorithms, parallelization.
I. I NTRODUCTION VLSI implementations of pulse coupled neural networks aimed at exploring various computational aspects of biological neural networks have so far mainly explored two avenues: On the one hand, networks have been created with simple dynamics and fixed, repetitive network structures with very little flexibility, but relatively high neural element count [1]. The processing function(s) of these networks have been determined a priori, emulating cut-outs from biological structures and operations, either to analyze/understand these functions or use them in technical application. The limited flexibility of these designs relegates them to large-scale proof-of-concept of a certain functionality, while further exploration of this functionality necessitates an IC redesign. The other avenue of exploration consists of IC’s employing variable topologies and complex neuron/synapse dynamics with attendant large configuration memories. These have been relegated to small networks (100-103 neurons, 105 neurons, >107 synapses), some compromise has to be achieved between the reconfigurability of these networks and the VLSI hardware constraints. The work presented in [3] is a step toward this kind of system, achieving a synthesis of both approaches, with
large element count and hardware-constrained, yet flexible configuration. Current research work aims to enhance the neural dynamics of said system and make it capable of realizing very large networks via a linking of individual IC’s directly on their wafer. The complete system will be used as a research tool for exploring various computational paradigms as postulated from neurobiological evidence. We will first present an overview of the target system design of the hardware side of research project Fast Analog Computing with Emergent Transient States (FACETS)[4]. Next, we will show current results for all of the subtasks associated with the hardware design, such as prototypes for the single IC’s, state of technology development, and various algorithms used in implementing the system. II. FACETS H ARDWARE S YSTEM System design is generally constrained by the available resources, especially chip area. Considering a small hardware implementing 103 Neurons with 103 Synapses each requires a crossbar with 103 x 106 switches and configuration memory of 10 MBit to allow full flexibility. For implementation, already this small example is not feasible and the proposed hardware is implementing orders of magnitude more neural elements. Hardware constraints in general encompass the following: 1) not enough configuration memory for neuro-elements, so this memory has to be shared, with synapses/neurons having similar parameters grouped around this shared memory. 2) not enough configuration memory and IC space for electronic axons and dendrites, so a full network connectivity cannot be achieved. 3) the dynamics of neurons and synapses have to be achieved using less IC space, which inherently speeds up operation of these elements (e.g. smaller integration capacitors), so any analysis circuits and the supply/biasing backbone of such an IC would have to be that much faster 4) the increased speed of these elements also leads to increased communication bandwidth (both intra-chip and off chip for analysis and linking of IC’s)
The FACETS hardware platform will implement approximately 106 neurons on several interconnected wafers. The wafers are not diced but used completely by connecting the individual reticles on the wafer by wafer-scale interconnects. These reticles, so called Analog Network Cores (ANCs), form the basic element of the FACETS Architecture, containing the neurons, synapses and the main part of the communication network. Each neuron in this system connects on average to 1k other neurons via plastic synaptic links. Additional communication, configuration and general backplane is provided by a PCB backplane situated above the wafer. This backplane contains dedicated ASICs designed for interfacing with the ANCs, so called Digital Network Chips (DNCs). Figure 1 will provide an overview of the concept.
Table 1. Layer distinction by communication type Layer 0
Type continuous-time, analog direct connection
Description Dendritic compartments and their respective synapse groups are linked via configurable analog current connections to form compartmental neurons.
1
continuous-time, multiplexed
Main backbone of pulse communication, multiplexed asynchronous digital pulse communication. Load independent due to static routed connections, with a slight chance of pulse loss for colliding events. Connections to adjacent ANC’s within a certain range of die ’hops’ can be routed over this layer.
2
discrete, packed based
The number of connections is limited by the load capacity of the channels. In this case the activity of the routed connections determines the number of routable channels. Used also for external communication with/analysis of network.
III. C URRENT S TATE OF DEVELOPMENT In the following section we will present sample results from hardware and system level research. The research is partitioned in chronologically subsequent deliverables which can be found in the project proposal of the integrated project FACETS (Nr. 15879). A. Analog floating gate memory
Fig. 1.
The FACETS hardware platform schematically
The logical architecture of the FACETS system communication is a resource-constrained three layered structure. The smallest configuration units of the system are SynapseNeuron Groups directly linked together with so called layer 0 connections. Around the analog core elements, the communication layer 1 of multiplexed continuous-time connections is implemented as ring bus structure. Layer 1 connections can be directed over die boundaries to adjacent ANCs and to the interface of communication layer 2. Layer 2 will be used for long range connections to distant ANCs using a time discretized events-within-a-packet based protocol. The three different layers can be distinguished by their communication paradigms and different constraints as shown in table 1. The number of routings via a specific layer will be constrained by a limited number of connections per layer and the load dependency of those connections.
The FACETS neural network hardware will require a large set of parameters for the configuration of the various analog sub circuits as well as the definition of a biologically relevant topology. Suitable parameter representations not only comprise analog voltages and currents, but also variable timing delays. An analog floating gate memory in a standard 0.18 µm singlepoly CMOS process is under development at the University of Heidelberg, which allows to store analog values directly and to provide multiple parameters simultaneously, without the need of a digital representation and A/D conversion. In consequence, the area requirements for the parameter storage are extremely reduced with respect to standard implementations. The prototype of the analog floating gate memory consists of an array of active storage cells, a set of high voltage drivers, which handles the signals applied to the memory array during read and write, an address decoder and a controller for coordination of the individual units. A major design challenge is that all memory cells need to be operated simultaneously to provide their parameters via individual outputs, while the address circuitry is idle. This is fundamentally different from a digital Flash memory, where a cell is only active, when addressed. At the same time, the address logic needs to be operated at high voltage levels, in order to induce tunneling currents during programming. Therefore, high voltage switches were designed using triple well technology, allowing the isolation of n-MOS transistors from the substrate and
hardware. One major goal of FACETS is to achieve wafer-scale integration of a neural network circuit. This requires additional processing steps to the standard wafer production, to interconnect the single units, i.e reticles, on the wafer to a common circuitry. These reticles have a targeted size of 20 mm x 20 mm and will be equipped with pad arrays along the reticle edge that allow the connection of multiple units with metal lines. Laser lithography and consecutive galvanic metalization will be used to create interconnection patterns with a line pitch of 4 µm. Approximately 20,000 direct connections per reticle can be achieved with this technology. The basic idea of this post processing step is to bypass the space limitation at the die borders by placing the contacts as shown in fig. 4, representing a close up of an interconnection area.
Fig. 2.
Measured output values of a Floating Gate cell array
achieve switching up to 12 V. Triple well transistors use much more area than standard devices, so a special programming scheme has been designed, that requires the implementation of triple well technology only in the driver circuitry outside the memory array. By this the area for one memory cell could be reduced to 6 µm x 10 µm. The use of two control gates per memory cell allows to connect one control gate of all cells in a row to one common row driver with triple well switches, while all other control gates are connected to common line drivers. A special programming sequence allows programming all cells individually to an arbitrary voltage within the programming range, despite their high interconnectivity.
Fig. 4.
Layout close-up of a wafer-scale interconnection area
B. D20 first test ASIC MiniLink 1) Architecture: The layout shown in fig. 5 represents a first DNC test ASIC designed, serving as concept study of the layer 2 connections via Low Voltage Differential Signaling (LVDS) links, Floating Gate memory structures as well as JTAG access and configuration logic for the DNC prototype.
Fig. 3.
Absolute error of the read out values
For test purposes a complete set of 2048 floating gate memory cells of one prototype circuit was programmed to voltages between 0.10 V and 1.13 V. Figure 2 shows 64 memory words with 32 entries. The individual words are shifted by 0.02 V for better visibility. Figure 3 presents the deviation of each memory cell from its specified programming value. The measurement accuracy is limited by the test setup. The confirmed accuracy is in the order of 8 bit, which fulfills the requirements for the FACETs
Fig. 5.
D20 MiniLink layout view w/o metal fill
A rough block outline of the ASIC can be seen in fig. 6 showing six high speed LVDS links at the bottom side separated from the PLL pads which reside above their pads. Two full custom PLLs can be found, left hand side is a 2 GHz, right side a 4 GHz version. On the ASICs left border, high voltage analog pads lead to the adjacent Floating Gate (FG) array. The bias current bank and control below the
FG array generate currents for the analog parts. Full custom serializer/ deserializer structures for the JTAG interface are located between the PLLs and the current bank. Pads on the right side form the I/O for the digital parts like JTAG and the RAM blocks which are covered by metal routing while the analog pads frame the top side of the MiniLink.
Fig. 6.
Design blocks of the MiniLink ASIC
Access to the chip to configure the full and semi custom parts and control the test functionality is possible via a JTAG interface [5]. This functionality covers writing and reading of receive and transmit memories, initialization of transactions and observation of the connections by read out of status information. Serializer and deserializer have been designed as full custom macros to allow clock frequencies in the gigahertz range as available standard cells are limited to 500MHz. The serializer structure is realized with two parallel high-speed shift registers using parallel input and a double-data-rate output stage. The deserializer consists of high-speed shift registers for serialparallel conversion and a selection tree for dynamic stream synchronization. Both macros are designed to operate with a 3 GHz clock and use C2 MOS dynamic latches with embedded logic which show reduced power consumption in comparison to standard CMOS circuits. For comparison and characterization six LVDS channels, representing the communication links for the inter wafer communication network and on-wafer inter DNC network, have been integrated. They are controlled, tested and monitored via the JTAG interface. The Point-to-Point LVDS connections, each containing 4 lines per channel, are implemented with two different connection speeds. which is 2 Gbit/s and 3.2 Gbit/s following the ANSI/TIA/EIA-644-A standard. Additionally the short circuit current of the LVDS pads is adjustable to extend the range of the link [6], [7], [8]. The DNC prototype provides error detection and error handling mechanisms utilizing a CRC error detection. The applied 8 bit polynomial allows to safely detect 3 bit errors per packet. With optional DC balance coding and protocol functionality for error handling, the simulation shows that for the best case condition, the time from transmitter input to receiver output for 64 bit pulse data is less than 100 ns. Two fully integrated, i.e. w/o external loop filter, n-integer PFD-charge pump PLLs provide the LVDS clock signals. [9] Key data for the components gained through simulations for
the two on-chip PLLs are shown in table 2. Table 2. On-chip PLLs simulation results Paramter frequency reference VCO frequency output freqency avg. power dissipation area
PLL1 500 MHz 2 GHz 2 GHz, 1 GHz 6,7 mW 110x120 µm2
PLL2 400 MHz 3.2 GHz 1.6 GHz, 0.8 GHz 11,9 mW 125x125 µm2
The ASIC also contains an additional test circuit which is an advanced version of the above mentioned full custom Floating Gate memory array for parameter characterization purposes. The technology is the UMC 0.18 CMOS process with triple well option. The design was automatically routed using a commercial router utilizing analog macro cells. The circuit is currently in fabrication and expected to be shipped in February 2007. C. D21 Software Framework
Fig. 7.
D21 software architecture
In the following we will introduce the D21 Software Framework for the FACETS System as shown in fig. 7. 1) Testbenches, Benchmarks: Because it is obviously too ambitious to build up a a generic, unspecialized hardware architecture, a solution that fits to given realistic networks (coming as input from biological research) has to be found. Every benchmark consists of a stimulus, a topology of the neural network to be simulated, including model parameters, and information on how the simulation performance should be evaluated. Stimuli are normally given as lists of spikes, generated either via statistical processes or via a model of the human retina from input images or video sequences [10]. Network topologies can be described either statistically or as a netlist. Output evaluation includes spike statistics, synaptic weight analysis for learning tasks, and receptive field reconstruction. Benchmark tasks are mainly provided by FACETS partners. They have to be converted and integrated in the system simulation environment. Where software for stimulus generation, network definition and output analysis is not directly available from FACETS partners or difficult to integrate,
custom programs have to be developed performing the required tasks. Benchmarks from FACETS partners include the liquid computing networks from TU Graz group [11] and the deterministic models by INRIA group [12]. In addition, indigenous benchmarks have been developed. 2) Mapping Tool: The major challenge arises with the mapping of any given neural network onto the later Facets hardware, which is to place neurons and synapses in a configurable neuronal STDP array and route the structure of the network. Furthermore we have to (1) optimize the synaptic connections to minimize the routing effort and (2) come as close to a given delay as possible. This is due to the fact that different connection routings influence the faithfulness of biology-reproduction, as well as the overall number of routable connections. To meet those requirements, a mapping software is under development which will map the network to the hardware and reduce the routing costs. The routing cost optimization is done by heuristic manipulation of the ordered index vector of the connection matrix, which represents the routed neural network with each synaptic link as non-zero element. The mapping and optimization process is done in three steps with rising local concentration granularity. Please note, this is not necessarily correlated with the three different routing levels. a) Step 1 - Hyper Global: In the first step the mapping focuses on the wafer level of the Facets system. The mapping follows a ’first come first serve’ ordered location distribution scheme according to the provided benchmark netlists, whereas the subsequent optimization concentrates as much synapses as possible on single wafers and thus reduces the inter-wafer communication. This results in diagonalization of the above mentioned routing matrix in the form of squares with the border length of neurons/wafer on the main diagonal in the connection matrix. Synaptic connections outside those squares have to be realized via packet-based communication links. The number of vector variations for mapping is the number of permutations of possible index vector configurations without recurrences for a given number of neurons in a neural network to the number of physically available neuron. This number of variations shows an exponential growth with a linearly increasing network size to be routed and thus expands the search space for the optimization algorithm dramatically. The Complexity of the problem makes it obviously impossible to probe all ordering variations of the connection matrix. So different heuristic reordering algorithms where tested which yield up to 10 % improvement in relative connection density for global routing on a network generated with uniform distribution of synaptic connections with 5 to 15 % connection density. On networks with a more regular structure, however, the same effort results in significantly higher optimization, as can be seen in table 3 for a semi-structured V1 network [11]. This benchmark is based on a generic wafer-scale architecture, with 2 wafers, 4 ANCs/wafer and 125 neurons/ANC, realizing a biological network with 1000 neurons and 135651 synapses.
Table 3. Mapping performance for a semi-structured V1 network in [%] of total synapses
Initial Net After Mapping
Layer 1 12,5 21,2
wafer-Scale 37,5 48,2
Layer 2 50,0 30,6
b) Step 2 - Global: In the second step the matrix can be split up into sub blocks representing the single wafers to assign neurons to individual ANCs on the wafer. The communication paths between dies cannot longer be treated as equal, as was the case for inter-wafer paths. Adjacent ANCs are interconnected via post-processing wafer connects. Packetbased links between ANCs will be utilized to reach other ANCs outside the range of the layer 1 network which allows to connect to dies within a not yet defined hop count range. If we take the system of layer 1 buses connecting the ANCs on the wafer, it is evident that a direct connect between two adjacent ANCs uses less bus resources than an extended connect which crosses at least one ANC, which itself causes less costs than a layer 2 routing. Furthermore it has to be taken into account that the ordering of the ANCs with respect to the packed based network also becomes an issue. This in turn leads to more complex cost functions used to evaluate the quality of a routing. By examination of the correlation between different cost functions representing different networks and the optimization reached for the routing we are able to explore the design parameter search space to find a suitable parameter set for the ANC network. c) Step 3 - Local: The third step, the local optimization concentrates on the configuration of ANC´s. The Step 2 connection matrix is therefore again split up into sub blocks representing single ANC’s. The available routing resources consist of layer 1 and layer 0 connections. Prior to optimization, a first run selects connections with minimum costs until available resources are exhausted. The following iterated runs use different algorithms to reduce the local costs and after convergence return to global mapping with either successful mapping or with connections that could not be mapped. Global optimization is then relocating those connections for a next local optimization run. Successful local mapping also generates the hardware configuration data which can be loaded onto the final hardware or system simualation to create the hardware representation of the input network. 3) Parallelization: A rough estimate of the amount of data to be handled shall lead to the problems to be solved in future work. Taken a simple matrix for example, containing only the information on existing synaptic connections and utilizing a single bit to represent one connection a complete matrix will have aprox. 1.2 GByte. As an alternative a list representation storing only the nonzero elements can be chosen. The list will need aprox. 0.4 GByte with the above given 105 neurons x 103 synapses/neuron and 17 bit per x-y-matrix-index accordingly. Although the latter gives a memory effort reduction by 2/3, the data handling becomes more difficult. Utilizing an a priori
classification of command types and their costs according to clock cycles, a comparison of the effort accessing one element of the matrix vs. its list representation can be done. This finally leads to the result that a list access is aprox. 20 times the duration of a matrix access. Taking the above into account we decided for the list representation, due to memory limitations, accepting the longer processing time. So in order to complete this optimization task in a reasonable time, a way has to be found to parallelize the search algorithm in order to speed up the simulation. First tests using a hybrid parallelization model with Pthreads on node level and Message Passing Interface (MPI) for inter node communication with a master process controlling several helper processes have been conducted.
Fig. 8.
Structure of the simulation library
4) Executable system specification: The executable specification will help to explore the system architecture at an early stage of the FACETS design process. A library that will be used to conduct the simulation is shown in fig. 8. It assembles design block models of different abstraction levels, complexity and accuracy, i.e. at different design stages. Components of the library are then used to conduct a Virtual Functional Prototype (VFP) combining Transaction Level Modeling (TLM) with Transaction Based Verification (TBV) methodologies. D. HICANN The High Input Count Analog Neural Network (HICANN) as the upcoming FACETS development stage will contain in principle all necessary routing and building block components that will form the final system. However, the most effort is concentrated on layer 1 routing as well as neuron and synapse prototypes. The layer 1 routing network has three different sources/sinks. Those are the primary wafer-scale layer1 buses, the DNC-Interface and the Analog Neuronal Network Core (ANNCORE). The proposed routing architecture for this layer is focused on the wafer-scale route through ability. In that way every incoming primary wafer-scale layer1 bus can be easily routed through the whole reticle to several outgoing layer 1 buses to other ANC reticles. The overall objective is to take maximum advantage of the high density wafer-scale connections to realize neuronal networks with a biologically worthwhile size and connectivity.
Fig. 9.
Schematic of the proposed HICANN architecture
A route through friendly routing architecture increases the range of the layer 1 routing. This eases the mapping of dense networks with realistic pulse delays. Figure 9 shows how this will be achieved. Thin arrows represent a 7 bit wide layer 1 bus and thick arrows 32 bundled layer 1 buses. Every incoming layer1 bus is directly connected to two other sides of the ANC. By the so called ”route through switches” situated there it is possible to configure to which primary wafer-scale layer 1 bus output it is connected. One half of all layer 1 buses is routed on the left side and the other half on the right side of the ANNCORE. The two blocks of ”Select Switches” are used to configure which of these buses will be used as input for the synapses of the ANNCORE. IV. C ONCLUSION The current development stage of a configurable wafer-scale neural ASIC system for exploratory neural network research has been presented. It will allow combined implementation of network sizes, dynamics and topologies considerably ahead of todays’ state-of-the-art. A first communication prototype has been designed and is currently under fabrication. It contains state-of-the-art components for the later DNC, such as clocking resources, prototype layer 2 connections, etc. During the next project steps we will gain precise information on the technology constraints and limits through research carried out on the hardware side. The upcoming HICANN ASIC will integrate all components necessary to build the FACETS system and can be used in conjunction with the DNC prototype to realize a PCB-level
version of the final wafer-scale system to gain operational experience and to test software tools, such as the mapping. The conducted benchmarks ensure that the hardware design fits to the biological parameters gained through biological research. A mapping software has been described which closes the gap between the hardware-constrained waferscale-system and biology-derived neural networks, matching up constraints such as (biological) axonal delays and (hardware) pulse routing delays, ensuring faithful reproduction of network behavior. More effort will be expanded on the optimization algorithms, especially the application of multi-objective genetic algorithms for step 1 Hyper Global search which can be parallelized following the island model in [13]. The mapping will be extended to concentrate on further optimization parameters besides the communication, e.g. delays, pulse loss probability, etc. ACKNOWLEDGMENT The research project FACETS is financed by the European Union as Integrated Project (Nr. 15879) in the framework of the Information Society Technologies program. R EFERENCES [1] T. Morie, M. Nagata, and A. Iwata. Design of a Pixel-Parallel Feature Extraction VLSI System for Biologically-Inspired Object Recognition Methods. Proc. International Symposium on Nonlinear Theory and its Application NOLTA 01:371–374, 2001. [2] G. Indiveri, E. Chicca, and R. Douglas. A VLSI reconfigurable network of integrate and fire neurons with spike-based learning synapses. Proceedings of European Symposium on Artificial Neural Networks ESANN’2004 405–410, 2004. [3] J. Schemmel, K. Meier, and E. Mueller. A New VLSI Model of Neural Microcircuits Including Spike Time Dependent Plasticity. Proceedings of the 2004 International Joint Conference on Neural Networks IJCNN’04 1711–1716, 2004. [4] K. Meier. Fast Analog Computing with Emergent Transient States in Neural Architectures, Integrated project proposal, FP6-2004-IST-FET Proactive, Part B. Kirchhoff Institut f¨ur Physik, Ruprecht-Karls-Universit¨at Heidelberg, September 2004. [5] IEEE Std 1149.1-2001: Test Access Port and Boundary-Scan Architecture. The Institute of Electrical and Electronics, Inc.,New York, 2001. [6] A. Boni, A. Pierazzi and D. Vecchi. LVDS I/O Interface for Gb/s-per-Pin Operation in 0.35- µm CMOST. IEEE Journal of Solid-State Circuits 36(4):706–711, 2001. [7] Y. Yan and T. H. Szymanski. Low power high speed I/O interfaces in 0.18 µm CMOS. In IEEE International Conference on Electronics, Circuits and Systems ICECS2003 Sharjah-United Arab Emirates , December 2001. [8] C.-C. Wand, J.-M. Huang and J.-F. Huand. 1.0 Gb/s LVDS transceiver design for LCD panels. In Asia-Pacific Conference on Advanced System Integrated Circuits AP-ASIC2004 236–239, August 2004. [9] Cicero S. Vaucher. Architectures for RF Frequency Synthesizers. Kluwer Academic Publishers, US, 1st edition, 2002. [10] A. Wohrer, P. Kornprobst and T. Vieville. A Biologically-Inspired Model for a Spiking Retina. Research report of INRIA, URL: http://www.inria.fr/rrrt/rr-5848.html, 2006. [11] S. Haeusler and W. Maass. A Statistical Analysis of InformationProcessing Properties of Lamina-Specific Cortical Microcircuit Models. inCerebral Cortex 17(1):149–162, 2007. [12] T. Vieville and P. Kornprobst. Modeling Cortical Maps with Feed-Backs. International Joint Conference on Neural Networks (IJCNN), 110–117, 2006. [13] Y. Chen, Z. Nakao, and X. Fang. A Parallel Genetic Algorithm Based on the Island Model for Image Restoration. Proceedings of the 1996 IEEE Signal Processing Society Workshop 109–118, 1996.