Visual Strategies for Sparse Spike Coding
Laurent Perrinet Manuel Samuelides ONERA/DTIM, 2, av. Belin, 31055 Toulouse, France
[email protected],
[email protected]
Abstract We explore visual spike coding strategies in a neural layer in order to build a dynamical model of primary vision. A strictly feed-forward architecture is compared to a strategy accounting for lateral interactions that shows sparse spike coding of the image as is observed in the primary visual areas [1]. This transform is defined over a neural layer according to a greedy matching pursuit scheme and we show that for natural images its coefficients may be simply carried by the rank of spike arrival. This model is easily expandable to multiple layers and despite its simplicity has applications to image compression comparable to industrial standards but also novel strategies for pattern detection.
1 Introduction Faced with the light influx from the outside world, what are the plausible coding strategies in the visual system to extract the relevant features necessary to a given goal? The particular efficiency of a strategy, e.g. for a animal to categorize preys and predators as quickly as possible, is a main constraint on the visual system, and a key challenge is to determine the ”visual code”. In fact, experiences of ultra rapid categorization [2] in humans and monkeys showed that the visual system could categorize a high-level class in as short as , urging us to move away from the analogy of the primary visual system with classical image processing strategies to a dynamical neural network model. In fact, signals may be conveyed by analog mechanisms locally, but over longer distances, as in the optic nerve, information has to be transformed in a spatiotemporal pattern of spikes. In this optic, we will first concentrate on the problem of transmitting images (we will restrict ourselves as in [2] to flashed gray-scale static images) through the optic nerve to the Lateral Geniculate Nucleus (LGN) and rate the model by its reconstruction quality. We will explore codes that should convert the analog intensities into a ’wave’ of spikes along the optic nerve. It appears immediately that to take advantage of the speed of processing, a new goal appears for the retinal code: the analog image should be temporally transformed so that the generated spike train transmits progressively the most information within the shortest latency. We will first present the retinal network as a roughly simplified one-pass feed-forward layer
http://laurent.perrinet.free.fr
of neurons that transforms the image into a pattern of spikes. These spikes are converted back to an activation level in the LGN. Though not fully biological, this algorithm is quite general and provides a conceptual construction framework that will be extended to multiple layers. Then, guided by neurophysiological and computational evidence we will explore sparse spike coding codes and evaluate their validity, performance and applications.
2 Spike coding in the Retina As described in [3], let us define our model retina with a set of neurons, the ganglion cells (GCs), sensitive at different spatial scales to the local contrast of the image intensity detected at the photo-receptors (see Fig. 1-left). First, the dendrite of a neuron may be characterized by its weight vector over its receptive field (RF) , defined by its position and scale. These weight vectors will be defined as dilated, translated and sampled Mexican Hat (or DOG) filters (see [4, pp. 77]). The neurons’ RFs centers are set uniformly so that the retinal neuronal layer uniformly fills the spatial/frequency space and therefore forms a wavelet-like transform[4] of the retinal image. In fact, we generally write as in [5] the activity at the soma of the neuron:
#%$'& #)$ "! ("!
#%$
(1)
#
where "! is the light intensity at pixel ! and is the receptive field (i.e support) of * . Also, instead of differentiating ON or OFF cells, we considered for simplicity &0/ / that each neuron is assigned a polarity + which is either , or - , so that .+ . Each neuron integrates the image’s intensity information at its soma until it reaches a
PhRs Optic fiber i
GCs
LGN
Figure 1: Architecture of our retinal model. A sample neuron is highlighted in red, showing its ON-center, OFF-surround RF, its axon in the optic fiber, its synapse to LGN neurons and the corresponding emitting field lo the LGN activity layer. threshold: it then emits a spike that propagates along the axon and the activity is reset. Classically, this generates a train of spike whose frequency constitutes the image’s code. But, theoretically, it may be equivalently replaced by the spiking time (or latency) of the first spike. In the following we will consider solely this spike, so that the code exactly consists of this latency for each of the different fibers and which is inversely proportional to the neuron’s excitation current, i.e. to the correlation of the input clipped to the RF with the weight vector. The latency of each neuron may / be controlled relatively to other neurons / . Finally, the fibers makes synapses by the norm of the vector which is set to 1 with neurons from the LGN, so that the spike may trigger activity to different neurons from the LGN (its emitting field, see Fig. 1-right)
3 Compact spike coding using wavelet-like architecture A compact coding strategy is achieved when considering GCs on dyadic scales (i.e. powers of 2) and by considering the whole spatio-temporal pattern of first spikes corresponding to the wavelet coefficients[4]. In fact, the reconstruction of the image from the wavelet coefficients is not assured in general, justifying the use of adjoint filters or other image processing techniques. However, if carefully chosen the filters may verify an admissibility condition which permits us to write an estimate for the activity at the LGN after the arrival of 1 spikes:
(2) & + & Where is the ordered list of spikes and is a normalizing factor depending on the
$
1
filters. This relation gives a simple neural reconstruction scheme compatible with our archtecture if there exist a way to infer the absolute contrast value. In fact, VanRullen and Thorpe [3] found that the normalized absolute contrast value obeys to a certain regularity over natural images (see Fig. 3-Wav): the activity may therefore be reconstructed to within a normalization factor by adding up ON $ or OFF patches corresponding to the GCs’ RFs, weighted by a modulation function of the rank of arrival of spikes. If we set $
,
,
$
$ , & $'& +
(3)
Comparison with other candidate codes on this architecture has been extensively studied in [3] which showed that the rank order coding reconstruction scheme outperformed spike rate frequency coding or other temporal codes. Moreover, simulations with neurophysiological parameters show that the image is efficiently coded under the constraints we set up in sec. 1, so that this strategy could be effectively used in the retina. However, it should be pointed out that since the filters used overlap in their selectivity, some information is redundantly transmitted. And trying to remove these cross-talks impairs the reconstruction: this is known in wavelet theory as the compromise between sparsity and quality of reconstruction and leads to at best a compact code. When searching for a sparse code, new strategies of dynamical spike coding should therefore be explored.
4 Sparse coding and Overcomplete Representation In fact, as in natural languages, messages get sparser as the vocabulary gets bigger. Consequently, sparse coding [1] suggests that the code could consist of a relatively small number of active spiking neurons if their spatial weighting functions form an overcomplete basis of the input space. This could be an economical way for neurons to represent information since only a few neurons are active, but most importantly, this overcomplete representation aims at coding the essence of the signal by parsing it in elementary ”words” from a large dictionary reflecting the causes of image generation. Especially, in contrast to the presented compact coding, the dictionary will typically consist of several times the dimension of the image space. This coding has been applied to classical neural networks in its original version but has failed so far to use the temporal aspect of neuronal processing. Following [1], we assume that an approximation of an image may be calculated as the linear sum of different ’patches’ from a given dictionary of spatial weighting functions $ . In our framework, we want this approximation to be dynamic, that is that and at the spike: $ $ , , (4) where is the spike list. Our goal is thus to find a dynamical neural mechanism for finding this optimal list with their corresponding coefficients from Eq. 4. But minimizing / $/ at any time leads to a combinatorial explosion of the freedom of choice of the
subset of and of the values (it is in fact a NP hard problem [4]).
#
!
"
"
5 Sparse Spike coding using a Matching Pursuit method A solution is to use a Matching Pursuit (MP) [4, pp.412–9] method, which is derived from a statistics estimation algorithm which was applied to wavelet analysis. The MP algorithm decomposes a signal from a Hilbert space over a dictionary of elementary signals, by iteratively choosing the best match and virtually removing this match from the input. As with the wavelet transform, it may be similarly translated to a spike coding scheme. Let’s and initially set . First, we determine the first neuron in the layer to fire as $ ArgMax (5) and for this index , we define the extremal contrast value . Actually, we found the best match in the sense of the projection on the dictionary, so we subtract this projection from :
The contrast becomes
-
/ / - 1
-
1
Iterating this steps in time, our algorithm is simply for
(6)
(7)
, given the initialization:
& - + 1 and + is the sign of (i.e. its ON or OFF polarity). This algorithm is with exactly equivalent to MP for normalized filters ( 1 ) and presents the same computational and properties[4, pp.412–419]. The image code is given by the spike list + complexity $ with their corresponding value of polarity and extremal contrast value. Finally the algorithm is stopped at time when the absolute contrast is less than ! a given threshold. $ and at the spike: To reconstruct the image we derive immediately , $ $ , + & 1 (8)
ArgMax
$
which was our goal in (Eq. 4). Theoretical results proved the convergence of MP [6] and numerical simulations show that this algorithm fairly decomposes the data even if it is suboptimal compared to Basis Pursuit. In particular, it may be optimized by setting the 1 according to the input’s statistics.
6 Results We applied our method to different patches drawn from natural images using the same methods as in [1] and defined a dictionary where the of filters similar to sec. 3, but scale grows geometrically with a factor (i.e. layers per octave) on scales. We $ of bits necessary to describe the spike list as may easily compute the maximum number "! # $%& (' *) ,+ ' - 4 . )0/ ' 21 where . )3/ ' 21 is the number of quantization levels for the contrast (here ). Moreover, if ! 5 is the number of pixels in the image, '06 278:9@A . Numerically, the information per spike needed is thus in this simula& CED2F tion B . In fact, if we assume that objects may appear uniformly at every scale and if we set 1 , larger objects will appear first; by whitening the gain of the different scales, we force a fair
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
norm. wmse
norm. mse
competition between different scales. As described in [7], the norm 1 of the filters were adjusted according to natural image statistics by the use of a decorrelating filter [5]. This process may also be described as a maximization of the entropy of the spikes’ addresses, since there will be no a priori knowledge of the scale and space of the first spike. This also inspired the use of a Weighted Mean-Squared Error (WMSE) criterion based on the Mahalanobis distance that better corresponds to a subjective distance between images. spikes (Fig. 5-top) and As in [3], the reconstructed image is recognizable after only a
MP
qMP
0
1000
2000
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
MP
qMP 0
1000
2000
Figure 2: Reconstruction quality in function of the file size (in ) for matching Pursuit (MP) and quantized Matching Pursuit (qMP). (left) Standard MSE, (right) Weighted MSE (see text for details) the convergence is equally fast both with the Mean Square Error (MSE) and WMSE (see Fig. 2). This coding strategy provides a sparse representation of the signal: in comparison with the dyadic decomposition [3](see Fig. 3-Wav), the output coefficients and the MSE decrease more rapidly to (see Fig. 3-LUTb) since correlated activity is progressively removed. It transmits most active spikes first, and leads therefore to very good information transfer rate even on very short latencies. It is of importance now to know how this reconstruction scheme could be implemented in the LGN reconstruction neural layer.
7 Rank Order Coding
1 Wav 0.8 LUTb LUT
0.6 0.4 0.2 −2
10
−1
10
0
10
Figure 3: Look-Up Tables for the diverse method described in the text, as a function of the percentage of the number of spikes fired to the total number of neurons. (Wav) Wavelet coding, Matching Pursuit: (LUT) incremental learning, (LUTb-dotted) batch learning.
In fact, as in [3], we observed the behavior of the absolute contrast for different natural images, using the same setup as the overcomplete architecture we described above. Similarly, we observed some similarities across images from the database (see Fig. 3-LUTb), up to a normalization factor. Using the mean absolute coefficients as a look-up table, we thus built a mechanism of reconstruction at the LGN, but as opposed to [3], this algorithm is adaptive and therefore the error may be compensated dynamically. This is put in evidence by evaluating the reconstruction error (see Fig. 2), which show that the loss by the quantization is paid back by the gain in the information necessary to code the spikes’ address. This behavior showed up to be stable and we therefore used an incremental rule which has the advantage of being more biologically plausible and enabling on-line learning (see Fig. 3-LUT). This rule takes the form of a Hebbian rule -or rather a stochastic algorithm- so that ' as a modulation function, coding using after the $ $ ' ' $ ' ' (9) , ' 1 where is the stochastic learning gain. This could be implemented in a biologically plausible model by using an analog coding (as the rate of firing over longer periods). Practically it shows similar results both for the LUT and for the reconstruction error. It is still to be determined what macroscopic parameters may influence the shape of this modulation function, and particularly its speed of decreasing, but due to the adaptability of the Matching Pursuit algorithm, it seems reasonable to choose a mean upper bound of the absolute coefficient value to cover all possible images. As a consequence, we found a biologically plausible algorithm of sparse coding in a neural layer, extensively using rapid lateral interactions, as is observed in most vertebrate’s neural layers. Unilke other models of lateral interactions, this method exactly accounts for the fired spikes. Moreover, it should be pointed out that the substrate of the visual case is in this case the rank order of spiking at a layer. This temporal code has proved to be particularly efficient in rapid image processing tasks [8] so that it could be effectively used by neurons. Moreover, some neuronal mechanisms as shunting inhibition may act as a modulation function in neural layers and support its biological implementation.
!
8 Applications
This system may easily be extended to another layer. In fact since activity at the synapses $ & $ , , + of LGN neurons may be incrementally constructed as A , we can directly compute the activity of neuron incrementally as
,
$
, $
$
, +
&
1
(10)
where is the weight vector of neuron . By defining an appropriate threshold , we build a spiking mechanism and lateral inhibition similar to Matching Pursuit, so that at the firing of neuron , the activity is updated as :
$
$
- +
1
(11)
images drawn from the Olivetti Subsequently, we applied our method to different Face Database [9] and used another layer of orientation sensitive neurons. We define the weight vectors of these neurons as dilated, translated and sampled DOG and Gabor filters (see [4, pp. 160]). The scale grows geometrically with a factor (i.e. 5 layers per octave) on scales and the direction is circularly 1 (i.e. a DOG), 3 and 0 . The & LUT C D F were ED tabulated in the same manner, so that the information rate is in this layer B .
WMSE
MSE
200 180 160 140 120 100 80 MP 60 40 20
Jpeg
1000
2000
120 110 100 90 80 70 60 50 40 30
3000
Jpeg
MP
1000
2000
3000
Figure 4: Reconstruction error in function of the file size (in ) and compared to JPEG at different qualities (A) Standard MSE, (B) Whitened MSE Convergence is quicker and code is sparser than for the simple retinal model. The face is recognizable after only a few spikes (see Fig. 5-Top Row) and the spikes match particular features (like the eyes or the mouth). In particular, lines are extracted at multiple scales. Though very simple, this technique compares to JPEG at low qualities (see Fig. 4) and suggests that it could be used in a dynamical pattern recognition system. In particular, it is possible to modify the spatio-temporal spike pattern by modifying the
Figure 5: Reconstruction with a two-layered matching pursuit. (Left Column) Original image (Columns 2 to 6) Reconstruction after the integration of resp. 10, 50, 100, 450 and 750 spikes (Top Row) Matching Pursuit for image 1 from subject 11. (Middle Row) Attentive centered mechanism for the face image centered on a background white noise. (Bottom Row) Autonomous attentive mechanism for the face image randomly placed on a background white noise. precedence of some spike over others [10], that is here by modifying the norm of the neurons in space (and not simply in scale as for the whitening stage). In fact, by giving more weight to the central part of the image, we force the central spikes to fire first (see Fig. 5-Middle Row). This process may be triggered by an attentive process that will enhance the firing of spikes in a large area around the center of a last firings. For instance, at a step of the matching pursuit, instead of choosing the best match over all possible neurons, the match should be chosen as = / $ / ArgMax
(12) ,
1
typically,
/
/
where is a regularization constant and is a distance measure between two neurons2. Results show that this simple heuristics presents interesting properties for pattern recognition (see Fig. 5-Bottom Row). This can also be extended by introducing an a priori knowledge of the dictionary elements that may be selected, a method resembling fractal coding.
Conclusion We have proved that we may define a code based on a dictionary of primary visual weight vectors and that this code is efficient and sparse, leading to a dynamical overcomplete representation of the image. We also shown that the coefficients could be quantized by the polarity and the rank of spike coding, that this algorithm could be implemented in a biologically plausible network and that it led to very efficient coding strategies which compares at high compression ratios to image processing standards like JPEG. This sparse strategy may thus be used in the retina and primary visual areas especially for low bit compression and fast image transmission. But we also saw that this coding strategy provides a way to efficiently and progressively represent the image. In fact, we showed that a simple heuristic in the choice of the best match could lead to performant properties, as an autonomous attentive process, that are useful for image processing and particularly to pattern recognition. This strategy is now in the process of taking into account adaptive mechanisms on the relative gain of the neurons but also on the weight vectors. It is also implemented for more effective image processing performance using a pyramidal architecture which will be implemented on a retina chip.
References [1] Bruno Olshausen and David J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37:3311–3325, 1998. [2] Simon J. Thorpe, Denis Fize, and Catherine Marlot. Speed of processing in the human visual system. Nature, (381):520–522, 1996. [3] Rufin Van Rullen and Simon J. Thorpe. Rate coding versus temporal order coding: What the retina ganglion cells tell the visual cortex. Neural Computation, 13(6):1255– 1283, 2001. [4] St´ephane Mallat. A Wavelet Tour of signal Processing. Academic Press, 1998. [5] Joseph J. Atick and A. Norman Redlich. What does the retina know about natural scenes? Neural Computation, 4(2):196–210, 1992. [6] St´ephane Mallat and Zhifeng Zhang. Matching pursuit with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3414, 1993. [7] Laurent Perrinet and Manuel Samuelides. Sparse image coding using an asynchronous spiking neural network. In Proceedings of ESANN, pages 313–318, 2002. [8] Rufin Van Rullen, Jacques Gautrais, Arnaud Delorme, and Simon J. Thorpe. Face processing using one spike per neuron. BioSystems, 483:229–239, 1998. [9] Ferdinando Samaria and Andy Harter. Parameterisation of a stochastic model for human face identification. In Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, Sarasota FL, December 1994, 1994. [10] Rufin Van Rullen, Arnaud Delorme, and Simon J. Thorpe. Object recognition using spiking neurons ii: Spatial attention explained by temporal precedence of information. 2000. 2
using relative adresses will also compact the code’s size