Sparse coding of natural images produces localized, oriented, bandpass receptive elds Bruno A. Olshausen and David J. Field Department of Psychology, Uris Hall Cornell University Ithaca, New York 14853 (
[email protected],
[email protected]) Technical Report CCN-100-95 November 9, 1995 Abstract
The images we typically view, or natural scenes, constitute a minuscule fraction of the space of all possible images. It seems reasonable that the visual cortex, which has evolved and developed to eectively cope with these images, has discovered ecient coding strategies for representing their structure. Here, we explore the hypothesis that the coding strategy employed at the earliest stage of the mammalian visual cortex maximizes the sparseness of the representation. We show that a learning algorithm that attempts to nd linear sparse codes for natural scenes will develop receptive elds that are localized, oriented, and bandpass, much like those in the visual system. These receptive elds produce a more ecient image representation for later stages of processing because sparseness reduces the entropies of individual outputs, which in turn reduces the redundancy due to complex statistical dependencies among unit activities.
The spatial receptive elds of simple cells in mammalian striate cortex have been reasonably well described physiologically and can be characterized as being localized, oriented, and bandpass: Each cell responds to visual stimuli within a restricted and contiguous region of space that is organized into excitatory and inhibitory sub elds elongated along a particular direction, and the spatial frequency response is generally bandpass with bandwidths in the range of 1-2 octaves [1, 2, 3, 4]. We seek to provide a functional explanation of these spatial response properties in terms of an ecient coding strategy for natural images. An image, I (x; y), is modeled as a linear superposition of (not necessarily orthogonal) basis functions, i(x; y): X a (x; y) : (1) I (x; y ) = i i i
1
The image code is set by the choice of basis functions, i. The coecients, ai, are dynamic variables that change from one image to the next. The goal is to nd a set of that forms a complete code (i.e., spans the input space) and results in an ecient representation in which the values of the coecients are as statistically independent as possible over an ensemble of natural images. Achieving statistical independence is desirable, because it makes explicit the structure in sensory signals [5, 6]. One line of approach to this problem is based on principal components analysis [7, 8, 9], in which the goal is to nd a set of mutually orthogonal basis functions that capture the directions of maximum variance in the data and for which the coecients are pairwise decorrelated, haiaj i = haii haj i. The receptive elds that result from this process are not localized, however, and the rst few principle components appear oriented only because they are low-frequency (Fig. 1). Principal components analysis
Figure 1: Principal components calculated on 8x8 image patches extracted
from natural scenes using Sanger's rule [8].
The resulting basis functions are not localized, and the vast majority do not at all resemble any known cortical receptive elds. The rst few principal components appear \oriented" only by virtue of the fact that they are composed of a small number of low frequency components (since the lowest spatial frequencies account for the greatest part of the variance in natural scenes [13]), and reconstructions based solely on these functions will merely yield blurry images.
is appropriate for capturing the structure of data that are well described by a Gaussian cloud, or in which the linear pairwise correlations are the major form of statistical dependence in the data. However, natural scenes contain many higher-order forms of statistical structure, and there is good reason to believe they form an extremely non-Gaussian shape that is not at all well captured by orthogonal components [10]. Intuitively, one can see this in that oriented structures such as lines and edges will 2
produce three-point and higher correlations [11]. In addition, curved, fractal-like edges tend to produce local alignments in the phase spectrum over about 1-2 octaves in spatial frequency [12, 11], and this fact will also be missed by the linear, pairwise correlations. In order to capture these important forms of structure in natural images, we need to develop a linear modeling framework that can take into account higherorder statistical dependencies in the data. Complete statistical independence among a set of variables is satis ed when the joint entropy is equal to the sum of individual entropies: H (a ; a ; :::; an) = Pi H (ai). If there exist statistical dependencies, then the sum of individual entropies will be greater than the joint entropy. Assuming we have some way of ensuring that information in the image is preserved, then, our best strategy will be to lower the individual entropies, H (ai), as much as possible. (In Barlow's terms [5], we seek a minimum entropy code.) We conjecture that natural images have \sparse structure"|that is, any given image can be represented in terms of a small number of descriptors out of a large set [13, 10]|and so we shall seek a speci c form of low-entropy code in which the probability distribution of each coecient's activity is uni-modal and peaked around zero. The search for a sparse code may be formulated as an optimization problem by constructing the following cost functional to be minimized: E = ?[preserve information] ? [sparseness of ai ] : (2) The rst term measures how well the code describes the image, which we choose as the mean square of the error between the actual image and the reconstructed image: X X (3) [preserve information] = ? [I (x; y) ? ai i(x; y)] : 1
2
2
x;y
i
The second term assesses the sparseness of the code for a given image by assigning a cost depending on how activity is distributed among the coecients: those representations in which activity is spread over many coecients should incur a higher cost than those in which only a few coecients carry the load. The cost function we have constructed to meet this criterion takes the sum of each coecient's activity passed through a non-linear function S (x): [sparseness of ai] = ?
X S ( a i ? i ) ; i
i
i
i
(4)
where i = haii and i =2 h(ai ? i ) i. The choices for S (x) that we have experimented with include ?e?x , log(1 + x ), and jxj, and all yield qualitatively similar results (described below). The reasoning behind these choices is that they will favor among activity states with equal variance those with the fewest number of non-zero coecients. This is illustrated in geometric terms in Figure 2. Learning is accomplished by performing gradient descent on the total cost functional, E . For each image presentation, the ai evolve along the gradient of E until a minimum is reached: X a ? i )] ; (5) a_ i = [bi ? Cij aj ? S 0( i 2
2
2
j
3
Cost
1
0 -1
aj
-1
0
ai
1
Figure 2: The cost function for sparseness, plotted as a function of the joint
activity of two coecients, ai and aj .
In this example, S (x) = log(1 + x2). An activity vector that points towards a corner, where activity is distributed equally among both coecients, will incur a higher cost than a vector with the same length that lies along one of the axes, where the total activity is loaded onto one coecient. The gradient tends to \sparsify" activity by dierentially reducing the value of low-activity coecients more than high-activity coecients.
4
where bi = Px;y i(x; y)I (x; y), Cij = Px;y i(x; y)j (x; y), and is a rate constant. After a number of trials have been computed this way, the i are updated by making an incremental step along their gradient of hE i: i(xm; yn) = w h[I (xn; ym) ? I^(xn; ym)] aii: (6) where I^ is the reconstructed image, I^(xm; yn) = Pi ai i(xm; yn), and w is the learning rate. The vector length (gain) of each basis function, i, is adapted over time so as to maintain equal variance on each coecient. Note that there is a simple network interpretation of this system: The value of each output unit, ai, Pis determined from a combination of a feedforward input term, bi, a recurrent term, j Cij aj , and a nonlinear self-inhibition term, S 0, that dierentially pushes activity toward zero. The output values ai are then fed back through the functions i to form a reconstruction image, and the weights evolve by doing Hebbian learning on the residual signal. We may roughly interpret each basis function as a \receptive eld," since it is the feedforward weighting function that contributes to a unit's output. The learning rule was tested on several arti cial datasets containing controlled forms of sparse structure, and the results of these tests (Fig. 3) con rm that the network is indeed capable of discovering sparse structure in input data, even when the sparse components are non-orthogonal. The result of training the network on 16x16 image patches extracted from natural scenes is shown in Figure 4a. The vast majority of receptive elds are well localized within each array (with the exception of the low frequency functions which occupy a larger spatial extent). Moreover, the receptive elds are oriented and broken into dierent spatial frequency bands. In some sense, this result should not come as a surprise, because it simply re ects the fact that natural images contain localized, oriented structures with limited phase alignment across spatial frequency [12, 11]. The entire set of basis functions form a complete image code that spans the joint space of spatial position, orientation, and spatial frequency (Fig. 4b) in a manner similar to wavelet codes, which have previously been shown to form sparse representations of natural images [13, 10, 14]. The resulting histograms have sparse distributions (Fig. 4c) and reduced entropy (2.93 nats vs. 3.19 nats before training) for a mean square reconstruction error that is 4% of the image variance. These results demonstrate that localized, oriented, bandpass receptive elds emerge when only two global objectives are placed on a linear coding of natural images: that information be preserved, and that the representation be sparse. No other constraints are required to obtain these receptive eld properties. The fact that cortical cells show similar receptive eld properties to those developed here is suggestive that evolutionary and developmental forces are following a similar objective. An important prediction of this coding strategy that is not supported by available evidence, however, is that one would expect to nd the vast majority of cells tuned to the highest spatial frequency band, and substantially fewer cells at lower spatial frequencies (the exact proportions depend upon bandwidth spacing: a factor of four decrease in number would be expected for an octave decrease in spatial frequency). Much of the disagreement, though, may be due to inadequate assay methods [15, 16], and it will 5
Result
Training set
a.
Sparse pixels
b.
Sparse gratings
c.
Sparse gabors
Figure 3: Test cases.
In a,b, and c, representative training images are shown at left, and the resulting basis functions that were learned from these examples are shown at right. In a, images were composed of sparse pixels: each pixel was activated independently according to an exponential distribution, P (x) = e?jxj =Z . In b, images were composed similar to a, except using gratings instead of pixels (i.e., by generating \sparse pixels" in the Fourier domain). In c, images were composed of sparse Gabor patches using the method described by Field [10]. In all cases, the basis functions were initialized to random initial conditions. The learned basis functions successfully recover the sparse components from which the images were composed. 2 ? x The form of the sparseness cost function was S (x) = ?e , but other choices (see text) yield the same results.
6
a.
c. 0.1
0.01
log P(ai)
b.
0.001
0.0001
1e-05 -2
-1
0
ai
Figure 4: (caption next page) 7
1
2
Figure 4: Results from training a network of 192 output units on 16x16
pixel image patches extracted from natural scenes.
The scenes were 20 512x512 images of natural surroundings in the American northwest, preprocessed by ltering with the zero-phase whitening/lowpass lter R(f ) = fe?(f=f0 )4 , f0 = 200 cy/picture. Whitening counteracts the fact that the rms error preferentially weights low frequencies for natural scenes, while low-pass ltering prevents diagonal spatial frequencies from extending higher than the horizontal and vertical spatial frequencies (due to the rectangular sampling lattice). Qualitatively similar results are obtained without these preprocessing steps, but with slower learning of the high frequency components and some anisotropy in orientation tuning. Gradient descent for the ai was performed using the conjugate gradient method, and was halted within 10 iterations or when the change in E was less than 1% (whichever came rst). The i were initialized to random values and were updated every 100 image presentations. A stable solution was arrived at after approximately 4000 updates ( 5 days of execution time on an SGI Indy workstation running a 100Mhz R4600 CPU). Parameters used were w = 2:0; = = 0:14. The form of the sparseness cost function was S (x) = log(1 + x2 ), as this choice seemed most adept at dealing with sparse components of unequal variance (despite the \whitening" lter, there is still an uneven distribution among frequency components). The activity statistics i and i were collected by moving averages. a, The learned receptive elds are localized, oriented, and bandpass in spatial frequency. Each function is scaled so that its dynamic range lls the grey scale, but with zero always represented by the same grey-level (black is negative, white is positive). b, The organization of the learned receptive elds in space, orientation, and spatial frequency. Receptive elds were subdivided into high, medium, and low spatialfrequency bands (in octaves), according to the peak frequency in their power spectra, and their spatial location was plotted within the corresponding plane. Orientation preference is denoted by line orientation. c, Activity histograms averaged over all coecients for the learned basis functions (solid line) and for random intial conditions (dashed line). In both cases, was set to yield a mean square error that was 4% of the image variance. Thus, the learned basis functions can tolerate a higher degree of sparsi cation for the same mean squared error. The width of each bin in the histogram is 0.04.
be important to resolve this issue if wavelet-like codes are to be taken seriously as models of the early visual representation. In recent years there has been an increasing emergence of work on learning algorithms that attempt to nd sparse codes [17, 18] or causal structure in data [21, 20, 19]. All of these methods would seem to have the potential to arrive at qualitatively similar results to those we have shown here, and indeed Bell's recent algorithm [21] has shown the ability to duplicate our results when a sparse probability distribution is imposed on outputs (personal communication). In addition, the BCM learning rule [22], which attempts to nd multi-modal distributions, has shown the capability of developing somewhat localized receptive elds when trained on natural images [23], but the complete set of receptive elds required for reconstruction were not determined so it is dicult to assess the image code as a whole. The major features that distinguish our algorithm from these alternative schemes are that the coding is sparse, analog, allows for overcompleteness, and provides a biologically plausible account of 8
the response properties of cortical simple cells.
Acknowledgements Formulation of the sparse coding model bene ted from discussions with Mike Lewicki. We also thank Chris Lee, Carlos Brody, and George Harpur for useful input. Supported by NIMH, F32-MH11062 (BAO).
References [1] Hubel DH, Wiesel TN (1968) Receptive elds and functional architecture of monkey striate cortex. The Journal of Physiology, 195: 215-244. [2] De Valois RL, Albrecht DG, Thorell LG (1982) Spatial frequency selectivity of cells in macaque visual cortex. Vision Res, 22: 545-559. [3] Jones JP, Palmer LA (1987) An evaluation of the two-dimensional Gabor lter model of simple receptive elds in cat striate cortex. Journal of Neurophysiology, 58: 1233-1258. [4] Parker AJ, Hawken MJ (1988) Two-dimensional spatial structure of receptive elds in monkey striate cortex. Journal of the Optical Society of America A, 5: 598-605. [5] Barlow HB (1989) Unsupervised learning. Neural Computation, 1: 295-311. [6] Atick JJ (1992) Could information theory provide an ecological theory of sensory processing? Network, 3: 213-251. [7] Linsker R (1988) Self-organization in a perceptual network. Computer, pp. 105117. [8] Sanger TD (1989) An optimality principle for unsupervised learning. In: Advances in Neural Information Processing Systems I, D. Touretzky, ed., pp. 11-19. [9] Hancock PJB, Baddeley RJ, Smith LS (1992) The principle components of natural images. Network, 3: 61-72. [10] Field DJ (1994) What is the goal of sensory coding? Neural Computation, 6: 559-601. [11] Olshausen BA, Field DJ (1995) Natural image statistics and ecient coding. Submitted to Network (Proceedings of the Workshop on Information Theory and the Brain, September 4-5, 1995, University of Stirling, Scotland). Available electronically as ftp://redwood.psych.cornell.edu/pub/papers/stirling.ps. 9
[12] Field DJ (1993) Scale-invariance and self-similar `wavelet' transforms: an analysis of natural scenes and mammalian visual systems. In: Wavelets, Fractals, and Fourier Transforms, Farge M, Hunt J, Vascillicos C, eds, Oxford UP, pp. 151-193. [13] Field DJ (1987) Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am, A, 4: 2379-2394. [14] Daugman JG (1989) Entropy reduction and decorrelation in visual coding by oriented neural receptive elds. IEEE Transactions on Biomedical Engineering, 36: 107-114. [15] Olshausen BA, Anderson CH (1994) A model of the spatial frequency organization in primate visual cortex. The Neurobiology of Computation, Bower JM, ed., Kluwer, pp. 275-280. [16] Olshausen BA, Anderson CH, Van Essen DC (1995) A multiscale dynamic routing circuit for forming size- and position-invariant object representations. Journal of Computational Neuroscience, 2: 45-62. [17] Foldiak P (1990) Forming sparse representations by local anti-Hebbian learning. Biol. Cybernetics, 64: 165-170. [18] Zemel RS (1993) A minimum description length framework for unsupervised learning. Ph.D. Thesis, University of Toronto, Dept. of Computer Science. [19] Saund E (1995) A multiple cause mixture model for unsupervised learning. Neural Computation, 7: 51-71. [20] Hinton GE, Dayan P, Frey BJ, Neal RM (1995) The \wake-sleep" algorithm for unsupervised neural networks. Science, 268: 1158-1161. [21] Bell AJ, Sejnowski TJ (1995) An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7: 1129-1159. [22] Bienenstock EL, Cooper LN, Munro PW (1982) J Neurosci 2: 32-48. [23] Law CC, Cooper LN (1994) Formation of receptive elds in realistic visual environments according to the Bienenstock, Cooper, and Munro (BCM) theory, Proc Natl Acad Sci, USA, 91: 7797-7801.
10