A Learning Rule for Extracting Spatio-Temporal ... - Semantic Scholar

9 downloads 187 Views 224KB Size Report
Email: [email protected], [email protected]. Abstract ... Accordingly, if a neuron codes for a physical parameter then its output should ... of input neurons. If there are input neurons responsive to n ranges of orientation, there must be.
Published in Network, 6(3), pp 1-8, 1995

A Learning Rule for Extracting Spatio-Temporal Invariances James Stone & Alistair Bray

School of Cognitive and Computing Sciences, University of Sussex, Sussex, BN1 9QH, England. Email: [email protected], [email protected] Abstract

The inputs to photoreceptors tend to change rapidly over time, whereas physical parameters (e.g. surface depth) underlying these changes vary more slowly. Accordingly, if a neuron codes for a physical parameter then its output should also change slowly, despite its rapidly uctuating inputs. We demonstrate that a model neuron which adapts to make its output vary smoothly over time can learn to extract invariances implicit in its input. This learning consists of a linear combination of Hebbian and anti-Hebbian synaptic changes, operating simultaneously upon the same connection weights but at di erent time scales. This is shown to be sucient for the unsupervised learning of simple spatio-temporal invariances.

We present a learning rule, based upon temporal correlations, that allows a single model neuron to extract important information from its input concerning temporal parameters (i.e. invariances). Primary sensory areas such as striate cortex tend to use a place-coding of input features (such as edge-orientation) in which the identity of the neuron responding may be more important than its response magnitude. Such a coding demands many neurons. We show that a smaller number of neurons, adapting in accord with the learning rule we de ne, could alter their synaptic connections to transmit the same amount of information. This is achieved by use of a frequency-coding in which response magnitude has high information content. Many Hebbian-learning models exploit spatial correlations in their inputs, whereas only a few exploit temporal correlations [1, 2, 3]. These temporal models update synaptic strengths between neurons according to exponentially weighted time-averages of post-synaptic (or pre-synaptic) activity over the recent past. Using these exponential traces with Hebbian synaptic modi cation allows each neuron to make strong connections to others with which it is temporally correlated. Essentially, these invariance-seeking model neurons simply compute a logical OR on a subset of their inputs. For example, a neuron that learns to be selective for edge-orientation, whilst invariant to edge-position, forms uniformly strong connections to the subset of all input neurons having the appropriate orientation selectivity, regardless of spatial position, because activity within this subset is temporally correlated. These models are limited because they require an output neuron to represent each subset of input neurons. If there are input neurons responsive to n ranges of orientation, there must be n output neurons to maintain orientation discrimination; the result is a place-code for orientation in which the output-value of the responding neuron carries only a small amount of information. We describe below a learning rule by which an alternative coding could be acquired. In the above example, position-invariant edge-orientation information could be transmitted (theoretically) by a single neuron. The connection-strength between such an output-neuron and any input-neuron would specify the orientation preference of that input-neuron, and the output-neuron would code by its value, rather than by its identity. Although it is unlikely that neural systems use such non-redundant coding as in this extreme case, our learning rule provides a general means of moving from highly redundant place-codes (such as achieved with simple OR connections) to value-codes using fewer neurons. Information is preserved, redundancy is reduced. The rule is based upon a general assumption concerning perception. That is, the inputs to sensory receptors tend to change rapidly and discontinuously over time, whereas the physical parameters underlying these changes vary more slowly, and more smoothly. If a neuron codes for a physical parameter then its output should also change slowly and smoothly, despite its rapidly uctuating inputs. Therefore, a neuron that adapts to make its output vary, but to vary smoothly over time, can learn to code the invariances underlying its input. The output J

Stone is a joint member of the Schools of Biological and Cognitive Sciences

1

can be made to re ect both smoothness and variability by forcing it to have a small short-term variance, and a large long-term variance. This strategy can be implemented for a single linear output unit u. The output of u at time t is yt . The temporal output y~t of u is a short-term average of outputs y (i.e. y~t is the temporal exponentially weighted sum of outputs yt ), and y is a similar long-term average. We can obtain the desired behaviour in y by altering the connection weights between u and its input units such that y has a large long-term variance V , and a small short-term variance U. Maximising V=U maximises the variance of y overPlong intervals, whilst simultaneously minimising its variance over short intervals. The output of u is yt = j wj xj , where wj is the value of a weighted connection from input xj to u. A merit function F can be de ned as: P

!

T t 2 t (1) F = 12 log VU = 21 log PtT=1(yt ? yt )2 t=1(y ? y~ ) In order to maximise F we require its derivative with respect to each weight wj . Omitting temporal superscripts, F can be re-written:   X X F = 12 log (y ? y)2 ? 12 log (y ? y~)2 and its derivative:     @F = 1 X(y ? y) @y ? @y ? 1 X(y ? y~) @y ? @~y @wj V @wj @wj U @wj @wj Given that y is a linear function of its input: @F = 1 X(y ? y)(x ? x ) ? 1 X(y ? y~)(x ? x~ ) j j j j @w V U which can be speci ed as: 1 1 V h(y ? y)(xj ? xj )i ? U h(y ? y~)(xj ? x~j )i This yields the direction of steepest ascent, and determines the weight changes that maximise F. This rule for synaptic modi cation combines Hebbian and Anti-Hebbian learning (the HAH-rule), operating simultaneously at two di erent time-scales. The Hebbian adaptation is: h wj = V (y ? y)(xj ? xj ) where is the learning rate. The anti-Hebbian adaptation is: ah wj = ? U (y ? y~)(xj ? x~j ) The interpretation of this rule is as follows. If V is small relative to U then learning is principally Hebbian, which has the e ect of increasing the variability of outputs over long periods. That is, it will prevent the output of the unit being constant. However, if V is large relative to U then learning is principally anti-Hebbian. This has the e ect of decreasing the variability of outputs over short periods. The net e ect of these changes is to generate an output which has a large range, but which varies smoothly over time. We demonstrate this rule in three computer simulations that convert a place-coding to a value-coding. For the rst simulation (see Figure 1), consider a vector whose elements are all zero except one, which is unity. Over time, the position of the non-zero element oscillates about the centre, from one end of the vector to the other, in simple harmonic motion. Using the above rule, and taking this temporally changing vector as input, the unit learns to code the position of the non-zero element along the vector. The second simulation (see Figure 2) is a repeat of the rst, except that the motion of the non-zero element is far less smooth than previously. The element's position is calculated by adding a large degree of noise to the underlying sinusoidal signal. In this situation the unit reliably learns to code for position with little loss of precision. For the nal simulation (see Figure 3), we use the two-dimensional analogue of the rst. The non-zero element moves independently in two dimensions; two units, with a single decorrelation weight between them, learn to code separately for motion in two orthogonal directions. In neither of these simulations would previously mentioned synaptic modi cation rules [1, 2, 3] have been successful. Standard Hebbian/anti-Hebbian rules would ignore the temporal correlations

2

in the input. A network similar to Foldiak's could only provide another place-coding of reduced resolution: with two outputs (as above) the network would transmit a maximum of 2 bits of information. The BCM rule [4, 5] and the ABS rule [6, 7, 8] provide computational and biological support (respectively) for combining Hebbian and anti-Hebbian synaptic modi cation to acquire stimulus selectivity. However, these rules do not exploit the local temporal correlations in the input: adaptation is a function of long-term postsynaptic activity and the current pre-synaptic activity. By ignoring the recent temporal history of the neuron's pre- and post-synaptic activity, such rules permit neurons to learn spatial, but not temporal, correlations. In conclusion, we have derived a learning rule that allows a linear unit to compute a value-coding of linear invariances that vary smoothly over time in its input. Such a unit is capable of converting a place-coding de ned over a set of neurons to a value-coding on the output of far fewer neurons. A place-coding may be the necessary `expensive' result of extracting non-linear statistics (e.g. binocular disparity) from data (e.g. using competitive/Kohonen nets) in a non-temporal manner; however, once this extraction has been achieved, the HAH-rule provides a robust means of extracting linear invariances in these statistics, that are themselves non-linear invariances in the original data. It has been shown elsewhere that one such non-linear invariance (stereo disparity) can be extracted using a multi-layer network which exploits the same temporal smoothness assumptions as the method described in this paper [9].

Acknowledgements Thanks to Peter Foldiak for comments on this paper. Jim Stone is supported by a JCI/MRC grant awarded to J Stone, T Collett and D Willshaw. Alistair Bray is supported by a SERC grant.

References [1] [2]

Foldiak P. Learning invariance from transformation sequences. Neural Computation, 3(2):194{200, 1991. Wallis G, Rolls E T and Foldiak. Learning invariant responses to the natural transformations of objects.

[3]

Barrow H G and Bray A J. A model of adaptive development of complex cortical cells. In Aleksander I

[4]

Law C C and Cooper L N. Formation of receptive elds in realistic visual environments according to the

[5] [6] [7] [8] [9]

International Joint Conference on Neural Networks, pages 1087{1090, 1993.

and Taylor J, editors, Arti cial Neural Networks II: Proceedings of the International Conference on Arti cial Neural Networks, pages 881 { 884. Elsevier Publishers, 1992. Bienenstock, Cooper, and Munro (BCM) theory. Proceedings National Academy Science USA, 91:7797{7801, 1994. Bienenstock E L, Cooper L N and Munro P W. Theory for the development of neuron selectivity: Orientation speci city and binocular interaction in visual cortex. Journal of Neuroscience, 2:32{48, 1982. Artola A and Singer W. Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. TINS, 16(11):480{487, 1993. Artola A, Brocher S and Singer W. Di erent voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature, 347:69{72, 1990. Stanton P K and Sejnowski T J. Associative long-term depression in the hippocampus: induction of synaptic plasticity of Hebbian covariance. Nature, 339:69{72, 1990. Stone J V. Learning spatio-temporal invariances. In Smith L S and Hancock P J B, editors, Neural Computation and Psychology, pages 75{85. Springer Verlag: Workshops in Computing Series, 1995.

3

Figure 1

Extracting one invariance.

The value of an invariance is coded by the position of activation on a set of inputs; the output unit adapts such that its value correlates with the invariance. At each time t a linear unit has input elements xj (1  j  101). All input elements are zero except xj = 1 where j = round(51+50  sin(t  360=)) and  = 450 (and angles are measured in degrees). Weighted connections between the output unit andPits inputs are initialised by choosing 2 values uniformly from the range ?1 : : :1, and then normalising so that 101 j =1 wj = 1. At each time-step the weight vector adapts according to the HAH-rule (temporal superscripts are omitted for clarity): wj = V (y ? y)(xj ? xj ) ? U (y ? y~)(xj ? x~j ) where = 0:001 and wjt+1 = wjt + wj . A value for the trace y, with a half-life of l time-steps, is computed using the formula: 0

0

y t+1 = l y t + (1 ? l )yt+1 where l = 0:51= . Similar formulae are used for x, where the long-term averages have l = 2, and for y~ and x~ where the short-term averages have s = (=31). Values for V and U are approximated using similar formulae: l

V t+1 = l V t + (1 ? l )(yt+1 ? y t )2 U t+1 = l U t + (1 ? l )(yt+1 ? y~t )2 Both V t and U t are computed using the same value of l = 2; their values are used as an approximation of V and U de ned in (1). All temporal averages are initialised by running data through the network for 4 time steps without any adaptation. When adaptation commences, no weight normalisation is required since the merit function F is independent of the magnitude of y [a]. The Connectivity. A single output unit has weighted connections to 101 binary-valued inputs; at any time, only a single input is non-zero and its position oscillates over time. [b]. The Output. After 5,000 time steps ( 11) the magnitude of the correlation r between y and the position j of the non-zero element is greater than 0.9. When the network is fully converged the value of y is plotted against time for 5 (2,250) time-steps. The correlation between y and j is almost perfect (jrj > 0:999). [c]. The Weights. The value of wj is plotted against position j. The straight line indicates almost perfect correlation between wj and j, (jrj > 0:999). If l is suciently large then weights converge to a stable value. Otherwise, there may be a small amount of drift. 0

0

Figure 2

Extracting one invariance in the presence of noise

This simulation is identical to that in Figure 1 except that the position of the non-zero element in the vector is a function of a noisy version of a sinusoidal signal. The position of the non-zero element is given by j = round(51 + 50  sin((t  360=) + 300  random())) where the function random() computes random numbers between zero and unity, from a uniform distribution. As before,  = 450. [a]. The Output. After 100,000 time steps ( 222) the magnitude of the correlation between y and the position j of the non-zero element is greater than 0.9. When the network is fully converged j is plotted against time for 2 (900) time-steps. In the graph below that, y is plotted for the same period. In this simulation (and all other trials with similar noise levels) the magnitude of the correlation between y and the j is extremely high (jrj > 0:985). [b]. The Weights. The value of wj is plotted against position j. The straight line indicates almost perfect correlation between wj and j (jrj > 0:98). 0

0

0

0

4

Figure 3

Extracting two invariances.

Two invariances are coded by the position of activation on a set of inputs; two output units adapt such that their values correlate with these invariances1. Two linear units receive the same inputs xij (1  i; j  51); at time t all inputs are zero except xi j = 1 where i = round(26+25  sin((t  360=)+)), j = round(26+25  sin((t  360=) ? )),  = 450 and  = (t  17)=360. Weighted connections between each unit and inputs are initialised P itsP 51 2 by choosing values uniformly from the range ?1 : : :1, and then normalising so that 51 i=1 j =1 wij = 1. There is a single weight wah that acts to decorrelate the two outputs y1 and y2 ; it is de ned to be the negative of the correlation coecient between y1 and y2 over the last  time-steps, where  = min(20; t). The outputs are de ned as: 0

0 0

0

y1 = y2 =

51 X 51 X i=1 j =1

51 X 51 X i=1 j =1

w1 xij ij

w2 xij + kwah y1 ij

where k = 10. At each time-step the weight vectors are adapted as in the previous simulation, with = 0:001. In computing the traces l = 2 and s = (=31). [a]. The Connectivity. Two output units each have weighted connections to the same grid of 51x51 binaryvalued inputs; at any time, only a single input is non-zero and its position oscillates independently in the two spatial dimensions over time. The non-symmetrical anti-Hebbian connection ensures that the output of the second unit is decorrelated from that of the rst. [b]. The Output. After learning, the values for y1 and y2 are plotted against time for 5 time-steps. The value of y1 correlates with the position i of the non-zero element along the i dimension, and the value of y2 correlates with its position j along the j dimension. In both cases, the correlation is almost perfect (jrj > 0:975). [c]. The Weights. The two weight vectors w1 and w2 are shown as 2D grey-scale arrays (where intensity is a linear function of weight value, and weights vary from ?0:35 : : :0:35 in the left array and ?0:6 : : :0:6 in the right array). Each array is a 2D ramp with weight values rising linearly in one direction but remaining constant in the orthogonal direction. The directions of weight increase for the two units are approximately orthogonal to one another. 0

0

1 This simulation is analagous to F oldiak's if one invariance is considered to be edge-orientation and the other edge-position: one unit becomes spatially invariant, the other orientation invariant.

5

y

w101

w1

1

2

3

100

101

Oscillating Motion

[a] y 0.5 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 -0.3 -0.4 -0.5 t 0

450

900

1350

1800

2250

60

80

101

[b] w 0.5 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 -0.3 -0.4 -0.5

j 1

20

40

[c] Figure 1:

6

Position (j) 100 80 60 40 20 0

t 0

200

400

600

800

0

200

400

600

800

Output (y)

0.2 0.1 0.0 -0.1 -0.2 t

[a] w

0.2 0.1 0.0 -0.1 -0.2 j 1

20

40

60

[b] Figure 2:

7

80

101

y2

y1 wah

Motion i’

Motion j’

[a] y 0.6 0.4 0.2 0.0 -0.2 -0.4 -0.6

t 0

450

900

1350

[b]

[c] Figure 3:

8

1800

2250