units themselves carry out a relatively simple computation by combining (usually taking a weighted .... algorithms the information about the weight changes is not locally present at the site of the synapse ...... was to give weights 'momentum' ...... Studying such simple models, of course, does not answer our original question.
Peter Fijldigk
Churchill
College,
September,
Models
Cambridge
1991
of sensory
coding
This thesis is submitted for the degree of Doctor of Philosophy.
Preface
This dissertation outcome submitted
Signed
is the result of my own work and includes
of work done in collaboration. for a degree or diploma
It is not substantially
or other qualification
Date
nothing
which is the
the same as any I have
at any other University.
Contents ............................................................................................................................. 1. introduction ........................................................................................................ processing.. ........................................................................ 1.1 The goal of sensory reduction ................................ 1.2 Regularities in natural stimuli and redundancy decorrelation.. ................................................................................. Chapter 2. Adaptive 2.1 Lateral interactions in the cerebral cortex ....................................................... learning.. ......................................................... 2.2 Decorrelation by anti-Hebbian 2.3 A formal model ............................................................................................... between several variables ......................................................... 2.4 Decorrelation 2.5 Decorrelation and single unit adaptation.. ....................................................... 2.6 Relation to Kohonen’s ‘novelty filter’............................................................... 2.7 Information theoretic properties.. .................................................................... 2.8 Biological relevance ........................................................................................ 2.9 Pattern adaptation and aftereffects.. ............................................................... 2.10 Decorrelation and associative memory.. ......................................................... 2.11 Decorrelation and supervised learning ........................................................... Chapter 3. Optimal linear feature extraction.. ................................................................. 3.1 Data reduction ................................................................................................. 3.2 Principal component analysis.. ......................................................................... 3.3 Oja’s algorithm ............................................................................................... 3.4 Simulation ....................................................................................................... 3.5 Transmitted information.. ................................................................................ 3.6 Assumptions and biological relevance.. ............................................................ ................................................................. 3.7 Comparison to related algorithms.. Chapter 4. Sparse coding.. ................................................................................................ 4.1 The Hebb unit as suspicious coincidence detector.. ............................................ learning.. ..................................................................................... 4.2 Competitive 4.3 Sparse coding.. ................................................................................................. mechanisms.. ................................. 4.4 Combination of Hebbian and anti-Hebbian .I .......................................... 4.5 Example 1: Learning lines.. ................................... 4.6 Example 2: Learning the alphabet.. ................................................................. with other competitive algorithms.. ............................................ 4.7 Comparison and biological relevance.. ............................................................ 4.8 Assumptions ................................. 4.9 Hebbian and anti-Hebbian mechanisms in development.. Chapter 5. Learning invariances.. ..................................................................................... 5.1 Shift invariance .............................................................................................. 5.2 Hebbian rule with trace mechanism ................................................................ 5.3 Simulation.. ..................................................................................................... 5.4 Assumptions and biological plausibility.. ........................................................ Chapter 6. Conclusions.. .................................................................................................... References.. ........................................................................................................................ Abstract.. Chapter
5 7 .13 .16 .19 .19 .21 .22 .26 .28 32 .33 .37 .40 .42 .46 .52 .52 .52 .53 .56 .59 .62 .63 .65 .65 .66 .66 ,68 69 .70 .74 .76 .79 .82 .82 .84 .85 .88 .92 96
Acknowledaements
I would encouragement work. discussions. version
like to thank and support
my supervisor,
and for the lot of interesting
I am also grateful
to Dr Graeme
I very much appreciate
of this
thesis,
Professor Horace Barlow, for a lot of help,
Mitchison
the useful comments
and the contribution
of many
ideas that greatly influenced for many of Dr Arthur other
people
very
my
interesting
Pete on a very late in and
outside
Cambridge. This studentship
work
was supported
from Churchill
College,
by an Overseas Cambridge.
-4-
Research
Studentship,
a research
This
work
computational
is aimed
principles
system detects physical representations
quantities
processing
of the environment
behavioural
decisions
some
of the general
in the brain.
and transforms
The sensory
them into internal
are based. The properties
that make
from a complex
array of receptor signals are considered.
The brain needs extensive knowledge for the interpretation
about the statistical
structure
of sensory signals. The acquisition
of the sensory
and use of such
knowledge
are studied using models consisting
of networks of simple processing units with
properties
that are believed
essential in biological
to be functionally
As the information modifiable
connections
modification learning
between
about
input
distribution
the form
are studied.
important
without
patterns,
for understanding
form
an uncorrelated
mechanism
can match the distribution units,
An uncorrelated,
square error correcting
modification
representation
achieving
the nervous
between
statistical
rules’,
local
system. Specific learning
a predefined
structure
set of
in the input
rule is demonstrated of the correlated
of input patterns information-theoretically
equal variance learning
the activity-dependent
or supervision:
synaptic
adaptively
is due to the adaptive,
in four ‘unsupervised’
a mapping
but to discover
external guidance
neurons.
One class of such ‘learning
of these rules are studied
1 - An ‘anti-Hebbian’
units.
the rules governing
the goal is not to implement
and output
the representation
capacity of these networks
the units,
of these connections
tasks, in which given
processing
rules, are particularly
hypotheses
a
useful are discussed, and ways in which the sensory system may form such
representations
environment
and modelling
of sensory information
on which
representation
at understanding
-5-
input
signal.
to the actual signalling optimal
signal also makes optimally
possible.
to be able to
signal
efficient
This
space of on noisy least-mean-
2 - A combination implement
a form
of Hebbian
of the statistical
reduces the dimensionality
the higher-order of information. associative
occurring
units
of biologically
redundancy
Such a representation
is demonstrated
Component
Analysis,
signal while maximising
is advantageous
which
the information
are noisy.
more plausible,
of the representation
to
nonlinear
units is shown
substantially without
reducing
considerable
loss
if it is to be used in further
stages.
4 - A Hebbian processing
of Principal
code inputs into a sparse representation,
statistical
learning
connections
even when the units themselves
arrangement
to be able to adaptively
method
of a noisy Gaussian
content of the representation,
3 - A similar
and anti-Hebbian
rule
to respond
transformations
modified
by a trace mechanism
in a way which
of the input signal.
-6-
is invariant
is studied, with
respect
that allows to commonly
Chapter
1. Introduction
Understanding
the computational
problems.
of brain function
interesting
and difficult
discoveries
over the last several decades concerning
different
scientific
principles
levels, we do not yet sufficiently
to see, hear, to learn skills, remember Work intelligence
brains so effortlessly, computational emulate
vision,
The essential human
and other branches
or motor
failure
by human
coordination,
performance
of artificial and animal
are in fact hard
of so far all man-made
machines
to
at most of these tasks was found largely
of background
is needed to solve even such seemingly
simple problems.
efficiently
of the brain at several
tasks performed
to be due to the fact that a large amount
access knowledge
of
how the nervous system enables us
robotics
such as object recognition
or even approach
the structure
understand
that even the simplest
problems.
there has been an explosion
events, to plan actions and make decisions.
in the field of computer demonstrated
Although
is one of the most
knowledge
about the environment
The ability
to encode, learn and
may be a key factor that allows the brain to perform
well
at many of these tasks. The
field
Churchland, signals
models
governing models
analytically
the question
particularly
to capture,
networks
successful
of neurons
science is the abstraction treatment
the computational
systems. The main
internally
and chemical)
It uses simplifying,
of abstraction,
in nervous
consistent,
advantage
of
and can be treated
used in all branches of science. An example
that helped
of the Turing
of theoretical
(electrical
&
simulation.
are widely
model
(Sejnowski,Koch,
and process information.
at some level
or tested using computer models
neuroscience’
of how physical
is that they are well-specified,
Simplifying
nervous
‘computational
are used in the brain to represent
principles
a formal
named
1988) addresses
computer-based
formal
recently
to capture
machine.
problems
system are useful for isolating
important
It provided
in the field.
the basic computational
-7-
principles
of a
in computer
a conceptual
framework
Simplifying
models
problems,
for
of the
and it is hoped
that
the principles
understand
revealed
using
these models
some of the computational
Neurobiological simplifying
models
assumptions.
of the action potential
differ
An example
One interesting
aspect of this model
conductances
later with
make
hypotheses
testable
postulated
greatly
in the degree
of a realistic
in models
differential
model
to help
to which
they
is the Hodgkin-Huxley
equations
us
model
(Hodgkin
& Huxley,
1952).
the voltage-
to account for the data were discovered
of a new technique.
that go beyond
make
of the membrane
was that the ionic channels underlying in the model
the introduction
In general,
immediate
we should
experimental
expect to
data if they make
predictions. It may seem desirable
and build
models
speculations however,
to start from our anatomical
by gradually
and assumptions is extremely
incorporating
case of an integrated
difficult
circuit.
the model
possible
with
respects, nonetheless first grasping
approach
it is certainly
and reality,
impossible principles.
systems
simulations,
or, even more importantly,
the system.
But even in cases in which
add to the explanatory
may differ
To maximise
system and in the
the computational
from
the system
a complex
of
in many
system without
the chance of finding
of simplifying
because of the limitations
power
approach,
matches
and testing should proceed simultaneously.
complicated
may be preferable
to
them, and then test the agreement
to understand
We will need to make a larger number
model
resorting
as for instance
seems to be to try to identify
The model
modeling
either
detail,
is much worse in both these respects with
for achieving
system.
some fundamental
between model
knowledge,
complicated
of the system’s operation,
As the situation
mechanisms
the actual
and more
faced with even a moderately
the brain, the most promising goals, consider
more
and physiological
when the facts run out. This ‘reverse engineering’
even when we know the basic principles
simpler
enough
in the squid giant axon, in which the dynamics
by a set of nonlinear
much
be general
processes taking place in the brain.
was modelled
dependent
will
assumptions
of the computers
because of the lack of our detailed these two considerations
to a more complex of the model.
-8-
with
more
used in the knowledge
of
are not limiting,
a
one if the additional
A model
to model
details
fewer parameters
do not may
account
for the data less precisely
correctly
to new observations.
a computation
The Turing
than a complete
presented
in the following
assumptions.
but it also has a greater machine,
simulation
chance of generalising
for instance, may be a better model
of the actual computer
chapters,
as all
In order to avoid confusion,
models
hardware.
do, make
of
The models
many
simplifying
an effort is made to make these assumptions
explicit. The nervous molecules
system can be modelled
and ions, synapses, dendritic
circuits
or behavioural
neural
and local circuit
systems.
at several different
trees, neurons, local circuits,
In the following
levels. The behaviour
models,
of single neurons
single-unit
recording
the last few decades
successful
in revealing
individual
nerve cells, on which most of our current knowledge
a great deal of knowledge
pattern
that carry the information
of neurons
maps, inter-regional
attention
as during
It is the precise activity
levels, at the level of
is especially
techniques about
is focused
interesting
were spectacularly
the response
properties
of brain function
with a wide range of response
necessary for the discrimination
on the
of
is based. properties
and recognition
of complex
patterns. Neural distributed
network processing
interconnected
simple
models models’)
a weighted
sum of) the inputs
computations
synapse-like simple
to other units. Despite
due to the large number
the representation
of information
another.
Unlike
networks
of how
nontrivial
their simplicity,
as patterns
algorithms,
-9-
neural
of activity
properties,
by combining
‘parallel of
highly
Typically
axon-like
on the connection
lines
lines. The
(usually
taking
scalar value as an output they can perform
networks
through
elaborate which they
help us think
about
across a large population
transformations neural
or
computations.
specific connections
Formal
as synapse-mediated
most conventional
weights
computation
of highly
and send their outputs.
and computations
issue
they receive to give a single
receive their inputs
units,
models’
of units with some neuron-like
carry out a relatively
that is transmitted
the
‘connectionist
units can perform
the units and modifiable
units themselves
called
address
processing
they consist of a large number connecting
(also
networks
from
one pattern
have an inherent
of to
ability
to learn. Learning and learning
takes place by the modification
algorithms
exist that specify
means of such algorithms perform
complex
of the connections
how connections
be modified.
By
a network can be trained - as opposed to being programmed
- to
are interesting
or even surpass
the performance
problems
related
to pattern
formalise
and therefore
in their own right
of conventional
processing
to program.
speech from text (Sejnowski
and recognition,
For instance,
& Rosenberg,
solve a wide range of complex
An additional
advantage
they
extremely
fast;
calculation
computer
networks
between data points,
supported
examine
of networks of
as opposed
the
is hard to to generate
speech recognition
is that once they are trained, unit
activations
is
‘memory-based’,
to most conventional,
system, however,
which are currently Many
neither
hypotheses
as models
the assumptions
or ruled out by anatomical
about the nervous
definitely
they make,
and physiological
almost
storing
and
‘computation-based’
linearly
sum their
suggested,
however,
calculating
a complex
proved
that each neuron
1983; Koch & Poggio,
the dendritic
about the potential
function
should
whether
of hypotheses,
(Durbin
cortical
of a
models is that neurons tree, it has been
& Rumelhart,
neurons,
processor
1989; Koch &
require
from one another
pyramidal
power
as a sophisticated
1983). These models
decoupled
-lO-
processing
in network
be considered
1985; Swindale,
identified
they are
Our present knowledge
of the dendritic
of its inputs
tree must be electrotonically
of morphologically
and consider
or disproved.
Based on the complexity
logical
processes in the
still leaves room for a large number
have been made
inputs.
of computational
evidence.
single neuron. One of the most widely used assumptions
Analyses
in
algorithms.
we must
Poggio,
can be trained
in areas including
are fundamentally
If we want to use neural networks brain,
where the solution
a network
problems
control.
Neural
especially
et al., 1989), and they can be used to
and robotic
instantaneously.
programs,
with,
19871, or to read postal codes from envelopes
engineering
the
as they often compete
computer
(LeCun,Boser,Denker,Henderson,Howard,Hubbard,
interpolating
should
tasks.
These algorithms
are
between the units,
that parts of and the soma.
however,
have
shown
that
their
dendritic
compartmentalisation Koch,Douglas, does
not
(Douglas
& Wehmeier, completely
compartments, strictly
is
& Martin,
& Colonnier,
out
some
local
More
local
interaction
1983), though
also agrees well
function
interaction
may
a
take
with
& Nelson,
the measured
Cugell
& Robson,
1966; Movshon,Thompson,
appear
completely
unreasonable
visual
a small
of 1988;
number
by a single
place
receive
of
neuron
on dendritic
is
spines
dual synapses
(Beaulieu
1991). The assumption
of input
responses
& Tolhurst,
as a first
degree
& Jack, 1990). This
computed
still
high
& Whitteridge,
within
only about 10% of spines
1985; Dehay,Douglas,Martin,
summation
for
1990; Douglas,Martin,
but it does mean that logical
& Poggio,
unsuitable
1990; Stratford,Larkman,Mason,Major,
rule
constrained.
(Koch
structure
of cortical
cells (Enroth-
1978), and therefore
approximation
to the function
does not
computed
by a
single neuron. Another value was
common
to other
units,
suggested
information
correspond structure
& Morrone,
1981; Dayhoff of cortical
in the higher
on decoupled
questionable Yamane, timing
whether 1991).
this
strategy
evidence
of individual
action
potentials
importance
relative
an approximation the output
to consider
were
to the firing one scalar
be distorted
structure spines
would
likely
and
and lost (Douglas
and it is
(Young,Tanaka,
hypothesis,
to carry
only
It is therefore
&
local, nonlinear
be required,
this
value corresponding
&
& Martin,
be attenuated,
in the brain
to support
frequency.
a
of neural firing,
used
is still
may constitute
are lo-20 ms (Douglas
would
is generally
It
1983; Sherry,Barrow,
above about 15 Hz would
or dendritic
a single, scalar
of the neurons.
intervals
& Gerstein,
neurons
frequencies
if later
secondary
Even
dendrites
frequency
in the interspike
et al., 19901, any frequencies contained
is that units transmit
to the firing
1991). To make use of the fine temporal
interactions
&
the precise information
of
not unreasonable
to the firing
as
frequency
as
is about
the
of a neuron.
Perhaps form
would
in most models
1982). As the time constant
1990; Stratford
Martin,
which
that fine temporal
code (Cattaneo,Maffei, Klemm,
assumption
of the rules
the most
crucial
governing
hypothesis
synaptic
neural
modification.
- ll-
network In many
models neural
make network
learning
algorithms
the information
the synapse weight
about the weight changes is not locally
(e.g., (Rumelhart,Hinton,
changes
would
require
& Williams,
a rather
computationally as models
powerful
of synaptic
must take the constraint only an important for efficient
electronic
computational
limitations,
associative
memory
unsupervised Zipser,
Willshaw
system
1982; Grossberg,
(Kohonen,
& Malsburg,
mechanisms
successfully
models
local learning
learning
used
in
rules,
is not
despite
functional
1976a; Malsburg,
evidence
their
models
of
19691, and
1973; Rumelhart
&
in the development
of
1986; Miller,Keller,
in the nervous
models
Locality
& Longuet-Higgins,
1976). There is also accumulating
underlying
as biological
of self-organisation
1984; Linsker,
candidates
but it may also be essential
Local
1984; Willshaw,Buneman,
(Amari,
(e.g., (Parker,
seriously.
neural networks,
been
1985), as well as in structural
the nervous
intended
modification
implementation.
have
(Kohonen,
learning
of synaptic
for biological
hardware
machinery
may be, they are not likely
in the brain. Networks
of locality
constraint
physical
of the
to be present in the nervous system. However
some of these methods
plasticity
1986)). The local computation
complicated
1985; Tesauro, 1990)), which is not thought
present at the site of
& Stryker, for
1989;
the biophysical
system (Brown,Kairiss,
& Keenan,
1990). Learning Methods
in general is the process of the formation
of supervised
the desired
output
learning
or a reinforcer
priate or not. These methods exceedingly
require
either
a ‘teacher’
of realistic
training
animals
and people
to learn from single or a relatively
An animal,
have few ‘supervised’
examples.
for instance,
of the input.
learning
of labelled
small
number
examples
about a poisonous
they become
require with
was appro-
a very large the ability
of trials,
are often
of
which
severely
food or a predator
may
opportunities.
In many cases, the complexity complexity
generated
This is in sharp contrast
as the number
learning
for each input
size, and usually
of labelled
restricted.
that provides
do not scale well with the size of the problem,
number
advantage
from examples.
that reports whether the output
slow faced with problems
can be a great
of a mapping
of the mapping
This is especially
to be learnt is largely
true in problems
- 12-
involving
due to the
perception;
it is
much easier to learn a mapping ‘run’
than
to map
an array
methods
in general
exploit
available
data arriving
form of instruction
1.1.
from a suitable of pixels
The goal of sensory
One of the most puzzling
questions
working higher
model animals,
but rather
(Craik, 1943) suggested animals
form
the fundamental
coded and represented
functionally Wiesel,
the cortex
specialised
that further
during
a car, playing
basic local connectivity
decisions.
of sensory model’
of
Such a
messages.
is thought
higher
range of problems
a piano, or reading cognitive
functions
of the ways in which
is characteristically
areas, its anatomical & Powell,
philogeny
expansion
model
In
to be
than those a book). It is without
first
sensory information
is
in the brain.
1974; Rockel,Horns,
so rapidly
principles
a working
of the
success of the cortex may be due to such a
that it can solve a far wider
(such as driving
is not this
and events
and behavioural
of this ‘world
that we will be able to understand
Although
situations,
that the brain constructs
to make predictions
the most sophisticated
that fueled its evolution
deciphering
in and what we experience
the objects,
located in the cerebral cortex. The evolutionary
unlikely
1987).
must also be necessary for the interpretation
large degree of adaptability
or any other
of sensory channels about the stream of data constantly
vast array of signals,
that allows
of readily
about the brain is how it makes sense of the
our senses. What we are interested
the world
amount
processing
bombarding
Craik
to
Unsupervised
to reinforcement,
1989; Lippmann,
carried by millions
environment.
in the large
relationship
information
changing
of e.g. ‘tiger’
representation.
regularities
any obvious
(Becker, 1991; Hinton,
representation
to the symbolic
the statistical
without
symbolic
divided
structure
patterns
a large
is surprisingly
is relatively
few additional
genetic
instructions.
are similar
in different
cortical
- 13-
number
uniform
1980). The fact that the neocortex
suggests that its wiring
requires
into
of
(Hubel
&
has expanded
easy to replicate,
and
The cell types and areas. Differences
are
often quantitative
rather than quantitative,
the areas. For instance,
layers receiving
from which most outputs
originate
not only on its anatomy,
artificially
rerouting
responsive,
orientation
(Sur,Garraghty,
1988).
animals,
selectivity
similar
embryonic
rat visual cortex transplanted
contain
with
& O’Leary,
of cells
normally
with
similar
between cortical
to different
principles.
selectivity
units unique
operation ticular
and
& ,Frost,
direction
1989).
Late
cortex was shown to
to somatosensory
cortex
on
cortical
common
algorithm
may be used, for example,
in the visual and somatosensory principle
a
suggesting
et al., 1988), or at least share
The same mechanism
behind
to generate
cortices.
the great variety
areas, what may this principle
of operations
taking
be? How does this cortical
relate to the task the brain has to carry out? How do we make sense of a parstimulus
when the number
of possible patterns
are very unlikely
to ever experience
high dimensional
patterns
our internal
of the environment?
model
As the survival
values
at certain
representations
is so large that we
that form an important
(or of a robot, for that matter)
rather than the directly retinal
of stimulation
the same pattern twice? How do we transform
into symbolic
of an animal
to objects in the environment intensity
to
areas seems to be that they receive input
operate
may
If there is some common place in different
orientation
somatosensory
(Gilbert,Bourgeois,Eckhorn,Goldman-Rakic,Jones,Kriiger,
directional
cortex
projections
other areas. There is a body of evidence
areas
cortical
some operational
in auditory
retinal
in area 17 (Metin
functional
visually
to those of cells in area 17
showing
to neonatal
identified
by early lesions),
By
1991).
difference
from and send outputs all
found
anatomically
The main
that
a percentage
to that
‘barrels’,
(Schlaggar
hamsters
cortex have responses to visual stimuli
of normal
aspects of its development.
cells were found
from
of
in sensory areas, while layers
in ferrets (induced
Results
functions
in motor areas. The cortex shows
but also in many
projections
to the different
are thicker
and direction-selective
& Roe,
somatosensory
inputs
are more pronounced
uniformity
retinal
and are related
locations,
- 14 -
detectable
the role of sensory
is causally
signals,
these part of
related
such as light
processing
must
be
therefore
to transform
patterns
of neural
relevant,
meaningful
further
the detected
activity
complex conditioning
assume pathways
& Wagner, available
1972; Sutton
stimulus,
however,
simple
individual.
Classical
between
models
the ‘labelled
conditioned
response,
they are formed
line’
be detected
of Pavlovian
such as plain
behaviour
of conditioning
explain
representation
stimulus’,
of what pattern
The physical
1927; Rescola may
stimuli.
on complex
be
More
features of the
or the face of a particular how an association sensory
should
signals must first be represented
is formed
stimulus
and the
signals are formed,
are not represented
of activity
‘CSl’ and
such pathways
tone or light
of a complex
1990). If stimuli
between
and instrumental
1983; Pavlov,
will depend a situation,
significant
cannot be based on
should
animals,
of an object,
at all (Tesauro,
stimulus.
models
model’
but they do not specify how these labelled
it raises the question
particular
A ‘working
1981). In lower
the appropriate
e.g. the detection
the behaviourally
1980; Mackintosh,
stimuli,
of
are to be used by
e.g. ‘US for ‘unconditioned
(Dickinson,
& Barto,
for extremely
frequently,
labelled
stimuli’
between
and causal relationships Classical
consisting
to these objects and other biologically These representations
need to be established.
of the environment.
‘CS2’ for ‘conditioned
related
where associations
signals; associations
entities
signals into representations
in the environment.
stages of processing,
the raw input
lines’,
more closely
entities
events of the environment
physical
locally
if
on ‘labelled
be chosen to represent
a
in terms of meaningful
categories before they can be used for conditioning. The information instance, only retina.
providing
evidence for the existence of an object in an image, for
is not tied to the activation
be inferred
by looking
The visual
interdependencies fundamentally
system
of any particular
for spatially should
distributed
therefore
be looking
photoreceptor, patterns
task involving
the detection
signal.
- 15-
of activation
for the complicated
of sensory signals caused by objects. Object statistical
its presence can
detection
of regularities
over the statistical
is therefore
of the incoming
a
1.2.
high
Regularities
in natural
Sensory stimuli
from the natural
statistical
Field,
redundancy
stimuli
and redundancy
environment
are far from random.
introduced
by different
kinds
1987). This fact is demonstrated,
for example,
by our ability
parts of natural anything
images
resembling
(Attneave,
a natural
; Kersten,
The following
of regularities
random
continuous
similar
intensity
distribution
almost values.
of intensity
single points,
within
Images expressed greater
This
points
is a second
pairs is different
and it is characterised
Other properties, be similar
everywhere;
Image intensity
order
statistical
from the product
by the autocorrelation
tend to have
regularity,
as the
of the distributions
of
function.
such as colour, texture, and stereoscopic
contain
edges and lines.
by the fact that the probability
depth also tend to
are colinear
This is third
order
statistical
close to where it was a moment tend to have non-zero
active is
than if they are not.
earlier.
Most local parameters
autocorrelation
objects are approximately
regularity
of any three points being equally
As objects move with finite speed, their image at one moment
Most
is
small regions of an image.
if the three points
position
images:
close to each other
rigid,
is likely
of images
to be
at a given
in time. which
causes their parts to move in
the same direction. Lines determined
and edges are not positioned by the shape of the objects giving
Objects
randomly
significantly
is
as well as more abstract
above chance level. For instance,
- 16-
position
rise to them.
can cause several of the above properties,
ones, to appear together
in space; their
at
as natural
images.
in natural
;
missing
to arrive
dot patterns,
Objects tend to have surfaces of nearly constant reflectance. therefore
to predict
of the space of all possible
are some of the causes of redundancy
They have (Attneave,
1987). We are very unlikely
image just by generating
images occupy only a very small fraction
reduction
the existence
of dogs
causes
‘barks’,
‘dog-features’
etc.) to
statistically
appear
together
regularities
and touch.
event (A&B)
the component
in other
property
cliches’,
1990a), is that the component
were
is significantly
greater
also called
(Barlow,
not independent:
than the product
taste, ‘feature
1972; Barlow,
the probability
of
of the probabilities
of
events (A, B) :
independence
( P(A&B)
>> P(A) P(B).
is just the opposite
redundant
define objects, features, concepts, categories, 1959; Barlow,
1961b; Barlow,
1985). In fact, the receptive patterns
of the defining
condition
of statistical
= P(A) P(B) >.
It is these non-independent,
redundant
such as hearing,
coincidences’
events are statistically
that this expression
(Barlow,
modalities,
of all these regularities,
or ‘suspicious
P(A&B) Notice
‘shorter-than-a-metre’,
more often than if these properties
can be found
The defining
‘sensory
constellations’,
the joint
much
‘has tail’,
independent.
Similar smell
(such as ‘furry’,
features, the ‘suspicious
and these are what we should
1969b; Barlow,
field properties
coincidences’
1972; Watanabe,
of sensory neurons
that
be detecting
1960; Watanabe,
are well matched
to such
of stimulation.
Orientation
specificity
is the most prominent
feature
of cells in the primary
visual cortex. Such cells are good detectors of edges and lines. Directional components.
Disparity
Some
growing
correspond
Highly straints imposed
and curvature
cells in the visual
others to patterns properties
and speed selective cortical selectivity cortex
or diminishing
to redundancy
cells are good detectors also correspond
show selectivity
to redundant
to rotating
in size (Bruce,Desimone,
caused by object motion
selective cells, e.g. face and hand detectors,
of moving
image
features.
patterns,
while
& Gross, 1981). These
in three dimensional are well matched
space. to the con-
by the existence of objects in the environment.
The ‘suspicious of sensory processing: define useful primitive
coincidence’ departures
detection
principle
from randomness
features, coincidences
-17-
can be applied
in the physical
of these primitives
at different
stimuli
levels
can be used to
can define appropriate
object representations, derive conditioned While objects
whereas coincidences stimuli
components
are relatively
correlated,
(Barlow,
independent
and left-hand
redundant
patterns
representation
the input of
be defined
independent
stimulation,
the
A recoding
of the signal
(or the environment
output
corresponding
to the current parameter
encoder can use the redundancy
the difference
between what is exp&ed
Two basic mechanisms
features redundancy
rise to the signal),
values. An adaptive
This kind
removal correlation
of correlations
of model
and ~--~ what is unusual
step towards
the
to the resulting
as a parametric with
the recoded reducing
so that the parameters is not one that imitates model,
showing
in the environment. __w--
here for redundancy
between the representation
hierarchy
statistical
reduction: may interact
independence
units. A mechanism
so
is the
for removing
is discussed in the next chapter. - The pattern
of synaptic connections
may be such that the cell at the higher lower level. This kind of mechanism
level responds
between cells at different to a redundant
is discussed in subsequent
- 18-
as
correlated
redundancy
1970), it is rather an inverse
will be considered
An important
of
system can be considered
- Cells at the same level in the processing as to reduce redundancy.
of highly
corresponding
of the signal to update its model,
(e.g. Marr,
not considered
from other such conjunctions.
giving
the state of the environment.
the system to be modelled
that are very highly
as conjunctions
statistical
model
best describe
Sub-patterns
in terms of independent
itself is reduced.
with each other, separate
sides of faces, are usually
separate objects. Objects could therefore
By recoding
correlated
of one another.
that are relatively
can be used to
1991).
of objects can be highly
e.g. the right-
sets of components
of objects with reinforcement
chapters.
levels
feature of the
up
Chapter 2.1.
2. Adaetive
Lateral
decorrelation
interactions
in the cerebral
In what way can units interact Inhibition
is known
that inhibitory
to play an important
mechanisms
The clearest retinal
ganglion
fields.
of natural
high-pass
filtering
spectrum
and removing
images.
as well. Cells carrying by stimulation
effects are caused by unspecific
lateral inhibition,
between
the simple
may be set up relatively More experiments
interactions
show that inhibition
on the pattern depend
on the difference
indicate
that
Freeman,
maximal (Blakemore
1991; Ferster,
is known
to be
that region.
of which depends
fields. Such a constant,
non-specific
Such
only on inhibition
mechanisms.
can be observed
in the cerebral
cortex.
Several
in the visual cortex is more specific
type of inhibition
found at lower levels, and depends
of the cells. Inhibition
between
the optimal
orientations,
inhibition
can be observed
& Tobin,
1972; Bonds,
1986; Fries,Albus,
between orientation
between
& Creutzfeldt,
also reveal inhibition
orientation
or between cells whose orientation
tuning
1988; Toyama,Kimura,
1981).
& Tanaka,
- 19-
tuned cells
and some experiments cells
tuned
to similar
1989; DeAngelis,Robson,Ohzawa,
techniques
degrees (Toyama,
between ganglion
about touch over a certain
the strength
Cross-correlation tuning
the power
between neurons
centre-surround
selective properties
field organisation
information
by changes in
or ‘whitening’
correlations
of
is well matched
is carried mostly
of the area surrounding
easily by ontogenetic
complicated
than the unspecific
orientations
receptive
hat” organisation
operation
by flattening,
receptive
region of the skin are inhibited
and it has been suggested
filtering
As information
centre-surround
redundancy?
1961a).
or decreasing pairwise, second-order
Such simple
the statistical
“Mexican
The resulting
reduces redundancy
present in other modalities
the distance
(Barlow,
is the centre-surround,
cell receptive
intensity,
to reduce
role in the brain,
reduce redundancy
example
to the autocorrelation
cell activities.
in order
cortex
mostly
1977; Nelson between
&
& Frost, 1978).
cells with the same
differed by not more than 30
Other direction
cells found in the middle
and velocity
McGuinness,
tuned
1985). In many of these cells, motion when the direction
of the centre.
These cells therefore
receptive
receptive
field.
regions
visual area of the cerebral
and have an antagonistic
most effectively
classical
temporal
field, but not to global
Such interactions
of the image
computational
strategy:
cortex tuned
to similar
to local,
motion
motion.
located within
stimulus
parameters
a small
inhibit
motion
textures,
to that
within
motion
seem to suggest
distance
the
or of the entire
caused by the common
These results
neurons
is identical
‘unexpected’
of extended
&
reduces the response
of the background
reduce redundancy
or the viewer’s
(Allman,Meizin,
in the surround
and velocity respond
surround
cortex are
of
a general
from each other in the
each other more than other cells in
the same region. The question
existence
of how the neural
precise
interaction
properties
between
be specified
mechanisms?
are greatly
determine
the highly
properties
themselves
Further experiments blocked
cortical
complicated,
without
as the classical by experience.
regulator determination
highly
the need
It would
response
for activity-dependent of cortical
be impossible
between
the
Can the
selective
response properties
interaction
for
highly
specific
the effect of the main of bicuculline
(Sillito,
altered
inhibitory
cortical
cells
cells
to genetically whose
inhibition neuronal
in the cortex
interactions
response
inhibitory
come
transmitter,
1984). The response
their direction
properties.
is cell specific,
excitability,
of response properties.
- 20 -
and it plays
from
GABA
is
properties
in the absence of inhibition:
and also change their ocular dominance
of cortical
raises
have not even been fully determined.
cells are drastically
mediated
effects
rise to them can be specified.
cells with
stimulus-specific
surround
cells lose their discrete ‘on’ and ‘off’ subregions,
selectivity, GABA
giving
genetically
influenced
by the application
complex
cortical
purely
evidence
in which
stimulus-specific
interactions
This seems unlikely,
themselves
visual
of these highly
simple
of and
and orientation
These results suggest that rather
than
an important
just role
a tonic in the
Many neural network learning
networks,
connections
has revealed
cortex,
the extent
the
competitive
main
enhancement’,
a uniform,
of the
Intracellular
anatomists
unmodifiable
or ‘winner
about
modifiable
the form of
recurrent
thalamic
axons
(Feldman
become
alternative
synaptic
plasticity
more
effective
at exciting
(White,
Decorrelation
The rule studied
filter’
by anti-Hebbian
the function
(Kohonen
(Barlow,
variables.
of ‘contrast
cells, and wij the effectiveness
conventional
Hebb constant
rule
could
form of the repulsion
the
rate
at all.
by Donald
Hebb
the postsynaptic
cell in the future.
An
sections.
1990b),
by Hebb.
and is also similar
It originates to Kohonen’s
law states that a ‘repulsion the activity
of the synapse
be formulated
determining
connections,
some version of
as formulated
If yi and yj denote
postsynaptic
at all,
1982; Grossberg,
often follow
of that suggested
& Oja, 1976). Barlow’s
correlated
In
learning
here is the opposite
‘law of repulsion’
increases between
mathematical
1989).
are considered
& Ballard,
the postsynaptic
to this rule is discussed in the following
Barlow’s
that thalamic
would store no information connections
staining
synapses even in layer 4 of
value, serving
selection’
find
19491, which states that synapses that are successful in exciting
neuron
positive
connections.
projections:
of cortical connections,
containing
the hypothesis
‘novelty
lateral
or input
features of the
1987; Rolls, 1989b). The constant values of these recurrent
the majority
Models
from
to the ‘feed-forward’,
for instance, when lateral connections
‘normalisation’
1976b; Lippmann,
2.2.
of recurrent
of termination
models,
and competitive
One of the most prominent
of local colateral
layer
they are often assigned
(Hebb,
interactions.
purely
perceptron
no more than about 20% of the excitatory
learning
constituting
lateral
is the abundance
axons contribute
such as the multi-layer
assign the task of learning
and neglect
cortex, however,
models,
constant’
of the pre- and
between
them
then
the
as: Awij = cx yi yj, where a is a small of learning.
Analogously,
law would be (Barlow & Foldibk,
-21-
the 1989):
simplest
if i f j,
Awij = - o yi yj , (or possibly
dwij/dt
= - cxyi yj in continuous
two cells, and wij the net synaptic in that the weight direction
simultaneous
decrease in the value of the synaptic an increase in the effectiveness of excitatory
time), where yi and yj denote the activity
weight between them. This is similar
change is proportional
is opposite:
to the pre- and postsynaptic pre- and postsynaptic
weight.
activities,
activity
results
rule
but its in a net
of inhibitory
synapses, or a decrease in the effectiveness
ones.
weight
building
the units to fire together
up between
thereby reducing,
A formal
external
number
or ‘weight’
the corresponding
positively
and eliminating learning
correlated
make it harder
the correlation.
for
This is why
rule.
model
Let us consider receives
were originally
the units will gradually
this rule could also be called the ‘decorrelating’
2.3.
to Hebb’s
of
Such a net decrease could either be due to
It is easy to see that if the two variables then negative
(1)
a set of n simple,
inputs,
linear
as well as feedback
associated
from
with each connection.
processing
units.
the other units. The unit multiplies
weight and adds these values.
X
Y
n
Figure 1.
- 22 -
Each of the units
n
There
is a single
each input
by
Model
1 Figure
input,
1 shows an arrangement
as well as feedback
dynamics
from
in which each unit
the other units.
receives only one external
The simplest
model
of the activity
is: n
dyi
Z dt
= - Yi
+
Xi
+ C
Wij
Yj,
j=l or in matrix
form
where x is held constant stable equilibrium
for each input
if I - W is positive
pattern.
definite.
The above differential
equation
has a
At equilibrium:
n Yi=Xi+CWijYj, j=l or y=x+wy,
y = (I - W)-1 x. Let T = (I - W)-I, ),
and Cxx be the correlation
then the correlation
of the output
matrix
of the input
([Cxxlij
=
is
C w = = TC,,TT, where < > denotes taking superscript
T denotes the matrix
Equation the input
after
1 gives the ‘on-line’
the network
calculations
version of the decorrelating
reached
a stable
we will avoid having
side of equation
result, if the environment Hebbian
patterns,
and
rule, in which
and the weight vector
state for the input
version
1 by its expected
is ergodic,
presentations,
learning
pattern.
In the
to deal with the effect of individual
vectors on the weights by using the ‘off-line’ right-hand
of input
transpose.
signal is treated as a succession of pattern
is modified following
the expected value over the distribution
value.
of the learning This will
rule, replacing
- 23 -
the
have no effect on the
and the step size is chosen small enough.
rule thus takes the form:
input
The anti-
Awij = - o ,
if i f j,
AW = - a offdiag(C&
,
or
where
offdiag()
(Alternatively,
is an
operator
in continuous
that
sets diagonal
time dW/dt
matrix
=-
where p is the correlation
-____ 1 - (&1)7-
The uncorrelated,
(l+w2)
’
matrix cxx =
%y
zero.
weights (w) between them.
T = (I - WY1 = correlation
to
= - a offdiag(Cyy).)
Let us consider two units with symmetric
and input
elements
coefficient.
(
.12 PO102 PO102 .22 1 ’
Then
012+2wp0l02+w2022 pcqq (w2+1) + (012+,22) (
anti-Hebb
rule
reaches
so the cross-correlation polo2
polo2 (w2+1) + (012+022) w20~2+2wp0*02+022
w
equilibrium
(Aw 12 = 0) when
w
the units
>
*
are
term:
w2 + (q2+022)
w + polo2
= 0,
which gives fixed points: q2+o$ wf = - 2pqo2
or when 01 = 02 = CT c(w) = po2 w2 + 202 w + pc9 = 0,
wf=-
1 *d l-p2 P
’
if p f 0,
and wf=Oifp=O. The stability
of the fixed points
is determined
by the sign of the derivative
f’(w) at the fixed points: c’(w)
1 w=wf=2wP+2
1 w=wf=dl-P2/
- 24 -
implying
that only
is a stable fixed point
of the weight dynamics.
The feedback weight as a function
Figure
2. Lateral
inputs
with correlation
Figure Gaussian
weights
of the input correlation
necessary to remove
coefficient
correlation
is plotted
between
in figure 2.
two units receiving
p and equal mean squared values.
3 shows the effect of the anti-Hebbian
variables
with
correlation
Besides removing
matrix:
correlations
have the side effect of decreasing
feedback between two zero mean 1 0.975 Cxx = ( 0.975 1 > . Wf = -0.8, c yy =
between the variables,
the variance
- 25 -
of the individual
the lateral units.
connections
-2.. . . .
-3 1
2.4.
Decorrelation
To demonstrate variables
the following
.
(b)
Figure 3. (a) 500 samples from a correlated distribution
.
-3-
(a)
(b) output
. .
with correlations
between
normal
removed
several
distribution,
C&t)
@+I)
= W(t)
rule between
cxx ((I _ &)jl
- a offdiag(Cyy
-26-
calculated
variables
was carried out:
= (I - &-l
to the network
by the feedback connections
the effect of the anti-Hebbian iteration
the input
)T
0)) ,
more
than
two
where W, Cxxl Cyyr and I are n x n matrices. numerical
integration
and a better However, Notice
by the Euler method
integration
method,
the modelled
that linear
i.e. the correlation
matrix
could be interpreted
of the continuous
is more
is driven
as the
form of the adaptation
such as the Runge-Kutta
process itself
decorrelation
This iteration
method,
naturally
could
expressed
only by the pairwise,
rule,
be applied.
in discrete
time.
second order statistics,
of the input.
. .
0.25
.
t
. t
.
0.
.
.
'0,
-0.25 log10 II offdiag(Cyy)
=
.
=
.
.
.
.
.
'
.
II -0.5
.
.
.
.
qj
-1. 5. 4. Logarithm
diagonal
elements
of the Euclidian
deviations,
minimal
and maximal
input
correlation
matrices.
(n = 10, a = 0.1)
The result Euclidean
norm of the output
is the Euclidean correlation iterations correlation
containing
matrix
matrices
is plotted
matrices
on a logarithmic
matrices
matrix,
of Cxx. L was generated random
numbers
is 1)offdiag(Cyy)
by Gram-Schmidt
-27-
matrix
normal
in figure
4. The
input
of the number the random
of
input
Cxx = L A LT, where matrix
containing
orthogonalizing
distribution,
values
11, where 11C II = 4X Cij2
semidefinite matrix:
with
generated
over a set of random
and A is the diagonal
from a standard
20.
(t). Average
scale as a function
a diagonal
1 . .
1( -1t
correlation
(n = 10) is shown
have to be positive by rotating
_
.
of iterations
The average of this norm
were generated
orthogonal
matrix
.
over a set of 100 randomly
for 10 units
correlation
norm.
(0. As correlation
L is a random eigenvalues
of the simulation
values
.
1 5..
of the output
of the number
with
’ I . .
10.
norm
set to 0 as a function
.
"tt .
-0.75
Figure
.
the
a matrix
while the eigenvalues
were taken from an exponential
distribution
of parameter
Cxx always had full rank. If some of the eigenvalues a corresponding
1. In the examples
given here
are chosen to be 0, the variance
of
number of outputs go to 0.
The algorithm
scales well with network
does not significantly
change with the number
size; the slope of the adaptation of units in the network,
curve
as illustrated
in
figure 5.
n = 100 n=50 n = 20 n = 10 n=5
-1.5*-
n=2
-2.Figure 5. Adaptation
as a function
average over 10 random
2.5.
and single
As demonstrated
in the previous
removes
elements
between
of the output
correlation
control’
signalling
amount
adaptation
section,
the anti-Hebbian
several variables. matrix,
In Model
adaptation
law
1, however,
the
i.e. the mean squared
range of the units, the conventional
this is even necessary if the input
different
1. Each curve is an
values of the
in any useful way. To adjust the range of actual values to the
needs to be incorporated
as the interaction
of units in Model
(a = 0.01)
unit
correlations
units are not constrained physical
matrices.
Decorrelation
successfully diagonal
correlation
of the number
into the model. variables
necessary to eliminate of reduction
or ‘gain
are normalised
at earlier
stages,
between pairs of units may cause a
in the variance of different
- 28 -
adaptation
In case of more than three variables,
themselves
correlations
‘single-unit’
units.
Models single-unit
Model
2 and 3 are two alternatives
modifications
to Model
1 incorporating
adaptation.
2
Model unit adaptation, output.
2 is an analytically in which
simple
the units
modification
are also allowed
The goal here is not only to eliminate
matrix
of the output
(unity,
for simplicity).
identity
but also to make The output
of Model
1 incorporating
to receive
input
off-diagonal
the diagonal
correlation
elements
from
singletheir own
of the correlation
elements
approach
the same value
should
therefore
be equal to the
matrix
matrix: cxx ((I - w j1 lT = I.
(I - wjl The matrix
W satisfying
this equation
can be calculated
(for a symmetric
W) as
WI = I - cxx1/2. The symmetric While
square root Cxx1j2 always exists as Cxx is positive in Model
self-connections
1 no self connections
contains
an additional,
feedback gain to be adjusted
were allowed,
constant
term:
semidefinite.
here the learning Awii = 1 - ,
rule for the causing
the
so as to achieve unit mean square value. w(O) = 0
cyyW = (I - w(t))-1 cxx ((I _ w(t))-1 )T w(t+l) or dW/dt
= a ( I - Cw) in continuous A convergence
1991; Plumbley, outputs correlation
matrices,
time.
proof for this algorithm
1991). Figure
to the identity
= w(t) + a (I - cyy(t) ),
matrix. otherwise
is given by Plumbley
6 shows the convergence Convergence
and Dayan (Dayan,
of the correlation
matrix
can only be expected here for full-rank
the condition
cannot be met.
- 29 -
for uncorrelatedness
of the input
of the n variables
0.75 1 = 0.5 0.25
m
4
:?t' .
. .
log10 III- cyyII
.
, 1
5.'
. Ilp.
. JO.
l
-0.25
.
.
.
LU. vt
.
.
-0.5
.
t
.
.
.
.
.
-0.75 Figure
6. Logarithm
of the Euclidian
matrix
of the output
of Model
iterations
(t) for 100 random
of the
outputs
simultaneously.
are
is diagonal
of the difference
2 and the unit
full-rank
normalised
matrix
input correlation and
between
as a function matrices.
correlations
the correlation
of the number
of
The mean square values
between
them
are eliminated
(a = 0.05)
W in Model symmetric
norm
2 approaches
Wl = I - Cxxli2
square root was calculated then for a symmetric .
-0 .5 .-
, as demonstrated
in figure
7. The
as Cxx1/2 = L A1’2 L-l, where Cxx = L A L-l, and A
W(O) (W(O) = 0 in this case).
. .
+ . 0.. f
.
60. po.. . . . .
1 .I-
80. n .
.
n
. . .
2. .5 ..-2.5
/:::i::Il,(((, 1 J .I . . .
-3. -3.5 t Figure
7. Average,
standard
log10 II W - WI 11is plotted
deviation,
as a function
minima
of t. (a = 0.05)
- 30 -
and
maxima
over
. 10 runs,
Model
3
An alternative connections,
to Model
but by explicitly
2 is a one in which
multiplying
the gain is adjusted
each output
not by self-
by a factor (gi). The outputs
are
expressed as: yi = gi (xi + .$wij
Yj),
and therefore y = (I - G W)-’ G x. where
G is a diagonal
diagonal
elements.
matrix,
containing
No self-connections
the gains (gi) of the individual
are allowed
here, as in Model
units
as the
1.
w(O) z20
C&t)
= (I - G(‘)W(t)jl($‘) W(t+l)
Cxx ((I _ &)W(t))-‘G(f))T
= W(t) - a offdiag(Cyy(f))
G(t+l) = G(t)+ y diag(1 - Cyy(t)), where diag0
sets all non-diagonal
0.75.-
elements
to 0.
.
log10 II I - Cyy 11
5.
-0.25.-
-
l
l 1p.
-0.5,.
. . .
-0.75,Figure
8. Logarithm
of the Euclidian
matrix
of the output
of Model
iterations
norm
of the difference
3 and the unit matrix
(t). (a = 0.05, y = 0.1)
-31-
between
as a function
the correlation
of the number
of
Figure identity
8 shows the convergence
matrix,
while figure 9 shows the convergence
9. The overall
C xx -li2.
transfer
Logarithm
over 10 random
2.6.
Relation
novelty
patterns.
Necessarily,
filter
would
of units
tend
T and CxxS1j2 are
The filter
to the ‘novelty
absent,
is trained
on a small
then projects
set containing
in the network.
its input
to zero for any input
a number
pattern.
by
set, gradually
into the subspace going to zero for
are linearly
For a larger
is that
or are governed
set, with the output patterns
described
difference
are either
if the training
independent,
of input
patterns
set, the outputs
of the
The anti-Hebbian
self-
for instance, would not reach a stable value for any positive
- 32 -
filter’
& Oja, 1976). One minor
filter
filter can only work for a training
than the number
connections,
rules are similar
novelty
pattern.
between
approaches
filter’
to the one spanned by the training
the ‘familiar’ the novelty
‘novelty
1984; Kohonen
The
rules.
of the network
of the difference
here, self-connections
to each input
complementary
to the
matrices. (n = 10, a = 0.05, y = 0.1)
and modification
proposed
learning
‘habituating’
of the outputs
of T to CxxS1j2.
T = (I - GW(f))-lG(t)
covariance
and Oja (Kohonen,
in the models
smaller
input
to Kohonen’s
The network
different
matrix,
of the average norm
plotted
by Kohonen
matrix
cxx-l-/2
log10 II -I- -
Figure
of the correlation
signal variance.
Decorrelating of inputs,
networks
decorrelating
patterns.
Decorrelating
ensemble
of inputs.
input.
Training
perform
networks
are exposed to a large, potentially
networks
learn the average, ‘expected’
For a linear network,
Unlike
and operational
the novelty
modes.
Training
2.7.
theoretic
Information
Neurons
are not ideal
or pulse-code
distinguishable
signalling
itself
range
has inherent
signal
automatic
gain control. steps of image
simple
light
luminances
The method
to noise ratio while
earliest
adaptation actually
This method coding
avoiding
known
to take place already
step is to match distribution
of luminance
operation,
training tracking
by noise, and the
itself.
of the dynamic
signal
saturation
As four reliably range of cortical
to utilise
to the signalling
of adjusting
system
this range
space defined
an amplification is known
is light
gain
in engineering
adaptation.
range of neurons
This step is necessary
and quantised.
range to the mean luminance
the width
need not be
by to as
The goal of
to the range
as luminance
of
values
e.g. between a sunny and a night scene, while the
response range of neurons is strictly limited, of the response
error
1987), it is vital
the response
in the image.
vary over several orders of magnitude,
middle
of the
is used in the sensory system as well. One of the
in the visual
is to adjust
present
with
They are affected
estimate
the range of the input
maximise
patterns
of
a large
matrix
does not have separate
quantisation
& Parker,
of neurons.
between
to the correlation
in parallel
devices.
levels may be an optimistic
by adjusting
number
properties
(Barlow,Kaushal,Hawken,
the dynamic
decorrelation
a specific set
infinite
relations
phases, training
can proceed
of the input.
optimally
this corresponds
filter,
changes in the statistics
neurons
task. Instead of learning
here does not have well defined
kept constant.
spike-rate
a different
in the retina
(Barlow,
of the response
range
values in the environment
- 33 -
The first step is to adjust the
value. Such adaptation 1969a; Laughlin, to the width by adjusting
is indeed
1989). The second or variance
of the
a gain factor. A similar
phenomenon, neurons
known
as contrast
in the primary
1973; Movshon DePriest,
visual
& Lennie,
1989; Vautin However
& Berkeley,
Suppose
the available
utilized
even if each neuron individually
the two neurons
& Freeman,
could
discriminate
adapted
capacity
decorrelation
range
gain control,
of individual adapts
to correlations
then the output representational
- 34 -
the output However,
as
only the states on or near the
of the discriminable
adaptation
to the range between
of
adaptation
neuron correctly.
only a fraction
or ‘single-unit’
neurons
levels
is wasted, even when a single-unit
signals utilise
always
10) If each of the two
equal probabilities.
are correlated,
Figure
response
&
will not be properly
gained by observing
with approximately
the response range of each individual
As automatic
very nearly
states. The maximal
process adjusted
10. Correlated
(figure
and the most information
10, when the two activities
are used and channel
& Bisti,
response range of neurons
four activity
be in any of 16 different
would be utilized,
in the
1985; Sclar,Lennie,
reason, two neurons
is in itself optimally
if each of these 16 states occurred
diagonal
(Maffei,Fiorentini,
response space of the two neurons
can reliably
was found
1977).
that, for whatever
together;
shown in figure
gain control,
cortex of cats and monkeys
1979; Ohzawa,Sclar,
respond
possibilities
or contrast
there are other ways in which the available
can be wasted.
neurons
adaptation
mechanism
of values
variables.
states.
matches
the
in the environment,
A decorrelated
signal
has
optimal
resistance to noise on the output
information
in that it maximises
the mutual
between the input and the noisy output.
A communication between its input, has probability density
of the network,
channel
is characterized
A, and its output, density
function
functions
by the mutual
B. If the n dimensional p,(x),
continuous
and pB(y), respectively,
~A,B (x ,y ) , then I(A; B) is defined
I(A; B) = H(A) + H(B) - H(A,B)
information, input
I(A, B) and output
and joint
probability
as:
= H(A) - H(A
I B) = H(B) - H(B I A),
where H(A) = H(B) = is the entropy
I I
PA(x) lo&$x))
dx,
p&d
dy
of the input and the output,
H(A,B)
= -
H(AIB)=-
I s
H(B 1A) = are the joint
and conditional
1949). The entropy
s
and
p A,B(XIY) lOg(p~,~(X~y))
dx dy,
p A,B(x,y) lfd;“)
dx dy,
PA,B(X,J’) log
dx dy
entropies,
of an n-dimensional
log(pB(y))
respectively normal
(Jones, 1979; Shannon
distribution
with covariance
& Weaver, matrix
C is:
H = l log ((2rce)” det(C)). The entropy the amount
of uncertainty
of the channel. output maximise
is a measure of uncertainty,
about the input that remains
In the case of the network,
to maximise the mutual
and the conditional
the decrease information.
we should
in our uncertainty Here the conditional
of noise in the channel.
- 35 -
after having
entropy observed
like the observation about entropy
H(A
the input,
I B) is
the output of the noisy
i.e. it should
can be non-zero
because
Xp-jY
&I:+N
Figure 11. Additive
Assuming output
that the signal can be approximated
has covariance
to the output
noise on the network output.
matrix
Cyy, and that the uncorrelated,
has covariance
matrix
; log ((he)”
maximised
is constant,
diagonal,
det($).
Mathematically
[Cyylii
= 1. A constraint
equal
variances
matrix
det(Cyy+CNN))
by maximising
of the units.
I Iii Aii,
Wegner’s
with equality signal
in the presence of noise. suggestion
theorem
information
- H(N)
that the
is given by
=
(27’Ee)” det(CNN))
correspond
=
elements,
1(X; Y+N)
by the finite dynamic
to a constrained
variance,
states that for any positive
maximises
(Bodewig,
det(Cyy)
is
range
would also result in a condition
only when A is diagonal
(Cyy = I) therefore
information
e.g. for
definite 1956). A
and the mutual
that above some noise level it is better to have correlated
from the result reported
he uses a less plausible
lOg(
on the total power (C[Cyy]ii)
decorrelated
signals differs
-;
det(Cyy) is constrained
normalised,
Linsker’s
IX) = H(Y+N)
and has equal diagonal
this would
on the units.
A, det(A)
- H(Y+N
distribution,
equal variance noise added
CNN (figure ll), the mutual
1(X; Y+N) = H(Y+N)
As CNN
by a normal
constraint
here, as instead
of constraining
signal
on the length of the weight vectors (Linsker,
-36-
variance, 1988).
2.8.
Biological
relevance
The above models
make
from the known physiology were allowed
to have
excitation
and inhibition.
functions
of Gaussian
real neurons, negative
and anatomy both
variables
biological
interpretation,
and whether
merit
adaptation.
output
not observed
in
is possible.
A
rate of a neuron
by a complementary be given
is that they provide More plausible
2 and 3 differ in the way gain control simpler
form, though
such connections
more complicated, self-connections
governed
likely
pair of
a straightforward
a simple
demonstration
versions of the network
model
correlated
possibility
is that
will
synapse. The main
in them.
form.
Model
3 is analytically
2. As there is no evidence
for
all other connections,
The gain control mechanism neurons,
2
at all for self-connections,
from that governing
in single
Model
as indicated
in Model by contrast
visual cortex (Sclar et al., 1989). anti-Hebbian
units, this
to Model
mechanism.
of adaptation
The fundamental, positively
have the desired
very similar
plausible
is implemented
there is no evidence
by a rule different
in the primary
it specifies
assumption
cortical
output
neurons,
inhibitory
be inhibitory
raises several questions.
a decrease
is due to an increase
between them would require to
would
but functionally
2 is not a biologically
thought
as the firing
cannot
to both
linear
of these assumptions
these models
connections
corresponding
to implement
could be interpreted
their main
justification
in Chapter 4.
has an analytically
adaptation
values,
assumed
interpretation
however,
of anti-Hebbian
Models
3 is a more
negative
rate, or each unit could be represented
On the whole,
be considered
and
centred on zero. As these are properties
value of a unit, for instance,
of the principle
that have no direct
of the cortex. For instance, recurrent
were furthermore
only a more indirect
neurons.
are
positive
Units
below the spontaneous
Model
several assumptions
interneurons.
-37-
&
interaction
in the effectiveness
pyramidal
(Beaulieu
in total
Between two strength.
of an inhibitory
cells are excitatory,
so inhibition
As only 15-25% of cortical Somogyi,
1990;
One
synapses
Gabbott,Martin,
&
Whitteridge,
1987;
Whitteridge,
would
proposed
coupled
by Baddeley
interneurons
themselves
cells by Hebbian
receptive
field
1957). Another
only
alternative
weakening
fibres
of anti-Hebbian
between
two
modification
different on recurrent
modification Anatomical
input
synapses
neurons
the firing
light
levels
the main
of signals
assumed of cells,
from
(Ito,
depression
the theory
Nevertheless
structure
of the & Kuffler,
is achieved
1989; Ito & Kano, fibres
of transmission
constitute
would
by
for an activity-
some parallel
here, as this depression
while
comes
is evidence
cell. This does not fully
the
inhibition
strength
in the cerebellum
arrival
excitatory
eliminating
(Barlow,FitzHugh,
There
solution
variance).
in connection
synapse.
synapses
connections.
evidence
on cortical
instability
discouraging
at higher
small
evidence
and from
for the
has been observed require
it demonstrates
anti-Hebbian
that anti-Hebbian
does in fact exist in the brain.
in the vicinity.
inherent
between
is
1991), in which
the centre-surround
cell leads to a long-lasting
types
elegant
change of synaptic
mediating
simultaneous
interaction
but between
1991; Plumbley,
of largest
inhibitory
One possibility
alternative,
correlations
the eigenvector
to the Purkinje
neuron.
those same cells, each interneuron
of excitatory
fibre to a Purkinje
the parallel
(Baddeley,
be enough
neurons,
cells. Another
of an excitatory
of nearly
not
individual
is that the net decrease
dependent
seem
for an activity-dependent
is present
in the efficacy
synaptic
(along
inhibition
a decrease
a climbing
not
to each excitatory
neighbouring
where
1982). Repetition
does
uniquely
and inhibit
correlation
the retina,
KisvArday,Martin,Freund,Magl&zky,
pick up the strongest
The only evidence
kind
there
and Plumbley
learning,
remaining
from
1986),
1988;
take place not between
of more tightly
largest
al.,
for one to be assigned
that decorrelation
was
et
& Somogyi,
interneurons
groups
Gilbert
suggests
neurons
derive
Anti-Hebbian
in a system
persistent
joint firing
Is it reasonable
to assume
of other neurons
nearby?
that the majority from
synapses with
such
(up to 85%) of the excitatory
the recurrent would
a large
collaterals
be very amount
of other
effective of positive
cortical
in reducing
the
feedback
by
of neurons. that the firing Initially
of cortical
the opposite
- 38 -
neurons
would
is uncorrelated
with
seem to be the case. It is
known,
for instance,
arranged
that orientation
tuned
neurons
in the primary
in maps where nearby cells tend to have similar
Wiesel,
1962). Stimulating
sinusoid
gratings,
giving
a highly
stimuli
reveal
Neurons
may indeed
correlated only
the same time, narrow
position,
response. However,
stereoscopic
disparity
spatial
parameters
vary in complex
To fully
frequency,
behaving
cells may be only slightly
visual
(Hubel
stimuli,
(‘columns’)
properties
of cells, all
in its natural
of these cells. of the stimulus
correlated,
or not correlated
the stimulus,
of motion, restricted
environment,
these circumstances
redundant
at
is within
They are tuned
a highly
&
such as
using highly
neuron.
phase, direction Outside
ways. Under
represent
tuning
to several parameters
of the particular
of the stimulus.
in an animal
field
cortex are
only when each of these parameters
band characteristic
paradigm,
neurons.
of the receptive
optimally
size, orientation,
simple
such experiments,
cortex show selectivity
responding
orientation
give rise to large populations
one dimension
in the visual
relatively
the cortex by the extremely
visual
a
to the
ocularity
and
experimental
all these
stimulus
the response of individual
at all with the response of other
the space defined
by the selective
properties
must be covered, and as this space is very large, the visual system cannot afford to use a highly
redundant
representation.
The assumption
of full connectivity
within
a small
neuron
and the size of the axonal
stronger
if the anti-Hebbian
projections
are thought
Decorrelation primary mapping
local region
visual
of cortical
between variables cortex could
of response
tissue, limited
and dendritic
interaction to be more
limited
by the number
by inhibitory
approximation of synapses on a
trees. This limitation
were mediated
would
interneurons,
in space than those of excitatory
represented
be achieved
properties
would only be an appropriate
be even whose neurons.
at greater distances from each other in the in secondary
is organised
1986).
- 39 -
according
visual
areas, where
to different
principles
the spatial (Barlow,
2.9.
Pattern
adaptation
Anti-Hebbian
recurrent
after-effects
of adaptation
aftereffects
can be observed
the orientation
(‘tilt’)
colour
contingent
1990b;
Harris,
wherever
may account for some of the psychophysical stimuli.
psychophysically,
aftereffect,
spatial
A whole
frequency
effect), and many
1961), pattern
channels. sensitivity
perceived
tilt of a vertical
the right
would
neurons
of an intensely
to the right
(Blakemore,
(Wilson,
not explained
to the relationships
between
adaptation
mechanism
which
interpretative
the
environment. keeping interpretive
mechanism
For instance,
optimal
the
tilted
to
orientation
synapses
than
tracks
was proposed
just
‘fatigue’
or a gain
an active
process by
the statistics
which valid
by
result from adaptation
this mechanism
needs to know about to provide
1974; Tolhurst,
adaptation.
1989). It involves
contingencies,
- 40 -
from
to account
of individual
& Tolhurst,
here, after-effects
rather
adaptively
occurring
adaptation
for spatial frequency
sensory environment
of the naturally
caused by
of neurons were considered
of inhibitory
& Foldiak,
system
activity
to lines slightly
of cells with
1971; Dealey
presented
variables
(Barlow
In the normal
track
modification
to the hypothesis
1980; Mollon, of pattern
of neurons.
by the gain control
& Georgeson,
synaptic
such aftereffects
1973).
1975) as an explanation
According
firing
(Barlow,
‘fatigue’
of neural
population
between populations
(Blakemore,Carpenter,
1972). Anti-Hebbian
stimulated
others
1980; Held,
are due to a passive
by the weaker
mechanisms
phenomena
(Harris,
by the imbalance
adaptation,
1978).
line to the left after an exposure
be explained
Inhibitory for several
aftereffects
They are explained
decreased
slightly
explanation
aftereffect,
curvature
seems to be true that one finds
to the conventional
of adaptation
(‘waterfall’)
adaptation,
one looks for them in the visual system (Stromeyer,
selective
variety
e.g. the motion
(e.g., the McCollough
1980). It generally
1977; Sutherland,
Wilson
connection to patterned
adaptation
According
tilted
and aftereffects
of the
sensory
serves the purpose is precisely
expectations
what
of the
upon which
interpretations the result
of images depend.
of the adaptation
According
to this explanation,
to the abnormal
contingencies
illusory
introduced
aftereffects
are
by the adapting
stimulus. The difference of a contingent green
after-effect,
horizontal
appear
between the two explanations
tinted
grating
for a few minutes,
explanation
of the contingent
case of the McCollough interaction
effect. After looking a vertical
with green, while the horizontal
The conventional conjunction
the McCollough
strength
black
and white
have to assume
colour-selective,
effect. If, however,
at a red vertical
and a
grating
will
one with pink.
would
stimuli,
can most easily be seen in the case
orientationally
modification
between separate colour-selective
cells selective tuned
for the
cells in the
is due to an alteration and orientation-tuned
of the
populations
of cells, we need not assume cells tuned to both of the contingencies.
In fact it is believed
that processing
in the visual
of colour
and form are kept rather well segregated
(Hubel
& Livingstone,
number
of cells tuned both to colour and orientation.
anti-Hebbian
1985; Shipp
inhibitory
connections
the ‘red’ and ‘vertical’
black
1985), and there may not be a significant According
would be strengthened
and white
would cause an imbalance grating,
and towards
to this explanation,
during
units and the ‘green’ and ‘horizontal’
phase. These altered connections the vertical
& Zeki,
system
the correlated
units during
the
firing of
the adapting
towards ‘green’ when viewing
‘red’ when viewing
the horizontal
grating. Further the long
time
conditions
support
MacKay
can last several
& MacKay,
considered
convincing
argument
(Movshon
& Lennie,
and the dependence
In the absence of visual stimulation hours,
1974; McCollough,
The same mechanism usually
based on synaptic modification
course of this after-effect,
of stimulation.
the aftereffect
for an explanation
as contingent
days, or even longer 1965; Riggs,White,
may also explain
other
effects, such as spatial
comes from electrophysiological 1979) from neurons
undergoing
-41-
on the
after the adapting
phase,
(Jones & Holding,
1975;
adaptation
1974; Shute, 1979). effects that are not
frequency
adaptation.
by
of the decay
& Eimas,
recording
is provided
adaptation.
by Movshon They adapted
A very
and Lennie a neuron
with
one of two gratings
that had different
spatial
frequencies
but elicited
a similar
response from the recorded neuron. They found a reduced response to the adapting but not to the other grating.
This result is a strong indication
more than just a change in sensitivity. overlapping predict
population
of neurons
a diminished
population
response
and the recorded
As the two gratings
that adaptation
would
cell, and a much
inhibition
smaller
involves
activate a largely
in the visual cortex, the decorrelation
due to the increased
grating
model
between
inhibition
nonwould
the adapted
between
it and the
other population. Decorrelation Mollon,
may
1991). Adaptation
also explain a display
plane causes
maximal
aligned
one of the ‘cardinal’
would
with
in colour
channels
corresponding
curve. Activation
directions
a large number
to the ‘cardinal’
cardinal
directions
leading
to increased
inhibition
would
diminish
sensitivity
along
the other diagonal.
2.10.
would
channels
between
associative
memory
(Anderson, 1969; Hopfield,
1980; Steinbuch,
1961; Taylor, memory
directions
result in a similar
along a diagonal
studied
and
- 42 -
only two sensitivity by the
the two channels
In the test phase, this
direction,
while not affecting
memory
of neural
1972; Caianiello,
1982; Kohonen, 1956; Willshaw
between
adaptation.
functions
between
of the space defined
correlation
them during
hetero-
to a different
interaction
1968; Anderson,
models:
each tuned
explanation
an anti-Hebbian
and associative
thoroughly
1976; Grossberg,
of associative
channel,
would
&
line, even if it is not
space. One possible
the response along the adapting
Decorrelation
One of the most
in colour
(Webster
along a line in the colour
along the adapting
give rise to positive
inhibition
and Mollon
oscillates
of adaptable
space. Alternatively,
of the colour
by Webster
whose colour
decreases in sensitivity
be to postulate
direction
results
1972; Little,
networks
is that of
1961; Gardner-Medwin, 1974; Marr,
1969; Palm,
et al., 1969). There are two basic classes auto-associative.
In heteroassociative
memories,
the goal
presentation usually
is to store a set of input-output
of one of the input
also a requirement
output
store a set of single
pattern
stored vectors should
‘complete’
associative
small set of patterns.
at the completed
nets can perform completing
completed
discriminate
between a complete
even small
deviations
were eliminated learning
autoassociative
1990),
Lyapunov gradient pattern Hebbian
rules used in autoassociative
analytical
or ‘energy’
input
does
‘homunculus’
pattern,
who
parts when
is identical
for the
how can we still
How could we be so sensitive
if all deviations
to
from the ‘standard’
learning
phenomena
in animal
of frequently
experienced
patterns
& Mackintosh,
as would
function,
is
1989; Todd
be predicted
possible
(Amit,
started from an activation
rule generates
corresponding valleys
corresponding
by
to the regions
are usually
proposed
with 1989).
dynamics
the
function
some
by Hopfield
statistical
mechanics
Hopfield
defined
a
of the units perform
state corresponding
to the ‘standard’
in the energy
-43-
memories
memory
as its analogy
on which the activation
descends to a local minimum
with basins of attraction
recurrent
of autoassociative
treatment
descent. A network
learning
pattern
theory.
1982) has been very influential
thorough
for a fixed, relatively
of a network
Perceptual
diminished
parts of the same
the raw input
1989; McLaren,Kaye,
than
of the Hebb rule. The model
(Hopfield, made
rather
the goal is to
we need to ‘fill in’ missing
pattern?
patterns
completion’.
1969; Hall & Honey,
The learning variants
and a partial
should
or faulty patterns.
Is it the infamous
and a complete
the other way: discriminability
(Gibson,
& Mackintosh,
pattern
from well known
by ‘pattern
also point
enhanced
partial
pattern
It is
of a part of one of the
completion
cues? If the output
the
pattern.
memories,
the missing
or regenerating
recognition.
so that
output
of the input
altered
pattern
pattern ? Why would
successfully
version
by retrieving
also restore slightly
we can recognise objects from partial
the desired
In auto-associative
the pattern
But simply
of patterns,
so that the presentation
not make any progress towards pattern is looking
altered
pattern.
vectors,
vector. It should
Recurrent
retrieves
that a slightly
still evoke the appropriate
stored pattern
patterns
pairs
pattern.
to an input A modified
near learned
of state space which
patterns
lead to the
same minimum. function,
Unfortunately,
attracting
generalisation
input
patterns
near frequent
our ability
function
they define
patterns
an additional
and ‘invalid’
need for a yet unspecified memory
patterns
generate
larger
basins
from a larger region of state space, resulting
known
introduce
‘valid’
input
patterns than near infrequent
to discriminate
Lyapunov
frequent
better
mechanism
than novel
problem
for selecting
in more
ones. This again contrasts with
source of redundancy,
states. A further
in this
ones. Valleys
of this
instead of reducing with
the relatively
this approach small number
it, as is the
of ‘valid’
states for storage. The Hebbian
to local recurrent memories
autoassociation connection,
are supposed
representation
and the decorrelation
and are therefore
to learn
there would
learn.
The computational
them.
The goal is not the completion
of the neural of stimuli.
activity
Instead
already
uneven
the probability
patterns
by raising
of having
the space, better
accordance
with perceptual
learning
can easily be performed
but the formation infinite
is achieved
an
try to flatten
in dense regions
states’, we should
to the distribution
set
concentrate
space, we should
of ‘valid
to
but to interpret
a potentially
in the energy function
number
simple
of the input around
Hamming-distance
frequent
generalisation
by feed-forward
connections.
vectors are almost orthogonal,
a perceptron
discriminate
a given pattern
converge
patterns,
memories
of
match
the
signal.
By
states in
phenomena.
above the level required
is missing
‘hills’
discriminability
only slightly
pattern
in representation
a small
The task of categorisation,
spaces most random
in a decorrelated
autoassociative
that best represent
for representation
expanding
tolerance
but
roles
autoassociative
valleys in the energy surface to even further
distribution
space available
left for Hebbian
themselves
of patterns
whole
units,
of a fixed, small set of patterns,
distribution
space. Instead
between
Hebbian
goal is not just to store arbitrary
of making
pattern
incompatible.
correlations
be nothing
theories assign opposite
to prevent
with high confidence,
or faulty (Kanerva,
to the stored vector which
signalling
is closest in Hamming
- 44 -
As in high dimensional unit with a threshold to unrelated
even when a significant
1988). As Hopfield
and fault
networks
set
patterns
will
part of the
are not guaranteed
distance
to the input,
to
feed-
forward
competitive
respect
as well.
networks
Hebbian
recurrent
anti-Hebbian
chapters.
Learning
applied
feed-forward
connections, methods,
to feed-forward
that
strengthening
activity
heteroassociative
patterns.
in another
between
stimuli
in the following
in Chapter
5, can be
associations.
are formed
but by changing
Two associated
the desired output,
stimuli
not
It is by the
the representations
can be represented
by
can then be used as input giving
process could only be applied
limitations,
Hebbian
way. The collateral often
travelling
ramifications
are grouped
These longer
range connections,
the stimulus
(Toyama,
more generalisation
and
specification
extent
to a
between
connections.
similar
stimulus
number
of potential less limited
preference,
within
a single
mechanisms
cortical
for example,
et al., 1981; Ts’o & Gilbert,
connections
of these long-range
may be considered
than that between
cells are extremely area.
Their
terminal
Decorrelation individual
-45-
cortical areas, connect
for the orientation
and range of these interactions
local connections.
may also coexist
clusters, which are given off at discrete intervals.
A group of these units linked
specificity
in the cerebral cortex
of the axons of pyramidal
of the horizontal
and maintenance
locally
as well as those between different
1988; Toyama
1986). The distribution
clustering
6 mm
into distinct
that share stimulus
and anti-Hebbian
branches
widespread,
much
discussed
with
stimuli.
due to anatomical
Hebbian
be combined
new way of forming
These representations
net generating
this
generalisation.
a radically
themselves.
As the decorrelation
Wiesel,
properties
of links between fixed representations,
overlapping
columns
distance
associations
stimuli
can naturally
from
to give rise to more useful forms of generalisation
also implies
long-term
of the individual
associated
connections
such as the one to be introduced
Hamming
This approach
1987)) may be preferable
with desirable
connections
than the rather limited
suggested
(e.g., (Lippmann,
or wavelength
1988; Ts’o,Gilbert,
correspond
observed
connections by excitatory
as a ‘super-unit’,
anatomically. are likely connections with
&
well with the The
to involve and with
a much
between these ‘super-units’ neurons.
of
These ‘super-units’
larger
would be would
be
fault
tolerant,
associated
i.e. they
with a completely
Decorrelation associative input
Memory
non-redundant
capacity
more overlap
best for uncorrelated
of pattern
2.11.
Decorrelation
necessary
the eventual
the representation output methods
patterns,
distribution
memory
reduces correlations
between pattern
vectors.
and supervised
of
capacity
is
between units by
the average
distance
learning
it may be to process the sensory signals learning
by unsupervised
is determined
by how efficiently
behaviour.
If the appropriate
a desired
be specified
and analytically
most tractable
on an adaptive
linear
& Hoff, 1960). The goal is analogous
best estimates
for each input,
of the target
between
the target
adaptive
signal
by supervised
or reinforcement
processing
is given
as the ‘delta rule’, ‘adaline
1960). This algorithm
ease of computation.
It is an ‘on-line’
method,
-46-
& Stearns,
given input
the mean solution
squared
important not requiring
1985;
patterns are the
difference
algorithm,
rule’ (Kohonen,
1970; Widrow
is
to this problem
(least-mean-square)
rule’ or ‘Widrow-Hoff
is particularly
learning
vector so that the outputs
A well known
by the LMS
(Widrow
regression:
one that minimises
and the actual output.
of supervised
combiner
find the weight
values;
form
to linear
1984; Stone, 1986; Sutton & Barto, 1981; Widrow, & Hoff,
input
increasing
it must eventually
learning
and a target output
known
assume a random
of learning.
correction
Widrow
stage for a feed-forward
space, thereby
can be used for generating
The simplest, error
a problem
by the crosstalk between the stored patterns.
success of unsupervised
has to be learnt,
memory
between correlated
the average correlations
avoiding
representation.
is limited
and decreasing
methods,
to cell death,
vectors. Decorrelation
the dense regions
However
resistance
Most models of associative
As there is much
stretching
show
may play a useful role as a preprocessing
memory.
patterns.
usually
would
in also
1977; Kohonen,
& Stearns, 1985; Widrow
because of its simplicity off-line
gradient
and
estimation,
or repetitions response
of data. All
be available
makes
at each time
it especially
correcting
relevant
learning
backpropagation
rules
is that the input
step. The possibility
to neural
suitable
are extensions
If the output output
that is required
modelling
for nonlinear
of this simple
of a single
adaptive
is t, then for each input-target
vector
and the desired
of a local
(Mitchison,
implementation
1989).
and multi-layer
Other
networks,
error such as
rule.
linear
combiner
is y = v * x, and the target
pair, the LMS rule specifies a weight change: Av=2u(t-v.x).
The mean-square
error is a quadratic
function
of the weights with a single global
minimum: E((t - Y)~) = E(t2) + vT Cxx v - 2 CxtT v = E(t2) - CxtT v* + (v - v*)~ Cxx (v - v”) where
is the op timal
v* = Cxx -I C,t
gradient
descent
sufficiently
on the estimate
small u, convergence
matrix
larger
p’s causing
eigenvalues.
divergence.
Ideally,
of this error
determines
of the input
l/h max > I-L > 0, where Lax
constraint
has to be applied
components.
of the weight with
the LMS
to u, causing eigenvalues
to converge
As the time constant
make the error surface elliptic,
narrow
‘valley’
is limited
by the
be chosen
so that
algorithm,
cause a spread
the most along
the learning
with
restrictive
the eigenvectors
is l/(4
the high
lt hi), the final
eigenvalue
this corresponds
of the
be adapted
much slower than along
by the smallest
while
& Stearns, 1985),
(see figure
to a slow descent
rate is limited
by the
axes of the surface (see Fig 12.a). It would be clearly advantageous
surface.
and a
vector should
for each component
12.1~). As correlations
axes of the error
signal
the components
curve will be determined
the wide
input,
of C xx (Widrow
in the input
slope of the learning
along
rule specifies
For stationary
In fact, the lo must
Correlations
CL. As this is not possible
eigenvalue
surface.
is the largest eigenvalue
each eigencomponent
to the smaller
vector. The LMS
the speed of convergence,
distribution.
a separate
corresponding
weight
to v* is guaranteed.
The choice of CL, which correlation
(Wiener)
By applying
decorrelation
with
to be able to do LMS learning variance
-47-
equalisation
on a circular error
(Model
2 or 3), it is
possible
to do just that. As the decorrelation
identity
matrix,
all the eigenvalues
approach
can then be chosen so that it is optimal much
makes the output a common
efficiency adaptive
of the LMS algorithms
equal (Widrow
algorithm
To demonstrate random
variable
the proposed
algorithm,
with zero mean and correlation
c
its first component
and later
dominate. decorrelating estimate,
network
becomes
Notice more
(Model
and cross-correlation
the best estimate
as the decorrelating
Gaussian
network
as input
matrices
that
the error where
of the target based on the
on the input initially
the second
of LMS learning
in figure
12.b. The weight
LMS, following
time
ineffective.
starts to make the error surface circular,
the weight
on the modified
- 48 -
relatively
constant
on the output the input
starts to of the and the
vector starts out in the
the error surface defined
is initially
weight, descending
vectors after 500
decreases
2). The effective weight vector between
as in the case of simple
the optimal
are equal or nearly
, and C,t =
shallow
u = (I - W>-l v, is plotted
route towards
for
1 1 0.9
(x), with correlation
12.b and d shows the result
network
same direction input,
presentations.
Figure
limit
matrix
Figure 12. a and c shows the result of LMS learning
quickly,
matrix
The
0.55 0.59
The goal is to find the weight vector giving
vector
a theoretical
as the target value (t), and the last two components
linear combiner
random
rate
curve will be
take a three-dimensional
0.59 0.55 0.9 1
Cxx = ( 0!9”;”
input.
to approach
the
1984).
c =
to an adaptive
~1, the adaptation
The learning
of the correlation
1
Consider
approach
all with the same time constants.
has been shown
when the eigenvalues
& Walach,
value.
for all components.
steeper, as it will be a sum of components
covariance
After
by the
the decorrelating
vector takes a more direct error surface (not shown).
0.1
0.2
0.3
0.4
0.5
a.
b. In e
In e
-6:
J. -. .* -. .I.. -: $0.
-7: 100.
200.
300.
d.
C.
Figure 12. a. The effect of LMS learning the
presentation
e = (v - v*jT output,
of 500 random
( Av = 2 u (t - v . x) ) on the weight vector during
input-target
Cxx (v - v*). b. The effective
u = (I - W)-l
v during
two different
time
input-target
pair presentation
apparently
larger fluctuations
weight
LMS learning
d. The slope of the error function follows
400.
pairs.
vector
on the output
stays constant during
constants.
c. The
mean
between
squared
the input
of the decorrelating
adaptation,
unlike
error. and the network.
in c. where it
v(O) = 0, W(O) = 0, u = 0.0025, a = 0.01. For each
y = (I - W)-l
x, Av = 2 u (t - v . y), AW = a (I - y yT). The
at lower errors in d are due to the logarithmic
the error.
-49-
plotting
of
t
Note that in figure 12.~ faster convergence rate was used. For normalised equality
input
was achieved
variables
( diag(Cxx)
only when C xx = I, so for the decorrelated
a faster
adaptation
eigenvalues 1. With
rate, giving
even when the same adaptation = I ) we have h,,,
signal it is always possible
an even faster
convergence.
of Cxx are 0.1 and 1.9, while the decorrelated
the same adaptation
2 1, with
In the example,
variables
rate, the faster convergence
to choose the
have eigenvalues
1,
is due to the smaller
value of
determined
stability
the longest time constant. The
choice
of adaptation
rate
is not
only
by
requirements,
but also by the need to keep the error due to the random
final weights
close to the optimal
noise levels, as in the example
weights.
presented
This is especially
M = u tr(C), which makes it necessary to chose u smaller by the stability
condition
(Widrow
the trace for a normalised convergence, a smaller
or equal convergence
adaptation the input
less frequently
available
beginning
be required
does not change
can be achieved
with
with smaller m&adjustment
signal is present before the target signal, than the input,
of the target.
faster
by chasing
the decorrelating
In this case optimal
or when the target is
process can proceed
adaptation
is possible
even in
right
at the
of LMS learning. This algorithm
as ‘LMS/Newton’
is similar algorithm
(Ahmed,Soldan,Hummels, Miller,
can be expressed as
& Stearns, 1985). As decorrelation
can be obtained
in case of high
than what would
the same misadjustment
of the
rate.
When
the absence
input,
fluctuations
important
above. This misadjustment
the
1980; Orfanidis,
to the algorithms and
the
known
‘sequential
CL Parikh,
1977; Compton,
1988; Widrow
& Stearns,
in adaptive regression’
1988; Graupe,
signal processing algorithm
(SER)
1972; Monzingo
1985). The LMS/Newton
&
algorithm
replaces the LMS rule by Av = 2 u hav Cxx-1 (t - v . x), where hav is the average of the eigenvalues to a descent directly
towards
the optimal
of the input weight
-5o-
correlation
matrix,
which leads
vector. This, of course, assumes that
Cxx is known. algorithm
The sequential
in which The main
the former requires
Cxx-l
previous
weights
time
surface
alternative
preprocessing
neural
descent
learning
‘momentum’
network
with
of the LMS/Newton
the weight
adaptation.
proposed
implementation,
to this method
would
on an elongated,
by adding weight
while
as well. a fraction
modification components
the slow
above is that
while
less plausible
has also been used to accelerate
1990).
- 51 -
error
that similar
A solution
latter
the ‘narrow’
only
learning
slow
to solve this at the 1986).
directions
of the
A possible
by an additional
using explicit
in nonlinear
is not
& Hinton,
get amplified.
to be uncorrelated scheme
problems
modification
(Plaut,Nowlan, along
surface
proposed
of the weight
components
be to force units
but biologically
‘valley-like’
It has been recognised algorithm
that the oscillating
get diminished,
term. A similar,
simultaneously
the SER and the algorithm
network.
step to the current
This has the effect
error
of gradient
the backpropagation
was to give
is a modification
rules.
to single layer linear
down
error
between
does not have a natural
The problem
algorithm
is approximated
difference
local learning
unique
regression
networks
Gram-Schmidt (Orfanidis,
Chapter 3.1.
3. Optimal
Data
linear
description
from
as a retinal
(or television)
a large array
the resolution
There are well
and many useful
in a given
in designing advance,
of physical
better
methods
problem
may depend
a more general-purpose about
criterion
that they
desired
input-output
regularities
be most
in the input
useful
data-set.
measures.
(Linsker,
Analysis
has such optimal
properties.
Principal
Principal features
from
high
dimensional
1982)). It is also known the Hotelling
Transform
may
in most
situations,
will
be. Selection
features
will
could be simply
to
be lost this extraction,
The features
of the categoriser.
most
However,
may not be known
in
not be available
at the physical
of features
be based
must
independent
of what
of features
of a set of features
1988). Among
component
component
output
of
and feature
categories
such
the amount
would
application.
the correct
The quality
loss of information
recognition, of variables
output
symbolic
of data in a signal
for data compression
on the desired
Good
meaningful,
a lot of information
In this case selection
minimal
3.2.
the number
of a particular
desired
relationship
information-theoretic
large. During
classifier
extractor.
a concise,
The amount
but obviously
rely on the details
of the feature
signals.
is extremely
of the input,
or information
location
is to derive
One way of reducing
known
of these
recognition
image
data needs to be reduced.
way.
extraction
reduction
The basic goal of pattern
decrease
feature
reduce linear
on the
exactly
the
only
on
can rely
can be determined
dimensionality methods
with
Principal
by
only
a
Component
analysis
analysis
WA)
data distributions
in signal processing in image
is a statistical
(see for instance
as the Karhunen-Lo&e
processing,
-52-
method
and it is related
for
(Devijver Transform
to Factor
extracting & Kittler, (KLT)
or as
Analysis
and
Singular
Value
transformation
(rotation)
coordinates
(SVD)
(Gerbrans,
of a distribution
are uncorrelated
is concentrated reduce
Decomposition
and maximal
amount
on only a small number
the number
concentrated
of variables
and minimise
variances.
The basis
covariance
matrix
by taking
of this new
the
The optimal
the subspace
of the n eigenvectors
with
information
of a normally
variance,
content
so by maximizing
variances
in which
of the original
on which
system
distribution
the variance with
are given
the
is
small
are the eigenvectors
coordinates
the
space we can
out the coordinates
m to n dimensions
largest
orthogonal
In this transformed
by leaving
on these
from
system
only the coordinates
eigenvalues.
The
projection
of variance
coordinate
variance
It is a linear,
a coordinate
of coordinates.
the loss of variance
vectors and
into
1981).
of the
corresponding
by PCA
is therefore
eigenvalues. distributed
PCA also maximizes
variable the amount
depends
on its
of information
on
the n variables.
3.3.
Oja’s
algorithm
Oja proposed modification principal
rule of the input component
eigenvalue)
qj (figure
The unit modification
consisting
input
of a single linear
connections,
(the single
of a stationary
The output strengths,
a model
and showed
eigenvector vector
unit
a local, Hebb-type
that the unit extracts
of the covariance
sequence
with
matrix
with
the largest the largest
(Oja, 1982).
of the unit, y is the sum of the inputs
xj weighted
by the connection
13):
is trained
on vectors
of the connections
during
from
an input
each training
- 53 -
distribution
step is:
and the rule
for the
Figure 13. The output
of Oja’s linear unit trained on a stationary
converges to the largest principal
sequence of input
vectors
component.
Aqj = p (xjy - qjy2), where
xjy is the Hebbian
the output
term that makes the connection
are correlated,
constraint
training
the unit
the variance
noise is added
to the outputs
approach
then if they follow
then their output
less than what could be achieved
Several strengths distribution
alternative
in linear connectionist
algorithms networks
have
of errors and backward
which makes their operation
and implementation
of two local rules, Oja’s rule and decorrelation
- 54 -
to the
analysis
be-
If there is
the same rule and no
values will of course be identical, information
which will be
by PCA. been
proposed
to change
to get more than one principal
(Oja, 1985; Sanger, 1989; Williams,
rely on the calculation
subject
component
is no more useful than the value of a single unit. The transmitted significantly
1.
the one with the largest variance.
for signalling,
and
The weight-decay
of its output
is not a full principal
cause the unit finds only one component, available
This term makes Cqi’
maximizes
that Cqi 2 = 1. This, however,
more than one unit
when the input
i.e. when they are active simultaneously.
term, -qjy2 is necessary to prevent instability. After
stronger
component
19851, but these are non-local propagation
connection
rules; they
of values between
more complicated.
of a
layers,
Here a combination
will be shown to achieve the same goal.
Model
4 Model
Chapter output
4 combines
2 (figure units
Oja’s network
14). A large number
by Hebbian
connections
between the units keep the outputs
X
Figure Hebbian,
with the decorrelating
(m) of inputs
connect to a smaller
(q), and anti-Hebbian
uncorrelated,
network
feedback
as first proposed
proposed number
(n) of
connections
in (Foldiak,
in
(w)
1989).
m
14. The
combined
Hebbian
black circles anti-Hebbian
When
an input
- anti-Hebbian
network.
White
circles
indicate
connections.
is presented
to the network,
the units settle to a stable state for
which n Yi = jglqijXj + .C wijYj, m
J=l
or
y=Qx+Wy,
y = (I - W)-lQ
Initially the connection
the qij’s are set to random
x.
values and wll.. - 0. The modification
strengths are: Awij = - o yiyj
ifi#j
and Aqij = PCxjyi - qijyi2).
- 55 -
rules for
The training The
is unsupervised,
modification
modification
rules
i.e. there is no need for a teaching
are purely
of connection
strength
all
is available
there is no need for the propagation
3.4.
local:
the
information
locally
necessary
convergence
at the site of the connection,
The
characterized the output
input
and
matrix,
output
is also normal
performed
scale, i.e. neither
for individual
to have normal
input
distribution
directly
with covariance matrix
similarly
diagonal
matrix
the eigenvalues,
generated
by Gram-Schmidt
containing
distribution
over the set of input matrices
were calculated
of a matrix
with elements
and keeping
cyy(t)
= (I _ W(t))-1 =
T(t)
Cxx
Q(t), T(t)T,
= Wtt) - a offdiag(Cyyft)),
-56-
by taking
a and p small
in each cycle t for different
T(t)
W(t+l)
here the rotation
rules were approximated
patterns
If the of the
over the interval
matrix
L was
chosen from a
i-0.5, 0.51. The rows of L are the eigenvectors
modification
mean,
1-3 by Cxx = LALT, where A is the
but
orthogonalisation
on the interval
The proposed
as in Models
were
= T Cxx TT.
For each run qil(O) ‘s were chosen from an even distribution Cxx was generated
matrix.
is T , then the distribution
Cyy =
i-0.5,0.51.
zero
the
of the network
from this covariance
by the network
vectors
with
Cxx, and because of the linearity
could be calculated
of the transformation
on a slow time
nor the stable outputs
by its covariance distribution
was simulated
was assumed
matrix
following
the
of values from other units.
operation
of the outputs,
calculated.
values
for
Simulation
The network’s
uniform
or error signal.
their
of Cxx. expected
(a = p = 0.02). The
network
sizes:
- diag(Cyy(t))Q(t)),
Qct+l) = Qct) + b(T(kxx
where diag0
offdiagf)
is an operator
sets off-diagonal
which
elements
of the overlap
was plotted performed
(figure
principal
15). After
neither
component
components
length
convergence, will
not
PCA subspace was calculated,
and
outputs
the row vectors of T, the transformation
be individually
orthogonal
of the network system,
aligned
with
but
they jointly
to each other,
can be measured
and looking
in the PCA subspace to the total length. have the transformed
to hi, and 2.1 > h2 > ... > hm.
the principal span the
by transforming
at the ratio If columns
components
between
rows of T into the length
of L are the eigenvectors,
as rows if the ith column
Geometrically
of
averaged
of L
over the rows, the
of the rows in the subspace (I$ and the total length (1) are
l&m Ip / 1 is a measure all orthogonal unlike
to zero, while
subspace.
coordinate
then M = T L will corresponds
the n dimensional
are they
The convergence the eigenvector
elements
between this subspace and the subspace of the network
by the network,
components,
matrix
to zero.
For each input distribution a measure
sets diagonal
of the overlap
l=J~
i-1 i=l
i=l i=l
of the two subspaces, and it is 0 only if rows of T are
to, and 1 only if they are all in the PCA subspace. The final subspace,
the final T, is independent
of the initial
-57-
random
choice of feed-forward
matrix
Q.
0.6 -
m=lOO,n=lO
m=SO.n=5 0.4 -
0.2 -
0 0
I SO
I loo
I IS0
I
I 250
0
; 0
I 50
I IW
I IS0
I 200
1 250
cycles
I lpl
I
0.8
0.8
0.6
0.6
m=2OO,n=20
m=4OO,n=40
0.4
0.4
0.2
0.2
I 50
I 1w
I
150
I
200
I
cycles
I so
250
I 19)
I 200
.
I 80
cycles.-- .-
Figure 15. The convergence of subspace of network quantity
I IW
outputs to the PCA subspace. The
lp / 1 is a measure of the overlap between the two subspaces. m is the number of
inputs, n the number of outputs. The curve is an average of 10 runs with different input covariance matrices and initial forward connections.
- 58 -
random
3.5.
Transmitted
information
Oja’s algorithm
reaches the maximum
single unit case if the inputs contain and similarly
for algorithms
To demonstrate
of mutual
uncorrelated
information
set by PCA for the
noise of equal variance
that yield the PCA subspace (Plumbley
that the mutual
H(Y) - H(Y
I X) was calculated
the network
input
information
is maximised
in each iteration
& Fallside,
entropy
1988), 1988).
in the present model
step. If uncorrelated
(figure 16) then the conditional
(Linsker,
noise is added to
of the output
is
H(Y I X) = H(YN), where
YN is the noise transformed
normal
distribution
pendent
so the covariance
of the noise on the output
The mutual
information
matrix
containing Further,
input.
=
- i log( (2ne)” det(CN))
should be compared
IpCA = $log((2ne)”
and Cp = PCxxPT.
is I. The
in this case is
i log( (2rce)” det(+))
where P is the matrix
to be inde-
of the noise on the input
noise in the network
I(X;Y) = H(Y)-H(YN)
This quantity
is assumed
to have
is CN = TTT.
Figure 16. Additive
transformation,
The noise is also assumed
with zero mean, and the noise on each input
and of unit variance,
covariance
to the output.
with the mutual
det(Cp))
- + log((2rre)”
the n largest normalised
we know that det(PPT)
so
- 59 -
= i log T (detf TT ,> information
given by PCA:
det(PPT)), principal
components
as rows,
= 1, because PCA is an orthogonal
IpCA = 5 lo@ det(Cp) where
hl, h2,...hn are the n largest Figure
training
cycles
17 shows
eigenvalues
the mutual
approaching
) = $ log( hlh2*** &I ), of Cxx.
information
the maximal
of the proposed
network
as a function
of
value set by PCA.
1 ’ 1’ PC/~ 0.R
0.6
0.6
m=SO,n=5 04
0.2
1I
,
0
ib
m=lOO,n=lO 0.4
03
I IW
I 150
I 2al
c
I 250
I
I
SO
IW
I
150
cycles
I 200 cycles
I 2.50
I 260
I 250
I
0.8
0.6
m=4OO,n=40
m=2OO,n=20
0.4
0.4
0.2
0.2
I 0
; 0
I 50
Figure curve
I loo
I Is0
17. The mutual is an average
I 200
cycles
information
of 10 runs with
I)--
I 250
SO
I IW
I 150
cycles
of the network different
I
0
reaches
random
-6O-
the maximum
input covariance
set by PCA. The matrices.
(
Model
5
By adding suggested principal
gain control
in Model
2, the outputs
component
subspace,
shows the convergence convergence
Il. .
for instance
by the mechanism
the uncorrelated
but they are also normalised
components
in variance.
Figure
of the l&a
subspace to the PCA subspace. Figure 18.b plots the
correlation
.
units,
are not only
of the output
of the output
-0.25.-
to the output
matrix
to the identity
matrix.
.
9 ,40.
I 170:
I
;
60.
80.
100.
l
-0.5.-
1%10III- cyy 11 -0.75,-l:-1.25,. -1.5.-
Figure
.
18. a. The overlap
of the network
11I - Cyy 11tends to 0 as the correlation matrix. input
Plots
are averages,
correlation
matrices.
standard
matrix
deviations,
(a = 0.1, p = 0.04)
- 61-
output,
and the PCA subspace. b. log10
of the output minima
approaches
and maxima
the identity
over ten random
Figure 19. Additive
Normalisation only in the input, units (figure be treated
in model
the network
in the previous
quite
separately.
depends
on the correlation
Assumptions
In addition
experimentally hippocampus
noise depends
connections.
Hebbian
modification
potentiation
analysed _m----Willshaw
for
by Oja specifies
including
& Pockett,
a decrease
a class of associative
was first observed (LTP)
in the
1973; Levy & Steward,
the mammalian
matrix
visual
nervous
cortex (Artola
1988; Brown et al., 1990; Madison, unstable.
in the synaptic
by Stent and Singer (Rauschecker
& Dayan,
noise only
to occur in the central and peripheral
cell is active in the absence of presynaptic
rule proposed
the
4 and 5 assume
of synapses
the Hebb rule only specifies a weight increase, it is inherently
synaptic
on whether
Models
1973; Bliss & Lsmo,
demonstrated
and invertebrates,
models,
synaptic
of long-term
& Singer, 1987; Bindman,Murphy,
introduced
only
can
relevance
of the previous
(Bliss & Gardner-Medwin,
1979), and was subsequently
term
and the output
subspace, while the effect of output
and biological
in the form
system of vertebrates
to noise not
of the output.
to the assumptions
feed-forward
resistant
noise on the input
The effect of input
into the appropriate
optimally
section, but also to noise on the output
19). In fact, the effect of independent
transforms
Hebbian
5 makes
as demonstrated
network
3.6.
noise on the network input and output.
firing.
& Singer,
memories __-
The normalising when
the post-
Oja’s rule is similar
to the
1979; Stent, 1973), which (Dayan
1990). The effects of the changes specified
- 62 -
weight
1991). As
& Willshaw, ~__~~
is
1991;
by this rule have been
observed
experimentally
hippocampus
Several compression.
neural
a set of input
algorithms
algorithms and output
hidden
& Zipser,
applied
recognition
network
1990; Rauschecker
units.
have
units through
It was found
1987; Rumelhart
to image (Harrison,
been
& Singer,
that the hidden
& Sejnowski,
(CottrelLMunro,
improve
performance
in the auto-encoding
pointed
out earlier,
the back-propagation
problem algorithm
layer comes
1989) and hidden
& Kamp,
to
& Hornik,
1985). This method
that nonlinear (Boulard
data
of a smaller
1985; Baldi
& Zipser,
1987). It was also demonstrated
for
the identity
the ‘bottleneck
et al., 1986; Williams,
compression
proposed
was used to implement
to the PCA subspace (Ackley,Hinton,
1989; Elman
as well as in the
algorithms
The back-propagation
of intermediate
correspond
also
to related
other
between
number
& Levy,
cortex,
& Singer, 1981).
Comparison
mapping
of the visual
(Lopez,Burger,Dickstein,Desmond,
1979; Rauschecker
3.7.
in the development
was
speech
units do not
1988). However,
as
does not seem to be biologically
realistic. To get more than just the largest principal to get PCA subspace, (Krogh
& Hertz,
related
to Gram-Schmidt
component,
but the learning
orthogonalisation,
(Sanger, 1989). By repeating
implausible,
An arranged
output
implementation
on the back-projecting
units
an algorithm
finding
similar
his network
are non-local
as well
quite closely
the largest
principal
from the input the resulting
this sequence over n units the n largest principal
The biological
algorithm
involving
and subtracting
as does an alternative
with constraints
Oja extended
rules used in this model
1990; Oja, 1985; Oja, 1989). Sanger proposed
and then back-propagating
can be obtained.
component,
of this sequential
implementation
involving
output
components
algorithm
seems
a large number
of units
weights.
to Model
4 (Foldiak,
has also been proposed
- 63 -
1989), but
and analysed
with
(Kiihnel
hierarchically & Tavan,
1990;
Rubner,Schulten, connections
& Tavan,
were specified
higher
index
found
that the first
1990; Rubner from an output
output
unit
extracts
units force them
two fundamental for the strictly
asymmetrical
the first
the difference
disadvantageous
vector
device
a large attenuation
linear
for the last, probably
combinations
resulting
of the principal
units,
components,
while
connections
W matrix.
It was
units
the previous
However,
of collateral
a
as expected,
the higher from
outputs.
there are at all
connections
in the
of the units is maximised
is minimised
the units, amplification
with
is that as the units find the principal
the components.
involve
to a unit
component,
between the variances
if a physical
anti-Hebbian
One is that there is no evidence
noise is to be used to represent but it would
principal
arrangement
variance
index
triangular
to the previous
The other problem
of the normalised
is clearly
to a lower
with this algorithm.
be required.
themselves,
(the entropy This
hierarchical,
that would
components
with a lower
as the anti-Hebbian
to be uncorrelated
problems
1989). Asymmetrical
by any of the other output
extract the rest of the components,
brain
unit
but not vice versa, corresponding
because it is not influenced
output
& Tavan,
with
(Devijver finite
Normalisation
& Kittler,
)).
range
and
dynamic
could still be applied
for the first component
in additional
noise.
to
and a large
As Model
4 takes
the variances
of the output
units
for the learning
parameters
in Model
are
more uniform. The convergence were also studied Hornik,
properties
and conditions
by Baldi and Hornik,
and by Leen, using bifurcation
1990; Leen, 1991).
-f34-
analysis
4
(Baldi &
Chapter 4.1.
4. Sparse
The Hebb
coding
unit as suspicious
The computational any number nervous
of linear
system,
properties.
capabilities
especially
(McCullocl~
& Pitts,
Such a unit
performs
and the inputs
pattern
matches
the weight
by allowing
exceeds
y=o
otherwise. kind
patterns
nonlinear
in the response
nonlinear
values,
model
of a
sum of its inputs, value,
as
and
its threshold
(t)
to patterns
pattern
then the weighted Depending
so the unit will respond
is needed
for the solution
fires,
patterns
vector closer to themselves to prevent
If you think
sum is maximal
up to a limiting
according
the weights
to that pattern or pattern
Hamming
is competitive
- 65 -
from
distance.
coincidence rule.
the active
de-
If on the inputs
even better in the future.
are
In this
are able to adjust
the
ones. To use several of these units,
them from detecting
of this problem
when the
vector only in a small
to the Hebb
components
than the infrequent
of the
on the value of the threshold,
can be made into a suspicious
to change
the unit
occurring
matching.
that differ from the weight
matcher
its connections
2 t,
of pattern
vector precisely.
strengthened,
suggested
a given
if C qjXj
of a pattern
a mechanism
rate. A simple
y=l
presentation
way, the frequently
to have
takes a weighted
of bits, so this unit can be said to generalise This elementary
weight
firing
sum
a simple
as binary
the unit will also respond
tector
this
very limited,
rate, they cannot signal negative
unit which
when
are clearly
1943):
weights
number
by a maximal
threshold
output
firing
detector
by a single one. Neurons
are known
in the cortex,
is also limited
a positive
networks
layers can always be replaced
cell is that of a binary gives
of linear
Due to their low spontaneous
and the signal
coincidence
the same feature. learning.
One method
4.2.
Competitive
Competitive Malsburg,
learning
learning
1973) in its simplest
unit that fits the input and suppressing implemented
the output
connections
the input
patterns.
discriminable
in the regions
However,
this
pattern
is within
a certain Hamming
4.3.
Sparse
competitive
unit
can be units. In giving
populated
by
can be used by a turning
units. As the pattern
is very limited
gets activated
sum
regions,
in a single trial by simply
of storage
on the
vectors in a at all between
in the number
of
to generalise.
An
only when the input
distance from the weight vector of the unit.
coding
It would be much more desirable each unit representing
one component
capacity
to code each input
or property
results in a significant
also give rise to desirable
of the pattern.
increase in the number
of such a distributed
state by a set of active units, Since the combinatorial
of discriminable
code is high. Distributed
effects like generalisation
states, the
representations
between overlapping
patterns,
noise
of units
for each input
resistance.
On the other pattern,
into disjoint
states that it can code, as well as in its ability to a particular
and damage
between the competing
to each other, there is no interference kind
only the
mechanism
representation
unit to the active output
associated
representational
1985) activates
of space that are densely
local, ‘grandmother-cell’
output
use of units
connections
space among themselves
are orthogonal
input
& Zipser,
1976a;
the one with the largest weighted
layer to associate outputs
from the winner
local representation stored
inhibitory
finer discrimination
supervised
1982; Grossberg,
of all other units. Such a ‘winner-take-all’
vectors. The resulting
subsequent
& Ballard,
(Rumelhart
best by selecting
by strong constant
a selectively
1982; Feldman
version
pattern
this way, the units divide
pattern
(Amari,
the mapping
hand,
when
a large
to be implemented
number
by a subsequent
- 66 -
are active
layer becomes more complicated
and harder memory
to implement
network,
distributed
by simple
i.e. the number
representation
neuron-like
of input-output
is significantly
Memory
capacity
is limited
patterns.
Learning
may become extremely
become complicated
sparse coding, resentations. authors
by the interference
1988; Buhmann,Divko,
1990; Palm,
1988; Tsodyks
& Feigel’man,
1990). In a sparse code, the input portion
of the available
between amount
tation.
capacity
of generalisation Competitive
learning
highly
proposed
sets of lower-level
other such conjunctions. instance, combination while
learning
of car type and colour
if car types and colours
above scheme represent
problem
scheme would (e.g., ‘yellow
are not significantly
learn to code colour
a particular
car as a combination
unit).
capacity,
of a subsequent
components
could
‘Volkswagen’
this proportion
method
Generalisation
may
in a small pro-
output
the
function.
of forming
a local represen-
into a sparse representation. corresponding independent
of coding cars of different a separate
Volkswagen correlated
of activity
the tradeoff
as well as that between
to from
colour,
for
unit to code each
detector’
(Harris,
1980)),
in the environment,
and type on separate
the
sets of units,
and to
in those units (a ‘yellow’
and a
then occur specifically
-67-
& Touretzky,
one can control
that are relatively
require
&
et al., 1969; Zetzsche,
in terms of features or conjunctions
In a hypothetical
the competitive
1969; Nadal
by the activity
here aims at coding inputs
finds a representation
correlated
are represented
is an unsupervised
& Wilczek,
1990; Rosenfeld
1981; Willshaw
and memory
and the complexity
The mechanism
The mechanism
patterns
rep-
1988; Gardner-Medwin
1970; Marr,
1988; Rolls & Treves,
units. By choosing
representational
1987; Gardner,
by
out by several
1987; Baum,Moody,
& Buneman,
1988; Willshaw,
connections
distributed
were pointed
& Sompolinsky,
stored
can be combined
local and completely
1988; Gardner,
1980; Palm,
et al., 1969).
et al., 1986)).
representations
of sparse representations
& Schulten,
(Willshaw
the non-orthogonal
(e.g., (Rumelhart
& Recce, 1987; Longuet-Higgins,Willshaw, Toulouse,
between
between
1989; Amit,Gutfreund,
of an associative
slow, and the rules for adjusting
is a compromise
The advantages
(Amari,
than optimal
of both local and distributed
which
The capacity
pairs that can be stored using a highly
lower
and hard to implement
The advantages
units.
along
one feature
or
aspect of the input. the unit
in the type group,
large Hamming
4.4.
An output
distance
in the previous is performed Hebbian
proposed
but with
by conventional
inhibitory
about
(first
chapter,
feedback
creases
statistical sparse
colours
get connected
even when
to
it has a
mechanisms
while
in a single
i.e. pairwise
are nonlinear,
value
active
input
is assumed:
by adjusting lowers
unit gradually
elements,
of the input.
mechanism
close to a fixed
by anti-
they can learn not
between
and features
an additional
coincidences
on units connected
correlations
dependencies
xj, j=l...m,
each unit
its own
its threshold becomes
from
the network
differential
This
equation
dy*i dt=
of the units,
models,
the other
is guaranteed
1984).
and n representation
and the nonlinearity
step as in the previous
(Hopfield,
following
As the units
a frequently
of suspicious
weights
for a long time gradually
has m inputs:
by the feedback =wji>
units. The detection
is used as
more
thresh(i.e. deselective
its threshold.
of the feedback
enced
of firing
1990)), the same architecture
feed-forward
coding,
that has been inactive
The network Because
nonlinear
statistics,
to achieve
its selectivity),
by raising
sient
to other
would
and anti-Hebbian
connections.
to keep its probability
old. A unit
(wij
generalise
in (Foldiak,
Hebbian
the second-order
In order
lated
‘Volkswagen’
the original.
of Hebbian
but also about higher-order
tries
only with
6 In this model
only
and it could
from
Combination
Model
correlated
units.
Provided
to settle
transient
was
for each input
f(jglqijxj
because
the output
that the feedback
by
pattern:
+ jilwijY*j
- 68 -
- ti ) - y*j
yi, i=l...n.
cannot
the final output
into a stable simulated
units:
be calcu-
here is influis symmetric
state after
an initial
tran-
numerically
solving
the
where qij is the weight of the connection
from xj to yi, wij is the connection
between units
yi
the
the
and yj
and
the
nonlinearity
f(u) = 1 /(l+exp(-Au)). ffcqijxj
of
The initial
- ti). The outputs
value
units
is represented
of y *i for each input
are then calculated
by rounding
by pattern
was y”i(0)
The feed-forward
initially
on [O,ll and normalised
length
selected from a uniform
distribution
=
the values of y*i in the stable
state to 0 or 1 fyi = 1 if y*i > 0.5, yi = 0 otherwise). random,
function
weights
are
to unit
(Cj q2ij = l), and the feedback weights are 0. On each learning
thresholds
trial, after the output
are modified
anti-Hebbian
according
has been calculated,
to the following
the connections
and
rules:
rule AWij = - OZ(YiYj - P*)
( if i = j or wij Hebbian
> 0 then wij := 0 >
rule Aqij = P yi Cxj - qij)
threshold
modification
Ati = Y (yi - p>,
Here a, p and y are small ity. The Hebbian weight
rule contains
vectors bounded.
positive
constants
and p is the specified
a weight decay term in order to keep the feed-forward
The anti-Hebbian
rule is inherently
ising term is necessary. Note that these rules only contain the weight locally
connects,
so all the information
at the site of the connection.
will be demonstrated
4.5.
Example
Patterns network.
on two simple,
1: Learning
consisting
This example
bit probabil-
of random
stable so no such normal-
terms related
to the units that
necessary for the modification
is available
In the next two sections some aspects of the model artificially
generated
distributions.
lines
horizontal
and vertical
was chosen for comparison
- 69 -
lines were presented
with that given
by Rumelhart
to the and
Zipser
(Rumelhart
difference
& Zipser,
1985) to demonstrate
is that the patterns
here consist
of the 16 possible
lines are drawn
the others
20). Pixels
(figure
are 0. This pattern are allowed.
Figure
20. A random
(figure
21). Patterns
activity
of the simplicity well matched
4.6.
correlated
to the number
Example
A slightly statistical comparison
structure with
more
realistic
of the input that presented
independently
line have the value
from
all
1, all others
264 input
to the network
patterns
in Example
so that the units became detectors the suspicious
coincidences
of lines were
independent.
is optimal
distribution
Of course
of the
of the set: lines of
in the sense that it
is removed
by the net-
this is the case only because
and the fact that (line positions)
1.
coded by a combination
and all the redundancy
of components
2: Learning
(l/8)
in this example
in the input,
of the artificial
of lines. On an 8x8 grid, each
probability
presented
of combinations
The code generated
The important
units.
developed
components,
are statistically
learning.
as only 216 of the possible
of the patterns
connections
all the information
as the outputs
redundant,
sample
consisting
in the units.
preserves work
highly
a fixed
has 16 representation
The feed-forward most common,
of combinations
that are part of a drawn
set is highly
The network
with
competitive
the network
size was
in the input.
the alphabet
example
is considered
is more
complicated.
by Barlow
in this This
example
et al. (Barlow,Kaushal,
- 70 -
section,
in which was
chosen
& Mitchison,
the for 1989)
OOZT 008 OOP 0
/ 0009T: El 0008
ooot
where methods
were considered
a set of probabilities is the probability probability
for uniquely
so as to minimise
assigning
the higher
order redundancy
of the strings. If Aj
of string j, bij denotes the ith bit of the code for the jth string and the
of the ith bit being 1 is pi, then higher
(Barlow
binary strings of a fixed length to
order redundancy
can be defined
as
et al., 1989): R = IetA, b) - E(A)1 / E(A),
where e(A,b) = - C[pi i is the sum of the individual
entropies
log pi + (l-pi)
log (l-pi)]
of the bits of the string, and
E(A) = - CAj log Aj j is the entropy entropy
of the set of strings. The sum of the bit entropies
fixed position
patterns
in this example
on an 8 x 15 raster. During
the same probabilities
were constructed normalised (Barlow
from the standard
to unit length.
training,
of letters
letters were presented in a piece of English
system font of a Sun workstation
The same letter
frequencies
presented in random
text. Input
in a order
vectors
and vectors were
were used as in Barlow
et al.
et al., 1989).
only a small number infrequent
of a large number between constraint,
bit probability
(p), the resulting
of l’s (table 1). Frequent
ones, as otherwise
inhibition explicit
consist of images
as they appeared
Due to the prescribed
firing
than the
of the strings, and they are equal only when the bits are independent. The input
with
is never smaller
introduced
by the frequent
force the decorrelating
the active units. Another is that no two frequent
contain
connections
simultaneous to increase
feature of the code, which is not due to an letters are assigned
while the code is not completely
reversible,
information
(table 2). This is significantly
present in the input
patterns
letters tend to have fewer active bits than
the correlations
of cells would
output
it preserved
- 71 -
the same output,
a large proportion
so that
(97%) of the
better than the amount
of
information
retained
by an untrained
random
network,
which
in this example
is less
than 50%. input patterns
network output 0000000000000000 0000010010000000 1000000000000000 0000100000000000 0100010000000000 0000000010000000 0001000000000001 0000000000000010 0000010000000000 0001000000000000 0000110000000000 0000000000100000 0001000000001000 0001011000000000 1000010000100000 0100010000010000 0110000000000000 0001010010000000 0000000100000100 0011010000000100 0000000000001000 0000000000001100 0010000000001000 0001010000000000 0000010000010000 0010010000001000 0000001101000000 0001000010001000 0010011000000000 0000011100000000 0000001100010000 0000011000001000 0010000000000000 0000000000000100 0010001000001000 0000011000011000 0010010000000000 0010010000010000 0000010000011000 0011010000000000 0010000000011000 0010000001000000 0000011000000000 0000010000000100 0000001000000000 0010011100000000 0010000100000000 0010011000001000
Table 1. The code generated rows indicate
the output
hand side (a=O.Ol,
by the network
e t
i! 0 a n s r ( - u x hm 1Tl c u a. f b P 4 Y w v . 1CG NH kBRF 14 1 X
q ) w
V P S8 A:< ; 2 OOQU9 j/'=>% EK D 6 M L z 3 5 + [ ? # 7 J
after the presentation
of the 16 units for the input
p=O.OOl, ~0.01,
h = 10, p=O.l).
- 72 -
patterns
of 8000 letters.
indicated
The
on the right-
input
output
number of units
120
entropy
4.34 bits
4.22 bits
24.14 bits
5.86 bits
456%
39%
bit probabilities
high
low
representation
distributed
sparse
(E)
sum of bit entropies redundancy
(e)
(R)
(8x15)
Table 2. Some properties
A property
of the code in Example
of the code, which is important
is its smoothness:
patterns
as in the case of letter e and o and even in the confusion
patterns
tend to get mapped
fields of the units reflect the properties
detect one of the most frequent
other units are less selective
letters and become
and their receptive
of features in the input patterns
to similar
output
of 0, 0, Q, u and 9 in
preserving,
redundancy
reducing
the network
transformation
sparse activity
all’ mechanism
implemented
by the fixed inhibitory
representing
it by a statistically
What implications pervised
allow
the activity
Generalisation
therefore
implemented
while
many
combinations
a smooth, input
of units. In contrast connections
more than one unit
uncorrelated
or not highly
in both examples
of a smaller
number
works best for infrequent,
- 73 -
patterns
in competitive
into an
learning,
to be active for each pattern, correlated
that frequent
of units
information
with the ‘winner-take-
does such a code have for generalisation
layer ? It can be observed into
selective,
of the distributed
uncorrelated,
connections
highly
fields consist of different
approximately
these modifiable
of the code. Some of the
(figure 22).
In both of the above examples
coded
of view of general-
1. The receptive
units
input
(97% of input)
2.
from the point
isation,
table
similar
16
than
‘unknown’
set of features. in a subsequent patterns
tend to get
the infrequent patterns
su-
ones.
that are repre-
sented as sets of more frequent, representation
‘known’
tends to be more localized,
more specifically, Unlike
without
interference
arrangement
redundancy
present in the environment,
with
the input
performs
a recoding
other
that corresponds
out above, is often not desirable.
recodes input
vectors by combinations
of the proposed
modified
so that learning
threshold
and activation
simulation
of this modified
network
to the methods
algorithm
The method
there were no training
spatial scales.
as in Model
competitive
of clustering
implemented
(Everitt,
and vector vector, as in Model
‘winner-take-all’
learning,
were the same as in Model
Model
as well as the result from the untrained,
6, with the exception
cycles, only the 100 initial
after 8000 learning of the random
threshold
adaptation
the 6 was
activation.
6. The result
6
to the
1974). To compare
took place only on the unit with the maximal
algorithm,
a
forms of
categories, which corresponds
analysis with
different
a single class to each input
are shown in Table 3. The results were obtained
the same parameters
to them
algorithms
of overlapping
dynamics
the
to consider
classes, ‘winner-take-all’
1974; Gray, 19841. Assigning
in cluster
be useful
over different
competitive
was pointed
performance
patterns,
can be associated
each layer extracting
space into discrete
as ‘clumping’
it may
or redundancy
[ Everitt,
known
patterns
network,
quantisation
method
For more frequent
from other associations.
of such subnetworks,
Comparison
By dividing learning
so output
in the case of a linear
hierarchical
4.7.
components.
The of the
random
cycles with
column,
where
steps, as in all
the other runs. As can be seen in the table,
the anti-Hebbian
competitive
redundancy
as well as a lower rate of information
loss. The random
the random
recoding
(Kanerva,
significantly
worse with this non-random
used by Marr
and Kanerva
input distribution.
- 74 -
; Marr,
net gave a lower code, analogous 1969), performed
to
input
random (untrained)
modified competitive
anti-Hebbian competitive
entropy
4.34
sum of bit entropies redundancy
24.14 456%
1.91 (44%) 3.94 106%
3.67 (85%) 6.88 87%
4.22 (97%) 5.86 39%
Table 3. Comparison
of the anti-Hebbian
competitive
network
with alternative
coding
schemes.
One of the possible Hebbian
competition
representation
reasons why competitive
is that while
to frequent
ones, it is exactly
patterns
the opposite
weakening
the ‘winner-take-all’
the simple
competitive
of units than infrequent
categories.
enhancement’, 1989b;
competitors Gaussian
and
& Payne,
Carlson
to Model
6 (Rubner into binary
such a hierarchical
centred
outputs,
measures.
inputs
Grossberg
adapt in proportion
have continuous
studied
representations
a sparser
to infrequent
By switching
one can also obtain
multiple
of pattern
off or
outputs
space attract
the redundancy
from
most of
(figure 23.)
schemes have also been proposed
instead of a complete
distribution
to assign
competition.
But as dense regions
ones, increasing
Rolls,Littlewort,
tends
worse than anti-
input patterns evoke a response from a much larger number
Other ‘soft’ competitive multiple
and more distributed
mechanism,
performs
learning
in the case of simple
network.
the weight vectors, frequent
anti-Hebbian
learning
Rolls
et al. studied
mechanisms
‘winner-take-all’
competition
1991). Nowlan
proposed
to the relative on the weight
probability
a nonlinear
that the input
1982; Rolls, in which
1990). As these models
them to Model
version of the Rubner-Schulten
1990). The arguments
scheme also apply
to this case.
- 75 -
all
came from a
6 based on entropy model
very similar
et al., 1990), and used it to code one- and two-dimensional form (Carlson,
into
of ‘contrast
(Grossberg, a scheme
vector (Nowlan,
it is not easy to compare
to classify inputs
mentioned
in chapter
continuous 3 against
Figure
23. Simple
the pattern
competitive
space. With
regions cause activation
learning
assigns more weight vectors to dense regions
the winner-take-all
mechanism
in a larger number
disabled,
patterns
of
in the dense
of units than ones in sparse regions of pattern
space.
4.8.
Assumptions
and biological
The assumptions
of Model
units here take only positive total input.
models
connections
and negative
values,
with
and
vice
versa.
Such
biological
excitatory,
while
assumption
still needs an indirect
all the feedback
mediated ‘Learning adaptive
a functionally
negative
by an interneuron interneurons’ inhibitory
of the
on the same unit were allowed
to take
connections
were demonstrated
to of
possible
This
pyramidal ways to
is that inhibition
is
neuron, or a small group of neurons.
to have a very
in linear networks -76-
One possibility
values.
neurons,
2, there are several
value.
negative
were positive,
inhibitory
as the main cortical
in Chapter connection
from
to be characteristic
had negative,
assigned to the excitatory
connections
changing
6, all feed-forward
connections
models:
function
are not believed
interpretation,
As mentioned
are a nonlinear
values sometimes
reversals
synapses. In Model
cells are excitatory.
than those of the previous
values, and their outputs
individual
achieve
6 are more realistic
In the previous
both positive positive,
relevance
(Baddeley,
similar
effect
1991; Plumbley,
to direct 1991), and
they would
also be biologically
plausible
adaptation
in non-linear
unspecific,
fixed baseline of inhibition
modified
networks.
in the anti-Hebbian
would be modified Note
candidates
An alternative mediated
direction
that in Model
rounding
results in a nonlinear,
plausible
than a rule linear
unblocking
of the NMDA
or by the saturation
of the calcium
& Levy, 1990). Rounding patterns
a relatively
An essential
additional
the
simple
network
competitive
that neurons of activity
Tomasulo,Levy,
of inactivity,
(Prince
underlying
caused either by the potentiation, spine heads
to calculate
impossible
in Model
the entropy.
was evaluated A much
(DeSieno,
realistic
to carry out.
the ‘firing
adaptive
1988; Grossberg,
adaptation.
probability’
threshold
of the
mechanism
1976b). There
their own level of excitation
& Martin, ). Neurons
1990; increase
of the cell. Evidence
short-term,
after a period and are thought
for such phenomena
-77-
1990;
following
longer
of repetitive
firing.
to involve
after-
was found
1988), as well as in the hippocampus
in a
of their
& Desmond,
excitability
has
is experimental
as a function
Levy,Colbert, their
A
is that
more
6 is that of threshold
keeping
whether
caused by units that never win the competition
are relatively
& Huguenard,
more
in synaptic
be considered.
and decrease their excitability
of these effects
hyperpolarisation neurons
& Steward,
involved
of the network
in the model
can regulate (Douglas
is biologically
mechanism
nonlinear,
in order
low level. A similar
problem
Such
was also necessary to determine
could
assumption
of the output.
system in the post-synaptic
or different
serves the purpose
been used to avoid
Some
channels
in the model
set of patterns
units at or near the predefined
evidence
The cellular
of the way the performance simple
values, and binary
rule, which
to be highly
set size would have made the simulation
This mechanism
had continuous
learning
buffering
were identical
consequence
periods
strength
in the unit activations.
voltage-dependent
history
ensure that the total connection
discontinuous
are believed
stimulus
synapses
stages, and in the evaluation
potentiation
only
be to have a rather
and excitatory
activations
long term synaptic
further
would
by interneurons,
could
6 the neural
was used only in the learning
two output
solution
of anti-Hebbian
in the desired direction.
rounding
(Holmes
for the mediation
in neocortical
(Hotson
& Prince,
1980; Lancaster be required
& Adams,
to explain
There relates
are two
of concern
even
after
full
instability
decreasing
schemes
adaptation
Perrett,
short
temporal
cortex
helps to avoid
response
stimulus,
carrying
As the visual
to an input
in Model
constants
an uncomfortable time
response
latencies
highly
responses
from
ranging
discriminative
between
of cells in the primary
shortest These
approximately
latencies, results
development generation
seem
to limit
of feed-forward of the response
visual cortex
5 ms, this leaves almost
and seems
to rule
out significant
the role of recurrent connections;
they could
in case of a fully adapted
- 78 -
relying
recurrent
to a
and Perrett
They
the onset
&
namely observe
of the visual
5 ms of their response.
no unaccounted
latency
of 4 steps,
of
each
time in case of the
or top-down
not playing
that of
in the macaque
a minimum
connections
the
(Oram
of this class,
have a minimum
lateral
to
that
lo-15 times
operations.
in the first
network.
minor
or top-
show
for head views
requires
to only
this
the time necessary
for all models
cortex
that
on lateral
Simulations
69-213 ms from
information
stable
this problem.
by Oram
cells selective
at least 40 ms, and the route to the temporal step requiring
these
6, as
corresponding
fast, at most
Results
question
to complete
latencies
pattern.
a
and the associated
and concerns
6 are relatively
of the units.
leads
of plasticity,
all models
models),
is always
simulations
patterns
certainly
auto-associative
with
input
rate in models,
is enough
extremely
by the individual
One
to be completely
often
periods
time
there
the code, and most
(including
still
1987). In Model
during
of critical
1991) still raise
whether
however,
with
which
& Grossberg,
observed,
is one associated
may
of the model.
by the the network,
The existence
in
form
It was
of the activations
the characteristic
by the present
the code is not guaranteed
change
mechanisms
threshold.
(Carpenter
problem
convergence
convergence
or synaptic
as well,
generated
measures.
interactions
obtain
algorithms
not completely
The other
raised
of the code generated
in the outputs
performance
questions
convergence.
does
changes
down
further
in competitive
in other competitive
cellular
longer term changes in response
to the stability
matter
1986). Additional
interactions.
to the guidance a significant
of the
role in the
4.9.
Hebbian
Perhaps large number taking
and anti-Hebbian
the most important
combinations
The properties
proportion
Wiesel,
is that they can connect
of signals
they receive
selective
to stimulus
of the vast range of possible
feature
from
extraction
previous
visual
responding
stimuli
by
stages of
visual cortex vary greatly,
parameters,
to a
but it can
only to a very
(Barlow,
1975; Hubel
&
1962). Much
genetically
of the anatomical (Hubel
evidence
LeVay,
structure
& Wiesel,
that experience
development
receptive
Wiesel,
stimulus fields
1977; Kleinschmidt,Bear, & Hahn,
It
modification
of synaptic
postsynaptic
neural
has been strengths
activity
rise to these properties there is also plenty environment
(Fregnac
& Singer,
1987; Rauschecker
1982).
giving
1970). However,
in the normal
of sensory
Rauschecker 1982;
of neurons
of neurons in the primary
be said that they are highly small
property
in development
of other cells in a specific way. They can perform
well-defined
processing.
mechanisms
of experimental
is necessary for the proper
& Imbert,
1984; Hubel,Wiesel,
1987; Movshon
& Van Sluyters,
& Singer, 1981; Shatz, 1990; Sherman demonstrated
that
depend on the temporal
(Brown
is determined
the
& 1981;
& Spear,
experience-dependent
correlation
between pre- and
et al., 1990; Fregnac,Shulz,Thorpe,
& Bienenstock,
1988). Visual
deprivation
experiments
point
dependent
self-organizing
mechanisms
in the development
Functional
modifications
comparing
distributions
in animals
raised
artificially
altered
lines, or without
of receptive
under
different
statistical continuous
more cells responding missing
linked
ones. Other
to visual
the significant
experience
field properties conditions.
structure motion
out
Kittens
(e.g., with
role
of the visual also been
in different
populations
of neurons
raised
in environments
with
many
in a stroboscopically
gave similar
vertical,
inferred
by
an
but no horizontal
lit environment)
results for other visual
- 79 -
system.
have
to the features that were present in the environment studies
of activity
parameters
have many than to the to which
cortical
neurons
orientation
are sensitive,
selectivity.
such as directional
These results
suggest
selectivity,
the existence
which on the one hand would relax the requirements the developmental of individual Many
results
have been proposed
purely
contrast
(Barrow,
structure
maximal
the representation
Bender, achieved
between velocity
and direction
throughout
1972; Perrett,Rolls, for arbitrary
and special
mechanisms
the role of
rules, but they of information
visual
from retinal
neurons
tuning,
as
to hand-like
as faces are exceptionally
may exist for their processing.
the processing
receptors
to temporal
Some highly shapes,
through
cortical
areas,
pattern-selective while
other
even to the spatial
& Rocha-Miranda,
- 80 -
orientation
passes through
is achieved.
preferentially
connections
the sensory system. The sparseness of
cells
orientation
1969; Gross,Rocha-Miranda,
& Caan, 1982). Such extremely
stimuli,
playing
is
tuning.
as information
high level of selectivity
of the face (Gross,Bender,
concentrate
feed-forward
cells with similar
best to faces, with some cells showing selectivity
or identity
only
on the development
with Hebbian
and secondary
in these areas respond
respond
feedback
system, sparseness is increased
cells, primary
1986;
and the role of learning
here also use local learning
in combination
inhibition
learning
1982; Linsker,
1989). These models
fields,
with
based on Hebbian
of the
structure.
seems to be increased
where an extremely cells
presented
is used widely
In the visual
ganglion
connections
those with similar
Sparse coding
retinal
of the receptive
Several models
& Munro,
& Kammen,
than that of spatial feedback
selectivity
and ocular dominance
ones in that they concentrate
Anti-Hebbian
well as between
allow the sensory system
mechanisms.
1987; Bienenstock,Cooper,
The models
rather
feature
learning
et al., 1989; Yuille
differ from the previous
hierarchy.
selectivity
to feed-forward
can help explain
of neural
by local, Hebbian
enhancement.
representation
of
in the sensory
on the spatial
assigned
and precision
on the development
1973; Miller
mostly
mechanism
environment.
of orientation
Malsburg,
for the complexity
and
to be best suited to the statistics of their particular
system can be explained development
dominance
of a learning
process, while on the other hand it would
animals
ocular
high selectivity
important Nevertheless
stimuli
&
may not be for monkeys,
it is also unlikely
that face processing of experience
on the response
1989). A similar successively
is completely
principle
higher
determined
genetically,
of face-selective
cells (Rolls,Baylis,Hasselmo,
is used in the taste system,
levels, and become more narrowly
-81 -
as demonstrated
where
by the effect & Nalwa,
cells are more selective
tuned (Rolls,
1989a).
at
Chapter
5. Learnina
invariances
How can we consistently position,
distance,
recognise objects when changes in the viewing
size, orientation,
(e.g. of a newspaper
relative
or a gymnast)
position
can change their retinal
The visual system must
contain knowledge
to generalise
correctly.
Part of this knowledge
is also likely
that the visual
plenty
of examples Simple
take weighted
model
number
The solution
of input
are only
‘similar’
from our subjective
difference
the transformed
is probably
so significantly? in order to be able
determined
genetically,
but it
which
image
lines generate
patterns
contains
judgement
in Hamming
similar
outputs.
into the same category.
very differently is therefore
of similarity.
or a relatively
distance,
small
patterns
Competitive
distance,
which
is
transformation,
distortion
of shape can
units tend to respond
poorly
to require
which
For these units,
Any simple
so these simple
and generalise likely
chapters, in which
if they are close in Hamming
or a rotation,
to this problem
in the previous
show a form of generalisation
clusters ‘similar’
such as a shift in position cause a great
about such transformations
such as the ones studied
sums of their inputs,
patterns
very different
projections
system learns from its sensory experience,
neurons,
for instance,
however,
of the object itself
of such transformations.
that differ on only a small learning,
or deformations
angle, eye
to
across the transformation.
either a more complex
model
of a neuron, or a network of simple units.
5.1.
Shift
invariance
Fukushima proposed network
consisting
complex) tions,
a simple
(Fukushima, solution
1980; Fukushima, to the positional
of alternating
layers. Feature detectors
while
the outputs
feature detector
1990; Fukushima invariance
problem
(3’ or simple)
in the ‘s’ layer are replicated
of detectors
of the same feature
- 82 -
& Miyake,
1982)
by a hierarchical
and invariance in many
are pooled
(‘C’ or
different from
posi-
different
positions region
in the ‘C’ layers. can therefore
The presence
activate
the appropriate
This idea is consistent
with
(Hubel
& Wiesel,
1962; Spitzer
receive
their
inputs
same
major
orientation
detecting to more
in different
selectivity
are completely
invariance,
but some
has successfully
been applied
al., 1989). LeCun using
be adjusted.
Fukushima’s
is also
not
a global
model.
invariance tasks
There is
is limited
(Nazir
modeling assumptions
way,
weight
plausibility
vector
one of the ‘simple’ weights fied
However,
neural
(except
units
rise
sensitive only
shift
1980; LeCun
et
(zip codes)
by
that need to
the operation
limited
units
This architecture
digits
on perceptual
networks
achieving
in accordance that
with
translation
discrimination
way
learning
- 83 -
to ‘complex’
mechanism),
This operation
in
essential
the question
by LeCun units
et al.,
in a fixed
unit have the same
whenever
to the same complex
sharing’).
from
used
to a complex Therefore
of their
Apart
algorithm
(e.g., by a Hebbian
also be helpful
of some
are connected
in position).
units connected (‘weight
may
seems very difficult.
that are connected
are modified
the same
by these
detectors
for a shift
of all the other simple
in exactly
not
(Fukushima,
the implementation
networks
that the feature
units
system
way,
psychophysically
of the backpropagation
and that all the simple
input
visual
are still
of free parameters
is spatially
demonstrated
presented
system.
in biological
assume
of handwritten
but
giving
1990).
of the principles
of the biological both models
been
in the human
the visual
recognition
transform
they
In this
that
for the
of feature
gradually
while
problems
evidence
cells
selective
is also achieved.
the number
cortex
In the top layer,
recognition
to reduce
visual
the pair
way,
invariance.
tolerance
psychophysical
It has
& O’Regan,
Some
reliable
constraints
subunits
of its components.
in pattern
a limited
that complex
model,
of the pattern,
of distortion
et al. achieve
such architectural
invariance
position
degree
assume
in a hierarchical
of positional
to the position
relative
1985) which
In Fukushima’s
range
within
cells in the primary
cells or simple-cell-like
is repeated
and a larger
to the approximate
of complex
positions.
layers
indifferent
models
simple
in any position
‘C’ unit.
& Hochstein,
from
and invariance
of the feature
the weights
of
the corresponding unit need to be modiis non-local
for
the
synapses
of all the units
signal broadcast either,
is not available
version
the same complex unit
by letting
(a ‘family’)
simple
adaptively
units
Instead
would
would
detectors
of requiring
simple
in an identical
independently
unit (form
‘clubs’).
simple-to-complex
units way,
permanently
and then allowing A learning
connections.
A simple
Hebbian
patterns
If the input
to the simple
layer contains
between
detectors
tions would
Model
of that feature
prevent
in different
positions.
to to a
ones to connect needed
to specify
rule, which
depends
only detect overlapping an example
of the feature
never be significant
The absence
of positive
overlap correla-
to the same output.
7
in which
solution
(first
the modification
to a time
average
(a ‘trace’)
to keep the weight
published
of the synaptic
not to the pre- and postsynaptic
order
then there will
those units being connected
The proposed rule,
at any moment
connected
similar
rule is therefore
here as it would
position
all connect
the same goal can be achieved
does not work
one spatial
for local learning,
of the same feature
activations,
at only
not solve this problem
be necessary
only on instantaneous in the input.
A ‘learn now’
mechanism
in which
to develop develop
to a complex
these modifiable
trace
is needed
unit.
which
units
modified.
units.
rule with
An arrangement
originally
unit to all its simple
of the input,
to the simple
Hebbian
complex
for the one that was
by the complex
as the shifted
5.2.
except
activity,
strength
but instead
of the postsynaptic
vector
in (Foldiak,
1991)) is a modified at time
y. A decay
bounded:
Awij(t)
= a Yi(t) [ xj(t) - wij(t)
where
- 84 -
step t is proportional
to the presynaptic
activity
1,
Hebbian
activity term
(x) and
is added
in
Yict) = (1-S) yi(t-I)
A similar
trace mechanism
of classical
conditioning
effect that
activity
temporal
has been proposed by Sutton
filtering
will
the modified
5.3.
in the 64 positions
During
simple
training,
moment.
This
that the desired
on the activity
positions
of only one
line detectors,
units
is
In the simple
one unit for each of 4 orienlayer,
units. lines covering
of the simple
at different
moments
to the network.
the whole array are selected at random
These lines are swept across the visual units in time
of the appropriate (figure
of time comparable
to the duration
units that get activated
orientation
- 85 -
one
for a period
all the connections
units of only one orientation
in
of these
unit succeeds in exciting
the rest of that sweep get strengthened
sweep, causing simple units of only one orientation
field, but
The activation
unit gets enhanced
of the sweep. Therefore
during
Hebb rule. Simple
24).
If an active simple
units, then the trace of that complex
to the modified
and complex
on an 8 x 8 grid. There are only 4 units in the complex
and two directions.
units is the input
the simple
the goal is to learn shift invariance. oriented
moving
rise to activation
between
of the four complex
simple
at a later
the assumption
As the trace depends
in which
to the simple
from four orientations
different
learning
embodies
of the connections
in an example
fully connected
giving
& Barto, 1981). A trace has the
Simulation
layer there are position-dependent tations
1972) and used in models
rule is still local.
The development simulated
influence
of the activity
features are stable in the environment. unit,
by Klopf (Klopf,
and Barto (Sutton
at one moment
low-pass
+ 6 yi(t).
from the according
get activated
to connect to the given complex
during unit.
a
Figure
24. Five consecutive
segments
at each position
corresponding inactive
orientation
units
(Xj
frames from one of the sequences in the figure
represent
and position.
Thick
the activation
segments
= 0). The trace is maintained
used as input. of a simple
The four unit of the
are active (Xi = l), thin ones are
between sweeps.
-To prevent
more than one complex
some kind of competitive, For the sake of clarity
inhibitory
would
also be possible.)
strengths,
possible
is necessary between the complex competitive
described
to the same orientation,
scheme (Rumelhart
here. (The use of decorrelating
units.
& Zipser, connections
Each unit took a sum of its inputs
weighted
by the connection
yk of the unit with the maximal
weighted
sum was set to 1,
and the output
while the outputs
interaction
the simplest
1985) was used in the simulation
unit from responding
of the rest of the units were set to 0:
1 if argmaxif
Figure 25.a shows the initially
random
Cj wijxj ) = k,
connections
between the simple
units, while figure 25.b shows the connections
after training
retina. The result and the speed of convergence
is relatively
and the complex
with 500 sweeps across the insensitive
to 6, the constant
determining
the decay rate of the trace. Too short time constants (6 near l), however,
not enough
to link different
those demonstrated lead to confusion two different
phases of the transformation
and result in results similar
in figure 25.~. On the other hand, too long time constants between the subsequents
(small
are to 6)
sweeps, as the trace does not decay between
transformations.
- 86 -
Figure 25: Connection training
patterns
of the four complex
units (a) before training
on 500 line sweeps across the visual field. The length
the strength orientation
of the connection to the complex
from the simple
unit. Initial
weights
An alternative
version
connectivity
very similar
this version
contains
were chosen from a uniform
of the algorithm
without
simulation.
This version
is physiologically
(1-S)
units (presynaptic
ZjCtT1)
more plausible,
-87-
and
distribution
The learning
+
6
Xjtt).
as will be pointed
rule in
trace), rather than of the
where =
position
trace (6 = 1).
complex ones:
xj(t)
indicates
achieves the same goal, and results in
to that given in the previous
a trace of the simple
of each segment
unit of the corresponding
on [O,O.l]. a = 0.02, 6 = 0.2. (c) The result of training
and (b) after
out below.
The application
of this algorithm
(e.g., three-dimensional
rotations
to more complicated
or deformations)
even harder to see how they could be specified in which
such invariance
the case of positional required
properties
invariance
by the proposed
detectors
should
environment
should
generate
from
and biological
Electrophysiological perception
experiments
are due to the receptive
system.
Complex
position
within
a limited
contrast,
and retinal
Mishkin,
fields (Azzopardi
1977; Miyashita
et al., 1989; Perrett
that
are coincident,
that
the
the activation
of
of time.
of individual
more complex
of
cells in the visual invariance
cells in higher
forms of invariance
of the stimulus,
properties
approximate
1962), while
to
visual
to the size, colour,
and they also have much
& Rolls, 1989; Bruce et al., 1981; Gross, 1991; Gross &
& Chang,
1988; Perrett,Harries,Bevan,Thomas,Benson,Mistlin, presented
above to explain
these properties
assumptions.
The first one is about the form of synaptic synapse is strengthened
feature and
the invariance
cortex exhibit
& Wiesel,
et al., 1982). The model
makes three fundamental
causing
be
plausibility
visual
location
than it is in
All that would
of processing,
a short period
suggest
range (Hubel
cortex exhibit
orientation
stages
field characteristics
cells in the primary
areas in the temporal
larger receptive
previous
detectors within
Assumptions
models.
the way
the transformation-dependent
sequences of the transformation
these transformation-dependent
5.4.
or LeCun’s
is that
as it is
some kind of learning;
could be wired in is much less obvious
algorithm
invariances
may be even more interesting without
in Fukushima’s
be available
or abstract
learning
not only when the activations
but also when they are both activated
of time.
The trace mediating
biological
implementations.
membrane
time constants
this memory As cortical than previously
estimated
- 88 -
that a
of the pre- and post-synaptic within
for recent activity pyramidal
rule: it is assumed
cells (possibly
a relatively
short period
could
have many
may
have
greater
cells
much
possible longer
than 100 ms), the
simplest
possibility
synaptic
activity
independent kept
is that (Stratford
from
channels.
potentiation released
of synapses,
have
the
(Levy state
to occur.
& Steward,
in
activations. (Bullier
the
network,
so that
input
cells combine
inputs
connect
to a complex
unit
tuning
specific cortex.
from simple
only
spatial
invariant
may come from
rotation,
independent
and
while
cells are slightly
(Perrett,Hietanen,Oram,
responses
learn
direct
input
input
longer
found
number
of the face.
- 89 -
from
suggests
are their
the LGN
from
simple could
cell.
If
be used to and spatial
Another
example
of transformationin monkey
to only one view of others
Response
than that of view-dependent 1991), which
phases
cells do not receive
and phase tuning.
selectively
of a smaller
rule
to combine
by the combination
respond
in the
learning
the same orientation
neurons
to all views
& Benson,
receive
neurons
the response
respond
positions
for
is preserved
transformation
can
to receive
cells with
would
section.
the face-selective
Some of these face-selective profile),
which
the alternative
cell
and
at the
activation
cells, the above mechanisms
the simple
different
transformation
detectors
(e.g. right head
but with
et al., 1988),
transmitter
for LTP in the hippocampus
for different
‘complex’
the
to the receptor
1971; Stone, 1972), layer 2 complex
(Gilbert
complex
in the previous
the
in the long term
postsynaptic
activity
cells are known
1979; Hoffman,
thalamic
for achieving
the
An
of synaptic
1990). Such a mechanism
implement
is that detectors
some complex
& Henry,
would
trace presented
assumption
While
frequency
precede
is
may be
activity.
constants
Therefore
was in fact required
this mechanism
the presynaptic
available
to
& Levy,
average
to be involved
kinetics.
1983). As it is the presynaptic
The second
direct
activation
time
of past
that the trace
by neural
may still be bound
(Holmes
This condition
of the receptors,
involving
slow
activation
activation
presynaptic
potentiation
relatively
likely
changed
are believed
trace
The running
the activation
which
presynaptic
an electrical
it is more
is gradually
involve
receptors,
time of the post-synaptic
retain
of the neuron.
that
may
NMDA
at an earlier
require
activation
concentration
implementation
receptor
could
et al., 1990). However,
the electrical
by a chemical
alternative
the membrane
temporal of faces
are invariant
latencies
to
of view-
ones (130 vs. 130 ms)
that the view-dependent
ones
may provide enough
input
to the invariant
transformation-phase
transformations.
simple
sensitive
This problem
transformation
is required
units
cells. A further cells
is especially
to
obvious
simultaneously.
can be avoided
A receptive
to shifts of only
dependent
face selective
tuning
a fraction
tuning
the
to more than one large number
of
broad.
fields are fairly
would
of the receptive broad
of
is sufficiently
of discontinuities
of the width
assumption
be much field.
more
The view-
(approximately
constantly
activations
may even be present
early development the developing motion
provide
60 degree)
invariance
property,
while
invariance
to perspective
for intrinsically
complex,
non-local
Malsburg,
(Coolen geometric
alternative
visual
general
transformations
plausible
form
& Shatz,
during in
1990). Forward for the size
be able to train
of objects. Other kinds of invariances,
and the
are shown in figure 26. appear unreasonable,
than previously algorithms
proposed
& Malsburg,
1989), backpropagation
pattern
relations
(Glunder,
would
teaching
- 90 -
inputs’
& Zipser,
involve
more
(Bienenstock
learning
&
in a Hopfield
19871, or the computation
1986). A biologically
to the present algorithm
(Rumelhart
which
matching
19891, Hebbian (Hinton,
the present algorithm
models,
such as graph
& Kuijk,
representations
of motion
sweeping
waves of activity
would
areas as ‘unsupervised
invariant
generated
of
or eye
stimulation
a more
or slower
or addition
visual
ideal training
1987; Buhmann,Lange,
network
motion
space, for instance, would provide
As none of these three assumptions to be more
object
examples
The necessary
(Meister,Wong,Baylor,
necessary for their training
appears
provides
invariance,
before the start of normal
retina
in three dimensional
shift
the necessary transformations.
as there is evidence mammalian
transformations
is that the environment
In the case of learning
movements
towards
range
cells, whose receptive
cells also show a relatively
important
the transformation.
higher
total
there are
to head rotation. The third
inner
the
when invariance
if their transformation-phase
field with a large number
sensitive
cover
is whether
The need for an excessively
This does seem to be the case with visual simple smooth.
consideration
more
plausible
be to use backprojections to bias Hebbian
of
from
mechanisms
1985). Such a mechanism
could
Invariances position
orientation
00
size/depth
0
shape
ViWd
0 /\
l!@j
/'
I
I
'9