Models of sensory coding

0 downloads 0 Views 6MB Size Report
units themselves carry out a relatively simple computation by combining (usually taking a weighted .... algorithms the information about the weight changes is not locally present at the site of the synapse ...... was to give weights 'momentum' ...... Studying such simple models, of course, does not answer our original question.
Peter Fijldigk

Churchill

College,

September,

Models

Cambridge

1991

of sensory

coding

This thesis is submitted for the degree of Doctor of Philosophy.

Preface

This dissertation outcome submitted

Signed

is the result of my own work and includes

of work done in collaboration. for a degree or diploma

It is not substantially

or other qualification

Date

nothing

which is the

the same as any I have

at any other University.

Contents ............................................................................................................................. 1. introduction ........................................................................................................ processing.. ........................................................................ 1.1 The goal of sensory reduction ................................ 1.2 Regularities in natural stimuli and redundancy decorrelation.. ................................................................................. Chapter 2. Adaptive 2.1 Lateral interactions in the cerebral cortex ....................................................... learning.. ......................................................... 2.2 Decorrelation by anti-Hebbian 2.3 A formal model ............................................................................................... between several variables ......................................................... 2.4 Decorrelation 2.5 Decorrelation and single unit adaptation.. ....................................................... 2.6 Relation to Kohonen’s ‘novelty filter’............................................................... 2.7 Information theoretic properties.. .................................................................... 2.8 Biological relevance ........................................................................................ 2.9 Pattern adaptation and aftereffects.. ............................................................... 2.10 Decorrelation and associative memory.. ......................................................... 2.11 Decorrelation and supervised learning ........................................................... Chapter 3. Optimal linear feature extraction.. ................................................................. 3.1 Data reduction ................................................................................................. 3.2 Principal component analysis.. ......................................................................... 3.3 Oja’s algorithm ............................................................................................... 3.4 Simulation ....................................................................................................... 3.5 Transmitted information.. ................................................................................ 3.6 Assumptions and biological relevance.. ............................................................ ................................................................. 3.7 Comparison to related algorithms.. Chapter 4. Sparse coding.. ................................................................................................ 4.1 The Hebb unit as suspicious coincidence detector.. ............................................ learning.. ..................................................................................... 4.2 Competitive 4.3 Sparse coding.. ................................................................................................. mechanisms.. ................................. 4.4 Combination of Hebbian and anti-Hebbian .I .......................................... 4.5 Example 1: Learning lines.. ................................... 4.6 Example 2: Learning the alphabet.. ................................................................. with other competitive algorithms.. ............................................ 4.7 Comparison and biological relevance.. ............................................................ 4.8 Assumptions ................................. 4.9 Hebbian and anti-Hebbian mechanisms in development.. Chapter 5. Learning invariances.. ..................................................................................... 5.1 Shift invariance .............................................................................................. 5.2 Hebbian rule with trace mechanism ................................................................ 5.3 Simulation.. ..................................................................................................... 5.4 Assumptions and biological plausibility.. ........................................................ Chapter 6. Conclusions.. .................................................................................................... References.. ........................................................................................................................ Abstract.. Chapter

5 7 .13 .16 .19 .19 .21 .22 .26 .28 32 .33 .37 .40 .42 .46 .52 .52 .52 .53 .56 .59 .62 .63 .65 .65 .66 .66 ,68 69 .70 .74 .76 .79 .82 .82 .84 .85 .88 .92 96

Acknowledaements

I would encouragement work. discussions. version

like to thank and support

my supervisor,

and for the lot of interesting

I am also grateful

to Dr Graeme

I very much appreciate

of this

thesis,

Professor Horace Barlow, for a lot of help,

Mitchison

the useful comments

and the contribution

of many

ideas that greatly influenced for many of Dr Arthur other

people

very

my

interesting

Pete on a very late in and

outside

Cambridge. This studentship

work

was supported

from Churchill

College,

by an Overseas Cambridge.

-4-

Research

Studentship,

a research

This

work

computational

is aimed

principles

system detects physical representations

quantities

processing

of the environment

behavioural

decisions

some

of the general

in the brain.

and transforms

The sensory

them into internal

are based. The properties

that make

from a complex

array of receptor signals are considered.

The brain needs extensive knowledge for the interpretation

about the statistical

structure

of sensory signals. The acquisition

of the sensory

and use of such

knowledge

are studied using models consisting

of networks of simple processing units with

properties

that are believed

essential in biological

to be functionally

As the information modifiable

connections

modification learning

between

about

input

distribution

the form

are studied.

important

without

patterns,

for understanding

form

an uncorrelated

mechanism

can match the distribution units,

An uncorrelated,

square error correcting

modification

representation

achieving

the nervous

between

statistical

rules’,

local

system. Specific learning

a predefined

structure

set of

in the input

rule is demonstrated of the correlated

of input patterns information-theoretically

equal variance learning

the activity-dependent

or supervision:

synaptic

adaptively

is due to the adaptive,

in four ‘unsupervised’

a mapping

but to discover

external guidance

neurons.

One class of such ‘learning

of these rules are studied

1 - An ‘anti-Hebbian’

units.

the rules governing

the goal is not to implement

and output

the representation

capacity of these networks

the units,

of these connections

tasks, in which given

processing

rules, are particularly

hypotheses

a

useful are discussed, and ways in which the sensory system may form such

representations

environment

and modelling

of sensory information

on which

representation

at understanding

-5-

input

signal.

to the actual signalling optimal

signal also makes optimally

possible.

to be able to

signal

efficient

This

space of on noisy least-mean-

2 - A combination implement

a form

of Hebbian

of the statistical

reduces the dimensionality

the higher-order of information. associative

occurring

units

of biologically

redundancy

Such a representation

is demonstrated

Component

Analysis,

signal while maximising

is advantageous

which

the information

are noisy.

more plausible,

of the representation

to

nonlinear

units is shown

substantially without

reducing

considerable

loss

if it is to be used in further

stages.

4 - A Hebbian processing

of Principal

code inputs into a sparse representation,

statistical

learning

connections

even when the units themselves

arrangement

to be able to adaptively

method

of a noisy Gaussian

content of the representation,

3 - A similar

and anti-Hebbian

rule

to respond

transformations

modified

by a trace mechanism

in a way which

of the input signal.

-6-

is invariant

is studied, with

respect

that allows to commonly

Chapter

1. Introduction

Understanding

the computational

problems.

of brain function

interesting

and difficult

discoveries

over the last several decades concerning

different

scientific

principles

levels, we do not yet sufficiently

to see, hear, to learn skills, remember Work intelligence

brains so effortlessly, computational emulate

vision,

The essential human

and other branches

or motor

failure

by human

coordination,

performance

of artificial and animal

are in fact hard

of so far all man-made

machines

to

at most of these tasks was found largely

of background

is needed to solve even such seemingly

simple problems.

efficiently

of the brain at several

tasks performed

to be due to the fact that a large amount

access knowledge

of

how the nervous system enables us

robotics

such as object recognition

or even approach

the structure

understand

that even the simplest

problems.

there has been an explosion

events, to plan actions and make decisions.

in the field of computer demonstrated

Although

is one of the most

knowledge

about the environment

The ability

to encode, learn and

may be a key factor that allows the brain to perform

well

at many of these tasks. The

field

Churchland, signals

models

governing models

analytically

the question

particularly

to capture,

networks

successful

of neurons

science is the abstraction treatment

the computational

systems. The main

internally

and chemical)

It uses simplifying,

of abstraction,

in nervous

consistent,

advantage

of

and can be treated

used in all branches of science. An example

that helped

of the Turing

of theoretical

(electrical

&

simulation.

are widely

model

(Sejnowski,Koch,

and process information.

at some level

or tested using computer models

neuroscience’

of how physical

is that they are well-specified,

Simplifying

nervous

‘computational

are used in the brain to represent

principles

a formal

named

1988) addresses

computer-based

formal

recently

to capture

machine.

problems

system are useful for isolating

important

It provided

in the field.

the basic computational

-7-

principles

of a

in computer

a conceptual

framework

Simplifying

models

problems,

for

of the

and it is hoped

that

the principles

understand

revealed

using

these models

some of the computational

Neurobiological simplifying

models

assumptions.

of the action potential

differ

An example

One interesting

aspect of this model

conductances

later with

make

hypotheses

testable

postulated

greatly

in the degree

of a realistic

in models

differential

model

to help

to which

they

is the Hodgkin-Huxley

equations

us

model

(Hodgkin

& Huxley,

1952).

the voltage-

to account for the data were discovered

of a new technique.

that go beyond

make

of the membrane

was that the ionic channels underlying in the model

the introduction

In general,

immediate

we should

experimental

expect to

data if they make

predictions. It may seem desirable

and build

models

speculations however,

to start from our anatomical

by gradually

and assumptions is extremely

incorporating

case of an integrated

difficult

circuit.

the model

possible

with

respects, nonetheless first grasping

approach

it is certainly

and reality,

impossible principles.

systems

simulations,

or, even more importantly,

the system.

But even in cases in which

add to the explanatory

may differ

To maximise

system and in the

the computational

from

the system

a complex

of

in many

system without

the chance of finding

of simplifying

because of the limitations

power

approach,

matches

and testing should proceed simultaneously.

complicated

may be preferable

to

them, and then test the agreement

to understand

We will need to make a larger number

model

resorting

as for instance

seems to be to try to identify

The model

modeling

either

detail,

is much worse in both these respects with

for achieving

system.

some fundamental

between model

knowledge,

complicated

of the system’s operation,

As the situation

mechanisms

the actual

and more

faced with even a moderately

the brain, the most promising goals, consider

more

and physiological

when the facts run out. This ‘reverse engineering’

even when we know the basic principles

simpler

enough

in the squid giant axon, in which the dynamics

by a set of nonlinear

much

be general

processes taking place in the brain.

was modelled

dependent

will

assumptions

of the computers

because of the lack of our detailed these two considerations

to a more complex of the model.

-8-

with

more

used in the knowledge

of

are not limiting,

a

one if the additional

A model

to model

details

fewer parameters

do not may

account

for the data less precisely

correctly

to new observations.

a computation

The Turing

than a complete

presented

in the following

assumptions.

but it also has a greater machine,

simulation

chance of generalising

for instance, may be a better model

of the actual computer

chapters,

as all

In order to avoid confusion,

models

hardware.

do, make

of

The models

many

simplifying

an effort is made to make these assumptions

explicit. The nervous molecules

system can be modelled

and ions, synapses, dendritic

circuits

or behavioural

neural

and local circuit

systems.

at several different

trees, neurons, local circuits,

In the following

levels. The behaviour

models,

of single neurons

single-unit

recording

the last few decades

successful

in revealing

individual

nerve cells, on which most of our current knowledge

a great deal of knowledge

pattern

that carry the information

of neurons

maps, inter-regional

attention

as during

It is the precise activity

levels, at the level of

is especially

techniques about

is focused

interesting

were spectacularly

the response

properties

of brain function

with a wide range of response

necessary for the discrimination

on the

of

is based. properties

and recognition

of complex

patterns. Neural distributed

network processing

interconnected

simple

models models’)

a weighted

sum of) the inputs

computations

synapse-like simple

to other units. Despite

due to the large number

the representation

of information

another.

Unlike

networks

of how

nontrivial

their simplicity,

as patterns

algorithms,

-9-

neural

of activity

properties,

by combining

‘parallel of

highly

Typically

axon-like

on the connection

lines

lines. The

(usually

taking

scalar value as an output they can perform

networks

through

elaborate which they

help us think

about

across a large population

transformations neural

or

computations.

specific connections

Formal

as synapse-mediated

most conventional

weights

computation

of highly

and send their outputs.

and computations

issue

they receive to give a single

receive their inputs

units,

models’

of units with some neuron-like

carry out a relatively

that is transmitted

the

‘connectionist

units can perform

the units and modifiable

units themselves

called

address

processing

they consist of a large number connecting

(also

networks

from

one pattern

have an inherent

of to

ability

to learn. Learning and learning

takes place by the modification

algorithms

exist that specify

means of such algorithms perform

complex

of the connections

how connections

be modified.

By

a network can be trained - as opposed to being programmed

- to

are interesting

or even surpass

the performance

problems

related

to pattern

formalise

and therefore

in their own right

of conventional

processing

to program.

speech from text (Sejnowski

and recognition,

For instance,

& Rosenberg,

solve a wide range of complex

An additional

advantage

they

extremely

fast;

calculation

computer

networks

between data points,

supported

examine

of networks of

as opposed

the

is hard to to generate

speech recognition

is that once they are trained, unit

activations

is

‘memory-based’,

to most conventional,

system, however,

which are currently Many

neither

hypotheses

as models

the assumptions

or ruled out by anatomical

about the nervous

definitely

they make,

and physiological

almost

storing

and

‘computation-based’

linearly

sum their

suggested,

however,

calculating

a complex

proved

that each neuron

1983; Koch & Poggio,

the dendritic

about the potential

function

should

whether

of hypotheses,

(Durbin

cortical

of a

models is that neurons tree, it has been

& Rumelhart,

neurons,

processor

1989; Koch &

require

from one another

pyramidal

power

as a sophisticated

1983). These models

decoupled

-lO-

processing

in network

be considered

1985; Swindale,

identified

they are

Our present knowledge

of the dendritic

of its inputs

tree must be electrotonically

of morphologically

and consider

or disproved.

Based on the complexity

logical

processes in the

still leaves room for a large number

have been made

inputs.

of computational

evidence.

single neuron. One of the most widely used assumptions

Analyses

in

algorithms.

we must

Poggio,

can be trained

in areas including

are fundamentally

If we want to use neural networks brain,

where the solution

a network

problems

control.

Neural

especially

et al., 1989), and they can be used to

and robotic

instantaneously.

programs,

with,

19871, or to read postal codes from envelopes

engineering

the

as they often compete

computer

(LeCun,Boser,Denker,Henderson,Howard,Hubbard,

interpolating

should

tasks.

These algorithms

are

between the units,

that parts of and the soma.

however,

have

shown

that

their

dendritic

compartmentalisation Koch,Douglas, does

not

(Douglas

& Wehmeier, completely

compartments, strictly

is

& Martin,

& Colonnier,

out

some

local

More

local

interaction

1983), though

also agrees well

function

interaction

may

a

take

with

& Nelson,

the measured

Cugell

& Robson,

1966; Movshon,Thompson,

appear

completely

unreasonable

visual

a small

of 1988;

number

by a single

place

receive

of

neuron

on dendritic

is

spines

dual synapses

(Beaulieu

1991). The assumption

of input

responses

& Tolhurst,

as a first

degree

& Jack, 1990). This

computed

still

high

& Whitteridge,

within

only about 10% of spines

1985; Dehay,Douglas,Martin,

summation

for

1990; Douglas,Martin,

but it does mean that logical

& Poggio,

unsuitable

1990; Stratford,Larkman,Mason,Major,

rule

constrained.

(Koch

structure

of cortical

cells (Enroth-

1978), and therefore

approximation

to the function

does not

computed

by a

single neuron. Another value was

common

to other

units,

suggested

information

correspond structure

& Morrone,

1981; Dayhoff of cortical

in the higher

on decoupled

questionable Yamane, timing

whether 1991).

this

strategy

evidence

of individual

action

potentials

importance

relative

an approximation the output

to consider

were

to the firing one scalar

be distorted

structure spines

would

likely

and

and lost (Douglas

and it is

(Young,Tanaka,

hypothesis,

to carry

only

It is therefore

&

local, nonlinear

be required,

this

value corresponding

&

& Martin,

be attenuated,

in the brain

to support

frequency.

a

of neural firing,

used

is still

may constitute

are lo-20 ms (Douglas

would

is generally

It

1983; Sherry,Barrow,

above about 15 Hz would

or dendritic

a single, scalar

of the neurons.

intervals

& Gerstein,

neurons

frequencies

if later

secondary

Even

dendrites

frequency

in the interspike

et al., 19901, any frequencies contained

is that units transmit

to the firing

1991). To make use of the fine temporal

interactions

&

the precise information

of

not unreasonable

to the firing

as

frequency

as

is about

the

of a neuron.

Perhaps form

would

in most models

1982). As the time constant

1990; Stratford

Martin,

which

that fine temporal

code (Cattaneo,Maffei, Klemm,

assumption

of the rules

the most

crucial

governing

hypothesis

synaptic

neural

modification.

- ll-

network In many

models neural

make network

learning

algorithms

the information

the synapse weight

about the weight changes is not locally

(e.g., (Rumelhart,Hinton,

changes

would

require

& Williams,

a rather

computationally as models

powerful

of synaptic

must take the constraint only an important for efficient

electronic

computational

limitations,

associative

memory

unsupervised Zipser,

Willshaw

system

1982; Grossberg,

(Kohonen,

& Malsburg,

mechanisms

successfully

models

local learning

learning

used

in

rules,

is not

despite

functional

1976a; Malsburg,

evidence

their

models

of

19691, and

1973; Rumelhart

&

in the development

of

1986; Miller,Keller,

in the nervous

models

Locality

& Longuet-Higgins,

1976). There is also accumulating

underlying

as biological

of self-organisation

1984; Linsker,

candidates

but it may also be essential

Local

1984; Willshaw,Buneman,

(Amari,

(e.g., (Parker,

seriously.

neural networks,

been

1985), as well as in structural

the nervous

intended

modification

implementation.

have

(Kohonen,

learning

of synaptic

for biological

hardware

machinery

may be, they are not likely

in the brain. Networks

of locality

constraint

physical

of the

to be present in the nervous system. However

some of these methods

plasticity

1986)). The local computation

complicated

1985; Tesauro, 1990)), which is not thought

present at the site of

& Stryker, for

1989;

the biophysical

system (Brown,Kairiss,

& Keenan,

1990). Learning Methods

in general is the process of the formation

of supervised

the desired

output

learning

or a reinforcer

priate or not. These methods exceedingly

require

either

a ‘teacher’

of realistic

training

animals

and people

to learn from single or a relatively

An animal,

have few ‘supervised’

examples.

for instance,

of the input.

learning

of labelled

small

number

examples

about a poisonous

they become

require with

was appro-

a very large the ability

of trials,

are often

of

which

severely

food or a predator

may

opportunities.

In many cases, the complexity complexity

generated

This is in sharp contrast

as the number

learning

for each input

size, and usually

of labelled

restricted.

that provides

do not scale well with the size of the problem,

number

advantage

from examples.

that reports whether the output

slow faced with problems

can be a great

of a mapping

of the mapping

This is especially

to be learnt is largely

true in problems

- 12-

involving

due to the

perception;

it is

much easier to learn a mapping ‘run’

than

to map

an array

methods

in general

exploit

available

data arriving

form of instruction

1.1.

from a suitable of pixels

The goal of sensory

One of the most puzzling

questions

working higher

model animals,

but rather

(Craik, 1943) suggested animals

form

the fundamental

coded and represented

functionally Wiesel,

the cortex

specialised

that further

during

a car, playing

basic local connectivity

decisions.

of sensory model’

of

Such a

messages.

is thought

higher

range of problems

a piano, or reading cognitive

functions

of the ways in which

is characteristically

areas, its anatomical & Powell,

philogeny

expansion

model

In

to be

than those a book). It is without

first

sensory information

is

in the brain.

1974; Rockel,Horns,

so rapidly

principles

a working

of the

success of the cortex may be due to such a

that it can solve a far wider

(such as driving

is not this

and events

and behavioural

of this ‘world

that we will be able to understand

Although

situations,

that the brain constructs

to make predictions

the most sophisticated

that fueled its evolution

deciphering

in and what we experience

the objects,

located in the cerebral cortex. The evolutionary

unlikely

1987).

must also be necessary for the interpretation

large degree of adaptability

or any other

of sensory channels about the stream of data constantly

vast array of signals,

that allows

of readily

about the brain is how it makes sense of the

our senses. What we are interested

the world

amount

processing

bombarding

Craik

to

Unsupervised

to reinforcement,

1989; Lippmann,

carried by millions

environment.

in the large

relationship

information

changing

of e.g. ‘tiger’

representation.

regularities

any obvious

(Becker, 1991; Hinton,

representation

to the symbolic

the statistical

without

symbolic

divided

structure

patterns

a large

is surprisingly

is relatively

few additional

genetic

instructions.

are similar

in different

cortical

- 13-

number

uniform

1980). The fact that the neocortex

suggests that its wiring

requires

into

of

(Hubel

&

has expanded

easy to replicate,

and

The cell types and areas. Differences

are

often quantitative

rather than quantitative,

the areas. For instance,

layers receiving

from which most outputs

originate

not only on its anatomy,

artificially

rerouting

responsive,

orientation

(Sur,Garraghty,

1988).

animals,

selectivity

similar

embryonic

rat visual cortex transplanted

contain

with

& O’Leary,

of cells

normally

with

similar

between cortical

to different

principles.

selectivity

units unique

operation ticular

and

& ,Frost,

direction

1989).

Late

cortex was shown to

to somatosensory

cortex

on

cortical

common

algorithm

may be used, for example,

in the visual and somatosensory principle

a

suggesting

et al., 1988), or at least share

The same mechanism

behind

to generate

cortices.

the great variety

areas, what may this principle

of operations

taking

be? How does this cortical

relate to the task the brain has to carry out? How do we make sense of a parstimulus

when the number

of possible patterns

are very unlikely

to ever experience

high dimensional

patterns

our internal

of the environment?

model

As the survival

values

at certain

representations

is so large that we

that form an important

(or of a robot, for that matter)

rather than the directly retinal

of stimulation

the same pattern twice? How do we transform

into symbolic

of an animal

to objects in the environment intensity

to

areas seems to be that they receive input

operate

may

If there is some common place in different

orientation

somatosensory

(Gilbert,Bourgeois,Eckhorn,Goldman-Rakic,Jones,Kriiger,

directional

cortex

projections

other areas. There is a body of evidence

areas

cortical

some operational

in auditory

retinal

in area 17 (Metin

functional

visually

to those of cells in area 17

showing

to neonatal

identified

by early lesions),

By

1991).

difference

from and send outputs all

found

anatomically

The main

that

a percentage

to that

‘barrels’,

(Schlaggar

hamsters

cortex have responses to visual stimuli

of normal

aspects of its development.

cells were found

from

of

in sensory areas, while layers

in ferrets (induced

Results

functions

in motor areas. The cortex shows

but also in many

projections

to the different

are thicker

and direction-selective

& Roe,

somatosensory

inputs

are more pronounced

uniformity

retinal

and are related

locations,

- 14 -

detectable

the role of sensory

is causally

signals,

these part of

related

such as light

processing

must

be

therefore

to transform

patterns

of neural

relevant,

meaningful

further

the detected

activity

complex conditioning

assume pathways

& Wagner, available

1972; Sutton

stimulus,

however,

simple

individual.

Classical

between

models

the ‘labelled

conditioned

response,

they are formed

line’

be detected

of Pavlovian

such as plain

behaviour

of conditioning

explain

representation

stimulus’,

of what pattern

The physical

1927; Rescola may

stimuli.

on complex

be

More

features of the

or the face of a particular how an association sensory

should

signals must first be represented

is formed

stimulus

and the

signals are formed,

are not represented

of activity

‘CSl’ and

such pathways

tone or light

of a complex

1990). If stimuli

between

and instrumental

1983; Pavlov,

will depend a situation,

significant

cannot be based on

should

animals,

of an object,

at all (Tesauro,

stimulus.

models

model’

but they do not specify how these labelled

it raises the question

particular

A ‘working

1981). In lower

the appropriate

e.g. the detection

the behaviourally

1980; Mackintosh,

stimuli,

of

are to be used by

e.g. ‘US for ‘unconditioned

(Dickinson,

& Barto,

for extremely

frequently,

labelled

stimuli’

between

and causal relationships Classical

consisting

to these objects and other biologically These representations

need to be established.

of the environment.

‘CS2’ for ‘conditioned

related

where associations

signals; associations

entities

signals into representations

in the environment.

stages of processing,

the raw input

lines’,

more closely

entities

events of the environment

physical

locally

if

on ‘labelled

be chosen to represent

a

in terms of meaningful

categories before they can be used for conditioning. The information instance, only retina.

providing

evidence for the existence of an object in an image, for

is not tied to the activation

be inferred

by looking

The visual

interdependencies fundamentally

system

of any particular

for spatially should

distributed

therefore

be looking

photoreceptor, patterns

task involving

the detection

signal.

- 15-

of activation

for the complicated

of sensory signals caused by objects. Object statistical

its presence can

detection

of regularities

over the statistical

is therefore

of the incoming

a

1.2.

high

Regularities

in natural

Sensory stimuli

from the natural

statistical

Field,

redundancy

stimuli

and redundancy

environment

are far from random.

introduced

by different

kinds

1987). This fact is demonstrated,

for example,

by our ability

parts of natural anything

images

resembling

(Attneave,

a natural

; Kersten,

The following

of regularities

random

continuous

similar

intensity

distribution

almost values.

of intensity

single points,

within

Images expressed greater

This

points

is a second

pairs is different

and it is characterised

Other properties, be similar

everywhere;

Image intensity

order

statistical

from the product

by the autocorrelation

tend to have

regularity,

as the

of the distributions

of

function.

such as colour, texture, and stereoscopic

contain

edges and lines.

by the fact that the probability

depth also tend to

are colinear

This is third

order

statistical

close to where it was a moment tend to have non-zero

active is

than if they are not.

earlier.

Most local parameters

autocorrelation

objects are approximately

regularity

of any three points being equally

As objects move with finite speed, their image at one moment

Most

is

small regions of an image.

if the three points

position

images:

close to each other

rigid,

is likely

of images

to be

at a given

in time. which

causes their parts to move in

the same direction. Lines determined

and edges are not positioned by the shape of the objects giving

Objects

randomly

significantly

is

as well as more abstract

above chance level. For instance,

- 16-

position

rise to them.

can cause several of the above properties,

ones, to appear together

in space; their

at

as natural

images.

in natural

;

missing

to arrive

dot patterns,

Objects tend to have surfaces of nearly constant reflectance. therefore

to predict

of the space of all possible

are some of the causes of redundancy

They have (Attneave,

1987). We are very unlikely

image just by generating

images occupy only a very small fraction

reduction

the existence

of dogs

causes

‘barks’,

‘dog-features’

etc.) to

statistically

appear

together

regularities

and touch.

event (A&B)

the component

in other

property

cliches’,

1990a), is that the component

were

is significantly

greater

also called

(Barlow,

not independent:

than the product

taste, ‘feature

1972; Barlow,

the probability

of

of the probabilities

of

events (A, B) :

independence

( P(A&B)

>> P(A) P(B).

is just the opposite

redundant

define objects, features, concepts, categories, 1959; Barlow,

1961b; Barlow,

1985). In fact, the receptive patterns

of the defining

condition

of statistical

= P(A) P(B) >.

It is these non-independent,

redundant

such as hearing,

coincidences’

events are statistically

that this expression

(Barlow,

modalities,

of all these regularities,

or ‘suspicious

P(A&B) Notice

‘shorter-than-a-metre’,

more often than if these properties

can be found

The defining

‘sensory

constellations’,

the joint

much

‘has tail’,

independent.

Similar smell

(such as ‘furry’,

features, the ‘suspicious

and these are what we should

1969b; Barlow,

field properties

coincidences’

1972; Watanabe,

of sensory neurons

that

be detecting

1960; Watanabe,

are well matched

to such

of stimulation.

Orientation

specificity

is the most prominent

feature

of cells in the primary

visual cortex. Such cells are good detectors of edges and lines. Directional components.

Disparity

Some

growing

correspond

Highly straints imposed

and curvature

cells in the visual

others to patterns properties

and speed selective cortical selectivity cortex

or diminishing

to redundancy

cells are good detectors also correspond

show selectivity

to redundant

to rotating

in size (Bruce,Desimone,

caused by object motion

selective cells, e.g. face and hand detectors,

of moving

image

features.

patterns,

while

& Gross, 1981). These

in three dimensional are well matched

space. to the con-

by the existence of objects in the environment.

The ‘suspicious of sensory processing: define useful primitive

coincidence’ departures

detection

principle

from randomness

features, coincidences

-17-

can be applied

in the physical

of these primitives

at different

stimuli

levels

can be used to

can define appropriate

object representations, derive conditioned While objects

whereas coincidences stimuli

components

are relatively

correlated,

(Barlow,

independent

and left-hand

redundant

patterns

representation

the input of

be defined

independent

stimulation,

the

A recoding

of the signal

(or the environment

output

corresponding

to the current parameter

encoder can use the redundancy

the difference

between what is exp&ed

Two basic mechanisms

features redundancy

rise to the signal),

values. An adaptive

This kind

removal correlation

of correlations

of model

and ~--~ what is unusual

step towards

the

to the resulting

as a parametric with

the recoded reducing

so that the parameters is not one that imitates model,

showing

in the environment. __w--

here for redundancy

between the representation

hierarchy

statistical

reduction: may interact

independence

units. A mechanism

so

is the

for removing

is discussed in the next chapter. - The pattern

of synaptic connections

may be such that the cell at the higher lower level. This kind of mechanism

level responds

between cells at different to a redundant

is discussed in subsequent

- 18-

as

correlated

redundancy

1970), it is rather an inverse

will be considered

An important

of

system can be considered

- Cells at the same level in the processing as to reduce redundancy.

of highly

corresponding

of the signal to update its model,

(e.g. Marr,

not considered

from other such conjunctions.

giving

the state of the environment.

the system to be modelled

that are very highly

as conjunctions

statistical

model

best describe

Sub-patterns

in terms of independent

itself is reduced.

with each other, separate

sides of faces, are usually

separate objects. Objects could therefore

By recoding

correlated

of one another.

that are relatively

can be used to

1991).

of objects can be highly

e.g. the right-

sets of components

of objects with reinforcement

chapters.

levels

feature of the

up

Chapter 2.1.

2. Adaetive

Lateral

decorrelation

interactions

in the cerebral

In what way can units interact Inhibition

is known

that inhibitory

to play an important

mechanisms

The clearest retinal

ganglion

fields.

of natural

high-pass

filtering

spectrum

and removing

images.

as well. Cells carrying by stimulation

effects are caused by unspecific

lateral inhibition,

between

the simple

may be set up relatively More experiments

interactions

show that inhibition

on the pattern depend

on the difference

indicate

that

Freeman,

maximal (Blakemore

1991; Ferster,

is known

to be

that region.

of which depends

fields. Such a constant,

non-specific

Such

only on inhibition

mechanisms.

can be observed

in the cerebral

cortex.

Several

in the visual cortex is more specific

type of inhibition

found at lower levels, and depends

of the cells. Inhibition

between

the optimal

orientations,

inhibition

can be observed

& Tobin,

1972; Bonds,

1986; Fries,Albus,

between orientation

between

& Creutzfeldt,

also reveal inhibition

orientation

or between cells whose orientation

tuning

1988; Toyama,Kimura,

1981).

& Tanaka,

- 19-

tuned cells

and some experiments cells

tuned

to similar

1989; DeAngelis,Robson,Ohzawa,

techniques

degrees (Toyama,

between ganglion

about touch over a certain

the strength

Cross-correlation tuning

the power

between neurons

centre-surround

selective properties

field organisation

information

by changes in

or ‘whitening’

correlations

of

is well matched

is carried mostly

of the area surrounding

easily by ontogenetic

complicated

than the unspecific

orientations

receptive

hat” organisation

operation

by flattening,

receptive

region of the skin are inhibited

and it has been suggested

filtering

As information

centre-surround

redundancy?

1961a).

or decreasing pairwise, second-order

Such simple

the statistical

“Mexican

The resulting

reduces redundancy

present in other modalities

the distance

(Barlow,

is the centre-surround,

cell receptive

intensity,

to reduce

role in the brain,

reduce redundancy

example

to the autocorrelation

cell activities.

in order

cortex

mostly

1977; Nelson between

&

& Frost, 1978).

cells with the same

differed by not more than 30

Other direction

cells found in the middle

and velocity

McGuinness,

tuned

1985). In many of these cells, motion when the direction

of the centre.

These cells therefore

receptive

receptive

field.

regions

visual area of the cerebral

and have an antagonistic

most effectively

classical

temporal

field, but not to global

Such interactions

of the image

computational

strategy:

cortex tuned

to similar

to local,

motion

motion.

located within

stimulus

parameters

a small

inhibit

motion

textures,

to that

within

motion

seem to suggest

distance

the

or of the entire

caused by the common

These results

neurons

is identical

‘unexpected’

of extended

&

reduces the response

of the background

reduce redundancy

or the viewer’s

(Allman,Meizin,

in the surround

and velocity respond

surround

cortex are

of

a general

from each other in the

each other more than other cells in

the same region. The question

existence

of how the neural

precise

interaction

properties

between

be specified

mechanisms?

are greatly

determine

the highly

properties

themselves

Further experiments blocked

cortical

complicated,

without

as the classical by experience.

regulator determination

highly

the need

It would

response

for activity-dependent of cortical

be impossible

between

the

Can the

selective

response properties

interaction

for

highly

specific

the effect of the main of bicuculline

(Sillito,

altered

inhibitory

cortical

cells

cells

to genetically whose

inhibition neuronal

in the cortex

interactions

response

inhibitory

come

transmitter,

1984). The response

their direction

properties.

is cell specific,

excitability,

of response properties.

- 20 -

and it plays

from

GABA

is

properties

in the absence of inhibition:

and also change their ocular dominance

of cortical

raises

have not even been fully determined.

cells are drastically

mediated

effects

rise to them can be specified.

cells with

stimulus-specific

surround

cells lose their discrete ‘on’ and ‘off’ subregions,

selectivity, GABA

giving

genetically

influenced

by the application

complex

cortical

purely

evidence

in which

stimulus-specific

interactions

This seems unlikely,

themselves

visual

of these highly

simple

of and

and orientation

These results suggest that rather

than

an important

just role

a tonic in the

Many neural network learning

networks,

connections

has revealed

cortex,

the extent

the

competitive

main

enhancement’,

a uniform,

of the

Intracellular

anatomists

unmodifiable

or ‘winner

about

modifiable

the form of

recurrent

thalamic

axons

(Feldman

become

alternative

synaptic

plasticity

more

effective

at exciting

(White,

Decorrelation

The rule studied

filter’

by anti-Hebbian

the function

(Kohonen

(Barlow,

variables.

of ‘contrast

cells, and wij the effectiveness

conventional

Hebb constant

rule

could

form of the repulsion

the

rate

at all.

by Donald

Hebb

the postsynaptic

cell in the future.

An

sections.

1990b),

by Hebb.

and is also similar

It originates to Kohonen’s

law states that a ‘repulsion the activity

of the synapse

be formulated

determining

connections,

some version of

as formulated

If yi and yj denote

postsynaptic

at all,

1982; Grossberg,

often follow

of that suggested

& Oja, 1976). Barlow’s

correlated

In

learning

here is the opposite

‘law of repulsion’

increases between

mathematical

1989).

are considered

& Ballard,

the postsynaptic

to this rule is discussed in the following

Barlow’s

that thalamic

would store no information connections

staining

synapses even in layer 4 of

value, serving

selection’

find

19491, which states that synapses that are successful in exciting

neuron

positive

connections.

projections:

of cortical connections,

containing

the hypothesis

‘novelty

lateral

or input

features of the

1987; Rolls, 1989b). The constant values of these recurrent

the majority

Models

from

to the ‘feed-forward’,

for instance, when lateral connections

‘normalisation’

1976b; Lippmann,

2.2.

of recurrent

of termination

models,

and competitive

One of the most prominent

of local colateral

layer

they are often assigned

(Hebb,

interactions.

purely

perceptron

no more than about 20% of the excitatory

learning

constituting

lateral

is the abundance

axons contribute

such as the multi-layer

assign the task of learning

and neglect

cortex, however,

models,

constant’

of the pre- and

between

them

then

the

as: Awij = cx yi yj, where a is a small of learning.

Analogously,

law would be (Barlow & Foldibk,

-21-

the 1989):

simplest

if i f j,

Awij = - o yi yj , (or possibly

dwij/dt

= - cxyi yj in continuous

two cells, and wij the net synaptic in that the weight direction

simultaneous

decrease in the value of the synaptic an increase in the effectiveness of excitatory

time), where yi and yj denote the activity

weight between them. This is similar

change is proportional

is opposite:

to the pre- and postsynaptic pre- and postsynaptic

weight.

activities,

activity

results

rule

but its in a net

of inhibitory

synapses, or a decrease in the effectiveness

ones.

weight

building

the units to fire together

up between

thereby reducing,

A formal

external

number

or ‘weight’

the corresponding

positively

and eliminating learning

correlated

make it harder

the correlation.

for

This is why

rule.

model

Let us consider receives

were originally

the units will gradually

this rule could also be called the ‘decorrelating’

2.3.

to Hebb’s

of

Such a net decrease could either be due to

It is easy to see that if the two variables then negative

(1)

a set of n simple,

inputs,

linear

as well as feedback

associated

from

with each connection.

processing

units.

the other units. The unit multiplies

weight and adds these values.

X

Y

n

Figure 1.

- 22 -

Each of the units

n

There

is a single

each input

by

Model

1 Figure

input,

1 shows an arrangement

as well as feedback

dynamics

from

in which each unit

the other units.

receives only one external

The simplest

model

of the activity

is: n

dyi

Z dt

= - Yi

+

Xi

+ C

Wij

Yj,

j=l or in matrix

form

where x is held constant stable equilibrium

for each input

if I - W is positive

pattern.

definite.

The above differential

equation

has a

At equilibrium:

n Yi=Xi+CWijYj, j=l or y=x+wy,

y = (I - W)-1 x. Let T = (I - W)-I, ),

and Cxx be the correlation

then the correlation

of the output

matrix

of the input

([Cxxlij

=

is

C w = = TC,,TT, where < > denotes taking superscript

T denotes the matrix

Equation the input

after

1 gives the ‘on-line’

the network

calculations

version of the decorrelating

reached

a stable

we will avoid having

side of equation

result, if the environment Hebbian

patterns,

and

rule, in which

and the weight vector

state for the input

version

1 by its expected

is ergodic,

presentations,

learning

pattern.

In the

to deal with the effect of individual

vectors on the weights by using the ‘off-line’ right-hand

of input

transpose.

signal is treated as a succession of pattern

is modified following

the expected value over the distribution

value.

of the learning This will

rule, replacing

- 23 -

the

have no effect on the

and the step size is chosen small enough.

rule thus takes the form:

input

The anti-

Awij = - o ,

if i f j,

AW = - a offdiag(C&

,

or

where

offdiag()

(Alternatively,

is an

operator

in continuous

that

sets diagonal

time dW/dt

matrix

=-

where p is the correlation

-____ 1 - (&1)7-

The uncorrelated,

(l+w2)



matrix cxx =

%y

zero.

weights (w) between them.

T = (I - WY1 = correlation

to

= - a offdiag(Cyy).)

Let us consider two units with symmetric

and input

elements

coefficient.

(

.12 PO102 PO102 .22 1 ’

Then

012+2wp0l02+w2022 pcqq (w2+1) + (012+,22) (

anti-Hebb

rule

reaches

so the cross-correlation polo2

polo2 (w2+1) + (012+022) w20~2+2wp0*02+022

w

equilibrium

(Aw 12 = 0) when

w

the units

>

*

are

term:

w2 + (q2+022)

w + polo2

= 0,

which gives fixed points: q2+o$ wf = - 2pqo2

or when 01 = 02 = CT c(w) = po2 w2 + 202 w + pc9 = 0,

wf=-

1 *d l-p2 P



if p f 0,

and wf=Oifp=O. The stability

of the fixed points

is determined

by the sign of the derivative

f’(w) at the fixed points: c’(w)

1 w=wf=2wP+2

1 w=wf=dl-P2/

- 24 -

implying

that only

is a stable fixed point

of the weight dynamics.

The feedback weight as a function

Figure

2. Lateral

inputs

with correlation

Figure Gaussian

weights

of the input correlation

necessary to remove

coefficient

correlation

is plotted

between

in figure 2.

two units receiving

p and equal mean squared values.

3 shows the effect of the anti-Hebbian

variables

with

correlation

Besides removing

matrix:

correlations

have the side effect of decreasing

feedback between two zero mean 1 0.975 Cxx = ( 0.975 1 > . Wf = -0.8, c yy =

between the variables,

the variance

- 25 -

of the individual

the lateral units.

connections

-2.. . . .

-3 1

2.4.

Decorrelation

To demonstrate variables

the following

.

(b)

Figure 3. (a) 500 samples from a correlated distribution

.

-3-

(a)

(b) output

. .

with correlations

between

normal

removed

several

distribution,

C&t)

@+I)

= W(t)

rule between

cxx ((I _ &)jl

- a offdiag(Cyy

-26-

calculated

variables

was carried out:

= (I - &-l

to the network

by the feedback connections

the effect of the anti-Hebbian iteration

the input

)T

0)) ,

more

than

two

where W, Cxxl Cyyr and I are n x n matrices. numerical

integration

and a better However, Notice

by the Euler method

integration

method,

the modelled

that linear

i.e. the correlation

matrix

could be interpreted

of the continuous

is more

is driven

as the

form of the adaptation

such as the Runge-Kutta

process itself

decorrelation

This iteration

method,

naturally

could

expressed

only by the pairwise,

rule,

be applied.

in discrete

time.

second order statistics,

of the input.

. .

0.25

.

t

. t

.

0.

.

.

'0,

-0.25 log10 II offdiag(Cyy)

=

.

=

.

.

.

.

.

'

.

II -0.5

.

.

.

.

qj

-1. 5. 4. Logarithm

diagonal

elements

of the Euclidian

deviations,

minimal

and maximal

input

correlation

matrices.

(n = 10, a = 0.1)

The result Euclidean

norm of the output

is the Euclidean correlation iterations correlation

containing

matrix

matrices

is plotted

matrices

on a logarithmic

matrices

matrix,

of Cxx. L was generated random

numbers

is 1)offdiag(Cyy)

by Gram-Schmidt

-27-

matrix

normal

in figure

4. The

input

of the number the random

of

input

Cxx = L A LT, where matrix

containing

orthogonalizing

distribution,

values

11, where 11C II = 4X Cij2

semidefinite matrix:

with

generated

over a set of random

and A is the diagonal

from a standard

20.

(t). Average

scale as a function

a diagonal

1 . .

1( -1t

correlation

(n = 10) is shown

have to be positive by rotating

_

.

of iterations

The average of this norm

were generated

orthogonal

matrix

.

over a set of 100 randomly

for 10 units

correlation

norm.

(0. As correlation

L is a random eigenvalues

of the simulation

values

.

1 5..

of the output

of the number

with

’ I . .

10.

norm

set to 0 as a function

.

"tt .

-0.75

Figure

.

the

a matrix

while the eigenvalues

were taken from an exponential

distribution

of parameter

Cxx always had full rank. If some of the eigenvalues a corresponding

1. In the examples

given here

are chosen to be 0, the variance

of

number of outputs go to 0.

The algorithm

scales well with network

does not significantly

change with the number

size; the slope of the adaptation of units in the network,

curve

as illustrated

in

figure 5.

n = 100 n=50 n = 20 n = 10 n=5

-1.5*-

n=2

-2.Figure 5. Adaptation

as a function

average over 10 random

2.5.

and single

As demonstrated

in the previous

removes

elements

between

of the output

correlation

control’

signalling

amount

adaptation

section,

the anti-Hebbian

several variables. matrix,

In Model

adaptation

law

1, however,

the

i.e. the mean squared

range of the units, the conventional

this is even necessary if the input

different

1. Each curve is an

values of the

in any useful way. To adjust the range of actual values to the

needs to be incorporated

as the interaction

of units in Model

(a = 0.01)

unit

correlations

units are not constrained physical

matrices.

Decorrelation

successfully diagonal

correlation

of the number

into the model. variables

necessary to eliminate of reduction

or ‘gain

are normalised

at earlier

stages,

between pairs of units may cause a

in the variance of different

- 28 -

adaptation

In case of more than three variables,

themselves

correlations

‘single-unit’

units.

Models single-unit

Model

2 and 3 are two alternatives

modifications

to Model

1 incorporating

adaptation.

2

Model unit adaptation, output.

2 is an analytically in which

simple

the units

modification

are also allowed

The goal here is not only to eliminate

matrix

of the output

(unity,

for simplicity).

identity

but also to make The output

of Model

1 incorporating

to receive

input

off-diagonal

the diagonal

correlation

elements

from

singletheir own

of the correlation

elements

approach

the same value

should

therefore

be equal to the

matrix

matrix: cxx ((I - w j1 lT = I.

(I - wjl The matrix

W satisfying

this equation

can be calculated

(for a symmetric

W) as

WI = I - cxx1/2. The symmetric While

square root Cxx1j2 always exists as Cxx is positive in Model

self-connections

1 no self connections

contains

an additional,

feedback gain to be adjusted

were allowed,

constant

term:

semidefinite.

here the learning Awii = 1 - ,

rule for the causing

the

so as to achieve unit mean square value. w(O) = 0

cyyW = (I - w(t))-1 cxx ((I _ w(t))-1 )T w(t+l) or dW/dt

= a ( I - Cw) in continuous A convergence

1991; Plumbley, outputs correlation

matrices,

time.

proof for this algorithm

1991). Figure

to the identity

= w(t) + a (I - cyy(t) ),

matrix. otherwise

is given by Plumbley

6 shows the convergence Convergence

and Dayan (Dayan,

of the correlation

matrix

can only be expected here for full-rank

the condition

cannot be met.

- 29 -

for uncorrelatedness

of the input

of the n variables

0.75 1 = 0.5 0.25

m

4

:?t' .

. .

log10 III- cyyII

.

, 1

5.'

. Ilp.

. JO.

l

-0.25

.

.

.

LU. vt

.

.

-0.5

.

t

.

.

.

.

.

-0.75 Figure

6. Logarithm

of the Euclidian

matrix

of the output

of Model

iterations

(t) for 100 random

of the

outputs

simultaneously.

are

is diagonal

of the difference

2 and the unit

full-rank

normalised

matrix

input correlation and

between

as a function matrices.

correlations

the correlation

of the number

of

The mean square values

between

them

are eliminated

(a = 0.05)

W in Model symmetric

norm

2 approaches

Wl = I - Cxxli2

square root was calculated then for a symmetric .

-0 .5 .-

, as demonstrated

in figure

7. The

as Cxx1/2 = L A1’2 L-l, where Cxx = L A L-l, and A

W(O) (W(O) = 0 in this case).

. .

+ . 0.. f

.

60. po.. . . . .

1 .I-

80. n .

.

n

. . .

2. .5 ..-2.5

/:::i::Il,(((, 1 J .I . . .

-3. -3.5 t Figure

7. Average,

standard

log10 II W - WI 11is plotted

deviation,

as a function

minima

of t. (a = 0.05)

- 30 -

and

maxima

over

. 10 runs,

Model

3

An alternative connections,

to Model

but by explicitly

2 is a one in which

multiplying

the gain is adjusted

each output

not by self-

by a factor (gi). The outputs

are

expressed as: yi = gi (xi + .$wij

Yj),

and therefore y = (I - G W)-’ G x. where

G is a diagonal

diagonal

elements.

matrix,

containing

No self-connections

the gains (gi) of the individual

are allowed

here, as in Model

units

as the

1.

w(O) z20

C&t)

= (I - G(‘)W(t)jl($‘) W(t+l)

Cxx ((I _ &)W(t))-‘G(f))T

= W(t) - a offdiag(Cyy(f))

G(t+l) = G(t)+ y diag(1 - Cyy(t)), where diag0

sets all non-diagonal

0.75.-

elements

to 0.

.

log10 II I - Cyy 11

5.

-0.25.-

-

l

l 1p.

-0.5,.

. . .

-0.75,Figure

8. Logarithm

of the Euclidian

matrix

of the output

of Model

iterations

norm

of the difference

3 and the unit matrix

(t). (a = 0.05, y = 0.1)

-31-

between

as a function

the correlation

of the number

of

Figure identity

8 shows the convergence

matrix,

while figure 9 shows the convergence

9. The overall

C xx -li2.

transfer

Logarithm

over 10 random

2.6.

Relation

novelty

patterns.

Necessarily,

filter

would

of units

tend

T and CxxS1j2 are

The filter

to the ‘novelty

absent,

is trained

on a small

then projects

set containing

in the network.

its input

to zero for any input

a number

pattern.

by

set, gradually

into the subspace going to zero for

are linearly

For a larger

is that

or are governed

set, with the output patterns

described

difference

are either

if the training

independent,

of input

patterns

set, the outputs

of the

The anti-Hebbian

self-

for instance, would not reach a stable value for any positive

- 32 -

filter’

& Oja, 1976). One minor

filter

filter can only work for a training

than the number

connections,

rules are similar

novelty

pattern.

between

approaches

filter’

to the one spanned by the training

the ‘familiar’ the novelty

‘novelty

1984; Kohonen

The

rules.

of the network

of the difference

here, self-connections

to each input

complementary

to the

matrices. (n = 10, a = 0.05, y = 0.1)

and modification

proposed

learning

‘habituating’

of the outputs

of T to CxxS1j2.

T = (I - GW(f))-lG(t)

covariance

and Oja (Kohonen,

in the models

smaller

input

to Kohonen’s

The network

different

matrix,

of the average norm

plotted

by Kohonen

matrix

cxx-l-/2

log10 II -I- -

Figure

of the correlation

signal variance.

Decorrelating of inputs,

networks

decorrelating

patterns.

Decorrelating

ensemble

of inputs.

input.

Training

perform

networks

are exposed to a large, potentially

networks

learn the average, ‘expected’

For a linear network,

Unlike

and operational

the novelty

modes.

Training

2.7.

theoretic

Information

Neurons

are not ideal

or pulse-code

distinguishable

signalling

itself

range

has inherent

signal

automatic

gain control. steps of image

simple

light

luminances

The method

to noise ratio while

earliest

adaptation actually

This method coding

avoiding

known

to take place already

step is to match distribution

of luminance

operation,

training tracking

by noise, and the

itself.

of the dynamic

signal

saturation

As four reliably range of cortical

to utilise

to the signalling

of adjusting

system

this range

space defined

an amplification is known

is light

gain

in engineering

adaptation.

range of neurons

This step is necessary

and quantised.

range to the mean luminance

the width

need not be

by to as

The goal of

to the range

as luminance

of

values

e.g. between a sunny and a night scene, while the

response range of neurons is strictly limited, of the response

error

1987), it is vital

the response

in the image.

vary over several orders of magnitude,

middle

of the

is used in the sensory system as well. One of the

in the visual

is to adjust

present

with

They are affected

estimate

the range of the input

maximise

patterns

of

a large

matrix

does not have separate

quantisation

& Parker,

of neurons.

between

to the correlation

in parallel

devices.

levels may be an optimistic

by adjusting

number

properties

(Barlow,Kaushal,Hawken,

the dynamic

decorrelation

a specific set

infinite

relations

phases, training

can proceed

of the input.

optimally

this corresponds

filter,

changes in the statistics

neurons

task. Instead of learning

here does not have well defined

kept constant.

spike-rate

a different

in the retina

(Barlow,

of the response

range

values in the environment

- 33 -

The first step is to adjust the

value. Such adaptation 1969a; Laughlin, to the width by adjusting

is indeed

1989). The second or variance

of the

a gain factor. A similar

phenomenon, neurons

known

as contrast

in the primary

1973; Movshon DePriest,

visual

& Lennie,

1989; Vautin However

& Berkeley,

Suppose

the available

utilized

even if each neuron individually

the two neurons

& Freeman,

could

discriminate

adapted

capacity

decorrelation

range

gain control,

of individual adapts

to correlations

then the output representational

- 34 -

the output However,

as

only the states on or near the

of the discriminable

adaptation

to the range between

of

adaptation

neuron correctly.

only a fraction

or ‘single-unit’

neurons

levels

is wasted, even when a single-unit

signals utilise

always

10) If each of the two

equal probabilities.

are correlated,

Figure

response

&

will not be properly

gained by observing

with approximately

the response range of each individual

As automatic

very nearly

states. The maximal

process adjusted

10. Correlated

(figure

and the most information

10, when the two activities

are used and channel

& Bisti,

response range of neurons

four activity

be in any of 16 different

would be utilized,

in the

1985; Sclar,Lennie,

reason, two neurons

is in itself optimally

if each of these 16 states occurred

diagonal

(Maffei,Fiorentini,

response space of the two neurons

can reliably

was found

1977).

that, for whatever

together;

shown in figure

gain control,

cortex of cats and monkeys

1979; Ohzawa,Sclar,

respond

possibilities

or contrast

there are other ways in which the available

can be wasted.

neurons

adaptation

mechanism

of values

variables.

states.

matches

the

in the environment,

A decorrelated

signal

has

optimal

resistance to noise on the output

information

in that it maximises

the mutual

between the input and the noisy output.

A communication between its input, has probability density

of the network,

channel

is characterized

A, and its output, density

function

functions

by the mutual

B. If the n dimensional p,(x),

continuous

and pB(y), respectively,

~A,B (x ,y ) , then I(A; B) is defined

I(A; B) = H(A) + H(B) - H(A,B)

information, input

I(A, B) and output

and joint

probability

as:

= H(A) - H(A

I B) = H(B) - H(B I A),

where H(A) = H(B) = is the entropy

I I

PA(x) lo&$x))

dx,

p&d

dy

of the input and the output,

H(A,B)

= -

H(AIB)=-

I s

H(B 1A) = are the joint

and conditional

1949). The entropy

s

and

p A,B(XIY) lOg(p~,~(X~y))

dx dy,

p A,B(x,y) lfd;“)

dx dy,

PA,B(X,J’) log

dx dy

entropies,

of an n-dimensional

log(pB(y))

respectively normal

(Jones, 1979; Shannon

distribution

with covariance

& Weaver, matrix

C is:

H = l log ((2rce)” det(C)). The entropy the amount

of uncertainty

of the channel. output maximise

is a measure of uncertainty,

about the input that remains

In the case of the network,

to maximise the mutual

and the conditional

the decrease information.

we should

in our uncertainty Here the conditional

of noise in the channel.

- 35 -

after having

entropy observed

like the observation about entropy

H(A

the input,

I B) is

the output of the noisy

i.e. it should

can be non-zero

because

Xp-jY

&I:+N

Figure 11. Additive

Assuming output

that the signal can be approximated

has covariance

to the output

noise on the network output.

matrix

Cyy, and that the uncorrelated,

has covariance

matrix

; log ((he)”

maximised

is constant,

diagonal,

det($).

Mathematically

[Cyylii

= 1. A constraint

equal

variances

matrix

det(Cyy+CNN))

by maximising

of the units.

I Iii Aii,

Wegner’s

with equality signal

in the presence of noise. suggestion

theorem

information

- H(N)

that the

is given by

=

(27’Ee)” det(CNN))

correspond

=

elements,

1(X; Y+N)

by the finite dynamic

to a constrained

variance,

states that for any positive

maximises

(Bodewig,

det(Cyy)

is

range

would also result in a condition

only when A is diagonal

(Cyy = I) therefore

information

e.g. for

definite 1956). A

and the mutual

that above some noise level it is better to have correlated

from the result reported

he uses a less plausible

lOg(

on the total power (C[Cyy]ii)

decorrelated

signals differs

-;

det(Cyy) is constrained

normalised,

Linsker’s

IX) = H(Y+N)

and has equal diagonal

this would

on the units.

A, det(A)

- H(Y+N

distribution,

equal variance noise added

CNN (figure ll), the mutual

1(X; Y+N) = H(Y+N)

As CNN

by a normal

constraint

here, as instead

of constraining

signal

on the length of the weight vectors (Linsker,

-36-

variance, 1988).

2.8.

Biological

relevance

The above models

make

from the known physiology were allowed

to have

excitation

and inhibition.

functions

of Gaussian

real neurons, negative

and anatomy both

variables

biological

interpretation,

and whether

merit

adaptation.

output

not observed

in

is possible.

A

rate of a neuron

by a complementary be given

is that they provide More plausible

2 and 3 differ in the way gain control simpler

form, though

such connections

more complicated, self-connections

governed

likely

pair of

a straightforward

a simple

demonstration

versions of the network

model

correlated

possibility

is that

will

synapse. The main

in them.

form.

Model

3 is analytically

2. As there is no evidence

for

all other connections,

The gain control mechanism neurons,

2

at all for self-connections,

from that governing

in single

Model

as indicated

in Model by contrast

visual cortex (Sclar et al., 1989). anti-Hebbian

units, this

to Model

mechanism.

of adaptation

The fundamental, positively

have the desired

very similar

plausible

is implemented

there is no evidence

by a rule different

in the primary

it specifies

assumption

cortical

output

neurons,

inhibitory

be inhibitory

raises several questions.

a decrease

is due to an increase

between them would require to

would

but functionally

2 is not a biologically

thought

as the firing

cannot

to both

linear

of these assumptions

these models

connections

corresponding

to implement

could be interpreted

their main

justification

in Chapter 4.

has an analytically

adaptation

values,

assumed

interpretation

however,

of anti-Hebbian

Models

3 is a more

negative

rate, or each unit could be represented

On the whole,

be considered

and

centred on zero. As these are properties

value of a unit, for instance,

of the principle

that have no direct

of the cortex. For instance, recurrent

were furthermore

only a more indirect

neurons.

are

positive

Units

below the spontaneous

Model

several assumptions

interneurons.

-37-

&

interaction

in the effectiveness

pyramidal

(Beaulieu

in total

Between two strength.

of an inhibitory

cells are excitatory,

so inhibition

As only 15-25% of cortical Somogyi,

1990;

One

synapses

Gabbott,Martin,

&

Whitteridge,

1987;

Whitteridge,

would

proposed

coupled

by Baddeley

interneurons

themselves

cells by Hebbian

receptive

field

1957). Another

only

alternative

weakening

fibres

of anti-Hebbian

between

two

modification

different on recurrent

modification Anatomical

input

synapses

neurons

the firing

light

levels

the main

of signals

assumed of cells,

from

(Ito,

depression

the theory

Nevertheless

structure

of the & Kuffler,

is achieved

1989; Ito & Kano, fibres

of transmission

constitute

would

by

for an activity-

some parallel

here, as this depression

while

comes

is evidence

cell. This does not fully

the

inhibition

strength

in the cerebellum

arrival

excitatory

eliminating

(Barlow,FitzHugh,

There

solution

variance).

in connection

synapse.

synapses

connections.

evidence

on cortical

instability

discouraging

at higher

small

evidence

and from

for the

has been observed require

it demonstrates

anti-Hebbian

that anti-Hebbian

does in fact exist in the brain.

in the vicinity.

inherent

between

is

1991), in which

the centre-surround

cell leads to a long-lasting

types

elegant

change of synaptic

mediating

simultaneous

interaction

but between

1991; Plumbley,

of largest

inhibitory

One possibility

alternative,

correlations

the eigenvector

to the Purkinje

neuron.

those same cells, each interneuron

of excitatory

fibre to a Purkinje

the parallel

(Baddeley,

be enough

neurons,

cells. Another

of an excitatory

of nearly

not

individual

is that the net decrease

dependent

seem

for an activity-dependent

is present

in the efficacy

synaptic

(along

inhibition

a decrease

a climbing

not

to each excitatory

neighbouring

where

1982). Repetition

does

uniquely

and inhibit

correlation

the retina,

KisvArday,Martin,Freund,Magl&zky,

pick up the strongest

The only evidence

kind

there

and Plumbley

learning,

remaining

from

1986),

1988;

take place not between

of more tightly

largest

al.,

for one to be assigned

that decorrelation

was

et

& Somogyi,

interneurons

groups

Gilbert

suggests

neurons

derive

Anti-Hebbian

in a system

persistent

joint firing

Is it reasonable

to assume

of other neurons

nearby?

that the majority from

synapses with

such

(up to 85%) of the excitatory

the recurrent would

a large

collaterals

be very amount

of other

effective of positive

cortical

in reducing

the

feedback

by

of neurons. that the firing Initially

of cortical

the opposite

- 38 -

neurons

would

is uncorrelated

with

seem to be the case. It is

known,

for instance,

arranged

that orientation

tuned

neurons

in the primary

in maps where nearby cells tend to have similar

Wiesel,

1962). Stimulating

sinusoid

gratings,

giving

a highly

stimuli

reveal

Neurons

may indeed

correlated only

the same time, narrow

position,

response. However,

stereoscopic

disparity

spatial

parameters

vary in complex

To fully

frequency,

behaving

cells may be only slightly

visual

(Hubel

stimuli,

(‘columns’)

properties

of cells, all

in its natural

of these cells. of the stimulus

correlated,

or not correlated

the stimulus,

of motion, restricted

environment,

these circumstances

redundant

at

is within

They are tuned

a highly

&

such as

using highly

neuron.

phase, direction Outside

ways. Under

represent

tuning

to several parameters

of the particular

of the stimulus.

in an animal

field

cortex are

only when each of these parameters

band characteristic

paradigm,

neurons.

of the receptive

optimally

size, orientation,

simple

such experiments,

cortex show selectivity

responding

orientation

give rise to large populations

one dimension

in the visual

relatively

the cortex by the extremely

visual

a

to the

ocularity

and

experimental

all these

stimulus

the response of individual

at all with the response of other

the space defined

by the selective

properties

must be covered, and as this space is very large, the visual system cannot afford to use a highly

redundant

representation.

The assumption

of full connectivity

within

a small

neuron

and the size of the axonal

stronger

if the anti-Hebbian

projections

are thought

Decorrelation primary mapping

local region

visual

of cortical

between variables cortex could

of response

tissue, limited

and dendritic

interaction to be more

limited

by the number

by inhibitory

approximation of synapses on a

trees. This limitation

were mediated

would

interneurons,

in space than those of excitatory

represented

be achieved

properties

would only be an appropriate

be even whose neurons.

at greater distances from each other in the in secondary

is organised

1986).

- 39 -

according

visual

areas, where

to different

principles

the spatial (Barlow,

2.9.

Pattern

adaptation

Anti-Hebbian

recurrent

after-effects

of adaptation

aftereffects

can be observed

the orientation

(‘tilt’)

colour

contingent

1990b;

Harris,

wherever

may account for some of the psychophysical stimuli.

psychophysically,

aftereffect,

spatial

A whole

frequency

effect), and many

1961), pattern

channels. sensitivity

perceived

tilt of a vertical

the right

would

neurons

of an intensely

to the right

(Blakemore,

(Wilson,

not explained

to the relationships

between

adaptation

mechanism

which

interpretative

the

environment. keeping interpretive

mechanism

For instance,

optimal

the

tilted

to

orientation

synapses

than

tracks

was proposed

just

‘fatigue’

or a gain

an active

process by

the statistics

which valid

by

result from adaptation

this mechanism

needs to know about to provide

1974; Tolhurst,

adaptation.

1989). It involves

contingencies,

- 40 -

from

to account

of individual

& Tolhurst,

here, after-effects

rather

adaptively

occurring

adaptation

for spatial frequency

sensory environment

of the naturally

caused by

of neurons were considered

of inhibitory

& Foldiak,

system

activity

to lines slightly

of cells with

1971; Dealey

presented

variables

(Barlow

In the normal

track

modification

to the hypothesis

1980; Mollon, of pattern

of neurons.

by the gain control

& Georgeson,

synaptic

such aftereffects

1973).

1975) as an explanation

According

firing

(Barlow,

‘fatigue’

of neural

population

between populations

(Blakemore,Carpenter,

1972). Anti-Hebbian

stimulated

others

1980; Held,

are due to a passive

by the weaker

mechanisms

phenomena

(Harris,

by the imbalance

adaptation,

1978).

line to the left after an exposure

be explained

Inhibitory for several

aftereffects

They are explained

decreased

slightly

explanation

aftereffect,

curvature

seems to be true that one finds

to the conventional

of adaptation

(‘waterfall’)

adaptation,

one looks for them in the visual system (Stromeyer,

selective

variety

e.g. the motion

(e.g., the McCollough

1980). It generally

1977; Sutherland,

Wilson

connection to patterned

adaptation

According

tilted

and aftereffects

of the

sensory

serves the purpose is precisely

expectations

what

of the

upon which

interpretations the result

of images depend.

of the adaptation

According

to this explanation,

to the abnormal

contingencies

illusory

introduced

aftereffects

are

by the adapting

stimulus. The difference of a contingent green

after-effect,

horizontal

appear

between the two explanations

tinted

grating

for a few minutes,

explanation

of the contingent

case of the McCollough interaction

effect. After looking a vertical

with green, while the horizontal

The conventional conjunction

the McCollough

strength

black

and white

have to assume

colour-selective,

effect. If, however,

at a red vertical

and a

grating

will

one with pink.

would

stimuli,

can most easily be seen in the case

orientationally

modification

between separate colour-selective

cells selective tuned

for the

cells in the

is due to an alteration and orientation-tuned

of the

populations

of cells, we need not assume cells tuned to both of the contingencies.

In fact it is believed

that processing

in the visual

of colour

and form are kept rather well segregated

(Hubel

& Livingstone,

number

of cells tuned both to colour and orientation.

anti-Hebbian

1985; Shipp

inhibitory

connections

the ‘red’ and ‘vertical’

black

1985), and there may not be a significant According

would be strengthened

and white

would cause an imbalance grating,

and towards

to this explanation,

during

units and the ‘green’ and ‘horizontal’

phase. These altered connections the vertical

& Zeki,

system

the correlated

units during

the

firing of

the adapting

towards ‘green’ when viewing

‘red’ when viewing

the horizontal

grating. Further the long

time

conditions

support

MacKay

can last several

& MacKay,

considered

convincing

argument

(Movshon

& Lennie,

and the dependence

In the absence of visual stimulation hours,

1974; McCollough,

The same mechanism usually

based on synaptic modification

course of this after-effect,

of stimulation.

the aftereffect

for an explanation

as contingent

days, or even longer 1965; Riggs,White,

may also explain

other

effects, such as spatial

comes from electrophysiological 1979) from neurons

undergoing

-41-

on the

after the adapting

phase,

(Jones & Holding,

1975;

adaptation

1974; Shute, 1979). effects that are not

frequency

adaptation.

by

of the decay

& Eimas,

recording

is provided

adaptation.

by Movshon They adapted

A very

and Lennie a neuron

with

one of two gratings

that had different

spatial

frequencies

but elicited

a similar

response from the recorded neuron. They found a reduced response to the adapting but not to the other grating.

This result is a strong indication

more than just a change in sensitivity. overlapping predict

population

of neurons

a diminished

population

response

and the recorded

As the two gratings

that adaptation

would

cell, and a much

inhibition

smaller

involves

activate a largely

in the visual cortex, the decorrelation

due to the increased

grating

model

between

inhibition

nonwould

the adapted

between

it and the

other population. Decorrelation Mollon,

may

1991). Adaptation

also explain a display

plane causes

maximal

aligned

one of the ‘cardinal’

would

with

in colour

channels

corresponding

curve. Activation

directions

a large number

to the ‘cardinal’

cardinal

directions

leading

to increased

inhibition

would

diminish

sensitivity

along

the other diagonal.

2.10.

would

channels

between

associative

memory

(Anderson, 1969; Hopfield,

1980; Steinbuch,

1961; Taylor, memory

directions

result in a similar

along a diagonal

studied

and

- 42 -

only two sensitivity by the

the two channels

In the test phase, this

direction,

while not affecting

memory

of neural

1972; Caianiello,

1982; Kohonen, 1956; Willshaw

between

adaptation.

functions

between

of the space defined

correlation

them during

hetero-

to a different

interaction

1968; Anderson,

models:

each tuned

explanation

an anti-Hebbian

and associative

thoroughly

1976; Grossberg,

of associative

channel,

would

&

line, even if it is not

space. One possible

the response along the adapting

Decorrelation

One of the most

in colour

(Webster

along a line in the colour

along the adapting

give rise to positive

inhibition

and Mollon

oscillates

of adaptable

space. Alternatively,

of the colour

by Webster

whose colour

decreases in sensitivity

be to postulate

direction

results

1972; Little,

networks

is that of

1961; Gardner-Medwin, 1974; Marr,

1969; Palm,

et al., 1969). There are two basic classes auto-associative.

In heteroassociative

memories,

the goal

presentation usually

is to store a set of input-output

of one of the input

also a requirement

output

store a set of single

pattern

stored vectors should

‘complete’

associative

small set of patterns.

at the completed

nets can perform completing

completed

discriminate

between a complete

even small

deviations

were eliminated learning

autoassociative

1990),

Lyapunov gradient pattern Hebbian

rules used in autoassociative

analytical

or ‘energy’

input

does

‘homunculus’

pattern,

who

parts when

is identical

for the

how can we still

How could we be so sensitive

if all deviations

to

from the ‘standard’

learning

phenomena

in animal

of frequently

experienced

patterns

& Mackintosh,

as would

function,

is

1989; Todd

be predicted

possible

(Amit,

started from an activation

rule generates

corresponding valleys

corresponding

by

to the regions

are usually

proposed

with 1989).

dynamics

the

function

some

by Hopfield

statistical

mechanics

Hopfield

defined

a

of the units perform

state corresponding

to the ‘standard’

in the energy

-43-

memories

memory

as its analogy

on which the activation

descends to a local minimum

with basins of attraction

recurrent

of autoassociative

treatment

descent. A network

learning

pattern

theory.

1982) has been very influential

thorough

for a fixed, relatively

of a network

Perceptual

diminished

parts of the same

the raw input

1989; McLaren,Kaye,

than

of the Hebb rule. The model

(Hopfield, made

rather

the goal is to

we need to ‘fill in’ missing

pattern?

patterns

completion’.

1969; Hall & Honey,

The learning variants

and a partial

should

or faulty patterns.

Is it the infamous

and a complete

the other way: discriminability

(Gibson,

& Mackintosh,

pattern

from well known

by ‘pattern

also point

enhanced

partial

pattern

It is

of a part of one of the

completion

cues? If the output

the

pattern.

memories,

the missing

or regenerating

recognition.

so that

output

of the input

altered

pattern

pattern ? Why would

successfully

version

by retrieving

also restore slightly

we can recognise objects from partial

the desired

In auto-associative

the pattern

But simply

of patterns,

so that the presentation

not make any progress towards pattern is looking

altered

pattern.

vectors,

vector. It should

Recurrent

retrieves

that a slightly

still evoke the appropriate

stored pattern

patterns

pairs

pattern.

to an input A modified

near learned

of state space which

patterns

lead to the

same minimum. function,

Unfortunately,

attracting

generalisation

input

patterns

near frequent

our ability

function

they define

patterns

an additional

and ‘invalid’

need for a yet unspecified memory

patterns

generate

larger

basins

from a larger region of state space, resulting

known

introduce

‘valid’

input

patterns than near infrequent

to discriminate

Lyapunov

frequent

better

mechanism

than novel

problem

for selecting

in more

ones. This again contrasts with

source of redundancy,

states. A further

in this

ones. Valleys

of this

instead of reducing with

the relatively

this approach small number

it, as is the

of ‘valid’

states for storage. The Hebbian

to local recurrent memories

autoassociation connection,

are supposed

representation

and the decorrelation

and are therefore

to learn

there would

learn.

The computational

them.

The goal is not the completion

of the neural of stimuli.

activity

Instead

already

uneven

the probability

patterns

by raising

of having

the space, better

accordance

with perceptual

learning

can easily be performed

but the formation infinite

is achieved

an

try to flatten

in dense regions

states’, we should

to the distribution

set

concentrate

space, we should

of ‘valid

to

but to interpret

a potentially

in the energy function

number

simple

of the input around

Hamming-distance

frequent

generalisation

by feed-forward

connections.

vectors are almost orthogonal,

a perceptron

discriminate

a given pattern

converge

patterns,

memories

of

match

the

signal.

By

states in

phenomena.

above the level required

is missing

‘hills’

discriminability

only slightly

pattern

in representation

a small

The task of categorisation,

spaces most random

in a decorrelated

autoassociative

that best represent

for representation

expanding

tolerance

but

roles

autoassociative

valleys in the energy surface to even further

distribution

space available

left for Hebbian

themselves

of patterns

whole

units,

of a fixed, small set of patterns,

distribution

space. Instead

between

Hebbian

goal is not just to store arbitrary

of making

pattern

incompatible.

correlations

be nothing

theories assign opposite

to prevent

with high confidence,

or faulty (Kanerva,

to the stored vector which

signalling

is closest in Hamming

- 44 -

As in high dimensional unit with a threshold to unrelated

even when a significant

1988). As Hopfield

and fault

networks

set

patterns

will

part of the

are not guaranteed

distance

to the input,

to

feed-

forward

competitive

respect

as well.

networks

Hebbian

recurrent

anti-Hebbian

chapters.

Learning

applied

feed-forward

connections, methods,

to feed-forward

that

strengthening

activity

heteroassociative

patterns.

in another

between

stimuli

in the following

in Chapter

5, can be

associations.

are formed

but by changing

Two associated

the desired output,

stimuli

not

It is by the

the representations

can be represented

by

can then be used as input giving

process could only be applied

limitations,

Hebbian

way. The collateral often

travelling

ramifications

are grouped

These longer

range connections,

the stimulus

(Toyama,

more generalisation

and

specification

extent

to a

between

connections.

similar

stimulus

number

of potential less limited

preference,

within

a single

mechanisms

cortical

for example,

et al., 1981; Ts’o & Gilbert,

connections

of these long-range

may be considered

than that between

cells are extremely area.

Their

terminal

Decorrelation individual

-45-

cortical areas, connect

for the orientation

and range of these interactions

local connections.

may also coexist

clusters, which are given off at discrete intervals.

A group of these units linked

specificity

in the cerebral cortex

of the axons of pyramidal

of the horizontal

and maintenance

locally

as well as those between different

1988; Toyama

1986). The distribution

clustering

6 mm

into distinct

that share stimulus

and anti-Hebbian

branches

widespread,

much

discussed

with

stimuli.

due to anatomical

Hebbian

be combined

new way of forming

These representations

net generating

this

generalisation.

a radically

themselves.

As the decorrelation

Wiesel,

properties

of links between fixed representations,

overlapping

columns

distance

associations

stimuli

can naturally

from

to give rise to more useful forms of generalisation

also implies

long-term

of the individual

associated

connections

such as the one to be introduced

Hamming

This approach

1987)) may be preferable

with desirable

connections

than the rather limited

suggested

(e.g., (Lippmann,

or wavelength

1988; Ts’o,Gilbert,

correspond

observed

connections by excitatory

as a ‘super-unit’,

anatomically. are likely connections with

&

well with the The

to involve and with

a much

between these ‘super-units’ neurons.

of

These ‘super-units’

larger

would be would

be

fault

tolerant,

associated

i.e. they

with a completely

Decorrelation associative input

Memory

non-redundant

capacity

more overlap

best for uncorrelated

of pattern

2.11.

Decorrelation

necessary

the eventual

the representation output methods

patterns,

distribution

memory

reduces correlations

between pattern

vectors.

and supervised

of

capacity

is

between units by

the average

distance

learning

it may be to process the sensory signals learning

by unsupervised

is determined

by how efficiently

behaviour.

If the appropriate

a desired

be specified

and analytically

most tractable

on an adaptive

linear

& Hoff, 1960). The goal is analogous

best estimates

for each input,

of the target

between

the target

adaptive

signal

by supervised

or reinforcement

processing

is given

as the ‘delta rule’, ‘adaline

1960). This algorithm

ease of computation.

It is an ‘on-line’

method,

-46-

& Stearns,

given input

the mean solution

squared

important not requiring

1985;

patterns are the

difference

algorithm,

rule’ (Kohonen,

1970; Widrow

is

to this problem

(least-mean-square)

rule’ or ‘Widrow-Hoff

is particularly

learning

vector so that the outputs

A well known

by the LMS

(Widrow

regression:

one that minimises

and the actual output.

of supervised

combiner

find the weight

values;

form

to linear

1984; Stone, 1986; Sutton & Barto, 1981; Widrow, & Hoff,

input

increasing

it must eventually

learning

and a target output

known

assume a random

of learning.

correction

Widrow

stage for a feed-forward

space, thereby

can be used for generating

The simplest, error

a problem

by the crosstalk between the stored patterns.

success of unsupervised

has to be learnt,

memory

between correlated

the average correlations

avoiding

representation.

is limited

and decreasing

methods,

to cell death,

vectors. Decorrelation

the dense regions

However

resistance

Most models of associative

As there is much

stretching

show

may play a useful role as a preprocessing

memory.

patterns.

usually

would

in also

1977; Kohonen,

& Stearns, 1985; Widrow

because of its simplicity off-line

gradient

and

estimation,

or repetitions response

of data. All

be available

makes

at each time

it especially

correcting

relevant

learning

backpropagation

rules

is that the input

step. The possibility

to neural

suitable

are extensions

If the output output

that is required

modelling

for nonlinear

of this simple

of a single

adaptive

is t, then for each input-target

vector

and the desired

of a local

(Mitchison,

implementation

1989).

and multi-layer

Other

networks,

error such as

rule.

linear

combiner

is y = v * x, and the target

pair, the LMS rule specifies a weight change: Av=2u(t-v.x).

The mean-square

error is a quadratic

function

of the weights with a single global

minimum: E((t - Y)~) = E(t2) + vT Cxx v - 2 CxtT v = E(t2) - CxtT v* + (v - v*)~ Cxx (v - v”) where

is the op timal

v* = Cxx -I C,t

gradient

descent

sufficiently

on the estimate

small u, convergence

matrix

larger

p’s causing

eigenvalues.

divergence.

Ideally,

of this error

determines

of the input

l/h max > I-L > 0, where Lax

constraint

has to be applied

components.

of the weight with

the LMS

to u, causing eigenvalues

to converge

As the time constant

make the error surface elliptic,

narrow

‘valley’

is limited

by the

be chosen

so that

algorithm,

cause a spread

the most along

the learning

with

restrictive

the eigenvectors

is l/(4

the high

lt hi), the final

eigenvalue

this corresponds

of the

be adapted

much slower than along

by the smallest

while

& Stearns, 1985),

(see figure

to a slow descent

rate is limited

by the

axes of the surface (see Fig 12.a). It would be clearly advantageous

surface.

and a

vector should

for each component

12.1~). As correlations

axes of the error

signal

the components

curve will be determined

the wide

input,

of C xx (Widrow

in the input

slope of the learning

along

rule specifies

For stationary

In fact, the lo must

Correlations

CL. As this is not possible

eigenvalue

surface.

is the largest eigenvalue

each eigencomponent

to the smaller

vector. The LMS

the speed of convergence,

distribution.

a separate

corresponding

weight

to v* is guaranteed.

The choice of CL, which correlation

(Wiener)

By applying

decorrelation

with

to be able to do LMS learning variance

-47-

equalisation

on a circular error

(Model

2 or 3), it is

possible

to do just that. As the decorrelation

identity

matrix,

all the eigenvalues

approach

can then be chosen so that it is optimal much

makes the output a common

efficiency adaptive

of the LMS algorithms

equal (Widrow

algorithm

To demonstrate random

variable

the proposed

algorithm,

with zero mean and correlation

c

its first component

and later

dominate. decorrelating estimate,

network

becomes

Notice more

(Model

and cross-correlation

the best estimate

as the decorrelating

Gaussian

network

as input

matrices

that

the error where

of the target based on the

on the input initially

the second

of LMS learning

in figure

12.b. The weight

LMS, following

time

ineffective.

starts to make the error surface circular,

the weight

on the modified

- 48 -

relatively

constant

on the output the input

starts to of the and the

vector starts out in the

the error surface defined

is initially

weight, descending

vectors after 500

decreases

2). The effective weight vector between

as in the case of simple

the optimal

are equal or nearly

, and C,t =

shallow

u = (I - W>-l v, is plotted

route towards

for

1 1 0.9

(x), with correlation

12.b and d shows the result

network

same direction input,

presentations.

Figure

limit

matrix

Figure 12. a and c shows the result of LMS learning

quickly,

matrix

The

0.55 0.59

The goal is to find the weight vector giving

vector

a theoretical

as the target value (t), and the last two components

linear combiner

random

rate

curve will be

take a three-dimensional

0.59 0.55 0.9 1

Cxx = ( 0!9”;”

input.

to approach

the

1984).

c =

to an adaptive

~1, the adaptation

The learning

of the correlation

1

Consider

approach

all with the same time constants.

has been shown

when the eigenvalues

& Walach,

value.

for all components.

steeper, as it will be a sum of components

covariance

After

by the

the decorrelating

vector takes a more direct error surface (not shown).

0.1

0.2

0.3

0.4

0.5

a.

b. In e

In e

-6:

J. -. .* -. .I.. -: $0.

-7: 100.

200.

300.

d.

C.

Figure 12. a. The effect of LMS learning the

presentation

e = (v - v*jT output,

of 500 random

( Av = 2 u (t - v . x) ) on the weight vector during

input-target

Cxx (v - v*). b. The effective

u = (I - W)-l

v during

two different

time

input-target

pair presentation

apparently

larger fluctuations

weight

LMS learning

d. The slope of the error function follows

400.

pairs.

vector

on the output

stays constant during

constants.

c. The

mean

between

squared

the input

of the decorrelating

adaptation,

unlike

error. and the network.

in c. where it

v(O) = 0, W(O) = 0, u = 0.0025, a = 0.01. For each

y = (I - W)-l

x, Av = 2 u (t - v . y), AW = a (I - y yT). The

at lower errors in d are due to the logarithmic

the error.

-49-

plotting

of

t

Note that in figure 12.~ faster convergence rate was used. For normalised equality

input

was achieved

variables

( diag(Cxx)

only when C xx = I, so for the decorrelated

a faster

adaptation

eigenvalues 1. With

rate, giving

even when the same adaptation = I ) we have h,,,

signal it is always possible

an even faster

convergence.

of Cxx are 0.1 and 1.9, while the decorrelated

the same adaptation

2 1, with

In the example,

variables

rate, the faster convergence

to choose the

have eigenvalues

1,

is due to the smaller

value of

determined

stability

the longest time constant. The

choice

of adaptation

rate

is not

only

by

requirements,

but also by the need to keep the error due to the random

final weights

close to the optimal

noise levels, as in the example

weights.

presented

This is especially

M = u tr(C), which makes it necessary to chose u smaller by the stability

condition

(Widrow

the trace for a normalised convergence, a smaller

or equal convergence

adaptation the input

less frequently

available

beginning

be required

does not change

can be achieved

with

with smaller m&adjustment

signal is present before the target signal, than the input,

of the target.

faster

by chasing

the decorrelating

In this case optimal

or when the target is

process can proceed

adaptation

is possible

even in

right

at the

of LMS learning. This algorithm

as ‘LMS/Newton’

is similar algorithm

(Ahmed,Soldan,Hummels, Miller,

can be expressed as

& Stearns, 1985). As decorrelation

can be obtained

in case of high

than what would

the same misadjustment

of the

rate.

When

the absence

input,

fluctuations

important

above. This misadjustment

the

1980; Orfanidis,

to the algorithms and

the

known

‘sequential

CL Parikh,

1977; Compton,

1988; Widrow

& Stearns,

in adaptive regression’

1988; Graupe,

signal processing algorithm

(SER)

1972; Monzingo

1985). The LMS/Newton

&

algorithm

replaces the LMS rule by Av = 2 u hav Cxx-1 (t - v . x), where hav is the average of the eigenvalues to a descent directly

towards

the optimal

of the input weight

-5o-

correlation

matrix,

which leads

vector. This, of course, assumes that

Cxx is known. algorithm

The sequential

in which The main

the former requires

Cxx-l

previous

weights

time

surface

alternative

preprocessing

neural

descent

learning

‘momentum’

network

with

of the LMS/Newton

the weight

adaptation.

proposed

implementation,

to this method

would

on an elongated,

by adding weight

while

as well. a fraction

modification components

the slow

above is that

while

less plausible

has also been used to accelerate

1990).

- 51 -

error

that similar

A solution

latter

the ‘narrow’

only

learning

slow

to solve this at the 1986).

directions

of the

A possible

by an additional

using explicit

in nonlinear

is not

& Hinton,

get amplified.

to be uncorrelated scheme

problems

modification

(Plaut,Nowlan, along

surface

proposed

of the weight

components

be to force units

but biologically

‘valley-like’

It has been recognised algorithm

that the oscillating

get diminished,

term. A similar,

simultaneously

the SER and the algorithm

network.

step to the current

This has the effect

error

of gradient

the backpropagation

was to give

is a modification

rules.

to single layer linear

down

error

between

does not have a natural

The problem

algorithm

is approximated

difference

local learning

unique

regression

networks

Gram-Schmidt (Orfanidis,

Chapter 3.1.

3. Optimal

Data

linear

description

from

as a retinal

(or television)

a large array

the resolution

There are well

and many useful

in a given

in designing advance,

of physical

better

methods

problem

may depend

a more general-purpose about

criterion

that they

desired

input-output

regularities

be most

in the input

useful

data-set.

measures.

(Linsker,

Analysis

has such optimal

properties.

Principal

Principal features

from

high

dimensional

1982)). It is also known the Hotelling

Transform

may

in most

situations,

will

be. Selection

features

will

could be simply

to

be lost this extraction,

The features

of the categoriser.

most

However,

may not be known

in

not be available

at the physical

of features

be based

must

independent

of what

of features

of a set of features

1988). Among

component

component

output

of

and feature

categories

such

the amount

would

application.

the correct

The quality

loss of information

recognition, of variables

output

symbolic

of data in a signal

for data compression

on the desired

Good

meaningful,

a lot of information

In this case selection

minimal

3.2.

the number

of a particular

desired

relationship

information-theoretic

large. During

classifier

extractor.

a concise,

The amount

but obviously

rely on the details

of the feature

signals.

is extremely

of the input,

or information

location

is to derive

One way of reducing

known

of these

recognition

image

data needs to be reduced.

way.

extraction

reduction

The basic goal of pattern

decrease

feature

reduce linear

on the

exactly

the

only

on

can rely

can be determined

dimensionality methods

with

Principal

by

only

a

Component

analysis

analysis

WA)

data distributions

in signal processing in image

is a statistical

(see for instance

as the Karhunen-Lo&e

processing,

-52-

method

and it is related

for

(Devijver Transform

to Factor

extracting & Kittler, (KLT)

or as

Analysis

and

Singular

Value

transformation

(rotation)

coordinates

(SVD)

(Gerbrans,

of a distribution

are uncorrelated

is concentrated reduce

Decomposition

and maximal

amount

on only a small number

the number

concentrated

of variables

and minimise

variances.

The basis

covariance

matrix

by taking

of this new

the

The optimal

the subspace

of the n eigenvectors

with

information

of a normally

variance,

content

so by maximizing

variances

in which

of the original

on which

system

distribution

the variance with

are given

the

is

small

are the eigenvectors

coordinates

the

space we can

out the coordinates

m to n dimensions

largest

orthogonal

In this transformed

by leaving

on these

from

system

only the coordinates

eigenvalues.

The

projection

of variance

coordinate

variance

It is a linear,

a coordinate

of coordinates.

the loss of variance

vectors and

into

1981).

of the

corresponding

by PCA

is therefore

eigenvalues. distributed

PCA also maximizes

variable the amount

depends

on its

of information

on

the n variables.

3.3.

Oja’s

algorithm

Oja proposed modification principal

rule of the input component

eigenvalue)

qj (figure

The unit modification

consisting

input

of a single linear

connections,

(the single

of a stationary

The output strengths,

a model

and showed

eigenvector vector

unit

a local, Hebb-type

that the unit extracts

of the covariance

sequence

with

matrix

with

the largest the largest

(Oja, 1982).

of the unit, y is the sum of the inputs

xj weighted

by the connection

13):

is trained

on vectors

of the connections

during

from

an input

each training

- 53 -

distribution

step is:

and the rule

for the

Figure 13. The output

of Oja’s linear unit trained on a stationary

converges to the largest principal

sequence of input

vectors

component.

Aqj = p (xjy - qjy2), where

xjy is the Hebbian

the output

term that makes the connection

are correlated,

constraint

training

the unit

the variance

noise is added

to the outputs

approach

then if they follow

then their output

less than what could be achieved

Several strengths distribution

alternative

in linear connectionist

algorithms networks

have

of errors and backward

which makes their operation

and implementation

of two local rules, Oja’s rule and decorrelation

- 54 -

to the

analysis

be-

If there is

the same rule and no

values will of course be identical, information

which will be

by PCA. been

proposed

to change

to get more than one principal

(Oja, 1985; Sanger, 1989; Williams,

rely on the calculation

subject

component

is no more useful than the value of a single unit. The transmitted significantly

1.

the one with the largest variance.

for signalling,

and

The weight-decay

of its output

is not a full principal

cause the unit finds only one component, available

This term makes Cqi’

maximizes

that Cqi 2 = 1. This, however,

more than one unit

when the input

i.e. when they are active simultaneously.

term, -qjy2 is necessary to prevent instability. After

stronger

component

19851, but these are non-local propagation

connection

rules; they

of values between

more complicated.

of a

layers,

Here a combination

will be shown to achieve the same goal.

Model

4 Model

Chapter output

4 combines

2 (figure units

Oja’s network

14). A large number

by Hebbian

connections

between the units keep the outputs

X

Figure Hebbian,

with the decorrelating

(m) of inputs

connect to a smaller

(q), and anti-Hebbian

uncorrelated,

network

feedback

as first proposed

proposed number

(n) of

connections

in (Foldiak,

in

(w)

1989).

m

14. The

combined

Hebbian

black circles anti-Hebbian

When

an input

- anti-Hebbian

network.

White

circles

indicate

connections.

is presented

to the network,

the units settle to a stable state for

which n Yi = jglqijXj + .C wijYj, m

J=l

or

y=Qx+Wy,

y = (I - W)-lQ

Initially the connection

the qij’s are set to random

x.

values and wll.. - 0. The modification

strengths are: Awij = - o yiyj

ifi#j

and Aqij = PCxjyi - qijyi2).

- 55 -

rules for

The training The

is unsupervised,

modification

modification

rules

i.e. there is no need for a teaching

are purely

of connection

strength

all

is available

there is no need for the propagation

3.4.

local:

the

information

locally

necessary

convergence

at the site of the connection,

The

characterized the output

input

and

matrix,

output

is also normal

performed

scale, i.e. neither

for individual

to have normal

input

distribution

directly

with covariance matrix

similarly

diagonal

matrix

the eigenvalues,

generated

by Gram-Schmidt

containing

distribution

over the set of input matrices

were calculated

of a matrix

with elements

and keeping

cyy(t)

= (I _ W(t))-1 =

T(t)

Cxx

Q(t), T(t)T,

= Wtt) - a offdiag(Cyyft)),

-56-

by taking

a and p small

in each cycle t for different

T(t)

W(t+l)

here the rotation

rules were approximated

patterns

If the of the

over the interval

matrix

L was

chosen from a

i-0.5, 0.51. The rows of L are the eigenvectors

modification

mean,

1-3 by Cxx = LALT, where A is the

but

orthogonalisation

on the interval

The proposed

as in Models

were

= T Cxx TT.

For each run qil(O) ‘s were chosen from an even distribution Cxx was generated

matrix.

is T , then the distribution

Cyy =

i-0.5,0.51.

zero

the

of the network

from this covariance

by the network

vectors

with

Cxx, and because of the linearity

could be calculated

of the transformation

on a slow time

nor the stable outputs

by its covariance distribution

was simulated

was assumed

matrix

following

the

of values from other units.

operation

of the outputs,

calculated.

values

for

Simulation

The network’s

uniform

or error signal.

their

of Cxx. expected

(a = p = 0.02). The

network

sizes:

- diag(Cyy(t))Q(t)),

Qct+l) = Qct) + b(T(kxx

where diag0

offdiagf)

is an operator

sets off-diagonal

which

elements

of the overlap

was plotted performed

(figure

principal

15). After

neither

component

components

length

convergence, will

not

PCA subspace was calculated,

and

outputs

the row vectors of T, the transformation

be individually

orthogonal

of the network system,

aligned

with

but

they jointly

to each other,

can be measured

and looking

in the PCA subspace to the total length. have the transformed

to hi, and 2.1 > h2 > ... > hm.

the principal span the

by transforming

at the ratio If columns

components

between

rows of T into the length

of L are the eigenvectors,

as rows if the ith column

Geometrically

of

averaged

of L

over the rows, the

of the rows in the subspace (I$ and the total length (1) are

l&m Ip / 1 is a measure all orthogonal unlike

to zero, while

subspace.

coordinate

then M = T L will corresponds

the n dimensional

are they

The convergence the eigenvector

elements

between this subspace and the subspace of the network

by the network,

components,

matrix

to zero.

For each input distribution a measure

sets diagonal

of the overlap

l=J~

i-1 i=l

i=l i=l

of the two subspaces, and it is 0 only if rows of T are

to, and 1 only if they are all in the PCA subspace. The final subspace,

the final T, is independent

of the initial

-57-

random

choice of feed-forward

matrix

Q.

0.6 -

m=lOO,n=lO

m=SO.n=5 0.4 -

0.2 -

0 0

I SO

I loo

I IS0

I

I 250

0

; 0

I 50

I IW

I IS0

I 200

1 250

cycles

I lpl

I

0.8

0.8

0.6

0.6

m=2OO,n=20

m=4OO,n=40

0.4

0.4

0.2

0.2

I 50

I 1w

I

150

I

200

I

cycles

I so

250

I 19)

I 200

.

I 80

cycles.-- .-

Figure 15. The convergence of subspace of network quantity

I IW

outputs to the PCA subspace. The

lp / 1 is a measure of the overlap between the two subspaces. m is the number of

inputs, n the number of outputs. The curve is an average of 10 runs with different input covariance matrices and initial forward connections.

- 58 -

random

3.5.

Transmitted

information

Oja’s algorithm

reaches the maximum

single unit case if the inputs contain and similarly

for algorithms

To demonstrate

of mutual

uncorrelated

information

set by PCA for the

noise of equal variance

that yield the PCA subspace (Plumbley

that the mutual

H(Y) - H(Y

I X) was calculated

the network

input

information

is maximised

in each iteration

& Fallside,

entropy

1988), 1988).

in the present model

step. If uncorrelated

(figure 16) then the conditional

(Linsker,

noise is added to

of the output

is

H(Y I X) = H(YN), where

YN is the noise transformed

normal

distribution

pendent

so the covariance

of the noise on the output

The mutual

information

matrix

containing Further,

input.

=

- i log( (2ne)” det(CN))

should be compared

IpCA = $log((2ne)”

and Cp = PCxxPT.

is I. The

in this case is

i log( (2rce)” det(+))

where P is the matrix

to be inde-

of the noise on the input

noise in the network

I(X;Y) = H(Y)-H(YN)

This quantity

is assumed

to have

is CN = TTT.

Figure 16. Additive

transformation,

The noise is also assumed

with zero mean, and the noise on each input

and of unit variance,

covariance

to the output.

with the mutual

det(Cp))

- + log((2rre)”

the n largest normalised

we know that det(PPT)

so

- 59 -

= i log T (detf TT ,> information

given by PCA:

det(PPT)), principal

components

as rows,

= 1, because PCA is an orthogonal

IpCA = 5 lo@ det(Cp) where

hl, h2,...hn are the n largest Figure

training

cycles

17 shows

eigenvalues

the mutual

approaching

) = $ log( hlh2*** &I ), of Cxx.

information

the maximal

of the proposed

network

as a function

of

value set by PCA.

1 ’ 1’ PC/~ 0.R

0.6

0.6

m=SO,n=5 04

0.2

1I

,

0

ib

m=lOO,n=lO 0.4

03

I IW

I 150

I 2al

c

I 250

I

I

SO

IW

I

150

cycles

I 200 cycles

I 2.50

I 260

I 250

I

0.8

0.6

m=4OO,n=40

m=2OO,n=20

0.4

0.4

0.2

0.2

I 0

; 0

I 50

Figure curve

I loo

I Is0

17. The mutual is an average

I 200

cycles

information

of 10 runs with

I)--

I 250

SO

I IW

I 150

cycles

of the network different

I

0

reaches

random

-6O-

the maximum

input covariance

set by PCA. The matrices.

(

Model

5

By adding suggested principal

gain control

in Model

2, the outputs

component

subspace,

shows the convergence convergence

Il. .

for instance

by the mechanism

the uncorrelated

but they are also normalised

components

in variance.

Figure

of the l&a

subspace to the PCA subspace. Figure 18.b plots the

correlation

.

units,

are not only

of the output

of the output

-0.25.-

to the output

matrix

to the identity

matrix.

.

9 ,40.

I 170:

I

;

60.

80.

100.

l

-0.5.-

1%10III- cyy 11 -0.75,-l:-1.25,. -1.5.-

Figure

.

18. a. The overlap

of the network

11I - Cyy 11tends to 0 as the correlation matrix. input

Plots

are averages,

correlation

matrices.

standard

matrix

deviations,

(a = 0.1, p = 0.04)

- 61-

output,

and the PCA subspace. b. log10

of the output minima

approaches

and maxima

the identity

over ten random

Figure 19. Additive

Normalisation only in the input, units (figure be treated

in model

the network

in the previous

quite

separately.

depends

on the correlation

Assumptions

In addition

experimentally hippocampus

noise depends

connections.

Hebbian

modification

potentiation

analysed _m----Willshaw

for

by Oja specifies

including

& Pockett,

a decrease

a class of associative

was first observed (LTP)

in the

1973; Levy & Steward,

the mammalian

matrix

visual

nervous

cortex (Artola

1988; Brown et al., 1990; Madison, unstable.

in the synaptic

by Stent and Singer (Rauschecker

& Dayan,

noise only

to occur in the central and peripheral

cell is active in the absence of presynaptic

rule proposed

the

4 and 5 assume

of synapses

the Hebb rule only specifies a weight increase, it is inherently

synaptic

on whether

Models

1973; Bliss & Lsmo,

demonstrated

and invertebrates,

models,

synaptic

of long-term

& Singer, 1987; Bindman,Murphy,

introduced

only

can

relevance

of the previous

(Bliss & Gardner-Medwin,

1979), and was subsequently

term

and the output

subspace, while the effect of output

and biological

in the form

system of vertebrates

to noise not

of the output.

to the assumptions

feed-forward

resistant

noise on the input

The effect of input

into the appropriate

optimally

section, but also to noise on the output

19). In fact, the effect of independent

transforms

Hebbian

5 makes

as demonstrated

network

3.6.

noise on the network input and output.

firing.

& Singer,

memories __-

The normalising when

the post-

Oja’s rule is similar

to the

1979; Stent, 1973), which (Dayan

1990). The effects of the changes specified

- 62 -

weight

1991). As

& Willshaw, ~__~~

is

1991;

by this rule have been

observed

experimentally

hippocampus

Several compression.

neural

a set of input

algorithms

algorithms and output

hidden

& Zipser,

applied

recognition

network

1990; Rauschecker

units.

have

units through

It was found

1987; Rumelhart

to image (Harrison,

been

& Singer,

that the hidden

& Sejnowski,

(CottrelLMunro,

improve

performance

in the auto-encoding

pointed

out earlier,

the back-propagation

problem algorithm

layer comes

1989) and hidden

& Kamp,

to

& Hornik,

1985). This method

that nonlinear (Boulard

data

of a smaller

1985; Baldi

& Zipser,

1987). It was also demonstrated

for

the identity

the ‘bottleneck

et al., 1986; Williams,

compression

proposed

was used to implement

to the PCA subspace (Ackley,Hinton,

1989; Elman

as well as in the

algorithms

The back-propagation

of intermediate

correspond

also

to related

other

between

number

& Levy,

cortex,

& Singer, 1981).

Comparison

mapping

of the visual

(Lopez,Burger,Dickstein,Desmond,

1979; Rauschecker

3.7.

in the development

was

speech

units do not

1988). However,

as

does not seem to be biologically

realistic. To get more than just the largest principal to get PCA subspace, (Krogh

& Hertz,

related

to Gram-Schmidt

component,

but the learning

orthogonalisation,

(Sanger, 1989). By repeating

implausible,

An arranged

output

implementation

on the back-projecting

units

an algorithm

finding

similar

his network

are non-local

as well

quite closely

the largest

principal

from the input the resulting

this sequence over n units the n largest principal

The biological

algorithm

involving

and subtracting

as does an alternative

with constraints

Oja extended

rules used in this model

1990; Oja, 1985; Oja, 1989). Sanger proposed

and then back-propagating

can be obtained.

component,

of this sequential

implementation

involving

output

components

algorithm

seems

a large number

of units

weights.

to Model

4 (Foldiak,

has also been proposed

- 63 -

1989), but

and analysed

with

(Kiihnel

hierarchically & Tavan,

1990;

Rubner,Schulten, connections

& Tavan,

were specified

higher

index

found

that the first

1990; Rubner from an output

output

unit

extracts

units force them

two fundamental for the strictly

asymmetrical

the first

the difference

disadvantageous

vector

device

a large attenuation

linear

for the last, probably

combinations

resulting

of the principal

units,

components,

while

connections

W matrix.

It was

units

the previous

However,

of collateral

a

as expected,

the higher from

outputs.

there are at all

connections

in the

of the units is maximised

is minimised

the units, amplification

with

is that as the units find the principal

the components.

involve

to a unit

component,

between the variances

if a physical

anti-Hebbian

One is that there is no evidence

noise is to be used to represent but it would

principal

arrangement

variance

index

triangular

to the previous

The other problem

of the normalised

is clearly

to a lower

with this algorithm.

be required.

themselves,

(the entropy This

hierarchical,

that would

components

with a lower

as the anti-Hebbian

to be uncorrelated

problems

1989). Asymmetrical

by any of the other output

extract the rest of the components,

brain

unit

but not vice versa, corresponding

because it is not influenced

output

& Tavan,

with

(Devijver finite

Normalisation

& Kittler,

)).

range

and

dynamic

could still be applied

for the first component

in additional

noise.

to

and a large

As Model

4 takes

the variances

of the output

units

for the learning

parameters

in Model

are

more uniform. The convergence were also studied Hornik,

properties

and conditions

by Baldi and Hornik,

and by Leen, using bifurcation

1990; Leen, 1991).

-f34-

analysis

4

(Baldi &

Chapter 4.1.

4. Sparse

The Hebb

coding

unit as suspicious

The computational any number nervous

of linear

system,

properties.

capabilities

especially

(McCullocl~

& Pitts,

Such a unit

performs

and the inputs

pattern

matches

the weight

by allowing

exceeds

y=o

otherwise. kind

patterns

nonlinear

in the response

nonlinear

values,

model

of a

sum of its inputs, value,

as

and

its threshold

(t)

to patterns

pattern

then the weighted Depending

so the unit will respond

is needed

for the solution

fires,

patterns

vector closer to themselves to prevent

If you think

sum is maximal

up to a limiting

according

the weights

to that pattern or pattern

Hamming

is competitive

- 65 -

from

distance.

coincidence rule.

the active

de-

If on the inputs

even better in the future.

are

In this

are able to adjust

the

ones. To use several of these units,

them from detecting

of this problem

when the

vector only in a small

to the Hebb

components

than the infrequent

of the

on the value of the threshold,

can be made into a suspicious

to change

the unit

occurring

matching.

that differ from the weight

matcher

its connections

2 t,

of pattern

vector precisely.

strengthened,

suggested

a given

if C qjXj

of a pattern

a mechanism

rate. A simple

y=l

presentation

way, the frequently

to have

takes a weighted

of bits, so this unit can be said to generalise This elementary

weight

firing

sum

a simple

as binary

the unit will also respond

tector

this

very limited,

rate, they cannot signal negative

unit which

when

are clearly

1943):

weights

number

by a maximal

threshold

output

firing

detector

by a single one. Neurons

are known

in the cortex,

is also limited

a positive

networks

layers can always be replaced

cell is that of a binary gives

of linear

Due to their low spontaneous

and the signal

coincidence

the same feature. learning.

One method

4.2.

Competitive

Competitive Malsburg,

learning

learning

1973) in its simplest

unit that fits the input and suppressing implemented

the output

connections

the input

patterns.

discriminable

in the regions

However,

this

pattern

is within

a certain Hamming

4.3.

Sparse

competitive

unit

can be units. In giving

populated

by

can be used by a turning

units. As the pattern

is very limited

gets activated

sum

regions,

in a single trial by simply

of storage

on the

vectors in a at all between

in the number

of

to generalise.

An

only when the input

distance from the weight vector of the unit.

coding

It would be much more desirable each unit representing

one component

capacity

to code each input

or property

results in a significant

also give rise to desirable

of the pattern.

increase in the number

of such a distributed

state by a set of active units, Since the combinatorial

of discriminable

code is high. Distributed

effects like generalisation

states, the

representations

between overlapping

patterns,

noise

of units

for each input

resistance.

On the other pattern,

into disjoint

states that it can code, as well as in its ability to a particular

and damage

between the competing

to each other, there is no interference kind

only the

mechanism

representation

unit to the active output

associated

representational

1985) activates

of space that are densely

local, ‘grandmother-cell’

output

use of units

connections

space among themselves

are orthogonal

input

& Zipser,

1976a;

the one with the largest weighted

layer to associate outputs

from the winner

local representation stored

inhibitory

finer discrimination

supervised

1982; Grossberg,

of all other units. Such a ‘winner-take-all’

vectors. The resulting

subsequent

& Ballard,

(Rumelhart

best by selecting

by strong constant

a selectively

1982; Feldman

version

pattern

this way, the units divide

pattern

(Amari,

the mapping

hand,

when

a large

to be implemented

number

by a subsequent

- 66 -

are active

layer becomes more complicated

and harder memory

to implement

network,

distributed

by simple

i.e. the number

representation

neuron-like

of input-output

is significantly

Memory

capacity

is limited

patterns.

Learning

may become extremely

become complicated

sparse coding, resentations. authors

by the interference

1988; Buhmann,Divko,

1990; Palm,

1988; Tsodyks

& Feigel’man,

1990). In a sparse code, the input portion

of the available

between amount

tation.

capacity

of generalisation Competitive

learning

highly

proposed

sets of lower-level

other such conjunctions. instance, combination while

learning

of car type and colour

if car types and colours

above scheme represent

problem

scheme would (e.g., ‘yellow

are not significantly

learn to code colour

a particular

car as a combination

unit).

capacity,

of a subsequent

components

could

‘Volkswagen’

this proportion

method

Generalisation

may

in a small pro-

output

the

function.

of forming

a local represen-

into a sparse representation. corresponding independent

of coding cars of different a separate

Volkswagen correlated

of activity

the tradeoff

as well as that between

to from

colour,

for

unit to code each

detector’

(Harris,

1980)),

in the environment,

and type on separate

the

sets of units,

and to

in those units (a ‘yellow’

and a

then occur specifically

-67-

& Touretzky,

one can control

that are relatively

require

&

et al., 1969; Zetzsche,

in terms of features or conjunctions

In a hypothetical

the competitive

1969; Nadal

by the activity

here aims at coding inputs

finds a representation

correlated

are represented

is an unsupervised

& Wilczek,

1990; Rosenfeld

1981; Willshaw

and memory

and the complexity

The mechanism

The mechanism

patterns

rep-

1988; Gardner-Medwin

1970; Marr,

1988; Rolls & Treves,

units. By choosing

representational

1987; Gardner,

by

out by several

1987; Baum,Moody,

& Buneman,

1988; Willshaw,

connections

distributed

were pointed

& Sompolinsky,

stored

can be combined

local and completely

1988; Gardner,

1980; Palm,

et al., 1969).

et al., 1986)).

representations

of sparse representations

& Schulten,

(Willshaw

the non-orthogonal

(e.g., (Rumelhart

& Recce, 1987; Longuet-Higgins,Willshaw, Toulouse,

between

between

1989; Amit,Gutfreund,

of an associative

slow, and the rules for adjusting

is a compromise

The advantages

(Amari,

than optimal

of both local and distributed

which

The capacity

pairs that can be stored using a highly

lower

and hard to implement

The advantages

units.

along

one feature

or

aspect of the input. the unit

in the type group,

large Hamming

4.4.

An output

distance

in the previous is performed Hebbian

proposed

but with

by conventional

inhibitory

about

(first

chapter,

feedback

creases

statistical sparse

colours

get connected

even when

to

it has a

mechanisms

while

in a single

i.e. pairwise

are nonlinear,

value

active

input

is assumed:

by adjusting lowers

unit gradually

elements,

of the input.

mechanism

close to a fixed

by anti-

they can learn not

between

and features

an additional

coincidences

on units connected

correlations

dependencies

xj, j=l...m,

each unit

its own

its threshold becomes

from

the network

differential

This

equation

dy*i dt=

of the units,

models,

the other

is guaranteed

1984).

and n representation

and the nonlinearity

step as in the previous

(Hopfield,

following

As the units

a frequently

of suspicious

weights

for a long time gradually

has m inputs:

by the feedback =wji>

units. The detection

is used as

more

thresh(i.e. deselective

its threshold.

of the feedback

enced

of firing

1990)), the same architecture

feed-forward

coding,

that has been inactive

The network Because

nonlinear

statistics,

to achieve

its selectivity),

by raising

sient

to other

would

and anti-Hebbian

connections.

to keep its probability

old. A unit

(wij

generalise

in (Foldiak,

Hebbian

the second-order

In order

lated

‘Volkswagen’

the original.

of Hebbian

but also about higher-order

tries

only with

6 In this model

only

and it could

from

Combination

Model

correlated

units.

Provided

to settle

transient

was

for each input

f(jglqijxj

because

the output

that the feedback

by

pattern:

+ jilwijY*j

- 68 -

- ti ) - y*j

yi, i=l...n.

cannot

the final output

into a stable simulated

units:

be calcu-

here is influis symmetric

state after

an initial

tran-

numerically

solving

the

where qij is the weight of the connection

from xj to yi, wij is the connection

between units

yi

the

the

and yj

and

the

nonlinearity

f(u) = 1 /(l+exp(-Au)). ffcqijxj

of

The initial

- ti). The outputs

value

units

is represented

of y *i for each input

are then calculated

by rounding

by pattern

was y”i(0)

The feed-forward

initially

on [O,ll and normalised

length

selected from a uniform

distribution

=

the values of y*i in the stable

state to 0 or 1 fyi = 1 if y*i > 0.5, yi = 0 otherwise). random,

function

weights

are

to unit

(Cj q2ij = l), and the feedback weights are 0. On each learning

thresholds

trial, after the output

are modified

anti-Hebbian

according

has been calculated,

to the following

the connections

and

rules:

rule AWij = - OZ(YiYj - P*)

( if i = j or wij Hebbian

> 0 then wij := 0 >

rule Aqij = P yi Cxj - qij)

threshold

modification

Ati = Y (yi - p>,

Here a, p and y are small ity. The Hebbian weight

rule contains

vectors bounded.

positive

constants

and p is the specified

a weight decay term in order to keep the feed-forward

The anti-Hebbian

rule is inherently

ising term is necessary. Note that these rules only contain the weight locally

connects,

so all the information

at the site of the connection.

will be demonstrated

4.5.

Example

Patterns network.

on two simple,

1: Learning

consisting

This example

bit probabil-

of random

stable so no such normal-

terms related

to the units that

necessary for the modification

is available

In the next two sections some aspects of the model artificially

generated

distributions.

lines

horizontal

and vertical

was chosen for comparison

- 69 -

lines were presented

with that given

by Rumelhart

to the and

Zipser

(Rumelhart

difference

& Zipser,

1985) to demonstrate

is that the patterns

here consist

of the 16 possible

lines are drawn

the others

20). Pixels

(figure

are 0. This pattern are allowed.

Figure

20. A random

(figure

21). Patterns

activity

of the simplicity well matched

4.6.

correlated

to the number

Example

A slightly statistical comparison

structure with

more

realistic

of the input that presented

independently

line have the value

from

all

1, all others

264 input

to the network

patterns

in Example

so that the units became detectors the suspicious

coincidences

of lines were

independent.

is optimal

distribution

Of course

of the

of the set: lines of

in the sense that it

is removed

by the net-

this is the case only because

and the fact that (line positions)

1.

coded by a combination

and all the redundancy

of components

2: Learning

(l/8)

in this example

in the input,

of the artificial

of lines. On an 8x8 grid, each

probability

presented

of combinations

The code generated

The important

units.

developed

components,

are statistically

learning.

as only 216 of the possible

of the patterns

connections

all the information

as the outputs

redundant,

sample

consisting

in the units.

preserves work

highly

a fixed

has 16 representation

The feed-forward most common,

of combinations

that are part of a drawn

set is highly

The network

with

competitive

the network

size was

in the input.

the alphabet

example

is considered

is more

complicated.

by Barlow

in this This

example

et al. (Barlow,Kaushal,

- 70 -

section,

in which was

chosen

& Mitchison,

the for 1989)

OOZT 008 OOP 0

/ 0009T: El 0008

ooot

where methods

were considered

a set of probabilities is the probability probability

for uniquely

so as to minimise

assigning

the higher

order redundancy

of the strings. If Aj

of string j, bij denotes the ith bit of the code for the jth string and the

of the ith bit being 1 is pi, then higher

(Barlow

binary strings of a fixed length to

order redundancy

can be defined

as

et al., 1989): R = IetA, b) - E(A)1 / E(A),

where e(A,b) = - C[pi i is the sum of the individual

entropies

log pi + (l-pi)

log (l-pi)]

of the bits of the string, and

E(A) = - CAj log Aj j is the entropy entropy

of the set of strings. The sum of the bit entropies

fixed position

patterns

in this example

on an 8 x 15 raster. During

the same probabilities

were constructed normalised (Barlow

from the standard

to unit length.

training,

of letters

letters were presented in a piece of English

system font of a Sun workstation

The same letter

frequencies

presented in random

text. Input

in a order

vectors

and vectors were

were used as in Barlow

et al.

et al., 1989).

only a small number infrequent

of a large number between constraint,

bit probability

(p), the resulting

of l’s (table 1). Frequent

ones, as otherwise

inhibition explicit

consist of images

as they appeared

Due to the prescribed

firing

than the

of the strings, and they are equal only when the bits are independent. The input

with

is never smaller

introduced

by the frequent

force the decorrelating

the active units. Another is that no two frequent

contain

connections

simultaneous to increase

feature of the code, which is not due to an letters are assigned

while the code is not completely

reversible,

information

(table 2). This is significantly

present in the input

patterns

letters tend to have fewer active bits than

the correlations

of cells would

output

it preserved

- 71 -

the same output,

a large proportion

so that

(97%) of the

better than the amount

of

information

retained

by an untrained

random

network,

which

in this example

is less

than 50%. input patterns

network output 0000000000000000 0000010010000000 1000000000000000 0000100000000000 0100010000000000 0000000010000000 0001000000000001 0000000000000010 0000010000000000 0001000000000000 0000110000000000 0000000000100000 0001000000001000 0001011000000000 1000010000100000 0100010000010000 0110000000000000 0001010010000000 0000000100000100 0011010000000100 0000000000001000 0000000000001100 0010000000001000 0001010000000000 0000010000010000 0010010000001000 0000001101000000 0001000010001000 0010011000000000 0000011100000000 0000001100010000 0000011000001000 0010000000000000 0000000000000100 0010001000001000 0000011000011000 0010010000000000 0010010000010000 0000010000011000 0011010000000000 0010000000011000 0010000001000000 0000011000000000 0000010000000100 0000001000000000 0010011100000000 0010000100000000 0010011000001000

Table 1. The code generated rows indicate

the output

hand side (a=O.Ol,

by the network

e t

i! 0 a n s r ( - u x hm 1Tl c u a. f b P 4 Y w v . 1CG NH kBRF 14 1 X

q ) w

V P S8 A:< ; 2 OOQU9 j/'=>% EK D 6 M L z 3 5 + [ ? # 7 J

after the presentation

of the 16 units for the input

p=O.OOl, ~0.01,

h = 10, p=O.l).

- 72 -

patterns

of 8000 letters.

indicated

The

on the right-

input

output

number of units

120

entropy

4.34 bits

4.22 bits

24.14 bits

5.86 bits

456%

39%

bit probabilities

high

low

representation

distributed

sparse

(E)

sum of bit entropies redundancy

(e)

(R)

(8x15)

Table 2. Some properties

A property

of the code in Example

of the code, which is important

is its smoothness:

patterns

as in the case of letter e and o and even in the confusion

patterns

tend to get mapped

fields of the units reflect the properties

detect one of the most frequent

other units are less selective

letters and become

and their receptive

of features in the input patterns

to similar

output

of 0, 0, Q, u and 9 in

preserving,

redundancy

reducing

the network

transformation

sparse activity

all’ mechanism

implemented

by the fixed inhibitory

representing

it by a statistically

What implications pervised

allow

the activity

Generalisation

therefore

implemented

while

many

combinations

a smooth, input

of units. In contrast connections

more than one unit

uncorrelated

or not highly

in both examples

of a smaller

number

works best for infrequent,

- 73 -

patterns

in competitive

into an

learning,

to be active for each pattern, correlated

that frequent

of units

information

with the ‘winner-take-

does such a code have for generalisation

layer ? It can be observed into

selective,

of the distributed

uncorrelated,

connections

highly

fields consist of different

approximately

these modifiable

of the code. Some of the

(figure 22).

In both of the above examples

coded

of view of general-

1. The receptive

units

input

(97% of input)

2.

from the point

isation,

table

similar

16

than

‘unknown’

set of features. in a subsequent patterns

tend to get

the infrequent patterns

su-

ones.

that are repre-

sented as sets of more frequent, representation

‘known’

tends to be more localized,

more specifically, Unlike

without

interference

arrangement

redundancy

present in the environment,

with

the input

performs

a recoding

other

that corresponds

out above, is often not desirable.

recodes input

vectors by combinations

of the proposed

modified

so that learning

threshold

and activation

simulation

of this modified

network

to the methods

algorithm

The method

there were no training

spatial scales.

as in Model

competitive

of clustering

implemented

(Everitt,

and vector vector, as in Model

‘winner-take-all’

learning,

were the same as in Model

Model

as well as the result from the untrained,

6, with the exception

cycles, only the 100 initial

after 8000 learning of the random

threshold

adaptation

the 6 was

activation.

6. The result

6

to the

1974). To compare

took place only on the unit with the maximal

algorithm,

a

forms of

categories, which corresponds

analysis with

different

a single class to each input

are shown in Table 3. The results were obtained

the same parameters

to them

algorithms

of overlapping

dynamics

the

to consider

classes, ‘winner-take-all’

1974; Gray, 19841. Assigning

in cluster

be useful

over different

competitive

was pointed

performance

patterns,

can be associated

each layer extracting

space into discrete

as ‘clumping’

it may

or redundancy

[ Everitt,

known

patterns

network,

quantisation

method

For more frequent

from other associations.

of such subnetworks,

Comparison

By dividing learning

so output

in the case of a linear

hierarchical

4.7.

components.

The of the

random

cycles with

column,

where

steps, as in all

the other runs. As can be seen in the table,

the anti-Hebbian

competitive

redundancy

as well as a lower rate of information

loss. The random

the random

recoding

(Kanerva,

significantly

worse with this non-random

used by Marr

and Kanerva

input distribution.

- 74 -

; Marr,

net gave a lower code, analogous 1969), performed

to

input

random (untrained)

modified competitive

anti-Hebbian competitive

entropy

4.34

sum of bit entropies redundancy

24.14 456%

1.91 (44%) 3.94 106%

3.67 (85%) 6.88 87%

4.22 (97%) 5.86 39%

Table 3. Comparison

of the anti-Hebbian

competitive

network

with alternative

coding

schemes.

One of the possible Hebbian

competition

representation

reasons why competitive

is that while

to frequent

ones, it is exactly

patterns

the opposite

weakening

the ‘winner-take-all’

the simple

competitive

of units than infrequent

categories.

enhancement’, 1989b;

competitors Gaussian

and

& Payne,

Carlson

to Model

6 (Rubner into binary

such a hierarchical

centred

outputs,

measures.

inputs

Grossberg

adapt in proportion

have continuous

studied

representations

a sparser

to infrequent

By switching

one can also obtain

multiple

of pattern

off or

outputs

space attract

the redundancy

from

most of

(figure 23.)

schemes have also been proposed

instead of a complete

distribution

to assign

competition.

But as dense regions

ones, increasing

Rolls,Littlewort,

tends

worse than anti-

input patterns evoke a response from a much larger number

Other ‘soft’ competitive multiple

and more distributed

mechanism,

performs

learning

in the case of simple

network.

the weight vectors, frequent

anti-Hebbian

learning

Rolls

et al. studied

mechanisms

‘winner-take-all’

competition

1991). Nowlan

proposed

to the relative on the weight

probability

a nonlinear

that the input

1982; Rolls, in which

1990). As these models

them to Model

version of the Rubner-Schulten

1990). The arguments

scheme also apply

to this case.

- 75 -

all

came from a

6 based on entropy model

very similar

et al., 1990), and used it to code one- and two-dimensional form (Carlson,

into

of ‘contrast

(Grossberg, a scheme

vector (Nowlan,

it is not easy to compare

to classify inputs

mentioned

in chapter

continuous 3 against

Figure

23. Simple

the pattern

competitive

space. With

regions cause activation

learning

assigns more weight vectors to dense regions

the winner-take-all

mechanism

in a larger number

disabled,

patterns

of

in the dense

of units than ones in sparse regions of pattern

space.

4.8.

Assumptions

and biological

The assumptions

of Model

units here take only positive total input.

models

connections

and negative

values,

with

and

vice

versa.

Such

biological

excitatory,

while

assumption

still needs an indirect

all the feedback

mediated ‘Learning adaptive

a functionally

negative

by an interneuron interneurons’ inhibitory

of the

on the same unit were allowed

to take

connections

were demonstrated

to of

possible

This

pyramidal ways to

is that inhibition

is

neuron, or a small group of neurons.

to have a very

in linear networks -76-

One possibility

values.

neurons,

2, there are several

value.

negative

were positive,

inhibitory

as the main cortical

in Chapter connection

from

to be characteristic

had negative,

assigned to the excitatory

connections

changing

6, all feed-forward

connections

models:

function

are not believed

interpretation,

As mentioned

are a nonlinear

values sometimes

reversals

synapses. In Model

cells are excitatory.

than those of the previous

values, and their outputs

individual

achieve

6 are more realistic

In the previous

both positive positive,

relevance

(Baddeley,

similar

effect

1991; Plumbley,

to direct 1991), and

they would

also be biologically

plausible

adaptation

in non-linear

unspecific,

fixed baseline of inhibition

modified

networks.

in the anti-Hebbian

would be modified Note

candidates

An alternative mediated

direction

that in Model

rounding

results in a nonlinear,

plausible

than a rule linear

unblocking

of the NMDA

or by the saturation

of the calcium

& Levy, 1990). Rounding patterns

a relatively

An essential

additional

the

simple

network

competitive

that neurons of activity

Tomasulo,Levy,

of inactivity,

(Prince

underlying

caused either by the potentiation, spine heads

to calculate

impossible

in Model

the entropy.

was evaluated A much

(DeSieno,

realistic

to carry out.

the ‘firing

adaptive

1988; Grossberg,

adaptation.

probability’

threshold

of the

mechanism

1976b). There

their own level of excitation

& Martin, ). Neurons

1990; increase

of the cell. Evidence

short-term,

after a period and are thought

for such phenomena

-77-

1990;

following

longer

of repetitive

firing.

to involve

after-

was found

1988), as well as in the hippocampus

in a

of their

& Desmond,

excitability

has

is experimental

as a function

Levy,Colbert, their

A

is that

more

6 is that of threshold

keeping

whether

caused by units that never win the competition

are relatively

& Huguenard,

more

in synaptic

be considered.

and decrease their excitability

of these effects

hyperpolarisation neurons

& Steward,

involved

of the network

in the model

can regulate (Douglas

is biologically

mechanism

nonlinear,

in order

low level. A similar

problem

Such

was also necessary to determine

could

assumption

of the output.

system in the post-synaptic

or different

serves the purpose

been used to avoid

Some

channels

in the model

set of patterns

units at or near the predefined

evidence

The cellular

of the way the performance simple

values, and binary

rule, which

to be highly

set size would have made the simulation

This mechanism

had continuous

learning

buffering

were identical

consequence

periods

strength

in the unit activations.

voltage-dependent

history

ensure that the total connection

discontinuous

are believed

stimulus

synapses

stages, and in the evaluation

potentiation

only

be to have a rather

and excitatory

activations

long term synaptic

further

would

by interneurons,

could

6 the neural

was used only in the learning

two output

solution

of anti-Hebbian

in the desired direction.

rounding

(Holmes

for the mediation

in neocortical

(Hotson

& Prince,

1980; Lancaster be required

& Adams,

to explain

There relates

are two

of concern

even

after

full

instability

decreasing

schemes

adaptation

Perrett,

short

temporal

cortex

helps to avoid

response

stimulus,

carrying

As the visual

to an input

in Model

constants

an uncomfortable time

response

latencies

highly

responses

from

ranging

discriminative

between

of cells in the primary

shortest These

approximately

latencies, results

development generation

seem

to limit

of feed-forward of the response

visual cortex

5 ms, this leaves almost

and seems

to rule

out significant

the role of recurrent connections;

they could

in case of a fully adapted

- 78 -

relying

recurrent

to a

and Perrett

They

the onset

&

namely observe

of the visual

5 ms of their response.

no unaccounted

latency

of 4 steps,

of

each

time in case of the

or top-down

not playing

that of

in the macaque

a minimum

connections

the

(Oram

of this class,

have a minimum

lateral

to

that

lo-15 times

operations.

in the first

network.

minor

or top-

show

for head views

requires

to only

this

the time necessary

for all models

cortex

that

on lateral

Simulations

69-213 ms from

information

stable

this problem.

by Oram

cells selective

at least 40 ms, and the route to the temporal step requiring

these

6, as

corresponding

fast, at most

Results

question

to complete

latencies

pattern.

a

and the associated

and concerns

6 are relatively

of the units.

leads

of plasticity,

all models

models),

is always

simulations

patterns

certainly

auto-associative

with

input

rate in models,

is enough

extremely

by the individual

One

to be completely

often

periods

time

there

the code, and most

(including

still

1987). In Model

during

of critical

1991) still raise

whether

however,

with

which

& Grossberg,

observed,

is one associated

may

of the model.

by the the network,

The existence

in

form

It was

of the activations

the characteristic

by the present

the code is not guaranteed

change

mechanisms

threshold.

(Carpenter

problem

convergence

convergence

or synaptic

as well,

generated

measures.

interactions

obtain

algorithms

not completely

The other

raised

of the code generated

in the outputs

performance

questions

convergence.

does

changes

down

further

in competitive

in other competitive

cellular

longer term changes in response

to the stability

matter

1986). Additional

interactions.

to the guidance a significant

of the

role in the

4.9.

Hebbian

Perhaps large number taking

and anti-Hebbian

the most important

combinations

The properties

proportion

Wiesel,

is that they can connect

of signals

they receive

selective

to stimulus

of the vast range of possible

feature

from

extraction

previous

visual

responding

stimuli

by

stages of

visual cortex vary greatly,

parameters,

to a

but it can

only to a very

(Barlow,

1975; Hubel

&

1962). Much

genetically

of the anatomical (Hubel

evidence

LeVay,

structure

& Wiesel,

that experience

development

receptive

Wiesel,

stimulus fields

1977; Kleinschmidt,Bear, & Hahn,

It

modification

of synaptic

postsynaptic

neural

has been strengths

activity

rise to these properties there is also plenty environment

(Fregnac

& Singer,

1987; Rauschecker

1982).

giving

1970). However,

in the normal

of sensory

Rauschecker 1982;

of neurons

of neurons in the primary

be said that they are highly small

property

in development

of other cells in a specific way. They can perform

well-defined

processing.

mechanisms

of experimental

is necessary for the proper

& Imbert,

1984; Hubel,Wiesel,

1987; Movshon

& Van Sluyters,

& Singer, 1981; Shatz, 1990; Sherman demonstrated

that

depend on the temporal

(Brown

is determined

the

& 1981;

& Spear,

experience-dependent

correlation

between pre- and

et al., 1990; Fregnac,Shulz,Thorpe,

& Bienenstock,

1988). Visual

deprivation

experiments

point

dependent

self-organizing

mechanisms

in the development

Functional

modifications

comparing

distributions

in animals

raised

artificially

altered

lines, or without

of receptive

under

different

statistical continuous

more cells responding missing

linked

ones. Other

to visual

the significant

experience

field properties conditions.

structure motion

out

Kittens

(e.g., with

role

of the visual also been

in different

populations

of neurons

raised

in environments

with

many

in a stroboscopically

gave similar

vertical,

inferred

by

an

but no horizontal

lit environment)

results for other visual

- 79 -

system.

have

to the features that were present in the environment studies

of activity

parameters

have many than to the to which

cortical

neurons

orientation

are sensitive,

selectivity.

such as directional

These results

suggest

selectivity,

the existence

which on the one hand would relax the requirements the developmental of individual Many

results

have been proposed

purely

contrast

(Barrow,

structure

maximal

the representation

Bender, achieved

between velocity

and direction

throughout

1972; Perrett,Rolls, for arbitrary

and special

mechanisms

the role of

rules, but they of information

visual

from retinal

neurons

tuning,

as

to hand-like

as faces are exceptionally

may exist for their processing.

the processing

receptors

to temporal

Some highly shapes,

through

cortical

areas,

pattern-selective while

other

even to the spatial

& Rocha-Miranda,

- 80 -

orientation

passes through

is achieved.

preferentially

connections

the sensory system. The sparseness of

cells

orientation

1969; Gross,Rocha-Miranda,

& Caan, 1982). Such extremely

stimuli,

playing

is

tuning.

as information

high level of selectivity

of the face (Gross,Bender,

concentrate

feed-forward

cells with similar

best to faces, with some cells showing selectivity

or identity

only

on the development

with Hebbian

and secondary

in these areas respond

respond

feedback

system, sparseness is increased

cells, primary

1986;

and the role of learning

here also use local learning

in combination

inhibition

learning

1982; Linsker,

1989). These models

fields,

with

based on Hebbian

of the

structure.

seems to be increased

where an extremely cells

presented

is used widely

In the visual

ganglion

connections

those with similar

Sparse coding

retinal

of the receptive

Several models

& Munro,

& Kammen,

than that of spatial feedback

selectivity

and ocular dominance

ones in that they concentrate

Anti-Hebbian

well as between

allow the sensory system

mechanisms.

1987; Bienenstock,Cooper,

The models

rather

feature

learning

et al., 1989; Yuille

differ from the previous

hierarchy.

selectivity

to feed-forward

can help explain

of neural

by local, Hebbian

enhancement.

representation

of

in the sensory

on the spatial

assigned

and precision

on the development

1973; Miller

mostly

mechanism

environment.

of orientation

Malsburg,

for the complexity

and

to be best suited to the statistics of their particular

system can be explained development

dominance

of a learning

process, while on the other hand it would

animals

ocular

high selectivity

important Nevertheless

stimuli

&

may not be for monkeys,

it is also unlikely

that face processing of experience

on the response

1989). A similar successively

is completely

principle

higher

determined

genetically,

of face-selective

cells (Rolls,Baylis,Hasselmo,

is used in the taste system,

levels, and become more narrowly

-81 -

as demonstrated

where

by the effect & Nalwa,

cells are more selective

tuned (Rolls,

1989a).

at

Chapter

5. Learnina

invariances

How can we consistently position,

distance,

recognise objects when changes in the viewing

size, orientation,

(e.g. of a newspaper

relative

or a gymnast)

position

can change their retinal

The visual system must

contain knowledge

to generalise

correctly.

Part of this knowledge

is also likely

that the visual

plenty

of examples Simple

take weighted

model

number

The solution

of input

are only

‘similar’

from our subjective

difference

the transformed

is probably

so significantly? in order to be able

determined

genetically,

but it

which

image

lines generate

patterns

contains

judgement

in Hamming

similar

outputs.

into the same category.

very differently is therefore

of similarity.

or a relatively

distance,

small

patterns

Competitive

distance,

which

is

transformation,

distortion

of shape can

units tend to respond

poorly

to require

which

For these units,

Any simple

so these simple

and generalise likely

chapters, in which

if they are close in Hamming

or a rotation,

to this problem

in the previous

show a form of generalisation

clusters ‘similar’

such as a shift in position cause a great

about such transformations

such as the ones studied

sums of their inputs,

patterns

very different

projections

system learns from its sensory experience,

neurons,

for instance,

however,

of the object itself

of such transformations.

that differ on only a small learning,

or deformations

angle, eye

to

across the transformation.

either a more complex

model

of a neuron, or a network of simple units.

5.1.

Shift

invariance

Fukushima proposed network

consisting

complex) tions,

a simple

(Fukushima, solution

1980; Fukushima, to the positional

of alternating

layers. Feature detectors

while

the outputs

feature detector

1990; Fukushima invariance

problem

(3’ or simple)

in the ‘s’ layer are replicated

of detectors

of the same feature

- 82 -

& Miyake,

1982)

by a hierarchical

and invariance in many

are pooled

(‘C’ or

different from

posi-

different

positions region

in the ‘C’ layers. can therefore

The presence

activate

the appropriate

This idea is consistent

with

(Hubel

& Wiesel,

1962; Spitzer

receive

their

inputs

same

major

orientation

detecting to more

in different

selectivity

are completely

invariance,

but some

has successfully

been applied

al., 1989). LeCun using

be adjusted.

Fukushima’s

is also

not

a global

model.

invariance tasks

There is

is limited

(Nazir

modeling assumptions

way,

weight

plausibility

vector

one of the ‘simple’ weights fied

However,

neural

(except

units

rise

sensitive only

shift

1980; LeCun

et

(zip codes)

by

that need to

the operation

limited

units

This architecture

digits

on perceptual

networks

achieving

in accordance that

with

translation

discrimination

way

learning

- 83 -

to ‘complex’

mechanism),

This operation

in

essential

the question

by LeCun units

et al.,

in a fixed

unit have the same

whenever

to the same complex

sharing’).

from

used

to a complex Therefore

of their

Apart

algorithm

(e.g., by a Hebbian

also be helpful

of some

are connected

in position).

units connected (‘weight

may

seems very difficult.

that are connected

are modified

the same

by these

detectors

for a shift

of all the other simple

in exactly

not

(Fukushima,

the implementation

networks

that the feature

units

system

way,

psychophysically

of the backpropagation

and that all the simple

input

visual

are still

of free parameters

is spatially

demonstrated

presented

system.

in biological

assume

of handwritten

but

giving

1990).

of the principles

of the biological both models

been

in the human

the visual

recognition

transform

they

In this

that

for the

of feature

gradually

while

problems

evidence

cells

selective

is also achieved.

the number

cortex

In the top layer,

recognition

to reduce

visual

the pair

way,

invariance.

tolerance

psychophysical

It has

& O’Regan,

Some

reliable

constraints

subunits

of its components.

in pattern

a limited

that complex

model,

of the pattern,

of distortion

et al. achieve

such architectural

invariance

position

degree

assume

in a hierarchical

of positional

to the position

relative

1985) which

In Fukushima’s

range

within

cells in the primary

cells or simple-cell-like

is repeated

and a larger

to the approximate

of complex

positions.

layers

indifferent

models

simple

in any position

‘C’ unit.

& Hochstein,

from

and invariance

of the feature

the weights

of

the corresponding unit need to be modiis non-local

for

the

synapses

of all the units

signal broadcast either,

is not available

version

the same complex unit

by letting

(a ‘family’)

simple

adaptively

units

Instead

would

would

detectors

of requiring

simple

in an identical

independently

unit (form

‘clubs’).

simple-to-complex

units way,

permanently

and then allowing A learning

connections.

A simple

Hebbian

patterns

If the input

to the simple

layer contains

between

detectors

tions would

Model

of that feature

prevent

in different

positions.

to to a

ones to connect needed

to specify

rule, which

depends

only detect overlapping an example

of the feature

never be significant

The absence

of positive

overlap correla-

to the same output.

7

in which

solution

(first

the modification

to a time

average

(a ‘trace’)

to keep the weight

published

of the synaptic

not to the pre- and postsynaptic

order

then there will

those units being connected

The proposed rule,

at any moment

connected

similar

rule is therefore

here as it would

position

all connect

the same goal can be achieved

does not work

one spatial

for local learning,

of the same feature

activations,

at only

not solve this problem

be necessary

only on instantaneous in the input.

A ‘learn now’

mechanism

in which

to develop develop

to a complex

these modifiable

trace

is needed

unit.

which

units

modified.

units.

rule with

An arrangement

originally

unit to all its simple

of the input,

to the simple

Hebbian

complex

for the one that was

by the complex

as the shifted

5.2.

except

activity,

strength

but instead

of the postsynaptic

vector

in (Foldiak,

1991)) is a modified at time

y. A decay

bounded:

Awij(t)

= a Yi(t) [ xj(t) - wij(t)

where

- 84 -

step t is proportional

to the presynaptic

activity

1,

Hebbian

activity term

(x) and

is added

in

Yict) = (1-S) yi(t-I)

A similar

trace mechanism

of classical

conditioning

effect that

activity

temporal

has been proposed by Sutton

filtering

will

the modified

5.3.

in the 64 positions

During

simple

training,

moment.

This

that the desired

on the activity

positions

of only one

line detectors,

units

is

In the simple

one unit for each of 4 orienlayer,

units. lines covering

of the simple

at different

moments

to the network.

the whole array are selected at random

These lines are swept across the visual units in time

of the appropriate (figure

of time comparable

to the duration

units that get activated

orientation

- 85 -

one

for a period

all the connections

units of only one orientation

in

of these

unit succeeds in exciting

the rest of that sweep get strengthened

sweep, causing simple units of only one orientation

field, but

The activation

unit gets enhanced

of the sweep. Therefore

during

Hebb rule. Simple

24).

If an active simple

units, then the trace of that complex

to the modified

and complex

on an 8 x 8 grid. There are only 4 units in the complex

and two directions.

units is the input

the simple

the goal is to learn shift invariance. oriented

moving

rise to activation

between

of the four complex

simple

at a later

the assumption

As the trace depends

in which

to the simple

from four orientations

different

learning

embodies

of the connections

in an example

fully connected

giving

& Barto, 1981). A trace has the

Simulation

layer there are position-dependent tations

1972) and used in models

rule is still local.

The development simulated

influence

of the activity

features are stable in the environment. unit,

by Klopf (Klopf,

and Barto (Sutton

at one moment

low-pass

+ 6 yi(t).

from the according

get activated

to connect to the given complex

during unit.

a

Figure

24. Five consecutive

segments

at each position

corresponding inactive

orientation

units

(Xj

frames from one of the sequences in the figure

represent

and position.

Thick

the activation

segments

= 0). The trace is maintained

used as input. of a simple

The four unit of the

are active (Xi = l), thin ones are

between sweeps.

-To prevent

more than one complex

some kind of competitive, For the sake of clarity

inhibitory

would

also be possible.)

strengths,

possible

is necessary between the complex competitive

described

to the same orientation,

scheme (Rumelhart

here. (The use of decorrelating

units.

& Zipser, connections

Each unit took a sum of its inputs

weighted

by the connection

yk of the unit with the maximal

weighted

sum was set to 1,

and the output

while the outputs

interaction

the simplest

1985) was used in the simulation

unit from responding

of the rest of the units were set to 0:

1 if argmaxif

Figure 25.a shows the initially

random

Cj wijxj ) = k,

connections

between the simple

units, while figure 25.b shows the connections

after training

retina. The result and the speed of convergence

is relatively

and the complex

with 500 sweeps across the insensitive

to 6, the constant

determining

the decay rate of the trace. Too short time constants (6 near l), however,

not enough

to link different

those demonstrated lead to confusion two different

phases of the transformation

and result in results similar

in figure 25.~. On the other hand, too long time constants between the subsequents

(small

are to 6)

sweeps, as the trace does not decay between

transformations.

- 86 -

Figure 25: Connection training

patterns

of the four complex

units (a) before training

on 500 line sweeps across the visual field. The length

the strength orientation

of the connection to the complex

from the simple

unit. Initial

weights

An alternative

version

connectivity

very similar

this version

contains

were chosen from a uniform

of the algorithm

without

simulation.

This version

is physiologically

(1-S)

units (presynaptic

ZjCtT1)

more plausible,

-87-

and

distribution

The learning

+

6

Xjtt).

as will be pointed

rule in

trace), rather than of the

where =

position

trace (6 = 1).

complex ones:

xj(t)

indicates

achieves the same goal, and results in

to that given in the previous

a trace of the simple

of each segment

unit of the corresponding

on [O,O.l]. a = 0.02, 6 = 0.2. (c) The result of training

and (b) after

out below.

The application

of this algorithm

(e.g., three-dimensional

rotations

to more complicated

or deformations)

even harder to see how they could be specified in which

such invariance

the case of positional required

properties

invariance

by the proposed

detectors

should

environment

should

generate

from

and biological

Electrophysiological perception

experiments

are due to the receptive

system.

Complex

position

within

a limited

contrast,

and retinal

Mishkin,

fields (Azzopardi

1977; Miyashita

et al., 1989; Perrett

that

are coincident,

that

the

the activation

of

of time.

of individual

more complex

of

cells in the visual invariance

cells in higher

forms of invariance

of the stimulus,

properties

approximate

1962), while

to

visual

to the size, colour,

and they also have much

& Rolls, 1989; Bruce et al., 1981; Gross, 1991; Gross &

& Chang,

1988; Perrett,Harries,Bevan,Thomas,Benson,Mistlin, presented

above to explain

these properties

assumptions.

The first one is about the form of synaptic synapse is strengthened

feature and

the invariance

cortex exhibit

& Wiesel,

et al., 1982). The model

makes three fundamental

causing

be

plausibility

visual

location

than it is in

All that would

of processing,

a short period

suggest

range (Hubel

cortex exhibit

orientation

stages

field characteristics

cells in the primary

areas in the temporal

larger receptive

previous

detectors within

Assumptions

models.

the way

the transformation-dependent

sequences of the transformation

these transformation-dependent

5.4.

or LeCun’s

is that

as it is

some kind of learning;

could be wired in is much less obvious

algorithm

invariances

may be even more interesting without

in Fukushima’s

be available

or abstract

learning

not only when the activations

but also when they are both activated

of time.

The trace mediating

biological

implementations.

membrane

time constants

this memory As cortical than previously

estimated

- 88 -

that a

of the pre- and post-synaptic within

for recent activity pyramidal

rule: it is assumed

cells (possibly

a relatively

short period

could

have many

may

have

greater

cells

much

possible longer

than 100 ms), the

simplest

possibility

synaptic

activity

independent kept

is that (Stratford

from

channels.

potentiation released

of synapses,

have

the

(Levy state

to occur.

& Steward,

in

activations. (Bullier

the

network,

so that

input

cells combine

inputs

connect

to a complex

unit

tuning

specific cortex.

from simple

only

spatial

invariant

may come from

rotation,

independent

and

while

cells are slightly

(Perrett,Hietanen,Oram,

responses

learn

direct

input

input

longer

found

number

of the face.

- 89 -

from

suggests

are their

the LGN

from

simple could

cell.

If

be used to and spatial

Another

example

of transformationin monkey

to only one view of others

Response

than that of view-dependent 1991), which

phases

cells do not receive

and phase tuning.

selectively

of a smaller

rule

to combine

by the combination

respond

in the

learning

the same orientation

neurons

to all views

& Benson,

receive

neurons

the response

respond

positions

for

is preserved

transformation

can

to receive

cells with

would

section.

the face-selective

Some of these face-selective profile),

which

the alternative

cell

and

at the

activation

cells, the above mechanisms

the simple

different

transformation

detectors

(e.g. right head

but with

et al., 1988),

transmitter

for LTP in the hippocampus

for different

‘complex’

the

to the receptor

1971; Stone, 1972), layer 2 complex

(Gilbert

complex

in the previous

the

in the long term

postsynaptic

activity

cells are known

1979; Hoffman,

thalamic

for achieving

the

An

of synaptic

1990). Such a mechanism

implement

is that detectors

some complex

& Henry,

would

trace presented

assumption

While

frequency

precede

is

may be

activity.

constants

Therefore

was in fact required

this mechanism

the presynaptic

available

to

& Levy,

average

to be involved

kinetics.

1983). As it is the presynaptic

The second

direct

activation

time

of past

that the trace

by neural

may still be bound

(Holmes

This condition

of the receptors,

involving

slow

activation

activation

presynaptic

potentiation

relatively

likely

changed

are believed

trace

The running

the activation

which

presynaptic

an electrical

it is more

is gradually

involve

receptors,

time of the post-synaptic

retain

of the neuron.

that

may

NMDA

at an earlier

require

activation

concentration

implementation

receptor

could

et al., 1990). However,

the electrical

by a chemical

alternative

the membrane

temporal of faces

are invariant

latencies

to

of view-

ones (130 vs. 130 ms)

that the view-dependent

ones

may provide enough

input

to the invariant

transformation-phase

transformations.

simple

sensitive

This problem

transformation

is required

units

cells. A further cells

is especially

to

obvious

simultaneously.

can be avoided

A receptive

to shifts of only

dependent

face selective

tuning

a fraction

tuning

the

to more than one large number

of

broad.

fields are fairly

would

of the receptive broad

of

is sufficiently

of discontinuities

of the width

assumption

be much field.

more

The view-

(approximately

constantly

activations

may even be present

early development the developing motion

provide

60 degree)

invariance

property,

while

invariance

to perspective

for intrinsically

complex,

non-local

Malsburg,

(Coolen geometric

alternative

visual

general

transformations

plausible

form

& Shatz,

during in

1990). Forward for the size

be able to train

of objects. Other kinds of invariances,

and the

are shown in figure 26. appear unreasonable,

than previously algorithms

proposed

& Malsburg,

1989), backpropagation

pattern

relations

(Glunder,

would

teaching

- 90 -

inputs’

& Zipser,

involve

more

(Bienenstock

learning

&

in a Hopfield

19871, or the computation

1986). A biologically

to the present algorithm

(Rumelhart

which

matching

19891, Hebbian (Hinton,

the present algorithm

models,

such as graph

& Kuijk,

representations

of motion

sweeping

waves of activity

would

areas as ‘unsupervised

invariant

generated

of

or eye

stimulation

a more

or slower

or addition

visual

ideal training

1987; Buhmann,Lange,

network

motion

space, for instance, would provide

As none of these three assumptions to be more

object

examples

The necessary

(Meister,Wong,Baylor,

necessary for their training

appears

provides

invariance,

before the start of normal

retina

in three dimensional

shift

the necessary transformations.

as there is evidence mammalian

transformations

is that the environment

In the case of learning

movements

towards

range

cells, whose receptive

cells also show a relatively

important

the transformation.

higher

total

there are

to head rotation. The third

inner

the

when invariance

if their transformation-phase

field with a large number

sensitive

cover

is whether

The need for an excessively

This does seem to be the case with visual simple smooth.

consideration

more

plausible

be to use backprojections to bias Hebbian

of

from

mechanisms

1985). Such a mechanism

could

Invariances position

orientation

00

size/depth

0

shape

ViWd

0 /\

l!@j

/'

I

I

'9

Suggest Documents