If visual saliency predicts search, then why? Evidence

0 downloads 0 Views 7MB Size Report
saliency effect is instead due to the meaning of the scene regions. Experiment 3 supports this ... in a scene—and one particular model of how fixation locations might be chosen—the bottom‐up .... buildings) and interiors. These images were ...

 
 
 


If
visual
saliency
predicts
search,
then
why?

 Evidence
from
normal
and
gaze­contingent
search
 tasks
in
natural
scenes



 
 Tom
Foulsham1*
and
Geoffrey
Underwood2
 
 
 
 
 
 
 1Department
of
Psychology,
University
of
British
Columbia,
Canada




2School
of
Psychology,
University
of
Nottingham,
UK



 *CORRESPONDING
AUTHOR:

 Department
of
Psychology,
2136
West
Mall,
Vancouver,
BC,
Canada
V6T
1Z4
 Tel:
604
8220069

 Email:
[email protected]
 
 
 RUNNING
HEAD:
Saliency
in
normal
and
gaze‐contingent
search
 
 KEYWORDS:
Attention,
Scene
perception,
Saliency
map
models,
Eye
movements,
 Visual
search


 Abstract
 
 The
Itti
and
Koch
(2000)
saliency
map
model
has
inspired
a
wealth
of
research
 testing
the
claim
that
bottom‐up
saliency
determines
the
placement
of
eye
fixations
 in
natural
scenes.

Although
saliency
seems
to
correlate
with
(although
not
 necessarily
cause)
fixation
in
free‐viewing
or
encoding
tasks
it
has
been
suggested
 that
visual
saliency
can
be
overridden
in
a
search
task,
with
saccades
being
planned
 on
the
basis
of
target
features,
rather
than
being
captured
by
saliency.

Here,
we
find
 that
target
regions
of
a
scene
that
are
salient
according
to
this
model
are
found
 quicker
than
control
regions
(Experiment
1).

However,
this
does
not
seem
to
be
 altered
by
filtering
features
in
the
periphery
using
a
gaze‐contingent
display
 (Experiment
2),
and
a
deeper
analysis
of
the
eye
movements
made
suggest
that
the
 saliency
effect
is
instead
due
to
the
meaning
of
the
scene
regions.

Experiment
3
 supports
this
interpretation,
showing
that
scene
inversion
reduces
the
saliency
 effect.

These
results
suggest
that
saliency
effects
on
search
may
have
nothing
to
do
 with
bottom‐up
saccade
guidance.
 


Introduction
 
 
 Almost
as
soon
as
researchers
could
record
where
people
were
looking
this
 technique
was
applied
to
pictures
of
scenes
[1,
2].

The
results
of
these
early
 experiments
were
clear:
people
tended
to
fixate
some
locations
more
than
others,
 and
these
locations
tended
to
be
the
places
that
were
most
useful
for
the
task
at
 hand.

How
do
observers
choose
these
locations?

In
considering
this
question,
this
 paper
will
focus
on
one
particular
task—searching
for
something
in
a
scene—and
 one
particular
model
of
how
fixation
locations
might
be
chosen—the
bottom‐up
 saliency
map
model,
and
in
particular
the
implementation
proposed
by
Itti
and
Koch
 [3].
 
 Saliency
and
eye
guidance
in
natural
scenes
 The
origins
of
saliency
as
a
theoretical
predictor
of
attention
in
natural
scenes
can
 be
considered
in
the
context
of
two
complementary
areas
of
research.

First,
in
 psychophysical
studies
of
attention,
a
large
amount
of
research
has
gone
in
to
 investigating
the
factors
that
determine
how
people
search
for
a
target
amidst
an
 array
of
simple
stimuli
[4].

These
studies
normally
use
reaction
time
and
“search
 slope”
(the
degree
to
which
search
time
suffers
as
the
number
of
to‐be‐searched
 items
increases)
as
measures
of
the
way
in
which
covert
attention
moves
among
the
 different
items.

Theories
of
this
process
(such
as
Feature
Integration
Theory,
[4])
 often
distinguish
between
“bottom‐up”
attention—which
explains
how
some
 targets,
such
as
a
red
line
amongst
green
lines,
“pop‐out”
from
a
display
and
are
 found
quickly
regardless
of
set
size—and
“top‐down”
attention—which
describes


the
volitional
movement
between
items
which
are
harder
to
find.

In
order
to
 represent,
preattentively,
the
relative
bottom‐up
strength
of
different
items,
models
 of
attention
often
propose
a
2D
representation:
a
“master
map”,
or
according
to
 Koch
and
Ulman
[5]
a
“saliency
map”.
 


In
natural
scenes,
which
are
more
complex
than
simple
visual
search
arrays,


and
where
eye
movements
tend
to
accompany
shifts
of
attention,
it
has
been
more
 difficult
to
determine
how
attention
is
controlled.

Eye
movements
are
rapid
and
 serial
but
do
bottom‐up
stimulus
features
determine
their
destinations?

If
they
do,
 then
we
should
be
able
to
measure
the
statistics
of
images
at
the
locations
where
 people
fixate
and
find
that
there
are
visual
features
that
mark
out
these
locations.

 Several
researchers
suggested
that
this
was
indeed
the
case
[6,
7].

For
example,
 fixated
locations
tend
to
have
higher
contrast
and
higher
edge
density
than
places
 that
are
not
looked
at
[8].

Combining
the
types
of
features
(colour,
brightness,
 orientation)
that
were
known
from
psychophysics
to
attract
attention
also
proved
 difficult
in
complex
natural
scenes.

The
saliency
map
model
aimed
to
resolve
this
 problem
and
was
proposed
by
Itti
and
colleagues
as
a
robust
way
to
combine
low‐ level,
linear
filter
responses
into
an
explicit
representation
of
bottom‐up
saliency
for
 natural
images
[3,
9].

It
combines
features
across
several
spatial
scales
in
order
to
 compute
the
centre‐surround
contrast
of
each
point
in
the
image.

If
overt
attention
 moves
in
the
same
way
as
attention
in
simple
visual
search,
and
supported
by
 observations
from
the
image
statistics
at
fixation,
then
people
should
move
their
 eyes
to
points
that
are
represented
as
being
high
saliency
on
the
saliency
map.




Initial
tests
of
predictiveness
of
the
saliency
map
model
in
natural
scenes


were
promising.

In
particular,
in
free‐viewing
tasks
where
participants
were
just
 told
to
inspect
the
image,
Parkhurst
et
al.
[10]
and
Peters
et
al.
[11]
showed
that
 saliency
predicted
fixations
much
better
than
chance.

However,
subsequent
studies
 have
questioned
this
claim.

For
example,
Tatler
et
al.
[8]
suggested
that
much
of
the
 correlation
between
salient
features
and
fixation
arises
because
both
fixations
and
 features
are
biased
to
the
centre
of
the
display.

Henderson
et
al.
[12]
argued
that
 saliency
was
a
poor
predictor
of
fixation,
and
that
instead
participants
were
fixating
 regions
that
were
semantically
meaningful,
regardless
of
their
saliency.

Foulsham
 and
Underwood
[13]
found
that
although
salient
regions
were
fixated
more
often
 than
chance
this
difference
was
reduced
if
the
chance
baseline
was
adjusted
to
take
 into
account
systematic
tendencies
in
the
way
people
move
their
eyes.

Moreover,
 saliency
was
completely
unable
to
model
the
sequential
patterns
(scanpaths)
made
 by
observers.

Although
the
predictions
of
the
saliency
model
may
be
more
accurate
 while
people
watch
videos
[14],
this
finding
has
also
recently
been
contradicted
by
 Dorr,
Gegenfurtner
and
Barth
[15]
who
argue
that
correlations
between
fixations
 and
visual
features
could
just
be
a
result
of
systematic
scanpaths
operating
on
 natural
images
with
non‐randomly
distributed
features.
 
 
 Saliency
in
naturalistic
visual
search
 The
evidence
for
the
role
of
saliency
in
free
viewing,
some
of
which
has
been
 reviewed
above,
is
mixed.

However,
it
is
sometimes
forgotten
that
the
Itti
and
Koch
 [3]
model
was
introduced
as
a
model
of
orienting
in
search.

Indeed,
this
paper


showed
that
the
saliency
model
performed
similarly
to,
or
better
than,
human
 searchers
looking
for
oriented
lines
amongst
distractor
lines,
or
for
a
camouflaged
 tank
in
a
natural
environment.

It
is
particularly
useful
to
consider
the
performance
 of
the
saliency
model
in
search,
not
just
because
that
is
the
situation
for
which
it
was
 designed,
but
because
by
defining
a
task
for
participants
we
can
consider
how
 bottom‐up
saliency
interacts
with
top‐down
control.

The
task
that
observers
are
 performing
is
critical
to
where
people
fixate
[2,
16]
but
how
top‐down
and
bottom‐ up
guidance
is
combined
remains
an
important
question.
 


Since
Itti
and
Koch
[3],
several
researchers
have
explored
how
saliency


affects
visual
search
in
natural
scenes.

In
one
of
the
first
of
these
studies,
we
found
 that,
although
objects
defined
as
salient
by
the
model
were
fixated
early
and
often
in
 a
memory‐encoding
task
(where
people
were
simply
asked
to
remember
the
scene),
 these
regions
were
only
rarely
fixated
when
participants
were
searching
for
a
non‐ salient
target
[17].

We
have
termed
the
process
by
which
top‐down
control
 overcomes
bottom‐up
saliency
“cognitive
override”.

We
found
a
similar
result
when
 the
fixation
of
“medium”
and
“low”
saliency
target
objects
was
compared
[18].

 Medium
targets
were
fixated
more
often
than
low
targets
in
an
encoding
task,
but
 there
was
very
little
difference
in
a
search
task.

These
findings
have
since
been
 confirmed
by
Henderson
et
al.
[19],
who
used
the
finding
that
the
most
salient
 region
is
neglected
in
favour
of
a
completely
non‐salient
target
as
support
for
their
 “cognitive
relevancy”
model
of
search.

According
to
this
account,
when
searching,
 regions
are
selected
based
on
a
“flat”
landscape:
a
partially
parsed
layout
of
the
 scene
that
identifies
possible
locations
and
objects
without
ranking
their
saliency.



In
a
search
task
where
target
and
distractor
objects
were
arranged
in
arrays,
Chen
 and
Zelinsky
[20]
also
demonstrated
that
eye
movements
were
immediately
drawn
 to
the
target
object,
regardless
of
whether
highly
salient
distractors
were
present
in
 the
array.

On
the
other
hand,
Foulsham
and
Underwood
[21]
showed
that
response
 times
were
moderated
by
the
saliency
of
a
distractor
object
adjacent
to
the
target,
 even
when
that
distractor
was
not
fixated.
 


Of
course,
the
baseline
saliency
map
model
is
unable
to
explain
any
simple


search
tasks
where
top‐down
control
is
necessary
(i.e.
where
targets
are
not
salient
 according
to
bottom‐up
features;
this
was
acknowledged
in
Itti
and
Koch,
[3]).

 Instead,
several
attempts
have
been
made
to
combine
visual
saliency
with
top‐down
 guidance.

For
example,
Navalpakkam
and
Itti
[22]
proposed
a
model
whereby
the
 saliency
map
was
combined
with
a
“task‐relevance”
map
derived
from
comparing
 points
in
the
image
with
a
stored
representation
of
the
target.

An
alternative
is
to
 combine
bottom‐up
information
with
knowledge
about
where
objects
are
likely
to
 occur,
and
this
is
implemented
in
the
contextual
guidance
model
of
Torralba
et
al.
 [23].

In
Cutsuridis’
model
[24],
bottom‐up
saliency
is
but
one
of
three
parts
of
the
 gaze
orienting
system,
with
this
initial
visual
processing
combining
with
 neuromodulation
and
top‐down
recognition
to
select
targets.

Although
these
and
 other
models
combine
saliency
and
target
knowledge
it
is
not
known
whether
target
 saliency
still
plays
a
role
(as
it
would
do
if
saliency
and
target
knowledge
were
 combined),
or
whether
the
saliency
map
is
effectively
reduced
to
zero,
overridden
as
 a
flat
landscape
over
which
top‐down
selection
occurs.
 
 


The
present
research
 In
this
paper
we
present
three
experiments
investigating
the
role
of
target
saliency
 in
search
within
natural
scenes.

Where
this
has
been
attempted
previously
(e.g.
 [18]),
objects
have
been
selected
or
modified
on
the
basis
of
their
saliency.

A
 possible
drawback
of
such
studies
is
that
these
manipulations
might
interfere
with
 the
naturalness
of
the
scene,
introducing
further
confounds.

Instead,
in
the
present
 studies
we
did
not
manipulate
scenes
but
used
the
predictions
of
the
saliency
model
 to
select
target
regions
that
were
either
salient
or
non‐salient.


 In
Experiment
1,
we
looked
to
see
whether
response
times
and
eye
 movements
prioritized
salient
regions.

This
would
be
predicted
by
any
model
 where
saliency
is
combined
with
target
knowledge:
salient
and
non‐salient
regions
 will
both
receive
a
top‐down
boost,
but
salient
regions
will
still
be
represented
by
a
 higher
“peak”.

If
there
are
other
saliency
peaks
competing
for
attention
then
at
least
 some
of
the
time
these
might
be
higher
than
a
non‐salient
target
region,
particularly
 if
they
are
also
similar
to
the
target,
and
so
it
will
take
longer
for
the
visual
system
to
 select
the
target
region
and
to
respond.

If
such
an
effect
occurs,
it
is
important
to
 consider
why,
and
the
saliency
map
model
makes
a
clear
prediction.

Salient
regions
 should
be
responded
to
quicker
because
eye
movements
made
to
peripheral
 locations
in
the
saliency
map
select
them
more
rapidly.

Even
in
a
weighted
saliency
 map
with
one
winning
target,
a
higher
underlying
peak
should
be
quicker
to
evolve
 or
reach
a
threshold
than
a
smaller,
less
salient
one.
 
 
 Experiment
1
 


Method
 
 Participants
 A
group
of
sixteen
students
took
part
in
exchange
for
payment.

All
had
normal
or
 corrected‐to‐normal
vision,
and
their
inclusion
was
contingent
on
achieving
a
good
 calibration
with
the
eye
tracker.

All
participants
gave
their
informed
consent.
 
 Stimuli
and
design
 The
stimuli
for
this
experiment
were
90
colour
photographs.

All
the
photographs
 were
realistic
scenes
showing
outdoor
environments
(landscapes,
houses
and
other
 buildings)
and
interiors.

These
images
were
sourced
from
a
commercially
available
 collection
and
selected
from
a
larger
set
following
screening
with
the
saliency
map
 model.

To
produce
model
simulations
we
processed
each
image
using
the
saliency
 map
algorithm
described
in
detail
by
Itti
and
Koch
[3].

The
saliency
model
has
been
 refined
and
updated,
although
for
the
most
part
the
core
assumptions
(and
 predictions)
of
the
model
remain
the
same.

The
experiments
reported
here
use
a
 version
of
the
model
compiled
from
source
code
available
at
http://ilab.usc.edu/
 and
downloaded
in
May
2004.

This
model
sub‐samples
the
image
at
8
spatial
scales
 and
extracts
intensity,
colour
and
orientation
features
at
each
scale.

These
features
 are
combined
across
scales
in
a
centre‐surround
fashion
in
order
to
compute
the
 feature
contrast—the
degree
to
which
locations
stand
out
from
their
background— for
each
point
in
the
image.

The
different
“feature
maps”
are
then
normalised
and
 summed,
producing
a
map
showing
the
overall
conspicuity
of
different
regions.



Although
this
map
indicates
which
regions
stand
out
the
most,
the
model
goes
 beyond
this
to
produce
explicit
predictions
of
the
shifts
of
attention
that
will
occur.

 Specifically,
a
winner‐take‐all
network
chooses
the
most
salient
peak
on
the
saliency
 map,
which
is
then
suppressed,
allowing
the
next
peak
to
be
fixated
and
so
on.

 Figure
1
shows
the
first
five
locations
selected
by
the
model
for
one
stimulus,
along
 with
the
raw
saliency
map
from
which
the
model
produces
its
output.
 For
the
present
experiment,
screening
ensured
that
all
of
the
stimuli
had
five
 non‐contiguous
regions
as
the
first
five
locations
(that
is
none
of
these
regions
were
 re‐selected
in
the
first
5
shifts
of
attention).
In
order
to
test
attention
to
these
 locations,
three
regions
were
identified
within
each
image.

Based
on
the
model
 predictions,
the
first
location
chosen
by
the
model
(hereafter,
“high
saliency”)
and
 the
fifth
most
salient
region
(“low
saliency”)
were
selected
as
regions
of
interest.

 This
allowed
a
large
number
of
potential
stimuli
to
be
generated
without
 experimenter
manipulation
of
objects
or
other
interference.

In
each
case
the
 regions
were
squares
of
200
by
200
pixels
(approx
6˚
square)
centred
on
the
model‐ generated
location.

The
size
of
patch
was
chosen
following
pilot
observations
 showing
that
using
smaller
patches
led
to
too
many
patches
that
were
highly
similar.
 These
regions
corresponded
to
areas
that
the
model
predicts
should
be
potent
at
 attracting
fixations.
If
the
model
is
able
to
predict
dynamic
eye
movements
across
 several
fixations
then
high
saliency
regions
should
be
selected
preferentially
to
low
 saliency
regions.
However,
in
some
cases,
this
difference
might
be
small
(or
even
 zero
as
the
model
will
still
decide
between
such
regions
due
to
small
amounts
of
 random
noise
which
are
added
to
the
maps
in
the
model).

If
the
model
is
only
useful


in
predicting
the
first
few
fixations,
low
saliency
regions
might
not
be
fixated
 preferentially
at
all.

 To
test
the
ability
of
the
model
to
predict
which
regions
will
be
inspected
a
 further
set
of
control
regions
was
generated
to
give
a
baseline
indication
of
the
 allocation
of
attention
to
regions
selected
by
chance.
For
each
image
a
random
(x,
y)
 coordinate
was
produced
and
a
region
defined
around
it
in
the
same
way
as
with
the
 salient
regions.
The
final
90
stimuli
were
selected
from
a
larger
set
with
one
further
 criterion:
that
none
of
the
test
regions
(high,
low
and
random
control)
overlapped,
 ensuring
that
fixation
of
each
was
mutually
exclusive.

An
example
of
each
type
of
 region
is
included
in
Figure
1.
 
 [INSERT
FIGURE
1
ABOUT
HERE]
 
 A
problem
with
selecting
random
regions
as
controls
is
that
centrality
is
 often
confounded
with
saliency
[8,
13].

If
salient
regions
tend
to
be
central
then
 comparing
them
to
fully
random
control
regions
might
be
misleading
as
higher
 fixation
rates
would
be
expected,
not
because
salient
regions
are
more
potent
at
 drawing
attention,
but
because
they
coincide
with
a
tendency
to
concentrate
gaze
in
 the
centre
of
an
image.

It
was
therefore
important
in
the
current
study
that
salient
 regions
were
not
closer
to
the
centre
of
the
screen
than
control
regions.

In
fact
 control
regions
were
closer
to
the
centre
(8.5˚
away
on
average)
than
high
or
low
 saliency
regions
(10.22˚
and
10.37˚
respectively).


For
the
search
task,
regions
were
cropped
and
displayed
in
the
centre
of
a
 blank
screen
with
the
instructions
to
search
for
this
target
region
in
the
following
 image.

Each
stimulus
was
paired
with
a
test
region.
In
half
of
these
cases
the
region
 came
from
the
same
image
(target
present
trials),
whilst
the
other
half
were
filler
 trials
where
the
region
came
from
a
different
image
in
the
same
general
category
 (for
example
both
showed
houses).
In
the
45
target
present
stimuli
the
three
types
 of
region
were
equally
represented.

 
 Apparatus
 Stimuli
were
presented
on
a
19”
CRT
monitor
positioned
60cm
from
participants
 and
subtending
a
visual
angle
of
approximately
34˚
x
27˚.

Participants’
eye
 movements
were
monitored
using
an
EyeLink
II
head‐mounted
eyetracker,
which
 samples
eye
position
at
500Hz
based
on
a
video
image
of
the
pupil.

Fixations
and
 saccades
were
parsed
by
the
eyetracker
using
displacement,
velocity
and
 acceleration
thresholds
that
were
0.1˚,
30˚/s
and
8000˚/
s2
respectively.

Manual
 responses
were
collected
by
a
gamepad.
 
 Procedure
 The
procedure
is
depicted
in
Figure
2.

Following
a
short
calibration
and
practice,
 participants
were
presented
with
a
series
of
90
trials
in
a
randomized
order.

Each
 trial
began
with
a
target
region
(which
could
be
a
high
saliency,
low
saliency
or
 control
region)
presented
in
the
centre
of
a
blank
screen
for
2
seconds.

This
was


followed
by
a
drift‐correct
dot
at
which
point
the
experimenter
confirmed
that
 fixation
was
steady
and
in
the
centre
of
the
screen
by
pressing
a
key.

A
full
size
 image
then
appeared
on
the
screen
and
participants
had
to
respond
as
quickly
as
 possible
to
indicate
whether
the
image
contained
the
target
region
by
pressing
one
 of
two
keys
on
the
gamepad.

The
search
display
was
terminated
by
their
response.

 At
this
point,
if
the
participant
responded
to
indicate
the
target
was
present,
a
 further
screen
asked
them
to
indicate
in
which
of
the
four
quadrants
the
region
was
 located.

This
test
was
added
after
a
pilot
study
in
order
to
make
sure
that
 participants
were
not
guessing
based
on
the
gist
of
the
image
and
that
they
were
 actually
trying
to
locate
the
target
region.
 
 [INSERT
FIGURE
2
ABOUT
HERE]



 
 Analysis
and
results
 
 
 In
this
experiment
we
were
interested
in
seeing
whether
search
was
quicker
or
 more
efficient
for
targets
that
were
more
salient,
as
would
be
predicted
by
the
 saliency
map
model.

We
first
considered
manual
reaction
time
(RT),
which
is
the
 typical
measure
of
processing
in
visual
search,
before
looking
at
the
eye
movements
 made
in
more
detail.

Accuracy
was
87%
on
average
and
did
not
vary
with
different
 types
of
target
(F(2,30)
=
1.10,
p
=
.35).

As
a
result,
incorrect
trials
were
excluded
 and
subsequent
measures
are
based
only
on
correct
target‐present
trials,
unless
 otherwise
noted.

The
accuracy
at
which
people
were
able
to
subsequently
choose
 where
the
target
had
been
was
also
high,
and
as
this
research
is
not
concerned
with
 localization
this
measure
will
not
be
analyzed
further.

The
measures
taken
are


summarized
in
Table
1.

Means
for
each
type
of
target
were
compared
with
 repeated‐measures
ANOVA
and
pairwise
comparisons.
 
 Reaction
Time
 Participants
took
an
average
of
approximately
2.5
seconds
from
the
time
the
search
 image
appeared
until
making
their
response.

However,
there
was
a
reliable
effect
of
 target
saliency
on
RT
(F(2,30)
=
10.7,
p