Neural Networks Modeling eye movements in visual agnosia with a ...

4 downloads 0 Views 1MB Size Report
in the setting of visual agnosia, indicating severe limitations for the saliency map model. ..... a time. Tests showed average performance in auditory verbal.
Neural Networks 24 (2011) 665–677

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

2011 Special Issue

Modeling eye movements in visual agnosia with a saliency map approach: Bottom–up guidance or top–down strategy? Tom Foulsham a,∗ , Jason J.S. Barton a,b,c , Alan Kingstone a , Richard Dewhurst d , Geoffrey Underwood e a

Department of Psychology, University of British Columbia, Canada

b

Department of Medicine (Neurology), University of British Columbia, Canada

c

Department of Ophthalmology and Vision Sciences, University of British Columbia, Canada

d

Humanistlaboratoriet, Lund University, Sweden

e

School of Psychology, University of Nottingham, UK

article Keywords: Visual attention Neuropsychology Visual saliency Object recognition Eye movements

info

abstract Two recent papers (Foulsham, Barton, Kingstone, Dewhurst, & Underwood, 2009; Mannan, Kennard, & Husain, 2009) report that neuropsychological patients with a profound object recognition problem (visual agnosic subjects) show differences from healthy observers in the way their eye movements are controlled when looking at images. The interpretation of these papers is that eye movements can be modeled as the selection of points on a saliency map, and that agnosic subjects show an increased reliance on visual saliency, i.e., brightness and contrast in low-level stimulus features. Here we review this approach and present new data from our own experiments with an agnosic patient that quantifies the relationship between saliency and fixation location. In addition, we consider whether the perceptual difficulties of individual patients might be modeled by selectively weighting the different features involved in a saliency map. Our data indicate that saliency is not always a good predictor of fixation in agnosia: even for our agnosic subject, as for normal observers, the saliency–fixation relationship varied as a function of the task. This means that top–down processes still have a significant effect on the earliest stages of scanning in the setting of visual agnosia, indicating severe limitations for the saliency map model. Top–down, active strategies – which are the hallmark of our human visual system – play a vital role in eye movement control, whether we know what we are looking at or not. © 2011 Elsevier Ltd. All rights reserved.

1. Introduction Visual agnosia is a rare neuropsychological impairment in which patients are unable to recognise objects (Farah, 1990). Thus these patients fail to identify real objects or line drawings of objects even though their performance on many other visual tasks is normal, and their ability to identify objects by sound or by touch is intact. Despite many case reports, and some elegant experiments looking at what visual agnosic subjects can and cannot see, much of what we know about this intriguing disorder is based on their perception of simplified drawings of objects in isolation. However, researchers have recently attempted to model the active visual processing of complex scenes and pictures by agnosic subjects as an extreme case of a neurobiologically inspired account of human visual attention: the saliency map model (Foulsham et al., 2009; Mannan et al., 2009). In this paper we consider how well the

∗ Corresponding address: Department of Psychology, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, UK. Tel.: +44 1206 874159. E-mail address: [email protected] (T. Foulsham). 0893-6080/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2011.01.004

saliency map model predicts eye movements in complex images in visual agnosia, and what the implications of this might be for the brain mechanisms involved. In the first section we review research investigating visual agnosia, introduce the computational saliency map model, and consider recent attempts to investigate saliency-based vision in agnosic subjects. We then describe a new experiment, which shows that the relationship between saliency and fixation in visual agnosia may be less straightforward than previously described. In the last section we outline what this means for the control of eye movements in both the injured and the healthy brain. 1.1. Visual agnosia Visual agnosia is an impairment in which the patient is unable to recognise visual objects, despite relatively preserved visual acuity, intact recognition through other sensory modalities, and intact semantic knowledge about objects (Farah, 1990; Humphreys & Riddoch, 1987; Riddoch & Humphreys, 2003). Thus it must be shown that the patient’s failure to recognise objects is not due to poor visual acuity or other elementary visual problems; that it is

666

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

not just a disorder of naming (hence patients are still impaired when a non-verbal response is required, such as pantomiming the use of a tool shown to them); and that it is distinct from a lack of knowledge about objects in general (because patients can identify objects using different sensory modalities). A particularly striking clinical sign of the disorder is the accurate ability of some of these patients to copy line drawings in which they cannot recognise the depicted object (Rubens & Benson, 1971). Researchers have described a range of different impairments under the label of visual agnosia, including deficits in the perception of simple visual features (apperceptive agnosia); a failure to combine or organise features into coherent objects (integrative agnosia); and a disconnection between object percepts and their semantic representations (associative agnosia). It is thought that these deficits reflect a spectrum of impairments in the different stages in object recognition (Behrmann, 2003; Riddoch & Humphreys, 1987). Visual object agnosia can result from lesions to a range of areas, but in general is associated with brain damage that can be quite diffuse but is greater posteriorly, with damage to the occipital, temporal and parietal lobes. The research on visual agnosia raises interesting questions about how these patients inspect a complex image, such as the scenes to which humans are exposed in daily life. There is some evidence that additional information such as colour improves the recognition performance of agnosic subjects. For example, they may be better at recognising real objects and photographs than line drawings (Behrmann, 2003). Some patients can also use contextual scene information to improve their recognition of objects (Riddoch & Humphreys, 1987). For these reasons it might be that agnosic subjects are less impaired at viewing realistic scenes than identifying line drawings of isolated objects. This article investigates how patients with visual agnosia move their eyes when looking at images, and the next section reviews some of the research examining this in normal participants. 1.2. Top–down and bottom–up influences on eye movements in natural scenes When people look at the world around them, their viewing is divided between a series of fixations directed at certain features of the scene. The location of these fixations is important: they determine the parts of the visual field that will be examined with the highest spatial resolving power, and a failure to fixate a certain position may mean that important details are missed. The control of these eye movements, and of attention in general, is often attributed to one of two causal processes (Henderson, 2003). First, in top–down control, the eyes are moved according to some property of either the observer or the task that they are performing. For example, Yarbus (1967) showed that different people, or the same people doing different tasks, will produce different patterns of eye movements for the exact same visual stimulus. In laboratory tasks, top–down control is inferred when participants voluntarily orient towards a target that would otherwise be inconspicuous (Underwood, Foulsham, van Loon, Humphreys, & Bloyce, 2006), when eye movements change systematically according to the task (Henderson, Weeks, & Hollingworth, 1999), or when experts show a different pattern of scanning from novices (Underwood, Foulsham, & Humphrey, 2009). In the brain, top–down control is associated with the modulation of visual areas and of the eye movement control system by higher-level brain regions (Bar et al., 2006; Treue, 2003). Second, where we fixate might be determined only by properties of the items being looked at. This ‘‘bottom–up’’ control is seen in ‘‘pop-out’’ effects in visual search—where conspicuous items are selected quickly and automatically (Treisman & Gelade, 1980; Wolfe, 2005). In complex scenes, the evidence that certain

types of visual features tend to be selected by eye movements is also good. Thus fixated regions tend to have high luminance contrast and to contain edges and 2nd order features such as corners (Reinagel & Zador, 1999; Zetzsche, 2005). Moreover, the selection of these features at various levels of the visual system is relatively well understood. For example, retinal ganglion cells represent feature contrast via the opponent centre-surround structure of their receptive fields and representations in visual cortex continue to emphasise discontinuities. Features that stand out from their background are pre-attentively processed and can have a rapid effect on where we attend and ultimately move our eyes to. While the psychological distinction between top–down and bottom–up attention has been discussed since James (1890), it is only recently that these processes have been modeled computationally, particularly in a way that can make useful predictions about complex stimuli such as scenes. Central to the new breed of computational model is the concept of a saliency map: a topographical map that compresses all the different information from the visual field into a series of peaks (Itti & Koch, 2001; Koch & Ullman, 1985). Crucially, this mechanism allows a single attentional focus to be simply controlled by combining many features (e.g. intensity, color, motion) into a single representation of saliency at each point in space. Saliency maps are common in many computational models of visual attention and eye guidance (Cutsuridis, 2009; Findlay & Walker, 1999; Itti & Koch, 2000), and as a control mechanism they could be used for combining both bottom–up and top–down inputs. However, we will concentrate on the Itti and Koch (2000, 2001) saliency map model because it is fully implemented and can be applied to any arbitrary image. This model’s performance has also been extensively compared to healthy observers and, recently, to visual agnosic subjects. 1.3. The bottom–up saliency map as a model of eye movements The Itti and Koch saliency model takes as its starting point a pixel-based image, similar to that which would fall on the retina (although the original version of the model did not simulate a fovea and thus had high resolution throughout the visual field). This image is then processed in a centre-surround manner by linear filtering in three feature channels: intensity, color (using red–green and yellow–blue opponency) and orientation (using oriented gabor filters to simulate the impulse response functions of neurons in visual cortex). Additional channels coding motion and flicker have also been implemented for dynamic stimuli (Itti, 2005). Computationally, centre-surround processing and spatial competition is achieved by combining filter responses from the multiple spatial scales of a Gaussian pyramid (Burt & Adelson, 1983). The feature maps that result from this process emphasise areas of feature contrast: regions of a single color or of uniform brightness will be suppressed, while areas with features that stand out from their background are enhanced. This spatial competition for salience is further encouraged by simulating known ‘‘nonclassical’’ long-range interactions, which cause a neuron’s firing rate to be affected by activity far beyond the classical receptive field. Once the within-feature ‘‘conspicuity maps’’ have been computed, they are combined into a single saliency map that represents a good approximation of bottom–up visual saliency. In this map, places in the original image that stand out from their surroundings are represented with higher values: they are more salient and should be more potent at attracting bottom–up attention. Itti and Koch propose a winner-take-all mechanism which will move fixation to the highest peak in the map, followed by transient inhibition of this location, ensuring that the eyes will move to the next most salient location and so on.

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

A large number of studies have been carried out to see how effectively such a saliency map model predicts where people look in natural scenes. Parkhurst, Law, and Niebur (2002) compared the strength of activity in a saliency map (the model-predicted saliency) at the fixated locations of four subjects with that expected by a uniform fixation distribution. The resulting ‘‘chance-adjusted salience’’ was significantly positive, indicating a greater than chance probability of fixating regions with high salience. There were two other noteworthy findings. First, the chance-adjusted saliency was higher for artificial fractal images than for natural scenes, implying a greater correlation of saliency with fixations for artificial stimuli. Second, over time the relationship seemed to decrease, such that saliency was highest at the locations of the first few fixations and declined as viewing went on. Peters, Iyer, Itti, and Koch (2005) used a slightly different calculation to compare the eye movements of 12 participants with the activation in the saliency map. The authors normalised the saliency map for each image (so that the map values had a mean of zero and a standard deviation of one), and they then calculated the average of those values at fixated locations (what they call the ‘‘normalised scanpath salience’’). Using this calculation, a system that selects random locations will result in a normalized scanpath saliency of zero, whereas values greater than this would indicate that fixations were selecting points of high saliency. The standard saliency model resulted in a normalised scanpath salience that was significantly greater than zero for a range of complex images. A convenient way to express the success of the saliency map model in this experiment is as a percentage of the inter-observer relationship (i.e. as a proportion of how well the fixations of any one individual could be predicted by those of all other observers). Peters et al. report that the saliency map model explains around 50% of this variation. In contrast to these positive findings, though, a review of other studies examining the relationship between model-predicted saliency and fixation demonstrates that, in fact, visual saliency provides a rather poor account of viewing behaviour. The findings that count against the model can be divided into two main themes. First, and because of the mainly correlational nature of the foregoing analyses, several researchers have pointed out that some of the predictive power of the saliency map model may come about not because visual saliency causes fixation, but because of coincidental similarities in the distribution of saliency and fixations. For example, Tatler, Baddeley, and Gilchrist (2005) demonstrate that, because salient items tend to be in the centre of the image (e.g., because of a photographer bias), and because eye fixations are also biased to the centre, at least some of the predictive power of the saliency map model could be coincidental. This can also account for why Parkhurst et al. (2002) found a decreasing trend for fixations to correlate with saliency over time: viewing often becomes less centralised and so saliency-at-fixation will decrease if only because of fewer fixations at the centre. Henderson, Brockmole, Castelhano, and Mack (2007) found that bottom–up saliency could not account for the sequence of eye movements made by observers viewing scenes, and suggested that although salient regions were fixated more often, these regions were also more meaningful (e.g. they were likely to be objects) than control patches. Foulsham and Underwood (2008) qualified some of the issues regarding the central bias, demonstrating that if estimates of chance took into account the image-independent biases in fixation placement the model showed a significant (but weak) correspondence to spatial fixation patterns. Foulsham and Underwood also demonstrated that the saliency model could not predict the sequential order of fixations: the scanpaths that are to some extent reproduced within and across observers viewing the same image. The upshot of these and other demonstrations is that if bottom–up saliency can tell us anything about where people look, its effect in healthy observers is likely to be quite small.

667

The second theme emerging from recent studies is that, where top–down guidance is available, it dominates or completely overrides bottom–up visual saliency. The list of such studies is long so we will only mention a few at this point. In naturalistic search tasks, people are easily able to look straight to a target, seemingly regardless of it’s saliency or the saliency of distractor objects (Chen & Zelinsky, 2006; Einhauser, Rutishauser, & Koch, 2008; Foulsham & Underwood, 2007; Henderson, Malcolm, & Schandl, 2009). Eye movements in realistic, active tasks are made toward task-relevant objects at systematic times in the course of an action, again regardless of the visual saliency of these objects (Ballard & Hayhoe, 2009). The effects of visual saliency are also reduced or disappear entirely when in competition with people’s tendency to look at eyes, faces and social stimuli (Birmingham, Bischof, & Kingstone, 2009), and when domain experts look at pictures relevant to their expertise (Underwood et al., 2009). These studies and others have triggered a move toward combining bottom–up saliency with top–down constraints within a saliency map framework, for example by modeling observers’ representation of target identity (Navalpakkam & Itti, 2005; Zelinsky, 2008) or scene gist (Torralba, Oliva, Castelhano, & Henderson, 2006). Of course, even in their original proposal of the model, Itti and Koch were candid about the need to include top–down modulation, but it has not proven straightforward to determine experimentally exactly how the two inputs combine. It is within this landscape that the relationship between saliency and fixation in visual agnosia becomes particularly pertinent. 1.4. Saliency and fixation in visual agnosia As we have seen, patients with visual agnosia are unable to discern the meaning of objects, despite normal early visual processing. In active scene viewing, healthy observers can avoid looking at salient locations in favour of fixating objects and regions that are meaningful in the context of the scene and the task. These propositions lead us to a straightforward prediction regarding the relationship between fixation and saliency in visual agnosia, and the rest of this paper will explore this hypothesis. Given that agnosic subjects suffer from impaired object recognition (and hence have difficulty in guiding top–down processes that depend upon meaning and scene interpretation), but should still possess brain mechanisms for computing feature contrast, they might be more biased towards bottom–up saliency than normal viewers. In other words, while healthy observers significantly modulate bottom–up saliency with top–down knowledge about a scene, its objects and their relevance to a task, perhaps agnosic subjects will more closely resemble a default, bottom–up saliency computation. Confirming this prediction would support some of the assumptions of the saliency map model—that bottom–up saliency is computed early and independently from object recognition, but overridden by top–down knowledge in healthy observers—and could also point to the brain regions that are necessary for bottom–up and top–down computations. Two studies, performed independently, have recently addressed this prediction. In Foulsham et al. (2009) we reported extensive analysis of the eye movements of a visual agnosic patient, CH, while she searched for real objects (pieces of fruit) in a series of photographs of natural scenes. We measured several different aspects of her eye movement behaviour while she viewed these scenes, with the task being to decide whether there was a piece of fruit in the scene (see Fig. 1). There were several noteworthy findings. First, CH was impaired on the task, as expected given that she could not recognise the fruit targets. She both missed the target and reported finding it when it was absent on many more trials than controls and she also took much longer to respond. Second, while the performance of healthy controls was associated with relatively

668

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

Fig. 1. Agnosic eye movements and saliency in a search of natural scenes. CH’s fixations (yellow circles) and saccades (connecting lines) are shown for two scenes (top). In each case the target is a piece of fruit in the bottom right corner of the scene, and CH fails to fixate the target. Instead she often fixated bright or colorful items that coincided with peaks on a saliency map. Saliency maps, with bright points representing salient parts of the image, are shown for both scenes (bottom). Source: Data from Foulsham et al. (2009).

few fixations and often a long saccade to the target, the eye movements made by CH were quite different. She made many more fixations, and these had a longer average duration than normal observers. She also made much shorter saccades, both over the trial as a whole, and when only the first few eye movements in viewing were considered. Hence we found that a deficit in object recognition was associated with disordered eye movements. What of their relationship to visual saliency? The targets in our search task were not highly salient, and we evaluated the fixation of the target, as well as of the most salient region in the scene (according to Itti & Koch’s computational model). Control participants fixated the target on almost every trial and fixated the most salient region relatively rarely (young controls did so on 25% of trials, age-matched controls on 36%). In contrast, CH fixated the target on only 25% of trials and looked at the most salient region on 67% of trials, far more than controls did. Thus, not only was CH impaired at finding and fixating the target, her fixations were selectively biased towards the most salient region. To compare eye movements to the computational model more directly, we also calculated the saliency map value at each fixation (in a similar way to Parkhurst et al., 2002). Sure enough, CH’s fixations were targeted at locations corresponding to higher bottom–up saliency than control participants. Therefore, this experiment provided important evidence that the scanning fixations of patients with visual agnosia do indeed match the predictions of a saliency map more closely than those of healthy controls. The examples in Fig. 1 also serve to illustrate some of the different aspects of top–down control involved when searching efficiently within scenes. This top–down control might involve the enhanced recognition of contextually congruent objects (in an office scene a computer might be recognised more easily than a piece of fruit), as well as the prioritizing of a likely object (such as a tabletop; for examples of contextual effects on the fixation of objects see Henderson et al., 1999; Underwood & Foulsham, 2006). It is not known which of these processes are deficient in visual agnosia. Anecdotally, in the experiment reported in Foulsham et al. (2009), CH was sometimes able to ascertain the scene category

(she knew she was looking at an office) and she also searched appropriate locations at least some of the time (e.g. looking at the tabletop in Fig. 1, top right panel). Such contextual influences have been modeled for some search tasks (Torralba et al., 2006), and this is one way of quantifying the top–down signal available for search. Mannan et al. (2009) also investigated the fixations of patients with visual agnosia looking at complex images. Data from two patients was reported: SF, a woman with atrophy of posterior regions, and JJ, who had experienced posterior cerebral hemorrhages leading to damage of occipital, temporal and parietal regions. The researchers reported that the first locations looked at by these patients were similar to those fixated by healthy control participants. Interestingly this similarity only occurred at the beginning of scene viewing. As time progressed, the similarity between the two groups diverged. Mannan et al. attributed this divergence over time to the fact that the fixations of patients were determined by bottom–up saliency for the entire viewing period, whereas in controls the initial impact of stimulus saliency was supplanted by top–down oculomotor control processes later on in the viewing sequence. These results are interesting for several reasons. On the one hand, the findings of Mannan et al. (2009) agree with our results from Foulsham et al. (2009) in that they suggest pure bottom–up saliency might be a better model for agnosic eye movements than for the scanpaths of healthy subjects. On the other hand, in our task the patient’s eye movements were quite different from controls, while those in Mannan et al. (2009) were similar between the agnosic and healthy subjects. This could be related to stimulus and task differences: while we asked subjects to find an object, their task merely asked subjects to describe the scenes. As we have seen in normal participants, the predictive power of the saliency map model varies according to whether participants are searching for an object or merely trying to examine the details of a scene (Foulsham & Underwood, 2007). However, a closer inspection of Mannan et al. (2009) reveals that the evidence that the fixations of agnosic subjects are being driven by bottom–up saliency in the view-and-describe task is actually rather minimal. The main result—that normal and agnosic

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

eye movements are similar early in viewing but diverge after the first few fixations—is clear, but explicit comparisons of eye movements and saliency are limited. What one needs to show is that the saliency model accurately predicts everyone’s fixations at the start, but only those of the agnosic subjects for the whole viewing period. Mannan et al. compared the fixation patterns of healthy and agnosic observers with the ones predicted by the saliency map model, using their linear distance metric. They report that the locations of fixations made by healthy observers, particularly the initial fixation, were quite close to the saliency model-predicted fixation. In addition, they report a null effect, namely that the ‘‘initial fixation locations chosen by patients were not significantly different from those predicted for controls by this model’’ (Mannan et al., 2009, p. 248). The interpretation made was that, because healthy controls and agnosic subjects made similar fixations early in viewing, and then diverged, and because these early fixations made by control participants were reliably similar to saliency map predictions, agnosic viewing must be tied to saliency throughout, and it must be healthy viewing patterns that are diverging from bottom–up control. Unfortunately, the support for this is not compelling. The authors report a single t statistic as evidence that the first fixation for healthy individuals was significantly similar to the saliency model. However, as has been shown by work elsewhere (e.g. Foulsham & Underwood, 2008; Tatler et al., 2005) it is not clear what the baseline similarity here should be, or how close a fixation needs to be to be considered ‘‘significantly’’ related to a model-predicted salient location. Furthermore, there is no equivalent, statistically significant test reported showing that the patients’ behaviour, even their very first fixation, was similar to the saliency model. Mannan et al. ’s Fig. 1H also contradicts the pattern of data reported in their text, suggesting that agnosic fixations are less similar to saliency than controls. Thus, if saliency provides an account of anyone’s fixation data, it appears that it provides an account of the first fixation of the healthy observers, and no account of the patients’ fixations. After the first fixation, saliency seems to provide a progressively worsening account of everyone’s data. To summarise, there seems to be mixed evidence, therefore, that agnosic eye movements are better modeled by a bottom–up saliency map than healthy observer’s fixations. Furthermore, research in normal participants urges caution regarding whether saliency actually causes fixation or whether it merely co-occurs (i.e. is spatially similar, which is what Mannan et al., tested) with other features of the stimulus such as objects. To investigate this further, the following section describes an additional experiment conducted with our patient CH, who is selectively impaired at recognizing objects, performing an unconstrained ‘‘view-andremember’’ task (see Foulsham & Underwood, 2008, for full details of the task). CH and two groups of control participants viewed a large set of images for a short period, and they then saw them again, mixed with novel images, with the task being to identify which images they had seen before. By looking in detail at the relationship between saliency and fixation in a task different from Foulsham et al. (2009), and with no defined target, we can check whether the saliency map model can actually predict agnosic eye movements. This task also provides a test of CH’s old–new recognition performance for natural scenes, and her eye movement behaviour is assessed to see whether the patterns observed in search (such as longer fixations and shorter saccades) hold for a different task. In order to explore the impact of image meaning on eye movements in this task, the same procedure was followed with computer generated patterns (fractals) as stimuli. Several authors have used these stimuli because they have similar spatial frequency content to realistic images, and some have argued that saliency is a better predictor of eye movements in fractals than in realistic scenes, presumably because in scenes people are more

669

likely to use their top–down knowledge about the layout of the environment to distribute their attention (Parkhurst et al., 2002). In support, we have also recently demonstrated some differences in eye movement biases seen in viewing scenes and fractals (Foulsham & Kingstone, 2010). With this research in mind several predictions are possible. First, if scene semantics are beneficial for encoding and recognising images, normal observers should be better at recognising scenes than fractals. CH, on the other hand, might be impaired (relative to controls) on scenes, but less so on fractals as accessing semantic details from visual features is less important for fractals than for scenes. In terms of saliency, Parkhurst et al. (2002) would predict a higher correlation between saliency and fixation position with fractals than with scenes. If CH is more susceptible to bottom–up guidance than controls then this correlation should be similar for her in both classes of stimuli, and greater than those of controls, particularly for scenes, and possibly for fractals. 2. Eye movements during scene encoding in visual agnosia 2.1. Method 2.1.1. Case description CH is a right-handed woman who was 63 years old at the time of testing and had reported slowly progressive visual difficulties, including difficulty with reading and with recognizing the faces of friends. She could still write. She had problems finding objects around the house and later had trouble navigating. She reported mostly normal colour vision, but confused navy with black. She had full visual fields (confirmed by Goldmann perimetry) and good visual acuity (20/25 for single letters), and made normal saccades and pursuit eye movements. Further neuropsychological evaluation showed that her general knowledge, vocabulary, abstract verbal reasoning, comprehension and writing from dictation were all normal. Her reading was extremely impaired, even for single words, and she deciphered words one letter at a time. Tests showed average performance in auditory verbal learning and memory, and a digit span of 6 forwards and 4 backwards. Testing of her visual abilities revealed randomly distributed errors on line cancellation and search tasks, but no systematic errors in line bisection. There was no evidence of systematic neglect of one side of space and CH was able to make eye movements to all areas of the screen (e.g. when calibrating the eye tracker). Performance on the Boston Naming Test was very poor, and line drawings were often unidentified or incorrectly named. In all cases object recognition was slow and required a lot of effort. When naming line drawings she frequently failed or misidentified objects, often focusing on small details (key = ‘‘see the O’’, chair = ‘‘grass. . . something to sit on’’). She correctly named a square and a circle, but had difficulty naming three-dimensional shapes (cylinder = ‘‘cone’’, cube = ‘‘all these angles, octagon’’, pyramid = ‘‘triangles in it, a tent’’). She interpreted the Cookie theft picture as ‘‘a woman washing dishes, something spilling, it must be a restaurant’’. CH scored 20% on the Benton line orientation test (in the severely deficient range) but she performed normally on a test of curvature discrimination (Barton, Cherkasova, Press, Intriligator, & O’Connor, 2004). She was only 56% accurate at judging the spatial configuration of 2-item dot patterns and 22% accurate with 4 dots (chance = 33% correct Barton, 2009). On a test requiring subjects to judge whether a triangle was symmetric, on which controls are 100% accurate, CH’s performance was at chance (56% correct). A cerebral perfusion scan with technetium injection at 4 years after onset showed marked hypoperfusion of the posterior parietal and temporal lobes. CT scan 6 years after onset of

670

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

symptoms revealed some general sulcal prominence with occipital predominance of enlargement of the lateral ventricles, consistent with a diagnosis of posterior cortical atrophy (Benson, Davis, & Snyder, 1988). CH had previously taken part in another eyetracking study (Foulsham et al., 2009) and she was rested and alert at the time of testing. 2.1.2. Control participants Data for healthy young participants performing the encoding and recognition task with natural scenes has previously been reported (Foulsham & Underwood, 2008). There were 21 of these younger controls and an additional 11 young controls were recruited to perform the task with computer generated fractal stimuli. Young controls had an age range of 19–25. Two groups of healthy age-matched controls (age-matched controls) also took part: 5 (3 females) who viewed scenes and 4 (2 females) who viewed fractals. Age-matched controls ranged in age from 58 to 76, with a mean age of 65. All control participants had normal or corrected-to-normal vision. A questionnaire confirmed that none of the age-matched controls reported problems with scene or object recognition, and all had regular examinations by an optician. 2.1.3. Stimuli, design and apparatus Two groups of 90 colour images were used. Half the stimuli were natural scenes, as used in Foulsham and Underwood (2008) and sourced from a commercially available CD-ROM collection. The scenes showed a selection of outdoor scenes (buildings, cityscapes, park scenes) and interiors (e.g. a kitchen). The remaining stimuli were computer-generated fractals from the Spanky fractal database (http://spanky.triumf.ca/). Fig. 2 shows examples of both types of stimuli. All images were presented at 1024 × 768 pixels, which subtended a visual angle of 34° × 27° with our apparatus. The two types of stimuli were presented in separate sessions. In each case, 45 of the stimuli were presented at encoding and then re-presented at recognition, interleaved with the remaining 45 stimuli. Eye movements were recorded using the head-mounted Eyelink II system (for controls) and the desk-mounted Eyelink 1000 (for CH). Both systems are video-based eye trackers that record point-of-regard from the pupil image; the latter system was used because it was more comfortable for the patient. The Eyelink II recorded at 500 Hz, while the Eyelink 1000 recorded at 1000 Hz. The participants used a chin rest, ensuring a constant viewing distance of 60 cm from the screen, and they were instructed not to move their head during tracking. Our analyses are based on fixations and saccades, which were parsed by the eye-tracking system using velocity and acceleration thresholds (30°/s and 8000°/s2 ). 2.1.4. Procedure At the start of the experiment the eye tracker was calibrated to a high level of accuracy and this was repeated where necessary. The experimental procedure was the same for both scenes and fractals. During the encoding phase, 45 images were presented in a random order and the participant was instructed to memorise them. Each picture was displayed for 3 s and preceded by a central drift-correct marker. Following encoding, 90 images (half from the encoding phase, the remainder novel pictures) were randomly presented, again for 3 s (following a drift correct marker). This was followed by a screen prompting the participant for a response, and this procedure ensured that presentation in both parts was equivalent. At this prompt participants were asked to verbally identify the stimulus as old or new, a response the experimenter logged using the keyboard. A short practice session of 18 pictures

(6 to be encoded followed by 12 at test) was presented before both the scene and the fractals experiment, and these pictures were not presented again in the actual experiment. Patient CH performed the scene condition first, followed by the fractal condition, although she was rested and motivated and was given a break between the two conditions. In the control groups, the factor of stimulus type was manipulated between groups. 2.1.5. Modeling We used a version of Itti and Koch’s (2000) saliency map model to predict the scene regions which should attract attention and eye fixations. Specifically, we used a version of the model compiled from source code available at http://ilab.usc and downloaded in May 2004. For further modeling (see Section 2.3.1) we used the Saliency Toolbox in MATLAB (Walther & Koch, 2006). In each case, the standard parameters were used for the model: images were filtered at 8 spatial scales, the focus of attention (over which inhibition of return occurs) was 1/16th of the width of the image and iterative normalization processes functioned to enhance the saliency of a few peaks (see Itti & Koch, 2000, for more parameter details). Fig. 3 shows the saliency computation for one of the images in the experiment. 2.2. Results The analysis focused on quantifying CH’s behaviour, relative to the control groups, on both scenes and fractals. Hits, false alarms, and overall sensitivity are given as measures of task performance. Following this, global eye movement behaviour was assessed to see if CH’s eye movements were different from those made by controls. In terms of saliency, we measured how likely CH was to land on highly salient regions, and we computed the normalised saliency at fixation. We compared CH to the distribution of control participants (both young controls and age-matched controls) using a z-score. 2.2.1. Recognition performance CH reported that she found the viewing duration very brief but that she did recognise some scenes when she saw them at test. She occasionally made general verbal reports of the contents of the pictures (e.g. ‘‘this is outdoors’’, ‘‘this is a house’’, ‘‘a building. . . looks like a school’’) although these were not always correct. The proportion of hits and false alarms and a d′ measure of sensitivity for all groups is shown in Table 1. The control groups were relatively good at recognizing items, as shown by a large proportion of hits, few false alarms and a relatively high d′ . In scenes, CH made fewer hits than controls (young controls, z = 2.70; age-matched controls, z = 2.43; both p < 0.01), more false alarms (young controls, z = 7.67; agematched controls, z = 13.86; both p < 0.00001) and had a d′ of zero, indicating no discriminative power (significantly different from young controls, z = 4.18; age-matched controls, z = 4.16, both p < 0.0001). In fractals, CH made fewer hits than young controls (z = 1.96; p < 0.05) but was within the normal range for this group on false alarm rate (z = 0.048) and d′ (z = 1.29). When compared to the age-matched controls, CH made around the same number of hits (z = 0.40) but many more false alarms (z = 3.17, p < 0.001) and with a lower sensitivity (z = 4.41, p < 0.0001). Thus CH was impaired on the recognition task, but less so on the fractals. There were only slight differences between the control groups, and given the small number of age-matched controls these were not analysed further. Instead, the control groups were combined to explore the effect of stimulus type on overall sensitivity. A one-way ANOVA showed a significant effect, F (1, 39) = 28.6, MSE = 0.32, p < 0.0001; control subjects’ performance was better for scenes than fractals. CH showed no such benefit and in fact was slightly better for fractals.

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

671

Fig. 2. Natural scenes (top) and fractals (bottom) that were used in the study.

Colour

Intensity

Orientation

Saliency map

Fig. 3. Generating a saliency map using the Itti and Koch (2000) model. A complex image (e.g. top row) is filtered in 3 different feature channels (middle row), which are then combined to give an overall saliency map (bottom row). Regions of high conspicuity are shown as brighter points on the saliency map.

672

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

Table 1 Recognition accuracy for CH and the control groups. Control data shows means with standard deviations. Scenes

Fractals

CH

Young controls *, **

Age-matched controls

CH *

Young controls

Age-matched controls

82% 12%

61% 8%

Hits

M SD

58%

82% 9%

84% 11%

58%

False alarms

M SD

58%*, **

10% 6%

16% 3%

44%**

43% 20%

13% 10%

d′

M SD

0.00*, **

2.33 0.56

2.03 0.49

0.34**

1.20 0.67

1.52 0.27

* **

CH different from young controls at p < 0.05. CH different from age-matched controls at p < 0.05.

Table 2 Global eye movement measures for all participants viewing scenes and fractals at encoding. Scenes

Number of fixations per trial Mean fixation duration (ms) Mean saccadic amplitude (°) * **

Fractals

CH

Young controls

Age-matched controls

CH

Young controls

Age-matched controls

M SD

10.4 1.5

10.6 1.4

10.7 0.6

10.3 1.5

9.4 1.41

10.0 2.1

M SD M SD

248 52 3.50*, ** 0.97

278 57 6.2 0.54

247 31 6.47 1.03

231 51 2.64*, ** 1.05

314 77 7.8 1.71

303 115 5.92 1.81

CH different from young controls at p < 0.05. CH different from age-matched controls at p < 0.05.

2.2.2. Global eye movement measures Due to differences in responses between participants at test, we confined our analysis of eye movements to the fixations and saccades made during encoding, when all participants were performing the same task. The number of fixations, the average fixation duration and the mean saccadic amplitude was computed for each trial, and these are summarised in Table 2. With both fractals and scenes, CH did not differ from young or agematched controls in number of fixations or fixation duration. The most striking observation, which is consistent with that seen in Foulsham et al. (2009), is that CH made far smaller saccades than the control subjects, for both scenes and fractals (all zs > 1.8, p < 0.05). 2.2.3. Fixation of salient regions The remaining analysis concentrated on comparing the fixations of CH and the control groups to the predictions of the saliency map model. Several methods have previously been used to compare scanpaths, both between individual observers, and between experimental and model-predicted data. For example, Foulsham and Underwood (2008) compared the scanpath similarity within and between individuals using a string-edit distance comparison (which is particularly sensitive to the temporal sequence of fixations) and a linear distance metric (from Mannan, Ruddock, & Wooding, 1995, which is more sensitive to spatial proximity). Mannan et al. (2009) also used a modified version of the linear distance metric to document similarity between agnosic and normal eye movements, but our aim here is to clarify how these eye movements are related to the saliency map. Le Meur, Le Callet, and Barba (2007) discuss four methods for quantifying this relationship using fixated locations and generating fixation density maps, and they show similar results for all four. We opted to use two straightforward measures that have been used in the previous literature and that require few extra-model assumptions: the proportion of fixations within the most salient image regions and the normalised saliency at fixation (Foulsham & Underwood, 2008; Foulsham et al., 2009; Henderson et al., 2007; Peters et al., 2005). Other measures would likely produce very similar results, and thus these methods will be sufficient for determin-

ing whether the model provides a better fit to agnosic than control data. We first calculated the proportion of fixations that landed in one of the five most salient regions of the image. These were defined as areas of radius 1° around the first five points selected by the saliency model, and by design we only presented images where these regions did not overlap. Together, all five regions comprised 10.4% of the image area and thus a completely random fixationgeneration process would select these regions about this fraction of the time. In scenes, an average of 18.6% (SD = 3.5%) of fixations in young controls and 18.0% (SD = 2.3%) of fixations in age-matched controls were located in one of the five most salient regions. This was lower for CH (M = 14.9%), although still within the normal range (z = 1.1, p = 0.15 and z = 1.4, p = 0.08 when compared to young controls and age-matched controls respectively). In fractals, control participants spent a similar proportion of fixations in the five most salient areas (young controls, M = 17.4%, SD = 2.7%; age-matched controls, M = 17.6%, SD = 0.7%). CH had fewer fixations in salient areas (M = 14.3%; vs. young controls, z = 1.13, p = 0.13; vs. age-matched controls, z = 4.6, p < 0.0001). An independent-samples t test compared the proportion of control participants’ fixations on salient regions in scenes and in fractals and found there was no significant difference (t (39) = 1.1, p = 0.3). These findings therefore do not support the contention that fixations in agnosia are driven by saliency to a greater degree than those in healthy viewing. 2.2.4. Saliency at fixation The previous analysis looked only at the most salient regions, and as we have seen this included only a small proportion of all the fixations made. An alternative way to look at the relationship with the model is to calculate the saliency map value at the location of each fixation. Following Foulsham et al. (2009) and Peters et al. (2005), we took the normalised saliency at fixation and averaged across all fixations following the first saccade, in both the scenes and the fractals. Specifically, we normalised the saliency map of each image so that map values had a mean of 0 and a standard deviation of 1. We then extracted the normalised saliency at each fixated location, excluding the very first fixation,

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

673

because the pattern was very similar in both types of stimuli. In healthy young controls normalised saliency was highest on the first saccade (i.e. the first free fixation on the image). However, the slight decrease on subsequent saccades was not reliable (no within-subjects effect of saccade number, F (7, 210) = 1.2, p = 0.26). In age-matched controls there was no consistent decrease and no reliable effect of saccade number (F (7, 56) < 1). The normalised saliency at locations fixated by CH was consistently lower than that at controls’ fixation locations, and it approached chance (zero) on the second and fifth saccades. Interestingly, the normalised saliency in CH is not highest on the initial saccade but on later eye movements (particularly the 4th and 8th saccade, the only times when CH is above the level of controls). This is perhaps because CH, unlike young controls, returned to look at salient items later in viewing. 2.3. Modeling the degradation of specific features with a saliency map approach

Fig. 4. Top: the mean normalised saliency at fixation, for different types of stimuli in the recognition task. Data points for the control groups (YCs = young controls, AMCs = age-matched controls) show the mean (with standard error bars) across participants. Bottom: the normalised saliency as a function of saccade number, averaged across both scenes and fractals.

which was necessarily in the centre of the image. This method is desirable as it takes any variability in the saliency distribution of different images into account, and it gives a scale-free parameter that reflects the association between fixation locations and peaks in the saliency map. A value of zero would indicate no relationship between fixations and saliency, whilst a value of +1 would suggest that fixated locations were 1 standard deviation above the average saliency for that image. If participants were selectively fixating non-salient regions, a negative value would be expected. The mean normalised saliency at fixation for CH and the control groups is shown in Fig. 4 (top). In contrast to the results from Foulsham et al. (2009), and those implied by Mannan et al. (2009), CH actually showed lower saliency at fixation than control groups. She was outside the normal range when compared to both young controls (scenes, z = 3.3, p < 0.001; fractals, z = 3.3, p < 0.001) and age-matched controls (scenes, z = 1.7, p = 0.04; fractals, z = 2.1, p < 0.05). This analysis confirms, therefore, that in this task agnosia actually led to less guidance by saliency, and this was equally true in fractals, despite their lack of easily attributable meaning. Combining the control groups, fixations in fractals showed a reliably higher normalised saliency (t (38) = 3.2, p < 0.005). This is consistent with what was observed by Parkhurst et al. (2002). Mannan et al. (2009) focus their analysis on the change in similarity between patients and controls over multiple fixations, and they show a general decline in the relationship between saliency and eye movements over time. In their study observer and model-predicted scanpaths were most similar on the first saccade, and became less similar as time progressed. We computed the normalised saliency at the locations targeted by the each of the first 8 saccades (there were too few trials with more saccades than this to include eye movements beyond these first 8). The results are shown in Fig. 4 (bottom), collapsed across scenes and fractals,

The structure of the saliency map model, and it’s inspiration in the hierarchical organization of the visual field, raises an interesting additional possibility: if some agnosias arise from a selective impairment in a particular type of feature, then their attentional behaviour might be better accounted for by adjusting the weights of the different features in the model. Simulating the disruption of different cognitive abilities by ‘‘lesioning’’ connectionist models has proven a useful tool in neuropsychology and artificial intelligence (see Olson & Humphreys, 1997, for a review), and here we sought similar insights by selectively tuning the saliency model based on observed behavioural deficits. As a proof of principle, we reanalyzed CH’s fixations on scenes during the view-and-remember task, varying the contribution of the three different feature channels that went into calculating visual saliency. Would some feature combinations perform better than others? In the case of CH, although her low-level vision was good, she was more impaired in orientation perception than in the perception of colour or luminance. Therefore, her case description led us to predict that a model with a reduced contribution of oriented edges would provide a better account for her data. 2.3.1. Modeling Examples of the different model variants tested are shown in Fig. 5 for the same example stimulus as Fig. 3. We generated saliency maps for each model variation, and calculated the normalised saliency at fixation in the same way as previously, for fixations in scenes only. Six different model variants were tested, in addition to the standard, three-feature model already presented. These models represent all possible combinations of the individual features. First, we tested three models with each feature in isolation and all others weighted to zero: colour only, orientation only and intensity only. Then, we tested each combination of two features (weighted equally), with the remaining feature excluded: colour + intensity (i.e. with orientation weighted to zero), orientation + intensity and colour + orientation. Here, we consider only these six cases, although it should be noted that an infinite number of different weights could be chosen according to hypotheses about the relative importance of different features (e.g. colour might be weighted twice as strongly as the other features). In each case, conspicuity maps are computed for each feature and then weighted accordingly. These maps are then summed, with the iterative normalization algorithm described by Itti and Koch (2000), which has the result of suppressing points of weak activation and emphasizing a few salient peaks, in whichever feature channel is available.

674

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

Colour only

Orientation only

Intensity only

Colour + Intensity

Orientation + Intensity

Colour + Orientation

Fig. 5. Saliency maps generated from one of the scenes in the experiment, using six model variants comprising different combinations of features. Note that these final saliency maps have been through several iterations of between- and within-feature competition, making a few salient peaks stand out. Some objects (such as the red flowers in the bottom left of the image) are salient in terms of some features (in this case colour) but not others.

2.3.2. Results The normalised saliency at the locations of CH’s fixations is plotted in Fig. 6 for each of the model variants, alongside the same data for control groups. As before, a larger positive normalised saliency indicates a better fit between fixation and saliency. We tested model performance by comparing the normalised saliency at fixation to zero, using one sampled t tests across participants (for control groups) and across individual trials (for CH). Looking first at the data from control participants, there is relatively little difference in the performance of the different model variants. In student participants, the best models were those that used colour information, although all models were reliably above zero (one sampled t-tests, all p < 0.0001). This was also the case in age-matched participants (all models greater than zero, p < 0.05). However, model predictions for CH’s data were more variable. The models with the highest normalised saliency at agnosic fixation locations were the colour-only model and the colour + intensity model, and these variants outperformed the standard, all features model. Moreover, the models with the worst performance were those that included orientation features (particularly orientationonly and orientation + intensity). Comparing the results from CH’s fixations across all 45 images from encoding, the normalised saliency at fixation was greater than zero only in those models without orientation features (colour-only and colour + intensity, both p < 0.05). It is also interesting that these models, and the intensity-only model, are the only variants that result in equivalent model performance for CH and the control groups.

These analyses, therefore, confirm our prediction that CH, who was particularly impaired at a neuropsychological test of orientation perception, made fixations that were less correlated with orientation saliency than with other features such as colour. Furthermore, they suggest that investigating the weighting of different features in the saliency map could be a useful tool in exploring attention in agnosia and other disorders of vision. 3. General discussion This experiment built on the findings from Foulsham et al. (2009) and Mannan et al. (2009) and investigated the eye movements of a visual agnosic patient while viewing complex images. We asked whether CH could recognise scenes and fractals, whether her allocation of fixations was anomalous when encoding these images, and whether these fixations could be modeled by a saliency map approach. There were several interesting findings. Despite performing normally on pretests of short- and longterm memory, CH showed much poorer visual recognition than control subjects. Her inability to identify the objects in the natural scenes may have made it much more difficult to encode and later recognise these scenes. The control subjects were better at recognizing previously seen pictures when they were meaningful scenes than when they were fractals. This might be because they could encode the scenes more efficiently by interpreting what they could see. As predicted, CH showed no such benefit and in fact performed slightly better at recognizing meaningless fractals.

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

675

Fig. 6. Normalised saliency at fixation for the different model variants. Data are from all fixations in scenes made by CH (left) and the control groups (with bars showing mean with standard error bars across participants; YCs = young controls, AMCs = age-matched controls). Data from the standard model with all features weighted equally (see Fig. 4) are included for comparison.

Agnosic oculomotor behaviour was characterised by much shorter saccades than that of healthy observers. This is consistent with what we found in a search task (Foulsham et al., 2009). CH’s eye movements to simple targets (such as the eye tracking calibration markers) were normal, confirming that she could make large saccades if necessary. Instead, we interpret her shorter saccades as a reflection of her inability to recognise items in the scene quickly, and a tendency instead to re-fixate several times within an object. 3.1. Is bottom–up saliency a good model for scanning in visual agnosia? Based on previous research, we predicted that fixations in visual agnosia would more closely match a saliency map than those in healthy viewing, particularly in fractals, which have no conventional semantic interpretation. The rest of this discussion will consider whether the saliency map model is indeed a good model for fixations in agnosia. In Foulsham et al. (2009), we found that CH was biased towards salient objects that were not the target during a search task. In Mannan et al. (2009), the scanpaths of two patients during a visual inspection task were similar to those of controls, but this similarity diverged over time, which may have been because normal observers gradually shifted from the fixation of bottom–up salient features towards inspection of meaningful scene elements. In the present study, the saliency model did seem to tell us something about where people were likely to fixate. For example, 15%–20% of everyone’s fixations landed on the five most salient regions, which comprised only 10% of the area. The normalised saliency at fixation was positive, demonstrating that people fixated points of higher-than-average saliency, according to the model. However, the values from these analyses show that the correlation between saliency and fixation was actually rather small (see also Foulsham & Underwood, 2008, for further investigations of saliency in this task). The normalised saliency values we found are at the lower end of those reported by Peters et al. (2005). Moreover, in contrast to previous results, in the present study CH actually looked at salient regions less than control participants, a result that we confirmed with analysis of the normalised saliency at fixated locations. This finding argues against the proposition that a bottom–up saliency map is a good model for scanning in visual agnosia. There are several possible explanations for the discrepancy between CH’s fixations in search (where she tended to look at salient items; Foulsham et al., 2009), and those in the view-and-remember task used here (where she looked at salient items less than control participants). One possibility is that

agnosics truly are more driven by bottom–up features, but that the Itti and Koch (2000) model needs to be modified in order to predict this difference. To this end, it will be important to measure bottom–up attention in different ways within the same patients in order to provide converging evidence beyond a single model. It is possible that other bottom–up models, perhaps those emphasizing different features, might account for agnosic behaviour better (see, for example, the model by Bruce & Tsotsos, 2009). On the other hand, in additional modeling we found that even when some features were excluded, the correlation between CH’s fixations and saliency did not exceed that from control participants. The task in the present study required images to be scanned for a relatively short and fixed period of 3 s, whereas in the search task viewing was self paced and continued until a decision was made about the target. This difference in time limit may have encouraged a change in scanning by CH, although Mannan et al. (2009) also used a trial duration of 3 s. Saliency at fixation was greater in fractals for all participants, but even in these images, where there was no systematic structure or meaning, CH looked at salient regions less often than controls. It is possible that our participants did in fact extract meaning from these random patterns, and further testing with other control stimuli such as white noise patterns would be helpful. Where such patterns varied only in features that were accurately perceived in agnosia we predict less of a discrepancy between healthy and agnosic scanning. We also analysed trends in the relationship between saliency and fixation over multiple fixations. The saliency map model predicts that viewers will fixate locations in order of decreasing saliency, and therefore normalised saliency should be highest on the initial saccade and decrease on subsequent eye movements. We did not find a significant decrease in either healthy or agnosic observers, which is further evidence that the standard saliency model does not capture scanning in this task very well. Mannan et al. (2009) also suggest that the similarity between observed and model-predicted scanpaths decreases over time. However, the linear-distance metric used in that study may not have been able to capture the relation of a dynamic scanpath to the underlying saliency map representation. For example, if agnosic patients keep refixating the same salient region (which is possible given CH’s small saccades) the linear-distance measure will compare these fixations with points later in the model-predicted scanpath, resulting in decreasing similarity. In the measure used in our study a tendency to refixate the most salient location will lead to a consistently high normalised saliency at fixation. This is not what we found with CH, but she did fixate salient points later in viewing, perhaps reflecting a tendency to occasionally leave and then come back to salient items.

676

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

Our finding that CH fixated salient regions less often than controls argues for a different interpretation of the processes involved in guiding eye movements, and we believe that this interpretation can also offer a parsimonious account of the data reported by Mannan et al. (2009).

combination with different metrics for computing this similarity. However, this similarity alone cannot be confidently attributed to shared bottom–up or top–down processes without further experimental and modeling investigations such as the ones here. 3.3. Further directions

3.2. An alternative interpretation When we asked our agnosic patient to first view images and then to identify which ones she had seen previously, the patient actually showed less saliency at fixation than controls, and she was less likely to look at the regions identified by the model as being most salient. This is discordant with the hypothesis that agnosia leads to more guidance by saliency in all circumstances, but it can be reconciled with previous results by considering the relationship between the stimuli and the task at hand. In the search task used in Foulsham et al. (2009), the target objects were not highly salient, and therefore healthy participants had a top–down set to avoid salient regions. Without recognition of the targets, the patient was more likely to look elsewhere, and this tended to include salient areas, perhaps because she anticipated that targets would be salient. The requirements in the memory task were quite different: in this task participants had to make their own judgment as to what details might be important for remembering the images. Often these were objects and distinctive scene regions that would be more likely to contain visual discontinuities. In this task it would have been an effective strategy to look at salient items. The patient, however, seems to have adopted a different strategy, resulting in a lower correlation between fixation and saliency. This strategy may have been an attempt to pick out object-like forms, at least in the scenes, perhaps driven particularly by the colour of the different items. CH’s difficulties may also have been exacerbated by the fixed time period. Why, then, did Mannan et al. (2009) report that similarity between controls and agnosic subjects diverges over time? One can take as a starting position the assumption that agnosic subjects’ eye movements must be driven by stimulus saliency because they have an object recognition problem. From there it is a simple line of reasoning that takes one to the conclusions that (a) the initial similarity between patient and healthy observers reflects the fact that the eye movements of both populations are being driven initially by salience, and (b) the later divergence between patient and healthy observers reflects the fact that healthy observers are invoking top–down control of their eye movements (which cannot be done by agnosic subjects, given the starting assumption). In contrast, we believe that for both populations top–down processes have a significant effect on the earliest stages of scanning. It is not known what strategies controls and agnosic patients may have used in the view-and-describe task reported by Mannan et al., but it seems likely that these scanning routines would interact with the gradual acquisition of object details and therefore change over time. Their two agnosic patients may have just struggled to decode relevant objects and move on, whereas healthy participants would have exhausted the scene content and, perhaps, moved on to more idiosyncratic scanpaths. The finding that the fixation–saliency relationship changes with task is confirmed by our other work with healthy participants (Foulsham & Underwood, 2007), and many other recent papers have reported immediate top–down effects on viewing (e.g. Einhauser et al., 2008). Mannan and colleagues see agnosic subjects’ behaviour as evidence for the role of saliency in normal, early viewing of a scene. A more parsimonious account is that the fixation patterns for both populations reflect shared top–down processes initially, and that these top–down processes diverge as time progresses. Further research into the similarity, or divergence, between healthy and disordered scanpaths is needed, perhaps in

We have reviewed the saliency map approach as a model of fixations in complex scene viewing, in healthy and agnosic observers. Bottom–up visual saliency does not seem to invariably control where people look, even in visual agnosia, but our findings point to other ways in which a saliency map representation might be useful for modeling the effects of neuropsychological disorders on scene viewing. First, because the bottom–up version of the model is built from hierarchical feature channels inspired by visual cortex, it provides an implemented test bed for exploring the effects of degradation of these features. While the fit between the model and patient fixations was rather poor, consistent with a large role for top–down strategy, it was improved by down-weighting the role of orientation features. The worst model performance was found when saliency was computed from only orientation, which makes sense if CH’s low-level perception of these features is impaired. When saliency was computed from intensity contrast alone, model performance was also poor, and this may be because oriented edges and large variations in intensity tend to co-occur. CH’s clinical data suggests that her colour vision is relatively preserved, and indeed the model performed best when saliency was calculated from colour contrasts. Adapting the architecture of the model in this way would be useful for other patients who have been thoroughly evaluated, and in concert with what is known about the brain regions that support the processing of different visual features it could also lead to insights about where damage is located. Alternatively, by evaluating model predictions from a range of architectures, it might be possible to diagnose different featural deficits from an active visual task. There are other, fine-grained ways in which the saliency map model could be adjusted to take into account particular classes of features (such as weighting red or horizontal orientations more heavily) and these might also be used in cases where there was evidence that a patient was selectively impaired on this subfeature. It would also be useful to explore the inhibition parameter which suppresses the currently fixated location and promotes the movement of attention. CH made small saccades, so it is possible that her sequential eye movements could be modeled by reducing this inhibitory component and making refixations more likely. Such adjustments may be particularly important in single case studies or investigations with only a small group of patients. While computational models often make only normative predictions, and struggle to account for individual differences in heterogeneous patient populations, this is an elegant way in which these differences could be studied. Given past and present distinctions between subtypes of agnosia (e.g. based on the extent to which the object recognition deficit is due to impaired perception of primitive visual features), this model might also help to clarify which features, if any, are perceived and able to guide attention in a particular patient. Second, given our interpretation that both normal and agnosic viewing is driven by top–down strategies, more research is needed to quantify these strategies. This could be achieved by varying tasks and task knowledge more precisely, for example by measuring and modeling patient performance in a search task like that in Foulsham et al. (2009). Much recent work has been focused on how bottom–up saliency may be moderated by top–down factors in search, such as the expected features of the target (Navalpakkam & Itti, 2005) and semantic knowledge of the gist of a scene and the

T. Foulsham et al. / Neural Networks 24 (2011) 665–677

distribution of objects within that context (Torralba et al., 2006). Using these models it should be possible to obtain a clearer picture of what knowledge is available to guide the eyes in agnosia, and what top–down information is missing. For example, it might be that CH can ascertain that she is looking at a street scene, and that when looking for a person she would know that people tend to be at street level, towards the bottom of the image, but that her representation of the features in images of people is degraded. We have anecdotal evidence that CH can get the gist of a scene (based on her verbal reports during our experiments), and Steeves et al. (2004) suggest that their agnosic patient DF does so on the basis of preserved colour information. Exploring these top–down aspects of eye guidance in visual agnosia has the potential to reveal new insights into what capabilities are spared in disorders of visual recognition. Of course, the work reported here is based on a single case study. Although such studies can be important we remain cautious as to the degree that the results can be generalised to other patients and patient groups. In closing, we have presented evidence that a completely bottom–up saliency map approach does not necessarily guide eye movements in visual agnosia. However, modeling both bottom–up and top–down guidance within a saliency map provides a useful tool for exploring the interaction of visual features, task and strategy in both the healthy and the injured brain. Acknowledgement TF is supported by a Commonwealth Postdoctoral Fellowship from the Government of Canada. References Ballard, D. H., & Hayhoe, M. M. (2009). Modelling the role of task in the control of gaze. Visual Cognition, 17(6–7), 1185–1204. Bar, M., Kassam, K. S., Ghuman, A. S., Boshyan, J., Schmidt, A. M., Dale, A. M., et al. (2006). Top–down facilitation of visual recognition. Proceedings of the National Academy of Sciences of the United States of America, 103(2), 449–454. Barton, J. J. S. (2009). What is meant by impaired configural processing in acquired prosopagnosia? Perception, 38(2), 242–260. Barton, J. J. S., Cherkasova, M. V., Press, D. Z., Intriligator, J. M., & O’Connor, M. (2004). Perceptual functions in prosopagnosia. Perception, 33(8), 939–956. Behrmann, M. (2003). Neuropsychological approaches to perceptual organization. In M. A. Peterson, & G. Rhodes (Eds.), Perception of faces, objects and scenes: analytic and holistic processes (pp. 295–334). Oxford: Oxford University Press. Benson, D. F., Davis, R. J., & Snyder, B. D. (1988). Posterior cortical atrophy. Archives of Neurology, 45(7), 789–793. Birmingham, E., Bischof, W. F., & Kingstone, A. (2009). Saliency does not account for fixations to eyes within social scenes. Vision Research, 49(24), 2992–3000. Bruce, N. D. B., & Tsotsos, J. K. (2009). Saliency, attention, and visual search: an information theoretic approach. Journal of Vision, 9(3), 1–24. Burt, P. J., & Adelson, E. H. (1983). The Laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4), 532–540. Chen, X., & Zelinsky, G. J. (2006). Real-world visual search is dominated by top–down guidance. Vision Research, 46(24), 4118–4133. Cutsuridis, V. (2009). A cognitive model of saliency, attention, and picture scanning. Cognitive Computation, 1, 292–299. Einhauser, W., Rutishauser, U., & Koch, C. (2008). Task-demands can immediately reverse the effects of sensory-driven saliency in complex visual stimuli. Journal of Vision, 8(2), 1–19. Farah, M. J. (1990). Visual agnosia: disorders of object recognition and what they tell us about normal vision. Cambridge, MA: MIT Press. Findlay, J. M., & Walker, R. (1999). A model of saccade generation based on parallel processing and competitive inhibition. Behavioral and Brain Sciences, 22(4), 661–721. Foulsham, T., Barton, J. J. S., Kingstone, A., Dewhurst, R., & Underwood, G. (2009). Fixation and saliency during search of natural scenes: the case of visual agnosia. Neuropsychologia, 47(8–9), 1994–2003. Foulsham, T., & Kingstone, A. (2010). Asymmetries in the direction of saccades during perception of scenes and fractals: effects of image type and image features. Vision Research, 50(8), 779–795. Foulsham, T., & Underwood, G. (2007). How does the purpose of inspection influence the potency of visual saliency in scene perception? Perception, 36, 1123–1138.

677

Foulsham, T., & Underwood, G. (2008). What can saliency models predict about eye movements? Spatial and sequential aspects of fixations during encoding and recognition. Journal of Vision, 8(6), 1–17. Henderson, J. M. (2003). Human gaze control during real-world scene perception. Trends in Cognitive Sciences, 7(11), 498–504. Henderson, J. M., Brockmole, J. R., Castelhano, M. S., & Mack, M. L. (2007). Visual saliency does not account for eye movements during visual search in realworld scenes. In R. van Gompel, M. Fischer, W. Murray, & R. W. Hill (Eds.), Eye movements: a window on mind and brain (pp. 537–562). Amsterdam: Elsevier. Henderson, J. M., Malcolm, G. L., & Schandl, C. (2009). Searching in the dark: cognitive relevance drives attention in real-world scenes. Psychonomic Bulletin & Review, 16(5), 850–856. Henderson, J. M., Weeks, P. A., & Hollingworth, A. (1999). The effects of semantic consistency on eye movements during complex scene viewing. Journal of Experimental Psychology—Human Perception and Performance, 25(1), 210–228. Humphreys, G. W., & Riddoch, M. J. (1987). To see but not to see: a case study of visual agnosia. Hillsdale, NJ: Erlbaum. Itti, L. (2005). Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes. Visual Cognition, 12(6), 1093–1123. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40(10–12), 1489–1506. Itti, L., & Koch, C. (2001). Computational modelling of visual attention. Nature Reviews Neuroscience, 2(3), 194–203. James, W. (1890). The principles of psychology. New York: Holt. Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Le Meur, O., Le Callet, P., & Barba, D. (2007). Predicting visual fixations on video based on low-level visual features. Vision Research, 47(19), 2483–2498. Mannan, S., Kennard, C., & Husain, M. (2009). The role of visual salience in directing eye movements in visual object agnosia. Current Biology, 19(6), R247–R248. Mannan, S., Ruddock, K., & Wooding, D. (1995). Automatic control of saccadic eye movements made in visual inspection of briefly presented 2-D images. Spatial Vision, 9(3), 363–386. Navalpakkam, V., & Itti, L. (2005). Modeling the influence of task on attention. Vision Research, 45(2), 205–231. Olson, A. C., & Humphreys, G. W. (1997). Connectionist models of neuropsychological disorders. Trends in Cognitive Sciences, 1(6), 222–228. Parkhurst, D., Law, K., & Niebur, E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107–123. Peters, R. J., Iyer, A., Itti, L., & Koch, C. (2005). Components of bottom–up gaze allocation in natural images. Vision Research, 45(18), 2397–2416. Reinagel, P., & Zador, A. M. (1999). Natural scene statistics at the centre of gaze. Network: Computation in Neural Systems, 10(4), 341–350. Riddoch, M. J., & Humphreys, G. W. (1987). A case of integrative visual agnosia. Brain, 110, 1431–1462. Riddoch, M. J., & Humphreys, G. W. (2003). Visual agnosia. Neurologic Clinics, 21(2), 501–520. Rubens, A. B., & Benson, D. F. (1971). Associative visual agnosia. Archives of Neurology, 24(4), 305–316. Steeves, J. K. E., Humphrey, G. K., Culham, J. C., Menon, R. S., Milner, A. D., & Goodale, M. A. (2004). Behavioral and neuroimaging evidence for a contribution of color and texture information to scene classification in a patient with visual form agnosia. Journal of Cognitive Neuroscience, 16(6), 955–965. Tatler, B. W., Baddeley, R. J., & Gilchrist, I. D. (2005). Visual correlates of fixation selection: effects of scale and time. Vision Research, 45(5), 643–659. Torralba, A., Oliva, A., Castelhano, M. S., & Henderson, J. M. (2006). Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review, 113(4), 766–786. Treisman, A., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12, 97–136. Treue, S. (2003). Visual attention: the where, what, how and why of saliency. Current Opinion in Neurobiology, 13(4), 428–432. Underwood, G., & Foulsham, T. (2006). Visual saliency and semantic incongruency influence eye movements when inspecting pictures. Quarterly Journal of Experimental Psychology, 59(11), 1931–1949. Underwood, G., Foulsham, T., & Humphrey, K. (2009). Saliency and scan patterns in the inspection of real-world scenes: eye movements during encoding and recognition. Visual Cognition, 17(6–7), 812–834. Underwood, G., Foulsham, T., van Loon, E., Humphreys, L., & Bloyce, J. (2006). Eye movements during scene inspection: a test of the saliency map hypothesis. European Journal of Cognitive Psychology, 18(3), 321–342. Walther, D., & Koch, C. (2006). Modeling attention to salient proto-objects. Neural Networks, 19(9), 1395–1407. Wolfe, J. M. (2005). Guidance of visual search by preattentive information. In L. Itti, G. Rees, & J. K. Tsotsos (Eds.), Neurobiology of attention (pp. 101–104). San Diego, CA: Academic Press/Elsevier. Yarbus, A. L. (1967). Eye movements and vision. New York: Plenum. Zelinsky, G. J. (2008). A theory of eye movements during target acquisition. Psychological Review, 115(4), 787–835. Zetzsche, C. (2005). Natural scene statistics and salient visual features. In L. Itti, G. Rees, & J. K. Tsotsos (Eds.), Neurobiology of attention. London: Elsevier.