May 2, 2000 - well for this complex visual task, is consistent with several recent physiological .... 3.3 A Hierarchical Model of Object Recognition in Cortex .
How a Part of the Brain Might or Might Not Work: A New Hierarchical Model of Object Recognition by
Maximilian Riesenhuber Diplom-Physiker Universit¨at Frankfurt, 1995 Submitted to the Department of Brain and Cognitive Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computational Neuroscience at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2000
c Massachusetts Institute of Technology 2000. All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Brain and Cognitive Sciences May 2, 2000
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomaso Poggio Uncas and Helen Whitaker Professor Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Earl K. Miller Co-Chair, Department Graduate Committee
How a Part of the Brain Might or Might Not Work: A New Hierarchical Model of Object Recognition by Maximilian Riesenhuber Submitted to the Department of Brain and Cognitive Sciences on May 2, 2000, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computational Neuroscience
Abstract The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated representations, extending in a natural way the model of simple to complex cells of Hubel and Wiesel. Somewhat surprisingly, little quantitative modeling has been done in the last 15 years to explore the biological feasibility of this class of models to explain higher level visual processing, such as object recognition in cluttered scenes. We describe a new hierarchical model, HMAX, that accounts well for this complex visual task, is consistent with several recent physiological experiments in inferotemporal cortex and makes testable predictions. Key to achieve invariance and robustness to clutter is a MAX-like response function of some model neurons which selects (an approximation to) the maximum activity over all the afferents, with interesting connections to “scanning” operations used in recent computer vision algorithms. We then turn to the question of object recognition in natural (“continuous”) object classes, such as faces, which recent physiological experiments have suggested are represented by a sparse distributed population code. We performed two psychophysical experiments in which subjects were trained to perform subordinate level discrimination in a continuous object class — images of computer-rendered cars — created using a 3D morphing system. By comparing the recognition performance of trained and untrained subjects we could estimate the effects of viewpoint-specific training and infer properties of the object class-specific representation learned as a result of training. We then compared the experimental findings to simulations in HMAX, to investigate the computational properties of a population-based object class representation. We find experimental evidence, supported by modeling results, that training builds a viewpoint- and class-specific representation that supplements a pre-existing representation with lower shape discriminability but greater viewpoint invariance. Finally, we show how HMAX can be extended in a straightforward fashion to perform object categorization and to support arbitrary class hierarchies. We demonstrate the capability of our scheme, called “Categorical Basis Functions” (CBF), with the example domain of cat/dog categorization, and apply it to study some recent findings in categorical perception.
Thesis Supervisor: Tomaso Poggio Title: Uncas and Helen Whitaker Professor
2
Acknowledgments Thanks are due to quite a few people who have contributed in different ways to the gestation of this thesis. First, of course, are my parents. I am extremely grateful for their untiring support and encouragement over the years — especially to my father for convincing me to major in physics. Second is Hans-Ulrich Bauer, who advised my Diplom thesis at the Institute for Theoretical Physics of the University of Frankfurt. Without Hans-Ulrich and his urging to get my PhD in computational neuroscience in the US I would never have applied to MIT and would now probably be a bitter physics PhD working for McKinsey. To him this thesis is dedicated. At MIT, I first want to thank my advisor, Tommy Poggio, for warning me about the smog at Caltech, and for being the best advisor I could imagine: Providing a lot of independence while being very encouraging and supportive. Being exposed to his way of doing science I consider one of the biggest assets of my PhD training. Then there are the people in my thesis committee, all of which have provided valuable input and guidance along the way. Special thanks are due to Peter Dayan for a fine collaboration during my first year that introduced me to quite a few new ideas. To Earl Miller for being so open to collaborations with wacky theorists. To Mike Tarr for a nice News & Views commentary [96] on our paper [82], and a very stimulating visit which might lead to an even more stimulating collaboration. . . To Pawan Sinha and Gadi Geiger for introducing me to the wonderful world of psychophysics and providing invaluable advice when I ran my first experiment. To David Freedman and Andreas Tolias for introducing me to the wonderful world of monkey physiology and being great collaborators. To Christian Shelton for the “mother of all correspondence algorithms” that was so instrumental in many parts of this thesis and beyond. To Valerie Pires, without whose help the psychophysics in chapter 4 would have been nothing more than “suggestions for future work”, and to Mary Pat Fitzgerald, “the mother of CBCL”, whose help in dealing with the subtleties of the MIT administration was (and continues to be) greatly appreciated. Then there is our fine department that really has it all “under one roof”. I could not have imagined a better place to get my PhD. Last but not least I want to gratefully acknowledge the generous support provided by a Gerald J. and Marjorie J. Burnett Fellowship (1996–1998) and a Merck/MIT Fellowship in Bioinformatics (1998–2000) that enabled me to pursue the studies described in this thesis.
3
Contents 1 Introduction
8
2 Hierarchical Models of Object Recognition in Cortex
11
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.4
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.5
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3 Are Cortical Models Really Bound by the “Binding Problem”?
26
3.1
Introduction: Visual Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2
Models of Visual Object Recognition and the Binding Problem . . . . . . . . . . . . .
27
3.3
A Hierarchical Model of Object Recognition in Cortex . . . . . . . . . . . . . . . . . .
29
3.4
Binding without a problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.4.1
Recognition of multiple objects . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.4.2
Recognition in clutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.6
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
4 The Individual is Nothing, the Class Everything: Psychophysics and Modeling of Recognition in Object Classes
40
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.2
Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2.1
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.2.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Modeling: Representing Continuous Object Classes in HMAX . . . . . . . . . . . . .
50
4.3.1
51
4.3
The HMAX Model of Object Recognition in Cortex . . . . . . . . . . . . . . .
4
4.4
4.5
4.3.2
View-Dependent Object Recognition in Continuous Object Classes . . . . . .
51
4.3.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.4.1
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.4.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.4.3
The Model Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5 A note on object class representation and categorical perception
66
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.2
Chorus of Prototypes (COP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.3
A Novel Scheme: Categorical Basis Functions (CBF) . . . . . . . . . . . . . . . . . . .
68
5.3.1
An Example: Cat/Dog Classification . . . . . . . . . . . . . . . . . . . . . . .
69
5.3.2
Introduction of parallel categorization schemes . . . . . . . . . . . . . . . . .
72
Interactions between categorization and discrimination: Categorical Perception . . .
72
5.4.1
Categorical Perception in CBF . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
5.4.2
Categorization with and without Categorical Perception . . . . . . . . . . . .
75
5.5
COP or CBF? — Suggestion for Experimental Tests . . . . . . . . . . . . . . . . . . . .
78
5.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.4
6 General Conclusions and Future Work
81
5
List of Figures 2-1 Invariance properties of one IT neuron . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2-2 Sketch of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2-3 Illustration of the highly nonlinear shape tuning properties of the MAX mechanism
17
2-4 Response of a sample model neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2-5 Average neuronal responses to scrambled stimuli . . . . . . . . . . . . . . . . . . . .
21
3-1 Cartoon of the Poggio and Edelman model of view-based object recognition . . . . .
30
3-2 Sketch of our hierarchical model of object recognition in cortex . . . . . . . . . . . . .
32
3-3 Recognition of two objects simultaneously
. . . . . . . . . . . . . . . . . . . . . . . .
34
3-4 Stimulus/background example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3-5 Model performance: Recognition in clutter . . . . . . . . . . . . . . . . . . . . . . . .
36
4-1 Natural objects, and artificial objects used in previous object recognition studies . . .
42
4-2 The eight prototype cars used in the 8 car system . . . . . . . . . . . . . . . . . . . . .
45
4-3 Training task for Experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
4-4 Illustration of match/nonmatch pairs for Experiment 1 . . . . . . . . . . . . . . . . .
47
4-5 Testing task for Experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4-6 Average performance of the trained subjects on the test task of Experiment 1 . . . . .
49
4-7 Average performance of untrained subjects on the test task of Experiment 1 . . . . .
50
4-8 Our model of object recognition in cortex . . . . . . . . . . . . . . . . . . . . . . . . .
52
4-9 Recognition performance of the model on the eight car morph space . . . . . . . . .
53
4-10 Dependence of average (one-sided) rotation invariance, r, on SSCU tuning width,
55
4-11 Dependence of invariance range on the number of afferents to each SSCU . . . . . .
55
4-12 Dependence of invariance range on the number of SSCUs . . . . . . . . . . . . . . . .
56
4-13 Effect of addition of noise to the SSCU representation . . . . . . . . . . . . . . . . . .
57
4-14 Cat/dog prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4-15 The “Other Class” effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4-16 Car object class-specific features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6
4-17 Performance of the two layer-model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4-18 The 15 prototypes used in the 15 car system . . . . . . . . . . . . . . . . . . . . . . . .
61
4-19 Average performance of trained subjects in Experiment 2 . . . . . . . . . . . . . . . .
62
4-20 Average performance of untrained subjects in Experiment 2 . . . . . . . . . . . . . .
63
5-1 Cartoon of the CBF categorization scheme . . . . . . . . . . . . . . . . . . . . . . . . .
70
5-2 Illustration of the cat/dog stimulus space . . . . . . . . . . . . . . . . . . . . . . . . .
71
5-3 Response of the cat/dog categorization unit . . . . . . . . . . . . . . . . . . . . . . . .
71
5-4 Sketch of the model to explain the influence of experience with categorization tasks on object discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
5-5 Average responses over all morph lines for the two networks . . . . . . . . . . . . . .
76
5-6 Comparison of Euclidean distances of activation patterns . . . . . . . . . . . . . . . .
77
5-7 Output of the categorization unit trained on the cat/dog categorization task . . . . .
80
5-8 Same as Fig. 5-7, but for a representation based on 30 units chosen by k-means . . .
80
7
Chapter 1
Introduction Tell your hairdresser that you are working on vision and he will likely say “Vision? But that’s easy!” Indeed, the apparent ease with which we perform object recognition even in cluttered scenes and under difficult viewing conditions belies the amazing complexity of the visual system. This became apparent with the groundbreaking studies of Hubel and Wiesel [36, 37] in the primary visual cortices of cats and monkeys in the late 50s and 60s. Subsequently, many other visual areas were discovered, with recent surveys listing over 30 visual areas linked in an intricate and still unresolved pattern [22, 35]. This complex connection scheme can be coarsely divided up into two pathways, the “What” pathway (the ventral stream) running from primary visual cortex, V1, over V2 and V4 to inferotemporal cortex (IT), and the “Where” pathway (dorsal stream) from V1 to V2, V3, MT, MST and other parietal areas [103]. In this framework, the “What” pathway is specialized for object recognition whereas the “Where” pathway is concerned with spatial vision. Looking at the ventral stream in more detail, Kobatake et al. [41] have reported a progression of the complexity of cells’ preferred features and receptive field sizes as one progresses along the stream. While neurons in V1 are tuned to oriented bars and have small receptive fields, cells in IT appear to prefer more complex visual patterns such as faces (for a review on face cells, see [15]), and respond over a wide range of positions and scales, pointing to a crucial role of IT cortex in object recognition (as confirmed by a great number of physiological, lesion and neuropsychological studies [50, 94]). These findings naturally prompted the question of how cells tuned to views of complex objects showing invariance to size and position changes in IT could arise from small bar-tuned receptive fields in V1, and how ultimately this neural substrate could be used to perform object recognition. In humans, psychophysical experiments had given rise to two main competing theories of object recognition: the structural description and the view-based theory (see [98] for a review of the two theories and their experimental evidence). The former approach, its main protagonist being the recognition-by-components theory of Biederman [5], holds that object recognition proceeds by
8
decomposing an object into a view-independent part-based description while in the view-based theory object recognition is based on the viewpoints objects had actually appeared in. Experimental evidence ([8, 48, 49, 97, 100], see also chapter 2) and computational considerations [19] appear to favor the view-based theory. However, several challenges for view-based models remained; Tarr and Bulthoff ¨ [98] very recently listed the following problems: 1. tolerance to changes in viewing condition — while the system should allow fine shape discriminations it should not require a new representation for every change in viewing condition; 2. class generalization and representation of categories — the system should generalize from familiar exemplars of a class to unfamiliar ones, and also be able to support categorization schemes. This thesis presents a simple hierarchical model of object recognition in cortex (chapter 5 shows how the model can be extended to object categorization in a straightforward way), HMAX [96], that addresses these challenges in a biologically plausible system. In particular, chapter 2 (a reprint of [82], copyright Nature America Inc., reprinted by permission) introduces HMAX and compares the visual properties of view-tuned units in the model, especially with respect to translation, scaling and rotation in depth of the visual scene, to those of neurons in inferotemporal cortex recorded from in various experiments. Chapter 3 (a reprint of [81], copyright Cell Press, reprinted by permission) shows that HMAX can even perform recognition in cluttered scenes, without having to resort to special segmentation processes, which is of special interest in connection with the so-called “Binding Problem”. While the first two papers focus on “paperclip” objects that have been used extensively in psychophysical [8, 48, 91], physiological [49] and computational studies [68], this object class has several disadvantages (such as not being “nice” [107]) that make it unsuitable as a basis to investigate recognition in natural object classes — the topic of chapter 4 (a reprint of [84]) — where objects have similar 3D shape, such as faces. Instead, in chapters 4 and 5, stimuli for model and experiment were generated using a novel 3D morphing system developed by Christian Shelton [90] that allows us to generate morphed objects drawn from a space spanned by a set of prototype objects, for instances cars (chapter 4), or cats and dogs (chapters 4 and 5). We show that the recognition results obtainable in natural object classes represented using a population code where the activity over a group of units codes for the identity of an object (as suggested by recent physiology studies [112, 115]) are quite comparable to those for individual objects represented by “grandmother” units (in chapter 2), that is, the performance of HMAX does not appear to be special to a certain object They also listed the need for a simple mechanism to measure perceptual similarity, in order “to generalize between examplars or between views” ([98], p. 9), which thus appears to be a corollary of solving the two problems listed above.
9
class. In addition, simulations show that a population-based object representation provides several computational advantages over a “grandmother” representation. We further present experimental results from a psychophysical study in which we trained subjects using a discrimination paradigm to build a representation of a novel object class. This representation was then probed by examining how discrimination performance was affected by viewpoint changes. We find experimental evidence, supported by the modeling results, that training builds a viewpoint- and class-specific representation that supplements a pre-existing representation with lower shape discriminability but greater viewpoint invariance. Chapter 5 (a reprint of [83]) finally shows how HMAX can be extended in a straightforward way to perform object categorization, and to support arbitrary object categorization schemes, with interesting opportunities for interactions between discrimination and categorization as observed in categorical perception.
10
Chapter 2
Hierarchical Models of Object Recognition in Cortex Abstract The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated representations, extending in a natural way the model of simple to complex cells of Hubel and Wiesel. Somewhat surprisingly, little quantitative modeling has been done in the last 15 years to explore the biological feasibility of this class of models to explain higher level visual processing, such as object recognition. We describe a new hierarchical model that accounts well for this complex visual task, is consistent with several recent physiological experiments in inferotemporal cortex and makes testable predictions. The model is based on a novel MAX-like operation on the inputs to certain cortical neurons which may have a general role in cortical function.
2.1 Introduction The recognition of visual objects is a fundamental cognitive task performed effortlessly by the brain countless times every day while satisfying two essential requirements: invariance and specificity. In face recognition, for example, we can recognize a specific face among many, while being rather tolerant to changes in viewpoint, scale, illumination, and expression. The brain performs this and similar object recognition and detection tasks fast [101] and well. But how? Early studies [7] of macaque inferotemporal cortex (IT), the highest purely visual area in the ventral visual stream thought to have a key role in object recognition [103] reported cells tuned to views of complex objects such as a face, i.e., the cells discharged strongly to the view of a face but very little or not at all to other objects. A hallmark of these cells was the robustness of their firing 11
to stimulus transformations such as scale and position changes. This finding presented an interesting question: How could these cells show strongly differing responses to similar stimuli (as, e.g., two different faces), that activate the retinal photoreceptors in similar ways, while showing response constancy to scaled and translated versions of the preferred stimulus that cause very different activation patterns on the retina? This puzzle was similar to one faced by Hubel and Wiesel on a much smaller scale two decades earlier when they recorded from simple and complex cells in cat striate cortex [36]: both cell types responded strongly to oriented bars, but whereas simple cells exhibited small receptive fields with a strong phase dependence, that is, with distinct excitatory and inhibitory subfields, complex cells had larger receptive fields and no phase dependence. This led Hubel and Wiesel to propose a model in which simple cells with their receptive fields in neighboring parts of space feed into the same complex cell, thereby endowing that complex cell with a phase-invariant response. A straightforward (but highly idealized) extension of this scheme would lead all the way from simple cells to “higher order hypercomplex cells” [37]. Starting with the Neocognitron [25] for translation invariant object recognition, several hierarchical models of shape processing in the visual system have subsequently been proposed to explain how transformation-invariant cells tuned to complex objects can arise from simple cell inputs [64, 111]. Those models, however, were not quantitatively specified or were not compared with specific experimental data. Alternative models for translation- and scale-invariant object recognition have been proposed, based on a controlling signal that either appropriately reroutes incoming signals, as in the “shifter” circuit [2] and its extension [62], or modulates neuronal responses, as in the “gain-field” models for invariant recognition [78, 88]. While recent experimental studies [14, 56] have indicated that in macaque area V4 cells can show an attention-controlled shift or modulation of their receptive field in space, there is still little evidence that this mechanism is used to perform translation-invariant object recognition and whether a similar mechanism applies to other transformations (such as scaling) as well. The basic idea of the hierarchical model sketched by Perrett and Oram [64] was that invariance to any transformation (not just image-plane transformations as in the case of the Neocognitron [25]) could be built up by pooling over afferents tuned to various transformed versions of the same stimulus. Indeed it was shown earlier [68] that viewpoint-invariant object recognition was possible using such a pooling mechanism. A (Gaussian RBF) learning network was trained with individual views (rotated around one axis in 3D space) of complex, paperclip-like objects to achieve 3D rotation-invariant recognition of this object. In the network the resulting view-tuned units fed into a a view-invariant unit; they effectively represented prototypes between which the learning network interpolated to achieve viewpoint-invariance. There is now quantitative psychophysical [8, 48, 95] and physiological evidence [6, 42, 49] for
12
the hypothesis that units tuned to full or partial views are probably created by a learning process and also some hints that the view-invariant output is in some cases explicitly represented by (a small number of) individual neurons [6, 49, 66]. A recent experiment [48, 49] required monkeys to perform an object recognition task using novel “paperclip” stimuli the monkeys had never seen before. Here, the monkeys were required to recognize views of “target” paperclips rotated in depth among views of a large number of “distractor” paperclips of very similar structure, after being trained on a restricted set of views of each target object. Following very extensive training on a set of paperclip objects, neurons were found in anterior IT that selectively responded to the object views seen during training. This design avoided two problems associated with previous physiological studies investigating the mechanisms underlying view-invariant object recognition: First, by training the monkey to recognize novel stimuli with which the monkey had not had any visual experience instead of objects (e.g., faces) with which the monkey was quite familiar, it was possible to estimate the degree of view-invariance derived from just one object view. Moreover, the use of a large number of distractor objects allowed to define view-invariance with respect to the distractor objects. This is a key point, since only by being able to compare the response of a neuron to transformed versions of its preferred stimulus with the neuron’s response to a range of (similar) distractor objects can the VTU’s (view-tuned unit’s) invariance range be determined — just measuring the tuning curve is not sufficient. The study [49] established (Fig. 2-1) that after training with just one object view there are cells showing some degree of limited invariance to 3D rotation around the training view, consistent with the view-interpolation model [68]. Moreover, the cells also exhibit significant invariance to translation and scale changes, even though the object was only previously presented at one scale and position. These data put in sharp focus and in quantitative terms the question of the circuitry underlying the properties of the view-tuned cells. While the original model [68] described how VTUs could be used to build view-invariant units, they did not specify how the view-tuned units could come about. The key problem is thus to explain in terms of biologically plausible mechanisms the VTUs’ invariance to translation and scaling obtained from just one object view, which arises from a trade-off between selectivity to a specific object and relative tolerance (i.e., robustness of firing) to position and scale changes. Here, we describe a model that conforms to the main anatomical and physiological constraints, reproduces the invariance data described above and makes predictions for experiments on the view-tuned subpopulation of IT cells. Interestingly, the model is also consistent with recent data from several other experiments regarding recognition in context [54], or the presence of multiple objects in a cell’s receptive field [89].
13
(a)
(b)
40
40
30
30
Spike Rate
10 Best Distractors
20
20
10 10 0 0 60
84
108
132
156
180
37 9 20 5 24 3
Rotation Around Y Axis
(Target Response)/ (Mean of Best Distractors)
(c)
3 2 1 0
0
6
7
6
4
1
(d)
7
5
2
Distractor ID
AAAAAA AAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AA AAAA AAAAAA AA AAAA AAAAAA AA AAAA AA
1.90 2.80
6
*
5 4 3 2 1 0
3.70
4.70
5.60
AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA
AAAAAAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA
AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA
AAAAAAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA
AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA AAAA AAA AAAA AAAAAAA AAA AAAA AAAAAAA AAA
(0,0) (x,x) (x,-x) (-x,x) (-x,-x)
Azimuth and Elevation (x = 2.25 degrees)
Degrees of Visual Angle
Figure 2-1: Invariance properties of one neuron (modified from Logothetis et al. [49]). The figure shows the response of a single cell found in anterior IT after training the monkey to recognize paperclip-like objects. The cell responded selectively to one view of a paperclip and showed limited invariance around the training view to rotation in depth, along with significant invariance to translation and size changes, even though the monkey had only seen the stimulus at one position and scale during training. (a) shows the response of the cell to rotation in depth around the preferred view. (b) shows the cell’s response to the 10 distractor objects (other paperclips) that evoked the strongest responses. The lower plots show the cell’s response to changes in stimulus size, (c) (asterisk shows the size of the training view), and position, (d) (using the 1.9 size), resp., relative to the mean of the 10 best distractors. Defining “invariance” as yielding a higher response to transformed views of the preferred stimulus than to distractor objects, neurons exhibit an average rotation invariance of 42 (during training, stimuli were actually rotated by 15 in depth to provide full 3D information to the monkey; therefore, the invariance obtained from a single view is likely to be smaller), translation and scale invariance on the order of 2 and 1 octave around the training view, resp. (J. Pauls, personal communication).
14
view-tuned cells
"complex composite" cells (C2)
11 00 11 00 00 11
110 00 10 1
"composite feature" cells (S2)
110 00 10 1
complex cells (C1)
11 00 111 0 00
simple cells (S1)
weighted sum MAX
Figure 2-2: Sketch of the model. The model is an hierarchical extension of the classical paradigm [36] of building complex cells from simple cells. It consists of a hierarchy of layers with linear (“S” units in the notation of Fukushima [25], performing template matching, solid lines) and non-linear operations (“C” pooling units [25], performing a “MAX” operation, dashed lines). The non-linear MAX operation — which selects the maximum of the cell’s inputs and uses it to drive the cell — is key to the model’s properties and is quite different from the basically linear summation of inputs usually assumed for complex cells. These two types of operations respectively provide pattern specificity and invariance (to translation, by pooling over afferents tuned to different positions, and scale (not shown), by pooling over afferents tuned to different scales).
2.2 Results The model is based on a simple hierarchical feedforward architecture (Fig. 2-2). Its structure reflects the assumption that invariance to position and scale on the one hand and feature specificity on the other hand must be built up through separate mechanisms: to increase feature complexity, a suitable neuronal transfer function is a weighted sum over afferents coding for simpler features, i.e., a template match. But is summing over differently weighted afferents also the right way to increase invariance? From the computational point of view, the pooling mechanism should produce robust feature detectors, i.e., measure the presence of specific features without being confused by clutter and context in the receptive field. Consider a complex cell, as found in primary visual cortex, whose preferred stimulus is a bar of a certain orientation to which the cell responds in a phase-invariant way [36]. Along the lines of the original complex cell model [36], one could think of the complex cells as receiving input from an array of simple cells at different locations, pooling over which results in
15
the position-invariant response of the complex cell. Two alternative idealized pooling mechanisms are: linear summation (“SUM”) with equal weights (to achieve an isotropic response) and a nonlinear maximum operation (“MAX”), where the strongest afferent determines the response of the postsynaptic unit. In both cases, if only one bar is present in the receptive field, the response of a model complex cell is position invariant. The response level would signal how similar the stimulus is to the afferents’ preferred feature. Consider now the case of a complex stimulus, like e.g., a paperclip, in the visual field. In the linear summation case, complex cell response would still be invariant (as long as the stimulus stays in the cell’s receptive field), but the response level now would not allow to infer whether there actually was a bar of the preferred orientation somewhere in the complex cell’s receptive field, as the output signal is a sum over all the afferents. That is, feature specificity is lost. In the MAX case, however, the response would be determined by the most strongly activated afferent and hence would signal the best match of any part of the stimulus to the afferents’ preferred feature. This ideal example suggests that the MAX mechanism is capable of providing a more robust response in the case of recognition in clutter or with multiple stimuli in the receptive field (cf. below). Note that a SUM response with saturating nonlinearities on the inputs seems too brittle since it requires a case-by-case adjustment of the parameters, depending on the activity level of the afferents. Equally critical is the inability of the SUM mechanism to achieve size invariance: Suppose that the afferents to a “complex” cell (which now could be a cell in V4 or IT, for instance) show some degree of size and position invariance. If the “complex” cell were now stimulated with the same object but at subsequently increasing sizes, an increasing number of afferents would become excited by the stimulus (unless the afferents showed no overlap in space or scale) and consequently the excitation of the “complex” cell would increase along with the stimulus size, even though the afferents show size invariance (this is borne out in simulations using a simplified two-layer model [79])! For the MAX mechanism, however, cell response would show little variation even as stimulus size increased since the cell’s response would be determined just by the best-matching afferent. These considerations (supported by quantitative simulations of the model, described below) suggest that a sensible way of pooling responses to achieve invariance is via a nonlinear MAX function, that is, by implicitely scanning (see discussion) over afferents of the same type that differ in the parameter of the transformation to which the response should be invariant (e.g., feature size for scale invariance), and then selecting the best-matching of those afferents. Note that these considerations apply to the case where different afferents to a pooling cell, e.g., those looking at different parts of space, are likely to be responding to different objects (or different parts of the same object) in the visual field (as is the case with cells in lower visual areas with their broad shape tuning). Here, pooling by combining afferents would mix up signals caused by different stimuli. However, if the afferents are specific enough to only respond to one pattern, as one expects in the
16
(a)
(b)
MAX
expt.
SUM
1
response
0.8 0.6 0.4 0.2 0
Figure 2-3: Illustration of the highly nonlinear shape tuning properties of the MAX mechanism. (a) Experimentally observed responses of IT cells obtained using a “simplification procedure” [113] designed to determine “optimal” features (responses normalized so that the response to the preferred stimulus is equal to 1). In that experiment, the cell originally responds quite strongly to the image of a “water bottle” (leftmost object). The stimulus is then “simplified” to its monochromatic outline which increases the cell’s firing, and further to a paddle-like object, consisting of a bar supporting an ellipse. While this object evokes a strong response, the bar or the ellipse alone produce almost no response at all (figure used by permission). (b) Comparison of experiment and model. Green bars show the responses of the experimental neuron from (a). Blue and red bars show the response of a model neuron tuned to the stem-ellipsoidal base transition of the preferred stimulus. The model neuron is at the top of a simplified version of the model shown in Fig. 2-2, where there are only two types of S1 features at each position in the receptive field, tuned to the left and right side of the transition region, resp., which feed into C1 units that pool using a MAX function (blue bars) or a SUM function (red bars). The model neuron is connected to these C1 units so that its response is maximal when the experimental neuron’s preferred stimulus is in its receptive field.
final stages of the model, then pooling by using a weighted sum, as in the RBF network [68], where VTUs tuned to different viewpoints were combined to interpolate between the stored views, is advantageous. MAX-like mechanisms at some stages of the circuitry appear to be compatible with recent neurophysiological data. For instance, it has been reported [89] that when two stimuli are brought into the receptive field of an IT neuron, that neuron’s response appears to be dominated by the stimulus that produces a higher firing rate when presented in isolation to the cell — just as expected if a MAX-like operation is performed at the level of this neuron or its afferents. Theoretical investigations into possible pooling mechanisms for V1 complex cells also support a maximum-like pooling mechanism (K. Sakai & S. Tanaka, Soc. Neurosci. Abs., 23, 453, 1997). Additional indirect support for a MAX mechanism comes from studies using a “simplification procedure” [113] or “complexity reduction” [47] to determine the preferred features of IT cells, i.e., the stimulus components that are responsible for driving the cell. These studies commonly find a highly nonlinear tuning of IT cells (Fig. 2-3 (a)). Such tuning is compatible with the MAX response function (Fig. 2-3 (b), blue bars). Note that a linear model (Fig. 2-3 (b), red bars) cannot reproduce this strong response change for small changes in the input image. In our model of view-tuned units (Fig. 2-2), the two types of operations, scanning and template
17
matching, are combined in a hierarchical fashion to build up complex, invariant feature detectors from small, localized, simple cell-like receptive fields in the bottom layer which receive input from the model “retina.” There need not be a strict alternation of these two operations: connections can
!
skip levels in the hierarchy, as in the direct C1 C2 connections of the model in Fig. 2-2. The question remains whether the proposed model can indeed achieve response selectivity and invariance compatible with the results from physiology. To investigate this question, we looked at the invariance properties of 21 view-tuned units in the model, each tuned to a view of a different, randomly selected paperclip, as used in the experiment [49]. Figure 2-4 shows the response of one model view-tuned unit to 3D rotation, scaling and translation around its preferred view (see Methods). The unit responds maximally to the training view, with the response gradually falling off as the stimulus is transformed away from the training view. As in the experiment, we can determine the invariance range of the VTU by comparing the response to the preferred stimulus to the responses to the 60 distractors. The invariance range is then defined as the range over which the model unit’s response is greater than to any of the distractor objects. Thus, the model VTU shown in Fig. 2-4 shows rotation invariance of 24, scale invariance of 2.6 octaves and translation invariance of 4.7 of visual angle. Averaging over all 21 units, we obtain average rotation invariance over 30.9 , scale invariance over 2:1 octaves and translation invariance over 4.6. Units show invariance around the training view, of a range in good agreement with the experimentally observed values. Some units (5/21), an example of which is given in Fig. 2-4 (d), show tuning also for pseudo-mirror views (obtained by rotating the preferred paperclip by 180 in depth, which produces a pseudo-mirror view of the object due to the paperclips’ minimal self-occlusion), as observed in some experimental neurons [49]. While the simulation and experimental data presented so far dealt with object recognition settings in which one object was presented in isolation, this is rarely the case in normal object recognition settings. More commonly, the object to be recognized is situated in front of some background or appears together with other objects, all of which are to be ignored if the object is to be recognized successfully. More precisely, in the case of multiple objects in the receptive field, the responses of the afferents feeding into a VTU tuned to a certain object should be affected as little as possible by the presence of other “clutter objects.” The MAX response function posited above for the pooling mechanism to achieve invariance has the right computational properties to perform recognition in clutter: If the VTU’s preferred object strongly activates the VTU’s afferents, then it is unlikely that other objects will interfere, as they tend to activate the afferents less and hence will not usually influence the response due to the MAX response function. In some cases (such as when there are occlusions of the preferred feature, or one of the “wrong” afferents has a higher activation) clutter, of course, can affect the value provided
18
(b)
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
response
response
(a)
0.5 0.4
0.5 0.4
0.3
1
0.3
0.2
0.5
0.2
0.1
0
0 0
20
10
40
20
60
30
40
80 100 stimulus size
50
120
0.1
60
140
0 50
160
(c)
(d)
60
70
80 90 100 viewing angle
110
120
130
1
0.9 0.8
1
response
response
0.7 0.5
0 −4
4 −2
0.6 0.5 0.4 0.3
2 0 2
−2 4
x translation (deg)
0.2
0 −4
0.1
y translation (deg)
0 0
90
180 viewing angle
270
Figure 2-4: Responses of a sample model neuron to different transformations of its preferred stimulus. The different panels show the same neuron’s response to (a) varying stimulus sizes (inset shows response to 60 distractor objects, selected randomly from the paperclips used in the physiology experiments [49]), (b) rotation in depth and (c) translation. Training size was 64 64 pixels corresponding to 2 of visual angle. (d) shows another neuron’s response to pseudo-mirror views (cf. text), with the dashed line indicating the neuron’s response to the “best” distractor.
19
by the MAX mechanism, thereby reducing the quality of the match at the final stage and thus the strength of the VTU response. It is clear that to achieve the highest robustness to clutter, a VTU should only receive input from cells that are strongly activated (i.e., that are relevant to the definition of the object) by its preferred stimulus. In the version of the model described so far, the penultimate layer contained only 10 cells corresponding to 10 different features, which turned out to be sufficient to achieve invariance properties as found in the experiment. Each VTU in the top layer was connected to all the afferents and hence robustness to clutter is expected to be relatively low. Note that in order to connect a VTU to only the subset of the intermediate feature detectors it receives strong input from, the number of afferents should be large enough to achieve the desired response specificity. The straightforward solution is to increase the number of features. Even with a fixed number of different features in S1, the dictionary of S2 features can be expanded by increasing the number and type of afferents to individual S2 cells (see Methods). In this “many feature” version of the model, the invariance ranges for a low number of afferents are already comparable to the experimental ranges — if each VTU is connected to the 40 (out of 256) C2 cells that are most strongly excited by its preferred stimulus, model VTUs show an average scale invariance over 1:9 octaves, rotation invariance over 36.2 and translation invariance over 4.4. For the maximum of 256 afferents to each cell, cells are rotation invariant over an average of 47 , scale invariant over 2.4 octaves and translation invariant over 4.7. Simulations show [81] that this model is capable of performing recognition in context: Using displays as inputs that contain the neurons preferred clip as well as another, distractor, clip, the model is able to correctly recognize the preferred clip in 90% of the cases (for 40/256 afferents to each neuron, the maximum rate is 94% for 18 afferents, dropping to 55% for 256/256 afferents, compared to 40% in the original version of the model with 10 C2 units), i.e., the addition of the second clip interfered with the activation caused by the first clip alone so much that in 10% of the cases the response to the two clip display containing the preferred clip fell below the response to one of the distractor clips. This reduction of the response to the two-stimulus display compared to the response to the stronger stimulus alone has also been found in experimental studies [86, 89]. The question of object recognition in the presence of a background object was explored experimentally in a recent study [54], where a monkey had to discriminate (polygonal) foreground objects irrespective of the (polygonal) background they appeared with. Recordings of IT neurons showed that for the stimulus/background condition, neuronal response on average was reduced to a quarter of the response to the foreground object alone, while the monkey’s behavioral performance dropped much less. This is compatible with simulations in the model [81] that show that even though a unit’s firing rate is strongly affected by the addition of the background pattern, it is still in most cases well above the firing rate evoked by distractor objects, allowing the foreground
20
(a)
(b) 1
avg. response
0.8
0.6
0.4
0.2
0
1
4
16 64 number of tiles
256
Figure 2-5: Average neuronal responses of neurons in the many feature version of the model to scrambled
stimuli. (a) Example of a scrambled stimulus. The images (128 128 pixels) were created by subdividing the preferred stimulus of each neuron into 4, 16, 64, and 256, resp., “tiles” and randomly shuffling the tiles to create a scrambled image. (b) Average response of the 21 model neurons (with 40/256 afferents, as above) to the scrambled stimuli (solid blue curve), in comparison to the average normalized responses of IT neurons to scrambled stimuli (scrambled pictures of trees) reported in a very recent study [108] (dashed green curve).
object to be recognized successfully. Our model relies on decomposing images into features. Should it then be fooled into confusing a scrambled image with the unscrambled original? Superficially, one may be tempted to guess that scrambling an image in pieces larger than the features should indeed fool the model. Simulations (see Fig. 2-5) show that this is not the case. The reason lies in the large dictionary of filters/features used that makes it practically impossible to scramble the image in such a way that all features are preserved, even for a low number of features. Responses of model units drop precipitously as the image is scrambled into progressively finer pieces, as confirmed very recently in a physiology experiment [108] of which we became aware after obtaining this prediction from the model.
2.3 Discussion We briefly outline the computational roots of the hierarchical model we described, how the MAX operation could be implemented by cortical circuits and remark on the role of features and invariances in the model. A key operation in several recent computer vision algorithms for the recognition and classification of objects [87, 92] is to scan a window across an image, through both position and scale, in order to analyze at each step a subimage – for instance by providing it to a classifier that decides whether the subimage represents the object of interest. Such algorithms have been successful in achieving invariance to image plane transformations such as translation and scale. In addition, this brute force scanning strategy eliminates the need to segment the object of interest before recognition: segmentation, even in complex and cluttered images, is routinely achieved as a byproduct of
21
recognition. The computational assumption that originally motivated the model described in this paper was indeed that a MAX-like operation may represent the cortical equivalent of the “window of analysis” in machine vision to scan through and select input data. Unlike a centrally controlled sequential scanning operation, a mechanism like the MAX operation that locally and automatically selects a relevant subset of inputs seems biologically plausible. A basic and pervasive operation in many computational algorithms — not only in computer vision — is the search and selection of a subset of data. Thus it is natural to speculate that a MAX-like operation may be replicated throughout the cortex. Simulations of a simplified two-layer version the model [79] using soft-maximum approximations to the MAX operation (see Methods) where the strength of the nonlinearity can be adjusted by a parameter show that its basic properties are preserved and structurally robust. But how is an approximation of the MAX operation realized by neurons? It seems that it could be implemented by several different, biologically plausible circuitries [1, 13, 17, 32, 44]. The most likely hypothesis is that the MAX operation arises from cortical microcircuits of lateral, possibly recurrent, inhibition between neurons in a cortical layer. An example is provided by the circuit proposed for the gaincontrol and relative motion detection in the visual system of the fly [76], based on feedforward (or recurrent) shunting presynaptic (or postsynaptic) inhibition by “pool” cells. One of its key elements, in addition to shunting inhibition (an equivalent operation may be provided by linear inhibition deactivating NMDA receptors), is a nonlinear transformation of the individual signals due to synaptic nonlinearities or to active membrane properties. The circuit performs a gain control operation and — for certain values of the parameters — a MAX-like operation. “Softmax” circuits have been proposed in several recent studies [34, 45, 61] to account for similar cortical functions. Together with adaptation mechanisms (underlying very short-term depression [1]), the circuit may be capable of pseudo-sequential search in addition to selection. Our novel claim here is that a MAX-like operation is a key mechanism for object recognition in the cortex. The model described in this paper — including the stage from view-tuned to viewinvariant units [68] — is a purely feedforward hierarchical model. Backprojections – well known to exist abundantly in cortex and playing a key role in other models of cortical function [59, 75] – are not needed for its basic performance but are probably essential for the learning stage and for known top-down effects — including attentional biases [77] — on visual recognition, which can be naturally grafted into the inhibitory softmax circuits (see [61]) described earlier. In our model, recognition of a specific object is invariant for a range of scales (and positions) after training with a single view at one scale, because its representation is based on features invariant to these transformations. View invariance on the other hand requires training with several views [68] because individual features sharing the same 2D appearance can transform very differently under 3D rotation, depending on the 3D structure of the specific object. Simulations show that the
22
model’s performance is not specific to the class of paperclip object: recognition results are similar for e.g., computer-rendered images of cars (and other objects). From a computational point of view the class of models we have described can be regarded as a hierarchy of conjunctions and disjunctions. The key aspect of our model is to identify the disjunction stage with the build-up of invariances and to do it through a MAX-like operation. At each conjunction stage the complexity of the features increases and at each disjunction stage so does their invariance. At the last level – of the C2 layer in the paper – it is only the presence and strength of individual features and not their relative geometry in the image that matters. The dictionary of features at that stage is overcomplete, so that the activities of the units measuring each feature strength, independently of their precise location, can still yield a unique signature for each visual pattern (cf. the SEEMORE system [52]). The architecture we have described shows that this approach is consistent with available experimental data and maps it into a class of models that is a natural extension of the hierarchical models first proposed by Hubel and Wiesel.
2.4 Methods Basic model parameters. Patterns on the model “retina” (of 160 160 pixels — which corresponds to a 5 receptive field size (the literature [41] reports an average V4 receptive field size of 4.4 ) if we set 32 pixels
= 1) — are first filtered through a layer (S1) of simple cell-like receptive fields (first derivative of gaussians, zero-sum, square-normalized to 1, oriented at 0 45 90 135 with standard deviations of 1.75 to 7.25 pixels in steps of 0.5 pixels; S1 filter responses were rectified dot products with the image patch falling into their receptive field, i.e., the output s1j of an S1 cell with preferred stimulus wj whose receptive field covers an image patch Ij is s1j
= jw
j
Ij j). Receptive field (RF) centers densely sample the input retina. Cells in the next
(C1) layer each pool S1 cells (using the MAX response function, i.e., the output c1i of a C1 cell with afferents s1j is c1i
= maxj 1j ) of the same orientation over eight pixels of the visual field in each dimension and all scales. s
This pooling range was chosen for simplicity — invariance properties of cells were robust for different choices of pooling ranges (cf. below). Different C1 cells were then combined in higher layers, either by combining C1 cells tuned to different features to give S2 cells responding to co-activations of C1 cells tuned to different orientations or to yield C2 cells responding to the same feature as the C1 cells but with bigger receptive fields. In the simple version illustrated here, the S2 layer contains six features (all pairs of orientations of C1 cells looking at the same part of space) with Gaussian transfer function (
23
= 1, centered at 1, i.e., the response 2k s
of an S2 cell receiving input from C1 cells c1m c1n with receptive fields in the same location but responding to different orientations is s2k
;
= exp(; ( 1m ; 1)2 + ( 1n ; 1)2 2)), yielding a total of 10 cells in the C2 layer. c
c
=
Here, C2 units feed into the view-tuned units, but in principle, more layers of S and C units are possible. In the version of the model we have simulated, object specific learning occurs only at the level of the synapses on the view-tuned cells at the top. More complete simulations will have to account for the effect of visual experience on the exact tuning properties of other cells in the hierarchy.
Testing the invariance of model units. View-tuned units in the model were generated by recording the activity of units in the C2 layer feeding into the VTUs to each one of the 21 paperclip views and then setting the connecting weights of each VTU, i.e., the center of the Gaussian associated with the unit, resp., to the corresponding activation. For rotation, viewpoints from 50 to 130 were tested (the training view was arbitrarily set to 90) in steps of 4. For scale, stimulus sizes from 16 to 160 pixels in half octave steps (except for the last step, which was from 128 to 160 pixels) and for translation, independent translations of 112 pixels along each axis in steps of 16 pixels (i.e., exploring a plane of 112 112 pixels) were used.
“Many feature” version. To increase the robustness to clutter of model units, the number of features in S2 was increased: Instead of the previous maximum of two afferents of different orientation looking at the same patch of space as in the version described above, each S2 cell now received input from four neighboring C1 units (in a
2 2 arrangement) of arbitrary orientation, giving a total of 44 = 256 different S2 types and
finally 256 C2 cells as potential inputs to each view-tuned cell (in simulations, top level units were sparsely connected to a subset of C2 layer units to gain robustness to clutter, cf. Results). As S2 cells now combined C1 afferents with receptive fields at different locations, and features a certain distance apart at one scale change their separation as the scale changes, pooling at the C1 level was now done in several scale bands, each of roughly a half-octave width in scale space (filter standard deviation ranges were 1.75–2.25, 2.75–3.75, 4.25– 5.25, and 5.75–7.25 pixels, resp.) and the spatial pooling range in each scale band chosen accordingly (over neighborhoods of 4 4,
6 6, 9 9, and 12 12, respectively — note that system performance was robust
with respect to the pooling ranges, simulations with neighborhoods of twice the linear size in each scale band produced comparable results, with a slight drop in the recognition of overlapping stimuli, as expected), as a simple way to improve scale-invariance of composite feature detectors in the C2 layer. Also, centers of C1 cells were chosen so that RFs overlapped by half a RF size in each dimension. A more principled way
24
would be to learn the invariant feature detectors, e.g., using the trace rule [23]. The straightforward connection patterns used here, however, demonstrate that even a simple model shows tuning properties comparable to the experiment.
Softmax approximation. In a simplified two-layer version of the model [79] we investigated the effects of approximations to the MAX operations on recognition performance. The model contained only one pooling stage, C1, where the strength of the pooling nonlinearity could be controlled by a parameter, p. There, the output c1i of a C1 cell with afferents xj was 1
i
c
=
X
j
exp( j j j) j k exp( j k j) p
P
x
p
x
x
which performs a linear summation (scaled by the number of afferents) for p = p
0 and the MAX operation for
! 1.
2.5 Acknowledgments Supported by grants from ONR, Darpa, NSF, ATR, and Honda. M.R. is supported by a Merck/MIT Fellowship in Bioinformatics. T.P. is supported by the Uncas and Helen Whitaker Chair at the Whitaker College, MIT. We are grateful to H. Bulthoff, ¨ F. Crick, B. Desimone, R. Hahnloser, C. Koch, N. Logothetis, E. Miller, J. Pauls, D. Perrett, J. Reynolds, T. Sejnowski, S. Seung, and R. Vogels for very useful comments and for reading earlier versions of this manuscript. We thank J. Pauls for analyzing the average invariance ranges of his IT neurons and K. Tanaka for the permission to reproduce Fig. 2-3 (a).
25
Chapter 3
Are Cortical Models Really Bound by the “Binding Problem”? Abstract The usual description of visual processing in cortex is an extension of the simple to complex hierarchy postulated by Hubel and Wiesel — a feedforward sequence of more and more complex and invariant features. The capability of this class of models to perform higher level visual processing such as viewpoint-invariant object recognition in cluttered scenes has been questioned in recent years by several researchers, who in turn proposed an alternative class of models based on the synchronization of large assemblies of cells, within and across cortical areas. The main implicit argument for this novel and controversial view was the assumption that hierarchical models cannot deal with the computational requirements of high level vision and suffer from the so-called “binding problem”. We review the present situation and discuss theoretical and experimental evidence showing that the perceived weaknesses of hierarchical models are not true. In particular, we show that recognition of multiple objects in cluttered scenes, arguably among the most difficult tasks in vision, can be done in a hierarchical feedforward model.
26
3.1 Introduction: Visual Object Recognition Two problems make object recognition difficult: 1. The segmentation problem: Visual scenes normally contain multiple objects. To recognize individual objects, features must be isolated from the surrounding clutter and extracted from the image, and the feature set must be parsed so that the different features are assigned to the correct object. The latter problem is commonly referred to as the “Binding Problem” [110]. 2. The invariance problem: Objects have to be recognized under varying viewpoints, lighting conditions etc. Interestingly, the human brain can solve these problems with ease and quickly. Thorpe et al. [101] report that visual processing in an object detection task in complex visual scenes can be achieved in under 150 ms, which is on the order of the latency of the signal transmission from the retina to inferotemporal cortex (IT), the highest area in the ventral visual stream thought to have a key role in object recognition [103]; see also [72]. This impressive processing speed presents a strong constraint for any model of object recognition.
3.2 Models of Visual Object Recognition and the Binding Problem Hubel and Wiesel [37] were the first to postulate a model of visual object representation and recognition. They recorded from simple and complex cells in the primary visual cortices of cats and monkeys and found that while both types preferentially responded to bars of a certain orientation, the former had small receptive fields with a phase-dependent response while the latter had bigger receptive fields and showed no phase-dependence. This observation led them to hypothesize that complex cells receive input from several simple cells. Continuing this model in a straightforward fashion, they suggested [36] that the visual system is composed of a hierarchy of visual areas, from simple cells all the way up to “higher order hypercomplex cells.” Later studies [7] of macaque inferotemporal cortex (IT) described neurons tuned to views of complex objects such as a face, i.e., the cells discharged strongly to a face seen from a specific viewpoint but very little or not at all to other objects. A key property of these cells was their scale and translation invariance, i.e., the robustness of their firing to stimulus transformations such as changes in size or position in the visual field. These findings inspired various models of visual object recognition such as Fukushima’s Neocognitron [25] or, later, Perrett and Oram’s [64] outline of a model of shape processing, and Wallis and 27
Rolls’ VisNet [111], all of which share the basic idea of the visual system as a feedforward processing hierarchy where invariance ranges and complexity of preferred features grow as one ascends through the levels. Models of this type prompted von der Malsburg [109] to formulate the binding problem. His claim was that visual representations based on spatially invariant feature detectors were ambiguous: “As generalizations are performed independently for each feature, information about neighborhood relations and relative position, size and orientation is lost. This lack of information can lead to the inability to distinguish between patterns that are composed of the same set of invariant features. . . ” [110]. Moreover, as a visual scene containing multiple objects is represented by a set of feature activations, a second problem lies in “singling out appropriate groups from the large background of possible combinations of active neurons” [110]. These problems would manifest themselves in various phenomena such as hallucinations (the feature sets activated by objects actually present in the visual scene combine to yield the activation pattern characteristic of another object) and the figure-ground problem (the inability to correctly assign image features to foreground object and background), leading von der Malsburg to postulate the necessity of a special mechanism, the synchronous oscillatory firing of ensembles of neurons, to bind features belonging to one object together. One approach to avoid these problems was presented by Olshausen et al. [62]: Instead of trying to process all objects simultaneously, processing is limited to one object in a certain part of space at a time, e.g., through “focussing attention” on a region of interest in the visual field, which is then routed through to higher visual areas, ignoring the remainder of the visual field. The control signal for the input selection in this model is thought to be provided in form of the output of a “blob-search” system that identifies possible candidates in the visual scene for closer examination. While this top-down approach to circumvent the binding problem has intuitive appeal and is compatible with physiological studies that report top-down attentional modulation of receptive field properties, (see the article by Reynolds & Desimone in this issue, or the recent study by Connor et al. [14]), such a sequential approach seems to be difficult to reconcile with the apparent speed with which object recognition can proceed even in very complex scenes containing many objects [72, 101], and is also incompatible with reports of parallel processing of visual scenes, as observed in pop-out experiments [102]), suggesting that object recognition does not seem to depend only on explicit top-down selection in all situations. A more head-on approach to the binding problem was taken in other studies that have called into question the assumption that representations based on sets of spatially invariant feature detectors are inevitably ambiguous. Starting with Wickelgren [114] in the context of speech recognition, several studies have proposed how coding an object through a set of intermediate features, made up of local arrangements of simpler features (e.g., using letter pairs, or higher order combinations,
28
instead of individual letters to code words — for instance, the word “tomaso” could be confused with the word “somato” if both are coded by the sets of letters they are made up of; this ambiguity is resolved, however, if they are represented through letter pairs) can sufficiently constrain the representation to uniquely code complex objects without retaining global positional information (see Mozer [58] for an elaboration of this idea and an implementation in the context of word recognition). The capabilities of such a representation based on spatially-invariant receptive fields were recently analyzed in detail by Mel & Fiser [53] for the example domain of English text. In the visual domain, Mel [52] recently presented a model to perform invariant recognition of a high number (100) of objects of different types, using a representation based on a large number of feature channels. While the model performed surprisingly well for a variety of transformations, recognition performance depended strongly on color cues, and did not seems as robust to scale changes as experimental neurons [49]. Perrett & Oram [65] have recently outlined a conceptual model based on very similar ideas of how a representation based on feature combinations could in theory avoid the “Binding Problem”, e.g., by coding a face through a set of detectors for combinations of face parts such as eye-nose or eyebrow-hairline. What has been lacking so far, however, is a computational implementation quantitatively demonstrating that such a model can actually perform “real-world” subordinate visual object recognition to the extent observed in behavioral and physiological experiments [48, 49, 54, 89], where effects such as scale changes, occlusion and overlap pose additional problems not found in an idealized text environment. In particular, unlike in the text domain where the input consists of letter strings and the extraction of features (letter combinations) from the input is therefore trivial, the crucial task of invariant feature extraction from the image is nontrivial for scenes containing complex shapes, especially when multiple objects are present. We have developed a hierarchical feedforward model of object recognition in cortex, described in [82], as a plausibility proof that such a model can account for several properties of IT cells, in particular the invariance properties of IT cells found by Logothetis et al. [49]. In the following, we will show that such a simple model can perform invariant recognition of complex objects in cluttered scenes and is compatible with recent physiological studies. This is a plausibility proof that complex oscillation-based mechanisms are not necessarily required for these tasks and that the binding problem seems to be a problem for only some models of object recognition.
3.3 A Hierarchical Model of Object Recognition in Cortex Studies of receptive field properties along the ventral visual stream in the macaque, from primary visual cortex, V1, to anterior IT report an overall trend of an increase of average feature complexity and receptive field size throughout the stream [41]. While simple cells in V1 have small localized 29
(b)
(a)
view-tuned units view-invariantΣ unit
Σ
View angle
Figure 3-1: (a) Cartoon of the Poggio and Edelman model [68] of view-based object recognition. The gray ovals correspond to view-tuned units that feed into a view-invariant unit (open circle). (b) Tuning curves of the view-tuned (gray) and the view-invariant units (black). receptive fields and respond preferentially to simple shapes like bars, cells in anterior IT have been found to respond to views of complex objects while showing great tolerance to scale and position changes. Moreover, some IT cells seem to respond to objects in a view-invariant manner [6, 49, 66]. Our model follows this general framework. Previously, Poggio and Edelman [68] presented a model of how view-invariant cells could arise from view-tuned cells (Fig. 3-1). However, they did not describe any model of how the view-tuned units (VTUs) could come about. We have recently developed a hierarchical model that closes this gap and shows how VTUs tuned to complex features can arise from simple cell-like inputs. A detailed description of our model can be found in [82] (for preliminary accounts refer to [79, 80], and also to [43]). We briefly review here some of its main properties. The central idea of the model is that invariance to scaling and translation and robustness to clutter on one hand, and feature complexity on the other hand require different transfer functions, i.e., mechanisms by which a neuron combines its inputs to arrive at an output value: While for the latter a weighted sum of different features, which makes the neuron respond preferentially to a specific activity pattern over its afferents, is a suitable transfer function, increasing invariance requires a different transfer function that pools over different afferents tuned to the same feature but transformed to different degrees (e.g., at different scales to achieve scale invariance). A suitable pooling function (for a computational justification, see [82]) is a so-called MAX function, where the output of the neuron is determined by the strongest afferent, thus performing a “scanning” operation over afferents tuned to different positions and scales. This is similar to the original Hubel and Wiesel model of a complex cell receiving input from simple cells at different locations to achieve phase-invariance. In our model of object recognition in cortex (Fig. 3-2), the two types of operations, selection and template matching, are combined in a hierarchical fashion to build up complex, invariant feature detectors from small, localized, simple cell-like receptive fields in the bottom layer. In particular,
30
160 160 pixels — which corresponds to a 5 receptive field size ([41] report an average V4 receptive field size of 4.4 ) if we set 32 pixels = 1) are first fil-
patterns on the model “retina” (of
tered through a layer (S1, adopting Fukushima’s nomenclature [25] of referring to feature-building cells as “S” cells and pooling cells as “C” cells) of simple cell-like receptive fields (first derivative of gaussians, zero-sum, square-normalized to 1, oriented at 0 45 90 135 with standard deviations of 1.75 to 4.75 pixels in steps of 0.5 pixels. S1 filter responses are absolute values of the image “filtered” through the units’ receptive fields (more precisely, the rectified dot product of the cells’ receptive field with the corresponding image patch). Receptive field centers densely sample the input retina. Cells in the next layer (C1) each pool S1 cells of the same orientation over a range of scales and positions. Filters were grouped in four bands each spanning roughly pling over position was done over patches of linear dimensions of
:5 octaves, sam-
4 6 9 12 pixels, respectively
(starting with the smallest filter band); patches overlapped by half in each direction to obtain more invariant cells responding to the same features as the S1 cells. Different C1 cells were then combined in higher layers — the figure illustrates two possibilities: either by combining C1 cells tuned to different features to give S2 cells responding to co-activations of C1 cells tuned to different orientations or to yield C2 cells responding to the same feature as the C1 cells but with bigger receptive fields (i.e., the hierarchy does not have to be a strict alternation of S and C layers). In the version described in this paper, there were no direct C1
! C2 connections, and each S2 cell received input
2 2 arrangement) of arbitrary orientation, yielding a total S2 transfer functions were Gaussian ( = 1, centered at 1). C2
from four neighboring C1 units (in a of 44
= 256 different S2 cell types.
cells then pooled inputs from all S2 cells of the same type, producing invariant feature detectors tuned to complex shapes. Top-level view-tuned units had Gaussian response functions and each VTU received inputs from a subset of C2 cells (see below). This model had originally been developed to account for the transformation tolerance of viewtuned units in IT as recorded from by Logothetis et al. [49]. It turns out, however, that the model also has interesting implications for the binding problem.
3.4 Binding without a problem To correctly recognize multiple objects in clutter, two problems must be solved: i) features must be robustly extracted, and ii) based on these features, a decision has to be made about which objects are present in the visual scene. The MAX operation can perform robust feature extraction (cf. [82]): A MAX pooling cell that receives inputs from cells tuned to the same feature at, e.g., different locations, will select the most strongly activated afferent, i.e., its response will be determined by the afferent with the closest match to its preferred feature in its receptive field. Thus, the MAX mechanism effectively isolates the feature of interest from the surrounding clutter. Hence, to achieve 31
view-tuned cells
"complex composite" cells (C2)
11 00 11 00 00 11
110 00 10 1
"composite feature" cells (S2)
110 00 10 1 111 111111111 000 000000000 000 111 000000000 000 111111111 111 000000000 111111111 000 111 000000000 111111111 000 111 000000000 000 111111111 111 000000000 111111111 000 111 000000000 000 111111111 111 000000000 111111111
1 11111111 0 00000000 0 1 00000000 0 11111111 1 00000000 11111111 0 1 00000000 11111111 0 1 00000000 11111111 0 1 00000000 11111111 0 1 00000000 0 11111111 1 11 00 111 0 00
complex cells (C1)
simple cells (S1)
weighted sum MAX
Figure 3-2: Diagram of our hierarchical model [82] of object recognition in cortex. It consists of layers of linear units that perform a template match over their afferents (blue arrows), and of non-linear units that perform a “MAX” operation over their inputs, where the output is determined by the strongest afferent (green arrows). While the former operation serves to increase feature complexity, the latter increases invariance by effectively scanning over afferents tuned to the same feature but at different positions (to increase translation invariance) or scale (to increase scale invariance, not shown). In the version described in this paper, learning only occured at the connections from the C2 units to the top-level view-tuned units.
32
robustness to clutter, a VTU should only receive input from cells that are strongly activated by the VTU’s preferred stimulus (i.e., those features that are relevant to the definition of the object) and thus less affected by clutter (which will tend to activate the afferents less and will therefore be ignored by the MAX response function). Also, in such a scheme, two view-tuned neurons receiving input from a common afferent feature detector will tend to both have strong connections to this feature detector. Thus, there will only be little interference even if the common feature detector only responded to one (the stronger) of the two stimuli in its receptive field due to its MAX response function. Note that the situation would be hopeless for a response function that pools over all afferents through, for example, a linear sum function: The response would always change when another object is introduced in the visual field, making it impossible to disentangle the activations caused by the individual stimuli without an additional mechanism such as, for instance, an attentional sculpting of the receptive field or some kind of segmentation process. In the following two sections we will show simulations that support these theoretical considerations, and we will compare them to recent physiological experiments.
3.4.1 Recognition of multiple objects The ability of the model neurons to perform recognition of multiple, non-overlapping objects was investigated in the following experiment: 21 model neurons, each tuned to a view of a randomly selected paperclip object (as used in theoretical [68], psychophysical [8, 48], and physiological [49] studies on object recognition), were each presented with 21 displays consisting of that neuron’s preferred clip combined with each of the 21 preferred clips (in the upper left and lower right corner of the model retina, resp., see Fig. 3-3 (a)) yielding 21 2
= 441 two-clip displays. Recognition perfor-
mance was evaluated by comparing the neuron’s response to these displays with its responses to 60 other, randomly chosen “distractor” paperclip objects (cf. Fig. 3-3). Following the studies on viewinvariant object recognition [8, 48, 82], an object is said to be recognized if the neuron’s response to the two-clip displays (containing its preferred stimulus) is greater than to any of the distractor objects. For 40 afferents to each view-tuned cell (i.e., the 40 C2 units excited most strongly by the neuron’s preferred stimulus; this choice produced top-level neurons with tuning curves similar to the experimental neurons [82]), we find that on average in 90% of the cases recognition of the neuron’s preferred clip is still possible, indicating that there is little interference between the activations caused by the two stimuli in the visual field. The maximum recognition rate is 94% for 18 afferents, dropping to 55% if each neuron is connected to all 256 afferents. Fig. 3-3 (c) plots the recognition rate as a function of the number of afferents to each VTU: the rate climbs in the beginning as discriminability of different clips increases with the number of afferents, and then falls again, as the presence of the second object in the visual field increasingly interferes with the input to the VTU
33
(a)
(b)
(c) 1
20
correctly recognized 2−clip displays
0.9 0.8
response
0.7 0.6 0.5 0.4 1
0.3 0.2
0.5
0.1
0
0
5
10
20
30
10 15 second clip index
40
50
18 16 14 12 10 8 6 4 2
60
0
20
50
100
150 afferents
200
250
Figure 3-3: Model neuron responses and average recognition rate for the case of two objects in the visual field. (a) Example stimuli. (b) Response curve of one neuron to its 21 two-clip stimuli (the first four being the stimuli shown in (a)), with dashed line showing response to best distractor, inset shows response to 60 distractors. (c) Dependence of average recognition rate (over 21 model neurons) on the number of afferents to each VTU. caused by the first object (the probability that another object activates a feature detector connected to the VTU more strongly than the preferred object increases as the VTU receives input also from feature detectors that are excited only weakly by its preferred object). These simulation results have an interesting experimental counterpart in the work of Sato [89] who studied the responses of neurons in Macaque IT to displays consisting of one or two simultaneously appearing stimuli within the IT cell’s receptive field. He defines a “summation index”, ;max(RA RB ) SmI, as SmI = RA+B with RA the IT neuron’s response to one stimulus, A, RB the min(RA RB ) neuron’s response to another stimulus, B , and RA+B the neuron’s response to both stimuli presented simultaneously in its receptive field. Neurons performing a linear summation would have an SmI of 1 while MAX neurons would show an SmI of 0. For a fixation task, Sato reports a mean
SmI of ;0:18 ( = 0:5, N = 70, both stimuli in same hemifield).
From these data, the response
of real IT neurons appears to have strong MAX characteristics. In fact, a reduction of the response to the two-stimulus display compared to the response to the stronger stimulus alone, implied by the negative SmI and also found in an experiment by Rolls and Tovee [86], is compatible with the response reduction observed in the two-clip simulation shown in Fig. 3-3 (b). Interestingly, for a visual discrimination task, Sato reports [89] very similar average SmI values, suggesting that the same bottom-up driven MAX response mechanism might be operating in both cases.
3.4.2 Recognition in clutter So far, we have examined the model’s performance for two well-separated objects in the visual field. What about the case of two overlapping stimuli, e.g., when the object of interest is in front of a background object? This stimulus configuration was used in a physiology experiment by Missal et al. [54]. They trained a monkey on a paired-associate task involving 30 polygonal shapes, followed by recordings 34
(a)
(b)
Figure 3-4: (a) Example stimulus (green) and outline background (red) from the experiment by Missal et al. [54] (redrawn from [54]). (b) Example stimulus for the corresponding experiment with the model (see text). The foreground clip in (b) was correctly recognized by the corresponding model neuron (which was the same as in Fig. 3-3).
of shape selective cells in IT to the training stimuli and, among others, to displays consisting of the training stimuli superimposed on randomly generated background (other polygons in outline), which were selected so as not to drive the cells (Fig. 3-4a). In this condition, the monkeys behavioral performance decreased slightly (from 98% to 89%), but the average neuronal response dropped precipitously to 25%. How could the monkey still do the task so well in the face of such a drastic change in neuronal response? Furthermore, do we see a similar behavior in the model? Simulation of the experimental paradigm with our model is straightforward: Foreground stimuli were the 21 clips used in the simulations described previously. Backgrounds were randomly generated polygons consisting of eight edges, chosen so that each corner was at a distance from the center of at least 45% of the stimulus size. Following [54] we only chose backgrounds that did not drive the models cells, here defined as generating an input to the VTU more than two standard deviations away from the preferred stimulus. Taking the 21 view-tuned cells described above (with 40 afferents out of 256 C2 cells each) and testing each neuron’s response to an input image consisting of that neuron’s preferred stimulus superimposed on a randomly generated polygonal background, responses on average (over 10 trials and 21 model neurons each) drop to
49% of the
response to the stimulus alone. However, average responses to the best distractor (out of 60) are even lower (42%) (note that the response level of the neurons (but not the recognition rates) depends on the standard deviation of their Gaussian response function, which is a free parameter
0:16 in all simulations, producing tuning curves qualitatively similar to those observed experimentally [82] — = 0:12, for instance, would give average responses of 33% to the stimulus-background combination and 23% to the best distractor). This leads to an average recogniand was set to
tion rate of 65% in this condition (even in the absence of color cues that were present in the original Missal et al. experiment and which would improve recognition in the model, too, if features were color selective). The maximum average recognition rate was
74% for 100 afferents, the maximum
average rate for one trial (over 21 neurons) was 90% at 105 afferents. Model parameters were not
specially tuned in any way to achieve this performance, so higher recognition rates (for instance
35
(b)
100
1
90
0.9
80
0.8
70
0.7
average response
recognition rate (%)
(a)
60 50 40 30
0.5 0.4
0.2
10 0
0.6
0.3
non−overlap overlap
20
non−overlap overlap best distractor
0.1 50
100
150 afferents
200
0
250
50
100
150 afferents
200
250
Figure 3-5: (a) Average recognition rates (over 21 cells, 10 runs each with different, randomly selected background) for non-overlap (blue) and overlap conditions (cf. text). (b) Average response levels in the two conditions, compared to the average response to the best distractor, as a function of the number of afferents to each view-tuned model neuron. also through pooling the responses of several neurons tuned to the same object but receiving inputs from different afferents) are very likely achievable. This demonstrates that the ability of the MAX response function to ignore non-relevant information, in this case the background, together with an object definition based on its salient components is sufficient to perform recognition in clutter.
3.5 Discussion As most existing theories of the brain, our model is likely to be incomplete at best and quite possibly wrong altogether. It provides, however, a plausibility proof that biologically plausible models do in fact exist that do not suffer from the binding problem in performing difficult recognition tasks. This is of course just an explicit demonstration of a known but often neglected fact — that the binding problem is not a fundamental computational problem, like, for instance, the correspondence problem in vision. Instead, the binding problem arises only from the limitations of certain specific computational architectures. Our model shows that there are natural, old-fashioned cortical models that do agree with available data, do not suffer from the binding problems and do not need oscillations or synchronization mechanisms. In models like ours, recognition can take place without an explicit segmentation stage. The key is to ignore non-relevant information. At the level of the C cells, this is done through the MAX response function that allows a unit to scan over the image and pick best matches. At the level of the final view-tuned cells (for instance), this is achieved by restricting the afferents to the VTU to those that correspond to the relevant/salient features for the object. This in turn requires an earlier overcomplete set of “feature” selective cells which may roughly correspond to the dictionary of shapes described by Tanaka [93]. Subsets of this dictionary are inputs to each of several VTU units. 36
Many approaches to solving the binding problems do not use oscillation or synchronization mechanisms but instead rely on top-down attentional mechanisms. In fact, it has been argued that top-down control might help in “binding” features together by focussing attention to a region of interest (see Reynolds and Desimone, and Wolfe). However, we can perfectly well perform very complex object recognition tasks (e.g., determining whether an image contains a certain object) without focussing attention on a specific part of space, cf. [101]. Our model is bottom-up and does not require an explicit top-down signal but is consistent with its use in certain situations. To explain the latter point, we will briefly describe a possible approximate implementation of the MAX operation in terms of cortical microcircuits of lateral, possibly recurrent, inhibition between neurons in a cortical layer. A specific example is a circuit based on feedforward (or recurrent) shunting presynaptic (or postsynaptic) inhibition by “pool” cells [70]. The circuit performs a gain control operation and – for certain values of the parameters – a softmax operation (an approximation to the MAX operation in which the degree of nonlinearity is controlled by a parameter): each of the p N signals xi (the activation of the afferents) undergoes a softmax operation as yi = C +xi j xqj . Thus for large p and for q = p 1, we have yi = xi if xi = maxj xj and xi = 0 otherwise. “Softmax”
P
;
circuits have been proposed by Nowlan & Sejnowski [61] and others [34, 45] to account for several cortical functions. Circuits of this type may perform an operation between a simple sum or a MAX on the inputs of layer of cells under the control of a single variable and thus may form the basis in cortex, at one extreme, for normalization of signals, and, at the other, for a MAX-like operation [13]. Thus, in the context of this hypothetical circuitry for the MAX operation, an intriguing possibility is that the same softmax mechanism might be used in both situations, either predominantly driven by bottom-up information or using top-down signals that may control a parameter (equivalent to locally raising q or C ) which switches off the “competition” between inputs in locations outside the “focus of attention”. Several experiments suggest that the visual system uses a MAX or softmax operation to select bottom-up among different inputs: for instance, there is evidence that a MAX-like operation is used in tasks involving object recognition in context [89]. As discussed by Nowlan and Sejnowski [61], the same active selection mechanism underlying preattentive perceptual phenomena may also be used by top-down overt attentional signals, for instance, when focussing attention to a specific part of visual space [45, 77]. The MAX mechanism performs an input-driven selection (and possibly scanning) operation over its inputs which might have interesting implications for the pop-out effect [102] (cf. the article by Wolfe): As the MAX operation is performed in parallel over many neurons, detection of stimuli does not require an attention-controlled “focussed” search as described above, if surrounding stimuli do not interfere with the VTU’s preferred object. Therefore, for objects that activate different features (such as a square amidst circles), recognition is possible without sequential search — the stimuli “pop-out”. However, in the case of interference, as in a display consisting of many similar
37
paperclips, detection might require “focussing attention” (as discussed by Reynolds & Desimone) to reduce the influence of competing stimuli. In this case, there would be no pop-out but rather sequential search would be required to perform successful recognition. The observed invariance ranges of IT cells after training with one view are reflected in the architecture used in our model: One of its underlying ideas is that invariance and feature specificity have to grow in a hierarchy so that view-tuned cells at higher levels show sizeable invariance ranges even after training with only one view, as a result of the invariance properties of the afferent units. The key concept is to start with simple localized features — since the discriminatory power of simple features is low, the invariance range has to be kept correspondingly low to avoid the cells being activated indiscriminately. As feature complexity and thus discriminatory power grows, the invariance range, i.e., the size of the receptive field, can be increased as well. Thus, loosely speaking, feature specificity and invariance range are inversely related, which is one of the reasons the model avoids a combinatorial explosion in the number of cells — while there are more different features in higher layers, there do not have to be as many neurons responding to these features as in lower layers since higher-layer neurons have bigger receptive fields and respond to a greater range of scales. Notice also that the cells in the model are not binary but have continuous response functions, greatly increasing the representational power of the system (which is why “hallucinations” do not occur). This hierarchical buildup of invariance and feature specificity greatly reduces the overall number of cells required to represent additional objects in the model: The first layer contains a little more than one millions cells (160
160 pixels, at four orientations and 12 scales each — for sim-
plicity, dense sampling was used at all scales). Connections in higher levels are in principle subject to learning, driven by the input ensemble and the requirements of the recognition task at hand. As described, we did not investigate learning in the model but rather focussed on demonstrating that invariant recognition in clutter is possible using a simple hierarchical feedforward architecture. Hence, except for the C2
! VTU connections, all connections were preset by picking a simple
pooling scheme in C1 (described above, resulting in 46,000 C1 cells) and a combinatorial rule to create S2 features from C1 inputs (yielding close to three million S2 cells), which were then pooled over to yield the final 256 complex composite feature dectectors in C2. The actual number of cells required to perform the tasks is likely to be lower — in fact, a model with bigger pooling ranges in C1 resulting in about one fourth the number of S2 cells has been shown to have very similar recognition rates. The crucial observation is that if additional objects are to be recognized irrespective of scale and position, the addition of only one unit, in the top layer, with connections to the (256) C2 units, is required (in the case of a distributed code, where individual neurons participate in the coding of several objects, requirements are likely to be even less). This does not appear to be specific to the class of paperclip objects: The exact same model described in this paper has already been
38
applied successfully (with the only difference being the appropriate setting of the weights from the C2 units to the VTUs) to the recognition of computer-rendered images of cars (Riesenhuber & Poggio, unpublished observations). Thus, the recognition of different classes of objects would only require the addition of more view-tuned units in the top layer of the network.
3.6 Acknowledgments Supported by grants from ONR, Darpa, NSF, ATR, and Honda. M.R. is supported by a Merck/MIT Fellowship in Bioinformatics. T.P. is supported by the Uncas and Helen Whitaker Chair at the Whitaker College, MIT.
39
Chapter 4
The Individual is Nothing, the Class Everything: Psychophysics and Modeling of Recognition in Object Classes Abstract Most psychophysical studies of object recognition have focussed on the recognition and representation of individual objects subjects had previously explicitely been trained on. Correspondingly, modeling studies have often employed a “grandmother”-type representation where the objects to be recognized were represented by individual units. However, objects in the natural world are commonly members of a class containing a number of visually similar objects, such as faces, for which physiology studies have provided support for a representation based on a sparse population code, which permits generalization from the learned exemplars to novel objects of that class. In this paper, we present results from psychophysical and modeling studies intended to investigate object recognition in natural (“continuous”) object classes. In two experiments, subjects were trained to perform subordinate level discrimination in a continuous object class — images of computer-rendered cars — created using a 3D morphing system. By compar40
ing the recognition performance of trained and untrained subjects we could estimate the effects of viewpoint-specific training and infer properties of the object-class specific representation learned as a result of training. We then compared the experimental findings to simulations, building on our recently presented HMAX model of object recognition in cortex, to investigate the computational properties of a population-based object class representation as outlined above. We find experimental evidence, supported by the modeling results, that training builds a viewpoint- and class-specific representation that supplements a pre-existing representation with lower shape discriminability but possibly greater viewpoint invariance.
4.1 Introduction Object recognition performance crucially depends on previous visual experience. For instance, a number of studies have shown that recognition memory for unfamiliar faces is better for faces of one’s own race than for faces of other races (the “Other Race-Effect”) [46, 51]. Recognition performance for objects belonging to less familiar object classes such as “Greeble” objects [28] is rather poor without special training. These differences in performance are presumably the result of the differing degree of visual experience with the respective object classes: Extensive experience with an object class builds a representation of that object class that generalizes to unseen class members and facilitates their recognition. Previous object recognition studies involving training on novel stimuli have mostly focussed on training subjects to recognize individual isolated objects [8, 28, 48, 95], usually by either familiarizing the subjects with a single object and then testing how well this object could be recognized among distractors [8, 48], or by training subjects on a small number of objects (e.g., by having the subjects learn names for them [28, 95, 99]) and then testing how well these objects could be recognized among distractors. Thus, the objective of training was not to learn an object class (like faces) from which arbitrary novel examplars (unfamiliar faces) could be drawn during testing — rather subjects had to re-recognize the exact same objects as used in training. An additional problem of several of the aforementioned studies was that the stimuli used belonged to rather artificial object classes such as “cube objects”, “paperclips”, or “amoebas” (Fig. 41b) differing from naturally occuring object classes — such as, e.g., faces, human bodies, cats, dogs, or cars (Fig. 4-1a) — in that the objects did not share a common 3D structure (making them “not nice” object classes in the terminology of Vetter et al. [107]). Even in studies where objects of a more natural appearance were used (such as the “Greeble” family of objects [28]), subjects were still trained to recognize individual representatives (e.g., by
41
(a)
(b)
Figure 4-1: Natural objects, and artificial objects used in previous object recognition studies. (a) Members of natural object classes, such as pedestrians (not shown) and cars, usually share a common 3D structure, whereas stimuli popular in previous psychophysical studies of object recognition (from [9]), (b), do not. naming them) whose recognition was later tested under various transformations [28, 60, 99]. Similarly, computational models of object recognition in cortex have almost exclusively focussed on the recognition of individual objects that had been learned explicitely [25, 62, 68, 81, 111]. These computational studies [25, 68, 82, 111] commonly feature an object representation where for each stimulus to be recognized, a unique “grandmother”-type unit is trained to respond to this individual object. While such a scheme (with one or more “grandmother” units per object [107]) may actually be used to represent highly overtrained objects [49] in situations where the subject has to recognize (a small number of) individual objects among a great number of similar distractor objects [8, 48, 49], the inefficiency and inflexibility of such a scheme makes it highly unlikely to be used in cortex to represent natural object classes. A different possibility to represent objects is a scheme where a group of units, broadly tuned to representatives of the object class, code for the identity of a particular object by their combined activation pattern. There exists some experimental evidence that is compatible with such a representation: Recordings from neurons in inferotemporal cortex (IT), a brain area believed to be essential for object recognition [50, 94], suggest that facial identity is represented by such a sparse, distributed code [115]. This is further supported by an optical imaging study in IT [112] that indicated an area of neurons selective for face stimuli. Few studies, experimental or theoretical, have investigated viewpoint-dependent recognition in a principled way for the more general (and natural) case of object classes, where training objects are used to build a distributed class representation that is then probed during testing using randomly chosen objects from the same class.y Edelman [18] in a recent study used simple classes (Gaussian blobs in parameter space) of geon-based “dog” and “monkey” stimuli. However, the focus of that study was object categorization. y In a recent study, Tarr and Gauthier [99] trained subjects (in a naming task) to recognize a small number of individual
42
For the purpose of this paper we informally define a continuous object class as a set of visually similar objects in terms of 3D structure, that span a multidimensional space, i.e., there is a continuous parametric representation of that class and objects can have arbitrarily similar shapes. Vetter and Blanz [106] (see also [39]), for instance, have shown that human faces can be well represented in such a way. This definition is related to the “nice” object classes of Vetter et al. [107]. Here, we stress the shape similarity of objects in the class, where two members can be arbitrarily similar to each other, which is of primary interest for recognition and discrimination. The aim of this paper is to investigate if and how the results on view-dependent object recognition obtained for the individual object case and a “grandmother” representation transfer to continuous object classes represented through a distributed population code. We will first present results from a psychophysical study intended to investigate this question, in which subjects were trained to perform subordinate level discrimination in a continuous object class — images of computerrendered cars — created using a 3D morphing system [90]. By comparing the recognition performance of trained and untrained subjects we can estimate the effects of viewpoint-specific training and infer properties of the object-class specific representation learned as a result of training. We will then compare the experimental findings to simulations, building on our recently presented HMAX model of object recognition in cortex [82, 83, 96], to investigate the computational properties of a population-based object class representation as outlined above. In the process, we will demonstrate that the recognition performance of HMAX previously demonstrated for the class of paperclip objects is not special to this class of objects but also transfers to other object classes. A second experiment was designed to test the model predictions and to investigate the viewpointdependency of object recognition in more detail. The results of this study will be compared to simulation results in the last section.
4.2 Experiment 1 Several psychophysical studies have reported above-chance recognition rates for up to
45 (and
beyond [18]) viewpoint differences between sample and test object after presenting the sample object at a single viewpoint, for paperclip objects [8, 48] as well as geon-based dog and monkey stimuli [18]. However, these experiments controlled target/distractor similarity — which strongly influences recognition performance [18, 60, 99] — only very roughly (in two levels, [18]) or not at all ([8, 48]). Even more crucial, these studies did not compare the recognition performance of trained subjects objects seen from a single viewpoint. Subjects were then trained on additional viewpoints for a subset of the training objects. Subsequently, it was tested how recognition performance for rotated views transferred to the training objects that had only been seen at one viewpoint during training (the “cohort” objects). As the number of “cohort” and training objects was rather small (4–6 objects), however, it is unclear whether subjects actually learned a representation of the whole object class. Furthermore, two of the three experiments in [99] used “cube” objects, which, as mentioned above, are not a good model for natural object classes.
43
to naive subjects. Hence, it is unclear how much of the recognition performance was due to training and how much was due to a pre-existing representation not specific to the class of training objects. The aim of Experiment 1 was to train subjects on a recognition task involving stimuli chosen from a precisely defined continuous object class, presenting objects always at the same viewpoint, and then to probe this representation by testing recognition performance for varying viewpoints and match/nonmatch object similarities. The results of the trained group are compared to the performance of a naive group that did not receive any training on the object class prior to testing.
4.2.1 Methods 4.2.1.1 A Continuous Object Class: Morphed Cars Stimuli for both experiment and modeling were generated using a novel automatic, 3D, multidimensional morphing system developed by Christian Shelton in our lab [90]. With this system we were able to create a large set of “intermediate” objects, made by blending characteristics of the different prototype objects (Viewpoint DataLabs, UT) spanning the class. This was done by specifying how much of each prototype the object to be created should contain, naturally defining a vector space over the prototype objects. Correspondences have been calculated for a system based on eight car prototypes (the “8 car system”) and subsequently for 15 car prototypes (the “15 car system”). Thus, an advantage of the morphing system is that it allows multidimensional morphing, i.e., the creation of objects that are made up of mixtures of several 3D prototypes. Moreover, as the prototypes are three-dimensional graphics objects, morphed objects can be freely transformed, e.g., through viewpoint or illumination changes. In the initial morphing studies, we used the 8 car system, whose prototypes are shown in Fig. 42. While the prototypes are available as color models we chose to render all objects as “clay” models by setting the colors to gray values and decreasing surface reflectance (C. Shelton, personal communication). Objects were rendered with a lighting source located above the camera and equally strong ambient lighting, and normalized in size. This procedure was designed to reduce the influence of possibly confounding color and size cues in the experiment. Stimulus space. Stimuli in Experiment 1 were drawn from a subspace of the 8 car system, a twodimensional space spanned by the three marked prototypes shown in Fig. 4-2 (the space spanned by three prototypes is two-dimensional since coefficients have to sum to one). The advantage of using a well-defined object class spanned by few prototypes is that the class can be exhaustively covered during training and its extent is well known, which is not the case, x As monochrome printers produce gray values by dithering, these and the other grayscale images print best on a color printer.
44
*
*
*
Figure 4-2: The eight prototype cars used in the 8 car system. The cars marked with an asterisk show the prototypes spanning the morph space used in Experiment 1.x e.g., for the class of human faces.
4.2.1.2 Psychophysical Paradigm Figures 4-3 and 4-5 illustrate the training and testing tasks, respectively. They follow a two alternative forced-choice (2AFC) design, with the two choices presented sequentially in time. The advantage of using such a task is that subjects only have to decide which of the two choices resembles the previously encountered sample stimulus more closely, thereby eliminating the influence of biases on the decision process that are associated with a yes/no task in which only one choice stimulus is presented. An additional important advantage of the 2AFC paradigm is that it transfers to simulations in a straightforward way. Subjects sat in front of a computer monitor at a distance of about 70cm, with the stimuli subtending about 3 degrees of visual angle (128
128 pixels). Each trial was initiated by the appearance
of the outline of a blue square (about 5 of visual angle) on the screen, at which time subjects had to push a button on a computer mouse to initiate the trial. Immediately after the button push, a randomly selected (see below) car appeared on the screen for 500ms, followed by a mask consisting of a randomly scrambled car image, presented for 50ms. After a delay of 2000ms, the first choice car appeared in the same location as the sample car, for 500ms, followed by a 500ms delay and the presentation of the second sample car. After the presentation of the second car, the outline of a green square appeared, cueing subjects to make a response (by pressing a mouse button), indicating whether the first (left button) or the second (right button) choice car was equal to the sample car. In the training task (in Experiment 1 as well as Experiment 2, see below), subjects received auditory feedback on incorrect responses. In the training task, sample and test objects were all presented at the same 225 viewpoint on all trials, a 3/4 view (termed the training view, TV). New, randomly chosen target (sample) and distractor objects were chosen on each trial by picking coefficient vectors from a uniform distribution followed by subsequent coefficient sum normalization. The purpose of the training task was to
45
Figure 4-3: Training task for Experiments 1 and 2. Shown are an example each of the two different kinds of trials: match and nonmatch trials. In both, a sample car is followed by a mask and a delay. Then, in a match trial (upper branch), the match car appears as the first choice car, and a distractor car as the second choice car. For a nonmatch trial (lower branch), the order is reversed. The example objects shown here are from the 15 car system used in Experiment 2 (see section 4.4). Subjects had to make a response after the offset of the second choice car and received auditory feedback on the correctness of their response. induce subjects to build a detailed viewpoint-specific representation of the object class. The (Euclidean) distance d in morph space between target and distractor (nonmatch) objects was decreased over the course of training: Initially, distractor objects were chosen to be very dissimilar to the target objects, d = 0:6, making the task comparatively easy (Fig. 4-4, top). Subjects performed trials at
this level of task difficulty until performance reached 80% correct. Then d was decreased by 0:1 and
d = 0:4 (Fig. 4-4, bottom). At the time they were tested, each subject in the trained group performed > 80% correct on the d = 0:4 set (on a block of
the training repeated with new stimuli down to
50 match and 50 nonmatch trials, randomly interleaved). After subjects in the training group reached the performance criterion on the training task, they were tested in a task similar to the training task but in which the viewpoint of match and nonmatch choice stimuli differed by 45, corresponding to a rotation of the car towards the viewer (as shown in Fig. 4-5 for a 22.5 rotation, as used in Experiment 2). This direction of rotation was chosen arbitrarily. For each viewpoint and distance combination, subjects were tested on 30 match and 30 nonmatch trials, for a total of 240 trials (with 120 unique match/nonmatch pairs), which were presented in random order. The high number of trials was chosen to mitigate possible effects of morph space anisotropy with respect to subjects’ perceptual similarity judgments. Subjects received no feedback on their performance in the testing task.
46
Figure 4-4: Illustration of match/nonmatch object pairs for Experiment 1. The top shows a pair a distance d
= 0 6 apart in morph space while the lower pair is separated by = 0 4. :
d
:
Figure 4-5: Testing task for Experiments 1 and 2. The task is identical to the training task from Fig. 4-3 except for the absence of feedback and the fact that the viewpoint choice cars were presented at could vary between trials (the example shows a viewpoint difference of ' = 22:5, as used in Experiment 2).
47
4.2.2 Results Subjects were 14 members of the MIT community that were paid for participating in the experiment plus the first author. Seven subjects and the first author were assigned to a “trained” group that received training sessions until the performance criterion as described above was reached, upon which they performed the testing task as described above. One subject, whose initial performance on the easiest (d
= 0:6) training set was at chance, was excluded from the training group.
For
the remaining subjects, criterion was reached after one or two training sessions of one hour each (average 1.75 sessions). Another seven subjects, the “untrained” group, did not receive any training on the stimuli but were run only on the testing task. As mentioned above, this comparison to an untrained group is essential: The visual system is very well able to perceive novel objects even without training, i.e., there is a baseline discrimination performance for any novel object class. This is also expected for the car objects used in our study, as their shapes make them similar to real cars subjects have some visual experience with (in agreement with our objective to use a natural object class to investigate the learning of natural object classes). However, as the degree of similarity between match and nonmatch objects is continuously increased during training, subjects have to learn to perceive fine shape differences among the morphed cars used in the experiment. It is this learning component we are interested in, allowing us to investigate how class-specific training on one viewpoint transfers to other viewpoints. Figure 4-6 shows the averaged performance of the subjects in the trained group on the test task. A repeated measures ANOVA (using SPSS 8.0 for Windows) with the factors of viewpoint and distance in morph space between match and nonmatch objects revealed highly significant main effects of both viewpoint difference and distance in morph space (F (1 6)
= 155:224 and F (1 6) = 21:305, resp., p < 0:005) on recognition rate, with a non-significant interaction (F (1 6) = :572, p > :4) between the two factors. Interestingly, performance even for the 45 viewpoint difference is significantly above chance (p < 0:001 for both distances, t-test). The performance of the untrained subjects is shown in Fig. 4-7. The ANOVA here again revealed significant main effects of viewpoint and distance (for main effect of distance,
F (1 6) = 7:814,
p < 0:05, for viewpoint F (1 6) = 14:994, p < 0:01, no significant interaction p > :4). Comparing the average recognition rates for the trained (Fig. 4-6) and untrained (Fig. 4-7) groups, it is apparent that recognition rates for the trained view are higher in the trained group than in the untraineed group whereas performance of the two groups seems to be equal for the '
= 45 view. Examin-
ing the different performances in the two groups in more detail, t-tests (one-tailed, assuming that training improves performance) on the two different populations for the different conditions revealed that the performance of the trained group was significantly better than the untrained group for 0 viewpoint difference for both d = 0:6 and d = 0:4 (p < 0:05), while the difference was not sig-
nificant for the 45 viewpoint difference (p
0:3). Note that the observed performance differences 48
0.73
0.89 1
0.66
performance
0.9
0.8
0.84
0.7
0.6
0.5 0.6 0.55 0.5 0.45
distance
0.4
0
5
10
15
20
25
30
35
40
viewpoint difference
Figure 4-6: Average performance of the trained subjects (N = 7) on the test task of Experiment 1. The z axis shows performance, x-axis viewpoint difference ' between sample and test objects (' 2 f0 45g), and y -axis morph space distance d between match and nonmatch objects (d 2 f0:4 0:6g). The height of the bars at the data points shows standard error of the mean. The numbers above the data points show the corresponding numerical scores. are unlikely to be due to the untrained subjects’ lower familiarity with the 2AFC paradigm, as performance differed only on a subset of conditions, namely those where sample and test objects were presented at the same viewpoint. As for the trained group, recognition in the untrained group for a viewpoint difference of 45 was significantly above chance (p < 0:002 for both distance levels). Thus, the data show the following 1. For the trained as well as the untrained subject groups, recognition performance decreases with increasing target/distractor similarity and with viewpoint difference. 2. Both trained and untrained subjects perform above chance at the 45 view. 3. Training subjects with randomly chosen cars at the 0 view improves recognition of class members at the 0 view but does not affect recognition performance if the viewpoint difference between sample and test objects is 45.
4.2.3 Discussion The results of Experiment 1 indicate that while there is a recognition benefit from training on the 0 view, it does not transfer to the 45 view. However, even for a viewpoint difference of 45 , recognition is still significantly above chance. These results are especially interesting with respect to a recent study by Edelman [18] that, for a categorization task involving geon-based stimuli, reported two different performance regimes depending on the degree of viewpoint difference between sample and test object. He surmised that 49
0.73
0.8 1
0.64
performance
0.9
0.8
0.76
0.7
0.6
0.5 0.6 0.55 0.5 0.45
distance
0.4
0
5
10
15
20
25
30
35
40
viewpoint difference
Figure 4-7: Average performance of untrained subjects (N as in Fig. 4-6.
= 7) on the test task of Experiment 1. Axis labeling
this might be the manifestation of two recognition mechanisms, one at work for small viewpoint differences and another one for larger ones. The results of Experiment 1 suggest the following interpretation: While training at a single viewpoint does not transfer to a viewpoint difference of 45 between sample and test object, recognition in this case might rely on features that are robust to rotation (like the roof shape of the car) and which do not depend on object-class specific learning. Similar non-specific features can be used in the untrained group to perform recognition also for the unrotated viewpoint, but they are not sufficient to perform discrimination in fine detail, as evidenced by the lower recognition performance for the training view. Training lets subjects build a detailed class and viewpoint-specific representation that supplements the existing system: Subtle shape discriminations require sufficiently finely detailed features that are more susceptible to 3D rotation, whereas coarser comparisons can likely be performed also with cruder features or use more view-invariant representations optimized for different objects (see general discussion). Over which range of viewpoints would we expect training at a single viewpoint to have an effect? To answer this question we performed simulations in HMAX presented in the next section.
4.3 Modeling: Representing Continuous Object Classes in HMAX Our investigation of object recognition in continuous object classes is based on our recently presented HMAX model [82] that has been extensively tested on the representation of individual “paperclip” objects. After a brief review of HMAX, we shall demonstrate how the same model can easily be applied to the representation of natural object classes (the use of such a representation to
50
perform object categorization is described in [83]).
4.3.1 The HMAX Model of Object Recognition in Cortex Figure 4-8 shows a sketch of our model of object recognition in cortex [81, 82] that provides a theory of how view-tuned units (VTUs) can arise in a processing hierarchy from simple-cell like inputs. As discussed in [81, 82], the model accounts well for the complex visual task of invariant object recognition in clutter and is consistent with several recent physiological experiments in inferotemporal cortex. In the model, feature specificity and invariance are gradually built up through different mechanisms. Key to achieve invariance and robustness to clutter is a MAX-like response function of some model neurons which selects the maximum activity over all the afferents, while feature specificity is increased by a template match operation. By virtue of combining these two operations, an image is represented through a set of features which themselves carry no absolute position information but code the object through a combination of local feature arrangements. At the top level, view-tuned units (VTUs) respond to views of complex objects with invariance to scale and position changes.{ In all the simulations presented in this paper we used the “many feature” version of the model as described in [81, 82].
4.3.2 View-Dependent Object Recognition in Continuous Object Classes As mentioned in the introduction, various studies have provided support that “natural” object classes, in particular faces [112, 115], are represented by a population of units broadly tuned to representatives of this object class. Other physiological studies have provided evidence that neuronal tuning in IT can be changed as a result of training [6, 42, 49, 55]. A population-based representational scheme is easily implemented in HMAX through a group of VTUs (the stimulus space-coding units, SSCU, which can also provide a basis for object categorization [83]) tuned to representatives of the object class. k Discrimination between different objects proceeds by comparing the corresponding activation patterns over these units. To investigate the properties of such a representation, we created a set of car objects using the eight car system. In particular, we created lines in morph space connecting each of the eight prototypes to all the other prototypes for a total of 28 lines through morph space, with each line divided into 10 intervals. This created a set of 260 unique cars. Each car was rendered from 13 viewpoints around the 225 training view (TV), spanning the range from 180 to 270 in steps of 7.5, which yielded a total of 3380 images. { To perform view-invariant recognition, VTUs tuned to different views of the same object can be combined, as demonstrated, e.g., in [68]. k Note that the receptive fields of the SSCUs do not have to respect class boundaries, as long as they adequately cover the input space [83].
51
view-tuned cells
"complex composite" cells (C2)
11 00 0 00 11 001 11 0 1 00 11
1 0 0 00 11 01 1 0 1 00 11
"composite feature" cells (S2)
11 00 110 00 1
complex cells (C1) 11 00 0 1 00 00 11 011 1 00 11
simple cells (S1)
weighted sum MAX
Figure 4-8: Our model of object recognition in cortex (from [81]). The model is an hierarchical extension of the classical paradigm [36] of building complex cells from simple cells. It consists of a hierarchy of layers with linear (“S” units in the notation of Fukushima [25], performing template matching, solid lines) and nonlinear operations (“C” pooling units [25], performing a “MAX” operation, dashed lines). The non-linear MAX operation — which selects the maximum of the cell’s inputs and uses it to drive the cell — is key to the model’s properties and is quite different from the basically linear summation of inputs usually assumed for complex cells. These two types of operations respectively provide pattern specificity and invariance (to translation, by pooling over afferents tuned to different positions, and scale (not shown), by pooling over afferents tuned to different scales).
52
1 1 0.8 0.8 0.6 0.6
10
180 200 220 240 260
10
5
5
0
0
260 220 240 180 200
Figure 4-9: Recognition performance of the model on the eight car morph space. x-axis shows viewpoint ' of nonmatch object, y -axis match/nonmatch distance d (in steps along the morph line the sample object lies on) in morph space, and z -axis model discrimination performance for all (' d) stimulus pairs in the sample set. Model parameters were n SSCU = 16, = 0:2, c = 256. The two subplots show the same graph from two different viewpoints to show positive rotations (i.e., toward the front, so that the front of the car is turning towards the viewer, as used in the psychophysical experiments), left plot, and negative rotations (i.e., towards the back, so that the side of the car faces the viewer), right plot.
We then defined a set of SSCUs tuned to representatives of the car class. The representatives were chosen by performing k-means clustering on the set of 260 cars shown at the training view (results shown are for individual k-means runs — repeated runs tended to produce quantitatively similar results). To examine the viewpoint-dependence of object recognition in the car class we then performed trials in which each of the TV cars was presented to the model (the “sample car”), causing an activation pattern asample over the SSCUs. Then a “match” and a “nonmatch” (distractor) object were
' = 225+' ;45 ' 45 as described above, while the latter was a different car a distance d away from the sample car along the same morph line, shown at the same viewpoint ' as the match car. The two choice chosen. The former was the same car shown from a different viewpoint
cars caused activation patterns amatch and anonmatch , resp. Recognition of the rotated sample car was said to be successful if
jjasample ; amatchjj < jjasample ; anonmatchjj
(4.1)
using a (unweighted) Euclidean metric, i.e., if the SSCU activation pattern caused by the sample object was more similar to the match object’s activation pattern than to the activation pattern caused by the nonmatch object. This paradigm is equivalent to a two alternative-forced choice task and has the advantage that modeling of the decision process is straightforward. Recognition performance for each (' d) combination was tested for all possible sample/distractor car pairs.
Figure 4-9 shows recognition performance as a function of d and ' for a representation based
on nSSCU
= 16 SSCUs (selected by k-means as described above), each with a tuning width of = 0:2 and c = 256 connections to the preceding C2 layer (i.e., fully connected to all 256 C2 units [81]).
We see that the general trend observed in the experiment also holds in the simulations: Discrimination performance drops with increasing target/distractor similarity and increasing viewpoint
53
difference between sample and choice objects. In particular, for the positive rotation direction investigated in Experiment 1 (and also Experiment 2, see below), performance reaches chance for rotations of 30, while it is still robustly above chance for viewpoint differences of 22.5 . In order to investigate how discrimination performance varies with the number of SSCUs in the representation, the tuning width of the individual SSCU and the number of afferents to each SSCU, we shall in the following plot the average (one-sided) invariance range as a function of these parameters, limiting ourselves to the positive rotations also used in the experiment. The average one-sided invariance range, r, for a given set of model parameters and a given match/nonmatch
distance d in morph (in steps along the morph line the sample object lies on) space is calculated by
summing up the above-chance performance values, p0i , for viewpoint difference 'i , p0i
0:5) obtained from the raw performance scores pi shown, e.g., in Fig. 4-9. Then,
r =
X(pi ; pi
n;1 i=1
0
0
+1 )'i + pn 'n 0
= 2 (pi ;
(4.2)
n = 5 'i = f0 7:5 15 22:5 30g. This definition assumes a monotonic drop in performance with increasing ', i.e., that if an object can be discriminated for a certain ' it can also be discriminated for all '0 < '. This condition was met in the great majority of cases. with
Dependence of rotation invariance on tuning width of SSCUs. The dependence of the average rotation invariance on the tuning width of the SSCUs, i.e., the
of the Gaussian SSCUs, is shown
in Fig. 4-10 (all other parameters as before). The invariance range seems to be rather independent of the tuning width of the SSCUs. This is due to the high precision and dynamic range of the model neurons whose response function is Gaussian: Even if the stimulus rides on the tail of a sharply tuned Gaussian unit, the unit’s response still depends monotonically on the match of the afferent activity to the unit’s preferred stimulus. For a more realistic response function with a narrower dynamic range we would expect a stronger effect of tuning width, in particular a drop of performance as tuning becomes too narrow. Note that the average one-sided invariance range for cars is very comparable to that obtained for paperclip objects, which was on the order of 30=2
=
15 [82].
Dependence on number of afferents to each SSCU. In the studies of recognition in clutter using HMAX [81] it was found that robustness to clutter can be increased in the model by having viewtuned units receive input only from a subset of units in the C2 layer, namely the
n most strongly
activated ones. The invariance range, on the other hand, was found to increase with the number For a few parameter sets, performance at
' = 30 was still slightly above chance. 54
20
inv. range
15
10
5
0 10 8 6
0.5 4 0.2 0.15
2 0.1
distance
VTU sigma
Figure 4-10: Dependence of average (one-sided) rotation invariance, r, (z -axis) as a function of the tuning width, , of SSCUs (x-axis). y -axis in this and Figs. 11–13 shows distance in steps between match and nonmatch objects along the morph line the sample (match) object lies on. Other parameters as in Fig. 4-9. 20
20
10
10
0 10
0 10 5 0
100
5
200
0
100
200
Figure 4-11: Dependence of invariance range on the number of afferents to each SSCU (x-axis), left plot, which were varied from having only the 10 most strongly activated C2 units for each SSCU feed into the respective SSCU to a fully connected network with 256 afferents. Other parameters as in Fig. 4-9. The right plot shows the average rotation invariance range for a “grandmother”-like representation where an individual neuron is allocated for each sample stimulus, and recognition performance is based just on this unit. of afferents. Figure 4-11 shows the dependence of the average invariance range on the number of strongest afferents to each SSCU (left plot) for a representation based on n SSCU
= 16, compared to
a “grandmother” representation (right plot) where a dedicated “grandmother” unit was allocated for each sample stimulus and match and nonmatch objects were discriminated based on which of the two stimuli caused a greater excitation of the “grandmother” unit. This is identical to the representation used in the recognition experiments with paperclip objects [82]. Interestingly, while the invariance range shows a strong dependence on the number of afferents in the “grandmother” case, with invariance range asymptoting at about
c = 100 (similar to what had been found for
paperclips [82]), there seemed to be a much weaker dependence on the number of afferents in the population-based representation. Dependence on number of SSCUs. Figure 4-12 shows the average rotation invariance as a function of the number of SSCUs. While rotation invariance for a representation consisting of just one SSCU (the average) shows expectedly poor rotation invariance, the invariance is already sizeable for nSSCU
= 2 and grows only weakly with the number of SSCU for nSSCU > 2. Thus, it may seem 55
20
inv. range
15
10
5
0 10 8 6
32 8 4
2
distance
48
16
4 2 1
#SSCU
Figure 4-12: Dependence of invariance range on the number of SSCUs (x-axis). Other parameters as in Fig. 4-9.
that a representation based on more than two units offers only marginal benefits. However, this picture changes dramatically if noise is added to the representation, which was studied in the next section. Robustness to Noise. The above simulations all assumed completely deterministic model neurons where firing rates are quasi-continuous variables of very high precision. Real neurons, however, are likely to show a more stochastic response of limited precision to repeated presentation of the same stimulus. We can qualitatively simulate the implications of such a behavior in the model by adding noise to the model neurons and examining how much noise of a certain amplitude affects performance for the different model configurations. Figure 4-13 shows how additive Gaussian noise of varying amplitude (which was just added to the activity pattern caused by the sample stimulus) affects invariance ranges for representations based on varying numbers of SSCUs. We see that while the performance for zero noise is similar, the representation based on nSSCU
nSSCU = 16.
= 2 is much less robust to noise than a representation based on
Thus, increasing the number of units in a representation increases its robustness to
noise (at least for the case of independent additive noise, as used here). The “Other Class” Effect. An analog of the “Other Race” effect [46, 51] mentioned in the introduction can be modeled in quite a straightforward fashion, if we replace the SSCU representation tuned to cars with one tuned to a different object class. Here we chose six units tuned to prototypical cats and dogs (as used in separate physiological and modeling studies of object categorization [24, 83]), rendered as clay models and size-normalized, shown in Fig. 4-14. Figure 4-15 shows the “Other Class” effect obtained when using these six cat/dog SSCU to perform the car discrimination task from the previous sections: While performance in the no noise condition is only somewhat
56
16 SSCU
20
20
15
15
inv. range
inv. range
2 SSCU
10 5 0 10
10 5 0 10
8
8 6 4
distance
2 0
0.05
0.1
0.2
0.4
6
0.8
4
noise amplitude
distance
2 0
0.05
0.1
0.2
0.4
0.8
noise amplitude
Figure 4-13: Effect of addition of noise to the SSCU representation for different numbers of SSCU in the representation. The x-axis shows the amplitude of the Gaussian noise that was added to each SSCU in the representation. The plot on the left shows the performance of a representation based on nSSCU = 2, the right one for nSSCU = 16.
Figure 4-14: Cat/dog prototypes. worse than with the specialized representation (left plot), even noise of very low amplitude reduces the performance to chance, as the cat/dog-specific SSCU respond only little to the car objects, thus making the activation pattern highly sensitive to noise. This underlines the influence of a specialized class representation on discrimination/recognition performance. Feature learning. The C2 features used in HMAX are based on combinations of difference-ofGaussians (DOG) filters of different orientations that might not be optimal to perform object discrimination for the car class used in the experiments. Can performance be improved with a more specific feature set?
1
1
0.8
0.8
0.6 180 200 220 240 260
0.6
10
180 200 220 240 260
5 0
10 5 0
Figure 4-15: The “Other Class” effect with six SSCUs tuned to the cat/dog prototypes ( = 0:5; for = 0:2, performance was even lower and numerical underflows occured; other parameters as in Fig. 4-9). Left plot shows no noise, right with noise amplitude of 0.05. Compare to Fig. 4-13. 57
Figure 4-16: Car object class-specific features obtained by clustering the image patches of sample cars. 1 1 0.8 0.8 0.6 0.6 180 200 220 240 260
10
10
5
5
0
0
260 220 240 180 200
Figure 4-17: Performance of the two layer-model using the features shown in Fig. 4-16. Parameters were n
SSCU
= 16 = 0 2, 200 afferents to each SSCU. Axes as in Fig. 4-9.
:
No learning algorithm for feature learning in the HMAX hierarchy has been presented so far. However, we can investigate the effect of a class-specific feature set in a two-layer version of HMAX [79] where S1 filters are not limited to DOGs but can take on arbitrary shapes, and C1 units pool (using the MAX function) over all S1 cells at different positions and scales tuned to the same feature, with C1 units feeding directly into VTUs, without S2 or C2 layers. Invariance properties of the twolayer version of HMAX using a set of 10 features consisting of bars and corners are comparable to the full model [79, 82]. We can obtain a feature set specific to the car object class by performing clustering on the set of image patches created by dividing each sample car into small
12 12 pixel patches.
Figure 4-16
shows the features obtained by clustering the sample car image patch space (containing only those patches that had at least 10% of their pixels set) into 200 clusters using k-means. Figure 4-17 shows the performance of the two-layer model using these features. In comparison to standard HMAX (Fig. 4-9), performance of the two-layer model is somewhat worse for positive rotation, with performance dropping to chance already for rotations of 22.5. On the other hand, performance for negative rotations (i.e., those that turn the side of the car towards the viewer) is somewhat better. Both effects could be due to the fact that many of the patches shown in Fig. 4-16 contain car parts likely to change under positive rotation, like the wheels, but which are more stable under negative rotation.
58
4.3.3 Discussion The simulations point to several interesting properties of a population-based representation, 1. invariance ranges for a population-based representation where object identity is encoded by the distributed activity over several units (SSCUs) tuned to representatives of that class are comparable to a representation based on “grandmother” cells where recognition is based on a dedicated unit for each object to be recognized; 2. invariance ranges are already high for a representation based on a low number of SSCUs, but robustness to noise grows with the number of SSCUs in the representation; 3. even if each SSCU is only connected to a low number of afferents (the
n most strongly acti-
vated) from the C2 layer, invariance based on the population representation is high. The last point is especially intriguing, as it might point to a way to obtain robustness to clutter together with high invariance to rotation in depth, avoiding the trade-off found for a “grandmother” representation [81]. Further, the simulations suggest an additional advantage of a population-based representation over a representation based on the C2 features directly: Suppose a certain object (e.g., “my car”) is to be remembered. If a car-specific representation has been learned it suffices to store the low-dimensional activation pattern over the SSCUs whereas in the absence of a specialized representation it will be necessary to store the activity pattern over a much higher number of C2 units to achieve comparable specificity.yy In the context of Experiment 1, the simulations suggest that the advantage of viewpoint- and class-specific training should only extend to roughly between 22.5 and 30 of viewpoint difference between training and testing viewpoint. It thus confirms our theory put forward in the discussion of Experiment 1 that performance there was due to the combination of a class- and viewpointspecific representation and a pre-existing, less specific but more view-invariant representation. The class-specific representation is capable of fine shape discrimination but only over limited a range of viewpoints, while the more general one uses features that are less optimized for the novel object class but show greater tolerance to rotation. For small ', the two representations can complement each other, while for larger viewpoint differences the unspecific features still allow recognition in some cases. yy If there are also SSCUs tuned to objects from other classes, it would suffice to store the activation pattern over the most strongly activated SSCUs to achieve sufficient specificity, as simulations have indicated (not shown). Thus, it is not necessary for the SSCUs to carry class labels (cf. [83]).
59
4.4 Experiment 2 Experiment 1 suggested that the view-invariance range derived from one object view extends less than 45 from the training view. The modeling work presented in the previous section predicted that an effect of training should only extend to between 22.5 and 30 of viewpoint difference. The purpose of Experiment 2 was to more finely investigate the degree of view-invariance and at the same time examine how the training effect observed in Experiment 1 carried over to a broader object class. The latter modification was chosen as the small size of the object class in Experiment 1 implied that discrimination hinged on a very limited number of features. In Experiment 2 we therefore increased the size of the class significantly (to 15 prototypes instead of 3) to make the discrimination task harder in the hope of increasing the learning effect. Further, we added an intermediate viewpoint difference, 22.5 , in the test task, as the simulations presented in the previous section suggested that the effect of training should start to drop off beyond this viewpoint difference.
4.4.1 Methods 4.4.1.1 Stimulus Generation Stimuli in Experiment 2 were drawn from a morph space spanned by the 15 car prototypes shown in Fig. 4-18. Objects were rendered in the same way as for Experiment 1. Coefficient vectors in morph space were now generated by first randomly picking which coefficients should be different from zero with a probability of
p = 0:25.
Those coefficients were then set to random values be-
tween 0 and 1 picked from a uniform distribution and the whole coefficient vector subsequently sum-normalized to one. This algorithm was introduced to increase the diversity of the training stimuli as randomly setting all coefficients with subsequent normalization tended to produce a rather homogeneous set of objects visually similar to the average. In test trials, distractors D6 for the d
= 0:6 trials were selected by picking a coefficient vector of
the appropriate Euclidean distance from the sample stimulus. The vector was chosen by appropriately modifying an equal number of coefficients as were different from zero in the sample vector, observing the constraint that the (linear) sum over all coefficients had to stay constant. The distrac-
d = 0:4 trials was selected to lie on the line connecting the sample and D6 . Moreover, the same objects were chosen for the different ' trials. This made performance comparisons in
tor for the
the different conditions easier as it decreased the effects of morph space anisotropy (that was due to the different visual similarities of the prototypes). For each d
2 f0:4 0:6g ' 2 f0 22:5 45g
combination, subjects were tested on 30 match and 30 nonmatch trials for a total of 360 trials. All conditions were randomized.
60
Figure 4-18: The 15 prototypes used in the 15 car system. 4.4.1.2 Subject Training Training for subjects in the trained group in Experiment 2 proceeded in a similar fashion as in Experiment 1. Subjects started out with match/nonmatch pairs a distance d
= 0:6 apart in morph
space and then performed blocks of 100 trials (50 match, 50 nonmatch), until performance on the block exceeded 80%. When their performance reached criterion, a final dfinal
= 0:4.
d was decreased by 0.1, down to
To make sure that subjects’ performance was not specific to the training set,
performance was tested on another set of d
= 0:4 stimuli after they reached criterion. In all cases subjects’ performance on the second d = 0:4 set was comparable to their performance on the first
set.
4.4.2 Results Subjects were 24 members of the MIT community that were paid for participating in the experiment, all of which were naive to the purpose of the experiment and had not participated in experiment 1. Seven subjects were randomly assigned to a “trained” group that received training sessions until the performance criterion as described above was reached. Six further subjects started to receive training, but did not complete it due to initial chance performance on the easiest (d training stimuli (N
= 0:6)
= 2), failure to reach criterion after five training sessions (N = 1, at which point the subject’s performance had asymptoted at 75% on the d = 0:4 training set for the previous three sessions), or failure to come to training sessions (N = 3). For one of the subjects that completed training, the data collection program malfunctioned during testing and the subject was excluded from further analysis. Another trained subject was not available for testing. 11 subjects did not receive any training on the stimuli but were only run on the testing task. One subject of that group whose performance was highly anomalous (at chance
61
0.65 0.81 0.89 1
0.61
performance
0.9
0.81 0.8
0.87
0.7
0.6
0.5 0.6 0.55 0.5 0.45
distance
0.4
0
5
10
15
20
25
30
35
40
viewpoint difference
Figure 4-19: Average performance of trained subjects (N
= 5) in Experiment 2. Axes labeling as in Fig. 4-6.
for 0 viewpoint difference, probably due to subject sleepiness) was excluded from further analysis. Another subject was excluded due to program malfunction during the experiment. Training sessions already revealed that discrimination in the larger morph space was harder than in the three prototype space from Experiment 1: Subjects on average required four hours of training (range: 3–5 hours), more than twice as much as in Experiment 1. Figure 4-19 shows the averaged performance of the subjects in the trained group on the test task. A repeated measures ANOVA with the factors of viewpoint and distance in morph space between match and nonmatch objects revealed a highly significant main effect of viewpoint dif-
= 33:151, p 0:005) on recognition rate, but no significant effect of distance (F (1 4) = 3:259, p = :145) and a non-significant interaction (F (1 6) = :492, p > :5) between
ference (F (2 3) the two factors.
The averaged performance over the 9 untrained subjects is shown in Fig. 4-20. There are sig-
= 31:169, p 0:001) and distance (F (1 8) = 22:478, p 0:001) with no significant interaction between the two (F (2 7) = :715, p > :5). nificant effects of both viewpoint difference (F (2 7)
Comparing the trained and untrained groups, t-tests show that recognition performance in the trained group is significantly (p
< 0:02) higher at the training view for d = 0:6 and d = 0:4, and also for the ' = 22:5 view and d = 0:4 (p < 0:01), but does not reach significance for ' = 22:5 and d = 0:6 (p = 0:11). Recognition performance is not significantly different for the two groups at ' = 45 for both distances (p 0:2). For both groups and distances, performance at ' = 45 is significantly above chance (p < 0:02, one-tailed t-tests).
62
0.62 0.77 0.79 1
0.58
performance
0.9
0.69 0.8
0.75
0.7
0.6
0.5 0.6 0.55 0.5 0.45 0.4
distance
0
5
10
15
20
25
30
35
40
viewpoint difference
Figure 4-20: Average performance of untrained subjects (N
= 9) in Experiment 2. Axes labeling as in Fig. 4-6.
4.4.3 The Model Revisited What recognition performance would we expect from the model for the stimuli used in Experiment 2? To investigate this question, we used the training stimuli (400 cars from the d two
= 0:6 0:5 and the
d = 0:4 training files) and performed k-means clustering on them to obtain a class-specific
representation as subjects might have learned it as a result of training. To investigate the discrimination performance of this representation, we used a SSCU representation with the exact same parameters as in Fig. 4-9, i.e., n SSCU
= 16 = 0:2 c = 256.
We then evaluated the performance
of this representation for the sample, match, nonmatch triples from the testing task as described in
' = 45 was at chance as expected, but for ' = 22:5, performance was 65% correct for d = 0:6 and 63% correct for d = 0:4, compatible with the results obtained
section 4.4.1.1. Performance at
for the eight car class (Fig. 4-9).
4.4.4 Discussion Increasing the size of the object class to be learned increased task difficulty considerably as evidenced by the longer training times as compared to Experiment 1. As expected, this correlated with a more significant effect of training on recognition performance (even with a smaller group of trained subjects than in Experiment 1). While the effect of training was highly significant for the training view, we only observed a
' = 22:5 for d = 0:4, with the difference at d = 0:6 just failing to reach significance (p = 0:11). This effect could be interpreted in the “fine/specific” and
significant training effect for
“coarse/unspecific” dual framework of object recognition proposed in the context of Experiment 1 as indication that the benefits of the class- and viewpoint-specific representation learned in the
63
training phase do not extend much farther than '
= 22:5, as suggested by the simulations pre-
sented in section 4.3. The performance of the untrained subjects can be interpreted as indicating that the coarse class-unspecific representation performs roughly as well as the viewpoint-specific representation at
' = 22:5 d = 0:6, i.e., there is a balance between the class- and viewpoint-
specific and the coarse but more view-invariant representation, whereas for the finer shape discrimination required at
d = 0:4 the specific representation still provides a significant advantage.
Interestingly, the small effect of match/nonmatch distance in morph space for the trained group at '
= 22:5 is paralleled in the model, where there is only a 2% performance difference for this viewpoint difference and the d = 0:6 (65%) and d = 0:4 (63%) conditions.
4.5 General Discussion The psychophysical and modeling results presented in this paper point to an interesting answer to the initial question of object recognition in continuous object classes: Viewpoint-specific training builds a viewpoint-specific representation for that object class. While this representation supports fine shape discriminations for viewpoints close to the training view, its invariance range is rather limited. However, there is a less-specific, pre-existing object representation that cannot discriminate shapes as finely as the trained class-specific representation but shows greater tolerance to viewpoint changes. It is instructive to compare these observations to an earlier paper by Moses et al. [57] where it was found that generalization ability for changes in viewpoint and illumination was much greater for upright than for inverted faces, suggesting that prior experience with upright faces extended to the novel faces even though the novel faces had been encountered at one view only. Based on our simulation results, we would expect a similar behavior, i.e., limited invariance around the training view with high sensitivity to shape changes in addition to a coarser but more invariant system, also for other transformations, such as, for instance, varying illumination. It will be very interesting to test this hypothesis, by training a subject as presented in this paper but then varying, for instance, illumination angle, and to then compare trained and untrained groups and model performance. This would also allow us to make inferences about the invariance properties of the feature channels feeding into the SSCU (which determine the invariance range of the learned class- and viewpoint-specific representation). Another interesting, more theoretical, question concerns the properties of the “pre-existing” representation: Can experience with rotated objects of a certain class provide greater viewpoint invariance albeit with coarse shape resolution also for novel objects belonging to a different class? Poggio and Vetter [71] (see also [107]) proposed the idea that class-specific view-invariant features could be learned from examples and then used to perform recognition of novel objects of the same class given just one view. Jones et al. [40] (see also [4]) presented a computational implementation 64
of this proposal showing how class-specific learning could facilitate perceptual tasks. If such a mechanism transfers also to sufficiently similar members of a novel object class (for instance from real cars, which have been seen from many viewpoints, to the morphed cars), then it would provide a suitable candidate for the less-specific but more view-invariant representation found in this experiment. Some simulations along these lines have been performed in ([20], pp. 131), but the performance of such a scheme, for instance with respect to view-invariant recognition, was never tested. It will be very exciting to explore this issue in future work.
Acknowledgements Thanks to Valerie Pires for running the subjects and for expert technical assistance, to Gadi Geiger and Pawan Sinha for stimulating discussions, advice on the experimental paradigm and data analysis, and for comments on an earlier version of this manuscript, and to Christian Shelton for the morphing system used to generate the stimuli.
65
Chapter 5
A note on object class representation and categorical perception Abstract We present a novel scheme (“Categorical Basis Functions”, CBF) for object class representation in the brain and contrast it to the “Chorus of Prototypes” scheme recently proposed by Edelman [20]. The power and flexibility of CBF is demonstrated in two examples. CBF is then applied to investigate the phenomenon of Categorical Perception, in particular the finding by Bulthoff ¨ et al. [10] of categorization of faces by gender without corresponding Categorical Perception. Here, CBF makes predictions that can be tested in a psychophysical experiment. Finally, experiments are suggested to further test CBF.
5.1 Introduction Object categorization is a central yet computationally difficult cognitive task. For instance, visually similar objects can belong to different classes, and conversely, objects that appear rather different can belong to the same class. Categorization schemes may be based on shape similarity (e.g., “human faces”), on conceptual similarity (e.g., “chairs”), or on more abstract features (e.g., “Japanese cars”, “green cars”). What are possible computational mechanisms underlying categorization in the brain?
66
Edelman has recently presented an object representation scheme called “Chorus of Prototypes” (COP) [20] where objects are categorized by their similarities to reference shapes, or “prototypes”. While this categorization scheme is of appealing simplicity, the reliance on a single metric in a global shape space imposes severe limitations on the kinds of categories that can be represented. We will discuss these shortcomings and present a more general model of object categorization along with a computational implementation that demonstrates the scheme’s capabilities, relate the model to recent psychophysical observations on categorical perception (CP), and discuss some of the model’s predictions.
5.2 Chorus of Prototypes (COP) In COP, “the stimulus is first projected into a high-dimensional measurement space, spanned by a bank of [Gaussian] receptive fields. Second, it is represented by its similarities to reference shapes” ([20], p. 112, caption to Fig. 5.1). The categorization of novel objects in COP proceeds as follows (ibid., p. 118): 1. A category label is assigned to each of the training objects (“reference objects”), for each of which an RBF network is trained to respond to the object from every viewpoint; 2. a test object is represented by the activity pattern it evokes over all the output units of the reference object RBF networks (i.e., the “similarity to reference shapes” above); 3. categorization is performed using the activity pattern and the labels associated with the output units of the reference object RBF networks. Categorization procedures explored were winnertake-all, and k -nearest-neighbor using the training views (this time taking the prototypes to be not the objects but the object views), i.e., the centers of individual RBF units in each network, with the class label in this cased based on the label of the majority of the k closest stored views to the test stimulus. The appealingly simple design of COP also seems to be its most serious limitation: While a representation based solely on shape similarities seems to be suited for the taxonomy of some novel objects (cf. Edelman’s example of the description of a giraffe as a “cameleopard” [20]), such a representation appears too impoverished when confronted with objects that can be described on a variety of levels: A car, for instance, can look like several other cars (and also unlike many other objects), but it could also be described as a “cheap” car, a “green” car, a “Japanese” car, an “old” The idea of representing color through similarities to prototype objects seems especially awkward considering that it first requires the build-up of a library of objects of a certain color with the sole purpose of allowing to “average out” object shape.
67
car, etc. — different qualities that are not simply or naturally summarized by shape similarities to individual prototypes but nevertheless provide useful information to classify or discriminate the object in question from other objects of similar shape. The fact that an object can be described in such abstract categories, and that this information appears to be used in recognition and discrimination, as indicated by the findings on categorical perception (see below), calls for an extension of Chorus to permit the use of several categorization schemes in parallel, to allow the representation of an object within the framework of a whole dictionary of categorization schemes that offers a more natural description of an object than one global shape space. While Edelman ([20], p. 244) suggests a refinement of Chorus where weights are assigned to different dimensions driven by task demands, it is not clear how this can happen in one global shape space if two objects can be judged as very similar under one categorization scheme but as rather different under another (as, for instance, a chili pepper and a candy apple in terms of color and taste, resp.). Use of different categorization schemes appears to require reversible temporary warping of shape space depending on which categorization scheme is to be used, which runs counter to the notion of one general representational space.
5.3 A Novel Scheme: Categorical Basis Functions (CBF) In CBF, the receptive fields of stimulus-coding units in measurement space are not constrained to lie in any specific class — unlike in COP, there are no class labels associated with these units. The input ensemble drives the unsupervised, i.e., task-independent learning of receptive fields. The only requirement is that the receptive fields of these stimulus space-coding units (SSCUs) cover the stimulus space sufficiently to allow the definition of arbitrary classification schemes on the stimulus space (in the simplest version, “learning” just consists in storing all the training examples by allocating an SSCU to each training stimulus). These SSCUs in turn serve as inputs to units that are trained on categorization tasks in a supervised way — in fact, if each training stimulus is represented by one SSCU, then the network would be identical to a standard radial basis function (RBF) network. Figure 5-1 illustrates the CBF scheme. Novel stimuli in this framework evoke a characteristic activation pattern over the existing categorization units (as well as over the SSCUs). In fact, CBF can be seen as an extension of COP: instead of a representation based on similarity in a global shape space alone (as in “the object looks like xyz”, where x,y,z can be objects for which individual units have been learned), abstract features, which are the result of prior category learning, are equally valid for the description of an object (as in “the object looks expensive/old/pink”). Hence, an object is not only represented by expressing its similarity to learned shapes but also by its membership to learned categories, providing a 68
natural basis for object description. In the proof-of-concept implementation described in the following, SSCUs are identical to the view-tuned units from the model by Riesenhuber and Poggio [82] (in reality, when objects can appear from different views, they could also be view-invariant — note that the view-tuned units are already invariant to changes in scale and position [82]). For simplicity, the unsupervised learning step is done using k-means, or just by storing all the training exemplars, but more refined unsupervised learning schemes, which better reflect the structure of the input space, such as mixture-ofGaussians or other probability density estimation schemes, or learning rules that provide invariance to object transformations [111] are likely to improve performance. Similarly, the supervised learning scheme used (Gaussian RBF) can be replaced by more biologically plausible or more sophisticated algorithms (see discussion).
5.3.1 An Example: Cat/Dog Classification To illustrate the capabilities of CBF, the following simulation was performed: We presented the hierarchical object recognition system (up to the C2 layer) of Riesenhuber & Poggio [82] with 144 randomly selected morphed animal stimuli, as used in a very recent monkey physiology experiment [24] (see Fig. 5-2). A view-tuned model unit was allocated for each training stimulus, yielding 144 view-tuned units (results were similar if the 144 stimuli were clustered into 30 units using k-means, see appendix). The activity patterns over the 144 units to each of the 144 stimuli were used as inputs to train a gaussian RBF output unit, using the class labels 1 for cat and -1 for dog as the desired outputs. The categorization performance of this unit was then tested with the same test stimuli as in the physiology experiment (which were not part of the training set). More precisely, the testing set consisted of the 15 lines through morph space connecting each of the prototypes, each subdivided into 10 intervals, with the exclusion of the stimulus at the mid-points (which in the case of lines crossing the class boundary would lie right on the class boundary, with an undefined label), yielding a total of 126 stimuli. Figure 5-3 shows the response of the categorization unit to the stimuli on the category boundary-crossing morph lines, together with the desired label. A categorization was counted as correct if the sign of the network output was identical to the sign of the class label. Performance on the training set was 100% correct, performance on the test set was 97%, comparable to monkey performance, which was over 90% [24]. The four categorization errors the model makes lie right at the class boundary.
69
task 1
task 2
task-related units
task n
(supervised training)
stimulus space-covering units (SSCUs) (unsupervised training)
10 0 10 1
11 00 1 0 0 1
Riesenhuber & Poggio model of view-tuned units 110 00 1 11 00
11 00 000000000 111111111 00 11 000000000 111111111 00 11 000000000 111111111 00 11 000000000 00 111111111 11 000000000 111111111 00 11 000000000 111111111 00 11 000000000 00 111111111 11 000000000 111111111
11 00 000000000 111111111 00 11 000000000 111111111 00 11 000000000 111111111 00 11 000000000 111111111 00 11 000000000 111111111 00 11 000000000 111111111 00 11 000000000 00 111111111 11 10 0 10 1
weighted sum MAX
Figure 5-1: Cartoon of the CBF categorization scheme, illustrated with the example domain of cars. Stimulus space-covering units (SSCUs) are the view-tuned units from the model by Riesenhuber & Poggio [82]. They self-organize to respond to representatives of the stimulus space so that they “cover” the whole input space, with no explicit information about class boundaries. These units then serve as inputs to task-related units that are trained in a supervised way to perform the categorization task (e.g., to distinguish American-built cars from imports, or compacts from sedans etc.). In the proof-of-concept implementation described in this paper, the unsupervised learning stage is done via k-means clustering, or just by storing all the training exemplars, and the supervised stage consists of an RBF network.
70
"cats"
"dogs"
Figure 5-2: Illustration of the cat/dog stimulus space. The stimulus space is spanned by six objects, three “cats” and three “dogs”. Our morphing software [90] allows us to generate 3D objects that are arbitrary combinations of the six prototypes. The lines show possible morph directions between two prototypes each, as used in the test set.
1.5
1
response
0.5
0
−0.5
−1
−1.5
−2 0 (cat)
0.1
0.2
0.3
0.4 0.5 0.6 position on morph line
0.7
0.8
0.9
1 (dog)
Figure 5-3: Response of the categorization unit (based on 144 SSCU, 256 afferents to each SSCU, SSCU = 0:7) along the nine class boundary-crossing morph lines. All stimuli in the left half of the plot are “cat” stimuli, all on the right-hand side are “dogs” (the class boundary is at 0.5). The network was trained to output 1 for a cat and -1 for a dog stimulus. The thick dashed line shows the average over all morph lines. The solid horizontal line shows the class boundary in response space.
71
5.3.2 Introduction of parallel categorization schemes To demonstrate how different classification schemes can be used in parallel within CBF, we also trained a second network to perform a different categorization task on the same stimuli. The stimuli were resorted into three classes, each based on one cat and one dog prototype. For this categorization task, three category units were trained (on a training set of 180 animal morphs, taken from training sets of an ongoing physiology project), each one to respond at a level of 1 for stimuli belonging to “its” class and a level of -1 for stimuli from the other two classes.y Each category unit received input from the same 144 SSCUs as the cat/dog category unit described above. As mentioned, it is an open question how to best perform multi-class classification. We evaluated two strategies: i) categorization is said to be correct if the maximally activated category unit corresponds to the true class (“max” case); ii) categorization is correct if the signs of the three category units are equal to the correct answer (“sign” case). Performance on the training set in the “max” as well as in the “sign” case was 100% correct. On the testing set, performance using the “max” rule was 74%, whereas the performance for the “sign” rule was 61% correct, the lower numbers on the test set as compared to the cat/dog task reflecting the increased difficulty of the three-way categorization. We are currently training a monkey on the same categorization task, and it will be very interesting to compare the animal’s performance on the test set to the model’s performance.
5.4 Interactions between categorization and discrimination: Categorical Perception When discriminating objects, we commonly do not only rely on simple shape cues but also take more complex features into account. For example, we can describe a face in terms of its expression, its age, gender etc. to provide additional information that can be used to discriminate this face from other faces. This suggests that training on categorization tasks could be of use also for object discrimination. The influence of categories on perception is expected to be especially strong for stimuli in the vicinity of a class boundary: In the cat/dog categorization task described in the previous paragraph, the goal was to classify all members of one class the same way, irrespective of their shape. Hence, when presented with two stimuli from the same class, the categorization result will ideally not allow to discriminate between the two stimuli. On the other hand, two stimuli from different y Multi-class classification is a challenging and yet unsolved computational problem — the scheme employed here was chosen for its simplicity.
72
classes are labelled differently. Thus, one would expect greater accuracy in discriminating stimulus pairs from different classes than pairs belonging to the same class (note that in this paper we are not dealing with the discrimination process itself — while several mechanisms have been proposed, such as a representation based directly on the SSCU activation pattern, or one based on the activity pattern over prototypes such as view-invariant RBF units [20, 68], we in this section only discuss how prior training on categorization tasks can provide additional information to the discrimination process, without regard to how the latter might be implemented computationally). This phenomenon, called Categorical Perception [33], where linear changes in a stimulus dimension are associated with nonlinear perceptual effects, has been observed in numerous experiments, for instance in color or phoneme discrimination. A recent experiment by Goldstone [31] investigated Categorical Perception (CP) in a task involving training subjects on a novel categorization. In particular, subjects were trained on a combined task that first required them to categorize stimuli (rectangles) according to size or brightness or both and then to discriminate stimuli from the same set in a same-different design. The study found evidence for acquired distinctiveness, i.e., cues (size and brightness, resp.) that were task-relevant became perceptually salient even during other tasks. The task-relevant interval of the task-relevant dimension became selectively sensitized, i.e., discrimination of stimuli in this range improved (local sensitization at the class-boundary — the classical Categorical Perception effect), but dimension-wide sensitization was, to a lesser degree, also found (global sensitization). Less sensitization occured when subjects had to categorize according to size and brightness, indicating competition between those dimensions.
5.4.1 Categorical Perception in CBF The CBF scheme suggests a simple explanation for category-related influences on perception: When confronted with two stimuli differing along the stimulus dimension relevant for categorization, the different respective activation levels of the categorization unit provide additional information to base the discrimination on, and thus discrimination across the category boundary is facilitated, as compared to the case where no categorization network has been trained. Fig. 5-4 illustrates this idea: The (continuous) output of the categorization unit(s) provides additional input to the discrimination network in a discrimination task. In a categorization task, the output of the category unit is thresholded to arrive at a binary decision, as is the output of the discrimination network in a yes/no discrimination task. In particular, global sensitization would be expected as a side effect of training the categorization unit if its response is not constant within the classes, which is just what was observed in the simulations shown above (Fig. 5-3): The “catness” response level of the categorization unit decreases as
73
same/different decision category decision
discrimination network
task n
task-related unit(s)
stimulus space-covering units Figure 5-4: Sketch of the model to explain the influence of experience with categorization tasks on object discrimination, leading to global and local (Categorical Perception) sensitization. Key is the input of the categorytuned unit(s) to the discrimination network (which is shown here for illustrative purposes as receiving input from the SSCU layer, but this is just one of several alternatives), shown by the thick horizontal arrow.
74
stimuli are morphed from the cat prototypes to cats at the class boundary and beyond. Its output is then thresholded to arrive at the categorization rule, which determines the class by the sign of the response (cf. above). Local sensitization (Categorical Perception) occurs as a result of a stronger response difference of the categorization unit for stimulus pairs crossing the class boundary than for pairs where both members belong to the same class. In agreement with the experiment by Goldstone [31], we would expect competition between different dimensions in CBF when class boundaries run along more than one dimension (e.g., two, as in the experiment), as compared to a class boundary along one dimension only: For the same physical change in one stimulus property (one dimension), the response of the categorization unit should change more in the one-dimensional than in the two-dimensional case since in the latter case crossing the class boundary requires change of the input in both dimensions.
5.4.2 Categorization with and without Categorical Perception Bulthoff ¨ et al. have recently reported [10] that discrimination between faces is not better near the male/female boundary, i.e., they did not find evidence for CP in their study, even though subjects could clearly categorize face images by gender. Such categorization without CP can be understood within CBF: Following the simulations described above, CP in CBF is expected if the response of the category unit shows a stronger drop across the class boundary than within a class, for the same distance in morph space. Suppose now the slope of the categorization unit’s response is uniform across the stimulus space, from the prototypical examplars for one class (e.g., the “masculine men”) to the prototypical exemplars of the other class (e.g., the “feminine women”). If the subject is forced to make a category decision, e.g., using the sign of the category unit’s response, as above, the stimulus ensemble would be clearly divided into two classes (noise in the category unit’s response would lead to a smoothed out sigmoidal categorization curve). However, in a discrimination task, the difference of response values of the category unit for two stimuli across the boundary would not be different from the difference for two stimuli within the same class (if the within-pair distance for both pairs with respect to the category-relevant dimension is the same). Hence, no Categorical Perception, or, more precisely, any local sensitization would be expected. In CBF, the slope of a category unit’s response curve is influenced by the extent of the training set with respect to the class boundary. To demonstrate this, we trained a cat/dog category unit as described above using four training sets differing in how close the representatives of each class were allowed to get to the class boundary (which was again defined by an equality in the sum over the cat and dog coefficients). Introducing the “crossbreed coefficent”, c, of a stimulus belonging to a certain class (cat or dog) as the coefficient sum of its corresponding vector in morph space
75
1
0.1 (100,93): 0.9 0.4 (100,94): 1.6
0.5
0
−0.5
−1 0 (cat)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 (dog)
Figure 5-5: Average responses over all morph lines for the two networks (parameters as in Fig. 5-3) trained
on data sets with c = 0:1 and c = 0:4, respectively. The legend shows in parentheses the performance (on the training set and on the test set, resp.); the number after the colon shows the average change of response across the morph line (absolute value of response difference at positions 0.4 and 0.6) divided by the response difference for that morph line averaged over all other stimulus pairs 0.2 units apart.
over all prototypes of the other class (dog or cat, resp.), training sets differed in the maximum value of c, ranging from 0.1 to 0.4 in steps of 0.4 (c values of stimuli in each training set were chosen uniformly within the permissible interval, and training sets contained an equal number of stimuli,
c = 0:1, thus contained stimuli that were very close to the prototypical representatives of each class, whereas the c = 0:4 set contained cats with strong dog components i.e., 200). The first case,
and dogs with strong cat components, resp. Fig. 5-5 shows how the average response along the morph lines differs for the two cases c = 0:1
and c
= 0:4. The legend shows in parentheses the performance on the training set and on the test
set, resp.; the number after the colon shows the average change of response across the morph line (absolute value of response difference at positions 0.4 and 0.6) relative to the response difference for that morph line averaged over all other stimulus pairs 0.2 units apart. While categorization performance in both cases is very similar (93% vs. 94% correct on the test set), the relative change across the class border is much greater for the
c = 0:4 case than in the c = 0:1 case, where the
response drops almost linearly from position 0.2 to position 0.9 on the morph line (incidentally, the
c = 0:4 case is very similar to the drop observed in prefrontal cortical neurons of a monkey trained on the same task [24] with the same maximum c value). relative drop of 1.6 in the
Thus, CBF predicts that the amount of categorical perception is related to the extent of the training set with respect to the class boundary: If the training set for a categorization task is sparse around the class boundary (as is the case for face gender classification where usually most of the training examplars clearly belong to one or the other category with a comparatively lower number of androgynous faces), a lower degree of CP would be expected than in the case of a training set that extends to the class boundary. It will be interesting to test this hypothesis experimentally by training subjects on a categorization task where different groups of subjects are exposed to subsets of the stimulus space differing
76
3 2 1 1 0 0
0.5 0.5 n2
1 0
n
(c=0.1) − (c=0.4)
3 2 1 1 0 0
0.5 0.5 1 0
n2
n
distance between activity patterns
c=0.4
distance between activity patterns
distance between activity patterns
c=0.1
0.2 0.1 0 1 −0.1 0
0.5 0.5 n2
1 0
n
Figure 5-6: Comparison of Euclidean distances of activation patterns (over 144 SSCU, as used in the previous
simulations) for stimuli lying at two different positions on morph lines for the cases of c = 0:1 and c = 0:4. The left panel shows the average euclidean distance between the activity pattern for a stimulus at position n (y-axis) and a stimulus on the same morph line at position n2 (x-axis), for the network trained on the data set with c = 0:1 (note that there were no stimuli at the 0.5 position). The middle panel shows the corresponding plot for the network trained on c = 0:4, while the right panel shows the difference between the two plots: Differences between the two networks are usually quite low in magnitude (note the different scaling on the z-axes), suggesting that discrimination performance in the c = 0:1 case should be close to the c = 0:4 case.
in how close the training stimuli come to the boundary. Category judgment can then be tested for (randomly chosen) stimuli lying on lines in morph space passing through the class boundary. In a second step, subjects would be switched to a discrimination task to look for evidence of CP. The prediction would be that while subjects in all groups would divide the stimulus space into categories (not necessarily in the same way or with the same degree of certainty, as there would be uncertainty regarding the exact location of the class boundary that increases for groups that were only trained on stimuli far away from the boundary), the degree of CP should increase with the closeness of the training stimuli to the true class boundary. Naturally, the categorization scheme used in this task should be novel for the subjects to avoid confounding influences of prior experience. Hence, a possible alternative to the cat/dog categorization task described above would be to group car prototypes (randomly) into two classes and then train subjects on this categorization task. One issue to be addressed is whether the fact that subjects are trained on different stimulus sets will influence discrimination performance (even in the absence of any categorization task). For the present case, simulations indicate only a small effect of the different training sets on discrimination performance (see Fig. 5-6), but it is unclear whether this transfers to other stimulus sets. However, while the different training groups might differ in their performance on the untrained part of the stimulus space due to the different SSCUs learned, the prediction is still that the area of improved discriminability should coincide with the subjects’ location of the class boundary rather than with the extent of the training set. To avoid range and anchor effects [11] (see footnote below), stimuli should be chosen from a continuum in morph space, e.g., a loop.
77
Why has no CP been found for gender classification while other studies have found evidence for CP in emotion classification using line drawings [21] as well as photographic images of faces [11]?z For the case of emotions, subjects are likely to have had experience with not just the “prototypical” facial expression of an emotion but also with varying combinations and degrees of expressions and have learned to categorize them appropriately, corresponding to the case of high
c values in the
cat/dog case described above, where CP would be expected.
5.5 COP or CBF? — Suggestion for Experimental Tests It appears straightforward to design a physiological experiment to elucidate whether COP or CBF better model actual category learning: A monkey is trained on two different categorization tasks using the same stimuli (for example, the cat/dog stimuli used in the simulations above). The responses of prefrontal cortical neurons (which have been shown in a preliminary study using these stimuli [24] to carry category information) to the test stimuli are then recorded from while the monkey is passively viewing the test stimuli (e.g., during a fixation task). In CBF, we would expect to find neurons showing tuning to either categorization scheme, whereas COP would predict that cell tuning reflects a single metric in shape space. In the former case, it will be interesting to compare neural responses to the same stimuli while the monkey is performing the two different categorization tasks to look at response enhancement/suppression of neurons involved in the different categorization tasks.
5.6 Conclusions We have described a novel model of object representation that is based on the concurrent use of different categorization schemes using arbitrary class definitions This scheme provides a more natural basis for classification than the “Chorus of Prototypes” with its notion of one global shape space. In our framework, called “Categorical Basis Functions” (CBF), the stimulus space is represented by units whose receptive fields self-organize without regard to any class boundary. In a second, supervised stage, categorization units receiving input from the stimulus space-covering units (SSCUs) come to learn different categorization task(s). Note that this just describes the basic framework — one could imagine, for instance, the addition of slow time-scale top-down feedback to the SSCU layer, analogous to the GRBF networks of Poggio and Girosi [69], that could enhance categorization performance by optimizing the receptive fields of SSCUs. Similarly, the algorithms z CP has also been claimed to occur for facial identity [3], but the experimental design appears flawed as stimuli in the middle of the continuum were presented more often than the ones at the extremes, and prototypes were easily extracted from the discrimination task, biasing subjects discrimination responses towards the middle of the continuum [73].
78
used to learn SSCUs (k-means clustering or simple storage of all training examples) and the categorization units (RBF) should just be taken as examples. For instance, (a less biological version of) CBF could also be implemented using Support Vector Machines [105]. In this case, a categorization unit would only be connected to a sparse subset of SSCUs, paralleling the sparse connectivity observed in cortex. A final note concerns the advantages of CBF for the learning and representation of class hierarchies: While the simulations presented in this paper limited themselves to one level of categorization, it is easily possible to add additional layers of sub- or superordinate level units receiving inputs from other categorization units. For instance, a unit learning to classify a certain breed of dog could receive input not only from the SSCUs but also from a “generic dog” unit, or a “quadruped” unit could be trained receiving inputs from units selective for different classes of four-legged animals, in both cases greatly simplifying the overall learning task.
Acknowledgements Thanks to Christian Shelton for k-means and RBF MATLAB code, and for the morphing programs used to generate the stimuli [90]. Special thanks to Prof. Eric Grimson for help with taming the “AI Lab Publications Approval Form”.
79
Appendix: Parameter Dependence of Categorization Performance for the Cat/Dog Task n=144 a=40 sig=0.2 (100,66): 0.48
n=144 a=256 sig=0.2 (100,94): 1
4
1
2
0.5
0
0
−2
−0.5
−4 0 (cat)
0.2
0.4
0.6
−1 0 (cat)
0.8 1 (dog)
n=144 a=40 sig=0.7 (100,58): 0.85
0.2
0.4
0.6
0.8 1 (dog)
n=144 a=256 sig=0.7 (100,97): 2.1 1
15
0.5
10
0
5
−0.5 0 −1 −5 0 (cat)
0.2
0.4
0.6
0.8 1 (dog)
−1.5 0 (cat)
0.2
0.4
0.6
0.8 1 (dog)
Figure 5-7: Output of the categorization unit trained on the cat/dog categorization task from section 5.3.1, for 144 SSCUs (where each SSCU was centered at a training example) and two different values for the of the SSCU and the number of afferents to each SSCU (choosing either all 256 C2 units or just the 40 strongest afferents, cf. [82]). The numbers in parentheses in each plot title refer to the unit’s categorization performance on the training and on the test set, resp. The number on the right-hand side is the average response drop over the category boundary relative to the average drop over the same distance in morph space within each class (cf. section 5.4.2). Note the poor performance on the test set for a low number of afferents to each unit, which is due to overtraining. The plot in the lower right shows the unit from Fig. 5-3.
n=30 a=40 sig=0.2 (98,91): 0.78
n=30 a=256 sig=0.2 (99,91): 0.77 1
1 0.5
0.5
0
0
−0.5
−0.5
−1 −1.5 0 (cat)
−1 0.2
0.4
0.6
0.8 1 (dog)
0 (cat)
n=30 a=40 sig=0.7 (99,88): 0.95
0.2
0.4
0.6
0.8 1 (dog)
n=30 a=256 sig=0.7 (100,94): 1.2
1.5
1
1
0.5
0.5
0
0 −0.5 −0.5 −1
−1 0 (cat)
0.2
0.4
0.6
0.8 1 (dog)
0 (cat)
0.2
0.4
0.6
0.8 1 (dog)
Figure 5-8: Same as Fig. 5-7, but for a SSCU representation based on just 30 units, chosen by a k-means algorithm from the 144 centers in the previous example.
80
Chapter 6
General Conclusions and Future Work There are two main objectives in the recognition of isolated objects (i.e., in the absence of clutter): Invariance to object transformations — be they image-based (e.g., translation, scaling) or depending on the 3D structure of the object (e.g., illumination, rotation in depth) — and specificity (i.e., the ability to discriminate between similar objects). In HMAX, both invariance and specificity are gradually built up in a hierarchy, allowing us to proceed from small and simple receptive fields (i.e., that show only little invariance and specificity) to “big” (i.e., , invariant to scaling and translation) and complex receptive fields. Note that only invariance to stimulus scale and position and not, for instance, rotation in depth, is increased over the hierarchy — to increase invariance to transformations that depend on the 3D structure of an object, the object’s 3D structure must be known: a given 2D image patch can be the projection of an infinite number of different 3D shapes. Thus, invariance to transformations that depend on the 3D structure of the object can only be increased after the object’s structure has been constrained sufficiently to decrease the number of candidates with the same 2D projection but different 3D structure, e.g., at the VTU level [68]. This gradual and parallel increase in HMAX is crucial, as only increasing invariance properties first would lead to invariant but unspecific feature detectors that are not suitable to represent complex objects, while the converse, i.e., increasing only feature complexity first, would cause an exponential explosion of the number of units required for invariant recognition of a certain object, along with an inability to generalize over position and scale, in contrast to experimental data [49]. The two goals of increasing specificity and invariance are subserved by two different computational mechanisms in HMAX, namely a template match to increase feature specificity while main81
taining invariance, and a MAX function (or its softmax approximation [82]) to increase invariance through (nonlinear) pooling while maintaining feature specificity. This MAX function, i.e., the ability of neurons to select the strongest among their inputs (which, as shown in chapter 3, also permits to perform recognition in clutter by isolating relevant information from distractors), is the crucial component of the model and one of its predictions. How realistic is it? Interestingly, very recent work by Giese et al. [29] has identified several neurally plausible candidate circuits to compute (an approximation) to a MAX-function. The next step is to test the MAX hypothesis experimentally. The prime candidates for such an experiment would be complex cells in primary visual cortex, i.e., the lowest level in which the MAX function is posited to be realized, as the preferred features of (simple) cells feeding into the complex cells are comparatively well known, which is not the case for higher areas [16, 27, 41]. A possible experimental design is straightforward and would entail comparing the responses of a complex cell to two individual stimuli presented separately to the response to the combined display. If the two stimuli excite different populations of afferents (otherwise interaction between the two objects would be expected, cf. chapter 3), a MAX or softmax function would predict the complex cell’s response to be dominated by the stronger of the two individual stimuli. Finally, we can use the model as a basis to explore further some of the issues raised in the previous chapters: 1. In the simulations described in this thesis we have confined ourselves to view-specific representations. While HMAX provides invariance to scaling and translation, invariance to transformations that depend on an object’s 3D structure, such as viewpoint and illumination changes, have to be learned from transformed views of the object by combining several VTUs tuned to different views of the same object, as demonstrated in [68] for the case of rotation in depth. As discussed in chapter 4 in the context of recognition in object classes, an interesting question is how much view-invariance (and/or invariance to other transformations such as changes of illumination direction) for novel objects increases if the class is represented not by a set of view-tuned units but by view-invariant (illumination-invariant) units that have each been trained to respond in a view-invariant (illumination-invariant) fashion to a certain member of the class, and further, how these invariance properties extend to another object class with similar objects (like, for example, from real cars to the morphed cars of chapter 4). The experiments of Moses et al. [57] and our experimental results from chapter 4 suggest that there is some benefit in terms of view-invariance for novel objects belonging to the same class. 2. In chapter 3, we demonstrated that HMAX can to some degree perform recognition in clutter, without any special segmentation process. However, there was a trade-off between the robustness to clutter of a single VTU and its invariance range/specificity: More afferents (C2 82
features) to a VTU provide greater invariance to rotation, but at the cost of increased sensitivity to clutter. Interestingly, this seemed to be less so the case for a population-based representation, as shown in chapter 4. Here, the invariance range was wide even for a low number of afferents to each SSCU as object identity was not just represented by one unit but redundantly (cf. also the increased robustness to noise) by a set of units. This might provide a way to achieve robustness to clutter without compromising invariance range. 3. The simulations in chapter 5 have demonstrated that CBF units can generalize within their class and also permit fine category distinctions among similar objects. It will be interesting to see how this performance extends to even more naturalistic object classes where betweenand within-class variabilities are greater, such as the photographic images of actual cats and dogs as used in studies of object categorization in infants [74]. Moreover, infants seem to be able to build basic level categories even without an explicit teaching signal [74], possibly exploiting a natural clustering of the two classes in feature space. We are currently exploring how category representations can be built in such a paradigm. 4. HMAX is a purely feedforward model and thus fits well with the tight timing constraints found in recent EEG studies of object detection tasks [101, 104] which have provided evidence that the visual system can determine within 150ms whether a novel picture contains a member of a certain object class (e.g., an animal). This fast response time is on the order of the response latency of IT neurons, estimates for which range from over 100ms [85] to over 200ms [26]. Thus, these data suggest that at least some object detection tasks can be performed in a time interval roughly equal to a single feed-forward pass through the ventral stream. This does not rule out a role for feedback connections, however. In fact, one could imagine that top-down signals “tune” the perceptual system [12] for the task at hand, e.g., to perform detection of objects belonging to a certain class (“animals”). Future work will investigate how such topdown tuning signals can be incorporated into HMAX and how they can be used to increase performance in detection tasks, especially in the presence of clutter. 5. The connection pattern of HMAX up to the C2 layer is simple, albeit hard-wired. Learning so far has only been explored for the C2-VTU connections, but not for lower layers. A challenge here lies in finding a learning scheme that describes how input stimuli drive the development of features at lower levels (S1, S2), while at the same time assuring that features of the same type are pooled over in an appropriate fashion in the C layers. Hyv¨arinen and Hoyer [38] have recently presented a learning rule whose aim is to decompose an image into independent feature subspaces. The learning rule is similar to the independence maximization rule with a sparsity prior used by Olshausen and Field [63] with the difference that here the independence between the norms of projections on linear subspaces is maximized. With this learning 83
rule, Hyv¨arinen and Hoyer are able to learn shift- and phase-invariant features similar to complex cells (or in the C1 layer). It remains to be seen whether a hierarchical version of such a scheme could also be used in HMAX. 6. As discussed in chapter 2, the MAX operation performs a scanning and selection operation over a range of inputs, which is a central element of many computational algorithms. Neurons performing a weighted sum followed by a Gaussian nonlinearity, the other element of HMAX, can in principle learn any kind of input/output mapping [69]. A hierarchical model consisting of MAX and template match units thus seems to be a powerful general framework for computation in the brain. Is HMAX therefore also a good model of processing for other parts of cortex? Very recent results by Giese [30] who applied a hierarchical model very similar to HMAX in structure (featuring a combination of MAX and template match operations) to model the recognition of complex motion patterns in cortex, are very encouraging. Maybe not a theory of “how the brain might work” [67], these studies will help us decide whether HMAX is at least a theory of how a part of the brain might work. . .
84
Bibliography [1] Abbott, L., Varela, J., Sen, K., and Nelson, S. (1997). Synaptic depression and cortical gain control. Science 275, 220–224. [2] Anderson, C. and van Essen, D. (1987). Shifter circuits: a computational strategy for dynamic aspects of visual processing. Proc. Nat. Acad. Sci. USA 84, 6297–6301. [3] Beale, J. and Keil, F. (1995). Categorical effects in the perception of faces. Cognition 57, 217– 239. [4] Beymer, D. and Poggio, T. (1996). Image representations for visual learning. Science 272, 1905–1909. [5] Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psych. Rev. 94, 115–147. [6] Booth, M. and Rolls, E. (1998). View-invariant representations of familiar objects by neurons in the inferior temporal visual cortex. Cereb. Cortex 8, 510–523. [7] Bruce, C., Desimone, R., and Gross, C. (1981). Visual properties of neurons in a polysensory area in the superior temporal sulcus of the macaque. J. Neurophys. 46, 369–384. [8] Bulthoff, ¨ H. and Edelman, S. (1992). Psychophysical support for a two-dimensional view interpolation theory of object recognition. Proc. Nat. Acad. Sci. USA 89, 60–64. [9] Bulthoff, ¨ H., Edelman, S., and Tarr, M. (1995). How are three-dimensional objects represented in the brain? Cereb. Cortex 3, 247–260. [10] Bulthoff, ¨ I., Newell, F., Vetter, T., and Bulthoff, ¨ H. (1998). Is the gender of a face categorically perceived? Invest. Ophthal. and Vis. Sci. 39(4), 812. [11] Calder, A., Young, A., Perrett, D., Etcoff, N., and Rowland, D. (1996). Categorical perception of morphed facial expressions. Vis. Cognition 3, 81–117. [12] Carr, T. and Bacharach, V. (1976). Perceptual tuning and conscious attention: Systems of input regulation in visual information processing. Cognition 4, 281–302. [13] Chance, F., Nelson, S., and Abbott, L. (1999). Complex cells as cortically amplified simple cells. Nat. Neurosci. 2, 277–282. [14] Connor, C., Preddie, D., Gallant, J., and van Essen, D. (1997). Spatial attention effects in macaque area V4. J. Neurosci. 17, 3201–3214. [15] Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. J. Cogn. Neurosci. 3, 1–8. [16] Desimone, R. and Schein, S. (1987). Visual properties of neurons in area V4 of the macaque: sensitivity to stimulus form. J. Neurophys. 57, 835–868.
85
[17] Douglas, R., Koch, C., Mahovald, M., Martin, K., and Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science 269, 981–985. [18] Edelman, S. (1995). Class similarity and viewpoint invariance in the recognition of 3D objects. Biol. Cyb. 72, 207–220. [19] Edelman, S. (1997). Computational theories of object recognition. Trends Cog. Sci. 1, 296–304. [20] Edelman, S. (1999). Representation and Recognition in Vision. MIT Press, Cambridge, MA. [21] Etcoff, N. and Magee, J. (1992). Categorical perception of facial expressions. Cognition , 227– 240. [22] Felleman, D. and van Essen, D. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex 1, 1–47. [23] Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Comp. 3, 194– 200. [24] Freedman, D., Riesenhuber, M., Shelton, C., Poggio, T., and Miller, E. (1999). Categorical representation of visual stimuli in the monkey prefrontal (PF) cortex. In Soc. Neurosci. Abs., volume 29, 884. [25] Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cyb. 36, 193–202. [26] Fuster, J. (1990). Inferotemporal units in selective visual attention and short-term memory. J. Neurophys. 64, 681–697. [27] Gallant, J., Connor, C., Rakshit, S., Lewis, J., and van Essen, D. (1996). Neural responses to polar, hyperbolic, and cartesian gratings in area V4 of the macaque monkey. J. Neurophys. 76, 2718–2739. [28] Gauthier, I. and Tarr, M. (1997). Becoming a “Greeble” expert: exploring mechanisms for face recognition. Vis. Res. 37, 1673–1682. [29] Giese, M., Yu, A., and Poggio, T. (2000). Neural mechanisms for the realization of maximum operations. Unpublished manuscript. [30] Giese, M. (2000). Learning-based neural model for the recognition of biological motion. In Proceedings of ARVO 2000. In press. [31] Goldstone, R. (1994). Influences of categorization on perceptual discrimination. J. Exp. Psych.: General 123, 178–200. [32] Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Nat. Neurosci. 1, 17–61. [33] Harnad, S. (1987). Categorical perception: The groundwork of cognition. Cambridge University Press, Cambridge, UK. [34] Heeger, D. (1992). Normalization of cell responses in cat striate cortex. Vis. Neurosci. 9, 181– 197. [35] Hilgetag, C., O’Neill, M., and Young, M. (1996). Indeterminate organization of the visual system. Science 271, 776–777. [36] Hubel, D. and Wiesel, T. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Phys. 160, 106–154. [37] Hubel, D. and Wiesel, T. (1965). Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. J. Neurophys. 28, 229–289. 86
[38] Hyv¨arinen, A. and Hoyer, P. Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. To appear in Neural Comp. [39] Jones, M. and Poggio, T. (1996). Model-based matching by linear combinations of prototypes. AI Memo 1583, CBCL Paper 139, MIT AI Lab and CBCL, Cambridge, MA. [40] Jones, M., Sinha, P., Vetter, T., and Poggio, T. (1997). Top-down learning of low-level visual tasks. Curr. Biol. 7, 991–994. [41] Kobatake, E. and Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J. Neurophys. 71, 856–867. [42] Kobatake, E., Wang, G., and Tanaka, K. (1998). Effects of shape-discrimination training on the selectivity of inferotemporal cells in adult monkeys. J. Neurophys. 80, 324–330. [43] Koch, C. and Poggio, T. (1999). Predicting the visual world: Silence is golden. Nat. Neurosci. 2, 9–10. [44] Koch, C. and Ullman, S. (1985). Shifts in selective visual attention: towards the underlying neural circuitry. Hum. Neurobiol. 4, 219–227. [45] Lee, D., Itti, L., Koch, C., and Braun, J. (1999). Attention activates winner-take-all competition among visual filters. Nat. Neurosci. 2, 375–381. [46] Lindsay, D., Jack Jr., P., and Christian, M. (1991). Other-race face perception. J. App. Psychol. 76, 587–589. [47] Logothetis, N. (1998). Object vision and visual awareness. Curr. Op. Neurobiol. 8, 536–544. [48] Logothetis, N., Pauls, J., Bulthoff, ¨ H., and Poggio, T. (1994). View-dependent object recognition by monkeys. Curr. Biol. 4, 401–414. [49] Logothetis, N., Pauls, J., and Poggio, T. (1995). Shape representation in the inferior temporal cortex of monkeys. Curr. Biol. 5, 552–563. [50] Logothetis, N. and Sheinberg, D. (1996). Visual object recognition. Ann. Rev. Neurosci. 19, 577–621. [51] Malpass, R. and Kravitz, J. (1969). J. Pers. Soc. Psychol. 13, 330–334.
Recognition for faces of own and other race.
[52] Mel, B. (1997). SEEMORE: combining color, shape, and texture histogramming in a neurally inspired approach to visual object recognition. Neural Comp. 9, 777–804. [53] Mel, B. and Fiser, J. (2000). Minimizing binding errors using learned conjunctive features. Neural Comp. 12, 247–278. [54] Missal, M., Vogels, R., and Orban, G. (1997). Responses of macaque inferior temporal neurons to overlapping shapes. Cereb. Cortex 7, 758–767. [55] Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature 335, 817–820. [56] Moran, J. and Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science 229, 782–784. [57] Moses, Y., Edelman, S., and Ullman, S. (1993). Generalization to novel images in upright and inverted faces. Technical Report CS93-14, Weizmann Institute of Science, Israel. [58] Mozer, M. (1991). The perception of multiple objects: a connectionist approach. MIT Press, Cambridge, MA. 87
[59] Mumford, D. (1992). On the computational architecture of the neocortex. II. The role of cortico-cortical loops. Biol. Cyb. 66, 241–251. [60] Newell, F. (1998). Stimulus context and view dependence in object recognition. Perception 27, 47–68. [61] Nowlan, S. and Sejnowski, T. (1995). A selection model for motion processing in area MT of primates. J. Neurosci. 15, 1195–1214. [62] Olshausen, B., Anderson, C., and van Essen, D. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci. 13, 4700–4719. [63] Olshausen, B. and Field, D. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381, 607–609. [64] Perrett, D. and Oram, M. (1993). Neurophysiology of shape processing. Img. Vis. Comput. 11, 317–333. [65] Perrett, D. and Oram, M. (1998). Visual recognition based on temporal cortex cells: viewercentred processing of pattern configuration. Z. Naturforsch. 53c, 518–541. [66] Perrett, D., Oram, M., Harries, M., Bevan, R., Hietanen, J., Benson, P., and Thomas, S. (1991). Viewer-centred and object-centred coding of heads in the macaque temporal cortex. Exp. Brain Res. 86, 159–173. [67] Poggio, T. (1990). A theory of how the brain might work. In Proceedings of Cold Spring Harbor Symposia on Quantitative Biology, volume 4, 899–910 (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY). [68] Poggio, T. and Edelman, S. (1990). A network that learns to recognize 3D objects. Nature 343, 263–266. [69] Poggio, T. and Girosi, F. (1989). A theory of networks for approximation and learning. AI Memo 1140, CBIP paper 31, MIT AI Lab and CBIP, Cambridge, MA. [70] Poggio, T., Reichardt, W., and Hausen, W. (1981). A neuronal circuitry for relative movement discrimination by the visual system of the fly. Naturwissenschaften 68, 443–466. [71] Poggio, T. and Vetter, T. (1992). Recognition and structure from one 2D model view: Observations on prototypes, object classes and symmetries. AI Memo 1347, CBIP Paper 69, MIT AI Lab and CBIP, Cambridge, MA. [72] Potter, M. (1975). Meaning in visual search. Science 187, 565–566. [73] Poulton, E. (1975). Range effects in experiments on people. Am. J. Psychol. 88, 3–32. [74] Quinn, P., Eimas, P., and Rosenkrantz, S. (1993). Evidence for representations of perceptually similar natural categories by 3-month-old and 4-month-old infants. Perception 22, 463–475. [75] Rao, R. and Ballard, D. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 2, 79–87. [76] Reichardt, W., Poggio, T., and Hausen, K. (1983). Figure-ground discrimination by relative movement in the visual system of the fly - II: Towards the neural circuitry. Biol. Cyb. 46, 1–30. [77] Reynolds, J., Chelazzi, L., and Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J. Neurosci. 19, 1736–1753.
88
[78] Riesenhuber, M. and Dayan, P. (1997). Neural models for part-whole hierarchies. In Advances in Neural Information Processing Systems, Mozer, M., Jordan, M., and Petsche, T., editors, volume 9, 17–23 (MIT Press, Cambridge, MA). [79] Riesenhuber, M. and Poggio, T. (1998). Just one view: Invariances in inferotemporal cell tuning. In Advances in Neural Information Processings Systems, Jordan, M., Kearns, M., and Solla, S., editors, volume 10, 167–194 (MIT Press, Cambridge, MA). [80] Riesenhuber, M. and Poggio, T. (1998). Modeling invariances in inferotemporal cell tuning. AI Memo 1629, CBCL Paper 160, MIT AI Lab and CBCL, Cambridge, MA. [81] Riesenhuber, M. and Poggio, T. (1999). Are cortical models really bound by the “Binding Problem”? Neuron 24, 87–93. [82] Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025. [83] Riesenhuber, M. and Poggio, T. (1999). A note on object class representation and categorical perception. AI Memo 1679, CBCL Paper 183, MIT AI Lab and CBCL, Cambridge, MA. [84] Riesenhuber, M. and Poggio, T. (2000). The individual is nothing, the class everything: Psychophysics and modeling of recognition in object classes. AI Memo 1682, CBCL Paper 185, MIT AI Lab and CBCL, Cambridge, MA. [85] Rolls, E., Judge, S., and Sanghera, M. (1977). Activity of neurones in the inferotemporal cortex of the alert monkey. Brain Res. 130, 229–238. [86] Rolls, E. and Tovee, M. (1995). The responses of single neurons in the temporal visual cortical areas of the macaque when more than one stimulus is present in the receptive field. Exp. Brain Res. 103, 409–420. [87] Rowley, H., Baluja, S., and Kanade, T. (1998). Neural network-based face detection. IEEE PAMI 20, 23–38. [88] Salinas, E. and Abbott, L. (1997). Invariant visual responses from attentional gain fields. J. Neurophys. 77, 3267–3272. [89] Sato, T. (1989). Interactions of visual stimuli in the receptive fields of inferior temporal neurons in awake monkeys. Exp. Brain Res. 77, 23–30. [90] Shelton, C. (1996). Three-dimensional correspondence. Master’s thesis, MIT, Cambridge, MA. [91] Sinha, P. and Poggio, T. (1996). The role of learning in 3-D form perception. Nature 384, 460–463. [92] Sung, K. and Poggio, T. (1998). Example-based learning for view-based human face detection. IEEE PAMI 20, 39–51. [93] Tanaka, K. (1993). Neuronal mechanisms of object recognition. Science 262, 685–688. [94] Tanaka, K. (1996). Inferotemporal cortex and object vision. Ann. Rev. Neurosci. 19, 109–139. [95] Tarr, M. (1995). Rotating objects to recognize them: A case study on the role of viewpoint dependency in the recognition of three-dimensional objects. Psychonom. Bull. & Rev. 2, 55–82. [96] Tarr, M. (1999). News on views: pandemonium revisited. Nat. Neurosci. 2, 932–935. [97] Tarr, M. and Bulthoff, ¨ H. (1995). Is human object recognition better described by geon structural descriptions or by multiple views? Comment on Biederman and Gerhardstein (1993). J. Exp. Psych. Hum. Percep. Perf. 21, 1494–1505. 89
[98] Tarr, M. and Bulthoff, ¨ H. (1998). Image-based object recognition in man, monkey and machine. Cognition 67, 1–20. [99] Tarr, M. and Gauthier, I. (1998). Do viewpoint-dependent mechanisms generalize across members of a class? Cognition 67, 73–110. [100] Tarr, M., Williams, P., Hayward, W., and Gauthier, I. (1998). Three-dimensional object recognition is viewpoint-dependent. Nat. Neurosci. 1, 275–277. [101] Thorpe, S., Fize, D., and Marlot, C. (1996). Speed of processing in the human visual system. Nature 381, 520–522. [102] Treisman, A. and Gelade, G. (1980). A feature-integration theory of attention. Cog. Psychol. 12, 97–136. [103] Ungerleider, L. and Haxby, J. (1994). ’What’ and ’where’ in the human brain. Curr. Op. Neurobiol. 4, 157–165. [104] vanRullen, R. and Thorpe, S. The time course of visual processing: From early perception to decision-making. Submitted to Nat. Neurosci. [105] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York. [106] Vetter, T. and Blanz, V. (1998). Estimating coloured 3D face models from single images: An example based approach. In Proceedings of the European Conference on Computer Vison ECCV’98 (Freiburg, Germany). [107] Vetter, T., Hurlbert, A., and Poggio, T. (1995). View-based models of 3D object recognition: invariance to imaging transformations. Cereb. Cortex 3, 261–269. [108] Vogels, R. (1999). Categorization of complex visual images by rhesus monkeys. Part 2: singlecell study. Eur. J. Neurosci. 11, 1239–1255. [109] von der Malsburg, C. (1981). The correlation theory of brain function. Technical Report 81-2, Dept. of Neurobiology, Max-Planck Institute for Biophysical Chemistry, Gottingen, ¨ Germany. [110] von der Malsburg, C. (1995). Curr. Op. Neurobiol. 5, 520–526.
Binding in models of perception and brain function.
[111] Wallis, G. and Rolls, E. (1997). A model of invariant object recognition in the visual system. Prog. Neurobiol. 51, 167–194. [112] Wang, G., Tanaka, K., and Tanifuji, M. (1996). Optical imaging of functional organization in the monkey inferotemporal cortex. Science 272, 1665–1668. [113] Wang, G., Tanifuji, M., and Tanaka, K. (1998). Functional architecture in monkey inferotemporal cortex revealed by in vivo optical imaging. Neurosci. Res. 32, 33–46. [114] Wickelgren, W. (1969). Context-sensitive coding, associative memory, and serial order in (speech) behavior. Psych. Rev. 76, 1–15. [115] Young, M. and Yamane, S. (1992). Sparse population coding of faces in the inferotemporal cortex. Science 256, 1327–1331.
90