properties characterise a cup, as viewed from any angle, then the cup may be .... to Mark Lee, Marcus Rodrigues, David Cli and Inman Harvey for comments onĀ ...
Arti cial Intelligence and Simulation of Behaviour, April 1993, Birmingham, England.
Computer Vision: What Is The Object? James V. Stone
Computer Science, University of Wales, Aberystwyth, Wales.
Abstract. Vision consists of a multiplicity of tasks, of which object identi cation is only one. We in the computer vision community have concentrated our eorts on object identi cation, and have thereby ensured that the formulation of the problem of vision provides methods which are not of general utility for vision. Ironically, one consequence of this is that computer vision may not even be of use for object identi cation. An analysis of why computer vision has become synonymous with object identi cation is presented. The implications of this analysis for object identi cation and for interpreting neurophysiological evidence in terms of `feature detectors' are presented. A formulation of the problem of vision in terms of spatiotemporal characteristics is proposed.
1 Object Identi cation in Human and Computer Vision The hardest part of any scienti c investigation is not solving a particular problem, but formulating questions that focus attention on important aspects of the phenomena under investigation. This paper attempts to step back from conventional formulations of the `problem of vision' by making explicit the unspoken assumptions upon which conventional formulations of the problem are based. Historically, the primary goal of computer vision has been to identify objects. One of the most in uential books [1] on computer vision states in its preface: \Computer vision is the construction of explicit, meaningful descriptions of physical objects from images." Ballard and Brown, page xiii, 1982[1]. There is no hint here that computer vision may consist of more than object identi cation1. And, from a psychophysiological perspective: \The brain's task then, is to extract the constant, invariant features of objects from the perpetually changing ood of information it receives from them." Zeki, page 43, 1992[2]. (Italics added). \The goal of vision is to inform us of the identity of objects in view and their spatial positions", Cavanagh, page 261, 1989[3]. (Italics added). The cited texts are more general than these quotes might suggest, but these quotes demonstrate a prevalent view of the role of both human and computer vision. More recently there has been a move away from traditional formulations of the problem of vision [4, 5]. These approaches, whilst commendable in many respects, do not make explicit the central role of the notion of `object'. In Ballard's [4] excellent account of the advantages of animate vision there is no reference made as to why objects might be a useful way to represent the visual world, nor why the particular representations of objects used (colour histograms) might be formed by an animate vision system. Formulations of the problem of computer vision in terms of objects may have arisen because the most obvious concomitant of vision in humans is their ability to identify objects. However, most of human and animal vision has little to do with identifying objects. More typically, vision is used to guide limbs, to track motion, to detect changes in motion/lighting/colour/depth, to estimate relative depths of surfaces using parallax/stereo/motion/texture. Whilst some of these tasks may require the detection of objects, none of these tasks requires that those objects be recognised or identi ed. Moreover, these tasks involve computation of quantities (such as position, depth, velocity, colour, gradient) which are not necessarily required in order to compute the identity of an object. 1 I use the term `object identi cation' instead of the more usual `object recognition' because I utilise a distinction between recognition (e.g. familiarity with an object) and identi cation (i.e. classifying or naming an object).
1
It is tempting to use easily quanti able tasks, such as object identi cation, in order to compare the performance of a seeing machine to that of a human. A machine that can identify objects provides a tangible demonstration that it can do what humans do. It is tempting to suppose that if a machine can identify objects then it can `see' as humans do. However, the fact that humans can identify objects does not imply that this is the only task humans use vision for; and a demonstration that both humans and machines can identify objects is not a demonstration that a machine can `see' in the sense normally associated with seeing humans. It is less easy to measure how well a human uses vision to aid walking/climbing/reaching/grasping, even though there is ample evidence that vision is essential for these tasks. The conventional formulation of the problem of object recognition implies that objects consist of well de ned features, and that objects can be identi ed by rst extracting these features and then matching them to stored representations. If we accept that vision includes object identi cation, but that the visual mechanisms we posses evolved in order to perform many other visual tasks, then the conventional formulation of the problem of vision appears not only simplistic, but also peculiarly biased toward a task (object identi cation) that is an important, but relatively small part of, what vision is for. It seems likely that our sophisticated object recognition ability is a relatively recent evolutionary development. Like most evolutionary innovations the ability to recognise objects was probably synthesized from pre-existing computational mechanisms. Consequently, if a mechanism is useful for object identi cation only, then it is unlikely that it forms a part of the solution implemented by human visual systems. Conversely, if a mechanism subserves other forms of visually guided behaviour and object recognition then it is likely that that mechanism forms part of the human visual system. Within computer vision we pride ourselves on the correspondence between our methods and the computational mechanisms observed in the human visual system. However, if we wish to achieve object recognition by modelling the computational properties of the human visual system then we could do so by paying more attention to the types of tasks for which our visual systems evolved to deal with.
1.1 Object Identi cation: A Conceptual Analysis The notion of `object' is so deeply ingrained in our language that its logical status is rarely questioned. If the question does arise then it is usually addressed by invoking sub-objects, or `features'. For example, a face may be de ned in terms of `features' such as nose, eye and mouth. However, such sub-objects are logically indistinguishable from the objects of which they are a part. (A similar type of dilemma is yet to be recognised in the connectionist literature where the notion of `micro-feature' is currently used as an explanatory concept). If we accept that objects can be de ned in terms of sub-objects or features (and that this recursive process is nite) then it should be possible to dierentiate two objects on the basis of their respective features. However, it is possible for two dierent types of object to be speci ed by a single feature-based description. An example of this is the letter `O' and the number `O'. If features are clearly insucient to distinguish between two objects then the notion of context and/or function is often invoked. Invoking context to disambiguate two physically similar objects has the eect of moving the nub of the problem from the structure of the object to the structure of its spatio-temporal neighbourhood. However objects and contexts are usually de ned using similar types of primitives. Consequently, invoking context usually results in resolution of the immediate problem without addressing its underlying cause, and whilst creating a set of similar problems (such as how to identify a given context). As with `context', it might be supposed that two physically similar objects can be disambiguated by appealing to their functional attributes. In order to provide a functional method for distinguishing between a pillow and a cushion, the latter might be described as `used to support back when sitting', whereas a pillow might be described as `used to support head while sleeping'. Such descriptions clearly create more problems than they solve. What does it mean to support? How are back and head de ned, in terms of still more features? Of course, a list of features does not constitute an adequate description of an object. It is important to consider the relationships between dierent features. However, choosing a set of relationsbetween-features is analogous to the problem of choosing features, and is subject to the same types of pitfalls[6](p376) Moreover, the type of problem described above with respect to classifying objects according to features arises with respect to relations-between-features. 2
Several issue are raised by the fact that a well de ned set of primitive descriptors for objects is not available, and that there does not seem to be a principled method for obtaining such a set. Does the problem of object identi cation have a robust set of descriptors which can be used to recognise objects? Or, is object identi cation an inappropriate formulation of the problem represented by vision? By formulating a more general de nition of the problem of vision, conventional formulations of the problem of object recognition may become irrelevant, not because object identi cation isn't required, but because it is addressed as part of a more general computational problem. The eect of this is to de-emphasise object identi cation as the primary objective in computer vision. Object identi cation is an integral, but subsidiary, part of visual behaviour, and can be realised as part of a solution to a more general formulation of the problem of computer vision.
1.2 Features and `Grandmother Cells' The tendency to describe objects in terms of features exists in several related elds, psychology, arti cial intelligence, computer vision and neurophysiology. The last is particularly interesting because, unlike the others, it aspires to making direct contact with the computational machinery responsible for perception. Yet even here, neurons quite close to the retinal input are described as feature detectors. Indeed, early attempts to account for these ndings proposed that the function of these neurons is to signal the presence of these features [7, 8]. However, simulations using an arti cial neural network (ANN) to perform a simple shape from shading task [9] have demonstrated that the types of feature detectors observed in the retina and in the primary visual cortex (V1) can arise (in an ANN) in the absence of corresponding `retinal' features. The `edge detectors' identi ed in this ANN developed in the absence of contrast edges in the shaded images used to train the ANN. Additionally, `feature detection' theories predict the existence of increasingly response-speci c neurons. With few exceptions such neurons have not been identi ed. The exceptions involve neurons that respond to ethologically relevant stimuli such as faces [10], but evidence for the existence neurons responding to other types of stimuli is not compelling (see [11]). In particular, the reductionist approach adopted by Fujita et al. suggests that such complex stimuli may not be the optimal stimuli for neurons in the inferotemporal cortex (even if they are then this might tell us little about the function of such neurons, see below). Fujita et al. de ned the optimal response properties of neurons in the anterior inferotemporal cortex (\the nal station of the visual cortical stream crucial for object recognition", [11], p343) in terms of spatially de ned features of simple geometric objects2 . By progressively simplifying `optimal' stimuli, Fujita et al. found that neurons responded selectively to line drawings of simple shapes, intensity or colour contrasts, or luminance gradations. Neurons in the same column shared the same optimal stimulus, and adjacent neurons usually shared similar response properties. Whilst the importance of these results cannot be over-emphasised, the interpretation placed upon them by the authors and an accompanying review of the paper (in the same edition as [11]) is consistent with the assumption that these neurons are only used for object recognition. Indeed this assumption appears to be implicit in the title, \Columns for visual features of objects in monkey inferotemporal cortex" (italics added). Whilst there seems little doubt that these neurons are involved in the perception of form, this more general characterisation of their function admits a larger set of visual tasks than is implied by a characterisation in terms of object recognition alone. Just as in [9] `edge detecting' units in an ANN performing a shape from shading task were observed, so it may be that the feature detecting neurons observed by Fujita et al. have much to do with form perception, but are not uniquely associated with object recognition. Indeed, the thesis of [9] is that the role of neurons which respond selectively to visual inputs cannot be deduced from the nature of those inputs. Results from psychophysical experiments on the role of oriented receptive retinal elds suggest that, \oriented lters are not `orientation detectors, but are precursors to a more subtle stage that locates and represents spatial features"[12](p235). Together, these data suggest that the function of Fujita et al.'s feature detecting neurons may not be to detect certain features. Instead, the function of those neurons may be similar in type to the units and neurons described in [9] and [12], respectively. In addition to the objections raised above against labelling response-speci c neurons as feature detectors, this class of theory creates several obvious, but fundamental, problems. First, if a single cell codes for a particular retinal feature then the death of that cell would eliminate the ability to recognise that feature. Second, there are not enough neurons to code for every possible combination of features. However, Hinton [13] and Ballard [14] have proposed the use of coarse coding ANN units and feature subspaces, respectively, to ameliorate this combinatorial problem. 2
Of the set of objects tested these were found to elicit maximal responses.
3
Although workers in neurophysiology state that `feature detector' is only a convenient term to describe their ndings the implication that the function of each neuron is to signal the presence of a single feature is pervasive. It is pervasive to the extent that researchers using ANNs con dently refer to `feature detecting units - but removing instances of such units usually has a minimal eect on the ability of the ANN to utilise information associated with the putative feature. Whereas most units contribute to the nal output of an ANN, the functional signi cance of each unit in responding the presence of certain features is not known. Whilst there seems little doubt that units in ANNs and neurons in the primary visual cortex respond to contrast edges, and that neurons in the inferotemporal cortex respond selectively to visual con gurations, this functionally neutral description is often not the language used to describe the behaviour of such units and neurons. I am not arguing that neurons do not code for visual features. I am proposing that the strong bias amongst vision workers from many elds to identify vision as being synonymous with object recognition results in an interpretation of data which assumes that the response characteristics of neurons can only be explained in terms of their role in object recognition. Given our evolutionary history, and our skill in performing dierent types of visually guided behaviours, it seems unparsimonious to propose that neurons which respond to features that are parts of objects are only involved in object recognition.
2 Linguistic Anthropomorphism in Computer Vision Linguistic descriptions of the physical world tend to be expressed in terms of objects and their associated properties and processes. An example is: `The green stone is sinking'. Here the primary descriptor is the stone, with green (property) and sinking (process) being `attached' to the stone. However this linguistic description belies a bias which may bear little relation to the type of quantities computed by a visual system. Within the visual system the motion, colour and spatio-temporal integrity of the stone are not necessarily subsidiary to each other, they are simply computable attributes of a physical scenario. I believe that this linguistic bias has in uenced much work in computer vision, and has retarded progress by identifying computer vision with object identi cation, rather than with what vision is used for in biological vision systems. More recently, the apparent intractability of computer vision as object identi cation, and the consequent lack of practical use of much of computer vision work, has led to a re-appraisal of what vision is for. This move away from conventional computer vision and toward animate vision[4] is motivated by the modest practical success of computer vision systems. It is now accepted by some[4] that computer vision was asking the wrong questions. Rather than asking \How can we get a machine to name objects?", perhaps it should have been asking, \What is vision for?". The `new' computer vision recognises what went wrong with computer vision, but not why it went wrong, nor why the error was repeated by successive generations of researchers. By making the problem explicit we may be able to avoid mistakes of this type in the future. As computer vision researchers we accept that the conventional formulation of the problem of object identi cation represents a tractable problem. We therefore implicitly accept the status of objects as the primary descriptors of a given visual scenario. Moreover, the compelling psychological importance of objects suggests that each object can be recognised on the basis of a purely spatial parameterisation of that object. The sentence, `The green stone is sinking', makes sense to us because it is consistent with our own perceptions. It is tempting to model our own perceptual capabilities using primitives (object, property and process) such that the computational precedence of each matches our own linguistic precedence. Although linguistic descriptions are consistent with experience, they are not necessarily determined by such experience. Instead of modelling human capabilities in terms of objects with properties and processes, we could equally well describe the world in terms of processes with objects and properties; e.g. `The sinking is green stone'. Here, the process of `sinking' is the hook upon which the `stone' object and `green' property hang. In such a linguistic world objects and properties would be subsidiary to processes. If our language were organised in this manner then computer vision would probably consist of recognising entities such as `sinking', `blowing', and objects would be treated as attributes of these primary descriptors. This ending is paragraph. There are no logical reasons for partitioning the world into objects with subsidiary properties and processes; though there may be sound computational reasons for doing so. Whorf[15] would argue that such a partitioning determines how we perceive the physical world. However, I am less concerned here with why we partition the world in this way than with the consequences of any particular partitioning. The point is that partitioning the world along dimensions of process rather than objects is logically 4
indistinguishable from a conventional partitioning if each provides the same amount of information about the physical world. From an ethological perspective there is little point in knowing that a predator is present if it is not known in which direction the predator is moving. Conversely there is little point in knowing in which direction an object is moving if it is not known what type of object (predator/prey) it is. Both the object type and the processes associated with an object are required. How we choose to describe such physical scenarios with language is immaterial provided that both types of information are communicable within the language. I am not proposing that scenes should be described with processes as primary descriptors. The point of the preceding discussion is that language imposes precedence, or hierarchy, on our descriptions of the visual world. How elements are ordered in this hierarchy matters less than which types of entities constitute the elements of the hierarchy. I believe that the current formulation of problems in computer vision (and AI) re ects much about the structure of linguistic hierarchy of the English language, and little about the underlying computational processes which generate the elements of the hierarchy3 .
3 Spatial and Spatio-Temporal Characteristic Views 3.1 Spatial Characteristic Views It is only by observing how a thing appears to change that its invariant properties can be gauged. Rotating a cup does not alter the cup, but it does alter the cup's appearance. If it is known which properties characterise a cup, as viewed from any angle, then the cup may be recognised from any viewpoint. Marr [18] suggested that objects can be recognised by making use of characteristic views of those objects. These are views which are relatively stable with respect to rotation. This approach has been developed and implemented in [19], and more recently in [20]. In support of Marr there is evidence [21] that object identi cation in humans makes use of sets of characteristic views. This evidence suggests that recognition of an object presented at a particular orientation occurs by matching to a view which is interpolated across several stored views of that object. The questions of how it is decided which set of stored views (object) to use for interpolation, and how the interpolation process is executed, remain unanswered. A computer demonstration of the utility of this approach for objects de ned as sets of 3D points is given in [22].
3.2 Spatio-Temporal Characteristic Changes of View Note: Throughout this section the term `object' is used for the sake of brevity only. The following discussion is intended to apply to any con guration of points in 3D space (e.g. rock faces, the surface of a path, the surface of turbulent water) for which it is desirable to compute some attribute (e.g. motion, distance, orientation, hardness) of the set. Whereas characteristic views can be used to recognise con gurations of points by interpolating over those views[22], spatio-temporal characteristic views can be used for recognition by interpolating over a set of stored spatio-temporal characteristic views. For example, the set of retinal changes induced by the rotation of a set of 3D points is sucient to specify not only the rotation, but the relative positions of points in 3D space. In short, whereas characteristic views can be used to specify particular 3D spatial relations between points which characterise a given object, spatio-temporal characteristic views can be used to specify spatio-temporal relations between points which characterise a given object and parameters associated with its motion. The relative retinal motions of those points not only specify the relations between the corresponding 3D points, but also their collective motion in 3-space. A simple example of recognition of a particular type of motion (not of an object) via spatio-temporal cues is that of the motion of wavelets on a river surface. To paraphrase from the previous section: Rotating a cup does not alter the cup, but it does alter the changes over time in the cup's appearance. If it is known which changes characterise the rotating cup then the cup may be recognised if it is rotating. It might be mistakenly thought that the above is no more than an interpretation of obtaining 3 Whilst there is no empirical evidence that the linguistic precedenceused in languagesis re ected in the computational organisation of the brain, there is evidence that `what' and `where' attributes are computed in dierent parts of the brain [16, 17]. It appears that the linguistic distinction between `what' and `where' (but not its precedence) has a corresponding functional distinction in the brain.
5
`structure from motion'. Using a set of spatio-temporal characteristic views for recognition is dierent from using `structure from motion'. The latter uses motion to infer the atemporal structure of a 3D scene, whereas the former uses the eects of motion as a cue for recognition. These cues are not necessarily interpreted in terms of the 3D structure of a scene. In support of this argument three examples are described in which the response to visual stimuli cannot be explained only in terms of the spatial structure of visual stimuli. Example 1. In investigating the speci city in the reactions of young geese to birds of prey ying overhead a model was constructed with symmetric anterior and posterior wing edges, and with a `head' at each end of the body[23]. One head had a short `neck', and the other head had a long `neck'. This model elicited an escape response in young birds only when it was moved in the direction of the short `neck'. When the model was moved in this direction it gave the impression of a hawk in
ight, when it was moved in the opposite direction it gave the impression of a goose. Thus it was not the atemporal shape that determined the responses because spatial structure was common to both directions; instead the response was determined by shape in relation to the direction of movement (i.e. the spatio-temporal structure of the stimulus). Example 2. The ability of mosquito-hunting dragon ies to recognise their prey does not depend upon the shape of the prey[23]. Instead, dragon ies react speci cally to the type of motion associated with
ying mosquitoes. Example 3. With regard to the case made for identi cation via dynamic cues, Johansson has provided ample evidence [24] that observers can identify moving objects (humans) for which the sequential process extracting spatially de ned features, followed by identi cation appears to be impossible. Johansson's experiments consisted of showing observers a lm of a person walking in the dark with a light attached to each major joint of each limb. Under these conditions a single frame is usually insuf cient to evoke the perception of a person. However, only a few frames allow the observer to perceive a moving person. It is the relative motion of points(lights), and not their static con guration in any single frame, that evokes the perception of a person. Certainly observers are familiar with the relative positions of major joints on a static human body, but this is insucient to evoke the perception of a person. However, observers are also familiar with the sets of changes in the relative positions of major joints associated with walkers. Unlike purely positional information, derivatives of positional information with respect to time are sucient to unambiguously specify the gure of a person. Johansson claims that such changes over time are sucient to adequately specify a person, and even the identity of a person: \I do know him by his gait; He is a friend." Julius Caesar, W. Shakespeare. Mather et al. investigated the cues responsible for perceiving Johansson gures. In support of the proposal (above) that the eects of motion can be used directly as a cue for recognition without rst interpreting such motion in terms of the 3D structure of a scene they conclude: \The visual system may rely heavily on detecting such [wrist and ankle] characteristic movement patterns during recognition of moving images, rather than on constructing a full structured representation of the body." Mather et al., page 155, 1992[25]. Evidence that neurons in the temporal cortex respond selectively to walking humans is provided in [26]. Each of these neurons responded only to the image of a human walking forward. Images of a human walking backwards in the same direction (as the forward walker) did not evoke a response. About 40% of neurons that responded to human walkers also respond to Johansson movies of human walkers. Proposing that spatio-temporal characteristic views should be used for computer vision does not specify how this could be accomplished. For Johansson gures this could be implemented by modifying the ANN described in [22] so that each unit is associated with a particular set of contiguous views of a Johansson gure. Each unit speci es the extent to which the inputs match its own preferred input. The nal output is obtained by interpolating over several sets of contiguous views (that is, by forming a superposition of unit outputs). As in [22], the `preferred' set of views of each unit can be adapted so as to optimise performance of the ANN. As a researcher engaged in the construction of computational models of vision I am acutely aware that such suggestions are easier to propose than to implement. In defence, I refer the reader to the rst 6
sentence of this paper. The purpose of this paper is to propose an approach (which is inevitably more nebulous than a computational theory) which is a rst step in the process of constructing computational theories that are consistent with this approach.
4 Conclusion I have argued that object identi cation is not an adequate characterisation of the problem of vision, and that too much emphasis has been placed on object identi cation within computer vision. The problem of vision consists of a multiplicity of tasks, of which object identi cation is only one. Others include visually guided behaviours such as walking, grasping, climbing. By continuing to concentrate eorts on object identi cation there is a danger that the formulation of the problem (of object identi cation) will provide methods which are not of general utility for vision, and which may not even be of use for object identi cation. Object identi cation is an integral part of the general problem of vision. Consequently it is likely that solutions to the problem of vision and object identi cation share a common set of computational mechanisms. A broader de nition of computer vision ensures that computer vision will be useful for general visually guided behaviours, as well as for identi cation of objects. I have suggested that one way to usefully broaden the de nition of the problem of vision is to consider the use of spatio-temporal cues, not only as a means of estimating the atemporal structure of a scene, but directly as cues for accomplishing particular visual tasks which include (but are not necessarily uniquely associated with) object identi cation. Acknowledgements: Thanks to Stephen Isard for comments on longer versions of this paper, and for suggesting Tinbergen's experiment as an example of the use of a spatio-temporal stimulus. Thanks to Raymond Lister and Helen Peddington for useful discussions on drafts of this paper. Thanks also to Mark Lee, Marcus Rodrigues, David Cli and Inman Harvey for comments on an earlier draft of this paper. This work was undertaken as part of an MRC/JCI grant awarded to Mark Lee at the Department of Computer Science, University of Wales, Aberystwyth.
7
References
[1] DH Ballard and CM Brown. Computer Vision. Prentice-Hall Inc., New Jersey 07632, 1982. [2] S Zeki. The visual image in mind and brain. Scienti c American, 267(3), Sept. 1992. [3] P Cavanagh. Multiple analyses of orientation in the visual system. In Neural Mechanisms of Visual Perception, Lam, D and Gilbert, C (Eds)), pages 261{279, 1989. [4] DH Ballard. Animate vision. Arti cial Intelligence, 48:57{86, 1991. [5] RA Brooks. A robust layered control system for a mobile robot. IEEE J. Robot Autom., 2:14{23, 1986. [6] S Watanabe. Knowing and Guessing. J Wiley and Sons, 1969. [7] DH Hubel and TN Wiesel. Receptive elds, binocular interaction, and functional architecture in the cat's visual cortex. Journal of Physiology, 160:160{154, 1962. [8] HB Barlow. Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1:371{394, 1972. [9] SR Lehky and TJ Sejnowski. Neural network model of visual cortex for determining surface curvature from images of shaded images. Proc R. Soc. London (B), 240:251{278, 1990. [10] DI Perrett, E Rolls, and W Caan. Visual neurones responsive to faces in the monkey temporal cortex. Exp. Brain Res., 47:329{342, 1982. [11] I Fujita, K Tanaka, M Ito, and K Cheng. Columns for visual features in monkey inferotemporal cortex. Nature, 360(26):343{346, 1992. [12] MA Georgeson. Human vision combines oriented lters to compute edges. Proc. Roy. Soc. London, (B), 249:235{245, 1992. [13] GE Hinton. Shape representation in parallel systems. In Proc 7th IJCAI, Vancouver BC, pages 1088{1096, 1981. [14] DH Ballard. Parameter nets. Arti cial Intelligence, 22:235{267, 1984. [15] BL Whorf. Language, Thought, and Reality. MIT Press, New York, 1956. [16] JHR Maunsell and WT Newsome. Visual processing in monkey extrastriate cortex. Ann. Rev. Neuroscience, 10:363{401, 1987. [17] M Mishkin, LG Ungerleider, and KA Macko. Object vision and spatial vision: Two cortical pathways. Trends in Neurosciences, 6:414{417, 1983. [18] D Marr. Vision. Freeman: New York, 1982. [19] D Lowe. Perceptual Organisation and Visual Recognition. Kluwer Academic Publishers, Boston MA, 1985. [20] Bray, A J. Recognising and Tracking Polyhedral Objects. PhD thesis, School of Cognitive and Computing Sciences, University of Sussex, UK, 1990. [21] S Edelman and H Bultfo. Viewpoint-speci c representations in three-dimensional object recognition. MIT AI Memo No. 1239, 1990. [22] T Poggio and S Edelman. A network that learns to recognize three-dimensional objects. Nature, 343:263{ 266, 1990. [23] N Tinbergen. The Study of Instinct. Clarendon), Oxford University Press, 1951. [24] G Johansson. Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14:201{21, 1973. [25] G Mather, K Radford, and S West. Low-level visual processing of biological motion. Proc. Roy. Soc., 249:149{155, 1992. [26] D Perrett, M Harries, AJ Mistlin, and AJ Chitty. Three stages in the classiication of body movements by visual neurons. In Images and Understanding, Barlow H, and Blakemore C, Weston-Smith, M (Eds), pages 95{107, 1990.
8