Attentive Visual Recognition for Scene Exploration Luiz Pessoa, Sergio Exel, Alexandre Roque, and Ana Leit~ao COPPE Sistemas e Computac~ao Universidade Federal do Rio de Janeiro Rio de Janeiro, RJ fpessoa,exel,roque,
[email protected]
Abstract
Vision is an active process where behaviorally important information is selectively gathered. We present a model of scene exploration in which the highresolution fovea is deployed to interesting regions by selective attention processes. The system switches between exploratory and recognition modes such that in exploration mode, regions of interest are investigated and in recognition mode, a given region is carefully examined as the recognition process is attempted for the particular region of the scene. Model simulations illustrate its behavior.
across the scene is not random. A key issue is then establishing the factors that determine the scenic scan paths. In this paper we develop an attentive visual recognition model for scene exploration. Our objective at this point is not to replicate human scan paths in given situations, but to investigate how a high-resolution fovea can be eectively deployed across a scene to recognize objects of interest and ignore less relevant ones. For concreteness we apply it to the task of face recognition in scenes, although it was not especially designed for this domain.
Vision is an active process. A central ingredient of active vision is the use of an attention mechanism that helps decide where in a scene to search for a particular object, or where within an object to gather additional information. The visual guidance provided for by attention is crucial given the space-variant mapping between retina and cortex of the mammalian visual system (Schwartz, 1977). Attention coupled with rapid, or saccadic, eye movements, compensates for the decrease in resolution at the periphery, and provides a powerful basis for visually-guided behavior. In most models of recognition the task at hand consists in determining the correct label associated with an arbitrary image given a set of stored objects, or models. Typically the input image contains a single image of interest which contains a transformed (e.g., rotated) version of one of the stored models. Human visual recognition, on the other hand, is commonly confronted with \input images" that contain a multitude of objects, among which some may be familiar and others not. One important visual behavior consists in scene exploration during which entire scenes are sequentially explored. During this process, known objects are recognized | while at the same time, unknown objects are labeled as such | as the high-resolution fovea is deployed across the scene. This behavior should be distinguished from visual search, in which a given object or feature is actively searched for in a scene. Although during scene exploration, top-down information, such as expectations for human faces, may be at play, no explicit priming of low-level features (such as horizontal edges or green regions) should be assumed. Nevertheless, the sequential deployment of the fovea
Figure 1 shows the macroscopic model diagram. The input scene is initially represented as a spacevariant complex-cell map, where oriented edge information is made available. The entire process of scene exploration is built upon a pyramid structure of complex cell (oriented edge) responses. The model includes two main components: a recognition system and an exploration system that interact to determine how scenic regions are investigated. System operation switches between exploration mode and recognition mode. In exploration mode, regions of interest are investigated. In recognition mode, a given region is carefully examined as the recognition process is attempted for the particular region of the scene. Initially, the entire scene is used to generate an interest map that determines regions of interest to be investigated. Each region of interest is then visited and recognition attempted. In our system, single-object recognition is not a monolithic process (see Exel and Pessoa (1998) for details). Instead, it engages an attentive system that selectively deploys a high-resolution fovea (with its accompanying periphery) across the image through small saccadic movements | called recognition saccades. The foveation process gradually builds up information until recognition is attained. Hence, the system attempts to recognize the object given partial information. When a decision criterion is met, the object is recognized, leading to potential behavioral actions. But as long as the available information is insuf cient to support recognition, the foveation process is engaged in order for the system to gather new information. The determination of the next foveation utilizes both bottom-up and top-down information from the
1 Introduction
2 Scene Exploration Model
set of stored models, and corresponds to the process of selective attention within the model. The foveation process continues until the object is recognized1 . Alignment RECOGNITION
EXPLORATION Saliency Map
Models Interest Map
Saccade
Saccade
Categorization
Space-Variant Representation
and Exel (1998) for further details). In recent years, early lters have been applied successively in a number of computer vision problems, notably in object recognition (Brunelli & Poggio, 1993; Rao & Ballard, 1996). A key property of the organization of the visual system is the space-variant representation utilized. In the present system, a simpli ed space-variant representation was employed. An image pyramid was constructed in which the above (single-scale) computations were replicated, where needed, for 4 scales (see Figure 2). A space-variant complex-cell map was then conveniently obtained by utilizing only high spatial frequency information for the foveal region, and employing the lower spatial frequency information for three rings around the fovea, each ring with the corresponding lower spatial frequency oriented edge information. The activities of this space-variant complex cell map formed the basis for the feature vector employed by the recognition process (see below).
2.1 Exploration System
Image
Figure 1: Attentive scene exploration model. A complex-cell space-variant representation is employed by a recognition and an exploration system. The large black disks represent regions of interest in the scene speci ed by the interest map. The long arrow connecting them is a large exploratory saccade. The small white circular areas represent partial information gathered by recognition foveations in order to incrementally recognize objects of interest. The short white arrows connecting them are small recognition saccades. Once recognition succeeds for a single object, the system switches to the exploratory mode. The whole process of exploration-recognition iterates until there are no further regions of interest with high activation in the interest map. Below we specify the model components in detail. Space-Variant Representation. The representation adopted herein is based on cell responses of the early visual system. The image is initially ltered with center-surround (unoriented) lters that resemble retinal ganglion cells. Unoriented responses are subsequently processed by elongated lters sensitive to contour orientation. Filters sensitive to directionof-contrast (light-dark and dark-light), resembling cortical simple cells are applied. Contrast-insensitive responses, such as produced by cortical complex cells, are generated by pooling light-dark and dark-light simple cell responses at all positions. We also pool across orientations to generate a nal complex cell map sensitive to image contours at all orientations (see Pessoa et al. (1995); Grossberg & Pessoa (1998) or Pessoa The process may be interrupted after a large number of cycles, in which case the object is not recognized. 1
Determining Regions of Interest. In the real world, a scene contains not only a set of objects of potential interest, but a host of irrelevant information, as is present in the background. A key problem for scene exploration is then determining regions of interest to be investigated. Indeed, eective exploration depends on the ecient determination of candidate regions, that is, regions with a high likelyhood of containing objects of interest, while minimizing the regions that contain irrelevant information. Regions of interest are determined by an interest map computed on the basis of both bottom-up and top-down information. The complex cell map generated by oriented ltering comprises the bottom-up component. The top-down component is based on the priming of expected shapes (object outlines). One class of shape that commonly occurs in scenes is faces. In this paper the priming of faces comprises the topdown component. Complex cell map activities for a typical face determine a prototype mask that is then used to search the scene for similar shape outlines. Here we take advantage of the pyramid structure employed and instead of searching at the highest resolution (the original image resolution) we search at a lower resolution (next to last level in Figure 2). Exploratory Saccades. The interest map determines eligible regions to be investigated. The higher the activation value at the interest map, the higher the precedence given to the investigation of that region. After recognition is attempted at one location, a new scenic region is visited. This will necessarily be positioned some distance away from previously investigated locations since after recognition, long-lasting inhibition of nearby locations ensues. In this way, large exploratory saccades are generated and other scenic regions are visited.
2.2 Recognition System
Model-to-Scene Alignment. Once regions of interest (or candidate regions) are determined, the recog-
nition process can be initiated. The representation adopted in this work is topographic (oriented responses at given spatial locations). Scene and model have dierent sizes and origins de ning a scene space and a model space. Hence it is necessary to align the models to the scene. How can we align scene and model without solving the very problem we set out to solve, namely, recognize the object at the position of interest in the scene? Alignment through sliding all models through the part of the scene indicated by the point of current interest is unfeasible given its high computational cost. Our suggestion is to make use of the pyramid structure adopted in order to greatly reduce computations. Sliding is then performed at the lowest resolution of the pyramid. Moreover, only a small window around the point of interest is used in order to determine correlation between scene and model. In our simulations we have obtained reliable results with 3 3 windows. What is the result of alignment? Alignment results in a list of locations at candidate models (stored models). The list of locations is then visited in order of correlation score as the system attempts to recognize the object. In all, the model-to-scene alignment attempts to quickly generate a list of potential candidate models that can then trigger a more extensive recognition process. Categorization Module. The categorization module is responsible for determining whether an incoming image matches one of its model objects. It comprises a Fuzzy Adaptive Resonance Theory neural network (Carpenter et al., 1991) extended to allow for incremental feature extraction (Aguilar & Ross, 1994) | in this way, the partial feature vectors are incrementally built up. The space-variant complex cell map is used as a feature vector which is compared to stored models in the category layer. Each category node stands for a model object, and the node with the highest activation value indicates the category of the input. Similar activation values at the category layer indicate that the input is consistent with several stored models. This condition indicates that more information is needed and the attentional recognition saccade system is engaged. It is only in the case that the highest activation value exceeds the second highest by a certain xed amount that recognition is successfully terminated. Otherwise another image patch should be considered. Recognition Saccade System. Single-object recognition proceeds by use of partial information that is incrementally built up. At a given moment, only part of a scene is used for comparison with the stored models. What are the critical features that can most aid the recognition process? Some researchers have employed xed regions to provide informative features, such as the eyes (Brunelli and Poggio, 1993; Tistarelli, 1995), while others have suggested that important features are those that have high energy in operator response (Alpaydin, 1996). Our approach has been to combine both bottom-up and top-down information as posi-
tions with the greatest discrepancy between complex cell response and model activations are favored. The recognition saccadic system builds a saliency map as follows: (1) T = y1 kc ? m k where y is the activation of category i, c is the complex cell edge response for position pq, and m is the stored model complex cell response for the same position. The T thus specify a space-variant saliency map such that the highest activity determines the position of the center of the subsequent foveation. The foveation system also employs inhibition of return. Once a position is visited it should be prevented from being revisited since the associated local data have been already incorporated into the input feature vector. Hence, after foveation a region is inhibited at the saliency map. pq
X i
i
i
pq
pq
pq
pq
pq
3 Simulations
Twelve images of faces on a uniform white background were initially stored by the model.2 In order to test the model, the original images were pasted in arbitrary positions on richly-texture background images. A typical simulation is shown in Figure 2. The original image is initially represented as a space-variant complex-cell map. In practice this was done by constructing a four-level image pyramid. Level 3 (next to last) was then used in order to determine the interest map. The interest map speci es candidate regions that should be explored. In the present version, top-down priming of faces determined activations at the interest map. Priming was accomplished by determining the correlation of a prototype complex-cell map of a face and the image (at Level 3). Local maxima of the correlation measure were then determined and speci ed the centers of candidate regions. For the simulation shown, 8 candidate regions were obtained (marked with a cross or a circle). The 8 candidate regions were then visited according to the activation values at the interest map. In the process, 4 faces (marked by circles) were recognized and one missed (cross). Three of the recognized faces were recognized with the very rst saccade, while one triggered 2 small recognition saccades (lower left face). Note also that one of the faces was recognized even though the marked point of interest was positioned slightly above the face (rightmost face). In this case, the complex-cell information of the region centered around the point was deemed sucient for reliable recognition (at the same criterion levels as the other faces).3 The remaining background positions visited were rejected at the outset, such that no small recognition saccades were produced. It is our hope that in the future, more robust priming methods, and a more elaborate computation of an interest map can eliminate these uninter2 The images used were obtained at the MIT Eigenfaces database used by Turk and Pentland (1991). 3 For the lower left face, recognition saccades on the background also gather face information.
esting background candidate regions. Nevertheless, in general, there will always be false positives that have to be checked if one is not to miss important objects. Acknowledgements. This research was supported in part by the CNPq/Brazil grant 520419/96-0.
References
Aguilar, M. & Ross, W.D. (1994). Incremental ART: A neural network system for recognition by incremental feature extraction. Proceedings of the World Congress on Neural Networks (WCNN-94). Alpaydin, E. (1996). Selective attention for handwritten digit recognition. In Advances in Neural Information Processing Systems 8 (NIPS'95), D. Touretsky, M. Mozer & M. Hasselmo (Eds.). MIT Press. Brunelli, R. & Poggio, T. (1993). Face recognition: Features versus templates. IEEE PAMI, 15, 10421052. Carpenter, G., Grossberg, S., & Rosen, D. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4, 759-771. Exel, S. & Pessoa, L. (1998). Attentive visual recognition. To appear in International Conference on Pattern Recognition, Brisbane, 16-20th of August. Grossberg, S. & Pessoa, L. (in press). Texture segregation, surface representation, and gure-ground separation. To appear in Vision Research. Pessoa, L., Mingolla, E., & Neumann, H. (1995). A contrast- and luminance-driven multiscale network model of brightness perception. Vision Research, 35, 2201-2223. Rao & Ballard, D. (1996). An active vision architecture based on iconic representations. Arti cial Intelligence (Special Issue on Vision), 78, 461-505. Schwartz, E. (1977). Spatial mapping in the primate sensory projection: analytic structure and relevance to perception. Biological Cybernetics, 25, 181-194. Tistarelli (1995). Active/space-variant object recognition. Image and Vision Computing, 13(3), 215-226. Turk, M. & Pentland, A. (1991). Face recognition using eigenfaces. Journal of Cognitive Neuroscience, 3, 71-86.
Figure 2: Space-variant representation: The rst 4 images show the complex-cell image pyramid employed. Note that while responses are shown for all positions for all levels, they need not be all computed; their computation can be determined by the interest and the saliency maps. Scene exploration: From the set of 8 candidate regions explored, 4 faces are recognized (marked by circles), one missed (cross), and three correctly rejected (crosses).