Attention as Selection-for-Action: A Scheme for Active Perception Christian Balkenius, Nils Hulth Lund University Cognitive Science Kungshuset Lundagård S-222 22 Lund, Sweden
[email protected] [email protected]
Abstract We propose three principles for attentional control of actions in autonomous robots. (1) attention as action suggests that attentional shifts and the selection of focus of attention should be seen as actions rather than as a purely sensory process, (2) selection-for-action suggests that actions should be implicitly controlled by the current focus of attention (3) deictic reference is a method of referring to external objects without explicitly representing all of it properties. The three principles are illustrated in two examples. First for a mobile robot and second for a visually controlled manipulator. In the second example we also report two learning experiments where a robot picks out the correct focus of attention for a task based on reinforcement learning.
1 Introduction While attentional mechanisms are often seen as something necessary evil that has to be included because of insufficient computational power, we want to suggest that selective processing is not only a matter of limited resources but an effective way of processing sensory information. The scheme we present is based on a number of principles that have been suggested as aspects of human cognition. The first is that attentional shifts should be seen as actions in their own right and not as primarily sensory processing. The second is that actions use the current attentional window as an implicit argument. Actions are performed on or toward the object or location in the focus of attention. The final principle is the use of deictic reference (Ballard et al., 1997). In agreement with Allport (1990) we call the overall scheme selection-foraction. Together these principles allow for a very efficient implementation of perception and action for a robot. They allow for a control system that combines top-down and bottom-up processing in an efficient way to handle may complicated situation.
1.1 What is Attention? Attention can be seen as a filter that decides how sensory signals should be processed. For an image from a video camera, an attentional system can be used to select only a part of the image for processing. For auditory processing, it can be used to filter the sound with respect to a specific type of source or location. The goal of the attentional system is to exclusively let signals through that are of current use to the robot and to make the processing of sensory data easier. Although the output from the attentional system will be easier to process than the total of the sensory signals, it is wrong to consider attentional processing as a way of reducing the computational burden of the whole system. While this is true in visual processing when only a part of the input image is selected for processing, the situation is sometimes the opposite. In sound processing, for example, it is much more computationally expensive to filter out signals from one specific source than to let everything through. Filtering can be based on either the specific content of the source, or on its spatial location, that is, focus of attention can be either feature related or spatial. In the first case, a specific object or sensory pattern is selected. In the second case, it is the spatial location that is attended to. In most cases, attention is not purely feature related or spatial but a combination of both. Attending to a coffee cup on the table focuses on both its spatial location and its perceptual features such as its color and shape. The cues used to control attention can also be of these two types (figure 1). A spatial cue can be used to direct sensory processing and select the object at that location. A feature cue can be used to find the location of an object that matches it. For example, selecting the correct location of the table will place the coffee cup in the focus of attentional. The same attentional state will result if the cup-shape is used as reference.
Location Cue
Sensors
Location Features
Attentional System
Feature Cue
Sensors
Location Features
Attentional System
Figure 1 The attentional system uses spatial or feature cues to direct the focus of attention to a region that matches the cue. It subsequently opens up a sensory channel that includes both spatial information and sensory features.
The attentional state is a function of both the internal cues and the external word. The selected attentional state is usually much more specific than the cue used to set it up. If the coffee cup is the only blue object of the table simply using blue as a cue will be sufficient to attend to the cup. This is in essence what Ballard et al (1997) calls a deictic code. A deictic code is used as an implicit reference that points at an object. The cue blue references the coffee cup and can be used to retrieve all sensory properties of the cup although it does not in itself represent all of its properties. One way to view the attentional system is as a content addressable memory. Just as a part of a stored record can be used to look up data in memory, a cue sent to the attentional system is used to look up its reference in the external world. Locomotor Space
Reaching Space
Body Space
Figure 2 Three spaces with different attentional systems: body space, reaching space and locomotor space.
1.2 Attentional Spaces The focus of attention described above always has a spatial component and it is useful to consider what types of spatial coordinates systems that are relevant to a robot.
In the cognitive literature one finds references to three types of coordinate systems or spaces (figure 2). The first is called body or personal space and is the system used for interaction with the body in for example grooming. The second is called reaching space or extra-personal space and contains a volume centered at the body within reach of the hands. Last, there exists a larger space around the body called locomotor space that can only be reached by moving the body from one place to another. While the first type of space may not be of much use to current mobile robots, the second two correspond to the volume reachable by the manipulators of the robot and the area that can be reached by navigation. The attentional state of a robot can reference either of these spaces. Below we will give an example of an attentional system both for locomotion (section 3) and reaching (section 4). These spaces can further be divided into areas that can be probed with the current directions of the sensors and areas that require an active redirection of sensors or movement of the robot before they can be perceived. We call the first part if the different spaces visible and the other hidden. When attention in humans is described, attention that redirects sensors, that is cause eye movements, it is called overt while a change of attention that is not visible from the outside is called covert. To view a hidden part of a space, an overt attentional shift is always necessary while for the visible part of the spaces an overt attentional shift is not strictly necessary although they can still be desirable. In humans, for example, overt attentional shift are used in the already visible space to move the high resolution area of the retina (the fovea) to the current area of interest. 1.3 Intrinsic and Extrinsic Cues It is also useful to distinguish between the two types of attention called extrinsic and intrinsic. Extrinsic attention is triggered by external signals such as a sudden motion or sound. One way to define this type of attention is as attentional shifts that appear after the attended event has occurred. Intrinsic attention, on the over hand, is characterized by an attentional shift that occurs in anticipation of an event or in search of an object and is thus internally controlled. Relating these two types of cues to the different types of spaces described above we see that extrinsic attention can only be cased by events within the visible part of the attentional spaces while the target of intrinsic attention can be located anywhere in the different spaces. 1.4 Attentional Priming The last aspect of attention that we need to consider is the role of top-down processing in the control of attention. There are various forms of top-down effects in attention
from the very specific as in visual search to the more diffuse such as contextual priming. A more efficient processing of sensory information will result if the attentional shifts are part of a schema (Johnston and Dark, 1986). Here, a schema is a representational structure that states the context for a sequence of actions. A schema can be used to prime the attentional system to favor certain sensory patterns rather than other. Schema based attention lies behind the ability to quickly attend to stimuli that are of relevance to our current goals (Johnston and Dark, 1986). This ability is based on a priming mechanism that can enhance attention to stimuli that are of use to the currently selected task and inhibit attention to irrelevant or already attended cues. In the first case, the priming is said to be positive. In the second case it is negative (Johnston and Dark, 1986). 1.5 Attentional Fixation and Shift The control of attention can be divided into the two phases (1) shift and (2) fixation. An attentional shift occurs when a new cue, intrinsic or extrinsic, is delivered to the attentional system or when the priming is changed. This will cause the attention to change in the way directed by the new cues or primes. Fixation is an active process that locks the attentional system to the currently attended stimulus or location. The fixation has two roles. The first is to block out other stimuli than the one attended to in order to control what is let through the attentional filter. The second role is to track changes in the attended stimulus or its location. In the case of a visually attended object, fixation will cause the visual system to track the object if it moves. 1.6 Summary Figure 3 summarizes the influences on attention described above. A spatial cue directs attention to a certain location. This location can be a specific point or a whole region of space. A feature cue draws attention to a specific set of features. If the feature cue is very specific, the result will be a search. When these two cues are combined a search for a set of features at a specific location or region of an attentional space will be performed. The current goals and active schemas will influence the process through priming. Different objects will be positively primed in relation to how useful they are for the current task. The attentional system will thus be more likely to attend to relevant stimuli than irrelevant ones. It is also possible for negative priming to occur that will block attention to stimuli that have been learned to be irrelevant but that would otherwise command attention because of their high salience.
2. Attention and Action Selective attention is a valuable method for controlling the actions of a robot. The general scheme is to consider
attentional shifts and fixations as actions that are executed in sequence with ordinary motor commands which are themselves controlled by the current attentional state. Feature Cues
Positive Priming
Spatial Cues
Negative Priming ATTENTION
World Figure 3. Signals influencing the focus of attention. Spatial and feature cues are internally generated and used to direct attention. Priming selects targets relevant to the current goals of the system.
We want to propose three important principles for attentional control of actions. These are (1) attention as action , (2) selection-for-action and (3) deictic reference. The three principles are illustrated in the following example. Consider a mobile robot that is about to move toward the door of a room. This would minimally correspond to the following two actions: 1 attend-to(door) 2 approach
The first line exemplifies the principle of selection as action since shifting the attention is considered an action and not primarily a sensory process. Since attending is an action, it can be ordered in sequence with other actions and depend on the results of previous actions. The second line shows the principle of selection-foraction since the currently attended item us used as an implicit argument. The object selected by attention is used as the goal object in the action. To execute the action, this implicit argument can be used either when the action is initiated or to actively control the whole action. An example of the first would be a robot arm that reaches for a location that has first been selected by attention. An example of the latter was presented in Balkenius and Kopp (1996a, b, 1997) and Balkenius (1998), where the target is actively tracked while it is being approached by a mobile robot. A view of object reference related to selection-foraction has been described by Agre and Chapman (1987). In their computer program Pengi that played the arcade game Pengo, they made extensive use of task-related variables that referred to an external object of immediate importance to the game such as the-bee-that-is-chasing-me or the-block-I’m-pushing. The example also implicitly demonstrates the third principle of deictic reference. The attentional action sets up a reference to an external object, in this case the door.
Since this door will be actively processed by the sensory system as long as it is attended, there is no need for any complex internal representation of it. As long as the attentional channel is kept open all perceptual properties of the door can be derived directly from the sensors. The world is thus used as its own best model (Brooks, 1991). As pointed out by Ballard et al (1997) this has some important consequences for the internal representation of the environment. If all properties of an object can be collected from the environment, the internal representation can be minimally specific as long as it is sufficient to guide the attention toward the object. If the door is the most salient green object in the room, the 'door' argument could be equivalent to 'green'. In this case, 'green' is used as a deictic reference to the door since 'attend-to(green)' would give the same result as 'attend-to(door)' (Ballard et al. 1997). One important aspect of deictic representations is that they do not try to model the external objects. They are therefore simple to implement. Another significant property is that they are grounded in the sensory abilities of the robot (cf. Balkenius 1998). Very little translation between the internal representation and the sensory system is needed.
3. Selection for Action by a Mobile Robot One implementation of selection for action in a mobile robot has been presented by Balkenius and Kopp (1996a, b, 1997). In this implementation elastic templates for visual landmarks are used to control the navigation of the robot. Navigation from one place to another is divided into a sequence of approach actions where a sequence of landmarks are approached in turn until the goal is reached. A preference order defined over all visual templates is used to select the currently visible template that is most desired, that is, the one that leads most rapidly to the goal (Balkenius 1998). This preference order is essentially a schema in the sense described above and the selection of the most preferred template is a form of schema based attention. The following example shows the basic structure of the navigation. while(not at goal) 1 attend-to(preferences) 2 approach
The robot continuously selects the most preferred visible target in accordance with its current preferences and tries to approach it. Note that positive priming is the only mechanism used to select the target for attention. The architecture also includes mechanisms for searching for visual templates that are not currently in view but are expected to be visible after turning the camera. This makes it possible for the robot to approach locations that are in the hidden area of locomotor space. The longer example below demonstrates this aspect of the system.
while(not-at-goal) 1 attend-to(preferences) 2 fixate 3 approach 4 attend-to(expected-template-location) 5 saccade 6 attend-to(preferences) 7 if(more-preferred-template-found) fixate
The different steps have the following roles. (1) As before, the robot first attends to the most preferred target currently in view. (2) It then fixates the currently selected target to keep in it focus and track it while the robot is moving. (3) The approach behavior is selected and activated and stays on until explicitly turned off or superseded by another behavior. This will make the robot approach the location of the currently selected target in the focus of attention. (4) An intrinsic attentional shift initiates a search outside the visible area for a target more preferred than the current one. The argument to attend-to is the spatial cue for an expected location of a visual template closer to the goal. These expectation are derived from the currently attended template. The focus of attention is thus again used as an implicit argument to the action. (5) A saccade, that is a turn of the camera, is made toward the attended location. (6) After the anticipatory saccade, the robot searches for the most preferred template in the new image. (7) If the template found is better than the previous, a decision is made to fixate this new template instead. If no better template is found the robot will turn back the camera and continue to track the original target. The actions above demonstrates an important aspect of the system. Actions like fixate, approach and saccade are not limited to a fixed time slot but are ways of controlling the system dynamics. When the approach behavior has been activated it will be on and control locomotion until it is shut off or replaced by some other behavior. A more interesting case occurs after the saccade. The new position of the camera is only upheld for a limited time if no new target is fixated. This will make the camera return to its initial position automatically if the saccade did not find a more preferred target. By letting system dynamics such as these take care of the details of behavioral execution, the control schema becomes very simple.
4. Selection for Action by a Manipulator Selection-for-action is not only useful for mobile robots. Perhaps its role is even more clear in a robot that has to coordinate a vision system with a manipulator. In this case, the attentional system is used to select an object in the environment before the manipulation is performed on
it. Ballard et al (1997) describes a number of actions that are useful for a robot that manipulates blocks such as grasp, lift-up and put-down. They suggest that humans use fairly rough visual features of the target objects as reference to them. These features are not used to describe the objects but only as cues for the attentional system. This idea is based on studies of the human attentional system that appears to work in a similar way (Triesman and Gormican, 1988). In this section, we report experiments with attentionally driven manipulation in a reinforcement learning situation. When the action sequence for object manipulation is controlled by reinforcement learning, it would be useful if both the sensory and the motor aspects of the task could be controlled in the same way. Here we investigate how a robot can learn both appropriate feature cues and the spatial cues using reinforcement learning. The idea is that reinforcement should be able to control actions as well as attention. We have performed experiments in two tasks. The first is to learn the part of reaching space where an object can be located. If for example, the task for the robot is to move things from left to right on a table, it will have to learn that the action lift-up is appropriate only for object to the left and the action put-down is appropriate only at the right. This can be accomplished through attention controlled by reinforcement. In the second experiment, we studied how a robot could learn the appropriate visual cues to use as deictic references to objects in the scene. 4.1 Experiment A: Selection of Spatial Cues In this experiment the robot was presented with a scene on a table. To simplify the task, the robot was not allowed to move its camera. It was thus only able to learn spatial cues in the visible part of reaching space. The scene was divided into 64 discrete regions as shown in figure 4 and the target area consisted of 16 regions to the right. Scene
Target Area
Figure 4. The scene and the target area.
At each trial, targets where places in some of the 64 regions and the robot suggested one of them. The robot then received positive reinforcement if the selected region was part of the target area and negative reinforcement otherwise. To speed up the process, the multi-scale learning approach introduced by Balkenius (1996) was used. Instead of testing each location individually which
would have required 64 tests given that no noise was present, a hierarchical division of the regions was used. Each time a region received positive reinforcement it was generalized to regions close to it. Since the probability that a region would be selected was proportional to the reinforcement it had received, the robot mainly searched close to regions that had already been found to be good. It should be noted that the selection of regions was not a blind process since it too was controlled by the attentional system. That is, the robot did not select regions where no object was present. This made the learning time depend mainly on the number of distractors in the scene and in simple cases, learning was instantaneous. The system also correctly handled the case when no spatial cues had to be used, that is when all regions ware correct. 4.2 Experiment B: Selection of Feature Cues In our second experiment we tested the ability of a robot to use simple visual feature histograms as deictic references to objects. A number of objects were placed on a table and the robot could direct its attention to different regions of the scene and ask whether it contained a target object. When the robot selected a correct target area, it was rewarded. If the region in the focus of attention did not contain a target, the reward was omitted. The mechanism for selecting targets in the scene used a feature prototype derived from the previously selected correct regions. Each time the robot received a reward, this prototype vector was updated to the mean of the histograms for the rewarded regions (Mel, 1997). An estimate of the standard deviation for the prototype was also calculated. To select candidate regions, the robot calculated an attention map for the entire scene which was subsequently used to bias the selection of an attention region. The probability for selecting a region was proportional to the similarity of the features in the region and the prototype vector. To control the exploitation of the learned prototype compared to the exploration of new regions, the calculated probabilities was scaled using a constant ξ. A higher level of ξ would make the robot less likely to explore regions very different from the prototype. Figure 5 shows the learning curve for five similar targets in a context with many distractors. The percentage of correct attentional focuses is plotted as a function of time. The solid curve shows the learning with a high ξ, and the dashed curve shows the learning with a lower ξ. In both cases, learning starts with random exploration of the scene but stabilizes on a level of correct selections that depends on ξ. As can be seen, the number of correct choices of regions is higher with increased ξ as expected. The initial time before the curves start to rise is random since it depends on how quickly the robot selects the first correct region. Before that, it cannot possibly learn anything about the target histogram.
The pie-charts show the relative time spent focusing on each of the five targets for the two levels of exploration. The pie to the left shows the distribution of attention to the five correct regions with low level of exploitation. The pie to the right shows the effect of a higher level of exploitation. For a higher level of exploitation the distribution is more uneven since the robot tends to stick to the first target discovered. 1,8
Expl = 1
Expl = 15
1,6 1,4
Success
1,2 1 0,8 0,6 0,4 0,2 0 1
3
5
7
9
11 13
15 17 19 21 23 25 27
Iteration
Figure 5 The learning curve for a scene with five similar targets and many distractors. The solid line shows the learning with little exploration. The dashed line shows learning with a higher degree of exploration. The pie-charts show the relative time spent focusing on each of the five targets for the two levels of exploration.
Note that this experiment only makes use of features and ignores spatial location which means that if the targets are moved, the robot can still correctly locate them. It would be an interesting extension to combine the two types of learning described here. This would make it possible for the robot to learn both the locations and features of the target object or objects.
5. Discussion We have described a number of attentional mechanisms that are useful to autonomous robots. The attentional mechanisms can be seen as a filtering mechanism that is controlled both internally and externally by intrinsic or extrinsic cues. Attentional shifts can be either overt or covert depending on whether sensors have to be redirected or not. We have further made the distinction between three coordinate spaces useful for attention: body, reaching and locomotor space. The suitable type of coordinates depends on what actions are admissible. The central claim of the paper is that attentional shifts can be seen as actions and that action itself should be implicitly controlled by the current focus of attention, that is, the principle of selection-for-action.
We have further suggested two mechanisms. The first is the use of deictic reference, that is, references to external objects that do not attempt to model features of those objects that are directly accessible by the sensors except for the purpose of finding the referenced object in the scene. The second mechanisms that we have only briefly touched is the idea of system dynamics as a means of simplifying control. For example, the actions fixate will lock on to an object, but system dynamics will take care of tracking if it moves. The concept of positive and negative priming have also been described as mechanisms that attract or inhibit attention to objects that are either useful for a task or known to be of no importance. The overall scheme thus includes both top-down and bottom-up mechanisms. In the bottom-up path, the world suggests attentional possibilities that leads to attentional selection and finally to action. In the top-down path, the current goals of the robot will activate relevant perceptual subgoals which will prime the attentional system when it selects targets in the world. The experiments reported above are very simplistic and would have to be extended in a number of ways before they are useful to a robot in a real task situation. First, it would be interesting to let the system select individual features rather than a whole prototype vector. For example, to refer to blue balls, it would be sufficient to select the feature blue to identify the balls provided that there are no other blue objects in the scene. Such a representation is less prone to missclassification since it is not influenced by the variation of irrelevant features. Second, exploration could be improved by making regions that have been recently focused to become less likely to be focused again. Such a mechanism appears in the human cognitive system and is called inhibition of return. Third, learning can also be enhanced by only looking at regions that contain a reasonable amount of features. This would make the robot spend less time looking at the table and more likely to attend to relevant objects. The learning experiment we have reported below are only small parts of a learning system that has to include systems for the learning of action sequences as well as the cues for attention. This is an area for further research.
Acknowledgements This research was supported by the Swedish Foundation for Strategic Research.
References Agre, P. E. & Chapman, D. (1987). Pengi: an implementation of a theory of activity. In Proceedings of AAAI-87 . Seattle, WE, 268-272.
Allport, A. (1990) Visual attention, In Posner, M. I. Foundations of Cognitive Science, Cambridge, MA: MIT Press. Balkenius, C. (1996). Generalization in instrumental learning. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., Wilson, S. W. (Eds.) From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior. Cambridge, MA: The MIT Press/Bradford Books. Balkenius, C. (1998) Spatial Learning with Perceptually Grounded Representations, Robotics and Autonomous Systems, 7 (1). Balkenius, C., Kopp, L. (1996a). Visual tracking and target selection for mobile robots. In Jörg, K.-W. (Ed.) Proceedings of EUROBOT '96. IEEE Press. Balkenius, C., Kopp, L. (1996b). The XT-1 vision architecture. In Linde, P., Sparr, G. (Eds.) Symposium on image analysis. Lund 1996 Balkenius, C., Kopp, L. (1997). Elastic template matching as a basis for visual landmark recognition and spatial navigation. In Nehmzow, U., Sharkey, N. (Eds.) Proceedings of AISB workshop on “Spatial reasoning in mobile robots and animals”, Technical Report Series, Department of Computer Science, Report number UMCS-97-4-1 . Manchester: Manchester University. Ballard, D. H., Hayhoe, M. M., Pook, P. K. and Rao, R. P. N. (1997) Deictic codes for the embodiment of cognition, Behavioral and Brain Sciences, 20 (4) 723767. Brooks, R. A. (1991). Intelligence without reason. Proceeings of IJCAI-91. 569-595. Johnston, W. A. and Dark, V. J. (1986) Selective Attention, Annual Review of Psychology, 37, 43-75. Mel, B. W. (1997) SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approahc to visual object recognition. Neural Computation, 9, 777-804. Treisman, A., Gormican, S. (1988). Feature analysis in early vision: evidence from search assymetries. Psychological Review, 95, 15-48.