A Behavioural Vision System for Search and Motion Tracking Daniel ...

A Behavioural Vision System for Search and Motion Tracking

Daniel Livingstone, Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, UK.

Libor Spacek, Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, UK.

Abstract The focus of this paper is the combining of search and pursuit behaviours with an attentional mechanism. The paper describes a behavioural vision system implemented on a robot head utilising pre-attentive attentional mechanisms. The system is capable of controlling the camera to follow a moving target or to ‘search’ an area, without making use of high level control and image processing procedures. Motion pursuit is performed by a simple attentional mechanism, based on some of the low level processing as carried out in the human visual system. Similarly, during search pre-attentive vision is used to determine where the head will look next. Inhibition-of-return is implemented to prevent the head repeatedly scanning one or two ‘interesting’ points. Keywords: active vision, search, motion tracking, pursuit Introduction In recent years an increasing amount of machine vision research has been within the active, or animate, vision paradigm. This espouses vision systems which comprise closed-loop mechanisms - generally utilising low-level mechanisms synonymous to ones found in human or animal vision systems, which are adaptive, which exert some control over the visual process itself. A key aspect and the common denominator of most active vision research has been gaze control, declared by Ballard [1] to be the central asset of animate vision. For example: gaze control for vergence, for motion [2], [11], for search [5], [13], intentional gaze control to achieve some goal [8], or to obtain a useful set of images for some recognition task [10]. Studies have also been carried out on attentional models and mechanisms for gaze control, such as Culhane and Tsotsos [4] who use the attentional mechanism to choose areas within static frames for more detailed processing and Leavers [6] who uses information from the attentional mechanism to help classify particles by size. The advantages of the active vision paradigm are well reported by researchers, Ballard in particular, and will not be repeated here. The claimed and demonstrated reductions in required levels of cognitive processing have given much impetus to active vision research. As active, or animate, vision considers the visual system as a behavioural system, thus our approach to developing an artificial visual system is similar to the behavioural AI

approach to developing intelligent animate behaviour - that of using closed loop control systems, for reflex like actions, and adding layers of behaviour over initially simple behaviours to create intelligent and hopefully useful behaviour. Wheeler [12] further discusses why perception, including vision, should be considered a behavioural task. In the literature cited above, there are descriptions of systems developed to follow motion, or to scan images according to attentional stimuli, or to achieve other tasks. The aim of our work was to develop a behavioural vision system capable of both following motion with a moving camera and of searching an area. Both tasks were to be achieved using an attentional mechanism. The ability to switch between tasks, depending on action around the camera, was also to be developed. We make use of similar attention based mechanisms to control both the tracking and search behaviours. We add inhibition-of-return for the visual search and a trivial mechanism for switching between behaviours. Despite the simplicity of our implementation, the results are promising. But first the prototype robot head mechanism, the base for our work, is described below.

The Experimental Robot Head The prototype robot head system used in this work (developed by Dr. Paul Chernett at Essex) is shown in figure 1. Fig. 1 The experimental system. Left, the robot head. Right block diagram of the head and controller.

VxWorks host machine. Equipped with framegrabber and interface board

Lan

Robot Head Camera

Position Servo Motors

NeXTstep Unix Workstation

Programs can be downloaded onto the 680x0 VxWorks board from any workstation over the network, and the program running under VxWorks can control the attached robot head. At the University there are several robots similarly equipped with 680x0 processor boards running VxWorks, and the process for running a program on the robots is the same. There are two identical head systems extant - the one pictured which can be fitted to a desktop host machine and one fixed to a robot.


2

The work described in this paper was carried out using the VxWorks based machines with a desk-mounted robot head and camera. Currently the robot head is fitted with only the left camera of a binocular pair. Hardware and software changes to accommodate the second camera are not expected to prove troublesome. The system used is very low cost , and other than exhibiting a slow speed it performs well. Currently the frame rate is around three seconds per frame, this low speed due to the low cost frame grabbers and the slow host systems architecture currently installed. By time of publication, performance speeds will likely have improved considerably.

The Attentional Mechanism The visual process can be split into two distinct stages - pre-attentive and attentive vision. During the pre-attentive stage processing of limited complexity occurs over the whole visual array, in the post-attentive stage further processing is carried out, concentrated on areas within the field-of-view selected by the attentional mechanism [14]. This shows visual attention to work as a filter, selecting stimuli to concentrate on, other stimuli being attenuated as in Treisman’s Attenuator Theory [9]. In Treisman’s theory it is supposed that attenuation of less important stimuli is required as the brain lacks enough cognitive power to equally process all stimuli, thus the need for selection. More generally, to control the perceptual process itself, as well as to select a subset of stimuli for further processing, attentional mechanisms are deployed. Attention can be guided by low-level stimuli in pre-attentive vision and by higher level information from processes in late vision or cognition. One aspect of attention is the focus-of-attention. With regards binocular visual systems this is often the same as the fixation point, the point at which both eyes are aimed. Fixating both eyes on a common point achieves several tasks, such as easing the correspondence problem and allowing relative positional information to be determined by parallax. It also places the point of interest within the foveal regions of both eyes. Though the fixated point is not always the point at which visual attention is deployed, this is generally the case, and is assumed in this work. This does not preclude the further processing of points in the peripheral vision area, or selecting such points for the focus of attention. For truly adaptive behaviours a system where attention is guided both by the high and low level processes will be necessary. The envisioned approach is to make use of symbolic AI at the higher levels, behavioural AI at the lower levels. Cases where gaze is governed by late vision or by intent - e.g. focusing gaze at a set point irrespective of incoming stimuli - are not covered in this work. Thus our work concentrates on the closed-loop stimulus driven nature of the attentional mechanisms.


3

Some allowance is still made for higher level control - for switching between behaviours and for inhibition-of-return. The low level attention mechanism implemented on the prototype here is derived from that of Culhane & Tsotsos and that of Leavers, though greatly simplified. The Culhane and Tsotsos implementation is concerned with using attention as a guiding heuristic in selecting areas of a static image for processing, and operates recursively until the entire image has been processed. The attentional mechanism is a hierarchical WTA (winner-take-all) system. Receptive fields (RFs) covering the visual array return a response depending on the level of presence - or absence - of corresponding features in the area covered by that RF. For simplicity, all RFs are implemented as covering rectangular areas. The RF generating the greatest response ‘wins’, and is selected by the attentional mechanism for the next focus of attention. In our work, with a camera mounted on a robot head, the camera can then be moved to centre the point of interest. In C&T’s implementation, RFs of a variety of sizes exist. The RFs exist at every possible size at every position in the visual array. The responses are easily made independent of RF size by dividing the response for a RF by its size. The competition between the large number of RFs would necessitate a large number of neural computations and connections. This is exaggerated where the attentional mechanism is to work for a number of different features simultaneously. We feel that this could be prohibitive in a neural net implementation, as well as being unnecessary. Thus, we use RFs of fixed sizes scattered less densely over the visual array. These still cover the array completely. In our implementation all RFs are of the same size though, using the same technique as above, different sizes of RF may be made to coexist. The RFs are spread out with only small amounts of overlap (see figure 2). The overlap increases slightly towards the centre of the array. Using the rule for making responses size independent, we could have also made the RFs smaller in the foveal region, and increased the overlap to further favour responses from the foveal region. The RFs take as input integer values representing the stimuli over the area of the visual array covered. Individual RFs simply average the stimulus values to generate the required size independent response. To detect the greatest source of any selected type of stimulus the appropriate input must be selected, such as motion difference images or grey level images. The attentional mechanism need not change for each different input. For example, to use the attention mechanism to detect the brightest spot in view, grey level values would be used as inputs. Individual colour responses can be used, allowing easy detection of areas with uniquely high presence of particular colours. Stimuli for motion detection and edge-based search can also be input to the array of RFs, as described below. In cases where two or more RFs generate equal winning responses some arbitration would be necessary. This could be done by using further attention stimuli responses, or by arbitration on characteristics determined by later vision. Such arbitration is undefined in the current implementation, and is carried out in the code. A Behavioural Vision System for Search and Motion Tracking

4

Fig 2: Distribution of Receptive Fields over the visual array

(a) A single receptive field, (b) Four closely overlapping RFs and (c) an array of loosly overlapping RFs

Motion Detection To use the described attentional mechanism for motion tracking, difference images were used to provide the stimulus to the mechanism. Only positive values are fed to the mechanism, thus the modulus of the difference is used. This works under the simple assumption that wherever motion is most significant in the image area, the greater will be the intensities in the difference image in that area. This is a very low-level mechanism, providing the minimum of information about the moving object - only the position within the visual array. Velocity and shape information is not determined. Time sequence difference images may well provide some information useful in later cognitive processes, though this has not yet been considered. With fixed size RFs, where the motion covers an area in the image array greater than the RF size, the best that can be hoped for is for the winning RF to partially contain the motion. In practice, even a limited mechanism with only uniformly sized RFs is successful in capturing motion and selecting points for the next focus of attention. Small errors are quickly recovered from, and as long as the moving object remains within view. Even large errors - rare in trials except where the speed of motion greatly exceeds the sampling rate or where errors have occurred in the frame grabbing - typically do not result in losing the moving object. Due to lighting effects, the response of a RF is unlikely to be zero even when no motion has occurred, and a threshold was set to ignore very small responses. This threshold was set to around 1% of the maximum response. Pursuit was successfully performed by the prototype system using only the motion sensitive attentional mechanism (see figure 3). The prototype is a slow system, with a very low frame rate (seconds per frame rather than frames per second). Accordingly, the pursuit mechanism is easily defeated by rapid motion and can lose objects which move to fast. At higher frame rates, towards 25 or 50Hz, a marked increase in performance could be expected.


5

Fig 3: Examples of motion pursuit (a) Tracking movement across a room.

(b)Tracking movement down and along under a desk.

(c) Tracking Hand Movements

Where the gaze changes significantly between successive frames large responses could occur in areas of the visual field responding to stationary objects. These false motion stimuli can be attenuated by image stabilisation, another task performed in the HVS. The simplest approach here is to hold the camera steady for two frames, the more complex human system only requires to perform stabilisation in the foveal region. Where the frame rate is high, the cost of stabilisation after infrequent large saccades is low. This approach to motion detection is fast and non-iterative though limited. For example, a single target cannot be followed selectively out of many. Whichever target creates the highest response will automatically win attention. However, in a subsumption architecture further control could be exerted by post-attentive visual processes. This could be used to achieve more complex goals, such as following a selected target when several are presented.

Search There have been a number of studies of human performance on visual searches, and these provided a good base to compare our active vision systems performance against, and also a basis for the implementation of our system. While it is clear that people use knowledge to perform systematic searches of areas to achieve best performance, this ability is not fully developed until the age of 6 or 7 [3]. Younger children and adults faced with tasks to which their knowledge does not apply cannot use their knowledge to guide searching. Low level stimuli for search include attentional stimuli, and these are also used to guide search [14]. Our system uses low-level edge information and an attentional mechanism similar to that used above to guide its search. The search is knowledge free, and in the context of a subsumption architecture can be considered as modelling A Behavioural Vision System for Search and Motion Tracking

6

only the lower levels of processing. As with motion, control from above could be added later, as occurs in the development of the human visual search. Where control is exerted only from below a winning stimuli source will remain the focus of attention until some change in the environment causes another source to win over it. Inhibition-of-return (IoR) is the mechanism which prevents gaze constantly returning to a single point more apparently interesting than others in the field of view, or to a small set of such points. While it is believed that the HVS possesses such a mechanism, it is clear from the studies of Yarbus [15] that this mechanism still allows more than one visit to each area in the field of view. More than one visit may be required for fuller interpretation, say. Thus, IoR seems to rely partly on feedback from the higher levels of the perceptual process, not considered in this paper. It is likely that inhibition-of-return also relies on recognising - without interpreting - areas in peripheral vision to prevent gaze returning to them, again out of bounds for this study. We use an IoR mechanism to prevent pathological behaviour. In our system, without any form of inhibition of return, the gaze could easily bounce between the two most interesting points in the field of view. Consider each gaze position, represented by pan and tilt angles, to be a search ‘stop’. For our inhibition-of-return implementation we make the assumption that any world point which has projected on to the fovea on any previous stop will have been adequately interpreted, and will not be chosen as the point to aim the gaze at on any future stop. Approximate calculations to compute this can be easily carried out, deriving the angular field of view. from the known fovea and total image dimensions. Then any selected search point can be checked against the list of visited areas and if previously visited the next gaze will be centred where the second, third, etc., greatest response is generated. The inhibition-of-return suggested by C&T [4] is conceptually nicer, but does not allow for a moving camera or changing stimuli. The goal of our system is to ‘search’ every point of its environment that can be viewed by changing the camera position. The IoR mechanism has obvious weaknesses though it was sufficient for experimental purposes. The current prototype gaze control system used uses a simple inhibition-of-return mechanism, defensible chiefly in simplicity and ease of use for a robot head with only rotational degrees of freedom. Further work is required to develop IoR for fully mobile robots. Motion detection was inhibited, and search based on edge-points only was performed. Some images from this trial are shown in figure 4. The results correspond well with expectations - the system is most interested in the relatively busy desktop area, and only once this area has been largely explored does attention wander to the less busy areas above and below the desk.


7

Fig 4: An example of search using attention and inhibition-of-return.

The searches can be plotted by recording all the camera stop positions, and this is shown in figure 5, beside a plot from a visual search carried out by a person on a static image [15]. Figure 5: Search plots. (a): From the prototype head system, (b): Human search, from Yarbus, [15]. (a) Plot of camera positions during search. Initial position is pan = 0.0°, tilt = 0.0°. Pan angle increases as camera scans right to left, tilt as the camera scans up. (b) Plot from Yarbus of positions of eye positions while viewing an image. 30 20

Tilt

10

-30

0 -10 -10

10

30

50

-20 -30

Pan

Recognition As stated, recognition and late visual tasks are not carried out in the prototype system. Tistarelli (‘95) is an example of the use of viewpoint selection to aid recognition, and a range of benefits to recognition exist using active vision systems. By selecting areas within the visual array to concentrate visual processing resources on, more informative recognition cues can be processed earlier. Additional cues can be actively searched for, allowing better recognition and rejection. Moving the camera, more information on 3D shape becomes available. By providing the visual sensor with the ability to reposition itself, the ability to recognise known objects is improved dramatically.


8

System Behaviour Within our system, there are only two modes of behaviour - motion pursuit and area search. Motion is assumed to have precedence over search, and a search is automatically postponed when motion is detected. When motion is no longer detected, the system will wait a number of cycles for motion to continue - the assumption being that the pursued object may soon move again, and look where it was last detected until satisfied that there will be no more motion. Areas viewed while pursuing motion are not kept in the inhibition-of-return visit list, it being assumed that later visual resources will be concentrating on the target, not on searching the area and that a moving target may move objects or otherwise affect the properties of locations in view. We consider this to be a sensible behaviour for an animate system capable of head motions though otherwise immobile, perhaps an animate sentry. Later visual processes could easily be provided with additional information, the derived attentional stimuli, to assist in image interpretation. The pre-attentive processes we have implemented are also suitable for fast parallel implementations.

Conclusions We have demonstrated that attentional stimuli are sufficient for motion pursuit. Although such a pursuit is considerably more simplistic than one using, say, affine motion models [11], there is no reason to believe that having found a target with an attentional mechanism that a biological visual system would then ignore the mechanism. This gives support to our use of attention for motion pursuit. We have also shown how a similar mechanism can be used to form the basis for visual search behaviour. The pursuit and search behaviours have been integrated into a system capable of switching between behaviours appropriately. The attentional mechanisms also provide information which can certainly aid other mechanisms pursuing motion, etc. Would a system as opportunistic as the HVS discard this? Attention based search and pursuit, as described in this paper, should be considered as basic building blocks for more complex search and pursuit systems.

Acknowledgements The robot head was designed by Paul Chernett along with control code. Thanks also go to Vic Callaghan. To Rodney and Malcolm for assembling the head and answering many questions and to fellow Brooker lab students Ronnie, Simon and David. Finally thanks to Dave, Jeremy, Chris and Bronwen.


9

Bibliography [1] Ballard, D. H. (1991). “Animate Vision.” Artificial Intelligence 48: 57-86. [2] Bradshaw, K. J., P. F. McLauchlan, et al. (1994). “Saccade and pursuit on an active head/eye platform.” Image and Vision Computing 12(3): 155-163. [3] Coren, S., L. M. Ward, et al. (1994). Sensation and Perception, Harcourt Brace. [4] Culhane, S. M. and J. K. Tsotsos (1992). An attentional prototype for early vision. ECCV '92, The Second European Conference on Computer Vision, Santa Margherita Ligure, Italy, Springer-Verlag. [5] Granlund, G. H., H. Knutsson, et al. (1994). “Issues in robot vision.” Image and Vision Computing 12(3): 131-148. [6] Leavers, V. F. (1994). “Preattentive computer vision: towards a two-stage computer vision system for the extraction of qualitative descriptors and the cues for focus of attention.” Image and Vision Computing 12(9): 583-599. [7] Livingstone, D. J. (1995), Visual Guidance for Autonomous Robots, MSc project dissertation, University of Essex, Colchester, UK. [8] Remagnino, P., J. Illingworth, et al. (1995). “Intentional control of camera look direction and viewpoint in an active vision system.” Image and Vision Computing 13(2): 79-88. [9] Sanford, A. J. (1985). Cognition and cognitive psychology, Lawrence Erlbaum Associates Ltd. [10] Tistarelli, M. (1995). “Active/space variant object recognition.” Image and Vision Computing 13(3): 215-226. [11] Uhlin, T., P. Nordlund, et al. (1995). Towards an active visual observer. Computational vision and active perception laboratory, Royal Institute of Technology, Stockholm, Sweden. [12] Wheeler, M. (1994). Active perception in meaningful worlds. Brighton, UK, School of Cognitive and Computing Sciences, University of Sussex. [13] Wixson, L. E. (1994). Gaze selection for visual search. Computer Science department. USA, University of Rochester. [14] Wolfe, J. M. and K. R. Cave (1990). Deploying visual attention. A.I. and the eye. A. Blake and T. Troscianko, John Wiley and sons. [15] Yarbus, A.L. (1967). Eye Movements and Vision, Plenum Publishing Company, New York


10