Oct 22, 1995 - arti cial intelligence applications such as STRIPS 7] and Hearsay 6] ... The goal in visual motion tracking is to return information which ... a robust system should be able to recover quickly | a motion tracker which computes the.
Incremental Focus of Attention for Robust Visual Tracking Kentaro Toyama and Gregory D. Hager Department of Computer Science Yale Univ., P.O. Box 208285 New Haven, CT 06520 October 22, 1995 Abstract We present an incremental focus of attention (IFA) architecture for adding robustness to software-based, real-time, motion trackers. The framework provides a structure which, when given the entire camera image to search, eciently focuses the attention of the system into a narrow set of con gurations that includes the target con guration. IFA oers a means for automatic tracking initialization and reinitialization when environmental conditions momentarily deteriorate and cause the system to lose track of its target. Systems based on the framework degrade gracefully as various assumptions about the environment are violated. In particular, when constructed with multiple tracking algorithms of varying precision, the failure of a single algorithm causes another, less precise algorithm to take over, thereby allowing the system to return approximate information on feature location or con guration.
1 Introduction Robustness is a hallmark of intelligent behavior and a desirable property of any system. Yet, the abundance of black velvet in vision laboratories testi es to the extreme brittleness of current visual tracking technology. Visual events such as changes in the background, foreground occlusion, ambient light changes, and unanticipated motion of the camera or the target, conspire to produce undetected, irrecoverable failure in many existing visual trackers. Although there have been attempts to make trackers more robust with respect to certain disturbances (e.g., [11, 16, 17]), little has been done toward developing an eective, generalizable computational framework for dealing with these problems. In this article, we present the Incremental Focus of Attention (IFA) architecture. IFA is a framework for adding robustness to most existing real-time feature-based tracking algorithms. The core of IFA is a structure which, when given an area of a camera image to search, eciently focuses the \attention" of the tracking system onto a small set of con gurations which contains the target feature con guration. When visual perturbations temporarily cause the tracking system to lose its target, the IFA framework provides the possibility of reinitialization. If an IFA-based system is composed of multiple trackers of varying precision, failure of a single tracker simply means that another, less precise tracker takes over. In this way, the information provided by IFA-based tracking algorithms gracefully degrades as various assumptions about the environment are violated. In the pursuit of robustness, we have been inspired in part by the subsumption architecture of robotics [1, 2]. We, too, believe that robustness comes from decomposing a system based on complete tasks, as opposed to partitioning a complex task into functionally abstract modules. We also feel that arti cial constraints on a system's environment provide a crutch on which we may never stop leaning | instead, we must construct system components that monitor themselves for failure, and we must provide appropriate fallback procedures for those cases. Our approach, however, diers from the subsumption architecture in several 1
signi cant ways: rst, we ignore the philosophical implications of having to build an entire perception-action machine, since a strictly visual application cannot necessarily be said to \act" upon a real environment; second, the ow of information between our layers is fairly uniform and an integral part of the framework in contrast to the thin communications lines connecting subsumption layers; and lastly, our layers do not run in parallel, but run serially, passing control back and forth as appropriate. The ideas behind IFA are also similar to pop-out and foveated vision. Pop-out is an eect found in animals, where certain locally identi able cues, such as motion or color, rapidly focus visual attention. In computer vision, pop-out translates to algorithms which, because they perform image processing only on locally con ned areas of an image, could theoretically be performed rapidly in parallel [14]. The IFA framework might be considered similar to repeated applications of pop-out algorithms, but it diers in that the algorithms need not be parallelizable: while pop-out might characterize some parts of our framework, it does not properly describe others. Foveation is another computer vision concept inspired by nature with which the IFA framework shares conceptual traits. In foveated vision, the resolution of an image is inversely proportional to its distance from the center, or fovea [15, 17]. Most often, foveation is used to provide both a region of high accuracy and a wide eld of view for a reasonable computational cost. Our framework also takes advantage of speed gained by processing on images of varying resolution. Unlike foveation, however, the IFA architecture channels attention based on visual events instead of relying on a predetermined geometrical scheme. If we view tracking as a large search problem, IFA can even claim precursors in classic arti cial intelligence applications such as STRIPS [7] and Hearsay [6] which cull search spaces by examining various preconditions. A layer of an IFA tracker culls the search space for the next layer by checking if regions marked by certain visual cues warrant further attention. For example, if the target object is a falling apple, one layer of the framework may eliminate 2
regions of the image which are not red, and the next layer may throw out regions of the image which do not exhibit motion. As with standard search techniques, processing does not continue in regions which do not satisfy the particular preconditions. Yet, while we do formulate tracking as a search problem, the dierences between AI searching algorithms and IFA are obvious: the domains, the speed of operation, and the combinatorial aspects of the problems dier signi cantly. The IFA architecture is, ultimately, a close cousin of focus of attention algorithms [5, 18, 19]. These algorithms use layered processing to implement both parallel and inherently serial procedures for narrowing attention. The intention is to create fast, robust vision systems. There are, however, a few theoretical dierences between our approach and that of most focus of attention research. First, we do not restrict ourselves to only positional information about features in an image. Instead, the search ranges over con guration spaces that may include rotation, scaling, shear, or even full 3D pose information. Second, instead of a network model with neural components arranged in layers, our layers are heterogeneous input-output devices. Third, during each tracking cycle, IFA processing occurs only in a single layer of the framework rather than in all layers. Finally, and what is perhaps the root cause of the previous dierences, much focus of attention work has been concerned with computational models of biological visual systems, whereas we are primarily concerned with performing ecient, robust tracking on common computing hardware. As a result we have been able to implement complete systems which are able to track moving objects robustly in real time with real-world scenes. The remainder of this article is organized as follows. Section 2 lays down the formulation for trackers and the IFA framework. Section 3 oers experimental evidence for the robustness of IFA systems. Section 4 discusses issues in robustness and explains why systems based on our framework degrade gracefully. Section 5 lists the directions in which we are currently expanding the system. 3
2 Motivation and Theory The goal in visual motion tracking is to return information which characterizes the con guration or, more generally, the state, of a target object1 from the data contained in a stream of images. The images themselves may be monchrome or color, monocular or stereo. The state information to be computed might include the target's position, orientation, shape, velocity, or a variety of other geometric or photometric qualities. Under ideal conditions, a motion tracker should be able to return complete state information for the target object, but ideal conditions rarely hold in the real world. Even we humans, with our well-developed visual systems, are unable to track objects with high precision under adverse conditions, such as when the objects are in our visual periphery, when they are moving too quickly, when they are wholly or partially occluded, or when ambient lighting becomes dim. However, even in these circumstances human beings are nevertheless able to do two things: gain partial state information about an object that cannot be precisely tracked, and recover an object that is temporarily lost from view. Thus, athletes can still catch, hit, or kick a ball by knowing its approximate position even when its orientation is rapidly changing; and drivers can continue to track the vehicle ahead of them, even after a quick glance in the mirrors. In biological vision systems, this kind of robustness is taken for granted, yet in computer vision it is an issue that has received relatively little attention. The IFA framework begins to ll this gap by providing a general architecture for creating robust motion tracking systems. Several issues govern any attempt to make a system robust. First, a robust system must be aware of its own success or failure. In tracking, this means that the system must know when a particular procedure ceases to track correctly. A robust tracker should rarely, if ever, track something other than its intended target. Secondly, there must be fallback mechanisms in the case of failure. Either there must be procedures for recovery, or there We use the term \object" to generically denote anything from simple features such as edge segments to complex three-dimensional objects. 1
4
TARGET CONFIGURATION high-precision tracking
intermediate-level trackers and selectors
low-resolution tracker/selector FULL CONFIGURATION SPACE
Figure 1. The Incremental Focus of Attention (IFA) framework.
must be substitute procedures which can take over for the failing component to some degree. For a tracker, both of these fallback procedures would add robustness: when tracking is lost, reinitialization of the lost object would allow continued tracking, and substitute trackers would allow the system to return partial output while reinitialization takes place. Finally, a robust system should be able to recover quickly | a motion tracker which computes the precise pose of an object as it was several minutes ago might be of no use to applications which require up-to-date visual input. These constraints suggest a layered fallback scheme (refer to Figure 1). After a tracking cycle, control passes from layer to layer, up or down the framework, depending upon the success of tracking during the previous cycle. The topmost layer performs high-precision 5
tracking. Under ideal conditions processing should take place solely in this layer. As environmental conditions deteriorate, however, control is relinquished to lower layers which perform less accurate, but hopefully more robust, tracking. Lower layers continuously attempt to pass control to the higher layers so that when better conditions return, so will better tracking. Clearly, the crucial questions have to do with the transitions between layers: how layers detect their own failure and give up tracking to a lower layer, and how tracking becomes more precise as favorable conditions return.
2.1 Trackers and Selectors We now describe how layers are structured in an IFA-based system and how layers are combined into a single tracking system. First, we de ne X as the space of all possible con gurations assumable by a target object. Con gurations may be physical states of the object (such as 3D position and orientation) or image-based states of the projection of the object onto the camera image. We also de ne the visible con guration set, Xv , to be the set of all con gurations which would cause part or all of the object to project onto the camera image. Layers in the IFA framework are either selectors or trackers. Both narrow an input con guration set, xin X , to a smaller output con guration set, xout X , such that xout xin. Both trackers and selectors can be characterized by their output accuracy which depends on the size of their output sets. Less accurate trackers or selectors return large output sets compared with their more accurate counterparts. As described in the next section, layers of the IFA framework are arranged in order of increasing output accuracy, with the least accurate layer at the bottom taking the entire visible con guration set as its input and the most accurate layer at the top returning relatively precise object state information. What distinguishes trackers from selectors is that trackers must be designed to meet 6
the lter criterion: if the output set of a tracker is non-empty, then it contains the actual con guration of the target object. In other words, trackers must not produce false positives. Selectors, on the other hand, may return subsets of their input set which do not contain the target con guration. Instead, the primarily role of selectors is to partition the output of a tracker below it into sets with size and shape appropriate as input to the tracker immediately above. For example, many conventional trackers for point features search a rectangular region of an image about a nominally predicted location. The search space for such a tracker may be restricted to a prede ned size in order to assure reasonable real-time performance. On the other hand, some window-based line segment trackers (such as those described in [10]) are designed to nd lines within a band of orientations, lengths, and image locations. In order to connect a point-feature tracker to a line tracker, some translation and partitioning of output sets is required. Selectors also focus the attention of the system by favoring certain output sets over others. In particular, heuristics may be used to determine which part of a selector's input set is likely to contain the target state. These can depend on image information. For instance, a selector may prefer to return output sets which include con gurations near detected motion, if a target object is known to be frequently in motion. Because selectors do not guarantee that their output sets will contain a target con guration even when their input sets do, selectors must necessarily ful ll another requirement: given the same input set, a selector must return a dierent output set each time it is called. The union of output sets over time equals the input set (this is similar to the reasoning behind inhibition of input areas in focus of attention work [5]). This restriction ensures that the system as a whole will not fall into a loop where a selector insistently produces a set that doesn't include the target state. Lastly, a selector's input set may become exhausted, in which case it must be reset so that other lower-level algorithms can take over. The following are a few examples of trackers and selectors: 7
Polygon Tracker: A feature-based polygon tracker tracks polygonal shapes as multiple corners, which are in turn tracked as two intersecting line segments (a rectangle tracker is described in [10]). The polygonal tracker will narrow a \compact" set of input con gurations, into an smaller set of con gurations which corresponds to the image projection of the polygon together with some allowance for noise. Mistracking is detected when the line segments of adjacent corners are out of alignment or if distinct corners merge together.
3-D Model Tracker:
A 3-D model tracker (such as the ones described in [4, 8, 12]) tracks the 3-D pose of an object with respect to the camera. The input con guration set would be the approximate pose of the object with an error margin, and the output set could be a smaller set of poses. Mistracking is detected when the image does not provide enough correspondences with the model.
Motion-biased Selector: Given any input set, a motion-biased selector will rst return output sets that include con gurations whose camera projections intersect with regions of non-zero image dierences between two recent time instances. Simply put, it initially focuses its attention on regions where motion is detected. As the same selector is repeatedly applied without resetting, the motion bias is not applied, and other subsets of the input set will be returned as output. Eliminating the motion bias after futile attempts insures that stationary targets will eventually be found.
2.2 Framework Construction Given several tracking algorithms which don't conform to the standards prescribed in x2.1, we must adapt them before we can construct an IFA system. Some tracking algorithms are more easily adapted into selectors and others are better converted into IFA trackers. The decision is based partly on whether or not the tracker satis es the lter criterion and partly on what assumptions can be made about the visual environment. 8
Tracking algorithms which almost always satisfy the lter criterion under a minimal set of assumptions about the environment should become trackers in the IFA framework. A few modi cations are necessary: rst, the trackers must be altered to detect mistracking, i.e., they must return the empty set if tracking fails. There are usually geometric constraints or thresholds which can be set so that a tracker will report failure when appropriate. For example, as described in x2.1, a tracker based on an object model might report failure when a certain percentage of its model components don't nd correspondences in the image. For the robustness of the system as a whole, it is better for trackers to be conservative in their own estimation of success. In other words, a tracker which sometimes signals failure even during successful tracking is preferred over one which occasionally mistracks without knowing. Tracking algorithms which are likely to miss a target should be converted into selectors. For example, an algorithm which tracks an object by searching the whole image for its predominant intensity value may nd itself tracking the wrong parts of the image when ambient lighting changes. Were this tracking algorithm incorporated as a low-level tracker in an IFA system, altered lighting conditions would cause the entire system to enter a loop where high-level trackers vainly seek the target in input sets that do not include it. However, as a selector, the same tracking algorithm assumes the role of a heuristic to eciently narrow the search space when lighting conditions are as expected. In order to ensure reasonable behavior, the system must also be able to disregard this heuristic. For this reason, a candidate selection algorithm is adapted to the framework by incorporating it into a generic selector procedure which initially attends to the areas suggested by the candidate. Over time, however, the selector switches to exhaustively returning its input set independent of the nested tracker. In the case mentioned above, when ambient illumination changes, the selector initially tries its heuristic, but as higher layer tracking algorithms continually fail, the selector gives up on its heuristic and switches to an exhaustive search. After all tracking algorithms have been appropriately adapted, they are sorted by output 9
TARGET CONFIGURATION high-precision tracker
tracker
tracker
selectors required
FULL CONFIGURATION SPACE
Figure 2. Selectors are required between trackers when the input set of one is smaller or of dierent shape than the output set of the layer below it.
accuracy. Additional selectors are inserted wherever a layer's output is not of the shape expected by the layer immediately above it (see Figure 2). These selectors may be specially designed for the system and the environment, or they may be brute force search algorithms which systematically return dierent output sets. In the event that the bottommost layer does not take the full visible con guration set as its input, a nal global selector is added at the bottom. Given a collection of con gured and sorted trackers and selectors, they are labeled from 0 to ? 1 such that the topmost layer is ? 1, and the bottommost is 0. We then associate thresh a frustration index, to each layer from 0 to i and a frustration threshold i n
n
n
f rus
f rus
10
i
Initialization: Reset all selectors, set all frustration indices to 0, set to 0 and set x0 := X , the entire visible con guration set. Loop: Compute x and evaluate the following conditional: 1. If layer is a tracker and x = ; or (signaling i
in
v
out i
out i
i
f rusi
thresh
> f rus i
mistracking), set i to 0 and decrement if it is greater than 0. thresh 2. Else, if layer is a selector and then reset i i to 0, i reset the selector, and decrement if it is greater than 0 3. Else, if layer is a selector or if layer is a tracker with ? 1, in out increment set x := x and increment i i+1 i 4. Else, = ? 1 and tracking must be successful. Reset all selectors, set all frustration indices to 0, and set xini := xout i f rus
i
i
f rus
> f rus
;
i
i
i
:
i
f rus ;
f rus
i < n
;
i:
n
:
Figure 3. The IFA algorithm.
? 1 The purpose of these values is to put a limit on how many times a layer may be called before reverting to a lower layer (see x2.3). Monitoring of the frustration index at any layer n
:
allows for detection of repeated, unsuccessful attempts to track in an empty region of the con guration space.
2.3 Algorithm With this framework, we now proceed to describe the algorithm (see Figure 2.3) used to control computation with the IFA framework. We employ the following nomenclature: represents the currently active layer, xini and xout denote the input and output sets of layer i thresh respectively, and represent the frustration index and frustration i and i threshold for layer as described above. Once initialized, the algorithm spends its time in the main loop, repeatedly executing the currently active layer and evaluating the results. If a layer mistracks or is frustrated i
i;
f rus
f rus
i
11
(steps 1 and 2), it is reset and the system moves down a layer. Step 3 moves the system up a layer either because tracking at that level was successful and there are higher layers, or because a selector is not yet frustrated. Step 4 represents the case of successful top-layer tracking. In this case, all selectors and frustration indices are reset. The net result is that the algorithm moves the tracking system down layers when tracking is uncertain, thereby broadening the search region, and ascends layers as trackers succeed, narrowing the set of con gurations until the top layer is reached.
3 Examples We demonstrate the IFA framework with several implemented systems which robustly track various objects. These systems were implemented on a Sun Sparc 20 equipped with a standard framegrabber and a single chip color camera. Since there is signi cant overlap in the set of trackers and selectors used by the systems, we brie y describe each tracker and selector separately:
Linear exhaustive selector: This selector takes any set of con gurations as its input and returns a set that is appropriate for the tracker above it. Repeated application of this selector on the same input set will return dierent output sets such that the entire visible con guration space of the target object is covered within a nite number of applications.
Random selector: This selector takes any set of con gurations as its input and randomly returns a set that is appropriate for the tracker above it. Repeated application of this selector returns dierent output sets such that in the limit of in nite applications, the entire visible con guration space is covered. Either of the selectors above provide the \last resort" mechanism for any heuristic-based selector and are used as the basis for the selectors described below. 12
Motion-based selector: See the description given in x2.1. Position-based selector: This type of selector narrows an input set according to preset geometric constraints which may be relative to the camera coordinate system or absolute with respect to a physical coordinate system. If, for example, it is known a priori that there will be two objects of similar appearance and that the target is the one to the left, the selector begins by returning output con gurations which are consistent with this information. Position-based selectors may also incorporate prediction models such that selection occurs based on where the object is expected to be. As with any selector, multiple calls to this selector will eventually cover the entire input set.
Color-based selector: A color-based selector marks the regions in a image which contain certain range of hues (similar to the way dierent regions are categorized by color in [3]). A full-frame, color-based selector initialized for a target object justi es its output heuristic based on the assumptions that the target object is within a range of certain hues, that ambient light is not biased toward any hue, that noise with respect to hue is small, that the object is at least partially in view, and that the object is near enough to the camera that it projects onto a nite number of pixels in the camera image. Given these assumptions, the color-based selector will nd the pixels in the image which are within the range of desired hues and will initially return a set of possible con gurations for its target which correspond to con gurations of the target object that are consistent with what is projected onto the image.
Intensity-based selector: The intensity-based selector performs exactly like the colorbased selector, except that its image processing occurs on the intensity values of pixels rather than on hue. See Figure 9 for an example of an intensity-based selector choosing only parts of the image which coincide with the target object's intensity values. 13
Combination selector: A combination selector nests some of the search heuristics described above. For example, a color- and motion-based selector would initially return output sets whose con gurations correspond to projections which match a given hue and exhibit motion. Many of the selectors above are appropriate for processing an entire image. We employ them on subsampled, low-resolution images which take only a fraction of the time it would take to grab and process complete images. Because they produce relatively low-accuracy output, we have found that applying them to low-resolution images does not signi cantly impact on their performance.
Homogeneous region tracker: If the target object has pixel values (hue or intensity) which are predominantly within a certain range, then the homogeneous region tracker will produce a set of con gurations which correspond roughly to the \center" of the region. Tracking is performed by mutually repelling elements which strive to stay within a region where the pixel values are consistent with one another. As the distance between elements (the diameter) expands, the size of the output con guration set is correspondingly reduced.
Edge-of-polygon tracker: Intuitively, this tracker tracks a single edge of a polygon. In terms of the tracking formalism, it expects a set of target con gurations which would project the target object onto a small region of the image, but for which no orientation information is known. By tracking a single edge, the output con guration set is culled to include only those con gurations of the polygon which would be consistent with the object projecting the tracked edge onto the camera image.
Feature-based rectangle tracker: A speci c instance of the polygon tracker described in x2.1, where four corners are used to track an approximately rectangular shape. 14
SSD-based pattern tracker: The SSD-based pattern tracker matches a stored template to a region of the current image and computes a residue value. It returns an output set whose elements are con gurations that correspond to ane transformations of the template which match with the current camera image. It fails if the computed sum-of-squared-dierence (SSD) residue is above a given threshold (see [10]). The residue of an SSD computation is the actual pixel-by-pixel sum of squared dierences. Variations of this tracker may perform the same function with a subset of ane transformations.
SSD-based multi-pattern tracker: The same tracker as above, but which tracks successfully if the current image contains a match to any of several stored images.
3.1 Robust Disk Tracking In the past, we have often used 3.5 inch diskettes as a convenient target for various experiments with visual servoing [9]. These experiments, while successful in demonstrating hand-eye coordination, were nevertheless extremely sensitive to disturbances. Distractions in the background, whole or partial occlusion, sudden motion of the camera, and various other \unexpected" perturbations of the visual environment often resulted in undetected mistracking, which led to incorrect and even dangerous motions by the robot arm as the visual servoing algorithm computed robot velocities based on unreliable tracking information. To make this system more robust to visual disturbances, we have developed a robust disk tracker which consists of the following components (in order of increasing output accuracy): a color-, motion-, intensity-, and position-based combination selector (layer 0)2 , a homogeneous region tracker (layer 1), an edge-of-polygon tracker (layer 2), and a feature-based rectangle This layer is actually numbered 0.0 through 0.8 in Figures 6 and 7 because several dierent heuristics are nested within one another. For example, sublayer 0.8 returns output sets which simultaneously satisfy color, motion, intensity, and position criteria, but sublayer 0.4 allows output sets which only satisfy the color-based selection. Internal frustration indices cause lower sublayers to take over when higher ones repeatedly fail. 2
15
tracker (layer 3). The selector at the bottom rapidly focuses the attention of the system to areas which match the color of the disk, which exhibit motion, and which are positionally near the predicted location of the disk. The color and intensity heuristic is valid because ambient lighting conditions normally don't change signi cantly (although if they did, the selector would not insist on particular colors inde nitely). The use of motion and positional prediction are justi ed by the fact that if the system reverts to this layer, it is most likely due to overly fast motion of the disk relative to the camera. The homogeneous region tracker helps to quickly discard regions of the input set in which the disk is unlikely to be by conservatively signaling a mistrack if it is unable to grow beyond a certain diameter. In other words, it checks if the output con guration set of the selector projects onto a reasonablysized patch of the camera image. Finally, the edge-of-polygon tracker nds an edge of the disk, and the rectangle tracker tracks the four corners. The robustness of the system is evident through its behavior under various conditions. Tracking the disk proceeds perfectly as long as the disk moves within a certain speed, there are no distractions, and the disk remains wholly in view. When one of the corners is occluded or if there are signi cant distractions in the background, the top-layer tracker signals a mistrack, and tracking oscillates between homogeneous region tracking and rectangle initialization based on edge-of-polygon tracking. Homogeneous region tracking remains stable if only the corners are occluded, but the rectangle tracker keeps mistracking as long as the occlusion remains. During this time, even though the tracker is unable to recover corner coordinates of the rectangle, it still maintains a rough approximation of the position of the disk because of the region tracker. If the occlusion is temporary, tracking resumes as the rectangle tracker recovers tracking. If the occlusion persists and the region tracker becomes frustrated, or if occlusion becomes complete and the region tracker signals mistrack, or if sudden fast motion occurs and the region tracker signals a mistrack, then the system must start again at the bottom-layer selector. At this point, if the disk is completely occluded, 16
tracking will fail. If there is another disk of the same color in the eld of view, tracking may continue on the imposter assuming the user has not explicitly incorporated a way to distinguish between disks. If the disk reappears, IFA will reinitialize the disk at its new location, and the system will continue tracking.
3.2 Robust Surface Tracking Approximately planar surfaces of various kinds can be tracked robustly using an IFA system similar to the disk tracker but substituting a sequence of SSD trackers where the disk tracker uses homogeneous region tracking and rectangle tracking. For concreteness, we describe the actual technique we use to robustly track a face, although the same system can be used with minimal alteration to track almost any roughly planar surface. We again list the trackers and selectors, in order of increasing output accuracy: a color-, motion-, and position-based combination selector (layer 0), an SSD tracker with high residue (layer 1), and an SSD tracker with low residue (layer 2). The selector acts as usual, narrowing the visual eld to those areas which exhibit motion and are of the target person's skin tone. The lower SSD tracker with the high residue threshold tries to make a rough match to its template given a region to search. If during its tracking phase, it is unable to reduce the residue below its threshold, it signals a mistrack and reverts to the selector. If the residue after a tracking phase is below its threshold, then tracking proceeds to the top-layer tracker which is the same SSD tracker with a lower threshold. The lower threshold is chosen so that the tracker will be successful only when the examined region of the image are similar up to ambient image noise and minor illumination changes. Repeated experiments show that this three-layered IFA tracker quickly recovers track of a target face after perturbations like those described for disk tracking (x3.1) and continues tracking accurately under the assumptions made when con guring the bottom-layer tracker. Figure 4 shows face tracking at dierent layers. 17
(layer 2)
(layer 0)
Figure 4. Face tracking at dierent layers. On the right, tracking has temporarily reverted to layer 0 tracking, since the stored image does not match the rotated head.
3.3 Robust Object Tracking Finally, we describe a generic object tracker which can be seen as a precursor to a robust poseestimating 3-D object tracker. The generic object tracker is based on the idea of tracking an object by matching to multiple views. The implementation is identical to that for robust surface tracking, except that the SSD trackers are replaced by SSD-based multi-pattern trackers, and the color-based heuristic for the bottom layer selector seeks to nd color matches based on the union of predominantly occurring colors in all stored gures. In Figure 5, we see how two images were used to track an envelope. The faces of the envelope were initialized as the images to track. In the left and right gures, tracking proceeded at layer 0, since one of the envelope's faces was completely visible. However, during the time the envelope was being ipped over, no match was made to either image; thus, tracking temporarily reverts to a lower layer where color-based selection takes place.
3.4 Analysis Figures 6 and 7 show where the tracking systems spends it time during some example runs. The horizontal axis represents time and the vertical axis represents the IFA layer executed 18
(front)
(no match)
(back)
Figure 5. Tracking an envelope as two dierent sides. In the center image, the envelope is tilted too much to match with either stored image, so tracking has reverted to color selection.
by the system. Time instants marked by an `a' indicate where temporary partial occlusions or slight taps to the camera occurred. In these instances, top-layer tracking was disturbed, causing the system to revert to an intermediate tracker momentarily, but as the occlusion ceased or as the camera resettled, the system quickly climbed back up to the top-layer tracker. At times marked by `b,' the camera was jerked away so that tracking was completely lost. While the camera was readjusted with the target object in a dierent location in the eld of view, the system oscillated back and forth among the low layers as it tried to nd the target object among various candidates. Eventually, the selector chose an output set containing the target object and tracking settled back up to the high layer of the framework. Finally, at times marked `c,' temporary full occlusion occurred. In all such instances, tracking again reverted to the bottommost selector since intermediate trackers could not nd their target. However, unlike the case where the tracking was lost without occlusion, the bouncing between low layers is caused by the fact that the system sought for a target object that was not visible. As the occlusion ended, precise tracking resumed. Dierences in the shape of the graphs in Figures 6 and 7 can be explained by the differences in the underlying layers of the respective systems. The SSD-based system spends much of its time in layer 1, especially following complete occlusion. Layer 1 corresponds 19
a a b
layer
c
disk -- low
4.00
3.00
2.00
1.00 0.80 0.60 0.40 0.20 0.00 time (sec). 0.00
5.00
10.00
15.00
a a b
layer
20.00
25.00
c
disk -- med
4.00
3.00
2.00
1.00 0.80 0.60 0.40 0.20 0.00 time (sec). 0.00
5.00
10.00
15.00
a a b
layer
20.00
25.00
c
disk - high
4.00
3.00
2.00
1.00 0.80 0.60 0.40 0.20 0.00 time (sec). 0.00
5.00
10.00
15.00
20.00
25.00
Figure 6. Disk tracking layers and recovery time. Low clutter (top), medium clutter (middle), and high clutter (bottom).
20
to the coarse SSD tracker. Due to the the actual implementation of our SSD tracking (see [10, 13]) which is based on a steepest-descent algorithm, a relatively long tracking phase is required for this tracker. Thus, the full tracking phase must be endured before it can recognize whether the object tracked is the actual target. This time can be shortened, perhaps, by other techniques (such as color histogramming) which might be able to distinguish between targets and non-targets in a shorter amount of time. In contrast, layer 1 of the disk tracker has a short tracking cycle. Thus, it immediately continues to layer 0 or reverts to higher layers without staying in layer 1 for too long a time interval. Figure 8 shows the average time for the system to recover track at the top layer following a complete occlusion (the entire eld of view was temporarily covered). Because our top level selectors use heuristics similar to pop-out for color and motion, the more cluttered the background is with respect to these attributes, the greater the time required for tracking to recover. The maximum recovery time after over 100 trials was 15.3 seconds in a heavily cluttered environment, during which time both foreground and background objects were in motion. In all instances, however, tracking eventually resumes when the target object returns to full view. Figure 9 shows the backgrounds used for the dierent levels of clutter. Perhaps the best argument for IFA systems is the subjective experience of watching them in action. Where previously, entering the camera eld of view was taboo during experiments which use tracking, now, it is almost a pleasure to deliberately cause mistracking, just to watch the system recover its target. Tracking applications can be left running even as people walk in front of the camera, lights are turned o, or target equipment is moved, because soon after the environment settles again, tracking will have resumed. In a word, tracking is tenacious. Finally, one of the advantages of the IFA framework from a programmer's point of view is the relative ease with which existing trackers can be incorporated into it. The bottom layers with some variation suit almost all tracking applications. Then, given an existing 21
a a b
layer
c
face -- low
3.00
2.00
1.00 0.80 0.60 0.40 0.20 0.00 time (sec). 0.00
5.00
10.00
15.00
a a b
layer
20.00
25.00
c
face -- med
3.00
2.00
1.00 0.80 0.60 0.40 0.20 0.00 time (sec). 0.00
5.00
10.00
15.00
a a b
layer
20.00
25.00
c
face - high
3.00
2.00
1.00 0.80 0.60 0.40 0.20 0.00 time (sec). 0.00
5.00
10.00
15.00
20.00
25.00
Figure 7. SSD tracking layers and recovery time. Low clutter (top), medium clutter (middle), and high clutter (bottom).
22
t (sec.) high medium low
5.50 5.00 4.50 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00
clutter low
med
high
low
(disk)
med
high
(face)
Figure 8. Average recovery times after complete occlusion with dierent degrees of background clutter for experiments with disk and face tracking.
(low)
(medium)
(high)
Figure 9. Experimental setup for various levels of clutter. Regions which which pass an intensitybased selector are marked. Note that as clutter increases, the size of candidate con gurations also increases.
23
precision tracker, it is usually a simple matter to decide when it mistracks based on geometric and visual constraints, especially because there is room to err on the side of being overly conservative. Putting them together is a matter of nesting tracking loops with entrance and exit conditions that depend on the layers. Placing the ne tracker in the innermost loop fashions the top of the framework and added robustness is the result. In this way, we have been able to easily adapt the framework to provide robustness for trackers of the X Vision system [10], which oers a set of modular tools for feature-based motion tracking.
4 Towards a Notion of Robustness As stated at the outset, our goal is to provide \robust" tracking through incremental focus of attention. As is often the case, however, pinning down a precise de nition of robustness which matches our intuitive expectations is dicult. In the case of IFA-based algorithms, we would like to eventually make two claims. First, we would like to claim that given a tracking algorithm which operates reliably and accurately within a certain range of environmental conditions, and given any number of other, less accurate trackers and selectors which can be used incrementally narrow the search space for this tracking algorithm, combining them into an IFA-based system will result in a tracker which performs no worse than the top-level tracking algorithm. Second, we would like to show that the system degrades gracefully when assumptions are violated. Unfortunately, to formally establish such a claim would require a far more formal understanding of IFA-based tracking than we currently have. However, the structure of the algorithms makes it possible to sketch some of the analysis that would be required. The rst assertion, that an IFA-based system will perform at least as well as its top-layer tracker when conditions permit, depends only on proving that the IFA algorithm will eventually move to the top level under those conditions. Since selectors are guaranteed to eventually exhaust the entire target con guration space, the only way this would not happen is if an 24
intermediate level tracker failed consistently, even when provided with an input set which contained the target. Thus, this claim can be established for a given system provided it can be shown that, under equivalent imaging conditions, if a tracker with ne output accuracy succeeds on an input set, xini , then any less accurate tracker would succeed on any input set which includes xini To show the second assertion, that the IFA tracker degrades gracefully as assumptions are violated, we must rst de ne what it means to degrade gracefully. With respect to the tracking problem, we might de ne graceful degradation in terms of the following qualities of a tracking system: :
1. The system will operate as expected given that a maximal set of assumptions remains valid. The maximal set of assumptions are those which are required for a tracker to return an output set with the precision expected by the user. 2. The system will continue (possibly faulty) operation given that a minimal set of assumptions hold. The minimal set of assumptions for a tracker can be thought of as reasonable circumstances without which no tracker can be expected to perform (e.g., no hardware or operating system failures, some ambient light, etc.). A system degrades gracefully if, starting from the minimal set, adding more assumptions allows tracking to improve. Tracking is \better" if one or more of the following criteria are satis ed to greater degrees: 3. The system is able to reliably report that the nal output con guration set has less precision than is required by the user or indicate when tracking is completely unsuccessful. 4. The system will recover a lost target whenever the maximal set of assumptions becomes valid. 25
5. The output set is likely to contain the target con guration. 6. The output set is small. 7. The system can quickly return an output set. We note that these qualities can occur in diering degrees and do not de ne a total ordering for the \gracefulness" of a tracking system. For instance, a tracker which returns a relatively large output set that is highly likely to contain the target con guration cannot easily be compared with another tracker which returns a small output set that is less likely to contain the target con guration. An IFA-based system generally satis es properties 1 through 3 by construction. The conditions for 4 were established above. Satisfying criteria 5 through 7 depends on the presence of intermediate level trackers which can operate when conditions are not ideal. Thus, to the extent that such trackers exist and can be proven to exhibit the required properties, it can be said that graceful degradation is an inherent quality of the IFA framework.
5 Conclusion The IFA framework is a general scheme for adding robustness to almost any tracking system. By explicitly constructing trackers which recognize their own failure and layering fallback mechanisms, the IFA architecture endows a system with the quality of degrading gracefully with respect to environmental assumptions. We do not claim to solve the tracking problem under complete occlusion or when lights are o, but IFA provides a framework such that whenever reasonable conditions return, so, too, will tracking. Further work with IFA proceeds along two main avenues | improving selectors and trackers, and creating dynamic IFA systems which swap layers in and out depending upon environmental conditions. The former is ongoing research by the vision community as a whole; improved trackers and pop-out algorithms will only add to the power of IFA systems. 26
The latter is work in which we are currently engaged, in an attempt to make IFA systems adapt to their environment by dynamically altering component selectors and trackers. Finally, we are proceeding to formalize the notions of \robustness" and \graceful degradation" with the goal of providing a formal framework for the dynamic analysis of IFA-based tracking algorithms.
Acknowledgments This research was supported by ARPA grant N00014-93-1-1235, Army DURIP grant DAAH0495-1-0058, by National Science Foundation grant IRI-9420982, and by funds provided by Yale University.
References [1] R. A. Brooks. A robust layered control system for a mobile robot. IEEE J. of Rob. and Autom., 1(1):24{30, March 1986. [2] R. A. Brooks. Intelligence without representation. Arti cial Intelligence, 47:139{159, 1991. [3] J. D. Crisman and C. E. Thorpe. SCARF: a color vision system that tracks roads and intersections. IEEE Trans. Rob. and Autom., 9(1):49{38, February 1993. [4] J. L. Crowley, P. Stelmaszyk, T. Skordas, and P. Puget. Measurement and integration of 3-D structures by tracking edge lines. Int'l Journal of Computer Vision, 8(1):29{52, 1992. [5] S. M. Culhane and J. K. Tsotsos. An attentional prototype for early vision. In Computer Vision { ECCV '92, pages 551{560, Italy, May 1992.
27
[6] L. D. Erman, F. Hayes-Roth, V. R. Lesser, and D. R. Reddy. The HEARSAY-II speech understanding system: Integrating knowledge to resolve uncertainty. Computing Surveys, 12(2):213{253, 1980. [7] R. E. Fikes and N. J. Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Arti cial Intelligence, 2:189{208, 1971. [8] D. B. Gennery. Visual tracking of known three-dimensional objects. Int'l Journal of Computer Vision, 7(3):243{270, 1992. [9] G. D. Hager. Calibration-free visual control using projective invariance. In Proceedings of the ICCV, pages 1009{1015, 1995. Also available as Yale CS-RR-1046. [10] G. D. Hager. The \X-vision" system: A general purpose substrate for real-time visionbased robotics. In Proceedings of the Workshop on Vision for Robots, pages 56{63, 1995. Also available as Yale CS-RR-1078. [11] A. Kosaka and G. Nakazawa. Vision-based motion tracking of rigid objects using prediction of uncertainties. In Proc. 1995 IEEE Conf. on Rob. and Autom., pages 2637{2644, Nagoya, Japan, May 1995. [12] D. G. Lowe. Robust model-based motion tracking through the integration of search and estimation. Int'l Journal of Computer Vision, 8(2):113{122, 1992. [13] J. Shi and C. Tomasi. Good features to track. In Computer Vision and Patt. Recog., pages 593{600. IEEE Computer Society Press, 1994. [14] H. Tagare and D. McDermott. Model-based edge selection for 2-D object recognition. DCS RR-1044, Yale University, August 1994. [15] D. Terzopoulos and T. F. Rabie. Animat vision: Active vision in arti cial animals. In Int'l Conf. on Comp. Vision, pages 801{808, June 1995.
28
[16] K. Toyama and G. D. Hager. Keeping your eye on the ball: Tracking occluding contours of unfamiliar objects without distraction. In Proc. 1995 Int'l Conf. on Intel. Rob. and Sys., pages 354{359, Pittsburgh, PA, August 1995. [17] K. Toyama and G. D. Hager. Tracker fusion for robustness in visual feature tracking. In SPIE Int'l Sym. Intel. Sys. and Adv. Manufacturing, volume 2589, Philadelphia, PA, October 1995. [18] J. K. Tsotsos. An inhibitory beam for attentional selection. In Spatial Vision for Humans and Robots. Cambridge University Press, 1993. [19] J. K. Tsotsos. Towards a computational model of visual attention. In T. Papathomas, C. Chubb, A. Gorea, and E. Kowler, editors, Early Vision and Beyond, pages 207{218. MIT Press, 1995.
29