Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, USA, 2000
1
Multifeature object tracking using a model-free approach Hieu T. Nguyen and Marcel Worring Intelligent Sensory Information Systems University of Amsterdam, Faculty of WINS Kruislaan 403, NL-1098 SJ, Amsterdam, The Netherlands
[email protected] [email protected]
Abstract In this contribution we introduce a new model-free method for object tracking. The tracking is posed as a segmentation problem which we solve using the watershed algorithm. A framework is defined to compute the required topographic surface from distances to the predicted contour, intensity edges and motion edges. This multifeature tracking approach yields accurate results in the presence of object corners, image clutter, and camera motion. Results on real sequences confirm the stability and robustness of the method. Objects are tracked over long sequences and in the presence of fast object motion.
1. Introduction In digital video analysis, which has attracted an increasing attention in recent years, tracking objects is of great importance for segmentation tasks as well as for content-based video indexing. Due to the diversity of objects and events appearing in video, a tracking scheme should be: • able to track objects of arbitrary shape. • able to cope with non-rigid motion. • robust to camera movement. • robust to image clutter. In this paper, we focus on tracking of the object contour. In most existing methods, contours are approximated by a parametric model. For example, in [2, 6], the contour is approximated by B-splines. The methods first fit a Bspline curve to the intensity edges and then use a Kalman filter to track the B-spline coefficients. Several other tracking algorithms are based on the model of active contours or snakes with energy [11, 10]. Contour energy commonly
comprises two parts: the external energy representing the image forces attracting the contour to image features, and the internal energy imposing smoothness on the contour. In [11], Terzopoulos et al, in addition, defined the kinetic energy taking into account the motion of the contour. The Euler-Langrange equation, describing the minimization of the total energy, is the basis for the construction of a Kalman filter. Although this equation is defined in the continuous domain, the implementation requires the contour to be approximated by a polygon with a fixed number of vertices. The approximation of the contour using models with a fixed number of parameters has the drawback that it cannot track objects of arbitrary shape. The methods need to impose smoothness constraints on the contour. It causes inaccuracies at corners. While most of the methods use a model to approximate the object contour, a few methods [5, 9] are model-free. In [5], the tracking is performed by means of motion segmentation. The watershed algorithm is employed to segment 3D optic flow. This method lacks techniques to cope with image clutter and does not pose consistency between tracking results of successive frames. In [9], Paragios et al apply the level-set approach to minimize the contour energy. Unlike the snake-based methods mentioned earlier, this approach does not require a polygonal approximation of the contour. However, the method does not work when the camera is moving. In the existing methods the determination of object contour relies on gradient of intensity. We remark, however, that the use of gradient has two drawbacks. First, if the initial contour or a part of its is in an uniform area, the gradient information cannot tell the algorithm where to move the contour. Second, this approach always attracts the contour to areas of highest gradient, corresponding to strongest edges and this is a problem because irrelevant edges, i.e. those inside objects as well as those in the background may have higher contrast than the object boundary. It gets worse when gradient at the contour changes as the object moves into a region of different level of intensity.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, USA, 2000
In this contribution we develop a new model-free method for tracking contours. The paper is structured as follows. In Section 2 we develop the framework for contour tracking using edges, based on the watershed algorithm. Section 3 discusses the detection method for each kind of edges used. Tracking results are shown in Section 4.
2. Contour tracking using edges Every iteration of a tracking procedure consists of updating the object contour based on measurements in the current frame, taking into account the results from previous iterations. Finding the contour should be based on edges of some object features such as intensity, color or motion. In this section we first develop a new method for the detection of the object contour in an edge map of an arbitrary type. We then extend the method for the case when several kinds of edges are used in combination.
2.1. Finding contour in an edge map, using the watershed algorithm For the moment, consider the problem of the detection of the object contour in one (observed) edge map Θ. Denote the expected contour as Γ. The two following requirements should be met: • Γ should be composed of pixels with highest chance of being a contour point. • Γ should not be too different from the tracking result from the previous frame. To measure the chance of a pixel x being a contour point, let us consider the a posteriori probability density: p(x ∈ Γ|Θ). According to the Bayes rules, we have: p(x ∈ Γ|Θ) ∝ p(x ∈ Γ)p(Θ|x ∈ Γ)
(1)
Assuming no prior information is available, i.e. p(x ∈ Γ) is a constant over the entire image, we can evaluate p(x ∈ Γ|Θ) via the likelihood p(Θ|x ∈ Γ). We assume further that the edge detector is good enough to detect the entire contour so that any contour point x corresponds to one point in the edge map denoted by c(x), while the rest of the edge map is assumed to be unrelated to x. It implies: p(Θ|x ∈ Γ) ∝ p(c(x)|x ∈ Γ). We define c(x) as the edge point closest to x. The point c(x) is considered as a measurement of x and may not coincide with x since the image signal is corrupted by noise. Under the assumption that the edge detector is unbiased and identically normally distributed with variance σ 2 , finally we obtain: p(x ∈ Γ|Θ)
∝ ∝
p(c(x)|x ∈ Γ) 1 exp[− 2 D(x, Θ)2 ] σ
(2)
2
where D(x, Θ) is the distance from x to the edge map. This distance can be computed by using the algorithm developed in [3]. Considering (2), the first requirement means that the expected contour should be composed of pixels which are as closest as possible to edges. Note that we have ignored in p(x ∈ Γ|Θ) the influence of the edge strength which in turn is related to the gradient, due to the drawbacks mentioned in the introduction. Our method uses the distance to the edge map as an image force attracting the contour. The contour is then always forced to move to the nearest edge points, hence, the method works even if the initial contour is far from the actual contour. Moreover, all edges with high enough strength have an equal influence to the contour. We now consider the second requirement. Let Θp be the contour, predicted from the previous frame. The derivation of this contour will be described in Section 3.1. We restrict ourselves to find Γ within a narrow band around Θp denoted by C. This band is obtained by the homotopic thickening of Θp . The latter operation preserves the connectivity, so, if Θp is a simple contour, which means it is non-self intersecting, then C divides the rest of the image into two disjoint parts: the interior Int(C) and the exterior Ext(C). The size r of the thickening should be large enough such that Int(C) can be assumed with high confidence to belong to the object, while Ext(C) can be considered to belong to the background. The task now is to reclassify pixels in C to either object or background. We do this by using the watershed algorithm from mathematical morphology [1]. Recall that the watershed algorithm was developed to segment the image into influence zones of given marker regions. The algorithm grows the marker-regions by subsequently adding their neighboring pixels until the remaining area of the image is filled up. Crucial is that during the process neighboring pixels of the markers are visited according to a priority function f(x). The result regions define the influence zones of the original markers. This resembles the process of flooding water from the markers into the topographic surface −f(x). The boundary between the influence zones is called the watershed line and corresponds to the crest lines on the topographic surface. Let us now return to our contour detection problem. We use Int(C) and Ext(C) as two markers for the watershed. The priority for visiting pixels remaining in C should be such that those pixels that are less likely boundary points, are removed from the band first. According to the discussions on the first requirement, we define f(x) = D(x, Θ). Thus, we determine the object contour by applying the watershed algorithm for the topographic surface being the distance transform of the edge map with minus sign: −D(x, Θ) . Pixels which are far away from edges have higher priority and therefore will be assigned to one of the two markers first, meanwhile pixels on edges which are most likely boundary points remain till the end.
3
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, USA, 2000
expected contour Γ
marker 2 contour thickening
apply the watershed marker 1
C
contour predicted from previous frame Θ p
updated contour
Figure 1. One iteration of the tracking method. The obtained watershed line again is a simple contour. The process is illustrated in Figure 1.
2.2. Combined use of several kinds of edges The algorithm, developed in the previous subsection for a single edge map, can be easily extended to take into account several kinds of edges. Such a combined use of edges allows to increase the tracking stability. In general, the object contour coincides with intensity edges, so, the intensity (or color) edge map should definitely be used. Moreover, the predicted contour Θp should also be used (as an edge map) to reduce the sensitivity of the tracker to errors in edge detection. In case of missing edges, the predicted contour should be kept. Finally, we can also exploit motion edges that indicate changes in optic flow. Thus, in this paper we use three kinds of edges: the predicted contour Θp , the intensity edge map ΘI and the motion edge map Θm . If the edge maps are obtained independently, the overall probability density for a pixel to be a contour point is the product of those obtained with individual edge type. The priority for the pixel removal from the band C becomes a weighted sum of the squared distance transforms of the individual edge maps: f(x) = αpD(x, Θp )2 +αI D(x, ΘI )2 +αm D(x, Θm )2 (3) where αp + αI + αm = 1 and αt ∝ 1/σt2 . The watershed line obtained lies in between the three edge maps. Setting one of the weighting coefficients to high values attracts the watershed line to the edges of the corresponding type. In particular, with a high value of αp the tracker tends to keep the predicted contour and produces smooth changes of the tracking results, but this also prevents tracking the non-rigid motion of the contour. So far, we have developed the general framework for tracking the object contour using several kinds of edges as input data. Apparently, the performance of the tracking algorithm depends on the accuracy of the detection of these edges. These issues will be considered in the next section.
3. Detection of edges 3.1. Object motion estimation and contour prediction To obtain Θp , we first estimate the object motion between two frames It−1 and It using a parametric motion model, and then warp the previously estimated contour Γt−1 to the current frame It . Depending on the specific application, various kinds of motion model can be used: translation, affine or quadratic. In many situations the translation model is adequate since the predicted contour will be refined further, hence, it corrects the inaccuracy of motion estimation. The dominant translation vector vp is estimated by minimizing the motion-compensated prediction errors: X vp = arg min [It−1(x) − It (x + v)]2 (4) v∈Ω x∈Rt−1 where Rt−1 is the image region occupied by the object at time t − 1 and Ω is the velocity space. The contour Θp is then obtained by translating Γt−1 over vp . In practice, we can restrict the two components vx and vy of v to integer numbers, hence, vector vp can be found by an exhaustive search in the rectangular Ω = [−vxmax , vxmax ] × [−vymax , vymax ] where vxmax and vymax are set according to the maximum expected speed of the object. Since the exhaustive search finds the global minimum of (4), it allows to estimate the object motion robustly even in the case of fast translations.
3.2. Intensity edge detection and background edges removal We employ the Canny edge detector to detect the intensity edges. The derivatives of It are computed by convolving it with the derivatives of a Gaussian kernel of size σc . An edge point is then defined as one whose gradient magnitude achieves its local maximum in the gradient direction.
4
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, USA, 2000
We keep only the significant edges, eliminating ones whose gradient magnitude is lower than a given threshold Tc . The contour detection may be affected by irrelevant edges, especially those in the background which were previously occluded. When the object motion is approximately known, we can detect and remove the background edges from the edge map and therefore, the negative impact of image clutter can be reduced greatly. (t-1)
ΘI
ΘI
warping
Θ Iw
U
ΘI
V
w
ΘI
Figure 4. Detected motion edges (marked by white lines).
remove differences
2
|dv| =
Figure 2. Removal of background edges. The circle is the object contour and the diamond is a background edge.
dx dy
T
3.3. Motion edge detection To compute motion edges, the optic flow field should be computed first. We employ the method of Lucas and Kanade in [7]. Let v(x, y) = (u(x, y), v(x, y)) be the vector field obtained. The detection of optic flow edges is similar to that of color edges. Both can be treated as the detection of edges in multivalued image, which have been applied to color images in [12] and is summarized below. At every pixel (x, y) the rate of change of v(x, y) in the direction (dx, dy) is measured by the squared norm of its differential [12]:
gxy gyy
dx dy
(5)
where gij = ∂∂iv . ∂∂jv , i, j ∈ {x, y}. The two eigenvectors of the matrix [gij ] indicate the directions where the change of v(x, y) is maximal and minimal respectively. The change rates are given by the corresponding eigenvalues. The maximal eigenvalue λ+ can be considered the gradient for the multivalued image v(x, y):
(t−1)
Let ΘI be the edge map detected in frame It−1 . We (t−1) warp ΘI into frame It by translating it over the motion vector vp obtained in Section 3.1. The received edge map, denoted as Θw I , will coincide with ΘI at the foreground edges. At background edges which have different motion, Θw I and ΘI differ. In the ideal case, the background edges can be removed from ΘI by performing the intersection operation ΘI ∩ Θw I . In practice, we do this in a different way. We first compute the distance transform of Θw I and look for those pixels in ΘI , for which this distance exceeds a threshold Td . An example of this idea is illustrated in Figure 2 and the result for real data is shown in Figure 3. Note, however, that isolated points where Θw I and ΘI coincidently intersect, and edges parallel to the motion direction will not be removed. However, these have little influence on the tracking performance.
gxx gxy
λ+ =
1 [gxx + gyy + ∆] 2
(6)
q 2 . The direction of the where ∆ = (gxx − gyy )2 + 4gxy maximal change is given by: tan θ+ =
2gxy ∆ + gxx − gyy
(7)
A pixel is then considered as an edge point if λ+ achieves its maximum along the direction θ+ . Furthermore, only edge points with λ+ exceeding a threshold Tm are retained. Since the motion edges obtained usually have low accuracy, it is advantageous to carry out the above process at the low resolution level to save computation time. The edges obtained are then just interpolated into the high resolution. An example is shown in Figure 4.
4. Experiments In Figure 5, 6 and 7, the tracking results are shown for the well-known test video sequences: Bike, Son and Table Tennis respectively. In the algorithm developed, the following parameters need to be set: the size of the contour thickening r, the coefficients for the combination of the distance transforms αp , αI , αm , the parameters of the estimation of object motion vxmax , vymax , the parameters for intensity edge detection σc, Tc and Td and the threshold Tm for motion edge detection. For all sequences, we
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, USA, 2000
a)
b)
c)
Figure 3. Detected intensity edges. a) the original frame; b) result of the Canny detector, σc = 5 , Tc = 1.0; c) after the removal of background edges.
a) frame 0
b) frame 20
c) frame 80
Figure 5. Tracking results for the Bike sequence. αp = 0.45, αI = 0.5, αm = 0.05, r = 4(pixels), vxmax = vymax = 7
a) frame 0
b) frame 2
c) frame 15
Figure 6. Tracking results for the Son sequence, parameters as in Figure 5.
a) frame 79
b) frame 80
c) frame 85
Figure 7. Tracking fast motion. Results of tracking the tennis ball in the Table Tennis sequence (marked by black lines). αp = 0.5, αI = 0.5, αm = 0, r = 2, vxmax = vymax = 35
5
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head, USA, 2000
have used:σc = 3.0, Tc = 1.0, Td = 2, Tm = 0.001. The other parameters are indicated in the figures. The motion edges are computed at level 2 of the Gaussian pyramid [4], in which both sizes of the image are reduced by factor 4. Note that since the motion edges usually has lower accuracy, αm takes much smaller values than αp and αI . In the experiments, the initial contour at frame 0 was obtained by performing a color segmentation of the scene [5], followed by merging regions undergoing similar motion [8]. The robustness of the algorithm to image clutter can be observed from the results. For examples, in the Bike sequence, while tracking the bike from frame 0 to frame 20, the algorithm successfully passed over the strong edge in the background between the light sky and the dark tree. In the Son sequence, although a part of the initial contour got stuck to a strong background edge segment, it recovered quickly two frames later. In the Table Tennis sequence, the algorithm was applied to track the ball over 80 frames. We lost the ball only once when the player grabbed it. Motion edges were not used in this sequence, i.e αm = 0, as the object is uniform inside. At some moments, the displacement of the ball between two frames was as large as 30 pixels. The algorithm, however, could successfully estimate such fast motion due to the use of the exhaustive search method. The demo video-clips are available http://carol.wins.uva.nl/˜tat/demo/track.html.
at
5. Conclusion and future work In this paper we introduced a new model-free method for object tracking. It uses the watershed algorithm to find an optimal separation between the object and its background given the predicted contour from the previous iteration, and observed intensity and motion edges in the current frame. As no model is used for describing the contour, objects of arbitrary shape can be tracked. As no smoothness is imposed, corners are tracked accurately as well. The removal of background edges and the integration of multiple features in the tracking make the method robust against image clutter. Constraining the solution to a region around the predicted contour prevents dramatic changes in object shape between frames. Furthermore, from the watershed algorithm it follows that the resulting contour will not be selfintersecting. These are great advantages over the common snake-based methods. In those methods complicated internal energy terms composed of first and second order contour properties are incorporated to achieve the same. Formalizing the relation between snakes and our approach is topic of further research.
6
Acknowledgements The authors thank the anonymous reviewers for their valuable comments. We also would like to thank Dr. R. Van den Boomgaard for his helpful discussions. This research is supported by the Netherlands Organization for Scientific Research (NWO).
References [1] S. Beucher and F. Meyer. The morphological approach of segmentation: the watershed transformation. In E. Dougherty, editor, Mathematical Morphology in Image Processing, chapter 12. Marcel Dekker, New York, 1992. [2] A. Blake, R. Curwen, and A. Zisserman. A framework for spatio-temporal control in the tracking of visual contour. Int. J. Computer Vision, 11(2):127–145, 1993. [3] G. Borgefors. Distance transforms in digital images. Comp. Vision, Graph. and Image Proc., 34:344–371, 1986. [4] P. Burt and E. Adelson. The Laplacian pyramid as a compact image code. IEEE Trans. on Comm., 31:532–540, 1983. [5] C. Gu. Multivalued morphology and segmentation-based coding. PhD thesis, Ecole polytechnique federale de Lausanne, 1995. [6] M. Isard and A. Blake. CONDENSATION - conditional density propagation for visual tracking. Int. J. Computer Vision, 29(1):5–28, 1998. [7] B. Lucas and T. Kanade. An iterative image registration technquie with an application to stereo vision. In Proc. DARPA Imaging Understanding Workshop, pages 121–130, 1981. [8] H. Nguyen, M. Worring, and A. Dev. Motion statistics based region merging in video sequences. In Proc. of the 6th IEEE Int. Conf. on Multimedia Sys., volume 1, pages 762–766, Florence, Italy, 1999. [9] N. Paragios and R. Deriche. A PDE-based level set approach for detection and tracking of moving objects. In Proc. Inter. Conf. on Computer Vision, 1998. [10] N. Peterfreund. Robust tracking of position and velocity with Kalman snakes. IEEE Trans. on PAMI, 21(6):564–569, 1999. [11] D. Terzopoulos and R. Szeliski. Tracking with Kalman snakes. In A. Blake and A. Yuille, editors, Active Vision, pages 3–20. MIT Press, Cambridge, 1992. [12] S. Zenzo. Gradient of multi-images. Comp. Vision, Graph. and Image Proc., 33:116–125, 1986.