2010 International Conference on Pattern Recognition
Combining Foreground / Background Feature Points and Anisotropic Mean Shift For Enhanced Visual Object Tracking
A
Sebastian HanerA,B , Irene Yu-Hua GuA Dept. of Signals and Systems, Chalmers University of Technology, Sweden B Center for Mathematical Sciences, Lund University, Sweden Email:
[email protected],
[email protected]
Abstract This paper proposes a novel visual object tracking scheme, exploiting both local point feature correspondences and global object appearance using the anisotropic mean shift tracker. Using a RANSAC cost function incorporating the mean shift motion estimate, motion smoothness and complexity terms, an optimal feature point set for motion estimation is found even when a high proportion of outliers is presented. The tracker dynamically maintains sets of both foreground and background features, the latter providing information on object occlusions. The mean shift motion estimate is further used to guide the inclusion of new point features in the object model. Our experiments on videos containing long term partial occlusions, object intersections and cluttered or close color distributed background have shown more stable and robust tracking performance in comparison to three existing methods. Keywords-Visual object tracking, video surveillance, mean shift, SIFT, SURF, RANSAC, dynamic maintenance
I. Introduction Robust tracking of moving objects in video with complex scenarios, especially when sequences contain multiple objects with occlusions, intersections and background clutter, is a challenging task. Among many appearance-based techniques developed, Mean Shift (MS) has drawn much attention. Many new varieties have been introduced since the pioneering work of MS in [1], e.g. level-set asymmetric kernel MS [2] and anisotropic MS tracking [3]. While MS tracking performs relatively well, it may drift or fail when object occlusion or intersection is present, or when background colors are similar to the foreground object. On the other hand, point feature correspondences from SIFT [4] or SURF [5], have been found effective in relating salient 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.1112
features between images. Since individual point features are sensitive to noise and image variations, a set of correspondences under a given motion model is usually considered. To remove false feature matches, RANdom SAmple Consensus (RANSAC) [6] is a standard approach. Using RANSAC, one may choose a subset of feature point correspondences satisfying the model. To tackle the singularity problem in RANSAC when the number of consensus points drops below the minimum required number to estimate the model parameters, [7] proposes a variant, ”RAMOSAC”, that uses a set of models ranging from projective, similarity, affine and translational transformations. Recently, [8] proposed an EM algorithm that integrates SIFT features and MS with good performance reported. The method, however, does not track scale or orientation. In [9], affine structure was estimated by SIFT point correspondences to achieve box size robustness, while object box center was estimated from MS. To further improve this, [10] proposed an integrated tracking scheme combining local feature correspondences and global appearance from the isotropic MS with a fully tunable bounding box of 5 degrees of freedom (DoF). The method however cannot predict occlusions from outside the bounding box, and it is a two-step sequential process without much interaction between the local and global schemes. Motivated by the above, we propose a new tracking scheme that employs local feature point correspondences from both the foreground and the surrounding background as well as global object appearance via anisotropic mean shift with fully parameterized bounding box. Comparing with [7], [10], the proposed method introduces many novelties (mentioned in the abstract) aimed at significantly enhancing the tracking performance. Experimental results and comparisons on videos containing complex scenarios have shown that the proposed tracker is more stable and robust than several existing algorithms. 3476 3492 3488
II. Tracking based on Anisotropic Mean Shift and Point Feature Correspondences
are invariant to affine or a set of transformations are used for tracking. Scale-invariant feature correspondences are computed by SIFT [4] or SURF [5], followed by finding the maximum number of corresponding feature points that agree with a selected type of transformation using RANSAC [6]. The object bounding box can then be inferred. For successful tracking, dynamic maintenance of point features has been found to be very important. In [7], new points that agree with the selected transformation are added, existing points proven unstable or irrelevant are pruned, and the updating process is frozen when partial object occlusion is suspected.
This section briefly reviews two tracking methods related to our work. One method is based on 5 DoF anisotropic MS tracking, another is local feature point correspondence-based tracking. Anisotropic mean shift-based tracking: In [10], the target is described by a rectangular bounding box parameterized by center location, width, height and orieny2 , W, H, θ). The Bhattacharyya coefficient tation (y1 , m ρ(p, q) = pu (y, Σ)qu is used to measure the u=1 similarity between a candidate and a reference object area, where pu (y, Σ) and qu are the pdf estimates for a candidate object at t and the reference object, c pu = k (y − xi )T Σ−1 (y − xi ) δ[b(xi ) − u] 1 |Σ| 2 i
III. The Proposed Method This section proposes a novel tracking scheme that is aimed at increasing tracking robustness, especially when objects are partially occluded, or when objects and background have similar colors. The basic idea is to exploit both local point features and global appearance, and select the final tracking result by minimizing a given cost function. To minimize the impact of occlusions, the dynamic updating of point features in [7] is improved upon by using occlusion information extracted from the areas surrounding the tracked object. Main novelties of the proposed method include: • The kernel of the MS tracker is tuned to tightly fit the object of interest in order to aid the selection of new foreground feature points. • A RANSAC cost function based in part on the transformation distance of consensus point correspondences, the complexity of the selected transformation, and the MS tracking result, is minimized to obtain the final tracking result. • Two sets of feature points are extracted and dynamically maintained, one for the foreground object, another for the surrounding background. The latter is used to prevent known background features being introduced into the object model. The method is detailed as follows. The cost function: we propose to apply the following cost function for extracting consensus feature points: s = i∈C xi − T (xi )2 + (|P | − |C|)η 2 + εDm 4 4 + β4 c=1 uc − vc 2 + λ4 i=1 vi − T (vi )2 (1) where xi and xi are the matched points in the current and the previous frame t and (t − 1), T the selected transformation (or, model) between the corresponding points in two frames with a degree of freedom of Dm , vi the vertices of the tracker bounding box at (t − 1), |P | and |C| the number of matched feature points and matched inlier points, respectively, η the inlier/outlier classification threshold, and ε, λ and β are weights determining the relative contributions of each term, and
and m is the total number of histogram bins, Σ is the kernel bandwidth matrix, b(xi ) is the bin index assigned to the pixel at xi , i is summed over pixels under the kernel (or, the selected region), c is a normalization constant, u is the bin index, k is the kernel profile, and y the kernel center. ρ(p, q) is maximized iteratively for a 5 DoF anisotropic mean shift, where the center of the kernel (or bounding box), yˆ, and the γ-normalized ˆ of the kernel are estimated by [10] bandwidth matrix Σ n T −1 (y − xi ) i=1 xi wi g (y − xi ) Σ yˆ = n T −1 (y − x )) i i=1 wi g ((y − xi ) Σ n T −1 yi Σ y˜i ) · y˜i y˜iT i=1 wi g(˜ ˆ= 2 n Σ 1−γ yiT Σ−1 y˜i ) i=1 wi k(˜ m where wi = u=1 δ (b(xi ) − u) qu /pu (y, Σ), g(x) = y − xi ), and γ is empirically determined. −k (x), y˜i = (ˆ −1 Once Σ is estimated, eigen-decomposition Σ = V ΛV v11 v12 λ1 0 is applied, where V = ,Λ= . v21 v22 0 λ2 If the orientation of the region is defined as the angle θ between the long axis of bandwidth matrix and the horizontal axis, and the height H and width W of the region as twice the kernel bandwidth along the long and short axis, it follows θ, W , and that Σ, H are (H/2)2 0 T related by Σ = R (θ) , where 0 (W/2)2 cos θ sin θ R = . The orientation, height and − sin θ cos θ width of√object are estimated by: θˆ = tan−1 (v21 /v11 ), √ ˆ ˆ H = 2 λ1 , W = 2 λ2 , where v11 and v21 are the components of the largest eigenvector. Feature point correspondence-based tracking: Tracking using local point features is found useful especially in cases of partial occlusions. In [10], [7], a set of maximum consensus feature point correspondences that
3489 3493 3477
are determined empirically. uc are the vertices of the rectangle of same dimension and orientation as the MS tracker estimate. (1) contains the contributions from the point correspondences, the transformation complexity, the mean shift result and the hypothesized object motion; the first two terms represent an ordinary RANSAC cost function, while the third term penalizes more complex transformations, the fourth term is the distance between the four corners of the rectangle derived from the anisotropic mean shift and the corners of the hypothesized bounding box, and the fifth term is the distance between the corners of the hypothesized bounding box at t and the bounding box at (t − 1). The mean shift tracker: The MS tracker is used both as the input to RANSAC process and for selecting new point features; only those falling within the intersection of the MS ellipse and the tracked bounding box are considered for inclusion in the object model. To minimize the risk of including background features, the initial MS ellipse is set to be smaller than the bounding box (scaled down by a factor φ = 0.7 in our tests). The parameter γ is tuned to ensure a tight fit to the target. The Bhattacharyya coefficient is computed for the MS tracked area as described in Section II. If the coefficient is low, the MS result is considered unreliable, and the MS tracker is reinitialized at the final tracker box location at time t. Extract foreground and background feature points: Feature points are extracted from a region surrounding the location of the object in the previous frame (black rectangles in Fig.1). The features are matched against both the foreground object features and the background feature set, and the motion is found via the RANSAC procedure. Dynamically add new feature points: RANSAC splits the new features into 3 sets, F = {I, O, U}, where I contains the inlier points, O outlier points and U unmatched points. If the Bhattacharyya coefficient of the mean shift tracked area is high and the number of inlier points is not too small (|I| ≥ κ), features in the unmatched set U are then dynamically added to the foreground set F or the background set B in the following manner: a) New feature points in U are added to the foreground point set F, if they fall within the intersection of the tracker bounding box and the mean shift ellipse and do not match any of the features in B. Allowing only features from the intersection decreases the risk of erroneously including background features in the object description. b) New feature points in U that were not added to F are added to the background set B. Dynamically pruning feature points: A pruning process is performed for both the background point set
B and the foreground feature point set F, in order to maintain a reasonable size. A score si is assigned to each new feature, and is then updated at every time step: sti = st−1 + 2 for matched inlier points I, sti = st−1 −1 i i for matched outlier points in O, and no change for unmatched points U. For the foreground point set F, pruning is performed as follows: If the total number of points in F exceeds a predefined threshold mf , then the points with lowest scores are removed. For the background point set B, points are sorted out according to their time (or frame) history. If the total number of points in B exceeds a predefined threshold mb , then the oldest points are removed. Updating SURF feature descriptors: Feature point descriptors in the foreground object set F are also updated when new observations become available. When an observed point with descriptor dobs is matched and later classified as an inlier, the corresponding point descriptor αdobs +(1−α)di di in F is updated by di = αd (α = 0.5 obs +(1−α)di was used in our tests). The idea is to allow features which change appearance over time to be tracked longer. Integration of Two Trackers: The two trackers, mean shift and point feature trackers, are integrated into the proposed scheme according to (1). Table I summarizes the pseudo algorithm. Table I. Summary of the proposed tracking scheme. Specify a target object bounding box in the 1st frame (t=1). Approximate the box by an ellipse, shrunk by φ. Compute q ˆ from the ellipse area, initialize the MS tracker. Extract feature points from a region surrounding the object. Add points within the boundary to F, and the rest to B. 5. For subsequent frames t=2,3,· · · n do: 5.1 Run the anisotropic MS tracker, starting from its (t-1) position. 5.2 Extract local features from a region surrounding the tracking window, and match them to the object database. 5.3 Estimate the object transformation T for feature points using (1). Classify the extracted points into 3 sets: I, O and U. ˆ ] at the current MS tracker state. 5.4 Calculate ρ[ˆ p, q 5.5 If ρ > τ1 and |I| ≥ κ, then: a) Match all points in U to the background point set B. b) Add unmatched points in U that fall within the intersection of the MS ellipse and the tracker boundary to F. c) Add the remainder in U to B. 5.6 Else if ρ < τ2 , go to Step 2 to reinitialize the ellipse, using the current tracker boundary. 5.7 Update scores si for points in F. 5.8 Update inlier descriptors di . 5.9 If |F| > mf , remove (|F| − mf ) lowest-scoring points from F. 5.10 If |B| > mb , remove (|B| − mb ) oldest points from B. END{For} 1. 2. 3. 4.
IV. Experiments and Results The proposed scheme is tested on several videos captured by a dynamic camera that contain complex scenarios, e.g. intersection, long term partial occlusion of objects, complex background clutter easily confused with the objects, and objects with changing velocity. For all test videos, the following parameters φ = 0.7, 3490 3494 3478
corners of a ground truth box and the tracked box) as a function of frame number obtained from two video sequences are shown in Fig.3 for the proposed tracker and the 3 existing trackers. These results clearly show 1
1 RAMOSAC mean shift SIFT+mean shift proposed
0.8 normalised distance
0.8 normalised distance
τ1 = 0.8, τ2 = 0.4, κ = 3, mf = mb = 1000, ε = 7, α = 0.5, β = −0.1, λ = [−0.6, −0.4] were used. Results: As an example, Fig.1 shows the tracking results from the ’street’ sequence using the proposed method. This video is considered rather difficult in tracking, partly because of the highly textured background, the dynamic camera, and some close color background. Many trackers discussed in Section II fail quickly on this sequence.
0.6
0.4
0.2
0
RAMOSAC mean shift SIFT+mean shift proposed
0.6
0.4
0.2
50
100
150 frame no.
200
250
0
200
250
300
350 frame no.
400
450
Figure 3. Tracking errors from the proposed tracker and 3 existing trackers, on the video ’steps’ (top) and ’street’ (bottom).
enhanced performance of the proposed scheme. However, disadvantages exist, for example, some parameters (e.g. in (1) need to be tuned empirically.
V. Conclusion
Figure 1. Tracking results from the ‘street’ sequence using the proposed scheme. From left to right, top to bottom: frames # 10, 80, 140, 200, 230, 260, 320, 380 and 440. The parameters used were λ = −0.6, β = −0.1, ε = −7. Black rectangles: the search area (noting: the area between the black and blue rectangles belongs to the background); Blue rectangles: final tracking result; green ellipses: from the mean-shift tracker. Comparisons: To further evaluate the method, comparisons were made to 3 different methods: tracker-1: from [7] (denoted as RAMOSAC), tracker-2: from [10] (denoted as SIFT + mean shift); tracker-3: (mean shift only). Fig.2 shows the results for the video ’steps’.
By incorporating MS into the RANSAC cost function, using background features to detect partial occlusions, using the intersection between the MS tracker ellipse and the tracker boundary for feature updates, and dynamically maintaining two sets of feature points, the proposed scheme demonstrates in our experiments increased robustness, especially during intersections of similar objects, partial object occlusions and when tracking over a cluttered background with confusing colors. Comparisons with 3 existing trackers show a marked improvement. Future work includes further tests to determine the suitable ranges of parameters and tests on more image sequences.
References [1] D. Comaniciu, V. Ramesh, P. Meer, ”real-time tracking of nonrigid objects using mean shift”, Proc. IEEE Int’l Conf. CVPR, vol.2, pp.142-149, 2000. [2] A.Yilmaz, ”Object tracking by asymmetric kernel mean shift with automatic scale and orientation selection”,CVPR, 2007. [3] S.Qi, X.Huang, ”Hand tracking and gesture gecogniton by anisotropic kernel mean shift”, IEEE.Conf. NNSP, 2008. [4] D. Lowe, ”Distinctive image features from scale-invariant keypoints”, Int. Journal Comp. Vision, vol.20, pp.91-110, 2004. [5] H. Bay, T. Tuytelaars, L. V. Gool, ”SURF: speeded up robust features”, Proc. Int. conf. ECCV, vol.3951,pp.404-417, 2006. [6] M.Fischler,R.Bolles, ”Random Sample Consensus: a paradigm for model fitting with applications to image analysis and automated cartography”, ACM.Comm., vol.24,pp.381-395,1981. [7] P.Strandmark, I.Y.H.Gu, ”Joint anisotropic mean shift and consensus point feature correspondences for object tracking in video”, LNCS, vol.5575,pp.450-459,2009. [8] H. Zhou, Y. Yuan, C. Shi, ”Object tracking using SIFT features and mean shift”, Int.Journal Computer Vision, 2008. [9] C. Zhao, A. Knight, I. Reid, ”Target tracking using mean-shift and affine structure”, In Int’l Conf. Pattern Recognition, 2008. [10] Z.Khan, I.Y.H.Gu, T.Wang, A.Backhouse, ”Joint anisotropic mean shift and consensus point feature correspondences for object tracking in video”, In proc. IEEE conf. ICME’09, 2009.
Figure 2. Tracking results (some critical moments) from the ’steps’ sequence. Left to right, top to bottom: frames 5, 20, 35, 45, 55 and 60. Tracker-1 (red), Tracker-2 (cyan), Tracker-3 (magenta) and the proposed tracker (blue) (green is the MS tracker). Observing the results one can see that trackers 1-3 fail and start tracking the people in the background, while the proposed algorithm, though temporarily lost for a few frames, recaptures and successfully continues tracking the target throughout the sequence. To further compare the tracking performance, the tracking errors (defined as the distance between the 4
3491 3495 3479