Learning Adaptive Metric for Robust Visual Tracking - Semantic Scholar

3 downloads 4691 Views 1MB Size Report
distance metric learning method to a more powerful nonlinear kernel metric learning ... Y. Wu is with the Electrical Engineering and Computer Science De- partment ...... in 1997, and the. Ph.D. degree in electrical and computer engineering.
1

Learning Adaptive Metric for Robust Visual Tracking Nan Jiang, Wenyu Liu, Member, IEEE, Ying Wu, Senior Member, IEEE,

Abstract—Matching the visual appearances of the target over consecutive image frames is the most critical issue in videobased object tracking. Choosing an appropriate distance metric for matching determines its accuracy and robustness, and thus significantly influences the tracking performance. Most existing tracking methods employ fixed pre-specified distance metrics. However, this simple treatment is problematic and limited in practice, because a pre-specified metric does not likely to guarantee the closest match to be the true target of interest. This paper presents a new tracking approach that incorporates adaptive metric learning into the framework of visual object tracking. Collecting a set of supervised training samples onthe-fly in the observed video, this new approach automatically learns the optimal distance metric for more accurate matching. The design of the learned metric ensures that the closest match is very likely to be the true target of interest based on the supervised training. Such a learned metric is discriminative and adaptive. This paper substantializes this new approach in a solid case study of adaptive-metric differential tracking, and obtains a closed-form analytical solution to motion estimation and visual tracking. Moreover, this paper extends the basic linear distance metric learning method to a more powerful nonlinear kernel metric learning method. Extensive experiments validate the effectiveness of the proposed approach, and demonstrate the improved performance of the proposed new tracking method. Index Terms—Visual tracking, metric learning, supervised, adaptive, discriminative.

I. I NTRODUCTION HE most critical issue in visual target tracking is to match the visual appearances of the target over consecutive image frames. This process can be generally formulated as:

T

min D (f (x, t), f (x + ∆x, t + ∆t)) , ∆x

(1)

where f (x, t) is the visual feature of the target at location x at time t, and D(·, ·) is the distance metric in the feature space. Similar to many other computer vision problems such as object recognition, this process is largely stipulated by the interplay of two factors: visual features that characterize the target in a feature space, and the distance metric that is used to determine the closest match in such a feature space. Different from those recognition tasks, a uniqueness in this tracking task Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. N. Jiang and W. Liu are with the Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, P. R. China. (e-mail: [email protected]; [email protected] ). Y. Wu is with the Electrical Engineering and Computer Science Department, Northwestern University, Evanston, IL 60208-3118 USA (e-mail: [email protected])

is that the matching is constructed between two consecutive frames over ∆t, and thus it requires a computationally more efficient solution. If we can identify strong features, e.g., those that are invariant to lighting change and local deformation, and those that are discriminative from the distracters (or false positive matches), even we only use the Euclidean distance metric for simple nearest-neighbor matching, good results can be expected. But when strong features cannot be easily specified, the choices of the distance metric largely influence the matching performance. Thus, finding an appropriate metric becomes critical. The scope of this paper is not on finding new image features. Instead, assuming a given set of specified image features, this paper concentrates on the handling of the distance metric issue for visual tracking. Most existing tracking methods employ a fixed pre-specified metric. For example, besides the Euclidean metric, many other choices have been explored, including the Matusita metric [1], the Bhattacharyya coefficient [2], the Kullback-Leibler divergence [3], the information-theoretic similarity measures [4], and a combination of those [5]. However, simply using a pre-specified metric is problematic and limited in practice. One phenomenon we often observe is that the closest match under a predefined metric in a given feature space may not be the true target of interest. Thus, in this case, it leads to a false positive match, and in turn fails the tracker. One reason behind this phenomenon is because the target of interest is in general a semantic object, and thus its appearance variations can induce quite complex uncertainties. For example, as illustrated in Fig. 1, when we want to track a particular face, the closest neighbor in a certain given feature space can be another person, when the target face is subject to some significant variations, e.g., lighting or pose changes. Unless these uncertainties can be modeled or learned in advance, it is risky to use a pre-specified metric regardless of the appearance uncertainties. Second, as the target is represented by its visual features, some of them may be more discriminative than others in telling the target apart from the background and distracters. A predefined distance metric does not guarantee to emphasize on those discriminative features. Moreover, in the tracking scenario, as the background can be dynamic and the distracters may keep changing, using a predefined metric regardless is not plausible. Therefore, a good metric in tracking must be learned, be discriminative, and be adaptive. One good method in the literature is based on learning discriminative features [6]. This method implements a ranking mechanism to evaluate a finite set of pre-specified features that

2

II. R ELATED W ORK

Fig. 1. Illustration of distance metrics in matching. The closest match can be different due to the metric we choose.

are linear combinations of three likelihood images in the RGB space. During the tracking process, this method dynamically selects features over time. Although this method works well in some cases, it also has some limitations. Although it looks for discriminative features, the found linear color features may not be powerful enough to separate the target from distracters. In addition, this method performs an exhaustive check over a limited finite set of candidate color features. Thus, this approach is not generally applicable to other features, as the exhaustive search implementation in this method is neither plausible for large candidate sets, nor for continuous features. Different from the methods in the literature, our work is targeted on a new learned and adaptive metric for tracking. It achieves a better separability of the target from the distracters through nonlinear classification and nonlinear features. It is not limited to color features, but is more generally applicable. The proposed method provides an accurate metric for better tracking, because the closest match under our learned metric is most likely to be the target due to its design that minimizes the misclassification rate. The learned Mahalanobis metric weights features differently, implying the identification of discriminative features. This is especially useful when tracking the target in a quite cluttered and distractive environment. The proposed method is naturally adaptive, as the collection of the training data is performed on-the-fly. In addition, the solution to the motion estimation is in a closed-form, which enables a computationally efficient solution to target tracking. The novelty of the proposed method includes the following four aspects. (1) It integrates adaptive metric learning into visual tracking; (2) It provides a closed-form solution to differential tracking under adaptive metric learning. (3) It extends the metric learning method from a linear metric to a nonlinear one. (4) It allows dimension reduction by embedding the high dimensional feature space into a lower dimensional space, which largely enhances the computational efficiency. The organization of this paper is as follows: after a brief description of related works in Sec.II, the basic formulation of metric learning for tracking is introduced in Sec.III. The kernel metric learning for tracking is presented in Sec.V. The experiments are reported and discussed in Sec.VI, and followed by the conclusion.

Feature selection and distance metric are two closely related concepts, and both can contribute to the improvement of tracking performance. On one hand, feature selection and distance metric emphasis on different aspects in matching. Feature selection concentrates on finding a discriminative and robust target description for tracking, while the distance metric focuses on measuring the distances (or similarities) between the target and the matching candidates. Basically, the feature selection methods can be divided into two categories: selecting and combining. The selecting approach tries to pick the best features based on some general criteria, e.g., those features that have the largest margin from the background features [6]. The combining approach aims to identify stronger features based on the combination of weaker ones [7]. When the visual features of the target are fixed, a distance metric is used to measure the similarity between the target and its candidates. For those robust and discriminative features, simple distance metrics may report good matching results. However, in practise, robust and discriminative features vary from scenarios, and can not be simply specified in advance. In such a case, sophisticated distance metrics are needed for accurate matching. On the other hand, feature selection and distance metric are closely related, and the two can be unified in some special cases. For example, feature selection and linear distance metric learning can be regarded as a linear transform of the original feature space, R∗ = AR, where R ∈ Rm is the original feature, and A represents the linear transform, and R∗ ∈ Rd is the new feature. A does not have to be a square matrix. To select some discriminative features, such as [8], [6], we can require that each row of A be an indicator row vector, i.e., all but one element in this vector are zeros. The method proposed in [9] integrates different kinds of features by distributing the weights according to their discrimination power. This can be regarded as using a diagonal matrix of A. Then many discriminative learning methods, e.g., support vector machines [10] and boosting [11], can be employed to identify useful features before combining them. The feature identification process can also be treated as learning the diagonal matrix A (either in batch or sequential processing). In other words, the dynamic feature selection methods, discriminative learning methods and linear distance metric learning methods share the same mathematical structure. Since dynamic feature selection and discriminative learning generally assume a sparse transformation matrix in advance, they are special cases of the general distance learning method, as proposed in this paper. Moreover, using different discriminative learning methods in metric learning leads to different performances. We should notice that the distance metric learning method employed in this paper is able to handle multi-class classification tasks, while most discriminative learning methods are designed for two-class classification, such as the one proposed in [12]. General distance metric learning methods can be divided into two categories: unsupervised and supervised approaches. Unsupervised distance metric learning is to learn an underlying low-dimensional manifold where the relationships among the

3

observed data are preserved. Some classic linear unsupervised distance metric learning methods include PCA [13] and MDS [14]. In the recent decade, significant research has been performed on unsupervised nonlinear distance metric learning. Some representative works include LLE [15] that preserves the local configuration relation of the data in both the embedded space and the original space, and ISOMAP [16] that preserves the geodesic distances among data points in a low dimensional manifold. Supervised distance metric learning approaches aim to learn appropriate distance metrics for better class discrimination. Given similar and dissimilar data pairs as the supervised information, the methods in [17] and [18] explicitly learn to reduce the distances between similar data pairs, and to enlarge the distances between dissimilar data pairs. Given the training data and the known labels, the neighborhood component analysis (NCA) algorithm [19] learns a Mahalanobis distance metric to achieve the best performance for the k-NN classifier by maximizing the classification error in the leave-one-out cross validation. This is achieved by adopting the soft-max formulation for the nearest-neighbor classification. Another method that does not use soft-max is the large margin nearest neighbor algorithm (LMNN) [20]. It enforces the data from the same class to be the nearest neighbors while the data from different classes are separated by a large margin. A very recent piece of work related to our method was proposed in [21]. The TUDAMM method described in [21] integrates the appearance modeling and motion estimation together in an Expectation-Maximization(EM) like algorithm. In this method, its distance metric learning largely follows the concept proposed by Globerson and Roweis [22]. In Globerson and Roweis’ method, it constructs a convex optimization problem that aims to find a metric by attempting to collapse all the examples in the same class to a single point and to push the examples in other classes infinitely far away. However, mapping all the examples in the same class to a single point is a much stronger constraint than the minimization task in the soft-max formulation as in NCA [19]. This paper proposes a new method that integrates NCA metric learning and differential tracking. Comparing with other metric learning methods, we choose NCA for the following reasons. First of all, NCA, as a supervised metric learning algorithm, is able to ensure the distance metric to be adaptive and discriminative. Secondly, NCA is intuitive. Unlike the algorithms that require two kinds of constraints [17], [18], or those that need to identify the support vectors first [20], the soft-max formulation in NCA not only enables an analytical solution, but also provides a natural way to represent the relations among training data. It implicitly pulls the data in same class together, and pushes the data with different labels away. Finally, NCA is analytical as it can easily be solved by a gradient descent procedure. It is also able to achieve dimension reduction and provides a low-dimensional embedding of the data, and thus it may lead to computationally efficient tracking methods. All the above properties make NCA a good choice as a metric learner for target tracking.

Fig. 2. Distance metric learning on synthetic data. The left figure indicates the training samples in original feature space, and the right figure shows these samples in the transformed feature space through metric learning.

III. F ORMULATION AND A G ENERAL A PPROACH A. Formulation Selecting an appropriate distance measure is fundamental to the tracking problem. In this paper, we first focus on the basic Mahalanobis distance metric, and then extend it to a nonlinear kernel metric. The Mahalanobis distance metric is parameterized by a matrix A ∈ Rd×m that linearly transforms the original feature space Rm to a new space Rd (where d ≤ m in general). Then the Mahalanobis distance metric is written as: D(xi , xj ) = (Axi −Axj )T (Axi −Axj ) = (xi −xj )T Q(xi −xj ) (2) where Qm×m = AT A. Using this Mahalanobis distance metric for matching, we have: min ∥Af (x, t) − Af (x + ∆x, t + ∆t)∥2 ∆x

(3)

or T

min [f (x, t) − f (x + ∆x, t + ∆t)] Q [f (x, t) − f (x + ∆x, t + ∆t)] ∆x

Learning the linear transform A based on the training data from specific scenarios, we can increase the discriminative power of the tracker to better separate the target from the distracters in the environments in the transformed feature space. This new space shrinks the distances among objects of the same identity, and tolerates their appearance variations. When A is an identity matrix, the Mahalanobis distance is degenerated to the Euclidean distance. Fig.2 illustrates the main idea behind the distance metric learning, on a synthetic data set. Before applying the transform A, the data points in the original feature space are mixed together, as shown to the left in Fig.2. The transformation A pulls the data points of the same class together, and pushes the other data points away, as shown to the right in Fig.2. By design, this transform is expected to largely improve the matching accuracy, and thus improving the tracking accuracy in the target tracking applications. Fig.2 shows an example where the incorrect matching in the original input space can be corrected by learning an appropriate linear transformation. Within this formulation, the method proposed in [6] can be treated as a special case. In [6], only three parameters in the transform A are adjustable, and it employs an exhaustive

4

search to identify the best parameters. As a more general treatment, our formulation allows a much more flexible adjustment of the metric by optimizing A without predefining a selection pool. As there are d × m parameters in A to be optimized, it is impossible to perform an exhaustive search as done in [6]. Thus a more computationally efficient solution is needed. Moreover, d is the dimensionality of the transformed feature space, and m is the dimensionality of the input feature vectors. Instead of performing the matching in Rm , the matching can be done in the projected lower dimensional space Rd . In other words, the transformation matrix A is not necessary to be a square matrix. As d ≪ m, it can also be regarded as a lowdimensional embedding. In the following sections, we will address two important issues: • How can we learn an appropriate linear transformation A for a higher matching accuracy? • How can we determine the motion ∆x under the new metric? B. Basic metric learning Our metric learning method in the visual tracking scenario is supervised, and we need both positive and negative data. Fig. 3 gives an illustration of this training sample collection process. Once we have localized the target at a time instant, we collect the image regions of the target, extract their feature vectors, and label them as positive training samples. Image region in its very close vicinity (e.g., shifting 1 or 2 pixels) are also treated as positive samples. Nearby image regions that are not close enough (i.e., that create false positive matches) are collected and used as negative samples, as shown in Fig. 3 (a). This training sample collection process is applicable to every video frame. The training data collected in this manner are quite informative for discriminative learning, because they are located fairly close to the true classification boundary, as shown in Fig. 3 (b), and thus the learned classification is expected to be accurate. Therefore, based on the training data collected in this way, we are able to differentiate true matches of the target from false positive matches. In other words, the closest match is very likely to be the true target of interest based on the learning over such a training set. As we can localize the target in more video frames over time, we can easily collect many such training samples on-the-fly. We can either set a forgetting factor or use a sliding time window to maintain an active training database for discriminative learning. In addition, this training sample generation process can naturally be extended to multiple targets. In such a case of multiple targets, we collect a labeled data set consisting of feature vectors {x1 , x2 , . . . xn } in Rm and their corresponding class labels {c1 , c2 , . . . cn }. Our basic metric learning method is performed on this set of labeled training data. Following the concept in [19] for metric learning, we aim to find a linear transform A ∈ Rd×m (where d ≤ m) that maximizes the k-NN classification accuracy in the projected space. The k-NN classifier predicts the class label of an input data point by the consensus of its k-nearest neighbors under

(a) Training samples collected in a video frame

(b) Training samples collected in the feature space Fig. 3.

The illustration of the training sample collection process

a given distance metric. This is a quite simple classification scheme. However, this method is not analytical and is not differentiable with respect to A. Because the set of neighbors of a data point may undergo discrete changes in response to the smooth changes in A, any objective function based on the nearest neighbors will have to be piecewise-constant, and hence not differentiable. Fortunately, this difficulty can be alleviated by using a soft representation of the nearest neighbors. As proposed in [19], we can consider the entire transformed data set as probabilistic (“soft”) nearest neighbors. To measure the neighborhood relationship between a pair of feature vectors xi and xj in the training set, pij is defined to represent a soft-max over the Euclidean distance in the transformed space. exp(−∥Axi − Axj ∥2 ) 2 k̸=i exp(−∥Axi − Axk ∥ )

pij = ∑

(4)

In other words, each feature vector xi treats another feature vector xj as its neighbor with probability pij , and inherits its class label from the feature vector it selects. Denote the set of input data in the same class of i by Ci = {j|ci = cj }. Then, the objective function of the metric learning is to maximize the expected number of input data that are correctly classified: A∗ = arg max g(A) = arg max A

A

∑ i

log(



pij )

(5)

j∈Ci

This choice of objective function is preferable as it is differentiable with respect to A. This optimization problem can be approached by any gradient-based technique, as the

5

gradient of the objective function can be obtained in a closedform: ∑∑ ∂g pik xik xTik − = 2A ( ∂A i k



T j∈Ci pij xij xij ∑ ) j∈Ci pij

method, the target is represented by its feature histogram (e.g, a color histogram), q = [q1 , q2 , · · · , qm ]T ∈ Rm , 1 K(yi − c)δ(b(yi ), u), (7) C where b(yi ) maps a feature at the pixel location yi into a histogram bin u. K is a homogeneous spatial weighting kernel centered at c. δ(·) is the delta function, and the constant C is used for normalization. Rewrite Eq.(7) in a matrix form as in [1]: qu =

(6)

where xij = xi − xj . In our implementation, we use a conjugate gradient-descent method to iteratively maximize the objective function to obtain an optimal solution of A.

q(c) = UT K(c)

C. General Approach The general approach of integrating adaptive metric learning in tracking is described in Fig. 4. This approach has three major components: • Training samples collection. Once the target region is located at a time instant, as described in III-B, we collect a number of n training samples, where a half of n are positive training samples extracted in the close vicinity of the target, and the other half of n are negative training samples extracted within a larger region surrounding the target region. Although various features can be applied here, as a specific implementation, we can extract a color histogram to represent the target appearance. A training samples is denoted by xi in Sec. III-A. Over time, with the accumulation of more and more samples, this training data set can be updated on-the-fly for adaptation. • Adaptive metric learning. Based on this training set, metric learning is performed by the method described in Sec. III-B. Different from the exhaustive search method proposed in [6], we design a gradient descent method to achieve a more efficient optimization. Moreover, by updating the transform matrix using new training samples, the learned metric is naturally adaptive over time. • Motion analysis and target localization. We employ the learned metric to measure the distances between the target and candidates. Based on the closed-form results of motion analysis presented in Sec. IV-B, the location of target in the current frame is estimated by Eq.(17). IV. L INEAR M ETRIC FOR D IFFERENTIAL T RACKING In our general approach described in Sec. III-C, many tracking methods can be used within this framework. In this paper, we focus on a solid case study of integrating metric learning and differential tracking. Differential tracking methods, such as mean-shift tracking [23] and contextual flow [24], assume small motion among two consecutive image frames. Based on a linear approximation of the two frames, differential tracking methods give a closed-form solution to the motion, and thus allowing very computational efficient algorithms for real-time tracking. A. Kernel-based Tracking Kernel-based tracking method attracts much attention in the literature because of its computational efficiency, simplicity and robustness [25], [1], [26], [27]. In the kernel-based

(8)

where 

 δ(b(y1 ), u1 ) · · · δ(b(y1 ), um )   .. .. .. t×m U= (9) ∈R . . . δ(b(yt ), u1 ) · · · δ(b(yt ), um )   K(y1 − c) 1   .. t K=  (10) ∈R . C K(yt − c) By denoting by p(c + ∆c) the target candidates’ color histogram, and D(q(c), p(c + ∆c)) the objective function for matching, visual tracking is to estimate the best displacement ∆c that optimizes the following objective function. ∆c∗ = arg min D(q, p(c + ∆c)), ∆c

(11)

In this paper, we start from the Matusita metric in our methodology, and notice that the proposed approach can easily be extended to other distance measures. Employing the Matusita metric, the distance between two distributions is written as: √ √ D(∆c) = ∥ q − p(c + ∆c)∥2 (12) The result of motion estimation is obtained in [28]: √ √ ∆c = M† ( q − p(c))

(13)

where, M† is the pseudo inverse of matrix M, ∆c ∈ Rr . M=

1 1 diag(p(c))(− 2 ) UT JK (c) ∈ Rm×r 2   ∇c K(y1 − c)   .. JK =   .

(14)

(15)

∇c K(yt − c) The Matusita metric aims to optimize the sum of square differences of two distributions. Regardless of the contextual information of the object, it does not have a strong discriminative power for the visual measurements. Meanwhile, when specifying a fixed Matusita metric in advance and using it regardless, it will lack the ability to adapt to scene changes or to the appearance changes of object itself. This is likely to lead to tracking drift or failure. Moreover, other generally used distance metrics share the same problems as this Matusita metric. Fortunately, this problem can be overcome by the approach proposed in this paper, as follows.

6

Fig. 4. The framework of metric learning in visual tracking. It has three major components, including training sample collection, adaptive metric learning and motion analysis and target localization.

B. Differential Tracking under a New Metric Projecting the original data into a new feature space, the distance between two data points is given: √ √ D(∆c) = ∥A q − A p(c + ∆c)∥2 √ We approximate p(c + ∆c) linearly by using its Taylor expansion and dropping the higher order terms. Following the notations in Sec.II, the new objective function for tracking is: √ √ ∆c∗ = arg min ∥A q − A p(c) − AM∆c∥2 ∆c

(a) Training samples that are not linearly separable in the original feature space

(16)

The optimal displacement ∆c in our objective function is obtained by: √ √ ∆c = (MT AT AM)−1 MT AT A( q − p(c)) Let T

T

S = (M A AM)

−1

T

(17)

T

M A A

Then, we rewrite Eq.(17). The optimal displacement ∆c under a new metric is obtained by: √ √ ∆c = S( q − p(c)) (18) Please note that in this paper we treat the kernel-based tracking method described in this section as a solid case study. But the general metric learning-based tracking approach proposed in this paper is generally applicable to many other tracking methods. V. K ERNEL M ETRIC FOR D IFFERENTIAL T RACKING Although in practice the above linear metric learning method generally works well in many cases, there exist cases where such a linear metric is confronted. In these cases, no matter what linear projections we use, the projected data are not locally linearly separable, and thus leading to unsatisfactory k-NN classification results. Therefore, we cannot find a plausible linear metric in such cases. As shown in Fig 5, Fig 5 (a) is a simplified example where the positive and negative

(b) Projecting the original training samples into a higher dimensional space can make them linearly separable. Fig. 5. The benefit of using a kernel metric. The red dots represent the positive training samples, and the blue dots represent the negative training samples.

training samples can not be linearly separated. However, as shown in Fig 5 (b), projecting those training samples into a higher dimensional space by using the kernel method can make them linearly separable. For these nonlinear cases, we design a kernel metric learning method. Kernel metric learning approaches the problem by mapping the original data xi to a higher dimensional or even an infinite dimensional feature space, and aiming to find a linear metric in the new space to achieve our objective function. This learned kernel metric is more powerful for class discrimination. In addition, since the mapping can be quite general, e.g., not necessary to be linear, the kernel metric learning method can be quite general as well.

7

A. Kernel distance metric learning In the previous section we have described an algorithm that integrates the linear distance metric learning in tracking. We now describe how to “kernelize” this method in order to compute the non-linear mapping of the input data to optimizes our objective function. The kernel method we propose here learns a Mahalanobis distance metric in a high, possibly infinite, dimensional feature space, by a nonlinear mapping ϕ. We restrict our analysis to nonlinear maps ϕ for which there exists a kernel function k that can be used to compute the inner product without carrying out the nonlinear mapping explicitly, such that k(xi , xj ) = ϕTi ϕj . Let Φ = [ϕ(x1 ), ϕ(x2 ), · · · , ϕ(xn )]T ∈ Rn×l , where n is the number of training samples, and l is the dimension of the projected feature space that l might be infinite. We consider the parameterization of A in the form A = ΛΦ, where A ∈ Rd×l is a linear combination of feature vectors in high(or infinite) dimensional feature space. Define by ki = Φϕ(xi ) = [k(x1 , xi ), . . . , k(xn , xi )]T , and kij = ki − kj . By substituting input vector xi by its non-linear mapping ϕ(xi ) in Eq.(2), the Mahalanobis distance in the nonlinearly transformed space is: ϕ

d (xi , xj ) =

kTij ΛT Λkij

(19)

This means that we do not need to explicitly identify the non-linear transform ϕ(x) before computing dϕ . Even if ϕ(x) may project the original data into an infinite dimensional space, as Λd×n is finite, it is still feasible to compute dϕ . Therefore, we now need to optimize the k-NN performance over Λ instead of A. By using the Mahalanobis distance in the nonlinear kernel space, we can rewrite the soft-max of metric learning in the kernel space as: exp(−kTij ΛT Λkij ) T T k̸=i exp(−kik Λ Λkik )

pϕij = ∑

(20)

The optimal linear transformation for metric learning in the kernel space is:

where xi are training samples. According to the Taylor expansion, we have: √ √ √ ′ p(c + ∆c)) = k(xi , p(c)) + k (xi , p(c))∆c (24) Suppose we employ the RBF kernel here,we have:

k(xi ,

√ √ √ ∥xi − p(c)∥2 1 k (xi , p(c)) = √ exp(− )[x − p(c)]T M i 2σ 2 2πσ 3 (25) Denote by: ′

   F= 

√ ′ k (x1 , √p(c)) ′ k (x2 , p(c)) .. . √ ′ k (xn , p(c))

    

(26)

Rewrite the objective function of kernel metric learning for tracking: O(∆c) = ∥Λk√q − Λk√p(c) − ΛF∆c∥2

(27)

√ √ T where we use k√q to represent√[k(x1 , q), · · · k(x , √n , q)] and k√p(c) to represent [k(x1 , p(c)), · · · k(xn , p(c))]T . Based on the kernel convolution, the optimal displacement ∆c is obtained by: ∆c = L(k√q − k√p(c) )

(28)

where L = (FT ΛT ΛF)−1 FT ΛT Λ VI. E XPERIMENTS

This section reports our experiments that validate the effectiveness of the proposed metric learning methods for robust Λ = arg max g(Λ) = arg max log( (21) Λ Λ tracking. These experiments include a comparison of our i j∈Ci method and a differential tracker without metric learning, a The gradient of objective function is: comparison of our method and a dynamic feature selection   method, a comparison of the linear metric learning method and ∑ ϕ T ∑ ∑ p k k ∂g ij ij ij j∈C ϕ i T   (22) the kernel metric method, the adaptation of the learned metric, = 2Λ pik kik kik − ∑ ϕ ∂Λ examples of tracking multiple targets, and a convergence study p j∈Ci ij i k̸=i of the proposed metric learning methods. B. Kernel distance metric learning for tracking In all of our experiments, the tracker is initialized through a manual selection of the image region of interest on the For kernel metric learning, the feature vectors are mapped target. The visual appearance of the target is represented by to a high or infinit dimensional space. In this regard, linking a normalized RGB color histogram that has 225 bins. For with Eq.(3), the kernel based objective function for tracking each frame, we collect 50 training examples and they are is: accumulated over time. In our experiments, we evaluate the √ √ performance of the tracker for every frame. When the value D(∆c) = ∥Aϕ( q) − Aϕ( p(c + ∆c))∥2 of the objective function g(A) in Eq.(5) is greater than a pre√ √ T = ∥Λ[k(x1 , q), · · · k(xn , q)] defined threshold, we re-run metric learning to obtain a new √ √ − Λ[k(x1 , p(c + ∆c)), · · · k(xn , p(c + ∆c))]T ∥2 metric that adapts to the new video scene. This threshold is (23) 10−5 in all of our experiments. ∗





pϕij )

8

A. Effectiveness of Metric Learning in Tracking 1) Comparing with a Tracker without Metric Learning: To validate and demonstrate the effectiveness of metric learning for tracking in real video, we compare our proposed method with the basic differential tracking method without metric learning proposed in [1]. We select the Sailing sequence for this investigation for the following two reasons. First, in this testing video, the target (i.e., the sailing boat) is subject to significant illumination changes due to the sunset. Second, the target itself has a low contrast to the background and thus its color appearance is quite similar to the background. Because of the above two issues, the target can hardly be well separated from the background by pre-specified color features. Fig. 6 shows our comparison on this testing video. The first row in Fig.6 shows that the closest matches by the baseline method [1]. False positives are closest matches that are not the target we want to track. Fortunately, by having a supervised metric learning component in tracking, our method is able to adaptively learn a better distance metric based on the supervised training data. As this step significantly enhances the discrimination power of the metric, it largely improves the tracking accuracy even using only simple color features. As shown in the third row in Fig.6, the proposed tracker smoothly tracks the sailing boat through out the low contrast and the illumination changes, while the method without metric learning loses track of the target from the beginning. This example well demonstrates the effectiveness of using metric learning in tracking. 2) Comparing with Dynamic Feature Selection: By using the same Sailing sequence, we also compare our method with the dynamic feature selection method proposed in [6]. As shown in the second row of Fig.6, our proposed method gives a much more robust results than the method in [6]. The method proposed in [6] tries to combine a color-based tracker with LDA, where the pool of feature candidates is predefined and has a limited size. This method cannot give satisfactory results for this testing sequence. In addition, as discussed in Sec.II, the feature candidates evaluated in [6] are a subset of those proposed in this paper. As a more general case of [6], our method is more flexible, since more parameters in the transformation matrix A are adjustable. In addition, the flexibility in our method does not incur a high computational cost, because our method achieves the optimal solution by using a more efficient gradient-based method than the exhaustive search as done in [6]. As shown in Fig.6, the method proposed in [6] fails to track the target in Sailing sequence, while our method accurately tracks the target through out the entire sequence, which clearly shows the superiority of our proposed method. 3) A Quantitative Comparison: To have a quantitative comparison, we obtain the ground truth of the testing sequences by manually labeling all the frames. For the Sailing sequence, we compare our method with [6] and [1], and show the comparison of the tracking error (in pixel) in Fig.7. This quantitative evaluation is consistent with the subjective evaluation shown in Fig.6. In other words, the method we proposed outperforms [1] and [6] in both objective and subjective evaluations.

Fig. 7. A comparison of tracking error of the proposed method with the methods in [6] and [1]

(a)

(b)

(c)

Fig. 9. Metric learning results of our method and TUDAMM on the Texture sequence. (a) one sample frame of the test video, (b)the metric learning result of TUDAMM (c) the metric learning result of the proposed method.

4) Comparing with TUDAMM: We compare our method with a very recent method called TUDAMM [21] that uses a probabilistic metric learning technique in tracking. To compare the performances, we created a testing video sequence, in which the target is a paper segment and the background is a large paper with a identical texture. Tracking such a target is extremely difficult because the target is almost indistinguishable without motion, even for human eyes. We name this test sequence Texture, and we use a white bounding box to indicate the true location of the target. The tracking results of TUDAMM are shown in red bounding boxes of Fig. 8, and our results are in blue. For this testing sequence, the key to a successful tracking is the quality of the metric that differentiates the subtle differences between the target and the background. The comparison is shown in Fig. 8 where the top row is the result of TUDAMM, and the bottom row is the result of our method. This experiment clearly shows that TUDAMM is unable to handle this kind of sequences, because it loses track of the target right in the beginning of the sequence, and is completely off and out of the image boundary at the 19th frame. On the contrary, our method is able follow the target quite well. This is not surprising because our method can learn a more discriminative and powerful metric to separate the target from the background as shown in Fig. 9. Comparing with the metric learning result of TUDAMM shown in Fig. 9(b), it’s clear that our method, as shown in Fig. 9(c)can correctly separate all training points in this test. 5) Other Examples: We have tested our method extensively on many other challenging sequences. Due to the page limit, this section only gives the results on some of the testing sequences to show the effectiveness of our method on handling several major difficulties: heavy background clutters, spatial distracters and fast motion.

9

Fig. 6. A comparison to the baseline method without metric learning [1](Top), dynamic feature selection method [6](Middle) and our method(Bottom) on the Sailing sequence.

Fig. 8. Sample frames of the comparison result of our method and TUDAMM on the Texture sequence. Top row indicates the tracking result of TUDAMM, and the bottom row shows the tracking result of our method

Fig. 10.

Sample frames of the result of our method on the Running sequence that presents cluttered background and large scaling changes

One example of cluttered background is given in Fig. 10. The object being tracked in this Running sequence is a man in blue, which is subject to a large scale change. This sequence is quite challenging, because of the heavily cluttered background. Our tracking algorithm can correctly estimate the positions and scales of the target throughout the entire sequence. In Fig. 11, we present the tracking result on the Zebra sequence. This example shows how our method is able to handle spatial distracters. In this sequence, the two zebras are almost identical in their visual appearances. Thus, when we want to track one zebra, the other zebra generates a very faithful false positive match, and then distracting the tracker easily. This situation largely confronts most of the existing

tracking methods. Fortunately, learning a discriminative metric, our tracking method is able to identify the nuance of the two similar objects and distinguish the target by the subtle difference from the distracter. Our method can comfortably give a promising result. Another example of handling spatial distracters is shown in Fig. 12 on the Soccer sequence, where there are many small targets wearing uniforms. Our tracking algorithm is able to track the soccer player robustly. The kernel-based tracking method is only a case study in this paper. The metric learning we propose is generally applicable to many other tracking methods. Since the kernelbased tracking method assumes the small motion, it may not

10

Fig. 11.

Sample frames of the result of our method on the Zebra sequence that presents distracters

Fig. 12.

Sample frames of the result of our method on the Soccer sequence that presents small objects

be able to handle fast motion sequence. Here, we employ the temple matching, along with exhaustive search to tackle the fast moving problem. By enlarging the searching region, our method can easily handle the fast motion sequence as shown in Fig. 13.

200

180

Kernel Method 160

Basic Method

Tracking Error

140

120

100

80

60

40

B. Tracking Multiple Targets The metric learning method in our method is not limited to the 2-class case. Because it is based on the k-NN classifier, it can naturally handle multiple classes. Thus, one of the main advantages of the proposed tracking method is that it can naturally handle multiple targets. We validate this benefit of our tracking method in the example of the Racing sequence, as shown in Fig.14. In this testing video, we want to track three targets, each of which suffers from distracters in the cluttered background as well as the appearance variations due to the camera view changes. Unlike the 2-class classifier proposed in [12], [10], the k-NN classifier employed in our method can be easily extended to the multiple classes scenario. By collecting and labeling the positive samples for each target of interest, the proposed metric learning method is able to pull together the samples in the same class, and to push away all the other samples in different classes. This leads to the class discrimination in our method. As shown in Fig. 14, our tracking algorithm can smoothly track all the targets. It avoids the tracker from drifting. This example clearly shows that our method is able to handle multiple objects quite well. C. Kernel Metric Learning for Tracking 1) Compare with the basic metric learning method: Although the proposed linear metric learning works well on most of our testing videos, it is confronted by the cases where the positive and negative training samples are not linearly separable. The proposed kernel metric learning method aims to handle this difficulty. Our experiments also validate that the kernel method proposed in this paper has a superior performance than the linear method. Fig. 15 gives a subjective comparison on the Animal sequence. The top row of Fig. 15 shows the tracking results of the linear metric learning method, and the bottom row the

20

0 0

10

20

30

40

50

60

Frame Number

Fig. 16.

A comparison of tracking error between basic and kernel method

kernel method proposed in Sec.V. We can see that the linear metric learning method does not work very well on this Animal sequence, because the positive and negative data are heavily mixed up. This sequence is particularly difficult, because the motion of the target in this video is dramatic. In addition, the target presents similar appearances as the background. In this situation, it is very difficult, if not impossible, to find the linear projection A so that the positive and negative data can be clearly separated in the transformed space. Fortunately, as shown in Fig.15, the proposed kernel method is able to cope with this difficulty very well. The kernel metric learning method tracks the object successfully through out the whole sequence. The objective comparison of tracking error (in pixel) between the kernel method and linear method is given in Fig.16. Both subjective and objective evaluations show that the proposed kernel method works better than the basic linear metric learning method. More testing results of the proposed kernel based distance metric learning method for tracking are shown in the Fig.17. 2) Adaptation in Metric Learning: In our proposed method, the learned metric is adaptive over time. This is necessary for robust tracking, because of the dynamic changes in the appearance of both the target and the background. For example, in the Animal sequence, the target often shares the similar appearance as the background. A fixed metric cannot work well for all the situations, and thus the metric should adapt to the dynamic background. In Fig.18, the first row shows the key frames of the video, the second row shows the positive and negative data distribution before metric learning, and the third row shows the

11

Fig. 13.

Sample frames of the result of our method on the Sprite sequence that presents fast motion, the 15th, 16th, 17th, 18th, 19th and 20th frames.

Fig. 14.

Sample frames of the result of our method on tracking multiple targets on the Racing sequence

Fig. 15.

A comparison of the basic metric learning method(top) and kernel method proposed in this paper(bottom).

Fig. 17.

Extensive results of the kernel method proposed in this paper

positive and negative data distribution after metric learning. From Fig.18, it is clear that the positive and negative data that are mingled together are clearly separated after the metric learning. In addition, the margin between the positive and negative training data after the metric learning becomes larger, meaning that the distance metric learned here is robust and discriminative. Moreover, although the ways that the positive and negative samples mix together are quite different from time to time, our method is able to find good projections to separate them over time. In summary, our method is able to learn an adaptive distance metric robustly and discriminatively.

D. Convergence Study of our Metric Learning The issue of convergence is important for the gradient descent method in the proposed metric learning methods. Meanwhile, the convergence of the proposed metric learning method is tightly related to dimension reduction in our method. Thus, we analyze them together here. Fig.19 gives an example of convergence and dimension reduction. The first row in Fig.19 shows the test video frames. The second row in Fig.19 presents the convergence result of the basic linear metric learning method. And the third row shows the convergence result of the kernel metric learning

12

Fig. 18. An example of adaptive metric over time on the Animal sequence. The first row is the sample frames of the video, the second row shows the training samples before metric learning, and the third row shows the samples after metric learning

method. In the bottom two rows, the x-axis represents the number of iterations, and the y-axis represents the output of objective function g(A) and g(Λ) respectively. In this study, we have three observations. First, in all the cases, our method is able to converge after a finite number of iterations. Second, our algorithm converges rapidly. In most cases, our method converges within 4 iterations. Third, the convergence is not influenced much by the optimization techniques we used. In addition, the different color lines in Fig.19 indicate the different numbers of dimension we chose for metric learning. In our experiment, we have tested the dimensions from 3 to 30. This is indeed a large range for this parameter. Our results are quite encouraging. From these results, we can see that the reduction of dimensionality does not actually have much influence on the speed of convergence. This indicates that our proposed metric learning methods have a nice convergence property, as they are not affected much by the dimensionality of the embedded space. VII. C ONCLUSION Matching the visual appearances of the target is the most important issue in visual tracking. Finding an optimal distance metric is critical in solving this matching problem. In this paper, we propose a new method for matching based on metric learning, and integrate this metric learning and differential tracking. This paper describes a discriminative and adaptive distance metric learning method that learns a Mahalanobis distance metric to maximize the performance of the k-NN classifier in the projected feature space. In order to achieve a better discrimination power, a nonlinear kernel metric learning method is proposed. Based on these, we integrate the distance metric learning into visual tracking, and obtain very robust and promising tracking performance in our extensive experiments. ACKNOWLEDGMENT This work is supported in part by National Natural Science Foundation of China (grant No. 60873127, 60903172), and in

part by National Science Foundation grant IIS-0347877 and IIS-0916607. R EFERENCES [1] G. D. Hager, M. Dewan, and C. V. Stewart, “Multiple kernel tracking with ssd,” in Computer Vision and Pattern Recognition. Proceedings. IEEE Conference on, vol. 1, pp. I–790 – I–797 Vol.1, 2004. [2] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, pp. 564 – 577, may 2003. [3] A. Elgammal, R. Duraiswami, and L. S. Davis, “Probabilistic tracking in joint feature-spatial spaces,” in Computer Vision and Pattern Recognition. Proceedings. IEEE Computer Society Conference on, vol. 1, pp. I– 781 – I–788 vol.1, 18-20 2003. [4] P. Viola and W. M. Wells, “Alignment by maximization of mutual information,” Computer Vision, IEEE International Conference on, vol. 0, p. 16, 1995. [5] C. Yang, R. Duraiswami, and L. Davis, “Efficient mean-shift tracking via a new similarity measure,” in Computer Vision and Pattern Recognition. Proceedings. IEEE Conference on, vol. 1, pp. 176 – 183 vol. 1, 20-25 2005. [6] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, pp. 1631 –1643, oct. 2005. [7] H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via on-line boosting,” 2006. [8] B. Han and L. Davis, “Object tracking by adaptive feature extraction,” in Image Processing, International Conference on, vol. 3, pp. 1501 – 1504 Vol. 3, 24-27 2004. [9] Z. Yin, F. Porikli, and R. T. Collins, “Likelihood map fusion for visual object tracking,” in Applications of Computer Vision, IEEE Workshop on, pp. 1 –7, 7-9 2008. [10] S. Avidan, “Support vector tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 26, pp. 1064 –1072, aug. 2004. [11] S. Avidan, “Ensemble tracking,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, pp. 261 –271, feb. 2007. [12] B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in Computer Vision and Pattern Recognition, IEEE Conference on, pp. 983 –990, 2009. [13] M. J. Black and A. D. Jepson, “Eigentracking: Robust matching and tracking of articulated objects using a view-based representation,” Int. J. Comput. Vision, vol. 26, no. 1, pp. 63–84, 1998. [14] T. F. Cox and M. A. A. Cox, Multidimensional Scaling. London: Chapman & Hall, har/dis ed., January 1994. [15] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, 2000.

13

Fig. 19. A study of convergence on different sequences and the analysis of dimension reduction. The first row shows the test videos, the second row presents the convergence result of basic metric learning method, and the third row presents the convergence result of kernel metric learning method. [16] J. B. Tenenbaum, V. Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, pp. 2319–2323, December 2000. [17] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Advances in Neural Information Processing Systems 15, pp. 505–512, MIT Press, 2002. [18] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, “Learning distance functions using equivalence relations,” in International Conference on Machine Learning, pp. 11–18, 2003. [19] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighborhood component analysis,” in Advances in Neural Information Processing Systems, 2005. [20] K. Weinberger, J. Blitzer, and L. Saul, “Distance metric learning for large margin nearest neighbor classification,” in Advances in Neural Information Processing Systems, (Vancouver, BC), 2006. [21] X. Wang, G. Hua, and T. X. Han, “Discriminative tracking by metric learning,” in European Conference on Computer Vision, 11th Conference on, 2010. [22] A. Globerson and S. Roweis, “Metric learning by collapsing classes,” Advances in Neural Information Processing Systems, vol. 18, pp. 451– 458, 2006. [23] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in Computer Vision and Pattern Recognition. Proceedings. IEEE Conference on, vol. 2, pp. 142 –149 vol.2, 2000. [24] Y. Wu and J. Fan, “Contextual flow,” in Computer Vision and Pattern Recognition, IEEE Conference on, pp. 33 –40, 20-25 2009. [25] R. T. Collins, “Mean-shift blob tracking through scale space,” in Computer Vision and Pattern Recognition. Proceedings. IEEE Computer Society Conference on, vol. 2, pp. II – 234–40 vol.2, 18-20 2003. [26] C. Shen, J. Kim, and H. Wang, “Generalized kernel-based visual tracking,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 20, pp. 119 –130, jan. 2010. [27] A. Yilmaz, “Object tracking by asymmetric kernel mean shift with automatic scale and orientation selection,” in Computer Vision and Pattern Recognition. IEEE Conference on, pp. 1 –6, 17-22 2007. [28] Z. Fan, M. Yang, Y. Wu, G. Hua, and T. Yu, “Efficient optimal kernel placement for reliable visual tracking,” in Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, vol. 1, pp. 658 – 665, 17-22 2006.

Nan Jiang received the B.S. degree in electronics and information engineering from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2006. She is currently pursuing the Ph.D degree in the Department of Electronics and Information Engineering, HUST. Her research interests include computer vision and pattern recognition.

Wenyu Liu (M08) received the B.S. degree in computer science from Tsinghua University, Beijing, China, in 1986, and the M.S. and Ph.D. degrees in electronics and information engineering from Huazhong University of Science and Technology (HUST), Wuhan, China, in 1991 and 2001, respectively. He is now a Professor and Associate Dean of the Department of Electronics and Information Engineering, HUST. His current research areas include multimedia information processing, and computer vision.

Ying Wu (SM06) received the B.S. degree from Huazhong University of Science and Technology, Wuhan, China, in 1994, the M.S. degree from Tsinghua University, Beijing, China, in 1997, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana-Champaign (UIUC), Urbana, in 2001. From 1997 to 2001, he was a Research Assistant at the Beckman Institute for Advanced Science and Technology, UIUC. During summer 1999 and 2000, he was a research intern with Microsoft Research, Redmond, WA. In 2001, he joined the Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, as an Assistant Professor. He is currently an Associate Professor of electrical engineering and computer science at Northwestern University. His current research interests include computer vision, image and video analysis, pattern recognition, machine learning, multimedia data mining, and human-computer interaction. Dr. Wu serves as an associate editor for IEEE TRANSACTIONS ON IMAGE PROCESSING, the SPIE Journal of Electronic Imaging, and the IAPR Journal of Machine Vision and Applications. He received the Robert T. Chien Award at UIUC in 2001 and the National Science Foundation CAREER Award in 2003.