This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS
1
Online State-Based Structured SVM Combined With Incremental PCA for Robust Visual Tracking Yingjie Yin, De Xu, Senior Member, IEEE, Xingang Wang, and Mingran Bai
Abstract—In this paper, we propose a robust state-based structured support vector machine (SVM) tracking algorithm combined with incremental principal component analysis (PCA). Different from the current structured SVM for tracking, our method directly learns and predicts the object’s states and not the 2-D translation transformation during tracking. We define the object’s virtual state to combine the state-based structured SVM and incremental PCA. The virtual state is considered as the most confident state of the object in every frame. The incremental PCA is used to update the virtual feature vector corresponding to the virtual state and the principal subspace of the object’s feature vectors. In order to improve the accuracy of the prediction, all the feature vectors are projected onto the principal subspace in the learning and prediction process of the state-based structured SVM. Experimental results on several challenging video sequences validate the effectiveness and robustness of our approach. Index Terms—Incremental PCA, object tracking, state space, structured SVM.
I. I NTRODUCTION ISUAL TRACKING is one of important problems in computer vision and is widely used in intelligent video surveillance, human–computer interaction, visual navigation, and so on. It is also a challenging problem due to appearance variations, illumination changes, background clutter, occlusions, viewpoint changes, etc. As a result, the robust tracker must be able to learn the appearance model of the object or improve the classifier performance on-the-fly. Adaptive tracking-by-detection methods are widely used to design robust trackers in computer vision. These trackers can be divided into two different categories: 1) generative tracker and 2) discriminative tracker. The incremental visual tracker (IVT) [1] is most representative of generative trackers. The IVT extends the Sequential Karhunen–Loeve (SKL) algorithm [2] to propose a new incremental PCA algorithm
V
Manuscript received March 24, 2014; revised July 29, 2014 and October 10, 2014; accepted October 10, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61227804 and Grant 61421004 and in part by the National Defense Science and Technology Innovation Fund Project of the Chinese Academy of Sciences under Grant CXJJ-14-M08. This paper was recommended by Associate Editor B. Zhang. The authors are with the Research Center of Precision Sensing and Control, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190 China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2363078
that correctly updates the mean and the principal subspace of the object’s feature vectors. Besides the IVT, Bao et al. [3] described the object appearance using a sparse representation [4]–[6] over a template set and propose a new minimization model and a much faster numerical solver for the l1 tracker [7], [8]; Kwon and Lee [9], [10] constructed appropriate sampled trackers using now accepted samples and integrate the constructed trackers to track the object. Other generative methods were proposed in [11]–[14]. Despite these online generative tracking algorithms are demonstrated success, several problems such as insufficient accepted samples and the drift problem remain to be solved. Instead of focusing only on the accepted samples of the object, the discriminative tracker focuses on both the appearance model of the object and the information from the background. The traditional discriminative trackers treat object tracking as a binary classification problem and aim to find a decision boundary that can best separate the object from the background. Grabner et al. [15] and Grabner and Bischof [16] adopted the online AdaBoost classifier [17] to distinguish the object and the background and use the surrounding background as negative examples in the update. Collins et al. [18] proposed a method to adaptively select color features that best discriminate the object from the current background. Babenko et al. [19], [20] proposed a multiple instance learning method to learn a classifier by using the training bags for object tracking. Avidan [21] proposed the support vector tracking (SVT) which integrates the SVM classifier into an optic-flow-based tracker and the SVT maximizes the SVM classification score instead of minimizing an intensity difference function between successive frames. Hare et al. [22] present a framework for adaptive visual object tracking based on structure output prediction. Unlike the binary classifier, the classifier in [22] directly output the 2-D translation transformation. Other discriminative methods were proposed in [23]–[27]. One more important thing for discriminative trackers is to update the classifier online because data are input in a consecutive sequence during tracking. Recently, several successful methods have been proposed to address the online learning issue of a classifier. Bordes et al. [28] proposed a novel online algorithm, called LASVM, to reach competitive accuracies after performing a single pass over the training set. Orabona et al. [29] proposed online independent support vector machines to reduce time and space requirements at the price of a negligible loss in accuracy with
c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2
new observations. Wang et al. [30] proposed an online core vector machine classifier with adaptive minimum-enclosingball adjustment to deal with very large data sets efficiently by using an efficient redundant samples deletion technique. Wang et al. [31] proposed an online SVM based on convex hull vertices selection to update the classifier consuming less time without reducing its classification performance. Both generative and discriminative methods may lead to the drift problem due to the incorrect update of the appearance model of the object and the classifier. Some tracking methods turn to combine the generative and discriminative methods. Zhang et al. [32] proposed an effective and efficient tracking algorithm with an appearance model based on features extracted in the compressed domain. The tracking algorithm is generative as the object can be well represented based on the features extracted in the compressive domain and it is also discriminative because these features are used to separate the object from the surrounding background via a naive Bayes classifier. Xie et al. [4] combined the discriminative with the generative models and introduced the maximum margin projection into object tracking. In this paper, we propose a robust tracking algorithm based on the structured SVM and incremental PCA for handling the drift problem. The direct motivation for the stated-based structured SVM model and incremental PCA is to combine the discriminative method with the generative method to generate a robust tracker. The state-based structured SVM can be considered as the discriminative method because the structured SVM is used to distinguish the object and the background and the surrounding background is used as negative examples during training the structured SVM. The incremental PCA can be considered as the generative method because the incremental PCA is used to update the mean and the principal subspace of the object’s feature vectors. Different from the current structured output tracking (Struck) [22] which predicts the 2-D translation transformation during tracking, the proposed statebased structured SVM tracker learns and predicts the object’s states directly. In the state-based structured SVM tracker, we introduce the concept of the object’s virtual state. With the aid of the virtual state, the state-based structured SVM tracker is combined with the incremental PCA. Like in [1], the incremental PCA is used to update the mean and the principal subspace of the object’s feature vectors. The virtual state is considered as the most confident state of the object in every frame. The mean updated by the incremental PCA is used as the virtual feature vector corresponding to the virtual state. We usually make the basic assumption that principal components with larger associated variance represent interesting dynamics, while those with lower variances represent noise [33]. In order to reduce the interference from the noise and improve the accuracy of the prediction, all the feature vectors are projected onto the principal subspace in the learning and prediction process of the state-based structured SVM. The main contributions of this paper are as follows. 1) It proposes a state-based structured SVM tracking framework to learn and predict the object’s states directly. The state provides an important interface for the requirements of the user. If the change of object in
IEEE TRANSACTIONS ON CYBERNETICS
the frames is mainly its position variation, the state can be described as the object’s 2-D position in the frame. If the changes of the object in the frames include variations of the scale, position, posture, and so on, the state of the object can be defined as a more sophisticated model. 2) It introduces the concept of the object’s virtual state to achieve a natural combination of the discriminative model (structured SVM) and the generative model (incremental PCA). 3) It uses the principal subspace generated by the incremental PCA as the space of the feature vectors to reduce the interference from the noise and improve the accuracy of the prediction. This paper is structured as follows. Section II describes the state-based structured SVM model whose online optimization process is presented in Section III. In Section IV, we present the virtual state and the combination between the state-based structured SVM and the incremental PCA. In Section V, the generation of the sampling states and the definition of the loss function are introduced. The summary of the tracking algorithm is presented in Section VI and experimental results are shown in Section VII. Finally, Section VIII is devoted to the conclusions. II. O NLINE S TATE -BASED S TRUCTURED SVM M ODEL In the state-based structured SVM model, the object is described by the state. The object’s state can be defined as its scale, position, posture, or a more sophisticated model. The state-based structured SVM aims at obtaining an effective discriminant function which is used to find the object’s state accurately in every frame during the tracking. A. Discriminant Function Rather than learning a prediction function to estimate the object’s 2-D translation transformation between frames in [22], we propose learning a prediction function f : χ → S to directly estimate the object’s state in every frame. S is a state space and χ is a feature vector set space which is composed of feature vector sets {X1 , X2 , . . . , Xi , . . .} where s1
s
j
sm
s
j
Xi = {xi i , . . . , xi i , . . . , xi i }, xi i ∈ X S , XS is a feature vector space and m is the number of feature vectors (also called s
j
sampling number) in each set Xi . Each feature vector xi i ∈ X S j corresponds to a state si ∈ S. The output of the prediction f is thus the object’s state instead of the binary labels ±1 in traditional discriminative classifier and 2-D translation transformation in [22]. A labeled training example (obtained from image sequences) is a pair (xs , s). A virtual state su (introduced in Section IV) is defined in every frame, and we assume that the virtual state su is the only and we cannot get the virtual state su by limited sampling strategy in every frame. The actual value of the virtual state su u is not cared about, however, the feature vector xs correspondu ing to the virtual state s is very important because it is the link point of the state-space structured output SVM and incremental PCA learning. The prediction function f is learned in a
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. YIN et al.: ONLINE STATE-BASED STRUCTURED SVM COMBINED WITH INCREMENTAL PCA FOR ROBUST VISUAL TRACKING
structured output SVM framework [34]–[36]. A discriminant function F : XS → R where XS is the feature vector space and R is the real space used for prediction according to st = f (X) = arg max F(x ) s
techniques, as shown in 1 max − α 2
1 j m where X = {xs , . . . , xs , . . . , xs }, X ∈ χ , S¯ = {s1 , . . . , s j , . . . , sm }, S¯ ⊂ S, m is the sampling number. The discriminant function F is used to measure the degree of simu ilarity between xs and xs . A high score is given to xs which u is close to xs . The form of the discriminant function F is (2) F(xs ) = w, xs
where w is the parameter vector and (xs ) is a kernel map which maps xs into another suitable feature space and is implicitly defined by a joint kernel function (3) K(xs , x¯ s¯ ) = xs , x¯ s¯ .
n
n
+ s.t.
∀i :
sm
s2
u si i
u si i
∀i, ∀si = sui i , si ∈ S¯ i : F(xi ) − F(xisi ) ≥ (xi , xisi ) u si i
where (xi , xisi ) is the loss function between xi and xisi . Slack variables ξi are introduced for the potential violation of (4) according to the standard SVM derivation and the parameter vector w can be calculated by minimizing the convex objection function
w
ui
s
where δi (si ) = (xi i ) − (xisi ). The discriminant function can be expressed as
1 2 n
max − β
The convex objection function (5) can be converted into its equivalent form by using standard Lagrangian duality
n
i=1 j=1 si ∈{S¯ i sui } s ∈{S¯ suj } j j i j
−
n
i=1 si ∈{S¯ i sui } i
s.t. ∀i, ∀si ∈ {S¯ i ∀i : si ∈{S¯ i
(8)
s s βisi βj j xj j , xisi
ui s βisi xi i , xisi
sui i } : βisi ≤ δ si , sui i C
(9)
βisi = 0
u
si i }
∀j, ∀sj ∈ {S¯ j ∀j :
u s u sj j } : βj j ≤ δ sj , sj j C s
βj j = 0 uj
sj }
where δ(s, s¯) = 1 if s = s¯ and 0 otherwise. The discriminant function (7) is simplified to the function F(xs ) =
u
III. O NLINE O PTIMIZATION
(7)
then the dual of objection function (5) can be simplified to
si (xi i , xisi ) − ξi
where C is a penalty coefficient. The discriminant function F can be obtained by solving the optimization problem (5).
ui ¯ −αisi if si = si , si ∈ Siui ui = si s if α i si =s ,si ∈S¯ i isi
=
s
≥
i=1
u si =si i ,si ∈S¯ i
αisi δi (si ), (xs ).
i
∀i, ∀si = sui i , si ∈ S¯ i : w, (xi i ) −(xisi )
βisi
sj ∈{S¯ j
ui
n
Let
i=1
(5)
(6)
uj
1 w 2 + C ξi 2
s.t. ∀i : ξi ≥ 0
: αisi ≥ 0
sj =sj
n
min
u ∀j, ∀sj = sj j , sj ∈ S¯ j : αjsj ≥ 0 αjsj ≤ C ∀j :
(4)
u si i
αisi
u
si xi i , xisi
αisi ≤ C
su
s1
αisi αjsj δj (sj ), δi (si )
u si =si i
F(xs ) =
Xi = {xi i , . . . , xi i , . . . , xi i }, Xi ∈ χ corresponding to the samj pling state set S¯ i = {s1i , . . . , si , . . . , sm i }. This partial ranking can be expressed by constraints
i=1 si =sui ,si ∈S¯ i i ∀i, ∀si = sui i , si ∈ S¯ i
B. Optimization Problem is considered as the most confident The virtual state state of the object in every frame. The discriminant function F shown in (2) can be considered to be an evaluation function for measuring the degree of similarity between a feature vector xs (corresponding to a sampling state s) and u the feature xs (corresponding to the virtual state su ) and it u gives a high score to xs which is close to xs . This amounts to a partial order relationship on the elements of the set
n
i=1 si =sui ,si ∈S¯ i j=1 s =suj ,s ∈S¯ j j j i j
(1)
s∈S¯
3
n
i=1 si ∈{S¯ i sui } i
βisi (xisi ), (xs ).
(10)
As in the case of classification SVMs, the kernel map (xs ) in the optimization function (9) and the discriminant function (10) only occurs in the form of inner product ·, ·, so it can be implicitly defined by a joint kernel function (3). We refer to xisi for which βisi = 0 as a support vector and V sup is the set of all the support vectors. The set sp
s1
j
s
sm
ui
s
Xi = {xi i , . . . , xi i , . . . , xi i , xi i } which includes at least one
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4
IEEE TRANSACTIONS ON CYBERNETICS
Algorithm 1 SMOSTEP Input: Given a serial number i of a support pattern set, the coefficients βisi + , βisi − , the support gradient set G corresponding to the support vector set V sup Output: The new coefficients βisi + , βisi − , the new support gradient set G corresponding to the support vector set V sup 1. Compute k00 = (xisi + ), (xisi + ), k11 = (xisi − ), (xisi − ), k01 = (xisi + ), (xisi − ). s+
2. 3. 4. 5.
s−
g (β i )−gi (βi i ) Compute λu = i k00i +k11 −2k . 01 u Let λ = max(0, min(λ , Cδ(si +, si −) − βisi + )). Update the coefficients βisi + = βisi + +λ, βisi − = βisi − −λ. Update the support vector set Vsup and the gradient
set G: begin if βisi + = 0 then
Vsup xisi + . Vsup = G = G gi (xisi + ). end if if βisi − = 0 then
Vsup xisi − . Vsup = G = G gi (xisi − ). end if end Update gradients in the gradient set G: begin s for gj (βj j ) ∈ G do sj s s s gj (βj ) = gj (βj j ) − λ( (xj j ), (xisi + ) − (xj j ), (xisi − )). 20. end for 21. end Return The new coefficients βisi + , βisi − , the new support gradient set G corresponding to the new support vector set V sup 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
support vector is called a support pattern set. gi (si ) is the gradient of the following with respect to βisi of xisi ∈ V sup and the support gradient set G is the set of all gradients gi (βisi ): 1 2 n
f (β) = −
−
n
s
i=1 j=1 si ∈{S¯ i sui } s ∈{S¯ suj } j j i j
n
i=1 si ∈{S¯ i sui } i
βisi
u
si xi i , xisi
s
βisi βj j (xj j ), (xisi )
.
(11)
As in [22], the sequential minimal optimization (SMO) algorithm [34] is used as an elementary step for monotonically improves (9) with respect to of coefficients βisi + , βisi − . a pair s
u βi i = 0 from (8), so We can get the constraint si ∈ S¯ i si i
ui − ¯ the coefficients βisi + , βisi − subject to {s+ si } i , si } ⊂ {Si must be modified by opposite amounts, βisi + = βisi + + λ, βisi − = βisi − + λ. The selection strategies of βisi + , βisi − in [22], [30], and [32] are used for the elementary SMOSTEP shown in Algorithm 1.
ui
s
PROCESSNEW selects βi i
of the virtual feature vec-
u
si xi i
in the new frame as βisi + and βisi − is found tor arg min gi (βisi ). PROCESSOLD according to βisi − = βi i ,si ∈{S¯ i s
selects the arg min
βisi +
u s βi i ,si ∈{S¯ i si i }
=
u
si i }
arg max
u s u βi i