628
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 4, APRIL 2017
Object Tracking via Temporal Consistency Dictionary Learning Xu Cheng, Member, IEEE, Yifeng Zhang, Member, IEEE, Jinshi Cui, and Lin Zhou
Abstract—Sparse representation-based methods have been successfully applied to visual tracking. However, complex and inefficient optimization limits their deployment in practical tracking scenarios. In this paper, we propose a temporal consistency dictionary learning tracking algorithm to enable efficient dictionary learning and tracking executive. First, we present an objective function which introduces the fixed dictionary and variance dictionary to reconstruct the object’s appearance. In particular, the proposed method takes the temporal consistency into account by adding a regularization term into the objective function to constrain the difference of object appearance at adjacent frames. Then the optimization problem is solved in an iteration way. Moreover, the proposed method can encode the object’s local structural information, and the local patches from the same candidate altogether for a global appearance representation. Second, we develop an effective observation likelihood function based on the proposed model. It takes the influence of patches with large reconstruction errors into consideration, thereby, alleviating the drifting of the object. Finally, we present an appearance updating strategy to adapt to the object’s appearance variations by the online dictionary learning. Experimental evaluations on the TB50 and TB100 datasets show that the proposed tracking method outperforms sparse representation related visual tracking as well as other state-of-the-art tracking methods. Index Terms—Dictionary learning, object tracking, sparse representation, surveillance, temporal consistency.
I. I NTRODUCTION BJECT tracking plays an important role in computer vision field, and numerous tracking approaches have been proposed in the past few decades with demonstrated success. Some focus on generative models [1]–[3], and some turn to discriminative models [4]–[10]. Some combine both generative and discriminative models [11], [12].
O
Manuscript received July 7, 2016; accepted October 1, 2016. Date of publication November 4, 2016; date of current version March 24, 2017. This work was supported in part by the Natural Science Foundation of Jiangsu Province under Grant BK20151102, in part by the Ministry of Education Key Laboratory of Machine Perception, Peking University under Grant K-2016-03, and in part by the Natural Science Foundation of China under Grant 61201345 and Grant 61571106. This paper was recommended by Associate Editor K. Huang. (Corresponding author: Xu Cheng.) X. Cheng is with the Nanjing Marine Radar Institute, Nanjing 210003, China, and also with the School of Information Science and Engineering, Southeast University, Nanjing 210096, China (e-mail:
[email protected]). Y. Zhang and L. Zhou are with the School of Information Science and Engineering, Southeast University, Nanjing 210096, China (e-mail:
[email protected];
[email protected]). J. Cui is with the Ministry of Education Key Laboratory of Machine Perception, Peking University, Beijing 100871, China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMC.2016.2618749
Generative model-based methods formulate the object tracking as a template matching problem. Matching-based scheme searches for the most similar image regions to the object template. Ross et al. [1] proposed an incremental learning scheme for subspace learning, which is robust to illumination variations (IVs). Kwon and Lee [2] decomposed the object state into multiple observation models and motion models to capture a wide appearance changes. Wang et al. [3] proposed a probability continuous outlier model to cope with partial occlusion (OCC) via holistic object template. However, these methods do not make full use of the surrounding background information to track the object. Finally, tracking performance will be degraded. In discriminative models, object tracking is cast as a binary classification problem. Avidan [4] combined a set of online learned weak classifiers to a strong one to separate the object from background. Recent, machine learning-based methods have also been proposed for visual tracking, such as online adboosting tracker [5], tree structure model [6], multiple-instance learning (MIL) [7], [8], tracking learning detection (TLD) [9], and transferring learning [10], [13]. This mechanism is shown to work well in some challenging scenes. Zhang et al. [14] exploited a mixture of experts based on entropy minimization criterion to fuse multiple classifiers’ tracking outputs. Hare et al. [15] proposed a learning scheme of a joint structured output to estimate the object state based on the spatial distribution of samples. In addition, the strategy based on the combination of generative model and discriminative model is used complementarily for different scenarios [11], [12]. Cui et al. [18] integrated low and high-dimensional trackers to overcome their respective weaknesses. The contributions of this paper can be summarized as follows. First, we formulate the object tracking as a temporal consistency dictionary learning optimization problem (TCDL). In particular, temporary consistency scheme is incorporated into the objective function by using l2 norm to constrain the difference of object appearance between two consecutive frames, which effectively improves the tracking performance. Then this optimization problem is solved using an iterative algorithm. Second, the object’s appearance that captures local and the global information of the object is reconstructed based on fixed dictionary and variance dictionary. This scheme is discriminative and helps in dealing with distracters present in the challenging scenario. In addition, a novel updating strategy is presented. The initial object information from the first frame is incorporated into template updating process.
c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2216 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
CHENG et al.: OBJECT TRACKING VIA TCDL
Fig. 1.
629
Flowchart of the proposed tracking algorithm.
Variance dictionary is online learned based on cumulative probability. Third, an observation model is developed for accurate object localization. Finally, experimental results show that the proposed tracker performs favorably against the state-ofthe-art trackers on TB50 [16] and TB100 [17] benchmark datasets. The remainder of this paper is organized as follows. In Section II, work related to our tracking algorithm is reviewed. Section III introduces the details of the proposed tracking algorithm. The detailed experimental comparisons between the proposed tracker and state-of-the-art trackers are reported in Section IV, and we conclude this paper in Section V. II. R ELATED W ORKS In this section, we discuss the most relevant tracking methods to this paper and concomitantly analyze our advantages against these methods. Readers are referred to [19] for more details. Recently, sparsity-based tracking algorithms have been successfully applied to visual tracking [20]. The object appearance is sparsely represented by a linear combination of object templates and trivial templates. Later, several methods improve the efficiency of original l1 tracker by using accelerated proximal gradient method [21]. Wang and Lu [22] exploited the PCA subspace to represent the object appearance under the l1 framework. Liu et al. [23] employed the histograms of sparse coding of patches to learn a compact dictionary for object tracking. Zhong et al. [11] proposed a sparsity-based collaborative model and used sparse linear regression to select discriminative features that can distinguish the object from cluttered background. Zhang et al. [24] constructed a sparse measurement matrix that satisfied the restricted isometry property in compressive sensing to represent the object appearance. Wang et al. [25] proposed a non-negative dictionary learning strategy for updating the object templates via Huber loss function. Jia et al. [26] presented an effective alignment-pooling scheme to represent the object appearance. However, abovementioned methods do not consider the relationship between candidate states. In [27], the object tracking is regarded as
the multitask learning problem, and the correlations between candidates are mined by using joint group sparsity. Similar method is presented in [28], in which the tracking is cast as a low-rank matrix learning problem. During the tracking, good candidates resemble each other, and the appearances of bad candidates are likely to diverse by mixed norm constraints. In [29], a discriminative reverse sparse tracker via weighted multitask learning is proposed. Different from [27], the templates are reversely represented via the candidates, and the representing of each positive template is viewed as a single task. In [30], a least soft-threshold squares method is proposed to handle the appearance change and outlier simultaneously with the Gaussian-Laplacian distribution. Wang et al. [31] proposed an inverse sparse representation tracking method with a locally weighted distance metric. Xie et al. [32] proposed a robust tracking algorithm based on local sparse coding with discriminative dictionary learning and keypoint matching scheme. It is more robust than the traditional methods to reject the incorrect matches and eliminate the outliers. In [33], a temporal restricted reverse-low-rank learning algorithm is proposed for visual tracking, which exploits the low-rank structure among consecutive target observations and enforces the temporal consistency of target in a global level.
III. P ROPOSED T RACKING M ETHOD The original l1 tracker regards tracking as finding a sparse representation in the dictionary templates. Then, the representation is used to track the object under the particle filter framework. However, it is prone to failure in some challenging scenarios such as full OCC, deformation (DEF), IVs, and so on. The reason is that l0 or l1 -norm is utilized to solve sparse representation coefficients. In other words, the use of l0 or l1 -norm sparsity regularization makes the tracking inefficient and has a big computation burden. In the following, we first formulate our problem, and then introduce our tracking algorithm. An overview of the proposed algorithm is shown in Fig. 1.
630
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 4, APRIL 2017
A. Problem Formulation
problem can be written as
Inspired by [20], our aim is to learn a compact dictionary to distinguish the object from cluttered background. The original l1 tracking problem is formulated by solving the l1 minimization s = arg minx − Ds + λs1 2
min t
Vi ,α i ,β i
yi = Dα i + Vi β i + Ei
(1)
(2)
where α i , β i , and Ei are the corresponding reconstruction coefficients and errors, respectively. Moreover, temporary consistency is an important factor for visual tracking. The object appearance exhibits visual consistency during the tracking. This implies that the object appearance’s representation between two adjacent frames should not change too much. Therefore, we incorporate the characteristic into the optimization function by enforcing an l2 -norm to constrain the difference of appearance. During the tracking, the patches of candidates should be ideally constructed by the corresponding fixed dictionary and variance dictionary. First, it is desirable that the nonzero coefficients only appear at the place which corresponding to the object these patches belonging to. Second, the reconstruction errors should be as small as possible. In addition, considering the temporal consistency which widely exists during the tracking, we add this regularization term into the objective function to constrain the appearance difference between two consecutive frames. Therefore, the optimization
i=1
K 2 2 T t + γ Vt D + η Vi − Vt−1 i F
s.t. Vt (:, j)2 = 1,
s
where λ is a parameter to balance the reconstruction error and sparsity; D ∈ Rd×l is a dictionary; s ∈ Rl denotes the reconstruction coefficients of the object state x. However, the dictionary in the original l1 tracker is not very discriminative when the object suffers from OCC. To represent object’s variation in OCC, illumination changes, motion blur (MB), etc., we need to construct a variance dictionary V. First, we obtain the tracking results of the first ten frames of a video by running IVT tracker [1], and normalize them to 32 × 32 pixels images. Then we utilize overlapped sliding window with a shift of 8 pixels on each normalized image with a spatial layout to obtain a set of overlapped local image patches. Each patch is characterized as a column feature vector, and used to encode the part of the object. All these patches are used as the original variance dictionary V = [V1 , V2 , . . . , VK ] ∈ Rd×(10×K) to encode the structural local information of the object. The ith patches from different tracking results are regarded as Vi ∈ Rd×10 (i = 1, 2, . . . , K), where d denotes the dimension of one patch vector; K is the number of patches within one object state. Different from variance dictionary, we randomly sample n patches from the different object locations in the normalized tracking results of the first ten frames, and the size of each patch is 16 × 16 pixels image. All these patches form a fixed dictionary D ∈ Rd×n . Therefore, these patches contain complex variations of the object. During the tracking, fixed dictionary does not update. With the help of D and Vi , the ith patch of a candidate yi ∈ Rd can be reconstructed by
K yi − Dα i − Vt β i 2 + λ1 α i i βi F 1 (3)
i=1
∀j
where λ1 and γ denote the corresponding parameters; η is a positive regularization parameter. The first two terms mean that each patch of the object should be sparsely reconstructed by the fixed dictionary and variance dictionary. The term (Vt )T D2F means that the variance dictionary V should be incoherent with D. For each candidate, after getting the reconstruction coefficients of each patch, we concatenate coefficients of all the patches to represent the candidate’s appearance. B. Optimization Scheme The objective function in (3) is not convex. The task is to optimize (3) with respect to both the sparse coefficients and variance dictionary. However, when V is fixed, it is the standard l1 -regularization problem. We use Lasso method to obtain the optimal sparse coefficients α and β. When α and β are fixed, the objective function in (3) can be modified as min t V
s.t.
K K 2 2 t t−1 t T ci − Vt β i 2 + γ V D + η − V V i i i F (4) F i=1 i=1t V (:, j) = 1, ∀j 2
where ci = yi − Dα i , each column of V is updated by the work of Kong and Wang [34]. All the rest of columns are fixed when a given column is updated. Therefore, by setting the derivation of (4) to be zero, we can obtain the updated results −1 V(;, j) = γ DDT + β(j, :)I2 + ηI (5) × cj β(i, :)T + ηVt−1 (:, j) . Therefore, we alternatively optimize α, β, and V during the tracking. The iteration processes are terminated if a stopping criterion is satisfied, such as a maximal iterations number or the difference of the objective values between consecutive steps. The solution of the objective function in (3) converges after no more than six iterations. However, to improve the tracking efficiency, we only update one subdictionary at each frame based on a random probability scheme which will be introduced in Section III-C. C. Update Strategy Tracking with fixed template is prone to fail during the tracking. However, frequent and straightforward updates may degrade the tracking performance gradually. In this paper, we update the object template once in every several frames.
CHENG et al.: OBJECT TRACKING VIA TCDL
631
For the object template updating, to reduce the object’s drift, the object template from the first frame is retained and incorporated into the process of updating. Thus, the new template Tt is computed by Tt = uT1 + (1 − u)Tp
(6)
where Tt ∈ R(10×K) is updated object template at frame t; T1 ∈ R(10×K) denotes the initial object template which concatenates sparse coefficients of all the patches within one object in the first frame; Tp ∈ R(10×K) is the last updated template; and u is a constant factor. The patch is not updated if it is corrupted. For the variance dictionary updating, to reduce the computational complexity, we propose an updating scheme which assigns different probabilities to each subdictionary Vi (i = 1, 2, . . . , K). A cumulative probability sequence R is generated by
2 k 21 k=1 2 , K ,...,1 . (7) R = 0, K k−1 − 1 k−1 − 1 k=1 2 k=1 2 Then, a random number r is generated based on uniform distribution on the unit interval [0, 1]. By determining which part the random number lies in, the corresponding subdictionary is updated. The subdictionaries whose cumulative probabilities are greater than r are used to reconstruct the corresponding patches of the object. Then, we exploit (5) to update the subdictionary which has the smallest reconstruction error of the patch, while other subdictionaries are not updated. D. Tracking Procedure
Algorithm 1 TCDL Tracking Algorithm Input: Video frames {It }t=1,2,... Output: Tracking Results {St }t=1,2,..., 1 Initialization: 2 Model the object template in the first frame; 3 Construct the fixed dictionary and variance dictionary by using the first ten frames of a video; 4 Tracking: 5 for t=11 : end of the video 6 Sample candidates based on the tracking result St−1 of the last frame; 7 Solve the sparse coefficients of all the patches from candidates; 8 Update the variance dictionary according to Eq. (5); 9 Compute the observation likelihood score of each candidate by Eq. (9); 10 Candidate with the largest observation score is regarded as the tracking result; 11 if (a certain frame interval attains && occlusion doesn’t occur) 12 Update the object template by Eq. (6); 13 end if 14 end for
where εk is the reconstruction error of the kth patch; ε0 denotes a threshold which determines whether the patch is corrupted or not. The patch with large error is considered as noise and the corresponding weighted vector wk is set to be zero. Finally, tracking result is obtained based on the majority votes of all the patches. The entire tracking algorithm is summarized in Algorithm 1.
IV. E XPERIMENTAL R ESULTS AND A NALYSIS
Object tracking is implemented within the particle filter framework. Given the observation set O1:t = {o1 , o2 , . . . , ot }, the state x can be inferred by the Bayesian posterior estimation
p(xt |O1:t ) ∝ p(ot |xt ) p(xt |xt−1 )p(xt−1 |O1:t−1 )dxt−1 (8) where p(xt |xt−1 ) and p(ot |xt ) denote the dynamic model and the observation model, respectively. For dynamic model, it describes the temporal correlation between two consecutive frames. We use the affine transformation with six parameters to represent the object’s motion state xt = {xt , yt , st , θt , γt , t }, where xt , yt , st , θt , γt , and t denote x, y translations, rotation angle, scale, aspect ratio, and skew, respectively. During the tracking, the object’s candidates are generated from a Gaussian distribution p(xt |xt−1 ) = N(xt , xt−1 , ), where is a diagonal covariance matrix. For observation model, it plays an important role in visual tracking. The formulation in this paper is
K 1 p p,k k ˜ p ot |xt ∝ exp − (9) wk Tt − Tt k=1
2
where T˜ kt is the kth patch’s object template at frame t; Tt denote sparse coefficients of the kth patch in the pth candidate sample at frame t; denotes the element-wise multiplication; wk (k = 1, 2, . . . , K) is an indicator vector which is defined as 1 εk < ε0 wk = (10) 0 otherwise p,k
A. Parameter Settings Our tracker is implemented with MATLAB 2012b and runs about 4.5 frames/s on a PC with Intel i7-3770 CPU (3.4 GHz) with 32 GB memory. To make a fair comparison, performance evaluations are done on TB50 [16] and TB100 [17] benchmark datasets which cover almost all the challenging scenarios. In our experiments, the parameters are empirically set as follows. The size of the warped image is set to 32 × 32 pixels. The learning rate u in (6) is set to 0.9 and the template is updated in every 8 frames. Five hundred particles are used. λ1 , γ , η, and ε0 are set to 0.05, 0.1, 0.8, and 0.5, respectively. K is 9 in this paper; maximal number of iterations and the difference of the objective function values between consecutive steps are 6 and 0.1, respectively. All the experimental parameters are fixed during the tracking. 1) Baselines: We evaluate our tracker against popular stateof-the-art trackers, including weighted MIL (WMIL) [8], TLD [9], sparse collaborative model (SCM) [11], TGPR tracker [13], MEEM tracker [14], struck tracker [15], accelerated proximal gradient l1 (APGL1) [21], multitask learning tracker [27], compressive tracker (CT) [24], adaptive structural local sparse appearance (ASLA) [26], context tracker (CXT) [35], and kernelized correlation filters (KCFs) [36]. Source codes of these trackers are kindly released by the authors. For all the comparison methods, we use the default parameters given in the codes for the different methods to obtain the tracking results.
632
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 4, APRIL 2017
2) Evaluation Methodology: We evaluate the performance for different trackers both qualitatively and quantitatively. Video sequences on the TB50 and TB100 benchmarks are annotated with 11 challenging attributes which include OCC, IV, scale variation (SV), fast motion (FM), MB, low resolution (LR), DEF, background clutter (BC), in-plane rotation (IPR), out-plane rotation (OPR), and out-of-view (OV). For quantitative evaluation, we use tracking precision (TP) and tracking success rate (TSR) to measure the performance of trackers. We first define center location error (CLE) which denotes the mean square root of Euclidean distance between the center location of tracked object and ground-truth. TP denotes the percentage of frames whose CLE is smaller than a certain threshold. Another criterion TSR is overlapped rate between the estimated object location and ground-truth. For qualitative evaluation, tracking results of some key frames for different tracking algorithms are selected to validate superiority of our tracking algorithm. B. Qualitative Comparison With Existing Methods In this section, we will qualitatively evaluate our tracker on 11 challenging attributes sequences. Tracking results of only some representative sequences are presented due to the limited space. We show the results of seven popular trackers in Figs. 2 and 3 for clarity. 1) Occlusion: Partial OCC or full OCC is one of the most challenging tasks in visual tracking field. In Faceocc1 sequence, all the comparison trackers can track the face accurately due to static object and background. In sequence Jogging, some heavy OCC scenarios fail most of tracking methods. For our tracker, object appearance model pauses updating when the object is occluded. However, if the object reappears, TCDL tracker helps to recover the object even though a period of OCC. 2) Illumination and Scale Variations: Illumination and SVs are the most familiar attributes during the tracking. When illumination or SV occurs, the object appearance changes make the tracker difficult to follow. In Singer1 sequence, illumination varies frequently from frames 67 to 90. Most of trackers fail at the beginning, while TCDL, MEEM, KCF, and SCM perform well. The results of CarScale sequence show the superiority of TCDL tracker. TCDL can obtain an accurate scale, while others fail before frame 171. 3) Cluttered Background: The background object has the similar appearance or texture as the object. The similar areas become distracters which may lead to wrongly tracking the similar appearance object. In Liquor sequence, several liquor bottles with similar color, texture, and shape are shown in the whole sequence. APGL1 tracker fails at frame 212 due to frequent disturbances. Color attributes-based trackers hardly locate the object. Consequently, distracters with the similar colors nearby will defeat the trackers. SCM and TCDL trackers can keep tracking with the object. 4) Fast Motion and Motion Blur: Figs. 2 and 3 report representative tracking results. It can be seen that TCDL tracker and some recent trackers can keep in touch with the
Fig. 2. Some representative results of selected sequences on different attributes are shown to demonstrate the effectiveness of our tracker. (a) Faceocc1. (b) DavidIndoor. (c) CarScale. (d) Jumping. (e) Tiger2. (f) Singer1. (g) Deer. (h) Shaking.
object of interest all the time. In Deer sequence, the LR and MB make it difficult to extract object’s features due to FM. In sequence Jumping, some trackers lose the object and
CHENG et al.: OBJECT TRACKING VIA TCDL
633
Fig. 3. Some representative results of selected sequences on different attributes are shown to demonstrate the effectiveness of our tracker. (a) Board. (b) Bird2. (c) Skating1. (d) Liquor. (e) Jogging. (f) Skiing.
recapture the object by chance after failure due to repetitive motion of the object. However, TCDL tracker performs well in term of accuracy. Some selected results demonstrate the effectiveness of our tracker under the FM and MB scenarios. 5) Deformation: In benchmarks [16], Bird2, Bolt, Basketball, Jogging, DavidIndoor, Skating1, Tiger1, Tiger2, Walking, and so on are labeled as DEF attribute. In Figs. 2 and 3, we can see that some comparison methods can also track the object with a major error under the DEF
circumstance, and the proposed TCDL tracker works well when DEF occurs. 6) Rotation: In-plane and out-of-plane are also a very important task. When an object rotates, we assure that the object’s motion between two consecutive frames will not change much when it experiences rotation. Fig. 3(a) and (b) reports some tracking results of selected sequences with rotation attribute. In board sequence, the object undergoes IPR, and only SCM and TCDL trackers can catch up with the object. Other trackers fail after frame 39.
634
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 4, APRIL 2017
Fig. 4. Average precision plots and success plots for the OPE, TRE, and SRE on the TB50 dataset. The performance score for precision plot is at threshold of 20 pixels, while performance score of success plot is area under curve (AUC) score.
7) Low Resolution: Skiing sequence is shown in Fig. 3(f). Compared with other trackers, our tracker can track the object accurately in the whole sequence. C. Quantitative Comparison With Existing Methods The overall quantitative comparison results are reported in Figs. 4 and 5. We evaluate the tracking algorithms using
three protocols: 1) one pass evaluation (OPE); 2) temporal robustness evaluation (TRE); and 3) spatial robustness evaluation (SRE) by precision and success rates. For conciseness, only top nine trackers with respect to the ranking score are shown in each plot. In one-pass precision and success plots, the precision (or success) score is shown in the square brackets based on precision threshold 20 pixels (success rate threshold 0.6). TCDL tracker can achieve about 68.6%
CHENG et al.: OBJECT TRACKING VIA TCDL
635
Fig. 5. Average precision plots and success plots for the OPE, TRE, and SRE on the TB100 dataset. The performance score for precision plot is at threshold of 20 pixels, while performance score of success plot is AUC score.
precision and 52.9% success rate, respectively. In addition, the TRE and SRE evaluate schemes also reflect the merits of the proposed method. Overall, TCDL tracker can achieve not bad performance in terms of accuracy. Moreover, we report evaluation results for 11 challenging attributes in Tables I and II. These attributes are useful for evaluation the tracking performance in different aspects. We can see that KCF tracker works well with overall success
in OCC (51.4%), illumination (49.3%), FM (45.9%), and DEF (53.4%), while our tracker can achieve success rate of 61.4%, 49.5%, 47.4%, and 54.1%, respectively. In terms of LR, the Struck tracker achieves the precision of 54.5% while TCDL tracker works well with precision of 58.1%. However, we also note that TCDL tracker has some failure cases in handling IPR and OPR challenges and achieves a lower precision (Table I). During the tracking, the object may move completely out of
636
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 4, APRIL 2017
TABLE I AVERAGE TP S CORES ON D IFFERENT ATTRIBUTES : OCC, IV, SV, FM, MB, DEF, BC, LR, IPR, OPR, AND OV. T HE B OLD R ED AND B OLD B LUE F ONTS D ENOTE THE B EST AND THE S ECOND B EST R ESULTS , R ESPECTIVELY
TABLE II AVERAGE TSR S CORES ON D IFFERENT ATTRIBUTES : OCC, IV, SV, FM, MB, DEF, BC, LR, IPR, OPR, AND OV. T HE B OLD R ED AND B OLD B LUE F ONTS D ENOTE THE B EST AND THE S ECOND B EST R ESULTS , R ESPECTIVELY
Fig. 6. Comparisons of baseline methods on TB50. B1 removes the variance dictionary and uses a fixed dictionary in the tracking. B2 adds the incoherence constraint term for visual tracking. B3 utilizes temporal consistency to constrain the difference of object appearance between two consecutive frames. B4 adopts the dictionary update scheme of l1 tracker. Our TCDL tracker includes all the modules.
the frame. The reason is that TCDL tracker cannot handle the rotation when the object undergoes large appearance changes. In this case, TCDL tracker obtains a low observation likelihood score. D. Analysis of TCDL Tracker To verify the effectiveness of the proposed TCDL tracker, it is important to understand the contribution of each module in our framework. We first remove or change a certain module to obtain a baseline and compare our tracker with these
baseline trackers. In B1, variance dictionary is removed from the objective function in (3). In B2, the incoherence constraint term is removed to evaluate its effect. In B3, the temporal consistency constraint term is removed to measure how tracking is affected. To evaluate the random probability-based update approach, we replace it with the dictionary update scheme of l1 tracker in B4. Fig. 6 shows the comparison of precision and success plots of our trackers and baselines on the TB50 benchmark dataset. We can clearly see the contribution of each module. All the modules jointly make our tracker effective.
CHENG et al.: OBJECT TRACKING VIA TCDL
637
TABLE III C OMPARISON W ITH THE S TATE - OF - THE -A RT T RACKERS ON TB50 AND TB100 B ENCHMARK DATASETS FOR CLE AND RUNNING S PEED . T HE F IRST T HIRD H IGHEST VALUES A RE H IGHLIGHTED BY B OLD
E. Computational Cost TCDL tracker is carried out on a PC. In fact, we search object state in a limited image region based on the object tracking result of the last frame. The number of particles and the number of patches in a certain object state region have an important impact on the computational complexity. In addition, the speed of trackers also relies on texture and the object size. For example, in some sequences, it takes nearly one second per frame to process a big object size. Table III shows an average computational speed comparison for different trackers on TB50 and TB100 benchmark datasets. KCF, CT, WMIL, TLD, and MEEM trackers have higher frame rates than TCDL tracker while TCDL tracker achieves a better accuracy in terms of CLE. V. C ONCLUSION In this paper, we propose a TCDL tracking algorithm. Temporal consistency is incorporated into the proposed tracking framework to constrain the difference of the object appearance. The iteration process includes solving of sparse coefficients and the updating of the variance dictionary. The observation model takes the spatial information of patches into consideration with a mask scheme. In addition, the initial object template is incorporated into the updating process of the template. Variance dictionary is updated based on a random number. Numerous experiments on challenging sequences demonstrate that TCDL algorithm performs favorably against several state-of-the-art algorithms. ACKNOWLEDGMENT The authors would also like to thank Y. Wu and M.-H. Yang at University of California, Merced, for providing data (Visual Tracker Benchmark, http://www.visual-tracking.net) and ground truth labels. The authors also like to thank the anonymous reviewers for useful and constructive comments that help to improve the quality of this paper. R EFERENCES [1] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vis., vol. 77, nos. 1–3, pp. 125–141, 2008. [2] J. Kwon and K. M. Lee, “Visual tracking decomposition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., San Francisco, CA, USA, Jun. 2010, pp. 1269–1276. [3] D. Wang, H. Lu, and M.-H. Yang, “Online object tracking with sparse prototypes,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 314–325, Jan. 2013. [4] S. Avidan, “Ensemble tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 261–271, Feb. 2007. [5] H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via online boosting,” in Proc. Brit. Mach. Vis. Conf., Edinburgh, U.K., 2006, pp. 47–56.
[6] X. Zhang et al., “Human pose estimation and tracking via parsing a tree structure based human model,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 44, no. 5, pp. 580–592, May 2014. [7] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1619–1632, Aug. 2011. [8] K. Zhang and H. Song, “Real-time visual tracking via online weighted multiple instance learning,” Pattern Recognit., vol. 46, no. 1, pp. 397–411, 2013. [9] Z. Kala, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422, Jul. 2012. [10] Q. Wang, F. Chen, J. Yang, W. Xu, and M.-H. Yang, “Transferring visual prior for online object tracking,” IEEE Trans. Image Process., vol. 21, no. 7, pp. 3296–3305, Jul. 2012. [11] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparsitybased collaborative model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, Jun. 2012, pp. 1838–1845. [12] X. Cheng, N. Li, T. Zhou, L. Zhou, and Z. Wu, “Object tracking via collaborative multi-task learning and appearance model updating,” Appl. Soft Comput., vol. 31, pp. 81–90, Jun. 2015. [13] J. Gao, H. Ling, W. Hu, and J. Xing, “Transfer learning based visual tracking with Gaussian processes regression,” in Proc. Eur. Conf. Comput. Vis., Zürich, Switzerland, 2014, pp. 188–203. [14] J. Zhang, S. Ma, and S. Sclaroff, “MEEM: Robust tracking via multiple experts using entropy minimization,” in Proc. Eur. Conf. Comput. Vis., Zürich, Switzerland, 2014, pp. 188–203. [15] S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in Proc. IEEE Int. Conf. Comput. Vis., Barcelona, Spain, 2011, pp. 263–270. [16] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Portland, OR, USA, Jun. 2013, pp. 2411–2418. [17] Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Sep. 2015. [18] J. Cui, Y. Liu, Y. Xu, H. Zhao, and H. Zha, “Tracking generic human motion via fusion of low- and high-dimensional approaches,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 43, no. 4, pp. 996–1002, Jul. 2013. [19] S. Zhang, H. Yao, X. Sun, and X. Lu, “Sparse coding based visual tracking: Review and experimental comparison,” Pattern Recognit., vol. 46, no. 7, pp. 1772–1788, 2013. [20] X. Mei and H. Ling, “Robust visual tracking and vehicle classification via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp. 2259–2272, Nov. 2011. [21] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust L1 tracker using accelerated proximal gradient approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, Jun. 2012, pp. 1830–1837. [22] D. Wang and H. Lu, “Object tracking via 2DPCA and l1-regularization,” IEEE Signal Process. Lett., vol. 19, no. 11, pp. 711–714, Nov. 2012. [23] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, “Robust tracking using local sparse appearance model and k-selection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Colorado Springs, CO, USA, 2011, pp. 1313–1320. [24] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,” in Proc. Eur. Conf. Comput. Vis., Florence, Italy, 2012, pp. 864–877. [25] N. Wang, J. Wang, and D.-Y. Yeung, “Online robust non-negative dictionary learning for visual tracking,” in Proc. IEEE Int. Conf. Comput. Vis., Sydney, NSW, Australia, 2013, pp. 657–664. [26] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, Jun. 2012, pp. 1822–1829. [27] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual tracking via structured multi-task sparse learning,” Int. J. Comput. Vis., vol. 101, no. 2, pp. 367–383, 2013. [28] T. Zhang, S. Liu, N. Ahuja, M.-H. Yang, and B. Ghanem, “Robust visual tracking via consistent low-rank sparse learning,” Int. J. Comput. Vis., vol. 111, no. 2, pp. 171–190, 2015.
638
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 47, NO. 4, APRIL 2017
[29] Y. Yang, W. Hu, W. Zhang, T. Zhang, and Y. Xie, “Discriminative reverse sparse tracking via weighted multi-task learning,” IEEE Trans. Circuits Syst. Video Technol., to be published. [30] D. Wang, H. Lu, and M.-H Yang, “Robust visual tracking via least softthreshold square,” IEEE Trans. Circuits Syst. Video Technol., vol. 26, no. 9, pp. 1709–1721, Sep. 2016. [31] D. Wang, H. Lu, Z. Xiao, and M.-H. Yang, “Inverse sparse tracker with a locally weighted distance metric,” IEEE Trans. Image Process., vol. 24, no. 9, pp. 2646–2657, Sep. 2015. [32] Y. Xie et al., “Discriminative object tracking via sparse representation and online dictionary learning,” IEEE Trans. Cybern., vol. 44, no. 4, pp. 539–553, Apr. 2014. [33] Y. Yang, W. Hu, Y. Xie, W. Zhang, and T. Zhang, “Temporal restricted visual tracking via reverse-low-rank sparse learning,” IEEE Trans. Cybern., vol. 47, no. 2, pp. 485–498, Feb. 2017. [34] S. Kong and D. Wang, “A dictionary learning approach for classification: Separating the particularity and the commonality,” in Proc. Eur. Conf. Comput. Vis., Florence, Italy, 2012, pp. 186–199. [35] T. B. Dinh, N. Vo, and G. Medioni, “Context tracker: Exploring supporters and distracters in unconstrained environments,” in Proc. CVPR, Colorado Springs, CO, USA, 2011, pp. 1177–1184. [36] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
Xu Cheng was born in Taiyuan, China, in 1983. He received the B.E. and M.E. degrees in information engineering from the Taiyuan University of Technology, Taiyuan, China, in 2007 and 2010, and the Ph.D. degree in information and communication engineering from Southeast University, Nanjing, China, in 2015, respectively. He is currently a Senior Engineer with Nanjing Marine Radar Institute, Nanjing. His research interests include computer vision, object tracking, and pattern recognition.
Yifeng Zhang was born in Wuhu, China. He received the B.E. degree in electrical engineering from Southeast University, Nanjing, China, in 1984, the M.E. degree in computer engineering from the Harbin Institute of Technology, Harbin, China, in 1989, and the Ph.D. degree in electrical engineering from Southeast University in 1999, respectively. From 1999 to 2001, he was a Post-Doctoral Fellow with the Department of Radio Engineering, Southeast University. In 2009, he joined Southeast University, where he is currently an Associate Professor with the School of Information Science and Engineering. He published an academic book in Chinese. His current research interests include visual tracking, watermarking and information hiding, chaotic neural information processing, and machine learning. Dr. Zhang was a recipient of the Best Paper Award from the IEEE Asia Pacific Conference on Circuits and Systems.
Jinshi Cui received the B.E., M.E., and Ph.D. degrees in computer science and technology from Tsinghua University, Beijing, China, in 1999, 2001, and 2004, respectively. She is currently an Associate Professor with the School of Electronics Engineering and Computer Science, Peking University (PKU), Beijing. She is researching on computer vision-based children’s behavior analysis, together with colleagues from the Psychology Department, PKU. She has authored over 40 papers in respected CV, robotics, and ITS conferences and journals. Her current research interests include computer vision (CV) and intelligent systems.
Lin Zhou was born in Jiangsu Province, China, in 1978. She received the B.E. degree in information engineering and the Ph.D. degree in signal and information processing from Southeast University, Nanjing, China, in 2000 and 2005, respectively. She is currently an Associate Professor with the School of Information Science and Engineering, Southeast University. Her research interests include speech processing and spatial hearing.