This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
Date of publication xxxx 00, 0000, date of current v ersion xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.Doi Number
Adaptive Online Learning Based Robust Visual Tracking Weiming Yang 1 2
1-2
1
1
, Meirong Zhao , Yinguo Huang , and Yelong Zheng
1
State Key Laboratory of P recision Measuring Technology and Instruments , Tianjin University, Tianjin, 300072 China College of Electronic Information and Automation , Tianjin University of Science and Technology, Tianjin, 300222 China
Corresponding author: Yelong Zheng (e-mail:
[email protected]).
T his work was supported in part by the Natural Science Foundation of Tianjin under Grant 17JCTPJC54200.
ABSTRACT Accurate location estimation of a target is a classical and very popular problem in visual object tracking, for wh ich Correlation Filters have been proven h ighly effective in real-t ime scenarios. However, the great variation of the target’s appearance and the surrounding background throughout a video sequence would lead to failure tracking for the sake of the model drift, using trackers based on Correlation Filters. In our approach, we present a simple and fast method to improve the robustness of the model based on Sum of Template and Pixel-wise Learners (Staple). On the one hand, a confidence regression model is established to adjust adaptively the model online learning rate to alleviate the model drift. On the other hand, instead of likelihood, the scale with maximal posterior probability is selected as the target scale to obtain the more accurate estimation. Extensive experimental results demonstrate that the proposed approach performs favorably against several state-of-the-art algorith ms on large-scale challenging benchmark datasets at speed in excess of 42 frames per second. INDEX TERMS adaptive online learning, Correlation Filters, discriminative classifier, Visual Tracking
I. INTRODUCTION
Visual object tracking is a classical and very popular problem in co mputer vision, which enjoys a wide range of applications, including vehicle navigation, surveillance, robotics, and so on. A typical scenario of visual tracking [1] is to localize an object of interest throughout an image sequence, given only an initial object annotation, e.g. an axisaligned bounding box, within the first frame. Despite significant progresses [1-7] have been made in recent years, it also remains very challenging for designing a generic object tracker, which can deal with all real-world phenomena, such as abrupt motion, deformation, partial or full occlusion and so on, at a high speed for real-time applications. As the target’s appearance would vary greatly throughout an image sequence, a fixed model trained by the first frame alone would lead to failure tracking. Most state-of-the-art trackers [8-11] therefore maintain an object model and update online, which are intended to take account of the appearance changes of the target when a new frame arrived. One widespread approach is that the prediction of the object in a new frame is treated as a positive pattern to update the model. However, background noise and the deformation of the target would lead to poor target estimation. Under this
VOLUME XX, 2018
situation, the new model updating by the conventional methods would drift slowly. With small errors accumulated, failure tracking would occur finally. The key challenge of tracking an unfamiliar object in a video sequence is that the tracker should have strong robustness to the changes of appearance [12], that is to say, how to update the target model online adaptively and accurately. In this paper, we build upon a Correlation Filters (CF) [6, 8] based tracker popularly known as Sum of Template and Pixel-wise Learners (Staple) [13] tracker. Two complementary image patch representations are used to character a model. One representation is CF based contextual template [10] which learns online to distinguish the target object from its surrounding background. This representation is great robust to motion blur and illu mination changes, but notoriously sensitive to deformation. Another representation is color statistics [11] based histogram which can cope well with variation in shape. By exploit ing the inherent structure of each representation, two independent ridge-regression models are adopted to solve the model. The experimental results show that Staple outperforms many more complex trackers in multiple benchmarks [1, 7] at real-time speed. Staple uses a heuristic update rule, which is a simple linear
1
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
weighted combination of all the previous models, to incorporate historical tracking information [14]. This update strategy is sub-optimal when the training pattern is poor. In order to improve the performance of Staple, two methods are adopted. Firstly, based on CF, we establish a target appearance regression model, whose maximal response is used to control the online learning rate of two complementary image patch representation models. When the appearance of the target changes itself, the model is updated using the normal learning rate used in reference [13]. Under the situation of occlusion, we will use the lower learning rate, not zero, to update the model. Although the model could incorporate some background information, the location of the target estimated using this model could not be far away fro m the ground-truth. This adaptive online learning (AOL) strategy improves the robustness for two aspects. Correlation Filters model contextual classifier aggressively adapts to translation estimation against significant heavy occlusion. The color statistical histogram model can strongly deal with deformation. Therefore, the improved algorithm can effectively adapt to appearance’s change and alleviate the risk of drifting. Secondly, we use the conditional probability to solve the problem of scaling. The current target scale can be obtained by maximizing the posterior probability [15] based on hypothesis that the prior of the scale is Gaussian distribution centered the scale in the previous frame. By integrating two improvements, we propose an adaptive online learning based tracker and obtain more accurate tracking results as illustrated in Fig. 1. Extensive results on large-scale benchmark datasets [1, 7] show that our tracker is superior to the conventional Staple and several state-of-theart trackers in terms of efficiency, accuracy and robustness.
Figure 1. Comparison of our proposed approach (AOL) with two recent state-of-the-art trackers in challenging situations of significant occlusion and deformation on the Lemming and Sylvester sequences [7]. Our proposed approach using adaptive online learning achieves better tracking results.
II. RELATED WORK
Visual object tracking has been widely studied with numerous applications in recent years . More and more 2
excellent tracking algorithms have been proposed which usually construct a target appearance model from the observed image information by using either generative [4, 16, 17] or discriminative [5, 6, 8, 9, 18-20] representations. Generative approaches adopt an appearance model to represent the target and search for the region similar to the model as the estimation of target, e.g. frag ment-based tracker [16]. Discriminative approaches, which incorporate machine learning techniques, estimate decision boundary between an object image patch and the surrounding background. Examp les of discriminative trackers include multiple instance learning tracking [21], ensemble tracking [2], and all the CF based trackers [6, 8-11, 13, 18]. For more comprehensive literature reviews, readers are referred to [1, 3, 7]. In this section, we only elaborate the methods closely related to this work: (i) Correlation Filters and (ii) Schemes to reduce model drift. Correlation Filters: As early as in 80’s, Hester and Casasent [22] have done the seminal work that the CF was used for object detection. Until 2010, Bo lme et al. [8] introduced the CF into the visual tracking field and put forward Minimu m Output Sum of Squared Error (MOSSE) filter. By degenerating all the necessary convolution computations in the time domain to elementary additions and multip lications in the frequency domain, MOSSE achieved a state-of-the-art performance on the tracking benchmark [1] with a speed reaching 600 fps. CF based algorithms have been popularized in the tracking community. Several significant improvements have been proposed to improve tracking performance. Zhang et al. [9] formulated the spatiotemporal relationships between the object and its immediate background based on a Bayesian framework. Henriques et al. [6] proposed CSK tracker which exploited a dense sampling training patterns generated by the circular shifts of a given window. The CSK algorithm was built on illumination intensity features. The intensity features was improved by well-engineered features such as multi-channel features (HOG features) [23] in the KCF [10] and more co mplex color features (color attributes) in CN [18]. CF based trackers had been demonstrated the capability of accurate target localization in many different challenging scenarios in the real-time. Li and Zhu [19] figured out the fast scale estimation problem using a multi-resolution extension of a kernelized correlation translation filter (SAMF), where the translation filter was applied at several resolutions. It achieved higher tracking precision at the cost of higher computation complexity. Contrary to [19], Danelljan et al. [24] proposed a method that directly learned the appearance changes induced by scale variations, which was realized by a multi-scale template based on a scale pyramid representation. CF based trackers were inherently confined to the problem of learning a rigid template. To achieve robustness to deformation, a representation which was insensitive to shape variation should be adopted. Image histograms had this VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
property, because they discarded the position of every pixel. One tracker based color histogram was CSR-DCF [25] in which the channel and spatial reliability concepts were introduced, but it only reached 13fps. Co mbining the advantages of template and the color statistics histogram, Bertinetto et al. [13] proposed a tracker, Staple, which could outperform far more sophisticated trackers according to multiple benchmarks at speed than 90 fps. However, the critical problem regarding online model update was not addressed by these methods. Schemes to reduce model drift: Model drift often occurs when they are updated online from the poor training patterns which were resulted from tracking failure, occlusions and misalignment of training samples, and so on. Several works had aimed to prevent the model deterioration from the update process using poor patterns [25, 26]. Others exploited machine learning approaches that were robust to the poor patterns [21]. TLD tracker [27] utilized two experts to alleviate drift. Each expert encoded rules for additional supervision based on optical flow focused on identification of particular type of the classifier errors and could be mutually compensated. In [20], M EEM tracker maintained a collection of past models as an expert ensemble and used a multi-expert restoration scheme to address the model drift, where the best expert was selected as the current tracer based on a minimu m entropy criterion. A multimodal target detection technique was proposed in LMCF [26] to improve the target localization precision and prevent model drift. In [25], Alan et al. adopted the channel and spatial reliability concepts in CF tracking and provided a novel learning algorithm in the filter update and the tracking process. Distractor-Aware Tracker [11] used adaptive threshold and explicit suppression of regions with similar colors. Our algorithm improves updating components based on Staple. In this work, we introduce an appearance confidence regression model fro m the most reliable tracked region rather than a heuristic metric. III. ADAPTIVE ONLINE LEARNING BASED ROBUST VISUAL TRACKING A. Problem formulation
In this paper, we consider the tracking-by-detection framework [27], in which tracking problem is treated as a detection task and learns information about the target from each detection online. Our aim is to locate the target and update an online discriminant function f : X Y that is adaptive to the changes of the target without being prone to drift, where X is the input space and Y is the arbitrary output space. The basic flow of the tracking system is that the tracker sequentially maintains an estimation of the target’s position pt of a 2-D bounding box in frame t, where t 1, , T is the time. When the next frame xt 1 arrived, in our case, the cyclic shifts of the image patch within pt are considered to be the candidate targets . Each candidate is given by xtpt1 ( w, h ) , (w, h) {0, ,W 1}{0, , H 1} , where VOLUME XX, 2018
W and H are the width and the height of the search area respectively. The discriminant function f () is applied to these candidate objects. These candidate objects are ranked according to
(w, h) arg max ( w, h) f ( xtpt1( w, h) )
(1)
The function f ( xtp1 ) assigns a score to the rectangular window p in image xt 1 . The target’s position is updated as
pt 1 pt (w, h) . A standard approach [28] of learning the discriminant function f is to minimize a loss function L( f ( X t ), y) that depends on the previous target patches and the labels of the targets in those images X t {( xi , pi )}ti 1
arg min f {L( f ( X t ); y) R( f )} ,
(2)
where denotes the space of model parameters and R( f ) stands for a regularization term, the constant 0 controls the relative weight between the regularization term and the empirical risk to prevent over-fitting. We adopt the discriminant function that is a linear combination of template based function ftmpl _ AOL ( x) , and histogram based function f hist _ AOL ( x) , as shown in Eq. (3). It’s similar to Staple [13], but we use two adaptive online learning based template and histogram functions
f ( x) tmpl ftmpl _ AOL ( x) hist f hist _ AOL ( x)
(3)
We set tmpl 1 b and hist b to form a convex combination of the two discriminant functions, the parameter b is chosen on a validation set. By this way, two discriminant’ magnitudes would be compatible which make the linear combination more effective. B. Adaptive Online Learning the discriminative Function
Write the adaptive online learning based template discriminant function ftmpl _ AOL ( x) as ( x) w , where ( x) is the kernel function mapping an image patch x into another space, and w is the model parameters learned from all the circular shifts of x which incorporates with Gaussian label values y. Let x denotes the mapping of all the circular shifts of x, the structural risk function is obtained by Regularized Least Squares (RLS)
w arg min w || x w y ||22 || w ||22
(4)
The Representer Theorem [29] states that a solution can be expanded as a linear combination of the input: w Tx a . The object function is expressed as
arg min a || x Tx a y ||2 || Tx a ||2
(5)
The coefficient a in the dual form has the simple closed form solution
3
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
a ( K I )1 y ,
(6)
where K is the kernel matrix defined by K x Tx which evaluates the kernel on all pairs of the circular shifts of x. As shown in [6], if the kernel () is permutation invariant, kernel matrix K is circular. Thus we can compute the dual solutions a efficiently using its DFT diagonalization property, as follows: (7) a* F 1 (( Kˆ )1 yˆ ) , * where a denotes complex conjugate of signal a, F 1 is defined to be the inverse FFT, and Kˆ , yˆ refer to the FFT of the kernel matrix K and the target regressive label y respectively. Taking account of the change of the target’s appearance, the existing works use a “censorship mechanism” where an update of tx1 , t 1 is prevented if certain criteria is met [30] (or not met), or use a constant learning rate based on experience [9, 10, 13]. Different fro m them, we train a
confidence regression model R f based on CF fro m the most reliable tracked target region, as shown in Fig. 2. We adopt the maximal response value R ft as a confidence metric to adaptively adjust the online learning rate tmpl frame by frame as shown in (8), and a pre-defined threshold is used, if R ft ,tmpl 1 , otherwise, tmpl 2 .
The
t 1 Rfx
tx1 (1 tmpl )tx tmpl tx1
(8a)
t 1 (1 tmpl ) t tmpl t 1
(8b)
,
t 1 Rfx
of R f is learned online with the same
learning rate tmpl frame by frame as 1 1 tRfx (1 tmpl )tRfx tmpl tRfx
(9a)
t 1 t t 1 Rfx (1 tmpl ) Rfx tmpl Rfx
(9b)
Equation (8) Template Model
Template
x1:t
Frame x1:t 1
Frame
Model
tx , t Convolution
Template Model Response
Tracking- frame x1:t 1 , position pt
R ft
Template Model
tx1 , t 1
Training - framext+1, position pt+1
Merge Responses
Convolution
R tf 1
By Equation (3)
Equation (9)
R tf
Histogram Model Frame t O
Integral Image
Histogram Model Response
R tf 1
Histogram Model Frame xt 1
HOt 1 , H Bt 1
x1:t
H ,H
tx1 , t 1
Frame xt 1
Histogram Model Frame x1:t 1
H Ot 1 , H Bt 1
t B
Equation (12) Figure 2. Flowchart of adaptive online learning algorithm. In frame xt 1 tracking, template-related and histogram-related models stand for the target’s appearance model, the position pt 1 of the target is estimated at the peak of the merge responses of two models’ response. Template-related model represents the most reliable tracked target’s appearance, the maximum response of which is used to adjust adaptively the online learning rate. In frame xt 1 training, two template-related models and a histogram-related model are updated using the adaptive online learning
4
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
The same online learning scheme is used in histogram based discriminative function f hist _ AOL ( x) . To retain the speed and efficiency of CF, our histogram model is learned by applying linear regression to each feature pixel independently over rectangular object region O and its surrounding region B xt . The Bayesian framework [11] is used to obtain the object likelihood, P, at location ( w, h) as
P(bw, h | xw, h O) P( xw, h O)
P( xw, h O | O, B, bw, h )
{O , B}
P(bw, h | xw, h ) P( xw, h )
(10) where bw, h stands for the bin b assigned to the color components of xw, h . In particular, we directly estimate the likelihood term from color histograms, then (10) can be simplified to H O (bw, h ) P( xw, h O | O, B, bw, h ) H O (bw, h ) H B ( Bw, h ) , (11) 0.5 where H O (b) and H B (b) denote the b-th bin of the nonnormalized histogram H co mputed over the region O and B respectively. In the online version [13], the numerator and denominator of model parameters in (11) are updated respectively
HOt 1 (1 hist ) HOt hist HOt 1
12(a)
H Bt 1 (1 hist ) H Bt hist H Bt 1
12(b)
We also select the maximal response value R ft as a confidence metric to adaptively adjust the histogram online learning rate frame by frame. The same threshold is also used, if R ft ,hist 3 , otherwise, hist 4 . C. Scale Estimation
Another improvement is about the problem of scale estimation of the target. A popular scale estimation approach (DSST) was proposed by Danelljan et al. [24], which learned a distinct and multi-scale template for scale search using a 1D CF. Our algorith m is updated using the same scheme as the template learned for translation. When a new frame arrived, we search for the target first in translation and subsequently in scale estimation. In order to get more accurate estimation of scale, the posterior probability, instead of likelihood, is used to compute the scale search according to Bayesian framework [15, 31]. We maximize over the posterior probability by following equation
si arg maxi P(si | y) arg maxi P( y | si ) P(si ) ,
(13) (14)
th where si refers to the i scale and P( y | si ) is the
likelihood estimation. The prior P( si ) is assumed to follow a Gaussian distribution, whose maximu m value located at the VOLUME XX, 2018
previous scale. Assumed that the target’s scale would not change significantly between consecutive frames, the smoother change between scales could be detected by this algorithm. More stable detections are obtained by this strategy using posterior probability rather than likelihood IV. EXPERIMENTS
In this section, we give experimental implementation details of our proposed algorithm AOL and perform extensive experiments on recent and popular benchmarks to test our approach, named OTB50 [1] and OTB100 [7]. Both benchmarks pose challenging problems to object tracking including: scale variation (SV), occlusion (OCC), illumination variation (IV), motion blur (M B), deformation (DEF), fast motion (FM), out-of plane rotation (OPR), background clutters (BC), out-of-view (OV), in-plane rotation (IPR), and low resolution (LR). Tracking quality is characterized by success and precision plots [1]. The trackers are ranked by the area-under-the-curve (AUC) which is the average of the success rates with correspond to the sampled overlap threshold. A. Implementation Details
We report the experimental results on OTB50 and OTB100 in addition to the baselines that are part of the benchmarks, using the authors’ own results to ensure a fair co mparison. Therefore, for each evaluation, we only co mpare against those methods with their own results. In Table 1, we list the values of the most important parameters we used, they are selected on a validation set. The template-related model learning rate 1 and the histogram-related model learn ing rate 3 are selected when the confidence value R ft is greater than the value , otherwise, 2 and 4 are selected respectively. Besides parameters the Table 1 lists, the other parameters are set to the same values commonly used in literature [13], and all parameters are kept constant throughout all experiments T ABLE 1 Parameters of AOL
Parameters
1
2
3
4
Values
0.23
0.02
0.005
0.04
0.35
B. Results
Besides 29 trackers in the tracking benchmark datasets [1, 7, 16], we evaluate the proposed algorithm AOL with another 9 recently proposed state-of-the-art trackers including MEEM [20], Staple [13], LCT [32], DSST [24], LM CF [26], SAMF [19], CSR-DCF [25], TGPR [33], DLSSVM [34]. Fig. 3 shows the results in one-pass evaluation (OPE) using precision and the success plots on OTB50 and OTB100. To make it clear, we only plot the top 10 ranked trackers. In addition, we present the quantitative comparisons of distance precision (DP) at 20 pixels, overlap success (OS) [1] rate at 0.5 on OTB100 in Table 2. 5
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
Table 2 shows that our algorithm performs better than the other 10 state-of-the-art trackers in DP and OS. A mong these trackers in the literatures, the conventional Staple method achieves the best result on DP with an average of 67.55%
and the SRDCF performs the best result on OS with an average of 72.39%. Our algorithm performs the best with DP of 69.48% and OS of 73.64%.
TABLE 2 Comparisons with state-of-the-art trackers on the OT B100 benchmark sequences. Our approach performs favorably against existing methods in DP at a threshold of 20 pixels, OS rates at an overlap threshold 0.5. T he best two results in each ro w are highlighted by bold and underline.
AOL(Our)
Staple
SRDCF
LCT
DSST
LMCF
MEEM
KCF
SAMF
Struck
DLSSVM
DP(%)
69.48
67.55
66.20
63.89
58.03
67.37
58.27
55.55
64.16
49.36
61.62
OS(%)
73.64
70.88
72.39
70.06
59.98
71.77
62.05
55.05
67.17
51.68
62.69
Figure 3.The Precision plots of OPE (left column) and The Success plots of OPE (right column) on OTB50 (top row) and OTB100 (down row) . The numbers in the legend indicates the average AUC scores, which also shows the years an d original sources of these trackers.
Fig. 3 illustrates the success and precision plots of top ten trackers on two benchmark sequences, OTB50 and OTB100 respectively. AOL ran ks top both on OTB50 and OTB100 except slight lo wer than LMCF for precision plots on OTB50. On the benchmark OTB50, AOL achieves 0.768 ranking score in precision plots slightly lo wer than LM CF that reaches 0.771. On the benchmark OTB100, AOL achieves 0.601 ranking score in success plots and 0.742 ranking score in precision plots higher than LM CF 6
which reaches 0.579 and 0.720. It shows that AOL is mo re generic than LM CF. Co mparing with the conventional Staple, both success and precision scores increase 3%, which indicates that AOL can allev iate the model drift. Struck [5] performed the best with an average 0.460 and 0.586 respectively when the first benchmark [1] came out, AOL significantly improves Struck by an average improvement of 30% and 26% in the average A UC scores of precision and success plot scores respectively. VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
C. Attribute-based evaluation
The image sequences in the benchmark datasets OTB50 and OTB100 are annotated with 11 attributes to describe the different challenges which the tracker must face. By analyzing the performances of the trackers on these attributes, we can research the advantage and disadvantage of the trackers. We evaluate AOL with these trackers on these challenging attributes in OTB50 and OTB100 as
shown in Fig. 4 and Fig. 5. The figures demonstrate that AOL performs best on four attributes only on OTB50, such as scale variation, in-plane rotation, out of view, and illu mination variation. Whereas AOL performs best on six attributes on OTB100, such as on scale variation, out of view, out of p lane, occlusion, in plane rotation and deformation. It indicates that AOL is more robust.
Figure 4.The precision plots and success plots on OTB50 for eight challenging attributes including scale variation, in -plane rotation, out of view, occlusion, illumination variation, occlusion, out-of-plane rotation, low resolution and background clutter. The proposed AOL performs best or second best on most of the attributes.
VOLUME XX, 2018
7
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
Figure 5.The precision plots and success plots on OTB100 for eight challenging attributes including scale variation, out of view, occlusion, in plane rotation, deformation, fast motion, low resolution and background clutter. The proposed AOL performs best on most of the attributes.
D. Efficiency
Tracking speed is an important factor of many real-world applications. The proposed AOL is implemented in Maltab and all the experiments are conducted on a regular desktop PC with Intel
[email protected] and 8GB RAM. Our matlab prototype runs at approximately 42 fps, whereas the running speed of Staple on the same hardware is 68 fps. Without losing the real-time performance, the tracking performance is significantly improved by AOL about 6% on the precision rate and 5% on the success rate as shown in Fig. 3. AOL makes a better tradeoff between tracking accuracy and running speed. V. CONCLUSIONS
Firstly, we establish a high-confidence appearance model to reduce model drift resulted fro m the poor train ing pattern in the process of update. Secondly, we imp rove target scale estimation by maximizing the posterior probability to improve robustness to gradual scale changes. Extensive experimental results on OTB50 and OTB100 demonstrate that AOL outperforms many comp lex trackers and validate the effective of the adaptive online learning in the tracking process. V. REFERENCES [1] Y. Wu, J. Lim, and M. H. Yang, "Online Object Tracking: A Benchmark," in CVPR, Portland, Oregon, USA, 2013, pp. 2411-2418. [2] S. Avidan, "Ensemble Tracking," PAMI, vol. 29, no. 2, pp. 261-271, 2008.
In this paper, we propose a new adaptive online learning based tracker which improves the Staple fro m t wo aspects. 8
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2813374, IEEE Access
[3] A. Yilmaz, O. Javed, and M. Shah, "Object tracking: A survey," Acm Computing Surveys, vol. 38, no. 4, pp. 1-45, 2006. [4] J. Kwon and K. M. Lee, "Visual tracking decomposition," in CVPR, San Francisco, California, USA, 2010, pp. 1269-1276. [5] S. Hare, A. Saffari, and P. H. S. Torr, "Struck: Structured output tracking with kernels," in ICCV, Barcelona, Spain, 2011, pp. 263 -270. [6] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, "Exploiting the Circulant Structure of Tracking-by-Detection with Kernels," in ECCV, Firenze, Italy, 2012, pp. 702-715. [7] Y. Wu, J. Lim, and M. H. Yang, "Object Tracking Benchmark," PAMI, vol. 37, no. 9, pp. 1834-1848, 2015. [8] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, "Visual object tracking using adaptive correlation filters," in CVPR, San Francisco, California, USA, 2010, pp. 2544-2550. [9] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M. H. Yang, "Fast Visual Tracking via Dense Spatio-temporal Context Learning," in ECCV, Zurich, Switzerland, 2014, pp. 127-141. [10] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, "High-Speed Tracking with Kernelized Correlation Filters," PAMI, vol. 37, no. 3, pp. 583-596, 2015. [11] H. Possegger, T . Mauthner, and H. Bischof, "In defense of colorbased model-free tracking," in CVPR, Boston, MA, USA, 2015, pp. 2113-2120. [12] N. Wang, J. Shi, D. Y. Yeung, and J. Jia, "Understanding and Diagnosing Visual T racking Systems," in ICCV, Santiago, Chile, 2015, pp. 3101-3109. [13] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. T orr, "Staple: Complementary Learners for Real-Time T racking," in CVPR, Las Vegas, USA, 2016, pp. 1401-1409. [14] A. Bibi and B. Ghanem, "Multi-template Scale-Adaptive Kernelized Correlation Filters," in ICCV Workshop, 2015, pp. 50-57. [15] A. Doucet and A. M. Johansen, "A T utorial on Particle Filtering and Smoothing: Fifteen Years Later," In Handbook of Nonlinear Filtering, Oxford, UK: Oxford university Press, 2011, ch. 7, sec. 2, pp. 656-704. [16] A. Adam, E. Rivlin, and I. Shimshoni, "Robust Fragments-based Tracking using the Integral Histogram," in CVPR, New York, NY, USA, 2006, pp. 798-805. [17] D. Ross, J. S. Lim, R. Lin, and M. H. Yang, "Incremental Learning for Robust Visual T racking," IJCV, vol. 77, no. 1, pp. 125-141, 2008. [18] M. Danelljan, F. S. Khan, M. Felsberg, and J. V. D. Weijer, "Adaptive Color Attributes for Real-T ime Visual Tracking," in CVPR, Columbus, Ohio, USA, 2014, pp. 1090-1097. [19] Y. Li and J. Zhu, "A Scale Adaptive Kernel Correlation Filter Tracker with Feature Integration," in ECCV Workshop, 2014, pp. 254-265. [20] J. Zhang, S. Ma, and S. Sclaroff, "MEEM: Robust Tracking via Multiple Experts Using Entropy Minimization," in ECCV, Zurich, Switzerland, 2014, pp. 188-203. [21] B. Babenko, M. H. Yang, and S. Belongie, "Visual tracking with online Multiple Instance Learning," in CVPR, San Francisco, CA, USA, 2009, pp. 983-990. [22] C. F. Hester and D. Casasent, "Multivariant t echnique for multiclass pattern recognition," Appl. Opt, vol. 19, no. 11, pp. 1758-61, 1980. [23] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," in CVPR, San Diego, CA, USA, 2005, pp. 886-893. [24] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, "Discriminative Scale Space T racking," PAMI, vol. 39, no. 8, pp. 1561-1575, 2017. [25] A. Lukezic, T. Vojir, L. C. Zajc, J. Matas, and M. Kristan, "Discriminative Correlation Filter with Channel and Spatial Reliability," in CVPR, Honolulu, Hawaii, USA, 2017, pp. 4847-4856. [26] M. Wang, Y. Liu, and Z. Huang, "Large Margin Object Tracking with Circulant Feature Maps," in CVPR, Honolulu, Hawaii, USA, 2017. [27] Z. Kalal, K. Mikolajczyk, and J. Matas, "Tracking-LearningDetection," PAMI, vol. 34, no. 7, pp. 1409-22, 2012. [28] J. Kivinen, A. J. Smola, and R. C. Williamson, "Online learning with kernels," IEEE T SIGNAL PROCES, vol. 52, no.8, pp. 2165-2176, 2004. [29] C. K. I. Williams, "Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond," Publications of the American Statistical Association, vol. 98, no. 462, pp. 489-490, 2003. [30] W. Zhong, H, Lu, and M. H. Ynag, "Robust object tracking via sparsity-based collaborative model," in CVPR, Providence, RI, USA, 2012, pp. 1838-1845. VOLUME XX, 2018
[31] W. M. Yang and M. R. Zhao "Auto-adjust lag particle filter smoothing algorithm for non-linear state estimation," Acta Physica Sinica, vol. 65, no. 2, pp. 32-38, 2016. [32] C. Ma, X. Yang, C. Zhang, and M. H. Yang, "Long-term correlation tracking," in CVPR, Boston, MA, 2015, pp. 5388-5396. [33] J. Gao, H. Ling, W. Hu, and J. Xing, "Transfer Learning Based Visual Tracking with Gaussian Processes Regression," in ECCV, Zurich, Switzerland, 2014, pp. 188-203. [34] J. Ning, J. Yang, S. Jiang, L. Zhang, and M. H. Yang, "Object Tracking via Dual Linear Structured SVM and Explicit Feature Map," in CVPR, Las Vegas, USA, 2016, pp. 4266-4274.
9
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.