High Performance Visual Tracking with Extreme

0 downloads 0 Views 7MB Size Report
Dec 2, 2018 - disturbance (e.g., Jogging2 #84). For Suv, when .... the current target appearance differs a lot from the initial one, and the target feature ...
IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, DEC, 2018

1

High Performance Visual Tracking with Extreme Learning Machine Framework Chenwei Deng, Senior Member, IEEE, Yuqi Han, Student Member, IEEE, and Baojun Zhao

Abstract—In real-time applications, a fast and robust visual tracker should generally have the following important properties: (1) feature representation of an object that is not only efficient, but also has a good discriminative capability; (2) appearance modeling which can quickly adapt to the variations of foreground and backgrounds. However, most of the existing tracking algorithms cannot achieve a satisfactory performance in both of the two aspects. To address this issue, in this paper, we advocate a novel and efficient visual tracker by exploiting the excellent feature learning and classification capabilities of an emergent learning technique, i.e., Extreme Learning Machine (ELM). The contributions of proposed work are as follows: (1) motivated by the simplicity and learning ability of ELM autoencoder (ELM-AE), an ELM-AE based feature extraction model is presented, and this model can provide a compact and discriminative representation of the inputs efficiently. (2) due to the fast learning speed of ELM classifier, an ELM-based appearance model is developed for feature classification, thereby fast distinguishing the object of interest from its surroundings. In addition, in order to cope with visual changes of the target and its backgrounds, the online sequential ELM (OS-ELM) is used to incrementally update the appearance model. Plenty experiments on challenging image sequences demonstrate the effectiveness and robustness of the proposed tracker. Index Terms—Robust visual tracking, extreme learning machine (ELM), online sequential ELM (OS-ELM), ELM autoencoder (ELM-AE), feature learning, feature classification.

I. I NTRODUCTION

V

ISUAL tracking is a very challenging problem in computer vision and related fields. It has a wide range of applications including visual surveillance, human-computer interaction, medical imaging and traffic monitoring, just to name a few. While numerous algorithms have been proposed in recent decades ( [1]–[9]), it is still an open problem to design a robust and efficient tracker to cope with appearance changes caused by illumination changes, pose variation, partial occlusions, cluttered background, etc. A discriminative tracker typically involves three components: a feature representation model which can extract useful information for describing the object appearance; an appearance model (classifier) aiming to find the decision boundary for separating the target from its surrounds; a motion model for estimating the likely movements of a target over time. Manuscript received June 15, 2018; revised Oct 6 and Dec 2, 2018; accepted Dec 11, 2018. This work was partially supported by the National Natural Science Foundation of China under Grant 91438203. (Corresponding author: Chenwei Deng) C. Deng, Y. Han and B.Zhao are with School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China. (Email: [email protected])

Generally, the first two components dominate the performance of visual tracker [7], and need to be considered carefully. During the tracking process, feature extraction plays an important role. Good feature representation should model the appearance of object effectively and efficiently. In the latest studies, various visual features (e.g., color histogram, Haarlike, histograms of oriented gradients, local binary pattern, etc.) are utilized for tracking. However, these handcrafted features are somehow empirically designed and have demonstrated to be lack of discriminability for the modeling of dynamic object appearances. To address the limitations above, Lan [10]–[12] combines complementary features for target representation which have been proved effective to enhance tracking performance. However, since all of the high dimensional features need to be stored and computed during fusion and selection, such trackers are computational expensive. Apart from these hand-crafted features, another strategy is to design a learning-based feature extraction model. The most representative feature-learning strategy should be dictionary learning and deep learning methods. For dictionary learning [13], image patches are encoded with respect to an over-complete dictionary. However, the online updating for such methods often requires alternating iterations, making the tracker inefficient. While, deep learning based feature representation [14]–[16] has been extensively studied and achieved promising progress. However, the deep-learning based methods have their limitations. First, with multiple parameters to be iteratively tuned, the performance of the deep works highly relies on the off-training data and are prone to over-fitting with the limited online samples. Moreover, even some recent works [17], [18] have demonstrated state-of-the-art performance, they highly rely on the computing platforms and face difficulties to implement under some resource-constrained environments. Besides the robust feature representation, an accurate appearance model needs to be efficiently built for effective tracking. Technically speaking, appearance modeling is to construct a feature classifier, in terms of efficiency and effectiveness. The most representative classifiers recently used in tracking are naive Bayes classifier [2], and support vector machine (SVM) [4], [19], etc. The naive Bayes model is a simple probabilistic classifier based on the Bayes Theorem. With prior probability and class conditional probability, the naive Bayes classifier can be trained via Maximum-likelihood computing. However, the estimation of class conditional probability requires a large set of training samples, which may be difficult for tracking tasks. Besides, the assumption of naive independency among image features is not reasonable, leading to a inferior performance for online tracking. SVM aims to learn a hyperplane by

2

maximizing the minimal margin on the training set. Avidan first [19] applies the SVM classifier for visual tracking, but its performance depends on the prior training with substantial data. Furthermore, the SVM-based models have to solve a quadratic programming problem when retraining the classifier, resulting a heavy computation burden. Instead of the inequality constraints in the standard SVM, equality constraints are used in the least square SVM (LS-SVM), and the resulting implementation reduces the computational burden. Therefore, LS-SVM is applied for feature classification in [4], [13]. Seen from the above analysis, we found that most of the existing tracking algorithms cannot achieve optimal performance in terms of accuracy and speed. Generally speaking, a good discriminative tracker can be attributed to the following aspects. (1) feature representation should be discriminative enough for appearance changes while being processed efficiently. (2) the classifier adopted in the appearance model should have a low computational burden and fast adapt to the visual changes of foreground and backgrounds. In this paper, we attempt to develop an efficient and robust visual tracker by exploiting the strong feature learning and classification capabilities of Extreme Learning Machine (ELM). The ELM proposed by Huang et al. [20] is a novel machine learning framework. Compared with traditional learning methods (neural network or SVM), the hidden node weights of ELM are randomly generated and it only computes the output weights among hidden layers and the output layer. ELM not only has a much faster learning speed but also obtains a better generalization. Moreover, the online sequential ELM (OSELM) [21] makes it flexible to update without retraining the whole network, and this is highly preferred in the online vision task. Motivated by such advantages, we apply this emerging learning technique into visual tracking and develop an effective and efficient tracker. Even though there are similar ELM-based trackers presented in [22], [23], the proposed method differentiates itself from the existing ELM tracking schemes either in the target feature representation or the appearance model classification. In [22], the authors employ a co-training framework to optimize two handcrafted feature extractors, while the work in [23] adopts a sparse learning representation strategy. It’s obvious that such feature extraction and optimization increase the time-cost, and thus may not suitable for the time-critical tracking application. Furthermore, [22] assumes that the ELM classifiers with different features should give the same label for the same sample, and a fusion strategy by multiplying the probability of each feature is constructed. We argue that such assumption seems to be idealized, since different features have different representation capability across different scenarios, how to determine the class label if these two features couldn’t reach an agreement. Besides, the strategy of multiplying two likelihoods of different features seems to be unreasonable, since any small feature likelihood may lower the joint probability. To this end, we believe that using ELM-AE based feature followed by a single ELM classifier could yield better performance than using the ensemble scheme in [22]. To our best knowledge, it’s the first time to use the ELM theories to construct a comprehensive framework (including feature extraction, ap-

IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, DEC, 2018

Fig. 1. The diagram of the proposed ELM-based visual tracker.

pearance modeling, and template updating) for visual tracking application. For the related theory and concept of ELM, please refer to the supplementary materials. The main contributions are summarized as: 1) A fast learning-based image feature extractor is developed using ELM-AE. According to the ELM theories [24], [25], ELM-AE [26] with random feature mapping (with almost any nonlinear piecewise activation function) can provide universal approximation capability. That is, unlike the deep neural network based feature extraction methods [14], [15] where the hidden neuron parameters need to be iteratively fine-tuned, the proposed method has a random feature mapping process, which can provide a simple but effective feature learning capability. As the training samples coming, the object/image features can be extracted efficiently by randomly mapping the input data into feature extraction network, and then computing the output weights. It is worth noting that despite the simplicity of our feature extractor, it can provide a compact and discriminative notion of the “thing” being tracked, which facilitates the tracking task. 2) An accurate appearance model is efficiently established by exploiting ELM/OS-ELM for feature classification and updating. As mentioned above, compared with SVM, ELM can achieve better generalization performance with much faster learning speed [25]. Moreover, it is also proved that OS-ELM with the new input data can achieve the same learning accuracy as batch learning with all of the training data, which facilitates online and incremental updating of the built model [21]. Thus, the discriminative property of ELM-based appearance model is expected to be better than those existing SVM ones [4], [19], and would adapt to visual variations more accurately, even with fast speed. II. PROPOSED VISUAL TRACKER In this section, we develop a novel ELM-based visual tracker, and the overall architecture of the proposed framework is illustrated in Fig. 1. One can see that several candidate samples around the current object location are first generated, and then ELM-AE is performed to extract the informative features of these samples. With the learned features, ELM classifier is trained for object/non-object classification. The final object state is estimated via computing the confidence

D eng

et al.: HIGH PERFORMANCE VISUAL TRACKING WITH EXTREME LEARNIN7G MACHINE FRAMEWORK

3

Fig. 3. The out weights β of ELM-AE in Dudek sequence.

Fig. 2. The architecture of ELM-AE based feature extraction.

scores of the candidate samples. According to the tracking results, some positive and negative samples are collected for updating the classifier using OS-ELM. For object tracking in the next frame, the new candidate samples are generated around the object being tracked, and the same procedure is implemented as the above tracking process. A. ELM-AE based Feature Extraction As a new and efficient unsupervised learning technique, ELM-AE can be used to learn the image features faster than the existing methods (e.g., deep Boltzmann machine) [26], and has achieved favorable performance in image classification [26], [27]. The success of ELM-AE feature representation has inspired us to extend it into visual tracking. The proposed feature extraction model is schematically shown in Fig. 2, and one can see that it mainly contains three steps: ELM random mapping, ELM learning and ELM-AE feature mapping. Suppose that in an ELM-AE with L hidden nodes, the input data set is X = {x1 , x2 , . . . , xN }, where xi ∈ Rd is the ith input data vector. For ELM random mapping, we randomly generate the input parameters w and b, which project the input data xi ∈ Rd into a L-dimensional ELM random space as follows. L

Hi = {G(wj , bj , xi )}j=1 , i = 1, . . . , N

(1)

where G(·) is the activation function, and G(wj , bj , x) is the j-th output of ELM random feature space. Then, ELM learning is performed for the random mapped image data. At this stage, the main objective is to reconstruct the input X from ELM random feature space H by solving the following problem. o n ˆ argmin kX − Hβk2 +µkβk β= (2) 2 p β

where β are the output weights to be obtained, µ is a regularization parameter. The setting of p controls the characteristic of ELM learning. In our previous work [27], the more sparse and meaningful hidden features are obtained, when p is the `1 norm. Following the similar solution of [27], we utilize the `1 sparse constraint to solve Eq. 2, and learn the discriminative and informative features of the object.

Fig. 4. Evaluation of the proposed feature extraction model. (a) 9 candidate samples from Car4 sequence. (b) The plot of Euclidean distances between the mean of feature vectors of object samples and those of candidate samples.

The final step of proposed feature extraction model is ELMAE feature mapping. As described in [26], the parameters β have the ability to represent the input data, and can be viewed as the learned image bases for describing the distributions of the input data. In Fig. 3, we also show the semantic meaning of β trained from Dudek sequence, and we can see that β reflect the input data distributions. Thus, similar to the coding strategy in dictionary feature learning [13], the inputs X and learned bases β are multiplied to provide a compact feature representation of the input data. C = G(Xβ T )

(3) L

G(·) is same as in Eq. 1; C = {c1 , . . . , cN }, and ci ∈ R is the learned feature vector of the input data xi ∈ Rd . For visual tracking, the feature dimension L is often less than the input data dimension d. Although the resulting feature representation is compact, it still has a good discriminative capability. To illustrate the discriminative ability of proposed ELM-AE based feature extraction, a simple evaluation is implemented over a collection of image patches from Car4 sequence. In Fig. 4(a), a group of candidate sample patches are listed. By using ELM-AE feature extraction, we can get the learned feature vectors. The Euclidean distances between the mean of feature vectors of the object samples and those of the candidate samples are calculated, which are shown in Fig. 4(b). Although the Euclidean distance is a rough measurement of the similarity, it demonstrates that our feature extraction model has a good discriminative ability. In our experiments (see Table II), we also compared its performances with five other feature extractors, which further validate the better feature extraction capability of the ELM-AE. It should be emphasized that due to the random mapping aspect of ELM, the feature representation performances have slight fluctuations; however, the stability of the results can be statistically guaranteed (also demonstrated in Figs. 8 and 9). Even though the input hidden neuron parameters (w and b) are randomly assigned, ELM satisfies the universal approximation capability [24]. That is, there exist appropriate β, which

4

IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, DEC, 2018

can perfectly represent the input samples, with overwhelming probability. Practically, we optimize the optimal β by solving the problem of Eq. 2. Updating of feature extraction model: During tracking, the feature extraction network should be updated in order to maintain a good representation of the target as well as prevent the tracker from drifting to the background. If the feature extraction network is updated too frequently, it will lead to unstable performances as well as low efficiency. To address this issue, we advocate a simple but effective updating strategy. Specifically, the ELM-AE is totally retrained every 10 frames including both the random input parameter w, b and the output parameter β. Still, the high speed could be held since the input parameter w and b are randomly generated without any tuning and the closed-form solution of the output parameter β could be obtained directly. Furthermore, the appearance of object generally does not have a significant change over a small period of time. With the guarantee of the universal approximation capability of ELM, the extracted feature could provide a robust and compact representation for the target, which could share the incremental updating appearance model naturally. Motivated by this, with the trained β, we compute the mean of the training errors of positive (object) samples: Λ=

Np 1 X p 2 kx − Hβk2 Np i=1 i

(4)

where xpi is the i-th positive training sample, and Np is the number of positive training sample. Suppose that we have a tracking result χt , and the learning error of χt is calculated 2 by Λt = kχt − Hβk2 . If the Λt is above 1.5Λ, the feature extraction network would be updated. Note that the training samples for updating the feature extraction model are same as those used in the ELM classifier updating (see Section II-C). B. Appearance Modeling After image feature extraction, a discriminative appearance model (classifier) is used to maximize the separability among the obtained features. In this section, ELM classifier is utilized for separating the object of interest from the backgrounds. From the perspective of classification stability, the classification capability of ELM ensures that, given any random mapping h(x), h(x)β can approximate any input sample x into a label function f (x). For visual tracking, ELM classifier only uses a single-output node, therefore, the label function f (x) is a binary function. Generating initial training samples: Generally, the object state is manually or automatically located in the first frame. We assume that the initial center of object is J0 , as shown in Fig 5, the initial positive samples are randomly generated within a circular area defined by kJ0 − Ji k < ϕ, where Ji is the center of i-th sampled patch. Meanwhile, we draw the initial negative samples from an annular region defined by ϕ < kJ0 − Jj k < $ (ϕ and $ are inner and outer radiuses, respectively). With the initial training samples, we can n extract the corresponding o features by ELM-AE. Let C = p p n n c1 , · · · , cNp , c1 , · · · , cNn ∈ RN ×Lf denote the extracted

o n feature vectors, and T = tp1 , · · · , tpNp , tn1 , · · · , tnNn ∈ RN ×1 be the corresponding class labels. Here, Np and Nn are the number of positive and negative samples, respectively (N = Np + Nn ); Lf is the hidden node number of ELMAE, i.e., the dimension of the extracted image features. ELM training: Let the hidden node number of ELM classifier be Lc . The input hidden parameters wj and bj (j = 1, · · · , Lc ) are randomly assigned. Then, the hidden layer outLc put matrix is calculated by Hi = {hj (ci )}j=1 (i = 1, . . . , N ), where hj (c) = G(wj , bj , c) is the j-th hidden node output, and G(·) is the activation function. We utilize a regularized ELM with equality constraints, and its stability and generalization performance have been especially studied in [25]. min β

2 1 2 kβk

+

λ 2

N P

2

kei k

i=1

(5)

s.t. h(ci )β = ti − ei , i = 1, · · · , N Based on the KKT theorem [28], Eq.5 is equivalent to solving the following optimization problem: N N X 1 λX 2 2 kβk + kei k − αi (h(ci )β − ti + ei ) 2 2 i=1 i=1 (6) In order to find the optimal solution of Eq. 6, we should have these KKT optimality conditions as follows.

LELM =

∂LELM ∂β ∂LELM ∂ei ∂LELM ∂αi

= 0 → β = H Tα = 0 → αi = λei , i = 1, · · · , N

(7)

= 0 → h(ci )β = ti − ei , i = 1, · · · , N

If the number of training data is larger than that of hidden neurons (N > Lc ), the closed form solution of β is  −1 I β∗ = H TH + H TT (8) λ Then, we can have the final output decision function:  −1 I f (x) = h(x)β ∗ = h(x) H T H + H TT λ

(9)

If the number of training data is less than the hidden neurons number (Lc > N ), the closed form solution of β is −1  I T (10) β ∗ = H T HH T + λ The corresponding output decision function is: −1  I T f (x) = h(x)β ∗ = h(x)H T HH T + λ

(11)

Object locating: For visual tracking, we only have the object state in the first frame. Therefore, there does not exist sufficient training samples for the first frame. On the other hand, in order to achieve a good generalization performance, the hidden node number of ELM classifier would be larger than that of training samples. Thus, for the beginning few frames, we apply the Eq. 11 for training the ELM classifier. Let the candidate sample features be Y = {y1 , · · · , yM } ∈ RM ×Lf , where M is the number of candidate samples. With

D eng

et al.: HIGH PERFORMANCE VISUAL TRACKING WITH EXTREME LEARNIN7G MACHINE FRAMEWORK

Fig. 5. The proposed online appearance model.

the same hidden weights w and b of the ELM training, we calculate the outputs of hidden neurons by h (yi ) = G (w, b, yi ). By using the resultant β ∗ from the Eq. 10, we can obtain the confidence score pi of i-th candidate sample as: pi = f (yi ) = h (yi ) β ∗

(12)

The position state of the i-th candidate sample is set as si . Based on the confidence scores, the candidate sample states are ranked in a descending order {s1 , s2 , · · · , sM }, and whose corresponding confidence scores are {p1 , p2 , · · · , pM }. To reduce the effects of the inevitable classification error, we compute the final object state using a small set of topranked candidate samples rather than the only one sample with highest score. Practically, we choose one twentieth number of candidate samples, and the weight of i-th sample is X 0.05×M wsi = pi pj (13) j=1

Thus, the final estimated object state sˆ is obtained as: sˆ =

0.05×M X

si × wsi

(14)

i

Re-training: Considering the fact that the number of initial training samples from the first frame is not sufficient, the trained classifier may not have a good adaptability. Based on the estimated object state sˆ during the beginning few frames, we adopt the same strategy as in the first frame to collect new training data, thereby enriching the initial training samples. In the experiments, we uniformly collect 500 positive and 500 negative training samples from the initial five frames. Note that since the number of expanded training samples is often larger than that of hidden nodes, Eq. 9 should be used as the retrained classifier output decision function, and the output weights β ∗ from Eq. 8 is used for object locating in the following frames. C. Updating of Appearance Model During the tracking period, we use a particle filter to obtain candidate samples, and due to the changes of both target

5

and environment, the discriminative tracker should adaptively update its appearance model (classifier). OS-ELM updating: For ELM classifier, the number of hidden neurons is manually defined, and the input hidden parameters (w and b) are randomly generated. Once the tracking task begins, the parameters above are fixed and never changed. To adapt to the visual changes of the object and backgrounds, we should update the output weights β. A simple way for updating the β is to train the classifier using the whole training data (including the old samples). In this case, updating process is to re-train the appearance model with an increasing amount of training samples, which is a memory-wasting and time-consuming task. For tracking efficiency, sequential learning is more preferred, and in our work, we adopt OS-ELM to update the classifier of the appearance model. Suppose that we already have training sample features ℵ0 = N0 . As mentioned in the ELM re-training stage, N0 is {ci , ti }i=1 larger than the number of hidden neurons. The initial output weights β0 can be calculated by −1  I + H0T H0 H0T T0 = P0−1 H0T T0 (15) β0 = λ where H0 denotes the initial hidden layer output matrix, and T0 is the initial class label set. Let the new training samples N0 +N1 , and N1 denote the number of new be ℵ1 = {ci , ti }i=N 0 +1 samples, with the calculated partial hidden-layer output matrix H1 and the new class label set T1 , the output weights can be updated by T

β1 = P1−1 [H0 , H1 ] [T0 , T1 ]

(16)

T

where P1 = λI + [H0 , H1 ] [H0 , H1 ] = P0 + H1T H1 . In [21], according to the Eqs. 15 and 16, the incremental formulation of output weights β is obtained as follows: −1 P1−1 = P0−1 − P0−1 H1T I + H1 P0−1 H1T H1 P0−1 (17) β1 = β0 + P1−1 H1T (T1 − H1 β0 )

(18)

From the derivation above, one can also see that the online updating of β could achieve the same result as batch learning with the whole training data. Selection of the updating samples: The selection of updating samples is significant for classifier updating, as the “bad” samples will introduce noise into the appearance model and the tracker may drift away. We propose a simple yet effective selection scheme for updating samples, which are composed of positive samples and negative ones as shown below. It should be mentioned that unlike other tasks (detection or segmentation) with clearly-defined target category, the reliable target information is only given in the first frame in tracking application, which is quite limited. A robust tracker should fully take consideration of reliable initial target information as well as deal with the target appearance change. To this end, we employ two different kinds of data in the set of positive samples: one is the fixed samples and the other is the dynamic samples. The fixed ones are collected from the samples in h i f f f the first frame denoting as: Γ1 = χ1 , χ2 , · · · , χNf . During

6

IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, DEC, 2018

the tracking process, we keep the Γ1 unchanged to prevent the tracker being dominated by recent training samples with noise. Meanwhile, to adapt to the visual changes, we represent the dynamic samples as Γ2 = χs1 , χs2 , · · · , χsNs , which are updated by the latest tracking results. Let the t-th located object result be χt , and its confidence score be ς. If ς > ς0 (ς0 is a pre-defined value), we assume that χt has a good relation with the total positive samples. Then, we calculate the following equation as: disc = τ × dis(χt , χf1 ) + (1 − τ ) × dis(χt , χsNs )

(19)

where dis(·) denotes the distance function (e.g., Euclidean), χf1 is the initial tracking result, χsNs represents the latest updating sample of Γ2 . τ is a postive parameter to control the balance between χf1 and χsNs , which is defined as 1 5

1 + 125

t

e− 100 . t (1−e− 100 )

Here, t denotes current frame. One can see

that, τ is approach to 1 at the begining. With the passage of time, τ get smaller, indicating that the initial frame has less impact on the tracking result. If disc ≤ tr (tr is also a predefined value), we can draw some positive samples nearby the result χt for updating the Γ2 . The updating order of Γ2 follows “first in first out” role, which means that the “oldest” sample is the dropping component when updating Γ2 . For negative samples set Γ3 , we draw the negative samples far away from the center of “selected” tracking result in the nearest frame. Owing to that the backgrounds of recent successive frames are similar, the Γ3 set could well represent the non-object patches of candidate samples. From the analysis above, we can see that the proposed selection scheme considers both the latest and the original observations. Thus, it is expected to adapt to the appearance change effectively and alleviate the drifting problem. D. Discussions It should be noted that the essence of our proposed tracker is applying the ELM technique into the tracking framework. From the learning point of view, unlike traditional learning models, ELM has a better generalization performance with much faster learning speed. On the other side, visual tracking needs an efficient learning tool in real-time applications. Thus, for efficiency and effectiveness, the ELM model is preferred in visual tracking as discussed below. Difference with traditional learning-based features: There exist several feature learning schemes in recent visual trackers: the object is represented by hashing binary codes in [4], and dictionary learning is used to extract image features in [13]. In our work, we utilized a novel and efficient feature learning model, i.e., ELM-AE, and has achieved a better discriminative capability as demonstrated in the experiments. Apart from the above feature extractors, auto-encoder is an emergent feature learning technique. Mathematically, ELMAE has a similar structure as traditional autoencoders, e.g., denoising autoencoder (DAE) [14]. That is, mapping the input x into a latent representation ψ through a nonlinear function ψ = hW,b (x), and the resulting ψ is to reconstruct the original input data via x = hW 0 ,b0 (ψ). The autoencoder usually uses a back-propagation (BP) method for network training.

Unlike the existing autoencoders [14], [15], ELM-AE randomly generates the hidden neuron parameters {W, b} to obtain the latent representation ψ, and approximates the input x with optimization methods. Without BP-based parameters tuning procedure, ELM-AE could be simple and efficient for feature learning. Besides, according to the ELM theories [20], ELM-AE also has the universal approximation capability and is free from the curse of local minima issues. Moreover, the sparse constraint of the output weights could prune insignificant hidden units, while remaining the more meaningful hidden features. This architecture will help to reduce the number of neural nodes and thus further improve the feature mapping efficiency. Difference with LS-SVM classification model: The performance of a discriminative tracker heavily depends on the feature classification capability. LS-SVM is one of the most popular classifiers in visual tracking [4], [13], due to its closedform solution. The optimization of LS-SVM is as: min

2 1 2 kwk

+

N P

λ 2

2

kei k

(20)

i=1

s.t. ti (w · φ(x) + b) = 1 − ei , i = 1, · · · , N Here, φ(x) is the implicit mapping of the input x. Based on the KKT theorem [28], the above problem is equivalent to: 2

LLS−SVM = 12 kwk + −

N P

λ 2

N P

2

kei k

i=1

(21)

αi (ti (w · φ(x) + b) − 1 + ei )

i=1

In order to find the optimal solution of Eq. 21, we can have N P

δLLS−SVM δw

=0→w=

δLLS−SVM δei δLLS−SVM δαi

= 0 → αi = λei

δLLS−SVM δb

αi ti φ(xi )

i

(22)

= 0 → ti (w · φ(xi ) + b) = 1 − ei N P =0→ αi ti = 0, ∀i i=1

The optimization of LS-SVM can be represented as      0 TT b 0 = α 1 T λI + ZZ T

(23)

where T = [t1 , · · · , tN ] denotes the class labels, and Z = [t1 φ(x1 ), · · · , tN φ(xN )]. Unlike LS-SVM, in this work, the proposed tracking framework adopts ELM for feature classification, which performs much better against LS-SVM. Firstly, in LS-SVM model, the mapping function φ(·) is often implicit and unknown. Thus, it is difficult to study the universal approximation capability of LS-SVM without knowing φ(·) [25]. Therefore, it often suffers from the curse of local minima issues. However, all of the parameters in ELM feature mapping h(·) are randomly generated, and h(x) is known to users finally. The universal approximation and classification capabilities of ELM would guarantee a better result in classification theoretically. As mentioned above, the bias term b could be removed in the ELM classifier with the guarantee of the universal

D eng

et al.: HIGH PERFORMANCE VISUAL TRACKING WITH EXTREME LEARNIN7G MACHINE FRAMEWORK

approximation capability. To this end, by comparing the KKT optimality conditions (Eq. 7 and 22). could find that the PWe N LS-SVM has one more condition: i=1 αi ti = 0, which is the partial derivative over the bias term b. That is to say, for the LS-SVM based tracker, the hyperplane parameter αi is more dominated by the class label ti . However, the tracking application could not be treated as a pure binary classification problem due to the existence of class label errors. Slight inaccuracy during tracking may bring about label errors, and these errors will further affect the classification precision of LS-SVM hyperplane, leading to further tracking inaccuracy. In contrast, ELM does not have such hyperplane constraint, since its Lagrange parameter αi is not directly related to the class label ti . Thus, ELM can yield a better tracking performance than LS-SVM model with label errors. Thirdly, the main computational cost of LS-SVM is from matrix inversion for computing the parameter αi in Eq. 23, where ZZ T ∈ RN ×N is used. However, in most cases, the hidden nodes number L can be smaller than the number of training data: L < N . For instance, during the ELM retraining stage, the number of selected training samples is larger than that of the hidden nodes. And in this case, we apply the Eq.8 to compute the output weights β, which uses H T H ∈ RL×L . Thus, the computational cost of ELM reduces dramatically. As for the LS-SVM, fixed-size LS-SVM [29] has been proposed to reduce the computational cost. However, the fixed-size LSSVM only used the subset of the training data, while the ELM solution of Eq.8 can use all the training data. Compared with fixed-size LS-SVM, ELM could have a better performance.

7

Fig. 6. The performances of OPE for different activation functions.

Fig. 7. The performance of different parameters on 51 sequences. (a) Average precision rate at 20-pixel error threshold. (b) Average frames per second (Fps).

In this section, we evaluate the performance of the proposed tracker on 51 challenging sequences provided on Online Tracking Benchmark, which involves 11 common challenging attributes. Our tracker is compared with 16 representative sota tracking methods (Struck [5], ASLA [30], SCM [31], TLD [3], L1APG [32],M-ELM [22], LS-SVM [4], CSK [33], KCF [34], MEEM [35], FCNT [36], LSDM-CNN [37], ECO [17], Dsiamese [18], DLT [15] and CNT [16]. For all the compared trackers, we use the original parameters and source code provided on OTB or the authors websites. It should be mentioned that MEEM follows an ensemble framework, while FCNT, LSDM-CNN, ECO, Dsiamese, DLT, CNT is equipped with deep features. Thus, we follow the manner of MEEM tracker and implement a Multi-Expert ELM tracker for fair comparsion. Specifically, four ELM classifiers are employed to predict the target position for each frame. All the experiments are conducted in MATLAB R2014a on a 2.20 GHz CPU, 2 GB RAM, Win7 x86 system.

before putting it into the ELM-AE network. Besides, all the patches would be rescaled to [32 × 32] to speed up the tracker. As for the ELM classifier parameters, the number of hidden nodes needs to be set manually. In our experiment, we found that 400 is proper. For the classifier initialization, during the initial five frames, we uniformly select 500 positive samples and 500 negative samples to train the ELM classifier. In the updating samples selection, the confidence score threshold ς0 is 0.7, and the other updating threshold tr is 0.4. The appearance model is incrementally updated every ten frames via OS-ELM. The total number of positive updating samples is 100, where the number of the fixed ones is 10, and the number of the dynamic ones is 90. In addition, the number of negative updating samples is also 100. For object locating, 200 candidate samples are generated in each frame for the single ELM-based tracker, while 400 samples are generated for the Multi-Experts ELM tracker to boost the localization precision. To cope with the likely rotation problem, we model the image samples state via the six parameters of affine transformation: {θ1 , θ2 , θ3 , θ4 , θ5 , θ6 }, where {θ1 , θ2 } denote the 2D position changes, {θ3 , θ4 , θ5 , θ6 } are the rotation angle, scale, aspect ratio and skew, respectively. we set the affine parameter fixed as [4,4,0.1,0,0,0] for all the videos for fair comparison.

A. Experimental setup

B. Parameter Analysis

In this work, the activation function G(·) is set to be the sigmoid function. For the ELM-based feature extraction, we set the number of hidden nodes to be 100. In Eq. 2, the regularization parameter µ is 0.1. Considering that the Online Tracking Benchmark contains both color and grayscale sequences. We would convert all the RGB image to gray one

Many activation functions can be used for the hidden layer neurons’ output calculation, such as sigmoid, sine, Gaussian, tahn, and hard-limiting function. According to the ELM theory, given any bounded non-constant piecewise continuous function as the activation function, it could satisfy ELM universal approximation capability theorems. We conduct experiments

III. EXPERIMENTAL RESULTS AND ANALYSIS

8

IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, DEC, 2018

Fig. 8. The success plots and precision plots of OPE. The proposed trackers are compared with 16 representative trackers on 51 challenging sequences respectively in terms of tracker’s category. The scores of success and precision plots are the values shown in the legend.

using different activation functions as shown in Fig. 6. One can see that the proposed method obtains a similar performance given different activation functions. In our simulation, sigmoid function is chosen as the activation function. In addition, the proposed tracker has two significant parameters: the number of hidden nodes in ELM-AE based feature extraction and the number of hidden nodes in ELM-based feature classification. In order to avoid confusion, we set the former one as Lf and the later one as Lc . We conduct 30 experiments with Lf = (50, 100, 200, 300, 400, 600) and Lc = (100, 200, 400, 800, 1000). In Fig. 7, the average precision rate and average Fps are used to measure the performances of different parameters on the sequences provided on OTB50 [6]. Note that Lf denotes the dimension of the extracted image feature, while Lc represents the Vapnik-Chervonenkis dimension of ELM classifier. Technically speaking, there is not a best possible way to set the values of both Lf and Lc . Therefore, they are to be determined under the trial-and-error procedure. Generally, the larger Lf (or Lc ) will lead to a lower real-time performance, just as shown in Fig. 7(b). Considering both accuracy and efficiency, we found that when the Lf is 100, we can obtain a better result. In this case, our tracker achieves the optimal result when the Lc is 400. C. Quantitative Comparisons In this subsection, we measure tracking accuracy of the proposed algorithm against the other ones. Two metrics (the success plot and the precision plot [6]) are utilized to evaluate the performance of different trackers. The T success plot is based S on the overlap ratio, S = Area(BT BG)/Area(BT BG), where BT is the tracking result and BG denotes the ground truth. The success plot shows the percentage of frames with S > t throughout all the thresholds t ∈ [0, 1]. The area under curve (AUC) of each plot serves as the measurement to rank the trackers. Meanwhile, the precision plot illustrates the percentage of frames whose tracked locations are within the given threshold (20 pixels) distance to the ground truth. It should be noted that we aim at constructing a tracker which is effective and efficient at the same time. While some trackers focus more on the performance instead of the tracking efficiency. To this end, we argue that it’s unfair to make comparison directly. Thus, we compare the single-ELM based

Fig. 9. Stability validation for the proposed ELM trackers on OTB-50.

tracker with the traditional trackers, while the Multi-Experts ELM tracker is compared with deep-trackers and ensemble tracker for fair comparison. In the experiments, we repeat our simulations for 50 times and record the average performance due to the slight fluctuations in each tracking process. Fig. 8 demonstrates the overall performance of the proposed Single-ELM and Multi-Experts ELM based tracker against other tracking algorithms. The fluctuations of proposed tracking results (among 50 times of experiments) are depicted in Fig. 9 to further demonstrate the robustness and stability of our trackers. According to Fig. 8, the single ELM-based tracker ranks first in terms of both success rate and precision rate against other traditional trackers. While, the Multi-Experts ELM tracker yields a competitive performance against the Dsiamese and FCNT tracker, while performs inferiorly against ECO tracker. However, even ECO and Dsiamese have achieved state-of-the-art performance, they still suffer from the slow-learning problem when implemented in the CPU platform as other trackers as shown in Table I. On the other hand, the proposed Multi-Experts ELM tracker outperforms all the other trackers while running at 10 frames per second. We also report the attribute-based evaluation to further

D eng

et al.: HIGH PERFORMANCE VISUAL TRACKING WITH EXTREME LEARNIN7G MACHINE FRAMEWORK

demonstrate the effectiveness of the proposed tracker. Please refer to the supplementary materials for more details. D. Key Components Validation In this subsection, we will validate the key components of the proposed tracker: ELM-AE based feature extraction and ELM based feature classification. First, to verify the capability of ELM-AE based feature extraction, we compared it with five other feature extractors, i.e., raw pixel, Haar-like, HOG+LBP in [22], DAE [15] and CNN [16]. Note that, the only difference lies in the feature extractors, while the classifier and updating strategy in the proposed tracker are applied to all these trackers for fair comparison. Table II shows the performances of the seven compared trackers (denoted as “Proposed-S”, “Proposed-M”, “Raw”, “Haar”,“HOG+LBP”,“DAE” and “CNN”). For our proposed feature extraction, the more important hidden features are remained by the sparse constraint of output weights. Thus, the resulting features can be more discriminative and informative than raw pixels, thereby facilitating the robustness (in terms of success and precision rates) of our trackers. Besides, due to the pruning of insignificant hidden features, the dimension of our learned features can be smaller than that of the original raw pixels. Therefore, the Fps of our tracker is ( “Proposed-S”) larger than the one using raw pixels. For Haar-like features, it is not straightforward to add the affine motion model in tracking. And thus, the performances of some sequences involving the rotation (e.g., Boy, F reeman3, etc.) are even worse than using raw pixels with affine motion model. In addition, the Haar-like features are sensitive to the occlusion, just as in the Jogging2 or Suv sequences. Combing different handcrafted features (HOG+LBP) [22] could improve the robustness of feature extrication, at the price of high computation (only obtains 10 Fps), which cannot satisfy the requirement of real-time tracking application. Since the iterative fine-tuning of DAE or CNN model is complicated, the online DAE or CNN is difficult for real-time tracking. Different from the offline DAE model [15], the proposed feature extractor is trained with the current object, and updated online when necessary, remaining meaningful hidden features, thereby leading to a better result than that of the offline DAE model. In CNT tracker [16], to accelerate the feature learning speed, unsupervised clustering is adopted to generate online local template for image patch convolution. However, without training by a large number of auxiliary data, the resultant features in CNT may not be able to adapt to the significant appearance changes during tracking. Afterwards, in order to verify the performance of ELMbased feature classification, we implemented the tracker with the same ELM-AE feature using different classifiers, including single ELM, Multi-Experts ELM, SVM, LS-SVM and Online Multi-Experts SVM denoted as “Proposed-S”, “Proposed-M”, “SVM”, “LS-SVM” and “Online-SVM”, in Table II. The SVM and LS-SVM follow the manner in [4], [13], which either re-trained every four frames or update incrementally to deal with the visual change of the target for fair comparsion. Furthermore, we also compare the proposed Multi-Experts

9

ELM with the online Multi-Experts SVM extracted from [35] for completeness. Theoretically, the computing of SVM is more expensive than LS-SVM, and therefore, the Fps of LSSVM is larger than that of SVM, with the price of lower tracking accuracy. As discussed in Section II-D, the solution of LS-SVM/SVM suffers from local minima issues, and is more sensitive to the likely label errors than ELM model. Due to these two points, LS-SVM/SVM model has a mild performance, especially in the sequences with label noise. Compared with LS-SVM/SVM, ELM model has more stable convergence with respect to the number of training samples. Thus, the whole computational burden of ELM would be lower than those of LS-SVM/SVM models, leading to a faster tracking performance. The proposed Multi-Experts ELM tracker also outperforms the Online-SVM tracker in terms of both the performance and running speed, according to Table II. E. Qualitative Evaluation To better visualize the tracking performance of the proposed method, we provide a qualitative comparison of our approach with some discriminative trackers, including Struck [5], ASLA [30], SCM [31], TLD [3], L1APG [32], M-ELM [22], LSSVM [4], CSK [33], KCF [34], DLT [15] and CNT [16]. Several video sequences are selected from OTB50 which contain various of challenging attributes to present the tracking performance among different trackers as shown in Fig. 10. 1) Heavy Occlusion: Occlusion is a big challenge for visual tracking, as it will destroy the holistic appearance of the target. As shown in Fig. 10(e, h), we tested 2 sequences (Jogging, Suv) having a severe occlusion. Compared with other trackers, the proposed method has excellent results due to the following two aspects: (1) the ELM-AE features have a good discriminative capability; (2) the sample selection in updating process avoids degradation of classification performance by rejecting inappropriate samples. In contrast, the existing features (e.g., Haar-like, LBP) adopted in other trackers are less effective when similar objects occlude each other. Thus, those trackers easily lose the tracking object with similar object occlusion. At frame #73 of Jogging2, both Struck, SCM, CNT and proposed tracker lock the object while the other trackers drift away by the traffic light. Even though TLD is able to re-locate the target after occlusion, it is sensitive to the similar object disturbance (e.g., Jogging2 #84). For Suv, when occlusion is removed, most of the compared trackers lose the tracked object (see Suv #573). Since the original observations are always considered in the updating samples, our tracker could alleviate this drifting problem. 2) Illumination Change and Background Cluster: For visual tracking, illumination change is the most common challenge. Drastic illumination variation will affect the features using pixel intensity (e.g., color histogram, Haar-like). Meanwhile, the background clutters make it very difficult to tail the accurate state of moving object. Fig. 10(b, g) present the tracking results from 2 sequences (Singer1, Car4) to evaluate whether the proposed tracker can deal with these two issues. In Car4, we track a moving car with drastic illumination change, scale change and background noise. Only our tracker

10

IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, DEC, 2018

TABLE I P ERFORMANCE COMPARISON OF TRACKING EFFICIENCY ( IN TERMS OF F PS ) AGAINST 12 REPRESENTATIVE TRACKERS ON CPU PLATFORM .

Fps

Proposed-S 22

Proposed-M 10

Struck 4

ALSA 8.5

SCM 0.5

TLD 18

L1APG 2

MEEM 8.5

LDSM 1

FCNT 1

DLT 1.5

CNT 5

ECO 0.8

Dsiamese 2.4

TABLE II T HE VALIDATION OF KEY TECHNICAL COMPONENTS OF THE PROPOSED TRACKER IN TERMS OF OVERALL SUCCESS RATE AND PRECISION . T HE LAST ROW SHOWS THE RESULTS OF TRACKING EFFICIENCY ( IN TERMS OF F PS ). Metric

Proposed-S

Proposed-M

Success rate

0.59

Precision

0.76

Fps

22

Feature Representation HOG+LBP DAE 0.55 0.52

0.65

Raw 0.42

Haar 0.48

0.86

0.53

0.60

0.70

10

19

13

10

and CNT tracker successfully track the object throughout the entire sequence. When the car passes a bridge (see at Car4 #198), the other trackers could not cope with illumination changes as well as scale changes. The accumulating tracking errors make them lose the accurate state of object. For Singer1, there exist large changes in environmental illumination and object scale. The L1APG lose the object when drastic illumination changes occur (e.g., Singer1 #145). In contrast, the Struck, KCF, M-ELM, ASLA and CSK could keep on tracking the object, but cannot cope with the scale changes. With the discriminative ELM-AE features and effective updating of the appearance model via OS-ELM, the proposed tracker can deal with the above challenges, and performs better than other trackers. 3) Fast Motion and Motion Blur: When the tracking object undergoes an abrupt motion, it is difficult to accurately locate its position and deal with the following motion blurs. Fig. 10(a, f) presents the tracking results on these sequences (boy and Jumping). It can be seen that the proposed tracker performs better than other methods. The object of boy is with fast motion and serious motion blur. Likewise, for Jumping, the face moves fast among frames. Meanwhile, the background clutters make it difficult to distinguish the face from the backgrounds. Struck, CNT, KCF and the proposed tracker perform more stable than other trackers. Even the DLT tracker strives to learn features offline, it could not adapt to fast movement of the target. However, the proposed method is better when serious motion blur occurs (e.g., Jumping #54). It can be attributed to that ELM has a better performance than SVM-like models in image classification, especially in the condition with label noise. 4) Scale Change and Background Clutter: Fig. 10(d) shows the tracking results on sequence (F reeman3) with large scale change and background clutter. In F reeman3, ASLA, SCM, DLT and proposed method are able to deal with the scale changes, while most of the other trackers drift as label error accumulating (F reeman3 #341, #392, #450). Meanwhile, KCF cannot deal with scale variation due to the fixed-size bounding box. CNT drifts from the target at #392, as the features by convolution with online local templates, are lack of robustness for fast scale changes.

Discriminative Classifier LS-SVM Online-SVM 0.53 0.57

CNN 0.57

SVM 0.55

0.72

0.74

0.71

0.67

0.84

8

6

10

13

8.5

5) In-plane Rotation and Out-plane Rotation: During the tracking process, rotation will result in shape deformation and visual changes. Fig. 10(a, c) show the tracking results of 2 sequences (Boy, F reeman1) with rotation variation, pose and scale changes. We can see that the proposed tracker works well in these cases. Among the other trackers, L1APG, SCM and ASLA couldn’t adapt to the serious rotation and gradually drift away (see at Boy #270, #387). In F reeman1, SCM and proposed tracker can cope with frequent rotation (e.g., #277), while other trackers fail to locate the object even though they use online updating to learn the different appearances of the target. CNT drifts at #138 since the current target appearance differs a lot from the initial one, and the target feature extracted only from the first frame could hardly deal with such challenging appearance change. On the other hand, with affine motion model, the new object observations with rotation variations can still be tracked, and incorporated into the appearance model via online classifier updating. It can be proved that the OS-ELM updating with new observations equals to the ELM training with the whole observations. Thus, for online tracking, the proposed tracker has a strong adaptability for visual changes. In addition, the selection of updating samples always considers the original observation, hence preventing the tracker from being dominated by recent updating samples with some noise. IV. C ONCLUSION This paper presents an efficient and robust tracking algorithm by exploiting the learning and classification capabilities of ELM. First, a novel ELM feature representation method is advocated for extracting the meaningful image features. We have shown that the extracted ELM-AE features are informative and discriminative, thereby facilitating the tracking performance. Second, to perform object/non-object classification, the ELM classifier is utilized to build an efficient and effective appearance model. Furthermore, the OS-ELM technique is used for updating the appearance model, which can fast adapt to the visual changes of object appearance. Moreover, the discussions on the used ELM feature representation and classification fully show the insights of our tracker. Numerous experiments conducted on the benchmark have demonstrated the effectiveness and robustness of the proposed tracker.

D eng

et al.: HIGH PERFORMANCE VISUAL TRACKING WITH EXTREME LEARNIN7G MACHINE FRAMEWORK

(a) Boy with Fast Motion and In-plane Rotation

(b) Car4 with Illumination Variation and Scale Variation

(c) Freeman1 with In-plane Rotation and On-plane Rotation

(d) Freeman3 with Scale Variation

(e) Jogging2 with Server Occlusion and Deformation

(f) Jumping with Fast Motion and Motion Blur

(g) Singer1 with Scale Variation and Illumination Variation

(h) Suv with Severe Occlusion and Out of View

Fig. 10. Representative tracking results on some challenging sequences.

11

12

IEEE TRANSACTIONS ON CYBERNETICS, VOL. XX, NO. XX, DEC, 2018

R EFERENCES [1] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp. 1619–1632, 2011. [2] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressive tracking,” in Proc. ECCV. Springer, 2012, pp. 864–877. [3] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, pp. 1409–1422, 2012. [4] X. Li, C. Shen, A. Dick, and A. Hengel, “Learning compact binary codes for visual tracking,” in Proc. IEEE CVPR, 2013, pp. 2419–2426. [5] S. Hare, A. Saffari, and P. H. Torr, “Struck: Structured output tracking with kernels,” in Proc. IEEE ICCV, 2011, pp. 263–270. [6] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proc. IEEE Conf. CVPR, 2013, pp. 2411–2418. [7] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, and A. V. D. Hengel, “A survey of appearance models in visual object tracking,” ACM Trans. Intell. Systems and Technol., vol. 4, no. 4, p. 58, 2013. [8] B. Ma, L. Huang, J. Shen, and L. Shao, “Discriminative tracking using tensor pooling.” IEEE Trans. Cyber., vol. 46, no. 11, 2017. [9] L. Wang, H. Lu, and M. H. Yang, “Constrained superpixel tracking.” IEEE Trans. Cyber., vol. PP, no. 99, pp. 1–12, 2017. [10] X. Lan, A. J. Ma, P. C. Yuen, and R. Chellappa, “Joint sparse representation and robust feature-level fusion for multi-cue visual tracking,” IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, vol. 24, no. 12, p. 5826, 2015. [11] X. Lan, S. Zhang, P. C. Yuen, and R. Chellappa, “Learning common and feature-specific patterns: A novel multiple-sparse-representation-based tracker.” IEEE Transactions on Image Processing, vol. PP, no. 99, pp. 1–1, 2018. [12] X. Lan, A. J. Ma, and P. C. Yuen, “Multi-cue visual tracking using robust feature-level fusion based on joint sparse representation,” in Computer Vision and Pattern Recognition, 2014, pp. 1194–1201. [13] F. Liu, C. Shen, I. Reid, and A. Hengel, “Online unsupervised feature learning for visual tracking,” arXiv preprint arXiv:1310.1690, 2013. [14] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. ACM Int. Conf. Machine learning, 2008, pp. 1096–1103. [15] N. Wang and D.-Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Advances in Neural Information Processing Systems, 2013, pp. 809–817. [16] K. Zhang, Q. Liu, Y. Wu, and M.-H. Yang, “Robust visual tracking via convolutional networks without training,” IEEE Trans. Image Process., vol. 25, no. 4, pp. 1779–1792, 2016. [17] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Eco: Efficient convolution operators for tracking,” pp. 6931–6939, 2016. [18] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning dynamic siamese network for visual object tracking,” in IEEE International Conference on Computer Vision, 2017, pp. 1781–1789. [19] S. Avidan, “Support vector tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 8, pp. 1064–1072, 2004. [20] G. Huang, G.-B. Huang, S. Song, and K. You, “Trends in extreme learning machines: A review,” Neural Networks, pp. 32–48, 2015. [21] N.-Y. Liang, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A fast and accurate online sequential learning algorithm for feedforward networks,” IEEE Trans. Neural Networks, vol. 17, pp. 1411–1423, 2006. [22] H. Liu, F. Sun, and Y. Yu, “Multitask extreme learning machine for visual tracking,” Cognitive Computation, vol. 6, pp. 391–404, 2014. [23] B. Wang, L. Tang, J. Yang, B. Zhao, and S. Wang, “Visual tracking based on extreme learning machine and sparse representation,” Sensors, vol. 15, no. 10, pp. 26 877–26 905, 2015. [24] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Networks, vol. 17, no. 4, pp. 879–892, 2006. [25] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Trans. Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, pp. 513–529, 2012. [26] C. L. L. Kasun, H. Zhou, G.-B. Huang, and C.-M. Vong, “Representational learning with elms for big data,” IEEE Intelligent Systems, vol. 28, no. 6, pp. 30–59, 2013. [27] J. Tang, C. Deng, and G.-B. Huang, “Extreme learning machine for multilayer perceptron,” IEEE Trans. Neural Networks and Learning Systems, vol. 27, no. 4, pp. 809–821, 2016. [28] R. Fletcher, Practical Methods of Optimization: Volume 2 Constrained Optimization. New York: Wiley, 1981.

[29] K. De Brabanter, J. De Brabanter, J. A. Suykens, and B. De Moor, “Optimized fixed-size kernel models for large data sets,” Comput. Statist. Data Anal., vol. 54, no. 6, pp. 1484–1504, 2010. [30] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in CVPR, 2012. [31] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparse collaborative appearance model,” IEEE Trans. Image Process., vol. 23, no. 5, pp. 2356–2368, 2014. [32] X. Mei and H. Ling, “Robust visual tracking using l1 minimization,” in ICCV, 2009, pp. 1436–1443. [33] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Computer Vision–ECCV 2012. Springer, 2012, pp. 702–715. [34] ——, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, 2015. [35] J. Zhang, S. Ma, and S. Sclaroff, “Meem: Robust tracking via multiple experts using entropy minimization,” vol. 8694, pp. 188–203, 2014. [36] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual tracking with fully convolutional networks,” in IEEE International Conference on Computer Vision, 2015, pp. 3119–3127. [37] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discriminative saliency map with convolutional neural network,” in International Conference on International Conference on Machine Learning, 2015, pp. 597–606.

Chenwei Deng (M’09-SM’15) received the Ph.D. degree in signal and information processing from Beijing Institute of Technology, Beijing, China, in 2009. Since 2012, he has been an associate professor and then a full professor at the School of Information and Electronics, Beijing Institute of Technology. Prior to this, he was a Post-doctoral Research Fellow with the School of Computer Engineering, Nanyang Technological University, Singapore. He has authored or co-authored over 50 technical papers in refereed international journals and conferences, and co-edited one book. His current research interests include video coding, quality assessment, perceptual modelling, feature representation, object recognition, and tracking.

Yuqi Han (S’16) received the B.Eng. degree from the School of Information and Electronics, Beijing Institute of Technology, Beijing, China, in 2015. He also received the B.Sc. degree from the National Development School, Peking University, Beijing, China, in the same year. Currently, he is pursuing the Ph.D. degree with the School of Information and Electronics, Beijing Institute of Technology, Beijing, China. His research interest mainly focuses on visual tracking, object detection and machine learning.

Baojun Zhao received the Ph.D. degree in electromagnetic measurement technology and equipment from Harbin Institute of Technology (HIT), Harbin, China, in 1996. From 1996 to 1998, he was a postdoctoral fellow at Beijing Institute of Technology (BIT), Beijing, China. Currently, he is a full professor, Vice Director of Laboratory and Equipment Management and Director of the National Signal Acquisition and Processing Professional Laboratory. He has authored or co-authored over 100 publications and received 5 provincial/ministerial-level scientific and technological progress awards in these fields. His main research interests include image/video coding, image recognition, infrared/laser signal processing, and parallel signal processing.