3rd International Conference on Pattern Recognition and Image Analysis (IPRIA 2017) April 19-20, 2017
Superpixel tracking using Kalman filter Mohammad Faghihi
Mehran Yazdi
Sara Dianat
Signal and Image Processing Lab Department of Electrical Engineering Shiraz University Shiraz, Iran
[email protected]
Signal and Image Processing Lab Department of Electrical Engineering Shiraz University Shiraz, Iran
[email protected]
Signal and Image Processing Lab Department of Electrical Engineering Shiraz University Shiraz, Iran
[email protected]
in the object motions, partial and complete occlusions, existence of similar objects, complex background, sequence quality (e.g. existence of noise, low frame rate, etc.) and moving camera.
Abstract— In this paper, we propose an algorithm for tracking of moving objects in video sequences. Our method uses Kalman filter to predict the location of target and exploits superpixel based tracking algorithm to find the real position of target in a search region surrounding the predicted location. The motion dynamics and equations from mechanics physics are used to design a Kalman filter with assumption of constant acceleration motion. Using this Kalman filter makes our method able to handle long lasting occlusions. We have also devised a scheme that helps the tracker to find the target after long lasting occlusions.
A video sequence can have one or many of the above mentioned challenging cases. Fig.1. depicts examples of these challenges. For further information about challenging cases in popular datasets we can refer to [1]. There are many works that try to devise a scheme to overcome these challenges. In [2], an algorithm based on 2DCepstrum approach is proposed to track objects under illumination variations. Authors of [3] have presented a tracking method that incrementally learns a low-dimensional subspace representation. The proposed method efficiently adapts online, to changes in the appearance of the target. A Hamiltonian Markov Chain Monte Carlo based tracking algorithm has been proposed in [4] for handling abrupt motion. The authors of this paper have integrated the Hamiltonian Dynamics into traditional MCMC1 tracking method. They have made use of HD2 properties to construct MCMC updates. Considering the occlusion case, Pan and Hu [5] have proposed an algorithm that progressively analyzes the occlusion situation using the spatiotemporal context information which is also checked by reference target and motion constraints. Their tracker is able to distinguish the target in occlusions effectively. Authors of [6] have introduced adjustable polygons as a novel active contour model to make their algorithm perform well in complex backgrounds. Bibby and Reid [7] have derived a probabilistic framework for robust and real time tracking of objects seen by a moving camera.
Keywords—image processing; Kalman filter; visual tracking;
I. INTRODUCTION In recent years, with advancement of technologies and therefore decrease in the price of electronic devices, cameras and other image and video acquisition equipment have become widespread among people, organizations, governments, etc. Image and video acquisition equipment have many applications and can be used for various purposes ranging from recording personal memories to video surveillance and crime detection. These devices produce an enormous amount of data (i.e. image and video output of cameras) every day which contain considerable amounts of information. However without a system for extracting information from videos and images, many of them might become useless. As a result, image and video processing have attracted the attention of many engineers and researchers recently. Visual tracking is one of the most interesting and also most challenging fields of image and video processing. There are many challenging scenarios in visual tracking and devising a robust algorithm that can overcome all challenging cases and do well in any condition is still an unsolved problem.
Our proposed method is based on Kalman filter prediction, and simple motion equations from mechanics physics. It also adapts the appearance model and some other properties from the superpixel tracking (SPT) method [8]. We have done some
Some of the challenges in visual tracking are: illumination variations, moving object’s appearance change, abrupt changes
1
978-1-5090-6454-0/17/$31.00© IEEE
170
Markov Chain Monte Carlo 2 Hamiltonian Dynamics
in order to make their algorithm able to adapt to appearance, scale and illumination variations. In a different point of view, the tracking can be dealt with as a detection problem. As an example of these methods, we can refer to PROST tracker [15]
modification on the original SPT [8] algorithm, in order to make it suitable for our application. We have also focused on the challenging scenario of occlusion and devised a scheme which makes the algorithm able to recover from occlusions.
Fig.1. some challenging cases in visual tracking. a) Illumination variations, Cannons Dataset3 b) Changes in the appearance model, skating sequence from VTD dataset [14] c) abrupt motion, animal sequence from Kwon dataset [14] d) occlusion, woman sequence from Fragtrack [13] e) similar appearances to target, Bolt sequence from Zhang dataset4 f) degradation (motion blur), Kwon 2009 dataset5 g) cluttered background, soccer sequence from Kwon dataset [14] h) cluttered background, Singer 2 sequence from Kwon dataset [37]
that is able to handle some drifts and shape deformations. In the structured output tracking (Struck) method [16], structured output support vector machine is adopted in order to avoid ambiguity in the labeling process of updating the classifier used for tracking. However, this method uses simple low-level features and its performance drops while dealing scale change or occlusions.
The rest of this paper is organized as follows: In part II a review of previous works related to visual tracking is given. In part III, our proposed method, Kalman filter and their parameters are described. In part IV, the performance of our tracker is evaluated and compared with some other state of the art methods. Finally, a conclusion and a guideline for future improvements are given in part V.
There are also some methods that exploit mid-level visual cues instead of high-level appearance models or low-level features. Mid-level visual cues have been shown to be effective representations that contain sufficient information about the structure of target. Superpixels (as a kind of midlevel cues) have been used in some applications like image segmentations and object recognition [17], [18]. Also some tracking methods are developed based on superpixels. We can refer to [19], in which the tracking is considered as a target/background segmentation problem. However, this method has a very high computational complexity as it processes every frame entirely. In [20] a superpixel based tracker is proposed that tries to perform tracking using only the background cues. It is shown that it has a good overall performance. This method is based on the algorithm proposed in [8], which is called robust superpixel tracking (SPT). The authors of SPT [8] have shown that their method outperforms many sate of the art algorithm.
II. RELATED WORK There are numerous works that try to solve tracking problem in various and novel ways. The kernel based tracker of [9], models the foreground object as a blob and selects some discriminative features in order to distinguish background from foreground objects. The method proposed in [10], uses adaptive mixture models and is able to deal with appearance changes. Ensemble tracker [11] considers the tracking as a binary classification of pixels. But this method has limited ability to handle severe occlusions. Another proposed tracker is incremental visual tracker (IVT) [12], which adopts adaptive appearance model to overcome appearance change and illumination variations challenges. In another work, the fragment-based (Frag) tracker [13] is proposed that tries to handle occlusions by using histogram of local patches. However, this method is unable to handle appearance changes well, as it does not update the template that combines votes of matching local patches. Authors of visual tracking decomposition (VTD) [14] exploit the particle filter frameworks and multiple models for motion and observation,
There are also papers that apply Kalman filter to problem of visual tracking. We can refer to [21], that combines Kalman filtering and mean shift for real time eye tracking. They have shown that their method works well under various illuminations and face orientations. In another work [22], an adaptive Kalman filter is used to track a moving object. It is shown that this method can estimate the position of target in many video sequences that contain real-world situations. This method is also able to handle long- lasting occlusion and track the object successfully after reappearance in the video
3
Visual tracking datasets of York University http://www.cse.yorku.ca/vision/research/visual-tracking 4 Zhang dataset http://www4.comp.polyu.edu.hk/~cslzhang/CT/CT.htm 5 Tracking of a non-rigid object project website http://www.cv.snu.ac.kr/research/~bhmctracker/ index.html
171
The simple form (without inputs) of Kalman filter model assumes the true state at time t, denoted by , is evolved from last state ( ) at time t-1, using (2) and (3).
sequence. However, in some cases the wrong prediction drifts the tracker away from target. There are many more novel and interesting methods that apply different techniques and solutions to perform a robust tracking. However, referring to all of these works is out of scope of this paper. For more information about visual tracking, proposed algorithms, their performance and their pros and cons, we can refer to [23] and [24].
=
=
,
,
is different from
(4)
)
(5)
is the observation model that maps In above equations the true space into the observation state and is the observation noise which assumed to be zero mean white Gaussian noise with covariance matrix of . The Kalman filter is a two stage filter and at any iteration performs two tasks: 1- prediction step and 2- update step. The values for parameters in these steps are given below. (The .̂ in (6) to (12) represents the estimate of a parameter or matrix.) 1.
Prediction step i. Predict (a priori) state: =
|
(6)
|
ii. Predicted estimate covariance: =
|
In (7), 2.
|
+
|
is estimate covariance, and
(7) is transpose of
.
Update step i. Observation residual: =
−
(8)
|
ii. Residual covariance:
Using the above mentioned information about target’s bounding box, we can define the target at time t (i.e. the t-th frame) as in (1): ,
+
~ (0,
We also assume that the target object has a constant acceleration. This assumption is true, when the frame rate is fairly high in comparison to target speed. By this, we mean that the changes in the velocity of target (in both vertical and horizontal directions) in consequent frames of a small time slot are relatively equal. So even if the acceleration of object (in vertical, horizontal or both directions) gradually changes, our assumption would still be true (which is the case of many realworld video sequences). However, sudden change in acceleration or direction of motion would violate this assumption. In this condition the target would be missing for several frames and we can define a large search area to look for target and continue the tracking.
,
,
The relation between the observation at time t, and the state at that time, is stated in (4).
A. Assumptions Here, we have assumed that the initial position and scale of target is known in the first frame and these data are fed to the algorithm. We also assume that the target can be represented by a rectangle around it. More complex shapes are also possible, but the rectangular box is the most common among various algorithms.
,
(3)
Please note that the Kalman state target state that is defined in (1).
By this representation, we can think of visual tracking as tracking a moving point (i.e. the center of bounding box) in a two dimensional plane (i.e. a frame of the video sequence) and estimating the length of the box in horizontal and vertical directions. In this sense, Kalman filter seems to be very suitable for visual tracking.
,
(2)
,is the state transition model and , is the Where process noise which is assumed to be drawn from a zero mean normal distribution with covariance matrix, .
III. PROPOSED ALGORITHM
,
~ (0, )
In our method, we use a rectangle as the bounding box (template) of target. Center of this box represent the center of target and dimension of this rectangle represent the width and height of target (target’s scale).
=
+
=
+
|
(9)
iii. Optimal Kalman gain :
(1)
= In (10),
In which , , are center locations of target’s , bounding box in horizontal and vertical directions and are the length of the rectangular box sides along , , , horizontal and vertical directions of frame, respectively.
+
|
(10)
is the inverse of residual covariance matrix.
iv. Updated state estimate: |
=
|
+
(11)
v. Updated estimate covariance:
B. Designing the Kalman filter Kalman filter was first introduced by R.E. Kalman in 1960. It is a set of mathematical equations that provides a recursive solution to the least squares method. For more information about Kalman filter refer to [25].
|
=( −
)
|
(12)
For the tracking problem, we want to estimate the next location of target’s center, based on current and previous
172
0.25 0 0.5 0 0 0.25 0 0.5 = × (19) 0.5 0 1 0 0 0.5 0 1 In which, = = or the value assumed , , as acceleration. Choosing a small value for makes the algorithm able to perform well even in the cases with constant velocity motion.
locations of it. Using this definition and target state from (1), we define our measurement vector as: ,
=
(13)
,
And we take the states of our Kalman filter as in (14). , ,
=
(14)
,
From (4), (13) and (14) we can conclude:
,
are first order In the above definition, , , , derivatives of location in horizontal and vertical directions. We can conclude intuitively that these two derivatives can be replaced by motion vectors of the center of target in subsequent frames. This information is not available and we would initialize these values by the training phase of our algorithm.
The matrix
+
+
.Δ
+
.Δ
(16)
(17)
From (2), (14), (15) and (16) we can write: , , , ,
1 0 = 0 0
0 1 0 0
1 0 1 0 ,
0 1 0 1
,
+
,
2
,
2 ×Δ +
(20)
(21)
(22)
C. Training the tracker Considering the mentioned assumptions, the position of target is known in the first frame. However, in order to initialize the values for motion vectors we must have some information about more than one frame at the beginning of sequence. So we need a training phase in our algorithm. We have exploited robust superpixel tracker [8] to implement our tracker. We have adapted its appearance model that assumes the target and background can be represented by some superpixels without destroying the boundaries between target and background. It also requires some prior information about superpixels of target and background. Authors of [8] have implement a training phase for their algorithm which is also useful for initializing our Kalman filter.
, ,
0 0 0 0
1 0 0 0 0 1 0 0 = × 0 0 1 0 0 0 0 1 Initially we must choose ≫ 0 .
L is the location, and Δ is the time that has been passed. Since we are processing frame by frame, we can set: Δ =1
0 0 0 0
Finally we should set , or the estimate covariance matrix. The values of this matrix are a measure of accuracy of our estimation . As this parameter would be adjusted in the update progress, we initially choose a simple value for it. The matrix is given by (22).
(15)
Therefore: =
0 1 0 0
1 0 0 0 0 1 0 0 = × 0 0 1 0 0 0 0 1 is the parameter that controls the values of R.
The transition matrix , is related to the equations that define the transition from one state to the next state. Since we have assumed a constant acceleration motion, we can use the following equations from mechanics physics: =
1 0 = 0 0 is set as:
(17)
, ,
So we have: 1 0 1 0 0 1 0 1 = (18) 0 0 1 0 0 0 0 1 The process noise (Q matrix) controls the adaptability of our filter. If our target motion has a large variance, the values in the Q matrix should be greater. However, smaller values for Q, produces a smoother results. As a result, choosing a correct value for Q is very important. Based on motion equations of mechanics physics and by trial and error, we found the value in (19) a suitable choice.
Here, we describe the training phase proposed by [8] and adapted by us, very briefly. In this method, a simple tracker is used for first m frames. After founding the location of target in these frames, a square area surrounding the target (this area must be larger than target itself, for more details refer to [8]) is segmented into Nt superpixels. Then the features of these superpixels (which are normalized histograms in HSI color space) are extracted and accumulated for all of m training frames. These features are
173
then clustered using the mean shift algorithm and a superpixel based discriminative appearance model is obtained. We also extract the motion vector in horizontal and vertical directions to initialize our tracker.
set =10, = 7, = 0.1. The value of this parameters are found empirically. Table 1 shows the results of our tracker in comparison with other tracking methods. The results of other tracking methods are also adapted from [8]. To count the number of successfully tracked frames, the criterion from [20] is used. We have also assumed that > 0.6
,
,
|
(23)
is a parameter that controls the size of square area. Same as the SPT[8] , we segment this region of newly arrived frame into superpixels and extract their features. We follow the tracking method of SPT [8] algorithm and find the best estimate for location and scale of target in this region. Then we use this information to update our Kalman filter. E. Occlusion handling The SPT [8] is able to detect and handle some occlusion. But it always assumes the search region to be centered at the position of target’s center in last frame and as a result during heavy and long lasting occlusions, it will lose track of the target. However, our method has an estimate of next location of target based on its motion dynamics and keeps predicting the location of occluded target during long lasting occlusions and if the motion dynamics of target does not change greatly during this time, our tracker is able to found the target again and continue the tracking task. We also add another scheme to our algorithm in order to recover from long lasting occlusions. To do so, we count the number of subsequent frames with occlusions. For every 10 from (23) and add 0.2 to frame we increase the parameter it. This will increase the search area that includes the is candidates of target. When the target is found again, decreased frame by frame (in case no other occlusions happen) until it reaches the initial value it was set to. Please note that using this method in very long lasting occlusions would lead to a search region with the same size of the entire frame. In this case the target would be found but the computational complexity would heavily increase so we can set an upper bound for .
PF [28]
,
NUMBER OF SUCCESFULLY TRACKED FRAMES
MS [26]
×|
Sequence
=
TABLE I.
Frag[13]
D. Tracking In the tracking phase, we use the information from training stage, and initialize the Kalman filter. By using the Kalman filter, we predict the location of target in the next frame. We could have used this location as the center of target but this method is not robust. So, when the location of target is predicted, we choose a square region centered at this location. Length of this square’s sides is given by (23).
IVT [12]
Singer 1
64
96
328
87
87
471
347
346
Bolt
15
172
5
33
12
199
231
305
Bird1
1
6
4
44
118
7
84
93
Bird2
36
19
9
44
86
9
90
98
Surfing 1
36
16
24
28
10
24
98
150
MIL [27]
VTD [14]
SPT [8]
Our Method
Fig.2. Result of our tracking algorithm, bolt sequence
IV. EXPRIMENTAL RESULTS We have implemented our algorithm in MATLAB R2016b on a laptop with 2.6 GHz core i7 CPU and 6 GB of memory. The feature and parameters of superpixel tracking parts are the same as in [8] except for which is set to 2 in our algorithm. The upper bound for is set to 3.75. We have also
174
[1] Dubuisson, Séverine, and Christophe Gonzales. "A survey of datasets for visual tracking." Machine Vision and Applications 27.1 (2016): 23-52. [2] Cogun, Fuat, and A. Enis Cetin. "Object tracking under illumination variations using 2D-cepstrum characteristics of the target." Multimedia Signal Processing (MMSP), 2010 IEEE International Workshop on. IEEE, 2010. [3] Ross, David A., et al. "Incremental learning for robust visual tracking." International Journal of Computer Vision 77.1-3 (2008): 125-141. [4] Wang, Fasheng, and Mingyu Lu. "Hamiltonian monte carlo estimator for abrupt motion tracking." Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012. [5] Pan, Jiyan, and Bo Hu. "Robust occlusion handling in object tracking." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007. [6] Delagnes, Philippe, Jenny Benois, and Dominique Barba. "Active contours approach to object tracking in image sequences with complex background." Pattern Recognition Letters 16.2 (1995): 171-178. [7] Bibby, Charles, and Ian Reid. "Robust real-time visual tracking using pixel-wise posteriors." European Conference on Computer Vision. Springer Berlin Heidelberg, 2008. [8] Yang, Fan, Huchuan Lu, and Ming-Hsuan Yang. "Robust superpixel tracking." IEEE Transactions on Image Processing 23.4 (2014): 16391651. [9] Collins, Robert T., Yanxi Liu, and Marius Leordeanu. "Online selection of discriminative tracking features." IEEE transactions on pattern analysis and machine intelligence 27.10 (2005): 1631-1643. [10] Jepson, Allan D., David J. Fleet, and Thomas F. El-Maraghi. "Robust online appearance models for visual tracking." IEEE transactions on pattern analysis and machine intelligence 25.10 (2003): 1296-1311. [11] Avidan, Shai. "Ensemble tracking." IEEE transactions on pattern analysis and machine intelligence 29.2 (2007). [12] Lim, Jongwoo, et al. "Incremental Learning for Visual Tracking." Nips. Vol. 17. 2004. [13] Adam, Amit, Ehud Rivlin, and Ilan Shimshoni. "Robust fragments-based tracking using the integral histogram." Computer vision and pattern recognition, 2006 IEEE Computer Society Conference on. Vol. 1. IEEE, 2006. [14] Kwon, Junseok, and Kyoung Mu Lee. "Visual tracking decomposition." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. [15] Santner, Jakob, et al. "PROST: Parallel robust online simple tracking." Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. [16] Hare, Sam, et al. "Struck: Structured output tracking with kernels." IEEE transactions on pattern analysis and machine intelligence 38.10 (2016): 2096-2109. [17] Ren, Xiaofeng, and Jitendra Malik. "Learning a Classification Model for Segmentation." ICCV. Vol. 1. 2003. [18] Mori, Greg, et al. "Recovering human body configurations: Combining segmentation and recognition." Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 2. IEEE, 2004. [19] Ren, Xiaofeng, and Jitendra Malik. "Tracking as repeated figure/ground segmentation." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007. [20] Li, Annan, and Shuicheng Yan. "Object tracking with only background cues." IEEE Transactions on Circuits and Systems for Video Technology 24.11 (2014): 1911-1919. [21] Zhu, Zhiwei, et al. "Combining Kalman filtering and mean shift for real time eye tracking under active IR illumination." Pattern Recognition, 2002. Proceedings. 16th International Conference On. Vol. 4. IEEE, 2002. [22] Weng, Shiuh-Ku, Chung-Ming Kuo, and Shu-Kang Tu. "Video object tracking using adaptive Kalman filter." Journal of Visual Communication and Image Representation 17.6 (2006): 1190-1208. [23] Yilmaz, Alper, Omar Javed, and Mubarak Shah. "Object tracking: A survey." Acm computing surveys (CSUR) 38.4 (2006): 13.
Fig.3. Result of our tracking algorithm, recovering from heavy occlusion, surfing1 sequence
It can be concluded from the results, that our method will perform very well in sequences that have motion dynamics similar to our assumptions. As in bolt or surfing1 sequences that have motions with nearly constant acceleration our results are very good. It must be noted that in surfing1 algorithm, our tracker was able to recover from heavy occlusion and found the target again. Fig.2. shows some of results of our tracking algorithm for bolt sequence. Fig.3. depicts the capability of our tracker to follow the target during severe occlusion, and finding it again to recover from occlusion. V. CONCLUSIONS In this paper, we have developed a tracking algorithm by combing the superpixel tracking method and Kalman filter. We have shown that the addition of Kalman filter would improve the performance of superpixel tracking algorithm. However, the performance depends on the real motion model of target. The Kalman filter of our method is designed with the assumption of motion with constant acceleration. In cases that the Kalman filter parameters fit the real motion model well (e.g. a human moving with constant acceleration or a vehicle moving with constant acceleration, the algorithm performs very well and achieve outstanding results (as in the case for surfer1 and bolt sequences). But when the target has abrupt motions or sudden changes in acceleration or direction of movement, performance of the proposed algorithm would decrease. Our algorithm is also able to recover from severe occlusions especially when the target has a real motion model similar to what we have set for Kalman filter parameters. For further improvement, we propose developing an extended Kalman filter (EKF), which seems to be able to track more complex forms of motion by exploiting more advanced motion dynamics. REFERENCES
175
[27] Babenko, Boris, Ming-Hsuan Yang, and Serge Belongie. "Visual tracking with online multiple instance learning." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009. [28] Nummiaro, Katja, Esther Koller-Meier, and Luc Van Gool. "An adaptive color-based particle filter." Image and vision computing 21.1 (2003): 99110.
[24] Smeulders, Arnold WM, et al. "Visual tracking: An experimental survey." IEEE Transactions on Pattern Analysis and Machine Intelligence 36.7 (2014): 1442-1468. [25] Welch, Greg, and Gary Bishop. "An introduction to the Kalman filter." (1995). [26] Comaniciu, Dorin, and Peter Meer. "Mean shift: A robust approach toward feature space analysis." IEEE Transactions on pattern analysis and machine intelligence 24.5 (2002): 603-619.
176