Feature Evaluation by Particle Filter for Adaptive Object Tracking Zhenjun Han, Qixiang Ye, Yanmei Liu, Jianbin Jiao* Graduate University of Chinese Academy of Sciences, Beijing, China, 100049 ABSTRACT An online combined feature evaluation method for visual object tracking is put forward in this paper. Firstly, a feature set is built by combining color histogram (HC) bins with gradient orientation histogram (HOG) bins to emphasize the color representation and contour representation of an object respectively. Then a feature confidence evaluation approach in a Particle Filter framework is proposed to make that features of larger confidence can play more important roles in the instantaneous tracking, ensuring that the tracking can adapt to the appearance changes of either foreground or background. In this way, we extend the traditional filter framework from modeling motion states to modeling feature evaluation. The temporal consistency of particles can also ensure that the evolution of feature confidence is always gentle. Examples are presented to illustrate how the method adapts to changing appearances of both tracked object and background. Experiments and comparisons demonstrate that object tracking with evaluated combined features are highly reliable even when objects go across complex backgrounds. Keywords: Feature Evaluation, Object Tracking, Particle Filter
1. INTRODUCTION Visual tracking is to automatically find the same object in an adjacent frame from a video sequence once it is initialized. It is a prerequisite for understanding video contents, a crucial step for visual surveillance applications, human computer interaction systems and robotics [1], etc. Therefore, tracking has drawn more attention of researchers in the last decades, and numerous powerful tracking algorithms have also been brought out, which include Mean-shift [14], Kalman filter [13], etc. Tracking methods for various objects, such as human beings and vehicles, are also extensively studied. Although it has been widely investigated in the past years, there are still some difficulties of tracking, including the changes of object motion state, appearance changes of either object or background, long duration occlusions, etc. Up to now, researchers have investigated various approaches, mainly on tracked object representation, motion modeling and searching methods [5] etc. A motion model predicts an object’s location in a new frame of an image sequence using its motion history and other known movement characteristics. This prediction, which is often performed by temporal filters, can improve the tracking stabilization and make the tracking survive some short occlusions. Linear filters, such as traditional Kalman filter, suppose that an object can only have linear translational or affine motions. Non-linear models, such as Particle Filter, can formulate non-linear and non-Gaussian motions. However, their parameters are more difficult to be estimated and sometimes are more sensitive to noise. On the other hand, robust searching methods are also indispensable to a successful tracking. Existing searching methods use various strategies to find an object’s new location in the next frame, given a candidate area determined by a filter or empirical method. In this process, the block where the image appearance is most similar to the appearance of the tracked object will be determined as the tracked result. In addition, when tracked object varies on size, the searching algorithm needs to determine the scale parameter in the scale space. In the early years, Lucas-Kanade algorithm [15], which depends on feature optimization matching, is the most representative one. In recent years, more efficient and effective searching algorithms, such as Mean-shift [14], have been developed to calculate the best local match for a rigid or non-rigid object. A better motion modeling method or a better searching algorithm can be helpful to avoid many difficulties during thetracking process in some cases, but they are not always very effective because object tracking is more related to adaptive object appearance modeling. Object representation is the cornerstone of tracking. Color histogram (HC) feature *
[email protected]; phone +86-10-88256561; fax +86-10-88256278; Visual Communications and Image Processing 2009, edited by Majid Rabbani, Robert L. Stevenson, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7257, 72571G · © 2009 SPIE-IS&T · CCC code: 0277-786X/08/$18 · doi: 10.1117/12.805558
SPIE-IS&T/ Vol. 7257 72571G-1
have been widely used in tracking [1-4] for its effectiveness and efficiency. The calculation of color histogram is a simple statistics of color occurrence probability discarding spatial distribution of the colors, and therefore HC feature is robust to small object deformation, scale variation and even some rotation, which ensures that the HC feature representation succeeds in many tracking conditions, especially when the appearances of both the object and the background are stable. However, HC cannot work well when the object has the similar color with its background. As an improvement of traditional HC features, Chen et al. [4] use region color histogram features for object tracking. The main advantage is that the employed region color features capture not only the color property of object but also the spatial distribution of colors on the object, in which way the region color histogram can redeem the disadvantages of HC to some extent. Recently, histogram of orientation gradient (HOG), derived from scale invariant feature transformation (SIFT) [6] features, began to be applied for object detection and tracking. HOG captures the edge or gradient structures from an image, which are the characteristics of local contour and shape; therefore, it is insensitive to color variation. In [5], Dalal et al. justified that the representation ability of HOG is almost as strong as SIFT [6] descriptor given a fixed scale. However, just like the other contour and texture features, the disadvantages of HOG are that its efficiency is lower compared with HC and it cannot represent objects and background with large smooth regions effectively since the contours of them are indistinctive. Researches on the two types of features inspired us to integrate them together because they are mutual complementary. Combination of HC and HOG produces a statistical feature set for tracked object. A characteristic of object tracking is that foreground and background appearances change continuously during the tracking process. Therefore, those most effective features that can discriminate an object with its background also keep changing, which make it very difficult to track the objects correctly. A countermeasure for this is to evaluate the features on-line. In the existing literatures, there are some approaches about feature evaluation for object tracking. Colloins et al. [2] defined a feature evaluation method according to the two-class variance ratio measures of the object and its background. Liang et al. [1] calculated the confidence of a feature based on Bayes error rate between the object and its background. Wang et al. [3] put forward a method which selects Haar wavelet features on-line by Adaboost learning. Chen et al. [4] use a hierarchical Monto Carlo algorithm to learn region confidences for object tracking. The basic idea of this approach is that the color features of some regions on the object will play more important roles than others during the tracking process. By calculating region confidences, this approach can be adaptive to object appearance changes and even to some object occlusion. Despite the advantages of these approaches, a main drawback of these approaches is that most of them separately select or evaluate the features in a feature set for each frame and seldom consider the temporal consistency of their feature selection or evaluation, which may decrease the stabilization of tracking when there are noises in the tracking process. What’s more, to our knowledge, most of the existing feature evaluation approaches are performed on a single feature set, e.g., pure color features, pure texture features or contour features. It is not sure that one type of features can always discriminate the object from its background effectively. The evaluation on combined feature set needed to be further explored for more stable tracking results. In this paper, we use a combined feature set of HC and HOG, called HOGC, for object tracking. Although each feature alone has certain drawbacks, the combined feature set is the evolvement of color, edge orientation histograms [6] and SIFT descriptors [7]. For the combined feature set, we evaluate the confidence of each feature based on the foregroundbackground discriminative power. Then evaluated features are re-combined for tracking instantaneously, which will make the tracking adapt to background or object changes. The evaluation is formulated in a Particle Filter framework and thus the temporal consistency of the particles can ensure that the evaluation is smooth along time axis. Our adaptive appearance models can accommodate unstable lighting, pose variations, scale changes, view-point changes, and camera noise. In a word, our approach for feature evaluation is the balance between the adaptation to new observations and resistance to noise.
2. THE PROPOSED TRACKING ALGORITHM The flow chart of the proposed tracking method is shown in Fig.1. In the method, HC features (feature set A1) and HOG features (features set A2) are extracted first frame by frame to build combined feature set. In this figure, feature set A1~An are used to denote each feature set, which imply that the combined feature set can be extended to more than two feature types. A feature evaluation procedure is carried out frame by frame to calculate the confidence of each feature (each feature is a histogram bin in this paper) in the combined set and then the tracking algorithm is completed by a searching algorithm on evaluated features with a tracking filter.
SPIE-IS&T/ Vol. 7257 72571G-2
Fig. 1. Flow chart of the proposed tracking algorithm.
During the tracking process, in each video frame, we need first to define an object region and its background region in order to extract the features, and then calculate the confidence of each feature bin instantaneously. If an object is a rectangle whose area is h × w pixels, then its background is defined as outline rectangle ( 2h × 2w pixels) around it, as shown in Fig.2. The object features {F i ( x, y )} , i = 0, 1, ..., N are extracted from the pixels in ( x , y) centered object area and the background features { B i ( x, y )} , i = 0, 1, ..., N are from pixels in background area.
Backrouiid
ObjeN.
Fig. 2. Object and its background regions.
2.1 2.1 Feature extraction We use HC in RGB color space and HOG on gray-scale image data to construct the combined feature set HOGC. These features are chosen because: 1) These features can be computed efficiently and are easy to be evaluated in our method; 2) These features are mutually complementary for representing an object. In the combined feature set, there are totally 12 feature sub-sets corresponding to histograms of R, G and B color components in RGB color space and HOG features in 9 image blocks. 2.2 2.1.1 HC feature extraction To calculate the color histogram, RGB color space is chosen for its efficiency. Each color component (R, G and B component) is linearly quantized into 16 levels and then 16 dimensions of histogram bins are extracted on the component. Finally, we obtain a color histogram (HC) of 48 dimensions from the three color components totally. 2.3 2.1.2 HOG feature extraction We follow the idea of [5] to extract histogram of gradient (HOG) features on gray-scale image data. A histogram of 72 dimensions is extracted to describe the gradient orientation of an object. Details of HOG feature extraction are shown as follows.
SPIE-IS&T/ Vol. 7257 72571G-3
(a) (b) (c) Fig. 3. HOG feature extraction illustration. (a) Mask for gradient calculation, (b) Orientations, (c) An image window partitioned into 9 blocks for gradient orientation calculation.
Gradient is calculated in gray-scale space with no gamma correction. We will firstly resize the rectangle object region into an image window of fixed size, say, 32x32 pixels; then we divide the object window into small spatial cells with the size of 8x8 and 4 (2x2) such adjacent cells are then integrated into a block in a sliding fashion, therefore block overlaps each other. Different from the method in [5], each block is only consisted of an 8-bin HOG without local normalization. Each pixel in the block calculates ori ( h, w) for orientation histogram channel based on the orientation of the gradient element centered on it. The mask for the calculation of ori ( h, w) is shown in Fig.3a while the calculation procedure is shown in Eq.(1). I = G (σ , 0 ) * I 0 dy = I (h + 1, w) − I (h − 1, w) . dx = I (h, w + 1) − I (h, w − 1) ori (h, w) = a tan 2( dy, dx) ori ∈ [−π , π ]
(1)
G (σ , 0 ) is a Gaussian smooth function and σ is the scale parameter determined empirically. Votes of ori ( h, w) in a block are accumulated into 8 orientation histogram bins, which are shown in Fig.3b, to form a HOG feature of the block. The HOG of each block are then combined to obtain a 72 dimension features (9 blocks in the object window shown in Fig.3c) for the whole object.
2.4 2.2 Feature evaluation during tracking process Extracted HOGC features are evaluated during the tracking process under the following constrains: (1) Confidence or discriminative power of a feature is reflected by a float value fallen into (0.0, 1.0). (2) After the evaluation, features of higher discriminative power have larger confidences and vice versa. 2.2.1 Definition of the discriminative power The discriminative power of a feature bin in the combined set is related to the ability of discriminate object with its background. After the features from the object and background regions are extracted, the discriminative power of each feature bin in the current video frame can be measured by computing the log likelihood ratio between the object feature and the background feature. There, the discriminative power of a feature in tth frame is defined as follow: Sti = max(0, min(1, log Sti =
max( Ft i ( x, y ), δ ) )), i = 1...M . max( Bti ( x, y ), δ )
(2)
.
(3)
Sti N
∑S i =1
, i = 1...N i t
where Ft i ( x, y ) is the ith feature bin of the object at frame t , Bti ( x, y) is the ith feature bin of the background at time t and M represents feature sub-set’s dimension (M is 16 for each HC feature sub-set and 8 for each HOG feature sub-set). Intuitively, it can be seen that when the ith object feature Ft i ( x, y ) is larger than the ith background feature Bti ( x, y) , log
Ft i ( x, y) will take a larger value, then the discriminative power may be larger. δ is set as 0.005 in our work. Bti ( x, y )
2.2.2 Feature evaluation and object tracking Particle Filter is a technique for implementing a recursive temporal Bayesian filter by Mote Carlo simulations. The key point of Particle Filter is to represent the required posterior state of a moving object by a set of random samples with associated confidences [7]. As for the object tracking, Particle Filter is widely used for motion modeling. In this paper we extend the function of it from motion state modeling to feature confidence evaluation. The core idea of our approach is to define a Particle Filter for each feature sub-set. We need to define totally 12 such filters corresponding to
SPIE-IS&T/ Vol. 7257 72571G-4
histograms of R, G and B color components and HOG features in 9 blocks respectively. Supposing that there are a set of particle filters (sample sets) at frame t , { pt ( j )} = {( j i , wt ( j i )}, j = 1, 2,...,12 , where j is the index of R, G, B histogram i i i or HOG of a block, j represents the i th feature bin (particle) of feature sub-set j , wt ( j ) is the confidence of j in frame t , we can then describe the feature evaluation procedure as follows.
1. Initiate the confidences {w0 ( j i ) | i = 1, 2,...M } of jth particle filter with equal values; 2. For jth particle filter in { pt ( j )}, j = 1, 2,...,12 , construct the new corresponding sample set as follows: 2.1 Calculate the cumulative probability for each particle by 0 ⎪⎧ct = 0 i = 1, 2,..., M ⎨ i i −1 i ⎪⎩ct = ct + wt ( j ) This induces Sett ( j ) = ( j i , wt ( j i ), cti |i = 1, 2,..., M ) ; 2.2 Select k particles (they can repeat) starting from Sett ( j ) (re-sampling), each particle i ' is selected as follows: (a) generate a random number r ∈ [0,1] , uniformly distributed i (b) find the smallest i for which ct ≥ r (c) set i ' = i ; 2.3 Predict the re-sampled particles {i ' | i ' = 1, 2,..., M } by sampling from p(i | i ') ; In our approach, the transition probability p(i | i ') is defined as a normal distribution of feature bin as p(i | i ') N (0, δ ) , and δ is set as 2M in our experiments; i i 2.4 Weight wt +1 ( j ) of each new particle i starting from the measurement p ( St +1 | i ) ; In our approach, i we use the discriminative power of the feature bin (shown in Eq. (2)) as its measurement p ( St +1 | i ) ; 2.5 Normalize
∑w
t +1
( j i ) = 1 for each feature bin (particle).
i
3. Track the object’s position ( x, y)t +1 at t + 1 frame by M ⎛ ⎞ Min ⎜ ∑∑ wt +1 ( j i ) × Ft i (( x, y )t ) - Ft i+1 (( x, y )tc+1 ) ⎟ c ⎝ j i =1 ⎠ c ( x , y ) where t +1 denotes the cth candidate position in (t + 1)th frame; The tracking can also be performed by a classic particle filter by integrating the evaluated feature confidences into its observation model; 4. t= t+1; Go to step 2 or end the tracking loop.
(
)
Fig. 4. Feature evaluation with Particle Filter during tracking process.
The iteration of this process over the time axis can ensure that the evolution of feature confidence is temporal consistent even that the variation of feature confidence is non-linear and non-Gaussian.
3. EXPERIMENTS In this part, experiments with comparisons are carried out to validate the proposed method. The experimental videos are selected from three video datasets, including VIVID tracking video set [10], CAVIAR tracking video set [11], and our SDL tracking video set [12], most of which are publicly available. Most of selected videos for experiments are challenging. The test videos consist of a variety of cases, including occlusion between tracking objects, lighting changes, scale variation, object rotation and complex background. Some of the videos are captured on a moving platform. In some of the videos, both the appearances of the objects and their background keep changing during almost all the tracking process. Nevertheless, exciting results are obtained in our experiments. In the first video clip, the appearances of both the object (a red and white jeep) and its background changes greatly, which will lead to the changes of their features. Tracking errors often easily happen in such cases. Fig. 5a shows the stable tracking results based on our proposed algorithm. In the second video clip, we track an object (a small car) which has the similar color with its background. Fig.5b presents the tracking results that show the robustness of the method.
SPIE-IS&T/ Vol. 7257 72571G-5
I (a) Frame 50
Frame 300
Frame 700
Frame 1000
Frame 1300
Frame 1600
(b) Frame 50
Frame 300
Frame 700 pppiiir-
Frame 1000
Frame 1300 Frame 1600 Fig. 5. Two examples of tracking experiments using our proposed approach.
To quantitatively evaluate the proposed method, we define the relative displacement error rate (DER) as DER =
displacement error between racked object position and groundtruth . Size of the object
The lower the DER is, the better the tracking performance is. In experiments we use the average DER of video clips to reflect the performance of each method. The DER in the tracking processes are shown in Fig.6. It can be seen on the figure that the DER of our method (about 0.05 to 0.2) is smaller than the other three methods in the whole tracking process. And the small variation of DER also implies the stabilization of the proposed tracking method.
SPIE-IS&T/ Vol. 7257 72571G-6
0.9
-S-CMU-TRACK
0.6
IIC TRACK 0.7
-e-HOGC1-TRACK
0.6
-HOGC-TRACK
0.5
0.4
I
0
140
150
160
170
160
190
200
210
220225
Framelndex Fig. 6. Displacement Error Rate (DER) of our method, HC (color histogram), HOGC1 (HOGC-based tracking without confidence evaluation) based tracking and CMU (Collins’s method in [2]).
Tracking efficiency is another criterion to evaluate a tracking approach. In experiment, we find that our method can almost work in real time (about 20 frames per second) on a computer with Pentium IV CPU (2.4G), which is almost as fast as the CMU method.
4. CONCLUSIONS AND FUTURE WORKS Online automatic feature evaluation is very important to improve the adaptability of visual object tracking. In this paper, we propose a novel combined feature evaluation approach for long duration adaptive object tracking, and then give implementations of the approach in Particle Filter. Experimental results with comparisons are provided, which validates both the combined feature set and the feature evaluation approach. They also demonstrate that object tracking with evaluated combined features are highly reliable when the background is complex or there are big appearance changes of both the object and the background. A prototype system developed by ourselves verifies the feasibility of the tracking method in practical applications. A known disadvantage of the current tracking method is that it does not focus on improving the detail algorithms of Particle Filter, which result in that our tracking performance may be worse in some cases than existing tracking methods. Another limitation of our tracking method is that our feature set can not well handle the large scale changes, in which case it will lead to a failure tracking when there are large scale changes of the tracked object. The tracking template updating [9] issue is not considered in our method either. In the future work, more effective features which are insensitive to the lighting and scale changes should be investigated and integrated into the combined feature set, and existing template updating algorithms can also be integrated to improve the usability of the proposed tracking method in practical applications.
5. AKNOWLEDGEMENT This work is partly supported by National Science Foundation of China with No. 60672147, No. 60872143 and Science 100 Program of Chinese Academy of Sciences.
REFERENCES [1]
D.W. Liang, Q.M. Huang, W. Gao, and H.X. Yao, "Online Selection of Discriminative Features Using Bayes Error Rate for Visual Tracking," 7th Pacific-Rim Conference on Multimedia, pp. 547--555, 2006.
SPIE-IS&T/ Vol. 7257 72571G-7
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
R. Collins and Y. Liu, "Online Selection of Discriminative Tracking Features," Proceedings of Ninth IEEE ICCV, vol. 1, pp. 346--352, 2003. J. Wang, X. Chen, and W. Gao, "Online Selecting Discriminative Tracking Features Using Particle Filter," Proceedings of IEEE CVPR, vol. 2, pp. 1037--1042, 2005. D. Chen and J. Yang, "Robust Object Tracking via Online Spatial Bias Appearance Model Learning," IEEE Trans. PAMI, vol. 29, pp. 2157--2169, 2007. N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," CVPR, pp. 1063-6919, 2005. D.G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," IJCV, pp. 91--110, 2004. E. Cuevas, D. Zaldivar, and R. Rojas, "Particle filter for vision tracking," Technical Report B, Fachbereich Mathematikund Informatik, Freie Universität Berlin, 2005. Y. Qian, G. Medioni, and I. Cohen, "Multiple target tracking using spatial-temporal Markov chain Monte Carlo data association," Proceedings of IEEE Conference on CVPR , pp. 1– 8, 2007. T.F. Cootes, G.J. Edwards, and C.J. Taylor, "Active appearance models," IEEE Trans. PAMI, Vol.23,pp. 681–685, 2001. VIVID Tracking Evaluation Web Site at: http://www.vividevaluation.ri.cmu.edu/datasets/datasets.html. CAVIAR Test Case Scenarios at: http://homepages.inf.ed.ac.uk/rbf/CAVIAR.
[email protected] or
[email protected] E. Cuevas, D. Zaldivar, and R. Rojas, "Kalman filter for vision tracking," Technical Report B, Fachbereich Mathematikund Informatik, Freie Universität Berlin, 2005. D. Comaniciu, V. Ramesh, and P. Meer, "Real-time tracking of non-rigid objects using mean shift," Proceedings of IEEE Conference on CVPR,, pp. 142–149, 2000. B. Lucas, and T. Kanade, "An iterative image registration technique with an application to stereo vision," Proceeding of DARPA Image Understanding Workshop, pp. 121–130, 1981.
SPIE-IS&T/ Vol. 7257 72571G-8