Adaptive Object Tracking with Online Statistical Model Update

0 downloads 0 Views 726KB Size Report
In this paper, we propose a statistical model-based contour tracking ... Visual tracking has been a main focus of research in video analysis and processing.
Adaptive Object Tracking with Online Statistical Model Update KaiYeuh Chang and Shang-Hong Lai Dept. of Computer Science, National Tsing Hua University, Hsinchu 300, Taiwan {kaiyeuh, lai}@cs.nthu.edu.tw

Abstract. In this paper, we propose a statistical model-based contour tracking algorithm based on the Condensation framework. The models include a novel object shape prediction model and two statistical object models. The object models consist of the grayscale histogram and contour shape PCA models computed from the previous tracking results. With the incremental singular value decomposition (SVD) technique, these three models are learned and updated very efficiently during tracking. We show that the proposed shape prediction model outperforms the affine predictor through experiments. Experimental results show that the proposed contour tracking algorithm is very stable in tracking human heads on real videos with object scaling, rotation, partial occlusion, and illumination changes.

1 Introduction Visual tracking has been a main focus of research in video analysis and processing. With the rapid growth of digital video in consumer electronics and video surveillance, reliable visual tracking techniques have been strongly demanded recently. Previous methods on visual tracking can be divided into the model based [1-3] and non-model based [4-6] approaches. For objects with well-defined models, the corresponding object tracking problem is easier. However, the requirement of constructing object models beforehand limits the practical feasibility of this approach, especially the object may undergo a wide variety of different motions, including 3D rigid and nonrigid motions. Non-model based tracking approaches treat object tracking as an optimization problem. They normally track objects from image sequences by using the latest tracking result as reference for the object. This approach is sensitive to error drift, i.e. error accumulation. Once the object is lost during tracking, it may not be found again. There are many different modifications for the particle filter. Rui and Chen [8] modified the way for computing the posterior probability by considering the current image in the prediction phase. Recently, Maggio and Cavallaro [9] combined the particle filter and mean shift techniques for refined object tracking. Okuma et al. [10] proposed an algorithm that integrates the particle filter and adaboost techniques for tracking multiple targets. Like mean shift, Nummiaro et al. [11] employed the Bhattacharyya coefficient in the color distribution for object tracking in a particle filter framework. P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp. 363 – 372, 2006. © Springer-Verlag Berlin Heidelberg 2006

364

K. Chang and S.-H. Lai

Besides, Jepson et al. [2] models the appearance of the object under tracking via three models. The wandering model reliably estimates the parameters for rapid temporal variations and shorter temporal histories, the stable model captures the behavior of temporally stable image observations, and the last model accounts for data outliers. With online EM algorithm, they update the models adaptively to achieve robust object tracking. In this paper, we propose a visual tracking algorithm based on the framework of the Condensation algorithm [1] for tracking object contour via on-line object model generation and dynamic prediction model update. The Condensation, or particle filter, framework consists of the prediction and measurement phases. The sample contours at the current frame are predicted during the prediction phase. In the measurement phase, the probability of each sample is computed from the image information at the current frame. The pre-trained model and prediction matrix (or motion model) used in the Condensation technique are learned from the best results achieved by using the Kalman filter tracking. This two-pass method (one for learning and the other for tracking) is inconvenient for tracking general objects in practice. Recently, Lim et al. [3] proposed a method on self constructing and updating model for appearance-based object tracking. However, contour tracking provides a more detailed object tracking result, not only the position, rotation, and scale but also the object shape deformation. In most appearance-based tracking algorithms, the object is represented by a rectangle or an ellipse which can be aligned by a simple transformation. The entire information inside the rectangle or the ellipse can be exploited to determine the tracking result. Nonetheless, it is difficult to use the entire image region information for deformable contour tracking since it requires establish point-to-point correspondences between two deformable regions, especially for the particle filter which uses many random samples to approximate the probability distribution. In this paper, we propose an object contour tracking algorithm based on the particle filter framework. It only needs an initial contour at the first frame and then the object models and the prediction matrix are constructed online from the previous contour tracking results automatically. In the proposed algorithm, we build two online models for the target object – one is the shape model and the other is the grayscale histogram model. The grayscale histogram simply records the grayscale information inside the object contour region. Each of these two models is represented by a mean vector and several principle components, which are adaptively computed with the incremental singular value decomposition technique [3,7]. 1.1 Condensation Framework Here we briefly describe the condensation framework for visual tracking. In the condensation tracking, the prediction phase can be represented as a probability term

p ( x t | X t −1 ) = p ( x t | x t − 1 ) ,

(1)

where xt is the predicted state, xt-1 is the state at the previous time t-1 and Xt-1 denotes the whole previous states. The left function means that we use all the previous states to predict the current state. For simplicity, we only take the state at the previous time instant to predict the current state, which is represented as conditional probability on the right hand side of the above equation. On the other hand, the measurement phase can be represented by the following function

Adaptive Object Tracking with Online Statistical Model Update

p (z t | xt ) ,

365

(2)

where zt is the current observed information. This means that we take the predicted state to see if it matches the current observation. The object tracking problem can be thought of using all the image and object model information that we currently have to find the object, which is given by the following conditional probability p (xt | Z t ) ,

(3)

where Zt denotes all the information that we currently have. To compute the probability function (3), we employ the Bayes rule as follows p ( x t | Z t ) = k t p ( z t | x t ) p ( x t | Z t −1 ) ,

(4)

where p ( x t | Z t −1 ) =



x t −1

p ( x t | x t −1 ) p ( x t −1 | Z t −1 )

and kt is a normalization constant. From the above two equations, we will see how the prediction and measurement phases work in the Condensation tracking algorithm. Now we define several symbols for explaining the Condensation algorithm. The points of each sample contour is arranged into a vector s=[x1 y1…xi yi…xn yn]T, where n is the total number of the points along the contour and (xi,yi) is the coordinate of the i-th point. The symbol si,t means the i-th sample contour at the t-th frame. The symbol πi,t means the probability of si,t. Similarly, s isampled denotes the i-th random sample at ,t the t-th frame according to the probabilities of all sample contours at the (t-1)-th frame. We denote the predicted contour of s isampled as s ipred ,t , t . The Condensation tracking algorithm [1] is given as following: Given N contour samples and their corresponding probabilities {si,t-1,πi,t-1, i=1,…, N} at the (t-1)-th frame, we want to find N contour samples and their corresponding probabilities {si,t, πi,t, i=1,…, N} at the t-th frame based on the following procedure: 1. We sample N contours s isampled , i = 1, … , N from the previous sample contours ,t si,t-1,i=1,…, N according to their associated probabilities πi,t-1. 2. The prediction function p ( x t | x t −1 = s isampled ) is applied to predict s ipred from s isampled . ,t ,t ,t 3. We find the contour si,t around the predicted contour s ipred at the t-th frame. ,t 4. Compute πi,t = p(zt | xt = si,t). 5. The expected contour st at the t-th frame is computed as follows st =

∑π i

i ,t

s i ,t .

(5)

At step 3 and 4, we may find some points like edge boundary as a contour found from the frame around the predicted one. For simplicity, we call it image contour. With a shape model, a fitting algorithm can be applied to fit the image contour to see if it is suitable to be the target object’s boundary.

366

K. Chang and S.-H. Lai

In Condensation [1], the image contour is a reference result while the fitted contour is si,t. When si,t is close to the image contour, it is highly possible to be the object’s boundary. Taking the fitted contour as si,t can reduce the noise from the image and make si,t to be a reasonable contour at least. However, when the model is unreliable or lack of information, the fitted contour may not be a good choice. Besides, the model needs the object information. Thus, we take the image contour as si,t and the fitted contour as the reference result. This decision is with lots of risk and choosing the contour points from the image becomes very important.

2 Visual Tracking Based on Condensation Algorithm The Condensation algorithm requires three basic components in the tracking algorithm. The first is the prediction function (step 2), the second is the method to find si,t around s ipred according to the information of the current frame (step 3), and the last is ,t the way to compute the probability πi,t = p(zt | xt = si,t) (step 4). Assume that we already have the prediction matrix, the object models and a way to find the image contour. We show how the prediction function works and how to measure the probability here. The determination of the prediction matrix, the object models and the image contour will be described in the later sections. Our prediction function is given as following

⎡s isampled ,t ⎤ ⎥, ⎣ 1 ⎦

s ipred = P⎢ ,t

(6)

is used to model the global where P is the prediction matrix and the 1 below s isampled ,t translation of the contour. In our experiment, the performance of the prediction matrix is degraded when the contour has a larger change in the shape or the position than the previous ones. Thus, we combine s isampled and s ipred to get a more stable prediction ,t ,t result. s ipred ← C pred s ipred + (1 − C pred ) s isampled , ,t ,t ,t

(7)

where Cpred is the weight of the prediction result. Then we add Gaussian noises into the rotation, scale and translation parameters, i.e.

(2× j) ⎤ (2 × j) ⎤ ⎡ s ipred ⎡ s ipred ,t ,t ← RS ⎢s pred ( 2 × j + 1) ⎥ ⎢s pred ( 2 × j + 1) ⎥ + t , ⎣ i ,t ⎦ ⎣ i ,t ⎦

(8)

pred where (s ipred , t (2 × j ), s i , t (2 × j + 1)) is the coordinate of the j-th point of the contour, R and S are the random rotation and scaling matrices and t is the random translation vector. To compute the observation probability, we build the grayscale histogram and shape models for the target object and update the object models online. The grayscale histogram is normalized with the total number of pixels inside the contour region, thus representing the occurrence frequency of each bin. For each contour si,t, we com-

Adaptive Object Tracking with Online Statistical Model Update

367

pute the grayscale histogram hi,t inside it and align it to the mean shape and the aligned result is s ialigned . ,t We project the contour s ialigned and the corresponding grayscale histogram hi,t onto ,t the object model, which consists of the grayscale histogram and shape PCA models, to compute the reconstructed contour s irecon and grayscale histogram h irecon , respec,t ,t tively. Note that the PCA models for the grayscale histogram and the object shape can be adaptively updated from the previous object tracking results by using the incremental singular value decomposition [3, 7]. This PCA reconstruction is used to measure how well the observation fits to the object model. The discrepancies of the model fitting are given as follow: dhi ,t = h i ,t − h irecon = ∑ j h i , t ( j ) − h irecon ( j) , ,t ,t

(9)

and dsi , t = s

aligned i ,t

−s

∑ (s = j

recon i ,t

aligned i ,t

( j) − s

recon i ,t

( j ))

2

.

(10)

s i ,t

Thus, the observation probability function is defined by

π i , t = p ( z t | x t = s i ,t ) ∝ exp ( − C s dsi , t − C h dhi ,t ) ,

(11)

where Cs and Ch are two constants to represent the weights of the two factors. Thus, we compute the conditional probability πi,t = p(zt | xt = si,t) as follows:

π i , t = p ( z t | x t = s i ,t ) =

exp ( −Cs dsi , t − Ch dhi , t )



j

exp ( −Cs ds j ,t − Ch dh j ,t )

.

(12)

3 Shape Prediction Matrix Consider two object contours st-1 and st at two connective frames t-1 and t. We employ a prediction matrix P to describe their relationship as follows:

⎡st −1 ⎤ ⎥ ∼ st ⎣1⎦

P⎢

(13)

When considering N consecutive frames, the problem of estimating the prediction matrix P turns to minimizing the following energy function N

∑ i=2

Let S i , j = ⎡

⎡ s i −1 ⎤ si − P ⎢ ⎥ . ⎣1⎦ 2

(14)

sj ⎤ and S i , j = [ s i s i +1 s j ] . Then we can compute the 1 ⎥⎦ matrix P by using the least square estimation, thus leading to si s i +1

⎢⎣ 1 1

P = S 2 , N S1,T N −1 ( S1, N −1S1,T N −1 ) . −1

(15)

368

K. Chang and S.-H. Lai

Note that the prediction matrix P needs to be updated dynamically during object tracking to better describe the shape deformation. However, we do not need to store all the previous contours to compute the prediction matrix. Instead, we can simply store and update the two matrixes S 2, N S1,T N −1 and S1, N −1S1,T N −1 for updating the prediction matrix. When new contours at M frames are available for prediction matrix update, these two matrixes become S 2 , N + M S1,T N + M −1 = S 2, N S1,T N −1 + S N +1, N + M S TN , N + M −1

(16)

S1, N + M −1S1,T N + M −1 = S1, N −1S1,T N −1 + S N , N + M −1S TN , N + M −1 .

(17)

and

If each contour is composed of n points, we need to store the two matrices of sizes 2n×(2n+1) and (2n+1)×(2n+1), respectively. However, computing the inverse of the matrix S1, N −1S1,T N −1 is computationally expensive for large n. The problem is even worse when frequent prediction matrix update is required in practice. Here we propose an efficient way to update the prediction matrix P by using the incremental SVD technique as described below. Considering N frames for estimating the prediction matrix P, we arrange all N contours column by column as follows:

P S 1, N −1 = S 2 , N

(18)

Using singular value decomposition (SVD) to decompose the matrix S 1, N −1 yields

S1, N −1 = U 1, N −1Σ1, N −1V1,TN −1

(19)

SVD

Then the prediction matrix P can be computed by

P = S 2 , N V1, N −1Σ1,−1N −1U 1,TN −1 .

(20)

The size of the matrix V1,N-1 will increase with the data or frame number. By combining S2,N and V1,N-1 into a matrix S2,NV1,N-1, we only need to maintain the three matrices U1,N-1, Σ1,N-1 and S2,NV1,N-1. If n points compose a contour and the k largest eigenvalues with the corresponding eigenvectors of the SVD are needed, then the three matrix sizes are (2n+1)×k, k×k and 2n×k, respectively. The incremental SVD technique can be used to easily generate and update the matrices U1,N-1 and Σ1,N-1. For the computation of the matrix S2,NV1,N-1, its computational complexity only depends on the total number of the kept eigenvalues in the SVD. The incremental SVD produces a matrix V after a step of SVD. This matrix V is used to update V1,N-1. Assume that new M data arrives and we keep k1,N-1 and k1,N+M-1 eigenvalues before and after the model update. Note that the size of V is (k1,N-1+M)×k1,N+M-1.

T We can divide V into two parts, i.e. V = [VupT Vbottom ] , where the sizes of Vup and Vbottom are k1,N-1×k1,N+M-1 and M×k1,N+M-1, respectively. The matrix V1,N+M-1 becomes T

T V1, N + M −1 = [ (V1, N −1Vup ) T Vbottom ] . Thus, we update S2,NV1,N-1 as follow T

S 2 , N + M V1, N + M −1 = [ S 2, NV1, N −1 ]Vup + S N +1, N + M Vbottom .

(21)

Adaptive Object Tracking with Online Statistical Model Update

369

4 Contour Refinement The image contour is composed of several nodal points. The main idea to find the nodal points is to search a large gradient in the normal direction of each point in s ipred ,t since there is usually a large gradient in the object boundary. In addition, there are several criteria to be considered for the nodal points. Firstly, the directions of the gradient and the normal line should be as consistent or adverse as possible. Secondly, we compute an average distance according to the large gradient criterion. Then, the distance between the nodal point and the corresponding one in the predicted contour is assumed to be close to the average distance. Thirdly, if more than one point s ipred ,t meets the above two criteria, we set all of them to be candidates for the nodal point. Thus, the nodal points are selected based on the score function given as follows:

⎧ random ( minScale, maxScale ) × ⎪ ⎪ ⎛ p feature − p pred − avgDis Score ( p feature ) = ⎨ n ig feature exp ⎜ − Cvar ⎝ ⎪ ⎪⎩ 0, otherwise

⎞ , if n ig feature > C , (22) angle ⎟ g feature ⎠

where ppred is one of the points of s ipred , n is a unit vector for the corresponding nor,t mal direction, pfeature is one of the points located on the normal line of ppred, and gfeature is the image gradient at the location pfeature. Thus, n i g feature is the amount of the gradient projected on the normal direction. The condition n ig feature g feature > Cangle means the first criterion and Cangle is the threshold of the minimum cosine value of the angle between n and gfeature. The function random(minScale, maxScale) returns a random number between minScale and maxScale. Both minScale and maxScale are positive values. This random number generation is used to implement the third criterion. The function exp ( − p feature − p pred − avgDis Cvar ) is used to implement the second criterion, where the parameter Cvar controls the distance closeness. is denoted by p npred , p mfeature is the m-th candidate point for The n-th point of s ipred ,t ,n p npred , and p nfeature is the n-th nodal point of the contour si,t. Thus, p nfeature is determined based on maximizing the score function as follows

p nfeature = arg max ( Score ( p mfeature )) . ,n

(23)

p mfeature ,n

5 Experimental Results In this section, we show some experimental results on video tracking by using the modified Condensation tracking algorithm with online model adaptation. We also give the experimental results by using the affine motion prediction for comparison with the proposed algorithm that uses the novel shape prediction matrix update scheme. The affine motion model can be represented by the following equation

370

K. Chang and S.-H. Lai

⎡ a b ⎤ ⎡s x ,t ( j ) ⎤ ⎡ e ⎤ ⎡ s x ,t +1 ( j ) ⎤ ⎢ c d ⎥ ⎢ s y , t ( j ) ⎥ + ⎢ f ⎥ = ⎢ s y , t +1 ( j ) ⎥ , ⎣ ⎦⎣ ⎦ ⎣ ⎦ ⎣ ⎦

(24)

where (sx,t(j), sy,t(j)) is the coordinate of the j-th point of the contour at the t-th frame. By assuming the affine motion to be constant in a short period, we can estimate the affine motion parameters by using the least square solution as follows

⎡A ⎢C ∑ ⎢⎢ 00 t ∈previous several frames j∈points of a contour ⎢D ⎢⎣ 0

C B 0 0 E 0

0 0 A C 0 D

0 0 C B 0 E

D E 0 0 F 0

0⎤ ⎡a⎤ ⎡G ⎤ 0 ⎥ ⎢b⎥ ⎢H ⎥ D⎥ ⎢ c ⎥ ⎢ I ⎥, = E ⎥ ⎢ d ⎥ t∈previous several frames ⎢ J ⎥ 0 ⎥ ⎢ e ⎥ j∈points of a contour ⎢ K ⎥ ⎢⎣ L ⎥⎦ F ⎥⎦ ⎢⎣ f ⎥⎦



(25)

where A = s x , t ( j )s x , t ( j ), B = s y , t ( j )s y , t ( j ), C = s x , t ( j )s y ,t ( j ), D = s x ,t ( j ), E = s y , t ( j ), F = 1, G = s x , t ( j )s x , t +1 ( j ), H = s y , t ( j )s x ,t +1 ( j ), I = s x , t ( j )s y ,t +1 ( j ), J = s y ,t ( j )s y , t +1 ( j ), K = s x , t +1 ( j ), and L = s y , t +1 ( j ). In our implementation, the total number of samples in the particle filter is 90. The shape prediction matrix used in our tracking algorithm is initialized to be an identity matrix and it is updated for every 3 frames. For ease of computation, we employ the forgetting factor scheme in the incremental SVD technique [3]. The forgetting factor is set to be (n - 3)/n, where n is the total number of frames used for estimating the shape prediction matrix. The total number of frames used in the experiments is set to 40 empirically. For the affine motion estimation, we use a small number of frames for the least square estimation because there are only 6 affine motion parameters. From our experimental result, the affine predictor is less stable when a certain degree of errors are involved in the tracking result. In contrast, our prediction matrix accounts for more temporal shape variations across more frames, thus making the shape prediction more stable. This is evident from Figure 1 and 2. In addition, we show the performance of our contour tracking algorithm on two sequences. The total number of samples in the particle filter is set to 90 and we use 40 previous frames for estimating the shape prediction matrix. The data amounts for updating the shape and the grayscale histogram models are both 100. The first testing video sequence is the Dudek sequence [2], which is about 38 seconds with 15 fps frame rate. In this sequence, we track the contour of the head, which contains different scales, poses, translations and partial occlusions in the video. Some of the tracking results are depicted in Figure 3. The background of the video is cluttered and contains many different objects. The tracking results show that our algorithm generally can provide quite reliable tracking performance. The second testing video sequence is about 51 seconds with 15 fps frame rate. This sequence contains a person moving in a room and the contour of his head is our target object. The main difficulty is the large illumination changes from dark to bright conditions, which can test our grayscale histogram model. The result shows that the proposed algorithm can track the head contour pretty well for the entire sequence as some frames depicted in Figure 4.

Adaptive Object Tracking with Online Statistical Model Update

371

(a)

(b)

(c) Fig. 1. Predictor comparison I: (a) The tracking results by using the affine predictor for frames 148-152. The previous 2 frames are used to predict the affine matrix. (b) The tracking results by using the affine predictor for frames 148-152. The previous 4 frames are used to predict the affine matrix. (c) The tracking results by using the proposed shape prediction matrix for frames 148-152. The total number of frames used for estimating the shape prediction matrix is 40.

(a)

(b) Fig. 2. Predictor comparison II: The tracking results by using (a) the affine predictor (2 previous frames) (b) the proposed shape prediction matrix (40 previous frames) for frames 148-152

Fig. 3. The tracking results of the Dudek face sequence [2] with different scales, poses, translations and partial occlusions

Fig. 4. The tracking results by using the proposed algorithm on the second testing video sequence with different scales, poses, translations, and the significant illumination changes from dark to bright conditions

6 Conclusion In this paper, we purpose an adaptive contour tracking algorithm based on the Condensation algorithm with online updating the shape prediction matrix and object models. The novel shape prediction model is very flexible and accounts for temporal

372

K. Chang and S.-H. Lai

shape deformation. The object model consists of the grayscale histogram and contour PCA models adaptively computed from previous tracking results by using the incremental SVD technique. The proposed Condensation tracking algorithm with online model update is computationally efficient due to the use of incremental SVD for updating both the object and shape prediction models. Due the online model update capability and the flexible shape prediction model, the proposed tracking algorithm is very stable since most recent tracking results are taken for the model update. Experimental results show the proposed tracking algorithm can track human heads very reliably in cluttered environment under large lighting variations on several real videos.

References 1. Isard, M. and Blake, A.: CONDENSATION - conditional density propagation for visual tracking. International Journal of Computer Vision, Vol. 29. (1998) 5-28 2. Jepson, A. D., Fleet, D. J. and El-Maraghi, T.F.: Robust online appearance models for visual tracking. IEEE Conf. Compute Vision Pattern Recognition, Vol. 1. (2001) 415-422 3. Lim, J., Ross, D., Lin, R.S. and Yang, M.H.: Incremental learning for visual tracking. Neural Information Processing Systems 17 (NIPS 2004) 4. Liu, T.-L. and Chen, H.-T.: Real-time tracking using trust-region methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 26. (2004) 397-402 5. Comaniciu, D., Ramesh, V. and Meer, P.: Real-time tracking of non-rigid objects using mean shift. In Proc. Conf. Computer Vision and Pattern Recognition, Vol. 2. (2000) 142-149 6. Comaniciu, D., Ramesh, V. and Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25. (2003) 564-575 7. Levy, A. and Lindenbaum, M.: Sequential Karhunen Loeve basis extraction and its application to image. IEEE Transactions on Image Processing, Vol. 9. (2000) 1371-1374 8. Rui, Y. and Chen, Y.: Better Proposal Distributions: Object Tracking Using Unscented Particle Filter. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition. (2001) 786-793 9. Maggio, E. and Cavallaro, A.: Hybrid Particle Filter and Mean Shift tracker with adaptive transition model. Proc. of IEEE Signal Processing Society International Conference on Acoustics, Speech, and Signal Processing (ICASSP). (2005) 19-23 10. Okuma, K., Taleghani, A., de Freitas, N., Little, J. J. and Lowe, D. G.: A Boosted Particle Filter: Multi-target Detection and Tracking. European Conference on Computer Vision. (2004) 28-39 11. Nummiaro, K., Koller-Meier, E. and Van Gool, L.: An Adaptive Color-Based Particle Filter. International Journal of Image and Vision Computing, Vol. 21. (2003) 99-110

Suggest Documents