We mention here only a few of the most relevant works for the approach taken in this paper. The comprehensive survey on visual tracking methods can be found ...
Scale-invariant visual tracking by particle filtering Arie Nakhmani*a, Allen Tannenbauma,b a
Dept. of Electrical Engineering, Technion - Israel Institute of Technology, Haifa 32000, Israel b Schools of Electrical and Computer and Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0250 ABSTRACT
Visual tracking is an important task that has received a lot of attention in recent years. Robust generic tracking tools are of major interest for applications ranging from surveillance and security to image guided surgery. In these applications, the objects of interest may be translated and scaled. We present here an algorithm that uses scaled normalized crosscorrelation matching as the likelihood within the particle filtering framework. We do not need color and contour cues in our algorithm. Experimental results with constant rectangular templates show that the method is reliable for noisy and cluttered scenarios, and provides accurate and smooth trajectories in cases of target translation and scaling. Keywords: Tracking, cross-correlation, CONDENSATION algorithm, scale-invariant, surveillance
1. INTRODUCTION In this note, we investigate the problem of tracking arbitrary targets in video sequences. Many of the algorithms available tend to be application-specific, are appropriate for a very limited class of video sequences, and suppose strong prior information on the tracked target (e.g., shape, texture, size, color, camera dynamics, or motion constraints). On the other hand, a number of more generic target visual tracking algorithms search for distinctive features that can be followed from frame to frame. For these reasons, any progress on general arbitrary target (without distinctive features) trackers will be of interest for active vision, recognition, and surveillance applications. In the present work, we propose a video tracking framework for tracking non-articulated (blob-like) targets, which lack prominent features. The proposed algorithm works in a variety of scenarios, and deals naturally with clutter and noise in the scenes, target scaling, and low contrast targets. The most important assumption is that the target motion and scaling are smooth, without abrupt changes. We suppose that the target of interest is selected (by human operator or by automatic detection algorithm) in the first frame of video sequence. Tracking is performed by acquiring the target’s centroid trajectory in a given bounding box. We should note that this problem formulation is not new, and a large literature is available on this topic. We mention here only a few of the most relevant works for the approach taken in this paper. The comprehensive survey on visual tracking methods can be found in the paper by Yilmaz et.al. [1]. A deep analysis of particle filters is provided in [2], where rigorous theory and applications of particle filters are presented. Also, a powerful application of particle filters to image sequences (CONDENSATION algorithm) can be found in the paper by Blake and Isard [3]. The possible solutions to scale invariant template matching are presented in [4-6]; see these works and the references therein. Although several attempts of combining area template matching with particle filtering have been made previously [7, 8], they used adaptive and learning schemes which makes them different from the algorithm given in this paper. The remainder of this paper is organized as follows. Section 2 explains the scale invariant template-matching problem. We briefly discuss the classical template matching with the normalized cross-correlation coefficient function (NCC), and we define the concept of scaled cross-correlation (SNCC). In Section 3, we consider the general problem of tracking with particle filters, and present the algorithm using measurement steps that are based on SNCC. In Section 4, we test our algorithm on three video sequences that illustrate some of its key features. Finally, in Section 5, we summarize our research, and present the conclusions. We also discuss several problems that still need to be solved, and propose the future directions for the research.
2. SCALE-INVARIANT TEMPLATE MATCHING Let I(m,n) denote the intensity value of the image (or the search region), and P(i,j) denote the intensity value of the template patch. We assume that the size of I is Mx My, and the size of P is Nx Ny. Clearly, we assume that the size of I is greater than the size of P. It is known that the noisy version of the patch is placed somewhere in the image I. Our goal is to determine the most probable position of the patch in image I. The standard approach to this problem is to compute the coordinates of the maximum normalized cross-correlation coefficient (NCC) between the image and the template. These coordinates represent the location of the best match. The normalized cross-correlation coefficient is defined for any pixel (m,n) by: Nx N y
( I (i m 1, j n 1) I (m, n))( P(i, j ) P ) i 1 j 1
NCC (m, n)
Nx Ny
Nx N y
( I (i m 1, j n 1) I (m, n)) ( P(i, j ) P ) 2
i 1 j 1
(1) 2
i 1 j 1
where the mean intensity is defined by:
P
1 Nx N y
I ( m, n )
Nx N y
P(i, j),
(2)
i 1 j 1
1 Nx N y
Nx N y
I (i m 1, j n 1),
(3)
i 1 j 1
m 1, 2,..., M x N x 1,
(4)
n 1, 2,..., M y N y 1.
The values of NCC(m,n) are between -1 and 1 (1 for perfect match, and 0 for “no correlation”). The technique presented here is used in many practical applications, and has demonstrated robustness to noise and intensity variations [9]. Unfortunately, this technique fails in the case of a scaling (zoom) of the desired target in the image I. The straightforward solution to this problem is to find the location of maximum for the scaled normalized crosscorrelation function (SNCC): Nx N y
( J J (m, n))( P(i, j ) P ) SNCC (m, n, s )
i 1 j 1
Nx N y
Nx N y
i 1 j 1
i 1 j 1
(5)
( J J (m, n))2 ( P(i, j) P )2 where s – is the scaling factor (>0), J = I(m+s(i-1), n+s(j-1)) (if the indices are not integer, then they should be rounded, or the value of J should be interpolated from the closest neighbors), P - is defined in (2), and
J (m, n)
1 NxNy
Nx N y
I (m s(i 1), n s( j 1)) i 1 j 1
(6)
In other words, the template patch is compared to the scaled version of the image I, and the best match is found. Since the number of possible scalings is infinite, even the approximate solution by scale grating can be very computationally demanding, and not appropriate for real-time applications. We propose to overcome this problem by assuming that the scale does not change abruptly, therefore it can be modeled as a simple Markov process, e.g., for the frame k: sk sk 1 vk ; vk ~ N (0, ); s0 1
(7)
Remarks: 1) One should make sure that sk remains positive for each frame. 2) If some prior knowledge about changes in scale is available, this knowledge can be incorporated into the model by modifying the distribution of vk . For example, if we suppose that most of the time the scale will not change, then we should choose the truncated normal distribution added to delta distribution at s=0. This definition fits well into the particle filtering framework, and makes the problem tractable. Furthermore, we are interested only in non-negative values of SNCC, thus we use the half-wave rectified scaled cross-correlation, in which the negative values replaced by zeros. In the next section, we will combine the advantages of the SNCC and particle filtering techniques.
3. PARTICLE FILTERING 3.1 Particle filtering Our tracker is based on the CONDENSATION algorithm proposed by Isard and Blake [3]. In this section, a short overview of the algorithm is given, and the application to scale invariant tracking is presented. The algorithm uses the SNCC as the likelihood for determining the target’s position. We refer the reader to reference [2] for the complete background on particle filtering. In general, the goal of particle filtering is to estimate the sequence of hidden state parameters Xk, based only on the observed data Zk. These estimates follow from the posterior distribution P(Xk|Z0,Z1,…,Zk). It is assumed that the state and the observations are first order Markov processes, and each Zk depends only on Xk. The particle filter estimates the P(Xk|Z0,Z1,…,Zk) distribution, and it does not require any linearity or Gaussian assumptions on the model. The particle filter will generate a set of N samples that approximate the filtering distribution. For the k-th frame, we denote the state vector by Xk=(x1,x2,…). For example, the state can be the top-left corner coordinates of the desired target (x1=x, x2=y) in the frame, and its scaling (x3=s). Additionally, the state can include velocity and acceleration of the target. The state estimate is recursively obtained as follows:
P( X k | Z 0 , Z1 ,...Z k ) P( X k 1 | Z 0 , Z1 ,...Z k 1 ) k P( X k | X k 1 )
(8)
k P ( Z k | X k ) P ( SNCC ) ( Z k | X k ) SNCC
(9)
where
The prediction step that corresponds to the distribution P( X k | X k 1 ) is governed by system state dynamical equations. For example, if state time evolution is assumed to be smoothly changing, and there is no additional information about the target dynamics, then the simplest model given by
X k X k 1 vk , vk ~ N (0, ) is many times appropriate. The mean of X k over all the particles is approximately the actual value of X k .
(10)
3.2 The algorithm The state estimation is carried out by updating weighted particles according to (8). The following table summarizes the algorithm steps.
INITIALIZATION The N particles X 0( n ) , (n 1,..., N ) are drawn from the uniform distribution, or selected by the operator. For each video frame (k-th frame), we perform the following steps: STEP 1: Using the particles from previous frame, predict the new state by sampling from:
X k( n ) ~ P( X k | X k( n)1 ).
(11)
STEP 2: Measure and weight the new position in terms of the measured features Z k :
k( n ) P ( SNCC ) ( Z k | X k( n ) ), wk( n ) wk( n)1 k( n ) .
(12)
STEP 3: Resample the particles X k( n ) , (n 1,..., N ) according to the weight wk( n ) . STEP 4: Compute the state estimate from: 1 N Xˆ k X k( n ) , N n 1
(13)
and repeat the steps (1-4) for the next video frame. The result of this algorithm is the estimated state Xˆ , that includes the information about the position and scaling of the tracked target in every video frame.
4. EXPERIMENTAL RESULTS AND DISCUSSION We tested the proposed algorithm in various situations, including highly cluttered exterior scenes with shadows and partial occlusions with a high rate of success. A single template was used for every video. We chose the simplest motion model (10). We selected the target manually in the first video frame. We tracked the targets with 60 particles. The video resolution is 240x320, and the frame rate is 25 frames per second.
Figure 1: Maneuvering vehicle sequence with the tracking results.
4.1 Sequence 1: Maneuvering Vehicle In the first sequence, we want to track a vehicle. Despite the significant zoom and moving camera, our tracker manages to follow the target (see Figure 1). This video represents a challenging scenario for tracking in outdoor conditions.
4.2 Sequence 2: Boat In the second sequence the boat is tracked. The contrast of the boat with the background is so low, that the following the boat is hard even for a human observer (see Figure 2). Additionally, the scene is very noisy (water glare and the plume behind the boat). The tracker manages to overcome these problems. Although in frame 798 the tracker has the wrong estimate of scale (because of bad measurements), the algorithm reestablishes the correct estimate after a few frames.
Figure 2: The boat sequence with the tracking results.
4.3 Sequence 3: A Crowded Party In this sequence, we want to track a single person in a large crowd. The results of tracking are shown in Figure 3. In the frame 83 the person tracked, despite the variations in the form and partial occlusion. In the frames 115-125, a full occlusion occurs. At frame 123, our tracker temporary lost track and the scaling is wrong. Nevertheless, the tracker finds the right position after the person reappears. We note that for all sequences, we used simple target dynamics model and a constant template. We assumed that no additional information is given about the target, besides the template. With learned higher order models, and smoothly changing adaptive template we expect to get even better results with the same algorithm.
Figure 3: Crowded party sequence with the tracking results.
5. CONCLUSION In this paper, we presented an algorithm for tracking video sequences of scaled and translated targets without the need for adaptation and learning mechanisms. Using a rather low dimensional state space, we achieve robust tracking results with many complicated and cluttered real world video sequences, including sequences with a moving camera. The combination of the particle filter with a correlation tracker makes it possible to get smooth target trajectories. The algorithm can cope with translations, and moderate deformations of the tracked target, when the deformations affect only a small portion of pixels in the template. The algorithm is appropriate also for small targets with low contrast. The algorithm is time efficient, and should be suitable for real-time applications. The disadvantage of our approach is that it is not capable to track the targets subjected to large rotations. The problems of partial and full occlusions should be addressed too. The next step in our research will be to add rotation states to the particle filter definition, and to choose good dynamic models for rotation, to achieve rotation invariant tracking. In addition, other types of correlation measures should be tested. Finally, in the future, the algorithm should be extended for multiple target tracking.
REFERENCES [1] [2] [3]
[4]
[5]
[6]
[7]
[8]
[9]
Yilmaz, A., Javed, O., and Shah, M., "Object Tracking: A Survey," ACM Computing Surveys, Vol. 38(4), (2006). Doucet, A., de Freitas, N., and Gordon, N., Sequential Monte Carlo Methods in Practice, Springer, (2001). Isard, M., and Blake, A., "CONDENSATION—Conditional Density Propagation for Visual Tracking," International Journal of Computer Vision, Vol. 29(1), pp. 5-28, (1998). Cahn von Seelen, U.M., and Bajcsy, R.,"Adaptive Correlation Tracking of Targets with Changing Scale," Reconnaisance, Surveillance, and Target Acquisition for the Unmanned Ground Vehicle, Morgan Kaufmann Publishers, San Francisco, CA, pp. 313-322, (1997). Zhao, F., Huang, Q., and Gao, W., "Image Matching by Normalized Cross-Correlation," ICASSP Proceedings, (2006). Ooi, J., and Rao, K., "New Insights Into Correlation-Based Template Matching," Proceedings of SPIE, Vol. 1468, pp. 740-751, (1991). Mei, X., Zhou, S.K., and Porikli, F., "Probabilistic Visual Tracking via Robust Template Matching and Incremental Subspace Update," IEEE International Conference on Multimedia and Expo, pp. 1818-1821,( 2007). Zhou, S., Chellappa, R., and Moghaddam, B., "Appearance Tracking Using Adaptive Models in a Particle Filter," Proc. of Asian Conf. on Computer Vision, (2004) Lewis, J.P., "Fast Normalized Cross-Correlation," Vision Interface, Quebec, Canada, pp. 120-123, (1995).