Occlusion robust adaptive template tracking - Semantic Scholar

2 downloads 101385 Views 150KB Size Report
Email: { tat,worring,rein }@science.uva.nl. Abstract. We propose a new method for tracking rigid objects in image sequences using template matching. A Kalman ...
Occlusion robust adaptive template tracking Hieu T. Nguyen, Marcel Worring, and Rein van den Boomgaard Intelligent Sensory Information Systems University of Amsterdam, Faculty of Science Kruislaan 403, NL-1098 SJ, Amsterdam, The Netherlands Email: { tat,worring,rein }@science.uva.nl

Abstract We propose a new method for tracking rigid objects in image sequences using template matching. A Kalman filter is used to make the template adapt to changes in object orientation or illumination. This approach is novel since the Kalman filter has been used in tracking mainly for smoothing the object trajectory. The performance of the Kalman filter is further improved by employing a robust and adaptive filtering algorithm. Special attention is paid to occlusion handling.

1. Introduction This paper is concerned with tracking of rigid objects in video, using template matching. The region, occupied by a rigid moving object in any image of a sequence, can be obtained from an object template via a transformation of the coordinates. The parameters of this transformation characterize the current position of the object. They are obtained by matching the active object template with a region in the current frame. The template matching will be inaccurate in the following situations: 1. the template does not represent the current object appearance due to changes of object orientation or illumination conditions, 2. the object is only partly visible, or not visible at all, due to occlusion. A tracking algorithm should be able to handle both these situations. In the first situation, the tracker should update the template to accommodate the changed object appearance. In the second situation, the tracker should be able to detect the occlusion and recapture the object when the occlusion ends. A number of tracking methods, based on template matching, have been developed in computer vision. Ignoring methods using domain specific knowledge, there are three approaches to acquire the template: from the first frame in the sequence [5, 8], from the preceding frame [7, 10, 2], and

from the entire sequence up to the current point in time [11]. The template is a memory about the object. The trackers of the first approach always “remember” the object appearance in the first frame. The tracking, therefore, becomes unstable when the object is tracked for long time. The trackers of the second approach remember the most recent object appearance, forgetting all images before that. In this case, the tracker can easily use a wrong template due to partial occlusions or due to the accumulation of errors from previous tracking steps. The third approach is most appropriate as all images in the past can contribute to the construction of the template. This kind of template is used in [11]. In the reference, the new template is computed as a weighted average between the previous template and the current image. To achieve stability and robustness to occlusions, several tracking methods use the Kalman filter [7, 3, 10]. In all methods, the Kalman filter keeps track of object’s position and its velocity. The object position is predicted for the current frame. To verify the prediction, an independent image processing technique is then employed to detect the object. The technique is usually based on edge detection [3] or minimizing an intensity matching error, defined as the sum-of-squared differences (SSD) between the image and a template [7, 10]. Finally, the Kalman filter yields a trade-off between the predicted position and the position detected. In such an approach, the benefit of using the Kalman filter is just smoothing the object trajectory. The filter has very little effect on improving the object detection part, that is crucial for achieving robustness of the tracker. The paper aims to achieve robustness of template matching by developing a new method for constructing the template. The template should be robust against occlusions, and at the same time, be able to adapt to changes in object orientation or illumination conditions. The paper is structured as follows. Section 2 is the main section which describes the tracking of the template intensities using the Kalman filter. This section also develops methods for template matching and occlusion handling. The tracking algorithm is given in section 3. Section 4 shows the tracking results and an performance evaluation of the proposed method with the previous approaches.

2. Tracking template intensity using a robust adaptive Kalman filter 2.1. Template matching based tracking Let us assume that the image region Ω(t), occupied by the tracked object at any time moment t, is obtained from a template region R via a coordinate transformation ϕ : R 7→ Ω(t). This implies that every point x = (x, y) in Ω(t) is obtained from a corresponding point p = (px , py ) in R as follows (see Figure 1): x = ϕ(p; a(t))

(1)

where a(t) denotes the parameter vector of the transformation, which is specific for Ω(t). When the object is rigid, ϕ can be derived in closed-form with a reasonable number of parameters. Examples are the translational, affine or quadratic transformations [1]. The deformation of Ω(t) between two consecutive moments of time can usually be modeled by the same type of transformation but with a different parameter vector v(t), characterizing object motion. For the complete representation of the object, grey values inside Ω(t) should be taken into account as well. We therefore define for each point p in R a template intensity g(p, t), which equals the intensity at the corresponding point x given in eq. (1). In practice, we obtain an estimate gˆ(p, t) of this intensity only.

p R

a(t)

Ωt

a(t+1)

x

x

v(t) Ω t+1

Figure 1. Illustration of the template-to-object transformation.

Let I(x, t) denote the intensity of pixel x at time t. The position of the object in the current frame is determined by a(t). This vector is estimated by matching the template gˆ(p, t′ ), obtained at some earlier point in time t′ < t, with the current image I(x, t). Usually, the previous template is used, i.e. t′ = t − 1. During an occlusion, t′ is the moment where the occlusion is detected. In order to make the matching robust against partial occlusions, a robust error function ρ is employed for the defi-

nition of the matching error. Thus, a(t) is estimated as: X  I(ϕ(p; a), t) − gˆ(p, t′ )  (2) a(t) = arg min ρ a r¯ p∈R where r¯ is an estimate of the standard deviation of the residual I(ϕ(p; a), t) − gˆ(p, t′ ). We use Huber’s function [6], although other functions in [12] can be used as well:  2 e /2 if |e| < c ρ(e) = (3) c(|e| − c) otherwise where c= 1.345. For the reliability and efficiency of parameter estimation, we consider only the two kinds of motion, encountered most frequently in video: translation and scaling. Then, the parameter vector a(t) comprises three components a1 (t), a2 (t), a3 (t) and the transformation from template to object in eq. (1) can be rewritten as:       x a (t) p = (1 + a3 (t)) x + 1 (4) y a2 (t) py where a1 (t), a2 (t) specify the translations in the x-and yaxes respectively, and a3 (t) specifies the scaling magnitude. In conclusion, template matching is described by eq. (2), once methods for computing g(p, t) and r¯ are given.

2.2. Kalman filter for tracking intensity We propose to use the Kalman filter for estimating g(p, t). This is a novel approach as traditional tracking methods [7, 3, 10] use the Kalman filter to track a(t) and v(t) (see Figure 2). The advantage of the new approach is that the Kalman filter provides an adaptive and optimal template, allowing the matching procedure to identify the object in the scene accurately. One could use a Kalman filter to track all three vectors a(t), v(t) and g(p, t) together. However, since the object intensity changes independently from object position and motion, we can employ the simplified approach that tracks g(p, t) only. The position and motion are obtained as result of the template matching procedure. A further simplification assumes that the intensities g(p, t) for different pixels p are independent so that they can be tracked independently by individual Kalman filters. Let us describe the prediction model and the observation model for the filters. The model for the state prediction is: g(p, t) = g(p, t − 1) + εw (p, t − 1)

(5)

where εw (p, t − 1) denotes the state noise process. This noise models changes in object intensities due to factors such as change of the illumination condition or the object orientation. As common in Kalman filtering, εw (p, t − 1) is assumed to be gaussian, and furthermore, to have the same 2 variance σw for all p.

Object detection

Object detection

measured position (x,y)

Position prediction

measured intensity

Position updating

Intensity prediction

predicted position

updated position

Intensity updating predicted intensity

updated intensity

a)

b)

Figure 2. a) Illustration of the traditional position tracking; b) Illustration of the proposed intensity tracking. To obtain measurements of the filters, the previous template gˆ(p, t − 1) is matched with the current frame. The result of this matching is the parameter vector a(t). Then, I(ϕ(p; a(t)), t) is used as measurement for g(p, t): I(ϕ(p; a(t)), t) = g(p, t) + εI (p, t)

(6)

where εI (p, t) models the noise in the image signal. Again we assume εI (p, t) to have an identical gaussian distribution for all p with variance σI2 . This implies all filters share the same parameters. This assumption is usually valid due to the similarity in motion between points on the object. Although the previous template is used to obtain the measurements I(ϕ(p; a(t)), t), the assumption required for the Kalman filter on the independence between the measurement noise and the state noise, is not violated. The previous template gˆ(p, t − 1) is used to help the matching procedure finding the true value for the measurements only. We now derive equations for the Kalman filters constructed from eq. (5) and (6). Let gˆ(p, t− ) be the prediction of g(p, t) at time t, and gˆ(p, t) the estimate of g(p, t) after the filter takes the current measurement I(ϕ(p; a(t)), t) into account. Let σ ˆ g2 (t− ) and σˆg2 (t) be the variance of gˆ(p, t− ) and gˆ(p, t) respectively. Let r(p, t) be the residual, i.e. the difference between the prediction and measurement at pixel p: r(p, t) = I(ϕ(p; a(t)), t) − gˆ(p, t−)

(7)

The template intensities are updated as follows: gˆ(p, t−) σ ˆg2 (t− )

= =

gˆ(p, t) =

σ ˆg2 (t)

gˆ(p, t − 1) 2 σ ˆg2 (t − 1) + σw gˆ(p, t− ) +

(8) (9)

σ ˆg2 (t− ) r(p, t) σ ˆg2 (t− ) + σI2

=

σI2 gˆ(p, t−) + σ ˆg2 (t− )I(ϕ(p; a(t)), t) (10) σ ˆg2 (t− ) + σI2

=

σ ˆg2 (t− )σI2 σ ˆg2 (t− ) + σI2

(11)

The Kalman filter described requires the following parameters be known: the initial state variance σg2 (0), the state

2 noise variance σw , and the measurement noise variance σI2 . Several existing template tracking methods can be obtained from the above filter by setting the parameters to special values. Constant template trackers correspond to the parameter 2 setting σg2 (0) = 0 and σw = 0. On the other hand, setting 2 σw = ∞ yields a tracker based on matching two consecutive frames. As noted, both these approaches have drawbacks due to the inappropriate parameters. The Kalman fil2 ter with fixed σw and σI2 resembles, at convergence, the averaging of image and template with fixed coefficients. This simple averaging is sensitive to outliers. Furthermore, in 2 dynamic scenes, σw and σI2 vary over time. The tracking algorithm should take these factors into account.

2.3. Robust adaptive filtering When the noise in eq.(5) and eq.(6) is gaussian, eq. (10) provides the optimal estimates for the template intensities g(p, t). In practice, due to occlusions or imperfections of the motion model used, the measurement error in eq. (6) may deviate too much from the gaussian distribution. To keep the assumptions of the Kalman filter satisfied, these outliers should be removed from the filter state estimation. We therefore use a robust estimation scheme [4]. When the residual exceeds a certain times its standard deviation r¯, the measurement is rejected and the state g(p, t) is not updated. On the other hand, to prevent the possibility that g(p, t) may never be updated, we do not allow the algorithm to declare a pixel as outlier for long time. For each pixel p, a counter no (p) is introduced, that counts the number of successive frames where p is declared outlier. When no (p) exceeds a maximally allowed value nomax , the template intensity g(p, t) is forced to accept the value I(ϕ(p; a(t)). Thus, eq. (10) is replaced by:  σˆg2 (t− )  −  r(p, t) g ˆ (p, t ) +    σ ˆg2 (t− ) + σI2     if r(p, t) > α r¯   (12) gˆ(p, t) =  I(ϕ(p; a(t)), t)     if no (p) ≥ nomax       gˆ(t− ) otherwise

where α is a predefined coefficient. From now on, whenever the updating of the template is mentioned, it refers to eq. (12). Eq. (12) makes the template robust against short-time occlusions. We now consider the proper parameter settings. Among the filter parameters mentioned in section 2.2, the 2 state noise σw and the measurement noise σI2 are most critical. In practice, they are seldom known and not even constant in time. Therefore, one would like to estimate these parameters simultaneously with the states. Such filters are called adaptive or self-tuning [9, ch. 10]. The input data for the estimation of the noise parameters is the residual sequence. We use the covariance matching method which suggests to compare the estimated variance of the residual with their theoretical variance. Let R′ be the subset of R without outliers, and N ′ the number of pixels in R′ . The variance of the residuals is estimated by averaging over R′ and over the last K frames: 1 r¯ = K 2

t X

r 2 (t)

(13)

i=t−K+1

where r 2 (t) = β

1 X 2 r (p, t) N′ p∈R′

(14)

Here, β is the coefficient needed to correct the variance for the truncation of residuals. To make the above estimate of variance consistent with the case where residuals are true  the R α 2 − t2 −1 1 √ gaussian noise, we choose β = t e 2 dt . 2π −α

The value of r¯2 , given by eq. (13), is used in eq. (2) and (12). By comparing r¯2 with the theoretical variance of r(p, t), which is σ ˆg2 (t− ) + σI2 , one of the two noise parameters can be readjusted if the other one is known beforehand. Tuning one parameter is usually sufficient for the filter to adapt to changes of object orientation or illumination. Furthermore, simultaneous estimation of both noise parameters in general is not reliable [9]. Let us assume the measurement noise σI is known, then the state noise is estimated as: 2 σw = r¯2 − σI2 − σg2 (t − 1)

(15)

This re-estimation of σw is especially useful when the object orientation or the illumination condition changes. In these cases, object intensities change faster, leading to the 2 increase of r¯2 , and hence, the increase of σw as well. The 2 higher value of σw weights the measurement in eq. (12) more heavily, and therefore, keeps the template up-to-date with the object appearance. 2 It remains to specify σI2 and the initial values for σw and σg2 . They are set such that initially the states and measurements have equal weights: 2 σI2 = 0.5r 2(1), σw (0) = 0 and σg2 (0) = 0.5r 2(1) (16)

Using eq. (15) and (16), all noise parameters are set automatically.

2.4. Occlusion handling The rejection of outliers, described in eq. (12), makes the template robust against short-time occlusions. When the object is occluded (even completely), the filter does not accept the new appearance outright, remembering the old appearance of the object for a while. The occlusion can then be declared when the fraction of outliers exceeds a predefined percentage γ: N − N′ >γ N

(17)

where N is the number of pixels in R, and as before, N ′ is the number of pixels in R′ . During the occlusion, the template and parameters are not updated. An important point is how to detect the end of the occlusion. One can detect the end of the occlusion if the percentage of outliers drops below a given threshold. However, very often the object appearance changes quite a lot during the occlusion due to change of object orientation. This makes the result sensitive to the choice of the threshold. The end of occlusion, nevertheless, can be detected more reliably, if we know that the maximal duration of the occlusion is limited to L frames. Let to be the time the occlusion is detected. The template should then be matched with the frames from to to to + L. The end of the occlusion is the frame, yielding the minimum cost in (2). In general, it is not necessary to visit all L frames. To reduce computations, the search for the end of occlusion is done in two stages. In the first search, the frames are visited with a large step. The second search is then performed only around the frame with the minimum cost, found from the first stage.

3. Algorithm The full algorithm is presented in Tab.1. In the code, template matching(g, I(t), a(t), err, fr) denotes the routine matching the template g with the image I at time t. Returned are the parameters a(t) of the best matching region, the matching error err and the percentage of outliers fr. We minimize (2) by an exhaustive search in a quantized space of a. When a3 is too large, the template needs to be resized to avoid warping errors. Note that in case where object motion is known to be translation and without scaling, the use of the pure translation model yields much more robust result than if scaling is considered into account.

4. Results We have tested the algorithm on 8 movie clips containing 1063 frames and 8 complete occlusions in total. The objects were human faces and bodies. Templates were initialized

Algorithm 1. Initialization(); /* set initial values for σg , σI , σw , g */ t := 0; 2. repeat { 2 σg2 := σg2 + σw ; /* eq.(9)*/ template matching(g, I(t), a(t), err, fr); if fr > γ then { /* occlusion handling */ for (i := t + 1; i

Suggest Documents