Real-Time Feature Tracking and Outlier Rejection ... - Semantic Scholar

2 downloads 338 Views 197KB Size Report
Dec 4, 2000 - [3, 4, 6], medical imaging [2, 14] and mosaicing [12, 15], just to mention a few. In visual tracking the main interest is in establishing region ...
Real-Time Feature Tracking and Outlier Rejection with Changes in Illumination Hailin Jiny

Paolo Favaroy

Stefano Soattoyz

y Department of Electrical Engineering, Washington University, St.Louis - MO 63130 z Department of Computer Science, UCLA, Los Angeles - CA 90095

Techincal Report UCLA CSD-200035 December 4, 2000

Keywords: Tracking, deformable templates, image deformations, structure from motion

Abstract We develop an efficient algorithm to track point features supported by image patches undergoing affine deformations and changes in illumination. The algorithm is based on an explicit geometric-photometric model that is used to track features as well as to detect outliers in a hypothesis testing framework. The algorithm runs in real time on a personal computer, and will be made available to the public.

1 Introduction Tracking the deformation of image regions has proved to be an essential component of vision-based systems in a variety of applications ranging from control systems [9, 13, 5], to human-computer interactions [3, 4, 6], medical imaging [2, 14] and mosaicing [12, 15], just to mention a few. In visual tracking the main interest is in establishing region correspondences between images obtained from a moving camera. Once correspondences have been established, the temporal evolution of the trajectories can be used, for instance, as a combined measurement of motion and structure of the scene. To this and other purposes it is important for features to be tracked as long and reliably as possible. The longer the baseline and the smaller the error, the more accurate the structure recon struction can be done [1].A popular technique for visual tracking on unstructured images is to minimize the sum of squared difference of images intensities, usually referred to as SSD matching. Many works have been based on this principle, starting 1

with the pioneering work of Lucas and Kanade [11] that establishes matching between frames adjacent in time. As Shi and Tomasi [16] note, the inter-frame matching is not adequate for applications where the correspondence for a finite size image patch over a long time span is needed. Indeed, inter-frame tracking is prone to cumulative error when trajectories are integrated over time. On the other hand, when considering matching over long time spans, the geometric deformations become significant and more complex models are necessary. Shi and Tomasi show that the affine transformation is a good tradeoff among model complexity, speed and robustness. However, SSD or correlation-based tracking algorithms usually assume that the changes in the scene appearance are only due to geometric deformations. Thus, when changes in illumination are relevant, these approaches perform poorly. Hager and Belhumeur in [8, 7] describe an SSD-based tracker that compensates for illumination changes. Their approach is divided in three phases: first a target region is defined, then a basis of reference templates for the illumination change is acquired and stored, and finally tracking is performed based on the prior knowledge of the templates. As computers are becoming more and more powerful, a growing range of applications can be implemented in real-time with all the benefits that follow. For instance, structure from motion algorithms have been implemented as fast as 30 frames per second [10]. Such systems use feature tracking as measurement and hence speed is of paramount importance. Furthermore, visual trackers that do not rely on prior knowledge on the structure or motion of the scene, and are insensitive to environmental changes, i.e. illumination, open to a wide variety of applications. In this paper we propose a system that performs real-time visual tracking in presence of brightness and contrast changes. No off-line computations or prior assumptions are made both on structure and illumination of the scene. Finally, we provide a principled method to reject outliers that do not fit, for instance at occluding boundaries, as T-junctions.

2 An image deformation model In this section we discuss models of the deformation undergone by the image irradiance function as a consequence of relative motion between a rigid scene and a camera, and illumination changes. Our task is to design a model that trades off generality with computational efficiency, leading to a real-time implementation.

2.1 Geometry

X be aXpoint on a surface S in the scene. Let x = (X) where  is a central projection, for instance  (X) = kXk . We will not make distinctions between the homogeneous coordinates x = [x y 1℄T and the 2-D coordinates [x y ℄T . I (x; t) denotes the intensity value at the location x of an image acquired Let

at time t. Away from discontinuities in S , generated for example by occluding boundaries, the image deformations of a surface S can be described as image motion:

I (x; 0) = I (g (x); t) 8 x 2 W (1) where W is the region of interest in the image, and g () is, in general, a nonlinear time-varying function which depends on an infinite number of parameters (the surface S ): g (x) =  (Rt x + Tt ) with  j x = X 2 S (2) 2

and (Rt ; Tt ) 2 SE (3) is a rigid change of coordinates between the inertial reference frame, that we choose to coincide with the camera reference system at time 0, and the moving frame (at time t). Clearly, having in mind real-time operation, we need to restrict the class of deformations to a finitedimensional one that can be easily computed. The most popular feature tracking algorithms rely on a purely translational model: g ( ) = + d 8 2 WT (3)

x x

x

x

where WT is a window of a certain size, and d is the 2-D displacement of on the image plane. This model results in very fast algorithms [11], although for it to be valid one has to restrict the size of the window, thereby losing the averaging power of the model. Typical sizes for windows range from 3  3 to 7  7, beyond which the model is easily violated after a few frames. Therefore, a purely translational model is only valid locally in space and time. A slightly richer model can be obtained by considering Euclidean transformations of the plane, i.e. g ( ) = exp(b) + d where exp(b) describes a rotation on the plane. However, a slightly richer model, where the linear term is not restricted to be a rotation, turns out to be more appropriate: g( ) = A + d 2 WA (4)

x

x

x

x

x

x

where A 2 GL(2) is a general linear transformation of the plane coordinates , d is the 2-D displacement, and WA is the window of interest. This affine model has been proposed and tested by Shi and Tomasi [16]. Because of image noise the equation (1) in general does not hold exactly. If the motion function g () can be approximated by a finite set of parameters , then the problem can be moved to determining those such that: Z (5) b = arg min kI ( ; 0) I (g ( ); t)kd : W

x

x

x

for some choice of norm k  k, where = fdg; fA; dg in the translational and affine model respectively. Notice that the residual to be minimized is computed in the space of irradiance functions.

2.2 Illumination All the models outlined above are intrinsically sensitive to changes in illumination of the scene. Indeed, in real environments brightness or contrast changes are unavoidable phenomena that cannot always be controlled. It follows that modeling light changes is necessary for visual trackers to operate in a general situation. Consider a light source L in the 3-D space and suppose we are observing a smooth surface S . As figure 1 explains, the intensity value of each point on the image plane depends on the portion of incoming light from the source L that is reflected by the surface S , and is described by the Bidirectional Reflectance Distribution Function (BRDF). When the light source L is far enough from the surface S , the light direction is approximately constant. In a similar manner, if the observer is far enough from the surface S , the viewing angle for each point on the surface can be approximated with a constant. Finally, we assume that for a point p on the smooth surface S , it is possible to consider a neighborhood U around p such that normal vectors to S do not change within U , i.e. in U the surface is a plane. Let E be the albedo lying on S , then the image generated by U can be written as:

I (x; 0) = E E + ÆE 3

8 x 2 WU

(6)

Light source L

N S

P

Tangent plane

p Z

Image plane X

Y

Figure 1: Image formation process when illumination is taken into account: an intensity values in p on the image plane depends in general on the light L coordinates, the observer position, the normal N to the tangent plane at P and the albedo on the surface S . where WU =  (U ) and E and ÆE are constant. Under the previous assumptions, the effects due to directional light sources can be captured by E ; ÆE instead, compensates for diffused light sources and reflections. In short, E and ÆE can be thought as parameters that represent respectively the contrast and brightness changes of the image. When either the camera or the scene is subject to motion, these parameters will change according to the new geometric setting, and hence E and ÆE vary along time. We define the following as our model for illumination changes:

I (x; 0) = (t)I (x; t) + Æ (t)

8 x 2 WU

E (t) E (0) = ÆE (t)

t > 1:

where we define (t) and Æ (t) recursively as:

(t) Æ (t)

=

E (t) E (0) ÆE (0)

(7)

2.3 Computing image motion and illumination parameters Since we are interested in long time spans tracking, we choose the affine model for the geometric deformations. The combination of the geometric and illumination deformations models together, gives the following: I ( ; 0) = (t)I (A + d; t) + Æ (t) 8 2 WU (8)

x

x

x

Because of image noise and because both the affine motion model and the affine illumination model are approximations, equation (8) is in general not satisfied exactly. Therefore, we choose to pose the problem as an optimization problem: find the parameters A, d,  and Æ to minimize the following discrepancy: Z  = [I ( ; 0) (I (A + d; t) + Æ )℄2 w( )d (9) W

x

x

4

x x

x

where w () is a weight function. In the simplest case, w ( ) = 1. However, in general, the shape of w () depends on the application. For instance, it can be a Gaussian-like function to emphasize the window center. To carry out the minimization, we approximate the modeled intensity using a first-order Taylor expansion

y (t)I (Ax + d; t) + Æ (t) =: I (x; t) + Æ + rI u (10) u where rI is the gradient of the image intensity computed at x, y = Ax + d, u = [d11 d12 d21 d22 d1 d2 ℄ and uy is the derivative of y with respect to u. Rewriting equation (10) in matrix form, we have I (x; 0) = f (x; t)T z

x

(11)

where f ( ; t) = [xIx yIx xIy yIy Ix Iy I 1℄T and z = [d11 d12 d21 d22 d1 d2  Æ ℄T collects all the parameters of interest. Multiplying equation (11) by f ( ; t)T on both sides, and integrating over the whole window W with the weight function w (), we have the following linear 8  8 system:

x

Sz = a where

Z

a=

and

S=

(12)

W

I (x; 0)f (x; t)w(x)dx

(13)

W

f (x; t)T f (x; t)w(x)dx

(14)

Z

If we consider the pixel quantization, the integral becomes a summation. We write form: " # X T U

S=

2

where T

6 6 6 = 66 6 4

x2 Ix2 xyIx2 x2 Ix Iy xyIx Iy xIx2 xIx Iy

w(x) x2W V W

xyIx2 y 2 Ix2 xyIx Iy y 2 Ix Iy yIx Iy yIy2

VT

=

U

x2 Ix Iy xyIx Iy x2 Iy2 xyIy2 xIx Iy xIy2

2

6 6 6 6 =6 6 6 6 4 "

and

W

=

xyIx Iy y 2 Ix Iy xyIy2 y 2 Iy2 yIx Iy yIy2

xIx I yIxI xIy I yIy I Ix I Iy I

xIx yIx xIy yIy Ix Iy

xIx2 yIx2 xIx Iy yIx Iy Ix2 Ix Iy

3 7 7 7 7 7 7 7 7 5

xIx Iy yIx Iy xIy2 yIy2 Ix Iy Iy2

S in block-matrix (15)

3 7 7 7 7 7 7 5

(16)

(17)

#

I2 I : I 1

(18)

T is the matrix computed in Shi-Tomasi algorithm, which is based on geometry only. W comes from our model of illumination. U and V are the cross terms between geometry and illumination. 5

By solving equation (12), it is possible to compute all the parameters. However, it will only give a rough approximation for z even when the model is correct, because of the first-order approximation in equation (10). This algorithm can be repeated iteratively around the previous solution until all the parameters converge. This is known as the Quasi-Newton method. The advantage of using QuasiNewton is that one does not need to compute the second derivative of the measurement, i.e. the Hessian of the image. This improves the speed and stability of the optimization algorithm.

3 Outlier rejection To monitor the quality of the tracked features between the first and current frame, the tracker monitors the residuals between the first and the current frame. As we indicate in section 2.2, doing rejection without taking into account changes in illumination is inadequate. Also, another issue one has to consider is the variation in the intensity of the patch of interest. The patch with high variation in intensity tends to give high residual because of pixel quantization and interpolation during the matching. We take into consideration this fact by normalizing the residual according to the variance of intensity. Let di be the sum of squared intensity difference for patch i at time t, and i be the variance of patch i at time t0 , our rejection ruler is simply comparing dii among all the points. Another practical issue is the information contained in the image patch. When a patch deforms significantly from its initial appearance, it carries little information. Based on this reason, we introduce another monitoring scheme. Let Wi (t) be the transformed window at time t for patch i. We compare the area of the projected window with certain threshold area(Wi ). If it falls below a certain threshold, the tracker cannot precisely give the correspondence.

4 Real-time implementation We implement our algorithm under Microsoft Windows on a PC (Dual CPU at 1 GHz, 2 GByte memory). It can track 100 patches with size 7  7 pixels within 5 ms. The speed gives enough time for other feature-based algorithms to perform in real-time. We have the code for both Matrox and Imagenation frame grabbers. Figure 2 shows the interface of our real-time program

5 Experiments Figure 3 is one image of a sequence of about 400 frames long. The scene consists in a box rotating along the vertical axis. The box first rotates from left to right, then comes back to the initial position. When it rotates, the illumination changes substantially because the object is Lambertian and the light is stationary. Figure 5 shows the residuals of our algorithm versus Shi-Tomasi algorithm. Shi-Tomasi algorithm gives high residual as the box moves to the right in the sesame sequence. Instead, in our algorithm the residual remains low. This is crucial to do outlier rejection based on residual. Figure 4 shows the evolution of a selected patch. The top eight images are the views at different time instants. The sequence in the middle is the reconstruction of the patch at time 0 using the geometric parameters estimated by Shi-Tomasi algorithm. The bottom eight images are the reconstructed patches based on our estimation. Note that not only the appearance is estimated correctly, but also the change 6

Figure 2: Interface of real-time feature tracker in illumination is compensated for. Figure 7 shows the estimated  (image contrast) and figure 6 shows the estimated Æ (image brightness). Both estimates come back to the initial state, as expected. The ShiTomasi tracker cannot track this patch after approximately 200 frames. Due to illumination changes the optimization scheme cannot find the correct match. In this section, we show the experiment of using our algorithm to do outlier rejection based on residual. Figure 8 shows one image with an occluded patch. Figure 9 shows the evolution of residuals for 5 selected patches using both Shi-Tomasi and the illumination invariant algorithm. As one can see, some residuals grow for different reasons. In particular feature number [3] is correctly tracked along the two boxes sequence by Shi-Tomasi tracker, but its residual is comparable to feature [4] that is an outlier. The usual outlier rejection rule would discard also those features that are instead valid ones. When the selected patches are not under the assumptions (see feature [5] for instance), the illumination estimation is not reliable (see figures 10 and 11).

6 Conclusions We have presented an extension of Shi-Tomasi tracker to take into consideration changes in illumination and reflection. The estimation of parameters is done using a Quasi-Newton type method. The computational cost is low and the algorithm has been implemented in real time. We have made our real-time implementation (both code and documentation) available to the public (after the double-blind reviewing is completed). We test our algorithm on real images. The experimental results show that our model can keep tracking under changes in illumination where other trackers would perform poorly. The computed residual is invariant to changes in illumination, which allows to monitor point features and to reject correctly the outliers. Taking into account illumination changes is desirable in a variety of applica7

Figure 3: One snapshot from the sesame sequence. The superimposed square shows the region of interest. tions. In real scenes it is not always the case that these environmental parameters can be controlled, and sometimes this is made even harder by the use of off-the-shelf cameras with automatic gain controls. Acknowledgements This research is supported in part by NSF grant IIS-9876145, ARO grant DAAD19-99-1-0139 and Intel grant 8029.

References [1] A. Azarbayejani and A.P. Pentland. Recursive estimation of motion, structure, and focal length. IEEE Trans. Pattern Analysis and Machine Intelligence, 17(6):562–575, June 1995. [2] E. Bardinet, L.D. Cohen, and N. Ayache. Tracking medical 3d data with a deformable parametric model. In European Conference on Computer Vision, pages I:317–328, 1997. [3] A.F. Bobick and A.D. Wilson. A state based technique for the summarization and recognition of gesture. In International Conference on Computer Vision, pages 382–388, 1997. [4] T.J. Darrell, B. Moghaddam, and A.P. Pentland. Active face tracking and pose estimation in an interactive room. In IEEE Computer Vision and Pattern Recognition, pages 67–72, 1996. [5] E.D. Dickmanns and V. Graefe. Dynamic monocular machine vision. Machine Vision and Applications, 1:223–240, 1988. [6] D.M. Gavrila and L.S. Davis. Tracking humans in action: A 3d model-based approach. In Image Understanding Workshop, pages 737–746, 1996. 8

(a)

(b)

(c)

Figure 4: Some snapshots of the region of interest from the sesame sequence: (a) is the original sequence as it evolves in time; (b) is the reconstruction of the initial patch using Shi-Tomasi algorithm; (c) is the reconstruction of the initial patch using the illumination-invariant tracker. As it can be seen the illumination change leads Shi-Tomasi tracker to not converge properly. Our method maintains the appearance of the patch constant through tracking. 70 Shi−Tomasi tracker Illumination−invariant tracker 60

residuals

50

40

30

20

10

0

0

50

100

150

200 frames

250

300

350

400

Figure 5: Evolution of SSD residuals. After 196 steps Shi-Tomasi tracker cannot compensate for changes in illumination. From frame 197 to frame 392 the sesame box comes back to its original position and the patch can be matched exactly (i.e. the residual goes to zero) as expected. [7] G.D. Hager and P.N. Belhumeur. Efficient region tracking with parametric models of geometry and illumination. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(10):1025–1039, October 1998. [8] G.D. Hager and P.N. Belhumeur. Real time tracking of image regions with changes in geometry and illumination. In IEEE Computer Vision and Pattern Recognition, pages 403–410, 1998. [9] S. Hutchinson, G.D. Hager, and P.I. Corke. A tutorial on visual servo control. IEEE Trans. Robotics and Automation, 12(5):651–670, October 1996. [10] H. Jin, P. Favaro, and S. Soatto. Real-time 3-d motion and structure of point features: Front-end system for vision-based control and interaction. In IEEE Computer Vision and Pattern Recognition, pages 778–779, 2000. 9

1.3

1.25

λ (image contrast)

1.2

1.15

1.1

1.05

1

0.95

0

50

100

150

200 frames

250

300

350

400

Figure 6: Evolution of  (image contrast). The image contrast increases till frame 197 and then goes back to 1 at frame 392, where the box was back to the original position.

δ (image brightness) [image intensity (256 gray levels)]

0

−20

−40

−60

−80

−100

−120

0

50

100

150

200 frames

250

300

350

400

Figure 7: Evolution of Æ (image brightness). The image brightness decreases as the sesame box moves until frame 197 and then decreases going back to 0. When the patch was back to the original position. [11] B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In International Joint Conference on Artificial Intelligence, pages 674–679, 1981. [12] S.K. Nayar and A. Karmarkar. 360 x 360 mosaics. In IEEE Computer Vision and Pattern Recognition, pages II:388–395, 2000. [13] N.P. Papanikolopoulos, P.K. Khosla, and T. Kanade. Visual tracking of a moving target by a camera mounted on a robot: A combination of control and vision. IEEE Trans. Robotics and Automation, 9:14–35, 1993. [14] P. Shi, G. Robinson, T. Constable, A. Sinusas, and J.S. Duncan. A model-based integrated approach to track myocardial deformation using displacement and velocity constraints. In International Conference on Computer Vision, pages 687–692, 1995. [15] R. Szeliski. Image mosaicing for tele-reality applications. In DEC, 1994.

10

4

1

2

5

3

Figure 8: One snapshot from the two boxes sequence. In this sequence some features are chosen according to the selection algorithm of Shi-Tomasi. Feature [4] has been chosen to show the behavior of the residual when there is partial occlusion. Feature [5] has been chosen to show the algorithm behavior when the plane assumption is no longer true. Feature [3] is an example of point feature that has high residual (when light changes are not accounted for) comparable to occluded features residual (see feature [4]). 0.03 Shi−Tomasi tracker Illumination−invariant tracker

4 0.02

residuals

3

0.01

0

0

20

40

60

80

100 frames

120

140

160

180

200

Figure 9: Evolution of SSD residual for the two boxes sequence. The residuals of features [4] and [3] are comparable. Thus, usual monitoring for occluded features becomes inadequate when the environmental light changes. The other features have residuals that are comparable in both Shi-Tomasi and illumination-invariant trackers. Features [1] and [2] do not suffer from light changes. Feature [5] is not under the planar assumption. Feature [3] suffers from the checker board reflections, while feature [4] becomes partially occluded. [16] C. Tomasi and J. Shi. Good features to track. In IEEE Computer Vision and Pattern Recognition, pages 593–600, 1994.

11

λ (illumination contrast)

5

1

0

Figure 10: Evolution of estimated correctly.

20

40

60

80

100 frames

120

140

160

180

200

 (image contrast). As shown above, feature [5] image contrast cannot be

δ (image brightness) [image intensity (256 gray levels)]

20

0

−20

−40

5 −60

−80

−100

−120

Figure 11: Evolution of estimated correctly.

0

20

40

60

80

100 frames

120

140

160

180

200

Æ (image brightness). As shown above, feature [5] image contrast cannot be

12

Suggest Documents