A Newton method for pose re nement of 3D models - CiteSeerX

0 downloads 0 Views 196KB Size Report
Arthur Pece? and Anthony Worrall?? Computational Vision Group, Department of Computer Science. University of Reading, P.O. Box 225. Reading RG6 6AY ...
A Newton method for pose re nement of 3D models Arthur Pece? and Anthony Worrall?? Computational Vision Group, Department of Computer Science University of Reading, P.O. Box 225 Reading RG6 6AY, England

Abstract. Di erent optimization methods can be employed to otpimize a numerical estimate for the match between

an instantiated object model and an image. In order to take advantage of gradient-based optimization methods, perspective inversion must be used in this context. We show that convergence can be very fast by extrapolating to maximum goodness-of- t with Newton's method. This approach is related to methods which either maximize a similar goodness-of- t measure without use of gradient information, or else minimize distances between projected model lines and image features. Newton's method combines the accuracy of the former approach with the speed of convergence of the latter.

1 Introduction Following pioneering work on iconic evaluation ([1]; see also [10]), our approach to tting 3D models to visible objects is based on projecting the model onto the image and nding the object pose that maximizes some measure of the goodness-of- t between model projection and image. The original iconic-evaluation method [1] optimizes the model pose without inverting the perspective projection. This procedure is computationally expensive because it cannot make use of gradient information in the optimization procedure. An alternative method for tting 3D models to images is based on detecting simple image features (e.g. edges or high-contrast points), matching these features to geometrical components of the object (e.g. the closest projected model lines), and minimizing the distances between image features and model features by perspective inversion [2, 3, 4, 5, 9, 11]. This method o ers much faster convergence at the price of a smaller radius of convergence. The main problem with this method is that it is strongly dependent on accurate matches between image features and model features. The Newton method described in this paper resembles the latter approach in making use of perspective inversion. However, no image features are matched to model lines: instead, the inverse perspective is used to project into parameter space the gradient of an evaluation function similar to the function employed for iconic evaluation [10]. By avoiding commitment to particular correspondences between image features and model lines, this method is more robust than other active search methods, while maintaining the speed advantage over passive search methods.

2 Formulation of the problem Model-based vision works by incorporating prior knowledge, speci cally geometric knowledge, into the visual inference process. The particular application for which our method was developed is tracking vehicles in trac. The relevant prior knowledge is a 3D model of the car and a de nition of the ground plane (relative to the camera) to which the car is assumed to be constrained (see e.g. [10]). ? ??

Email: [email protected] Email: [email protected]

2.1 Relationship between pose parameters and image coordinates The pose of the car on the ground plane is de ned by two translational and one rotational degrees of freedom. The discrepancy between the model pose and the actual pose of the car can be expressed by the 3-vector p = (X; Y; ). This discrepancy will induce a discrepancy between model lines (projected onto the image plane) and image edges. Only the normal component of the discrepancy between lines and edges is detected by our method. Given n points on the model lines, it is possible to compute the relationship between changes of the 3 pose parameters and normal components of the image-coordinate changes of the n line-points , where  is a n vector. By using a small-angle approximation, and assuming that distances within the object are negligible compared to the distance between object and camera, this relationship can be summarized by a Jacobian:  = Jp (1) where J is a n  3 matrix of derivatives of image coordinates with respect to pose parameters of a rigid object. The Jacobian is derived in Appendix A.

2.2 The evaluation function Cars and their visual background can vary widely in colour and texture: all that can be predicted about the image, from a generic model of a car with a given pose, is that there will be some change of image intensity where the model lines of the car are projected onto the image plane. Allowing for some positional error, the change of intensity will take place within a window centered on the projected model line. These considerations lead to the choice, as the evaluation function, of the sum of absolute values of di erences of grey levels, in the direction normal to the projected model line. This sum is evaluated within a window. With this choice, the equation for the evaluation function at the location indexed by i is given by:

ei = e(i) =

X 

  jI ( )j w  ? i

(2)

where  is the distance along the normal vi from the projected model line; I ( ) is the discrete derivative, in the direction vi , of the grey-level values (obtained by bilinear interpolation from pixel values); w(a) is a window function and  is a scaling factor, which is equal to 4 pixels in our implementation. The window function should be smooth, even, non-negative, integrable and monotonically decreasing away from the origin. We use a Gaussian window:  2 (3) w(a) = exp ?2a

Note that this measure of goodness-of- t is indi erent not only to the direction of contrast, but also, to some extent, to the type of image feature, i.e. whether the boundary between car and background is a step or ramp edge, an image line (two edges of opposite polarity), or a combination of such features.

2.3 Probabilistic framework Rather than trying to nd the most probable pose given the image (the MAP estimate of the pose) or the pose that maximizes the probability of the image (the ML estimate), we try to nd the pose that gives evaluation functions of lowest prior probability, i.e. the pose which minimizes the probability of accidental matches between model lines and image edges. The three statistical approaches are related, indeed they are equivalent under some assumptions, such as uniform prior probability of poses. However, the probability distribution of \scores" (values of the sum of evaluation functions over a line), given that the model pose is correct, is dicult to estimate directly; on the other hand, the probability distribution of scores, given that the model pose is not correct, can be estimated by Monte Carlo methods. Assuming that the scores for the di erent model lines are statistically independent, then minimizing the probability of the scores for all model lines is equivalent to minimizing the sum of log-probabilities of the scores. A Monte Carlo estimate of these log-probabilities (Fig.1) shows that their dependence on the score is approximately linear (i.e. the probability of a given score depends exponentially on the score over a wide range). The linear relationship breaks down in the range of low scores, which have lower probability than predicted by a linear extrapolation; however, this range can be neglected since low scores do not provide

Fig. 1.: Statistical distributions of normalized line scores, plotted as log-probabilities vspnormalized scores. Continuous curve: score for a single point evaluation (normalized by multiplication by 2 2). Dashed curve: score for a line length of 16 pixels (4 point evaluations spaced by 4 pixels). Dotted curve: score for a line length of 64 pixels (8 point evaluations spaced by 8 pixels). Statistical distributions for lines of length between 4 and 9 pixel are not shown because they overlap the curve for the single point evaluation (with an opportune normalization factor and spacing between evaluations). evidence for a model line. The slopes of the linear portions of the curves can be made approximately equal by a simple normalization procedure, which is based only on the length of the projected model line. Therefore, under the assumption of statistical independence for di erent model lines, maximizing the sum of normalized scores is equivalent to minimizing the log-probability of these scores being obtained by accidental matches between model lines and image edges. The (negative) sum of normalized scores is de ned as the energy E :

E = E0 ?

X ei i

N (Li )

(4)

Where E0 is a constant needed to ensure that the probabilities add up to unity, Li is the length of the mode line including point i, and the normalization factor N (Li ) is chosen to ensure that the slope in Fig.1 is the same for all line lengths. This can be achieved by setting N (Li ) to be proportional to the square root of line length, except at very short lengths. In practice, it is computationally ecient to increase the spacing between point evaluations, rather than increasing the normalization factor. For lines more than 8 pixels long, we set the number of equally-spaced point evaluations to be equal to the square root of line length, and therefore the spacing between point evaluations is also equal to the square root of line length: the e ect of increasing the number of point evaluations is compensated by the decrease of correlation between point evaluations. For lines betweenp4 and 8 pixels long, only 2 point evaluations are carried out and the normalization factor is N (4) = 1p= 2. Lines shorter than 4 pixels have only one point evaluation and a normalization factor N (1) = 1=(2 2).

3 Minimization of log-probability Having derived an estimate of the log-probability of a given set of pose parameters, our method di ers from the previous iconic evaluation method [10] in trying to maximize the log-probability by Newton's method, rather than by the downhill simplex method. The principle is as follows:

1. The evaluation function of one point on a model line, as a function of translation of the line-point in the direction normal to the line, is approximated by the rst three terms of its Taylor expansion:

de i + d2e 2 e(i )  ei(0) + d d2 i i 0

i

0

(5)

2. The evaluation of the model as a function of the pose parameters on the ground plane is approximated by the rst three terms of its Taylor expansion: E (p)  E (0) + r E j0 Jp + pT JT r2 E 0 Jp (6) where r E is a n-vector whose elements are the derivatives of the normalized point evaluations with respect to the displacements of the line-points along the normals: dei (r E )i = N?(L1 ) d (7) i i and r2 E is a n  n diagonal matrix whose diagonal elements are the second derivatives of the normalized point evaluations with respect to the line-point displacements: ?r2 E  = ?1 d2 ei (8)  ii 2 ?r2 E  = 0N (Li ) di (i 6= j )  ij 3. The evaluation function can be minimized by Newton's method by solving the system for the pose parameters at which the gradient is zero: dp = ?H?1JT rE j0 (9) Where the Hessian H is given by: H = JT r2E 0 J (10) If the Hessian is positive de nite at the extremum found in this way, then the extremum is indeed a minimum (in parameter space) of the evaluation function, rather than a maximum or a saddle point.

3.1 Approximating the Hessian The method, as outlined above, has a fundamental aw that has been found to render it unusable in practice: Newton's method can only be used for minimization when the Hessian is positive de nite at every iteration. In our case, the problem arises from the second derivative of e with respect to i : d2 e = 1 X jI ( )j w  ? i  (11)

d2i 2   (see Appendix B). Like the expression for ei , this is a weighted sum of image derivatives, but the weights

in Eq.11 can be positive, negative or zero: Newton's method is a good approximation only if most of the signi cant image derivatives are weighted with negative weights over all model line-points. One possible solution would be to employ methods such as conjugate-gradient ([8],pp.420-425), variablemetric ([8],pp.425-430) or Levenberg-Marquardt ([8],pp.683-688), which do not rely on an accurate estimate of the Hessian. However, these methods require sub-iterations involving several evaluations of the objective function. In pose re nement, these evaluations are the most computationally expensive step of the iteration: each sub-iteration would become as expensive as one iteration of the simple Newton's method or of constantrate gradient descent. As a faster alternative, we replace the formally-correct expression for the Hessian with the following approximation: H~ = JT E J (12)

where E is a n  n diagonal matrix whose diagonal elements are the normalized point evaluations for each line-point: (13) Eii = N (1L ) e2i i Eij = 0 (i 6= j ) This approximation preserves the local minima of the evaluation function (see Appendix B). The method as implemented can be described as an approximate Newton (or quasi-Newton) method [8]. dp goes downhill in Even with this approximation, it is necessary to ensure that the parameter change  the energy, i.e. that the parameter change is opposite to the local energy gradient. This can be done by dp whenever it is the same as the sign of the corresponding element of reversing the sign of any element of  T J (r E )0 .

4 Results Fig.2 shows the results of convergence tests comparable to those shown in [11, 12]: an optimal pose for a car model is determined by eye to t a car image and the method is applied to the car model after it has been displaced from the optimal pose by xed amounts, ranging from -1 m to 1 m along both the x and y axes and from ?25 to 25 degrees of rotation around the z axis. The histograms show the density of nal poses of the model on the ground plane: each bin has widths of 0.2 meters and 5 degrees. The top image is the same as used in previous research and our results compare favourably with those reported for the least-squares method [11, 12], which is the most e ective method developed to date. The second image shows that convergence is not as good when the car is seen from the side, rather than from the top. The third image shows that spurious edges in the background further deteriorate convergence. Finally, the bottom image shows that partial occlusion leads to the loss of a single optimal pose; however, the two distinct peaks of the histogram are suciently close to merge into a single optimal pose if occlusion is removed in subsequent frames.

5 Conclusions

5.1 Comparison to previous methods

The main di erence between the method introduced in this paper and a previously-introduced inverse method [11] is in the expression for e(i ) (Eq.2): in the previous (least-squares) method, the point evaluation function is given by the squared distance between the line-point and the highest image derivative along the normal to the line-point (see also Appendix B). The relative advantages and disadvantages of the two methods are: { with Newton's method, local gradients depend on the local contrast, while with the least-squares method, local gradients depend only on distances between projected lines and maximum values of image derivatives, independently of the values of these maxima; { Newton's method avoids commitment to particular features by weighing features according to the absolute value of the image derivative, while the least-squares method only considers the local maxima of image derivatives and loses information on the other derivatives along the normals; { The Gaussian window increases the robustness of Newton's method by gradually decreasing the weight of distant features; { In the least-squares method, the exact form of the Hessian can be used, instead of the approximation in Eq.12; however, this is a questionable advantage, since the minimization that is carried out by the least-squares method does not have a clear statistical justi cation. The image-gradient method [4] is closely related to our Newton method. Again, the main di erence is in the form of the point evaluation function: by summing (over a Gaussian window) norms of the image gradient, rather than directional image derivatives, the image-gradient method loses the information contained in the local direction of the image gradient. The main disadvantage of the image-gradient method, however, is its heavy computational cost. As pointed out, the new method is most closely related to iconic evaluation [10], from which it di ers mainly by the use of the gradient of the evaluation function in the optimization procedure. A minor di erence is in the use of a Gaussian (rather than triangular) window in the point evaluation function (Eq.3).

Fig. 2.: Left: images used for the convergence tests. Right: histograms of poses after 40 iterations of Newton's method. The two axes of the histogram represent Euclidean distance from the optimal pose and  (angle di erence) with respect to the optimal pose. The width of the bins is 0.2 meters on the distance axis and 5 degrees on the angle axis. The height of the highest peaks in each histogram are (from top to bottom): 742 poses, 390 poses, 201 poses, and 220 poses.

5.2 Possible extensions of the method The method described above can be used with di erent evaluation functions (Eq.2), and these functions can themselves have di erent parameters, most notably the width  of the window. The width parameter opens the possibility of applying the method over multiple scales. In a previous paper [7], we described the application of the method with a slightly di erent evaluation function, in which the squared image derivatives are used instead of the absolute values. The advantages of using absolute values are slightly greater linearity of the log-probability curves (Fig.1) and a greater similarity to the evaluation function used in iconic evaluation [1]. However, convergence is practically the same with either evaluation function. Two other applications for which the new method is suitable are shape recovery of deformable models [11] and recovery of extrinsic camera parameters. The only modi cation that is needed to the algorithm is the form of the Jacobian. Preliminary research suggests that the method can perform quite well in the recovery of camera parameters [6].

Acknowledgements The research described in this paper was supported by the TMR program SMART II.

Appendix A: Derivation of the Jacobian It is convenient to introduce camera-centered (xc ) and model-centered (xm ) 3D coordinates, as well as 2D image coordinates (u). Note that the model-centered coordinates are de ned by the object model, not by the actual position of the object in the scene. The discrepancy between the model pose and the actual object pose is due to the translation and rotation of the car on the ground plane, which can be decomposed into a translation vector (X; Y ) and a rotation angle  (for rotation around the z axis). Using a small-angle approximation, the discrepancy for the coordinates of each model-point is given by

0 X ? y  1 m xm = @ Y + xm  A

(14)

0

This relationship can be expressed as

xm = Cp;

01 C = @0

1

0 ?ym 1 xm A 0 0 0

(15)

where p = (X; Y; ). The transformation from model to camera coordinates can be decomposed into a rotation followed by a translation: xc = Rxm + t (16) where R is a rotation matrix that aligns the model-centered coordinates with the camera-centered coordinates and t is a vector expressing (in camera-centered coordinates) the translation from the focal point of the camera to the center of the model coordinate system. The transformation from camera to image coordinates is given by the well-known relation: 1 0 0 1 F=F 010 (17) u = z Fxc; c

where F is the focal length of the camera. Putting together the above expressions, and making the simplifying assumption that the distances within the object are negligible compared to the distance between the object and the camera, we obtain the relationship between changes of pose parameters and changes in image coordinates: u = 1 FRC (18)

p tz

Only the component of u normal to the model line under consideration can be detected by looking for image derivatives along normals to the model line. This component can be obtained from the inner product of u times the unit normal vi to the model line:  = 1 vT FRC (19)

p tz

i

One such expression can be obtained for each visible model point. If n model points are visible, then n such expressions can be arranged into matrix form by de ning the n  3 Jacobian matrix:

0C 1 1 B C 2C 1 C J = t VT FR B B . @ .. CA z

(20)

Cn

where V is the 2  n matrix of normals to the n line points.

Appendix B: Approximating the evaluation function It is desirable that the window function w be a smooth function of its argument, so that the gradient depends smoothly on the model line positions in the image. However, this smoothness generates a problem: in order for w to decrease smoothly to zero, it must have a positive second derivative for large values of its argument, which makes Newton's method unapplicable. Consider the Gaussian window, used in our implementation:  ( ? )2  X (21) e() = jI ( )j exp ? 22  @e = X jI ( )j exp? ( ? )2   ?  (22) @ 22 2  " # @ 2e = X jI ( )j exp? ( ? )2    ?  2 ? 1 @2  22 2 2 (X )  2 2  = jI ( )j exp ? ( 2?2)  ?2  ? 12 e (23)  The squared image derivatives are weighted with positive, negative or zero weights in Eq.23, depending on the distance from the model line: only if most of the signi cant image derivatives are within a distance  from the model line, for all model lines, can Newton's method be reliably applied. This problem can be avoided by neglecting the rst term on the right-hand side of Eq.23. In the following, we justify this approximation. Rather than maximizing the sum of (absolute values of) image derivatives by moving the window, let us minimize the sum of squared distances from the image derivatives under the current window. This can be achieved by substituting the following evaluation function:

  ? 0    ?  2 X ? 1 e~() = jI ( )j w

(24) 2    where 0 is considered as a distinct variable for the purpose of di erentiation, but is actually equal to . Di erentiating Eq.24 with respect to  while keeping 0 constant is equivalent to di erentiating the

evaluation function with respect to the position of the model line, while keeping the window at its original position. The derivatives of the new evaluation function are: @ e~ = X jI ( )j w   ? 0   ?  (25)

If w is Gaussian, then we have

@   2   @ 2e~ = ?1 X jI ( )j w  ? 0 @2 2  

(26)

@ e~ = @e (27) @ @ @ 2e~ = ?1 e (28) @2 2 In this case, the two evaluation functions (e, Eq. 2 and e~, Eq.24) have the same rst derivatives and therefore the same stable points. De ne the energy E~ as the sum of normalized e~: X E~ = E~0 ? N (e~Li ) (29) i

i

~ (Eq.12). It is easy to see that the gradient of E~ is the same as the gradient of E , while the Hessian of E~ is H Therefore, the method as implemented minimizes E~ at each iteration; at the next iteration, a new E~ is constructed (since the 0 's have changed) and must be minimized again. The method converges when the gradient of E~ is zero, at which point the gradient of E must also be zero. This stable point can be a minimum, a maximum or a saddle point of E . In practice, using the new evaluation function (while updating the value of 0 at each iteration) normally leads to a good visual match between projected model lines and images; therefore we do not check that the pose at convergence is indeed a minimum of E .

References 1. KS Brisdon, Hypothesis veri cation using iconic matching. Ph.D. thesis, University of Reading, 1990. 2. C Harris, Tracking with rigid models. In: Active Vision (A Blake, A Yuille, eds) pp 59-73. MIT Press, 1992. 3. D Koller, K Daniilidis, H-H Nagel, Model-based object tracking in monocular image sequences of road trac scenes. Int.J.Comp.Vis. 10(3): 257-281, 1993. 4. H Kollnig, H-H Nagel, 3D pose estimation by tting image gradients directly to polyhedral models. Proc. 5th ICCV, pp 569-574, IEEE Computer Soc. Press, 1995. 5. DG Lowe, Fitting parametrized 3-D models to images. IEEE Trans. PAMI 13(5): 441-450, 1991. 6. AEC Pece, GD Sullivan, Model-based control of an active camera head. Proc. EU-HCM SMART workshop, Lisbon, April 1995. 7. AEC Pece, AD Worrall, A statistically-based Newton method for pose re nement. Image and Vision Computing, in press (1998). 8. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in C (2nd edition). Cambridge University Press, 1992. 9. RS Stephens, Real-time 3D object tracking. Proc. Alvey Vis.Conf. pp 85-90, 1989. 10. GD Sullivan, Visual interpretation of known objects in constrained scenes. Phil.Trans.R.Soc.Lond. B 337:361-370, 1992. 11. AD Worrall, JM Ferryman, GD Sullivan, KD Baker, Pose and structure recovery using active models. Proc. 6th BMVC, pp 137-146, 1995. 12. AD Worrall, GD Sullivan, KD Baker, Pose re nement of active models using forces in 3D. Proc. 3rd ECCV, pp 341-352, Springer-Verlag, 1994. This article was processed using the TEX macro package with SIRS98 style

Suggest Documents