Numerical Methods for Model{Based Pose Recovery - CiteSeerX

0 downloads 0 Views 667KB Size Report
classi cation: analytical perspective solutions, a ne solutions and numerical perspective solutions. ...... 69] Y. Wu, S. S. Iyengar, R. Jain, and S. Bose. A new ...
Numerical Methods for Model{Based Pose Recovery  Rodrigo L. Carceroni

Christopher M. Brown

The University of Rochester Computer Science Department Rochester, New York 14627 Technical Report 659 (Revised) August 1997

Abstract In this paper we review and compare several techniques for model{based pose recovery (extrinsic camera calibration) from monocular images. We classify the solutions reported in the literature as analytical perspective, ane and numerical perspective. We also present reformulations for two of the most important numerical perspective solutions: Lowe's algorithm and Phong{Horaud's algorithm. Our improvement to Lowe's algorithm consists of eliminating some simplifying assumptions on its projective equations. A careful experimental evaluation reveals that the resulting fully projective algorithm has superexponential convergence properties for a wide range of initial solutions and, under realistic usage conditions, it is up to an order of magnitude more accurate than the original formulation, with arguably better computation{time properties. Our extension to Phong{Horaud's algorithm is, to the best of our knowledge, the rst method for independent orientation recovery that actually exploits the theoretical advantages of point correspondences over line correspondences. We show that in the context of a speci c real{life application (visual navigation), it is either more accurate than other similar techniques with the same computational cost, or more ecient with the same accuracy.

The University of Rochester Computer Science Department supported this work.  This material is based on work supported by CAPES process BEX-0591/95-5, NSF IIP grant CDA-9401142, NSF grant IRI-9306454 and DARPA grant DAAB07-97-C-J027.

Contents 1 Introduction

1

2 Analytical Perspective Solutions

4

1.1 Paper Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1 2.2 2.3 2.4

Point Correspondences . Line Correspondences . Angle Correspondences . Dealing with Ambiguity

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 3

4 9 14 17

3 Weaker Camera Models

19

4 Numerical Perspective Solutions

30

3.1 A Taxonomy of Camera Models . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Weak{Perspective Pose Recovery . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Using Multiple Camera Models . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Strong{from{Weak Pose Methods . . . . . . . . . . . . . 4.2 Derivative{Based Methods . . . . . . . . . . . . . . . . . 4.2.1 The Classic Approach: Lowe's Algorithm . . . . 4.2.2 Using Other Orientation Representation Schemes

5 Experimental Evaluation

5.1 Analytical Approaches for Accuracy Evaluation . . . 5.2 Evaluating Our Improvements to Lowe's Algorithm . 5.2.1 Experimental Methodology . . . . . . . . . . 5.2.2 Convergence in the General Case . . . . . . . 5.2.3 Convergence with Rough Alignment . . . . . 5.2.4 Execution Times . . . . . . . . . . . . . . . . i

. . . . . .

. . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

19 24 28

31 36 37 42

57 57 58 58 60 61 61

5.2.5 Sensitivity to Depth in Object Center Position . . . . . . . 5.2.6 Sensitivity to Translational Error in Initial Solution . . . . 5.2.7 Sensitivity to Rotational Error in Initial Solution . . . . . . 5.2.8 Sensitivity to Additive Noise . . . . . . . . . . . . . . . . . 5.2.9 Accuracy in Practice . . . . . . . . . . . . . . . . . . . . . . 5.2.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Evaluating Our Improvements to Phong{Horaud's Algorithm . . . 5.3.1 A Real{Life Application: Visual Navigation . . . . . . . . . 5.3.2 Experimental Methodology . . . . . . . . . . . . . . . . . . 5.3.3 Convergence in the General Case . . . . . . . . . . . . . . . 5.3.4 Execution Times . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Sensitivity to the Number of Visible Landmarks . . . . . . 5.3.6 Sensitivity to the Height Distribution of the Visible Terrain 5.3.7 Sensitivity to Calibration Bias and Measurement Noise . . 5.3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Conclusion and Future Directions

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

63 63 65 66 67 70 71 71 72 74 75 76 77 78 78

80

ii

Chapter 1

Introduction The recovery of the relative orientation and position (pose) of the camera in a single scene, given the resulting 2D image, is a central problem in computer vision, also known as extrinsic (external) camera calibration. The applications in which this problem arises are numerous: cartography, tracking, object recognition, hand{eye coordination, augmented reality and visual navigation. In many cases, a description of the three{dimensional geometry of the scene is available a priori, allowing the use of model{based techniques. These techniques in general exploit known correspondences between model features such as vertices, edges and angles, and their respective images, in order to create a set of constraints that can be used to invert the 3D{to{2D projective transformation performed by the camera. In some of these applications eciency is a crucial requirement. A typical example is tracking. In several real{life tasks ranging from automatic assembly of structures in orbit and satellite retrieval to trac monitoring in a highway, it is necessary to determine the pose and the linear and angular velocities of one or more objects moving in the visual eld of the camera. The di erent schemes proposed to deal with this problem can be classi ed as active or passive. In the active vision approaches, the camera initially performs quick saccade to foveate the object of interest and then it pursuits the object in a relatively smooth way during a certain time interval. The major advantage of this approach (in addition to being biologically inspired) is that the foveation of the objects of interest increases the resolution, allowing more accurate recognition and pose recovery. But on the other hand, active foveation is a very hard control problem, because of the substantial delays involved, the intrinsically discrete nature of the measured data (30 frames per second, if a commercial camera is used), and the relatively tight constraints involved in keeping the objects of interest in the visual eld. For a detailed discussion of these problems and some possible solutions to them, we refer the reader to [54; 12]. The passive approaches, on the other hand, use static cameras [31]. So, the visual control problem can be avoided, but the objects of interest must in general occupy a smaller part of the visual eld and thus less information about them is available. Regardless of which of these schemes is chosen, the task of tracking a single target can be divided in two parts: acquisition and tracking proper [18]. Acquisition involves the lo1

calization (possibly via motion detection and segmentation) and identi cation of the target, as well as a rough initial estimation of its pose, velocities and possibly some other state{ variables of interest. This phase of tracking is particularly hard in active vision systems, due to the need to perform capture saccades towards the moving target [49]. The information obtained in the acquisition phase can then be used to initiate the tracking proper phase, in which the target state{variables are re ned and updated at a relatively high frequency. This phase can be divided into several subtasks, as suggested by Donald Gennery [18]: 1. Prediction: Given a multivalued time series with the history of (noisy) measurements of the target state{variables performed so far, it is necessary to predict the values of these variables in the next sampling instant, so that the tracking system can always restrict its search for the target to a relatively small part of the scene. Traditionally, this extrapolation of the values of the state{variables is done recursively, for eciency reasons. In other words, at any point in time, the tracker keeps only a vector of current estimates for the true (as opposed to measured) values of the state{variables and some of the higher{order moments of the multivalued time{series (typically the covariance matrix). Then, given the new measurements, all these variables are extrapolated for some future instant and the whole process can be repeated as soon as the another set of measurements is available [10; 22]. 2. Projection: Then, given the predicted pose of the target and a certain model for the 3D{to{2D transformation performed by the camera, it is necessary to determine the apparence of the target in the scene. Typically, this task is formulated as the determination of positions, orientations or apparent angles, and the visibility analysis, for a set of distinctive target features. 3. Measurement: The next step is to search the expected visible features in the image. The problem of identifying which image features correspond to each model features, known as the correspondence problem, is the hardest aspect of this task [39; 31; 30]. However, this problem can be avoided (or at least ameliorated) if the features are distinctive enough to be uniquely distinguished regardless of the pose of the target (via color information, for instance). Another useful trick is to make the other three steps of the tracking proper phase tolerant to false matches in the solution of the correspondence problem, via the use of robust statistics [40; 61; 48; 17]. 4. Backprojection: Finally, it is necessary to compute the discrepancy between the actual image measurements and the image measurements that would be expected given the analysis performed in the Projection step. Of course, it is also necessary to determine how this discrepancy a ects the current estimate for the state of the target. Ultimately, some sort of backprojection from the 2D image plane to the 3D scene space is needed to perform this task. So, this is the central problem that we discuss throughout this paper. 2

1.1 Paper Outline Many di erent techniques for the problem of model{based pose recovery from known feature correspondences have appeared in the literature recently. They can be grouped according to the nature of the camera models and mathematical techniques employed. In the following three chapters, we discuss and compare the three main classes of solutions according to this classi cation: analytical perspective solutions, ane solutions and numerical perspective solutions. The analytical solutions, discussed on Chapter 2, are relatively easy to be implemented, are computationally cheap, and work even in scenes with signi cant perspective e ects. However, they work only for a limited number of features, are ambiguous and have very poor error propagation properties. In Chapter 3, we review the techniques based on linearized camera models. These are also simple, ecient and, contrary to the analytical solutions, they work for scenes with arbitrarily many features. However, they are not able to cope with signi cant perspective distortion. Chapter 4 is the core of the paper. The solutions discussed in that chapter compute successive approximations for the camera pose, until an approximation with the desired precision is obtained. These solutions are very general and accurate, but they are more complex and computationally demanding. We discuss ways of tuning them for real{time usage and present reformulations that make two of them either more ecient (without loss of accuracy) or more accurate (at no extra computational cost). This is the main original contribution of the present work. In Chapter 5, we present the empirical evaluation of the two reformulations proposed. Finally, in Chapter 6, we summarize the contributions of our own work in this area and present some future perspectives.

1.2 Notational Conventions Throughout this paper, we denote scalar constants and variables (such as lengths and distances) by lower{case letters in italics (angles are denoted by Greek letters), points in three{dimensional space by upper{case letters in italics, vectors by lower{case letters in bold face (angle vectors are denoted by Greek letters), matrices by upper{case letters in bold face (the letter I is reserved to the identity matrix), quaternions by lower{case letters in Sans Serif font and dual quaternions by lower{case letters in Sans Serif font with a hat (^) superscript. Prime superscripts (' and ") and letter superscripts between parenthesis are always used to denote that a variable is described either in alternative reference frame or in a di erent step of an algorithm (they are never used to denote derivatives). Dot product and cross product operations involving two vectors are denoted by  and  respectively. For both of them, if the two operands are row vectors, the result is assumed to be a row vector. Otherwise, the result is assumed to be a column vector. The capital T superscript is used to denote the transpose of a matrix, the -1 superscript to denote the inverse of a matrix, and the y (dagger) superscript to denote the pseudo{inverse of a matrix. In general, constant and variable names are not changed across di erent parts of the text and, whenever this happens, explicit notice is provided. 3

Chapter 2

Analytical Perspective Solutions The general idea of the analytical solutions is to work with a xed number of correspondences and then to express image properties such as feature positions, orientations or apparent angles as a function of a prede ned set of pose parameters. By matching the resulting expressions against the corresponding actual image measurements, one can derive a set of polynomial equations involving the pose parameters. Finally, if the number of correspondences is big enough, these equations can be combined algebraically, yielding the desired pose. Analytical solutions are traditionally classi ed according to the nature of the geometrical constraints used. Three major classes have been identi ed in the literature: solutions that use constraints derived from point correspondences (i.e. direct feature positions), solutions that use constraints derived from line correspondences (i.e. orientation only) and solutions that use relationships between actual and apparent angles. In the following subsections, we review some classic techniques in each of these classes.

2.1 Point Correspondences The rst solution to the problem of model{based pose recovery with a full{perspective camera model published in the modern computer vision literature is probably due to Fischler and Bolles [17]. The authors used the term Perspective{n{Point (PnP) to designate the problem of determining the pose of a rigid object with respect to the camera, given a set of n correspondences between points in the image and points in a 3D model of the object of interest. The P3P problem, in particular, is illustrated in Fig. 2.1. Each correspondence of this type provides two constraints: one for the u axis and another for the v axis, in the image plane. So, at least three of them are needed to constrain all the six DOF of the problem and thus the P1P and P2P problems admit an in nity of solutions. The PnP problems with n  3, on the other hand, are more interesting in the sense that there is enough information to constrain the number of possible solutions down to a nite amount. In Fig. 2.1, O represents the optical center of the camera and I1 , I2 and I3 represent the projections on the image plane of the model points P1 , P2 and P3 , respectively. The 4

P1 d1 I1

v l1 l3 O

l2

e 31

I3

P3 d3

e 12 e 23

I2

P2

d2

u

Figure 2.1: Geometry of the Perspective{3{Point (P3P) problem. distances between the three model points, denoted by e12 , e23 and e31 (with obvious meanings) are known in advance. Furthermore, each image point constrains the position of the corresponding model point to lie in a known line{of{sight with respect to the camera. The unit vectors along these lines{of{sight, whose coordinates in the camera frame can be determined directly from the image coordinates of of each point, are denoted by l1 , l2 and l3 . The angles between each pair of these unit vectors (12 , 23 and 31 ) can thus be obtained with a single extra dot{product operation. So, the only additional information that is needed to trivialize the problem are the distances between the optical center O and each of the model points P1 , P2 and P3 . Denote these unknowns by d1 , d2 and d3 , respectively. The solution proposed by Fischler and Bolles [17] amounts to using the Cosines Law for each pair of lines{of{sight, in order to get a set of polynomial equations with second{degree terms involving the unknowns. The resulting system is shown below:

d21 + d22 ? 2 d1 d2 cos(12 ) = e12 2 d22 + d23 ? 2 d2 d3 cos(23 ) = e23 2 d23 + d21 ? 2 d3 d1 cos(31 ) = e31 2

(2.1) (2.2) (2.3)

The authors point out the fact that n polynomial equations in n unknowns can have no more solutions than the product of their respective degrees. So, the system above can have at most eight solutions. Furthermore, as all non{constant terms of the system are of second{degree, for every real positive solution, there is a geometrically isomorphic negative solution. Obviously, the negative solutions are of no relevance for the problem at hand and can be discarded. So, the upper bound on the number of possible solutions for any instance of the P3P problem with the points in general position and distinct projections in the image plane is four. Indeed, after some algebraic manipulation, two of the three unknowns in the system above can be eliminated yielding an eighth{degree polynomial equation with terms of even order only, as shown by Linnainmaa et al [34]. By solving this equation and substituting back for the two other unknowns, one can recover the positions of the three vertices in the three{dimensional camera frame. 5

Actually, the P3P problem was the subject of several studies published between 1841 and 1949. Both Fischler and Bolles and Linnainmaa et al were unaware of this previous work (in great part published in German) when they proposed their own solutions. But Haralick and Lee [21] presented a nice summary and an empirical comparison of all the major solutions known for the P3P problem by 1991. According to their survey paper, the solution for the P3P problem with the best error propagation properties, in the general case, is the one due to Finsterwalder (1903) which was summarized by Finsterwalder and Scheufele (1937). Geometrically, the idea is the same: to use the Cosines Law to get the system of Eqs. (2.1) to (2.3). The di erence lies on the algebraic manipulation that follows. Rather than trying to derive a single polynomial equation for one of the unknowns, Finsterwalder initially reduces the 3  3 system to an equivalent 2  2 non{linear system by means of a change of variables. He multiplies one of the resulting equations by an additional parameter  (arti cially introduced in the system) and then expresses one of the transformed variables as a function of the other transformed variable and . Then, he tries to determine a value for  that makes the resulting relation between the two transformed variables be linear. This amounts to solving a cubic equation and is a valid step because multiplying an equation of a system by an arbitrary non{null constant does not change the solutions. A substitution of the resulting linear relation in any of equations of the 2 2 transformed system yields a quadratic equation involving only one of the transformed variables, trivializing the problem. So, again there are four possible solutions: after the choice of a value for , one is left with two quadratic equations, each of which yields two possibly di erent and even possibly valid solutions. But notice that Finsterwalder's solution amounts to solving a cubic and two quadratic equations, rather that solving a quartic equation, and thus seems to be naturally more stable than the solutions found in the computer vision literature of the 1980s. However, it is important to make clear that the comparison performed by Haralick and Lee [21] took into account only the loss of accuracy caused by rounding errors, with single and double precision arithmetic. They did not take into account the e ect of quantization noise in the images, nor did they consider cases where the matching between model and image edges was not properly done. Furthermore, they showed that the precision of all the methods involved can vary up to three orders of magnitude depending on the order in which the correspondences are used. So, they suggested some heuristics to determine the best order, that seem to work very well in most cases. A natural question to be asked at this point is: is there some instance of the P3P problem that actually has four real positive solutions? Unfortunately, the answer is yes, as shown by Fischler and Bolles [17]. One such instance happens when the three image points are the vertices of an equilateral triangle centered at the optical axis, and the three{dimensional model is also an an equilateral triangle. In this case, there are four possible poses for the object. It may be on a plane orthogonal to the optical axis, so that the distances between the optical center and the three vertices are all equal, as shown in Fig. 2.2(a), or any one of the vertices may be at a distance four times smaller that the other two (with respect to the optical center) as shown in Fig. 2.2(b). But in any case, once a certain description of each model vertex in the camera frame is selected, the determination of the translation and the rotation between the camera frame 6

O

O 1

4

4

4 4

P3

4

P1

P3

P1

P2

P2 (a)

(b)

Figure 2.2: Multiple solutions for a deceptive instance of the P3P problem. and the object frame is quite straightforward. The trick to recover the translation between the origin of the camera frame (O) and the origin of the model frame (Om ), for instance, is to represent the vector that describes the position of one of the model vertices in the model frame (let's say Om P1 ), as a linear combination of the two \edges" that intersect each other at that vertex (P1 P2 and P1 P3 ) and their dot product:

Om P1 = P1 P2 + P1 P3 + (P1 P2  P1 P3 )

(2.4)

As stated in Linnainmaa et al [34], in this case the coecients , and can be readily obtained from the resulting 3  3 linear system. Then, as the coordinates of P1 , P2 and P3 in the camera frame have just been discovered, the description of Om P1 in the same system can be found trivially. Subtracting the resulting vector from OP1 , one can nally recover the translation OOm . The same kind of technique can then be used to recover the rotation. In this case, the trick is to express each of three axes of the object frame as a linear combination of the same pair of vertices and their dot product. This yields the recovery of the coordinates of the model frame axes in the camera frame and thus the recovery of the 3  3 rotation matrix between those two coordinate systems. However, chosing the right solution when more than one is found is still a problem. In order to avoid it, Fischler and Bolles [17] showed that four correspondences involving coplanar model points can be used to determine a unique solution to the perspective pose recovery problem. The technique employed in this case consists of using the four correspondences to compute a 3  3 collineation matrix that maps every point in the object plane to a point in the image plane, as well as the inverse transformation. Mapping the ideal line [0 0 1]T through these two transformations then yields the vanishing line of each plane, 7

projected on the other. The distance between the vanishing line projected on the image plane and the point of intersection between that plane and the optical axis can be used to recover the tilt angle between the two planes. Similarly, the distance between the vanishing line on the object plane and the point where that plane is intersected by the optical axis can be used to recover the pan angle between the two planes. After this, the recovery of the remaining pose parameters is straightforward. A completely di erent solution to the P4P problem with coplanar points was proposed by Abidi and Chandra [1]. These authors wanted to perform pose recovery with an imaging system that included zoom control. So, they developed a technique that allows one to recover not only the relative translation and rotation between the camera and the object, but also the unknown camera focal length (f ). Furthermore, contrary to the methods previously proposed, their basic technique does not need any trigonometric operation. It works only with additions, subtractions, multiplications, divisions and square root operations. The central idea of their solution is to notice that, if the four model points P1 , P2 , P3 and P4 are coplanar, then any three of them taken together with the camera's optical center form a tetrahedron whose height (h) may be de ned as the distance between the optical center and plane in which the object is located. One of the four tetrahedrons de ned as explained above is illustrated in Fig. 2.3. The area of the base (relative to h) of this tetrahedron can be computed directly from the model. So, the tetrahedron's volume can be obtained in two di erent ways: either as a product of its base's area by the height h, or as the scalar triple product of the vectors that describe each model point belonging to its base in the camera frame (denoted by p1 , p2 and p3 ), as shown in the following equations:

V = hA 3 or V = p1  p2  p3

(2.5)

The unit vectors parallel to p1 , p2 and p3 can all be obtained from the unknown focal length f and the image coordinates of the corresponding model points, because the ray that connects the optical center to an arbitrary model point obviously contains the image of that model point too. So, denoting the coordinates of the image of P1 by u1 and v1 , p1 is parallel to the vector [u1 v1 f ] in the camera frame. Of course the coordinates of the four image points are known. So, by equating the two alternative formulas for h on any tetrahedron, one can express h as a function of the unknown focal length and three of the four unknown distances between the optical center and the model points (d1 , d2 , d3 and d4 ). As we already saw, the recovery of these distances (and the focal length in this particular case) basically solves the pose recovery problem. By equating expressions for h originated by di erent tetrahedrons one can express any three of the four unknown distances as a function of the fourth unknown distance (let's say d1 ) and the focal length f . So, this reduces the number of unknowns in the problem down to two. Now consider the two model \edges" that share the vertex whose distance to the optical center is one of the unknowns left (P1 P2 and P1 P3 ). The length of P1 P2 , for instance, can be expressed as a function of f , d1 and d2 . So, it can also be expressed as a function of f and d1 only, and the same reasoning is valid for the length of P1 P3 . Fortunately, for our purposes, when one of the resulting expressions for the edge length is divided by the other, the factors that depend on d1 cancel each other out and we are left with an expression 8

O

d1

d3 d2

d4 h

P4 P3

P1 Area = A

P2

Figure 2.3: Geometry of Abidi and Chandra's solution to the P4P problem with coplanar points and unknown focal length. that depends only of f . Of course, the ratio between any two edge lengths can be obtained directly from the model, allowing the recovery of the value of f . Substituting this value in the formulas previously derived, we can then solve the whole problem. Special care must be taken in the cases where the object plane is almost parallel to the image plane. In those situations, the perspective projection of the object into the image plane is nearly an orthogonal projection followed by a scaling operation, and thus the camera focal length can not be recovered independently from the distance between the two planes. Another infelicity is that (similarly to the generic P3P problem) P4P instances and even P5P instances in which all the model points are in general position may admit more than one solution. Fischler and Bolles [17] presented deceptive instances (with two solutions each) for these two classes of problems. With six or more correspondences involving model points in general position, it is nally possible to guarantee the uniqueness of the solution. In this case, the number of constraints is enough to solve a linear system involving the twelve coecients of the 3  4 matrix that maps the three dimensional object space to the image plane in homogeneous coordinates. However, notice that the pose recovery problem itself has only six DOF, as usual. So, if the pose is recovered directly in a twelve{DOF matrix form, some additional steps will be needed to ensure the the orthonormality constraints, and thus reduce the number of DOF back to six.

2.2 Line Correspondences Line correspondences are constraints that are weaker than point correspondences. If n point correspondences are known, the lines obtained by grouping them in pairs can be 9

used as input to any method based on line correspondences. But if an arbitrary set of line correspondences is available, then it may be impossible to cast the problem of recovering the pose based on them into an equivalent problem involving only point correspondences, because these lines may not have pairwise intersections in 3D space. On the other hand, lines are in general easier to be located in digitized images than points. The extraction of point features in general can only be performed reliably for distinctive features that can be correlated with prede ned templates (but this type of feature is not always present in the intended scene). The identi cation of lineal features, on the other hand, is based on the extraction of intensity gradients, which can be done in much more generic situations. Horaud et al [24] introduced a method to solve the pose recovery problem when the image correspondences for three non{coplanar lines with a common intersection point are known. The authors de ned their work as a solution for the generic P4P problem, because this problem can be reduced to pose recovery from a pencil of three lines. But actually, the only point whose position is needed in their technique is the origin of the line pencil. Furthermore, their solution uses the same ideas as other solutions based on line correspondences. So, here it will be classi ed as a solution to a restricted form of the Perspective{3{Line (P3L) problem, rather than a general solution for P4P. The fundamental idea of this solution, as well as of most solutions based on line correspondences, is to exploit the power of a geometrical entity known as the interpretation plane. If the imaging process is modeled as a perspective transformation, then a model edge can be the right correspondence for a given image edge if and only if it lies on the plane de ned by the image edge and the optical center. This plane, represented as a dark surface in Fig. 2.4, is called the interpretation plane. It can be also de ned as the geometrical locus of all possible model edge positions that result in a certain image edge, hence the name. The nice property of this geometrical entity is that its description (and thus the description of its normal vector) in the camera frame does not depend on the model at all. So, the problem of recovering the pose can be cast into the equivalent (but easier) problem of determining the transformation needed to make each model edge lie in a direction perpendicular to the normal vector of its interpretation plane. Another basic idea exploited by Horaud et al [24] is the use of an intermediate coordinate system \between" the camera frame and the object frame. This third reference system is called the image frame. The basic geometry of the problem is exhibited in Fig. 2.5. The camera frame is centered at the optical center O and its three axes x, y and z are de ned to be, respectively, parallel to the two image axes and to the camera optical axis. The object frame is centered at the origin of the pencil of lines that constitutes the model (P0 ). The xm axis is aligned with an arbitrary reference edge, and ym and zm are de ned to complete the right{handed system. Finally, the image frame is centered at I0 , the projection of P0 in the image plane. The axis ximg is parallel to the line OP0 , the axis yimg is normal to the interpretation plane of the reference edge and the axis zimg is de ned to complete the right{handed system. So, the recovery of the object pose can be decomposed into two steps: obtaining the transformation between the camera frame and the image frame and obtaining the transformation between the image frame and the object frame. The camera{image transformation 10

P u v I

x

O

z y

Figure 2.4: The interpretation plane of a generic image edge. is straightforward, because it does not depend on the model at all. The description of ximg in the camera frame can be obtained by a simple normalization of the image coordinates of I0 . Similarly, yimg can be obtained as a cross product of the normalized descriptions of the two extreme points of the image of the reference edge and zimg is given by the cross product of ximg and yimg . These three descriptions, in addition to the coordinates of I0 , form the desired transformation matrix. The image{object transformation, on the other hand, is the composition of the translation I0 P0 , a rotation about yimg (whose unknown magnitude is denoted by ) that aligns ximg with xm , and then a rotation about xm (whose unknown magnitude is denoted by ) that aligns the two other pairs of axes. The algebraic trick used to recover the orientation is to parameterize the descriptions of the two non{reference model edges as a function of  and . Then, the constraints that each of those two edges must be aligned to its interpretation plane can be used to obtain a 2  2 non{linear system involving  and . Denoting the unit vectors normal to the interpretation planes by n1 , n2 and n3 (with obvious meanings), these constraints can be written as:

P1 P2  n1 = 0 P1 P3  n2 = 0

(2.6) (2.7)

This system can then be analytically reduced to a single fourth{degree polynomial equation involving the cosine of . In the special case of three coplanar edges, this equation can be simpli ed to a form with terms of even order only. Similarly, if any two edge images are coplanar or the model includes a right angle between two edges, the problem can be reduced to a quadratic equation on the cosine of . In any case, after the values of  and 11

P1

xm zm P0

ym

I1 z img yimg I0

ximg P3 P2

I3

x O

I2

z y

Figure 2.5: Geometry of Horaud's solution to the restricted P3L problem.

 have been discovered, the recovery of the translation I0 P0 is straightforward, yielding the image{object transformation. The composition of this transformation with the camera{ image transformation previously found can then be carried out to solve the complete pose recovery problem at hand. A general solution to the P3L problem, which can be used even in cases where the lines with known correspondences don't intersect each other, was presented by Dhome et al [16]. Actually, this solution is very similar to the restricted solution presented above. Again, an intermediate (image) frame is used in addition to the camera frame (de ned in the traditional sense) and the model or object frame. Actually two di erent object coordinate systems, one generic and one partially aligned with certain object features, are used. The former can be neglected, however, because the model can be always preprocessed so as to align the desired features with its coordinate system. The intermediate frame is de ned so that the plane determined by its coordinate axes ximg and yimg is the interpretation plane of an arbitrary image edge with known correspondence, which will be again called the reference edge. The ximg axis is set to be parallel to the image of this reference edge. Notice that the descriptions of these vectors on the primary camera frame depend only on the image of the reference edge. So, the transformation between the primary camera frame and the intermediate frame is once more completely independent on the model. Also as in the restricted solution, the remaining rotational transformation between the 12

intermediate frame and the partially aligned object frame can be parameterized as a function of two unknown angles. The alignment restrictions for the two non{reference edges are then used to derive a non{linear system. However, due to the absence of a common intersection point between the edges, the analytical resolution of this system results in an eighth{degree polynomial equation. So, in spite of being less general, the method proposed by Horaud et al [24] should be used whenever it is applicable (for instance, in problems of the type P4P), due to its greater simplicity. But Dhome et al [16] also show that the eighth{degree polynomial equation involved in the general solution can be simpli ed to degree four, if either the three edges with known correspondences are coplanar or they have a common intersection point. After the rotational transformation is nally known, the recovery of the translation is quite straightforward, because the rotation can be applied to any arbitrary point belonging to one of the three model edges with known correspondences. Notice that those points don't even need to be visible in the image. The resulting \synthetic" points, after subjected to the unknown translation, must lie in the interpretation planes of the corresponding model edges. So, for each correspondence available this yields a linear relation that constrains one degree of freedom of the translation vector. Using three correspondences, one gets a 3  3 linear system that can be readily solved for the three components of the translation vector. An upper bound for the number of possible solutions to the P3L problem (with lines in general position) was determined by Navab and Faugeras [51]. They denote the model{ frame description of the unit vector parallel to a generic model edge by l0 , the model{frame description of a generic point in the line de ned by this edge by p0 , and the image{frame description of the vector normal to the corresponding interpretation plane by n. Then, they observe that the determination of the relative rotation between the model and camera frames (described by a 3  3 matrix R) amounts to solving a set of equations of the form: nT Rl0 = 0: (2.8) After that, the translation vector t is obtained by solving a linear system whose equations have the form: nT (Rp0 + t) = 0: (2.9) They start with the problem of orientation determination, which involves only Eq. (2.8). Their analysis initially assumes that there are two cameras, rotated with respect to each other by a generic non{unit 3  3 orthonormal matrix, such that Eq. (2.8) is satis ed in both of them for a generic 3D line l0 . Then, they verify the implications of this assumption on the set possible positions for l0 . They show that the existence of a second distinct solution for Eq. (2.8) constrains l0 to lie in three{dimensional structure called a line complex, which is embedded in the four{dimensional space of all possible 3D lines. If the existence of a third distinct solution for Eq. (2.8) is also imposed, l0 is then constrained to a two{dimensional structure called a line congruence. Four distinct solutions result in a one{dimensional ruled surface. Finally, ve distinct solutions reduce the geometrical locus of l0 to the empty set. So, there is an in nite number of instances of the orientation{from{3{line{correspondences problem that admit four solutions, but there is no instance of this problem that admits ve distinct geometrically sound solutions, provided that the lines involved are in general position. 13

A similar analysis is also carried out for the problem of full pose recovery, which involves solving a system with several instances of both Eq. (2.9) and Eq. (2.8). In this case, the dimensionality of each possible set of feasible instances of l0 is reduced by one. In other words, the existence of two solutions for the system now constrain the geometrical locus of l0 to a line congruence, three solutions constrain it to a ruled surface and four solutions constrain it to the empty set. So, the maximum number of solutions for an arbitrary instance of the P3L problem with lines in general position is three. Notice that this does not con ict with the existence of four solutions to an instance of the P3P problem, because in the equivalent P3L problem the lines are not in general position.

2.3 Angle Correspondences Another type of geometrical constraint that can be used in model{based pose recovery are matches involving angles between model edges in 3D space and the corresponding apparent angles in the image plane. Shakunaga and Kaneko [55] presented a solution to a restricted form of the Perspective{3{Angle (P3A) problem. More speci cally, this solution requires at least two of the model angles involved to be right angles. Initially, the authors concentrate on the instances of this restricted problem in which the three angles are part of a unique trihedral vertex (or in other words, they share a common vertex). The geometrical insight that leads to the solution is to notice that, in this case, one of the model edges can be thought of as the image of the vector normal to a plane de ned by the two other edges. The image of the common vertex and the image of this \normal" edge can then be used to build an intermediate coordinate system \between" the camera and the object frame, which is neither viewer{centered nor object{centered and thus makes the problem of recovering relative orientation easier. The de nition of this auxiliary frame, denominated First Perspective Moving Coordinate (FPMC), is illustrated in Fig. 2.6. The origin of the system (I0 ) is the image of the shared vertex (P0 ). The FPMC's zimg axis is aligned with the ray that passes through both the camera's optical center (O) and I0 . So, the transformation between the camera frame and the FPMC can be seen as the simulation of a camera rotation that \foveates" the common vertex. This is a fundamental concept in most approaches for pose recovery based on angle correspondences. Actually, for non{trivial cases, there is an in nite number of transformations that can be used to accomplish this foveation step, each of which creates a new \virtual image" centered in the projection of the shared vertex. But regardless of which of those transformations is chosen, the description (in the camera frame) for the projection of the model's normal edge on the resulting virtual image is always the same. The ximg axis is then chosen to be parallel to it, and yimg is chosen to complete a right{handed orthonormal system. The nice property of the FPMC system is that, in it, there is a very simple relation involving the actual 3D angle between the two non{normal model edges ( ), the corresponding apparent angle in the virtual image ( ) and the inclination of the plane de ned by these two edges with respect to the zimg axis ( ). This relation, called the primary 14

P2

P3

Image Plane

Object Plane

I2 I y 3 img I1 x O

P1

ximg I0

P0

z img

z y

Figure 2.6: De nition of the First Perspective Moving Coordinate (FMPC) in Shakunaga and Kaneko's solution to the restricted P3A problem. Perspective Angle Transform (primary PAT), follows:

tan = tan tan :

(2.10)

This equation can actually be used in two distinct ways, after the FPMC is constructed. From the point of view of a model{based pose recovery algorithm, it can be used to recover the angle , given (which can be extracted directly from the object model). But the same relation could also be used in a method designed to construct models of unknown objects in structured environments. In this case a known value for would be used as input to the equation, which would then yield the value of . Unfortunately, the FPMC system is not unique, in the sense that, if the sign of its ximg axis is inverted, the same primary PAT is still valid, resulting in two mirror solutions for the pose recovery problem (Necker's ambiguity). Another problem of the FPMC is that it requires the three edges involved to have a common intersection point. But Shakunaga and Kaneko [55] also showed how to avoid this problem, through the de nition of another intermediate coordinate system, denominated the Second Perspective Moving Coordinate (SPMC). The de nition of the SPMC is quite similar to that of the FPMC, because the zimg axis is de ned so that the intersection (actual or virtual) of a pair of edges is foveated. However, contrary to the de nition of the FPMC, in the SPMC, the other two axes are not dependent on the image of the model edge that is normal to the surface de ned by the other two. 15

The SPMC can not be used to derive a direct relation between actual and apparent angles, such as the primary PAT. But it can be used to constrain the possible positions of the vanishing point of the normal model edge in the virtual image to a quartic curve known as general PAT. If the additional constraint that the third angle is also right is imposed, then the general PAT can be reduced to an hyperbola in the virtual image plane. In this case, given the images of three relatively orthogonal model edges that do not intersect each other in 3D space, one can generate the SPMC on the virtual intersection of two of them in the image plane and then constrain the possible locations of the vanishing point of the third line to an hyperbola. By intersecting the resulting curve with the third line itself, one gets a nite set of possible positions for the desired vanishing point, which can then be checked for consistency. Once one of these positions is chosen, the orientation of the corresponding edge in the SPMC (and thus in the camera frame) can be determined. A solution for the P3A problem with arbitrary angles in a trihedral vertex was presented by Wu et al [69]. Again, the central idea is to create a virtual image in which the shared vertex is foveated. However in Wu's solution, one particular transformation to obtain the intermediate image frame is arbitrarily chosen. It consists of a rotation about the camera frame's y axis, followed by a rotation about its x axis. If the coordinates of the shared vertex in the original image are u and v, then the transformation that foveates this point (as described above) is given by: 2 1 d1 6 6 R = 666 du1 dv2 4 u d2 p

3 7 7 v ? d1 d2 777 ; where: 5

? du1

0 d1 d2

(2.11)

1

v d2

d2 p

d1 = u2 + 1; d2 = u2 + v2 + 1: Once the virtual image is created, non{linear constraints involving actual and apparent angles can be derived. Let 12 , 23 and 31 denote actual angles between model edges, 1 , 2 and 3 denote apparent angles in the virtual image plane (with respect to the ximg axis), let and 1 , 2 and 3 denote the actual inclinations of the model edges with respect to the zimg axis, as shown in Fig. 2.7 (with obvious meanings). The unit vectors along the directions of the model edges (e1 , e2 and e3 ) can be expressed as a function of those angles. For instance: e1 = [sin 1 cos 1 sin 1 sin 1 cos 1 ]T : (2.12) Then, the fact that the dot product of two unit vectors is the cosine of the angle between them can be used to derive the following set of non{linear equations: sin 1 sin 2 cos( 1 ? 2 ) + cos 1 cos 2 = 12 sin 1 sin 3 cos( 1 ? 3 ) + cos 1 cos 3 = 13 sin 2 sin 3 cos( 2 ? 3 ) + cos 2 cos 3 = 23

(2.13) (2.14) (2.15)

Notice that the values of the s and s are known. The system can then be analytically solved for the cosine of any of the three unknowns, yielding a fth{degree equation. In 16

x img

yimg

P1

I1

O

β1

I0

α12

I2

ψ1 P0

z img

P2

Figure 2.7: Geometry of the P3A problem with a trihedral vertex. particular cases, however, the degree of this equation can be reduced. If either the three edges are coplanar or at least two of the actual angles between them are right, the resulting system can be reduced to a fourth{degree equation with even order terms only. In case there is one right angle in the virtual image, the resulting polynomial equation is cubic and, nally, if the virtual images of two edges are collinear, a generic equation of fourth{degree arises. In any case, however, Necker's ambiguity is present, as in the restricted solution for two or more right angles.

2.4 Dealing with Ambiguity So, a problem that is common to all the perspective pose recovery techniques discussed so far is that the non{linear nature of the constraints employed usually results in the generation of multiple candidate solutions to the problem at hand. Of course, only one such solution corresponds to the actual pose of the object in the scene and some sort of veri cation is needed in order to identify it. The most straightforward type of veri cation consists of using each candidate pose to reproject the object features back into the image plane, so as to build a synthetic image of the scene. Then, the discrepancy between the apparence of each feature in the reprojected image and the apparence of the same feature in the actual image can be used to determine whether the pose used in the reprojection is likely to be the desired solution or not. For instance, Dhome et al [16] proposed the use of this type of veri cation in their solution to the P3L problem. The constraints that they use to recover the relative orientation between the object and the camera frames theoretically guarantee that the reprojected edge images will be parallel to the corresponding actual edge images, but they do not guarantee that these images will be coincident. So, the perpendicular distance between matched pairs of reprojected and actual edge images can be used to determine the feasibility of any recovered pose estimate. 17

This type of veri cation can be strengthened with the use of visibility constraints. A trivial example is the elimination of poses in which the translational component along the camera's optical axis is negative (because in this case, the object is \behind" the camera). Furthermore, at least for convex polyhedral objects, it is simple to determine which features are visible in the image, given a certain pose. So, poses that would result in the occlusion of features that were actually detected in the real image can be discarded. For instance, if the position of a model point (P ) in the camera frame is given by a vector p and the outward normal to the object's surface on that point is denoted by n (still in the camera coordinates), then, assuming that the object is convex and it is not occluded by anything else in the scene, P is visible if and only if:

p  n < 0:

(2.16)

In some cases, however, simple consistency checks may not be enough to eliminate all the ambiguity created by non{linear constraints. Fortunately, more elaborate schemes for disambiguation can be found in the literature. For instance, if one is using a method that requires n correspondences but the actual number of features matched in the image is m > n, then a Hough transform can be employed to determine the most likely solution. The idea is to group all the available matches between model and edge features into (possibly non{ disjoint) subsets that contain just the minimum number of constraints required by the pose recovery scheme selected (3 points in the case of a P3P technique, for example). Then the analytical pose recovery algorithm of choice can be applied independently for each individual subset, yielding a nite number of possible pose estimates. Each such pose estimate can be thought of as point in a multidimensional space composed by the pose parameters. So, the pose estimates generated by di erent subsets can be grouped in clusters (for instance through a discretization of the pose space) and the cluster with the biggest number of elements can be used to determine the most likely solution. Linnainmaa et al [34] suggest this type of approach to disambiguate their solution to the generic P3P problem. But they notice that the discretization of a 6D space, such as the one formed by all the free parameters in a generic pose recovery problem, can be a quite expensive step both in terms of time and in terms of space. So, they initially work only with the three{dimensional space composed by the translational parameters. Then, after the best translation is selected, the solutions that contain it are used in another Hough transform that yields the desired rotation parameters. Actually, Linnainmaa's disambiguation scheme also includes some traditional consistency checks (for instance, visibility checks) that are used to eliminate obviously unfeasible candidates before the Hough transform is applied. The resulting system seems to work very well for a few test cases, but a more systematic evaluation of it is necessary.

18

Chapter 3

Weaker Camera Models The analytical solutions for the perspective pose recovery problem presented in the previous chapter su er from some serious weaknesses. One of them, obviously, is ambiguity. Some alternatives to deal with it have been proposed, but the traditional consistency checks based on the reprojection of the model features, for instance, are in many cases not restrictive enough to yield a unique solution. On the other hand, the techniques based on Hough transforms are in general so computationally intensive that their application under real{ time constraints is at least questionable. Furthermore, the analytical methods discussed so far have very poor error propagation properties. For instance, in the comparison between P3P algorithms performed by Haralick and Lee [21], all the techniques tested were found to produce relative errors of at least 0.1% in the actual 3D positions of the object features, just as a result of the propagation of rounding errors with single precision arithmetic. Of course, this problem becomes much more serious if we take into account the e ect of quantization noise in the imaging process, for instance. Furthermore, these methods are based only on a very small number of correspondences and thus they can produce completely wrong results if some incorrectly matched feature is eventually used. The source of most of these problems is the intrinsic non{linearity of the geometrical constraints that arise when the imaging process is modeled as a perspective transformation. Fortunately, under certain special conditions, an imaging transformation that is actually perspective (or even more complex, if we take into account lens distortion, for instance), can be reasonably approximated with much simpler models. So, if the scene to be analyzed allows the use of a model simple enough so that correspondences can be expressed with linear constraints, then all the problems generated by the non{linear nature of analytical perspective solutions can be avoided.

3.1 A Taxonomy of Camera Models Represent an arbitrary point P in the 3D model space by the homogeneous coordinates [px py pz 1]T and the corresponding image point I in the 2D image plane by [u v 1]T . A generic camera model can be de ned as a 3  4 transformation matrix M that speci es 19

how each possible 3D point in the eld{of{view of the camera is mapped to a point in the image plane: I = MP; (3.1) where  is an arbitrary scalar constant. For convenience, this matrix can be decomposed in the following way: 2 k11 6 M = KIpG = 4 k21

k12 k13 k22 k23 0 0 k33

32 76 54

1 0 0 0 0 1 0 0 0 0 1 0

2 3 g11 6 g 76 21 56 4 g31

g12 g22 g32 g41 g42

g13 g23 g33 g43

3

g14 g24 777 : g34 5 g44

(3.2)

In Eq. (3.2), the matrix G maps points from a 3D object{centered coordinate system to a 3D camera{centered coordinate system. So, this matrix encodes the pose of the object with respect to the camera and its degrees of freedom are also known as the extrinsic parameters of the camera. The matrix K, on the other hand, encodes the intrinsic parameters of the camera, such as the origin of the image plane, the aspect ratio and the focal length. The most generic model according to this conceptualization is the projective camera model, in which the matrix P can assume any arbitrary con guration. Thus, this model has eleven DOF, because in homogeneous coordinates the overall scaling factor is irrelevant. In many situations, however, some restrictive assumptions can be imposed in order to obtain more tractable models. For instance, in general, it is very reasonable to assume that the intrinsic matrix can be written as: 2 3 f 0 ox x (3.3) Kcalib = 64 0 fy oy 75 ; 0 0 1 where fx and fy encode both the aspect ratio and the focal length of the camera, and uo and vo encode the origin of the image plane. Furthermore, in many applications the values of these four intrinsic parameters are xed and can be determined through some camera calibration performed prior to the pose recovery itself [63; 60; 46]. Finally, another very realistic restriction is to require the mapping G between the model and the camera frames to be a Euclidean transformation, that is, a composition of rotations and translations. The result of all these assumptions is known as the perspective camera model. This is the model used in all the pose recovery techniques that we discussed up to this point. As we already saw, it has six DOF. Its extrinsic matrix can be written as: "

#

Gpersp = R0 1t ;

(3.4)

where R = [rx ry rz ]T is a 3  3 orthonormal matrix representing the relative rotation between the model and the camera frames, and t = [tx ty tz ]T is a 3  1 vector representing the relative translation. Assume, only for clarity purposes, that K, the encoding for the intrinsic parameters, is an identity matrix. Also, let p be the vector that denotes the position of an arbitrary point 20

P with respect to the origin of the camera frame. One can verify that, in the perspective camera model, the coordinates of the image of P are given by: tx ) ; upersp = ((pp  rrx + z + tz ) ( p  r ty ) : vpersp = (p  ry + +t ) z

(3.5) (3.6)

z

So, the problem with this model is that the denominators in Eqs. (3.5) and (3.6) depend not only on the relative z translation between the camera and the object frames (tz ), but also on the z o set of each particular model point with respect to the origin of the object frame, as expressed (in image coordinates) by p  rz : In order to avoid this type of non{linearity, one can impose the restriction that the transformation matrix must be of the form: 2 Maff = 64

3

m11 m12 m13 m14 7 m21 m22 m23 m24 5 : 0

0

0

1

(3.7)

The general model that conforms to this restriction is denominated the ane camera model. If a model of this kind is used, then the image coordinates for the projection of an arbitrary model point (u and v) can be expressed as a linear function of the corresponding model coordinates (px , py and pz ). This linearization implied by the ane model is a somewhat strong assumption that is in general valid only if the dimensions of the object are relatively small compared to the distance between the object and the optical center of the camera [50]. But again, this general model is unnecessarily complex for many applications (eight DOF) and some reasonable restrictive assumptions can be used to obtain more tractable specializations. Similarly to what was done for the generic projective model, assume that the intrinsic matrix K is in the form expressed in Eq. (3.3), that the resulting four intrinsic parameters are known a priori, and that the transformation between the object and the camera frames is Euclidean. There are two di erent models that conform to all of those restrictions and both of them are linearized approximations of the perspective camera model. The di erence is that one of them, known as the weak{perspective camera model, is an approximation of order zero, while the other, known as the paraperspective camera model, is a rst order approximation. In order to understand more clearly the relationship between the perspective camera model and its two ane approximations, we factor tz , that is constant for all points in the object space, out of the denominators of Eqs. (3.5) and (3.6). Then, upersp and vpersp can be expressed as a function of p by:

p  rt + tt x

x

z

z

y

y

z

z

1+ ; p  rt + tt vpersp = 1 +  ;

upersp =

21

(3.8) (3.9)

where:  = p t rz :

(3.10)

z

The weak{perspective camera model consists of approximating the non{linear factor

1 1+ by the constant 1, resulting in:

uweak = p  rtx + tx ; z p  r + ty : y vweak = t

(3.11) (3.12)

z

So, it is equivalent to an orthogonal projection of all the model points into a plane parallel to the image plane, followed by an uniform scaling of the resulting image by the factor t1 . This type of transformation is also known as scaled orthography. The paraperspective camera model, on the other hand, uses the rst order term of the expansion of 1+1  around the point  = 0, in addition to the term of order zero. So, it assumes that 1+1   1 ? . The expressions for u and v obtained after this linearization are 2 then approximated again by neglecting the terms that involve pt2 , as shown by Christy and Horaud [11], yielding: p  (rx ? tt rz ) + tx upara = ; (3.13) t z

z

x z

z

vpara =

p  (ry ? tt rz ) + ty y z

tz

:

(3.14)

The geometrical semantics of the paraperspective camera model is illustrated and compared with the perspective and weak{perspective models in Fig. 3.1. As usual, the optical center is denoted by O, the camera frame axes are denoted by x, y and z, the origin of the object frame is denoted by Om , and the perspective projection of Om in the image plane is denoted by Oimg . The paraperspective camera model corresponds to projecting and each model point P along the line parallel to OOm . The resulting \intermediate image", formed in the plane that contains Om and is parallel to the image plane, is then uniformly scaled by the factor t1 . In spite of their linearity, both the weak{perspective and the paraperspective camera models still have six DOF. So, they may be too general for some applications in which the position or orientation of the objects with respect to the camera is restricted by some sort of fundamental constraint. Wiles and Brady [66], for instance, propose some simpler camera models for the important problem of smart vehicle convoying on highways. In their analysis, they assume that a camera rigidly attached to a certain trailing vehicle is used to estimate the structure of a leading vehicle, in such a way that the paths traversed by these two vehicles are composed exclusively by a series of translations and rotations parallel to a unique \ground plane". In spite of the focus of their research being the recovery of structure from motion, many of their observations and suggestions can be generalized to the analogous problem of pose recovery. Clearly, the application{speci c constraints reduce the number of DOF in the relative pose of the leading vehicle to three. Because the camera does not undergo any rotation z

22

P

x

I weak I para I persp Om Oimg

O

z

Optical Axis

Image Plane

Figure 3.1: Geometry of the perspective, paraperspective and weak perspective camera models. about its optical axis, the x axis of the camera frame can be de ned so as to be parallel to the ground plane. Furthermore, the tilt angle between the camera's optical axis and the ground plane ( ) is xed and can be measured in advance. In this situation, the general projective camera can be specialized to a model called projective Ground Plane Motion (GPM) camera, whose projection matrix (Mproj gpm ) has the following form: 2 Mproj gpm = 64

3

m11 a b m13 m14 c m31 1 c m33 c m34 75 ; m31 b m33 m34

(3.15)

where a, b and c are three arbitrary scalar constants. Notice that this is an uncalibrated camera model with six DOF. So, one can again impose the additional restrictions that the intrinsic component of the camera transformation must be of the form indicated in Eq. (3.3), with known values for the remaining parameters, and that the motion must be rigid. Let denote the the angle between the z axis in the object frame and the camera's optical axis, and let t = [tx 0 tz ]T denote the relative translation between the origins of the object and camera frames. The matrix shown in Eq. (3.15) can then be simpli ed to the following form: 2 3 d cos ? a sin a b d sin + a cos d tx + a tz 1 c cos c tz 75 ; (3.16) Mpersp gpm = 64 c (? sin ) ? sin b cos tz where a, b and c and d are known scalar constants that depend on the intrinsic camera parameters and on the xed angle . The resulting model, called the perspective GPM camera, has only three DOF but is still non{linear. 23

Finally, Wiles and Brady [65; 67] also introduced a series of uncalibrated camera models for special cases in which the possible motion of the observed objects is constrained to a plane normal to the optical axis of the camera. They designated these models planar projective, planar ane and translation and scaling. Unfortunately, their analysis of the conditions necessary for the usage of each one of these models is a bit super cial, and their presentation does not cover the details of any speci c technique for either structure or pose recovery. To summarize this section, we show how the camera models presented can be organized hierarchically (Fig. 3.2). In this diagram, an arrow pointing downwards indicates that the model at the top subsumes the model at the bottom, or alternatively, that the set of cameras that conform to the model at the bottom is properly contained in the set of cameras that conform to the model at the top. Linear

Non-linear Projective (11 DOF) Planar Projective (8 DOF) Uncalibrated

Projective GPM (6 DOF)

Affine (8 DOF)

Planar Affine (6 DOF)

Translation and Scaling (3 DOF)

Perspective (6 DOF) Paraperspective Weak perspective (6 DOF)

Calibrated Perspective GPM (3 DOF)

Figure 3.2: A General hierarchy of camera models.

3.2 Weak{Perspective Pose Recovery As shown in in Fig. 3.2, the weak{perspective camera is the simplest model that can be used in applications in which the pose of the observed objects with respect to the camera is unconstrained (6 DOF). It basically approximates a perspective projection by assuming that all the points in a 3D object are roughly at the same distance from the image plane 24

of the camera. So, in situations where the distance between the object and the camera is relatively big if compared to the dimensions of the own object, this model can yield precise pose estimates at a low computational cost. Furthermore, solutions based on it are in general much simpler to analyze and much easier to implement than their perspective counterparts, and its linearity eliminates the necessity of calibration of certain intrinsic camera parameters. Due to these advantages, the weak{perspective and other ane camera models have been widely used for object recognition [28; 26; 33; 62], structure and motion estimation [57; 56; 47; 8; 58], and augmented reality [32]. The problem of pose recovery with this type of model has also been extensively studied. Before we discuss the techniques designed to work with arbitrary targets, let's consider a very simple and ecient trick that involves some pre{engineering of the desired target. Suppose that the four extreme points of a trihedral vertex with three right angles and unit edges belonging to the object can be located in the image, as shown in Fig. 3.3 (with the usual naming conventions). In practice, this can be accomplished by proper positioning, in the object, of devices that emit some electromagnetic signal (possibly outside of the visible spectrum) that is not absorbed by any part of the scene. Image Plane

P3 P2 l=1 l=1 l=1 P0

O

P1

v u

Figure 3.3: Engineered target, with three right angles between edges of unit length. But, anyway, if the correspondences of these four specially arranged points can be established, then their images can be thought of as projections of the unit vectors that de ne the coordinate system associated with the model space. Then, the key idea is to notice that the transformation that projects the points from the 3D model space to the image plane, according to the weak{perspective formulation, is equivalent to a composition of rotations and translations, that preserve the orthonormality of the trihedral vertex; followed by an scaling operation, that changes the overall orthonormality only by a single multiplicative factor; followed by the elimination of the z coordinates. But if any two rows or columns of a 3  3 orthonormal matrix are known, then the third row can be determined with a cross product operation. 25

So, let the two{dimensional coordinates of the four images of the extreme points in the trihedral system be denoted by [u0 v0 ]T , [u1 v1 ]T , [u2 v2 ]T and [u3 v3 ]T , where [u0 v0 ]T is the image of the intersection between the edges. Then, if the object coordinate system is identi ed with the edges of the trihedral angle, two of the three rows of the 3  3 matrix that represents the relative rotation between this system and the camera frame can be recovered up to a scale factor with subtractions only, as shown bellow: "

#

#

"

r0 x = u1 ? u0 u2 ? u0 u3 ? u0 ; v1 ? v 0 v2 ? v 0 v3 ? v 0 r0y

(3.17)

By normalizing the resulting vectors and using cross product operations to enforce orthonormality, one can then recover the values of the rows rx , ry and rz themselves, and thus the whole rotation matrix Rweak :

rx = jrr0xj x 0 rz = rx  jrr0y j y ry = r z  r x 0

(3.18) (3.19) (3.20)

For many interesting applications, however, this type of target engineering approach is impractical. Fortunately, even in the general case of pose recovery with three points in arbitrary spacial positions, the weak{perspective model yields very ecient algorithms. Huttenlocher and Ullman [25] suggested what seems to be the most standard solution to this problem. As we already mentioned, the weak{perspective camera is the composition of an Euclidean transformation and a scaling operation. Huttenlocher and Ullman notice that the resulting image is the same, regardless of the order in which these two components are applied to the object. So, they assume that the scaling, parameterized by an unknown s, is performed rst. After that, the partially transformed model can be aligned with the image through the composition of a translation and a series of rotations about known axes. The z translation between the object and the camera frames is immaterial, because the projection is parallel to the optical axis of the camera. On the other hand, the two{DOF translation parallel to the image plane that properly aligns one of the three model points can be computed directly from the coordinates of the point's image. The same thing is true for the rst rotation, performed about the optical axis, that aligns one of three edges with its image. The two nal rotations needed to complete the overall alignment can then be parameterized by unknown angles  and . Fortunately, after the rst rotation, one is left with exactly three geometrical constraints that can be used to derive a 3  3 system involving s,  and . One of the two non{aligned points already lies on an aligned edge, and thus its complete alignment constrains one DOF only. The other point is totally unaligned, and so, imposing the condition that it must match its image constrains two additional DOF. After some algebraic manipulation, the system generated by these constraints can be solved for s, yielding a fourth{degree polynomial equation with even terms only. Obviously, scaling by a negative factor has no feasible geometrical interpretation. So, one is left with at most two solutions, which is already better than the bound of four possible solutions 26

for the P3P problem, demonstrated in Fig. 2.2. Indeed, Huttenlocher and Ullman showed in a later work that with three point correspondences in general position, it is possible to recover a unique weak{perspective transformation corresponding to the pose, up to a re ection [26]. However, we are not going to discuss this proof in detail, because Alter [2] achieved a stronger result in a simpler and more elegant way. Actually, he derives the same biquadratic equation proposed by Huttenlocher and Ullman from a completely di erent analysis of the problem. In addition, he interprets the two possible solutions for the scaling factor s, in order to demonstrate that only one of them is algebraically and geometrically feasible. So, contrary to the P3P problem, there is actually no ambiguity in pose recovery from three correspondences, if a weak{perspective camera is assumed. The geometry of Alter's solution is illustrated in Fig. 3.4. As usual, the model points are denoted by P0 , P1 and P2 . The corresponding two{dimensional image points are denoted by I0 , I1 and I2 , respectively, and the other variables refer to edge lengths, with obvious meanings. P2

l12 P1

h2 l02

h1

l01 P0 Scale by s P’2

P’1

h’2 d02 I 2 I0

d01

d12

h’1 I1

Figure 3.4: Geometry of Alter's solution to the problem of weak{perspective pose recovery with three point correspondences. Alter initially notices that the pose recovery problem as a whole can be reduced to determining the scaling factor s, and the distances between any two of the three model points and the plane parallel to the image plane that contains the third point (h1 and h2 ). This can be done very easily if one observes that the scaled tetrahedron at the bottom Fig. 3.4 contains two right{angle triangles (I0 I1 P10 and I0 I2 P20 ). A third right{angle triangle can be obtained by considering the di erence in the heights h1 and h2 , with respect to the 27

image plane. These three triangles yield the following set of polynomial equations:

h21 + d201 = (s l01 )2 h22 + d202 = (s l02 )2 (h1 ? h2 )2 + d212 = (s l12 )2

(3.21) (3.22) (3.23)

After some algebraic manipulation one arrives at p a biquadratic equation for s. So, the possible solutions for s2 have the general form bpa  . After some careful analysis, Alter shows that only the solution with the positive  is geometrically feasible. The other solution corresponds to inverting the problem, or in other words, scaling the image and then projecting it orthogonally to the plane de ned by the three model points, as illustrated in Fig. 3.5. After the correct values of s, h1 and h2 have been recovered, the computation of the transformation matrix can be performed by expressing the axes of the object frame as a parametric function of the two model edges and their dot product, as usual (Eq. (2.4)). P2

l12 h2

l02

l01

P1 h1

P0

Figure 3.5: Geometrically infeasible solution to the problem of weak{perspective pose recovery with three point correspondences.

3.3 Using Multiple Camera Models The camera models presented in this chapter entail di erent restrictive assumptions about the nature of the application intended. So, any system based exclusively in one of them is guaranteed to be valid only if those restrictions are met. If for any reason some of the assumed constraints is violated, the resulting systems can fail catastrophically. So, systems that rely on a unique simpli ed camera model tend to work well just for very speci c applications in structured environments. On the other hand, replacing a simple camera model that works in most cases with a more complex model that works always may not help too much. An obvious disadvantage is the increased computational cost of the pose recovery techniques based on more complex camera models. Furthermore, algorithms based on more general models are more prone to su er from numerical conditioning problems. For instance, many of the perspective pose recovery methods discussed in the last chapter are subject to singularities [21] that can be eliminated if the camera is assumed to be ane. In addition, camera models with more degrees of freedom require more correspondences between model and edge features, which increases the complexity of the low{level stages of vision. And nally, the pose 28

parameters of simpler camera models have in general a clearer interpretation than those of more complex models. For instance, the geometrical meaning of the single rotational DOF of the perspective Ground Plane Motion (GPM) camera is obvious. On the other hand, there are several di erent ways of expressing the three rotational DOF of the traditional perspective camera and none of them is obviously better than all the others. Due to all of these reasons, Wiles and Brady [67] suggest that the best way to build a robust, portable, ecient and clear image understanding system is to consider several camera models organized in a hierarchy of increasing complexity. The resulting system would then periodically evaluate the appropriateness of each one of these models, in order to try to choose \the simplest applicable camera model that accurately models the data". They suggest that a good measure of appropriateness for camera models should combine three components: an estimate of the accuracy of the model, an estimate of its clarity and an estimate of its eciency. Unfortunately, they do not address more speci cally the problem of estimating the eciency of a camera model and their measure of clarity is designed speci cally for the problem of computing structure from motion, not for pose recovery. Their measure of accuracy, on the other hand, consists basically of reprojecting the model using the solutions computed by the image understanding system and then counting the number of features whose reprojections miss the correspondences in the actual image by more than a constant factor times an estimate of the standard deviation of the noise involved in the imaging and feature extraction processes. Beveridge and Riseman [6] suggest that using both weak perspective and full perspective is a good way of combining eciency with accuracy. They studied a speci c application, namely indoor robot navigation, in which the determination of the correct correspondences between model features and images is the central problem. So, they could not rely on the methods discussed so far, which assume that the correspondences have been established a priori. Their approach to deal with this complex problem is to perform a probabilistic search in the space of all possible sets of matches between model and image edges. Any set of correspondences generated during this search must then be veri ed for consistency with the image. This amounts to estimating a transformation that conforms to the camera model chosen, and reprojects all the model features back to the image optimally, in a least squares sense. If a weak{perspective model is used, than there is a closed from solution to the least squares tting needed to verify any chosen set of matches. On the other hand, a perspective camera, due to its non{linearity, requires the use of numerical methods to nd the least squares solution to the pose recovery problem. So, the idea of the authors is that one can initially use a weak{perspective approximation to roughly check the consistency of the selected sets of matches. So, the sets that are obviously inconsistent can be discarded immediately. The sets that are promising can then be checked more carefully, with an expensive but accurate numerical method for overdetermined perspective pose recovery. The authors performed an experimental evaluation of the resulting hybrid system and found that it is up to an order of magnitude faster than the alternative of using only full perspective, but still, in all the cases studied, \both algorithms found exactly the same optimal correspondence", while the weak perspective alternative failed in several situations. 29

Chapter 4

Numerical Perspective Solutions As discussed in the last chapter, the use of linearized camera models is an e ective way of ameliorating problems that arise with the closed{form perspective solutions (such as ambiguity and ill{conditioned error propagation). Furthermore, linearized models allow one to deal with overconstrained problems, in which the number of restrictions imposed by the correspondences between model and image features is possibly much bigger than the number of free parameters in the scene. So, in applications for which a big number of feature correspondences can be reliably established, solutions based on linearized models tend to be more robust, because the measurement errors and the image quantization noise in individual features are averaged out. Unfortunately, linearized camera models are approximations whose validity is sometimes questionable in practice. So, an ideal pose recovery algorithm should combine the generality of a perspective camera model with the robustness of the schemes based on ane approximations. Indeed, it is possible to satisfy this requirement in practice by casting the problem of pose recovery into an equivalent multivariate numerical optimization problem. There are at least two distinct ways in which this can be done. The most traditional, straightforward, and widely used approach consists of de ning a global measure of the discrepancy between the actual image and the image that would be expected given a perspective camera model and an arbitrary estimate for the unknown pose. Then, by replacing the chosen error measure (which is a non{linear function of the pose parameters) with a local linear or low{degree{polynomial approximation on the point corresponding to the current pose estimate, one can compute a correction that in general yields a better pose estimate. This process can be iterated until (ideally) the error function is locally minimized and the current pose estimate converges to the actual pose with the desired precision. Due to the fact that the computation of the local approximations involves evaluating rst and possibly higher{order derivatives of the error function, we call this general approach derivative{based. Another alternative consists of rewriting the perspective projection Eqs. (3.5) and (3.6) as a function of a set of parameters (one for each feature) that explicitly denote the discrepancy between the image predicted by the perspective model itself and the image predicted by some ane approximation. Then, by arti cially setting all these parameters to zero, one 30

can compute the pose based on the linearized model. The result is used to compute better values for the discrepancy parameters, which can be in turn used to compute a new pose and so on. Again, ideally, after some iterations of this process, the discrepancy parameters will converge to their actual values, and (what is more important) the estimated pose will converge to the actual perspective pose, within a certain tolerance margin. We denominate these as strong{from{weak pose methods. In spite of this second class of techniques having appeared more recently in the literature, we start our discussion with them, because they are more directly related to the material covered in the previous chapter.

4.1 Strong{from{Weak Pose Methods The idea of re ning an ane (weak) model numerically in order to obtain a full{perspective pose estimate was rst proposed by DeMenthon and Davis [15]. Their algorithm starts with a weak perspective camera model that is then re ned in order to obtain a solution to a restricted form of the PnL problem (n  3) in which the model edges have a common intersection point (this is, of course, also a solution to the generic P(n +1)P problem). Let the point of intersection of the n model edges be denoted by P0 . Since we are dealing with an arbitrary number of features, we use the subscript i to denote an arbitrary i{th feature. So, Pi denotes an arbitrary model point and [ui vi ]T denotes the corresponding image. Initially, recall from the previous chapter (Eqs. (3.8) and (3.9)) that the perspective equations can be manipulated so as to isolate the non{linearity of the model in a term of the form 1+1  . Actually,  is clearly dependent on the particular coordinates of each individual model point. So, for notational clarity, we write it with a subscript i too. As we are interested in the images of the edges ei = P0 Pi , rather than the images of individual points, we rewrite Eqs. (3.8) and (3.9) as:

ei  r0x = ui (1+ i ) ? u0 (1  i  n); ei  r0 y = vi (1+ i ) ? v0 (1  i  n);

(4.1) (4.2)

where: i = ei  r0 z ; T [r0 x r0 y r0 z ]T = [rx rty rz ] = R tz : z

(4.3) (4.4)

Notice that this formulation allows one to separate completely the recovery of the rotation matrix R and the depth component of the translation (tz ), from the recovery of the translation components parallel to image plane (tx and ty ), a typical simpli cation used in methods based on line correspondences. But the crucial property of Eqs. (4.1) and (4.2) is that, if the values of all parameters i are assumed to be xed, then these equations yield a linear system involving the six unknown values for the elements of r0 x and r0 y . In particular, if we replace each i with 0 ( 1+1  1), then the resulting equations: i

ei  r0 x = ui ? u0 (1  i  n); ei  r0y = vi ? v0 (1  i  n); 31

(4.5) (4.6)

are exactly the imaging equations of a weak{perspective camera model. If the actual imaging process were (at least reasonably approximated by) a weak{ perspective transformation, and the number of edge correspondences (n) were exactly three, one could just solve the system of Eqs. (4.5) and (4.6), in order to recover r0 x and r0 y , the rst two rows of the rotation matrix scaled by a constant. Since R is an orthonormal matrix, a straightforward normalization of these vectors would then yield rx and ry (and thus tz ). Finally, the cross product of rx and ry would yield rz . Notice that this process is very simlilar to the solution for the problem of weak{perspective pose recovery with engineered targets discussed in the last chapter (Eqs. (3.18) to (3.20)). However, DeMenthon and Davis [15] opted for obtaining ry directly from r0 y , instead of using Eq. (3.20). So, their solution does not enforce the orthonormality of the recovered rotation matrix. Of course, the interesting cases from our point of view here are those in which the weak{ perspective solutions recovered as described above are unacceptable as a nal solutions, but can still serve as a rough initial estimate for the full{perspective pose. For these cases, DeMenthon and Davis [15] suggested the following algorithm: start by assuming that 8i i = 0, and compute the weak{perspective pose, as described above. Then, substitute the values estimated for rz and tz in Eq. (4.4), in order to recover r0 z . Substitute r0 z in Eq. (4.3), in order to recover the values of all the parameters i . Finally, use the new (presumably improved) values of the parameters i to compute a new pose estimate (by solving the system of Eqs. (4.1) and (4.2)). Repeat this cycle until either the current pose estimate converges to the desired solution or a prede ned number of iterations is exceeded (in which case it is assumed that the method diverges). After R and tz have been recovered this way, the computation of tx and ty is trivial. Geometrically, the meaning of the step involving the linear system resolution in the procedure described above is quite clear: it corresponds to computing a pose estimate assuming a certain weak perspective camera model. The computation of the new values for the parameters i has a more subtle interpretation though: it is equivalent to deforming the object, in such a way that every model point is shifted along a plane with constant depth (determined by the current pose estimate), so as to be included in the line{of{sight of the corresponding image point. Then, this deformed object is used to compute another weak{perspective pose and so on, until all the model points converge to a unique depth (and thus the di erence between the weak and the full{perspective poses disappears). Or, in a more intuitive way: each weak{perspective pose computation allows one to estimate what are the individual depths of the model features relative to the camera. Then these depths can be used to correct the model to compensate the fact a linearized camera model is going to be used to recover the pose, and thus more precise depth estimates can be computed. The cycle continues until all depth estimates converge to their actual values. If the pose recovery problem is overdetermined (n > 3), the only di erence is that the system of Eqs. (4.1) and (4.2) will have to be solved in a least{squares sense. Notice that this system can be rewritten as: E r0 x = du ; (4.7) E r 0 y = dv ; (4.8) where: E = [e1 : : : en ]T ; 32

du = [u1 ? u0 : : : un ? u0]T ; dv = [v1 ? v0 : : : vn ? v0]T : As long as the extreme points Pi of at least three model edges are non{coplanar, the rank of the matrix E is three, and the system expressed in Eqs. (4.7) and (4.8) can be solved in a least{squares sense through the computation of the pseudoinverse of matrix E (denoted Ey):

r0 x = Ey du r0y = Ey dv

(4.9) (4.10)

The computation of Ey is a relatively expensive operation. However, the matrix E itself depends only on the model for the target, which is assumed to be known a priori. It does not depend at all on the pose estimates or even on the values of the i parameters. So, Ey can actually be computed o {line for each possible target used in the process of pose recovery, resulting in an very fast inner loop in the numerical routine. Notice that, for improved eciency, this preprocessing step can be carried out even if the problem at hand is not overconstrained. This seems to be the main advantage of the methods based on the re nement of weaker models compared to the more traditional perspective pose recovery techniques based on the computation of derivatives for an error measure, which will be described in the next section. A problem with the method described so far is that, when the target is a planar object (or at least, all the edges used in the pose recovery process are coplanar), the rank of matrix E drops to two and it is no longer possible to compute a unique solution for the system of Eqs. (4.7) and (4.8). This fact can be con rmed with a geometrical analysis of Eqs. (4.1) and (4.2). Let's consider that the tail of the unknown vector r0 x is placed at the reference point P0 . Then, each of the n instances of Eq. (4.1) constrains the head of r0 x to certain plane orthogonal to the edge ei , because it states that the projection of r0 x along ei is equal to a known constant. If at least three edges are in general position, then the restrictive planes de ned by their constraints will have a common intersection point, which will be the location of the head of r0 x in the absence of additional constraints. On the other hand, if all the model edges are coplanar, then their pairwise intersections will be a set of parallel lines and the position of the head of r0 x will be (at least in principle) undetermined. Actually, even if the model edges are only nearly coplanar, the proximity of this singularity can lead to problems of numerical instability, resulting either in the divergence or in a very slow convergence of the pose recovery algorithm. This kind of situation arises very frequently in cartography applications, in which the variance on the height of a set of landmarks is much smaller than the average pairwise distance between these landmarks. Fortunately, Oberkampf, working in cooperation with DeMenthon and Davis [52], proposed an extension to the basic full{from{weak{perspective pose recovery algorithm described so far, which handles the cases where the rank of E is equal to two as well. They notice that, in those cases, the pseudoinverse solution for Eq. (4.7) is such that r0 x is parallel to the plane that contains the target, and its head is located on the point Pmin with minimal distances (in a least{squares sense) from the restrictive planes de ned by the n instances 33

of Eq. (4.1) (again, assuming that the tail of r0 x is located at P0 ). Of course, this solution is not unique, since any point located on the line that contains Pmin and is normal to the object plane will also be at a minimal distance from the restrictive planes, as illustrated in Fig. 4.1. Normal

r’x n

Projections of the Restriction Planes Object Plane

P2 e2

Pmin r’0x e1

P0

P1

Figure 4.1: Geometry of Oberkampf's solution to the perspective pose recovery problem with planar targets. Analogous reasoning is valid for r0 y . So, a generic solution to the undetermined system de ned by Eqs. (4.7) and (4.1) can be parameterized with two unknowns  and , as follows:

r0x = r00x +  n; r0 y = r00y +  n;

(4.11) (4.12)

 = ?r0 x  r0 y ; 2 + 2 = r0 2x ? r0 2y :

(4.13) (4.14)

where r0 0x and r0 0y are the pseudoinverse solutions for Eqs. (4.9) and (4.10), respectively, and n is the unit vector normal to the object plane. Actually, in order to extend this analysis to the cases where the target is only nearly planar, n can be de ned as the unit vector of the null space of the matrix E, which can be computed from an o {line singular value decomposition (SVD) of E. Then, one can use the restrictions that r0 x and r0 y must be perpendicular and must have the same norm, in order to derive the following system, from Eqs. (4.11) and (4.12):

An algebraic resolution of this system yields two alternative pairs of values of  and  (and thus to r0 x and r0 y ), due to Necker's ambiguity. Oberkampf et al suggest that this ambiguity is actually desirable because it is (geometrically) present when a weak{perspective camera is used, and thus it should be taken into account whenever weak perspective is a 34

good approximation for full perspective. However, the exploration of both solutions in an iterative procedure would result in an exponential growth of the execution time with respect to the number of iterations. The authors propose an heuristic that avoids this problem, by discarding the worse pose estimates, whenever the total number of possible solutions is bigger than two. Another problem with the original solution proposed by DeMenthon and Davis [15] is that it tends to have convergence problems if the angle between the optical axis and the line{of{sight of the origin of the edges (P0 ) is relatively large. The problem is that in this situation, the weak{perspective model is in general not a good approximation for full perspective. In order to ameliorate this, Horaud et al [23] suggested an alternative technique that starts with a paraperspective (rather than weak{perspective) camera model that is then numerically re ned until it approximates full perspective with the desired precision. Recall from the previous chapter that using a paraperspective camera model is equivalent to replacing the factor 1+1 , in the full{perspective projection equations, with its rst order 2 approximation 1 ? i , and then neglecting the resulting terms that depend only on pt2 . If we apply this transformation directly to Eqs. (4.1) and (4.2), we obtain the following equations for the images of edges ei (rather than points) under a paraperspective projection: i

z

ei  r0x ? u0 i = ui ? u0 (1  i  n); (4.15) 0 ei  r y ? v0 i = vi ? v0 (1  i  n); (4.16) where i is de ned by Eq. (4.3) and r0 x and r0 y are de ned by Eq. (4.4), as usual.

Comparing the expressions above with the weak{perspective model de ned in Eqs. (4.5) and (4.6), one can notice that the only di erence is that the paraperspective model introduces the extra term ?u0 i on the left{hand{side of the projection equations. If we add this term to both sides of the perspective Eqs. (4.1) and (4.2), the result will still be a perspective camera model model:

ei  r0x ? u0 i = ui (1 + i) ? u0 ? u0 i (1  i  n); ei  r0 y ? v0 i = vi (1 + i) ? v0 ? v0 i (1  i  n):

(4.17) (4.18)

Now, we substitute the expression that dictates the correct value for i (Eq. (4.3)) only on the left{hand{side of Eqs. (4.17) and (4.18), and rearrange the resulting formula as:

ei  r00x = (ui ? u0 ) (1 + i) (1  i  n); ei  r00 y = (vi ? v0 ) (1 + i) (1  i  n); where: r00 x = rx ?t u0 rz ; z r ? v 0 rz : y 00 ry = t z

(4.19) (4.20) (4.21) (4.22)

Notice that if the correct values for i are used on the right{hand{side, Eqs. (4.19) and (4.20) still de ne a full{perspective camera model. But, on the other hand, if we arti cially 35

impose that 8i i = 0 on the right{hand{side, then the system above will be equivalent to Eqs. (4.15) and (4.16), which de ne a paraperspective model. So, we can apply the same basic iterative procedure suggested by DeMenthon and Davis [15], using Eqs. (4.19) and (4.20) instead of Eqs. (4.1) and (4.2). Of course, now the recovery of the original rotation matrix (R) and the depth component of the translation (tz ) will be performed according to by Eqs. (4.21) and (4.22), rather than Eq. (4.4), which adds some extra algebra to the method, but the basic idea is still the same. We refer the reader to the sections 3 to 5 of [23] for a detailed explanation of the computation of R and tz , given r00 x and r00 y . However, we should stress here that, contrary to DeMenthon and Davis [15], Horaud et al [23] do suggest a (very elegant) way of ensuring the orthonormality of the matrix R. Finally, their solution is also easily generalizable to over and under{constrained instances. The work of Horaud et al is certainly very interesting theoretically, as an attempt to clarify the links between paraperspective, weak perspective and full perspective. However, the algorithm that they propose seems to be overkill for dealing targets positioned at lines{ of{sight whose angles with respect to the optical axis are relatively wide. A much simpler solution, in our opinion, would be to generate a virtual scene in which the object is foveated using Eq. (2.11), then compute the pose in this virtual scene using the original algorithm proposed by DeMenthon and Davis [15], and nally, premultiply the 3  4 matrix encoding the virtual pose by the inverse of the foveation matrix obtained from Eq. (2.11), in order to recover the original pose. In addition to being conceptually more simple (and thus easier to implement), we conjecture that this alternative may also be more ecient. The problem with Horaud's solution is that the extra complexity of the rst order terms in the paraperspective model is present in the inner loop of the numerical pose recovery routine. On the other hand, in our alternative approach, all the extra work needed to create the virtual scene and then obtain the original pose from the recovered virtual pose is performed outside of the numerical inner loop of DeMenthon and Davis's algorithm, So, our claim is that, if the number of iterations of this inner loop is suciently long, our alternative scheme will execute faster than Horaud's full{from{para perspective algorithm. Of course, a careful timing and analysis of the number of iterations of those two techniques is necessary in order to decide under which circumstances each one of them is better than the other. So, we state these ideas as an open conjecture of this paper:

Conjecture 4.1 (open) The use of virtual foveation plus a full{from{weak{perspective

pose recovery algorithm yields solutions with the same precision than the direct application of a full{from{para perspective technique, with a smaller computational cost.

4.2 Derivative{Based Methods It is dicult to determine where did the idea of using derivative{based numerical optimization techniques for model{based pose recovery rst appears in the literature. But the most in uential and widely used technique exploiting this possibility seems to be the classic pose recovery algorithm proposed by David Lowe [38; 37; 36]. 36

4.2.1 The Classic Approach: Lowe's Algorithm Conceptually, Lowe's method is quite simple, yet extremely elegant and general. Contrary to the techniques discussed in the previous section, it assumes from the beginning that the imaging transformation is a perspective projection. The key idea to overcome the non{ linerity of this camera model is that, given a certain guess (initial solution) for the unknown scene parameters (s) | including the six pose parameters, possibly some intrinsic camera parameters such as the focal length f , and even, if desired, some internal degrees of freedom in the scene | one can reproject the known 3D scene model into the image plane, using the perspective Eqs. (3.5) and (3.6), and then compute a vector of errors (distances) d between the positions, orientations and apparent angles of the reprojected features, and the corresponding actual measurements. Rather than solving directly for the vector of parameters s in a non{linear system, Lowe's algorithm applies Newton's method to compute a series of correction vectors s0 that are successively subtracted from the current estimate for s, until either this estimate converges to a point that minimizes the error vector d locally, or the maximum number of iterations allowed is exceeded. If s(i) is the pose estimate on iteration i, then s(i+1) is obtained by: s(i+1) = s(i) ? s0 (i) : (4.23) So, given a vector of error measurements d, we want to solve for an s0 which eliminates this error. If the error function were linear in the elements of s, then a solution to this problem could be computed in a straightforward way, and that is basically what Newton's method does. It assumes that the error function can be approximated locally by a linear form and then uses this linear form to compute s0 . Based on this assumption of local linearity, the e ect of each parameter correction si on an error measurement is si multiplied by the partial derivative of the error with respect to that parameter. Therefore, we can determine s0 by solving the following equation:

J s0 = d;

@di : where: Jij = @s j

(4.24)

When the system represented in Eq. (4.24) is overdetermined, a pseudoinverse solution that minimizes the error function in a least{squares sense can be computed [38], in a similar way to the techniques presented in the last section. However, in the case of gradient{based methods, the matrix J is not constant along di erent iterations of the numerical optimization procedure | it clearly depends on the current value of s. So, in this case, it is not possible to compute the pseudoinverse matrix o {line, which is a major performance drawback with respect to the strong{from{weak pose techniques. But on the other hand, the strong{ from{weak techniques can not be generalized to deal with intrinsic camera parameters or internal degrees of freedom in the target. So, there is a tradeo between eciency and generality involved in the choice between these two paradigms. Lowe's original algorithm [38; 37; 36], in particular, has an additional weakness: the approach that it suggests for computation of the Jacobian matrix J. 37

Taking into account the fact that now the focal length f is possibly non{constant (and thus we can not simply normalize all the distances with respect to it, as we did before), we rewrite the perspective projection Eqs. (3.5) and (3.6) for a generic model point p as: [x y z ]T = R (p ? t); (4.25) fy ]; [u v] = [ fx (4.26) z z where R is the unknown rotation matrix and t is the unknown translation vector, as usual. Lowe suggested that, in order to achieve greater eciency, one should reparameterize the translational components of the pose, so as to express them directly in image plane coordinates, rather than in a three{dimensional Euclidean space. More speci cally, in his simpli ed projective model, the image coordinates of an arbitrary feature (u and v) are expressed as a function of the corresponding model{space coordinates (p) by the following equations: [x0 y0 z 0 ]T = R p; 0 0 0 fy + t0 ): + t [u v] = ( z 0fx + t0z x z 0 + t0z y

(4.27) (4.28)

The variables R and f remain the same as in Eqs. (4.25) and (4.26), but vector t has been replaced by the parameters t0x , t0y and t0z . The two transforms are equivalent when:

t = R?1

"

#

0 (z 0+t0 ) t0y (z 0+t0z ) 0 T t z x ? f ? f ? tz :

(4.29)

According to Lowe, \in the new parameterization, t0x and t0y simply specify the location of the object on the image plane and t0z speci es the distance of the object from the camera". To compute the partial derivatives of the error with respect to the rotation angles (x , y and z are the rotation angles about the xed x, y and z axes, respectively, in the camera frame) it is necessary to calculate the partial derivatives of x, y and z with respect to these angles. Table 4.1 gives these derivatives for all combinations of variables.

x

x 0 y z 0 z ?y0

y ?z0 0

x0

z y0 ?x0 0

Table 4.1: Partial derivatives of x, y and z with respect to counterclockwise rotations  (in radians) about the coordinate axes of the camera reference system. Newton's method is carried out by calculating the optimum correction rotations x , y and z to be made about the camera{centered axes. Given Lowe's parameterization, the partial derivatives of u and v with respect to each of the seven parameters of the imaging model (including the focal length f ) are given by Table 4.2. 38

t0x t0y t0

u

v

?fc2x0 ?fc2x0y0

?fc2y0

1 0

z x y z

0 1

fc (z 0 + c x0 2) ?fc y0 f c x0

?fc (z0 + c y02 ) fc2 x0 y0 fc x0 c y0

Table 4.2: Partial derivatives of u and v with respect to each of the camera viewpoint parameters and the focal length, according to Lowe's original approximation; here c = z0 +1 t0 . z

Lowe then notes that each iteration of the multi{dimensional Newton's method solves for a vector of corrections i

h

s0 = t0x t0y t0z x y z T :

(4.30)

Lowe's algorithm dictates that for each point in the model matched against some corresponding point in the image, we rst project the model point into the image using the current parameter estimates and then measure the error in the resulting position with respect to the given image point. The u and v components of the error can be used independently to create separate linearized constraints. Making use of the u component of the error, du , we create an equation that expresses this error as the sum of the products of its partial derivatives times the unknown error{correcting values:

@u @u @u 0 @u 0 @u 0 @u @t0x tx + @t0y ty + @t0z tz + @x x + @y y + @z z = du :

(4.31)

The same point yields a similar equation for its v component. Thus each point correspondence yields two equations. As Lowe says: \from three point correspondences we can derive six equations and produce a complete linear system which can be solved for all six camera{model corrections". The problem with Lowe's formulation is that it assumes that t0x and t0y are constants to be determined by the iterative procedure, when in fact they are not constants at all | they depend on the location of the points being imaged. Denote the rows of R by rx , ry and rz , as usual. Then, using the projective transformation formulated in Eqs. (4.25) and (4.26) the rede ned translational parameters t0x , t0y , t0z are given by: t0z = ?t  rz ; and then: (4.32)

t0x = ?f p tr r+x t0 ; z z 0t = ?f t  ry : y p  r + t0 z

39

z

(4.33) (4.34)

Notice that t0z is dependent only on the object pose parameters, but t0x and t0y are also a function of each point's coordinates in the object coordinate frame. It is therefore in general impossible to nd a single consistent value either for t0x or for t0y . In the general case both these parameters will depend on the position of each individual object feature. They are not constants | they are only the same for those points for which p  rz has the same value. Therefore we can not use t0x and t0y as de ned in Eqs. (4.32) to (4.34). The assumption that is implicit in Lowe's algorithm as published is that the corrections needed for the translation are much larger than those due to rotation of the object. However, if no restrictions are imposed, the coordinates of the points in the object coordinate frame (p) can have a large variance. Even if they do not, the term p  rz may change signi cantly (due to the object's own geometry) and a ect the estimation process. Ishii et al [27] observed the problems described above and proposed an alternative formulation for the problem. However their algorithm also contains some questionable simplifying assumptions. Image formation is again speci ed by Eqs. (4.25) and (4.26). However, the alternative translation vector [t0x t0y t0z ]T is now rede ned to be represented in the camera frame, rather than the image frame: [t0x t0y t0z ]T = R t:

(4.35)

In this approximation the computation of the partial derivatives is performed using the coordinates of the points in the object coordinate frame, ignoring the e ect of rotation, which is their central simpli cation. The partial derivatives of u and v with respect to each of the seven parameters of the camera model are given by Table 4.3.

t0x t0y t0z

u ?fc

v

0

0 ?fc fa c2 fb c2 x ?fa c2py ?fc (pz + b c py ) y fc (pz + a c px ) fb c2 px z ?fc py fc px f ac bc Table 4.3: Partial derivatives of u and v with respect to each of the camera viewpoint parameters and the focal length according to Ishii's approximation; here [a b c] = [px ? t0x py ? t0y p ?1t0 ]. z

z

Hoping to improve the accuracy of those two approximations, we proposed in an earlier work [4; 5], another reformulation for Lowe's original algorithm that uses a full{perspective imaging model. Initially, we de ned both the image formation process and the alternative translation vector [x0 y0 z 0 ]T exactly as in Lowe's formulation (Eqs. (4.25) to (4.27)). Then, we eliminated the approximations of Lowe and Ishii by de ning: [t00x t00y t00z ] = [t  rx t  ry t  rz ] 40

(4.36)

In this case the image coordinates of each point are given by: # " 0 00 y0 + t00y x + t x (4.37) [u v] = f z 0 + t00 f z 0 + t00 : z z The partial derivatives of u and v with respect to each of the six pose parameters and the focal length are given by Table 4.4.

t00x t00y t00

u fc

v

0

fc 2 ?fa c ?fb c2 z 2 0 x ?fa c y ?fc (z0 + b c y0 ) y fc (z 0 + a c x0 ) fb c2 x0 0 z ?fc y fc x0 f ac bc Table 4.4: Partial derivatives of u and v with respect to each of the camera viewpoint parameters and the focal length according to our fully projective solution; here [a b c] = [x0 + t00x y0 + t00y z0+1t00 ]. 0

z

As in Lowe's formulation, the translation vector is computed using Eq. (4.29), with t00x , t00y and t00z as de ned in Eq. (4.36). This translation vector is de ned in the object coordinate frame. The minimization process yields estimates of t00x, t00y and t00z , which are the result of the product of the rotation matrix by the translation vector. A numerically equivalent but conceptually more elegant way of looking at this solution is through a rede nition of the image formation process, so that rotation and translation are explicitly decoupled, and the translation vector is de ned in the camera coordinate frame. So we rede ne: [x y z ]T = R p + t; (4.38) (4.39) then: [t00x t00y t00z ]T = t; and Eqs. (4.36) and (4.37) are substituted by:  0 0 + ty  y x + t x (4.40) [u v] = f z 0 + t f z 0 + t : z z In this case, the least{squares minimization procedure gives the estimates of the translation vector directly. Our claim, established through a careful experimental analysis [4; 5] that will be summarized in Chapter 5, is that removing the mathematical approximations present on Lowe's and Ishii's formulations yields better solutions in practice, without any signi cant increase in the computational cost of the numerical optimization procedure. So, we state this as another conjecture:

Conjecture 4.2 (validated empirically) The use of a full{perspective imaging model in

Lowe's algorithm yields a much more accurate numerical technique, with superexponential convergence for a wide range of initial conditions, and even arguably better computation{ time properties.

41

4.2.2 Using Other Orientation Representation Schemes As we brie y mentioned in the previous subsection, Lowe's method assumes that the relative orientation between the camera and the target is represented by three angles measured about coordinate axes xed with respect to the camera. This representation system is known as roll, pitch and yaw (RPY) angles. Possibly, the major reason for using it is the fact that it is nonredundant, that is, it uses only three parameters to represent three DOF. Of course, this is a very desirable property if one is using a numerical technique that searches the space of possible poses for an acceptable solution. The smaller is the number of parameters used to encode this space, the smaller tends to be the computational cost of exploring it. Unfortunately, the RPY representation has some serious shortcomings. To start with, it contains a singularity: when the pitch angle is equal to 2 radians, roll and yaw rotations combined span only a one{dimensional space and can not be separated from each other [14]. RPY angles are also ambiguous: there is more than one way to represent certain orientations. Finally, they complicate the composition of two rotations and the physical meaning of the orientations represented with them is not intuitive. The same observations are valid for Euler angles, another nonredundant scheme that represents rotations as the sequential composition of three rotations about the axes of a frame that moves in space with each rotation. Gennery [18] noticed those shortcomings and suggested that an ideal system for representing orientations should have the following properties: 1. Nonredundancy: the number of free parameters should be exactly equal to the number of degrees of freedom in the space of possible orientations; 2. Continuity: a continuous motion of the object whose orientation is represented should always result in continuous changes in all parameters; 3. Absence of singularities: the partial derivatives of all parameters with respect to any di erential rotation, at any orientation, should be always nite. In 2D space a single angle meets these three criteria (but it is multivalued). However there is no such a set of parameters for the group SO(3), the special orthogonal group of order 3, that corresponds to all possible orientations of a 3D object [18]. Euler's theorem (as presented and proved by Kanatani [29]), states that \by a 3D rotation, all points move around a common axis, maintaining a xed distance from it." So, another possibility for representing orientations of 3D objects with just three parameters is through a \3{vector" !, whose orientation speci es the rotation axis (with respect to a reference frame), and whose norm speci es the angle about this axis. A problem with this representation is that, if the angle is a multiple of 2 radians, then the direction of ! is unde ned. The only way to avoid this is to restrict the angle to the interval [? ], but in this case a discontinuity is introduced. Furthermore, in the general case ! is not a physical vector, and for this reason, the analytical computation of derivatives with respect to its components is quite complex. However, for di erential rotations, ! is a physical vector and computing derivatives with respect to it is extremely simple. 42

Gennery [18] suggests that, if one starts with a relatively close description of the current pose, then the correction needed to obtain a true solution can be represented as a di erential rotation vector with three free parameters. He models the imaging process as a fully projective transformation. In a slightly simpli ed version (for clarity purposes), his model is equivalent to Eqs. (3.5) and (3.6):

p0i = R pi ; p00i = "p0i + t; 00  x p00  y # p [u v] = pi00  z pi00  z ; i i

(4.41) (4.42) (4.43)

where pi is the model frame description of an arbitrary feature, R and t are the rotation matrix and the translation vector, respectively (as usual), and x, y and z are the axes of the camera reference system. Then, he uses the crucial observation that if the target is rotated by an in nitesimal rotation !, a feature whose 3D position (in the object frame) is described by pi will be shifted by a (di erential) vector !  p0i . The e ect of this in the u component can then be calculated using the chain rule: ui = !  p0i  @@pu00i = p0i  @@pu00i  !: i i So, the partial derivatives of ui with respect to ! are given by:

@ui = p0  @ui : i @ p00 @! i

(4.44)

The partial derivatives of ui with respect to p00i can be derived through a di erentiation of Eq. (4.43). In fact, since the Jacobian of p00i with respect to the translation t is a unit matrix, this also yields the partial derivatives of ui with respect to the translation vector t, completing extraction of all rst{order derivatives needed for ui :

@ui = @ui : @ t @ p00i

(4.45)

The determination of the partial derivatives of vi can then be carried out in an analogous way. So, at this point one may think that it is possible to proceed in the same way that Lowe did. However, the approximations used in order to determine the derivatives for ! are valid only for very small rotations. The solution proposed by Gennery in order to overcome this problem is to insert his derivative{based pose recovery algorithm inside of the real{time loop of a tracker that uses temporal ltering (a Kalman{ lter{like algorithm) to predict the pose of the target. So, if the motion of the target is smooth enough, the error in the predictions of the tracker will be so small that a di erential treatment will yield a precise correction for the pose parameters. 43

Another problem that Gennery had to face is the fact that the di erential rotation can not be used to encode the pose of the target itself, since there is no bound on the total rotation angle with respect to the camera reference frame. So, it is necessary to use a separate representation for the pose. The basic requirement on this actual orientation representation is that both the conversions from the di erential representation to it, and the composition of two rotations represented with it must be performed eciently. Gennery's choice in this case was to use the Euler{Rodrigues parameterization. In order to explain more clearly what exactly this representation is, it is interesting to introduce initially another parameterization for the group SO(3), known as conical representation. The conical representation consists of a (unit) 3{vector a specifying the unique rotation axis (with respect to a reference frame), and a scalar  specifying the angle of the rotation about a. Notice that this representation is related to the di erential rotation vector ! by:

! =  a:

(4.46)

The Euler{Rodrigues representation identi es the four parameters in the conical representation with the elements a quaternion q | an extension proposed by Hamilton to the concept of complex number, that has one real and three imaginary components [3]. More speci cally, q is related to the to the conical parameters by:

       a : (4.47) q = cos  sin 2 2 For a more detailed and formal discussion of the properties of these schemes and other parameterizations for SO(3), we refer the reader to [9; 29; 3]. In order to convert the corrections from the di erential representation to the Euler{ Rodrigues form, Gennery uses the small{angle approximations sin()   and cos()  1. By substituting these approximations and Eq. (4.46) into Eq. (4.47), one gets the following conversion formula:   1 (4.48) q= 1 ! : 2 The composition of this correction with the old orientation is then carried out with a single vector multiplication. However, the quaternion generated by the conversion indicated in Eq. (4.48) in general does not have unit norm. So, an additional normalization is needed in order to complete the recovery of the new pose estimate. Actually, Gennery's technique is not a numerical algorithm for pose recovery in a strict sense, since the calculation of corrections to the pose estimates is not directly iterated. It is performed just once for each scene and the (assumed) temporal coherence of a sequence of scenes is used to keep the errors smaller than a certain desired limit. So, one of the most serious weaknesses of his approach it the fact that it is not as general as the other methods discussed here: it can not be used in applications in which all that is available is just a single (or a small number of) static image(s). Furthermore, the strong coupling between pose recovery and temporal smoothing makes his technique quite dicult to understand and implement. In our opinion it is much more reasonable to implement the numerical pose algorithm and the temporal lter as two completely separate modular components of

44

a tracker. For instance, ideally one should be able to replace the Kalman lter with a lattice lter, if necessary for a certain application, without having to change the code for the pose recovery component. An actual numerical pose recovery algorithm that also uses the Euler{Rodrigues parameterization to represent orientations was proposed by Phong, Horaud et al [53]. Actually, they propose two di erent pose recovery schemes: one based on line correspondences and another based on point correspondences. One of the main contributions with respect to the work of Lowe [38; 37; 36] is that they suggest the use of a second{order trust{region optimization technique in the process of pose recovery, which makes their algorithms less dependent the quality of the initial solutions. However, here we will focus only on the geometrical aspects of their work. For a detailed description of their trust{region technique and an intuitive comparison between it and Levenberg{Marquardt's method, we refer the reader to [53]. Their algorithm for line correspondences is based on the concept of interpretation plane, introduced in Section 2.2 and illustrated in Fig. 2.4. So, recall the de nition for this concept: the interpretation plane for a certain image edge ei is the plane that contains ei and the optical center of the camera (O). So, by de nition, the description of the any model edge in the camera reference system must be contained in the interpretation plane of its corresponding image edge. As in the analytical techniques described in Section 2.2, this constraint is partially enforced by requiring ei to be orthogonal to the unit vector ni , normal to the interpretation plane of its image. Notice, however, that this merely guarantees that the edge i is parallel to (but not that it is actually contained in) the interpretation plane. Phong, Horaud et al [53] strictly enforce the inclusion constraint with the additional requirement that the camera{frame description pi for an arbitrary point contained in ei must also be orthogonal to ni . This ensures the inclusion because the origin of the camera frame is assumed to be the optical center O, which is contained in any interpretation plane. So, the basic constraints of Phong{Horaud's line{based algorithm can be written directly in vector and matrix form, as: n  (R e) = 0; (4.49) n  (R p + t) = 0: (4.50) Eq. (4.49) is equivalent either to Eq. (2.6) or to Eq. (2.7). Furthermore Eq. (4.49) is not dependent on t at all. So, this formulation allows one to decouple the recovery of the rotation from the recovery of the translation, as do the analytical techniques based on line correspondences. Theoretically one could solve for the unknown pose encoded by R and t, by minimizing the global error (distortion) functions d(rline) and d(tline) , respectively, in the following sequence:

d(rline) (R) = d(tline) (t) =

n X

(ni  (R ei ))2 ;

(4.51)

(ni  (R ei + t))2 ;

(4.52)

i=1 n X i=1

where n is the number of line correspondences. 45

The problem with this naive approach is that the number of elements of R is nine, and thus the resulting numerical optimization procedure would be hopelessly slow and probably inaccurate if compared to Lowe's method, for instance, that works in a 6D space to recover both rotation and translation. For this reason, Phong, Horaud et al represent the rigid transformation resulting from the composition of R and t as a screw. A screw is a concept originated from the work of the French mathematician Chasles, who showed that \a convenient parameterization of the displacement group (i.e. rigid transformations) is a rotation about a unique axis not passing through the origin and a translation along the same axis." [53]. In order to specify a screw unambiguously, it is necessary to de ne: the screw axis, the angle of the rotation about it (), and the magnitude of the translation along it (t). Phong, Horaud et al specify the screw axis using two unit vectors: one to represent its orientation (r) and another to represent the location of the point belonging to it that is closest to the origin (l, such that l  r = 0). The interesting fact about this de nition scheme is that it can be encoded as a dual quaternion q^ = r +  s, as suggested by Walker et al [64], where r and s are quaternions and  2 = 0. The relationship between the quaternions r and s and the original screw parameters is dictated by: 



cos 2 sin 2 r ;    t   t s = ? sin 2 2 ? 2 cos 2 r + sin 2 (l  r) : r =

(4.53) (4.54)

Notice that Eq. (4.53) is a mere rewriting of Eq. (4.47). Furthermore, by de nition r has unit norm and is orthogonal to s. At this point, the advantages of this representation over the direct use of the screw parameters r, l,  and t are probably not clear. One such advantage is that the conversions between the screw representation and the original pose representation (R and t) can be performed in a much simpler way if one uses the dual quaternion encoding. Given an arbitrary quaternion r = [r0 rx ry rz ]T , we de ne the two matrices Q(r) and W(r): 2 r0 6 r 6 6 x 4 ry rz 2 r0 6 r 6 6 x 4 ry rz

Q(r) = W(r) =

3

?rx ?ry ?rz r0 ?rz ry 777 ; rz r0 ?rx 5 ?ry rx r0 3 ?rx ?ry ?rz r0 rz ?ry 777 : ?rz r0 rx 5 ry ?rx r0

(4.55)

(4.56)

Using these de nitions, R and t can be expressed as a function of r and s by: " # 1 0T = W(r)T Q(r);

(4.57)

0

(4.58)

0 R "

t

#

= 2 W(r)T s: 46

If we treat arbitrary 3{vectors v as quaternions whose rst component is null (v = [0 vT ]T ), then using the equations above, the conditions for alignment of the target edges with their corresponding virtual planes (originally expressed in Eq. (4.49) and (4.50)) can be restated in quaternion form: 



n  (R e) = nT W(r)T Q(r)e = 0;   n  (R p + t) = nT W(r)T Q(r)p + 2 W(r)T s = 0:

(4.59) (4.60)

Notice that in the equations above, the pose quaternions r and s are \trapped" inside the matrices Q and W. Since the numerical algorithm is going to work directly with these quaternions, it is a good idea to \isolate" them, so that the computation of the derivatives of the rotation and translation error functions is simpli ed. Fortunately, it is easy to check from the de nitions of Q and W that for two arbitrary quaternions a and b:

Q(a) b = W(b) a:

(4.61)

This \commutative" property, in addition to the associativity of matrix multiplication, allows one to separate the parameters r and s in the Eqs. (4.59) and (4.60). Finally, before rewriting the equations for the error functions in quaternion form, it is necessary to notice that now there are eight parameters in the pose encoding, but of course the problem itself has only six DOF, as usual. Phong, Horaud et al [53] suggest that this problem can be xed by introducing extra terms in the error functions, so as to penalize solutions where the quaternion r either does not have unit norm (rT r 6= 1) or is not orthogonal to the quaternion s (rT s 6= 0). Taking this into account, the error functions previously de ned in Eqs. (4.51) and (4.52), are now given by:

d(rline) (r) = d(tline) (s) =

n   2 2 X r T Ai r +  r T r ? 1 ; i=1 n  2  2 X rT Bi r + rT Ci s +  rT s ; i=1

(4.62) (4.63)

where: Ai = Q(ni )T W(ei ); Bi = Q(ni)T W(pi ); Ci = 2 Q(ni)T : The parameter  in the equations above is used to regulate the relative strength between the alignment constraints derived from the known geometry of the target and the consistency constraints, which eliminate the two redundant DOF in the dual quaternion representation. According to Phong, Horaud et al [53],  \must be taken very large in order to guarantee that the penalization constraints are satis ed." The motivation for separating the recovery of rotation from the recovery of translation, as done in Eqs. (4.62) and (4.63), is that the computational cost of the numerical optimization procedure necessary to minimize the error functions does not scale very well with 47

respect to the number of parameters used to encode the space of all possible poses. So, reducing an optimization problem with eight DOF to two separate problems with four DOF each normally results in a much smaller overall computational cost. But on the other hand, the fact that r is treated as a constant in Eq. (4.63) implies that the values recovered for the rotation parameters are only optimized to guarantee that each model edge is parallel to the interpretation plane of the corresponding model edge, but not to guarantee that the model edges are actually included in the appropriate planes. This means that the solutions recovered with this scheme where rotation and translation are decoupled tend to be (slightly) less accurate than those obtained when all the constraints available are used simultaneously in a six DOF numerical optimization procedure. So, if one wants maximum accuracy at the expense of increasing the computational cost of the pose recovery process, a simple solution is to work with a coupled error (discrepancy) function d(cline) (r; s) = d(rline) (r) + d(tline) (s). Unfortunately, in some cases, even this more expensive alternative is still relatively inaccurate. The problem is that, depending on the geometry of the object and on the actual pose, constraints based on line correspondences alone do not provide enough information to uniquely de ne a solution, regardless of how many features are used. More, speci cally, consider the case where the object is planar and the plane that contains it also includes the optical center of the camera, as shown in Fig. 4.2 (with the usual naming conventions). In this case, the interpretation plane for any edge in the object is the same: the object plane itself. So, any pose that aligns the object with this unique interpretation plane (but not necessarily aligns all the points in the object in their proper positions) will make all the dot products involving the edge descriptions and the unit vector normal to the interpretation plane (n) be equal to zero, minimizing the error functions d(rline) and d(tline) . As a result of this, any pose recovery algorithm based uniquely on line correspondences will be unable to recover the angle of the object about the axis n. Phong, Horaud et al [53] also proposed an algorithm that exploits the additional geometrical constraints generated by point correspondences. In this case, the basic restriction for correct alignment is that the coordinates of the image of any given feature (ui and vi , 1  i  n) must agree with the imaging equations of the selected camera model (Eqs. (3.5) and (3.6)). The clever algebraic trick needed to express this in the dual quaternion setting is to rewrite these two equations in a form that involves only vector and matrix additions and multiplications: [1 0 ? ui ] R p + [1 0 ? ui ] t = 0; (4.64) [0 1 ? vi ] R p + [0 1 ? vi ] t = 0: (4.65) Then, the translation scheme formalized in Eqs. (4.57) and Eqs. (4.58) is used to express these constraints as a function of the two components of a dual quaternion (r and s), as before. Finally, in order to complete the conversion to pure quaternion form, the remaining 3{vectors [1 0 ? ui ] and [0 1 ? vi ] are written as purely imaginary quaternions. Some additional algebraic manipulation allows the pose recovery problem at hand to be cast as the optimization of a function d(point) (r; s), as before. Again, a penalization parameter  is needed to constrain the two redundant degrees of freedom. We refer the reader to [53] for a careful derivation. 48

P0 Target

P2

P1

I0

I1

I2 Image Plane

O

Figure 4.2: Occurrence of a singularity in Phong{Horaud's pose recovery algorithm with rotation decoupled from translation. The important point here, which can be seen directly from Eqs. (4.64) and (4.65), is that in this solution the recovery of the rotation parameters is strictly coupled to the recovery of their translation counterparts and it is no longer possible to break the resulting 8D optimization problem into two much easier 4D ones, for greater eciency. In principle this limitation seems to be inherent to the use of point correspondences, because alignment constraints for individual points do not make sense unless the translation is taken into account. However, we claim that is possible to make use of the strong geometrical constraints provided by point correspondences and still keep the decoupling between rotation and translation. So, we state this as another conjecture:

Conjecture 4.3 (validated analytically) It is possible to recover the rotational compo-

nents of the pose of a rigid object in a way that is completely independent on the associated translational components and still uses constraints provided by point correspondences in order to avoid singularities that arise when only line correspondences are used.

Our idea for an algorithm that satis es these requirements is to represent the (real) point features implicitly, as the intersection of two (virtual) model edges. Then, the interpretation planes of these edges can be used to de ne a basic set of alignment constraints (as before), and the fact that these edges have pairwise intersections in 3D space can be used to derive some additional constraints. Notice that we can not use the imaging equations directly for the intersection points because their actual locations depend on the unknown translation. In order to simplify our analysis, we initially focus on the non{overconstrained P3P problem. The resulting solution can then be easily extended to overconstrained cases. So, assume that the correct correspondences for three model points in general position (P0 , P1 and P2 ) are known. In this case, each of these points is constrained to lie in a line{of{sight 49

de ned by the optical center and the corresponding image point. Let l0 , l1 and l2 be the unit vectors along the lines{of{sight de ned by the images of P0 , P1 and P2 , respectively, as shown in Fig. 4.3. Then, the description of point Pi (0  i  2) in the camera frame must be a vector parallel to li . We express these constraints in parametric form as: pi = i li . Notice that the unit vectors pi can be obtained with a simple normalization of the coordinates of the corresponding image points.

P0 = λ 0 l 0 e 02

e 01 P1 = λ1 l 1

P2 = λ 2 l 2

Aligned Model

l0 m01

n 01

Image

θ02 θ01 l0 l2

l1 O

Figure 4.3: Geometry of our improvement to Phong{Horaud's pose recovery algorithm with rotation decoupled from translation. Each pair of points [Pi Pj ]; i 6= j; de nes a virtual edge that we use to derive alignment constraints. As we already saw, one such constraint is the inclusion in the interpretation plane of its (virtual) image. However, up to this point we still have not used the fact that the length of any model edge is known a priori. The reason for this is very simple: if we consider only one individual correspondence, then unless we know the translation parameters, the ratio between the length of the model edge and the length of its image provides very little 50

useful information about the its true orientation, because di erent combinations of values for the translation and the angle about the normal to the interpretation plane can generate the same size ratio. For instance, if an edge were closer that it actually is but were viewed more obliquely, its apparent size could still be the same. However, if we consider two edges that have a common intersection point, then the relationship between the ratios of their actual and apparent sizes gives us some valuable information about the orientation of the target, even if the translation is completely unknown. Consider the edges e01 = P0 P1 and e02 = P0 P2 , in Fig. 4.3, for instance. Since the extreme point P0 , which is common to both of them, is located at a distance 0 from the origin, one might be tempted to say that the size ratios for both edges must be equal if the object is properly aligned. However, this is not the case, because the distance 1 is not necessarily equal to 2 . Indeed, it is possible to derive a more subtle alignment condition using the fact that each size ratio \induces" a value for the distance 0 . So, we can use the di erence between the values induced by edges e01 and e02 as a measure of alignment error. In order to come out with a mathematical formalization of this idea, we start with a set of simple alignment conditions: we require the di erence between the parametric descriptions (in terms of the line{of{sight unit vectors) for the extreme points of each edge to be equal to the description of the respective edges themselves, in the camera reference system. Notice that the description of an edge is a vector and thus is invariant with respect to translations of the object. So, mapping the edge description from the model to the camera reference system involves only a premultiplication by the rotation matrix R, and we can write the alignment conditions for the pair e01 and e02 as: R e01 = 1 l1 ? 0 l0 ; (4.66) R e02 = 2 l2 ? 0 l0 : (4.67) Focus on Eq. (4.66). We want to solve this for 0 , given an estimate of R and thus we need to eliminate the extra parameter 1 . In order to do this, we de ne a new intermediate coordinate system \between" the object and the camera frames and express the three{ dimensional alignment restriction at hand in terms of the coordinate axes of this new system, which we denominate Interpretation Reference Frame (IRF). One of the axes of the IRF for edge e01 is de ned to be the unit vector l0 , another axis (n01 ) is de ned to be normal to the interpretation plane of the corresponding edge in the image, and remaining axis (m01 ) is de ned as the cross product of the previous two, as shown in Fig. 4.3. We express Eq. (4.66) in terms of these axes as follows: 8 > < l0  (R e01 + 0 l0 ? 1 l1 ) = 0; m01  (R e01 + 0 l0 ? 1 l1) = 0; > : n  (R e +  l ?  l ) = 0: 01 01 0 0 1 1 Then, we can use the fact that the three axes of the IRF form an orthonormal basis in order to simplify the system above to the following form: 8 > < l0  (R e01 ) = ?0 + 1 (l0  l1 ); m01  (R e01 ) = 1 (m01  l1); > : n01  (R e01 ) = 0: 51

Notice that the last equation in the system above is equivalent to Eq. (4.59). So, the alignment constraints for our decoupled solution actually subsume those of Phong{Horaud's algorithm. The other two equations can be solved for 0 in the following way:

1 = a01 T (R e01 ); 0 = b01 T (R e01 );

(4.68)

where: a01 = m 1  l m01 ; 01 1 b01 = (l0  l1 ) a01 ? l1 = cotan 01 m01 ? l1;

(4.69)

and 01 is the angle between l0 and l1 (as shown in Fig. 4.3), in radians. Eq. (4.67) can be solved for 0 in an analogous way, yielding:

0 = b02 T (R e02 ); b02 = cotan 02 m02 ? l2:

(4.70) (4.71)

Finally we use the assumption that edges e01 and e02 intersect each other at point P0 . That is, the alignment expressed by R is correct if and only if the values for 0 generated by Eqs. (4.68) and (4.70) are the same. So, we equate these two expressions in order to obtain:

b01 T (R e01 ) ? b02T (R e02 ) = 0:

(4.72)

Using the dual quaternion representation for the pose parameters, the equation above can be rewritten as: b01 T









W(r)T Q(r) e01 ? b02T W(r)T Q(r) e02 = 0:

Then, using the \commutative" property stated in Eq. (4.61), as well as the associativity and distributivity of matrix multiplication, the pose parameters can be \isolated" in the equation above, yielding: rT





Q(b01)T W(e01 ) ? Q(b02 )T W(e02 ) r = 0:

(4.73)

This equation and Eq. (4.49) are the constraints that we use to guarantee the proper alignment of the three distinct model points with the lines{of{sight of their respective image points. Now, we return our attention to the case of (possibly) overconstrained problems. Notice that, according to the analysis performed so far, each (virtual) edge intersection yields two constraints in the form of Eq. (4.59) and one additional constraint in the form of Eq. (4.73). So, if the problem at hand consists of recovering the pose based on a set of n point correspondences, one possibility is to organize these points in a ring. Then, for each point i such that 1  i  n, we use the \edges" to the predecessor of i, p(i), and to successor of i, s(i), in the ring, in order to derive the alignment constraints mentioned above. So, 52

the orientation recovery problem can be solved through the minimization of the following function:

d(rring) (r) = where: A(i) = B(i) =

n  n  2  2 2 X X rT A(i) r + rT B(i) r +  rT r ? 1 ; i=1 i=1 T Q(n(i)(p(i)) ) W(e(i)(p(i)) ); Q(b(i)(p(i)) )T W(e(i)(p(i)) ) ? Q(b(i)(s(i)) )T W(e(i)(s(i)) ):

(4.74) (4.75) (4.76)

where and  are empirically de ned weighting factors that control the relative strength of the di erent types of constraints, all the quaternions with indices (i)(p(i)) are obtained from the virtual edge connecting the i{th point to its predecessor in the ring, and all the quaternions with indices (i)(s(i)) are obtained from the virtual edge connecting the i{th point to its successor in the ring. After that, the translation can be recovered with a (possibly overconstrained) linear system resolution. Notice, however, the approach described above increases the amount of work per iteration with respect to the original decoupled algorithm based on point correspondences. Another alternative, which does not increase the computational cost per iteration, is to group the object points with known correspondences in triangles and then apply our P3P solution individually on each triangle. Assume, without loss of generality, that the triangle i consists of points 3i?2, 3i?1 and 3i. Then, the function that must be minimized to obtain the unknown rotation is:

d(rtri) (r) =

where: A(i) = B(i) = C(i) =

n=3  n=3  2 2 X X rT A(3i?1) r + rT B(3i?1) r + i=1 i=1 n= 3  2 2 X T r C(3i?1) r +  rT r ? 1 ; i=1 Q(n(i)(i?1) )T W(e(i)(i?1) ); Q(b(i)(i?1) )T W(e(i)(i?1) ) ? Q(b(i)(i+1) )T W(e(i)(i+1) ); Q(n(i)(i+1) )T W(e(i)(i+1) ):

(4.77) (4.78) (4.79) (4.80)

This new formulation explicitly guarantees only the alignment of two edges of each triangle with their respective interpretation planes and the agreement on the induced value for the distance 0 . However, these conditions necessarily entail the alignment of the third edge with respect to its respective interpretation plane and thus the complete alignment of the triangle:

Theorem 4.1 The alignment of two edges of a triangle with the interpretation planes of their correspondences and the agreement on the value induced for distance between their intersection and the optical center necessarily implies the alignment of the third edge with the interpretation plane of its image.

In order to prove this claim, again consider an individual triangle P0 P1 P2 , as illustrated in Fig. 4.3. Assume, without loss of generality, that the two aligned edges are e001 = R e01 53

and e002 = R e02 . Keeping the notation used so far, these alignment constraints can be expressed as: (01) e001 = (01) 1 l1 ?  0 l0 ; (02) e002 = (02) 2 l2 ?  0 l0 ;

where k(ij ) is the distance (with respect to the optical center) that the actual to apparent size ratio of edge ij induces for the vertex k (k may be either i or j ). Notice that, in principle, we are not assuming anything about the agreement on 0 . So, we use two di erent variables for it, one in each equation. Using the equations above, we express the fact that the third edge, e012 = R e12 , is equal to the di erence between the other two (because they form a triangle): (02) (01) (02) e012 = e002 ? e001 = ((01) 0 ? 0 ) l0 ? 1 l1 + 2 l2 :

(4.81)

Now we can nally throw in the assumption that there is agreement on the value of 0 . The result of this, in Eq. (4.81), is that the term involving l0 vanishes: (02) (01) (02) 0 (01) 0 = 0 =) e12 = 1 l1 + 2 l2 :

(4.82)

So, e012 can be written as a linear combination of l1 and l2 , which is equivalent to saying that the third edge is also aligned with the interpretation plane of its image, as desired. 2 Furthermore, it is also possible to show that the converse of Theorem 4.1 is not true, in other words, that our additional alignment constraints e ectively add strength to the formulation of Phong, Horaud et al:

Theorem 4.2 If the images of the three vertices of a triangle are all distinct, then the

alignment of its three edges with the interpretation planes of their correspondences does not necessarily entail the agreement on any of the values induced for distances between the model vertices and the camera's optical center.

We start our proof from Eq. (4.81), since all that was assumed up to that point was the alignment of two edges with their respective interpretation planes. Since by hypothesis the lines{of{sight of the three edges are all distinct, then any one of these lines{of{sight can be written as a linear combination of the other two and their cross product:

l2 = c0 l0 + c1 l1 + c2 (l0  l1 ): Then, we substitute this in Eq. (4.81), in order to obtain: (02) (02) (01) (02) (02) e012 = e002 ? e001 = ((01) 0 ? 0 + c0 2 ) l0 + (?1 + c1 2 ) l1 + c2 2 (l0  l1 ):

Finally, we use the assumption that the third edge is also aligned with its interpretation plane to write it as a linear combination of the unit vectors along the lines{of{sight of its 54

extreme points. Expressing l2 as a linear combination of l0 , l1 and their cross product (again), we get the following equation: (12) 0 (12) (12) (12) (12) 0 e012 = (12) 2 e02 ? 1 e01 = c0 2 l0 + (?1 + c1 2 ) l1 + c2 2 (l0  l1 ): Equating the two expressions for e012 derived above, we get the following system: 8 (01) (02) (02) = c (12) > > 0 2 < 0 ? 0 + c 0 2 (01) (02) ? 1 + c1 2 = ?(12) + c1 (12) 1 2 > >

c2 (02) = c2 (12) 2 2

:

There are two main possibilities here. The rst is c2 6= 0, or in other words, the plane de ned by the triangle and the interpretation planes for the three edges are not , = (12) coincident. In this case, the last equation in the system above implies that (02) 2 (02) , and combined with2the =  which, combined with the rst equation, implies that (01) 0 0 (12) . So, in this case second equation implies that (01) =  our additional constraints do not 1 1 add any power to the original formulation. However, there is also the case where c2 = 0. Notice that in this case both c0 and c1 must be non{zero, or otherwise the hypothesis that the images of the three edges are distinct would be violated. So, any solution such that: (12) ? (02) = 1 ((12) ? (01) ) = 1 ((02) ? (01) ); (4.83) 2

2

c1 1

1

c0 0

0

satis es all the constraints imposed. Obviously, in this case the di erences (12) ? (02) 2 2 , (12) (01) (02) (01) 1 ? 1 and 0 ? 0 do not need to be zero. They can vary arbitrarily, as long as the proportionality 1 : c1 : c0 is kept. Thus, our additional constraints do add power to the original formulation. 2 Notice that we not only showed that our technique is more powerful than Phong{ Horaud's decoupled solution, but we also characterized exactly the problem instances in which this superiority arises. Furthermore, notice that such situations (namely, planar scenes) are exactly those in which no technique based on line correspondences can guarantee complete alignment. This, along with the observation that our technique can include arbitrary line{based constraints via the use of Eq. (4.49), shows that it is actually strictly more powerful than any method based uniquely on line correspondences. In the next chapter, we provide strong empirical evidence that this theoretical superiority makes a di erence in practice. As a nal observation in this chapter, recall that Gennery's technique [18] makes use of a minimal representation system but due to its di erential rotation assumption it does not seem to be powerful enough to deal with a large space of initial solutions. On the other hand, the use of a second{order optimization technique makes Phong{Horaud's technique [53] quite robust with respect to the initial solutions, but the use of a non{minimal scheme for rotation representation seems to cause some convergence problems, as Horaud reported in a later work [23]: \whenever the number of correspondences is between 3 and 10, then the trust{region minimization method either requires a large number of iteration or doesn't converge towards the correct solution." Furthermore, both techniques use the Euler{Rodrigues 55

parameterization to represent the accumulated pose resulting from the composition of the initial solution with all the corrections computed iteratively. So, we speculate that it is possible to come out with a hybrid scheme that will make use of Phong{Horaud's scheme until the uncertainty on the values of the pose parameters is relatively small, and then will switch into a mode that uses Gennery's di erential representation in order to speed up the convergence towards the desired solution:

Conjecture 4.4 (open) The combined use of Gennery's and Phong{Horaud's techniques (with our improvements incorporated to the later) yields a pose recovery algorithm with relatively fast convergence and relatively low sensitivity to the initial solutions used.

56

Chapter 5

Experimental Evaluation As we showed in the three previous chapters, there are many available alternatives for solving the pose recovery problem. We already discussed some of the advantages and disadvantages of several of these alternatives, but a more careful evaluation of the accuracy and the computational cost of each one of them is needed in order to determine which one is the best for speci c practical applications.

5.1 Analytical Approaches for Accuracy Evaluation Grimson et al [20] suggest an analytical approach to determine conservative bounds to the uncertainty of the values recovered for the pose parameters, given the individual uncertainties in all the image measurements used in the recovery process. More speci cally, they use an ane camera model and assume that the pose is recovered from a set of point positions in the image plane. Initially, they express each of these positions as a function of a vector of pose parameters (s). They consider that each individual pose parameter si (1  i  6) can vary by up to a certain unknown amount denoted by i . Then, given a set of prede ned bounds for the errors in the image measurements, and a set of speci c ground{truth values of the pose parameters si , the resulting system of equations can be solved for each i , because of the linearity of the imaging model adopted. So, the method outlined above allows one to estimate the accuracy of the pose recovery process, for each individual actual pose of the scene with respect to the camera. The main drawbacks of this approach are that it can not be extended to perspective camera models (because of the non{linearity of the corresponding imaging equations), and that it tends to greatly overestimate the uncertainties of the pose parameters. Madsen [41] suggested a more ellegant and generic approach which also works for other types of image measurements and camera models. In his analysis, the set of all the image measurements is expressed as a vector m, and it is assumed that the covariance matrix for this vector () is known a priori. The goal is to determine the covariance matrix  for the vector of pose parameters (s). Madsen proposes a local linearization of the function that maps the space of all possible sets of image measurements to the space of all pose 57

parameters. With this rst{order approximation, the covariance matrix  is given simply by:  =



@ s    @ s T ; @m @m

(5.1)

where @@ms is the Jacobian matrix of the pose parameters with respect to the image mearurements. This analysis allows one to determine which ground{truth viewpoints are more problematic in terms of error propagation, given a certain pose recovery scheme. By sampling the space of all possible viewpoints one can then compute an estimate of the expected \general{case" accuracy of any given pose recovery scheme, without the need to actually implement and execute the whole pose recovery process. However, this type of approach does not take into account certain factors that are crucial to determine the actual accuracy of numerical pose recovery schemes, such as the convergence properties of the optimization techniques employed. So, in order to evaluate the gains obtained with the improvements to Lowe's and Phong{ Horaud's algorithms proposed in the previous chapter, we performed detailed simulations of the behavior of the original and the modi ed techniques, using synthetic data. In Section 5.2, we describe the tests relative to Lowe's algorithm, which were performed in a very general application domain. Phong{Horaud's algorithm, on the other hand, was tested in the context of a more speci c real{life application. We describe this application as well as the tests performed with it in Section 5.3.

5.2 Evaluating Our Improvements to Lowe's Algorithm In order to compare Lowe's original algorithm [38; 37; 36], Ishii's reformulation to it [27] and our improvement to the previous two [4; 5], we performed extensive experiments with synthetic data. Our goal is to estimate the relative accuracy and convergence speed of each algorithm for a number of useful situations. So, in the tests we control a few parameters explicitly and sample all the others uniformly, hoping to cover important cases while keeping the amount of data down to a manageable level. In Lowe's approximation, we use the depth of the center of the object in the camera frame as the multiplicative factor that yields the values of the \rede ned" pose parameters t0x and t0y (see Section 4.2.1). All the methods are tested with exactly the same poses and initial conditions [5].

5.2.1 Experimental Methodology Unless explicitly stated otherwise, all the experiments described in this section take the imaged object to be the eight corners of a cube, with edge lengths equal to 25 times the focal length of the camera (for a 20 mm lens, for instance, this corresponds to a half meter wide, long and deep object). The parameters explicitly controlled, in general, were the depth of the object's center with respect to the camera frame (ztrue ), measured in focal lengths, and the magnitudes of the translation (tdi ) and the rotation (rdi ) needed to align 58

the initial solution with the true pose. ztrue is always measured in focal lengths and tdi and rdi are measured as a relative error with respect to ztrue and as an absolute error in  radians, respectively. A formal de nition of these parameters and of the whole sampling methodology is given in [5]. Unless stated otherwise, three average values are chosen for each of those parameters (Table 5.1). For each average value v, the corresponding parameter is then sampled uniformly in the region [ 34v ; 54v ]. Param Avg 1 Avg 2 ztrue 50 500 tdi 0.1 0.01 rdi 0.2 0.02

Avg 3 5,000 0.001 0.002

Table 5.1: General average sampling values used in most tests for the controlled parameters of Lowe's algorithm and its reformulations. The other nine pose and initial solution parameters are in general sampled uniformly over their whole domain. The true object position is constrained to lie in the interior of the in nite pyramid whose origin is the optical center and whose faces are the semi{planes z = jxj and z = jyj, z  0. For each test we compute two global image{space error measures, assuming known correspondence between image and model features. The rst, called Norm of Distances Error (NDE), is the norm of the vector of distances between the positions of the features in the actual image and the positions of the same features in the reprojected image generated by the estimated pose. The second, called Maximum Distance Error (MDE), is the greatest absolute value of the vector of error distances. Both measures are always expressed using the focal length as length unit. NDE and MDE do not necessarily indicate how close the estimated pose is from the true pose. We also record individual errors for six di erent pose parameters: the errors in the x, y and z coordinates of the estimate for the actual object translation vector, measured as relative errors with respect to the object's center actual depth (ztrue ), and the absolute errors in the estimates for the roll, pitch and yaw angles of the object frame with respect to the camera, measured in units of  radians. Although all these metrics were computed, here we show only results from NDE, and x{translation error: they are faithfully representative of both image{space error metrics and the three translation and three rotation error metrics. For each of those eight di erent error measures, we compute the average, the standard deviation, the averages and standard deviations excluding the 1%, 5% or 25% smallest and largest absolute values, and the median. Statistics that leave out the tails of the error distributions are included to be fair to a method (if any) that underperforms in a few exceptional situations but is better \in general": for instance one that occasionally violently diverges but usually gives better results. Here we only present the average error and its standard deviation and the results with the exclusion of the upper and lower 25% of the errors. For more error measures and more statistics see [5]. 59

5.2.2 Convergence in the General Case Initially, we tried to compare the speed of convergence and nal accuracy of each method with arbitrary poses and initial conditions. The statistics for the NDE, based on 13,500 executions per method, are plotted in Fig. 5.1. They show that for most poses Lowe's original approximation converges to a very high global error level and Ishii's approximation only improves the initial solutions in its rst iteration and diverges after that. Our fully projective solution, on the other hand, converges at a superexponential rate to an error level roughly equivalent to the relative rounding error of double precision, which is about 1:11  10?16 . Mean with all data

10

10

0

Mean without 25% extremes

0

10

−5

10

−10

10

−15

10

10

10

Std without 25% extremes

−10

10 −10

−10

10

10

0

10 20 Iteration number

10

−15

10

−20

−20

10

0

−5

10

0

10

Error

Std with all data

10

10

−20

0

10 20 Iteration number

10

−20

0

10 20 Iteration number

10

0

10 20 Iteration number

Figure 5.1: Convergence of an image{space error metric, the Norm of Distances Error (see Section 5.2.1), with respect to the number of iterations of Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). Tests performed with a cube, rotated by arbitrary angles with respect to the camera frame. Even taking into account the worst data, our approximation still converges superexponentially to this maximum precision level | the bad cases only slow convergence a bit. But in this case Lowe's original algorithm and (especially) Ishii's approximation tend to diverge, yielding some solutions worse than the initial conditions. The statistics for the errors in the individual pose parameters make the superiority of the fully projective approach even more clear. Fig. 5.2 exhibits the relative errors in the value of the x translation. Both Lowe's and Ishii's algorithms diverge in most situations, while the fully projective solution keeps its superexponential convergence. Due to their simpli cations, Lowe's and Ishii's methods in those cases are not able to recover the true rotation of the object. They tend to make corrections in the translation components to t the erroneously rotated models to the image in least{squares sense, generating very imprecise values for the parameters themselves. This problem is specially acute with Ishii's approximation, which tends to translate the object as far away from the camera as possible, so that the reprojected images for all points are collapsed into a single spot that minimizes the mean of the squared distances with respect to the true images. Similar results were obtained for the other ve parameter{space errors. To assure that the results did not depend on symmetries in the cubical imaged object, we repeated the same tests with an asymmetric object whose eight points were all uniformly 60

Mean with all data

40

10

20

Mean without 25% extremes

10

10

10

0

10

−10

10

0

10 0

0

10

10

0

10 20 Iteration number

10

−10

10

−20

−20

10

20

10

10

10

Std without 25% extremes

20

10

20

10

Error

Std with all data

40

10

−20

0

10 20 Iteration number

10

−20

0

10 20 Iteration number

10

0

10 20 Iteration number

Figure 5.2: Convergence of the ratio between the error on the estimated x translation and the actual depth of the object's center, with respect to the number of iterations of Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). Tests performed with a cube, rotated by arbitrary angles with respect to the camera frame. sampled in the space [?1; 1]3 and then scaled for a maximum edge size of 25 focal lengths. All the results were almost identical to those obtained with the cube [5].

5.2.3 Convergence with Rough Alignment For some relevant practical applications, our initial assumption that all the attitudes of the object with respect to the camera happen with equal probability is too general. For instance, in vehicle following applications it is reasonable to assume that the poses in which the object frame is roughly aligned to the camera frame happen with much larger probability than poses in which the object frame is rotated by large angles. We therefore performed some tests in which the rotation component of the initial solutions was represented by a quaternion whose axis was sampled uniformly on a unit semi{sphere with z  0, but whose angle was constrained to the region [? 5 ; 5 ]. The NDE statistics, plotted in Fig. 5.3, show that in this case the accuracy of Ishii's approximation is much improved (predictably, given its semantics). Instead of diverging, now it converges exponentially towards the rounding error lower bound. So, even in this favorable scenarium, Ishii's approximation is still much less ecient than the fully projective solution, that converges super{exponentially (in about 5 iterations) for the NDE, as shown, and also for all other error metrics tested.

5.2.4 Execution Times Lowe's and Ishii's simpli cations do not result in a signi cant inner{loop performance gain with respect to the fully projective solution. We hand{optimized the three algorithms, with common subexpression factorization, loop vectorization and static pre{allocation of all matrices. After that, the internal loop (in Matlab) for Lowe's method (which is the simplest of the three) contained only four oating point operations less than the internal loop of the fully projective solution. 61

Mean with all data

0

10

−5

10

−10

10

−15

10

10 10

Mean without 25% extremes

0

10 20 Iteration number

0

10

−5

10

−10

10

−15

10

10

−5

10

−10

10

−15

10

−20

−20

10

10 20 Iteration number

10

Std without 25% extremes

−5

−10

−15

−20

0

0

−20

0

10 20 Iteration number

10

0

10 20 Iteration number

Figure 5.3: Convergence of an image{space error metric, the Norm of Distances Error (see Section 5.2.1), with respect to the number of iterations of Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). Tests performed with a cube, rotated by angles of at most 5 radians with respect to the camera frame. We measured the execution times of 20 iterations of each method (details in [5]). The statistics shown in Fig. 5.4 were gathered from a set of 13,500 runs per method, performed with the same sampling techniques employed in the convergence experiments. 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00

Lowe’s Ishii’s

Median

Std 50%

Avg 50%

Std 90%

Avg 90%

Std 98%

Avg 98%

Std All

Full Proj.

Avg All

Execution Times (s)

Error

10

10

Std with all data

0

10

Figure 5.4: Execution times (in seconds) for 20 iterations of each method, computed over all data and with elimination of the 1%, 5% and 25% error extremes. Full projective solution average times are 2.99% to 4.21% longer than those of Lowe's original method, but the standard deviations of the execution times for Lowe's solution are between 6% and 130% bigger than those of the fully projective. Thus, the fully projective approach may be more suitable for hard real{time constraints, due to its smaller sensitivity to ill{conditioned con gurations. The problem is that Lowe's original method is much more likely to face singularity problems in the resolution of the system described in Eq. (4.24), resulting in the execution of slower built{in Matlab routines. The fully projective approach 62

looks even better when compared to Ishii's solution. The explanation is that a careful subexpression factorization can save us the work that Ishii's simpli cations are designed to save, so we pay no time penalty for a solution that is less sensitive to the proximity of singularities [5].

5.2.5 Sensitivity to Depth in Object Center Position We also performed some experiments to check the sensitivity of the techniques to individual variations in each one of the three controlled parameters. First, we varied the average value of ztrue (object depth) logarithmically between 25 and 51,200 focal lengths, (corresponding, respectively, to 50 cm and 1,024 m, with a 20 mm lens). The statistics for the NDE, plotted in Fig. 5.5 for each of the twelve values chosen for ztrue , show that our method is almost always much more accurate than both Lowe's and Ishii's. The only exception happens at the distance of 25 focal lengths. Mean with all data

10

10

0

10

−10

10

Error

10

10

Mean without 25% extremes

10

1000 100000 Object center depth

10

0

10

−10

10

0

10

−10

10

10

10

10

10

0

−10

−20

1000 100000 Object center depth

Std without 25% extremes

10

10

−20

−20

10

Std with all data

10

10

10

−20

1000 100000 Object center depth

10

10

1000 100000 Object center depth

Figure 5.5: Sensitivity of an image{space error metric, the Norm of Distances Error (see Section 5.2.1), with respect to the actual depth of the object's center (in focal lengths), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). The problem is that in this situation some individual object points may get as close as 5 focal lengths from the zero depth plane on the camera frame, due to the errors in the initial conditions. In this case, our method tends to behave like Ishii's, shifting the object as far away from the camera as it can (so as to collapse the image in a single point), instead of aligning it. This can be con rmed by the analysis of the errors for the x translation (Fig. 5.6). But even in this extreme situation, our method, unlike Lowe's and Ishii's, still converges in most cases. The results for the error angles also support these observations.

5.2.6 Sensitivity to Translational Error in Initial Solution Using the same sampling methodology as the previous experiment, we also studied the e ect of changing the relative error in the translational component of the initial pose estimates. Fifteen values for the relative initial translational error tdi ranging from 0.025 to 0.5 were chosen. The statistics for the NDE, depicted in Fig. 5.7, show that our method is once again much more accurate in general. However, when the average magnitude of the translational error 63

Mean with all data

40

10

20

10

0

10

Error

10

10

Mean without 25% extremes

10

1000 100000 Object center depth

10

0

10

−10

10

20

10

0

10

10

1000 100000 Object center depth

10

0

−10

−20

10

Std without 25% extremes

10

10

−20

−20

10

Std with all data

40

10

10

−20

10

1000 100000 Object center depth

10

10

1000 100000 Object center depth

Figure 5.6: Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object's center, with respect to the actual depth of the object's center (in focal lengths), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). is greater than 30% of the actual depth of the object's center, our method has convergence problems for the worst 1% of the data and its overall reprojection accuracy drops to a level close to that of Lowe's original approximation.

Mean with all data

10

10

0

Mean without 25% extremes

0

10

−5

10

−10

10

−15

10

−20

10

10

10

10 −10

10

−20

10

0 0.1 0.2 0.3 0.4 0.5 Translational disturbance

0

Std without 25% extremes

−5

10

0

10

Error

Std with all data

10

10

−10

−10

10

10

−20

10

0 0.1 0.2 0.3 0.4 0.5 Translational disturbance

10

0 0.1 0.2 0.3 0.4 0.5 Translational disturbance

−15

−20

0 0.1 0.2 0.3 0.4 0.5 Translational disturbance

Figure 5.7: Sensitivity of an image{space error metric, the Norm of Distances Error (see Section 5.2.1), with respect to the ratio between the magnitude of the translational disturbance in the initial solution and the actual depth of the object's center, for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). An analysis of the statistics for the x translation (Fig. 5.8 | other translation and pose angle results are similar) shows that in these cases no divergence towards in nite depth happens, but merely a premature convergence to false local minima. It is interesting to notice that the accuracy of Lowe's method stays at this same high error levels even with much better initial conditions, which indicates that Lowe's algorithm (as well as Ishii's, which performs even worse) usually (and not only in extreme cases) gets stuck in local minima. 64

Mean with all data

40

10

20

10

0

10

Error

10

10

Mean without 25% extremes

0 0.1 0.2 0.3 0.4 0.5 Translational disturbance

10

0

10

−10

10

−20

10

20

10

0

10

10

0 0.1 0.2 0.3 0.4 0.5 Translational disturbance

Std without 25% extremes

10

10

−20

−20

10

Std with all data

40

10

10

10

0

−10

−20

0 0.1 0.2 0.3 0.4 0.5 Translational disturbance

0 0.1 0.2 0.3 0.4 0.5 Translational disturbance

Figure 5.8: Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object's center, with respect to the ratio between the magnitude of the translational disturbance in the initial solution and the actual depth of the object's center, for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line).

5.2.7 Sensitivity to Rotational Error in Initial Solution Using the same sampling strategy once more, we selected ten average values for the absolute rotational error rdi , ranging from 10 to  radians. The statistics for the NDE, exhibited in Fig. 5.9, show again the superiority of our approach with relatively small errors. Similarly, with errors bigger than 310 radians, our method starts having convergence problems and its reprojection accuracy approaches that of Lowe's. 10

Mean with all data

10

0

Std with all data

Mean without 25% extremes

0

10

−5

10

−10

10

−15

10

10

10

Std without 25% extremes

−10

10 −10

−10

10

10

0 0.2 0.4 0.6 0.8 1 Rotational disturbance

10

−15

10

−20

−20

10

0

−5

10

0

10

Error

10

10

−20

0 0.2 0.4 0.6 0.8 1 Rotational disturbance

10

−20

0 0.2 0.4 0.6 0.8 1 Rotational disturbance

10

0 0.2 0.4 0.6 0.8 1 Rotational disturbance

Figure 5.9: Sensitivity of an image{space error metric, the Norm of Distances Error (see Section 5.2.1), with respect to the magnitude of the rotational disturbance in the initial solution (in  radians), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). The errors in x translation recovery (Fig. 5.10) and in the pose angles show that large errors in initial rotation, unlike those in initial translation, make our method diverge towards in nite depth. This causes its accuracy in terms of pose parameters values to drop to levels comparable to (in some cases even worse than) those of Ishii's. However, in this situation Lowe's original method also diverges. A solution with a relative translational error of 1010 , 65

105 , or even 101 is not much more useful in practice than another solution with a relative translational error of 1020 . The problem in this case is the intrinsically downhill nature of Newton's method, which is the core of all the techniques studied here. We believe that the only way to overcome this limitation would be to use a method based on an optimization technique with better global convergence properties, such as trust region optimization. 40

Mean with all data

10

20

Std with all data

Mean without 25% extremes

10

10

10

0

10

−10

10

0

10 0

0

10

10

0 0.2 0.4 0.6 0.8 1 Rotational disturbance

10

−10

10

−20

−20

10

20

10

10

10

Std without 25% extremes

20

10

20

10

Error

40

10

−20

0 0.2 0.4 0.6 0.8 1 Rotational disturbance

10

−20

0 0.2 0.4 0.6 0.8 1 Rotational disturbance

10

0 0.2 0.4 0.6 0.8 1 Rotational disturbance

Figure 5.10: Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object's center, with respect to the magnitude of the rotational disturbance in the initial solution (in  radians), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line).

5.2.8 Sensitivity to Additive Noise In this experiment, Gaussian noise with zero mean and controlled standard deviation was added to the coordinates of the features in the image. 2,700 executions of each method were performed for each of the fteen values of the noise standard deviation chosen in the range of 2?15 to 2?1 focal lengths. The statistics for the NDE, plotted in Fig. 5.11, show that in this case the accuracy of our solution is always limited by the noise level, while the other two approaches get stuck on higher error levels even when the noise level is very small. For an error level of about 10?3 focal lengths (which corresponds roughly to the quantization noise with a sensing array of 1K  1K pixels), there is still a considerably wide gap of accuracy (about one order of magnitude) between our technique and Lowe's, the second most accurate method. The analysis of the e ect on the x translation errors (Fig. 5.12) shows that divergence towards in nite depth is a problem again for relatively high noise levels (greater than 10?3 in the worst cases). However, the roll angle errors, displayed in Fig. 5.13, illustrate the fact that the degradation in the estimate for the rotation provided by our method happens smoothly. Our technique remains signi catively more precise, at least for rotation recovery, for noise levels of up to 10?1 focal lengths. This is quite impressive given the fact that the restrictions in the view angle constrain the images to a 2  2 window (in focal lengths) on the image plane, where the noise was added. 66

Mean with all data

10

Std with all data

10

10

Mean without 25% extremes

5

10

5

10

5

Std without 25% extremes

−2

10

10

Error

10

0

10

0

10 0

0

10

−5

10

−4

10

10

−5

−5

0

10

10

10

−5

−5

Imaging noise std

10

0

10

10

−6

−5

0

10

Imaging noise std

10

10

−5

0

10

Imaging noise std

10 Imaging noise std

Figure 5.11: Sensitivity of an image{space error metric, the Norm of Distances Error (see Section 5.2.1), with respect to the standard deviation of the noise added to the image (in focal lengths), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). Mean with all data

40

Std with all data

40

10

Mean without 25% extremes

10

20

10

0

10

10

10

10

0

10

20

10

0

10

20

Std without 25% extremes

10

Error

10

20

10

10

−20

10

−20

−5

0

10

10 Imaging noise std

10

0

−10

−5

0

10

10

10 Imaging noise std

−10

−5

0

10

10 Imaging noise std

10

−5

0

10

10 Imaging noise std

Figure 5.12: Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object's center, with respect to the standard deviation of the noise added to the image (in focal lengths), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line).

5.2.9 Accuracy in Practice Finally, we also wanted to compare the three methods in a realistic situation, in order to check if the better accuracy properties of our approach would make a di erence in practice. The introduction of noise in the experiments was a rst step towards this direction, but up to this point we have not addressed the question of what would be realistic initial conditions. One possibility for applications such as tracking would be to create reasonably precise initial estimates of the pose with a smoothing lter. But this approach is very dependent on application{speci c parameters, such as the sampling rate of the camera, the bandwidth of the image processing system as a whole, the positional depth, the linear speed and the angular speed of the tracked object. A more general approach, which we follow here, is to use a weaker camera model to generate an initial solution for the problem analytically, and then use the projective iter67

Mean with all data

0

Error

Mean without 25% extremes

0

10

−2

10

0

Std without 25% extremes

10

−1

10

10

−4

10

Std with all data

0

10

−2

−5

0

10

10 Imaging noise std

10

−5

−5

0

10

10

10

−5

−5

0

10

Imaging noise std

10 Imaging noise std

10

−5

0

10

10 Imaging noise std

Figure 5.13: Sensitivity of the error on the estimated roll angle (measured in  radians), with respect to the standard deviation of the noise added to the image (in focal lengths), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). ative solution(s) to re ne this initial estimate. This approach follows the same line of the numerical method suggested by DeMenthon and Davis [15] (described in the previous chapter). The weak{perspective solution proposed in that paper is summarized in Eqs. (4.5) and (4.6): ei  r0 x = ui ? u0 (1  i  n); ei  r0y = vi ? v0 (1  i  n): A normalization of the recovered vectors r0 x and r0 y yields the two rst rows of the rotation component of the transformation that describes the object frame in the camera coordinate system. The third row can then be obtained with a single cross product operation. After that, the recovery of the translation is straightforward. However, this simple weak{perspective approximation introduces errors that increase proportionally not only to the inverse depth of the object, but also to its \o {axis" angle (the angle of its center with respect to the optical axis, as viewed from the optical center). In order to avoid this last problem, we rst preprocessed the image to simulate a rotation that puts the center of the object's image in the intersection of the optical axis with the image plane. Let the center of the object image be described by [u v]. Then, this transformation, as suggested in [69], is given by Eq. (2.11): 2 1 0 ? du1 3 d1 6 6 R = 666 du1 dv2 4 u d2 p

7 7 v ? d1 d2 777 ; where: 5

d1 d2

1

v d2

d2 p

d1 = u2 + 1; d2 = u2 + v2 + 1: After this preprocessing, we applied the technique described by Eqs. (4.5) and (4.6), in order to recover the \foveated" pose. Then, we premultiplied the resulting transformation 68

by the inverse of the matrix de ned in Eq. (2.11), in order to recover the original weak{ perspective pose, which was used as the initial solution for the iterative techniques in being compared. The only controlled parameter left was the actual depth of the object's center (ztrue ). We chose nine average values for it, growing exponentially from 25 to 6,400 focal lengths. The noise standard deviation was set at 0.002 focal lengths (corresponding roughly to a 512  512 spatial quantization). The number of iterations of each method per run was set at 2, allowing a real{time execution rate of about 100 Hz. For each average value of ztrue , 2,500 independent runs of each technique were performed. The statistics for the NDE, depicted in Fig. 5.14, show that our fully projective solution was up to one order of magnitude more accurate than the other two methods for most cases in which the distance was smaller than 1,000 focal lengths (about 20 m, with the typical focal length of 20 mm). For distances bigger than that, the precision of the weak{ perspective initial solution alone was bigger than the limitation imposed by the noise and so the three techniques performed equally well. Analysis of the results for the x translation error (Fig. 5.15) and the other ve parameter{ space errors, shows the interesting fact that all the techniques exhibit parameter{space accuracy peaks in the range of 50 to 400 focal lengths. The explanation for that is the fact that when the object is too close, the quality of the initial weak{perspective solution degrades quickly. But on the other hand, when the object is too far away, the noise gradually overpowers the information about both the distance (via observed size) and the orientation of the object, since all the feature images tend to collapse into a single point. Of course, in practice, the exact location of these peaks depends on the dimensions of the actual object(s) whose pose is being recovered. Mean with all data

5

10

Std with all data

5

Mean without 25% extremes 2 10

10

0

10

0

−2

10

Error

10 0

0

10

10

−2

10 −5

−5

10

Std without 25% extremes

2

10

10

100 1000 10000 Object center depth

10

10

−4

100 1000 10000 Object center depth

10

10

−4

100 1000 10000 Object center depth

10

10

100 1000 10000 Object center depth

Figure 5.14: Sensitivity of an image{space error metric, the Norm of Distances Error (see Section 5.2.1), with respect to the actual depth of the object's center (in focal lengths), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). Tests performed with initial solutions generated by a weak{perspective approximation. In the case of our technique, the accuracy peak happened clearly at distances of 50 to 100 focal lengths (1 to 2 m with 20 mm lens). Similar results were obtained when the number of iterations for each run was raised to 5. This suggests that our solution may 69

Mean with all data

20

10

10

Mean without 25% extremes 0 10

10

10

Error

Std with all data

20

10

Std without 25% extremes

0

10

−1

10

10

−2

10 0

0

10

10

−10

−10

10

−2

10

10

100 1000 10000 Object center depth

10

10

−3

100 1000 10000 Object center depth

10

10

−4

100 1000 10000 Object center depth

10

10

100 1000 10000 Object center depth

Figure 5.15: Sensitivity of the ratio between the error on the estimated x translation and the actual depth of the object's center, with respect to the actual depth of the object's center (in focal lengths), for Lowe's (solid line), Ishii's (dotted line), and our fully projective solution (dash{dotted line). Tests performed with initial solutions generated by a weak{perspective approximation. be very well suited for indoor applications in which it is possible to keep a safe distance between the objects of interest and the camera.

5.2.10 Discussion Lowe's approximation was previously criticized by McIvor [44]. He states that assuming that t0x and t0y see Section 4.2.1) are constants amounts to an ane approximation. This is true for the parameters t0x and t0y themselves, but the ane approximation does not extend through the whole formulation | in Eq. (4.28) the denominators use z 0 + t0z instead of just t0z . If a constant value had been used for those denominators then the formulation would be purely ane. Without implementing other formulations, McIvor speculates (correctly) that the use of full perspective would improve the accuracy of the viewpoint, perhaps at the expense of decreased numerical stability. But as we showed in this section, the fully projective formulation is actually more stable except in situations that break the other two formulations tested as well. Worrall et al. [68] compare their algorithm for perspective inversion with Lowe's algorithm. They claim that their technique outperforms both Lowe's original method and a reformulation of it using full{perspective projection, in terms of speed of convergence in simulations performed with a cube. This work sounds similar to ours, but [68] provides no detail on the perspective projection version of Lowe's algorithm used in the comparison. They also do not present any discussion or comparison between the two di erent implementations of Lowe's algorithm that they mention. Finally, they only report concrete experimental results for their own inversion method, which is based on line (rather than point) correspondences. No comparative evaluation of the two variants of Lowe's algorithm was presented. Our experiments indicate that a straightforward reformulation of the perspective imaging equations removes mathematical approximations that limit the precision of Lowe's and 70

Ishii's formulations. The fully projective algorithm has better accuracy with a minimal increase in terms of computational cost per iteration. The fully projective solution is also very stable for a wide range of actual object poses and initial conditions. In some particularly extreme scenaria, our approach does su er from numerical stability problems, but in these situations the accuracy of Lowe's and Ishii's approximations is also unacceptable, with errors of one or more orders of magnitude in the values of the pose parameters. We believe that this type of problem is a consequence of Newton's method and can only be overcome with the use of more powerful numerical optimization techniques, such as trust region methods. In scenaria that may realistically arise in applications such as indoor navigation, with the use of reasonable (weak{perspective) initial solutions and taking into account the e ect of additive Gaussian noise in the imaging process, the fully projective formulation outperforms both Lowe's and Ishii's approximations by up to an order of magnitude in terms of accuracy, with practically the same computational cost.

5.3 Evaluating Our Improvements to Phong{Horaud's Algorithm In the previous chapter we also presented a solution for the problem of independent orientation recovery from point correspondences, largely inspired in the work of Phong, Horaud et al [53]. We showed that our solution is theoretically superior to the previously existing methods for independent orientation recovery, but up to this point we did not discuss the implications of this theoretical superiority in terms of any speci c real{life application. This section provides some evidence that the theoretical advantages of our solution do make a di erence in practice. Initially, we stress the fact that our technique does not need actual edges to work. It is based on point correspondences and the edges eij mentioned in its formulation are only virtual edges de ned by pairs of points. Furthermore, the camera's optical center does not need to be located exactly at the object plane and, in fact, the object does not need to be exactly planar in order for our solution to produce accuracy gains with respect to solutions based uniquely on line correspondences. If the smallest singular value on the SVD of the 3  n matrix composed of the image points is much smaller than the other two and the optical center is relatively close to the plane that ts the model points in a least{squares sense, then the proximity of a singularity will a ect the line{based algorithms and the additional alignment constraints derived by our solution will have a signi cant impact on the overall convergence speed and accuracy of the numerical pose recovery.

5.3.1 A Real{Life Application: Visual Navigation A real{life scenario where these conditions arise is on o {road navigation of a mobile robot using landmarks located at relatively distant positions (in comparison to the maximum height variation of the visible terrain). An interesting aspect of this application is the fact that, in many situations, accurate models of the terrain are available, allowing the use of model{based techniques. For instance, the United States Geographical Survey o ers Digital 71

Elevation Maps (DEMs) with di erent resolutions, commercially and for free download on the Internet. The traditional model{based approaches for recovering the position of o {road vehicles from images of relatively distant visual landmarks consist of performing exhaustive searches either in a space of all possible interpretations of the scene, or in the space of all possible renderings of the model (map), or in a discretized space of all possible poses of the camera [13]. In scenes that include distant features such as mountain peaks, this type of approach seems to be suitable for position recovery, in spite of its elevated computational cost. However, even in these situations, the relative orientation of the vehicle with respect to the features of interest can presumably change in a relatively fast way in an obstacle avoidance operation, or simply as a result of changes in the inclination of the terrain. So, the techniques mentioned above are certainly not suited for full 6D pose recovery in real time. We conjecture that they are useful in order to establish periodically the right correspondences between pairs of map (model) and image features, but a tracking technique based on some perspective inversion algorithm is needed to perform real{time orientation recovery. Furthermore, some of those traditional exhaustive techniques rely on mosaicing to obtain a reasonable number of feature correspondences, and it seems to us that a more practical alternative is to use wide{angle lenses. But of course, this requires the ability to deal with very signi cant perspective distortion, which the traditional techniques (as well as ane pose recovery methods) do not possess.

5.3.2 Experimental Methodology In order to test the impact of our contribution in this type of scenario, we applied the three versions of Phong{Horaud's original method and the two versions of our decoupled solution for point correspondences to the same synthetic scenes and then computed some error statistics for the nal pose estimates obtained by each algorithm. The process of scene generation involved initially the choice of the number of visible landmarks. Assuming independence in the probability of occurrence of individual landmarks in the visual eld, this number was drawn from a Poisson distribution with mean three. All the underconstrained scenes (two visible features or less) were immediately discarded. In the remaining scenes, the positions of individual features along the ground plane were uniformly sampled on a 1  1 km square initially centered at the camera's optical center. The heights of the features with respect to the ground plane were chosen from a normal distribution with mean zero and standard deviation explicitly controlled by a parameter hs. The same number of scenes was generated with eight possible values for hs, ranging exponentially from 1 to 128 m. Each of the resulting scenes was then rotated about the axis normal to the ground plane by an arbitrary angle | sampled uniformly in the interval [0; 2) radians | and translated along the optical axis of the camera by a distance drawn from an uniform distribution on the interval [0:9; 1:1) km, so that all the landmarks were placed in a visual eld about 103.5 degrees wide on the horizontal image axis (in the worst case). Next, the resulting scenes were translated along the axis normal to the ground plane, to account for the height of the camera 72

(using the same distribution employed in the generation of the heights of the landmarks). Finally, each scene was rotated about an individual axis uniformly sampled on a unit sphere, by an angle drawn from a normal distribution with mean zero and standard deviation equal to 10 radians. This nal step was performed in order to account for (relatively small) inclinations of the terrain and errors in the \foveation" of the horizon line. The imaging simulation was performed with a \noisy" pinhole camera model. On each image, a random multiplicative bias was used to model inaccuracies in the calibration of the focal length and the aspect ratio, and a random additive bias was used to model inaccuracies in the calibration of the intersection of the optical axis with the image plane. Furthermore, Gaussian noise was added to the coordinates of all image features, in order to model measurement errors in the low{level stages of vision (feature detection and 2D localization). More speci cally, if the description of an arbitrary point i in 3D camera coordinates is given by pi = [xi yi zi ]T , we compute the coordinates of the corresponding image point as:  f y f x i i (5.2) [ui vi ] = (1 + "m ) z + "a + i (1 + "m ) z + "a + i ; i i where f is the camera focal length; "m and "a are the multiplicative and additive biases, 

respectively, both unique for the whole image and drawn from normal distributions with zero mean and controlled standard deviations bm and ba (respectively); and i and i are Gaussian additive noise (individually generated for each feature) with zero mean and controlled standard deviation na . Unless stated otherwise, the experiments were performed with three di erent values for bm (0.01, 0.02 and 0.04), for ba (0.01, 0.02 and 0.04 focal lengths), and for na (0.001, 0.002 and 0.004 focal lengths). For each possible combination of values of the parameters hs , bm , ba and na such that ba = f bm , several scenes were generated in a completely independent way. For each one of them, a unique initial solution, to be used by all the ve methods under evaluation, was computed. This initial solution was obtained simply by multiplying the angles and distances used in the generation of the original scene by a random disturbance factor with mean 1 and standard deviation 0.1 (the only exception was the axis of the nal rotation, that was added to a disturbance vector sampled uniformly on a unit sphere, and renormalized), and then repeating the same steps taken originally. Initially, we tested the accuracy and eciency of the techniques at hand in the general case. These techniques were implemented in Matlab (release 4.2) and then translated to Ansi C by the Matlab compiler (with optimization options -ri). The resulting code was then compiled with gcc (optimization level -O2) and executed in a Sun Sparc Station 4, running Sun OS. For each of the 72 di erent combinations of the controlled parameters, we generated and tested 500 di erent scenes, resulting in a total of 36,000 independent executions of each method in study. On each of these executions, the maximum number of iterations of the trust{region algorithm was xed at 20, and per{iteration traces of four di erent error measures were computed. Two of these measures use the values recovered for the camera orientation and position to estimate the individual 3D positions of all the model points and compare these estimates to the respective actual 3D positions. In the other two, this process is repeated 73

with the true value for the camera position, instead of the recovered value. The goal of these alternative measures is to evaluate the accuracy of orientation recovery alone. In practice this is important because accurate position estimation can be obtained directly from a GPS system. For each of these two possibilities, one of the error measures computed was the square root of the average squared positional error for all individual features, and the other was the maximum individual positional error. Here we show only the results based on the measure that uses recovered position and squared individual errors (which we call Actual Average Squared Distance | AASD), because the results obtained with the other three measures were qualitatively identical. As a measure of eciency, we used the elapsed times of orientation recovery, since in this particular application, the camera position only needs to be estimated at a much coarser temporal scale. Then, each measure considered (that is, errors on each iteration and the total execution time) had its average and standard deviation computed for the whole set of 36,000 executions.

5.3.3 Convergence in the General Case The evolution of the global error as a function of the iteration number is displayed in Fig. 5.16. In this and all the other error plots presented here, the solid and dotted lines at the top represent the results obtained with Phong{Horaud's decoupled and coupled solutions for line correspondences, respectively, the dashed and dash{dotted lines correspond to our decoupled solutions for point correspondences organized in triangles and in a ring, respectively, and the solid line at the bottom represents the results obtained with Phong{ Horaud's coupled solution for point correspondences. Mean

Standard deviation 0.2

0.1

AASD Error (km)

AASD Error (km)

0.12

0.08 0.06 0.04 0.02 0

5

10 15 Iteration number

0.15

0.1

0.05

0 0

20

5

10 15 Iteration number

20

Figure 5.16: Convergence of a 3D{space error measure (AASD error) with respect to the number of iterations of the trust{region optimization. Top solid, dotted, dashed, dash{ dotted and bottom solid lines represent the results for decoupled recovery from lines, coupled recovery from lines, decoupled recovery from triangles, decoupled recovery from a ring of points and coupled recovery from points, respectively. It can be observed that the three solutions that make use of point correspondences have very similar convergence properties. Phong{Horaud's solution has a smaller nal standard deviation, indicating that our algorithms may have certain convergence problems in very 74

particular problem instances. However, the convergence of the average AASD error with our algorithms is slightly faster. For instance, if the maximum number of iterations had to be limited to two, in order to cope with real{time constraints, our solutions would actually yield more accurate answers on average, with minimal di erences in terms of standard deviation. Furthermore, in all three cases, the nal values achieved for this error are very similar: 28.1 m for our decoupled version with triangles, 27.7 m for our decoupled version with the ring of features, and 25.7 m for Phong{Horaud's coupled solution. On the other hand, the two solutions based on line correspondences have clear convergence problems after a few iterations, even in terms of the average errors. The nal AASD error for the decoupled and coupled solutions are 93.9 m and 86.1 m, respectively. This is a strong empirical con rmation of our theoretical claim that point correspondences are inherently more powerful than line correspondences.

5.3.4 Execution Times

0.20 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00

Mean

Coup Points

Dec Ring

Dec Triangles

Coup Lines

Std

Dec Lines

Execution Times (s)

The elapsed times are displayed in Fig. 5.17. The central aspect of this chart, which we want to stress here, is the fact that the coupled solutions are considerably slower than their decoupled counterparts. If we compare the average times plus standard deviations (since these techniques are intended to be used in real{time conditions), the coupled solution for line correspondences takes about 74% longer than the decoupled one. With point correspondences, Phong{Horaud's original solution takes about 61% longer than the our fastest decoupled version (the one based on grouping the features in triangles).

Figure 5.17: Execution times for orientation recovery. Another important fact is that our solution with features organized in a ring takes approximately 21% longer then the one that uses the triangles, in spite of both having virtually the same accuracy. So, as Theorem 4.1 predicted, the triangle organization allows one to reduce the overall execution time without any major impact in terms of the quality of 75

the nal solutions. Finally, a quite unexpected result was the fact that our fastest solution was even faster than the original decoupled solution based on line correspondences. This happens because the extra complexity of our scheme is only re ected in the computation of the coecients for the error function, which is performed just once and not for every single iteration. So, the faster convergence of our technique can easily outweigh this additional initial overhead.

5.3.5 Sensitivity to the Number of Visible Landmarks We also performed some experiments to compare the sensitivity of the di erent algorithms with respect to changes in scene conditions such as the number of visible landmarks, the elevation variance in the visible terrain, the levels of bias introduced by inaccuracies in the camera calibration process, and the level of noise in the 2D localization of the image features. In all these experiments, the number of iterations of the trust{region algorithm was reduced to 5, because as shown in Fig. 5.16, only minor additional improvements are achieved after that point (if any at all). Also, rather than computing per{iteration traces of the di erent error measures, we recorded only the best solutions found throughout the optimization process. But on the other hand, on each experiment we performed independent executions and computed global statistics with a series of di erent settings of the scene condition whose impact on the pose recovery was being evaluated. For instance, the variation of the AASD error with respect to the number of visible features in the scene is displayed in Fig. 5.18. There is a clear general tendency of reduction in the error when the number of features is increased, which is quite natural, because the measurement errors for individual features are averaged out when many of them are used. The important fact is that for all the 10 di erent values of the number of landmarks tested (3,600 di erent scenes each), the greater accuracy of the methods based on point correspondences is unquestionable. Standard Deviation 0.07

0.05

0.06

AASD Error (km)

AASD Error (km)

Mean 0.06

0.04 0.03 0.02 0.01 2

0.05 0.04 0.03 0.02

4 6 8 10 Number of Visible Landmarks

0.01 2

12

4 6 8 10 Number of Visible Landmarks

12

Figure 5.18: Error sensitivity with respect to the number of visible landmarks in the scene. Top solid, dotted, dashed, dash{dotted and bottom solid lines represent the results for decoupled recovery from lines, coupled recovery from lines, decoupled recovery from triangles, decoupled recovery from a ring of points and coupled recovery from points, respectively. 76

5.3.6 Sensitivity to the Height Distribution of the Visible Terrain The sensitivity measurements with respect to the variation of hs , the standard deviation of the normal distribution of the terrain height, are plotted in Fig. 5.19 (10 values between 1.25 and 640 m, with 3,600 scenes per value). It can be seen that the accuracy of the algorithms based on point correspondences is roughly invariant with respect to the value of hs . The algorithms based uniquely on line correspondences, however, are quite sensitive to this parameter, because of the singularity when the scene is perfectly planar. In the case of scenes with features uniformly distributed in a spherical region, for instance, this di erence should vanish. But the important fact is that the reduction in the accuracy gap between line{based and point{based techniques happens gradually, rather than in a sharp fashion. So, even in very generic scenarios with big height variations some modest gains can be obtained if a point{based algorithm is used. Mean

Standard Deviation 0.08

AASD Error (km)

AASD Error (km)

0.08

0.06

0.04

0.02 −3 10

−2

−1

0

10 10 10 Height Standard Deviation (km)

0.06

0.04

0.02 −3 10

−2

−1

0

10 10 10 Height Standard Deviation (km)

Figure 5.19: Error sensitivity with respect to the standard deviation of the terrain's height distribution. Top solid, dotted, dashed, dash{dotted and bottom solid lines represent the results for decoupled recovery from lines, coupled recovery from lines, decoupled recovery from triangles, decoupled recovery from a ring of points and coupled recovery from points, respectively. Now we contrast these results with those obtained by Yuan [70]. Although the use of di erent experimental setups prevents a direct quantitative comparison, a qualitative analysis is enough to show the bene ts of our approach. Yuan's algorithm is, to the best of our knowledge, the only alternative solution for independent orientation recovery from point correspondences available in the literature | Liu et al [35] suggest a trivial way of creating generic line correspondences from point correspondences, but this obviously does not add any strength to the traditional line{based algorithms. The key idea of Yuan's method is to preprocess a (n ? 1)  3 matrix that encodes the known structure of the n three{dimensional model points in order to obtain a minimal structural representation with only r rows, where r is the rank of the original structure matrix. This minimal structural representation is then used to derive a parametric description of the nine elements that compose the 3  3 matrix representing the unknown pose. Finally, the free parameters in this description are determined through a numerical optimization that guarantees the orthonormality of the resulting 3  3 pose matrix. 77

The major drawback of the approach outlined above is that the exact format of the parametric descriptions for the pose matrix's elements depends both on the number of feature points (n) and on the rank of the structure matrix (r). For instance, if the right correspondences for exactly four model points in general position are known, the numerical optimization in Yuan's method is carried out in a four{dimensional space. However, if only three point correspondences in general position are available (and thus the target is a 2D object), a six{dimensional optimization is needed. There is no middle ground between planar and non{planar scenes. The determination of the type of structure available is carried out prior to the optimization phase and the numerical behavior in each possible case is completely distinct. So, if one decides to be strict about ruling scenes as planar, in quasi{planar scenes the proximity of a singularity will generate numerical instability. But on the other hand, if a more liberal planarity criterion is adopted, then the utilization of a procedure designed speci cally to work with exactly planar scenes may result in a drastic accuracy reduction with instances that are actually only quasi{planar. Yuan performed experiments in which the structure of synthetically{generated scenes was disturbed by zero{mean Gaussian noise. The errors in orientation estimates obtained with planar targets were up to an order of magnitude larger than errors in orientation estimates obtained with points in general position, under the same experimental conditions. Compare this with the results summarized in Fig. 5.19. Our technique is actually more accurate in quasi{planar cases than in cases where the structure is completely general. And, more important, the transition between these two types of instances happens in a continuous, gradual way, since there is no singularity in the planar case. Hence we claim that the technique introduced here is, to the best of our knowledge, the rst method for independent orientation recovery that fully exploits the superiority of point correspondences over line correspondences. Furthermore, our technique is much simpler and more intuitive than Yuan's method, because the number of parameters used in the numerical optimization is xed (four) and the meaning of each of these parameters is very clear, since they directly encode the rotational components of the unknown pose.

5.3.7 Sensitivity to Calibration Bias and Measurement Noise Finally, the results obtained for the variations of the biases bm , ba and of the additive noise na are qualitatively very similar. So, here we show only the former, in Fig. 5.20 (10 values between 1.25  10?3 and 0.64, with 3,600 scenes per value). It can be seen that when the bias (noise) is increased to relatively high levels, the average accuracies of all di erent techniques tend to collapse into a single function, as the the image signal is gradually overpowered. But on the other hand, when the bias (noise) is reduced beyond a certain critical level, the accuracies of the di erent types of techniques converge to very distinct \intrinsic" error levels and the superiority of the versions based on point correspondences becomes clear.

5.3.8 Discussion In the previous chapter, we performed a careful theoretical comparison between Phong{ Horaud's original work and our improvements to it. This was especially important as a 78

Mean

Standard Deviation

0.1

0.05

0 −3 −2 −1 0 10 10 10 10 Multiplicative Bias Standard Deviation

0.1

AASD Error (km)

AASD Error (km)

0.15

0.08

0.06

0.04

0.02 −3 −2 −1 0 10 10 10 10 Multiplicative Bias Standard Deviation

Figure 5.20: Error sensitivity with respect to the standard deviation of the multiplicative calibration bias. Top solid, dotted, dashed, dash{dotted and bottom solid lines represent the results for decoupled recovery from lines, coupled recovery from lines, decoupled recovery from triangles, decoupled recovery from a ring of points and coupled recovery from points, respectively. warning for the users of methods based on line correspondences that these techniques may face certain singularities in real{life applications. From a more practical point of view, one of the main reasons why line correspondences are so popular is the well{known fact that they allow the complete separation between orientation and position recovery. This trick cuts the dimensionality of the problem in half and usually yields much more ecient techniques. The results that we have just presented here show that our extension of this separation to point correspondences yields solutions that combine eciency with accuracy. More speci cally, we suggested an example of a real{life application that bene ts from our technique, namely outdoors visual navigation of a mobile robot. Extensive evaluation with synthetic data showed that, for this particular application, our proposed algorithm is signi cantly more accurate and even a bit faster than Phong{Horaud's line{based decoupled solution, and on the other hand, it is signi cantly faster than Phong{Horaud's coupled solution for point correspondences, while retaining roughly the same accuracy. We formulate our technique as an extension of Phong{Horaud's solution for line correspondences, in order to make use of the robustness of their trust{region optimization algorithm and the eciency of their dual quaternion pose representation. But actually, the alignment constraints that we derive are not tied in any particular way to these two particular aspects of Phong{Horaud's algorithm and one might perfectly well use our geometrical formulation with other optimization techniques such as a simple rst{order iterative gradient method, and other pose representation schemes such as Euler angles and a 3{element translation vector.

79

Chapter 6

Conclusion and Future Directions In the past fteen years or so, several di erent solutions (both analytical and numerical) have been proposed to the classic problem of model{based pose recovery. However, not much work has been devoted to comparing the relative strengths and weaknesses of these solutions in real{life applications. Unfortunately, some of these techniques have certain potentially dangerous shortcomings and this fact has not been stressed in the literature. For instance, some researchers [42; 7] have used Lowe's method in practical applications without discussing the appropriateness of his approximation at all. In addition, some the most recent important advances towards a de nite answer to the pose recovery problem (namely, Phong{Horaud's [53] and DeMenthon{Davis's [15] techniques) rely at least partially on alignment constraints derived uniquely from line correspondences, which (as we showed) are subject to certain singularities. In the present work, we discussed the relative strengths and weaknesses of several techniques for model{based pose recovery from known correspondences. We also proposed reformulations for two of them and showed that these reformulations make a signi cant di erence in practice. More speci cally, we showed that Lowe's original pose recovery algorithm uses certain critical simplifying assumptions in its formulation. We proposed a reparametrization of the pose description that allows these simpli cations to be removed in a straightforward way, yielding an algorithm with superexponential convergence for a wide range of initial conditions and arguably better computation{time properties. We also studied Phong{Horaud's algorithm for independent orientation recovery from line correspondences, and we showed that the intrinsic weaknesses of this type of geometrical constraints cause accuracy problems in practice. We presented an alternative solution that keeps the decoupling between orientation and position recovery but exploits constraints derived from point correspondences in order to achieve more precise orientation estimates. In terms of future directions, an obvious alternative is to explore the two conjectures that we left open in Chapter 4. Conjecture 4.1 is especially promising, since the techniques involved (namely DeMenthon{Davis's [15] and Horaud's [23] algorithms) are extremely suited for real{time usage (due to the o {line computation of the pseudo{inverse model matrix) and are not very sensitive to inaccuracies in the intrinsic camera parameters that are more dicult to be calibrated (for instance, the intersection of the optical axis with the image 80

plane), because they are based on successive re nements of initial ane camera models. So, our impression is that an empirical study using real images obtained with our mobile platform would be much more e ective for these two techniques than for Lowe's or Phong{Horaud's algorithms. Another promising direction is to study ways of combining di erent tracking techniques together, so that we can bene t from more generic techniques in relatively unknown and unstructured domains, but we can still make use of specially tailored algorithms in situations that are well understood and occur relatively often. For instance, as we discussed in Chapter 3, ane pose recovery techniques are extremely ecient and thus suitable for use in a real{time tracker, but they are not appropriate for images with signi cative perspective distortion. On the other hand, the numerical techniques discussed in Chapter 4 don't have this weakness, but they are less ecient and their accuracy depends on the quality of the initial pose estimates available. So, it would be intersecting to have a tracking system that could select the most appropriate of these schemes on{the{ y, or even combine di erent results obtained by di erent methods, based on how well each one of them performed in the (recent) past. Gil et al [19] propose the use of several methods for combining estimators (from the Statistics and Machine Learning literature) in the context of position estimation. However, they did not address the problem of orientation recovery. Actually, the focus of their work is on combining di erent methods for the two{dimensional problem of identifying the target in individual images, rather than on performing three{dimensional backprojection in multiple ways. With respect to temporal ltering, some applications can be successfully tackled with traditional techniques, such as Kalman lters estimating position and orientation directly. However, the use of problem{speci c knowledge to reduce the state{space dimensionality seems to lead to considerably better results. Along this line, Soatto and Perona [59] have suggested that keeping the target in the optical axis of a single camera and exploring the resulting additional geometrical constraints can allow a signi cant increase in the pose prediction accuracy. Another strong trend is to use some sort of non{holonomic model of vehicle dynamics, in order to perform tracking in the vehicle{parameter space (steering angle plus speed magnitude, for instance) [31; 43]. While those approaches seem to lead to more precise tracking systems under controlled conditions, they also seem to lead to much more specialized (and thus less robust) trackers. So, combining them with traditional Kalman{ lter{based techniques in an adaptive way can be a natural answer to the apparent con ict between precision and generality. Finally, with the recent advent of uncalibrated methods of scene reconstruction, one of the major current trends of computer vision is the use of techniques that incorporate the computation of the three{dimensional scene structure into the process of real{time pose and motion estimation, instead of working from a prede ned scene model. We refer the reader to [45] for a review of some of the fundamental advances in this area. However, it is also known that these techniques have certain limitations when applied to the control of robotic systems (for instance) [46]. So, an interesting future direction would be to explore these uncalibrated techniques and compare them to the the calibrated algorithms that we have studied so far, in order to verify in which circumstances each approach is more e ective and 81

even, if possible, try to combine their individual strengths into a unique framework.

82

References [1] M. A. Abidi and T. Chandra. A new ecient and direct solution for pose estimation using quadrangular targets: Algorithm and evaluation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(5):534{538, May 1995. [2] T. D. Alter. 3-D pose from 3 points using weak{perspective. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(8):802{808, Aug 1994. [3] Simon L. Altmann. Rotations, Quaternions and Double Groups. Clarendon Press, 1986. [4] H. Araujo, R. L. Carceroni, and C. M. Brown. A fully projective formulation to improve the accuracy of Lowe's pose{estimation algorithm. To appear in Computer Vision and Image Understanding. [5] H. Araujo, R. L. Carceroni, and C. M. Brown. A fully projective formulation for Lowe's tracking algorithm. Technical Report 641, University of Rochester Computer Science Dept. Nov 1996. [6] J. R. Beveridge and E. M. Riseman. Hybrid weak{perspective and full{perspective matching. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 432{438, 1992. [7] Alistair J. Bray. Tracking objects using image disparities. Image and Vision Computing, 8(1):4{9, Feb 1990. [8] C. M. Brown and D. Terzopoulos, editors. Real{Time Computer Vision. Cambridge University Press, 1994. [9] Christopher M. Brown. Some computational properties of rotation representations. Technical Report 303, University of Rochester Computer Science Dept. Aug 1989. [10] Christopher M. Brown. Tutorial on ltering, restoration, and state estimation. Technical Report 534, University of Rochester Computer Science Dept. Jun 1995. [11] S. Christy and R. Horaud. Euclidean reconstruction: from paraperspective to perspective. In Proc. 4th European Conf. on Computer Vision, volume 2, pages 129{140, 1996. [12] P. I. Corke. Dynamics of visual control. Technical report, CSIRO Division of Manufacturing Technology, 1994. [13] F. Cozman and E. Krotkov. Position estimation from outdoor visual landmarks for teleoperation of lunar rovers. In Proc. 3rd IEEE Workshop on Applications of Computer Vision, pages 156{161, Sarasota, Florida, Dec 1996. [14] John J. Craig. Introduction to Robotics: Mechanics and Control. Addison{Wesley, 1989. 83

[15] D. F. DeMenthon and L. S. Davis. Model{based object pose in 25 lines of code. Int. J. of Computer Vision, 15:123{141, 1995. [16] M. Dhome, M. Richetin, J-T. Lapreste, and G. Rives. 3-D pose from 3 points using weak{perspective. IEEE Trans. on Pattern Analysis and Machine Intelligence, 11(12):1265{1278, Dec 1989. [17] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model tting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381{395, Jun 1981. [18] Donald B. Gennery. Visual tracking of known three{dimensional objects. Int. J. of Computer Vision, 7(3):243{270, 1992. [19] S. Gil, R. Milanese, and T. Pun. Combining multiple motion estimates for vehicle tracking. In Proc. 4th European Conf. on Computer Vision, volume 2, pages 307{320, 1996. [20] W. E. L. Grimson, D. P. Huttenlocher, and T. D. Alter. Recognizing 3D objects from 2D images: An error analysis. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 316{321, 1992. [21] R. M. Haralick and C. Lee. Analysis and solutions of the three point perspective pose estimation problem. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 592{598, 1991. [22] Simon S. Haykin. Modern Filters. Macmillan Pub. Co. 1989. [23] R. Horaud, S. Christy, F. Dornaika, and B. Lamiroy. Object pose: Links between paraperspective and perspective. In Proc. 5th IEEE Int. Conf. on Computer Vision, pages 426{433, 1995. [24] R. Horaud, B. Conio, O. Leboulleux, and B. Lacolle. An analytic solution for the perspective 4-point problem. Computer Vision, Graphics, and Image Processing, 47:33{44, 1989. [25] D. P. Huttenlocher and S. Ullman. Object recognition using alignment. In Proc. 1st Int. Conf. on Computer Vision, pages 102{111, 1987. [26] D. P. Huttenlocher and S. Ullman. Recognizing solid objects by alignment with an image. Int. J. of Computer Vision, 5(2):195{212, 1990. [27] M. Ishii, S. Sakane, M. Kakikura, and Y. Mikami. A 3-D sensor system for teaching robot paths and environments. Int. J. of Robotics Research, 6(2):45{59, 1987. [28] D. W. Jacobs. Optimal matching of planar models in 3-D scenes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 269{274, 1991. [29] Kenichi Kanatani. Group{Theoretical Methods in Image Understanding. Springer{ Verlag, 1990. 84

[30] Y. C. Kim and K. Price. Re nement of noisy correspondence using feedback from 3-D motion. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 836{838, 1992. [31] D. Koller, K. Danilidis, and H.-H. Nagel. Model{based object tracking in monocular image sequences of road trac scenes. Int. J. of Computer Vision, 10(3):257{281, 1993. [32] K. Kutulakos and J. Vallino. Ane object representations for calibration{free augmented reality. In Proc. IEEE Virtual Reality International Symposium, 1996. [33] Y. Lamdan, J. T. Schwartz, and H. J. Wolfson. Object recognition by ane invariant matching. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 335{344, 1988. [34] S. Linnainmaa, D. Harwood, and L. S. Davis. Pose determination of a three{ dimensional object using triangle pairs. IEEE Trans. on Pattern Analysis and Machine Intelligence, 10(5):634{647, Sep 1988. [35] Y. Liu, T. S. Huang, and O. D. Faugeras. Determination of camera location from 2-D to 3-D line and point correspondences. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12(1):28{37, Jan 1990. [36] David G. Lowe. Solving for the parameters of object models from image descriptions. In Proc. ARPA Image Understanding Workshop, pages 121{127, Apr 1980. [37] David G. Lowe. Three{dimensional object recognition from single two{dimensional images. Arti cial Intelligence, 31(3):355{395, Mar 1987. [38] David G. Lowe. Fitting parameterized three{dimensional models to images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 13(5):441{450, May 1991. [39] David G. Lowe. Robust model{based motion tracking through the integration of search and estimation. Int. J. of Computer Vision, 8(2):113{122, 1992. [40] C.-P. Lu., E. Mjolsness, and G. D. Hager. Online computation of exterior orientation with application to hand{eye calibration. Mathl. Comput. Modelling, 24(5/6):121{143, 1996. [41] C. B. Madsen. Viewpoint variation in the noise sensitivity of pose estimation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 41{46, 1996. [42] K. Martin, C. V. Stewart, and R. Hammond. Real time tracking of boroscope tip pose. In Proc. 3rd IEEE Workshop on Applications of Computer Vision, pages 123{ 128, Sarasota, Florida, Dec 1996. [43] S. J. Maybank, A. D. Worrall, and G. D. Sullivan. A lter for visual tracking based on a stochastic model for driver behaviour. In Proc. 4th European Conf. on Computer Vision, volume 2, pages 540{549, 1996. 85

[44] Alan McIvor. An analysis of Lowe's model{based vision system. In Proc. 4th Alvey Vision Conference, pages 73{77, University of Manchester, UK, Aug 1988. [45] P. F. McLauchlan and D. W. Murray. A unifying framework for structure and motion recovery from image sequences. In Proc. 5th IEEE Int. Conf. on Computer Vision, pages 314{320, 1995. [46] P. F. McLauchlan and D. W. Murray. Active camera calibration for a head{eye platform using the variable state{dimension lter. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(1):15{21, Jan 1996. [47] P. F. McLauchlan, I. D. Reid, and D. W. Murray. Recursive ane structure and motion from image sequences. In Proc. 3rd European Conf. on Computer Vision, pages 217{ 224, 1994. [48] P. Meer, D. Mintz, and A. Rosenfeld. Robust regression methods for computer vision: A review. Int. J. of Computer Vision, 6(1):59{70, 1991. [49] D. W. Murray, K. J. Bradshaw, and P. F. McLauchlan. Driving saccade to pursuit using image motion. Int. J. of Computer Vision, 16(3):205{228, 1995. [50] V. S. Nalwa. A Guided Tour of Computer Vision. Addison{Wesley, 1993. [51] N. Navab and O. Faugeras. Monocular pose determination from lines: Critical sets and maximum number of solutions. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 254{260, Jun 1993. [52] D. Oberkampf, D. F. DeMenthon, and L. S. Davis. Iterative pose estimation using coplanar feature points. Computer Vision and Image Understanding, 63(3):495{511, May 1996. [53] T. Q. Phong, R. Horaud, and P. D. Tao. Object pose from 2-D to 3-D point and line correspondences. Int. J. of Computer Vision, 15:225{243, 1995. [54] H. P. Rotstein and E. Rivlin. Optimal servoing for active foveated vision. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 177{182, 1996. [55] T. Shakunaga and H. Kaneko. Perspective angle transform: Principle of shape from angles. Int. J. of Computer Vision, 3:239{254, 1989. [56] L. S. Shapiro, A. Zisserman, and M. Brady. 3D motion recovery via ane epipolar geometry. Int. J. Computer Vision, 16:147{182, 1995. [57] Larry S. Shapiro. Ane Analysis of Image Sequences. Cambridge University Press, 1995. [58] S. Soatto and P. Perona. Dynamic rigid motion estimation from weak perspective. In Proc. 5th IEEE Int. Conf. on Computer Vision, pages 321{328, 1995. [59] S. Soatto and P. Perona. Motion from xation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 817{824, 1996. 86

[60] D. E. Stevenson and M. M. Fleck. Robot aerobics: Four easy steps towards a more

exible camera calibration. In Proc. 5th IEEE Int. Conf. on Computer Vision, pages 34{39, 1995. [61] Charles V. Stewart. Minpran: A new robust estimator for computer vision. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(10):925{938, Oct 1995. [62] D. W. Thompson. Three-dimensional model matching from an unconstrained viewpoint. In Proc. IEEE Int. Conf. on Robotics and Automation, pages 208{220, 1987. [63] Roger Y. Tsai. A versatile camera calibration technique for high{accuracy 3D machine vision metrology using o {the{shelf TV cameras and lenses. IEEE J. of Robotics and Automation, RA-3(4):323{344, Aug 1987. [64] M. W. Walker and L. Shao. Estimating 3-D location parameters using dual number quaternions. CVGIP: Image Understanding, 54(3):358{367, Nov 1991. [65] C. Wiles and M. Brady. Closing the loop on multiple motions. In Proc. 5th IEEE Int. Conf. on Computer Vision, pages 308{313, 1995. [66] C. Wiles and M. Brady. Ground plane motion camera models. In Proc. 4th European Conf. on Computer Vision, volume 2, pages 238{247, 1996. [67] C. Wiles and M. Brady. On the appropriateness of camera models. In Proc. 4th European Conf. on Computer Vision, volume 2, pages 228{237, 1996. [68] A. D. Worrall, K. D. Baker, and G. D. Sullivan. Model based perspective inversion. Image and Vision Computing, 7(1):17{23, 1989. [69] Y. Wu, S. S. Iyengar, R. Jain, and S. Bose. A new generalized computational framework for nding object orientation using perspective trihedral angle constraint. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(10):961{975, Oct 1994. [70] Joseph S.-C. Yuan. A general photogrammetric method for determining object position and orientation. IEEE Trans. on Robotics and Automation, 5(2):129{142, Apr 1989.

87