Robust Estimation of Camera Translation Between Two Images Using a Camera With a 3D Orientation Sensor Takayuki Okatani and Koichiro Deguchi Graduate School of Information Sciences, Tohoku University Aoba-Campus 01, Sendai 980-8579, Japan E-mail: okatani,
[email protected]
Abstract It is still a difficult problem to establish correspondences of feature points and to estimate view relations for multiple images of a static scene, if the images have large disparities. In this paper we explore the possibility of applying a cheap and general-purpose 3D orientation sensor to improve the robustness of matching such two images. We attach a 3D orientation sensor to a camera and use the system to acquire the images. The camera orientation is obtained from the sensor. Assuming known intrinsic parameters of the camera, we are to estimate only the camera translation between the two views. Owing to the small number of parameters needed to be estimated, it becomes possible to apply a voting method. We show that the method by voting is more robust than the methods based on random sampling, especially for difficult pair of images to make correspondences. In addition, using the known camera orientation, the images can be rectified so that it is as if they were taken by parallel cameras, before the candidate matches are searched for. This helps finding as many correct matches as possible for pairs of images that include rotation around the camera axis. Experimental results for synthetic images as well as real images are shown.
1. Introduction Reconstructing 3D structure of a scene from its images taken by a camera moving freely in the scene has been one of the main themes of computer vision. In many methods for 3D reconstruction and image synthesis, for example, shape from silhouette, space carving, and light field rendering, poses of the camera must be accurately determined. This paper presents a method for obtaining camera pose using a camera with a 3D orientation sensor. Orientation of the camera is obtained from the orientation sensor rigidly attached to it. The goal is to robustly estimate correspondences of images and to determine camera poses using the combination of the camera and the orientation sensor.
It is still a difficult problem to establish correspondences of feature points across multiple images, especially when there are large disparities in the images. Several methods have been proposed, which are based on random sampling [3, 2, 6]. They perform well in some cases but can fail even for the pair of images that are seemingly not ill-conditioned. There are several possible reasons and solutions have been proposed so far. We think that the main reason comes from the nature of random sampling, such that two view relation with many d.o.f. has to be estimated from only a small number of point correspondences. In order to cope with this, we attach a 3D orientation sensor to a camera and assume known intrinsic parameters of the camera. Then translation of the camera is the only parameters to be estimated. We argue that difficulties in the problem could be resolved if we gave up estimating all of the camera parameters from the images. We consider getting information as much as possible from sources outside the images. For example, recent digital still image cameras provide us with the focal length of the lens at the time of image acquisition and this could be used even when the focal length varies over the images. There are many cheap 3D orientation sensors available recently. A 3D orientation sensor produces orientation of the sensor in space with three angles such as roll, pitch, and yaw. It is usually a combination of accerelometors, gyro sensors, and magnetometor. Attaching one to a camera never spoils the portability of the camera. The camera with the sensor can be used in the same way as usual cameras. There were several attempts to use a 3D orientation sensor in vision problems: a method of eliminating the ambiguity in factorization of structure and motion and methods of improving accuracy of structure and motion for image sequence [1] and for a pair of stereo images. All of them assumed known point correspondences across the images, however. Unlike those previous methods, we aim at robustly matching feature points by considering the advantages of knowing the camera orientation. First, because of the
1051-4651/02 $17.00 (c) 2002 IEEE
known orientation of the camera, it becomes possible to rectify the images so that it is as if they were taken by parallel cameras. This helps searching for candidates of matches as correctly as possible by image correlation. Secondly, since only a few parameters need to be estimated, we can apply a voting method. This performs better than methods based on random sampling when there are a large number of mismatches of points. These are expected to improve the robustness of the method.
1. Define the area for voting in t 2t3 space and initialize it. 2. For all candidate matches, execute the following: 2.1 Compute a for the given candidate. 2.2 Vote t satisfying a t 0. That is, draw a line in the voting space in an additional manner. 3. Detect peaks in the voting space and recover t.
2. Image matching when camera orientation is known
Figure 1. Algorithm of estimating translation by a voting method.
2.1. Relation between two images This section briefly summarizes the relation between two images taken by a camera of known orientation. Let X be a vector containing homogeneous coordinates of some world point represented in the world coordinate system. Also let x and x be vectors containing homogeneous coordinates of its image points for the first camera and the second camera, respectively. These two vectors have the relation: x Fx
0
(1)
In the case of uncalibrated cameras and unknown camera poses, the problem is to estimate F as well as to search matches of feature points. We use R and t to denote rotation matrix and translation vector from the first camera coordinates to the second, and A and A to denote the 3 3 matrix containing the intrinsic camera parameters for the two views, respectively. Then, we have x ∝ A RtX. The fundamental matrix F is given by F A T RA 1 , where T is a 3 3 matrix such that T p t p for arbitrary 3-vector p. If the intrinsic parameters A and A are known, and also rotation R between two views is known, rewriting Eq.(1) yields a constraint on the unknown translation t in the following form:
candidates for a match of the point are determined based on their similarity between the images, which is measured by the normalized correlation of the image intensity. Since we know the camera orientation, the image can be rectified before computing the image correlation so that it is as if the images were taken by parallel cameras. This greatly helps finding correct matches especially in the case where the images are rotated relative to each other, since the image rotation can be a source of failures in searching for candidate matches. This is done as follows: First we define an appropriate plane that is intermediate between the two image planes for the two camera poses. Using R to denote the rotation matrix between the two camera orientation, we get such a plane by rotating the image plane for the first view with a rotation R that has the same rotation axis as R but a half rotation angle of R. Then the images are rectified using homographies, e.g., H BR A 1 , where B is the camera matrix for the rectified image. This is done for both views. The resulting images thus obtained are expected to have a small image rotation.
ax x ; A A R t
2.3. Estimation of camera translation by voting This section describes the method by voting. Given a correct pair of point matches, we have one equation for unknown t:ax x ; A A R t 0. Because of the scaling ambiguity, we choose t 1 t 2 t3 . Then the above equation for t draws a line in the t 2t3 plane. For a given set of multiple matches, multiple lines can be drawn in the same way. If the matches are correct, all the lines drawn must cross at a single point, whose coordinates are true values t 2 t 3 of t2 t3. Even if there are incorrect matches and wrong lines are drawn for them, it is expected that we have the largest number of votes around t 2 t 3 . Thus, we can estimate t by the algorithm shown in Fig.1. The process of the above voting method can be geometrically understood. We are given the camera rotation and intrinsic parameters. Imagine moving both camera coordinate systems so that their projection center coincide as shown in
0
(2)
Scale of the translation t cannot be determined because of the scaling ambiguity. Thus, t has only two d.o.f. By imposing additional constraint, e.g., t 1, t can be determined from two matches of points.
2.2. Search for candidate matches We extract corners from each of the two images using the Harris corner detector and use them as feature points. For these feature points, we search for candidate matches between the two images. For a point on the first image,
1051-4651/02 $17.00 (c) 2002 IEEE
100
100 80
80
60
60
40
40
20
20 0
xi
x’i
xj
0.2
0.4 Error (radians)
0.6
0.8
100
x’j t
Fig.2. Then, for an arbitrary world point, its epipolar plane passes through the coinciding projection centers as well as the corresponding image points on the two images. Thus, a match of point gives one epipolar plane. The camera translation to be estimated here is determined as the crossing line of these epipolar planes generated by different pairs of point matches. Calculating an epipolar plane corresponds to drawing a line in the voting space, and selecting the crossing line corresponds to the peak detection in the voting space.
2.4. Comparison with random sampling In the methods using random sampling, once correspondences of points are determined, they are fixed and are never changed during the process of random sampling; simply whether or not each of the correspondences is correct is examined. Thus, the method using random sampling cannot deal with the possibility that a point might corresponds to another point than its fixed match. In the above voting method, on the other hand, we can consider multiple candidates at the same time; for a point in the first image, multiple points on the second image can be chosen as its candidate matches to use for voting. More largely the camera moves in space, more difficult it is to find a correct match in the second image for a given point in the first image. Even in such a case, it is easier to find a set of points in the second image that includes a correct match as an element. Using this set of the candidate matches, the voting method could estimate the correct parameters. Thus, the voting method is expected to be able to deal with difficult pairs of images, such that the method using random sampling would fail.
3. Experimental results 3.1. Results for synthetic data In order to test robustness of our method, we conducted experiments using synthetic data. For 200 feature points
0.2
0.4 Error (radians)
0.6
0.8
0
0.2
0.4 Error (radians)
0.6
0.8
100
80
80
60
60
40
40
20
20
0
Figure 2. Illustration of the voting method. See text.
0
0.2
0.4 Error HradiansL
0.6
0.8
Figure 3. Upper row: Results by a random sampling guided method. Lower row: Results by our method. Left column: σ 0 5. Right column: σ 2 0.
lying on two perpendicular planes, two images are generated by changing the camera pose. Uniform noise with a range σ : σ is added to each image coordinates of feature points in both images. Candidates for the matches are randomly generated. As a measure of the error of the estimate, we use the angle between the estimate t and true translation t The distribution of the error are calculated for 100 trials of the estimation, in each of which the image coordinates are generated with uniform noise and candidate matches are randomly selected. In this experiment, 20 points in the second image are selected as candidate matches for a point in the first image, so that the chosen 20 points always include a correct match for the point in the first image. This corresponds to the case of 95% mismatches. Figure 3 show the results by our method and those by a method using random sampling, respectively. In the method using random sampling, 5000 samples are used, which are larger than 1200 samples required to achieve at least 95% success. It can be seen that the voting method produces better results than the method using random sampling.
3.2. Results for real images We applied the proposed method to real images. We used a 3D orientation sensor Datatec GU-3011 and a still image camera NIKON D1. Images were resized to 500 328. Output of the sensor is not the camera orientation but the orientation of the sensor itself. In order to convert the sensor output into the camera orientation, we use techniques called the hand/eye calibration that have been developed in the researches of robot vision [4]. To get the intrinsic camera parameters we use the Zhang’s method [5]. After estimating the camera translation, we apply the second rectification to the images using the estimated camera translation so that the epipolar lines are the horizontal lines of the images. This helps subsequent processes of dense stereo matching. The rectification is used here just
1051-4651/02 $17.00 (c) 2002 IEEE
Figure 4. Result of estimation of the camera translation for images having rotation. The upper row shows the original images and the lower row show the final images that are rectified.
Figure 6. Result for images including moving objects.
moving cars in the images and the method nevertheless can find the correct solution.
4. Summary We have presented a method of robustly estimating the camera translation from two images using a camera with a 3D orientation sensor. Considering the parameters to be estimated have only two d.o.f., we apply a voting method to estimate the correspondences of points across the images and to determine the camera translation. Along with the rectification of the images using the camera orientation, it improves the robustness. The experiments with real images yield promising results.
References Figure 5. Result for images with large disparities.
to check whether the estimation is correct or not. Figure 4 shows a result. The original two images are for books on a table. This pair of images have several difficulties, because of which the conventional methods using a usual camera would fail. One is the problem of degeneracy. Since the feature points are mostly distributed along planar surface, it is difficult to accurately determine the fundamental matrix. Another is that the images are considerably rotated relative to each other. Our method is free from the degeneracy problem, and rectification of the images before taking image correlation solves the problem of the image rotation. The small square shown in the lower left corner is a plot of the voting space. Figure 5 shows another result. This result demonstrates robustness of our method in the case where the pair of images have large disparities and occlusions. Figure 6 shows yet another result. There are
[1] T. Mukai and N. Ohnishi. The recovery of object shape and camera motion using a sensing system with a video camera and a gyro sensor. In Proceedings of IEEE International Conference on Computer Vision, 1999. [2] P. H. S. Torr and C. Davidson. Impsac: Synthesis of importance sampling and random sample consensus. In Proceedings of European Conference on Computer Vision, 2000. [3] P. H. S. Torr and D. W. Murray. The development and comparison of robust methods for estimating the fundamental matrix. International Journal of Computer Vision, 24(3):271– 300, 1997. [4] R. Y. Tsai and R. K. Lenz. A new technique for fully autonomous and efficient 3d robotics hand/eye calibration. IEEE Transactions on Robotics and Automation, 5(3):345– 358, 1989. [5] Z. Zhang. A flexible new technique for camera calibration. Technical Report MSR-TR-98-71, Microsoft Research, 1998. [6] Z. Zhang, R. Deriche, O. Faugeras, and Q. T. Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artifical Intelligence, 78:87–119, 1994.
1051-4651/02 $17.00 (c) 2002 IEEE