1
Homography-Based Monocular Visual Odometry Guillaume Caron Master Research STIC - URCA
[email protected]
Abstract— In robotics, the problem of robot localization is studied for a long time. One approach to deal with it is the use of cameras whatever their kind. It brings the mobile robot, thanks to the image provided by the sensor, the information of the environment in front of it. Even if vision is often used to correct classical odometry derive, we would want to only use vision approach. The purpose of this document is to present a new method in the trajectory estimation of a mobile robot in an indoor environment in order to localize itself. Using a perspective camera with images acquired at a rate of 24 images/sec, interest points are extracted and matched for consecutive pairs of images. This matching is used to estimate the displacement of the camera between two succeeding shots. After a fly over the classical approaches to solve this recurrent old problem, the new homography-based method will be presented. Finally, the first results will be presented.
I. I NTRODUCTION
Fig. 1. Representation of the conditions of our work. A mobile robot moves into a plant environment and its camera plane is parallel to the floor and to the ceil.
which bring us an image. This is the classical way to get visual information of what a robot sees. In our system, we want to use such a single monocular camera without equipping the environment. But the camera is placed on the robot in a particular manner: its optical axis is perpendicular to the ground and the ceil such that the image plane is parallel to these two planes (figure 1). Our proposition is to detect interest primitives (points) in successive images and match them to estimate the projective transformation that links the two primitives set. After that, the camera displacement is estimated and repeating this process all along the displacement of the robot will bring the ability to compute its positions as its trajectory. Here, it is necessary to talk about the geometric transformation between two images of a same scene. In a scene, a camera displacement can be characterized by the geometric transformation of its centre in comparison to a point lying on a scene plane (Figure 2). This type of transformation is a planar homography (see [9] for more details). It is a link between two primitives, as interest points, modeled in the homogenous coordinates system as: 0 x h1 h2 h3 x y 0 = h4 h5 h6 y 1 h7 h8 1 1
Robot navigation is a complex task and merges a lot of functionalities such as robot displacement, mission supervising or modeling of the environment. Robot localization is one of these tasks and can be defined by the capability of estimating the position of the robot. There are some methods to made self localization of the robot possible. We find infrared devices in closed environment (plants for instance): the robot is equipped by sensors which are able to receive information from the waypoints of the robot road in order to guide it to its destination. These methods are effective and are used in industry but their issues are the cost of the devices and particularly, the need to install devices in the environment of the robot. The idea is to use camera(s) to allow the robot seeing its surrounding environment. There are too many laboratories which treat of this research theme to list them but labs such as LAAS-CNRS, IRISAand we can linearize this sytem in : INRIA and CREA are active on these subjects. P 0 = HP Everyone knows standard perspective cameras
2
In the next section, we will quickly relate work about robot self-localization based on visual odometry. Two methods based on same tools of ours will be presented. In section III, we will tackle and develop the new method proposed in this paper just before presenting the results in the next section. Finally, before concluding this document, the results will be presented. II. S TATE OF THE ART Generally speaking, methods for mobile robot self-localization using vision can be divided in two parts. First, we find methods based on image indexation but our work is in trajectory estimation. This approach is especially designed for known environment but it is not a problem for us since the purpose of our work is to develop a localization method in plant environment. Camera displacement following the ground is an approach on which we can find a lot of papers. Pears and Liang in [5] work under the assumption their system moves indoor on a plane structure. They work with interest points and match these primitives in two following images in a video sequence, i.e. images acquired successively in the displacement of the robot. Their work is interesting because they do not work under the assumption that interest points lies on a plane, they verify it. In a past work [4], they select a subset of points which verifies best the coplanarity constraint to estimate the projective transformation between two images. This method is effective since after a test movement, only an error of four degrees is noted. To compute angles of rotation, they use an eigen value decomposition of the homography. It is also interesting since, contrarily to the methods of INRIA reports [1] and [2] and our work, it is not necessary to extract the intrinsic camera parameters from the homography.
Fig. 2. Transformation of camera position by translation and rotation in reference to a point lying to a scene plane.
Xu De et al. [6] work on a system composed by a mobile robot and a monocular camera looking at a wall, i.e. a plane parallel to the camera one perpendicular to the ground plane. However, the camera rotates around its X axis which angle is defined by the Z axes of the two following camera positions. Plus, if the homography is mentioned as a theoretical base, the couple rotation/translation is computed only thanks to manipulations of pixels coordinates since their system is a particular case. These two works are near of ours but some differences are noticeable. They are interesting since as our proposition, they work with interest points and planar homography which proves the interest of our approach. III. T RAJECTORY ESTIMATION From related works, as the ones we presented, even from stereovision visual odometry [7] and in fact, from many other works, the general process of camera displacement estimation between two frames of a same scene is: primitives extraction from images, matching of images by their primitives, estimation of the geometric transformation of images to deduce the one of the camera center. Note this is possible iff the displacement between two shots is not too important in order to have common parts in images. This point is verified by our system since the camera acquisition rate is 24 images/sec and speed is not very fast.
A. Interest points extraction In the literature, we find different images primitives to detect and [10] is a good review of the most developed groups primitives detectionmatching-transformation estimation. We choose to use the Harris corner detector [8] since it is fast, often used in the literature, and it has also good repeatability rates [13]. The detection of interest points by the Harris method lies on the computation of the so called structure matrix using partial derivatives of the intensity function:
3
Next, it is possible to compute the eigen values of this matrix and to threshold the lowest to detect a corner, i.e. a point with high derivatives in more than one direction. But it is also possible to threshold a more precise measure based on the structure matrix determinant and trace :
with k constant fixed at 0.001 in our case by experimentation. Due to a non constant illumination of the surrounding environment of the mobile robot during its displacements, the threshold cannot be constant. So we made it dynamic by computing the maximum of corners strength in the image and reduce it to around 10% (also experimentally fixed) to fix the threshold.
C. Homography estimation Now we have correspondences, we can compute the transformation. Homography computation has been chosen because we want to follow a plane and planar homographies are particularly effective in this case. Plus, pure rotations can occur in robot displacement so methods based on fundamental matrix cannot be used in our case. To estimate the planar homography from the correspondences, we use the Gold Standard algorithm from [9] based on SVD which brings us an affine planar homography. But nothing proves our points are coplanar. Even if the camera is pointed to the ceil, structures can appear in the field of view. So to be sure homography is computed from correspondences lying on a scene plane, we use the homography matrix to eliminate outliers by minimizing the reprojection error defined by :
B. Images matching Now we have interest points detected in two succeeding images, they must be matched. There are many ways to match points based only on their intensity, their neighborhood, a correlation measure or, less classical, on vectors of invariant, particularly in image indexation methods [13]. But in our case, the objective is to produce a real time system and a correlation measure was chosen to match points. It has the advantage to be quick and accurate. The correlation measure is named ZNCC : Zero-mean Normalized Cross-Correlation. For a point p in the first image, a matching candidates list is formed by points of image 2 included in a window around p. Between p and candidates, we compute the ZNCC following these steps : - for each point, their neighborhood is centered, so that their mean is zero, and normalized - next, neighborhood of p is multiplied by the candidates and the highest value brings the matching. Of course, if there are no candidates, p is eliminated and if there are no correlation scores greater than 80% between p and its matching candidates, it means p cannot be retrieved in the second image.
When this error in under a threshold, the homography matrix is considered trustable. Moreover, since there is always a dominant scene plane (the plant ceil) parallel to the image plane, it will be this one selected which leads to interesting simplifications for the camera displacement computation. D. Camera displacement From the estimated planar homography, there are some ways to decompose the movement of the camera between two frames. Tsai et al. in [11] and Zhang et al. in [12] developed methods based on SVD to extract camera displacement. However, we are in a particular case which allows us to simplify this process. Since the camera plane is parallel to the floor, the camera displacement is in its plane. Plus, the observed plane is parallel to the image plane too. So, from the movement of the observed plane between two images, we can directly deduce the movement of the camera and the movement of the robot since it moves in a parallel plane. But in practice, the estimated affine homography is noised and its linear part do not just contain
4
the plane rotation information. [9] presents different kinds of transformation matrices and affine matrix has particularity to contain information of rotation, translation and also deformation. The deformation information is linked to the rotation and we want to split them. From the definition of H :
A has to be splitted to recover only the rotation information. A can be decomposed by SVD and the plane rotation θ can be retrieved from it: Fig. 4. Trajectory estimation from real images. The upper part of the figure presents the ”real” trajectory and below the estimation.
So now, we have rotation and translation information of the camera and can easily compute the position of the image 2 point of view from the first and homography estimation was presented and reimage one. sults are motivating. Future work will be focused on precision improvement. IV. R ESULTS R EFERENCES Experimentations were made on virtual images generated with Pov-Ray and on real images acquired [1] E. Malis, F. Chaumette and S. Boudet, 2D 1/2 visual servoing, Rapport de recherche de l’INRIA, RR-3387, 1998. by a mobile robot in real conditions. Figures 3 and [2] A. Bartoli, P. Sturm and R. Horaud, A Projective Framework for 4 show the results on trajectory estimation in the Structure and Motion Recovery from Two Views of a Piecewise Planar Scene, Rapport de recherche de l’INRIA, RR-4070, two cases. Note the low error rates in the virtual 2000. images case: a MSE below 3 cm for positions and [3] M. Pressigout, Approches hybrides pour le suivi temps-rel below 4 degrees for robot orientation. dobjets complexes dans des squences video, Thse, IRISA, 2006.
Fig. 3. Trajectory estimation from virtual images. The green curve represents the real positions and the blue, the estimated ones.
V. C ONCLUSION We presented the problem of self-localization in robot navigation and a way to solve it by visual odometry. The method developed in this paper and future works on it are designed for indoor localization and particularly in a plant since it is the purpose of our project. A method based on primitives extraction, matching
[4] N. Pears and B. Liang, Ground plane segmentation for mobile robot visual navigation, Proceedings of the 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp 1513-1518, 2001. [5] B. Liang and N. Pears, Visual navigation using planar homographies, ICRA02, pp 205-210, 2002. [6] XU De, TU Zhi-Guo and TAN Min, Study on visual positioning based on homography for indoor mobile robot, Acta Automatica Sinica 31 (3): 464-469, 2005. [7] D. Nistr, O. Naroditsky and J. Bergen, Visual Odometry, Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (1) pp 652-659, 2004. [8] C. Harris and M.J. Stephens, A combined corner and edge detector, Alvey Vision Conference, pp 147152, 1988. [9] R. Hartley and A. Zisserman, Multiple view geometry in computer vision, seconde edition, Cambridge university press, 2003. [10] P. K. Jain and C. V. Jawahar, Homography Estimation from Planar Contours, Third International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), North Carolina, Juin 2006. [11] R. Y. Tsai, T. S. Huang and W.-L. Zhu, Estimating threedimensional motion parameters of a rigid planar patch, II: Singular value decomposition, IEEE transactions on acoustics, speech and signal processing, (ASSP-30:4), pp 525-534, 1982. [12] Z. Zhang and A. R. Hanson, Scaled euclidean 3D reconstruction based on externally uncalibrated cameras, ISCV95, p. 37, 1995. [13] C. Schmid, Appariement d’images par invariants locaux de niveaux de gris, Thse, INPG, Juillet 1996.