Fusion of Discrete and Continuous Epipolar

Fusion of Discrete and Continuous Epipolar Geometry for Visual Odometry and Localization David Tick

Jinglin Shen

Dr. Nicholas Gans

Computer Science Dept. University of Texas at Dallas Dallas, Texas 75252 Email: [email protected]

Electrical Engineering Dept. University of Texas at Dallas Dallas, Texas 75252 Email: [email protected]

Electrical Engineering Dept. University of Texas at Dallas Dallas, Texas 75252 Email: [email protected]

Abstract—Localization is a critical problem for building mobile robotic systems capable of autonomous navigation. This paper describes a novel visual odometry method to improve the accuracy of localization when a camera is viewing a piecewise planar scene. Discrete and continuous Homography Matrices are used to recover position, heading, and velocity from images of coplanar feature points. A Kalman filter is used to fuse pose and velocity estimates and increase the accuracy of the estimates. Simulation results are presented to demonstrate the performance of the proposed method.

I. I NTRODUCTION Building mobile robotic systems that are capable of realtime autonomous navigation is a complex, multi-faceted problem. One of the primary aspects of this problem is the task of localization. Localization, or pose-estimation, refers to the task of estimating the velocity (both angular and translational), position, and orientation of the robot at any given instant. There are many established ways to approach the task of localization including wheel odometry [1], inertial sensors [1], GPS [1], sonar [2], and IR/laser-based range finding sensors [3]. There has also been significant development of localization techniques which are solely vision-based. Visual odometry is a method of localization that uses one or more cameras to constantly capture images or video frames, taken of a scene [4]. The frames are analyzed sequentially using various computer vision techniques. The analysis of these frames estimates angular and translational velocities of the camera between each time step. These velocities can then be integrated over time to estimate how the camera has moved over time. Using this technique, an estimate of a mobile robot’s speed, position, and heading (orientation) can be calculated and maintained over time. A pose estimate produced in this manner can be further enhanced by combining visual odometry with traditional wheel odometry and other sensor devices. This can be achieved through the use of various signal processing techniques called sensor fusion. The Kalman filter is particularly useful for fusing signals from multiple sensors and removing errors in localization that occur due to many factors such as sensor noise, quantization, flawed process modeling, and sensor bias or drift [1], [5]. Some vision-based localization techniques are designed to calculate the pose of the camera relative to some well known

reference object that appears in each frame [6], [7]. These techniques are usually dependent on accurate prior knowledge regarding the geometric properties of the scene or a reference object. In many situations, there exists little accurate prior knowledge regarding a scene and the objects therein. However, in such cases, two frames can be compared with one another based on a set of feature points which exists in both frames. Once a set of feature points is identified in both frames, a homogeneous mapping of the feature points from one frame onto the next must exist. This mapping can be described as a linear transformation and encapsulates the rotation and translation of the camera that occurs between the taking of each picture. In practice, the Essential Matrix or Euclidean Homography Matrix is used to estimate a camera’s pose in terms of a set of rotational and translational transformations [8], [9]. Over the years, many control systems have been designed that utilize Essential or Homography Matrices in visionbased robotic tasks [10]–[13]. The Homography Matrix, in its discrete form, has been widely used for many years by the community [14]–[18]. Of particular interest, is the application of the Euclidean Homography Matrix as a means of estimating velocity [19]. However, it seems that there has not been as much work done regarding applications of the Homography Matrix in its continuous form. The Euclidean Homography Matrix can be used in both a continuous as well as a discrete form [9]. If one does a discrete homography-based estimation of the camera position and orientation (i.e., pose), then one can also integrate the continuous estimate of the velocity at each time step to extract the pose as well. The pose estimate obtained from integrating the continuous homographic estimate of angular velocity must agree with the discretely estimated position of the camera. In this paper, we propose a method which uses a single six degrees of freedom (6DOF), or “eye in hand”, camera to do homography-based visual odometry. This system will capture video frames and identify sets of coplanar feature points in a scene. It will estimate the change in pose of the camera by using the discrete form of the Euclidean Homography Matrix. The system will also utilize the continuous form of the Euclidean Homography Matrix to estimate the velocity of the camera over time. Our system utilizes the well known Kalman

filter to fuse these two estimates and remove error which accrues over time due to integration, noise, and quantization [5]. Section II will clarify the terminology used in this paper as well as formally define and model the problem. Section III will then explain the proposed approach at homographic visual odometry. We present simulation results in section IV that illustrate the effectiveness of the proposed system. Finally, in section V we will espouse the various conclusions that we have reached. II. BACKGROUND While there are certain formal conventions that exist with respect to the terminology used in vision-based localization, there is no official standard. Thus, it is necessary that we define the terminology that we use in this paper. A. Formal Definition and Terminology Fig. 1 illustrates a camera, with attached reference frame, moving over time. The lowercase sub-script attached to each axis label indicates the frame of reference to which that axis belongs. Navigation sensors, including cameras, measure things in terms of the moving body frame, formally called Fb . These measurements are then rotated to obtain localization with respect to the world frame, formally called Fw . The zaxis of Fb is oriented along the optical axis, the x-axis is oriented along the horizontal direction of the image plane, and the y-axis is oriented parallel to the vertical direction of the image. The orientation and position (collectively referred to as pose) of a body frame in terms of the world frame is expressed as a function of time by including the index number for a given time step in parentheses. For example, in Fig. 1 Fb (t0 ) describes the pose of the body frame Fb at time index t0 as measured from Fb (t0 ), while Fb (tk ) describes the pose of Fb at time tk as measured from Fb (t0 ). The changes in the pose of Fb that occur over the time interval [t0 , t1 , ..., tk−1 , tk ] is described in terms of a translation vector Tk ∈ R3 and a rotation matrix Rk ∈ SO (3), formally written as (Tk , Rk ), where SO (3) means “Special Orthogonal Group of order three”. At any given time tk , the instantaneous linear and angular velocities of the body frame Fb are described, as a pair of vectors (vk , ωk ), where vk ∈ R3 and ωk ∈ R3 . B. Pinhole Camera Model Fig. 2 illustrates a camera taking two images from two different poses Fb∗ and Fb (t). Fb∗ is considered a static reference frame, such as the pose at time t = 0. Fb (t) is considered a moving frame or current frame. The changes which exist between the two poses, as stated above, are encapsulated by (T, R). Fig. 2 also shows N ≥ 4 feature points that all lie in the plane πs . The 3D coordinates of each feature point as measured from the reference frame of the pose Fb (t) are defined in terms of a vector mj (t) ∈ R3 . Similarly, the 3D coordinates of each feature point as measured from

Fig. 1.

Translation and Rotation (Tk, Rk) of Body/Camera Frame (Fb)

Fig. 2.

Planar Scenes and Feature Points

the reference frame of the pose Fb∗ are defined in terms of a vector m∗j ∈ R3 . Formally these vectors are given as m∗j ∈ R3

=

mj (t) ∈ R3

=

∗ ∗ ∗ T xj , yj , zj , ∀j ∈ {1, . . . , N }

[xj (t) , yj (t) , zj (t)]T , ∀j ∈ {1, . . . , N }.

Two different images are captured by the camera at the two poses Fb∗ and Fb (t). This is modeled by projecting the feature points onto the 2D image planes πb∗ and πb (t). The coordinates of the feature points in these 2D planes are expressed as a normalized set of 3D coordinates, where depth along the z-axis is set equal to one. The normalized image plane coordinates that result from this projection, as measured

from the reference frame of the pose Fb (t) are defined in terms of a vector mj (t) ∈ R3 . Similarly, the normalized image plane coordinates of each feature point as measured from the reference frame of the pose Fb∗ are defined in terms of a vector m∗j ∈ R3 . These vectors are expressed as 3

mj (t) ∈ R3

m∗j

∈R

#T

=

"

x∗j yj∗ , ,1 zj∗ zj∗

=

xj (t) yj (t) , ,1 zj (t) zj (t)

, ∀j ∈ {1, . . . , N } T

, ∀j ∈ {1, . . . , N }.

C. Euclidean Homography In this work, we focus on the planar Homography case, that is, all 3D feature points are on the same plane. Consider the two images of points m(t) on plane πs , mentioned above, the transformation between the two images is given by [9] mj = Rm∗j + T.

(1)

The relationship in (1) can be rewritten in terms of image points as [8] mj = (R +

1 T n∗T )m∗j d∗

where n∗ = [n∗x , n∗y , n∗z ]T is the constant unit normal vector of plane πs measured in Fb∗ , and d∗ is the constant distance between the optical center of the camera (i.e. the origin of Fb∗ ) to the plane πs . We assume that d∗ is known in this initial investigation. The matrix Hd = R +

1 T n∗T d∗

defines what is known as the discrete Homography Matrix. By using the four-point algorithm, the Homography Matrix Hd (t) can be solved to recover the translation vector T (t) and the rotation matrix R(t) [8]. In the continuous case, image point mj (t) and its optical flow m ˙ j (t) are measured instead of image pair m∗j and mj (t). The time derivative of mj (t) satisfies m ˙j =ω ˆ mj + v

(2)

where ω ˆ (t) is the skew-symmetric matrix of ω(t). The relationship in (2) can be rewritten in terms of image points as [9] m ˙ j (t) = (ˆ ω+

1 ∗T vn )mj (t) . d∗

The matrix Hc = ω ˆ+

1 ∗T vn d∗

is defined as the continuous Homography Matrix. Similar to the discrete form, a four-points algorithm gives the solution for linear velocity v(t) and angular velocity ω(t) of the camera.

III. A PPROACH As described in section II, the translation and rotation of a camera can be recovered by applying the Homography Matrix method in its discrete form. Similarly, the linear and angular velocities of a camera can be obtained using the continuous form. However, the estimation can be noisy due to possible sensor noise from the camera and numerical error produced during computation of the Homography Matrix. Integration of the continuous velocity estimate over time should agree with the discrete pose estimate, thus fusing velocity and pose estimates will reduce the effect of noise. A Kalman filter is built to perform the sensor fusion, including integration, and generate the final pose and velocity estimates. The reference feature points m∗j are taken as the points at initial time t0 . At each time step tk , feature points m∗j and mj (tk ) are used to calculate the Homography Matrices Hd (tk ) and Hc (tk ). The Homography Matrices are decomposed to solve for Rk , Tk , vk , ωk . The states in the system are the position and velocity of the camera. Roll, pitch, yaw angles (denoted r(tk ), p(tk ), w(tk )) are used to represent camera orientation instead of using the recovered rotation matrix. The relation between the angle rates r(t ˙ k ), p(t ˙ k ), w(t ˙ k ) and angular velocity ωk = [ωx (tk ), ωy (tk ), ωz (tk )]T is given by [20] r˙

=

p˙ = w˙ =

(ωx sin r + ωy cos r) tan p + ωz

(3)

(ωx cos r − ωy sin r) (ωx sin r + ωy cos r) sec p.

(4) (5)

The system used in the Kalman Filter is given by xk

=

Fk · xk−1 + qk

(6)

yk

=

Ck · xk + rk .

(7)

In (6), the state vector is given by T

xk = [x, y, z, p, w, r, vx , vy , vz , ωx , ωy , ωz ] ∈ R12 and qk ∈ R12 is a normally distributed random process with zero mean and covariance matrix Qk ∈ R12×12 . The state transition matrix Fk ∈ R12×12 and the process covariance matrix Qk can be found in the Appendix. A random walk process is used to determine Qk . In Fk and Qk the term ∆t is the frame rate of the camera, and the various σ terms are scaling terms for the states and are indicated as subscripts. In (7), the measurement matrix Ck ∈ R12×12 is an identity matrix since all components of the state vector are being measured. The sensor noise rk ∈ R12 is a normally distributed random process with zero mean and covariance matrix Rk ∈ R12×12 , which is a diagonal matrix, whose diagonal elements are decided according to the estimated measurement noise of each state. Given the terms in (6) and (7) the Kalman filter is designed and updated in the typical manner [5]. Estimates from the update step are utilized to provide the pose and velocity estimates that are in turn used for localization of the camera.

Wx(rad/s)

0 −0.2

estimated actual 0

1

2

3

4

5

0

0

1

2

3

4

5

6

Posz(m)

0.2 0

Wz(rad/s)

0

0

1

2

3

4

5

1

2

3

4

5

6

0

1

2

3

4

5

6

0

1

2

3

4

5

6

0

0.1 0 −0.1

−0.2

0

0.2

−0.2 −0.2

estimated actual

0.01

−0.01 6

0.2

Posy(m)

0.02

Wy(rad/s)

Posx(m)

0.2

time(s)

6

time(s)

Fig. 6. Fig. 3.

Angular velocity estimation when no camera noise added

Position estimation when no camera noise added

IV. R ESULTS Angx(rad)

−4

5

x 10

estimated actual

0 −5

0

1

2

3

4

5

6

1

2

3

4

5

6

1

2

3

4

5

6

Angy(rad)

−3

5

x 10

0 −5

0

Angz(rad)

−3

5

x 10

0 −5

0

time(s)

Fig. 4.

Orientation estimation when no camera noise added

Vx(m/s)

5 0 −5

estimated actual 0

1

2

3

4

5

6

0

1

2

3

4

5

6

0

1

2

3

4

5

6

Vy(m/s)

5 0 −5

Vz(m/s)

5 0 −5

time(s)

Fig. 5.

Linear velocity estimation when no camera noise added

In this section, simulation results of the proposed Homographic visual odometry method are presented. A single camera observes four static, coplanar feature points. The camera motion is generated so that the four points always stay in the field of view. The linear velocity is sinusoidal along all degrees of freedom. The angular velocity is zero along the xaxis and sinusoidal along the other two axes. The frame rate of the camera is 30 frames/second. Discrete and continuous Homography matrices are used to obtain the estimated camera position, heading and velocity, which are then fed into the Kalman filter to obtain the final estimation. The estimation results are then compared with the actual camera motion. In the first simulation, we assume that there is no sensor noise, that is, the camera can observe the position of feature points perfectly. Fig. 3 to Fig. 6 show the estimated camera position, heading, linear velocity and angular velocity, and compare it to the actual motion that was generated. The dashed trajectories represent the actual values, and the solid trajectories show the estimated ones. It can be seen that given a random initial state, the estimated values converge toward the actual trajectory quickly and thereafter estimation error remains small. Therefore, camera motion can be recovered accurately using the proposed Kalman filter method, given the absence of sensor noise. The diagonal values of Rk used are [0.001, 0.001, 0.001, 0.00001, 0.00001, 0.00001, 0.001, 0.001, 0.001, 0.00001, 0.00001, 0.00001], according to the measurement noise of each state. In the second simulation, sensor noise is added to the feature point coordinates. The noise used in this simulation is a Gaussian distribution of zero mean and 0.05 pixel variance. Fig. 7 to Fig. 10 show the final result in the noisy scenario. It can be seen that the estimated trajectories can still track the actual values, but a lag is noted between the true signal and the estimate. This is caused by increasing the diagonal elements of the measurement covariance matrix Rk . Larger values in Rk make the estimated trajectories more accurate,

Wx(rad/s)

0 −0.2

estimated actual 0

1

2

3

4

5

0.2

Posy(m)

0 −0.05

6

0

1

2

3

4

5

6

Posz(m)

0.2 0

Wz(rad/s)

0

0

1

2

3

4

5

1

2

3

4

5

6

0

1

2

3

4

5

6

0

1

2

3

4

5

6

0

0.1 0 −0.1

−0.2

0

0.1

−0.1 −0.2

estimated actual

0.05

Wy(rad/s)

Posx(m)

0.2

time(s)

6

time(s)

Fig. 10. Fig. 7.

estimated actual

Angx(rad)

−3

2

x 10

0 −2

0

1

2

3

4

5

6

Angy(rad)

−3

5

x 10

0 −5

0

1

2

1

2

3

4

5

6

3

4

5

6

Angz(rad)

−3

5

x 10

0 −5

0

time(s)

Fig. 8.

Orientation estimation when camera noise added

Vx(m/s)

5 0 −5

estimated actual 0

1

2

3

4

5

6

0

1

2

3

4

5

6

Vy(m/s)

5 0 −5

Vz(m/s)

5 0 −5

Angular velocity estimation when camera noise added

Position estimation when camera noise added

but increase the lag of the Kalman filter. There is a trade-off between the accuracy of estimation and the sensitivity of the system. In this simulation, diagonal values of [0.1, 0.1, 0.1, 0.001, 0.001, 0.001, 0.1, 0.1, 0.1, 0.001, 0.001, 0.001] are used to balance the responsiveness and accuracy of the system. V. C ONCLUSION In this paper, we present a novel visual odometry method for localization of a mobile robotic system. Continuous and discrete forms of the Euclidean Homography Matrix are used to recover position and velocity information from camera images. A Kalman filter is used to fuse the velocity and pose estimates. This reduces the effects of noise and improves the estimate. Simulations were performed to explore the performance of the proposed method. When sensor noise is not considered, the estimated positions and velocities show little error from the true signals. When noise is introduced to the camera model, parameters of the Kalman filter must be chosen carefully to balance the responsiveness and mitigate the effects of noise. There are several avenues open for future work. In the short term, experiments will be performed of the presented method. This will require accurate and fast detection and tracking of features. Low levels of tracking error will be important to mitigate sensor noise. In the longer term, a method must be determined to handle large scale movements of the robot and camera such that feature points can be allowed to enter and leave the camera field of view without disrupting the estimation. Finally this method will be incorporated with other localization and odometry methods, including inertial measurement units and wheel encoders. R EFERENCES

0

1

2

3

4

5

6

time(s)

Fig. 9.

Linear velocity estimation when camera noise added

[1] G. Dudek and M. Jenkin, “Inertial sensors, gps, and odometry,” in Springer Handbook of Robotics, 2008, pp. 477–490. [2] L. Kleeman and R. Kuc, “Sonar sensing,” in Springer Handbook of Robotics, 2008, pp. 491–519. [3] R. B. Fisher and K. Konolige, “Range sensors,” in Springer Handbook of Robotics, 2008, pp. 521–542.

[4] D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 1, 27 2004, pp. I–652 – I–659 Vol.1. [5] R. G. Brown, Introduction to Random Signal Analysis and Kalman Filtering. John Wiley & Sons, 1983. [6] D. DeMenthon and L. S. Davis, “Model-based object pose in 25 lines code,” in European Conf. on Computer Vision, 1992, pp. 335–343. [7] L. Quan and Z.-D. Lan, “Linear n-point camera pose determination,” IEEE Trans. Pattern Anal. Machine Intell., vol. 21, no. 8, pp. 774–780, 1999. [8] O. D. Faugeras and F. Lustman, “Motion and structure from motion in a piecewise planar environment,” Int. J. Pattern Recog. and Artificial Intell., vol. 2, no. 3, pp. 485–508, 1988. [9] Y. Ma, S. Soatto, J. Koseck, and S. Sastry, An Invitation to 3-D Vision. Springer, 2004. [10] E. Malis, F. Chaumette, and S. Boudet, “2-1/2D visual servoing,” IEEE Trans. Robot. Automat., vol. 15, no. 2, pp. 238–250, 1999. [11] C. Taylor and J. Ostrowski, “Robust vision-based pose control,” in Proc. IEEE Int. Conf. Robotics and Automation, 2000, pp. 2734–2740. [12] Y. Fang, D. Dawson, W. Dixon, and M. de Queiroz, “Homographybased visual servoing wheeled mobile robots,” in Proc. IEEE Conf. on Decision and Control, 2002, pp. 2866–2871.

[13] Y. Fang, W. E. Dixon, D. M. Dawson, and P. Chawda, “Homographybased visual servo regulation of mobile robots,” IEEE Trans. Syst., Man, Cybern., vol. 35, no. 5, pp. 1041–1050, 2005. [14] Z. Zhang and A. Hanson, “3D reconstruction based on homography mapping,” in Proc. ARPA Image Understanding Workshop Palm Springs CA, 1996. [15] N. Daucher, M. Dhome, J. Laprest, and G. Rives, “Speed command a robotic system by monocular pose estimate,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, 1997, pp. 55–43. [16] C. Baillard and A. Zisserman, “Automatic reconstruction planar models from multiple views,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1999, pp. 559–565. [17] E. Malis and F. Chaumette, “2 1/2D visual servoing with respect to unknown objects through a new estimation scheme camera displacement,” Int. J. Computer Vision, vol. 37, no. 1, pp. 79–97, 2000. [18] K. Okada, S. Kagami, M. Inaba, and H. Inoue, “Plane segment finder: algorithm, implementation and applications,” in Proc. IEEE Int. Conf. Robotics and Automation, 2001, pp. 2120–2125. [19] V. Chitrakaran, D. M. Dawson, W. E. Dixon, and J. Chen, “Identification a moving object´s velocity with a fixed camera,” Automatica, vol. 41, no. 3, pp. 553–562, 2005. [20] D. H. Titterton and J. L. Weston, Strapdown Inertial Navigation Technology (2nd Edition)l. Institution of Engineering and Technology, 2004.

A PPENDIX K ALMAN F ILTER M ATRICES 0  0    0   0    0   0  =   0   0    0   0    0 

Fk ∈ R12×12

0



Qk ∈ R

12×12

0 0 0 0

0 0 0 0

0 ∆t 0 0

0 ∆t

0 0

0 0

0 0

0 0 0 0

0 0 0 0

0 0

0 0

0 0

∆t 0

0 cos (r) ∆t

0 − sin (r) ∆t

0 0

0 0

0

0

0

0

sin(r) cos(p) ∆t

cos(r) cos(p) ∆t

0 0 0 0

0 0 0 0

0 0

0 1

0 0

0 0

sin (r) tan (p) ∆t 0

cos (r) tan (p) ∆t 0

0 0 0 0

0 0 0 0

0 0

0 0

1 0

0 1

0 0

0 0

0 0 0 0

0 0 0 0

0 0

0 0

0 0

0 0

1 0

0 1

0 0

0 0

0

0

0

0

0

∆3 σx 3t

  0    0    0    0    0  =   σ ∆2t  x 2   0    0    0    0  0

0 σy

∆3t 3

0

0

0

0

0

0

0

0

0

0

0

0

0

0

σz

0

∆3t 3

0

0

0 ∆2 σx 2t

0

0 σy

∆2t 2

0

 0 0    0   0    0   ∆t    0   0    0   0    0  1

0

0

0 σz

∆2t 2

0

0

0

0

0

0

0

0

∆2 σp 2t

0

0 0

∆3 σp 3t

0

0

∆3 σw 3t

0

0

0

0

0

∆2 σw 2t

0

0

0

0

0

0

0

0

0

0

0

0

0

∆3 σr 3t

0

0

0

0

0

σx ∆t

0

0

0

0

0

0

0

0

0

0

σy ∆t

0

0

0

0

∆2 σz 2t

0

0

0

0

0

σz ∆t

0

0

0

0

0

0

0

0

σp ∆t

0

0

∆2 σw 2t

0

0

0

0

0

σw ∆t

0

0

0

0

0

0

σr ∆t

σy

∆2t

0 0 0 0

2

0 0 0

σp

∆2t

0 0

2

0

σr

∆2t 2

σr

∆2t 2

                              