.
Recursive estimation of ego - motion and scene structure from a moving platform Stefan Carlsson Computational Vision and Active Perception Laboratory Royal Institute of Technology S-100 44 Stockholm, Sweden Email:
[email protected]
Introduction A sequence of images of an object from dierent viewpoints can be used to compute the shape of the object and its motion relative to the camera. In order to get robust and stable estimates of shape and motion we would like to integrate information over a whole sequence of image frames. In this work we will concentrate on the case where the camera is moving on a planar ground. The objective is to nd the motion of the camera relative to the environment and the position in the scene of feature points extracted from the image. The typical application for this kind of problem is visual navigation and obstacle detection. General motion of a rigid object relative a camera can be modeled as a translation and a rotation relative to the point of projection of the camera, taken as the origin of the camera coordinate system. A natural approach to the problem of motion and structure estimation using several frames over extended time, is to formulate it in terms of sequential state estimation where techniques such as Kalman ltering can be used. Due to the non-linear nature of the problem the so called extended Kalman lter, in which equations are linearised around estimated values, have been used, [1, 2, 3, 5, 6, 8]. The extended Kalman lter is attractive since it is relatively straight forward to determine given a model of the temporal evolution and measurment process of the variables to be estimated. Ideally we would like to use the linear Kalman lter since it has well established convergence and is the minimum variance linear estimator in uncorrelated observation noise. The main idea of this work is therefore to estimate the state transition matrix, in our case the rotational matrix, separately and use this estimate in a linear Kalman lter for estimating the state consisting of the 3-D coordinates of the points and the components of the translational motion.
Modelling rigid motion of point-sets
Introducing the vector S( ) for the 3-D coordinates of points, (X ( ) ; Y ( ) ; Z ( )) , i = 1 : : :p the matrix Rn for the rotation matrix and D for the translation vector, the motion of a point is described by: ) S( +1 = R S( ) + D (1) i n
i n
n
i
i n
n
i
i n
n
n
n
If the image plane is at unit distance from the center of projection, the equations relating the observed image coordinates x( ); y( ) to the 3-D coordinates via a perspective i
i
transformation is : ( ) y( ) = Y ( ) + w( ) Z
( ) x( ) = X ( ) + w ( Z i n
i n
x) n
i n
n
i n
i
i n
y n
where w( ) and w( ) is the observation error of image coordinates, which we assume to be uncorrelated with zero mean. This can be writen in the linear form: x n
y n
1 0 ?x(( )) S( ) + Z ( ) w ( ) = 0 0 1 ?y i n i n
i n
i n
(2)
i n
De ning S ,n and Hn for the stacked state vector and the block diagonal matrices in the state and observation equations respectively, we can write these as: 0 S 1 0 j I 1 0 S 1 0 0 1 3 @ ??+1 A = @ ??n ?? ?? A @ ?? A + @ ?? A (3) D +1 0 j I u D n
n
;p
n
n
n
n
H S + W = 0
(4) where u is process noise modelling the non-constant translational motion and W is the stacked state dependent measurment noise vector. Looking at eq. 3 we see that we have a linear system describing the dynamics of evolution of the state variables S and D and we have a linear constraint, eq. 4, relating the state variables and measured image coordiates each time instant. It is therefore natural to apply standard Kalman ltering. Note that in the Kalman ltering formulation, the measurment of eq. 4 will actually correspond to the 0 of the right hand while the actual observations x( ); y( ) will be elements in the observation matrix. An important point to note is also that in the linear Kalman ltering formulation the observation noise vectors will be Z ( ) w ( ). I.e. we have state dependent observation noise. The kalman lter is however still the linear minimum variance state estimator,provided that the measurment noise covariance is modi ed by multiplication of the expected value of the second order moment of Z ( ) [9, 10]. 0
n
n
n
0
n
n
i n
i n
i n
i
i n
Kalman ltering for estimation of 3-D coordinates and translational motion The equations 3 and 4 represent the linear state transition equation and linear observations respectively. The linear minimum variance estimator for this problem is realised by the kalman lter as: 0 j I 1 0 S^ 1 0 S^ 1 3 3 +1 @ ?? A = (I3( +1) 3( +1) ? [ K H +1 j 0 ]) @ ??n ?? ?? A @ ?? A ^ D +1 0 j I D^ p;
n
p
n
;
p
n
n
n
n
(5)
It is important to note that the kalman lter never improves any absolute estimates of position and velocity due to the well known depth scale ambiguity. Only relative magnitudes of velocity and positions can be estimated, corresponding to the direction of . In our implementation the gain matrix was computed from the state vector (S j D) the error covariance matrix, involving rather lenghty caculations. We believe however that simpli cations to the full implementation, e.g. constant gain and decoupled states, is possible without sacri cing performance too much. T
Combined state and rotational parameter estimation
The Kalman lter eq. 5 assumes knowledge of the rotation matrix. In general this will not be available so it has to be estimated from the data. In principle both translational and rotational motion could be estimated separately from image data, [4, 5, 7]. From the estimation theoretic point of view we are faced with the problem of unknown parameters in the state transition matrix. I.e. our problem is that of simultaneous state and parameter estimation. ^ we take the estimated state variables and In order to estimate the rotation matrix R compute the prediction of the 3-D position of the point i: n
0 S^ 1 = ( n j I ) @ ?? A
S~ +1
n
3;p
n
D^
(6)
n
Ideally if the state estimates are correct, the correct choice of the rotation matrix would ) ) be that which makes this prediction project to the observed image coordinates x( +1 ; y( +1 of the point i. In order to estimate the rotation we therefore minimise the prediction error with respect to the rotation matrix:
0 S^ 1 min = jjH S~ jj = jjH ( n j I ) @ ?? A jj n
Pe
n+1
n+1
n+1
3;p
D^
i
i
n
n
7)
n
The combined algorithm for rotation parameter estimation and kalman ltering for 3-D coordinates and translational motion is illustrated by the block diagram in g. 1. The two steps of parameter identi cation and state estimation works like this: ) ) 1. At time n + 1 the measured image coordinates x( +1 ; y( +1 i= 1 : : :p are used as input to construct the matrix H +1 , Together with the state estimate S^ ; D^ it is then used to nd the least ^ acc. to eq. (7). squares estimate R ^ is then fed to the kalman lter eq. (5) 2. The estimated rotation R where the updated state estimates S^ +1 , D^ +1 are computed using ) ) image coordinates x( +1 ; y( +1 i = 1 : : :p contained in the matrix i
i
n
n
n
n
n
n
n
n
H
n+1
i
i
n
n
n
Pred. error minimisation x
n+1
y
^ R
n
n+1 delay Kalman filter
^ ^ S n+1 Dn+1
Fig. 1. Block diagram of combined state and parameter estimation.
Scene coordinate estimation - theoretical performance Given the exact state transition matrix determined by the rotation, the Kalman lter of eq. 5 performes a recursive weighted lest squares estimation of the state vector. A lower bound in terms of error variance can be computed assuming perfect knowledge of translational motion D . With no a priori information, the conditional covariance of the scene coordinate vector S given measurments up to time N is: n
N
P
j
N N
= E[(S^ ? S )(S^ ? S ) ] = [ N
N
N
N
T
X
H R ? H ]? T n
n
1
(8)
1
n
n
For the special case of motion on a planar surface with only horisontal measurments, we have: D = D = 0; H = (?1; x ) and R = E[(W 0 )2] = Z 2 2 where 2 is the variance of the noise in the measured horisontal image coordinate. For the covariance matrix of the error of the initial scene coordinate estimate we then get: 1?1 0 X X x 1 ? B (Z0 + nD )2 (Z0 + nD )2 C CC (9) P j = BB@ X x X x2 A ? (Z + nD )2 (Z0 + nD )2 0 Using as an approximation the exact values of the measured noisy image coordinates: Y
X
n
n
n
n
n
w
w
n
Z
n
N N
Z
n
n
n
Z
n
Z
n
X0 x X Z = Z0 + nD
(10)
n
n
n
Z
we can evaluate the mean squared error of the scene coordinate (X ; Z ) as: E[(X^ ? X )2 + (Z^ ? Z )2 ] = Trace(P j ) N
N
N
N
N
N N
N
(11)
Fig. 2 illustrates the root mean square estimation error for various nal scene coordinates (X ; Z ) when the camera is moving forward with speed D = ?25m=sec (90km=h) N
N
Z
Z 2
10 5
−−−1
10 5
2 −−−1
90 m
80
70
60
50
40 −−−0.1 −−−0.1 30
20
10
0 −40
−30
−20
−10
0
10
20
30
X 40 m
Fig. 2 Root mean square error (m) in estimated position of scene coordinate (X, Z) using linear kalman ltering with known translational velocity and no rotaion. Axis denote nal position of coordinate (X, Z) after forward movement 25 m/sec for 1.8 sec. Frame rate is 25 frames/sec giving a total of 40 frames. Noise in image coordinate has r.m.s. value 1/720.
and estimation is carried out over 40 frames with frame rate 25 frames/sec. I.e. total estimation time is 1.8 sec and total forward movement is 45 m. The root mean square noise in the image coordinate measurments was 1=720 The most notable fact from g. 2 is the very large anisotropy of the estimation error with respect to direction from optical axis, which in this case corresponds to the direction of focus of expansion. This is to be expected since in the direction of the F.O.E. there is no motion parallax which implies that the possibility of depth estimation disappears.
Experimental results and discussion Real image data was collected by mounting a video camera on the top of a car. Images were sampled at 25 frames/sec and 720x576 pixels/frame. In order to extract point features from this sequence the Canny-Deriche edge detector was applied to each frame. Along a horisontal line all edge points were collected, giving the horisontal image coordinates. By selecting edge points from vertically oriented edges only, suciently clean time trajectories of horisontal coordinates over time could be generated. Fig. 3 illustrates horisontal edge coordinates evolving over time from the image sequence of g. 4 The basic idea for this kind of selection is that if the edgepoint is projected from a vertical edge in the scene and the car is moving on a planar surface, it is sucient to consider only the horisontal coordinate of the edge since this is the same for all edge points. This gives a relatively robust way of extracting features compared to e.g. corner extraction. The edge points were matched over time using a simple predictor in the image plane. After this matching the coordinates trajectories were edited manually and only the longest were retained. Fig. 4 shows the original grey value frame-48 with edge points superposed and the estimated positions of the points in the camera X ? Z system. The depths are initiated to the same value at frame-0. The motion of the camera relative the scene consists of a dominant translational motion in the direction of the optical axis with slight rotation due to the fact that the car is changing lane from right to left in order to avoid the road-work going on in the right lane, about 100 m in front of the car in the rst frame. After about 40 frames, a stable con guration of scene points has been generated by the algorithm. Unfortunately no quantitative ground truth was available for this sequence. From frame 48 it can be concluded however that the algorithm computes qualitatively the correct depth of the points, provided these are not too close to the direction of motion. As was stated in the theoretical performance part, the convergence properties of depth estimation depend to a large extent on the distance of the points in the image from the focus of expansion. The two poles to at the road side closest to the camera seem to get the correct depth, while the third is too close to the center to converge. The relative poisitions of the poles to the left and the road sign and the trees to the right are estimated qualitatively correct.
Acknowledgements This work was supported by the swedish Prometheus project. We would like to thank Magnus Andersson and Peter Nordlund of the Royal Institue of Technology for work with the data from the real world sequence. [1]
References
T.J. Broida and R. Chellappa, Estimation of object motion parameters from noisy
images, T-PAMI 8, 1986, 90-99. [2] T.J. Broida and R. Chellappa, Performance bounds for estimating three dimensional motion parameters from a sequence of noisy images, JOSA A6, 1989, 879-889.
[3] [4] [5] [6] [7] [8] [9] [10]
T.J. Broida, S. Chandrachekhar and R. Chellappa, Recursive 3-D motion esti-
mation from a monocular image sequence, IEEE-Aerospace and Electronic systems, C.G. Harris and J.M. Pike, 3D positional integration from image sequences, IVC 6, 1988, 87-90. J. Heel, Dynamic motion vision, Proc. IUW, 1989 pp. 702 - 713 J. Heel, Direct dynamic motion vision, Proc. IEEE conf. Robotics and Automation, 1990, pp. 1142-1147 A. Kalldahl, Simultaneous estimation of motion and shape from a sequence of feature points projections, Linkoping studies in science and technology Thesis no: 194, Linkoping Sweden, october, 1989 L. Matthies, T. Kanade, and R. Szeliski, Kalman lter-based algorithms for estimating depth from image sequences, IJCV 3, 1989, 209-236. P.K. Rajasekaran, N. Satyanarayana and M.D. Srinath, Optimum linear estimation of stochastic signals in the presence of multiplicative noise, IEEE-Trans on Aerospace and Electronic systems, Vol. AES-7, may 1971 J.K.Tugnait, Stability of optimum linear estimators of stochastic signals in white multiplicative noise, IEEE-Trans. on Automatic Control Vol. AC-26, pp. 757 - 761, june, 1981
Fig.3 Trajectories over time of horisontal edge coordinates of frames 1 - 48 from image sequence of g. 4
Original frame with features t = 48
Estimated positions (white) t = 48
Fig. 4 Features and estimated X-Z positions after 48 frames