SCIENCE CHINA Information Sciences
. RESEARCH PAPER .
April 2013, Vol. 56 042301:1–042301:14 doi: 10.1007/s11432-012-4649-9
Image-based self-position and orientation method for moving platform LI DeRen1 , LIU Yong1,2 ∗ & YUAN XiuXiao3 1State
Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China; 2School of Electronic Information, Wuhan University, Wuhan 430079, China; 3School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079,China Received June 11, 2011; accepted December 9, 2011; published online September 14, 2012
Abstract The position and orientation of moving platform mainly depends on global positioning system and inertial navigation system in the field of low-altitude surveying, mapping and remote sensing and land-based mobile mapping system. However, GPS signal is unavailable in the application of deep space exploration and indoor robot control. In such circumstances, image-based methods are very important for self-position and orientation of moving platform. Therefore, this paper firstly introduces state of the art development of the image-based self-position and orientation method (ISPOM) for moving platform from the following aspects: 1) A comparison among major image-based methods (i.e., visual odometry, structure from motion, simultaneous localization and mapping) for position and orientation; 2) types of moving platform; 3) integration schemes of image sensor with other sensors; 4) calculation methodology and quantity of image sensors. Then, the paper proposes a new scheme of ISPOM for mobile robot - depending merely on image sensors. It takes the advantages of both monocular vision and stereo vision, and estimates the relative position and orientation of moving platform with high precision and high frequency. In a word, ISPOM will gradually speed from research to application, as well as play a vital role in deep space exploration and indoor robot control. Keywords self-position and orientation, ISPOM, visual odometry, structure from motion, simultaneous localization and mapping Citation Li D R, Liu Y, Yuan X X. Image-based self-position and orientation method for moving platform. Sci China Inf Sci, 2013, 56: 042301(14), doi: 10.1007/s11432-012-4649-9
1
Introduction
Self-position and orientation of moving platform is a process of estimating the moving platform position and orientation with the onboard sensors. Whether the platform moves along a predefined path or moves freely, the self-position and orientation method which is either independent of non-board sensors or dependent of those sensors receiving broadcast signals (e.g. GPS signal) is crucial to some application of computer vision, machine vision, and robot control. Currently, the most common method for estimating the position and orientation of outdoor moving platform is based on GPS and INS. Also, it can achieve high accuracy via high-cost GPS/INS when under good GPS signal environment. While for the indoor moving platform (e.g. Human), current methods ∗ Corresponding
author (email:
[email protected])
c Science China Press and Springer-Verlag Berlin Heidelberg 2012
info.scichina.com
www.springerlink.com
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:2
based on A-GPS mobile positioning technology in cellular mobile networks, Bluetooth positioning technology and so on, can only obtain approximate position. Besides, a new vision-based self-position and orientation method, which is based on image and developed from computer vision technology, has arisen in recent years. It can also achieve absolute position and orientation combining with global positioning sensors. Thus, this paper intends to present a critical review of current image-based methods and propose a new ISPOM scheme for moving platform.
2
A comparison among some researches involved in image-based methods for self-position and orientation
In the past decades, there have been some researches involved in image-based methods for position and orientation in the field of computer vision and robot vision, i.e. structure from motion, vision odometry, vision-based simultaneous localization and mapping and vision-based egomotion estimation. Their main properties are as follows: • Structure from motion (SFM). Structure from motion refers to the process of recovering the threedimensional structure of the scene or an object from the image sequence or the video which are captured when the image sensor moves itself. In this case, the three-dimensional structure generally refers to the point cloud reconstruction of viewed object or scene. In the recovering process, the position and orientation of the moment of the image sensor capturing can be obtained. Moreover, the process is a post-processing stage and does not need to output the results instantly when the image sensor is moving. • Visual odometry (VO). Visual odometry refers to realizing the odometry function with image sensor. Odometry is a moving sensor which can estimate the change of wheeled robots or other moving platforms over time. The change is the position relative to its starting location. When the left and the right wheels are both equipped with odometries, the change of the moving orientation can also be obtained via dead reckoning with the difference of the pulse counter of the odometries. As relative localization with odometry fails to accurately position moving platform due to systematic error [1], random error, and wheel slippage, image sensors have been utilized to position moving platform to overcome those shortages of traditional odometry. • Vision-based simultaneous localization and mapping (vSLAM). Vision-based simultaneous localization and mapping refers to realizing SLAM with image sensor. In the robotics community and the intelligent vehicle community, SLAM is the problem of estimating the position of a robot/vehicle and reconstructing the structure of an unknown scene, or updating the structure of a known scene (generally, a two dimensional map). It is the premise of behavior estimation and auto drive in robot control. Traditional SLAM depends on the sensors such as laser range finder, laser scanner, sonar, ultrasonic or inertial sensor. Presently, more attention has been focused on image sensor. So vision-based SLAM has been generated. The purpose of vSLAM is simultaneous processing, localization and mapping. Obviously, it has high demands for real-time computing and all images are involved in computation. • Vision-based egomotion estimation (vEE). Vision-based egomotion estimation refers to estimate the motion parameters of moving platform with image sensor. The motion parameters include translating and rotating direction at the moment. If a timing system is available, translation and rotation speed can also be obtained. So vEE focuses mainly on finding motion parameters, not on location and orientation relative to the scene. Image-based self-position and orientation method proposed in this paper aims to recover the position and orientation of the moving platform with the onboarded image sensor. The recovered position and orientation can be used to register other sensor data into a relative frame of reference. For example, the distance value obtained from the onboarded laser scanner can be registered into a point cloud which is the three dimensional reconstruction of the viewed scene. So ISPOM has one main feature: obtained position and orientation should have high frequency so that it can satisfy other sensor’s high frequency output. Table 1 is a comparison of the above studies methods. It can be summarized as: ISPOM is a general case, and the others are the particular cases under special condition.
Li D R, et al.
Sci China Inf Sci
Table 1 Methods
SFM
Outputs
Comparison of some studies
DOF of pos.
Processing
Involved
and Ori.
mode
images
Post-
Selected
processing
key images
Dense point cloud,
6
relative pos. and ori. 2D
Relative pos. and
3
Real-time
ori. in 2D plane
VO 3D
April 2013 Vol. 56 042301:3
All images
6
Pos. and ori., map
Real-time
3
Relative direction and speed
Real-time
/
of translation and rotation ISPOM
path in 2D plane path in 3D space
All images
processing vEE
of viewed scene
Recovery of moving
ori. in 3D space vSLAM
3D reconstruction Recovery of moving
processing
Relative pos. and
Application purposes
Guiding the behavior of robot or intelligent vehicle
All images
Close loop control
processing
Relative pos. and
Post-
6
ori. in 3D space
All images
processing
Space datum for registering other sensor data
Image sensor Position & heading
Motion
VO
vEME ISPOM
Odometry
SFM
EME
vSLAM SLAM Location & mapping Figure 1
The relation of research field of SFM, VO, vSLAM, vEE and ISPOM.
Figure 1 shows the relation among there studies. Every ellipse represents a research field, and the overlapping area among them represents the common research problem. It shows there are many common points among them. The characteristics are: 1) SFM, VO, vSLAM, vEE and ISPOM all depend on image sensor; 2) ISPOM is the kernel of all research, studies that are far from the kernel represent low requirement of position and orientation and give more special application; 3) the studies on odometry direction give more focus on relative positioning and heading; the studies on SLAM direction pay more attention to self-position relative to scene and reconstructing the structure of scene; the studies on vEE direction emphasize on estimate the motion parameters; 4) besides the image sensor, there are other choices of sensors.
3
The types of moving platform
The estimated position and orientation directly describes the position and orientation of the image sensor (i.e., camera). But the image sensor can not move alone in 3D space. Image sensor is always rigidly fixed on a platform, and both image sensor and platform move together. So the result implicitly describes the position and orientation of the moving platform. In general application, e.g. navigating with image
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:4
sensor, the position and orientation of image sensor are treated as those of moving platform. But if high precision of position and orientation is required, it should be transformed to the reference center of the platform. This transformation is a rigid one and its translation vector and rotation matrix can be obtained through a calibration process [2]. There are many types of moving platform. Theoretically, any object is a moving platform if it can move itself and fix image sensor and other capturing, transmitting and processing equipments. Because there should be some detectable or trackable features in images, or stably obtainable optical flow field, some moving object (e.g., ship) cannot be used as a moving platform for position and orientation with image sensor. According to the space where the platform moves, they can be categorized as flying platform, land mobile vehicle and underwater vehicle. The following are the moving platforms which are reported in recent literature. • Flying platform. Flying platform refer to those artificial ones (So far, there have been no attempts for self-position and orientation of natural flying object, e.g. birds). Three types of flying platform are reported: satellite, missile/rocket and unmanned aerial vehicle (UAV). Satellite orientation can be determined by star sensor which uses the image of high-brightness star (e.g., sun or other stars) to determine star identification. Once the stars in an image have been identified, orientation can be inferred based on which stars are in view and how these stars are arranged. To those missiles which are guided by stellar guidance system, they can use it to improve the accuracy of the inertial guidance system. Stellar guidance system identifies stars in images and calculates missile’s position and orientation. Onboard INS of unmanned aerial vehicle, especially low-altitude and miniature UAV [3,4], doesn’t provide high precision in angle measurement because of its low payload capacity, low cost and energy supply. Moreover, in urban canyon environment, GPS signals are often blocked therefore this leads to positioning accuracy of onboard GPS dropping. In this case, image-based method is a choice to improve accuracy of self-position and orientation. • Land mobile vehicle. The land mobile vehicle can move indoors or outdoors. For outdoor moving, automobile is a typical land vehicle. While another land mobile vehicle, moving robot can move indoors and outdoors. Automobile is a common land moving platform that can move along fixed routes with wheels [5,6] and its high payload capacity is capable of multi-sensor integration and fusion. Those sensors include position and orientation system (POS consists of GPS and IMU), CCD camera, laser scanner, etc. Brad Grinstead et al. [7] applied the mobile scanning system developed at the University of Tennessee to compare the pose estimation from POS and images. On the one hand, their experiment shows when the GPS drops out, image-based method is an auxiliary pose estimator and the provided localization is accurate enough until the GPS system’s accuracy comes back into acceptable bounds. On the other hand, the accumulated errors become increasingly large over time and incorporating a bundle adjustment module into the image-based localization procedure can alleviate these effects. D.I.B. Randeniya et al. [2] examined the calibration of integration of vision and inertial sensing, and pointed out that effective integration can be used to overcome the Gyro drift problem in inertial systems over time. Image-based method is a very important self-position and orientation method for moving robot [8– 10]. The reasons are: for one thing, wheeled robot tends to slip and slide on slippery surface, loose soil or high slope, and traditional odometry would fail under such circumstances; for another, non-wheeled robot cannot use the traditional odometry. Furthermore, image can provide more information of the environment. • Planet/moon rover. Up to date, there is no global positioning system on the moon or other planet. So satellite-based positioning cannot work for planet or moon rover. For self-position and orientation of those rovers, sensors such as IMU, odometry, star sensor and image sensor are often used. In this situation, ISPOM is a vital method as it can correct the drift of IMU and overcome the slippage when moving on the loose soil or high slope [11–14]. • Underwater vehicle. Similarly, there is no report of image-based method for self-position and orien-
Li D R, et al.
Sci China Inf Sci
Table 2
April 2013 Vol. 56 042301:5
Reported integration schemes Schemes
Sensor Image sensor GPS INS IMU DMI Wheel odometry
1 √
2 √
3 √
4 √
√
√
√
√
√
√
√
5 √
6 √
7 √
8 √
√
√
9 √
√ √
√ √
Laser scanner/range finder
√
tation of a ship. One reason is that detectable or trackable features in the field of view of the camera do not exist, especially when the ship sails on the ocean. Recent studies only focus on self-position and orientation of the underwater vehicle with image sensor, and their purpose is automatic control of remote operated underwater vehicle, inspection of underwater structures or other objects of interest, underwater measurement and underwater cable tracking [15,16]. • Human and others moving platform. In some areas, such as visual augmented reality, egocentric wearable vision and vision assisted avoidance of collisions with obstacles for the people with visual disability, human is treated as a moving platform and self-position and orientation is achieved with hand-held or wearable image sensor [17,18]. Some equipment with image sensor, e.g. mobile phone, is moving platform in some situations. Jari et al. [19] estimate a smart phone’s motion for control of a user interface when large objects that do not fit into the display need to be scrolled and zoomed by the user.
4
Integration schemes of image sensor with other sensors
Self-position and orientation of moving platform can work with image sensor. It can also work with the combination of image sensor and other sensors, i.e., GPS, IMU/INS, distance measuring instrument (DMI), wheel odometry, and laser scanner/range finder. Such integration can take the advantages of all sensors and make the position and orientation more stable and more accurate. Table 2 reports the integration strategies in references. The scheme 1-2-3: image sensor + GPS + INS (IMU)/DMI. These schemes are used for self-position and orientation of outdoor moving platform. Because GPS/INS (IMU) can fully achieve absolute position and orientation, image-based method becomes a supplementary method. There are many choices of GPS, e.g. real time difference DGPS, carrier-phase differential GPS, and low-cost GPS [4,20]. Global navigation satellite system (GNSS) is another choice [21]. The author has experimented on exterior orientation elements of orientation image by adjustment computations with orientation image, and other linear array images’ by interpolation [22], or compensated the systematic error of POS by bundle block adjustment [23]. The scheme 4: image sensor + GPS. It is a low cost scheme. Low precision GPS is compensated by image-based method. Since a dead-reckoning sensor (e.g. odometry or IMU) for moving distance computation does not existed, overlapping stereo image is used to estimate the absolute moving distance and calculate attitude. Such method has been applied by the author in aerial photogrammetry mapping without ground control points by GPS-supported aerial triangulation [24–30]. The scheme 5: image sensor + INS/IMU + odometry. Odometry can overcome the INS drifting over time and correct the moving distance, and image-based method can improve the orientation precision of INS [31]. This scheme is often applied in wheel robot or automobile because of odometry. The scheme 6: image sensor + IMU. Image sensor of this scheme is used to overcome the IMU drifting over time [4,32]. Compared with the scheme 5, scheme 6 has no constraint for moving with wheels, so it can be applied when human becomes moving platform [17].
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:6
The scheme 7: image sensor + odometry + laser scanner. The approximate position of moving platform is obtained from odometry. Then accurate position is achieved from the scene structure that is reconstructed from both laser scanner and stereo images. Kai et al. [9] use image-based method to improve the position and orientation precision. Their experiment proved that the feature-based approach for localization with infinite lines does an excellent job when the robot moves in structured or semistructured surroundings. The scheme 8: image sensor + odometry. This is a common scheme of relative position and orientation and the two types of sensors are both in low cost. On the one hand, image sensor can provide abundant information, compensate the accumulated orientation and position error when odometry fails due to wheel slip. On the other hand, odometry can provide the moving distance between two consecutive images [33,34]. When features in the images are not available, the position and orientation is obtained by dead-reckoning from the odometry. The scheme 9: image sensor. It only depends on image sensor (see Section 5) so system complexity is reduced. Orientation from single image sensor has no scale problem, but position from it is up to an unknown scale factor [6]. But position and orientation from stereo cameras has no such problem [35]. Up to now, few studies have reported on this scheme.
5
Calculation methodology and quantity of image sensors
Camera is the image sensor that can record a real scene in images. These images may be still photographs or moving images such as videos. Furthermore, the camera is generally a digital type which is convenient to grab, transmit and process images. Mouragnon et al. [36] divide cameras into three types: central cameras with a unique optical center, axial cameras with collinear centers and non-axial camera. They can be distinguished as single camera, two cameras and multi-cameras. The following summarizes the calculation methodology of self-position and orientation from the three types of camera setup. 5.1
Self-position and orientation from single camera
Basically, there are three types of methods for self-position and orientation from single camera: discrete methods, regional methods and auxiliary feature methods. Besides, hybrid methods are derived from the two type methods. Discrete methods use a set of corresponding features in images to estimate the position and orientation. A general processing procedure includes four steps: feature detection, corresponding feature match, robust estimation of position and orientation, bundle adjustment refinement of estimated position and orientation. These features include corner features (e.g., Harris corner, SUSAN corner, SIFT/SURF feature) and edge features (e.g., line). Regional methods are also named area-correlation methods which means the whole region (or subregion) of image is involved in computation. Regional methods can be classified into three categories: direct methods, differential methods and image registration-based methods. Direct methods use the temporal and spatial gradients of scalar fields (i.e., intensity images and depth fields) to estimate the position and orientation without calculating explicitly the optical flow [8]. Differential methods use directly the optical flow [32] which can approximately present the image velocities (i.e., motion in images). Image registration-based methods calculate the position and orientation by image registration between input images and images captured beforehand. Kourogi et al. [37] register video frames into pre-captured panoramic images. Compared to regional methods, discrete methods allow two consecutive images to have a large displacement from each other. But when this displacement is small, the discrete methods tend to present serious problems in triangulation. These problems are: 1) the accuracy of depth estimation degrades very fast; 2) difference between images induced by camera displacement cannot be distinguished from that induced by camera rotation. On the other hand, regional methods use optical flow which can be estimated with reasonable accuracy only when the difference between images is small.
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:7
Hybrid methods are integrated methods of both discrete methods and regional methods. For example, Andrew et al. [38] use standard Shi-Tomasi feature tracker to track features, and then obtain the camera motion through a cost function of the features’ surrounding image intensity neighborhood. Auxiliary feature methods estimate the self-position and orientation through the transformation of specially artificial features detected from images [32]. When the absolute location of those features is known, auxiliary feature methods can provide absolute position and orientation. 5.2
Self-position and orientation from two cameras
Self-position and orientation from two cameras can be classified into four types: 3D reconstruction-based methods, non-3D reconstruction from stereo cameras based methods, non-stereo cameras methods and small overlapping stereo cameras methods. 3D reconstruction-based methods need to reconstruct the 3D structure of viewed scene before computing of position and orientation. Firstly, the corresponding points between left and right images are matched. Then its 3D space coordinates are obtained by triangulation and the corresponding 3D points between the consecutive stereo images are tracked or matched. Finally, relative position and orientation are obtained from those corresponding 3D points through rigid transformation [35,39]. Non-3D reconstruction-based methods use differential image, trifocal tensor or quadrifocal tensor instead of creating the 3D reconstruction of scene [40,41]. Non-stereo cameras means that there is no overlapping region between the two cameras and the two cameras aim to augment the field-of-view (FOV) [8]. Obviously, the larger the FOV is, the higher the accuracy of self-position and orientation can achieve. Small overlapping stereo cameras methods take the advantages of both stereo cameras and non-stereo cameras. In other words, the small overlapping region can provide the absolute displacement and nonoverlapping region can augment the FOV [10]. 5.3
Self-position and orientation from multi-cameras
Self-position and orientation from multi-camera can be classified into two types: overlapping region multicameras and non-overlapping region multi-cameras. The former can provide the absolute displacement from overlapping region and augment the field-of-view from multi-cameras. The latter has the following advantages. 1) It has larger FOV that can provide more constraint than single or two cameras setup due to undistinguishable transformation between camera displacement and camera rotation when FOV is small [42]. 2) It can focus on interested region when compared to omnidirectional camera. 3) The cameras needn’t to be fixed at the center of moving platform as they can be fixed separately at any locations of platform. Table 3 gives a brief summary of the quantity of applied image sensors and calculation methodology.
6
A scheme of self-position and orientation only depending on image sensors
The paper proposes a new scheme of self-position and orientation of moving platform only depending on image sensors. This scheme is an integration of single camera and stereo cameras and it takes the advantages of both video and stereo image sequences. This is because images from video are characterized by short interval (i.e., high temporal resolution) and stereo images can recover the absolute displacement between each two consecution image pairs. It profits from high resolution stereo cameras, and low resolution but high sampling rate video camera. Table 4 is the characteristics of pose estimation from single camera, stereo cameras and the integration of both of them. Then Figure 2 illustrates the data flow of designed scheme in which only image sensors involved. After three calibrations illustrated in Figure 2 (i.e., calibration of single video camera’s interior parameters, calibration of stereo cameras and relative pose calibration between the single camera and the stereo cameras ), all the three image sensors can be treated as a whole position and orientation system. Moreover, there are three aspects of computation of position and orientation in Figure 2.
Li D R, et al.
Table 3
Sci China Inf Sci
April 2013 Vol. 56 042301:8
The quantity of image sensors and calculation methodology
Amount of image sensors
Methodology
Main characteristics
Point feature
Harris corner, SUSAN corner SIFT point, SURF point
Discrete methods
Line feature
Hough line Canny edge, then line
Single camera
Regional methods
Direct methods
Intensity image, depth field
Differential methods
Optical flow field
Image registration-base methods
Register input image into pre-captured images
Hybrid methods of both discrete
Hybrid methods Auxiliary feature methods
Stereo cameras Two cameras
Feature points and its surrounding
methods and regional methods
image intensity neighborhood
Specially designed features
Estimation of position and orientation
in images
from the transformation of features
3D reconstruction
Computation from the corresponding
based methods
points of two point cloud sets
Non-3D reconstruction
Intensity image
based methods
Trifocal or quadrifocal tensor
Non-stereo cameras methods Using the methods of single camera Augmentation of FOV Small overlapping stereo cameras methods Multi-cameras
Augment of FOV
Table 4 Camera
Resolution
setup
Spatial Temporal
Single
Low
High
Integrated methods of both stereo
Both 3D reconstruction of scene and
cameras and non-stereo cameras
augmentation of FOV
360 degree panorama image
5,6 or 8 cameras
Local image
Other setup
The characteristics of three camera setup
Accuracy of self-posi- Sensitivity to moving tion and orientation
due to sampling rate
Low
High
camera
Requirement
Scale of displace-
for moving
ment estimation
Fails at pure rotation; non-unified scale ill-conditional at small displacement
Stereo
High
Low
High
Low
cameras Integration of the both
Suitable for
unified scale
any movement High
High
High
High
Suitable for
unified scale
any movement
1) Position and orientation from single camera. It uses the video frames (i.e., image sequence) to estimate pose. At first, key frames are selected from the abundant frames. Then position and orientation of key frames are estimated. Finally, position and orientation of all frames are recovered through interpolation and bundle adjustment. 2) Position and orientation from stereo cameras. The corresponding points between left and right image serve to reconstruct 3D points. The corresponding points between the consecutive left/right images are used to find the corresponding 3D points between the consecutive point clouds which can recover the rigid transformation. 3) Integrated position and orientation. By interpolation from the exposure moment of single camera’s shutter, position and orientation with high sampling rate is obtained. In this case, the video camera and stereo cameras have the same time base.
Li D R, et al.
Stereo cameras calibration
Sci China Inf Sci
Stereo images sequence
April 2013 Vol. 56 042301:9
Feature points tracking
Corresponding between left and right image
Key frame selection
Corresponding between consecutive left images
Point cloud
Video camera calibration
Video frames
Imtegration calibration
Pose estimation Interpolation Bundle adjustment
Corresponding between point cloud
Position and orientation of platform Figure 2
Data flow of designed scheme.
{Pi−1,i,i+1} {P1,2,3} {P2,3,4}
{P3,4,5}
Ci−1
Ci
C5 C1
C4 C2 Figure 3
6.1
Cn
C3 The demand of feature points using singe camera.
Position and orientation from single camera
Feature points are tracked in the whole video frames. Next, key frames which have large displacement but still have enough feature points are selected. So the projections of a 3D point appear in three images (illustrated in Figure 3). The dashed line denotes the single camera consecutively moving from position C1 to position Cn . Ci (i = 1, 2, . . . , n)is the camera center and the trajectory of the moving camera can be approximately denoted by the line segments connecting each two positions: Ci and Ci+1 . Ci also denotes the image that the camera capture at that position and point set {Pi−1,i,i+1 } is the 3D space points that can be viewed from position Ci−1 , Ci and Ci+1 . Moreover, {Pi−1 }, {Pi }and {Pi+1 } are the projections of the point set {Pi−1,i,i+1 } in the three images. The arrowed line from Ci to {Pi−1,i,i+1 } denotes the forward intersection which aims to determine the points’ 3D coordinates and the arrowed line from {Pi−1,i,i+1 } to Ci denote the resection which aims to recover the relative position and orientation of Ci . Take the first four images for example, computation process is shown as follows. First, the position and orientation of C2 should be calculated (The origin point of the reference frame is at C1 and nonrotation exists). The corresponding points {P1 } and {P3 } serve to calculate the relative translation T1,3 and orientation R1,3 between C1 and C3 . Second, the recovered 3D point set {Pˆ1,2,3 } is based on C1 and C3 (shown in Figure 4). So the 3D point set {Pˆ1,2,3 } and its corresponding points {P2 } are known, and the camera projection matrix PC2 of C2 can be determined. Through decomposition of matrix PC2 , the and translation T1,2 are obtained. Because {Pˆ1,2,3 } is located in a reference frame which is rotation R1,2
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:10
R1,3,T1,3
R2,4,T2,4
C1 R1,2,T1,2
C2
T2,3
C2 determined from image 1,2,3 Figure 4
C2
C3
R2,3,T2,3 ′
C3
C4 T3,4
′ || C3 determined from image 2,3,4 ||T2,3|| = ||T2,3
Scale transfer in self-position and orientation from singe camera.
{Pi−1,i} {P1,2}
{P2,3} CC,i−1
CC,i
CC,n
Ri−1,i, Ti−1,i
CC,1 R1,2,T1,2
Figure 5
CC,2
R2,3,T2,3
CC,3
The demand of feature points using stereo cameras.
determined by C1 , the orientation and translation of C2 are R1,2 and T1,2 (i.e., R1,2 = R1,2 and T1,2 = T1,2 ). The three translation vectors of T1,2 , T2,3 and T1,3 must satisfy T1,2 + T2,3 = T1,3 . Moreover, T1,2 and T1,3 are calculated, and ||T1,3 || = 1. So T2,3 can be obtained as equation ||T2,3 || = ||T1,3 || − ||T2,3 ||. and orientation R2,3 of C3 can be obtained. Because C2 and C3 in the left Similarly, translation T2,3 should equal to the length of T2,3 . Therefore, and right of Figure 3 are the same images, the length of T2,3 ||T
||
2,3 denotes the new T2,3 after its scale is changed as equation T2,3 = T2,3 T2,3 ||T || . The translation T2,4
||
||T
2,3
2,3 and T2,3 should have the same scale, so the equation T2,4 = T2,4 ||T || is satisfied. After T2,3 and T2,4 2,3
are obtained, it is explicit to scale T3,4 when its length is calculated through ||T3,4 || = ||T2,4 || − ||T2,3 ||. Obviously, obtained T3,4 is under the condition of ||T1,3 || = 1. From the above computation, we can see that each time there are three images involved, and the position and orientation of the first image are obtained from the last triple images. The second image’s are obtained from currently triple. Furthermore, the moving distance between the last two images can transfer the unified scale to the next triple images.
6.2
Position and orientation from stereo cameras
Self-position and orientation from stereo cameras are explicit. Moreover, the real moving distance between each two positions can be obtained without scale transfer. As illustrated in Figure 5, CC,i denotes the center of stereo cameras (generally, it is the center of the baseline). {Pi−1,i } is the point set which can be viewed by stereo pairs from position i − 1 and i. Ri−1,i and Ti−1,i are rotation and translation respectively which denote rigid transformation for position i − 1 and i. Because the translation Ti−1,i has the real moving distance, each time only two consecutive stereo pairs need computation. 6.3
Integrated position and orientation
Figure 6 illustrates the geometry between single camera and stereo cameras. C1 and C2 are the left and right camera center respectively. CC is the center of the baseline C1 C2 and also denotes the stereo cameras’ center. A reference frame CC − XC YC ZC located in CC is constructed, and the recovered position and orientation is relative to it. A video camera is fixed on this point, and its center is denoted
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:11
{P}
Y′ X1
Y
C1
Z′ CS
X2
X′
CC
Z
X C2 Figure 6
Geometry between single camera and stereo cameras.
CC,′ j+1 CS, j+1,1 CC,′ j CS, j−1,m CS, j,1
... CS, j,2
CS, j,i
... CS, j,i+1
CS, j,m
YC, j+1
CC, j+1 X C, j+1
YC, j
ZC, j+1 Trajectory points from single camera
ZC, j Figure 7
CC, j
XC, j
Trajectory points from single cameras
The spatial-temporal relationship between two trajectories form single camera and stereo cameras.
as CS . This camera constructs a reference frame CS − XS YS ZS . Obviously, CS and CC cannot exactly locate at the same position, and the three axis of the two reference frames cannot be exactly parallel respectively. A rigid transformation between the two reference frames can be calibrated in advance. The spatial-temporal relationship between the single camera and stereo cameras can be illustrated in Figure 7. CC,j (j = 1, 2, . . . , n) is the moving trajectory point (the center of baseline) recovered from stereo image sequences and there is a local reference system CC,j − XC,j YC,j ZC,j in every CC,j . Generally, the grab frequency of stereo cameras is lower than that of single camera. So there are CS,i (i = 1, 2, . . . , m) images between two consecutive points CC,j and CC,j+1 . CS,j,i also represent the optical center of single camera. Similarly, there is a corresponding local reference system CS,j,i − XS,j,i YS,j,i ZS,j,i . CC,j and CC,j+1 are points in stereo camera moving trajectory corresponding to CC,j and CC,j+1 . Due to and CC,j+1 many not coincide with any CS,j,i . If the same time base difference grab frequencies, the CC,j is adopted by single camera and stereo cameras grabbing procedures, these two sequences of trajectory points can be integrated together and give better result than individual result. The three steps of integration of video camera and stereo cameras are as follows: and CC,j+1 . They may not be concurrent 1) Computation of the position and orientation of CC,j with any CS,j,i . Take CC,j for example, it should locate between two points CS,j−1,m and CS,j,1 , which are points of moving trajectory recovered from video camera. These three points are arranged by time can be interpolated in the time domains. sequence. So CC,j Translation vector can be obtained through linear interpolation or B-Spline interpolation. The former is located in the line segment and the latter is derived from a smooth trajectory. As for orientation interpolation, it takes quaternion interpolation (Also known as spherical linear interpolation) which is characterized by constant velocity transformation. , CS,j,i and CC,j+1 (They are the corresponding positions of CC,j , CS,j,i , CC,j+1 in stereo 2) CC,j CC,j+1 should have the same cameras’ reference frame) are obtained by the constrain that vector CC,j length and direction with vector CC,j CC,j+1 . Because the direction of the two vectors cannot be exactly collinear, and it drifts over time due to error accumulation, both the direction and the length transformation are considered optimal results. as a reference point, CC,j is concurrent with CC,j ; The computation steps are as follows: Take CC,j
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:12
Camera moving trajectory
B C
D
A
(a) (b) Figure 8 Artificial data and comparison of recovered trajectories. (a) Artificial 3D scene and camera trajectory; (b) Recovered trajectories by difference methods. The position of CS,j,i is obtained by changing the relative length between each two consecutive points with the length of CC,j CC,j+1 ; Take CC,j as the center of rotation, the rotation angle and axis between are obtained, and rotate CS,j,i to its destination orientation. CC,j+1 and CC,j+1 3) The orientation of each CS,j,i is obtained with the orientation constrain of the start and the end be collinear with the orientation of CC,j , the points of this line segment. Make the orientation of CC,j orientation of each points CS,j,i is collinear with the orientation of CS,j,i after a sequence transformations. cannot be exactly collinear with the orientation of CC,j+1 , CS,j,i can be Since the orientation of CC,j+1 transformed to its destination orientation through a similar step described at step 2) . After these three steps, the integration between single camera and stereo cameras is achieved. The position and orientation of single camera has been integrated into the position and orientation of stereo cameras. The recovered moving trajectory has not only the real distance character of stereo cameras, but also the high frequency of video camera. Furthermore, it has high sensitivity to displacement and orientation changes with high frequency.
6.4
Experimental results
Simulation experiment has been used to test our method. In this experiment, an artificial 3D scene and three virtual cameras are created. The scene has the following features: Large and complex scene can simulate complex and long distance camera’s moving trajectory; Abundant texture can provide detectable or trackable features between consecutive images; There is ground truth of moving trajectory which can be compared with recovered trajectory. Figure 8(a) shows the artificial 3D scene and camera moving path, and Figure 8(b) shows the recovered moving trajectories by difference methods from the image sequence of single camera. In this figure, Line A is the ground truth; Line B is the trajectory recovered by the method introduced in Subsection 6.1; and Line C is the trajectory on the condition of invariant ; Line D is the refined trajectory of Line A through local bundle adjustment method. On the one hand, Line A matches the ground truth very well at the beginning, but converges to a point rapidly. On the other hand, the reliability of the recovered pose is confirmed by Line C because of well matched trajectories. Therefore, accumulative displacement error is the main cause of Line A’s convergence. Line D proves that local bundle adjustment can enhance the reliability and accuracy of recovered trajectory. The accumulative error of this kind of open trajectory is still significant. Figure 9 shows the comparison between the ground truth and the recovered trajectory from stereo cameras with different view points. In this figure, Line A is the ground truth; Line B is the recovered trajectory. Obviously, these two lines match very well. It proves the high reliability and accuracy from stereo cameras in the artificial 3D scene. When the trajectory integration from single camera to stereo cameras is completed, the final recovered trajectory is just like Line B, the obtained position and orientation give higher frequency and sensitivity than those from stereo cameras, and higher precision than that from single camera.
7
Conclusion
This paper summarizes state of the art development of the image-based self-position and orientation
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:13
A A
B
(a) Figure 9
B
(b)
Recovered trajectory from stereo cameras. (a) Top view; (b) Side view.
method for moving platform. But there needs more analysis in this area. For example, regarding the dependence between image-based method and other sensors, there are two types of integration mode: loose integration [7] and tight integration [2]. In the former one, image-based system is independent and not relies on GPS/INS (IMU). When the confidence of the GPS-based system falls below a specified threshold, the image-based system takes over. Otherwise, the GPS-based system estimates the position and orientation. In the latter one, image-based system is not independent as it needs GPS-based system working at all time and the obtained position and orientation is a weighted average of both systems. Consequently, this paper proposes a new scheme only depending on image sensors. This scheme takes the advantages of both high resolution stereo image sequence and high sampling rate video frames, providing a high precision and frequency estimation of position and orientation. On the whole, image-based self-position and orientation method for moving platform has been improved both in theory and application research. In the future, more research is required in the following area: improving precision, strengthening stability, enhancing real-time computation, and promoting its application, especially the applications in deep space exploration and indoor robot control. Acknowledgements This work was supported by National Key Developing Program for Basic Sciences of China (Grant No. 2012CB719902), and National Natural Science Foundation of China (Grant Nos. 41021061, 40971219).
References 1 Nakju D, Howie C, Wan K C. Accurate relative localization using odometry. In: IEEE International Conference on Robotics and Automation. Taipei, 2003. 1606–1612 2 Randeniya D I B, Gunaratne M, Sarkar S, et al. Calibration of inertial and vision systems as a prelude to multi-sensor fusion. Transport Res C: Emer, 2008, 16: 255–274 3 Iv´ an F M, Pascual C, Carol M, et al. Omnidirectional vision applied to unmanned aerial vehicles (UAVs) attitude and heading estimation. Robot Auton Syst, 2010, 58: 809–819 4 Farid K, Kenzo N. A visual navigation system for autonomous flight of micro air vehicles. In: IEEE International Conference on Intelligent Robots and Systems. Saint Louis, 2009. 3888–3893 5 Sergio A R F, Vincent F, Philippe B, et al. An embedded multi-modal system for object localization and tracking. In: IEEE Symposium on Intelligent Vehicles. San Diego, 2010. 211–216 6 Friedrich F, Davide S, Marc P. A constricted bundle adjustment parameterization for relative scale estimation in visual odometry. In: IEEE International Conference on Robotics and Automation. Anchorage, 2010. 1899–1904 7 Brad G, Andreas K, Mongi A A. A comparison of pose estimation techniques: hardware vs. video. In: Proceedings Of SPIE Unmanned Vehicle Technology VII. Orlando, 2005. 166–173 8 Munir Z. High resolution relative localisation using two cameras. Robot Auton Syst, 2007, 55: 685–692 9 Kai O A, Nicola T, Bj¨ orn T J, et al. Multisensor on-the-fly localization: precision and reliability for applications. Robot Auton Syst, 2001, 34: 131–143 10 Kim J H, Chung M J, Choi B T. Recursive estimation of motion and a scene model with a two-camera system of divergent view. Pattern Recogn, 2010, 43: 2265–2280 11 Yang C, Mark M, Larry M. Visual odometry on the mars exploration rovers. In: IEEE International Conference on Systems, Man and Cybernetics. Hawaii, 2005. 903–910 12 Di K C, Xu F L, Wang J, et al. Photogrammetric processing of rover imagery of the 2003 mars exploration rover mission. ISPRS J Photogramm, 2008, 63: 181–201 13 Li R X, Di K C, Matthies L H, et al. Rover localization and landing site mapping technology for the 2003 mars exploration rover mission. Photogramm Eng Rem S, 2004, 70: 77–90
Li D R, et al.
Sci China Inf Sci
April 2013 Vol. 56 042301:14
14 Li R X, Squyres S W, Arvidson R E, et al. Initial results of rover localization and topographic mapping for the 2003 mars exploration rover mission. Photogramm Eng Rem S, 2005, 71: 1129–1144 15 Jochen K, Jan A, Frank K. A robust vision-based hover control for ROV. In: IEEE Kobe Techno-Ocean. Kobe, 2008. 1–7 16 Dennis K, Homayoun N. Development of visual simultaneous localization and mapping (VSLAM) for a pipe inspection robot. In: International Symposium on Computational Intelligence in Robotics and Automation. Jacksonville, 2007. 344–349 17 Castle R O, Klein G, Murray D W. Combining monoSLAM with object recognition for scene augmentation using a wearable camera. Image Vision Comput, 2010, 28: 1548–1556 18 Liu J J, Cody P, Kostas D. Video-based localization without 3D mapping for the visually impaired. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. San Francisco, 2010. 23–30 19 Jari H, Pekka S, Janne H. Vision-based motion estimation for interaction with mobile devices. Comput Vis Image Und, 2007, 108: 188–195 20 Arvind R, Anning C, Jay A F. Observability analysis of INS and lever-arm error states with CDGPS - camera aiding. In: IEEE Symposium on Position Location and Navigation. Indian Wells, 2010. 1197–1230 21 Mattia D A, Andrea L, Francesco N, et al. GIMPhI: A novel vision-based navigation approach for low cost MMS. In: IEEE Position Location and Navigation Symposium. Indian Wells, 2010. 1238–1244 22 Li D R, Zhao S M, Lu Y H, et al. Combined block adjustment for airborne three-line CCD scanner images. Acta Geod Cartogr Sin, 2007, 36: 245–250 23 Yuan X X. A novel method of systematic error compensation for a position and orientation system. Prog Nat Sci, 2008, 18: 953–963 24 Li D R, Yuan X X. GPS-supported bundle block adjustment–an empirical result from test field taiyuan. Acta Geod Cartogr Sin, 1995, 24: 1–7 25 Li D R, Yuan X X. Some investigation for GPS Supported acrotriangulation. Acta Geod Cartogr Sin, 1997, 26: 14–19 26 Li D R, Yuan X X, Wu Z C. GPS supported automatic aerotriangulation. J Rem Sens, 1997, 1: 306–310 27 Li D R, Yuan X X. Airborne mapping system with GPS-supported aerotriangulation. In: Proceedings International Workshop on Mobile Mapping Technology. Bangkok, 1999. 351–358 28 Yuan X X, Li D R. GPS-supported determination method for interior orientation of aerial camera. Acta Geod Cartogr Sin, 1995, 24: 103–196 29 Yuan X X, Li D R. GPS-Supported Aerial Triangulation in China, Spatial Information Science, Technology and its Applications-RS, GPS, GIS their Integration and Applications. Wuhan: Technical University of Surveying and Mapping Press, 1998. 671–681 30 Yuan X X, Li D R. Application of GPS-supported aerotriangulation in the establishment for hainan land resource foundation information system. J WTUSM, 1999, 24: 38–42 31 Wang W, Wang D. Land vehicle navigation using odometry/INS/vision integrated system. In: IEEE International Conference on Cybernetics and Intelligent Systems. Chengdu, 2008. 754–759 32 Zhang T G, Kang Y, Markus A, et al. Autonomous hovering of a vision/IMU guided quadrotor. In: International Conference on Mechatronics and Automation. Changchun, 2009. 2870–2875 33 Sag¨ u´ es C, Guerrero J J. Visual correction for mobile robot homing. Robot Auton Syst, 2005, 50: 41–49 34 Wang P L, Shi S D, Hong X W. A SLAM algorithm based on monocular vision and odometer. Comput Simul, 2008, 25: 172–175 35 Garc´ıa G R, Sotelo M A, Parra I, et al. 2D visual odometry method for global positioning measurement. In: IEEE International Symposium on Intelligent Signal Processing. Xiamen, 2007. 1–6 36 Mouragnon E, Lhuillier M, Dhome M, et al. Generic and real time structure from motion using local bundle adjustment. Image Vision Comput, 2009, 27: 1178–1193 37 Kourogi M, Kurata T, Sakaue K. A panorama-based method of personal positioning and orientation and its real-time applications for wearable computers. In: Proceedings of 5th IEEE International Symposium on Wearable Computing. Zurich, 2001. 107–114 38 Andrew E J, Yang C, Larry H M. Machine vision for autonomous small body navigation. In: IEEE Aerospace Conference. Big Sky, 2000. 661–671 39 Wang X H, Fu W P, Su L. Research on binocular vision SLAM with odometer in indoor environment. J Xi’an Univ Technol, 2009, 25: 466–471 40 Comport A I, Malis E, Rives P. Accurate quadrifocal tracking for robust 3D visual odometry. In: IEEE International Conference on Robotics and Automation. Roma, 2007. 40–45 41 Bernd K, Andreas G, Henning L. Visual odometry based on stereo image sequences with RANSAC-based outlier rejection scheme. In: IEEE Symposium on Intelligent Vehicles. San Diego, 2010. 487–492 42 Cornelia F, Yiannis A. Observability of 3D motion. Int J Comput Vision, 2000, 37: 43–63