(5) To ensure that the stereo-inertial system does not gain information along ... multi-state constraint Kalman filter (MSCKF) proposed by Mourikis and ...
Stereo-inertial odometry using nonlinear optimization Jianzhu Huai, Charles K. Toth and Dorota A. Grejner-Brzezinska The Ohio State University
BIOGRAPHIES Jianzhu Huai received his M.S. degree in geomatics from Beihang University in China in 2012. Since 2013, he has been in the geodetic engineering PhD program in the Department of Civil, Environment and Geodetic Engineering at The Ohio State University (OSU). His current research interest is multi-sensory real time navigation. Charles K. Toth is a Research Professor in the Department of Civil, Environmental and Geodetic Engineering at the Ohio State University. He received an M.S. in Electrical Engineering and a Ph.D. in Electrical Engineering and Geo-Information Sciences from the Technical University of Budapest, Hungary. His research expertise covers broad areas of 2D/3D signal processing, spatial information systems, high-resolution imaging, surface extraction, modeling, integrating and calibrating of multi-sensor systems, multi-sensor geospatial data acquisition systems, and mobile mapping technology. He is ISPRS Technical Commission I President for 20122016. Dorota A. Grejner-Brzezinska is the Lowber B. Strange endowed professor and Chair of th Department of Civil, Environmental and Geodetic Engineering, and director of the Satellite Positioning and Inertial Navigation (SPIN) Laboratory at The Ohio State University. Her research interests cover GPS/GNSS algorithms, GPS/inertial and other sensor integration for navigation in GPS-challenged environments, sensors and algorithms for indoor and personal navigation, Kalman filter and nonlinear filtering. She published over 300 peer reviewed journal and proceedings papers, numerous technical reports and five book chapters on GPS and navigation, and led over 30 sponsored research projects. She is ION and RIN Fellow, and the recipient of the 2005 ION Thomas Thurlow Award, the 2005 and 2015 United States Geospatial Information Foundation (USGIF) Academic Research Award. She is the President of ION.
ABSTRACT The complementary nature of visual and inertial sensors makes their fusion suitable for bridging GNSS gaps in challenged environments. Filtering-based approaches once dominated the integration of visual and inertial cues, but nonlinear optimization approaches are becoming more prevalent. To explore the potential of nonlinear optimization in real-time large-scale navigation, this paper presents a motion estimation approach that tightly fuses data from stereo cameras and a consumer-grade inertial measurement unit (IMU). The proposed framework includes a tracking frontend and a mapping backend. In the frontend, point features are tracked in a circular manner between two consecutive frames of the stereo rig (each frame has two synchronized images). Also, features in a number of selected frames (keyframes) that have been triangulated are tracked in the incoming frame by feature matching. Given tracked point features and readings from the IMU, two windows of constraints are built, a spatial window and a temporal window. The spatial window contains constraints between poses of keyframes and observed points. Besides such pose-point constraints, the temporal window also has the constraints relating consecutive poses of the recent frames using the readings from the IMU. With these constraints, all variables (aka states) involved in both windows are refined with a nonlinear optimization. This optimization is done recursively as new frames and IMU readings stream in, and the keyframes move out from the double window. In the backend, the map is routinely maintained once a new keyframe is inserted. In addition, features in the keyframes make the large-scale appearance-based loop closure possible. The current implementation also supports the online removal of redundant keyframes and spurious points. This enhances compact scene representation and assures the extended operation. Tests on the Tsukuba Stereo and KITTI datasets verified that this approach achieved higher accuracy than the sliding window smoother, and an incremental stereo odometry. Thus, benefits of using keyframes and fusing inertial data in motion estimation were verified. It is our goal to make the final implementation fully available as an open-source.
INTRODUCTION Visual inertial navigation systems (VINS) have become quite popular as it is often desirable to extend navigation capacity to GNSS compromised areas. Complementary to dead reckoning, based on a proprioceptive sensor (e.g., an IMU or even a motion model) in these areas, exteroceptive visual sensors (e.g., a monocular camera, a stereo rig, or a Kinect) observe static and dynamic features in the surrounding environment. Tracked static features across image frames serve as observations to constrain the motion estimated by the proprioceptive sensor. Despite different combinations of visual and inertial sensors at the hardware level, typical algorithms to process the sensor data can be roughly grouped into recursive filtering and batch nonlinear optimization approaches [Leutenegger et al., 2014]. While filtering approaches have achieved high precision and low computation load, they are more susceptible to linearization errors [Dong-Si and Mourikis, 2011]. In contrast, batch processing methods benefit from repeated linearization of nonlinear error terms at the expense of computation loads. For real time operations, batch processing approaches with less intensive computation have been proposed. Among them, keyframe-based approaches estimate egomotion by registering incoming images to a sparse or dense model (aka map) of the scene, and meanwhile refine the model with images. Old variables are marginalized or removed from the optimization window while preserving the sparse structure of the original problem. As a result, the computation cost is kept bounded and a balance between the accuracy and the efficiency is achieved. In addition, many keyframe-based methods are very adept in dealing with loop closures. Based on the model concept used in keyframe-based approaches, the nonlinear optimization idea of [Leutenegger et al. 2014], and the loop closure work of [Mur-Artal et al., 2015], we designed a keyframe-based stereo inertial navigation system. Given a local map, its frontend tracks the camera motion by a nonlinear optimization of two windows of states. The backend maintains the map routinely every time a keyframe is inserted into the map. The loop closure detection and optimization is also conducted in the backend to ensure global map consistency. In view of the multitude of VINS systems, the distinctive properties of our approach are summarized as follows. (1) Instead of using salient features detected over pyramid levels, we adopted a fast feature detection and matching
strategy proposed in [Geiger et al., 2011] to extract and match point features, and to roughly estimate incremental motion. (2) A gravity-aligned ORB [Rublee et al., 2011] feature descriptor [Kurz and Benhimane, 2011] was used to match features in the map and in the incoming image. It is expected to improve feature matching performance. (3) Inspired by [Leutenegger et al, 2014], we defined a double window structure to organize variables for optimization in the frontend. It has one window of spatially separated keyframes, and the other window of temporally consecutive frames. Frames in the double window are correlated as they observe common points in the map, while frames in the temporal window are also related by the inertial readings. (4) The map maintenance and loop closure [Mur-Artal et al., 2015] run parallel to the stereo-inertial motion estimation. Redundant keyframes and spurious points in the map can be safely removed. Long-term tracking and consistent mapping is within reach. (5) To ensure that the stereo-inertial system does not gain information along unobservable directions, the first estimate Jacobians are used in conjunction with the double window structure during optimization. (6) Using a range of simulated and real data, we demonstrated that keyframe-based stereo inertial nonlinear optimization performed better than the sliding window smoothing and incremental stereo odometry. This paper starts with a review of related research efforts. Next, the method for processing stereo and inertial data is described. Then, simulated and real test results are presented and analyzed in comparison to other methods. Finally, a discussion about the proposed method concludes this paper. RELATED WORK This section briefly reviews recent developments in simultaneous localization and mapping (SLAM) and VINS. Approaches that use visual sensors and/or inertial sensors are covered as they often use some techniques in common. Ideally, better localization and mapping may be achieved if all available information is considered. Based on this idea, direct methods have obtained impressive performance in SLAM, e.g., dense visual odometry (DVO) [Kerl et al., 2013], large scale direct monocular SLAM (LSD-SLAM) [Engel et al., 2014]. These approaches produce a dense local map, but it is difficult to take advantage of the observations across the image
frames. In contrast, sparse approaches can leverage tracked point features across the frames with a local nonlinear optimization. Sparse approaches often base motion estimation on tracking sparse features of a local map in the incoming frame. These approaches may be divided into two groups in terms of the processing technique: filtering and nonlinear optimization. Besides low computation complexity, many filtering approaches [Taylor et al, 2011; Kelly and Sukhatme, 2011] achieved accurate motion estimation in a small area. For exploratory tracking, one notable approach is the multi-state constraint Kalman filter (MSCKF) proposed by Mourikis and Roumeliotis, 2007. Unlike conventional Extended Kalman filter (EKF) SLAM methods that include feature points in the state vector, MSCKF includes poses of frames of a sliding window instead. States are propagated with inertial readings and corrected with measurements derived from tracks of point features. Over years, MSCKF has evolved to maintain desired properties [Li and Mourikis, 2013; Huang et al., 2014]. Filtering approaches suffer from accumulated linearization errors, which can be reduced in nonlinear optimization approaches due to the repeated relinearization. Strasdat et al. [2010] showed that an optimization approach achieved higher accuracy than a filtering approach given similar computation resource. Leutenegger et al. [2014] again verified the advantage of the nonlinear optimization over filtering in terms of the accuracy, if the sparsity of the problem is preserved. To execute nonlinear optimization in real time, three types of approaches have been proposed [Nerurkar et al., 2014]: incremental approaches, sliding window smoothing, and keyframe-based approaches. Incremental approaches, e.g., [Kaess et al., 2011] suffers in tackling with loop closures. Sliding window smoothing, e.g., [Dong-Si and Mourikis, 2011] does not maintain the problem’s sparse structure. In contrast, with proper marginalization, keyframe-based approaches (e.g, scalable visual SLAM (ScaViSLAM) [Strasdat et al., 2011], Constrained keyframe-based localization and mapping (C-KLAM) [Nerurkar et al., 2014]) can keep the sparsity while maintaining the accuracy. Also, keyframes provide a simple way to construct the map, thus it is easy to achieve loop closure. In keyframe-based approaches for vision-based sparse SLAM, ScaViSLAM [Strasdat et al., 2011] employed a double window structure consisting of the current frame and previous keyframes for nonlinear optimization, achieving local accuracy and global consistency of the map. However, its feature tracking in the frontend was liable to breakdown in cases of few observed features. For
Kinect style cameras, [Henry et al. 2012] used the keyframe concept in loop closure detection and offline global optimization. For a monocular camera, ORBSLAM [Mur-Artal et al., 2015] tracked the current frame from a local map of keyframes and maintained the map by optimizing local map, pruning redundant keyframes and closing loops. Among keyframe-based approaches for VINS, FrameSLAM [Konolige and Agrawal, 2008] achieved real-time optimization with stereo images by marginalizing redundant frames and keeping a skeleton of frames. It also incorporated the inertial data in a loosely coupled fashion. However, tightly coupled approaches may be more accurate as they can better handle correlation between states [Leutenegger et al., 2014]. CKLAM [Nerurkar et al., 2014] computed constraints between keyframes by marginalizing non-key poses and non-key landmarks while retaining the sparse structure of the original problem. It achieved accuracy close to bundle adjustment, though it ran offline. Our work presented here is close to that of [Leutenegger et al., 2014], which presented a real-time keyframe-based tightly coupled approach to integrate inertial and visual cues for both monocular and stereo cameras. Using well synchronized visual inertial data, it obtained better accuracy than MSCKF. However, loop closure and map maintenance is not considered. Besides above techniques to improve the accuracy of a recursive VINS algorithm, the consistency also has a major impact. Li and Mourikis [2013] and Dong-Si and Mourikis [2011] pointed out the inconsistent issue in filtering-based and sliding window smoothing VINS, respectively. To maintain consistency, both used a first estimate Jacobian (FEJ) approach, which was also adopted by Leutenegger et al. [2014]. Here, we also use this approach in linearizing old variables. That is, for variables that move out from the optimization and those related to them by some constraint, their estimates are fixed in computing future Jacobians, but kept float otherwise.
Figure 1. Stereo-inertial odometry system overview.
METHOD The presented framework has three modules (or threads) running in parallel: tracking, mapping, and loop closing, as shown in Fig 1. The tracking module estimates egomotion and selects keyframes. The mapping module adds new keyframes to the map, and removes spurious map points and redundant keyframes. The loop closing module detects loops and optimize the global pose graph.
are topographically close to a reference keyframe. These keyframes are related by pose-point constraints. The reference keyframe is in general the latest keyframe inserted to the map. However, in loopy motion of the sensor assembly, it may be the keyframe that observes maximum points in common with the current frame. The double window also includes map points that have been observed by at least two keyframes of the double window. In sum, the double window is essentially a local map.
Tracking This section reviews each stage of the tracking module. Next techniques in double window optimization will be explained. (1) Detect and match features: Given a stereo image pair, feature candidates in both images, i.e., blob max, blob min, corner max, and corner min, are extracted using the technique in StereoScan [Geiger et al., 2011]. Then, these feature points in the current frame are matched to the previous frame by circular matching. That is, starting from a feature in the current left image, its best match is searched for in the current right image, then in the previous right image, next in the previous left image, and finally in the current left image again. If the last matched feature coincides with the starting one, a quad match is accepted. In addition, feature points are also matched from the current left to current right and to current left image, resulting in stereo matches. The matching steps for quad matches and stereo matches are illustrated in Fig 2. (2) Estimate the incremental motion between the current frame and the previous frame: Given quad matches, the inter-frame motion can be estimated by a RANSAC procedure as done in [Geiger et al., 2011]. Alternatively, it can be obtained from state propagation with the IMU readings. With the initial motion estimate, inliers can be sieved from the quad matches, and used to associate map points with features in the current frame if some features in the previous frame have associated map points.
(3) Build the double window: A double window structure is used to organize variables to be optimized and related observations. Our double window structure is reminiscent of that presented by [Strasdat et al. 2011]. Strasdat et al. [2011] used double windows in the backend to organize keyframes, which are related by the pose-point and posepose constraints. In contrast, our double window structure is employed in the frontend and composed of a temporal window and a spatial window. The temporal window includes 𝑻𝑻 most recent frames, which are related by posepoint constraints and IMU readings. Note a frame in the temporal window can be a keyframe. The spatial window includes 𝑺𝑺 keyframes that are not in the first window and
Figure 2. A quad match occurs between four images of two stereo frames and a stereo match only involves two images of a stereo frame (4) Track and optimize the local map: The ORB feature descriptors are extracted for stereo matches obtained in Step (1). To improve feature matching, the ORB descriptor is oriented along the projected gravity direction in the 2D image, which is known thanks to the IMU readings [Kurz and Benhimane, 2011]. Assume a camera projects world points following a pinhole model with projection matrix 𝐊𝐊 and possibly with distortion. 𝐊𝐊 is composed of focal lengths in the horizontal and vertical directions (𝑓𝑓𝑢𝑢 , 𝑓𝑓𝑣𝑣 ), and principal point coordinates (𝑝𝑝𝑢𝑢 , 𝑝𝑝𝑣𝑣 ) in pixels. 𝑓𝑓𝑢𝑢 0 𝑝𝑝𝑢𝑢 (1) 𝐊𝐊 = � 0 𝑓𝑓𝑣𝑣 𝑝𝑝𝑣𝑣 � 0 0 1 Also assume the gravity vector in the current camera T
frame is known as 𝐠𝐠 𝑐𝑐 = �𝑔𝑔𝑥𝑥 , 𝑔𝑔𝑦𝑦 , 𝑔𝑔𝑧𝑧 � . where the superscript c refers to the camera frame, and subscripts x, y, z denote the three axes of the camera frame, � of pixel coordinates, 𝑢𝑢�, 𝑣𝑣�, that is, For one feature point 𝐩𝐩 � = [𝑢𝑢�, 𝑣𝑣�, 1]T (the accent ~ means that a variable contains 𝐩𝐩 noise), we first undistort it to get the undistorted pixel coordinates 𝐩𝐩 = [𝑢𝑢, 𝑣𝑣, 1]T . Thus, we can virtually create a rectified point along the gravity direction 𝐩𝐩’ = 𝐩𝐩 + 𝐊𝐊𝐠𝐠 𝑐𝑐 in homogeneous coordinates, and distorting it leads to an
� = [𝑢𝑢�′, 𝑣𝑣�′, 1]T . As a result, the actual point in the image𝐩𝐩′ projected 2D gravity direction of this feature is: � − 𝐩𝐩 � = [𝑢𝑢� ′ − 𝑢𝑢�, 𝑣𝑣� ′ − 𝑣𝑣�, 0]T 𝐝𝐝 = 𝐩𝐩′
(2)
threshold. A triangulated point has to have enough parallax, small reprojection errors before it is accepted as a map point. Once finished, the new keyframe is sent to the mapping thread for further maintenance.
For each point in the local map, we look for its best match in the stereo matches of the current frame, as illustrated in Fig 3. Then, a nonlinear optimization is performed on the local map. Not only does it refine the pose estimate of the current frame, but it also refines poses of frames and points in the double window.
Figure 4. In a frame, quad matches and observed map points are subsets of stereo matches. To create a keyframe from a frame, new map points are triangulated from quad matches that are not map points. Note quad matches involves the current and previous frame, while map points are anchored in previous keyframes. Nonlinear stereo-inertial optimization Figure 3. A match between a map point and a feature in the current frame is shown. The map point needs to match both the left and right descriptor of a stereo match corresponding to the feature. (5) Select a new keyframe: If the tracked map points in the current frame consists of less than 60% of the observed map points in its reference keyframe and the current frame tracks at least 50 points, a new keyframe is spawned from the previous frame. Some new map points are triangulated based on the quad matches in Step (1). The relation between quad matches and the map points observed in a frame is illustrated in Fig 4. This procedure to create new map points is controlled by a grid to ensure uniform distribution of observations of the map points in the new keyframe. Alternatively, a quad tree can be used to control this procedure [Strasdat et al., 2011], but we find a simple fixed grid is as effective for this purpose. In ORB-SLAM, new map points are created between the new keyframe and the existing ones in the mapping thread, our system additionally creates new map points for a new frame based on quad matches in the tracking thread. As a result, the tracking module depends much less on the processed keyframes by the mapping module and obviates failures when the mapping thread is jammed by many waiting new keyframes. In particular, a grid of fixed cell sizes is created over the dimension of the left image. Each cell accumulates a number of observed map points. The maximum number across all cells is used as a threshold to control new point creation. For all cells, new map points are created from quad matches until the number of observations equals the
The nonlinear optimization involved in the tracking module is explained in this section. It covers the inertial dead-reckoning, stereo-inertial formulation, and first estimate Jacobians. IMU kinematics For low cost inertial navigation, the earth rotation is relatively small regarding the gyro error sources, and can be safely ignored in short term inertial navigation [Leutenegger et al., 2014]. Suppose a world frame is attached to a fixed point on earth, and biases of accelerometers and gyros (𝐛𝐛𝑎𝑎 ,𝐛𝐛𝑔𝑔 respectively) are random walk processes, then following equations that govern the IMU motion can be obtained. This motion model simplifies that of Shin [2005]. 𝑤𝑤 𝐱𝐱̇ 𝑤𝑤 𝑠𝑠 = 𝐯𝐯 𝑤𝑤 � 𝑠𝑠 − 𝐛𝐛𝑎𝑎 − 𝐧𝐧𝑎𝑎 ) + 𝐠𝐠 𝑤𝑤 𝐯𝐯̇ = 𝐑𝐑𝑤𝑤 𝑠𝑠 (𝐚𝐚 𝑤𝑤 𝑤𝑤 𝑠𝑠 � 𝑤𝑤𝑤𝑤 𝐑𝐑̇ 𝑠𝑠 = 𝐑𝐑 𝑠𝑠 �𝛚𝛚 − 𝐛𝐛𝑔𝑔 − 𝐧𝐧𝑔𝑔 � × 𝐛𝐛̇𝑎𝑎 = 𝐧𝐧𝑏𝑏𝑏𝑏 𝐛𝐛̇𝑔𝑔 = 𝐧𝐧𝑏𝑏𝑏𝑏
(3)
To ease notation, the subscript s of a variable implies the variable refers to the IMU sensor; the superscript w implies the variable is expressed in the world frame, and the operator [∙]× generates the skew symmetric matrix of 𝑤𝑤 a 3-vector. Thus, in the equation group (3), 𝐱𝐱 𝑤𝑤 𝑠𝑠 and 𝐯𝐯 is the IMU sensor position and velocity in the world frame, 𝑤𝑤 𝐠𝐠 𝑤𝑤 is the gravity at location 𝐱𝐱 𝑤𝑤 𝑠𝑠 , 𝐑𝐑 𝑠𝑠 is the rotation from 𝑠𝑠 � 𝑤𝑤𝑤𝑤 are the the sensor frame to the world frame, 𝐚𝐚�𝑠𝑠 and 𝛚𝛚
accelerometer and gyro measurements, and 𝐧𝐧𝑎𝑎 , 𝐧𝐧𝑔𝑔 , 𝐧𝐧𝑏𝑏𝑏𝑏 , 𝑠𝑠 � 𝑤𝑤𝑤𝑤 and 𝐧𝐧𝑏𝑏𝑏𝑏 are white Gaussian noise driving 𝐚𝐚�𝑠𝑠 , 𝛚𝛚 , 𝐛𝐛𝑎𝑎 , and 𝐛𝐛𝑔𝑔 , respectively.
We define the error in the attitude 𝛙𝛙𝑤𝑤 that relates the � 𝑤𝑤 estimated rotation and true rotation such that 𝐑𝐑 𝑠𝑠 = (𝐈𝐈 − 𝑤𝑤 𝑤𝑤 𝑤𝑤 [𝛙𝛙 ]× )𝐑𝐑 𝑠𝑠 . Note 𝛙𝛙 is defined in the world frame which is fixed to the Earth. To represent the orientation error in a world frame, was reported to lead to better consistency than in the IMU sensor frame [Li and Mourikis, 2013].
And we define the errors in position and velocity as the difference between the estimated and the true values, 𝛿𝛿𝐱𝐱 𝑤𝑤 𝑠𝑠 𝑤𝑤 𝑤𝑤 ∶= 𝐱𝐱� 𝑤𝑤 � 𝑤𝑤 − 𝐯𝐯 𝑤𝑤 , thus we obtain the 𝑠𝑠 − 𝐱𝐱 𝑠𝑠 , and δ𝐯𝐯 ∶= 𝐯𝐯 error state vector: T 𝑤𝑤 T , 𝛙𝛙𝑤𝑤 T , 𝐛𝐛T𝑎𝑎 , 𝐛𝐛𝑔𝑔T � δ𝓧𝓧 ∶= �δ𝐱𝐱 𝑤𝑤 𝑠𝑠 , δ𝐯𝐯
T
(4)
Putting all Gaussian white noise terms into a noise vector, 𝐰𝐰 ∶= �𝐧𝐧T𝑎𝑎 , 𝐧𝐧𝑔𝑔T , 𝐧𝐧T𝑏𝑏𝑏𝑏 , 𝐧𝐧T𝑏𝑏𝑏𝑏 �
T
(5)
the continuous transition equations can be written as:
where:
𝟎𝟎3×3 ⎡ 𝟎𝟎 ⎢ 3×3 𝐅𝐅 = ⎢𝟎𝟎3×3 ⎢𝟎𝟎3×3 ⎣𝟎𝟎3×3
δ𝓧𝓧̇ = 𝐅𝐅δ𝓧𝓧 + 𝐆𝐆𝐆𝐆
𝐈𝐈3 𝟎𝟎3×3 𝟎𝟎3×3 𝟎𝟎3×3 𝟎𝟎3×3
𝟎𝟎3×3 𝑠𝑠 [𝐑𝐑𝑤𝑤 (𝐚𝐚 � − 𝐛𝐛𝑎𝑎 )]× 𝑠𝑠 𝟎𝟎3×3 𝟎𝟎3×3 𝟎𝟎3×3
And the covariance function of 𝐰𝐰 is: σ2 𝐈𝐈 ⎡ 𝑎𝑎 3 ⎢𝟎𝟎3×3 𝚬𝚬[𝐰𝐰(t1 )𝐰𝐰(t 2 )T ] = ⎢ 𝟎𝟎 ⎢ 3×3 ⎣𝟎𝟎3×3 𝑡𝑡1 )
𝟎𝟎3×3 σ𝑔𝑔2 𝐈𝐈3 𝟎𝟎3×3 𝟎𝟎3×3
𝟎𝟎3×3 𝟎𝟎3×3 σ2𝑏𝑏𝑏𝑏 𝐈𝐈3 𝟎𝟎3×3
(6) 𝟎𝟎3×3 𝐑𝐑𝑤𝑤 𝑠𝑠 𝟎𝟎3×3 𝟎𝟎3×3 𝟎𝟎3×3
𝟎𝟎3×3 ⎤ 𝟎𝟎3×3 𝑤𝑤 ⎥ −𝐑𝐑 𝑠𝑠 ⎥ 𝟎𝟎3×3 ⎥ 𝟎𝟎3×3 ⎦
𝟎𝟎3×3 ⎤ 𝟎𝟎3×3 ⎥ 𝛿𝛿(𝑡𝑡2 − 𝟎𝟎3×3 ⎥ ⎥ σ2𝑏𝑏𝑏𝑏 𝐈𝐈3 ⎦ (7)
where σ is the noise density (i.e., the square root of the power spectral density) of the respective Gaussian white noises, and 𝛿𝛿(∙) is the Dirac delta function. The covariance of the state vector 𝐏𝐏(δ𝓧𝓧) can be propagated as done in [Shin 2005], p. 70. Formulation of stereo-inertial optimization Based on the double window structure, a nonlinear optimization is performed for every incoming frame in the tracking thread. For the clarity of formulation, the following notation is used. States, i.e., variables to be estimated, include static landmarks (aka environment point features) 𝐱𝐱 𝐿𝐿 , rover states associated with each image
frame 𝐱𝐱 𝑅𝑅 . For a particular frame of index k, 𝐱𝐱 𝑘𝑘𝑅𝑅 includes a SE(3) transformation from the world frame to the left 𝑐𝑐 camera frame 𝐓𝐓𝑤𝑤0 , velocity of the IMU sensor in the 𝑤𝑤 world frame 𝐯𝐯𝑠𝑠 , accelerometer bias 𝐛𝐛𝑎𝑎 , and gyro bias 𝐛𝐛𝑔𝑔 . Thus, it is represented as: 𝑐𝑐
𝐱𝐱 𝑅𝑅 ∶= �𝐓𝐓𝑤𝑤0 , 𝐯𝐯𝑠𝑠𝑤𝑤 , 𝐛𝐛𝑎𝑎 , 𝐛𝐛𝑔𝑔 �
(8)
To simplify the implementation, it is split into two parts: 𝑐𝑐 pose states 𝐱𝐱 𝑇𝑇 ∶= �𝐓𝐓𝑤𝑤0 �, and speed/bias states 𝐱𝐱 𝑠𝑠𝑠𝑠 𝑇𝑇
∶= � 𝐯𝐯𝑠𝑠𝑤𝑤 𝑇𝑇 , 𝐛𝐛𝑇𝑇𝑎𝑎 , 𝐛𝐛𝑔𝑔𝑇𝑇 � .
Throughout our implementation, the general graph optimization package (g2o) [Kummerle et al., 2011] is used to optimize the objective function. The optimization process requires defining the perturbation Δ𝐱𝐱 of states. In accord with notions of g2o, the perturbation around the estimate of a variable is defined as 𝐱𝐱 = 𝐱𝐱� ⊞ Δ𝐱𝐱. The operator ⊞ means applying Δ𝐱𝐱 to the estimate of a variable brings about its true value. In particular, for landmark and speed/bias states, the perturbation is Δ𝐱𝐱 ∶= 𝐱𝐱 − 𝐱𝐱�. For the camera pose, the perturbation ϵ is defined as: −1 ∨
�𝑤𝑤𝑐𝑐 � � ϵ ∶= log �𝐓𝐓𝑤𝑤𝑐𝑐 �𝐓𝐓
(9)
where log(∙)∨ maps a SE(3) element to its minimal tangent vector representation in ℝ6 .
The Jacobians of error terms (aka cost function due to observations) relative to the perturbation are used frequently in optimization and computed in the fashion of g2o. That is, 𝐉𝐉 =
𝜕𝜕𝐞𝐞(𝐱𝐱�⊞Δ𝐱𝐱) 𝜕𝜕Δ𝐱𝐱
│Δ𝐱𝐱=𝟎𝟎
(10)
In computing Jacobians, analytic differentiation is preferred. However, if the analytic form is too involved, automatic differentiation rather than numerical differentiation is performed by Ceres Solver [Agarwal et al., 2010] with local parameterization. Error terms in the objective function Two types of errors are involved in formulating the objective function: the re-projection error and IMU constraint error. These errors are assumed as random variables characterized by their mean and covariance. 𝑗𝑗
For a landmark 𝐱𝐱 𝐿𝐿 indexed 𝑗𝑗 observed in camera 𝑖𝑖 (in the stereo case, 𝑖𝑖 = 0 denotes left camera, 𝑖𝑖 = 1 right camera) of frame 𝑘𝑘, the pose-point constraint (aka re-projection error term) is:
𝑖𝑖,𝑗𝑗,𝑘𝑘
𝐞𝐞r
𝑐𝑐
𝑐𝑐 (𝑘𝑘) 𝑗𝑗 𝐱𝐱 𝐿𝐿 �
= 𝐳𝐳 𝒊𝒊,𝒋𝒋,𝒌𝒌 − 𝐡𝐡𝑖𝑖 �𝐓𝐓𝑐𝑐0𝒊𝒊 𝐓𝐓𝑤𝑤0
(11)
where subscript r denotes the re-projection error, subscript L denotes the landmark, subscript 𝑐𝑐0 denotes the left camera frame, subscript 𝑐𝑐𝒊𝒊 denotes the camera frame of index 𝑖𝑖, and superscript 𝑐𝑐0 (𝑘𝑘) denotes the left camera 𝑐𝑐 𝑐𝑐 (𝑘𝑘) 𝑗𝑗 frame at 𝑘𝑘. Thus, 𝐓𝐓𝑐𝑐0𝒊𝒊 𝐓𝐓𝑤𝑤0 𝐱𝐱 𝐿𝐿 transforms a landmark expressed in the world frame into the camera 𝑖𝑖 frame at 𝑘𝑘. 𝑗𝑗 Moreover, 𝐳𝐳 𝒊𝒊,𝒋𝒋,𝒌𝒌 is the actual observation of 𝐱𝐱 𝐿𝐿 , 𝐡𝐡𝑖𝑖 projects a world point coordinated in the 𝑖𝑖th camera frame into the 𝑖𝑖th camera image. For two consecutive frames at epochs 𝑡𝑡𝑘𝑘 and 𝑡𝑡𝑘𝑘+1in the temporal window, they are also related by the IMU readings 𝐳𝐳𝑠𝑠𝑘𝑘 , as illustrated in Fig 5. The error term can be defined as: ∨ �𝑐𝑐𝑤𝑤 (𝑘𝑘+1) 𝐓𝐓𝑤𝑤𝑐𝑐0(𝑘𝑘+1) � log�𝐓𝐓 𝑘𝑘 0 , 𝐳𝐳 ) = � � ∈ ℝ15 (12) 𝐞𝐞𝑘𝑘𝑠𝑠 (𝐱𝐱 𝑘𝑘𝑅𝑅 , 𝐱𝐱 𝑘𝑘+1 𝑠𝑠 𝑅𝑅 𝑘𝑘+1 𝑘𝑘+1 𝐱𝐱� 𝑠𝑠𝑠𝑠 − 𝐱𝐱 𝑠𝑠𝑠𝑠
where subscript s denotes that the constraint is due to the �𝑐𝑐𝑤𝑤 (𝑘𝑘+1) is the predicted left camera IMU sensor readings. 𝐓𝐓 0 pose at frame 𝑘𝑘 + 1 (hence subscript 𝑐𝑐0 (𝑘𝑘 + 1)), given states 𝐱𝐱 𝑘𝑘𝑅𝑅 at frame 𝑘𝑘, the transformation from the left camera frame to the IMU sensor frame 𝐓𝐓𝑐𝑐𝑠𝑠0 and IMU readings 𝐳𝐳𝑠𝑠𝑘𝑘 . 𝐱𝐱� 𝑘𝑘+1 𝑠𝑠𝑠𝑠 is the predicted speed and bias at frame 𝑘𝑘 + 1.
Figure 5. The IMU and camera data stream shows that IMU readings and camera images arrive at different frequency and time. We index the two inertial measurements close to camera frame 𝑘𝑘 in time by 𝑝𝑝𝑘𝑘 − 1 and 𝑝𝑝𝑘𝑘 such that 𝑡𝑡𝑝𝑝𝑘𝑘−1 < 𝑡𝑡𝑘𝑘 ≤ 𝑡𝑡𝑝𝑝𝑘𝑘 . To predict the camera pose at frame 𝑘𝑘 + 1 from frame 𝑘𝑘, the inertial measurements in range �𝑡𝑡𝑝𝑝𝑘𝑘−1 , 𝑡𝑡𝑝𝑝𝑘𝑘+1−1 � should be used, this is 𝐳𝐳𝑠𝑠𝑘𝑘 .
In summary, the objective function over the double window is: 𝑖𝑖,𝑗𝑗,𝑘𝑘 T
𝜒𝜒 2 = ∑1𝑖𝑖=0 ∑𝑆𝑆+𝑇𝑇 𝑘𝑘=1 ∑𝑗𝑗∈𝒥𝒥(𝑖𝑖,𝑘𝑘)�𝐞𝐞r 𝑘𝑘 T 𝑘𝑘 𝑘𝑘 ∑𝑇𝑇−1 (𝐞𝐞 𝑘𝑘=0 𝑠𝑠 ) 𝐖𝐖𝑠𝑠 𝐞𝐞𝑠𝑠
𝑖𝑖,𝑗𝑗,𝑘𝑘 𝑖𝑖,𝑗𝑗,𝑘𝑘 𝐞𝐞r
� 𝐖𝐖r
+ (13)
where 𝑆𝑆 and 𝑇𝑇 are the spatial and temporal window size, 𝑖𝑖,𝑗𝑗,𝑘𝑘 is the information matrix for point respectively. 𝐖𝐖r observations and can be obtained by assuming 1 pixel isotropic observation noise in an image, and 𝐖𝐖𝑠𝑠𝑘𝑘 is the weight matrix for the IMU error term and can be approximated as follows. 𝐑𝐑𝑘𝑘𝑠𝑠 = (𝐖𝐖𝑠𝑠𝑘𝑘 )−1 =
𝑘𝑘 𝜕𝜕𝐞𝐞𝑘𝑘 𝑠𝑠 𝑠𝑠 � 𝑘𝑘+1 �𝐱𝐱 𝑘𝑘𝑅𝑅 , 𝐳𝐳𝑠𝑠𝑘𝑘 � � 𝜕𝜕𝐞𝐞𝑘𝑘+1 𝐏𝐏�𝛿𝛿𝓧𝓧 � � 𝑘𝑘+1 � 𝜕𝜕𝜕𝜕𝓧𝓧 𝜕𝜕𝜕𝜕𝓧𝓧
T
(14)
� 𝑘𝑘+1 �𝐱𝐱 𝑘𝑘𝑅𝑅 , 𝐳𝐳𝑠𝑠𝑘𝑘 � is the propagated covariance of where 𝐏𝐏�𝛿𝛿𝓧𝓧 � 𝑘𝑘+1 at epoch 𝑡𝑡𝑘𝑘+1 , the predicted IMU error state vector 𝛿𝛿𝓧𝓧 𝜕𝜕𝐞𝐞𝑘𝑘
𝑠𝑠 is computed by automatic differentiation with and � 𝑘𝑘+1 𝜕𝜕𝜕𝜕𝓧𝓧 local parameterization.
First estimates to compute Jacobians It has been shown that using the latest available state estimates to compute Jacobians causes inconsistency with EKF-SLAM [Li and Mourikis, 2013] and fixed lag smoother [Dong-Si and Mourikis, 2011]. As suggested therein, we use first available estimates of states to compute their related Jacobians. The first available estimates of states are determined based on a marginalization process. The basic idea is that once a variable goes out of the optimization window, itself and variables related to it by some error terms fix their estimates in computing future Jacobians. That is, if there are some observations (in other words, error terms) that relate variables to be excluded from and variables included in the optimization window, all of these variables will be marked as having fixed linearization points. In practice, to preserve the sparse property of the problem, some observations can be discarded without the loss of accuracy, so that Jacobian computation will be accomplished using the latest estimates of the related variables. Specifically, the marginalization process is done in the tracking module as follows. Recall that as an image frame arrives, a new double window is constructed. If a frame (can be a keyframe) was the last frame of the old temporal window, i.e., it has been excluded from the new temporal window, its pose state and the pose and speed/bias states of its next frame in the temporal window is marked as having fixed linearization estimates. For instance, in the beginning of the VINS, an initial map is constructed from the quad matches between the first two frames while ensuring uniform distribution. The second frame is taken as the first keyframe, and both frames are inserted into the temporal window. As frames stream in, the first marginalization occurs just after the temporal window is full and the spatial window is empty. Prior to the next optimization, the first frame slips out of the temporal window, and the pose and speed/bias of the first keyframe are marked as having fixed linearization estimates. As a keyframe goes out of the spatial window, i.e., it was in the old spatial window but not in the new spatial window, the points observed in this frame and not observed by the frames in the double window will be marked as having fixed linearization points. In optimization, if a point has been observed in a keyframe that is not included in the double window, that
keyframe and the point’s observation is also included in the optimization, but the keyframe remains fixed. In the beginning of the optimization, a few iterations of structure-only bundle adjustment (SBA) is performed, then all the variables in the double window are included and refined in optimization. Unlike Leutenegger et al., 2014, we do not marginalize states exactly. Rather, all states that participate in the Gauss-Newton system of equation are constructed according to the double window. Thus, the sparsity of the Hessian matrix is guaranteed even when the camera performs a loopy motion. Although exact marginalization is not performed, the mapping and loop closing thread will further improve the localization accuracy. Mapping and loop closing in the backend Inspired by ORB-SLAM, the local mapping and loop closing threads are incorporated to maintain the global consistency of the map. Both threads take actions when a new keyframe is inserted. They basically run procedures detailed by [Mur-Artal et al., 2015] which are briefly discussed below. Given a new keyframe, the local mapping creates new map points between the new keyframe and its connected keyframes, updates the connections between the keyframes due to co-visible map points, removes the outlier map points, and removes redundant keyframes if a majority of its observed points have been observed by other keyframes. In contrast to ORB-SLAM, the mapping thread does not perform any map optimization. For a new keyframe, the loop closing thread attempts to find candidate keyframes that may be revisited in the map by the technique of bags of words. If a candidate keyframe has a SE(3) transformation with the new keyframe supported by enough matched features, we accept the candidate as a loop keyframe..Then, map points observed in loop keyframe and the new keyframe are matched and fused. Finally, a pose graph optimization in SE(3) space is performed on the essential graph of keyframes similar to [Mur-Artal et al., 2015]. When a pose graph optimization starts, the double window optimization in the frontend is suspended. Once the pose optimization finishes, not only the map gets updated, but the states of frames in the temporal window are also updated. RESULTS Validations of our system were performed on the simulated new Tsukuba CG stereo dataset [Martull et al., 2012] and sequences of the KITTI odometry dataset [Geiger et al., 2012].
The Tsukuba datasets did not provide independent inertial measurements, therefore, the actual IMU readings must be simulated based on the ground truth poses. Lovegrove et al. [2013] observed that SE(3) poses can be fitted closely by 𝐶𝐶 2 -continuous B-splines. As a result, true inertial measurements can be sampled from differentiated B-splines. Adding typical noise to these true values leads to the actual IMU readings. In particular, given a sequence of uniformly time-stamped poses which represent transformations from the sensor frame to the world frame, 𝐓𝐓𝑠𝑠𝑤𝑤𝑖𝑖 , 𝑖𝑖 = 0 , … , 𝑁𝑁 − 1, the continuous pose 𝐓𝐓𝑠𝑠𝑤𝑤 (𝑢𝑢) in interval [𝑡𝑡𝑖𝑖 , 𝑡𝑡𝑖𝑖+1 ] can be expressed as follows: 𝐓𝐓𝑠𝑠𝑤𝑤 (𝑢𝑢) = 𝐓𝐓𝑠𝑠𝑤𝑤𝑖𝑖−1 ∏3𝑗𝑗=1 exp�𝐁𝐁(𝑢𝑢)𝑗𝑗 𝛀𝛀𝑖𝑖+𝑗𝑗−1 �
(15)
where 𝑢𝑢 is an intermediate variable related to time 𝑡𝑡 in 𝑡𝑡−𝑡𝑡𝑖𝑖 . 𝛀𝛀𝑖𝑖 represents the [𝑡𝑡𝑖𝑖 , 𝑡𝑡𝑖𝑖+1 ], defined as 𝑢𝑢(𝑡𝑡): = 𝑡𝑡𝑖𝑖+1 −𝑡𝑡𝑖𝑖
motion between epoch 𝑖𝑖 − 1 and 𝑖𝑖 and is defined as 𝛀𝛀𝑖𝑖 : = −1 log ��𝐓𝐓𝑠𝑠𝑤𝑤𝑖𝑖−1 � 𝐓𝐓𝑠𝑠𝑤𝑤𝑖𝑖 �. 𝐁𝐁(𝑢𝑢) is the basis vector depending only on 𝑢𝑢, and 𝐁𝐁(𝑢𝑢)𝑗𝑗 is the 0-based 𝑗𝑗-th element of 𝐁𝐁(𝑢𝑢). 1 6 1 5 𝑢𝑢 𝐁𝐁(𝑢𝑢) = 𝐂𝐂 � 2 � = � 6 1 𝑢𝑢 3 0 𝑢𝑢
0 3 3 0
0 0 1 −3 1� � 𝑢𝑢 � 3 −2 𝑢𝑢2 0 1 𝑢𝑢3
(16)
The noise added to the true inertial measurements were generated according to the specifications of a Microstrain 3DM-GX3-35 MEMS IMU shown in Table 1. Table 1. IMU characteristics: accelerometer noise density 𝜎𝜎𝑎𝑎 , accelerometer bias noise density 𝜎𝜎𝑏𝑏𝑏𝑏 , gyro noise density 𝜎𝜎𝑔𝑔 , and gyro bias noise density 𝜎𝜎𝑏𝑏𝑏𝑏 𝜎𝜎𝑎𝑎
𝜎𝜎𝑏𝑏𝑏𝑏
Accelerometers 8.0e-4 𝑚𝑚/(𝑠𝑠 2 √𝐻𝐻𝐻𝐻) 1.3e-5
𝑚𝑚/(𝑠𝑠 3 √𝐻𝐻𝐻𝐻)
𝜎𝜎𝑔𝑔
𝜎𝜎𝑏𝑏𝑏𝑏
Gyros 5.2e-4 𝑟𝑟𝑟𝑟𝑟𝑟 /(𝑠𝑠√𝐻𝐻𝐻𝐻) 2.9e-6 𝑟𝑟𝑟𝑟𝑟𝑟 /�𝑠𝑠 2 √𝐻𝐻𝐻𝐻�
We compared our keyframe-based double window framework with an incremental odometry method realized in libviso2 [Geiger et al., 2011], and a sliding window smoothing approach. The sliding window approach tracks points across the frames using the same quad matches as the keyframe-based approach and these points’ positions are iteratively refined with the sliding window smoother. The sliding window is set to contain 30 most recent frames, and hence 29 segments of the IMU readings. The point observations in the frames of the sliding window are chosen such that they are distributed evenly across these frames. As with the proposed keyframe-based approach, the sliding window smoother also employs g2o for optimization.
The error metric of Geiger et al. [2012] was used to evaluate performance of the algorithms. To begin, a set of frames are sampled from all frames (one frame per second). Choosing one such frame as a starting point, relative translation and rotation of frames with the increasing distances are compared to those of the ground truth. Then, the discrepancies between the estimated and the true relative translations and rotations are binned based on the distance traveled, resulting in the error statistics, rotation and translation error with respect to distance intervals
sliding window smoother that used stereo and inertial data avoided this blunder.
All tests were performed on a virtual machine with i7 2.4GHz CPU and 8G RAM. In all tests, the world frame is assumed to be the frame of the left camera at the starting image frame. Current implementation processes frames at a rate of 7Hz, as shown in Table 2. Table 2. Average computation time of double window optimization (DWO) with stereo and inertial data on KITTI frames of size 1241x376 Thread
Tracking
Mapping
Loop closing
Time per frame
144.2 ms
13.6 ms
7.8 ms
Figure 7. The trajectories estimated by the sliding window smoother and DWO were superimposed onto the ground truth of the KITTI sequence 00. With online loop closure, our approach achieved much consistent trajectory.
(a)
Figure 6. The trajectories estimated by several odometry approaches using the Tsukuba stereo dataset shown in the X-Z plane of the first left camera frame. Note libviso2 and DWO with only stereo data jumped at the end because of much moving portion in images, DWO and the
(b) Figure 8. Rotation (a) and translation (b) error vs. distance.by different motion estimation approaches on Tsukuba dataset. Given stereo and inertial data, DWO achieved better results than the sliding window smoother, and both
approaches obtained trajectories closer to the ground truth than the incremental odometry, as shown in Figs. 6 and 7. This was also confirmed by the evaluation metric as shown in Fig. 8. It also indicated the advantage of using inertial data as stereo inertial approaches achieved better results than approaches using only stereo data. Because of the IMU data, velocity of the sensor assembly was obtained at a frequency of 100Hz. Fig. 9 showed an example of the estimated velocity in the world frame superimposed onto the true velocity. In each axis, the estimated velocity agreed well with the truth. To verify the effectiveness of DWO, we compared the results obtained by our approach with only stereo data to that of the leading method SOFT [Cvišić and Petrović, 2015] on KITTI odometry dataset, as shown in Fig. 10. It showed that although DWO had slightly larger translation error, it achieved better rotation estimate on test sequences of KITTI dataset.
(b) Figure 10. Rotation (a) and translation (b) error vs. traveled distance by our approach using only stereo data vs. the SOFT method on KITTI test sequences 11 to 21. Our approach estimated better rotation, but suffered somewhat in estimating translation. CONCLUSION
Figure 9. The estimated velocity profiles by stereo inertial DWO superimposed onto the true ones for the KITTI sequence 00
(a)
We present a tightly coupled stereo-inertial odometry system that estimates motion using data collected by a rigid assembly of stereo cameras and a consumer-grade IMU. Using a map of connected keyframes, loop closure detection and pose graph optimization are done efficiently. To improve localization accuracy, a double window structure is used to organize local variables, which are optimized with every frame. It is verified that nonlinear optimization using stereo and inertial data achieves better results than nonlinear optimization with only stereo data in open loop explorations. On the other hand, nonlinear optimization with stereo data achieves competitive performance on KITTI odometry dataset which implies the strength of the double window structure. Although circular tracking of rudimentary point features often results in fairly dense tracks of features, our system still relies on the assumption that the camera observes enough static features. Besides too few static features, another failure situation of the system may occur at initialization if the stereo assembly remains static or performs pure rotation and most features are very distant. Currently our implementation only triangulates points of sufficient parallax, thus infinity landmarks are not used in the optimization. These landmarks are useful in constraining rotations. In addition, alternative parameterization of point positions, such as anchored inverse depth representation [Strasdat et al., 2011], may improve numerical accuracy and hence better tracking. Our stereo-inertial framework is quite general. It can be extended to inter-sensor calibration, and motion estimation based on combining alternative sensors, such as a single camera, a Kinect, and a GPS receiver.
ACKNOWLEDGMENTS Jianzhu Huai acknowledges financial support from the Chinese Scholarship Council. We would like to thank Raul Mur-Artal for the source code of ORB-SLAM. REFERENCES Agarwal, Samuel, et al. "Ceres Solver" (2010). Available at: http://ceres-solver.org (accessed 1 June 2015). Cvišić, Igor, and Ivan Petrović. "Stereo odometry based on careful feature selection and tracking." Proceedings of the European Conference on Mobile Robots (ECMR), 2015. Dong-Si, Tue-Cuong, and Anastasios I. Mourikis. "Motion tracking with fixed-lag smoothing: Algorithm and consistency analysis." Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011. Engel, Jakob, Thomas Schöps, and Daniel Cremers. "LSD-SLAM: Large-scale direct monocular SLAM." Computer Vision–ECCV 2014. Springer International Publishing, 2014. 834-849. Geiger, Andreas, Julius Ziegler, and Christoph Stiller. "Stereoscan: Dense 3d reconstruction in real-time." Intelligent Vehicles Symposium (IV), 2011 IEEE. IEEE, 2011. Geiger, Andreas, Philip Lenz, and Raquel Urtasun. "Are we ready for autonomous driving? the kitti vision benchmark suite." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012. Henry, Peter, et al. "RGB-D mapping: Using Kinect-style depth cameras for dense 3D modeling of indoor environments." The International Journal of Robotics Research 31.5 (2012): 647-663. Hosseinyalamdarya, S., and A. Yilmaza. "Motion Vector Field Estimation Using Brightness Constancy Assumption and Epipolar Geometry Constraint." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences 1 (2014): 9-16. Huang, Guoquan, Michael Kaess, and John J. Leonard. "Towards consistent visual-inertial navigation." Robotics and Automation (ICRA), 2014 IEEE International Conference on. IEEE, 2014. Kaess, Michael, et al. "iSAM2: Incremental smoothing and mapping using the Bayes tree." The International Journal of Robotics Research (2011): 0278364911430419. Kelly, Jonathan, and Gaurav S. Sukhatme. "Visualinertial sensor fusion: Localization, mapping and sensorto-sensor self-calibration." The International Journal of Robotics Research 30.1 (2011): 56-79. Kerl, Christian, Jürgen Sturm, and Daniel Cremers. "Dense visual SLAM for RGB-D cameras." Intelligent
Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013. Konolige, Kurt, and Motilal Agrawal. "FrameSLAM: From bundle adjustment to real-time visual mapping." Robotics, IEEE Transactions on 24.5 (2008): 1066-1077. Kummerle, Rainer, et al. "g 2 o: A general framework for graph optimization." Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011. Kurz, Daniel, and Selim Benhimane. "Gravity-aware handheld augmented reality." Mixed and Augmented Reality (ISMAR), 2011 10th IEEE International Symposium on. IEEE, 2011. Leutenegger, Stefan, et al. "Keyframe-based visual– inertial odometry using nonlinear optimization." The International Journal of Robotics Research (2014): 0278364914554813. Li, Mingyang, and Anastasios I. Mourikis. "Highprecision, consistent EKF-based visual–inertial odometry." The International Journal of Robotics Research 32.6 (2013): 690-711. Lovegrove, Steven, Alonso Patron-Perez, and Gabe Sibley. "Spline Fusion: A continuous-time representation for visual-inertial fusion with application to rolling shutter cameras." Proceedings of the British machine vision conference. 2013. Martull, Sarah, Martin Peris, and Kazuhiro Fukui. "Realistic cg stereo image dataset with ground truth disparity maps." ICPR workshop TrakMark2012. Vol. 111. No. 430. 2012. Mur-Artal, Raul, J. M. M. Montiel, and Juan D. Tardos. "ORB-SLAM: a Versatile and Accurate Monocular SLAM System." arXiv preprint arXiv:1502.00956 (2015). Nerurkar, Esha D., Kejian J. Wu, and Stergios I. Roumeliotis. "C-KLAM: Constrained keyframe-based localization and mapping." Robotics and Automation (ICRA), 2014 IEEE International Conference on. IEEE, 2014. Rublee, Ethan, et al. "ORB: an efficient alternative to SIFT or SURF." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011. Shin, Eun-Hwan. "Estimation techniques for low-cost inertial navigation." UCGE report 20219 (2005). Strasdat, Hauke, J. M. M. Montiel, and Andrew J. Davison. "Real-time monocular slam: Why filter?." Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010. Strasdat, Hauke, et al. "Double window optimisation for constant time visual SLAM." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011. Taylor, Clark N., et al. "Comparison of two image and inertial sensor fusion techniques for navigation in unmapped environments." Aerospace and Electronic Systems, IEEE Transactions on 47.2 (2011): 946-958.