Large-Scale Robotic SLAM through Visual Mapping. Christof ... Keyframe-based visual SLAM systems perform reliably and fast in ... rent state is a source of error. ..... One open aspect in our approach is the propagation of measurement un-.
Large-Scale Robotic SLAM through Visual Mapping Christof Hoppe, Kathrin Pirker, Matthias R¨uther and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {hoppe, kpirker, ruether, bischof}@icg.tu-graz.ac.at
Abstract Keyframe-based visual SLAM systems perform reliably and fast in medium-sized environments. Currently, their main weaknesses are robustness and scalability in large scenarios. In this work, we propose a hybrid, keyframe based visual SLAM system, which overcomes these problems. We combine visual features of different strength, add appearance-based loop detection and present a novel method to incorporate non-visual sensor information into standard bundle adjustment frameworks to tackle the problem of weakly textured scenes. On a standardized test dataset, we outperform EKFbased solutions in terms of localization accuracy by at least a factor of two. On a self-recorded dataset, we achieve a performance comparable to a laser scanner approach. Hallway 10
Our Approach Laser
y [m]
5
0
−5
−10 −10
0
10
20
30
40
x [m]
Figure 1. Reconstructed trajectory of a loop that is passed four times. The result of our approach overlaid by the trajectory reconstructed by a laser range finder. The black dots show the map points of our map. The green dots are obstacles identified by the laser range finder.
1.
Introduction
In Simultaneous Localization and Mapping (SLAM) one seeks to generate a map using noisy environment information while simultaneously localizing oneself within this map. This is a basic prerequisite for most autonomous robotic applications like path planning or navigation. Typical SLAM systems need to work robustly within an area of several thousand square meters, while the accuracy of the map and the reconstructed trajectory has to be within a few centimeters. Trajectory estimation and map generation have to be performed in real-time. Further, especially in aerial robotics, it is mandatory This work was supported within the Austrian Research Promotion Agencys (FFG) program: Research Studios Austria Machine Vision Meets Mobility (#818633) and the Austrian Federal Ministry of Economy, Family and Youth (bmwfj).
to recover a full 6 Degrees of Freedom (DoF) pose and the sensor has to provide three-dimensional information of the environment. Camera-based SLAM is able to retrieve this kind of information. Additionally, the sensor is cheap, lightweight and small. Existing SLAM methods can be classified into Bayesian and geometric approaches by their underlying noise handling method. Most Bayesian methods are standard Extended Kalman Filter (EKF) systems which differ in their map representation, type of image features or feature initialization. Geometric approaches formulate the SLAM problem as an optimization problem and use bundle adjustment for noise handling. EKF systems are inherently unable to cope with a large number of observations, because their complexity squares with the number of measurements. Geometric approaches usually solve the SLAM problem within a Structure-from-Motion framework over selected keyframes. Only a few systems [1, 10] showed that they are able to build a map of several thousand square meters reliably. In this paper, we improve a keyframe based stereo SLAM approach by addressing the following problems: (a) Accurate pose estimation of keyframes, (b) reliable map extension, (c) loop detection and correction, and (d) handling of weakly textured environment. We combine features of different strength to achieve accurate pose estimation of keyframes. To robustly extend the map, we apply a simple criterion that triggers map updates when entering unknown terrain. By adding appearancebased loop detection, we are able to correct highly erroneous trajectory and map estimates. Since our system is designed for performing in large indoor environments where featureless areas may occure, we propose a method that handles such areas by integrating odometry information in the bundle adjustment framework. We evaluate our system on two datasets: A hallway sequence and a standardized dataset of the Rawseeds project which comes with groundtruth information. Compared to other stereo SLAM solutions on this dataset our approach shows superior accuracy while still being real-time capable. Figure 1 shows the reconstruction of the hallway sequence.
2.
Related Work
When visual SLAM (VSLAM) came up, most methods were based on the Extended Kalman Filter (EKF) [4]. Although theoretically the EKF is a maximum likelihood estimator of a Gaussian random variable, there are several practical problems. Computational complexity squares with the number of observations, the method is not robust against data association errors and the linearization of the current state is a source of error. Montiel et al. [8] tackled the problem of EKF linearization by changing the representation of landmarks from a Euclidean representation to the unified inverse depth parameterization. This leads to better performance but requires six parameters per landmark instead of three. To decrease computational complexity, Paz et al. [10] used a divide-and-conquer algorithm to split the map into sub-maps. They also used the unified inverse depth parameterization to reconstruct a trajectory of 200 m. The authors expect that their system can be implemented in real-time. With increasing computational power, geometric optimization methods became popular for SLAM systems. One of the first systems was presented by Klein et al. [6]. Their system (PTAM) provides the pose of a monocular camera in an unknown environment to an augmented reality application in real-time. Klein et al. split tracking and map building into parallel threads with different timing requirements. While tracking has to be performed in real-time, the map is extended only by selected frames (keyframes) and can be more time-consuming. Their approach has set a keypoint with respect to accuracy and speed, but does not fulfill the prerequisites of robotic SLAM. It requires user input for map initialization, generates a map with arbitrary scale and introduces scale drift over time. Restricted to operation in small desktop environments, it comes without a loop correction mechanism.
(a)
(b)
Figure 2. (a) The convex hull and the recognized map points (green) used for pose estimation of this frame. The red crosses outside the convex hull mark the detected keypoints. (b) Weakly textured environment.
The demanding optimization of structure and pose prevents most SLAM algorithms from performing loop closing in real-time. Konolige et al. [1] reduced the number of camera poses and 3D points. They showed that even with reduced data the same accuracy could be reached after bundle adjustment. In a follow-up paper [7], they present an optimized implementation of the underlying bundle adjustment that solves very large SLAM problems with several thousand poses in short time. Sibley et al. [12] introduced a new representation of map structure and camera poses. Map points and camera poses are described by their relative positions, which prevents a propagation of large reprojection errors through the whole loop and bundle adjustment converges in a few iterations. Strasdat et al. [14] proposed a loop closing algorithm for monocular SLAM that uses only camera poses as measurements. They optimize the relative similarity transform between camera poses taking also the scale drift of monocular tracking into account. Their results in terms of accuracy are comparable to standard bundle adjustment. In contrast to the discussed systems, our method is designed for use on a robot and it is able to build a map with true metric scale. Furthermore, we integrate non-visual sensor information into the VSLAM process without changing the underlying optimization algorithm, which increases robustness without special treatment of weakly connected keyframes.
3.
Methodology
In our SLAM system, we combine features of different strength to estimate the pose of a keyframe precisely. We present two rules for selecting new keyframes. We perform loop detection using an appearance-based method and we incorporate information of additional sensors into the standard bundle adjustment by introducing synthetic map points. 3.1.
Accurate pose estimation of keyframes
As in previous approaches [3], the pose between two keyframes is estimated by visual odometry through active feature search. It reduces the effort of exhaustive feature matching between 3D map points and the image by using a prior for pose information. It also enables us to use patch-based correlation as a feature descriptor. A pose prior Ct is calculated by moving the old pose Ct−1 with its current speed vt−1 : Ct = Ct−1 + α · vt−1 · δt, (1) where δt is the time difference between two frames and α a damping factor. To find correspondences between the set of map points M = {M1 , ...Mn }, we reproject these points to the camera at its predicted pose: m = H(Ct , M ), (2)
120
Recent frames Loop closing candidates Noise
Nearest Keyframes
100 80 60 40 20 0 0
20
40
60
80
Current Keyframe
100
120
Figure 3. The 7 nearest neighbors (y-axis) of each keyframe (x-axis). Starting from keyframe 100, the robot enters the loop for a second time.
where H(·, ·) is the projective mapping of camera at pose Ct . This results in a set m that contains the reprojected positions of the map points which lie within image boundaries. The feature descriptor of a valid map point Mj is matched with image features in a fixed radius around mj . The predicted pose Ct is then refined by minimizing the reprojection error of the correspondences. The advantage of active search is the reduced search space for feature matching. It also avoids gross outlier matches. However, active search fails if the motion of the robot is not well predicted, resulting in a wrong motion estimate. To meliorate the problem, we add a seperate feature matching step for every new keyframe Ck . We select a set L of map points that are visible in the last n = 5 keyframes. We project L to Ck l = H(Ck , L), (3) where l is the set of all projected positions of L, which lie in the image boundaries of Ck . We determine the correspondences between L and the keypoints in Ck using the SURF feature descriptor. The pose is re-calculated using these correspondences. Since the correspondence set may contain outliers, we use the Tukey M-Estimator error function for pose optimization. Afterwards, sliding window bundle adjustment ist applied, taking the 10 most recent keyframes and their corresponding map points into account. 3.2.
Map extension trigger
A crucial point for keyframe based VSLAM methods is the decision when to extend the map. We define two rules to determine whether a frame is used for map extension. The first rule accounts for the distance between the current camera pose Ct and the last keyframe Ck−1 . If the Euclidean distance d = kCt − Ck−1 k exceeds a given threshold (e.g. 0.40 m) a map extension is triggered. This is reasonable, because patch-based correlation is able to handle only a limited range of scale change. The second rule triggers map extension if the camera rotates. Given the set m of matched keypoints used for visual odometry, we analyze their distribution in image space (see Figure 2(a)). If the area of the convex hull of m is below a certain threshold (e.g. 40% of the image area), map extension is also triggered. This rule prevents track loss in tight turns of the robot. 3.3.
Loop Closing
Because of the incremental nature of SLAM, small errors sum up over time and cause a large error. Loop detection/correction decreases the overall error. To detect a loop closure, we compare the appearance of a new keyframe Ck to all previous keyframes C1 , ..., Ck−1 . To get a set of N most similar
Visual map point Synthetic map point Keyframe pose
Ct-1
Ct
Visual odometry Dead-reckoning
Figure 4. Sensor Fusion. If only few map points are available, the camera is tracked by a relative pose estimation sensor, e.g. odometry or IMU. We generate synthetic map points and reproject them to the new as well as to older ones.
keyframes, we make use of the vocabulary tree approach proposed by Nister et al. [9]. The set of nearest neighbors N can be classified into three groups: Recently added keyframes, old keyframes and false matches. Since recently added keyframes are not relevant for loop closure, they are removed from N . To identify outlier keyframes in N , we perform SURF feature matching between Ck and the keyframes remained in N , followed by geometric validation. To guarantee robust loop detection, we expect that N consists of at least three keyframes which are located close to each other. The keyframe structure is illustrated in Figure 3. For loop correction, we associate Ck to map points that were also visible in N using SURF features. The overall reprojection error is minimized by performing least-squares bundle adjustment on the whole map. 3.4.
Sensor fusion
A serious problem for visual SLAM are weakly textured areas (see Figure 2(b)). If the robot is equipped with additional sensors like odometer, laser range finder, GPS or IMU, these can be used to keep track of the robot. However, a fallback to dead-reckoning breaks the visual connectivity between keyframes. We follow a different strategy by incorporating relative movement information into the visual domain by generating synthetic map points. A set of synthetic map points is used to connect a keyframe Ck to the last n keyframes K = {Ck−1 , ...., Ck−n }. A synthetic map point Sj consists of a 3D position Mj and a set of image measurements m = {m1 , . . . , ml }, where l ≤ n. A synthetic map point is not linked to any kind of real visual feature. Since it carries the information required by bundle adjustment, it can be used in the optimization like any other map point. Therefore, existing and highly optimized structure from motion techniques can be used without special treatment of breaks in visual connectivity. If the number of 3D-2D correspondences found in the tracking step, drops below a given threshold (e.g. 30 correspondences), we use relative movement information to estimate camera pose Ct . If the Euclidean distance or the rotational difference between Ct and the last keyframe Ck exceeds a threshold, a new keyframe Ck+1 is added. We reconnect Ck+1 by generating a set of synthetic map points S = {S1 , ..., Sl }. Each Sj is a random 3D position Mj assigned, such that its reprojected position is visible in Ck+1 and in at least one keyframe of K. To obtain the measurement set m of Sj , we reproject Sj to Ck+1 and the keyframes in K. The reprojected positions are stored in m. The number of synthetic map points generated to reconnect Ck+1 is in the range of the average number of visible map points per keyframe in K. Figure 4 illustrates the synthetic map point generation.
4.
Experiments
To evaluate the performance of our approach, we applied it to two datasets: A hallway scene and a sequence of the Rawseeds [2] project. Whereas the hallway scene is well textured and spacious, the Rawseeds sequence consists of long, narrow corridors with little texture. To demonstrate the problems
of a state-of-the-art approach, we processed both sequences by a modified monocular, keyframe-based VSLAM system [6] (stereo PTAM). We extended this system by using a calibrated stereo camera setup to achieve a true scale map. 4.1.
Datasets and accuracy measure
The hallway dataset is a trajectory of a 69 m loop that is passed four times in a row. We recorded 3000 stereo images of 640x480 pixel at a framerate of 7.5 Hz. The camera was mounted on a wheeled robot with a driving speed of 0.7 m/s. The scene is static, with artificial and natural lighting. This dataset is taken at a well textured environment (Figure 5 (c)-(e) ) and can be processed without sensor fusion as described in Section 3.4. The Rawseeds dataset is a 180 m subsequence of the Bicocca 2009-02-25b dataset provided by the Rawseeds project. The sequence comprises 5500 images with a resolution of 640x480. The framerate was 15 Hz. The camera was also mounted on a wheeled robot and the average driving speed was 0.5 m/s. The environment is illuminated by artificial lighting and the robot is the only moving object. As shown in Figure 5 (a), 5 (b) and Figure 2(b), the trajectory of the Rawseeds dataset starts in a large hallway and leads to narrow corridors. In contrast to the well textured hallway, the corridors are weakly textured and some parts do not contain visual features (Figure 2(b) ). To quantify accuracy, we use the absolute trajectory error (ATE). Given two sets of pose estimates at synchronized time instances A = A1 , . . . , An and B = B1 , . . . , Bn , the ATE is defined as AT E =
n X
d(Ak , Bk ),
k=1
where d(·, ·) denotes the Euclidean distance. Since both sequences can be defined in two different coordinate systems, they are first aligned by minimizing the ATE in a least squares sense. The specified error is the remaining error after sequence alignment. For the Rawseeds dataset, we compare our results with the provided groundtruth. The hallway dataset is compared to a trajectory reconstructed by a laser range finder. We used the GMapping algorithm proposed by Grisetti et al. [5]. 4.2.
Accuracy analysis
On the hallway dataset, stereo PTAM fails after roughly 15 m because visual odometry is not able to handle the abrupt motion changes of the robot. A door sill causes a displacement of up to 40 px in two consecutive frames. Since the active search region is limited to 8 px, no valid 3D-2D image correspondences could be identified. Patch-based correlation for data association is sensitive to repetitive texture and prevents increasing the search region. Our system processes the hallway dataset completely without the need of sensor fusion. The mean ATE, compared to the trajectory
(a)
(b)
(c)
(d)
(e)
Figure 5. Example image from datasets. (a) and (b) are taken from the Rawseeds dataset. The other three images are contained in the hallway dataset.
reconstructed by GMapping, is 0.22 m. The map contains 31,678 map points, which are 116 map points per meter. The result is shown in Figure 1 Stereo PTAM reconstructs 83.36 m of the Rawseeds sequence with a mean ATE error of 0.88 m. The map contains 31,450 map points, which are 377 points per meter. The system fails after 83 m because it faces a pure white wall as shown in Figure 2(b). The experiment shows that stereo PTAM is able to reconstruct a long trajectory with acceptable error but it fails in case of abrupt motion changes and little feature information. For processing the Rawseeds dataset with our approach, we use odometry information in case of too few image features. We fall back to odometry if less than 30 3D-2D correspondences are found by active search. This was the case at three parts of the trajectory. To reconnect keyframes generated during odometry tracking, we generate 80 map points per keyframe on average. Before performing loop correction, the mean ATE was 0.60 m, which is reduced to 0.55 m after loop optimization. The reconstructed map contains 31,600 map points (166 map points per meter). Although the number of map points is decreased by 56%, the accuracy is increased. For the Rawseeds dataset, two visual SLAM solutions are available online. One solution is based on an EKF approach using lines as visual features (H-Trinocular SLAM [13]). It achieves a mean ATE of 2.55 m with a standard deviation of 1.14 m. The second result is generated by an EKF approach using SURF features (CI-Stereo Graph [11]). The trajectory has a mean ATE of 1.12 m with a standard deviation of 0.51 m. Our approach with a mean ATE of 0.55 m outperforms both example solutions by at least a factor of two. Figure 6 shows both reconstructed trajectories compared to the groundtruth and our approach.
5.
Conclusion
Many state of the art visual SLAM systems suffer from limited scalability, lacking loop-closure capability and limited robustness in environments with weak visual information. In our work, we extended a keyframe-based VSLAM approach to achieve accurate map and trajectory reconstruction in largescale environments. We have shown that our system explores and navigates over a trajectory of more than 200 m with an ATE below 1%. It is capable of navigating through completely untextured scenes, and manages map sizes of more than 3000 m2 . The increase in robustness is achieved through employing stronger feature descriptors, adding virtual map points if visual information is insufficient, and performing appearance-based localization for loop-closing. Compared to other keyframe-based systems like PTAM, our system creates more computational overhead, which is primarily due to SURF-descriptor computation and matching. However, on a high-end PC equipped with a GPGPU the system is still real-time capable. The incorporation of synthetic map points allows to add nonRawseeds - Comparison 5
CI−Stereo Groundtruth
y [m]
0
Trinocular SLAM Our approach
−5 −10 −15 −20 −25 −30
−10
0
10
20
x [m]
30
40
50
60
Figure 6. Published SLAM results on the Rawseeds dataset, including our approach.
visual sensor information into the bundle adjustment framework without changing existing, highly optimized implementations. One open aspect in our approach is the propagation of measurement uncertainty from external sensors to the synthetic map points. We currently follow the ad-hoc approach of adding random points with a covariance of 1. In the future we seek to examine a point placement strategy which allows to model uncertainty in a more realistic way.
References [1] M. Agrawal and K. Konolige. FrameSLAM: From bundle adjustment to real-time visual mapping. IEEE Transactions on Robotics, 24(5), 2008. [2] S. Ceriani, G. Fontana, A. Giusti, D. Marzorati, M. Matteucci, D. Migliore, D. Rizzi, D. G. Sorrenti, and P. Taddei. Rawseeds ground truth collection systems for indoor self-localization and mapping. Autonomous Robots, 27:353–371, 2009. [3] A.J. Davison. Active search for real-time vision. Int. Conf. on Computer Vision, 1:66–73, 2005. [4] A.J. Davison and N. Kita. Sequential Localization and Map-Building for Real-Time Computer Vision and Robotics. Robotics and Autonomous Systems, 36(4):171–183, 2001. [5] G. Grisetti, C. Stachniss, and W. Burgard. Improved Techniques for Grid Mapping With RaoBlackwellized Particle Filters. IEEE Transactions on Robotics, 23(1):34 –46, 2007. [6] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In International Symposium on Mixed and Augmented Reality, 2007. [7] K. Konolige, G. Grisetti, R. Kummerle, W. Burgard, B. Limketkai, and R. Vincent. Sparse pose adjustment for 2d mapping. In IROS, Taipei, Taiwan, 10/2010 2010. [8] J. Montiel, J. Civera, and A. Davison. Unified Inverse Depth Parametrization for Monocular SLAM. In Proc. of Robotics: Science and Systems, 2006. [9] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2006. [10] L.M. Paz, P. Pini´es, J. D. Tard´os, and J. Neira. Large Scale 6DOF SLAM with Stereo-in-Hand. IEEE Transactions on Robotics, 24(5):946–957, 2008. [11] P. Pini´es, Lina Mar´ıa Paz, and J. D. Tard´os. CI-Graph: An efficient approach for Large Scale SLAM. In IEEE Int. Conf. Robotics and Automation, 2009. [12] G. Sibley, C. Mei, I. Reid, and P. Newman. Adaptive relative bundle adjustment. In Proc. of Robotics: Science and Systems, 2009. [13] D. G. Sorrenti, M. Matteucci, D. Marzorati, and A. Furlan. Benchmark Solution to the Stereo or Trinocular SLAM - Bicocca 2009-02-25b BP http://www.rawseeds.org/ rs/rawseeds/rs/assets/solutions_data/4adc84637e0a0/BS_milan_ Bicocca_25b.pdf, last visited 2011-02-24, 2009. [14] H. Strasdat, J. M. M. Montiel, and A. Davison. Scale Drift-Aware Large Scale Monocular SLAM. In Proc. of Robotics: Science and Systems, 2010.