Feature-based or Direct: An Evaluation of Monocular Visual Odometry

arXiv:1705.04300v1 [cs.CV] 11 May 2017

Feature-based or Direct: An Evaluation of Monocular Visual Odometry Nan Yang, Rui Wang, Daniel Cremers Computer Vision Group Department of Computer Science Technische Universit¨at M¨ unchen {yangn, wangr}@in.tum.de, [email protected] Abstract Monocular visual odometry (VO), which incrementally estimates camera poses and local 3D maps, is the key component of monocular simultaneously localization and mapping (SLAM). The two main branches of the VO family, namely the feature-based and the so called direct methods, have seen tremendous improvements on accuracy, robustness and efficiency, and have gained exponential popularity over recent years. Nevertheless, no comprehensive evaluation between them has been performed. In this paper, we compare the state-of-the-art feature-based and direct VO methods, aiming at an exhaustive evaluation under diverse real-world settings. For the first time, we evaluate their performance on images with and without photometric calibration, camera motion bias and robustness to rolling shutter effect. We discuss about the possible reasons and give explanations to all our experiment results.

1

Introduction

Modern visual SLAM systems usually have two basic components: VO and global map optimization. While the VO component incrementally estimates camera poses and builds up a local map, small errors are accumulated and over time the estimated camera pose starts to drift away from its actual position. If a previous visited location is detected, the drift can be removed by the global optimization component using techniques like loop closure and pose graph optimization. Although a globally consistent map can be achieved, the accumulated errors cannot be totally removed from the system. In other words, the overall performance of any SLAM system is fundamentally determined by its VO accuracy. Besides its crucial role for SLAM, VO can also be applied individually for a variety of applications such as path planning and obstacle avoidance for unmanned aerial vehicles (UAVs), where its accuracy and robustness are the deciding factors in the respective functionalities. During the past few years, the VO community has seen significant progress in improving algorithm accuracy, robustness and efficiency [4–6, 8, 9, 14, 16]. 1

Depending on whether image feature extraction and matching are needed, these methods can be grouped into two categories, i.e., feature-based methods [14,16] and direct methods [4–6, 8, 9, 17]1 . Feature-based methods extract a sparse set of key points from each image and match them across multiple frames. Camera poses and feature depths are estimated by minimizing the reprojection errors between feature pairs. As modern feature descriptors are invariant to some extend to illumination and view-point changes, feature-based methods are more robust than direct methods to brightness inconsistencies and large view-point changes among consecutive frames. However, feature extraction and matching bring additional computational overhead, which highly limits the number of features that can be maintained in the system. As a result, the reconstructed 3D maps are rather sparse and often cannot be used directly in practice. Moreover, in low-texture environment where not enough features can be extracted, tracking can easily get lost. In [4], J. Engel et al. also found that feature-based methods are more sensitive to photometric noise such as image blurring. Direct methods in turn do not match image features and minimize the reprojection error. They use all image pixels (dense) [17], all pixels with sufficiently large intensity gradient (semi-dense) [5] or sparsely selected pixels (sparse) [4, 8, 9] and minimize a photometric error obtained by direct image alignment on the selected pixels. Camera poses and pixel depths are estimated by minimizing the photometric error using non-linear optimization algorithms. Since more image information can be used, direct methods are more robust in low-texture scenes and can deliver much denser 3D reconstruction. Correspondingly, due to the direct image alignment formulation, direct methods are very sensitive to unmodeled artifacts such as rolling shutter effect, camera auto exposure and gain control. More crucially, the brightness constancy assumption does not always hold in practice, which drastically reduces the performance of direct methods in environment with rapid lighting change. To evaluate and compare the performances of feature-based and direct VO methods, we choose the state of the art of each group, namely ORB-SLAM [16] 2 and Direct Sparse Odometry (DSO) [4]. Our evaluations mainly focus on the following aspects: a) Photometric calibration. Pixels corresponding to the same 3D point may have different intensities across images due to optical vignetting, camera auto gain and exposure control. Performing photometric calibration approximately recovers the irradiance images and improves the performance of direct method [4]. However, it is still unclear how can it influence feature-based method, which we address in Sec.3.1. b) Motion bias. Running the same sequence forward or backward can result in different VO performances [7]. In Sec.3.2, we perform both quantitative and qualitative evaluations, and give an analytical explanation for the obtained results. 1 We

consider Semi-Direct Visual Odometry [8, 9] as direct methods. turn ORB-SLAM into a VO system, we turn off its loop closure and global bundle adjustment functionalities. 2 To

2

c) Rolling shutter effect. Feature-based methods have been considered more robust against rolling shutter effect. In Sec.3.3, we carry out experiments to get a more though understanding on this.

2

Related Work

Apart from ORB-SLAM and DSO, we also performed our experiments on another two popular methods, i.e., LSD-SLAM [5] and SVO [8]. However, both of them failed on most of the sequences we selected for our evaluations. For this reason, we only focus on ORB-SLAM and DSO in the rest of this paper.

2.1

Direct Sparse Odometry (DSO)

Unlike previous direct VO or SLAM algorithms which use dense or semi-dense formulations, DSO performs a novel sparse points sampling across image area with sufficient intensity gradient, thus it combines advantages of both direct and sparse pipelines: The sampled points, although sparser now, still deliver more fine-grained 3D geometry representation compared with feature-based methods; On the other hand, reducing the amount of data enables real time windowed bundle adjustment (BA) which jointly optimizes for all model parameters, including camera poses, depths, camera intrinsics and brightness affine transformation factors. Photometric Calibration. To boost the performance of direct method, J. Engel et al. [4, 7] proposed a photometric camera calibration pipeline based on the formulation I(x) = G(tV (x)B(x)), (1) where G is the non-linear camera response function, t is the exposure time, V : Ω 7→ [0, 1] is the pixel-wise attenuation factors of vignetting artifacts, B(x) is the irradiance image up to a scalar factor. Performing direct image alignment on B(x) instead of I(x) removes artifacts that pollute the brightness constancy assumption and therefore significantly increases the tracking accuracy [4]. An example of photometric calibration is shown in Figure 1. Front-End. The front-end of DSO is in charge of initial image tracking, point sampling and frame management. Each new image is tracked by direct image alignment in a coarse-to-fine manner inspired by LSD-SLAM. Whether a new keyframe needs to be created depends on heuristics such as if there is sufficient change of field of view or brightness. The system always keeps Nf keyframes in the active window (Nf = 7 by default) and old keyframes are culled out by marginalization. When a new keyframe is created, its pixels with high image gradients and even spacial distribution are selected as candidates to be added to the system. Back-End. When a new keyframe is added to the active window, the optimization back-end jointly optimizes for all model parameters by minimizing the photometric error. Gauss-Newton algorithm is used to achieve a good trade-off

3

irradiance U(I)

250

250

200

200

150

150

100

100

50

50

0

0 0

50

100

150

200

250

0

pixel value I

(a) Response with gamma correction.

50

100

150

200

250

pixel value I

(b) Response without gamma correction.

(c) Original image.

(d) Calibrated image.

Figure 1: Example of photometric calibration. Camera response functions with gamma correction on (a) and off (b), images before (c) and after (d) photometric calibration.

between speed and accuracy. To keep the sparsity of the Hessian matrix, obsolete and redundant information is marginalized with Schur complement [15]. To avoid the inconsistency and keep the observability of the system, the First Estimate Jacobians technique is involved in the non-linear optimization process [12, 15].

2.2

ORB-SLAM

Developed by R. Mur-Artal et al., ORB-SLAM is a feature-based SLAM system that uses ORB features [18] for all the tasks including tracking, mapping, relocalization and loop closing. To evaluate its VO performance, we disable its loop closure and global BA functionalities in all our experiments. The VO component of ORB-SLAM can be considered as an integration of two parts, i.e., tracking and local mapping. Tracking. To track a new frame, ORB features are first extracted from the image and matched to those from the reference frame. Feature correspondence searching is facilitated by a constant velocity motion model. The initial pose is obtained using motion-only BA. To refine the pose estimation, all features in the local map are projected to the current frame, more correspondences are matched and the pose optimization is performed again. Similar to DSO, new keyframe is spawned also based on several heuristics as detailed in [16]. Local Mapping. Every time a new keyframe is inserted, the system updates the covisibility graph which is a data structure to improve system efficiency. New map points are triangulated based on epipolar geometry between each pair of connected keyframes in the covisibility graph. The triangulated 3D points are then projected to the rest connected keyframes and their correspondences are searched there. Finally, a local BA is performed for all the keyframes and map points within the current window. Unlike in DSO, old points and keyframes are culled out directly from the active window without marginalization.

2.3

Datasets

To perform an exhaustive evaluation for ORB-SLAM and DSO, we use the following public datasets which cover a variety of real-world settings, e.g., indoor/outdoor, texture/textureless, global/rolling shutter.

4

number of runs

500

500

500

400

400

400

300

300

300

200

200

200

100

100

100

0

0 0

2

4

6

8

10

e align

DSO + V, G, t DSO + V, G DSO ORB + V, G ORB

0 0

5

10

15

e r (degree)

20

1

1.5

2

2.5

3

3.5

4

es

Figure 2: Performance comparison of DSO and ORB-SLAM on sequences with/without photometric calibration (camera response function G, vignetting factors V ). The alignment error ealign , rotation drift er (in degree) and scale drift es is shown in the corresponding subplot, respectively. For reference, we also show results of DSO using camera exposure time t.

The TUM Mono VO Dataset [7] contains 50 sequences captured by using a global shutter camera with two different lenses. Camera response function, dense attenuation factors and exposure time of each image are provided for photometric calibration. The EuRoC MAC Dataset [1] contains 11 sequences recorded by global shutter cameras mounted on a drone. It covers three different indoor environments. Some of the sequences are quite challenging as they have extremely unstable motion and strong brightness change. The ICL-NUIM Dataset [10] is extended by C. Kerl et al. [13] to provide both simulated rolling shutter and global shutter sequences of indoor environments, which we use to compare the performance differences of the selected VO methods on images with rolling/global shutters. The Cityscapes Dataset [3] provides a long street view sequence captured by industrial rolling shutter cameras in Frankfurt. We use it to evaluate how feature-based and direct methods work against realistic rolling shutter effect. We also recorded a sequence in outdoor environment with a large loop containing more than 20 thousand frames, which is used to compare the motion bias between DSO and ORB-SLAM.

3 3.1

Evaluation Photometric Calibration

In the first experiment we evaluate how photometric calibration can influence the performances of feature-based and direct methods. We use the 50 original sequences from the TUM Mono VO Dataset, as well as their corresponding ones after photometric calibration, i.e., with the nonlinear camera response function G and pixel-wise vignetting factors V calibrated. Example images with/without photometric calibration can be found in Figure 1c and 1d. DSO and ORBSLAM run 10 times on each of these 100 sequences, respectively. The alignment error ealign , rotation drift er (in degree) and scale drift es are calculated and their corresponding number of runs are accumulated in Figure 2. It is worth noting that integrating the exposure times into the formulation of ORB-SLAM is not straightforward, therefore we do not use them for both ORB-SLAM and DSO. However, for reference we also show the results of DSO with all calibration

5

Robustness

e PC - e nPC align align

300

Accuracy

60

200

40

100

20

0

0

10

PC 5

nPC 0

-100

1

-20 1

5

10

15

20

25

30

35

40

45

50

1

5

10

15

20

25

30

35

40

45

5

10

15

20

25

30

35

40

45

50

sequence number

50

sequence number

sequence number

Figure 3: Left: Performance difference of ORB-SLAM on the TUM Mono Dataset. C nP C PC nP C eP align − ealign is shown for each sequence, where ealign and ealign stands for the alignment error with and without photometric calibration, respectively. The 6 sequences on which tracking fails completely are marked in red. Middle: An enlarged view of the first plot. Sequences with performance differences larger than 50 are marked in red. Solid blue dots are used to mark the 4 sequences shown in Figure 4. Right: Color-encoded errors of all runs.

10000

10000 8000

8000

6000

6000

4000

4000

2000

2000

10000

0

50

100

150

200

250 0

10000

8000

8000

6000

6000

4000

4000

2000

0

0

50

100

150

200

2000

0 250

0 0

324 matches

342 mathces

10000

10000

8000

8000

8000

6000

6000

6000

4000

4000

4000

2000

2000

2000

0

0 50

100

150

86 matches

200

250 0

100

150

200

100

150

50

200

250 0

257 mathces

100

150

200

250

200

250

278 matches

10000 8000 6000 4000 2000

0 50

250 0

206 matches

10000

0

50

0 50

100

150

309 matches

200

250 0

50

100

150

351 matches

seq 01, frame 2085 seq 46, frame 2748 seq 21, frame 5369 seq 22, frame 818 Figure 4: Performance differences of ORB feature matching before and after photometric calibration. Up: before calibration. Down: after calibration. Histogram of the left image is shown under each image pair. As can be seen here, after photometric calibration the numbers of matches decrease a lot on dark images, while increase significantly on bright images.

information used, i.e., G, V and t. As can be seen in Figure 2, with G and V calibrated, the performance of DSO increases significantly. In addition, using t can further boost the tracking accuracy. However, interestingly, photometric calibration reduces the overall performance of ORB-SLAM. We give our explanations in the following. DSO. As DSO is fundamentally built up on the assumption that a 3D point should have the same intensity across different images, it is straightforward to conclude that removing or reducing brightness inconsistencies among images can improve DSO’s performance. More evidences can also be found in [4]. ORB-SLAM. To better understand the overall performance declines of ORB-SLAM in Figure 2, we further show its performance on each sequence in Figure 3, where differences of the alignment errors with/without photometric C nP C calibration eP align − ealign are shown in the left and middle, alignment errors of all runs are shown in the right. As can be seen there, both the tracking robustness and accuracy decrease after photometric calibration: ORB-SLAM fails on 6 sequences and generally performs worse on the other sequences. It is worth noting that the performance decline is not consistent over sequences. To

6

249 tracked

84 tracked

53 tracked

16 tracked

52 tracked

30 tracked

6 tracked

8 tracked

8 tracked

9 tracked

1 tracked

9 tracked

seq 26

seq 35

seq 36

seq 40

seq 41

seq 50

Figure 5: Failure cases of the 6 sequences mentioned in the left of Figures 3 where ORBSLAM lost its tracking. Up: before photometric calibration. Down: after photometric calibration. Features shown in the images are the ones projected from the active local map to the current frame and successfully matched there. As can be seen here, after photometric calibration, dark images become even darker and not enough features can be matched on them.

understand the inconsistency, we need to take a relook at the inverse camera response function G−1 as shown in Figure 1a. This nonlinear function can be roughly divided into three linear parts with pixel values I belonging to [0, 90), [90, 190) and [190, 255]. Depending on the different slopes, applying G−1 compresses intensities in [0, 90) and stretches the ones in [190, 255]. In other words, it reduces contrast of dark area while increases it for bright area. As image features like ORB generally work better on images with higher contrast (more evenly distributed intensity histogram), we assume the performance declines of ORB-SLAM are mainly caused by dark frames. We select some example frames and perform ORB feature matching before and after photometric calibration. The numbers of feature matches and image histograms are shown in Figure 4. The different results on bright and dark images verify our assumption. One thing to point out is, although sometimes dropping of the number of matched features may not seem crucial (as in the second column of Figure 4), the effect can be accumulated over multiples frames. As a result, when the system projects all feature within the current local map to the newest frame and tries to search for their correspondences, the number of matches decreases drastically as shown in Figure 5. As a result, only few features from the newest frame will be considered as inliers and added into the system, This is the main reason for the tracking failures in Figure 3.

3.2

Motion Bias

In this section we analyze the motion bias of VO algorithms as pointed out by J. Engel et al. in [4, 7]. The term motion bias here refers to the difference of VO performance caused by running the same sequence forward and backward. Experiments in [4, 7] show that ORM-SLAM performs better when running backward, while DSO does not suffer from such bias. For a better understanding, we carry out two more experiments in this section: In the first one, we run ORB-SLAM and DSO on the outdoor sequence captured by ourself where

7

DSO backward

DSO forward

ORB-SLAM backward

ORB-SLAM forward

number of runs

Figure 6: Results on recorded sequence. Corresponding areas are marked within the same colors. 500

ORB-SLAM

500

0

0 0

1

2

3

4

5

6

7

8

9

10

0 0

4

e align

number of runs

500

8

12

16

20

500

0 2

3

4

5

e align

2

6

7

8

9

10

2.5

3

3.5

4

320 x 240 (fwd) 424 x 320 (fwd) 640 x 480 (fwd) 960 x 720 (fwd) 320 x 240 (bwd) 424 x 320 (bwd) 640 x 480 (bwd) 960 x 720 (bwd)

500

0 1

1.5

es

DSO

500

0

1

e r (degree)

0 0

4

8

12

16

20

1

e r (degree)

1.5

2

2.5

3

3.5

4

es

Figure 7: Performance differences of ORB-SLAM (up) and DSO (down) on the TUM Mono VO Dataset due to motion bias at different image resolutions. While DSO delivers similar results under different settings, ORB-SLAM performs consistently better when running backward and the performance gaps increase with lower image resolutions.

ideally the overall camera trajectory should form a large loop. In the second experiment, we rerun ORB-SLAM and DSO on the TUM Mono VO Dataset, 10 times forward/backward at one of 4 different resolutions of each sequence. The results are shown in Figure 6 and Figure 7. Figure 6 shows that ORB-SLAM is not able to operate on the entire sequence, while DSO successfully runs both forward and backward delivering similar accuracies. We rerun it on a preselected segment (in the blue rectangular) and find it encounters with large scale drift running backward and fails completely at some point running forward (in the red rectangular). Results in Figure 7 also show that DSO is more consistent performing when running forward and backward, while ORB-SLAM gives significantly better results running backward. This conclusion is consistent over all used image resolutions. Possible reasons are analyzed in the following. Both for feature-based and direct methods, triangulation is a necessary step for estimating depths of newly observed 3D points. According to [11], better depth estimation can be achieved with larger disparity between image pair. For a 3D point, its images in two frames can be mapped by x0 = x + Kt/z,

(2)

where x and x0 are the point images in homogeneous coordinates, K the camera intrinsic matrix, t the relative translation between the two frames, z the point depth in the first frame. Now assume the camera is in one of the first 5 sequences shown in Figure 8. When it moves forward, new points will emerge from the 8

seq 17

seq 23

seq 29

seq 46

seq 47

seq 31

number of runs

Figure 8: Sample sequences from the TUM Mono VO Dataset, on which ORB-SLAM has largest motion bias [7]. The first 5 images show scenarios where motion bias can happen. The end part of sequence 31 is shown in the last image. Such high frequency textures (leaves) are challenging for initialization when running backward. 200

200

200

150

150

150

100

100

100

50

50

50

0

0 0

0.1

0.2

0.3

e rmse (t max =1s)

0.4

0.5

ORB fwd ORB bwd

0 0

0.1

0.2

0.3

0.4

e rmse (t max =10s)

0.5

0

0.1

0.2

0.3

0.4

0.5

e rmse (t max =Inf.)

Figure 9: Performance of ORB-SLAM on EuRoC MAV Dataset with different settings for local loop closure. From left to right, points have been seen in the last 1s, 10s and without time limitation are used for local loop closure.

image center and have relatively small motions among consecutive frames. This pattern of optical flow introduces poorly initialized depths into the system. On the contrary, when moving backward, points close to the camera come into the field of view with large parallaxes, thus their depths are better initialized. We claim this is the main reason for the improved performance of ORB-SLAM when running backward. To verify this we take a further look at sequence 31 and 44 of the TUM Mono VO Dataset, that are the main counter examples on which ORB-SLAM performs better for forward running. In the end part of sequence 31 there is a large amount of high frequency textures (leaves) as shown in the last in Figure 8, which makes ORB-SLAM not able to initialize or fail directly after the initialization. In sequence 44, interestingly, the camera is moving most of the time backward, which in fact verifies our conclusion. Moreover, we also run ORB-SLAM forward/backward on the EuRoC MAV Dataset, where all the sequences are captured indoor and the camera motion is rather diverse without any clear pattern. The results in Figure 9 show no significant difference on the performances, hence further support our conclusion. Nevertheless, one may argue that according to the explanation above, DSO should also perform better when running backward. We list the possible reasons for the inconsistency between ORB-SLAM and DSO in the following. a) Depth representation. Instead of using depth directly like ORBSLAM, DSO uses an inverse depth parametrization that has been proved advantageous for monocular visual SLAM, in the sense that it better copes with distant features [2]. Moreover, DSO uses a novel filtering approach as proposed in [6] for initial depth estimation, which takes better care of uncertain depth estimations by assigning them large variances. Summing up these two arguments, we claim that the distant points, those are poorly initialized from the image center, have less impact on DSO than on ORB-SLAM. b) Discretization artifact. When a new point is observed in DSO, in 9

ICL-NUIM

400 200

20 DSO r.sh. DSO g.sh. ORB r.sh. ORB g.sh.

10

y[m]

200

30

y[m]

number of runs

40

0 DSO GPS

-200

0 DSO GPS

-200

0 0

0.2

0.4

e rmse

0.6

0.8

-200

0

200

x[m]

400

600

-200

0

200

400

x[m]

Figure 10: Left: Results on the extended ICL-NUIM Dataset. Middle and Right: Results on segments from the Frankfurt sequence of Cityscapes. Estimated poses are aligned to GPS coordinates with 7D similarity transformation. Note that the provided GPS coordinates are not accurate.

order to initialize its depth, its correspondence is searched in the reference image along the epipolar line using sub-pixel accuracy. In ORB-SLAM, however, an new ORB feature is matched to a previously observed feature with both of them at discretized image locations. Thus ORB-SLAM suffers more with the pixel discretization artifact, especially when matching those distant features emerging from the image center when running forward. This conclusion is supported by the results in Figure 7, where the performance gaps of ORB-SLAM between running forward/backward increase with reduced image resolutions, while DSO seems to be robust to such artifact.

3.3

Rolling Shutter Effect

In the first experiment of this section, we run ORB-SLAM and DSO 10 times on each of the 4 Living Room sequences of the ICL-NUIM Dataset, as well as their simulated rolling shutter correspondences. The results are shown in the left plot in Figure10. Although both methods are influenced by rolling shutter effect, the performance decline of DSO is apparently larger than the one of ORB-SLAM, which verifies the point that feature-based methods are more robust to rolling shutter effect than direct methods. In the second experiment we aim at comparisons on realistic images. As there is no such dataset that provides both real global shutter and rolling shutter sequences together, we only compare the VO accuracies on rolling shutter sequences. For this purpose we use the Frankfurt sequence of Cityscapes and split it into smaller segments (each with around 6000 frames). To our surprise, ORB-SLAM always fails on the selected segments, whenever the camera rotates strongly at corners. Thus we suspect the failures may not be caused by the rolling shutter effect and we only show the estimated camera trajectories of DSO in Figure10 (middle and right). The trajectories are aligned to the GPS ground truth using similarity transformation. The results here show that using rolling shutter cameras with high single line readout time, DSO can still deliver satisfying results.

10

4

Conclusions

We present a thorough evaluation for ORB-SLAM and DSO on photometric calibration, motion bias and rolling shutter effect. Our main conclusions are: (1) With photometric calibration, the performance of DSO gets improved significantly, while for ORB-SLAM it helps only in bright environments but reduces the performance in dark scenes. (2) Compared with DSO, ORB-SLAM has larger bias between forward and backward camera motion. (3) DSO is more sensitive to rolling shutter effect, but can still deliver satisfying results when the effect is not very strong. (4) From all the performed experiments, we find DSO is generally more robust than ORB-SLAM under different settings.

References [1] Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achtelik, and Roland Siegwart. The euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 2016. [2] Javier Civera, Andrew J Davison, and JM Martinez Montiel. Inverse depth parametrization for monocular slam. IEEE transactions on robotics, 24(5):932–945, 2008. [3] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [4] J. Engel, V. Koltun, and D. Cremers. arXiv:1607.02565, July 2016.

Direct sparse odometry.

In

[5] J. Engel, T. Sch¨ ops, and D. Cremers. LSD-SLAM: Large-scale direct monocular SLAM. In European Conference on Computer Vision (ECCV), September 2014. [6] J. Engel, J. Sturm, and D. Cremers. Semi-dense visual odometry for a monocular camera. In IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, December 2013. [7] J. Engel, V. Usenko, and D. Cremers. A photometrically calibrated benchmark for monocular visual odometry. In arXiv:1607.02555, July 2016. [8] Christian Forster, Matia Pizzoli, and Davide Scaramuzza. SVO: Fast semidirect monocular visual odometry. In IEEE International Conference on Robotics and Automation (ICRA), 2014.

11

[9] Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, and Davide Scaramuzza. Svo: Semidirect visual odometry for monocular and multicamera systems. IEEE Transactions on Robotics, 2016. [10] A. Handa, T. Whelan, J.B. McDonald, and A.J. Davison. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In IEEE Intl. Conf. on Robotics and Automation, ICRA, Hong Kong, China, May 2014. [11] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [12] Guoquan Huang, Anastasios Mourikis, and Stergios Roumeliotis. A firstestimates jacobian ekf for improving slam consistency. In Experimental Robotics, pages 373–382. Springer, 2009. [13] C. Kerl, J. Stueckler, and D. Cremers. Dense continuous-time tracking and mapping with rolling shutter RGB-D cameras. In IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015. [14] Georg Klein and David Murray. Parallel tracking and mapping for small AR workspaces. In Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’07), Nara, Japan, November 2007. [15] Stefan Leutenegger, Simon Lynen, Michael Bosse, Roland Siegwart, and Paul Furgale. Keyframe-based visual–inertial odometry using nonlinear optimization. The International Journal of Robotics Research, 34(3):314– 334, 2015. [16] Raul Mur-Artal, J. M. M. Montiel, and Juan D. Tard´os. ORB-SLAM: a versatile and accurate monocular SLAM system. CoRR, abs/1502.00956, 2015. [17] Richard A. Newcombe, Steven Lovegrove, and Andrew J. Davison. Dtam: Dense tracking and mapping in real-time. In Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc J. Van Gool, editors, ICCV, pages 2320– 2327. IEEE Computer Society, 2011. [18] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 2564–2571, Washington, DC, USA, 2011. IEEE Computer Society.

12