benchmark dataset and visual data crowdsourced by smartphones. For the second aspect, a calibration ..... 3.5.2 Collaborative Mapping with the KITTI Dataset .
Collaborative SLAM with Crowdsourced Data
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By Jianzhu Huai Graduate Program in Civil Engineering
The Ohio State University 2017
Dissertation Committee: Dr. Dorota Grejner-Brzezinska, Advisor Dr. Charles Toth Dr. Alper Yilmaz Dr. Christopher Jekeli
Copyrighted by Jianzhu Huai 2017
Abstract
Navigation from one place to another involves estimating location and orientation. As such, navigation and location-based services are essential to our daily lives. Prevalent navigation techniques rely on integrating global navigation satellite systems (GNSS) and inertial navigation systems (INS) which does not provide reliable and accurate solutions when GNSS signals are degraded and in the indoor environments. To remedy this deficiency, simultaneous localization and mapping (SLAM) approaches have been developed for navigation purposes. However, SLAM uses data from one device or one session, and is limited in the amount of available information and thus is prone to accumulating errors. Using data from multiple sessions or devices, a collaborative SLAM technique holds promise for reusing mapping information and enhancing localization accuracy. Such techniques gain further significance in view of the widespread cloud computing resources and handheld mobile devices. With crowdsourced data from these consumer devices, collaborative SLAM is key to many location-based services, e.g., navigating a building for a group of people or robots. However, available solutions and scope of research investigations are somewhat limited in this field. Therefore, this work focuses on solving the collaborative SLAM problem with crowdsourced data. This problem is approached at three aspects in this work: collaborative SLAM with crowdsourced visual data, calibrating a camera-IMU (inertial measurement unit) sensor
ii
system found on mobile devices, and collaborative SLAM with crowdsourced visual and inertial data. For the first aspect, a collaborative SLAM framework is proposed based on the clientserver architecture. It has independent clients estimating the device motion using the visual data from mobile devices. The output from these clients is processed by the server to collaboratively map the environment and to refine estimates of device motion which are transmitted to clients to update their estimates. The proposed framework achieves reuse of the existing maps, robust loop closure, and real-time collaborative and accurate mapping. These properties of the framework were validated with experiments on a benchmark dataset and visual data crowdsourced by smartphones. For the second aspect, a calibration approach based on the extended Kalman filter (EKF) is developed for the camera-IMU system of a mobile device. Aimed at target-free on-thefly calibration, this approach estimates the intrinsic parameters of both the camera and IMU sensor, and the temporal and spatial relations between the two sensors. Simulation and experiments with a benchmark dataset and smartphone data validated the calibration performance. The calibration approach provides essential parameters for integrating visual and inertial data which is the highlight of the third aspect. There the collaborative SLAM framework is extended to work with visual and inertial data. While keeping such features as reusing the existing maps and real-time collaborative mapping, the new framework has the metric scale largely observable thanks to the inertial data. These features of the framework were validated with experiments on visual and inertial data collected by smartphones. iii
In summary, these research efforts have proved that accurate and effective collaborative SLAM is achievable with crowdsourced data at the level that has not been demonstrated before. It represents a step towards location-based services which harness the power of crowdsourcing, such as crowd-based gaming and emergency response.
iv
Acknowledgments
First of all, I would like to thank my Ph.D. advisor, Dr. Grejner-Brzezinska for her valuable mentorship. Over the years, she patiently guided me in both schooling and research, keeping my professional development a priority. By allowing me to explore topics of my own interest in the early years, she trusted me that I would master problemsolving skills and find a meaningful research topic. I am also much indebted to Dr. Toth, a long-term research collaborator of Dr. GrejnerBrzezinska, for his continuous support and guidance. He hosted the weekly discussions of our research group, and provided constructive feedbacks to my research hurdles and dilemmas. My thanks also go to Dr. Yilmaz and Dr. Jekeli, who served on the committee of my candidacy exam and defense of the dissertation. Both have been very helpful in solving many problems I encountered along the line of research. Moreover, I deeply appreciate the review comments on organization and grammar by Edwina Carreon, the instructor for the Education Teaching and Learning 6913 course during the 2016 fall term. I am very fortunate to be surrounded at work by many peers, who have close research directions. Siavash HosseinyAlamdary provided me with deep insights through many discussions when I tried to integrate GPS and IMU data. Following his learning path in research, I learned working in the Linux environment, which phenomenally sped up my v
research progress. I also owe thanks to Zoltan Koppanyi and Grzegorz Jóźków, both helped me solve many tricky problems. Omar E. Mora, Justin Crawford, Steven Ostrowski, Farhad Mirkazemi, and Yunming Shao maintained hard-working and congenial atmosphere in our working place. I am also very grateful that Jisun Lee and Grzegorz Jóźków gracefully allowed me to use their smartphones to capture part of the data used in experiments in this work. Finally, I would like to thank my wife who is supportive of my initiatives every way. She as a homemaker largely shoulders the burden of house chores. Yet she constantly lifts my spirits when I had bad times. Her positive outlook drives me to achieve the work presented here. Looking back further, I am also deeply grateful to my parents. They are the hard-working role models for me, and allow me the freedom to pursue things meaningful to me.
vi
Vita
2009………....................................................B.S. Civil Engineering, Beihang University 2012…………………………........................M.S. Highway and Railway Engineering, Beihang University 2013 to present ..............................................Graduate Student, Department of Civil, Environmental, and Geodetic Engineering, The Ohio State University
Fields of Study
Major Field: Civil Engineering Geodetic and Geoinformation Engineering
vii
Table of Contents
Abstract ............................................................................................................................... ii Acknowledgments............................................................................................................... v Vita.................................................................................................................................... vii List of Acronyms .............................................................................................................. xii List of Tables .................................................................................................................... xv List of Figures ................................................................................................................. xvii Chapter 1: Introduction ...................................................................................................... 1 1.1 Background and Motivation ...................................................................................... 2 1.2 Approach and Scope.................................................................................................. 5 1.3 Contributions ............................................................................................................. 6 1.4 Structure of the Dissertation ...................................................................................... 8 Chapter 2: Mathematical Preliminaries and Sensor Models ............................................... 9 2.1 Notation ..................................................................................................................... 9 2.2 Coordinate Frames and Transformations ................................................................ 10 2.2.1 Coordinate Frames ............................................................................................ 10 viii
2.2.2 Quaternions ....................................................................................................... 12 2.2.3 Transformations: SO(3), SE(3), and Sim(3) ..................................................... 15 2.3 Estimation Theories in SLAM ................................................................................ 20 2.3.1 Nonlinear Optimization on Euclidean Spaces .................................................. 22 2.3.2 Nonlinear Optimization on Manifolds .............................................................. 26 2.3.3 Filtering ............................................................................................................ 27 2.4 IMU Models ............................................................................................................ 31 2.5 Camera Models ....................................................................................................... 35 2.6 Summary ................................................................................................................. 42 Chapter 3: Collaborative SLAM with Crowdsourced Visual Data .................................. 44 3.1 Related Work........................................................................................................... 47 3.2 Overview of the Framework ................................................................................... 50 3.3 Operations on a Client ............................................................................................. 52 3.4 Operations on the Server ......................................................................................... 56 3.4.1 Keyframe Handler ............................................................................................ 57 3.4.2 Loop Closing .................................................................................................... 62 3.4.3 Stepwise Pose Graph Optimization .................................................................. 67 3.5 Experiments............................................................................................................. 71 3.5.1 Stepwise vs. Direct Pose Graph Optimization.................................................. 72 ix
3.5.2 Collaborative Mapping with the KITTI Dataset............................................... 73 3.5.3 Collaborative SLAM with Crowdsourced Visual Data .................................... 75 3.6 Summary and Discussion ........................................................................................ 80 Chapter 4: Calibration of Consumer Camera and Inertial Sensors ................................... 82 4.1 Related work ........................................................................................................... 84 4.2 Camera-IMU Calibration with MSCKF.................................................................. 88 4.2.1 State Representation ......................................................................................... 91 4.2.2 Propagate the State and Covariance ................................................................. 95 4.2.3 Update with Image Observations ................................................................... 104 4.2.4 State Management .......................................................................................... 114 4.3 Experimental Results with Simulation .................................................................. 115 4.4 Experimental Results on a Public Dataset............................................................. 124 4.5 Calibration of the Camera-IMU System on a Smartphone ................................... 128 4.6 Summary and Discussion ...................................................................................... 131 Chapter 5: Collaborative SLAM with Crowdsourced Visual-Inertial Data.................... 133 5.1 Related Work......................................................................................................... 133 5.2 Overview of the Framework ................................................................................. 138 5.3 Operations on a Client ........................................................................................... 139 5.4 Operations on the Server ....................................................................................... 142 x
5.4.1 States and Measurements................................................................................ 144 5.4.2 Keyframe Handler with Visual and Inertial Data ........................................... 153 5.4.3 Loop Closing with Visual and Inertial data .................................................... 158 5.5 Experiment Results ............................................................................................... 161 5.5.1 VIO on a Client............................................................................................... 162 5.5.2 Collaborative Mapping on the Server ............................................................. 164 5.6 Summary ............................................................................................................... 168 Chapter 6: Conclusion..................................................................................................... 170 6.1 Summary and Contributions ................................................................................. 170 6.2 Discussion and Future Work ................................................................................. 172 References ....................................................................................................................... 175 Appendix A: Discrete Transition Model and Continuous Differential Model of the Error State in the MSCKF ........................................................................................................ 186 Appendix B: Jacobians of a Keypoint Measurement in the MSCKF ............................. 197 Appendix C: Covariance and Jacobians for the IMU Error Term .................................. 206
xi
List of Acronyms
BA
Bundle Adjustment
BRIEF
Binary Robust Independent Elementary Features
C-KLAM
Constrained Keyframe-based Localization And Mapping
C-SLAM
Collaborative Simultaneous Localization And Mapping
CCD
Charge-Coupled Device
CMOS
Complementary Metal-Oxide Semiconductor
CPU
Central Processing Unit
CoSLAM
Collaborative Simultaneous Localization And Mapping
CSfM
Collaborative Structure from Motion
DCM
Direction Cosine Matrix
DoF
Degrees of Freedom
DSO
Direct Sparse Odometry
DTAM
Dense Tracking And Mapping
EKF
Extended Kalman Filter
FAST
Features from Accelerated Segment Test
GBA
Global Bundle Adjustment
GNSS
Global Navigation Satellite System
GPS
Global Positioning System xii
IMU
Inertial Measurement Unit
INS
Inertial Navigation System
LSD-SLAM
Large Scale Direct Simultaneous Localization And Mapping
MAV
Micro Aerial Vehicle
MEMS
Micro-Electro-Mechanical System
monoSLAM monocular SLAM MSCKF
Multi-State Constraint Kalman Filter
OKVIS
Open Keyframe-based Visual-Inertial SLAM
ORB
Oriented FAST and Rotated BRIEF
ORB-VO
Visual Odometry based on ORB features
ORB-SLAM Simultaneous Localization And Mapping based on ORB features PSD
Power Spectral Density
PTAM
Parallel Tracking And Mapping
PTAMM
Parallel Tracking And Multiple Mapping
RAM
Random Access Memory
RANSAC
RANdom SAmple Consensus
RMSE
Root Mean Square Error
ROS
Robot Operating System
ROVIO
RObust Visual-Inertial Odometry
RTK
Real-Time Kinematic
SE(3)
Special Euclidean group of the three-dimensional Euclidean space
se(3)
Lie algebra of SE(3)
xiii
SfM
Structure from Motion
Sim(3)
Similarity transformation group of the three-dimesional Euclidean space
sim(3)
Lie algebra of Sim(3)
SLAM
Simultaneous Localization And Mapping
SO(3)
Special orthogonal group of the three-dimensional Euclidean space
so(3)
Lie algebra of SO(3)
SPIN Lab
Satellite Positioning and Inertial Navigation Laboratory
std. dev.
standard deviation
VIO
Visual-Inertial Odometry
VO
Visual Odometry
xiv
List of Tables
Table 2.1. Definitions of terms used in this work ............................................................. 43 Table 2.2. Symbols commonly used in this text ............................................................... 43 Table 3.1. States, initial estimates, and constraints involved in the direct pose graph optimization and two steps of the stepwise pose graph optimization. Here k enumerates poses of all N keyframes in the graph. A pose is represented by a similarity transformation consisting of a rotation from the global frame (subscript ‘g’) to the camera frame of the kth keyframe (superscript ‘k’), R kg, a translation meaning the coordinates of the g-frame’s origin in the k-frame, tkg, and a scale, skg, representing one unit length of the g-frame in k-frame’s length unit. Also, note ‘~’ identifies constraints or observations. ................................. 69 Table 3.2. Descriptions of datasets used in tests. Note the smartphone video frames were down sampled by a factor of 2 in processing ..................................................... 71 Table 3.3. Results of collaborative SLAM with the crowdsourced visual data. Note for the maximum error and RMSE, the ratio is the error divided by the traveled distance. .............................................................................................................. 79 Table 3.4. Processing times for threads in processing the two segments of the S6 dataset. ............................................................................................................................ 79 Table 4.1. The multi-state constraint Kalman filter (MSCKF) algorithm ........................ 90 Table 4.2. List of variables used in camera-IMU calibration problem ............................. 94 Table 4.3. Characteristics of noise and distributions of parameters of the camera-IMU system in simulation. For a particular IMU parameter vector, the same standard deviation applies to each of its dimensions. Similarly, the root power spectral density applies to every dimension of a particular IMU noise vector. So does the image noise. ...................................................................................................... 117 Table 4.4. RMSE of calibration parameters over 600 simulations at a few epochs. ...... 122
xv
Table 4.5. Characteristics of noise and distributions of parameters of the camera-IMU system in MH_01_easy test. For an IMU parameter vector, the same standard deviation applies to each of its dimensions. Similarly, the root power spectral density applies to every dimension of an IMU noise vector. So does the image noise. ................................................................................................................. 125 Table 4.6 Nominal values for the calibration parameters and their estimated values by the MSCKF averaged over the last 5 seconds. ....................................................... 131 Table 5.1. Involved states and constraints (observations) in different least squares problems on the server for visual-inertial collaborative SLAM. ...................... 161 Table 5.2. Description of the crowdsourced smartphone dataset. Note every video frame is down sampled by a factor of 2 in processing. ............................................... 162 Table 5.3. Noise characteristics for the IMU on the Sumsung Galaxy S6 smartphone .. 163 Table 5.4. Error metrics of the VIO (visual inertial odometry) and collaborative mapping on the server after the rigid transformation for the first half of the S6 data. Note for the maximum error and RMSE (root mean square error), the ratio is the error divided by the traveled distance. ...................................................................... 167
xvi
List of Figures
Figure 1.1. Relation between methods and modules discussed in different chapters. For clarity, only one client enumerated by i (j) is shown for the collaborative monocular (visual-inertial) SLAM framework. In the diagram, SfM is short for structure from motion, MSCKF for multi-state constraint Kalman filter, SLAM for simultaneous localization and mapping, and IMU for inertial measurement unit. Note PhotoScan [46] is a software package developed by Agisoft LLC. .... 7 Figure 2.1. The tangent space at the identity of SO(3), denoted by so(3), approximates the SO(3) manifold structure. ................................................................................... 18 Figure 2.2. Parameterization of the misalignment effect depends on definition of the IMU body frame (identified by the ‘x-y-z’ axes triad). For the left graph, the body frame (b-frame) is defined relative to an external coordinate frame, therefore the misalignment between the b-frame and accelerometer triad (‘ax-ay-az’) has 6DoF and can be parameterized by, e.g., 6 small angles, α, β, γ, ζ, ψ, and ϕ. Note x’and y’-axis are on the x-y plane and z’-axis is on the y-z plane. In contrast, on the right graph, the body frame is defined relative to the accelerometer triad such that x-axis is along ax-axis and y-axis is on the plane spanned by ax- and ay-axis. As such, the misalignment has only 3DoF and can be parameterized by, e.g., 3 small angles, α, β, and γ. .................................................................................... 34 Figure 2.3. An overview of the coordinate frames and the camera model used in this work. The pinhole camera model projects a real-world point pi to a point zi on the image plane at focal length f, i.e., z=f. Notice the image plane is inverted for ease of calculation. The body frame is usually defined as the nominal IMU frame based on the camera frame. That is, the relative rotation between the body c frame and camera frame, R b , is assumed fixed to calibrated values obtained offline. The camera and IMU in a smartphone often have the shown relative orientation. Moreover, the pose of the sensor assembly relative to the global g frame is denoted by Tb. ....................................................................................... 38 Figure 2.4. In the electronic rolling shutter scheme, the sensor values are reset and read out row by row from top to bottom with regular delays. Pixels in a row are read out simultaneously. The sum of delays between rows is the frame time or readout time. ....................................................................................................... 41 xvii
Figure 2.5. Rolling shutter skew effect. As the red rectangular object moves fast horizontally in front of the sensor, a camera with a global shutter may capture its normal shape with motion blur, but a camera with a rolling shutter often captures a skewed shape. .................................................................................... 41 Figure 2.6. The relation between latencies of inertial data and images. The upper graph shows epochs at which the inertial data and images are sampled. After a latency of dimu, the inertial data are recorded and timestamped. Similarly, the images are timestamped with a latency of dc, as shown in the lower graph. Though both latencies are unobservable with only visual and inertial data, their difference td is. ........................................................................................................................ 42 Figure 3.1. The collaborative SLAM framework follows a client-server structure. Each client runs a monocular visual odometry algorithm. The collective outputs from n clients, i.e., keyframes messages, are processed on the server to create SLAM maps. Optionally, the server can transmit information to a client to update its pose estimate. ..................................................................................................... 51 Figure 3.2. Overview of the ORB (a type of image feature)-VO (visual odometry) algorithm deployed on a client. It employs a tracking thread to track the camera motion with respect to a map, and a mapping thread to refine the map. In the diagram, BA is short for bundle adjustment, N is the maximum number of keyframes kept in the VO, and F is the maximum number of relocalization failures until a new VO session is initialized. Steps of the workflow are explained in the following text. .......................................................................... 53 Figure 3.3. Overview of the workflow on the server. The server has a loop closing thread (left column) which detects loops and merges maps, and possibly many keyframe handler threads (only one is shown) which process keyframe messages sent in by clients performing monocular visual odometry (VO). Note BA is short for bundle adjustment, and in the shown handler thread, td is an adaptive time interval to make it work in real time. ................................................................. 58 Figure 3.4. Local neighborhood of the current keyframe used in the local bundle adjustment........................................................................................................... 61 Figure 3.5. Loop closure within a simulated map of keyframes (represented by triangles with different sizes indicating scale changes) and points (crosses) by the pose graph optimization. The global frame is aligned with the camera frame of the first keyframe, and the camera is assumed to follow a planar motion. For clarity, only a few points of interest are drawn, and fields of view of keyframes of interest are shown with black dashed lines. (a) True motion starts from the red circle and returns. (b) Estimated motion in the map on the server shows a growing scale, note the scale may decrease as well. As keyframes 19-21 are processed by the loop closing thread, keyframe 4, 2, 1, are the loop candidates xviii
(red lines) for keyframes 19, 20, 21, respectively. Because the three loop candidates view common scenes and thus are connected as neighbors, a loop is detected between the current keyframe 21 and the matched keyframe 1. (c) The detected loop is accepted (red solid line) after the RANSAC (random sample consensus) procedure. Matched points between two neighborhoods of the two keyframes are linked by red dashed lines. (d) Prior to the pose graph optimization, the current keyframe and its two neighbors are corrected with the estimated similarity transformation, and matched points are fused. (e) The map on the server after the pose graph optimization.................................................. 64 Figure 3.6. Loop closure between two simulated maps, i.e., merging two maps. Each map consists of keyframes (triangles with different sizes showing scale changes) and points (crosses). For clarity, only a few points of interest are drawn, and fields of view of keyframes of interest are shown with black dashed lines. (a) True motion of two cameras (blue and black) starting from the red circles. (b) Estimated motion in the two maps on the server with a loop accepted between the two keyframes (the matched keyframe and the current keyframe) linked by the red solid line. Matched points between two neighborhoods of the two keyframes are linked by red dashed lines. (c) The map on the server after the merging operation which transforms the original map of the current keyframe with the estimated similarity transformation, and fuses matched points. ........... 66 Figure 3.7. For KITTI sequence 00, the original pose graph (showing a large scale drift) generated by the monocular visual odometry (VO), and the pose graphs after direct optimization or stepwise optimization are superimposed onto the ground truth..................................................................................................................... 73 Figure 3.8. Snapshots of operations on the server for the two segments of the KITTI sequence 00. (a) Initially, two maps (beginning marked by circles) are maintained on the server by two handlers (blue and red triangles) assigned to each client. As a loop is detected (the red link) and accepted, (b) two maps are merged and the two handlers start working on the same map. Over time, (c) a loop is accepted (the red link) within the merged map which is shown in (d) after the loop closure. ......................................................................................... 74 Figure 3.9. Results from processing the first half of the S3 data with the proposed framework. The S3 camera first traversed the bottom loop three times, and then the top loop four times. (a) The motion of the camera is represented by positions of keyframes estimated by the monocular collaborative SLAM (C-SLAM) framework, also shown are the GPS positions for these keyframes. The obvious gap between the GPS and C-SLAM position at the beginning is partly due to the drift of timestamps in the S3 data. (b) The number of keyframes in the map on the server vs. number of frames processed by the monocular VO on the client. xix
The number of keyframes grows linearly when new areas are explored and remains roughly constant when locations are revisited. ..................................... 76 Figure 3.10. Positions of keyframes estimated by the collaborative SLAM (C-SLAM) framework in processing two segments of the S6 data were plotted against GPS reference positions. During the processing on the server, many loops were accepted and used to constrain the scale drift. ................................................... 77 Figure 3.11. Trajectories up to a scale of two clients calculated by the monocular visual odometry (VO) and by using pose and scale estimates from the server are superimposed onto the final merged map on the server. Obvious scale and position drifts exist in results of monocular VO. With pose and scale information from the server, clients achieve better localization relative to the map on the server. Note jumps (blue lines) in estimated poses for clients result from global bundle adjustment (GBA) invoked after loop closure within one map or map merge upon a loop between two maps. .......................................... 78 Figure 4.1. Anchored inverse depth parameterization for a feature. The feature’s coordinates in the global frame pgf decomposes into two parts: the position of the g
camera frame in which the feature is anchored, pca, and the feature’s coordinates ca
in the anchor camera frame, p f . ....................................................................... 106 Figure 4.2. Structure of the measurement Jacobian Hx for one feature fi. In this example, the current image of index 9 causes the state vector to be augmented with the body state 𝛑 b9 . Because fi is no longer observed in the current image, it is triangulated with respect to the anchor camera frame associated with the previous image of index 8. Then its observations {zi,5, zi,6, zi,7, zi,8} are used for the EKF update. The Jacobians with respect to the error state vector for each measurement stack into Hx. Here the empty grid cells represent blocks of 0s, whereas gray grid cells represent nonzero entries. ........................................... 112 Figure 4.3. Workflow of the multi-state constraint Kalman filter (MSCKF) algorithm. Note numbers in parenthesis identify relevant equations. ................................ 114 Figure 4.4. The coordinate frames in the simulation test. In the left graph, the line segment oq rotating in the global x-y plane has a constant length rxy, and pq of constant length rz is rotating about q in the vertical plane formed by the global zaxis and oq, which is also the x-z plane of the body frame. The x-axis of the body frame is aligned with pq. The right graph shows the entire trajectory during 300s simulation, and the feature points on six sides. ....................................... 116 Figure 4.5. RMSE histories of parameters of the camera-IMU system over 600 Monte Carlo simulations. Each subgraph shows the RMSE history for all dimensions of xx
a particular parameter vector: (a) bg, (b) ba, (c) Tg, (d) Ts, (e) Ta, (f) pcb, (g) (fx, fy, cx, cy) , (h) (k1, k2, p1, p2), (i) (td, tr). ................................................................. 119 Figure 4.6. Error history and reported ±3𝜎 (std. dev.) of calibration parameters in a sample trial. Each subgraph shows the error history and the ±3𝜎 envelope for all dimensions of a particular parameter vector: (a) bg, (b) ba, (c) Tg, (d) Ts, (e) Ta, (f) pcb, (g) (fx, fy, cx, cy) , (h) (k1, k2, p1, p2), (i) (td, tr). ................................. 122 Figure 4.7. Comparison of the MSCKF estimates and the ground truth (‘gt’ in graph legends) for the MH_01_easy test. (a) The trajectory estimated by MSCKF superimposed onto the ground truth, aligned by an SE(3) transformation. Estimates and ±3𝜎 bound obtained by MSCKF and the ground truth for (b) bg, (c) ba, and (d) pcb. (e) shows the estimates and ±3𝜎 bound obtained by MSCKF for td and tr, whose nominal values are zeros because of the global shutter camera used in the Euroc MAV dataset [91].................................................... 127 Figure 4.8. Calibration results for the camera-IMU system on Galaxy S3 with the MSCKF method. (a) positions of the body frame (i.e., nominal IMU frame) in the global frame which is associated with the first camera image, obtained by MSCKF (multi-state constraint Kalman filter) and ORB-VO (visual odometry), (b) ‘x-y-z’ components of quaternions from the b-frame to the g-frame obtained by MSCKF and ORB-VO, (c) gyroscope biases, (d) accelerometer biases, (e) position of the b-frame in the camera frame, (f) estimated focal length and principal point with their nominal values deducted, (g) estimated distortion parameters, k1, k2, (h) image readout time and time delay with regard to the inertial data. ...................................................................................................... 129 Figure 5.1. The client-server structure of the collaborative SLAM framework that works with crowdsourced visual-inertial data. ............................................................ 139 Figure 5.2. The sliding window in OKVIS consists of recent frames (solid wireframes), 6, 7, 8, 9, and keyframes (solid wireframes with red bases) in the spatial vicinity of the current frame 9, i.e., 1 and 4. The recent frames are constrained by inertial measurements (solid ticks). Over time, irrelevant frames (dashed wireframes) and old keyframes are marginalized and old inertial data (dashed ticks) are discarded. .......................................................................................................... 141 Figure 5.3. Overview of the workflow on the server for collaborative visual-inertial SLAM. A loop closing thread (left column) detects loops, computes loop constraints, merges maps, and optimizes maps using these loop constraints. Multiple keyframe handler threads (only one is shown) process keyframe messages sent in by clients which perform visual-inertial odometry. A keyframe handler tracks incoming keyframes to the map on the server, selectively sends keyframes to the loop closing thread, performs bundle adjustment (BA) in neighborhoods of retained keyframes, and culls redundant keyframes in the map. xxi
Note td is an adaptive time interval to make a handler work in real time. For comparison, dashed boxes identify operations different from the server of the collaborative monocular SLAM shown in Figure 3.3. ..................................... 143 Figure 5.4. Inerital data and keyframes processed by a keyframe handler on the server. The inertial data segments between two keyframes are denoted by z1,4, z4,6, z6,9. At the borders of these segments, the inertial samples are linearly interpolated from the two closest epochs. ............................................................................ 147 Figure 5.5. Keyframes and landmarks involved in the local bundle adjustment of a keyframe handler for collaborative SLAM with visual and inertial data. ........ 156 Figure 5.6. Snapshots of operations on the server during processing two streams of keyframe messages from two clients. (a) Two maps are created on the server by two handlers for each client; (b) A loop (red link) is detected between two maps; (c) The two maps are merged into one; (d) The two handlers concurrently work on the merged map. .......................................................................................... 164 Figure 5.7. Relations between different coordinate frames used in the experiment of S6 data. The six nodes represent origins of five coordinate frames, (1) e-frame, (2) the body frame of the reference sensor (br-frame), (3) br-frame at epoch t0 (br(t0)-frame) which is the initialization point of visual inertial SLAM or VIO, (4) the body frame of the camera-IMU system (b-frame), and (5) the global frame used in visual inertial integration (g-frame, identical to the b-frame at t0), and the phase center of the GPS antenna. A solid arrow identifies the rigid transformation between two coordinate frames or the translation between two nodes. A dashed arrow shows how an entity is obtained. For instance, for the transformation, Tbrb , from the b-frame to the br-frame, its translation part (lever arm) is manually measured and its rotation part (boresight) is estimated by aligning the angular rates from the IMU on a mobile device and the reference IMU sensor. The position of the antenna in the e-frame can be obtained either through nodes e-br-antenna, or through e-br(t0)-g-b-br-antenna. Note SLAM is short for simultaneous localization and mapping, VIO for visual inertial odometry, GPS for global positioning system, INS for inertial navigation system, IMU for inertial measurement unit. ..................................................... 166 Figure 5.8. (Left) Trajectories estimated by the VIO (visual inertial odometry) algorithm and the collaborative SLAM (simultaneous localization and mapping) on the server were aligned to the GPS (global positioning system) ground truth with a rigid transformation. (Right) Trajectories estimated by the VIO algorithm and the collaborative SLAM on the server were superimposed onto the GPS ground truth by using the measured boresight and lever arm parameters. ................... 168
xxii
Chapter 1: Introduction
As a vehicle or a person moves from one place to another, reliably navigating a potentially unknown environment is indispensable. Here navigation refers to the process of estimating the location and orientation of a platform, e.g., a handheld mobile device, a ground vehicle, or a micro aerial vehicle (MAV), with respect to a specific coordinate frame. These locations and orientations of a platform over time are also known as its egomotion. Accurate ego-motion estimates can be provided by the simultaneous localization and mapping (SLAM) technique. Using data from sensors on a platform, a typical SLAM technique creates a specific map representing the structure of the environment (mapping) and localizes the platform relative to the map (localization). By sharing and reusing maps between platforms, collaborative SLAM promises better accuracy and enhanced robustness in estimating location and orientation. In addition, collaborative SLAM enables many applications, e.g., collaborative exploration of an unknown area [1], localizing a swarm of robots [2], and games in augmented reality with multiple players [3]. These potential applications of collaborative SLAM can be maximized with widely available commercial off-the-shelf mobile devices (e.g., smartphones and tablets) because they are mostly equipped with sensors useful for SLAM, e.g., a camera and an inertial measurement unit (IMU). That is, collaborative SLAM can use crowdsourced data from 1
these devices to the advantage of localization and mapping and related services. In view of such rich crowdsourced data, the collaborative SLAM is required to process these data in real time, to create accurate and compact maps of the environment, and to share and reuse information between devices, e.g., a device can localize relative to a map created by another. So far, existing collaborative techniques usually employ customized sensors [4], and to the best of our knowledge, no such technique of the abovementioned attributes has been proposed for crowdsourced data [5]. Furthermore, if both visual and inertial data are provided by a mobile device, the collaborative SLAM needs to effectively integrate the two types of data, thus requiring proper calibration of these two sensors. Existing techniques are mainly designed for calibrating a customized visual and inertial sensor rig and are not suitable for sensors on a mobile device because they usually omit the rolling shutter effect of the camera and the scale and misalignment effect of the IMU. Therefore, this work investigates the following problems: (1) Design a real-time collaborative SLAM technique for crowdsourced visual data which are usually captured by cameras in a forward-looking position; (2) Calibrate the camera-IMU system on a consumer mobile device; (3) Design a real-time collaborative SLAM technique for crowdsourced visual and inertial data with an emphasis on effectively integrating the inertial data. 1.1 Background and Motivation In an outdoor environment with clear lines of sight to enough GNSS satellites, navigation has been well solved with the GNSS/INS (inertial navigation system) technology [6] 2
where GNSS determines positions and INS provides attitude angles. Because INS suffers from drifts in position and attitude, it relies on external position corrections from GNSS to calibrate errors in gyroscopes and accelerometers. However, GNSS-based positioning is subject to signal blockage, interference and multipath effects resulting in a limited accuracy, it cannot be used in indoor environment and has limited use in urban canyons. In response, numerous sensor fusion technologies for navigation have been developed which can be roughly divided into two categories: sensor fusion on a platform, and sensor fusion on a network. With sensors rigidly attached to a platform, the data from these sensors are integrated to determine the 6DoF (degrees of freedom) pose (i.e., 3D position and 3DoF orientation) of the platform. These sensors on the platform usually include exteroceptive sensors that observe environmental features, e.g., cameras and laser scanners, and proprioceptive sensors that monitor the movement, e.g., IMUs. As such, the sensor fusion can be solved with a SLAM or an odometry method. A SLAM method requires inputs from exteroceptive sensors for creating a specific map of the scene, and it estimates the pose of the platform relative to the map, e.g., SLAM with a RGB-D sensor and an IMU [7] and SLAM with a stereo camera and an IMU [8]. By contrast, an odometry method focuses on solving for poses and generally does not involve the mapping task, e.g., visual-inertial odometry (VIO) [9]. Sensor fusion with data from sensors on one platform is fragile in that it is difficult to recover from failures and relocalize to existing maps on the platform, especially in exploratory tasks.
3
On the other hand, sensor fusion on a network of platforms with communication capabilities utilizes the rich collective data for navigation. If these platforms are equipped with exteroceptive sensors, e.g., cameras, the sensor fusion can be solved with the collaborative SLAM. Because data are shared between platforms, collaborative SLAM techniques have proved to improve localization accuracy and enhance robustness [2, 10] compared to independent SLAM on a platform. One obstacle in deploying a collaborative SLAM method is the availability of platforms. Existing collaborative SLAM approaches, e.g., [2, 11-13], often rely on customized sensors which limit their application scope. If a collaborative SLAM method is to be widely used, it needs to rely on the off-the-shelf mobile devices because they are (1) easily accessible, with 2 billion smartphone users worldwide in 2016, (2) designed with a powerful communication capability, e.g., live video chat, and (3) mounted with multiple useful sensors for navigation, e.g., cameras and IMUs. But these off-the-shelf sensors bring specific challenges: (1) these cameras are apt to generate a plethora of data through crowdsourcing, and (2) the camera-IMU systems are not well calibrated [14], i.e., their intrinsic parameters often have large uncertainties. For navigation of a group of robots or people with these consumer devices, real-time collaborative SLAM approaches should be designed with map reuse and concise mapping thus reducing the demand on computational resources, and a proper calibration of camera-IMU systems in these devices should be performed.
4
1.2 Approach and Scope Salient visual landmarks in the scene project to corner features in images. These landmarks form a sparse representation of the structure of the scene. Observations of these virtual landmarks in images serve as exteroceptive input to the collaborative SLAM problems and the calibration of a consumer camera-IMU system. If not too few, these landmarks are sufficient to drive these problems to meaningful solutions. This work extensively uses estimation theories. The collaborative SLAM problems involve optimizing maps of landmarks and image poses. To improve accuracy and convergence, nonlinear optimization techniques are widely used which highlight repeated linearization, i.e., linearly approximating a nonlinear function at a certain point repeats several times. For calibrating the camera-IMU system, a classic filtering technique is used which has proved sufficient for this purpose [14, 15]. Collaborative SLAM with a network of sensor platforms can be achieved with either a centralized or a decentralized computing scheme. In the centralized scheme, sensor platforms forward their data to a master which processes data from all platforms, e.g., [4, 16]. In contrast, clients in the decentralized scheme are responsible for processing all relevant data, e.g., [11, 17]. To balance benefits of both schemes, the collaborative SLAM approaches in this work combine them in that both clients and the master take computation workloads. This work does not specify the quality of mobile devices required for acceptable collaborative SLAM or successful camera-IMU calibration. Apparently, with quite noisy data from very low-grade mobile devices, methods developed here may give rather poor 5
performance. Quantitatively analyzing the noise effects of sensors in a variety of mobile devices remains future work. 1.3 Contributions The following contributions are made by this work to the state of the art of collaborative SLAM. (1) A client-server framework for collaborative SLAM is developed that works with image sequences crowdsourced by mobile devices. A client tracks the device motion with images of a single camera (i.e., monocular) on a mobile device and transmits compact messages to the server where they are used to build maps of the viewed environment and to discovers revisits. Loop constraints derived from revisits are used to achieve consistent maps by the proposed stepwise pose graph optimization technique which is robust in correcting a large drift in the map scale, often occurring with images captured by a forward-looking camera. The server can also transmit positions and attitudes to clients so that they can localize relative to maps on the server. This framework is extensively evaluated with public benchmarks and crowdsourced smartphone video datasets. The server hosted on a consumer computer can process messages from up to four clients close to real time while achieving a positioning accuracy within 1% of the traveled distance. (2) To calibrate the camera-IMU sensor rig on a consumer mobile device, an approach based on the multi-state constraint Kalman filter (MSCKF) [14] is developed. This approach estimates the time offset between the visual and inertial data, spatial configuration of the camera-IMU system, and intrinsic parameters of 6
both sensors, along with the motion of the sensor system. Its performance is validated with simulation and real-world data. (3) Using parameters obtained by the calibration approach, the collaborative SLAM framework is extended to work with crowdsourced images and inertial data. On the server, the pre-integrated inertial measurements are used in nonlinear optimizations to recover the metric scale of maps. The extended framework is validated with visual and inertial data crowdsourced by smartphones, demonstrating real-time operation and accurate mapping with a couple of clients. The workflows of the above discussed methods are drawn in Figure 1.1.
ith camera data stream
ith camera calibration: SfM with PhotoScan
jth camera-IMU calibration on the fly: MSCKF (Chapter 4)
Client i: Monocular odometry
Client j: Visual-inertial odometry
Server: Collaborative SLAM up to a scale
Server: Collaborative SLAM
Collaborative SLAM with monocular data (Chapter 3)
Collaborative SLAM with visualinertial data (Chapter 5)
jth cameraIMU data stream
Figure 1.1. Relation between methods and modules discussed in different chapters. For clarity, only one client enumerated by i (j) is shown for the collaborative monocular (visual-inertial) SLAM framework. In the diagram, SfM is short for structure from motion, MSCKF for multi-state constraint Kalman filter, SLAM for simultaneous localization and mapping, and IMU for inertial measurement unit. Note PhotoScan [46] is a software package developed by Agisoft LLC.
7
1.4 Structure of the Dissertation This dissertation is organized as follows. Chapter 2 reviews notation conventions, coordinate frames and related transformations, estimation theories in the context of localization and mapping, and IMU and camera sensor models. Chapter 3 presents the collaborative SLAM framework that works with crowdsourced image sequences or videos. The framework was tested with visual data from a public benchmark and from several smartphones. In Chapter 4, calibration of a camera-IMU system is solved with a filtering approach which was verified with simulation and real-world data. Using the calibration results obtained with the calibration method in Chapter 4, Chapter 5 extends the collaborative SLAM framework to work with crowdsourced visual and inertial data, featuring pre-integrated IMU measurements. The extended framework was validated by experiments on visual-inertial data crowdsourced by smartphones. Chapter 6 summarizes contributions of this work and suggests future work. Several appendices provide the supporting derivations.
8
Chapter 2: Mathematical Preliminaries and Sensor Models
This chapter discusses mathematical prerequisites and sensor models for this work in an effort to make this dissertation self-contained. A review of the notation in this work is given in Section 2.1. Section 2.2 briefly reviews the three-dimensional Euclidean transformation and similarity transformation. Section 2.3 presents fundamentals of the nonlinear optimization and filtering techniques which are used in subsequent chapters to solve localization and mapping problems. For these problems, the observations are provided by IMUs and cameras. The observation models for the IMU and the camera typically found on mobile devices are discussed in Sections 2.4 and 2.5, respectively. 2.1 Notation The notation used in this work follows the convention outlined in [18]. Scalars are denoted with lowercase italic letters, vectors with boldface lowercase letters, matrices with boldface uppercase letters. For a variable 𝐱, its true value or state is denoted without any mark. The estimate of a variable 𝐱 is identified with the hat mark, ‘^’, i.e., 𝐱̂. The error of a variable 𝐱 is defined as the difference between its true value and estimate. 𝛿𝐱 = 𝐱 − 𝐱̂ This choice is trivially supported by the following example. For a random variable 𝑥 = 𝑛 ∈ ℝ1 with n denoting a zero-mean noise, its estimate is 𝑥̂ = 0 and its error is 𝛿𝑥 = 𝑥 − 𝑥̂ = 𝑛. 9
2.2 Coordinate Frames and Transformations This study deals with such quantities as positions and orientations in the 3D Euclidean space. These quantities are specified in Cartesian coordinate reference frames. Each reference frame has an orthogonal right-handed triad of three axes realizable with points of known coordinates. Reference frames are related by transformations such as Euclidean transformations and similarity transformations. Reciprocally, these transformations also relate expressions of a point in different reference frames. The following subsections sequentially describe coordinate frames, quaternions for expressing orientations, and several types of transformations used in this text. 2.2.1 Coordinate Frames Locations need to refer to a reference frame. The data used for localization usually determine what kind of reference frame needs to be defined or used. The GPS data refers to the Earth-Centered-Earth-Fixed (ECEF) frame (e-frame) [19]. The IMU data refers to the inertial frame (i-frame) and the IMU sensor frame [19]. A camera image is associated with a camera frame (c-frame). Commonly, the c-frame has its origin at the camera lens’ optical center. When one looks from behind the camera, its x-axis is pointing to the right, its y-axis pointing down, and its z-axis pointing to the camera view. To describe the motion of a multi-sensor system, e.g., a camera-IMU assembly, a body frame (b-frame) affixed to the sensor system needs to be defined. In calibrating the camera-IMU system (Chapter 4), assuming the IMU has a three-axis accelerometer and a three-axis gyroscope, by definition the b-frame’s origin is located at the intersection of the three axes of the accelerometer and its orientation is known and fixed relative to the 10
camera frame. If, for instance, the orientation of the b-frame is set as the nominal rotation between the camera and IMU sensor frame, then the b-frame is the nominal IMU sensor frame. However, on the server of a collaborative SLAM framework (Section 3.2 and 5.2), the b-frame is chosen as the camera frame for easy comparison of results. Because the sensors used in this work, e.g., cameras and IMUs, can only localize in a relative manner, the motion of the sensor system needs to refer to a global reference frame (g-frame) which is observable by the sensors and used for navigation in a local area on the Earth. For collaborative SLAM with monocular cameras (Chapter 3), the gframe is chosen as the camera frame of the first keyframe (Section 3.3). For calibration (Chapter 4) and collaborative SLAM (Chapter 5) with camera-IMU systems, the g-frame is defined such that it has the same origin as the b-frame associated with the first keyframe and its z-axis is pointing at the negative gravity direction. In a 3D coordinate frame, e.g., a-frame, a position is expressed with a vector of three coordinates, 𝐩𝑎 = [𝑥
𝑦
𝑧]T , where the superscript ‘𝑎’ identifies the frame in which
the position is expressed. The orientation of a coordinate frame with respect to another can be represented with a set of Euler angles, a direction cosine matrix (DCM), i.e., a rotation matrix, or a quaternion of four numbers. The Euler angle representation suffers from the gimbal lock issue at particular orientations which roughly means that a particular orientation may correspond to many sets of different Euler angles. As a result, only the other two representations are used in this text. The quaternion representation is only used in implementation and reviewed in Section 2.2.2 for completeness. The 3×3 rotation matrix is always used in derivations pertinent to rotations. 11
A rotation matrix, in combination with a position vector, makes a 6DoF transformation in the 3D Euclidean space. Suppose there are two coordinate frames, a-frame and b-frame, and let 𝐑𝑎𝑏 denote the orientation of b-frame with respect to a-frame, and 𝐩𝑎𝑏 denote the coordinates of b-frame’s origin in a-frame. Then, [𝐑𝑎𝑏 , 𝐩𝑎𝑏 ] transforms the coordinates of a point in b-frame, 𝐩𝑏 , to coordinates in a-frame, 𝐩𝑎 . Because this transformation preserves the distance between every pair of points, it is also known as the rigid transformation. 𝐩𝑎 = 𝐑𝑎𝑏 𝐩𝑏 + 𝐩𝑎𝑏
(2.1)
2.2.2 Quaternions A quaternion is a hyper-complex number of four components, concisely representing orientations in the 3D Euclidean space. Quaternions are mainly used in implementing algorithms involving rotations, potentially improving efficiency and numerical stability. Otherwise, they are often unnecessary for derivations because the rotation matrix suffices in this aspect. In this work the Hamilton convention [20] is adopted in defining and manipulating quaternions as it is widely used and implemented in the navigation and robotics community, e.g., [21, 22], Eigen [23], ROS (robot operating system) [24]. A quaternion 𝑞 is defined in a coordinate frame, e.g., a-frame. Let i, j, and k denote imaginary units associated with the x, y, and z axis of the a-frame, respectively, and they are subject to the following constraints. 𝑖𝑖 = −1, 𝑗𝑗 = −1, 𝑘𝑘 = −1, 𝑖𝑗 = 𝑘, 𝑗𝑘 = 𝑖, 𝑘𝑖 = 𝑗
12
(2.2)
Then, given a scalar 𝜁 and a unit vector 𝑏𝑖 + 𝑐𝑗 + 𝑑𝑘 satisfying 𝑏 2 + 𝑐 2 + 𝑑 2 = 1, the corresponding quaternion of four components is defined as
[𝑞𝑤
𝑞𝑥
𝑞𝑦
𝜁 𝜁 𝑞 ≜ cos + sin (𝑏𝑖 + 𝑐𝑗 + 𝑑𝑘) 2 2 = [1 𝑖 𝑗 𝑘][𝑞𝑤 𝑞𝑥 𝑞𝑦 𝑞𝑧 ]T 𝑞𝑧 ] ≜ [cos 𝜁 𝑏 sin 𝜁 𝑐 sin 𝜁 𝑑 sin 𝜁 ] 2 2 2 2
(2.3)
Similar to complex numbers, the conjugate of a quaternion, 𝑞 ∗ is defined as 𝜁 𝜁 𝑞 ∗ = cos − sin (𝑏𝑖 + 𝑐𝑗 + 𝑑𝑘) 2 2
(2.4)
The geometrical meaning of the above defined quaternion is that the quaternion triple product 𝑢 = 𝑞𝑣𝑞 ∗ rotates an arbitrary 3D vector 𝑣 = 𝑣1 𝑖 + 𝑣2 𝑗 + 𝑣3 𝑘 expressed in aframe about the direction 𝑏𝑖 + 𝑐𝑗 + 𝑑𝑘 by the angle 𝜁 (right-handed) to 𝑢 = 𝑢1 𝑖 + 𝑢2 𝑗 + 𝑢3 𝑘 expressed in a-frame. The product of multiplying two quaternions 𝑝 = 𝑝𝑤 + 𝑝𝑥 𝑖 + 𝑝𝑦 𝑗 + 𝑝𝑧 𝑘 and 𝑞 = 𝑞𝑤 + 𝑞𝑥 𝑖 + 𝑞𝑦 𝑗 + 𝑞𝑧 𝑘 using constraints specified in equation (2.2) can be written as follows. 𝑝𝑤 −𝑝𝑥 −𝑝𝑦 −𝑝𝑧 𝑞𝑤 𝑝𝑥 𝑝𝑤 −𝑝𝑧 𝑝𝑦 𝑞𝑥 𝑝𝑞 = [1 𝑖 𝑗 𝑘] 𝑝 𝑝𝑧 𝑝𝑤 −𝑝𝑥 𝑞𝑦 𝑦 𝑝 𝑥 𝑝 𝑤 ] [ 𝑞𝑧 ] [ 𝑝𝑧 −𝑝𝑦
(2.5)
Now let us look at the relation between a quaternion and a rotation matrix. Assume for a moment that the triple product 𝑞(⋅)𝑞 ∗ rotates all 3 unit vectors along a-frame’s axes to unit vectors along axes of another coordinate frame, e.g., b-frame, of the same origin. An equivalent way to describe the assumed operation is that each unit vector along b-frame’s axis transforms its coordinates expressed in b-frame to those expressed in a-frame. As a result, the assumption leads to the following relations. 13
[𝑖
𝑗
𝑘]𝐞𝑎𝑥𝑏 = 𝑞[𝑖
𝑗
𝑘][1 0
[𝑖
𝑗
𝑘]𝐞𝑎𝑦𝑏 = 𝑞[𝑖
𝑗
𝑘][0
[𝑖
𝑗
𝑘]𝐞𝑎𝑧𝑏 = 𝑞[𝑖
𝑗
𝑘][0 0 1]T 𝑞 ∗
0]T 𝑞 ∗
1 0]T 𝑞 ∗
where the unit vectors along b-frame’s axes expressed in a-frame are denoted by 𝐞𝑎𝑥𝑏 , 𝐞𝑎𝑦𝑏 , and 𝐞𝑎𝑧𝑏 . With these relations, it is easy to see that the assumed triple product 𝑞(⋅)𝑞 ∗ operating on a triplet of coordinates of a point expressed in b-frame, 𝐩𝑏 , converts the triplet 𝐩𝑏 to coordinates of the point expressed in a-frame 𝐩𝑎 . [𝑖 𝑗 𝑘]𝐩𝑎 = [𝑖
𝑗
𝑘][𝐞𝑎𝑥𝑏
𝐞𝑎𝑦𝑏
𝐞𝑎𝑧𝑏 ]𝐩𝑏 = 𝑞[𝑖 𝑗 𝑘]𝐩𝑏 𝑞 ∗
(2.6)
Then, using equation (2.1), it is easy to see that 𝐩𝑎 = 𝐑𝑎𝑏 𝐩𝑏 = [𝐞𝑎𝑥𝑏
𝐞𝑎𝑦𝑏
𝐞𝑎𝑧𝑏 ]𝐩𝑏
This means, the assumed triple product is actually equivalent to 𝐑𝑎𝑏 in effect. In this sense, the q involved in the assumed operation is designated as 𝑞𝑏𝑎 . Multiplying out equation (2.6), the relation between 𝐑𝑎𝑏 and 𝑞𝑏𝑎 is obtained. 2 𝑞𝑤 + 𝑞𝑥2 − 𝑞𝑦2 − 𝑞𝑧2 𝐑𝑎𝑏 = [ 2(𝑞𝑥 𝑞𝑦 + 𝑞𝑤 𝑞𝑧 ) 2(𝑞𝑥 𝑞𝑧 − 𝑞𝑤 𝑞𝑦 )
2(𝑞𝑥 𝑞𝑦 − 𝑞𝑤 𝑞𝑧 ) 2 𝑞𝑤 − 𝑞𝑥2 + 𝑞𝑦2 − 𝑞𝑧2 2(𝑞𝑦 𝑞𝑧 + 𝑞𝑤 𝑞𝑥 )
2(𝑞𝑥 𝑞𝑧 + 𝑞𝑤 𝑞𝑦 ) 2(𝑞𝑦 𝑞𝑧 − 𝑞𝑤 𝑞𝑥 ) ] 2 𝑞𝑤 − 𝑞𝑥2 − 𝑞𝑦2 + 𝑞𝑧2
(2.7)
For simpler notation, a quaternion 𝑞 = 𝑞𝑤 + 𝑞𝑥 𝑖 + 𝑞𝑦 𝑗 + 𝑞𝑧 𝑘 is often represented by a quadruple vector 𝐪. 𝐪 = [𝑞𝑤 𝑞𝑥 𝑞𝑦 𝑞𝑧 ]T 𝑞 = [1 𝑖 𝑗 𝑘]𝐪 The product of two quaternions 𝑝 = [1 𝑖
𝑗
𝑘]𝐩 and 𝑞 = [1 𝑖
(2.8) 𝑗
𝑘]𝐪 can be
represented with quadruple vectors by defining a ‘⊗’ operator following equation (2.5).
14
𝑝𝑤 −𝑝𝑥 −𝑝𝑦 −𝑝𝑧 𝑞𝑤 𝑝𝑥 𝑝𝑤 −𝑝𝑧 𝑝𝑦 𝑞𝑥 𝐩⊗𝐪= 𝑝 𝑝𝑧 𝑝𝑤 −𝑝𝑥 𝑞𝑦 𝑦 𝑝 𝑥 𝑝 𝑤 ] [ 𝑞𝑧 ] [ 𝑝𝑧 −𝑝𝑦
(2.9)
With equation (2.6), it follows that [
0 0 0 ∗ ] = [ 𝑎 𝑏 ] = 𝐪𝑎𝑏 ⊗ [ 𝑏 ] ⊗ 𝐪𝑎𝑏 𝐑𝑏 𝐩 𝐩𝑎 𝐩
which is often used in practice for rotating vectors with a quaternion. 2.2.3 Transformations: SO(3), SE(3), and Sim(3) This work involves nonlinear estimation with variables that do not belong to the Euclidean vector space (cf. Sections 3.4 and 5.4). If they are treated as quantities in the Euclidean space, issues arise in estimation. Take the example from [25], suppose a function, 𝑓(𝛚), of a 3D rotation parameterized by a vector 𝛚 ∈ ℝ3 , e.g., Euler angles, is to be minimized in the Euclidean space. min 𝑓(𝛚)
𝛚∈ℝ3
(2.10)
Then in each update step for 𝛚 (cf. equation (2.36)), a small increment 𝛅 ∈ ℝ3 is applied as 𝛚∗ ← 𝛚 + 𝛅. But the result of rotation by 𝛚 and rotation by 𝛅 is in general not equivalent to the rotation represented by their sum. That is, quantities like rotations do not concatenate like those in a Euclidean space. Such quantities and their concatenation are better modeled by Lie groups which are handy in photogrammetry and robotics. Basically, a Lie group is both a smooth manifold and a group with compatible group operations. A smooth manifold is Euclidean locally near each point, but its global structure is rather different. An illustrative smooth manifold is the surface of the Earth ellipsoid. In a local region, the Euclidean coordinates may be used for navigation and 15
surveying, but alternative coordinates are needed for long distances. On the other hand, as described in [25, 26], a group is an algebra structure defined on a set of elements, G, with a neutral element, 𝑖𝑑 ∈ 𝐺, and an operation ‘⊞’ that combines two elements in G into another element in G, i.e., ⊞: 𝐺 ⊞ 𝐺 → 𝐺 Due to the group operation ‘⊞’, every element 𝑎 ∈ 𝐺 has an unique inverse 𝑏 ∈ 𝐺 such that 𝑎 ⊞ 𝑏 = 𝑖𝑑 ∈ 𝐺. Since a Lie group is a smooth manifold, its structure near a point can be approximated to the first order by a tangent space. Intuitively, a tangent space is a Euclidean plane tangent to a point on the smooth manifold. The tangent space is closed under an algebraic operation, Lie bracket [26], therefore, the tangent space together with the operation is also known as the Lie algebra. The tangent space is often taken at the neutral element (the identity) because it exists in every Lie group. The significance of a tangent space at the identity is that it converts the group operation on the manifold close to the identity to a simple addition in a Euclidean space. Therefore, functions of quantities on a manifold can be linearly approximated at any point by functions of quantities in a Euclidean space. As such, existing methods of nonlinear estimation in Euclidean spaces (Section 2.3.1) can be generalized to quantities on manifolds. The Lie groups of interest in this text are matrix Lie groups whose elements can be represented with square matrices. For a matrix Lie group, the neutral element is the identity matrix, the group operation is the matrix multiplication, and the inverse of an element is the matrix inverse. The following gives a condensed review of three matrix 16
Lie groups used in this work, special orthogonal group SO(3), special Euclidean group SE(3), and similarity transformation group Sim(3) based on [25, 26]. All 3D rotation matrices make up the special orthogonal group SO(3) with the group operation being matrix multiplication and the inverse being matrix transpose. The smooth manifold formed by the group is associated with a tangent space at the identity denoted by so(3) whose elements are 3×3 skew symmetric matrices. A so(3) element has a minimal representation in ℝ3 . That is, a skew symmetric matrix is generated by a vector in ℝ3 with the hat operator, (∙)∧ . Conversely, the vee operator, (∙)∨ , maps a skew symmetric matrix to a 3D vector. 𝜔1 ∧ 0 𝛚 = [𝜔 2 ] = [ 𝜔 3 𝜔3 −𝜔2 (𝛚∧ )∨ = 𝛚 ∧
−𝜔3 0 𝜔1
𝜔2 −𝜔1 ] ∈ 𝑠𝑜(3) 0
(2.11)
The hat operator for a 3D vector is also denoted by a cross operator, ⌊∙⌋× , because the cross product of two 3D vectors, 𝐚, 𝐛 ∈ ℝ3 , is related to the skew symmetric matrix of 𝐚 as follows. 𝐚×𝐛 = ⌊𝐚⌋× 𝐛
(2.12)
The elements in so(3) and SO(3) are related by exponential and logarithm maps. The exponential map is the standard matrix exponential for matrix Lie groups. It takes an element of so(3) to SO(3). exp(𝛚∧ ) = 𝐈 +
sin(‖𝛚‖) ∧ 1 − cos(‖𝛚‖) ∧ 2 (𝛚 ) 𝛚 + ‖𝛚‖ ‖𝛚‖2
Its inverse is done by the logarithm map which takes a matrix 𝐑 in SO(3) to so(3).
17
(2.13)
𝟎3×3 log(𝐑) = { 𝐑 − 𝐑T 2√1 − 𝑑 2
𝑑=1 cos
−1 (𝑑)
𝑑 ∈ (−1,1)
with 𝑑 =
1 (trace(𝐑) − 1) 2
(2.14)
To make the logarithm map bijective, a common understanding is to limit its codomain to the open ball, i.e., ‖log(𝐑)∨ ‖ < 𝜋. Before looking into other matrix Lie groups, the importance of a tangent space at the identity is shown by the case of so(3). In the tangent space at the identity on the SO(3) manifold, the group operation close to the identity can be approximated with the addition (Figure 2.1). 𝐈 ⊞ exp(⌊𝛚⌋× ) = 𝐈 ⋅ exp(⌊𝛚⌋× ) ≈ 𝐈 + ⌊𝛚⌋×
(2.15)
where 𝛚 ∈ ℝ3 is assumed to have a small magnitude.
SO(3)
so(3)
Figure 2.1. The tangent space at the identity of SO(3), denoted by so(3), approximates the SO(3) manifold structure.
The rigid transformations in the 3D Euclidean space mentioned in Section 2.2.1 are actually members of a special Euclidean group in 3D space, SE(3). For the case of equation (2.1), the rigid transformation can be represented by an element, 𝐓𝑏𝑎 , of SE(3). 𝑏 𝐑𝑎𝑏 𝐩𝑎 𝑎 𝐩 𝑎 ( ) = 𝐓𝑏 ( ) with 𝐓𝑏 ≜ [ 𝟎1×3 1 1
18
𝐭 𝑎𝑏 ] 1
(2.16)
where the symbol of the translation part, 𝐭 𝑎𝑏 , is used interchangeably with 𝐩𝑎𝑏 , meaning coordinates of b-frame’s origin in a-frame. The group operation in SE(3) is common matrix multiplication and the inverse is the matrix inverse. For instance, the transformation between the c-frame and the b-frame, and that between the b-frame and the a-frames, can be concatenated.
[
𝐑𝑎𝑐 𝟎1×3
𝐓𝑐𝑎 = 𝐓𝑏𝑎 ⋅ 𝐓𝑐𝑏 ⇔ 𝐭 𝑎𝑐 𝐑𝑎𝑏 𝐭 𝑎𝑏 𝐑𝑏𝑐 ]=[ ][ 1 𝟎1×3 1 𝟎1×3
𝐑𝑎 𝐑𝑏 𝐭 𝑏𝑐 ]=[ 𝑏 𝑐 1 𝟎1×3
𝐑𝑎𝑏 𝐭 𝑏𝑐 + 𝐭 𝑎𝑏 ] 1
(2.17)
And the inverse of the transformation from the b-frame to the a-frame, 𝐓𝑏𝑎 , is as follows. (𝐓𝑏𝑎 )−1 = [
(𝐑𝑎𝑏 )T 𝟎1×3
−(𝐑𝑎𝑏 )T 𝐭 𝑎𝑏 ] 1
(2.18)
The associated tangent space at the identity, se(3), has elements that can be generated by vectors of ℝ6 with a hat operator. The vector in ℝ6 corresponding to a se(3) element is its minimal representation which can be written as a concatenation of two three-vectors, a translational and a rotational component, 𝛎 ∈ ℝ3 and 𝛚 ∈ ℝ3 , respectively. 𝛎 ∧ 𝛚∧ ( ) =[ 𝛚 𝟎1×3
𝛎 ] ∈ 𝑠𝑒(3) 0
(2.19)
The exponential map takes an element of se(3) to SE(3) as follows. 𝛎 ∧ exp(𝛚∧ ) exp (( ) ) = ( 𝛚 𝟎1×3 where 𝐕 = 𝐈 +
1−cos 𝜃 𝜃2
𝛚∧ +
𝜃−sin 𝜃 𝜃3
𝐕𝛎 ) ∈ 𝑆𝐸(3) 1
(2.20)
(𝛚∧ )2 and 𝜃 = ‖𝛚‖2 .
The SE(3) group can be generalized to the Sim(3) group by including a positive scale factor. An element of Sim(3), 𝐒𝑏𝑎 , can be viewed as the transformation between two reference frames, the a- and b-frame of different scales. It consists of a rotation, 𝐑𝑎𝑏 , a 19
translation, 𝐭 𝑎𝑏 , between the two frames as in 𝐓𝑏𝑎 ∈ 𝑆𝐸(3), and a scale factor, 𝑠𝑏𝑎 , meaning that a distance measured 1 in the b-frame is measured 𝑠𝑏𝑎 in the a-frame. 𝐒𝑏𝑎 ≜ [
𝑠𝑏𝑎 𝐑𝑎𝑏 𝟎1×3
𝐭 𝑎𝑏 ] 1
(2.21)
Its inverse is given by 𝐒𝑎𝑏 = [
(𝐑𝑎𝑏 )𝐓 /𝑠𝑏𝑎 𝟎
−(𝐑𝑎𝑏 )𝐓 𝐭 𝑎𝑏 /𝑠𝑏𝑎 ] 1
For completeness, the inverse of the scale 𝑠𝑏𝑎 is defined as 𝑠𝑎𝑏 = 1/𝑠𝑏𝑎 . The Sim(3) group is associated with a tangent space sim(3) at the identity. An element of sim(3) can be generated from its minimal representation, a vector of ℝ7 , with a hat operator. The minimal representation can be written as a concatenation of three parts, a translation vector, 𝛎 ∈ ℝ3 , a rotation vector, 𝛚 ∈ ℝ3 , and a scale, 𝜎 ∈ ℝ. 𝛎 ∧ 𝛚∧ + 𝜎𝐈3×3 (𝛚) = [ 𝟎1×3 𝜎
𝛎 ] ∈ 𝑠𝑖𝑚(3) 0
(2.22)
The exponential map that maps sim(3) to Sim(3) is given below as repeated from [25]. 𝛎 ∧ 𝑒 𝜎 exp(𝛚∧ ) 𝐖𝛎 exp ((𝛚) ) = ( ) ∈ 𝑆𝑖𝑚(3) 𝟎1×3 1 𝜎
(2.23)
where 𝑒𝜎 − 1 𝐴𝜎 + (1 − 𝐵)𝜃 ∧ 𝑒 𝜎 − 1 𝐴𝜃 + (𝐵 − 1)𝜎 𝐖=( )𝐈 + 𝛚 +( − 2 2 ) (𝛚∧ )2 𝜎 𝜃(𝜎 2 + 𝜃 2 ) 𝜎𝜃 2 𝜃 (𝜎 + 𝜃 2 ) with 𝐴 = 𝑒 𝜎 sin 𝜃, 𝐵 = 𝑒 𝜎 cos 𝜃, and 𝜃 = ‖𝛚‖2 . 2.3 Estimation Theories in SLAM Before looking into the estimation theories, the basics of differentiation on a Euclidean space is explained. Because a smooth manifold has a Euclidean tangent space at each 20
point, the differentiation on a manifold can be done in the tangent space with rules that apply to the Euclidean space. Given a differentiable, real-valued function 𝐹(𝑥1 , ⋯ , 𝑥𝑛 ) that maps a point 𝐱 = [𝑥1 , ⋯ , 𝑥𝑛 ]T in the ℝ𝑛 space to a scalar in ℝ, its gradient ∇𝐹 is the vector of 𝑛 first-order partial derivatives of 𝐹. 𝜕𝐹 𝜕𝐹 T ∇𝐹 ≜ ( ,⋯, ) 𝜕𝑥1 𝜕𝑥𝑛
(2.24)
The Jacobian generalizes the gradient for a vector-valued function 𝐟 which is a differentiable mapping between Euclidean spaces ℝ𝑛 and ℝ𝑚 . 𝑓1 (𝑥1 , ⋯ , 𝑥𝑛 ) ⋮ 𝐟=[ ] 𝑓𝑚 (𝑥1 , ⋯ , 𝑥𝑛 ) The Jacobian for the function is defined as 𝜕𝑓1 𝜕𝑥1 𝐉𝐟 ≜ ⋮ 𝜕𝑓𝑚 [ 𝜕𝑥1
𝜕𝑓1 𝜕𝑥𝑛 ⋱ ⋮ 𝜕𝑓𝑚 ⋯ 𝜕𝑥𝑛 ] ⋯
(2.25)
In this work, Jacobians are often required in the linearization step of a nonlinear estimation problem. The related functions are, in general, differentiable maps between two smooth manifolds. Whenever quantities on manifolds are involved during differentiation, the derivative is calculated in the tangent space for such quantities. If both the domain and codomain of a function are Euclidean spaces, the derivative can be calculated in the usual form because the tangent space coincides with the Euclidean space. 21
Jacobians are also involved in propagating the covariance of a vector variable. For two variables in Euclidean spaces, x and y, related by a nonlinear function, 𝐲 = 𝐟(𝐱), given the covariance of variable x, cov 𝐱, the covariance of 𝐲 can be approximated by the covariance propagation rule. cov 𝐲 = 𝐉 cov(𝐱) 𝐉 T with 𝐉 =
𝜕𝐟(𝐱) 𝜕𝐱
(2.26)
2.3.1 Nonlinear Optimization on Euclidean Spaces With the differentiation techniques described, the estimation theories are briefly reviewed. A problem in SLAM can often be formulated in the probabilistic state estimation manner: given the measurement vector 𝐳 and the likelihood function 𝑝(𝐳|𝐱), estimate the state represented by the parameter vector 𝐱. The most probable value of x maximizes the likelihood function, thus also known as the maximum likelihood estimate. Equivalently, it can be found by minimizing the negative log likelihood. arg max 𝑝(𝐳|𝐱) = arg min[− log(𝑝(𝐳|𝐱))] 𝐱
𝐱
(2.27)
In many cases, the likelihood function can be decomposed into many factors when observations in 𝐳 are conditionally independent. That is, an observation in 𝐳, given 𝐱, is independent of other observations. As a result, 𝑝(𝐳|𝐱) can be written as the product of 𝐾 likelihood distributions, 𝑝𝑘 (𝐳𝑘 |𝐱) , each distribution showing the likelihood of the observation 𝐳𝑘 given a subset of 𝐱. 𝐾
𝑝(𝐳|𝐱) = ∏ 𝑝𝑘 (𝐳𝑘 |𝐱) 𝑘
22
(2.28)
If these distributions, 𝑝𝑘 (𝐳𝑘 |𝐱), are assumed to be joint Gaussians which are often valid for problems in SLAM, the maximum likelihood problem (equation (2.27)) can be converted to a least squares problem with a cost function 𝑓(𝐱) as shown in equation (2.29). Here ‘ ∝ ’ means “is proportional to”, 𝚺𝑘 is the covariance matrix of the measurement 𝐳𝑘 , and given 𝐱, 𝐳̂𝑘 (𝐱) computes the predicted measurement (note it has no uncertainty). T
𝑝𝑘 (𝐳𝑘 |𝐱) ∝ exp −(𝐳𝑘 − 𝐳̂𝑘 (𝐱)) 𝚺𝑘−1 (𝐳𝑘 − 𝐳̂𝑘 (𝐱)) arg max 𝑝(𝐳|𝐱) = arg min 𝑓(𝐱) 𝐱
𝐱
𝐾
𝑓(𝐱) ≜ ∑(𝐳𝑘 −
(2.29) T 𝐳̂𝑘 (𝐱)) 𝚺𝑘−1 (𝐳𝑘
− 𝐳̂𝑘 (𝐱))
𝑘
In general, the least squares problem can be solved with the Gauss-Newton method or the Levenberg-Marquardt method [27] as shown below. Assume 𝐱 ∈ ℝ𝑛 , the least squares problem can be written as 𝐾
𝐾
arg min𝑛 𝑓(𝐱) = arg min𝑛 ∑ 𝑓𝑘 (𝐱) = arg min𝑛 ∑ 𝐞𝑘 (𝐱)T 𝚺𝑘−1 𝐞𝑘 (𝐱) 𝐱∈ℝ
𝐱∈ℝ
𝐱∈ℝ
𝑘
(2.30)
𝑘
where 𝐞𝑘 (𝐱) ≜ 𝐳𝑘 − 𝐳̂𝑘 (𝐱) is the error associated with each measurement 𝐳𝑘 , and thus has the same covariance, 𝚺𝑘 , as 𝐳𝑘 . Because 𝐞𝑘 (𝐱) is often nonlinear, the Gauss-Newton method actually minimizes 𝑓(𝐱) on a neighborhood of the current estimate, 𝐱̂, which is assumed to be close to the solution where 𝑓(𝐱) achieves a local minimum. arg min𝑛 𝑓(𝐱) → arg min𝑛 𝑓(𝐱̂ + δ𝐱) 𝐱∈ℝ
δ𝐱∈ℝ
(2.31)
Then the problem is linearized by approximating the error function with a first-order Taylor expansion at the current estimate 𝐱̂.
23
𝐞𝑘 (𝐱̂ + δ𝐱) ≈ 𝐞𝑘 (𝐱̂) + 𝐉𝑘 δ𝐱 Here 𝐉𝑘 ≜
𝜕𝐞𝑘 (𝐱̂+δ𝐱) 𝜕δ𝐱
|
δ𝐱=𝟎
(2.32)
. Thus, each cost function 𝑓𝑘 can be approximated as follows.
𝑓𝑘 (𝐱̂ + δ𝐱) = 𝐞T𝑘 (𝐱̂ + δ𝐱)𝚺𝑘−1 𝐞𝑘 (𝐱̂ + δ𝐱) ≈⏟ 𝐞T𝑘 𝚺𝑘−1 𝐞𝑘 + 2δ𝐱 T ⏟ 𝐉𝑘T 𝚺𝑘−1 𝐞𝑘 + δ𝐱 T ⏟ 𝐉𝑘T 𝚺𝑘−1 𝐉𝑘 δ𝐱 𝑐𝑘
𝐛𝑘
𝐇𝑘
= 𝑐𝑘 + 2δ𝐱 T 𝐛𝑘 + δ𝐱 T 𝐇𝑘 δ𝐱
(2.33)
In summary, the whole cost function can be approximated by 𝑓(𝐱̂ + δ𝐱) = ∑ 𝑓𝑘 (𝐱̂ + δ𝐱) 𝑘
≈ ∑ 𝑐𝑘 + 2δ𝐱 T ∑ 𝐛𝑘 + δ𝐱 T ∑ 𝐇𝑘 δ𝐱 ⏟𝑘 ⏟𝑘 ⏟ 𝑘 𝑐
𝐛
= 𝑐 + 2δ𝐱 T 𝐛 + δ𝐱 T 𝐇δ𝐱
(2.34)
𝐇
It can be minimized regarding δ𝐱 by the solution δ𝐱 ∗ to the following linear equation. 𝐇δ𝐱 ∗ = −𝐛
(2.35)
Here 𝐇 is the information matrix of the system which is the inverse of the covariance matrix as justified by equation (2.40). The value for 𝐱 which decreases the cost function from 𝑓(𝐱̂) is updated by 𝐱̂ ← 𝐱̂ + δ𝐱 ∗
(2.36)
This linearization and update procedure (equation (2.34-2.36)) is repeated several times in the Gauss-Newton algorithm (GNA) until some termination criteria are met. Usually the GNA requires that the iteration starts at an initial guess close to the solution of a local minimum. To make it robust to a poor initial guess, the Levenberg-Marquardt (LM) algorithm blends the GNA and gradient descent. The idea of gradient descent is to update 𝐱̂ in the negative gradient direction with a factor α controlling the step size. 𝐱̂ ← 𝐱̂ + δ𝐱 ∗ with δ𝐱 ∗ = −α∇𝑓(𝐱̂) 24
(2.37)
The blend of gradient descent and GNA (equation (2.38)) is done by introducing a damping factor 𝜆 into the linearized equation (2.35). A small 𝜆 makes the LM method behave like GNA, whereas large 𝜆 make it like the gradient descent. By adaptively change the damping factor in each iteration, the incremental step towards the solution is controlled. (𝐇 + 𝜆𝐈)δ𝐱 ∗ = −𝐛
(2.38)
Although robust in difficult situations, the LM method does not guarantee a global minimum just like GNA. More details of the LM method can be found in [27]. After the most probable estimate of 𝐱 is obtained, its covariance can be calculated with a first-order approximation assuming 𝐇 is full rank. A full rank 𝐇 means all the dimensions of 𝐱 is observable. In that case, the uncertainty of 𝐱̂ is captured by equation (2.35) up to a first-order approximation. Take covariance on both sides of equation (2.35), and use the definitions of 𝐇 , 𝐛 , and 𝚺𝑘 found in equations (2.34) and (2.30), the following is obtained. 𝐇cov(𝛿𝐱 ∗ )𝐇 T = ∑ 𝐉𝑘T Σ𝑘−1 cov(𝐞𝑘 )𝚺𝑘−1 𝐉𝑘 𝑘
= ∑ 𝐉𝑘T 𝚺𝑘−1 𝐉𝑘
(2.39)
𝑘
=𝐇 Notice that 𝐇 T = 𝐇, the covariance, 𝚺𝐱 , of 𝐱̂ is given by 𝚺𝐱 ≜ cov(𝐱̂) = cov(δ𝐱 ∗ ) = 𝐇 −1
25
(2.40)
2.3.2 Nonlinear Optimization on Manifolds The GNA procedure (equations (2.34-2.36)) assumes the sought-after solution 𝐱 is on a Euclidean space. A problem in SLAM (cf. Sections 3.4 and 5.4) often has a solution on a smooth manifold, ℳ. As a result, the minimization problem of equation (2.30) becomes 𝐾
arg min 𝑓(𝐱) = arg min ∑ 𝐞𝑘 (𝐱)T Σ𝑘−1 𝐞𝑘 (𝐱) 𝐱∈ℳ
𝐱∈ℳ
(2.41)
𝑘
To solve such least squares problems, the GNA procedure needs to be generalized to smooth manifolds. This is often done by a “lift-solve-retract” scheme. The application of this scheme in the Euclidean space is shown by equations (2.34-2.36) where the tangent space at a point 𝐱 coincides with the Euclidean space. For an optimization problem on a smooth manifold ℳ , the “lift” step involves a retraction, 𝑅𝐱̂ , which is a bijective mapping from an element in the tangent space at 𝐱̂ of a minimal representation, δ𝐱, to an element in the neighborhood of current estimate, 𝐱̂, on the manifold, 𝒩(𝐱̂). 𝑅𝐱̂ (δ𝐱) ∈ 𝒩(𝐱̂) ⊂ ℳ ̂ 𝑏𝑎 , its neighborhood on For instance, for an estimated rotation expressed with the DCM, 𝐑 the SO(3) manifold can be approximated by itself and the tangent space at the identity. ̂ 𝑏𝑎 can be expressed with 𝐑 ̂ 𝑏𝑎 and an element in An element, 𝐑𝑏𝑎 , in the neighborhood of 𝐑 the tangent space of a minimal representation, δ𝐱 ∈ ℝ𝟑 , i.e., ̂ 𝑏𝑎 ≈ (𝐈 + δ𝐱 ∧ )𝐑 ̂ 𝑏𝑎 𝐑𝑏𝑎 = 𝑅𝐱̂ (δ𝐱) = exp(δ𝐱 ∧ ) 𝐑 Because the tangent space locally resembles a Euclidean space, the optimization problem can be solved by the linearization technique in equation (2.35) which gives δ𝐱 ∗ that minimizes 𝑓(𝐱) in the neighborhood of 𝐱̂. 26
arg min 𝑓(𝐱) → arg min𝑛 𝑓(𝑅𝐱̂ (δ𝐱)) 𝐱∈ℳ
Δ𝐱∈ℝ
(2.42)
Finally, the “retract” step updates the current estimate by mapping the element in the tangent space of a minimal representation, δ𝐱 ∗ , to the neighborhood 𝒩(𝐱̂) on the manifold with the retraction function. 𝐱̂ ← 𝑅𝐱̂ (δ𝐱 ∗ )
(2.43)
The steps (equations (2.42) and (2.43)) are usually repeated a few times until a certain termination criterion is met as done in the classic Gauss-Newton framework. The retraction scheme used in equation (2.42), i.e., 𝐱 = 𝑅𝐱̂ (δ𝐱), can be viewed as a general definition of the error δ𝐱 on a smooth manifold given the true value 𝐱 and its estimated value 𝐱̂. In this sense, the “lift-solve-retract” scheme generalizes the error state model used in filtering techniques which is discussed in the following section. 2.3.3 Filtering If all the data are processed at once, i.e., batch processing, the Gauss-Newton or Levenberg-Marquardt method can be used to solve problems in the context of SLAM. However, batch approaches usually cannot operate in real time because the batch problem grows as new observations arrive. Therefore, approaches that can incorporate new observations close to real time are needed. To this end, sliding window filters, e.g., fixed-lag smoothers [28], incremental smoothing, e.g., iSAM2 [29], are typical solutions. Of them, the filtering approaches have been well studied. Therefore, its concept and the most popular variant, extended Kalman filter (EKF) is briefly reviewed. The evolution of a dynamic system can be described in the state space. A point in the state space represent one state of the system at a certain epoch which may have many 27
dimensions, hence called a state vector. States of a system are often not directly observable, but sequential states in time are often related by a known system process model and observations can be obtained for variables related to these states by a known observation model. As a result, with these observations, it may be possible to estimate relevant states. If a state at some epoch 𝑘, 𝐱 𝑘 is estimated with observations up to that epoch, 𝐳𝑖 , 𝑖 = 1, ⋯ , 𝑘 , this estimation procedure is often called filtering. In contrast, prediction refers to estimating a state beyond the time of the available observations, and smoothing means estimating a state with observations before and after its epoch. In terms of probabilistic estimation, filtering recursively estimates the posterior distribution or belief of the current state at 𝑘, 𝐱 𝑘 , given observations 𝐳𝑖 , 𝑖 = 1, ⋯ , 𝑘, received at each epoch up to 𝑘, i.e., 𝐳1:𝑘 . 𝑏𝑒𝑙(𝐱 𝑘 ) ≜ 𝑝(𝐱 𝑘 |𝐳1:𝑘 )
(2.44)
This problem can be converted to a recursive one by using the Bayesian rules and making the Markov assumption which specifies that future states are not influenced by (i.e., are independent of) previous states given the current state. Bayes
𝑏𝑒𝑙(𝐱 𝑘 ) ∝ 𝑝(𝐳𝑘 |𝐱 𝑘 , 𝐳1:𝑘−1 )𝑝(𝐱 𝑘 |𝐳1:𝑘−1 ) Markov = 𝑝(𝐳𝑘 |𝐱 𝑘 )𝑝(𝐱 𝑘 |𝐳1:𝑘−1 ) = 𝑝(𝐳𝑘 |𝐱 𝑘 )∫ 𝑝(𝐱 𝑘 |𝐱 𝑘−1 , 𝐳1:𝑘−1 )𝑝(𝐱 𝑘−1|𝐳1:𝑘−1)𝑑𝐱 𝑘−1 Markov = 𝑝(𝐳𝑘 |𝐱 𝑘 )∫ 𝑝(𝐱 𝑘 |𝐱 𝑘−1)𝑏𝑒𝑙(𝐱 𝑘−1 )𝑑𝐱 𝑘−1
(2.45)
If the prior distribution, 𝑏𝑒𝑙(𝐱 𝑘−1 ), is Gaussian, and the noises affecting 𝑝(𝐱 𝑘 |𝐱 𝑘−1) and 𝑝(𝐳𝑘 |𝐱 𝑘 ) are independent Gaussians, then 𝑏𝑒𝑙(𝐱 𝑘 ) is proportional to a Gaussian distribution shown in equation (2.46). For the following derivation in this section, 𝐱 𝑘 is assumed to be in a Euclidean space unless otherwise stated. 28
𝑏𝑒𝑙(𝐱 𝑘 )
Gaussian
∝
T
exp −(𝐳𝑘 − 𝐳̂𝑘 (𝐱 𝑘 )) 𝚺𝑘−1 (𝐳𝑘 − 𝐳̂𝑘 (𝐱 𝑘 )) T
−1 ⋅ exp −(𝐱 𝑘 − 𝚽(𝐱 𝑘−1 )) 𝐏𝑘|𝑘−1 (𝐱 𝑘 − 𝚽(𝐱 𝑘−1 ))
(2.46)
where 𝐳̂𝑘 (𝐱 𝑘 ) is the measurement function which predicts the measurement at k given current state 𝐱 𝑘 , 𝚺𝑘 is the covariance of measurement 𝐳𝑘 , 𝚽(𝐱 𝑘−1 ) is the system process function which predicts current state 𝐱 𝑘 from the previous one, and 𝐏𝑘|𝑘−1 is the predicted covariance of 𝐱̂ 𝑘 = 𝚽(𝐱 𝑘−1 ) given covariance of 𝐱 𝑘−1, 𝐏𝑘−1. As such, the maximum a posteriori (MAP) estimate of 𝐱 𝑘 can be found by minimizing the cost function shown below. T
−1 𝑓(𝐱 𝑘 ) ≜ (𝐱 𝑘 − 𝚽(𝐱 𝑘−1 )) 𝐏𝑘|𝑘−1 (𝐱 𝑘 − 𝚽(𝐱 𝑘−1 )) T
+ (𝐳𝑘 − 𝐳̂𝑘 (𝐱 𝑘 )) 𝚺𝑘−1 (𝐳𝑘 − 𝐳̂𝑘 (𝐱 𝑘 ))
(2.47)
Its first part keeps the current estimate close to its prediction by the system process model, the second part tries to minimize the error between predicted and actual observations. To solve for the MAP estimate for equation (2.47), the extended Kalman filter (EKF) follows exactly the procedure detailed in equations (2.34-2.36), i.e., the “liftsolve-retract” scheme. The first “lift” step linearizes the nonlinear problem in the state space to a linear problem in the tangent space at the current estimate. Specifically, the best initial guess of 𝐱 𝑘 is predicted from the state at k-1 by the process model, 𝐱̂ 𝑘 = 𝚽(𝐱 𝑘−1 ) . Let 𝐞1 (𝐱 𝑘 ) ≜ 𝐱 𝑘 − 𝚽(𝐱 𝑘−1 ) and 𝐞2 (𝐱 𝑘 , 𝐳𝑘 ) ≜ 𝐳𝑘 − 𝐳̂𝑘 (𝐱 𝑘 ) , following equation (2.32), their Jacobians 𝐉1 and 𝐉2 are given by 𝜕𝐞1 (𝐱̂ 𝑘 + δ𝐱) | =𝐈 𝜕δ𝐱 δ𝐱=𝟎 𝜕𝐞2 (𝐱̂ 𝑘 + δ𝐱) 𝐉2 ≜ | 𝜕δ𝐱 δ𝐱=𝟎 𝐉1 ≜
29
(2.48)
By definitions in equation (2.33), the residual vectors are 𝐛1 = 𝟎, and 𝐛2 = 𝐉2T 𝚺𝑘−1 𝐞2, and −1 the components of the information matrix are 𝐇1 = 𝐏𝑘|𝑘−1 , 𝐇𝑘 = 𝐉2T 𝚺𝑘−1 𝐉2. With these,
equation (2.35) becomes −1 (𝐏𝑘|𝑘−1 + 𝐉2T 𝚺𝑘−1 𝐉2 )δ𝐱 𝑘∗ = −𝐉2T 𝚺𝑘−1 𝐞2
(2.49)
where δ𝐱 𝑘∗ is the step taken to minimize the cost function 𝑓(𝐱 𝑘 ) in equation (2.47). Then the linearized problem is solved for the increment δ𝐱 𝑘∗ . Since the increment represents the difference between the estimated 𝐱 𝑘 and its true value, it is also known as the error state [19]. Its exact form can be obtained by the matrix inversion lemma [30]. δ𝐱 𝑘∗ = 𝐊𝐞2 −1 𝐊 ≜ −𝐏𝑘|𝑘−1 𝐉2T (𝚺𝑘 + 𝐉2 𝐏𝑘|𝑘−1 𝐉2T )
(2.50)
Here 𝐊 is the Kalman gain. With δ𝐱 𝑘∗ , the “retract” step updates the estimate 𝐱̂ 𝑘 as below. 𝐱̂ 𝑘 ← 𝐱̂ 𝑘 + δ𝐱 𝑘∗
(2.51)
Finally, in view of equation (2.40) and (2.49), the covariance of the state estimate 𝐱̂ 𝑘 , 𝐏𝑘 can be obtained by −1 𝐏𝑘 = (𝐏𝑘|𝑘−1 + 𝐉2T 𝚺𝑘−1 𝐉2 )
−1
= (𝐈 + 𝐊𝐉2 )𝐏𝑘|𝑘−1
(2.52)
The above procedure can be generalized to cases with 𝐱 𝑘 on a smooth manifold by replacing 𝐱̂ 𝑘 + δ𝐱 with a retraction function 𝑅𝐱̂ (δ𝐱) that maps an error state δ𝐱 to a state close to 𝐱̂ on the manifold. In contrast to nonlinear optimization techniques, e.g., GNA, that leverage repeated linearization, a filtering approach usually performs only one linearization in one update step using the latest observation (equations (2.49)-(2.52)). 30
2.4 IMU Models To aid localization and mapping, this work uses IMUs found in mobile devices. These sensors are usually built with the micro electro-mechanical system (MEMS) technology. The IMU models in this section are mainly concerned with such inertial sensors. An inertial sensor measures its movement relative to the inertial frame. Examples are accelerometers and gyroscopes. An accelerometer measures the linear acceleration due to real forces (also known as specific forces), e.g., support or traction force. A gyroscope measures the angular rotation rate in the inertial space. An IMU usually contains an orthogonal triad of three accelerometers sensing linear accelerations on each axis and an orthogonal triad of three gyroscopes sensing angular rates on each axis. Because the unit of sensors is often firmly mounted to a platform, e.g., a smartphone, this physical arrangement of the IMU and platform is called strapdown mechanization [19]. The strapdown sensors experience the whole range of dynamics of the platform, e.g., shock and vibration, compared to the stabilized configuration where the IMU sensors are isolated from the platform motion by a set of gimbals. However, it greatly reduces weight, cost, size, and maintenance, as a result, strapdown IMUs are widely used in commercial and military products. MEMS IMUs are affected by a range of errors [22]. The calibration done in factory usually provides a model written in hardware for the deterministic (repeatable) biases, scale factors, and misalignment (axes non-orthogonality cross-coupling effects), and temperature dependencies. For the stochastic parts including turn-on bias, in-run bias,
31
acceleration white noise, and angular rate white noise, the factory calibration provides their statistics, e.g., power spectral density (PSD) and standard deviation. In using inertial measurements, a common practice assumes that they are affected by biases and white noises, and the deterministic parts are well-calibrated, for instance, [31, 32]. In this case, assume the body frame is the perfect nominal IMU frame with the origin at the intersection of three accelerometer axes, the accelerometer measurement 𝐚𝑚 and gyroscope measurement 𝛚𝑚 by an IMU are composed by 𝐚𝑚 = 𝐚𝑏𝑠 + 𝐛𝑎 + 𝐧𝑎 𝛚𝑚 = 𝛚𝑏𝑖𝑏 + 𝐛𝑔 + 𝐧𝑔
(2.53)
where 𝐚𝑏𝑠 is the acceleration caused by the specific force (subscript ‘s’) in the body frame (superscript ‘b’), 𝛚𝑏𝑖𝑏 is the angular rate of the body frame (subscript ‘b’) relative to the inertial frame (subscript ‘i’) expressed in the body frame (superscript ‘b’), 𝐛𝑎 and 𝐧𝑎 are the bias and Gaussian white noise on three accelerometer axes (subscript ‘a’), and 𝐛𝑔 and 𝐧𝑔 are the bias and Gaussian white noise on three gyro axes (subscript ‘g’). These biases are often assumed to follow random walk processes driven by Gaussian white noises 𝐧𝑏𝑎 , 𝐧𝑏𝑔 for accelerometers and gyroscopes, respectively. 𝐛̇𝑎 = 𝐧𝑏𝑎 𝐛̇𝑔 = 𝐧𝑏𝑔
(2.54)
This straightforward model of errors, may be inadequate for low-grade inertial sensors on consumer mobile devices. For these devices, there may be conspicuous residual scale factor error and misalignment even after the factory calibration. Following the approach proposed in [14], these errors affecting accelerometers and gyroscopes can be modelled as follows. 32
Considering bias, scale factor error, and misalignment, the linear acceleration, 𝐚𝑚 , measured by the accelerometer triad can be expressed as 𝐚𝑚 = 𝐓𝑎 𝐚𝑏𝑠 + 𝐛𝑎 + 𝐧𝑎
(2.55)
From the right side, 𝐛𝑎 and 𝐧𝑎 are the bias and Gaussian white noise on three accelerometer axes, 𝐚𝑏𝑠 is the acceleration caused by the specific force in the body frame, and 𝐓𝑎 is a 3×3 matrix projecting 𝐚𝑏𝑠 onto the sensing directions of accelerometers. Its diagonal elements, roughly speaking, represent the scale factors on each axis, thus should be close to 1. On the other hand, its off-diagonal elements represent the misalignment effect, thus should be close to zero. Note if the body frame is defined to have a known constant orientation relative to an exteroceptive sensor, i.e., a camera, all 9 parameters in 𝐓𝑎 need to be calibrated for an accelerometer triad (Figure 2.2-left). Otherwise, for instance, if the body frame is defined to have its 𝑥-axis aligned with one accelerometer, and its 𝑦-axis in the plane spanned by this accelerometer and another, only 6 parameters in 𝐓𝑎 needs to be calibrated [33]. That is, 𝐓𝑎 only has 6DoF including 3DoF for scales and 3DoF for the misalignment effect (Figure 2.2-right).
33
z'
z' z
z
y
y y'
o Accelerometer triad
x'
o Accelerometer triad
x
x
Figure 2.2. Parameterization of the misalignment effect depends on definition of the IMU body frame (identified by the ‘x-y-z’ axes triad). For the left graph, the body frame (bframe) is defined relative to an external coordinate frame, therefore the misalignment between the b-frame and accelerometer triad (‘ax-ay-az’) has 6DoF and can be parameterized by, e.g., 6 small angles, α, β, γ, ζ, ψ, and ϕ. Note x’- and y’-axis are on the x-y plane and z’-axis is on the y-z plane. In contrast, on the right graph, the body frame is defined relative to the accelerometer triad such that x-axis is along ax-axis and y-axis is on the plane spanned by ax- and ay-axis. As such, the misalignment has only 3DoF and can be parameterized by, e.g., 3 small angles, α, β, and γ.
Considering bias, scale factor error, misalignment, and g-sensitivity, the angular rate measurement 𝛚𝑚 by the gyroscope triad is modeled as 𝛚𝑚 = 𝐓𝑔 𝛚𝑏𝑖𝑏 + 𝐓𝑠 𝐚𝑏𝑠 + 𝐛𝑔 + 𝐧𝑔
(2.56)
From the right side, 𝐛𝑔 and 𝐧𝑔 are the bias and Gaussian white noise on three gyro axes, 𝐚𝑏𝑠 is the acceleration caused by the specific force, and 𝐓𝑠 is a 3×3 matrix representing the dependence of the gyroscope measurement on the acceleration (i.e., g-sensitivity, hence subscript ‘s’), 𝛚𝑏𝑖𝑏 is the angular rate of the body frame relative to the inertial frame expressed in the body frame, and 𝐓𝑔 represents the scale factors and misalignment in the gyroscope triad, similarly to 𝐓𝑎 . 34
2.5 Camera Models This work uses image sequences or videos captured by rear cameras on consumer mobile devices, such as smartphones and tablets. These cameras often have a relatively narrow angle of view (~70°) which brings the benefit of reduced distortion in contrast to cameras with a wide-angle lens or fisheye lens. Therefore, the optical imaging process can be adequately modeled by the ideal pinhole perspective projection model, augmented by a distortion model. The following text describes both the pinhole model and distortion model following [34, 35]. The ideal pinhole projection is widely used in the computer vision community. It is depicted in Figure 2.3. In reality, the image plane refers to the imaging sensor chip behind the lens. An analog charge-coupled device (CCD) or an active complementary metal-oxide-semiconductor (CMOS) sensor are most commonly used to convert the received light to electrical signals. In Figure 2.3, the image plane is inverted for easy notation and calculation. The line passing through the lens’ optical center and perpendicular to the image plane is called the principal axis. The principal axis intersects the image plane at the principal point. The distance between the principal point and the optical center when the lens is focused at infinity is the well-known focal length of the lens. In this work the term “focal length” is generalized to refer to the distance between the image plane and the optical center, following [36]. This saves confusion with many similar terms, such as principal distance, effective focal length.
35
In the pinhole model, the optical center acts like a pinhole. It projects a world point expressed in the camera frame 𝐩𝑖 = [𝑥, 𝑦, 𝑧]T to the normalized point 𝐱 𝑛 on the normalized image plane at 𝑧 = 1. 𝑥𝑛 𝑥/𝑧 𝐱𝑛 = [ ] = [𝑦 ] 𝑦/𝑧 𝑛
(2.57)
A real camera lens deviates from the ideal perspective model with distortions, i.e., the normalized image point, 𝐱 𝑛 , is displaced. Two types of distortions, radial and tangential distortions, are often exhibited by camera lens. Of these, tangential distortion is often negligible. In view of the varying impact of the two distortion types on different lens, different approaches [37] have been proposed to account for the lens’ distortion. The most commonly used one is the inverse polynomial model, also called Brown’s model, which is implemented in [35, 38]. The distorted coordinates of the projected point on the normalized image plane, 𝐱 𝑑 , is the normalized image coordinates, 𝐱 𝑛 , augmented with radial and tangential distortions expressed by polynomials. 𝑥𝑑 𝑥𝑛 𝑥𝑛 𝐱 𝑑 ≜ [𝑦 ] = [𝑦 ] + (𝑘 ⏟ 1 𝑟 2 + 𝑘2 𝑟 4 + 𝑘3 𝑟 6 ) [𝑦 ] 𝑑 𝑛 𝑛 𝑟𝑎𝑑𝑖𝑎𝑙 𝑑𝑖𝑠𝑡𝑜𝑟𝑡𝑖𝑜𝑛
2𝑝 𝑥 𝑦 + 𝑝2 (𝑟 2 + 2𝑥𝑛2 ) +[ 1 𝑛 𝑛 ] ⏟2𝑝2 𝑥𝑛 𝑦𝑛 + 𝑝1 (𝑟 2 + 2𝑦𝑛2 )
(2.58)
𝑡𝑎𝑛𝑔𝑒𝑛𝑡𝑖𝑎𝑙 𝑑𝑖𝑠𝑡𝑜𝑟𝑡𝑖𝑜𝑛
where 𝑟 ≜ √𝑥𝑛2 + 𝑦𝑛2 , (𝑘1 , 𝑘2 , 𝑘3 ) and (𝑝1 , 𝑝2 ) are the radial and tangential coefficients for lower order distortion terms, respectively. The higher order terms are left out due to their negligible effects. In practice, 𝑘3 is often set to zero for narrow angle-of-view lens, e.g., those in cameras of common smartphones.
36
Once distortion is applied, the distorted image coordinates 𝐱 𝑑 is related to the actual pixel coordinates on the imaging sensor at 𝑧 = 𝑓 by the intrinsic parameters. The common intrinsic parameters in a pinhole model are the focal length (𝑓𝑥 , 𝑓𝑦 ) and coordinates of the principal point (𝑐𝑥 , 𝑐𝑦 ) in pixel units. Suppose the size of a pixel in the horizontal and vertical direction (Figure 2.3, 𝑢, 𝑣 axes, respectively) are 𝑑𝑥 , 𝑑𝑦 , then the focal length in 𝑓
𝑓
pixel units are (𝑓𝑥 = 𝑑 , 𝑓𝑦 = 𝑑 ). The intrinsic parameters may also include a skew 𝑥
𝑦
factor in case the pixels are not rectangles. But this is very uncommon with cameras on mobile devices. Together these intrinsic parameters map the distorted image coordinates (𝑥𝑑 , 𝑦𝑑 )T to the actual pixel coordinates 𝐳 = (𝑢, 𝑣)T . 𝑐𝑥 𝑓𝑥 𝑢 [ ] = [𝑐 ] + [ 0 𝑣 𝑦
0 𝑥𝑑 ][ ] 𝑓𝑦 𝑦𝑑
(2.59)
Written in the homogeneous form, these parameters go into the camera calibration matrix, 𝐊. 𝑢 𝑓𝑥 [𝑣 ] = [ 0 1 ⏟0
0 𝑓𝑦 0
𝑐𝑥 𝑥𝑑 𝑐𝑦 ] [𝑦𝑑 ] 1 1
(2.60)
𝐊
In this work, a visually distinct point in the world, i.e., a landmark, is often located as a keypoint in an image with some feature extraction method, leading to the measured projection of this point. This keypoint measurement is assumed corrupted by independent Gaussian noise at each dimension.
37
Figure 2.3. An overview of the coordinate frames and the camera model used in this work. The pinhole camera model projects a real-world point pi to a point zi on the image plane at focal length f, i.e., z=f. Notice the image plane is inverted for ease of calculation. The body frame is usually defined as the nominal IMU frame based on the camera frame. c That is, the relative rotation between the body frame and camera frame, Rb, is assumed fixed to calibrated values obtained offline. The camera and IMU in a smartphone often have the shown relative orientation. Moreover, the pose of the sensor assembly relative to g the global frame is denoted by Tb.
Another common property of cameras on mobile devices is the digital zoom. Cameras with zoom capabilities can zoom in and out to change the angle of view and hence level of details in the image. This zoom function is often realized with a group of lens, e.g., zoom lens, in many consumer cameras. That is a typical realization of the optical zoom where the lens may move in a large distance range. Such a large range is often unaffordable for thin mobile devices; therefore, they use the digital zoom for which zooming in just enlarges what is already captured and does not add information. The lack
38
of optical zoom ability of mobile devices actually removes the zoom dependence [39] for photogrammetric applications. These cameras on mobile devices also have many features that complicate the problem if their images are used for localization, notably auto focus, auto exposure, white balancing, and rolling shutter. About 40% of cameras on mobile devices can change their focus with the so-called auto focus technique [40]. The auto focus technique moves the lens along the optical axis, so that the real scene is focused clearly on the imaging sensor, but a varying focal length significantly complicates the motion estimation with images. Auto exposure varies the sensor sensitivity, aperture and/or shutter speed depending on lighting condition. It may exaggerate motion blur in dim areas. White balancing manipulates colors of pixels to make a picture look natural. It may adjust the pixel intensities strangely in light transition zones. Therefore, these capabilities of a camera are expected to be turned off when it is used to collect images for localization. Cameras on compact mobile devices often uses CMOS sensors which have lower fabrication cost and power consumption [41] compared to CCDs. A CMOS sensor commonly uses an electronic rolling shutter readout scheme for taking both still images and videos [41]. With the scheme, the sensor output is read out row by row from the top to the bottom. Empirically, a row aligns with the longer side of the sensor chip. The pixels in a row are read out simultaneously, and there is a regular delay, 𝑑, i.e., line delay, for reading a subsequent row, as shown in Figure 2.4. Before reading happens, the pixels are reset row by row with the same regular delay. After reset, as light falls on a pixel over the exposure time, it accumulates charge until the 39
reading. Because the pixel values in an image are obtained at different times, the relative motion between the camera and an object will cause distortion on the object’s image, see Figure 2.5. If such images are used for localization, this rolling shutter effect needs to be compensated. To this end, many techniques estimate the camera pose for each row by fitting between poses of camera frames [42], or by integrating the inertial data from an IMU [43]. Of these techniques, the first step is to obtain the timestamp for a pixel. Assume the line delay 𝑑 is constant, the time to read out an image frame of 𝑁 rows is roughly 𝑡𝑟 = 𝑁𝑑. And the epoch to read out a pixel at (𝑢, 𝑣) is 𝑁 𝑁 𝑡𝑟 𝑡 = 𝑡𝑜 + (𝑣 − ) 𝑑 = 𝑡𝑜 + (𝑣 − ) 2 2 𝑁
(2.61)
𝑁
where 𝑡𝑜 is the time to read out the 2 th row of the image. 𝑡𝑜 is related to the image timestamp 𝑡𝑖 by a latency 𝑑𝑐 , i.e., 𝑡𝑖 = 𝑡𝑜 + 𝑑𝑐 . Because 𝑑𝑐 cannot be estimated with only a camera, it may be assumed zero, e.g., [42]. However, with an external sensor, e.g., an IMU, a relative time delay between the timestamps of images and inertial data can be 𝑁
estimated as illustrated in Figure 2.6. For example, the inertial data indexed 2 and the 2 th row of the image indexed 𝑖1 are sampled at the same epoch 𝑡0 per the camera-IMU sensor clock. Upon logging, the inertial data indexed 2 is timestamped with 𝑡2 = 𝑡0 + 𝑑𝑖𝑚𝑢 due to a delay 𝑑𝑖𝑚𝑢 . A different latency 𝑑𝑐 occurs with the image 𝑖1 which is timestamped with 𝑡𝑖1 = 𝑡0 + 𝑑𝑐 . In general, these two latencies may be assumed constant, but they are unobservable with only the visual and inertial data. Fortunately, their difference 𝑡𝑑 is observable and can be estimated by correlation techniques, e.g., [44]. 𝑡𝑑 = 𝑑𝑖𝑚𝑢 − 𝑑𝑐 = 𝑡2 − 𝑡𝑖1 40
(2.62)
In other words, to make timestamps of both sensors consistent, the image timestamps need to be corrected by 𝑡𝑑 relative to timestamps of the inertial data. For the above example, the corrected time, 𝑡̂𝑖1 , for image 𝑖1 is obtained from its raw timestamp, 𝑡𝑖1 , and the relative time delay to the inertial data, 𝑡𝑑 . 𝑡̂𝑖1 = 𝑡𝑖1 + 𝑡𝑑
(2.63)
time reset and begin exposure
frame k
end exposure
frame k+2
frame k+1
Row 0 Row 1
sweep order time reset and begin exposure
sensor chip
Row 0 Row 1
frame Row N-1k
end exposure
frame k+2
frame k+1
exposure time
frame time row number
sweep order
sweep order
Figure 2.4. In the electronic rolling shutter scheme, the sensor values are reset and read out row by row from top to bottom with regular delays. Pixels in a row are read out simultaneously. The sum of delays between rows is the frame time or readout time. Row N-1 frame time sensor chip
exposure time
row number
sweep order
object horizontal motion relative to sensor
object horizontal motion relative to sensor
global shutter
global shutter
rolling shutter
rolling shutter
Figure 2.5. Rolling shutter skew effect. As the red rectangular object moves fast horizontally in front of the sensor, a camera with a global shutter may capture its normal shape with motion blur, but a camera with a rolling shutter often captures a skewed shape. 41
Sampling epochs
Inertial data 1 2 3
4
5
6
7
8
9
10
time Images 𝑖1
𝑖2
𝑑𝑖𝑚𝑢 1
𝑖3
Logging epochs 2
3
4
5
6
7
8
9
10
time
𝑑𝑐 𝑡𝑑 𝑖1
𝑖2
𝑖3
Figure 2.6. The relation between latencies of inertial data and images. The upper graph shows epochs at which the inertial data and images are sampled. After a latency of dimu, the inertial data are recorded and timestamped. Similarly, the images are timestamped with a latency of dc, as shown in the lower graph. Though both latencies are unobservable with only visual and inertial data, their difference td is.
2.6 Summary This chapter briefly reviews the notation convention, coordination frame transformations, estimation theories, and IMU and camera models. As this work involves terms that may have specific meanings in fields other than SLAM, for clarity, Table 2.1 lists their definitions observed by this text. Also, commonly used symbols are summarized in Table 2.2.
42
Table 2.1. Definitions of terms used in this work Point feature Landmark Keyframe
or A visually salient 3D point in the world A selected image frame in which new landmarks are triangulated so that enough landmarks are found in subsequent frames 2D measurement in an image of a landmark 3D position and 3DoF (degrees of freedom) attitude A collection of frames and observed landmarks representing the structure of the environment A graph composed of pose vertices and edges of pose constraints Distance between the imaging sensor and the lens’ optical center In this work, it refers to the electronic rolling shutter readout scheme [41] used in CMOS cameras of compact consumer products, e.g., smartphones In this work, an inertial measurement unit consists of at least a three-axis accelerometer and a three-axis gyroscope, or three orthogonal accelerometers and three orthogonal gyroscopes, thus giving 6DoF measurements. This is in contrast to a “reduced” IMU which has fewer sensing axes
Keypoint Pose Map Pose graph Focal length Rolling shutter
IMU
Table 2.2. Symbols commonly used in this text 𝑔
𝐑𝑘 𝒈
𝒒𝒌 𝒈 𝒈 𝐭 𝒌 or 𝐩𝒌 𝛚𝒄𝒂𝒃 𝒈
𝒈
𝐑 𝐭 = [ 𝒌 𝒌] 𝟎 𝟏 𝒈 𝒈 𝒈 𝒈 𝒔𝒌 𝐑 𝒌 𝐭 𝒌 𝐒𝒌 = [ ] 𝟎 𝟏 𝒈 𝐓𝒌
3DoF rotation that transforms a 3D vector expressed in the k-frame to that in the g-frame 𝒈 Quaternion representation of 𝐑 𝒌 Origin of the k-frame expressed in the g-frame, 𝒈 𝒈 note 𝐭 𝒌 and 𝐩𝒌 are used interchangeably The angular rate of b-frame relative to a-frame expressed in c-frame The rigid transformation from the k-frame to the g-frame The similarity transformation from the k-frame to the g-frame
43
Chapter 3: Collaborative SLAM with Crowdsourced Visual Data
This chapter studies collaborative SLAM with crowdsourced monocular data. The problem is, given monocular data collected by multiple mobile devices, how to build maps of the environment using the collective data (mapping) and estimate motion of each device with respect to these maps (localization). Consumer cameras used daily in high volumes have paved the way for crowdsourcing visual data, e.g., images and videos. Such a large amount of visual data is valuable for localization and mapping [36]. To this end, existing approaches often use the large-scale structure from motion (SfM) technique [45] to do offline processing. Offline SfM requires not only much processing power, but also an initial estimate of pose for every frame [45]. Alternatively, as the visual data are streamed from, for example, multiple cameras, they can be processed concurrently in real time [4] with the collaborative simultaneous localization and mapping (SLAM). The benefits of collaborative SLAM compared to SfM for crowdsourced visual data are as follows: (1) Maps are generated in real time. Here, a map refers to a collection of frames of 3D positions and 3D orientations and points of 3D positions representing the structure of the environment. (2) A user can localize by using maps built by others.
44
(3) SfM usually works with unordered image collections, e.g., from the internet. These images often have diverse camera intrinsic characteristics that are difficult to estimate. In contrast, collaborative SLAM works with image sequences or video clips. Because many images share the same camera model, it is easier to estimate the model and to obtain accurate poses of images. Moreover, collaborative SLAM generates pose estimates for images at a fraction of the time used by offline SfM methods, e.g., AgiSoft PhotoScan [46], and thus is suitable for real-time applications. Considering these benefits, a collaborative SLAM framework is proposed following the client-server paradigm; the client referring to the processing unit at the user, the server being the master that can be hosted in a cloud computing network. Besides these processing units, the framework also requires a communication capability between clients and the server. In execution of the framework, there can be several independent clients, each working with image frames captured by a camera. With these frames, each client estimates the motion of the camera and sends selected frames (i.e., keyframes) to the server. Initially, the server generates a map for each client. As overlapping scenes are found between maps of two clients, the two maps are merged, and afterwards, the merged map is updated using keyframes from both clients. If overlapping scenes are found within one map, the map is updated with a loop closure technique. Moreover, the server can send refined poses to clients so that these clients have better estimates of camera motion relative to the collaboratively built maps on the server.
45
The proposed framework to work with crowdsourced visual data has the following properties. (1) Visual data are crowdsourced by forward-looking cameras on mobile devices. Given sequential images from a camera, a monocular visual odometry (VO) algorithm deployed on a client estimates the motion of the camera up to a scale. The algorithm deals well with forward-looking cameras if there are enough point features in the scene. (2) Maps corresponding to individual clients can be merged if the same scene is observed by multiple clients. For the resulting merged map, these clients can access and extend it at the same time. (3) The size of the map grows only if a new area is explored. (4) As a previously visited location is recognized and a loop constraint is formed within a map, the loop closure technique is used to update the map. To this end, a robust step-by-step approach is proposed which optimizes different subsets of variables involved in the pose graph. To validate these properties, the proposed framework is evaluated with tests on various datasets. With a pose graph that exhibits a significant drift in the scale factor, the robustness of the stepwise loop closure is verified. Then, tests on the real world KITTI dataset [47] demonstrate that the proposed framework can concurrently process images collected by forward-looking cameras. Moreover, tests on outdoor datasets crowdsourced with smartphones show that the size of the incrementally created map is proportional to the explored area, and accurate collaborative SLAM can be done with crowdsourced data. 46
The rest of the chapter is organized as follows. Section 3.1 reviews existing research about the collaborative SLAM. Section 3.2 overviews the proposed framework. Its integral components, clients and the server, are illustrated in Section 3.3 and 3.4, respectively. Section 3.5 reports the experiments and results on real world datasets. Finally, Section 3.6 concludes the chapter with a summary. 3.1 Related Work Solving the collaborative monocular SLAM problem relies on many concepts and techniques developed in the SLAM community. Numerous monocular VO and SLAM approaches have appeared in the last decade. In general, VO focuses on estimating motion whereas SLAM considers both estimating motion and creating maps. At the beginning, monocular SLAM has been realized with filtering approaches using sparse image features, e.g., monoSLAM (monocular SLAM) [48]. As computation power grows, approaches based on iterative nonlinear optimization prevails, e.g., PTAM (parallel tracking and mapping) [49] and ORB-SLAM [50]. These approaches are referred to as indirect approaches because they use matched sparse image features as observations whereas direct approaches directly use image intensity as observations. Direct approaches for SLAM and VO have also been actively investigated, e.g., DTAM (dense tracking and mapping) [51], LSD-SLAM (large scale direct SLAM) [52], and DSO (direct sparse odometry) [53]. Though direct approaches often result in visually appealing dense 3D maps, as pointed out in [53] they may be outperformed by indirect approaches in case of strong geometric noise, e.g., due to the rolling shutter effect or coarse intrinsic calibration. Since such geometric noise often occurs with cameras on 47
commercial mobile devices, for them, indirect approaches may offer superior performance. Collaborative monocular SLAM takes monocular SLAM or VO one step further. As visual data are collected by cameras in different sessions, collaborative monocular SLAM concurrently processes the collective data to enhance the accuracy of localization and mapping. One early research effort close to this direction is PTAMM (parallel tracking and multiple mapping) [54]. It could create multiple maps of several places with independent cameras, and relocalize a camera into these maps as it revisited these places. Later, CoSLAM (collaborative SLAM) developed by [55] processed synchronized videos from multiple cameras to localize these cameras and to build a map including dynamic and static points. Its idea originated from the observation that a dynamic object observed by two cameras at the same time constrained their relative pose. As a result, the motion of a camera that observed mainly dynamic scenes or underwent nearly pure rotation could be reliably estimated if it had a sufficient overlapping view with another camera that mainly observed static objects. This method was limited in that it did not reuse the existing map when revisiting a place, and that the same scene had to be observed by all cameras at the beginning. Then, a collaborative SfM (CSfM) system was proposed by [4] for downward-facing cameras mounted on micro aerial vehicles (MAVs). Each MAV run a VO algorithm to estimate its own motion, and all the outputs from these MAVs were transmitted to and processed by a ground station that hosted the CSfM system. More recently, a cloud-based collaborative 3D online reconstruction framework was proposed by [56] which employed mobile phones (clients) and a cloud server. With 48
monocular images, a client estimated its own motion and created sparse point clouds with LSD-SLAM [52]. Its outputs, i.e., keyframes and pose constraints, were used by the server to build a session model (map) for that client. As overlaps were found, session models could be fused into a global model. The framework’s major feature was to adapt the computation and transmission workload on the client based on available resources, e.g., battery, connectivity, and CPU power. However, it did not support reusing existing maps on the server. In other words, the global model kept growing even when the same area was revisited. The proposed framework in this work is closely related to [4], as a similar client-server paradigm is used. The major differences are as follows: (1) On the client, this work uses a VO algorithm that works well with crowdsourced data of forward-looking cameras (Section 3.3). (2) On the server, this work uses a loop detection technique based on the ORB features [57], and a robust stepwise pose graph optimization (Section 3.4.3). Moreover, loop closure and global bundle adjustment (GBA) [36] are performed independent of processing keyframes from clients (Section 3.4.1), simplifying thread management upon loop closure. It is well-known that monocular VO is prone to failing with rapid rotations and few features in the scene [50]. Therefore, a failure recovery mechanism is necessary to ensure tracking continuity. This problem is often solved with a relocalization technique, e.g., [49], but it is not effective in exploratory tasks in which a location is rarely revisited. In such situations, the VO needs to be reinitialized. [58] kept the poses on both sides (before 49
and after the failure) topographically connected, assuming high uncertainty of the relative motion. Because this uncertainty is difficult to quantify, our proposed approach simply starts a new map with subsequent frames and leaves poses on both sides disconnected. The severed maps may be merged if loops are detected between them in the future. With this approach, very few data are left unused during failures of the VO. 3.2 Overview of the Framework The proposed collaborative SLAM framework is formulated in a client-server structure (Figure 3.1) which is designed to achieve a low bandwidth requirement, compact map representation, and real-time operation. As shown in Figure 3.1, several clients work on independent data input from multiple cameras. With the image frames coming from a camera, a client tracks the camera motion with respect to a map on the client by using a monocular VO algorithm. As tracking proceeds, the client chooses keyframes to extend the client map. Old keyframes are removed from the client map and transmitted to the server in the format of keyframe messages. A keyframe message contains the pose of the keyframe, and observations of point features in the keyframe.
50
Keyframe message Keyframe pose, features
Loop closing thread
Visual vocabulary
Monocular VO Client 1
…
Monocular VO Client n
Keyframe handler thread
…
Keyframe handler thread
Keyframe database
SLAM maps
Server
Figure 3.1. The collaborative SLAM framework follows a client-server structure. Each client runs a monocular visual odometry algorithm. The collective outputs from n clients, i.e., keyframes messages, are processed on the server to create SLAM maps. Optionally, the server can transmit information to a client to update its pose estimate.
As keyframes arrive at the server, they are processed by a handler thread associated with each client. A handler essentially maintains a map on the server for a client. Besides that, a handler also selectively keeps keyframes and sends them in a sequence to the loop closing thread. The loop closing thread detects commonly viewed areas between incoming keyframes and existing ones in the maps on the server, using a visual vocabulary trained offline [59] and a keyframe database which records the occurrences of visual words of the vocabulary (i.e., image features) in keyframes. As revisits (loops) are found, this thread updates the related maps with the pose graph optimization [60] and GBA [36] which optimizes all variables at once with all available observations and constraints. In addition, the server can send relevant location and map information to the client. This way, by using data contributed by all clients, a client better knows its pose and environment. 51
With the outline of the framework presented, the following sections describe the workflow and implementation details of a client and the server. 3.3 Operations on a Client As frames from a camera arrive, the associated client estimates the motion of the camera up to an unknown scale factor. To this end, any VO method that has bounded computational complexity can be employed, for instance, the modified PTAM (parallel tracking and mapping) by [61]. This work adapted the ORB-SLAM by [50] because it worked well with datasets of forward-looking cameras. The adapted ORB-SLAM for VO (ORB-VO from now on) is briefly reviewed in the following along with the changes to the ORB-SLAM. Except for these changes, details of the components of ORB-VO refer to the original ORB-SLAM [50]. The ORB-VO estimates the camera motion by solving two connected problems in parallel, tracking and mapping (Figure 3.2). In the implementation, each problem is handled by a thread.
52
Tracking thread Read frame
Mapping thread
Extract ORB features
Yes
New keyframe?
Initialize?
Yes Create more points
No Track the last frame or relocalize
Yes
No
Cull recent points #Keyframe> N?
#Relocalization failures>F?
No
No
Yes Output front keyframe
Track the local map No
Local BA Select a keyframe? Yes
Map initialization
Create new points
Cull local keyframes
Client
Figure 3.2. Overview of the ORB (a type of image feature)-VO (visual odometry) algorithm deployed on a client. It employs a tracking thread to track the camera motion with respect to a map, and a mapping thread to refine the map. In the diagram, BA is short for bundle adjustment, N is the maximum number of keyframes kept in the VO, and F is the maximum number of relocalization failures until a new VO session is initialized. Steps of the workflow are explained in the following text.
The tracking thread has several tasks: initialize a map, track a frame (i.e., estimate camera motion), relocalize a frame, and select keyframes. At the beginning, the tracking thread attempts to initialize a map from two frames considering their geometrical configuration. Once a map is initialized, subsequent frames are tracked to the last frame if tracking has been successful. Otherwise, they are relocalized with respect to keyframes in the map. If 53
either relocalization or tracking to the last frame is successful, the current estimate of camera pose is refined by tracking the current frame to a local map, that is, a few keyframes that have overlapping views with the current frame and their associated points. As the current frame moves away from the existing map and meets certain criteria (observing more than 15 points, but less than either 200 or 90% of points visible in the keyframe that sees the most common points with the current frame), it is chosen as a new keyframe to extend the map. As new keyframes arrive, the mapping thread is responsible for triangulating new points, optimizing the map with bundle adjustment (BA), and culling bad points and redundant keyframes. For a new keyframe, more points will be created by triangulating matched feature observations between the keyframe under consideration and its neighboring keyframes that observe common points. Then, a robust bundle adjustment over this local neighborhood of the map is performed and some observations of points may be marked as outliers and removed. These points that have too few observations (90%) of its observed points well observed in other keyframes. To meet the need of bounded computation and to reduce tracking failures, ORB-VO is different from ORB-SLAM in the following aspects: (1) No loop closing: The loop closing capability is disabled so that a client runs VO in real time. (2) Reinitialize VO upon too many tracking failures: As reviewed earlier (Section 3.2), relocalization is called when tracking fails. If the number of consecutive 54
failures exceeds a threshold, 𝐹(10 in tests), a new VO session will start, i.e., reinitialize tracking and mapping. (3) Create points in the tracking thread: It was observed that in ORB-SLAM, tracking often fails when the mapping thread has many awaiting keyframes and old landmarks are no longer visible in recent frames. This often happens when the camera undergoes a rapid rotation or there are few features in the scene. To reduce occurrences of such tracking failures, some new points are created in the new keyframe as it is selected in the tracking thread [8]. These points are triangulated from the matched feature observations between the new keyframe and its reference keyframe, that is, the keyframe having the most commonly visible points with it. Obviously, this reference keyframe will be discounted in creating points in the mapping thread. (4) Export keyframes as messages: To bound computation in the long run, if the number of keyframes in the map on the client exceeds a preset size 𝑁(100 in tests), the front keyframe (i.e. the oldest keyframe) will be removed from the client map and transmitted to the server. Then, this keyframe and those culled keyframes and points are sent to a trash data structure which is cleared once every iteration of the tracking thread. The delayed clearance is because these data may be in use in the tracking thread. The transmitted keyframe is in the format of a keyframe message. One message contains
Position of the keyframe 𝑘 expressed in the global frame of the map on the client (identified by superscript ‘c’), 𝒕𝑐𝑘 , and orientation of the keyframe 𝑘 with respect 55
to that global frame, 𝐑𝑐𝑘 . [𝐑𝑐𝑘 , 𝒕𝑐𝑘 ] transforms a 3D point 𝑗 expressed in the coordinate frame of keyframe 𝑘 to the client global frame, i.e., 𝐩𝑐𝑗 = 𝐑𝑐𝑘 𝐩𝑘𝑗 + 𝐭 𝑐𝑘 . Transmitting a keyframe’s pose in the client global frame instead of its relative pose to a previously transmitted keyframe (e.g., [4]) is robust to lost messages during disruptive communication.
Keypoints in the keyframe.
As such, message transmission requires a low communication bandwidth. For example, if 2000 features are detected in a keyframe, each represented by a 28 byte OpenCV keypoint [62] and described by a 32 byte ORB feature descriptor [57], the size of a keyframe message is roughly (28 + 32)×2000 = 120𝐾𝐵. For a camera that captures frames at 30Hz, keyframes are usually generated by ORB-VO at less than 10Hz (the maximum rate in our experiments was 8.13Hz). Even in this case, the required communication bandwidth is only 10×120 = 1.2𝑀𝐵/𝑠 = 9.6𝑀𝑏𝑝𝑠 without any data compression (the wireless network in the SPIN (satellite positioning and inertial navigation) Lab has a speed > 50𝑀𝑏𝑝𝑠). 3.4 Operations on the Server Using keyframe messages sent in by all clients, the server refines poses of these keyframes, builds maps in a collaborative fashion, and sends pose feedbacks to the clients. Its workload is carried out by many keyframe handlers and a loop closing thread. A keyframe handler refines poses of incoming keyframes, triangulates points from feature observations in selected keyframes, and optimizes a local neighborhood of the corresponding map. Optionally, a handler can provide pose feedbacks to the associated 56
client. On the other hand, the loop closing thread detects visual overlaps between incoming keyframes and existing keyframes which may lead to loops, and incorporates loop constraints into the map with the pose graph optimization. 3.4.1 Keyframe Handler New keyframes from a client are processed by a keyframe handler responsible for the client. As handlers are associated with independent clients, multiple handler threads can run in parallel on the server, leveraging a multi-core processor. A handler carries out several tasks (Figure 3.3), many of which are common to the monocular VO on a client. To begin with, a handler thread is initialized when the first keyframe of a client arrives. As the second keyframe comes along, a map composed of the two keyframes and the points triangulated from the matched point observations in both, is constructed for the handler on the server. This is easy because a keyframe already knows its pose from the monocular VO. Another situation that causes a new map to be constructed for the handler is when a new session of VO begins on the client (Section 3.3). In that case, the handler disconnects from the existing map and starts working on the new map. With an existing map on the server, a subsequent keyframe sent to the handler can be tracked to the local map similarly to the ORB-VO (Figure 3.3 middle column). At the beginning, the pose of the current keyframe with respect to the map on the server is calculated, using the pose estimate in the keyframe message and the scale ratio (equation (3.4)). Then, a few keyframes that have overlapping views with the current keyframe are retrieved from the existing map. The 3D points observed in these keyframes, representing the local map, are projected to the current keyframe and matched to its 2D point features. 57
These 3D-2D point matches are used to optimize the pose of the current keyframe by a least squares method that minimizes the reprojection error of visible points in the keyframe [50].
Monocular VO Client SLAM maps Keyframe database Visual vocabulary
Loop closing Detect loop
Keyframe handler Yes Track the local map
New keyframe? No Locally converged?
Estimate scale difference
Keep keyframe?
No
Yes Loop closing?
No Local BA and cull keyframes No
Yes Suspend
Yes Add new map points
Compute constraint Fuse map points/ Merge maps Optimize pose graph/ Global BA
Wake up
Sleep 𝑡𝑑
Server
Figure 3.3. Overview of the workflow on the server. The server has a loop closing thread (left column) which detects loops and merges maps, and possibly many keyframe handler threads (only one is shown) which process keyframe messages sent in by clients performing monocular visual odometry (VO). Note BA is short for bundle adjustment, and in the shown handler thread, td is an adaptive time interval to make it work in real time.
58
Now for the current keyframe, two estimates of its pose are available: one is estimated by the monocular VO with respect to the client map, [𝐑𝑐𝑘 , 𝒕𝑐𝑘 ], and shipped in the keyframe message, the other is just estimated by the least squares method with respect to a map on the server, [𝐑𝑠𝑘 , 𝒕𝑘𝑠 ] (superscript ‘s’ signifies the global frame of the considered map on the server). In general, the two estimates have different scales, because the motion of the camera and structure of the environment is estimated from the monocular visual data up to an unknown and unobservable scale factor which drifts over time without loop closure [4]. This scale difference (called scale ratio from now on) needs to be compensated in calculating the pose of an incoming keyframe with respect to the existing map on the server (equation (3.4)). To this end, the scale ratio estimation technique proposed by [4] is used (Figure 3.3 middle column). Briefly, the scale ratio is measured by comparing the relative translation in the map on the server and that in the client map. Then, this measurement is used to update the current estimate of the scale ratio. Specifically, for the current keyframe 𝑘 and the last received keyframe 𝑗, their relative ̃ 𝑗𝑘 , 𝐭̃ 𝑘𝑗 ] (here 𝐑 ̃ 𝑗𝑘 , 𝐭̃ 𝑘𝑗 represents the pose of keyframe 𝑘 in 𝑗 and motion in the client map [𝐑 the accent ‘~’ signifies a variable calculated with values in the client map) is ̃ 𝑗𝑘 = (𝐑𝑗𝑐 )T 𝐑𝑐𝑘 𝐑 T
𝑗 𝐭̃ 𝑘 = (𝐑𝑗𝑐 ) (𝐭 𝑐𝑘 − 𝐭𝑗𝑐 )
(3.1)
T
where (𝐑𝑗𝑐 ) transforms the relative distance 𝐭 𝑐𝑘 − 𝐭𝑗𝑐 to the camera frame of keyframe j. ̂ 𝑗𝑘 , 𝐭̂ 𝑘𝑗 ] (the accent ‘⌃’ signifies a variable Their relative motion in the server map [𝐑 calculated with values in the server map) is
59
̂ 𝑗𝑘 = (𝐑𝑗𝑠 )T 𝐑𝑠𝑘 𝐑 T
𝑗 𝐭̂ 𝑘 = (𝐑𝑗𝑠 ) (𝐭 𝑘𝑠 − 𝐭𝑗𝑠 )
(3.2)
The ratio of the two relative translations gives the scale ratio measurement, 𝜆̃𝑘 = 𝑗 𝑗 ‖𝐭̂ 𝑘 ‖/‖𝐭̃ 𝑘 ‖. With this measurement, the scale ratio is updated from its last estimate 𝜆𝑗
with a smoothing factor 𝛼 (empirically set to 0.2) by 𝜆𝑘 = 𝜆𝑗 + 𝛼(𝜆̃𝑘 − 𝜆𝑗 )
(3.3)
For the subsequent keyframe 𝑙, the estimated scale ratio 𝜆𝑘 is used to calculate its pose [𝐑𝑠𝑙 , 𝐭 𝑙𝑠 ] with respect to the map on the server (equation (3.4)), which is then refined by the least squares method for pose optimization. ̃ 𝑘𝑙 𝐑𝑠𝑙 = 𝐑𝑠𝑘 𝐑 𝐭 𝑙𝑠 = 𝐭 𝑘𝑠 + 𝜆𝑘 𝐑𝑠𝑘 𝐭̃ 𝑘𝑙
(3.4)
As the scale difference is updated, a keyframe is kept if the number of its visible points falls below 90% of those in its reference keyframe, or a threshold (50 used in experiments). In contrast to the monocular VO, here even when no point on the map is found in the current keyframe, it is still kept because it has a good pose estimate by the monocular VO. For the preserved keyframe, new points will be triangulated between itself and its neighboring keyframes to extend the map on the server. Meanwhile, it will be sent to the loop closing thread for detecting revisited areas and closing loops (Figure 3.3 middle column). So far, one iteration of the handler thread is complete. If no keyframe is waiting to be processed by the handler thread and the map that the handler works on has not been bundle adjusted in the local neighborhood (i.e., locally converged), a local bundle adjustment is performed to refine variables in the local 60
neighborhood of the latest keyframe (Figure 3.4). Afterwards, redundant keyframes in the neighborhood are culled similarly to the ORB-VO (Figure 3.3 right column). Because of culling keyframes and using the local map in tracking a keyframe in the handler, keyframes added to the map in previous visits are reused and they prevent adding redundant keyframes. As a result, the map grows only when new areas are explored.
Local map of the current keyframe
Landmarks observed by core keyframes
Landmarks observed by the current keyframe
Peripheral keyframes
Multiple feature observations Current keyframe
Core/Local keyframes
Figure 3.4. Local neighborhood of the current keyframe used in the local bundle adjustment.
As a feedback, the server can inform a client about the refined pose estimate of its last sent keyframe k, [𝐑𝑠𝑘 , 𝐭 𝑘𝑠 ] and scale factor estimate 𝜆𝑘 . As a result, using equations (3.1) and (3.4), the client can estimate its pose with respect to the map on the server which is of better accuracy. It is notable that the client should not transform its poses and map with the information from the server since this will render equations (3.1)-(3.4) invalid for subsequent keyframes to the server. A similar problem was encountered in [63] and solved by a technique that did not transform the existing estimates and map.
61
3.4.2 Loop Closing Processing keyframes from all handlers in a sequential manner, the loop closing thread (Figure 3.3 left column) detects and closes loops in the maps as a place is revisited. It uses the technique presented in ORB-SLAM to detect visually similar keyframes and to compute the relative pose between a pair, as is briefly reviewed here. In detecting a loop, every incoming keyframe (e.g., the current keyframe) queries the keyframe database to retrieve visually overlapping keyframes [59]. The visual overlap between two keyframes is measured by a similarity score which grows as the commonly seen area increases. The lowest similarity score 𝑠𝑚𝑖𝑛 between the current keyframe and its neighbors is used as a threshold to discard retrieved keyframes that have scores lower than 𝑠𝑚𝑖𝑛 . The rest of the retrieved keyframes, excluding neighbors of the current keyframe, are considered candidates of a loop. If three consecutive loop candidates for the incoming keyframes are connected as neighbors, a loop is detected between the last loop candidate (called the matched keyframe from now on) and the current keyframe (Figure 3.5b). As the two keyframes observe many 3D points coordinated in different camera frames that actually correspond to the same points in the real world, a similarity transformation from the matched keyframe (subscript ‘m’) to the current one (superscript 𝑐 ‘c’), 𝐒̃𝑚 , can be found with the Umeyama algorithm [64], embedded in the RANSAC
(random sample consensus) scheme [65]. If there are enough inlier correspondences, the loop is accepted for loop closure (Figure 3.5c, Figure 3.6b). To begin with loop closure, the keyframe handler threads that are working on the maps that the two keyframes belong to, are suspended for safe operation on the maps. Then 62
matching points will be searched for and fused between the neighboring keyframes of the matched keyframe and those of the current keyframe (Figure 3.5d), effectively attaching the loop. If the matched keyframe and the current keyframe belong to two maps, the map of the 𝑐 current keyframe will be transformed with 𝐒̃𝑚 and merged into that of the matched
keyframe (Figure 3.6c). Otherwise, i.e., the loop is within one map, the pose graph optimization is performed over the graph of neighboring keyframes (Figure 3.5e) while 𝑐 incorporating the constraint of 𝐒̃𝑚 .
Because loop closure changes the scale of the map, the scale ratio estimates for relevant handlers need to be updated so that subsequent keyframes to these handlers can have pose calculated properly (equation (3.4)). Specifically, in loop closure, the similarity 𝑐 transformation, 𝐒̃𝑚 , from the matched keyframe to the current keyframe consists of a 𝑐 ̃ 𝑐𝑚 , a translation, 𝐭̃ 𝑐𝑚 , and a scale, 𝑠̃𝑚 rotation, 𝐑 (here accents ‘~’ signify these values are
observations obtained by the loop closing thread). ̃ 𝑐 𝐒̃𝑚 = [𝑠̃𝑚 𝐑 𝑚 𝟎 𝑐
63
𝑐
𝐭̃ 𝑐𝑚 ] 1
(3.5)
(a)
(b)
(c)
(d)
continued
Figure 3.5. Loop closure within a simulated map of keyframes (represented by triangles with different sizes indicating scale changes) and points (crosses) by the pose graph optimization. The global frame is aligned with the camera frame of the first keyframe, and the camera is assumed to follow a planar motion. For clarity, only a few points of interest are drawn, and fields of view of keyframes of interest are shown with black dashed lines. (a) True motion starts from the red circle and returns. (b) Estimated motion in the map on the server shows a growing scale, note the scale may decrease as well. As keyframes 19-21 are processed by the loop closing thread, keyframe 4, 2, 1, are the loop candidates (red lines) for keyframes 19, 20, 21, respectively. Because the three loop candidates view common scenes and thus are connected as neighbors, a loop is detected between the current keyframe 21 and the matched keyframe 1. (c) The detected loop is accepted (red solid line) after the RANSAC (random sample consensus) procedure. Matched points between two neighborhoods of the two keyframes are linked by red dashed lines. (d) Prior to the pose graph optimization, the current keyframe and its two neighbors are corrected with the estimated similarity transformation, and matched points are fused. (e) The map on the server after the pose graph optimization. 64
Figure 3.5. continued
(e)
𝑐 Applying 𝐒̃𝑚 in the map merging case (Figure 3.6) roughly means that the positions of 𝑐 keyframes in the map of the current keyframe are divided by 𝑠̃𝑚 , and the same should be
done for subsequent keyframes to any handler working on this map. In the case of pose 𝑐 graph optimization (Figure 3.5), applying 𝐒̃𝑚 roughly means that the positions of the 𝑐 keyframes close (time-wise) to the current keyframe are divided by 𝑠̃𝑚 , and the same
should be done for the subsequent keyframes to the handler that sends in the current keyframe. For both cases, proper calculation of the pose of a subsequent keyframe to the relevant handler is accomplished by updating the handler’s scale ratio 𝜆𝑘 with its pseudo 𝑐 observation 𝜆̃𝑘 = 𝜆𝑘 /𝑠̃𝑚 . Here, the smoothing factor 𝛽 is empirically set to 0.9 because
𝜆𝑘 has to adjust quickly to the scale change caused by the loop closure. 𝜆𝑘 ← 𝜆𝑘 + 𝛽(𝜆̃𝑘 − 𝜆𝑘 )
(3.6)
As the loop closure finishes, all the suspended handlers are signaled to resume. Meanwhile, the whole map is optimized with GBA on an independent thread. Because 65
two maps originally built by two individual handlers may merge, over time many handler threads may access and update one map. Therefore, its components, keyframes and points, are protected with mutually exclusive locks for safe multi-thread operations.
(a)
(b)
continued
Figure 3.6. Loop closure between two simulated maps, i.e., merging two maps. Each map consists of keyframes (triangles with different sizes showing scale changes) and points (crosses). For clarity, only a few points of interest are drawn, and fields of view of keyframes of interest are shown with black dashed lines. (a) True motion of two cameras (blue and black) starting from the red circles. (b) Estimated motion in the two maps on the server with a loop accepted between the two keyframes (the matched keyframe and the current keyframe) linked by the red solid line. Matched points between two neighborhoods of the two keyframes are linked by red dashed lines. (c) The map on the server after the merging operation which transforms the original map of the current keyframe with the estimated similarity transformation, and fuses matched points.
66
Figure 3.6. continued
(c)
3.4.3 Stepwise Pose Graph Optimization In optimizing a pose graph, it is observed that directly optimizing over the similarity transformations [60] sometimes does not converge mainly because the initial guess is far from the true solution. To improve on convergence, a stepwise pose graph optimization method is proposed. It first solves for scales of the similarity transformations by a linear least squares algorithm, then refines scales and translations of the similarity transformations with a nonlinear optimization method while fixing the rotations. The rationale for this approach is twofold. First, empirically monocular VO often achieves relatively good estimates of rotation. Second, the linear least squares method guarantees an initial guess close to the true solution for scales. This stepwise approach was demonstrated in this study to work with obvious scale drifts and to deal with multiple loops in one pose graph at once.
67
Specifically, suppose a to-be-optimized pose graph has 𝑁 keyframes of indices 𝑘 ∈ [1,2, … , 𝑁]. Their true poses (i.e., states to be optimized) are expressed by similarity transformations from the global frame (subscript ‘g’) to the camera frame (superscript ‘k’), 𝐒𝑔𝑘 = [
𝑠𝑔𝑘 𝐑𝑘𝑔 0
𝐭𝑔𝑘 ] 1
(3.7)
Initially, the estimated (hence accent ‘^’) poses are 𝐒̂𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁] which are ̂𝑔𝑘 , and assumed scales 𝑠̂𝑔𝑘 = 1 for 𝑘 ∈ [1,2, … , 𝑁]. By constructed from the latest poses 𝐓 definitions of Sim(3) (equation (2.21)) and SE(3) (equation (2.16)), ̂ 𝑘 𝐭̂ 𝑘 𝑠̂ 𝑘 𝐑 𝐒̂𝑔𝑘 = [ 𝑔 𝑔 𝑔 ] 0 1 𝑘 𝑘 ̂ ̂ 𝐑 𝐭 ̂𝑔𝑘 = [ 𝑔 𝑔 ] 𝐓 0 1
(3.8)
These states to be optimized are constrained by two types of observations. One type of observation (equation (3.9)) occurs with two neighboring keyframes 𝑖 and 𝑗 of overlapping views and derives from their latest pose estimates. The other type (equation (3.5)) occurs with the current keyframe and the matched keyframe due to the accepted loop. All these observations are used in the stepwise or direct pose graph optimization (Table 3.1). 𝑗 ̃𝑗 𝑗 𝐒̃𝑖 ≜ [𝑠̃𝑖 𝐑 𝑖 𝟎
𝑗 𝐭̃ 𝑖 ] = 𝐒̂ 𝑗 (𝐒̂ 𝑖 )−1 𝑔 𝑔 1
(3.9)
The direct pose graph optimization refines all the states with all observations (Table 3.1). In contrast, the stepwise pose graph optimization firstly solves for the scales 𝑠𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁], with the two types of observations while fixing the rotations and translations, 68
𝐑𝑘𝑔 , 𝐭𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁]. For an observation of the first type (equation (3.9)), it has the following form assuming fixed rotations and translations. [𝑠̃𝑖𝑗
𝑠𝑔𝑖
−1] [ 𝑗 ] = 𝑛1 𝑠𝑔
(3.10)
Here, the noise 𝑛1 for all such observations is assumed to have independent and identical distributions.
Table 3.1. States, initial estimates, and constraints involved in the direct pose graph optimization and two steps of the stepwise pose graph optimization. Here k enumerates poses of all N keyframes in the graph. A pose is represented by a similarity transformation consisting of a rotation from the global frame (subscript ‘g’) to the camera frame of the kth keyframe (superscript ‘k’), Rkg, a translation meaning the coordinates of the g-frame’s origin in the k-frame, tkg, and a scale, skg, representing one unit length of the g-frame in k-frame’s length unit. Also, note ‘~’ identifies constraints or observations. Pose graph optimization Direct
Stepwise: scales
States to be optimized or solved 𝑠𝑔𝑘 , 𝐭𝑔𝑘 , 𝐑𝑘𝑔 , 𝑘 ∈ [1,2, … , 𝑁] 𝑠𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁]
Fixed states
Constraints or observations 𝑗 ̃𝑗 [𝑠̃𝑖 𝐑 𝑖 0
𝐑𝑘𝑔 , 𝐭𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁]
[𝑠̃𝑖𝑗
−1] [ 𝑠𝑔𝑐
Stepwise: scales + translations
𝑠𝑔𝑘 , 𝐭𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁]
𝐑𝑘𝑔 , 𝑘 ∈ [1,2, … , 𝑁]
𝑗 𝐭̃ 𝑖 ] = 𝐒̂ 𝑗 (𝐒̂ 𝑖 )−1 𝑔 𝑔 1
= 𝑗
𝑠𝑔𝑖
𝑗 ] = 𝑛1
𝑠𝑔 − 𝑛1
Initial state values prior to optimization ̂ 𝑘𝑔 , 𝑘 𝑠̂𝑔𝑘 = 1, 𝐭̂𝑔𝑘 , 𝐑 ∈ [1,2, … , 𝑁] 𝑠̂𝑔𝑘 = 1, 𝑘 ∈ [1,2, … , 𝑁]
𝑐 𝑠̃𝑚
𝑠𝑔 𝑗 T 𝑗 𝑗 𝐭̃ 𝑖 = 𝐭𝑔 − 𝑖 𝐑𝑔 (𝐑𝑖𝑔 ) 𝐭𝑔𝑖 + 𝑛2 𝑠𝑔
𝑠̂𝑔𝑘 , 𝐭̂𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁]
For the observation of the other type (equation (3.5)), the scale of the matched keyframe is assumed fixed and equal to 1. If in offline processing there are multiple loops, only the oldest matched keyframe is chosen to have a fixed scale. As a result, the scale constraint (equation (3.10)) between the matched keyframe, 𝑚, and another keyframe (e.g., the current keyframe 𝑐) becomes 69
1 𝑐 −1] [𝑠 𝑐 ] = 𝑠̃𝑚 − 𝑠𝑔𝑐 = 𝑛1
𝑐 [𝑠̃𝑚
𝑔
(3.11)
The linear system from stacking equations (3.10)-(3.11), can be solved by a QR factorization [66] which gives the estimates for scales, 𝑠̂𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁]. Built upon the estimated scales in the first step, the second step of the stepwise pose graph optimization optimizes the scales and translations of all keyframes, i.e., 𝑠𝑔𝑘 , 𝐭𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁] , while fixing their rotations, 𝐑𝑘𝑔 , 𝑘 ∈ [1,2, … , 𝑁] , with the two types of observations. For an observation of either type, it has the following form which is derived from equation (3.9) assuming fixed rotations. 𝑗
𝑗 𝐭̃ 𝑖
=
𝑗 𝐭𝑔
−
𝑠𝑔
𝑠𝑔𝑖
𝑗
T
𝐑𝑔 (𝐑𝑖𝑔 ) 𝐭𝑔𝑖 + 𝐧2
(3.12)
Here, the noise 𝐧2 for all such observations is assumed to have independent and identical distributions. With these observations, 𝑠𝑔𝑘 , 𝐭𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁], are optimized starting from 𝑠̂𝑔𝑘 , 𝐭̂𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁], by a nonlinear optimizer which gives their new estimates, ̂ ’ identify new estimates after optimization). 𝑠̂̂𝑔𝑘 , 𝐭̂̂𝑔𝑘 , 𝑘 ∈ [1,2, … , 𝑁] (‘(⋅) Once the two steps finish, the pose of each keyframe of index 𝑘 ∈ [1,2, … , 𝑁] is recovered as ̂ 𝑘𝑔 ̂ 𝑘 = [𝐑 𝐓 𝑔 𝟎
𝐭̂̂𝑔𝑘 /𝑠̂̂𝑔𝑘 ] 1
To update the position of a point in the global frame, 𝐩 𝑔 , the point is projected to the 𝑔 ̂𝑔𝑟 [𝐩 ]; then the projection is normalized by keyframe 𝑟 which has its first observation, 𝐓 1
̂ 𝑟 . The the scale change of 𝑟, 𝑠̂̂𝑔𝑟 ; at last, the point is reprojected to the global frame by 𝐓 𝑔 70
operations in updating 𝐩 𝑔 are summarized in equation (3.13) where [∙]1:3 refers to the first three rows of a vector. −1
̂𝑔𝑟 ) 𝐩 𝑔 ← [(𝐓
𝑔 ̂𝑔𝑟 [𝐩 ] /𝑠̂̂𝑔𝑟 ] = [𝐩 𝑔 + (𝐑 ̂ 𝑟𝑔 )T ( 𝐭̂𝑔𝑟 − 𝐭̂̂𝑔𝑟 )] /𝑠̂̂𝑔𝑟 𝐓 1 1:3
(3.13)
3.5 Experiments Several tests were conducted to verify the algorithms and capabilities of the proposed collaborative SLAM framework. At the beginning, the stepwise pose graph optimization algorithm is compared with the direct one because it is used in the framework for subsequent tests. Afterwards, the collaborative framework is tested on the outdoor KITTI datasets [47] and smartphone datasets collected by the SPIN Lab at The Ohio State University (OSU) (Table 3.2). In collecting the smartphone data, the auto focus and auto exposure functions of cameras were locked so that the focal length and the shutter speed remained constant.
Table 3.2. Descriptions of datasets used in tests. Note the smartphone video frames were down sampled by a factor of 2 in processing Monocular Environment datasets collected by cameras mounted on a car suburban KITTI sequence residential area 00 west campus parking lot of Samsung S6 OSU west campus parking lot of Samsung S3 OSU
Duration / Ground truth Image Frame per error (Method) size second [Pixel] (FPS)