PF: A Novel Filter-based Monocular Visual Odometry

RB 2 − P F : A Novel Filter-based Monocular Visual Odometry Algorithm Yifan Zhou† and Simon Maskell†

Abstract— This paper proposes an improvement to FastSLAM. The approach is applicable when the dynamic model describing the motion of the camera has linear sub-structure. The core novelty of the proposed algorithm is to separate the consideration of the camera’s dynamic model into two sub-models without constraining the two sub-models to have independent noise processes. In contrast to commonlyused FastSLAM algorithms, which use a particle filter to consider both these sub-models, a particle filter is used for one sub-model and a Kalman filter for the other. This tactic is Rao-Blackwellisation and is the same as that which underpins the development of FastSLAM, but where the focus was only on exploiting near-linear sub-structure related to the state of the landmarks. Comparisons with Unscented FastSLAM 2.0 indicate that the new approach improves estimation accuracy. Comparisons with cutting-edge SLAM algorithms also reflect the competitive nature of this approach as a solution to navigation problems. Future work will improve the processing of features and consider multi-sensor data input. Index Terms— Particle filters, Monocular, SLAM, Visual Odometry, Rao-Blackwellisation, Navigation

I. Introduction Simultaneous Localisation And Mapping (SLAM) is the algorithmic process of processing sensory data to estimate the current state of a vehicle while also estimating a map. Early SLAM algorithms used the Extended Kalman Filter (EKF) to achieve good results [1]. The dimensionality of the stacked state comprising the vehicle’s state and the landmarks that parameterise the map is typically high dimensional. As a result, SLAM algorithms that use a single EKF are slow. FastSLAM [2] addressed this by exploiting the fact that, conditional on the path taken by the vehicle1 , the estimates that relate to each landmark are conditionally independent. This use of Rao-Blackwellisation makes it possible to run a particle filter to estimate the vehicle’s state with each particle also running an independent Kalman-filter for each landmark. Further improvements to FastSLAM focused on the particle filter part: FastSLAM 2.0 extends FastSLAM by proposing particles in regions of the statespace that are probable given the received sensory data and by using an Unscented Kalman Filter (UKF) as an efficient and robust alternative to an EKF [3]. † Yifan Zhou and Simon Maskell are with the School of Electrical Engineering, Electronics and Computer Science, University of Liverpool, Liverpool, UK. {sgyzhou7, smaskell}@liverpool.ac.uk 1 FastSLAM also assumes that the landmarks are stationary and that the sensory data has been associated over time such that the data relating to each landmark can be identified.

We now describe the application of SLAM to visual sensory data and techniques used to enhance the accuracy of 3D mapping. State-of-the-art algorithms borrow an idea from off-line Structure from Motion approaches (which achieve extremely low estimation error). These approaches can be categorised into feature-based (e.g., PTAM [4], ORB-SLAM [5]) and direct-based (e.g., DTAM [6], LSD-SLAM [7]). Those algorithms make use of Bundle Adjustments to optimise a global cost function that considers all historic camera states. Since the processing load of such Bundle Adjustment is large, this process is only performed relatively infrequently. This makes the difference between SLAM and (visual) odometry estimation more obvious. Unlike SLAM (that aims to maximise global mapping quality), Visual Odometry (VO) is a navigation method that aims to output accurate states at the time of each frame (without later correction). As explained in [8], a filter-based solution can also be appropriate when the uncertainty is pronounced and when computational resources are insufficient for Bundle Adjustment. Previous work has considered fusion of data from visual odometry and an Inertial Measurement Unit (IMU) (e.g., [9], [10]), though they did not consider a filter-based solution). While such fusion is not considered in this paper, our research will progress to consider multisensor fusion in the future. In this paper, we use FastSLAM in 3D contexts involving a monocular camera (our work is based on the description of FastSLAM in [11] and the Mono-camera framework in [12]). In such contexts it is common to use a dynamic model for the vehicle’s state that involves states that describe the Cartesian position and velocity as well as other states that describe the orientation and angular velocity. If the dynamic model for the vehicle’s state includes velocities, the model contains conditionally linear sub-structure. Thus, using a similar logic to that underpinning the development of FastSLAM, we propose to use Rao-Blackwellisation [13] such that a Kalman filter is used to estimate the velocity while a particle filter estimates (only) the vehicle’s position. This approach aims to decrease the dimensionality of the (sub)-problem that the particles need to inhabit. This idea of combining FastSLAM 2.0 and Rao-Blackwellisation has been previously proposed [14] (as part of wider research into the understanding the utility of Rao-Blackwellisation [15]). However, our approach is more general than the previous work we are aware of in the context of FastSLAM: we allow for correlation between the noise processes considered by the sub-models (e.g., as encountered when

considering velocities) whereas previous research did not (in [16], correlated noise was included, but SLAM/VO was not). Since we use Rao-Blackwellisation twice (once for the landmarks and once for the velocity), we call the new algorithm “RB 2 − P F ”. This paper is organised as follows: Section II reviews the basics of FastSLAM algorithms and the structure of the dynamic model considered. Section III is a description of the new RB 2 −P F algorithm. Section IV explains how the feature points are handled. Section V presents results before section VI draws conclusions. II. Previous Work A. FastSLAM And Extensions The core of FastSLAM [2, 3] is the following factorisation (where the annotations indicate how part of the problem is tackled with a particle filter and part with an Extended Kalman filter): p (Θ, s1:t |z1:t , u1:t ) = p (s1:t |z1:t , u1:t ) {z } | Particle filter

Y n

p (θn |s1:t , z1:t , u1:t ) | {z }

(1)

Extended Kalman filter

where s1:t denotes the vehicle (s)tate from time 1 to time t, z1:t denotes the measurements, u1:t denotes the controls, θn denotes the (static) location of the nth associated landmark and Θ is the set of such landmark locations. The resulting algorithm then involves sampling particles, updating the Kalman filter parameters associated with each landmark, updating the particles’ weights and resampling. These steps are now briefly described. FastSLAM 1.0 [2] considers a bootstrap particle filter such that samples are generated using:

where we use the EKF or UKF to calculate a predicted measurement for the nth landmark, zˆn,t , and the assoˆ z such that we can approximate as ciated covariance, Σ n,t follows: Y ˆz p (zt |s1:t , u1:t , z1:t−1 ) ≈ N zn,t ; zˆn,t , Σ (3) n,t n

where zn,t is the measurement for the nth landmark2 . One can improve on the original FastSLAM 2.0 by replacing the use of derivatives in the EKF with the use of an Unscented transformation (an approach that has been shown to improve estimation consistency [17] in other contexts). Unscented FastSLAM (UFastSLAM) [18, 19] adopts this approach and specifically uses the Scaled Unscented Transformation (SUT) [20], which we use to generate the results in section V. The final stage of the particle filter is resampling. This is used to alleviate the degeneracy phenomenon: without resampling, after a few iterations, a small number of particles have much larger weights than other particles with most particles contributing almost nothing to the particle filter’s estimates (which are a weighted sum across particles). Resampling probabilistically discards the particles with low weights and replicates particles with high weights. The details are described elsewhere (e.g., in [21]). B. Near-constant Velocity Model A general nearly constant velocity model3 [22] can be described as: 1

xt = F · xt−1 + Qt 2 ωt

(6)

q (st |s1:t−1 , u1:t , z1:t ) = p (st |st−1 , ut )

ωt ∼ N (O2D×1 , I2D×2D ) .

(7)

Such a bootstrap approach is easy to implement but can lack efficiency when considering data that is informative about where the particles should be sampled. FastSLAM 2.0 [3] uses an improved proposal for sampling which capitalises on informative data: samples are generated using q (st |s1:t−1 , u1:t , z1:t ) ≈ p (st |s1:t−1 , u1:t , z1:t ), where the approximation (necessary if the measurements are in polar co-ordinates) is calculated using an EKF or UKF. This approach is often referred to as using a near-optimal proposal in the broader context of particle filters where it has found utility in a number of other application domains. FastSLAM assumes that, for each particle, the uncertainty in each landmark position can be modeled as ˆ n,t , where N (x; µ, Σ) denotes a Gaussian N θn ; θˆn,t , Σ with mean of µ and a covariance of Σ. Given a sampled state sequence for the vehicle, the parameters of each landmark can be updated using an EKF or UKF (assuming polar, i.e. non-linear, measurements). After that, the weight of the particles can be updated:

where ID×D is an Identity matrix with dimension D ×D, OD×D is a Zero matrix with dimension TD × D. We assume that xt = xt,1 T , xt,2 T such that xt,1 and xt,2 are vectors representing position and velocity respectively. The dimension of xt,1 and xt,2 depends on the problem (e.g., whether the state is 2D or 3D). We assume we can decompose F as follows: ID×D ∆T · ID×D F = (8) OD×D ID×D

wt =

p (st |st−1 , ut ) p (zt |s1:t , u1:t , z1:t−1 ) · wt−1 q (st |s1:t−1 , u1:t , z1:t )

(2)

where

2 As an aside, we note that, to the best of our knowledge, most existing FastSLAM implementations choose to perform further approximations. The authors do not regard these as advantageous (in terms of either computational cost or estimation accuracy: all the terms in (2) will be readily accessible in a SLAM implementation). The approximations are:

wt ≈ p (zt |s1:t−1 , u1:t , z1:t−1 ) · wt−1 ≈

Y

˜z N zn,t ; z˜n,t , Σ n,t · wt−1

(4) (5)

n

˜ z involve using the EKF or UKF to predict the where z˜n,t and Σ n,t measurement and associated covariance for the nth landmark given only st−1 (and not also the sampled value of st ). 3 We use x to denote state vector, since it is general and can be for Cartesian and angular, etc.

where ∆T is the (continuous) time between epochs t − 1 and t. Qt then parameterises the uncertainty of the position and the correlated uncertainty in velocity:   1 1 3 2 3 ∆T · ID×D 2 ∆T · ID×D  (9) Qt = σ  1 2 ∆T · I ∆T · I D×D D×D 2

This is then the prior, input to an EKF or UKF update step that processes zt to generate the parameters of a Gaussian approximation (just as in FastSLAM 2.0): pf pf (15) , Σpf , z1:t = N vtpf ; vt|t q vtpf |v1:t−1 t|t pf ≈ p vtpf |v1:t−1 , z1:t (16)

where σ quantifies the amount of (white-noise) acceleration assumed. Please see [22] for more details.

from which a sample of vtpf is drawn. b) Weight Update: The weight update is derived using a similar logic to that used in FastSLAM 2.0, albeit with ut dropped from the notation: pf pf p vtpf |v1:t−1 , z1:t−1 p zt |v1:t , z1:t−1 wt = · wt−1 pf q vtpf |v1:t−1 , z1:t (17)

III. RB 2 − P F For Monocular Visual Odometry For monocular visual odometry, we recall the camera motion model proposed in [12]. The state consists of: Cartesian position; Cartesian velocity; orientation, described using quaternions; angular velocity, described using Euler angles. Since the position and orientation are not correlated in our case (for a hand-held camera), they will be described separately. This section first focuses on Cartesian position. We then go on to extend the filter to consider orientation. We then summarise the operation of a system that considers both position and orientation. A. RB 2 − P F For Cartesian Position Considering the Cartesian (v)ector only, we note that (6) is a special case of the following: vt,1 F11 F12 vt−1,1 = vt,2 F21 F22 vt−1,2 12 ωt,1 Q11 Q12 + (10) ωt,2 Q21 Q22 and we assume the following non-linear measurement model: p (zn,t |vt ) = N zn,t ; h vtpf , θn,t , Rt (11)

using (14), (3)4 and (15). 2) Kalman Filter For Cartesian Velocity: Once we have obtained the samples of position, we need to update the parameters of the Kalman filter associated with the velocities. We pick a convenient square root of Q: # " 12 − 12 T 1 Q Q Ψ Q Q 11 12 12 22 (18) = Q2 = 1 Q21 Q22 2 0 Q22 where 12 − 12 T − 12 T Ψ = Q11 − Q12 Q22 Q22 Q12 such that, using (10): −1T

pf kf vtpf = F11 vt−1 + F12 vt−1 + Ψωt,1 + Q12 Q222 ωt,2 (20)

and

1

kf pf 2 vtkf = F22 vt−1 + F21 vt−1 + Q22 ωt,2 |{z} | {z } |{z}

vtpf

where, with the aim of aiding exposition, we now use and vtkf in place of vt,1 and vt,2 in the rest of the paper. 1) Particle Filter For Cartesian Position: We now describe how to sample the particles and perform the weight update. The other steps (e.g., data association and resampling) are unchanged relative to existing FastSLAM 2.0 implementations and are described elsewhere [23, 24]. a) Optimal Proposal For Sampling: The calculation of a near-optimal proposal is similar to FastSLAM 2.0 except for two things: the particles only sample the vehicle’s position; the uncertainty in velocity captured by the Kalman Filter needs to be considered. To calculate the near-optimal proposal, we first calculate the predicted position and associated covariance as: pf pf kf vt|t−1 = F11 vt−1 + F12 vt−1

Σpf t|t−1

= Q11 +

T F12 Σkf t−1 F12

(12)

Φ

(21)

Γ

bt

where the annotations (here and subsequently) will be useful later. It follows that: pf −1 −1 kf vtpf + (F12 F22 F21 − F11 )vt−1 = F12 F22 vt {z } | {z } | H m t 1T 1 − −1 2 +Ψωt,1 + Q12 Q222 − F12 F22 Q22 ωt,2 | {z }

(22)

Λ

and 1

kf pf −1 kf −1 −1 2 − F22 Q22 ωt,2 . vt−1 = F22 vt − F22 F21 vt−1

(23)

pf The result (given the values of vt−1:t ) is of the following form:

(13)

kf where vt−1 and Σkf t−1 are the mean and covariance from the Kalman filter. Such that: pf pf p vtpf |v1:t−1 , z1:t−1 = N vtpf ; vt|t−1 , Σpf (14) t|t−1

(19)

kf vtkf =Φvt−1 + bt + Γωt,2

(24)

mt = Hvtkf + Ψωt,1 + Λωt,2

(25)

and

4 Albeit

with some abuse of notation.

such that we have linear dynamics and measurements5 with respect to the estimation of velocity and parameters that can be calculated. Note that ωt,2 appears in (24) and (25), such that the noises are correlated. This demands a less standard, but well-established variant of the Kalman filter (which we detail in the algorithmic summary that follows). B. Modifications For Orientation In our system, the camera orientation still follows the near-constant velocity model, which means the derivation above is almost the same for filtering the orientation. Because the (o)rientation is represented using (q)uaterions, oq , and the angular velocity is represented using (E)uler angles, o˙ E , the transition matrix (8) and the process noise matrix (9) must be modified. Reminding that the quaternions are known to be fourdimensional and the Euler angles are known to be threedimentional. The transition matrix of the camera orientation, as expressed using quaternions can be written as follows: oqt+1 = A(oqt , ∆oqt ) oqt

(27) ∆oqt

where is the previous orientation, is the orientation increment and where A(.) is quaternion multiplication defined as:       a1 a2 a1 a2 − b1 b2 − c1 c2 − d1 d2  b1   b2  a1 b2 + b1 a2 + c1 d2 − d1 c2        A  c1  ,  c2  = a1 c2 − b1 d2 + c1 a2 + d1 b2  (28) d1 d2 a1 d2 + b1 c2 − c1 b2 + d1 a2 Considering the angular velocity as described using Euler angles, we have: ∆oqt = G(∆oE t )

(29)

where G(.) is the transformation function from Euler angles to quaternions the detail of which can be found E in [25]. Note that the increment ∆oE t = o˙ t · ∆T . The transition matrix (which is a 7 × 7 matrix) for orientation is approximated as: I4×4 ξt FO ≈ (30) O3×4 I3×3 where ξt (which is a 4×3 matrix) is calculated as follows: ∂A(oqt , ∆oqt ) ∂G(o˙ E t · ∆T ) · (31) ∂∆oqt ∂ o˙ E t Note that the transition matrix is approximated of the non-linear time update of orientation. The approximation is likely to be small with the small ∆T relevant to monocular VO (∆T = 1/30s in our experiments). ξt =

5 It

is perhaps interesting to note that pf −1 mt = vtpf + (F12 F22 F21 − F11 )vt−1

(26)

such that the measurement provided by the particle filter is conceptually a measurement of change in position.

The process noise for orientation is calculated using: E T QO t = τt · Qt · τt

(32)

QE t

where is the process noise using Euler angles which has the same form as in (9) and: ξˆt O4×3 τ= (33) O3×3 I3×3 where ξˆt is from (31) with ∆T = 1 and τ is a 7×6 matrix. C. Algorithm Summary The camera (s)tate6 integrates position and orientation: T st = spf skf t t (34) T = vtpf oqt vtkf o˙ E t An RB 2 − P F is used to handle st since the sub-states are uncorrelated and follow the nearly constant velocity model. The operation is identical to that described in Section III-A (by simply substituting vt with st ), but with:   P P F11 O3×4 F12 O3×3 O O   O4×3 F11 O4×3 F12  (35) F = P P  F21 O3×4 F22 O3×3  O O O3×3 F21 O3×3 F22   QP O3×4 QP O3×3 11 12  O4×3 QO  O4×3 QO 11 12  Q= (36) P  Q21 O3×4 QP  O 3×3 22 O O O3×3 Q21 O3×3 Q22 where F P is the position transition matrix (8); QP is the position process noise (9); F O and QO are from (30) and (32). A brief implementation of one iteration of the proposed algorithm is shown in Algorithm 1. Some of the steps (e.g., resampling) are identical to those in classical FastSLAM 2.0. The remaining details are described in the following section. IV. Feature Points Initialisation And Maintenance A. Features Initialisation In common with other approaches, the image is divided into a grid and in each cell, the corner detector’s threshold is manipulated to ensure that a minimum number of corners are detected. The features points are extracted by a FAST corner detector7 [26]. We calculate the BRISK feature descriptor [27] for each point in our system since it is relatively fast and precise [28]. For each feature, we store a distribution on inverse depth (represented by a Gaussian), an existence score and a flag indicating if it is a permanent map point. The subsequent subsections explain these stored quantities in more details. 6 We

use st for the system state, so it is consistent to 7 https://www.edwardrosten.com/work/fast.html

Section II-A.

Algorithm 1 The Implementation of Proposed RB 2 − P F VO 1: for each particle i do 2: Perform State Prediction using (12) (or using the non-linear approach described in [12]). 3: Data Association (described in Section IV-B). pf,(i) 4: Sample st , using (15). 5: Perform Velocity Prediction using: kf,(i)

kf,(i)

sˆt|t−1 = Φˆ st−1|t−1 + bt kf,(i)

(37)

kf,(i)

Σt|t−1 = ΦΣt−1|t−1 ΦT + ΓΓT 6:

Calculate the following quantities: pf,(i)

mt = st Pvm = Pmm =

pf,(i)

−1 + (F12 F22 F21 − F11 )st−1

kf,(i) Σt|t−1 H T + ΓΛT kf,(i) HΣt|t−1 H T + ΨΨT T T

(39) (40)

+ ΛΛ + HΛΓ + ΓΛT H T

7:

(38)

(41)

Perform Velocity Update using: kf,(i)

sˆt|t

kf,(i) Σt|t

kf,(i)

kf,(i)

−1 = sˆt|t−1 + Pvm Pmm (mt − Hst|t−1 )

=

kf,(i) Σt|t−1

−1 T + Pvm Pmm Pvm

(42) (43)

Calculate weights using (17). Update the feature points’ Existence Score (described in Section IV-C). 10: end for 11: Resample (when necessary). 8: 9:

B. Feature Matching After predicting the camera state, all the feature points will be assessed to check whether they could appear in the image and whether their feature descriptors are likely to remain similar to their initial value. The assessment (in order) tests each the following criteria: 1) The feature’s existence score is not lower than 0.4; 2) The angle between current viewing ray and the original feature viewing ray should be smaller than 45 degrees; 3) The distance between the current camera position and the position when the feature was first seen should be smaller than 30% of the feature’s depth from the position where the feature was first seen; 4) The feature point’s position should project to a point inside the image. This enables a subset of the feature points to be obtained. For each feature in this subset, the BRISK descriptors of its neighbours will be calculated. Then, we compare all the neighbours’ BRISK values with feature’s BRISK value. The neighbour which is most similar to the feature with the hamming distance smaller than 0.15 × 512 will be matched (where 512 is the dimension of BRISK descriptor). The neighbour size was identified

empirically and optimised for each of the datasets8 . To minimise the computational cost, we search in a smaller range and enlarge it if there is no match. C. Existence Score The quality of feature points are evaluated using a probabilistic existence score motivated by Probabilistic Data Association (PDA) tracking (details in [29]). Each feature point is initialised with a score P 0|0 = 0.5. Any feature points with a score of less than 0.4 will be deleted unless its depth uncertainty is small enough for it to be permanently preserved. Regarding the score prediction (k) and update, assume that Pt−1|t−1 is the probability score of feature k at time t − 1. At time t, the probability (k) score prediction, Pt|t−1 (before receiving data at time t) is firstly calculated as follows: (k)

(k)

(k)

Pt|t−1 = P22 · Pt−1|t−1 + P12 · (1 − Pt−1|t−1 )

(44)

where P22 denotes the probability that the target stays observable and P12 is the probability that the target transfers to observable from unobservable. In this paper, we use P22 = 0.95 and P12 = 0.05. (k) After receiving the features at time t, assuming Zt = (k) (k) {zt,1 , ..., z (k) } is the set of associated detections to t,Nt

feature (k), we calculate:

(k)

δt

=

  PD PG [1 −

PD PG ; (k) PNt(k) VG t (k) · i=1 ˆ N

Nk = 0 (k) Λt,i ];

otherwise

t

(45) where PD is the probability of detection (which is 0.75 in the experiment); PG is the probability that the detection is inside the gate (which is 0.995 in the experiment); VG is defined as: q (k) (k) (46) VGt = π · |St | · G (k)

where St is the uncertainty of the feature point position on the image which is calculated via SUT giving camera state uncertainty and (static) feature point uncertainty; G = 9.21 corresponds to the value of PG . We also assume that: ˆt(k) = Nt(k) − PD PG P (k) N t|t−1 (k)

Λt,i =

1 · PG

(k) 2 1 q · e−(dt,i ) /2 (k) (2π)M/2 |St | | {z } pf s (k) (k) (k) N (zt,i ;h( t q ,θt );St ) Ot

(47)

(48)

8 We use a larger size: 9 × 9 and enlarge to 17 × 17 for ‘TUMRGBD’ due to it contains lots of sudden moves. We use size 7 × 7 for ‘ICL-NUIM’.

(k)

where zt,i is the position of detection i and pf s (k) h( t q , θt ) is the expected position of feature Ot point k on image. Finally, (49) is used to update the existence score. (k)

(k)

Pt

=

1 − δt (k)

1 − δt

(k)

(k)

· Pt|t−1

· Pt|t−1

(a) fr3 struct tex near

(b) fr3 struct notex far

(c) fr3 sitting rpy

(d) Living Room kt0

(49)

Recall that due to the method that we use for matching features, there is at most one detection for each feature (k) point. As a result, Nt equals either 1 or 0. V. Experiments The proposed approach is tested on two public benchmarks: TUM-RGBD [30] and ICL-NUIM [31]. All the results of RB 2 − P F and RB − P F are the median value of 10 executions and the results for ORB-SLAM [5], LSDSLAM [7] and SVO [32] either come from related papers or from using public released codes9 . ‘Freiburg3’ of TUM-RGBD consists of several rectified image sequences captured by a hand-held Kinect sensor. The datasets include blurred images and there are also sudden rapid movements which increase mis-matching and give rise to larger uncertainty in the motion estimation. Example images are shown in Figure 1 (a)-(c). The average camera motion velocities are shown in the fourth column of Table I. ‘Living Room (with noise)’ in ICLNUIM is a set of synthetic image sequences with noises that simulate a real camera input(an example shown in Figure 1 (d)). The main difference to ‘Freiburg3’ is that they have better image qualities (e.g., no blurring). However the camera motion contains larger rotations which leads to greater motion ambiguity. There are more frames with less textures which is quite challenging for the point feature based SLAM/VO methods. While one could reduce the corner detection threshold, doing so causes large numbers of outliers to be generated. In this experiment, the estimated trajectories are aligned with the ground-truth. All the algorithms are evaluated by Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The RMSE of ATE and RPE are calculated by public tool10 provided by [30]. A. System Parameters In order to see the generality of the system, we test it using different datasets with almost identical configurations11 (special cases are marked in Table I). The filters are configured as follows: 2 2 • Process noise: σpos = 2.5 m for camera position 2 2 and σori = 1 rad for camera orientation. 2 • Measurement noise: Rt = 2 2 . 9 https://github.com/raulmur/ORB

SLAM2 and https://github.com/tum-vision/lsd slam 10 https://vision.in.tum.de/data/datasets/rgbd-dataset/tools 11 Our own experiments showed that particular parameters can increase the performance for different datasets.

Fig. 1. Image examples from TUM-RGBD and ICL-NUIM Benchmarks.

• •

Number of particles: 8; The threshold for resampling: 0.5 × 8. Initial inverse depth: 0.25m with variance 0.152 m2 .

B. A comparison of RB 2 − P F and RB − P F The implementations of RB 2 −P F based and RB−P F based Visual Odometry are modular with a large number of common modules such that the result only depends on the filter model used. Table I shows the Relative Pose Error for both approaches. Figure 2 (a)(b) displays some trajectories from RB 2 − P F and the permanent feature points. It is obvious that RB 2 − P F is outperforms RB-PF on all the datasets. Our general conclusion is that RB − P F is not suitable for this problem, at least in the configuration we have used. The reason for this difference is that in a high dimensional space, 8 particles is not enough for sampling both position and velocity accurately. Handling the velocity using an extra Kalman filter reduces the dimensionality. It is possible that increasing the number of particles will improve the performance of the RB−P F , but the computational cost that results is likely to be prohibitive. C. A comparison of RB 2 − P F and other algorithms Table II shows the results of proposed algorithm comparing to popular ORB-SLAM [5], LSD-SLAM [7] and SVO [32]. Results from ‘sit halfsphere’, ‘str tex near’ and ‘str tex far’ show that ORB-SLAM with loop closure and Bundle Adjustment can output the best camera trajectories. The reason is that as our approach is posing the problem as a filtering problem. If there is a blurred frame where only a few features are matched, the uncertainty will increase. Since the uncertainty is dependent on the unknown realisation of the process noise, the sudden error can be large. In contrast, ORBSLAM and LSD-SLAM skip the blurred images and then apply the optimisation over the history and thereby

TABLE I The Median Relative Pose Error Among 10 Executions Comparing between RB 2 − P F and RB − P F on TUM-RGBD [30]. Relative Pose Error Relative Pose Error Dataset Average Velocity (Trans. RMSE [cm/s]) (Rot. RMSE [deg/s]) [cm/s] or [deg/s] RB − P F RB − P F Translational Rotational RB 2 − P F RB 2 − P F sit halfsphere 5.59 11.22 1.77 4.64 18.0 19.1 str tex near 1.75 10.25 0.97 3.71 14.1 7.7 str tex far 2.14 9.55 0.65 2.63 19.3 4.3 str notex far 2.55 16.93 0.74 4.69 16.6 4.0 sit rpy 4.93 5.22 1.76 2.98 4.2 23.8 To test ‘sit rpy’, the process noise of both filters is changed to σpos = 0.52 m2 . To test ‘str notex far’, the process noise of only RB − P F is changed to σpos = 0.52 m2 and σori = 0.52 rad2 , otherwise RB − P F cannot produce any satisfying result. Freiburg3

(a) fr3 struct tex near

(b) fr3 struct notex far

ORB-SLAM cannot find enough features for initialisation. As a direct-based approach, LSD-SLAM works well during the first 370 frames but then lost track (for a total of 814 frames). By decreasing the corner detection threshold, the proposed algorithm finishes the entire sequence accurately (trajectory shown in Figure 2 (b)). ‘sit rpy’ is challenging and contains only rotations, rapid movements and extreme blurring. Our algorithm outputs a relatively large ATE, but it estimates rotations well (see Table I) with a smaller process noise. ORB-SLAM cannot initialise successfully due to no translation. LSD-SLAM loses track at an early stage. The proposed algorithm does not perform well on ‘lr kt0n’ due to a long period of textureless images. SVO works well since it applies a novel edge features against the issue. ORB-SLAM loses track during the textureless period as well, but it recovers. Its extremely good result only considers the keyframes position (without the lost track situations). ‘lr kt1n’ contains less textureless frames which makes for a fairer comparison between SVO and RB 2 − P F . The result is that RB 2 − P F outperformed SVO. This shows that RB 2 − P F is better than optimisation based approaches when estimating complex camera states. ORB-SLAM fails on initialisation, though the textures are obvious in the images and the camera motion contains translation. The results of ‘lr kt2n’ still shows the superiority of RB 2 − P F relative to the optimisation-based estimation method (the trajectory of RB 2 − P F is shown in Figure 2 (c)). ORB-SLAM performs best, due to the implementation of loop closure and full Bundle Adjustment.

(c) lr kt2n

VI. Conclusions Fig. 2. The trajectory estimated by RB 2 − P F Visual Odometry vs. Ground-truth .

can easily handle this situation. ‘str tex near’ and ‘str tex far’ contains trajectories without a loop (shown in Figure 2 (a)). Without loop closure, the disparity between our proposed algorithm and ORB-SLAM is reduced while our proposed algorithm performs better than LSD-SLAM. ‘str notex far’ is a textureless dataset. With a rather low corner detection threshold and so many outliers,

In this paper, an improvement to conventional FastSLAM is proposed and applied to a Monocular Visual Odometry problem. This improvement is based on splitting the vehicle state into two parts such that a subset of the state can be tackled using a Kalman filter. It is found that the proposed algorithm offers improved performance in the context of real benchmarks relative to one of the best pre-existing FastSLAM based approaches. Our system also yields competitive performance against the state-of-art SLAM frameworks and does so with no need for either loop closure or Bundle Adjustment.

TABLE II The Median Localisation Error Among 10 Executions (Absolute Trajectory RMSE [cm]) Comparing to ORB-SLAM, LSD-SLAM And SVO on TUM-RGBD and ICL-NUIM [31] Benchmark. TUM-RGBD ORBLSD-SLAM RB 2 − P F Freiburg3 SLAM sit halfsphere 7.78 1.34 5.87 str tex near 4.19 1.58 X str tex far 4.43 0.77 7.95 str notex far 4.28 X X sit rpy 5.35 X X X means the algorithm cannot initialise correctly or lost track on more than 50% of total frames. Results of ORB-SLAM and LSDSLAM on ‘str notex far’ and ‘sit rpy’ are coming from our own execution using public code. We tried different configurations for ORB-SLAM. The rest results are from [5]. ICL-NUIM ORBSVO RB 2 − P F Living Room SLAM lr kt0n 14.15 0.25(Lost) 2 lr kt1n 5.06 X 7 lr kt2n 8.87 3.97 10 Results of SVO are from [32]. Results of ORB-SLAM are from our own execution using public code with smaller corner detection threshold.

References [1] M. G. Dissanayake, P. Newman, S. Clark, H. F. DurrantWhyte, and M. Csorba, “A solution to the simultaneous localization and map building (slam) problem,” IEEE Transactions on robotics and automation, vol. 17, no. 3, pp. 229–241, 2001. [2] M. Montemerlo, S. Thrun, D. Koller, B. Wegbreit, et al., “Fastslam: A factored solution to the simultaneous localization and mapping problem,” in Aaai/iaai, pp. 593–598, 2002. [3] M. Montemerlo and S. Thrun, “Fastslam 2.0,” FastSLAM: A scalable method for the simultaneous localization and mapping problem in robotics, pp. 63–90, 2007. [4] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pp. 225–234, IEEE, 2007. [5] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orbslam: a versatile and accurate monocular slam system,” IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147–1163, 2015. [6] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in real-time,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2320–2327, IEEE, 2011. [7] J. Engel, T. Sch¨ ops, and D. Cremers, “LSD-SLAM: Largescale direct monocular SLAM,” in European Conference on Computer Vision, pp. 834–849, Springer, 2014. [8] H. Strasdat, J. Montiel, and A. J. Davison, “Real-time monocular slam: Why filter?,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, pp. 2657– 2664, IEEE, 2010. [9] L. Kneip, M. Chli, R. Siegwart, et al., “Robust real-time visual odometry with a single camera and an imu,” in BMVC, pp. 1– 11, 2011. [10] R. Mur-Artal and J. D. TardÃşs, “Visual-inertial monocular slam with map reuse,” IEEE Robotics and Automation Letters, vol. 2, pp. 796–803, April 2017. [11] E. Eade and T. Drummond, “Scalable monocular slam,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, pp. 469–476, IEEE, 2006. [12] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, “Monoslam: Real-time single camera slam,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 6, pp. 1052–1067, 2007.

[13] A. Doucet, N. De Freitas, K. Murphy, and S. Russell, “Raoblackwellised particle filtering for dynamic bayesian networks,” in Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pp. 176–183, Morgan Kaufmann Publishers Inc., 2000. [14] T. Sch¨ on, R. Karlsson, D. T¨ ornqvist, and F. Gustafsson, “A framework for simultaneous localization and mapping utilizing model structure,” in Proceedings of the 10th International Conference on Information Fusion, pp. 1–8, IEEE, 2007. [15] T. Sch¨ on, R. Karlsson, and F. Gustafsson, “The marginalized particle filter in practice,” in 2006 IEEE Aerospace Conference, Big Sky, MT, USA, March, 2006, 2006. [16] S. Saha and F. Gustafsson, “Particle filtering with dependent noise processes,” IEEE Transactions on Signal Processing, vol. 60, no. 9, pp. 4497–4508, 2012. [17] R. Van Der Merwe, A. Doucet, N. De Freitas, and E. Wan, “The unscented particle filter,” in NIPS, vol. 2000, pp. 584– 590, 2000. [18] C. Kim, R. Sakthivel, and W. K. Chung, “Unscented fastslam: a robust and efficient solution to the slam problem,” IEEE Transactions on Robotics, vol. 24, no. 4, pp. 808–820, 2008. [19] C. Kim, H. Kim, and W. K. Chung, “Exactly rao-blackwellized unscented particle filters for slam,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 3589–3594, IEEE, 2011. [20] S. J. Julier, “The scaled unscented transformation,” in Proceedings of the 2002 American Control Conference (IEEE Cat. No. CH37301), vol. 6, pp. 4555–4559, IEEE, 2002. [21] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking,” IEEE Transactions on signal processing, vol. 50, no. 2, pp. 174–188, 2002. [22] X. R. Li and V. P. Jilkov, “Survey of maneuvering target tracking: dynamic models,” in AeroSense 2000, pp. 212–235, International Society for Optics and Photonics, 2000. [23] M. Montemerlo and S. Thrun, “Simultaneous localization and mapping with unknown data association using fastslam,” in Robotics and Automation, 2003. Proceedings. ICRA’03. IEEE International Conference on, vol. 2, pp. 1985–1991, IEEE, 2003. [24] D. H¨ ahnel, S. Thrun, B. Wegbreit, and W. Burgard, “Towards lazy data association in slam,” in Robotics Research. The Eleventh International Symposium, pp. 421–431, Springer, 2005. [25] S. W. Shepperd, “Quaternion from rotation matrix.[fourparameter representation of coordinate transformation matrix,” Journal of guidance and control, 1978. [26] E. Rosten and T. Drummond, “Machine learning for highspeed corner detection,” in European conference on computer vision, pp. 430–443, Springer, 2006. [27] S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2548– 2555, IEEE, 2011. [28] O. Miksik and K. Mikolajczyk, “Evaluation of local detectors and descriptors for fast feature matching,” in Pattern Recognition (ICPR), 2012 21st International Conference on, pp. 2681–2684, IEEE, 2012. [29] S. S. Blackman and R. Popoli, Design and Analysis of Modern Tracking Systems. Artech House Publishers, 1999. [30] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in Proc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012. [31] A. Handa, T. Whelan, J. McDonald, and A. Davison, “A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM,” in IEEE Intl. Conf. on Robotics and Automation, ICRA, (Hong Kong, China), May 2014. [32] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semi-direct visual odometry for monocular and multi-camera systems,” IEEE Transactions on Robotics and Automation, to appear., 2017.