Towards Bundle Adjustment with GIS Constraints for Online Geo-Localization of a Vehicle In Urban Center Dorra Larnaout 1 Steve Bourgeois 1 Vincent Gay-Bellile 1 Michel Dhome 2 1
CEA, LIST, 91191 Gif-sur-Yvette, France.
[email protected]
2
LASMEA-UMR, 6602 Universite Blaise Pascal/CNRS, 63177 Aubiere Cedex, France
[email protected]
lems1 . During the last decade, this challenging problem has promoted several researches aiming to locate a single camera embedded in a vehicle. Existing solutions can be classified by the a priori information used to guide the localization. A first class of methods is based on an appearance modelling of the scene. Such approaches rely on two main steps. First, a geo-referenced database containing several images is build during an off-line step. Then, techniques of indexation are used to identify, on-line, the most similar image to the current view, see e.g. [2, 13, 17, 18]. To improve the accuracy of the localization, some solutions [4, 6] use another kind of database containing 3D points cloud associated to 2D descriptors extracted from the images. In this case, the localization is estimated by matching 2D features of the current view with these 3D features. However, appearance based approaches are subject of two main drawbacks. First, the location process relies on an expensive database and hard to maintain up-to-date. Second, viewpoint recognition step is sensitive to both illumination conditions and viewpoint changes as it is shown in [3], where Irschara et al. demonstrate that SIFT features don’t allow to manage scene variation over seasons. The second class of methods relies on a geometric modelling of the scene. Geometric information are mainly provided by GIS (Geographic Information System) which contains several layers: satellite images, roads model, coarse 3D buildings model, etc. In addition to the widespread of the GIS database, these methods have two main advantages : the stability of the scene geometry over time and its insensitivity to lighting conditions and viewpoint changes. Such advantages have promoted the emergence of this class of approaches. In general, these solutions try to combine geometric information provided by the 3D model [7, 9, 15, 16] or satellite images [5] with multi-view constraints in an off-
Abstract Vehicle geo-localization using a single camera is a challenging issue promoted by several limitations that GPS incurs in dense urban area. While most of the existing solutions rely on viewpoint recognition techniques, the combination of Structure from Motion algorithm and the geometric information provided by Geographic Information System (GIS) has emerged as a promising alternative. In this system paper, we introduce a monocular SLAM process that exploits both ”Digital Elevation Model” layer (DEM) and ”3D buildings model” layer of a GIS to reach an accurate and robust geo-localization. Our solution relies on a constrained bundle adjustment that exploits the DEM to correct the camera trajectory while the 3D buildings model is used to constrain the reconstructed 3D points cloud. Moreover, our solution is designed to achieve an on-line and real-time (i.e. about 30 Hz) localization while most of the state of the art solutions are off-line. In addition to evaluations on different synthetic and real large scale sequences, an Augmented Reality application is realized proving these statements.
1. Introduction Recently, the low cost and the miniaturization of video sensors have promoted their integration in many devices to develop innovative applications. In automotive field, these sensors are introduced to ensure applications such as parking assistance, pedestrian detection, etc. However, currently, there is no commercial system using them as an alternative or a way to improve localization provided by GPS (Global Positioning System) in dense urban area where its precision decreases greatly due to urban canyons prob-
1 when
4321
the signal is obstructed by buildings.
line processing. This can allow to reach high accuracy. However, due to the lack of constraints provided by the geometric model mainly in straight lines, Lothe et al. [8] show that it is difficult to ensure an on-line localization with real time performance. In this paper, the first on-line and real-time solution using a single camera and based on GIS database is proposed. It relies on a keyframe-based SLAM process that estimates simultaneously the camera motion (at each frame) and the 3D scene reconstruction represented by a sparse points cloud (at each key-frame). While the classical approach refines the 3D reconstruction and the estimated motion using a bundle adjustment process, we propose to exploit the a priori knowledge provided by GIS information to constrain the camera trajectory (cf. Section 2) and points cloud (cf. Section 3). The complementarity of these two constraints is explained in Section 4 where we also describe our complete framework that allows to activate, deactivate or merge each constraint taking into account different kinds of available GIS data through the trajectory followed by the camera. Evaluation on real large scale sequences (i.e. greater than 1000 meters) and Augmented Reality application testify the robustness and the accuracy of the localization ensured by the proposed solution (cf. Section 5).
fer matrix between the world coordinate frames and the coordinate frames of the plane ψj to which the camera Cj have to belong.
2. Trajectory-constrained SLAM Throughout its trajectory, the camera embedded in the vehicle keeps the same height relative to the road plane. This hypothesis, in spite its triviality, has two main advantages. First, it does not limit the nature of vehicle (e.g. nonholonomic motion assumption [12]), so it can be applied to any kind of car. Second, it provides a new constraint affecting the estimated trajectory of the vehicle specially the camera poses. This can avoid drifts due to the inaccurate estimation of the camera height and therefore provide more stability for the localization. In this section, we introduce a new bundle adjustment that takes into account the altitude constraint (Section 2.1). We detailed then additional steps required by BA to guarantee a better convergence (Section 2.2). Finally, evaluation on synthetic sequence is presented in Section 2.3.
2.1. Bundle adjustment with DEM constraint While the altitude of the camera varies in the world coordinate frame, it is obvious that it keeps the same value in the local coordinate frame correspondent to the road plane2 . This ascertainment allows to introduce the altitude constraint in the BA by simply using a new parametrisation of the camera pose and without adding any additional penalty term. For that, the camera is, firstly, associated to the adequate road plane. Once its pose is expressed in the road coordinate frame, its coordinate along the normal axis of the road is fixed at the right height. Such solution requires only to compute the transfer matrix from the world coordinate frame to the road plane coordinate frame Lj 3 . Otherwise, it allows to reduce the number of degree of freedom from 6 to 5 for each camera. Respecting the new parametrisation, the re-projection of a 3D point Qi computed from the projection matrix Pjψ of Cj associated to the road plane ψj is given by:
Notation. In the following, we assume that an initial reconstruction of the scene (3D points and camera trajectory) has already been estimated with monocular SLAM algorithm. We note {Qi }N i=1 the set of 3D points reconstructed where N is the total number of points. In the same way, the set of cameras is represented by {Cj }m j=1 , where m is the total number of cameras used in the BA. Each camera Cj is defined with its intrinsic K and extrinsic (Rj ; tj ) parameters. Its projection matrix Pj is given by Pj = KRjT (I3 | − tj ), where I3 is (3 × 3) identity matrix. Each 3D point Qi has a corresponding set of 2D observations noted {qi,j }j∈Ai where qi,j represents the observation of the 3D point Qi in the camera Cj and Ai is the set of cameras indexes observing Qi . All entities are expressed T in homogeneous coordinates, e.g. qi,j ≈ (xi,j ; yi,j ; wi,j ) T where is the transposition and ≈ the equality up to a nonzero scale factor. We assume that the 3D buildings model is defined by a set of planes {πk }pk=1 (p is the total number of planes constituting the model) and we note Mk the transfer matrix between the coordinate frames of plane πk and the world coordinate frames. On the other hand, we model the DEM by a set of 3D edges {el }rl=1 where each edge el represents the road axis (r is the total number of edges constituting the road model). In the following, we assume that the tilt according to the direction of width road is insignificant. Then road plane ψl correspondent to the edge el is simply computed by widening this latter. Finally we note Lj the trans-
qi,j ≈ KPjψ Lj Qi h i ≈ K(Rjψ )T I| − tψ j Lj Qi , ψ
(1)
ψ
where Rj j and tj j are extrinsic parameters of the camera expressed in the road plane coordinate. The cost function used in the bundle adjustment with altitude constraint is given by: E
n om Pjψ
j=1
, {Qi }N i=1
=
N X X
ρ d2 qi,j , Pjψ Lj Qi , s1
i=1 j∈Ai
(2) 2 X axis is aligned with the road axis while Z axis is orthogonal to the road plane. 3 it is computed beforehand, since the DEM is constant.
4322
where d2 (q0 ; q1 ) = kq0 − q1 k2 is the point-to-point 2 distance. ρ(r, s1 ) = r2r+s2 is the Geman-McClure M1 estimator used to reject outliers. s1 is the rejecting threshold estimated from the vector r concatenating the re-projection errors such as s1 = median(r)+5.2∗MAD(r) where MAD is the Median of Absolute Deviation. During the optimization step, only (5 × m + 3 × N ) parameters are estimated. Therefore, we reduce by m the total number of optimized parameters. Eq. 9 implies that each camera must be already associated to a road plane. This additional step (i.e. camera-toroad plane association ) is very important to ensure a better convergence of the optimization process. In the following, the integration of the association step and the BA with altitude constraint in the SLAM optimization process is described.
(a)
(d)
(b)
(c)
(e)
Figure 1. Performances of SLAM with altitude constraint.(a), (b), (c) Illustrations of synthetic sequence. Red curves represent results relative to BA introducing the additional constraint while blue ones are the results relative to BA without constraint.(d) Localization error at each key frame. (e) Evolution of the scale factor.
2.2. Integration in the SLAM process Initially, current poses estimated by the SLAM process don’t exactly correspond to the desired altitude since they are not already associated to a road plane. To satisfy the required condition introduced by the Eq. 9, a correction of each camera altitude is needed. Such condition can be simply fulfilled by matching each camera to the nearest road plane and then correcting its pose using the method described in Section 2.1. Nevertheless, due to the uncertainty of the localization, some wrong camera-to-road plane associations may occur (e.g. at crossroads) which can disturb the convergence of the BA. To handle this problem, we propose to revalue this association after the BA. An iterative optimization process is, then, adopted where the association step and the BA are alternated until the association result is maintained constant. The different steps of one iteration of the proposed method is resumed in Table 1.
sequence illustrated in Fig. 1. This sequence simulates a video stream recorded by a camera embedded on a vehicle at an altitude of 1.5 meters from a flat ground. The camera passes through a corridor whose walls represent the geometric model of the scene shown in Fig. 2(a). Ground truth is represented in the same figure. Localization errors and scale factor evolution are measured along the sequence. Localization errors are computed by measuring errors between the estimated position for each camera and the correspondent one in the ground truth. Scale factor is computed from the ratio of the distance between two successive poses of camera and their correspondent in the ground truth. We compare two algorithms: SLAM constrained to DEM and the one described in [10] where no constraint is introduced. Evolution of localization errors and scale factor drift are exposed in Fig. 1(d) and 1(e). It is obvious that taking into account the altitude constraint allows to reduce both localization errors and scale factor drift. Indeed, the median error for SLAM constrained to DEM is about 3.16 meters while it exceeds 9 meters for the ”classical” SLAM. This implies that altitude constraint allows a localization 3 times more accurate. Otherwise, scale factor evolution shows that with the additional constraint, the camera motion is compressed by at worst 0, 09% while a compression of 40% is obvious when no constraint is introduced. Improvement of the localization is highlighted in Fig. 2 where the 3D reconstructions obtained with the two algorithms are represented. This experiment underlines the benefits of the altitude constraint in the bundle adjustment. Indeed, it prevents errors in altitude and reduces the scale factor drift. However, this constraint does not remove totally the localization drifts. The result remains dependent on the initial pose introduced in the optimization step. In other words, if the initial pose is too far from the optimal one, the camera-to-
2.3. Evaluation on Synthetic Sequence The following experience aims to evaluate the contribution of the altitude constraint in the BA. For that, SLAM algorithm constrained to a DEM is evaluated on a synthetic Table 1. One iteration of the optimization process using BA with altitude constraint 1. For each j ∈ Ai (a) Search the nearest plane ψj , computed from DEM, to the camera Cj . (b) Correct the camera Cj in ψj coordinate frames. (c) Compute the projection matrix Pjψ in ψj coordinate frames. 2. Compute the rejecting threshold s1 . 3. Minimize the cost function (Eq. 9) using a Levenberg Marquardt algorithm.
4323
In the following, we propose to adopt the solution introduced in [16] and resumed in Section 3.1 and adapt it to our context. We propose then to introduce preliminary steps inspired from the urban context to ensure a better convergence of BA (Section 3.2) and to provide more robustness. (a)
(b)
(c)
3.1. Bundle adjustment with buildings model constraints
Figure 2. Results of altitude constraint. The camera trajectory is represented with blue triangle, 3D points is represented in red while the geometric model is drawn with black line. (a) the geometric model and the ground truth. (b) SLAM without geometric constraint. (c) SLAM constrained to DEM.
The SLAM process provides during its execution a 3D map containing both the different positions describing the camera trajectory and a 3D points cloud describing the observed scene. By analysing this set of points, two classes can be distinguished: 3D points associated to building facade constituting the known part of the environment and 3D points belonging to the other elements modelling the unknown part of the environment (e.g. trees, parked cars, road signs. . . ). Position of points belonging to the unknown part of the environment can be refined by applying a BA minimizing the following cost function:
road plane association can be disturbed and therefore, during the optimization step, poses camera can get trapped in local minima. In the following Section 3, we deal with the challenging problem of scale factor drift by analysing the contribution of another geometric constraint provided by 3D buildings model.
3. SLAM constrained to a 3D buildings model
X X EE {Pj }m = ρ d2 (qi,j , Pj Qi ) , s j=1 , {Qi }i∈U
In urban area, buildings are the most prevalent structures. Having orthogonal surfaces to the ground, they provide several geometric information that can not only constrain the camera pose but also the scene (i.e. 3D point cloud). In addition, 3D buildings models are now widely available through GIS database. These advantages promote recent researches using this prior geometric knowledge to improve the SfM reconstruction. Most of existing solutions share a common approach to introduce constraints provided by building facades: aligning the SfM reconstruction on the 3D city model while respecting the geometric multi-view constraints, allowing then a satisfactory estimate of scale factor. Unfortunately, these approaches e.g. [5, 7, 14] handle, generally, this problem with an off-line way. The first on-line solution was proposed by Lothe et al in [8]. The motion of the camera is estimated using a SLAM process with a simple BA without additional constraint. Then, an ICP is used to align the local reconstruction with buildings model. Nevertheless, since the ICP convergence is guaranteed only if the scene provide enough geometric constraints, this correction exclusively occurs after turns. This implies that the localization is not accurate enough during the rest of the trajectory mainly in straight line making impossible the Augmented Reality application. Recently, Tamaazousti et al. have proposed in [16] a real-time BA introducing constraints provided by both known (i.e. belonging to geometric model) and unknown (e.g. belonging to the other elements of the scene) parts of the environment. Even if this solution is successfully applied, on-line, to track small object in an unknown environment, its use in large scale motion is restricted to an off-line process, due to a lack of robustness.
i∈U j∈Ai
(3)
where U is the set of 3D points indexes that constitutes the unknown part of the environment. A 3D point Qi , belonging to the first class introduced above (i.e. the known part of the environment), has an additional geometric property since it has only two degrees of freedom. In fact, it exists in the 3D buildings model a plane πi to which Qi belongs. Consequently, in the plane coordinate frames assigned to πi , we can define Qπi = (Xiπ , Yiπ , 0, 1)T where Qi = Mi Qπi 4 . This relation is used to optimize a SfM reconstruction taking into account the building constraint by minimizing the following cost function: X X π EM {Pj }m ρ d2 (qi,j , Pj Mi Qπ i ),s j=1 , {Qi }i∈M = i∈M j∈Ai
(4)
where M is the set of 3D points indexes associated to the known part of the environment, with card(M) + card(U) = N . s = max(sU , sM ) where sU and sM are respectively the rejecting thresholds estimated for known and unknown parts of the environment. Thus, the complete cost function of the resulting bundle adjustment is a bi-objective one given by: m π E {Pj }m j=1 , {Qi }i∈U , {Qi }i∈M = EE {Pj }j=1 , {Qi }i∈U π + EM {Pj }m j=1 , {Qi }i∈M (5) 4 recall:M is the transfer matrix between the plane π coordinate i i frames and the world coordinate frames.
4324
Once the optimization step is carried out, the 3D points Qπi are exactly situated on their correspondent building facades. To take into account the inaccuracy of the buildings model, a triangulation of these points is performed.
3.2. Points cloud segmentation (a)
To ensure a successful optimization, a point-to-building facade association is required. This assignment needs a preliminary segmentation of the 3D points cloud. To deal with this classification problem, in [16] a simple ray tracing from the different 2D observation {qi,j }j∈Ai is used. Qi is then classified as belonging to the known part of the environment if the corresponding ray intercepts a plane of the buildings model and therefore it is associated to this plane. However, this solution remains unsuitable to urban context since it doesn’t manage the occultation problem that occurs frequently. In fact, throughout its motion, the camera will often be surrounded by building facades, in this case , the described segmentation method implies that most points will be associated to the known part of the environment. This may disturb the localization. Since solutions based on building segmentation [1, 11] are time consuming, we propose to guide the segmentation with the buildings model. The main idea is to establish the segmentation and the association steps according to the distance d between each point and its nearest building facade. Thus, a 3D point will be matched to a plane if the measured distance is lower than a specific threshold. For more robustness, this threshold must allow to compensate both the inaccuracy of the buildings model and drifts of SLAM. We intend, then, to estimate it dynamically at each keyframe from the distribution of d. Even if this distribution is not known a priori, it is possible to predict it from the a posteriori knowledge of the previous distribution computed at the last frame, where 3D points belonging to the known part of the environment was already matched to their respective plane. Moreover, since the error evolution of the SLAM process is gradual, the distribution of the distance at the current keyframe can be approximated from those observed over the last n keyframes. Otherwise, we note that the error of 3D points position is not isotropic (e.g. the uncertainty is greater along the axis of camera motion, see Fig. 3 and Table 2). Consequently, it seems inadequate to apply the same threshold for all 3D point. The threshold must, then, varies according to the orientation of each building facades computed in the camera coordinate frame. The idea is to divide the whole space to finite number5 F of angular sectors {αi }F i=0 where we classified each facade π according its normal. The distance distribution is then computed for each αi . The distribution tending towards a Gaussian in each angular sector, we propose to estimate the threshold τi , corresponding to the sector αi , using the median and the MAD 5 Experimentally,
(b)
(c)
(d)
Figure 3. Distance distribution. (a) represents the current view observed by the camera. (b) (resp. (c)) Distance distribution of the right (resp. left) lateral facade observed by camera. (d) Distance distribution of the orthogonal facade observed by camera Table 2. Median distance for each angular sector.
Facade orientation Lateral right Lateral left Orthogonal
Median distance (meter) 0.94 0.68 3.53
(Median of Absolute Deviation) function: τi =
+∞ med(Dαi ) + c ∗ mad(Dαi )
if card (Dαi ) ≤ χ otherwise. (6)
Dαi is a vector concatenating the distances between any 3D point belonging to the known part of the environment, observed over the last n cameras, and the associated model plane having the same angular sector as πi . χ represents the minimal cardinal of Dαi that implies the existing distribution is enough relevant. Finally c is a constant fixed to 5.2. As camera-to-road plane association, some wrong pointto-building facade associations may occur and therefore disturb the convergence of the BA. To ensure an optimal convergence, we propose to adopt an iterative optimization process where points associations and the BA are alternated.
3.3. Evaluation In this section, we, first, evaluate the proposed segmentation method described in Section 3.2. Then we assess the buildings constraint in conditions similar to those encountered in real urban area. The synthetic video described in Section 2.3 is used. To evaluate de robustness and the precision of the segmentation method, several objects hiding partially walls are in-
we notice that considering 4 sections is sufficient.
4325
(a)
(b) (a)
(b)
Figure 5. Performances of SLAM with buildings constraints using to proposed segmentation method.(a) Localization error at each key frame. (b) Evolution of the scale factor. (c)
(d)
(e)
Figure 4. Segmentation method. (a) (b) Illustrations of the synthetic sequence. The camera trajectory is drawn with blue triangles. Red points constitute known part of the environment while the green points belong to the unknown part of the environment. In (c) the geometric model (red) and 3D objects added (green) are represented. (d) Localization obtained with SLAM constrained to buildings model and using the segmentation method described in [16]. (e) Localization obtained with SLAM constrained to buildings model and using the segmentation method we propose.
the quality of global localization. In the following Section, we propose to fuse this two constraints in order to insure optimal localisation.
4. SLAM constrained to a complete GIS model As detailed above, GIS models provide several geometric information that can be introduced in the optimization step to reduce different SLAM drifts. In fact, the use of DEM allows more stable camera poses. However, the accuracy of the localization remains dependant on the initialization of the system to optimize mainly disturbed by a wrong estimate of the scale factor. Otherwise, the use of 3D buildings model provides other kinds of geometric constraints. These can be applied to 3D points cloud representing the scene observed by the camera to allow a satisfactory scale factor estimation. Nevertheless, the result is extremely sensitive to altitude drift relative the camera pose and remains depending on available constraints provided by the 3D buildings model: for example, if observed facades are situated only on one side of the camera, the scale factor will be wrongly estimated. To take advantages of these different kinds of constraints and to overcome the limitations of the separately use of each one, an elegant way to fuse them is introduced. We propose, then, a new bundle adjustment that guarantees at each key frame a coherent and an accurate camera pose according to 3D points cloud that respects the true geometry of the scene. The resulting cost function is given by:
serted, see Fig. 4(a) and 4(b). These objects constitute unknown part of the environment and can be assimilate to cars or trees in urban context. In this experiment, the SLAM process with BA introducing building constraints is used. We compare then the proposed segmentation method (Section 3.2) and the one detailed in [16]. Fig. 4(e) shows that the second segmentation method introduces many erroneous point-to-building plane associations (red dots located on the road). Since the number of wrong associations is too high, the rejecting threshold is misjudged and the bundle adjustment diverges, consequently, the algorithm fails. Otherwise, since the first segmentation method produces a low number of outliers, there is no convergence problem. To evaluate the contribution of building constraint, localization errors and scale factor evolution are represented in Fig. 5. These two parameters are computed as described in 2.3. We compare two algorithms: SLAM constrained to buildings model and the one described in [10] where no constraint is introduced. It is notable that this additional constraint allows to reduce both localization errors and scale factor drift. Indeed, the median error for SLAM constrained is about 0.14 meters while it is equal to 1.24 meters for the ”classical” SLAM. In other words, buildings constraint reduce by more than 8 the localization error. Otherwise, we note that the associated scale factor evolution is centred on 1 and drifts are too low. These experiments show that accurate point-to-building facade association improves the convergence of BA and therefore the accuracy of localization. However, to succeed this step, the pose of the camera must to be near to the correct one. In fact, high errors relative to the camera altitude can easily disturb the result of segmentation and therefore
E
n o ψ m Pj
j=1
π , {Qi }i∈U , Qi i∈M
= EE + EM
n o ψ m Pj , {Qi }i∈U 1
n
ψ
Pj
om π , Qi i∈M 1
(7)
where EE
n
ψ
Pj
om j=1
, {Qi }U
=
X X
2 ψ ρ d qi,j , Pj Lj Qi , s (8)
i∈U j∈Ai
and EM
n
ψ
Pj
om X X 2 π ψ π , Qi M = ρ d qi,j , Pj Lj Mi Qi , s 1
i∈M j∈Ai
(9)
4326
As explained above, to ensure an optimal convergence, an iterative optimization process is adopted. In other words, points and camera associations and the BA are alternated. Table 3 resume the different steps of one iteration of the proposed method. This formalism implies that the altitude constraint is applied separately to cameras created at key-frames. Therefore, it is possible to mix altitude-constrained camera and unconstrained camera in the same bundle adjustment. It allows the system to deactivate the altitude constraint in some region where the DEM is not available (e.g. car park) or when incoherent motion is detected (e.g. hump).
(a)
(c)
(b)
(d)
(e)
Figure 6. The ”Versailles” sequence. (a) The real trajectories of the camera. (b) The buildings model (red), the road model (blue). (c), (d), (e), Illustration of the ”Versailles” sequence: buildings facades are observed on both side of road, a crossroad situation, buildings exist only on one side of the road.
5. Results on real sequences In this section, performances of the proposed solution are evaluated on real large scale sequences. They are 640x480 videos representing two long tours (one about 1500 meters, the second is about 1000 meters) recorded in the district of Versailles, France (see Fig. 6(a)). They have been recorded by a standard camera fixed on the roof if the vehicle, providing 30 fps with a field of view of 90 degrees. The distance between the camera and the ground is about 1.5 meters. Fig. 6(c), 6(d) and 6(e) are some illustrations of these sequences. No ground truth is available for these real sequence, evaluation is, then, based on the visual result of the localization (Section 5.1) and the augmented reality application (Section 5.2). A top view if the GIS 3D models, provided by the French National Institute of Geography, is presented in Fig. 6(b). Walls having less than 2 meters as height, are not represented. The roads model is represented by 3D edges modelling the road axis. The accuracy of both city and roads
(a)
(b)
(c)
Figure 7. Localization in urban area using SLAM constrained to a complete GIS model. (a) Localization using a basic SLAM process. (b) Localization using SLAM constrained only to buildings model [16]. (c) Localization using SLAM constrained to a complete GIS model.
models is about one meter.
5.1. Localization in urban area
Table 3. One iteration of optimization process using BA constrained to a complete GIS model
We compare in Fig. 7 results of three algorithms: the SLAM constrained to a complete GIS model, a keyFramebased SLAM process without any constraint [10] and SLAM constrained only to buildings model [16]. As expected the ”classical” SLAM algorithm without constraints(see Fig. 7(a)) suffers from scale factor drift and accumulation errors. Fig. 7(b) and 7(c) underline the benefits of the altitude constraint and the proposed point-to-building plane segmentation algorithm. Indeed, on straight line, the SLAM constrained only to buildings model is subject of important scale factor drift. This drift can be explained by both drifts relative to the camera altitude and segmentation errors. This problem is particularly noticeable in critical configurations namely where buildings facade are absent at one side of the street (areas A and B in the first sequence, C and D in the second one). For these reasons, SLAM constrained to a complete GIS model ensures successfully the localization
1. For each j ∈ Ai (a) Search the nearest plane ψj , computed from DEM, to the camera Cj . (b) Correct the camera Cj in ψj coordinate frames. (c) Compute the projection matrix Pjψ in ψj coordinate frames. 2. Segment {Qi }N i=1 in (Qi )i∈U and (Qi )i∈M . 3. For each i ∈ M (a) Project Qi on its correspondent plane πi . 4. Compute the rejecting threshold s. 5. Minimize the cost function introduced in Eq. 5 using a Levenberg Marquardt. 6. Triangulation of 3D points in (Qi )i∈M taking into account the updated camera poses.
4327
plete framework ensuring a hight accuracy of localization in dense urban area.
References (a)
[1] G. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. ECCV, 2008. [2] M. Cummins and P. Newman. Appearance-only slam at large scale with fab-map 2.0. The International Journal of Robotics Research, 2010. [3] A. Irschara. Scalable Scene Reconsturction and Image Based Localization. PhD thesis, 2012. [4] A. Irschara, C. Zach, J. Frahm, and H. Bischof. From structure-from-motion point clouds to fast location recognition. In CVPR, pages 2599–2606. IEEE, 2009. [5] R. Kaminsky, N. Snavely, S. Seitz, and R. Szeliski. Alignment of 3d point clouds to overhead images. In CVPR Workshop, 2009. [6] Y. Li, N. Snavely, and D. Huttenlocher. Location recognition using prioritized feature matching. ECCV, 2010. [7] P. Lothe, S. Bourgeois, F. Dekeyser, E. Royer, and M. Dhome. Towards geographical referencing of monocular slam reconstruction using 3d city models: Application to real-time accurate vision-based localization. In CVPR, June 2009. [8] P. Lothe, S. Bourgeois, E. Royer, M. Dhome, and S. N. Collette. Real-time vehicle global localisation with a single camera in dense urban areas: Exploitation of coarse 3d city models. In CVPR, June 2010. [9] M. Maurer, M. Rumpler, A. Wendel, C. Hoppe, A. Irschara, and H. Bischof. Geo-referenced 3d reconstruction: Fusing public geographic data and aerial imagery. In ICRA, 2012. [10] E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. Real time localization and 3d reconstruction. In CVPR, 2006. [11] M. Recky, A. Wendel, and F. Leberl. Facade segmentation in a multi-view scenario. 3DIMPT, 2011. [12] D. Scaramuzza, F. Fraundorfer, M. Pollefeys, and R. Siegwart. Absolute scale in structure from motion from a single vehicle mounted camera by exploiting nonholonomic constraints. In ICCV, 2009. [13] G. Schindler, M. Brown, and R. Szeliski. City-scale location recognition. In CVPR, 2007. [14] H. Strasdat, J. M. M. Montiel, and A. Davison. Scale drift-aware large scale monocular slam. In Proceedings of Robotics: Science and Systems, June 2010. [15] C. Strecha, T. Pylvanainen, and P. Fua. Dynamic and scalable large scale image reconstruction. In CVPR, 2010. [16] M. Tamaazousti, V. Gay-Bellile, S. Naudet-Collette, S. Bourgeois, and M. Dhome. Nonlinear refinement of structure from motion reconstruction by taking advantage of a partial knowledge of the environment. In CVPR, 2011. [17] A. Zamir and M. Shah. Accurate image localization based on google maps street view. In ECCV, 2010. [18] W. Zhang and J. Kosecka. Image based localization in urban environments. 3DPVT, 2006.
(b)
Figure 8. Augmented reality application. (a) Image extracted from the ”Versailles” sequence. (b) The same image with augmented reality information.
for the both sequences while the one constrained only to buildings facade fails at the second turn in the two cases.
5.2. Augmented reality: Application to navigation aid 3D reconstructions obtained prove that introducing both constraints provided by the DEM and those provided by the buildings model improves greatly the accuracy of the localization. This result allowed the integration of our algorithm in an augmented reality application specially navigation aid. As it is shown in Fig. 8, a tube that indicates the trajectory to follow is inserted. In the same way, security information (e.g. road signs, pedestrian crossing) that require a hight accuracy of localization are added successfully just before each crossroad. In spite of the jitter effect that remains to correct, we observe that the accuracy reached is sufficient to ensure a safe and effective guidance: while the GPS may have some difficulties to determine the forward road, it is notable that thanks to the proposed solution, roads are designed without any ambiguity. Otherwise, do not have the ground truth, this application allows to appreciate the quality of the localization.
6. Conclusion In this paper, we have presented an original real-time and mainly on-line framework based on monocular SLAM to ensure an accurate localization in dense urban area. In addition to multi-view constraints, we have proposed a bundle adjustment that exploits both constraints provided by a coarse 3D buildings model and a digital elevation model. While most of existing solution need off line steps to ensure an accurate localization, the proposed approach reaches a hight precision without any off line step. Reality Augmented application is presented underlining then the accuracy and the robustness of the localization. To improve the rendering and bring more stability to elements inserted on Augmented Reality, it necessary to handle the jitter effect. Further works will deal also with the challenging problem of initialization. This step can be successfully performed after the first turn where available geometric constraints are sufficient to locate accurately the camera as it was demonstrated in [8]. A possible merge with other sensors (e.g. odometer, GPS) is also conceivable to have a com4328