head Digital Surface Model (DSM) using GPS information. An additional refinement step ..... thank Microsoft Photogrammetry Graz for providing the dense aerial ...
Automatic Alignment of 3D Reconstructions using a Digital Surface Model Andreas Wendel, Arnold Irschara, and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {wendel, irschara, bischof}@icg.tugraz.at
Abstract We present a novel technique for the automatic alignment of Structure from Motion (SfM) models, acquired at ground level or by micro aerial vehicles, to an overhead Digital Surface Model (DSM) using GPS information. An additional refinement step based on the correlation of the DSM height map with the model height map corrects for the GPS localization uncertainties and results in precisely aligned models. Our approach successfully handles cases where previous methods had problems, including objects on the ground, unoccupied space, and models covering a small area. We conclude our work by presenting several applications of our approach, namely the fusion of detailed SfM model information into the original DSM, season-invariant matching using aligned models, and alignment for providing context in visualization.
1. Introduction In recent years, the importance of image-based 3D reconstruction systems has grown considerably for several reasons. First, the availability of more powerful hardware for high-resolution acquisition and computation has allowed to develop novel algorithms, resulting on the one hand in demonstrative, high-quality models [7], and on the other hand in very robust techniques which even allow to reconstruct a scene from publicly available images [2]. While visualizations showing sparse 3D point clouds were often just interesting to an insider, most people consider recent dense reconstructions nice. Second, novel image acquisition tools such as micro aerial vehicles (MAVs) provide views of the scene which are different from what people are used to see. Finally, new applications for 3D models are emerging due to the two previous aspects. For instance, remote tourism is possible using a collection of automatically geo-referenced pictures [14]. In this context we see the need for an accurate alignment of small Structure-from-Motion (SfM) models in a world coordinate system. It would often be beneficial to gener-
Figure 1. Automatic fusion of two semi-dense 3D point clouds with an overhead, textured Digital Surface Model (DSM). The atrium model (red) has been reconstructed from ground-level, whereas the images for the campus model (blue) originate from a micro aerial vehicle. Best viewed in color.
ate a large 3D model of a scene, but the creation fails because feature points connecting different parts of the model are missing. This could happen due to vegetation, a building corner, or simply a change of illumination during image acquisition. In this paper, we introduce a method to fuse such partial reconstructions by projecting them in an overhead Digital Surface Model (DSM) using GPS information. An additional refinement step based on the correlation of the DSM height map with a height map generated from the model corrects for the GPS localization uncertainties and results in precisely aligned models. In comparison to previous methods coping with the alignment of 3D point clouds to overhead images [10, 6], our approach is less prone to errors caused by objects on the ground, it can cope with unoccupied space and does not need free-space constraints, it works with detailed models covering a small area, and it is faster. We evaluate our approach using five outdoor SfM models with more than one million points each. Figure 1 shows a typical alignment of two models in a DSM. The model marked in red has been reconstructed from images taken at ground level, whereas the model marked in blue originates from aerial images acquired using an octo–rotor MAV.
We conclude our work by presenting several applications of our approach, including the integration of the SfM model information into the original DSM, the alignment of two models acquired in summer and winter for season-invariant matching, and the alignment of a construction site to a construction plan for providing context in visualization.
2. Related Work The problem of aligning 2D images or 3D models to a 3D structure is well studied, especially in the context of large-scale city modeling. Frueh and Zakhor [6] present an algorithm to fuse close–range facade models acquired at ground level with a far–range DSM recorded by a plane. The models are created using both ground-based and airborne laser scanners, as well as digital cameras for texturing. Their approach is based on registering the edges of the DSM image to the horizontal scans of a ground model using Monte-Carlo-Localization. Similarly, Strecha et al. [16] register facades segmented from a 3D point cloud to building footprints. Their approach combines various visual and geographical cues in a generative model, which allows robust treatment of outliers. However, both approaches are focused on large-scale city models with flat facades to both sides, resulting in fairly clean edges both on the ground and in the DSM. Our approach is designed to cope with all kinds of outdoor scenes (including for instance a park with a single landmark) and features an implicit visibility constraint which improves registration. Wang and You [21, 22] tackle the problem of registering images of 2D optical sensors and 3D range sensors, without any assumption about initial alignment. Their approach is based on region matching between optical images and depth images using Shape Context [4]. They extract regions from an ortho projection of the scene using an adjusted segmentation step, and connected regions of the same heights from a DSM. Again, this works well for large-scale city models, but would not work for partial SfM models. Additionally, 3D models created from ground level hardly show large regions which could be matched to a nadir view. Another well known method of aligning two point clouds is the Iterative Closest Points (ICP) [24] algorithm. ICP estimates a transform to minimize the overall distance between points by iteratively assigning closest points as correspondences and solving for the best rigid transform. While ICP is mainly used for registering 3D laser scans, Zhao et al. [25] use it to align dense motion stereo from videos to laser scan data. However, 3D ICP can take very long and the dense representation is not suitable for our applications, such as interest point fusion (see Section 6). Our work is most closely related to the approach of Kaminsky et al. [10]. They use 2D ICP to compute the optimal alignment of a sparse SfM point cloud to an overhead image using an objective function that matches 3D points
to image edges. Additionally, the objective function contains free space constraints which avoid an alignment to extraneous edges in the overhead image. While their approach is suitable to align many 3D models obtained from ground level, it would fail to align the models acquired using our micro aerial vehicle which include many points on the ground. Typical models presented in their paper show vertical walls which can be easily projected to the ground and fitted to respective edges. However, when many extraneous edges are visible in the overhead image their method fails. If a digital surface model and enough data to densify the sparse reconstruction are available, our approach is less prone to errors caused by objects on the ground, it implicitly follows the free-space constraint, it works with models covering a smaller area, and it is faster.
3. Obtaining the 3D Models Our approach requires not only accurate semi-dense 3D reconstructions of the environment, but also a georeferenced digital surface model for alignment. In the following paragraphs, we first describe our image-based reconstruction pipeline, based on the approach introduced in [9, 23]. We then show how a DSM can be estimated by using publicly available data, if it is not available for a specific location.
3.1. Structure from Motion Reconstruction For terrestrial model reconstruction we rely on a Structure from Motion (SfM) approach that is able to reconstruct a scene from unorganized image sets. This approach is widely applicable since no prior knowledge about the scene, i.e. no sequential ordering of the input images, has to be provided. In particular our framework consists of three processing steps, namely feature extraction, matching, and geometry estimation. First, we extract SIFT features [12] from each frame. We then match the keypoint descriptors between each pair of images and perform geometric verification based on the Five-Point algorithm [13]. Since matches that arise from descriptor comparisons are often highly contaminated by outliers, we employ a RANSAC [5] algorithm for robust estimation. The matching output is a graph structure denoted as epipolar graph EG, that consists of the set of vertices V = {I1 . . . IN } corresponding to the images and a set of edges E = {eij |i, j ∈ V} that are pairwise reconstructions. Our SfM method follows an incremental approach [14] based on the epipolar graph EG. We initialize the geometry as proposed in [11]. Next, for every image I that is not reconstructed and has a potential overlap to the current 3D scene (estimated from the EG graph), 2D– to–3D correspondences are established. A three-point pose algorithm [8] inside a RANSAC loop is used to insert the position of a new image. When a pose can be determined
(a)
(b)
(c)
Figure 2. Atrium scene reconstruction from 157 views. (a) Original view of the scene. (b) Sparse model and source cameras obtained by structure from motion reconstruction. (c) Semi-dense point model obtained by refining the sparse model with PMVS [7].
(i.e. a sufficient inlier confidence is achieved), the structure is updated with the new camera and all measurements visible therein. A subsequent procedure expands the current 3D structure by triangulation of new correspondences. Bundle adjustment [19] is used to globally minimize the reprojection error over all measurements. The triangulated points are then back-projected and searched for in every image. To this end we utilize a 2D kd-tree for efficient correspondence search in a local neighborhood of projections. This method ensures strong connections within the current reconstruction. Whenever a number of N images is added (we use N = 10), bundle adjustment is used to simultaneously optimize structure and camera pose. The sparse reconstruction result can be seen in Fig. 2(b). We rely on the publicly available patch–based multiview stereo (PMVS) [7] reconstruction software to densify terrestrial reconstructions. The semi-dense PMVS reconstruction result of a captured atrium scene is depicted in Figure 2(c).
3.2. Surface Model Estimation from a Map High-quality digital surface models may not be accessible for everyone, and they may not be available for some locations on earth. However, web–based map providers such as OpenStreetMap1 provide fairly accurate information not only regarding roads and directions, but also concerning buildings and vegetation. Figure 3(b) depicts a typical map. Semantical meanings in the map are connected to a certain color, so it is straightforward to estimate a rough DSM by assigning a single, estimated height to all buildings. While the resulting building heights are obviously not correct, they are good enough to allow fitting of accurate models in a local environment to this DSM. Estimated and true building heights will differ, but the approximation allows the error minimization to find the correct displacement. To automate this approach, we compute an area of interest as proposed by [10]. Given the relevant GPS coordinates, the OpenStreetMap image can then be downloaded by using the ModestMaps API [1]. An estimated surface model can be seen in Figure 3(c). 1 http://www.openstreetmap.org
4. Alignment Pipeline Our alignment pipeline follows a three step approach: First, we geo-reference the DSM image by estimating a transform between GPS and image coordinates. Second, we roughly align the acquired 3D point cloud to the DSM by exploiting the available GPS information. Finally, we project the dense 3D point cloud into DSM image coordinates and search for the best height map alignment using normalized cross-correlation.
4.1. Geo-referencing of the DSM We start by geo-referencing the DSM by estimating a transform between GPS and image coordinates. While for the estimated DSM the GPS coordinates are given by the corner points, we need to manually select point correspondences for the high-quality DSM at the moment. Note that manual intervention can be easily avoided by obtaining an ortho photo from ModestMaps [1]. The corner GPS points are again known, and it can be matched to the ortho photo that comes with the high-quality DSM using interest point descriptors [12]. The GPS coordinates, namely latitude κ and longitude λ, are transformed to a square coordinate system (xgps , ygps ), xgps
= λ,
(1)
ygps
=
(2)
ln(tan(π/4 + κ/2)),
using a Mercator projection [15]. We then solve for a robust homography between (xgps , ygps ) and the known points in the DSM image (xdsm , ydsm ) using RANSAC [5]. This process has to be done only once for every DSM that should be used. The estimated homography Hdsm is stored for transforming GPS coordinates of 3D models into the DSM.
4.2. Rough Alignment using GPS Information We exploit the GPS information available for every camera in our 3D models for rough alignment to the DSM. If the coordinates measured by the GPS system were correct, it
(a) High-Quality DSM
(b) Map for Estimation
(c) Estimated DSM
Figure 3. Surface Model Estimation from a Map. If a high-quality DSM (a) is not available, semantic information from OpenStreetMap (b) can be used to estimate a rough surface model (c).
(a)
(b)
Figure 4. Rotation for ground plane alignment based on the method of [17]. (a) SfM model in the coordinate system of the first model camera. (b) SfM model after performing the estimated rotation.
(a)
(b)
Figure 5. Similarity transform between camera centers (gray) and GPS coordinate centers (red). (a) Sequence taken at ground level. (b) Sequence taken by a micro aerial vehicle.
would be straightforward to estimate a 3D similarity transform. However, longitude and latitude are noisy, and especially the altitude estimate can be very inaccurate. Therefore, our approach is to solve for a 2D similarity transform between the camera positions (xcam,i , ycam,i ) in the SfM model and the Mercator transformed GPS coordinates (κcam,i , λcam,i ). Neglecting the altitude is just possible when the ground planes are aligned. We use the ap-
proach of Szeliski [17] to compute the ground plane normal and the corresponding rotation. The method assumes that the horizontal axis in every image coordinate system is approximately parallel to the ground plane, which is the case when taking upright photographs from the ground, but also when taking nadir pictures on a micro aerial vehicle. The ground plane rotation is visualized in Figure 4. A robust 2D similarity transform Hmodel can then be found using a RANSAC approach [5] to cope with longitude/latitude outliers. The result can be seen in Figure 5 for two scenes, the first acquired at ground level and the second acquired by a micro aerial vehicle.
4.3. Precise Alignment using Correlation Given the rough alignment, we use correlation for precise alignment. We project the semi-dense 3D point cloud into the pixel grid of the DSM image, storing only the maximum height value per pixel. Pixel clusters with a radius r