Multi-Agent Recognition System based on Object

0 downloads 0 Views 6MB Size Report
However, dense match- ing of street level imagery as captured from mobile mapping ... and are useful for following tasks like the automatic recon- struction of building .... the type UIMU-LCI and a L1/L2 GNSS kinematic antenna. In case of good ... of these two initial steps, which are briefly discussed in the System Calibration.
Image-Based Mobile Mapping for 3D Urban Data Capture Stefan Cavegn and Norbert Haala

Abstract

Ongoing innovations in dense multi-view stereo image matching meanwhile allow for 3D data collection using image sequences captured from mobile mapping platforms even in complex and densely built-up areas. However, the extraction of dense and precise 3D point clouds from such street-level imagery presumes high quality georeferencing as a first processing step. While standard direct georeferencing solves this task in open areas, poor GNSS coverage in densely built-up areas and urban canyons frequently prevents sufficient accuracy and reliability. Thus, we use bundle block adjustment, which additionally integrates tie and control point information for precise georeferencing of our multi-camera mobile mapping system. Subsequently, this allows the adaption of a state-of-the-art dense image matching pipeline to provide a suitable 3D representation of the captured urban structures. In addition to the presentation of different processing steps, this paper also provides an evaluation of the achieved image-based 3D capture in a dense urban environment.

Introduction

For a considerable period, 3D data capture from mobile mapping systems was completely (Puente et al., 2013) or primarily (Kersten et al., 2009) based on lidar sensors. Meanwhile, advances in Visual Odometry (Scaramuzza and Fraundorfer, 2011; Fraundorfer and Scaramuzza, 2012), Simultaneous Localization and Mapping (Cadena et al., 2016), Structurefrom-Motion (Schönberger and Frahm, 2016), and Dense Image Matching (Remondino et al., 2014) alternatively enable the use of camera-based systems for highly efficient and accurate 3D mapping even in complex urban environments (Pollefeys et al., 2008; Gallup, 2011). Within the paper, we present the adaption of the stereo matching software SURE (Rothermel et al., 2012) for the evaluation of image sequences collected from a multi-camera mobile mapping system. The use of multiple cameras potentially provides a large redundancy during photogrammetric processing, which e.g. proved to be very beneficial for tasks like the extraction of Digital Surface Models (DSM) from airborne imagery (Haala, 2014). However, dense matching of street level imagery as captured from mobile mapping systems is frequently aggravated if compared to airborne data collection which can usually be realized by state-of-the-art software tools. One reason is large variances in distance as collected from the terrestrial viewpoints. This can result in great variations of image scale, which frequently has to be considered during different matching steps. Problems also arise from multiple occlusions due to the complex 3D geometry Stefan Cavegn is with the Institute of Geomatics Engineering, FHNW University of Applied Sciences and Arts Northwestern Switzerland, Gruendenstrasse 40, 4132 Muttenz, Switzerland; and the Institute for Photogrammetry, University of Stuttgart, Geschwister-Scholl-Strasse 24, 70174 Stuttgart, Germany ([email protected]). Norbert Haala is with the Institute for Photogrammetry, University of Stuttgart, Geschwister-Scholl-Strasse 24, 70174 Stuttgart, Germany.

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

12-16 December Peer Reviewed WEB.indd 925

especially in dense urban areas. In contrast to rather simple 2.5D processing during DSM generation from airborne views, this geometry also requires the implementation of steps like filtering and data fusion in true 3D space. Besides, more complex data structures like (meshed) 3D points have to be generated to encode the complex geometry of such built-up environments and are useful for following tasks like the automatic reconstruction of building façades (Tutzauer and Haala, 2015). The mobile mapping system used for our investigations alternatively aims on the collection of so-called geospatial 3D image spaces (Nebiker et al., 2015). These are georeferenced RGB-D images to be used for tasks like 3D monoplotting, where a user can accurately measure 3D coordinates at features of interest simply by clicking on a location within the 2D imagery. In order to link each pixel to a corresponding 3D coordinate, this of course presumes both image georeferencing and matching at the pixel level. Our mobile mapping system, which is described in more detail in the next section, features a ground sampling distance (GSD) of 1 cm for distances of 23 m from the system. This further reduces to 2 to 6 mm for a more typical measurement range of 4 to 14 m. Direct georeferencing as usually applied by mobile mapping systems allows for an image registration at the centimeter level in open areas which provide good GNSS conditions. As an example, in such an environment Burkhard et al. (2012) obtained absolute 3D point measurement accuracies of 4 to 5 cm in average for their stereovision mobile mapping system. However, our applications aim on imagebased mobile mapping for 3D urban data capture. Therefore, our test scenarios presented in the following section mainly include densely built-up urban areas, where multipath effects and signal shading by trees and buildings aggravate this process. Thus as discussed in the Image Orientation by ImageBased Georeferencing Section, image orientation by imagebased georeferencing is required, which improves the results from direct georeferencing by a supplementary bundle block adjustment using additional tie and control point observations. These results then allow for a high quality alignment of the respective image sequences as a prerequisite for the multi-view stereo matching presented in the Dense Multi-View Stereo Matching section. As demonstrated by our investigations the accuracy, reliability and completeness of products like 3D point clouds considerably benefit from the available redundancy during image-based mobile mapping for urban data capture.

Mobile Mapping Platform and Test Scenario

All data used for the investigations presented in this paper was captured by the multi-sensor stereovision mobile mapping system of the Institute of Geomatics Engineering (IVGI), University of Applied Sciences and Arts Northwestern Switzerland Photogrammetric Engineering & Remote Sensing Vol. 82, No. 12, December 2016, pp. 925–933. 0099-1112/16/925–933 © 2016 American Society for Photogrammetry and Remote Sensing doi: 10.14358/PERS.82.12.925

D ecember 2016

925

11/22/2016 3:53:05 PM

Figure 1. (a) IVGI mobile mapping system with sensor configuration, (b) for the campaign in July 2014, and (c) for the campaign in August 2015. (FHNW), which is presented in the first part of this section. For our evaluations on georeferencing and dense image matching quality, two mobile mapping campaigns incorporating different sensor constellations were carried out. These campaigns are introduced in the second part of this section.

Mobile Mapping System

As it is visible in Figure 1, the multi-sensor system features several industrial stereo cameras with CCD sensors as well as a GNSS/IMU positioning system. They are mounted on a rigid platform to ensure stability of relative orientations and offsets between all sensors. This is not only a prerequisite for highly accurate mapping but also crucial in terms of efficiency since the same calibration parameters can be used for multiple sessions. All sensors are synchronized by hardware trigger signals from a custom-built trigger box. As depicted in Figure 1b, there are stereo cameras pointing back-right and stereo cameras looking left. These stereo systems incorporate HD cameras with a resolution of 1,920 × 1,080 pixels, a pixel size of 7.4 µm, a focal length of 8 mm, and a field of view of 83° × 53°. While the back-right stereovision system with a base of 779 mm maps the closer sidewalk area and lower façade parts, the opposite sidewalk area is covered by the left stereovision system featuring a larger base of 949 mm.

Figure 2. Base map of the study area with overlaid projection centers of selected stereo image sequences, 3D reference points, terrestrial laser scanning (TLS) stations and point cloud patches (Source: Geodaten Kanton Basel-Stadt).

926

D ec em b er 2016

12-16 December Peer Reviewed WEB.indd 926

However, the georeferencing investigations presented in this paper relate to the main stereo system facing forward, which is depicted for two different configurations on the right part of Figure 1. The system consists of two 11 Megapixel (MP) cameras and a calibrated stereo base of 905 mm. These stereo cameras have a resolution of 4,008 × 2,672 pixels at a pixel size of 9 µm, a focal length of 21 mm and resulting fields of view of 81° in horizontal and 60° in vertical direction. While Figure 1b shows the camera configuration for the first campaign in July 2014, a third HD camera was setup in the middle of this stereo configuration for the campaign in August 2015 (Figure 1c). In order to enable direct georeferencing of the imagery acquired at typically 5 fps, a NovAtel SPAN inertial navigation system is used. The navigation system consists of a tactical grade inertial measurement unit featuring fiber-optics gyros of the type UIMU-LCI and a L1/L2 GNSS kinematic antenna. In case of good GNSS coverage, these sensors provide an accuracy of horizontally 10 mm and vertically 15 mm during post-processing (NovAtel, 2016). Accuracies of the attitude angles roll and pitch are specified with 0.005° and heading with 0.008°. A GNSS outage of 60 seconds lowers the horizontal accuracy to 110 mm and the vertical to 30 mm.

Test Scenario

The study area depicted in Figure 2 is located at a very busy junction of five roads in the city center of Basel, Switzerland. It includes three tramway stops causing many overhead wires as well as large and rather tall commercial properties which create a very challenging environment for GNSS positioning. Besides, there are several moving objects such as pedestrians, cars, and tramways in the investigated region, which pose even more challenges for data processing. Three street sections of this test site featuring sidewalks were mapped three times, once in July 2014 and twice during a day in August 2015, which is a difference in time of 13 months (Table 1). In all nine cases, data acquisition was performed shortly before noon during good weather conditions. For our georeferencing investigations, we used 85 up to 191 stereo image pairs from the forward facing stereovision system on a sequence length between 108 m and 217 m. An along-track distance between successive image exposures of 1 m was targeted, but larger distances occurred at velocities higher than 18 km/h since the maximum frame rate was 5 fps. Whereas the campaign in July 2014 was part of a complete survey of the city-state of Basel, the campaign in August 2015 was specifically performed for the investigations at our study area (Figure 3). In order to capture optimal trajectories, we acquired kinematic data according to best practice as specified by the manufacturer. First, static initialization for approximately three minutes in an open sky area followed by leveling until approaching the test site was carried out. After the first mapping of the test site, an additional loop was driven so that data could PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

11/22/2016 3:53:14 PM

Table 1. Characteristics of the Nine Selected Stereo Image Sequences x, y (where x corresponds to the Street Sections 1 to 3 shown in Figure 2, and y corresponds to the campaign, 0 = 24./27.7.2014, 1 = 20.8.2015 10:30-10:37, 2 = 20.8.2015 10:4710:53) Sequence 1.0 1.1 1.2

Date and Time 24.7.14 10:20 20.8.15 10:30 20.8.15 10:47

Image count 246 322 312

Length [m] 164 173 175

2.0 2.1 2.2 3.0 3.1 3.2

27.7.14 11:53 20.8.15 10:34 20.8.15 10:50 27.7.14 11:57 20.8.15 10:37 20.8.15 10:53

314 342 382 170 232 192

173 212 217 108 141 146

Along-track spacing [m] Mean Max. 1.34 1.97 1.08 1.25 1.13 1.48 1.11 1.25 1.14 1.29 1.23 1.54

1.60 2.06 1.93 1.49 1.73 2.37

Figure 3. Trajectory of campaign performed on 20 August 2015: high quality and low quality; study area: medium to low quality. Trajectory extent in east-west direction is approximately 4.750 km). again be acquired in the study area. Returning to the start area, imagery was captured on our outdoor calibration field for the purpose of boresight alignment (Burkhard et al., 2012). A further loop served for leveling and there was a static observation at the end of around four minutes nearby the FHNW building as well. The GNSS station on its roof which is part of the Automated GNSS Network for Switzerland (AGNES) was defined as base station. The complete campaign resulted in a total trajectory length of 22.756 km and 12,220 stereo image pairs acquired on 20 August 2015 from 10:17:53 until 11:19:29 local time. We captured reference data in March 2015, which is eight months after the first mobile mapping survey. Nonetheless, there were no significant changes for permanent objects such as buildings and roads in the study area. However, changes occurred due to moving objects. We performed four terrestrial laser scans (TLS) by a Leica ScanStation P20 and determined 3D coordinates of 51 points mainly on corners of road markings by a Leica Nova MS50 total station. While 3D accuracy of TLS points ranges from 1 to 2 cm, tachymetry points which served either as ground control (GCP) or checkpoints (CP) (Figure 2 and Figure 6) have an absolute 3D accuracy of better than 1 cm.

Image Orientation by Image-Based Georeferencing

Since one of the main features of our mobile mapping system is the application of multiple cameras for dense 3D data capture by multi-view stereo matching, both high quality PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

12-16 December Peer Reviewed WEB.indd 927

calibration of the complete configuration and precise georeferencing is essential for further geometric evaluation of the captured image sequences. While system calibration is available to a sub-pixel level, accuracy of direct georeferencing is limited due to poor GNSS conditions for our applications in dense urban environments. However, the quality of these two initial steps, which are briefly discussed in the System Calibration and Direct Georeferencing section, can be improved considerably by image-based georeferencing presented in the ImageBased Georeferencing section as it is demonstrated by our accuracy evaluations in the Accuracy Investigations section.

System Calibration and Direct Georeferencing

Precise calibration is fundamental since errors and offsets will be transferred to the full extent to the following steps. Therefore, we calibrated all sensors mounted on the rigid frame of the mobile mapping system in an extensive and rigorous process. First, we determined interior as well as relative orientation parameters between all cameras by constrained bundle adjustment exploiting imagery taken on different indoor calibration fields for the two campaigns. While the indoor calibration field for the campaign in July 2014 features a uniform 3D point distribution, the indoor calibration field for the campaign in August 2015 does not have any 3D points close to the ground. However, in both cases, many 3D points are signalized with coded targets. Second, we computed lever arm and misalignment to the left camera of the forward looking stereo system using imagery which was captured on our outdoor calibration field (Burkhard et al., 2012). We processed navigation data in tightly coupled mode using the GNSS and inertial post-processing software Inertial Explorer (version 8.60) from NovAtel. Furthermore, we performed processing in multi-pass directions and additionally smoothed trajectories which led to the trajectory quality depicted in Figure 3. By incorporating the previously computed boresight alignment as well as the relative orientation parameters of the forward facing stereovision system, we calculated directly georeferenced sensor orientations for all images captured by the left and right camera of the main stereovision system.

Image-Based Georeferencing

We performed image-based georeferencing for each of the nine stereo image sequences using Agisoft PhotoScan (version 1.2.3) by incorporating exterior orientation parameters from direct georeferencing as well as automatically determined image observations to tie points and manually defined image observations to approximately 20 ground control points per sequence in the bundle adjustment. Even though input imagery was previously corrected for distortion and principal point based on the calibration parameters, significant radial distortion parameters were still estimated by bundle adjustment for the six sequences captured in August 2015 and were hence considered. The suboptimal point distribution in the calibration imagery could be a reason for the remaining distortion residuals of up to approximately 10 pixels in the image corners. As we newly set tie point precision to 0.3 pixel and defined 0.5 pixel for image observations to ground control points, also sequences acquired in July 2014 were reprocessed which led to slightly different results compared to Cavegn et al. (2015) and to Nebiker et al. (2015). Overall RMSE values of 0.42 to 0.89 pixel were computed, 0.15 to 0.21 pixel for tie points and 0.81 to 1.08 pixel for ground control points (Table 2). The resulting mean reprojection error for GCP of approximately one pixel is plausible if considering the samples depicted in Figure 4. Potential problems in 3D accuracy could be caused by the rather challenging identification and measurement of these natural ground control points, e.g., compared to signalized targets. Other issues are varying distances to the 3D points mainly on corners of road markings leading to different object resolutions, e.g., for white strips or crosswalks. D ecember 2016

927

11/22/2016 3:53:20 PM

Table 2. Reprojection Errors and 3D residuals of Ground Control Points (GCP) from Bundle Adjustment Sequence 1.0 1.1 1.2 2.0 2.1 2.2 3.0 3.1 3.2 Mean

Overall RMSE [px] 0.50 0.64 0.62 0.42 0.75 0.81 0.44 0.78 0.89 0.65

Tie points RMSE [px] 0.15 0.19 0.20 0.15 0.19 0.21 0.15 0.20 0.21 0.18

GCP RMSE [px] 0.93 0.86 0.91 0.81 0.90 0.93 1.08 1.03 0.83 0.92

GCP 3D RMSE [mm] 27 17 17 31 47 26 35 18 17 26

Figure 6. Locations of ground control point groups as well as check points for stereo image sequence 2.1 (Source: Geodaten Kanton Basel-Stadt). Table 3. RMSE values for Check Point Residuals of direct and Image-Based Georeferencing

Figure 4. Image sections of all 23 GCPs for stereo image sequence 2.1 showing difficult identification.

Figure 5. RMSE values in mm for deviations of projection centers between direct and image-based georeferencing. Whereas most of the residuals for the GCP of stereo image sequence 2.1 are smaller than 2 cm, the highest value amounts to 19 cm i.e., 1.9 pixel (second sample on top row of Figure 4) which partly contributes to the largest 3D RMSE value of 47 mm (Table 2). The mean tie points reprojection error of 0.18 pixel stands for relative orientations of high quality. Thus, for our scenario, the mean RMSE value of manually measured pixel coordinates at ground control points is larger by a factor of five if compared to automatic tie point measurements, while standard applications frequently assume a factor of two. Since the following investigations also focus on the exploitation of imagery captured by the back-right and left stereovision systems, we used offsets and rotations from the left camera of the forward facing system to the respective cameras determined in the calibration process in order to compute the corresponding exterior orientation parameters.

Accuracy Investigations

In order to assess the quality of directly georeferenced sensor orientations as well as its potential improvement by

928

D ec em b er 2016

12-16 December Peer Reviewed WEB.indd 928

Sequence 1.0 1.1 1.2 2.0 2.1 2.2 3.0 3.1 3.2 Mean

CP count 15 11 11 11 12 11 8 10 10

Direct Δ3D [mm] 555 168 774 131 593 813 174 64 568 427

Image-based 2 GCP groups Δ3D Impr. [mm] factor 137 4.1 42 4.0 121 6.4 76 1.7 432 1.4 425 1.9 42 4.2 30 2.1 53 10.8 151 2.8

Image-based 4 GCP groups Δ3D Impr. [mm] factor 27 20.4 21 8.0 26 29.4 48 2.7 73 8.1 36 22.5

39

11.1

image-based georeferencing in a challenging urban environment with frequent GNSS degradations, we computed deviations of projection centers between direct and image-based georeferencing for all nine sequences (Figure 5). These 3D deviations with a mean value of approx. 40 cm range from 46 to 803 mm and the height is the component with the largest residuals for all sequences but for 3.2. Trajectories of stereo image sequences captured on the same street section at different times show differences of up to several decimeters. While we obtained small deviations for sequences 1.1 and 2.0, they are significantly larger for the other sequences of these two street sections. Direct sensor orientations of stereo image sequences 3.1 und 3.2 were determined accurately, since all deviations lie in the range of one decimeter. As described in Cavegn et al. (2016), we could also reveal nine trajectory discontinuities from direct georeferencing predominantly caused by a vehicle stop of several seconds mainly in front of crosswalks. The detected 3D discontinuities amount mostly to a few centimeters, but they reach up to approximately 15 cm. While comparing 3D coordinates of camera trajectories both from direct georeferencing and bundle block adjustment gives some hints on the respective quality, accuracy investigations on 3D coordinates of measured image points are much more evident. For computation of these 3D coordinates by spatial intersection, orientation parameters from direct georeferencing and bundle adjustment can be used. Therefore, we established several groups of two, three or four ground control

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

11/22/2016 3:53:30 PM

Dense Multi-View Stereo Matching

The final step of our pipeline for 3D urban data capture by image-based mobile mapping aims at the evaluation of the captured image sequences by a suitable dense multiview image matching pipeline. This step is realized by the software system SURE, which proved to perform well also for complex tasks like urban data collection from oblique aerial views (Cavegn et al., 2014) or close range terrestrial applications (Wenzel et al., 2014). Since we are aiming at the evaluation of street level mobile mapping Figure 7. Selected matching configurations, Lt0 is the base image; others are match images. imagery captured in densely built-up urban environments, matching can be aggravated by potential difficulties. These arise from large variations in scale due to a higher depth of field, greater illumination changes and multiple occlusions as present in the imagery. A main feature of our mobile mapping system is the application of multiple cameras. Such a configuration provides high redundancy and thus helps to overcome potential problems during multi-view stereo matching for 3D point cloud generation. Especially if imagery is available at high redundancy, the selection of suitable combinations is an important preprocessing step. Hence, the Selection of Matching Configurations section presents different matching configurations as selected from the mobile mapping sequence. Some of these potential configurations also consist of stereo pairs captured from cameras oriented in moving direction at different timeframes. This in-sequence matching for such configurations from backward and forward looking cameras presumes adaptions of processing steps like the rectification of stereo image pairs, which is described in more detail by Cavegn et al. (2015). Point cloud processing and filtering by a modified version of the software system SURE is described in the Processing and Filtering of Point Clouds section, while the Analysis of Point Cloud Patches section analyses the geometric quality of these data on example of different reference patches.

Selection of Matching Configurations

Figure 8. Filtered point clouds generated by incorporating forward stereo imagery of sequences 1.0 and 2.0 and by exploiting five match images per base image (configuration c4), (a) fold 2, and (b) fold 3. points (GCP) and defined approx. half of the previously used GCP as checkpoints (Figure 6). Then, we computed checkpoint (CP) residuals for two scenarios. First, only one GCP group at each end of a segment was defined. Second, two additional GCP groups in-between and close to the corresponding sharp curve were established. For scenario one with two GCP groups, we obtained mean check point residuals of around 15 cm which is roughly three times better than a value of approximately 40 cm for direct georeferencing (Table 3). Scenario two featuring four GCP groups, which led to check point residuals per sequence of 21 to 73 mm, shows to improve the direct georeferencing accuracy by an order of magnitude.

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

12-16 December Peer Reviewed WEB.indd 929

For investigations on derived point cloud patches exploiting mobile mapping image sequences, the two matching configurations c1 and c4 depicted in Figure 7 were selected. Configuration c1 represents standard stereo matching with one base and one match image captured at the same point of time. For that purpose, a great number of algorithms exists (Scharstein and Szeliski, 2002; Menze and Geiger, 2015). In case of configuration c4, the base image is additionally matched with the two previous and the two following images resulting in five neighboring match images. This redundancy should lead to an increase in completeness, reliability and accuracy. Six images were also used by Vogel et al. (2014) for evaluating their powerful dense 3D scene flow method which estimates both the depth and the 3D motion field of dynamic scenes.

Processing and Filtering of Point Clouds

We processed forward, back-right and left stereo imagery of the three selected image sequences from the campaign in July 2014 using SURE (Rothermel et al., 2012) with standard parameters exploiting configurations c1 and c4 which resulted in 18 point clouds. Each of these was filtered by the octreebased approach described in Wenzel et al. (2014). Aiming at a dense point cloud in object space, we set the filtering parameter fold to 2 (Figure 8a) in contrast to Cavegn et al. (2015) where fold 3 was used (Figure 8b). This resulted in a significantly higher density and particularly in a better point coverage on the sidewalks and façades for the forward stereovision system. However, it also caused more clutter especially

D ecember 2016

929

11/22/2016 3:53:40 PM

Figure 9. Mobile mapping images [(a) forward, (b) back-right, (c) left] and generated point clouds by configuration c4 in the region of patch 27 [(d) forward, (e) back-right and forward, (f) left and forward].

Figure 10. Selected segments of the point clouds generated by forward and back-right stereo imagery defining the patches P21-P28. around the overhead wires, and a considerable number of points representing moving objects were not removed like the tramway in the middle of Figure 8a. Figure 9 depicts the benefit of exploiting imagery from the back-right as well as from the left stereovision system in addition to forward imagery. While incorporating back-right imagery leads to a significant increase in sidewalk points (Figure 9e), the left stereovision system covers a larger road surface part and is beneficial for lower façade points as well (Figure 9f).

Analysis of Point Cloud Patches

Visual inspection gives a first impression of differences between selected point clouds. However, in order to be able to draw meaningful conclusions in terms of completeness and accuracy, assumptions need to be verified by numerical values. Therefore, density and deviation values for several patches were computed and visualized using deviation patches and profiles. Although point clouds of complex 3D structures can be generated by image-based mobile mapping, their evaluation with terrestrial laserscanning is aggravated due to different viewpoints and measuring techniques. For simplification and in order to allow a comparison with the five road patches P1-P5 defined in Cavegn et al. (2015), we determined eight predominantly planar patches in road and sidewalk regions by four points, i.e., two patches per side of the roads which were mapped in both directions (Figure 2). Each patch area

930

D ec em b er 2016

12-16 December Peer Reviewed WEB.indd 930

needed to be covered with point clouds generated by all three stereovision systems and with points captured by TLS. No disturbing objects were present in most cases and all patches include curbstones as well as one vertical plane of them. Since all patch borders entirely lying on the road surface are defined by the back-right stereovision system, the limitation on the opposite side is given by vertical objects like façades or walls. As it can be seen in Figure 10, we chose patches with varying illumination conditions and a different portion of road markings. In order to assess the eight selected patches, we used the evaluation procedure described by Cavegn et al. (2014). First, a TLS and three Dense Image Matching (DIM) point clouds per patch were extracted using Leica Cyclone, i.e. one DIM point cloud for each stereovision system. Second, all reference TLS point cloud patches were subsampled to a distance of 1 cm. TLS and DIM grids of 3 cm spacing were used for the computation of deviations and deviations larger than 30 cm were disregarded for further processing. The left column of Figure 11 exemplarily shows the deviations of patch 27 for configurations c1 and c4 and for all three stereovision systems. White holes indicate sparse regions where deviations were not computed. Deviations for forward and back-right are mainly positive and the largest deviations are computed along curbstone edges of the sidewalk as well as on the upper part of the sidewalk. There are significantly more points for back-right than for forward, especially in case of c4.

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

11/22/2016 3:53:54 PM

Figure 11. Deviations DIM-TLS and profiles of patch 27 for the (a) forward, (b) back-right, and c) left stereovision systems. Profiles of patch 27 are depicted by the right column of Figure 11. Cross- and along-track profiles of point clouds from image sequence 2.0 (forward and back-right) reveal an offset of around 2 cm for both configurations compared to TLS. The vertical curbstone plane is the less accurately modeled by the back-right stereovision system. Point clouds of c4 are less noisy than point clouds of c1 for forward and left. Mean density and deviation values for the mixed road and sidewalk patches P21-P28 are given in Table 4 and standard deviation values are depicted in more detail by Figure 12. While the size of the investigated patches ranges from 11 to 32 m2, the mean size of 22 m2 corresponds to around onefifth of the mean value of the road patches P1-P5 defined in Cavegn et al. (2015). The highest mean density value of 23268 points/m2 was computed for forward c1 which is around three times higher than for forward c4 principally caused by the fold parameter for the SURE triangulation

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

12-16 December Peer Reviewed WEB.indd 931

module (fold 1 for c1, and fold 2 for c4). Both values are 10 to 20 times higher than the values determined for P1-P5 which is mainly due to a different filtering degree. Low density values are caused by difficult matching conditions such as a large shadow area for patch 22 in case of the forward and back-right stereovision systems resulting in rather high standard deviation values, but not in case of the left stereovision system where just a small shadow area is present since captured at another date and daytime. In most cases, RMSE and standard deviation values are larger for c1 than for c4. Because of the lower filtering degree and a vertical plane of curbstones for each patch, standard deviation values are approx. twice as large as for road patches P1-P5. While mean DIM-TLS values of 13 to 23 mm were computed for patches 27 and 28 for forward and back-right, there are just a few mm for the left stereovision system whose data was captured from another trajectory (image sequence 2.0

D ecember 2016

931

11/22/2016 3:54:04 PM

Figure 12. Standard deviation values in mm for residuals between DIM and TLS point cloud patches (SD DIM-TLS). Table 4. Mean density and Deviation Values for All Road Patches (fw: forward, br: back-right, c1: Stereo Matching, c4: In-Sequence Stereo Matching, P1-P5: Cavegn et al. (2015)) Patch size [m2]

Density [Points/m2]

RMSE DIM-TLS [mm]

Mean DIM-TLS [mm]

SD DIM-TLS [mm]

P21-P28 fw c1

22

23268

26

-7

19

P21-P28 fw c4

22

7623

22

-9

12

P21-P28 br c1

22

16639

26

-7

17

P21-P28 br c4

22

15699

22

-9

11

P21-P28 left c1

22

3884

33

-13

28

P21-P28 left c4

22

4183

27

-13

22

P1-P5 fw c1

104

1338

14

-7

8

P1-P5 fw c4

104

751

13

-5

8

versus 3.0). The highest standard deviation value for forward c1 was determined for patch 24 which is due to a bicyclist who caused many non-road points that could not be removed by c1 but almost completely by c4. 3D point accuracy is dependent on the measuring distance which is depicted by Burkhard et al. (2012) for both camera types used in the present investigations. As the average distance between the left stereovision system and the selected patches is around 10 m, the larger accuracy values compared to the forward and back-right stereovision systems is not surprising. While patch 26 shows with a distance of 7 m the best values for the left stereovision system, patch 24 has the second largest standard deviation values due to a distance of 13 m. The greatest RMSE and standard deviation values were computed for patch 23 which has the largest area. In terms of density, large values were computed for both c1 and c4 for the back-right in contrast to the forward stereovision system where density values are approximately three times larger for c1 than for c4. However, density is highly dependent on filtering parameters and especially on the filtering degree. Table 4 shows more accurate values for c4 than for c1 which is caused by the higher redundancy of c4. Mainly due to large distances between the stereo cameras and the

932

D ec em b er 2016

12-16 December Peer Reviewed WEB.indd 932

respective patches, the left stereovision system provides the less accurate point cloud patches. Similar accuracies were determined for the back-right and forward patches as the lower resolution of the back-right compared to the forward stereovision system is compensated by shorter distances to the patches.

Conclusions

Within the paper, we demonstrated the feasibility of a dense multi-view stereo matching pipeline for the collection of georeferenced 3D imagery using a camera-based terrestrial mapping platform. The achieved georeferencing accuracy at the sub-pixel level is sufficient for a subsequent dense multiimage matching aiming at a high-resolution reconstruction of complex 3D structures in urban environments. This also allows for image measurement precisions of 1 pixel at objects of interest which fulfills urban mapping requirements. So far, our image-based georeferencing including automatic tie point measurement and bundle adjustment is based on a state-of-the-art photogrammetric software system using imagery from the forward facing stereovision system. Nevertheless, the multiple camera configuration of our system enables

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

11/22/2016 3:54:09 PM

combinations of different views during bundle block adjustment. In combination with a tighter integration of measures from direct georeferencing, this will potentially increase the accuracy and robustness of the process and also limit the need for control point measurements. Thus, our current developments aim at automated tie point detection and matching in largely different views. Further improvements are expected from an integrated georeferencing approach to handle multiple large image sequences while aiming at image based mobile mapping for 3D data capture on a city-wide scale.

Acknowledgments

Thanks are due to iNovitas AG (Baden-Dättwil, Switzerland) and to the City of Basel (Bau- und Verkehrsdepartement Kanton Basel-Stadt) for providing the mobile mapping data of the campaign performed in July 2014. Some of this work was co-financed by the Swiss Commission for Technology and Innovation CTI as part of the infraVIS project.

References

Burkhard, J., S. Cavegn, A. Barmettler, and S. Nebiker, 2012. Stereovision mobile mapping: system design and performance evaluation, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XXII ISPRS Congress, 25 August - 01 September, Melbourne, Australia, Vol. XXXIX-B5, pp. 453–458. Cadena, C., L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I.D. Reid, and J.J. Leonard, 2016. Past, Present, and Future of Simultaneous Localization And Mapping: Towards the RobustPerception Age, URL: http://arxiv.org/abs/1606.05830 (last date accessed: 15 October 2016). Cavegn, S., N. Haala, S. Nebiker, M. Rothermel, and P. Tutzauer, 2014. Benchmarking high density image matching for oblique airborne imagery, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, ISPRS Technical Commission III Symposium, 05-07 September, Zurich, Switzerland, Vol. XL-3, pp. 45–52. Cavegn, S., N. Haala, S. Nebiker, M. Rothermel, and T. Zwölfer, 2015. Evaluation of Matching Strategies for Image-Based Mobile Mapping, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, ISPRS Geospatial Week, 28 September - 03 October 2015, La Grande Motte, France, Vol. II-3/W5, pp. 361–368. Cavegn, S., S. Nebiker, and N. Haala, 2016. A systematic comparison of direct and image-based georeferencing in challenging urban areas, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XXIII ISPRS Congress, 12-19 July 2016, Prague, Czech Republic, Vol. XLI-B1, pp. 529–536. Fraundorfer, F., and D. Scaramuzza, 2012. Visual Odometry, Part II: Matching, robustness, optimization, and applications, IEEE Robotics & Automation Magazine, 19(2):78–90. Gallup, D., 2011. Efficient 3D Reconstruction of Large-Scale Urban Environments from Street-Level Video, Ph.D. dissertation, University of North Carolina, Chapel Hill, North Carolina, 123 p.

PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING

12-16 December Peer Reviewed WEB.indd 933

Haala, N., 2014. Dense Image Matching Final Report, EuroSDR Publication Series, Official Publication No. 64, pp. 115–145. Kersten, T.P., G. Büyüksalih, I. Baz, and K. Jacobsen, 2009. Documentation of Istanbul historic peninsula by kinematic terrestrial laser scanning, The Photogrammetric Record, 24(126):122–138. Menze, M., and A. Geiger, 2015. Object scene flow for autonomous vehicles, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 07-12 June, Boston, Massachusetts, pp. 3061–3070. Nebiker, S., S. Cavegn, and B. Loesch, 2015. Cloud-based geospatial 3d image spaces - A powerful urban model for the smart city, ISPRS International Journal of Geo-Information, 4(4):2267–2291. NovAtel, 2016. SPAN UIMU-LCI, URL: http://www.novatel.com/ assets/Documents/Papers/IMU-LCI.pdf (last date accessed: 15 October 2016). Pollefeys, M., D. Nistér, J.-M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp, C. Engels, D. Gallup, S.-J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewénius, R. Yang, G. Welch, and H. Towles, 2008. Detailed real-time urban 3D reconstruction from video, International Journal of Computer Vision, 78(2-3):143–167. Puente, I., H. González-Jorge, J. Martínez-Sánchez, and P. Arias, 2013. Review of mobile mapping and surveying technologies, Measurement, 46(7):2127–2145. Remondino, F., M.G. Spera, E. Nocerino, F. Menna, and F. Nex, 2014. State of the art in high density image matching, The Photogrammetric Record, 29(146):144–166. Rothermel, M., K. Wenzel, D. Fritsch, and N. Haala, 2012. SURE: Photogrammetric surface reconstruction from imagery, Proceedings of the LC3D Workshop, December, Berlin, Germany. Scaramuzza, D., and F. Fraundorfer, 2011. Visual odometry, Part I: The first 30 years and fundamentals, IEEE Robotics & Automation Magazine, 18(4):80–92. Scharstein, D., and R. Szeliski, 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, International Journal of Computer Vision, 47(1-3):7–42. Schönberger, J.L. and J.-M. Frahm, 2016. Structure-from-Motion Revisited, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 26 June - 01 July, Las Vegas, Nevada, pp. 4104–4113. Tutzauer, P., and N. Haala, 2015. Façade reconstruction using geometric and radiometric point cloud information, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, PIA15+HRIGI15, 25-27 March 2015, Munich, Germany, Vol. XL-3/W2, pp. 247–252. Vogel, C., S. Roth, and K. Schindler, 2014. View-consistent 3D scene flow estimation over multiple frames, Proceedings of the 13th European Conference on Computer Vision (ECCV), 06-12 September 2014, Zurich, Switzerland, Part IV, LNCS 8692, pp. 263–278. Wenzel, K., M. Rothermel, D. Fritsch, and N. Haala, 2014. Filtering of point clouds from photogrammetric surface reconstruction, The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, ISPRS Technical Commission V Symposium, 23-25 June, Riva del Garda, Italy, Vol. XL-5, pp. 615–620.

D ecember 2016

933

11/22/2016 3:54:09 PM

Suggest Documents