Orientation and 3D modelling from markerless terrestrial ... - FBK | 3DOM

The Photogrammetric Record 25(132): 356–381 (December 2010)

ORIENTATION AND 3D MODELLING FROM MARKERLESS TERRESTRIAL IMAGES: COMBINING ACCURACY WITH AUTOMATION Luigi Barazzetti ([email protected]) Marco Scaioni ([email protected]) Politecnico di Milano, Italy Fabio Remondino ([email protected]) Bruno Kessler Foundation (FBK), Trento, Italy

Abstract In this paper an automated methodology is presented (i) to orient a set of close-range images captured with a calibrated camera, and (ii) to extract dense and accurate point clouds starting from the estimated orientation parameters. The whole procedure combines different algorithms and techniques in order to obtain accurate 3D reconstructions in an automatic way. The exterior orientation parameters are estimated using a photogrammetric bundle adjustment with the image correspondences detected using area- and feature-based matching algorithms. Surface measurements are then performed using advanced multi-image matching techniques based on multiple image primitives. To demonstrate the reliability, precision and robustness of the procedure, several tests on different kinds of free-form objects are illustrated and discussed in the paper. Three-dimensional comparisons with range-based data are also carried out. Keywords: 3D modelling, automation, image matching, image processing, orientation, surface reconstruction

Introduction Reality-based 3D modelling is intended as the entire process to generate a digital 3D object from a set of images or range maps. Three-dimensional data can also be obtained using computer graphic methods or procedural modelling approaches (Ebert et al., 2003; Mu¨ller et al., 2006) although these are not procedures based on real measurements. Image-based modelling (Remondino and El-Hakim, 2006) uses passive sensors (digital cameras) and requires a mathematical formulation to transform 2D image coordinates into 3D information. Images contain all the useful information to form geometry and texture for a 3D modelling application. But the reconstruction of detailed, accurate and photo-realistic 3D models from images is still a difficult task, particularly in the case of large and complex sites that have to be photographed with widely separated or convergent image blocks. On the other hand, range technologies use active optical sensors (Blais, 2004) and can directly provide relatively accurate 3D data, often combined with colour information either 2010 The Authors. The Photogrammetric Record 2010 The Remote Sensing and Photogrammetry Society and Blackwell Publishing Ltd. Blackwell Publishing Ltd. 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street Malden, MA 02148, USA.

The Photogrammetric Record

from the sensor itself or from an external digital camera. Although still costly, normally bulky, with limited flexibility, difficult to use in certain places at the best of times and strongly dependent on the surface characteristics, active sensors (such as terrestrial laser scanners or structured light systems) have reached high levels of acceptance in recent years and are often used in surveying and 3D modelling applications. Nowadays, the range-based modelling production process (Cignoni and Scopigno, 2008) is quite straightforward, although problems can arise in the case of very large data-sets and in the modelling of sharp edges. Recently much research work and many practical projects have used a combination of these digital reconstruction techniques, coupled with survey information and maps for correct georeferencing and scaling, in order (i) to exploit the intrinsic potential and advantages of each technique, (ii) to compensate for the individual weaknesses of each method, and (iii) to achieve more accurate and complete surveying, modelling, interpretation and digital conservation results (Stumpfel et al., 2003; El-Hakim et al., 2004; Ro¨nnholm et al., 2007; Stamos et al., 2008; Guidi et al., 2009; Remondino et al., 2009). The aim of this paper is to report a reliable and precise image-based production pipeline to automatically reconstruct complex and detailed free-form objects from sets of unoriented and markerless terrestrial images. The focus is mainly on the tie point extraction phase for successive stages of the orientation process and on surface reconstruction based on advanced matching algorithms. The procedure was tested mainly on objects such as bas-reliefs, small statues and decorations (Fig. 1) as interactive 3D reconstruction is normally still the best and fastest approach for complete architectural objects such as buildings (El-Hakim, 2002; De Luca et al., 2006). In recent years the computer vision (CV) community has developed several techniques for scene reconstruction at a very high level of automation. Examples have been reported by Beardsley et al. (1996), Pollefeys et al. (1999, 2004), Hao and Mayer (2003), Dick et al. (2004), Nister (2004a) and Agarwal et al. (2009), and can also be seen in Photosynth (http:// photosynth.net) or the recent open-source software Bundler (http://phototour.cs.washington. edu). However, even in the best of cases (Goesele et al., 2007) an accuracy of only 1:400 has been reported, limiting the use of such techniques to applications requiring only aesthetically pleasing 3D models for simple visualisation, object-based navigation, annotation transfer or image browsing purposes. On the other hand, the accuracy of the final 3D model is an aim of fundamental importance for the methodology developed here. Furthermore, the proposed method is distinguished from previous work by its robustness, speed, flexibility, completeness of the achieved 3D models and capability to process convergent images or multi-strip blocks at the original geometric resolution. The methodology (Fig. 2) allows a set of terrestrial markerless

Fig. 1. Typical objects digitally reconstructed with the proposed automated methodology. From left to right: Mayan sculpture, Copan (UNESCO World Heritage Site), Honduras; ‘‘Mouth of Truth’’, Malesco, Italy; Peruvian Inca statuette; temple tower G1, My Son Sanctuary (UNESCO World Heritage Site), Vietnam. 2010 The Authors The Photogrammetric Record 2010 The Remote Sensing and Photogrammetry Society and Blackwell Publishing Ltd

357

Barazzetti et al. Orientation and 3D modelling from markerless terrestrial images

Fig. 2. The flowchart for automated image orientation and surface measurement procedures. Example at right shows details from Pompeii Forum, Campania, Italy.

images to be automatically oriented and then to derive detailed and complete surface models. A calibrated camera should be used to capture the images, in order to avoid the need for selfcalibration, which is generally not appropriate or reliable in practical 3D modelling projects with a weak image network (Remondino and Fraser, 2006). For the image orientation phase, the proposed method combines some CV and photogrammetric algorithms for the robust detection of the image correspondences. After a standard photogrammetric bundle block adjustment to retrieve the exterior orientation (EO) parameters, a multi-image matching procedure extracts dense point clouds with characteristics in terms of accuracy and density similar to those achievable with range-based techniques (Remondino et al., 2008). 2010 The Authors

358

The Photogrammetric Record 2010 The Remote Sensing and Photogrammetry Society and Blackwell Publishing Ltd


In the following sections the procedures for automated image orientation and surface reconstruction are shown. Examples and analyses of performance are afterwards reported and commented on. Tie Point Transfer and Image Orientation Overview and Related Work The complexity and diversity of image network geometry in close-range applications, with wide baselines, convergent images, illumination changes, occlusions and varying overlap, makes the identification of tie points more complex than in standard aerial photogrammetry. Automatic aerial triangulation (AAT) has reached a significant level of maturity and reliability demonstrated by the numerous commercial software packages available on the market (Buÿu¨ksalih and Li, 2005). On the other hand, in close range photogrammetry, commercial solutions for the automated orientation of markerless sets of images are still pending. Few commercial packages are available to automatically orient video sequences which are generally of very low geometric resolution and so are not really useful for detailed 3D modelling projects (for example, Boujou, 2D3 and MatchMover, RealViz). The automated orientation approaches in CV are generally named ‘‘structure from motion’’ (SfM), or ‘‘structure and motion’’, which refer to simultaneous estimation of the image orientation parameters and sparse 3D point clouds from a set of image correspondences. Some of these methods are termed ‘‘uncalibrated’’, meaning that all interior orientation parameters including distortion coefficients are initially unknown, but are derived during the processing or from the image information given in exchangeable image file format (EXIF). These methods usually start with a robust identification of interest points on the images, then a subset of images (generally an image pair) is oriented and all the other images are progressively concatenated into the final bundle adjustment (Snavely et al., 2008). The number of images that these methods can analyse in a fully automatic way can be quite vast (Agarwal et al., 2009). However, a very low geometric resolution is employed and so they can rarely find a practical application in standard photogrammetric surveys. Moreover, the derived sparse 3D point clouds need extensive editing and structuring operations to produce a complete and textured 3D model, in particular for architectural 3D reconstructions. In addition, the lack of complete camera calibration and the absence of a perspective bundle adjustment with all the statistical analysis of the recovered parameters militate against an accurate reconstruction, which is a necessary requirement for most photogrammetric projects. On the other hand, there is a lack of commercial photogrammetric packages capable of automatically determining the EO parameters in case of markerless and convergent terrestrial images and then deriving a detailed and accurate surface model of the scene. For many years, commercial solutions used coded targets for the calibration and orientation phase (Ganci and Handley, 1998; Cronk et al., 2006). Targets are automatically recognised, measured and labelled to solve the identification of the image correspondences. This solution becomes very useful and practical for camera calibration, but in many surveys targets cannot be used or applied to the object and the detection of image correspondence is carried out with interactive measurements performed by an operator. In this sense, the state of the art for image orientation in close range photogrammetric surveys still needs manual measurements if targets cannot be used. Some research solutions capable of automatically orienting a set of calibrated images were presented by Roncella et al. (2005), La¨be and Fo¨rstner (2006) and Remondino and Ressl (2006). These methods are based on the automatic extraction of interest points or scale-invariant 2010 The Authors The Photogrammetric Record 2010 The Remote Sensing and Photogrammetry Society and Blackwell Publishing Ltd

359


features, removing wrong correspondences with a RANSAC solution (Fischler and Bolles, 1981), followed by a relative orientation between image pairs, or triplets, or directly by a photogrammetric bundle adjustment (Granshaw, 1980). A reliable and precise procedure is presented for automated tie point extraction from any kind of terrestrial image sequence, which was developed to overcome the aforementioned lack in close range photogrammetry. The flowchart of the orientation procedure is shown in Fig. 2. The whole procedure can be considered as a multi-step process in which several parameters are estimated or refined. This allows different sets of images, acquired under different image networks, to be handled. The user needs only to select some parameters at the beginning of the procedure and then to use the extracted image correspondences in the bundle adjustment to retrieve the EO parameters, and an initial sparse point cloud of the analysed scene. The procedure can be properly tuned to deal more effectively with different image sets (Fig. 3), thanks to the optional choice between techniques and input parameters. In fact, while the paper will focus on the 3D reconstruction of cultural heritage artefacts, the method was developed also for other applications in the fields of engineering and natural sciences. An application in the geological field can be found in Alba et al. (2009). The input elements for the automated tie point extraction and transfer are the images and the calibration parameters of the camera used. A generic block of n images can be considered composed of (n2) n)/2 combinations of stereopairs, which are firstly analysed independently for the identification of the correspondences, and then progressively combined. However, if images form an ordered sequence, the number of image combinations to be worked upon becomes n ) 2, with a significant reduction of computational time. Thus, two procedures for image matching were implemented in the case of ordered sequences or sparse blocks (Fig. 3), respectively. A preliminary orientation based on SIFT (Lowe, 2004) or SURF (Bay et al., 2008) features is derived. Then the FAST operator (Rosten and Drummond, 2006) is employed together with a final bundle adjustment to improve the precision of the final EO and sparse 3D reconstruction. The automatically extracted pixel coordinates of homologous image points can be imported and used for image orientation and sparse geometry reconstruction in most commercial photogrammetric software (with, for example, Leica Photogrammetry Suite, Australis, iWitness, iWitnessPro, PhotoModeler) as well as Bundler. The innovative aspects of the developed method are: (a) effectiveness on a large variety of unorganised and fairly large pinhole camera image data-sets;

Fig. 3. The image sequences are processed according to the geometric structure of the image block: ordered sequence (left) or sparse block (middle and right). 2010 The Authors

360



(b) capability to work with high-resolution images; (c) accurate image measurements based on least squares matching (LSM); (d) combination of feature-based and area-based operators. In the next sections, the different steps of the automated orientation workflow shown in Fig. 2 are presented and described. Feature Detection Several interest operators for feature detection and description have been developed over recent years (Remondino, 2006). Generally, the most valuable property for an operator is its repeatability, which means the capability of finding the same point or feature under different viewing and illumination conditions. The developed methodology starts extracting features (interest points and regions) with the SIFT and SURF operators. Both have a detector capable of finding interest points in the images and a descriptor to associate a vector of information to each single detected point for further matching purposes. Valgren and Lilienthal (2007) reported the high repeatability of both operators in the case of terrestrial images. The scale-invariant feature transform (SIFT) is a standard algorithm in many CV applications (such as object recognition, SfM, data registration) because it provides highly distinctive features that are invariant to image scaling and rotations. The functioning of this operator can be split into four steps: scale-space extreme detection, keypoint localisation, orientation assignment and keypoint descriptor creation. For more details the reader is referred to Lowe (1999, 2004). As the original SIFT implementation provided by Lowe (http:// www.cs.ubc.ca/~lowe/keypoints/) is not capable of working with high-resolution images, a proper implementation was developed starting from the work of Novozin (http://user.cs. tu-berlin.de/~nowozin/autopano-sift/). An interesting variant of SIFT, the gradient location and orientation histogram (GLOH) was proposed by Mikolajczyk and Schmid (2005). Ke and Sukthankar (2004) proposed a PCA-based SIFT where a principal component analysis is used for dimensionality reduction. The speeded-up robust features (SURF) algorithm was designed to be a fast distinctive point detector and descriptor. SURF gives comparable results to SIFT while having a lower computational time. The method uses a Hessian matrix-based measure for the detector and the distribution of the first-order Haar wavelet responses for the descriptor, and generally returns fewer points than SIFT. In the present methodology, the original SURF implementation (available at http://www.vision.ee.ethz.ch/~surf/) was adopted. Comparison of Feature Descriptors SIFT and SURF associate a descriptor to each extracted image feature. A descriptor is a vector with a variable number of elements that describes the feature. Homologous points can be found by simply comparing the descriptors, without any preliminary information about the image network or epipolar geometry. In the present approach, the SIFT operator follows the standard one proposed by Lowe (2004), which is based on a vector of 128 elements. For the SURF operator a vector length of 128 is used, although the regular SURF and U-SURF (rotation invariance is not considered) have a descriptor length of 64. Two strategies for comparing the descriptors are available: a quadratic matching (m2) procedure (slow but rigorous) and a kd-tree procedure (fast but approximate). The user has to select the detector–descriptor operator and the procedure to compare the vectors and to extract 2010 The Authors The Photogrammetric Record 2010 The Remote Sensing and Photogrammetry Society and Blackwell Publishing Ltd

361


the correspondences. This choice depends on the number of images and extracted features, with results that can be really different in terms of time. Given two images I and J, in which m and n features were detected with descriptors Dm and Dn, the quadratic matching procedure compares all descriptors of the image I with all those of the image J. Then the Euclidean distance between both descriptors is estimated as a measure of the difference. Moreover, a constraint between the first- and the second-best candidate is added to be more distinctive. The method can be summarised as follows: (1) each descriptor Dm is compared with all the descriptors Dn by estimating the Euclidean distance dmn = ||Dm – Dn||; (2) all distances dmn are sorted from the shortest (dmn)1 to the longest (dmn)n; (3) an image correspondence is accepted if (dmn)1 < t (dmn)2. The value of the threshold t generally varies from 0Æ5 to 0Æ8. This method ensures a good robustness but is computationally expensive. The second strategy to compare the descriptors is based on a kd-tree approach (Beis and Lowe, 1997), a widely used strategy, for example, for panorama generation from unoriented images (Brown and Lowe, 2003). Two fast libraries are the approximate nearest neighbours (ANN) library (Arya et al., 1998) and the fast library for approximate nearest neighbours (FLANN) (Muja and Lowe, 2009). For panoramic images, points are extracted for each image, then a big global tree is built with all points. In this case, this procedure cannot be directly applied, as points between all combinations of image pairs should be found. Thus, for two generic images I and J, a kd-tree is created with the descriptors (Dm) of the image J and the descriptors (Dn) of I are compared by using the kd-tree. Then the control based on the Euclidean distances between the first two candidates is applied. Table I shows some results of the presented strategies to compare feature descriptors and matching homologous points. The image size ranges from 640 · 480 pixels to 5616 · 3744 pixels. When the number of detected features increases, the difference in terms of CPU time is remarkable (minutes versus hours). Removal of Wrong Image Correspondences Both automated strategies for the comparison of the feature descriptors retrieve a sufficient number of image correspondences although some mismatches are often still present. Table I. Comparison between quadratic- and kd-tree-based matching strategies to transfer tie points from two sets of feature descriptors extracted with the SIFT operator. Image pair

1 2 3 4 5 6 7 8 9 10

Image size

640 720 1500 1856 2560 2816 3072 3872 4500 5616

· · · · · · · · · ·

480 576 1004 1392 1920 2112 2304 2592 3000 3744

Features (image 1)

1391 342 5636 13 451 18 329 23 080 21 343 43 211 83 729 94 208

Features (image 2)

1405 314 5017 14 401 16 816 22 806 20 691 40 858 87 471 127 359

Kd-tree matching

Quadratic matching (m2)

Matched features

Time (s)

Matched features

262 28 349 1139 741 1204 2018 1716 19 336 25 968

1Æ69 1Æ33 5Æ74 17Æ39 36Æ55 39Æ22 59Æ11 64Æ28 161Æ6 338Æ5

328 25 419 1603 1135 1690 2936 2385 27 772 34 230

Time (s) 2Æ78 1Æ38 19Æ09 20 442Æ3 465Æ4 698 1633 7900 22 580 2010 The Authors

362



To remove these outliers the robust estimation of the fundamental matrix F, or the essential matrix E is used (Longuet-Higgins, 1984) in case the object is planar (Hartley and Zisserman, 2004). Both 3 · 3 matrices feature rank 2 and they encapsulate the epipolar geometry of a stereopair with known (E matrix) or unknown (F matrix) camera calibration parameters, respectively. Nister (2004b) proposed a method to estimate the E matrix using five image correspondences, although the concept had already been known within the photogrammetric community for more than 50 years (Thompson, 1959). The proposed methods to estimate the F matrix can be grouped into linear, iterative and robust methods as reviewed by Armangue´ and Salvi (2003). Given a set of image correspondences between two views xi = (xi, yi, 1)T and x¢i =(x¢i, y¢i, 1)T, written in homogeneous coordinates, the condition xi0 T Fxi ¼ 0 ð1Þ

must be satisfied. This condition can be easily demonstrated by considering that the F matrix represents a connection between a point in the first image and the epipolar line in the second one l¢i = Fxi (and vice versa li =FTx¢i), in which also lines are expressed by homogeneous vectors. Thus, the dot product between a point in the second image and the epipolar lines of the corresponding point in the first one must be zero because the point lies on the line (xTi li ¼ 0). A very popular method to estimate F is based on seven image correspondences, which represent the minimum number of observations to solve for the unknown matrix elements. Indeed the F matrix has a scale ambiguity that coupled with the singular constraint det(F) = 0 reduces the number of independent elements to seven. Equation (1) leads to a system of equations of the form zTi f ¼ ½ xi0 xi xi0 yi xi0 yi0 xi yi0 yi yi0 xi yi 1 f ¼ 0 ð2Þ where f contains the elements of F ordered into a vector. If seven points are selected to form the data matrix Z = [z1, …, z7]T, the null space of ZTZ is dimension 2, barring degeneracy. Therefore, the solution is a 2D space of the form aF1 + (1) a)F2 = 0, which coupled with the determinant constraint gives det|aF1 + (1 ) a)F2| = 0. This last is a cubic polynomial equation in a that can be easily solved. The solution of F (or E) needs to be sought with robust techniques as they allow the detection of possible outliers in the observations. All the proposed methods are based on the analysis of several sets of image coordinates randomly extracted from the whole data-set. In the present procedure three popular techniques are included: RANSAC (Fischler and Bolles, 1981), Least Median of Squares (Rousseeuw and Leroy, 1987) and MAPSAC (Torr, 2002). Robust methods are really necessary when the correspondences are extracted automatically, as a number of outliers can be present in the data, especially in the case of wide baselines. The idea of having robustness in the estimation is to have safeguards against deviations from the assumptions. This is in contrast with diagnostics, whose purpose is to find and identify deviations from the model assumptions (Huber, 1991). Within robust estimators, gross errors are defined as observations which do not fit to the stochastic model of the estimated parameters. Their efficiency depends on different factors but primarily on the number (percentage) of outliers. Generally there is a lack of repeatability because of their random way of selecting the points (random sampling). Compared to data snooping techniques (a statistical test of the normalised residuals), robust estimators do not provide a measure or a judgment about the quality of the found (or rejected) outliers. The random sampling consensus (RANSAC) algorithm estimates the number of trials and then extracts a minimum number of correspondences to calculate F (Torr and Murray, 1993). 2010 The Authors The Photogrammetric Record 2010 The Remote Sensing and Photogrammetry Society and Blackwell Publishing Ltd

363


Then, the distance between each point and its epipolar line is evaluated, and, given a threshold T, the estimation of F is carried out by maximising the number of inliers. RANSAC is usually a fast estimation procedure and has a higher breakdown point, but needs to fix a preliminary T according to the data precision. The least median of squares (LMedS) technique evaluates each solution by using the median symmetric epipolar distances to the data (Scaioni, 2000). The solution which minimises the median is chosen. Rousseeuw and Leroy (1987) proposed a minimum number of trials ms to obtain a good subsample with a given probability P and a percentage of outliers : ms > log(1 ) P)/log(1 ) (1 ) ) p ) ( p is the number of correspondences, that is, the parameters to be estimated, seven in this case). For any subsample k of image coordinates an F matrix is computed, then the median of the squared residuals is estimated by using the whole data-set of image coordinates with the relation lk ¼ median½d 2 ðx 0i ; Fk xi Þ þ d 2 ðxi ; FTk x 0i Þ. A robust estimation of the standard deviation can be derived from the data with the relation 5 pffiffiffiffiffi lk ð3Þ r0 ¼ c 1 þ np

where lk is the minimal median and c = 1Æ4826. Then, a weight wi based on r0 is determined for each correspondence and is used to detect outliers (wi = 0): 1 ri2