Turning Images into 3-D Models - IEEE Xplore

0 downloads 0 Views 4MB Size Report
There exist several goals for three-dimensional (3-D) digitization and modeling of natural and cultural heritage objects, such as for accurate detailed ...
[Fabio Remondino, Sabry F. El-Hakim, Armin Gruen, and Li Zhang]

Turning Images into 3-D Models [Developments and performance analysis of image matching for detailed surface reconstruction of heritage objects]

T

here exist several goals for three-dimensional (3-D) digitization and modeling of natural and cultural heritage objects, such as for accurate detailed documentation, digital preservation and restoration, physical replicas, virtual tourism, and research or education. We focus on the detailed documentation aspect. Capturing the 3-D data requires a technique that is highly accurate; portable due to accessibility problems; low cost due to limited budgets; fast due to the usually short allowed time on the site so as not to disturb works or visitors; and flexible and scalable due to the wide variety and sizes of sites and heritage objects. It is also important that the technique captures dense 3-D data on the necessary surface elements to guarantee a realistic experience close up or to monitor surface condition over time. Furthermore, the fine geometric details are needed because even with rich texture maps, models without these details will exhibit too smooth and flat-looking surfaces or polygonized silhouettes that are easily detected by the human eye. Although image-based modeling techniques can produce accurate and realisticlooking models, practical systems are still highly interactive since current fully automated methods are still unproven in real applications and may not guarantee accurate results. As an imagebased measurement technique, photogrammetry has for many years been dealing with accurate 3-D reconstruction of objects. Even though it was often perceived as time

Digital Object Identifier 10.1109/MSP.2008.923093

1053-5888/08/$25.00©2008IEEE

IEEE SIGNAL PROCESSING MAGAZINE [55] JULY 2008

© 1995 MASTER SERIES

sensors. It can also cope with wide baselines using several advancements over standard stereo matching techniques. Our approach is sequential, starting from a sparse basic segmented model created with a small number of interactively measured points. This model, specifically the equation of each surface, is then used as a guide to automatically add the fine details. The following three techniques are used, each where best suited, to retrieve the details: 1) For regularly shaped patches such as planes, cylinders, or quadrics, we apply a fast relative stereo matching technique. 2) For more complex or irregular segments with unknown shape, we use a global multi-image geometrically constrained technique. 3) For segments unsuited for stereo matching, we employ depth from shading (DFS). The goal is not the development of a fully automated procedure for 3-D object reconstruction from image data (e.g., structure from motion) or a sparse stereo approach, but we aim at the digital reconstruction of detailed and accurate surfaces from calibrated and oriented images for practical daily documentation and digital conservation of wide variety of heritage objects (Figure 1).

consuming and complicated, the heritage community is starting to consider it for precise digital documentation as a promising, less costly, and practical alternative to range sensors. Thus, efforts to increase the level of automation are continuing. Dense matching of image elements over two or more images is necessary to capture the geometric details in 3-D, but it is a challenging and often ill-posed problem. However, there are several proven strategies—for example, limiting the search to the epipolar line, working with hierarchies of image pyramids, matching features and edges first to provide better approximations for successive area-based matching (ABM), using a region-growing algorithm from initial seed points, computing intermediate depth maps, and using more than two images for higher reliability. Subpixel precision is an important factor as only a pixel precision creates a noticeable systematic stepping (quantization error) that gives undesirable visual effects. But in spite of all the advances, some problems still remain and user interactions are still needed. A fully automated, precise, and reliable method adaptable to different image sets and scene contents is not available, in particular for convergent and wide baseline images. However, it has been proven that a successful image matcher should do the following: 1) use accurately calibrated cameras and images with strong geometric configuration, 2) use local and global image information to extract all the possible matching candidates and get global consistency among the matching candidates, 3) use constraints to restrict the search space, 4) consider an estimated shape of the object as a priori information, and 5) employ strategies to monitor the matching results. With this in mind, in this article we propose a multistage image-based modeling approach that requires only a limited amount of human interactivity and is capable of capturing the fine geometric details with similar accuracy as close-range active range

AN OVERVIEW OF IMAGE MATCHING TECHNIQUES Since a large body of work on image-based 3-D modeling and image matching exists and the techniques are definitely scene and application dependent, we focus here on those approaches that were practically applied to cultural heritage objects and sites. We also report only on methods for creating detailed geometric models. Thus image-based rendering (IBR), which skips geometric modeling, surveying, computer-aided design (CAD) techniques, or visual hull, factorization methods and other approaches that yield mainly a sparse set of 3-D points, will not be covered.

(a)

(b) [FIG1] (a) Typical cultural heritage objects requiring (b) detailed and accurate 3-D models for documentation, conservation, analysis, restoration, or replica purposes.

IEEE SIGNAL PROCESSING MAGAZINE [56] JULY 2008

The fully automated structure from motion 3-D modeling procedures widely reported in the vision community [1], [2] require very short intervals between consecutive images to guarantee constant illumination and scale between successive images. This large number of closely spaced images can be a problem in an archaeological area, large heritage sites, or in areas with limited accessibility. The short baseline also results in large depth errors unless the points are tracked over a large image sequence. Large errors up to 5% (one part in 20) were reported, limiting the use of these methods to simple visualization applications. On the other hand, wide baselines between images give rise to image-scale variations and occlusion problems. Different strategies have been proposed for stereo matching with wide baselines [3]–[5], but a complete accuracy analysis is still lacking. Image matching is generally defined as the establishment of correspondences between two or more images to reconstruct surfaces in 3-D. In order to extract these correspondences, the primitives to be matched must be defined. Afterwards, a similarity measure is computed and evaluated between primitive pairs, and then a disparity map or 3-D point cloud can be generated. Image matching remains an active area of research even after four decades of activity. The main reason is the difficulty in finding a unique match (multiple possible correct matches) or no match at all (partly occluded in one image or looks very different due to light and geometric variation), which creates a primarily ill-posed problem. Illposed problems can be converted into well-posed problems by introducing constraints. In general, there are three forms of constraints—unary, binary, and N-ary constraints. A unary constraint (like the epipolar or collinearity constraints [6] or the similarity constraint [7]) represents the similarity or likelihood of individual matches. The surface smoothness and uniqueness constraint can be seen as a binary constraint. Nary constraints are usually represented with mutually topological relations among matches. Recent overviews on stereo matching can be found in [8] and [9], while [10] compared multi-image matching techniques. References [8] and [10] classified the different stereo and multiview matching algorithms according to six fundamental properties: the scene representation, photoconsistency measure, visibility model, shape prior, reconstruction algorithm, and initialization requirements. On the other hand, we consider the two main classes of matching primitives, i.e., image intensity patterns (windows composed of grey values around a point of interest) and features (edges and regions), which are then transformed into 3-D information through a mathematical model (e.g., collinearity model or camera projection matrix). According to these primitives, the resulting matching algorithms are generally classified as ABM or feature-based matching (FBM). ABM ABM, also called signal-based matching, is the more traditional approach. It is justified by the continuity assumption, which asserts that at a certain level of resolution where image match-

ing is performed, most of the image window depicts a portion of a continuous and planar surface element. Therefore, adjacent pixels in the image window will generally represent contiguous points in object space. In ABM, each point to be matched is the center of a small window of pixels (patch) in a reference image (template) which is statistically compared with equally sized windows of pixels in another (target) image. The measure of match is either a difference metric that is minimized, such as RMS difference, or more commonly a similarity measure that is maximized. ABM is usually based on local squared or rectangular windows. In its oldest form, area-based image matching was performed with cross-correlation and the correlation coefficient as a similarity measure. Cross-correlation works fast and well if the patches contain enough signal without too much high-frequency content (noise) and if geometrical and radiometric distortions are minimal. To overcome these problems, image reshaping parameters and radiometric corrections were considered, leading to the well-known nonlinear least squares matching (LSM) estimation procedure [6]. The location and shape of the matched window is estimated with respect to some initial values and computed until the grey-level differences between the deformed patch and the template one reach a minimum. That is, the goal function to be minimized is the L2norm of the residuals of least squares estimation, although several investigations have shown that L1-norm or absolute deviation (LAD) can be used to improve the accuracy of the estimation problem in the presence of outlier pixels in the data. References [6], [11], and [12] introduced multiphoto geometrically constrained (MPGCs) matching and the use of additional constraints into the image matching and the surface reconstruction process. Recently, the MPGC-based framework was used in a reformulated version in [4]. ABM was also generalized from image to object space, introducing the concept of “groundel” or “surfel” [13]. ABM, especially the LSM method with its subpixel capability, has a very high accuracy potential (up to 1/50 pixel) if welltextured image patches are used. Disadvantages of ABM are the need for small searching range for successful matching, the large data volume which must be handled and, in the case of LSM, the requirement of good initial values for the unknown parameters, although this is not the case for other techniques such as graph-cut [8]. Problems occur in areas with occlusions, areas with a lack of or repetitive texture, or if the surface does not correspond to the assumed model (e.g., planarity of the matched local surface patch). FBM FBM determines the image correspondence using image features. FBM comprises two stages: 1) the detection of interesting features and their attributes in all images and 2) the determination of the corresponding features using particular similarity measures. The two stages are related to each other in the sense that the feature extraction and attributes computation must be such that the correspondence determination is easy, precise, and not sensitive to scale and

IEEE SIGNAL PROCESSING MAGAZINE [57] JULY 2008

information than single points and are also useful in the surface reconstruction (e.g., as breaklines) to avoid smoothing effects on the object edges. 3) Regions [18]: Regions are homogeneous areas of the images with intensity variations below a certain threshold. Image regions should be invariant under certain transformations. Under a generic camera movement (e.g., translation), the most common transformation is an affinity, but also scale-invariant detectors have been proposed. Generally, an interest point detector is used to localize the points, and afterwards an invariant region is extracted around each point. Features are first extracted and afterwards associated with attributes (“descriptors”) to characterize and match them [15]. A typical strategy to match characterized features is the computation of the Euclidean or Mahalanobis distance between the descriptor elements. Larger (or global) features are called structures and are usually composed of different local features. Matching with global features is also referred to as relational or structural matching [19]. It establishes a correspondence from the primitives of one structural description to the primitives of a second structural description. Besides the attributes of the local features, relations between these features are introduced to characterize global features and establish the correspondence—for example, 1) geometric relations (e.g., the angle between two adjacent polygon sides or the minimum distance between two edges), 2) radiometric relations (e.g., the differences in average grey value or grey value variance between two adjacent regions), and 3) topologic relations (e.g., the notion that one feature is contained in another). Since relational matching techniques use not only image features but also geometrical or topological relations among the features to determine the correspondence, the image matching tasks can be fully automated Camera Ordering and Preprocessing Widely Separated Images without any initial estimates or very Calibration coarse ones. Relational matching can Tie Point Extraction be approached in many ways, relying on graph searching techniques, energy minimization, or relaxation labelImage Registration ing techniques [20]. FBM is often used as an alternative Shape Properties Seed Points Segmentation method or combined with ABM. Compared to ABM, FBM techniques are Initial Basic Model more flexible with respect to surface Constraints discontinuities, less sensitive to image Fine Details with noise, and require less approximate valDense Matching and DFS ues. The accuracy of FBM is limited by the accuracy of the feature extraction Edit and Final Textured Model process. Also, because of the sparse and Fill Holes irregularly distributed nature of the Interactive extracted features, the matching results Rendering/Visualization Fully Automatic in general are sparse point clouds and postprocessing procedures like interpolation need to be performed. [FIG2] The modeling pipeline for object reconstruction from images. orientation. A good feature for image matching should be distinct with respect to its neighborhood, invariant with respect to geometric and radiometric influences, stable with respect to noise, and unique with respect to other features. There are three types of features: 1) Interest Points [14], [15]: Interest point detectors are generally divided into contour-based methods, which search for maximal curvature or inflexion points along the contour chains; signal-based methods, which analyze the image signal and derive a measure which indicates the presence of an interest point; and methods based on template fitting which try to fit the image signal to a parametric model of a specific type of interest point (e.g., a corner). The main properties of a point detector are: 1) accuracy, or the ability to detect a pattern at its correct pixel location; 2) stability, or the ability to detect the same feature after the image undergoes some geometrical transformation (e.g., rotation or scale) or illumination changes; 3) sensitivity, or the ability to detect feature points in low contrast conditions; and 4) controllability and speed, or the number of parameters controlling the operator and the time required to identify features. 2) Edges [16]: The key for edge extraction is the intensity change, which is shown via the gradient of the image. Edge detectors usually follow the same steps: smoothing, applying edge enhancement filters, applying a threshold, and edge tracing. Then the most widely used methods involve edgel linking and segmentation. The Canny detector [17] is probably the most widely used edge detector and very suitable due to its performance and low sensitivity to parameter variation. Lines (edgel) provide more geometric

3-D Reconstruction

IEEE SIGNAL PROCESSING MAGAZINE [58] JULY 2008

DETAILED SURFACE RECONSTRUCTION METHODOLOGY Our approach (Figure 2) is a step-wise method which has been under constant development and improvement for many years. The advances came mainly due to the fact that it has been used on several heritage objects and sites (Figure 1) with a high degree of complexity, each providing new challenges that required solutions. We assume that the cameras are precalibrated, the images are captured with the same camera settings as for the calibration, and the images are oriented with subpixel accuracy. The user must then segment the scene to remove unwanted regions, such as background, and divide the object or site into regions to improve the matching and modeling process. Indeed, scene segmentation [21] reduces processing time and helps in the modeling step, regardless of the object size and complexity. CREATING THE INITIAL MODEL Modeling architectural structures or archaeological finds and sites requires a sparse model to be created first and then the fine geometric details to be added in a second step. In the literature, the two-step approach is quite common. The first step, interactive or automated, produces a basic model using assumptions on the object’s shape or camera interior parameters, while the second step uses dense stereo matching to add details. For architectural structures, we create basic models of surface elements such as planar walls, quadrics, and cylindrical shapes like columns, arches, doors, or windows using an approach initially developed in [22]. It is based on photogrammetric bundle adjustment and uses knowledge about surface shapes such as being planes, cylinders, parallel, symmetric, etc. For archaeological objects, we define the raw geometry of the object using some seed points located in the main discontinuity areas. 3-D MODELING OF THE FINE GEOMETRIC DETAILS Using the initial sparse model (namely, surface equations) as a guide and knowing the camera calibration and orientation parameters, we developed an automated procedure to extract the fine details with high-resolution meshes and achieve accurate documentation with photo-realistic visualization. The following three techniques are used, each where best suited: 1) a relative stereo matching technique for patches with regular shape, fitting an implicit function (e.g., plane, cylinder, or quadric) 2) a multi-image matching technique for irregular patches with unknown approximate function and using seed points 3) DFS for patches unsuited for stereo or multi-image matching (e.g., untextured patches). STEREO MATCHING Stereo matching works best when sufficient texture variations or localized features are present on the surface. Therefore, we first analyze the intensity level of the template window to select the areas where stereo matching will be applied. This

includes mean, standard deviation, and second derivative of the grey levels of the pixels in the window. If those are higher than preset thresholds, the stereo matching will proceed; otherwise, we consider the region to be too uniform for stereo matching and switch to DFS, which works best on smoothly shaded surfaces. The relative stereo matching approach reduces the problems by using the basic model to narrow the search for matching. The procedure is as follows for each segment with known fitting function: ■ A high-resolution approximate mesh of triangulated 3-D points, which can be as dense as one vertex per pixel, is placed automatically on each segment according to its fitting function. ■ The coordinates of the approximate mesh from the basic model are replaced with the final coordinates from the stereo matching. In fact, the technique computes only the correction to the points. The stereo matching minimizes the normalized squares of the difference between the template and the search window. The search is done along the epipolar line, limiting the search to a disparity range computed from the basic model. The window in the search image is re-sampled to take into account the difference in orientation between the two images and surface orientation of the basic model. This accounts for the geometric variations between these two images and gives accurate and reliable results. MULTIPHOTO GEOMETRICALLY CONSTRAINED (MPGC) MATCHING The stereo matching approach, although fast and effective for relatively flat surfaces, requires an approximate surface shape. However, for irregular surfaces like archaeological finds and sculptures, the approximate shape is unknown. Therefore, an extended, albeit slower, more global approach that does not require knowledge of an approximate surface has been developed. It is based on nonlinear least-squares estimation and simultaneously uses more than two images to increase its precision and reliability by matching the point in all the images it appears in. The multi-image matching approach was originally developed for the processing of very high-resolution linear array images [23] and afterwards modified to accommodate any linear array sensor [24], as for instance available with satellite images. Now it has been extended to process other image data such as the traditional aerial photos or convergent close-range images. The multi-image approach, based on the MPGC framework of [6], [11], and [12], uses a coarse-to-fine hierarchical solution with an effective combination of several image matching algorithms and automatic quality control. The approach performs three mutually connected steps: 1) Image Preprocessing: The set of available images is processed with an adaptive smoothing filter [25] to reduce the noise level and possible radiometric problems such as strong bright and dark regions and to optimize the images for subsequent feature extraction and image matching. An enhancement filter [26] is also applied to strongly enhance

IEEE SIGNAL PROCESSING MAGAZINE [59] JULY 2008

[FIG3] Examples of image preprocessing with adaptive smoothing and enhancement filter in dark areas and untextured regions.

are more tolerant to noise. A regular image grid is used in addition to overcome problems in low-texture areas. The MPM matching performs three operations at each pyramid level: 1) point and edge extraction and matching, 2) integration of matching primitives, and 3) initial mesh generation. Feature points are interest points extracted with the Lue operator [27] and the dominant points of the edges (extracted with Canny operator [17]), computed through a polygon approximation algorithm. Within the pyramid levels, the feature matching is performed with an extension of the standard cross-correlation [geometrically constrained crosscorrelation (GC3)] technique. The multi-image matching is guided from object space, knowing an approximate surface model, and allows reconstruction of 3-D object coordinates from all available images simultaneously. Having a set of images, the central image is chosen as reference and others serve as the search images. The normalized cross-correlation (NCC) coefficient is used as the similarity measure between the image windows in the reference and one of the search images. Compared to the traditional cross(a) (b) correlation method, NCC in the GC3 algorithm is computed with respect to the height value Z in object space rather than the disparity in image space, so that the NCC functions of all individual stereo image pairs can be integrated in a single framework. Then, following [28], instead of computing the correct match of a point P by evaluating the individual NCC functions between the reference and search images, (c) we define the sum of NCC (SNCC) for a point P, with respect to Z, finding the Z [FIG4] Multiphoto matching example, with the template image, the resampled value which maximizes the SNCC function. image patches at the end of the least squares estimation, and the search images with the epipolar line. For the edge matching, a preliminary list of and sharpen the already existing texture patterns. The filter adjusts brightness values in local areas so that the local mean and standard deviation match user-specified target values (Figure 3). Finally, image pyramids are generated to work with several versions of the image having progressively changing spatial resolutions. 2) Multiple Primitive Multi-Image (MPM) Matching: Multiple matching primitives (interest points, edges, and grid points) are used. Interest points are suitable to generate accurate surface models, but they suffer from noise, occlusions, and discontinuities. Edges generate coarser but more stable models as they have higher semantic information and

IEEE SIGNAL PROCESSING MAGAZINE [60] JULY 2008

candidates is built up based on the similarity measures and ■ Self-tuning matching parameters: These parameters some constraints (epipolar geometry as well as the approxiare automatically determined by analyzing the results of mated digital surface model derived from the higher level of the higher-level image pyramid matching and using them image pyramid) to restrict the number of possible matches. at the current pyramid level. These parameters include Then the GC3 algorithm is applied, with further similarity measures computed from the geometric and region attributes of the edges and consistency checking to remove wrong correspondences. The consistency checking is based on the figural continuity constraint and it is performed in a local neighborhood along the edges and solved by a probability relaxation method. 3) Refined Matching: At the original image resolution level, an MPCG LSM (Figure 4) and least squares B-spline snakes (a) (b) [29] can be used as an option to achieve potentially subpixel accuracy matches and identify some inaccurate and possibly false matches. With the LSB snakes, the edges in object space are represented with linear B-spline functions whose parameters are directly estimated, together with the matching parameters in the image spaces of multiple images. The surface derived from the previous MPM step provides good (c) (d) enough approximations for the two matching methods and increases the convergence rate. [FIG5] (a) Original image (out of three) of a bronze low-relief, (b) The main characteristics of the multi-image-based matching extracted 3-D edges, and derived surface model, shown in (c) shaded and (d) color-code mode. procedure are: ■ Truly multiple image matching: A point is matched simultaneously in all the images where it is visible, and exploiting the collinearity constraint, the 3-D coordinates are directly computed together with their accuracy values. The multiple solution and occlusion problems are reduced with the multiimage approach, removing ambiguities with the multiple epipolar line intersections. ■ Matching with multiple primitives: The method takes advantage of both area-based matching and feature-based matching techniques and uses both local and global image information. In particular, it combines an edge matching method with a point matching method through a probability-relaxation-based relational matching process (a) (b) (c) (Figure 5). Edges generate coarser but more stable models as they have higher semantic information and they are [FIG6] Model of a relief at a Dresden site—dense matching reconstruction on five images, shown as (a) textured, (b) colormore tolerant to image noise. Feature points are instead coded, and (c) shaded 3-D model. suitable to generate dense and accurate surface models, but they suffer from problems caused by image noise, low texture, occlusions, and discontinuities. Therefore, the combination of feature points and grid points is necessary since the grid points can be used to fill gaps in areas of poor or no textures. Moreover, their combination draws a compromise between the optimal requirement for precise and reliable matching and the optimal requirement for [FIG7] The 3-D documentation of marble bas reliefs using widely separated images, visualized as color-coded, shaded, and textured model. point distribution.

IEEE SIGNAL PROCESSING MAGAZINE [61] JULY 2008

the size of the correlation window, the search distance, and the threshold values. The adaptive determination of the matching parameters results in higher success rate and less mismatches.

■ High matching redundancy: By exploiting the multiimage concept, highly redundant matching results are obtained. This also allows automatic blunder detection, and mismatches can be detected and deleted through the analysis and consistency checking within a small neighborhood.

[FIG8] The 3-D modeling of a church apse from three convergent images, shown as shaded, color-coded, and textured model.

DFS DFS is applied where grey-level variations are not adequate for stereo/ multiphoto matching or for areas appearing only in a single image. Standard shape from shading techniques, which compute surface normals, lacked success in actual applications due to the ill-posed formulation that requires unrealistic assumptions, such as that the camera looks orthogonally at a Lambertian surface, and there is only one single light source located at infinity [30]. Our approach computes the depth directly, rather than the surface normal. It is applied to a work image: a grey-level version of the original with some preprocessing such as noise removal filtering and editing of unwanted shades. Using known depth and grey level at 8–10 points determined interactively, we form a curve describing the relation between grey levels and depth variation from the basic model. The curve intersects the grey-level axis at the average intensity value of points actually falling on the basic model. By adjusting the curve, the results can be instantly reviewed. We adjust the coordinates of the grid points on the surface of the basic model segment according to shading using this curve. EXAMPLES AND PERFORMANCE EVALUATION We have performed different tests retrieving 3-D models from several cultural heritage data sets. We show how our approach and methodology

[FIG9] DFS results on stone itching/carving.

IEEE SIGNAL PROCESSING MAGAZINE [62] JULY 2008

size or shape and accurately retrieve all the fine geometric can accurately document heritages of different size, shape comdetails from calibrated and oriented images. The fast stereo pair plexity, and types of detail (Figures 6–9). approach constrains the search of correspondences along the The different datasets contain both short baseline and wide epipolar line, while the 3-D coordinates of points and matched baseline images, and textured and untextured surfaces. edges are computed in a second phase using rejection criteria We also performed a quantitative evaluation of the accuracy for the forward ray intersection. On the other hand, the multiof our matching approach using several test objects, one of image approach is more reliable and precise but requires fairly which is shown in Figure 10(a), in a controlled lab environaccurate image orientation parameters to exploit the collineariment. The lab allowed us to compare the results with ground ty constraint within the LSM estimation. It uses points and truth under the same measurement conditions. For ground edges to retrieve all the surface details, and it can be applied to truth, the objects were scanned with two highly accurate closeshort or wide baseline images. It can cope with scale and other range laser scanners: the Surphaser HS25X (0.48 mm accuracy, geometry changes, different illumination conditions or repeatphase-shift measurement principle) and the ShapeGrabber 502 ed patterns, and occlusion thanks to improved reliability by the (0.42 mm accuracy, triangulation based). The same objects were multi-image approach. The accuracy evaluations (relative accuthen modeled with both matching techniques described previracy 1:1,000) was based on i) the standard deviations provided ously. To compare these models with ground truth data, we used by the registration of the ground truth data with the phoPolyWorks Inspector software. Color-coded results of the comparitogrammetrically reconstructed surface and ii) the graphical son are illustrated in Figure 10(b). The standard deviation of the display (color-coded) of the differences. A standard methodolodifferences between the scanned model and the image-based gy for the performance evaluation of the results should be model was, on average, 0.54 mm (Surphaser) and 0.52 mm developed like those available for the traditional surveying or (ShapeGrabber) for all data sets. The error distribution shows that the largest errors are consistently near boundaries and at sharp surface gradients. All the different methodologies, including the laser scanners used for ground truth, also have problems in those areas. Another accuracy and performance evaluation test was (a) (b) performed with a small statue, [FIG10] (a) Lab test object with the derived image-based model and (b) color-coded difference about 15 cm high. The image- between scanned model and image matched model. matched model, generated using 25 images, has been compared with range data acquired with a Breuckmann stripe projection system (Opto-Top SE, feature accuracy of 50 μm). The color-coded comparison of the two results gives a standard deviation of 0.17 mm (Figure 11). CONCLUSIONS Three-dimensional imagebased modeling of heritages is a very interesting topic with many possible applications. In this article, we reported our methodology for detailed surface measurement and reconstruction. We presented a two-step procedure able to model complex objects of any

[FIG11] A small statue (approximately 15 cm high and 9 cm wide) modeled with 25 images (12 megapixel each with a 28-mm objective). A closer view of the reconstructed upper details is also presented, together with the 3-D comparison results (std = 0.17 mm).

IEEE SIGNAL PROCESSING MAGAZINE [63] JULY 2008

CMM. Apart from standards, comparative data and best practices are also needed to show not only advantages, but also limitations of systems and algorithms. A good example is provided by the German VDI/VDE 2634 guideline for close-range optical 3-D vision systems (particularly for full-frame range cameras and single scan) while the American ASTM/E57 committee is trying to develop standards for 3-D imaging systems. Nevertheless, our accuracy evaluations demonstrated the potential of the photogrammetric approach, and in particular of the proposed matching strategy, for the documentation of cultural heritage. Indeed, photogrammetry has all the potential to retrieve the same results (details) as active range sensors, but in a cheaper, faster, more portable, and simpler manner. Fully automated approaches which retrieve nice-looking 3-D data without the fine details are not of practical use in the accurate daily documentation and digital conservation of heritage objects. We believe that site managers, archaeologists, restorators, conservators, and the entire heritage community need simple and cost-effective methods to be able to accurately record and document objects and sites. AUTHORS Fabio Remondino ([email protected]) is a scientific researcher at the Institute of Geodesy and Photogrammetry (chair of Photogrammetry and Remote Sensing) of ETH Zurich, Switzerland, and the Centre for Scientific and Technological Research of the B. Kessler Foundation in Trento, Italy. He received his master’s degree in environmental engineering at Polytechnic of Milan, Italy, and a Ph.D. in image-based modeling from ETH Zurich, Switzerland. He is the cochair of the International Society of Photogrammetry and Remote Sensing (ISPRS) working group on scene modeling and virtual reality. Sabry El-Hakim ([email protected]) is a principal research officer at the Visual Information Technology Group in the Institute of Information Technology at the National Research Council (NRC) of Canada. He received his M.Sc. and Ph.D. degrees in photogrammetry from the University of New Brunswick, Canada, after which he joined the NRC. He is chair of the International Society of Photogrammetry and Remote Sensing (ISPRS) working group on scene modeling and virtual reality and a Fellow of SPIE. His research interests include imagebased modeling, multisensor data integration, and virtual reality. Armin Gruen ([email protected]) is professor and head of the Chair of Photogrammetry and Remote Sensing at the Institute of Geodesy and Photogrammetry of ETH Zurich, Switzerland. He graduated in geodetic sciences and obtained his doctorate degree in 1974 in photogrammetry from TU Munich, Germany. From 1981–1984, he was an associate professor with the Department of Geodetic Science and Surveying, Ohio State University, Columbus. He has published more than 375 articles and papers and edited and coedited 21 books and conference proceedings. Li Zhang ([email protected]) received his master’s degree in photogrammetry at the Wuhan University, P.R. China, and a

Ph.D. in from ETH Zurich, Institute of Geodesy and Photogrammetry. He is now working at the Chinese Academy of Surveying and Mapping, Beijing, China. REFERENCES

[1] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. London: Cambridge Univ. Press, 2004. [2] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch, “Visual modeling with a hand-held camera,” Int. J. Comp. Vis., vol. 59, no. 3, pp. 207–232, 2004. [3] C. Strecha, T. Tuytelaars, and L. Van Gool, “Dense matching of multiple widebaseline views,” in Proc. IEEE ICCV’03, 2003, vol. 2, pp. 1194–1201. [4] M. Goesele, N. Snavely, B. Curless, H. Hoppe, and S.M. Seitz, “Multi-view stereo for community photo collections,” in Proc. ICCV 2007, Rio de Janeiro, Brazil, 2007. [5] Y. Furukawa and J. Ponce, “Accurate, dense and robust multi-view stereopsis,” in Proc. IEEE CVPR07, Minneapolis, MN, 2007. [6] A. Gruen, “Adaptive least square correlation: A powerful image matching technique,” South African J. Photogrammetry, Remote Sens. Cart., vol. 14, no. 3, pp. 175–187, 1985. [7] D.N. Bhat and S.K. Nayar, “Ordinal measures for image correspondence,” IEEE Trans. Pattern Anal. Mach. Intell. (PAMI), vol. 20, no. 4, pp. 415–423, 1998. [8] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int. J. Comp. Vis., vol. 47, no. 1–3, pp. 7–42, 2002. [9] M.Z. Brown, D. Burschka, and G.D. Hager, “Advance in computational stereo,” IEEE Trans. Pattern Anal. Machine Intell. (PAMI), vol. 25, no. 8, pp. 993–1008, 2003. [10] S.M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A comparison and evaluation of multi-view stereo reconstruction algorithm,” in Proc. IEEE Int. Conf. Computer Vision Pattern Recognition (CVPR), pp. 519–528, 2006. [11] A. Gruen and M. Baltsavias, “Geometrically constrained multiphoto matching,” Photogrammetric Eng. Remote Sens., vo. 54, no. 5, pp. 633–641, 1988. [12] E.P. Baltsavias, “Multi-photo geometrically constrained matching,” Ph.D. thesis, Institute of Geodesy and Photogrammetry, ETH Zurich, Switzerland, 1991. [13] U.V. Helava, “Object-space least-squares correlation,” Photogrammetric Eng. Remote Sens., vol. 54, no. 6, pp. 711–714, 1988. [14] C. Schmid, R. Mohr, and C. Bauckhage, “Evaluation of interest point detectors,” Int. J. Comp. Vis., vol. 37, no. 2, pp. 151–172, 2000. [15] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE Trans. Pattern Anal. Machine Intell. (PAMI), vol. 27, no. 10, pp. 1615–1630, 2005. [16] C. Schmid and A. Zisserman, “The geometry and matching of lines and curves over multiple views,” Int. J. Comp. Vis., vol. 40, no. 3, pp. 199–233, 2000. [17] J.F. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Machine Intell. (PAMI), vol. 8, no. 6, pp. 679–698, 1986. [18] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool, “A comparison of affine region detectors,” Int. J. Comp. Vis., vol. 65, no. 1–2, pp. 43–72, 2005. [19] Y. Wang, “Principles and applications of structural image matching,” J. Photogrammetry Remote Sens., vol. 53, no. 3, pp. 154–165, 1998. [20] Z. Zhang, J. Zhang, L. Mingsheng, and L. Zhang, “Automatic registration of multisource imagery based on global image matching,” Photogrammetric Eng. Remote Sens., vol. 66, no. 5, pp. 625–629, 2000. [21] G. Zeng, S. Paris, L. Quan, and F. Sillion, “Accurate and scalable surface representation and reconstruction from images,” IEEE Trans. Pattern Anal. Machine Intell. (PAMI), vol. 29, no. 1, pp.141–158, 2007. [22] S.F. El-Hakim, “Semi-automatic 3-D reconstruction of occluded and unmarked surfaces from widely separated views,” Int. Arch. Photogrammetry Remote Sens., vol. 34, no. 5, pp. 143–148, 2002. [23] L. Zhang and A. Gruen, “Automatic DSM generation from linear array imagery data,” Int. Arch. Photogrammetry Remote Sens., vol. 35, no. 3, pp. 128–133, 2004. [24] L. Zhang, “Automatic digital surface model (DSM) generation from linear array images,” Ph.D. dissertation Nr. 90, Institute of Geodesy and Photogrammetry, ETH Zurich, Switzerland, 2005. [25] P. Saint-Marc, J.-S. Chen, and G. Medioni, “Adaptive smoothing: A general tool for early vision,” IEEE Trans. Pattern Anal. Machine Intell. (PAMI), vol. 13, no. 6, pp. 514–529, 1991. [26] R. Wallis, “An approach to the space variant restoration and enhancement of images,” in Proc. Symp. Current Mathematical Problems in Image Science, Naval Postgraduate School, Monterey, CA, Nov. 1976, pp. 329–340. [27] Y. Lue, “Interest operator and fast implementation,” Int. Arch. Photogrammetry Remote Sens., vol. 27, no. 3, pp. 491–500, 1988. [28] M. Okutomi and T. Kanade, “A multiple-baseline stereo,” IEEE Trans. Pattern Anal. Machine Intell. (PAMI), vol. 15, no. 4, pp. 353–363, 1993. [29] H. Li, “Semi-automatic road extraction from satellite and aerial images,” Ph.D. thesis, Institute of Geodesy and Photogrammetry, ETH Zurich, Switzerland, 1997. [30] R. Zhang, P.-S. Tsai, J.E. Cryer, and M. Shah, “Shape from shading: A survey,” IEEE Trans. Pattern Anal. Machine Intell. (PAMI), vol. 21, no. 8, pp. 690–706, 1999. [SP]

IEEE SIGNAL PROCESSING MAGAZINE [64] JULY 2008