the seeds of the reconstruction because 3D points verified in three views are more likely to be correct. Secondly ... 5 City Modeling from Google Street View Equirectangular Images. 31. 5.1 Sequential SfM ...... areas using 360. ⦠field of view ...
Incremental Structure from Motion for Large Ordered and Unordered Sets of Images
A Dissertation Presented to the Faculty of the Electrical Engineering of the Czech Technical University in Prague in Partial Fulfillment of the Requirements for the Ph.D. Degree in Study Programme No. P2612 - Electrotechnics and Informatics, branch No. 3902V035 - Artificial Intelligence and Biocybernetics, by
Michal Havlena
Prague, June 2012
Thesis Advisor Ing. Tom´ aˇ s Pajdla, Ph.D.
Center for Machine Perception Department of Cybernetics Faculty of Electrical Engineering Czech Technical University in Prague Karlovo n´amˇest´ı 13, 121 35 Prague 2, Czech Republic fax: +420 224 357 385, phone: +420 224 357 465 http://cmp.felk.cvut.cz
Abstract Structure from Motion (SfM) computation from large unordered image sets is dominated by image feature matching. This thesis proposes avoiding exhaustive pairwise image matching by sampling pairs of images and estimating visual overlap using the detected occurrences of visual words. Although this technique alone leads to a significant speedup of SfM computation, the efficiency of the reconstruction from redundant image sets, e.g. community image sets of cities with landmarks, can be further improved by using the proposed image set reduction technique which performs selection of a small subset from the set of input images by computing the approximate minimum connected dominating set of a graph expressing image similarity. The efficiency of SfM computation can be also disrupted by spending too much time in a few difficult matching problems instead of exploring other easier options first. We propose using a priority queue for interleaving different SfM tasks as this facilitates obtaining reasonable reconstructions in limited time. The priorities of the individual tasks are set according to the estimated visual overlap again but they are also influenced by the history of the computation. Image similarity estimated from co-occurring visual words proves its usability even for ordered image sets in the proposed sequence bridging technique. Geometrically verified loop candidates are added to the model as new constraints for bundle adjustment which closes the detected loops as it enforces global consistency of camera poses and 3D structure in the sequence. Several technical improvements are proposed also. First, triplets of images are used as the seeds of the reconstruction because 3D points verified in three views are more likely to be correct. Secondly, we demonstrate that the amount of translation w.r.t. the scene can be reliably measured for general as well as planar scenes by the dominant apical angle (DAA). By selecting only image pairs which have sufficient DAA, one is able to select keyframes from image sequences and high quality seeds when reconstructing from unordered image data. Finally, cone test is used instead of the widely used reprojection error to verify 2D-3D matches which allows for accepting a correct match even if the currently estimated 3D point location is incorrect. The proposed methods are validated by several experiments using both ordered and unordered image sets comprising thousands of images. City modeling is performed from both fish-eye lens images and equirectangular panoramas, successful pedestrian detection is demonstrated on images generated using the proposed non-central cylindrical projection once the images are stabilized w.r.t. ground plane using the estimated camera poses.
Acknowledgements I would like to express my thanks to my colleagues at CMP who I had the pleasure of working with, especially to Akihiko Torii who was my closest collaborator in the topics covered by this thesis and to Jan Heller who implemented the web-based interface to our SfM methods. I am grateful to my advisor, Tom´ aˇs Pajdla, for fruitful discussions and brilliant ideas which led me in my research towards the fulfillment of the Ph.D. degree. I also would like to thank to my family and to my friends for all their support that made it possible for me to finish this thesis.
I gratefully acknowledge EC project FP6-IST-027787 DIRAC, which supported my research. Partial support of Czech Science Foundation under Project 201/07/1136, Grant Agency of the Czech Technical University under project CTU0705913, and EC projects FP7-SPA-218814 PRoVisG and FP7-ICT-247525 HUMAVIPS is also acknowledged.
Contents 1 Introduction
1
2 State of the Art 2.1 Ordered Image Set Processing . . . . . . . . . . . . . . . . 2.1.1 City Reconstruction from Image Sequences . . . . 2.1.2 Image Stabilization for Visual Object Recognition 2.2 Unordered Image Set Processing . . . . . . . . . . . . . . 2.2.1 Avoiding Exhaustive Pairwise Image Matching . . 2.2.2 Reducing the Size of the Image Set . . . . . . . . . 2.2.3 Prioritizing Promising SfM Tasks . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
5 5 5 7 11 11 12 13
3 Contribution of the Thesis 15 3.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Fast City Modeling from Omnidirectional Stereo Rig 4.1 The SfM Framework for an Omnidirectional Stereo Rig 4.1.1 Omnidirectional Camera Calibration . . . . . . . 4.1.2 Features . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Initialization by 2D tracking . . . . . . . . . . . 4.1.4 Expansion of the Euclidean Reconstruction . . . 4.1.5 Bundle Adjustment . . . . . . . . . . . . . . . . 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 4.2.1 Omnidirectional Stereo Rig . . . . . . . . . . . . 4.2.2 SfM with the Stereo Rig Rigidity Constraint . . 4.2.3 SfM without the Stereo Rig Rigidity Constraint . 4.2.4 Performance . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
5 City Modeling from Google Street View Equirectangular Images 5.1 Sequential SfM from Equirectangular Images . . . . . . . . . . . . . . 5.1.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Generating Tracks by Concatenating Pairwise Matches . . . . . 5.1.3 Robust Initial Camera Pose Estimation . . . . . . . . . . . . . 5.1.4 Bundle Adjustment Enforcing Global Camera Pose Consistency 5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . .
. . . . . . . . . . .
19 19 20 20 21 22 24 25 25 25 28 29
. . . . . .
31 31 31 33 33 35 37
6 Omnidirectional Image Sequence Stabilization by SfM 41 6.1 Robust Estimation of Relative Camera Motion . . . . . . . . . . . . . . . 42
vii
Contents
6.2 6.3 6.4
6.5
6.1.1 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Detecting Features and Constructing Tentative Matches . . . 6.1.3 Epipolar Geometry Computation by RANSAC + Soft Voting Measuring the Amount of Camera Translation by DAA . . . . . . . 6.2.1 Too Small Motion Detection on Simulated Data . . . . . . . Sequential Wide Baseline Structure from Motion . . . . . . . . . . . Omnidirectional Image Stabilization . . . . . . . . . . . . . . . . . . 6.4.1 Image Rectification Using Camera Pose and Trajectory . . . 6.4.2 Central and Non-central Cylindrical Image Generation . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Omnidirectional Image Sequences . . . . . . . . . . . . . . . . 6.5.2 Unordered Omnidirectional Images . . . . . . . . . . . . . . . 6.5.3 Details of Experimental Settings and Computations . . . . .
7 Modeling from Unordered Image Sets using Visual Indexing 7.1 Randomized Structure from Motion . . . . . . . . . . . . . . . . 7.1.1 Computing Image Similarity Matrix . . . . . . . . . . . . 7.1.2 Constructing Atomic 3D Models from Camera Triplets . . 7.1.3 Merging Partial Reconstructions . . . . . . . . . . . . . . 7.1.4 Gluing Single Cameras to the Best Partial Reconstruction 7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
42 43 44 45 48 50 53 53 54 57 57 61 65
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
67 67 68 68 71 73 74
8 Efficient Structure from Motion by Graph Optimization 8.1 Image Set Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Image Similarity . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Minimum Connected Dominating Set . . . . . . . . . . . . 8.2 3D Model Construction Using Tasks Ordered by a Priority Queue 8.2.1 Creation of Atomic 3D Models . . . . . . . . . . . . . . . . 8.2.2 Model Growing by Connecting Images . . . . . . . . . . . . 8.2.3 Merging Overlapping Partial Models . . . . . . . . . . . . . 8.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Image Set Reduction . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Sparse 3D Model Reconstruction . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
81 81 82 83 84 86 86 87 88 88 88
9 Conclusions
93
Bibliography
95
Keywords:
viii
Structure from Motion, Omnidirectional Vision, City Modeling
List of Algorithms 1
Construction of the initial camera poses by chaining EGs . . . . . . . . . 34
2 3 4
Robust estimation of relative camera motion . . . . . . . . . . . . . . . . 46 Measuring camera motion by computing the dominant apical angle . . . . 48 Keyframe selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5
Approximate minimum CDS computation [37] . . . . . . . . . . . . . . . . 82
ix
Notation h h H H H hi hi Hi Hi Hi h(.) h(.) H(i, j) h∈H 0 = [0, . . . , 0] I = diag([1, . . . , 1] ) |H| h ⎤ ⎡ 0 −h3 h2 [h]× = ⎣ h3 0 −h1 ⎦ −h2 h1 0
x
scalar value column vector matrix set, list entity (image, camera, descriptor, model, . . . ) scalar h with label/index i vector h with label/index i matrix H with label/index i set/list H with label/index i entity H with label/index i functions which map domain ”.” to scalars functions which map domain ”.” to vectors [i, j]-th element of matrix H means that h is element of set/list H vector of zeros identity matrix number of elements of set/list H Euclidean (L2 ) norm of vector h cross product matrix of vector h
Commonly used symbols and abbreviations SfM MSER SIFT SURF LAF FLANN RANSAC PROSAC EG DAA LP SDP CDS I X u R t P = [R | t] E = [t]× R τ M
Structure from Motion Maximally Stable Extremal Regions Scale-Invariant Feature Transform Speeded Up Robust Features Local Affine Frames Fast Library for Approximate Nearest Neighbors Random Sample Consensus Progressive Sample Consensus Epipolar Geometry Dominant Apical Angle Linear Programming Semidefinite Programming Connected Dominating Set image 3D point image point (projection of a 3D point) camera rotation (absolute or relative) camera translation (absolute or relative) calibrated camera projection matrix (absolute R, t) essential matrix (relative R, t) apical angle similarity matrix
xi
1
Introduction
People have been trying to develop efficient methods for measuring and modeling the world around them for ages. Photogrammetry, the ancestor of current computer vision methods, started to develop in the middle of the 19th century when the first focal-plane cameras appeared. Now, with massive spread of digital cameras and photo sharing sites like Facebook and Flickr [107], the scalability of the scene modeling methods has become the main issue. In this thesis, we deal with the problem of modeling large predominantly static scenes from ordered, i.e. sequential, and unordered, i.e. randomly sorted, image sets comprising thousands of images when there is no further a priori information about the scene captured available, the only requirement being the presence of a visually distinctive structure in the images. To solve such a task, we build upon the foundations of multiple view geometry [39], namely Structure from Motion. The goal of Structure from Motion (SfM) is to recover both the positions of the 3D points of the scene (structure) and the unknown poses of the cameras capturing the scene (motion, external camera calibration). Actually, many applications of SfM treat the estimated “sparse” scene 3D points just as a by-product of camera pose computation because knowing camera poses, one can use multi-view stereo 3D surface reconstruction methods [30, 46] to construct compelling “dense” 3D models of the scene, see Figure 1.1. The camera poses itself can be also successfully used for visual odometry computation on movable platforms endowed by cameras [96], e.g. robots and cars. On the other hand, several applications do make use of the reconstructed sparse 3D point cloud, e.g. robot localization w.r.t. a pre-computed model of the scene [41] or simple obstacle avoidance in robotics. When there are just two images captured, the geometrical situation can be either modeled by homography, when camera undergoes pure rotation or observes a single plane in the scene, or by epipolar geometry, when the movement of the camera is general and an articulated scene is observed, see Figure 1.2. Measurements in the images, namely the image pixel coordinates of the corresponding projections of unknown scene 3D points, can be used to form a system of equations according to the geometrical constraints of the appropriate model. When solving this system by minimizing the reprojection or epipolar error of the correspondences measured in the images, the parameters of the model are obtained. In the case of epipolar geometry, scene 3D points can be triangulated from the corresponding rays [39], in the case of homography, no 3D structure can be reconstructed because the depth of the scene is not perceived, which is, on the other hand, appreciated when creating single view-point image panoramas.
1
1 Introduction
(a)
(b)
(c)
Figure 1.1: (a) Sample images from an unordered image set comprising 316 images of Marseille Centre de la Vieille Charit´e. (b) External camera calibration and a sparse 3D point cloud obtained using method [91]. (c) 3D surface reconstruction computed by method [46] once external camera calibration was known. As the systems of equations become more complicated for the cases where more than two cameras are involved, there is no closed form solution for the SfM problem [76]. Furthermore, the problem is highly non-linear and the search for the optimal model parameters by minimizing the reprojection error is likely to get stuck in a local minimum. There are two major approaches to performing reconstructions from image sets comprising more than two images, the “linear” and the “incremental” methods. The “linear” methods relax the problem by using L∞ instead of the traditional L2 norm for measuring the error. Then, all the equations formed from image measurements can be put together in a large linear system and all camera and 3D point positions can be solved for once camera rotations are known [60]. The obvious drawback of these methods is high sensitivity to noise in the measurements due to the relaxation. The “incremental” methods use simple components such as epipolar geometry computation for a pair of images, 3D point triangulation, and camera resectioning (estimation of camera pose w.r.t. already triangulated 3D points) to build the resulting model incrementally, starting from one or more seed reconstructions and connecting additional cameras until the whole scene is reconstructed [90], all interleaved with necessary non-linear optimization improving the quality of both camera poses and 3D point positions [56]. The methods described in this thesis fall into the group of the incremental methods. The art of such methods resides in the design of a computational “pipeline” which combines the aforementioned components in order to create a reconstruction procedure which is both robust and efficient for the input image data. There is an additional issue connected with Structure from Motion computation which has been concealed from the reader in the previous lines. It is internal camera calibration transforming the directions of the rays cast by the camera to image pixel coordinates. For perspective cameras, this transformation is fully described by the five non-zero entries of the matrix K [39] and it is possible to combine the estimation of internal and external
2
X
l u2
u1 C1
e1
e2
C2
Figure 1.2: Epipolar constraint. Scene 3D point X is projected to the two images giving raise to a correspondence (u1 , u2 ). The camera projection centers C1 and C2 and the 3D point X form an epipolar plane which constrains the positions of u1 and u2 to the corresponding epipolar lines. Epipolar lines of different scene points intersect in the epipoles e1 and e2 in the first and second image respectively. camera calibration in a single procedure. Such combined methods are sensitive to various degeneracies of the scene and specific actions have to be taken in order to prevent failures [18]. Therefore, most of the recent methods assume internally calibrated cameras or use a simplified internal calibration model, e.g. unknown focal length and optional radial distortion incorporated to the non-linear optimization loops [90]. The assumption of having internally calibrated cameras is also advantageous when non-perspective cameras are used. The incorporation of internal camera calibration estimation into SfM computation from such cameras is often much more complicated as the ray-to-pixel transformation cannot be expressed by matrix multiplication [67]. Due to these reasons, the methods described in this thesis assume internally calibrated cameras and are applicable to both perspective and non-perspective central cameras in return.
3
2
State of the Art
Next, we summarize the state of the art of the large scale incremental SfM methods. Note that the complexity of the task is quite different for ordered and unordered image sets as image order gives a clue which pairs of images should have an overlapping field of view and are therefore suitable for processing. This information is completely missing when the image set is unordered and one has to perform additional actions to prevent costly processing of all possible image pairs, which would in most cases just find out that the images have nothing in common. On the other hand, the methods for fast selection of promising image pairs can be used also for ordered image sets to improve the consistency of the resulting models via loop closing once some parts of the scene are revisited in the sequence.
2.1 Ordered Image Set Processing First, we review the work done in SfM from image sequences. The source of an ordered image set can be either a video sequence or a sequence of images, the only requirement being the fact that consecutive images do have overlapping fields of view.
2.1.1 City Reconstruction from Image Sequences 3D scene modeling from images is an important problem of computer vision and photogrammetry. Albeit large progress made recently in understanding the key problems of geometry [39], optimization [101], and related algebra [73], the design of systems working on a large number of images is still an interesting engineering problem. Nevertheless, working or partially working solutions have already been introduced for some applications. For instance, Boujou system [1] is capable of reconstructing camera trajectory from a sequence containing several thousands of perspective images when image sequences are acquired in a limited space and the camera does not make sharp turns. City reconstruction has been first addressed using aerial images [36, 9, 38, 58, 104, 105] which allowed reconstructing large areas from a small number of images. The resulting models, however, often lacked visual realism when viewed from the ground level since it was impossible to texture the facades of the buildings, an exception being the approach recently introduced by C3 Technologies and Saab AB. The declassified missile targeting technology which consists of rapid 3D mapping and realistic rendering has been used to construct and generate views of detailed 3D models covering whole cities [14].
5
2 State of the Art
Figure 2.1: The final 3D surface model output from a city modeling system using 8 perspective cameras. The estimation of geo-registered camera poses for each frame started by tracking 2D features in consecutive frames using a hierarchical KLT tracker. These 2D-2D matches were used to establish a Euclidean space for the cameras and to triangulate 3D points using the computed camera poses. Image courtesy of [3]. Alternatively, survey vehicles equipped with laser scanners and cameras were used to gather 3D depths and textures at ground level [28, 29, 92, 95]. These systems gave nice and accurate 3D models in some situations but they were complicated and expensive. In this thesis, we focus only on systems working purely with ground-level images. Dense Image Sequences. A city modeling system [3] from dense image sequences acquired simultaneously by 8 perspective cameras has been designed. This can be considered as a step towards using omnidirectional vision as 8 cameras allow for a much wider field of view. The off-line image processing entailed a sparse reconstruction during which the geo-registered poses of the cameras were estimated and dense reconstruction when a texture-mapped 3D model of the urban scene was computed from the video data and the results of the sparse step, see Figure 2.1. Recently, a framework for city modeling from image sequences working in real-time has been developed in [20]. It uses SfM to reconstruct camera trajectories and 3D points in the scene, fast dense image matching (assuming that there is a single gravity vector in the scene and all the building facades are ruled surfaces parallel to it), and real-time texture mapping to generate visually correct models from a very large number of images. The system gives good results but two major problems have been reported. First, cars parked along streets were not correctly reconstructed since they did not lie in the ruled surfaces used for scene representation. This problem has been solved by recognizing car locations and replacing them by corresponding computer generated models [21]. Secondly, 3D reconstruction could not survive sharp camera turns when a large part of the scene moved away from the limited view field of cameras. The second problem can be solved by using cameras with a larger field of view, as shown in Chapter 4.
6
2.1 Ordered Image Set Processing
Sparse Image Sequences. Short baseline SfM using simple image features [20], which performs real-time detection and matching, recovers camera poses and trajectory sufficiently well when all camera motions between consecutive frames in the sequence are small. On the other hand, wide baseline SfM based methods, which use richer features such as MSER [62], Laplacian-Affine, Hessian-Affine [69], SIFT [57], and SURF [7], are capable of producing feasible tentative matches under large changes of visual appearance between images induced by rapid changes of camera pose and illumination. Work [33] presented SfM based on wide baseline matching of SIFT features using a single omnidirectional camera and demonstrated the performance on indoor environments. Recently, Google Street View Pittsburgh Research Data Set comprising thousands of equirectangular city panoramas has been released by Google. Work [66] demonstrated 3D modeling from perspective images exported from Google Street View images using piecewise planar structure constraints. Another recent related work [96] demonstrated the performance of SfM which employs guided matching by using epipolar geometries computed in previous frames, and robust camera trajectory estimation by computing camera orientations and positions individually for the calibrated perspective images acquired by Point Grey Ladybug Spherical Digital Video Camera System [82]. We show a large scale sparse 3D reconstruction using the original equirectangular panoramic images in Chapter 5. Loop Closing. The problem inevitable for sequential SfM is to have drift errors accumulated while proceeding along the trajectory. Loop closing [49, 84] is essentially capable of removing the drift errors since it brings the global consistency of camera poses and 3D structures by giving additional constraints for the final refinement accomplished by bundle adjustment. In [49], loop closing is achieved by merging partial reconstructions of overlapping sequences which are extracted using an image similarity matrix [87, 50]. Work [84] finds loop endpoints by using the image similarity matrix and verifies the loops by computing the rotation transform between the pairs of origins and endpoints under the assumption that the positions of the origin and the endpoint of each loop coincide. Furthermore, they constraint the camera motions on a plane to reduce the number of parameters in bundle adjustment. Unlike in [84], we aim at proposing a pipeline which recovers camera poses in 3D and tests the loops by solving camera resectioning [74] in order to accomplish large scale 3D modeling of cities, see Chapter 5.
2.1.2 Image Stabilization for Visual Object Recognition Image stabilization using camera poses and the trajectory estimated by reliable Structure from Motion (SfM) plays an important role in 3D reconstruction [1, 39, 3, 20, 23, 106], self localization [33], and reducing the number of false alarms in detection and recognition of pedestrians, cars, and other objects in video sequences [44, 51, 52, 99]. The state of the art wide baseline SfM methods often work with perspective cameras because of the simplicity of their projection models and the ease of their calibration. On the other hand, due to the limited field of view of perspective cameras, occlusions and
7
2 State of the Art
sharp camera turns may cause that consecutive frames look completely different when the baseline becomes longer or the change of the view direction becomes larger. This makes image feature matching very difficult (or even impossible) and camera pose and trajectory estimation fails under such conditions. As stated earlier, these problems can be avoided if the SfM method uses omnidirectional cameras, e.g. fish-eye lens convertors [67], catadioptric cameras [32, 67], or compound cameras [84, 96]. Large field of view also facilitates the analysis of activities happening in the scene since moving objects can be tracked for longer time periods [52]. The most related SfM approach [96] employs guided matching by using epipolar geometries computed in previous frames and estimates camera trajectory robustly by computing camera orientations and positions individually. The performance of their SfM is demonstrated on sufficiently dense image sequences acquired by a car-mounted Ladybug 2 spherical camera [82]. Robust Estimation of Relative Camera Poses. The state of the art technique for finding relative camera poses from image matches first establishes tentative matches by pairing image points with mutually similar features and then uses RANSAC [26, 39, 16] to look for a large subset of the set of tentative matches which is, within a predefined threshold ε, consistent with an epipolar geometry (EG) [39]. Unfortunately, this strategy does not always recover the epipolar geometry generated by the actual camera motion, which has been observed in [54, 75, 100]. It has been demonstrated in [16] that ordering the tentative matches by their similarity may help to reduce the number of samples in RANSAC. PROSAC sampling strategy has been suggested which allows to uniformly sample from the list of tentative matches ordered ascendingly by the distance of their descriptors. The promising samples are drawn first which often leads into hitting a sufficiently large configuration of good matches early. Often, there are more models which are supported by a large number of matches. Then, the chance that the correct model, even if it has the largest support, will be found by running a single RANSAC is small. Work [54] suggested to generate models by randomized sampling as in RANSAC but to use soft (kernel) voting for a physical parameter, the radial distortion coefficient in that case, instead of looking for the maximal support. The best model is then selected as the one with the parameter closest to the maximum in the accumulator space. This strategy works when the correct, or almost correct, models provide consistent values of the parameter while the incorrect models with high support generate different values. In Chapter 6, as in [75], we show that this strategy works also when used for voting in the space of motion directions. To illustrate the problem, we shall now discuss two interesting examples of camera motions which have a gradually increasing level of difficulty. Figure 2.2(a) shows an easy pair of images which can be solved by a standard RANSAC estimation [39]. 57%, i.e. 1,400, of tentative matches are consistent with the true motion. Figure 2.2(c) shows that there is a dominant peak in the data likelihood p(M |e) of the matches given the motion direction [75] meaning that there is only one motion direction which explains a large number of matches. Figure 2.2(b) shows the voting space for the
8
2.1 Ordered Image Set Processing
0.5 0.75
0.4
0.5
(a)
0.3
0.25
0.2
0
0.1
(b)
(c)
Figure 2.2: Easy camera motions. (a) The first (top) and the second (bottom) image. Red ◦ and green show the true epipole and the epipoles computed by soft voting for the position of the epipole, respectively. Small dots show the matches giving green . (b) Voting space for the motion direction in the first image generated by 50 soft votes cast by the result of a 500-sample PROSAC, visualized on the image plane (top) as a 3D plot (bottom). White color corresponds to a large number of votes. The peak corresponds to the green . (c) The maximal support for every possible epipole (i.e. CIF image from [75]). White color corresponds to high support. The image space has been uniformly sampled by 10,000 epipoles and the size of the support of the best model found by the 500-sample PROSAC has been recorded for each epipole. motion direction in the first image generated by 50 soft votes cast by the result of a 500-sample PROSAC, visualized on the image plane (top) as a 3D plot (bottom). White represents a large number of votes. The peak corresponds to the green . Figure 2.3(a) shows a difficult pair of images since only 1.4%, i.e. 50, tentative matches are consistent with the true motion. There are very many wrong tentative matches on bushes where nearly all the local image features are small and green. Thus, many motion directions get high support from wrong matches. The true motion has the highest support but its peak is very sharp and thus difficult to find in limited time. Even this difficult example can be solved correctly by the technique presented in Chapter 6. Robust SfM by Detecting Too Small Translations. When obtaining global camera poses by chaining pairwise epipolar geometries, one has to deal with the problem of
9
2 State of the Art
0.014 0.22 0.012 0.15
(a)
0.01
0.07
0.008
0
0.006
(b)
(c)
Figure 2.3: Difficult camera motions. (a) Image pair. (b) Voting space for the motion direction. (c) The maximal support for every possible epipole. See Figure 2.2 for detailed description. Notice that the true motion has the highest support but its peak is very sharp and thus difficult to find in limited time.
detecting too small translation as the missing translation component of motion disturbs the reconstruction and leads to unsuccessful 3D point triangulation. This has been addressed in [60] by considering camera motions being pure rotations if at least 90% of matches verified by an epipolar geometry were also verified by fitting a pure rotation. Another recent work [19] looks at a related problem of determining the scale of the motion of a stereo rig with non-overlapping fields of view. In work [99], a method providing a reliable detection of too small camera translation from two images was proposed and it was demonstrated that such capability enhances SfM and object recognition from a video sequence taken by a moving camera. Since the scale of the reconstruction cannot be determined from two images alone, the amount of camera translation can be measured only relatively w.r.t. the observed scene, e.g. by the means of a dominant apical angle (DAA) of the 3D points reconstructed from the matches. Note that the apical angle of a 3D point X is the angle under which the camera centers are seen from the perspective of the point X. It is shown on simulated data that the dominant apical angle is a linear function of the length of the true translation for general as well as planar scenes and that it can be reliably estimated in the presence of outliers. In Chapter 6, we demonstrate in real experiments that the proposed measure enables robust computation of camera poses and trajectory even from sequences acquired with the presence of large changes of motion acceleration.
10
2.2 Unordered Image Set Processing
Figure 2.4: Skeletal set construction. Starting from an image graph (left), a skeletal image set (middle) is computed by weighting graph edges by the estimated uncertainties of two-frame reconstructions between pairs of overlapping images and using a graph algorithm to select a subset of images that, when reconstructed (right), approximate the coverage and accuracy of the full image set. Image courtesy of [91].
2.2 Unordered Image Set Processing Secondly, we review the work done in SfM using image sets without a given order. An unordered image set can be understood as a set of images of a scene taken with no strict requirements on the overlaps of the fields of view of the consecutive images or it can be even a result of an image search on photo sharing sites like Flickr [107]. Typical queries would be: “Paris, Arc de Triomphe”, “Rome, Fontana di Trevi”, “Prague, Castle”, etc.
2.2.1 Avoiding Exhaustive Pairwise Image Matching Most of the state of the art techniques for 3D reconstruction from unordered image sets [85, 11, 103, 60] start the computation by performing exhaustive pairwise image matching in order to reveal the structure and connectivity of the data. This becomes infeasible for image sets comprising thousands of images because the number of image pairs is quadratic in the number of images. Even Bundler [90], one of the most known 3D modeling systems from unordered image sets, uses exhaustive pairwise image feature matching and exhaustive pairwise epipolar geometry computation to create an image graph with vertices being images and edges weighted by the uncertainty of pairwise relative position estimations which is later used to lead the reconstruction. By finding the skeletal set [91] as a subgraph of the image graph having as few internal nodes as possible while keeping a high number of leaves and the shortest paths being at most constant times longer, the reconstruction time improves significantly but the time spent on image matching remains the same. An example of skeletal set construction can be seen in Figure 2.4.
11
2 State of the Art
Recent advancement of the aforementioned technique [2] abandons exhaustive pairwise image matching by using shared occurrences of visual words [77, 87] to match only the ten most promising images per each input image. On the other hand, the number of computed image matchings still remains rather high for huge image sets. The presented computational speed is achieved also thanks to massive parallelization which demands grid computing on 496 cores. It is worth mentioning that a similar approach was used in Google Maps to construct the models of several popular landmarks from all around the world using user-contributed Picasa and Panoramio photos [35]. The method presented in Chapter 7 builds upon the same principles [77, 87] as method [2] to achieve efficient reconstruction from unordered image sets. Notice that the original paper describing the method, [42], has been published before [2].
2.2.2 Reducing the Size of the Image Set Another possible approach to reducing the number of necessary pairwise image feature matchings lies in reducing the number of images to be processed because the input image set may be highly redundant. A hybrid approach that combines the strengths of 2D recognition and 3D reconstruction suitable for landmark reconstruction from large scale contaminated Internet image collections has been presented in [55]. First, input images are clustered using the GIST [80] descriptor which is based on 2D appearance. Secondly, the images in each cluster are tested for geometrical consistency and the image that is best connected to the rest of the images in a given cluster is selected as an “iconic image”, see Figure 2.5. These images and the pairwise geometric relationships between them define an “iconic scene graph” that captures all the important aspects of the original image set. The iconic scene graph is then partitioned and used for efficient reconstruction of several partial 3D models which are later merged together to obtain the desired 3D model of the whole scene. In [27], the method has been re-implemented in order to be highly parallel and therefore suitable for GPU computing. A successful reconstruction of about 3 million images within the span of a day on a single high-end PC has been demonstrated. Although being quite impressive, the resulting models do not reach the quality of real 3D surface models as they represent the objects from typical viewpoints only. Compared to the aforementioned method, the technique presented in Chapter 7 focuses on image sets containing evenly distributed cameras, where one cannot reduce the number of images dramatically without losing a substantial part of the model. Also the proposed atomic 3D models have a different purpose as they are primarily intended for geometrical verification of tentative image feature matches. In Chapter 8, we present an advancement of this technique which removes input images based on visual overlap measured by shared occurrences of visual words [87]. The method is more robust to viewpoint changes than [55] because it seeks for images capturing the same 3D structure rather than for images acquired from the same viewpoint, as demonstrated in [17]. Furthermore, the method works also for omnidirectional images where GIST often fails.
12
2.2 Unordered Image Set Processing
Figure 2.5: The selection of iconic images for the Statue of Liberty. The initial image set comprising 45,284 images is reduced to a set of 196 iconic images by 2D appearance-based clustering (GIST) followed by geometric verification of top cluster representatives. The iconic images are nodes in an iconic scene graph consisting of multiple connected components, each of which gives rise to a partial 3D model. Image courtesy of [55].
2.2.3 Prioritizing Promising SfM Tasks Last but not least, also the order of executing different SfM tasks has influence on the quality of the resulting 3D model and on the time spent during reconstructing it. Furthermore, it is desirable to output partial 3D models as soon as possible, e.g. in applications involving user interaction. Up to our best knowledge, the method described in Chapter 8 is the first large scale SfM method which interleaves matching and reconstruction steps, as confirmed by a recent survey [88]. The computation becomes a dynamic process where matching aids reconstruction and vice versa.
13
3
Contribution of the Thesis
The contribution of the thesis is related to large scale Structure from Motion (SfM) from both ordered and unordered image sets. The research on sequential SfM was conducted in collaboration with my colleague, Akihiko Torii, and the research on SfM from unordered data was done by me. The thesis presents the advancements in both areas because they are tightly bound and most of the research code written during my Ph.D. studies is used by both reconstruction pipelines. Specifically, my contributions are the following: 1. Use of visual indexing for image pair/triplet selection. We avoid the most time consuming step of large scale SfM from unordered image sets, the computation of all pairwise matches and geometries, by sampling pairs of images and estimating visual overlap using the detected occurrences of visual words. The evaluation of the similarity scores by computing scalar products of so-called tf-idf vectors [87] is also quadratic in the number of images in the set but scalar product is a much simpler operation than full feature matching which leads to a significant speedup of SfM. Furthermore, we proposed to sample triplets of images instead of pairs for the seeds of the reconstruction because 3D points verified in three views are more likely to be correct. The constructed atomic 3D models are merged together to give the final large scale 3D model at later stages of the computation. 2. Image set reduction by applying a graph algorithm (CDS). The idea of using visual indexing for SfM was further extended in order to be able to reconstruct image sets with uneven image coverage, i.e. community image sets of cities with landmarks, efficiently. A small subset from the set of input images is selected by computing the approximate minimum connected dominating set of a graph with vertices being the images and edges connecting the visually similar images by a fast polynomial algorithm [37]. This kind of reduction guarantees, to some extent, that the removed images have visual overlap with at least some images left in the set and therefore can be connected to the resulting 3D model later, if needed. 3. Task ordering using a priority queue. We use task prioritization to avoid spending too much time in a few difficult matching problems instead of exploring other easier options. Compared to our previous work having the computation spit in several stages [42], the usage of a priority queue for interleaving different “atomic 3D model construction” and “image connection” tasks facilitates obtaining reasonable reconstructions in limited time. The priorities of the individual tasks are set according to image similarity and the history of the computation.
15
3 Contribution of the Thesis
Joint contributions follow: 4. Computation of dominant apical angle (DAA). When performing sequential SfM by chaining pairwise epipolar geometries [39], the reconstruction usually fails when the amount of translation between consecutive cameras is not sufficient. We demonstrate that the amount of translation can be reliably measured for general as well as planar scenes by the most frequent apical angle, the angle under which the camera centers are seen from the perspective of the reconstructed scene points. By selecting only image pairs which have sufficient DAA, one is able to easily reconstruct even sequences with variable camera motion speed and/or camera stopping for a while. 5. Sequence bridging by visual indexing. We extend the known concept of loop closing, e.g. [43], which tries to correct the trajectory of the camera once the same place is re-visited, by searching for all the trajectory loops at once based on cooccurring visual words. Geometrically verified loop candidates are added to the model as new constraints for bundle adjustment which closes the detected loops as it enforces global consistency of camera poses and 3D structure in the sequence. 6. Image stabilization using non-central cylindrical image generation. A new technique for omnidirectional image rectification based on stereographic image projection was introduced as an alternative to the central cylindrical image generation. We show that non-central cylindrical images are suitable for people and car recognition with classifiers trained on perspective data, e.g. [25], once the images are stabilized w.r.t. the ground plane. 7. Using cone test instead of reprojection error. When verifying 2D-3D matches in a RANSAC loop [26], we do not rely on the widely used reprojection error but make use of the fact that the projections of the 3D point to all the related images are known and use a “cone test” instead. Two pixels wide pyramids are cast through the corresponding pixel locations in the related cameras and an LP feasibility task is solved to decide whether the intersection of the “cones” is non-empty or not. This allows for accepting a correct match even if the currently estimated 3D point location is incorrect without modeling the probability distribution of the location of the 3D point explicitly. The same principle is used also for verifying 3D-3D matches when the projections of both 3D points are known. It is worth mentioning that the methods were implemented to work with the general central camera model, covering the most common special cases including (i) perspective cameras, (ii) fish-eye lenses, (iii) equirectangular panoramas, and (iv) cameras calibrated by the polynomial omnidirectional model. The reconstruction pipelines are accessible to the registered users through our webbased interface, http://ptak.felk.cvut.cz/sfmservice, and were successfully used to reconstruct many challenging image sets.
16
3.1 Thesis Outline
3.1 Thesis Outline The rest of the thesis is organized as follows: • Chapter 4 deals with the problem of the automatic reconstruction and modeling of real cities from dense image sequences acquired by a pair of cameras mounted on a survey vehicle and presents an extension of the framework [20] for an omnidirectional stereo rig. • Chapter 5 presents an extension of [98] capable of visual 3D modeling of large city areas using 360◦ field of view equirectangular panoramas, namely Google Street View Pittsburgh Research Data Set. Loop closing is achieved by employing visual indexing [87]. • Chapter 6 introduces the dominant apical angle (DAA) [99] and presents an integrated pipeline for camera pose and trajectory estimation followed by image stabilization and rectification for dense as well as wide baseline omnidirectional image sequences acquired by a single hand-held camera. It is also shown that the proposed approach is capable of facilitating visual object recognition [22, 25] by using the images augmented by the computed camera trajectory [98]. • Chapter 7 proposes a novel SfM technique for unordered image sets using image pair similarity scores computed from the detected occurrences of visual words [77, 87] to avoid exhaustive pairwise image feature matching. The concept of atomic 3D models for improving the quality of the resulting 3D structure is also introduced. • Chapter 8 presents an extension of the method demonstrated in Chapter 7. SfM is speeded up by selecting a subset of input images based on the approximate minimum connected dominating set computed on the graph constructed according to the visual overlap by a fast polynomial algorithm [37]. Furthermore, the strict division of the computation into steps is relaxed by introducing a priority queue which interleaves different reconstruction tasks in order to get a good scene covering reconstruction in limited time.
17
4
Fast City Modeling from Omnidirectional Stereo Rig
Automatic reconstruction and modeling of real cities from dense image sequences acquired by cameras mounted on a survey vehicle calls for the ability to process a very large number of images which span extended spaces and are acquired along trajectories containing large camera rotations. The processing must be done in, or at least close to, real-time. In this chapter, we present an extension of the framework [20] for an omnidirectional stereo rig. We focus on presenting the extensions to camera tracking and Structure from Motion and demonstrating the functionality of the modified SfM framework in experiments. It is also shown that using two omnidirectional cameras bound into a stereo rig prevents the undesirable drift in the estimation of camera poses. Omnidirectional cameras have been used on cars and mobile platforms [8] mainly to estimate ego-motion of the vehicles or for simultaneous localization and motion planning [33, 24]. These works used catadioptric cameras with views optimized to see the complete surroundings of their vehicles in a limited resolution. They do not provide images of photographic quality needed for city modeling. We therefore use 180◦ fish-eye lenses which are compact and provide better image quality [67]. Vision with wide field of view was previously used also for city modeling to capture images with very large resolution. Panoramic mosaicing was preferred to using a fisheye lens for recovering relative camera poses very accurately from a small number of images [4, 5] and to generate high resolution and high dynamic range images [97] from geo-referenced positions. This approach provides very detailed but large images and is not suitable for real-time processing. We use two compact 4 Mpixel omnidirectional cameras as images of such size can be processed in real-time. On the other hand, our images are extremely radially distorted and a special projection model is needed to process them.
4.1 The SfM Framework for an Omnidirectional Stereo Rig Omnidirectional cameras differ from the perspective ones primarily in their image projection. This difference influences (i) camera calibration, (ii) feature extraction for image matching, and (iii) Structure from Motion computation. We shall next describe the extension of the SfM framework [20] to be able to use the omnidirectional stereo rig of cameras with fish-eye lens convertors shown in Figure 4.1.
19
4 Fast City Modeling from Omnidirectional Stereo Rig
Figure 4.1: Omnidirectional stereo rig with Kyocera Finecam M410R cameras and Nikon FC-E9 fish-eye lens convertors.
4.1.1 Omnidirectional Camera Calibration We calibrate omnidirectional cameras off-line using the technique [6] and Miˇcuˇs´ık’s twoparameter model [67], which links the radius of the image point r to the angle θ of its corresponding rays w.r.t. the optical axis, see Figure 4.2, as ar . (4.1) θ= 1 + br 2 Projecting via this model provides good results even when a low quality fish-eye lens is used because the additional parameter b can compensate for improper lens manufacturing. All operations in the SfM framework that compute a projection of a world 3D point into the image or a ray cast through a pixel are using this lens model. The mapping from pixel positions to the corresponding rays is pre-computed and stored in a table to save time in actual computations.
4.1.2 Features Images are matched by detecting, describing and tracking corner-like image features [20]. The green image channel is divided into sections of 8×8 pixels and at most one salient feature per section is used to limit the amount of computation. Feature saliency F is computed from a square region of pixels as F = |(MU L + MW R ) − (MU R + MW L )|,
(4.2)
where MU L , MU R , MW L , and MW R are average pixel values inside the upper-left, upperright, lower-left, and lower-right quadrants.
20
4.1 The SfM Framework for an Omnidirectional Stereo Rig
θ
r (a)
(b)
Figure 4.2: Diagram (a) shows the equiangular projection of Nikon FC-E9 lens convertor. Angle θ measured between the cast ray and the optical axis determines the radius r of a circle in the image circular view field where the pixel representing the value of the projected 3D point will lie. The Nikon FC-E9 lens convertor can be seen in (b). These features were designed to detect corners of buildings and their windows and they work reliably for corners where horizontal and vertical lines meet. The detection becomes worse for rotated corners. Furthermore, objects captured in omnidirectional images are radially distorted as they come closer to the border of the circular view field. Feature saliency can therefore differ dramatically if computed on an object located in the center of the view field or on the same object when it appears close to the border. This can be remedied by local image rectification [64] but we observed that the difference is negligible when matching consecutive images of our dense image sequences. Figure 4.3 shows an input image and the detected feature points.
4.1.3 Initialization by 2D tracking Camera tracking and the Structure from Motion computation has to be initialized by computing an initial 3D structure. Internal camera calibrations (held constant for the whole sequence) and a few initial camera poses are needed. Feature points are detected and tracked in 2D over several consecutive images and then triangulated into world 3D points using known camera poses. Tracking in 2D is done by constructing tentative matches from pairs of feature points in consecutive images, which have small differences in positions as well as in their saliencies. Images used for initialization thus should come from a slow camera motion without sharp turns. Next, pixel regions of the tentative feature point pairs are correlated and only the sufficiently and mutually most similar tentative matches are joined to construct tracks.
21
4 Fast City Modeling from Omnidirectional Stereo Rig
Figure 4.3: Input image (left) and detected feature points marked with colored squares around them (right). Black area around the circular view field is excluded from feature detection. Only those tracks that are tracked during all frames of the initialization are used to triangulate cameras and compute the initial 3D structure. It is important to adjust the length of the initial sequence to retain a sufficient number of tracks corresponding to a sufficiently large camera motion. The initialization is done independently for the left and right camera, so two sets of world 3D points are computed.
4.1.4 Expansion of the Euclidean Reconstruction Once the Euclidean reconstruction is initialized, the next image pair in the stereo sequence is taken and the reconstruction is expanded using it. The expansion consists of several steps described below in detail. First, the camera poses of the new stereo pair must be established. 3D points reconstructed in previous frames are projected into the new images using the last established camera poses. The feature points that could prolong the tracks connected with the projected 3D points are found in small neighbourhoods of the projections using the same tests as during the initialization. As can be seen in Figure 4.4, every reconstructed R R R 3D point, e.g. XR i,i+j triangulated from feature point positions xi and xi+j or xi and L (depending on whether or not it has been re-triangulated already), is projected yi+j L R into the right and the left images as π R (XR i,i+j ) and π (Xi,i+j ). To prolong tracks, we R L R establish tentative 3D-2D matches (XR i,i+j , xi+j+1 , yi+j+1 ) between the 3D point Xi,i+j , R R R the feature point xi+j+1 found in the neighbourhood of π (Xi,i+j ) as the feature point L whose saliency is most similar to the saliency of xR i+j , and the feature point yi+j+1
22
4.1 The SfM Framework for an Omnidirectional Stereo Rig XR i,i+j+1
XR i,i+j
L yi+j+1
XR i,i+j xR i+j+1
CL i+j+1 L yi+j
CR i+j+1
L yi+j+1
CL i+j+1
xR i+j
CL i+j
CR i+j
xR i
xR i CR i
CR i
R R R L Figure 4.4: 3D point XR i,i+j triangulated from xi and xi+j or xi and yi+j is projected R into new images acquired by cameras CL i+j+1 and Ci+j+1 (left). Positions of L and xR the most similar feature points are denoted by yi+j+1 i+j+1 . 3D point R R L and yi+j+1 (right). Xi,i+j is refined into Xi,i+j+1 using triangulation from xR i
found in the neighbourhood of π L (XR i,i+j ) as the feature point whose saliency is most . The tentative 3D-2D matches are used as the input similar to the saliency of xR i+j+1 to RANSAC [26] robust estimation technique which estimates the camera poses and simultaneously rejects wrong tentative matches. Left camera pose can be computed from a minimal sample of three 3D-2D correspondences by Nister’s algorithm [74] and right camera pose is then obtained using the rigid left-right camera transformation computed from the known camera poses during the initialization. The main advantage of Nister’s algorithm, originally designed for finding poses of non-central cameras, lies in the fact that the rays do not need to be concurrent and thus rays going through both the left and the right cameras can be combined together in one sample. The algorithm [73] leads to solving an 8-degree polynomial using Sturm sequences and bisection with a fixed number of iterations and gives accurate results in constant time. The RANSAC stopping condition ensures stopping dependent on the probability of finding a better sample. As we are using samples of length three, RANSAC usually needs only tens of samples to meet the stopping condition. However, not to exceed the maximal processing time available, a threshold for the maximum number of samples has to be used. To save even more time, the test for inliers is performed gradually on partitions of the matches and the verification is terminated as soon as it is clear that the new hypothesis cannot be better than the best hypothesis known at the time. A similar idea is extended into a two-step evaluation procedure in [15] and further modified
23
4 Fast City Modeling from Omnidirectional Stereo Rig R L specially for on-line motion estimation in [72]. A match (XR i,i+j , xi+j+1 , yi+j+1 ) is an R R R L inlier if and only if both matches (Xi,i+j , xi+j+1 ) and (Xi,i+j , yi+j+1 ) are inliers. Two runs of the Levenberg-Marquardt non-linear optimization are used to refine the camera poses using the computed set of inliers. The first refinement uses reprojection error as the cost function and finds the best solution according to the computed set. As this set can be computed incorrectly and can contain true outliers which might have bad influence on the optimization, a fixed cost value is used when the reprojection error is larger than a threshold during the second refinement to suppress this influence. Again, reprojection errors in both the left and the right images are measured. The tracks of the resulting inliers are prolonged and 3D points connected with these tracks are refined by re-triangulation. The stereo rig rigidity constraint is enforced again L R when feature points xR i and yi+j+1 are used to triangulate the 3D point Xi,i+j+1 . The rest of the tracks, i.e. the tracks of the outliers and the tracks that did not have a corresponding match, are ended. If the same feature point is detected later again, a new track with a new connected 3D point is created with no binding to the old one. There are also tracks that do not have a 3D point connected with them because either they are too short or the angle between the two rays used for triangulation is not yet large enough. These tracks are prolonged using the following geometrical constraints derived from the established camera poses to restrict the set of possible locations of the feature points. First, a homography through a virtual plane in a fixed distance in front of the camera is used to get an estimate of the position of the feature point and a circular neighbourhood around this location is searched. This distance should be set to the expected average distance of the feature points. An additional condition is the proximity to the matching epipolar line. When having omnidirectional cameras, the residual distance is computed as the distance between the feature point position and the perpendicular projection of the ray going through the position of the feature point into the matching epipolar plane, projected to the image.
4.1.5 Bundle Adjustment The data computed from the image sequences during the expansion are divided into blocks, each of them holding information from 60 images. Unlike the on-line local bundle adjustment routine described in [70], our routine processes the already finished data blocks with no back coupling to the expansion. First, the positions of 3D points are refined with fixed camera poses and then the camera poses are refined with fixed positions of 3D points. Left and right cameras are rigidly bound using the left-right camera transformation and 3D point reprojection errors in both the left and the right images are summed together in the cost function. The whole routine runs twice and a fixed cost value is used when the reprojection error is larger than a threshold during the second run to suppress the influence of outliers. The main reason of running the bundle adjustment routine is to smooth the camera trajectories and to remove noise from 3D point clouds as only the tracks of feature points
24
4.2 Experimental Results
visible in four or more frames are used for refinement and 3D points reconstructed from short tracks are thrown away because these tracks are considered to be less reliable.
4.2 Experimental Results Next, we shall demonstrate Structure from Motion with an omnidirectional camera stereo rig. We shall first describe the stereo rig and then compare the results of motion computation with and without the modifications that enforce the stereo rig rigidity constraint described in Section 4.1.
4.2.1 Omnidirectional Stereo Rig The important parameters of a camera rig are: view angle, resolution, image quality, frame rate, exposure synchronization, size and weight, and the length of the baseline. We have constructed a two-camera rig. Each camera of the rig is a combination of Nikon FCE9 mounted via a mechanical adaptor onto a Kyocera Finecam M410R digital camera, see Figure 4.1. Nikon FC-E9 is a megapixel omnidirectional add-on convertor with 183◦ view angle. It is designed to be mounted on top of the lenses of standard Nikon digital cameras. The lens is larger and heavier than similar FC-E8 Nikon lens but it is designed for imagers with higher resolution than FC-E8 and provides images of photographic quality. Kyocera Finecam M410R delivers 2,272×1,704 images at 3 frames per second. Since the FC-E9 lens is originally designed for a different optical system, we used a custom made mechanical adaptor to fit it on top of the Kyocera lens. The resulting combination yielded a circular view of the diameter slightly under 1,600 pixels in the image. Since the FC-E9 lens is close to equiangular projection [6], we obtain angular resolution 0.11 = 183/1600 degrees per pixel in the radial direction of the image. The tangential resolution depends on the distance from the view center. It grows from 0.11 degrees per pixel in the center to 0.036 degrees per pixel at the periphery. For comparison, consider that a 1,024×768 camera with a common angle of view 40◦ yields almost uniform resolution 0.039 = 40/1024 degrees per pixel. Kyocera cameras do not have external synchronization but we were able to connect an external signal to start the acquisition at the same moment. Figure 4.5 shows four cameras mounted on a survey vehicle. The two cameras with large fish-eye lenses form our stereo rig with 0.95 m baseline.
4.2.2 SfM with the Stereo Rig Rigidity Constraint There are several ways to get the camera poses needed for the initialization. If the cameras are mounted on a vehicle riding at a constant known velocity with no changes in the direction of the movement during one second, starting camera poses for the left camera can be computed easily. If the relative camera pose of the right camera w.r.t. the left camera is known, starting camera poses for the right camera can be obtained by a simple transformation.
25
4 Fast City Modeling from Omnidirectional Stereo Rig
Figure 4.5: Kyocera Finecam M410R cameras with Nikon FC-E9 fish-eye lens convertors and two conventional perspective cameras mounted on a survey vehicle. Perspective cameras were not used in our experiments. Another approach does not rely on a known stereo rig calibration but computes the starting camera poses directly. An extension of a WBS Structure from Motion [62] to omnidirectional images can be used to get epipolar geometries between the first left and first right, first left and e.g. sixth left, and first right and sixth left cameras. These geometries can be then combined together to get movement estimation fulfilling the stereo rig rigidity constraint. Both approaches were tested and work well. The main advantage of the first approach lies in the fact that one needs no additional method to start the reconstruction. On the other hand, the second approach can be used even when the stereo rig calibration and/or the movement of the car are not known. Figure 4.6 shows a city segment with several blocks of houses used for our experiments. We were driving our survey vehicle equipped with the camera rig slowly following the path drawn in the map. The designed trajectory contains sharp turns to test the performance under difficult conditions and a closed loop which allows us to measure the accuracy of the reconstruction. The data were acquired under normal traffic conditions with cars and pedestrians moving in the streets. Our test sequence was 870 frames long and the first and the sixth image were used to initialize the SfM with more than 200 correct tracks for each camera reconstructed into world 3D points. The top view of the resulting reconstructed 3D model can be seen in Figure 4.7. Straight street segments are quite easy, the support of the RANSAC winner is usually more than 60% and only few tens of runs of the RANSAC loop are needed to find it. Segments with sharp turns are much more difficult, the support of the RANSAC winner and also the number of active tracks drop dramatically, see Figure 4.8. We hypothesize that this is caused mostly by inaccurate camera and/or stereo rig calibration because the world 3D points come closer to cameras and start rotating, which causes
26
4.2 Experimental Results
Figure 4.6: Aerial view of the city segment used for the acquisition of our test sequence, the designed car trajectory is drawn with a white line. The trajectory contains several sharp turns and a round-trip around a block of houses. the errors in the estimations of their depths to become much more important than when these 3D points are distant and the movement is rotation-free. The shape of the reconstructed trajectory corresponds well to the actual one, we observe only small problems at the beginnings of the turns when the movement is still estimated as being forward although the car is just starting to turn. This is probably caused by finding a large number of feature points on the corner building and a lack of feature points in the other parts of the scene. These “corner building” feature points form a large set of inliers to an incorrect model which does not describe the whole scene well. The error in position estimation accumulated along the 420 meters long loop is less than 4.5 meters which gives the relative position estimation error 1.1%. This drift error could be eliminated by using loop closing techniques, as will be shown in Chapter 5.
27
4 Fast City Modeling from Omnidirectional Stereo Rig
Figure 4.7: SfM with the stereo rig rigidity constraint. Camera positions are represented by larger dots, smaller dots represent the reconstructed world 3D points. The loop is not closed, mostly because of the errors arising in the sharp turns where the number of active tracks drops dramatically. Note that the reconstruction nearly failed in the sharp turn at frame number 499.
4.2.3 SfM without the Stereo Rig Rigidity Constraint During the adaptation of the original SfM into an omnidirectional one, we first adapted the geometry and RANSAC without enforcing the stereo rig rigidity constraint [40] in the reconstruction. Stereo information was used only in the RANSAC loop where the left camera pose was estimated from 3D-2D matches from both cameras and the right camera pose was computed using the stereo rig calibration. SfM worked fine when using additional GPS/INS data but failed when these data were not used. The resulting model reconstructed for the same test sequence without enforcing the rigidity constraint can be seen in Figure 4.9. The number of active tracks
28
4.2 Experimental Results
number of active tracks
250
200
150
100
50
0
0
100
200
300
400
500
600
700
800
900
frame number Figure 4.8: Variation of the number of active tracks for different frames in the sequence. Note that the number of active tracks drops dramatically in frames corresponding to sharp turns. drops under 10 in the first sharp turn because the positions of world 3D points were not estimated well as the scale of the reconstruction was gradually lost. A comparison with the original framework using perspective cameras was not performed but we hypothesize that the result would be even worse not only because of the missing stereo rig rigidity constraint but also because of the lack of feature points caused by a very narrow field of view.
4.2.4 Performance The original SfM framework is able to work in real-time and it would be exciting to achieve the same speed even with fish-eye cameras. Until now, we were interested more in functionality than in performance and the actual speed of our C++ implementation on a standard 2GHz Intel Pentium 4 computer is about 1.3 frames per second. This is primarily caused by the size of the input images which is 800×800 compared to 360×288 used with perspective cameras. Working with smaller images makes it more difficult to detect and to correctly describe enough feature points and making the images much smaller will be possible only if an extension to feature extraction would be proposed and implemented. This extension would describe the features on a locally unwarped image. As this unwarping would not be quick enough using the CPU, GPU programming techniques should be used via OpenGL. On the other hand, it showed out that 3 frames per second provided by our omnidirectional cameras are enough for the reconstruction from a moving vehicle because
29
4 Fast City Modeling from Omnidirectional Stereo Rig
Figure 4.9: SfM without the stereo rig rigidity constraint. The resulting 3D model from the top (left) and the side views (right). Scale of the reconstruction is being lost gradually so the cameras are approaching the ground plane although they were actually moving parallel to it. The reconstruction fails in the first sharp turn. feature points do not get lost from the image as quickly as when perspective cameras are used. That is why it is not necessary to achieve 25 frames per second computational performance, 3 frames per second are enough for real-time processing.
30
5
City Modeling from Google Street View Equirectangular Images
Large scale 3D models of cities built from video sequences acquired by car mounted cameras provide a relatively rich 3D contents. A virtual reality system covering the whole world can be brought by embedding such 3D contents into Google Earth or Microsoft Virtual Earth in near future. In this chapter, we present a SfM pipeline for visual 3D modeling of such a large city area using 360◦ field of view omnidirectional images, see Figure 5.1. The main contribution of the presented method lies in demonstrating that one can achieve SfM from a single sparse omnidirectional sequence with only an approximate knowledge of calibration as opposed to [20, 96] where the large scale models are computed from dense sequences and with precisely calibrated cameras. We present an experiment with the Google Street View Pittsburgh Research Data Set1 , which has denser images than data freely available at Google Maps. Therefore, we processed every second image and could have processed even every fourth image with a small degradation of the results.
5.1 Sequential SfM from Equirectangular Images The proposed SfM pipeline is an extension of work [98] which demonstrated the performance of the recovery of camera poses and trajectory on the image sequence acquired by a single fish-eye lens camera. See Chapter 6 for more technical details on each step of the presented pipeline.
5.1.1 Calibration Assuming that the input omnidirectional images are produced by the equirectangular projection, see Figure 5.2, the transformation from image points to unit vectors of their rays can be formulated as follows. For the equirectangular image having the dimensions IW and IH , a point u = [ui , uj ] in the image coordinates is transformed into a unit vector p = [px , py , pz ] in spherical coordinates: px = cos φ sin θ, py = sin φ, pz = cos φ cos θ . 1
(5.1)
Provided and copyrighted by Google.
31
5 City Modeling from Google Street View Equirectangular Images
(a)
(b)
Figure 5.1: Camera trajectory computed by SfM. (a) Camera positions (red circles) exported into Google Earth [34]. To increase the visibility, every 12th camera position in the original sequence is plotted. (b) The 3D model representing 4,799 camera positions (red circles) and 123,035 3D points (color dots). i I j z O θ φ x p u y (a)
(b)
(c)
Figure 5.2: Omnidirectional imaging. (a) Point Grey Ladybug Spherical Digital Video Camera System [82] used for acquiring the Street View images. (b) Omnidirectional image used as input data for SfM. (c) Transformation between a unit vector p on a unit sphere and a pixel u of the equirectangular image. The coordinates px , py , and pz of the unit vector p are transformed into angles θ and φ. Column index ui is computed from the angle θ and row index uj from the angle φ. where angles θ and φ are computed as: IW 2π , θ = ui − 2 IW π IH 2π IH = uj − . φ = uj − 2 IW 2 IH
32
(5.2) (5.3)
5.1 Sequential SfM from Equirectangular Images
5.1.2 Generating Tracks by Concatenating Pairwise Matches Tracks used for SfM are generated in several steps. First, up to thousands of SURF features [7] are detected and described on each of the input images. Secondly, sets of tentative matches are constructed between pairs of consecutive images. The matching is achieved by finding features with closest descriptors between the pair of images, which is done for each feature independently. When conflicts appear, we select the most discriminative match by computing the ratio between the first and the second best match. We use Fast Library for Approximate Nearest Neighbors (FLANN) [71] which delivers approximate nearest neighbours significantly faster than exact matching thanks to using several random kd-trees. Thirdly, tentative matches between each pair of consecutive images are verified through epipolar geometry (EG) computed by solving the 5-point minimal relative pose problem for calibrated cameras [73]. The tentative matches are verified with a RANSAC based robust estimation [26] which searches for the largest subset of the set of tentative matches consistent with the given epipolar geometry. We use PROSAC [16], a simple modification of RANSAC, which brings a good performance [83] because of reducing the number of samples by using ordered sampling [16]. The 5-tuples of tentative matches are drawn from the list ordered ascendingly by their discriminativity scores, which are the ratios between the distances of the first and the second nearest neighbours in the feature space. Finally, the tracks are constructed by concatenating inlier matches. The pairwise matches, obtained by epipolar geometry validation, often contain incorrect matches lying on epipolar lines or in the vicinity of epipoles since they may support the epipolar geometry even without violating geometric consistency. In practice, such incorrect matches can be mostly filtered out by selecting only the tracks having a sufficient length. We reject tracks containing less than three features.
5.1.3 Robust Initial Camera Pose Estimation Initial camera poses and positions in a canonical coordinate system are recovered by using the epipolar geometries of pairs of consecutive images computed in the stage of verifying tracks. The essential matrix Eij , encoding the relative camera pose between frames i and j = i + 1, can be decomposed into Eij = [tij ]× Rij . Although there exist four possible decompositions, the right one can be selected as that which reconstructs the largest number of 3D points in front of both cameras. Having the normalized camera matrix [39] of the i-th frame Pi = [Ri | ti ], the normalized camera matrix Pj can be computed by Pj = [Rij Ri | Rij ti + γ tij ]
(5.4)
where γ is the scale of the translation between frames i and j in the canonical coordinate system. The scale γ can be computed by any 3D point seen in at least three consecutive frames but the precision depends on the uncertainty of the reconstructed 3D point. Therefore, a robust selection from possible candidates of scales has to be done while evaluating the quality of the computed camera position.
33
5 City Modeling from Google Street View Equirectangular Images Algorithm 1 Construction of the initial camera poses by chaining EGs Input Ei,i+1 , i = 1, . . . , n − 1 EGs of pairs of consecutive images. Matches (tracks) supporting the epipolar Mi , i = 1, . . . , n − 1 geometries. Normalized camera matrices. Output Pi , i = 1, . . . , n 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:
P1 := [I | 0] // Set the first camera to be the origin of the canonical coordinates. for i := 1, . . . , n − 1 do Decompose Ei,i+1 and select the right rotation R and translation t where t = 1. Ui := 3D points computed by triangulating the matches Mi using R and t if i = 1 then Pi+1 := [ RA | Rb + t] where Pi = [A | b]. X := Ui // Update 3D points. else Find 3D points Ui−1,i+1 ⊂ Ui in the i-th camera coordinates seen in three images. Find 3D points Xi−1,i+1 ⊂ X in the canonical coordinates seen in three images. t := 0, smax := 0, tmax := |Xi−1,i+1 | // Initialization for RANSAC cone test. while t ≤ tmax do t := t + 1 // New sample. Select t-th Xi−1,i+1 ∈ Xi−1,i+1 and Ui−1,i+1 ∈ Ui−1,i+1 . γ := Xi−1,i+1 / A (Ui−1,i+1 − b) // The scale to be tested. Pt := [ RA | Rb + γt] where Pi = [A | b]. st := the number of matches m ∈ Mi which are consistent with the motions Pi−1 , Pi and Pt . if st > smax then Pi+1 := Pt // The best motion with scale so far. smax := st // The maximum number of supports so far. Update the termination length tmax . end if end while Update X by merging Ui−1,i+1 and adding Ui \ Ui−1,i+1 . end if end for
The best scale is found by RANSAC maximizing the number of points that pass the “cone test” [42]. This test, being an alternative to the widely used test based on thresholding the reprojection error, checks the intersection of pixel ray cones in a similar way as the feasibility test of L1 - or L∞ -triangulation [47, 48], see Algorithm 1. When performing the cone test, one pixel wide pyramids formed by four planes (up, down, left, and right) are cast around the matches and it is tested whether their intersection is empty or not using the LP feasibility test [59] or an exhaustive test [42] which is faster when the number of the intersected pyramids is smaller than four, see Chapter 7.
34
5.1 Sequential SfM from Equirectangular Images
(a)
(c)
(b)
(d)
Figure 5.3: Results of SfM with loop closing. (a) Trajectory before bundle adjustment. (b) Trajectory after bundle adjustment with loop closing. Examples of the images used for the loop closing: (c) Frames 6,597 and 8,643. (d) Frames 6,711 and 6,895.
5.1.4 Bundle Adjustment Enforcing Global Camera Pose Consistency Even though the Google Street View data are not primarily acquired by driving the same street several times, there are some overlaps suitable for constructing loops that can compensate drift errors induced while proceeding the trajectory sequentially. We construct loops by searching pairs of images observing the same 3D structure in different times in the sequence. Here we take advantage of having equirectangular images because using this projection model, images taken when revisiting a place from a different direction are very similar to the original ones, which is not the case for the commonly used perspective images.
35
5 City Modeling from Google Street View Equirectangular Images
The knowledge of GPS locations of Street View images truly alleviates the problem of image matching for loop closing but does not completely reduce it since common 3D structures can be seen even among relatively distant images. We do not rely on GPS locations because image matching achieved by using the image similarity matrix is potentially capable to match such distant images and it is always important for the vision community to see that certain problem can be solved entirely using vision. Building Image Similarity Matrix. SURF descriptors of each image are quantized into visual words using the visual vocabulary containing 130,000 words computed from urban area omnidirectional images. Next, term frequency–inverse document frequency (tfidf) vectors [87, 50], which weight words occurring often in a particular document and downweight words that appear often in the database, are computed for each image with more than 50 detected visual words. Finally, the image similarity matrix M is constructed by computing the image similarities, which we define as cosines of angles between normalized tf-idf vectors, between all pairs of images. Loop Finding and Closing. First, we take the upper triangular part of M to avoid duplicate search. Since the diagonal entries of M which are the neighbouring frames in the sequence essentially have high scores, the 1st to 50th diagonals are zeroed in order to exclude very small loops. Next, for the image Ii in the sequence, we select the image Ij as the one having the highest similarity score in the i-th row of M. Image Ij is a candidate of the endpoint of the loop which starts from Ii . Note that the use of an upper triangular matrix constraints j > i. Next, the candidate image Ij is verified by solving the camera resectioning [74]. Triplets of the tentative 2D-3D matches constructed by matching the descriptors of 3D points associated to the images Ii and Ii+1 with the descriptors of the features detected in the image Ij are sampled by RANSAC to find the camera pose having the largest support evaluated by the cone test again. The image Ii+1 , which is the successive frame of Ii , is additionally used for performing the cone test with three images in order to enforce geometric consistencies in the support evaluation of RANSAC. Local optimization is achieved by repeated camera pose computation from all inliers [86] via SDP and SeDuMi [94]. If the inlier ratio is higher than 70%, camera resectioning is considered successful and the candidate image Ij is accepted as the endpoint of the loop. The inlier matches are used to give additional constraints on the final bundle adjustment. We perform this loop search for every image in the sequence and test only the pair of images having the highest similarity score. If one increased the number of candidates to be tested, our pipeline would approach SfM for unorganized image sets [60, 55, 65] based on exhaustive pairwise matching. Finally, very distant points, i.e. likely outliers, are filtered out and sparse bundle adjustment [56] modified in order to work with unit vectors, which is the approach similar to [53], refines both points and cameras.
36
5.2 Experimental Results
(a)
(b)
Figure 5.4: Resulting 3D model consisting of 2,400 camera positions (red circles) and 124,035 3D points (blue dots) recovered by our pipeline. (a) Initial estimation. (b) After bundle adjustment with loop closing.
5.2 Experimental Results We used 4,799 omnidirectional images of the Google Street View Pittsburgh Research Data Set. Since the input omnidirectional images have large distortion at the top and bottom, we clipped the original images by cropping 230 pixels from the top and 410 pixels from the bottom to obtain 3,328 × 1,024 pixel large images, see Figure 5.2(b). Since the tracks are generated based on wide baseline matching, it is possible to save computation time by constructing initial camera poses and 3D structure from a sparser image sequence. Our SfM was run on every second image in the sequence, i.e. 2,400 images
37
5 City Modeling from Google Street View Equirectangular Images 6 x 10 (meters)
GPS SfM
4.4775
4.4774
4.4773
4.4772
4.4771
4.477
4.4769 5.842
5.843
5.844
5.845
5.846
5.847
5.848
5.849
5.85 5
x 10 (meters)
Figure 5.5: Comparison to the GPS provided in the Google Street View Pittsburgh Research Data Set. Camera trajectory by GPS (red line) and estimated camera trajectory by our SfM (blue line). were used to create a global reconstruction. The remaining 2,399 images were attached to the reconstruction in the final stage. The initial camera poses were estimated by computing epipolar geometries of pairs of successive images, and chaining them by finding the global scale of camera translation, see Algorithm 1. The resulting trajectory is shown in Figure 5.3(a). After estimating the initial camera poses and reconstructing 3D points, the pairs of images acquired at the same location in different times were searched for. The red lines in Figure 5.3(a) indicate links between the accepted image pairs. Figure 5.3(b) shows the camera trajectory after running bundle adjustment with the additional constraints obtained from loop closing. Figures 5.3(c) and (d) show the examples of pairs of images used for closing the loops at frames 6,597 and 8,643, and 6,711 and 6,895 respectively. Furthermore, Figure 5.4 shows the camera positions and the 3D points of the initial recovery (a) and after the loop closing (b) in different views. In Figure 5.5, the recovered trajectory is compared to the GPS positions provided in the Google Street View Pittsburgh Research Data
38
5.2 Experimental Results Procedure Detection Matching Chaining Loop Closing Bundle
time [h] 12.8 4.5 1.0 6.3 14.5
Table 5.1: Computational time in hours. (Detection) SURF detection and description. (Matching) Tentative matching and computing EGs. (Chaining) Chaining EGs and computing scales. (Loop Closing) Searching and testing loops. (Bundle) Final sparse bundle adjustment. Set. The computational time spent in different steps of the pipeline implemented in MATLAB+MEX running on a standard Core2Duo PC is shown in Table 5.1. Since the method is scalable and therefore storing the intermediate results of the computation on a hard drive instead of in RAM, performance could be improved by using a fast SSD drive instead of a standard SATA drive. Finally, the remaining 2,383 camera poses were computed by solving camera resectioning in the same manner as used in loop verification. Linear interpolation was used for the 16 cameras that could not be resectioned successfully. Figure 5.1(b) shows the 4,799 camera positions (red circles) and the 124,035 world 3D points (color dots) of the resulted 3D model.
39
6
Omnidirectional Image Sequence Stabilization by SfM
When dealing with the problem of image sequence stabilization using SfM, a reliable sequential SfM method needs to be proposed first. In contrary to existing SfM algorithms, which work in situations when the camera motion is small or once the 3D structure is initialized, we aim at a more general case when neither the relationship between the cameras nor the structure is available. Then, two-view camera matching and relative motion estimation is a natural starting point to camera tracking and Structure from Motion. This is an approach used by the state of the art wide baseline Structure from Motion algorithms, e.g. [10, 89, 60, 91, 65], that start with pairwise image matches and epipolar geometries which they next clean up and make them consistent by a large scale bundle adjustment. The main contribution of this chapter is to present an integrated pipeline for camera pose and trajectory estimation followed by image stabilization and rectification for dense as well as wide baseline omnidirectional images acquired by a single hand-held camera. Our wide baseline SfM is capable to recover camera poses and trajectories from sequences having large and non-smooth camera motions between consecutive frames. Therefore, the recovery can be accomplished even from sequences in which some frames are contaminated by unexpected accidents, e.g. blurred images, extreme change of the view direction, and lack of features to match. Furthermore, we show that the proposed approach is capable of facilitating visual object recognition by using the stabilized and rectified images augmented by the computed camera trajectory and the 3D reconstruction of the detected feature points. There are some essential issues for reliable camera pose and trajectory estimation: • The choice of camera, its geometric projection model, and a suitable calibration technique (Section 6.1.1). • Image feature detection, description (Section 6.1.2), and robust relative motion estimation (Section 6.1.3). • Robust 3D structure computation (Sections 6.2 and 6.3). • The choice of a suitable omnidirectional image stabilization and rectification method (Section 6.4). Moreover, the pipeline has a natural potency to deal with unordered images, regarding them as a sequence after being ordered by using an image indexing method based on visual words and visual vocabulary [87, 50] as described in Section 6.5.2.
41
6 Omnidirectional Image Sequence Stabilization by SfM
Figure 6.1: Kyocera Finecam M410R camera and Nikon FC-E9 fish-eye lens convertor.
6.1 Robust Estimation of Relative Camera Motion Hereafter, we describe the details of our pipeline with some illustrative examples.
6.1.1 Camera Calibration The setup used in this work is a combination of a Nikon FC-E9 lens, mounted via a mechanical adaptor, and a Kyocera Finecam M410R digital camera, which has been introduced in Chapter 4 already, see Figure 6.1. Nikon FC-E9 is a megapixel omnidirectional add-on convertor with 183◦ view angle which provides high-quality images. Kyocera Finecam M410R delivers 2,272×1,704 pixels large images at 3 frames per second. The resulting combination yields a circular view of the diameter slightly under 1,600 pixels in the image. The calibration of omnidirectional cameras is non-trivial but crucial for achieving good accuracy of the resulting 3D reconstruction. The same as in Chapter 4, our camera is calibrated off-line using the state of the art technique [6] and Miˇcuˇs´ık’s two-parameter model [67], that links the radius of the image point r to the angle θ of its corresponding rays w.r.t. the optical axis as ar . (6.1) θ= 1 + br 2 After a successful calibration, we know the correspondence of the image points to the 3D optical rays in the coordinate system of the camera. The following steps aim at finding the transformation between the camera and the world coordinate systems, i.e. the pose of the camera in the 3D world, using 2D image matches.
42
6.1 Robust Estimation of Relative Camera Motion
(a)
(b)
(c)
Figure 6.2: Wide baseline image matching. The colors of the dots correspond to the detectors (red) MSER-Intensity+ and (blue) MSER-Intensity−. (a) All detected features. (b) Tentative matches constructed by selecting pairs of features which have the mutually closest descriptors. (c) The epipole (green ) computed by maximizing the support. Note that the scene dominated by a single plane does not induce degeneracy in computing calibrated epipolar geometry by solving the 5-point minimal relative pose problem.
6.1.2 Detecting Features and Constructing Tentative Matches For computing the 3D structure, a set of tentative matches is constructed by detecting image features. We have tested several feature detectors: Maximally Stable Extremal Regions (MSER) [62], Laplacian-Affine, Hessian-Affine [69], Scale-Invariant Feature Transform (SIFT) [57], and Speeded Up Robust Features (SURF) [7]. We can conclude that the choice of the feature detector is not crucial for the resulting 3D models. We use MSER and SIFT since they have potential to match features under large changes of view direction and are more efficient than the features from [69]. Parameters of the detectors were chosen to limit the number of regions to 1–2 thousands per image. For MSER, the detected regions are assigned Local Affine Frames (LAF) [78] and transformed into the standard positions w.r.t. their LAFs. Discrete Cosine Transform (DCT) descriptors [79] are computed for each region in the standard position. For SIFT, keypoints are detected
43
6 Omnidirectional Image Sequence Stabilization by SfM
based on the Difference of Gaussians (DoG) and SIFT keypoint descriptors are created from sets of histograms of the gradient information computed from the neighbours of the keypoints. Finally, tentative matches are constructed by searching the mutually closest descriptors between the given images. We use Fast Library for Approximate Nearest Neighbors (FLANN) [71] which performs approximate nearest neighbour search based on random kd-trees. Figures 6.2(a) and (b) show two examples of feature detection and matching for pairs of wide baseline images. When all camera motions between consecutive frames are small and moderate, short baseline matching using simpler image features [20] described in Chapter 4 can be used efficiently under assumptions on the proximity of the consecutive projections. However, in practical situations, some frames may be contaminated or lost by unexpected accidents, e.g. an extremely fast camera movement, while acquiring a long sequence. The view point and direction can change a lot between the usable consecutive frames and the short baseline matching often fails. By using wide baseline matching, one can handle such situations as it is possible to make a link between the non-contaminated frames.
6.1.3 Epipolar Geometry Computation by RANSAC + Soft Voting 3D structure can be robustly computed by RANSAC [26] which searches for the largest subset of the set of tentative matches which is, within a predefined threshold ε, consistent with an epipolar geometry [39]. We use ordered sampling as suggested in [16] to draw 5-tuples from the list of tentative matches which may help to reduce the number of samples in RANSAC. From each 5-tuple, relative pose is computed by solving the 5point minimal relative pose problem for calibrated cameras [73, 93]. Figure 6.2(c) shows the results of computing the epipolar geometry for two pairs of wide baseline images. Ordered Randomized Sampling. Samples are drawn from tentative matches ordered ascendingly by the distance of their descriptors as suggested in [16]. On the other hand, we keep the original RANSAC stopping criterion [39] and limit the maximum number of samples to 1,000. We have observed that pairs which could not be solved in 1,000 samples got almost never solved even after many more samples. Using the stopping criterion from [16] often leads to ending the sampling prematurely since the criterion is designed to stop as soon as a large non-random set of matches is found. Our objective is, however, to find a globally good model and not to stop as soon as a local model having a sufficiently large support is found. Orientation Constraint. A given essential matrix can be decomposed into four different camera and point configurations which differ by the orientations of the cameras and points [39]. Without enforcing the constraint that all points have to be observed in front of the cameras, some epipolar geometries may be supported by many matches but it may not be possible to reconstruct all points correctly, i.e. in front of both cameras.
44
6.2 Measuring the Amount of Camera Translation by DAA
For omnidirectional cameras, the meaning of infrontness is a generalization of the classical infrontness for perspective cameras. With perspective cameras, a point X is in front of the camera when it has a positive z coordinate in the camera coordinate system. For omnidirectional cameras, a point X is in front of the camera if its coordinates can be written as a positive multiple of the direction vector which represents the half-ray by which X has been observed. In general, it is beneficial to use only the matches which generate points in front of the cameras. However, it takes time to verify this for all matches. On the other hand, it is fast to verify whether the five points in the minimal sample generating the epipolar geometry can be reconstructed in front of both cameras and to reject such epipolar geometries which do not allow it. Furthermore, the orientation constraint in average reduces the computational cost because it avoids evaluating the residuals corresponding to many incorrectly estimated camera motions. Soft Voting. We vote in a two-dimensional accumulator for the estimated motion direction. However, unlike in [54, 75], we do not cast votes directly by each sampled epipolar geometry but by the best epipolar geometries recovered by the ordered sampling of PROSAC. This way the votes come only from the geometries that have a very high support. We can afford to compute more, e.g. 5, epipolar geometries since the ordered sampling is much faster than the standard RANSAC. Altogether, we need to evaluate maximally 1,000 × 5 = 5,000 samples to generate 5 soft votes, which is comparable to running a standard 5-point RANSAC for the expected contamination by 77 % of mismatches [39]. Yet, with our technique, we could go up to 98.5 % of mismatches with a comparable effort. Finally, the relative camera pose with the motion direction closest to the maximum in the voting space is selected. The proposed robust estimation of relative camera motion is summarized as the pseudo code in Algorithm 2 with the actual parameters used in the real experiments.
6.2 Measuring the Amount of Camera Translation by DAA Consider a pair of calibrated cameras with the normalized camera matrices [39], P = [I | 0] and P = [R | − t] and an image point correspondence given by a pair of homogeneous coordinates (x, x ) represented by unit direction vectors, i.e. x = x = 1. There holds (6.2) α x = α R x − t, with real α, α , rotation R and translation t. If there was no noise, pure camera rotation, i.e. t = 0, could be detected by finding out that x = R x holds true for all the correspondences. However, this does not occur even if the physical camera really rotates due to noise in image measurements. Thus, in real situations, a non-zero essential matrix E can always be computed from noisy image matches, e.g. by the 5-point algorithm [73].
45
6 Omnidirectional Image Sequence Stabilization by SfM
Algorithm 2 Robust estimation of relative camera motion Input I1 , I2 Image pair. vmax := 5 Number of soft votes. tmax := 1000 Maximum number of random samples. ε := 0.1 ◦ Tolerance for establishing matches. η := 0.9999 RANSAC confidence. ◦ Standard deviation of Gaussian kernel for soft voting. σ := 0.4 Relative camera motion and its supports. Output E∗ , M∗ I. Detect and describe features, (MSER-INT±, LAF+DCT) [78, 79] and (SIFT) [57]. II. Construct the list M of m tentative matches with mutually closest descriptors. Order the list ascendingly by the distance of the descriptors. III. Find camera motion consistent with the tentative matches [100]: 1: Set D to zero. // Initialize the accumulator of camera translation directions. 2: for i := 1, . . . , vmax do 3: t := 0 // The counter of samples. 4: while t ≤ tmax do 5: t := t + 1 // New sample. 6: Select the 5 tentative matches M5 ⊂ M of the t-th sample [16]. 7: Et := the essential matrix by solving the 5-point problem for M5 [73, 93]. 8: if M5 can be reconstructed in front of the cameras [39] then 9: st := the number of matches consistent with Et , i.e. the number of all matches m ∈ M for which max((u1 , Et u2 ), (u2 , E t u1 )) < ε, where m = [u1 , u2 ] . 10: else 11: st := 0 12: end if 13: rt := log(η)/ log 1 − s5t / m // The termination length defined by the 5 maximality constraint [39]. 14: tmax := min(tmax , rt ) // Update the termination length. 15: end while 16: tˆ = argt=1,...,tmax max st // The index of the sample with the highest support. ˆi := Etˆ, e ˆi := camera motion direction for the essential matrix Etˆ. 17: E ˆi . 18: Vote in accumulator D by the Gaussian with sigma σ and the mean at e 19: end for ˆ := argx∈domain(D) max D(x) // Maximum in the accumulator. 20: e ˆi ) // The motion closest to the maximum. 21: i∗ := argi=1,...,vmax min (ˆ e, e ∗ ˆi∗ // The “best” camera motion. 22: E := E 23: M∗ := {m ∈ M : m is consistent with E∗ } // Inlier matches supporting E∗ . IV. Return E∗ and M∗ .
46
6.2 Measuring the Amount of Camera Translation by DAA X τ
α
α
τ Rx C
x t
C
Figure 6.3: The apical angle τ at the point X reconstructed from the correspondence (x, x ) relatively depends on the length of the camera translation t and on the distances of X from the camera centers C, C .
Having m matches (xi , xi ) and the essential matrix E computed from them, we can reconstruct m 3D points Xi . Figure 6.3 shows a point X reconstructed from an image match (x, x ). For each point X, the apical angle τ , which measures the length of the camera translation from the perspective of the point X, is computed. If the cameras are related by pure rotation, all angles τ are equal to zero. The larger is the camera translation, the larger are the angles τ . The closer is the point X to the midpoint of the camera baseline, the larger is the corresponding τ . In fact, measuring the apical angles is equivalent to measuring disparities on a spherical retina as the corresponding angle, i.e. the apical angle τ , is easily computed with relative rotation R such that τ = (Rx, x ).
(6.3)
For a given E and m matches (xi , xi ), one can select the decomposition of E to R and t, which reconstructs the largest number of 3D points in front of the cameras. The apical angle τi , corresponding to the match (xi , xi ), is computed by solving a set of linear equations for the relative distances αi , αi αi xi = αi R xi − t
(6.4)
in the least square sense and by using the law of cosines 2 αi αi cos(τi ) = αi 2 + αi − t2 . 2
(6.5)
For a small translation w.r.t. the distance to the scene points, it is natural to use the approximation αi = αi . Then, the apical angle τi becomes a linear function of t. This is instantly proven by using the approximated equation of the law of cosines cos(τi ) = 1 −
t2 2α2i
(6.6)
47
6 Omnidirectional Image Sequence Stabilization by SfM Algorithm 3 Measuring camera motion by computing the dominant apical angle Input E, M Relative camera motion and its m supports. ◦ Standard deviation of Gaussian kernel for soft voting. σ = 0.4 Dominant apical angle. Output τ ∗ 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
Decompose E into the rotation R and the translation t [39]. for i := 1, . . . , m do Compute the apical angle τi from the match mi ∈ M, R, and t (see Section 6.2). end for q 10 := the 10th percentile from all τi . // Lower bound on apical angles. q 90 := the 90th percentile from all τi . // Upper bound on apical angles. for i := 1, . . . , m do if q 10 < τi < q 90 then Vote in accumulator B by the Gaussian with sigma σ and the mean at τi . end if end for τ ∗ := argy∈domain(B) max B(y) // Maximum in the accumulator.
and the cosine series expansion cos(τi ) = 1 −
τi2 + O(τi4 ). 2!
(6.7)
If all matches were correct, the largest τi would best represent the amount of the translation. However, all matches are rarely correct and thus we need to find a robust measure of the translation. The distribution of the values of τi depends on the distribution of the points in the scene and on mismatches, if present. We have observed that for many general 3D as well as planar scenes, the distribution has a dominant mode τ ∗ = arg max g(τi )
(6.8)
where g(τi ) performs kernel voting with Gaussian smoothing [54], and that the mode τ ∗ predicts the length of the translation well. The pseudo code of DAA computation is listed in Algorithm 3.
6.2.1 Too Small Motion Detection on Simulated Data Figures 6.4, 6.5, and 6.6 show the results of simulated experiments for three different scenes, different motion directions, and for the length of the translation increasing from zero to a large value. The amount of camera translation was computed by the method based on RANSAC [26], which is described in Algorithm 2. Notice that we use a combination of ordered sampling [16] with kernel voting to maximize the chance of recovering the correct epipolar geometry [100]. We also enforce the reconstructed points to be in front of both cameras before counting the support size in RANSAC.
48
6.2 Measuring the Amount of Camera Translation by DAA
10 8 6 4 2 0 0
0.5
1
1.5
2
2.5
||t||
(a)
3
3.5
4
4.5
5
0.7
tran. error (deg)
rot. error (deg)
DAA (deg)
12
0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.5
1
1.5
2
2.5
||t||
(b)
3
3.5
4
4.5
5
100 80 60 40 20 0 0
0.5
1
1.5
2
2.5
||t||
3
3.5
4
4.5
5
(c)
Figure 6.4: Measuring the length of the camera translation for a general 3D scene. (a) Dominant apical angle. Noisy data: blue line with “+” markers – backward motion, red line with “×” markers – lateral motion; exact data: green lines with “◦” and “ ” – backward and lateral motions, respectively. (b) Camera rotation error. (c) Camera translation error.
Figure 6.4 shows an experiment with a general 3D scene consisting of 1,000 points uniformly distributed in a hemisphere with the center at [0, 0, 10] and radius 25. The first camera was placed at [0, 0, 0] looking towards the scene points. Two motions of the second camera were tested. The backward motion was constructed as [0, 0, −s] , i.e. the camera was moving away from the scene. The sideways motion was constructed as [s, 0, 0] . In both cases, s ranged from 0 to 5. 3D points were projected by normalizing their coordinate vectors, constructed w.r.t. the respective camera coordinate systems, to unit length. To simulate the imprecision due to image sampling during digitization and image measurement, Gaussian noise with a standard deviation σ = 3◦ , corresponding to 1.3 pixels in a 800×800 pixels large image capturing 180◦ field of view, was added to the normalized vectors. Figure 6.4(a) shows the dominant apical angle (DAA) as a function of the length of the true translation. DAA for the backward motion is shown by the blue line with “+” markers, whereas DAA for the lateral motion is shown by the red line with “×” markers, both computed from noisy measurements. The green lines with “◦” and “ ” markers, respectively, show the respective DAA of the backward and lateral motions computed from exact measurements. We see that the DAA is a linear function of the length of the true motion for translations longer than 0.25 meters. The slope of the lateral DAA is slightly larger (2.5◦ /meter) than the slope of DAA for the backward motion (2.0◦ /meter) in this case. DAA of the zero translation computed from noisy matches is slightly above the zero due to noise in image measurements. Figure 6.4(b) shows the difference in the estimated camera rotation Rest w.r.t. the true rotation R evaluated as the angle of rotation of R−1 est R. Notice that the error is constant for all lengths of the translation which shows that the rotation is computed correctly even if the direction of the translation, Figure 6.4(c), cannot be found reliably. Figures 6.5 and 6.6 show the same experiment as above on a planar scene and on a 3D scene consisting of two planes. The results are comparable to those shown in Figure 6.4. In particular, we can see that we are able to measure the amount of translation
49
6 Omnidirectional Image Sequence Stabilization by SfM
15
10
5
0 0
0.5
1
1.5
2
2.5
||t||
3
3.5
4
4.5
3 2.5 2 1.5 1 0.5 0 0
5
tran. error (deg)
rot. error (deg)
DAA (deg)
20
0.5
1
1.5
(a)
2
2.5
||t||
3
3.5
4
4.5
5
100 80 60 40 20 0 0
0.5
1
1.5
(b)
2
2.5
||t||
3
3.5
4
4.5
5
(c)
Figure 6.5: Measuring the length of the camera translation for a planar scene. (a) Dominant apical angle. Noisy data: blue line with “+” markers – backward motion, red line with “×” markers – lateral motion; exact data: green lines with “◦” and “ ” – backward and lateral motions, respectively. (b) Camera rotation error. (c) Camera translation error.
10
5
0 0
0.5
1
1.5
2
2.5
||t||
(a)
3
3.5
4
4.5
5
1.5
tran. error (deg)
rot. error (deg)
DAA (deg)
15
1
0.5
0 0
0.5
1
1.5
2
2.5
||t||
(b)
3
3.5
4
4.5
5
100 80 60 40 20 0 0
0.5
1
1.5
2
2.5
||t||
3
3.5
4
4.5
5
(c)
Figure 6.6: Measuring the length of the camera translation for the scene consisting of two planes. (a) Dominant apical angle. Noisy data: blue line with “+” markers – backward motion, red line with “×” markers – lateral motion; exact data: green lines with “◦” and “ ” – backward and lateral motions, respectively. (b) Camera rotation error. (c) Camera translation error.
in all three cases. It is interesting to notice that the error in rotation is constant for general 3D scenes, Figures 6.4(b) and 6.6(b), but grows linearly for the planar scene, Figure 6.5(b). This reflects the fact that the angle which is occupied by scene points determines, to large extent, the quality of rotation estimation from scenes with shallow depth. At the same time, we can see that the quality of estimating the amount of camera translation has not been affected.
6.3 Sequential Wide Baseline Structure from Motion Camera poses in a canonical coordinate system are recovered by chaining the EGs of pairs of consecutive images in a sequence. The essential matrix Eij encoding the relative camera pose between frames i and j can be decomposed into Eij = [tij ]× Rij . As stated earlier, there exist four possible decompositions and the right one can be selected as that which reconstructs the largest number of 3D points in front of both
50
6.3 Sequential Wide Baseline Structure from Motion Algorithm 4 Keyframe selection Input Ii , i = 1, . . . , n Input images. τmin = 0.2 ◦ − 2 ◦ Minimum amount of translation. Output ki , i = 1, . . . , n Keyframe flags. 1: k1 := TRUE, set ki := FALSE for i = 2, . . . , n. 2: i := 1 3: while i < n do 4: j := 0, q := TRUE 5: while (q = TRUE) ∧ (i + j < n) do 6: j := j + 1 7: Compute the relative motion Ei,i+j between Ii and Ii+j . 8: m := number of supports of Ei,i+j . 9: τ ∗ := DAA computed from Ei,i+j and its supports. 10: ω ∗ := sum of the weighted apical angles computed from Ei,i+j and its supports. 11: q := (τ ∗ < τmin ) ∧ (ω ∗ < m) 12: end while 13: if q = FALSE then 14: i := i + j 15: ki := TRUE 16: end if 17: end while cameras. Having the normalized camera matrix [39] of the i-th frame Pi = [Ri | ti ], the normalized camera matrix Pj can be computed by Pj = [Rij Ri | Rij ti + γij tij ]
(6.9)
where γij is the scale of the translation between frames i and j in the canonical coordinate system. This scale can be computed by any 3D point seen in at least three consecutive frames but the precision depends on the uncertainty of the reconstructed 3D point. Therefore, a robust selection from the possible candidates of the scales has to be done while evaluating the quality of the computed camera pose. The best scale is found by RANSAC maximizing the number of points that pass the cone test, see Chapter 7, which checks the intersection of pixel ray cones. In this case, only quarter-pixel wide pyramids formed by four planes are cast around the matches and it is tested whether their intersection is empty or not because high accuracy is demanded. In contrary to standard sequential SfM techniques which compute camera translation and rotation from the estimated 3D point cloud, we compute camera rotation and translation from EG as it has been shown in [96] that computing camera rotation from EG is more accurate. In contrary to the decoupled SfM technique presented in the aforementioned paper, the maximum number of samples in our scale selection is bounded by the number of 3D points seen in three consecutive frames because we do not need to draw pairs of 2D-3D matches to compute the camera translation.
51
6 Omnidirectional Image Sequence Stabilization by SfM
On the other hand, when recovering also camera translation from EG, one must take care whether or not a particular EG contains a sufficient amount of translation because EGs inaccurately computed from image pairs having too small translation disturb the chaining of camera poses and do not contribute by reconstructing new 3D points in the scene. It is also important that a sufficient number of 3D points with large apical angles exists in the pairwise reconstruction for obtaining accurate scales when chaining the EGs. For producing the EGs capable of stable recovery of camera poses and trajectory, we propose to use only the images that satisfy one of the two quality scores computed between the pairs of consecutive frames. We will call such images keyframes. One of the quality scores is the DAA τ ∗ , which has been described in Section 6.2 already. Setting the minimum amount of DAA to 0.2◦ –2◦ enables to detect too small translation. The other quality score ω ∗ is the sum of the weighted scores computed based on apical angles. The weighted score ωi for the apical angle τi of a 3D point Xi is defined in the following formula: ωi = qi + 4qi + 20qi ,
qi = qi
=
qi =
(6.10)
1 0
τi ≥ 5◦ otherwise
(6.11)
1 0
τi ≥ 10◦ otherwise
(6.12)
1 0
τi ≥ 15◦ otherwise
(6.13)
Quality score ω ∗ checks whether there is a sufficient number of 3D points with sufficiently large apical angles. The threshold value for ω ∗ is set to the number of the reconstructed 3D points, i.e. either all the reconstructed points must have the apical angles at least 5◦ or some of them are having even larger apical angles as the weighting constants are favouring such 3D points. The pseudo code of keyframe selection is summarized in Algorithm 4. Note that Algorithm 4 may not select the last frame of the sequence as a keyframe. In that case, we regard the last keyframe as the end of the sequence. After recovering the camera poses and 3D points using only the keyframes, the camera poses corresponding to the images not selected as the keyframes are estimated by solving the camera resectioning task [74]. Since every non-keyframe is interleaved between two keyframes, the tentative 2D-3D matches are efficiently constructed by extracting the 3D points associated to the two keyframes. RANSAC is used to find the camera pose having the largest support of the tentative 2D-3D matches evaluated by the cone test again. Local optimization is achieved by repeated camera pose computation from all the inliers [86] via SDP and SeDuMi [94]. If the inlier ratio is higher than 70%, camera resectioning is considered successful. In the final step, very distant points, i.e. likely outliers, are filtered out and sparse bundle adjustment [56] modified in order to work with unit vectors refines both points and cameras.
52
6.4 Omnidirectional Image Stabilization
I j
i
I j
z O
i
z
θ x
φ p
O
u
θ x
φ p
u
y
y
(a)
(b)
Figure 6.7: Transformation between a pixel u of the resulting cylindrical image and a unit vector p on a unit sphere. Column index ui is transformed into angle θ and row index uj into angle φ. These angles are then transformed into the coordinates px , py , and pz of a unit vector. (a) Central cylindrical projection. (b) Non-central cylindrical projection.
6.4 Omnidirectional Image Stabilization The recovered camera poses and trajectory can be used to rectify the original images to the stabilized ones. Image stabilization is beneficial e.g. for facilitating visual object recognition where (i) objects can be detected in canonical orientations and (ii) ground plane position can further restrict feasible object locations.
6.4.1 Image Rectification Using Camera Pose and Trajectory If there exists no constraint on camera motion in the sequence, the simplest way of stabilization is to rectify images w.r.t. the up vector in the coordinate system of the first camera and all other images will then be aligned with the first one. This can be achieved by taking the first image with care. When the sequence is captured by walking or driving on the roads, the images can be stabilized w.r.t. the ground plane with a natural assumption that the motion direction is parallel to the ground plane. For the fixed gravity direction g and the motion direction
53
6 Omnidirectional Image Sequence Stabilization by SfM
t, we compute the normal vector of the ground plane d=
t × (g × t) . t × (g × t)
(6.14)
Then, we construct the stabilization and rectification transform Rs for the image point represented as a unit 3D vector such that Rs = [ a, d, b ] where a=
[0, 0, 1] × d [0, 0, 1] × d
(6.15)
a×d . a × d
(6.16)
and b=
This rectification preserves yaw (azimuth) which is sufficient for producing panorama images having the same field of view as the original images.
6.4.2 Central and Non-central Cylindrical Image Generation Having perspective cutouts rectified w.r.t. the ground plane, an arbitrary object recognition routine designed to work with images acquired by perspective cameras can be used without any further modifications. Furthermore, some object recognition methods, e.g. [51], could benefit from image stabilization. On the other hand, as a true perspective image is able to cover only a small part of the available omnidirectional view field, we propose to use cylindrical images which can cover a much larger part of it. Knowing the camera and lens calibration, we represent our omnidirectional image as a part of the surface of a unit sphere, each pixel being represented by a unit vector. It is straightforward to project such surface on a surface of a unit cylinder surrounding the sphere using rays passing through the center of the sphere, see Figure 6.7. We transform the column index ui of a pixel of the resulting cylindrical image into angle θ and the row index uj into angle φ using IW θmax , (6.17) θ = ui − 2 IW IH θmax , (6.18) φ = arctan uj − 2 IW where IW and IH are the dimensions of the resulting image and θmax is the horizontal field of view of the omnidirectional camera. These angles are then transformed into the coordinates px , py , and pz of a unit vector as px = cos φ sin θ, py = sin φ, pz = cos φ cos θ .
(6.19)
Note that the top and bottom of the rectified image look rather deformed for the vertical field of view reaching π if the height of the resulting image IH is being increased,
54
6.4 Omnidirectional Image Stabilization
(a)
(b)
(c)
(d)
Figure 6.8: Omnidirectional image rectification. (a) Original omnidirectional image (equiangular). (b) Central cylindrical projection. (c) Perspective projection. (d) Non-central cylindrical projection. Note the large deformation at the borders of the perspective image and at the top and bottom borders of the central cylindrical image. The borders of the non-central cylindrical image are less deformed. see Figure 6.8. We propose to use a generalization of the stereographic projection which we call a non-central cylindrical projection. Projecting rays do not pass through the center of the sphere but are cast from points on its equator. The desired point is the intersection of the plane determined by the column of the resulting image and the center of the sphere with the equator of the sphere. The equation for angle θ remains the same but angle φ is now computed using ⎞ ⎛ uj − I2H θImax W ⎠. (6.20) φ = 2 arctan ⎝ 2 When generating the images, bilinear interpolation is used to suppress the artifacts caused by image rescaling.
55
6 Omnidirectional Image Sequence Stabilization by SfM
(a)
(b)
(c)
Figure 6.9: Camera trajectory of sequence CITY WALK. (a) The sequence contains moving objects occluding large parts of the view, rapid changes of illumination, and a natural complex environment. (b) A bird’s eye view of the city area used for the acquisition of our test sequence. The trajectory is drawn with a red line. (c) The bird’s eye view of the resulting 3D model. Red cones represent the keyframe camera poses recovered by our SfM. Blue cones represent the camera poses of the non-keyframes. Small dots represent the reconstructed world 3D points.
56
6.5 Experimental Results
(a)
(b)
Figure 6.10: Results of image transformations in sequence CITY WALK. The images are stabilized w.r.t. the ground plane and panoramic images transformed by (a) central cylindrical projection and (b) non-central cylindrical projection. Note that the pedestrians are less deformed when using the non-central cylindrical projection while convening a larger field of view than the central one.
6.5 Experimental Results Next, the presented pipeline is tested using different ordered image sets and also an additional step which allows for reconstructing easy unordered image sets is shown.
6.5.1 Omnidirectional Image Sequences The experiments with real data demonstrate the use of the proposed image stabilization method. Five image sequences of a city scene captured by a single hand-held fish-eye lens camera are used as our input. CITY WALK. Sequence CITY WALK is 949 frames long and the distance between consecutive frames is 0.2–1 meters. This sequence is challenging for recovering the camera trajectory due to sharp turns, objects moving in the scene, large changes of illumination, and natural complex environments, see Figure 6.9(a). The camera motions are reasonably recovered by using the features detected from stationary rigid objects. Figure 6.9(c) shows the obtained camera poses and the world
57
6 Omnidirectional Image Sequence Stabilization by SfM
(a)
(b)
(c)
(d)
DAA (deg)
2 1.5 1
0.5 0 0
50
100
150 image
200
250
300
(e)
(f)
(g)
Figure 6.11: Detection of “too small motion” demonstrated on sequence GO AND STOP. (a) and (b) Pair of images with too small motion. (c) and (d) Pair of images with a sufficiently large motion. The green shows the relative camera motion direction estimated from pairs of images (a) and (b), and (c) and (d), respectively. The red and yellow dots are the tentative matches and the supports of the motion . It can be seen that the motion direction is estimated incorrectly when the motion is too small even though the size of the support is sufficiently large. (e) The DAA computed from pairs of consecutive images in the sequence. (f) and (g) The recovered camera poses and trajectory of keyframes (red cones) and non-keyframes (blue cones), and the world 3D points (color dots) from two different views.
3D points reconstructed by our SfM. The red cones represent the keyframe camera poses while the blue cones represent the non-keyframe camera poses computed by solving
58
6.5 Experimental Results
camera resectioning. The reconstructed camera trajectory well fits the walking trajectory shown in Figure 6.9(b). Since the sequence was captured while walking along a planar street, all the images were stabilized using the recovered camera poses and trajectory w.r.t. the ground plane. Figure 6.10 shows the images generated by using the central and the non-central cylindrical projections. It can be seen that the non-central cylindrical projection in Figure 6.10(b) successfully suppresses the deformation at the top and the bottom of the image and makes people standing close to the camera look much more natural. GO AND STOP. Sequence GO AND STOP is 303 frames long and the distance between consecutive frames is 0–1 meters. The observer was standing still at a fixed spot in frames 1–14, 51–68, and 157–170, otherwise walking along a street. We can detect when the observer was standing by finding the “too small” DAA on the graph in Figure 6.11(e) which shows the DAAs between every pair of consecutive frames. In Figures 6.11(a)–(d), the green shows the relative camera motion direction estimated from pairs of images (a) and (b), and (c) and (d), respectively. The red and yellow dots are the tentative matches and the supports of the motion . It can be seen that the motion direction is estimated incorrectly when the motion is too small even though the size of the support is sufficiently large. Figures 6.11(f) and (g) show the camera poses and the world 3D points reconstructed by our SfM visualized from two different viewpoints. Again, the red cones represent the keyframe camera poses and the blue cones represent the non-keyframe camera poses. ABNORMAL MOTION. Sequence ABNORMAL MOTION is 410 frames long and the distance between consecutive frames is 0.2–1 meters. The observer was walking along a street when performing abnormal motions three times as spotted in yellow markers in Figure 6.12(d). Figures 6.12(a), (b), and (c) show the frames 100–115, 333–348, and 381–396 respectively, where the abnormal motions were acted. Figure 6.12(d) shows a bird’s eye view of the city area used for the acquisition and the red dots are the computed camera poses of the keyframes superposed on it. Figure 6.12(e) shows the camera poses of the keyframes (red cones) and of the nonkeyframes (blue cones), and the world 3D points (color dots). The significant utility of the wide baseline SfM on large field of view images can be seen on the reliable recovery of the sequence having abnormal motions which are fatal for classic sequential SfM methods working under the assumption of limited motions. FREE MOTION. Sequence FREE MOTION is 187 frames long and the distance between consecutive frames is 0.2–1 meters. This sequence is also challenging for recovering the camera poses and trajectory due to the large view changes caused by extreme camera rotation and translation. Figure 6.13(b) shows several examples of the original images in the sequence. Figure 6.13(a) shows the camera poses recovered by our SfM visualized from a bird’s eye view. Figure 6.13(c) shows the panoramic images generated by
59
6 Omnidirectional Image Sequence Stabilization by SfM
(a)
(b)
(c)
(c) (b)
(b) (c)
(c)
(a)
(d)
(b)
(a)
(a)
(e)
Figure 6.12: Camera poses and trajectory of sequence ABNORMAL MOTION. The sequence was acquired with abnormal motions at the frames (a) 100–115, (b) 333–348, and (c) 381–396 while walking along a street. (d) The recovered camera poses of the keyframes are superimposed on the map of a bird’s eye view. (e) The recovered camera poses of the keyframes (red cones) and of the non-keyframes (blue cones), and the world 3D points (color dots) from a bird’s eye view. See detailed views of the recovered camera poses in (a), (b), and (c) on the right side.
60
6.5 Experimental Results
the non-central cylindrical projection. As the motion is completely irrelevant w.r.t. the ground plane, all images are stabilized w.r.t. the gravity vector in the coordinate system of the first camera. Figure 6.13(d) shows the panoramic images stabilized using the recovered camera poses and trajectory. It can be seen clearly from this result that even large image rotations are successfully canceled using the recovered camera poses and trajectory. PED DETECTION. Sequence PED DETECTION is 404 frames long and the distance between consecutive frames is 0–1 meters. The images are stabilized w.r.t. the ground plane by using the estimated trajectory and rectified by adopting the non-central cylindrical projection, see Figure 6.14. The multi-body pedestrian tracker [22, 25] is applied to the sequence of the stabilized cylindrical images and the results are shown in Figures 6.14(b)–(e). Thanks to proper image rectification, a pedestrian detector using Histograms of oriented Gradients (HoG) [22] trained on perspective images could be used. The tracker used can greatly benefit from our ability of producing stable image sequences as it uses the ground plane position to reject false positive detections which is otherwise possible only for sequences acquired by vehicle-mounted cameras.
6.5.2 Unordered Omnidirectional Images We demonstrate the capability of our sequential SfM on unordered images by applying an image indexing method based on visual words and visual vocabulary [87, 50]. These approaches are often used for selecting promising image pairs in methods reconstructing from unordered image sets, see Chapter 7, but here they are used just for ordering the images into a pseudo sequence. Image set CASTLE ENTRANCE originally consisted of three sequences acquired at different times. To reveal the ability of wide baseline SfM, we randomly selected 40 out of 109 images of the whole data to make cameras sparser than general sequential images. Furthermore, 10 images of different locations were added as outliers, see Figures 6.15(a)–(d). For each image, a term frequency–inverse document frequency (tf-idf) vector [87, 50] was computed using a visual vocabulary containing 130,000 words trained from omnidirectional images of an urban area. Image similarities between the pairs of images were computed as the cosines of the angles between the normalized tf-idf vectors. Then, a pseudo sequence was constructed by randomly selecting one image as the first frame and concatenating the most similar image as the successive frame. 33 camera poses were successfully recovered and none of the outlier frames were selected. See Figure 6.15(e) for the recovered camera poses of the keyframes (red cones), the non-keyframes (blue cones), and the world 3D points (color dots). To evaluate the accuracy of the camera pose estimation, 109 camera poses of the same scene reconstructed by the state of the art randomized SfM method for unordered image sets, see Chapter 7, were used as the ground truth camera poses. We measured the error between the camera poses computed by our method and those computed by the randomized SfM method after giving the corresponding image indices and finding the
61
6 Omnidirectional Image Sequence Stabilization by SfM
(a)
(b)
(c)
(d)
Figure 6.13: Results of our image stabilization and transformation in sequence FREE MOTION. (a) The camera poses and the world 3D points reconstructed by our SfM visualized from a bird’s eye view. Red cones represent the keyframe camera poses recovered by our SfM. Blue cones represent the non-keyframe camera poses estimated by solving camera resectioning. (b) Original images. (c) Non-stabilized images. (d) Stabilized images w.r.t. the gravity vector in the first camera coordinates. Image rotations are successfully canceled and all images are stabilized using the recovered camera poses and trajectory.
62
6.5 Experimental Results
(a)
(b)
(c)
(d)
(e)
Figure 6.14: Result of multi-body pedestrian tracking in sequence PED DETECTION on the cylindrical images stabilized w.r.t. the ground plane. (a) The camera poses and the world 3D points reconstructed by our SfM visualized from a bird’s eye view. In (b), (c), (d), and (e), the color boxes and curves indicate the positions of pedestrians and their trajectories estimated by using the previous frames.
63
6 Omnidirectional Image Sequence Stabilization by SfM
(a)
(b)
(c)
(e)
(d)
(f)
Figure 6.15: Result of our SfM performed on artificially ordered omnidirectional images from image set CASTLE ENTRANCE. The image set consists of 40 typical landmark images of the entrance to the Prague Castle (a) and (b) and 10 images from other locations acting as outliers (c) and (d). (e) The camera poses and the world 3D points reconstructed from images ordered as a sequence by using image similarity computed based on visual words and visual vocabulary indexing. (f) Visualization of the camera poses (+) and trajectory (dashed line) estimated by our method and the camera poses (•) estimated by the randomized SfM, see Chapter 7.
similarity transform bringing the data from the two reconstructions into correspondence. The mean of the camera translation and rotation errors are 0.024 and 0.031 respectively, where the camera translation error is the fraction of the diameter of the smallest sphere containing all cameras and the camera rotation error is in radians. Both sets of camera poses can be seen in Figure 6.15(f).
64
6.5 Experimental Results Name CITY WALK GO AND STOP ABNORMAL MOTION FREE MOTION PED DETECTION CASTLE ENTRANCE
# Frm 949 303 410 187 404 50
# Key 503 73 198 176 74 31
τmin 1◦ 2◦ 1◦ 0.2◦ 2◦ 2◦
Det. 147 41 57 32 54 6
Match 77 30 43 18 26 4
SfM 3 2 4 3 2 0.5
Res. 30 35 25 2 51 0.5
Table 6.1: Details of the experimental results for all sequences. (# Frm) The number of frames. (# Key) The number of keyframes selected and used in our wide baseline SfM. (τmin ) The minimum DAA, i.e. the minimum size of motion, in degrees. The rest is the computational time in different steps for each sequence in minutes. (Det.) Feature detection and description. (Match) Tentative match construction and EG computation. (SfM) Chaining EGs, scale estimation, and bundle adjustment. (Res.) Estimating camera poses of nonkeyframes.
6.5.3 Details of Experimental Settings and Computations We used the same parameter values except the minimum DAA τmin , i.e. the minimum size of motion, for all sequences. The actual values used in the experiments are listed in Algorithms 2 and 3, and Table 6.1. For all sequences but FREE MOTION, there was no significant difference in the visual quality of the reconstruction as long as setting the minimum DAA τmin between 1◦ and 2◦ . In sequence FREE MOTION, we set τmin = 0.2◦ which is smaller than in other sequences because larger values of the minimum DAA selected too sparse keyframes due to the lack of matches in consecutive frames and thus the camera trajectory could not be recovered stably. The time spent in different steps of the pipeline having a MATLAB+MEX implementation running on a standard Intel Core2Duo PC can be found in Table 6.1. The average computation time is about 18 seconds per frame and the performance can be further improved by using GPU implementations of feature detection and by speeding up the data storing processes which are caching all the results used in the pipeline on a hard drive.
65
7
Modeling from Unordered Image Sets using Visual Indexing
Computation effort of large scale SfM from unordered image sets is dominated by image feature matching. In this chapter, we propose a novel SfM technique based on image pair similarity scores computed from the detected occurrences of visual words [77] which allows to perform pairwise image feature matching only when it is likely to be successful. As the detection of visual words and the comparison of the constructed tf-idf vectors [87] is very fast, this leads to a significant speed-up while having only a small influence on the quality of the resulting model. Atomic 3D models constructed from camera triplets that share at least 100 points are used as the seeds which form the final large scale 3D model when merged together. Using three views instead of two allows us to reveal most of the outliers of pairwise geometries at an early stage of the process hindering them from derogating the quality of the resulting 3D structure at later stages. Global optimization is replaced by faster locally suboptimal optimization of partial reconstructions which turns into the global technique when all parts are merged together. Cameras sharing fewer points are glued to the largest partial reconstruction during the final stage of the process. Our pipeline is operating in the “easy first, difficult later” manner where pairwise matching and other computations are performed on demand. Therefore, it is possible to get a result close to the optimality in a given time available. Particular threshold values present at several places are the proposed values for obtaining a model whose quality is comparable to the results of the state of the art techniques using all pairwise matches. For easy data, there always exist many subsets of all pairwise matches that are sufficient for computing a reconstruction of a reasonable quality there but using just a subset of pairwise matches instead of the whole set yields a much faster reconstruction. Our method can be viewed at as a random selection of one of these subsets guided by the image similarity scores. Furthermore, unlike other state of the art techniques, our pipeline is able to work both with calibrated perspective and calibrated omnidirectional images which is broadening its usability.
7.1 Randomized Structure from Motion The computation consists of four consecutive steps which are executed one after another: (i) computing image similarity matrix, (ii) constructing atomic 3D models from camera triplets, (iii) merging partial reconstructions, and (iv) gluing single cameras to the best partial reconstruction, see Figure 7.1. The input of the pipeline is an unordered set of
67
7 Modeling from Unordered Image Sets using Visual Indexing
Figure 7.1: Overview of the pipeline. Input images are described by SURF and an image similarity matrix is computed. Atomic 3D models are constructed from camera triplets, merged together into partial reconstructions, and finally single cameras are glued to the largest partial reconstruction. images acquired by cameras with known calibration. For perspective cameras, EXIF information can be used to obtain the focal length and we can assume principal point in the middle of the image. Omnidirectional cameras have to be pre-calibrated according to the appropriate lens or mirror model [67].
7.1.1 Computing Image Similarity Matrix First, up to thousands of Speeded Up Robust Features (SURF) [7] are detected and described on each of the input images. Image feature descriptors are quantized into visual words according to a vocabulary containing 130,000 words computed from urban area images [50]. Assignment is done by Fast Library for Approximate Nearest Neighbors (FLANN) [71] searching for approximate nearest neighbours using a hierarchical k-means tree with branching factor 32 and 15 iterations. The parameters were obtained by FLANN automatic algorithm configuration finding the best settings for obtaining nearest neighbours with precision 90% in the shortest time possible. Next, term frequency– inverse document frequency (tf-idf) vectors [87], which weight words occurring often in a particular document and downweight words that appear often in the database, are computed for each image with more than 50 detected visual words and finally, pairwise image similarity matrix MII containing cosines of angles between normalized tf-idf vectors ti , tj of images Ii , Ij is computed as MII (i, j) = ti · tj .
(7.1)
Images with less than 50 detected visual words are excluded from further computation.
7.1.2 Constructing Atomic 3D Models from Camera Triplets Image similarity matrix MII is used as a heuristics telling us which triplets of cameras are suitable for constructing atomic 3D models. As MII is symmetric with units on the diagonal, we take the upper triangular part of MII , exclude the diagonal, and search for the maximum score. This gives us a pair of cameras Ci and Cj . Then, we find three “third camera” candidates Ck1 , Ck2 , and Ck3 such that min(MII (i, k1 ), MII (j, k1 )) is
68
7.1 Randomized Structure from Motion maximal, min(MII (i, k2 ), MII (j, k2 )) is the second greatest and min(MII (i, k3 ), MII (j, k3 )) is the third greatest among all possible choices of the third camera. Atomic 3D models are constructed for each of the candidates as described below. The resulting models are ranked by the quality score and the model with the highest quality score is selected and passed to the next step of the pipeline. Denoting the third camera corresponding to the selected atomic 3D model as Ck , cameras Ci , Cj , and Ck are removed from future selections by zeroing rows and columns i, j, and k of MII . If the quality of all three 3D models is 0, no 3D model is selected and MII (i, j) is zeroed preventing further selection of this pair of cameras. The whole procedure is repeated until the maximum score in MII is lower than 0.1. Quality score. Each 3D point X triangulated from a triplet of cameras has associated three apical angles [99], one apical angle per each camera pair τij (X), τik (X), and τjk (X). The formula for computing the 3D model quality q is the following: τ (X) = min(τij (X), τik (X), τjk (X))
|P1 | P1 = {X : τ (X) ≥ 5◦ } q1 = 0
|P2 | ◦ P2 = {X : τ (X) ≥ 10 } q2 = 0
|P3 | P3 = {X : τ (X) ≥ 15◦ } q3 = 0 q = q1 + 4q2 + 20q3
(7.2) |P1 | ≥ 10 otherwise
(7.3)
|P2 | ≥ 10 otherwise
(7.4)
|P3 | ≥ 10 otherwise
(7.5) (7.6)
Our q formula, which is slightly more robust than the quality measure introduced in Chapter 6, checks whether there is a sufficient number of 3D points with large apical angles, as they ensure good relative camera pose estimation [99]. Constants 4 and 20 favour atomic models with 3D points having really large apical angles, as we seek for atomic models with distant cameras, while threshold value 10 ensures that the quality is not overestimated when only few points have sufficiently large apical angles. As P1 ⊇ P2 ⊇ P3 , points with τ (X) ∈ 10◦ , 15◦ ) have five times greater weight than those with τ (X) ∈ 5◦ , 10◦ ) and the same applies to points with τ (X) ∈ 15◦ , ∞) against τ (X) ∈ 10◦ , 15◦ ). Atomic 3D model construction. The atomic 3D model from a triplet of cameras is constructed in several steps. After each step, the construction is terminated if the number of reconstructed 3D points falls under 100 and the model quality score set to 0. All intermediate results of the computation are stored into separate files and can be reused if needed which speeds up the computation. The procedure is the following: 1. Image features, namely Maximally Stable Extremal Regions (MSER) [61] on intensity and saturation channels and Affine Invariant Interest Points (APTS) [68]
69
7 Modeling from Unordered Image Sets using Visual Indexing
Laplacian-Affine and Hessian-Affine, are detected on three input images (denoted as Ii , Ij , and Ik ) and the assigned Local Affine Frames (LAF) [63] are described by Discrete Cosine Transform (DCT). 2. Tentative matches between the three image pairs (Ii Ij , Ii Ik , and Ij Ik ) are computed using FLANN [71] searching for approximate nearest neighbours using 4 random kd-trees, filtered to keep only the mutually best matches, and then further filtered into tentative matches among triplets by chaining matches in all three images. 3. Homogeneous image coordinate vectors of filtered tentative matches are normalized to unit direction vectors using the known camera calibration. Pairwise relative camera poses are obtained by soft voting for the epipole positions [54], see Chapter 6, using 5 votes from independent Progressive Sample Consensus (PROSAC) [16] runs with the 5-point algorithm [73]. 4. Shared inliers of these geometries, i.e. final matches, together with three pairwise triangulations [39] are computed. The relative positions of the cameras and the common scale of all three reconstructions is found using one 3D point correspondence using RANSAC again. 5. 3D points are triangulated from the pair with the largest baseline for omnidirectional cameras, or obtained by optimal triangulation from three views [13] for perspective cameras. 6. Very distant points are filtered out and sparse bundle adjustment [56] modified similarly as in [53], regarding non-perspective central cameras as a kind of a generalized camera, refines both points and cameras. Detecting multiple types of image features (ad. 1.) is favourable as they are usually located in different parts of an image: MSER features are found on uniform regions while APTS features fire on corners. To achieve high computation speed, tentative matches are found using an approximate technique with 80% precision and subsequent two-step filtering of the computed tentative matches (ad. 2.) decreases their contamination by mismatches leading to a speed-up of the epipolar geometry estimation. As the ordered randomized sampling in PROSAC still has the randomness of selecting matches, each epipolar geometry resulted by a single run of PROSAC may be different, especially when the tentative matches are strongly contaminated by mismatches. To increase the chance of finding the correct model, we cast the epipole positions, i.e. relative motion directions, of the best epipolar geometries recovered by several independent runs of PROSAC (ad. 3.). The best model is selected as the one with the epipole position closest to the maximum in the accumulator space. This strategy works when the correct, or almost correct, models provide consistent motions while the incorrect models with high support generate different ones, which is often the case. More details can be found in [100] and Chapter 6.
70
7.1 Randomized Structure from Motion
It is a natural property of evaluating matches by epipolar geometry that incorrect matches lying on epipolar lines or in the vicinity of epipoles are often regarded as inliers. However, they can be easily filtered out by finding shared inliers of three views as 3D points successfully verified in three views are unlikely to be incorrect. Therefore, RANSAC obtaining the common scale of the three reconstructions (ad. 4.) is a good test of the quality of pairwise geometries. To find which triplets of final matches generate a consistent 3D point, we use a “cone test” checking the existence of a 3D point that would project to desired positions in all three matches after the scales were unified. During the cone test, four pixels wide pyramids, two pixels to each side, formed by four planes (up, down, left, and right) are cast around the final matches and the LP feasibility test [59] is used to test whether their intersection is empty or not. The definitive advantage of the cone test over the standard technique checking the reprojection errors [39] lies in the fact that inaccurately reconstructed 3D points, e.g. those with small apical angles which have large uncertainties in depth estimates, do not affect the error measure. If one used the reprojection errors instead, which is equivalent to testing whether or not a given reconstructed 3D point lies in the intersection of the cast cones, some correct matches could be rejected due to corresponding inaccurately reconstructed 3D points. Inaccurate 3D points triangulated from accepted matches do not cause any harm as they are later re-triangulated (ad. 5.) and bundled (ad. 6.). As an exhaustive test is faster than LP for three pyramids, LP is used only when intersecting a higher number of pyramids during merging and gluing and not in this particular case. The exhaustive test constructs all candidates for the vertices of the convex polyhedron comprising the intersection of the pyramids as the intersections of triplets of planes. The intersection of the pyramids is empty iff none of these candidates lies in all 12 positive halfspaces formed by the planes. To reject atomic 3D models with low-quality pairwise geometries, the quality score is set to 0 if the inlier ratio of the cone test is under 80%. To ensure a uniform image coverage by the projections of reconstructed 3D points, a unit sphere surrounding the camera center representing different unit vector directions is tessellated into 980 triangles T using [12]. A triangle T is non-empty if there exists a reconstructed 3D point projecting into it, empty otherwise. The image coverage measurement cI of image I is defined as To = {T ∈ T : T is non-empty} cI =
|To | . |T |
(7.7)
If more than one image from the triplet has cI < 0.01, the quality score of the atomic 3D model is set to 0.
7.1.3 Merging Partial Reconstructions First, a new similarity matrix MT T containing similarity scores between selected atomic 3D models is constructed. Having two atomic 3D models constructed from cameras with indices Cm = {i, j, k} and Cn = {i , j , k } respectively, there are always nine pairs
71
7 Modeling from Unordered Image Sets using Visual Indexing
of cameras such that the cameras are contained in different models. The similarity score between two atomic 3D models is computed as the mean of the similarity scores of those nine pairs as 1 MII (im , jn ). (7.8) MT T (m, n) = 9 im ∈Cm jn ∈Cn
The matrix is again used as the heuristics telling us which pairs of atomic 3D models are suitable for merging. In the beginning, there is one partial reconstruction per accepted 3D model, each of them containing three cameras and 3D points triangulated from them. Partial reconstructions will be connected together during the merging step forming larger partial reconstructions containing the union of cameras and 3D points of the connected reconstructions. We take the upper triangular part of MT T , exclude the diagonal, and search for the maximum score. This gives us a pair of atomic 3D models Am and An . Next, we try to merge the two partial reconstructions Rp and Rq containing the models Am and An respectively. After a successful merge, elements MT T (p , q ) are zeroed for all models Ap contained in partial reconstruction Rp and all models Aq contained in partial reconstruction Rq in order to prevent further merging between atomic 3D models which are both contained in the same partial reconstructions. If the merge is not considered to be successful, partial reconstructions are not connected and MT T (m, n) is zeroed preventing further selection of this pair of atomic models. Notice however, that this is not a strict decision on the mergeability of partial reconstructions Rp and Rq as they can be connected later using a different pair of atomic models contained in them. The whole procedure is repeated until the maximum score in MT T is lower than 0.05. Merging two atomic 3D models. The actual merge is performed in several steps. Given two atomic 3D models Am and An , first, tentative 3D point matches are found. Each 3D point X reconstructed from a triplet of cameras Ci , Cj , and Ck has three LAF+DCT descriptors DiX , DjX , and DkX connected with it. Having six sets of descriptors (Di , Dj , and Dk for 3D points from model Am and Di , Dj , and Dk for 3D points from model An ), we find the mutually best matches between all nine pairs of descriptors (Di Di , Di Dj , etc.) independently. As particular descriptors of a single 3D point from model Am can be matched to descriptors of different 3D points in model An in individual matchings, unique 3D point matches need to be constructed. The nine lists of the 3D point matches output from the individual matchings are concatenated and sorted by the distance of the descriptors in the feature space. A unique matching is obtained in a greedy way by going through the sorted list and accepting only those 3D point matches whose 3D points are not contained in any of the 3D point matches accepted before. If there are less than 10 tentative 3D point matches, the merge is not successful, otherwise we try to find a similarity transform bringing model Am to the coordinate system of model An . As three 3D point matches are needed to compute the similarity transform parameters [102], RANSAC with samples of length three is used. A 3D point match is an inlier if the intersection of the three pyramids from cameras contained in
72
7.1 Randomized Structure from Motion model An and the three pyramids from the transformed cameras contained in model Am is non-empty. Local optimization is performed by repeating the similarity transform computation from all inliers. If the inlier ratio is higher than 60%, the merge is considered successful and the whole partial reconstructions Rp and Rq are merged according to this similarity transform computed from atomic 3D models Am and An only. Rq remains fixed and the 3D points and cameras of Rp are transformed, 3D point matches which were inliers are merged into a single point with the position being the mean of the former positions after transformation. Sparse bundle adjustment [56] is used to refine the whole partial reconstruction after a successful merge. The resulting partial reconstruction is then transformed to a normalized scale to allow easy visualization and to ease the next step of the pipeline.
7.1.4 Gluing Single Cameras to the Best Partial Reconstruction The best partial reconstruction Rr is selected as the one containing the highest number of cameras. In this step, we are trying to find the poses of the cameras which are not contained in Rr . Another similarity matrix MT I , which contains similarity scores between atomic 3D models contained in Rr and cameras not contained in Rr , is constructed. The similarity score between an atomic 3D model constructed from cameras with indices Co = {i, j, k} and a camera not contained in the partial reconstruction is computed as the mean of similarity scores of three pairs of cameras as 1 MII (io , l). (7.9) MT I (o, l) = 3 io ∈Co
We search for the maximum score in MT I and obtain atomic 3D model Ao and camera Cl . During the gluing step, we compute the pose of camera Cl using 3D points contained in model Ao . The gluing being successful, we zero the column l of MT I in order to prevent further selection of already glued single cameras, otherwise only element MT I (o, l) is zeroed. The whole procedure is repeated until the maximum score in MT I is lower than 0.025. Gluing a single camera. When performing the actual gluing, we find mutually best tentative matches between three pairs of descriptor sets (Di Dl , Dj Dl , and Dk Dl ) independently. Unique 2D-3D matches are obtained using the same greedy approach as when performing a merge. If the number of tentative matches is smaller than 20, the gluing is not successful. Otherwise, RANSAC sampling triplets of 2D-3D matches is used to find the camera pose [74] having the largest support evaluated by the cone test again. Local optimization is achieved by repeated camera pose computation from all inliers [86] via SDP and SeDuMi [94]. If the inlier ratio is higher than 80%, the gluing is considered successful and camera Cl is added into the partial reconstruction Rr . Sparse bundle adjustment is used to refine the whole partial reconstruction and the reconstruction is transformed to a normalized
73
7 Modeling from Unordered Image Sets using Visual Indexing
Figure 7.2: Example input image data. Perspective images from image set DALIB (top row ) and omnidirectional images from image set CASTLE (bottom row ). scale again because improper scale of the reconstruction can influence the convergence of the SDP program.
7.2 Experimental Results We present results on two image sets. The first one consists of 64 images and the camera poses obtained by the exhaustive method computing matches between all pairs of cameras [60] are known. We consider them being near the ground truth as their accuracy has been proven by a successful 3D surface reconstruction. For the second experiment, we use a set of 4,472 omnidirectional images captured while walking through Prague. Our method was able to find images sharing the views and reconstruct several landmarks present in them. DALIB image set. Image set DALIB consists of 64 perspective images capturing a paper model of a house acquired by a camera with known calibration, see Figure 7.2. The pipeline selected 13 atomic 3D models out of 132 candidates (MII was sampled only 44 times for the best pair). It was sufficient to compute just 199 pairwise image matches compared to 2,016 computed by the exhaustive method. All atomic models were successfully merged into a single partial reconstruction and the poses of 25 missing cameras were obtained during gluing resulting in the model shown in Figure 7.3. The time spent in different steps of the pipeline having a MATLAB+MEX implementation running on a standard Intel Core2Duo PC can be found in Table 7.1. The total compu-
74
7.2 Experimental Results
Figure 7.3: Complete reconstruction of image set DALIB. A partial reconstruction containing all 39 cameras from selected atomic 3D models was extended with 25 missing cameras during gluing. “Ground truth” evaluation
error
0.03 translation rotation
0.02 0.01 0
10
20
30 40 image index
50
60
Figure 7.4: Measured errors of the camera pose estimation of image set DALIB. Camera translation error is the fraction of the diameter of a sphere containing all cameras, camera rotation error is in radians. Note that all cameras but camera number 14 were estimated with error smaller than 0.7%. tation time was less than 45 minutes. Sparse bundle adjustment takes less than a second in average to run for an atomic 3D model and at most several seconds when applying for refining larger partial reconstructions because there are not so many constraints as we do not match all image pairs.
75
7 Modeling from Unordered Image Sets using Visual Indexing
Figure 7.5: Visualization of the selected atomic 3D models and their merging of image set DALIB. Cameras computed by our method (denoted by •) contained in the same atomic 3D model are connected by a colored line, cameras glued to a given model are sharing its color. Merging is shown by dashed grey lines. Cameras obtained by the exhaustive method are denoted by +. Name DALIB CASTLE
Similarity 2 min 6 hrs
Atomic 3D 37 min 257 hrs
Merge 2 min 18 hrs
Gluing 2 min 19 hrs
Table 7.1: Time spent in different steps of the pipeline while reconstructing image sets DALIB and CASTLE. Computation time of the exhaustive method using a similar MATLAB+MEX implementation on the same hardware was around 90 minutes, most of the time being spent on computing pairwise image feature matches, see Table 7.2. When reconstructing the same image set using Bundler [89, 90] which also uses exhaustive pairwise image feature matching, the computation time went down to 8 minutes but the resulting camera poses were less accurate than those obtained by our method due to the lack of SIFT [57] image features on the paper model. Bundler is faster mainly because of implementation reasons as it contains a highly optimized C/C++ code.
76
7.2 Experimental Results
Figure 7.6: The partitioning of the resulting 3D point cloud among 13 selected atomic 3D models of image set DALIB. Color coding is the same as for Figure 7.5. Method MATLAB+MEX Bundler [90]
Features 10 min →
Matching 65 min 8 min
Geometry 15 min ←
Table 7.2: Time spent in different steps of our exhaustive method and Bundler [90] for image set DALIB. Bundler time is the total time spent by the method. After finding the similarity transform between the camera poses computed by our method and those computed by the exhaustive one, we were able to measure the error of the camera pose estimation. It has shown that there is no significant loss of quality, see Figure 7.4. Both sets of cameras can be seen in Figure 7.5 together with the visualization of atomic 3D models and their merging. Figure 7.6 shows the partitioning of the resulting 3D point cloud among 13 selected atomic 3D models. CASTLE image set. Our second image set CASTLE consists of 4,472 omnidirectional images captured by a 183◦ fish-eye lens camera with known calibration. The images were acquired in several sequences while walking in the center of Prague and around the Prague Castle but they were input into the pipeline as an unordered set. The pipeline selected 652 atomic 3D reconstructions out of 100,410 candidates and only 58,961 pairwise image matches were computed while the number of all possible image
77
7 Modeling from Unordered Image Sets using Visual Indexing
Figure 7.7: Partial reconstruction #486 of image set CASTLE. Right part of the St.Vitus Cathedral and other buildings surrounding the square were reconstructed from 90 cameras, another 49 cameras were connected during gluing. pairs is 9,997,156. Several partial reconstructions containing remarkable landmarks were obtained, see Figures 7.7, 7.8, and 7.9. The total computation time was around 12.5 days. As Bundler works with perspective images only, we could not compare its performance with the performance of the proposed method on this image set directly but if we linearly extrapolated the computation time of Bundler using the number of all possible image pairs, it would be around 27.5 days. Minor merging and gluing errors caused by repetitive image structures and matching clouds can be found in some of the resulting partial reconstructions. As our current “winner takes all” approach is unable to recover from such errors, alternative ways of merging and a method evaluating their quality in order to bound incorrect ones would need to be introduced.
78
7.2 Experimental Results
Figure 7.8: Partial reconstruction #407 of image set CASTLE. Part of the Old Town Square with the clock tower was reconstructed from 69 cameras, another 39 cameras were connected during gluing.
Figure 7.9: Partial reconstruction #471 of image set CASTLE. Entrance to the Prague Castle was reconstructed from 60 cameras, another 49 cameras were connected during gluing.
79
8
Efficient Structure from Motion by Graph Optimization
As the sizes of input image sets grow, the efficiency of SfM computation becomes crucial. In this chapter, we present an extension to the SfM method described in Chapter 7 which is able to deal with a much larger variety of image sets thanks to the proposed image set reduction procedure and the prioritization of the individual reconstruction tasks. We seek to reconstruct 3D scene structure and camera poses both from large collections of images downloaded from the web and from images taken by cameras mounted on moving vehicles as in the Google Street View. This is a challenging task because unstructured web collections often contain a large number of very similar images of landmarks while, on the other hand, image sequences often have a very limited overlap between images. To speed up the reconstruction, it is desirable to select a subset of the input images in such a way that all the remaining images have a significant visual overlap with at least one image from the selected ones, so the connectivity of the resulting model should not be damaged. For selecting such a subset of input images, the approximate minimum connected dominating set can be computed by a fast polynomial algorithm [37] on the graph constructed according to the estimated visual overlap. The algorithm used is closely related to the maximum leaf spanning tree algorithm employed in [91] but the composition of the graph is quite different and less computationally demanding in our case (Section 8.1). The actual SfM pipeline uses the atomic 3D models constructed from camera triplets introduced in Chapter 7 as the basic elements of the reconstruction but the strict division of the computation into steps is relaxed by introducing a priority queue which interleaves different reconstruction tasks in order to get a good scene covering reconstruction in limited time (Section 8.2). Our aim here is to avoid spending too much time in a few difficult matching problems by exploring other easier options which lead to a comparable resulting 3D model in shorter computational time. We also introduce model growing by constructing new 3D points when connecting an image which allows for sparser image sets than those which could be reconstructed by the method presented in Chapter 7.
8.1 Image Set Reduction When performing SfM computation from user-input images, the input image set may often be highly redundant, such as photographs acquired by tourists at landmark sites. As it is not needed to use all such input images in order to get a 3D model covering
81
8 Efficient Structure from Motion by Graph Optimization
Algorithm 5 Approximate minimum CDS computation [37] Input Output
G = (V, E) S
Unweighted undirected graph. List of vertices belonging to the minimum CDS of G.
I. Label all vertices V ∈ V white. II. Set D := {} and repeat until no white vertices are left: 1: For all black vertices V ∈ V set cV := 0. 2: For all gray and white vertices V set cV := number of white neighbours of V . 3: Set V ∗ := arg max cV . V
Label V ∗ black and add it into D. 5: Label all neighbours of V ∗ gray. 4:
III. Set S := D and connect components of the subgraph of G induced by D by adding at most 2 vertices per component into S in a greedy way. IV. Return S.
the scene captured in them, it is possible to speed up the reconstruction by using only a suitable subset of input images. We seek for a method that would remove the unnecessary images from the input image set while affecting neither the quality nor the connectivity of the resulting 3D model much. The concept of visual words, which first appeared in [87], has been used successfully for matching images and scenes [77]. It proved its usefulness also for near duplicate image detection [17] when the scene is captured from different viewpoints or under different lighting conditions. Our aim is to (i) evaluate pairwise image similarity efficiently following [81, 2] and (ii) formulate the selection of the desired subset of input images as finding a suitable subgraph of the graph constructed according to image similarity.
8.1.1 Image Similarity We use the bag-of-words approach to evaluate image similarity. In particular, we follow the method proposed in [81] to create the pairwise image similarity matrix MII containing the cosines of the angles between the normalized tf-idf vectors computed from the numbers of occurrences of the quantized SURF [7] image feature descriptors in individual images. Next, we create an unweighted undirected graph GII expressing image similarity. Vertices of GII are the input images and we add five edges per vertex connecting it with the five most similar images according to the values of MII , which is close to the approach used in [2]. Edges are not added if the measured similarity falls under 0.05. Notice that there may (and often will) exist vertices with degree higher than five in the resulting graph as some images may be similar to many other images.
82
8.1 Image Set Reduction
(a)
(b)
Figure 8.1: Minimum CDS computation. Vertices belonging to the minimum dominating set D are labeled black, vertices added when connecting the components in order to get S are labeled blue. (a) General graph. (b) Graph being a singly connected line.
8.1.2 Minimum Connected Dominating Set According to [37], the minimum connected dominating set (CDS) problem is defined as follows. Given a graph G = (V, E), find a minimum size subset of vertices S ⊂ V, such that the subgraph induced by S is connected and S forms a dominating set in G. In a graph with a dominating set D ⊂ V, each vertex V ∈ V is either in the dominating set or adjacent to some vertex in the dominating set, V ∈ D ∨ ∃V ∈ D : (V, V ) ∈ E. The problem of finding the minimum CDS is known to be NP-hard [31] but [37] presents a fast polynomial algorithm with an approximation ratio of ln + 3, being the maximum vertex degree in the graph, see Algorithm 5. We use the aforementioned algorithm to find the minimum connected dominating set SII of the graph GII , see Figure 8.1(a), and only the images corresponding to the vertices in SII are further used for the sparse 3D model reconstruction. Edges of the subgraph of GII induced by D (Algorithm 5, Step III.) together with the edges connecting the components of this subgraph in order to get SII are used as the seeds of the reconstruction. The usage of the dominating set provides for connecting the removed images to the resulting 3D model reconstructed from the selected ones using camera resectioning [74] if required, as an image is removed only if it is similar to at least one image which remains in the selected subset, i.e. there exists visual overlap between the resulting model and each of the removed images. Furthermore, the connectivity of the resulting 3D model is preserved by using the connected dominating set which does not allow for splitting the originally connected graph into components. For non-redundant image sets, e.g. when the graph expressing image similarity is a singly connected line, the method removes only the first and the very last images because removing more images would affect model connectivity, see Figure 8.1(b). On the other hand, the reduction of highly redundant image sets is drastic, as shown in Section 8.3.1.
83
8 Efficient Structure from Motion by Graph Optimization
8.2 3D Model Construction Using Tasks Ordered by a Priority Queue The reduced image set is input into our 3D reconstruction pipeline which grows the resulting 3D model from several atomic 3D models. The computation is divided into tasks, each of them can either try to create a new atomic 3D model from three images, or try to connect one image to a given 3D model, see Figure 8.2. The order of the execution of different tasks is determined by task priority keys set when adding them to the priority queue being the essential underlying data structure. Note that the task with the smallest priority key has the highest priority, i.e. it is always in the head of the queue, in our implementation of the priority queue. Our aim is to set task priority keys in such a way that stopping the computation at any time would give a good scene covering sparse 3D model for the time given which is demanded e.g. by online SfM services. The state of the art SfM approaches [90, 91, 2] implement this priority queue implicitly in such a way that they may get stuck by solving a difficult part of the reconstruction even when an easier path to the goal exists, as they are greedily growing from a single seed. Using our approach, several seeds are grown in parallel so the easiest path is actively searched for. First, the queue is filled with one candidate camera triplet for atomic 3D model construction per seed. The triplet is constructed from the two cameras Ci , Cj being the endpoints of the edge corresponding to the seed. The third camera Ck∗ is selected as k∗ = arg max min(MII (i, k), MII (j, k))
(8.1)
k
and the priority key of this task is set to 1 − MII (i, j). Next, the task from the head of the priority queue is taken and executed. As we are just starting the computation, it will be an atomic 3D model creation task. If the atomic 3D model construction from a given candidate camera triplet is not successful, the camera triplet is rejected and another candidate camera triplet for the same seed is input into the queue with the priority key doubled. The new third camera accompanying cameras Ci and Cj is selected similarly as in Equation 8.1 by taking the camera Ck∗ with the s-th largest value of min(MII (i, k∗ ), MII (j, k∗ )) and increasing s. After a successful atomic 3D model creation, the vicinity of the respective seed is searched for camera candidates suitable for connecting with the newly created atomic 3D model and tasks connecting the five most suitable cameras are input into the queue. We put the indices of the cameras contained in the atomic 3D model Am into the set Cm and the rest of the camera indices into Cn . Then, we search for a candidate camera Cj ∗ to be connected to the atomic 3D model using (i∗ , j ∗ ) =
arg max (im ,jn ) ∈ Cm ×Cn
MII (im , jn ).
(8.2)
The priority key of this task is set to 1 − MII (i∗ , j ∗ ). Other four candidate cameras are selected similarly using the second, third, fourth, and fifth largest value of MII (i∗ , j ∗ ).
84
8.2 3D Model Construction Using Tasks Ordered by a Priority Queue
Figure 8.2: Schematic visualization of the computation. The task retrieved from the head of the priority queue can be either an atomic 3D model construction task (dark gray) or an image connection task (light gray). Unsuccessful atomic 3D model construction (–) inserts another atomic 3D model construction task with the priority key doubled into the queue, a successful one (+) inserts five image connection tasks. Unsuccessful image connection (–) inserts the same task again with the priority key doubled, a successful one (+) inserts a new image connection task. Merging of overlapping 3D models is called implicitly after every successful image connection if the overlap is sufficient.
Alternatively, the head of the priority queue may contain an image connection task. After a successful image connection, a task connecting another camera to the same partial 3D model is created using Equation 8.2 again with a larger set Cm and input into the queue in order to keep the number of image connection tasks at five per a partial model. When the connection of an image to a given 3D model is unsuccessful, the task is input into the queue again with the priority key doubled because it may be successful if tried again after other images are successfully connected. In order to keep the resulting reconstruction consistent and connected, grown 3D models are implicitly merged together when they share at least five images. If the merge is not successful, it will be tried again when the number of shared images increases further. The whole procedure is repeated until the priority queue is empty or the available time runs out. The following paragraphs describe particular parts of the pipeline in deeper detail.
85
8 Efficient Structure from Motion by Graph Optimization
8.2.1 Creation of Atomic 3D Models Atomic 3D model construction introduced in Chapter 7 has been improved and extended in several ways: 1. SIFT [57] and SURF [7] image feature detectors and descriptors have been added as it shows out that a combination of many different detectors is needed for difficult image sets. On the other hand, for easy image sets, it is possible to use only the fastest of them, which is SURF in our case. 2. Camera calibration does not need to be the same for all images in the set and can be obtained from the EXIF info of JPEG images. 3. The formula computing the quality score q has been simplified into: q = |{X : τ (X) ≥ 5◦ }|,
(8.3)
τ (X) being the apical angle measured at the 3D point X. In contrast with the original formula, 3D points with even larger apical angles do not contribute more to the quality score as we found out that it does not bring any significant improvement over the simple formula. We require the quality score of at least 20 to accept a given candidate camera triplet as being suitable for reconstructing. Together with the remaining triplet quality pre-tests, the decision rule is the following: A given candidate camera triplet is accepted if and only if the results of pairwise epipolar geometries are consistent (the inlier ratio of the RANSAC finding the common scale is higher than 0.7), at least fifty 3D points have been reconstructed, at least twenty of them have apical angles larger than 5 degrees, and their projections cover a sufficiently large portion of the three respective viewfields.
8.2.2 Model Growing by Connecting Images Connection of a new image to a given partial 3D model proceeds in two stages. First, the pose of the corresponding camera Cl with respect to the 3D model is estimated. Secondly, promising cameras from the vicinity of the newly connected one are used to create new 3D points. Every 3D point already contained in the model has a descriptor which is transferred from one of the corresponding images during its triangulation. Thus it is easy to find 2D-3D matches between the reconstructed 3D points and the feature points detected and described in the candidate image being connected. To ensure reasonable speed even for large models with millions of points, we do one-way matching only with strict criteria on the first/second nearest neighbour distance ratio, setting it to 0.7 [57]. If the number of tentative matches is smaller than 20, the connection is not successful. Otherwise, RANSAC sampling triplets of 2D-3D matches is used to find the camera pose [74] having the largest support evaluated by the cone test [42]. Local optimization is achieved by repeated camera pose computation from all inliers [86] via SDP and SeDuMi [94]. We
86
8.2 3D Model Construction Using Tasks Ordered by a Priority Queue
require the inlier ratio to be higher than 60% to consider the connection as successful and continue. Next, we find the cameras already contained in the partial model, which have some viewfield overlap with the newly connected camera, by examining the projections of the inlier 3D points from the previous stage. We take a set Co of the indices of all cameras, which contain projections of at least 20 inlier 3D points, and try to triangulate 3D points from camera pairs (Cio , Cl ) : io ∈ Co . Newly triangulated 3D points with apical angles larger than 5 degrees are accepted if they are projected to at least three cameras after being merged based on the shared 2D feature points in Cl . Cone test can further reject a 3D point if those projections are not consistent with any possible 3D point position. Finally, sparse bundle adjustment [56] is used to refine the whole partial reconstruction after adding new 3D points and their projections.
8.2.3 Merging Overlapping Partial Models When two partial 3D models share images, they usually share also 2D feature points which are the projections of some already triangulated 3D points. Therefore, we can avoid costly descriptor matching and create tentative 3D point matches between the two partial 3D models from pairs of 3D points which project to the same 2D feature points in both models. As the 2D-3D matching used when connecting new images is rather strict, it often fails to find correspondences between not so distinctive regions, e.g. regions corresponding to the repetitive scene structures, which leads into triangulating the same scene 3D point once more at the latter stage. After connecting many images, scene 3D points may have several triangulated copies in the model, that is why the tentative 3D point matches created for merging often form large connected components, each of them corresponding to a single scene 3D point. After splitting all of these components into two parts, one per each partial model being merged, we use the cone test for each of those parts to verify that given 3D points can be merged into one. When this “internal merge” consolidating the partial models is finished, we continue with merging the two models using the collapsed tentative 3D point matches. If there are less than 10 tentative 3D point matches, the merge is not successful, otherwise we try to find a similarity transform between the coordinate systems of the models. As three 3D point matches are needed to compute the similarity transform parameters [102], RANSAC with samples of length three is used. Inliers are evaluated by the cone test using image projections from both partial models and local optimization is performed by repeating the similarity transform computation from all inliers. Camera poses corresponding to the images shared by the models are averaged (rotation and position separately) inside the RANSAC loop before the cone test, so the similarity transforms which would lead into incorrectly averaged cameras would not be accepted. We require the inlier ratio to be higher than 60% to consider the merge as successful. Finally, the smaller model is transformed to the coordinate system of the larger one because transforming the smaller model is faster. 3D point matches which were inliers
87
8 Efficient Structure from Motion by Graph Optimization
are merged into a single point with the position being the mean of the former positions after transformation and duplicate image projections are removed. Sparse bundle adjustment [56] is used to refine the whole partial reconstruction after a successful merge.
8.3 Experimental Results We demonstrate the proposed method in three experiments. The first one shows the efficient reduction of a highly redundant image set using the approximate minimum connected dominating set of a graph constructed using the image similarity matrix, the latter ones present the output of our 3D model reconstruction pipeline after 6 hours of computation for an omnidirectional and a perspective image set. All measured times are achieved by running a MATLAB+MEX implementation on an Intel Core2Quad PC.
8.3.1 Image Set Reduction Image set DITREVI consists of 2,545 images resulting from a Flickr Photo Sharing site [107] search for “di trevi” (April 2009). The image set is highly redundant and contaminated with images not capturing Di Trevi Fountain as it comprises pictures uploaded by hundreds of tourists visiting Rome. After detecting SURF image features and computing the image similarity matrix in 2 hours, the algorithm finding the approximate minimum connected dominating set of the corresponding graph returned 70 images in 5 seconds, see Figure 8.3. Selected images reasonably cover different scene viewpoints while the image set size was reduced by more than 97%. Furthermore, the contamination ratio of the image set decreased from 17% to 7% after the reduction. We used Bundler [90], a publicly available SfM tool, to evaluate the suitability of the image selection done by CDS for 3D reconstruction. The model returned in 44 minutes contains 47 camera poses and 8,489 3D points, see Figure 8.4(a). We ran Bundler also on five randomly selected sets of 70 images out of 2,545. Two of the runs did not return any result, two returned small fragments of the model with fewer than 5 camera poses, and one returned an incomplete 3D model having 32 camera poses and 3,355 3D points, as can be seen in Figure 8.4(b).
8.3.2 Sparse 3D Model Reconstruction Two city sequences with landmark areas visited several times are used to demonstrate sparse 3D model reconstruction, see Figure 8.5. Nevertheless, they were input into the pipeline as unordered image sets. CASTLE image set. Omnidirectional image set CASTLE [42] captured by a 183◦ fisheye lens camera with known calibration [67] consists of 4,472 omnidirectional images captured while walking in the center of Prague and around the Prague Castle. The obtained approximate minimum connected dominating set comprises 1,063 vertices and 1,359 edges are used as the seeds of the reconstruction. Image set reduction is not as
88
8.3 Experimental Results
Figure 8.3: Images corresponding to the approximate minimum connected dominating set computed for image set DITREVI. Image set size has been reduced by 97% from 2,545 to 70 and the contamination ratio of the image set decreased from 17% to 7%. drastic as for image set DITREVI because the images are more evenly distributed. We use MSER [61], SIFT, and SURF image features in order to create sufficiently many 3D points even when image resolution is low. Several 3D models showing the important landmarks captured in the image set were obtained when the reconstruction time was limited to 6 hours, see Figure 8.6.
89
8 Efficient Structure from Motion by Graph Optimization
(a)
(b)
Figure 8.4: (a) 3D model computed by Bundler [90] from the 70 images selected from image set DITREVI by CDS. (b) The best from the 3D models returned by the five runs of Bundler on different random selections of 70 images from image set DITREVI.
Figure 8.5: Sample input images from image sets CASTLE and VIENNA respectively. The resulting sparse 3D models are very similar to those presented in Chapter 7 but the speed of the reconstruction differs significantly as 12.5 days were needed to obtain those results using the previous method. Using the new approach, the models are obtained
90
8.3 Experimental Results
(a)
(b)
Figure 8.6: Two largest partial 3D models reconstructed from the reduced image set CASTLE (1,063 images) after 6 hours of computation. in 10 hours, including 4 hours for image similarity matrix computation, which shows proper task priority key assignment. VIENNA image set. Image set VIENNA [45] consists of 2,448 radially undistorted perspective images captured by a pre-calibrated camera while walking in the center of Vienna. After computing the image similarity matrix in 90 minutes, 1,008 vertices and 1,900 edges being the seeds of the reconstruction are obtained in 10 seconds as the result of the search for the approximate minimum connected dominating set of the corresponding graph. As the resolution of the images in this set is sufficient, only SURF image features are used for 3D model reconstruction. The 3D models showing several important landmarks captured in the image set, received after 6 hours of reconstruction, can be seen in Figure 8.7. Compared to the omnidirectional image set CASTLE, only parts of the landmarks are reconstructed in the 6 hour limit because more images are needed to capture the whole landmark as the field of view of the perspective camera is limited. Partial 3D models become larger and connected gradually when the reconstruction continues, see Table 8.1 for different quantitative results of the reconstruction process at given times. Notice that the number of active seeds drops (86 → 77 → 65) after some time as the overlapping models are merged and also the sub-quadratic number of computed pairwise matchings w.r.t. the number of images contained in the partial models being far behind
91
8 Efficient Structure from Motion by Graph Optimization
(a)
(b)
Figure 8.7: Two largest partial 3D models reconstructed from the reduced image set VIENNA (1,008 images) after 6 hours of computation. Time # pairs # seeds # images
1h 548 44 153
2h 991 57 244
3h 1432 66 313
4h 1773 77 368
5h 2100 86 411
6h 2360 80 438
7h 2624 79 466
8h 2882 77 496
9h 3172 73 521
10h 3437 71 546
11h 3679 71 572
12h 4030 65 600
Table 8.1: The number of computed pairwise matchings, the number of active seeds, and the number of images contained in at least one partial model for the reduced image set VIENNA (1,008 images) at given times of the reconstruction process. the quadratic number which would be achieved by methods using exhaustive pairwise image matching. Note that when running Bundler on the reduced image set, 3 hours are spent on detecting and describing SIFT image features and 1,922 out of 15,753 tested image pairs are accepted after additional 6 hours of computation. No partial 3D models are output at this time as bundling starts later, after all 507,528 possible image pairs are tested. If one modified Bundler according to [2] so that it would test only the ten most promising image pairs per image based on image similarity and ran it on the nonreduced image set comprising 2,448 images, the whole 6 hour limit would still be spent on testing 16,762 obtained image pairs. This demonstrates the need for a prioritized Structure from Motion pipeline for large image sets.
92
9
Conclusions
The thesis contributed to improving scalability and efficiency of Structure from Motion computation from both ordered and unordered image sets. In particular, the methods presented in the thesis belong to the group of “incremental” SfM methods which build the resulting 3D models by creating one or more seed reconstructions that are iteratively expanded by connecting additional images and by triangulating additional 3D points. The structure of the input data needs to be known in order to be able to select the images which should be connected in a given iteration. It was pointed out that such images can be chosen relatively easily for image sequences, i.e. ordered image sets, where consecutive frames have visual overlap but the selection of suitable images is rather difficult for general image data, i.e. unordered image sets. Recent methods dealing with the same problem [90] reviewed in Chapter 2 reveal the unknown structure of the input data by performing detection and description of salient feature points in the input images followed by exhaustive pairwise image matching. This becomes infeasible for large image sets as the number of tested image pairs is quadratic in the number of images. The method presented in Chapter 7 overcomes this problem by employing visual indexing, namely the distance of the so-called tf-idf vectors [87], to assign a similarity score to each pair of images. Only image pairs with high similarity scores, i.e. those which are likely to have visual overlap, are tested and used for seed generation provided that the visual overlap is geometrically verified. It was shown that the proposed method is effective in reducing the number of tested image pairs while the quality of the resulting 3D model is not degraded compared to the exhaustive method. Several technical choices which facilitate successful 3D model construction were also presented. First, triplets of cameras are used instead of image pairs as the seeds of the reconstruction. The well known fact that the third camera helps rejecting incorrectly triangulated 3D points at early reconstruction stages is ignored by most of the competing approaches. Secondly, a novel measure for detecting the amount of motion w.r.t. the scene was introduced. The dominant apical angle (DAA) presented in Chapter 6 was successfully used both for seed rejection when reconstructing unordered image sets and for keyframe selection when dealing with image sequences. Finally, by employing “cone test” instead of the widely used reprojection error, it is possible to correctly classify the tentative matches to inliers and outliers even when current 3D point estimations are not accurate. Using the constraints added by the correctly accepted tentative matches, 3D point estimations can be refined. This would not be possible if such tentative matches were rejected due to large reprojection errors.
93
9 Conclusions
The usefulness of visual indexing for ordered image sets was shown in Chapter 5. This time, the similarity scores of the individual image pairs were used to choose loop closing candidates in the proposed sequence bridging technique. The geometrically verified candidates were included in the resulting 3D model as additional constraints and the model was refined by an extra run of the non-linear optimization [56]. The approach was validated on a sequence comprising thousands of images by showing successful removal of the accumulated camera pose estimation error inevitable for the methods reconstructing from image sequences. An efficient SfM method which is able to reconstruct from a large variety of image sets was described in Chapter 8. It was observed that some of the image sets are large just because they are highly redundant. As this redundancy is not beneficial to SfM computation, an image set reduction procedure was introduced. Scores obtained by visual indexing are used to construct a graph expressing image similarity and the desired subset of input images is selected as the approximate minimum connected dominating set of this graph. It was shown that such selection can reduce the number of input images drastically while the connectivity of non-redundant sequential data is preserved. Last but not least, the strict division of SfM computation to several steps was replaced by a priority queue of the individual SfM tasks. The priorities of the particular tasks are assigned according to the visual indexing scores again but they are also influenced by the history of the computation, so easy options are tried first which leads to obtaining good scene covering reconstructions in limited time. As stated earlier, all the presented methods were implemented to work with the general central camera model, covering the most common special cases including (i) perspective cameras, (ii) fish-eye lenses, (iii) equirectangular panoramas, and (iv) cameras calibrated by the polynomial omnidirectional model. This broadens the usability of the methods significantly because unlike the commonly used perspective cameras, cameras with wide field of view do not suffer from occlusions and sharp camera turns when processing sequential data as shown in Chapter 4. Omnidirectional vision is beneficial for reconstructions from unordered image sets also because a much smaller number of images is sufficient for covering a given scene segment. If perspective data are required by consecutive computation steps, e.g. 3D surface construction or pedestrian detection, it is possible to generate truly perspective cutouts from the data. For pedestrian detection, image stabilization w.r.t. the ground plane together with the generation of non-central cylindrical images proposed in Chapter 6 proved to be eligible. The reconstruction pipelines are accessible to the registered users through CMP SfM Web Service 1 and were successfully used to reconstruct many challenging user-uploaded image sets.
1
CMP SfM Web Service is a remote procedure call service operated by Center for Machine Perception (CMP) at Czech Technical University with the majority of available procedures being implementations of different computer vision methods developed at CMP. The service can be accessed through web page http://ptak.felk.cvut.cz/sfmservice and also using a command line scripting interface.
94
Bibliography [1] 2d3. Boujou – http://www.boujou.com, 2001. [2] S. Agarwal, N. Snavely, I. Simon, S. Seitz, and R. Szeliski. Building Rome in a day. In ICCV’09, pages 72–79, 2009. [3] A. Akbarzadeh, J.-M. Frahm, P. Mordohai, B. Clipp, C. Engels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stew´enius, R. Yang, G. Welch, H. Towles, D. Nist´er, and M. Pollefeys. Towards urban 3d reconstruction from video. In 3DPVT’06, pages 1–8, 2006. [4] M. Antone and S. Teller. Automatic recovery of relative camera rotations for urban scenes. In CVPR’00, pages II:282–289, 2000. [5] M. Antone and S. Teller. Scalable, absolute position recovery for omni-directional image networks. In CVPR’01, pages I:398–405, 2001. [6] H. Bakstein and T. Pajdla. Panoramic mosaicing with a 180◦ field of view lens. In OMNIVIS’02, pages 60–67, 2002. [7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). CVIU, 110(3):346–359, June 2008. [8] R. Benosman and S. Kang. Panoramic Vision. Springer-Verlag, 2000. [9] C. Brenner and N. Haala. Fast production of virtual reality city models. IAPRS, 32(4):77–84, 1998. [10] M. Brown and D. Lowe. Recognising panoramas. In ICCV’03, pages II:1218–1225, 2003. [11] M. Brown and D. Lowe. Unsupervised 3D object recognition and reconstruction in unordered datasets. In 3-D Digital Imaging and Modeling (3DIM), pages 56–63, 2005. [12] J. Burkardt. Sphere grid: Points, lines, faces on a sphere – http://people.scs. fsu.edu/~burkardt/datasets/sphere_grid, 2007. [13] M. Byrod, K. Josephson, and K. Astrom. Fast optimal three view triangulation. In ACCV’07, pages II:549–559, 2007. [14] C3 Technologies. Nokia Maps 3D – http://maps.nokia.com, 2011.
95
Bibliography [15] O. Chum and J. Matas. Randomized RANSAC with td,d test. In BMVC’02, pages 448–457, 2002. [16] O. Chum and J. Matas. Matching with PROSAC: Progressive sample consensus. In CVPR’05, pages I:220–226, 2005. [17] O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near identical image and shot detection. In Conference on Image and Video Retrieval (CIVR), pages 549–556, 2007. [18] O. Chum, T. Werner, and J. Matas. Two-view geometry estimation unaffected by a dominant plane. In CVPR’05, pages 772–780, 2005. [19] B. Clipp, J.-H. Kim, J.-M. Frahm, M. Pollefeys, and R. Hartley. Robust 6DOF motion estimation for non-overlapping, multi-camera systems. In WACV’08, pages 1–8, 2008. [20] N. Cornelis, K. Cornelis, and L. Van Gool. Fast compact city modeling for navigation pre-visualization. In CVPR’06, pages II:1339–1344, 2006. [21] N. Cornelis, B. Leibe, K. Cornelis, and L. Van Gool. 3D city modeling using cognitive loops. In 3DPVT’06, pages 9–16, 2006. [22] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR’05, pages II:886–893, 2005. [23] A. Davison, I. Reid, N. Molton, and O. Stasse. MonoSLAM: Real-time single camera SLAM. PAMI, 29(6):1052–1067, June 2007. [24] T. Ehlgen and T. Pajdla. Maneuvering aid for large vehicle using omnidirectional cameras. In WACV’07, pages 1–6, 2007. [25] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. A mobile vision system for robust multi-person tracking. In CVPR’08, pages 1–8, 2008. [26] M. Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM, 24(6):381–395, June 1981. [27] J.-M. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram, C. Wu, Y.H. Jen, E. Dunn, B. Clipp, S. Lazebnik, and M. Pollefeys. Building rome on a cloudless day. In ECCV’10, pages IV:368–381, 2010. [28] C. Fr¨ uh, S. Jain, and A. Zakhor. Data processing algorithms for generating textured 3D building facade meshes from laser scans and camera images. IJCV, 61(2):159–184, February 2005.
96
Bibliography
[29] C. Fr¨ uh and A. Zakhor. 3D model generation for cities using aerial photographs and ground level laser scans. In CVPR’01, pages II:31–38, 2001. [30] Y. Furukawa and J. Ponce. Accurate, dense, and robust multi-view stereopsis. In CVPR’07, 2007. [31] M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979. [32] C. Geyer and K. Daniilidis. Structure and motion from uncalibrated catadioptric views. In CVPR’01, pages I:279–286, 2001. [33] T. Goedem´e, M. Nuttin, T. Tuytelaars, and L. Van Gool. Omnidirectional vision based topological navigation. IJCV, 74(3):219–236, September 2007. [34] Google. Google Earth – http://earth.google.com, 2004. [35] Google. Photo tours in Google Maps – http://maps.google.com/phototours, 2012. [36] A. Gr¨ un. Automation in building reconstruction. In D. Fritsch and D. Hobbie, editors, Photogrammetric Week’97, pages 175–186, Stuttgart, 1997. [37] S. Guha and S. Khuller. Approximation algorithms for connected dominating sets. Algorithmica, 20(4):374–387, 1998. [38] N. Haala, C. Brenner, and C. St¨ atter. An integrated system for urban model generation. In ISPRS Congress Comm. II, pages 96–103, 1998. [39] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2003. [40] M. Havlena, K. Cornelis, and T. Pajdla. Towards city modeling from omnidirectional video. In M. Grabner and H. Grabner, editors, CVWW’07, pages 123–130, St. Lambrecht, 2007. ˇ Fojt˚ [41] M. Havlena, S. u, and T. Pajdla. Nao robot localization and navigation with atom head. Research Report CTU–CMP–2012–07, CMP Prague, March 2012. [42] M. Havlena, A. Torii, J. Knopp, and T. Pajdla. Randomized structure from motion based on atomic 3D models from camera triplets. In CVPR’09, pages 2874–2881, 2009. [43] K. Ho and P. Newman. Detecting loop closure with scene sequences. IJCV, 74(3):261–286, September 2007. [44] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. In CVPR’06, pages II:2137–2144, 2006.
97
Bibliography
[45] A. Irschara, C. Zach, and H. Bischof. Towards wiki-based dense city modeling. In Virtual Representations and Modeling of Large-scale environments (VRML), pages 1–8, 2007. [46] M. Jancosek and T. Pajdla. Multi-view reconstruction preserving weaklysupported surfaces. In CVPR’11, pages 3121–3128, 2011. [47] F. Kahl. Multiple view geometry and the L∞ -norm. In ICCV’05, pages II:1002– 1009, 2005. [48] Q. Ke and T. Kanade. Quasiconvex optimization for robust geometric reconstruction. PAMI, 29(10):1834–1847, October 2007. [49] M. Klopschitz, C. Zach, A. Irschara, and D. Schmalstieg. Generalized detection and merging of loop closures for video sequences. In 3DPVT’08, pages 137–144, 2008. ˇ [50] J. Knopp, J. Sivic, and T. Pajdla. Location recognition using large vocabularies and fast spatial matching. Research Report CTU–CMP–2009–01, CMP Prague, January 2009. [51] B. Leibe, N. Cornelis, K. Cornelis, and L. Van Gool. Dynamic 3D scene analysis from a moving vehicle. In CVPR’07, pages 1–8, 2007. [52] B. Leibe, K. Schindler, and L. Van Gool. Coupled detection and trajectory estimation for multi-object tracking. In ICCV’07, pages 1–8, 2007. [53] M. Lhuillier. Effective and generic structure from motion using angular error. In ICPR’06, pages I:67–70, 2006. [54] H. Li and R. Hartley. A non-iterative method for correcting lens distortion from nine point correspondences. In OMNIVIS’05, pages 1–8, 2005. [55] X. Li, C. Wu, C. Zach, S. Lazebnik, and J.-M. Frahm. Modeling and recognition of landmark image collections using iconic scene graphs. In ECCV’08, pages I:427– 440, 2008. [56] M. Lourakis and A. Argyros. The design and implementation of a generic sparse bundle adjustment software package based on the Levenberg-Marquardt algorithm. Tech. Report 340, Institute of Computer Science – FORTH, August 2004. [57] D. Lowe. Distinctive image features from scale-invariant keypoints. 60(2):91–110, November 2004.
IJCV,
[58] H. Maas. The suitability for airborne laser scanner data for automatic 3D object reconstruction. In Ascona’01, pages 291–296, 2001.
98
Bibliography
[59] A. Makhorin. Glpk: GNU linear programming kit – http://www.gnu.org/ software/glpk, 2000. [60] D. Martinec and T. Pajdla. Robust rotation and translation estimation in multiview reconstruction. In CVPR’07, pages 1–8, 2007. [61] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In BMVC’02, pages 384–393, 2002. [62] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. IVC, 22(10):761–767, September 2004. ˇ Obdrˇz´alek, and O. Chum. Local affine frames for wide-baseline stereo. [63] J. Matas, S. In ICPR’02, pages IV:363–366, 2002. [64] T. Mauthner, F. Fraundorfer, and H. Bischof. Region matching for omnidirectional images using virtual camera planes. In O. Chum and V. Franc, editors, CVWW’06, pages 93–98, Telˇc, 2006. [65] Microsoft. Photosynth – http://livelabs.com/photosynth, 2008. [66] B. Miˇcuˇs´ık and J. Koˇseck´a. Piecewise planar city 3D modeling from street view panoramic sequences. In CVPR’09, pages 2906–2912, 2009. [67] B. Miˇcuˇs´ık and T. Pajdla. Structure from motion with wide circular field of view cameras. PAMI, 28(7):1135–1149, July 2006. [68] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. IJCV, 60(1):63–86, October 2004. [69] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 65(1-2):43–72, November 2005. [70] E. Mouragnon, F. Dekeyser, P. Sayd, M. Lhuillier, and M. Dhome. Real time localization and 3D reconstruction. In CVPR’06, pages I:363–370, 2006. [71] M. Muja and D. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In VISAPP’09, pages 331–340, 2009. [72] D. Nist´er. Preemptive RANSAC for live structure and motion estimation. In ICCV’03, pages 199–206, 2003. [73] D. Nist´er. An efficient solution to the five-point relative pose problem. PAMI, 26(6):756–770, June 2004. [74] D. Nist´er. A minimal solution to the generalized 3-point pose problem. CVPR’04, pages I:560–567, 2004.
In
99
Bibliography
[75] D. Nist´er and C. Engels. Estimating global uncertainty in epipoloar geometry for vehicle-mounted cameras. In SPIE – Unmanned Systems Technology VIII, pages 62301L:1–12, 2006. [76] D. Nist´er, F. Kahl, and H. Stew´enius. Structure from motion with missing data is np-hard. In ICCV’07, pages 1–7, 2007. [77] D. Nist´er and H. Stew´enius. Scalable recognition with a vocabulary tree. In CVPR’06, pages II: 2161–2168, 2006. ˇ Obdrˇz´alek and J. Matas. Object recognition using local affine frames on distin[78] S. guished regions. In BMVC’02, pages 113–122, 2002. ˇ Obdrˇz´alek and J. Matas. Image retrieval using local compact dct-based repre[79] S. sentation. In DAGM’03, pages 490–497, 2003. [80] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42(3):145–175, May 2001. ˇ [81] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In CVPR’07, pages 1–8, 2007. [82] Point Grey Research Inc. Ladybug 2 – http://www.ptgrey.com/products/ ladybug2/index.asp, 2005. [83] R. Raguram, J.-M. Frahm, and M. Pollefeys. A comparative analysis of RANSAC techniques leading to adaptive real-time random sample consensus. In ECCV’08, pages 500–513, 2008. [84] D. Scaramuzza, F. Fraundorfer, R. Siegwart, and M. Pollefeys. Closing the loop in appearance guided SfM for omnidirectional cameras. In OMNIVIS’08, pages 1–14, 2008. [85] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or ’How Do I Organize My Holiday Snaps?’. In ECCV’02, pages I:414–431, 2002. [86] G. Schweighofer and A. Pinz. Globally optimal O(n) solution to the PnP problem for general camera models. In BMVC’08, pages 1–10, 2008. ˇ [87] J. Sivic and A. Zisserman. Video Google: Efficient visual search of videos. In Toward Category-Level Object Recognition (CLOR), pages 127–144, 2006. [88] N. Snavely. Scene reconstruction and visualization from internet photo collections: A survey. IPSJ Transactions on Computer Vision and Applications (CVA), 3:44– 66, 2011. [89] N. Snavely, S. Seitz, and R. Szeliski. Photo tourism: Exploring image collections in 3d. In SigGraph’06, pages 835–846, 2006.
100
Bibliography
[90] N. Snavely, S. Seitz, and R. Szeliski. Modeling the world from internet photo collections. IJCV, 80(2):189–210, 2008. [91] N. Snavely, S. Seitz, and R. Szeliski. Skeletal graphs for efficient structure from motion. In CVPR’08, pages 1–8, 2008. [92] I. Stamos and P. Allen. 3-D model construction using range and image data. In CVPR’00, pages I:531–536, 2000. [93] H. Stew´enius. Gr¨ obner Basis Methods for Minimal Problems in Computer Vision. PhD thesis, Centre for Mathematical Sciences LTH, Lund University, Sweden, 2005. [94] J. Sturm. SeDuMi: A software package to solve optimization problems – http: //sedumi.ie.lehigh.edu, 2006. [95] Y. Sun, J. Paik, A. Koschan, and M. Abidi. 3D reconstruction of indoor and outdoor scenes using a mobile range scanner. In ICPR’02, pages III:653–656, 2002. [96] J. Tardif, Y. Pavlidis, and K. Daniilidis. Monocular visual odometry in urban environments using an omdirectional camera. In IROS’08, pages 2531–2538, 2008. [97] S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg, M. Jethwa, and N. Master. Calibrated, registered images of an extended urban area. IJCV, 53(1):93–107, June 2003. [98] A. Torii, M. Havlena, and T. Pajdla. Omnidirectional image stabilization by computing camera trajectory. In PSIVT’09, pages 71–82, 2009. [99] A. Torii, M. Havlena, T. Pajdla, and B. Leibe. Measuring camera translation by the dominant apical angle. In CVPR’08, pages 1–7, 2008. [100] A. Torii and T. Pajdla. Omnidirectional camera motion estimation. In VISAPP’08, pages II:577–584, 2008. [101] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment: A modern synthesis. In Vision Algorithms: Theory and Practice, pages 298–372, Corfu, 1999. [102] S. Umeyama. Least-squares estimation of transformation parameters between two point patterns. PAMI, 13(4):376–380, April 1991. [103] M. Vergauwen and L. Van Gool. Web-based 3D reconstruction service. Machine Vision and Applications (MVA), 17(6):411–426, December 2006. [104] C. Vestri and F. Devernay. Using robust methods for automatic extraction of buildings. In CVPR’01, pages I:133–138, 2001.
101
Bibliography
[105] G. Vosselman and S. Dijkman. 3D building model reconstruction from point clouds and ground plans. IAPRS, 34(3):37–44, October 2001. [106] B. Williams, G. Klein, and I. Reid. Real-time SLAM relocalisation. In ICCV’07, pages 1–8, 2007. [107] Yahoo! Flickr: Online photo management and photo sharing application – http: //www.flickr.com, 2005.
102
List of Publications Related to the Thesis # 4
Publication M. Havlena, T. Pajdla, and K. Cornelis: Structure from Omnidirectional Stereo Rig Motion for City Modeling VISAPP 2008, Funchal, Madeira, Portugal
5
A. Torii, M. Havlena, and T. Pajdla: From Google Street View to 3D City Models OMNIVIS 2009, Kyoto, Japan
6
A. Torii, M. Havlena, T. Pajdla, and B. Leibe: Measuring Camera Translation by the Dominant Apical Angle CVPR 2008, Anchorage, Alaska, USA
6
A. Torii, M. Havlena, and T. Pajdla: Omnidirectional Image Stabilization for Visual Object Recognition International Journal of Computer Vision, Springer, New York, USA, 2011
7
M. Havlena, A. Torii, J. Knopp, and T. Pajdla: Randomized Structure from Motion Based on Atomic 3D Models from Camera Triplets CVPR 2009, Miami Beach, Florida, USA
8
M. Havlena, A. Torii, and T. Pajdla: Efficient Structure from Motion by Graph Optimization ECCV 2010, Hersonissos, Crete, Greece
103