Int J Comput Vis (2008) 79: 137–158 DOI 10.1007/s11263-007-0108-2
Inter-Image Statistics for 3D Environment Modeling Luz A. Torres-Méndez · Gregory Dudek
Received: 13 March 2007 / Accepted: 5 November 2007 / Published online: 30 November 2007 © Springer Science+Business Media, LLC 2007
Abstract In this article we present a method for automatically recovering complete and dense depth maps of an indoor environment by fusing incomplete data for the 3D environment modeling problem. The geometry of indoor environments is usually extracted by acquiring a huge amount of range data and registering it. By acquiring a small set of intensity images and a very limited amount of range data, the acquisition process is considerably simplified, saving time and energy consumption. In our method, the intensity and partial range data are registered first by using an imagebased registration algorithm. Then, the missing geometric structures are inferred using a statistical learning method that integrates and analyzes the statistical relationships between the visual data and the available depth on terms of small patches. Experiments on real-world data on a variety of sampling strategies demonstrate the feasibility of our method. Keywords 3D environment modeling · Sensor fusion · Markov random fields
Electronic supplementary material The online version of this article (http://dx.doi.org/10.1007/s11263-007-0108-2) contains supplementary material, which is available to authorized users. L.A. Torres-Méndez () · G. Dudek Centre for Intelligent Machines, McGill University, Montreal, H3A 2A7, Canada e-mail:
[email protected] Present address: L.A. Torres-Méndez CINVESTAV Campus Saltillo, Carretera Saltillo-Monterrey, Km. 13.5, Coahuila, 25900, Mexico
1 Introduction Many computer graphics and computer vision applications require 3D models. This 3D model must contain geometric and photometric details as well as knowledge about empty spaces. In robotics, having a mobile robot able to build a 3D map of the environment is particularly appealing as it can be used not only for planning robot actions but also for several important applications. For example, virtual exploration of remote locations, either for safety, automatic rescue and inspection of hazardous or inhospitable environments (e.g. the reactor of a nuclear power plant) or for efficiency reasons (museums’ tours). All these applications depend on the transmission of meaningful visual and geometric information. When building a 3D model of a real environment, suitable sensors to densely cover the environment are required. Visual sensors are the preferred alternative for capturing the photometric characteristics of the scene and they are fast. Conversely, range sensors capture geometric data directly, but acquiring dense range maps is often impractical. Stereo cameras can produce volumetric scans that are economical, but they often require calibration or produce range maps that are either incomplete or of limited resolution. Moreover, obtaining complete characteristics of the environment using only one type of sensor certainly limits and makes the whole process difficult. Sensor fusion may be a solution, as different types of sensors can have their data correlated appropriately, strengthening the confidence of the resulting precepts well beyond that of any individual sensor’s readings. A typical 3D model acquisition pipeline is composed by a specialized 3D scanner to acquire precise geometry, and a digital camera to capture appearance information. Most of the commercial devices at all price ranges, such as the Swiss Ranger and the CanestaVision sensors, in the low-cost arena, and
138
the highly praised 3DV Systems captures high-resolution visual images along with low-resolution depth information. Thus, one alternative solution is to recover high-resolution or dense range image from the low-resolution ones (Diebel and Thrun 2005). However, in this research, as we wish to have a 3D map representation of large indoor scenes, we choose to simplify the way sensor data is acquired so the time and energy consumption can be minimized. This can be achieved by acquiring only partial, but reliable, depth information. Techniques for surface depth recovery are essential in multiple applications involving computer vision and robotics. In particular, we investigate the autonomous integration of incomplete sensory data to recover dense depth maps and then integrating them to build a 3D model of an unknown large-scale1 indoor environment. Thus, the challenge becomes one of trying to extract, from the sparse sensory data, an overall concept of shape and size of the structures within the environment. We explore and analyze the statistical relationships between intensity and range data in terms of small image patches. Our goal is to demonstrate that the surround (context) statistics on both the intensity and range image patches can provide information to infer the complete 3D layout of space. It has been shown by Lee et al. (2001) that although there are clear differences between optical and range images, they do have similar second-order statistics and scaling properties (i.e., they both have similar structure when viewed as random variables). Our motivation is to exploit this fact and also that both video imaging and limited range sensing are ubiquitous readily-available technologies while complete volume scanning is prohibitive on most mobile platforms. Our approach for range estimation is based on the assumption that the pixels constituting both the range and intensity images acquired in an environment can be regarded as the results of pseudo-random processes, but that these random processes exhibit useful structure. In particular, the assumption that range and intensity images are correlated is exploited. A second assumption made is that the variations of pixels in the range and intensity images are related to the values elsewhere in the image(s) and that these variations can be efficiently captured by the neighborhood system of a Markov Random Field (MRF) model. Both these assumptions have been considered before (Geman and Geman 1984; Efros and Leung 1999; Wei and Levoy 2000; Efros and Freeman 2001; Hertzmann et al. 2001), and just recently there has been an enormous interest in developing techniques that combines information of camera images with range images to recover 3D structure (Torralba and 1 Large-scale space is defined as a physical space that cannot be entirely perceived from a single vantage point (Kuipers 1978).
Int J Comput Vis (2008) 79: 137–158
Oliva 2002; Michels et al. 2005; Diebel and Thrun 2005). In summary, this work answers the question of how the statistical nature of visual (photometric) context provides information about its geometric properties. In particular, how the statistical relationships between intensity and range data can be modeled reliably such that the inference of unknown range be as accurate as possible. In this vein, as the sensor data to be acquired are different, achieving a robust registration of the intensity data with the incomplete range data is vital for the range synthesis problem. To this end, an imagebased registration technique is presented.
2 Related Work Our work is an instance of the 3D environment modeling problem. Over the last 30 years, this problem has received considerable attention in the computer vision and computer graphics community and more recently in robotics. In the context of this paper we will consider only a few representative solutions. In computer vision, shape-from-X techniques extract depth information from intensity images by using cues such as shading, texture, retinal disparity and motion. This group of techniques are passive, i.e., they obtain image data without emitting energy, and are typically performed by designing mathematical models of image formation and inverting these models. These models are traditionally based on physical principles of light interaction. However, due to the highly underconstrained characteristic of the inverse problem of these principles, many assumptions about the type of surface and albedo need to be made, which may not be all suitable for the complex real scenes. Saxena et al. (2006) applied supervised learning to the problem of estimating depth from single monocular cues on images of unconstrained outdoor and indoor environments. This task is difficult since requires a significant amount of prior knowledge about the global structure of the image. Michels et al. (2005) used supervised learning to estimate distances to obstacles in order to autonomously drive a remote controlled car. Methods such as shape from shading (Zhang et al. 1999) rely on purely photometric properties, assuming uniform color and texture. Hoiem et al. (2005) also considered monocular 3D reconstruction, but focused on generating visually pleasing graphical images by classifying the scene as sky, ground, or vertical planes, rather than accurate metric depthmaps. Dense stereo vision gained popularity in the early 1990s due to the large amount of range data that it could provide (Little and Gillett 1990; Fua 1993). In mobile robotics, a common setup is the use of one or two cameras mounted on the robot to acquire depth information as the robot moves through the environment. Over the past decade, researchers have developed very good
Int J Comput Vis (2008) 79: 137–158
stereo vision systems (see Scharstein and Szeliski 2002 for a review). Although these systems work well in many environments, the cameras must be precisely calibrate for reasonably accurate results. Also, the results are limited by the baseline distance between the two cameras. The depth estimates tend to be inaccurate when the distances considered are large. The depth maps generated by stereo under normal scene conditions (i.e., no special textures or structured lighting) suffer from problems inherent in window-based correlation. These problems manifest as imprecisely localized surfaces in 3D space and as hallucinated surfaces that in fact do not exist. Recently, there has been much progress on using learning approaches in stereo vision. In particular, from the findings in the work by Szeliski et al. (2006), stereo methods based on Markov Random Fields (MRFs) have become popular with the emergence of better inference techniques, such as graph cuts (Boykov et al. 2001; Kolmogorov and Zabih 2001) and belief propagation (Sun et al. 2003; Sun et al. 2005; Felzenszwalb and Huttenlocher 2006). One of the top-performing methods for stereo vision is the work by Zhang and Seitz (2005), who iteratively estimate the global parameters of a MRF stereo from the previous disparity estimates, without having to rely on groundtruth data. In a more recent work, Saxena et al. (2007) incorporate monocular cues from a single image into a stereo system for modeling depths and relationships between depths at different points in the image using a hierarchical, multiscale MRF. A training set, comprising a large set of stereo pairs and their corresponding ground-truth depth maps, is used to model the posterior distribution of the depths given the monocular image features and the disparities. Other works have attempted to model 3D objects from image sequences (Tomasi and Kanade 1992; Fitzgibbon and Zisserman 1998; Pollefeys et al. 1998), with the effort of reducing the amount of calibration and avoiding restriction on the camera motion. Fitzgibbon and Zisserman (1998) proposed a method that sequentially retrieves the projective calibration of a complete image sequence based on tracking corner and/or line features over two or more images, and reconstructs each feature independently in 3D. Their method solves the feature correspondence problem using methods based on the fundamental matrix and tri-focal tensor, which encode precisely the geometric constraints available from two or more images of the same scene from different viewpoints. A similar work is that of Pollefeys et al. (1998), in which a 3D model of an object is obtained from image sequences acquired from a freely moving camera. The camera motion and its settings are unknown. Their method is based on a combination of the projective reconstruction, self calibration and dense depth estimation techniques. In general, these methods derive the epipolar geometry and the trifocal tensor from point correspondences. However, they assume that it is possible to run an interest operator such as a corner detector to extract from one of the images a sufficiently
139
large number of points that can then be reliably matched in the other images. It appears that if one uses information of only one type, the reconstruction task becomes very difficult and works well only under narrow constraints. Depth information from an object or scene can also be acquired directly by using active range sensors. These techniques are known as active sensing techniques. These sensors are independent from external lighting conditions and do not need any texture to perform well. However, these sensors tend to be expensive, the data acquisition process slow, and normally of limited spatial resolution. There is a great number of research work using laser rangefinders for different applications, such as the 3D reconstruction problem or to perform scene understanding for indoor mobile robots (Connolly 1984; Whaite and Ferrie 1997; Nyland et al. 1999; El-Hakim 1998; Sequeira et al. 1999; Stamos and Allen 2000). Most of this research work focuses on the extraction of geometric relationships and calibration parameters in order to achieve realistic2 and accurate representations of the world. In most cases it is not easy to extract the required features, and human intervention is often required. Depending on the application and the complexity of the object or scene, achieving geometric correctness and realism may require data collection from different sensors (passive and/or active) as well as the correct fusion of all these observations. The idea of using more than one sensor to complement the data of one sensor with that of another is not new (Nelson and Cox 1988). However, methods for data fusion are not as well developed as those for the design and analysis of individual sensors. Independent of the level of representation used, a variety of popular mathematical techniques for sensor fusion appear in the literature: for example, probability and Bayesian inference (Davies 1986), Dempster-Shafer theory of evidence (Shafer 1976). These established techniques are then incorporated into a framework for transforming data into a common coordinate system, producing verdicts on the correctness of the various sources, and allowing stable estimation of the parameters of a problem. A vast body of research on 3D modeling and virtual reality applications has been focused on the fusion of intensity and range data with promising results (Pulli et al. 1997; El-Hakim 1998; Sequeira et al. 1999; Levoy et al. 2000; Stamos and Allen 2000). These works all consider the complete acquisition of 3D points from the object or scene to be modeled, focusing mainly on the registration and integration problems. There exist recent work where the goal is to generate high-resolution depth maps from low-resolution depth maps using high-resolution visual information (Freeman and Torralba 2003; Diebel and Thrun 2005). With train2 In this context, realistic means that the final model contains geometric and photometric details.
140
ing images, Freeman and Torralba (2003) use new lowdimensional representations, called scene recipes to learn the relationships between a low resolution range image and a high resolution intensity image to infer a more accurate high resolution range image. Such a problem may arise when range data is derived through stereo algorithms. Stereo algorithms suffer from the lack of local surface texture due to smoothness of depth constraint, or local mismatches in disparity estimates. Thus, the stereo methods only provide a coarse depth map, which is often accurate in the low frequency bands, but less accurate in the high frequency bands. In order to compute a better depth map, their method integrates both the high resolution image and the low resolution range image, which are decomposed into a steerable wavelet filter pyramid. This filter breaks the image down according to scale and orientation, with minimal aliasing between subbands (Freeman and Adelson 1991). Unlike Freeman et al.’s approach, our method does not need a special training set, all the parameters are estimated from the input data collected itself. Diebel and Thrun (2005) propose an MRF model that integrates the low-resolution range image and the high-resolution visual image to enhance the resolution and accuracy of the depth image. The intuition behind the MRF model is similar to that in our work, i.e., that depth discontinuities in a scene often corresponds to color or brightness changes within the associated visual image. However, the notable difference of our research work is that the amount of the range data acquired is very small compared to the intensity data. There is no prior work (except for these authors previous work (see Torres-Méndez and Dudek 2004a, 2004b, 2006; Torres-Méndez 2005) that uses partial depth measurements together with the intensity images, for complete and dense scene recovery. Man-made environments exhibit strong statistical regularities3 that are exploitable by biological and machine vision systems. Torralba and Oliva (2002) studied the statistics of natural images and concluded that these statistics strongly vary as a function of the interaction between the observer and the world. In particular, they show that second order statistics of images are correlated with scene scale and scene category and provide information to perform fast and reliable scene and object categorization. They also show that image statistics also vary when considering scenes at different scales. Close-up views on man-made objects tend to produce images that are composed of flat and smooth surfaces. Consequently, the energy of the power spectra for close-up views is concentrated mainly in low spatial frequencies. As distance between the observer and the scene background increases, the visual field comprehends a large space, that is 3 Real scenes are constrained by many regularities in the environment, such as the natural geometry of objects and their arrangements in space, natural distributions of light, and regularities in the observer’s position.
Int J Comput Vis (2008) 79: 137–158
likely to encompass more objects. The images of man-made scenes appear as a collection of surfaces that break down into smaller pieces (objects, walls, windows etc). Thus, the spectral energy corresponding to high spatial frequencies increases as the scene becomes more cluttered due to the increase, with distance, in the area covered by the visual field. In contrast, spectral signatures of natural environments behave differently while increasing depth. Recent emphasis on explicit probability models for images may be due to the growing appreciation for the variability exhibited by the images and the realization that exact mathematical models may not be feasible. The key to a statistical approach is to have probability models that capture the essential variability and yet are tractable. Earliest, and still widely used, probabilistic models for images were based on Markov Random Field (MRF) models (Winkler 1995). MRFs, in particular, define a class of statistical models which enable to describe both the local and global properties of structures in images. However, few studies have been made into the statistical relationship between images and range images. Those few studies have uncovered meaningful and exploitable statistical trends in real scenes which may be useful for designing new algorithms in surface inference, and also for understanding how humans perceive depth in the real scenes. One of the few investigations into the joint statistics of range and intensity images was carried on by Howe and Purves (2002). In their study, they examine range and coregistered intensity images to find how the length of a line segment on a blank background depends on the orientation of the line. They found that this bias closely matches the 3D length of the line segments when projected into a range image. This result may help to explain how the brain computes depth. Potetz and Lee (2003) measure the correlations between linear properties of range images and linear properties of image intensity data, to explore the structure of the correlations that could usefully underlie 3D judgments from intensity data in images of natural scenes (e.g., shape from shading). By using linear regression, ridge regression, and canonical correlation techniques, they extract simple but interesting statistical trends between both image domains. The content of their images include scans of trees and wooded areas, rocky areas, building exteriors, and sculptures. Those few studies have uncovered meaningful and exploitable statistical trends in real scenes which may be useful for designing new algorithms in surface inference, and also for understanding how humans perceive depth in the real scenes. A greater understanding of how real images are formed could lead to substantial insight into how depth information may be inferred from single images.
Int J Comput Vis (2008) 79: 137–158
141
3 Our Framework
3.1.1 Acquiring Partial Range Data
3D environment modeling typically involves the accurate reconstruction of the sensed objects and knowledge of the spatial layout of these objects. We divide the 3D environment modeling in the following stages: (1) Sensor data acquisition and data registration; (2) Depth inference or shape recovery; and (3) Data integration for model representation. In this paper the three stages are described, focusing mainly on the depth inference stage.
In our acquisition framework, the spinning mirror (y-axis) of the laser and panning motor (x-axis) combine to allow the laser to sweep out a longitude-latitude sphere. Since each step taken by the pan unit can be programmed we can have different sampling strategies to acquire sparse range data. The quality of the final range map will depend on the amount of partial range data acquired and the sampling strategy used. Figure 2 shows the simple heuristic for collecting our experimental data, which is essentially based on the relationship between the size of the objects and the distance from where those objects are seen: the farther we are from an object, the smaller it looks, so dense sampling is needed to accurately represent its shape. On the other hand, the closer we are to an object, the larger it looks, so sparse subsampling is enough. The next step is to acquire the cylindrical panorama mosaic, following by the registration with the partial range data. Appendix A gives details about how the panorama mosaic is acquired and the method we used for the laser-camera registration.
3.1 Data Acquisition and Registration The aim of our system configuration is to reduce the data acquisition time and facilitate the registration process. On top of the robot, we have assembled a system consisting of a 2D laser rangefinder (laser scanner) and a high-resolution digital camera, both mounted on a pan unit (see Fig. 1). The center of projections of the camera (optical center) and laser (mirror center) and the pan unit are aligned, facilitating the registration between the intensity and range data, as we only need to know their projection types in order to do image mapping. We assume dense and uniformly sampled intensity images, and sparse but uniformly sampled range images. The sampling of intensity images occurs more often than that of range images. The area covered by the sampling data (approximately a view of 90°) remains fixed at each robot pose. However, the amount of range data may vary depending essentially on the sampling strategy.
Fig. 1 Arrangement of the sensors on the pan unit on the mobile robot
3.2 Range Synthesis After registering the intensity and partial range information, we want to solve the following problem: How to infer a dense range map from an intensity image and a limited amount of initial range data. We refer to this problem as the range estimation or range synthesis problem. The solution of the range estimation problem can be defined as the minimum of an energy function. The first idea on which our approach is based, is that an image can be modeled as a sample function of a stochastic process based on the Gibbs distribution, that is, as a Markov Random Field (MRF) (Geman and Geman 1984). We consider range estimation a task of assigning a depth value to each pixel of the input image that best describes its surrounding structure using the already available intensity and range data. The MRF model has the ability to capture the characteristics of this input data and then use them to learn a marginal probability distribution that is to be used to infer data in regions with missing
Fig. 2 Sampling strategy based on distance of the scene from the robot. a Robot is far from the objects on the scene, the subsampling is dense. b Robot is close to the objects, thus the subsampling is sparser
(a)
(b)
142
range values. This model uses multi-scale representations of the intensity and range images to construct a probabilistic algorithm that efficiently estimate the missing range. Statistical relationships are learned directly from the input data, without having to make any assumptions regarding lighting conditions of specific nature, location or environment type that would be inappropriate to a particular scene. MRF models have been successfully used to solve many fundamental problems of image analysis and computer vision (Stan 1995), some relevant work has been already mentioned here. Most of these models are for low level processing. These include image restoration, edge detection, image segmentation, texture synthesis and analysis, image inpainting, surface reconstruction, stereovision, motion analysis, and recently for high level vision, such as scene interpretation and object matching and recognition. In this context, the fundamental idea of MRF models relies on the fact that an image pixel cannot be based only on its own value, but it must be influenced by the values and properties of its neighboring pixels. Moreover, it is very improbable that a pixel has a value different than all of its neighbors, in fact we can affirm that certain configurations are indeed impossible. The appeal of Markov Random Field (MRF) models for range estimation comes from their explicit ability to model interactions and relationships between neighboring parts of the data space, i.e. the spatial distribution of intensity and range data. This model is trained using the (local) relationships between the observed range data and the variations in the intensity images and then used to compute unknown depth values. The learned representation of pixel variation to perform resolution enhancement of face images proposed by Baker and Kanade (2002) is quite similar in spirit to what we describe here, although the application and technical details differ significantly. The work by Freeman et al. (2000) on learning the relationships between intrinsic images is also related. For the particular problem of estimating dense range maps from intensity images and partial range data, we first need to consider how the relationships between the intensity information and the partial range can give us knowledge about the total underlying geometry of the scene. Intensity edges can be used as the primary cue in guiding the search for discontinuities in other physical processes (for example surface depth, surface orientation, texture, shadows, color, specularities) (Poggio et al. 1988). The critical role of intensity edges in artificial—and probably also biological—vision is intuitively clear: changes in surface properties (depth, orientation, material, texture) usually produce large gradients in the image intensity. Assuming a simple imaging model (e.g., Lambertian), large intensity gradients in the image can be caused by six physical phenomena (Poggio et al. 1988): occluding edges (extremal edges and blades); folds; shadow edges; surface markings and
Int J Comput Vis (2008) 79: 137–158
specular edges. Intensity edges are detected quite reliably by the Canny edge detector (Canny 1986). Because of the constraints of image formation discussed earlier, the correct depth discontinuities will, in almost all cases, correspond precisely to the locations of intensity edges. Our range synthesis method exploits this by restricting the range estimation process according to intensity edges, thus assuring the smoothness and continuity of discontinuities. This process will be described in detail later. There are some cases in which discontinuities will not occur at intensity edges, for instance, objects that blend in with their background. Although this situation occurs rarely in natural scenes, it is usually present due to camera underexposure or saturation, where some locations of the objects may blend in with the background. Moreover, there may be false edges (i.e., edges that do not represent a discontinuity), such as those coming from texture-like regions or even shadows. However, as will be demonstrated in our experiments, this is not critical for reconstruction since range data inside the neighborhoods that are very close to these texture-like edges represent the same type of smooth surface. When learning relationships between intensity and shape, special attention has to be paid to occlusions and boundaries. Not all intensity values in an image are directly related to shape variations. For example, an image can be decomposed into paint (reflectance) and shading variations (surface normal) (Freeman and Torralba 2003). Shading variations are directly produced by the shape, however, the paint will be related to changes in the reflectance function of the material and not directly related to the shape. However, the method we propose, learns the relationships from the intensity and range images directly without having to hypothesize surface smoothness or reflectance properties, that may be inappropriate to a particular environment. Thus, our approach does not intent to implicitly obtain a general algorithm, but to use the local statistical relationships between the intensity and the input range directly from the observed data. 3.2.1 The MRF Model The range estimation problem can be posed as a labeling problem. A labeling is specified in terms of a set of sites and a set of labels. In our case, sites represent the pixel intensities in the matrix I and the labels represent the depth values in R. Let S index a discrete set of M sites S = {s1 , s2 , . . . , sM }, and L be the set of corresponding labels L = {l1 , l2 , . . . , lM }, where each li takes a depth value. The inter-relationship between sites and labels define the neighborhood system N = {Ns | ∀s ∈ S}, where Ns is the set of neighbors of s (i.e., the neighborhood of s), such that (1) s ∈ Ns , and (2) s ∈ Nr ⇐⇒ r ∈ Ns (see Fig. 3). Each site si is associated with a random variable Fi . Formally, let F = {F1 , . . . , FM } be a random field defined on S, in which
Int J Comput Vis (2008) 79: 137–158
143
Fig. 4 Definition of an augmented voxel Fig. 3 A neighborhood system definition
3.2.2 Non-parametric Sampling and MAP-MRF Estimation
a random variable Fi takes a value fi in L. A realization f = f1 , . . . , fM , is called a configuration of F, corresponding to a realization of the field. The random variables F defined on S are related to one another via the neighborhood system N . F is said to be an MRF on S with respect to N if and only if the following two conditions are satisfied (Hammersley and Clifford 1971): (1) P (f ) > 0 (positivity); and (2) P (fi |fS −{i} ) = P (fi |fNi ) (Markovianity), where S − {i} is the set difference, fS −{i} denotes the set of labels at the sites in S − {i} and fNi = {fi | i ∈ Ni } stands for the set of labels at the sites neighboring i. The choice of N together with the conditional probability distribution of P (fi |fS −{i} ), provides a powerful mechanism for modeling spatial continuity and other scene features. We model a neighborhood Ni as a square mask of size n × n centered at pixel location i, where only those voxels containing intensity and range values are used in the synthesis process. Calculating the conditional probabilities in an explicit form to infer the exact maximum a posteriori (MAP) in MRF models is intractable. We cannot efficiently represent or determine all the possible combinations between pixels with its associated neighborhoods. Various techniques exist for approximating the MAP estimate, such as Markov Chain Monte Carlo, iterated conditional modes, maximizer of posterior marginals, etc. In our research, we avoid the computational expense of sampling from a probability distribution and take two different approaches for the MAP-MRF estimation. The first approach is based on a non-parametric sampling strategy that is easy to implement, generates good results and is fast to execute. The second approach is an extension of the first, it uses the belief propagation (BP) algorithm to compute marginal probabilities. This approach improves the synthesized results of those of the first approach, at the expense of additional computation.
We introduce a definition of augmented voxels4 which contain intensity (either from grayscale or color images) and range information (where the range can initially be unknown). We are interested only in the set of such augmented voxels in which one augmented voxel lies on each ray that intersects each intensity pixel of the input intensity image, thus giving us registered range and intensity images (see Fig. 4). To compute the MAP estimate for a depth value Ri of the augmented voxel Vi , one first needs to construct an approximation to the conditional probability distribution P (fi |fNi ) and then sample from it. We could use the parameters themselves, estimated from the given input data or samples (neighborhoods from input intensity and range data), as an approximation to maximum likelihood sampling, however this is hindered by the need to compute the partition function Z which is usually computationally intractable for not very small MRFs, since the number of possible combinations increases exponentially. Instead, for each new depth value Ri ∈ R to estimate, the samples, which correspond to the neighborhood system for the voxel location i, Ni , are queried and the distribution of Ri is constructed as a histogram of all possible values that occurred in the samples. Ni is a subset of the real infinite set of voxels, denoted by Nreal . Based on our model, we assume that the depth value Ri depends only of its immediate neighbors in intensity and range, (Ni ). If we define a set Γ (Ri ) = {N ⊂ Nreal : Ni − N = 0}
(1)
containing all occurrences of Ni in Nreal , then the conditional probability distribution of Ri can be estimated with a histogram based on the depth values of voxels representing each N in Γ (Ri ). Unfortunately, we are only given V, i.e., a finite sample from Nreal . Thus, there might not be any neighborhood 4 We define a voxel as a location containing only intensity information, whereas an augmented voxel is a location that may contain both intensity and range information.
144
Int J Comput Vis (2008) 79: 137–158
containing exactly the same characteristics in intensity and range as Ni in V. Thus, we must use a heuristic which let us find a plausible Γ (Ri ) ≈ Γ (Ri ) to sample from. The similarity measure · between two generic neighborhoods Na and Nb is described over the partial data in the two neighborhoods and is calculated as follows, Na − Nb = G(σ, v − v0 ) v∈Na ,Nb
×
(Iva − Ivb )2 + (Rva − Rvb )2 ,
(2)
where v0 represents the augmented voxel located at the center of the neighborhoods Na and Nb , v is a neighboring voxel of v0 . I a and R a are the intensity and range values of the neighboring augmented voxels of the depth value Rp ∈ v0 to synthesize, and I b and R b are the intensity and range values to be compared with and in which, the center voxel v0 has already assigned a depth value. G is a 2D Gaussian kernel applied to each neighborhood, such that those voxels near the center are given more weight than those at the edge of the window. Let Ap be a local neighborhood system for the augmented voxel p, that comprises nearby neighborhoods within a radius r, Ap = {Nq ∈ N | distance(p, q) ≤ r}
(3)
where Nq is a neighborhood which center voxel has an assigned depth value, and is located at a maximum distance r from the location of voxel p. This set will conform the training/sample data for that particular voxel p. A depth value Rp from the augmented voxel Vp with neighborhood Np , is then synthesized by first selecting the most similar neighborhood (Nbest ) to Np , i.e., the closest match to the region being filled in, Nbest = arg min Np − Aq .
(4)
Aq ∈Ap
Second, all the neighborhoods Aq in Ap that are similar (within a threshold ) to this Nbest are included in Γ (Rp ), as follows Np − Aq < (1 + ) Np − Nbest .
(5)
We can now construct a histogram from the depth values Rp in the center of each neighborhood in Γ (Rp ), and randomly sample from it. Rq is then used to specify Rp . For each successive augmented voxel this approximates the maximum a posteriori estimate. 3.2.3 MAP-MRF Using Belief Propagation (BP) In order to propagate evidence, we use a pairwise Markov network. In our case, observation nodes represent image patches (or neighborhoods) where some of the voxels may contain both, intensity and range, or just intensity; the hidden nodes are depth values to estimate at the center voxels of the image patches of the observable nodes. BP efficiently estimates Bayesian beliefs in the MRF network by iteratively passing messages between neighboring nodes. Figure 5 shows the (a) input intensity and (b) range images and (c) the pairwise Markov network for the range estimation problem, where the observation node yi is a neighborhood in intensity centered at voxel i, and the hidden nodes xi represent the depth values to be estimated (white areas in (b)), but also hidden nodes contain the already available range data (as image patches), whose beliefs remain fixed at all times; a subset of these available range is used to locally train the network and compute compatibilities between observation and hidden nodes and the rest is used for propagating the beliefs in and out globally. Learning the Compatibility Functions The training pairs of intensity image patches with its corresponding range image patches are used to learn the compatibility functions. However, not all available pairs of intensity and range image patches are used as training pairs. We choose only a local set of image pairs that are located up to a distance r from the voxel location of the depth value to be estimated. This reflects our heuristics about how the intensity values locally provide knowledge about the type of surface that intensity value belongs to. As in the method described elegantly in Freeman et al. (2000), we use the overlapping information from the intensity image patches themselves, to estimate the compatibilities Ψ (xj , xk ) between neighbors. Let k and j be two neighboring intensity image patches. A candidate for an image
Fig. 5 Pairwise Markov network for the range estimation problem
(a)
(b)
(c)
Int J Comput Vis (2008) 79: 137–158
145
Fig. 6 The MRF-BP algorithm for range synthesis
1. Divide the input registered images, i.e., the intensity and the (incomplete) range images, into small patches, which form the sets of xi ’s and yi ’s. 2. For each intensity image patch yi , find the k closest yxi ’s from the local training set located up to distance r from that voxel. The corresponding xi ’s are the candidates for that patch. 3. Calculate the compatibility function Φ(xi , yi ) according to (7). 4. For each pair of neighboring input patches, calculate the k × k compatibility function Ψ (xi , xj ) according to (6). 5. Estimate the MRF-MAP solution using BP. 6. Assign the depth value of the center pixel of each estimated maximum probability patch xiMAP to the corresponding pixel in the range image patch.
patch whose range value is to be estimated, is defined as the corresponding range value of the most similar intensity image patch found in the training set. Let djl k be a vector of pixels of the lth possible candidate for image patch xk which m be the lie in the overlap region with patch j . Likewise, let dkj values of the pixels (in correspondence with those of djl k ) of mth candidate for patch xj which overlap patch k. We say that image candidates xkl (candidate l at node k) and xjm are compatible with each other if the pixels in their region of overlap agree. We assume that the image and training samples differ from the “ideal” training samples by Gaussian noise of covariance σi and σs , respectively. Those covariance values are parameters of the algorithm. Then, the compatibility matrix between range nodes k and j are defined as follows: Ψ (xkl , xjm ) = exp−|dj k −dkl |
m 2 /2σ 2 s
l
(6)
.
The rows and columns of the compatibility matrix Φ(xkl , xjm ) are indexed by l and m, the range image candidates at each node, at nodes j and k. We say that a range image patch candidate xkl is compatible with an observed intensity image patch y0 if the intensity image patch ykl , associated with the range image patch candidate xkl in the training database matches y0 . Since it will not exactly match, we must again assume “noisy” training data and define the compatibility Φ(xkl , yk ) = exp−|yk −yo | l
2 /2σ 2 s
(7)
.
The MAP Estimate The maximum a posteriori (MAP) range image patch for node i is: xiMAP = arg max Φ(xi , yi ) xi
Mj i (xi ),
(8)
j ∈N(i)
where N (i) are all node neighbors of node i, and Mj i is the message from node j to node i and is computed as follows: Mij (xj ) = Z
xi
Ψ (xi , xj )Φ(xi , yi )
k∈N(i)\{j }
Mki (xi ), (9)
where Z is the normalization constant. The algorithm for the range estimation problem using the MRF-BP framework is listed in Fig. 6. Currently, the main drawback of using the BP-based method is the computational time. The BP-based algorithm runs in O(nk 2 T ) time, where n is the number of pixels to be synthesized, k is the number of possible labels for each pixel and T is the number of iterations. It takes O(k 2 ) time to compute each message and there are O(n) messages per iteration. The NP sampling method runs in O(Mpn) time, where M is the number of neighborhoods to be compared with, p is the number of pixels that conform the neighborhood of each pixel (e.g., 5 × 5 = 25 pixels). It usually takes around 30 seconds for an image of size 128×128 in a 3 GHz Pentium 4 with 1 Gigabyte of RAM. We are currently seeking for better optimization methods to reduce this running time. 3.2.4 Range Synthesis Ordering In standard MRF methods, the assumption is that the field is updated in either stochastically or in parallel according to an iterative schedule. In practice, several authors have considered more limited update schedules. In this work, a single update is done at each unknown measurement. Thus, a depth value R(x, y) is synthesized sequentially (although this does not preclude parallel implementations). Critical to the quality of the reconstruction is the order in which a voxel, whose range value is to be synthesized, is selected. An ordering that is commonly used, is the wellknown onion-peel ordering. This ordering uses a predetermined schedule over space, essentially walking a spiral from the boundary of a region towards the center. The main problem with this ordering is the strong dependence from the previous assigned voxel. A more suitable ordering is based on the amount of available information in the voxel’s neighborhood, such that voxels containing the maximum number of neighboring augmented voxels are synthesized first. Figure 7 shows a simulated example that compares the onionpeel ordering with the information-driven ordering proposed in this work.
146
Int J Comput Vis (2008) 79: 137–158
Fig. 9 The notation diagram
Fig. 7 Comparing the onion-peel and information-based orderings
are computed based on these two factors and are assigned to each voxel for reconstruction, such that as we reconstruct, the voxel with the maximum priority value is selected. If more than one voxel shares the same priority value, then the selection is done randomly. Computing the Priority Values The reconstruction sequence depends entirely on the priority values that are assigned to each voxel on the boundary of the region to be synthesized. The priority computation is biased toward those voxels that are surrounded by high-confidence voxels, that are not on an isophote line, and whose neighborhood does not represent an intensity discontinuity, in other words, whose neighborhood does not have any edges on it. Furthermore, edge information is used to defer the synthesis of those voxels that are on an edge to the very end. When using the BP-based method, we essentially use the following message-passing rule:
Fig. 8 Isophote on a surface
It was observed that the reconstruction across depth discontinuities is often problematic as there is comparatively little constraint for probabilistic inference at these locations. Such regions are often identified with edges in the intensity images and with linear structures in the range maps. These linear structures are called isophotes, which in the range domain are defined as all normals forming same angle with direction to eye (see Fig. 8). Thus, in the reconstruction sequence, synthesis of voxels close to intensity discontinuities (indicated by edges) and/or depth discontinuities (indicated by the isophotes) are deferred as much as possible. Summarizing, the reconstruction sequence synthesize first the depth values of those voxels for which we can make the most reliable inferences based on essentially two factors: (1) the number of neighboring augmented voxels (i.e. locations with already assigned range and intensity) and, (2) the existence of intensity and/or depth discontinuities (i.e. if an edge or a linear structure exists). Priority values
Edge-constraint rule A node can send a message to a neighboring node once it has received messages from all its other neighbors except those edge-labeled nodes. The notation used is similar to that used in the inpainting literature (Criminisi et al. 2003). The region to be synthesized, i.e. the target region (see Fig. 9), is indicated by Ω = {ωi | i ∈ A}, where ωi = R(xi , yi ) is the unknown depth value at (xi , yi ) and A ⊂ Zm is the set of subscripts for the unknown range data. The contour of Ω is denoted by δΩ. The input intensity (I ) and the known range values (Rk ) together form the source region, and is indicated by ζ . This region ζ is used to calculate the statistics between the intensity and the input range for reconstruction. Let Vp be an augmented voxel with unknown range located at the boundary δΩ and Np be its neighborhood, which is a n × n square window centered at Vp . For all voxels Vp ∈ δΩ, their priority value is computed as follows: P (Vp ) = C(Vp )D(Vp ) + 1/(1 + E),
(10)
Int J Comput Vis (2008) 79: 137–158
147
Fig. 10 The diagram shows how priority values are computed for each voxel Vp on δΩ. Given the neighborhood of Vp , np is the normal to the contour δΩ of the target region Ω and ∇I⊥p is the isophote (direction and range) at voxel location p
where E is the number of edges found in the neighborhood or image patch Np ; C(Vp ) is the confidence term, and D(Vp ) the data term. The confidence term is defined as follows: i∈Np ∩ζ C(Vi ) , (11) C(Vp ) = |Np | where |Np | is the total number of voxels (augmented or not) in Np . At the beginning, the confidence of each voxel is assigned 1 if its intensity and range values are filled and 0 if the range value is unknown. This confidence term C(Vp ) may be thought of as a measurement of the amount of reliable information surrounding the voxel Vp . Thus, as we reconstruct, those voxels whose neighborhood has more of their voxels already filled, are synthesized first, with preference given to voxels that were synthesized early on. The data term D(Vp ) is computed using the available range data in the neighborhood Np , as follows (see Fig. 10): D(Vp ) = 1 −
|∇I⊥p · np | α
,
(12)
where α is a normalization factor (e.g. α = 255, in a typical gray-level image), np is a unit vector orthogonal to the boundary δΩ at voxel Vp and ∇I⊥p is the isophote (direction and range) at voxel location p. This data term reduces the priority of a voxel in whose neighborhood an isophote “flows” into, thus altering the sequencing of the extrapolation process. This term plays an important role in the algorithm because it prevents the synthesis of voxels lying near a depth discontinuity. Note, however, that it does not explicitly alter the probability distribution associated the voxel (except by deferring its evaluation), and thus has only limited risk for the theoretical correctness of the algorithm. Once all priority values of each voxel on δΩ have been computed, we find the voxel Vp with the highest priority, which is the one to be synthesized. The confidence of the augmented voxel which most resemble the neighborhood of Vp is assigned to C(Vp ) (see (11)). 3.3 Comparing the NP-Based and BP-Based Methods for Range Synthesis We conducted some experiments with data obtained from two databases available on the web, both intensity images
and real range data are already registered. The first database, the USF range image database5 from the CESAR lab at Oak Ridge National Laboratory, provides real intensity (reflectance) and range images of indoor scenes acquired by an Odetics laser range finder mounted on a mobile platform. The second database, the Middlebury stereo database6 (Scharstein and Szeliski 2003), provides color images with complex geometry and pixel-accurate ground-truth disparity data. The first database contains images of size 128 × 128 pixels represented by 256 grey levels and the second database contains color images of size 450 × 375 pixels. The neighborhood systems used in all experiments were windows of 5 × 5 pixels, unless otherwise indicated. The Canny edge detector (Canny 1986) is used to extract the edges of the input intensity images and the smoothing parameter is set to 0.8. The parameter d, i.e., the maximum distance between two neighborhoods’ center voxels, which are to be compared to find the best matches, is set to 10 pixels. For the case of the BP-based method, the number of iterations were set to 50. We have conducted many experiments on the number of iterations, and observed that below 50 iterations the results are poor, while iterations above this number do not reflect much of improvement. We consider several strategies for sampling the range data. In general, the sampling of the range data is assumed to be clumped into at least some sets of mutually-adjacent voxels as opposed to scattered measurements far from oneanother. The sampling strategies presented are not necessarily identical to those provided by any particular sensor (except for those that are indicated). However, they are useful for analyzing and testing our methods for range synthesis so we can determine suitable samplings on the range data to have a good reconstruction. Some of these results have been already published (Torres-Méndez and Dudek 2002; Torres-Méndez and Dudek 2004a). 3.3.1 Evaluation Methodology Since we have ground truth range in all of the experiments, we can evaluate the performance of our method. There are different techniques to compute the errors in the reconstruction. We choose to compute the mean absolute residual (MAR) error. The absolute value of each error is taken and the mean of those values is computed to arrive at the MAR error. The MAR error of the two range data matrices R1 and R2 is defined as: i,j |R1 (i, j ) − R2 (i, j )| MAR error = . (13) # of unknown range pixels 5 http://marathon.csee.usf.edu/range/DataBase.html. 6 http://vision.middlebury.edu/stereo.
148
Int J Comput Vis (2008) 79: 137–158
(a)
(b) Fig. 11 Results on real data. a From left to right, the input intensity image, the edge map and the associated ground truth range. b The left column shows the initial range data (the white squares represent unknown data to be estimated). The second and third columns show
the synthesized results when using the NP sampling and BP-based methods, respectively. The last column are the residual images for the BP-based synthesized results
The following experiments involves the range synthesis when the initial range data is a set of stripes with variable width along the x- and y-axis of the intensity image. In the following cases, we tested our methods with the same intensity image in order to compare the results. Figure 11(a) shows the input intensity image (left) of size 128 × 128 pixels, the edge map (middle), and for purpose of comparison we show the ground truth range image (right) from which we omit using much of the data for input to simulate real sensing conditions. Four cases of sampling are shown in Fig. 11(b). The initial range data, shown in the left column, goes from dense to very sparse. The percentage of missing range data is indicated below each image. The parameters rw and xw indi-
cate the width of the stripes of known and unknown range, respectively. For example, the first case has rw = 24 and xw = 80. For the first three cases the size of the neighborhood is set to be 5 × 5 pixels and for the last case 3 × 3 pixels. The second and third columns show the synthesized range data obtained after running our algorithms using the NP sampling and when using the BP-based (after 50 iterations) methods, respectively. We also show the residual images for the BP-based synthesized results. The first two cases have the same amount of missing range (39%), however the synthesized range for the second case is much better. Intuitively and based on the results from the previous section, this is because the sample spans a broader distribution of range-intensity combinations. Ta-
Int J Comput Vis (2008) 79: 137–158
149
Table 1 MAR errors for the cases shown in Fig. 11(b) rw
xw
% of area with
MAR Errors (in centimeters)
missing range
NP sampling
BP-based
39
36.36
4.23 1.33
24
80
10
20
39
5.76
5
25
61
8.86
1.85
3
28
76.5
9.99
2.76
Table 2 Average MAR errors for 32 representative images in the database Mean MAR errors NP sampling
BP-based
4.27
2.65
ble 1 shows the MAR errors (calculated only on the unknown areas) of the examples shown in Fig. 11(b) for each of the two methods. From the data in the table, we can see that the BPbased method produces smaller errors than the NP sampling method. The reason for this is because the belief at each pixel is propagated through all its neighbors (except when edges are detected), so that it covers all the area of missing range. This allows for smooth transitions on the range values and the removal of inconsistencies to find the minimum energy of the overall system. Whereas the NP sampling method only consider one iteration per assignment of all the best matches of the already existing range values, which also induces order-dependent artifacts. The USF database contains 400 images approximately. As many of these images are very similar, we have selected for our experiments only the most representative ones, essentially those that can be found in a general indoor manmade environment. For the first experiments, we choose the sampling case with rw = 5 and xw = 25, which corresponds to images with 61% of the range points unknown. As indicated, the input images given to our range synthesis algorithm are: the intensity image together with its edge map and the corresponding partial range image. The average MAR errors for the ensemble of images using both the NP sampling and BP methods are shown in Table 2. Four examples of these experiments are shown in Figs. 12 and 13, which corresponds to the input and the synthesized range images, respectively. In order to confirm the importance of using edge information in the reconstruction process, we display the synthesized results in the first column of Fig. 13 when edge information is not considered. The second and third columns show the synthesized range images when considering edge information and using the NP sampling and belief propagation-based methods, respectively. The last column displays the ground truth range im-
Fig. 12 Input data to our range synthesis algorithms. The first two columns are the input intensity and the associated edge map; the third column shows the input range images (61% of the range is unknown)
Fig. 13 The first three columns are the synthesized results for the input data shown in Fig. 12 when no edge info is used and when using the NP sampling and BP-based methods, both with edge information. The last column shows the ground truth range for visual comparison
150
Int J Comput Vis (2008) 79: 137–158
ages for comparison purposes. We also show in Fig. 14 zoomed versions of some regions of interest of these experimental results. The MAR errors from top to bottom are shown in Table 3. Another type of experiments is where the initial range data is a set of stripes only along the x-axis. This set of experiments is interesting because the subsampling strategy resembles what is obtained by sweeping a two-dimensional LIDAR sensor. This sampling, however, presents a challenge for the range synthesis process, since we have very limited range data available (just along the y-axis). Thus, it is in this type of sampling where the isophote information from the available range data, together with the edge map
Fig. 14 Zoomed versions of some regions of interest of the results shown in Fig. 13
from the input intensity image, greatly help in the reconstruction process. This will be shown in the following experiments, in which we used a total of 30 pairs of registered intensity and range images from the Middlebury database. Once again, we have selected only those images that may represent an indoor scene. Figure 15 displays from left to right, the input color intensity image with its corresponding edge map, and the ground truth range image from where we hold back the data to simulate the samples. Two samplings on the same range image are shown in order to compare the results. In Fig. 16, the first column displays the initial range data. The percentage of unknown range are 65% and 62%, respectively. The two right columns show the synthesized range images without using isophote constraints and when they are incorporated to our algorithm. The regions enclosed by the red rectangles show where our algorithm performed poorly. The mean absolute residual errors (MAR) are from top to bottom, 10.5 and 12.2, when no using isophote constraints compared to 6.5 and 7.3, when using isophote constraints. The algorithm was able to capture the underlying structure of the scene by being able to reconstruct object boundaries efficiently, even with the small amount of range data given as an input. More experimental results are shown in Fig. 17. The first row show the input color intensity image, the edge map and its associated ground truth range (for comparison purposes). Three cases of sampling are shown in the subsequent rows. The first column is the input range. The percentages of unknown range (indicated below each image) are 63%, 78% and 85%, respectively. The last two columns depicts the synthesized results when using isophotes constraints in both the NP sampling and BP-based methods, respectively. The MAR errors from top to bottom, are given in Table 4. In general, the use of isophote information improves the reconstructions because it helps to detect the continuation of
Table 3 MAR errors for the cases shown in Fig. 13 Table 4 MAR errors for the cases shown in Fig. 17
MAR errors No edge-info
NP sampling
BP-based
Percentage of
MAR errors
unknown range
NP sampling
BP-based
8.58
3.77
2.58
13.48
3.03
1.63
63%
6.42
3.90
11.39
2.99
2.03
78%
6.45
3.90
7.12
2.20
1.36
85%
7.98
4.80
Fig. 15 The input intensity image, edge map and associated ground truth range
Int J Comput Vis (2008) 79: 137–158
151
Fig. 16 The initial range images are in the first column with their percentage of unknown range indicated below each. To compare results, the middle column shows the synthesized range images without using isophote information and the last column show the improved synthesized results with the incorporation of isophote constraints
linear structures from the available range. This was clearly illustrated in the examples above. 3.3.2 Discussion All the images used in our experiments were selected based on the fact that they may represent a common scene of an indoor environment. The proposed methods rely on how the visual information integrates with the very limited depth information to learn the inter-relationships between them. Visual information from indoor scenes usually show smooth changes in their surfaces which help to infer the geometry in the missing areas. Of course, the initial range data given as an input plays a very important role in the quality of the synthesis. As the percentage of known range data increases, and if this known range data is distributed about the scene to be synthesized, our methods can produce very good results in inferring the missing range data. There are however a notable difference in the performance of both methods we presented. The synthesized result from the NP sampling method are obtained faster than those when running our BP-based algorithm. However, the results from the BPbased method are much more better than those from the NP sampling method. The synthesized results when using the belief propagation show that our method can accomplish the propagation of geometric structure from the available neighborhood information. However, there are some regions where this propagation was not effective, again, due to the amount of initial range and on the types of surfaces captured by this initial range. Fig. 17 Results on real data. In the first row are the input color image and ground truth range. The subsequent rows show three cases, the initial range images are in the left column (percentage of unknown range is indicated below each). The two right columns show the synthesized results using the NP sampling and BP-based (50 iterations) methods, respectively
4 Experimental Results In order to validate our approach for 3D environment modeling, we carried out experiments into two environments. The
152
Int J Comput Vis (2008) 79: 137–158
Fig. 18 Our mobile robot. Nomad SuperScoutII with the sensors mounted on it
(a) A range scan from our lab.
(b) A range scan from the CIM floor. Fig. 20 Complete range scans from the environments to be modeled
(a) MRL lab.
(b) CIM floor.
Fig. 19 Intensity images from the environments to model
environments are different in size and the type of objects they contain. The first environment is a medium-size room (approximately 9.5 m × 6 m × 3 m, which corresponds to our Mobile Robotics laboratory. It contains the usual objects in offices and labs (e.g., chairs, tables, computers, tools, windows, etc.) The second environment is larger (approximately 2 m × 20 m × 3 m) and corresponds to the corridors of the CIM floor.7 The mobile robot used in our experiments is a Nomad Super Scout II, manufactured by Nomadics, Inc., retrofitted and customized for this work. It has a mobile base equipped with two driven by servo motors. Mounted on the robot are the 2D laser range finder and the CCD camera (Fig. 18). Intensity images are acquired by using a Dragonfly camera from Point Grey Research, with a resolution of 1024 × 768 at 15 frames per second. The camera, attached to the laser, is mounted on the pan unit, allowing for panoramic image acquisition. In Fig. 19, two intensity images acquired from the two environments are depicted. Figure 19(a) is an image from our lab and (b) an image from the corridors of the CIM floor. To acquire the range images, we use an infrared 8 mW AccuRange 4000-LIR, from Accuity Research, Inc. The scanning range of the laser allows distance measurements between 0 to 15 meters at up to 50,000 range readings per second, with precision better than 1 cm. A mirror, 45° off7 The
Center for Intelligent Machines (CIM) floor is the 4th floor of the McConnell Engineering Building at McGill University.
axis, rotates about a shaft with a 2000-position encoder to sweep out a plane. The mirror center of projection of the laser rangefinder is attached to the center of rotation of the pan unit. The pan unit was built in our lab, and has a very clean interface, making programming very simple. The panning angle (the horizontal angle) covers an area of 180° with one step of 0.36°. Since we are looking at having quality range values, each (2D) slice is an average of 10 samples, taking around 0.5 s. per sample with a slower sampling resolution. The acquisition time of complete range scans covering a 180° area (500 slices) is about 20 minutes. Figure 20 shows two examples of the complete range scans acquired from the two environments (black areas represent missing range values). Acquiring Partial Range In practice, we will acquire partial range data by sampling a small amount of range data (approximately 30 to 50% of total range) at each robot pose, thus reducing significantly the acquisition time. However, in order to be able to estimate the performance of our method for estimating dense range maps, complete range scans were acquired through all the experiments presented here. Acquiring the Panorama As the pan unit rotates, at every 18° our CCD camera takes a picture. All the images are then projected onto a cylindrical representation to obtain a cylindrical panorama mosaic. Figure 21 presents two 180° cylindrical panoramas (from the respective environments) constructed using the technique described in Sect. A.1.
Int J Comput Vis (2008) 79: 137–158
153
Fig. 21 Two cylindrical panoramas
(a) From images taken in our lab
(b) From images taken in the hall
4.1 Camera-Laser Data Registration As described in Sect. A.2, both the range and intensity data must be in similar cylindrical representations. For the arrangement used in these experiments, f = 300 pixels,
Y = 5 cm and the range of the points is r = 5−8 meters, and β is between 6 to 10 pixel units. Figure 22 shows samples of the registered panorama mosaic (top) and range image (bottom). It is important to note, that the registration was computed using only the partial range data as an input. Since we are using a mapping between intensity and range image coordinates, the quality of the registration in the coarse registration step, does not depend on the amount of range data given as an input. In the fine registration step, i.e., the local alignment of 2D and 3D edges, the features captured by the input range data are crucial.
(a)
4.2 Estimating Dense Range Maps In this section we present experimental results when using our BP-based method for range synthesis algorithm. To evaluate the performance of our method we compute the mean absolute residual (MAR) error. Figure 23 shows an example of data collected in our lab. (a) is the input intensity, (b) the intensity edges and (c) the input partial range data, where 50% of the total range is unknown. The resulted synthesized range image is shown in (d), and the ground truth range image in (e), for comparison purposes. The MAR error for this example is 7.85 centimeters. A second example, from the data acquired on the CIM floor, is shown in Fig. 24(a)–(c) is the input data: the intensity image, intensity edges and the partial range, respectively. The synthesized range image after running our algorithm is shown in (d), and (e) shows the ground truth range for comparison purposes. The MAR error is 5.76 centimeters.
(b) Fig. 22 Samples of the registered intensity (top) and range data (bottom) collected a from our lab and b from the CIM floor
5 Discussion and Future Work In this paper we have presented a complete framework for the mobile robot environment modeling problem using in-
154
Int J Comput Vis (2008) 79: 137–158
Fig. 23 Results on dense range map estimation. a–c Input data to our range synthesis algorithm. d The synthesized range image and e the ground truth range
(a) Input intensity data
(b) Intensity edges
(a) Input range (50% is unknown)
(d) Synthesized range image
(e) Ground truth range Fig. 24 Results on dense range map estimation. a–c Input data to our range synthesis algorithm. d The synthesized range image and e the ground truth range
(a) Intensity data
(b) Intensity edges
(a) Input range (50% is unknown)
(d) Synthesized range image
(e) Ground truth range
complete sensor data. The ability to recover complete and dense depth maps of a scene greatly depends on the type, quality and amount of information available. The data acquisition framework described was designed to speed up the acquisition of range data by obtaining a relatively small amount of range information from the scene to be modeled. By doing so, we compromise the accuracy of our final representation. However, since we are dealing with man-made environments, we can take into account the coherence of surfaces and their causal inter-relationships with the photometric information, to estimate complete range maps from the partial range data. The experimental results shown in this
paper for two different indoor scenes demonstrate the feasibility of our method. While our results are highly satisfying, there remain several interesting issues to be resolved. Foremost among these is the need for a formal characterization of performance for such an approach. Another issue, is that related to the spatial sampling of range data of the environment, which at the moment is done by using a simple strategy depending on the distance to objects in the scene. For more complicated environments, it may be desirable to apply an active sensing technique so that a more dense sampling is done in sites having more objects
Int J Comput Vis (2008) 79: 137–158
with different depth profiles and a sparse sampling in sites with none (e.g. walls) or few objects. In ongoing work, we are exploring this issue more fully. Interesting features from images taken at different views should be selected across different scales. This selection may be done according to the geometric structure, texture and shading characteristics observed in the images. In the same vain, the exploration of the environment by the robot will depend on the selection of interesting locations visually available from its previous position. Acknowledgements This work was funded by the Natural Sciences and Engineering Research Council of Canada and the National Council for Science and Technology of Mexico (CONACyT).
Appendix A: Laser-Camera Data Registration A.1 Acquiring the Cylindrical Panorama Mosaic A cylindrical panorama is created by projecting images taken from the same viewpoint, but with different viewing angles onto a cylindrical surface. Each scene point P = (x, y, z)T is mapped to cylindrical coordinate system (ψ, v) by x y ψ = arctan , v=f √ , (14) 2 z x + z2 where ψ is the panning angle, v is the scanline, and f is the camera’s focal length. The projected images are spatially aligned and correlated in order to precisely determine the amount of rotation between two consecutive images. In the cylindrical space, a translation becomes a rotation, so we can easily build the cylindrical image by translating each component image with respect to the previous one. In practice, there is also small vertical misalignment between images due to vertical jitter and optical twist. Therefore, both a horizontal translation tx and a vertical translation ty are estimated for each input image. To recover the translational motion, we estimate the incremental translation δt = (δtx , δty ) by minimizing the intensity error between two images, E(δt) = [I1 (xi + δt) − I0 (xi )]2 , (15) i
where xi = (xi , yi ) and xi = (xi , yi ) = (xi + tx , yi + ty ) are corresponding points in the two images, and t = (tx , ty ) is the global translational motion field which is the same for all pixels. After a first order Taylor series expansion, the above equation becomes 2 E(δt) ≈ [gT (16) i δt + ei ] , i
155
where ei = I1 (xi ) − I0 (xi ) is the current intensity error, and giT = ∇I1 (xi ) is the image gradient of I1 at xi . This minimization problem has a simple least-squares solution, gi giT δt = − [ei gi ] . (17) i
i
The complexity of the registration lies on the amount of overlap between the images to be aligned. In our experimental setup, we will typically have about 20 to 30% of overlap between adjacent images. However, as the panning angles at which images are taken is known, the overlap can be as small as 10% and still be able to align the images. To reduce discontinuities in intensity between images,8 we weight each pixel in every image proportionally to their distance to the edge of the image (i.e., it varies linearly from 1 at the centre of the image to 0 at the edge), so that intensities in the overlap area show a smooth transition between intensities in one image to intensities of the other image. A natural weighting function is the hat function, h/2 − x w/2 − y − (18) w(x, y) = h/2 w/2 , where h and w are the height and the width of the image. Intuitively, this function gives less weight to the pixels along the image edges. A.2 Camera-Laser Data Registration: Panorama with Depth To effectively use the panoramic image mosaic and the incomplete spherical range data for range synthesis, model building and rendering, both sensor inputs need to be registered with respect to a common reference frame. This registration process becomes complex as the projections, resolutions and scaling properties of the sensors are different due to the fact that they capture different type of data representing different physical aspects of the environment (i.e., appearance and distance). Therefore, before any meaningful high-level interpretation of the acquired images can be made, the projective model for both sensing technologies must be understood. The coordinate systems of the laser scanner (spherical coordinates) and the CCD camera (typical image projection) must be adjusted one to each other. Typical methods for registration, such as feature extraction cannot be used as we have limited range information. Geometry-based techniques are commonly used to determine the relationship between the projective models of the 8 Having different intensity values between corresponding points can be due to different factors, of which vignetting (intensity decreases towards the edge of the image) is the only effect that can be weighted within a function.
156
Int J Comput Vis (2008) 79: 137–158
them is computed. The physical configuration of the sensors is approximated, assuming only a vertical translation
Y and a pan rotation between the two reference coordinate systems LCS (laser coordinate system) and CCS (camera coordinate system). For a point xl (φ, h) in the cylindrical laser image, its corresponding point in the panoramic mosaic xc (u, v) is u = aφ + α, Y − Y Y
Y
Y v, = f =f −f = bh − f r r r r
Fig. 25 Projection of the 3D point P onto cylindrical coordinates: (φ, h) for the range data and (u, v) for the panoramic mosaic
camera and laser rangefinder, but the main disadvantage is that the sensors need to be perfectly calibrated and remain fixed. In our method, even though we also have a fixed sensor configuration, there is no calibration involved. Instead, an image-based technique, similar to that presented in Cobzas (2003), is used that recovers the projective model transformation by computing a direct mapping between the points in the data sets. The accuracy of this method depends on the characteristics of the physical system and the assumptions made (e.g. affine camera or planar scene), but in general their performance is good for locally recovering a mapping between data sets. As mentioned above, the optical center of the camera and the mirror center of the laser scanner are vertically aligned and the orientation of both rotation axes coincide. Thus, we only need to transform the panoramic camera data into the laser coordinate system. However, as both data are in different independent coordinate systems, the first step in the registration process is to map both type of data to a cylindrical coordinate system. The previous section described how to obtain the cylindrical panorama. To convert the spherical range image to a cylindrical representation similar to that of the panoramic image mosaic, the radius of the cylindrical range image must be equal to the camera’s focal length. This mapping is given by f P(r, θ, φ) → P r, φ, → P(r, φ, h), (19) tan θ where r represents the distance from the center of the cylinder to the point, h is the height of the point projected on the cylinder, φ is the azimuth angle and f the focal length of the camera (see Fig. 25). Again, this data is sampled on a cylindrical grid (φ, h) and represented as a cylindrical image. Once having the intensity and range data in similar cylindrical image representations, a global mapping between
(20)
where a and b are two warp parameters that will account for difference in resolution between the two images, α aligns the pan rotation, and Y = rh/ f 2 + h2 is the height of the 3D point X(r, φ, h). Since f , Y , and the r remain fixed through the experimental setup, the term f Y r can be approximated to a constant β. Thus, the general warp equations are: u = aφ + α,
v = bh + β.
(21)
The warp parameters (a, b, α, β) are computed by minimizing the sum of the squared error of two or more corresponding points9 in the two images. In the case of a good (i.e., what we hope to be normal) initial estimate, a reasonable convergence to a nearly correct10 solution is obtained. The initial estimate places the panorama mosaic nearly aligned with the range data, with a moderate translation or misalignment typically of about 5 to 7 pixels. To correct this, a local alignment is performed using the set of corresponding control points. Since these control points are typically obtained by matching the 2D edges on the intensity image to the 3D edges on the range image, after calculating the misalignment with the warp parameters, we can locally “stretch” the range data for each of the control and matching points available to fit the intensity data. This stretching is achieve by using cubic interpolation and check for additional matchings between 2D and 3D edges. It is an iterative process that stops when the misalignment is below of some given threshold. After that, each range data point acquired by the laser rangefinder has a corresponding pixel value in the panoramic image.
References Baker, S., & Kanade, T. (2002). Limits on super-resolution and how to break them. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1167–1183. 9 Since the computation of the warp parameters is done only once, the actual number of corresponding points must be a large number, ideally 20, to obtain a good linear fit. 10 The term “correct” simply means that the final solution “looks good”. There is not recorded ground truth for this purpose.
Int J Comput Vis (2008) 79: 137–158 Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11), 1222–1239. Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698. Cobzas, D. (2003). Image-based models with applications in robot navigation. PhD thesis, University of Alberta, Canada. Connolly, C. I. (1984). Cumulative generation of octree models from range data. In Proceedings of the international conference on robotics (pp. 25–32), March 1984. Criminisi, A., Perez, P., & Toyama, K. (2003). Object removal by exemplar-based inpainting. In IEEE computer vision and pattern recognition. Davies, D. (1986). Uncertainty and how to treat it: modeling under uncertainty. In Proceedings of the first international conference on modeling under uncertainty (pp. 16–18), April 1986. Diebel, J., & Thrun, S. (2005). An application of Markov random fields to range sensing. In Advances in neural information processing systems (NIPS) (pp. 291–298). Cambridge: MIT Press. Efros, A., & Freeman, W. T. (2001). Image quilting for texture synthesis and transfer. In SIGGRAPH (pp. 1033–1038). Efros, A., & Leung, T. K. (1999). Texture synthesis by non-parametric sampling. In IEEE international conference on computer vision (pp. 1033–1038), September 1999. El-Hakim, S. F. (1998). A multi-sensor approach to creating accurate virtual environments. Journal of Photogrammetry and Remote Sensing, 53(6), 379–391. Felzenszwalb, P. F., & Huttenlocher, D. P. (2006). Efficient belief propagation for early vision. International Journal of Computer Vision, 70(1), 41–54. Fitzgibbon, A. W., & Zisserman, A. (1998). Automatic 3D model acquisition and generation of new images from video sequences. In Proceedings of European signal processing conference (pp. 1261–1269). Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9), 891–906. Freeman, W. T., & Torralba, A. (2003). Shape recipes: scene representations that refer to the image. In Advances in neural information processing systems (NIPS) (Vol. 15). Cambridge: MIT Press. Freeman, W. T., Pasztor, E. C., & Carmichael, O. T. (2000). Learning low-level vision. International Journal of Computer Vision, 20(1), 25–47. Fua, P. (1993). A parallel stereo algorithm that produces dense depth maps and preserves image feature. Machine Vision and Applications, 6, 35–49. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Hammersley, J. M., & Clifford, P. (1971). Markov field on finite graphs and lattices. Unpublished. Hertzmann, A., Jacobs, C. E., Oliver, N., Curless, B., & Salesin, D. H. (2001). Images analogies. In SIGGRAPH (pp. 327–340), August 2001. Hoiem, D., Efros, A. A., & Herbert, M. (2005). Geometric context from a single image. In Proceedings of ICCV. Howe, C. Q., & Purves, D. (2002). Range image statistics can explain the anomalous perception of length. Proceedings of the National Academy of Sciences, USA, 99, 13184–13188. Kolmogorov, V., & Zabih, R. (2001). Computing visual correspondence with occlusions using graph cuts. In Proceedings ICCV (pp. 508–515). Kuipers, B. (1978). Modelling spatial knowledge. Cognitive Science, 2, 1291–1353. Lee, A., Pedersen, K., & Mumford, D. (2001). The complex statistics of high-contrast patches in natural images. Private correspondence.
157 Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D., Pereira, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., Shade, J., & Fulk, D. (2000). The digital Michelangelo project: 3D scanning of large statues. In Proceedings of ACM SIGGRAPH (pp. 131–144), July 2000. Little, J. J., & Gillett, W. E. (1990). Direct evidence for occlusion in stereo and motion. Image and Vision Computing, 8, 328–340. Michels, J., Saxena, A., & Ng, A. Y. (2005). High speed obstacle avoidance using monocular vision and reinforcement learning. In Proceedings of the 21th international conference on machine learning (ICML). Nelson, W., & Cox, I. (1988). Local path control for an autonomous vehicle. In IEEE international conference on robotics and automation (pp. 1504–1510), New York, NY. Nyland, L., McAllister, D., Popescu, V., McCue, C., Lastra, A., Rademacher, P., Oliveira, M., Bishop, G., Meenakshisundaram, G., Cutts, M., & Fuchs, H. (1999). The impact of dense range data on computer graphics. In Proceedings of multi-view modeling and analysis workshop (MVIEW part of CVPR) (p. 8), June 1999. Poggio, T. A., Gamble, E. B., & Little, J. J. (1988). Parallel integration of vision modules. Science, 242, 436–439. Pollefeys, M., Koch, R., Vergauwen, M., & Van Gool, L. (1998). Metric 3D surface reconstruction from uncalibrated images sequences. In Proceedings of SMILE workshop (post-ECCV) (pp. 138–153). Potetz, B., & Lee, T. S. (2003). Statistical correlations between 2D images and 3D structures in natural scenes. Journal of Optical Society of America, 20(7), 1292–1303. Pulli, K., Cohen, M., Duchamp, T., Hoppe, H., McDonald, J., Shapiro, L., & Stuetzle, W. (1997). Surface modeling and display from range and color data. In Lecture notes in computer science (Vol. 1310, pp. 385–397). Berlin: Springer. Saxena, A., Chung, S. H., & Ng, A. Y. (2006). Learning depth from single monocular images. In Proceedings of the advances in neural information processing systems (NIPS). Cambridge: MIT Press. Saxena, A., Schulte, J., & Ng, A. Y. (2007). Depth estimation using monocular and stereo cues. In Proceedings of international joint conference on artificial intelligence (IJCAI). Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(7), 7–42. Scharstein, D., & Szeliski, R. (2003). High-accuracy stereo depth maps using structured light. In IEEE CVPR (Vol. 1, pp. 195–202). Sequeira, V., Ng, K., Wolfart, E., Goncalves, J. G. M., & Hogg, D. C. (1999). Automated reconstruction of 3D models from real environments. ISPRS Journal of Photogrammetry and Remote Sensing, 54, 1–22. Shafer, G. (1976). Mathematical theory of evidence. Princeton: Princeton University Press. Stamos, I., & Allen, P. K. (2000). 3D model construction using range and image data. In CVPR, June 2000. Stan, Z. (1995). In T. L. Kunii (Ed.), Computer science workbench. Li–Markov random field modeling in image analysis. Berlin: Springer. Sun, J., Zheng, N., & Shum, H. (2003). Stereo matching using belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(7), 787–800. Sun, J., Li, Y., Kang, S., & Shum, H.-Y. (2005). Symmetric stereo matching for occlusion handling. In Proceedings CVPR (Vol. II, pp. 399–406). Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., & Rother, C. (2006). A comparative study of energy minimization methods for Markov random fields. In ECCV (Vol. 2, pp. 19–26).
158 Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization approach. International Journal of Computer Vision, 9(2), 137–154. Torralba, A., & Oliva, A. (2002). Depth estimation from image structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9), 1226–1238. Torres-Méndez, L.A. (2005). Statistics of visual and partial depth data for mobile robot environment modeling. PhD thesis, McGill University. Torres-Méndez, L. A., & Dudek, G. (2002). Range synthesis for 3D environment modeling. In IEEE workshop on applications of computer vision (pp. 231–236), Orlando, FL. Torres-Méndez, L. A., & Dudek, G. (2004a). Statistical inference and synthesis in the image domain for mobile robot environment modeling. In Proceedings of the IEEE/RSJ conference on intelligent robots and systems (IROS), September 2004. New York: IEEE Press. Torres-Méndez, L. A., & Dudek, G. (2004b). Reconstruction of 3D models from intensity image and partial depth. In AAAI proceedings (pp. 476–481), July 2004.
Int J Comput Vis (2008) 79: 137–158 Torres-Méndez, L. A., & Dudek, G. (2006). Statistics of visual and partial depth data for mobile robot environment modeling. In A. Gelbukh & C. A. Reyes-Garca (Eds.), Lecture notes in artificial intelligence: Vol. 4293. Mexican international conference on artificial intelligence (pp. 715–725). Berlin: Springer. Wei, L., & Levoy, M. (2000). Fast texture synthesis using treestructured vector quantization. In SIGGRAPH (pp. 479–488), July 2000. Whaite, P., & Ferrie, F. P. (1997). Autonomous exploration: driven by uncertainty. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(3), 193–205. Winkler, G. (1995). Image analysis, random fields, and dynamic Monte Carlo methods: a mathematical introduction. Berlin: Springer. Zhang, L., & Seitz, S. M. (2005). Parameter estimation for MRF stereo. In CVPR (Vol. 2, pp. 288–295). Zhang, R., Tsai, P.-S., Cryer, J. E., & Shah, M. (1999). Shape from shading: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 690–706.