INTEGRATING 3D TIME-OF-FLIGHT CAMERA DATA AND ... - CiteSeerX

27 downloads 562 Views 2MB Size Report
tion within the range domain and register both modalities to achieve a higher .... the design of a low cost, portable 3D acquisition device, there are a lot of issues ...
INTEGRATING 3D TIME-OF-FLIGHT CAMERA DATA AND HIGH RESOLUTION IMAGES FOR 3DTV APPLICATIONS Benjamin Huhle, Sven Fleck ∗

Andreas Schilling †

University of T¨ubingen WSI/GRIS

University of T¨ubingen currently on leave at Stanford University

ABSTRACT Applying the machine-learning technique of inference in Markov Random Fields we build improved 3D models by integrating two different modalities. Visual input from a standard color camera delivers high-resolution texture data but also enables us to enhance the 3D data calculated from the range output of a 3D time-of-flight camera in terms of noise and spatial resolution. The proposed method to increase the visual quality of the 3D data makes this kind of camera a promising device for various upcoming 3DTV applications. With our two-camera setup we believe that the design of lowcost, fast and highly portable 3D scene acquisition systems will be possible in the near future. Index Terms— 3D scene acquisition, MRF, sensor fusion, 3DTV, time-of-flight camera 1. INTRODUCTION Automated 3D model acquisition techniques are essential for many emerging applications. 3DTV is one major area where 3D models are required as the user will be able to freely choose his view point in real-time. Also the visualization of buildings, safety critical environments such as airports and power plants or even virtual walk-throughs of whole city models require visually appealing 3D models. Designing a system that builds 3D models of real environments in an automated and hassle free way is still a challenging research topic. Many approaches use laser scanners to realize such a system. For example the W¨agele-platform [1] uses 2D laser scanners and the environment is sampled by moving the platform through the scene. Other systems based on 3D laser scanners are commercially available, e.g., by FARO. Although the resulting 3D models are quite good, both approaches share the drawback of requiring completely static scenes as they sample the environment over time. Typically, model acquisition systems take advantage of different sensors. Unfortunately each sensor has its inherent drawbacks in terms of noise, spatial and temporal resolution ∗{

huhle | fleck}@gris.uni-tuebingen.de

[email protected]

etc. Most approaches do not combine all the available sensor information in an optimal way, e.g., in many setups based on a combination of range scanners and cameras one does not actively use the vision sensor during model creation as stated by Andreasson et al. [2]. Instead the camera data is only used as texture information in the visualization step. Previous work where 3D time-of-flight cameras are used shares the same problem of not optimally integrating the available information. Combining a time-of-flight camera with a standard camera, Prasad et al. [3] use traditional interpolation within the range domain and register both modalities to achieve a higher resolution 3D model. Using a stereo camera together with a time-of-flight camera, Kuhnert et al. [4] present a straightforward approach to combine range data of both sensors. Our system integrates the information of a time-of-flight camera and a standard camera based on a well-established machine learning approach, namely a Markov Random Field (MRF). The recently presented method of Diebel et al. [5] makes use of special potential functions to elegantly encode the dependencies and characteristics of different modalities. Employing this technique enables us to build a promising setup for our integrated time-of-flight system. Section 2 describes our system, in Section 3 the integration of the two modalities is discussed. First results are shown in Section 4. 2. OUR SYSTEM The presented system comprises a 3D time-of-flight camera with an additional standard camera mounted closely on top of it (see Figure 1). This setup enables a comfortable and extremely portable acquisition platform that can be used in the field of 3DTV movie sets due to its small form factor, its light weight and the low power consumption. 2.1. 3D Time-of-Flight-Camera The 3D time-of-flight camera used for our setup is a PMD [vision] 19k with a 160x120 pixel photonic mixer device (PMD) sensor array. It acquires distance data using the time-of-flight principle with active illumination by invisible modulated nearinfrared light. For each pixel it delivers distance and intensity

Applying the resulting translation and rotation to the 3D data calculated from the depth data, one obtains 3D coordinates in the reference frame of the high resolution camera. 3. INTEGRATING THE TWO MODALITIES

Fig. 1. Our two-camera setup and raw output data. Top: color image from the standard camera. Bottom row: intensity and depth image from the time-of-flight camera. information simultaneously, where the distance data is computed from the phase shift of the reflected light directly inside the camera. Both modalities are captured through the same optical system and are therefore perfectly aligned. The camera works with a frame-rate of up to 15 fps. We use the PMD vision camera equipped with a 12 mm lens resulting in a horizontal field-of-view of about 30◦ . Figure 1 shows the depth map and the intensity image as output by the camera. It is clearly seen that the depth data from this device needs adequate postprocessing when it comes to building visually convincing 3D models as needed for 3DTV applications. 2.2. Multi Sensor Setup To enhance the low resolution of the image data as well as the depth image delivered by the time-of-flight camera, we mounted a high-resolution standard camera on top of the PMD vision camera. A Matrix Vision BlueFox with a 1600x1200 pixel sensor and a 12 mm lens was used. The resulting horizontal field-of-view of about 34◦ is similar to the one of the PMD camera as can be seen in Figure 1. This ensures an easy calibration of both cameras and only a small loss of information. Figure 1 shows the combined camera system. Since the intensity image and depth data coming from the PMD camera are perfectly aligned, it is possible to use standard algorithms for calibration and registration of the highresolution camera with the depth data. This means that by registering two standard images we can effectively register two different modalities. Using the stereo system calibration method of the camera calibration toolbox for Matlab1 from Caltech we calculate the relative positions of both cameras. 1 available

at www.vision.caltech.edu/bouguetj/calib doc

With our two camera setup we obtain three datasets per shot: high-resolution image data, low-resolution intensity as well as low-resolution depth data (cf. Figure 1). The low-resolution image from the depth camera is used in the calibration step to register both cameras as described in the previous section. Integrating the high-resolution color image and the depth data we employ the dependencies of both modalities. These are elegantly encodable in a graphical model, namely an MRF, such that the available information is automatically integrated. Employing the well-accepted machine-learning approach of Diebel [5] who achieved impressive results with a combined laser scanner and camera setup we use the following prior as heuristic: Discontinuities in the depth and image data tend to co-align, as edges in space typically result in edges in the image. 3.1. MRF Model A typical MRF model for image analysis consists of different layers of two dimensional grids representing the pixel plane [6]: the observation and the hidden variable layer, where the hidden variables in our case represent the depth values that we want to infer. The observation layer represents the measurement available from the PMD camera. Since the depth values are transformed into the standard camera frame, they are not available at regular grid positions and therefore have to be interpolated to the pixel positions of the standard camera image at high resolution. image gradients latent depth values depth measurements

Fig. 2. MRF including third layer incorporating image gradients; image pixels are omitted for clarity A standard approach is to a priori assume a smooth depth map. Therefore, a smoothness potential function on the edges connecting the nodes in the hidden variable layer adds costs to the energy function of the graph at sites where neighboring nodes have different depth values.

The dependency of the discontinuities is then modeled by a modification of these potential functions weighting the cost differently depending on the image gradient at the corresponding sites of the color image as proposed in [5]. This adds a third layer to the MRF, where image gradients are encoded as nodes in the graph. Connecting these with new gradient nodes introduced at the existing hidden variable layer, the resulting three-layer MRF in Figure 2 represents the inference model that integrates the different modalities. In the next subsection we shortly survey the optimization problem as formulated by Diebel et al. [5]. 3.2. Energy minimization Introducing weights in the smoothness potential function one obtains  wi,j (di − dj )2 , (1) Φ= i∈G j∈Ni

where G is the set of nodes, Ni denotes the neighborhood of node i and di , i ∈ G are the depth values residing on the nodes. The weights wi,j are determined by the third layer of the graph, namely the image gradients: wi,j = exp(−θ1 Ii − Ij 22 ), where Ii are the intensity values of the high resolution image. Variants of this potential function have recently been proposed in [2]. The link between the observations d˜i and the hidden variables di is established by the measurement potential function relying on squared differences:  Ψ= θ2 (di − d˜i )2 . (2) i∈G

The resulting energy function is proportional to the joint probability of the hidden variables ˜ = 1 exp (−0.5(Φ + Ψ)), p(d|I, d) Z

(3)

where Z is a normalization constant [6]. It depends on parameter Θ = (θ1 , θ2 ) which we set empirically. By minimizing Φ + Ψ we obtain the most likely depth data. As can be seen from Equations 1-3 the problem is quadratic and therefore can be solved fast and robustly by gradient descent. 4. RESULTS Figure 5 shows a direct comparison of the depth image resulting from simple interpolation of the depth data and the depth image obtained with the proposed algorithm. As expected, the post-processed depth image contains much less noise due to the smoothness enforcing MRF model, yet conserves high frequency information in the data. In the right image of this figure it can be seen how the heuristics of co-aligned image

Fig. 3. 3D model acquired with the presented system and depth gradients work in practice. Apart from small registration errors the dependencies are well extracted at the object boundaries. Small noise, e.g. on the cabinet door, does not affect the algorithm too much. However, problems can be encountered at highly textured areas where also the depth values are erroneous, e.g., at the photo on the wall where the depth measurement fails partly due to the glass picture frame. In Figure 3 and 4 we show rendered point clouds acquired with the proposed system. Both compared to the interpolated time-of-flight camera output as well as to the 3D model resulting from a slightly smoothed depth image in the center image of Figure 4, the MRF integrated data in the right of the same figure shows a significant improvement. Whereas the simple smoothing invents new points, e.g. at the side wall of the cabinet which is not visible in the real data, the MRF enhanced model only creates few outliers that can be removed automatically without corrupting the model. Therefore, by combining various datasets taken from different viewpoints it will be possible to create consistent 3D models of real scenes. This fact makes the system attractive for high-quality scene acquisition applicable for 3DTV. 5. CONCLUSION We presented a system for the acquisition of high-resolution 3D models with a combined time-of-flight and color camera setup. Applying a machine-learning MRF framework the high spatial resolution of the color camera is employed to increase the resolution and quality of the depth map from the time-of-flight camera. Our results show that a significant gain in the visual quality of the rendered 3D model derived from the different sensor data is achievable. Compared to previous work our approach does not make use of mechanical components as rotating laser scanners. The presented camera setup is applicable also to dynamical scenes

Fig. 4. Rendered 3D model of a test scene. Left: nearest-neighbor interpolation. Center: smoothed with a Gaussian filter (σ = 0.2). Right: proposed MRF algorithm

Fig. 5. Depth images. Left: nearest-neighbor interpolation. Center: proposed MRF algorithm. Right: depth (red channel) with overlaid sum of weights used in the smoothness potential function (green and blue channel). whereas most laser scanner based acquisition platforms are constraint to completely static environments. In comparison with stereo systems, our setup is also able to work in lowly illuminated or textureless scenarios where stereo systems generally fail. As the described system only presents a first approach in the design of a low cost, portable 3D acquisition device, there are a lot of issues for future work. Whereas we only presented a per-frame-algorithm to enhance the 3D model quality one could think of many ways to improve the data over time. The presented method deals especially well with small highfrequency noise, but various techniques could help to remove outliers from the data. Moreover, we will further investigate the sampling issues incorporated when projecting the depth values to the high-resolution image pixels in the first phase of the algorithm. ACKNOWLEDGEMENT

era,” in 13th European Signal Processing Conference (EUSIPCO 2005), September 4-8 2005. [2] H. Andreasson, R. Triebel, and A. Lilienthal, “Visionbased interpolation of 3d laser scans,” in Proceedings of the 2006 IEEE International Conference on Autonomous Robots and Agents (ICARA 2006), 2006. [3] A. Prasad, K. Hartmann, W. Weihs, S. E. Ghobadi, and A. Sluiter, “First steps in ehancing 3d vision technique using 2d/3d sensors,” in Computer Vision Winter Workshop, Czech Pattern Recognition Society, Chum and Franc, Eds., 2006. [4] K.-D. Kuhnert and M. Stommel, “Fusion of stereocamera and pmd-camera data for real-time suited precise 3d environment reconstruction,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’06), 2006.

6. REFERENCES

[5] J. Diebel and S. Thrun, “An application of markov random fields to range sensing,” in Advances in Neural Information Processing Systems 18, Y. Weiss, B. Sch¨olkopf, and J. Platt, Eds., Cambridge, MA, 2006, pp. 291–298, MIT Press.

[1] P. Biber, S. Fleck, F. Busch, M. Wand, T. Duckett, and W. Straßer, “3d modeling of indoor environments by a mobile platform with a laser scanner and panoramic cam-

[6] Stan Z. Li, Markov random field modeling in image analysis, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2001.

This work is supported by EC within FP6 under Grant 511568 with the acronym 3DTV.

Suggest Documents