Real-Time Connectivity Constrained Depth Map Computation using Programmable Graphics Hardware Nico Cornelis K. U. Leuven, ESAT-PSI Kasteelpark Arenberg 10 B-3001 Leuven-Heverlee
[email protected]
Abstract This paper presents a passive depth map computation algorithm which runs entirely on programmable 3D graphics hardware at real-time speed. Previous implementations relied on the use of either mipmapping or correlation windows to increase the robustness of the calculated depth map. These methods, however, are inherently unable to compute the depths for fine 3D structures. To overcome this obstacle we introduce a pixel-based method which incorporates connectivity information between neighboring pixels, while maintaining the performance boost gained by implementing the algorithm on a highly parallelized architecture such as graphics hardware.
1. Introduction Nowadays, cutting edge CPUs are capable of five gigaflops. In 1989 the world’s fastest million dollar supercomputer, The Cray 3, achieved this processing power. 15 Years later you can buy that performance for less than $500 at your local computer store. This opens the possibility of richer user environments and more complex real-time digital processing. On the one hand, for the CPU industry, Moore’s Law [1] still defines the current and expected increases in growth rate until 2020. On the other hand, due to the stream processor architecture of GPUs and advances in graphics hardware, mainly driven by the game industry, the growth curve for GPUs corresponds roughly to Moore’s Law cubed, doubling processing speed approximately every six months. The introduction of vertex and fragment programs caused a major breakthrough, giving the programmer almost complete control over the GPU. They provide a large instruction set, comparable to the instruction sets of CPUs, and can be programmed in both assembly and high level
0-7695-2372-2/05/$20.00 (c) 2005 IEEE
Luc Van Gool K. U. Leuven, ESAT-PSI Kasteelpark Arenberg 10 B-3001 Leuven-Heverlee
[email protected]
languages. Also, the expanding list of new features allow more and more software algorithms, most of them which rely on SIMD instructions, to be implemented on a graphics card with a substantial performance gain w.r.t. their software based counterparts. All of these features make graphics hardware an attractive platform for a vast number of algorithms. Our goal is to design an algorithm which computes depth maps in real-time while avoiding artifacts, exhibited by previous implementations.
2. Related Work Yang et al. [2] present a real-time depth estimation algorithm utilizing graphics hardware. They use images from several calibrated cameras to increase the robustness of the algorithm. Depth values are obtained by a plane sweep through 3D world space, making the time complexity linear w.r.t. the desired depth resolution. Later, Yang and Pollefeys [3] refined this method to work for a two camera setup using a weighted sum of correlation values resulting from various mipmap levels to achieve robustness. The mipmap levels represent subsampled versions of the original images. Each of the mipmap levels is generated by passing the previous level through a 2 × 2 boxfilter before subsampling, where level zero corresponds to the original image. Zach et al. [4] provide an algorithm which is based on the iterative refinement of a 3D mesh hypothesis to estimate the depths. The mesh vertices are placed every four pixels and in each iterations the algorithm keeps track of the best local modification for each of these vertices. In later work they improved the speed of the algorithm significantly, but the basic idea remained the same [5]. Later, they optimized the algorithm even further by using disparity maps in contrast to the vertex mesh [6]. Although the previous methods use mipmapping and correlation windows, which are suitable for the depth estimation of connected smooth surfaces, they are unable to es-
timate correct depths in the presence of high frequency textures. Here, the information provided by the lower mipmap levels is little to none compared to the performance overhead. For instance, if an object in the scene has a uniformly distributed high frequency color pattern, mipmapping will quickly reduce the pattern to an overall constant color. Also, due to occlusions, the use of mipmapping and correlation windows results in artifacts at the pixels belonging to fine 3D structures of a few pixels thick because the background surrounding the fine structure, viewed in the different images, is independent of the object the pixel belongs to. The method proposed in this paper introduces a new pixel-based algorithm to incorporate connectivity information between neighboring pixels to achieve the desired robustness.
3. Our Method 3.1 Setup The input to the algorithm presented in this paper consists of two or more images for which the respective internal and external camera parameters, in a common coordinate frame, are known. For one of these images, referred to as the reference image, we wish to compute the depth map in a fast, robust way. In the remainder of this text, we will refer to the remaining images as sensor images.
3.2 Radial Undistortion The internal camera parameters include the radial distortion parameters which are used to preprocess the images such that they are converted to the equivalent images of ideal pinhole cameras. Providing the radial distortion parameters do not vary with time, this preprocessing step can be implemented efficiently on the GPU by precomputing a lookup texture containing the undistorted texture coordinates for each pixel.
3.3 Error Function In order to find corresponding points between image pairs, formed by the reference image and each of the sensor images, we are in need of an error measure. In case only two images are used, the sum of squared differences(SSD) is a suitable measure. If more than two images are used, we propose the error measure, computed according to equation (1). e=
ns i=1
min(m, (ri − rr )2 + (gi − gr )2 + (bi − br )2 ) (1)
Pd
sensor image p x,y
Reference image
sensor image
Figure 1. Plane sweep setup for one reference image,two sensor images and three test planes.
Where r denotes the index of the reference image, ns equals the number of sensor images and m is an arbitrary threshold, indicating the maximum radius of influence in RGBspace. The saturation level m of the SSD scores is needed to account for occlusions by constraining the influence of outliers on the total SSD score. Using color information images results in a better error measure compared to using grayscale images.
3.4 Iterative Hierarchical Sweep 3.4.1 Initial Depth Map Estimation Using Plane Sweep For the initialization of the depth map we used the algorithm as described in [2]. A simple plane sweep is performed throughout 3D space, positioning the plane at an arbitrary number of discrete depths, parallel to the image plane of the reference image. Figure 1 illustrates the plane sweep process. For a pixel at position (x, y) in the reference image (px,y ) and a plane at depth d (Pd ), the following computations are performed: First, the intersection of Pd with the ray, connecting px,y and the camera center, is computed. Next, the resulting 3D point is projected in each of the sensor images, and their respective colors are fetched from texture memory. Finally, the error ex,y,d is computed according to equation (1). This process is repeated for multiple planes Pd , and the values ex,y,d and d, for which ex,y,d is minimal, are stored at position (x, y). Additionally, we use a hierarchical approach to speed up the computations and typically start at the third hierarchical level which spans 1/64th of the area of the original reference image. After a fixed number of iterations the level is lowered towards full resolution at level zero. Mipmapping is disabled because in the presence of high frequency textures, the information provided by the lower mipmap levels is little to none compared to the performance overhead. Therefore, we will only make use of bilinear texture filtering on the full resolution images. Figure 2 shows the initial depth map estimations for two calibrated images
terms of texture coordinate generation. First, we can use the coordinate frame of the reference image as the common coordinate frame without loss of generality. The projection matrix Pi for each sensor image can then be written as follows: P3×4i = K3×3i Rti −Rti ∗ ti Figure 2. Top: bilinear filtering (no mipmapping). Bottom: trilinear filtering (mipmapping). Left to right: computed depth maps for levels 0,1 and 2 respectively.
of the Stanford dragon, textured with high frequency uniformly random distributed colored noise. This figure clearly shows that the contribution of lower mipmap levels can have a negative effect on the depth map estimation. This effect is more pronounced at higher (lower resolution) mipmap levels. For scenes containing fine structures, this negative impact results in artifacts at the pixels belonging to the fine structure, due to occlusions. The mipmap levels will contribute false information because the background surrounding the fine structure, viewed in the different images, is independent of the object the pixel belongs to. 3.4.2 Refinement Using Depth Map Sweep During the initialization step, planes were the optimal choice for implementation on graphics hardware thanks to their simplicity and non-self-occluding property. When scaled about the camera center, a depth map itself also exhibits the latter property and can therefore be used to sweep the 3D space in the following iterations. This way, the algorithm can be made hierarchical in depth resolution. At each iteration, the error values for each pixel are recomputed for a number of small offsets to the depth map, generated in the previous iteration. These offsets are dependent on the iteration count, the hierarchical level and the number of planes in the initialization step. By changing the hierarchical level after a fixed number of iterations, the algorithm becomes hierarchical in screen space resolution also. For the initial depth map estimation, the planes were drawn at a number of discrete depths, parallel to the image plane of the reference image. Using the projective texture mapping facility of the graphics card, the texture coordinates were automatically generated by the GPU for each of the sensor images. After the initial depth map estimation, however, each pixel can contain a different depth value inhibiting the direct use of the texture coordinates, generated automatically by interpolating the texture coordinates of the plane corners in homogeneous 3D space. We designed the following algorithm to make efficient use of the GPU in
where K3×3i represents the internal camera matrix for sensor image i and Rti and ti specify the corresponding external rotation and translation in the coordinate frame of the reference image. Next, for each pixel with screen coordinates (u, v), we would like to efficiently compute the texture coordinates as if they were generated from a 3D point which is a scaled version of the ray direction with unit z value. If the depth duv is stored as the distance from the camera perpendicular to the reference image plane, we want to calculate the texture coordinates for a 3D point with coordinates (Xuv , Yuv , 1, 1/duv ), where Xuv and Yuv are the 3D coordinates of a ray, cast from the reference camera center through the pixel at position (u, v), normalized to have unit Z value. The new texture coordinates (xuvi , yuvi , wuvi ) for sensor image i can be derived as follows: Xuv xuvi Yuv yuvi = P3×3i P3×1i 1 wuvi 1/duv Xuv Yuv 1 P3×3i 0 = 1 + duv P3×1i 1 The first term in the resulting equation represents a homography at infinity, which can once again be computed automatically by the GPU. The matrices P3×3i are used for the automatic texture coordinate generation, while the vectors P3×1i are stored as parameters in the fragment program. So for a pixel with screen coordinates (u, v), the new texture coordinate for each sensor image i is computed by fetching the depth duv from the depth map, multiplying it with the vector P3×1i and adding it to the texture coordinate, which has been automatically generated by the GPU for image i. Notice that the depth value has to be fetched only once to compute the texture coordinates for all the sensor images. In terms of performance, the overhead consists of one texture lookup to fetch the depth value from the depth map and one multiply-add M AD operation per sensor image. The majority of the M AD operations can be performed during the texture fetch latencies while the overhead caused by the depth texture fetch decreases as the number of sensor images increases, since the depth value has to be fetched only once.
3.5 The Connectivity Constraint 3.5.1 Non Linear Filtering In most cases, the initial depth map will contain false depth estimates in the presence of homogeneous regions, occlusions and ambiguities. Because of the hierarchical nature of the algorithm, false depths at an early stage of the algorithm will have a significant influence on the final depth estimation at full resolution. Therefore, we will add connectivity information at each level to reduce the amount of false depths. We do this by incorporating a heuristic: if the pixel under inspection belongs to the same object as one of its neighbors, the most probable depth for the pixel is the same depth as its neighbor with optionally a small offset. For each pixel, the algorithm fetches the depths of neighboring pixels in a 3 × 3 neighborhood according to one of the sampling patterns, shown in Figure 3. The heuristic is implemented by computing the error value for each of these depths and their offsets. The error and depth value corresponding to the lowest error value are stored at the position, corresponding to the pixel under inspection. Finally the resulting values for all pixels are copied back into the depth map. This process can be repeated several times at the same pyramid level.
Figure 3. Three neighbor-sampling patterns. In order to remove isolated, floating points the depth corresponding to the pixel itself is not sampled such that the sampling pattern acts as a median filter. It is possible that in this step, some pixels with correct depth values are filtered out. The retrieval of these correct depth values will be discussed in Section 3.5.2. At each iteration, a depth approximating the depth of a neighboring pixel, is assigned to the pixel under inspection. Because the neighboring pixels undergo the same routine, it is possible that the depth, corresponding to the current pixel, is assigned to that neighboring pixel. Assuming we use the left pattern in Figure 3, the algorithm would allow flip-flopping of a depth value between two pixels which have an isolated depth w.r.t. their surroundings. To avoid this undesired effect, we alternate between the middle and right sampling pattern in Figure 3 at each iteration. By doing this, it requires at least three iterations for a depth value of a pixel to return to that pixel once again, allowing its neighbors to treat its depth as an outlier by filtering it out. In contrast to a median filter, this approach can also provide the pixel with a new, better depth at the same time and it is computationally less expensive. At the end of this step, the depth map will mainly consist of large connected patches.
Figure 4. Left to right: Resulting depth maps after 0,1 and 2 connectivity passes.
Figure 5. Left to right: Original reference image. Zoomed section of final depth map after 0,4,8 and 12 passes per hierarchical level respectively.
3.5.2 Retrieval Of Fine Structures As described in Section 3.5.1, the depth map was filtered much like a median filter in order to remove isolated, floating points. It is possible that in this step, some pixels with correct depth values were filtered because their surrounding pixels had incorrect depth values and not the pixels themselves, causing them to be filtered out. In subsequent stages of the algorithm, we wish to retrieve these correct depth values, providing they are connected either directly to a larger surface or indirectly, through previously retrieved fine structure points. We do this by sampling the depth values of all the pixels in a 3 × 3 neighborhood, rather than applying the sampling patterns. Because, at the end of the filtering step, the depth map consisted of large connected patches, the sampled depths for a pixel will belong to one of these patches. This list of depth samples is once more expanded and tested in the same way as described in Section 3.5.1. Figure 5 illustrates the retrieval of fine structures for a set of images, taken from a traffic sign. It is clearly noticeable in the third image that most of the outliers are filtered out, but also a number of inliers. The fourth and fifth image show how the complete traffic sign is retrieved afterwards while the algorithm searches for new points, originating from the base and the top of the traffic sign.
4. Practical Implementation
Figure 6. Left: Wireframe model without smoothness penalty. Right: With smoothness penalty.
The algorithm was implemented on a Nvidia Geforce 6800 GT GPU. Because we do not make use of mipmapping for the images, the 50% overhead for generating the mipmaps automatically is eliminated. The GPU also provides us with the hardware facilities to directly convert the resulting depth map into a connected 3D mesh, making the resulting mesh available for further use on the GPU, like visualizing the textured mesh from a virtual point of view. This avoids any unnecessary data transfer across the lower bandwidth AGP bus. Optionally, median and gaussian filtering can be performed as a postprocessing step on the GPU to remove artifacts, caused by variations in shutterspeed, ill defined colors at object edges, etc.
5. Results 3.5.3 Smoothness Penalty The proposed algorithm already behaves well in the presence of high frequency textures and occlusions, but the estimated depths for homogeneous regions remain noisy. To remedy this, we add a smoothness penalty to the error value, computed as follows. After each iteration, the depth map is passed through a 3 × 3 gaussian filter and the result is stored in a separate depth map. For each depth d in the list of depths to be tested by a pixel, the distance to the smoothed depth map ds is calculated. Based on this distance, the following penalty is added to the error value: enew = e + max(|d − ds |, dmax ) where dmax represents the maximum distance of influence. When all pixels in the 3 × 3 neighborhood have approximately the same depths, which is the case for median filtered homogeneous surfaces, all the depths to be tested are small variations of ds , causing the penalty to give preference to the depths closer to ds and therefore smoothing the surface. If the 3 × 3 neighborhood contains outliers, which could originate from fine structures, the tested depths will differ from ds by a large amount. In case this amount is larger than dmax , the smoothness penalty has no effect, avoiding the filtering of fine structures based on the smoothness constraint. The possibility remains that an erroneous tested depth approximates ds by coincidence, influencing the error value. However, due to the constant change of the ds value at each iteration, this erroneous influence has a high chance of being corrected in the following iterations. Figure 6 illustrates this smoothing effect, where the new error measure enew is applied throughout the entire algorithm.
Figure 7 shows computer generated images of the Stanford dragon and a hand model, textured with a high frequency uniformly random distributed colored noise pattern. The images on the right illustrate the computed 3D mesh. In Figure 8, two images were selected from the well known Tsukuba data set as input to the algorithm. The resulting depth map on the right illustrates the retrieval of fine structures such as the power cord from the desk lamp. The ghosting effect to the right of the objects originates from the occlusions, generated by the objects themselves. These erroneous depths correspond to high enew values. The left image in figure 9 shows the depth map for the reference image, selected from a set of five pictures taken from a historic building. Since all camera parameters have to be known, the generated depth map can be converted into a 3D mesh. The middle and right images represent one sensor image and the image, generated by projecting the 3D mesh onto its image plane. Table 1 presents performance results for this example. For each test, we used four hierarchical levels and 64 initialization planes. The instruction cycle count for the shader, testing one neighboring depth with five color images, is given at the bottom of the table.
6. Summary and Conclusions We believe to have designed a fast algorithm which does not suffer from artifacts still visible in previous implementations. Depending on the implementation, these artifacts include isolated erroneous points, holes, noisiness and incorrect depthvalues in the presence of high frequency textures and fine 3D structures. No constraints have been specified regarding the baseline, epipolar geometry, color chan-
Image sizes
Number of images
Smoothness penalty
Iterations per level
Time (ms)
No
1 2 3
4.5 7.3 10.3
Yes
1 2 3
5.4 9.5 13.3
No
1 2 3
9.0 14.8 20.6
Yes
1 2 3
10.0 16.7 23.4
No
1 2 3
21.7 38.9 56.7
Yes
1 2 3
28.9 55.1 82.8
No
1 2 3
46.4 85.0 122.3
Yes
1 2 3
53.1 97.9 141.7
2
256x256
5
2
720x576
5
Target: GeForce 6800 GT (NV40−GT) :: Unified Compiler: v71.80 Cycles: 20.75 :: R Regs Used: 5 :: R Regs Max Index (0 based): 4 Pixel throughput (assuming 1 cycle texture lookup) 280.00 MP/s
Table 1. Timing results.
Figure 9. Left: Depthmap for the reference image. Middle: sensor image Right: Depthmap, backprojected onto the sensor image plane.
nels or the maximum number of sensor images. The algorithm proposed in this paper is able to successfully retrieve the depths for fine 3D structures while efficiently rejecting isolated depths and smoothing homogeneous regions at the same time. Using programmable graphics hardware, we were able to implement this difficult combination of highly desirable properties and still meet the real-time processing constraint. Moreover, the current generation of GPUs allow us to take full advantage of this real-time processing power, by making the resulting 3D mesh directly available for further use on the GPU. This algorithm will be used as the first step for future work surrounding this topic, comprised of the integration of the 3D meshes from all images into one connected 3D mesh and the addition of backmatching. The latter will perform a consistency check between the depth maps generated for each of the images.
References [1] G.E.Moore, “Cramming more components onto integrated circuits,” Electronics, Vol. 38, pp. 114117, 1965.
Figure 7. Left and middle: Two computer generated input images used as input. Right: Resulting flatshaded models.
[2] R. Yang, G. Welch, and G. Bishop, “Realtime consensus based scene reconstruction using commodity graphics hardware,” Proceedings of Pacific Graphics, 2002. [3] R. Yang and M. Pollefeys, “Multi-resolution real-time stereo on commodity graphics hardware,” Conference on Computer Vision and Pattern Recognition, 2003. [4] C. Zach, A. Klaus, M. Hadwiger, and K. Karner, “Accurate dense stereo reconstruction using graphics hardware,” EUROGRAPHICS 2003, Short Presentations, 2003.
Figure 8. Left and middle: Two images from the Tsukuba data set. Right: Resulting depth map.
[5] C. Zach, A. Klaus, B. Reitinger, and K. Karner, “Optimized stereo reconstruction using 3D graphics hardware,” Workshop of Vision, Modelling, and Visualization (VMV 2003), pp. 119126, 2003. [6] C. Zach, K. Karner and H. Bischof,“Hierarchical Disparity Estimation with Programmable 3D Hardware,” SHORT Communication papers proceedings WSCG, 2004.