Disparity Map Refinement and 3D Surface Smoothing via ... - CiteSeerX

0 downloads 0 Views 5MB Size Report
Disparity Map Refinement and 3D Surface Smoothing via Directed Anisotropic Diffusion. Atsuhiko Banno. Institute of Industrial Science the University of Tokyo.
Disparity Map Refinement and 3D Surface Smoothing via Directed Anisotropic Diffusion Atsuhiko Banno Institute of Industrial Science the University of Tokyo

Katsushi Ikeuchi Interfaculty Initiative in Information Studies the University of Tokyo

[email protected]

[email protected]

Abstract We propose a new binocular stereo algorithm and 3D reconstruction method from multiple disparity images. First, we present an accurate binocular stereo algorithm. In our algorithm, we use neither color segmentation nor plane fitting methods, which are common techniques among many algorithms nominated in the Middlebury ranking. These methods assume that the 3D world consists of a collection of planes and that each segment of a disparity map obeys a plane equation. We exclude these assumptions and introduce a Directed Anisotropic Diffusion technique for refining a disparity map. Second, we show a method to fill some holes in a distance map and smooth the reconstructed 3D surfaces by using another type of Anisotropic Diffusion technique. The evaluation results on the Middlebury datasets show that our stereo algorithm is competitive with other algorithms that adopt plane fitting methods. We present an experiment that shows the high accuracy of a reconstructed 3D model using our method, and the effectiveness and practicality of our proposed method in a real environment.

1. Introduction Binocular stereo is one of the most significant and active areas in the field of computer vision. Recently, the number of publications on binocular stereo is increasing due to the Middlebury Stereo Vision Page 1 . The Middlebury page provides some common benchmark datasets and evaluation systems that all researchers can utilize to examine their proposed methods objectively and universally. According to the rank given by the Middlebury page, we can find common techniques adopted in many sophisticated algorithms. Many algorithms use a pixel-based method, which searches the correspondences between two images by single pixel matching, such as Graph Cut [3, 11] and 1 http://vision.middlebury.edu/stereo/

Belief Propagation [15, 16]. In addition, many algorithms include color segmentation of input images and impose the following assumptions: all 3D points in each segment lie on the same plane in the 3D world (i.e., 1/d = au + bv + c); otherwise, the disparities in each segment obey the same plane equation (i.e., d = au + bv + c). To satisfy these assumptions, the algorithms apply plane fitting methods to segmented disparity images. In this article, we call these assumptions segment constraint. In the Middlebury ranking of the 1-pixel threshold table, eight algorithms among the top ten adopt the segment constraint. This constraint is considered to be especially effective for the estimation of homogeneous and occluded regions. However, we ask: is the segment constraint really valid everywhere in every scene? For a small segment in an image, the corresponding small patch might be a proper approximate to a 3D object. For the structured objects such as buildings and artifacts, the segment constraint would be a fine constraint, since almost all surfaces consist of planes. In the real 3D world, however, there are many curved surfaces with homogeneous color which obey neither 1/d = au + bv + c nor d = au + bv + c. If we impose the segment constraint to these surfaces, it could make wrong surface models. Moreover, it might estimate wrong plane due to the noise and errors. In this article, we exclude the segment constraint and propose a new stereo algorithm comparable to other algorithms with the constraint. Our disparity generation algorithm is based on Belief Propagation [5] with sub-pixel order estimation, and we adopt a cross-check exchanging two input images to identify the confidence of a pixel. Then a disparity map is smoothed by using color information and the confidence labels. We utilize an anisotropic diffusion technique [12] for disparity map refinement instead of plane fitting. In anisotropic diffusion, the disparity map is not smoothed crossing any edges. Moreover, our implementation enables the anisotropic smoothing without violating the confidence labels. The evaluation of the reconstructed 3D model is also sig1870

2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops 978-1-4244-4441-0/09/$25.00 ©2009 IEEE

nificant. Many researchers of binocular stereo algorithm evaluate only the disparity map without reconstructing a 3D model. Especially for short-baseline stereo, little attention has been given to the accuracy of the reconstructed 3D model. In this article, we reconstruct a 3D model from several disparity maps and evaluate the accuracy of it. We propose a method to smooth the reconstructed surfaces without using a plane or other primitive types of model fitting in 3D reconstruction. Anisotropic diffusion is also utilized to refine the reconstructed surfaces. This paper is organized as follows. In the next section, we review some current trends in binocular stereo mainly on the algorithms that are highly rated in the Middlebury ranking. In Section 3, we explain our proposed algorithm of binocular stereo. In Section 4, we show our smoothing method for reconstruction of 3D surfaces. In Section 5, the experimental results on the Middlebury datasets show that our method is competitive with other methods. In addition, we show a high accuracy of 3D model reconstructed by our method. The results show the effectiveness and robustness in a real environment. In Section 6, we summarize our method.

2. Current Trends in Stereo Algorithms Since the advent of the Middlebury page, a huge number of binocular stereo methods, especially for estimating disparity maps, have been proposed. Many researchers prefer pixel-based methods to region-based methods that utilize window matching techniques [13, 14, 25]. In pixelbased method, a global energy function is defined by pixel matching and a smoothness constraint; the optimal disparity map is estimated to minimize it. In order to solve the optimization problem, many algorithms adopt Graph Cut, Belief Propagation, Dynamic Programming [7, 8] and Interior Point Method [1]. According to the Middlebury ranking, algorithms with Belief Propagation have numerical superiority. The ranking shows that many stereo algorithms adopt the segment-based method [20, 10, 23, 2, 17, 16]. Segmentbased methods are widely accepted for effectiveness of disparity map refinement. Almost all stereo algorithms adopt Mean Shift method [4] as their color segmentation strategy. Taguchi et al. [17] iteratively update segmentation at each optimization step. Zitnick et al. [26] adopt an enforced oversegmentation that divides homogeneous region into small patches. In order to verify the effect of the constraint, Sun et al. [16] compare two results generated by the same algorithm: one has the segment constraint and the other excludes the constraint. They reported that the results with the segment constraint were better than those without it. As mentioned above, only two stereo algorithms [22, 24] do not adopt the segment constraint. Several methods adopt sub-pixel order disparity estima-

tion except for the plane fitting, which brings continuous disparity values. Belief Propagation and Graph Cut are intrinsically methods for labeling problems, which give discrete disparity values to the pixels. Almost all algorithms with sub-pixel order estimate the continuous disparities by a parabola fitting of discrete disparities [24, 23, 21]. Gehrig et al. [6] additionally improved these sub-pixel estimations by an energy minimization framework. In the Middlebury ranking of the 0.5-pixel threshold table, these methods with sub-pixel order estimation are ranked high.

3. Binocular Stereo Algorithm 3.1. Estimation of Disparity Map In pixel-based methods, we define an energy value at each pixel; the summation of all values is treated as a global energy function. The disparity map D, which minimizes the global function, is the optimal solution. Typically, the cost function takes the following form: E(D) = Edata (D) + Esmooth (D).

(1)

The first term is the data term, which evaluates the pixel matching with the disparity configuration D. The second term is the smoothness term that takes a low value when the disparity value at a pixel is similar to those of its neighbors. Data Term The data term indicates the consistency of two input images and the disparities between them. This term takes low value if the consistency is fine. One image is referred to as a reference image I and another is referred to as a supporting image I  . When a disparity value at a pixel p in the image I is dp , many stereo algorithms adopt criteria based on the difference of color between the corresponding pixels (e.g. |I(p) − I  (p + dp )| or (I(p) − I  (p + dp ))2 ). These criteria are effective for benchmark images such as the Middlebury datasets since they are color-corrected. However, the criteria based on difference of color are unfavorable for the pictures captured in a real environment or for stereo with multiple cameras, because the same point on the real object is sometimes captured with different color values due to color response or exposure of the cameras. In view of these cases, we adopt mutual information (MI) [19, 9, 7] for the data term. Similarly to the previous methods [9, 7], our MI calculation is based on the joint probability distribution P r(I, I  ) at each correspondence as  D(p), (2) Edata = D(p) = − 1871



p

 1 log P r (I(p), I  (p + dp )) ⊗ g ⊗ g. |P|

(3)

In order to calculate mutual information D(p), we apply Gaussian convolution two times similarly to [9, 7]. Smoothness Term It is well known that the minimization of the Data term alone cannot bring the optimal solution because there is ambiguity in homogeneous regions and wrong matching due to noise. Therefore, the cost function in almost all stereo algorithms includes a smoothness constraint. This constraint guarantees that adjacent pixels have similar disparity values. The smoothness term is the summation of a penalty defined at each connection between adjacent pixels. We define the smoothness term as follows:  V (p), (4) Esmooth = p

V (p) =



Vpq

q∈N (p)

⎧ 0 ⎪ ⎪ ⎨ a1 = a ⎪ 2 ⎪ ⎩ a3

if dp = dq if |dp − dq | = 1 , if |Ip − Iq | > δ otherwise

is utilized for the calculation of the smoothness term. Such a one way algorithm cannot handle occluded areas visible from one image and invisible from another image. Generally, the accuracy of disparities in occluded areas is less than that in non-occluded areas. Therefore, the treatment of the occluded areas is important. In order to identify the occluded areas, several algorithms adopt a cross checking, which confirms mutual consistency by exchanging the reference image and the supporting one. Assuming that a pixel p at position x in the reference image has a disparity value dr (x), the position of the corresponding pixel in the supporting pixel should be x + dr (x) and the disparity value ds (x + dr (x)) should be −dr (x). When a pixel p in the reference image satisfies the condition |dr (x) + ds (x + dr (x))| ≤ 1, the pixel p is labeled as H (High confidence); otherwise, the pixel is labeled as L (Low confidence). In the following step, the disparity map is refined by using these labels.

3.2. Refinement of Disparity Map (5)

where a3 > a2 > a1 > 0. In our implementation, no penalty is imposed when adjacent pixels have the same disparity. If the difference of disparity between adjacent pixels is 1 pixel, the pair is assessed a penalty a1 . If the difference is more than 1 pixel, the connection is assessed a penalty a3 heavier than a1 . However, in the case of the disparity difference of over 1 pixel, the penalty is decreased to a2 when a color difference between them is larger than a threshold δ. These values are specified because a depth discontinuity and a color discontinuity often occur simultaneously. Minimization of Cost Function The minimization of the cost function (1) brings the optimal disparity map D. As the method for the minimization of eq. (1), we utilized Efficient Belief Propagation [5]. For disparity estimation by Belief Propagation, the optimal disparity value d∗p at pixel p is given as the label that has the minimum belief vector value bp (d) by d∗p = arg mind bp (d). The optimal disparity d∗ is represented by an integer value discretely. In our implementation, we estimate the disparity value with sub-pixel order by using quadratic polynomial interpolation. To estimate the continuous disparity value, we utilized the adjacent belief vector values around the optimal integer depth, i.e., d∗ and d∗ ± 1.

The resulting disparity map described above is not the optimal one because it still has some noise and errors. Many algorithms apply the segment constraint in this stage to reduce error and refine the disparities. Except for the segment constraint, one of the simplest methods for error and noise reduction is a Gaussian filter. A Gaussian filter, however, does not preserve edges. In the case of a disparity map, the Gaussian filter feathers the depth edges. In order to preserve edges, we adopt anisotropic diffusion [12] to smooth the disparity map. Anisotropic diffusion is an image-smoothing technique that smoothes images without crossing any edges. The smoothing mechanism obeys the following diffusion equation:

∂D(x) = div cD (x) · ∇D(x) . (6) ∂t If the diffusion coefficient cD (x) is constant everywhere in the image region, it brings isotropic diffusion like a Gaussian filter. In the above equation, the diffusion coefficient depends on the position x and a suitable choice of it enables a proper smoothing without blurring edges. In our refinement strategy, we have the following two assumptions. • If an adjacent pixel has a similar color, the disparity value is also similar. • Disparity information is transmitted from a pixel with high confidence to one with low confidence between adjacent pixels; it is not transmitted in the reverse direction.

Confidence Label Both images of the input stereo pair are utilized for the calculation of the data term. On the other hand, only one image

The second assumption prohibits all impacts of an L pixel on an H one. 1872

Figure 1. Anisotropic diffusion by color information and confidence label between adjacent pixels

According to these assumptions, we set the diffusion coefficient based on the color information and the confidence labels (Fig. 1). Therefore, the diffusion coefficient cD pq from pixel p to q is defined as follows:

2  Ip − Iq D , (7) cpq = wpq · exp − K ⎧ ⎨ 1 if p ∈ H 0.5 if p, q ∈ L wpq = . ⎩ 0 if p ∈ L, q ∈ H (8) The scalar value wpq controls the diffusion direction based on the confidence label. The following exponential function takes low value in the case of crossing color edges. By using these directed diffusion coefficients and a sufficiently small time step in eq. (6), we can obtain a refined disparity map without blurring any depth edges and without disturbing the high confidence pixels.

4. Three Dimensional Reconstruction In our strategy, we reconstruct 3D models by using at least three images. Among them, one image is referred to as the reference image; other images are referred to as the supporting images. Therefore, there are N input stereo images and N − 1 binocular stereo pairs. The first step is a range image generation from the disparity maps. The range image is constructed by 3D points that are estimated as intersection points of viewing rays for the stereo images. However, it is not guaranteed that the reconstructed 3D model consists of smooth surfaces, even if it is reconstructed from the smoothed disparity maps. In addition, there will be several holes in the range image. Therefore, we refine the 3D model by applying a surface smoothing via anisotropic diffusion and hole filling techniques.

and the intrinsic ones as the assumption of stereo. By obtaining disparity maps, we can also obtain several viewing vectors from each camera center to a 3D surface point. The 3D surface point is ideally the intersection point of these vectors; in reality, these vectors do not intersect at one point due to noise and error. Therefore, the 3D point is estimated as the nearest point from these rays. If the re-projection error of the estimated 3D point on the reference image is more than 1 pixel, the 3D point is deleted as an outlier. Then, the range image with respect to the reference image, R, is generated by using these 3D points; the sizes of the range image and the reference image are the same. Each pixel in the range image stores the distance as a real number from the camera center of the reference image to the corresponding point on the 3D surface. In addition, all pixels in the range image are classified into two categories: a pixel that stores the distance is labeled as O (Occupied), and a pixel that does not store the distance is labeled as E (Empty).

4.2. Smoothing and Hole-Filling of a 3D Model The surface smoothing procedure is applied to the pixels of label O. We adopt a smoothing method for the range image based on anisotropic diffusion [12]. Tasdizen et al. [18] utilized anisotropic diffusion with curvature. On the other hand, we define the diffuse coefficient cR based on unit normal vector n of adjacent pixels as follows:

∂R(x) = div cR · ∇R(x) , ∂t  2 where cR pq = max(np · nq , 0) .

(9) (10)

Here, both pixels p and q have the label O. By using an inner product of unit normal vectors, the surface of the reconstructed model is smoothed while preserving 3D edges. Hole filling is applied to a pixel which has the label E and at least one adjacent pixel has label O. In our implementation, the pixel is given the average distance of the adjacent pixels that have similar colors (threshold δR ) and the label O as,  1 Rp . (11) Rq = |P | p∈O,|Ip −Iq |

Suggest Documents