Depth recovery plays an important role in computer vision and computer graph- ics with applications such as robotics, 3D reconstruction or image refocusing. In.
On the Recovery of Depth from a Single Defocused Image Shaojie Zhuo, Terence Sim School of Computing National University of Singapore Singapore,117417
Abstract. In this paper we address the challenging problem of recovering the depth of a scene from a single image using defocus cue. To achieve this, we first present a novel approach to estimate the amount of spatially varying defocus blur at edge locations. We re-blur the input image and show that the gradient magnitude ratio between the input and re-blurred images depends only on the amount of defocus blur. Thus, the blur amount can be obtained from the ratio. A layered depth map is then extracted by propagating the blur amount at edge locations to the entire image. Experimental results on synthetic and real images demonstrate the effectiveness of our method in providing a reliable estimate of the depth of a scene. Key words: Image processing, depth recovery, defocus blur, Gaussian gradient, markov random field.
1
Introduction
Depth recovery plays an important role in computer vision and computer graphics with applications such as robotics, 3D reconstruction or image refocusing. In principle depth can be recovered either from monocular cues (shading, shape, texture, motion etc.) or from binocular cues (stereo correspondences). Conventional methods for estimating the depth of a scene have relied on multiple images. Stereo vision [1, 2] measures disparities between a pair of images of the same scene taken from two different viewpoints and uses the disparities to recover the depth. Structure from motion (SFM) [3, 4] computes the correspondences between images to obtain the 2D motion field. The 2D motion field is used to recover the 3D motion and the depth. Depth from focus (DFF) [5, 6] captures a set of images using multiple focus settings and measures the sharpness of image at each pixel locations. The sharpest pixel is selected to form a all-in-focus image and the depth of the pixel depends on which image the pixel is selected from. Depth from defocus (DFD) [7, 8] requires a pair of images of the same scene with different focus setting. It estimates the degree of defocus blur and the depth of scene can be recovered providing the camera setting. These methods either suffer from the occlusion problem or can not be applied to dynamic scenes.
2
Shaojie Zhuo, Terence Sim 3
2.5
2
1.5
1
0.5
0
(a)
(b)
Fig. 1. The depth recovery result of the book image. (a) The input defocused image. (b) Recovered layered depth map. The larger intensity means larger blur amount and depth in all the depth maps presented in this paper.
Recently, approaches have been proposed to recover depth from a single image in very specific settings. Several methods [9, 10] use active illumination to aid depth recovery by projecting structured patterns onto the scene. The depth is measured by the attenuation of the projected light or the deformation of the projected pattern. The coded aperture method [11] changes the shape of defocus blur kernel by inserting a customized mask into the camera lens, which makes the blur kernel more sensitive to depth variation. The depth is determined after a deconvolution process using a set of calibrated blur kernels. Saxena et al . [12] collect a training set of monocular images and their corresponding ground-truth depth maps and apply supervised learning to predict the value of the depth map as a function of the input image. In this paper we focus on a more challenging problem of recovering the depth layers from a single defocused image captured by an uncalibrated conventional camera. As the most related work, the inverse diffusion method [13], which models the defocus blur as a diffusion process, uses the inhomogeneous reverse heat equation to obtain an estimate of the blur at edge locations and then proposed a graph-cut based method for inferring the depth in the scene. In contrast, we model the defocus blur as a 2D Gaussian blur. The input image is re-blurred using a known Gaussian function and the gradient magnitude ratio between input and re-blurred images is calculated. Then the blur amount at edge locations can be derived from the ratio. We also construct a MRF to propagate the blur estimate from the edge location to the entire image and finally obtain a layered depth map of the scene. Our work has three main contributions. Firstly, we propose an efficient blur estimation method based on the gradient magnitude ratio, and we will show that our method is robust to noise, inaccurate edge location and interference from near edges. Secondly, without any modification to the camera or using additional illumination, our blur estimation method combined with MRF optimization can obtain the depth map of a scene by using only single defocused image captured by conventional camera. As shown in Fig. 1, our method can extract a layered depth map of the scene with fairly good extent of accuracy. Finally, we discuss two kinds of ambiguities in recovering depth from a single image using defocus cue, one of which is usually overlooked by previous methods.
On the Recovery of Depth from a Single Defocused Image Lens
Image sensor
c df
f0
1 Diameter of CoC c (mm)
Focal plane
0.8
N=2 N=4 N=8
0.6 0.4 0.2 0 2000
d
3
(a)
2500 3000 Object distance d (mm)
3500
(b)
Fig. 2. (a) A thin lens model. (b) The diameter of CoC c as a function of the object distance d and f-stop number N given df = 500mm, f0 = 80mm.
2
Defocus Model
As the amount of defocus blur is estimated at edge locations, we must model the edge first. We adopt the ideal step edge model which is f (x) = Au(x) + B,
(1)
where u(x) is the step function. A and B are the amplitude and offset of the edge respectively. Note that the edge is located at x = 0. When an object is placed at the focus distance df , all the rays from a point of the object will converge to a single sensor point and the image will appear sharp. Rays from a point of another object at distance d will reach multiple sensor points and result in a blurred image. The blurred pattern depends on the shape of aperture and is often called the circle of confusion (CoC) [14]. The diameter of CoC characterizes the amount of defocus and can be written as c=
|d − df | f02 , d N (df − f0 )
(2)
where f0 and N are the focal length and the stop number of the camera respectively. Fig. 2 shows a thin lens model and how the diameter of circle of confusion changes with d and N , given fixed f0 and df . As we can see, the diameter of the CoC c is a non-linear monotonically increasing function of the object distance d. The defocus blur can be modeled as the convolution of a sharp image with the point spread function (PSF). The PSF can be approximated by a Gaussian function g(x, σ), where the standard deviation σ = kc is proportional to the diameter of the CoC c. We use σ as a measure of the depth of the scene. A blurred edge i(x) can be represented as follows, i(x) = f (x) ⊗ g(x, σ).
3
(3)
Blur Estimation
Fig. 3 shows the overview of our local blur estimation method. A step edge is re-blurred using a Gaussian function with know standard deviation. Then the
4
Shaojie Zhuo, Terence Sim
⊗σ1
blurred edge
re-blurred edges
blur amount
∇i ∇i1
∇
gradients
gradient ratio
Fig. 3. Our blur estimation approach: here, ⊗ and ∇ are the convolution and gradient operators respectively. The black dash line denotes the edge location.
ratio between the gradient magnitude of the step edge and its re-blurred version is calculated. The ratio is maximum at the edge location. Using the maximum value, we can compute the amount of the defocus blur of an edge. For convenience, we describe our blur estimation algorithm for 1D case first and then extend it to 2D image. The gradient of the re-blurred edge is: ∇i1 (x) = ∇ i(x) ⊗ g(x, σ0 ) = ∇ (Au(x) + B) ⊗ g(x, σ) ⊗ g(x, σ0 )
x2 A exp(− ), = p 2 + σ2 ) 2 2 2(σ 2π(σ + σ0 ) 0
(4)
where σ0 is the standard deviation of the re-blur Gaussian function. We call it the re-blur scale. The gradient magnitude ratio between the original and re-blurred edges is |∇i(x)| = |∇i1 (x)|
r
σ 2 + σ02 x2 x2 ). exp( 2 − 2 2 σ 2σ 2(σ + σ02 )
(5)
It can be proved that the ratio is maximum at the edge location (x = 0). The maximum value is given by |∇i(0)| R= = |∇i1 (0)|
r
σ 2 + σ02 . σ2
(6)
Giving the insight on (4) and (6), we notice that the edge gradient depends on both the edge amplitude A and blur amount σ, while the maximum of the gradient magnitude ratio R eliminates the effect of edge amplitude A and depends only on σ and σ0 . Thus, given the maximum value R, we can calculate the unknown blur amount σ using σ= √
1 σ0 . R2 − 1
(7)
For blur estimation in 2D images, we use 2D isotropic Gaussian function to perform re-blur. As any direction of a 2D isotropic Gaussian function is a 1D Gaussian, the blur estimation is similar to that in 1D case. In 2D image, the gradient magnitude can be computed as follows: k∇i(x, y)k =
q
∇i2x + ∇i2y
where ∇ix and ∇iy are the gradients along x and y directions respectively.
(8)
On the Recovery of Depth from a Single Defocused Image 5 4
5
no noise var = 0.001 var = 0.01
5
dst = 30 dst = 15 dst = 10
4
4
3
3
3
2
2
2
1
1
0 0
1
2
(a)
3
4
5
shif t = 0 shif t = 1 shif t = 2
1
0 5 0
1
2
(b)
3
4
0 5 0
(c)
1
2
3
4
5
(d)
Fig. 4. Performance of our blur estimation method. (a) The synthetic image with blur edges. (b) Estimation errors under Gaussian noise condition. (c) Estimation errors with edge distances of 30, 15 and 10 pixels. (d) Estimation errors with edge shifts of 0, 1 and 2 pixels. The x and y axes are the blur amount and corresponding estimation error.
4
Layered Depth Map Extraction
After we obtain the depth estimates at edge locations, we need to propagate the depth estimates from edge locations to other regions that do not contain edges. We seek a regularized depth labeling σ ˆ which is smooth and close to the estimation in Eq. (7). We also prefer the depth discontinuities to be aligned with the image edges. Thus, We formulate this as a energy minimization over the discrete Markov Random Field (MRF) whose energy is given by E(ˆ σ) =
X i
Vi (σˆi ) + λ
X X i
Vij (ˆ σi , σ ˆj ).
(9)
j∈N (i)
where each pixel in the image is a node of the MRF and λ balance the single node potential Vi (ˆ σi ) and pairwise potential Vij (ˆ σi , σ ˆj ) which are defined as Vi (ˆ σi ) = M (i)(σi − σˆi )2 , X Vij (ˆ σi , σ ˆj ) = wij (ˆ σi − σ ˆ j )2 ,
(10) (11)
j∈N (i)
where M (·) is a binary mask with non-zeros only at edge locations. the weight 2 wij = exp{− (I(i) − I(j)) } encodes the difference of neighboring colors I(i) and I(j). 8-neighborhood system N (i) is adopted in our definition. We use FastPD [15] to minimized the MRF energy defined in Eq. (9). FastPD can guarantee a approximately optimal solution and is much faster than previous MRF optimization methods such as conventional graph cut techniques.
5
Experiments
There are two parameters in our method: the re-blur scale σ0 and the λ. We set σ0 = 1, λ = 1, which gives good results in all our examples. We use Canny edge detector [16] and tune its parameters to obtain desired edge detection output. The depth map are actually the estimated σ values at each pixel. We first test the performance of our method on the synthetic bar image shown in Fig. 4(a). The blur amount of the edge increases linearly from 0 to 5. We first add noises to the bar image. Under noise condition, although the result
6
Shaojie Zhuo, Terence Sim 3
2.5
2
1.5
1
0.5
0 3
2.5
2
1.5
1
0.5
0
(a)
(b)
(c)
Fig. 5. The depth recovery results of flower and building images. (a) The input defocused images. (b) The sparse blur maps. (c) The final layered depth maps.
of edges with larger blur amount is more affected by noise, our method can still achieve reliable estimation result (see Fig. 4(b)). We then create more bar images with different edge distances. Fig. 4(c) shows that interferences from neighboring edges increase estimation errors when the blur amount is large (> 3), but the errors are controlled in a relative low level. Furthermore, we shift the detected edges to simulate inaccurate edge location and test our method. The result is shown in Fig. 4(d). When the edge is sharp, the shift of edge locations causes quite large estimation errors. However, in practice, the sharp edges usually can be located very accurately, which greatly reduces the estimation error. As show in Fig. 5, we test our method on some real images. In the flower image, the depth of the scene changes continuously from the bottom to the top of the image. The sparse blur map gives a reasonable measure of the blur amount at edge locations. The depth map reflects the continuous change of the depth. In the building image, there are mainly 3 depth layers in the scene: the wall in the nearest layer, the buildings in the middle layer, and the sky in the farthest layer. Our method extracts these three layers quite accurately and produces the depth map shown in Fig. 5(c). Both of the results are obtained using 10 labels of depth with the blur amount from 0 to 3. One more example is the book image shown in Fig. 1. The result is obtain using 6 depth labels with blur amount from 0 to 3. As we can see from the recovered depth map, our
On the Recovery of Depth from a Single Defocused Image
(a)
(b)
7
(c)
Fig. 6. Comparison of our method and the inverse diffusion method. (a) The input image. (b) The result of inverse diffusion method. (c) Our result. The image is from [13]. 3
2.5
2
1.5
1
0.5
0
(a)
(b)
Fig. 7. The depth recovery result of the photo frame image. (a) The input defocused image. (b) Recovered layered depth map.
method is able to obtain a good estimate of the depth of the scene from a single image. In Fig. 6, we compare our method with the inverse diffusion method [13]. Both methods generate reasonable layered depth maps. However, our method has higher accuracy in local estimation and thus, our depth map captures more details of the depth. As shown in the figure, the difference in the depth of the left and right arms can be perceived in our result. In contrast, the inverse diffusion method does not recover this depth difference.
6
Ambiguities in Depth Recovery
There are two kinds of ambiguities in depth recovery from single image using defocus cue. The first one is the focal plane ambiguity. When an object appears blur in the image, it can be on either side of the focal plane. To remove this ambiguity, most of the depth from defocus methods including our method assume all objects of interest are located on one side of the focal plane. When taking images, we just put the focus point on the nearest/farthest point in the scene. The second ambiguity is called the blur/sharp edge ambiguity. The defocus measure we obtained may be due to a sharp edge that is out of focus or a blur edge that is in focus. This ambiguity is often overlooked by previous work and may cause some artifacts in our result. One example is shown in Fig. 7. The region indicated by the white rectangle is actually blur texture of the photo in the frame, but our method treats it as sharp edges due to defocus blur, which results in error estimation of the depth in that region.
8
7
Shaojie Zhuo, Terence Sim
Conclusion
In this paper, we show that the depth of a scene can be recovered from a single defocused image. A new method is presented to estimate the blur amount at edge locations based on the gradient magnitude ratio. The layered depth map is then extracted using MRF optimization. We show that our method is robust to noise, inaccurate edge location and interferences of neighboring edges and can generate more accurate scene depth maps compared with existing methods. We also discuss ambiguities arising in recovering depth from single images using defocus cue. In the future, we would like to apply our blur estimation method to images with motion blur to estimate the blur kernels. Acknowlegement. The author would like to thank the anonymous reviewers for their helpful suggestions. The work is supported by NUS Research Grant #R-252-000-383-112.
References 1. Barnard, S., Fischler, M.: Computational stereo. ACM Comput. Surv. 14(4) (1982) 553–572 2. Dhond, U., Aggarwal, J.: Structure from stereo: A review. IEEE Trans. Syst. Man Cybern. 19(6) (1989) 1489–1510 3. Dellaert, F., Seitz, S.M., Thorpe, C.E., Thrun, S.: Structure from motion without correspondence. In: Proc. CVPR. (2000) 557–564 4. Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: A factorization method. Int. J. Comput. Vision 9 (1992) 137–154 5. Asada, N., Fujiwara, H., Matsuyama, T.: Edge and depth from focus. Int. J. Comput. Vision 26(2) (1998) 153–163 6. Nayar, S., Nakagawa, Y.: Shape from focus. IEEE Trans. Pattern Anal. Mach. Intell. 16(8) (1994) 824–831 7. Favaro, P., Favaro, P., Soatto, S.: A geometric approach to shape from defocus. IEEE Trans. Pattern Anal. Mach. Intell. 27(3) (2005) 406–417 8. Pentland, A.P.: A new sense for depth of field. IEEE Trans. Pattern Anal. Mach. Intell. 9(4) (1987) 523–531 9. Moreno-Noguer, F., Belhumeur, P.N., Nayar, S.K.: Active refocusing of images and videos. ACM Trans. Graphics (2007) 67 10. Nayar, S.K., Watanabe, M., Noguchi, M.: Real-time focus range sensor. IEEE Trans. Pattern Anal. Mach. Intell. 18(12) (1996) 1186–1198 11. Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a conventional camera with a coded aperture. In: ACM Trans. Graphics, ACM (2007) 12. Saxena, A., Sun, M., Ng, A.: Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. (2008) 1–1 13. Namboodiri, V.P., Chaudhuri, S.: Recovery of relative depth from a single observation using an uncalibrated (real-aperture) camera. In: Proc. CVPR. (2008) 14. Hecht, E.: Optics (4th Edition). Addison Wesley (August 2001) 15. Komodakis, N., Tziritas, G., Paragios, N.: Performance vs computational efficiency for optimizing single and dynamic mrfs: Setting the state of the art with primaldual strategies. Proc. CVIU 112(1) (2008) 14–29 16. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6) (1986) 679–698