IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 3, MARCH 2015
371
Nonlocal Random Walks Algorithm for Semi-Automatic 2D-to-3D Image Conversion Hongxing Yuan, Shaoqun Wu, Peihong Cheng, Peng An, and Shudi Bao
Abstract—We propose a nonlocal random walks (NRW) algorithm to generate accurate depth from 2D images based on user interaction. First, a graphical model is proposed where edges are corresponding to links between local and nonlocal neighboring pixels. Local edges are weighted by a pixel dissimilarity measure, and spatial distances are incorporated into calculation of nonlocal weights. Second, user-defined values are mapped to probabilities that marked pixels have the maximum depth value, and the probabilities of unmarked pixels are obtained by NRW algorithm. Finally, the dense depth-map is recovered with the resulting probabilities. Since nonlocal principle is effective in preserving fine structures in images, we can recover sharp depth boundaries. Experiments on three images containing color bleeding areas demonstrate that our method achieves much high-quality results compared with the existing random walks (RW) based methods. Index Terms—2D-to-3D conversion, depth boundaries, depthmap, nonlocal neighbors, nonlocal random walks.
I. INTRODUCTION
S
EMI-AUTOMATIC 2D-to-3D conversion can generate high-quality 3D video for public use. The challenge of the conversion process is obtaining high-quality depth results around object boundaries. Recently, RW algorithm has been widely used in semi-automatic 2D-to-3D conversion owing to its smoothing properties inside objects [1]–[3]. RW based methods have problems in preserving sharp edges. To solve this, [4] uses the hard segmentation of Graph-cuts (GC) as an initial depth estimate to maintain the sharp boundaries in the RW depth-map. As explained in [5], the issue with [4] is that it is sensitive to segmentation quality of GC. Another drawback is the time consuming of the energy minimization in the GC. Thus, a fast watershed segmentation is utilized in [6] to generate the hard constraints of depth-maps. Some existing methods also exploit user interaction to refine depth estimates [7][8]. Manuscript received May 26, 2014; revised July 31, 2014; accepted September 17, 2014. Date of publication September 22, 2014; date of current version October 03, 2014. This work was supported by the Educational Commission of Zhejiang Province of China under Grant Y201431834, the Zhejiang Provincial Natural Science Foundation of China under Grant LQ14A040006, LY12F01001, LQ12D01001, and LQ12F03001, and by the Ningbo Natural Science Foundation under Grants 2011A610186 and 2012A610048. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. The authors are with the School of Electronics and Information Engineering, Ningbo University of Technology, Ning’bo, 315016 China (e-mail:
[email protected];
[email protected];
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LSP.2014.2359643
Fig. 1. A flowchart of the semi-automatic 2D-to-3D image conversion with proposed method.
Inspired by KNN matting [9], we propose NRW algorithm which combines KNN and local neighbors. Our method has advantages over existing RW based approaches as follows. First, noting that RW creates smooth depth gradients and combining this with nonlocal neighbors which are helpful for preserving edges [10], NRW guarantees the homogeneity of flat zones while enhancing depth boundaries. Second, NRW is independent of the hard segmentation. Thus, we can preserve sharp boundaries without introducing artifacts at these regions. Third, since nonlocal principle is effective in reducing the amount of necessary user input while also preserving accuracy [11], our method can maintain depth edges even with sparse user scribbles. The paper is organized as follows. In Section 2, the proposed NRW algorithm for semi-automatic 2D-to-3D conversion is described. The experimental results are given in Section 3. Finally, we give conclusion in Section 4. II. PROPOSED METHOD A. Overview of Our Method The difference between our method and existing RW based approaches [1]–[3] is that we incorporate KNN into local affinity propagation. Therefore, we can preserve sharp object boundaries even with sparse user labels with the help of nonlocal relationship between pixels. B. Problem Formulation Similar to StereoBrush [12], our method needs user to brush sparse scribbles on an input color image. The workflow of semiautomatic 2D-to-3D image conversion based on our method is shown in Fig. 1. We assume input image is of size . The normalized CIE vector at pixel is denoted by . Here, we normalize each channel of CIE individually so that all components is within [0,1]. Let and denote the numbers of pixels and marked pixels respectively. We represent components of matrices and vectors corresponding to marked pixels by a superscript . Similarly, components corresponding to the unmarked pixels are represented by a superscript . Given input image, and vector containing user-defined depth values for marked
1070-9908 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
372
IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 3, MARCH 2015
depth propagation on the graph rial Dirichlet integral
as the following combinato-
(4)
Fig. 2. Illustration of the proposed graph. Black and gray edges are 8-neighboring and KNN links respectively.
pixels, where is much less than , the 2D-to-3D conversion is to find the depth values of all pixels.
where the vector is of size and its element is the probability that pixel has the maximum depth value. In the above equation, the local weights are multiplied by 10 to ensure local neighbors with similar features have stronger influence to depth propagation than KNN. When local propagation fails, the KNN will play a leading role in the diffusion. The energy function in (4) can be reformulated in matrix forms as follows. (5)
C. Graphical Model where the We construct an undirected graph nodes consist of pixels and . The edges are denote KNN the links between two 8-neighboring pixels and links. The edges between two nodes are illustrated in Fig. 2. In a subset of , each node represents the pixel with user-defined value. The value of node is denoted by . We can interpret as the probability that depth value of is 255. The weights are assigned to edges according to the types of linking between two connected nodes. First, an edge , spanning two 8-neighboring vertices and , has the Gaussian weight given by (1) where is a free parameter controlling how dissimilar two colors are. Computing a weight for the edge involves collecting KNN of the node using their feature vectors. Let denote the feature vector of and it is defined as
where is the combinatorial Laplacian matrix of size defined as
(6) Similar to [14], we re-arrange the vector into two parts, such that marked pixels appear first, followed by unmarked pixels, and obtain a partition of . Performing the same re-arrangement with the rows of in equation (6), we decompose equation (5) into (7) Differentiating
with respect to
yields (8)
(2) where and are the spatial coordinates of pixel along the directions of row and column respectively. We apply FLANN [13] to compute KNN in the feature space. The nonlocal weight is defined by
Solving the above system of linear equations, we obtain the probabilities of all pixels. Then multiply by 255 and obtain the final estimated depth-map. III. EXPERIMENTAL RESULTS
D. Optimization Formulation via RW Algorithm
In order to evaluate the performance of proposed method, three images of size , Philips-the-3D-experience (Philips), Philips-the-3D-experience-2 (Philips-2) and Dice-2 from BBNC datasets [15] are used in our experiments [1]1. Table I provides the descriptions of these images. The two parameters and in (1) and (3) are fixed in all experiments, and empirically set as , . In KNN calculations, we set in our tests. We compare to existing RW based method [3], and hybrid GC and RW based approach ( ) We evaluate the objective quality of recovered depth in two ways:
The combinatorial Dirichlet problem has the same solution as the desired RW probabilities [14]. Therefore, we formulate the
1The source code and test images can be downloaded from https://github. com/tcyhx/NRW.
(3) where is a free parameter controlling how far two pixels are. Here is the normalized spatial cooris defined similarly. With and , we dinate of pixel and can adjust KNN weights according to the color and the spatial distances.
YUAN et al.: NONLOCAL RANDOM WALKS ALGORITHM
373
TABLE I DESCRIPTIONS OF TEST IMAGES
TABLE II PSNR OF ESTIMATED DEPTHS (IN DB)
TABLE III NCC OF ESTIMATED DEPTHS
(•) The peak signal-to-noise ratio (PSNR): , where is the vector corresponding
Fig. 3. Visual comparison between different algorithms for Philips. (a) Input image with scribbles. (b) Sparse depths. (c) Ground truth. (d) RW depths. depths. (f) Our depths. (e)
to the estimated depth-map and is the vector corresponding to the ground truth depth-map. (•) Normalized cross-covariance (NCC) between the estimated depth-map and the ground truth [16]: , where and denote the mean depth values, and and denote the variances of the ground truth and the recovered depth-map, respectively. NCC takes values between and (for values closer to the depths are more similar to the ground truth) [17]. A. Results Table II reports the PSNR of the estimated depth-maps obtained by using different algorithms. The results show that our method performs better than RW [3] and [4]. The proposed method outperforms GC+RW by up to 4 dB. lowers PSNR by 2 dB at least compared with RW. The reason is that inconsistent edges generated by GC will introduce some artifacts at object boundaries during depth merging which will be more clear in Fig. 3(e) and Fig. 4(e). Table III shows the NCC sores for the three evaluated images. The proposed algorithm reaches the highest scores for all of the test images. The NCC scores of are still lower than RW. From Table II and Table III, we can see that degrades the accuracy of depth which relies on depth prior generated by GC to modify the edge weights of RW algorithm. The visual comparisons for the different methods are given in Fig. 3, 4 and 5. For each figure, (a) shows the input image with user scribbles overlaid, (b) shows the sparse depth-map extracted from the user scribbles, (c) shows the ground truth depth-map, and the depth-maps obtained by RW, , and our method are shown in (d), (e), and (f), respectively. We can see that the depth-maps recovered by our method obviously outperform RW and as having sharper boundaries, which can be easily found in the enlarged parts (marked by squares).
Fig. 4. Visual comparison between different algorithms for Philips-2. (a) Input image with scribbles. (b) Sparse depths. (c) Ground truth. (d) RW depths. depths. (f) Our depths. (e)
In Fig. 3, the depth boundaries of the girl’s hand are lost using RW and while our method recovers the boundaries accurately. Due to color bleeding effect between the hand and the grass region, it is hard for RW to stop depth propagation across the object boundaries. GC also has problems to separate object boundaries in this case. Therefore, loses sharp object boundaries in these regions. Fig. 4 shows that our method obtains the boundaries of the trees accurately. Since the hard segmentation with GC does not
374
IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 3, MARCH 2015
helpful for maintaining fine image structures. Therefore, NRW allows for piecewise smooth depth-maps separated by discontinuities around object boundaries. The results of analysis with three images containing color bleeding areas have shown that NRW outperforms existing RW based methods. Unlike the hybrid GC and RW method, we enhance depth boundaries without loss of depth accuracy. In future, NRW will be further applied to videos by incorporating the temporal coherence. ACKNOWLEDGMENT The first author gratefully acknowledges the support of K. C. Wong Education, Hong Kong. The authors would like to thank Dr. Phan for discussions on implementation of GC based semiautomatic 2D-to-3D conversion. The authors would also like to thank the reviewers for their valuable comments. REFERENCES
Fig. 5. Visual comparison between different algorithms for Dice-2. (a) Input image with scribbles. (b) Sparse depths. (c) Ground truth. (d) RW depths. depths. (f) Our depths. (e)
respect fine details in images, it is hard for to preserve these fine structures. To maintain these fine details, RW needs enough scribbles that are tightly positioned along the boundaries since only pairwise relationship between the neighboring pixels is considered. In Fig. 5, RW and generate artifacts in the enlarged part of the smaller dine due to color bleeding between it and the background. With the help of nonlocal pairwise constraints provided by KNN, our method can generate clear boundaries exactly. We can see from Table II, III, Fig. 3, 4 and 5, enhances depth boundaries at the cost of depth accuracy. Due to the limitations of GC segmentation quality, it is hard for to obtain object boundaries accurately for complex natural images. Unlike , our method can preserve sharp depth boundaries while increasing depth accuracy. B. System Complexity Similar to [4], the complexity of the 2D-to-3D conversion system based on our method depends on the image size, content and the number of user scribbles. When performing on an Intel Core i7 machine with 8 GB of RAM, it takes between 20 seconds to 1 minute for scribbling on an image of size . Building the combinatorial Laplacian matrix in equation (5) takes between 10 to 11 seconds on average. Solving the equation (8) to obtain the depth-map varies from 2 to 4 seconds. IV. CONCLUSION In this letter, we present a NRW algorithm for producing dense depth-maps through user-defined sparse depth-maps. To improve depth quality at object boundaries, we incorporate nonlocal neighbors into the RW graphical model. RW has advantages in generating smooth depth while nonlocal principle is
[1] M. Guttmann, L. Wolf, and D. Cohen-Or, “Semi-automatic stereo extraction from video footage,” in Proc. IEEE Intl. Conf. on Computer Vision (ICCV), 2009, pp. 136–142. [2] D. Sykora, D. Sedlacek, S. Jinchao, J. Dingliana, and S. Collins, “Adding depth to cartoons using sparse depth (in) equalities,” Comput. Graph. Forum, vol. 29, no. 2, pp. 615–623, May 2010. [3] R. Rzeszutek, R. Phan, and D. Androutsos, “Semi-automatic synthetic depth map generation for video using random walks,” in Proc. IEEE Int. Conf. on Multimedia and Expo (ICME), 2011, pp. 1–6. [4] R. Phan and D. Androutsos, “Robust semi-automatic depth map generation in unconstrained images and video sequences for 2D to stereoscopic 3D conversion,” IEEE Trans. Multimedia, vol. 16, no. 1, pp. 122–136, Jan. 2014. [5] R. Phan, R. Rzeszutek, and D. Androutsos, “Semi-automatic 2D to 3D image conversion using a hybrid random walks and graph cuts based approach,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 897–900. [6] X. Xu, L.-M. Po, K.-W. Cheung, and K.-H. Ng, “Watershed and random walks based depth estimation for semi-automatic 2D to 3D image conversion,” in Proc. IEEE Intl. Conf. on Signal Processing, Communication and Computing (ICSPCC), 2012, pp. 84–87. [7] M. Liao, J. Gao, R. Yang, and M. Gong, “Video stereolization: Combining motion analysis with user interaction,” IEEE. Trans. Vis. Comput. Graph., vol. 18, no. 7, pp. 1079–1088, Jun. 2011. [8] Z. Zhang, C. Zhou, Y. Wang, and W. Gao, “Interactive stereoscopic video conversion,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 10, pp. 1795–1807, Oct. 2013. [9] Q. Chen, D. Li, and C. K. Tang, “KNN matting,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 35, no. 9, pp. 2175–2188, Sep. 2013. [10] G. Palma, M. Comerci, B. Alfano, S. Cuomo, P. De Michele, F. Piccialli, and P. Borrelli, “3D non-local means denoising via multi-GPU,” in Proc. Fed CSIS, 2013, pp. 495–498. [11] A. Buades, B. Coll, and J.-M. Morel, “Image denoising methods a new nonlocal principle,” SIAM Rev., vol. 52, no. 1, pp. 113–147, Feb. 2010. [12] O. Wang, M. Lang, M. Frei, A. Hornung, A. Smolic, and M. Gross, “StereoBrush: Interactive 2D to 3D conversion using discontinous warps,” in Proc. Eur. Symp. Sketch-Based Interfaces and Modeling, 2011, pp. 47–54. [13] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of computer vision algorithms,” [Online]. Available: http://www.vlfeat. org/ 2014 [14] L. Grady, “Random walks for image segmentation,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 28, no. 11, pp. 1768–1783, Nov. 2006. [15] X. Cao, Z. Li, and Q. H. Dai, “Semi-automatic 2D-to-3D conversion using disparity propagation,” IEEE Trans. Broadcast., vol. 57, no. 2, pp. 491–499, Jun. 2011. [16] R. Ranftl, S. Gehrig, T. Pock, and H. Bischof, “Pushing the limits of stereo using variational stereo estimation,” in Proc. Intelligent Vehicles Symp., 2012, pp. 401–407. [17] J. Konrad, M. Wang, P. Ishwar, and C. Wu, “Learning-based, automatic 2D-to-3D image and video conversion,” IEEE Trans. Image Process., vol. 22, no. 9, pp. 3485–3496, Sep. 2013.