probabilistic reliability based view synthesis for ftv - Semantic Scholar

4 downloads 0 Views 328KB Size Report
We would like to thank Heinrich-Hertz-Institut, Gwangju Institute of Science and .... Champagne Tower, Book Arrival, Newspaper and Love Bird. Field (MRF), we ...
Proceedings of 2010 IEEE 17th International Conference on Image Processing

September 26-29, 2010, Hong Kong

PROBABILISTIC RELIABILITY BASED VIEW SYNTHESIS FOR FTV Lu Yang1 , Tomohiro Yendo1 , Mehrdad Panahpour Tehrani1 , Toshiaki Fujii2 , Masayuki Tanimoto1 1

2

Graduate School of Engineering, Nagoya University, Nagoya 464-8603, Japan Graduate School of Science and Engineering, Tokyo Institute of Technology, Tokyo 152-8550, Japan [email protected] ABSTRACT

View synthesis using depth maps is an important application in 3D image processing. In this paper, a novel method is proposed for the plausible view synthesis of Free-viewpoint TV (FTV), using two input images and their depth maps. The depth estimation based on stereo matching is known to be error-prone, leading to noticeable artifacts in the synthesized new views. To produce high-quality view synthesis, we introduce a probabilistic framework which constrains the reliability of each pixel of new view by Maximizing Likelihood (ML). The spatial adaptive reliability is provided by incorporating Gamma hyper-prior and the synthesis error approximation. Furthermore, we generate the virtual view by solving a Maximum a Posterior (MAP) problem using graph cuts. We compare the proposed method with other depth based view synthesis approaches on MPEG test sequences. The results show the outperformance of our method both at subjective artifacts reduction and objective PSNR improvement. Index Terms— Free-viewpoint TV (FTV), view interpolation, depth map, reliability, Maximum a Posterior (MAP) 1. INTRODUCTION View synthesis is a crucial part in many multiview imaging applications, such as FTV [1] and 3DTV [2]. Generally, two original images, which come from left and right sides respectively, are defined as the references to generate the intermediate view using their depth maps. Depth maps are obtained offline by stereo matching with energy optimization technologies [3]. Unfortunately, current stereo-based depth estimation works are prone to generate errors especially at textureless and boundary areas. Several view synthesis methods have been proposed. In [4, 5], the depth maps were filtered to reduce the noise. However, large depth errors were difficult to be eliminated by simThis work is partially supported by Strategic Information and Communications R&D Promotion Programme (SCOPE) 093106002 of the Ministry of Internal Affairs and Communications of Japan and China Scholarship Council (CSC). We would like to thank Heinrich-Hertz-Institut, Gwangju Institute of Science and Technology, Electronics and Telecommunications Research Institute and MPEG-Korea Forum for providing MPEG sequences.

978-1-4244-7993-1/10/$26.00 ©2010 IEEE

1785

ple smoothing. Some other literatures [6, 7] treated boundaries or depth discontinuities as unreliable areas and synthesized those areas separately. In [8], the unreliable areas were labeled automatically by reliability reasoning. In contrast to [6, 7, 8] which adopted the binary reliability idea, we utilize the probabilistic model and the error approximation [8] to obtain the continuous reliability and suppress synthesis noises. The view synthesis can be considered as a labeling or image reconstruction problem [9, 10], which can be naturally formulated in the framework of MAP. Fitzgibbon et al. [9] used a database of image patches as the prior constraint. In [10], Mudenagudi et al. used graph cuts to generate the virtual view with super resolution. Both of them did not handle depth errors and used more than two references instead, while we only use two references and their depth maps. One problem is how to blend the intensities from references. Our reliability can address such problem by using the Maximum Likelihood criterion. Incorporating the reliability-based likelihood, the prior term can adaptively smooth the synthesized image to suppress noise and improve the PSNR. The final objective function can be optimized by graph cuts [3]. The contribution of this paper is two-fold: 1) The novel continuous reliability model is probabilistically inferred for view synthesis which maximizes the likelihood probability. 2) A reliability-based MAP framework is proposed which adaptively synthesizes the virtual view and generalizes the conventional view interpolation. The rest of this paper is organized as follows. In Sec. 2, we infer the new reliability from the likelihood probability. The synthesis framework and optimization are proposed in Sec. 3. We demonstrate the experimental results in Sec. 4. Sec. 5 concludes the paper. 2. RELIABILITY INFERENCE Given references IL , IR and the corresponding depth maps DL , DR , we can generate two synthesized virtual views as the observations: I1 : Projected IL using DL , I2 : Projected IR using DR . The ideal synthesized view f is assumed to be formulated by f

=

I k + ek

(1)

ICIP 2010

where k = 1, 2. e1 and e2 are the error images which are 2 ), assumed to be Gaussian for each pixel i by eki ∼N (0, σki 2 i = 1, 2...n. σki is the unknown spatially varying variance of eki , means that each pixel has the individual statistics. n denotes the number of pixels in the image. The view synthesis problem is to find an estimate of f = (f1 , f2 ...fn ). Following the Bayesian inference, the posterior probability P (f, σk |Ik ) is given as P (f, σk |Ik )

∝ P (Ik |f, σk )P (f, σk ) = P (Ik |f, σk )P (σk )P (f )

(2)

where the variance and image intensity are independent unknowns. Thus the likelihood with respect to f is PLikelihood

= P (Ik |f, σk )P (σk )

(3)

We assume the error at each pixel is also independent, then   n  e2ki 1  P (Ik |f, σk ) = exp − 2 2 2σki 2πσki i=1  n    e2 1 ki = exp − (4) 2 + lnσki 2σki (2π)n/2 i=1 Now we consider the distribution P (σk ) of the variance, which should be parameterized by the hyper-prior. To provide the tractable likelihood, we choose the Gamma hyperprior [11] which is conjugate for the exponential function and has been successfully applied in image restoration

[12]. The probability of inverse variance is defined as: P σ12 = ki

Gamma σ12 |α, β , where α, β are the hyper-parameters in ki the Gamma distribution. Thus we have   n  1 Gamma |α, β P (σk ) = 2 σki i=1  α−1  n   1 1 = exp −β 2 2 σki σki i=1  n    1 = exp − 2(α − 1)lnσki + β 2 (5) σki i=1 Replacing the probability of (4) and (5) in (3), we have   n 2 eki 1 +β 2 PLikelihood = exp − 2 σki i=1 

+ (2α − 1)lnσki (6) 2 can be obtained by maximizing the likelihood Now σki probability in (6), which can be addressed by minimizing the likelihood energy:

ELikelihood =

n 2  e

ki

i=1

2



1

(7) + (2α − 1)lnσki 2 σki

1786

The gradient of σki is ∇σki ELikelihood

=



e2ki + 2β 2α − 1 + 3 σki σki

(8)

Setting (8) to be zero yields 2 σki

=

e2ki + 2β 2α − 1

(9)

However, we can not compute (9) directly since eki is the error between the reference and unknown virtual view. Here we follow the idea in [8] and utilize the corresponding crosscheck error between two references as the approximation for eki . This is reasonable because one particular depth error usually cause the similar synthesis error at different viewpoints. Thus the error in (9) can be approximated using input reference views and depth maps. From (7), we can also obtain the likelihood energy with respect to f : ELikelihood (f )

=

n  (fi − Iki )2 i=1

+

2

(2α − 1)lnσki





1 2 σki (10)

We can simplify (10) by keeping only the terms which are dependent on f . Thus the likelihood energy in (10) should be replaced by ELikelihood (f )

=

n  1 2 2 (fi − Iki ) 2σ ki i=1

(11)

From (11) we can find that the inverse variance plays the role of reliability for each pixel. Finally, the reliability rki for each pixel from each reference can be defined incorporating (9) and (11): rki = =

1 α − 0.5 2 = e2 + 2β 2σki ki

a , (a = α − 0.5, b = 2β) +b

e2ki

(12)

where parameters a, b are determined by hyper-parameters α, β, respectively. 3. IMAGE SYNTHESIS After the reliability inference, we now describe our view synthesis framework based on the proposed reliability. Our goal is to estimate the virtual view f which maximizes the posterior of (2). The likelihood has been analyzed in Sec. 2. However, the reliability is not perfect and maximum likelihood may generate noise in the virtual view. Since most images are naturally smooth and can be modeled by Markov Random

Fig. 1. Left references (top) and the depth maps (bottom), the right views and depth maps are not shown. From left to right: Champagne Tower, Book Arrival, Newspaper and Love Bird. Field (MRF), we adopt a simple prior probability which encourages the smoothness of neighborhood in the synthesized virtual view. The prior term in (2) is defined as P (f )



n 

exp (−|fi − fj |) (j∈Ni )

(13)

(a)original (b)[15]

(c)[4]

(d)[7]

(e)[8] (f)proposed

i=1

where Ni is the nearest neighborhood of pixel i in the synthesized view. Maximizing the posterior is the equivalence of minimizing the following posterior energy: E(f ) = ELikelihood (f ) + EP rior (f ) =

2  n  k=1 i=1

n  a 2 (f − I ) + |fi − fj |(j∈Ni ) (14) i ki e2ki + b i=1

where the likelihood energy is from (11) and (12), for each reference side. The prior energy is the negative logarithm of the prior probability in (13). We will now analyze how hyper parameters a, b in (14) control the adaptivity of the proposed algorithm. First, the parameter a globally balances the weights of likelihood and prior, thus it has no influence on the spatial adaptivity. On the other hand, parameter b plays an important role as the tuning factor in the denominator of the reliability rki . More specifically, as b  eki , rki = e2 a+b → const, which means ki that reliability will be constant and the solution of Maximum 2i Likelihood will be fi = I1i +I , which is exactly the same 2 result of conventional image interpolation. Thus the conventional view synthesis can be considered as a special case of the proposed framework. When b → 0, rki → ea2 , which ki means that the change of approximated error eki will lead to significant weight difference for the corresponding pixels. For example, as eki → 0, rki → ∞, the prior energy has no influence, means that the particular pixel will not be smoothed. In contrast, when eki → ∞, then the denominator of rki will be very large and the likelihood energy will have the smallest weight. In other words, the prior energy or smoothness will have the significant influence at that pixel. As a result, our proposed synthesis framework is spatially adaptive and the hyper parameters can adjust the degree of the adaptivity.

1787

Fig. 2. Magnified local results on “Champagne Tower”(first row), “Book Arrival”(second row), “Newspaper”(third row) and ‘Love Bird”(fourth row). Finally, the energy of (14) should be efficiently minimized as a labeling problem. Note that the prior term is the metric thus we can construct a graph for the energy and apply graph cuts [3]. 4. EXPERIMENTAL EVALUATIONS The proposed method has been implemented on four MPEG sequences [13]: “Champagne Tower”(1280×960), “Book Arrival”(1024×768), “Newspaper”(1024×768) and “Love Bird”(1024×768). All sequences are well calibrated and rectified. For all datasets, we generate the center view using two reference views from left and right viewpoints, respectively. The corresponding depth maps are estimated off-line with MPEG depth estimation reference software [14], which is based on the stereo matching. We heuristically set the parameters a = 80, b = 160 to provide the best synthesized views on all sequences. The sequences and depth maps used in our experiments are shown in Fig. 1. In Fig. 2, the proposed method (f) is evaluated by comparisons against conventional synthesis [15](b), view generation in [4](c), intermediate view interpolation in [7](d), reliabilitybased interpolation [8](e) and the original ground truth image(a). We magnify the local areas of the synthesized views for visual purpose. For all benchmarks, we can see that the proposed method achieves the best subjective quality with the least artifacts in the synthesized views. For “Champagne Tower”, it is clear that our method significantly eliminates artifacts on the “desk edges” which come from the wrongly pro-

Table 1. PSNR (dB) comparisons and gains(in comparison with [15]) Convention[15] Mori et al.[4] Smolic et al.[7] Yang et al.[8] Proposed

Champagne 30.16 30.77(0.61) 30.09(-0.07) 30.45(0.29) 31.92(1.76)

Book 33.04 33.10(0.06) 33.03(-0.01) 33.85(0.81) 34.67(1.63)

News 30.61 30.68(0.07) 30.58(-0.03) 30.98(0.37) 31.87(1.26)

Love 29.36 28.62(-0.74) 29.30(-0.06) 29.22(-0.14) 29.70(0.34)

[3] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, no. 11, pp. 1222–1239, 2001. [4] Y. Mori, N. Fukushima, T. Yendo, T. Fujii, and M. Tanimoto, “View generation with 3d warping using depth information for ftv,” Signal Processing: Image Communication, vol. 24, no. 1-2, pp. 65–72, 2009.

jected background pixels. The visible artifacts in the “head hair” of “Book Arrival” are also reduced in our result. For “Newspaper”, no method can generate perfect result near the “hand” where the narrow texture is occluded. The occluded black line in the background violates reliability reasoning and our error approximation [8]. However, our proposed method can slightly reduce the artifacts which are not connected with the thin texture. Due to the adaptive smoothness term in our synthesis framework, we also generate the plausible virtual view for “Love Bird” with least noise. Furthermore, Table 1 shows that our method also provide the highest PSNR compared with other methods for all sequences. For “Champagne Tower”, our results have about 1.1 dB gain in comparison with the second best result. For “Book Arrival” and “Newspaper”, approximately 0.8 dB gain is observed. Although our emphasis is subjective quality, we also obtain 0.3 dB PSNR gain on “Love Bird”. 5. CONCLUSIONS In this paper, we introduce a probabilistic reliability based view synthesis method for FTV. The new reliability is inferred from the probabilistic likelihood and error approximation, resulting in the adaptive image rendering which favors reliable image intensities and avoids incorrect ones. Moreover, the image synthesis problem is posed in the MAP framework and optimized by graph cuts. We demonstrate our framework on four MPEG sequences, and show that our method can eliminate noticeable artifacts and improve PSNR values. Since our synthesis framework is dependent on the reliability reasoning, more accurate error approximation in the reliability computation would lead to better synthesis results. In the future, we would like to improve the reliability reasoning and generalize our method in the chrominance channel and temporal domain. 6. REFERENCES [1] M. Tanimoto, “Overview of free viewpoint television,” Signal Processing: Image Communication, vol. 21, no. 6, pp. 454–461, 2006. [2] A. Kubota, A. Smolic, M. Magnor, M. Tanimoto, T. Chen, and C. Zhang, “Multiview imaging and 3dtv,” Signal Processing Magazine, IEEE, vol. 24, no. 6, pp. 10–21, 2007.

1788

[5] D. Min, D. Kim, S. Yun, and K. Sohn, “2d/3d freeview video generation for 3dtv system,” Signal Processing: Image Communication, vol. 24, no. 1-2, pp. 31–48, 2009. [6] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High-quality video view interpolation using a layered representation,” in ACM SIGGRAPH’04, Los Angeles, California, 2004, pp. 600– 608. [7] A. Smolic, K. M¨uller, K. Dix, P. Merkle, P. Kauff, and T. Wiegand, “Intermediate view interpolation based on multiview video plus depth for advanced 3d video systems,” in Proceedings of ICIP, San Diego, California, USA, 2008, pp. 2448–2451. [8] L. Yang, T. Yendo, M. Panahpour Tehrani, T. Fujii, and M. Tanimoto, “Artifact reduction using reliability reasoning for image generation of ftv,” Journal of Visual Communication and Image Representation, vol. 21, no. 5-6, pp. 542–560, 2010. [9] A. Fitzgibbon, Y. Wexler, and A. Zisserman, “Imagebased rendering using image-based priors,” Int. J. Comput. Vision, vol. 63, no. 2, pp. 141–151, 2005. [10] U. Mudenagudi, A. Gupta, L. Goel, A. Kushal, P. Kalra, and S. Banerjee, “Super resolution of images of 3d scenes,” in ACCV, Tokyo, Japan, 2007, pp. 85–95. [11] J. M. Bernardo and A. F. M. Smith, Bayesian Theory, John Wiley & Sons, Chichester, 1994. [12] G. K. Chantas, N. P. Galatsanos, and A. C. Likas, “Bayesian restoration using a new nonstationary edgepreserving image prior,” Image Processing, IEEE Transactions on, vol. 15, no. 10, pp. 2987–2997, 2006. [13] ISO/IEC JTC1/SC29/WG11, Coding of moving pictures and audio, N9783, 2008. [14] ISO/IEC JTC1/SC29/WG11, Coding of moving pictures and audio, M15377, 2008. [15] H. Y. Shum, S. C. Chan, and S. B. Kang, Image-Based Rendering, Springer-Verlag, New York, 2006.

Suggest Documents