vector d43 that matches a pixel in I4 to I3, using the con- ventional BP algorithm [6]. Specifically, the matching cost is defined as. D(d43) = λd · min{|I4(x, y) â I3(x ...
VIRTUAL VIEW SYNTHESIS USING MULTI-VIEW VIDEO SEQUENCES Il-Lyong Jung , Taeyoung Chung, Kwanwoong Song, Chang-Su Kim School of Electrical Engineering, Korea University, Seoul, Korea E-mails: {illyong, lovelool17, kwsong71, changsukim}@korea.ac.kr d
ABSTRACT A virtual view synthesis algorithm using multi-view video sequences, which inherits the advantages of both the forward warping and the inverse warping, is proposed in this work. First, we use the inverse warping to synthesize a virtual view without holes from the nearest two views. Second, we detect occluded regions in the synthesized view based on the uniqueness constraint. Then, we refine the occluded regions using the information in the farther views based on the forward warping technique. Simulation results demonstrate that the proposed algorithm provides significantly higher PSNR performances than the conventional inverse warping scheme. Index Terms— Stereo matching, view synthesis, multi-view video, forward warping, and inverse warping. 1. INTRODUCTION Recently, the demands for more realistic multimedia data have been increased significantly. Three-dimensional (3-D) video or free-view video technologies have drawn much attention from both academia and industry, since they promise to offer richer experience than traditional video. However, it is a difficult task to capture a scene from a large number of viewpoints, which are necessary to achieve realism. Instead, attempts have been made to synthesize virtual views [1], while capturing real views from a few selected viewpoints only. A variety of view synthesis techniques have been proposed. According to the coordinate system of view synthesis, these techniques can be classified into forward warping or inverse warping [2]. In the forward warping, as shown in Fig. 1 (a), the stereo matching is performed to estimate disparity vectors based on the coordinate system of one of the reference views (IL or IR ), and the disparity vectors are scaled to synthesize a virtual view IV . Since the stereo matching is not one-to-one in general, the forward warping may generate holes in the virtual view IV . Also, if a scaled disparity vector s · d has noninteger-pixel accuracy, a post processing step is required to interpolate pixel values on the integer-pixel accuracy grid in IV . However, the forward warping has the advantage that it can detect occlusion regions effectively [3]. On the other hand, in the inverse warping, as shown in Fig. 1 (b), the bi-directional matching is performed directly on the coordinate system of the virtual view to be synthesized [4, 5]. The inverse warping does not generate holes in the synthesized view. It does not require the post processing step either. However, occlusions are harder to detect in the inverse warping, and they tend to yield ghost artifacts in the synthesized view. This work was supported partly by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) (KRF-2008-331- D00420) and partly by the Ministry of Knowledge Economy, Korea, under the Information Technology Research Center support program supervised by the Institute of Information Technology Advancement (grant number IITA-2009C1090-0902-0017).
IL
IV
IR
(x s ¸ d, y)
(x , y )
s
(x d, y)
1 s (a)
(1 s) ¸ d
s ¸ d
IL
IV (x s ¸ d,y)
IR (x (1s)¸ d,y)
(x , y )
s
1 s (b)
Fig. 1. Two approaches for view synthesis [2]: (a) forward warping and (b) inverse warping.
In this work, we propose a view synthesis algorithm for multiview video sequences, which combines the complementary advantages of the forward warping and the inverse warping. First, we employ the inverse warping to estimate disparity vectors directly on the coordinate system of a virtual view using the nearest two reference views. Second, we detect occluded regions based on the uniqueness constraint. Then, we employ the forward warping from farther reference views to conceal the occluded regions. Moreover, we introduce a temporal coherence term to estimate disparity vectors more accurately. Simulation results demonstrate that the proposed algorithm synthesizes virtual views faithfully, while reducing occlusion artifacts. The rest of this paper is organized as follows. Section 2 describes the proposed algorithm. Section 3 presents experimental results. Finally, concluding remarks are given in Section 4. 2. PROPOSED ALGORITHM As shown in Fig. 2, the proposed algorithm synthesizes a virtual view IV using the information in four reference views I1 , I2 , I3 , and I4 in a multi-view video sequence. We assume that the multi-view cameras are in a 1-D parallel configuration, the distance between two adjacent cameras is 1, and the distance between I2 and IV is s. To synthesize and render the virtual view IV , we should compute accurate disparity vectors. To this end, we employ the belief propagation (BP) algorithm in [6]. Let q denote a neighboring pixel of pixel p, and dp denote a disparity label of p. Also, let d denote
s ¸ dp
I1
I2
(1 s ) ¸ d p
I3
IV
I4
p
d 12
d 43
Fig. 2. A virtual view IV is synthesized using the information in four reference views I1 , I2 , I3 , and I4 . We assume that the distance between two adjacent reference views is 1, the distance between I2 and IV is s, and the distance between IV and I3 is 1 − s.
the matrix that contains the disparity labels of all pixels in an image. Then, for a candidate matrix d, we define the energy function as E(d)
= =
Edata (d) + Esmooth (d) D(dp ) + V (dp , dq ), p
(1)
p,q
where Edata (d) is the data term, which is the sum of matching errors D(dp ) between p and its corresponding point specified by dp . The smoothness term Esmooth (d) is the sum of disparity differences between two neighboring pixels p and q. Specifically, the disparity difference is defined as V (dp , dq ) = min{|dp − dq |, τs } to avoid assigning too high costs across real edges. The disparity vector matrix d is selected to minimize the energy function E(d) in Eq. (1).
Fig. 3. Occlusion Detection. The red region OR contains the pixels that are occluded and not visible in the right view I3 . The blue region OL contains the pixels occluded in the left view I2 .
We first estimate the disparity vector dp of pixel p in the virtual view IV using the left view I2 and the right view I3 . We employ the inverse warping, and use the bi-directional matching cost as the data cost D(dp ) in Eq. (1). Specifically,
IV are matched to the same pixel in the left view I2 , those pixels in IV are declared to be occluded in the left view I2 . Those pixels compose the occluded region OL , which is depicted as the blue region in Fig. 3. Similarly, if more than two pixels in IV are matched to the same pixel in the right view I3 , those pixels compose the occluded region OR , which is depicted as the red region in Fig. 3.
D(dp ) = λd · min{I2 (x + s · dp , y) − I3 (x − (1 − s) · dp , y)|, τd }.
2.3. Occlusion Handling Based on Forward Warping
After obtaining disparity vectors using the belief propagation algorithm, we render pixel p = (x, y) in the virtual view IV by
To refine the pixel values in the occluded regions OL and OR , we use the information in the farther views I4 and I1 , respectively.
IV (x, y) = w · I2 (x + s · dp , y) + (1 − w) · I3 (x − (1 − s) · dp , y),
• Case 1 - Occluded region OL : The pixels in OL tend to be not visible in the left view I2 and gather to the right sides of foreground objects. Therefore, to synthesize the pixels in OL , we use the information in the right views I3 and I4 . Specifically, we obtain the disparity vector d43 that matches a pixel in I4 to I3 , using the conventional BP algorithm [6]. Specifically, the matching cost is defined as
2.1. Stereo Matching Based on Inverse Warping
where the weighing coefficient w is set to 1 − s, so that the pixel in a nearer view is assigned a higher weight. 2.2. Occlusion Detection Although the inverse warping generates a virtual view without holes, it often yields ghost and blurring artifacts in occluded regions. Therefore, we detect the occluded regions using the uniqueness constraint, which was proposed for the forward mapping in [7]. For example, in Fig. 2, suppose that we match pixels in view I1 to view I2 . If more than two pixels in I1 are matched to the same pixel in I2 , those pixels tend to be occluded in I2 and tend to gather to the left side of a foreground object in I1 . We can use the uniqueness constraint in the inverse mapping as well. During the bi-directional matching, if more than two pixels in
D(d43 ) = λd · min{|I4 (x, y) − I3 (x + d43 , y)|, τd }. Then, based on the photo-consistency among the matching pixels, a virtual pixel in OL is refined by IV (x + (2 − s)d43 , y) = w · I3 (x + d43 , y) + (1 − w ) · I4 (x, y),
I Lt 1
Time instance
IVt 1
t 1
I Rt 1 (1 s ) ¸ d t 1
s ¸ d t 1 p Lt 1
p Vt 1
p Rt 1
mL
mR IVt
I Lt
I Rt
s ¸ dt
t
p Lt
(1 s ) ¸ d t p Vt
p Rt
Fig. 4. The motion vectors and disparity vectors in the inverse warping. −1
where the weighting coefficient w = (1−s)(1−s) −1 +(2−s)−1 , so that the pixels in I3 and I4 are weighted by the inverses of the distances from the virtual view IV .
(a)
(b)
(c)
Fig. 5. Comparison of the synthesized views and their enlarged parts in the “Akko and Kayo” sequence: (a) original, (b) inverse warping, and (c) inverse warping + occlusion handling.
• Case 2 - Occluded region OR : Similarly, the pixels in OR are not visible in the right view I3 , and they are rendered using the information in the left views I1 and I2 . Based on the BP algorithm, we obtain the disparity vector d12 using the matching cost D(d12 ) = λd · min{|I1 (x, y) − I2 (x + d12 , y)|, τd } Then, a virtual pixel in OR is synthesized via IV (x + (1 + s)d12 , y) = w · I1 (x, y) + (1 − w ) · I2 (x + d12 , y), where w =
(1+s)−1 . (1+s)−1 +s−1
(a)
2.4. Temporal Coherence Term In typical multi-view sequences, a frame at time instance t is highly correlated with the frame at the previous time instance t − 1. Therefore, during the inverse warping, we can improve the reliability of the disparity estimation by introducing a temporal coherence term. Fig. 4 shows the relationship among the disparity vectors and the motion vectors. Specifically, we have pt−1 V
= =
ptV − (1 − s)dt + mR + (1 − s)dt−1 ptV + s · dt + mL − s · dt−1 .
(2)
In general, an object point has similar depths in temporally adjacent frames, i.e., dt ≈ dt−1 . Therefore, from Eq. (2), we have mL ≈ mR . Intuitively, the motion vector mL of a point in the left view should be similar to the motion vector mR in the right view. Therefore, we first compute the motion vector fields of the left view and the right view. Then, for each disparity candidate dt of pixel ptV , we find the corresponding points in the left view and the right view and their motion vectors mL and mR . Then, we compute the temporal coherence term by Emotion (d) = M (dp ) (3) p
where
M (dp ) = λm · min{mL − mR , τm }.
(b)
(c)
Fig. 6. Comparison of the synthesized views and their enlarged parts in the “Newspaper” sequence: (a) original, (b) inverse warping, and (c) inverse warping + occlusion handling. By using the temporal coherence term Emotion (d) in addition to the data term and the smooth term in the energy function in Eq.(1), we can acquire more accurate disparity vectors for multi-view sequences. 3. EXPERIMENTAL RESULTS We evaluate the performance of the proposed algorithm on two multi-view video sequences “Akko and Kayo” [8] and “Newspaper” [9]. The “Akko and Kayo” sequence has the spatial resolution 640 × 480, while the “Newspaper” sequence has the resolution 1024 × 768. To measure the qualities of synthesized view frames objectively, we synthesize the 48th view (IV ) using the 46th (I1 ), 47th (I2 ), 49th (I3 ), and 50th(I4 ) views in the “Akko and Kayo” sequences, and compare the synthesized view with the real view. Similarly, we generate the 4th view from the 2nd, 3rd, 5th, and 6th views in the “Newspaper” sequence. The BP parameters are fixed in all experiments: λd = 0.07, λm = 0.04, τd = 15.0, τs = 5.0, and τm = 15.0. Fig. 5 (a) shows an original 48th view frame in the “Akko and Kayo” sequence and its enlarged part. Fig. 5 (b) shows the corre-
(a)
(b)
Fig. 7. Comparison of the estimated depth maps for the “Akko and Kayo” sequence: (a) without and (b) with the temporal coherence term.
sponding synthesized view, obtained by the inverse warping using I2 and I3 only. We see that there are blurring artifacts near the object boundaries, and some details are lost. Fig. 5 (c) shows the result, when the proposed occlusion handling is incorporated with the inverse warping. We see that the details are more faithfully reconstructed. Fig. 6 shows the results on the “Newspaper” sequence, which show similar tendencies to Fig. 5. Fig. 7 shows how the temporal coherence term improves the accuracy of the disparity estimation. We see that, by using the temporal coherence term, the proposed algorithm effectively suppresses the artifacts due to incorrectly estimated disparity vectors. Next, we compare the PSNR performances in Fig. 8. Compared with the inverse warping using the two nearest views only, the occlusion handling improves the PSNR performance by more than 1 dB on both sequences. Also, we see that the proposed algorithm provides even better performance, when the temporal coherence term is incorporated into the energy function. Specifically, the temporal coherence term improves the performances by about 0.4 dB on the “Akko and Kayo” sequence and about 0.1 dB on the “Newspaper” sequence, respectively. These simulation results indicate that the proposed view synthesis algorithm is efficient for rendering virtual views. 4. CONCLUSION In this work, we proposed a virtual view synthesis algorithm for multi-view video sequences, which combines the complementary advantages of the forward warping and the inverse warping. After synthesizing a view without holes using the inverse warping, the proposed algorithm detects occluded regions using the uniqueness constraint and refines the occluded regions using the forward warping from the farther views. Simulation results demonstrated that the proposed algorithm synthesizes virtual views more faithfully than the conventional inverse warping. Moreover, it was shown that the temporal coherence term can improve the accuracy of the disparity estimation. 5. REFERENCES [1] A. Fusiello, “Specifying virtual cameras in uncalibrated view synthesis,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 5, pp. 604–611, May 2007. [2] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithm,” Int. J. Comput. Vision, vol. 47, pp. 7–42, Apr. 2002.
(a)
(b)
Fig. 8. Comparison of PSNR performances on the (a) “Akko and Kayo” and (b) “Newspaper” sequences.
[3] J. Sun, Y. Li, S. B. Kang, and H.-Y Shum, “Symmetric Stereo Matching for Occlusion Handling,” in Proc. IEEE CVPR, June 2005, vol. 2, pp. 399–406. [4] A. Mancini and J. Konrad, “Robust quadtree-based disparity estimation for the reconstruction of intermediate stereoscopic images,” in Proc. SPIE stereoscopic Displays and Virtual Reality Systems, Jan. 1998, vol. 3295, pp. 53–64. [5] J. Lu, S. Rogmans, G. Lafruit, and F. Catthoor, “High-speed stream-centric dense stereo and view synthesis on graphics hardware,” in Proc. IEEE Workshop on Multimedia Signal Processing, Oct. 2007, pp. 234–246. [6] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient belief propagation for early vision,” Int. J. Comput. Vision, vol. 70, pp. 41–54, Oct. 2006. [7] D. Marr and T. Poggio, “Cooperative computation of stereo disparity,” Scinece, vol. 194, pp. 283–287, Oct. 1976. [8] M. Tanimoto, T. Fujii, T. Senoh, T. Aoki, and Y. Sugihara, “Test sequences with different camera arrangements for call for proposals on multiview video coding,” ISO/IEC JTC1/SC29/WG11 M 12338, July 2005. [9] Yo-Sung Ho, Eun-Kyung Lee, and Cheon Lee, “Multiview video test sequence and camera parameters,” ISO/IEC JTC1/SC29/WG11 M 15419, Apr. 2008.