PROJECTIVE RECTIFICATION-BASED VIEW INTERPOLATION FOR ...

1 downloads 0 Views 69KB Size Report
ABSTRACT. A projective rectification-based view interpolation algorithm is de- veloped for multiview video coding and free viewpoint video. It first calculates the ...
PROJECTIVE RECTIFICATION-BASED VIEW INTERPOLATION FOR MULTIVIEW VIDEO CODING AND FREE VIEWPOINT GENERATION Xiaoyu Xiu, Jie Liang School of Engineering Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada ABSTRACT A projective rectification-based view interpolation algorithm is developed for multiview video coding and free viewpoint video. It first calculates the fundamental matrix between two views without using any camera parameter. The two views are then resampled to have horizontal and matched epipolar lines. One-dimensional disparity is estimated next, which is used to interpolate the image for an intermediate viewpoint. After unrectification, the interpolated view can be displayed directly for free viewpoint video purpose. It can also be used as a reference to encode data of an intermediate camera. Experimental results show that the interpolated views can be 3 dB better than existing method. Video coding results illustrate that the method can provide up to 1.3 dB improvement over JMVC. Index Terms— Multiview Video Coding, View Interpolation, View Rectification. 1. INTRODUCTION Recent advances in computer, display, camera, and signal processing make it possible to deploy next generation visual communication services such as 3D-TV and Free Viewpoint Video (FVV). The former offers a 3D depth impression of the observed scenery, while the latter further allows interactive selection of viewpoints and generation of new views from any viewpoints. Since multiple cameras are used to capture the scenes, efficient compression of the multiview video data is crucial to these services. High-quality view synthesis is also required by the free viewpoint applications. Many multiview video coding (MVC) schemes use block based disparity-compensated prediction to exploit inter-viewpoint correlation [1, 2], which describes where each block of the current view comes from its neighboring views. This is usually used together with the conventional motion compensation within one view. The translational inter-view motion assumed by the disparity compensation approach could not accurately represent the geometry relationships between different cameras. Therefore, depth map based prediction algorithms have also been proposed. The algorithm in [3] first uses a block-based depth search method to extract the optimal depth. It then synthesizes a virtual view using the depth map. In [4], 3D image warping and relief texture mapping are used to encode the video and the depth information. However, these methods require the encoding of the depth map for each camera. Geometry-based view interpolation can also be used for MVC and FVV. An attractive feature of this approach is that it does not need depth map. Thus no extra bits are sent by the encoder. The view interpolation methods usually need two views to produce a virtual This work was supported in part by Canada NSERC under grants RGPIN312262-05, EQPEQ330976-2006 and STPGP350740-2007. Authors’ E-mails: {xxa4, jiel}@sfu.ca.

view between them. Most view interpolation methods first calculate disparity map and then interpolate the intermediate picture. In [5], the cost function for disparity estimation considers the smoothness of disparity transition. Recently, this method has been used in [6] for MVC. In [7], a cost function that is insensitive to image sampling is used. The scanline algorithm is also used to speed up the disparity matching. In [8], the graph-cut algorithm is adopted for the disparity map. The occlusion problem is addressed by calculating the disparity maps of both left and right pictures. The aforementioned view-interpolation approaches assume that aligned cameras are used, where all cameras can only differ from each others by horizontal shifts. For more complex camera setups, in [9] an interpolation method for a chain of cameras is proposed. It first rectifies the two views based on the method in [10], which calculates the fundamental matrix between two views and then resample them such that they have horizontal and matched epipolar lines. A modified version of the disparity estimation method in [7] is then used for the interpolation. The algorithm does not require the camera parameters, and there is no limitation on camera setup, as long as the distance between neighboring cameras is not too far. In this paper, by modifying the rectification, disparity estimation and unrectification in the algorithms in [9, 10], we obtain an improved projective rectification-based view interpolation framework, and apply it to MVC and FVV generation. Experimental results show that the interpolated views can be more than 3 dB better than the view interpolation method in [6]. Video coding results reveal that the method can provide up to 1.3 dB improvement over the coding result of Joint Multiview Video Coding (JMVC) software. The main objective of this paper is to investigate the performance of projective rectification in view interpolation and MVC. Although the complexity of our method is higher than existing schemes, the rapid development of parallel computing techniques can improve the feasibility of the method in the near future. For example, a real-time view interpolation scheme is reported in [11] using GPU computing. Reducing the complexity of the algorithm is also our ongoing research. Since view interpolation needs two neighboring views to interpolate the middle view, the proposed method can only be applied to half of the views. However, this problem can be resolved by using rectification-based view extrapolation instead of interpolation, and our result is reported in [12]. 2. PROJECTIVE RECTIFICATION In this section, we present an improved version of the projective view rectification algorithms in [9, 10]. The method resamples the two views so that their image planes are parallel and the corresponding points have the same vertical coordinates. Therefore, the disparity between the two rectified views reduces to one-dimensional shift. Disparity estimation and view interpolation can thus be simplified.

The first step in the projective rectification is to estimate the fundamental matrix, which characterizes the epipolar geometry between the two views [13]. In MVC and FVV, it is desired that the matrix can be obtained without using any camera parameter. Suppose a point X in 3-D space is projected to a point x in one view, then its projection point x′ in the other view must lie on the line Fx, where F is the 3 × 3 fundamental matrix with rank 2 and seven degrees of freedom. Therefore the following relationship holds [13]: x′T Fx = 0,

(1)



where x and x are 3 × 1 homogeneous coordinates [13]. This equation is a linear function of the entries of the fundamental matrix F. If enough point correspondences are available, various algorithms can be used to calculate F, such as the 7-point algorithm, the 8-point algorithm and the least square algorithm [13]. In this paper, the point correspondences are automatically selected from two images using corner detection and the Random Sample Consensus (RANSAC) algorithm [13]. The implementation in [14] is modified to calculate F from the selected point correspondences. The epipoles of the two views are the intersections between the line joining the two camera centers and the two image planes. Once the fundamental matrix F is known, the two epipoles can be obtained from the left and right null spaces of F [13]. The basic rectification algorithm in [9,10] consists of three steps. It first translates the coordinate origin to the image center via   1 0 −cx T =  0 1 −cy  , (2) 0 0 1 where c = (cx , cy ) is the image center. Suppose the epipole is at e = (ex , ey , 1)T after the translation. The next step is to rotate the image such that the epipole moves to the x-axis, i.e., its homogeneous coordinate is (f, 0, 1)T . The required rotation R is thus   αex αey 0 R =  −αey αex 0  , (3) 0 0 1 where α = 1 if ex ≥ 0 and α = −1 otherwise. Given the new epipole position (f, 0, 1)T , the following transformation is then applied to map the epipole to infinity:   1 0 0 0 1 0 . G= (4) −1/f 0 1 The overall rectification matrix is thus H = GRT.

(5)

In [9], the scheme in (5) is used to obtain the rectification matrices H and H′ for the left and the right view, respectively. However, the performance of this method is mainly determined by the accuracy of the calculated epipoles. In [10], a more robust and accurate matching transformation method is used, where the transformation H for the left view is still obtained by (5), but H′ for the right view is obtained by finding a matching transform that minimizes the mismatch of the two rectified views. However, this method needs to solve the camera matrices. In this paper, we optimize the matching transform H′ for the right view by minimizing the distances between a group of rectified corresponding points in the two views, i.e., X arg min ||Hxi − H′ x′i ||2 , (6) ′ H

i

where xi nd x′i are some most accurate point correspondences in the two images, selected by the RANSAC algorithm. The LevenbergMarquardt algorithm ( [13]) is used to find the optimal solution of H′ . The method in (5) is used to initialize the iteration of H′ . We will show in Sec. 5 that better result than [9] can be achieved by using 25 point correspondences. After the rectification, the resolution of the original input image is preserved at the origin of the coordinate system [10]. However, the resolutions at other regions are usually down-scaled, which can decrease the quality of the interpolated view. The down-scaled factor m at pixel (x′ , y ′ ) is given by [9] ∂x′ ∂x′ ∂x ′ ′ ∂y (7) m(x , y ) = ∂y ′ ∂y ′ . ∂x ∂y Usually m is monotonic in x′ and y ′ , and the minimum value mo is in one of the four image corners. We thus √ increase the resolution of the down-scaled regions by a factor of mo to compensate the loss of resolution, as in [9]. Note that basic epipolar geometry is also exploited in [15], but it assumes that the fundamental matrix is known, and it only uses the theory to predict the 2-D disparity vectors, whereas our scheme uses the epipolar geometry for rectification, 1-D disparity estimation, as well as view interpolation. In addition, [15] only considers multiview image coding instead of multiview video coding. 3. DISPARITY ESTIMATION After rectification, disparity estimation can be performed in 1-D. A 1-D dynamic programming method is used in [7] to estimate the disparity image. However, independent processing of different scan lines leads to horizontal stripes in the disparity map. Several algorithms based on graph cuts have been proposed [16], which achieve more accurate disparity estimation, but they cannot handle occlusions well, because the two images are treated asymmetrically, and no constraint is imposed to ensure that a pixel corresponds to at most one pixel in the other image. In [8], a smoothness term is introduced into the cost function to favor solutions with small changes between neighbors. It can also deal with occlusions by computing the disparities of both the left and right images, while preserving the advantages of graph cut. The energy cost function in it is defined as: E(f ) = Edata (f ) + Eocc (f ) + Esmooth (f ),

(8)

where Edata (f ) results from the intensity differences between corresponding pixels, Eocc (f ) imposes a penalty for making a pixel as occlusion, and the smooth term Esmooth (f ) ensures that neighboring pixels in the same image have similar disparities. The disparity estimation in [9] is based on the method in [7], by adding an extra term in the cost function to improve the smoothness of the disparity map. However, our experimental results show that the improvement is not always satisfactory. In this paper, we use the more accurate method in [8] for disparity estimation. 4. VIEW INTERPOLATION AND UNRECTIFICATION View interpolation can be performed once the disparity is known. However, although two views are available, there is no guarantee that every pixel in one view has its corresponding pixel in the other view, due to occlusion. In this paper, we follow the standard approach used in many papers such as [4, 7, 8].

37.5

44

Proposed method Yamamoto’s method

42

37

40 36.5

Y−PSNR (dB)

Y−PSNR (dB)

38

36

34

36

35.5

32 35

30 34.5

28

10

20

30

40

50

60

70

80

90

difference of camera x

(a)

34

0

10

20

30 40 Frame index

50

60

(b)

Fig. 1. View interpolation performance: (a) Our method and [6] for Xmas sequence; (b) Our method for Breakdancers sequence. If a pixel is visible in both views, the pixel position in the intermediate view can be easily obtained by simply interpolating the disparity value. The color value of the intermediate pixel can also be interpolated using the pixels in the left and right views. If a pixel is only seen in one view, no disparity value is available to it. In this case, we extend the disparities of the background into the occlusion area, and the color value is also copied similarly. For pixels whose corresponding pixels are outside the valid image area in the other view, we extend the disparity of the border pixels. The pixel color is also copied accordingly. 4.1. Un-rectification The last step for view interpolation is to project the interpolated view back to the original format. Therefore a unrectification transformation matrix is needed for the intermediate view. In this paper, we first locate the positions of the four corners from the interpolated view, denoted as xi , i = 1, . . . , 4. Our goal is to find an unrectification 3 × 3 matrix B that minimizes the mapping error from these points to the four corners of the unrectified image, i.e., P arg minB i=1,...,4 kBxi − x′i k2 , (9) where x′i are homogeneous coordinates of the four corners in the unrectified view. The direct linear transform (DLT) algorithm in [13] (Chap. 4) is used here, which converts (9) into a constrained least square problem arg min kAbk, s.t. kbk = 1, b

(10)

where b = [b1 b2 b3 ]T (bi is the i-th row of B), i.e., the vectorized version of B. Matrix A is an 8 × 9 matrix, and each pair of corner correspondence contributes to two rows of A. The optimal solution bo to (10) is the unit singular vector that corresponds to the smallest singular value of A. 5. EXPERIMENTAL RESULTS We first demonstrate the performance of the rectification-based view interpolation method by comparing with the view interpolation method in [6], using the same testing condition as Fig. 10 of [6]. In this example, 101 images of a fixed scene are taken by moving a camera to different positions along a straight line. The maximal disparity between the first and the last views is 70 pixels. Since

the cameras are almost perfectly aligned, the rectification step in our method is not necessary, and we essentially only compare the disparity estimation and view interpolation of the two methods. As in [6], we calculate the Y-PSNR between a specified view and the interpolated image of the same view, using a left reference view and a right reference view with index difference x. The results are shown in Fig. 1 (a), where each PSNR is the average of the PSNRs from picture 45 to picture 55, as defined in [6]. It can be seen that our method significantly outperforms [6] when x > 18, or when the disparity of the left and right views is greater than 12.6 pixels, which is the case in most systems. The maximal gain is around 3 dB. Note that the method in [6] only has good performance for aligned cameras, whereas our rectification-based method is more flexible. Fig. 1 (b) shows the PSNRs of the interpolated frames for the second view of the Breakdancers sequence using our method, using uncoded views as reference. The average PSNR is about 36 dB, making it attractive to free viewpoint generation. As mentioned before, our rectification method uses 25 point correspondence to find the matching transform in (6). To show the effectiveness of this method, we evaluate the average y-axis mismatch for the 25 pairs in the 31-st frame of the sequence. The mismatch of our method is 0.97 pixels. If Eq. (5), i.e., the method in [9], is used for both views, the average mismatch for these points would become 4.48 pixels. To demonstrate the efficiency of the proposed algorithm in MVC, we incorporate the algorithm into the current reference software JMVC 3.0 [17]. We treat the interpolated view and the reconstructed right view as inter-view references and two temporal images derived from hierarchical B structure [2] as temporal references. Since the algorithm does not need any camera calibration or depth information, no extra bits are sent by the encoder. Three other coding schemes are evaluated. The first one uses hierarchical B structure to encode each view independently, denoted as simulcast. The second scheme is the current JMVC method, where the reconstructed left and right views are used for inter-view prediction. In the third method, we disable the rectification step in our method, and only use disparity estimation and view interpolation, making it similar to [6]. We call it direct interpolation-based MVC. Table.1 summarizes the coding improvement of JMVC, the direct interpolation-based MVC and the proposed method over the simulcast approach, by encoding the second view of different sequences. It can be seen that without view rectification, the direct interpolation method only has similar performance to the JMVC, whereas the proposed method has improvement in all sequences, ranging from about 1.3 dB for Rena to 0.5 dB for Breakdancers, with

Table 1. Average coding gains over simulcast (dB) Average coding gains Sequence

JMVC

Rena Ballroom Exit Vassar Breakdancers Ballet Average

2.08 1.55 1.01 0.83 1.18 0.97 1.27

Direct interpolation 3.18 1.29 0.73 0.60 1.15 0.49 1.24

Proposed Method 3.38 2.12 1.50 1.72 1.71 1.61 2.01

[1] Y. Luo, Z. Zhang, and P. An, “Stereo video coding based on frame estimation and interpolation,” IEEE Trans. Broadcast, vol. 49, no. 1, pp. 14–21, Mar. 2003. [2] K. Muller, P. Merkle, and T. Wiegand, “Multiview coding using AVC,” ISO/IEC JTC1/SC29/WG11M12945, Bangkok, Thailand, 2003. [3] E. Martinian, A. Behrens, J. Xin, and A. Vetro, “View synthesis for multiview video compression,” PCS, 2006.

39

38

Y−PSNR(dB)

37

36

35

34

33

Proposed Direct Interp JMVC Simulcast

31

30 100

200

300

400

500

600 Rate(kbs)

700

800

900

1000

The authors thank Dr. K. Yamamoto for providing the results in [6]. 8. REFERENCES

40

32

7. ACKNOWLEDGEMENT

1100

Fig. 2. MVC coding results for Breakdancers. an average of 0.74 dB. The rate-distortion performance of Breakdancers is shown in Fig. 2. In this case, the direct interpolation method does not have any gain over JMVC, but the proposed method achieves an improvement of up to 1 dB, thanks to the rectification process.

6. CONCLUSION In this paper, a projective rectification-based view interpolation algorithm is presented for multiview video coding and free viewpoint generation. No camera calibration or depth map is required for decoding. Experimental results show that it has superior performance in free viewpoint generation. The preliminary results in this paper demonstrate the potential of rectification in MVC. However, view interpolation can only be applied to half of the views. To further improve the coding efficiency, we also develop a view extrapolation method for MVC, which applies view rectification and extrapolation to all views except for the first two. The results are reported in [12]. A synthesis bias correction method is also developed in it to compensate the errors in disparity estimation and illuminance variances in different cameras. As mentioned before, the speed of the algorithm can be improved by using GPU computing, as in [11]. On the other hand, since most blocks tend to use the conventional temporal prediction, the complexity of the algorithm can be reduced if the view interpolation or extrapolation method can be applied at block level and is only invoked when the temporal prediction is not efficient.

[4] Y. Morvan, D. Farin, and P. H. N. de With, “Multiview depthimage compression using an extended H.264 encoder,” Advanced Concepts for Intelligent Vision Systems, vol. 4678, pp. 675–686, Delft, Netherland, August. 2007. [5] M. Droese, T. Fujii, and M. Tanimoto, “Ray-space interpolation based on filtering in disparity domain,” 3D Image Conf., pp. 213–216, 2004. [6] K.Yamamoto, M. Kitahar, H. Kimata, T.Yendo, T. Fuji, M. Tanimoto, S. Shimizu, K. Kamikura, and Y. Yashima, “Multiview video coding using view interpolation and color correction,” IEEE Trans. Circ. Syst. for Video Tech., vol. 17, no. 11, pp. 1436–1449, Nov. 2007. [7] S. Birchfield and C. Tomasi, “Depth discontinuities by pixelto-pixel stereo,” Int. J. of Comp. Vision, vol. 35, no. 3, pp. 269–293, 1999. [8] V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions using graph cuts,” International Conference on Computer Vision, vol. 2, pp. 508–515, 2001. [9] D. Farin, Y. Morvan, and P. H. N. de With, “View interpolation along a chain of weakly calibrated cameras,” IEEE Workshop Content Generation and Coding for 3D Television, 2006. [10] R. Hartley, “Theory and practice of projective rectification,” Int. J. of Comp. Vision, vol. 35, no. 2, pp. 115–127, 1999. [11] H. Kimata, S. Shimizu, Y. Kunita, M. Isogai, K. Kamikura, and Y. Yashima, “Real-time MVC viewer for free viewpoint navigation,” IEEE Int. Conf. Multimedia Expo., pp. 1437–1440, 2008. [12] D. Pang, X. Xiu, and J. Liang, “Multiview video coding using projective rectification-based view extrapolation and synthesis bias correction,” IEEE Int. Conf. Multimedia Expo., 2009. submitted. [13] R. Hartley and A. Zisserman, “Multiple view geometry in computer vision,” Cambridge Univ. Press, 2003. [14] B. Lloyd, “Computation of the fundamental matrix,” http://www.cs.unc.edu/∼blloyd/comp290-089/fmatrix. [15] X. San, H. Cai, J.C. Lou, and J. Li, “Multiview image coding based on geometric prediction,” IEEE Trans. Circ. Syst. for Video Tech., vol. 17, no. 11, pp. 1536–1548, Nov. 2007. [16] S. Roy, “Stereo without epipolar lines: A maximum flow formulation,” International Journal of Computer Vision, vol. 1, no. 2, pp. 1–15, 1999. [17] “Joint multiview coding (JMVC) 3.0,” aachen.de, Nov. 2008.

garcon.ient.rwth-

Suggest Documents