Inter-view motion vector prediction for depth coding - IEEE Computer ...

INTER-VIEW MOTION VECTOR PREDICTION FOR DEPTH CODING Vijayaraghavan Thirumalai, Li Zhang and Ying Chen Qualcomm Technologies Inc., San Diego, CA, USA {vthiruma, lizhang, cheny}@qti.qualcomm.com ABSTRACT This paper presents an inter-view motion prediction technique for efficient compression of motion vectors of the depth views in 3D-HEVC. 3D-HEVC is an extension of HEVC standard for coding the multi-view video plus depth content, known as MVD. In MVD format, the motion characteristics of the adjacent views in the depth video are highly correlated. In this paper, we take benefit of this correlation and propose inter-view motion prediction technique, where motion information of the dependent depth views is predicted from the already coded motion information in a reference depth view. In addition, a novel method for deriving disparity vectors based on neighboring pixels is proposed in order to establish correspondences between the blocks in different depth views. Experimental results show that proposed inter-view motion prediction method provides an average bit-rate savings of 1.5% for the synthesized views when the motion information of the depth views is predicted without using the texture information. Index Terms— 3D-HEVC, inter-view prediction, merge candidates, depth coding.

motion

1. INTRODUCTION 3D video coding becomes desirable with advances in acquisition and display technologies, and it finds applications in auto-stereoscopic 3DTV and disparity adaption to heterogeneous devices [1]. Motion Picture Experts Group (MPEG) and ITU-T have established a new group, namely Joint Collaborative Team on 3D Video Coding Extension Development (JCT-3V) to start the standardization efforts for the 3D Video coding (3DV) standards. The 3DV standards are built on top of the existing video coding standards, e.g., H.264/AVC or High Efficiency Video Coding (HEVC) [2]. Among the 3DV standards, the most promising and highly efficient one, namely 3D-HEVC is under development. 3D-HEVC design is based on the HEVC standard where additional block-level enhanced coding tools are introduced for efficient compression of the MVD content. 3DV standards, including 3D-HEVC targets coding of visual information of a 3D scene that usually contains multi-

view texture data and its associated depth information (MVD) [3][4][5]. These multi-view videos (and its associated depth data) represent different projections of the same scene, which are usually captured by a set of synchronized cameras that are placed at different locations in the 3D scene. 3D video codecs take advantage of high correlation among different views to achieve high compression efficiency. In 3D-HEVC, one view is selected as a base view which is coded independently of the other views in order to provide backward compatibility to the HEVC decoders. Other views (known as dependent views) are coded with inter-view prediction using the visual information in the base view or other reference views in order to reduce the redundancy between views. In particular, additional coding tools such as inter-view motion prediction, inter-view residual prediction and new intra modes are introduced to code the dependent views which improve the coding efficiency [3]. In 3D-HEVC, inter-view motion prediction technique has been introduced, where the motion information of the dependent views is inherited from the corresponding block in the inter-view reference picture [5][6][7]. A disparity vector is used to identify the reference block which is derived from the motion information of the neighboring blocks; this technique known as Neighboring Block Disparity Vector (NBDV) [8]. In the current design of 3D-HEVC, inter-view motion prediction is, however, applied only for coding the motion information in the texture views but not for the depth views. For the depth views, motion information from the same texture view is rather used for efficient coding of motion vectors, e.g., the motion information from the co-located texture block is used to code the motion information of the current depth block. This prediction process is known as motion parameter inheritance (MPI) [9]. In MVD content, besides texture-depth inter-layer correlation, the motion information between the two closest views in the depth video is also correlated, since the two adjacent depth views are the projections of the same 3D scene. In this paper, therefore techniques are proposed to enable inter-view motion prediction for the depth views with a new disparity vector derivation scheme. Even though the bandwidth of the depth view is relatively low compared to the texture views, simulation results show that the proposed

method provides an average bit rate reduction of 1.5% (up to 2.9% for one test sequence) for the synthesized views when MPI is disabled. Furthermore, when MPI is enabled to predict motion from texture, the proposed method provides 0.3% of bitrate reduction for synthesized views. Due to effectiveness of this technique, the proposed scheme has been adopted into the 3D-HEVC coding standard. 2. MERGE CANDIDATE LIST FOR MERGE/SKIP MODES HEVC introduces the motion competition concept in a way that multiple candidates can be derived to form a candidate list. One candidate from the candidate list is finally chosen as the motion vector of the current prediction unit (PU) based on rate-distortion optimization. In HEVC, there are two inter prediction modes, namely merge/skip mode and advanced motion vector prediction (AMVP) mode. In the merge mode, the merge candidate index and the residual information are coded. Skip mode is considered as a special case of merge mode, where residual information is not transmitted. When an inter-coded PU is not encoded in skip/merge mode, it is encoded using AMVP mode, where the residual signals, motion vector predictor index, as well as motion information (prediction direction, reference picture index and motion vector differences), are transmitted. In 3D-HEVC, the merge/skip modes of HEVC are extended to consider the inter-view motion prediction, while the AMVP mode used in 3D-HEVC is the same as the one that is used in HEVC. Before we describe the merge candidate list construction process for both the texture and depth views in 3D-HEVC, we briefly review the merge candidate list construction process in HEVC.

Block-based View Synthesis Prediction (BVSP) candidate [5][6][7]. The main reason for considering the first three additional candidates is mainly to exploit the motion correlation among the different views. In other words, motion information for a block in the dependent view is highly correlated with the motion information of a block in a reference base view. The BVSP candidate is considered to efficiently code the predicted block that is generated by warping the reference view with a depth data. Due to the availability of more possible candidates, the number of candidates in the final merge list is increased to 6. For generating the additional candidates, a disparity vector has to be calculated in order to identity the corresponding block in the reference view. In 3D-HEVC, NBDV method [8] is used to derive a disparity vector; it is calculated from the disparity motion vectors of the neighboring blocks that include the spatial and temporal locations. Note that the disparity motion vector points to a corresponding block in an inter-view reference picture. The disparity vector derived from NBDV is further refined using Depth oriented Neighboring Block Disparity Vector (DoNBDV) method [10]. In this method, a corresponding depth block in the already coded depth view pointed by the disparity vector is identified first. Then, a new (refined) disparity vector is derived from a single depth value that is derived from the maximum depth value of four corners in the identified block.

2.1. Merge candidate list in HEVC In merge mode, the possible candidates for the merge list consists of spatial/temporal merging candidates derived from spatial/temporal neighboring blocks and virtual merging candidates. The spatial merging candidates are derived from the neighboring blocks located at left (0), above (1), above right (2), below left (3), and above left (4), as depicted in Fig. 1. Temporal merging candidate is derived from the co-located blocks of the current PU block. Pruning process is further applied to remove the identical candidates in the list. Up to five candidates are finally derived to form the merge list. For more details, we refer the reader to [2]. 2.2. Texture merge candidate list in 3D-HEVC For texture coding in 3D-HEVC, in addition to the spatial/temporal motion predictors used in HEVC, four additional candidates are considered as possible candidates for the merge list. The four additional candidates include inter-view motion vector candidate, disparity motion vector candidate, motion candidate from shifted neighbors and

Fig. 1. Spatial neighboring candidates considered for constructing the merge candidate list. The spatial neighboring candidates that are pruned are indicated with dashed lines. 2.3. Depth merge candidate list in 3D-HEVC For depth views, similar to the texture merge list in 3DHEVC, the maximum number of candidates in the final merge list is also set to 6. Besides the spatial/temporal motion predictors used in HEVC, an additional candidate namely motion parameter inheritance (MPI) is considered for constructing the merge list [9]. The main idea of MPI candidate is to exploit the similar motion characteristics of the texture and depth data. For a given PU in the depth view, the MPI candidate is generated using the motion information

of the co-located texture block in the same view. During the merge candidate list construction process, the MPI candidate is inserted in the first position in the list, as it is very likely to be the good predictor in most scenarios. This candidate is followed by the spatial/temporal candidates that are generated in a similar way to the HEVC merge modes. 3. INTER-VIEW MOTION PREDICTION FOR DEPTH VIEWS The current depth merge mode fails to exploit the motion correlation among the different depth views. Recall that the depth views represent different projections of the same 3D scene captured by synchronous (depth) video cameras. In such circumstances, the motion information of two associated blocks in the depth view 0 and depth view 1 is likely to be same, as shown in Fig. 2. In other words, the motion information of one of the views (say view 1) can be inferred from the previously coded views (say view 0), provided a disparity vector between the blocks in the current (view 1) and reference view (view 0) is known beforehand. Therefore, motion vector prediction accuracy can be improved by considering the motion information of the corresponding reference blocks as a possible candidate in the merge list. More details about the disparity vector generation and inter-view motion predictors are described in rest of this section.

Fig. 2. Illustration of (inter-view) motion vector correlation between two correlated depth views. The motion vector of two associated blocks in view 0 and view 1 are highly correlated. 3.1. Disparity vector derivation For generating an inter-view motion candidate, a disparity vector is required in order to identify the corresponding block in the reference view (see Fig. 2). The NBDV method used in texture coding (in order to derive a disparity vector) cannot be used to derive a disparity vector for the depth

views, as most of the neighboring blocks in the depth views may be intra coded, i.e., with high probability the neighboring blocks may not contain a disparity motion vector. Also, unlike texture images, in depth images the reconstructed depth samples at neighboring locations can be accessed; this would allow an accurate estimate of a disparity vector. Motivated by this, we propose to derive a disparity value for each coding unit (CU) from the neighboring reconstructed depth samples. The neighboring sample positions that are adjacent to the corners of the current CU block are used. More specifically, the chosen ones are above-left, bottom-left and above-right sample positions of the current block. These positions respectively are marked as P0, P1 and P2 (shown in red) in Fig. 3. From the reconstructed depth values at positions P0, P1 and P2, a single depth value is calculated as Depth = ( 5*D[P0] + 5* D[P1] + 6*D[P2]+ 8 )>>4,

(1)

where, D[P] represents the reconstructed depth value at location P, and >> represents the right shift operator. In Eq. (1), the weights 5, 5 and 6 respectively applied to the depth value at locations P0, P1 and P2 are selected based on trial and error experiments. We experimentally found that this particular choice of parameter set works quite well on the test sequences that contain both the natural and synthetic MVD contents. Finally, note that the weighted averaging as shown in Eq. (1) is performed mainly to avoid the division by 3.

Fig. 3. Three neighboring reconstructed depth samples that are marked in red are used for deriving a disparity vector for a CU block of size 2Nx2N. In special circumstances, where not all the three neighboring samples are available, e.g., at image boundaries, the depth value is calculated as described below. At top-left corner of the image, all the three neighbors P0, P1 and P2 are not available; the depth value is set to zero in such cases. At image boundaries (except at the top-left corner of the image) either above-right (P2) or bottom-left (P1) sample position is only available for the current CU block. In such cases, the depth value is set equal to the reconstructed value of the available neighbor without performing weighted averaging. The calculated depth value is then converted into a disparity vector, that is denoted here

as DV. This disparity vector DV is used to set the disparity vector to all the PU block’s contained within the CU, i.e., all the PU blocks within CU share the same disparity vector DV. 3.2. Inter-view motion candidates We describe now the generation of inter-view motion candidates for the non-base depth views by assuming that the disparity vector between the corresponding blocks is known beforehand. 3.2.1. Inter-view motion candidate Inter-view motion candidate is generated from the motion information of a reference block pointed by the disparity vector. The reference block is identified by shifting the center position of the current PU block located at position (W/2, H/2) with the disparity vector, where W and H, respectively represent the width and height of the PU block [11]. The disparity vector may be the same as the one derived in Section 3.1 or it could be a vector with shift operations (i.e., scaled) applied on top of it. Using the derived disparity vector, a corresponding block of current PU in a reference view of the same access unit is identified, as shown in Fig. 4. If the corresponding block is not intracoded and not inter-view predicted and its reference picture has a picture order count (POC) value equal to that of one entry in the same reference picture list of current PU, its motion information (including prediction directions, reference picture indices, and motion vectors), after converting the reference index based on POC is derived to be the inter-view motion candidate [12].

of the inter-view reference picture associated with the disparity vector [13][14]. Note that the disparity motion vector has the vertical component equal to zero and its horizontal component is derived from the disparity vector. 4. PROPOSED NEW MERGE LIST FOR DEPTH CODING In this section, we first describe the procedure for generating three additional candidates based on inter-view motion candidates and disparity motion vector candidates. Then, we describe how these additional candidates are used to construct a new merge candidate list. 4.1. Generation of additional candidates In this paper, we propose to generate three additional candidates. The methods used to derive these candidates are described below: (1) The first candidate is an inter-view motion candidate that is derived with an input disparity vector calculated in Section 3.1 (denoted as DV). (2) The second one is a disparity motion vector (DMV) candidate that is generated from the derived disparity vector DV. (3) The third candidate is generated from the following steps, as described in [15]. Firstly, an additional inter-view motion candidate is generated with an input disparity vector equal to the DV with shifts applied on top of it. This shifting vector has a horizontal and vertical component, respectively equal to the width and height of the current PU. If the additional inter-view motion candidate generated using shifted disparity vector is available, the additional inter-view candidate is considered as a third candidate and the process is terminated. When the additional inter-view candidate is unavailable, the third candidate is set equal to a disparity motion vector candidate with an input disparity vector equal to DV with the horizontal component shifted by 4. 4.2. Insertion positions of the additional candidates

Fig. 4. Derivation of inter-view predicted motion vector candidate for the merge/skip mode. 3.2.2. Disparity motion vector candidate The motion vector for this candidate is generated by converting an input disparity vector into a disparity motion vector, and the reference index is set to the reference index

For a PU in a non-base view, the MPI candidate [9] is inserted at the beginning of the merge candidate list. Then, the first additional inter-view motion candidate is inserted in the list. It is followed by the spatial merging candidates derived from the left and above blocks (resp. A1 and B1) with respect to the current PU (see Fig. 1). Then, the second additional candidate (which is a DMV candidate) is inserted right after the above-right spatial candidate. Then, the spatial merging candidates B0, A0 and A2 are inserted in the list. Finally, the third additional candidate is inserted right before the temporal merge candidate. The additional candidates are compared with the selective other candidates in the list in order to identify the redundant entries [13][14]. In details, in addition to the

original pruning processes among the spatial merging candidates [6], the first inter-view motion candidate is compared to the MPI candidate. The second additional DMV candidate is compared to the left and above spatial merging candidates (resp. A1 and B1). If the third additional candidate is an inter-view motion candidate, it is compared to the first additional inter-view motion candidate. 5. EXPERIMENTAL RESULTS In this section, the rate-distortion (RD) performance of the proposed inter-view motion prediction is presented, followed by the complexity analysis. The latest 3D-HEVC reference software version 3D-HTM 8.2 is used in our experiments. The common test sequences and common test conditions (CTC) [16][17] used for 3D-HEVC proposals evaluation are used in our experiments. The test set consists of 7 MVD sequences as tabulated in the first column of Table 1. The first four test sequences are at a resolution of 1024x768 luma/depth samples at 30fps and the last three test sequences are at HD resolution of 1920x1088 luma/depth samples at 25 fps. Each test sequences contain 3 texture and depth views, respectively. In CTC, the center view is usually considered as a base view and it is coded independently. The remaining two views are coded with inter-view prediction using the information in the base view. Two sets of experiments are conducted in order to evaluate the benefit of proposed scheme: (1) common test conditions with MPI disabled; and (2) common test conditions. The first test is to consider the case of the unavailability of co-located texture views, such as unpaired multi-view video plus depth; and the second test is to verify the coding efficiency of the proposed method on top of the latest 3D-HEVC. 5.1. Performance analysis As required in the CTC, all the three texture and depth views are coded. After decoding all the views (both texture and depth), 6 synthesized views are generated at uniformlyspaced camera locations. Since depth maps are used for generating the synthesized views, the Peak Signal-to-Noise Ratio (PSNR) values corresponding to the luminance component of synthesized views are used for the evaluation. The popular BD rate measurement proposed in [18] is used in this paper in order to evaluate the coding performance. The simulation results for the two tests are listed in Table 1, respectively in column 2 and 3. The BD rate performance is measured by the PSNR value of synthesized views and the total bitrate of the texture and depth views. From the second column of Table 1, it is noted that the proposed method provides significant BD rate reduction of 1.5%, when the inter-layer motion prediction (i.e., MPI) between the texture and depth views is disabled. Although the bandwidth for the depth views is relatively low (around 12% of the overall bitrate), the proposed method could

reduce the overall bitrate up to 2.9% for PoznanHall2 test sequence. In addition, when the proposed method is tested with the MPI coding technology enabled, our solution achieves another 0.3% bitrate reduction for the synthesized views. Finally in our experiments, as expected, it has been observed that the proposed coding tool has negligible impact on the coding performance of the texture views. 5.2. Complexity considerations Now, we briefly analyze the complexity of constructing the merge list for depth coding with and without enabling the inter-view motion prediction. The complexity increase due to inter-view motion prediction comes from two aspects: (i) memory access; and (ii) computational complexity. One additional memory access is required to get the motion information of the block in a reference view. It is noted that two inter-view motion candidates with or without shifting a disparity vector (refer to Section 4.1) may be derived at the same time by loading the reference blocks once in the memory. Also, in the worst case, 4 additional pruning processes are required, as described in Section 4.2. When we compare the number of pruning operations required to construct the list for the texture views and depth views, only one additional pruning process is required for the depth views due to inter-view motion prediction. Therefore, enabling inter-view motion prediction for the depth views gives a good trade-off between the complexity and performance, where significant coding gains can be achieved with reasonable complexity increase. Table 1. Coding performance in terms of synthesized views SEQUENCES Balloons Kendo Newspaper GT Fly PoznanHall2 PoznanStreet UndoDancer Average

with MPI disabled

CTC

-1.5% -1.7% -1.2% -0.6% -2.9% -1.2% -1.4% -1.5%

-0.3% -0.3% -0.3% -0.2% -0.4% -0.1% -0.4% -0.3%

6. CONCLUSIONS In this paper, we have extended the inter-view motion prediction design used in 3D-HEVC to the depth views. We have proposed a new method to derive a disparity vector from the neighboring reconstructed depth samples. Interview motion candidates are then derived based on the derived disparity vector as well as using the motion information of a picture in a reference view. Pruning process is further applied to reduce the redundancy among the candidates in the final merge candidate list. Simulation

results show an average bitrate saving of 1.5% can be achieved for the synthesized views when the inter-layer motion prediction is disabled; this certainly highlights the potentials of the proposed technique. 7. REFERENCES [1] A. Vetro, W. Matusik, H. Pfister, J. Xin, “Coding approaches for end-to-end 3D TV systems,” Proc. Picture Coding Symposium, Dec. 2004. [2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) Standard”, IEEE Trans. Circuits and Systems for Video Technology, Vol. 22, No. 12, pp. 1649-1668, Dec. 2012. [3] K. Muller, H. Schwarz, D. Marpe, C. Bartnik, S. Bosse, T. Hinz, H. Lakshman, P. Merke, F. H. Rhee, M. Winken and T. Wiegand, “3D High-Efficiency Video Coding for Multi-view Video and Depth data,” IEEE Trans. on Image Processing, Vol. 22(9), pp. 3366-3378, Sept. 2013. [4] G. J. Sullivan, J. M. Boyce, Y. Chen, J.-R. Ohm, C. A. Segall, and A. Vetro, “Standardized Extensions of High Efficiency Video Coding (HEVC),” IEEE Journals of Selected Topics in Signal Processing, Vol. 7, No. 6, pp. 1001-1016. Dec. 2013. [5] H. Schwarz, C. Bartnik, S. Bosse, H. Brust, T. Hinz, H. Lakshman, D. Marpe, P. Merkle, K. Müller, H. Rhee, G. Tech, M. Winken and T. Wiegand, “Description of 3D Video Technology Proposal by Fraunhofer HHI (HEVC compatible; configuration A),” ISO/IEC JTC 1/SC 29/WG 11 (MPEG) document m22570, Nov. 2011. [6] L. Zhang, Y. Chen, V. Thirumalai, J.-L. Lin, Y.-W. Chen, J. An, S. Lei, L. Guillo, T. Guionnet and C. Guillemot, “Inter-view motion prediction in 3DHEVC,” accepted to IEEE International Symposium on Circuits and Systems (ISCAS), Melbourne, Australia, June 2014. [7] E. G. Mora, J. Jung, M. Cagnazzo and B. PesquetPopescu, “Modifications of the merge candidate list for dependent views in 3D-HEVC,” IEEE international Conference on Image Processing (ICIP), Melbourne, Australia, Sept. 2013. [8] L. Zhang, Y. Chen, and M. Karczewicz, “Disparity Vector based Advanced Inter-view Prediction in 3DHEVC,” IEEE International Symposium on Circuits and Systems (ISCAS), Beijing, China, May 2013. [9] Y.-W. Chen, J.-L. Lin, Y.-W. Huang and S. Lei, “3DCE3.h results on removal of parsing dependency and picture buffers for motion parameter inheritance,” Joint Collaborative Team on 3D Video Coding Extension (JCT-3V) document JCT3V-A0049, 3rd Meeting: Geneva, CH, 11–18 Jan. 2013.

[10] Y.-L. Chang, C.-L.Wu, Y.-P. Tsai and S. Lei, “3DCE1.h: Depth-oriented neighboring block disparity vector (DoNBDV) with virtual depth retrieval,” Joint Collaborative Team on 3D Video Coding Extension (JCT-3V) document JCT3V-A0049, 3rd Meeting: Geneva, CH, 11–18 Jan. 2013. [11] V. Thirumalai, L. Zhang, Y. Chen, “Inter-view motion vector prediction for depth coding,” Joint Collaborative Team on 3D Video Coding Extensions (JCT-3V) document JCT3V-E0133, 5th Meeting: Vienna, AU, July 27– Aug. 2, 2013. [12] J. An, Y.-W. Chen, J.-L. Lin, Y.-W. Huang, and S. Lei, “3D-CE5.h related: Inter-view motion prediction for HEVC-based 3D video coding,” Joint Collaborative Team on 3D Video Coding Extension (JCT-3V) document JCT3V-A0049, 1st Meeting: Stockholm, SE, 16–20 July 2012. [13] L. Zhang, Y. Chen, L. He, “3D-CE5.h: Merge candidates derivation from disparity vector,” Joint Collaborative Team on 3D Video Coding Extension (JCT-3V) document JCT3V-B0048, 2nd Meeting: Shanghai, CN, 13–19 Oct. 2012. [14] E. Mora, B. Pesquet, M. Cagnazzo, “3D-CE5.h related: Modification of the Merge Candidate List for Dependant Views in 3DV-HTM,” Joint Collaborative Team on 3D Video Coding Extension (JCT-3V) document JCT3V-B0069, 2nd Meeting: Shanghai, CN, 13–19 Oct. 2012. [15] V. Thirumalai, L. Zhang, Y. Chen, M. Karczewicz, C. Guillemot, L. Guillo, J.-L. Lin, Y.-W. Chen and Y.-L. Chang, "CE3.h: Merge candidates derivation from vector shifting," Joint Collaborative Team on 3D Video Coding Extensions (JCT-3V) document JCT3V-E0126, 5th Meeting: Vienna, AU, July 27– Aug. 2, 2013. [16] L. Zhang, G. Tech, K. Wegner, S. Yea, “3D-HEVC Test Model 5,” Joint Collaborative Team on 3D Video Coding Extensions (JCT-3V) document JCT3V-E1005, 5th Meeting: Vienna, AT, 27 July-2 Aug. 2013. [17] D. Rusanovskyy, K. Mueller, A. Vetro, “Common Test Conditions of 3DV Core Experiments,” Joint Collaborative Team on 3D Video Coding Extensions (JCT-3V) document JCT3V-E1100, 5th Meeting: Vienna, AT, 27 July-2 Aug. 2013. [18] G. Bjontegaard, “Calculation of Average PSNR Differences between RD-curves,” ITU-T Q.6/SG16 VCEG, VCEG-M33, Apr. 2001.