3D Res. 04, 02(2013)6 10.1007/3DRes.02(2013)6
3DR EXPRESS
w
Depth-based Coding of MVD Data for 3D Video Extension of H.264/AVC
Dmytro Rusanovskyy • Miska M. Hannuksela • Wenyi Su
Received: 15 November 2012 / Revised: 12 March 2013 / Accepted: 29 April 2013 © 3D Research Center, Kwangwoon University and Springer 2013
Abstract* This paper describes a novel approach of using depth information for advanced coding of associated video data in Multiview Video plus Depth (MVD)-based 3D video systems. As a possible implementation of this conception, we describe two coding tools that have been developed for H.264/AVC based 3D Video Codec as response to Moving Picture Experts Group (MPEG) Call for Proposals (CfP). These tools are Depth-based Motion Vector Prediction (DMVP) and Backward View Synthesis Prediction (BVSP). Simulation results conducted under JCT-3V/MPEG 3DV Common Test Conditions show, that proposed in this paper tools reduce bit rate of coded video data by 15% of average delta bit rate reduction, which results in 13% of bit rate savings on total for the MVD data over the state-of-the-art MVC+D coding. Moreover, presented in this paper conception of depthbased coding of video has been further developed by MPEG 3DV and JCT-3V and this work resulted in even higher compression efficiency, bringing about 20% of delta bit rate reduction on total for coded MVD data over the reference MVC+D coding. Considering significant gains, proposed in this paper coding approach can be beneficial for development of new 3D video coding standards. Keywords H.264/AVC, three-dimensional video, video coding, 3D-AVC
1. Introduction 3D video has made recently a significant progress in deployment: it became very popular in cinemas, received Dmytro Rusanovskyy1 ( ) • Miska M. Hannuksela1 • Wenyi Su2 1 Nokia Resarch Center, Finland. 2 University of Science and Technoogy of China, China E-mail:
[email protected]
reasonable popularity in gaming industry and “3D-enabled” TV sets become available in many homes. Nevertheless, 3D video (3DV) remains not very common in living room environment. One of the reasons for this situation is a 3D video content available for viewing. The content that is usually understood nowadays as 3D video is not actually a 3D video content, but a stereoscopic 2D video content. Such content consists of a pair of 2D video signals representing 3D scene from different views and displayed to left and right eyes independently. Stereoscopic video content is typically created by capturing a scene with a pair of 2D cameras (stereo camera) that are separated from each other at a specific distance that is called stereo baseline. At the user side, the stereo content is being displayed with stereoscopic displays, e.g passive, active or autostereoscopic that are optimized to a predefined stereo baseline. However, it has been discovered that depth perception that results from stereo content displayed at specific stereo baseline depends on the display size and viewing conditions1. And the biggest problem here is not that depth perception can be reduced if shown at inappropriate display, but the fact that disparity between views observed in certain conditions may excite a comfortable range for some viewers, and this would lead to eye strain and/or headache1. This problem may be partly solved for specially designed stereo content that is viewed in fixed viewing conditions; with screen size and distance from viewers to screen are fixed. Such solutions have been deployed in 3D cinemas and this lead to a popularity of 3D movies in movie theaters. In order to bring 3D video content to heterogeneous display world, and to flexible viewing conditions of a living room, stereo baseline of displayed content needs to be adjusted either automatically by the displaying system or by the viewers, a.k. depth perception volume control. To provide this functionality, it is believed that plurality of stereo pairs produced with different baseline distance
2
should be available at the display side and optimal pair is shown on demand2. Another important problem of the nowadays stereoscopic video displays is need for special glasses, which may be not found suitable for everyone. Coming to the market multiview autostereoscopic displays (multiview ASD) seems to provide a solution for this problem. However, multiview ASD requires plurality (1020) of stereo views being available at the display side for displaying at once. Therefore, in order to resolve major problems that delay 3D video deployment in living room environment, 3D video content available for displaying should be represented with plurality of views, e.g. tens of available views on the 3D scene. A state-of-the-art solution for delivery 3D video content with multiple views is H.264/MVC multiview video coding (MVC)3. However, coding bit rate produced with MVC technology is proportional to the number of views representing the 3D scene. Considering, that multiview ASD would require tens of views (28-52) displayed simultaneously, 3D video content coded with MVC would require about 10x of bandwidth compared to a single view 2D video. It is obvious that 3D video systems with such requirements may not be very practical for deployment, and an alternative solution is needed. In order to address this problem, the Motion Picture Expert Group (MPEG) recently initiated a new standardization by issuing a Call for Proposals (CfP) on 3D video coding technology4 and expected requirements5. The CfP invited submissions in two categories, the first compatible with H.264/AVC3 and the second compatible with the High Efficiency Video Coding (HEVC) standard6, which was under development at the time of the CfP. About 20 technology proposals have been submitted to MPEG in response to the CfP. Proposals have been evaluated through a formal subjective testing7 and most advanced responses in corresponding categories8, 9 formed an initial base for further 3DV standardization development. Following the CfP evaluation, MPEG and, since July 2012, the Joint Collaborative Team on 3D Video Coding (JCT-3V)10 have initiated three parallel standardization developments for depth-enhanced multiview video coding. These developments differ in the finalization timeline, the utilized base coding technology and the extent how much the base coding technology is modified for depth-enhanced video coding. A depth enhanced extension for MVC, abbreviated MVC+D, specifies encapsulation of MVCcoded texture and depth views into a single bitstream11. The utilized coding technology is identical to MVC, and hence MVC+D is backward-compatible with MVC and the texture views of MVC+D bitstreams can be decoded with an MVC decoder. The MVC+D specification is planned to be finalized technically in January 2013. A reference test model of MVC+D is based on the proposal8 and it is implemented in 3DV-ATM reference software12. Another ongoing JCT-3V development is a 3D video extension of H.264/AVC, referred here to as 3D-AVC13. This development exploits redundancies between texture and depth and includes several coding tools that provide a compression improvement over MVC+D. The specification requires that the base texture view is compatible with H.264/AVC and compatibility of dependent texture views to MVC may optionally be provided. 3D-AVC is planned to be finalized technically in November 2013. Similarly to
3D Res. 04, 02(2013)6
MVC+D, a reference test model of 3D-AVC is based on the proposal8 and it is implemented in 3DV-ATM reference software12. Depth-enhanced 3D video coding extension of High Efficiency Video Coding (HEVC) standard14 is scheduled to be technically complete in July 2014. This development similarly to AVC+D exploits inter-component redundancies that provide compression improvement over multiview extension of HEVC15, which is planned for January 2014. The design of 3DV extension of HEVC14 is originated from the proposal9. Analyze of technology proposals submitted as a response to MPEG CfP on 3DV shows that despite difference in the base technology and implementation aspects, most of proposals shared two major components. First of all, most of responses utilized multiview video plus depth (MVD) data format2 for representing a 3D scene and depth-image-based rendering (DIBR)16 was utilized for rendering virtual views at the decoder side. And secondly, most advanced proposals, including those8, 9, utilized intercomponent redundancy inherited in MVD data and performed joint coding texture and video components. In this paper, we present details of the latter version of joint coding of MVD data. First we introduce a concept of Flexible Coding Order (FCO) that enables encoder to configure the coding order of MVD data to benefit from inter-component redundancies. The most beneficial coding configuration, “depth coded first” allows implementation of two novel 3DV coding tools, namely Depth-based Motion Vector Prediction (DMVP) and Backward-View Synthesis Prediction (BVSP) which are described in this paper. FCO, DMVP and BVSP were foundational elements in the 3DVATM test model design12 and undergo an active development toward the 3D-AVC coding standard13. The rest of the paper is organized as follows. Section II describes general principles of the proposed 3DV coding conception. Section III provides coding tools description and brief discussion on the evolution of these tools within undergoing 3D video standardization. Section IV provides results of the coding efficiency study for given tools and Section V concludes the paper.
2. Architecture of Depth-Enhanced 3D Video Coding This section describes a high-level coding architecture adopted for 3D-AVC13 and implemented with 3DV-ATM reference software12. The design is based on joint coding of the MVD data and provides a great flexibility in output bitstreams format, providing either single-view to H.264/AVC or stereo-view compatibility with H.264/MVC coding standard as this required5.
2.1. Multiview Video plus Depth Data Format The MVD data format represents the scene with texture data captured from multiple viewing angles, and secondly, every texture is accompanied with depth map information, see Fig. 1 where term T# denotes texture of the view #, and term D# denotes depth information of the view #. Thus assuming that 3D scene was captured from 3 viewing angles there are 6 MVD components in total (three texture
3D Res. 04, 02(2013)6
and three depth view components) for every time entity. In order to represent unlimited depth range of real world 3D scenery with limited number of bits, e.g. 8 bits, actual depth value z are non-linearly quantized to produce depth map values d as shown below and the dynamical range of represented z are limited with depth range parameters Znear and Zfar, see (1) 1 / z − 1 / Zfar (1) + 0 .5 d = 2N − 1 ⋅ 1 / Znear − 1 / Zfar where N is the number of bits to represent the quantization levels for the current depth map, the closest and farthest real-world depth values Znear and Zfar, respectively, correspond to depth values 255 and 0 in depth maps, respectively. The MVD offers a pixel-wise correspondence between texture and depth, thus every sample of texture is associated with a depth sample. Such correspondence allows projection of the texture sample of source view to a correspondent location in a target view, and projection (displacement in spatial coordinate) for a texture sample is derived through disparity D that is calculated between source and target view for a particular depth value z. With a parallel camera setup in mind5, the disparity value D between two views for a particular depth value z is derived as shown in (2): f ⋅l (2) D= z where f is a focal length for utilized cameras and l is a separation between two cameras.
3
the literature by many authors, just to mention17, where motion information of coded texture picture utilized as a prediction for motion vector of associated depth picture. In proposal9, this concept was adjusted to meet HEVC core design, by introducing inheritance of tree-block subdivisions and corresponding motion parameters from texture component to coded depth map component of the same view. In addition to this, proposal9 introduced depth map modeling with some of coding modes and coding parameters are determined with usage of texture information coded prior to the depth map data within the same view. The second approach is to utilize available depth data for more efficient coding of texture data within the same view. Analysis of the CfP responses submitted for evaluation7 shows that bit rate of the coded depth map occupies about 15% of the total MVD bit rate on the average, varying in the range from 7% to 25% where remaining 75%-93% of bit rate are utilized for coded texture. Therefore, utilization of depth data for coding of texture data within the same view would potentially bring significantly larger gains. This concept was initially proposed in Nokia’s 3DV response8 to MPEG 3DV CfP where it was utilized for advanced coding of dependent texture views and provided significant coding gain. As a result, this coding approach was selected as a founding basis for standardization development toward 3DAVC13 and it is described in more details in this section.
(a) Figure 1 Visualization of MVD data format for 3 views scenario (C3)
Due these properties, MVD format allows rendering of plurality of virtual views within some viewing range from a limited number of input views (e.g 2-3) transmitted to the decoder, and thus providing support for advanced 3D video functionality5. Virtual views are produced by depth-imagebased rendering (DIBR)16, which becomes an integral part of 3DV system. In addition to this, pixel-wise association between texture and depth information leads to a significant redundancy inherited in the MVD format.
(b)
2.2. Joint Coding of MVD data
Figure 2 Visualization of joint coding of MVD data; (a) texture information used for coding of depth (b) depth information is used for coding of texture
Depending on the codec architecture, this redundancy may be exploited either for more efficient coding of depth data or for more efficient coding of texture as it is shown in Fig. 2, or both by combining two approaches in one solution. The first approach assumes that texture data is coded and decoded prior to coding of depth map data and thus, it can be used for more efficient coding of depth map data, see visualization in Fig. 1(a). This concept has been studied in
The 3D-AVC codes components of MVD data as a sequence of access units. Each access unit consists of texture view components and depth view components representing a temporal entity of 3D scene; see visualization in Fig. 3. The data of a coded view component is not interleaved by any other coded view component, and the data for an access unit is not interleaved by any other access unit in the bitstream/decoding order. For example in a
4
3D Res. 04, 02(2013)6
stereo MVD (2 views), an access unit at time t consisting of texture view T and depth view D components (T0t,T1t ,D0t,D1t) precedes in bitstream and decoding order access unit t+1 consisting of texture and depth view components (T0t+1,T1t+1 ,D0t+1,D1t+1), see Fig. 3.
dependent texture views. Flowchart of such architecture is visualized in Fig. 4, where dependent texture T1 is coded after MVD components T0 and D1. Depth component D1 is utilized for more efficient coding of texture T1 through the DMVP and VSP coding tools.
3. Depth-based Coding of Texture Data
Figure 3 Access unit in 3DV coding and example coding order in 3D-AVC
In the 3-view scenario MVD data would consist of 6 components (T0T1T2 and D0D1D2) describing the 3D scene from different viewing angles. With respect to requirements5, a single texture view T0 should be coded with H.264/AVC and it is considered as a base view for other components of MVD data. Views T1 and T2 in contrast are coded as dependent views and inter-view redundancy, as well as inter-component redundancy may be deployed for its more efficient coding. Fig. 4 shows a high level flowchart of such enhanced coding for T1 and T2 in 3D-AVC, where novel VSP and DMVP coding tools marked in the red color. This section provides a description of these tools.
3.1. Depth-based Motion Vector Prediction
mv
Figure 4 High-level flowchart of encoder for dependent texture view
Within an access unit, the codec allows certain flexibility in coding order. In some configuration, coding texture component is reseeding the coding of depth component, e.g. T0D0-T1D1. In such coding configuration, coded texture may be utilized for more efficient coding of the depth component within the same view, as it shown in Fig 2(a). Alternatively, coding depth component may precede the texture component of the same view, thus allowing a coding concept visualized in Fig 2(b). The T0D0D1T1 is an example of such configuration. T0 is coded first and it is coded independently from other components thus allowing single-view compatibility with H.264/AVC. Following this, T0 can be utilized for more efficient coding of depth component D0. For the second dependent view (view #1), depth component D1 is coded first to texture component T1 and facilitate to more efficient coding of T1 component which follows the D1 in coding order. This coding structure, when depth is coded prior to the associated texture data enables coding tools that can be applied for coding of
The performance of the H.264/AVC relies on the quality of inter (temporal) prediction, which is implemented in encoders through the block matching search between the currently coded block and candidate blocks in reference pictures. The block matching search for a coded block (Cb) results in motion information, which is transmitted to the decoder side. Motion information associated with a Cb consists of three components, a reference index (refIdx) indicating the reference picture and two spatial components of motion vectors (MVx, and MVy). In order to reduce the required number of bits to encode the motion information, the blocks neighboring to the Cb block are used to produce a predicted motion vector (mvpx, mvpy), and the difference (dX, dY) between the actual motion information of Cb and mvp is transmitted: (3) Dx = MVx(Cb) – mvpx ; dY = MVy(Cb) – mvpy H.264/AVC specifies that components of the predicted motion vector (4) are calculated by a median value of the corresponding motion vector components (MVx, MVy) of the neighboring blocks A, B and C: mvpx = median(MVx(A), MVx(B), MVx(C)) (4) mvpy = median(MVy(A), MVy(B), MVy(C)) where the subscripts x and y indicate the horizontal and vertical components of the MV, respectively. The layout of spatial neighbors (A, B, C, D) utilized in MVP is depicted in the top left corner of Fig. 5. The motion vectors of corresponding blocks (A, B, C) are marked accordingly (MV(A), MV(B), MV(C))). In addition to MVP defined by (3) and (4), there are two special modes, Direct and Skip. In these modes, motion vector components are predicted as shown in (5), whereas the reference index is calculated as minimal reference index utilized in neighboring blocks (A, B, C) are selected for Cb: refId 0 = min(refId 0( A), refId 0( B ), refId 0(C ) ) (5) refId 1 = min(refId 1( A), refId 1( B ), refId 1(C ) ) where the refId0(.) and refId1(.) are reference indices for
3D Res. 04, 02(2013)6
5
blocks (A, B, and C) in the reference picture List 0 and List MVs of Cb & {A,B,C}
1, respectively. Samples of d(Cb)
Temporal prediction in Cb
N
Average Disparity of d(Cb)
Y Select from {A,B,C} blocks with temporal prediction
Select from {A,B,C} blocks with interview prediction
Median MVP of H.264/AVC
Median MVP with Disparity instead of ZeroMV
MV coding
D
B
A
Cb
C
Figure 5 Flow chart of direction-separated MVP
(a)
(b)
Figure 6 (a) Flowchart of the DMC for Skip mode in P Slice; (b) Flowchart of the DMC for Direct mode in B Slices.
From equations 3-5 it is obvious, that MVP scheme of H.264/AVC is well optimized for single-direction prediction (temporal), and it is not suitable for using more than prediction directions, e.g. for inter-view and VSPprediction. To resolve this problem, MVP was modified as follows. The conventional median MVP of (4) is restricted to the prediction direction that is used in Cb. All available neighboring blocks are classified according to the direction of their prediction (temporal, inter-view, VSP). For example, if Cb uses an inter-view reference picture, all neighboring blocks which do not utilize inter-view prediction are marked as not-available for MVP and are not considered in the median MVP (6). The flowchart of this process is depicted in Fig. 5 for inter and inter-view prediction, while it is applied similarly for VSP. In addition to this, we introduced a new default interview candidate vector which is derived from depth data d(Cb) that is associated with Cb. If no motion vector candidates are available from the neighboring blocks, MVx is set to average disparity D which is associated with current texture Cb and computed by (6): 1 (6) D (Cb ) = D (Cb (i )) N i where i is index of pixels within current Cb, D(Cb(i)) is a
∑
disparity of pixel Cb(i) computed as given in (2), and N is a total number of pixels in Cb. Flow charts of the process for the proposed depth-based motion competition (DMC) in the Skip and Direct modes are shown in Fig. 6(a) and 6(b), respectively. In the Skip mode, motion vectors {MVi} of texture data blocks {A, B, C} are grouped according to their prediction direction forming Group 1 and Group 2 for temporal and inter-view respectively. The DMC process, which is detailed in the grey block of Fig. 6(a), is performed for each group independently. For each motion vector MVi within a given Group, we first derive a motion-compensated depth block d(Cb,MVi) where the motion vector MVi is applied relative to the position of Cb to obtain the depth block from the reference picture pointed to by MVi. Then, we estimate the similarity of d(Cb) and d(Cb,MVi) as shown in (7): (7) SAD(MVi) = SAD(d(Cb,MVi), d(Cb)) where SAD is Sum of Absolut Differences computed over depth map samples of block d(Cb) and d(Cb, MVi.). Cb is currently coded block, d(Cb) is depth data associated with Cb and d(Cb, MVi) is motion-compensated block of depth data coordinates of which are derived as described above. The MVi that provides a minimal SAD computed as in (7) within a current Group is selected as an optimal predictor
6
3D Res. 04, 02(2013)6
for a particular direction (mvpdir) (8) mvpdir = min(SAD(MVi)) Following this, the predictor in the temporal direction (mvptemp) is competed against the predictor in the inter-view direction (mvpinter). The predictor which provides a minimal SAD is selected for usage in the Skip mode: (9) mvpopt = min(SAD(mvptemp), SAD(mvpinter)) The MVP for the Direct mode of B slices, illustrated in Fig. 6(b), is very similar to the Skip mode, but DMC (marked with grey blocks) is performed over both reference pictures lists (List 0 and List 1) independently. Thus, for each prediction direction (temporal or inter-view) DMC produces two predictors (mvp0dir and mvp1dir) for List 0 and List 1, respectively. The SAD values of mvp0dir and mvp1dir are computed as shown in (8) and averaged to form the SAD of bi-prediction for each direction independently: (10) SAD(MVPdir) = (SAD(mvp0dir) + SAD(mvp1dir))/2 Finally, the MVP for the Direct mode is selected from available mvpinter and mvptemp as it shown in (9). In more details, DMVP scheme description and simulation results achieved with introducing of this scheme can be found27.
3.2. Backward View Synthesis Prediction (BVSP) According to selected MVD representation, coded 3DV data consists of multiview texture, associated depth maps and camera parameters that describe the cameras arrangement. The presence of this data within a (de)coding loop can enable View Synthesis Prediction for (de)coding of dependent texture views. This technique allows an already decoded texture view component (e.g. T0) which is called source view to be projected to a viewing point of the currently (de)coded dependent view (e.g. T1, which is called a target view) using depth information and DIBR rendering technique16. A conventional VSP (so called Forward-VSP) implies that synthesized image at the target view is produced by DIBR from texture and depth information of the source view. A projected image is then included in the reference picture list(s) and serves as reference picture for Motion Compensated Prediction (MCP). In F-VSP, depth map samples d(s(x,y)) associated with texture samples s(x,y) of a source texture view (e.g. T0) are converted to disparity vectors D(s(x,y)) as shown in (11) and (12). −1
d (s( x, y ) ) 1 1 1 + (11) Z = ⋅ − Znear Zfar Zfar 255 f ⋅l (12) D (S ( x, y ) ) = z In (12), f is focal length of cameras, l indicates translation distance between given views and D is a disparity between views for a depth Z. Following this, every texture sample of source picture s(x,y) of the source view (T0) is projected to a pixel location in synthesised image t(x+D,y) of the target view (e.g. T1 or T2), as in (13). (13) t ( x + D ) = s ( x, y ) Sample-wise projection (11-13) may result in occlusions in synthesised image (multiple samples of the source image
are projection into the same pixel location) and holes (pixel location of the virtual image is not occupied with any of samples projected from the source image). Such rendering artefacts can distort the MCP when synthesised image serves as a reference picture. To resolve this problem, a various non-linear processing techniques were proposed. Just to mentioned few, example of such are z-buffering algorithm for occlusion handling and inpainting for holes detection18. An example of such F-VSP implementation was proposed19 and was later utilized in Nokia’s 3DV technology8. However, the F-VSP technique is considered to be memory and processing power demanding. For example, pixel-based processing of (11-13) and following holes and occlusion handling significantly increase memory access rates and efficiently disable the block-based processing concept which is typically utilized in state-of-the-art video coding systems. In addition to this, F-VSP “suffers” from poor data localization which becomes a sever complication for implementation of the Motion Compensation Prediction at the decoder side. When Cb is predicted from a reference block R(Cb) in VSP frame, it is not obvious what are the actual coordinates of samples in the source image that need to be copied to form this R(Cb). Therefore, F-VSP should either synthesize entire VSP frame prior to (de)coding, or synthesize for each Cb a large enough fragment of VSP frame. Since actual disparity vectors of R(Cb) are not known in F-VSP, synthesized reference area should be large enough to cover all possible disparity vector between target and source views20. The first solution requires significant memory and computational complexity increase, since entire frame should be synthesized, even if single Cb in target view is coded with VSP. The second solution has lower memory requirements and average complexity figures, since only fragment of VSP frame to be synthesized, but it significantly increases worth case complexity, since reference samples a of VSP frame may be produced multiple times, if reference areas R(Cb) of MCP in VSP frame overlap. As a result of analysis provided20, it was shown that in-loop F-VSP is the most computationally demanding module of the 3DV-ATM module and occupied about 30-40% the total decoding time and this complexity was considered as not acceptable. The coding order “depth comes first” that is introduced in this paper allows an alternative implementation of VSP, which is based on backward projection (from target view to source view) and which is free of aforementioned problems and well aligned with block-based architecture of H.264/AVC. A possible implementation of this scheme was proposed21 and it was adopted to 3D-AVC and to the reference 3DV-ATM software by replacing an existed earlier F-VSP implementation. In this section we will provide a brief description of this scheme. Let us assume that the following coding order is utilized: (T0, D0, D1, and T1). Texture component T0 is a base view and serves as a source view for VSP and T1 is dependent view coded with the VSP. Depth map components D0 and D1 are respective depth map associated with T0 and T1. Following this, denote x and y are absolute spatial coordinates of the samples consisting of currently coded Cb within dependent texture view T1. Samples of synthesized reference block R(Cb) can be retrieved from a source image
3D Res. 04, 02(2013)6
7
s(x,y) T0 with disparity vector D that is derived from depth data d(Cb) which is associated with Cb. Applying the vector D to spatial coordinates of Cb provides coordinates of source samples R(Cb) in view T0 : (14) R (Cb) = s ( x + D , y ), s ∈ T 0 In such implementation, residual signal r(Cb) predicted
from the synthesised R(Cb) are derived with a tradition Motion Compensated Prediction (MCP) module where displacement (motion) vectors are replaced with disparity vectors. (15) r (Cb ) = Cb − R (Cb ), Cb ∈ T 1 Visualization of this process is shown in Fig 7.
Figure 7 Visualization of Block-based VSP based on backward projection
In the case of parallel camera arrangement which is assumed to be a common 3DV use case, different views of MVD data are rectified, thus vertical component of the displacement between different views is equal to zero and disparity D as a horizontal component of motion vector is sufficient to derive prediction signal R(Cb) as it is shown in (14), thus (16) MV=(MVy,MVx)=(0,D) Considering that D can be derived from depth map data (1)-(2) which is present at the decoder side, no MV transmitting is required for VSP-predicted Cb. The third description of the motion information – reference index RefIdx is transmitted to specify the image that serves as a source for BVSP process. With such implementation, BVSP can be interpreted as a special case of SKIP/DIRECT mode in H.264/AVC, where derivation procedure for motion vector component and for reference index are performed as specified in this section. The BVSP process described in (14)-(16) does not require an additional memory buffer to allocate synthesised reference picture, since R(Cb) samples are retrieved directly from decoded texture image of base view T0, which is already present in Decoded Picture Buffer (DPB) and is a part of the inter-view prediction. In addition to this, BVSP process can be implemented with block-based architecture, where a single disparity vector D is derived for each Cb. This makes proposed scheme to be well align with MCP modules of original H.264/AVC. These two factors are significant benefits over the conventional Forward-based VSP19.
3.3. Discussion on Standardization Development Described in this paper concept of using depth/disparity information for coding of associated texture was initially proposed in Nokia’s response to MPEG CfP and implemented in Nokia’s 3DV Test Model (3DV-TM)8. This
implementation was built as extension of H.264/AVC codec by introducing functionality for D-MVD coding and VSP. Following the selection 3DV-TM as a base for a reference test model for MPEG 3DV standardization of 3D-AVC13, described in this paper coding tools were further improved by collaborative work of JCT-3V10. For example proposal22 targeted the complexity reduction of disparity derivation in DMVP. It was proposed, that computing a disparity vector with an arithmetic average of every depth map value of the block as it is shown in (6) is too complex, and sufficiently accurate disparity vector can be computed as a maximal disparity value of the four corners samples of the current block: D = max (D 0, D1, D 2, D3) (17) where disparity candidates D0…D3 are computed from depth map samples associated with corners samples of the Cb (top-left, top-right, bottom-left and bottom-right). Proposal23 proposed to modify the Direction-Separated MVP with more efficient usage of disparity vectors. It was proposed the disparity vector is utilized as a default motion vector candidate in median-based MVP (8) in a general MVP mode. And proposal24 introduced an explicit signalling scheme for motion vector candidates in Skip and Direct modes of DMVP. These contributions greatly contributed to improving coding efficiency of the developed concept and maturing the depth-based MVP technique in the scope of 3D-AVC standardization. Similarly to DMVP, the BVSP is currently undergoes an active development. There is a clear intention to implement BVSP as a part of the MCP chain with processing modules that are already defined in the H.264/AVC. A block-based BVSP was introduced in proposal21, which implies that a single disparity value can be derived for a predicted block and utilized as inter-view motion vector for block of texture size 2x2. The proposal25 further developed this concept and proposed to implement BVSP for texture blocks size of 4x4 and utilize existing in H.264/AVC motion compensated functionality. Similarly to the approach26, a single disparity value is computed as in (19) for a block of texture data and
8
3D Res. 04, 02(2013)6
serves as a motion vector to it. It is obvious, that using an existing in H.264/AVC MCP modules to implement BVSP would greatly reduce complexity of the method and reduce
amount of changes required to a core technology for implementing of new 3DV video coding standard.
Table 1 Coding configuration of 3DV-ATM for proposed schemes and the anchor (MVC+D)
Coding Parameters Multi-view scenario MVD resolution ratio (Texture : Depth) Flexible Coding Order Inter-view prediction structure Inter prediction structure QP settings for texture & depth Encoder settings View Synthesis in Post-processing Test sequences and coded views
Settings Three views (C3) 1:0.5 T0D0-D1T1-D2T2 PIP HierarchicalB, GOP8 26, 31, 36, 41 RDO ON, VSO OFF Fast_1D VSRS 28 As specified
Table 2 Coding performance of proposed DMVP schemes (Experiment 1)
Sequences Hall2 Street Danzcer GT_Fly Kendo Balloons Newspaper Average
Texture Coding dBR, % dPSNR dB -13.00 0.64 -5.32 0.18 -8.95 0.35 -13.24 0.59 -7.80 0.48 -8.21 0.47 -5.81 0.27 -8.90 0.43
Experiment 1 Total (Coded PSNR) dBR, % dPSNR dB -11.76 0.56 -4.95 0.17 -8.46 0.34 -12.35 0.55 -6.11 0.34 -7.26 0.40 -5.06 0.23 -7.99 0.37
Total (Synth. PSNR) dBR, % dPSNR dB -10.99 0.52 -4.47 0.16 -7.08 0.26 -10.89 0.46 -6.42 0.34 -6.99 0.37 -5.11 0.20 -7.42 0.33
Table 3 Coding performance of proposed DMVP and BVSP schemes (Experiment 2)
Sequences Hall2 Street Dancer GT_Fly Kendo Balloons Newspaper Average
Texture Coding dBR, % dPSNR dB -18.67 0.93 -9.41 0.33 -15.83 0.63 -21.85 1.02 -13.88 0.91 -16.09 0.95 -8.15 0.39 -14.84 0.74
Experiment 2 Total (Coded PSNR) dBR, % dPSNR dB -16.63 0.80 -8.56 0.30 -14.94 0.61 -20.35 0.95 -10.06 0.59 -13.56 0.77 -6.87 0.31 -12.99 0.62
Total (Synth. PSNR) dBR, % dPSNR dB -16.15 0.77 -7.70 0.28 -12.67 0.47 -17.95 0.79 -11.29 0.62 -13.84 0.75 -6.99 0.28 -12.37 0.57
Table 4 Coding performance of state-of-the-art configuration of DMVP and BVSP in 3DV-ATM
Sequences Hall2 Street Dancer GT_Fly Kendo Balloons Newspaper Average
Texture Coding dBR, % -28.08 -11.73 -18.96 -23.78 -21.91 -25.88 -13.27 -20.51
dPSNR dB 1.41 0.42 0.76 1.11 1.44 1.58 0.64 1.05
4. Simulation Results The coding concept described in this paper was initially evaluated as a part of Nokia’s CfP response to MPEG 3DV CfP and significant coding gain (about 35% of improvement in dBR vs. the H.264/MVC anchor) was reported for the given test scenarios8. However, Nokia’s 3DV-TM evaluated7 consisted additional tools such as joint
Experiment 3 Total (Coded PSNR) dBR, % dPSNR dB -27.52 1.36 -11.12 0.39 -18.50 0.76 -23.29 1.09 -19.62 1.16 -23.01 1.34 -11.38 0.53 -19.20 0.95
Total (Synth. PSNR) dBR, % dPSNR dB -25.73 1.26 -10.03 0.37 -16.70 0.62 -20.98 0.92 -19.13 1.07 -21.29 1.17 -11.21 0.46 -17.87 0.84
view filtering for depth, depth-range-based weighted prediction and gradual view refresh which are out of the scope of the coding concept presented in this paper and therefore, results presented8 may not allow an accurate evaluation of the presented concept separately. In addition to this, described in this paper DMVP was initially presented27, and provided simulation results have shown that DMVP provides about 8% of delta bit-rate
3D Res. 04, 02(2013)6
reduction, compared to the traditional MVP scheme of H.264/MVC. Nevertheless, the simulation results27 were produced with a Nokia’s 3DV-TM software, and thus may not allow an accurate evaluation of the technique in the scope of the recent 3DV development. In this paper we study coding performance of the proposed conception with the most recent 3DV-ATM software v5.1r212 which is maintained as a reference test model for MVC+D11 and 3D-AVC standardization activities13. In our experiments, we utilized the most recent 3DV-ATM software configured under the Common Test Conditions (CTC)28. Experiments were conducted with a JCT-3V 3-view MVD test set, which was coded with T0D0D1T1-D2T2 coding order, where T# and D# terms refer to a coded view component for texture and depth, respectively. T0 was coded independently of D0 as the base view in order to provide H.264/AVC compatibility. For texture views T1 and T2, a described in this paper concept of joint MVD data coding was utilized. Views D1, D2 were coded prior to the corresponding T1, T2, thus enabling DMVP and BVSP techniques. Inter-view prediction structure was fixed to PIP, with view 0 taken as base view (I) and a single-direction inter-view prediction was enabled from dependent views T1 and T2. Table 1 presents a short summary of major parameters of this configuration, whereas complete configuration files for 3DV-ATM are available12. The 3DVATM in the MVC+D configuration was utilized as an anchor (reference technique) and coding parameters presented in Table 1 were as well utilized for the MVC+D coding. Compression efficiency of proposed conception and tools was evaluated in terms of Bjontegaard delta bit rate (∆BR, %) and delta PSNR (∆PSNR, dB)29 computed against the MVC+D anchor results. In the Experiment 1, the 3DV-ATM software was configured to disable all 3D-AVC tools expect of DMVP as it is presented in this paper. For this purpose, all MPEG 3DV and JCT-3V adoptions modifying the DMVP schemes22-24, 26, 30 were disabled. The BVSP prediction was disabled from encoding configuration parameters. Coding gain achieved with DMVP against the MVC+D anchor is shown in Table 2. As it shown in Table 2, the DMVP scheme presented in this paper provides a significant coding gain (9% of dBR on average) for coded texture component comparing to the MVC+D anchor and about 8% of dBR on average is produced on total for coded MVD data, when coded depth map bit rate is taken into account. In the Experiment 2, the 3DV-ATM software was configured to enable BVSP in addition to DMVP, while keeping other tools of the reference test model disabled. Simulation results achieved with such configuration is shown in Table 3. DMVP and BVSP as they are presented in this paper provides a coding gain of 15% of dBR on average for coded texture component comparing to the MVC+D coding and about 13% of dBR on average is produced on total for coded MVD data. And finally, in the Experiment 3, we evaluated the presented in this paper coding concept based on latter 3DVATM development. Reference codec was configured to enable DMVP and BVSP as well all recent contributions related to these tools22-24, 26, 30. This allowed us to evaluate compression efficiency achieved with proposed in this paper coding conception. As it is shown in Table 4, the state-of-the-art development of DMVP and BVSP were
9
significantly improved and these tools provided about 21% of dBR reduction on average for texture component. When talking about complete 3DV coding results, proposed in this paper coding concept outperforms the MVC+D by 19% of dBR for total MVD bit rate (includes texture and depth), and results of view synthesis (computed with PSNR of synthesised views and total MVD bit rate) were improved by 18%. Figs. 8 visualize this coding gain by depicting resulting rate-distortion curves for proposed method and anchor for Poznan-Hall2 and Balloons sequences.
(a)
(b) Figure 8 Rate-distortion curves for proposed conception compared against the anchor; (a) Poznan_Hall2 and (b) Balloon test sequences respectively.
As it can be seen from Tables 2 and 3, that coding gain provided by DMVP and BVSP is greatly relying on the quality of available depth map. For examples significant gain (18% of dBR) is achieved for synthetic sequences Dancer and GT_Fly which have ground-truth depth map information, whereas for Newspaper and Street sequences gain is lower (8.5% of dBR), due to poor quality of depth map. However, DIBR algorithms relies on quality of depth even larger, thus we can expect that 3D video applications will operate with high quality depth maps.
5. Conclusions This paper presented concepts of depth-enhanced 3D video coding which is utilized for compression of Multiview Video plus Depth (MVD) data. The paper described a novel approach of using depth/disparity information for more efficient coding associated video data. Tools that were described in this paper have been developed for H.264/AVC based 3D Video Codec to address the 3D Video Coding (3DV) Call for Proposals (CfP) of the
10
Moving Picture Experts Group (MPEG). As possible implementation of this concept, we described in this paper Depth-based Motion Vector Prediction and Backward View Synthesis Prediction. Simulation results conducted under the CTC show, that DMVP and BVSP reduces bit rate of coded video data by 15%, which results in 13% of bit rate savings on total for the MVD data. Moreover, presented in this paper concept of depth-based coding of 3D video has been further developed by MPEG 3DV and JCT-3V and this work resulted in even higher compression efficiency, bringing about 20% of delta bit rate reduction on total for coded MVD data. Being combined with other coding tools that are currently under development may result in a potentially lucrative H.264/AVC-compatible 3D video coding standard.
References 1.
T. Shibata, J. Kim, D. M. Hoffman, and M. S. Banks (2011) The zone of comfort: predicting visual discomfort with stereo displays, Journal of Vision 11(8):11, 1–29. 2. A. Smolic, K. Müller, P. Merkle, N. Atzpadin, C. Fehn, M. Müller, O. Schreer, R. Tanger, P. Kauff, T. Wiegand, T. Balogh, Z. Megyesi, and A. Barsi (2007) Multi-view video plus depth (MVD) format for advanced 3D video systems, Joint Video Team, document JVT-W100. 3. ITU-T Recommendation H.264 (2012) Advanced video coding for generic audiovisual services. 4. Call for proposals on 3D video coding technology (2011) MPEG document N12036. Available online: http://mpeg.chiariglione.org/working_documents/exploration s/3dav/3dv-cfp.zip 5. Applications and Requirements on 3D Video Coding, MPEG document , Online version: http://mpeg.chiariglione.org/working_documents/exploration s/3dav/applications&requirements.zip 6. B. Bross, W.-J. Han, G. J. Sullivan, J.-R. Ohm, and T. Wiegand (ed.) (2012) High Efficiency Video Coding (HEVC) text specification draft 8, JCTVC document J1003. 7. Report of Subjective Test Results from the Call for Proposals on 3D Video Coding, Online: http://mpeg.chiariglione.org/working_documents/exploration s/3dav/3d-test-report.zip 8. D. Rusanovskyy and M. M. Hannuksela (2011) Description of 3D video coding technology proposal by Nokia, MPEG document M22552. 9. H. Schwarz, C. Bartnik, S. Bosse, H. Brust, T. Hinz, H. Lakshman, D. Marpe, P. Merkle, K. Müller, H. Rhee, G. Tech, M. Winken, T, Wiegand (2011) Description of 3D Video Coding Technology Proposal by Fraunhofer HHI (HEVC compatible, configuration A), MPEG document m22571. 10. Online: http://phenix.int-evry.fr/jct3v/ 11. Y. Chen, M. M. Hannuksela, T. Suzuki, and S. Hattori, Overview of the MVC+D 3D video coding standard, Elsevier Journal of Visual Communication and Image Representation. (In press)
3D Res. 04, 02(2013)6 12. MVC+D and 3D-AVC reference software: 3DV-ATM version 5.1r2, available online: http://mpeg3dv.research.nokia.com/svn/mpeg3dv/tags/3DVATMv5.1r2/ 13. M. M. Hannuksela, Y. Chen, and T. Suzuki (ed.) (2013) 3DAVC draft text 5, JCT-3V document JCT3V-C1002. 14. G. Tech, K. Wegner, Y. Chen, and S. Yea (ed.) (2012) 3DHEVC test model 1, JCT-3V document A1005. 15. G. Tech, K. Wegner, Y. Chen, and M. M. Hannuksela (ed.), (2012) MV-HEVC working draft 1, JCT-3V document A1004. 16. C. Fehn (2004) Depth-image-based rendering (DIBR) compression and transmission for a new approach on 3DTV, Proc. SPIE Conf. Stereoscopic Displays and Virtual Reality Systems XI, 5291: 93–104. 17. J. Zhang, M. M. Hannuksela, and H, Li (2010) Joint multiview video plus depth coding, Proc. IEEE ICIP, 28652868. 18. S. Yea and A. Vetro (2009) View synthesis prediction for multiview video coding, Signal Processing: Image Communication, 24(1-2): 89-100. 19. D. Tian, P.-L. Lai, P. Lopez, and C. Gomila (2009) View synthesis techniques for 3D video, Proc. SPIE 7443, Applications of Digital Image Processing XXXII. 20. W. Su, D.Rusanovskyy, L.Chen, M.Hannuksela (2011) CE1 - Low complexity block-based View Synthesis Prediction, MPEG document m24915, Geneva. 21. W. Su, D. Rusanovskyy, M. M. Hannuksela (2012) 3DVCE1.a: Block-based View Synthesis Prediction for 3DVATM, JCT-3V document A0107, Stockholm, Sweden. 22. J. Y. Lee, J. Lee, D.-S. Park (2012) CE5.a results on interview motion vector derivation using max disparity in skip and direct modes, JCT-3V document B0149, Shanghai, China. 23. C.-L. Wu, Y.-L. Chang, Y.-P. Tsai, S. Lei (2012) 3D-CE1.a: interview skip/direct mode with sub-partition scheme, JCT3V document B0094, Shanghai, China. 24. J.-L. Lin, Y.-W. Chen, X. Guo, Y.-L. Chang, Y.-P. Tsai, Y.W. Huang, S. Lei (2012) 3D-CE5.a related motion vector competition-based Skip/Direct mode with explicit signaling, MPEG document m24847, Geneva, Switzerland. 25. D. Rusanovskyy, M. M. Hannuksella (2013) CE1.a-related: Simplification of BVSP in 3DV-ATM, JCT-3V document C0169, Geneva, Switzerland. 26. J-L. Lin, Y-W. Chen, Y.-W Huang, S. Lei (2012) 3D-CE5.a related: Simplification on the disparity vector derivation for AVC-based 3D video coding, JCT-3V document A0045, Stockholm, Sweden. 27. W. Su, D. Rusanovskyy, M. M. Hannuksela, and H. Li (2012) Depth-based motion vector prediction in 3D video coding, Proc. of Picture Coding Symposium. 28. D. Rusanovskyy, K. Müller, A. Vetro (2012) Common Test Conditions of 3DV Core Experiments, JCT-3V document A1100, Stockholm. 29. G. Bjøntegaard (2001) Calculation of average PSNR differences between RD-Curves, ITU-T SG16 Q.6, document VCEG-M33. 30. J. Y. Lee, T. Uchiumi, J. Lee, Y. Yamamoto, D.-S. Park (2012) 3D-CE5.a results on joint proposal for an improved depth-based motion vector prediction method by Samsung and Sharp, MPEG document M24824, Geneva, Switzerland.