Simple Multi-View Coding with Depth Map - Springer Link

4 downloads 65 Views 1MB Size Report
good quality of synthesized view images. ... synthesis software are not perfect. .... Figure 3 Comparison of input depth map (top) and global depth map ..... 10. J. H. Yoo, Y. H. Seo, D. W. Kim, M. Kim, J. S. Yoo (2010). MVC algorithm using ...
3D Res. 04, 02(2013)2 10.1007/3DRes.02(2013)2

3DR EXPRESS

w

Simple Multi-View Coding with Depth Map

Takanori Senoh • Yasuyuki Ichihashi • Hisayuki Sasaki • Kenji Yamamoto

Received: 12 October 2012 / Revised: 15 February 2013 / Accepted: 03 March 2013 © 3D Research Center, Kwangwoon University and Springer 2013

Abstract * Many industries are anticipating the arrival of auto-stereoscopic video products and services based on multi-view images. Since multi-view images necessitate a large bandwidth, an efficient coding algorithm is desired. An approach that synthesizes multi-view images from a few views and depth maps is expected to be a promising method for accomplishing this. In this paper, an efficient and easyto-apply multi-view image coding algorithm based on a depth map is proposed. The experimental results show its fast encoding and decoding speed together with subjectively good quality of synthesized view images. Keywords multi-view image, depth map, global depth, inter-view prediction, residual, view synthesis

1. Introduction Multi-view images and depth maps provide promising methods of industrializing high quality 3D video services and products. Stereoscopic displays displaying left and right view images enable cost-efficient 3D systems to be constructed but require eyeglasses or restrict the viewing zone. Auto-stereoscopic displays displaying multi-view images are a better solution providing more natural 3D scenes using 3D cues of motion parallax in addition to binocular disparity and convergence1. Since autostereoscopic displays require a large amount of data for multi-view images, an efficient coding method is indispensable for their popularization. To achieve this goal, the use of depth maps has been investigated2-14. All these methods add depth-based view synthesis prediction (DVSP) to block-search inter-view prediction and temporal Takanori Senoh ( ) • Yasuyuki Ichihashi ( ) • Hisayuki Sasaki ( ) • Kenji Yamamoto ( ) National Institute of Information and Communications Technology, 4-2-1 Nukui-kitamachi, Koganei, Tokyo 184-8795, Japan Tel.+81-42-327-7262, Fax.+81-42-327-6902 E-mail: {senoh, y-ichihashi, sasaki, k.yamamoto}@nict.go.jp

prediction. The prediction error is transmitted to the decoder as residuals. At the decoder side, more multi-view images than the number of transmitted views are synthesized based on the depth maps to realize smooth motion parallax for 3D images15. Since the goal for these research projects is to reconstruct input views as close to the original as possible, the residual data requires a lot of bits. Also, depth maps for each view must be coded as accurately as possible, also requiring a large number of bits. Since a multi-view plus depth map approach assumes that most of the multi-views will be synthesized from them, focusing only on the decoded data will not be adequate. The final judgment should be made using the synthesized views since the view synthesis processing may introduce additional errors, especially when the depth maps or view synthesis software are not perfect. From this point of view, if the primary requirement focuses on the subjective quality of the synthesized views, the coding methods will become more efficient and appropriate. Another requirement is compatibility with existing single-view codecs. If the multiview codec is constructed on top of an existing 2D codec without changing it, this codec can be easily introduced in the market. From these points of view, a unique multi-view coding method was proposed16, 17. In this method, depth maps and views are pre-processed to a global depth map and global view by reducing the redundancy among them before encoding. This idea was simplified and evaluated18 based on an existing 2D video codec MPEG-4 AVC22. In this paper, this concept is further extended and evaluated using a high efficiency video codec MPEG HEVC23, which is currently being standardized as a singleview 2D video codec. The following sections explain the proposed simple coding algorithm for auto-stereoscopic displays displaying multi-view images synthesized from few-view images plus depth maps. Performance is evaluated using the objective and subjective quality of synthesized views and the complexity of the codec. In the proposed algorithm, depth maps for each view are merged to create a global depth map. Since depth maps include some errors when they are estimated from multi-view

2

images or captured by range cameras19, the global depth map reduces these errors by averaging them and also reduces the amount of data included. The global depth map is encoded using HEVC. Since the global depth map is a single sequence, its encoding speed is faster than the interview prediction speed for multiple input depth maps. Multiview images are inter-view-predicted from a base view based on the decoded global depth map. Instead of the prediction error, only unpredictable areas are merged into a frame and transmitted to the decoder as a residual view after being encoded using the same 2D video codec. Since the unpredictable areas are occlusion holes or out-of-frame

3D Res. 04, 02(2013)2

areas, the amount of residual data is small. Also, since the inter-view prediction is done using geometrical projection and the residual is encoded as a single view, its encoding speed is faster than the conventional block-search-based inter-view prediction speed for multiple input views. Since the base view is independently encoded using the same 2D codec, this view can be seen with conventional 2D decoders and displays. At the decoder side, multi-view images are synthesized from the decoded base view, residual view, and global depth map for auto-stereoscopic displays. Experimental results show the efficacy of the proposed algorithm for the HEVC codec.

Figure 1 Coding scheme of simple multi-view coding with depth map which consists of a global depth generator, residual view generator, core codecs (HEVC), and view synthesizer

2. Proposed coding algorithm

2.1 Base view coding

In the following discussion, multi-view images and depth maps provided by ISO/IEC JTC1/SC29/WG11 for MPEG 3DV Call for Proposal24 were used as the input data. As shown in Figure 1, the input data consists of three views (V1, V2, V3) left, center, and right, respectively, three depth maps at the same views (D1, D2, D3), and camera parameters for these views. These views are captured by parallel cameras placed with the same baseline length of about two pupil distances. The views are rectified and color-corrected. The associated depth maps are provided by MPEG24 after estimation from natural scenes or derivation from 3DCG scenes. The proposed simple multi-view coding with depth map scheme consists of a global depth generator, residual view generator, core codec of MPEG HEVC, and a view synthesizer. First, the global depth generator generates the global depth from all input depth maps (D1, D2, D3) using the camera parameters. Then, the residual view generator generates a combined residual view for input views V1 and V3, referring to the global depth and camera parameters. The base view (V2), residual view, and global depth are encoded by individual core codec (HEVC) separately. The view synthesizer synthesizes multi-views from the decoded base view, residual view, and global depth referring to the camera parameters. In order to include the camera parameters in the bit streams, they were compressed using the ZIP application and their bit counts were used for performance evaluation at this time. In the future, the camera parameters will be encoded according to the MPEG standard method. Details are described in the following sections.

Since the base view is used to synthesize the other multiview images at the decoder, the center view V2 is assigned as the base view to preserve consistency in its left and right synthesized views. The base view is encoded by the core codec HEVC independently, maintaining compatibility with conventional 2D video decoders.

2.2 Global depth coding The global depth generator projects and merges the left depth map D1, center depth map D2, and right depth map D3 to create a global depth map Dg, which is located in the center of the multi-views as shown in Figure 2. The corresponding pixel addresses before projection (xi, y), i = 1, 2, 3 and after projection (xig, y) are given by the following equations, where f, which represents the focal length of the camera, B, which represents the baseline length, Znear, which represents the nearest object distance, and Zfar, which represents the farthest object distance are provided by MPEG as the camera parameters for all input multi-view sequences24. Di(xi, y), i =1, 2, 3 represents the depth value at the pixel location (xi, y) in depth map Di, (i = 1, 2, 3). ∆xi represents the disparity value for depth value Di(xi, y). 1 1 fB fB ∆xi = ( − ) Di ( xi , y ) + (i = 1,3) 255 Z near Z far Z far (1)

x1g = x1 − ∆x1, xg2 = x2 , xg3 = x3 + ∆x3 Holes in the projected depth maps, where no pixels are projected, are inpainted with existing smaller depth values

3D Res. 04, 02(2013)2

at the left or right edge of the hole. Then, the projected depth maps Dig(xi, y), i =1, 2, 3 are averaged as follows. D1g ( x, y ) + 2 D 2g ( x, y ) + Dg3 ( x, y ) D g ( x, y ) = (2) 4 i Since depth values D g (x, y) are unreliable for natural scenes, inpainting is a kind of depth estimation from surrounding depth values. Averaging after inpainting will ease the depth error of existing depth pixels. Since the frame size of the global depth map is the same as that of the input depth maps, there is some loss of depth data at the left and right edges of the frame. Although these losses are twice as large as those of the previously reported method18, the amount of data for this global depth map is half as large. The lost areas are reconstructed at the decoder by inpainting them. Since depth maps are generally smooth, inpainting works well.

3

inpainting, and sub-sampling without a filter is shorter than the conventional block-search-based inter-view prediction coding for three input depth maps.

(a) Input depth map (Dancer)

(b) Global depth map (Dancer)

(c) Input depth map (Newspaper) Figure 2 Global depth generation, sub-sampling, encoding/decoding, and up-sampling

Then, the global depth map is subsampled to half both horizontally and vertically without a subsampling filter, keeping the depth edges sharp. As described in Section 2.3.1, the down-sampled global depth map is non-linearly up-sampled, keeping the depth edges sharp. These nonlinear down-sampling and up-sampling methods avoid depth edge degradation and maintain synthesized view quality. Figure 2 shows a flow chart for global depth map generation, sub-sampling, encoding, decoding, and upsampling. Global depth map sub-sampling can reduce the amount of depth data while preserving depth information13, 20 . Since the amount of depth data is small compared with view data, even a sophisticated inter-depth coding will increase the total coding gain very slightly. The sub-sampled global depth map Dg is encoded by the core codec HEVC. Since the global depth map is a small single view format, its encoding and decoding speed is fast compared with multi-view images. Although an additional depth frame buffer is required for global depth generation, the processing time for input depth map projection, which is done purely mathematically, one-dimensional hole

(d) Global depth map (Newspaper) Figure 3 Comparison of input depth map (top) and global depth map (bottom) for CG sequence (Dancer) and natural scene (Newspaper)

Figure 3 shows comparisons between input depth maps and global depth maps. The global depth map quality is almost the same as that of an input depth map for a CG sequence (Dancer). The global depth map for a natural scene is smoother than that of an input depth map (Newspaper). Since global depth generation is only averaging of the input depth maps, if joint multi-lateral filtering21 is combined, which utilizes texture information, it will further improve the global depth map quality.

4

3D Res. 04, 02(2013)2

Figure 4 Residual view generation by checking holes in decoded and projected global depth maps

2.3 Residual view coding 2.3.1 Uncovered area detection Residual view coding is done geometrically based on the decoded global depth map. Since the decoder uses the decoded global depth map to synthesize multi-view images, residual view coding should use the same decoded global depth map in order to maintain consistency between the encoder and decoder6, 25. Otherwise, the synthesized view quality degrades rapidly due to depth compression. As shown in Figure. 4, the decoded global depth map Dg' is upsampled to return to the original size20, 26, 29. This depth map up-sampling is done while maintaining the depth edges as follows. If the depth difference between two contiguous pixels is small, for example, less than 20 levels, the averaged depth sample is inserted between them. If not, a larger depth sample is repeated. Then, a 2D median filter of 5x5 pixels smoothes diagonal depth edges, maintaining the depth levels. After this up-sampling is performed, depth maps for foreground objects become about one-pixel fat. Since the background objects are generally less textured, this up-sampling maintains foreground object quality by moving the distortion to the less-visible background area in the synthesized views. Then, the reconstructed global depth map is projected to in-between view-points D1.0', D1.25, D1.5, D1.75, D2.25, D2.5, D2.75, and D3.0' using its own depth value as explained in Section 2.2. These view-points are the places to determine the residual area in the left and right input views (V1, V3). The residual area corresponds to the occluded area or out-of-frame area in the base view V2. Corresponding pixel locations in the depth map (xk, y) and the base view (xk2, y), k = 1.0, 1.25, 1.5, 1.75, 2.25, 2.5, 2.75, 3.0 are expressed in the following equation.

x 2k = x k − (2 − k )∆x k , (k = 1.0, 1.25, 1.5, 1.75, 2.25, 2.5, 2.75, 3.0)

(3)

This equation is just used to check if the corresponding pixel address (xk2, y) is inside of the base view or not. If not, that address is registered to the hole mask Mk as an unpredictable area. Hole masks are generated for all inbetween view-points and used to determine the residual area in the left and right views V1 and V3. If 3D scene has no common hole, which cannot be seen in the base view nor the residual view, occlusion area detected in the left and right views (V1, V3) covers all occlusion area in the arbitrary in-between views. If common holes exist, mainly caused by wrong depth edges, they must be detected in the view-points where such holes appear and must be added to the residual view. Although such area doesn’t cover true common hole texture, it is necessary for wrong depth map edges. For that purpose, six test-points V1.25, 1.5, 1.75, 2.25, 2.5 and 2.75 are added to detect the common holes caused by wrong depth edges in addition to the view-points V1 and V3. These eight test-points are empirically determined for suppressing the complexity of residual view generator and increasing the synthesized view quality. Undetected common hole area in the other arbitrary views is filled with the average color of residual view to minimize the view synthesis error at the decoder.

2.3.2 Occlusion hole detection An unpredictable area is not only an out-of-frame area but also an occluded area in the base view. In this case of backward projection from base view to depth map view, an occlusion hole must be detected by checking the depth map. Otherwise, pixels of foreground objects are projected to the

3D Res. 04, 02(2013)2

5

background area in the depth map view as shown in Figure

5.

Figure 5 Occlusion hole detection by checking neighbor depth value

Figure 6 View synthesis by projecting decoded base view and residual view based on decoded global depth map

An occlusion hole is detected by checking the following equation in the depth map. If a neighbor depth value Dk(x+δ, y) is larger than the currently checked depth value Dk(x, y) by th, the current pixel (x, y) is in an occlusion hole, where th is the depth difference value derived from the disparity difference δ. Since an occlusion hole exists in the left side of foreground objects when the base view resides in the right-hand side of the depth map view, only right neighbor depth values are checked. If the base view resides in the left-hand side, the left-side neighbor depth values are checked by changing the sign of δ. The search area of δ is 0 to the maximum disparity difference given by fBk(1/Znear 1/Zfar), where Bk represents the baseline length between view-point Dk and base view V2. if (Dk ( x + δ , y) − Dk ( x, y)) ≥ th, ( x, y) is in occlusionhole.

th =

fB fB 255 δ , (0 ≤ δ ≤ k − k ) 1 1 Z near Z far ) fBk ( − Z near Z far

(4)

The detected hole area is registered to the hole mask Mk. As explained in Section 2.3.1, in order to detect common holes in in-between view points, this hole detection is done

not only at the residual view-points D1 and D3, but also at the other view points Dk.

2.3.3 Residual coding After out-of-frame area detection and occlusion hole detection are performed, all hole masks Mk are shifted to the nearer left or right view point M1 or M3, and logically added to form left hole mask M1 and right hole mask M3. Since these masks are used to cut out the residuals, they are dilated by two-pixels to increase the residual area and allow degradation of the residual edge caused by the following encoding process. Based on these hole masks M1 and M3, the left and right residuals R1 and R2 are cut out from the left and right views V1 and V3. The area having no pixels is inpainted with the average color of the residual frame. Three pixels from outside of the residual edge are linearly interpolated to increase the coding efficiency. The left and right residuals are then subsampled to half horizontally and vertically and stacked into a frame of half-width. A simple 3-tap (0.25, 0.5, 0.25) low-pass filter is used to avoid

6

3D Res. 04, 02(2013)2

aliasing in the subsampled residuals. Since residuals are used only for an uncovered area in the synthesized views, subsampling does not affect the synthesized view quality very much but reduces the amount of residual data. Since

the residuals have no correlation to each other, interresidual coding gains nothing. Stacking is a better way to simplify residual coding. The residual is encoded by the core codec HEVC as a single view.

(a) Poznan_Hall2

(b) Poznan_Street

(c) Undo_Dancer

(d) GT_Fly

(e) Kendo

(f) Balloons

(g) Newspaper Figure 7 Averaged PSNR of synthesized views vs. total bit rate for seven test sequences (Reference: anchor, Tested: proposed)

2.4. View synthesis At the decoder side, the core decoder HEVC outputs the decoded base view V2', global depth map Dg', and residual Vr'. Figure 6 shows the view synthesis process. The decoded global depth map Dg' and residual Vr' are upsampled to return to the original size. Depth map up-

sampling is done in the same manner as described in Section 2.3.1. Residual up-sampling is done with a simple 2-tap (0.5, 0.5) low-pass filter. Sophisticated peaking filters are not used since they increase the ringing artifacts around the edge of the residual area. The reconstructed global depth map Dg' is projected to the target view points Dl and Dr to synthesize the output multi-views. The uncovered areas in

3D Res. 04, 02(2013)2

7

Dl and Dr are in-painted in the same manner as in Section 2.2. The target depth maps Dl and Dr are used to project the decoded base view V2' to the target views inversely. During the inverse projection process, out-of-frame areas and occlusion holes are detected and the hole masks Ml and Mr are generated in the same manner as described in Section 2.3. The reconstructed left and right residuals R1' and R3' are also inversely projected to the target views Rl and Rr and overwritten on the hole mask areas Ml and Mr in the left and right target views Vl and Vr. During the inverseprojection of residuals, occlusion holes are detected and logically multiplied with the hole masks Ml and Mr of the target views to generate common hole masks. The common holes are inpainted using the surrounding pixels of the hole. Then, the target views are output as the synthesized views Sl and Sr. This view synthesis is repeated as many times as required by the auto-stereoscopic display.

frames, prediction structures were configured according to the MPEG 3DV Common Test Condition27. The performance was compared with the MPEG 3DV HTM3.1 anchor codec28 with its default configuration files. The anchor codec encodes multi-view images with inter-view prediction which searches corresponding macro-blocks in predicted views, as well as the motion vector search in temporal frames. For the view synthesis of anchor codec results, VSRS_1D_fast, which is included in the MPEG 3DV HTM3.1 anchor codec, was used. The anchor codec was also used as the HEVC codec for the core codecs of the proposed method by modifying the configuration files to a one-view sequence. The same QP parameters as were used for base view coding were used for global depth map and residual view coding in order to keep their quality the same as for the base view. The other test conditions are the same as for the anchor codec.

3.1 Objective evaluation

(a) Encoding time

(b) Decoding time

(c) Rendering time Figure 8 Encoding, decoding, and view synthesis (rendering) time ratios compared with the anchor (1024x768: average of Balloons, Kendo and Newspaper, 1920x1088: average of GT_Fly, Poznan_Hall2, Posnan_Street, and Undo_Dancer)

3. Experimental results The proposed simple multi-view codec with depth map algorithm was implemented and simulated on a computer. Seven test sequences in MPEG 3DV CfP24 Poznan_Hall2, Poznan_Street, Undo_Dancer, GT_Fly, Kendo, Balloons, and Newspaper were used for the evaluation. Simulation setups such as quantization parameter QP, numbers of

Figure 7 shows averaged PSNRs of synthesized views S1.0, S1.25, S1.5, S1.75, S2.25, S2.5, S2.75, and S3.0 for each sequence at each rate point. Since no reference view is provided by MPEG for intermediate views such as V1.25, V1.5, V1.75, V2.25, V2.5, and V2.75, these views were synthesized using MPEG VSRS_1D_fast software from input views V1, V2, and V3 and depth maps D1, D2, and D3. The other PSNRs were measured referring to the input views. Since the PSNRs of decoded base view V2' are the same as those of the anchor, they were not included in the average PSNR. The PSNRs of the proposed algorithm are lower than those of the anchor. Since synthesized reference views might include some artifacts caused by the imperfect depth maps and the view synthesis software, these PSNRs are not highly accurate. PSNRs of anchor results are better since they are synthesized by the same view synthesis software as was used for the reference view synthesis. Since practical synthesized view quality can be measured subjectively, a subjective test was done as explained in the following section. Figure 8 shows the encoding, decoding, and view synthesis time ratios compared with the anchor. The enc time includes all processing time for global depth generation, its encoding, residual view generation, its encoding, and base view encoding. The dec time includes all decoding time for global depth, residual view, and base view. The ren time includes all view synthesis time for eight views V1.0, V1.25, V1.5, V1.75, V2.25, V2.5, V2.75, and V3.0. The ren time for the anchor as a reference is the sum of six views for V1.25, V1.5, V1.75, V2.25, V2.5, and V2.75 since V1.0 and V3.0 are directly decoded from the bit stream. The encoding time and decoding time are shorter than for the anchor (14.3% and 17.4% on average) since block-search-based inter-view predictions for the depth maps and the multi-views are replaced by the purely mathematical geometrical projection in the proposed algorithm and no brute-force search is required. Also, decoding time is faster than for the anchor since the decoding process does not require inter-view decoding and the residual views are not reconstructed to the complete view. Its full reconstruction is done in the view synthesis process together with many other target views. The view

8

3D Res. 04, 02(2013)2

synthesis time is longer than for the anchor (125.7% on average) since view synthesis for views 1 and 3 are

included. The view synthesis time per view is almost the same as for the anchor.

(a) Poznan_Hall2 (Proposed)

(b) Poznan_Hall2 (Anchor)

(c) Poznan_Street (Proposed)

(d) Poznan_Street (Anchor)

(e) Undo_Dancer (Proposed)

(f) Undo_Dancer (Anchor)

(g) GT_Fly (Proposed)

(h) GT_Fly (Anchor)

(i) Kendo (Proposed)

(j) Kendo (Anchor)

3D Res. 04, 02(2013)2

9

(k) Balloons (Proposed)

(m) Newspaper (Proposed)

(l) Balloons (Anchor)

(n) Newspaper (Anchor)

Figure 9 Synthesized view S1.5 at rate point QP40 for proposed algorithm (left) and anchor (right)

The proposed algorithm requires an additional one-frame memory for global depth generation, three-frame binary memory for hole masks, and one-frame texture memory for residual view generation compared with the anchor method. The one-frame memory for depth generation is used for the projection of the left depth map D1 and the averaging of it with the center depth map D2. The results are overwritten on D2. Then, the same process is performed on the right depth map D3. The same work memory for depth generation is used to project the decoded depth map to one of the in-between view points. There, the occlusion hole is detected and stored in a mask memory. The detected occlusion hole is projected to the left view point using the second mask memory and logically added with the hole mask for the left view point M1, which is stored in the third mask memory. In the same way, the occlusion hole masks for all in-between view points are generated and logically added with the hole mask M1 sequentially. Then, the left view V1 is masked by mask M1 and the hole area is left as the left residual R1. It is sub-sampled to 1/2x1/2 and stored in the work memory for residual view generation. The right residual is also generated in the same way and stored in the same residual memory. For view synthesis, one frame fewer depth memories are required for the proposed algorithm since there is only one depth map compared with three for the anchor.

3.2 Subjective evaluation Since accurate objective evaluation is difficult for synthesized multi-view images, subjective image quality was evaluated. Figure 9 shows the synthesized view S1.5 at the lowest rate point QP40 for the proposed algorithm (left)

and for the anchor (right). As seen in these figures, the proposed algorithm yields almost the same subjective quality as the anchor. In Figure 9(a) proposed, a slight ringing was observed around the right person compared with Figure 9(b) anchor. In Figure 9(d) anchor, a ghost image was observed around the car compared with Figure 9(c) proposed. The following subjective evaluation was done by ten subjects, who are experts on multi-view image coding, with a stereo pair of synthesized views S1.75 and S2.25 of each sequence on a 46-inch interlaced stereo 3D monitor with polarized eyeglasses. They scored the subjective quality difference of the proposed algorithm compared to the anchor results with seven levels (+3: significantly better, +2: better, +1: slightly better, 0: same, -1: slightly worse, -2: worse, -3: significantly worse). Figure 10 shows the results of the average score of ten subjects for each sequence. In the figure, CE7 shows the score of the proposed method. The last graph shows the total average of seven sequences. As seen in these graphs, all scores except for Newspaper were almost the same as for the anchor. For the total average, the score for the proposed algorithm was almost the same as for the anchor. The reason for the slightly lower score for the Newspaper sequence is that the provided depth maps include more errors than for the other sequences as shown in Figure 3. It must be noted that the scores of the synthesized sequences Dancer and GT_Fly are slightly better than those of the anchor with lower bit rates. The depth maps for these sequences are perfect having no error. This means that the proposed algorithm will yield better synthesized view quality when the input depth maps are improved in future. Since the above subjective test evaluates views (S1.75 and S2.25) that are rather close to the base view (V2.0), Figure 11 shows an example of synthesized views (S1.0 and

10

3D Res. 04, 02(2013)2

S1.25), which are the farthest from the base view at the lowest rate point QP=40 for the Newspaper sequence. Although some artifacts are observed compared with the

middle view point (S1.5) in Figure 9(m) or (n), they look to be acceptable at the lowest rate point.

(a) Poznan_Hall2

(b) Poznan_Street

(c) Undo_Dancer

(d) GT_Fly

(e) Kendo

(g) Newspaper

(f) Balloons

(h) Average of all sequences

Figure 10 Averaged subjective quality difference of synthesized view compared to the anchor results (CE7: proposed)

4. Conclusion A simple multi-view coding algorithm with depth map for auto-stereoscopic displays was proposed. Experimental results showed that the proposed algorithm subjectively yielded the same quality for synthesized views with much faster coding speed than for the anchor codec. Although

PSNRs for synthesized views were lower than for the anchor codec, they are not fully accurate because the references were synthesized using imperfect depth maps. The proposed algorithm is less complex and is easy to implement with almost the same subjective quality as the anchor codec. Since this algorithm uses a single-view 2D image codec without changing it, it can realize a fast and

3D Res. 04, 02(2013)2

efficient multi-view image codec easily. Also, since this algorithm works with all 2D image codecs without changing them, its coding efficiency will be easily improved when the core 2D codec is improved. The proposed algorithm will accelerate the early introduction of auto-stereoscopic displays and their services into the market.

11 8.

9.

10.

11. 12.

(a) S1.0 (proposed)

13.

14.

15. 16.

(b) S1.25 (proposed) Figure 11 Example of synthesized views (S1.0 and S1.25) which are the farthest from the base view (V2.0) at rate point QP=40 for Newspaper,

17. 18.

References 1. 2.

3.

4.

5.

6. 7.

O. Schreer, P. Kauff, T. Sikora (2005) 3D video communication, WILEY. M. Magnor, P. Eisert, B. Girod (2000) Multi-view image coding with depth maps and 3-D geometry for prediction, Proc SPIE Visual Communications and Image Processing 4310: 263-271. S. Shimizu, M. Kitahara, K. Kamikura, Y. Yashima (2006) Multi-view video coding based on 3-D warping with depthmap, 25th PCS Proceedings: Picture Coding Symposium, PCS2006. S. T. Na, K. J. Oh, C. Lee, Y. S Ho (2008) Multi-view depth video coding using depth view synthesis, Circuits and Systems, IEEE International Symposium on Digital Object Identifier 4541689:1400-1403. P. Merkle, Y. Morvan, A. Smolic, D. Farin, K. Muller, P.H.H de With, T. Wiegand (2009) The effects of multiview depth video compression on multiview rendering, Signal Processing: Image Communication 24:73-88. S. Shimizu, H. Kimata, Y. Yashima, M. Tanimoto (2009) Efficient multi-view coding using multi-view depth map, J. Inst. of Image Info. and Television Eng. 63(4):524-532. J. Y. Lee, H. Wey, D. S. Park (2010) A novel approach for efficient multi-view depth map coding, 28th Picture Coding Symposium PCS2010:302-305.

19. 20.

21.

22. 23. 24. 25.

K. N. Iyer, K. Maiti, B. Navathe, H. Kannan, A. Sharma (2010) Multiview videocoding using depth based 3D warp, Multimedia and Epo (ICME), 2010 IEEE International Conference on Digital Object Identifier ICME2010:11081113. K. Takahashi (2010) Performance analysis on multi-view coding with depth map distortion, Proc International Conference on Image Processing (ICIP2010) 565257:26252628. J. H. Yoo, Y. H. Seo, D. W. Kim, M. Kim, J. S. Yoo (2010) MVC algorithm using depthmap through an efficient side information generation, ICIC Express Letters 4(5B):18631868. C. Lee, B. Choi, Y. S. HoS (2011) Efficient multiview depth video coding using depth synthesis prediction, Optical Engineering 50(7): 077004(1)- 077004(13). Q. Liu, Y. Zhang, X. Ji, Q. Dai1u (2012) Geometric mapping assisted multi-view depth video coding, Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on Digital Object Identifier ICASSP2012:14531456. F. Shao, D. Jiang, G. Jiang, M. Yu, F. Li (2012) A multiview video plus depth coding method based on view warping and bit allocation, Proc International Conference on Measurement, Information and Control MIC2012:436-439. H. Schuwarz, C. Bartnik, S. Bosse, H. Brust, T. Hinz, H. Lakshman, D. Marpe, P. Merkle, K. Muller, H. Rhee, G. Tech, M. Winken, T. Wiegand (2012) 3D video coding using advanced prediction, depth modeling, and encoder control methods, 2012 Picture Coding Symposium PCS2012 6213271: 1-4. K. Muller, P. Merkle, G. Tech, T. Wiegand (2010) 3D video formats and coding methods, Proc 2010 IEEE 17th International Conference on Image Processing :2389-2392. T. Ishibashi, T. Yendo, M. P. Tehrani, T. Fujii, M. Tanimoto (2011) Global view and depth format for FTV, Digital Signal Processing, Proc Digital Object Identifier ICDSP2011 6005013: 1-6. T. Ishibashi, M. P. Tehrani, T. Fujii, M. Tanimoto (2012) FTV format using global view and depth map, Picture Coding Symposium PCS2012: 29-32. T. Senoh, K. Yamamoto, R. Oi, Y. Ichihashi, T. Kurita (2012) Simple multi-view coding with depth map, Proc 3 Dimensional Systems and Applications 3DSA2012 S6-1: 223-227. M. Kurc, O. Stankiewicz, M. Domanski (2012) Depth map inter-view consistency refinement for multiview video, Proc Picture Coding Symposium PCS2012: 137-140. E. Ekmekciouglu, M. Mrak, S. Worrall, A. Kondoz (2009) Utilization of edge adaptive upsampling in compression of depth map videos for enhanced free-viewpoint rendering, Image Processing (ICIP), 16th IEEE International Conference on Digital Object Identifier ICIP2009 5414296:733-736. E. Ekmekciouglu, S. Worrall, V. Velisavljevic, D. D. Silva, A. Kondoz (2011) Multi-view depth processing using joint filtering for improved coding performance, ISO/IEC JTC1/SC29/WG11 M20070. ISO/IEC AVC (2012) Information technology –Coding of audio-visual objects- Part 10: Advanced Video Coding, ISO/IEC 14496-10. ISO/IEC HEVC (2012) High efficiency video coding (HEVC) test model 8 (HM 8) encoder description, ISO/IEC JTC1/SC29/WG11, N12933. ISO/IEC 3DV (2011) Call for proposals on 3D video coding technology, ISO/IEC JTC1/SC29/WG11, N12036. G. Zhu, G. Jiang, M. Yu, F. Li, F. Shao, Z. Peng (2010) Joint video/depth bit allocation for 3D video coding based on distortion of synthesized view, Broadband Multimedia Systems and Broadcasting (BMSB), 2012 IEEE

12 International Symposium on Digital Object Identifier BMSB2012 6264319:1- 6. 26. W. S. Kim, A. Ortega, J. Lee, H. C. Wey (2010) 3-D video coding using depth transition data, 28th Picture Coding Symposium PCS2010 5702453:178-181. 27. JCT-3V CTC (2012) Common test conditions of 3DV core experiments, JCT-3V, A1100.

3D Res. 04, 02(2013)2 28. ISO/IEC HTM (2012) 3D-HEVC test model 1, ISO/IEC JTC1/SC29/WG11, N12937. 29. Q. Zhang, P. An, Y. Zhang, Q. Zhang, Z. Zhang (2010) Reduced resolution depth compression for multiview video plus depth coding, Signal Processing (ICSP), IEEE 10th International Conference on Digital Object Identifier 11451148.

Suggest Documents