{E.Ekmekcioglu, M.Mrak, S.Worrall, A.Kondoz}@surrey.ac.uk. ABSTRACTi. In this paper we propose a novel video object edge adaptive upsampling scheme for ...
UTILISATION OF EDGE ADAPTIVE UPSAMPLING IN COMPRESSION OF DEPTH MAP VIDEOS FOR ENHANCED FREE-VIEWPOINT RENDERING Erhan Ekmekcioglu, Marta Mrak, Stewart Worrall, Ahmet Kondoz Centre for Communication Systems Research, University of Surrey, GU2 7XH Guildford, Surrey, United Kingdom {E.Ekmekcioglu, M.Mrak, S.Worrall, A.Kondoz}@surrey.ac.uk ABSTRACTi In this paper we propose a novel video object edge adaptive upsampling scheme for application in video-plus-depth and Multi-View plus Depth (MVD) video coding chains with reduced resolution. Proposed scheme is for improving the ratedistortion performance of reduced-resolution depth map coders taking into account the rendering distortion induced in freeviewpoint videos. The inherent loss in fine details due to downsampling, particularly at video object boundaries causes significant visual artefacts in rendered free-viewpoint images. The proposed edge adaptive upsampling filter allows the conservation and better reconstruction of such critical object boundaries. Furthermore, the proposed scheme does not require the edge information to be communicated to the decoder, as the edge information used in the adaptive upsampling is derived from the reconstructed colour video. Test results show that as much as 1.2 dB gain in free-viewpoint video quality can be achieved with the utilization of the proposed method compared to the scheme that uses the linear MPEG re-sampling filter. The proposed approach is suitable for video-plus-depth as well as MVD applications, in which it is critical to satisfy bandwidth constraints while maintaining high free-viewpoint image quality. Index Terms— multi-view video, depth map compression, free-viewpoint video. 1. INTRODUCTION Stereoscopic video, the simplest form of 3D video, has amended conventional 2D video applications in many areas, from entertainment to communications. More recent 3D video concepts, like free viewpoint TV (FTV) and 3-dimensional television (3DTV) are receiving increasing attention because they allow a high level of interaction between the user and the service provider. Furthermore, a more immersive vision experience is provided, as 3D video conveys scene depth information. Scene depth information that represents the relative distance of each video object to the recording video camera constitutes an important part of these 3D video representations. Video-plus-depth, Multi-view Video plus Depth (MVD) [1], and Layered Depth Video (LDV) [2] are some of the key representations. Robust depth information is particularly needed in free-viewpoint video rendering to avoid visual artefacts in image areas separating the background and foreground objects. Inherent distortion in decompressed depth
978-1-4244-5654-3/09/$26.00 ©2009 IEEE
733
images, induced by quantisation, triggers such artefacts. At the same time, efficient compression of depth map video sequences, especially in the multi-view scenario, is essential to minimize the overhead of 3D video. A common approach is to downsample the depth maps before compression to save bandwidth [3], [4]. Generally, linear sampling filters, MPEG scaling conversion filters being an example [5], are used without local adaptations for the efficiency of the implementation. However, the effects on free-viewpoint rendering are generally ignored and the low-pass filtering distortion in the reconstructed depth map images often result in visually distorted object boundaries in the rendered images especially with fairly wide baseline distances. Our proposal is built on generic 1D upsampling filters as they are of lower complexity and can accurately adapt to local image statistics despite their simplicity. With the help of the extracted video shape information, significant depth transitions are well reconstructed in the High Resolution (HR) depth map after upsampling. This, in accordance, results in visual improvements in rendered images, especially on object boundaries. Furthermore, there is no need for an explicit communication of object edge information between the client and the server. The overhead, therefore, is not increased. Section 2 gives background information on depth map compression and the proposed approach. The experiments are explained in section 3 and the results are described in section 4. Finally, the paper is concluded with section 5. 2. DEPTH MAP COMPRESSION 2.1. Background on depth map compression Depth maps, unlike colour videos, are usually composed of regions of uniform gray-scale intensity. When encoded similarly to conventional 2D video, they tend to produce lesser bits than their equivalent colour texture images. MPEG-4 has an extension called multiple auxiliary components, suitable for handling depth map videos. MPEG-C provides specifications on the carriage of video depth information [6]. Current multiview plus depth map coders (as in [1]) regard depth information as colour information and encode them using MVC [7]. There are other approaches as well that do not rely on block-based coders. In [8], encoding of the depth map videos using platelets is suggested, which involves the creation of piecewise linear representations for depth map segments. In [9], the authors
ICIP 2009
the local depth intensity gradients in the low resolution depth map as:
eˆ( x, y ) = Ν( x, y )× e( x, y )
(2)
where Figure 1. Edge map extraction process at the client side (the sobel operator is applied directly to the depth map image at the server side).
represent the depth map segments by triangular meshes and compress accordingly. Our proposed approach is based on the AVC codec (similar to MVC) with reduced-resolution (asymmetric) coding of depth map videos as in [3], [4] for enhanced compression performance. Colour videos are assumed to be coded using a bit-rate around five times higher than that of depth map videos. Such rate allocation provides acceptable 3D perception [10]. Without the loss of generality, a downsampling factor of ½ for both dimensions of depth map videos is considered for reduced resolution depth map coding in our approach. 2.2. Description of linear MPEG upsampling filter The linear spatial scalability filters of MPEG are as described in [5]. A set of 2 1D, 6-integer tap Lanczos filters, shown in (1), are used for dyadic upsampling with a factor of 2. {1 , − 4 , 28 , 9 , − 2 , 0 }, (1) {0 , − 2 , 9 , 28 , − 4 ,1 } The two sets of coefficients during upsampling are used in an alternating manner, changing after each pixel shift of the filter window. The filters applied are linear and the sampling process is uniform, i.e. regardless of local image statistics, same filter kernel is used. Problems occur around depth discontinuities. At the discontinuities, the filter window encapsulates depth values belonging to both the background and the foreground objects, resulting in smooth transitions of depth values in these areas after upsampling. Smooth transitions of depth values result in ambiguous object boundaries in rendered free-view videos, particularly around foreground objects close to the camera. 2.3. Proposed edge-aware upsampling We modify the upsampling process to consider depth transitions. For this, the video edge information is required during upsampling of depth maps. Edge driven transformation and interpolation approaches exist in the literature [11], [12], [13]. In [11], the coefficients inside the biorthogonal 9/7 type filter are changed according to local spatial frequencies of highpass frames for better edge reconstruction. In [12] and [13], it is considered to replace the linear interpolation filters with the ones adapting themselves to local image statistics for better image perception at the expense of increased complexity. In our work, edge information is extracted from the high resolution reconstructed colour texture video by applying 3x3 Sobel operator. Gradients caused by texture changes, rather than foreground-background transitions, are eliminated by checking
734
∇ x DLR (x / 2, y / 2) + ∇ y DLR (x / 2, y / 2) ° ≥Τ Ν(x, y ) = ®1, 2 °¯0, otherwise
e and eˆ represent the initial and refined edge maps respectively. DLR is the LR reconstructed depth map. T is a predetermined threshold and the same value of T is used for all video sequences. T determines the sensitivity of the adaptive filter to the edge significance present in the downscaled depth map. eˆ is equal to 1 if the pixel is an edge pixel and 0 otherwise. Fig. 1 depicts this process. During upsampling, whenever the sliding filter window crosses or encapsulates an edge pixel, the window is cut one pixel before the transition takes place, and the rest of the window is padded using the last pixel before the edge. This way, only either background region pixels or foreground region pixels are included in the computation. The adaptively selected upsampling filter coefficients, hˆi , are derived using the coefficients in the original filter hi and the edge map information of the corresponding pixel in the image under upsampling, e(i), as: hˆi = HD −1 i 2 § · H ° e ( j ) ⋅ ¨¨ e(i − 1) ⋅ ¦ hm + e (i − 1) ⋅ hi ¸¸, i < D − 1 (3) ∏ ° 2 j =0 j =i © ¹ ° H D −1 H D −1 i +1 i HD ° −1 ®e(i − 1) ⋅ ¦ hj + e(i ) ⋅ ¦ h j + e(i + 1) ⋅ ¦ hj + ∏e ( j ) ⋅ hi , i = 2 j =0 j =0 j =i j = i −1 ° − 1 H i D ° § · H e ( j ) ⋅ ¨¨ e(i + 1) ⋅ ¦ hm + e (i + 1) ⋅ hi ¸¸, i > D −1 ° ∏ 2 H j =i ° © ¹ j = D −1 2 ¯
where i = 0, 1, ..., HD - 1 is the coefficient index and HD represents the original filter size. e represents the complement of e, where e equals 1 if the corresponding pixel is a boundary and 0 otherwise. For simpler representation, the real pixel index in the edge map function e, is dropped. Equation (3) is exactly the mathematical representation of the coefficient reordering process depicted in Fig. 2 where the sliding filter window is about to cross an object border. 3. EXPERIMENTS JMVM version 6.0 is used for the depth coding tests. The single viewpoint mode of JMVM produces an AVC compatible bit-stream. Fig. 3 shows the complete coding/ decoding cycle. The proposed upsampling approach is applied after decoding the depth map at low resolution. HD is equal to 6 as the filter coefficients in (1) are used. The Breakdancer [2] and Ballet multi-view plus depth map video sequences are used in the experiments, which are supplied by Microsoft Research, USA. Without the loss of generality, a single viewpoint with the associated depth map of each video sequence is considered for
Avg PSNR for free-view images (dB)
33
32
31
Reduced resolution depth (without edge adaptation) Reduced resolution depth (with edge adaptation) Full resolution depth
30
29 50
Figure 2. A reduced resolution depth map frame section and the filter window sliding along a row with the black pixels representing the object boundary.
150 250 Avg Bit Rate for depth map only (kbps)
350
Avg P SNR for free-view im ages (dB)
a) 35
34
33
32
Reduced resolution depth (without edge adaptation) Reduced resolution depth (with edge adaptation) Full resolution depth
31
30 20
120
220
320
420
Avg Bit Rate for depth map only (kbps)
Figure 3. Colour plus depth encoding/decoding cycle
the experiments (view #0 of both test sequences). The adaptation can be extended to each individual viewpoint inside a multi-view video sequence. A group of picture (GOP) size of 12 is used and the entropy mode is set to CABAC (Context Adaptive Binary Arithmetic Coder). The videos are originally of 1024 × 768 spatial resolution and the size of half resolution depth map videos is 512 × 384 pixels. Colour viewpoints are encoded at full resolution, with a bit rate of around five times that of the encoded half resolution depth map video sequences. Depth map and colour videos are encoded at a large range of bit-rates. For comparison, depth maps are also encoded at full resolution. 4. RESULTS AND ANALYSIS Fig. 4 (a) and (b) show the rate vs. rendering distortion performance of depth map coding for the Ballet and Breakdancer test sequences, respectively. The quality of the depth map is measured with the reconstruction quality of the free-viewpoint video at the position of the nearby viewpoint. Unstructured lumigraph rendering is utilized for free-view synthesis [14]. The original colour videos are used during rendering to solely highlight the effect of depth map quality. The rate overhead due to colour video coding is not added in the graphs, as all scenarios use the same colour compression. It is observed that downsampling the depth map before encoding and upsampling after decoding without edge adaptation outperforms depth map coding at full resolution, particularly at low bit-rates. The proposed approach, with edge adaptive upsampling, further improves compression efficiency for reduced resolution depth map coding. The cross-over point, at which full resolution compression starts to outperform reduced resolution compression, is shifted
735
b)
Figure 4. Rate-distortion performance for depth map coding experiments. (a) Ballet, (b) Breakdancers
towards higher bit-rates, by around 75 % for Ballet and 35 % for Breakdancer. According to these results, the proposed approach outperforms full resolution depth map coding at a wider range of bit-rates than if no edge adaptive upsampling scheme is used. Visual results also demonstrate improved performance for reduced-resolution depth map coding. Fig. 5 shows two rendered free-viewpoint samples, at a depth map compression rate of 280 kbps. Some image parts are enlarged and highlighted for better observing. It is evident that particular object boundaries that look spoiled in conventional reduced-resolution coding scheme, are better conserved with the proposed edge adaptive upsampling scheme. This is because of the fact that, since the local high frequency components are conserved well with the adaptive filters, boundary pixels in the free-viewpoint images are not scattered around. Consequently, a sharper and more robust perception is reached. The edge detection process that is used to extract significant object boundaries is the only source for the complexity overhead with respect to the reduced resolution coding scheme that is not edge adaptive. However, the set of 3x3 Sobel filters used for that purpose is enough to produce accurate edge information for adaptive filtering, while not being computationally complex. The increase in computational complexity is minor, whereas the reduction in rendering distortion is significant.
Figure 5. Some visual rendered free-viewpoint samples from Ballet and Breakdancer test sequences. Left side samples correspond to uniformly sampled depth map, whereas right side samples correspond to adaptively sampled depth map.
5. CONCLUSIONS In this work, we have proposed a video object edge adaptive upsampling scheme to be utilized with reduced resolution depth map compression. Accordingly, significant improvements in the rate – rendering distortion performance of depth map coding are observed and perceptually better free-viewpoint videos, free from boundary artefacts, are rendered. The proposed approach is useful especially in limited bandwidth scenarios including video-plus-depth and multi-view plus depth where high efficiency coding of depth data is essential. The proposed approach is also suitable for free-viewpoint video (FVV) applications to allow enhanced perception of rendered free-viewpoint videos.
[5] [6]
[7]
[8]
[9]
ACKNOWLEDGEMENT
This work was developed within VISNET II, a European Network of Excellence (http://www.visnetnoe.org), funded under the European Commission IST FP6 programme.
[10]
REFERENCES
[11]
[1] Merkle, P.; Smolic, A.; Muller, K.; Wiegand, T., "Multi-View Video Plus Depth Representation and Coding," Proc. of. ICIP 2007, vol.1, no., pp.I -201-I -204, 2007. [2] L. C. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, "High-quality video view interpolation using a layered representation," in ACM SIGGRAPH 2004 Papers. New York, NY, USA: ACM, 2004, pp. 600-608. [3] Shimizu, S.; Kitahara, M.; Kimata, H.; Kamikura, K.; Yashima, Y., "View Scalable Multiview Video Coding Using 3-D Warping With Depth Map," Circuits and Systems for Video Technology, IEEE Transactions on , vol.17, no.11, pp.1485-1495, Nov. 2007 [4] Ekmekcioglu, E.; Worrall, S.T.; Kondoz, A.M., "Bit-Rate Adaptive Downsampling for the Coding of Multi-View Video
736
[12]
[13]
[14]
with Depth Information," 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video, 2008, vol., no., pp.137-140, 2008. Spatial Scalability Filters, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, July 2005, Doc. JVT-P007, Poznan, Poland. Text of ISO/IEC 13818-1:2003/FDAM2 Carriage of Auxiliary Data, ISO/IEC JTC1/SC29/WG11, Jan. 2007, Doc. N8799, Marrakech, Morocco. A. Vetro, Y. Su, H. Kimata, and A. Smolic, “Joint multiview video model,” Joint Video Team, China, Oct. 2006, Doc. JVTU207. Y. Morvan, D. Farin, P. H. N. de With, “Platelet-based coding of depth maps for the transmission of multiview images”, Proc. of SPIE, Stereoscopic displays and Applications, vol. 6055, 2006. Bing-Bing C.; Sethurarnan, S.; Hatrack, P., "Mesh-based depth map compression and transmission for real-time viewbased rendering," Proc. of ICIP 2002, vol.2, no., pp. II-237II-240, 2002. C.T.E.R. Hewage, S. Worrall, S. Dogan and A.M. Kondoz, "Prediction of stereoscopic video quality using objective quality models of 2-D video", IET Electronics Letters, Vol. 44, Issue 16, p. 963-965, 2008. N. Sprljan, M. Mrak, E. Izquierdo, "Motion Driven Adaptive Transform Based on Wavelet Transform for Enhanced Video Coding", Proc. 2nd International Mobile Multimedia Communications Conference, 18-20 September 2006, Italy. Chang, S.G.; Cvetkovic, Z.; Vetterli, M., "Locally adaptive wavelet-based image interpolation," Image Processing, IEEE Transactions on , vol.15, no.6, pp.1471-1485, June 2006 Xin Li; Orchard, M.T., "New edge-directed interpolation," Image Processing, IEEE Transactions on , vol.10, no.10, pp.1521-1527, 2001. C. Buehler, M. Bosse, L. McMillan, S. Gortler, M. Cohen, "Unstructured Lumigraph Rendering," in SIGGRAPH '01: ACM SIGGRAPH 2001 Papers. New York, NY, USA: ACM, pp. 425-432, 2001.