A novel 3D video transcoding scheme for adaptive 3D video ...

A Novel 3D Video Transcoding Scheme for Adaptive 3D Video Transmission to Heterogeneous Terminals SHUJIE LIU and CHANG WEN CHEN, State University of New York at Buffalo

Three-dimensional video (3DV) is attracting many interests with its enhanced viewing experience and more user driven features. 3DV has several unique characteristics different from 2D video: (1) It has a much larger amount of data captured and compressed, and corresponding video compression techniques can be much more complicated in order to explore data redundancy. This will lead to more constraints on users’ network access and computational capability, (2) Most users only need part of the 3DV data at any given time, while the users’ requirements exhibit large diversity, (3) Only a limited number of views are captured and transmitted for 3DV. View rendering is thus necessary to generate virtual views based on the received 3DV data. However, many terminal devices do not have the functionality to generate virtual views. To enable 3DV experience for the majority of users with limited capabilities, adaptive 3DV transmission is necessary to extract/generate the required data content and represent it with supported formats and bitrates for heterogeneous terminal devices. 3DV transcoding is an emerging and effective technique to achieve desired adaptive 3DV transmission. In this article, we propose the first efficient 3DV transcoding scheme that can obtain any desired view, either an encoded one or a virtual one, and compress it with more universal H.264/AVC. The key idea of the proposed scheme is to appropriately utilize motion information contained in the bitstream to generate candidate motion information. Original information of both the desired view and reference views are used to obtain this candidate information and a proper motion refinement process is carried out for certain blocks. Simulation results show that, compared to the straightforward cascade algorithm, the proposed scheme is able to output compressed bitstream of the required view with significantly reduced complexity while incurring negligible performance loss. Such a 3DV transcoding can be applied to most gateways that usually have constraints on computational complexity and time delay. Categories and Subject Descriptors: I.4.2 [Image Processing and Computer Vision]: Compression (Coding) General Terms: Algorithms, Design Additional Key Words and Phrases: Multiview video, 3D video, transcoding, adaptive 3D video ACM Reference Format: Liu, S. and Chen, C. W. 2012. A novel 3D video transcoding scheme for adaptive 3D video transmission to heterogeneous terminals. ACM Trans. Multimedia Comput. Commun. Appl. 8, 3s, Article 43 (September 2012), 21 pages. DOI = 10.1145/2348816.2348822 http://doi.acm.org/10.1145/2348816.2348822

1. INTRODUCTION Three-dimensional video (3DV), by definition, is a collection of signals that is able to provide 3D perception of a given scene. With recent development in 3D displays and multimedia technologies, 3DV is attracting many interests from both industry and academia. There are numerous applications with This research was supported in part by US NSF under Grant 0915842 and by a Gift Funding from Technicolor (Thompson). Authors’ address: S. Liu and C. W. Chen, 338 Davis Hall, State University of New York at Buffalo, Buffalo, NY 14260; email: {sL252, chencw}@buffalo.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2012 ACM 1551-6857/2012/09-ART43 $15.00 DOI 10.1145/2348816.2348822 http://doi.acm.org/10.1145/2348816.2348822 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43

43:2

•

S. Liu and C. W. Chen

3DV, such as 3D Television (3DTV) [Vetro et al. 2004], Free Viewpoint Video (FVV) [MPEG 2009], 3D tele-immersive service [Yang et al. 2010] and other interactive 3DV applications. 3DTV is designed to display at least two views simultaneously, while with FVV, viewer can select any viewpoint he or she would like to see. 3D tele-immersive service intends to provide collaborative environment for geographically distributed users, where the content displayed to each user depends on her/his viewpoint and bandwidth. Cha et al. [2009] also presented another interactive 3DV application where the user can change the viewpoint of a 3D object with a touchable device. 3DV exhibits two unique characteristics compared with the conventional 2D video. First, in order to enable enhanced viewing experience, supplemental data should be included in 3DV in addition to the captured 2D video. This supplemental data may include video sequences captured from other views, as well as geometry information (such as depth information). Therefore, 3DV has a much larger data volume and the redundancy among them is more complicated. Second, due to the constraints in the number of camera devices and in compression complexity, only a few views are captured and compressed into a 3DV bitstream. Any virtual views have to be generated based on the captured/received information. Two categories of techniques are necessary and are under development towards facilitating these new features of 3DV. The first one is efficient representation and compression of 3DV content. As an amendment of H.264/AVC, Multiview Video Coding (MVC) is the first attempt in 3DV coding standardization, proposed by Joint Video Team (JVT) from MPEG of ISO/IEC and VCEG of ITU-T [JVT-AA209 2008]. MVC jointly encodes video sequences captured from multiple views with exploration of interview correlations. MPEG is also working on designing new data format and compression standard for 3DV. It is believed that depth information will be included in the new data format and the new 3DV coding standard can achieve better coding efficiency. A variety of approaches have been proposed for 3DV data compression, including depth compression with corresponding color video available [Maitre and Do 2008; Liu et al. 2011], resource allocation between depth and video data [Liu et al. 2009], improved motion search for depth compression with color video bitstream available [Oh and Ho 2006] and so on. The second category of techniques is seamless view rendering based on known information of nearby views, where Depth Image Based Rendering (DIBR) is a popular technique to generate virtual view based on nearby views’ color video and depth information [Smolic et al. 2008]. 1.1 Needs for Adaptive 3DV Transmission In order to enable 3D enhanced viewing experience and to take full advantage of unique features and techniques of 3DV outlined above, we need to overcome multiple constraints in both terminal devices and transmission networks. The terminal devices are required not only to be capable of decoding 3DV bitstream compressed with more complicated coding structures, but also to be capable of generating virtual views with embedded view rendering functionality. Furthermore, the network accessible bandwidth should be large enough to transmit the compressed 3DV data with much larger volume compared to 2D video. With all these strong requirements, only a limited number of users can enjoy 3DV services at present. More heterogeneous networks and devices are limited in one or more aspects in supporting 3DV transmission, compression and virtual view rendering. For example, some devices can only decode H.264/AVC encoded video sequences, while others may have very low network access bandwidth. Furthermore, service providers would prefer to generate virtual views at network gateways or video servers so that they can easily control the quality-of-service of the 3DV delivery. View rendering techniques embedded in terminals with limited computation capability shall introduce unexpected quality degradation. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

3D Video Transcoding Scheme for 3D Video Transmission

•

43:3

On the other hand, in most 3DV applications, not all 3DV data is required by an end user at any given time. Most displays can only display a limited number of views simultaneously, where the majority of them can only display stereo video or 2D video. For example, in an interactive 3DV application, each user can select the viewpoint he or she wants to be displayed through feedback channels, and the server/gateway will transmit the required view information to each user [Kurutepe et al. 2007]. Therefore, adaptive 3DV transmission is necessary to make 3DV services available to as many users as possible. We envision that this adaptive 3DV transmission is either implemented at the intermediate nodes such as gateways, or is embedded into video servers for more universal applications. The main functionality of adaptive transmission is to extract/generate the required video data and represent it with user supported formats and bitrates. 1.2 Existing Adaptive 3DV Transmission Approaches There have been several frameworks proposed for adaptive 3DV transmission [Yang et al. 2007; Zhang and Florêncio 2010; Florêncio and Zhang 2009; Lou et al. 2007; Kurutepe et al. 2007; Pan et al. 2011; Cheung et al. 2011]. Zhang and Florêncio [2010] and Florêncio and Zhang [2009] proposed bits allocation between multiple views for multiview video coding, based on user feedback in a real-time interactive scenario. Both assigned more bits to nearby views of the selected view and fewer bits to other views. Unlike fixed viewpoints used in Florêncio and Zhang [2009] and Zhang and Florêncio [2010] proposed to predict future viewpoints with probability distribution instead of fixed point and estimate the true viewpoint after receiving the 3DV bitstream. Lou et al. [2007] presented a 3DV multicast framework, where video from different viewpoints were broadcasted through different channels and users can select the channel they would like to “listen to”. Kurutepe et al. [2007], Pan et al. [2011], and Cheung et al. [2011] focused on 3DV compression scheme design aiming at improved adaptive video transmission and view switching. Kurutepe et al. [2007] proposed to compress multiview video with combination of MVC and SVC concepts and to dynamically select required view with certain quality. Pan et al. [2011] designed a frame selection scheme in multiview video compression. The scheme only encodes some frames from different views based on the assumption that view switching only happens between neighboring views. Cheung et al. [2011] designed a different compression scheme to include redundant frames encoded with distributed coding method, and thus facilitated view switching. The majority of these existing schemes focused on real-time unicast (one-to-one) service in which the original uncompressed 3DV data is assumed to be available at the server and only a single user is served. Furthermore, the scheme proposed in Lou et al. [2007] did not consider inter-view correlation in 3DV compression and also assumed that original uncompressed 3DV data is available at the server. The transmission schemes proposed by Kurutepe et al. [2007], Pan et al. [2011], and Cheung et al. [2011] were designed for multiview video without depth information. Since they focused on source coding algorithm design, the end-users were still required to be capable of decoding bitstreams generated from specifically designed encoders for their applications and rendering virtual views based on received information. 1.3 3DV Transcoding: An Emerging Technique Existing approaches for adaptive 3DV transmission cannot be adopted for numerous applications that need to deal with compressed bitstream because their adaptations are at the video server with the original uncompressed 3DV data available. An adaptation at intermediate network nodes is very much desired for contemporary 3DV applications. For example, some servers only have the pre-encoded 3DV data and need to provide adaptive service based on the user’s feedback. Moreover, for applications that deal with multiple users, adaptive transmission needs to be carried out at some intermediate nodes. Apparently, these intermediate nodes only have access to the compressed data. Therefore, an ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:4

•


emerging technique for 3DV transmission needs to be developed to adaptively extract/generate requested information from the compressed data and represent the desired 3DV contents with user supported formats and bandwidth. As probably the most effective approach to accommodate various requirements in coding standard, bitrate, resolution, number of views and so on [Ahmand et al. 2005], transcoding at a network node is an efficient way to achieve such functionality. With efficient 3DV transcoding techniques, not only are we able to satisfy the decoding and computational limitations of the terminal devices, we are also able to significantly reduce the required transmission bandwidth with a smaller data volume. In addition, since transcoders are usually deployed at network gateways, low computational complexity is necessary for transcoder designs to be widely deployed. A great variety of transcoding algorithms have been proposed for single view video (not for 3DV). They include open-loop transcoding and close-loop transcoding [Ahmand et al. 2005]. A transcoder was also proposed to independently convert coded multiview video streams into MVC bitstream [Bai et al. 2005], which merged all information together instead of extracting partial information from 3DV bitstream. In comparison with single view transcoding, there are several new challenges in 3DV transcoding design. First, in 3DV, video content from different views are encoded jointly, and the required views can be virtual views which are not captured and encoded in the original bitstream. Second, there exist inter-view dependencies among nearby views. Therefore, it is necessary for a 3DV transcoder to utilize information of nearby views with inter-view correlation. This introduces new challenges to 3DV transcoding designs that are different from single view video transcoding: extracting multiple views information and reconstructing useful information for the required view based on inter-view correlation. We have recently developed two transcoder designs as initial efforts for 3DV transcoding [Liu and Chen 2009, 2010]. In Liu and Chen [2009], a transcoder was designed to extract an encoded view from MVC bitstream, with utilization of reference views’ motion information for inter-view coded blocks. This scheme is able to convert the original true 3DV bitstream to a single view bitstream. In Liu and Chen [2010], a novel transcoding technique was developed to generate virtual views. This scheme utilizes reference views’ motion information to develop a new process of generating candidate motions. Both schemes convert the original 3DV into single view video for rendering and display in terminal devices that cannot decode and render true 3DV bitstreams. The two proposed frameworks are specifically designed for transcoding of captured views and virtual views, respectively. Although they share the same spirit in generating candidate motions, the detailed implementations of the particular algorithm for candidate motion generation processes are different. In Liu and Chen [2009], reference views’ motion for an inter-view coded block was obtained using global motion of the corresponding frame and the predicted coding mode was reused, while in Liu and Chen [2010], reference views’ motion was obtained using pixel-level disparity and further merged for different coding mode. It is necessary to design a universal 3DV transcoding scheme that can handle the generation of both captured views and virtual views. Furthermore, these initial frameworks suffer from loss in desired performance because the spatial correlation was not exploited to generate candidate motion. The performance loss is due to the fact that spatial prediction is better than interview prediction for certain blocks as well as the fact that signaling of motion vector (MV) is based on spatial motion prediction and inter-view predicted MV might introduce larger MV residual to be signaled. This article is an integrated solution for 3DV transcoding, seamlessly merging two initial frameworks [Liu and Chen 2009, 2010] with a much improved motion prediction scheme that considers both spatial motion prediction and inter-view motion prediction. The key idea for this integrated scheme ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:5

Fig. 1. MVC inter frame prediction structure.

is to re-encode the requested viewpoint with appropriate utilization of available motion information in the original bitstream, including those from the same view and the motion information of reference/nearby views. Besides both spatial and inter-view motion prediction as the new component, in this new scheme, instead of using global motion for the encoded view, we use existing motion vector for inter-view coded blocks to estimate the inter-view correspondence. This integrated scheme is able to generate either a captured view or a virtual view for terminal devices that are only capable of decoding and rendering single view (2D) video and can be deployed practically anywhere in the network of 3DV delivery. 2.

BACKGROUND

As the first compression standard for multiview video, MVC was designed to jointly compress video sequences captured from multiple views. Furthermore, MPEG has been working on new data formats and a coding standard for 3D video, which intends to provide better coding efficiency and rendering capability. In this section, we will introduce the basic framework of MVC, new 3D video formats, virtual view rendering techniques as well as the background of transcoder designs. 2.1 Overview of MVC As an amendment to H.264/AVC, MVC was designed as a compression standard for video sequences captured from multiple views simultaneously. It embeds several inter-view prediction coding tools to explore inter-view redundancy and thus is able to improve the coding efficiency. A typical MVC inter-frame prediction structure is depicted in Figure 1, where Si represents view i and Tj represents time instant j. Each view is encoded with the hierarchical B prediction structure, with inter-view prediction together if possible. There are base views coded without prediction from other views in MVC, as S0 in Figure 1. B views indicate those coded with prediction from both backward and forward views, such as S1 , S3 , and S5 in Figure 1, while P views indicate those coded with only one directional inter-view prediction, such as S2 , S4 , S6 and S7 in Figure 1. Those frames encoded without ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:6

•


Fig. 2. MVD data format and rendering of intermediate views.

prediction from others within the same view are called Anchor Pictures, such as frames at time T0 of each view shown in Figure 1. There are two major inter-view prediction algorithms embedded in MVC: pixel-level prediction and Motion Skip mode. In pixel-level prediction, the corresponding reconstructed inter-view reference pictures are added into reference picture list and motion estimation is carried out in the same fashion as the temporal level inter prediction. Motion Skip mode indicates that the motion prediction generated from reference views is utilized to encode the current block. The predicted motion is obtained using disparity parameters and motion information of reference views. 2.2 3D Video Formats In order to exploit the inter-view correlation more effectively and achieve enhanced view rendering capability, MPEG is working on defining new 3DV data formats, which is believed to include depth map as the extra geometry information. A depth map contains depth data within a range from Znear to Z f ar , indicating the minimum and maximum distance from objects in the 3D scene to the camera, respectively. Depth map typically contains depth data for every sample location, each represented with an 8 bits value, by quantizing and scaling [Znear , Z f ar ] to [255, 0]. Several 3DV data formats have been proposed based on the idea of including depth map as supplemental information, such as Multiview Video plus Depth (MVD) and Layered Depth Video (LDV) [Smolic et al. 2009]. Figure 2 shows an example of MVD data format, where each encoded color video sequence has a corresponding depth map included in the 3DV data. As shown in Figure 2, both color video and depth sequence of views S1 , S5 , S9 are included in the received MVD data, while more virtual views (S2 , S3 , S4 , S6 , S7 , S8 in Figure 2) can be generated using Depth Image Based Rendering (DIBR) and the received 3 views data. Unlike MVD, LDV was designed based on the idea that each pixel location of an image can be represented as a set of sample color values with the corresponding depth values, which originate from different 3D points along the line of sight. Two types of data formats are both proposed as LDV. Figure 3 shows one type of LDV, where both color video and depth sequence of a “major” view S5 were transmitted. In contrast to MVD, only residual signals of selected views (S1 , S9 in Figure 3) are in¨ cluded in the LDV data [Muller et al. 2008]. With this data structure, only invisible parts of the scene from the “major” view might be included in the residual signals. Another type of LDV only contains four layers from a single viewpoint (the “major” view): captured video, corresponding depth map, occlusion video and corresponding depth map. The occlusion video and depth can be generated from the ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:7

Fig. 3. LDV data format and rendering of intermediate views.

MVD data set by warping additional views to the “major” view and remove all foreground pixels visible from the “major” viewpoint. 2.3 Virtual View Rendering There are mainly two categories of methods in generating virtual views: light-field-based method [Levoy and Hanrahan 1996] and image based rendering [Shum et al. 2006; Fehn 2004]. The first method is based on the light-field of the 3D scene and is able to generate virtual views with high accuracy. However, it has very high computational complexity. Therefore, image based rendering is more appropriate for certain applications. The most straightforward approach to image based rendering is to adopt disparity-based inter-view prediction. However, disparity based 2D model cannot describe inter-view correlation with high accuracy and will introduce unexpected errors in certain cases. The most popular approach in this case is to explore inter-view correlation with a 3D model, which is generated based on the received/captured video and corresponding depth information [Zitnick et al. 2004]. One example of these 3D model based methods is DIBR [Fehn 2004]. In addition, MPEG has provided a view synthesize software VSRS (View Synthesis Reference Software), which is designed to generate virtual views based on captured video of nearby views, as well as corresponding depth information [Tanimoto et al. 2009]. 2.4 Transcoding Transcoding is a useful and popular technique to support video delivery over heterogeneous networks with various terminal devices. Transcoders can have multiple functionalities, such as coding format conversion, spatial resize/crop, frame rate reduction, bitrate adaptation and so on. Transcoders are usually deployed at the intermediate nodes, such as gateways. Therefore, it is necessary to design transcoders with low computational complexity and acceptable delay. A video transcoder basically consists of two parts: decoder and encoder. The decoder part in a transcoder mainly parses/decodes the original input bitstream to extract information we need for re-encoding and for inclusion in re-encoded bitstream. The encoder part will re-encode the required information appropriately to meet the requirements of network links and terminal devices. The most straightforward transcoder design is the cascade transcoder, as indicated in Figure 4. After fully decoding the original bitstream, the transcoder will extract required information (pre-processing is possible before re-encoding) and fully re-encode the extracted information without referencing any extra information. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:8

•


Fig. 4. Cascade transcoder structure.

Because of the fully re-encoding process, cascade transcoder cannot achieve high compression efficiency and low computation cost at the same time. On one hand, as the most time-consuming part in an encoder, motion estimation can be removed by coding each frame as an I frame. However, the resulting low compression efficiency will lead to low-quality video for access bandwidth limited users. On the other hand, if motion estimation is included to achieve high compression efficiency, this most time-consuming part in the cascade transcoder design will lead to high computational complexity. This is particularly true for 3DV transcoding in which much more complicated motion estimation is usually carried out to take full advantage of multiple correlations between temporal frames and different views. Because of this high computational complexity, a cascade transcoder often cannot be used for those computational capability limited gateways or delay sensitive services. Fortunately, significant amount of information contained in the input bitstream can be re-used during the encoding process. For example, the motion information of some blocks can be re-used directly, while motion information of other views can be relatively simply used to generate candidate motion based on inter-view correlation. The key innovation of this emerging 3DV transcoding presented in this paper is to design a 3DV transcoder that is able to appropriately use relevant information in the input bitstream to substantially simplify the re-encoding process while maintaining the performance of 3DV compression. 3.

PROPOSED SCHEME

In this article, we propose a transcoder design to extract/generate the requested single view content from the input 3D video data stream, and output the H.264/AVC coded bitstream for the extracted view. This scheme can be easily extended to the transcoder design for multiple views with an additional inter-view prediction process, such as transcoding of stereo video. In the following sections, we will first describe the proposed transcoding scheme. After that the correlation of temporal motion in different views are presented. The following two sections include detailed descriptions of most important functions in the transcoder design: candidate motion generation and motion refinement/mode decision process. 3.1 Proposed Transcoding Scheme As indicated earlier, the basic principle of the proposed transcoder design is to utilize information contained in the original input bitstream as much as possible. This principle shall derive two benefits: significant reduction in the computation of motion estimation for re-encoding and improved information fidelity to remedy the potential quality degradation of transcoding. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:9

Since the most time-consuming part in the re-encoding process is the motion estimation, we focus on predicting motion vectors using information in the originally compressed 3DV bitstream. Two types of information can be utilized to generate such predictions. The first one is the temporal motion information of the requested view if this is encoded in the original bitstream. From the current proposed 3D video data formats, we notice that video sequences captured from multiple views are included in a single set of 3D video data, with the residual information or original information as the input. Conventional video compression techniques, such as motion compensation, can indeed also help remove the intra-view redundancy and compress these 3DV sequences efficiently. This indicates that we can obtain temporal correlation of an encoded video sequence easily, and can use such information to represent motion information for transcoding in a similar fashion as the conventional video coding techniques. Another type of useful information for transcoding is the temporal motion information of encoded views other than the requested view. With such information and the inter-view correspondence, we can estimate temporal correlation of the requested view more easily. The inter-view correspondence can also be obtained with low complexity because: (1) for captured multiple views, inter-view redundancy needs to be removed to achieve high coding efficiency; therefore, inter-view correspondence is inherently contained in the bitstream, especially for those blocks coded with inter-view prediction; (2) for virtual views which need to be generated from other views, inter-view correspondence is necessary for view rendering. Moreover, since view rendering needs to be carried out together with the decoder, one desired key feature for view rendering is low computational complexity. Since many conventional video compression techniques can help with 3DV compression, we can assume the encoding of 3DV data can adopt similar techniques as conventional video coding methods. Figure 5 shows the overall system framework implemented in this article. As shown in the framework, motion information of both reference views and the requested view (if available), as well as interview correspondence, will be utilized to predict the temporal correlation, which is represented as a candidate motion for each coding mode. The candidate motion with corresponding coding mode can only be regarded as a coarse motion estimation result. Therefore, motion refinement and mode decision are necessarily included in the framework to improve the coding efficiency and to choose the best coding mode. 3.2 Motion Vector Correlation between Two Views In this section, we analyze the motion correlation between corresponding points in two views, which will be utilized in generating candidate motions. Without loss of generality, we assume perfect camera calibration and ignore the possible illumination changes between different views. The following derivations can be easily extended to take these parameters into consideration. The parameters can be obtained from the originally compressed bitstream if they are used in the initial compression of the 3DV data. In many cases, there is no need to consider them if they are not utilized in the initial compression. Based on such assumption, the relationship between two corresponding points P0 and P1 in two views can be represented as: ˜ = ZA RA−1 m ˜ + A t Z m

(1)

˜ and m ˜ represent the location of P0 and P1 in their corresponding image coordinates, rewhere m spectively. Z and Z are the depth value for P0 and P1 , while A and A indicate the camera intrinsic parameters for View0 and View1. The rotation parameter R and the translation parameter t denote the transformation from the world coordinate (camera coordinate of View0) to the camera coordinate of View1. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:10

•


Fig. 5. Proposed transcoding framework.

Most existing 3DV capturing systems have the 1D or 2D camera settings or small rotation angels between neighboring views. Therefore, we assume R = I in the implementation to simplify the process of generating candidate motions using inter-view correspondence. Furthermore, in most 3D video systems, same cameras are used to capture video from multiple views, which means A ≈ A . Therefore, Eq. (1) can be further simplified to: ˜ =Z·m ˜ + A t. Z · m

(2)

For a pair of pixels in two views, their inter-view correspondence can always be represented as a disparity vector. As shown in Figure 6, for a pair of pixels: P0 in T0 of View0 and P1 in T0 of View1, their temporal corresponding pixels in T1 are P0 and P1 , respectively. The relationship between P0 and P1 is described by the disparity vector DV , while the relationship between P0 and P1 is described by DV . The correspondence between P1 and P1 is described by MV0 , while the correspondence between P1 and P1 is described by MV1 . Temporal prediction in video coding is usually carried out within a group of pictures that can be played for about 0.5s. Therefore, both depth and camera parameter changes between temporal related ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:11

Fig. 6. Inter-view and intra-view correspondence.

pixels can be ignored. According to Eq. (2), the relation between MV1 and MV0 can be represented as: Z · MV1 = Z · MV0

(3)

Therefore, to estimate the temporal corresponding pixel for a given pixel P1 in View1, we can first find corresponding pixel P0 in View0 using disparity vector DV . With the existing motion vector of pixel P0 in the original bitstream, we can calculate the motion vector of P1 according to Eq. (3). 3.3 Candidate Motion Generation The pixel level motion correlation analyzed in Section 3.2 cannot be directly applied to candidate motion generation due to several constraints. First of all, a single motion vector is signaled for a set of pixels (block) in conventional video bitstream, thus we cannot generate accurate pixel-level motion vectors using Eq. (3). Second, we cannot always obtain pixel-level inter-view correspondence. For example, an MVC bitstream can only have global disparity vectors encoded for each frame and block level vectors encoded for some blocks. Finally, motion vector in reference view (View0 in the example shown in Section 3.2) may be unavailable since the pixel might be compressed with intra mode. On the other hand, there is no need for us to generate accurate motion vector for each pixel. The target for motion estimation is to find the best motion for a given block instead of a single pixel, under the rate-distortion optimization strategy: min{D + λR} . The best choice is the one that achieves the best tradeoff between distortion and the number of bits used to signal motion/residual information. This can be different from the most accurate motion vector. Furthermore, as long as we can obtain an estimated motion vector, we are able to search in a very small region to get a refined motion vector that satisfies the rate-distortion optimization. The candidate motion generation processes for virtual view and encoded view are described in the following sections respectively. The basic idea in these processes is to utilize available motion information as much as possible. 3.3.1 Candidate Motion Generate for Virtual View. Virtual views have not been directly encoded in the original input bitstream. Therefore, the available motion information for virtual views can only be derived from the encoded views different from the required virtual views. Instead of generating pixel ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:12

•


Fig. 7. Estimate motion information for 4 × 4 block.

level motion information, we first estimate the motion information for each 4 × 4 block. The candidate motion for each coding mode is obtained by merging motions of corresponding 4 × 4 blocks. More accurate inter-view correspondence will be available since a virtual view has to be generated based on this correspondence and the encoded views. In this research, VSRS is used to generate virtual views and can produce the disparity vector of each pixel. As shown in Figure 7, to estimate the motion information for a 4 × 4 block S in the virtual view, we first calculate the average disparity vector of the 16 pixels in this block as DV to find the corresponding area S in the reference view (view used to generate the virtual view). Since motion information encoded in the original bitstream is following the original block partitions, we need to find out the 4 × 4 block that has the largest overlap area with S among the blocks of original partition, indicated as E in Figure 7. Finally, the initial motion of block S is set as the same as the motion of block E. For each coding mode in H.264, candidate motion of each sub-block is generated including motion vector and corresponding reference index. First, a reference index is obtained by voting among all collocated 4×4 blocks with inter coding mode. Then the motion vector is set as the median of all motion vectors that are paired with the selected reference index. Figure 8 shows an example of generating a candidate motion for the upper block in MODE 16x8. The reference index is set as 1 based on the voting result. Since blocks 1, 2, 4 and 8 have reference index as 1, the candidate motion for the current 16 × 8 sub-block is set as Median{v1,v4,v2,v8} Blocks 3 and 7 are not considered in candidate motion generation, since their related modes are intra modes. Furthermore, if reference indices of all collocated 4 × 4 blocks are invalid, we cannot generate a candidate motion from reference view. In this case, we will use spatial motion prediction as the candidate motion. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:13

Fig. 8. Candidate motion generation example.

3.3.2 Candidate Motion Generate for Encoded View. For an encoded view in the original bitstream, there are two types of blocks: blocks coded without inter-view prediction, and blocks coded with interview prediction. For blocks coded without inter-view prediction, temporal motion of requested view is available in the original bitstream. Therefore, candidate motion can be set to be the same as the original encoded motion information. For these blocks, instead of generating candidate motion for each coding mode, we only maintain the original coding modes and re-encode these blocks with the original coding modes. For blocks coded with inter-view prediction, their case is similar to the case of virtual views, where only motion of other views and inter-view correspondence are available. Whether the pixel level prediction or the motion prediction is used, we can always estimate the disparity vector for the current block. Similar to the process described in Section 3.3.1, a corresponding block is found for each 4 × 4 block. After the corresponding block is obtained and the motion information is copied, a candidate motion is generated for each inter coding mode with the same process proposed for virtual views. Spatial motion prediction is used if inter-view motion prediction is not available. 3.4 Motion Refinement and Mode Decision Motion refinement and mode decision are necessary in the proposed transcoding to improve the coding efficiency and to choose the best coding mode. For a macroblock to be optimally encoded, the encoding process can be described as the following three steps: Step 1: For each inter mode (MODE 16x16, MODE 16x8, MODE 8x16, MODE 8x8, SKIP, DIRECT in H.264/AVC), if there is available candidate motion, perform motion search within a predefined small region. Step 2: Perform the encoding of the block with Intra modes to check the performance of each Intra mode. Step 3: Evaluate each coding mode and choose the best one. In the first step, the best motion vector is selected using the same metric as in H.264: min{D + λR}, where D is the distortion introduced by compression, R is the corresponding bitrate by choosing a particular motion vector and coding mode, and λ is a quantization step related parameter. The basic idea of motion search is to search in a range of an initial motion vector to find the one that achieves minimum (D + λR). Various motion search methods are provided in the reference software, such as full search and TZsearch algorithms. TZsearch [Tang et al. 2010] is a fast-motion search method that ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:14

•


c MERL). Fig. 9. Example frame of Ballroom (Left: View0, Middle: View1, Right: View2) (

performs search based on both initial motion vector and spatial predictors. It is better to consider both inter-view prediction and spatial prediction because: (1) there is also spatial correlation that can be used to predict motion; (2) motion vector is predictively coded based on spatial prediction, thus spatial prediction may be able to generate a better rate-distortion optimized result. After generating the best motion vector for each inter mode and best prediction method for each Intra mode with the first two steps, the best coding mode is selected in Step 3 using the same evaluation term of min{D + λR}. 4. PERFORMANCE EVALUATION 4.1 Simulation Setup The proposed transcoding scheme is implemented based on the reference software of MVC: Joint Multiview Video Coding (JMVC) with version 8.5. As introduced in Section 1, existing adaptive 3DV transmission approaches did not use transcoder design and most of them require modification of source encoding process, while existing transcoders were either designed for 2D video or for transcoding from independent coded views to MVC. On the other hand, as the most straightforward transcoder design, cascade algorithm with full H.264/AVC encoding is able to provide an estimated upper bound of the compression efficiency (Rate-Distortion) with H.264/AVC, while cascade algorithm with only Intra mode and SKIP/DIRECT mode is able to provide an estimated lower bound of the computational cost since it does not include any motion search. Therefore, we include comparisons between four types of transcoding methods: (1) Cascade algorithm with full H.264/AVC encoding (Cascade1); (2) Cascade algorithm with only Intra mode and SKIP/DIRECT mode in H.264/AVC (Cascade2); (3) The previous developed transcoder published as conference papers (Conference), (4) The proposed transcoder (Proposed). To keep the fairness in the running time comparison, we also use the fast motion search option (TZSearch) in Cascade1. We design two sets of simulations to evaluate the performance of the proposed transcoder for encoded views and virtual views, respectively. The first set of simulation is for the transcoding of an encoded view. Three test sequences, Ballroom, Exit and Race1, provided by the MVC standard are used in the simulation. Figure 9 shows an example of frames from the three views of Ballroom. The basic settings for the test sequences are indicated in Table I. For each test sequence, three views (View0, View1 and View2) are jointly encoded to generate the original compressed bitstream, where View0 is encoded as I view, View2 is encoded as P view, and View1 is encoded as B view. Each test sequence is encoded using four QPs (22, 27, 32 and 37) and the common test conditions provided by JVT [JVT-T207 2006]. We assume View1 is the requested view and the transcoder is required to produce an H.264/AVC bitstream for View1 using the same QP as the one in the original bitstream. When comparing performances of ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:15

Table I. Settings for Test Sequences Name Resolution Number of Frames GOP Size Frame Rate

Ballroom 640×480 250 12 25

Exit 640×480 250 12 25

Race1 640×480 300 15 30

Ballet 1024×768 100 12 15

Breakdancers 1024×768 100 12 15

Fig. 10. Example Frame of Ballet: Top-Left: Video of Left View, Top Right: Video of Right View, Bottom: Corresponding Depth c Microsoft). (

different approaches, we calculate the bitrate of the transcoding as the bitrate of the output bitstream, and the distortion as the PSNR between receiver decoded video and original uncompressed video, as indicated in Figure 11. The second set of simulation is designed for the transcoding of a virtual view. Two test sequences, Ballet and Breakdancers, provided by Microsoft are used [Microsoft], as indicated in Table I. Figure 10 shows example frames of Ballet. For each test sequence, the captured video sequences of two views are jointly encoded using four QPs (22, 27, 32 and 37). As introduced in Section 2.2, to achieve efficient virtual view rendering, depth information has to be transmitted along with the captured video sequence. In this simulation, we assume that all three types of transcoders have access to the original depth since the design of transcoders is independent of depth compression techniques. In each transcoding process, VSRS [Tanimoto et al. 2009] is adopted to generate the requested virtual view based on decoded video sequences and depth information. An intermediate view is assumed to be the requested virtual view. The output of the transcoder is an H.264/AVC encoded bitstream using the same QP as in the original bitstream. In the comparisons shown in the following, the bitrate is calculated as the bitrate of the ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:16

•


Fig. 11. Calculating PSNR and bitrate for simulations (Left: for the 1st one, Right: for the 2nd one).

Fig. 12. Rate-distortion comparison for ballroom.

output bitstream, and the distortion is the PSNR calculated between the rendering result using the original bitstream and the rendering result using the transcoded bitstream, as indicated in Figure 11. 4.2 Simulation Results and Analysis The Rate-Distortion comparison between three transcoder designs for all test sequences are shown in Figure 12∼Figure 16, respectively. It can be observed that, compared with the cascade transcoder ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:17

Fig. 13. Rate-distortion comparison for exit.

with full H.264/AVC encoding, the proposed scheme has about 0.2dB∼0.3dB loss for all test sequences. Compared with the Cascade2 method, the proposed approach can achieve more than 1dB gain for all sequences, and an outstanding gain of about 4.5dB for Race1. Compared with the initial algorithm in the two conference papers, the proposed scheme in this paper can achieve about 0.4∼0.8dB gain. The running time comparison between all three approaches is shown in Table II. Each running time in Table II is calculated as the average over the running times under the four selected QPs. Simulations are carried out on PC with 2.33GHZ CPU without code optimization. The running time is calculated for the entire sequence, for example, 300 frames for Race1 and 100 frames for Ballet. From the comparison shown in Table II, it can be observed that the new proposed scheme has much less running time compared with Cascade1. For the transcoding of an encoded view, the proposed scheme has similar running time as Cascade2, which is about 5%∼10% of the running time of Cascade1. On the other hand, the transcoding of a virtual view costs more time, which is about 20% of the running time of Cascade1. This difference in complexity between the encoded and virtual view scenarios is introduced by the different number of blocks that only have inter-view motion available in the original bitstream. For each block in a virtual view, we have to calculate inter-view correspondence, generate candidate motions for each coding mode, and select the best mode over all possible coding modes. However, for the transcoding of an encoded view, only those blocks coded with inter-view prediction need to go through a similar process, while only original coding mode and intra modes are necessary for other blocks. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:18

•


Fig. 14. Rate-distortion comparison for race1.

Therefore, the new proposed scheme is able to achieve an RD performance much closer to full H.264 encoding, with only about 5%∼20% running time of full H.264 encoding. Comparing to Cascade2, it can improve the RD performance by 1dB∼4.5dB with less than 10% running time increase. Notice that the running time for Race1 with Cascade1 is much larger than Ballroom and Exit. However, the running times with proposed algorithm or Cascade2 for these three sequences are nearly the same. This indicates that the motion search process in the re-encoding of Race1 costs much more time than other sequences because Race1 has much larger motion. Because of the characteristic of video content, it introduces larger loss for the proposed transcoder compared with Cascade1. However, with much larger gain compared to Cascade2 method, it indicates that our transcoding method is able to maintain relatively high accuracy in motion estimation. Furthermore, comparing to the two initial algorithms proposed in the two conference papers, the running time of this new scheme has increased slightly for the transcoding of encoded views, and is about twice as much for the transcoding of virtual views. This is because we considered both spatial and inter-view prediction with motion refinement in the new scheme. Since the spatial prediction based motion search is independent from the inter-view prediction based motion search process, the running time can be further reduced by parallel algorithm and code optimization. On the other hand, the new proposed scheme is able to achieve an RD performance similar to full H.264/AVC encoding, which is much better than the two initial algorithms. In both cases, the new proposed algorithm can further improve the RD performance by about 0.6 dB, with only slightly increased running time for encoded views. Alternatively, this algorithm increases the running time from 10% of Cascade1 to about 20% of ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:19

Fig. 15. Rate-distortion comparison for ballet.

Cascade1 for virtual views. All these simulation results suggest that, with the new proposed scheme, we can achieve a much better trade-off between complexity and coding efficiency for transcoder design. Overall, the proposed approach has much lower computational complexity and causes much less time delay with negligible performance loss. The proposed transcoder design is able to meet the desired requirements from both network links and terminal devices and therefore can be deployed in a wide variety of network gateways as a key component for an adaptive 3DV transmission system. 5. CONCLUSION As 3DV is becoming more and more popular, adaptive 3DV transmission and user driven 3DV system are necessary to support many 3DV applications by resolving the significant mismatch between tremendous data volume in 3DV and inherent stringent constraints from network links and terminal devices. In order to provide adaptive 3DV services to both 3D capable and 2D only user terminals, gateways should be able to extract user required information and re-encode it into user defined formats. Low complexity and low time-delay transcoder design is crucial for gateways to achieve such functionality. We presented in this paper the first transcoder design for 3DV to meet the desired requirements, capable of extracting the requested view and producing H.264/AVC encoded bitstream with low complexity. The requested view can be either an original encoded view in the 3DV bitstream, or a virtual view that needs to be generated from the compressed 3DV bitstream. The proposed algorithm appropriately reuses the information contained in the original bitstream, not only the information from the required view, but also motion information from other views. In comparison with the straightforward ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.

43:20

•


Fig. 16. Rate-distortion comparison for breakdancers.

Table II. Running Time Comparison(s) Sequences Ballroom Exit Race1 Ballet Breakdancers

Cascade1 3831.0803 2494.451 5013.3350 1928.4705 2723.8565

Cascade2 130.6803 123.5333 134.5640 105.9075 105.6768

Conference 131.5040 123.5145 133.9010 267.0828 266.5768

Proposed Algorithm 204.1783 172.3463 192.6298 571.5230 575.2288

cascade method with full H.264/AVC encoding, the proposed scheme is able to compress the required view with much less complexity while maintaining acceptable performance loss. Future improvements of this transcoder design will focus on combining the virtual view rendering and the re-encoding processes together, where the virtual view will not be generated before the re-encoding process and the re-encoding will be done with only the reference views and the inter-view correspondence. REFERENCES AHMAND, I., WEI, X., SUN, Y., AND ZHANG, Y.-Q. 2005. Video transcoding: An Overview of various techniques and research issues. IEEE Trans. Multimed. 7, 5. BAI, B., BOULANGER, P., AND HARMS, J. 2005. A multiview video transcoder. In Proceedings of the ACM Multimedia. 503–506. CHA, J., EID, M., AND SADDIK, A. 2009. Touchable 3D video system. ACM Trans. Multimed. Comput. Commun. Appl. 5, 4. CHEUNG, G., ORTEGA, A., AND CHEUNG, N.-M. 2011. Interactive streaming of stored multiview video using redundant frame structures. IEEE Trans. Image Process. 20, 3. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.


•

43:21

FEHN, C. 2004. Depth-image-based-rendering (DIBR), compression and transmission for a new approach on 3D-TV. In SPIE Stereoscopic Displays and Virtual Reality Systems. ˆ FLORENCIO , D. AND ZHANG, C. 2009. Multiview video compression and streaming based on predicted viewer position. In Proceedings of ICASSP ’09. JVT-AA209. 2008. Joint Draft 7.0 on multiview video coding. JVT-T207. 2006. Common Test Conditions for Multiview Video Coding. Klagenfurt, Austria. KURUTEPE, E., CIVANLAR, M. R., AND TEKALP, A. M. 2007. Client-driven selective streaming of multiview video for interactive 3DTV. IEEE Trans. Circuite Syst. Video. Technol. 17. LEVOY, M. AND HANRAHAN, P. 1996. Light field rendering, In Proceedings of SIGGRAPH’96, ACM, pp. 31–42. LIU, S. AND CHEN, C. W. 2009. Multiview video transcoding: From multiple views to single view. In Proceedings of the Picture Coding Symposium (PCS’09). LIU, S. AND CHEN, C. W. 2010. 3D video transcoding for virtual views. In Proceedings of ACM Multimedia. LIU, Y., HUANG, Q., MA, S., ZHAO, D., AND GAO, W. 2009. Joint video/depth rate allocation for 3D video coding based on view synthesis distortion model. Signal Process. Image Commun. 24, 8. LIU, S., LAI, P., TIAN, D., AND CHEN, C. W. 2011. New depth coding techniques with utilization of corresponding video. IEEE Trans. Broadcas. 57, 2. LOU, J.-G., CAI, H., AND LI, J. 2007. Interactive multiview video delivery based on IP multicast. In Advances in Multimedia. MAITRE, M. AND DO, M. N. 2008. Joint encoding of the depth image based representation using shape-adaptive wavelets. In Proceedings of ICIP. 1768–1771 MICROSOFT 3D VIDEO TEST SEQUENCES (available online: http://research.microsoft.com/en-us/um/people/sbkang/ 3dvideodownload/) MPEG VIDEO AND REQUIREMENTS SUBGROUP. 2009. Applications and requirements on 3D video coding. Document w11061. MPEG. ¨ MULLER , K., SMOLIC, A., DIX, K., MERKLE, P., KAUFF, P., AND WIEGAND, T. 2008. Reliability-based generation and view synthesis in layered depth video. In Proceedings of the IEEE International Workshop on Multimedia Signal Processing. OH, H. AND HO, Y.-S. 2006. H.264-based depth map coding using motion information of correspondingtexture video. Adv. Image Video Tech. 4319. PAN, Z., IKUTA, Y., BANDAI, M., AND WATANABE, T. 2011. User dependent scheme for multi-view video transmission. In Proceedings of ICC. SHUM, H.-Y., CHAN, S.-C., AND KANG, S. B. 2006. Image-Based Rendering. Springer. ¨ SMOLIC, A., MULLER , K., DIX, K., MERKLE, P., KAUFF, P., AND WIEGAND, T. 2008. Intermediate View Interpolation Based on Multiview Video Plus Depth for Advanced 3D Video Systems. In Proceedings of the IEEE International Conference on Image Processing. ¨ SMOLIC, A., MULLER , K., MERKELE, P., KAUFF, P., AND WIEGAND, T. 2009. An overview of available and emerging 3D video formats and depth enhanced stereo as efficient generic solution. In Proceedings of the Picture Coding Symposium. TANG, X.-L., DAI, S.-K., AND CAI, C.-H. 2010. An analysis of TZSearch algorithm. In Proceedings of ICGCS. TANIMOTO, M., FUJI, T., AND SUZUKI, K. 2009. View synthesis algorithm in view synthesis reference software 3.0 (VSRS3.0), Tech. rep. Document M16090, ISO/IEC JTC1/SC29/WG11. VETRO, A., MATUSIK, W., PFISTER, H., AND XIN, J. 2004. Coding approaches for end-to-end 3D TV systems. Proceedings of the Picture Coding Symposium. YANG, Z., WU, W., NAHRSTEDT, K., KURILLO G., AND BAJSCY, R. 2010. Enabling Multi-Party 3D Tele-Immersive Environments with ViewCast. ACM Trans. Multimedia Comput. Commun. Appl. 6, 2. YANG, Y., YU, M., JIANG, G., AND PENG, Z. 2007. A transmission and interaction oriented free-viewpoint video system. Int. J. Circ. Syst. Signal Process. 4, 1. ˆ ZHANG, C. AND FLORENCIO , D. 2010. Joint tracking and multiview video compression. In Proceedings of VCIP. ZITNICK, L., KANG, S. B., UYTTENDAELE, M., WINDER, S., AND SZELISKI, R. 2004. High-quality video view interpolation using a layered representation. ACM Trans. Graph. 23, 3, 600–608.

Received January 2010; revised April 2012 and June 2012; accepted June 2012

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 8, No. 3s, Article 43, Publication date: September 2012.