simulations when compared with the UXP protected SD-SVC and the multiple-description ... such as Forward Error Correction (FEC) codes or Unequal. Erasure ...
An Efficient Multiple Description Coding Scheme for the Scalable Extension of H.264/AVC (SVC) Hassan Mansour, Panos Nasiopoulos, Victor Leung Department of Electrical and Computer Engineering University of British Columbia 2356 Main Mall, Vancouver, BC Canada V6T 1Z4 Email: {hassanm, panos, vleung} @ece.ubc.ca
Abstract— The demand for efficient scalable video codecs has constantly been on the rise in response to the increase in the variety of services and QoS requirements in multimedia networks. Existing standardization efforts, such as the Scalable Video Coding extension of the H.264/AVC standard, do not offer efficient error resilient protection for all the different levels of video enhancement. We developed a multiple description scalable video coding technique that offers complementary and independently decodable descriptions, offering acceptable video quality even if only one of them is successfully received. Performance evaluations show that our scheme delivers an average improvement of 5 dB for single channel decoding and an improvement of 2 dB on average for packet loss simulations when compared with the UXP protected SD-SVC and the multiple-description motion compensated temporal filtering (MDMCTF) scheme at comparable redundancy levels.
Keywords -Multiple Description Coding, SVC I. I NTRODUCTION The demand for efficient and highly scalable video codecs has constantly been on the rise to support the heterogeneous Quality-of-Service (QoS) requirements and unexpected bandwidth fluctuations in multimedia delivery networks. To address these requirements, the Scalable Video Coding (SVC) standardization project was launched as an amendment of the H.264/AVC standard [1]. Existing error protection solutions, such as Forward Error Correction (FEC) codes or Unequal Erasure Protection (UXP) [2] are used to combat bit-errors, and Multiple Description Coding (MDC) is used to protect the video content from packet losses. In MDC, a coded video sequence is separated into multiple descriptions, such that, each description is independently decodable. If all the descriptions are received by the decoder, then the original video quality is generated. If, on the other hand, only one description is received, then a lower quality video is obtained. This independent decodability feature is made possible at the expense of additional coding overhead, also known as data “redundancy” between the various descriptions. The design of multiple description coders, therefore, focuses on minimizing the redundancy while maintaining an acceptable level of distortion. Existing multiple description coding schemes, reviewed in [3], deal mostly with predictive video coders. These schemes involve spatial, temporal, or fre-
quency subsampling, multiple description (MD) quantization, or multiple description transform coding [3]. The only MCTF based multiple description coders are the multiple-description motion compensated temporal filtering (MD-MCTF) method described in [4] and the embedded multiple description scalar quantization (EMDSQ) method presented in [5]. The latter utilizes the MD quantization coding by applying embedded multiple description scalar quantizers to the output of MCTF. The transmission of motion information in this approach is assumed to occur without errors, and thus the motion data is neither multiple-description coded nor duplicated in each description. However, loss of motion information does occur in practical applications and, therefore, should be taken into consideration in the design of MDC schemes. MD-MCTF, on the other hand, uses temporal partitioning of video frames to produce multiple descriptions. Important information, such as motion vectors and temporal low pass frames are simply duplicated in each description, thus resulting in a considerable increase in coding redundancy. Figure 1 illustrates the frame coding order of the MD-MCTF coding scheme. In each
Fig. 1. Coding structure of the MD-MCTF. Dark frames contain both texture and motion information, whereas light frames contain only motion information.
description, the lightly shaded HP frames contain only motion information. All residual components are set to zero. The
problem with this approach is that most of the redundant bit allocation is dedicated to duplicating the motion information. Texture information, on the other hand, is sacrificed to reduce the redundancy. However, this sacrifice results in increased degradation in visual quality. We developed a Multiple-Description Coding scheme which is specifically designed for the scalable extension of H.264/AVC (SVC). Our proposed method generates two descriptions for each enhancement layer of an SVC coded stream by embedding in each description only half of the motion and texture information of the original coded stream with a minimal degree of redundancy. The two descriptions are complementary but independently decodable, such that, if only one description is received, the decoder will be able to recover the missing motion information from the available data and generate an output video sequence with acceptable quality. If both descriptions are received, then the full quality of a single description SVC stream is delivered. Furthermore, we have implemented our proposed scheme and integrated all of its functionalities into the existing SVC standard. The proposed framework provides a highly error resilient video bitstream that requires no retransmissions or feedback channels while minimizing any channel overhead imposed by the video redundancy due to multiple description coding. The remainder of this paper is organized as follows. In section 2 we present our proposed scheme. Section 3 discusses our results, and conclusions are presented in Section 4. II. O UR P ROPOSED MD-SVC S CHEME Our proposed method takes advantage of the layered structure of SVC to generate two descriptions of each enhancement layer of an SVC coded video sequence. The SVC structure is modified to allow the encoder to create two complementary descriptions of HP frames, such that each description can be independently decoded. The challenge thus lies in offering acceptable video quality when descriptions are decoded separately while minimizing the redundancy induced by MDC. Due to the temporal correlation that exists between the neighboring pictures in a video sequence, most of the video signal power is retained in the low pass wavelet coefficients and the motion prediction information. We implement an extended MD temporal partitioning process to generate two descriptions of the High Pass (HP) frames’ motion information. The generation of these two descriptions results in a reduction in motion redundancy (compared to duplicating the existing motion information as it is done in MD-MCTF [4]). The bit-rate savings achieved from this approach are invested in improving the texture coding of both descriptions, a feature that is not addressed in MD-MCTF. The two description coding structure of our scheme is displayed in Figure 2. A HP frame is separated into two frames, a motion frame and a texture frame. Each of these frames is then handled separately, following a different coding
Fig. 2. Basic structure of the Multiple Description Coding module with a two description coder of High Pass frames.
path. In the following sections, we discuss in detail the multiple description coding modules shown in Figure 2. A. MDC of Motion Information As a first step, the motion macroblocks (MBs) are divided between the two descriptions (D1 and D2) using the Quincunx Lattice (or checkerboard) structure used in [6]. The H.264/AVC standard [7] specifies four macroblock and four sub-macroblock partitions. One motion vector (MV) is coded for each partition, and thus a macroblock can have anything between one MV and sixteen MVs. In order to remove any redundancy in coding the motion vectors in both descriptions, we place the MVs of the macroblocks labeled D1 in description one, meanwhile the MVs of the macroblocks labeled D2 are placed in description two of the HP frame. However, this produces HP frames with only half the normally available motion vectors. We devised an efficient scheme which helps the decoder accurately predict each missing MV using the correctly transmitted MVs of the neighboring macroblocks. Consider a HP frame in description one. All D2 labeled MBs in this frame contain no motion vectors. Thus, the encoder must derive motion-recovery data for the D2 labeled macroblocks. This recovery data is then coded instead of the original motion vectors of the macroblock, resulting in a reduction in bit-rate. Figure 3 shows a central macroblock, whose actual motion vectors are not transmitted, surrounded by four dark shaded macroblocks with known motion vectors. The central macroblock is divided into four 8x8 blocks {X0 , X1 , X2 , X3 }. The surrounding eight 8x8 dark shaded blocks are assigned the indices shown in Figure 3. For each 8x8 block inside the central macroblock, we find
minimum distortion DSSE . Bit-rate reduction in motion coding is achieved as a direct consequence of reducing the number of motion fields coded in the two descriptions. In order to further compress the error recovery data, we have adopted the Context Adaptive Binary Arithmetic Coding (CABAC) entropy coding method specified in the H.264/AVC and SVC standard. B. MDC of Texture Information Fig. 3. A divided central D2 macroblock with the surrounding 8x8 blocks and their respective indices.
a single block among the surrounding 8x8 neighbors with the best matching motion vector. Let us consider the motion recovery scheme for block X0 . First we use the original motion vector of X0 to generate the motion compensated samples. Next, starting with indexed block 0, we use its motion vectors in X0 and generate the corresponding motion-compensated samples. We repeat this step for all of the indexed blocks. The indexed block that minimizes the sum of square error (SSE) distortion between the original X0 motion-compensated samples and the samples produced using the MVs of this indexed block is chosen to be the best matching block. Its MVs will be used to approximate the MVs of X0 . If P is one of the central macrolocks belonging to to the set of block partitions P = {X0 , X1 , X2 , X3 }, then the SSE distortion measure can be expressed as follows: D PSSE i,j∈P
(P, rorig , morig , rest , mest ) = (lrorig [i + morig,x , j + morig,y ] −lrest [i + mest,x , j + mest,y ])2
The texture information encoded in a HP frame corresponds to the high pass components resulting from the MCTF process. Our approach to multiple-description coding of residual data is based on separating HP frames into even numbered and odd numbered frames. The even numbered frames are then transmitted in one description while the odd numbered frames are transmitted in the second description resulting in zero redundancy. However, HP frames may contain Intra-coded macroblocks, which carry original image information (not residual). If those macroblocks are missing from a description, then the decoded quality of that description will be significantly degraded.
(1)
where lr [.] represents the motion compensated luma samples, m the motion vectors, and P the sub-macroblock partition. The best matching neighbor block index is then found by minimizing equation 1 as follows: BNP = arg min DSSE (P, rorig , morig , rest , mest ) BN ∈S
(2)
where S = {0, 1, 2, 3, 4, 5, 6, 7} is the set of neighboring 8x8 block indices. After the best-neighbor matching stage, each block within the central macroblock has one index pointing to a neighboring 8x8 block. However, coding all four indices is undesirable when the H.264/AVC encoder has partitioned the central macroblock in larger than 8x8 partitions (e.g., 16x16, 16x8, 8x16). In this case, we take advantage of the larger partitioning mode and we use only one neighboring block index for the partition. For instance, if the macroblock mode is 16x16, then it is sufficient to code the best-neighbor index of only one 8x8 block partition to estimate the missing motion in the central macroblock. The chosen best-neighbor index BNP is the index that corresponds to the central 8x8 block P with the
Fig. 4. Multiple-description coding of residual frames in a temporal-scalable layer with three temporal levels. Dark blocks indicate Intra-coded macroblocks in the residual frame.
Therefore, we extend the temporal separation process by inserting all Intra-coded macroblocks found in the high pass frames in both descriptions as shown in Figure 4. This figure shows the separation of a set of residual HP frames into two descriptions. The dark blocks in the HP frames refer to Intra coded macroblocks. Notice that the HP frames coded in the two descriptions are complementary. The only redundancy involved rises from the Intra-coded macroblocks in the residual frames that are duplicated to improve the decoded video quality in case of packet losses.
III. S IMULATION R ESULTS We compare the performance of our framework with the single-description SVC (SD-SVC) and MD-MCTF presented in [4]. The performance measures include single channel decoding and PSNR performance under packet-loss scenarios. In our simulations, we have coded 961 frames from each of the three test sequences: crew, harbour, and foreman. Four SVC layers were encoded, one base layer (layer 0) and three enhancement layers (layer 1, 2, and 3). Table 1, shows the configuration of the coded test-sequences. Table 1. Layer 0 1 2 3
Coding configuration of the test sequences.
Spatial Resolution QCIF QCIF CIF CIF
Frame-Rate 15 15 30 30
GOP Size 8 8 16 16
Scalability Temporal SNR Spatial-temporal SNR
Our stream is composed of a base-layer which contains all layer 0 frames along with the low pass frames of the higher three SVC layers, and an enhancement-layer containing two descriptions for each SVC enhancement layer. We assume that for all cases, the base layer and the low pass frames of the enhancement layers are protected using Unequal Erasure Protection (UXP), a method specifically designed to protect base layer frames [2]. Furthermore, we compare the performance of our scheme and MD-MCTF for the same level of redundancy. A. Single Description Performance In an ideal MD channel environment, the MD network consists of two channels [3]. In this section we compare the performance of our scheme with MD-MCTF in case one channel fails. Table 2 shows the average Y-PSNR values obtained from decoding only description one of our scheme and MD-MCTF. Table 2.
Fig. 5. Comparison of visual quality from the Foreman sequence, MD-SVC (left), MD-MCTF(right).
Average Distortion Obtained by Reconstructing Description One. MDC Scheme Our scheme MD-MCTF
Average Y-PSNR (dB) Crew Foreman Harbour 30.05 28.41 26.77 21.63 25.24 23.01
We observe that our method outperforms MD-MCTF for all three test sequences by an average of 5 dB. Figure 5 shows snapshots from the Foreman sequence. The picture on the left is generated by our scheme. The picture on the right is MD-MCTF coded. We observe that our scheme causes slight distortions that are mainly due to the approximation of the motion vectors of macroblocks from the neighboring blocks. The benefit of this coding mechanism lies in reducing the video data-rate to allocate more bits to duplicate Intra-coded blocks. MD-MCTF, on the other hand, does not duplicate Intra-coded blocks which results in more severe visual distortion.
B. Packet Loss Performance We compare the performance of our scheme, which we labeled MD-SVC, with MD-MCTF and the single description SVC (SD-SVC) scheme under packet loss rates of 3%, 5%, 10%, and 20%. The network simulator used for these tests is the “Offline simulator for RTP/IP over UTRAN” provided by the 3GPP SA4 S4-AHVIC036 [8]. The sequences Crew, Foreman, and Harbour were encoded using each of the abovementioned coding schemes. The reconstructed frames obtained after decoding are compared using the luminance PSNR (YPSNR) measure as shown in Table 3. Table 3. Comparison of Y-PSNR performance of our scheme (MD-SVC), MD-MCTF, and SD-SVC under packet loss conditions. Sequence Crew Foreman Harbour
MDC Scheme MD-SVC MD-MCTF SD-SVC MD-SVC MD-MCTF SD-SVC MD-SVC MD-MCTF SD-SVC
0% 36.08 36.21 36.63 36.42 36.44 36.98 33.76 33.76 34.28
Y-PSNR (dB) at Loss 3% 5% 33.89 33.73 33.65 31.38 30.22 24.22 33.18 32.45 35.74 33.43 33.26 27.70 31.83 30.74 30.20 30.43 26.58 23.62
Rates of 10% 20% 33.61 31.40 28.50 23.93 22.92 21.11 30.51 28.64 29.43 26.17 21.30 18.12 29.48 24.29 25.96 23.26 19.33 17.06
We observe that on average our scheme outperforms the other two methods. For high packet loss rates (e.g., 10% and 20%) the improvement is averaged at 3dB. IV. C ONCLUSION We have developed a new multiple-description coding scheme for the scalable extension of H.264/AVC video coding standard (SVC). Our scheme generates two descriptions of the high pass frames of each enhancement layer of an SVC coder, each description having only half the motion vectors and texture information using a minimal degree of redundancy. Intra-coded macroblocks are inserted as redundant information since they cannot be approximated using motion information. The two descriptions are complementary but independently decodable, offering acceptable video quality even in the case that only of them is successfully received. If both descriptions are received then the full quality of the corresponding SVC
enhancement layer is delivered. Objective and subjective performance evaluations have shown that our scheme delivers a superior decoded video quality when compared with the UXP protected SD-SVC and the multiple-description motion compensated temporal filtering (MD-MCTF) scheme at comparable redundancy levels. Single channel decoding demonstrated an improvement of 5 dB on average for our scheme over MDMCTF. Packet loss simulations also exhibited an improvement of an average of 2 dB over MD-MCTF. R EFERENCES [1] R. Schafer, H. Schwarz, D. Marpe, T. Schierl, and T. Wiegand, “MCTF and scalability extension of H.264/AVC and its applications to video transmission, storage, and surveillance,” in Visual Communications and Image Processing, July 2005. [2] T. Schierl, H. Schwarz, D. Marpe, and T. Wiegand, “Wireless broadcasting using the scalability extension of H.264/AVC,” in ICME, July 2005.
[3] Y. Wang, A. R. Reibman, and S. Lin, “Multiple description coding for video delivery,” Proceedings of the IEEE, vol. 93, no. 1, pp. 57–70, January 2005. [4] M. van der Schaar and D. S. Turaga, “Multiple description scalable coding using wavelet-based motion compensated temporal filtering,” in ICIP, September 2003. [5] F. Verdicchio, A. Munteanu, A. Gavrilescu, J. Conelis, and P. Schelkens, “Scalable multiple description coding of video using motion compensated temporal filtering and embedded multiple description scalar quantization,” in International symposium of SPIE Optics East, vol. 5607, Philadelphia, USA, 2004, pp. 81–91. [6] C. Kim and S. Lee, “Multiple description coding of motion fields for robust video transmission,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 9, pp. 999–1010, September 2001. [7] T. Wiegand, G. Sullivan, and A. Luthra, Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITUT Rec. H.264 — ISO/IEC 14496-10 AVC), JVT-G050r1, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, 2003. [8] Common conditions for SVC error resilience testing, JVT-P206d0, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, July 2005.