Scalable video coding using motioncompensated temporal filtering and intraband wavelet based compression Bilel Ben Fradj & Azza Ouled Zaid
Multimedia Tools and Applications An International Journal ISSN 1380-7501 Multimed Tools Appl DOI 10.1007/s11042-012-1170-5
1 23
Your article is protected by copyright and all rights are held exclusively by Springer Science+Business Media, LLC. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your work, please use the accepted author’s version for posting to your own website or your institution’s repository. You may further deposit the accepted author’s version on a funder’s repository at a funder’s request, provided it is not made publicly available until 12 months after publication.
1 23
Author's personal copy Multimed Tools Appl DOI 10.1007/s11042-012-1170-5
Scalable video coding using motion-compensated temporal filtering and intra-band wavelet based compression Bilel Ben Fradj · Azza Ouled Zaid
© Springer Science+Business Media, LLC 2012
Abstract Scalable video compression is a crucial task that allows for high flexibility of video streams to different networks in various applications. Current video coding techniques exploit temporal correlation using motion-compensated predictive or filtering approaches. Particularly, motion-compensated temporal filtering (MCTF) is a useful framework for scalable video compression schemes. In this paper, we propose a new scalable video coding method that combines open-loop motioncompensated prediction with an embedded intra-band wavelet based compression. Our major objective is to provide a wavelet based video coding system that circumvents the drawbacks of conventional closed-loop prediction systems, without sacrificing compression performance. To improve the coding efficiency, we adaptively weight the target bitrate according to the temporal frame position in the temporal pyramid. Comparisons with state-of-the-art scalable video coding solutions confirm an overall coding efficiency gain of the proposed method specially at high bitrates. Keywords Scalable video coding · Wavelet transform · Motion-compensated temporal filtering · Rate control
1 Introduction With the development of multimedia applications over heterogeneous communication environments and the generalization of large video formats like high-definition television (HDTV) or digital cinema (DC) sequences, video compression is still a challenging problem. For instance, in most applications video content has to be
B. B. Fradj · A. O. Zaid (B) SysCom Laboratory, National Engineering School of Tunis, B.P. 37, le Belvedere, 1002 Tunis, Tunisia e-mail:
[email protected] B. B. Fradj e-mail:
[email protected]
Author's personal copy Multimed Tools Appl
accessible, at the same time, from a variety of devices ranging from high definition displays to cellphones and PDA handsets. Indeed, each device has its own constraints in terms of bandwidth, display resolution, and battery life. To tackle this challenge highly efficient and extremely flexible compression technology is essential. The ideal functionality of such technology would allow on the-fly and real-time stream adaptation of compressed bitstream according to the network characteristics and end-user terminals. The concept of scalability is a key feature to fulfill these requirements. A significant part of the success of a particular scalable coding algorithm relates to its compression efficiency. Consequently, one of the main requirements set for any scalable video coding algorithm is that it should perform as well as state-of-the-art non-scalable video coding in terms of visual quality, when both are compared for the vast majority of possible resolutions, quality levels and replay frame-rates. The conventional video coding standards MPEG-2, H.263, MPEG-4 Visual, and H.264/AVC [26] already include several tools by which the scalability functionality can be supported. However, the scalable profiles of those standards have rarely been used [20]. This is principally due to the fact that spatial and quality scalability features came along with a significant loss in coding efficiency. In addition, the use of the conventional closed-loop video coding structure of existing standards mired the scalability functionalities. Simulcasting the spatial resolutions and fidelity representations provides similar functionalities as a scalable bitstream, even though typically at the cost of a significant increase in bitrate. Moreover, expensive transcoding or reencoding is used to generate separately stored compressed programs for transmission over multipoint control units in video conferencing systems or for streaming services in 3G systems. Currently, the Scalable Video Coding (SVC) Amendment of the H.264/AVC standard [18] enables the use of video resources across a wide range of network and devices. The H.264/AVC extension SVC has achieved significant improvements in coding efficiency with an increased degree of supported scalability relative to the scalable profiles of prior video coding standards. However, because of the substantially higher complexity of the underlying H.264/AVC standard relative to prior video coding standards, the computational resources necessary for decoding an SVC bit stream are higher than those for scalable profiles of older standards. Furthermore, H.264 SVC still imposes restrictions on the encoded stream that limits accessibility and compression of no generic contents video (QCIF, CIF or even 4CIF formats) [20]. Indeed, video coders based on wavelet transforms (WT) may prove to be much more efficient for encoding digital cinema, medical or satellite sequences. For example, Motion-JPEG2000 (MJP2), which extends JPEG2000 to video coding applications, proved to be as efficient as H.264/AVC in intra mode for high-resolution sequences encoded at high bitrate, but this does not guarantee an efficient video coding system since motion compensation tools are not included. For broadcast and transmission over the networks or channels commonly in use today, interframe compression is needed to fit a high quality video program over a limited bandwidth. Previous research on the field [3, 5, 12, 23] uses wavelet-based coding both spatially and temporally to get good compression efficiency along with the functional benefits of limited or full scalability. This work presents a wavelet-based scalable video codec which is suitable for various types of video sequences and for bitrate scalable coding applications. The aim is to design a practical video coding scheme that takes advantage of the JPEG 2000 coding performance and the efficiency of motion-compensated temporal filtering
Author's personal copy Multimed Tools Appl
technique included in the well-known Motion Compensated Embedded Zero Block Coding (MC-EZBC) algorithm. Experiments using MC-EZBC and H.264 SVC confirm an overall coding efficiency gain of the proposed method over a wide range of bit-rates. The structure of the paper is as follows. Section 2 covers previous research on wavelet-based scalable video coding systems. Section 3 describes the motion estimation/compensation framework included in MC-EZBC codec. Section 4 presents our proposal for video compression and introduces the derived rate allocation strategy. Experimental results are presented and discussed in Section 5. Finally, Section 6 draws some general concluding remarks and outlines future research possibilities.
2 Wavelet-based scalable video coding In wavelet-based video coding temporal and spatial scalability can be achieved by applying a 3D wavelet transform on input frames. The result of such transform is motion information and wavelet coefficients that represent the texture of transformed frames. Since the result of a 3D wavelet transform is a set of transformed frames, scalable coding can be performed on a frame-to-frame basis using embedded still image coding algorithms. In order to facilitate the description of interframe waveletbased coding methods and architectures, we first revise the wavelet-based still image coding concept. Then, we review the basic tools that can be used for the extension of scalable image coding architectures to motion-compensated video coding. 2.1 Wavelet-based scalable image coding Wavelet-based scalable image coding algorithms are classified in two categories: inter-band and intra-band. Both of these categories apply successive approximation quantization (SAQ) on the wavelet coefficients generated by 2D wavelet transform in order to encode them progressively, from the most significant bit (MSB) to the least significant bit (LSB). Each category (inter or intra-band) uses a specific data structure in the coding process. The representative state-of-the-art scheme of the family of inter-band coders, is the embedded zerotree wavelet coding (EZW) [21]. This coding scheme uses a general model to characterize the inter-band dependencies among wavelet coefficients located in subbands with similar orientations. The model is based on a set partition (or zerotree) structure that takes advantage of the nature of energy clustering of subband/wavelet coefficients in both frequency and space locations. Inspired by the basic principles of Shapiro’s pioneering work, the SPIHT algorithm was developed [15]. The main drawback of this approach is the fact that zero regions, which consist of insignificant coefficients are approximated as a highly constrained set of tree structured regions. Consequently, certain zero regions that are not aligned with the tree structure, may be expensive to code and some portions of zero regions may not be included in zerotrees at all. Intra-band wavelet-based coding algorithms, including embedded block coding by optimized truncation (EBCOT) [22] and square partitioning (SQP) coders, exploit intra-band redundancies [17]. The zero regions are represented by a union of independent rectangular regions of fixed or variable sizes. This simple model accounts for the fact that the significant coefficients are clustered in certain areas in the wavelet
Author's personal copy Multimed Tools Appl
domain, corresponding to the edges and contours regions. Therefore, by using this model, one can isolate interesting image features by immediately eliminating large insignificant regions from further consideration [11]. A third category combines the aforementioned coding concepts to take advantage of their respective strength simultaneously. The motivation behind a combination of both methods comes from the fact that high redundancy exists not only among wavelet coefficients but also among nodes at individual quadtree levels within and across subbands. A representative state-of-the-art scheme for this category is described in [6]. This algorithm, entitled Embedded image coding using ZeroBlocks of subband/wavelet coefficients and Context modeling (EZBC) became popular with its 3D version (3D-EZBC) in the context of scalable video coding. Similarly to SQP algorithm [17], the EZBC coding scheme encodes the bit-planes using the quadtree partitioning. The distinctiveness of EZBC consists of the context modeling used in the context-based entropy coding of the significance of the nodes in the quadtrees. An information-theoretic analysis of combined inter and intra scale dependencies of wavelet coefficients and their application to wavelet image coders can be found in [8]. 2.2 Three-dimensional wavelet coding Techniques that extend the tow-dimensional (2-D) DWT to three dimensions have been implemented in a number of video compression systems. Tham et al. [25] propose a video codec that performs a motion-compensated three-dimensional (3D) wavelet (packet) decomposition of a group of video frames, and then encodes the important wavelet coefficients using a data structure termed tri-zerotrees (TRIZTR), which is an extension of the two-dimensional EZW algorithm. In their work, Kim and Pearlman [9] extend SPIHT algorithm to three dimensions to code video. Taubman and Zakhor [23] employ the concept of layered zero coding (LZC) to design an embedded 3-D wavelet-based video coder. Ohm [12] proposes a 3-D subband coding method that performs motion compensation prior to temporal subband decomposition. 2.3 Motion-compensated wavelet video coding The recent research advances in open-loop temporal prediction led to a number of scalable video coding schemes that perform motion-compensated temporal filtering (MCTF) and spatial decomposition. These coding systems can be roughly categorized into two classes: t+2D [4, 29, 30] and 2D+t [1, 5, 31], where “t” stands for temporal and “2D” for spatial. By using t+2D scheme, also called spatial-domain MCTF (SDMCTF), the sequence is first temporally decomposed through a MCTF, and then the temporal subbands are spatially analyzed and coded by a sequence of 2D wavelet-based encoders. In 2D+t or in-band MCTF (IBMCTF) coding systems, different spatial resolutions are created, and temporal processing is performed separately on each spatial resolution. These two coding structures are illustrated in Fig. 1. We may notice that the 2D+t scheme has attracted more attention because it provides better resolution scalability [2]. The most significant problem is that the motion estimation/motion compensation (ME/MC) in the subband domain is not as efficient as that in the original image domain because of the shift-variance of critical
Author's personal copy Multimed Tools Appl
(a)
(b)
Fig. 1 Wavelet-based scalable video coding: a) t+2D architecture, b) 2D+t architecture
sampling wavelet transform. To solve this problem, Andreopoulos et al. [1] propose to use complete-to-overcomplete discrete wavelet transforms for in-band MCTF (IBMCTF). This approach has demonstrated the best rate-distortion performances so far. However, this gain is brought at the cost of high complexity and slow run-time performance. It is important to note that with the intention to develop a new SVC coding standard, twelve of the fourteen submitted proposals in response to the call issued by MPEG represented scalable video codecs based on 3-D wavelet transforms, while the remaining two proposals were extensions of H.264/AVC [10, 28]. Extensive evaluation based on objective and subjective comparisons between wavelet video codecs and the current SVC standardization effort, which provides a pyramidal extension of H.264/AVC, has been carried out. The tests conducted by ISO/MPEG
Author's personal copy Multimed Tools Appl
have shown that mixed performance were observed with a predominance of higher efficiency of the scalable extension of H.264/AVC. As a result, MPEG and VCEG agreed to jointly finalize the SVC project as an Amendment of H.264/AVC within the Joint Video Team [13, 19].
3 Motion estimation and compensation with temporal filtering Although accurate motion estimation (ME) for video sequences has been the topic of interest of a massive number of publications, no existing scheme can capture the motion characteristics accurately and in a generic manner. This is mainly because of: occlusion and aperture effects; the appearance of new objects; scene illumination changes; changes in camera focus, etc. Consequently, all of motion compensation (MC) solutions solve the problem by alleviating certain, but not all of these effects. Furthermore, the general rule is that, the more advanced the ME algorithm used for MC, the more complex it becomes to realize it in a practical video coding system. As a result, the most popular ME solutions used in the common MC-based video coding systems are block-based motion-estimation algorithms, which in turn lead to block-based motion-compensation. In this section, we review a particular instantiation of a related block-based ME/MC system that was integrated to MC-EZBC codec [7]. Our choice of this system is motivated by the fact that MC-EZBC incorporates all the basic tools found in state-of-the-art open-loop video coding systems. For example, the prediction is bidirectional (using one past and one future frame); variable block sizes are used for block-based motion compensated prediction, which range (dyadically) from 64 × 64 pixels down to 4 × 4 pixels. The design aspect of the the aforementioned ME/MC scheme can be summarized as follows: First, the hierarchical variable size block matching (HVSBM) is performed to compute a set of motion vectors. Then, each pair of video frames is filtered through an MCTF which creates the corresponding temporal low and temporal high subbands frames. The output of the first HVSBM/MCTF stage is the temporal low and high pair denoted y˜ L0 (m) and y˜ H0 (m), where m is an index for each pair of input frames. To generate the temporal pyramid, the temporal-low outputs of the first MCTF are input to the subsequent HVSBM/MCTF stage. The outputs of the second stage are then denoted y˜ L1 (m) and y˜ H1 (m) for frame pair m. In general, the low and high subbands of the temporal filtering pyramid can be denoted y˜ Lk (m) and y˜ Hk (m), N where k = {0, 1, . . . , log2 N − 1} is the temporal level, and m = {0, 1, . . . , 2k+1 − 1} is the temporal position index. More details on ME/MC method, that was included in MC-EZBC video coding scheme, are explained in the following subsections. 3.1 Motion estimation The motion estimation strategy used in MC-EZBC algorithm performs in two stages. The first stage is backward motion estimation by using HVSBM. The second stage is detecting true covered and uncovered pixels based on the backward motion field. The HVSBM scheme uses a hierarchical motion estimation to speed up the search process and impose spatial interaction between neighboring motion vectors. Each block is subdivided into four sub-blocks as shown in Fig. 2. Motion vectors for new born sub-blocks are generated by refining the motion vector of their parent. This
Author's personal copy Multimed Tools Appl
Fig. 2 A 3 level HVSBM for motion estimation
spawning process continues. After generating the full motion vector tree, optimal tree pruning algorithm is used to determine which nodes to merge based upon ratedistortion (R-D) criteria. This algorithm uses a Lagrangian multiplier to determine a specified trade-off between the distortion and the coded motion vector rate Rmv . 3.2 MC temporal filter Since MCTF is a kind of subband/wavelet transform, it can be implemented either in the conventional way or by using lifting scheme. The low-pass and high-pass filtering in the conventional implementation can be looked at as a parallel computation, but the low-pass and high-pass filtering in the lifting implementation of this transform is a serial computation. When using motion-compensated temporal filtering, the filter must have perfect reconstruction to get high quality results. Constructing the filter using the lifting scheme satisfies this constrain. Moreover, for integer-pixel accurate MCTF, conventional and lifting implementations are equivalent, but for sub-pixel accurate MCTF, the lifting scheme can provide some nice properties. In addition, the lifting scheme reduces computational complexity by allowing the transform to be computed in-place, in which the input data is replaced by the transform data instead of requiring separate storage for the output. For details on the lifting implementation of sub-pixel accurate MCTF, the reader is referred to [4].
4 Description of the proposed framework 4.1 Global coding scheme In this work, we propose a scalable video coding paradigm that takes advantage of EBCOT [16] coding mechanism and the simplicity of MC-EZBC based spatial-domain
Author's personal copy Multimed Tools Appl
MCTF (SDMCTF). The central contribution of our work is the utilization of a sophisticated rate control scheme to increase the compression efficiency while keeping the computational complexity relatively low as compared to related earlier works. In comparison to MC-EZBC algorithm, the design of our coding scheme has the following differences: – –
texture coding is performed by applying EBCOT algorithm, on spatio-temporal wavelet coefficients instead of EZBC. original MC-EZBC rate allocation strategy is optimized by considering an hybrid rate control method. The complete new video coding system is outlined as follows (see Fig. 3):
1. Consecutive frames are divided into groups of pictures (GOP). Each GOP has 8 frames. 2. HVSBM/MCTF scheme is performed, yielding to 3-level temporal pyramid. Using the notation of Section 3, the low and high subbands of the temporal filtering pyramid are denoted y Fk (m), where F designates the (low-pass) or (high-pass) subband. 3. Regarding spatial analysis, texture coding and bitstream layering, each temporal frame y Fk (m) is spatially analyzed to obtain a spatio-temporal frame denoted y˜ Fk (m). We adopt the Daubechies 9/7 analysis/synthesis filter-bank to perform 4-stage spatial filtering. 4. Each spatio-temporal frame y˜ Fk (m) is partitioned into code-blocks which are compressed independently using EBCOT algorithm. 5. Motion vectors are converted into bitstream by using lossless DPCM and Adaptive Arithmetic Coder (AAC) [27]. As mentioned earlier, our main contribution consists of a simple and efficient rate control method for scalable video coding. The proposed rate control approach is discussed in detail in the following subsections. 4.2 Rate control In MC-EZBC coding framework, the entire spatio-temporal subbands are coded together without any rate-distortion optimization procedure. This tends to decrease
Fig. 3 System architecture of the proposed scalable video coding
Author's personal copy Multimed Tools Appl
the computational complexity compared with the efficient block-based EBCOT coder, but also provides limited efficiency, specially for low-resolution/low framerate decoding. In our work, we proposed to modify the original MC-EZBC rate allocation scheme by considering the Rate-Distortion optimization. Rather than adjust an average bitrate over the entire encoded sequence, our algorithm performs rate allocation independently for each GOP. Due to the high sensitivity of the low temporal subbands to compression losses, the maximum allowed byte budget is partitioned according to the importance of the temporal frames. A weight parameter α, which is assumed to be the criterion for the judicious allocation of the available bit-budget, is then assigned to each temporal frame depending on its position in the temporal pyramid. For each spatio-temporal frame, Post Compression RateDistortion (PCRD) algorithm is applied across all the corresponding code-blocks. This rate-control method ensures that compression performance will be superior than dividing the bitrate equally across the frames. On the other hand, it beneficiates from the performances and the flexibility of the PCRD algorithm. In the next paragraph we will briefly describe the EBCOT coding algorithm performed by the JPEG 2000 coder at block level. 4.2.1 EBCOT coding design The EBCOT codec basically consists of two main units (Tier 1 and Tier 2). The wavelet data is partitioned into separate, equally sized blocks, called the code-blocks. In a first phase, Tier 1, each block is encoded separately, making use of layered zero-coding principles. In Tier 2, a PCRD procedure determines where each codeblock’s embedded bitstream should be truncated in order to attain a pre-defined bitrate, distortion bound or visual quality level [22]. We point out that EBCOT algorithm has been recently integrated to JPEG2000-based scalable Interactive video (JSIV) paradigm, with and without motion compensation [24]. Despite its high efficiency in interactive browsing applications, JSIV performs worse than H.264 SVC in conventional streaming application. In the following paragraph, we will only provide a brief presentation of the PCRD optimization procedure for the case in which the codec minimizes the distortion given the target bitrate. 4.2.2 PCRD optimization method Based on the Tier 1 module, every code-block Bi can be truncated at the end of each coding pass, giving N different truncation points referred to as p j, 0 ≤ j < N p p and referring the bitrate of each truncation point of the code-block Bi as Ri j , Ri j ≤ p j+1 Ri . When the distortion is measured in terms of mean square error (MSE), the contribution from code-block Bi to final distortion D is denoted by the additive p distortions Di j at the truncation points p j. pj D= Di (1) i
The distortion metric
p Di j
is determined as: 2 p Di j = ωb2 i y[k] − yˆ p j [k] k∈Bi
(2)
Author's personal copy Multimed Tools Appl
where yˆ p j [k] denotes the coefficients quantized at the truncation point p j, y[k] designates the original coefficients of the considered code-block Bi and ωb i stands for the L2norm of the subband b i where the code-block Bi belongs. p Considering that the total bitrate of the codestream is given by R = i Ri j , we can approach the rate-distortion optimization problem as a generalized Lagrange multiplier for a discrete set of points as follows: { pλj } stands for the set of truncation points which minimizes: pλj pλ Di + λRi j (3) D(λ) + λR(λ) = i
where λ is the Lagrange multiplier. That value of λ which minimizes this expression yielding R(λ) = Rmax represents the optimal solution to the rate-distortion optimization problem, being pλj the set of optimal truncation points for that Rmax . Since we have a finite discrete set of truncation points { pλj }, it will probably be not feasible to obtain exactly the target bitrate Rmax . In practice, since there are many truncation points, it is adequate to determine the smallest value of λ, that satisfies R < Rmax . p
Basically, the algorithm tries to find the smallest distortion-rate slope Si j = p Di j
p p p where = Di j−1 − Di j and Ri j = λ truncation points { p j } for each code-block
p Ri j−1
p Ri j ,
pj
Di
pj Ri
,
− to obtain the optimal set of Bi and for given λ. This task is achieved by selecting only those truncation points that lie in the convex hull, i.e. those truncation points with strictly decreasing rate-distortion slope. 4.2.3 Proposed hybrid rate allocation method According to the dyadic wavelet decomposition, the coarsest subband carries the most significant information about the processed signal. Dissimilarly, the remainder high frequency subbands are perceived to be less important, with the degree of significance decreases from top to bottom in the wavelet pyramid structure. Consequently, the energy compaction behavior of the wavelet-based MCTF puts the most energy in the temporal low subband. For efficient compression purpose, the rate control is carried out by prioritizing the information of the temporal low subbands, i.e. following the same principle of prioritization of information used in waveletbased resolution-scalable image coding. When scalable bitrate mode is enabled, we propose to partition the total bitrate proportional to the frame locations in the temporal pyramid. After assigning a fixed bitrate to each temporal frame, the PCRD algorithm is applied across the code-blocks of all the spatio-temporal frames to obtain the desired bitrate or a near approximation. Given 8 size GOP, the proposed global rate allocation framework is formulated as follows: 1. For each temporal frame y Fk (m), the assigned bitrate is determined by: R Fk (m) = α Fk (m) × R∗ with α Fk (m) = 1 k,m
2. After performing the spatial analysis, sample data coding is carried out using fractional bit-plane coder based on EBCOT paradigm. Specifically, for each code-block in the spatio-temporal frame y˜ Fk (m), a set of truncation points is selected so that the global objective function in (2), given λ and the target bitrate Rmax = R Fk (m), is minimized.
Author's personal copy Multimed Tools Appl
5 Experimental results In this section, we evaluate the coding efficiency of the proposed video coding scheme. The assessment criterion is twofold: First, we evaluate the compression performance obtained with the proposed video coding system under a variety of settings in the rate control stage. Then, for the best setting of the proposed rate control method, a comparison against the successful prior video coding schemes: Rensselaer MC-EZBC version, of the MC-EZBC software, and JSVM version 12 [14], of the emerging H.264/SVC, are carried out. Currently, these coders deliver the best performance for scalable coding of video datasets. 5.1 Common experimental settings For the coding experiments of the following subsections, all the video-coding schemes under comparison utilized a GOP structure with 8 frames. Motion estimation is performed using half-pixel accuracy. The presented results are obtained for Foreman 352 × 288 × 144, Bus 352 × 288 × 144 and Crew 352 × 288 × 144 CIF sequences. Only the luminance component is taken into consideration since human visual system is less sensitive to color than to luminance. For reasons of brevity, average PSNRs on luminance component have been used as quality metric. For our codec and MC-EZBC, three-stage temporal decomposition using the motion-compensated Haar filtering is applied. Four-stage spatial analysis follows the temporal stage to complete the 3-D subband decomposition. Spatial transform is done with Daubechies 9/7 filter banks. Initial search range for the top (smallest) level of the HVSBM pyramid is fixed to 2. The value of λmv = 24 is used for the Lagrange multiplier in HVSBM tree pruning. Bi-directional motion vectors are used when the the percentage of unconnected pixels produced by forward motion estimation surpasses a given threshold. We should mention that in the current study, we only focus on the compression performance of our coding scheme over a large range of bitrates, without considering its behavior under resolution and temporal scalability. For this reason, we retained the LRCP (Layer-Resolution-Component-Position) progression mode, supported by JPEG 2000 standard, to generate the layered bitstreams. This progression mode allows the quality of the decoded sequence to be improved as the bitrate increases. Each temporal frame is represented through a succession of bitrate layers. Each bitrate layer contains a multiresolution pyramid of the MCTF coefficients. Concerning the JSVM v. 12 software, which was obtained from the Internet public domain (http://wftp3.itu.int/av-arch/jvt-site/), adaptive inter-layer prediction with RD optimization is activated to improve the coding efficiency. Furthermore, we use the context-adaptive binary arithmetic coding (CABAC) and enable the 8 × 8 transform. For motion estimation, we employ the fast search block matching with a search range of 16. The H.264 SVC provides two quality scalable modes namely coarse grain scalability (CGS) and medium grain scalability (MGS). In CGS, the residual texture signal in the enhancement layer is re-quantized with a quantization step size that is smaller than the quantization step size of the preceding CGS layer. SVC supports up to eight CGS layers, corresponding to eight quality extraction points, i.e., one base layer and up to seven enhancement layers. The original H.264 SVC software constrains the inter-layer prediction to three dependency layers.
Author's personal copy Multimed Tools Appl
(a) 45
PSNR (dB)
40
35
Collection 1 Collection 2 Collection 3 30 400
600
800
1000
1200
1400
1600
1800
2000
Rate (Kbps)
(b) 40
38
36
PSNR (dB)
34
32
30
28 Collection 1 Collection 2
26
Collection 3 24 400
600
800
1000
1200
1400
1600
1800
2000
Rate (Kbps)
Fig. 4 Rate-Distortion comparison among the three proposed rate allocation methods for sequences Foreman (a) and Bus (b)
Medium-grain quality scalability (MGS), is a variation of the CGS approach. With MGS mode, it is possible to divide the transform coefficient levels to multiple additional MGS layers to achieve finer grain scalability. In our tests, we enable the
Author's personal copy Multimed Tools Appl
MGS quality scalable mode. Specifically, we employ one enhancement layer divided into five MGS layers to cover a wide bitrate adaptation range. 5.2 Tuning our method Figure 4 depicts the average PSNR-Rate curves using our video coding method with 3 collections of α Fk (m) parameter for Bus and Foreman sequences. The selected values of α Fk (m) are expressed in Table 1. The results of Fig. 4a and b show that the first collection provides the worst performance of the three tested approaches. This is due to the fact that the bitrate is allocated uniformly between the temporal frames, without privileging the temporal subbands located in the last levels, which contain the most relevant information. The second collection choice is seen to perform better than collection 1, but worse than collection 3 which is the best choice. For example, in the case of Bus sequence decoded at bitrate equals to 512 Kbs, collection 1 yields a PSNR of 27.24 dB, collection 2 provides 28.54 dB and collection 3 gives 29.27 dB. Note also that (on average, over all bitrates), the gain incurred by the collection 3 is about 0.68 dB versus collection 2 and 2.2 dB versus collection 1, for Bus sequence. In the case of Foreman sequence, the gain incurred by the collection 3 is about 1.13 dB versus collection 2 and 2.61 dB versus collection 1. The previously-described collections of α Fk (m) parameter settings present only three out of many possible collections. In this study, we opted for the three specific settings due to the fact that, on average, they approximately produce similar impact than the other possible collections. In the following experiments, we retain collection 3 since it provides the best performance. 5.3 Comparison with other methods For demonstrating the efficiency of our coding method, we also compared it to the successful prior video coding schemes: MC-EZBC and JSVM v.12. Figure 5a–c illustrate the rate distortion curves obtained with the video coding schemes, under comparison, respectively for Foreman, Bus, and Crew CIF sequences. From the obtained results of Fig. 5, it was found that our coding method outperforms MCEZBC codec for Bus, Foreman and Crew test sequences, over all tested bitrates. The average margin of improvement offered by our method varies from 1.33 dB to more than 3.77 dB. From the results reported in Fig. 5a we can notice that for Foreman sequence, the JSVM v.12 codec performs best until the bitrate reaches approximately 810 Kbps, after which our video coding system performs better. This result comes from the fact that we use adaptive arithmetic coding (AAC) to code the motion vector prediction residuals. Even if most prediction residuals are near zero, AAC still Table 1 The three collections of α parameter setting Collection 1 Collection 2 Collection 3
α Fk (m) = 1 α Fk (m) = 4/3 α Fk (m) = 2/3 α Fk (m) = 16/9 α Fk (m) = 8/9 α Fk (m) = 2/3
for all temporal subbands in the GOP; if k = 3 and F = “L” or “H”; k = 2 and F = “H”; if k = 1 and F = “H”; if k = 3 and F = “L” or “H”; if k = 2 and F = “H”; if k = 1 and F = “H”.
Author's personal copy Multimed Tools Appl Fig. 5 Rate-Distortion comparison among video compression techniques for sequences Foreman (a), Bus (b) and Crew (c)
(a) 46 44 42
PSNR (dB)
40 38 36 34 32 Our method
30
JSVM v.12 28 200
400
600
800
1000
1200
1400
1600
1800
2000
Bitrate (Kbps)
(b) 40 38 36
PSNR (dB)
34 32 30 28 26 24 Our method 22
JSVM v.12
20 200
400
600
800
1000
1200
1400
1600
1800
2000
Bitrate (Kbps)
(c) 44 42
PSNR (dB)
40 38 36 34 32 30 JSVM v.12 Our method
28 200
400
600
800
1000
1200
Bitrate (Kbps)
1400
1600
1800
2000
Author's personal copy Multimed Tools Appl Fig. 6 A frame-by-frame comparison between our method, H.264/SVC and MC-EZBC, at (approximately) 512 Kbps, first 100 frames, for sequences Foreman (a), Bus (b), and Crew (c)
(a) 44 JSVM v.12 Our method
43 42
PSNR (dB)
41 40 39 38 37 36 35 0
20
40
60
80
100
Frame no.
(b) 34 JSVM v.12 Our method
33
PSNR (dB)
32 31 30 29 28 27 26 25 0
20
40
60
80
100
Frame no.
(c) 44
JSVM v.12 Our method
PSNR (dB)
42
40
38
36
34
32
0
20
40
60
Frame no.
80
100
Author's personal copy Multimed Tools Appl Fig. 7 Key-frame extracted from the sequence Bus: original frame (a); frame decoded at 512 Kbps after compression using MC-EZBC coder (b), the proposed method (c), and H.264/SVC (d)
Author's personal copy Multimed Tools Appl Fig. 8 Key-frame extracted from the sequence Foreman: original frame (a), frame decoded at 512 Kbps after compression using MC-EZBC coder (b), the proposed method (c), and H.264/SVC (d)
Author's personal copy Multimed Tools Appl Fig. 9 Key-frame extracted from the sequence Crew: original frame (a), frame decoded at 512 Kbps after compression using MC-EZBC coder (b), the proposed method (c), and H.264/SVC (d)
Author's personal copy Multimed Tools Appl
must assign probabilities, or frequency counts to all symbols, including unused ones. Since motion in Foreman sequence is relatively easy to track, for the smaller sets of generated motion vectors, the extra frequency counts have a relatively significant impact on the overall performance. The results of Fig. 5b show that, for Bus sequence, at low bitrates our method performs similar or slightly better than JSVM v.12. However, it achieves significant performance improvements at high-bitrates. From Fig. 5c it can be seen that for Crew sequence, our codec performs better than JSVM v.12 reference software for all the experimental points. It achieves an average PSNR gain of up to 1.92 dB. Examples of frame-by-frame PSNR comparison of the three schemes can be seen in Fig. 6a–c. It can be observed that our codec reduces the PSNR variations between frames. It also provides better performance for Bus and Crew sequences. Figures 7, 8 and 9 show three key-frames extracted from Bus, Foreman and Crew sequences, decoded at 512 Kbps (kbit/s). From these Figures, we can plainly discern that visual testing reveals that the proposed codec provides superior visual quality than MC-EZBC. Images processed by our compression method have less noisy appearance in the homogeneous regions, better preserved textures, and sharper appearance overall. For the tested bitrate, it can be said that for the tested sequences, the visual quality provided with the proposed coding scheme is comparable to the one provided by JSVM v.12.
6 Conclusion In this paper, we presented a new scalable video coding method that takes advantage of the powerful compression and scalability of EBCOT encoder and the efficiency of MCTF in exploiting temporal redundancy in image sequences. Moreover, we have designed a new hybrid rate allocation mechanism taken into account the characteristics of temporal frames and PCRD rate control principles. Simulation results indicate that our algorithm compares favorably with the state-of-the-art in scalable video coding, while supplying a tremendous flexibility with respect to the bitrate and spatio-temporal resolution at which the video is to be delivered. In our future work, we plan to extend scalable representations to the motion information, rather than just the subband sample data. Furthermore, additional performance improvements, of motion vector coding, may be achieved by replacing the existing adaptive arithmetic coding (AAC) with a context-adaptive binary arithmetic coder (CABAC).
References 1. Andreopoulos Y, Munteanu A, Barbarien J, van der Schaar M, Cornelis J, Schelkens P (2004) In-band motion compensated temporal filtering. Signal Process, Image Commun 19:653–673 2. Andreopoulos Y, Munteanu A, van der Schaar M, Cornelis J, Schelkens P (2004) Comparison between “t+2D” and “2D+t” architectures with advanced motion compensated temporal filtering. Technical Report ISO/IEC JTC1/SC29/WG11 m11045 3. Chen YW, Pearlman WA (1996) Three-dimensional subband coding of video using the zerotree method. In: Proc. SPIE, Proc. symp. visual commun. image processing, pp 1302–1309 4. Chen P, Woods JW (2004) Bidirectional MC-EZBC with lifting implementation. IEEE Trans Circuits Syst Video Technol 14(10):1183–1194
Author's personal copy Multimed Tools Appl 5. Cohen RA (2007) Hierarchical scalable motion-compensated video coding. PhD Thesis, Faculty of Rensselaer Polytechnic Institute 6. Hsiang S-T, Woods JW (2000) Embedded image coding using zeroblocks of subband/wavelet coefficients and context modeling. In: IEEE international symposium on circuits and systems (ISCAS), pp 662–665 7. Hsiang ST, Woods JW (2001) Embedded video coding using invertible motion compensated 3-D subband/wavelet filter bank. Signal Process, Image Commun 16:705–724 8. Juan L, Moulin P (2011) Information-theoretic analysis of interscale and intrascale dependencies between image wavelet coefficients. IEEE Trans Image Process 10(11):1647–1658 9. Kim BJ, Xiong Z, Pearlman WA (2000) Low bit-rate scalable video coding with 3-D set partitioning in hierarchical trees (3-D SPIHT). IEEE Trans Circuits Syst Video Technol 10(8):1374–1387 10. Leonardi R, Oelbaum T, Ohm J-R (2003) Report on call for evidence on Scalable Video Coding (SVC) technology. ISO/IEC JTC1/SC29/WG11, Tech. Rep. N5701 11. Munteanu A, Cornelis J, Van Der Auwera G, Cristea P (1999) Wavelet image compression—the quadtree coding approach. IEEE Trans Inf Technol Biomed 3(3):176–185 12. Ohm JR (1994) Three-dimensional subband coding with motion compensation. IEEE Trans Image Process 3(5):559–571 13. Reichel J, Wien M, Schwarz H (2004) Scalable video model 3.0. ISO/IEC JTC1/SC 29/WG11, Tech. Rep. N6716 14. Reichel J, Sclwarz H, Wien M (2007) Joint scalable video model JSVM-12 text. Technical Report Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, Shenzhen, China 15. Said A, Pearlman W (1996) A new fast and efficient image codec based on set partitioning in hierarchical trees. IEEE Trans Circuits Syst Video Technol 6:243–250 16. Schelkens P, Giro X, Barbarien J, Cornelis J (2000) 3D compression of medical data based on cube-splitting and embedded block coding. In: Proc. ProRISC/IEEE workshop, pp 495–506 17. Schelkens P, Munteanu A, Barbarien J, Galca M, Giro-Nieto X, Cornelis J (2003) Wavelet coding of volumetric medical datasets. IEEE Trans Med Imagine 22(3):441–458 18. Schwarz H, Wien M (2008) The scalable video coding extension of the H.264/AVC standard. IEEE Signal Process Mag 25(2):135–141 19. Schwarz H, Hinz T, Kirchhoffer H, Marpe D, Wiegand T (2004) Technical description of the HHI proposal for SVC CE1. ISO/IEC JTC1/SC 29/WG11, Tech. Rep. M11244 20. Schwarz H, Marpe D, Wiegand T (2007) Overview of the scalable video coding extension of the H.264/AVC standard. IEEE Trans Circuits Syst Video Technol 17(9):1103–1120 21. Shapiro JM (1993) Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans Signal Process 41:3445–3462 22. Taubman D (2000) High performance scalable image compression with EBCOT. IEEE Trans Image Process 9(7):1158–1170 23. Taubman D, Zakhor A (1994) Multirate 3-D subband coding of video. IEEE Trans Image Process 3(5):572–588 24. Thabit Naman A, Taubman D (2011) JPEG2000-based Scalable Interactive Video (JSIV) with motion compensation. IEEE Trans Image Process 20(9):2650–2663 25. Tham JY, Ranganath S, Kassim A (1999) Highly scalable wavelet-based video codec for very low bit-rate environment. IEEE J Sel Areas Commun 16(1):12–27 26. Wiegand T, Sullivan GJ, Bjontegaard G, Luthra A (2003) Overview of the H.264/AVC video coding standard. IEEE Trans Circuits Syst Video Technol 13(7):560–576 27. Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun ACM 30(6):520–540 28. Wood JW, Chen P (2002) Improved MC-EZBC with quarter-pixel motion vectors. ISO/IEC JTC1/SC29/WG11, Tech. Rep. M8366 29. Wu Y, Hanke K, Rusert T, Woods JW (2008) Enhanced MC-EZBC scalable video coder. IEEE Trans Circuits Syst Video Technol 18(10):1432–1436 30. Xiong R, Xu J, Wu F, Li S, Zhang YQ (2004) Spatial scalability in 3-D wavelet coding with spatial domain MCTF encoder. In: International symposium on picture coding, pp 583–588 31. Zhang D, Zhang W, Xu J, Wu F, Xiong H (2008) A cross-resolution leaky prediction scheme for in-band wavelet video coding with spatial scalability. IEEE Trans Circuits Syst Video Technol 18(4):516–521
Author's personal copy Multimed Tools Appl
Bilel Ben Fradj was born in Tunis, Tunisia, in June 1981. He received the Computer Sciences Engineering degree from the National Engineering School of Computer Sciences in 2005. Thereafter, he obtained the Master degree Communication Systems in 2007 from the National Engineering School of Tunis, Tunisia. Since September 2007, he is a member of the Communication Systems Laboratory of National Engineering School of Tunis, where he is currently pursuing the PhD degree. His research interests include scalable video coding, multiresolution image analysis, image and video transmission over networks, medical image processing.
Azza Ouled Zaid was born in Tunis, Tunisia, in November 1974. She received the Electrical Engineering degree in 1998. Then, she obtained her M.Sc. degree in Captors and Instrumentation for Vision Systems in 1999, from L3I Laboratory, the Rouen University, France and her Ph.D. degree in Electronic Sciences in 2002 from the Poitiers University France. In 2003-2004 she was working as a research assistant in LSS Laboratory from SUPELEC Engineering school in Paris France. In 2004-2010, she was an Associate Professor in the High Computer Science Institute of Tunis Tunisia. Currently she is Professor in the Department of Networks and Systems Administration of the High Computer Science Institute of Tunis. Since September 2004, she is a member of the Communication Systems Laboratory of National Engineering School of Tunis Tunisia. Her research interests include scalable image and video coding, 3D mesh processing.