adaptively weight the target bitrate according to the temporal frame position in the .... judicious allocation of the available bit-budget, is then as- signed to each ...
SCALABLE VIDEO CODING USING MOTION-COMPENSATED TEMPORAL FILTERING Bilel Ben Fradj
Azza Ouled Zaid
National Engineering School of Tunis Le Belvedere 1002, Tunis, Tunisia
High Computer Science Institute 2080 l’Ariana, Tunisia
ABSTRACT Motion-compensated temporal filtering (MCTF) is a useful framework for scalable video compression schemes. In this paper, we propose a new scalable video coder that combines motion-compensated temporal filtering with an embedded wavelet based coder. To improve the coding efficiency, we adaptively weight the target bitrate according to the temporal frame position in the temporal pyramid. Comparisons with state-of-the-art scalable video coding solutions have shown that our method yields a higher compression performance with a superior video quality. Index Terms— Video coding, motion compensation, wavelet transforms 1. INTRODUCTION With the development of multimedia applications overs heterogeneous communication environments and the generalization of large video formats, video coding is still a challenging problem. In order to optimize the end-to-end quality of service of the delivery chain, the compression scheme must allow for a flexible adaptation of the compressed bitstreams to network characteristics that may vary in time. The concept of scalability is a key feature to fulfill these new requirements. There are generally three dimensions of scalability that are used in video compression: resolution (spatial), fidelity (SNR), and frame-rate (temporal). The conventional video coding standards MPEG-2, H.263, and MPEG-4 Visual already include several tools by which the scalability functionality can be supported. However, the scalable profiles of those standards have rarely been used [1] due to the significant loss in coding efficiency and the high increase in decoder complexity. Simulcast provides similar functionalities as a scalable bitstream, but, at the cost of a significant increase in bitrate. Moreover, expensive transcoding or re-encoding is used to generate separately stored compressed bitstreams for transmission over multipoint control units in video conferencing systems or for streaming services in 3G systems. The H.264/AVC extension SVC [2] has achieved significant improvements in coding efficiency with an increased degree of supported scalability relative to
the scalable profiles of prior video coding standards. However, because of the substantially higher complexity of the underlying H.264/AVC standard relative to prior video coding standards, the computational resources necessary for decoding an SVC bitstream are higher than those for scalable profiles of older standards. Indeed, video coders based on wavelet transforms (WT) may prove to be much more efficient for encoding digital cinema, medical or satellite sequences. For example, Motion-JPEG2000 (MJP2) [3], which extends JPEG2000 to video coding applications, proved to be as efficient as H.264/AVC in intra mode for high-resolution sequences encoded at high bitrate. Nevertheless, MJP2 is simply a file format for storing intraframe-coded images. For broadcast and transmission over the networks or channels commonly in use today, interframe compression is usually needed to allow reasonably good video quality given a limited bitrate. Some related previous interframe wavelet-based coders can be found in [4, 5, 6, 7, 8]. These methods use waveletbased coding both spatially and temporally to get good compression efficiency along with the functional benefits of limited or full scalability. The well-known Motion Compensated Embedded Zero Block Coding (MC-EZBC) algorithm [9] is optimized for the best performance at high bitrates range. However its efficiency in the range of low bitrates is far from the optimal. In this work, we address the task for building an interframe coding method, which combines the advantages of highly efficient intraband JPEG2000 coding and the fine granular scalable MC-EZBC scheme. To efficiently allocate the global bit-budget, in each Group of Pictures (GOP), the total bitrate is divided proportionally to the temporal band positions. The structure of the paper is as follows. In section 2, interframe wavelet-based video coding is presented. The proposed video compression framework is then described in section 3. Next, in section 4, experimental results are presented and discussed. The work is concluded in section 5.
2. INTERFRAME WAVELET-BASED VIDEO CODING Interframe wavelet video coding paradigm is based on the concepts of motion-compensated temporal filtering and embedded subband/wavelet coding. It removes the temporal redundancy in an image sequence by using the motioncompensated (wavelet) filtering technique along the temporal axis. Then, the spatial wavelet decomposition is applied to the temporal filtered frames [5]. By imposing an intra-band hierarchical tree structure on the wavelet coefficients, the later are then processed using successive approximation quantization (SAQ) coupled with entropy coder. In the following paragraph, we review the block based motion estimation/motion compensation system that was integrated to MC-EZBC codec [9]. Our choice of this system is motivated by the fact that MC-EZBC incorporates all the basic tools found in state-of-the-art open-loop video coding systems. For example, the prediction is bi-directional (using one past and one future frame); variable block sizes are used for block-based motion compensated prediction, which range (dyadically) from 64 × 64 pixels down to 4 × 4 pixels. It is worth mentioning that Secker and Taubman [4] proposed a MCTF method based on a deformable mesh motion model. This model can achieve superior compression efficiency compared to the block-based model. However, occlusion or aperture effects are not treated efficiently and the motion field generated may be quite noisy. The MC-EZBC coder processes one group of picture (GOP) at a time. The temporal subband decomposition process is done in two sequential steps. First, the motion field (motion vectors) between two consecutive frames is constructed. Then, temporal filtering, using the lifted scheme [10], operates on these two frames to generate temporal highpass and low-pass subbands. The decomposition process continues iteratively until the highest temporal level has been reached. A temporal filtering pyramid is thus constructed as shown in Fig. 1. A GOP of N frames has log2 N levels if N is an integer power of 2. The low-pass and high-pass frames of the temporal pyramid can be denoted yLk (m) and yHk (m), where, k = {1, 2, ..., log2 N } is the temporal level, and m = {0, 1, ..., 2Nk − 1} is a temporal position index. It is worth noting that the MCTF plays a crucial role in motion compensated 3-D subband/wavelet coding since it influences the coding efficiency as well as the temporal scalability. 3. DESCRIPTION OF THE PROPOSED FRAMEWORK 3.1. Global coding scheme In this work, we propose a scalable video coding paradigm based on MCTF and Embedded Block Coding with Optimal
Fig. 1. Typical 3-level temporal filtering pyramid.
Fig. 2. System architecture of the proposed scalable video coding.
Truncation (EBCOT) [11] coder that exploits the intra-band dependencies between the wavelet coefficients. The global coding structure is depicted in Fig. 2. Half-pel motion estimation is achieved using fast hierarchical variable-size block matching algorithm (HVSBM), from 16 × 16 blocks down to 4 × 4 blocks. After generating temporal pyramid, which corresponds to the motion vector tree, a bottom-up pruning is utilized. Depending on the percentage of unconnected pixels, lifting implementation of the two-tap Haar filter is used temporally to achieve invertible half-pixel accurate MCTF. The produced set of frames consists of a temporal low subband and a collection of temporal high subbands. Regarding spatial analysis, texture coding and bitstream layering, temporal frames are spatially analyzed. We adopt Daub. 9/7 analysis/synthesis filters for the spatial filtering. EBCOT algorithm is then performed on wavelet coefficients to create an embedded bitstream. To encode the motion information, motion vectors are converted into bitstream by using lossless DPCM and adaptive arithmetic coding. The obtained bitstream for motion fields will be sent as side information.
3.2. Rate allocation principle Rather than adjust an average bitrate over the entire encoded sequence, our algorithm performs rate allocation independently for each GOP. Indeed, when applying the temporal filtering to a GOP, according to the dyadic decomposition, the temporal low and first temporal high subbands carries the most significant information. These frames are encoded with highest priority than the remainder high temporal subbands which are perceived to be less important. Due to the high sensitivity of the low temporal subbands to compression losses, the maximum allowed byte budget is partitioned according to the importance of the temporal frames. A weight parameter α, which is assumed to be the criterion for the judicious allocation of the available bit-budget, is then assigned to each temporal frame depending on its position in the temporal pyramid. It should be noted that in Andr´e et al. [7] work, temporal and spatial bit allocations are performed separately. As a result, the quantizer takes into account the coding constrains, provided separately by two different processes which adversely affects compression performance. To limit the performance losses, our method combines temporal and spatial allocations. A fixed rate is assigned to each temporal frame. The EBCOT algorithm is then applied to the processed temporal frame, using this fixed rate as rate constrain. The proposed rate allocation framework is formulated as follows: Given N size GOP and a global target bitrate R∗ step 1 Each temporal frame is denoted by yF k (m), where k and m be respectively the temporal level and the temporal position index in the processed GOP, F refers to low-pass or high-pass subband. For that temporal frame, the assigned bitrate is determined by: RF k (m) = αF k (m) × R∗ step 2 After performing the spatial analysis, Lagrangian rate allocation principle, based on JPEG2000-EBCOT algorithm, is applied across all the transformed frames. Specifically, for each spatio-temporal subband indexed by i in the yF k (m) frame, a quantization step size is chosen so that the global objective function J is minimized: ω i Di + λ γi (Ri − R∗ ) J(λ) = i
i
where Ri is the bitrate affected to subband i and Di is the distortion model associated with the same subband as a function of Ri , ωi is a “size weight”, γi is an “energy weight”, and λ is the Lagrangian multiplier. It is worth emphasizing that the proposed rate allocation scheme is independent of the choice of embedded intraframe
coder. Consequently, JPEG 2000 encoder can be replaced by intra-band compression technique such as QT-L algorithm [12]. 4. EXPERIMENTAL RESULTS Our implementation is based on ENH-MC-EZBC version of the MC-EZBC available at (http://www.cipr.rpi.edu/research/ mcezbc/ ) and OpenJpeg implementation of JPEG 2000 (part 1) standard that was obtained from (http://www.openjpeg.org/ ). We tested the proposed scalable video coder using QCIF and CIF formats of Foreman, Mobile and Bus sequences. Here, we only show the results obtained for Foreman 352 × 288 × 144 − 25f ps and Bus 352 × 288 × 144 − 25f ps CIF sequences. Only the luminance component is taken into consideration since human visual system is less sensitive to color than to luminance. For reasons of brevity, average PSNRs on luminance component have been used as quality metric. To assess the compression performance of our coding method, we have conducted various comparisons with two successful prior scalable video coding schemes: Rensselaer MC-EZBC version, of the MC-EZBC software, and JSVM version 12 [13], of the emerging H.264/SVC. Currently, these coders deliver the best performance for scalable coding of video datasets. Table 1. Configurations of the SVC base and enhancement layer. Layer reference Configuration setting - AVC compatible base layer Base - one intra-coded picture at the Layer beginning of the sequence - unweighted prediction - CABAC entropy coding Enhancement - inter-layer prediction Layer - 8 × 8 transform enabled - one intra-coded picture at the beginning of the sequence
4.1. The experimental conditions For our codec and MC-EZBC, three-stage temporal decomposition using the motion-compensated Haar filtering is applied. Four-stage spatial analysis follows the temporal stage to complete the 3-D subband decomposition. Spatial transform is done with Daubechies 9/7 filter banks. Bi-directional motion vectors are used when the the percentage of unconnected pixels produced by forward motion estimation surpasses a given threshold. We should mention that in the current study, we only focus on the compression performance of our coding scheme over a large range of bitrates, without considering its behavior under resolution and temporal scalabil-
(a) 45
40
PSNR (dB)
ity. For this reason, we retained the LRCP (Layer-ResolutionComponent-Position) progression mode, supported by JPEG 2000 standard, to generate the layered bitstreams. This progression mode allows the quality of the decoded sequence to be improved as the bitrate increases. Concerning the JSVM v. 12 software, that was obtained from the Internet public domain (http : //wf tp3.itu.int/av− arch/jvt − site/), we use the quality scalable coding mode, which is also referred to as coarse-grain quality scalable coding (CGS). The H.264/SVC bitstreams contain two representations (a base and an enhancement layer) with different reconstruction qualities and bitrates. Details about the configuration setting of the SVC base and enhancement layer are summarized in Table 1. It is important to note that the multilayer concept for quality scalablity in H.264/SVC codec only allows a few selected bitrates to be supported in a scalable bitstream. The number of supported rate points is identical to the number of layers. Consequently, the bitstreams are constructed such that successive extraction of included rate points is enabled.
35
Collection 1 Collection 2 Collection 3 30
400
600
800
1000
1200 Rate (Kbps)
1400
1600
1800
2000
(b) 40
38
36
4.2. Tuning our method PSNR (dB)
Fig. 3 depicts the average PSNR-Rate curves using our video coding method with 3 collections of αF k (m) parameters.
34
32
30
Collection 1:
28
αF k (m) = 1, for all temporal subbands in the GOP;
Collection 1 Collection 2
26
Collection 3
Collection 2:
⎧ ⎨ αF k (m) = 4/3, if k=3 and F= “L”or “H” or k=2 and F= “H”; ⎩ αF k (m) = 2/3, if k=1 and F= “H”; Collection 3:
⎧ ⎨ αF k (m) = 16/9, if k=3 and F= “L”or “H”; αF k (m) = 8/9, if k=2 and F= “H”; ⎩ αF k (m) = 2/3, if k=1 and F= “H”. The results of Fig. 3a and Fig. 3b show that the first collection provides the worst performance of the three tested approaches. This is due to the fact that the bitrate is allocated uniformly between the temporal frames, without privileging the temporal subbands located in the last levels, which contain the most relevant information. The second collection choice is seen to perform better than collection 1, but worse than collection 3 which is the best choice. The previously-described collections of αF k (m) parameter settings present only three out of many possible collections. In this study, we opted for the three specific settings due to the fact that, on average, they approximately produce similar impact than the other possible collections. In the following experiments, we retain collection 3.
24
400
600
800
1000
1200 Rate (Kbps)
1400
1600
1800
2000
Fig. 3. Rate-Distortion comparison among the three proposed rate allocation methods for (a) Foreman and (b) Bus sequences.
4.3. Comparison with other methods Note that, for comparison purposes, all video coding methods here have been restricted to a MV estimation with a half pixel accuracy. Fig. 4a and Fig. 4b respectively illustrate the rate distortion curves obtained with our method, JSVM v. 12 and MC-EZBC for Foreman and Bus CIF sequences. From the results reported in these figures, we can clearly notice that our coding method outperforms MC-EZBC codec for Bus and Foreman test sequences, over all tested bitrates. The average margin of improvement offered by our method varies from 1.33 dB to more than 3.52 dB. From the results reported in Fig. 4a, we can notice that for Foreman sequence, the JSVM v.12 codec performs best until the bitrate reaches approximately 700 Kbps, after which our video coding system performs better. This result comes
(a)
(a)
(b)
(c)
(d)
46
44
42
PSNR (dB)
40
38
36
34
32 MC−EZBC Our method
30
JSVM v.12 28 200
400
600
800
1000 1200 Bitrate (Kbps)
1400
1600
1800
2000
(b) 40
Fig. 5. Original luminance image extracted from the sequence Bus (a), and decoded at 512 Kbps after compression using the Proposed Method (b), MC-EZBC coder (c), and SVM v. 1 (d).
38
36
34
PSNR (dB)
32
30
5. CONCLUSION
28
26
24
MC−EZBC Our method
22
20
JSVM v.12
200
400
600
800
1000 1200 Bitrate (Kbps)
1400
1600
1800
2000
Fig. 4. Rate-Distortion comparison among video compression techniques for the (a) Foreman and (b) Bus sequences.
from the fact that we use adaptive arithmetic coding (AAC) to code the motion vector prediction residuals. Even if most prediction residuals are near zero, AAC still must assign probabilities, or frequency counts to all symbols, including unused ones. Since motion in Foreman sequence is relatively easy to track, for the smaller sets of generated motion vectors, the extra frequency counts have a relatively significant impact on the overall performance. The results of Fig. 4b show that, for Bus sequence, at low bitrates our method performs similar or slightly better than JSVM v.12. However, it achieves significant performance improvements in high-bitrates. Fig. 5 depicts a key-frame extracted from Bus sequence compressed at 512 Kbps (kbit/s). From this Figure, we can plainly discern that image processed by our compression method has a less noisy appearance in the homogeneous regions, better preserved textures, and sharper appearance overall.
In this paper, we have presented a scalable video coder that aims high quality applications. The proposed scheme takes advantage of the powerful compression and scalability of EBCOT encoder and the efficiency of MCTF in exploiting temporal redundancy in image sequences. Moreover, we have designed a new hybrid rate allocation mechanism taken into account the characteristics of temporal frames and Lagrangian rate allocation (LRA) principles. Preliminary simulation results indicate that our algorithm outperforms the conventional MC-EZBC codec and performs competitively with JSVM v. 12. In our future work, we plan to extend scalable representations to the motion information, rather than just the subband sample data. 6. REFERENCES [1] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the Scalable Video Coding Extension of the H.264/AVC Standard,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 17, no. 9, pp. 1103–1120, September 2007. [2] H. Schwarz and M. Wien, “The scalable video coding extension of the H.264 / AVC standard,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 135–141, March 2008.
[3] ISO/IEC 15444-3, “Motion-JPEG2000 (JPEG2000 Part 3),” 2002. [4] A. Secker and D. Taubman, “Highly scalable video compression using a lifting-based 3d wavelet transform with deformable mesh motion compensation,” in IEEE International Conference on Image Processing, Roshester, New York, September 2002, pp. 749–752. [5] S. S. Tsai and H. M. Hang, “Motion information scalability for MC-EZBC,” Signal Processing: Image Communication, vol. 19, no. 7, pp. 675–684, 2001. [6] M. Cagnazzo, T. Andr´e, M. Antonini, and M. Barlaud, “A Model-Based Motion Compensated Video Coder with JPEG2000 Compatibility,” in IEEE International Conference on Image Processing, Singapor, October 2004, pp. 2255–2258. [7] T. Andr´e, M. Cagnazzo, M. Antonini, and M. Barlaud, “A JPEG2000-compatible full scalable video coder,” EURASIP Journal of Image and Video Processing, vol. 2007, pp. 11, 2007. [8] R. A. Cohen, Hierarchical Scalable MotionCompensated Video Coding, Ph.D. thesis, Faculty of Rensselaer Polytechnic Institute, Troy, New York, May 2007. [9] S. T. Hsiang and J. W. Woods, “Embedded video coding using invertible motion compensated 3-D subband/wavelet filter bank,” Signal Processing: Image Communication, vol. 16, pp. 705–724, 2001. [10] B. Pesquet-Popescu and V. Bottreau, “Threedimensional lifting schemes for motion compensated video compression,” in IEEE International Conference on Acoustic, Speech, and Signal Processing, 2001, p. 17931796. [11] P. Schelkens, X. Giro, J. Barbarien, and J. Cornelis, “3D Compression of Medical Data Based on Cube-Splitting and Embedded Block Coding,” in Proc. ProRISC/IEEE Workshop, 2000, pp. 495–506. [12] A. Munteanu, J. Cornelis, G. Van Der Auwera, and P. Cristea, “Wavelet image compression - the quadtree coding approach,” IEEE Trans. : Information Technology in Biomedicine, vol. 3, (3), pp. 176–185, 1999. [13] J. Reichel, H. Sclwarz, and M. Wien, “Joint scalable video model jsvm-12 text,” Doc. JVT-Y202, Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, Shenzhen, China, 2007.