Transform Coding in the HEVC Test Model

2 downloads 26 Views 251KB Size Report
transform coding techniques in the HEVC Test Model are de- scribed, including the ... a different approach to the problem of adaptive transform coding using variable ..... Acoustics, Speech, and Signal Processing (ICASSP), May. 1989, pp.
2011 18th IEEE International Conference on Image Processing

TRANSFORM CODING IN THE HEVC TEST MODEL Martin Winken1, Philipp Helle1, Detlev Marpe1, Heiko Schwarz1, and Thomas Wiegand1,2 [ martin.winken | philipp.helle | detlev.marpe | heiko.schwarz | thomas.wiegand ]@hhi.fraunhofer.de 1

2

Image Processing Department Fraunhofer HHI Einsteinufer 37, 10587 Berlin, Germany

Image Communication Chair Technical University of Berlin Einsteinufer 17, 10587 Berlin, Germany

ABSTRACT

literature [4], but so far it has been applied only to a limited extent in video coding standards (e.g., in H.264/AVC HP, where the adaptive choice between 4 × 4 and 8 × 8 transform block sizes is supported). Note that the RQT design of the current HM has been originally proposed in [5], while in [6] a different approach to the problem of adaptive transform coding using variable sized blocks has been proposed, which can be viewed as a subset of our RQT approach. The paper is organized as follows. Section 2 gives a short overview of the HM design, section 3 describes the residual quadtree (RQT) based transform coding approach, section 4 discusses signaling of the RQT parameters, section 5 shows experimental results, and finally section 6 concludes.

Recently, ITU-T VCEG and ISO/IEC MPEG have started a new joint standardization activity on video coding, called High Efficiency Video Coding (HEVC). In this paper, the new transform coding techniques in the HEVC Test Model are described, including the residual quadtree (RQT) approach and coded block pattern signaling. Experimental results showing the advantage of using larger block size transforms, especially for high resolution video material, are presented. Furthermore, the trade-off between coding gain and encoder complexity in terms of the maximum RQT depth is discussed. Index Terms— transform coding, HEVC, video coding 1. INTRODUCTION

2. OVERVIEW OF THE HEVC TEST MODEL In recent years, digital video has become the dominant form of media content in many consumer applications. Video coding is one of the enabling technologies for the delivery of digital video and in particular, the current state-of-the-art video coding standard H.264/AVC has become the primary choice in many video applications. However, the core design of H.264/AVC has been finished with the 2004 development of the High Profile (HP). Since then, exploratory work within both standardization organizations in the field of video coding, the ISO/IEC Moving Picture Experts Group (MPEG) and ITU-T Video Coding Experts Group (VCEG), has shown potential for improving coding efficiency relative to H.264/AVC HP, in particular for high resolution video. Motivated by these investigations, MPEG and VCEG agreed to form a Joint Collaborative Team on Video Coding (JCT-VC) and issued a Joint Call for Proposals (CfP) [1] for video compression technology. The new standardization project is also referred to as High Efficiency Video Coding (HEVC). At the first JCT-VC meeting, an initial Test Model under Consideration (TMuC) [2] has been drafted, based on technology of the best-performing responses to the CfP. In this paper, the transform coding in the HEVC Test Model (HM Version 2.0) [3] is described with focus on the residual quadtree (RQT) approach. Using quadtree structures for variable block-size transform coding is well known in

978-1-4577-1302-6/11/$26.00 ©2011 IEEE

The architecture of the HM is based on the conventional hybrid video coding approach, using spatial (intra) or temporal (inter) prediction, followed by transform coding of the prediction residual and entropy coding of transform coefficients and prediction parameters. Each input picture is divided into a number of square blocks of equal size, called treeblocks. Each treeblock can be further divided into so-called coding units. The division of the treeblocks into coding units is specified by a quadtree structure called the coding tree. For the coding units, block sizes from 8 × 8 to 64 × 64 are supported, where the edge length of each admissible block is given as a power of two. For each coding unit (CU), it is transmitted whether intra or inter prediction is used. A CU can be further divided into 2 or 4 prediction units (PU), for each of which individual prediction parameters (i.e., motion vectors or intra prediction modes) are specified. Furthermore, so-called merging of neighboring inter coded PUs, possibly from different CUs, into one contiguous region is possible, such that motion parameters have to be transmitted only once for the whole region. For transform coding of the prediction residual, a CU can be split into smaller transform units. The splitting is signaled using a second quadtree, the residual quadtree (RQT). It should be noted that only square block transform units

3754

2011 18th IEEE International Conference on Image Processing

a

b

c

d f

e

g

h

i

j

e

a

b

c

d

f

g

h

i

j

Fig. 1. Example of a RQT structure (right) for partitioning a given coding unit (left) into transform units of variable size. are supported. Each transform unit (TU) corresponds to one transform block, i.e., for a TU of size N × N , a N × N block transform is used, where N is given as a power of two. The current HM supports four different transform block sizes in the range of 4 × 4 to 32 × 32. Note that in HM 2.0, the core transforms used for 4 × 4 and 8 × 8 blocks are the same as in H.264/AVC, while for the two larger block sizes, integer transforms based on Chen’s fast algorithm for the Discrete Cosine Transform (DCT) [7] are used. 3. RESIDUAL QUADTREE BASED TRANSFORM CODING The partitioning of a given CU into TUs is done based on a quadtree approach. The corresponding structure is called the residual quadtree (RQT). Fig. 1 shows an example, where a CU is partitioned into 10 TUs, labelled with the letters a to j. The individual blocks are processed in alphabetical order, corresponding to a depth-first tree traversal. The quadtree approach enables to adapt the transform to the varying spacefrequency characteristics of the residual signal. Larger transform block sizes, having larger spatial support, provide a better frequency resolution. Smaller transform block sizes, having smaller spatial support, on the other hand, provide a better spatial resolution. The trade-off between the two, spatial and frequency resolution, is chosen by the encoder control, for example, based on Lagrangian optimization techniques. The RQT based transform coding operates differently on intra and inter coded CUs. For inter coded CUs, the partitioning of the CU into TUs can be selected independently from the chosen PU partitioning. This means, the transform block may cover several different prediction blocks. Experimental results in [8] show an averaged coding performance loss of 0.4–0.7 % BD rate [9] if this feature is turned off. For example, if the CU is divided into two PUs, the RQT can still specify that the CU is not divided for the purpose of transform coding. In this case, one single transform block would cover the residual signal of two prediction blocks. For intra coded CUs, however, the causal neighboring blocks (i.e., those blocks which are encoded and transmitted before the current block in processing order), have to be fully reconstructed in order to be able to generate the prediction signal

for the current block. So, for the intra case, there is a forced subdivision of the CU into subblocks, such that none of these subblocks covers more than one prediction block. Each of these subblocks then may be further divided into TUs. The maximum allowed depth of the RQT, which restricts the minimum allowed transform block size, is transmitted in the bit-stream. For intra and inter coded CUs, different values of this parameter can be specified. In the current design of the HM, the maximum depth is restricted to be not greater than 2. For the coding of YUV video material with 4:2:0 chroma subsampling, the subdivision of the CU into transform blocks is the same for the luma and the chroma components, except the case that the luma transform block is of size 4 × 4. Since, for both luma and chroma, the smallest possible transform size is 4 × 4, in this case the 8 × 8 block which is constituted by the four luma 4 × 4 blocks is not further subdivided for chroma coding, leading to one single 4 × 4 chroma transform block. The transform coefficient coding using context-based adaptive binary arithmetic coding (CABAC) in the current HM is done using the same basic concepts as in H.264/AVC, but the context model selection has been adapted to larger block sizes. For details, the reader is referred to [10]. 4. SIGNALING OF THE RESIDUAL QUADTREE PARAMETERS The subdivision of the CU into TUs, corresponding to the RQT structure, is signaled using a recursive depth-first approach. Each node of the RQT corresponds to a certain block of image samples within the current CU. The root node of the RQT corresponds to the whole CU, while the leaf nodes correspond to the individual TUs. For signaling of the RQT structure, all the nodes of the RQT are traversed in depth-first order, and a split flag is transmitted for each node indicating whether it is a leaf node of the RQT or if it is an internal node. Note that there are certain cases, where a signaling of this flag would be redundant. One example would be a 64 × 64 CU, where the maximum allowed TU size is restricted to 32 × 32 (as specified in the current HM). In this case, the CU has to be subdivided at least once, since otherwise it would lead to a 64 × 64 TU, which is not allowed. A different example is the

3755

2011 18th IEEE International Conference on Image Processing

case, where the current node of the RQT already corresponds to a block of 4 × 4 luma samples. In this case, no further subdivision is allowed, since the minimum allowed TU size is 4 × 4. In all cases, where there is only one possible value of the split flag, this flag is not explicitly signaled, but instead it is implicitly inferred by the decoder. Besides the actual structure of the RQT, it is also signaled whether there are non-zero transform coefficients present in a particular TU or set of TUs. This corresponds to the coded block pattern (CBP) signaling in H.264/AVC. For inter coded CUs, first, even before the above described transmission of the RQT structure, it is signaled by one flag whether all transform coefficients in the whole CU are equal to zero. Since in intra coding, the case of transform blocks without non-zero transform coefficients is very rare in typical video sequences, this flag is not signaled for intra, but instead it is inferred to be zero (corresponding to false, i.e. there are non-zero transform coefficients). In case this flag is equal to one, no further information is transmitted for the current CU. Since in this case all the transform coefficients are equal to zero, it makes no difference how the CU is subdivided into TUs, and therefore the RQT structure is not signaled. This flag is especially useful for coding of video sequences at low bit-rates, because very efficiently with one single flag a potentially large number of transform coefficients can be set equal to zero. In case this flag is equal to zero, the RQT structure is signaled as described above. Furthermore, so called coded block flags are transmitted for luma and the two chroma components in order to indicate for each TU, whether there are non-zero transform coefficients. In inter coding, the coded block flags for chroma are coded interleaved with the split transform flags of the RQT signaling. This allows for efficient signaling of zero valued chroma coefficients in regions with non-zero luma coefficients. For luma as well as in intra coding, the coded block flags are signaled for each transform block after transmission of the RQT structure. This signaling scheme has been chosen because transform blocks which have non-zero coefficients only for chroma, but not for luma are typically very rare when using inter coding, whereas the opposite case (i.e. only luma non-zero coefficients) is not so rare. Thus, having the possibility to set a large area of chroma coefficients equal to zero at an intermediate RQT node can be advantageous. 5. EXPERIMENTAL RESULTS The effectiveness of the transform coding is demonstrated for 9 sequences currently used in the development of the HEVC Test Model. The sequences BQM ALL, BASKETBALL D RILL, PARTY S CENE and R ACE H ORSES have spatial resolution of 832 × 480. BQT ERRACE, BASKETBALL D RIVE, C ACTUS, PARK S CENE and K IMONO have a resolution of 1920 × 1080. The version of the Test Model Software used for our simulations is TMuC 0.9. The encoder configurations are based on the random access, high efficiency configuration specified

Sequence BQM ALL BASKETBALL D RILL PARTY S CENE R ACE H ORSES Average BQT ERRACE BASKETBALL D RIVE C ACTUS PARK S CENE K IMONO Average

8×8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Max transform size 16 × 16 32 × 32 -2.48 -3.19 -5.29 -7.91 -1.77 -2.10 -1.42 -1.67 -2.74 -3.72 -2.62 -3.79 -7.68 -9.94 -6.14 -8.42 -1.69 -2.18 -9.29 -12.13 -5.48 -7.29

64 × 64 -3.28 -8.64 -2.11 -1.69 -3.93 -4.15 -10.49 -8.94 -2.18 -12.58 -7.67

Table 1. BD-rate differences using transforms larger than 8 × 8 for the random access, high efficiency configuration. The maximum RQT depth is always set to zero.

in [11]. In this configuration, inter-prediction using at most 2 hypothesis is employed whithin groups of eight successive pictures. All results regarding coding-efficiency are presented in terms of Bjøntegaard Delta (BD) rate differences [9]. First, the effect of large transform sizes is shown. Table 1 shows results for different maximum transform sizes. The smallest transform size is always 4 × 4, while the maximum size is varied from 16 × 16 to 64 × 64. Note that in later HM versions, the 64 × 64 transform was removed as a result of a complexity vs rate and distortion trade-off. The values represent gains in comparison to the configuration using only 4 × 4 and 8 × 8 transforms as H.264/AVC HP. In this experiment, the RQT is constrained to a maximum tree depth of zero, so for a given CU, the largest applicable transform size is always used. In any case, the transform block size is completely inferred, effectively disabling the use of RQT. No extra bits are signaled for the RQT. The effect of choosing the transform size more flexibly by transmitting extra side information for the RQT is examined in a second experiment. The highest gains with more than 10% BD-rate savings are observed by using the sequences BASKETBALL D RIVE and K IMONO. These sequences contain fast camera and scene movement. Presumably, this provokes higher residual energy such that the effect of larger transforms becomes more evident. In the following, the effects of different maximum RQT depths are shown. In the experiment, the same values are used for the maximum depths in inter and intra coding. All other encoder parameters were set as in the random access, high efficiency configuration which allows transform sizes from 4×4 to 32 × 32. With a maximum RQT depth of zero, no side information is used to signal the TU size. This zero depth configuration serves as the reference for the BD-rate differences shown in Table 2. With greater maximum depths and thus more flexibility of choosing the transform size, overall averaged BD-rate savings up to 2 % are observed. In the rate-distortion (RD) sense, the extra rate spent for the RQT is compensated for by an increased reconstruction fidelity.

3756

2011 18th IEEE International Conference on Image Processing

Sequence BQM ALL BASKETBALL D RILL PARTY S CENE R ACE H ORSES Average BQT ERRACE BASKETBALL D RIVE C ACTUS PARK S CENE K IMONO Average

0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Max RQT depth 1 2 -1.82 -2.22 -1.40 -1.71 -2.79 -2.99 -2.26 -2.68 -2.07 -2.40 -2.21 -2.92 -1.22 -1.63 -1.13 -1.46 -2.08 -2.51 -0.70 -1.23 -1.47 -1.95

6. CONCLUSION

3 -2.24 -1.61 -3.12 -2.78 -2.44 -3.26 -1.69 -1.45 -2.71 -1.14 -2.05

Table 2. BD-rate gains using different maximum RQT depths in comparison to an RQT depth of zero for the random access, high efficiency configuration. Maximum transform size is 32 × 32.

Sequence BQM ALL BASKETBALL D RILL PARTY S CENE R ACE H ORSES Average BQT ERRACE BASKETBALL D RIVE C ACTUS PARK S CENE K IMONO Average

0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Max RQT depth 1 2 1.15 1.32 1.15 1.31 1.21 1.42 1.16 1.32 1.17 1.34 1.19 1.38 1.13 1.29 1.15 1.32 1.17 1.34 1.13 1.28 1.15 1.32

Compared to previous video coding standards, the current HM uses larger transforms for residual coding with block sizes up to 32×32. The use of larger transforms in the current HM results in significant gains in coding efficiency compared to the use of 4 × 4 and 8 × 8 transforms only. A residual quadtree (RQT) is used in order to further exploit the flexibility given by this larger range of applicable transform sizes. It is shown that a significant gain is achieved by spending extra rate to signal the transform size by means of the RQT. However, considering the trade-off in complexity and BD-rate savings, it is reasonable to reduce the freedom in choosing among possible tree structures. The current HM accounts for this by restricting the tree depth of the RQT to a maximum of two levels. 7. REFERENCES

3 1.47 1.45 1.59 1.44 1.49 1.52 1.38 1.43 1.48 1.40 1.44

[1] ITU-T VCEG and ISO/IEC MPEG, “Joint Call for Proposals on Video Compression Technology,” VCEG-AM91 and MPEG N11113, Jan. 2010. [2] JCT-VC, “Test Model under Consideration,” JCTVC-A205, Apr. 2010. [3] JCT-VC, “WD2: Working Draft 2 of High-Efficiency Video Coding,” JCTVC-D503, Jan. 2011. [4] C.-T. Chen, “Adaptive transform coding via quadtree-based variable blocksize DCT,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 1989, pp. 1854 –1857 vol.3.

Table 3. Normalized encoding durations for different RQT depths.

But in addition to such an RD trade-off, computational complexity is also an important issue in designing video coding algorithms. With the employed RD-optimal tree decision algorithm, the number of encoder decisions grows exponentially with the allowed maximum tree depth. As a measure for complexity, Table 3 shows encoding durations for different maximum depths. Like in Table 2, the reference is using a maximum RQT depth of zero. As expected, encoding durations increase considerably with greater maximum tree depths. But as can be seen from Table 2, allowing tree depths greater than 2 yields very little further RD-performance gains, if any. Following from this increase in complexity and the diminishing returns in rate-distortion performance at greater depths, a reasonable complexity versus RD-performance trade-off should limit the maximum depth. In the current HM, a maximum RQT depth of 2 is used as a fair trade-off. Note that in our experiments, all possible transform block sizes are tested in the RD optimization process at the encoder and that therefore the encoding time increase for larger RQT depths could be reduced at the cost of a slightly deteriorated RD performance, if an appropriate early-termination strategy is employed (e.g., as proposed in [12]).

[5] D. Marpe et al., “Video compression using quadtrees, leaf merging and novel techniques for motion representation and entropy coding,” IEEE Trans. on Circ. and Sys. for Video Technology, vol. 20, no. 12, pp. 1676–1687, Dec. 2010. [6] K. McCann et al., “Video coding technology proposal by Samsung (and BBC),” JCTVC-A124, Apr. 2010. [7] W.-H. Chen, C. Smith, and S. Fralick, “A fast computational algorithm for the discrete cosine transform,” IEEE Trans. Commun., vol. 25, no. 9, pp. 1004 – 1009, Sept. 1977. [8] T. Lee, J. Chen, and W.-J. Han, “TE 12.1: Transform unit quadtree/2-level test,” JCTVC-C200, Oct. 2010. [9] G. Bjøntegaard, “Calculation of average PSNR differences between RD-curves,” VCEG-M33, Apr. 2001. [10] T. Nguyen, H. Schwarz, H. Kirchhoffer, D. Marpe, and T. Wiegand, “Improved Context Modeling for Coding Quantized Transform Coefficients in Video Compression,” in Proc. Picture Coding Symposium (PCS), Dec 2010. [11] JCT-VC, “Common test conditions and software reference configurations,” JCTVC-C500, Oct. 2010. [12] M. Siekmann, H. Schwarz, B. Bross, D. Marpe, and T. Wiegand, “Fast encoder control for RQT,” JCTVC-E425, Mar. 2011.

3757

Suggest Documents