Four-step Algorithm for Early Termination in HEVC Inter-frame Prediction based on Decision Trees Guilherme Correa1, Pedro Assuncao2, Luciano Agostini3, Luis A. da Silva Cruz1 1
Instituto de Telecomunicações – University of Coimbra – Coimbra, Portugal
[email protected],
[email protected]
2
Instituto de Telecomunicações – Polytechnic Institute of Leiria – Leiria, Portugal
[email protected] 3
GACI – CDTEC – Federal University of Pelotas – Pelotas, Brazil
[email protected]
Abstract—The flexible encoding structures of High Efficiency Video Coding (HEVC) are the main responsible for the improvements of the standard in terms of compression efficiency in comparison to its predecessors. However, the flexibility provided by these structures is accompanied by high levels of computational complexity, since more options are considered in a Rate-Distortion (R-D) optimization scheme. In this paper, we propose a four-step early-termination method, which decides whether the inter mode decision should be halted without testing all possibilities. The method employs a set of decision trees, which are trained offline once, using information from unconstrained HEVC encoding runs. The resulting trees present a mode decision accuracy ranging from 97.6% to 99.4% with a negligible computational overhead. The method is capable of achieving an average computational complexity decrease of 49% at the cost of a very small Bjontegaard Delta (BD)-rate increase (0.58%).1 Index Terms— fast mode decision; inter prediction; block partitioning; decision trees; HEVC.
I. INTRODUCTION The High Efficiency Video Coding (HEVC) standard provides increased compression efficiency in comparison to its predecessors by employing new tools and more flexible encoding structures, such as the tree-based Coding Blocks (CB), Prediction Blocks (PB) and Transform Blocks (TB). To achieve the best Rate-Distortion (R-D) performance, the HEVC reference software (HM) [1] employs an exhaustive process, which tests every possible combination of encoding structures and chooses the one with lowest R-D cost. Even though it yields optimal encoding efficiency, this process is the main responsible for the large computational complexity increase of the HEVC encoder, which ranges from 9% to 502% in comparison to H.264/AVC, depending on which configuration is used [2]. In inter-frame prediction, the encoder initially tests the Merge/SKIP Mode (MSM) for the CB, which is conceptually similar to the SKIP mode of H.264/AVC. MSM allows deriving motion information from spatially or temporally neighboring PBs, forming a merged region sharing the same motion information. The SKIP mode is treated as a special case of MSM. After testing MSM, the remaining partitioning modes are tested in the order presented in Fig. 1, except in 8x8 CBs, which allow only the first four partitioning modes (MSM, 2N×2N, 2N×N and N×2N). This work was supported by IT-Portugal, CNPq-Brazil, and the project FCTCAPES (4.4.1.00 CAPES) (FCT/1909/27/2/2014/S) by FCT-Portugal and CAPES-Brazil.
978-1-4799-6139-9/14/$31.00 ©2014 IEEE
Fig. 1. Inter prediction modes available in HEVC.
For each mode following MSM, the encoder tests all candidate motion vectors and chooses the best option in terms of R-D cost for each PB in a CB. However, as only the best motion vectors and the best partitioning modes are chosen, most of the computation performed in this Rate-Distortion Optimization (RDO) process is discarded, summing up to large amounts of wasted encoding time. Consequently, fast mode decision (FMD) methods are extremely necessary to allow efficient implementations of HEVC aiming at low-complexity video coding. Several works have been already published in the literature for reducing or dynamically controlling the computational complexity of HEVC, especially focusing on low-complexity encoding structure definition and FMD approaches. In [3], a low-complexity R-D cost calculation decides whether or not the CU splitting process must be performed. However, the method is only applicable to intra-predicted CUs, which are a minority in the cases of encoding using configurations that use inter-frame prediction. The methods proposed in [4, 5] use information obtained from intermediate encoding steps, such as R-D costs and residue variance, to determine if a CU is split into smaller CUs. Other approaches seek to reduce the computational complexity of choosing the best PU mode for a CU [6, 7]. In [6], a complexity control scheme allows adjusting the number of evaluated PU modes for inter-predicted CUs according to a target complexity, while in [7] small PUs are combined into larger PUs depending on the image characteristics, skipping the evaluation of certain modes. Even though all these works are able to reduce computational complexity to a certain extent, they come with associated losses in terms of R-D efficiency, which are mostly non-negligible. In order to reduce R-D efficiency losses, intelligent approaches which apply machine learning techniques to benefit from intermediate encoding results and image characteristics have been proposed by some authors, especially for transcoding [8] and encoding efficiency optimization [9]. However, no work has been proposed yet which makes use of this type of techniques in the HEVC encoding process for computational complexity reduction. In this paper, we present a novel approach which uses data mining as a tool to build a set of decision trees that allow earlyterminating the inter prediction process in order to avoid testing all modes available in HEVC. The rest of this paper is organized as follows. Motivations and statistical analysis are presented in section II. The approach proposed in this work is detailed in
section III. Experimental results are presented and commented in section IV. Finally, conclusions are drawn in section V.
II. MOTIVATION AND STATISTICAL ANALYSIS Even though the HM encoder tests every encoding structure possibility in the RDO process, the occurrence of each inter prediction mode is not equally likely and some of them occur sporadically. Fig. 2 presents the average occurrence probability of each inter mode in the HEVC encoder for 10 video sequences of different temporal and spatial resolution, motion activity and texture (BlowingBubbles, RaceHorses, PartyScene, BQMall, SlideShow, vidyo1, ParkScene, BasketballDrive, NebutaFestival, Traffic). All sequences were encoded with Quantization Parameter (QP) 22, 27, 32, and 37 and using the Random Access (RA) encoder configuration, following the common test conditions of JCT-VC [10]. In Fig. 2, CBs encoded as Merge and SKIP are all grouped under the MSM label. Clearly, most CBs are encoded using such modes, especially in small CB sizes. For example, almost 95% and 82% of 8×8 and 16×16 CBs are encoded as MSM, respectively. However, even though the remaining modes are rarely used, they are still tested for every CB, which is not ideal when trading off encoding efficiency and computational complexity. Simply removing indiscriminately the possibility of using modes different from MSM for all CBs is not a viable solution, since it would cause large drops in the encoder R-D efficiency. If there was a way of predicting whether a CB should really be encoded in a mode different from MSM, the cost of testing the remaining seven modes could be avoided in 90% of the CBs, saving a great deal of computational complexity. Similarly, predicting whether a CB should really be encoded using PBs smaller than 2N×2N could decrease the cost of testing the remaining six modes. In order to find features which could lead to good predictions for a low-complexity inter mode decision, we have collected a large amount of data from both the original video sequences and internal encoding variables. More than 50 attributes have been collected per CB. Our analysis showed that the MSM R-D cost, the 2N×2N R-D cost, the ratio between MSM and 2N×2N R-D costs, the ratio between N×2N and 2N×N R-D costs and the splitting decision in the upper CB depth presented the highest correlation with the correct mode decision. Fig. 3 presents statistical results for such attributes in the case of inter coded 16×16 CBs in the BasketballDrive sequence coded with QP 32. Even though Fig. 3 refers to one specific case, results for all video sequences and the remaining CB sizes (8×8, 32×32 and 64×64) presented similar characteristics. Fig. 3(a) shows that the MSM R-D cost may be a good indicator of the necessity of testing the remaining seven modes for a CB. In the given example, CBs are rarely encoded with another mode if the MSM R-D cost is below 2,000. Similarly, CBs are rarely encoded with MSM when the MSM R-D cost is beyond 20,000. Fig. 3(b) shows that the ratio between the MSM and the 2N×2N mode R-D costs is also a great indicator of the necessity of testing the remaining six modes. It is possible to see that almost all CBs with a ratio under 0.6 were encoded as MSM in the given example, while CBs with a ratio beyond 1.0 are rarely encoded with MSM. In Fig. 3(c) and Fig. 3(d), it is possible to notice that the R-D costs of 2N×2N and MSM modes are generally lower in those CBs that do not split into smaller PBs than in CBs which split. Fig. 3(e) shows that in 76.72% of the cases when a 16x16 CB was split into smaller PBs, its upper CB was also split. Analogously, in 86.97% of the cases when a 16x16 CB was not split into smaller PBs, its corresponding upper depth CB was also not split. Finally, Fig. 3(f) shows that the ratio between the N×2N and 2N×N modes R-D costs can also give some indication of which asymmetric partitions should be tested (i.e., whether vertical or horizontal).
Fig. 2. Frequency of occurrence of each inter prediction mode for different CB sizes (RA configuration).
Fig. 3. Occurrence of CBs encoded using MSM according to (a) MSM R-D cost and (b) ratio between MSM and 2N×2N mode R-D costs; occurrence of CBs split into smaller PBs according to (c) 2N×2N mode R-D cost, (d) MSM R-D cost, and (e) splitting decision in the upper depth; (f) occurrence of vertical and horizontal asymmetric PBs according to the ratio between N×2N and 2N×N R-D costs.
The statistical analysis provided insightful suggestions for the mode decision approach proposed in this paper. It is clear that intermediate R-D costs from MSM, 2N×2N, 2N×N and N×2N modes and the partitioning information from the upper depth CB can be used in the development of early-termination schemes for a lowcomplexity inter mode decision, as shown in the next section.
III. FOUR-STEP EARLY-TERMINATION SCHEME Predictive data mining aims at determining the value of a dependent variable by looking at the value of some attributes. There are several methods of predictive data mining available currently, which vary broadly from one another in terms of efficiency, complexity and applicability. Decision trees [11] are a commonly used method due to its low-complexity implementation, in which a dependent variable can assume one among a finite number of values.
We have used the Waikato Environment for Knowledge Analysis (WEKA), version 3.6 [12], for data mining the attributes and to assist the development of the early-termination conditions for the inter mode decision. The HM software, version 12.0 (HM12) [1], was set to encode the 10 video sequences mentioned in section 2 with the same test conditions previously described. WEKA uses ARFF (Attribute-Relation File Format) files as input, which are text files describing a list of instances sharing a set of attributes. For each early-termination step proposed, four decision trees (each one for a different CB size) were grown from ARFF files containing all the attributes selected for the training phase, except in the case of 64×64 CBs, which does not include the splitting decision at the upper CB depth. The J48 method, which is an implementation of the C4.5 algorithm [11], was used to create the decision trees. Fig. 4 shows the RDO-based inter mode decision algorithm with the proposed four-step early-termination scheme implemented. White blocks in the diagram correspond to steps present in the original RDO-based decision, while grey blocks are those steps added for the early-termination. The four tests in Fig. 4 correspond to a different set of decision trees, which are presented in Fig. 5(a)-(d). Due to lack of space, only one tree per CB size is shown Fig. 5(a)-(d), but the remaining decision trees present very similar structures. Initially, the MSM mode is always tested for each CB. After that, the first set of decision trees detect whether or not the remaining modes should be tested (Test 1, in Fig. 4). The dependent variable for designing these decision trees is thus the information of whether or not MSM is used for a determined CB. As only the MSM R-D cost is available at this point, the decision tree is composed of only one simple node and two leaves, as shown in Fig. 5(a) for the case of 8x8 CBs. In Fig. 5(a), RD_MSM stands for the MSM R-D cost, T represents the decision of choosing MSM and early-terminating the RDO process and C is the decision of continuing the tests. After testing the 2N×2N mode, another set of trees decide if an early termination should occur (Test 2, in Fig. 4). In this case, the dependent variable is the information of whether or not a CB should be split into smaller inter PBs. Fig. 5(b), shows the 2N×2N earlytermination decision tree for 16x16 CBs. RD_2N×2N corresponds to the 2N×2N mode R-D cost and Usplit corresponds to the splitting decision performed at the upper CB depth. The ratio attribute corresponds in this case to the ratio between the MSM and 2N×2N R-D costs. The N function is a normalization computed by dividing the attribute value by its average among all CBs in the previous frame. If the process is not terminated in the previous steps, the 2N×N mode is tested and another early-termination step takes place (Test 3, in Fig. 4). If the encoder decides for early-termination in this step, it tests the horizontal asymmetric partitions (2N×nU and 2N×nD) and only then terminates the process. Fig. 5(c) shows the 2N×N earlytermination for 32x32 CBs, where ratio is the value obtained when dividing the 2N×N R-D cost by the lowest R-D cost among the previously tested modes. Finally, the N×2N mode is tested and another set of trees decide if the encoder should test the horizontal, the vertical (nL×2N and nR×2N) or all the asymmetric partitions (Test 4, in Fig. 4). Fig. 5(d) shows the N×2N early-termination for 64x64 CBs, where ratio is the value obtained dividing the 2N×N R-D cost by the N×2N R-D cost, T1 is the decision of testing only vertical partitions, T2 is the decision of testing only horizontal partitions and T3 is the decision of testing all partitions.
IV. EXPERIMENTAL RESULTS In order to evaluate the performance of the proposed scheme, three encoder versions were used: the original HM12, the modified version of HM12 in which only the MSM and the 2N×2N inter mode
Fig. 4. Four-step early-termination algorithm.
Fig. 5. (a) Test 1: MSM early-termination for 8x8 CBs; (b) Test 2: 2N×2N early-termination for 16x16 CBs; (c) Test 3: 2N×N early-termination for 32x32 CBs; (d) Test 4: N×2N early-termination for 64x64 CBs.
were enabled, and the low-complexity version of HM12 proposed in this work, with the implemented decision trees. These versions are from now on referred as original, simple and proposed, respectively. Notice that the HM encoder already includes some complexity reduction techniques, all of which were enabled in the three versions compared, so that the proposed method provides additional complexity reductions to the encoding process. Following the common test conditions of JCT-VC [10], all tests were performed using QPs 22, 27, 32, and 37, the RA configuration, and 10 video sequences (PeopleOnStreet, SteamLocomotive, Kimono1, Cactus, BQTerrace, BasketballDrill, BQSquare, BasketballPass, ChinaSpeed, SlideEditing), none of which was used when training the decision trees in order to guarantee the validity of the tests. Initially, performance was evaluated by counting the number of accurate decisions with reference to the original encoder. The decision accuracy was evaluated for each tree separately and the average results are presented in Table I. In 98.5% of the cases in which the original encoder chose MSM after testing every possible mode, the proposed method has also chosen MSM and then terminated the mode decision. Analogously, in 99% of the cases in which the original encoder chose either MSM or 2N×2N after testing all modes, the proposed method has also decided for one of the two modes and then halted due to the 2N×2N early-termination. The accuracy of the 2N×N and the N×2N early-termination steps were 97.6% and 99.4%, respectively. TABLE I AVERAGE ACCURACY OF EACH EARLY-TERMINATION STEP Early-termination Correct MSM 98.5% 2N×2N 99% 2N×N 97.6% N×2N 99.4%
The R-D efficiency of the proposed method was evaluated by comparing the bit rate and PSNR differences between the original encoder and both the simple and proposed versions. R-D efficiency results are presented in Fig. 6 and in Table II. Fig. 6 presents results for two video sequences: SteamLocomotiveTrain and BQTerrace, which presented the worst and best R-D results, respectively. It is possible to perceive that the R-D efficiency achieved with the proposed method is much closer to that of the original encoder than that obtained by the simple encoder. A closer detailed look at a portion of the curves (400% zoom boxes) shows that they almost overlap even in the worst case video sequence. Table II presents results for the 10 sequences encoded with the simple and proposed encoder versions. ΔT indicates the percentage computational complexity savings and BD-BR indicates the Bjontegaard-Delta (BD)-rate percentage increase when each encoder is used in comparison to the original one. An average BD-rate increase of 0.58% and an encoding time reduction of 49% are observed when the proposed encoder is used, whereas the simple encoder yields a 2.94% BD-rate increase and a ΔT of 60%. The BD-BR/ΔT ratio was calculated to compare the two encoders in terms of BD-rate increase per unit of computational complexity savings. The average BD-BR/ΔT ratio of the proposed encoder is 4.15 times smaller than the average BD-BR/ΔT ratio of the simple encoder, which means that the proposed method performs a much more efficient complexity reduction. We have compared our results to previous works on complexity reduction of HEVC encoding that presented BD results using the original HM encoder as reference [3-7]. Although the test conditions among all related works is not always the same, it is possible to notice in Table III that our method achieves similar or larger time savings when compared to them. Besides, it excels in terms of R-D efficiency in all cases, always incurring in a smaller BD-rate increase. Considering the BD-BR/ΔT ratio, the only work that achieves a value close to our method is [3]. However, as previously mentioned, [3] is only applicable to intra-predicted CUs, which are a minority in inter predicted images.
V. CONCLUSIONS A four-step early-termination scheme for the HEVC inter mode decision is presented in this paper making use of data mining tools for the construction of decision trees. A statistical analysis was performed on a set of attributes to determine their usefulness in the training of four sets of decision trees to be used for deciding whether the inter mode decision process could be early-terminated or not. The four steps allow early-terminating the mode decision process after testing Merge/SKIP, 2N×2N, 2N×N and N×2N modes, and achieved an average accuracy of 98.5%, 99%, 97.6% and 99.4%, respectively. The use of such trees in the HM encoder allowed a computational complexity reduction ranging from 37% to 66% at the cost of small losses in R-D efficiency (an average increase of 0.58% in terms of BD-rate. It is worthwhile to mention that the obtained decision trees
Fig. 6. Rate-Distortion efficiency for the (a) SteamLocomotiveTrain and (b) BQTerrace video sequences.
TABLE II COMPLEXITY REDUCTION AND R-D EFFICIENCY Simple Proposed Video* BQSquare BQTerrace BasketballDrill BasketballPass Cactus ChinaSpeed Kimono1 PeopleOnStreet SlideEditing SteamLocomotiveTrain Average
ΔT (%)
BDBR (%)
-60 +2.7 -61 +1.0 -60 +2.4 -60 +3.9 -60 +2.4 -59 +5.5 -61 +1.9 -61 +3.6 -61 +2.0 -61 +4.0 -60 +2.94
BDΔT BR/ΔT (%)
4.55 1.67 4.05 6.43 4.01 9.32 3.06 5.93 3.34 6.65 4.90
BDBR (%)
-46 +0.3 -53 +0.0 -44 +0.5 -42 +0.6 -48 +0.4 -46 +0.9 -50 +0.7 -37 +1.0 -66 +0.2 -56 +1.2 -49 +0.58
BDBR/ΔT
0.60 0.01 1.09 1.36 0.86 2.07 1.39 2.76 0.37 2.20 1.18
* Sequences not used in the training of the decision trees.
TABLE III COMPARISON WITH RELATED WORKS ΔT (%) BD-BR/ΔT Work BD-BR (%) Seunghyun [3] -50 +0.6 1.20 Jong-Hyeok [4] -48 +1.2 2.50 Goswami [5] -38 +1.7 4.42 Zhao [6] -50 +5.9 11.8 Khan [7] -44 +1.3 2.88 Proposed -49 +0.58 1.18
are very simple and can be easily implemented in a hardware design for the HEVC encoder.
REFERENCES [1]
K. McCann, B. Bross, W.-J. Han, I.-K. Kim, K. Sugimoto, and G. J. Sullivan, "High Efficiency Video Coding (HEVC) Test Model 12 (HM 12) Encoder Description," Document JCTVC-N1002, JCT-VC Metting: ed. Vienna, Austria, 2013. [2] G. Correa, P. Assuncao, L. Agostini, and L. A. da Silva Cruz, "Performance and Computational Complexity Assessment of High Efficiency Video Encoders," IEEE Trans. Circuits Syst. Video Technol., vol. 22, pp. 1899-1909, 2012. [3] C. Seunghyun and K. Munchurl, "Fast CU Splitting and Pruning for Suboptimal CU Partitioning in HEVC Intra Coding," IEEE Trans. Circuits Syst. Video Technol., vol. 23, no. 9, pp. 1555-1564, 2013. [4] L. Jong-Hyeok, P. Chan-Seob, and K. Byung-Gyu, "Fast coding algorithm based on adaptive coding depth range selection for HEVC," in IEEE Int. Conf. Cons. Electronics - Berlin, 2012, pp. 31-33. [5] K. Goswami et al., "Early Coding Unit (CU) Splitting Termination Algorithm for High Efficiency Video Coding (HEVC)," Electronics and Telecommunications Research Institute Journal, to be published. [6] T. Zhao, Z. Wang, and S. Kwong, "Flexible Mode Selection and Complexity Allocation in High Efficiency Video Coding," IEEE J. of Sel. Topics Signal Process., vol. 7, no. 6, pp. 1135-1144, 2013. [7] M. U. K. Khan, M. Shafique, and J. Henkel, "An Adaptive Complexity Reduction Scheme with Fast Prediction Unit Decision for HEVC Intra Encoding," in IEEE International Conference on Image Processing (ICIP), 2013, pp. 1578-1582. [8] G. Fernandez-Escribano, et al., "Low-Complexity Heterogeneous Video Transcoding Using Data Mining," IEEE Transactions on Multimedia, vol. 10, pp. 286-299, 2008. [9] Ray Garcia, Damian Ruiz Coll, Hari Kalva, and G. FernandezEscribano, "HEVC Decision Optimization for Low Bandwidth in Video Conferencing Applications in Mobile Environments," in IEEE International Conference on Multimedia and Expo (ICME 2013), San Jose, USA, 2013. [10] Common test conditions and software reference configurations, JCTVC-J1100, ISO/IEC-JCT1/SC29/WG11, Stockholm, Sweden, 2012. [11] J. R. Quinlan, C4.5: Programs for Machine Learning: Morgan Kaufmann Publishers, 1993. [12] M. Hall, et al., "The WEKA data mining software: an update," ACM SIGKDD Explorations Newsletter, vol. 11, pp. 10-18, 2009.