400
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 3, MARCH 2008
Efficient Reference Frame Selector for H.264 Tien-Ying Kuo, Member, IEEE, and Hsin-Ju Lu Abstract—This paper proposes a simple yet effective mechanism to select proper reference frames for H.264 motion estimation. Unlike traditional video codecs, H.264 permits more than one reference frame for increased precision in motion estimation. However, motion estimation is complicated by variable block-size motion estimation, which requires significant encoding complexity to identify the best inter-coding. Our smart selection mechanism selects suitable reference frames by means of a simple test, and only the selected frames will be searched further in the variable block size motion estimation. One major advantage of our mechanism is that it enables working with any existing motion search algorithms developed for the traditional single reference frame. Experimental results demonstrate the effectiveness of our proposed algorithm. Index Terms—Frame selection, H264, motion estimation, multiple reference frames, variable block size.
I. INTRODUCTION HE H.264 standard is the latest video codec developed by the Joint Video Team (JVT) [1]. It introduces several new coding tools to improve upon the rate-distortion performance of past coding standards. For example, variable block size motion compensation, subpixel motion estimation, and multiple reference frame motion compensation are tools introduced to enhance inter-coding efficiency [2]. Multiple reference frame motion compensation allows the encoder to predict a better picture, using several pre-coded and stored pictures. Many conditions demonstrate that multiple reference frames generate better predictions than a system using just one, like repetitive motion, uncovered background, noninteger pixel displacement, and lighting change [3]. However, the computational complexity of motion estimation increases linearly compared with that of a single reference frame, and could be even worse as the situation is complicated by variable block size motion estimation. In the literature, several methods have been proposed to reduce the complexity of multiple reference frame motion estimations. One way is to reduce the search points by exploiting the temporal correlation between the reference frames. For example, Wiegand [4], [5] used triangular inequality to eliminate impossible search points, while other investigators [3], [6]–[8] have adopted the continuous tracking technique to guess at a good initial search point for quick convergence. However, all
T
Manuscript received April 13, 2006; revised February 13, 2007. This work was supported by the National Science Council of R.O.C. under Grant 95-2221-E-027-030 and Grant 95-2219-E-002-012. This paper was recommended by Associate Editor L. Chen. T.-Y. Kuo is with the Department of Electrical Engineering, National Taipei University of Technology, Taipei 106, Taiwan, R.O.C. (e-mail:
[email protected]. edu.tw) H.-J. Lu was with the Department of Electrical Engineering, National Taipei University of Technology, Taipei 106, Taiwan, R.O.C. He is now with Advanced Digital Broadcast Inc., Taipei 231, Taiwan, R.O.C. (e-mail:
[email protected]. edu.tw) Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2008.918111
the methods noted above still require searching all reference frames. Furthermore, continuous tracking can fail easily when occlusion happens, where multiple references are involved. Another general approach is to search the reference frames in an orderly sequence, from the most recent to the most distant frames, using early stop criteria. Huang [9] searched either the previous or every reference frame based upon the result of the motion estimation from the previous frame. Chang [10] terminated early and excluded certain reference frames by comparing their motion precision. Li [11] examined the variance of the obtained motion vectors for early stops. Zhang [12] exploited the neighboring block coding status as the stop criterion. Note that all the early-stop methods are based on the uni-modal error model and could easily fall into local minimums, and that some of the sequence-dependent thresholds could cause a design problem. Hsu [13] searched only the reference frames referred to by neighboring blocks. However, an incorrect relationship between the neighboring and target blocks may occur along the moving object boundary, and may confine the decision to a convergent set of reference frames. Ting [14] proposed a different concept, involving a 3-D cross-search pattern to search among the references frames. Note that most methods mentioned above are incompatible with each other, and that it would be difficult to reuse the existing motion search algorithms designed for a single reference frame. Thus, this paper proposes a simple, yet effective mechanism to select proper reference frames for an H.264 motion estimation that also has the capacity to work with any of the existing motion search algorithms. The proposed method would select suitable reference frames according to the initial search results of an 8 8 size block. Consequently, only the selected qualified frames should be further tested in motion estimation. The rest of this paper is organized as follows. In Section II, we review some characteristics of H.264. We provide a detailed description of the proposed method in Section III. Finally, experimental results and conclusions are given in Sections IV and V, respectively. II. H.264 INTER-CODING TOOLS In H.264, each macroblock in an inter-mode prediction can be divided into a block partition of sizes 16 16, 16 8, 8 16 or 8 8 pixels, called macroblock partition, and a 8 8 block can be partitioned further into 8 8, 8 4, 4 8 or 4 4 pixels, called submacroblock partition. Furthermore, in H.264 multiple reference frame motion estimation , each macroblock partition in the current frame can refer to different reference frames, and an overhead term, called the reference parameter, REF, signals to which frames referred. The reference parameter must be transmitted for each mode, including modes 16 16, 16 8 and 8 16. If a macroblock is coded in mode 8 8, the reference frame parameter is coded only once for each 8 8 subpartition [1]. This means that all subblocks smaller than 8 8 and within
1051-8215/$25.00 © 2008 IEEE Authorized licensed use limited to: National Tsing Hua University. Downloaded on November 4, 2008 at 02:05 from IEEE Xplore. Restrictions apply.
KUO AND LU: EFFICIENT REFERENCE FRAME SELECTOR FOR H.264
401
Fig. 2. Effect of TH on coding efficiency and the chance of early stop in the first stage for various QPs.
Fig. 1. Flowchart of the proposed method.
the same submacroblock partition must refer to the same frame. Considering that motion estimation must be performed for each reference frame and each mode, in the Joint Model (JM) [15], it is made by comparing the rate-distortion cost (R-D cost) of each possible partition, evaluated as (1)
(1) denotes the motion vector where in the reference frame considered; denotes the predicted motion vector from the neighbors; is the Lagrange multiplier; denotes the original video signal; and is represents the rate the coded video signal. The term function of motion vectors and is computed using a table. The function SA(T)D, stands for either SAD or SATD, is used as a measure of distortion. In JM, SAD is applied for integer pixel motion estimation, while SATD is for subpel motion estimation if UseHadamard coding option is enabled [16]. III. PROPOSED METHOD A. Multiple Reference Frame Selection Motivated by the previous discussion that all subblocks inside the same submacroblock partition must refer to the same reference frame, we designed an efficient reference frame selector by treating the 8 8 block as a minimal unit, and making the selection using a mode 8 8 motion search. The flowchart of the proposed method is illustrated in Fig. 1.
In the first stage, the target 16 16 macroblock was disjoined into four blocks of size 8 8 to perform a motion search on the ), thereby obtaining immediate previous frame (ref-frame and their the four motion vectors corresponding minimal R-D costs from (1), where denotes the block index of mode 8 8. Next, . If both the variances of and we checked the value of components of are not greater than a small threshold TH, this indicates that the macroblock has a greater probability of being static content. In this case, our encoder will terminate early by designating only the previous frame as valid, without turning to the remaining reference frames. In real applications, the design of TH depends upon the computational capacity of the encoder; the larger the TH value, the lower the complexity but the worse the coding efficiency. Fig. 2 analyzes this effect of TH for various quantization parameters (QPs) by averaging seven video sequences. In this work, we set TH to zero in all experiments to determine the upper bound of our coding efficiency and the worst case of the complexity requirement of our frame selector. Note that, even the value of TH is set to zero, our speed performance is still satisfied as discussed in Section IV. On the other hand, if the flow goes to the second stage, the motion search of mode 8 8 should be tested on all of the remaining reference frames to obtain their motion vectors and their corresponding R-D costs , where indicates the maximal number of reference frames. For a given block index , we let be set as (2) which means that block has the best motion vector with the rather than any other lowest cost, by referring to ref-frame frame. Hence, it would be reasonable for the frame selector to set as the valid, qualified reference frame. Once the ref-frame all four blocks, i.e., is 0 to 3, have completed the frame qualification test, only the frame referred to by one of the 8 8 blocks will be set as the valid, qualified frame. The unqualified frames will be dropped, and only the qualified reference frames will be examined in the variable block size motion estimation. Note that, is always set as valid. in the second stage, the ref-frame
Authorized licensed use limited to: National Tsing Hua University. Downloaded on November 4, 2008 at 02:05 from IEEE Xplore. Restrictions apply.
402
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 3, MARCH 2008
TABLE I HIT RATE OF THE PROPOSED MULTIPLE REFERENCE FRAME SELECTION STRATEGY TO THE FAST FULL SEARCH (UNIT:%)
TABLE II AVERAGE NUMBER OF REFERENCE FRAMES USED BY EACH MACROBLOCK Fig. 4. Analysis of hit rate and false alarm using different modes on various sequences. TABLE III ENCODER PARAMETERS USED IN EXPERIMENTS
Fig. 3. Reference frame usage of the proposed reference frame selector for various video sequences.
B. Analysis of Hit Rate and Frame Usage To verify the effectiveness of our frame selector, we measured the hit rate by comparing its retained frames with the actual frames used via the exhaustive search method, which searches all reference frames and modes. We tested the average hit rate in Table I, which indicates that the hit rate was as high as 88%–95% on average for all cases. Note that, even with the high hit rate, we would expect that our frame selector can drop as many frames as it can, to speed up the process under the condition that the right reference frames are kept. The false alarm ranges from 13%–32%. To give a clearer view on the speed impact caused by the false alarm, we present more informative analysis in Table II and Fig. 3. The values in Table II indicate the average number of qualified reference frames kept
per macroblock after selection. With the different motion content sequences, the encoder can prevent considerable compu) references down to 1.29 tational complexity from 5 ( (i.e., Container) and the worse case of 2.13 (i.e., Tempete) reference frames on average. Fig. 3 plots the average reference frame usage of our proposed frame selector in each frame of several video sequences. The figure shows that our proposed frame selector can adaptively choose a low number of reference frames, depending upon the characteristics of the video sequences and frames. To further support that the idea of using mode 8 8 but not other modes in frame selection is proper, we analyzed the hit rate and the false alarm with different modes in Fig. 4, as an average . Fig. 4 shows that smaller block size result of mode is with higher hit rate. It is expected since the number of blocks within a macroblock grows, more frames can then be selected. However, the hit rate saturates from mode 8 8, as resulted from the fact that a submacroblock partition must refer to the same reference frame in H.264. On the other hand, the false alarm increases with the smaller block mode, which is not preferred because it causes unnecessary complexity in the motion estimation on the extra selected frames. As for the motion estimation time spending on a macroblock in the first stage of Fig. 1, all modes are similar with factors (using mode 8 8 as the basis) of 0.98, 0.98, 0.99, 1, 1.03, 1.03, 1.08 from modes 16 16 to 4 4, respectively. Thus, by judging from the
Authorized licensed use limited to: National Tsing Hua University. Downloaded on November 4, 2008 at 02:05 from IEEE Xplore. Restrictions apply.
KUO AND LU: EFFICIENT REFERENCE FRAME SELECTOR FOR H.264
403
Fig. 5. Rate-distortion curve comparisons among FFS5, FME5, and two proposed methods FFS5+ERFS, FME5+ERFS. (a) Foreman. (b) Mobile. TABLE IV R-D PERFORMANCE OF FME5, LI, AND TWO PROPOSED METHODS FFS5+ERFS, FME5+ERFS (FFS5 IS THE BASIS OF COMPARISON)
hit rate, false alarm and complexity, it proves that the mode 8 is the best one to perform the frame selection task.
8
IV. EXPERIMENTAL RESULTS Our experimental environment is based upon the H.264 reference software of JM 9.2 [17]. As shown in Table III, we tested seven video sequences of various motion activities, resolutions (QCIF and CIF), frame rates/frame skipping factors, and QPs, as specified in the JVT simulation suggestions [18]. In addition to the QP range specified by JVT (16–28), we further adopted four QPs, 32, 36, 40, and 44, to cover the very low bit rate performance. We also activated all seven block modes and encoded the first frame of the video as an I-frame with the rest as P-frames. All tests in the experiment were run on an Intel Pentium 4 3.0 GHz with 512 MB RAM, and the OS used was Microsoft Windows XP. Here, the maximal number of mul) in accordance with tiple reference frames was set at 5 ( Wiegand’s analysis [5], a value which yields a significant R-D boost with a reasonable amount of complexity increase, and thus has been widely adopted in the literature [3], [6]–[14]. Since our proposed frame selector can work with any existing motion search algorithm, we adopted two different search methods with our frame selector in the test, including Fast Full Search (FFS) and Fast Motion Estimation (FME), as implemented in JM 9.2.
First, the coding efficiency of each method was evaluated, including two exhaustive methods on all reference frames, FFS5 and FME5 (where 5 indicates the value of ), and a frame selection method of Li [11], as well as two methods adopting our efficient reference frame selector (ERFS), FFS5+ERFS and FME5+ERFS. For a fair comparison, Li’s frame selection methods were implemented with FME as the motion search method. The R-D curves for each method are plotted in Fig. 5 for certain test sequences. The figure demonstrates that the R-D performance of our FFS5+ERFS and FME5+ERFS is as close as with FFS5 and FME5. Since some of the curves are very close to each other, we used BDPSNR and BDBR [19], as recommended by JVT, to measure the performance difference between the methods, which basically calculates the average PSNR and bitrate distance between two R-D curves of two methods, respectively. Table IV presents the BDPSNR and BDBR of the three methods using FFS5 as the basis of comparison, because the performance of the full search is theoretically the metrics of the upper bound. A negative BDPSNR or positive BDBR indicates coding loss to FFS5 and is not preferred. Table IV shows that on average our proposed ERFS algorithm will degrade BDPSNR 0.05 dB to FFS5, and also 0.05 dB (i.e., 0.07 dB-0.02 dB) to FME5. Such insignificant degradation, resulting from the high hit rate observed in Table I, will not cause a noticeable visual difference. As to the bit rate,
Authorized licensed use limited to: National Tsing Hua University. Downloaded on November 4, 2008 at 02:05 from IEEE Xplore. Restrictions apply.
404
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 3, MARCH 2008
TABLE V EXECUTION TIME SPEED-UP OF FME5, LI, AND TWO PROPOSED METHODS FFS5+ERFS, FME5+ERFS (FFS5 IS THE BASIS OF COMPARISON)
Fig. 6. Number of macroblock references in each reference frame. (a) Foreman. (b) Mobile. TABLE VI COMPUTATIONAL COMPLEXITY GAINS OF FME, LI, AND TWO PROPOSED METHODS FFS5+ERFS, FME5+ERFS (FFS5 IS THE BASIS OF COMPARISON). THE COMPLEXITY IS MEASURED BY THE SAD AND SATD CALCULATIONS OF4 4 BLOCKS
2
the average percentage of increase is also as small as 1.14% and 1.79% for FFS5+ERFS and FME5+ERFS, respectively. In contrast to Li’s method, the results prove that our ERFS frame selector maintains nearly the same coding efficiency as exhaustive methods. Next we will discuss the computational complexity of each method in Table V. Table V lists the speed-up factors of each method on FFS5, based upon the execution time of total encoding time, and only the parts related to the multiple reference frame motion estimation, as well as the frame decision. As
shown in Table V, our ERFS can expedite FFS5 1.66 times in terms of motion estimation encoding time. If our ERFS is used with FME5, the speed-up ratio to FFS5 is up to 7.99 times, on average, which is much faster than Li’s method (3.18 times). Since the execution time may not truly reflect speed, due to the programming skill involved, we measured the speed-up factor of each method to FFS5 in terms of the SAD and SATD calculations for 4 4 blocks in Table VI, where SAD and SATD represent the number of pixel subtractions involved in the block matching for the integer and subpel motion estimations, respec-
Authorized licensed use limited to: National Tsing Hua University. Downloaded on November 4, 2008 at 02:05 from IEEE Xplore. Restrictions apply.
KUO AND LU: EFFICIENT REFERENCE FRAME SELECTOR FOR H.264
tively. By this measurement, our ERFS can speed FME5 up as much as 19.23 and 4.28 times versus FFS5, in terms of SAD and SATD calculations, respectively. Again, these values reveal that our method is much faster than Li’s approach. Fig. 6 analyzes the frame selection effect based upon the average reference count of macroblocks in each reference frame. For example, in Fig. 6(b), FFS5 requires all 396 (22 18) macroblocks of a CIF video frame to search all reference frames exhaustively. However, the actual average number referred by macroblocks in each reference frame should be much lower, as shown by the “Actual Reference Count by FFS5” (triangle-dotted line). Fig. 6 demonstrates that our ERFS can reduce the effort required to search exhaustively by dropping the unqualified references frames for a given macroblock, thereby making the ERFS curve approach actual usage. The slope of Li’s curve is not as steep as ours and results in less reduction in complexity. According to the above analysis, though our ERFS selects reference frames based on a simple test, it can improve the speed of the multiple reference motion estimation significantly, while retaining coding efficiency. Note that additional speed-up may be achieved when the mode decision involved exploits the 8 8 motion vector . For example, one could suggest only testing modes from 16 16 to 8 8 in the previous frame in the first stage of Fig. 1. However, since this work focuses mainly on frame selection, fast mode selection is not implemented. Furthermore, in the above experiment, we implemented the 8 8 block motion search of ERFS in the same way as its counterpart to save the cost of code size, as a concern in some implementations. That is, for the case of FFS5 with ERFS, its ERFS decision is made via an FFS search, while for the case of FME5 with ERFS, ERFS is achieved by means of FME. This may limit ERFS’s decision speed and is not necessary in implementation. For example, if FFS5+ERFS(DS) (i.e., diamond search [20]) is adopted, a faster speed-up factor of 2.38–3.65 can be achieved, as compared to a FFS5+ERFS(FFS) speed-up of 1.32–1.90 times, while the BDPSNR difference between all ERFS methods is less than 0.1 dB. V. CONCLUSION An efficient reference frame selector is proposed for the H.264 encoder to deal with the complexity issue pertaining to multiple reference motion estimation. The experimental results demonstrate that the proposed algorithm can reduce significantly the complexity of motion estimation at the encoder end, while keeping almost the same R-D performance as that of FFS at different bit rates and motion sequences.
405
REFERENCES [1] Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification, ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC, 2003. [2] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, and T. Wedi, “Video coding with H.264/AVC: Tools, performance, and complexity,” IEEE Circuits Syst. Mag., vol. 4, no. 1, pp. 7–28, 2004. [3] Y. Su and M. T. Sun, “Fast multiple reference frame motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 3, pp. 447–452, Mar. 2006. [4] T. Wiegand, B. Lincoln, and B. Girod, “Fast search for long-term memory motion-compensated prediction,” in Proc. IEEE Int. Conf. Image Process., Oct. 1998, vol. 3, pp. 619–622. [5] T. Wiegand, X. Zhang, and B. Girod, “Long-term memory motioncompensated prediction,” IEEE Trans. Circuits Syst. Video Technol., vol. 9, pp. 70–84, Feb. 1999. [6] C. J. Duanmu, M. O. Ahmad, and M. N. S. Swamy, “A continuous tracking algorithm for long-term memory motion estimation,” in Proc. IEEE Int. Symp. Circuits Syst., May 2003, vol. 2, pp. 356–359. [7] M. J. Chen, Y. Y. Chiang, H. J. Li, and M. C. Chi, “Efficient multiframe motion estimation algorithms for MPEG-4 AVC/JVT/H.264,” in Proc. IEEE Int. Symp. Circuits Syst., May 2004, vol. 3, pp. 737–740. [8] Y. H. Hsiao, T. H. Lee, and P. C. Chang, “Short/long-term motion vector prediction in multi-frame video coding system,” in Proc. IEEE Int. Conf. Image Process., Oct. 2004, vol. 3, pp. 1449–1452. [9] Y. W. Huang, B. Y. Hsieh, T. C. Wang, S. Y. Chien, S. Y. Ma, C. F. Shen, and L. G. Chen, “Analysis and reduction of reference frames for motion estimation in MPEG-4 AVC/JVT/H.264,” in Proc. IEEE Int. Conf. Multimedia . Expo., Jul. 2003, vol. 2, pp. 809–812. [10] A. Chang, O. C. Au, and Y. M. Yeung, “A novel approach to fast multiframe selection for H.264 video coding,” in Proc. IEEE Int. Symp. Circuits Syst., May 2003, vol. 2, pp. 704–707. [11] X. Li, E. Q. Li, and Y. K. Chen, “Fast multi-frame motion estimation algorithm with adaptive search strategies in H.264,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2004, vol. 3, pp. 369–372. [12] D. Zhang, Y. Shen, S. Lin, and Y. Zhang, “Fast inter frame encoding based on modes pre-decision in H.264,” in Proc. IEEE Int. Conf. Multimedia Expo., Jul. 2005, vol. 1, pp. 550–533. [13] C. T. Hsu, H. J. Li, and M. J. Chen, “Fast reference frame selection method for motion estimation in JVT/H.264,” IEICE Trans. Commun., vol. E87-B, no. 12, pp. 3827–3830, Dec. 2004. [14] C. W. Ting, H. Lam, and L. M. Po, “Fast block-matching motion estimation by recent-biased search for multiple reference frames,” in Proc. IEEE Int. Conf. Image Process., Oct. 2004, vol. 3, pp. 1445–1448. [15] Working Draft Number 2, Revision 2 (WD-2). Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, 2002. [16] K. P. Lim, G. J. Sullivan, and T. Wiegand, Text Description of Joint Model Reference Encoding Methods and Decoding Concealment Methods, JVT of ISO/IEC MPEG and ITU-T VCEG, Hong Kong, 2005, JVT-N046. [17] H.264/AVC Reference Software JM 9.2, [Online]. Available: http://bs. hhi.de/~suehring/tml/ [18] G. Bjontegaard, Recommended Simulation Condition for H.26L, 2001, ITU-T Q6/SG16, Doc. VCEG-L38. [19] G. Bjontegaard, Calculation of Average PSNR Differences Between RD-Curves, 2001, ITU-T Q6/SG16 Doc. VCEG-M33. [20] J. Y. Tham, Ranganath, S. Ranganath, M. Ranganath, and A. A. Kassim, “A novel unrestricted center-biased diamond search algorithm for block motion estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 4, pp. 369–377, Aug. 1998.
Authorized licensed use limited to: National Tsing Hua University. Downloaded on November 4, 2008 at 02:05 from IEEE Xplore. Restrictions apply.