fast multiple reference frame selection method in h ... - Semantic Scholar

3 downloads 0 Views 112KB Size Report
UMHexagonS [8] and the EPZS [9]. The first two algorithms make use of the conventional motion vector predictor (MVp), described in the standard, in order to.
FAST MULTIPLE REFERENCE FRAME SELECTION METHOD IN H.264 VIDEO ENCODING Spyridon K. Kapotas and Athanassios N. Skodras Digital Systems and Media Computing Laboratory, School of Science and Technology Hellenic Open University Patras, Greece {s.kapotas, skodras}@eap.gr ABSTRACT The new JVT H.264 video coding standard is becoming the dominant standard, mainly due to its high coding efficiency. Among all modules in the encoder, motion estimation plays a key role since it can significantly affect the output quality of an encoded sequence. However, motion estimation requires a significant part of the encoding time. One reason is because H.264 allows motion estimation based on multiple reference frames. In this paper, a new method for fast reference frame selection is proposed. The method performs a simple and quick test prior to motion estimation in order to choose the best reference frame. Index Terms— H.264, SAD, Motion Estimation, Multiple Reference Frames. 1. INTRODUCTION The recently established H.264/AVC is the newest video coding standard [1] developed by the Joint Video Team (JVT), formed by ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Motion Picture Experts Group (MPEG) in 2001. This new standard significantly outperforms the existing video coding standards. Among all modules in the encoder, motion estimation plays a key role since it can considerably affect the output quality of an encoded sequence. H.264 has various motion estimation-compensation units in sizes of 16×16, 16×8, 8×16, 8×8, and sub8×8. For sub8×8, there are further four sub-partitions of sub8×8, sub8×4, sub4×8, and sub4×4. Moreover, quarter-pixel motion compensation can be applied. Such wide block choices greatly improve coding efficiency but at the cost of largely increased motion estimation time. The computational complexity becomes even higher when larger search ranges, bi-directional and multiple reference frames are used. It has been observed that in the case of exhaustive search of all candidate blocks, up to 80% of the computational power of the encoder is consumed by motion estimation. Such high computational complexity is often a bottle-neck for real-time conversational applications.

As mentioned above, in H.264 motion estimation, searching in multiple reference frames is allowed in order for the temporal redundancy to be further reduced. Certainly, the computational load due to motion estimation increases with the number of reference frames, and the cost of motion estimation dominates the complexity of the video codec. Several methods for reducing the number of reference frames have been proposed over the past years. In [2] a method is proposed, which employs the best reference frames of neighboring blocks in order to determine the best reference frame of the current block. In [3] the frame selection is based on the sub-pixel movement across the reference frames. In [4] the number of reference frames is reduced by using the correlation of different values between the block of the current frame and that of previous frame. In [5] a method is proposed, which applies some well-known fast motion estimation methods on each reference frame. The local minimum SAD found in the selection path is used as the indicator of the final reference frame. In this paper we propose an approach, which is based on the same concept as [5], i.e. it performs a SAD test across the reference frames in order to reveal the optimal reference frame. However, our method outperforms [5], since it uses a significantly smaller number of test points, it takes into account the Lagrangian reference cost and it leads to better results with respect to video quality as well as to reduction of the motion estimation time. The rest of the paper is organized as follows. In Section 2 we give a brief description of the features of the H.264 reference code [6], which play a key role to the proposed method. In Section 3 we describe the new method. In Section 4 we present the simulation results. Finally, in Section 5 conclusions are drawn. 2. H.264 REFERENCE CODE STUDY 2.1. Multiple Reference Frame Selection H.264 uses multiple reference frames to achieve better motion prediction performance. The encoder performs the

decided, p(REF) is the prediction of motion vector and R (m(REF) − p(REF)) is the size of the bit stream of the motion vector after entropy coding. For a block of MxN, SAD (Sum of Absolute Difference) is calculated as in (2) M N SAD(s, c(m )) = ∑ ∑ s(x, y ) − c x − m x , y − m y (2) x =1 y=1 where s are values of the original pixels, c are prediction values and m are motion vectors.

(

)

90 80 70 60 50 40 30 20 10 0

Bridge Far Carphone Claire Container Foreman Grandma Highway

Co

The H.264 reference software [6] adopted three FME (Fast Motion Estimation) algorithms due to their competitive R-D performance over Full Search. The three FME algorithms are the UMHexagonS [7], the simplified UMHexagonS [8] and the EPZS [9]. The first two algorithms make use of the conventional motion vector predictor (MVp), described in the standard, in order to find a better search center and then perform a limited search around this center. On the other hand, the EPZS algorithm defines some sets of predicted search points, which are likely to give the best match. For that purpose the EPZS uses various predictors, such as the motion vector predictor, spatial and temporal predictors. In [10], we saw that the motion vector predictor and the center position (0,0) of the collocated macroblock in the reference frame are by far the most accurate predictors of the search center. This can be seen in Figure 1 which shows the contribution of each predicted search point, used by the EPZS, in finding the best match. By saying that a search point has N% contribution, it means that if the encoder performs the motion estimation search for 100 different blocks this search point will give the best match for N blocks. In particular the motion vector predictor is calculated in the following way [11]. Let E be the current macroblock, macroblock partition or sub-macroblock partition. Let A be the partition or sub-

l lo

ca te

Up

(0

,0 ) Le

ft Up

2.2. Search Center Prediction in Motion Estimation

Bridge Close

Le f em t Le M ft em M U em Up p C o Ri l lo gh ca t te M V Bl o W ck M i nd o V w Pr ed i ct or

where λmotion is the Lagrange operator used in motion estimation, R(REF) is the number of bits consumed for coding the index of the reference frame and it is computed by table look-up, m(REF) is the motion vector to be

M

(1)

+ λ motion (R (m(REF) − p(REF)) + R (REF))

Ri g th

)

C o n tr i b u tio n %

(

J REF | λ motion = SAD(s, c(REF, m(REF)))

partition immediately to the left of E. Let B be the partition or subpartition immediately above E and C the partition or sub-macroblock partition above and to the right of E. If there is more than one partition immediately to the left of E, the topmost of these partitions is chosen as A. If there is more than one partition immediately above E, the leftmost of these is chosen as B. 1. For transmitted partitions excluding 16x8 and 8x16 partition sizes, MVp is the median of the motion vectors for partitions A, B and C. 2. For 16x8 partitions, MVp for the upper 16x8 partition is predicted from B and MVp for the lower 16x8 partition is predicted from A. 3. For 8x16 partitions, MVp for the left 8x16 partition is predicted from A and MVp for the right 8x16 partition is predicted from C. 4. For skipped macroblocks, a 16x16 vector MVp is generated as in case (1) above (i.e. as if the block were encoded in 16x16 Inter mode). If one or more of the previously transmitted blocks is not available (e.g. if it is outside the current slice), the choice of MVp is modified accordingly.

Up

motion estimation upon every reference frame for every macroblock of the current frame. Rate-distortion optimization is the criterion of selection of the best coding mode. The rate-distortion algorithm evaluates the cost of every possible reference frame, considering the balance of the distortion and the number of bits consumed at the same time [12]. After that, the reference frame which results in the smallest cost is considered as the optimum choice. The cost function of the selection of the optimal reference frame is calculated as (1)

News Salesman Silent

Predictors

Figure 1. Search center predictors used by the EPZS

3. PROPOSED FRAME SELECTION METHOD

The proposed method takes advantage of the great prediction accuracy of both of the motion vector and the collocated macroblock’s center predictors in order to find the optimal reference frame. This frame is found by minimizing a cost function. Apparently, we cannot use (1) as a cost function since the R (m(REF) − p(REF)) value is not known prior to the motion estimation. We therefore take into account only the frame cost value and the modified cost function defined as

(

)

J REF | λ motion = SAD(s, c(REF, m(REF))) + λ motion (R (REF))

(3)

where λmotion is the Lagrange operator used in motion estimation, R (REF) is the number of bits consumed for coding the index of the reference frame and computed by means of a table look-up. The SAD is performed over the

Luma blocks of the macroblocks and it is calculated according to formula (2). It has been found that applying the cost function (3) only on the 16x16 blocks is sufficient enough to give the optimal reference frame. The proposed approach is as follows: For every macroblock in the current frame: 1. Get the 16x16 luma block. 2.

Get the next frame from the frame list.

3.

Calculate the cost function (3) for the 16x16 luma block of the macroblock, which is pointed by the motion vector predictor in the reference frame.

4.

Calculate the cost function (3) for the 16x16 luma block of the collocated macroblock in the reference frame.

5.

Compare the two results and keep the minimum.

6.

Compare this minimum with the one from the previous frame and set the global minimum.

7.

Repeat steps 2, 3, 4, 5, 6 until all of the reference frames have been examined.

8.

The remaining global minimum will denote the optimal reference frame for the macroblock, which is under test.

The pseudo code of the proposed method is shown in Figure 2. for every_P_frame { for every_16x16_luma_block=M { M_BestReferenceFrame=0 GlobalMin=MAX_INTEGER for every_reference_frame=N { LocalMin=MAX_INTEGER A=cost(mv_predictor) if(A< LocalMin) LocalMin =A B=cost(collocated_centre) if(B< LocalMin) LocalMin =B if (LocalMin < GlobalMin) { GlobalMin = LocalMin M_BestReferenceFrame=N } } } } Figure 2. The pseudo code of the proposed method

The method incorporates also a simple early-stop criterion in order to speed up the SAD calculation. The encoder compares the total SAD with the previous minimum. If the total so far exceeds the previous minimum, the calculation is terminated. The proposed method is quite simple and very easy to be implemented. There is no calculation overhead since

the position of the collocated block is known and the motion vector predictor is calculated by the encoder according to the H.264 standard, anyway. In addition to that, our method can be combined with any existing motion estimation algorithm, either the full search or any other fast motion estimation algorithm. Extended tests are conducted and useful conclusions are drawn in the following sections. 4. SIMULATION RESULTS

The proposed scheme was integrated within version 11.0 of the reference JVT software [6]. The most important configuration parameters of the reference software are shown in Table I. The rest of the parameters have retained their default values. TABLE I. CONFIGURATION PARAMETERS OF THE ENCODER Profile Number of Frames Frame Rate Number of reference frames Motion Estimation Algorithm RD Optimization Rate Control Bit Rate

Baseline 100 30 fps 5 Full Search High Complexity Enabled 45000 bps

First we conducted some quick tests, where we let our algorithm being executed without interfering with the reference frame selection process. Our purpose was to find out how many times our algorithm succeeds in finding the same optimal reference frame with the one, which is found by the full search algorithm in normal operation mode. The results are shown in Table II and they indicate that the proposed method may sufficiently replace the normal frame selection procedure. TABLE II.

SUCCESSFUL MATCHES BETWEEN THE PROPOSED METHOD AND THE OFFICIAL ADOPTED METHOD

Video Sequence (QCIF, 50 frames) Foreman Salesman News Carphone Grandma

Success (%) 73.6 95.4 94.8 71.5 94.0

Several video sequences in QCIF format were tested. The test sequences are separated into the following classes: 1. Class A: Low spatial detail and low amount of movement such as the “akiyo” and the “container” sequences. 2.

Class B: Medium spatial detail and low amount of movement or vice versa such as the “foreman”, the “silent” and the “news” sequences.

3.

Class C: High spatial detail and medium amount of movement or vice versa such as the “mobile” sequence.

The results are shown in Table III and in Table IV. The bit rate variations are negligible since we set a very low bit rate constraint at 45000 bps and therefore the comparisons can be made by examining the variations of the Motion Estimation time and the PSNR. From the comparisons we realize that the proposed method results in a significant reduction of the Motion Estimation time, while the degradation of the PSNR does not exceed 0.5 dB. Moreover, the proposed method has a uniform behavior for all of the testing sequences, no matter which class they belong to.

TABLE III.

6. ACKNOWLEDGEMENTS

This work was funded in part by the European Union – European Social Fund (75%), the Greek Government Ministry of Development - General Secretariat of Research and Technology (25%) and the Private Sector in the frames of the European Competitiveness Programme (Third Community Support Framework - Measure 8.3 programme PENED contract no.03ED832).

COMPARISON OF MOTION ESTIMATION TIME

Video Encoding Time (ms) Variation Sequence (%) Reference Proposed bridge-close 68664 14182 -79.345 bridge-far 68980 13974 -79.741 highway 71306 14665 -79.433 salesman 69236 13814 -80.047 carphone 72852 14942 -79.489 news 69410 14170 -79.585 grandma 69785 14341 -79.449 container 69505 13785 -80.166 claire 69850 14410 -79.370 silent 69520 13735 -80.243 foreman 73965 14218 -80.777 akiyo 70853 13665 -80.71 mobile 79776 15155 -80.81 Average Variation of Motion Estimation Time: -79.935 % TABLE IV.

considerably affecting PSNR and consequently video quality.

COMPARISON OF THE PSNR OF THE LUMA INTER FRAMES

Video PSNR (dB) Variation Sequence (dB) Reference Proposed bridge-close 32.733 32.774 0.041 bridge-far 39.216 39.239 0.023 highway 36.092 35.885 -0,207 salesman 33.057 32.988 -0.069 carphone 33.459 33.100 -0.359 news 33.590 33.530 -0.060 grandma 36.799 36.683 -0.116 container 36.410 35.981 -0.429 claire 41.126 40.893 -0.233 silent 32.427 32.389 -0.038 foreman 31.827 31.331 -0.496 akiyo 39.096 38.901 -0.195 mobile 23.742 23.312 -0.430 Average Variation of PSNR: -0.197 dB

5. CONCLUSIONS

In this paper we presented a new method for multi reference frame selection in H.264 video coding. The method performs a simple and fast test prior to motion estimation in order to choose the best reference frame among a number of candidates. In this way, the proposed method reduces the number of the reference frames to one. Thus it decreases the Motion Estimation time without

7. REFERENCES [1]

[2]

[3]

[4]

[5]

[6] [7]

[8]

[9]

[10]

[11] [12]

T. Wiegand, G.J. Sullivan, and A. Luthra, “Draft ITU-T Recommendation H.264 and Final Draft International Standard 14496-10 AVC,” JVT of ISO/IEC JTC1/SC29/WG11 and ITU-T SG16/Q.6, Doc. JVTG050r1, Geneva, Switzerland, May 2003. H-J Li, C-T Hsu and M-J Chen, “Fast Multiple Reference Frame Selection Method for Motion Estimation in JVTiH.264”, IEEE Asia-Pacific Conference on Circuits and Systems, 6-9 Dec. 2004. A. Chang, O. C. Au and Y. M. Yeung, “A Novel Approach To Fast Multi-Frame Selection For H.264 Video Coding”, 2003 IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Vol. 3, 6-10 April 2003. J-S Sohn and D-G Kim, “Fast Multiple Reference Frame Selection Method Using Correlation of Sequence in JVT/H.264”, IEICE Trans. Fundamentals, Vol. E89–A, No. 3, March 2006. C-W Ting, L-M Po and C-H Cheung, “Center-Motion Estimation in H.264”, Proc. of the 2003 Int. Conf. on Neural Networks and Signal Processing, Volume 2, 14-17 Dec. 2003. JVT Reference Software (version JM11.0) http://iphome. hhi.de/suehring/ tml/download/old_jm/. Z. Chen, P. Zhou, and Y. He, “Fast Integer and Fractional Pel Motion Estimation for JVT,” JVT-F017r.doc, Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG. 6th Meeting, Awaji, Island, Japan, 5-13 Dec., 2002. X. Yi, J. Zhang, N. Ling, and W. Shang, “Improved and Simplified Fast Motion Estimation for JM”, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG 16th meeting, Poznan, Poland, 24-29 July 2005. Yin, H. Y. Tourapis, A. M. Tourapis, and J. Boyce, “Fast Mode Decision and Motion Estimation for H.264,” IEEE Int. Conference on Image Processing, vol. III, pp. 853-856, Sept. 2003. S. K. Kapotas and A. N. Skodras, “A New Spatio-Temporal Predictor for Motion Estimation in H.264 Video Coding”, 8th Int. Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 2007), Santorini, Greece, 6-8 June 2007. I. E.G. Richardson, “H.264 and MPEG-4 Video Compression”, John Wiley & Sons Ltd, 2003. Z. Chen and K. N. Ngan, “Recent Advances in Rate Control for Video Coding”, Signal Processing Image Communication, Elsevier, pp. 19-38, Vol. 22, 2007.