IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.11 NOVEMBER 2006
2970
PAPER
Special Section on Image Media Quality
Projection Based Adaptive Window Size Selection for Efficient Motion Estimation in H.264/AVC Anand PAUL†a) , Jhing-Fa WANG†† , Nonmembers, Jia-Ching WANG† , Member, An-Chao TSAI† , and Jang-Ting CHEN† , Nonmembers
SUMMARY This paper introduces a block based motion estimation algorithm based on projection with adaptive window size selection. The blocks cannot match well if their corresponding 1D projection does not match well, with this as foundation 2D block matching problem is translated to a simpler 1D matching, which eliminates majority of potential pixel participation. This projection method is combined with adaptive window size selection in which, appropriate search window for each block is determined on the basis of motion vectors and prediction errors obtained for the previous block, which makes this novel method several times faster than exhaustive search with negligible performance degradation. Encoding QCIF size video by the proposed method results in reduction of computational complexity of motion estimation by roughly 45% and over all encoding by 23%, while maintaining image/video quality. key words: block based motion estimation, 1D projection, adaptive window size selection
1.
Introduction
Recently established H.264/AVC is the newest video coding standard. The main goals of the H.264/AVC standardization effort have been to enhance compression performance and provide a “network-friendly” video representation. In H.264 encoder, motion estimation is the major burden of complexity [1]. It is well known that motion search range is important parameter to determine the coding efficiency and the encoding computational cost. Most video coding standards, including MPEG-l/2/4, H.261, H.263, and H.264/AVC use block motion estimation and compensation for removing temporal redundancy. This is one of the most important part of a video encoder, and newer standards achieve better video quality at constant bitrate by allowing Subdivision of 16×16 pixel macroblocks (MBs) into smaller blocks. The encoder may then select whether to use large blocks and only a few motion vectors (MVs), or more accurate motion estimation (ME) with smaller blocks but with more motion vectors to transmit. For example, the newest and most efficient coding standard H.264/AVC allows subdividing the MBs into 16 × 8, 8 × 16, or 8 × 8 pixel blocks, and when the smallest size is chosen, the block may be further subdivided in a treelike fashion into 4 × 8, 8 × 4, or 4 × 4 pixel blocks Manuscript received December 7, 2005. Manuscript revised April 28, 2006. Final manuscript received July 31, 2006. † The authors are with Multimedia and Communication IC Lab, EE Department, NCKU, Tainan-701, Taiwan, Republic of China. †† The author is Chair Professor in EE Department, NCKU, Tainan-701, Taiwan, Republic of China. a) E-mail:
[email protected] DOI: 10.1093/ietfec/e89–a.11.2970
Fig. 1
Different partition sizes in a macroblock.
[2], [3]. A good quality encoder must then handle different blocks with 7 block sizes, as shown in Fig. 1. Since there are a vast number of possibilities for selecting the subdivision of blocks, the encoder must intelligently decide what block subdivision to use and which MV to use for each block. Earlier encoders typically computed the sum of absolute differences (SAD) between the current block and candidate blocks and selected simply the MV yielding the least distortion. However, this often will not give the best image quality for a given bit rate, because it may select long motion vectors that need many bits to transmit it also does not help determining how the subdivision should be performed, because the smallest blocks will always minimize the distortion, even if the multiple MVs may use larger amount of bits and increase the bit rate. This paper will not focus on determining how those subdivision are performed rather it focus on determining the search range for subsequent macroblock in a given frame and applies projection technique in determining motion vector. Motion estimation is computationally the most demanding part of a typical video encoder. The multiple block sizes available in newer standards only increase computation. Many fast motion estimation methods have been developed, but many of them compute the criterion only at a few possible MVs. This will degrade video quality, because the best MV might not be found. Other alternatives are the fast full search methods. A lot of research has been carried out to theoretically improve the various fast search methods, but the practical implementation has been considered relatively little. It is important that a given ME algorithm can be applied in practical encoders, taking into account such aspects as RD-optimization and multiple block sizes, or it will remain unused. In this paper we propose a novel motion es-
c 2006 The Institute of Electronics, Information and Communication Engineers Copyright
PAUL et al.: PROJECTION BASED ADAPTIVE WINDOW SIZE SELECTION FOR EFFICIENT MOTION ESTIMATION IN H.264/AVC
2971
timation method based on projection with adaptive window selection which reduces computation and improves image quality.
Simulation and experimental results of the proposed method is presented in Sect. 4 and finally conclusion of our work is concluded in Sect. 5.
1.1 Motivation
2.
The computational complexity of block based motion estimation is the direct consequence of the expensive 2D block matching process. The relationship between motion and projection has been well established before [4]. Reference [5] was the first to introduced fast feature based motion estimation based on integral projection; in which, most candidate blocks are eliminated by matching 1D projection of the blocks. 1D projection of 2D blocks is used to eliminate the majority of candidates by matching in 1D, which is much faster than matching 2D blocks. Basic projection scheme is shown in Fig. 2. This is a greedy approach as we are trying to match only the sum of two blocks, the sum contains too little information about the block and DC matching may yield too many mismatches, thus an adaptive scheme is incorporate to improve [6]. It is well known that motion search window is important to determine the coding efficiency and the encoding computational cost [4]. As the total encoding power is huge, it is necessary to develop a motion search window decision algorithm [6] to reduce encoding time. After obtaining motion vector of current block by projection method, Adaptive Window Size Selection (AWSS) technique is used to fix the window size for succeeding blocks in the frame to obtain accurate motion vectors. An appropriate search window for each block is determined on the basis of motion vectors and prediction errors obtained for the previous block. In this work, we aim to develop a simple yet efficient algorithm in which projection-based block matching [7] method is combined with adaptive window size selection scheme [8]. This makes the over all approach to be an efficient one for motion estimation. This novel Projection with Adaptive Window Size Selection (PAWSS) was implemented by modifying JM 10.1 (Joint Model) reference software [9] encoder to include the PAWSS method in it. This showed a similar performance over fast full search algorithm with less computation time. The paper is organized as follows. A review of projection scheme along with fast projection and excluding candidate by 1D matching is discussed in Sect. 2. In Sect. 3 Adaptive Window Size Selection (AWSS) is introduced and mapping of projection and AWSS algorithm is described.
2.1 Definition
(a) Fig. 2
Projection
Suppose the frame size is W×H, the block size is Bh ×Bv , and the search window size is Wh ×Wv. Then to predict one block, there are Wh × Wv candidates to search. A common block size is 16×16 (macroblock size) is used here for illustration. Let B x,y denote the 2D block of pixels with top-left position at (x, y) in a video frame, and let Bi,x,yj where 0 ≤ i < Bh , and 0 ≤ j < Bv , be the pixel value at jth row and ith column of the block. Define the vertical (column) projection of B x,y to be PB x,y , a 1D row vector whose ith value is the sum of the ith column of Bx,y : PBix,y =
B v −1
Bi,x,yj , 0 ≤ i < Bh , and 0 ≤ j < By .
By projection, a Bh × Bv 2D block is reduced to a Bh component 1D vector, and only DC information of each column is preserved as shown in Fig. 3. 2.2 Fast Projection In the current frame, there are only (W × H/Bh × Bv ) nonoverlapped blocks and the computational load to compute projection is small (O(W × H) operations). By contrast, in the reference frame, there are W × H different blocks (one starting at each pixel). Bx,y and its direct right (lower) neighbor B x+1,y (B x,y+1 ) share all pixels except two columns (rows). If PBix,y is known, for example, then PBix+1,y and PBix,y+1 in (2) and (3) can be updated efficiently (see Fig. 4), as follows: x,y x,y+1 PBix,y+1 = PBix,y − Bi,0 + Bi,B , v −1
and PBix+1,y
⎧ ⎪ PB x,y , i < Bh − 1 ⎪ ⎪ ⎪ ⎪ v −1 ⎨ B =⎪ . ⎪ B x,y , i = Bh − 1 ⎪ ⎪ ⎪ ⎩ j=0 i, j
(b)
Basic projection scheme. (a) reference frame, (b) current frame.
(1)
j=0
Fig. 3
The projection of 2D block.
(2)
(3)
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.11 NOVEMBER 2006
2972
Starting with B0,0 , with proper buffering, on average only 2 operations per pixel are required on average to compute the projection of a block in this updating manner. The cost for projection is thus only O(2W × H) operations. 2.3 Excluding Candidates by 1D Matching Two blocks cannot match well if their projections do not match well [7]. For a block C x,y in the current frame, block based motion estimation searches all displaced blocks R x+dx,y+dy in the search window in the reference frame for the best-matched block. The commonly used matching error metric called maximum amplitude difference (MAD) is given by: MAD(dx, dy) =
B h −1 B v −1 i=0
−
Ri,x+dy,y+dy |. j
(4)
j=0
B h −1
x,y x+dx,y+dy PC i − PRi .
(5)
i=0
MAD in (4) has more operation when compare to PMAD in (5). Pictorial representation of excluding candidate by 1D matching is shown in Fig. 5. We can see that 256 operations by 2D matching are reduced to 16 operations by 1D.
Fig. 4
To search for the motion vectors of a strip of blocks with their top corners at y in the current frame, only the blocks with their top corners within y − Wh /2, y + Wh /2 are involved in the reference frame, which is a W × Wv strip. So a W × Wv buffer, instead of one for W × H (usually H Wv ) is sufficient to store all reusable projections. When moving to the next strip, we need to slide up the buffer by Bv lines, discard the Bv lines moving out and update Bv lines moving in using fast projection. An additional Bv -point buffer is necessary by the current frame. 2.5 Discussion
|Ci,x,yj
By Contrast, the matching error MAD of the 1D projections (projection maximum amplitude difference- PMAD) of blocks C x,y and R x+dx,y+dy is PMAD (dx, dy) =
2.4 Buffering Scheme
a) Although only vertical projection is discussed in this manuscript, an entirely analogous algorithm ensues by using horizontal projections. We believe that vertical projection may be generally preferred, since video sequences tend to have more horizontal motion in the scenes. However, horizontal projection may be more suitable for certain sequences. b) By using both vertical and horizontal projections, a few more candidate blocks can be eliminated by 1D matching. However, the saving is just enough to compensate the cost of the extra projection. Besides, extra buffers are necessary. Hence, the combination of two 1D projections is not better than using a single projection. c) An even more greedy approach that may be based on matching just the total sum of blocks, PB x,y , does not work well in practice. This is DC matching, and it is certainly very fast: we can derive fast methods to both compute the DC as well as manage the 1D buffer. Although DC matching is very efficient (only one operation per match), the sum contains too little information about a block. Our analysis indicates that DC matching yields too many mismatches to be effective. Thus an adaptive scheme is incorporated to increase the probability to match block with less computation and improved performance then existing method.
Incremental calculation of block projection.
Fig. 5
Comparing 2D matching cost vs the 1D matching cost (MAD vs PMAD).
PAUL et al.: PROJECTION BASED ADAPTIVE WINDOW SIZE SELECTION FOR EFFICIENT MOTION ESTIMATION IN H.264/AVC
2973
3.
Adaptive Window Size Selection
In motion estimation, motion vectors that exceed the search window cannot be detected, and when this happens, since sufficient motion compensation efficiency cannot be obtained video quality will be degraded in the encoding process. Moreover video PSNR degradation might be avoided with a wider search window, the unneeded computational complexity created by this window at scenes with only small motions would be wasteful [10]. 3.1 Motion Vector Thresholding To avoid this problem of choosing search window, we have developed a method for making adaptive search window decisions at each block after finding first best match using projection method. Search window is modified on the basis of the motion estimation results for the previous block. The search window for any given block is chosen from among three candidates, i.e., ±32, ±16, and ±8 pixels. In search window selection, the sums of absolute motion vector values (SumMV) are first calculated. The prediction error for each block is calculated based on the PMAD between that block and the reference block. When this PMAD and SumMV exceeds an upper bound threshold value i.e. PMAD Threshold [8] and Motion Vector Threshold (PMADTh1, MVTh1), the largest search window (±32 pixels) will be chosen for the next block. When both PMAD and SumMV are smaller than lower threshold values (PMADTh2, MVTh2) the narrowest search window (±8 pixels) will be chosen for the next frame. That is, each search window is determined by comparing PMAD and, at times, SumMV with predetermined threshold values. The flow diagram is shown in Fig. 6. Conditions for the flow are given in later part of this section, which is based on the threshold values. The thresholds for SumMV are MVTh1 and MVTh2 (MVTh1 > MVTh2), and the thresholds for PMAD are PMADTh1 and PMADTh2 (PMADTh1 > PMADTh2). (MVTh1, MVTh2) = (15 × 104 , 10 × 104 ) and (PMADTh1, PMADTh2) = (35 × 105 , 25 × 105 ).
3.2 Projection and AWSS Mapping The fast algorithm presented in this paper avoids most of expensive 2D matchings. 1D projection of 2D blocks is used to eliminate the majority of candidates by matching in 1D, which is much faster than matching 2D blocks. The computational scalability can be achieved by controlling how many candidates to be excluded by 1D matching. Initially, column projection is used in reference frame and in current frame to project 2D block into 1D row, and PMAD is calculated and compared for all 1D projected row then best match block is found for frame. For finding the next best match block we use AWSS algorithm to fix the search window size and then PMAD is found for all the blocks with in this window size and the best match is chosen, this reduces the computational complexity and improvers the image/video quality significantly comparing to existing fast full search methods. With negligible PSNR degradation our result shows that bit-rate is also reduced. 3.3 AWSS Algorithm Search window is modified on the basis of the motion estimation results for the previous block. The search window for any given block is chosen from among three candidates, i.e., ±32, ±16, and ±8 pixels. More specifically, window size is determined on the basis of both the sum of the absolute of the motion vector (SumMV) and the sum of the prediction errors for the previous block. Proposed AWSS algorithm optimally varies area of each window size. The AWSS Algorithm flow is described below. Flow 1: if current window size is 8 or 16 and, PMAD > PMADTh1 and sumMV > MVTh1 then choose wider (±32) window size. When PMAD and sumMV are larger than PMADTh1 and MVTh1 respectively, it implies fast motion makes the prediction inaccurate so larger window size should be chosen for the next block. Flow 2: if current window size is 8 and, PMADTh2 < PMAD < PMADTh1 then choose medium (±16) window size. Flow 3: if current window size is 16 or 32 and, PMAD < PMADTh2 and SumMV < MVTh2 then choose smaller (±8) window size. When PMAD and SumMV is less than PMADTh2 and MVTh2 respectively, it implies slow motion makes waste of computation with larger search window. So, smaller window size should be chosen for the next block.
Fig. 6
Adaptive window size selection scheme.
Flow 4: if current window size is 32 and, PMADTh2 < PMAD < PMADTh1 and SumMV < MVTh1 then choose medium (±16) window size.
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.11 NOVEMBER 2006
2974
3.4 Flow Chart Figure 7 depicts the over all algorithm flow of projection based adaptive window size selection scheme. Initially, column projection is used in reference frame and in current frame to project 2D block into 1D row, and PMAD is calculated and compared for all 1D projected rows then best match block is found for the frame (initial frame). For finding the next best match block we use AWSS algorithm to
fix the search window size and then PMAD is calculated for all the blocks with in this window size and the best match is chosen after performing column projection. This is done until last frame is reached, if the last frame is not reached the flow performs AWSS and it does so until it reaches the last frame in the sequence. 4.
Simulation Results
Proposed motion estimation algorithm was implemented in the reference JM software version 10.1 [9]. We have tested the proposed method using standard QCIF (176 × 144) 30 frames per sec “Foreman” test sequence, slice mode is turned off and YUV format of 4:2:0 was used in the simulation. Results of PAWSS compared with original video frame and their pixel difference is shown in Fig. 8. Figure 8(a) is the original frame and Fig. 8(b) is obtained after PAWSS and Fig. 8(c) gives out the error (pixel difference) between the original frame and PAWSS. In Fig. 9, computation time complexity comparison of fast full search used in reference JM software [9] and proposed PAWSS are shown. For this experiment, we used full foreman sequence. Quantization parameter was set as 20 and 10 for I and P frames, respectively. CAVLC entropy coding scheme was selected for our simulation. Total motion estimation time has been significantly reduced by 45% compared with fast full search method and there is about 23% of the total encoding time reduction. 4.1 Simulation Environment The simulation is done on a P4 2.8 GHz workstation with 512 MB RAM running Windows XP professional operat-
Fig. 7
Flow chart for over all algorithm.
(a) Fig. 8
Fig. 9 Computation time complexity—PAWSS compared with fast full search method. (1) Total motion estimation time sequence (2) Total encoding time.
(b)
(c)
(a) Original Frame, (b) after PAWSS, (c) pixel difference of (a) and (b).
PAUL et al.: PROJECTION BASED ADAPTIVE WINDOW SIZE SELECTION FOR EFFICIENT MOTION ESTIMATION IN H.264/AVC
2975 Table 1
Bitrate comparison. [3]
[4] Table 2
PSNR comparison. [5]
[6]
[7]
ing system with service pack 2. The reference encoder was modified to include the PAWSS method in it. It was written in plain C language, without any platform dependant optimizations, and compiled with VC++ Compiler version 6.0. Table 1 brings out the bit rate comparison between fast full search and PAWSS. There is slight reduction of bit rate from 2268604 to 2266884 by PAWSS scheme. Even there is some reduction of bits/pic for both P and I frame for PAWSS when compared with fast full search scheme. This shows that by PAWSS scheme there is no significant reduction in bit rate rather it reduces computation complexity and finds motion vector quickly, which in turn increases encoding speed. PSNR comparison of fast full search and PAWSS are tabulated in Table 2, which shows there is negligible PSNR degradation of about 0.088 for I frame and 0.057 for P frame. 5.
[8]
[9]
[10]
specification,” (ITU-T Kec. H.264 ISO/IEC 14496.10 AVC) JVT of ISO/IEC MPEG and ITU-T VCEG, JVT-GO05, March 2003. T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol.13, no.7, pp.560–576, 2003. P. Milanfar, “A model of the effect of image motion in the radon transform domain,” IEEE Trans. Image Process., vol.8, no.9, pp.1276–1281, Sept. 1999. J. Kim and R. Park, “A fast feature-based block matching algorithm using integral projections,” IEEE J. Sel. Areas Commun., vol.10, no.5, pp.968–971, June 1992. K.L. Chung and L.C. Chang, “A new predictive search area approach for fast block motion estimation,” IEEE Trans. Image Process., vol.12, no.6, pp.648–652, June 2003. C. Tu, T. Tran, and P. Topiwala, “A hybrid feature/image block motion estimation approach,” ITU-T/VCEG M.26.doc, Austin Meeting, April 2001. T. Yamada, M. Ikekawa, and I. Kuroda, “Fast and accurate motion estimation algorithm by adaptive search range and shape selection,” Proc. ICASSP, vol.2, pp.897–900, 2005. Joint Video Team of ISO/IEC MPEG and ITU-T VCEG, H.264/AVC. (online) Reference Software JM10.1, http:/bs.hhi.de/ suehring/tml/download/ J. Mitcell, W. Pennebaker, C. Fogg, and D. LeGall, MPEG Video Compression Standard, Chapman and Hall, 1997.
Anand Paul is currently pursuing the Ph.D. degree in the electrical engineering at National Cheng Kung University, Taiwan, R.O.C. His research interests include Algorithm and Architecture for motion estimation in video, and Digital Video SoC design for H.264/AVC.
Conclusion
We have presented projection based motion estimation with Adaptive Window Size Selection in this paper. This novel method greatly reduces the computational complexity while maintaining prediction integrity since most candidates can be quickly eliminated by matching 1D projection, which is faster than 2D blocks. Moreover, addition of adaptive scheme to fix window size for succeeding blocks avoids unneeded computation. Thus, in encoding a QCIF size video our method results in reduction of computation complexity of block based motion estimation by 45% and over all encoding by 23% with negligible PSNR degradation, which is in effect, improves image/video quality. Acknowledgments This work is supported by National Science Council, Republic of China under the research grant NSC93-2215-E006-019. References [1] P. Kuhn, Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation, Kluwer, 1999. [2] Joint Video Team of ITU-T and ISODEC JTC I, “Draft ITU-T recommendation and final draft international standard of joint video
Jhing-Fa Wang is now a Chair Professor in National Cheng Kung University, Tainan, Taiwan. He received his Master and Bachelor degrees in the Department of Electrical Engineering from National Cheng Kung University, Taiwan in 1979 and 1973, respectively and Ph.D. degree in the Department of Computer Science and Electrical Engineering from Stevens Institute of Technology, U.S.A. in 1983. He was elected as an IEEE Fellow in 1999 and now the Chairman of IEEE Tainan Section. He got outstanding awards from Institute of Information Industry in 1991 and National Science Council of Taiwan in 1990, 1995, and 1997, respectively. He has been invited to give keynote speech in PACLIC 12 (Pacific Asia Conference on Language, Information and Computation), Singapore and served as the general chairman of ISCOM 2001. (International Symposium on Communication), Taiwan. He has developed a Mandarin speech recognition system called Venus-Dictate known as a pioneering system in Taiwan. He was an associate editor for IEEE Transaction on Neural Networks and VLSI System. He is currently leading a research group of different disciplines for the development of “Advanced Ubiquitous Media for Created Cyberspace.” He has published about 91 journal papers and 217 conference papers and obtained 5 patents since 1983. His research areas include wireless content-based media processing, image processing, speech recognition and natural language understanding.
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.11 NOVEMBER 2006
2976
Jia-Ching Wang received the M.S. and Ph.D. degrees in electrical engineering from National Cheng Kung University, Tainan, Taiwan, in 1997, 2002, respectively. His research interests include signal processing and VLSI architecture design. Dr. Wang is a member of the Phi Tau Phi Scholastic Honor Society. He is also a member of IEEE, ACM, and IEICE.
An-Chao Tsai is currently pursuing the Ph.D. degree in the electrical engineering at National Cheng Kung University, Taiwan, R.O.C. His research interests include Architecture for Entropy, video processing and Digital Video SW/HW co-design.
Jang-Ting Chen received his B.S. degree in electronics engineering from National Taiwan University of Science and Technology, Taipei, Taiwan, R.O.C., in 1998. His research interests include intelligent video coding technology, H.264/AVC video coding and associated VLSI architecture.