Adaptive Global Elimination Algorithm for Low ... - Semantic Scholar

2 downloads 0 Views 2MB Size Report
The global elimination algorithm uses fixed partition sizes and shapes irrespective of the nature of the macro-block. We show that by adapting the partition sizes ...
Copyright © 2009 American Scientific Publishers All rights reserved Printed in the United States of America

Journal of Low Power Electronics Vol. 5, 1–16, 2009

Adaptive Global Elimination Algorithm for Low Power Motion Estimation Ajit Gupte1 2 ∗ and Amrutur Bharadwaj2 1

2

Texas Instruments India Pvt. LTD. Bangalore, 560093, India Indian Institute of Science (ECE Department), Bangalore, 560012, India (Received: 22 October 2008. Accepted: 19 February 2009)

Motion estimation typically consumes 50% to 70% of total power in video encode application. Optimizing the power consumption of motion estimation process is of great importance to low power video applications. Power dissipation increases with computational complexity. Reduction in motion estimation complexity is usually associated with increase in bit rate and a loss of quality. We explore a set of algorithms that reduce the complexity of motion estimation by adaptively changing the matching complexity based on macro-block features yet have only a modest cost in terms of bit rate increase and quality loss. The adaptive techniques are applied to the global elimination algorithm, which is a well known motion estimation algorithm. The global elimination algorithm uses fixed partition sizes and shapes irrespective of the nature of the macro-block. We show that by adapting the partition sizes and shapes according to the macro-block features such as variance and Hadamard coefficients, the computational complexity of global elimination algorithm can be significantly reduced with only a small increase in bit rate. We also propose a novel center-biased search order that uses early termination method designed to work with the global elimination algorithm. The adaptive match and center-biased search together result in around 57% reduction in computational complexity and 50% reduction in power dissipation compared to the original global elimination algorithm.

Keywords: Block Matching, Global Elimination Algorithm (GE), Adaptive Global Elimination Algorithm (AGE), Motion Estimation (ME), Sum of Absolute Differences (SAD), Sum of Absolute Differences of Block-Sums (SADM), Peak Signal to Noise Ratio (PSNR), Macro-Block (MB), Hadamard Transform, Pixel Variance.

1. INTRODUCTION Modern video coding methods use block matching technique for motion estimation. For each 16 × 16 pixel ‘macro-block’ (MB) in current frame a best match is found within a window (search window) in the reference frame. Matching criterion used is typically the sum of absolute difference in pixels or ‘SAD.’ The complexity and associated power consumption of motion estimation varies from 50% to 70% of total encoder complexity1 and power dissipation. To reduce the complexity of motion estimation, various algorithms have been proposed. While lossless or optimal algorithms such as successive elimination2 and integral projection3 achieve the same quality as the full search algorithm (FS), they tend to be expensive for real time processing in applications with high resolution video. ∗

Author to whom correspondence should be addressed. Email: [email protected] J. Low Power Electronics 2009, Vol. 5, No. 1

In order to reduce complexity, many algorithms have been proposed that restrict the number of search points to a small number. Gradient descent based search algorithms such as three Step Search (TSS), New Three Step Search (NTSS), Hexagon based search4–6 19 etc. are some examples. Such algorithms assume ‘Uni-modal Error Surface’ in which the matching error would increase monotonically with the distance from the global minimum. In high resolution applications, such algorithms cause excessive quality loss compared to the full search method due to local minima in the search space. The JVT reference software for H.264 implements the Un-even cross, Multihexagon search algorithm (UMHexagonS).15 16 18 It avoids the local minima problem in traditional gradient based search algorithms by combining multiple search strategies. For example, the UMHexagonS algorithm divides the search into multiple stagtes such as the uneven cross, multi-hexagon, and finally the hexagon based and diamond search patterns. The number of search points is

1546-1998/2009/5/001/016

doi:10.1166/jolpe.2009.1010

1

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

significantly large compared to the traditional gradient based approaches but the local minima problem is greatly reduced. However, due to multi-step approach that uses different complex search patterns in each stage, the complexity of implementation is high, and data flow is irregular. This can be an issue in real time applications that demand regular data flow and simple control. Apart from the search optimization, there are many algorithms that optimize the motion estimation task by reducing matching complexity at each search position. Some algorithms simplify the SAD calculation by pixel sub-sampling, exploiting spatial correlation of pixels. E.g., alternate pixel pattern, quincunx patterns etc.7–10 describe algorithms based on the idea that high gradient pixels are major contributors to distortion (i.e., mismatch between the reference and the current MBs). These algorithms prioritize high gradient pixels in distortion measurement. Both lossy and lossless algorithms are explored. In Ref. [9], an adaptive decimation approach was used that chooses pixels based on individual pixel gradient. In Ref. [10], attempt was made to simplify the addressing of high gradient pixels compared to Ref. [9] by limiting pixel selection to predefined patterns. Both these approaches result in lower complexity of motion estimation compared to uniform sub-sampling that do not take into account the macro-block features. However, addressing pixels in somewhat random fashion adds to the complexity of these algorithms. Further, since only a subset of pixels used for SAD calculation in sub-sampling based algorithms, the quality loss can be large for large sub-sampling ratio. Alternatively, large number of pixels would have to be selected to keep the loss in quality small. Pixel truncation approach is used in Ref. [11] to reduce complexity. Averaging of pixels, instead of sub-sampling, is also quite commonly used. This prevents elimination of large number of pixels from distortion measurement as in case of sub-sampling although with additional complexity of calculating mean values. In the mean pyramid method,12 a pyramid of reference frames is constructed by averaging over successively wider ranges of pixels. The motion search starts at the top of the pyramid and is refined at the lower levels. However, in this approach, it is necessary to construct a hierarchy of reference frames at different resolutions. Successive elimination algorithm (SEA)2 also follows averaging approach. SEA works on the principle that the SAD of sum-of-pixels cannot be larger than the SAD of pixels themselves. Complete SAD calculation can be avoided at many search points by using this method. Multilevel SEA (MSEA)21 algorithms help in eliminating more candidates quickly by creating successively tighter bounds and hence avoiding full SAD calculation at most of the points. Ref. [22] proposes a lossy MSEA approach where the SAD values at lower levels are increased deliberately to increase the chance of early elimination. Predictive fine 2

Gupte and Bharadwaj

granularity MSEA algorithm increases20 number of bounds and hence helps early elimination by successively refining blocks at a given SEA level based on block gradients. Modified winner update approach of Ref. [17] improves upon fine granularity approach by avoiding refinement of blocks of low gradient by thresholding. Both Refs. [20] and [17] always partition a given block into four subblocks. A limitation of all MSEA algorithms is that mean value planes have to pre-calculated at each level and stored in memory. This increases the memory requirement significantly. Also, since the algorithm applies successively tighter bounds, it results in many redundant calculations at lower levels. Ref. [20] tries to address this issue by skipping lower levels depending upon the recent history of levels used by neighboring macro-blocks. The global elimination (GE) method13 operates on similar principles as SEA. To simplify the hardware implementation, the algorithm uses a lossy two stage approach for integer pixel motion estimation. It also eliminates the requirement of pre-calculating reference planes at all SEA levels, which is a significant bottleneck in SEA and MSEA. Here, mean values at each reference point are calculated on the fly. In Ref. [13] current and reference 16 × 16 MBs are partitioned into 4 × 4 groups, with 4 × 4 pixels from the original MB in each partition (level 2 of MSEA). Ref. [23] uses 8 × 8 blocks (level-1) instead of 4 × 4 but achieves similar performance as original GE algorithm by increasing the number of candidate points in the 2nd stage. In both MSEA and GE, no pixels are skipped unlike in the sub sampling approaches. Also unlike other averaging approaches such as the mean pyramid, the averages are repeatedly calculated at each search position, and no search position is skipped. This makes the algorithm less susceptible to local minima. However the original GE algorithm always uses the same number of partitions for every macro-block. In many video sequences, there are many plain MBs—i.e., the pixel values within the macro-block are almost the same. This can happen for example for shadow regions, walls etc. For such MBs, a reduced partition count can give similar PSNR quality at a given bit rate, but with a reduced computational cost. Another drawback of the original GE algorithm is the fixed square shape of the partitions. MBs with strong features which cut across these partitions tend to get averaged out and hence can result in incorrect matches. Another significant limitation of traditional GE algorithm is that it performs matching operation at all the search points in the entire search range. Even though the cost of matching at each reference position is small, the total cost of matching all the reference points is still quite large and may not be necessary. In this paper we propose solutions to address these drawbacks. We calculate the pixel variance of each MB to determine the right amount of partitioning to use. Lower variance MBs can do with smaller number of partitions, J. Low Power Electronics 5, 1–16, 2009

Gupte and Bharadwaj

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

say 2 × 2, which results in lower complexity. MBs with higher pixel variance can use greater number of partitions, for example, 4 × 4, with an increased computational complexity. Hence the computation is adapted to the MB characteristics. To solve the second problem of finding better partition shape, we estimate spatial frequency components using Hadamard transform.24 By making use of both variance and spatial frequency information, we obtain more optimal macro-block partitioning to reduce complexity of SAD calculations. Finally, we introduce a novel early termination algorithm that can be used with global elimination. The original global elimination algorithm uses raster-scan order to traverse the search window. We show that by using a center-biased approach along with early termination, one can greatly reduce the search complexity without affecting the quality significantly. The rest of the paper is organized as follows: In the next section, we give a brief background of the global elimination method and its limitations. We also elaborate on its computation and memory access complexity as a function of partition sizes. In Section 3 we describe our experimental setup. In Section 4, we explain the idea of adaptive partitioning based on the MB variance and show PSNR versus bit rate tradeoff. In Section 7 we describe the adaptive partitioning algorithm based on MB variance as well as MB features. In Section 6 we describe the early termination technique applicable to GE or adaptive GE (AGE) algorithm. Section 7 describes an architecture that can implement the proposed match-adaptive global elimination algorithm and also power estimation results. In Section 8 we present our conclusions.

2. OVERVIEW OF MOTION ESTIMATION WITH THE GLOBAL ELIMINATION (GE) ALGORITHM Block based motion estimation involves finding, for each block (MB) within the current frame, the best matching block from the reference frame. Typically, a rectangular window in the reference frame is used as a search area for each macro-block in the current frame. The size of the search window may depend upon video resolution and amount of motion in the video etc. The best matching position for the current macro-block in the search window is identified by a ‘motion vector,’ which is encoded along with the residue as part of the encoded bit stream. As mentioned in the introductory section, the matching criterion typically used is sum of absolute differences or SAD. Also, since the cost of coding motion vectors can be significant in modern video standards, the modern motion estimation algorithms also include an additional term for the motion vector cost in the matching function. The overall matching J. Low Power Electronics 5, 1–16, 2009

function is thus the weighted sum of the SAD value and the motion vector cost. matching_cost = sad +  ∗ motion_vector_cost

(1)

Due to the regular data flow structure, and simple 2-stage control flow, the global elimination algorithm is of interest in VLSI implementations. The global elimination algorithm (GE) carries out motion estimation in two stages. In the first stage each current MB is divided into equal sized sub-blocks and the sub-block averages are calculated. A reference MB at each search point is also identically divided into equal number of sub-blocks. In Ref. [13], the MBs were divided into 16 partitions of 4 × 4 pixels—which is same as level-1 of the MSEA algorithm. The SAD in sub-block averages (SADM) between current MB and reference MB is used as the matching measure rather than a complete SAD. The complexity reduction is achieved both because ‘SADM’ is cheaper to calculate than ‘SAD’ and also because the averages at successive search points can be incrementally calculated from some of the information saved from the previous reference point. The top ‘N ’ MB candidates with least cost function based on SADM values are selected for the 2nd stage search. During the 2nd stage, a full search (FS) is performed on all the ‘N ’ candidate-MBs to find the best integer pixel motion vector. Partitioning the MB into many small sub-blocks in first stage will result in more calculations in the matching operation compared to partitioning in few larger sub-blocks as described later in this section. Small number of partitions would greatly reduce the complexity of first stage. However, a cruder partition will also result in poorer match compared to a finer partition because a larger number of pixels are merged to find the mean values and a lot of details are lost. A larger number of top best candidates need to be then selected for the 2nd stage search. For example Ref. [23] uses only four 8×8 partitions as against sixteen 4 × 4 partitions in Ref. [13]. However, our analysis shows that, bit rate performance with top 5 candidates selected for 2nd stage with 16-partition GE is comparable to the bit rate with top 60 to 65 candidates needed with 4-partition GE at similar PSNR level. It is desirable to limit the number of top candidates in the 2nd stage because of the number of comparisons required to find the worst candidate at each reference point in the 1st stage increases with the number of top candidates. Also, calculating full SAD (in 2nd stage) on a large number of candidates located at arbitrary positions in the search window is costly. Thus, there is a scope to reduce the computation cost by adaptively partitioning macro-blocks based on the contents of individual current macro-blocks. Plain MBs can be partitioned into smaller number of sub-blocks with not much loss in quality, but significant reduction in complexity. 3

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Gupte and Bharadwaj

3. EXPERIMENTAL SETUP All the experiments are carried out using JM reference software for h.264 decoder of JVT (version 12.4). The software was modified to support the motion estimation algorithms proposed in this paper. Bit rate and PSNR values were measured for a range of QP values. The rate distortion optimization was set to low complexity mode. IPPP coding style was used and single motion vector per MB was allowed. Intra coded blocks inside P frames were not allowed in order to isolate the relative performance of different motion estimation schemes. A mix of high/low motion sequences were used for experiments, with varying resolutions from QCIF to HD. The intra frame period used was 30 frames. The number of best candidates chosen from the GE first stage was fixed to 5 unless specifically mentioned otherwise. Also, the search range was fixed to +/−32 pixels.

4. COMPLEXITY REDUCTION BY IDENTIFYING MACRO-BLOCKS WITH SMALL VARIANCE Many video sequences contain spatial regions which are nearly plain—with nearly identical pixel values. For example, an indoor sequence with a background wall, or shadow areas with low contrast and so on. Also, high resolution video sequences have higher spatial correlation of pixels and have higher concentration of plain MBs compared to lower resolution sequences. Since plain macro-blocks are typically embedded in plain regions, if we look at the SAD values for plain MBs with different motion vectors, we would get a gradual error surface with many motion vector candidates resulting in very small and similar SAD values. The resulting bit rate due to these MBs will also be very small due to absence of high frequencies and small error in motion vector estimation will not result in large increase in bit rate. Also, since the motion estimation cost function includes the motion vector cost in addition to the SAD value (Eq. (1)), the cost function difference for nearly equal SAD cost will be dominated by motion vector cost, and this will make sure that we get correct estimation of true motion with high probability, even in presence of large number of similar SAD values. Figures 1 and 2 illustrate error surfaces of a plain macro-block and a macro-block with a large variance respectively. ‘x’ and ‘y’ axes represent motion vectors in integer pixel unit, while the ‘z’ axis shows the ‘Error’ or the SAD values. The plain macroblock has an error surface with a large flat area with small and nearly equal error (SAD) values in the center, while the macro-block with large variance has uneven surface with large number of local minima and maxima. We use this observation to adapt GE partitioning for plain regions. In the following experiment, we select from two possible partition sizes for GE algorithm. We measure the pixel variance within current MB. If the variance 4

Fig. 1. Error surface of a macro-block in flat region from ‘Akiyo’ sequence.

exceeds a certain threshold, then the MB is partitioned into 4 × 4 = 16 sub blocks, each containing 16 pixels. If the variance is smaller than the threshold value, then the MB is partitioned into 2 × 2 = 4 sub-blocks of 64 pixels each. The variance measure that is used is sum of absolute difference between pixel values and mean value of the MB. Tables II and III show the computational complexity reduction using the variance based selection method for a larger set of videos. (The complexity analysis and details of cost calculations are described in Section 7.) Bit rate values with different algorithms at a fixed QP value of 28 are also noted. It can be observed that the complexity reduction is highly dependent upon the video content. For the sequences like coastguard and football, the variance of majority of the MBs is significantly large, so 4 × 4 partitions get selected for most of the MBs, and we do not obtain much complexity reduction compared to GE4 × 4. However, for sequences like ‘AKIYO,’ and many of the high resolution sequences like ‘viper train,’ ‘rush hour’ etc. there is a significant percentage of MBs with small variance value. For these sequences, the savings obtained in number of computations is significant. More importantly, the bit rate loss compared to GE4 × 4 with the variance

Fig. 2. Error surface of a macro-block from ‘coastguard’ sequence.

J. Low Power Electronics 5, 1–16, 2009

Gupte and Bharadwaj

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Table I. List of video sequences used in the experiments. Video sequence Coast guard QCIF Football QCIF Foreman QCIF Mobile QCIF Akiyo Coastguard D1 Foreman D1 Football D1 Mobile D1 Tennis Mobile calendar Parkrun Blue sky Pedestrian area Riverbed Rush hour Station Sunflower Tractor Viper train

Abbreviation

Resolution

Number of encoded frames

CGQCIF FBQCIF FMQCIF MobQCIF AKIYO CGd1 FMd1 FBd1 Mobd1 TN MC PR BLSK PED RB RH STN SUNFL TRACT VPTRN

176 × 144 176 × 144 176 × 144 176 × 144 352 × 288 640 × 480 640 × 480 704 × 480 704 × 480 704 × 480 1280 × 720 1280 × 720 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080 1920 × 1080

300 130 400 300 300 300 400 147 150 150 504 30 217 375 250 494 313 494 493 316

respectively. Compared to full search (not shown in the tables), average GE4 × 4 bit rate loss is 1.74% and average threshold-1 bit rate loss is 2.42% on average. For the purpose of effort versus quality comparison, a hypothetical bit rate at the complexity equal to variance based selection is obtained by linear interpolation between GE4 × 4 and GE2 × 2 using following equation: scale BR = GE4 × 4_bit rate − GE2 × 2_bit rate

scale Effort = GE4 × 4_effort − GE2 × 2_effort

reduced Effort = GE4 × 4_effort − Variance Method_effort

(2)

Hence, GE4 × 4_bit rate − projected_bit rate

based selection is fairly small. From Section 7 we can see that GE2 × 2 computational cost is 33/67 = 4925% that of the GE4 × 4. This cost reduction comes at the cost of significant average bit rate degradation of about 12% as can be seen in Tables II and III. The average bit rate degradation compared to GE4 × 4 for the threshold-1 case is just 0.66%, while for the threshold-2 case it is 3.62%. Since the QP value is kept constant, the PSNR degradation between GE4 × 4 and threshold-1 or threshold-2 case is very small. An average PSNR degradation of 0.0 dB and 0.03 dB was observed. The average computation cost savings compared to GE4 × 4 is about 6.3% and 17% respectively, where as peak cost savings are about 26% and 34%

= reduced Effort ∗ scale BR scale Effort It can be clearly seen from the Tables II and III that the bit rate degradation with variance based selection is much smaller than bit rate degradation projected using linear interpolation method. It should also be noted that as we increase the variance threshold value, the effectiveness of the complexity versus bit rate trade off diminishes. Figure 3 shows this graphically for the “blue sky” HD sequence. The 2nd point from left on the dotted curve corresponds to threshold-1. The point to the right of it corresponds to threshold-2. The two extremes of the dotted curve are same as GE4 × 4 and GE2 × 2. The effort relative to GE4 × 4 falls quite sharply for the threshold-1 case compared to the linear curve. The threshold-2 point is still better than the linear curve but the drop in effort required compared to threshold-1 point is more gradual. This can be reasoned intuitively as this: as we relax (increase) the threshold, more and more macro-blocks with

Table II. Macro-block variance based partition selection (threshold-1). Video Sequence

GE4 × 4 bit rate in kbps

Variance threshold1 bit rate in kbps

Computational cost compared to GE4 × 4 (%)

Variance threshold1 bit rate increase compared to GE4 × 4 (%)

Projected variance threshold1 bit rate increase compared to GE4 × 4 (%)

GE2 × 2 bit rate increased compared to GE4 × 4 (%)

AKIYO CGd1 FMd1 FBd1 Mobd1 TN MC PR BLSK PED RB RH STN SUNFL TRACT VPTRN

13628 406049 196058 390682 608323 455784 474456 2634177 895104 674839 3715682 538382 290597 454952 1119564 447238

13692 407681 197628 392193 609265 456952 479303 263603 904092 680762 3716635 548811 294101 455925 1121235 454206

9117 9980 9401 9894 9806 9796 8919 9991 8276 9100 9996 8866 9378 9868 9762 7795

047 040 080 039 015 026 102 007 100 088 003 194 121 021 015 156

068 007 127 031 025 056 370 001 251 298 001 309 308 013 070 574

391 1688 1075 1456 652 1386 1737 525 739 1679 1134 1383 2511 482 1488 1320

9371

066

157

1228

Average

J. Low Power Electronics 5, 1–16, 2009

5

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Gupte and Bharadwaj

Table III. Macro-block variance based partition selection (threshold-2). Video Sequence

GE4 × 4 bit rate in kbps

Variance threshold2 bit rate in kbps

Computational cost compared to GE4 × 4 (%)

Variance threshold2 bit rate increase compared to GE4 × 4 (%)

Projected variance threshold2 bit rate increase compared to GE4 × 4 (%)

GE2 × 2 bit rate increased compared to GE4 × 4 (%)

AKIYO CGd1 FMd1 FBd1 Mobd1 TN MC PR BLSK PED RB RH STN SUNFL TRACT VPTRN

13628 406049 196058 390682 608323 455784 474456 2634177 895104 674839 3715682 538382 290597 454952 1119564 447238

13809 418257 204158 411041 612406 484931 488572 2646466 918647 707438 3750626 566081 326467 459618 1135164 471779

8034 9114 7963 7941 9405 7393 8383 9687 7577 7677 9448 7373 7792 9103 9134 6608

133 301 413 521 067 639 298 047 263 483 094 514 1234 103 139 549

152 295 431 591 076 712 553 032 353 769 123 716 1093 085 254 883

391 1688 1075 1456 652 1386 1737 525 739 1679 1134 1383 2511 482 1488 1320

8289

362

445

1228

Average

larger variance values get partitioned into 2 × 2 blocks, thereby resulting in considerable quality loss. In the limit, with very large threshold value, the variance based selection performance will match GE2 × 2 performance. As shown in this section, variance based partitioning method results in cost reduction at relatively low quality loss. However, the method is more effective for low variance threshold values and hence the cost reduction that can be obtained is small in case of video sequences with small percentage of low variance regions. In subsequent sections, we look at macro-block features other than just the variance to explore further cost reduction options.

In Ref. [25], we introduced feature based partitioning using Hadamard transform. In this section, we begin

with explaining the motivation behind adaptive partitioning based on MB features with an example. We then briefly describe the algorithm in Ref. [25] and its limitations before proposing a new method based on hierarchical partitioning approach based on analysis of both macro-block features and variance. Figure 4 provides an illustration of the idea using macro-block containing vertical stripes. A current 16 × 16 MB is shown against a background reference region. The macro-block consists of dark and bright stripes of 16 × 4 pixels each as shown. Method 1 partitions the macro-block in vertical stripes with mean values m1 to m4, while method 2 partitions the block into horizontal stripes. Assume that all the pixels in the bright region have luminance value of 100, while those in the dark region have luminance value of 0. Then, if the macro-block is moved in the horizontal direction against the reference region by one pixel position, the resulting GE distortion using method 1

Fig. 3. Computational effort reduction with variance based selection method as a function of variance threshold.

Fig. 4. Edge adaptive partition selection: Difference in distortion with two different identical complexity methods of partitioning.

5. FEATURE BASED MACRO-BLOCK PARTITIONING

6

J. Low Power Electronics 5, 1–16, 2009

Gupte and Bharadwaj

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

will be 100 ∗ 16 ∗ 4 = 6400. This is same as the full SAD distortion between the two blocks. However, if method 2 is used instead, then the resulting distortion will be zero. Any movement in vertical direction results in identical distortion using either method. Thus, clearly, it is beneficial to partition the block as 16(vertical) × 4(horizontal) size sub-blocks as against 4 × 16 size sub-blocks since this greatly reduces the number of matching candidates compared to method 2. In real video sequences, it would be rare to find macro-block features aligned with the partitions as shown in Figure 4. However, the fact that the vertical partitions would be more effective for blocks with intensity variation in horizontal direction would still hold true. Refs. [7–10] discussed partial distortion elimination techniques as well as lossy algorithms based on the idea that high gradient pixels contribute to most of the distortion. GE performs averaging of group of pixels within a block, which corresponds to creating a smaller resolution block with larger macro-pixels. We propose that by avoiding the averaging of pixels across the macro-block features, a smaller resolution block is created where all the macro-pixels are high gradient pixels. These high gradient macro-pixels are then used for distortion measurement. Thus, the idea described in this section of block partitioning based on macro-block features is consistent with the idea of giving higher priority to high gradient pixels for distortion measurement. 5.1. Adaptive Partitioning Based on 4 × 4 Hadamard Transform Coefficients (HT4 × 4-AGE) In Ref. [25], spatial frequency components were estimated using Hadamard transform of 4 × 4 sub-block mean values and the dominant transform coefficients were used to select the partitioning for the MB. Different partitioning options were chosen based on the dominant Hadamard coefficients. The effectiveness of the algorithm was demonstrated using the inverted selection and random selection methods. E.g., if Hadamard coefficient 1 4 is a dominant coefficient in the macro-block, then partitioning as shown in method 1 of Figure 4 is performed. In reality, a macroblock may consist of more than one dominating frequency components. Partitioning should, therefore take into account all the dominant frequencies. So, if both f 1 2 and f 4 1 are dominant, then the macro-block is divided into 4 horizontal and two vertical partitions. 5.2. Hierarchical Partitioning Approach (HP-AGE) When a pattern within a macro-block matches closely with one of the 16 Hadamard basis frequency patterns, then the partitioning as proposed in the Section 5.1 would result in averaging of pixels in ideal manner where no edges within a macro-block are averaged. However, this type of partitioning can result in excessive partitioning of the J. Low Power Electronics 5, 1–16, 2009

Fig. 5. Block with multiple significant Hadamard coefficients.

macro-block where it may not be necessary. This is illustrated in Figure 5. The macro-block in this figure has significant features only in the bottom left region. However, the HT4 × 4-AGE would result in partitioning this block in 4 rows and two columns. The top half of the macro-block does not have any features, but is excessively partitioned, resulting in increased complexity. In order to create partitions more efficiently, it is necessary to analyze the features within a macro-block in a more localized manner. Also, the 4 × 4 Hadamard transform based algorithm does not take into account the variance of pixels while creating partitions. We showed in Section 4 that variance based selection can yield significant complexity reduction. In this section, we propose a new adaptive global elimination algorithm using hierarchical partitioning (referred as HP-AGE). For this, we begin with dividing the macroblock in four 8 × 8 blocks which is same as GE2 × 2. Each block of 64 pixels is analyzed using the sub-block variance as well as 2 × 2 Hadamard transform of four mean values of 4 × 4 pixels within the sub-block. Based on this analysis, the block can be further refined into 1, 2, 3, or 4 sub-blocks as dictated by the algorithm. If the sub-block has no significant frequency coefficient and also if its variance is small, then it is not partitioned further. On the other hand if transform coefficient 1 0

is strong, while other coefficients are negligible, then the 8 × 8 sub block is partitioned into two 4 × 8 sub blocks. Similarly, we decide partitioning in favor of two 8 × 4 sub-blocks if 0 1 frequency component is large and all other frequency components are small. Further, in each of these two cases, variance of pixels in each sub-partition is separately analyzed. For example, if two 4 × 8 subpartitions were chosen, and the variance within the top sub-partition is larger than certain threshold, then that subpartition is further divided into two partitions of 4 × 4 pixels. If both 0 1 and 1 0 frequency coefficients are large or the coefficient 1 1 is large, it leads to partitioning the block into 4 blocks of 4 × 4 pixels each. Also, a large sub-block variance at 8×8 level in absence of a single significant frequency component also leads to four partitions of 4 × 4 pixels. Figure 6 explains the pseudo-code for the 7

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Gupte and Bharadwaj

Fig. 8. Partitioning with HP-AGE in presence of diagonal edges. Fig. 6. Pseudo-code to determine partitions within each 8 × 8 sub-block of an MB.

algorithm. Parameters and variables in the pseudo-code are defined below: • Frq_Thr: Frequency Threshold value. If a Hadamard transform coefficient exceeds this value, then that coefficient is considered to be a significant frequency component. • Var_Thr: Variance threshold defined for a block of 32 pixels (4 × 8 or 8 × 4). • HmdTr[i][j]: Hadamard transform coefficient i j

With the hierarchical partitioning method (HP-AGE), the same block shown in Figure 5 would be partitioned into 5 partitions as shown in Figure 7 as against 8 partitions that would have resulted with the 4 × 4 Hadamard coefficient based selection (HT4×4-AGE). The figure also illustrates (with dotted line) the possibility of further partitioning one of the 4 × 8 sub-blocks if the variance within that region is large. Figure 8 shows partitioning with HPAGE algorithm in presence of diagonal edges. Table IV shows the bit rate and cost comparison between the two adaptive GE algorithms HT4 × 4-AGE and HP-AGE. The cost calculations are based on the method described in Ref. [25]. The different threshold

Fig. 7. Partition selection comparison between HT4 × 4-AGE and HP-AGE.

8

parameters for both the algorithms were adjusted so that the average bit rate difference between the two algorithms is very small. The experiment used constant QP = 28. The PSNR difference between the two is negligible due to constant QP . As can be seen, the HP-AGE results in significantly smaller (∼15%) total cost compared to HT4× 4-AGE at relatively small degradation in bit rate (0.42%). In fact, in 7 of the sequences, the HP-AGE algorithm results in lesser bit rate (negative bit rate increase percentage) at much lower cost compared to HT4 × 4-AGE. Table V shows comparison between HP-AGE algorithm bit rate and linear interpolation between GE4 × 4 and GE2 × 2 bit rates at equivalent effort as HP-AGE. Table IV. AGE.

Cost and bit rate comparison between HT4 × 4-AGE and HP-

Video sequence

HP-AGE compute cost compared to GE4 × 4 (%)

HT4 × 4-AGE compute cost compared to GE4 × 4 (%)

% bit rate increase from HT4 × 4 to HP-AGE (%)

BLSK CGQCIF CGd1 FBQCIF FBd1 FMQCIF FMd1 MC MobQCIF Mobd1 PED PR RB RH STN SUNFL TN TRACT VPTRN

5699 6226 6304 6930 5373 7850 5626 7283 8597 7537 5253 8200 6385 5174 5587 5377 5899 6261 5403

5984 7579 7524 8373 8676 8052 8121 8249 8162 7956 7112 8496 7967 7168 8096 8206 8811 7506 6423

−006 166 −074 041 416 027 301 −077 −171 −157 117 −081 044 070 021 088 258 −122 −066

Average

6367

7814

042

J. Low Power Electronics 5, 1–16, 2009

Gupte and Bharadwaj

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Table V. Performance of the HP-AGE algorithm with respect to linearly projected bit rate at equivalent effort.

Video sequence

Full search bit rate in kbps

GE4 × 4 bit rate in kbps

HPAGE bit rate in kbps

Computational cost compared to GE4 × 4 (%)

HPAGE bit rate increase compared to GE4 × 4 (%)

Projected HPAGE bit rate increase compared to GE4 × 4 (%)

AKIYO CGd1 FMd1 FBd1 Mobd1 TN MC PR BLSK PED RB RH STN SUNFL TRACT VPTRN

13596 394469 189444 378921 606461 447294 4668 2630554 888489 653602 3645707 522933 286957 453123 1103615 437731

13628 406049 196058 390682 608323 455784 474456 2634177 895104 674839 3715682 538382 290597 454952 1119564 447238

13666 406462 198895 394444 609291 457138 479518 2634434 905535 694778 3740661 557594 297511 45769 1129049 461175

7574 9769 8230 8988 9274 9261 8347 9959 7529 7363 9314 6904 8300 8428 8701 6615

028 010 145 096 016 030 107 001 117 295 067 357 238 060 085 312

187 077 375 290 093 202 566 004 360 872 153 844 842 149 381 881

Average

8410

123

392

Average over HD sequences

7894

191

560

A constant QP of 28 was used. The cost calculations are based on the improved architecture described in Section 7. Savings in computational cost with HPAGE are also shown. About 16% reduction in average computation cost is obtained while the bit rate increase is only 1.23% on average. If we compare this with the variance threshold-2 based selection results (Table III), the HP-AGE algorithm has almost equal average computational effort. However, average bit rate loss is much smaller. Further, for some of the video sequences, we can see that HP-AGE method results in lower bit rate at lower cost compared to variance threshold-2 experiment. Figures 9–11 show PSNR versus bit rate performance of three high resolution video sequences. For these

sequences, HPAGE algorithm results in significant amount of savings in computational cost compared to GE4 × 4 as seen in Table V. In all these figures, the GE4 × 4 curve tracks very closely with the full search curve, while the HPAGE curve closely tracks the GE4 × 4 curve. Thus by making use of both macro-block variance and spatial frequency coefficients, we can reduce the computational cost of the adaptive global elimination algorithm while maintaining similar bit rate at given PSNR. It can be noted that at lower bit rate levels, the HPAGE and GE4 × 4 curves are more tightly packed with the full search curves. So, for low bit rate applications, one can potentially obtain higher computational cost savings at very low increase in bit rate at same PSNR. The PSNR

Fig. 9.

Fig. 10. “Pedestrian area.”

“Blue Sky.”

J. Low Power Electronics 5, 1–16, 2009

9

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Gupte and Bharadwaj

Table VI. Computational cost reduction with early termination technique applied to HPAGE algorithm.

Fig. 11.

“Viper train.”

for the below three sequences at QP = 28 is in the range of 38 to 40 dB. The computational cost savings shown in Table V were obtained at QP = 28. So, at lower bit rates, significantly larger cost savings can be obtained (by relaxing the thresholds in pseudo-code (Fig. 6)).

6. CENTER-BIASED SEARCH ORDER AND EARLY TERMINATION The global elimination algorithm optimizes matching phase of motion estimation compared to full search method. In earlier sections in this paper, we have demonstrated adaptive techniques to further optimize the matching operation. Despite the fact that global elimination algorithm is attractive for real time hardware implementation due to simple control and data-flow, it has a limitation that it searches all the reference positions in the search window. This may result in a large number of unwanted computations. In this section, we illustrate a simple modification to the algorithm to perform search phase optimization. In traditional global elimination, the search window is scanned in raster-scan order. However, in reality, the motion vectors are centerbiased. i.e., the distribution of motion vectors is heavily centered around the center of the search window. Several motion estimation algorithms perform search space optimization based on this fact. For example, spiral full-search algorithm that uses early termination, starts at the center of the search space and proceeds outwards so that a good initial SAD value is found early in the search phase and that helps in eliminating a lot of candidates quickly. We apply this concept to the global elimination algorithm. For this, we modify the search order in vertical direction. Horizontal search order is kept the same. So, the search starts at the row which contains the center of the search window, and then progresses outwards. The sequence of ‘y’ co-ordinate of the motion vectors that are searched goes like this − > ‘0,’ ‘−1,’ ‘1,’ ‘−2,’ ‘2,’ ‘−3,’ ‘3’     and so on. (Search window is scanned in 10

Video sequence

Bit rate Computational Bit rate increase in effort in increase in HPAGE with HPAGE with early GE4 × 4 HPAGE early termination termination as bit rate compared to compared to percentage of in kbps GE4 × 4 (%) GE4 × 4 (%) GE4 × 4 effort (%)

CGQCIF FBQCIF FMQCIF MobQCIF AKIYO CGd1 FMd1 FBd1 Mobd1 TN MC PR BLSK PED RB RH STN SUNFL TRACT VPTRN

80046 159668 65895 144963 41543 1227293 71696 1211146 1635224 1488759 2685198 6668135 3362048 2322492 9406258 2030908 2132447 1489724 3977538 1327208

Average

018 042 025 000 052 013 106 101 015 034 085 000 057 298 047 265 067 031 052 347

026 045 023 000 065 038 126 115 018 039 094 002 072 382 049 299 079 034 066 523

4955 4894 4687 5027 3574 4959 4112 4562 4678 4687 4096 5053 3423 3482 4729 3425 4065 4253 4366 2856

083

105

4294

horizontal first-then vertical fashion.) The worst value of distortion (SADM + mvcost) among the top ‘N ’ stored candidates is checked against a threshold at the end of the search row. If it is smaller than the threshold, then the first stage search is terminated, assuming that a good match is already found. The search then proceeds to the 2nd stage of global elimination. Table VI shows the bit rate performance of the early termination applied to HPAGE algorithm. A constant QP of 20 was applied. The minimum computation cost reduction compared to GE4 × 4 is close to 50%, and average reduction is 58%. Bit rate increase compared to GE4 × 4 is 1.05%. Average bit rate increase and PSNR degradation in comparison with HPAGE is negligible (about 0.2% and 0.003 dB respectively). The computation cost reduction compared to the GE4 × 4 cost is significantly large compared to HPAGE alone. Also, it should be noted that the early termination algorithm reduces the computation cost significantly for all the sequences ranging from QCIF to HD resolution.

7. ARCHITECTURE FOR HP-AGE AND COMPUTATIONAL COST AND POWER ESTIMATION 7.1. Architecture for HP-AGE Since the GE algorithm does matching operation at each position in the search range, it is important to minimize J. Low Power Electronics 5, 1–16, 2009

Gupte and Bharadwaj

Fig. 12.

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Block-sum calculations at new reference position in incremental fashion.

the number of computations at each search position. One way to reduce the number of computations is by reusing as much computations as possible. In Refs. [13, 44], the architecture described reused the pixel additions that result in column sums of four pixels. However, at each reference position, the column sums were added together to get the sum of 4 × 4 blocks. In Ref. [14], the same authors introduced a parallel version of GE algorithm. In this paper, they also described a modified architecture that reuses the 4 × 4 sum values of blocks rather than the 4 × 1 pixel column sums. This reduces the computational cost significantly. (from 91 per reference position to 71). In this section, we demonstrate an architecture based on Ref. [14], that is suitable for AGE algorithm. We also reduce the number of computations further compared to Ref. [14] by calculating new block sum by subtracting first column sum and adding new column sum, rather than adding all column sums afresh. In case of 4 × 4 GE, this would reduce the number of computations from 71 to 67. In AGE, where a block-sum may be formed using 4 × 8 or 8 × 8 pixels, this method gives larger benefit by reducing number of column sums from 7 to 2 for such block-sum calculations. Figure 5 shows the incremental block-sum calculation engine that processes 8 lines of pixels. Each block sum engine receives 8 incoming pixels as shown on the left hand side. Two such engines are needed to calculate block-sums with 16 new incoming pixels at each reference position. In each blocksum engine, sum1 and sum3 give 4 × 4 block sums of top 4 rows and bottom 4 rows respectively. Sum2 and sum4 give the two 4 × 8 pixel sums. Sum5 and sum6 give 8 × 4 and 8 × 8 sums respectively. J. Low Power Electronics 5, 1–16, 2009

Figure 13 shows the shift register implementation that stores the block-sums. A total of 12 shift registers are required to store 6 sums from top and bottom 8 pixel rows. In oringal GE implementation with 4 × 4 block-sums, only 4 shift registers were necessary. Only a subset of shift registers are required to be active based on macro-block partitioning. Unwanted shift registers are turned off to save power. The shift registers are implemented in circular buffer fashion. So, for every new insertion inside the shift register, only read/write pointers are incremented rather than moving all the 16 data values. This is more power efficient. Also, some multiplexing logic is necessary to select appropriate block sum for each sub-block position as shown. A maximum of 16 sub-block sums are required for each macro-block (when all sub-blocks are 4 × 4 pixel partitions). The multiplexing is designed to minimize the multiplexing stages. For each 8 × 8 block, there are up to 4 sub-block sums and minimum 1 sub-block sum. Figure 14 shows all possible partitioning options for an 8 × 8 block and corresponding block-sum variable assignment to minimize multiplexing logic. Figure 13 shows the multiplexing necessary for the top-left 8 × 8 block. A maximum of 4:1 multiplexer is required for the sum variable m11. Sub-block sum variables m12 and m13 can be assigned from two nonzero and one zero options. The sub-block sum variable m14 reads either a single non-zero value or a ‘0.’ Similar multiplexing exists for the remaining thress 8 × 8 blocks. The select signals for the multiplexers are determined by the partition calculator block (not shown) and are static for entire search window for a macro-block. 11

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Fig. 13.

Shift register implementation to store block sums, multiplexing logic and SADM calculator.

Result of the SADM calculator in Figure 13 is fed to the SADM comparator (after adding mvcost function, which is not shown) block that maintains the top ‘N ’ candidates’ SADM value and motion vector information in a register file. Figure 15 shows the SADM comparator. SADM values are compared using the ‘max’ logic tree to find the current worst SADM candidate. If the new incoming SADM value is smaller than the worst, then the worst SADM entry and corresponding motion vectors are replaced by

Fig. 14. Eight possible block partitioning options and corresponding sub-block sum variable assignment.

12

Gupte and Bharadwaj

the incoming candidate In Ref. [13], after calculating the worst SAD value, a parallel comparator was used to find matching SADM value in the register file. This gives the candidate index to be replaced. Also, a checking hardware was necessary to break the tie when there are more than one worst candidates. In Ref. [14], the checking hardware was avoided and also parallel comparator logic was reduced by assigning a unique tag to each SADM register. This needs propagating the tag value down the ‘max’ logic tree and also still needs a parallel tag comparator to identify the index to be replaced. In Figure 15, we eliminate the requirement of tag logic. Instead, we maintain an additional bit of information that is already available in each max value finder logic labeled ‘max.’ The s ∗∗ output of the ‘max’ logic tells whether top or bottom input of the ‘max’ logic was chosen at the output. By tracing the s ∗∗ values backward in the ‘max’ logic tree, the worst SADM index is found, which becomes the write address to the register file. This way, we simplify the max logic tree by avoiding the maintenance of tag bits and also avoid the tag comparisons, which increase linearly with number of SADM candidates. A large number of SADM candidates can be maintained (for example 15 or 31) without increasing computational overhead significantly. Only log N comparisons take place for each incoming entry if the register file entries are static. If the incoming entry at cycle ‘k’ replaces an existing entry in the SADM register file, a maximum of 2 ∗ log N comparisons will happen including the comparisons due to replaced entry at time ‘k’ J. Low Power Electronics 5, 1–16, 2009

Gupte and Bharadwaj

Fig. 15.

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Parallel SADM comparator.

and new entry coming at time ‘k + 1.’ The actual number of comparisons depends upon the position where the incoming and replaced SADM entries merge in the ‘max’ logic tree. 7.2. Computational Cost The complexity of first stage of GE is measured in terms of number of add/sub/abs operations. The first stage of GE is most computationally intensive. For example, GE algorithm with all 4 × 4 partitions requires 67 ∗ 4225 = 283075 computations. Since operations other than the first stage of GE are in-frequent in most cases, they are ignored in the analysis unless specifically mentioned. Specifically, following operations are ignored in the complexity analysis: • Every current macro-block is first analyzed and its partitioning is decided. This involves (i) block-sum calculation of 16 4 × 4 sub-block, (ii) 2 × 2 Hadamard transform calculation of each of the four 8 × 8 block, (iii) variance calculation at sub-block level (iv) comparison with variance and Hadamard coefficient thresholds. All these operations together take less than 500 add/sub/abs calculations. • At the beginning of the search row, some additional pixel sums need to take place before the first block sum is available. So, for example, to get a block sum of 4 × 4 block for the 1st block in a search row, 15 sums are required instead of 3 that are needed for subsequent J. Low Power Electronics 5, 1–16, 2009

positions in the same search row. For a search range of +/−32, this happens once in 65 times, and hence can be ignored. • In our experiments, we selected only 5 best candidates from the 1st stage of GE or adaptive GE. The SADM comparator thus requires only 2 to 5 comparisons per reference point for both GE and adaptive GE and these calculations are not counted in our relative comparisons. If the number of best candidates in first stage is increased, then the number of SADM comparisons should be included in the cost analysis. • Also, 2nd stage of GE involves full SAD calculation on top 5 candidates, amounting to 767∗5 = 3835 calculations. This number is still small compared to total cost of GE or adaptive GE algorithm, and is ignored. 7.2.1. Computations that Determine the Computation Cost Although the Figure 12 shows 8 adders in a block-sum calculator, not all 8 adders are always active. For example, if both top 8 × 8 block are partitioned into 4 × 4 sub blocks, then only sum1 and sum3 adders and corresponding shift registers are active. On the other hand, if both 8 × 8 blocks are not partitioned further, then sum2, sum4 and sum6 adders are active. In spite of the way each macro-block is partitioned, the 12 additions (adders shown on left hand side of Fig. 12) 13

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

are always required to calculate the column sums from incoming pixels. Further, the number of computations and hence power dissipated in SADM calculator in Figure 13 depends upon total number of partitions in a macro-block. Total computational cost is calculated by adding all these computations together. When all four 8 × 8 blocks in a macro-block are partitioned as 4 × 4, it takes total 67 add/abs/sub computations for every reference position. Where as, if all 8 × 8 blocks are not partitioned further, only 33 computations are necessary. For other partitioning cases, the number of computations are between 67 and 33. Figure 16 shows an example when all the block sums are required. That means, a total of 16 additions are needed to calculate block-sums. However, total number of partitions in the macro-block are only 8. So, number of computations needed for SADM calculation are reduced. A total of 51 computations are required in this case. Cost computations based on this analysis are performed for all video sequences and these calculations are presented in Sections 4 and 5. 7.3. Comparison of Computational Cost with Sub-Sampling Techniques Since sub-sampling approach is another method of matching optimization in motion estimation, we compare the computational cost of GE and HPAGE algorithms with some of the popular sub-sampling algorithms. As seen, the computational cost for GE4 × 4 algorithm per reference position is 67. With HPAGE, this cost is reduced further by about 20–25%. In contrast, Ref. [10], which uses gradient based pixel sub-sampling, needed to select 40 to 44 high gradient pixels to achieve quality comparable to full search. In Ref. [7], 64 high gradient pixels were chosen. Many other sub-sampling algorithms such as alternate pixel sub-sampling and standard sub-sampling also use

25% of pixels for SAD calculation. Thus number of computations required for SAD calculation at each reference position using sub-sampling technique can range from 119 to 191. The computational cost of adaptive global elimination algorithm, thus, can be significantly lower than the computational cost of known (adaptive and non-adaptive) sub-sampling algorithms. Further, as already explained, the global elimination algorithm has advantage of regular and optimized memory access pattern and ease of implementation. 7.4. Power Estimation The architecture described in earlier sub-sections was implemented in RTL and the integer pixel motion estimation hardware engine consisting of the block sum calculator, the shift registers and the sadm calculator was synthesized to operate at 300 MHz in 65 nm technology. Synopsys design compiler was used for synthesis. The logic area was 56 k gates. To support HD resolution, four such engines would be required that will be run in parallel. The power analysis was done using Synopsys PrimeTime-PX. Analysis was performed on different types of macro-block partitions and the net power saving was calculated for different video sequences based on distribution of various partition types. Table VII shows the effort (computation) reduction and power (mW) reduction comparison for some of the HD sequences. The average power dissipation with GE4 × 4 algorithm was estimated to be 63.05 mW. We can observe that while the effort reduction is between 7% and 34%, the power savings are smaller upto 22%. In fact in one video sequence, ‘RB,’ the power goes up by a small amount. The main reason for this is that in the proposed scheme, we need more number of Fifos to store intermediate partition results. As many as 10 out of the 12 Fifos can be active for a given macro-block of a complex partition type in the proposed scheme, where as only 4 Fifos are needed for GE4 × 4. The additional active Fifos dissipate considerable power and offset the effect of reduction in number of computations. Hence it might Table VII.

Fig. 16. An example of macroblock partitioning when number of additions required to calculate block sums is maximum.

14

Gupte and Bharadwaj

Power saving wrt GE4 × 4 for HPAGE algorithm.

Video sequence

Number of computations relative to GE4 × 4 (%)

Power in mW

Power dissipated relative to GE4 × 4 (%)

BLSK PED RB RH STN SUNFL TRACT VPTRN

7529 7363 9314 6904 8300 8428 8701 6615

5334 5451 6347 5169 5852 6041 6069 4928

8460 8645 10067 8199 9281 9581 9626 7815

Average

7894

5649

8959

J. Low Power Electronics 5, 1–16, 2009

Gupte and Bharadwaj

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

Table VIII. Power saving percentage wrt GE4 × 4 for HPAGE algorithm with early termination. Video sequence

Net power saving percentage compared to GE4 × 4 (%)

CGQCIF FBQCIF FMQCIF MobQCIF AKIYO CGd1 FMd1 FBd1 Mobd1 TN MC PR BLSK PED RB RH STN SUNFL TRACT VPTRN

4870 4808 4784 4882 5211 4835 4970 4663 4874 4958 5239 4951 5646 5236 4518 5280 4953 4594 4710 5944

Average

4996

be beneficial to restrict the kinds of partitions allowed in the adaptive scheme, with an eye on constraining the total number of active Fifos. However, the effort reduction due to early termination directly translates to equivalent saving in power. Table VIII shows the net projected power saving percentages with both adaptive matching and early termination optimizations, compared to GE4×4. For HD sequences the average power reduced from 63.05 mW to 30.83 mW corresponding to 51.10% power reduction compared to GE4 × 4.

8. CONCLUSION We demonstrated adaptive global elimination algorithm that significantly reduces matching complexity. It was shown that by using macro-block characteristics such as variance and Hadamard transform coefficients, the macroblock partitioning can be adapted to reduce computational cost of matching operation while maintaining similar motion estimation quality. The SAD computations can be done in an incremental streaming fashion allowing easy hardware implementation. Power analysis of an RTL implementation in a 65 nm process technology indicates that the power reduction is lesser than the reduction of number of computations. This is due to the increased number of fifos required for complex partition types over the GE4 × 4 partition. Hence it will be useful to constrain the kinds of partitions used by the adaptive algorithm, by also considering the Fifo requirements along with the computation reductions. We have also proposed an early termination scheme based on center-biased search order that significantly reduced the searching complexity of global J. Low Power Electronics 5, 1–16, 2009

elimination algorithm. The memory access patterns for this search technique are consistent with how the frames are stored and hence allows for efficient use of memory bandwidth. The combination of the search and match optimized adaptive global elimination algorithm results in 58% savings in the computational cost and nearly 50% savings in the power dissipation for integer pixel motion estimation compared to the global elimination technique. Acknowledgments: We would like to acknowledge Dr. Ajit Rao of Texas Instruments India for his valuable feedback, and Soyeb Nagori and Anurag Jain of Texas Instruments India for their reviews and help offered for experimental setup.

References 1. Y. Murachi et al., A 95 mW MPEG2 MP@HL motion estimation processor core for portable high resolution video application. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences Archive E88-A (2005). 2. W. Li, W. Li, and E. Salari, Successive elimination algorithm for motion estimation. IEEE Transactions on Image Processing 4, 105 (1995). 3. J. S. Kim and R.-H. Park, A fast feature-based block matching algorithm using integral projections. IEEE Journal on Selected Areas in Communications 10, 968 (1992). 4. J.-N. Kim and T.-S. Choi, A fast three-step search algorithm with minimum checking points using unimodal error surface assumption. IEEE Transactions on Consumer Electronics 44, 638 (1998). 5. L.-M. Po and W.-C. Ma, A novel four-step search algorithm for fast block motion estimation. IEEE Transactions on Circuits and Systems for Video Technology 6, 313 (1996). 6. L. Reoxiang, Z. Bing, and M. L. Liou, A new three-step search algorithm for block motion estimation. IEEE Transactions on Circuits and Systems for Video Technology 4, 438 (1994). 7. B. Tao and M. T. Orchard, Gradient-based residual variance modeling and its applications to motion-compensated video coding. IEEE Trans. on Image Processing 10, 24 (2001). 8. B. Montrucchio and D. Quaglia, New sorting based lossless motion estimation algorithms and a partial distortion elimination performance analysis. IEEE Transactions on Circuits and Systems for Video Technology 15, 210 (2005). 9. Y. L. Chan and W. C. Siu, New adaptive pixel decimation for block motion vector estimation. IEEE Transactions on Circuits and Systems for Video Technology 6, 113 (1996). 10. Y.-L. Chan, W.-L. Hui, and W.-C. Siu, A block motion vector estimation using pattern based pixel decimation. IEEE International Symposium on Circuits and Systems, 1997. ISCAS ’97., Proceedings of 1997 2, 1153 (1997). 11. Z. L. He, C. Y. Tsui, K. K. Chan, and M. K. Liou, Low power VLSI design for motion estimation using adaptive pixel truncation. IEEE Transaction on Ciruits and Systems for Video Technology 10, 669 (2000). 12. K. M. Nam, J.-S. Kim, R.-H. Park, and Y. S. Shim, A fast hierarchical motion vector estimation algorithm using mean pyramid. IEEE Transactions on Circuits and Systems for Video Technology 5, 344 (1995). 13. Y.-W. Huang, S.-Y. Chien, B.-Y. Hsieh, and L.-G. Chen, Global elimination algorithm and architecture design for fast block matching motion estimation. IEEE Transactions on Circuits and Systems for Video Technology 14, 898 (2004). 14. Y.-W. Huang, C.-H. Tsai, and L.-G. Chen, Parallel global elimination algorithm and architecture design for fast block matching motion

15

Adaptive Global Elimination Algorithm for Low Power Motion Estimation

15. 16. 17.

18.

19.

estimation. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP ’04) (2004), Vol. 5, pp. V-153–6. Z. Chen, P. Zhou, and Y. He, Fast integer pel and fractional pel motion estimation for JVT. Document: JVT-F01, Joint Video Team (JVT), 6th Meeting: Awaji, Island, JP, December (2002). Z. Chen, P. Zhou, and Y. He, Fast motion estimation for JVT. Document: JVT-G016, Joint Video Team (JVT), 7th Meeting: Pattaya II, Thailand, March (2003). S.-D. Wei, S.-W. Liu, and S.-H. Lai, Modified winner update with adaptive block partition for fast motion estimation. 2006 IEEE International Conference on Multimedia and Expo, July (2006), pp. 133–136. C. A. Rahman and W. Badawy, UMHexagonS algorithm based motion estimation architecture for H.264/AVC. Proceedings of the 9th International Database Engineering and Application Symposium (IDEAS) (2005). C. Z. X. Lin and L.-P. Chau, Hexagon-based search pattern for fast block motion estimation. IEEE Transactions on Circuits and Systems for Video Technology 12, 349 (2002).

Gupte and Bharadwaj

20. Z. Ce, Q. Wei-Song, and W. Ser, Predictive fine granularity successive elimination for fast optimal block-matching motion estimation. IEEE Transactions on Image Processing 14, 213 (2005). 21. X. Q. Gao, C. J. Duanmu, and C. R. Zou, A multilevel successive elimination algorithm for block matching motion estimation. IEEE Transactions On Image Processing 9, 501 (2000). 22. Y. Songt, Z. Liu, T. Ikenaga, and S. Goto, Lossy strict multilevel successive elimination algorithm for fast motion estimation. IEEE International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS) (2006), pp. 431–434. 23. C.-P. Fan and S.-W. Lin, Fast global elimination algorithm and lowcost VLSI design for motion estimation. TENCON 2007–2007 IEEE Region 10 Conference, November (2007), pp. 1–4. 24. M. S. Porto, T. L. da Silva, R. E. C. Porto, L. V. Agostini, I. V. da Silva, and S. Bampi, Design space exploration on the H.264 4 × 4 Hadamard transform. 23rd NORCHIP Conference, 2005., November (2005), pp. 188–191. 25. A. Gupte and A. Bharadwaj, An adaptive, feature-based low power motion estimation algorithm. IEEE International Conference on Multimedia and Expo, 2008, June (2008), pp. 1013–1016.

Ajit Gupte

Ajit Gupte received B.E. degree in Electronics and Telecommunications from Pune University in 1994 and M.Tech in Integrated Electronics and Circuits from IIT Delhi in 1995. He joined Texas Instruments India in 1996 where he is currently a member of technical staff. He is pursuing his Ph.D. at ECE Dept, IISc Bangalore, in the area of video processing. His research interests include video and imaging algorithms, DSP processors and VLSI architectures.

Bharadwaj Amrutur

Bharadwaj Amrutur obtained his B.Tech in Computer Science and Engineering from IIT Bombay in 1990 and his Masters and Ph.D. in Electrical Engineering from Stanford University in 1994 and 1999 respectively. He has worked at Bell Labs, Agilent Labs and Greenfield Networks. He is currently an Assistant Professor in ECE Department at IISc Bangalore, working in the areas of VLSI Circuits and Systems.

16

J. Low Power Electronics 5, 1–16, 2009

Suggest Documents