Efficient Motion Estimation with Content-Adaptive Resolution - CiteSeerX

13 downloads 0 Views 854KB Size Report
Keywords: Motion estimation, content-adaptive algorithms, programmable platforms. 1. Introduction ... flexibility, reuse and portability, but also have drawbacks ...
Efficient Motion Estimation with Content-Adaptive Resolution Ralph Braspenning and Gerard de Haan Philips Research Laboratories, Prof. Holstlaan 4, Eindhoven, The Netherlands [email protected] Abstract We present a motion estimation algorithm designed to fully exploit the flexibility offered by programmable platforms. This algorithm adapts its resolution to the image content using the highest spatial accuracy only for those parts of the image where it is required, i.e. near the borders of moving objects. Keywords: Motion estimation, content-adaptive algorithms, programmable platforms. 1. Introduction The growing compute power of commercially available programmable platforms enables increasingly complex video processing algorithms to run in software, where they used to be feasible in dedicated hardware solutions only. Software implementations have a number of advantages over hardware solutions, like flexibility, reuse and portability, but also have drawbacks like high cost and power dissipation. To balance the disadvantages, we need to fully exploit the advantages. In this paper, we present a motion estimation algorithm designed to use the flexibility of programmable platforms. An attractive way to profit from the flexibility is to design complexity scalable video algorithms (SVAs). These algorithms are capable of dynamically trading resource usage for output quality in a near-optimal way. This enables dynamically adaptable systems [1]. Research regarding the general design of SVAs has been published before [2-6]. The basic approach, in these earlier publications, is to take an existing algorithm and make it scalable. As opposed to that approach, the algorithm presented in this paper is designed to exploit the flexibility, by making it inherently scalable. At the same time, however, the algorithm can be used as a non-scalable efficient solution. Hence, scalability becomes a property instead of a design goal. 2. Motion Estimation 2.1 Block Matching

2-D motion estimation, solves the problem of, given two successive luminance images & & & f ( x , n − 1) and f ( x , n) , where x is the 2-D position in the image and n is the picture

& &

number, finding a vector field d ( x , n) such that

& & & & f ( x − d ( x , n), n − 1) ≈ f ( x , n).

(1)

However, 2-D motion estimation suffers from several problems, like the occlusion problem, the aperture problem and the sensitivity to noise [7]. Because of the ill-posed nature of the motion estimation problem, the algorithms need additional assumptions, or models about the structure of the 2-D motion vector field. One popular approach is to assume that the motion vector, or model, is constant for a block of pixels. This approach is quite successful and is used for instance in MPEG encoding and in video format conversion. Typically, the dimensions of the blocks are constant values for a given application, e.g. for MPEG-2 the block size is usually 16×16 and for scan-rate up-conversion it is 8×8. This introduces the constraint that

& & & & & & d ( x , n) = d ( x ' , n), ∀x '∈ B ( x ),

(2)

&

&

where B (x ) is the block of pixels at position x , i.e.

& & B( x ) = {x ' | x' div β = x div β , i = 0,1 } L

&

L

L

L

and β = ( β  β )7 are the block dimensions.

(3)

Ver. S

x S

x + β

C

T

T

x + β 

x

Hor.

Figure 1: Candidate Set configuration. C is the current block. S indicates a spatial and T a temporal candidate.

Many block based motion estimation algorithms have been designed for different applications. We focus on scan-rate upconversion applications and, therefore, take the 3-D Recursive Search (3DRS) motion estimation principle as a basis [8]. This motion estimation algorithm was designed for scan-rate upconversion, which poses other requirements on the vector field than coding applications. Particularly, for scan-rate up-conversion we require a true-motion vector field that in general, differs from the vector field yielding the lowest residual signal for motion compensated encoding. The 3DRS algorithm constructs a small set of candidate motion vectors, CS , obtained from spatio-temporal predictions. (4)

L

Two assumptions underlie the 3DRS principle. First, objects are assumed to be smaller than blocks. Consequently, the vector estimated for a neighbouring block is a good candidate for the current block. The blocks are processed in a certain scanning order, say from left to right from top to bottom. Since some neighbouring blocks have already been estimated during the current scanning pass, they are called spatial candidates & & & and c = d ( x , n) , while the other blocks have to be taken from an earlier scan and, therefore, are & & & called temporal candidates, c = d ( x , n − 1) . This is depicted in figure 1. Next to spatial and temporal candidates also update candidates are added to the candidate set. Take a spatial candidate and add a small random & & & & vector update to it, i.e. c = d ( x , n) + u . The L

L

L

L

L

M

& & & & d ( x , n) = arg min (ε (c , x , n)) &

(5)

F∈&6

2.2 3-D Recursive Search Block Matching

& CS = {c | i = 0,  , | CS |}

objective is to find new motion vectors and to adjust the vector field to non-constant motion of objects. The update vectors can be relatively small because of the second assumption, which is that objects have inertia. This implies that the movement of objects varies continuously and not very fast from frame to frame. A typical candidate set contains two spatial candidates, one temporal candidate and one update candidate, resulting in four candidate vectors. In order to find the motion vector for the & & current block, a match error ε (c , x , n) is & minimized varying c in order to obtain the best matching output vector, i.e.

& &

and where ε (c , x , n) is the Sum of Absolute Differences (SAD),

& & ε (c , x , n ) =

&

& &

∑ | f ( x ' , n) − f ( x '−c , n − 1) | [ % [ &





&



(6)



Note that the number of SAD computations per block is equal to the size of the candidate set, in a typical case four, making the 3DRS algorithm very efficient and suited for real-time applications [9]. 3. Content-Adaptive Recursive Search 3.1 Adaptive Resolution As already mentioned in Section 2.1 motion estimation suffers from several problems, like the occlusion problem, the aperture problem and the sensitivity to noise [7]. The block-based approach reduces the aperture problem and the sensitivity to noise, but does not solve the occlusion problem. A technique to reduce the occlusion problem uses three frames [10]. Here, we will focus on the other two problems. An important parameter for these two problems is the block size. For larger block sizes the estimation process is less sensitive to noise and the aperture is bigger, which reduces the aperture problem. However, larger block sizes reduce the spatial accuracy. For scan-rate upconversion, a block size of 8x8 pixels has proven to be a good compromise, used in commercial ICs [9]. These ICs are application specific hardware designs and the block size is constant. In software implementations In our new contentadaptive algorithm we propose to start the

estimation process with large blocks, e.g. 32×32 pixels and only reduce the block size at the places in the image where the spatial accuracy must be high, i.e. near the border of moving objects. Hence, we combine the strengths of large and small blocks for the estimation process. This is different from hierarchical motion estimation methods, which generally generate multi-resolution versions of the whole image and perform complete motion estimation on all layers [7]. The method described in reference [11], starts with the lowest resolution and merges blocks depending on the content. Clearly, this is less efficient than our proposal, which focuses the calculations to regions that can profit. 3.2 Adaptive Candidate Set At each resolution our method uses the 3DRS principle to estimate the motion. As explained in Section 2.2 this principle is based on a candidate set of motion vectors, which implements the search strategy. This candidate set can vary for each resolution. At different resolutions (block sizes), different compositions of this candidate set are appropriate. An example of such a set of candidate sets is schematically listed in Table 1. The candidate positions are depicted in Figure 2. Table 1: Example CS for each resolution.

32x32 16x16 8x8 4x4 2x2

S1, S2 S1, S2, S3 S1, S2, S3 S1, S2, S3 S1, S2

M T T T T

U2 U1, U2

motion (M) is added at this resolution, see reference [12]. The update candidate (U2) of spatial candidate S2 is added to cope with small spatial differences in the motion of the large areas and small deviations from the global motion model. If we need to go down to a higher resolution (16x16), we first try to find the correct vector in the neighbourhood, by trying some more spatial candidates and a temporal candidate, before resorting to more update candidates. Because, although essential, too many update candidates results generally in noisy vector fields. If we need to increase the resolution even more (8x8), it is very likely that we are near a position of an edge between objects. If the estimator is converged then the correct motion vector can be found in the neighbourhood. For one object a spatial candidate should be the correct vector for the other the temporal candidate. If the estimator is not converged yet, the update candidates are required for finding the correct motion vector. At the highest resolutions (4x4 and 2x2) update candidates are avoided because of the noisiness of the estimation process at these resolutions. Furthermore, it is likely that the correct vectors have already been found in the neighbourhood, only the location of the edge between objects needs to be adjusted. By adapting the resolution and the candidate set according to the image content as described above, the search strategy changes accordingly. Therefore, we call this algorithm ContentAdaptive Recursive Search (CARS). 3.3 Split Criterion.

Ver. S3

S2

x x + β

S1

M T

x

x + β 

Hor.

Figure 2: Candidate motion vector indications and their positions.

The largest block size (32x32) is appropriate for large areas that have equal motion. Therefore, spatial candidates from opposite directions can be good candidates. Furthermore, a candidate that results from a parametric model of the global

An important element in the CARS algorithm is the decision to split a block into four smaller blocks, or not. This requires the detection of the situation when one motion vector is inappropriate to describe the motion of the whole block. One method is to check if non-of the candidate vectors has a good match. This cannot be achieved by simply looking at the computed match errors, because they also depend on the texture and contrast inside the block. A better metric, described in [10], uses the so-called VAR . This metric is a measure for the expected SAD value for a vector error of 1 pixel and is defined as follows

& VAR ( x ) =

&

&

&

∑ | f ( x ' , n) − f ( x '+2e , n) | + [ % [ &







&





& & & | f ( x ' , n ) − f ( x '+ 2 e , n ) | 

(7)

&

&

where e and e are the unit vectors in horizontal and vertical direction respectively. In reference [10] the following relation between the expected vector error ( E (VE ) ), the SAD value and the VAR has been experimentally found.

& & 3 ⋅ ε (c , x , n ) E (VE ) ≈ & 5 ⋅ VAR ( x )

(8)

Given a certain vector error, α9( , and rearranging equation (8) we can compute the expected SAD value for that vector error

& & & 5 εˆ (c , x , n) = ⋅ VAR ( x ) ⋅ α9( 3

(9)

Since the motion vectors are estimated with an accuracy of one quarter of a pixel, we expect a vector error of one quarter of a pixel or lower if the estimator is converged. Hence, if the best candidate vector has a SAD value that is higher than the expected SAD value for that block, i.e.

& & & & ε (d , x , n) > εˆ (d , x , n) ,

We expect that the number of cycles required to calculate the SAD is roughly linear to the amount of pixel differences that need to be calculated. Our first optimised implementation on a DSP processor of some of the SAD calculations for different block sizes shows this to be approximately correct.

(10)

the block is split into four smaller blocks and the procedure is repeated. As can be deduced from equations (9) and (10) a threshold, α9( , is involved in the split criterion. Since this threshold has a physical meaning, being the allowed vector error, the proper value is known in advance and requires no experiments. If a block is split, the motion of the four smaller blocks is estimated in the same way and immediately (depth-first), resulting in a singlepass algorithm. 4. Effiency 4.1 SAD Sub-sampling To evaluate the candidate motion vectors the match error from equation (6) is used. In equation (6) the motion compensated difference is calculated for all pixels in the block. However, a previous study presented in reference [13] indicates that the quality of this match error hardly suffers from sub-sampling. So not all pixel differences need to be calculated. From our first experiments it follows that while the number of pixels increases quadratically with the increase of the block size, the number of pixel differences only needs to increase linearly.

Block Size 2x2 4x4 8x8 16x16 32x32

Table 2: SAD sub-sample data. #Pixels SubRelative sample Cost 4 1 1 16 1 4 32 2 8 64 4 16 128 8 32

Cycles N.A. N.A. 104 203 519

Table 2 lists per block size the number of pixels that are used in the SAD calculation, the resulting sub-sample factor and the expected relative cost of the SAD calculation. It shows, for example, that the SAD computation for an 8x8 block is twice as expensive as for a 4x4 block. Without sub-sampling it would be four times as expensive. Therefore, sub-sampling makes the SAD calculation for larger block sizes relatively cheaper than for smaller block sizes. This way the available resources (cycles) are used efficiently. 4.2 Block Hopping To save computational power (cycles) a technique called block hopping can be used [3]. The idea behind block hopping can be described as follows. If the match error of a candidate vector is below a certain threshold, the motion vector is judged to be satisfactory, the evaluation of the other candidates is skipped and the algorithm hops on to the next block. Since we already introduced the VAR to judge whether a block should be split or not, it can also be used for block hopping. From equations (5) and (10) follows that a block will not be split if the match error of a certain candidate is below the expected SAD value. Hence, if for a candidate vector & c , i ∈ {0, , | CS | −1} , L

& & & & ε (c , x , n) ≤ εˆ(c , x , n) L

(11)

L

&

then for all candidate vectors c , j > i , the M

match error computation can be skipped, thus saving computational power.

Table 3: M 2 SE and cost for each sequence from the test set and several algorithms and settings. Algorithm 4x4 (8x8) 4x4 8x8 cars-hop (1/4) cars (1/4) Sequence Cost M2SE Cost M2SE Cost M2SE Cost M2SE Cost M2SE bicycle 205887 60 104626 66 54052 63 30214 71 41422 69 heli2 160671 71 81459 64 41146 70 16839 68 25485 68 hotel 213797 229 108166 231 53956 234 34847 239 45339 239 mummy 202584 56 103511 59 50447 58 23654 61 33502 63 renata 200003 63 102082 66 50921 69 14011 68 23519 69 ryan 211256 122 107414 123 54090 119 25036 132 35686 134 kiel 184167 57 93338 55 48873 58 10610 127 19426 75 Average

196909

94

100085

95

5. Results In Figure 3 the resulting block sizes for one frame of the Renata sequence are depicted by means of the white lines. In this sequence, the camera pans to the right while Renata is walking from left to right. The block sizes that were used range from 32×32 to 4×4, each time decreasing a factor 2 in horizontal and vertical direction. In the past, we have used the Modified Mean Square Error ( M 2 SE ) as a performance indicator for different motion estimation algorithms [8]. The essence of the modification (of the well-known MSE) is that the validity of the vectors is extrapolated outside the temporal interval on which they are estimated. Hence, it is a kind of measure of the deviation of the vector field with the true motion. Frame n can be reconstructed from frames

& & n − 1 and n + 1 using the vector field d ( x , n)

and motion compensated averaging, i.e.

& & & 1 f PF ( x , n) = ( f ( x − d ( x , n), n − 1) + 2 & & f ( x + d ( x , n), n + 1) )

(12)

The M 2 SE is then the MSE of the original frame n and the reconstructed one, i.e.

M 2SE (n) =

& & 1 ( f ( x , n) − f PF ( x , n))  (13) ∑ | W | [&∈:

where W is the measurement window which excludes the borders of the frame. Although the M 2 SE metric is far from perfect it is the only one that gives an indication about the true motion nature of the vector field. The results for each sequence from the test set are listed in Table 3. The algorithm

50498

96

22173

109

32054

102

“4x4(8x8)” is 3DRS using 4x4 blocks, however estimating the SAD on (overlapping) 8x8 blocks. The “4x4” and “8x8” are also 3DRS using the respective block sizes. The right most two algorithms are the CARS algorithm, the first one with block hopping enabled, the other disabled. The value for the parameter α9( is put between brackets. The cost listed in Table 3 is calculated using the relative costs of the SAD computation for different block sizes mentioned in Table 2.1 For most sequences, except “ryan” and “kiel”, the differences between the 3DRS variations and the CARS variations are around 8. For “ryan” the difference is larger because it is a dark sequence (low contrast) therefore detection of the necessity to split a block becomes harder. The “kiel” sequence has no occlusion and only shows zoom. Since there are no borders between moving objects in this sequence, large blocks are used. But, one vector for such a large block will describe the zoom less accurate than a lot of small blocks, resulting in a larger error. However, visually comparing up-converted sequences shows much smaller difference between 3DRS and CARS. This problem can be countered by using a parametric model for large blocks. Overall the CARS algorithm performs slightly worse according to the M 2 SE criterion than the 3DRS algorithm, but is significantly cheaper. The increased error is most probably also due to the fact that we model larger parts of the image with one motion vector (when using large block sizes). This limits the degrees of freedom in solving equation (1) more than compared to fixed 8x8 or 4x4 block sizes. Hence, 1

The 3DRS algorithms used here for comparison are more expensive than the ones presented in references [8] and [9]. Firstly, the sub-sampling structures presented in those references are not suited for software implementations. Secondly, the CS used in [8] and [9] is smaller.

a higher error can be expected. However, the magnitude of this effect is yet unknown. When the block hopping is enabled, the performance is similar to when it is disabled, except for the “kiel” sequence. The problem with zoom described above manifests itself extremely in this case. Note that the cost is reduced by approximately 30% when enabling block hopping.

Figure 3: Results on the Renata sequence

6. Conclusions We have presented an efficient motion estimation algorithm designed to fully exploit the flexibility offered by programmable platforms. This algorithm uses content-adaptive resolution to only use the highest spatial accuracy at the locations in the image where it is required, near the borders of moving objects. Hence, efficiently obtaining a high-resolution vector field. The example shows that a good criterion for splitting a block has been found. However, the raised M 2 SE and the difficulties with sequences with zoom and low contrast are subject of further study. References [1] C. Hentschel, et al., “Scalable video algorithms and quality-of-service resource management for consumer terminals,” in International Conference on Consumer Electronics (ICCE), Los Angeles, CA, June 2001, pp. 338-339. [2] C. Hentschel, R. Braspenning and M. Gabrani, “Scalable algorithms for media processing,” in International Conference on Image Processing (ICIP), Thessaloniki, Greece, October 2001, pp. 342-345. [3] R. Braspenning, G. de Haan and C. Hentschel, “Complexity scalable motion

estimation,” in Proceedings of the SPIE: Visual Communications and Image Processing (VCIP), San Jose, CA, January 2002, pp. 442-453. [4] S. Peng, “Complexity scalable video decoding via idct data pruning,” in International Conference on Consumer Electronics (ICCE), Los Angeles, CA, June 2001, pp. 74-75. [5] I. Richardson and Y. Zhao, “Adaptive algorithms for variable-complexity video decoding,” in International Conference on Image Processing (ICIP), Thessaloniki, Greece, October 2001, pp. 457-460. [6] K. Lengwehasatit, et al., “A novel computationally scalable algorithm for motion estimation,” in Proceedings of the SPIE: Visual Communications and Image Processing (VCIP), San Jose, CA, January 1998, pp 68-79. [7] A. Tekalp, Digital Video Processing, Prentice-Hall, 1995, Chapters 5-6, ISBN: 013-190075-7. [8] G. de Haan, and P. Biezen, “Sub-pixel motion estimation with 3-D recursive search block-matching”, Signal Processing: Image Communication 6, 1994, pp. 229-239. [9] G. de Haan, “IC for motion compensated deinterlacing, noise reduction and picture rate conversion, ” IEEE, Transactions on Consumer Electronics, August 1999, pp. 617-624. [10] G.A. Lunter, “Occlusion-insensitive motion estimation for segmentation,” in Proceedings of the SPIE: Visual Communications and Image Processing (VCIP), San Jose, CA, January 2002, pp. 573-584. [11] H.M. Jung, Variable size block matching motion estimation apparatus, EU-patent EP0720382, 3 July 1996. [12] G. de Haan and P.W.A.C. Biezen, “An efficient true-motion estimator using candidate vectors from a parametric motion model”, IEEE tr. on Circ. and Syst. for Video Techn., Vol. 8, no. 1, Mar. 1998, pp. 85-91. [13] G. de Haan and H. Huijgen, “Motion estimation for tv picture enhancement,” Signal Processing of HDTV III, 1992, pp. 241-248.

Suggest Documents