ABSTRACT. This contribution presents a powerful method for real-time capable global motion estima- tion. Up to now no further solutions are known to the ...
Real Time Global Motion Estimation for an MPEG-4 Video Encoder H. Richter† , A. Smolic††, B. Stabernack††, E. M¨uller† † )University
of Rostock, Institute of Communications and Information Electronics Richard-Wagner-Strasse 31, 18119 Rostock, Germany e-mail: buggs erika.mueller @ntie.e-technik.uni-rostock.de
††)
Heinrich-Hertz-Institut Berlin, Image Processing Department Einsteinufer 37, 10587 Berlin, Germany e-mail: smolic stabernack @hhi.de
ABSTRACT This contribution presents a powerful method for real-time capable global motion estimation. Up to now no further solutions are known to the authors. Global motion estimation is a great tool in MPEG-4 coding process to improve overall visual quality. The main disadvantage of the reference implementation in the current Verification Model is an unacceptable low computation speed. By means of consequently speed optimized algorithms a 600 MHz Intel Pentium has shown to be sufficient for the aquired task.
1 INTRODUCTION The term global motion is used to describe occurring 2D image motion caused by camera motion. Several mathematical models are available for a parametrical description of motion in images. In the case of MPEG-4 the 6 parameter affine model is very attractive since it is able to handle all kinds of linear distorsions at low computational complexity. Current state of the art approaches for estimating global motion [1][4][5] combine a rough, wide-range initial step followed by a differential refinement stage. In an initial stage a set of features is extracted from the given images. By tracking these features a set of motion vectors is obtained on which a global model is applied. The feature selection process is based on a confidence measure introduced by Shi and Tomasi [2]. A feature is defined as a small window containing an image part that can be tracked reliably by a block matching
algorithm. Through adaption of the search range a minimum of utilized computation time could be achieved. In the beginning of the estimation process and after a scene change no information about the motion is known. Therefore a relatively high search range has to be checked. Due to the fact that camera motion usually changes slowly, the range can be lowered in consecutive frames by using previous estimation results as predictor. Because it is unknown if the tracked features belong to foreground or background, local object motion might show an inconvenient impact on the global measure. Hence, it is required to minimize this influence. A suitable post processing option is given by the Maximum-Likelihood operator, also known as robust M-Estimator. Typically, 5-7 iterations of a robust M-Estimator are sufficient to distinguish between reliable measures and outliers. Due to a limited number of involved features and motion vectors this type of post processing doesn‘t noticeably affect the overall computation performance. The refinement stage of the proposed estimation hierarchy is based on an optical flow strategy. In [5] the iterative descending Gauss-Newton algorithm is proposed which delivers reliable measures on the cost of excessive computational requirements. An approach to reduce estimation complexity was presented in [3]. By rejecting unstructured areas from estimation the number of candidate pixels decreases and thus a lower amount of operations is required. These ex-
cluded areas don’t carry meaningful motion information and would disturb the estimation process anyway (aperture problems). Another drawback of the MPEG-4 reference implementations is the use of first order gradient operators for calculating required spatial gradients. In textured image areas where high frequencies are dominating the first order gradient operator is unable to estimate spatial gradients correctly. A solution in form of polynomial filters is explained in [6].
Fig. 2: feature selection: a) original, b) filtered image, c) selected features
the search range is kept low using estimations between former frames as predictor. The already introduced necessity to eliminate disturbing influences caused by differently moving foreground objects is performed us2 METHOD ing a robust M-Estimator. A difference (ε m ) A block diagram of the proposed algorithm is between the mean translation (a1 b1 Ξ) and shown in Fig.1. m m each individial measure (vx vy ) is calculated: I0 I1
Feature Matching
ΞT
I0 current image I1 reference image ΞT,ΞA motion parameters (translation, affine) IP predicted image
m
ε m vx
m vy b 1
a1
(2.1)
Then the mean error is: differential affine Estimation
µε
ΞA
1 ε m M m∑
M
(2.2)
IP
As weighting function for each individual measure Tukey’s Biweight was chosen where c is a tuning constant affecting the converFig. 1: Block diagram of proposed algorithm for gence behaviour: Warping
Global Motion Estimation and Compensation
2.1 FEATURE MATCHING By definition reliable tracking of features is available in well structured image areas. Therefore the feature selection process is performed after highpass filtering one of the two images involved in each estimation. The Laplace Operator with its FIR filter coefficients 1,-2,1 in both directions is a suitable low complexity filter. It can be implemented without multiplications, allows heavy parallel processing and concentrates resulting filter output into the low value range, leaving a minor number of peaks. This effect leads to a simplified selection stage. A one-pass binary classification in conjunction with small ringbuffers for each binary class suffices to find the desired features. Furthermore, the image is divided into several tiles to ensure getting features from every image area. Fig.2 illustrates this process. To track the features a full search 8x8 blockmatching is applied. As already mentioned
w ε
ε 1 c µ ε 0
2
2
ε c µε
ε c µε (2.3) With the shown prerequisites the following iteration function for a1 b1 Ξ is applied: Ξk
1 ∑ w m m
m
∑ w m vx m
m
∑ w m vy m
(2.4)
The shown algorithm is known for fast convergence behaviour, typically around 5-7 iterations are required. 2.2 DIFFERENTIAL AFFINE ESTIMATION After predicting the global translation a higher order refinement takes place. In principle the goal of motion compensation is to make the transformed image as similar as possible to the actual image [3]. With I0 as the current image and I1 as reference image, an image point in I0 or I1 is defined
as p x y . The transformation T p Ξ of each image point is controlled by the parameter vector Ξ. In case of the affine 6 parameter model and a b Ξ the following transformation is applied:
a 1 a2 x a 3 y b 1 b2 x b 3 y
T p Ξ
(2.5) The similarity at a certain point is measured in terms of the residual: ε p Ξ I1 p I0 T p Ξ
(2.6)
In a M-Estimator approach a function ρ of the residuals with a scale µ is used to minimize ε: min ∑ ρ ε p Ξ µ
(2.7)
Ξ p R
in order to reach the required computation speed. We present two methods for further improvement of convergence behaviour. Instead of applying the lowpass and derivation filter pair onto each pixel in each iteration as proposed in [6] we chose to pre-filter both images before the estimation takes place. The lowpass filter solves two important issues. It aims at the already mentioned statistical errors calculating spatial gradients and furthermore reduces noise. As tradeoff between accuracy and speed a simple 3-tap gaussian filter with the coefficients 14 12 14 is used. The second improvement targets the interaction between the feature matching and differential stages. In scenes where only minor changes of global motion are apparant, it has shown that the results from the last differential estimation as start vector offer a faster convergence speed whereas the estimated translation acts as good start vector on quickly changing motion behaviour.
This is solved using an iterative GaussNewton minimization. The weighting of each point contributing to the estimation can be expressed as w ε dρ dε ε. Using the simplified method introduced in [3] the weighting function is reduced to a single binary decision, i.e. a point participates fully as in a non-robust solution or not at all: 3
IMPLEMENTATION
ε2 µ ε2 µ
The explained algorithms were implemented (2.8) first in portable ANSI-C. All time critical parts have been rewritten in hand-optimized assembly language, specifically designed for The value of µ is determined as the mean square of all N considered points within re- use with Intel Pentium III and compatible gion R . It allows automatic scaling of the processors. The main targets in this step were efficient use of available SIMD engines in weighting function. To reduce the influence of unstructured areas conjunction with minimized load on memwe do an a priori evaluation of the spatial gra- ory bus. Through interleaving code using all dients (Ix Iy ) and use only those pixels (v 0) three concurrently available engines (Integer, in the estimation which satisfy the following MMX, ISSE) at the same time a potentially high scalability could be achieved. condition: w ε
v
0 1
1 0
Ix d Iy d otherwise
(2.9) A threshold d between 10 and 20 leaves only those pixels which lead to a fast convergence behaviour of the Gauss-Newton algorithm. We reached a relatively constant number of pixels involved in the estimation by adapting d on the pixel count in former estimations. Due to the computation-intensive nature of the chosen Gauss-Newton algorithm - 18 coefficients of a matrix plus 6 coefficients of an error vector have to get updated for every pixel involved in the estimation - it is required to keep the number of iterations low
4 EXPERIMENTAL RESULTS The proposed algorithm has been tested with several sequences, for instance with the well known Mobile, Horse and Stefan sequences often used to study global motion algorithms. Fig.3 shows the impact of the proposed lowpass filtering on the estimation, calculating the background-PSNR between the original and predicted images. Estimating the global motion between filtered frames results in remarkably better prediction accuracy. To measure the performance of this algorithm in direct comparison to the reference MPEG4 encoder the developed routines have been
PSNR in dB
integrated into MS-PDAM1 . Fig.4 and Fig.5 der similar conditions, the developed solushow the achieved compression ratio and tion works 230 times faster. Including necesPSNR for the Stefan and Horse sequences. sary motion compensation routines the overall cost for integration of the proposed algorithms into an MPEG-4 encoder solution is 38 robust, max. 4 Iterations, w/o lowpass filter around 18-22 milliseconds on common midrobust, max. 4 Iterations, with lowpass filter 36 class PC systems. In conjunction with an op34 timized MPEG-4 encoder, an 800 MHz Pentium III/E has shown to be sufficient to en32 code GMC frames at 25 FPS. 30
5 CONCLUSION
28
We presented a robust global motion estimator fast enough for real-time operation on Fig. 3: PSNR in background with and with- common hardware. By concentrating on reout pre-filtering, Mobile sequence (352x240 pels, ducing unnecessary operations and disturbing side-effects the usual accuracy penalties YUV4:2:0) of high-speed applications could be success32 31 fully avoided. The proposed algorithm per30 forms excellent in direct comparison to the 29 28 MPEG-4 reference implementation. 26
PSNR in dB
0
27 26 25 24 23 200
50
100 150 200 Picture number in Sequence
250
300
REFERENCES
without GMC GMC, reference implementation GMC, proposed algorithm 400
600
800 1000 Bitrate in kBit/s
1200
1400
1600
PSNR in dB
Fig. 4: Rate-Distortion curves for Stefan sequence (352x240 pels, YUV 4:2:0) 31 30 29 28 27 26 25 24 23
[2] J. Shi and C. Tomasi. Good Features to Track. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1994
without GMC GMC, reference implementation GMC, proposed algorithm 500
1000 1500 Bitrate in kBit/s
[1] A. Smolic, T. Sikora and J.-R. Ohm. Long-Term Global Motion Estimation and its Application for Sprite Coding, Content Description and Segmentation. IEEE Trans. on CSVT, Vol. 9, No. 8, pp. 1227-1242, December 1999
2000
Fig. 5: Rate-Distortion curves for Horse sequence (352x288 pels, Y only)
The results obtained with the proposed highspeed algorithm are excellent and match those of the MPEG-4 reference implementation. In case of the Stefan sequence the new developed algorithm shows up superiour tracking quick camera motion as apparant in the last thirty frames. The whole optimized estimation process needs around 12-15 ms for pictures in CIF format2 . In contrast to the reference implementation which needs about 3.5 seconds un1 MPEG4 Verification Model, reference encoder by
Microsoft 2 600 MHz Intel Pentium III /w Katmai Core, Linux 2.2.16, gcc2.95, nasm0.98
[3] A. Smolic and J.-R. Ohm. Robust Global Motion Estimation using a simplified M-Estimator approach. Proceedings of IEEE International Conference on Image Processing, 2000, Vancouver, Canada [4] H. Schwarz and E. M¨uller. Hypothesis-based Motion Segmentation for Object-based Video Encoding. Proceedings of Picture Coding Symposium, 1999, Portland, Oregon [5] ISO/IEC/JTC1/SC29/WG11. MPEG-4 Video VM 16.0, Doc. No. N3312, Noordwijkerhout, Netherlands, 2000 [6] E. P. Simoncelli. Distributed Representation and Analysis of Visual Motion. PhD Thesis, MIT, USA, 1993