robust global motion estimation using a simplified m ... - CiteSeerX

15 downloads 0 Views 400KB Size Report
as M-estimators have to be applied. We present a simpli- fied implementation of a robust M-estimator for global motion estimation that does not increase the ...
ROBUST GLOBAL MOTION ESTIMATION USING A SIMPLIFIED M-ESTIMATOR APPROACH $OMRVFKD6PROLüDQG-HQV5DLQHU2KP Heinrich-Hertz-Institute (HHI), Image Processing Dept. Einsteinufer 37, 10587 Berlin, Germany {smolic/ohm}@hhi.de

ABSTRACT Global motion estimation is an important task in a variety of video processing applications, such as coding, segmentation, classification/indexing or mosaicing. Due to the possible presence of differently moving foreground objects and other sources of distortions, robust methods such as M-estimators have to be applied. We present a simplified implementation of a robust M-estimator for global motion estimation that does not increase the computational complexity significantly compared to a non-robust estimator, while providing excellent results in terms of estimation accuracy. Additionally, unstructured image regions are detected and rejected for the estimation. This avoids aperture problems, that can have an bad impact especially on robust estimators that rely on a certain error measure. 1 INTRODUCTION The term global motion is used in the context of this paper to describe the apparent 2D image motion induced by camera operation. Depending on the application at hand, this can be parameterized by several motion models. The mathematical complexity of the global motion model directly affects the possible accuracy of the motion description, but also the computational complexity and stability. For instance the planar perspective model with 8 parameters is suitable to describe the global motion completely, if camera operation is restricted to pure rotation about the center of projection and zoom, and if the assumption of a pin hole camera model is satisfied [1]. More sophisticated formulations include also lens distortion parameters [2]. For many applications the affine model is sufficient, that consists of 6 parameters and represents a linear transformation, which is mathematically relatively easy to handle. Further simplifications lead to even simpler models, extensions apply higher order polynomials [6]. Global motion estimation depicts the determination of these motion parameters. Once available the motion information can be used in a variety of video processing applications. In a coding application it can be used for instance to predict the background in order to efficiently reduce temporal redundancy [4]. In a segmentation appli-

cation global motion information is exploited to separate differently moving foreground objects from the background [7]. Recent developments in the context of MPEG7 show, that a parametric description of global motion can also be used for classification of video, enabling motionbased search-and-retrieval applications of video in large databases [3]. Finally, global motion estimation is the main task for video mosaicing, which itself can be applied in all the aforementioned video processing areas [6]. The state of the art technology for global motion estimation applies differential approaches [3], [4], [5], that consist in their mathematical core of an iterative minimization of a postulated cost function with respect to the motion parameters to be estimated. If no a priori knowledge about the location of background and possibly differently moving foreground objects is available the quality of resulting estimates can be poor. Additionally there are other sources of errors, e.g. noise, lighting changes (shadows), model failures or uncovered/covered image regions. All these influences can be regarded as noise with respect to the global motion. However, the statistics of these errors can by no means be modeled as normal distribution. A lot of outliers have to be expected, i.e. measurements that are not covered by the applied statistical model. A classical solution for such a model parameter estimation problem is derived from robust statistics and maximum-likelihood-theory, a so called M-estimator [9]. The application of such a robust M-estimator for global motion estimation has been presented for instance in [5]. A simpler approach is described in the non-normative parts of MPEG-4/7, where a histogram of an error measure is computed and then a fixed percentage (e.g. 20%) of pixels with the largest error is excluded from the estimation [3], [4]. In this contribution, we present a robust global motion estimator, that follows the principles of an M-estimator, but uses a very simple binary weighting function. This keeps the computational complexity reasonable, but without being dependent on predefined thresholds of an error measure. The fields of differential global motion estimation and optical flow estimation [8] are closely related. It is reason-

able to transfer experience from one field to the other. We therefore introduce a mechanism for detection of unstructured areas that are known to carry improper motion information (aperture problems), and are thus excluded from the estimation process. We show that this is especially important in robust estimators that rely on certain error measures. 2 METHOD The basic idea of a differential global motion estimator is to find those transformation parameters, that make the transformed reference image as similar to the actual image as possible (note that this can also be formulated vice versa, i.e. forwards versus backwards prediction). Let I0 and I1 denote the intensities of the reference and the actual image respectively, p = (x, y) an image point under consideration and T(p, Ξ) the transformed position controlled by the parameter vector Ξ. In case of an affine motion model we have for instance: T ( p, ) = (Ξ 1 + Ξ 2 ⋅ x + Ξ 3 ⋅ y, Ξ 4 + Ξ 5 ⋅ x + Ξ 6 ⋅ y ) .(1) The similarity at a certain point is measured in terms of the residual: ε ( p, ) = I 1 ( p ) − I 0 (T ( p, )) . (2) A first approach to estimate the motion parameter vector is, to minimize the sum of squares of all residuals within an image region R, which is the entire image, if no other information is available: min ∑ ε 2 ( p, ) . (3) p∈R

Such a least-squares fitting is indeed appropriate, if the measurement errors are independent and normally distributed with constant standard deviation [9], but as explained before these assumptions do not hold at all for the problem at hand. In an M-estimator approach a function ρ of the residuals with a scale µ is used instead of the simple square function (which is a special case for ρ): min ∑ ρ (ε ( p, ), µ ) . (4) p∈R

This is solved, using an iterative Gauss-Newton minimization. Comparing the resulting equations (gradient and approximated Hessian [5]) of the least-squares and the Mestimator approach, it can be recognized, that the only difference is a weighting of the contribution of each point with a factor w(ε) = (dρ/dε)/ε. The central point in the design of an M-estimator is the selection of such a weighting function. Classical examples for ρ are the Geman-McLure [5], Andrew’s sine and Tukey’s biweight [9] functions. The derived weighting functions are centered around zero and decrease rapidly with increasing ε. This is the desired property, points with a small error shall have a large influence on the estimation, whereas points with a large error (outlier) shall be rejected.

A drawback of an M-estimator compared to a nonrobust approach (least-squares) is the additionally introduced computational complexity. Weighting with a hyperbolic or trigonometric function is required for each pixel (more than 100 000 for a CIF image) at each iteration. The necessary floatingpoint operations and divisions are not desirable for hardware and certain DSP implementations. We therefore simplify the weighting function to a rectangular function, this means we simply use a binary decision [10], i.e. a point is considered fully as in the non-robust formulation or not at all: 1 ε 2 < cµ w(ε ) =  . (5) ε 2 > cµ 0 Herein c denotes a tuning constant that is used to adjust the sensitivity of the algorithm (see below). Computation of eq. 5 introduces only minor additional complexity since the residuals are needed anyway in a differential global motion estimator. Moreover the exclusion of pixels even reduces the computational complexity. The mean square of the residuals of all N considered points within region R is given by: 1 µ= (6) ∑ ε 2 ( p, ) . N p∈R This is used for automatic scaling of the weighting function in eq. 5. As already mentioned, unstructured areas can have a bad impact on the estimation results (aperture problems). This is reflected in the error function in eq. 2. The error function will be small in unstructured areas, even if the motion description is not accurate. On the other hand, edges that carry very good and important motion information, will probably result in large errors even if the motion description is already good. Obviously, this is especially very dangerous for the performance of a robust M-estimator, since unstructured regions would be favored whereas structured regions could be rejected. To reduce the influence of unstructured regions we first evaluate the spatial image gradients and use only those pixels for the estimation that satisfy the following equation: 0 ( I x < d )∪ I y < d . (7) v= 1 otherwise

(

)

Note that the image gradients have to be calculated anyway in a differential global motion estimator, thus this processing step causes only a minor complexity increase, while decreasing it simultaneously due to the smaller number of considered points. Instead of a simple pixel difference approximation we use a special combination of low-pass and derivative filters for calculation of the spatial gradients as described in [8]. The gradient threshold d is used as tuning constant. Fig.1 shows an example for test sequence "Stefan" and a variation of d. The optimum

value was found experimentally to be d = 3, by evaluation of the estimation accuracy in terms of the residual prediction error of the background using several test sequences.

which is often used to study global motion algorithms. For this sequence the segmentation masks are available. Thus, it is possible to compare the results of experiments with and without using segmentation information known a priori. Fig.3 compares the PSNR of the predicted background and the estimated translation in x-direction. The results obtained with the proposed robust estimator without using a priori knowledge are excellent, although there are some minor differences. with mask

without mask

39

PSNR [dB]

37

Fig.2 Pixels used in the last iteration (black); no gradient threshold evaluation (d = -1); tuning constant c decreasing from topleft: c = 10.0, 5.0, 1.0, 0.5. 3 EXPERIMENTAL RESULTS The proposed robust estimator has been tested with several sequences, for instance the test sequence "Stefan",

33 31 29 27 25 0

50

100

150

200

250

300

250

300

frame no.

with mask

without mask

25 20 15 translation [pels]

Fig.1 Pixels used for the estimation (black); gradient threshold d increasing from topleft: d = 1,3,5,10. After calculation of the gradient mask (now representing R), the motion parameters are first estimated without taking eq. 5 into account, since there is no estimate available yet. In the second and all following iterations of the differential estimator, the previously calculated parameters are used for the process formulated above. The influence of the tuning constant c is illustrated in Fig.2, which shows the pixels used for the estimation in the last iteration for different values of c. The gradient threshold was not evaluated in these experiments (i.e. d = -1). With an increasing value of c less pixels of the foreground object are taken into account but also background pixels are suppressed. This means that there is an optimization problem for c. The optimum value was found experimentally to be c = 1.0 (in analogy to the procedure applied optimizing d, see above).

35

10 5 0 -5 0

50

100

150

200

-10 -15 -20 -25 frame no.

Fig.3 Comparison of experiments with and without segmentation masks; Left: PSNR of predicted background; Right: estimated translation in x-direction; Fig.4 visualizes the prediction error with both settings and shows the pixels that are used in the last iteration. Since the algorithm does not rely on a priori knowledge in can be used as a tool for segmentation of video.

Fig.4 Left: prediction error with use of segmentation masks; middle: prediction error without using segmentation masks; right: pels used in the last iteration. Fig.5 shows examples with test sequence "Horse" that contains a lot of complex background and foreground motion, shadows, reflections, transparent objects (waterdrops) occlusions and disclosures. It is therefore a very difficult example for global motion estimators. The prediction error images show, that the global motions is cap-

Fig.6 Mosaicking over 184 frames, without using any segmentation information tured accurately, although no a priori segmentation inforREFERENCES mation was used. [1]

Fig.5 Left: original images; middle: predicted images; right: prediction error. The algorithm presented in this paper is easily extended to long-term global motion estimation, i.e. the description of global motion over a long period relatively to a fixed reference in time and space [6]. Fig.6 shows a mosaic for the very difficult test sequence "Horse", generated over 184 frames. Despite the complex content, the presented robust global motion estimator provides stable and accurate results over the long period. 4 CONCLUSIONS We presented a robust global motion estimator that follows the principles of M-estimation, while being computationally efficient, through reduction of the weighting function to a simple binary decision. In addition we integrated a method for detection and rejection of unstructured areas, that can have a bad impact on motion estimation in general. This is especially dangerous for robust estimators that rely on a certain error measure to reduce the influence of outliers, since this would increase the influence of unstructured areas on the other side. The presented examples prove that the proposed algorithm provides excellent results in terms of accuracy and stability, even with very difficult test material.

R.Y. Tsai and T.S. Huang, "Estimation of ThreeDimensional Motion Parameters of a Rigid Planar Patch", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 29, December 1981. [2] H.S. Sawhney and R. Kumar, "True Multi-Image Alignment and ist Application to Mosaicing and Lens Distortion Correction", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 21, No. 3, March 1999. [3] ISO/IEC/JTC1/SC29/WG11, "MPEG-7 visual part of XM 5.0", Doc. No. N3321, Noordwijkerhout, Netherlands, March 2000. [4] ISO/IEC/JTC1/SC29/WG11, "MPEG-4 Video VM 16.0", Doc. No. N3312, Noordwijkerhout, Netherlands, March 2000. [5] H.S. Sawhney, S. Ayer and M. Gorkani, "Model-Based 2D & 3-D Dominant Motion Estimation for Mosaicing and Video Representation", IEEE Int. Conf. On Computer Vision, Cambridge, MA, USA, June 1995. [6] A. Smolic, T. Sikora and J.-R. Ohm, "Long-Term Global Motion Estimation and its Application for Sprite Coding, Content Description and Segmentation", IEEE Trans. on CSVT, Vol. 9, No.8, pp. 1227-1242, December 1999. [7] S. Kruse and R. Schäfer, "Advanced Processing Tools for Multimedia Production", Conference of International Broadcasting Convention (IBC), pp. 471-476, Amsterdam, Netherlands, September 1999. [8] E.P. Simoncelli, "Distributed Representation and Analysis of Visual Motion", PhD Thesis, Massachusetts Institute of Technology, Cambride, MA, USA, January 1993. [9] W.H. Press, S.A. Teukolsky, W.T. Vetterling. B.P. Flannary, "Numerical Recipes in C", Cambridge University Press, 1992. [10] T. Darrell and A.P. Pentland, "True Multi-Image Alignment and ist Application to Mosaicing and Lens Distortion Correction", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 5, May 1995.

Suggest Documents