Multiscale Modeling and Estimation of Motion Fields for Video Coding

57 downloads 5555 Views 605KB Size Report
email: [email protected]. Revised, April 1997. Abstract. We present a systematic approach to forward{motion{compensated predictive video coding. The rst stepĀ ...
to appear in IEEE Transactions on Image Processing

Multiscale Modeling and Estimation of Motion Fields for Video Coding Pierre Moulin1y, Ravi Krishnamurthy2 z and John Woods2 University of Illinois, Beckman Institute 405 N. Mathews Ave., Urbana, IL 61801 tel (217) 244-8366, fax (217) 244-8371 email: [email protected]

1

Rensselaer Polytechnic Institute ECSE Dept and Center for Image Processing Research Troy, NY 12180 email: [email protected] 2

Revised, April 1997.

Abstract We present a systematic approach to forward{motion{compensated predictive video coding. The rst step is the de nition of a exible model that compactly represents motion elds. The inhomogeneity and spatial coherence properties of motion elds are captured using linear multiscale models. One possible design is based on linear nite elements and yields a multiscale extension of the Triangle Motion Compensation (TMC) method. The second step is the choice of a computational technique that identi es the coecients of the linear model. We study a modi ed optical ow technique and minimize a cost function closely related to Horn and Schunck's criterion. The cost function balances accuracy and complexity of the motion{compensated predictor and is viewed as a measure of goodness of the motion eld. It determines not only the coecients of the model, but also the quantization method. We formulate the estimation and quantization problems jointly as a discrete optimization problem and solve it using a fast multiscale relaxation algorithm. A hierarchical extension of the algorithm allows proper handling of large displacements. Simulations on a variety of video sequences have produced improvements over TMC and over the half{pel{accuracy, full{search block matching algorithm, in excess of 0.5 dB in average. The results are visually superior as well. In particular, the reconstructed video is entirely free of blocking artifacts.

 y z

This work has been presented in part at ISCAS'93, IMDSP'93 and ICIP'95. Work partly performed while P. Moulin was with Bell Communications Research in Morristown, NJ. Work partly performed during an internship at Bell Communications Research in Morristown, NJ.

1

1

Introduction

Accurate modeling and estimation of motion in image sequences is a fundamental problem in many computer vision and image processing applications. In this paper we refer to motion as the apparent displacement of pixels between contiguous images, also called correspondence eld [1]. Determination of the motion eld is a well{known ill{posed problem that can be addressed by making certain physical assumptions. The motion eld is determined primarily by the motion of objects in the scene relative to the camera, and by changes in illumination; for this reason, it should exhibit a good degree of spatio{temporal coherence, sometimes described as piecewise smoothness [2, 3, 4]. In video compression, accurate motion estimation is desirable for two reasons: (a) it generally improves compression performance, and (b) it enhances processing tasks such as deinterlacing and frame rate interpolation1. Unfortunately there is no guarantee that the motion eld that would optimize the encoder's operational rate{distortion performance is close to the \true" correspondence eld, so the goals (a) and (b) are not identical [5]. We seek motion estimation techniques that perform well with regard to both (a) and (b). The main goal of this paper is to show that multiscale motion eld models coupled with appropriate estimation and quantization techniques, address both requirements. We focus on forward{motion{compensated video coding, in which the current frame is predicted from the previous reconstructed frame using a motion eld whose description is sent to the decoder as side information. The prediction error image is compressed and encoded using intra-frame coding techniques. Here accurate motion estimation helps reduce the prediction error and improve the overall compression performance. 1.1

Background

The standard block{matching (BM) algorithm [1] produces unnatural, blocky and noisy motion elds, and visually unpleasant artifacts at low to medium bit rates. The basic algorithm can be modi ed so as to improve the spatial coherence of the motion eld. A suitable motion{vector coding algorithm may then be able to exploit this spatial redundancy, reduce the motion{vector bit rate, and improve the rate{distortion performance of the video coder [6, 7]. Blocking artifacts may be alleviated using techniques such as Triangle Motion Compensation (TMC) [8, 9, 10] or Control Grid Interpolation (CGI) [11]. In BM, TMC and CGI, the resolution of the motion eld is xed { typically 16  16 { and is chosen so as to o er a reasonable overall tradeo between accuracy of the motion estimates on one hand, and reliability and coding cost on the other hand. Nevertheless xed resolution has a 1

e.g., bidirectional prediction of B{pictures in MPEG coders [1].

2

fundamental disadvantage, namely the inability to adapt locally to sharp variations in the motion eld. Typically this results in unpleasant visual artifacts when a motion boundary passes through a block. Very promising results have been obtained by making the resolution of the motion eld spatially adapted. That option is used in variable{size BM [12, 13] and in the Advanced Prediction Mode of the H.263 videoconferencing standard. This results in a quadtree representation for the motion eld. Orchard [14] has introduced a segmentation technique for re ning a motion eld obtained by BM. The motion information consists in the original BM vectors plus the segmentation information. Although work on spatially{adapted resolution has mostly dealt with piecewise{constant motion eld models, more exible representations have recently been investigated. Huang and Hsu have introduced Hierarchical Grid Interpolation (HGI), an extension of CGI which uses variable{size, bilinear{ warped rectangles in order to perform motion compensation [15]. A more sophisticated version of HGI has been used in a recent motion{compensated 3{D subband coding scheme [16]. Closely related to HGI is Hierarchical Triangle Compensation, a hierarchical version of TMC based on variable{size deformable triangles [15]. Both HGI and TMC use quadtree decompositions. More general adaptive triangular meshes have been investigated recently [17]. Another interesting approach is [18], which uses ane motion models over a segmented frame, and the MDL criterion for choosing the segmentation that minimizes an approximation to the total bit rate. In parallel with but independently of these developments in video coding, the computer vision community has developed a number of advanced techniques such as Bayesian and optical ow methods, which produce accurate, dense motion elds (one motion vector per pixel). Bayesian methods use a statistical model for the prediction error and for the a priori distribution of the motion eld. Maximum A Posteriori (MAP) motion eld estimates may then be computed. Such methods are applicable to coding problems, due to the approximate equivalence between code length and logarithm of the probability density function [19]. The main drawbacks of the MAP technique are the complexity of the model and the nonlinearity of the estimator. Considerable simpli cations arise with methods based on the optical ow equation [20, 21, 22]. Here the motion eld is computed based on the spatio{temporal derivatives of image brightness, using e.g., Horn and Schunck's linear estimation technique [20]. In order to keep the side motion information at a manageable level, the dense motion eld needs to be lossy encoded2 . Unfortunately, early attempts in that direction have not been encouraging [23, 24]. The authors in [23] conjectured that lossy compression severely degrades the motion eld and hence the quality of the prediction. Of course, this diculty does not arise in backward{motion{compensated coding schemes such as the pel{recursive method, which uses dense motion elds without transmission of side information [1]. 2

3

While the motion eld should ideally optimize the encoder's operational rate{distortion performance, a number of desired practical features of forward{motion{compensated coding schemes emerge from our brief survey of the literature:

 The motion eld should de ne an accurate predictor of the current frame.  The motion eld should be spatially coherent. This enhances interpolation and allows compact encoding of the motion information.  The motion eld is inhomogeneous and should be modeled using spatially{adaptive methods.  Computational complexity is a signi cant practical issue.

All of the techniques discussed above also make a tradeo , implicitly or explicitly, between smoothness (complexity) of the motion eld and quality of the prediction. Tradeo s between complexity and accuracy are not peculiar to motion estimation. They appear in all predictive coding problems for which a xed, parametric form of the predictor is not available [25]. 1.2

Overview of the Paper

Our study begins by pointing out, in Section 2, that in coding applications, motion estimation techniques implicitly or explicitly assume some type of model for the motion eld, and represent the motion eld in some compact form, of which xed parameterizations, quadtrees and contour{based codes are examples. (This view focuses on the implicit model rather than on the particular computational technique used to estimate the model parameters.) We propose an alternative in the form of the multiscale motion model a( )' (s) (1) d(s) =

X 

where s are pixel coordinates, d is the motion eld, f' (s)g is a complete, multiresolution basis, and a( ) are a set of coecients that parameterize the motion eld. Assuming that d is piecewise smooth, we know from image coding practice [26] and theory [27, 28] that the coecients a( ) may be compactly encoded. The regular data structure implied by (1) may also present a number of practical advantages. Our particular choice for f' (s)g is Yserentant's Hierarchical Finite Element (HFE) basis [29]. We describe the motivation for this design and show that it contains the TMC model as a special case. For this approach to be economical (in terms of bit rate) it is necessary that the estimated motion eld be smooth, or at least piecewise smooth. In Section 3 we investigate the use of an optical ow technique for estimating the motion eld. The basic optical ow technique produces a spatially coherent motion eld, and we modify it so as to improve prediction performance under bit rate constraints. Our 4

algorithm minimizes a modi ed Horn and Schunck criterion of the form

d) = jjDd FD (d)jj2 + jj5djj2

E(

(2)

where 5 is a spatial gradient operator. The rst term in (2) is the mean{squared value of a linear approximation to the prediction error, and the second is a regularization term that penalizes roughness of the motion eld. The tradeo between these two terms is controlled by the smoothness parameter . The criterion (2) is viewed as a measure of goodness of the motion eld. It determines not only the coecients that parameterize the motion eld, but also the quantization method. We formulate the estimation and quantization problems jointly as a discrete optimization problem and solve it using a fast multiscale relaxation algorithm similar to subband image coding algorithms developed in [30, 31]. In Section 4 we cover several implementation issues of considerable practical signi cance, including the choice of lters for computing spatio{temporal derivatives, the choice of the smoothness parameter in (2), and the choice of quantizers. The Horn and Schunck approach is based on derivative measurements; it is known that such techniques do not properly handle large displacements due to the temporal aliasing problem [2, 3, 32, 33]. This de ciency is addressed in Section 5. We investigate the use of a lowpass control pyramid to perform multiresolution measurements of the derivatives. This hierarchical technique initially produces coarse motion eld estimates based on coarse{level derivative measurements, and then progressively improves the estimates using ner{level derivative measurements. The estimation and quantization criterion is modi ed appropriately; we have investigated the use of inter{level smoothness constraints of a type also encountered in [34]. These constraints enforce spatial coherence of motion eld re nements. Anandan et al. [3] achieve the same goal using a windowing technique. Our approach is motivated primarily by coding considerations and may be viewed as a multiscale regularization technique. In Section 6 we report on our experiments using a variety of video sequences. Conclusions are presented in Section 7.

1.3 Notation We use italics symbols to denote scalar quantities, and boldface symbols for matrices, vectors and elds. For instance, I (s) is a scalar image, function of spatial coordinates s = (x; y)T , and f (s) is a eld with two (horizontal and vertical) components fx (s) and fy (s). When a scalar operator T is applied to both components of a eld f , we denote by T f the resulting eld, with components T fx and T fy . The time variable is denoted by t. If x, y and t are real{valued, the partial derivative operators in the respective directions are denoted by @x , @y and @t . If x, y and t are integer{valued, the backward 5

nite di erence operators are respectively denoted by Dx , Dy and Dt . For instance, Dx I (x; y) = 4 I (x; y ) ? I (x ? 1; y ). The spatial gradient operator is r = (@x ; @y )T . The discrete spatial gradient 4 (D ; D )T . We reuse these symbols when applying gradient operators to a eld, e.g., operator is 5 = x y 5f is the 2  2 matrix with columns 5fx and 5fy . The continuous and discrete Laplacian operators 4 2 4 @x + @y2 and 52 = Dx2 + Dy2 , respectively. are r2 =

2 Motion Field Models Let fI (s; t)g be a discretized sequence of images, where s = (x; y)T are spatial coordinates on a N { pixel rectangular lattice N and t 2 f0; 1; 2;   g is the temporal coordinate. We consider the motion{compensated prediction scheme

s = I~(s ? d(s); t ? 1); s 2 N

I^( ; t)

where I^(s; t) is a prediction for I (s; t), I~(s; t ? 1) is the previous reconstructed frame3 , and fd(s); s 2 N g is the motion eld. The components of d(s) may be real{valued. We use bilinear interpolation to de ne I~(s; t ? 1) for all s in a suitable rectangular region. The video bit stream consists of the displaced frame di erence (DFD) DF D (s) = I (s; t) ? I^(s; t); s 2 N (3) and fd(s); s 2 N g, both suitably compressed and encoded.

2.1 Fixed Parameterizations We rst consider the following linear model for the motion eld,

d(s) =

M X  =1

a( )' (s); s 2 N ;

(4)

in which d is a linear combination of M > 1 if n = (0; 0) >: 2 if n 2 f(1; 0); (0; 1); (1; 1)g ; 0 else

hl;n

= n;el ; 1  l  3:

where e1 = (1; 0)T , e2 = (0; 1)T , and e3 = (1; 1)T . It is easily veri ed that these subband lters de ne a perfect{reconstruction system. Notice that the lowpass lter does not satisfy the standard P normalization condition n h0;n = 2 [26].

3 Motion Field Estimation The theory and algorithms below apply to any multiscale motion eld representation of the type described in x2.2. The results are applied to the special case of HFEs in Section 6.

3.1 Estimation Criterion Consider the following cost function, which trades o prediction error energy against motion eld smoothness, X X E (d) = jDF D(s)j2 +  jj5d(s)jj2 (7) where DF D is given in (3) and

s

s

jj5d(s)jj2 = jDxdx(s)j2 + jDy dx(s)j2 + jDxdy (s)j2 + jDy dy (s)j2 : The rst term in (7) is the prediction error energy, and the second is a regularization term used by Horn and Schunck [20], that penalizes rough motion elds4. The nonnegative quantity , or regularization parameter, weights the in uence of the penalty term. The criterion (7) admits a standard While various designs of the spatial gradient operator have appeared in the literature, we adopt the backward di erence operator for convenience, see (15). 4

9

Bayesian interpretation [1, 4] and indirectly provides a tradeo between motion accuracy and predictor complexity as well. Two extreme instances of this tradeo are worth mentioning. If  = 0, the cost function is the energy of the DFD. The design (4) with M = N will result in a very complex motion eld and most if not all bits of the video coder being allocated to motion [37]. Conversely, if  = 1, the motion eld is necessarily constant, and virtually all bits will be allocated to the DFD.

3.2 Linearization of the DFD An important simpli cation arises when the DFD is approximated by a linear function of d. Expanding the prediction I~(s ? d(s); t ? 1) in (3) in a rst{order Taylor series expansion around (s; t ? 1) yields an approximation of (3) by the residual error

s) =4 rI~(s; t ? 1)T d(s) + F D(s)  DF D(s)

r(

(8)

4 where F D(s) = I (s; t) ? I~(s; t ? 1) is the frame di erence. By substitution of this approximation into (7), the criterion becomes the quadratic function

d) =

E(

X

s

jr(s)j2 + 

X

s

jj5d(s)jj2 :

(9)

The conditions under which the approximation (8) is valid are examined in x5. This formulation is closely related to the Horn and Schunck approach. Assuming conservation of the brightness of moving patches along motion trajectories would give rise to the classical optical ow equation [1, 20] d dt

s = rT I (s; t ? 1) @t s + @t I (s; t)  0:

I ( ; t)

(10)

Thus, the residual error in (8) is a discrete{time version of the di erential in (10). It also accounts for quantization e ects (I~ 6= I ). The optical ow equation is an overdetermined system of N linear equations with 2N unknowns. In both (8) and (10), the motion component normal to the image gradient does not contribute to the residual error. This is known as the aperture problem and has several important consequences to be examined soon. Substituting the expressions (8) and (4) for the residual error and the displacement eld into (9), we obtain a quadratic expression for the cost function viewed as a function of a,

a) =

E(

where

R(;  0 ) =4

XX  

0

X

X

a( )T R(;  0 )a( 0) + 2 cT ( )a( ) + jF D(s)j2  s

(11)

X X ~ rI (s; t ? 1)rI~(s; t ? 1)T ' (s)' (s) +  5' (s)5T ' (s)

s

0

10

s

0

(12)

are 2  2 matrices, and

c( ) =4

X

s

s)rI~(s; t ? 1)' (s):

F D(

(13)

are 2{vectors. The derivation may be found in Appendix B, along with a fast algorithm for computing (12) and (13). Each matrix R(;  0 ) is symmetric, and each R(;  ) is nonnegative de nite. If basis functions ' (s) and ' (s) are nonoverlapping, then Rr (;  0 ) is zero. Orthogonality of ' (s) and ' (s) does in general not imply that Rr (;  0 ) is zero. For this reason we see no advantage to constraining f' (s)g to be orthogonal. 0

0

In the absence of quantizers, the minimizer of (11) is a solution to the sparse linear system

X 

0

R(;  0 ) a( 0 ) = ?c( ); 8:

(14)

We gain some insight into the aperture problem by momentarily assuming that R(;  0 ) = 0 for  6=  0 . Then the linear system (14) admits the closed{form solution a( ) = ?R(;  )?1 c( ); 8 . Due to the aperture problem, the image{dependent part 5 Rr (;  ) of the matrix R(;  ) in (12) is often poorly conditioned at ne scales. In particular, at the nest scale j = 0, the linear HFE ' (s) has point support, so Rr (;  ) has rank one. The eigenvalues of the matrix Rr (;  ) express our degree of con dence in the estimate a( ) projected on the respective eigenvectors. The smoothness penalty term adds  times a constant positive de nite matrix to the matrix Rr (;  ) and so increases its eigenvalues. The e ects of this regularization are more pronounced for unreliable coecients than for reliable ones.

3.3 Estimation and Quantization Algorithm Given a set of scalar quantizers, the coecients a( ) in (4) are constrained to take their value on a prescribed discrete set. We seek those discrete{valued coecients that optimize the goodness criterion (11). This paradigm is not equivalent to computing the unconstrained optimum from (14) followed by independent quantization of the coecients in a second step6 . (The previous applications [23, 24] of optical ow to video coding used this plausible, but inferior, quantization technique. The need for quantizing, and not only estimating, under an appropriate distortion measure has also been recognized in [38].) The discrete optimization problem is solved using a deterministic multiscale relaxation algorithm which extends one in [30, 31]. The algorithm is greedy, so the cost function never increases during 5 6

Rr (;  ) is the rst term in the right{hand side of (12), also see Appendix B. But if R(;  ) was zero for  =  , E (a) would be additive in a( ) and each coecient could be quantized indepen0

0

6

dently of the others.

11

iterations, and the algorithm is guaranteed to converge. The algorithm takes advantage of the multiresolution nature of the basis for visiting scales in a coarse{to{ ne order, and computing a number of quantities using scale recursions. Such algorithms usually converge very rapidly [29, 31, 36, 39, 40, 41], and the motion estimation application is no exception. Our algorithm is summarized in Table 1; detailed explanations and derivations may be found in Appendix C. The steps of the algorithm are outlined as follows. First, an initial guess for a is selected. The coarse{scale components (j = jmax ) of a are then updated. Two representations dj and j of the motion eld are computed in preparation for an update of the components of a at the next ner scale, j ? 1. This step is repeated at all scales down to and including the ne scale, j = 0. Several such sequences of coarse{to{ ne steps, called sweeps, may be performed successively to improve the estimates.

3.4 Computational and Storage Complexity The computational complexity is O(N ) for a full sweep through the pyramid. The exact operations count depends on the number smax of sweeps through the multiscale pyramid, the number rmax of relaxation steps within each level, the number jmax of levels in the multiscale pyramid, and the size of the lters used for the multiscale pyramid. For illustrative purposes, we shall consider the speci c HFE case, and use the number of multiplies as a measure of computational complexity. The HFE transforms are implemented using very short lters with power{of{two coecients. These multiplies could be implemented very eciently in hardware, using binary shift registers. The HFE transforms Mh , MTh , Mg and R, use N , N3 , N and 4N multiplies respectively. The number of multiplies for the various steps in the basic algorithm is summarized in Table 3. The total count is 25N + smax (33 rmax + 23) N . We usually choose smax = 2 and rmax = 2, so this translates into approximately 203N multiplies. Substantial savings are possible if the ne-scale coecients are automatically set to zero instead of being actually computed. For example, if all coecients at level j = 0 are set to zero, the operations count becomes 77N multiplies. If in addition all coecients at level j = 1 are set to zero, the operations count becomes 46N multiplies. Let us now compare this to the computational complexity of full-search, half-pel accuracy BM. Assuming a search range of 16 pixels in each direction, and a local search over the 8 half-pel locations around the optimal integer-pel location, we obtain a total of 264 search locations. When using the MSE criterion for block matching, this would require a total of 264 multiplies. While the complexity of the BM algorithm may be reduced using a variety of methods (at some cost to estimation performance), the above computation indicates that the current implementation of our algorithm is comparable in complexity to full-search BM. Computational complexity may be reduced using various heuristics, e.g., 12

setting all ne{scale coecients of the motion eld to zero. However, developing faster algorithms is beyond the scope of this paper. It must also be mentioned that the decode operation is very fast (two Mh transforms, hence 2N multiplies). The current algorithm also requires a larger amount of storage than BM; in particular, the vectors a, , d and a require storage of 2N real numbers each. The storage requirements for the R(;  0 ) matrices are broken up as follows. Since each matrix is 2  2 and symmetric, we need to store 3 real numbers per matrix. In addition, R(;  0 ) = R( 0 ;  ). A detailed count of the number of matrices used in the relaxation algorithm gives us 2N matrices to store. The total storage requirements are thus 14N real numbers. Once again, storage requirements may be reduced if ne{scale coecients are set to zero.

4 Implementation Issues 4.1 Derivative Estimation The sensitivity of di erential motion estimation techniques to noise has been studied by Kearney et al. in [42]. Practical algorithms, including the original Horn and Schunck algorithm [20], use simple nite di erences to approximate the derivatives. Finite di erences approximate the derivative well at low frequencies, but unfortunately the approximation degrades rapidly with increasing frequency. This has led to the common misconception that di erential methods perform more poorly than matching methods if the signal has signi cant high-frequency content (like texture), or if the signal contains a signi cant amount of noise. In fact, the original Horn and Schunck algorithm is greatly improved if proper derivative lters are used, and if derivatives are computed after pre ltering [22, 32]. In [32], Simoncelli designs pre lter-derivative lter pairs which produce reliable derivative measurements. We have adopted his 3-tap lters in our implementation. Unlike [32], we wish to keep the temporal latency low and hence do not pre lter in the temporal direction. The use of these derivative-pre lter pairs produced noticeable improvements when compared to simple central or backward di erences. The prediction error was sometimes improved by as much as 1:5 dB for the sequences considered in x6; the motion elds were also smoother.

4.2 Quantizer Design The quantization loss, de ned as the di erence between the minimum of the cost function (11) under P quantizer constraints and the unconstrained minimum, is equal to E = ; q( )T R(;  0 )q( 0 ) 0

13

where q( ) is the quantization error on the coecient a( ). We have empirically observed that with HFEs, the matrices R(;  ) approximately scale as 22j . (This is the same scaling factor as for the size of the support set of the basis functions.) These observations prompted us to design the (uniform) quantizers as follows. We made each coecient contribute to the loss E independently of its scale, by scaling quantizer step sizes as 2?j . In other words, the quantizer steps are much larger at ne scales. The quantizer step size  at the coarsest level (j = jmax ) was left as a design parameter. Horizontal and vertical components of each coecient a( ) were quantized using the same quantizer. In some experiments, coecients at the nest scale(s) may be automatically quantized to zero, in accordance with the discussion in x3.4.

4.3 Selection of the Smoothness Parameter The smoothness parameter  in (7) controls the trade-o between the prediction error energy and the smoothness penalty term, as discussed in x3.1. In particular, the design  = 0 minimizes the prediction error energy. However, our actual criterion uses the residual error (8) as an approximation to the prediction error. Due to the limited accuracy of this approximation, we noticed that the prediction error energy viewed as a function of  is not monotonically increasing as originally expected, but unimodal with a fairly broad minimum. The corresponding motion elds are smooth, giving us some

exibility in the choice of . At this point we have not investigated the use of automatic selection criteria. Another important point in this coding application is that with an appropriate choice of quantizers, even the choice  = 0 may produce stable motion elds. Indeed, at coarse scales, the matrices R(;  ) are usually well-conditioned regardless of the value of . Killing all ne{scale coecients by making the ne-scale quantizer step-sizes large enough would ensure stable and spatially coherent motion elds7. In a practical scenario ( > 0), the e ects of  and of the quantizers interact in a complex, nonlinear way to regularize the problem.

5 Multiresolution Measurements The linearization (8) of the DFD implicitly assumes that the image intensity is linear over the range of motion. This assumption is likely to be violated when the motion is large. In this section we present a technique which addresses this well{known problem [2, 3, 32, 33]. 7

Fixed{resolution models such as TMC may be obtained in this fashion.

14

5.1 Control Pyramid and Coarse-to- ne Strategy The culprit is the temporal under-sampling of the video signal. Let us consider a 1D signal spatially sampled at the Nyquist rate and undergoing a constant motion d. Assuming that the sampling interval in spatial and temporal coordinates is unity, the alias{free condition is jdj < 1 [32]. Unfortunately, a signi cant amount of temporal aliasing is present in real{world video sequences, due to limited frame rates. The temporal aliasing problem a ects the high spatial frequencies of the image more than low frequencies. It is therefore advantageous to eliminate the latter by lowpass ltering the image. Low{ resolution (coarse) motion can then be estimated reliably, but the loss of high-frequency components makes it dicult to estimate high{resolution ( ne) motion. A possible remedy consists in rst undoing the coarse motion [32] by warping (motion-compensating) the previous image using the estimated coarse motion vectors. The residual motion (between warped and current images) is now smaller, so temporal aliasing in the high-frequency components is reduced. These high-frequency components can now be used to more reliably estimate ne corrections (motion eld re nements) to the coarse eld. This strategy can be applied recursively and elegantly implemented using a lowpass image pyramid [32]. We refer to this particular pyramid as the control pyramid. We adopted this strategy for estimating and encoding the coarse motion and the correction elds. The algorithm is summarized in Table 3 and detailed below.

5.2 Motion Field Models on the Control Pyramid Two opposite approaches to modeling of motion eld re nements have appeared in the literature. The rst models these re nements as spatially incoherent elds [32, 33]; the second assumes spatial coherence [2, 3, 34]. In order to exploit the possible coherence of motion eld re nements, we chose to model each of them using HFEs. Another issue of importance is the choice of the cost function in the hierarchical approach. One possibility is to use a global Horn and Schunck cost function [39, 43]. Another one is to penalize the roughness of each motion eld re nement. In [34], the latter approach was found to capture inhomogeneities of the motion eld better, and speed up the numerical algorithm. We also adopt this approach, for the following reasons: (a) we encode each motion eld re nement individually, so the coding arguments at the end of Section 1.1 apply directly, and (b) global smoothness is guaranteed if the coarse motion and the re nements are smooth. 15

5.3 The Hierarchical Algorithm In Table 3, the letter i is used to indicate level within the image pyramid. Quantities such as intensities, derivatives and motion elds at level i are assigned a superscript i. The motion eld obtained by interpolation from level i + 1 to level i is denoted by diji+1 . The algorithm begins by creating the lowpass pyramids for the current and previous frames. The motion eld at the coarsest level, dimax , is estimated and encoded using the multiscale representation (4) and the basic algorithm in Table 1. At every level (0  i < imax ), the following steps are performed.

 The quantized motion eld di+1 is upsampled, bilinearly interpolated (these cascaded operations de ne an operator I ) and multiplied by 2 in order to obtain diji+1 .  The previous frame is warped using diji+1 to obtain a warped frame I~W .  A correction eld di is estimated between the warped and the current frames using the basic algorithm in Table 1.  The motion eld di is obtained by composing the coarse motion and the correction eld [32].

The motion eld at the nest level is thus represented by a coarse eld and correction elds at increasingly ne levels, similar to the Laplacian pyramid for image coding [26, 44]. Each one of these elds is modeled using (4). At each level the motion vectors are expected to be no greater than one pixel.

6 Motion Estimation and Video Coding Results We have tested our optical ow coder with control pyramid (OF-CP) on a variety of video sequences, comparing it to two other forward{motion{compensated coders that respectively use full-search BM and the more sophisticated TMC algorithm. In all cases, DFDs were subband coded using the 13{band subband decomposition depicted in Fig. 4, with Adelson and Simoncelli's 9{tap lters [35], and uniform quantizers in each subband. Thus, the coders compared di ered only by the motion estimation method used. The BM algorithm used 16  16 blocks and a full search with half-pel accuracy. The search range was adjusted to capture the motion in the sequence under study and depends on the spatial and temporal resolution of the sequence. The TMC algorithm used is a fast hexagonal matching algorithm [9], sometimes more explicitly referred to as TMC-FHM. The algorithm computes an initial estimate for the motion at the grid points by using BM with 16  16 blocks centered at the grid points and half{pel accuracy. These estimates are then re ned using a gradient-based algorithm and quarter{pel accuracy. 16

Coding of Claire at 24 kbit/s. A rst set of experiments was performed on a 128  128  10 frames/s sequence obtained by cropping the original Claire sequence at QCIF resolution [1, p. 433]. We intra-coded the initial frame (I-frame) at 0:6 bits-per-pixel (bpp) and predictive-coded the following 39 frames (P-frames) at 0:13 bpp. The bit-rates quoted include the motion information; the bits still available after encoding the motion were used to encode the DFD. The overall bit-rate converted to approximately 24 kbit/s. The adaptive arithmetic coder described in [45] was used to entropy-code the quantizer outputs and obtain the quoted bit-rates. Our OF-CP algorithm used two levels in the control pyramid (imax = 1), and the following parameters for the basic motion estimator (Table 1) at each level: jmax = 4,  = 500 and  = 0:125. All coecients at the nest level (j = 0) of the HFE pyramid were set to zero. The parameters were chosen empirically; however, the algorithm performed well for a range of parameter values. Automatic selection of these parameters based on the image and motion data is a topic of future research. The search-range for the BM algorithm was 6 pixels in each direction. A comparison of the PSNRs is shown in Fig. 5, and a comparison of motion elds, DFDs and reconstructed images is presented in Fig. 6. The images in Fig. 6 are from the part of the sequence with relatively large motion. The optical ow technique yielded an average PSNR of 36:2 dB, compared with 35:4 dB and 35:8 dB for BM and TMC, respectively. In addition to the PSNR improvement, our algorithm provided a number of visual improvements, due to superior performance at the edges. The BM algorithm produced severe blocking artifacts; with TMC, severe distortions appear near the right side of the face, right eye, and mouth. These results also demonstrate the robustness of our optical

ow estimator in the presence of quantization noise. This robustness comes from the use of the control pyramid and from the use of the Simoncelli pre lter/derivative lter pairs for derivative estimation. The results obtained represent a dramatic improvement over earlier optical{ ow{based coders which performed worse than integer{pel{accuracy block matching [23]. Fig. 7 shows a comparison of the motion bit-rates for the three coders. Notice that the BM and TMC bit-rates are fairly constant even when the amount of motion in the sequence varies, mainly because of their xed resolution. On the other hand, our optical- ow method produces motion elds closer to the true motion, and the motion bit-rate varies with the amount of motion.

17

Coding of Claire at 128 kbit/s. We also show results on a 256  256 sequence obtained by cropping the original Claire sequence at CIF resolution [1]. Here we encoded every other frame, producing a compressed 15 frames/s sequence. We intra-coded the initial frame (I-frame) at 0:7 bpp and predictive-coded the following 59 frames (P-frames) at 0:12 bpp. The overall bit-rate was approximately 128 kbit/s. The OF-CP algorithm used three levels in the control pyramid (imax = 2), and the same parameters jmax = 4,  = 500 and  = 0:125 as in the QCIF experiment. All coecients at the nest level (j = 0) of the HFE pyramid were set to zero. Here the search range for BM was 8 pixels, corresponding to the same range of physical motion as in the QCIF experiment. The PSNRs are compared in Fig. 8. The average PSNR for our optical ow algorithm in this experiment was 39:0 dB, to be compared to 38:4 dB and 38:3 dB for BM and TMC respectively. In addition to these PSNR improvements, we again obtained substantial visual improvements.

Coding of Flower Garden at 700 kbit/s. We also compared the three motion estimators on the rst 30 frames of the Flower Garden sequence at 30 frames/s, using a 256  256 sequence cropped from SIF resolution. This is a di erent type of sequence with more texture and a smoother and more regular motion. In this case, we used a group-ofpictures (GOP) [1, p. 441] of size 12, with 0:8 bpp for the I-frames and 0:315 bpp for the P-frames. This translated into a bit-rate of 700 kbit/s. The OF-CP algorithm used the same parameters jmax = 4 and  = 500 as in the Claire experiments, but the HFE coecients were quantized more coarsely:  = 0:25, and all coecients at the nest two levels of the HFE pyramid were set to zero. The search range for BM was 16 pixels in each direction. The PSNRs are compared in Fig. 9. The average PSNR for the OF-CP algorithm was 24:8 dB, to be compared to 24:3 dB and 23:8 dB for BM and TMC respectively. Here BM performs well due to the translational nature of the motion, and the visual advantage of the OF-CP algorithm was less signi cant than in the Claire experiments. Another experiment on the Flower Garden sequence demonstrated the need for the control pyramid introduced in Section 5. We compared the OF-CP algorithm used above with the OF-Basic algorithm in Table 1. OF-Basic is a special case of OF-CP with imax = 0. The motion elds obtained were noisier than the OF-CP elds, so we used a larger parameter  = 0:5 for OF-Basic that made motion bit-rates comparable to those of OF-CP. The control pyramid gave us smoother and more accurate elds, and the capability to track large motion. There was a large performance gap between OF-CP and OF-Basic, as illustrated by Fig. 10. 18

Discussion. The experiments above have demonstrated some of the potential advantages of the OF-CP algorithm over BM and TMC, including PSNR improvements in the range 0.5|1 dB and comparable visual improvements for the Claire sequence at 24 and 128 kbit/s and for the Flower Garden sequence at 700 kbit/s. These results have been con rmed by numerous other experiments which are not reported here. However, the current method did not perform as well on sequences with fast and complex motion, like Football and Susie. We conjecture this is mainly due to the oversmoothing caused by the hierarchical estimator. Our control pyramid algorithm su ers from propagation of coarse motion estimation errors to the ner levels of the pyramid, and the inability to recover from these errors. The iterated registration algorithm in [46] has demonstrated good potential for addressing that diculty.

7 Conclusions We have presented a systematic approach to motion eld estimation and coding. First the use of multiscale models has been proposed for representing, computing and encoding motion elds. The goal is to capture the inhomogeneity and spatial coherence of motion elds in a way that simultaneously achieves exibility, spatial adaptivity, compression performance, and computational eciency. We have considered models based on arbitrary tree{structured, perfect{reconstruction, nonseparable, two{ dimensional lter banks with a separable sampling scheme. In particular, we have investigated the use of HFEs, which are a multiscale extension of the popular TMC model and can be cast in our subband ltering framework. We have adopted the HFEs in our video coding experiments. We have also proposed the following computational framework for forward{motion{compensated predictive coding: 1. Choosing a linear motion eld model and a set of quantizers de nes a very large, discrete set of predictors. 2. A criterion that measures the goodness of the motion eld (balances motion eld smoothness and predictor accuracy) is identi ed. 3. Optimizing the criterion over the set of possible predictors gives rise to a large{dimensional discrete optimization problem. A fast relaxation algorithm is used for solving the optimization problem. 19

Our particular computational method is based on optical ow and uses a modi ed Horn and Schunck criterion in 2) above. The expression (11) for the criterion is a quadratic function of the coecients in the model, and shows the e ects of the aperture problem and the regularization method on coecients at various scales. The choice of quantizers has no signi cant e ect on the complexity of our relaxation algorithm. In order to be able to cope with large displacements, the basic algorithm should be coupled with a hierarchical approach. We use a control pyramid for computing derivative measurements at multiple scales. An encoded coarse motion eld and a set of correction elds at increasingly ne resolutions are obtained at the end of one single sweep down the control pyramid. We have found that practical issues such as derivative lter design, have considerable practical importance. Simple, but ad hoc choices for the quantizers and smoothness parameter have been made; further improvements should be possible. Our current coder outperforms the TMC-FHM algorithm [9], half{pel{accuracy, full{search BM algorithms, as well as earlier optical{ ow{based algorithms [23, 24]. The choice of a motion eld model is somewhat independent of the speci cs of the computational framework; this view, which is apparently not traditional in the video coding community, was already espoused by Anandan and his coworkers in computer vision [2, 3]. For instance, other choices than the modi ed Horn and Schunck criterion are possible, so long as they produce motion elds which satisfy the basic requirements listed at the end of Section 1.1. Possible candidates include criteria based on Nagel's oriented smoothness constraints [21] and fractal priors [4]. Another possible improvement of the computational technique has to do with the mismatch between prediction error and residual error in (8). We are currently developing an algorithm which alleviates some of these diculties [46]. These possible modi cations to the computational technique need not a ect the choice of the motion model. A number of other extensions are possible. We are currently developing a scalable version of the video coder [46]. The use of occlusion models should enhance the performance of the coder [3, 16]. Finally, our approach need not be restricted to forward{motion{compensated predictive coding and 2{D motion eld models. Exploiting the temporal coherence of the motion eld [4] and extending our approach to three dimensions should be promising. Meanwhile, temporal coherence could be exploited by initializing our iterative algorithms with the results from the previous frame.

Acknowledgments. The authors are grateful to Alex Loui and Jelena Kovacevic for their helpful

interaction and suggestions at various stages of this research.

20

A The R{Transform The R{transform of an image x is computed using (5) and (6). The lowpass and detail signals are respectively XX h0;n h0;n cj?1;2k?n;2k ?n cjkk = 0

djll kk = 0

0

n n

0

0

XX

0

0

n n

0

hl;n hl ;n cj?1;2k?n;2k ?n 0

0

0

0

where j  0 and c?1;kk = x(k)kk . For  = (j; l; k) and  0 = (j; l0 ; k0 ), we de ne 0

0

8 < R(x)(;  0 ) = : cjkk : j = jmax djll kk : 0  j < jmax : 0

0

0

B The cost function (11) A quadratic expression for the prediction energy viewed as a function of a can be obtained from (8) as 4 X jr(s)j2 Er (a) =

s X ~ = rI (s; t ? 1)T d(s) + FD(s) s X ~ TX = rI (s; t ? 1) a( )' (s) + FD(s)  sX X X T T 0 0 2

=

 

2

X

a( ) Rr (;  )a( ) + 2 c ( )a( ) + jFD(s)j  s

0

where c( ) is given in (13) and

Rr (;  0 ) =4

2

X ~ rI (s; t ? 1)rI~(s; t ? 1)T ' (s)' (s):

s

0

Using the notation in x2, (13) can be written as c = MTh (FD  rI~). The collection of 2  2 matrices fRr (;  0 )g, where  and  0 belong to the same scale j , plays a central role in our algorithms and can be written in terms of the R{transform as Rr = R(rI~  rT I~). The Horn and Schunck penalty function takes the form

Es (a) =  where

X

s

jj5d(s)jj =  2

X

XX X X a( )Rs(;  0 )a( 0 ) jj a( )5T ' (s)jj =

s

2

 



X

0

Rs (;  0 ) =4  5' (s)5T ' (s) =  55 T ' (s)' (s) s s 0

21

0

and 5 is analogous to 5, but uses forward di erences instead of backward di erences. The 2  2 matrices fRs (;  0 )g, where  and  0 belong to the same scale j , may also be computed o {line using the R{transform, e.g.,

Rs(;  0 ) =  R(55 T s; ) ?(j; l; 0); (j; l0 ; k0 ? k) : 0

(15)

Adding Er (a) and Es (a), we obtain the expression (11) for the cost function, where R(;  0 ) is the sum of the matrices Rr (;  0 ) and Rs (;  0 ).

C Derivation of the relaxation algorithm in Table 1 We use the following notation. Estimates for quantities such as a at the end of the s{th sweep, are denoted by a superscript (s) . The current estimates during the s{th sweep, before work at scale j , are 4 denoted by a superscript (s;j ) . We denote by x = x ? x(s?1) the update for a variable x. We denote by dj the decimated, lowpass version of d at scale j ,

8 < a (k) : j = jmax dj (k) = : P jmax P P 0 0 0 0 k hl (k ? 2k )  alj (k ) : 0  j < jmax : k h (k ? 2k )  dj (k ) + l 0

0

0

3 =1

+1

The motion eld update based on dj is

d(s) =

X k

(16)

0

dj (k) jk (s); s 2 N :

(17)

From (11), the gradient of E (a) with respect to a( ) is given by X 41 ( ) = r E ( a ) = R(;  0 ) a( 0) + c( ): a (  ) 2 

(18)

0

This formula is compact but not very suitable for computation due to the interaction of a large number of terms at di erent scales in the right{hand side. An alternative expression for ( ) is

!

X X 1 2 j r ( s ) j +  jj5d(s)jj2 ( ) = ra( ) 2 s s X X ~ r(s)rI (s; t ? 1)' (s) +  5T d(s)5' (s) = = or in short,

s r(s)rI~(s; t ? 1) + 5 d(s) ' (s)

s X s

(19)

2

 = MTh (rrI~ + 52 d):

22

(20)

Armed with the de nitions (17) and (18), we turn to the derivation of the algorithm. At step (s; j ) of the algorithm, a(js?1) is updated as follows. Using (11) with a = a(s?1) + a( ), applying (18), and discarding the term independent of a( ), we obtain the contribution of a( ) at scale j to (11), 4

Ej (aj ) =

X l;k

ajl (k)T Rjll (k; k)ajl (k) + 2

0 X@ l;k

(jls;j ) (k) +

X

l ;k )6=(l;k)

(

0

1T Rjll (k; k0 )ajl (k0 )A ajl (k); 0

0

0

(21)

4 R((j; l; k); (j; l0 ; k0 )). A numerical relaxation technique is used to minimize (21) where Rjll (k; k0 ) = 0

over aj . A simple and ecient scheme is the Gauss{Seidel scheme described in the Appendix of [31]. (m) We let a(0) j be an initial guess for aj , and aj the estimate after the m{th relaxation step. The Gauss{Seidel iterations take the form 0 2

ajl(m) (k) = ?a(jls?1) (k) + Q @a(jls?1) (k) + ajl(m?1) (k) ? R?jll1 (k; k) 4(jls;j) (k) +

X l ;k )>(lk)

(

0

0

Rjll (k; k0 )ajlm? (k0 ) + (

0

X

1)

0

0

l ;k ) (l; k) is de ned by raster scan order 8 . The left{hand side is the constrained least{ squares estimate of ajl (k), given the current estimates of ajl (k0 ), (l0 ; k0 ) 6= (l; k). The two summations in the right{hand side respectively represent anticausal ltering of aj(m?1) and causal ltering of aj(m) . Two Gauss{Seidel relaxation steps are usually sucient for convergence of aj(m) ; we let aj = a(2) j and obtain the update 0

ajls (k) = ajls (k) + ajl (k): ( +1)

( )

By application of (8), (4), (16) and (19), we obtain 8 (s?1) <  (k) : j = jmax ; l = 0; (jls;j ) (k) = : (jls?1) P ( s ) jl (k) + k Rjl0 (k; k0 ) dj +1(k0 ) : 0  j < jmax ; 1  l  3:

(23) (24)

0

From (22) and (24), (js;j ) may be viewed as an inter{scale feedback of the motion eld correction at scale j + 1. The following update formula for the criterion has been found useful for monitoring the convergence of the algorithm. Let aU ( ) denote the argument of Q() in (22). Application of a standard least{ squares result to (21) shows that any variation a( ) around aU ( ) results in an increase of E by an amount E = a( )R(;  0 )a( ). Thus, the update (22) reduces E by an amount E (a) = q(m?1)T R(;  )q(m?1) ? q(m)T R(;  )q(m) where q(m?1) = a(s?1) ( ) + a(m?1) ( ) ? aU ( ) and q(m) = a(s?1) ( ) + a(m) ( ) ? aU ( ). 8

Note that on a parallel computer it may be advantageous to use a di erent scan order, see [31, 47].

23

(25)

References [1] A. M. Tekalp, Digital Video Processing, Prentice{Hall, Upper Saddle River, NJ, 1995. [2] P. Anandan, \A computational framework and an algorithm for the measurement of visual motion," International Journal of Computer Vision, Vol. 2, pp. 283{312, 1989. [3] P. Anandan, J. R. Bergen, K. J. Hanna and R. Hingorani, \Hierarchical Model{Based Motion Estimation," pp. 1|22 in Motion Analysis and Image Sequence Processing, M. I. Sezan and R. L. Lagendijk, Eds., Kluwer, 1993. [4] R. Chin, M. R. Luettgen, W. C. Karl, and A. S. Willsky, An Estimation{Theoretic Perspective on Image Processing and the Calculation of Optical Flow," pp. 23|52 in Motion Analysis and Image Sequence Processing, M. I. Sezan and R. L. Lagendijk, Eds., Kluwer, 1993. [5] H. Li, A. Lundmark and R. Forchheimer, \Image Sequence Coding at Very Low Bitrates: A Review," IEEE Trans. on Im. Proc., Vol. 3, No. 5, pp. 589|609, 1994. [6] C. Stiller and D. Lappe, \Gain/Cost controlled displacement-estimation for image sequence coding," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 4, pp. 2729{2732, 1991. [7] T. Ebrahimi, \A New Technique for Motion Field Segmentation and Coding for Very Low Bitrate Video Coding Applications," Proc. IEEE Int. Conf. Image Processing, Austin, TX, pp. II. 433|437, 1994. [8] H. Brusewitz, \Motion Compensation With Triangles," Proc. 3rd Int. Conf. on 64 kbit Coding of Moving Video, Rotterdam, The Netherlands, Sep. 1990. [9] Y. Nakaya and H. Harashima, \An Iterative Motion Estimation Method Using Triangular Patches for Motion Compensation," SPIE Vol. 1605, pp. 546|557, 1991. [10] K. Aizawa and T. S. Huang, \Model{Based Coding: Advanced Video Coding Techniques for Very Low Bit{Rate Applications," Proc. IEEE, Vol. 83, No. 2, pp. 259|271, Feb. 1995. [11] G. Sullivan and R. L. Baker, \Motion Compensation for Video Compression Using Control Grid Interpolation," Proc. ICASSP'91, pp. 2713|2716, 1991. [12] M. H. Chan, Y. B. Yu and A. G. Constantinides, \Variable Size Block Matching Motion Compensation with Applications to Video Coding," IEE Proc., Vol. 137, No. 4, pp. 205|212, Aug. 1990. [13] F. Dufaux and F. Moscheni, \Motion estimation techniques for digital TV: A review and a new contribution," Proc. IEEE, vol. 83, pp. 858{876, June 1995. [14] M. T. Orchard, \Predictive motion- eld segmentation for image sequence coding," IEEE Trans. Circuits and Systems for Video Technology, vol. 3, pp. 54{70, Feb. 1993.

24

[15] C.-L. Huang and C.-Y. Hsu, \A new motion compensation method for image sequence coding using hierarchical grid interpolation," IEEE Trans. Circuits and Systems for Video Technology, vol. 4, pp. 42{51, Feb. 1994. [16] J.-R. Ohm, \Motion-compensated 3-D subband coding with multiresolution representation of motion parameters," in Proc. IEEE Int. Conf. Image Processing, vol. II, pp. 250{254, Austin, TX, 1994. [17] Y. Altunbasak, A. M. Tekalp and G. Bozdagi, \Two{Dimensional Object{Based Coding Using a Content{ Based Mesh and Ane Motion Parameterization," Proc. IEEE Int. Conf. Image Processing, vol. II, pp. 394{ 397, Arlington, VA, Oct. 1995. [18] H. Zheng and S.D. Blostein, \Motion{Based Object Segmentation and Estimation Using the MDL Principle," IEEE Trans. on Image Proc., Vol. 4, No. 9, pp. 1223|1235, Sep. 1995. [19] C. Stiller, \Object{Oriented Video Coding Employing Dense Motion Fields," Proc. ICASSP, Adelaide, Australia, pp. V. 273|276, 1994. [20] B. K. P. Horn and B. Schunck, \Determining Optical Flow," Arti cial Intelligence, Vol. 17, pp. 185|203, 1981. [21] H.-H. Nagel, \On the Estimation of Optical Flow : Relations Between Di erent Approaches and Some New Results," Arti cial Intelligence, Vol. 33, pp. 299|324, 1987. [22] J. L. Barron, D. J. Fleet and S. S. Beauchemin, \Performance of Optical Flow Techniques," Int. Journal of Computer Vision, Vol. 12, No. 1, pp. 43|77, 1994. [23] M. Nickel and J. H. Husy, \A hybrid coder for image sequences using detailed motion estimates," in Proc. SPIE Conf. on Visual Communications and Image Processing, pp. 963{971, Boston, Massachusetts, 1991. [24] J. N. Driessen, Motion Estimation for Digital Video. Ph.D thesis, Delft University of Technology, The Netherlands, 1992. [25] J. Rissanen, Stochastic Complexity in Statistical Inquiry, World Scienti c, Singapore, 1989. [26] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding, Prentice-Hall 1995. [27] R.A. DeVore, B. Jawerth and V. Popov, \Compression of Wavelet Decompositions," American J. of Mathematics, Vol. 114, pp. 737|785, 1992. [28] R.A. DeVore and B. Lucier, \Classifying the Smoothness of Images: Theory and Applications to Wavelet Image Processing," Proc. IEEE Int. Conf. Image Processing, Austin, TX, pp. II. 6|10, 1994. [29] H. Yserentant, "On the Multilevel Splitting of Finite-Element Spaces," Numerische Mathematik, Vol. 49, pp. 379-412, 1986.

25

[30] P. Moulin, \An Adaptive Finite-Element Method for Image Representation," Proc. IEEE Int. Conf. on Pattern Recognition, pp. III. 70|74, The Hague, The Netherlands, Sep. 1992. [31] P. Moulin, \A Multiscale Relaxation Algorithm for SNR Maximization in Nonorthogonal Subband Coding," IEEE Trans. on Image Proc., Vol. 4, No. 9, pp. 1269|1281, Sep. 1995. [32] E. Simoncelli, \Distributed Representation and Analysis of Visual Motion," Ph.D. Thesis, MIT Dept of Elec. Eng. and Comp. Sci., 1993. Available by anonymous ftp from whitechapel.mit.edu. [33] E. Simoncelli, \Coarse-to-Fine Estimation of Visual Motion," 8th IEEE Workshop on Im. and Multidim. Sig. Proc., pp. 128|129, Cannes, France, 1993. [34] S. H. Hwang and S. U. Lee, \A Hierarchical Optical Flow Estimation Algorithm Based on the Interlevel Motion Smoothness Constraint," Pattern Recognition, Vol. 26, No. 6, pp. 939|952, 1993. [35] E. H. Adelson, E. Simoncelli, and R. Hingorani, \Orthogonal pyramid transforms for image coding," in Proc. SPIE Conf. on VCIP, pp. 50{58, 1987. [36] R. Szeliski, \Fast Surface Interpolation Using Hierarchical Basis Functions," IEEE Trans. on PAMI, Vol. 12, No. 6, pp. 513-528, 1990. [37] J. V. Gsladottir and M. T. Orchard, \Motion{Only Video Coding," Proc. IEEE Int. Conf. Image Processing, Austin, TX, pp. I. 730|734, 1994. [38] Y. Y. Lee and J. W. Woods, \Motion Vector Quantization for Video Coding," IEEE Trans. on Image Proc., Vol. 4, No. 3, pp. 378|382, Mar. 1995. [39] D. Terzopoulos, "Image Relaxation Using Multigrid Relaxation Methods," IEEE Trans. on PAMI, Vol. 8, pp. 129|139, 1986. [40] W. L. Briggs, A Multigrid Tutorial, SIAM, Philadelphia, 1987. [41] F. Heitz, P. Perez and P. Bouthemy, "Multiscale Minimization of Global Energy Functions in Some Visual Recovery Problems," Computer Vision, Graphics and Image Processing: Image Understanding, Vol. 59, No. 1, pp. 125|134, Jan. 1994. [42] J. K. Kearney, W. B. Thompson, and D. L. Boley, \Optical ow estimation: An error analysis of gradientbased methods with local optimization," IEEE Trans. Pattern Analysis and Machine Intelligence, pp. 229{ 244, 1987. [43] F. Glazer, \Multilevel Relaxation in Low-Level Computer Vision," pp. 312|330 in Multiresolution Methods for Image Processing and Analysis, Ed., A. Rosenfeld, Springer-Verlag, 1984. [44] P. J. Burt and E. H. Adelson, \The Laplacian pyramid as a compact image code," IEEE Trans. Commun., vol. 31, pp. 532{540, 1983.

26

[45] R. M. Witten, I. H. Neal, and J. G. Cleary, \Arithmetic coding for data compression," Communications of the ACM, vol. 30, pp. 520{540, June 1987. [46] R. Krishnamurthy, P. Moulin and J. W. Woods, \Multiscale Motion Estimation for Scalable Video Coding," Proc. IEEE Int. Conf. Image Processing, Lausanne, Switzerland, pp. I. 965|968, Sep. 1996. [47] P. Moulin, G. Lilienfeld, A. Ogielski, and J. W. Woods, \Video signal processing and coding on data-parallel computers," Digital Signal Processing: A Review Journal, vol. 5, pp. 118{129, Apr. 1995.

27

Input I~, I , a(0) . Compute FD = I ? I~ and R = R(rI~rT I~) + Rs in (12). Compute d(0) = Mh a(0) , r(0) = rT I~  d(0) + FD , (0) = MTh (r(0) rI~ + 52 d(0) ). for s = 1 to smax do a = 0 for j = jmax downto 80 do

8l; k : jls j

( ; )

< jls? (k) : j = jmax ; l = 0; (k) = : s? P s jl (k) + k Rjl (k; k0 ) dj (k0 ) : 0  j < jmax ; 1  l  3 (

1)

(

1)

( ) +1

0

0

compute a( ) using the Gauss{Seidel algorithm (22) 8l; k : a(jls)(k) = a(jls?1) (k) + ajl (k)

(23)

8 < a jmax (k) : j = jmax 8k : dj (k) = : P P P 0 0 0 0 k hl (k ? 2k )  alj (k ) : 0  j < jmax k h (k ? 2k )  dj (k ) + l 0

0

0

3 =1

+1

endfor

( )

(

(16)

0

d s = d s? + d r s = rT I~  d s + FD ( )

(24)

1)

(8) (20)

( )

(s) = MTh (r(s) rI~ + 52 d(s) ).

endfor

Table 1: Basic Multiresolution Optimization Algorithm (OF-Basic). Essential parameters are the coarse scale jmax , the motion eld smoothness parameter , and the quantizer step size  at scale jmax .

28

Computations performed prior to the sweeps Computation of image gradients rI~ 8N (using Simoncelli's 3{tap antisymmetric lters, see x4.1) Computation of the R(;  0 ) matrices 12N (requires 3 R-transforms) Computation of r(0) and (0) 5N Computations performed during each sweep (1  s  smax ) Update of (s) in (24) 16N rmax relaxation steps (22) 33 rmax N Update of dj (k) in (16) 2N Computation of r(s) and (s) in (8) and (20) 5N

Table 2: Number of Multiplies at Each Step of the Basic Algorithm.

Input I~(s; t ? 1), I (s; t). Compute image pyramids I~i (s; t ? 1) and I i (s; t), 0  i  imax . Estimate and encode coarse motion dimax between I~imax (s; t ? 1) and I imax (s; t) using the algorithm in Table 1. for i = imax ? 1 down to 0 do Compute diji+1 = 2 I di+1 . Perform warping to undo coarse motion: I~W (s; t ? 1) = I~i (s ? diji+1 (s); t ? 1) Estimate and encode correction eld di between I~W (s; t ? 1) and I i (s; t) using the algorithm in Table 1. Compose coarse motion and correction: di (s) = diji+1 (s ? di (s)) + di (s). endfor Table 3: Coarse-to- ne Algorithm using Control Pyramid (OF-CP). The motion eld is represented by dimax and di , 0  i < imax , each of which is estimated and encoded using the basic algorithm in Table 1. 29

Control points

X

(a) TMC: Approximation planar over triangular patches

(b) Motion field at control point X affects hexagonal area shown

Figure 1: The TMC motion eld is a linear combination of hexagonal{base pyramid functions whose support set is shown in Fig. 1 (b). There is exactly one basis function per control point.

g0;n

h0;n

g1;n

h1;n

g2;n

h2;n

g3;n

h3;n

ANALYSIS

SYNTHESIS

Figure 2: Subband ltering scheme for representation of our motion elds. The analysis and synthesis sections implement the map (4) and its inverse, respectively. The symbols # and " respectively indicate downsampling and upsampling by a factor of two along both horizontal and vertical directions.

30

Basis Function Support at Level 0

Resolution Level 2 Resolution Level 1 Resolution Level 0 (finest)

Triangle at Level 2

Basis Function Support at Level 2

Triangle at Level 1 Basis Function Support at Level 1

Figure 3: Hierarchical Finite Elements at three contiguous scales. Each pixel is associated with one particular resolution level and supports one basis function. 0

1

4

5

2

3

6

7

8

9

10

11

12

Figure 4: Subband decomposition used for DFD coding. 38 37.5

* BM x TMC-FHM

37

o OF-CP

PSNR

36.5 36 35.5 35 34.5 34 33.5 0

Figure 5: Comparison of

20

P SN R

40

60 Frame Number

80

100

120

s on Claire at QCIF resolution, 10 frames/s, coded at 24 kbit/s. 31

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

Figure 6: Comparison of motion estimation techniques on frames 90 and 93 of Claire at QCIF resolution, 10 frames/s, coded at 24 kbit/s. First row: (a) Frame 93 (90  80 section shown), (b) raw frame di erence, variance = 114. Second row: motion elds for (c) OF-CP with max = 2 (0.049 bpp), (d) BM (0.023 bpp), and (e) TMC-FHM (0.015 bpp). Third row: DFDs using (f) OF-CP, variance = 33.1, (g) BM, variance = 55.9 (h) TMC-FHM,32variance = 54.1. Fourth row: reconstructed frame 93 using (i) OF-CP, = 34 77, (j) BM, = 33 84, and (k) TMC-FHM, = 33 81. i

P SN R

P SN R

:

:

P SN R

:

0.05 * BM

0.045

x TMC-FHM 0.04

o OF-CP

Motion Bit Rate (bpp)

0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0

20

40

60 Frame Number

80

100

120

Figure 7: Comparison of motion bit rates on Claire at QCIF resolution, 10 frames/s, coded at 24 kbit/s.

40.5 * BM x TMC-FHM o OF-CP

40 39.5

PSNR

39 38.5 38 37.5 37 36.5 36 0

Figure 8: Comparison of

20

P SN R

40

60 Frame Number

80

100

120

s on Claire at CIF resolution, 15 frames/s, coded at 128 kbit/s.

33

25.5 * BM x TMC-FHM o OF-CP 25

PSNR

24.5

24

23.5

23 0

Figure 9: Comparison of

5

P SN R

10

15 Frame Number

20

25

30

s on Frames 1 ? 30 of Flower Garden at 700 kbit/s.

27 + OF-Basic o OF-CP

26

25

PSNR

24

23

22

21

20

19 0

5

10

15 Frame Number

20

25

30

Figure 10: Comparison of optical ow algorithm with and without Control Pyramid for Flower Garden at 700 kbit/s.

34