stochastic modeling and entropy constrained estimation of motion from ...

3 downloads 0 Views 202KB Size Report
sume motion in an observed image sequence to be a stochas- tic process, modeled ..... where c = log(Z) is a constant independent of the opti- mization argument ...
In the Proceedings of the 1998 IEEE International Conference on Image Processing.

Copyright 1998 IEEE. Published in the 1998 International Conference on Image Processing (ICIP'98), scheduled for October 4-7, 1998 in Chicago, IL. Personal usenew of this material is permitted. permission to reprint/republish this material for advertising or promotional for creating collective works for resale However, or redistribution to servers or lists, or to reuse any copyrighted component of this workpurposes in otherorworks, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

STOCHASTIC MODELING AND ENTROPY CONSTRAINED ESTIMATION OF MOTION FROM IMAGE SEQUENCES Sergio D. Servetto

y

Christine I. Podilchuk

http://www.ifp.uiuc.edu/~servetto/ y

Beckman Institute, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign. 405 N. Matthews St., Urbana, IL, 61801.

ABSTRACT

We consider the problem of coding video signals using motion compensation and a forward coded dense motion eld. First, we develop a motion estimation technique that yields dense estimates suitable for the coding application; next, we develop a prototype of a video coder, which we use to verify that high coding performance is attainable within our framework. To nd our sought motion estimates, we assume motion in an observed image sequence to be a stochastic process, modeled as a Markov Random Field (MRF). The standard Maximum A Posteriori (MAP) estimation problem with MRF priors is formulated as a constrained optimization problem (where the constraint is on the entropy of the sought estimate), but then transformed into a classical MAP estimation problem, and solved using standard techniques. A key advantage of the constrained formulation is that, in the process of transforming it back to the classical framework, parameters which in the classical framework are left unspeci ed {and often tweaked in an experimental stage{ become now uniquely determined by the introduced entropy constraint. And to verify that our motion estimates are indeed useful for coding, we compare the peformance of a prototype video coder with that of an equivalent coder based on block-matching motion estimates. Experimental results reveal, for various types of video signals and at various rates, that: (a) in terms of PSNR, our system equals or improves upon the performance of full search block-matching; and (b) in terms of visual quality our improvements are signi cant, since our images are completely free of blocking artifacts.

1. INTRODUCTION 1.1. Motivation

Coding is one of the most fundamental problems in video signal processing. Video coding systems are used in a variety of applications, of which a few examples are low bitrate, low quality teleconferencing; medium bitrate, medium quality systems for video storage and retrieval from CD-ROM media; and high bitrate, high quality satellite digital TV. For such applications, the highest rate/distortion performance has been achieved by video coding systems that Work performed while S. Servetto was a summer intern at Bell Labs, between 5/97{8/97 and 5/98{6/98



[email protected] 

Multimedia Communications Research Dept., Bell Labs - Lucent Technologies. 600 Mountain Av., Murray Hill, NJ, 07974.

make use of di erent motion compensation techniques, in order to reduce temporal correlations existing among different frames in the video sequence. In order to decode a video sequence encoded using motion compensation methods, the decoder is required to form an estimate of the motion eld used by the encoder in the coding process. A vast amount of work has been done on the problem of how to estimate the motion information, and how to convey this information to the decoder. The simplest approach, adopted by all video coding standards (e.g., MPEG-2 [5]), is that of block-based forward-coded motion compensation. In such methods, a set of motion vectors is computed at the encoder, and those vectors are explicitly transmitted to the decoder. Since in this context motion information is considered overhead, in the sense of it being side information to aid the decoding process, sending one motion vector per pixel {at least if using straightforward coding methods{ would consume the entire bit budget, and therefore would result in very poor quality of the images reconstructed at the decoder. Therefore, in order to reduce the number of vectors transmitted, a simplifying assumption is made: blocks of pixels move with constant translation motion, and hence the motion eld can be subsampled. Although it has been found in practice that acceptable video coding performance can be obtained using forward block motion, it is clear that there exist inherent problems with this approach. The rst one is that since physical motion is not piecewise constant, inaccurate compensation occurs at the boundaries of moving objects. As a result, there are pixels within blocks containing non-uniform motion which are incorrectly compensated, and therefore a signi cant energy increase in the prediction error signal occurs, with a consequent increase in the bitrate necessary to encode this residue signal. At low bitrates, when high quality coding of the prediction error is not possible, blocking artifacts become clearly visible in the reconstructed images.

1.2. Motion Estimation Techniques

The literature on motion estimation techniques is vast, so in this section we attempt to summarize the main contributions that we are aware of:  Motion Estimation Methods Motivated by Video Coding Problems: Overlapped and Segmented Block Motion Compensation (Orchard and Sullivan [10], Orchard [9]); Recur-

sive Estimation of a Dense Motion Field at the Decoder (Netravali and Robbins [8]); Lossy Transform Coding of Dense Motion Fields (Moulin et al. [7]).  Motion Estimation Methods Motivated by Computer Vision Problems: Motion Correspondences determined by Image Features (Huang and Netravali [4]); Optical Flow Techniques (Horn and Schunck [3]); Bayesian Estimation Techniques (Konrad and Dubois [6]).

1.3. Main Contributions and Paper Organization

Our goal is to derive a new motion estimation technique, to be applied to the construction of a new video coding system. Design goals that we set ourselves are:  We would like to use a dense motion eld for coding video, not a block-based eld, the main reason being that we hope to be able to devise a better compression scheme than traditional block-based approaches, by obtaining more accurate prediction and freedom from blocking artifacts.  Because of the diculties associated with recursive methods, we would like to explicitly encode the dense motion estimate as side information. The main contribution we present in this paper is the e ective development of a motion estimation technique that meets our design goals. To do so, we take the following steps: 1. We construct a MRF model for motion; our data model captures spatial, temporal and scale coherence properties of motion elds. 2. We develop an algorithm to compute MAP estimates for our data model. 3. We develop an algorithm for coding the motion elds produced by our MAP estimator. 4. We construct a prototype video coding system, that we use to make comparisons. The works of Konrad and Dubois [6] and Bouman and Liu [1] are the ones that triggered the line of research that led to the results presented in this work. The main di erences between our work and [6] are: (a) their data model only accounts for spatial coherence properties of the motion estimates, while our model extends theirs by also including temporal dependencies (the cliques in our model are not only de ned in terms of motion vectors at neighboring spatial locations, but also in terms of previous and future motion vectors); (b) their data model is de ned at a single scale, while ours is a de ned at multiple scales; (c) their estimator is based on the Gibbs sampler [2], while ours is a faster greedy algorithm; and (d) while they never consider coding of the estimates, we show how to control the entropy of the posterior distribution, in order to get estimates that can be used in a coding application. The main di erence between our work and [1], besides the obvious fact that we deal with motion vector elds and they deal with segmentations of textured images, is that in their work the multiscale aspects are not part of the data model they de ne, but only a device they introduce in order to overcome some practical computational problems we also encountered in early stages of our work; instead, the random elds we seek to estimate are indeed de ned on a four-dimensional lattice.

The rest of this paper is organized as follows. In Section 2, we formulate and solve the problem of MAP estimation with an entropy constraint and MRF priors. In Section 3, we de ne a multiscale MRF model for motion elds. In Section 4 we de ne our prototype video coders, and we present coding results. Finally, in Section 5, we summarize our work and discuss topics for future research.

2. ENTROPY CONSTRAINED MAP ESTIMATION WITH MRF PRIORS 2.1. Plan Outline

It is well known that a most ecient way to specify MRFs is by means of the celebrated Hammersley-Cli ord theorem, which proves the equivalence between MRFs and Gibbs distributions. To de ne a proper Gibbs distribution, two sets of functions have to be speci ed, which re ect the statistician's beliefs on how data behaves:  Singleton potential functions are the means provided by the model to force consistency between the random eld and the observed data. Intuitively, by a proper choice of singleton potentials, the statistician is able to specify \degrees of agreement" between di erent realizations of the random eld and the observed data.  Higher order potential functions are the means provided by the model to re ect the structure of the eld, independent of the observations. Intuitively, by a proper choice of these potentials, the statistician is able to express within the model properties of the data assumed a-priori (e.g., smoothness, location of discontinuities, etc.). It is clear that, unless the observed data is constrained in some way to agree with the structural properties of the eld, the goals of having a eld which simultaneously agrees with the observed data (i.e., that minimizes the singleton potentials) and has the desired structural properties (i.e., that minimizes the higher order potentials) are contradictory. This contradiction has been widely recognized in the literature as a major source of ad-hoc-ness in the application of MRF models, since typical solutions to this problem involve the minimization of a linear combination of singleton and higher order potentials, where the weights associated with this linear combination are chosen experimentally, and even worse, need to be changed for each particular set of observations available. In this section we propose a new way of formulating the MAP estimation problem. We recognize the following facts:  Singleton potentials do not re ect properties of the eld; instead, they are used to allow violations of those properties in the estimation process, to bring the eld to agree with the observed data.  Higher order potentials do re ect properties of the eld; and in this case, the closer the eld is to the assumed properties, the more predictable it becomes. Based on these observations is that we propose an alternative formulation of the standard MAP optimization problem: instead of nding a realization of the eld which minimizes some arbitrarily chosen linear combination of the singleton and higher order potentials, nd a realization in

which the singleton potential is minimized, but subject to a constraint in the entropy of the eld, computed using the measure that results only from considering higher order cliques. In the remainder of this section we formalize these intuitive concepts.

2.2. Constrained Optimization Problem Statement

Let (fs)s2L be a random eld de ned on a lattice L, with sample space and typical realizations ! 2 . Let US (!); UH (!) be a pair of valid Gibbs potentials, with the property that US (!) is only de ned in terms of singleton cliques, and UH (!) is only de ned in terms of non-singleton (higher order) cliques. Let H be a Gibbs measure on the eld (fs )s2L, de ned by means of the potential UH (!). Then, our goal is to nd a realization ! of the random eld (fs )s2L, such that US (!) is minimized, subject to the constraint that the self-information of the solution with respect to the measure H does not exceed a given bound. Formally, the problem is stated as follows: Find ! satisfying ! = arg min US (!)=T !2

2.4. Solution of the Unconstrained Problem

The unconstrained cost functional J(!; ) is composed of two additive terms: the original singleton potential US (!)=T , and the self information of the argument ? log(H (!)), scaled by the Lagrange multiplier . We show next how it is possible to write, for any xed  and for all realizations ! 2 , J(!; ) = U  (!), where U  (!) denotes a valid Gibbs potential. To see this, rst observe that ? log(H (!)) reduces to a linear function of UH : 



? log((!)) = ? log Z1 exp ? UHT(!) 



  = ? log Z1 ? UHT(!) log(e) = c +  UH (!) 



T

where c =  log(Z ) is a constant independent of the optimization argument !. So, J (!; ) can be written as:

? log(H (!))  Rbudget

J(!; ) = UST(!) ?  log(H (!))  4 U (! ) = UST(!) + c +  UHT(!) = + c T

The singleton potential is normalized by the temperature parameter used in the de nition of the measure H .

Therefore, since U  (!) = US (!)+ UH (!) de ned in this way is a valid potential, and since

subject to

2.3. Transformation into an Unconstrained Optimization Problem Let  be a positive real number, and de ne the Lagrangian cost function: J(!; ) = US (!)=T ?  log(H (!)) It is well known [13] that for any xed value 0 there exists a value of the constraint Rbudget(0 ), such that if !(0 ) is a solution to the unconstrained problem for 0 , then ! is also a solution of the constrained problem, with constraint Rbudget(0 ). Note however that this is not equivalent to saying that for any rate constraint R, there exists a value of 0 of the lagrange multiplier such that R = Rbudget(0 )! The optimization problem being considered in this work is one of discrete optimization, and lagragian methods as the one used here can only achieve solutions lying on the convex hull of the region of achievable pairs1 . Therefore, if this convex hull is not densely populated, the solution thus obtained may be suboptimal. At this time we don't know whether our convex hull is densely or sparsely populated, this is a topic on which we need to do further research; however we hasten to add that, as suggested by our experimental results, this issue may very well have little practical implications. 1 The achievable pairs (D,R) are those for which there exists a realization !, from which D is obtained as US (!)=T and R is obtained as ? log(H (!)).

arg min J(!; ) = arg min U  (!) ! ! the unconstrained problem is exactly the same as that of classical MAP estimation with MRF posteriors. Therefore, all the standard machinery available to solve the classical problem is applicable to the solution of the constrained problem. Note that in classical formulations of MAP estimation problems related to Gibbs distributions (e.g., image restoration, image segmentation, unconstrained estimation of motion for machine vision problems), there is no clear criteria for deciding how to assign relative weights to the strengths of the internal and external forces. As a result, ad-hoc choices are made in those contexts (the relative weights between US (!) and UH (!) are usually tweaked on a persequence basis), which in this context can be avoided.

3. A MULTISCALE MRF MODEL FOR MOTION In this section, we construct a multiscale MRF model for the set of motion elds. The key features that we want our data model to express are: Spatial Coherence. Motion elds are mostly smooth: motion vectors corresponding to pixels at nearby spatial locations are likely to be similar, since such pixels most of the time correspond to a single moving object. Therefore, we want to assign probabilities to elds in a way such that smooth elds are more likely than rough elds.

Temporal Coherence. A similar argument to the case of spatial coherence holds here, where temporal neighbors refer to pixels at spatially close locations in adjacent frames. Scale Coherence. Image sequences are observed at a single scale. This means that in order to obtain multiscale, successively re ned motion estimates, it is necessary to de ne in what sense the coarse elds approximate the nest eld. We do so by letting a motion vector at a coarse scale represent the average motion over some set of pixels in the observed sequence, where the size of this set of pixels increases as the eld becomes coarser. Recall from Section 2 that we argued how high order potential functions are used to express structural properties of the eld, while singleton potential functions are used to express how observations update the likelihood with which elds occur. In our data model, the coherence properties listed above are expressed as follows:  The set of neighbors of a given site are shown in Fig. 1.  All cliques are given by pairs of sites, where one of the elements in each pair is the center site in Fig. 1, and the other element is each one of the neighbor sites.  Clique potentials are de ned to be inversely proportional to a measure of similarity between the vectors at each clique site (the more similar vectors are, the lower the potential).  Singleton potentials are de ned as the average energy in the motion prediction error signal, where averages are taken over suitable blocks of size 2k  2k , for each site at scale k. motion frame n-1

motion frame n

motion frame n+1 motion pyramid level k+1

motion pyramid level k

Site

Neighbors

Figure 1: A typical neighborhood system. Space constraints prevent us from giving formal de nitions of the corresponding neighborhood systems, cliques, and potential functions. These can be found however in [11].

4. EXPERIMENTAL RESULTS In order to justify our claim that this new proposed motion estimation technique has advantages for the video coding application, we feel it is important to present video coding results supporting our claims. For that purpose we de ne in this section two video coders, which di er only in the way they estimate and encode motion, and then we compare the coding results obtained using each of them.

4.1. Two Video Coding Systems

The basic components of our video coder based on a dense motion eld are:

Coder Architecture. Video frames are partitioned into groups of N frames coded independently. Within each group, the rst frame (dubbed I-frame) is encoded as a still image, and the remaining N ? 1 frames (dubbed P-frames) are encoded using our proposed motion technique, with the residuals coded again as still images. I-frame and Residue Coder. The still image coder is a standard wavelet-based coder [12]. Motion Coding. The motion vectors within a group of frames are encoded in two steps: rst, each vector in the multiscale pyramid is replaced by the di erence of itself with its parent vector (the intuition being that due to the scale coherence property of the eld, parent and children vectors will be similar); then, each scale in the pyramid is encoded using an adaptive arithmetic coder. Rate Control. The rate control mechanism adopted is essentially constant bitrate. Denoting by M the total number of bits required to encode all motion vectors, T the total budget, and I the bits used to encode the I-frame, (T ? M ? I )=(N ? 1) is the number of bits spent on coding each of the remaining N ? 1 P-frames in the group. For the coder based on block-motion the basic architecture, the rate control mechanism, and the I-frame and residue coder remain the same. The only di erence is that motion is estimated using block matching, and coded using a standard entropy coder. 4.2. Video Coding Performance

We report coding results on two sequences. One is a head-and-shoulders, low motion type sequence (Manya); the other is a sports, high motion type sequence (football). Both sequences are of size 352 x 240 x 30 frames/sec. We encoded groups of pictures (GOPs) of N = 15 frames at a time. Manya was coded at a total rate of 0.05 bits/pixel (124 Kbits/sec); football was coded at 0.125 bits/pixel (310 Kbits/sec). We present results for one typical GOP for each sequence. For Manya, the motion bitrate was 16.7 Kbits/sec (dense eld,  = 2) and 52.9 Kbits/sec (block eld). For football, the motion bitrate was 53.1 Kbits/sec (dense eld,  = 4) and 73.7 Kbits/sec (block eld). For both dense elds the last two re nement levels in the motion pyramid were discarded. Each I-frame received 0.1 bits/pixel (8448 bits). Whatever bits were left after coding motion and the I-frame in each GOP were split evenly among all remaining P-frames. For Manya, our coder attained an average improvement of 0.93dB in PSNR over the equivalent block-based coder; for football, the average PSNRs were identical. In terms of subjective quality, the improvement attained by our technique is signi cant in both cases, due to its freedom from blocking artifacts. Fig. 2 shows PSNR values and Fig. 3 shows decoded images obtained by both coders.

5.1. Summary

5. CONCLUSIONS

In this work we studied the problem of how to estimate and explicitly encode as side information dense motion elds,

PSNR comparison PSNR comparison

28

23.8

MRF−motion 23.6

27.5

23.4 MRF−motion

23.2

block−motion

PSNR (dB)

PSNR (dB)

27

26.5

26

23 block−motion

22.8 22.6 22.4

25.5 22.2

25 0

2

4

6 8 Frame number in GOP

10

12

14

22 0

2

4

6 8 Frame number in GOP

10

12

14

(a) (b) Figure 2: Objective performance comparison of motion estimation techniques: (a) Manya, (b) football. On average, our proposed method results in reconstructions over 0.9dB better than full search block matching for Manya, and identical for football. in the context of motion compensated video coding. We started by introducing an entropy constraint in the classical problem of MAP estimation with MRF priors. We showed how this constraint uniquely determines an equilibrium between internal and external forces, and we used it as a mechanism to partially control the number of bits required to encode estimates. We then de ned a multiscale MRF model for motion, capturing all the scale, spatial and temporal coherence properties expected of motion elds arising from image sequences. In order to verify our claims of usefulness of the dense estimates we obtain for the video coding application, we de ned two identical video coders, di ering only in the way motion is estimated and encoded: in one case, we used our new proposed method, and in the other, we used the standard full-search block-matching algorithm followed by a standard entropy coder. We performed extensive coding tests, both on the simple head-and-shoulders type sequence and on the more complex football sequence, at rates ranging from 128Kbits/sec to 2.5Mbits/sec. In all cases, we found that:  In terms of objective coding performance, the coder based on our proposed dense motion estimates achieved PSNR numbers between 0.0dB and 1.2dB above those of the block-matching coder (for Manya), and between -0.2dB and 0.2dB (for football), with the better results corresponding to low rates.  In terms of subjective visual quality, the coder based on our proposed dense motion estimates performed signi cantly better at low rates, while at high rates the di erences were imperceptible.

5.2. Future Work

Building an ecient video coder is a complex task, and is one we have not addressed in this work. A signi cant number of issues need be addressed before a satisfactory solution can be obtained, among which the most important ones (but not the only) are:  Finding optimal tradeo s between bits spent in coding motion and bits spent in coding residue information, and the rate control problem in general.  Modeling and coding of residue signals.

(a)

(b)

(c) (d) Figure 3: Visual comparison of motion estimation techniques: (a,c) enlarged reconstructions using our motion eld, (b,d) enlarged reconstructions using full search block motion. Observe how our reconstructions are free from visually annoying blocking artifacts.

 Estimation of clique parameters in the MRF model, leading to optimal coding performance under speci ed rate constraints. Our next step is to focus on the issue of modeling and coding of residue signals. 6.

REFERENCES

[1] C. Bouman and B. Liu. Multiple Resolution Segmentation of Textured Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(2):99{ 113, February 1991. [2] S. Geman and D. Geman. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(6):721{741, November 1984. [3] B. K. P. Horn and B. Schunck. Determining Optical Flow. Arti cial Intelligence, 17:185{203, 1981. [4] T. S. Huang and A. N. Netravali. Motion and Structure from Feature Correspondences: A Review. Proceedings of the IEEE, 82(2):252{268, February 1994. [5] International Telecommunications Union. Generic Coding of Moving Pictures and Associated Audio (MPEG-2), 1994. [6] J. Konrad and E. Dubois. Bayesian Estimation of Motion Vector Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(9):910{927, September 1992. [7] P. Moulin, R. Krishnamurthy, and J. Woods. Multiscale Modeling and Estimation of Motion Fields for Video Coding. IEEE Transactions on Image Processing, 6(12):1606{1620, December 1997. [8] A. Netravali and J. Robbins. Motion CompensatedTelevision Coding: Part I. The Bell System Technical Journal, 58(3):631{670, March 1979. [9] M. Orchard. Predictive Motion Field Segmentation for Image Sequence Coding. IEEE Transactions on Circuits and Systems for Video Technology, 3(1):54{ 70, February 1993. [10] M. Orchard and G. Sullivan. Overlapped Block Motion Compensation: An Estimation Theoretic Approach. IEEE Transactions on Image Processing, 3(5):693{699, September 1994. [11] S. Servetto, C. Podilchuk, and K. Ramchandran. Stochastic Modeling and Entropy Constrained Estimation of Dense Motion Fields for Video Coding. In Preparation, to be submitted to the IEEE Transactions on Image Processing, Fall 1998. [12] S. Servetto, K. Ramchandran, and M. Orchard. Image Coding based on a Morphological Representation of Wavelet Data. Submitted to the IEEE Transactions on Image Processing, December 1996. [13] Y. Shoham and A. Gersho. Efficient Bit Allocation for an Arbitrary Set of Quantizers. IEEE Transactions on Acoustics, Speech and Signal Processing, 36(9):1445{1453, September 1988.

Suggest Documents