MULTISCALE MOTION ESTIMATION FOR SCALABLE VIDEO CODING Ravi Krishnamurthy1, Pierre Moulin2 and John W. Woods1 1 Rensselaer
Polytechnic Institute Center for Image Processing Research and ECSE Department Troy, NY 12180-3590 email: fravik@ipl,
[email protected] ABSTRACT
Motion estimation is an important component of video coding systems because it enables us to exploit the temporal redundancy in the sequence. The popular block-matching algorithms (BMAs) produce unnatural, piecewise constant motion elds that do not correspond to \true" motion. In contrast, our focus here is on high-quality motion estimates that produce a video representation that is less dependent on the speci c frame-rate or resolution. To this end, we present an iterated registration [1, 2, 3] algorithm that extends previous work on multiscale motion models and gradient-based estimation for coding applications [4, 5]. We obtain improved motion estimates and higher overall coding performance. Promising applications are found in temporally-scalable video coding with motion-compensated frame interpolation at the decoder. We obtain excellent interpolation performance and video quality; in contrast, BMA leads to annoying artifacts near moving image edges.
1. INTRODUCTION
Motion estimation is an important component of hybrid video coders because it allows us to exploit the temporal redundancy in the sequence. The block-matching algorithm (BMA) is very popular for forward motion-compensated (FMC) video coding applications because it oers a compact representation of the motion eld. However, BMA assumes piecewise-constant motion, and the estimates are generally not representative of the true motion. On the other hand, a variety of advanced motion estimators have been developed for computer vision applications. Unlike video coding, where the focus has been on the reduction of displaced frame-dierence (DFD) variance, these techniques attempt to estimate the \true" motion in the scene. These advanced methods produce spatiallycoherent, dense motion elds. Lossy motion- eld compression methods make the approach applicable to FMC video coding [4, 5]. Such motion estimates can be expected to provide a video representation (in terms of motion and DFD) that is less dependent on the speci c frame-rate or spatial resolution. Additionally, improved motion modeling frequently enhances the performance of motion-compensated interframe coders [4, 5]. In this paper, we rst present an iterated registration (IR) algorithm that improves over the optical- ow-based Part of this research was performed by the rst two authors at Bell Communications Research, Morristown, NJ.
2 University of
Illinois Beckman Institute and ECE Department 405 N. Mathews Avenue Urbana, IL 61801 email:
[email protected] algorithm in [4, 5]. We then apply the method to a temporally scalable coder with temporal subsampling and motioncompensated frame interpolation (MCFI) for the missing frames.
2. MULTISCALE MOTION MODEL The motion eld d = (dx ; dy )T is modeled as X d(s) = a( )' (s);
(1)
where s = (x; y)T indicates discretized spatial position, f' (s)g is a set of basis functions indexed by , and fa( )g are the coecients of the motion eld in this basis. This model provides us with a powerful framework that includes many existing techniques like BMA, control-grid interpolation (CGI) and triangle motion-compensation (TMC) [4]. In our experiments, we choose a complete basis of hierarchical nite elements (HFEs) [4], and obtain continuous motion elds that are planar over triangular patches. The HFE basis may be regarded as an hierarchical extension of the TMC motion model [4]. HFE transforms are implemented eciently using short, non-separable 2-D lters with rational coecients.
3. GRADIENT-BASED ESTIMATION: ITERATED REGISTRATION
Motion is estimated by minimizing the criterion E (d) =
X
s
jDF D(s)j2 + Es (d):
(2)
Here, DF D(s) = I (s; t) ? I (s ? d(s); t ? 1) is the motioncompensated prediction error, and the sum of squares of the motion eld gradient, Es (d) =
X
s
j@x dx j2 + j@y dx j2 + j@x dy j2 + j@y dy j2 (3)
weighted by the positive parameter , is a smoothness term that penalizes rough motion elds. This variation on the Horn and Schunck method [4, 6], yields spatially-coherent, compactly-encodable motion estimates. Standard methods for minimization of E (d) use a linear Taylor series approximation to the prediction error [3, 4, 5]. The accuracy of this approximation is limited to small displacements, so it is generally not possible to minimize the cost function (2) exactly. The problem persists even when a
coarse-to- ne strategy (using a control pyramid [4]) is used for estimating large displacements. When motion is very large relative to the depth of the pyramid, motion estimates may be grossly inaccurate. The IR method proposed here improves motion estimates at all levels of the control pyramid, and in particular at the coarsest level. IR extends the basic gradient approach by iteratively minimizing a series of linear approximations to the prediction error, a procedure similar to the one presented0 in [1, 2]. The algorithm is initialized using a motion eld d . At iteration k (1 k kmax ) a correction dk to the current motion eld dk?1 is computed. The motion eld is updated as dk (s) =4 dk?1 (s)+dk (s). At each iteration, the algorithm 4 de nes the warped (registered) image, IWk (s; t ? 1) = I (s ? dk (s); t ? 1), and the DFD, DF Dk (s) =4 I (s; t) ? IWk (s; t ? 1). Then, ak rst{order Taylor series kexpansion of the prediction I (s ? d (s); t ? 1) around (s ? d ?1 (s); t ? 1) yields the following approximation to the prediction error, 4 DF Dk?1 (s) + ?rI k?1 T dk (s): rk (s) = W
(4)
Substituting this approximation into (2) yields a quadratic cost function for dk , X Ek (dk ) = jrk (s)j2 + Es (dk?1 + dk ):
s
(5)
Using model we write dk (s) = P kthe multiscale motion P in (1), k k a ( )' (s) and d (s) = a ( )' (s). Ek may then be viewed as a quadratic cost function for ak and may be directly minimized over the model coecients using a gradient-descent algorithm which is similar to the basic algorithm described in [4, 5]. The minimization is subject to quantizer-set constraints on fak ( )g, so the algorithm solves the estimation and compression problems in one single step [4, 5]. The basic algorithm may 0be viewed as a special case of IR with initial conditions d = 0, and one single iteration (kmax = 1). Currently little is known about convergence properties of IR algorithms [1, 2]. In the current experiments, we chose a xed value for kmax and used the motion estimate dk (0 k kmax ) that yielded the lowest prediction error. In [4, 5], we have used multiresolution measurements from an image (control) pyramid in order to handle large displacements. The motion eld was represented by a coarse approximation at the coarsest level, and motion eld corrections at the ner levels. (This representation is similar to the Laplacian pyramid used in image coding [7] and nds applications to spatially-scalable coding [8].) The coarse approximation and ner level corrections are each modeled using the HFE basis and estimated in a coarse-to- ne manner using the basic algorithm with kmax = 0. In this paper, we use IR at each level of the control pyramid and obtain a better coarse motion eld as well as better correction elds. In particular, IR makes it possible to track large motion at the coarsest level, and to recover from propagation of motion errors from coarse to ner levels.
4. MOTION ESTIMATION RESULTS
Motion estimates were obtained between original Frames 46 and 48 of the Susie sequence at source interchange format (SIF) resolution. The motion is large, sometimes in
excess of 30 pixels. We compared the optical ow (OF) algorithm with IR (kmax = 10) to the OF algorithm without IR in [4, 5]. Both these estimators used 4 levels in the control (image) pyramid, 5 levels in the HFE pyramid, and = 250. A xed uniform quantizer was used at each level of the HFE pyramid (the quantization strategy was similar to one described in [4]), and the motion bit-rate was computed by measuring the entropies of the coecients at the various levels. Comparisons were also made to 16 16, half-pixel accuracy BMA with a search range of [?32; 32] in both the horizontal and vertical directions. We investigated the use of MCFI [9] for reconstructing the missing odd frame. Here interpolation was done using the backward motion eld from Frame 48 to Frame 46. At each point in the intermediate frame, the interpolated value was obtained by averaging corresponding pixels from the previous and following frames along the motion trajectory. The advantage of using IR is illustrated by the motion elds and the DFDs shown in Fig. 1. The IR algorithm helped us to reliably estimate the relatively large motion that is present even at the coarsest level (imax = 3). The basic OF algorithm sometimes produces oversmoothed motion estimates due to error propagation from coarse to ne levels. The IR algorithm is better able to recover from such errors, as evidenced by a comparison of columns 2 and 3 from Fig. 1. IR provides a sharper and more accurate eld; the improvements are most visible in the left side of the image (edge of face, nose, and lips). Improvements are also apparent when viewing the interpolated frames; OF without IR produces a blurry interpolation in these regions. The coarse, piecewise constant nature of BMA elds are also illustrated in Fig. 1. For this particular example, the DFD variance for BM was 1.5 dB lower than for IR. Indeed the video segment considered here features signi cant uncovered regions and many areas with signi cant brightness changes (such as the hair and face). These regions are dif cult to predict using a coherent eld. On the other hand, BMA is unconstrained by smoothness requirements, and is able to produce a lower DFD variance. In many situations, due to the improved motion model, optical ow algorithms (with or without IR) produce lower DFD variances than BMA (see, for example, the results presented in [4, 5]). A couple of points regarding coding performance should be made here. Firstly, the DFD image obtained using IR is smoother and easier to encode than its BMA counterpart; when we subband coded the two DFDs in the above example at 0:4 bits-per-pixel (bpp), the BMA advantage over IR dropped to a 0:8 dB coding gain and the coded frame had annoying blocking artifacts. Secondly, IR presents a striking advantage over BMA for interpolated frames. The frame interpolated using BMA (Fig 1(l)) presents severe blocking artifacts. There is also a large gain for IR in terms of mean-squared-error (MSE) of the interpolated frame. However, as both existing literature and our later results indicate, MSE might not be a good indicator of the quality of interpolated frames [9].
5. SCALABLE CODING APPLICATION
In many practical coders, especially in video conferencing applications, it may become necessary to temporally subsample the sequence in order to comply with real-time requirements and bandwidth constraints. In these situations, it would be advantageous to provide the decoder with the option of increasing the frame-rate using MCFI. For this
purpose, a good motion eld should be available; with our motion estimator, this requirement is ful lled without an additional motion estimation step at the decoder. The interpolation procedure used here is identical to the one described in Section 4. The interpolated frames may be regarded as similar to B-frames in MPEG [10], with some important dierences. Firstly, no bidirectional motion elds are estimated and transmitted; the eld between the I- and P-frames is used for interpolation. Secondly, no residual is transmitted for the interpolated frames. Therefore, we shall henceforth refer to these interpolated frames as \B"-frames. For our experiments, we subsampled the sequence by a factor of 4 with 3 \B"-frames in between the encoded frames. Therefore, we shall refer to this as a IBBBP coder. Both the optical ow method with IR (IBBBP-IR) and 16 16 half-pixel accuracy BMA (IBBBP-BM) were tested in this framework. The optical ow method used 3 levels in the control (image) pyramid, 4 levels in the HFE pyramid, = 1000 at each level, and IR with kmax = 5. Again, xed uniform quantizers were used at each level in the HFE pyramid, and motion bit-rate was computed by measuring the entropy of the coecients at the various levels. BMA used a search range of [?16; 16] in both the horizontal and vertical directions. Both coders (IR and BM) used a groupof pictures of size 12; of these only 3 frames (1 I and 2 P) were actually transmitted. The intra-frame coder for the DFD was a subband coder using the Adelson and Simoncelli 9-tap QMF lters [11] and the 13 band decomposition used in [4]. Uniform quantizers were used in each subband and entropies were used to calculate the bit-rates. The test sequence used was a 256 256 cropped version of Claire at SIF resolution and 30 frames-per-second (fps). Our temporal subsampling reduced the frame-rate to 7:5 fps. We considered frames 64 ? 96 which feature a relatively large rotation of the head near the end of the segment. In both coders, approximately 0:85 bpp were used for the I-frames; the P-frames used 0:28 bpp (including motion). The resulting bit rate was approximately 250 kbps. A comparison of peak signal-to-noise ratios (PSNRs) is shown in Fig. 2(a), and features oscillations which correspond to low PSNRs for the \B"-frames. However, we observed that the PSNR was an inadequate measure when measuring the quality of \B"-frames in the IBBBP-IR coder. A similar observation has also been made in [9]. To illustrate some of the reasons behind this, we show the reconstructed frame 90 (a \B"-frame between I-frame 88 and P-frame 92) for both methods along with the dierence with respect to the original (Fig. 2). A visual inspection of the reconstructed (interpolated) frame for IBBBP-IR suggests that the quality is much higher than that indicated by the 34:5 dB PSNR. The reconstruction error is partly due to a displacement between the reconstructed and original images. This displacement is caused by a violation of the constant-velocity assumption that was made for frameinterpolation. While this displacement would be unacceptable in P-frames (since we would then lose track of the motion), it is acceptable on the \B"-frames; the video has good subjective quality as long as the edges in the images are sharply reproduced. Therefore, the quality of the video is much higher than indicated by PSNR gures, and it may become necessary to use a dierent measure to describe the quality of \B"-frames. Fig. 2 also shows that BMA elds were unsuitable for interpolation. A view at the right side of the face shows
severe artifacts which substantially degrade video quality. BMA also produced blocking artifacts in reconstructed Pframes. We also observed signi cant coding gains going from the basic OF algorithm (kmax = 1) to the IR algorithm (kmax = 5). The former was unable to reliably estimate the large displacements in the temporally subsampled video and produced blurry \B"-frames similar to Fig. 1(j). The basic OF algorithm also produced inferior visual and MSE performance on the P-frames. On an average, IR gave us a coding gain of 0:85 dB on the P-frames.
6. CONCLUSIONS
In conclusion, we have developed an advanced motion estimation algorithm that produces dense, spatially-coherent, compactly-encoded motion elds. The estimator is gradientbased and uses coarse-to- ne iterated registration. It improves the accuracy of motion estimates in [4, 5]. In particular, the algorithm can successfully track large motion and recover from errors at coarse resolution levels. Due to the improved motion accuracy, representation of the video in terms of motion elds and prediction error images is less dependent on the speci c frame-rate or spatial resolution. This feature oers interesting exibility in the design and operation of the coder and decoder. For instance, our motion elds may be used for post-processing without an additional motion estimation step at the decoder. We have implemented this idea and simulated a temporally scalable video coder; our coding and interpolation results have demonstrated several improvements over the popular BMA.
7. REFERENCES
[1] B. Lucas and T. Kanade, \An iterative image registration technique with an application to stereo vision,"in Proc. 5th Intl. Joint Conf. Arti cial Intell., pp. 674-679, 1981. [2] J. K. Kearney, W. B. Thompson, and D. L. Boley, \Optical
ow estimation: An error analysis of gradient-based methods with local optimization," IEEE Trans. Pattern Analysis and Machine Intelligence, pp. 229{244, 1987. [3] W. Enkelmann, \Investigations of multigrid algorithms for the estimation of optical ow elds in image sequences," CVGIP, vol. 43, pp. 150{177, 1988. [4] P. Moulin and R. Krishnamurthy,\Multiscale modelingand estimation of motion elds for video coding," submitted to IEEE Trans. on Image Process., Dec. 1995. [5] R. Krishnamurthy, P. Moulin, and J. W. Woods, \Optical
ow techniques applied to video coding," in Proc. IEEE ICIP, vol. I, pp. 570{573, Washington, DC, 1995. [6] B. K. P. Horn and B. Schunk, \Determining optical ow," Arti cial Intelligence, vol. 17, pp. 185{203, 1981. [7] P. J. Burt and E. H. Adelson, \The Laplacian pyramid as a compact image code," IEEE Trans. Commun., vol. 31, pp. 532{540, 1983. [8] R. Krishnamurthy, \Optical ow techniques in video coding," CIPR Tech. Report CIPR TR-96-04, Rensselaer Polytechnic Institute, Troy, NY, 1996. [9] C. Caorio, F. Rocca, and S. Tubaro, \Motion compensated image interpolation," IEEE Trans. Commun., vol. 38, pp. 215{222, Feb. 1990. [10] D. Le Gall, \MPEG: A video compression standard for multimedia applications," Communications of the ACM, vol. 34, pp. 46{58, Apr. 1991. [11] E. H. Adelson, E. Simoncelli, and R. Hingorani, \Orthogonal pyramid transforms for image coding," in Proc. SPIE Conf. on VCIP, pp. 50{58, 1987.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l) Figure 1: Comparison of Motion Estimation Methods on Frames 46 ? 48 of Susie Top Row: (a) Original Frame 46 (b)-(d) Motion Fields{(b) Optical Flow (No Iterated Registration, 0:059 bpp) (c) OF (IR, kmax = 10, 0:053 bpp) (d) 16 16 BMA (0:040 bpp) Middle Row: (e) Raw FD between Frames 46 and 48 (128 128 section, scaled 4 times) (f)-(h) DFDs{ (f) OF (No IR, MSE=146:0) (g) OF (IR, kmax = 10, MSE= 90:0) (h) 16 16 BMA (MSE=62:2) Bottom Row: (i) Original Frame 47 (96 96 section shown) (j)-(l) Interpolated Images{ (j) OF (No IR, MSE=63:2) (k) OF (IR, kmax = 10, MSE=45:1) (l) 16 16 BMA (MSE=88:6) 42
40
PSNR
38
36
(b)
(c)
(d)
(e)
34 o---o IBBBP-IR 32
30 60
x---x IBBBP-BM
65
70
75
80 85 Frame Number
(a)
90
95
100
Figure 2: Comparison of Temporally Scalable Coders on Frames 64 ? 90 of Claire (a) PSNR comparison (b),(c) Reconstructed Frame 90 (110 90 section){(b) IBBBP-IR (c) IBBBP-BM (d), (e) Reconstruction Errors (scaled 4 times){(d) IBBBP-IR (PSNR=34.5) (e) IBBBP-BM (PSNR=32.0)