submitted to IEEE Transactions for Circuits and Systems for Video Technology
Frame Interpolation and Bidirectional Prediction of Video using Compactly-Encoded Optical Flow Fields and Label Fields Ravi Krishnamurthy1y, John W. Woods1 and Pierre Moulin2 Rensselaer Polytechnic Institute Center for Image Processing Research & ECSE Dept. Troy, NY 12180 tel: (518) 276-6079, fax (518) 276-6261 email: fravik@cipr,
[email protected] 1
University of Illinois, Beckman Institute 405 N. Mathews Ave., Urbana, IL 61801 tel (217) 244-8366, fax (217) 244-8371 email:
[email protected]
2
August 11, 1997
Abstract We consider the problems of motion-compensated frame interpolation (MCFI) and bidirectional prediction in a video coding environment. These applications generally require good motion estimates at the decoder. In this paper, we use a multiscale optical- ow-based motion estimator that provides smooth, natural motion elds under bit-rate constraints. These motion estimates scale well with change in temporal resolution and provide considerable exibility in the design and operation of coders and decoders. In the MCFI application, this estimator provides excellent interpolated frames that are superior to conventional motion estimators, both visually and in terms of PSNR. We also consider the eect of occlusions in the bidirectional prediction application, and introduce a dense label eld that complements our motion estimator. This label eld enables us to adaptively weight the forward and backward predictions, and gives us substantial visual and PSNR improvements in the covered/uncovered regions of the sequence.
This work has been presented in part at ICIP'96 and VCIP'97. Corresponding Author: Ravi Krishnamurthy is currently with Sarno Corporation, CN5300, Princeton, NJ 085435300. Tel: (609) 734-3116, Fax: (609) 734-2049 y
1
1 Introduction Video sequences usually contain a high degree of temporal redundancy that can be exploited for coding and processing purposes. Motion estimation and compensation is a powerful means of exploiting this redundancy and is used in most advanced video coders, including the MPEG and H.263 video coding standards [1]. The MPEG I and II standards de ne three types of frames: intra-coded I-frames, predictive coded P-frames and bidirectional predicted B-frames. The I-frames are coded independent of other frames, while the P- and B-frames both involve motion-compensated prediction from previously decoded frames, and an encoding of the residual (prediction or interpolation error). The I- and P-frames together may be viewed as a temporally subsampled representation of the video, and the use of B-frames is a form of interpolative coding and provides temporal scalability [1]. Each B-frame is interpolated from two reference pictures (I or P), one before and one after the current B-frame. B-frames often lead to a greater degree of compression due to the following reasons; (a) bidirectional prediction is usually more ecient than forward prediction, and (b) the B-frames are not used as a reference for encoding other frames, and can be quantized coarsely since the quantization noise does not propagate further. Motion estimation plays a crucial role in the performance of video coders. The most popular motion estimator for video coding is the block-matching algorithm (BMA). The popularity of BMA comes from its simplicity and low overhead. Even though more sophisticated estimators like triangle motion compensation (TMC) [2, 3], and control-grid interpolation (CGI) [4] have been developed, for the most part, motion estimators for video coding have remained relatively simple. These estimators address the requirement that the motion eld be compactly represented since it is transmitted as side information. They are usually built with regard to minimizing the residual DFD energy (which impacts the coding performance), and the accuracy and quality of the motion eld is not paramount. On the other hand, accurate motion estimates are crucial in post-processing applications like framerate conversion and deinterlacing. An example of the rst type of application is found in video con-
ferencing, where it is often necessary to temporally subsample the sequence in order to comply with 2
real-time requirements and bandwidth constraints. In these situations, it would be advantageous to provide the decoder with the option of increasing the frame-rate. Such applications involve a resampling of the spatio-temporal sampling grid, and processing along motion trajectories [5]. Thus, when motion estimates are accurate and close to the \true" motion, we can expect the video representation (motion plus residual) to scale better with temporal or spatial resolution. Furthermore, accurate motion estimates may be expected to improve compression performance. However, most accurate motion estimates require a large number of bits to encode, and there is no guarantee that the motion eld that optimizes the rate-distortion function is close to \true" motion. In previous work [6, 7], we have shown that multiscale optical ow eld models coupled with appropriate estimation and quantization techniques, address both requirements of accurate motion and good coding performance, especially when the motion is not too complex. The Iterated Registration (IR) algorithm recently presented in [8] addresses that limitation. The rst contribution of this paper is to demonstrate the bene ts of using compactly-encoded, high-resolution motion elds in frameinterpolation applications involving compressed video. In the applications considered here, no motion estimation is performed at the decoder; the transmitted motion elds are used for processing. This corresponds to the ideas presented by Chupeau [9],
who argues that motion should be estimated as near the video acquisition system as possible, and this (presumably) good quality motion eld can then be used in subsequent processing and coding steps. The multiscale optical ow estimator that we have developed scales well with spatial and temporal resolution and ts nicely in this framework. In this paper, we shall consider the motion-compensated frame interpolation (MCFI) and bidirectional prediction applications. In the rst MCFI application, we transmit a temporally subsampled sequence consisting only of I- and P-frames. Then, at the decoder, missing frames may be interpolated using the transmitted frames and motion vectors. This provides us with temporal scalability, with basic decoders displaying only the lower frame-rate sequence, and more complex decoders performing the MCFI. The interpolated missing frames are analogous to MPEG's B-frames, and will still be referred to as B-frames. No interpolation error residuals are transmitted for the B-frames. This surprising design 3
is made possible by the quality of our motion elds, and the resulting high quality of the interpolated frames. We show that the use of BMA elds fails miserably in this application; for this reason, MPEG encoders usually allocate bits to B-frame interpolation error residuals. Furthermore, the use of accurate but rate-constrained motion elds provides considerable exibility in the design of coders and decoders alike, since the video representation scales well with temporal resolution. For example, we could change both the transmitted frame-rate and the display frame-rate at the decoder depending on video content, available channel bandwidth, and user requirements. The presence of occlusions (covered/uncovered regions) complicates the problem of bidirectional prediction. In this situation, it is advantageous to weight the forward and backward predictions unequally to re ect the presence of covered/uncovered regions. The MPEG standard [1] handles this problem by making a decision between forward, backward and bidirectional prediction on a macroblock by macroblock basis1. The second contribution of this paper is a novel scheme for handling the problems caused by the covered/uncovered regions, extending the idea used in the MPEG standard. Here the bidirectional prediction may be changed on a pixel-by-pixel basis. Our extension involves the estimation of a label eld that weights the forward and backward predictions. The label eld is allowed to be dense (one label per pixel), with labels allowed to vary continuously in the range [0; 1]. In order to compress the label eld, we represent it in a multiscale basis, and use smoothness and quantizer-set constraints. The label eld is then estimated by an ecient, discrete optimization algorithm (similar to the optical
ow estimation algorithm described in [6]), enabling us to compactly encode these label elds. The label eld complements our optical ow eld and provides substantial improvements in interpolation performance. This application necessitates the transmission of a small overhead in the form of label eld data for each B-frame; here again, interpolation error residuals are not transmitted. The paper is organized as follows. In Section 2, we discuss previous work on MCFI. Our framework for MCFI is presented in Section 3, where we introduce two codecs that skip frames at the encoder and interpolate the missing frames at the decoder. Each codec uses a dierent frame interpolation 1
The MPEG standards also allow intra-coding of a B macroblock, but we do not consider this option in this paper.
4
procedure. We brie y describe our optical- ow-based motion estimation technique in Section 4. This motion estimator is multiscale and gradient-based. The motion estimates are smooth with high spatial resolution and are compactly encoded. We refer the reader to [6, 8, 10] for more details. In Section 5, we present experimental results demonstrating the bene ts of using these elds in MCFI. In Section 6, we introduce the new label eld technique for frame interpolation. This method is especially successful in the presence of occlusions, as shown by experimental results in Section 7. We summarize the discussion and present conclusions in Section 8.
2 Background on MCFI Motion-compensated frame-interpolation (MCFI) is used in video compression (e.g., MPEG's B frames), as well as in other processing applications like slow-motion video and frame-rate conversion. Motioncompensated ltering and processing is complicated by the sampling of the signal in the temporal and spatial directions. The presence of uncovered/covered backgrounds, and the presence of sensor and coding noise further complicates the problem. A discussion on spatio-temporal sampling, relationships between spatial resolution and frame-rate, as well a discussion of motion compensated ltering is presented by Dubois in [5]. He also describes interpolative coding methods that are based on dropping elds at the encoder and reconstructing the missing frames at the decoder using interpolation. Other references that describe coders with eld-skipping or frame-skipping are presented by Bierling and Thoma [11, 12], Huang and Chao [13] and Caorio, Rocca and Tubaro [14]. These references also present techniques to identify and handle occlusions. Examples of standards conversion applications that use MCFI are presented by Reuter [15] and Lagendijk and Sezan [16]. The MPEG I and II standards use bidirectional prediction to interpolate a B-frame in between two (I or P) reference frames [1]. Two motion vectors are used for each macroblock encoded in the bidirectional prediction mode (forward and backward). The macroblock is then interpolated as a weighted sum of forward and backward predictions. In this case, the motion elds are estimated at the coder and transmitted to the decoder. The interpolated frames usually contain severe blocking artifacts and are visually inadequate, necessitating the encoding and transmission of residuals for the 5
B-frame. In general, transmitted BMA vectors, which are used for video coding, are inadequate for motioncompensated processing. This can be handled either by sending residuals for the interpolated frames (or macroblocks in MPEG), or by performing an additional, computationally expensive motion estimation step at the decoder. By using our improved motion estimates, we propose to use transmitted motion vectors for motion-compensated processing, and avoid both these disadvantages.
3 Frame Interpolation and Bidirectional Prediction The following discussion assumes that the interpolated B frame lies in between two reference frames P1 and P2, as shown in Figure 1. Note that P1 and P2 can be either I or P frames. For simplicity of the presentation, we assume the presence of only one B frame between two reference frames; this can easily be generalized if we have more B frames. Given a Frame A, we denote the intensity at a point s in that frame by IA (s). The motion eld dAC between two frames A and C is used to predict Frame C from Frame A. This motion eld passes through the image grid points of Frame C, but not necessarily through the grid points of Frame A, since we allow sub-pixel accuracy. Using this notation, we need the motion eld dP 1P 2 to predictively code Frame P2. The motion elds dP 1B and dP 2B are also needed to obtain the forward and backward predictions for the B-frame. De ning I^B1 as the forward prediction, and I^B2 as the backward prediction, we have I^B1(s) = IP 1 (s ? dP 1B (s)) ;
(1)
I^B2(s) = IP 2 (s ? dP 2B (s)) :
(2)
Bidirectional prediction can be performed by simply averaging the forward and backward predictions. Thus, a possible interpolation for the B-frame, I^B (s), is
I^B (s) = 0:5 I^B1 (s) + I^B2(s) :
(3)
This procedure is used by many MPEG coders [1]. Note that (3) always uses bidirectional prediction; 6
there is no explicit handling of covered and uncovered regions. We shall address this concern in Section 6. The motion elds dP 1P 2, dP 1B and dP 2B are needed at the decoder. When coding a macroblock in the bidirectional prediction mode, MPEG coders transmit both motion vectors, in addition to the motion vector already transmitted for the corresponding block in the P-frame. The explicit transmission of all 3 motion vectors is apparently redundant but is necessary because BMA elds do not usually scale well with spatial or temporal resolution. On the other hand, our optical ow estimator produces smoother and more accurate motion elds that exhibit spatial and temporal coherence. We can take advantage of this coherence in order to reduce the number of motion elds transmitted. We consider two possible approaches below. Procedure 1
The rst possibility consists of only transmitting dP 1P 2, which is used to encode P2, and then explicitly calculating the elds dP 1B and dP 2B from the transmitted eld [10]. The motion eld dP 1P 2 passes through the grid points of P2, but dP 1B and dP 2B pass through the grid-points of B. Therefore, in order to calculate these elds, we motion-compensate dP 1P 2 along its motion trajectories, and use spatial interpolation to obtain dP 1B and dP 2B . In our experiments, we use a simple nearest neighbor interpolator for spatial interpolation. This approach is similar to the transformation of motion information described in [12]. The main problem with this approach occurs when the motion dP 1P 2
is accelerated between P1 and P2. In this situation, the interpolated frames are displaced from the original, leading to low PSNR values [8]. However, these frames might be acceptable from a visual standpoint. Procedure 2
The second possibility is to estimate and transmit the elds dP 1B and dBP 2. First, we encode dP 1B
using the optical ow algorithm described in the following section. This motion eld is then 7
used as a prediction for dBP 2, and we estimate and encode a correction to this prediction. This corresponds to a temporal DPCM with a prediction coecient of 1. The same technique is used for successively predicting K motion elds if there are K B-frames. In many cases, when the motion between the reference frames is fairly uniform, the temporal DPCM for coding the motion vectors gives us substantial bit-rate savings. Then, dP 1P 2 can be obtained by the composition operation (represented by ), dP 1P 2(s)
= dP 1B (s) dBP 2(s) = dBP 2(s) + dP 1B (s ? dBP 2(s))
(4)
The eld dP 2B needed for backward prediction may be obtained from dBP 2 by using a procedure similar to Procedure 1 above. The composition operation also generalizes in a straightforward manner when there are more than 1 B-frames in between reference frames. Our experiments, which are detailed in Section 6, indicate that both approaches provide similar performance on the P-frames. This is an indication that our optical ow eld scales well with change in temporal resolution. However, the second approach gave us better performance (in terms of PSNR) on the B-frames, since it can handle acceleration.
4 Motion Estimation The motion estimator used is gradient-based and is capable of providing a high-resolution estimate of large motion under bit-rate constraints [8]. A brief overview is given below. A more detailed description can be found in [6, 10].
4.1 Motion Field Models In forward motion-compensated coding applications, most motion estimation techniques implicitly or explicitly assume a model for the motion eld, and compactly represent the motion eld. We model
8
the motion eld in the form, d(s) =
X
a( )' (s)
(5)
where s are pixel coordinates, d is the motion eld, f' (s)g is a basis, and fa( )g is a set of coecients that parameterize the motion eld. Most existing motion estimation techniques like BMA, TMC [2, 3], and CGI [4] use this model, either explicitly or implicitly. For example, in BMA, the basis consists of indicator functions over blocks. The number of basis functions is typically much smaller than the number of pixels in the image, leading to a compact representation for the motion eld. In contrast, we use a complete multiscale basis and rely on the energy compaction capabilities of the basis. Our particular choice is Yserentant's Hierarchical Finite Element (HFE) basis [6], which may be regarded as an extension of the TMC basis. In TMC, the motion is modeled as ane over triangular patches. In the HFE model, we use a hierarchy of triangulations and successive approximations are ane over higher resolution (smaller) triangular patches. The HFE basis provides a smooth and continuous approximation to the motion eld, and provides us with energy compaction when encoding piecewise smooth motion elds. They also provide spatially adaptive resolution to the motion eld approximation. The HFE basis is similar to many of the wavelet bases described in the literature, with the basis functions being scaled and translated versions of a generating function. Therefore, transforms involving this basis may be eciently implemented using scale recursions and lterbanks, which also leads to ecient multiscale optimization algorithms.
4.2 Optical Flow Estimation Algorithm Motion is estimated by minimizing the criterion E (d) =
X
s
jDF D(s)j2 + Es(d):
9
(6)
Here, DF D(s) is the motion-compensated prediction error, and the energy of the motion eld gradient, Es (d) =
Xh
s
j@x dx j2 + j@y dx j2 + j@x dy j2 + j@y dy j2
i
(7)
weighted by the positive parameter , is a smoothness term that penalizes rough motion elds. This variation on the Horn and Schunck method [6, 17], yields spatially-coherent, compactly-encodable motion estimates. Other smoothness constraints may also be used, and are discussed in [10]. Standard methods for minimization of E (d) use a rst-order Taylor series approximation to DF D(s) viewed as a function of d(s) [18, 19, 6, 7]. This approach produces an approximation to E (d) that is quadratic in d. The linear multiscale model (5) for the motion eld is then substituted into (6), yielding an equivalent quadratic cost function E~ (a) in the model coecients fa( )g. Since we are considering coding applications, we impose quantizer set constraints on fa( )g. Minimization of the cost function E~ (a) subject to these constraints is a discrete optimization problem which we have solved using a fast relaxation algorithm. This method allows us to eciently estimate and represent the motion eld under bit-rate constraints. The accuracy of the linear approximation is limited to small displacements, so it is generally not possible to minimize the cost function (6) exactly. In order to estimate large displacements, we have used two extensions. The rst, called iterated registration (IR) [20, 21, 8, 22, 10], uses a sequence of linear Taylor series approximations to increase the range of motion. This method could be computationally expensive when the motion is very large. The second extension is a hierarchical motion estimator that uses an image (control) pyramid and a coarse-to- ne strategy. This procedure minimizes a sequence of lowpass, subsampled versions of the prediction error [6]. The procedure is computationally ecient and also nds applications in spatially scalable (multiresolution) coding applications [10]. However, the coarse-to- ne strategy could lead to oversmoothing at object boundaries due to the propagation of the coarse motion estimation errors to the ner levels. For the applications presented here, we use the optical- ow-based estimator in [10, 8, 22], that combines the iterated registration with the control pyramid. This enables us to trade-o the relative advantages of each algorithm, and we obtain compactly-encoded motion elds that capture reasonably 10
large and complex motion. In the rest of this paper, we shall refer to this motion estimator as the optical ow (OF) estimator.
5 Experimental Results in Motion-Compensated Frame Interpolation We present experimental results on two sequences, Claire and Susie. In these experiments, we temporally subsample the sequence by skipping frames, and encode the subsampled sequence. The missing frames are then estimated at the decoder by bidirectional prediction (motion-compensated frameinterpolation) at the decoder using transmitted motion elds, and thus represents a post-processing application using transmitted motion vectors. Coding of Susie at
600 kbits/sec
The rst experiment was performed on Frames 30-59 of the Susie sequence which features a large motion in the latter part of the segment. This is a 30 frames/sec SIF sequence; we used a 256 256 cropped version for our experiments. The sequence was rst coded at 15 frames/sec by skipping alternate frames, and coding the subsampled sequence. The missing frames were then interpolated at the decoder. We compared three forward motion-compensated (FMC) coders that diered only in the motion estimation technique; they used BMA, TMC, and our OF algorithm, respectively. All three coders used subband coding of the DFD, and a group of pictures of size 12. The I-frames were coded at 1:0 bpp, and the P-frames at 0:5 bpp including the motion information. All bit-rates were computed using the adaptive arithmetic coder described in [23]. The subband coder used the Adelson and Simoncelli 9-tap lters [24] and the 13 band subband decomposition shown in [6] for subband decomposition of the DFD. Furthermore, we used Procedure 1 for motion estimation and transmission (Section 3). In other words, we transmitted dP 1P 2, and used this motion eld in order to obtain interpolated frames. The OF algorithm used 3 levels in the control (image) pyramid. The motion elds at each level 11
of the control pyramid were represented and encoded using a 6{level HFE pyramid. The coecients at the nest two levels were set to zero so as to reduce computational complexity [6]. We chose the smoothness parameter, = 1000, and used 10 iterations for the IR algorithm. The quantizer step-size at the coarsest level of the HFE pyramid was chosen as 0:25, and the step-size was doubled as we went to successive ner levels in the pyramid. The BMA algorithm used full-search, half-pel accuracy, 16 16 blocks, and a search range of [?32; 32] in the horizontal and vertical directions. The TMC algorithm used here begins with an initialization using the above BMA algorithm. This initialization is then followed by a gradient-based update that uses quarter-pel accuracy. This TMC algorithm is very similar to the fast hexagonal matching algorithm proposed by Nakaya and Harashima in [2], and a detailed description of this
algorithm is available in [6, 10]. The experimental results are shown in Figures 2 and 3. The plot of PSNR against frame number shows that the three coders have somewhat similar performance on the I- and P-frames; however the optical- ow-based OF motion elds give us vastly superior frame interpolation. Note however that in frame-interpolation, it is important to compare the results visually, since PSNR is not always a reliable measure for image quality [14, 8]. This comparison is shown in Figure 3. The OF algorithm gives us a high-resolution motion eld that is very close to the \true" motion and the interpolated frame using this motion eld is of excellent visual quality. The BMA algorithm produces very blocky and noisy motion elds that are totally unacceptable in frame-interpolation. The TMC eld is smooth but erroneous in many parts of the image, especially in the areas of high-resolution for the motion, resulting in many annoying frame-interpolation artifacts. Coding of Claire at
64 kbits/sec
Here, the test sequence used was a 256 256 cropped version of Claire at CIF resolution and 30 frames/sec. Our temporal subsampling reduced the frame-rate to 7:5 frames/sec. We considered frames 0 ? 120, with the rst I-frame coded at 0:75 bpp and the subsequent 39 P-frames coded at 0:1 bpp, 12
giving us a bit rate of approximately 64 kbits/sec. The gures show the PSNR over frames 64 ? 96, which features a relatively large movement of the head. The three coders, OF, BMA and TMC were tested in this framework. Motion was estimated between the reference frames (I- and P-frames) and these motion elds were then used to interpolate the missing B-frames. This corresponds to Procedure 1 described in Section 3. The BMA and TMC algorithms used the same parameters as in the Susie experiment above, with the exception of a smaller search range, [?16; 16] for BMA. The OF algorithm used the following parameters: a 3-level control (image) pyramid, 5-level HFE pyramids, = 1000, and 5 iterations for the IR algorithm. Again, xed uniform quantizers were used at each level in the HFE pyramid, with the step size at the coarsest level of the HFE pyramid chosen as 0:125; the step-size was doubled as we went to a ner level. The DFD data in each subband was quantized using uniform quantizers, and both the motion and DFD bit-rates were calculated using the adaptive arithmetic coder described in [23]. The comparison between the three estimators is show in Figures 4 and 5. Figures 4(c) and 4(f) show that BMA elds resulted in poor interpolation. A view at the right side of the face shows severe artifacts which substantially degrade video quality. BMA also produced blocking artifacts in reconstructed P-frames. The TMC estimator performs quite well on this sequence, and produces a visually good interpolation. However, the resolution of the eld is insucient in the eyes and the lips, resulting in a blurry interpolation in these areas. Figure 5 shows the comparison of PSNRs, and we observe that the OF estimator also did better than both BMA and TMC in terms of PSNR. However, the visual improvements were more dramatic than the improvement in objective quality. Here again, we observed the inadequacy of PSNR as a visual quality measure when evaluating B-frames, particularly with the OF coder. This is illustrated by Figures 4(b) and 4(e) which show that the reconstruction error is partly due to a displacement between the reconstructed and original images. This displacement is caused by a violation of the constant-velocity assumption that was made for frame-interpolation. While this displacement would be unacceptable in P-frames (since we would then lose track of the motion), it is acceptable in the B-frames; the video has good subjective quality as long as the edges in the images are sharply reproduced. Therefore, the quality of the video is much 13
higher than that indicated by PSNR gures, and it may become necessary to use a dierent measure to describe the quality of B-frames. In a related experiment, we compared the two procedures for representation and transmission of the various elds required for forward and bidirectional prediction, which are discussed in Section 3. When we use Procedure 2, we can account for acceleration, which results in substantially higher PSNR values for the B-frames. This is illustrated in Figures 6 and 7. However, Procedure 2 gives inferior performance on the P-frames, mainly due to the higher bit rate required by the motion eld and consequently, the lower bit rate available for encoding DFD data. Our DPCM scheme for encoding the motion elds dB1B2 , dB2B3 , and dB3P turned out to be quite ecient. For example, when estimating motion between frames 88 and 92 of the Claire sequence, Procedure 1 used 0:06 bpp for motion while Procedure 2 used 0:08 bpp. Since the overall rate for the P-frame was only 0:1 bpp, this increase in
motion bit-rate reduced the number of bits available for DFD data by a factor of 2. However, when motion is relatively large, Procedure 2 provides better motion estimation performance, because the eect of occlusions is reduced. In our optical ow estimator, the intensity conservation assumption breaks down in these covered and uncovered regions, and the erroneous motion estimates are propagated from the occluded regions to the surrounding regions by the smoothness constraints. In Procedure 2, we use the intermediate frames, and therefore the occlusion region between P 1 and B
, or between B and P 2, is smaller than that between P 1 and P 2, leading to better motion
estimates. This eect is especially signi cant for higher motion sequences like Susie and Caltrain. In these situations, where the better motion estimation performance osets the increased motion bit-rate, Procedure 2 can give better performance even on the P-frames. In general, one may wish to adaptively
switch between the two procedures, depending on the application and on whether there is a large acceleration in a particular segment. This is another degree of exibility gained by using our smoother and more natural optical ow elds.
14
6 Improved Bidirectional Prediction using Label Fields The bidirectional prediction used so far (3) weights forward and backward predictions equally. While this procedure is reasonable when there are no covered or uncovered regions, it needs to be modi ed in the presence of occlusions. For example, a region uncovered from Frame P1 to Frame B would likely be better predicted from Frame P2, while a region covered from Frame B to Frame P2 would likely be better predicted from Frame P1. In order to handle these occlusion eects, MPEG predicts the intensity at a particular point s in the B-frame as I^B (s) = (1 ? l(s)) I^B1 (s) + l(s)I^B2(s);
(8)
where I^B1(s) and I^B2 (s) are the forward and backward predictions de ned in (1) and (2) respectively, and fl(s)g may be considered as a label eld that weights the contribution from the forward and backward predictors. The MPEG label eld is constant over each 16 16 macroblock and takes on one of three values: 1 (backward prediction), 0:5 (bidirectional prediction) and 0 (forward prediction). The value is usually chosen to minimize the interpolation error over the macroblock. The overhead needed for this label eld is very low, and it ts in well with the motion estimator which is also constant over the macroblocks. We extend this idea to higher resolution label elds so as to gain greater exibility in weighting forward and backward prediction. To this end, we represent the dense label eld (one label/pixel) using the multiscale HFE basis, and allow the labels to vary continuously in the range [0; 1]. These high resolution label elds are the natural complement of the optical- ow-based motion elds, in the same way that the MPEG label elds are the natural complement of the BMA elds. The multiscale HFE basis can be expected to eciently model label elds that are piecewise smooth, but with sharp boundaries. Since we expect occlusion regions to be contiguous, this should be a good model assumption. Thus, we rely on the energy compaction properties of the HFE basis to obtain a compact representation for the label eld, as was already the case with the optical ow elds. The HFE coecients of the label eld are transmitted as overhead information for the B-frame. 15
The label eld can be chosen to minimize the energy of the interpolation error DF DB (s) = IB (s) ? I^B (s), where I^B (s) is de ned in (8). De ning DF DB1 (s) = IB (s) ? IP 1(s ? dP 1B (s)) and DF DB2 (s) = IB (s) ? IP 2(s ? dP 2B (s)), we get DF DB (s)
= IB (s) ? I^B (s) = (1 ? l(s))DF DB1 (s) + l(s)DF DB2 (s) = l(s)(DF DB2 (s) ? DF DB1 (s)) + DF DB1 (s):
(9)
In order to force the label eld to be spatially coherent and compactly-encodable, we follow an approach similar to (6) and minimize the criterion E (l) =
X
s
where Es(l) =
jDF DB (s)j2 + Es (l); Xn
s
j@x lj2 + j@y lj2
o
(10) (11)
is a smoothness functional weighted by a non-negative parameter , similar to (7). With this de nition, E (l) is a
quadratic functional in fl(s)g, and can be optimized using an algorithm that is very similar
to the optical ow estimation algorithm described in Section 4. The only dierence with the optical
ow estimator is that fl(s)g is a scalar eld while the motion eld fd(s)g is a vector eld. Therefore, the label eld may be estimated by a scalar version of the OF estimation algorithm. These dense label elds also have several other potential applications. For example, they could be used in forward prediction in order to handle occlusions and illumination changes.
7 Experimental Results on Bidirectional Prediction using Label Fields Experiments were performed on a 256 256 cropped version of the Susie sequence at 30 frames/second and SIF resolution. We used a group-of-pictures (GOP) of size 12, with one B-frame between any two reference frames. We used 1:0 bits/pixel (bpp) for the I frame and 0:5 bpp for the P-frames, giving 16
us a bit rate of approximately 600 kbits/sec. The DFD was encoded using a spatial subband coder and the subband decomposition used the Adelson and Simoncelli 9-tap lters [24]. The bit rate for the P-frames included the overhead bits required to transmit the motion elds. We observed the advantage of using the label eld by comparing three video coders that dier in the strategy used for bidirectional prediction. The rst coder used the HFE label eld estimated using the procedure in Section 6. The bit-rate used to encode the label elds is considered as extra side-information. The second coder uses a constant label eld (l(s) 0:5) which means that no label eld is transmitted to the receiver. This scheme is equivalent to Procedure 2 in Section 3. The third coder uses a MPEG-type block-based label eld. All three coders predict P-frames using the optical
ow (OF) algorithm described in Section 4. The OF algorithm used a 3-level control pyramid, 5-level HFE pyramids, 10 iterations of the IR algorithm, and the smoothness parameter = 250. Uniform quantizers were used at each scale of the HFE pyramid, the quantizer step at the coarsest scale was chosen to be 0:25, and the step-size was doubled as we went down the HFE pyramid. Note that all three coders perform identically on the I- and P-frames, and dier only in their treatment of the B-frames. A comparison of PSNRs of the three coders is shown in Figure 8. We observe a substantial increase in PSNR over the B-frames (up to 2 dB) when we use our label eld, compared to simple averaging. The increase is most pronounced in the fast-moving part of the sequence where there are signi cant covered/uncovered regions. The bit-rate used for the label eld may be considered as extra sideinformation and is shown in Figure 9. Notice that this bit-rate is higher in the fast moving parts of the sequence. However, the overall bit-rate for the label eld is quite low and is never higher than 0.025 bpp. To illustrate the improvements in visual quality, we have shown interpolated frame 51 in Figure 10. This is a B-frame interpolated between reference frames 50 and 52. All 3 original frames are shown and illustrate the covered/uncovered backgrounds. There is an uncovered background on the left side of the face, part of the hair is covered with an illumination change, and the phone and hand start to move down and out of the picture while covering a part of her shirt.
17
The label eld estimated by our new algorithm is shown in Figure 10(d), with a value of 0 being represented by black (forward prediction) and value 1 being represented by white (backward prediction). Intermediate values of gray correspond to various weights for bidirectional prediction. To represent the label eld, we used 6 levels in the HFE pyramid, with the coecients at the two nest scales being set to zero. We used the smoothness parameter = 500 and uniform quantizers at the dierent scales of the HFE pyramid, similar to the strategy used for optical ow estimation. The quantizer step-size at the coarsest scale was chosen as 0:25, and the step-size was doubled as we went down the HFE pyramid. The gure shows that the label eld corresponds approximately to what we would expect in the covered and uncovered regions. Figure 10(e) illustrates the MPEG label eld for this frame, which is constant over 1616 blocks and takes on three possible values, 0, 0.5, and 1. Due to its blocky nature, this label eld is an unnatural partner for the OF estimator; however, this experiment gives us insight into desirable properties of label elds. The interpolated frames illustrate the advantages of using our label eld. When we use simple bidirectional prediction (l(s) 0:5), the motion estimator uses covered/uncovered regions in the reference frames to predict parts of Frame 51. This eect is apparent when viewing the zoomed pictures shown in Figure 11. There is a signi cant artifact near the phone when we use simple bidirectional prediction. In addition, some uncovered background in Frame 52 (near the left side of the face) is used to predict Frame 51. While this region does not look very objectionable on the still frame, it does degrade the visual quality of the video. Our label elds also perform better than the MPEG block-based label elds, as illustrated in Figure 11. The block-based label eld leads to unnatural blocking artifacts in the interpolated images. Thus, we obtain substantial improvements on the still images for a small additional increase in bit-rate. These improvements are also re ected by the PSNR improvements shown in Figure 8, where we get up to 1 dB improvements in some cases. There was also an improvement in visual quality of the moving video, but it was a little less pronounced than the improvement in the still frames.
18
8 Summary and Conclusions We have considered the problem of motion-compensated frame interpolation (MCFI) and bidirectional prediction in a video coding environment. These applications involve post-processing at the decoder and require accurate motion estimates for good performance. In this paper, we use an optical- ow-based motion estimator that we have previously developed, and this estimator provides smooth, natural motion elds under rate-constraints. This enables us to use transmitted motion vectors for postprocessing, and eliminates an expensive motion estimation step at the decoder. We compare our motion estimator with the popular BMA and TMC motion estimators. In the rst application, we rst temporally subsampled the sequence before coding, and provided the decoder with the option of increasing the frame-rate using MCFI. The frames dropped at the encoder were not used for motion estimation, and motion was estimated directly between the reference frames (I- and P-frames). This corresponds to Procedure 1 in Section 3. In this application, we have presented results on the Claire and Susie sequences and show that our optical- ow-estimator produces excellent interpolated frames, especially from a visual standpoint. BMA produces blocky, unnatural interpolated frames that are unacceptable in most applications. TMC produces smoother frames but contains annoying artifacts due to the low resolution of the motion estimate. The above motion estimation procedure, Procedure 1, does not use intermediate frames for motion estimation, and cannot handle accelerated motion between reference frames. In the presence of acceleration, the interpolated frames are displaced from the original, leading to low PSNR values. However, these frames are quite acceptable from a visual standpoint. This illustrates that PSNR is not always a reliable measure of image quality (especially) in the case of B-frames. Since our optical ow elds are more natural, they exhibit temporal coherence. This enables us to use the dropped frames (that are available at the encoder) in the motion estimation step. This procedure (Procedure 2 in Section 3) enables us to consider acceleration between the reference frames and gives us higher PSNR on the B-frames, without a signi cant drop in performance on the P-frames. This also raises the possibility of adaptively switching between Procedures 1 and 2, illustrating the 19
additional exibility provided by our optical- ow-based motion estimator. The procedures described above do not take occlusions (covered/uncovered regions) into account. In order to handle occlusions, we have developed a novel scheme that uses a label eld to optimally weight the forward and backward predictions. The label eld may be considered as a pixel-based extension of an idea from the MPEG standard. We represent this dense label eld using a multiscale basis, and estimate it using a fast optimization algorithm. This enables us to compactly encode the label eld, and transmit it as side-information. The use of label elds gives substantial improvements, both visually and in terms of PSNR, especially in the covered and uncovered parts of the sequence, for a small additional overhead. For example, on the Susie sequence coded at 600 kbits/s, we obtain PSNR gains of up to 2 dB on the B-frames for additional label eld rates that are less than 0:025 bpp. Suggestions for future work include the incorporation of label eld estimation into the motion estimation algorithm. This would enable us to avoid propagation of motion estimation errors from the occlusion regions to surrounding pixels. The label elds could also be used to handle the problems of occlusion eects and illumination changes when encoding the P-frames. Furthermore, since our motion elds scale well with change in frame-rate and change in spatial resolution, we could also develop other coding applications that exploit this exibility. It would also be possible to improve the coding performance by actually encoding the B-frame residuals.
References [1] A. M. Tekalp, Digital Video Processing. Upper Saddle River, NJ: Prentice Hall, 1995. [2] Y. Nakaya and H. Harashima, \An iterative motion estimation method using triangular patches for motion compensation," in Proc. SPIE Conf. on Visual Communications and Image Processing, vol. 1605, pp. 546{567, Boston, Massachusetts, 1991. [3] K. Aizawa and T. S. Huang, \Model-based image coding: Advanced video coding techniques for very low bit-rate applications," Proc. IEEE, vol. 83, pp. 259{270, Feb. 1995.
20
[4] G. J. Sullivan and R. L. Baker, \Motion compensation for video compression using control grid interpolation," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 2713{2716, 1991. [5] E. Dubois, \Motion-compensated ltering of time-varying images," Optical Engineering, vol. 3, pp. 211{239, 1992. [6] P. Moulin, R. Krishnamurthy, and J. W. Woods, \Multiscale modeling and estimation of motion elds for video coding," IEEE Trans. Image Process., 1996. Accepted for publication, available at http://cipr.rpi.edu/ravik/publications.html. [7] R. Krishnamurthy, P. Moulin, and J. W. Woods, \Optical ow techniques applied to video coding," in Proc. IEEE Int. Conf. Image Processing, vol. I, pp. 570{573, Washington, DC, 1995. [8] R. Krishnamurthy, P. Moulin, and J. W. Woods, \Multiscale motion models for scalable video coding," in Proc. IEEE Int. Conf. Image Processing, pp. 965{968, Lausanne, Switzerland, Sept. 1996. [9] B. Chupeau, \Estimation and distribution of motion information in image communication networks," Signal Processing Image Communication, pp. 539{552, 1993. [10] R. Krishnamurthy, Compactly-Encoded Optical Flow Fields for Motion-Compensated Video Compression and Processing. PhD thesis, Rensselaer Polytechnic Institute, Troy, NY, 1997. Available
at http://cipr.rpi.edu/ravik/publications.html. [11] M. Bierling and R. Thoma, \Motion compensated eld interpolation," Signal Processing, vol. 11, pp. 387{404, 1986. [12] R. Thoma and M. Bierling, \Motion compensating interpolation considering covered and uncovered background," Signal Processing: Image Communications, vol. 1, pp. 191{212, 1989. [13] C.-L. Huang and T.-T. Chao, \Motion-compensated interpolation for scan rate up-conversion," Optical Engineering, vol. 35, pp. 166{176, Jan. 1996.
[14] C. Caorio, F. Rocca, and S. Tubaro, \Motion compensated image interpolation," IEEE Trans. Commun., vol. 32, pp. 215{222, Feb. 1990.
21
[15] T. Reuter, \Standards conversion using motion compensation," Signal Processing, vol. 16, pp. 73{ 82, Dec. 1989. [16] R. L. Lagendijk and M. I. Sezan, \Motion compensated frame rate conversion of motion pictures," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. III, pp. 453{456, 1992. [17] B. K. P. Horn and B. Schunk, \Determining optical ow," Arti cial Intelligence, vol. 17, pp. 185{ 203, 1981. [18] B. K. P. Horn and B. Schunk, \Determining optical ow," Arti cial Intelligence, vol. 17, pp. 185{ 203, 1981. [19] W. Enkelmann, \Investigations of multigrid algorithms for the estimation of optical ow elds in image sequences," Computer Vision, Graphics, and Image Processing, vol. 43, pp. 150{177, 1988. [20] B. Lucas and T. Kanade, \An iterative image registration technique with an application to stereo vision," in Proceedings 5th International Joint Conference on Arti cial Intelligence, pp. 674{679, 1981. [21] J. K. Kearney, W. B. Thompson, and D. L. Boley, \Optical ow estimation: An error analysis of gradient-based methods with local optimization," IEEE Trans. Pattern Analysis and Machine Intelligence, pp. 229{244, 1987.
[22] R. Krishnamurthy, P. Moulin, and J. W. Woods, \Compactly-encoded optical ow elds for video coding and bidirectional prediction,," in Proc. SPIE Conf. on Visual Communications and Image Processing, San Jose, California, Feb. 1997.
[23] R. M. Witten, I. H. Neal, and J. G. Cleary, \Arithmetic coding for data compression," Communications of the ACM, vol. 30, pp. 520{540, June 1987.
[24] E. H. Adelson, E. Simoncelli, and R. Hingorani, \Orthogonal pyramid transforms for image coding," in Proc. SPIE Conf. on Visual Communications and Image Processing, pp. 50{58, 1987.
22
1 f
1-f
P1
B
P2 TIME
Figure 1: Arrangement of Frames for Bidirectional Prediction
46
44
42
40
PSNR
38
36
34
32
30
o OF * BMA x TMC
28
26 30
35
40
45 Frame Number
50
55
60
Figure 2: Comparison of Motion Estimators for Coding and Frame-Interpolation on Susie at SIF resolution, Coded at 600 kbits/sec 23
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h) (i) (a)-(c) Original Frames 46-48 of Susie (d)-(f) Motion Fields: (d) OF, 0:061 bpp (e) BMA, 0:043 bpp (f) TMC, 0:056 bpp (g)-(i) Interpolated Frame 47: (g) OF, PSNR = 31:78 dB (h) BMA, PSNR=28:51dB (i) TMC, PSNR = 30:39 dB Figure 3: Comparison of Motion Estimators in Motion Estimation and Frame Interpolation on Susie, Coded at 600 kbits/sec
24
(a)
(b)
(c)
(d)
(e)
(f) (g) (a) Original Frame 90 (180 160 section shown) (b)-(d) Interpolated Frame 90: (b) OF, PSNR = 33:7 (c) BMA, PSNR=31:4 dB (d) TMC, PSNR = 32:6 dB (e)-(g) Interpolation Errors: (e) OF (f) BMA (g) TMC Figure 4: Comparison of Motion Estimators for Coding and Frame-Interpolation on Claire at CIF resolution, Coded at 64 kbits/sec
25
38
37
36
PSNR
35
34
33
o OF * BMA x TMC
32
31 60
65
70
75
80 Frame Number
85
90
95
100
Figure 5: Comparison of Motion Estimators for Coding and Frame-Interpolation on Claire at CIF resolution, coded at 64 kbits/sec
40
39
38
PSNR
37
36
35
34
o Procedure 1 + Procedure 2
33
32 60
65
70
75
80 Frame Number
85
90
95
100
Figure 6: Comparison of Procedures to Obtain Various Motion Fields used in Bidirectional prediction: Comparison of PSNRs for Coding and Frame-interpolation on Claire at 64 kbits/sec 26
(a)
(b)
(c) (d) (a) Interpolated Frame 90 using Procedure 1, PSNR = 33:66 (180 160 section) (b) Interpolated using Procedure 2, PSNR = 36:94 (c) Reconstruction Error, Procedure 1 (d) Reconstruction Error, Procedure 2 Figure 7: Comparison of Procedures to Obtain Various Motion Fields used in Bidirectional Prediction: Reconstructed Frames for Coding and Frame-Interpolation on Claire at 64 kbits/sec
27
46
44
* HFE Label Field x Block−based MPEG Label Field o Simple Averaging
PSNR
42
40
38
36
34 30
35
40
45 Frame Number
50
55
60
Figure 8: Comparison of PSNRs on Frames 30 ? 59 of Susie for Bidirectional Prediction Using Various Label Fields: Sequence Coded at 600 kbits/sec
0.025
Label Bit Rate
0.02
0.015
0.01
0.005
0 30
35
40
45 Frame Number
50
55
60
Figure 9: HFE Label Bit-Rate for B-Frames when Encoding Frames 30 ? 59 of Susie
28
(a)
(b)
(c)
(d) (e) (a)-(c) Frames 50 ? 52 of Susie (d) HFE Label Field used for Bidirectional Prediction of Frame 51 (0:018 bpp) (c) Block-Based MPEG Label Field Figure 10: Label Fields for Bidirectional Prediction on Frames 50 ? 52 of Susie at 600 kbits/sec
29
(a)
(b)
(c) (d) (a) Original Frame 51 (zoomed 128 128 section (b) Interpolated B-Frame 51 Using HFE Label Field (PSNR=36:1 dB) (c) Interpolated using Simple Averaging (PSNR=34:1 dB) (d) Interpolated Using MPEG Label Field (PSNR=35:1 dB) Figure 11: Performance of Label Fields for Bidirectional Prediction on Frames 50 ? 52 of Susie at 600 kbits/sec
30