ESTIMATING DISTORTION OF CODED AND NON ... - CiteSeerX

1 downloads 0 Views 75KB Size Report
tortion for coded and non-coded frames in a video coder that employs variable ... methods can be used to optimize the trade-off between spa- tial and temporal quality in .... Ш + × plus the (temporal) distortion of × -½ skipped frames, which is ...
ESTIMATING DISTORTION OF CODED AND NON-CODED FRAMES FOR FRAMESKIP-OPTIMIZED VIDEO CODING Anthony Vetroy , Yao Wangz and Huifang Suny y

z

MERL - Mitsubishi Electric Research Laboratories, Murray Hill, NJ Polytechnic University, Department of Electrical Engineering, Brooklyn, NY ABSTRACT

This paper focuses on the problem of estimating the distortion for coded and non-coded frames in a video coder that employs variable frameskip. The distortion for coded frames is given by classic rate-distortion models, however the distortion for non-coded frames has not been considered. Based on the optical flow equation, we formulate a method for estimating the distortion of the non-coded frames. Simulations verify that our estimation is quite accurate over a range of sequences with varying scene complexity. Such methods can be used to optimize the trade-off between spatial and temporal quality in the video coding process. 1. INTRODUCTION A number of video coding standards have been developed that support encoding with variable frameskip, e.g., H.263 and MPEG-4. With these standards, the encoder may choose to skip frames to either satisfy buffer constraints or optimize the video coding process. However, in the literature, frame skipping has only been employed for the first case to satisfy buffer constraints. In this scenario, the encoder is forced to drop frames since limitations on the bandwidth do not allow the the buffer to drain fast enough. Consequently, bits that would be used to encode the next frame cannot be added to the buffer as they would cause the buffer to overflow. It should be noted that skipping frames in this case could lead to poor reconstruction of the video since frames are skipped according to the buffer occupancy and not according to content characteristics. The general problem that we aim for in this paper is the optimal rate-distortion (R-D) encoding of video sequences. The specific problem that we are interested in can be stated as follows. Given a particular video sequence, should the encoder choose to code more frames with lower spatial quality or fewer frames with higher spatial quality? It should be noted that this trade-off is not a simple binary decision, but rather a decision over a finite set of coding parameters. Obviously, the best set of coding parameters will yield the optimal R-D curve. The two parameters of interest are the

number of frames per second (fps) and the quantization parameter (QP). It should be emphasized that the overall distortion of the coded sequence, which is usually measured in PSNR, must also include in its calculations the distortion of the frames that have been skipped. This is a point that is often over-looked in papers that report data with skipped frames. In the context of this problem, it is necessary to consider the distortion in this way. The majority of the literature on R-D optimization does not touch on temporal aspect [1]-[3]. By and large, it is assumed that the frame rate is fixed. These papers consider optimizations on the QP [1], mode decisions for motion and block coding [2] and frame-type selection [3]. To achieve the optimum given that frame rate is fixed and the bit rate can be met with the chosen frame rate, such algorithms and techniques can be applied. However, the optimum across varying frame rates has not yet been considered. It should be noted that this trade-off between spatial and temporal quality has been studied by Martins, et al. to some extent [4], however, the trade-off between spatial and temporal quality was achieved with a user selectable parameter. To begin considering the above stated problem, it should be clear that a model for the distortion of coded and noncoded frames is needed. For the coded frames, we provide a brief review of the classic approach for rate-distortion modeling, and for the non-coded frames, we develop a novel formulation based the optical flow equation. The rest of this paper is organized as follows. The next section discusses the models that we use to estimate the distortion for coded and non-coded frames. In section 3, we provide simulation results that verify the accuracy of our estimation. Section 4 summarizes the main contributions and discusses how these models may be employed by the video coder. 2. DISTORTION MODELS The purpose of this section is to consider estimates for the distortion of coded and non-coded frames that would be encountered in a video coder that uses variable frameskip. We consider a general formulation for distortion that ac-

(

)

counts for skipped frames, D Q; fs , where Q represents the quantization parameter used for coding and fs represents the amount of frameskip. Furthermore, we denote the average distortion for coded frames by Dc Q and the average distortion for skipped frames by Ds Q; fs . The coded distortion is dependent on the quantizer, while the temporal distortion depends on both the quantizer and amount of frameskip. Although the frameskip factor does not directly influence the coded distortion of a particular frame, it does indeed impact this part of the distortion indirectly. For one, the amount of frameskip will influence the residual component, and secondly, it will also have an impact on the quantizer that is chosen. It is important to note that the distortion for skipped frames has a direct dependency on the quantization step size in the coded frames. The reason is that the skipped frames are interpolated from the coded frames, thereby carrying the same spatial quality, in addition to the temporal distortion. Given the above, we consider the average distortion over the specific time interval dti ; ti+fs , which is given by,

(

( )

)

]

Ddti ;ti+fs ] (Qi+fs ; fs ) = " # fs 1 1 Dc(Qi+f ) + i+X Ds (Qi ; k ) s fs k=i+1

(1)

In the above, the distortion over the specified time interval is due to the (spatial) distortion of 1 coded frame at t ti+fs plus the (temporal) distortion of fs skipped frames, which is dependent on the quantizer for the previously coded frame at t ti .

=

1

=

2.1. Coded Distortion From classic rate-distortion modeling [5], it is well-known that the variance of the quantization error is given by,

q2 = a  2 2R  z2

2.2. Non-Coded Distortion To model the temporal distortion due to skipped frames, we assume, without loss of generality, that the temporal interpolator simply repeats the previously coded frame. Other interpolators that average past and future reference frames, or make predictions based on motion, can still be considered in this framework, but the following derivation may no longer hold. As alluded to earlier, the distortion due to skipped frames can be broken into two parts: one due to the coding of the reference and another due to the interpolation error. This can be derived as follows. Let k denote the estimated frame at t tk , i denote the last coded frame at ti < tk , and k i . Then, the estimation error at tk is,

(3)

The above model is valid for a wide array of quantizers and signal characteristics. Such aspects are accounted for in the value of a. However, as stated earlier, the amount of frameskip can impact the statistics of the residual. In our experiments, we have found that the average bits/frame increases for larger values of fs . However, the variance

= =

ek



^

~

= ^=~

(2)

where z2 is the input signal variance, R is the average rate per sample and a is a constant that is dependent on the pdf of the input signal and quantizer characteristics. In the absence of entropy coding, the value of a typically varies between 1.0 and 10, but may take on values less than 1 with entropy coding. We use the above equation to model the spatial distortion,

Dc (Qi ) = a  2 2R(ti )  z2i

remains almost the same. This indicates that the variance is not capable of reflecting small differences in the residual that impact the actual relation between rate and distortion. This is caused by the design of the particular coding scheme and is believed to happen because of the presence of high-frequency coefficients. Actually, we believe that it is not only their presence, but also the position of such coefficients. If certain run-lengths are not present in the VLC table, less efficient escape coding techniques must be used. This probably means that fs affects the pdf of the residual, i.e., the value of a, while not changing z2i much. Currently, we ignore any changes in the residual due to frameskip and simply use the model given by eqn. (3) to model the spatial distortion. A fixed a and z2i computed from the last coded frame is used. Refining this model to account for the impact of frameskip on the residual and the influence of high-frequency coefficients is a topic for further study. It is not clear at this time if an improved model will impact the results significantly.



^k = k ~i k ~ + | k {z }i | i {z }i zi;k ci

(4)

where zi;k and ci represent the frame interpolation error and coding error, respectively. If these quantities are independent, the MSE is given by,

   E e2k = E 2 ci + E 2 zi;k

(5)

which is equivalently expressed as,

Ds (Qi ; k ) = Dc (Qi ) + E

 2  zi;k

(6)

To derive the expected MSE due to frame interpolation, we first assume that the frame at tk is related to the frame at ti with motion vectors x x; y ; y x; y ,

( ( )  ( )) k (x; y ) = i (x + x(x; y ); y + y (x; y ))

(7)

40

2.3. Practical Considerations

Actual Estimated

35

The main practical aspect to consider is how the equations for the distortion of non-coded frames are evaluated based on current and past data. For instance, in its current form, eqn. (9) assumes that the motion between i, the current time instant, and k , a future time instant is known. However, this would imply that motion estimation is performed for each candidate frame, k . Since such computations are not practical, it is reasonable to assume linear motion between frames and approximate the variance of motion vectors by,

30

MSE

25 20 15 10 5 0 0

50

100

150 Frames

200

250

Actual Estimated

200

2  2  x x

MSE

150

i;k

100 50 0 0

50

100

150 Frames

200

250

Figure 1: Comparison of actual vs estimated distortion using eqn. (3) for coded frames: (top) Akiyo, (bot) Foreman.

( )

In the above, it is assumed that every pixel x; y has a motion vector associated with it. In actuality, we use block motion vectors to approximate the motion at every pixel inside the block. Then,

zi;k = = (

i (x + xi;k ; y + yi;k ) @ i xi;k + @ i yi;k

@x

i (x; y ) (8)

)

(  )

 2  zi;k = x2 2 x i

i;k

+ y2 2 y i

i;k

(9)

where x2i ; y2i represent the variances for the x and y spa2 x ; 2 y represent the tial gradients in frame i, and  i;k i;k variances for the motion vectors in the x and y direction. The above shows that it is sufficient to model the interpolation error based on the second-order statistics of the motion and spatial gradient.

(

)

fl ;i



k

fl

 i 2

(10)

where fl denotes the amount of frameskip between the last coded frame and its reference. Similarly, estimates of the distortion for the next frame to be coded (i.e., calculation of eqn. (3)) requires knowledge of a and z2i , which depends on fs . As mentioned earlier, motion estimation for every candidate frame is not performed, therefore the actual residuals are not available either. To overcome this practical difficulty, the residual for future frames may also be predicted based on the residual ti . However, as discussed earof the current frame at t lier, the relationship between the a, z2i and frameskip is not as obvious as the relation between motion and frameskip. Also, we have observed that changes in the variance for different frameskip are very small. Therefore, we use the residual variance of the current frame at t ti for the candidate frames as well. In this way, changes in Dc are only affected by the the bit budget for candidate frameskip factors.

=

=

@y

where @@xi ; @@yi represent the spatial gradients in the x and y directions. Note, the above has been expanded using a first-order Taylor expansion and is valid for small x; y . This is equivalent to the optical flow equation, where the same condition on motion is also true. As a result, eqn. (8) is less accurate for sequences with large motion. However, for these sequences, the accuracy of the estimation is not so critical since an optimized encoder would not choose to lower the frame-rate for such sequences anyway. Treating the spatial gradients and motion vectors as random variables and assuming the motion vectors and spatial gradients are independent and zero-mean, we have,

E

i



(

)

3. SIMULATION RESULTS To confirm the accuracy of the distortion models discussed in the previous section, we devise two sets of experiments. A first experiment to test the accuracy of the estimated distortion for coded frames and a second experiment to test the accuracy of the estimated distortion for non-coded frames. The testing conditions for the first experiment are as follows. We consider two test sequences: Akiyo and Foreman. Each sequence is coded at its full frame-rate of 30fps. Three for fixed quantizers are used to code each sequence, Q for the next 100 frames and the first 100 frames, Q Q for the last 100 frames. Fig. 1 shows plots for Akiyo and Foreman in which the estimated distortion is cal. culated according to eqn.(3) with a From these plots, we first note that for the most part the estimated distortion tracks the actual distortion. Second, in both sequences, the estimation of the coded distortion at fine quantizer values tends to be over-estimated. The reason for this is not clear, but we note that it is possible to use a calibrated value of a to correct this estimation error, or to simply use a fixed MSE when the quantizer is so fine.

= 30

=2

= 15

=1

40

2500 Actual Estimated

35

Actual Estimated

2000

25

MSE

MSE

30

20

1500 1000

15 500

10 5

0 0

10

20

30

40

50 60 Frames

70

80

90 100

0

20

40

60

80 100 Frames

120

140

160

80 Actual Estimated

70

Figure 3: Comparison of actual vs estimated distortion using eqn. (6) for Foreman.

60

MSE

50 40 30 20 10 0 0

10

20

30

40

50 60 Frames

70

80

90 100

Figure 2: Comparison of actual vs estimated distortion using eqn. (6) for Akiyo: (top) first skipped frames only, (bot) second skipped frames only. For the second experiment, we consider the Akiyo sequence coded at a fixed frame-rate of 10fps and the Foreman sequence coded at a fixed rate of 15fps. A comparison of the actual and estimated distortion for the non-coded frames in Akiyo is shown in Fig. 2. The top plot shows the first skipped frames, while the bottom plot shows the second skipped frames. The plots indicate that the estimated distortion of skipped frames for Akiyo is quite accurate. In Fig. 3, comparisons of the actual and estimated distortion for the skipped frames in Foreman are shown. Although the estimated distortion is not as accurate as Akiyo, we can see that it is able to follow variations in the actual distortion quite well, except in a few cases. However, since the motion in this sequence is large, the distortion due to frame skipping would be a significant factor in the overall distortion. It is safe to say that an optimized coder will never choose to skip frames for such a sequence. Rather, it will resort to using a coarser quantizer until buffer constraints force the encoded sequence to a lower frame-rate. As a result, the accuracy for estimating the distortion of non-coded frames is much more critical for sequences with low to moderate motion. 4. CONCLUDING REMARKS This paper provides a means for estimating the distortion in coded and non-coded frames. The distortion for coded frames can be expressed by classic rate-distortion models. These equations were found to be accurate using a constant multiplier. However, if the multiplier is calibrated for every

frame based on data from the past frame, we found that the accuracy of the estimation can improve significantly. For the distortion of non-coded frames, we derived an approximation from the principles of optical flow. We found that the MSE due to skipped frames could be estimated as a function of the second-order statistics of the motion and spatial gradient. This approximation provides reasonably accurate estimates for sequences with low to moderate motion. The ability to estimate these distortions is an important step towards optimizing a video coder in terms of the frames that it skips. In this way, an optimized video coder that can make trade-offs in the spatial and temporal quality can be considered. The details of an algorithm that can perform such trade-offs will be reported in a future paper. 5. REFERENCES [1] H. Sun, W. Kwok, M. Chien, and C.H. John Ju, “MPEG coding performance improvement by jointly optimizing coding mode decision and rate control,” IEEE Trans. Circuits Syst. Video Technol., June 1997. [2] T. Weigand, M. Lightstone, D. Mukherjee, T.G. Campbell, S.K. Mitra, ”R-D optimized mode selection for very low bitrate video coding and the emerging H.263 standard,” IEEE Trans. Circuits Syst. Video Technol., Apr. 1996. [3] J. Lee and B.W. Dickenson, “Rate-distortion optimized frame type selection for MPEG encoding,” IEEE Trans. Circuits Syst. Video Technol., June 1997. [4] F.C. Martins, W. Ding, and E. Feig, ”Joint control of spatial quantization and temporal sampling for very low bit rate video,” Proc. ICASSP, May 1996. [5] N.S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice Hall, 1984.

Suggest Documents