An Overview of MPEG-4 Object-Based Encoding ... - Semantic Scholar

2 downloads 0 Views 145KB Size Report
MERL - Mitsubishi Electric Research Laboratories ... to code the previous object. After the initial .... should the encoder choose to code more frames with lower.
An Overview of MPEG-4 Object-Based Encoding Algorithms Anthony Vetro and Huifang Sun MERL - Mitsubishi Electric Research Laboratories Murray Hill, New Jersey USA Abstract – In this paper, we describe several object-based encoding issues and provide solutions based on the MPEG-4 coding standard. Specifically, we first consider the general problem of object-based rate control and the allocation of bits among each object. Going a step further, we then discuss the issues related to the allocation of bits between shape and texture and propose a method to model the rate-distortion characteristics of the shape. Finally, we describe some preliminary work that we have been doing towards optimizing the trade-off in spatial and temporal quality. For object-based coding, it is believed that significant coding gains can be achieved.

1. Introduction The traditional video coding standards such as MPEG-1, are all frame-based coding methods. In the object-based coding scheme, several problems have to be considered. The first problem is that the encoding optimization should be performed among the objects in the video sequence instead of among frames. The traditional rate control algorithm tries to assign the bits to each frame according to its coding type and quality requirement for the given bit rate and buffer condition. For object-based coding, bits should be reasonably allocated to the objects according to the quality requirement of each object during the coding process. The second issue is the bit assignment between the shape and texture coding. Since MPEG-4 is an objectbased coding scheme, each object is described by texture information and shape information. It is well known how to model the rate-distortion characteristics of texture information, however, to do the same for a lossy shapecoding scheme is a rather difficult problem. This may needed for low-bit-rate applications when the percentage of shape bits exceeds that used for texture. The third issue is due to frame skipping mechanism allowed by MPEG-4 video coding. Unlike MPEG-1 and MPEG-2, the MPEG-4 video-coding scheme can skip frames to satisfy the desired bit rate at low bit rate applications. It is noted that encoding a video sequence with a lower temporal resolution or skipping some frames can provide great benefits to the pieces of video sequence with low motion, but this will significantly increase the distortion for those pieces with high motion. Therefore, it is important to investigate the trade-off between spatial quality and temporal quality.

In this paper, we will give an overview of the above issues and encoding algorithms addressing the above problems. Several algorithms presented in this paper have been adopted by the MPEG standard [1, 2].

2. Object-Based Rate Control The rate control algorithm that we propose for objectbased coding is based on the quadratic rate-quantizer model described by Chiang and Zhang [3] and has been described in [4]. Here, we provide an overview of the main points to provide context for the next two sections. Fig. 1 is a block diagram that represents the major operations for object-based rate control. First, the algorithm must be initialized, which essentially defines buffer related quantities, such as the number of bits that are drained from the buffer per coding instant, and bit-rate related quantities, such as the number of available bits. The rate control algorithm is initiated only after the first Pframe has been coded. After this frame and every intercoded frame to follow, the rate-quantizer model is updated based on the encoding results of the current frame as well as a specified number of past frames. It should be noted that only bits used for coding the texture are considered in this update. The objective is to compute a new set of second-order complexity measures for each object. This is accomplished through a least-squares estimation method [3] that uses the QP values and corresponding bit counts for several past frames. Besides updating the model parameters, another key operation in the post-encoding stage is to check and regulate the buffer occupancy. If the buffer level is too high, some objects should be skipped to allow the buffer to drain to a more comfortable level. Given the updated model parameters and the next set of objects to be coded, the most critical step is for the rate control algorithm to allocate bits to each object. It begins by estimating an initial target, which is mainly based on the number of bits remaining and the number of bits used to code the previous object. After the initial target estimate for each object has been determined, the estimate is scaled according to the current buffer level. Distributing this target among the multiple objects can be done in many ways. We have found that the distribution according to size, motion and variance-like measure of the residual works reasonably well and is flexible since the weights for each portion can be weighed according to bit-rates, coding complexity, etc.

Figure 2: Trade-off in shape vs. texture bits.

Figure 1: Block diagram of object-based rate control.

3. R-D Modeling for Binary Shape As first discussed in [4], one of the major problems to perform object-based rate control is at low bit-rates. The reason is that a large percentage of the bits for shape consume the total bit-rate. This is illustrated in Fig. 2 for an arbitrarily shaped object in the coastguard sequence coded at a bit-rate of 24kbps. In this figure, we plot the average bits per coded frame versus the value of AlphaTH, which is a threshold used by an MPEG-4 encoder that controls the amount of allowable distortion in the shape. An AlphaTH value equal to zero implies lossless shape coding. It is clear from this plot that a large portion of this bit-rate is dedicated to bits used for shape. Under low bit-rate conditions, it becomes crucial to know the rate-distortion characteristics of the various shapes being considered. At the very least, just knowing the rate alone can help to provide a stable buffer. In this case, the encoder would be able to accurately estimate the number of bits needed for shape. However, with a shape model that also accounts for distortion, it then becomes possible to joint optimizations between the shape and texture coding. The proposed approach to model the shape is motivated by the fact that MPEG-4 uses a context-based coding algorithm. More details of this shape coding technique can be found in [5]. Essentially, intra-coded binary alpha blocks (BABs) require a 10-bit context to be computed, which leads to 1024 possible states that determine the rate which is needed to code every pixel. Attempting to model each of these states can be difficult. As a result, the states are first reduced by considering a smaller context, then the

number of states are further reduced by exploiting the symmetry of the possible binary patterns. In doing this, we must be careful not lose the information that is conveyed by the 10-bit context, but at the same time we should provide a more manageable set of parameters that are easily computed. In this modeling problem, our goal is to extract a set of parameters for each block. These parameters should be easy to compute and will be used to estimate the corresponding rate and distortion of the block for multiple resolutions. Our modeling approach is based on categorizing all possible binary patterns over a 2x2 neighborhood into N states. Let qij represent state i at resolution j, where 0 is full-scale, 1 is half-scale, and 2 is quarter-scale. Also, let the parameter, ni , represents the number of occurrences of state qij . Then, for every block, N −1

Rj = α j n =

∑α

Dj = β

∑β

T

T j

n=

i=0 N −1 i=0

ij

ij

nij nij

where the α ’s and β ’s are the model parameters that need to be estimated. Obviously, α ij denotes the rate for coding a pattern which belongs to state qij , and β ij denotes the distortion that is associated with such a pattern at scale j. Since the shape is coded without loss at full-scale, β = 0 . 0

Lastly, in modeling the rate that is generated by an arithmetic encoder we have,

α ij = − log( pij ) where pij is the probability of state qij . Assuming that the neighborhood is fixed, the major task is to partition the possible binary configurations into states. The criterion for partitioning these patterns is that the patterns belonging to the same state should i) have similar probability so that reliable estimates of the rate can be

4. Trade-Off in Spatial and Temporal Quality

Figure 3: Comparison of actual and predicted R-D characteristics for binary shapes. obtained and ii) incur similar distortion when downsampled. More details on this partitioning of binary states and the necessary training can be found in [6]. To confirm the accuracy of our model, several images outside the test set are used to train the model parameters. We then attempt to predict the bit-rate and distortion for a sample set of binary images. For each of the images, the level of resolution is fixed for every block. In this way, all the blocks are coded at either full, half, or quarterresolution. This would produce three operating points in the R-D domain. A sample of the outcome for the singer and dancer sequences is shown in Fig. 3. From these plots, we see that the rate is predicted well, but the distortion experiences some error. This is expected for a 2x2 neighborhood that is partitioned into only 6 states. From [1], it is stated that the down/up-sampling process to code then reconstruct the binary shape depends on a significant amount of data outside this neighborhood. One way to overcome this problem may be to consider a larger neighborhood, e.g., 3x3. This increase would account for neighboring dependence, however one would then need to partition the possible configurations so that the predefined criteria is met. As a final point, it should be noted that even an accurate estimate of the rate does have significant application. That is, it can be used to stabilize the buffer under low bit-rate coding conditions. This would be accomplished by predicting the amount of shape bits needed at each frame, then performing allocation to texture information afterwards or deciding if additional frames need to be skipped due to insufficient rate.

The general problem that we consider in this section is the optimal rate-distortion (R-D) encoding of video sequences. One of the major advantages of an object-based scheme is that each object can vary in its temporal quality. Varying the temporal rate for each object is expected to yield modest gains under certain conditions. However, an object-based coding scheme that codes each object with different frame rates is still quite ambitious, especially since a good formulation for frame-based coding does not exist. Consequently, we only consider the frame-based optimization for the moment and then discuss the gains that can be expected in an object-based scheme. The specific problem that we are interested in can be stated as follows. Given a particular video sequence, should the encoder choose to code more frames with lower spatial quality or fewer frames with higher spatial quality? Intuitively, we can expect that for low motion pieces of video sequence the frame skipping will cause less subjective distortion than for those pieces with fast motion. In this case if we assign more bits saved from frame skipping for coding intra-frame or predictive residues, we will get better performance for the coded frames and without the penalty of significant subjective distortion by displaying the skipped frames with repeating the previous coded frames. However, this will be not true for the fast motion case where it will cause significant subjective distortion by displaying skipped frames with repeating previous coded frames though the coded frames have good quality due to the use of more bits saved from frame skipping. Therefore, we believe a trade-off exists in temporal and spatial quality. In other words, there is optimization issue between spatial coding quality and temporal resolution. It should be noted that this trade-off is not a simple binary decision, but rather a decision over a finite set of coding parameters. Obviously, the best set of coding parameters will yield the optimal R-D curve. The two parameters of interest are the number of frames per second (fps) and the quantization parameter (QP). The majority of the literature on R-D optimization does not touch on temporal aspect [7]-[9]. By and large, it is assumed that the frame rate is fixed. These papers consider optimizations on the QP [7], mode decisions for motion and block coding [8] and frame-type selection [9]. To achieve the optimum given that frame rate is fixed and the bit rate can be met with the chosen frame rate, such algorithms and techniques can be applied. However, the optimum across varying frame rates is a different story. To solve this problem, we must first consider models to estimate the distortion of coded and non-coded frames. The detail of this formulation will be presented in [11], but the major results are summarized here. Essentially, the distortion of skipped frames can be modeled by the variances of the motion vector and spatial gradients, while the distortion of coded frames can be modeled according to

the classical bit allocation theory as described in [12]. In the following, we discuss the simulation results that have motivated us to consider this problem. Simulations are conducted with three test sequences: Akiyo, Foreman and News. Akiyo is a sequence with low coding complexity, Foreman is a sequence with very high coding complexity due to high motion and texture complexity, while News is a sequence characterized by moderate motion and coding complexity. Actually, there is significant motion in the News sequence, but it is localized. Each sequence is encoded and decoded with a fixed QP from the set of quantizers Q ={6,9,12,18,24} and a fixed frame rate from the set of frame rates F = {7.5, 15, 30} in Hz. In some cases, additional quantizers were added to the simulation to extend the R-D curve in relevant places. The PSNR for each decoded sequence is always computed at the full frame rate of 30Hz. For sequences coded at lower temporal resolutions, the skipped frames are simply repeated. Of course, some form of interpolation could be used to reconstruct the skipped frames, e.g., linear interpolation or something more sophisticated that accounts for the motion between the coded frames. However, the use of such methods would make the results dependent on a post-processing technique and that is something we would like to avoid for the moment. In this first set of simulation results with the Akiyo sequence, it becomes very clear that our initial hypothesis is confirmed. That is, better results can be achieved with lower frame rates. The main reason that we can observe such gains at lower temporal resolutions is because the motion in the scene is very low. As a result, when a frame is skipped, the increase in distortion is relatively small compared to the savings in bits. In significant contrast to the previous results, the RD curves for the Foreman sequence indicate that better results will not be obtained with lower frame-rate. Due to the high motion in this video sequence, skipping a frame is severely penalized. Consequently, lower frame rates cannot compete with higher frame rates at the same levels of distortion. On the other hand, with high motion sequences, we have a sort of hand-off or roll-off effect, which creates a staircase or step-like function that is discontinuous. In other words, assuming that a frame-rate is optimal for a given bit-rate. It will remain optimal for decreasing bit-rates until the maximum QP can no longer meet the given rate. At that point, the next frame rate can take over, with a moderate decline in the overall distortion. Since the News sequence is characterized by moderate motion (overall), we can expect a middle ground between the low-motion and high-motion extremes. In this case, the distortion introduced by lower temporal resolutions is not as severe as Foreman, but the gain in lower temporal resolutions are not as great as Akiyo. On average, the motion in the News sequence is moderate. However, in actuality, there is one particular object that is very high in

motion, while the remainder of the scene is very low. With an object-based coding scheme, it is possible to segment this object, and code it with a higher temporal resolution, while others are reduced to lower temporal resolutions. In this case, one must beware of potential composition problems. However, when composition problems are not a concern, the potential gains that can be achieved are very high. The composition problem is discussed in more detail in [12].

5. Concluding Remarks In this paper, we discussed the major issues related to object-based video encoding. These issues include the object-based rate control, bit allocation between shape coding and texture coding and R-D optimization on the selection of temporal resolution and quantization parameter for spatial quality. The solutions that we proposed to address these problems have been summarized.

References [1] ISO/IEC 14496-2:1999 ``Information technology -- coding of audio/visual objects,'' Part 2: Visual. [2] ISO/IEC 14496-5:2000 ``Information technology -- coding of audio/visual objects,'' Part 5: Reference Software. [3] T. Chiang and Y-Q. Zhang, ``A new rate control scheme using quadratic rate-distortion modeling,'' IEEE Trans. Circuits Syst. Video Technol., Feb 1997. [4] A. Vetro, H. Sun, and Y. Wang, ``MPEG-4 rate control for multiple video objects,'' IEEE Trans. Circuits and Syst. Video Technol., Feb. 1999. [5] N. Brady, F. Bossen, and N. Murphy, ``Context-based arithmetic encoding of 2D shape sequences, Proc Int’l Conf. Image Proc, Santa Barbara, CA, Oct. 1997. [6] A. Vetro, H. Sun, Y. Wang and O. Guleryuz, “Ratedistortion modeling of binary shape using state partitioning,” Proc Int’l Conf. Image Proc., Kobe, Japan, Oct. 1999. [7] H. Sun, W. Kwok, M. Chien, and C.H. John Ju, ``MPEG coding performance improvement by jointly optimizing coding mode decision and rate control,'' IEEE Trans. Circuits Syst. Video Technol., June 1997. [8] T. Weigand, M. Lightstone, D. Mukherjee, T.G. Campbell, S.K. Mitra, “R-D optimized mode selection for very low bitrate video coding and the emerging H.263 standard,'' IEEE Trans. Circuits Syst. Video Technol., Apr. 1996. [9] J. Lee and B.W. Dickenson, ``Rate-distortion optimized frame type selection for MPEG encoding,'' IEEE Trans. Circuits Syst. Video Technol., June 1997. [10] F.C. Martins, W. Ding, and E. Feig, “Joint control of spatial quantization and temporal sampling for very low bit rate video,” Proc. ICASSP, May 1996. [11] A. Vetro, H. Sun and Y Wang, “Rate-Distortion optimized video coding with frameskip,” IEEE Int’l Conf. on Image Proc., submitted Jan 2001. [12] N. S. Jayant and P. Knoll, Digital Coding of Waveforms, Prentice Hall, 1984 [13] A. Vetro, H. Sun and Y. Wang, "Object-based transcoding for adaptable video content delivery,'' IEEE Trans. Circuits and Systems for Video Technology, to appear 2001.

Suggest Documents