an mpeg encoder incorporating perceptually ... - Semantic Scholar

3 downloads 5912 Views 103KB Size Report
Email: [email protected] ... essential to design the coder based on the properties of the ... MPEG is a block-based coding method based on the Dis-.
AN MPEG ENCODER INCORPORATING PERCEPTUALLY BASED QUANTISATION Wilfried Osberger, Sean Hammond and Neil Bergmann Space Centre for Satellite Navigation, Queensland University of Technology, GPO Box 2434, Brisbane, Q., 4001, Australia. Email: [email protected]

ABSTRACT

2. PROPERTIES OF THE HVS

We discuss a strategy for adaptive quantisation in an MPEG encoder, based on the properties of the human visual system. This method can be used to replace the adaptive quantisation stage of the well known MPEG Test Model 5 (TM5), which is a simple method that does not closely model human perception. Our quantiser takes into account spatial masking by distinguishing between smooth, edge and texture regions, since they are known to have di erent masking properties. We also discuss methods of incorporating motion and higher level perceptual factors into our quantiser. The results show an improvement in PSNR of between 0.6 and 2.0 dB for a wide range of sequences and bit rates, compared to the TM5 model. We observed a signi cant increase in subjective picture quality using our adaptive quantiser, particularly at low bit rates.

The work of Vision Researchers over the last 35 years has led to a rapid advancement in our knowledge of the operation of the HVS. However, it has not been until more recently that this knowledge has been put into practical use in image processing algorithms. The MPEG standard explicitly takes into account several low level vision properties (e.g. reduced sensitivity to chrominance), and its exibility allows other HVS features to be incorporated. When designing a perceptually based MPEG quantiser, some of the most important HVS characteristics which we need to consider include [1]:

1. INTRODUCTION The Moving Picture Experts Group (MPEG) digital video coding standard is undergoing widespread usage in a range of di erent applications. A primary reason for its broad usage is its exible and open structure. This openness has led to extensive ongoing research into methods of optimising the coder for particular applications and bit rates. One of the most important but often neglected factors in coder design is that the human is the nal viewer and judge of the quality of the compressed picture. Thus in order to obtain the best perceptual quality picture at a given bit rate, it is essential to design the coder based on the properties of the Human Visual System (HVS). In this paper we demonstrate a way of implementing adaptive quantisation in an MPEG encoder, in a manner which re ects HVS properties. The organisation of the paper is as follows. Section 2 outlines the most important properties of the HVS, while Section 3 gives a basic overview of the operation of an MPEG encoder, focusing on the quantisation strategy. In Section 4 we discuss our method of incorporating important HVS properties into an MPEG encoder, and the results achieved by our method are compared with those of a standard MPEG encoder in Section 5. Finally, the discussion looks at future perceptual enhancements which may be made to improve subjective picture quality.

 Frequency sensitivities. This refers to our varying sen-

sitivity to stimulus (or coding errors) at di erent spatial and temporal frequencies. The relationship at the threshold of visibility is represented by a Contrast Sensitivity Function (CSF), which is approximated by the Quantisation Matrix (QM) of an MPEG encoder.  Masking e ects. This refers to our reduced ability to detect a stimulus when the background is either spatially or temporally complex. Thus errors are less visible along strong edges (but only for a few pixels either side of the edge), in textured areas, in fast moving areas which our eyes are unable to track, or immediately following a scene change. MPEG allows us to incorporate masking by allowing spatially variable quantisation through the MQUANT parameter.  Higher level perceptual factors, such as attention and eye movements. We only possess high visual acuity over a small area of viewing (the fovea), and our acuity drops o rapidly in the periphery. A coder which can take advantage of this fact by identifying Regions of Interest (ROI) in the scene (i.e. areas where any introduced distortions are most objectionable) can signi cantly reduce bandwidth without a degradation in subjective quality.

3. OPERATION OF AN MPEG ENCODER Currently there are two MPEG standards which have been released: MPEG-1 (1991) and MPEG-2 (1995). The basic operation of the two standards is similar, and we refer to them generically as MPEG unless otherwise speci ed. Al-

though we have used an MPEG-2 encoder [2] to implement the quantiser, our algorithm could readily be applied to an MPEG-1 encoder. MPEG is a block-based coding method based on the Discrete Cosine Transform (DCT). A sequence is made up of groups of pictures, which contain pictures of three di erent types: I- (intra-coded), P- (predictive-coded) and B- (bidirectionally predictive-coded). Each picture is broken into 16  16 pixel regions (macroblocks), which are further broken into 8  8 pixel regions before applying a DCT transform. This transform signi cantly reduces the spatial redundancy in the frame and allows ecient representation following quantisation via a user-de ned QM. The QM is designed both on the statistics of natural images, and on the spatial frequency sensitivity of the HVS. The signi cant temporal redundancy is reduced by allowing predictive Pand B- pictures, as well as independent I- pictures. Thus, through use of motion compensation in P- and B- pictures, only di erences in adjacent pictures need be coded. For a complete description of the MPEG standard, refer to [3]. The quantisation strategy used in MPEG revolves around the QM. Although we may specify a new QM only once per picture, we would like to use spatially variable quantisation within a picture since, due to the perceptual factors discussed previously, some areas of the picture can tolerate more severe distortion than others. MPEG has allowed for this via the MQUANT parameter, which allows a scaling of the QM (by a factor of 1 to 31) for each macroblock. MPEG-2 provides further exibility by allowing a non-linear scaling of the QM via MQUANT. The MPEG-2 committee developed a strategy for the usage of MQUANT in their Test Model 5 (TM5) [4]. This however was designed only as a basic strategy, and a more complex strategy can easily be adopted. TM5 involves three basic processes:

is used in place of the adaptive quantisation strategy of TM5. It is an extension of adaptive quantisation schemes designed for JPEG still image compression [5, 6]. The framework is exible and easily allows di erent perceptual factors to be taken into consideration. The process involves a classi cation of the macroblocks with regards to their visual importance and tolerance for compression. Each macroblock is weighted according to a range of di erent criteria which are known to be perceptually signi cant. These include:

 Spatial Activity. This shows the amount of spatial



1. Target bit allocation for a frame, 2. Rate control via bu er monitoring, 3. Adaptive quantisation based on local activity. The adaptive quantisation strategy crudely models our reduced sensitivity to complex spatial areas by varying MQUANT in proportion to the amount of local spatial activity in the macroblock. The spatial activity measure used in TM5 is the minimum variance among the four 8  8 luminance blocks in each macroblock. However, this simple activity measure fails to take into account such HVS properties as our di erent sensitivities to edges and textures, and higher level perceptual factors. For these reasons, a more complex model is required to provide compressed pictures with improved subjective quality.

4. PERCEPTUALLY IMPROVED QUANTISATION STRATEGY This paper proposes a new method of locally adaptive quantisation which closely models HVS properties. Our method

  

change occurring locally within a macroblock, which re ects the amount of spatial masking occurring. A distinction is made between edges and textures. This is because the HVS can only tolerate errors within a very small spatial area close to an edge (smaller than the width of a macroblock), so harshly quantised edge macroblocks contain easily visible errors. However masking occurs over all parts of a textured macroblock, so higher distortions can be tolerated [7]. We have implemented activity masking in a computationally ecient manner by using a local high-pass lter as an activity measure. The luminance picture is divided into 8  8 pixel sub-regions for classi cation. Regions with an activity below a threshold are regarded as smooth. Regions with an activity above the threshold are considered edges if they have a dominant orientation, otherwise they are classi ed as texture. We use a conservative approach to classify a macroblock: they are given the classi cation of the most sensitive of their four constituent 8  8 sub-regions. We then increase quantisation with activity more rapidly in textured regions than in at or edge regions, since more masking occurs in textured regions. Motion. Regions are classi ed with regards to their motion. Areas undergoing smooth, predictable motion attract attention and can be tracked by the eye and therefore cannot tolerate any more distortion than stationary regions [8]. However, areas undergoing fast or unpredictable motion cannot be tracked and can thus tolerate signi cant quantisation due to temporal masking. Position [9]. Areas near the centre of the picture are likely to be of more importance than peripheral regions, which can be quantised more harshly. Contrast [9]. Regions of high contrast with their surrounds attract our attention and are likely to be of greater visual importance, so should be encoded with greater delity. A-priori factors. If information is known about the scene contents, other criteria can be used. For instance, in a videoconferencing application, faces can be detected and classi ed as important regions.

More criteria can easily be added to this list if required. To work out the overall importance and compressibility of a macroblock, the above factors are calculated and weighted

according to their respective importance. The weights are calculated using empirical tests. This is then converted into a value of MQUANT for each macroblock, allowing locally adaptive quantisation.

5. EXPERIMENTAL RESULTS We have currently implemented the spatial activity masking process of our adaptive quantisation strategy. The computational complexity is similar to that of the TM5 adaptive quantiser. Thresholds for the classi cation of sub-regions have been chosen empirically. The results of the classi cation for a frame from the sequence \Claire" can be seen in Figure 1(b). Light areas indicate smooth regions and midgrey areas indicate edges, while textures are represented by dark regions. The results show that our classi cation distinguishes these di erent parts of the picture accurately. An indication of the subjective improvement which can be achieved by using our adaptive quantisation strategy can be seen in Figures 1(c) (frame quantised using our method) and 1(d) (frame quantised using TM5). The TM5 coder harshly quantises important edges on the face, such as the eyes and lips, resulting in objectionable distortion. However, our quantiser recognises these areas as being edges, and does not quantise them as severely. We have tested our adaptive quantisation strategy on a variety of di erent sequences and at a number of bit rates, and compared the results to TM5. Table 1 shows the peak signal-to-noise ratios (PSNR) achieved. For the same bit rate, our quantiser achieves an improvement in PSNR over TM5 of between 0.6 and 2.0 dB. We have observed that the subjective quality is also considerably improved. This is particularly noticeable at lower bit rates. Coding errors are now transferred to areas of high spatial masking (textures), while areas with lower tolerance to coding error (smooth regions and edges) are more accurately coded. The result of this redistribution of the coding error is a higher subjective quality at a given bit rate.

6. DISCUSSION This paper has presented a novel strategy for implementing locally adaptive quantisation in an MPEG encoder. It is based on known properties of the HVS, and involves a classi cation of local regions of an image depending on their visual importance and tolerance for compression. We have currently implemented only the rst stage of this model, which allows for spatial activity masking depending on the local activity and content of the picture. The results con rm the improved perceptual quality of sequences coded using this method, in comparison to those encoded with the standard TM5 adaptive quantisation strategy. We are currently implementing the other stages of our quantisation strategy discussed in Section 4, which take into account motion and some higher level perceptual factors. We are also working on calculating the parameters of our model from HVS data, rather than from empirical tests.

REFERENCES [1] B.A. Wandell. Foundations of Vision. Sinauer Associates Inc., Sunderland MA, 1991. [2] S. Eckart and C. Fogg. MPEG software simulation group, MPEG-2 encoder/decoder. Available from: ftp://ftp.mpeg.org/pub/mpeg/mssg/. [3] J.L. Mitchell, W.B. Pennebaker, C.E. Fogg, and D.J. LeGall. MPEG Video Compression Standard. Chapman and Hall, New York, 1997. [4] Test model 5. ISO/IEC JTC1/SC29/WG11/No 400, MPEG93/457, Apr 1993. [5] A.J. Maeder, J. Diederich, and E. Niebur. Limiting human perception for image sequences. In Proceedings SPIE 2657, pages 330{337, San Jose, Feb 1996. [6] J. Zhao, Y. Shimazu, K. Ohta, R. Hayasaka, and Y. Matsushita. An outstandingness oriented image segmentation and its application. In ISSPA, pages 45{48, Gold Coast, Australia, Aug 1996. [7] A.N. Netravali and B.G. Haskell. Digital Pictures: Representation and Compression. Plenum Press, New York, 1988. [8] M.P. Eckert and G. Buchsbaum. The signi cance of eye movements and image acceleration for coding television image sequences, pages 149{162. Digital Images and Human Vision. MIT Press, 1993. [9] J.W. Senders. Distribution of attention in static and dynamic scenes. In Proceedings SPIE 3016, pages 186{ 194, San Jose, Feb 1997.

(a)

(b)

(c)

(d)

Figure 1. Comparison of coded frames, produced using our adaptive quantisation strategy and the TM5 method. (a) original picture, (b) region classi cation (light = smooth, mid-grey = edge, dark = texture), (c) picture coded using our quantiser, (d) picture coded using TM5. Both sequences were coded at a bit rate of 250 Mbits per second.

Sequence Claire Miss America Flight Table Tennis Garden Football

Bit Rate (Mb/s) 0.25 0.5 0.8 2.0 2.0 3.0

PSNR using quantiser (dB) 39.7 40.5 38.1 31.1 27.7 31.6

our PSNR using TM5 (dB) 37.7 39.1 36.1 30.5 27.0 30.3

Table 1. PSNRs achieved using our adaptive quantisation strategy and TM5, for various sequences and bit rates.

AN MPEG ENCODER INCORPORATING PERCEPTUALLY BASED QUANTISATION Wilfried Osberger, Sean Hammond and Neil Bergmann Space Centre for Satellite Navigation, Queensland University of Technology, GPO Box 2434, Brisbane, Q., 4001, Australia. Email: [email protected] We discuss a strategy for adaptive quantisation in an MPEG encoder, based on the properties of the human visual system. This method can be used to replace the adaptive quantisation stage of the well known MPEG Test Model 5 (TM5), which is a simple method that does not closely model human perception. Our quantiser takes into account spatial masking by distinguishing between smooth, edge and texture regions, since they are known to have di erent masking properties. We also discuss methods of incorporating motion and higher level perceptual factors into our quantiser. The results show an improvement in PSNR of between 0.6 and 2.0 dB for a wide range of sequences and bit rates, compared to the TM5 model. We observed a signi cant increase in subjective picture quality using our adaptive quantiser, particularly at low bit rates.