Dec 3, 2008 - Jayashree Ñarlelar ... Powai, Mumbai - 400 076. INDIA. ... Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY.
Content Based Video Compression for High Perceptual Quality Videoconferencing using Wavelet Transform Jayashree Karlelar
U.B. Desai
Hughes Software Systems Ltd.,
Dept. of Electrical Engineering,
Electronic City, Plot 31, Sector 18,
Indian Institute of Technology,
Gurgaon - 122 015, INDIA
Powai, Mumbai - 400 076. INDIA.
Abstract New hybrid compression scheme for videoconferencing and videotelephony applications at very low bit rates (i.e., 32 Kbits/s) is presented in this paper. Human face is the most important region within a frame and should be coded with high delity. To preserve perceptually important information at low bit rates, such as face regions, skin-tone is used to detect and adaptively quantize these regions. Novel features of this coder are the use of overlapping block motion compensation in combination with discrete wavelet transform, followed by zerotree entropy coding with new scanning procedure of wavelet blocks such that the rest of the H.263 framework can be used. At the same total bit-rate, coarser quantization of the background enables the face region to be quantized nely and coded with higher quality.
1: Introduction Video compression at low bit rates has attracted considerable attention recently in the image processing community. This is due to the expanding list of very low bit-rate applications, such as videoconferencing, multimedia, and wireless communications. The bulk of the very low bit-rate coding research has centered around so called second-generation techniques [2], because existing standards such as H.263 perform rather poorly at lower rates. Second-generation techniques are certainly of interest, they are still characterized by computationally intensive algorithms which are required to solve challenging computer vision problems, and are therefore unlikely candidates for real-time communications over low-capacity communication channels. The work described in this paper can be seen as a compromise between block-based methods and model-based techniques. The approach taken here is a hybrid approach which blends a waveformbased coding approach with content based techniques, as this facilitates adaptive quantization of the content of input video sequences. The proposed method is mainly based on ITU-T Recommendations H.263 [1] standard, where discrete cosine transform (DCT) is replaced with discrete wavelet transform (DWT) with number of additional object-oriented features which makes it well suited for very low bit-rate (i.e., < 32 Kbits/s) coding of \head-and-shoulder" video, typical of real-time multimedia, videoconferencing, and videotelephony applications. In \head-and-shoulder" sequences, the human face and the associated subtle facial expressions need to be transmitted and reproduced as faithfully as possible at the receiver. The perceptual quality of such \head-and-shoulder" sequences can be improved by masking the face region for a discriminative quantization in the process of compression. Higher compression ratio in the background, where the distortion is perceptually less signi cant allows the human face to be coded with higher delity at the same overall bit rate. Replacing DCT with DWT to reduce spatial redundancy makes the underlying video compression scheme highly scalable because of inherent multiresolution structure present in DWT. Also, because of global nature of wavelet transform, the region based adaptive quantization does not produce visible arti cial boundaries between the regions of dierent quantization parameters. While the image content within the masked region is reproduced with high delity, the transition around the region boundary appears gradual and natural.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 03:53 from IEEE Xplore. Restrictions apply.
Organization of the paper is as follows. Section 2 describes the face detection method used in the proposed video coder using skin-tone color information, while Section 3 gives the brief description on the quantization and coding using zerotree. In Section 4 the proposed algorithm based on wavelet transform is described. Section 5 gives simulation results. The conclusion is given in Section 6.
2: Face Region Identi cation
Face region identi cation can be looked upon as a preprocessing step for complex applications like, face recognition, identity and credit card veri cation, face tracking, gaze tracking, indexing huge databases, and for model based coding of images and videos. The algorithms for locating faces based on template matching criterion are described in [9]-[11]. In video transmission and storage, colors are usually separated into luminance and chrominance components to exploit the fact that human eyes are less sensitive to chrominance variations. Human skin tones form a special category of colors, distinctive from the colors of most other natural objects. Although skin colors dier from person to Figure 1. Skin-tone color distriperson, they are distributed over a very small area in the bution on Cb-Cr plane. chrominance plane [7]. To generate skin-tone colors on chrominance plane, training data of face patches of dierent images are cut. Fig. 1 shows the distribution of skin-tone colors in the (Cb, Cr) chrominance plane. A neural network (NN) is trained for skin-tone and non skin-tone classi cation. Given an input of (Cb, Cr) pair, the neural network classi es it as skin-tone or non skin-tone. The above trained classi er is used in the proposed video coder to classify each macroblock as a candidate face macroblock or a non-face one. To classify this, averages corresponding to chrominance blocks in a macroblock are fed as an input to the NN. This classi cation information is then passed on to an adaptive quantizer, so that ner quantization step can be used for face region while coarser quantization step is used for the rest of the image. Examples of face region identi cation using the proposed NN based approach are shown in Fig. 3 for Miss America and Suzie video sequences, where white rectangle denotes the detected face regions. 100
Cb
50
0
-50
-100
-100
-50
0 Cr
50
100
3: Quantization and zerotree coding
Zerotree entropy coding (ZTE) [5] [6] is a new technique for coding wavelet transform coeÆcients. The technique is based on, but diers signi cantly from, the EZW [3] and SPIHT [4] algorithms. Like EZW and SPIHT, ZTE algorithm exploits the self-similarity inherent in the wavelet transform of images. ZTE coding organizes wavelet coeÆcients into wavelet blocks using zerotree. Each wavelet block comprises of coeÆcients at all scales and orientation that correspond to the frame at the spatial location of that block. The concept of the wavelet block provides an association between wavelet coeÆcients and what they represent spatially in the frame. Wavelet blocks are scanned from low-to-high frequency into a vector, as proposed in [6], in order to be processed by the remaining parts of the H.263 coder.
4: Proposed Video encoder
The block diagram of the proposed video encoder is as shown in Fig. 2. The encoder is very much similar to the H.263 coder except the inclusion of face region identi cation block and change in transform where DCT is replaced with DWT. Each frame of the video sequence is encoded either as intra (\I") or inter (\P", forward prediction) frame, where intra frame is used for the rst frame and inter frame is used for the remaining frames of the video sequence. Before that, each frame of the original video sequence, is passed to the NN based face region identi cation block, which gives information about the location of face region in a frame based on skin-tone color. The output of this block is passed to the coding control unit, which adaptively selects the quantization parameter.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 03:53 from IEEE Xplore. Restrictions apply.
Face Region Identification FL Coding Control Video In
T
-
RLC + VLC
Q
+
Q T
-1 -1
P
T
(a) Frame 12 MV
- Discrete wavelet transform
Q T
b
- Quantizer -1
P
- Inverse discrete wavelet transform - Predictor
RLC - Run length coding VLE - Variable length coding b
(b) Frame 64
- output bits
MV - Motion vectors FL
- Face locations
Figure 2. Block diagram of proposed video encoder
Figure 3. Face region detection for (a) Miss America and (b) Suzie.
For inter frames, a block based motion estimation scheme is used to detect local motion on blocks of size 16 16 with half-pel accuracy. Each block is predicted using overlapping block motion compensation scheme. After all the blocks have been predicted, the residuals are put together to form a complete residual frame for subsequent processing by the DWT. For those blocks where motion estimation fails, the intra mode is selected. To turn this block into a residual similar to the other predicted blocks, the mean of the block is subtracted from it and sent as an overhead [5]. The output obtained from the transformation block is coded using the ZTE algorithm and variable length code tables of H.263.
5: Experimental Results
Objective and subjective comparison tests have been carried out between the proposed coding scheme and the emerging ITU-T standard H.263 on various test sequences. Each frame of the video sequence is decomposed into three levels of wavelet transform using orthogonal lter of Daubechies having regularity 4. Frames are coded as I or P using block motion estimation of size 16 16, overlapping block motion compensation, the discrete wavelet transform, and the ZTE algorithm with region based quantization and modi ed scan of wavelet blocks [6] so that it can be further encoded using run-length and variable length coding tables of H.263. Several experiments were carried out using test sequences of \Miss America", and \Suzie" (QCIF with resolution 176 144 and having an original frame rate of 30 frames per second (fps)). H.263 results were produced with the publicly available TMN encoder software from Telenor (http://www.nta.no/brukere/DVC). For a target bit rate of 10 Kbits/s, the encoded frame rate is 7:5 fps for both methods. This target frame rate is achieved by discarding every three out of four frames of the original sequence during the encoding process. The precise bit rate control was achieved by changing the quantization parameter in both methods.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 03:53 from IEEE Xplore. Restrictions apply.
44
42
42
Peak Signal to Noise Ratio (dB)
Peak Signal to Noise Ratio (dB)
44
40 38 36 34 32 30 28
40 38 36 34 32 30 28
0
20
40
60
80
100
120
140
0
20
40
44
44
42
42
40 38 36 34 32 30 28 0
20
40
60
80
100
120
0
20
40
60
80
140
100
120
140
Frame Number
Peak Signal to Noise Ratio (dB)
Peak Signal to Noise Ratio (dB)
120
30
42
38 36 34 32 30
80
100
32
28 140
40
60
140
34
42
40
120
36
44
20
100
38
44
0
80
40
Frame Number
28
60
Frame Number
Peak Signal to Noise Ratio (dB)
Peak Signal to Noise Ratio (dB)
Frame Number
100
120
140
40 38 36 34 32 30 28 0
20
40
60
80
Frame Number
Frame Number
(a)
(b)
Figure 4. The PSNR values are plotted against Y-axis while frame number is against Xaxis. Solid line show the values for the proposed algorithm while dotted line show for H.263 at 10 Kbits/s rate, and 7.5 frames/s. Top row shows the plot for luminance component while bottom two rows shows for the choriminance component. (a) Miss America; (b) Suzie.
Fig. 4 shows the peak signal to noise ratio (PSNR) values versus frame number at 10 Kbits/s for Miss America, and Suzie video sequences. Note that the objective measure PSNR is a good indicator of the subjective image quality only at high and medium bit rates. At very low bit rates, this is not always the case. As can be seen from Fig. 4, H.263 and the proposed scheme gives comparable PSNR values. Subjective comparisons between H.263 and the proposed method can be made by examining Fig. 5 (for luminance component only). Fig. 5 depicts frame 40 of Miss America and frame 144 of Suzie at 10 Kbits/s. The output of H.263 (Fig. 5(b)) clearly shows the blockiness near eyebrows of Miss America. Also, lips of Miss America looks broken and incomplete because of block transform. For Suzie, the blockiness is observed near lower lip and tip of nose, whereas the result of the proposed method (Fig. 5(c)) looks more uniform and smooth. The blockiness is absent.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 03:53 from IEEE Xplore. Restrictions apply.
(a)
(b)
(c)
Figure 5. Zoomed images of Miss America (frame 44) and Suzie (frame 144) video sequences encoded at
10
Kbps,
7:5
fps: (a) Original; (b) H.263; and (c) Proposed method.
6: Conclusion
For the proposed video coding algorithm, simulation results demonstrate comparable objective and superior subjective performance when compared with H.263. The algorithm provides a better way to address the scalability functionalities because of inherent multiresolution structure present in the wavelet transform. The algorithm is very eÆcient for very low bit-rate coding applications. Due to DWT, there are no visible blocking artifacts present in-spite of using region based adaptive quantization. The region-based enhancements proposed here provides high perceptually quality videoconferencing of head-and-shoulder video material where high coding quality of facial areas is needed.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Draft ITU-T Recommendations H.263, \Video Coding for Low Bit Rate Communication," Dec. 1995. D.E Pearson, \Developments in model-based video coding," Proc. IEEE, Vol. 83, pp. 892-906, 1995. J.M. Shapiro, \Embedded image coding using zerotree of wavelet coeÆcients," IEEE Trans., Sig. Proc., Vol.41, No.12, pp. 3445-3462, Dec. 1993. Amir Said and William A. Pearlman, \A new fast and eÆcient image codec based on set partitioning in hierarchical trees," IEEE Trans., CSVT, Vol. 6, No. 3, pp. 243-250, June 1996. S. A. Martucci, I. Sodagar, T. Chiang, and Y.-Q. Zhang, \Zerotree wavelet video coder," IEEE Trans., CSVT, Vol. 7 (1), pp. 109-118, Feb. 1997. Ricardo de Queiroz, C.K. Choi, Young Huh, and K.R. Rao, \Wavelet transforms in a JPEG-Like image coder," IEEE Trans., CSVT, Vol. 7 (2), pp. 419-424, April 1997. Hualu Wang and Shih-Fu Chang, \A highly eÆcient system for automatic face region detection in MPEG video," IEEE Trans., CSVT, Vol. 7 (4), pp. 615-628, Aug. 1997. J. Luo, C. W. Chen, and K. J. Parker, \Face location in wavelet-based video compression for high perceptual quality videoconferencing," IEEE Trans., CSVT, Vol. 6 (4), pp. 411-414, Aug. 1996. A.N. Rajagopalan, K. Sunil Kumar, Jayashree Karlekar, R. Manivasakan, M. Milind Patil, U.B. Desai, P.G. Poonacha, and S. Chaudhuri, \Finding faces in photographs," Proc. IEEE Int. Conf. on Computer Vision, pp. 640-645, 1998. [10] T. Poggio and K. Sung, \Finding human faces with a Gaussian mixture distribution based face model," Proc. Asian Conf. on Computer Vision (Singapore), pp. 437-446, 1995. [11] H.A. Rowley, S. Baluja, and T. Kanade, \Neural network based face detection," Proc. IEEE Int. Conf. on Computer Vision and Pattern Recog., pp. 203-208, 1996.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on December 3, 2008 at 03:53 from IEEE Xplore. Restrictions apply.