Face Detection and Tracking for Video Coding Applications Bernd Menser and Michael Br¨unig Institut f¨ur Elektrische Nachrichtentechnik Rheinisch-Westf¨alische Technische Hochschule (RWTH) Aachen 52056 Aachen, Germany
[email protected]
Abstract In video communication applications at low bitrates, coding artifacts are especially disturbing in face areas. Encoding faces with higher quality than background regions requires an automatic face detection. This paper presents a robust face segmentation and tracking algorithm and investigates the application to region-of-interest coding with H.263+ and object-based coding with MPEG-4.
1 Introduction The demand of integrating visual information into communication applications has increased greatly in the last decade. This trend is driven by recent developments in wireless communication technologies and the internet. The list of applications includes video conferencing, video telephony and multimedia email. Due to the limited bandwidth and the huge size of the image data, high compression ratios must be achieved. At low bitrates, the image quality suffers from compression artifacts which are unpleasant especially in facial areas. However, the block-based coding standards ITU-T H.261 or H.263 ,widely used in video communication systems assign equal importance to each image block. The overall perceptual quality at low bitrates can be improved by encoding the face region with higher bitrate than less relevant background regions [1, 2, 7]. This requires the automatic detection and segmentation of human faces in the video sequences. A face region then defines a region-of-interest (ROI) and is coded with priority. Based on a robust face segmentation and tracking algorithm [8] ROI-coding with H.263+ and object-based coding using MPEG-4 are investigated and the potentialities to enhance the image quality of the face region are compared.
2 Face Detection and Tracking This section presents a robust face segmentation approach that is able to extract face regions in color images with complex background [8]. In contrast to other face segmentation algorithms that assume head and shoulder sequences with a constraint background, the proposed algorithm requires no prior knowledge about the number of faces or their position, and it is able to reject images that contain no faces at all. Therefore, the algorithm can be applied to other applications like image classification for databases. An overview of the proposed face segmentation scheme is given in Figure 1. In a first step, a skin probability map is generated from the original image using a model for the distribution of skin color. A common approach is a binary segmentation into skin and non-skin regions by applying a single threshold to the skin probability map. The subsequent analysis of the face candidate regions may fail if the face if not represented by a single region or merged with the background. Thus, the selection of an appropriate threshold is crucial to the overall detection performance. In our approach, instead of binary segmentation the skin probability map is analyzed at multiple threshold levels using connected operators. Connected operators are nonlinear filters that eliminate some parts of the image while preserving the contour of the remaining parts [11]. Connected components are either preserved or completely removed based on a decision criterion applied separately to each component. Originally introduced for binary images, connected operators can be extended to gray level image processing. A simple way to create a gray level connected operator is a threshold decomposition of the image and the application of binary connected operators to each layer. However, more efficient implementations exist for many operators [11]. The quantization of the skin probability map defines the number of threshold levels and leads to a set of nested connected components. Then, a hierarchy of connected operators using geometrical criteria is applied. Each
Input Image Cb,Cr
Y
Color Analysis Quantization Open/Close by Recon. Compactness Solidity Orientation
Variance Face Likelihood Face Region Figure 1. Outline of the face segmentation algorithm stage reduces the number of face candidates by removing regions that do not meet the operator criterion. The regions preserved in the shape analysis stage are passed to operators that analyze the texture inside the remaining components and perform the final classification.
2.1
Skin Color Analysis
Color is a simple but powerful pixel-based feature to detect human faces since the color of human skin is different from the color of many other natural objects. The distribution of skin color is analyzed using a large database with label skin pixels. The histogram shows a large variation of the luminance components of skin color, e.g. due to varying lighting conditions, whereas the chrominance components of skin pixels are clustered in a small area of the chrominance plane. Thus, the distribution of skin color in the chrominance plane of the YCbCr color space is modeled by a 2D Gaussian. A skin probability map indicates the likelihood of each image pixel to represent skin using this model. Figure 2 shows the luminance component and the corresponding skin probability map of the first image of the test sequence Foreman. Bright pixels in the skin probability map correspond to image pixel with high probability of representing skin. Various color spaces and models has been proposed to de-
Figure 2. Original image and corresponding skin probability map scribe the skin color, e.g. [1, 3, 12, 10]. However, the skin model is often used for a binary classification into skin and non-skin pixels. This corresponds to a binary segmentation of the skin probability map. Using a generic skin model, a universal threshold for skin classification that is applicable to all images can hardly be found. Instead applying a single threshold, the skin probability map is analyzed at different threshold levels by gray level connected operators.
2.2
Shape Analysis
The skin probability map is quantized and simplified using a hierarchical scheme of connected operators whose decision criteria are based on shape features (Fig. 1). The first step removes small and thin objects. An opening by reconstruction preserves all components that are not totally removed by binary erosion. Thus, small bright regions in the probability image are removed. The dual operator closing by reconstruction removes small regions with low probability. Since closing is an extensive operator it may artificially increase the skin likelihood assigned to the pixels. To prevent errors, a smaller structuring element is employed for closing. The subsequent operators make use of basic assumptions about the shape of a face. The operators act on threshold layers of the skin probability map. Components that can be excluded from the face candidates based on their shape are removed. The simple but effective decision criteria rely on combinations of the area A, the perimeter P , and the size (Dx ; Dy ) of the surrounding bounding box of the connected component. Thus, these features have to be computed only once for all operators. A compactness operator removes connected components with complex contours. The decision criterion compactness C is given by the ratio between the area and the squared perimeter of the connected component: compactness:
C
=
A P
2
(1)
This criterion reaches its maximum for circular objects. All objects with compactness lower than a predefined threshold
are removed. The next operator measures the solidity S of a connected component defined by the ratio of its area to the size of the bounding box: solidity:
S
=
A Dx Dy
(2)
If the solidity of a component is below a threshold, is is removed. The last operator uses a simple measure of orientation O , given by the ratio between the height Dy and the width Dx of the object surrounding bounding box: orientation:
O
=
Dy Dx
(3)
The hierarchy of simple and low complexity connected operators simplifies the skin probability map and reduces the number of connected components that are passed to the texture analysis stage significantly. Figure 3. Face detection results
2.3
Texture Analysis
The previously described filtering process utilizes geometrical properties of the connected components. However, for a reliable detection of faces, texture information has to be considered as well. Therefore, two connected operators are used whose decision criteria are based on the original image texture inside the support of the connected component. Before mapping the texture, holes inside the connected components are closed. Face regions show a certain amount of luminance variation due to eyes, mouth, etc. This property can be used by applying a connected operator based on the variance of the luminance inside the support of the connected component. The operator removes regions with little variance often caused by skin-colored background. For the final classification a criterion is applied which measures the likelihood of each remaining image region being a face. Moghaddam and Pentland have shown how this likelihood can be estimated assuming a unimodal gaussian density for the class of face images [9]. The Mahalanobis distance can be approximated by projecting the image into the subspace spanned by the eigenvectors (eigenfaces) calculated from a training set of face images. Further details are given in [8] and [9]. To compute the Mahalanobis distance for arbitrary shaped regions, a rectangular subimage is created which contains the original image texture of the region defined by the connected component. This image is of the same size as the tightest rectangle surrounding the connected component. The areas outside the connected component are filled with the background color of the training set and the image is scaled to the size of the training images. Then, this rectangular image is projected into the subspace and the approximated Mahalanobis distance is calculated. If the distance
is below a predefined threshold the region is classified as a face, otherwise it is removed. If nested regions belonging to different threshold layers are preserved by the final operator, the one with the maximum compactness defined by (1) is selected as the face region. In a final step, the contours of the segmented regions are smoothed by morphological operators.
2.4
Face Detection Results
Figure 3 shows results of the face segmentation algorithm for the first frames of the Foreman and News sequence. The detected face regions do not correspond to connected components of the same threshold layer. Thus, applying a global threshold to the skin probability maps would have failed.
2.5
Face Tracking
Once a face is detected, the initial extracted region and the corresponding threshold is used to track the face segment through the sequence. Since the detection algorithm indicates the presence of a face at a particular spatial position in the image sequence, tracking of the face can be done using color and shape information. Similar to the detection step, the concept of connected operators is used to track the face. The skin probability map of the current frame is quantized using an adaptive set of threshold levels. The threshold that belongs to the face region of the previous frame is an estimate of the new optimal threshold. A reduced number of additional threshold levels are centered around the old threshold level. The connected component that contains the face in the previous frame is projected into the current frame and is used as a marker. All connected components that are covered by less than a
3.1
Figure 4. Face tracking in the Foreman sequence (frames 12, 69, 90, 108, 153, 180)
predefined portion by this marker are removed. The remaining connected components are classified according to their shape properties. The center of gravity, the size of the bounding box and the solidity defined by (2) are computed. In contrast to the detection step, the shape features are compared to that of the marker region and not to predefined thresholds. In typical image sequences, the size and position of a face vary smoothly. However, at least one of these features changes significantly if the face is merged with the background or break into several components. If more than one nested region is preserved, the connected component with the maximum compactness defined by (1) is selected. To cope with temporary occlusions, the connected components belonging to the old threshold are accepted if all regions are rejected in the shape analysis stage. In this case, the current marker region is also used for the next frame. If the region is lost for several frames, the detection process is reinitialized. Figure 4 presents some results of tracking the face region in the Foreman sequence at a frame rate of 25/3 fps. The face segment is tracked even in the presence of rapid motion and partial occlusions.
ROI Coding using H.263+
In the block-based coding standard ITU-T H.263 the quantization parameter Q can be adjusted on macroblock level [5]. The limited incremental quantizer update of version 1 is extended by Annex T (H.263+) which allows an independent selection of Q for each macroblock. For ROI coding different quantization parameters are used for macroblocks that belong to the face region (Qf ) and macroblocks belonging to the background (Qb ). Therefore, a macroblock mask is generated from the pixel-based face mask. Whenever a transition between face and background region occurs in the macroblock scan, a DQUANT marker followed by the new quantization parameter is coded. Table 1 compares the achieved PSNR for the luminance component in the face region and the background region. All extensions of version 2 except Annex T are switched off. The chosen quantization parameters lead to approximately the same bitrate without any rate control. Using a ROI increases the average PSNR of the face region by 2.5 dB with an average degradation of 1 dB in the background. As shown in Figure 5, the subjective quality is significantly improved. On the average 150 bits per frame are necessary to encode the change of quantization. Thus, the overhead for encoding the block-based mask is nearly negligible. H.263+ without ROI Q=17 Rate [kbit/s] PSNR ROI [dB] PSNR BG [dB]
131.07 31.41 31.35
Qf
with ROI = 10, Qb = 30 131.45 33.95 29.36
Table 1. Coding results for Foreman sequence (CIF, 25/3 fps)
3 Video Coding Applications
3.2
The face area extracted by the face detection algorithm is the focus of attention in video communication application and has to be coded with higher quality than the less relevant background. This section addresses the enhancement of image quality by spatially adaptive quantization, although image quality can be enhanced by temporal scalability as well [6]. ROI-coding using H.263+ is compared to the objectbased coding of MPEG-4. Therefore, the first 200 frames of the Foreman sequence in CIF format are encoded with a fixed frame rate of 25/3 fps. No rate control is used in both encoder.
Different from the frame-based coding standards MPEG1/2 and H.26x, the MPEG-4 standard [4] allows coding of arbitrary shaped regions. Each video object (VO) is encoded in a separate bitstream and can be decoded independently. This offers the possibility to user interaction and scene decomposition at the decoder side. While the potential of MPEG-4 goes far behind region-of-interest coding, the concept of arbitrary shape video object can be used to enhance the face quality in a communication application. The face region and the background are encoded as separate VOs using different quantization parameters. Table 2 shows the obtained quality compared to coding of the entire image
Object-based Coding using MPEG-4
MPEG-4 1 VO 2 VOs (ROI+BG) Q = 14 Qf = 10; Qb = 30 Rate [kbit/s] PSNR ROI [dB] PSNR BG [dB]
182.2 32.97 32.25
183.4 34.20 29.74
Table 2. Coding results for Foreman sequence (CIF, 25/3 fps)
References
Figure 5. Frame 84 of the Foreman sequence without ROI (upper image) and with ROI (lower image)
as single rectangular VO. The combination of quantization parameter is selected to meet approximately the same rate. The H.263 quantization type is selected. To tax the cost for shape coding, both VOs are encoded again, using the same quantization parameter for both VOs and the bitrate is compared to the rate for a single rectangular VO. The total average bitrate per frame raises from 16180 bit to 19885 bit which is significantly higher compared to ROI coding in H.263. On the other hand coding two separate video objects offers more oppurtunities like interactivity or unequal error protection.
[1] D. Chai and K. N. Ngan. Face segmentation using skincolor map in videophone applications. IEEE Trans. on Circuits and Systems for Video Technology, 9(4):551–564, June 1999. [2] A. Eleftheriadis and A. Jacquin. Automatic face location detection and tracking for model-assisted coding of teleconferencing at low bit-rates. Signal Processing: Image Communication, 7(3):231–248, 1995. [3] C. Garcia and G. Tziritas. Face detection using quantized skin color regions merging and wavelet packet analysis. IEEE Trans. Multimedia, 1(3):264–277, 1999. [4] ISO/IEC JTC1 IS 14496-2 (MPEG-4). Information technology - generic coding of audio-visual objects (final draft of international standard), Oct. 1998. [5] ITU-T. Draft H.263 version 2: Video coding for low bit rate communication. Jan. 1998. [6] W.-J. Kim, J. W. Yi, and S. D. Kim. Quality scalable coding of selected region: Its modeling and H.263-based implemantion. Signal Processing: Image Communication, 15:181– 188, 1999. [7] B. Menser and M. Wien. Automatic face detection and tracking for H.263 compatible region-of-interest coding. In Proc. SPIE Image and Video Communications and Processing 2000, volume 3974, pages 882–891, San Jose, California, USA, Jan. 2000. [8] B. Menser and M. Wien. Segmentation and tracking of facial regions in color image sequences. In Proc. SPIE Visual Communications and Image Processing 2000, volume 4067, pages 731–740, Perth, Australia, June 2000. [9] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. IEEE Trans. Patt. Anal. Machine Intell., 19(7):696–710, July 1997. [10] A. Saber and A. M. Tekalp. Frontal-view face detection and facial feature extraction using color, shape and symmetry based cost functions. Pattern Recognition Letters, 19(8):669–680, June 1998. [11] P. Salembier, A. Oliveras, and L. Garrido. Anti-extensive connected operators for image and sequence processing. IEEE Trans. Image Processing, 7(4):555–570, 1998. [12] K. Sobottka and I. Pitas. A novel method for automatic face segmentation, facial feature extraction and tracking. Signal Processing: Image Communication, 12(3):263–281, June 1998.