Robust and Efficient Multipose Face Detection Using Skin Color ...

3 downloads 0 Views 304KB Size Report
segmentation of images or video frames into skin-colored blobs using a pixel-based ... tistical discrimination between face and non-face classes. We train and ... classified into four, sometimes overlapping, categories: knowledge-based meth-.
Robust and Efficient Multipose Face Detection Using Skin Color Segmentation Murad Al Haj, Andrew D. Bagdanov, Jordi Gonz` alez, and Xavier F. Roca Departament de Ci`encies de la Computaci´ o and Computer Vision Center Campus UAB, Edifici O, 08193, Bellaterra, Spain {malhaj,bagdanov}@cvc.uab.es, {Jordi.Gonzalez,xavier.roca}@uab.es

Abstract. In this paper we describe an efficient technique for detecting faces in arbitrary images and video sequences. The approach is based on segmentation of images or video frames into skin-colored blobs using a pixel-based heuristic. Scale and translation invariant features are then computed from these segmented blobs which are used to perform statistical discrimination between face and non-face classes. We train and evaluate our method on a standard, publicly available database of face images and analyze its performance over a range of statistical pattern classifiers. The generalization of our approach is illustrated by testing on an independent sequence of frames containing many faces and nonfaces. These experiments indicate that our proposed approach obtains false positive rates comparable to more complex, state-of-the-art techniques, and that it generalizes better to new data. Furthermore, the use of skin blobs and invariant features requires fewer training samples since significantly fewer non-face candidate regions must be considered when compared to AdaBoost-based approaches. Keywords: Multiview Face Detection, View-Invariant Face Features Classification.

1

Introduction

Face detection is a crucial step in many applications in computer vision. These include automatic face recognition, video surveillance, human-computer interface and expression recognition. Face detection also plays an important role in video coding where data must be transmitted in real-time over low bandwidth networks [3]. The better the face detection system performs, the less post-processing will be needed and all the previously mentioned applications will perform better. A survey of face detection is presented in [11] where the different methods are classified into four, sometimes overlapping, categories: knowledge-based methods, feature invariant approaches, template matching methods, and appearance based methods. However, Face detection is an expensive search problem. To properly localize a face in an image, all regions should be searched at varying scales. Usually, a sliding window approach is implemented to classify the regions in an image H. Araujo et al. (Eds.): IbPRIA 2009, LNCS 5524, pp. 152–159, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Robust and Efficient Multipose Face Detection

153

at different scales. This naturally produces many more windows corresponding to background objects than windows corresponding to actual face positions. The ratio of non-face regions produced to actual face regions can be as high as 100000:1. This high ratio calls for a very well trained classifier that will produce a low number of false positives - especially that face detection, as mentioned earlier, is a first step in many applications which will suffer from false positives. Furthermore, intra-class variations of non-rigid objects like faces make detection a challenging problem. These variations are mainly caused by changes in expression and pose. While classifiers aiming at frontal faces work well, training a classifier to detect a multipose faces is not an easy task. Frontal face detection has been extensively studied. Early work includes that of Sung and Paggio [6] were the difference between the local image pattern and a distribution based model was computed and used to build a classifier. Papageorgiou used an over-complete wavelet representation of faces to achieve detection [4]. Today, the most commonly used techniques for frontal upright face detection are that of Rowley et al. [7] that uses neural networks and that of Viola and Jones [9] that uses Haar-like features accompanied by Adaboost. Some attempts were made to extend the Viola and Jones method to multiview faces such as in [2]. However, fast multiview face detection is still an open problem since the methods are either computationally expensive, such as in the previously mentioned paper, or they produce a large number of false positives [10], or both. In this paper, an efficient and robust face detection method is presented. Our method starts by detecting skin blobs, then a small number of invariant features are extracted from them. Those features are used to learn a statistical model of faces vs. non-faces. The rest of the paper is divided as follows, section 2 introduces our skin color model, section 3 describes the different invariant features, in section 4 the experimental results of training and detection are presented, while finally concluding remarks are discussed in section 5.

2

Skin Color Model

Skin is an effective cue for face detection since color is highly invariant to geometric variations of the face and allows fast processing. By eliminating non-skin regions, the search space is reduced significantly. Trying to model skin color, around 800,000 skin pixels were sampled from 64 different images of people with different ethnicities and under various lighting conditions. Studying the distribution of these pixels, we concluded that a pixel with (R,G,B) should be classified as skin if: R > 75 and 20 < R − G < 90 and R/G < 2.5 The advantages of this method are its simplicity and computational efficiency, especially that no colorspace transformation is needed. The distribution of R is shown in Fig. 1.(a), while that of R-G is shown in Fig. 1.(b), and that of R/G is shown in Fig. 1.(c). Our experiments revealed that 96.9% of the skin pixels have their R value greater than 75, see Fig. 1.(a), 94.6% of them have their R-G values between 20 and 90, see Fig. 1.(b), which supports the observation in

154

M. Al Haj et al.

(a)

(b)

(c)

Fig. 1. (a) The distribution of R values in skin pixels. (b) The distribution of R-G values in skin pixels. (c) The distribution of R/G in skin pixels.

Fig. 2. The original image and its corresponding skin segmented blobs

[1], and 98.7% of them have their R/G values less than 2.5 see Fig. 1.(c). After skin-colored blobs are detected, a Sobel edge detector is applied in an attempt to separate any face blob from a background blob that might have the same color. A dilation process is then employed to further separate the edges. In Fig. 2 an example of this segmentation process is shown, where in Fig. 2.(a) the original image is shown, and in Fig. 2.(b) the skin segmented image is shown after edge detection and dilation are applied.

3

Invariant Features

Once the skin blobs are detected, a small number of view-invariant scaleindependent features are extracted. Unlike the Viola and Jones method, where hundreds of Haar-like features are used for only frontal face detection, we use 16 features for multipose face detection. These features are listed below. Aspect Ratio: Eccentricity:

Euler Number:

The ratio of the bounding box longer side to the shorter. Fitting the whole blob under an ellipse which has the same second-moments as the blob, eccentricity is defined as the distance between the foci of this ellipse and its major axis. The number of objects in the region minus the number of holes in those objects.

Robust and Efficient Multipose Face Detection

Extent:

155

The area of the blob divided by the area of the bounding box.

Orientation:

The angle between the x-axis and the major axis of the ellipse having the same second-moments as the blob.

Solidity:

The proportion of the pixels in the convex hull that are also in the region.

Roundness:

The ratio of the minor axis of the ellipse to its major axis.

Centroid Position: Two features encoding the distance between the centroid of the blob and the center of its bounding box in both x and y directions. Hu Moments:

4

Hu moments are very useful for our problem since they are invariant under translation, rotation, and changes in scale. They are computed from normalized centralized moments up to order three as shown below: I1 = η20 + η02 2 I2 = (η20 − η02 )2 + 4η11 2 I3 = (η30 − 3η12 ) + (3η21 − η03 )2 I4 = (η30 + η12 )2 + (η21 + η03 )2 I5 = (η30 − 3η12 )(η30 + η12 )[(η30 + η12 )2 − 3(η21 + η03 )2 ] + (3η21 − η03 )(η21 + η03 )[3(η30 + η12 )2 − (η21 + η03 )2 ] I6 = (η20 − η02 )[(η30 + η12 )2 − (η21 + η03 )2 + 4η11 (η30 + η12 )(η21 + η03 )] I7 = (3η21 − η03 )(η30 + η12 )[(η30 + η12 )2 − 3(η21 + η03 )2 ] + (η30 − 3η12 )(η21 + η03 )[3(η30 + η12 )2 − (η21 + η03 )2 ]

Experimental Results

Having generated the corresponding features of each skin-color blob in a certain image, what remains is to classify which of these blobs correspond to a face. For that purpose several different classifiers have been evaluated. The first part of this section describes the training process where four classifiers were trained on faces from the CVL database, while the second part shows the classification results on an independent sequence we recorded, where the face varies under pose and expression. 4.1

Results on CVL Database

The face images used in this section have been provided by the Computer Vision Laboratory (CVL), University of Ljubljana, Slovenia [5],[8]. The CVL database contains color images of people under seven different poses with variations in the expressions. The database contains around 800 faces (positive examples for the face class). For the non-face class, many images that do not contain any faces

156

M. Al Haj et al.

(a)

(b)

(c)

Fig. 3. (a) A sample face from the CVL database along with the segmented face blob. (b) Negative examples corresponding to body parts. (c) Negative examples not corresponding to body parts.

(a)

(b)

Fig. 4. (a) The learning curves. (b) The ROC curves.

but do contain skin-colored regions have been collected from various sources. A total of around 1600 non-face blobs were detected in this independent test set. Examples of faces blobs and non-face blobs are shown in Fig. 3, where it can be noted that the negative examples contained body parts as well as non-body regions. We generated the invariant features on the faces and the non-faces blobs and used them to train several classifiers. The classifiers we tested are the following: – Linear Bayes Classifier: minimum Bayes risk classifier assuming normally distributed classes and equal covariance matrices. – Quadratic Bayes Classifier: minimum Bayes risk classifier assuming normally distributed classes, trying to separate classes by a quadratic surface. – K Nearest Neighbor: classifies objects based on closest training examples in the feature space. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors. In our experiments, k was set to 3. – Decision Tree Classifier: is a hierarchically based classifier which compares the data with a range of automatically selected features. The selection of features is determined from an assessment of the spectral distributions or separability of the classes.

Robust and Efficient Multipose Face Detection

157

The learning curves for the different classifiers are shown in Fig. 4.(a), with a semi-log scale. The learning curves were computed by varying the size of the training set. For each training set size, the classification error was estimated by averaging over 30 trials (i.e. 30 randomly selected training sets for each size). It should be noted that at the end of the learning process, the error for the Decision Tree classifier and the K Nearest Neighbor is less than 10%. However, the Linear Bayes Classifier and Quadratic Bayes Classifier behave differently where the error start decreasing to a certain point and then it starts increasing, probably due to overfitting. The ROC curves are shown in Fig. 4.(b), where half of the data was used for training while the other half was used for testing. It can be seen that low error rates are obtained while maintaining a reasonably low false-positive rate and that the Decision Tree classifier has the highest precision among the four. 4.2

Classification on a Video Sequence

In order to test the classification rate of our classifiers, we have recorded a sequence where the face of the agent is varying in terms of pose and expression. We generated the various blobs from the sequence using our skin color segmentation method, in this process all small blobs whose area is less than 2% of the image area are dropped. The sequence contained 541 face blob and 2012 non-face blobs. The classifiers, that were trained on the faces from the CVL database and the non-faces we collected, were used to classify these blobs. The results of the various classifiers on these blobs is shown in Table 1. Table 1. Classification Results Face Non-Face Overall Detection Rate Detection Rate Detection Rate LBC 97.23% 87.38% 89.46% QBC 100% 46.17% 57.58% kNN 72.83% 94.04% 89.54% Decision Tree 95.75% 89.76% 91.03%

Fig. 5. The output of the Viola and Jones detector on one of our images

It is important to mention that we were able to achieve a high classification rate despite the fact that the testing sequence is completely unrelated to the training samples. The classifier with the best precision turned out to be the

158

M. Al Haj et al.

Fig. 6. Some of the experimental results of the decision tree classifier

Decision Tree, as it was suggested in the ROC curve in Fig. 4.(b). To compare with Viola and Jones method, we ran their detector on our sequence and we detected only 386 faces out of the 541 (around 69%) with a huge number of false positives, 470 false positives in 541 images. A sample output image of the Viola and Jones detector is shown in Fig. 5, while some experimental results of the Decision Tree classifier are shown in Fig. 6.

5

Conclusions

In this paper we described an efficient and robust technique for face detection in arbitrary images and video sequences. Our approach is based on segmentation of images into skin-colored blobs using a pixel-based heuristic described in section 2. Identified skin blobs are then analyzed to extract scale- and viewinvariant features. These features are used to train a statistical pattern classifier to discriminate face blobs from non-face blobs. The use of scale-invariant features, particularly the Hu-moments characterizing the spatial distribution of skin pixels in a candidate blob, is key to the success of our algorithm. The use of invariant features on segmented blobs obviates the need to scan all possible regions of an image at multiple scales looking for candidate faces. This results in a more efficient algorithm and reduces the need to use a weak classifier with a vanishingly small false-positive rate as is needed for AdaBoost approaches. A strong advantage of our approach is its ability to generalize to new data. The Viola and Jones face detector is known to be highly sensitive to the illumination and imaging conditions under which it is trained, and consequently it often does not generalize well to new situations without retraining. With our experiments we show that with our invariant feature space representation of skin blobs we can train a classifier on one dataset and it is able to accurately detect faces on an independent sequence it has never seen before (see table 1). The decision tree and k-NN classifiers consistently performed best, which is not surprising

Robust and Efficient Multipose Face Detection

159

given the limited training data and their known optimality. The Bayes classifiers succumb to overtraining before they reach respectable classification rates. The Bayesian classifiers would be preferable due to their simplicity, and we plan to address this problem by investigating feature reduction techniques in the near future. Of interest also is the uncorrelated performance of the four classifiers in table 1. This indicates that classifier combination could be used to effectively improve overall accuracy. Finally, the skin detection algorithm described in this work will be improved to further reduce the number of false skin blob detections.

Acknowledgements This work is supported by EC grants IST-027110 for the HERMES project and IST-045547 for the VIDI-video project, and by the Spanish MEC under projects TIN2006-14606 and CONSOLIDER-INGENIO 2010 MIPRCV CSD2007-00018.

References 1. Al-Shehri, S.A.: A simple and novel method for skin detection and face locating and tracking. In: Masoodian, M., Jones, S., Rogers, B. (eds.) APCHI 2004. LNCS, vol. 3101, pp. 1–8. Springer, Heidelberg (2004) 2. Huang, C., Ai, H., Wu, B., Lao, S.: Boosting nested cascade detector for multiview face detection. In: Proceedings of the 17th Intl. Conf. on Pattern Recognition (ICPR 2004), vol. 2, pp. 415–418 (August 2004) 3. Menser, B., Brunig, M.: Face detection and tracking for video coding applications. In: The 34th Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 49–53. IEEE, Los Alamitos (2000) 4. Papageorgiou, C.P., Oren, M., Poggio, T.: A general framework for object detection. In: Proceedings of the 6th International Conference on Computer Vision, vol. 2, pp. 555–562 (January 1998) 5. Peer, P.: Cvl face database, http://www.lrv.fri.uni-lj.si/facedb.html 6. Poggio, T., Sung, K.K.: Example-based learning for view-based human face detection. In: Proceedings of ARPA Image Understanding Workshop, vol. 2, pp. 843–850 (1994) 7. Rowley, H.A., Baluja, S., Kanade, T.: Nueral network-based face detection. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) 20, 22–38 (1998) 8. Solina, F., Peers, P., Batagelj, B., Juvan, S., Kovac, J.: Color-based face detection in ’15 seconds of fame’ at installation. In: Mirage 2003, Conference on Computer Vision / Computer Graphics Collaboration for Model-Based Imaging, Rendering, Image Analysis and Graphical Special Effects, INRIA Rocquencourt, France, Wilfried Philips, Rocquencourt, INRIA, March 10-11 2003, pp. 38–47 (2003) 9. Viola, P., Jones, M.: Robust real-time object detection. In: Proceedings of IEEE Workshop on Statistical and Computational Theories of Vision (July 2001) 10. Wang, P., Ji, Q.: Multi-view face detection under complex scene based on combined svms. In: Proceedings of the 17th Intl. Conf. on Pattern Recognition (ICPR 2004), vol. 4, pp. 179–182 (2004) 11. Yang, J., Kriegman, D.J., Ahuja, N.: Detecting faces in images: A survey. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) 24(1), 34–58 (2002)