On selecting an appropriate colour space for skin detection G. Gomez1 , M. Sanchez2 , and L.E. Sucar1 1
Dept. of Computing, ITESM-Morelos A.P. 99-C, 62050, Morelos. M´exico {gegomez,esucar}@campus.mor.itesm.mx 2 Faculty of Sciences, UAEM Av. Universidad 1001, Cuernavaca, Morelos.
[email protected]
Abstract. We present a comprehensive and systematic approach for skin detection. We have evaluated each component of several colour models, and then we selected a suitable colour model for skin detection. Such approach is well-known in the machine learning community as attribute selection. After listing the top components, we exemplify that a mixure of colour components can discriminate very well skin in both indoor and outdoor scenes. The spawning space created by such componens is nearly convex, therefore it allow us to use even simple rules to discriminate skin to non-skin points. These simple rules can recognise 96% of skin points with just 11% of false positives. This is a data analysis approach that will help to many skin detection systems.
1
Introduction
Skin detection is a very important step in many vision systems like gestures, hand tracking, video indexing, region of interests, face detection, etc. (see e.g. [2, 4, 6, 10, 14–17] to name just a few) Pixel based skin detection can narrow the search space prior to high-level layers. However, this is not an easy task. Skin values can vary with ambient light, such as colour lamps acting as filters, specularities, shadows, daylight, etc. Moreover, different cameras returns different values for the same scene. Finally, other input devices, such as scanners do not follow closely the CIE standards. Hence, skin detection becomes a cumbersome step, therefore, it is usual to have systems where people wear artificial landmarks or devices. Skin detection has a long tradition in computer vision, however, there has been a recent interest in using probilistic methods for skin detection. One popular choice are the Skin Probability Maps or SPMs for short [1, 2, 8]. Nevertheless, there is a lack of work addressing which components for the SPMs are more reliable. Hence, many SPMs work on raw RGB, HSV, Y Cb Cr , though it is recognised that a fixed transformation from RGB values does not affect the overlap between skin and non-skin [8]. We show that a mixure of colour components do achieve a high recognition rate, whilst the overlap between skin and non-skin clusters is minimised.
2
Skin detection
Although the RGB (Red,Green,Blue) space is quite sensitive to intensity variations, it is widely used [1, 2, 21]. A SPM is a lookup table where RGB values directly address a voting slot. Two 3D histograms, one for skin and one other for non-skin, are computed. After dividing every slot by the total count of elements, we get an associated probability on a [r, g, b] index. Statistics are derived from these two 3D histograms. Then, the conditional probability of a pixel with RGB values to be skin or non-skin is: P (rgb|skin) =
Histskin [r, g, b] , T otalskin
P (rgb|∼skin) =
Histnon−skin [r, g, b] T otalnon−skin
a new pixel can be labeled as skin if it satisfy a given threshold θ: P (rgb|skin) ≥θ P (rgb|∼skin) where θ is obtained empirically. Thus, the recognition ratio is a tradeoff between reducing false positives, and increasing correct skin classification. It has been found, in a recent survey [1], that a 95% of skin detection accomplished 20% of false matches. This high error rate is perhaps a consequence of an initial assumption in the 3D histograms, that is, skin points will form a cluster in some colour model, in this case raw RGB space. However, the sparseness of skin points and the overlap in RGB space is considerable. We argue that there are more reliable axes, where the overlap is minimised. We can select different axis from different colour models, as we shall see in next sections.
3
An appropriate colour space
As far as we know, there is no work addressing the selection of appropriate colour components for skin detection. Previous work evaluate a single and thus limited colour space [1, 6, 18, 21], such as HSV, Y Cr Cb , YUV, RGB, normalised RGB, etc. However, their limited performance would suggest that we are looking at the incorrect colour model. Therefore, we have changed the common methodology of having a single colour model. We evaluated each component of several colour models, and then we selected a set of appropriate colour components for skin detection. We decomposed and analysed each component of the following colour models: HSV, YIQ, RGB-Y, Gaussian Colour Model [5], YES, YUV, CMY, CIE XYZ, RGB, normalised RGB, CIE L*a*b*, CIE L*u*v*, CIE xyY, CIE u’v’w’, and Y Cr Cb , as well as some non-linear relations: r/g, r/b, g/b, x/y, x/z, y/z, and g r ( r+g+b − 31 )2 +( r+g+b − 13 )2 . To the latter we shall refer it as Wr [17]. In addition, we have also included Chroma, Hue, and Luminance computed from CIE L*a*b*. All white references were set to 240,240,240. Hue from HSV was shifted into the range −30 < h < 330, so that the red range becomes continuous.
3.1
Datasets
We explored the performance of several colour components in two image types: (i) skin/non-skin indoor, (ii) skin/non-skin outdoor. We used several daylight and illumination conditions, as well as a number of input sources. These image sets cover more than 2000 people from different races and ages. These cover a wide range of illumination conditions, from Tungsten lamps (∼3200K) to daylight (D65 or 5000K-5500K). We expected a shift to yellow in indoor scenes (Tungsten), and a little to blue in outdoor conditions (sun-shine). In the case of outdoor scenes, we use both direct sun-shine and shadows. A significant difference to [1, 2, 8] is that we do not use web skin images, because those are mainly scanned, and many scanners do not follow very closely the CIE standards. Moreover, there is also no control on filters and manipulations commonly used in photographic studios. After having carefully labeled 33500 image1 windows as skin and non-skin, we applied a statistical test, T-test at 95%, to decide whether or not indoor and outdoor skin data sets are different. This test shows that there is a significant difference (>5%). Nevertheless we shall see that, in practice, both data sets behave quite similar. We took the average of every image window as the input to a classification system2 . While it is usual to work with every pixel, we opt for averaging each sample window. Hence, many image antifacts were minimised, such as small spots, lumps, and freckles. 3.2
Selecting variables
For every colour model we analysed each component independently. All colour components were analysed in view that we do not want to find the most discriminative components, but the complementarity among these. It is important to stress such a difference, because one might think in using a standard PCA technique. In PCA one can select variables corresponding to high eigen-values from the resulting eigen-vectors. This analysis does not show complementarity, that is, many “good” variables overlap and hence its individual contributions are minimal. Conversely, we wish to select a minimal set of complementary variables, maximising recognition and minimising errors. Since a sparse distribution of skin points along a given colour component is not a promising signal, we select colour components with a compact range of skin. Further, two promising variables (with compact ranges in skin) will spawn a compact 2D space, and so forth. The basic idea is to select variables, from which we are interested in maximal coverage of skin points and minimal overlap to non-skin. Therefore we start by selecting variables exhibiting a compact range for skin points, and with a minimal overlap to non-skin points. Several tools were applied 1
2
A typical size of each window is 35x35 pixels. The dataset is freely available by contacting to any author. 80% for training, and 20% for testing.
to select variables, such as CN2 [3], C4.5 [12], and J48 [20]. Other tools, like χ 2 , PCA, and 1-Rule [7], were also used to have a close treatment to certain variables. Once we have selected one variable, we can focus on the dataset covered by its range. Then, using this reduced dataset we search the most discriminant variable again. In the case of indoor and outdoor scenes, the first selected variable was E of the YES colour model. This variable has a range where 96.98% of skin samples has been preserved, and this range introduces 22.41% of non-skin samples. These non-skin (false matches) will be reduced by adding a new variable. Roughly speaking we have applied the following three steps: 1. Using one promising variable from the poll, create a new dataset by applying its range. The dataset should have a good proportion of true positives, and a reduced number of false positives. The false positives will be reduced by looking for a new variable. 2. Compute new ranges on the remaining variables over the current dataset. Select the best variable which not only reduces false positives, but also covers the maximum number of skin points. 3. Is this new variable significantly enough for reducing false positives? If so, apply such a range, create a current dataset, and continue with step two. Otherwise, discard this variable, select the next one from the original poll, and restart from step one. Notice that we are working at this stage with window averages, however, as we shall see, such results can be extrapolated to single pixels.
3.3
An initial two-dimensional space
After having founding E component of YES as the starting variable, we computed the next variable on the narrowed dataset. Our top components among all tests done are shown in table 1. Despite the big coverage of I, RY, and Cr , these are not complementary variables, and its combination only represents a marginal reward.
Component Colour space E YES RY RGBy Cr Y C r Cb R/G RGB H HSV U YUV I YIQ Table 1. Top variables among all test done.
Interestingly we found two complementary variables which achieve more than 95% of recognition with just 10% of false matches for indoor+outdoor samples. Our initial hypothesis was oriented to find a three dimensional colour space, but nevertheless, we found that the spawning space of E and R/G permits to observe a cluttered zone for skin. Such cluster has a minimal overlap with nonskin, and skin points share a nearly convex area. Figure 1 depicts a section of this mixed colour map.
non-skin skin
2.4 2.2
Red/Green
2 1.8 1.6 1.4 1.2 1 0
10
20
30 E component of YES
40
50
60
Fig. 1. A colour space for skin detection in indoor and outdoor scenes. In this space 95% of skin points share a nearly convex area with a minimal overlap to non-skin.
By using Machine Learning tools like CN2, C4.5, and J48, we compute a couple of rules for this convex area, from which we found that a single range in E and R/G are good enough to recognise more than 97% of skin with just 10.6% of false matches in indoor+outdoor images. Such ranges are easily appreciated in figure 1. Hence, a final recognition rule can be written as follows: if (E > 13.4224) and (red/green < 1.7602) then label := skin else label := non-skin where E component is calculated [13] as: E = 12 (red − green). Although it is possible to compute a double range for the E or r/g components its overall contribution is marginal. Table 2 shows the resulting statistics. Notice the 96% of recognition with just 10.2% of false acceptances for both datasets, though a T-test does not show a similarity at 95%.
Scenes Indoor Outdoor Both
Recognition 97.95% 93.74% 96.59%
Error 10.65% 9.22% 10.20%
Table 2. Results using [E, red/green] colour space for skin detection.
This recognition rate does not need to store two 3D SPMs, nor fitting undesirable thresholds. However, some skin detection systems need to recognise scenes which are not taken directly from videotapes or digital cameras. This is the case of WWW indexing, digital libraries, etc., where many images come from scanners. The 2D colour space was tested in some thousands (details are found in the next section) of images from WWW, and we found an inaceptable error rate of 30%. In order to cope with this unconstrained scenario, we extend the computed colour space to three variables. 3.4
A three-dimensional space
A natural extension of this colour space is to take into account the next complementary component. In this case we found that Hue of HSV is complementary to E and red/green. Not surprisingly, this new skin cluster still resembles a nearly convex area. Same ranges on E and red/green are valid. For the new variable, we found the following range for skin: H < 23.89. Table 3 shows recognition results for this new [E, red/green, H] colour space. We exposed this new colour space to a more unconstrained data set. We labeled 5000 small non-skin images from WWW. The new data set includes many “noisy” images among the most difficult ones [8] for skin detectors, such as wood, flowers, and desert. We preserve our original amount of skin images. Then, the “challenge” set was indoor + outdoor + “noise”. Figure 2 depicts a section of the challenge set, in which 95% of skin points are together, but nevertheless it is also more sparse and introduces 16.91% of false matches. Scenes Indoor Outdoor Challenge
Recognition 95.58% 93.74% 95%
Error 8.76% 8.72% 16.91%
Table 3. Colour space [E, red/green, H] for skin detection.
Notice again that it is not necessary to store the SPMs for skin and non-skin, nor fitting empirical thresholds, and it is comparably fast since it take just some basic arithmetic and conditional statements. However, with this new axes we
non-skin skin H component of HSV 30 25 20 15 10 5 0 -5 -10
0
10
20
30 40 E component of YES
50
60
1
1.2
1.4
1.6
1.8
2
2.2
2.4
Red/Green
Fig. 2. A section of the [E, red/green, H] colour space for skin detection. 95% of skin points share a nearly convex area with a minimal overlap to non-skin.
can compute SPMs, and we expect good results since the skin/non-skin is not as overlapped as in the case of raw RGB.
4
Final comments
We can now summarise on the appropriateness of the synthetic colour space. The colour space was obtained after averaging thousands small windows of indoor and outdoor scenes. Despite this, we applied the same space to all pixels from the data-set. Before doing so, the ranges were fitted to cope with more variability. Due to the fact that E and red/green are based on the same two axes (red and green), then, both components can be expressed as 20 + green < red < 1.7602 ∗ green, and it is in agreement to other works [11, 19]. The component H now has a range −17.4545 < H < 26.6666. The final recognition rate was 96% of skin points, and 11% of false positives. This produces an overall recognition3 rate of 95%. A 95% of recognition on 9.8 million pixels in 5.4 seconds4 can be good enough for many applications in computer vision and robotics (e.g. gestures, tracking, face detection, etc.) where the primary source is a camera or videotape. Figure 3 shows several illumination conditions and its pixel-based segmentation using last values. Notice that the segmentation is not sparse, and it returns good results in both indoor and outdoor scenes. Because of Hue is not the faster choice, we have explored another components. Preliminary results are shown in the figure 4, where we show faster alternatives 3 4
True negatives and positives divided by the total number of points. A standard PC of 1 Ghz
(a)
(b)
(c)
(d) Fig. 3. Examples of skin detection. (a) Indoor with poor illumination. (b) Outdoor under sun shines, (c) Outdoor with daylight, and (d) Indoor with good illumination.
to Hue. However the analysis of other possible colour spaces are far beyond the scope of this paper. Nevertheless, the inclusion of Hue is highly recommended as a trade-off between recognition rate and running-time.
Mixed Colour Components
Gy (RGB−y)
Hue (HSV)
y (xyY)
a (L*a*b*)
Cb (yCbCr)
Wr
E (YES)
Hue (HSV)
R/G (RGB)
Wr
y (xyY)
Wr
U (YUV)
Wr
Hue (HSV)
y (xyY)
R/B (RGB)
Fig. 4. Preliminary results show a more complete view of complementary variables.
Our findings should not be extrapolated to images from scanners or WWW sites, since it has been built to different purposes. Perhaps one should argue that this colour space does not mean a thing in terms of colour science, i.e. colloquially, nothing to say about perceptual stimulus. Nonetheless, we have shown a methodology which can be used to create better colour spaces for pattern recognition. Research involved in SPMs will also find useful this methodology, since it is not necessary to change the technique, but only the axes in the histograms. These colour components are just opening the door to far complex recognition procedures.
5
Conclusions and future work
We found that there is a seldom appropriate colour model for pixel based skin detection in indoor and outdoor scenes. Further, it has been argued [8] that there is no a fixed transformation, based on RGB, that outperform the original RGB space. Therefore, we explored a mixure of colour components, taking into account its complementarity. Such approach is well-known in the machine learning community as attribute selection. To exemplify, we have shown a reliable colour space for skin detection in indoor and outdoor images. This colour space has three axis: E of YES, the ratio red/green, and H from HSV. This space has shown a low susceptibility to noise from unconstrained sources, illumination conditions and cameras. We found an acceptable error, recognition rate, and running time which are quite competitive to any previous work in the area.
References 1. J. Brand, J. S. Mason, A comparative assessment of three approaches to pixel-level human skin-detection. Proc. of the ICPR, vol. I, pp. 1056-1059, 2000. 2. J. Brand, J. S. Mason, M. Roach, M. Pawlewski. Enhancing face detection in colour images using a skin probability map. Proc. of the Int. Conf. on Intelligent Multimedia, Video and Speech Processing, pp. 344-347, 2001.
3. P. Clark and R. Boswell. Rule induction with CN2. In Y. Kodratoff, ed., Machine Learning - EWSL-91, 151-163, Berlin, 1991. Springer-Verlag. 4. M. Fleck, D. A. Forsyth, C. Bregler. Finding nacked people. Proc. of the ECCV, vol. II, pp. 592-602, 1996. 5. J. M. Geusebroek, R. van den Boomgaard, A. W. M. Smeulders, A. Dev. Color and scale: The spatial structure of color images. ECCV, LNCS 1842, pp. 331-341, 2000. 6. E. Hjelmas, B. K. Low, Face detection: A survey, CV&IU, vol. 83(3), pp. 236-274, Sept. 2001. 7. R.C. Holte, Very simple classification rules perform well on most commonly used datasets. Machine Learning, Vol. 11, pp. 63-91, 1993. 8. M. J. Jones, J. Regh. Statistical color models with applications to skin detection. Proc. of the CVPR, vol. I, pp. 274-280, 1999. 9. L. Jord˜ ao, M. Perrone, J. P. Costeira. Active face and feature tracking. Proc. of the Int. Conf. on Image Analysis and Processing, pp. 572-576, 1999. 10. V. P. Kumar, T. Poggio. Learning-based approach to real time tracking and analysis of faces. Automatic Face and Gesture Recognition, pp. 96-101, 2000. 11. A. Ogihara, A. Shintani, S. Takamatsu, S. Igawa. Speech recognition based on the fusion of visual and auditory information using full-frame color images. IEICE Trans. on Fundamentals, pp. 1836-1840, 1996. 12. J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA, 1993. 13. E. Saber, A. M. Tekalp, R. Eschbach, K. Knox. Automatic image annotation using adaptive color classification. Graphical Models and Image Processing, vol. 58, pp. 115-126, 1996. 14. E. Saber, A. M. Telkap. Frontal-view face detection and facial feature extraction using color, shape and symmetry based cost functions. Pattern Recognition Letters, vol. 19, pp. 669-680, 1998. 15. L. Sigal, S. Sclaroff, V. Athitsos. Estimation and Prediction of Evolving Color Distributions for Skin Segmentation under Varying Illumination. Proc. of the CVPR, vol. II, pp. 152-159, 2000. 16. K. Sobottka, I. Pitas. Segmentation and tracking of faces in color images. Automatic face and gesture recognition, pp. 236-241, 1996. 17. M. Soriano, B. Martinkauppi, S. Huovinen, M. Laaksonen. Skin Detection in Video Under Changing Illumination Conditions. Proc. of the ICPR, vol I, pp. 839-842, 2000. 18. J.-C. Terrillon, M. Shirazit, H. Fukamachi, S. Akamatsu, Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images, Automatic Face and Gesture Recognition, pp. 54-61, 2000. 19. T. Wark, S. Sridharan. A syntactic approach to automatic lip feature extraction for speaker identification. ICASSP, pp. 3693, 1998. 20. I. H. Witten, E. Frank, Data Mining, Morgan Kaufmann, 1999. 21. B. Zarit, B. J. Super, F. K. H. Quek. Comparison of Five Color Models in Skin Pixel Classification. Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pp. 58-63, 1999.