Liévin, M., Luthon, F.: Nonlinear color space and spatiotemporal mrf for hierar- chical segmentation of face features in video. IEEE Transactions on Image Pro-.
Lip Contour Segmentation Using Kernel Methods and Level Sets A. Khan, W. Christmas, and J. Kittler Centre for Vision, Speech and Signal Processing University of Surrey Guildford, GU2 7XH, UK {a.khan,w.christmas,j.kittler}@surrey.ac.uk
Abstract. This paper proposes a novel method for segmenting lips from face images or video sequences. A non-linear learning method in the form of an SVM classifier is trained to recognise lip colour over a variety of faces. The pixel-level information that the trained classifier outputs is integrated effectively by minimising an energy functional using level set methods, which yields the lip contour(s). The method works over a wide variety of face types, and can elegantly deal with both the case where the subjects’ mouths are open and the mouth contour is prominent, and with the closed mouth case where the mouth contour is not visible.
1
Introduction
The ability to extract a person’s lips from an image or image sequence has some useful and interesting applications. Bi-modal speech recognition is one: according to [1], speech perception and understanding in humans is not just an auditory activity, but rather also employs visual information conveyed through the associated lip and mouth movement of the speaker. Even in machine-based speech understanding, visual cues extracted from lip shape can serve as a complementary source of information that can help bolster the speech recognition process when combined with the auditory information. The appearance of the lips also conveys valuable information about a person’s facial expression. Applications include expression/emotion recognition and character-driven animation. This paper proposes a novel method of segmenting/tracking lip shapes in images/videos of a person’s face. Lip extraction has generally been performed using some sort of flexible lip shape representation, such as deformable templates [2], active contours [3] or active shape models [4]. Generally, in a sequence consisting of a person speaking into the camera, it will be observed that the shape of the lips exhibits some dominant modes of variation. Therefore some of these methods try to impose a strong prior on the set of possible shapes that these contours or templates can take. This can be achieved , for instance, by using a parametric shape model with a limited degree of freedom that can model these variations with sufficient accuracy, or by learning the shape statistics and constraining shape deformations accordingly[5],[6]. G. Bebis et al. (Eds.): ISVC 2007, Part II, LNCS 4842, pp. 86–95, 2007. c Springer-Verlag Berlin Heidelberg 2007
Lip Contour Segmentation Using Kernel Methods and Level Sets
87
Another issue to be considered is how to model both the inner and outer lip contour, given that the mouth of a speaking person opens and closes, affecting the visibility and prominence of the inner lip boundary. Methods differ in whether they model only the outer boundary of the lip or both the inner and outer contours, or whether the lip model is switched depending on whether the mouth is ascertained to be open or closed[7]. In the method presented here, the level set method is used to represent the lip shape. Level sets provide an energy minimisation framework that has been used extensively to model shapes and object boundaries in the computer vision community. This method facilitates incorporation of region information at an image pixel level while still enforcing global notions such as boundary smoothness. An additional benefit of this method in this particular application, is that because of its unique behaviour with regard to topological properties of shapes, the problem of modelling both the inner and outer lip contour is elegantly solved. To drive the shape fitting process to the actual in-image lips, some sort of colour and/or gradient information (of the lip region and lip boundaries respectively) is used. When gradient-based features are used, the segmentation process needs to be supported by a strong shape model, as the contrast between the lip region and the skin often tends to be weak. When using colour information, the assumption is that in a face image the colour distribution of lips is sufficiently different from everything else (primarily skin, but also other facial features and hair). This assumption can be relaxed a bit (as in this paper) by assuming a non-negligible degree of overlap between the colour distributions of lip and skin. Methods that rely on learning to distinguish lip and non-lip regions make the implicit assumption that over a large set of face images, the differences between the lip and non-lip feature observations can be characterised over the different races or skin types. The rest of this paper is organised as follows: Section 2 gives a detailed description of the approach used in this work. Section 3 presents some visual results, and numerical characterisation of their quality. The paper is drawn to conclusion in section 4.
2
The Method
The primary task to be achieved is the segmentation of the lip shape in a face image. For shape segmentation, the level set approach is a popular variant of active contour methods among the computer vision community, and is the one used in this work. In one of its basic formulations, level sets facilitate the segmentation of an image or image sequence into two regions, separated by a boundary over which some smoothness constraint has been exerted, such that the homogeinity of some property in each region is maximised. The property used depends on the particular application, and may be one, or a combination, of various cues (such as intensity, texture, optical flow, etc.) but is basically a scalar function defined over the image domain. To define this function, the method presented in this paper takes a machine-learning approach; a support vector machine classifier,
88
A. Khan, W. Christmas, and J. Kittler
suitably trained over a set of colour-based features, defines at a pixel-by-pixel level some notion of the likelihood of the class label (whether lip or non-lip). This effectively allows the conversion of a multidimensional feature vector (based on RGB colour components) to a scalar quantity which represents a “soft” decision regarding whether a particular pixel looks like lip or non-lip. The level set method allows integrating this information at a more global level, by yielding an entire lip region, rather than a pixel-level classification. A practical overview of the method is as follows: a number of example face images with the lip region demarcated are used to generate a training set of feature vectors and class labels (lip, non-lip) and fed as training examples to a SVM (support vector machine) classifier. Once the classifier has been trained, a new instance of a face image for lip extraction can be presented to the system, and the SVM output is treated as a function over which an appropriate energy minimisation problem is defined. This problem is solved by means of gradient descent of the level set function. The solution of this problem generally yields the lip contour(s). The details of the method are given in the next section. 2.1
Choice of Classification Method
It is observed that in general it is not possible to linearly separate skin and lip colours when colour is represented in RGB. Several non-linear colour spaces (i.e. obtained as non-linear transformation of the RGB space) have been proposed that purport to give better results for distinguishing lip and skin pixels[8]. This problem becomes even more serious when considering the distribution of skin and lip colour over the range of races and skin types, and it may well be impossible to be able to distinguish adequately between these two classes by using any particular colour space. Given the non-linear nature of the classification problem, in this method we chose to employ a support vector machine-based classifier with a radial basis kernel. The first advantage of using SVMs is, for colour-based features, we can justifiably dispense with the task of having to find a new colour-space to represent the data, as the non-linearity built into the SVM classifier should be able to cope. Hence RGB can be retained as the colour space. Further it is possible to cope with variations of skin and lip colour, and its correlation with race or type of skin, by including the necessary features that bring out these characteristics. Another important characteristic of the SVM is that the classification process yields not a hard binary decision of which class a particular observation fall in, but rather a decision function that is a continuous function of the feature variables. This functional characteristic makes the SVM output usable by the next stage of the process, namely level set method-based evolution. This will be discussed in more detail later. The training set was composed of tightly cropped face images (although it is acceptable to have a fair bit of background, as long as the skin region dominates the image area) which were subjectively chosen to be representative of the majority of the images that would be used to test the system.
Lip Contour Segmentation Using Kernel Methods and Level Sets
89
A 6-dimensional feature vector was chosen for classification: three of its components were the R, G and B intensities of the pixel, and the other three were the medians of the individual R,G, and B components of the training image, and referred from here onwards as the “dominant colour” of the image. It is this dominant image colour that brings out the differences between the various types of faces. Although strictly speaking, colour elements cannot be ordered (because of their multidimensionality), however it is quite reasonable to think that colours that are perceptually near have small Euclidean distances between them in the RGB space, and hence taking the median of the individual colour components does in fact represent the representative colour of the face (i.e. skin colour). Simple tests performed on the images bear this out. Once the classifier has been trained, a test image is fed to the system, which outputs the value of the SVM decision function for each pixel of the image. The sign of the decision function denotes the class label assigned to it by the binary classifier, and in many SVM-based applications, that is all the information that is used. However, the absolute value of the decision function (which is a real number) is actually a measure of the confidence of the classification decision, i.e. a larger absolute value indicated a higher likelihood of the correctness of the classification. “Borderline” cases have values close to zero. The decision function map is going to be used in defining the lip segmentation energy functional. 2.2
The Energy Functional and Its Level Set Based-Minimisation
To extract and represent the lip shape, a level-set shape model[9] is used. Level sets provide an implicit representation of shape that is free of parametrisation (and the problems associated with it). For instance, they can handle the sharp discontinuities in the shape, such as at the corners of the mouth where the curvature of the lip shape is quite high. Also, because the level sets automatically handle shape and regions that are multiply connected or possess holes, in this particular problem they present an elegant means of dealing with multiple hypotheses (i.e. if there are other regions in the image that look like lip) as well as dealing with the appearance and disappearance of the inner lips contour as the speaker’s mouth opens and closes. Suppose the decision function returns a positive result for the lip class and a negative result for the non-lip class. In order to define an energy functional over the decision function map output by the SVM classifier, the following simple fact is used: in the decision function map, the lip region will contain a large proportion of positive valued numbers (ideally 100% - although if that were the case, the SVM-based classifier would be enough to extract the lip shape and any further steps would be unnecessary) whereas the non-lip pixels would consist of (again, mostly) negative values. What needs to be determined now is the boundary that separates these two regions. Let f be the decision function on Ω, a subset of R2 (corresponding to the image support). Suppose C represents this boundary, ins(C) and out(C) represent the
90
A. Khan, W. Christmas, and J. Kittler
(possibly multiply-connected) regions on either side of this boundary. Consider the following energy functional: 2
E=−
f (x) dx
−
ins(C)
2 f (x) dx
+ μ length(C)
(1)
out(C)
The first two terms depend upon the decision function map and the last term penalises the length of the boundary separating the two regions. If we momentarily ignore the contribution of the last term, the minimisation of the above function becomes trivial: C simply corresponds to the “boundary” separating the binary thresholded version of the decision function (with a threshold of zero). The all-important boundary length term imposes the requirement of smoothness on the solution, which is where the level set method comes in to play. This allows the lip boundaries to be smooth in shape and “whole” (i.e. even though they contain small regions where the pixels which have been wrongly classified as skin pixels.) To recast this energy minimisation problem into the level set framework, we introduce the level set surface function: φ(x) : R2 → R. The contour C is represented by the (1-dimensional) zero level set of this function: C = {x : φ(x) = 0} and the two inside and outside are represented as follows the following twodimensional sets: ins(C) = {x : φ(x) > 0} out(C) = {x : φ(x) < 0} We also introduce the Heaviside function defined as follows: 1, φ > 0 H(φ) = 0, φ ≤ 0
(2)
and this allows us to write the level set version of (1) as follows: E(φ) = − Ω
2 2 f H(φ) dx − f (1 − H(φ)) dx + μ δ (φ) |∇φ| dx (3) Ω
Ω
Here, δ (φ) is the distributional derivative of the Heaviside function. The last term is equal to the length of the set φ(x) = 0 and is a consequence of the co-area formula[10]. The level set evolution equation obtained for the gradient descent of (3) is given by ∂φ ∇φ = β f + μ div( ) |∇φ| (4) ∂t |∇φ| where β = Ω f (x) dx The discrete version of (4) over the image grid is solved using well-known methods for solving partial differential equations.
Lip Contour Segmentation Using Kernel Methods and Level Sets
3
91
Some Results
The method was tested on face images from XM2VTS database. The training set for the SVM classifier consisted of 8 images chosen from this database, and the performance of the method was tested on 122 images (not including the ones in the training set). Face Image
SVM decision function
Final lip segmentation
Fig. 1. SVM decision function map Final lip segmentation
Final lip segmentation
Fig. 2. Lip segmentation in open-mouth cases
The centre image in figure 1 provides a two-colour visual representation of the SVM-decision map; red indicates that the corresponding pixel has been classified as a lip pixel, while blue indicates non-lip classification. The intensity of the colour is proportional to the reliability of the classification; brighter colours represent pixels more likely to belong to their respectively assigned classes. From this representation it can be observed that even within the true lip region, there are sub-regions that the classifier is incorrect for, or uncertain about (for example, the dark region of the upper-left portion of the lip). However, since the level-set minimisation process (through the boundary-smoothness term) takes a broader, region-level look instead of focusing on individual pixel values, and the lip region has been correctly segmented.
92
A. Khan, W. Christmas, and J. Kittler
Fig. 3. Some examples of good results
As the results indicate, the classifier is able to distil sufficient information from the training process to allow it to capture the lip region over a wide variety of faces. From figure 2, the advantages of using level set techniques should also be apparent; the mouth contour has been located in the examples where the subjects have opened their mouth. The contours are also able to deal with the high curvature at the corners of the mouth. 3.1
Quantitative Measurement of the Results
To numerically define a notion of the quality of the results, a reasonable approach is to find a measure that matches the shape of the extracted region with respect to the ground-truth. This was done as follows: let G denote the lip region as demarcated in the ground-truth, and E the lip region extracted by the method, considered as sets inside the image region (which can be taken to be the “universal set”). If | · | denotes the region area, then the quality of the extracted shape when matched with the ground-truth shape has been defined as follows: q(E, G) =
|E ∩ G| |E ∪ G|
(5)
The measure q is simply a ratio between the overlapping area and the total area of the two shapes. q(·, ·) is symmetric with respect to its arguments, lies in the range [0, 1], where 1 represents a perfect match and 0 represents a total mismatch. The histogram indicate that the majority of results have a quality measure of greater than 0.65 and visually correspond to a “good” segmentation result. Results with values in the intermediate range contain results of reasonable quality towards the upper end of this range, while at the lower end lie images that were partially segmented, but the level-set evolution process got stuck somewhere in a local minimum. There are a small number of completely failed segmentations, and these have a quality measure of less than 0.1.
Lip Contour Segmentation Using Kernel Methods and Level Sets
93
Fig. 4. Histogram distribution of the image quality
3.2
Discussion
It may be noticed from the results that at times some portion of the face other than lips may also be captured. Since only colour features have been used, this is inevitable, as other parts of the face may have similar colouring to lips. However, due to the topological property of level sets, these regions tend to be disjoint of the true lip contours, and can be treated as cases where there are multiple hypotheses regarding the lip region. This is not necessary bad; it may in fact be advantageous if, for instance, the image/video had several faces, so the method would work without having prior knowledge of their number. It is also observed that whether or not the method discovers multiple lip-like regions is strongly dependent on the shape and position of the initial level set contour. Gradient descent energy minimisation gives local solutions which depend on the initial estimate of the object whose energy is being minimised. As might be expected, by initialising closer to the expected lip region, better results are obtained (the results are usually “better” in the sense that a smaller number of lip-like candidates are extracted). By employing another process that considers higher level
94
A. Khan, W. Christmas, and J. Kittler
information such as lip shape, position and size, the true lip contour can be chosen from the set of lip candidates. In a tracking application, it is expected that this would have to be done only once (if at all) after which the segmented lip boundary from every frame would serve as a good initial estimate for the next frame, and due to the local nature of the minimisation, lip-like regions away from the lip region would not be found again. With regard to the few instances of complete failures such as the leftmost image in figure 4, it may be attributed to the fact that the skin tone in these cases was quite different from any of the eight images used in the training set.
4
Conclusion
In this paper, we have presented a new method to segment lips in face images. This method employs learning to deal with difficulties in the separability of lip and skin colour. Contour-based segmentation using level set methods ensures that the lip will be segmented as a whole shape, and further deals elegantly with the problem of segmentation of the mouth contour. 4.1
Future Work
It would be interesting to adapt the method proposed in this work to track a speaker’s lips in a video sequence. Although the SVM classifier was found to be the bottle-neck in terms of speed of the method, this problem could be alleviated in the tracking scenario, by pre-computing the classifier’s decision function once over a range of feature vector values and then employing a look-up table to construct the decision function map for each frame. In an environment with controlled lighting conditions, this seems like quite a feasible thing to do, as the skin colour components would not be expected to vary much throughout the sequence (and the variation in the median features would be even less). Alternative methods to speed up the SVM include reducing its complexity by means of selectively reducing the training set, or by simplifying the decision function itself[11][12]. In the level set minimisation phase, the contour extracted at each frame would serve as an excellent approximation for the next frame, and convergence could be achieved within a few iterations. Another improvement that might be considered is the incorporation of shape prior information in the level set framework[13].
References 1. Mcgurck, H., Macdonald, J.W.: Hearing lips and seeing voices. Nature 264 (1976) 2. Hennecke, M., Prasad, K., Stork, D.: Using deformable templates to infer visual speech dynamics. In: 9th Asilomar Conference on Signals Systems and Computers, Pacific Grove, CA, pp. 578–582 (1994) 3. Kaucic, R., Blake, A.: Accurate, real-time, unadorned lip tracking. In: ICCV 1998: Proceedings of the Sixth International Conference on Computer Vision, p. 370. IEEE Computer Society, Washington, DC, USA (1998)
Lip Contour Segmentation Using Kernel Methods and Level Sets
95
4. Luettin, J., Thacker, N.A., Beet, S.W.: Visual speech recognition using active shape models and hidden markov models. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 1996), vol. 2, pp. 817–820 (1996) 5. Lievin, M., Delmas, P., Coulon, P., Luthon, F., Fristot, V.: Automatic lip tracking: Bayesian segmentation and active contours in a cooperative scheme. In: ICMCS, vol. 1, pp. 691–696 (1999) 6. Bregler, C., Omohundro, S.M.: Nonlinear manifold learning for visual speech recognition. In: ICCV 1995: Proceedings of the Fifth International Conference on Computer Vision, p. 494. IEEE Computer Society, Washington, DC, USA (1995) 7. Tian, Y.-L., Kanade, T., Cohn, J.: Robust lip tracking by combining shape, color and motion. In: Proceedings of the 4th Asian Conference on Computer Vision (ACCV 2000) (2000) 8. Li´evin, M., Luthon, F.: Nonlinear color space and spatiotemporal mrf for hierarchical segmentation of face features in video. IEEE Transactions on Image Processing 13, 63–71 (2004) 9. Osher, S., Fedkiw, R.: Level set methods and dynamic implicit surfaces. Springer, Heidelberg (2003) 10. Morgan, F.: Geometric Measure Theory: A Beginner’s Guide, 3rd edn. Academic Press, London (2000) 11. Bakir, G.H., Bottou, L., Weston, J.: Breaking svm complexity with cross-training. In: NIPS (2004) 12. Burges, C.J.C.: Simplified support vector decision rules. In: International Conference on Machine Learning, pp. 71–77 (1996) 13. Chan, T., Zhu, W.: Level set based shape prior segmentation. In: CVPR 2005: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 1164–1170. IEEE Computer Society, Washington, DC, USA (2005)