Smile and Laughter Recognition using Speech Processing and Face ...

64 downloads 55 Views 485KB Size Report
This paper describes a method to detect smiles and laughter sounds from the video of natural dialogue. A smile is the most common facial expression observed ...
Smile and Laughter Recognition using Speech Processing and Face Recognition from Conversation Video Akinori Ito Xinyue Wang Motoyuki Suzuki Shozo Makino Graduate School of Engineering, Tohoku University 6-6-05 aza Aoba, Aramaki, Aoba-ku, Sendai 980-8579 Japan {aito,wxinyue,moto,makino}@makino.ecei.tohoku.ac.jp Abstract This paper describes a method to detect smiles and laughter sounds from the video of natural dialogue. A smile is the most common facial expression observed in a dialogue. Detecting a user’s smiles and laughter sounds can be useful for estimating the mental state of the user of a spoken-dialogue-based user interface. In addition, detecting laughter sound can be utilized to prevent the speech recognizer from wrongly recognizing the laughter sound as meaningful words. In this paper, a method to detect smile expression and laughter sound robustly by combining an image-based facial expression recognition method and an audio-based laughter sound recognition method. The image-based method uses a feature vector based on feature point detection from face images. The method could detect smile faces by more than 80% recall and precision rate. A method to combine a GMM-based laughter sound recognizer and the image-based method could improve the accuracy of detection of laughter sounds compared with methods that use image or sound only. As a result, more than 70% recall and precision rate of laughter sound detection was obtained from the natural conversation videos.

1. Introduction In a dialogue using speech, it is not only linguistic information that is exchanged between the speakers. To realize a natural human-machine dialogue system, it is important to exploit the information other than the linguistic information, for example, the prosodic information or the facial expression. Fujie et al. exploited the paralinguistic information conveyed by the prosody to realize a dialogue robot[1]. There are many works on the recognition of facial expressions and recognition of emotional states[2, 3, 4, 5, 6]. These works seems to be useful to estimate the emotional state of the user and to use the emotional information to

Proceedings of the 2005 International Conference on Cyberworlds (CW’05) 0-7695-2378-1/05 $20.00 © 2005

IEEE

improve the dialogue control strategy. However, most of these works used still images of fully expressed facial expressions or short-time videos containing a change from a neutral state to fully expressed facial expressions. It is unclear that these works can be applied to recognize facial expressions from videos of natural dialogues. On the other hand, nonlinguistic sounds that indispensably occur in a spoken dialogue cause a severe degradation of the speech recognition accuracy. For example, a dialogue system would recognize laughter sound or cough sound of the user as a part of linguistic utterance if no precaution against such noises were taken. Once these noises were recognized as words, such error would prevent the smooth dialogue. These kinds of nonlinguistic noises can be detected by acoustic models such as Gaussian mixture models (GMM) if a large amount of sample data can be obtained. For example, Lee et al.[7] trained a GMM using a large amount of noise data gathered by the information terminal Takemaru-kun, and realized the rejection of such noises. However, this approach requires a large amount of nonlinguistic noise data. Different from read speech data, nonlinguistic utterances are difficult to make intentionally. Therefore, the only way to gather the nonlinguistic noises is to record a large number of natural dialogues, which takes huge time and money. In this work, we focused on ‘smiles and laughter’ in natural dialogues. Smile is the most common facial expressions in a natural dialogue. Detection of the smile face will be useful for the estimation of the emotional state of the user. In addition, a smile face often occurs with nonlinguistic utterance. Therefore, detection of the smile face will be useful for the rejection of nonlinguistic noises. In this paper, we developed a method to detect smile and laughter from a natural conversation video. As mentioned above, it is difficult to gather a large amount of dialogue data. Therefore, we aimed to develop a method that enables to detect smile and laughter using a small amount of training data. Here, smile and laughter can be viewed from the two viewpoints: a smile face and an utterance of laughter sound.

Table 1. Overview of the natural conversation database. Frame rate # of dialogue Length of each sequence Languages Proportion of laugh Proportion of smile

30 fps 7 4 to 8 minutes Japanese, English, Chinese about 10% about 37–60%

3. Detection of smile face using image Figure 1. Experimental setup for recording the natural conversations.

Then we proposed a method to combine an image-based and an audio-based detection method. The structure of this paper is as follows. First, the conversation video database is described. This database was recorded for the development of the system. Next, a method for the recognition of smile face using a face image is described. Then a method to detect laughter sound from the dialogue speech is proposed. Finally, we propose a method to combine the previous two recognition results, and some experimental results are described.

2. Conversation video database Before developing a detection method of smile face and laughter sound, we recorded videos of several natural conversations. The database contained seven natural conversation video sequences made by 7 male test subjects using 3 languages (Japanese (3 persons), English (2) and Chinese (2)). In the later experiments, only Japanese data are used. The overview of the experimental setup is shown in Figure 1. This setup was used to record the facial expressions and voice of person A during his conversation with person B. The camera A recorded the face of person A and the wireless microphone recorded person A’s voice. Person A saw person B’s face through the television and heard person B’s voice using the headphones. Person A and B were friends and they were given an instruction to make conversation about any topic of interest. The labels of smile/laughter state were given to the recorded video frame by frame manually. We used three kinds of labels, ‘smile’ (smile face without laughter sound, with or without utterances), ‘laugh’ (smile face with laughter sound) and ‘normal’ (otherwise). The overview of the recorded data is shown in Table 1.

Proceedings of the 2005 International Conference on Cyberworlds (CW’05) 0-7695-2378-1/05 $20.00 © 2005

IEEE

3.1. Overview of the image-based smile face detection A large number of researches has been done for recognizing facial expressions from image[2, 3, 4, 5, 6]. As for the feature vector of the recognition, the feature vectors used in the existing methods are divided into two types: structure-based feature and frequency-based feature. The structure-based methods try to find the structure of a face from the image directly. They first find several feature points from a face image. The recognition is carried out using the feature points’ relationship[4] or motions of face parts[5]. The frequency-based method first finds the parts of a face, then DCT or wavelet coefficients are calculated as the feature vector[6]. We decided to use the structure-based features by detecting feature points and feature vectors from a face image directly, because the structure-based method is easier to implement. Our work used six-dimension feature vectors. Figure 2 shows the overview of the features. The features are as follows. 1. The lip lengths (L 1 , L2 ) 2. The lip angles(θ 1 , θ2 ) 3. The mean intensities of the cheek areas(W 1 , W2 ) After extracting the feature vector for each frame, the recognition is done on the frame-by-frame basis. We use a linear discrimination function for the recognition. It is possible to use other methods, including a neural network or a SVM. In this work we chose the simplest method because the amount of the training data is not so large.

3.2. Extraction of features from image Next we describe how to extract the features from an image. The feature extraction is performed as follows.

Figure 2. Facial feature points and feature vector.

Figure 3. Detection of the eye points.

1. Detection of face area We used the skin color detection[8] for the detection of face area. This method determines a pixel as the skin region when 0.333 < r < 0.664, r > g, 0.246 < g < 0.398 and g > 0.5 − 0.5r, where r = R/(R + G + B), g = G/(R + G + B) and R, G, B are the pixel value of red, green and blue respectively. After detecting the face area, the image is converted into a grayscale image. 2. Detection of feature points (P 1 , P2 , P3 ) Next, the feature points P 1 and P2 (the eyes) are searched. Figure 3 shows the process of the eyedetection. First, the upper half of the face area is divided vertically into two regions, and the centers of gravity of the intensity (considering that the black pixels have maximum values) are calculated for both regions. These centers of gravity become the candidates of eye points. Next, rectangular windows are set around the each center of gravity. The window is moved around the center of gravity, and the window area is searched whose average intensity is nearest to black. Then the center of the window area becomes the estimated point of the center of eye. As searching the eyes using image of one frame may cause a misdetection, the detected eye points are compared with those of the preceding and following frames. If the positions of the eye points of the current frame are greatly different from those of the neighboring frames, the eye points are regarded as misdetections, and the eyes are searched again. Next, the feature points P 3 and P4 (nose and mouse)

Proceedings of the 2005 International Conference on Cyberworlds (CW’05) 0-7695-2378-1/05 $20.00 © 2005

IEEE

Figure 4. Detection of the nose and the upper lip.

are detected. The Sobel filter is applied to the grayscale image to detect the horizontal edges. The filtered image is binarized using an arbitrary threshold. Then the feature points are searched by scanning down perpendicularly the image from the midpoint of P1 and P2 . The first to connected regions crossing the scan line are regarded as candidates of the nose and the upper lip. Figure 4 depicts the procedure of finding P 3 and P4 . 3. Extraction of features from the normalized face image After finding the feature points, size of the face image are normalized. Then the six-dimension feature vector V = (L1 , L2 , θ1 , θ2 , W1 , W2 ) is extracted from the normalized image.

3.3. Experiment We carried out experiments to recognize the smile faces. The classification of the facial expressions is performed by a linear discrimination function trained by the perceptron learning. First, we examined a user-dependent recognition experiment. The first one minute of the dialogue video (1800 frames) is used to train the discrimination function, then each frame of the following four-minute video sequences is classified either ‘smile’ or ‘non-smile’. We carried out this experiment for three video sequences (JPA, JPB and JPC). We observed the three evaluation measures: classification accuracy, recall rate and precision rate. Let N be the number of frames to be classified, L  be the number of frames labeled as ‘smile’, S + be the number of frames correctly classified as ‘smile’ by the system, and S − be that wrongly classified as ‘smile’ by the system. Then the classification accuracy is calculated as A=

S+ + N − L  − S− , N

(1)

S+ L

(2)

S+ . S+ + S−

(3)

the recall rate is R= and the precision rate is P =

The classification accuracy is the ratio for a frame to be correctly classified regardless of whether or not the frame is labeled as ‘smile’. On the other hand, recall and precision rate focus the ‘smile’ event. The recall rate is the ratio for a ‘smile’ frame is correctly classified as ‘smile’ (‘non-smile’ frames are not considered) and the precision rate is the ratio of the correctly classified ‘smile’ frames in the detected ‘smile’ frames. The result is shown in Table 2. The column ‘subject’ in the table shows the video used in the experiment. As a result, more than 80% performance was obtained for each of the evaluation measure. It is a promising result even though the experimental condition is strictly controlled. Next, we investigated the performance of smile face detection when the training data and the test data were taken from the video sequences of the different subjects. In this experiment, we chose one subject to train the discrimination function, and the other two subjects were evaluated using the discrimination function. The classification accuracy results are shown in Table 3. This results show that we could obtain relatively high performance for JPA and JPB each other. However, when JPC was used for either the training data or the test data, the classification accuracy degraded severely.

Proceedings of the 2005 International Conference on Cyberworlds (CW’05) 0-7695-2378-1/05 $20.00 © 2005

IEEE

Table 2. Result of smile detection (user dependent) subject JPA JPB JPC Average

A (%) 86.9 84.6 83.2 84.6

R(%) 98.4 86.9 86.1 84.3

P (%) 82.0 76.0 86.9 81.8

Table 3. Classification accuracy of smile face%(user independent) training\test JPA JPB JPC

JPA 84.7 77.2

JPB 85.7 65.4

JPC 60.1 71.1 -

The recall and precision results of this experiment is shown in Figure 5. The x-axis of this graph stands for the precision rate and the y-axis is the recall rate. Both the result of the user-dependent case (Table 2) and that of the user-independent case are plotted. From this result, it was found that the recall and precision rate are high for the userdependent case (gray points), while the user-independent results can have lower recall rates (black points).

4. Laughter sound detection using audio data In this section, the detection method of the laughter sound using audio data is described. The laughter sounds observed in a dialogue can be very diverse[9]. While there are voiced sounds similar to the ordinary utterance, there are also laughter sounds are similar to unvoiced consonant like [f], as well as the nasal breath sound. Therefore, it is difficult to recognize the laughter sound using ordinary acoustic models for speech recognition. We need to create a model for the laughter sound. In this work, we chose to use GMM (Gaussian Mixture Model) for the modeling of the laughter sound[7]. The overview of the laughter sound detection is shown in Figure 6. First, feature vectors (MFCC and Delta-MFCC) are extracted from the speech data for training. These feature vectors are labeled as ‘laugh’ (L) or ‘no-laugh’ (N ). Then the feature vectors of the each category are modeled using a GMM. When input audio signal is given, the two GMMs (for ‘laugh’ and ‘no-laugh’) are applied to the speech and the likelihood values for the both models are

Table 4. Experimental conditions of laughter detection by speech sampling freq. feature vector window frame shift NL NN detection threshold T averaging width

44.1[kHz] MFCC(12)+ΔMFCC(12) Hamming 25 [ms] 10[ms] 2,4,8,16,32,64 2,4,8,16,32,64 0.7 37

Figure 5. The recall rate and the precision rate

Figure 6. Overview of the laughter detection using audio stream.

calculated. The frame-by-frame sequences of the likelihood values are smoothed using a moving-average filter, and the difference of the both smoothed likelihood values are calculated. Finally, the likelihood difference is compared with the predefined threshold T , and the input frame is determined as ‘laugh’ or ‘no-laugh’. Table 4 shows the condition of the experiment. N L and NN stand for the number of mixture density of the ‘laugh’ GMM and ‘no-laugh’ GMM, respectively. Considering that ‘no-laugh’ sound (mainly speech or silence) is more complex than ‘laugh’ sound, we examined the combination of NL and NN where NL ≤ NN . The speakers of the training and evaluation data are identical. To compensate shortage of the training data, we carried out five-fold crossvalidation. First, five-minute speech was divided into five parts of one-minute speech. Then the four of them were used for training, and the remaining one-minute data were used for the evaluation data. This process was iterated five times by changing the evaluation data, and the results were

Proceedings of the 2005 International Conference on Cyberworlds (CW’05) 0-7695-2378-1/05 $20.00 © 2005

IEEE

Figure 7. Result of laughter detection (recall and precision)

averaged to obtain the final result. The detection threshold T and the averaging width were set to 0.7 and 37 respectively, according to the preliminary experiment. The recall and precision rate of the experimental result are shown in Figure 7. The x-axis of the graph stand for the precision rate and the y-axis is the recall rate. One point in the graph stands for a result for one combination of (NL , NN ). This result shows that the recall and precision rate vary according to the combination of numbers of mixture of the GMMs. Figure 7 is a frame-by-frame evaluation result. This result shows that the recall and precision rate are not high. By observing the detection result, we found that the GMMbased detector detects laughter events many times in one contiguous laughter sound. Figure 8 shows an example of the laughter detection by GMMs. In this figure, L h stands for the manually labeled laughter periods and L s is the laughter periods detected by the system.

Figure 8. An example of laughter sound detection by audio.

This result suggests that the GMM-based laughter detector can detect the laughter sound, but it is difficult to estimate the duration of the laughter. To confirm it, we calculated laughter event detection rate, which is the ratio of laughter events in which at least one laughter sound is detected. Figure 9 shows the relationship between the precision rate and the laughter event detection rate. This result proofs that the performance of the GMM-based detector to detect laughter sound is relatively high. When N L = 16 and NN = 16, about 60% precision rate and 95% laughter event detection rate was obtained.

5. Combination of audio-based and imagebased laughter detection In this section we describe a laughter detection method that combines the audio-based and image-based laughter detection. The audio-based method is exactly same as that described in Section 4. The image-based method is almost same as that explained in Section 3. The difference between the method in Section 3 and the method used here is that the image-based laughter detection method uses ‘laugh’ and ‘non-laugh’ data (instead of ‘smile’ and ‘non-smile’ data) for the training of the discrimination function. First, we investigate the laughter detection using the image. Figure 10 shows the result of the image-based laughter detection. Compared with the audio-based detection (Figure 8), the image-based result showed relatively good result for the estimation of the duration of the laughter sound. However, many false alarms were observed compared with the audio-based result. To combine these results effectively, we examined the

Proceedings of the 2005 International Conference on Cyberworlds (CW’05) 0-7695-2378-1/05 $20.00 © 2005

IEEE

Figure 9. Result of laughter detection (event detection rate and precision)

following method. 1. Laughter periods detected by the image-based method become the candidates of the laughter events. 2. If any laughter sound is detected by the audio-based method within a candidate of the laughter event, that candidate becomes the final result. Otherwise, the candidate is rejected. Figure 11 shows the proposed method. Using this method, the precision rate of the image-based method can be improved without degrading the recall rate. We carried out an experiment to investigate the performance of the combined method. The experimental conditions are the same as the user-dependent experiment in the Section 4. Table 5, 6 and 7 show the result of the audiobased, image-based and the combined method respectively. Comparing with Table 7 with the other two tables, it was proved that the recall rate of the image-based method is improved by combining the audio-based result. As a final result, 71% of the recall rate and 74% of the precision rate was obtained.

6. Summary Smile and laughter recognition combining the facial expression recognition and the speech processing was described. The smile recognition using image used the feature vector based on feature point detection from a face image, and the proposed method obtained more than 80% recall and precision rate. The audio-based laughter recognition used the ‘laugh’ and ‘non-laugh’ GMMs to detect

Figure 10. An example of laughter sound detection by image.

Figure 11. Combination of results of imagebased and audio-based laughter detection.

the laughter sound event from the input audio signal. This method gave high laughter event detection rate, but the recall and precision rate were not very high. Finally, a laughter sound detection method to combine the image-based and the audio-based method was proposed. The experimental result proved that the combination of the two methods improved the precision rate. As a final result, more than 70% recall and precision rates were obtained. As future works, we have to make more experiments using more data to obtain more reliable results. In addition, we are going to develop more robust user-independent smile/laughter recognition method and recognition method of other nonverbal information.

[5] Y. Zhu, L. C. De Silva, C. C. Ko, “Using moment invariants and HMM in facial expression recognition,” Pattern Recognition Letters, 23 83-91, 2002.

[7] A. Lee, K. Nakamura, R. Nisimura, H. Saruwatari and K. Shikano, “Noise Robust Real World Spoken Dialogue System using GMM Based Rejection of Unintended Inputs,” Proc. ICSLP, Vol. I, pp.173-176, 2004. [8] Y. Araki, N. Shimada and Y. Shirai, “Detection of Faces of Various Directions in Complex Backgrounds,” Proc. ICPR, pp.409-412, 2002.

References [1] S. Fujie, D. Yagi, Y. Matsusaka, H. Kikuchi and T. Kobayashi, “Spoken Dialogue System Using Prosody as Para-Linguistic Information,” Proc. Speech Prosody 2004, March 2004. [2] Y. Tian, T. Kanade, J.F. Cohn, “Recognizing Action Units for Facial Expression Analysis,” IEEE Trans. PAMI, Vol. 23, No. 2, February 2001. [3] Y. Yacoob and L. S. Dabis, “Recognizing Human Facial Expressions From Long Image Sequences Using Optical Flow,” IEEE Trans. PAMI, Vol. 18, No. 6, June 1996. [4] H. Ohta, H. Saji and H. Nakatani, “Recognition of Facial Expressions Using Muscle-Based Feature Models,” Trans. IEICE(D2), Vol. J82-D-II, No. 7, pp.11291139, July 1999.

Proceedings of the 2005 International Conference on Cyberworlds (CW’05) 0-7695-2378-1/05 $20.00 © 2005

[6] Y. Xiao, N. P. Chandrasiri, Y. Tadokoro and M. Oda, “Recognition of Facial Expressions Using 2-D DCT and Neural Network,” Trans. IEICE(A), Vol. J81-A, No.7, pp. 1077-1086, 1998.

IEEE

[9] J. A. Bachorowski and M. J. Owren, “Not all laughs are alike: Voiced but not unvoiced laughter elicits positive affect in listeners,” Psychological Science, 12, pp. 252-257, 2001.

Table 5. Laughter sound detection by audio. subject recall(%) precision(%) JPA 22.1 64.3 JPB 15.7 63.3 JPC 13.3 68.0

Table 6. Laughter sound detection by image. subject JPA JPB JPC

recall(%) 75.1 70.6 67.2

precision(%) 68.0 44.7 43.9

Table 7. Laughter sound detection by the combined method. subject JPA JPB JPC

recall(%) 75.1 70.6 67.2

precision(%) 73.7 72.8 75.6

Proceedings of the 2005 International Conference on Cyberworlds (CW’05) 0-7695-2378-1/05 $20.00 © 2005

IEEE