Sentence level emotion recognition based on decisions from

SENTENCE LEVEL EMOTION RECOGNITION BASED ON DECISIONS FROM SUBSENTENCE SEGMENTS Je Hun Jeon, Rui Xia, Yang Liu Department of Computer Science The University of Texas at Dallas, Richardson, TX, USA {jhjeon,rx,yangl}@hlt.utdallas.edu

ABSTRACT Emotion recognition from speech plays an important role in developing affective and intelligent systems. This study investigates sentence-level emotion recognition. We propose to use a two-step approach to leverage information from subsentence segments for sentence level decision. First we use a segment level emotion classifier to generate predictions for segments within a sentence. A second component combines the predictions from these segments to obtain a sentence level decision. We evaluate different segment units (words, phrases, time-based segments) and different decision combination methods (majority vote, average of probabilities, and a Gaussian Mixture Model (GMM)). Our experimental results on two different data sets show that our proposed method significantly outperforms the standard sentence-based classification approach. In addition, we find that using time-based segments achieves the best performance, and thus no speech recognition or alignment is needed when using our method, which is important to develop language independent emotion recognition systems. Index Terms— Emotion, Subsentence, Segment, Decision model 1. INTRODUCTION Speech contains rich information beyond what is said, such as the speaker’s emotion, real intention and meaning. Emotion recognition in speech is an important part of affective computing. Systems with ability to recognize speaker’s emotion or other paralinguistic information are important for humanmachine interaction. There has been increasing interest recently to recognize emotion from speech, such as the Interspeech Emotion Challenge in 2009 and Paralinguistic Challenge in 2010 [1, 2]. Most of previous work adopted sentence or utterance level emotion recognition, that is, a decision is made for a sentence regarding its emotion category. Numerous studies have been conducted in the last decade that try to improve either features or classifiers [3]. There are two major emotion recognition approaches. First is a dynamic modeling approach where

978-1-4577-0539-7/11/$26.00 ©2011 IEEE

4940

low-level descriptors (LLD) such as pitch, energy, or MFCC at a frame level are extracted, then GMM [4] or HMM [5, 6] are used as models. In the second approach, static feature vectors are created for the entire sentence that are derived by projection of the LLD using descriptive statistical functional, such as mean, standard deviation, kurtosis, and skewness. These features are modeled using support vector machine (SVM) [6] or multilayer perception (MLP) [7]. Similar performance has been shown using these two approaches. For sentence/utterance level emotion recognition, intuitively segments in an utterance contribute differently to the sentence level decision and a better model for segments may improve emotion recognition performance. A few studies have explored using segment units from the utterance [8, 9, 10]. Their results showed that using segment information does not perform as well as using the utterance level features, however, the combined features from the segment and utterance showed better performance. In this study, we investigate if we can use information from subsentence segments to help make decisions for the entire sentence. Rather than preserving all the features for subsentence segments and applying dynamic models, we propose to use a two-stage method where we first build emotion classification models based on these subsentence units, then the sentence level emotion is obtained from the decisions of its segments. This approach is different from previous segmentbased methods [8, 9]. We evaluate a few subsentence units such as word, phrase, and time based segments; and different ways to combine the predictions from each segment for utterance level emotion, including majority vote, average of posterior probabilities, and a GMM model. Our experimental results on different data sets show that our method of using subsentence units achieved better performance than the standard utterance level classification approach. 2. METHODS The task in this study is sentence level emotion recognition. Fig 1 shows an overview of our framework. The whole utterance X is first chunked into a sequence of segments, x1 , x2 ...xN . Then we use the segment-level emotion clas-

ICASSP 2011

LLD (16 · 2)

() ZCR, () RMS Energy, () F0, () HNR, () MFCC 1-12

Functionals (12)

mean, std. deviation, kurtosis, skewness extremes: min/max value, rel. min/max position, range linear regression: offset, slope, MSE

Table 1. 384 feature set: low-level descriptors (LLD) and functionals.

Fig. 1. Framework for sentence-level emotion classification using subsentence segments. sification model to generate posterior probabilities for the segments in a test utterance. The last step uses these probabilities to make a decision for the sentence regarding its emotion category. The following sections provide detailed description for each component. 2.1. Segmentation Units We investigate using different subsentence segment units in our method. • 3 words: This is similar to using words as segments, however, since some words are very short (e.g., one syllable words), we expand a word with its previous and following word to form a 3-word segment. • Phrases: We expect that emotion cues may vary depending on the syntactic roles of different regions in a sentence, and thus evaluate this syntax-based segment. We use YamCha tool [11] to obtain chunks for a sentence (such as noun phrases, verb phrases). • Time based segment: This does not use any word or syntax information. The segmentation is done simply based on time. We use a fixed length segment (1 second), and shift the window every 0.2 seconds. [12] showed that 1 second duration includes enough information for human or machines to determine emotion; therefore we set the segment length to be 1 second in this study. 2.2. Segment-based Emotion Classification The segment-level emotion classification is similar to the static modeling approach described in Section 1. We extracted 384 features as used in the Interspeech Emotion Challenge in 2009 [1] using openSMILE [13] toolkit. Table 1 lists the features. They are obtained by applying 12 functionals to 16 low level descriptors (LLD), and then including the corresponding first order delta regression coefficients. In order to train the segment-level emotion classifier, we assume that the emotion label for a segment is the same as

4941

that for the sentence containing this segment. For a given test utterance, the classifier generates posterior probabilities for each segment in the sentence. 2.3. Decision Model This step makes a decision for the sentence about its emotion category using predictions for its segments. We evaluate three different approaches. For the segments x1 , x2 ...xN in a test utterance, we use p(Ej |xi ) to represent the posterior probability for an emotion category Ej for a segment xi , obtained from the above emotion classifier. • Majority vote: First for each segment, we determine its emotion class, which is the one with the highest probability p(Ej |xi ). Then for the entire sentence, the emotion category is simply the majority vote from all the segments. • Average of segment probabilities: For a sentence, we calculate its posterior probabilities corresponding to different classes using the probabilities for all the seg ments, i.e., P (Ej |X) = N1 i p(Ej |xi ). Based on this sentence-level probability, the most likely class is determined. • GMM model: We use a separate classifier to model the distributions of the posterior probabilities of all the segments in a sentence. For each segment, the feature vector is composed of the posterior probabilities corresponding to different emotion categories. A GMM model is trained for each emotion class using the feature vectors for all the sentences (their segments). During testing, we calculate the likelihood of the test sequence based on different GMM models and choose the one with the highest score. 3. EXPERIMENTS 3.1. Experimental Setup Two different data sets are used in this study. • USC AudioVisual data (USC) (in English) [14]. One actress was asked to read 121 sentences with 4 different emotions: happiness (H), anger (A), sadness (S),

Whole utterance

Unit Type

Decision Model -

3 words

Majority vote Avg. of prob. GMM

Phrases

Majority vote Avg. of prob.

Time based

Majority vote Avg. of prob.

GMM

GMM

Accuracy (%) 84.9 85.3 86.2 87.0 81.8 83.1 83.1 87.8 88.4 88.4

A

H

N

S

A

H

N

S

A

95

20

2

4

A

104

13

1

3

H

24

94

2

1

H

23

94

1

3

N

0

0

114

7

N

0

0

120

1

S

0

3

10

108

S

0

3

8

110

(a) Using whole utterance

(b) Using time based segment

Table 3. Confusion matrices using whole utterance and time based segment with GMM decision model on USC data. Unit Type

Table 2. Results of utterance level classification using different segment units and different decision methods on USC data. and neutral (N). The emotion class distribution is thus balanced in this database. We used the available sentences to perform forced alignment to generate word boundary information that is needed for word or phrase based segments. • Berlin Emotional Speech Database (EMODB) (in German) [15]. The sentences in this database are with neutrality content. 10 actors read these sentences with 7 emotions including anger, boredom, disgust, fear, happiness, sadness, and neutral. To make the data comparable with the English one, we kept sentences in four emotional categories (anger, happiness, sadness and neutral), resulting in 339 sentences in total. The class distribution in this subset is: 37% angry, 21% happy, 22% neutral, and 18% sad. These two data sets are different in several dimensions: languages, number of speakers, and class distributions. Using two different data sets allows us to evaluate the robustness and generality of our proposed method. For the segment emotion model, we used SVM with RBF kernel from WEKA 3 Data Mining Toolkit1 . For GMM decision model, we use the Hidden Markov Model Toolkit (HTK)2 . Four mixture components are used in GMM. The baseline system is the sentence-level static modeling approach. The features for the entire sentence are the same as those used for segments, as described in Section 2. The SVM model with RBF kernel is used as the classifier. We perform 10-fold cross validation experiments. The data is split such that each fold has similar number of sentences and emotion categories. Note that for GMM, since we need to use the segment emotion classifier’s predictions to train another decision model, a second cross validation is needed for decision model training. The averaged accuracy from the 10 folds is used for the performance measure. 1 http://www.cs.waikato.ac.nz/ml/weka/ 2 http://htk.eng.cam.ac.uk/docs/docs.shtml

4942

Whole utterance

Decision Model -

Time based

GMM

Accuracy (%) 80.5 84.7

Table 4. Results of utterance level classification using time based segments with GMM decision model on EMODB data.

3.2. Emotion Recognition Results Table 2 shows the experimental results using USC data using three different segment units and three different decision approaches as well as using whole-utterance. Comparing the performances using difference segment units, we can see that using the time based segment achieved the best performance, while the performance using phrase based unit was the worst. For these different segments, we notice that phrase level segments generally are shorter (average duration is about 0.5 second), and the 3-word segments have similar length as the fixed length segments (1 second). This segment length difference can partly explain the poor performance of using phrase units. When using time based segment units, the duration of all the segments is the same, which means that the training and testing data are matched. These results are consistent with the results in [12]. More importantly, the fact that using time based segments performs the best also has additional advantages in that it does not rely on word alignment, speech recognition performance, or particular language. For different decision models, the GMM model achieved the best result. Using the average of the posterior probabilities yielded similar performance. Compared to the baseline results using the whole utterance, the performance gain using time based segment with GMM decision model is 3.5% for this data set (statistically significant). The confusion matrices using the whole utterance and the time based segments with GMM decision model are shown in Table 3. There is consistent performance improvement for three emotion categories except the “happy” emotion. We also evaluate if our approach generalizes to other data sets. Results for EMODB are shown in Table 4 using time based unit and GMM decision model. Since we do not have a speech recognizer for German, we cannot use word or phrase level segments. The results show significant performance improvement using our proposed model.

4. CONCLUSION AND DISCUSSION In this paper we proposed to use information from subsentence segments to help improve sentence-level emotion recognition. A segment-level emotion classifier is used to generate emotion probabilities for segments in a sentence. These probabilities are then combined to make the final decision for the entire sentence. This method combines some of the advantages of the static modeling approach and the dynamic modeling approach. The segments are longer than the frames typically used in the dynamic modeling approaches, and thus are able to capture acoustic/prosodic variation in long regions; subsentence segments provide richer information than the statistics for the entire sentence in the static modeling approach. Using decisions from segments also adds additional information to help systems with better sentence level decision. We have evaluated three different subsentence segments and three decision models. Among different segments, using time based segment units outperformed others, and GMM based decision model achieved the best performance. Compared with using the whole utterance, the performance gain using time based segment units with GMM decision model is 3.5% and 4.2% (absolute) on USC and EMODB data respectively. This gain is consistent across the two different data sets, showing promising future directions of emotion recognition framework. In addition, no word level time alignment is needed for time-based segments. This eliminates the need of a speech recognizer; therefore we can develop a language independent system for emotion recognition. In this study, when training the segment emotion models, we assumed that the emotion label for a segment is the same as that for the sentence containing this segment. However, some segments may not convey the same affective intent as the utterance it belongs to. In this case, using automatic clusters, such as that used in [16], instead of predefined emotion categories, might yield better performance. For the future work, we will explore the clustering based labeling method for subsentence segments within our current framework. We also plan to apply our approach to other spontaneous emotional data to evaluate its effectiveness. 5. ACKNOWLEDGMENT This work is supported by U.S. Air Force Award FA9550-101-0388. Any opinions expressed in this work are those of the authors and do not necessarily reflect the views of funding agency. 6. REFERENCES [1] Björn Schuller, Stefan Steidl, and Anton Batliner, “The interspeech 2009 emotion challenge,” Proc. of Interspeech, pp. 312–315, 2009.

4943

[2] Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth Narayanan, “The interspeech 2010 paralinguistic challenge,” Proc. of Interspeech, pp. 2794–2797, 2010. [3] Zhihong Zeng, Maja Pantic, Glenn I. Roisman, and Thomas S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31 (1), pp. 39–58, 2009. [4] Daniel Neiberg, Kjell Elenius, and Kornel Laskowski, “Emotion recognition in spontaneous speech using gmms,” Proc. of Interspeech, pp. 809–812, 2006. [5] Björn Schuller, Gerhard Rigoll, and Manfred Lang, “Hidden markov model-based speech emotion recognition,” Proc. of ICASSP, vol. II, pp. 1–4, 2003. [6] Yi-Lin Lin and Gang Wei, “Speech emotion recognition based on hmm and svm,” Proc. of ICMLC, pp. 4898–4890, 2005. [7] Anton Batliner, K. Fischer, Richard Huber, Jorg Spilker, and Elmar Nöth, “How to find trouble in communication,” Speech Communication, vol. 40, pp. 117–143, 2003. [8] Mohammad T. Shami and Mohamed S. Kamel, “Segmentbased approach to the recognition of emotions in speech,” Proc. of ICME, 2005. [9] Björn Schuller and Gerhard Rigoll, “Timing levels in segmentbased speech emotion recognition,” Proc. of Interspeech, pp. 1818–1821, 2006. [10] Anton Batliner, Dino Seppi, Stefan Steidl, and Björn Schuller, “Segmenting into adequate units for automatic recognition of emotion-related episodes: A speech-based approach,” Advances in Human-Computer Interaction, 2010. [11] Taku Kudoh and Yuji Matsumoto, “Use of support vector learning for chunk identification,” Proc. of CoNLL, pp. 142– 144, 2000. [12] Björn Schuller and Laurence Devillers, “Incremental acoustic valence recognition: an inter-corpus perspective on features, matching, and performance in a gating paradigm,” Proc. of Interspeech, pp. 801–804, 2010. [13] Florian Eyben, Martin Wöllmer, and Björn Schuller, “openear - introducing the munich open-source emotion and affect recognition toolkit,” Proc. of ACII, pp. 576–581, 2009. [14] Carlos Busso, Zhigang Deng, Serdar Yildirim, Murtaza Bulut, Chul Min Lee, Abe Kazemzadeh, Sungbok Lee, Ulrich Neumann, and Shrikanth Narayanan, “Analysis of emotion recognition using facial expressions, speech and multimodal information,” Proc. of ICMI, pp. 205–211, 2004. [15] Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter Sendlmeier, and Benjamin Weiss, “A database of german emotional speech,” Proc. of Interspeech, pp. 1517–1520, 2005. [16] Emily Mower, Kyu J. Han, Sungbok Lee, and Shrikanth Narayanan, “A cluster-profile representation of emotion using agglomerative hierarchical clustering,” Proc. of Interspeech, pp. 797–800, 2010.