Problems of the automatic emotion Recognitions in spontaneous speech; an example for the recognition in a dispatcher center Klára Vicsi, Dávid Sztahó Laboratory of Speech Acoustics, Budapest University of Technology and Economics, Department of Telecommunication and Media Informatics, Stoczek u. 2, 1111 Budapest, Hungary
[email protected],
[email protected]
Abstract. Numerous difficulties, in the examination of emotions occurring in continuous spontaneous speech, are discussed in this paper, than different emotion recognition experiments are presented, using clauses as the recognition unit. In a testing experiment it was examined that what kind of acoustical features are the most important for the characterization of emotions, using spontaneous speech database. An SVM classifier was built for the classification of 4 most frequent emotions. It was found that fundamental frequency, energy, and its dynamics in a clause are the main characteristic parameters for the emotions, and the average spectral information, as MFCC and harmonicity are also very important. In a real life experiment automatic recognition system was prepared for a telecommunication call center. Summing up the results of these experiments, we can say, that clauses can be an optimal unit of the recognition of emotions in continuous speech. Keywords: speech emotion recognition, telephone speech, prosodic recognizer, speech emotion database
1
Introduction
Two channels have been distinguished in human speech interaction. One conveys messages with a specific semantic content (verbal channel); the other (the non-verbal channel) conveys information related to both the image content of a message and to the general feeling and emotional state of the speaker [1]. Enormous efforts have been made in the past to understand the verbal channel, whereas the role of the non-verbal channel is less well understood. People show affect in many ways in speech, changes in speaking style, tone-ofvoice, and intonation are commonly used to express personal feelings, emotions often at the same time as imparting information. Research on various aspects of paralinguistic and extra linguistic speech has gained considerable importance in recent years.
2
Klára Vicsi, Dávid Sztahó
There are results about emotion characterization in speech and emotional recognition in the literature, but those results were obtained in clear lab speech [2, 3, 4, 5]. Most of them usually used simulated emotional speech, more frequently produced by artists. On the bases of these imitated databases researchers could find answers to many scientific questions in connection with the perception of the emotional information [7]. But the real word data differ much from acted speech, and in the application of speech technology, real word data processing is necessary. In the last years some works were published dealing with examination [8] and recognition [9] of emotion in spontaneous everyday conversations. The other big problem of the emotion research is the categorization of the emotion itself. Group of emotional categorization is the one commonly used in psychology, linguistics and speech technology, also described in the MPEG-4 standard [6]: happiness, sadness, anger, surprise, and scorn/disgust. The emotions in MPEG-4 standard were originally created to describe facial animation parameters (FAPs) of virtual characters. But these 5 emotion categories do not mask the emotions in the spontaneous speech, and generally the type and frequency of usage of emotions is different in different selection of real world data. This paper describes an emotion recognition experiment on a corpus of spontaneous everyday conversations between telephone dispatchers and customers, through telephone line. The acoustical pre-processing was prepared on the base of our former experiment on acted speech [10]. Not only the acoustical parameters of emotions were examined and classified, but word and word connection statistics of different emotional text is planned to prepare on the base of the corpus, as well.
2
System description
During a speech conversation, especially if it is a long one, the speaker’s emotional states are changing. If we want to follow the states of the speakers we have to divide the continuous speech flow into segments, and thus we can examine how the emotional state of a speaker change segment by segment through the conversation. The clause was selected as a segmentation unit in our system, on the base of the experiences of our earlier study [10]. The segmentation of the clause sized units was done by our prosodic segmenter [11]. (This segmenter was developed for semantic processing of input speech in ASR, which is used to perform clause and sentence boundary detection and modality (sentence type) recognition. The classification is carried out by HMMs, which model the prosodic contour for each clause and/or sentence modality type. Clause (and hence also sentence) boundary detection is based on HMM's excellent capacity in aligning dynamically the reference prosodic structure to the utterance coming from ASR input. The method also allows placing punctuation marks automatically.)
Problems of the automatic emotion Recognitions in spontaneous speech; an example for the recognition in a dispatcher center 3
After the acoustic pre-processing these clause-sized segments were classified according to its emotional content, using support vector machine (SVM) classifier. The block diagram of our system is presented in Fig. 1. clause chain with emotion Acoustical preprocessing
speech corpus labelled according to emotions
Support vector machine
Monitoring
Decision
clause unit boundaries
Clause unit prosodic segmenter
Fig. 1. Recognition system of our speech emotion classifier
In the beginning four different emotional states were differentiated in the recorded dialogues: neutral (N), nervous (I), querulous (P), and others (E). In the later experiments these emotions were grouped. Then the classified clauses were monitored through a time window, much longer than the clause’s duration, to give a more certain decision about the speaker’s emotional states. 2.1
Acoustical pre-processing
In general, fundamental frequency and intensity time courses were the most commonly used features for the expression of emotions, both in the field of speech recognition and synthesis. In our earlier automatic emotion recognition experiments, it was found, that adding spectral information greatly improves the recognition results [10]. Accordingly, the fundamental frequencies (F0i), the intensity values (Ei), 12 MFCC and their deltas were measured, using 150 ms time window, in 10 ms time frames, all together 28 feature vectors in every 10 ms. Then the clause unit prosodic segmenter marks the boundaries of clauses in the speech, resulting clause unit chain. On the base of the feature vectors in every 10 ms, statistical data were calculated characterizing each clause unit by multidimensional statistical feature vector, as it is presented on Fig. 2. These statistical feature vectors were calculated as follows: F0i values were normalized by the F01 value of the first time frame and E values were normalized to the max value of E in each clauses. The following statistical data were calculated from these normalized parameters for each clause: − − − − − −
max, min, mean, median values of F0i max, min, median, skew values of ∆ F0i mean, median of Ei max, min, mean, median, skew values of ∆ Ei maxima, minima, mean values of MFCCi maxima minima, mean values of ∆MFCCi
4
Klára Vicsi, Dávid Sztahó
Normalization, Calculation.of statistical values
F0i, ∆F0i Ei, ∆Ei
Normalization, Calculation.of statistical values
MFCCi, ∆MFCCi
Normalization, Calculation.of statistical values
Composition of multidimensional stat. feature vector
speech corpus
clause unit boundaries
Fig. 2. Acoustical preprocessing
Thus each clause was characterized by an 87 dimensional vector. 2.2
Speech corpora
The speech corpora, a collection of customer’s service dialogues, were recorded through telephone line, from 250 Hz-3500 Hz, by 8000 Hz and 16 bit sampling rate. 1000 calls were recorded. The duration of dialogs between the dispatchers and customers differed between 1 and 30 minutes. The “Transcriber” [13] tool was used for the segmentation and labeling while it is appropriate for parallel processing. We wanted to know the linguistic content, the place of clauses in the speech, the emotional content of a clauses, moreover who is speaking, the gender of speakers, etc. Thus the linguistic contents, the boundary of clauses, the emotion, and the speakers with gender were marked in the acoustical signal. Our prosodic recognizer and segmenter [11] were used to mark the clause boundaries in the recorded speech. Then experts labeled the clause segments by hand according to its emotional contents. Four different emotional states were differentiated in the recorded dialogues: neutral (N), nervous (I), querulous (P), and others (E). Practically there was no more emotion type in the 1000 calls only these 4. Unfortunately in many cases, the customer’s speech was neutral. Altogether there were 346 nervous clauses, 603 querulous, and 225 others in the customer’s speech and 603 typical neutral clauses were selected from the neutral ones for the classification experiment. An example of segmentation and labeling of the dialogues is presented on Fig. 3. The handmade segmentation and labeling is presented at the third line, and the labeling result of our classifier is presented below. The text of the customer and the dispatcher speech were marked in parallel with speech and its emotion.
Problems of the automatic emotion Recognitions in spontaneous speech; an example for the recognition in a dispatcher center 5
emotions
dispatcher customer
Fig. 3. An example of the segmented and labeled conversation. U:silent period, N:neutral, I:nervous, P: querulous emotion and E: else
2.3
Testing of the system
2.3.1 Classification of clauses according to its emotion For the training and testing of our emotion classifier the so called leave-one-out crossvalidation (LOOCV) was used [12], that is, using a single clause from a call as the validation data, and the remaining clauses in the calls as the training data. This is repeated such that each clause in the calls is used once as the validation data. The error matrix in case of the four emotions is to be seen on Table 1.
6
Klára Vicsi, Dávid Sztahó
Table 1. E, I, P, N emotions are classified into separate classes
E I N P
E
I
N
P
Correct
49 9 14 11
26 153 38 70
62 60 398 157
88 124 153 365 average
22% 44% 66% 60% 54%
The I and P emotions are hardly differentiated not only by the classifier, but by the humans too. Thus I and P and E classes were closed up into one class, denominated as the “discontent” emotion. This class of “discontent” and the class of neutral emotions were differentiated and trained. The testing results are shown in Table 2. Table 2. This (E, I, P), (N) emotions are classified into two separate classes as discontent and neutral
EIP N
2.3.2
EIP
N
Correct
887 335
287 839 average
76% 71% 73%
Monitoring of the emotional state of the customer
The aim of this research work is to find out a way how it is possible to detect the emotional state of the customer automatically, through a conversation. With this object a 15 sec long monitoring window was selected, and the clauses, which were automatically classified as “discontent” emotion, were counted in this window. Then, the window was moved forward, by 10 sec time steps. On the Fig. 4, some examples are presented, how the automatic monitoring detects the degree of the discontentment of the customer through the conversation. (Discontentment was 100% when all clauses in the monitoring window were detected as discontent.) The results obtained automatically are compared with the results on the hand labeled ones. On the whole, in case of continuous monitoring, the average distance between the results obtained automatically and on the hand labeled ones is 11.3%, by comparing the monitored values in every 10 sec time steps, and averaged the obtained differences for the full database.
Problems of the automatic emotion Recognitions in spontaneous speech; an example for the recognition in a dispatcher center 7
Fig. 4. The degree of the discontentment of the customer through a conversation. (Discontentment was 100% when all clauses were detected as discontent in the 15 sec monitoring window). The results obtained automatically are compared with the results on the hand labeled ones.
In real usage the main goal of the automatic recognition system is to signal when the discontentment level has reached a critical level. We call it as an “alarm level”. This “alarm level” has to be set manually. For example, setting this “alarm level” to 30 percent (that is, if discontentment is above 30%), it was marked when the handmade value was over this 30% alarm level, and when the automatic value was over this level. Then comparison was done frame-by-frame. The differences were considered as alarm detection errors. Thus the average detection error was 20.8%. The main reason of the errors is a small shift between the data of automatic recognition and hand labeling at some places. It is illustrated on Fig. 5.
8
Klára Vicsi, Dávid Sztahó
Fig. 5. The enlarged part of the 3rd diagram in Fig. 4. as an example for small shifts in curves of hand labeling and automatic recognition.
3
Discussion
The semantic content (verbal channel) and the general feeling and emotional state of the speaker (the non-verbal channel) are expressed at the same time in speech, and the semantic content contributes to the emotion recognition of human. In our experiments, discussed here, only the nonverbal parameters were used for the training and for the classification, without any linguistic information. In our second experiment, in section 2.3.2., the classified clauses were monitored through a time window, much longer than the clause’s duration, to give a more specified decision about the speaker’s emotional states. This monitoring technique seems to be capable to give an alarm if the discontentment of the customer exceeds a threshold, without the usage of verbal channel. It is clear that much better result can be obtained, if some information of verbal channel would be integrated to the system. This is why we recorded the linguistic content too, through the database processing, as it was described in session 2.2. In the future, we plan to process some linguistic information too, and we are going to integrate the information of the two channel.
4
Conclusions
The aim of this research work was to find out a way how it is possible to detect the emotional state of the customer automatically, through a spontaneous conversation, and to signal when the discontentment level of a customer has reached a critical level. The simple classification of four emotions (neutral, nervous, querulous, and others) was not very useful. The reason of this fact would be that the separation of the nervous and querulous emotions was not a good decision, because those are hardly differentiated by the humans too, especially without linguistic content.
Problems of the automatic emotion Recognitions in spontaneous speech; an example for the recognition in a dispatcher center 9
Thus I and P and E classes were closed up into one class, denominated as the “discontent” emotion. In this case the two classes, the discontent and neutral were correctly classified in 73% in average. By using our automatic monitoring techniques for the detection of the degree of the discontentment of the customer, (in 15 sec long monitoring window, in 10 sec time step), the average distance between the discontentment obtained automatically and on the hand labeled material is 11.3%. On the base of this monitoring technique it is possible to sign when the discontentment level has reached a critical “alarm level”, in our case it is at 30% of discontentment. The average alarm detection error was 20.8% but the reason of this error, in most cases, is only a small shift between the data of automatic recognition and hand labeling, resulting only a delay or foregoing of an alarm period. Accordingly the described decision technique can be useful in dispatcher centers, in spite of the fact, that only the nonverbal content of the speech was used for the decision. Of course in the future we are going to process the linguistic content too, parallel with the nonverbal one, hopefully serving so far better result.
5
Acknowledgements
The research was prepared within the confines of TÁMOP-4.2.2-08/1/KMR-20080007 project. We would like to thank the leader of the SPSS Hungary Ltd. and INVITEL Telecom Zrt. to give free run of the recorded 1000 dialogues for us.
6
References
[1] Campbell,N: Getting to the heart of the matter. Keynote Speech in Proc. Language resources and Evaluation Conference (LREC-04), Lisabon, Portugal, (2004). [2] Hozjan, V. – Kacic, Z.: A rule-based emotion-dependent feature extraction method for emotion analysis from speech. The Journal of the Acoustical Society of America. May, Vol. 119, Issue 5. (2006) 3109-3120 [3] Douglas-Cowie, E. – Campbell, N. – Cowie, R. – Roach, P.: Emotional speech: towards a new generation of databases. Speech Communication 40. (2003) 33–60. [4] Burkhardt, F., Paeschke A. at all.:A database of German Emotional Speech. N:Proc. Of Interspeech2005, pp. 1517-1520 (2005) [5] Navas, E. – Hernáez, I. – Luengo, I.: An Objective and Subjective Study of the Role of Semantics and Prosodic Features in Building Corpora for Emotional TTS. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 4, JULY (2006) [6] MPEG-4: ISO/IEC 14496 standard. http://www.iec.ch (1999) [7] Esposito, A.: The Perceptual and Cognitive Role of Visual and Auditory Channels in Conveying Emotional Information, Cogn. Comput DOI 10.1007/s12559-009-9017-8, Springer Science+Business Media, LLC (2009)
10
Klára Vicsi, Dávid Sztahó
[8] Campbell, N: Individual Traits of Speaking Style and Speech Rhythm in a Spoken Discourse. COST Action 2102 International Conference on Verbal and Nonverbal Features….Patras, Greece,October 2007.pp. 107-120. (2007) [9] T. Kostoulas, T. Ganchev, N. Fakotakis :Study on Speaker-Independent Emotion Recognition from Speech on Real-World Data, COST Action 2102 International Conference on Verbal and Nonverbal Features. Patras, Greece,October 2007.pp. 235-242. (2007) [10] Tóth, Sz. L., Sztahó, D., Vicsi, K.:Speech Emotion Perception by Human and Machine. Proceeding of COST Action 2102 International Conference, Patras, Greece, October 29-31, 2007: Revised Papers in Verbal and Nonverbal Features of Human-Human and HumanMachine Interaction 2008. ISBN: 978-3-540-70871-1. Springer LNCS, 2008. pp. 213-224. (2007) [11] Vicsi, K. Szaszák, Gy.: Using Prosody for the Improvement of ASR: Sentence Modality Recognition. In: Interspeech 2008. Brisbane, Ausztrália 2008.09.23-2008.09.26. ISCA Archive, http:www.isca-speech.org/archive (2008) [12] Kohavi, Ron (1995). "A study of cross-validation and bootstrap for accuracy estimation and model selection". Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence 2 (12): 1137–1143. (1995) [13] Transcriber. http://trans.sourceforge.net/en/presentation.php