Real-life Emotion Representation and Detection in Call Centers Data

3 downloads 92585 Views 44KB Size Report
dialogs (20 hours) recorded in French, at a Medical Emergency call center, is studied ... vice center can be reached 24 hours a day, 7 days a week. Its aim is to ...
Real-life Emotion Representation and Detection in Call Centers Data Laurence Vidrascu, Laurence Devillers Department of Human-Machine Communication, LIMSI-CNRS, Orsay, France LIMSI-CNRS, BP133, 91 403, Orsay cedex {vidrascu,devil}@limsi.fr

Abstract. Since the early studies of human behavior, emotions have attracted the interest of researchers in Neuroscience and Psychology. Recently, it has been a growing field of research in computer science. We are exploring how to represent and automatically detect a subject's emotional state. In contrast with most previous studies conducted on artificial data, this paper addresses some of the challenges faced when studying real-life non-basic emotions. Real-life spoken dialogs from call-center services have revealed the presence of many blended emotions. A soft emotion vector is used to represent emotion mixtures. This representation enables to obtain a much more reliable annotation and to select the part of the corpus without conflictual blended emotions for training models. A correct detection rate of about 80% is obtained between Negative and Neutral emotions and between Fear and Neutral emotions using paralinguistic cues on a corpus of 20 hours of recording.

1

Introduction

Detecting emotions can help orienting the evolution of a human-computer interaction via dynamic modification of the dialog strategy. We focus on real-life corpora which allow the study of complex and natural emotion behavior. Our aim is twofold, to study emotion mixtures in natural data and to achieve high level of performances in real-life emotion detection. Call centers provide interesting solutions for recording people in various natural emotional states since the recordings can be made imperceptibly. Among the natural corpora for emotion detection, we can mention the 'Lifelog' corpus consisting of everyday interactions between a female speaker and her family and friends [1]; the ‘Interviews corpus’ also known as the Belfast database [2]; the 'EmoTV' corpus - a set of TV interviews in French recorded in the HUMAINE project [3]; and medical dialogs [4]. Obviously, the types of emotions found in the corpus are heavily dependent on the task and situation. A corpus of real agent client dialogs (20 hours) recorded in French, at a Medical Emergency call center, is studied in this paper. There are many reviews on the representation of emotions. For a recent review, the reader will refer to Cowie & Cornelius [5]. Three types of theories generally used to

represent emotions are: abstract dimensions, appraisal dimensions and most commonly verbal categories. According to Osgood [6], the communication of emotions is conceptualized following several dimensions such as Evaluation, Power and Activation. Other subjective dimensions also include: intensity, control, tension, etc. Instead of naming emotions as discrete categories, they are defined along continuous abstract dimensions. The most widely employed scheme is based on the two perceptive abstract dimensions: Activation-Evaluation and has been employed to annotate several corpora with the Feeltrace tool [7]. However, other dimensions are necessary, for instance, for distinguishing between Fear and Anger. The appraisal theory [8] provides detailed specification of appraisal dimensions that are assumed to be used in evaluating emotion-antecedent events (novelty, pleasantness, goal relevance, etc). The major methodological problem is that the only reliable way to ensure a correct annotation is to ask the persons themselves to perform it. If done in real-time this can affect the expression of the emotions, but if done a posteriori, it relies on the person's recall of the situation. New abstract dimensions are suggested by appraisal theorists [8], for example the goal conduciveness/obstructiveness of an event, i.e. the individual’s potential to cope with the consequences. Most emotion detection studies have only focused on verbal categories using a minimal set of emotions to be tractable, often basic emotion (anger, fear, etc.). However, it is well-known that linguistic labels are imprecise and capture only a specific aspect of phenomena, i.e. those that are immediately relevant for speakers in a particular context. In this work we make the assumption that it is possible to perceive and to annotate mixtures of emotions. There is no clear typology of the different mixtures of emotions. Blended emotions can be considered as the presence of two or more emotions at the same time, one of which may be considered Major and the others Minor. The Major emotion can be related to the dominant emotion in the brain, with the other perceived emotions in a mixture being secondary. Our philosophy is to represent emotional states with a soft emotion vector combining Major and Minor emotions [3, 10]. The same emotion representation with Major and Minor labels has been used for audiovisual data [10]. A soft vector is obtained after the combination of annotations done by different raters with the Major and Minor emotion annotation. Such soft representation allows a better reliable annotation and also a better knowledge of the emotions mixtures present in the corpus. For detection purposes, this knowledge is used to select appropriate subsets of the corpus in order to obtain better detection performances. Automatic systems can lead to a deeper understanding of the perception of emotion by identifying the relevant cues to emotion detection in natural emotional states. Our detection experiments use prosodic and spectral features, but also some disfluencies (hesitation, pauses) and some non-verbal events such as laughter, inspiration, expiration, which were time-stamped during the transcription phase. We report in this paper our experiments on classification for Negative to Positive classes and for Fear to Neutral classes without taking into account the conflictual emotion mixtures. Section 2 is devoted to the description of the corpus. Section 3 presents the emotion labeling scheme and the analysis and reliability of our annotations. Section 4

describes the cues and the selection procedure. Models and results are briefly reported in 5.

2

Real–life Corpus

The studies reported in this paper were done on a corpus of naturally-occurring dialogs recorded in a real-life call center. The corpus contains real agent-client recordings obtained from a convention between a medical emergency call center and the LIMSI-CNRS. The transcribed corpus contains about 20 hours of data. The service center can be reached 24 hours a day, 7 days a week. Its aim is to offer medical advice. An agent follows a precise, predefined strategy during the interaction to efficiently acquire important information. His role is to determine the call topic and to obtain sufficient details about this situation so as to be able to evaluate the call emergency and to take a decision. In the case of emergency calls, the patients often express stress, pain, fear of being sick or even real panic. This study is based on a 20hour subset comprised of 688 agent-client dialogs (7 different agents, 784 clients). About 10% of speech data is not transcribed since there is heavily overlapping speech. Table 1 summarizes the characteristics of the corpus. Table 1. Corpus characteristics: 688 agent-client dialogs of around 20 hours (M: male, F: female) #agents #clients #turns/dialog #distinct words #total words

7 (3M, 4F) 688 dialogs (271M, 513F) Average: 48 9.2 k 262 k

The use of these data carefully respected ethical conventions and agreements ensuring the anonymity of the callers, the privacy of personal information and the nondiffusion of the corpus and annotations. The transcription guidelines are similar to those used for spoken dialogs in Amities project (www.dcs.shef.ac.uk/nlp/amities/). Some additional markers have been added to denote named-entities, breath, silence, intelligible speech, laugh, tears, clearing throat and other noises (mouth noise).

3

Emotion Annotation

Our emotion and meta-data annotation scheme and the choice of our labels have been described in [9]. The meta-data are given at the dialog level. The scheme specification at the segment level enables annotation of emotion labels and abstract dimensions with one or two emotion labels for segment, selected from fine-grained and coarse-grained labels as well as some local emotional context cues. The audio signal was further segmented into emotional segments where the annotators felt it was ap-

propriate, so the temporal-grain can be finer than the speaker turn. In order to minimize the annotation time, only the abstract dimensions which are complementary with verbal categories are labeled. The bi-polar Valence (Negative/Positive) is deducted from the fine-grained verbal labels. The only ambiguity concerns the class ‘Surprise’ which can be associated with both positive and negative emotions. Activation (passive, normal, active) is often confused with Intensity (low, middle, high) by nonexpert annotators. In these annotations, only intensity is rated on a 5-level scale. Rough labels for the bi-polar Activation (Passive/Active) were extracted from finegrained verbal labels. We also added a new dimension named Self-Control (not the Power/Control described in [6]). The Self-Control dimension is a meta-annotation describing the perception of the self-control of the person (from controlled to uncontrolled on a 7-level scale). All combinations of categories are possible with only one exception; for the verbal label surprise, which has an ambiguous valence, there must be a minor indicating the valence. The set of labels is hierarchically organized, from coarse-grained to fine-grained labels in order to deal with the lack of occurrences of fine-grained emotions and to allow for different annotator judgments. A definition and instances for each label were given in the annotation protocol. Because there are few positive occurrences in the corpus, there is only one coarse positive class. The annotation level used to train emotion detection system can be chosen based on the number of segments available. Table 2. Emotion classes hierarchy: multi-level of granularity

Valence-level Negative

Negative or Positive Positive Neutral

3.1

Coarse level (7 classes) Fear Anger Sadness Hurt

Fine-grained level (20 classes + Neutral) Fear, Anxiety, Stress, Panic, Embarrassment, Dismay Anger, Annoyance, Impatience, ColdAnger, HotAnger Sadness, Disappointment, Resignation, Despair Hurt

Surprise

Surprise

Positive Neutral

Interest, Compassion, Amusement, Relief Neutral

Annotation validation

The high subjectivity of human annotation requires the use of rigorous annotation protocols. After deciding the list of labels and the adopted scheme, precise rules for segmentation must be determined along with the number of annotators and the validation procedures. We also have to consider inter-labeler consistency and confidence measures. There are different measures of annotation reliability; for instance, the widely used Kappa inter-coder agreement measure for categorical labels and the Cronbach’s alpha measure for continuous variables. For those measures, one label by segment is normally used. When a mixture of emotions is annotated, a solution is to compare only the Major label or to add some rules to improve inter-labeler agreement such as Major/Minor = Minor/Major. We have adopted a self re-annotation procedure

of small sets of dialogs at different time (for instance once a month) in order to judge the intra-annotator coherence over time. As shown in Table 3, the annotations reliability seems to stabilize around 85%. This result is a proof of the difficulty of the task. Table 3. Labeler inter-reliability in terms of % agreement between two annotations by the same labeler at different times. Dec-Feb means first annotation in December, re-annotation in February (14 dialogs), Jan-Feb First annotation in January, re-annotation in February (11 dialogs), Mar-Apr (16 dialogs), Apr-May (16 dialogs). The two lines for Agent and Client corresponds to the 2 annotations (Labeller1 and Labeller2) (The corpus annotation started in December)

Agent Client

Dec-Feb

Jan-Feb

Mar-Apr

76.4 (369 seg.)

82.9 (287 seg.)

86.1 (495 seg.)

85.7 (405 seg.)

Apr-May

66.5 (369 seg.)

80.8 (279 seg.)

86.8 (499 seg.)

87.6 (412 seg.)

73.9 (356 seg.)

83.9 (255 seg.)

83.4 (499 seg.)

84.2 (442 seg.)

78.5 (350 seg.)

76.5 (254 seg.)

81.4 (505 seg.)

85.8 (450 seg.)

Since segments were labeled by more than one labeler and also since segments could be assigned one or two labels, it was necessary to create a mapping (i.e. to reduce the multiple labels per segment to one label) for the machine learning experiments. Let us consider each annotation as a vector (Major, Minor). The mapping combines the N (Major, Minor) vectors (for N annotators) in an emotion soft vector. Different weights are given to the emotion annotation, one weight to the Major emotions (wM) and one other to the Minor emotions (wm) [9]. About 50% of the corpus was thus labeled as neutral. The 5 top classes of the fine-grained emotion labels were Neutral, Anxiety, Stress, Relief and Amusement for clients and Neutral, Interest, Compassion and Surprise for agents. In order to assess the consistency of the selected labels, the inter-annotator agreement was calculated. The Kappa value is 0.55 for clients and 0.35 for agents when only considering Major annotation. The Kappa values are slightly better (0.6 and 0.37, respectively) if a rule allowing common annotation in one of the two annotations Major and Minor is used. 3.2

Blended emotions

Labeler 1 assigned a Minor label for 31 % of the non neutral segments, whereas labeler 2 for only 17 %. A first rough description of emotions mixture was defined. Mixed emotions within the same coarse-grained label are noted as Ambiguous. A labeler perceiving an emotion between annoyance and hotAnger would label it "Annoyance/Hot Anger". A mixture between two different coarse-grained labels is called Conflictual if they don't have the same valence, Non-conflictual otherwise. The nonconflictual mixtures can be separated into positive and negative. An example of a conflictual emotion would be "Anxiety/Annoyance". When perceived with another

emotion, the class 'Surprise' doesn't fit into those categories because its valence is not set, which accounts for a class Surprise. For analysis purposes, the Conflictual and Unconflictual emotions mixtures are manifestly the most interesting data. It is to be noted (see Figure 1) that both annotators have perceived mixtures in those classes and that they appear in different positions in the dialog (i.e. for Agent and Client: when a recurring blended emotion for an agent would be to feel both compassion and annoyance towards a caller, a caller might feel worry coupled with relief from knowing help is on its way…). For detection purposes, we have not considered those mixed emotion segments but improve the performance of the system by choosing non blended and non conflictual emotion segments.

2500 2000 1500 1000

la b2

500

la b1

ise Su rp r

:2 de d

Co nf lic tu al

Po s

eg N :2 Bl en

de d

Bl en

A m

bi gu ou s

0

lab1 lab2

Fig1. Repartition of the mixed emotions for each labeler. Lab1 and Lab2 are the two labelers; Blended: 2Pos means that the two labels are chosen from two different positive coarse grained labels ( 'Amusement', 'Relief, 'Compassion/Interest' ); Blended: 2Neg means that the two labels are chosen from two different negative coarse grained labels ( 'Fear', 'Anger', 'Sadness' and 'Hurt' )

4

Features

A crucial problem for all emotion recognition system is the selection of the set of relevant features to be used with the most efficient machine learning algorithm. In the experiments reported in this paper, we have focused on the extraction of lexical, prosodic, spectral, disfluency and non-verbal events cues. For prosodic (F0 and energy) and spectral cue extraction, the Praat program [11] has been used. About fifty features will be input to a classifier which will select the most relevant ones: - F0 and Spectral features (Log-normalized per speaker): min, max, mean, standard deviation, range at the turn level, slope (mean and max) in the voiced segments, regression coefficient and its mean square error (performed on the voiced parts as well), maximum cross-variation of F0 between two adjoining voiced segments (inter-

segment) and with each voiced segment(intra-segment), ratio of the number of voiced and non-voiced segments, formants and their bandwidth (first and second): min, max, mean, standard deviation, range.. - Energy features (normalized): min, max, mean, standard deviation and range at the segment level. - Duration features: speaking rate (inverse of the average length of the speech voiced parts), number and length of silences (unvoiced portions between 200-800 ms). - Disfluency features: number of pauses and filled pauses ("euh" in French) per utterance annotated with time-stamps during transcription. - Non linguistic event features: inspiration, expiration, mouth noise laughter, crying, and unintelligible voice. These features are marked during the transcription phase.

5

Classification

The above set of features are computed for all emotion segments and fed into a classifier. Ongoing experiences are being done on the corpus using Support Vector Machines ( [12]: algorithms search an optimal hyperplan to separate the data ) As a first study, experiments were only made on broad classes. For reliability, all experiments were done using jack-knifing with 5 subsets (4 subsets are used for training and one for test, the experiment is repeated 4 times with each subset being used for test). This procedure is repeated 10 times with different subsets. Table 4. Neutral/Negative detection performances with all features. The number into parenthesis is the standard deviation

Role in the dialogue

Emotion classification

SVM

Agent

Neutral / Anger

75.0 ( 2.4)

Neutral / Negative

73.4 ( 3.1)

Neutral / Negative

80.2 (2.4)

Neutral / Fear

83.8 (0.9)

Client

Different performances have been obtained taking into account the role in the dialog. Clients are much clearer than Agents when they express Negative emotions (80% vs 73% of good detection). Experiments selecting non blended emotion for training models have yielded a high level of emotion detection performance (comparable to those already achieved on subsets of the corpus [9]) and hopefully it will improve emotion detection with more classes or with same valence classes such as Anger andFear.

6

Discussion and perspectives

This paper focuses on real-life emotions and shows the complexity of natural emotional behavior expressed in dialogs on a medical call center. Our study of this corpus, using a uni-modal channel (only speech), reveals the presence of mixtures of emotions with conflictual or non-conflictual valences. When selecting the reliable part of the annotated corpus, the performances obtained are around 80% of good detection between Negative and Neutral or Fear and Neutral. Further experiments are to be made on finer classes. We have not yet used all the corpus annotations such as intensity and control dimensions, and also meta-data annotations. These annotations will be correlated with the soft-vector emotion annotation in future experiments. We also intend to combine linguistic cues and paralinguistic cues as done in previous experiments [9].

7

Acknowledgements

This work was partially financed by several EC projects: FP6-CHIL and NoE HUMAINE. The authors would like to thank, M. Lesprit and J. Martel for their help with data annotation. The work is conducted in the framework of a convention between the APHP France and the LIMSI-CNRS. The authors would like to thank the Professor P. Carli, the Doctor P. Sauval and their collegues N. Borgne, A. Rouillard and G. Benezit.

8

References

1. Campbell, N.: Accounting for Voice Quality Variation, Speech Prosody 2004 (2004) 217220. 2. Douglas-Cowie, E., Campbell, N., Cowie, R. and Roach, R., "Emotional speech; Towards a new generation of databases," Speech Communication, (2003). 3. Abrilian, S., Devillers, L., Buisine, S. and Martin, J.C.: EmoTV1: Annotation of Real-life Emotions for the Specification of Multimodal Affective Interfaces, HCI International (2005) 4. Craggs, R. & Wood, M.M.: A 2-dimensional annotation scheme for emotion in dialogue, Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications, Stanford University. (2004). 5. Cowie, R. & Cornelius, R.R: Describing the emotional states expressed in speech, Speech Communication, 40(1-2) (2003) 5-32 6. Osgood, C., May, W.H. & Miron, M.S: Cross-cultural Universals of Affective Meaning. University of Illinois Press, Urbana & al, (1975) 7. Cowie, R.,Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W. & Taylor, J. : Emotion Recognition in Human-Computer Interaction. IEEE Signal Processing Magazine, 18(1) (2001) 32-80. 8. Scherer K.R.: Appraisal Theory. In: Dalgleish, T. Power, M. (Eds), Handbook of Cognition and Emotion. John Wiley, New York, (1999) 637-663. 9. Devillers L., Vidrascu L. & Lamel L.: Challenges in real-life emotion annotation and machine learning based detection, to appear in Journal of Neural Networks 2005.

10. Devillers, L., Abrilian, S., Martin, J.-C.: Representing real life emotions in audiovisual data with non basic emotional patterns and context features.(ACII'2005) (submitted) 11. Boersma, P.: Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound”, Proceedings of the Institute of Phonetic Sciences, (1993) 97-110 12. Vapnik, V.N.: The Nature of Statistical Learning Theory, Springer (1995).

Suggest Documents