Fiction database for emotion detection in abnormal situations - Limsi

2 downloads 18867 Views 95KB Size Report
present study considers a preliminary database of audio- visual sequences ... corpus, and call center corpora consisting in interactions between agents and ...
Fiction database for emotion detection in abnormal situations Chlo´e Clavel, Ioana Vasilescu*, Laurence Devillers**, Thibaut Ehrette Thales Research and Technology France, Domaine de Corbeville, 91404 Orsay Cedex, France *ENST-TSI, 46 rue Barrault, 75634 Paris Cedex 13, France **LIMSI-CNRS, BP 133, 91403 Orsay Cedex, France [email protected], [email protected], [email protected]

Abstract The present research focuses on the acquisition and annotation of vocal resources for emotion detection. We are interested in detecting emotions occurring in abnormal situations and particularly in detecting ”fear”. The present study considers a preliminary database of audiovisual sequences extracted from movie fictions. The sequences selected provide various manifestations of target emotions and are described with a multimodal annotation tool. We focus on audio cues in the annotation strategy and we use the video as support for validating the audio labels. The present article deals with the description of the methodology of data acquisition and annotation. The validation of annotation is realized via two perceptual paradigms in which the +/-video condition in stimuli presentation varies. We show the perceptual significance of the audio cues and the presence of target emotions.

1. Introduction Emotion detection in spontaneous speech represents an interesting topic for a wide set of potential applications, such as human-machine interaction (human-machine dialog system), and survey application systems (homeland security). Classical security systems are only based on visual cues detection without any recording of audio signal. Our aim is to show the relevance of acoustic cues in illustrative abnormal situations. We define an abnormal situation as an unplanned event, consequent upon a human action, present or imminent or natural disaster, which implies human life threatening and requires prompt action to protect life or to limit its damages. We assume that in these situations human being experiences fear or other related negative emotions. Among the abnormal situations we can mention natural damages such as fires, earthquakes, flood etc, physical or psychological threatening and aggression against human beings (kidnapping, hostages etc). Emotion detection in speech for security applications requires appropriate real life databases. The work presented here describes the database we are building with the aim of developing a system able to detect emotions occurring during abnormal situations. We are hence interested with vocal manifestations in sit-

uations in which human life could be in danger. The potential applications are dealing with security of public places, for example, bank or subway surveillance. The development of an appropriate database of natural speech and on a robust annotation strategy is not straightforward. Literature on emotions shows that detecting robust vocal cues carrying emotion information about the speaker’s state is strongly correlated to the naturalness of the corpus employed in this purpose. We can notice an increasing interest in collecting and analyzing natural and context dependent corpora. As real life corpora for emotion detection, we can mention a conversational corpus containing everyday interactions between a target speaker and his environment [3], the Belfast database [1] containing an interview corpus and a talk-show based corpus, and call center corpora consisting in interactions between agents and clients [4]. However, despite the naturalness of the data, all the mentioned corpora illustrate life contexts in which emotions are shaded, mixed and moderate as the politeness rules and social conventions required. Natural corpora with extreme emotional manifestation for surveillance applications are not available because of the private character of the data. However, those emotions are present in everyday life, even though rare and unpredictable. Indeed, nowadays broadcast news provides strong examples of those extreme emotion manifestations but generally via short excerpts from a large variability of contexts. As a consequence, there is a lack of studies focusing on strong natural emotions. Given the difficulty of collecting abundant material with those target emotions, we are building a corpus based on fiction sequences extracted from a collection of selected recent movies in English language. Even though emotions are acted, they are expressed by skilled actors in realistic interpersonal interactions in the whole context of the movie scenario. Otherwise fiction provides an interesting scope of emergence contexts and of type of speakers that would have been very difficult to collect in real life. Our annotation methodology consists in three main phases: naming the emotions, annotating and validating them via appropriate perceptual tests. In the following sections we describe the methodol-

ogy of extraction (section 2) and annotation (section 3) of the corpus sequences and the validation of the annotation via perceptual paradigms (section 4) carried out in two experimental conditions on a subset of the corpus. Finally, in section 5 we present conclusions and further work projects.

2. Corpus 2.1. Description A corpus is in process of acquisition in order to provide about 500 audiovisual sequences in English, corresponding to manifestations of emotional states in abnormal situations, either in individual, group or crowd situations. Recent thrillers, action movies, and psychological dramas are good candidates. Our selection criteria rely on actors play and situation naturalness, on audio quality (voice predominance on noise and music) and on the type of abnormal situation, which we prefer realistic and near from the situations defined by our application field. In addition, other emotions occurring in normal situations are also considered in order to verify the robustness of acoustic cues for emotion detection in target situations. Both verbal and non-verbal manifestations (shots, explosions, cries, etc.) are considered. Verbal manifestations illustrate neutral (reference) state, target emotion manifestations and emotions occurring in normal situations other than neutral, and are annotated at several levels (section 3).

ifestations. In this purpose, each sequence that provides a particular context is segmented in speaker turns which represent the basic annotation unit. The annotation strategies correspond to the first two phases of the adopted methodology, naming and annotating. The emotion perception by a human being is strongly multimodal: both audio and video information help us to understand speaker’s emotional states. Although our annotation scheme relies essentially on audio description, our choice is helped by video, with the use of a tool for multimodal annotation ANVIL [5]. ANVIL allows us to create and adapt different annotation schemes. A specification file is thus provided in xml format. The output of ANVIL tool is an interpretation file in which the selected description is stored. ANVIL provides the possibility to import data from PRAAT, such as pitch contour and intensity variation analysis. 3.2. Annotation scheme We consider as relevant elements for the annotation: the emotional content (category labels and dimensional description), the context (here the threat), and the acoustic properties recovered by PRAAT ([7]).

2.2. The extraction and selection method Chapters of interest are chosen from previously selected DVDs and stored as MPEG files, before being segmented in short sequences of 20 to 100 seconds. Audio data are extracted into .wav files. The variability in terms of sequence duration is the consequence of a topic criterion of segmentation. Each sequence illustrates a particular topic and a verbal or non verbal context. Verbal sequences contain dialogs and/or monologues. Different types of situations are illustrated: hostages, individuals and groups lost in a threatening environment, kidnapping, etc. Sequences are segmented in speaker turns. For the study we present here, we considered 20 preliminary sequences extracted from six different movies and containing 152 speaker turns from 28 speakers, 14 male, 12 female, 2 child, and 21 overlaps. From the 152 speaker turns we selected a subset of 40 speaker turns test for the perceptual test (section 4).

3. Annotation strategies 3.1. A task dependant annotation strategy We adopt a task dependent annotation strategy which considers two main factors: the context of emotion emergence and the temporal evolution of the emotion man-

Audio descriptors Context descriptor Emotion descriptors both categorical and dimensional Context descriptors

Figure 1: Annotation Scheme in ANVIL

3.2.1. Emotion descriptors Interpretation files incorporate two types of emotional content for each speaker turn, a dimensional one and a categorical one. The dimensional description is based on the three following abstract dimensions: activation, evaluation and control, which are suggested as salient for emotions description[8]. The third dimension is renamed and adapted as ”reactivity” dimension, more perceptually intuitive to distinct emotions occurring in abnormal situations. Our final three dimensions are: intensity, which indicates how intense the emotion is (for example terror is more intense than fear); evaluation, which gives a global indication of the positive or negative feelings associated with the emotional states (for example happiness is a pos-

itive emotion and anger a negative one); reactivity, which enables us to distinguish different types of ”fear”. Indeed we are not experiencing the same fear if we cope with the threat or not [6]. The reactivity value indicates whether the speaker seems to be subjected to the situation (passive) or to react to it (active). For example, one reaction to fear can be inhibition (very passive) or anger (very active). For the dimensions, evaluation and reactivity, axes cover discrete values from wholly negative (−3, −2 − 1) to wholly positive (+1, +2, +3). The intensity axis provides four levels from 0 to 3. Level 0 corresponds to neutral states for the three dimensions. Each dimension is stored in a track of ANVIL’s annotation file as shown in figure 1. We also employ categorical labels for the emotional content of each speaker turn. We selected so far two groups of labels corresponding to emotions in abnormal situations (global class fear and other negative emotions) and other emotions (neutral and other emotions in normal situations). 3.2.2. Context and meta descriptors We consider here several factors that allow us to better describe the context of emotion emergence. The factors concern both the environment and the relation between speakers. Consequently, a ”threat track” provides the description of the threat intensity and of its incidence (immediate, imminent or potential). The speaker track gives the gender of the current speaker and its position in the interaction (victim or aggressor). Judgements about audio qualities (i.e. +/- noise, +/- music) of each speak turns are stored. They will be employed at testing the robustness of detection methods. 3.2.3. Transcription and acoustic descriptors With the aim of studying salient lexical cues we also transcribe the verbal content, with the help of subtitle support provided by DVDs. Breathing, shots and shouts are transcribed as non verbal events. PRAAT input allows to extract a set of statistics from the speech signal, such as intensity and pitch contours which will be correlated with other descriptors.

4. Perceptual validation 4.1. Protocol and subjects The perceptual test is conducted to answer several questions. Firstly, we evaluate the presence of emotion corresponding to an abnormal situation. We estimate also if the basic unit selected, the speaker turn, is salient for carrying a particular emotion. We select 40 speaker turns illustrating previously described emotion categories. Each class is illustrated by 10 stimuli, pronounced by 5 male and 5 female speakers. Speaker turns employed as stimuli vary

in length (from 3 to 43 words) and are presented in random order. We consider two experimental conditions, +/video, in order to verify the role of the audio cues in perceiving the target emotions. The +video condition provides both video and audio recordings for each speaker turn. The −video condition only provides the audio file corresponding to each speaker turn. 22 subjects participated in the perceptual test, 11 for -audio condition and 11 for +video condition. They are previously instructed on the purpose of the test, i.e. to judge audio and/or audio/video short sequences in terms of emotion they perceive and of emotion evaluation using the three abstract dimensions. We expect the test will allow us to evaluate the two description schemes of emotions, i.e. categorical vs. dimensional. Subjects are asked to name the emotion emerging from each stimulus. Concerning the use of abstract dimensions, they are described and examples are given in order to illustrate them. A familiarization phase consisting of 5 stimuli precedes the test. The 5 training stimuli are not considered as results and are provided to help subjects to practice and to build a personal evaluation scale. All the subjects understand English without any difficulty and in order to avoid misunderstandings, the transcription of the stimuli is also provided. The test phase consists in listening/watching and listening the 40 stimuli and in fulfilling the questions described in the instructions. Subjects are allowed to listen or/and watch each stimulus as many time as they prefer. 4.2. Results Figures below show the percentage of intensity (figure 2), evaluation (figure 3) and reactivity (figure 4) labels for emotions occurring in abnormal situations and other emotions, for both test conditions (audio and audiovisual). We present here the main results. We focus both on emotion evaluation with the abstract dimensions according to the experimental conditions, i.e. +/-video, and categorical labeling of the stimuli. The results obtained with the 3-dimensional emotion description show a differentiation of emotions in abnormal situations (global class fear and other negative emotions) from other emotions (neutral and other emotions) in normal situations. Indeed, the speaker turns initially labeled as emotions in abnormal situations are perceived as more intense emotions (figure 2). Concerning the evaluation axis, emotions in normal situations are globally perceived as corresponding to a zero level which means they are neither passive, nor active (figure 3). As expected, stimuli labeled as corresponding to abnormal situations are evaluated as negative. For the reactivity axis, emotions in normal situations are subjected to the same observation as for evaluation (figure 4), whereas emotions in abnormal situations are more frequently considered as active. Moreover, emotions are perceived as more intense with the help of the video support. However, for the three dimensions, the

two curves are close, which means that audio cues may be sufficient to detect such emotional states. 70

in in in in

60

abnormal situations with audio support normal situations with audio support abnormal situations with video support normal situations with video support

Intensity label ratings %

50

40

30

20

need additional video information for correct annotation, emotions and especially extreme negative emotions such as fear in abnormal situations have complex multimodal manifestations and video represent a real help for annotation. This finding is also illustrated in table 2, presenting the number of stimuli which received for reactivity ratings a higher score with the +video condition than with the −video condition. As in table 1, it shows that images help at perceiving more marked emotions in the case of stimuli initially categorized as fear.

10

0

0

1

2

3

Intensity scale

Figure 2: Percentage for intensity label ratings.

Table 2: Number of stimuli judged as more marked in the video condition (majority vote) on the reactivity axis and for the four emotion classes (10 stimuli/emotion class) N b.st. neutral emot. norm. neg. emot. fear reactivity 0/10 3/10 4/10 5/10

70

in in in in

60

abnormal situations with audio support normal situations with audio support abnormal situations with video support normal situations with video support

Evaluation label ratings %

50

40

30

5. Conclusion and future work

20

10

0 −3

−2

−1

0 Evaluation scale

1

2

3

Figure 3: Percentage for evaluation label ratings. 70

in in in in

60

abnormal situations with audio support normal situations with audio support abnormal situations with video support normal situations with video support

50

Reactivity label ratings %

Table 1: Annotated vs perceived emotion classes. % neutral emot. norm. neg. emot. fear audio 100 80 80 50 audio video 70 70 70 80

This paper presented a preliminary work in the acquisition and annotation of a fictional database for emotions detection in abnormal situations and particularly fear. We show the perceptual significance of the audio cues and the role of the video, the presence of target emotions in our data and the interest of the emotion annotation strategy, i.e. categorical and with abstract dimensions. Ongoing work focuses on a finer discrimination inside the global emotion classes and the correlation between the context descriptors and the vocal manifestations of emotions.

40

6. References

30

[1] Douglas-Cowie, E., Campbell, N., Cowie, R., Roach, P., 2003. ”Emotional speech: Towards a new generation of databases”. In Speech Communication.

20

10

0 −3

−2

−1

0 Reactivity scale

1

2

3

Figure 4: Percentage for reactivity label ratings. Finally, we correlate the labels provided by subjects with the four emotion classes. Table 1 shows a global correlation and allows us to differentiate the usefulness of the video when annotating the emotions. We notice that audio condition is sufficient to correctly annotate neutral stimuli as well as stimuli initially labialized as other emotions in normal situations (emot. norm.) and other negative emotions (emot. neg.). In those cases, providing the video seems to complicate the task. Indeed neutral stimuli are correctly rated in 100% of cases in audio condition, but when video is added they are perceived in 70% of cases as neutral. However for global class fear video seems to be necessary in order to provide the same judgement as for the initial annotation. If neutral state does not

[3] Campbell,N., Mokhtari, P., 2003. ”Voice quality: the 4h Prosodic Dimension”. In 15th ICPhs Barcelona”. [4] Devillers, L., Vasilescu, I., 2004. ”Reliability of Lexical et Prosodic Cues in two Real-life Spoken Dialog Corpora”. In the 4th International Conference on Language Resources and Evaluation. [5] Kipp, M., 2001. ”Anvil-a generic annotation tool for multimodal dialogue”. In 7th European Conference on Speech Communication and Technology (Eurospeech). [6] Ekman, P., 2003. ”Emotions Revealed : Recognizing Faces and Feelings to Improve Communication and Emotional Life”. Hardcover, Times book. [7] www.praat.org [8] Osgood, C., May, W. H., Miron, M.S., 1975. ”Crosscultural Universals of Affective Meaning”. University of Illinois Press, Urbana.

Suggest Documents