An 'open-set' detection evaluation methodology for automatic emotion recognition in ... a call center, a classification experiment of frustration with re- spect to ...
An ‘open-set’ detection evaluation methodology for automatic emotion recognition in speech Khiet P. Truong and David A. van Leeuwen TNO Human Factors, Dept. of Human Interfaces P.O. Box 23, 3769 ZG Soesterberg, The Netherlands {khiet.truong, david.vanleeuwen}@tno.nl
Abstract In this paper, we present a detection approach and an ‘open-set’ detection evaluation methodology for automatic emotion recognition in speech. The traditional classification approach does not seem to be suitable and flexible enough for typical emotion recognition tasks. For example, classification does not have an appropriate way to cope with ‘new’ emotions from outside the training database that can be encountered in real life situations. With the presented approach and evaluation method we aim to perform experiments that suit the task of an emotion detector and to obtain results that are better interpretable and generalizable to other datasets. This evaluation method takes into account the fact that ‘new’ emotions can be encountered by the emotion detectors that are not covered by the trained models. We also tested the trained emotion detectors on an independent emotion database. We report results that indicate that our ‘openset’ evaluation methodology indeed produces more representative and generizable results. Index Terms: emotion recognition, detection, evaluation
1. Introduction One of the features that can make human-machine interaction more efficient is automatic emotion recognition. Speech is one of the channels through which emotion can be expressed. Automatic emotion recognition in speech is a relatively new research area that aims at recognizing how something is said. Using this feature, systems can increase their intelligence and can adapt their processes to the user’s emotional state. Gaming is one of the areas where emotion recognition can be applied: for example, adapting the level of difficulty to the player’s emotional state can make the gameplay more challenging and interactive [1]. Emotion recognition can also be applied in a more professional environment: if we can sense an operator’s stress or workload level, the system can decide to adapt its task allocation to the operator’s stress level [2]. Call centers are particularly interested: in order to keep customers satisfied, a supervisor can receive a signal from the frustration detector and can decide to take over the conversation [3]. Emotion recognition can also be used as a ‘filter’ to filter and search through speech recordings for a specific emotion E. Especially in large audio archives, an additional search or browse parameter for emotion can increase the browser/search system’s functionality. For example, we can search or browse through movies or concerts, looking for happpy moments [4] or we can search in meetings for so-called hot-spots, i.e., moments of increased activation of the participants [5]. In order to develop such applications, we need to train emotion detectors with data that matches the environment in which
the end application will be used as closely as possible. This implies that for most applications, natural emotion data is needed which is difficult to collect. Therefore, most researchers have been working with relatively small databases (in comparison with databases used in automatic speech and speaker recognition) containing (usually) acted emotional data from a small number of speakers. Furthermore, traditionally, classification experiments are performed in which the task is to select one class from an N number of emotion classes. This does not seem an appropriate task for a typical emotion recognition problem. For example, if we want to develop a frustration detector for a call center, a classification experiment of frustration with respect to other emotions such as disgust, boredom or sadness does not seem to make sense because these other emotions are not frequently encountered in call centers. In addition, what happens if the classifier encounters a new emotion from outside the training database? By adopting the detection framework, we believe we can treat these shortcomings in more appropriate ways. A typical emotion recognition problem consists of determining whether this person has emotion E or not. This task suits the detection framework very well. Therefore, we adopt a detection approach in the current study. However, we are still dependent on the types of emotions in the database which makes the results less transferable to other datasets. In order to be less dependent on the types of emotions available in databases and to obtain more representative results, we propose a detection evaluation methodology that emulates ‘open-set’ detection experiments. We also evaluated our emotion detectors against another independent test database. In this paper, we present a detection approach and evaluation methodology for emotion recognition in speech. Section 2 describes the speech material used in this study and several data collection techniques and issues. In Section 3 the classification and detection approaches are described. The proposed evaluation methodology and the results are presented in 4. We end with a discussion and conclusions in Section 5.
2. Emotional speech data 2.1. Speech material used for training and evaluation In order to enable comparison of the results and methods between different studies, we have used an emotional speech database that is widely available. The Berlin Emotional Speech database [6] was used for training and testing our emotion detectors. The Berlin database contains emotional speech from five female and five male actors. Each actor uttered ten different sentences in seven different emotions: neutral (Ne), anger (An), fear (Fe), joy (Jo), sadness (Sa), disgust (Di) and boredom (Bo). This resulted in a total of 800 emotional utterances
that were normal sentences which could be used in everyday life. In a validation experiment, 20 subjects were asked to recognize the emotion and to rate the naturalness of the emotional utterance. In our analyses, we only used those utterances that had a human recognition rate of more than 80% and a naturalness of more than 60% (see Table 1, a total of 535 utterances were used from the original 800). Emotion Anger (An) Disgust (Di) Joy (Jo) Neutral (Ne)
N 127 46 71 79
Emotion Boredom (Bo) Fear (Fe) Sadness (Sa)
N 81 69 62
Table 1: Number of utterances used per emotion in Berlin Emotional Speech database. Additionally, a second emotional speech database was used for testing only: the Geneva Vocal Emotion Expression Stimulus Set (GVEESS). This database is based on research conducted by Klaus Scherer, Harald Wallbott, Rainer Banse and Heiner Ellgring. Detailed information on the production of this data set can be found in [7]. The GVEESS contains acted emotional speech samples in 14 different emotions from 12 actors. The 14 emotions are anxiety, disgust, happiness, hot anger, interest, cold anger, boredom, panic fear, shame, pride, sadness, elation, contempt and desperation. A selection of 224 emotional speech samples (16 sentences × 14 emotions) was obtained based on human recognizability and authenticity criteria. This GVEESS database differs from the Berlin Emotional Speech database on the following points: • recording conditions (different environment acoustics) • speakers • types of emotions, e.g., interest, shame, pride • interpretations or expressions of the same emotions, e.g., cold/ hot anger, elation/happiness We are well aware of the fact that the databases contain acted data and so-called full blown emotions and that these are rare in everyday life. However, the focus will not be on the performance of the emotion detectors but rather on the methodology to obtain results that are as representative as possible given the limitations. In the next section, we would like to pay attention to efforts to collecting spontaneous emotion data. 2.2. Acquiring semi-spontaneous and natural emotion data Acquiring annotated, spontaneous emotional speech data, suitable for automatic emotion recognition research is important but has proven to be difficult. Efforts to collect these data and to make it publicly available should therefore deserve more attention. In order to collect (semi) spontaneous emotions in a more or less controlled way, several elicitation techniques have been employed for the elicitation of emotional speech. For example, researchers have elicited emotions by showing movies or still pictures [8], by listening to music [9], by interacting with spoken-dialogue systems [10, 11], by interacting with virtual characters [12] or by playing games [13, 14, 15]. In interaction with machines, Wizard-of-Oz techniques are often used to frustrate the user deliberately. Games in particular seem to be a growing topic of interest for emotion researchers [13, 14, 15, 16, 1, 17]. In one of these studies, multimodal emotion data was collected by inviting subjects to play
a first-person shooter game in teams of two persons [17]. The lively (non) verbal interaction between the rivaling teams produced a relatively broad range of emotions that were occasionally very intense (which also was dependent on the group of players). During the game, suprising elements (e.g., extra monsters) were added to the gameplay to elicit even more emotions. Further, the players were asked to annotate their own emotions afterwards. It appears that people do not express their emotions very easily or intensily in daily life [18] where we usually adhere to the unwritten social rules of conversation. Given the subtlety and the ‘blended’ characteristics of natural emotions, the analysis of natural emotions is not straightforward. Furthermore, speech signals recorded in a natural environment usually contain background noises, overlapping speech, crosstalk etc.: in Fig. 1 we can observe a tension between the use of acted and natural emotion data. subtle complex emotions ethics authentic uncontrolled noisy data blended emotions
controlled full−blown emotions acted clean data simple
Figure 1: Tension between naturalness and complexity of data
One of the largest natural emotion audiovisual database is the Belfast Naturalistic database [19] which contains TV clips of people. The SMARTKOM database is another large database containing spontaneous emotional utterances [20]. The natural emotion database used in [3] consists of real dialogs recorded in a French Medical Emergency call center. One important issue with natural emotion databases is that it is often hard to make this data publicly available due to privacy issues. We have experienced that, although call centers show their interests in emotion recognition technology, they are not very keen on sharing their speech databases due to legal and privacy issues. Other potentially useful sources of natural emotion data are meetings (e.g., [21, 22]), although these usually do not have emotion annotations.
3. Method 3.1. Acoustic modeling We used Relative Spectral Transform - Perceptual Linear Prediction (RASTA-PLP) features [23, 24]. Other spectral features such as Mel-Frequency Cepstrum Coefficients (MFCCs) would also have been good candidates, but mainly for practical reasons and their good performance in our speaker recognition, we used RASTA-PLP features. Each 16 ms, 12 RASTA-PLP coefficients plus 1 energy component were computed with a windowwidth of 32 ms. In addition, delta coefficients were calculated by taking the first order derivatives of the 13 features over five consecutive frames. The features were normalized per utterance to obtain zero mean and unit standard deviation. Gaussian Mixture Models (GMMs) were used to train the acoustic models. Four Gaussian components were used by a ‘rule-ofthumb’ (approximately 50 data points required per estimated parameter). The GMMs were trained with five iterations of the Expectation-Maximization (EM) algorithm.
3.2. Recognition techniques 3.2.1. For reference: classification In classification, the task is to select one class from an N number of other classes. For each emotion class E, a GMME was trained. The choice for an emotion class E is determined by the GMME that produces the highest activation on a given test sample. Traditionally, emotion classification experiments are performed while we wonder whether this is appropriate for most of the emotion recognition applications that we would like to build. Already in [7], it was stated that “It is doubtful whether studies using 4-6 response alternatives in a vocal emotion recognition study actually study recognition or whether, more likely, the psychological process involved is discrimination among a small number of alternatives.”. A test showing how anger can be discriminated from sadness, disgust or boredom does not seem to be relevant for an application (e.g., in a call center) that is less likely to encounter these emotions. Further, the assumption is made that we can only encounter emotions from a closed-set of classes; in other words, there is no space for ‘new’ emotions that are not covered by the existing classes and models but which can be encountered in real-life applications. Therefore, we will adopt a detection approach that has more appropriate ways of dealing with typical emotion recognition problems. 3.2.2. What we would like to do: detection A typical emotion recognition task is to detect whether this person has emotion E or not, which can be more appropriately investigated by adopting the detection framework. The task is defined as a two-class problem where one class represents emotion E that one would like to detect, the target. The other class represents all other emotions ¬E that one does not want to detect, the non-target. For instance, if we want to detect Anger, we would train a target GMME with angry speech samples and a non-target GMM¬E with non-angry speech samples which with the current database would be comprised of Bored, Disgusted, Fear, Joy, Neutral and Sad speech. We can define a log likelihood ratio for each test sample x by computing llrx = llxtar − llxnon , where llxm is the obtained log likelihood of sample x applied to the target or non-target model m. The higher the log likelihood ratio, the more likely it is that sample x belongs to tar and vice versa. ‘New’ emotions that are not covered by the training database and that are encountered by the emotion detectors can be classified as either belonging to E or ¬E, in contrast with the classification approach where there is no corresponding class. However, the modeling of the nontarget model is still based on a rather arbitrary choice of emotions that are available in the database. In order to obtain results that are somewhat less dependent on the emotions available in the training database and that comprise the non-target model, we introduced an ‘open-set’ detection evaluation methodology which is described in Section 4.2 in more detail.
4. Evaluation methodology and results 4.1. For reference: classification The classification results reported are usually based on a perclassif ications centage of correct classifications ( #correct ). It #all test samples is hard to interpret and compare these classification accuracies between different studies because these performance figures depend, among others, on the number of test samples used per
emotion class (i.e., the priors in the test set) and the number of classes. In Table 2 we present speaker-independent crossvalidated results of the Berlin data. We obtained an averaged classification accuracy of 50.6%. The two relatively extreme emotions An and Sa can be classified with a relatively high accuracy, while the other less extreme emotions are often confused with each other.
An Bo Di Fe Jo Ne Sa
An 81 1 9 3 16 1 1
Bo 4 37 1 12 3 22 5
Classified as Di Fe Jo 10 9 17 3 5 1 18 9 5 7 18 14 8 7 33 10 8 0 2 3 0
Ne 6 29 5 10 4 38 5
Sa 0 5 6 5 0 0 46
Table 2: Confusion matrix of speaker-independent classification experiment, Berlin database We also tried to apply the Berlin classifiers to ‘new’ data from another database, namely the GVEESS database. The GVEESS dataset contains other speakers and other emotions. Some emotions from GVEESS were mapped to Berlin emotion: ‘happy’ and ‘elation’ were grouped together as Jo, ‘hot’ and ‘cold’ anger were grouped together as An, and ‘angst’ and ‘panic fear’ were grouped as Fe. The emotion classes that the databases do not have in common form a problem for evaluation because the classification of these test trials will always be consdiered incorrect since they do not have a corresponding emotion class in the Berlin data. If we evaluated the performance only on the mutual emotion classes, we achieved a classification accuracy of 34.3%. As one would expect, evaluating on all GVEESS emotions, thus including the emotion classes that the databases do not have in common, decreased the classification accuracy to 21.4%. 4.2. Proposed evaluation method: ‘open-set’ detection experiment In detection, we have two classes E and ¬E and the task is to make a choice between these two classes. In this framework, detectors can be evaluated with a single performance measure known as Equal Error Rate (EER) which can be identified and visualized in a Detection Tradeoff Curve [25]. A detector can make two types of errors: • false alarm: a trial that is incorrectly recognized as the target emotion (the false alarm rate is defined as the number of false alarms divided by the number of non-target trials) • miss: a trial that is incorrectly recognized as the nontarget emotion (the miss rate is defined as the number of misses divided by the number of target trials) The EER can be defined as the point where the false alarm rate is equal to the miss rate. The decision whether a trial belongs to a certain class or not is based on a decision score, in our case the log likelihood ratio: typically, the higher the scores, the more likely this trial belongs to the target class and vice versa. Decisions can be made by setting a well-chosen threshold. However, choosing a threshold has proven to be difficult in practice, which is better known as a ‘calibration problem’. The EER is a measure of discriminability where the error rates are
based on ‘soft’ decisions: i.e., thresholds are set ‘afterwards’ instead of prior to observing the decision scores. The trade-off between the false alarm rate and the miss rate can be plotted in a graph with probit-transformed scales which is better known as a DET plot [25]. In order to be speaker-independent (SI), which prevents our results from being biassed towards certain speakers, we each time tested on a speaker that was excluded from the training set. We carried out cross-validation experiments (i.e., leavespeaker-out experiments) by training a target GMM model on Etar (Etar stands for the emotion that one would like to detect) for emotion detector Etar excluding speech samples from test speaker Si . The non-target GMM model was trained on Enon , i.e., the rest of the emotions, excluding speech samples from speaker Si . The test was performed by applying these models to all emotional speech samples xi from speaker Si . We iterated over speaker Si , in our case we had ten iterations since there are ten speakers in the database. All scores (i.e., log likelihood ratios) were pooled together to form target and non-target score distributions so the EER could be computed. This process was repeated for each Etar . The results of these speakerindependent experiments can be observed in Fig. 3 where the results of each emotion Etar is represented by a DET-curve and an EER. We would like to take this concept of ‘speakerindependency’ a step further and apply this to the rather arbitrarily chosen emotion classes in emotion databases. In other words, we would like to make the results somewhat less dependent on the types of emotions that are present in an emotion database so we can better interpret and compare the results over different databases and studies. For instance, an anger detector that is developed on a database containing four emotions, e.g., sadness, happiness, anger and neutral, does not seem to transfer very well to a real-life application or other datasets where we can encounter a broad range of other emotions. What happens if the anger detector encounters a ‘new’ emotion from outside the database that has not been ‘seen’ by the trained models before? In addition to the speaker-independent cross-validation, we therefore performed an ‘speaker-and-emotion-independent’ cross-validation in which we test on a ‘new’ emotion that has been left out of the training and of which we do not have prior information in testing (Fig. 2). This evaluation intends to emulate so-called ‘open-set’ detection experiments in which we leave out a test speaker and a test emotion both at the same time during training. For each Etar detector , we trained a target model on emotion Etar excluding speech samples from test speaker Si . We trained a non-target model with the ‘rest of the emotions’ excluding emotion Etar , excluding samples from test speaker Si and excluding the ‘new’ test emotion Enew6=tar . We then tested on speech samples containing the target emotion Etar and the ‘new’ emotion Enew from test speaker Si . We iterated over speaker Si (ten iterations since we have ten speakers) and over the ‘new’ emotion Enew (in our case six iterations since we have seven emotions minus the target emotion). A summary of this procedure is given in Fig. 2. This process ensures that for each test performed, a target model is paired with a non-target model where both models do not have prior information about the test speech sample that carries an ‘unknown’ emotion and is uttered by an ‘unknown’ speaker. In this way we obtained non-target scores for the ‘new’ emotions we tested on and that can be pooled together into a single non-target distribution. The target scores need special attention. We obtained multiple, correlated target scores for the same target trial xtar
due to each iteration over Enew . We decided to take the mimimum of these scores and discarded the rest to emulate the most pessimistic situation (an alternative strategy would have been to average these scores). This whole process was repeated for each target emotion Etar (the results of these ‘leave-speakerand-emotion-out’ experiments are shown in Table 3). {speakers} = set of all speakers {emotions} = set of all emotions {nonemotions} = {emotions} \ Etar S = test speaker Enew = the ‘new’ emotion Etar = the target emotion foreach S of {speakers} train target model : ¬S ∧ Etar foreach Enew of {nonemotions} train nontarget model : ¬S ∧ ¬Etar ∧ ¬Enew test : S ∧ (Etar ∨ Enew ) end end or in other words: 1. train target model with speech samples that have emotion Etar and that are not uttered by speaker S 2. train non-target model with speech samples that do not have emotion Etar and Enew and that are not uttered by speaker S 3. perform a test with all speech samples that have emotion Etar or Enew and that are uttered by speaker S 4. repeat 2 and 3 for another Enew until all non-target emotions have been covered 5. repeat 1, 2, 3 and 4 for another S until all speakers have been covered
Figure 2: ‘Open-set’ cross-validation scheme in pseudo code The results of the speaker-independent (SI) and the ‘openset’ detection results applied to the Berlin data are presented in Fig. 3 and Table 3 ( the lower the EERs the better the performance). We can observe in these results that Sadness is relatively easy to detect while Fear is relatively hard to detect. In order to show how well results on one database can be transferred to another database or application, we also applied the Berlin emotion detectors to the GVEESS data (without performing any cross-validation because the speakers and emotions already differed from the training dataset). It is not surprising that the EERs of the GVEESS data increase given the differences between training and test set (Table 3). There are somewhat large differences between the SI results of the Berlin data and the GVEESS results: this confirms our suspicion that the SI results do not transfer well to other datasets. However, our proposed ‘open-set’ results are comparable to the GVEESS results, indicating that SI results are often too optimistic and that the ‘open-set’ detection experiments can give more representative, interpretable and transferable results.
GVEESS emotions mapped to the Berlin emotions in brackets Hot, cold anger (An) Boredom (Bo) Disgust (Di) Angst, panic fear (Fe) Happy, elation (Jo) Neutral (Ne) Sad (Sa) Mean
10 5 2
37.7% Fear 32.4% Disgust 26.7% Joy 26.4% Neutral 23.4% Boredom 18.2% Anger 10.0% Sadness
0.1
0.5 1
miss probability (%)
20
40
DET plot
0.1
0.5 1
2
5
10
20
40
Figure 3: Speaker-independent results (EERs) obtained with Berlin Emotional Speech datbase
An Bo Di Fe Jo Ne Sa Mean
SI 0.182 0.234 0.324 0.377 0.267 0.264 0.100 0.250
Evaluated on all emotions 0.344 0.312 0.498 0.438 0.406 0.174 0.362
Table 4: EERs of Berlin detectors applied to GVEESS data, evaluated on either all GVEESS emotions or only the mutual emotions that are present in both GVEESS and Berlin data
false alarm probability (%)
Emotion
Evaluated on mutual emotions 0.371 0.312 0.492 0.433 0.408 0.129 0.356
Berlin ‘open-set’ 0.252 0.360 0.461 0.652 0.355 0.417 0.129 0.375
GVEESS 0.344 0.312 0.498 0.447 0.406 0.174 0.364
Table 3: EERs of speaker-independent and ‘open-set’ detection experiments on Berlin database and GVEESS
In contrast with the classification approach, we now do not have difficulties evaluating the Berlin detectors on another dataset that contains other emotions (we do not have results for Neutral because GVEESS did not contain Neutral emotions). If we evaluate the results only on the mutual emotions we obtain results presented in the right column in Table 4. The small differences between the right and left column in Table 4 imply that the ‘new, unknown’ emotions that have never been seen before by the Berlin models (i.e., the GVEESS emotions interest, shame, pride, contempt and desperation) were nearly always detected correctly as the non-target class.
5. Discussion and conclusions Given the lack and availability of annotated natural emotion data, most researchers have usually worked with relatively small databases (in comparison to what is usually used in automatic speaker or speech recognition) containing acted emotional speech. The results obtained with these datasets are usually not representative for other databases or situations in which other speakers and other emotions can occur. A classification approach assuming a ‘closed-set’ of emotions does not seem to be able to cope with ‘new’ emotions from outside the train-
ing database and does not seem to be appropriate for the typical emotion recognition task. Therefore, we proposed to adopt the detection framework (which is a well-established framework that is frequently used in e.g., speaker recognition) and introduced an ‘open-set’ detection evaluation methodology that provides more representative and transferable results. This was confirmed by the results of subsequent classification/detection experiments in which we applied the Berlin emotion detectors to another database, namely the GVEESS database that contains other speakers and emotions. We obtained relatively low EERs for Sadness, in contrast with Disgust and Fear that had relatively high EERs. Since the field of automatic emotion recognition in speech is a relatively new one, many (evaluation) standards have not been agreed upon yet. Moreover, for relatively new emotion recognition methods that adopt a dimensional approach (e.g., [26]) that has not been covered here, other evaluation methods should be developed that reflect more closely and honestly the performance of the emotion analyzer. With the presented detection approach and evaluation for emotion recognition, we hope that researchers are encouraged to perform future emotion recognition experiments that suit more closely the type of task that one is interested in. Hopefully, we can achieve results that can be better interpreted and are better comparable between studies. Although we have tried to make the detection results as representative as possible given the limitations by proposing an ‘open-set’ detection evaluation methodology, the results will not be ‘really representative’ in the sense of ‘realistic, authentic’ if we continue working with artificial data. Therefore, the lack of natural emotion data for automatic emotion recognition purposes remains an important issue that should be tackled in the near future. Hence, data collection efforts should deserve more attention and recognition.
6. Acknowledgements This work was supported by the Dutch Bsik project MultimediaN.
7. References [1] K. M. Gilleade and A. Dix, “Affective videogames and modes of affective gaming: assist me, challenge me, emote me (ACE),” in Proc. DIGRA, 2005. [2] M. Grootjen, M. A. Neerincx, J. C. M. Weert, and K. P. Truong, “Measuring cognitive task load on a naval
ship: implications of a real world environment,” in Proc. ACI/HCII, 2007. [3] L. Vidrascu and L. Devillers, “Detection of real-life emotions in call centers.” in Proc. Interspeech, 2005, pp. 1841–1844. [4] A. Hanjalic and L.-Q. Xu, “Affective video content representation and modeling,” IEEE Trans. on Multimedia, vol. 7, pp. 143–154, 2005. [5] B. Wrede and E. Shriberg, “Spotting ‘hot spots’ in meetings: human judgments and prosodic cues,” in Proc. Eurospeech, 2003, pp. 2805–2808. [6] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of German emotional speech.” in Proc. Interspeech 2005, 2005, pp. 1517–1520. [7] R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression.” Journal of Personality and Social Psychology, vol. 70, pp. 614–636, 1996. [8] P. J. Lang, “The emotion probe - studies of motivation and attention.” American Psychologist, vol. 50, pp. 371–385, 1995. [9] J. Wagner, J. Kim, and E. Andr´e, “From physiological signals to emotions: implementing and comparing selected methods for feature extraction and classification.” in Proc. IEEE ICME International Conference on Multimedia and Expo, 2005, pp. 940–943. [10] J. Ang, R. Dhillon, A. Krupski, E. Shriberg, and A. Stolcke, “Prosody-based automatic detection of annoyance and frustration in human-computer dialog.” in Porc. ICSLP, 2002, pp. 2037–2040. [11] A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. N¨oth, “How to find trouble in communication.” Speech Communication, vol. 40, pp. 117–143, 2003. [12] C. Cox, “Sensitive artificial listener induction techniques.” presented to HUMAINE Network of Excellence Summer School., 2004. [Online]. Available: http://emotionresearch.net/ws/summerschool1/SALAS.ppt259 [13] J. Kim, E. Andr´e, M. Rehm, T. Vogt, and J. Wagner, “Integrating information from speech and physiological signals to achieve emotional sensitivity.” in Proc. Interspeech, 2005, pp. 809–812. [14] T. Johnstone, “Emotional speech elicited using computer games.” in Proc. ICSLP, 1996, pp. 1985–1988. [15] S. Yildirim, C. M. Lee, and S. Lee, “Detecting politeness and frustration state of a child in a conversational computer game.” in Proc. Interspeech, 2005, pp. 2209–2212. [16] N. Lazarro. (2007) Why we play games: 4 keys to more emotion. Retrieved February 2007. [Online]. Available: http://www.xeodesign.com/xeodesign [17] P. Merkx, K. Truong, and M. Neerincx, “Inducing and measuring emotion through a multiplayer first-person shooter computer game,” in Proc. Computer Games Workshop, 2007. [18] N. Campbell, “A language resources approach to emotion: corpora for the analysis of expressive speech.” in Proc. LREC Workshop on Corpora for Research on Emotion and Affect, 2006.
[19] E. Douglas-Cowie, R. Cowie, and M. Schr¨oder, “A new emotion database: considerations, sources and scope,” in Proc. ISCA Workshop on Speech and Emotion, 2000, pp. 39–44. [20] S. Steiniger, F. Schiel, O. Dioubina, and S. Raubold, “Development of user-state conventions for the multimodal corpus in SmartKom,” in Proc. Workshop on Multimodal Resources and Multimodal Systems, 2002, pp. 33–37. [21] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, “The AMI Meetings Corpus,” in Proceedings of the Measuring Behavior 2005 symposium on ”Annotating and measuring Meeting Behavior”, 2005, aMI-108. [22] N. Morgan, D. Baron, J. Edwards, D. Ellis, D. Gelbart, A. Janin, T. Pfau, E. Shriberg, and A. Stolcke, “The meeting project at ICSI,” in Proc. Human Language Technologies Conference, 2001, pp. 1–7. [23] H. Hermansky and N. Morgan, “RASTA processing of speech.” IEEE Trans. on Speech and Audio Proc., vol. 2, pp. 578–589, 1994. [24] SPRACHcore. [Online]. Available: http://www.icsi.berkeley.edu/ dpwe/projects/sprach/ [25] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, “The DET curve in assessment of detection task performance.” in Proc. Eurospeech, 1997, pp. 1895– 1898. [26] M. Grimm, E. Mower, K. Kroschel, and S. Narayanan, “Combining categorical and primitives-based emotion recognition,” in Proc. of EUSIPCO, 2006.