Speech-based Emotion Characterization using ... - Semantic Scholar

3 downloads 28878 Views 603KB Size Report
such identified fields are call center applications, medical diagnosis, lie detectors, and child ... Voice chat can be considered the most efficient and fastest way of ...
Speech-based Emotion Characterization using Postures and Gestures in CVEs Senaka Amarakeerthi, Rasika Ranaweera, and Michael Cohen Spatial Media Group University of Aizu Aizu-Wakamatsu 965-8580; Japan {d8111101, d8121104, mcohen}@u-aizu.ac.jp

Abstract—Collaborative Virtual Environments (CVEs) have become increasingly popular in the past two decades. Most CVE s use avatar systems to represent each user logged into a CVE session. Some avatar systems are capable of expressing emotions with postures, gestures, and facial expressions. In previous studies, various approaches have been explored to convey emotional states to the computer, including voice and facial movements. We propose a technique to detect emotions in the voice of a speaker and animate avatars to reflect extracted emotions in real-time. The system has been developed in “Project Wonderland,” a Java-based open-source framework for creating collaborative 3 D virtual worlds. In our prototype, six primitive emotional states— anger, dislike, fear, happiness, sadness, and surprise— were considered. An emotion classification system which uses short time log frequency power coefficients (LFPC) to represent features and hidden Markov models (HMMs) as the classifier was modified to build an emotion classification unit. Extracted emotions were used to activate existing avatar postures and gestures in Wonderland.

the interface can capture factors which are enriched with emotional clues. Facial expressions, voice, heartbeat, blood pressure, and humidity of the skin can be considered biometric parameters which can be used to characterize the emotional state of a user. To acquire emotional state with some of above-mentioned parameters, sophisticated instruments and advanced emotion classification techniques should be used. Even though accuracy of emotion representation is limited with biometric parameters, it is not easily fooled, compared to other approaches. As far as CVEs are concerned, most popular modes of communication are emoticons textchat, and voice-chat. Voice chat can be considered the most efficient and fastest way of communication in CVEs. In our approach, we have selected human speech signals as a detectable feature source. Fig. 1 illustrates overall process of emotion representation in CVEs.

Keywords-avatar animation; voice; emotion detection; hidden Markov model; emotion representation

II. R ELATED WORK

I. I NTRODUCTION Classification of emotions in human speech has become a popular area of research due to the wide variety of domains which can benefit from such technology. Some of such identified fields are call center applications, medical diagnosis, lie detectors, and child care applications. In terms of communication, this technology enables a more natural way of communicating between those who meet in virtual environments as an addition to traditional text-based and microphone speaker-based approaches. Collaborative Virtual Environments (CVEs) can be used for a broad spectrum of purposes, including simple entertainment and advanced research. A well known example of a CVE is “Second Life.” In computer-simulated virtual environments, avatars are driven by respective users and have expressive capabilities ranging from simple movements such as bowing to complicated movements such as facial expressions. Humans interact with each other in several ways, such as speech, eye contact, gesture, etc. The expression of emotions enrich social interactions by providing participants a rich channel of information. By selecting human voice to extract the affective state of the collaborator of a CVE tremendously increases the richness of the communication. The success rate of conveying emotions is greatly dependent on how well

The field of affective computing has several research avenues for recognition, interpretation, and representation of affect. Emotional information is conveyed across a wide range of modalities, including affect in written language, speech, facial display, posture, and physiological activity [16]. Several studies have been carried out to extract emotions from voice and to convey emotions through avatars in virtual environments. Different techniques available for voice-related emotion extraction and classification have been discussed by Vogt et al. [21]. Realtime voice-based emotion extraction and classification using mel frequency cepstral coefficients and Bayes classification, a probabilistic classifier based on Bayes theorem, was prototyped by the same researchers. Nwe et al. used HMMs for voice-based emotion classification [14]. An HMM is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters. In a regular Markov model, states are visible to an observer, and therefore the state transition probabilities are known. In an HMM, the state is not directly visible to an observer, but manifest through visible output. Some preliminary studies using posture as a modality for expressing emotions have been reported [8]. Nonverbal interaction in CVEs platformed on PCs and CVEs of mobile platforms were closely examined and prototypes have been developed by Zhu et al. [24]. Three major aspects of virtual

Figure 1.

Emotion representation process

reality – realistic appearance of modeling, realistic smooth and flexible motion modeling, and realistic high-level behavior modeling and implementations – are discussed in [9]. By analyzing instant messages, emotions have been extracted and classified to animate a conversing avatar by manipulating body gestures and gaze, as described in [17]. Alena et al. have developed a text-based affect detection system and haptic sensation system for Second Life [11]. III. E MOTIONS The word “emotions” is often referred to affective states of the mind. In normal human conversations, humans are capable of apprehending emotions of others by voice, posture, gesture, and facial expressions. Often in human-computer interaction, such emotions are ignored. Three major problems which should be addressed to enable machines to detect human emotions are [2]: 1) What is an affective state? 2) Which human communicative signals convey information about affective state? 3) How should various kinds of evidence be combined to optimize interfaces about affective states? For emotion-related studies, various researchers have used various classifications of emotions. The Ekman emotions (anger, dislike, fear, happiness, sadness, surprise) are used by most of them [5]. Some psychologists have proposed various models to illustrate the relationship between valance, arousal, and emotions. Fig. 2 shows a 2D taxonomy of emotions proposed by Russell [18]. The Pleasure-ArousalDominance (PAD) emotional state model is another affective state model popular among psychologists and computer scientists [10]. IV. P ROJECT W ONDERLAND Project Wonderland1 is a Java-based open source toolkit for creating collaborative 3 D virtual worlds. With small development effort, highly customized, special purpose virtual worlds can be created. Within these worlds, users can communicate with immersive audio and share live desktop applications and documents. With its module architecture, any part of the system can be extended by developers to add functionality. An avatar in Wonderland is a module that includes code, scripts, artwork, and world descriptions. It is also 1 http://openwonderland.org

Figure 2.

Circumplex model of core affect with relevant emotions

a Java Archive File (JAR) with a specific structure and XML manifest files to describe its content. Using standard CAD tools such as 3 D Studio Max, Blender, Maya, SketchUp, or Softimage, contents can be exported as COL LADA ( COLLAborative Design Activity) format, an interchange file format derived from XML for interactive 3 D applications. Such objects can be rendered by Java code at runtime as an animated avatar module. Wonderland is in continuous development. Avatars in Wonderland can be animated with limited pre-defined movements such as raised hand, nod, shake head, bow, shake hand, and eye winks as in Fig. 3. But there is no limitation to integrating new animations along with new artwork or combinations of existing contents. V. S OUND CAPTURING AND PREPROCESSING For realtime processing, voice of the human user is captured via a microphone. Then the voice stream is subjected to segmentation. The goal of audio segmentation is to divide a speech stream into units which are capable of carrying emotions. Most previous research approaches have not addressed audio segmentation issues since they are based on prerecorded, well-defined utterances. In spontaneous speech

Table I E MOTIONS AND THEIR EXPRESSION USING BODY POSTURES Emotion

Expressive means

Anger

Backward head bend; absence of a backwards chest bend; no abdominal twist; arms raised forwards and upwards; weight transfer is either forwards or backwards

Dislike

Higher degree of abdominal twisting; weight transfer is either forwards or backwards; most features are not very predictive

Fear

Head backwards; no abdominal twist are predictive; no effect of chest bend or upper arm position; forearms are raised; weight transfer is either backwards or forwards

Happiness

Head backwards; no forwards movement of the chest; arms are raised above shoulder level and straight at the elbow; weight transfer is not predictive

Sadness

Forwards head bend; forwards chest bend; no twisting; arms at the side of the trunk; weight transfer is not predictive

Surprise

Backwards head and chest bends; any degree of abdominal twisting, and arms raised with forearms straight; Weight transfer is not predictive

no clear boundaries exists. Segmented voice units should fulfill certain requirements to be useful for emotion classification [22]. They should be long enough to reliably calculate affective state of voice, but short enough to avoid data of more than one emotion. Real-time processing utterances have been used by [14], [4], [6], [15], [22] and words by [12], [23], [22]. Results of the relevant research reveal that the segment that should be used is highly dependent on the problem. Our approach windows a voice stream into two second segments which are to be processed by the classifier. This sound stream is sampled as 16-bit PCM with sampling frequency of 22.05 kHz. Each speech segment is further segmented into 16 ms frames. To reduce the spectral leakage, Hamming windowing is applied to each frame. Frame overlapping size is 9 ms. VI. F EATURE EXTRACTION From the preprocessed sound stream, emotion-relevant features should be extracted. Regarding which features should be extracted, there is no common agreement among researchers. The best approach is to do a statistical analysis for features proven by previous researches and identify the most relevant features based on the results [22]. Melbased speech power coefficients have been identified as an indicator of power of short time portions of a speech signal, and they are capable of quantizing agitation and calm of a given signal [13]. To extract features, Mel scale filter banks implemented by [14] are used. This filter bank has been implemented by studying the auditory resolving power for various frequencies. VII. E MOTION CLASSIFICATION Once feature vectors are created, emotion classification can be considered as a data mining problem. In principal any statistical classifier that can handle multi-dimensional data can be used to classify affective state. Most commonly used techniques are support vector machine, neural networks, and HMM. In our approach we have used HMM as emotion classifier.

Figure 3.

Project Wonderland

VIII. M ATLAB - CVE BRIDGE The above-mentioned feature extraction and classification steps are performed in Matlab. Output of the emotion classification unit consists of six outputs for six emotions, each output given as a percentage of that particular emotion. Emotions can not be considered discrete entities with clear boundaries [1]. Direct mapping of dominant emotion for emotion characterization does not yield good results. Matlab-CVE bridge calculates the degree of intensity of the posture. Calculated values are conveyed to the Wonderland avatar system. IX. E MOTION EXPRESSION Virtual characters are vital components of virtual environments where avatars represents the humans. However, creating an interactive, responsive, and expressive virtual character is difficult because of the complex nature of human nonverbal communication such as facial expression, body posture, and gesture [20]. Only limited research has been

carried out regarding represenation of affective nonverbal communication across posture and gesture [3], [7], [19]. Coulson has explained the relationship between emotions and body postures in his research findings [3], depicted in Fig. 4. Elaborations of those relationships can be found in Table I. For our prototype, existing posture and gestures of avatars were triggered according to postures explained in [3]. Postures were expressed by calling relevant API exposed by Wonderland.

Classification of fear happiness and sadness are not satisfactory. The relevant confusion matrix is shown in Table III. Table II AVERAGE ACCURACY OF EMOTION RECOGNITION Emotion

Average accuracy

Anger Dislike Fear Happiness Sadness Surprise

100% 90% 50% 35% 30% 90%

Table III C ONFUSION MATRIX FOR EMOTION RECOGNITION

Anger Dislike Fear Happiness Sadness Surprise

Anger

Dislike

Fear

Happiness

Sadness

Surprise

100% 0% 0% 15% 0% 5%

0% 90% 40% 30% 20% 5%

0% 10% 50% 0% 35% 0%

0% 0% 5% 35% 5% 0%

0% 0% 0% 0% 30% 0%

0% 0% 5% 20% 10% 90%

Enabling emotions in virtual environments using body postures is challenging. For our research, we have selected Project Wonderland due to availability of source for extension. But, in terms of emotion representation, Wonderland is in early stage of development. Though we were restricted by the above-mentioned limitations, we could archive emotion representation in a satisfactory level for research purposes. R EFERENCES Figure 4. Emotion representations: (a-c) represents anger, (i-k) represents dislike, (d-f) represent happiness, (l-n) represents sadness, (o-q) represents fear, and (r-t) represents surprise from front, side, and rear view points respectively.

X. R ESULTS AND CONCLUSION Even though emotion classification from voice is not a new research area, doing the process in realtime is relatively new area of research. In our approach, we have used two seconds of voice stream. This approach could be further improved by letting the system select the segment of voice dynamically. Experiments were carried out only for user dependent emotion classification. We will extend our experiments by incorporating user adaptation to implement a user independent system. Due to its ease of rapid prototyping, we used Matlab as the platform for emotion classification. Table II shows results of user dependent realtime emotion classification. Anger dislike and surprise can be identified as positively responded emotions for the developed system.

[1] S. Amarakeerthi, R. Ranaweera, M. Cohen, and N. Nagel. Mapping Selected Emotions to Avatar Gestures. In Proc. 19th Intelligent System Symp., FAN2009, Aizu-Wakamatsu, Japan, 2009. [2] G. Caridakis, K. Karpouzis, and S. Kollias. User and Context Adaptive Neural Networks for Emotion Recognition. Neurocomputing, 71(13-15):2553–2562, 2006. [3] M. Coulson. Attributing Emotion to Static Body Postures: Recognition Accuracy, Confusions, and Viewpoint Dependence. J. Nonverbal Behavior, 28(2):117–139, 2004. [4] L. Devillers, L. Vidrascu, and L. Lamel. Challenges in Real-life Emotion Annotation and Machine Learning Based Detection. Neural Networks, pages 407–422, 2005. [5] P. Ekman and W. V. Friesen. The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding. Semiotica, 1969. [6] R. Fernandez and R.W. Picard. Classical and Novel Discriminant Features for Affect Recognition from Speech. In Proc. Interspeech 2005, Lisbon, Portugal, 2005.

[7] A. Kleinsmith, R. P. De Silva, and N. Bianchi-Berthouze. Grounding Affective Dimensions into Posture Features. In Proc. First Int. Conf. Affective Computing and Intelligent Interaction), 2005. [8] A. Bryan Loyall and Joseph Bates. Real-time Control of Animated Broad Agents. In Proc. of Fifteenth Annual Conf. of Cognitive Science Society, 1993. [9] N. Magnenat-Thalmann and D. Thalmann. Virtual Humans: Thirty Years of Research, What Next? Visual Computer, 21(12):997–1015, 2005. MIRALab, Geneva Univ., Switzerland. [10] A. Mehrabian. Pleasure-Arousal-Dominance: A General Framework for Describing and Measuring Individual Differences in Temperament. J. of Current Psychology, 14(4):261– 292, 1996. [11] A. Neviarouskaya, H. Prendinger, and M. Ishizuka. EmoHeart: Conveying Emotions in Second Life Based on Affect Sensing from Text. Advances in Human-Computer Interaction, Special Issue on Emotion-Aware Natural Interaction, 2010(1), 2010. [12] G. Nicholas, M. Rotaru, and D. J. Litman. Exploiting Word-level Features for Emotion Prediction. In Proc. of the IEEE/ACL Workshop on Spoken Language Technology, Aruba, 2006. [13] T. L. Nwe, S. W. Foo, and L. C. Silva. Speech Based Emotion Classification. In Proc. IEEE Region 10 Int. Conf. on TENCON, pages 297–301, Geneva, Switzerland, 2001. [14] T. L. Nwe, S. W. Foo, and L. C. De Silva. Speech Emotion Recognition Using Hidden Markov Models. Speech Communication, 41(4):603–623, 2003. [15] P. Y. Oudeyer. The Production and Recognition of Emotions in Speech: Features and Algorithms. Int. J. of HumanComputer Studies, pages 157–183, 2003. [16] R. W. Picard. Affective Computing. MIT Press, 1997. [17] H. Prendinger. The Global Lab: Towards a Virtual Mobility Platform for an Eco-Friendly Society. In Trans. of Virtual Reality Soc. of Japan, pages 163–170, 2009. [18] J. A. Russell. A Circumplex Model of Affect. J. of personality and social psychology, 39(1):1161–1178, 1980. [19] V. Vinayagamoorthy, M. Gillies, A. Steed, E. Tanguy, X. Pan, C. Loscos, and M. Slater. Building Expression in to Virtual Characters. In Proc. Eurographics, 2006. [20] V. Vinayagamoorthy, M.Slater, and A. Steed. Emotional Personification of Humanoids in Immersive Virtual Environments. In Proc. of the Equator Doctoral Colluquim. Brockenhurst, New Forest, Hampshire, 2002. [21] T. Vogt, E. Andr´e, and J. Wagner. Automatic Recognition of Emotions from Speech: A Review of the Literature and Recommendations for Practical Realization. Springer-Verlag, 2008.

[22] J. Wagner, T. Vogt, and E. Andr´e. A Systematic Comparison of Different HMM Designs for Emotion Recognition from Acted and Spontaneous Speech. In Int. Conf. on Affective Computing and Intelligent Interaction (ACII), pages 114–125, Lisbon, Portugal, 2007. [23] S. Yacoub, S. Simske, X. Lin, and J. Burns. Recognition of Emotion in Interactive Voice Systems. In Proc. of Eurospeech 2003, Geneva, Switzerland, 2003. [24] J. Zhu, Z. Pan, G. Xu, H. Yang, and D. A. Cheok. Virtual Avartar Enhanced Nonverbal Communication from Mobile Phone to PCs. In Edutainment, LNCS 5093, pages 551–561. Springer Berlin, 2008.