Modeling the Emotional State of Computer Users - Semantic Scholar

10 downloads 1513 Views 27KB Size Report
Papers from the 1998 AAAI Fall Symposium, October 23-25, Orlando, Florida, Technical Report. FS-98-03, AAAI Press, 37-42. [Cahn 1989] Cahn, Janet E., ...
Modeling the Emotional State of Computer Users Gene Ball Senior Researcher, User Interface Group Jack Breese Assistant Director Microsoft Research One Microsoft Way Redmond, WA 98052 425-936-5653 [email protected], [email protected] ABSTRACT

We describe the structure of a Bayesian network designed to monitor the behavior of a user interacting with a conversational computer and use that information to estimate the user’s emotional state. Our model of emotional state uses two discrete dimensions, valence (bad to good) and arousal (calm to excited), to capture the dominant aspects of physical emotional response. Emotion is expressed behaviorally in a variety of ways, including by linguistic choices, qualities of vocal expression, and movement. In this paper, we focus on vocal expression, and its correlation with emotional state, using the psychological literature to suggest the appropriate parameterization of our Bayesian model. Introduction

Within the human-computer interaction community there is a growing consensus that traditional WIMP (windows, icons, mouse, and pointer) interfaces need to become more flexible, adaptive, and humanoriented [Flanagan 1997]. Simultaneously, technologies such as speech recognition, text-to-speech, video input, and advances in computer graphics are providing increasingly rich tools to construct such user interfaces. These trends are driving growing interest in agent- or character-based user interfaces exhibiting quasi-human appearance and behavior. One aspect of developing such a capability is the ability of the system to recognize the emotional state and personality of the user and respond appropriately [Picard 1995, Reeves 1995]. Research has shown that users respond emotionally to their computers. Emotion and personality are of interest to us primarily because of the ways in which they influence behavior, and precisely because those behaviors are communicative-- in human dialogues they establish a channel of social interaction that is crucial to the smoothness and effectiveness of the conversation. In order to be an effective communicant, a computer character needs to respond appropriately to these signals from the user and should produce its own emotional signals that reinforce, rather than confuse, its intended communication. There are two crucial issues on the path to what Picard has termed affective computing [Picard 1997]: •

Providing a system with a mechanism to infer the likely emotional state and personality of the user, and



Providing a mechanism generating behavior in an agent (e.g. speech and gesture) consistent with a desired personality and emotional state.

In earlier work [Breese 98], we showed that a Bayesian network can be used to relate the internal states of emotion to the external behaviors that they cause. The network can then be used to generate behaviors in an animated agent that appropriately express the emotional state that it is intended to portray. In this paper we focus on the diagnostic side of the problem, where the same causal network can be used to discover the internal emotional states which are most consistent with the behaviors that the system observes in the human user.

1

Modeling Emotion

The understanding of emotion is the focus of an extensive psychology literature. In this work, we adopt a simple model in which the current emotional state is characterized by discrete values along just two dimensions. These internal states are then treated as unobservable variables in a Bayesian network model. We construct model dependencies based on purported causal relations from these unobserved variables to observable quantities (external expressions of emotion) such as word choice, facial expression, speech speed, etc. Bayesian networks are an appropriate tool due to the uncertainty inherent in this domain. The flexibility of dependency structures expressible within the Bayes net framework make it possible to integrate the various expressions of emotion in a single model that is easily extended and modified. Emotion is the term used in psychology to describe short-term variations in internal mental state, including both physical responses like fear, and cognitive responses like jealousy. We focus on two basic dimensions of emotional response [Lang 1995] that can usefully characterize nearly any experience: •

Valence represents overall happiness encoded as positive (happy), neutral, or negative (sad).



Arousal represents the intensity level of emotion, encoded as excited, neutral, or calm.

Many commonly named emotions (especially those corresponding to physical responses to the environment) can easily be positioned along the Valence-Arousal dimensions (See Figure 1). Psychologists have devised laboratory tests that can reliably measure the physical aspects of emotional state by monitoring physiological variables such as galvanic skin response and heart rate. A computer-based agent does not have these "sensors" at its disposal, so alternative sources of information must be used.

Excited Angry

Joyful

Valence Neg

Pos Sad Relaxed Calm

Arousal Figure 1: Position of some named emotions within the Valence-Arousal space.

A traditional interactive computer system has very limited sources of information about the user’s emotional state. With keyboard and mouse as the only sources of interaction, the speed of input is the only clue available. As interfaces begin to make use of spoken input— using speech recognition software to decode the word sequence from a microphone signal— a much richer set of emotional cues become accessible. The choice of words used to express a concept can frequently carry emotional weight, and in [Breese 98] we described a mechanism for integrating that information into our Bayesian model of emotion and personality. In human-human conversation, it is clear that much of the emotional communication occurs through the non-linguistic (prosodic) aspects of speech. It’s not just what we say, but how we say it, that conveys our feelings of the moment. This implies that there is information about emotional state encoded in the acoustic signal beyond just the sequence of phonemes being pronounced. Of course, this encoding is very complex, and the problem of fully decoding the prosodic content of speech is probably at least as difficult as the speech recognition problem itself.

2

Vocal Expression of Emotion

As summarized by Murray and Arnott [Murray 1993], there is a considerable (but fragmented) literature on the vocal expression of emotion. Research has been complicated by the lack of agreement on the fundamental question of what constitutes emotion, and how it should be measured. Most work is based upon either self-reporting of emotional state or upon an actor’s performance of a named emotion. In both cases, a short list of “basic emotions” is generally used; however the categories used vary among studies. A number of early studies demonstrated that vocal expression carries an emotional message independent of its verbal content, using very short fragments of speech, meaningless or constant carrier phrases, or speech modified to make it unintelligible. These studies generally found that listeners can recognize the intended emotional message, although confusions between emotions with a similar arousal level are relatively frequent. Using synthesized speech, in a 1989 MIT Masters thesis, Janet Cahn showed that the acoustic parameters of the vocal tract model in the DECtalk speech synthesizer could be modified to express emotion, and that listeners could correctly identify the intended emotional message in most cases. Studies done by the Geneva Emotion Research Group [Banse 1996, Johnstone 1995] have looked at some of the emotional states that seem to be most confusable in vocal expression. They suggest, for example, that the communication of disgust may not depend on acoustic parameters of the speech itself, but on short sounds generated between utterances. In more recent work [Johnstone 1999], they have collected both vocal and physiological data from computer user expressing actual emotional responses to interactive tasks. The body of experimental work on vocal expression indicates that arousal, or emotional intensity, is encoded fairly reliably in the average pitch and energy level of speech. This is consistent with the theoretical expectations of increased muscle tension in high arousal situations. Pitch range and speech rate also show correlations with emotional arousal, but these are less reliable indicators. The communication of emotional valence through speech is a more complicated matter. While there are some interesting correlations with easily measured acoustic properties, particularly pitch range, complex variations in rhythm seem to play an important role in transmitting positive/negative distinctions. In spite of the widely recognized ability to “hear a smile”, which Tartter [Tartter 1980] related to formant shifts and speaker dependent amplitude and duration changes, no reliable acoustic measurements of valence have been found. Roy and Pentland [Roy 1996] more recently performed a small study in which a discrimination network trained with samples from three speakers expressing imagined approval or disapproval was able to distinguish those cases with reliability comparable to human listeners. Thus, recognition of emotional valence from acoustic cues remains a possibility, but supplementary evidence from other modalities (especially observation of facial expression) will probably be necessary to achieve reliable results. Bayesian Model of Vocal Expression

Our preliminary Bayesian sub-network representing the effects of emotional valence and arousal on vocal expression therefore includes causal links as shown in Figure 2. The parameterization of the probability distributions for each link reflects the trends reported in the literature cited above, as follows: •



Increasing levels of arousal causes: o higher average pitch o wider pitch range o faster speech, and o higher speech energy. More positive valence produces: o higher average pitch, o a tendency for a wider pitch range, and o a bias toward higher speech energy.

3

Valence

Base Pitch

Arousal

Pitch Range

Speech Rate

Speech Energy

Figure 2: A Bayesian network showing the modeled causal links between emotional state and aspects of vocal expression.

Conclusion

We have developed a Bayesian network model that relates two primary dimensions of emotion (valence and arousal) to emotionally expressive behaviors. A review of the psychological literature on the connection between vocal expression and emotional state suggests the importance of certain acoustic parameters in recognizing arousal, and reveals a lack of reliably distinguishable markers for emotional valence. As our model is currently structured, acoustic parameters alone cannot distinguish between increasing arousal and more positive valence. Other sources of evidence, especially facial expressions and word choices, would typically address that deficiency in the full emotional network. REFERENCES

[Banse 1996] Banse, R. & Scherer, K. R. (1996). Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70, 614-636. [Breese 98] Breese, Jack and Ball, Gene. “Bayesian Networks for Modeling Emotional State and Personality: Progress Report” in Emotional and Intelligent: The Tangled Knot of Cognition, Papers from the 1998 AAAI Fall Symposium, October 23-25, Orlando, Florida, Technical Report FS-98-03, AAAI Press, 37-42. [Cahn 1989] Cahn, Janet E., Generating Expression in Synthesized Speech. Master's Thesis, Massachusetts Institute of Technology. May, 1989. [Flanagan 1997] Flanagan, J., Huang, T., Jones, P. and Kasif, S. (1997). Final Report of the NSF Workshop on Human-Centered Systems: Information, Interactivity, and Intelligence. Washington, D.C., National Science Foundation. [Jensen 1989] Jensen, F. V., L., L. S. and G., O. K. (1989). Bayesian updating in recursive graphical models by local computations. Institute for Electronic Systems, Department of Mathematics and Computer Science, University of Aalborg, Denmark. [Jensen 1996] Jensen, F. V. (1996). An Introduction to Bayesian Networks. New York, New York, Springer-Verlag. [Johnstone 1995] Johnstone, I. T., Banse, R. & Scherer, K. R. (1995). “Acoustic Profiles from Prototypical Vocal Expressions of Emotion. ” Proceedings of the XIIIth International Congress of Phonetic Sciences, 4, 2-5. [Johnstone 1999] Johnstone, T. & Scherer, K. R. (in press). The effects of emotions on voice quality. To appear in the Proceedings of the XIVth International Congress of Phonetic Sciences, August 1999.

4

[Lang 1995] Lang, P. (1995). “The emotion probe: Studies of motivation and attention.” American Psychologist 50(5): 372-385. [Murray 1993] I.R. Murray and J.L. Arnott. “Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion.” Journal Acoustical Society of America, 93(2):1097-1108, Feb. 1993. [Picard 1995] Picard, R. W. (1995). Affective Computing. Perceptual Computing Section Technical Report No. 321, Cambridge, Massachusetts, M.I.T. Media Lab. [Picard 1997] Picard, R. W. (1997). Affective Computing. Cambridge, Massachusetts, MIT Press. [Pittam 1990] J. Pittam, C. Gallois, and V. Callan. The long-term spectrum and perceived emotion. Speech Communication, 9:177-187, 1990. [Reeves 1995] Reeves, B. and Nass, C. (1995). The Media Equation. New York, New York, CSLI Publications and Cambridge University Press. [Roy 1996] D. Roy and A. Pentland. Automatic spoken affect analysis and classification. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, pages 363367, Killington, VT, Oct. 1996. [Scherer 1971] Scherer, K., Koivumaki, J. and Rosenthal, R. (1971) “Minimal cues in the vocal communication of affect: Judging emotions from content-masked speech. ” Journal of Psycholinguistic Research, 1, 269-285. [Tartter 1980] Tartter, V.C. (1980). "Happy Talk: Perceptual and Acoustic Effects of Smiling on Speech," Perception and Psychophysics, 27, 24-27.

5

Suggest Documents