Towards a Simulation of Conversations with Expressive Embodied Speakers and Listeners Thomas Rist, Markus Schmitt DFKI Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany {rist, mschmitt}@dfki.de
Catherine Pelachaud
Massimo Bilvi
Linc – Paragraphe Iut of Montreuil University of Paris 8
[email protected]
University of Rome “La Sapienza”
[email protected]
Abstract In this paper we present some results to model complex interactions among virtual characters that participate in negotiation dialogues as well as our work related to a gaze model that controls the eye behavior of several agents conversing with each other. As a test-bed we have created an Avatar Arena in which several avatars will negotiate on meeting arrangement tasks on behalf of their users. To enhance the naturalness of the emerging negotiation dialogues we need to determine both the behaviors of speakers as well as the behaviors of listening characters. In our approach we try to exploit socio-physiological concepts, such as cognitive balance and dissonance to determine verbal and nonverbal behavior of all dialogue participants. We pay particular attention to the gaze behavior of the speaker considering the communicative functions the speaker desires to communicate as well as a statistical model of eye movements. In addition, a model of gaze behavior for listener is also proposed.
1. Introduction User interfaces of a broad range of applications are getting increasingly populated with life-like animated characters. Apart from diversity in the character’s embodiment (talking head only, full embodiment), rendering style (2D/3D/video) and conversational expressiveness (voice only, speech synchronized with facial displays and body gestures), most of the contemporary systems have in common that they assume that the user engages in a face-to-face conversation with a single virtual character. A major contribution of the character in such face-to-face settings is often seen in the fact that the character can make use of non-verbal channels and thus enables the
emulation of important aspects as known from multimodal communication among humans. There is also a vast range of applications that require the interactions among virtual characters. Depending on their role in a conversation, a character can be a speaker, a listener more or less directly addressed by a speaker, or an overhearer of a conversation among other characters. While considerable research efforts have been devoted to the development of sophisticated models of expressive talking heads and fully embodied virtual presenters that align speech production with non-verbal behavior, little attention has been paid so far to model the non-verbal behavior of listeners and overhearers. A peculiarity of our work is that we not only generate facial displays for speaking characters but also facial displays of a listener as well as eye gaze interaction between a speaker and a listener. The current paper presents Avatar Arena, a test-bed for the simulation of negotiation dialogues among embodied conversational characters that is currently developed in the EU-funded project Magicster IST-1999-29078.
2. Avatar Arena 2.1 Set-up and functional view Technically speaking, an Avatar Arena can be conceived as a distributed n:1 client server architecture (cf. Fig. 1). While the server component provides the arena where the negotiation takes place, a client component allows a user to configure and instruct her/his avatar, and also to observe the negotiation process carried out at the server. To this end, the client receives a generated script of the overall negotiation dialogue for display e.g., using several screens.
Figure 1: Conceptual view on Avatar Arena (left). Avatar Arena installation with negotiating “Greta Sisters” (right)
Arena avatars negotiate on meeting appointments on behalf of human users. However, we have picked this domain just for the purpose of illustration and do not attempt to make a contribution to meeting planning or appointment scheduling as such. Rather, Avatar Arena serves as a test-bed to investigate and evaluate mind models of different "cognitive complexity" for the virtual characters that engage in negotiation dialogues. Our research interest is solely on a simulation of the dynamics of social relationships among affective characters during the negotiation dialogues. For the display of negotiation dialogues the early version of Avatar Arena represents all participants by small cartoon characters (using the MS agent control for their animation). Apart from head movements into the direction of a speaking character this version does not provide appropriate means to align the behavior of characters that listen to a speaking character. Even if a designer manually crafts an extensive library of facial display animations for potential listener behaviors, a fine-grained temporal coordination of simultaneous speaker and listener behaviors would be difficult to achieve. In contrast, our new version of Avatar Arena not only represents users by 3D talking heads but also displays each talking head on a separate screen. The right-hand part of Fig. 2. shows three negotiating “Greta Sisters” sitting at the table with a human observer. Compared to a single-screen display of a 3D scene that includes all characters this spatial arrangements has two advantages: Firstly, it supports the impression for a human observer of being herself involved in the scene. Secondly, the talking heads can be displayed in full-screen mode without an intelligent camera control that switches back and forth between speaker and listeners. To determine speech-accompanying non-verbal behaviors a Greta humanoid relies on a taxonomy of communicative functions as proposed by [19]. A communicative function is defined as a pair (meaning, signal) where meaning corresponds to the communicative value the agent wants to communicate and signal to the behavior used to convey this meaning. To control our agent we have developed a representation language, called `Affective Presentation Markup Language' (APML) where the tags of this language are the communicative functions [18]. Our system takes as input the text (tagged with APML) the agent should say. The system instantiates the communicative functions into the appropriate signals. The output of the system is the audio file and the animation file that drives our facial model (for further details see [18] and Sec. 2.3).
2.2 Conceptual basis of Avatar Arena The development of Avatar Arena comprises a number of basic modeling tasks concerning an avatar’s: - Understanding of the domain, e.g., the avatars need to know what a meeting date is, and that one cannot participate in several meetings at the same time.
- Personal attitudes concerning already scheduled meeting dates as well as new dates to be negotiated. That is, we allow the users to assign importance values to appointments that the avatars should take into account in a negotiation. - Personal attitudes towards other characters. That is, we allow the users to indicate liking relationships (or social distances) holding between themselves and other users. - Rudimentary conversational skills that enable them to propose meeting dates, to justify their own proposals, as well as to comment on, accept or reject proposals made by others based on their personal calendar entries and attitudes. Since Avatar Arena does not aim to make a contribution to the area of appointment scheduling systems in the first place, the domain model has been kept quite simplistic. We currently use a small ontology of meeting dates that distinguishes between five categories of activities that can be associated with a meeting. These types are: activities that are somehow related to a person’s business career, wellness activities, cultural activities, social contacts, and any hobby-related activities (that are not covered by the other classes). Concerning their conversational skills all avatars have a number of different communicative acts at their disposal. They fall into different classes which correspond to the different phases of a negotiation: Opening phase: o Greeting, Task announcement Negotiation phase o Request-proposal, Propose-date, o Accept, Reject, Justify proposal, o Comment on justification or attitude Closing phase o Wrap-up, Leave-taking. To model attitudes we indicate a polarity that expresses a positive, neutral or negative bias towards a subject matter or another avatar. For instance, an avatar that is interested in cultural events but not at all in the career dimension may not be willing to postpone a theater visit in favor of a late evening working meeting. Since Avatar Arena aims to simulate negotiation dialogues between social entities, the most important modeling task aims at capturing some aspects of group dynamics that are observable in multi-party negotiation processes among humans. We start for the following assumptions: 1. Before a negotiating dialogue starts all participants have a certain social distance to each other. 2. Avatars make assumptions about the attitudes of other avatars in a way that is compatible with their attitudes towards these other avatars. 3. When an avatar discovers a mismatch between its assumption about another avatar’s attitudes (i.e., by listening to the other avatar’s utterances), this discovery may cause the experience of dissonance
and may eventually trigger a change in the social distance to the other avatar so that compatibility is achieved again. As outlined in more detail in [22], this model has its roots in the so-called “Congruity Theory” developed by Osgood and Tannenbaum [17] which is a sociopsychological theory that can be seen as a derivate of Balance Theory that has been originally proposed by Heider [13]. In essence “Congruity Theory” uses a triangular scheme to describe (i) a receiver (hearer) R and its attitude towards a sender (speaker) S, (ii) R’s attitude towards a subject matter X, and (iii) R’s assumption about S’s attitude towards X. Assertions in this model are made from the perspective of the receiver R. That is, Congruity Theory makes an assertion about the impact of a received message on R’s beliefs. Thereby impact depends on (a) the liking relationship between R and S (from the point of view of R), (b) R’s attitude towards the subject matter, and (c) the expressed attitude of S towards the subject matter X. Figure 2 illustrates the concept.
[A1] [A1] [A2] [A3] [A3] [A3] [A2]
We have to make an arrangement for a meeting in the next 8 days. When would it be fine for you? What about a meeting in 5 days? In 5 days I will have something better to do. I am going to be at a pri vate meeting in Paris. This arrangement is a bit more important for me than our meeting. Every time it's the same with you! You are a real egoist. I couldn't care less!
[A3] ...... [A1] Fantastic! We've done it!
Figure 3: Excerpt of a generated negotiation dialogue
2.3 Avatar Arena with Greta Humanoids In this section we briefly sketch how Avatar Arena interfaces with several instances of Greta humanoids as illustrated in the right-hand side of Fig. 1. The installation’s overall system architecture is shown in Fig. 4.
Figure 2: A statement made by S causes an imbalance in R’s belief system.
At time t0 R’s believe system is balanced with respect to S and X, since R likes S, believes in career, and further believes that career is of similar importance for S, too. During the conversation, however, R learns from S at t1 that S is not at all interested in career. Consequently, R’s observation causes some dissonance in her belief system. According to Festinger [11] there is a general tendency for individuals to seek consistency among their beliefs and opinions. Therefore, R may try different strategies for cognitive re-organisation to reduce and eliminate dissonance, e.g., R may: (a) change the dissonant beliefs to achieve consistency; (b) lower the importance value of dissonant beliefs; (c) increase the number of consonant beliefs to outweigh dissonant ones. Avatar Arena characters also have different coping strategies at their disposal. They can either change their attitudes towards the subject matter, or their attitudes towards the speaker who caused a dissonance, or even try to convince the speaker that she should change her attitude. Consequently, the model allows to simulate changes in personal relationships that may occur as side effects in a negotiation process. To sum up, in addition to task-oriented behavior, changes in an avatar’s cognitive configuration (be it a speaker or a hearer) provide an important source for determining an avatar’s verbal and non-verbal behavior. An excerpt of a generated sample dialogue is shown in Fig. 3.
Figure 4: Avatar Arena interfacing with several Greta animation engines.
The Arena-Server simulates a negotiation dialogue on a meeting arrangement task while taking into account the profile information of the participating avatars. At its heard lays a planning component that determines the dialogue contributions of the avatars during a negotiation process. For details on the planning approach see [1, 2]. Each avatar maintains a network of social relationships for all negotiation parties based on the concept sketched in Sec. 2.2. The dynamic changes of these networks influence the avatars’ active participation in the dialogue and also their motivation to make interpersonal statements. The output of the dialogue simulator is formed by a specification of dialogue turns which will be executed by the avatars. Dialogue turns are usually natural language sentences enriched by annotations expressed in APML. An example of an APML expression is shown in Fig. 5. It corresponds to the first turn of avatar A3 in the dialogue script shown in Fig. 4. Once an APML expression for a single a speech turn has been determined the expression is forwarded to the Dialogue-Act-Encoder. The Dialogue-Act-Encoder has two tasks. Firstly, the component creates data structures with unique identifiers that are needed to
establish a reference between a speech turn and the corresponding FAP and audio files. For each speech turn there is one data structure for the speaking avatar that includes a not-yet instantiated reference to the speaker’s facial expressions (in the form of a FAP file) and a reference to the speaker’s encoded verbal utterance (in the form of a wav audio file). In addition, the component creates a reference to a specification of the listener’s facial expressions (in the form of a FAP file). The second task of the Dialogue-Act-Encoder is to call the Greta agent server with an encoding request. The Greta agent server receives as input an APML expression and provides as output a FAP specification for the speaking avatar; an audio file that encodes the verbal realization of the speech turn, and a FAP specification for the addressee (listener) of the speaker’s speech turn. In 5 days I will have something better to do I am going to be at a private meeting in Paris. This arrangement is a bit more important for me than our meeting.
Figure 5: Specification of a dialogue turn in APML
To obtain a verbal realization of the speech turn the server sends itself an encoding request to the Festival speech synthesis server. This component returns an audio file for display together with a corresponding specification of a timed sequence of the phonemes involved. While the audio file returned by Festival becomes directly part of the output of the Greta agent server, the phoneme specification is used to calculate visemes / facial expressions for the talking character, and also facial expressions for a second character which is assumed to be the addressed listener of the speech turn. In both cases the specifications are delivered in the form of a FAP file. For the convenience of its clients the Greta agent server provides the output for the speaking agent and the listening agent at two different ports. The components named Speaker- / Listener-Output-Collector in Fig. 3 are connected to these ports and collect the generated files. Once output is received the Speaker-OutputCollector instantiates the references to the audio file
and the FAP file in the corresponding data structure that has already been created by the Dialogue-ActEncoder. The instantiated data structure is then forwarded to the Playlist-Scheduler component. Similarly the Listener-Output-Collector instantiates the references to the FAP file of the listening character. In this case there is no audio file. However, in the case of the Avatar Arena, there might be several listening characters. The current version of the Greta agent server does not support multiple listeners. As a first but rather simplistic approach we just make copies of the generated listener FAP file for all listening characters and forward a corresponding playlist item to the Playlist-Scheduler component. In the following section we describe our current gaze model that controls the eye behavior of several agents conversing with each other.
3. Gaze Model In previous work, our gaze model was based on the communicative functions model proposed by Poggi et al. [20]. This model predicts what should be the value of gaze in order to have a given meaning in a given conversational context. For example if at a point of her speech, the agent wants to emphasize a given word, the model will output that the agent should gaze at her conversant. But using only this model creates a very deterministic behavior model: at every communicative function associated with a meaning corresponds all the time the same signals. This model also does not take into account the duration that a given signal remains on the face. Indeed, this model is event-driven: it is only when a communicative function is specified that the associated signals are computed and that the corresponding behaviors may vary. Such a model used by itself has several drawbacks: first of all it does not take into account the past nor the current gaze behaviors to compute the new one, neither does it consider the duration gaze states by S and L have lasted. To embed this model into temporal considerations as well as to compensate somehow missing factors in our gaze model (such as social and culture aspects) we have developed a statistical model. That is we use our previously developed model to compute what should be the communicative gaze behavior; the gaze behavior outputted by this model is then probabilistically modified. The probabilistic model is not simply a random function, rather it is a statistical model defined with constraints. This model has been built using data reported in [5]. This data corresponds to interactions between two subjects lasting between 20 and 30 minute. A number of behaviors (vocalic behaviors, gaze, smiles and laughter, head nods, back channels, posture, illustrator gestures, and adaptor gestures) have been coded every 1/10th of second. Analysis of this data (cf. [5]) was done having in mind to establish two sets of rules: The first one, called `sequence rules',
refers to the time a behavior change occurs and its relation with other behaviors (does breaking mutual gaze happened by having both conversants breaking the gaze simultaneously or one after the other); while the second set of rules, called `distributional rules' refers to probabilistic analysis of the data (what is the probability to have mutual gaze and mutual smile). Our model comprises two main steps: 1. Communicative prediction: First it applies the communicative function model as introduced in [18] and [20] to compute the gaze model so as to convey a given meaning. 2. Statistical prediction: The second step is to compute the final gaze behavior using a statistical model and considering information such as: what is the gaze behavior for the Speaker (S) and a Listener (L) that was computed in step one of our algorithm, in which gaze behavior S and L were previously, the durations of the current gaze of S and of L. The first step of the model has already been described elsewhere [18, 20]. In the remaining of this section we concentrate on the statistical model. We use a Belief Network (BN) made up of several nodes. Suppose we want to compute the gaze states of S and L at time Ti the nodes are: - Communicative Functions Model: these nodes correspond to the communicative functions applying at time Ti. These functions have been further increased from the set specified in [20] to take into account Listener's functions such as back-channel and turn-taking functions. - Previous State: these nodes denote the gaze direction at time Ti-1. - Temporal consideration: these nodes monitor for how long S (respectively L) has been in a given gaze state. - NextGaze: gaze state for both agents at time Ti considering the values of the nodes just defined. The transition from Ti-1 to Ti is phoneme based, that is at each phoneme the system instantiates the BN nodes with the appropriate values to obtain by the BN the next gaze state. The outputs are probabilities for each of the four possible states of NextGaze. We use a uniform distribution to compute the final gaze states considering the probabilities of the four states.
sequence and distribution. For example, the BN has been built so that a change of state corresponding to `breaking mutual gaze' may not happen by having both agents breaking the gaze simultaneously.
4 Related Work The need to model social relationships between the involved characters is increasingly recognized in the field and some research groups already started to account for the social dimension in simulated conversations between animated characters. The work by Prendinger and Ishizuka [21] deserves mentioning here. In their SCREAM system they explicitly model concepts, such as social distance, social power and thread, to enhance the believability of generated dialogues. Traum & Rickel [26] have addressed the issue of multiparty dialogues in immersive virtual environments. In the context of a military mission rehearsal application [23] they address dialogue management comprising human-character and character-character dialogues. They propose a layered model for structuring multiparty, multi-conversation dialogues and point out the importance of non-verbal communication acts for turn management. Since the primary field of application of their work is a military mission rehearsal scenario, turn-taking behavior is often predetermined by the distinct roles of the dialogue partners. This is not the case for un-chaired negotiation dialogues with equally footed participants. Furthermore, besides the “how” to indicate initiative, it is also important to understand the “why” and the “when” a character should try to get the turn in an negotiation dialogue. Of particular interest for our work are also approaches that aim to produce communicative and affective behaviors for embodied conversational characters (e.g., by Ball & Breese [3], Cassell et al. [6, 7], Lester et al. [15], Lundeberg & Beskow [16], Poggi et al. [20]). Some researchers concentrate on gaze models to emulate turn-taking protocols [4, 6, 8, 24] 7), or to call for the user's attention [27] to indicate objects of interest in the conversation [4, 15, 25], to simulate the attending behaviors of agents during different activities and for different cognitive actions [9]. On the other hand [10, 12, 14] use a statistical model to drive eye movements. In particular Lee et al. [14] based their model on empirical models of saccades and statistical models of eye-tracking data.
5 Conclusions
Figure 7: Belief network used for the gaze model The weights specified within each node of the BN have been computed using empirical data reported in [5] and to follow the two sets of rules proposed in this model:
We have presented Avatar Arena, a test-bed for the simulation of multi-character negotiation dialogues. Our primary research interest is on the emulation of negotiation dialogues between affective and expressive characters that are embedded in a certain social context. To this end, we considered the character’s attitudes towards other characters and modeled a character’s social context in terms of liking
relationships between the character and all other dialogue partners. A gaze model that controls the eye behavior of several agents conversing with each other has also been proposed. An AVI video clip (~99MB) showing a sample negotiation dialogue among 3 Greta characters see www.dfki.de/mlounge/AAGreta.avi
References [1] André, E., and Rist, T. 2000. Presenting Through Performing: On the use of Life-Like Characters in Knowledge-based Presentation Systems. Proc. of IUI '2000: International Conference on Intelligent User Interfaces [2] André, E., and Rist, T. 2001. Controlling the Behavior of Animated Presentation Agents in the Interface: Scripting vs. Instructing. AI Magazine 22(4):53-66. [3] Ball, G. and Breese, J. 2000. Emotion and personality in a conversational agent. In S. Prevost, J. Cassell, J. Sullivan and E. Churchill, eds., Embodied Conversational Characters. MITpress, Cambridge, MA, 2000. [4] Beskow, J. 1997. Animation of talking agents. In C. Benoit and R. Campbell, eds., Proc. of the ESCA Workshop on Audio-Visual Speech Processing, pp. 149-152. [5] Capella, J. and Pelachaud, C. 2002. Rules for responsive robots: Using human interactions to build virtual interactions. In M.A. Fitzpatrick A. Vangelisti, H. Reis, eds., Stability and change in relationships, pp 325-353. Cambridge University Press. [6] Cassell, J., C. Pelachaud, N. I. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone. 1994. Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. Computer Graphics (SIGGRAPH '94 Proceedings), 28(4):413-420. [7] Cassell, J., Bickmore, M., Billinghurst, L., Campbell, K., Chang, H. Vilhjalmsson, and Yan, H. 1999. Embodiment in conversational interfaces: Rea. In CHI'99, pp. 520-527. [8] Cassell, J., Torres, O. and Prevost, S. 1999. Turn taking vs. discourse structure: How best to model multimodal conversation. In Y. Wilks, editor, Machine Conversations. Kluwer, The Hague. [9] Chopra-Khullar, S. and Badler, N.I. 1999. Where to look? Automating attending behaviors of virtual human characters. In Proceedings of Autonomous Agents'99. [10] Colburn, A. 2000. The role of eye gaze in avatar mediated conversational interfaces. In ACM Transactions on Graphics, Siggraph 2000, Sketches and Applications. ACM Press. [11] Festinger, L. 1957. A Theory of Cognitive Dissonance. Stanford University Press. [12] Fukayama, A. , Ohno, T., Mukawa, N., Sawaki, M. and Hagita, N. 2002. Messages embedded in gaze on interface agents - Impression management with agent's gaze. In CHI 2002, Vol. 4, pp.1-48.
[13] Heider, F. 1958. The Psychology of Interpersonal Relations. NY: Wiley. Chapter 7, pp. 174-217. [14] Lee, S., Badler, J. and Badler, N. 2002. Eyes alive. In ACM Transactions on Graphics, Siggraph 2002, pp. 637-644. ACM Press. [15] Lester, J.C., Stuart, S.G., Callaway, C.B., Voerman, J.L. and Fitzgeral, P.J. 2000. Deictic and emotive communication in animated pedagogical agents. In S. Prevost J. Cassell, J. Sullivan and E. Churchill, editors, Embodied Conversational Characters. MITpress, Cambridge, MA. [16] Lundeberg, M. and Beskow, J. 1999. Developing a 3Dagent for the August dialogue system. In Proc. of the ESCA Workshop on Audio-Visual Speech Processing, Santa Cruz, USA. [17] Osgood, C. and Tannenbaum, P. 1955. The principle of congruity in the prediction of attitude change. Psychology Review, pp. 62, 42-55. [18] Pelachaud, C., Carofiglio V., De Carolis, B., de Rosis, F, and Poggi , I. 2002. Embodied Contextual Agent in Information Delivering Agent. In Proc. AAMAS’02, Vol. 2. [19] Poggi, I. 2002. Mind markers. In N. Trigo M. Rector, I. Poggi, editor, Gestures. Meaning and use. University Fernando Pessoa Press, Oporto, Portugal. [20] Poggi, I., Pelachaud, C. and de Rosis, F. 2000. Eye communication in a conversational 3D synthetic agent. AI Communications, 13(3):169-181. [21] Prendinger, H. and Ishizuka, M.: Social Role Awareness in Animated Agents. In Proc. Agents’01, ACM Press. 2001, pp. 270–377. [22] Rist, T., and Schmitt, M. (2003). Applying SocioPsychological Concepts of Cognitive Consistency to Negotiation Dialog Scenarios with Embodied Conversational Characters. Proc. of AISB'02 Symposium on Animated Expressive Characters for Social Interactions. 79-84. (Extended version submitted for publication in 2003.) [23] Swartout, W., Hill, R., Gratch, J., Johnson, W.L., Kyriakakis, C., LaBore, C., Lindheim, R., Marsella, S., Miraglia, D., Moore, B., Morie J., Rickel, J., Thiébaux, M., Tuch, L., Whitney, R., and Douglas, J.: 2001. Towards the Holodeck: Integrating Graphics, Sound, Character and Story. Proc. of Agents’01, pp. 409-416. [24] Thorisson, K. 2002. Natural turn-taking needs no manual. In I. Karlsson B. Granström, D. House, editor, Multimodality in Language and speech systems, pages 173-207. Kluwer Academic Publishers. [25] Thorisson, K.R. 1997. Layered modular action control for communicative humanoids. In Computer Animation'97, Geneva, Switzerland. IEEE Computer Society Press. [26] Traum, D. and Rickel, J. 2002 Embodied Agents for Multi-party Dialogue in Immersive Virtual Worlds. Proc. AAMAS’02, Vol. 2, pp. 766-773. [27] Waters, K., Rehg, J., Loughlin, M., Kang, S.B. and Terzopoulos, D. 1996. Visual sensing of humans for active public interfaces. Tech. Rep. CRL 96/5, Cambridge Research Laboratory, Digital Equipment Corporation.