MULTIMODAL COMMUNICATION BETWEEN SYNTHETIC AGENTS Catherine Pelachaud Dipartimento di Informatica e Sistemistica Universit`a di Roma “La Sapienza” Via Buonarroti, 12 00185 Rome Italy
[email protected] tel: (39-6) 48299213
ABSTRACT Dialoging with a synthetic agent is a vast research topic to enhance user-interface friendliness. We present in this paper an on-going project on the simulation of a dialog situation between two synthetic agents. More particularly we focus our interest on finding the appropriate facial expressions of a speaker addressing to different types of listeners (tourist, employee, child, and so on) using various linguistic forms such as request, question, information. Communication between speaker and listener involves multimodal behaviors such as the choice of words, intonation and paralinguistic parameters for the vocal ones; facial expressions, gaze, gesture and body movements for the non-verbal ones. The choice of each individual behavior, their mutual interaction and synchronization produce the richness and subtility of human communication. In order to develop a system that computes automatically the appropriate facial and gaze behaviors corresponding to a communicative act for a given speaker and listener, our first step is to categorize facial expressions and gaze based on their communicative functions rather than on their appearance. The next step is to find inference rules that describe the “mental” process ongoing in the speaker while communicating with the listener. The rules take into account the power relation between speaker and listener and the beliefs the speaker has about the listener to constrain the choice of performative acts. KEYWORDS: 3D synthetic agents, visual modality, speech acts, facial expression
INTRODUCTION Conversation is a usual mean of exchange of information between humans. It involves not only verbal languages but also hand gestures, facial expressions, body posture, gaze behaviors and so on. To fully understand the transmitted information one needs to consider verbal and non-verbal signals. Humans learned since their tender age to decode such signals, to interpret a particular gaze direction, a smile or a body posture. They have great skills to display various and subtle signals as well as to perceive them. Listing all possible verbal and nonverbal existing signals during a conversation is an enormous task. Large amount of works in
Isabella Poggi Dipartimento di Linguistica Universit`a di Roma Tre Via del Castro Pretorio, 20 00185 Rome Italy
[email protected] tel: (39-6)491375
the area of cognitive science is available. Scholars have elaborated very sophisticated notation systems to describe nonverbal behaviors ([17], [3]). Detailed descriptions of human interaction have been done. They evidence the properties of the link among verbal and nonverbal signals: synchrony at different unit levels (syllable, word, sentence...) between signals, the role of individual signal within the conversation. But most of this work remains very descriptive. In order to make use of this knowledge and use it to drive our system, we have to extract some laws ruling these behaviors. As humans we are very sensitive to any errors perceived in the emitted signals: wrong movements, wrong timing of appearance and disappearance of the signals, as well as wrong duration of the display of the signals. This is specially true as synthetic agents are becoming more and more realistic: 3D model, fine simulation of muscles actions and of skin elasticity, good lip movements and so on create more and more the illusion of realistic model. The use of cartoon faces, caricatures, or other non-human animated objects (animals or even lifeless objects as Walt-Disney animation made us used to) bypass such a difficulty (even though doing a “good” animation is really far from being simple). So in the case of 3D human model one has to be very careful and minute in order to bring the model to life. Dialoging with a synthetic agent is a challenging topic and is subject to development in a large variety of domains: the entertainment industry (video games, interactive videos), tourist activities (interactive information assistance systems), school (long distance teaching systems) and so on. The general argument we confront in this work is how can a Speaker decide which words to utter, which intonation to use and which facial expression to exhibit when talking to a given interlocutor in a given situation. However, due to the complex task involved in such an issue, we are limiting our research to a more restrictive problem: given a Speaker and his/her goal of performing a Communicative Act, that is of communicating something to a Hearer, we are proposing a method to compute the appropriate performative of the communicative act and the corresponding facial expressions. Considering the person we are talking to and the context where the dialog is taking place is necessary. Indeed as Ekman has explained in length [16], display rules (or rules that govern the display of multimodal signals during a conversation) explain the variance one can notice in the behavior of a person talking to different locutors in different situations. The same sentence would have different linguistic structures and would be accompanied by different facial expressions if, for example, the locutor is a child, a tourist or a high ranking officer. The power relation between the speaker and the listener varies in the 3 cases. But also the knowledge capacity of each 3 listeners is not the same at all. All these information need to be integrated in our system. In the next section we show the importance of nonverbal sig-
nals in a conversation. We also give a repertoire of these signals and set out the functions and properties linking them. The notion of display rules is introduced at the end of this section. The section that follows it relates our system with other existing systems. Then we present the approach taken in this paper. The notion of performative and the structure of a communicative act is explained. We propose an hypothesis relating performative to the facial expressions of emotions. Then an overview of our system is provided. It is followed by a detailed example and by a description of the system design.
THE RELEVANCE OF NONVERBAL SIGNALS IN COMMUNICATION Let us give an example to demonstrate the importance that nonverbal signals have in conversation, how they vary the sense of what is being said. Given a very common sentence such as “I don’t know” Bolinger points out how it can be interpreted differently depending on the accompanying gestures [4] pp. 211: 1. Lips pursed: ‘No comment’ 2. Eyebrows arched: ‘I’m wondering too.’ 3. Shoulders raised: Same. 4. Head tilted sideways: ‘Evasion’. 5. Hands held slightly forward, palms up: ‘Empty, no information’. Even for such a simple utterance it exists a large number of interpretations. Actors know that very well since one of their training consists in repeating the same text with various expressions, bringing to life totally different characters. A conversation is a continuous stream of signals. The speaker chooses particular words to utter, uses a particular intonation. Facial expressions, gaze, hand and body gestures accompany the flow of speech. They are timely linked to what is being said. A raised eyebrow, a smile, a head nod appear at a given moment of the discourse. Each signal has a function in the conversation: it may add, modify or even substitute information. Verbal and nonverbal signals do not occur in a random and independent way. On the contrary, their occurrence are intertwined and synchronized; one needs to consider all of them to fully understand the meaning of the conversation. Imagine conversing with a person that only moves her lips to talk but uses no other signals: no intonation to mark an accent or the end of an utterance, no facial expression, no change in the gaze direction, no hand gesture... You will soon have the impression of dialoging with a robot rather than with a human. Moreover you will have hard time understanding what she will be saying. New and important information in her discourse will not be marked; no end of turn will be underlined; no change of gaze direction could soon be embarrassing, having a person either always fixing you or always avoiding looking at you is a very awkward feeling. Given a sentence such as “I told John to put the blue book over there”, various interpretations are possible. If no accent is indicated the sentence could be interpreted either as “I told John and not Peter to put the blue book over there” or as “I told John to put the blue book and not the red book over there”. If no pointing gesture or head direction accompany the word “there”, it has no meaning. “There” could be anywhere. In the same way if a listener does not give you any feedback during your speech, you will not know his reactions to what you are saying; you will not know if he understands, agrees, is interested. It will be like talking to a wall! Not displaying the correct facial expression at the right moment can be source of misunderstanding and can convey the wrong message. Now imagine a person raising her eyebrows to punctuate the end of her statement. Her statement would then be interpreted as a question. For example, a person is saying “John is
leaving in Rome” and she raises her eyebrow on the word “Rome” and holds the eyebrow raised during the pause following her utterance; the sentence will be interpreted as the non-syntactically marked question “John is leaving in Rome?” rather than the affirmation that “John is leaving in Rome”. Marking non-accented word with head nods can put the focus of the conversation on the wrong information. In the sentence “John is wearing a red scarf today”, the voice is stressing the word “red” but a head nod occurs on the word “scarf”. Both situations have very different interpretations which depend on which sign (the linguistic or the nonverbal signs) should prevail. If the verbal signal (accent of “red”) prevails, the sentence can be understood as “John is wearing a red scarf today rather than a blue scarf”. In the case the head nod happening on “scarf” prevails, the interpretation could be “John is wearing a red scarf today rather than a red hat”. In the former case the new information is the color of the scarf that John is wearing; while in the following example it is that John is wearing a scarf. Identically stressing a word and raising the eyebrow at another one create de-synchrony between verbal and facial channel and create difficulties to understand the message to be conveyed. These examples show the importance and the role of each multimodal signal in a conversation. Signals in different modalities are inter-synchronized and their meanings need to be evaluated in the context they are emitted. We can add that doing such a dichotomy of the channels as well as using the wrong signals is extremely difficult to do in a normal conversation. Even great actors will find it difficult to do.
T HE R EPERTOIRE OF M ULTIMODAL C OMMUNICATIONS Among nonverbal behaviors we can consider ([16], [36]): • intonation: it defines how a discourse is decomposed into prosodic units and how these units are related to each other. It is linked to the syntax and semantics of the discourse. It gathers words into utterance, utterances into paragraph. It stresses a particular word in an utterance showing its salience and/or its contrast within the context of the discourse. • paralinguistic elements: these parameters refer to the voice quality. Loudness, pitch range, tempo, number of pauses, rate of speech define the tone of voice. Emotions change these parameters. Pitch is an accurate indicator of emotional arousal. Various studies [43] have looked at characterizing the vocal parameters of emotions. Different voice types are also obtained: changing the laryngeal quality of the voice, one can get a breathy voice which is often used during intimate discussion or in the opposite one can talk with a harsh voice used during an angry discussion. • spatial orientation and distance: distance and orientation among participants in a conversation depend on their respective social status (distance increases during formal discussion), their intimacy relationship (close friends tend to be close to each other), the type of conversation they are involved in (during competition people tend to seat far from each other [22]). This parameter is highly culturally dependent. Conversants always move in accordance with the other to find his/her appropriate position. • body posture: it depends on the the situation conversants are involved in and on their relationship [5]. In a conversation between a high ranking person and an employee one will have no problem to distinguish who is whom. Friendliness is often reflected by the openness of the posture. • hand gesture: It accompanies speech. Hands stop moving as the speech ends [25]. The gesture can repeat the speech (showing the left direction while saying “then you
turn left”), add information (showing the left direction while saying “then you go this direction”), contradict (showing the left direction while saying “then you turn right”), substitute words (waving the hand instead of saying good-bye) [31]. Hand gestures can be classified into 4 symbolic classes ([25], [6]): dietic (gestures pointing to a point in space), iconic (gestures depicting an object), metaphoric (gestures representing an abstract idea), and beat (gestures marking the utterance rhythm). • facial expression: faces can express a large variety of expressions. Emotions are mainly expressed through the face [15]. Facial expressions may accompany the flow of speech, punctuate an accent, a pause [14] (raising the eyebrow on the accented word “RED” in “John is wearing a RED scarf”). They are linked to what is being said. They can replace a word (one can smile to say “hello”), refer to an emotion (showing a sad face while mentioning a past event “Yesterday I was really upset”). • gaze: Eye movement may be used to control the communicative process [1]. Its main functions in a conversation is to regulate the flow of speech [1] (breaking the gaze when taking the speaking turn), look for feedback (the speaker looks at the listener to check how s/he follows), express emotion (staring at the fear object), influence another person’s behavior (looking directly in the eye to infer power over the other), show one’s attitude toward the other (friends look at each other more often). A turn taking system has been established [13] to explain how people negotiate speaking turns.
F UNCTIONS R ELATING S IGNALS
THE
M ULTIMODAL
Verbal and nonverbal signals in a conversation are highly linked with each others. Co-occurring signals modulate what is being said. Different functions characterize their relation: • redundancy: signals from different modalities (vocal, face, gaze, gesture...) have the same meaning, e.g. vocal stress and raised eyebrow co-occurring on an accented word; • substitution: a signal is used in place of another one, e.g. using raised eyebrow to mark a non-syntactically formulated question. • contradiction: co-occurring signals from different modalities have opposite meaning, e.g. doing a head shake while saying “yes”. • addition: meaning of the signals adds to each other, e.g. doing a square hand shape while talking about a box (the gesture could refer to the size of the box). From the above sections we can see that the production of speech and nonverbal behaviors are highly connected. The complete sense of a discourse is obtained by summing up the information given by all these signals. It is as if verbal and nonverbal signals arise from the same mental representation; they are only different forms of the same process [21]. One needs to consider them all to obtain a natural face-to-face conversational system.
D ISPLAY RULES The choice of nonverbal signals displayed by participants depend not only on the meaning one wants to convey but also on the relationship one has with his/her interlocutor. Considering the same sentence of the example introduced in the above section “I don’t know”, if one is talking to a child, one will choose to raise the shoulder or purse the lips while talking: they are standardized signals and are easy to understand. But if one is talking to a touchy
boss, one might feel embarrassed to answer “I don’t know” and therefore will look down while talking. This person will not use any of the signals used to communicate with a child that could be viewed as too familiar in such a context. Such a phenomenon has been called display rules by Ekman [16]. Display rules refer to who can show an expression to whom, when and where. Every culture has its rules governing social relations. Breaking such rules would be perceived as an offence or as to establish power relationship. Display rules explain how one would tend to modify an expression by another one to respect such rules. An expression could be masked by another one (in some cultures in the case of a loss of a loved one sorrow can not be shown but a happy face is required); or an expression could be neutralized (one can not laugh in public of the misadventure of some else so one will restrain his smile); or it could also be added (smiling in polite situation). More generally then, considering the context in which a conversation is happening is necessary. Even if Ekman and Friesen first pointed at this phenomenon in trying to account for cultural differences in nonverbal communication, this notion can be extended to all context-driven variations.
TALKING FACES During the last years research on autonomous agents was enriched with a specific research area: the area of talking faces. Computer graphics techniques made it possible to create moving bodies and heads able to perform human-like lip movements and facial expressions ([9], [18], [20], [40], [26], [28]). Different levels of representation [20] and the consideration of the coarticulation effects [9], [27], [18] produce natural and believable lip movements during speech. The simulation of muscle contraction ([29], [41]) and skin elasticity ([42], [23]) give to 3D human head models a high degree of realism. Multimodal systems have seen a large interest in the past years. In view of creating a user-interface with a synthetic agent dialoging with a user in real-time, Takeuchi et al. [38] categorize facial expressions based on their communicative meaning following Chovil’s work [8]. The system is able to understand what the user is saying (within the limit of a small set of vocabulary) and to answer to the user. The synthetic agent speaks with the appropriate facial expression. E.g. the head is nodding in concert with “No” and a facial shrug is used as an “I don’t know” signal. ‘Gandalf’ [39] is an architecture to simulate face-to-face conversation with a user. The system takes as sensory input hand gesture, eye direction, intonation and body position of the user. Gandalf’s behavior is automatically computed in real time. He can exhibit some communicative facial expression, eyes and body movement as well as generate turn-taking signals. PPP Persona [35], a 2D animated agent has been created for the personalization of user interfaces. This agent is able to present and explain multimedia documents using facial expression, pointing hand gesture and spoken language. This agent has the ability to decide which material to select and how to present it to the user. At the same time, research in voice synthesis gave rise to sophisticated voice synthesizers that modulate voice by stress, intonation and other prosodic phenomena ([34], [19]). Two main voice synthesizers exist: text-to-speech and meaning-to-speech systems. Text-to-speech systems are based on quite simple syntactic analysis for the accents placement [19]. On the other hand meaning-to-speech systems use semantic and discourse knowledge in assigning accents [11]. The decision of accentuating or not a word is done taking into account not only its syntactic information but also its semantic novelty [34]. The notion of contrast between entities has been integrated [33]; contrast refers to the fact that a word can be said in opposition with what has been said previously and thus should receive a certain type of accent. The link between facial expression and intonation [26], and
between facial expression and dialog situation [6] have been studied. A method to compute automatically some of the facial expressions and head movements performing syntactic and dialogic functions has been proposed. This has made it possible to create faces exhibiting natural-like facial expressions to communicate emphasis, topic and comment through eyes and eyebrow movements, while also performing these functions through voice modulation [28]. In this work we concentrate more on facial expressions and propose a meaning-to-face approach, aiming at a face simulation automatically driven by semantic data. The particular problem we address here is how an animated face can produce the appropriate facial expression according to the performative of the communicative act being performed. For the sake of simplicity we restrict ourselves here only on the visual aspects, leaving aside the auditory ones.
OUR APPROACH According to a model in terms of goals and beliefs ([7], [10]), a speech act, and we say any communicative act - a smile, a gesture, a gaze - may be decomposed into a set of cognitive units represented as logical propositions. A communicative act is an action performed (but also, may be, a morphological feature exhibited) by a Sender S through any (biological or technological) device apt to produce some stimulus perceivable by an Addressee H, and having the goal that the Addressee gets some belief about S’s beliefs and goals. A communicative act then has two faces: it is made up by a signal (the muscular actions performed or the morphological features displayed) and a meaning (the set of goals and beliefs that S has the goal to be transferred to H’s mind). In order to simulate a communicative action, therefore, we have to provide our communicating agent with information on both the signal and the meaning. Information about the signal is provided in terms of 3D visual cues (say, changes in lip shape for vocal communication, facial actions for facial expression, changes in skin color for expressing emotions like paling in fear or blushing in shame). Information on the meaning is provided in terms of “cognitive units”. Cognitive units are declarative representations of semantic primitives, wherein all kinds of semantic content, among which communicative intention, word meanings and emotions, may be represented. The meaning of a communicative act is made up of a set of cognitive units that can be subdivided in information of two kinds: a general goal and a propositional content. The propositional content includes what S is referring to and what S is predicating about it. The general goal is the goal for which S is speaking of that propositional content. We may have three different kinds of general goals: information, question and request. But, in real interaction S has many different possible ways to make a request, or ask a question, or give an information. In other words, the general goal may be specified in many different performatives, or illocutionary forces ([2], [37]): only in the realm of requests we may have orders (e.g.“Put this room in order without any discussion”), commands (e.g. “Put this room in order”), advices (e.g.“You could put this room in order”), proposals (e.g. “An idea would be to put this room in order”), suggestions (e.g. “Why don’t you put this room in order”), begs (e.g. “Would you, please, put this room in order”), implorations (e.g., “Please, please, please, put this room in order”) and many more. These different ways of making a request depend, among other things, on the particular relationship between Speaker and Hearer.
T HE P ERFORMATIVE
OF A
C OMMUNICATIVE ACT
A performative may be defined as the specific social action one is trying to bring about and the specific social relation one wants to
one hold with one’s interlocutor in performing a communicative action. Human communication is endowed with three very sophisticated devices to express performatives: performative verbs, intonation and facial expression. In this work we present a system that computes the appropriate performative facial expression. Before illustrating what the system does on the signal side (the side of facial expression) we present a way to represent the meaning side of a performative. The overall meaning of a performative may be represented as a cluster of cognitive units, that may be subdivided into at least four cognitive sub-clusters, each pertaining to particular aspects of meaning generally included in any performative. The four cognitive sub-clusters contain the following kinds of information: 1. information common to all performative, including units like: this is a communicative act, with S addressing H. This implies in principle that each performative expression contains some way to select one’s Hearer and address him/her, which may be done, say, by directing gaze at H. Another information is about which of the three general goals - requesting, asking or informing - is being aimed at. 2. information about which of the three kinds of general goals, information, question, request, the specific performative belongs to. (a) Thus, performatives of informing all contain the cognitive units: S wants H to believe K (b) questioning performatives, the unit S wants H to have S believe K (c) requesting performatives contain units like S wants H to do A 3. specific information apt to distinguish that specific performative among others having the same general goal: for instance, a relevant distinction among requests is whether the requested action be in the interest of S or of H. (a) In commanding, units like the following may be included: S wants H to do A A is useful to a Goal GS of S That is, S is requesting an action useful to a Goal of him/herself; (b) in advising, instead, the requested action is (or S claims it is) in the interest of H: A is useful to a Goal GH of H (c) Finally, in proposing the requested action may be a cooperative action of both S and H useful to a goal of both: S wants H and S to do A A is useful to a Goal GSH of S and H (d) Again, a relevant distinction among acts of information is the degree of certainty or commitment with which S assumes what s/he is saying. For instance, in affirming or claiming S feels quite sure of what s/he is saying, S wants H to believe K S is sure of K while in suggesting s/he is not. 4. information on the power relationship between S and H, and on whether S intends to take advantage of it. In an order S claims having power on H and being willing to use it; this in turn means that in case H does not do what S is requesting, S is entitled to retaliate. In an imploration, S acknowledges power of H on him/herself, calls upon H’s benevolence, and in a sense acknowledges that if H does do A, S is indebted to H. In an advice, S claims being on the same level as H, and lets H free to do A or not.
Now, we may think that a sort of parallel correspondences hold between the components on the signal side (muscular actions) and the components on the meaning side (cognitive units). So for instance information n.1, stating that S is performing a communicative act, and hence is addressing H, might correspond to the specific action of directing gaze at H. Then, for instance, the cognitive structure of the performative of imploring may be represented as follows: S is addressing H 1. communicative act S wants H to do A 2. type of com. act S believes A is useful to 3. in interest of whom a goal GS of S GS is an important goal to S S depends on H for GS 4. power relation if H does not do A, S feels sad 5. affective state Figure 1: Performative of imploring
Figure 2: Peremptory command 5. information on the affective state of S that relevant for S’s communicative action. In imploring, S is not only asking for help, but also shows sadness, as if saying: if you do not do A, I am helpless and then sad. This is expressed by raising the inner parts of eyebrows as in sadness.
A FFECTIVE C OMPONENTS
IN
P ERFORMATIVES
As we said, both emotions and communicative intentions may be represented in terms of cognitive units. Therefore, in analyzing some performatives we may find that some emotions or parts of emotions are contained in them. For instance, in the performative of imploring a cognitive unit of sadness is included, since when S implores H to do A, S is claiming H has power on S in that having A done is very important to S, but S has no power to do A by oneself; and S is so powerless that, if H does not do A, so important to S, S will be sad (see Figure 1). S has the goal that A is done the goal that A is done is very important to S S can not do A by him/herself S wants H to do A S is dependent on H if H does not do A, S will be sad In the same vein, in a peremptory command a cognitive unit of anger is included, since when S commands H to do A, S is claiming S has power on H, and if H does not do what S requests, S is going to get angry (see Figure 2). S has the goal that A is done S has power on H If A is not done S is going to get angry
C OMMON AND D IFFERENT M EANINGS P ERFORMATIVES
IN
The representation of the cognitive structure of performatives also allows one to account for ambiguities in performative verbs, allowing one to capture, for polysemous words, both common and differential meanings. To suggest, for instance, is ambiguous between a request and an information: I can suggest you to do something or suggest you that the thing is so and so. But in both meanings the degree of assertiveness or certainty with which S is requesting or informing is low. A representation in terms of cognitive units may account for both differences and commonalities. In the two readings, the representation of suggesting as a performative of request and information are, respectively, the following. S is addressing H S wants H to do A S believes A is useful to a goal GH of H S believes this with a low degree of certainty S is addressing H S wants H to believe K S believes K with a low degree of certainty
PARALLEL C OMPOSITION M EANINGS
IN
S IGNALS
AND
Our hypothesis is that there is a (more or less systematic) correspondence between single cognitive units, or clusters of them, on the meanings side, and, on the signal side, single action units or morphological features, or clusters of them. So much that each specific performative, with its specific set of cognitive units, points to a particular combination of signalling devices (actions and features) that make up that performative facial expression. As was pointed out in a previous section a performative can be decomposed into five cognitive structures: 1. remark that a communicative act is being performed: directing gaze to the Hearer means, as mentioned by the turntaking system that a turn has been established: “S is addressing H”. 2. information about the general goal: facial and/or gaze signals characterizing the three main performative classes (inform, question, request) might be linked to each performative of the class: for example, raised eyebrow may mean S is putting a question. 3. to whom does the action serve: a parameter of head inclination could instead be devoted to express whether the action requested, or the information provided, is in the interest of S or H. So, no inclination might mean: “A is in the interest of S”, like in an order, with head slightly bent
forward meaning instead “A is in the interest of H”. The former could in fact mark a command, the latter an advice or a warning. 4. information on power relationship: gaze is an important cue to show power relationship [1]. To show importance over the other one might look down at him/her or even might slightly tilt back the head so to raise the chin up. Other dominance facial cues could be used. 5. information on the affective state: it might use the same facial action devoted to expressing that emotion outside a performative, since performative and emotion may share some of the cognitive units they are decomposed in. For instance, the raising of the inner parts of eyebrows, generally devoted to express sadness is also present in imploring.
SYSTEM OVERVIEW In this section we present an overview of our system (see Figure 3). The flow of the system is the following: given a General Goal (GG) of communicating some Propositional Content (PC) to a particular Hearer (H), the Speaker (S) deduces from the Hearer Model box a set of beliefs about H. Then S uses inference rules to choose a specific performative from a library. Finally the corresponding facial expressions are selected and animation of the face is generated. 1. Hearer Model S models H, that is, makes up a representation of H in terms of cognitive units. H is modeled according to three criteria: (a) knowledge capacity: more specifically, H’s knowledge base and inference capacity [24]. For instance, in comparing a tourist and a child as Hearers, S may think that the tourist has an inference capacity much like one’s own, while the cultural knowledge base may not be shared; on the contrary, the child’s cultural knowledge may be widely shared with the Speaker’s, while inference capacity may be much lower. (b) power relationship between S and H: S may, for instance, have power on a child but not on a tourist. (c) H’s personality: S may form a representation of H in terms of what goals are generally most important to H, which emotions is H most easy to feel, and so on. 2. Inference Rules Upon creating one’s own model of the Hearer, Speaker applies a set of inference rules that take as inputs the general goal of his/her Communicative Act and the Hearer model, and produce as an output some constraints on which specific performative may be triggered [32]. 3. Performative Library A third component is a performative library, where each performative is represented as a cluster of cognitive units: for instance, a command and an advice both share the cognitive unit of being a request, but in a command S has a strong intention about H fulfilling it, explicitly wants H to understand that S has the power to retaliate and is willing to do so in case of not fulfilment; in an advice, the requested action is on behalf of H, and moreover S claims to be in the same position of power as H, nor does s/he threat retaliation in case H does not do the requested action. 4. Performative Expression Library The next box in our system is a library of facial expressions, where correspondence is set among sub-clusters of cognitive units, within the meaning of each performative, and clusters of facial actions. The performative face of imploring, for instance, includes the following facial actions:
gaze directed to H, eyebrows raised at their inner corners, lowered lip corners, head bent laterally. 5. Graphics Output The last box in our system is a graphic animation component that automatically generates the output facial actions.
A DETAILED EXAMPLE Let us give an example of how our system works. Suppose that Speaker is going to utter the word “Here”. This word may work as an incomplete sentence, and thus (after [30]) stand for different sentences with different general goals: it may be an information like “John is here”, a question as “Where is John now?”, or a request, as “Let John come here”. This single word might then be uttered with a number of different performative intonations and facial expressions. Now, if the input to the system is the incomplete sentence “Here”, with only the general goal specified - a request - , what processing steps should this input undergo for the system to decide which specific performative to express - say, a command, an advice, a supplication, a suggestion - for its communicative action to be effective and tailored to the Hearer? First the input undergoes the component “Hearer Model”. Suppose H is S’s Boss, who is modeled through the following Cognitive Units: 1. Power relationship H has power on S 2. Knowledge capacity (a) Knowledge base H has a knowledge base largely comparable to S’s H has more knowledge than S on strategic domains H has less knowledge that S on local domains (b) Inference capacity H has the same or better inference capacity than S 3. Personality H is quite touchy H attributes a great importance to power relationships The system now goes to the inferential component and finds inferences like the following: 1. If H attributes great importance to power relationships, then H tends to think that status is a realistic representation of actual worth; this implies that 2. if H has more status than S does, then H may not accept that an Action devised by somebody in a lower status may be better than one devised by H; Therefore, S should 1. show uncertain about whether the requested action is or not the good one; 2. present the requested action as just one of the many possible ones; 3. claim H is free of rejecting S’s request. The system then explores the performative library looking for a performative of request that includes elements like (a), (b) and (c) among its cognitive units, and does find one: the performative of suggesting, that may be analyzed as follows: 1. S is addressing H 2. S wants H to do A 3. A is useful for a goal G of H (it is in H’s interest) 4. S is uncertain whether A is useful to G 5. A might not be the only action useful for G 6. H is free of doing A or not
GG PC H
Hearer Model
Inference Rules
Performative Library
Performative Expression Library
Graphics Output
Figure 3: System overview The lip shape during speech is obtained using the coarticulation algorithm of [27]. An implementation of the system is currently under its way.
CONCLUSION
Figure 4: Performative of suggestion Cognitive Units 4., 5. and 6. meet the constraints deduced by the inferential process. The performative of suggestion is then the best candidate to perform the request in an effective way, given that particular addressee. Then the system goes to the Performative Expression Library, where it finds the following correspondences between the selected performative of suggestion and a set of expressions: cognitive unit n.1., addressing H, is expressed by directing gaze to H; n.3, claiming that the request is in H’s interest, is expressed by a light forward head inclination; n.4, showing uncertain, is expressed by raised eyebrow (see Figure 4). Finally, the Graphics Output component generates an animation of the 3D face that moves the lips to say “Here” and at the same time performs the performative expression of suggesting.
SYSTEM DESIGN The system is designed according to a cognitive model of social and communicative action ([7], [10]). A communicative act consists of a signal and a meaning. The latter, the meaning part is represented in terms of cognitive units. The former, the visual part of the signal of a communicative act is formally represented in our system by Ekman and Friesen’s notational system, FACS [17]. FACS stands for Facial Action Coding System and is a notational system that describes all visible facial actions. It is made of Action Unit (AU). Each AU corresponds to the action of one muscle or group of related muscles. An expression is represented by a set of AUs. Information in the Hearer model and in the Inference Rules is represented in a propositional format that will be implemented in the Golem Inference Engine by [12]. The goal of the project Golem is to investigate how personality traits affect delegationhelp attitudes in multi-agent activities. An ad hoc logic programming language and a friendly user graphics interface have been developed that enable one to create agents with different personality traits. Agents are able to perform several forms of reasoning, like purely logical reasoning or uncertainty-based reasoning. The facial model we will be using is a structured regionally defined 3D model [29]. It uses FACS; each facial expression is defined as a set of Action Units, plus a head and eyes position.
We have presented a system that is able to select automatically the appropriate performative and its corresponding facial behaviors given a specific Speaker, Hearer, General Goal and utterance. Considering the context and the type of Hearer one is dealing with to compute the Speaker’s behavior enhances the credibility of the latter. Conversing with a believable agent is one step toward intelligent user-friendly interface. In the future we want to include in our system an audio component where the appropriate prosodic and intonative parameters would be automatically driven.
REFERENCES [1] A RGYLE , M., AND C OOK , M. Gaze and Mutual gaze. Cambridge University Press, 1976. [2] AUSTIN , J. How to do thinks with words. Oxford University Press, London, 1962. [3] B IRDWHISTELL , R. kinesics and context: essays on body motion communication. University of Pennsylvania, 1970. [4] B OLINGER , D. Intonation and its Part. Stanford University Press, 1986. [5] B ULL , P. Posture and Gesture. Pergamon Press, 1987. [6] C ASSELL , J., P ELACHAUD , C., BADLER , N., S TEEDMAN , M., ACHORN , B., B ECKET, T., D OUVILLE , B., P REVOST, S., AND S TONE , M. Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. Computer Graphics Annual Conference Series (1994), 413–420. Linguaggio, [7] C ASTELFRANCHI , C., AND PARISI , D. conoscenze e scopi. Il Mulino, Bologna, 1980. [8] C HOVIL , N. Social determinants of facial displays. Journal of Nonverbal Behavior 15, 3 (Fall 1991), 141–154. [9] C OHEN , M. M., AND M ASSARO , D. W. Modeling coarticulation in synthetic visual speech. In Models and Techniques in Computer Animation (Tokyo, 1993), M. MagnenatThalmann and D. Thalmann, Eds., Springer-Verlag. [10] C ONTE , R., AND C ASTELFRANCHI , C. Cognitive and Social Action. University College, London, 1995. [11] DAVIS , J., AND H IRSCHBERG , J. Assigning intonational features in synthesized spoken directions. In 25th Annual Meeting of the Association for Computational Linguistics (Buffalo, 1987), pp. 187–193. [12] D E ROSIS , F., AND G RASSO , F. Mediating between hearer’s and speaker’s views in the generation of adaptive explanations. Expert Systems with Applications 8, 4 (1995). [13] D UNCAN , S. Some signals and rules for taking speaking turns in conversations. In Nonverbal Communication, S. Weitz, Ed. Oxford University Press, 1974. [14] E KMAN , P. About brows: emotional and conversational signals. In Human ethology: claims and limits of a new disipline: contributions to the Colloquium, M. von Cranach, K. Foppa, W. Lepenies, and D. Ploog, Eds. Cambridge University Press, Cambridge, England; New-York, 1979, pp. 169–248.
[15] E KMAN , P. Emotion in the human face. Cambridge University Press, 1982. [16] E KMAN , P., AND F RIESEN , W. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica 1 (1969). [17] E KMAN , P., AND F RIESEN , W. Facial Action Coding System. Consulting Psychologists Press, Inc., 1978. [18] G UIARD -M ARIGNY, T., A DJOUDANI , A., AND B ENOIT, C. 3d models of the lips and jaw for visual speech synthesis. In Progress in Speech Synthesis, J. van Santen, R. Sproat, J. Olive, and J. Hirschberg, Eds. Springler-Verlag, 1996. [19] H IRSCHBERG , J. Accent and discourse context: Assigning pitch accent in synthetic speech. In aaai90 (1990), pp. 952– 957. [20] K ALRA , P., G OBBETTI , E., M AGNENAT-T HALMANN , N., AND T HALMANN , D. A multimedia testbed for facial animation control. In International Conference of Multi-Media Modeling, MMM’93 (Singapore, Nov 9-12, 1993), T. Chua and T. Kunii, Eds., pp. 59–72. [21] K ENDON , A. Gesticulation and speech: Two aspects of the process of utterance. In The Relation between Verbal and Nonverbal Communication, M.R.Key, Ed. Mouton, 1980, pp. 207–227. [22] L AVER , J. The gift of speech. Edinburgh University Press, 1991. [23] L EE , Y., T ERZOPOULOS , D., AND WATERS , K. Realistic modeling for facial animation. Computer Graphics Annual Conference Series (1995), 55–62. [24] M AGNO -C ALDOGNETTO , E., AND P OGGI , I. Micro- and macro-bimodality. In Proceedings of the Workshop on Audio Visual Speech Perception (Rhodes, September 26-27, 1997 1997), C.Benoit and R.Campbell, Eds. [25] M C N EILL , D. Hand and Mind: What Gestures Reveal about Thought. University of Chicago, 1992. [26] P ELACHAUD , C. Functional decomposition of facial expressions for an animated system. In Advanced Visual Interfaces (Rome, May 1992), M. C. T. Catarci and S. Levialdi, Eds., vol. 36, World Scientific Series in Computer Science, pp. 26– 49. [27] P ELACHAUD , C., BADLER , N., AND S TEEDMAN , M. Generating facial expressions for speech. Cognitive Science 20, 1 (January-March 1996), 1–46. [28] P ELACHAUD , C., AND P REVOST, S. Sight and sound: Generating facial expressions and spoken intonation from context. In Proceedings of the ESCA/AAAI/IEEE workshop on Speech Synthesis (New Paltz, New York, September 1994). [29] P LATT, S. A Structural Model of the Human Face. PhD thesis, University of Pennsylvania, Dept. of Computer and Information Science, Philadelphia, PA, 1985. [30] P OGGI , I. Le parole nella testa. Guida a un’educazione linguistica cognitivista. Il Mulino, Bologna, 1987. [31] P OGGI , I., AND C ALDOGNETTO , E. M. Mani che parlano. Gesti e Psicologia della comunicazione. Padova: Unipress, 1997. [32] P OGGI , I., AND C ASTELFRANCHI , C. Dare consigli. In 2nd International Workshop on Language Teachers. DILIT/International House, Rome, 1990, pp. 29–49. [33] P REVOST, S. A Semantics of Discourse Information for Specifying Intonation in Spoken Language Generation. PhD thesis, University of Pennsylvania, 1995. [34] P REVOST, S., AND S TEEDMAN , M. Specifying intonation from context for speech synthesis. Speech Communication 15 (1994), 139–153. ¨ , J. Adding animated [35] R IST, T., A NDR E´ , E., AND M ULLER presentation agents to the interface. In Intelligent User Interface (1997), pp. 79–86. [36] S CHERER , K. The functions of nonverbal signs in conversation. In The Social and Physiological Contexts of Language,
H. G. R. St. Clair, Ed. Lawrence Erlbaum Associates, 1980, pp. 225–243. [37] S EARLE , J. Speech Acts. Cambridge University Press, London, 1969. [38] TAKEUCHI , A., AND NAGAO , K. Communicative facial displays as a new conversational modality. In ACM/IFIP INTERCHI’93 (Amsterdam, 1993). ´ , K. Layered modular action control for com[39] T H ORISSON municative humanoids. In Computer Animation’97 (Geneva, Switzerland, 1997), IEEE Computer Society Press. [40] VATIKIOTIS -BATESON , E., BATESON , K., K ASAHARA , Y., G ARCIA , F., AND Y EHIA , H. Characterizing audiovisual information during speech. In ICSLP (1996). A muscle model for animating three[41] WATERS , K. dimensional facial expressions. Computer Graphics 21, 4 (July 1987), 17–24. [42] WATERS , K., AND T ERZOPOULOS , D. A physical model of facial tissue and muscle articulation. Proceedings of the First Conference on Visualization in Biomedical Computing (May 1990), 77–82. [43] W ILLIAMS , C., AND S TEVENS , K. Vocal correlates of emotional states. In Speech evaluation in psychiatry, Darby, Ed. Grune and Stratton, New-York, 1981.