Dec 15, 1999 - 1996); by endowing our computer agents with social intelligence and the ... Businesses will gain powerful new tools for establishing on-going ...
Social Intelligence in Conversational Computer Agents ProSeminar Conceptual Analysis of Thesis Area
Timothy W. Bickmore Gesture & Narrative Language Group MIT Media Laboratory December 15th 1999
I. Introduction The purpose of this work is to endow an embodied conversational computer agent with the ability to establish and maintain an on-going social relationship with a user. This ability is not only important in entertainment applications (e.g., as in Tamagotchis and Furbies which are specifically designed to establish on-going relationships with their users) and affect support systems (such as computer therapists or frustration management systems), but plays a crucial role in any application in which we want to manage how a user interacts with an agent. If we want our users to feel comfortable with our agents, trust them or be persuaded or comforted by them, then we must understand how humans perform these functions and endow our agents with similar behavior. In this work I am particularly interested in embodied conversational agents (Cassell, Bickmore et al. 1999), which are agents that use natural speech and an animated humanoid representation to interact with a user. I believe that most of the behaviors that people use to influence each other socially are conveyed either through language or nonverbal behavior in face-to-face interaction, and thus it is crucial that agents have these communication channels at their disposal for effecting the same results. Embodied conversational agents also leverage the knowledge all users in a language community have about how to engage in face-to-face conversation, making them potentially more natural and easy to use than any other form of human-computer interface. Adding social skills to these agents will serve to make them even more natural and easy to use. People use many linguistic strategies to achieve interpersonal goals, such as building rapport, establishing credibility and trust, effecting persuasion, and befriending. One example strategy is small talk; "light" conversation about neutral topics (e.g., weather, immediate context) (Laver 1981) which is used to build rapport and trust, provide time to "size" a stranger up, and mitigate positive and negative face threats (Goffman 1983). Even in business meetings or sales encounters it is customary to begin with some amount of small talk before "getting down to business" (at least in American culture). Small talk is not a random phenomena used to fill idle time, but is a linguistic behavior that can be used strategically in a conversation, for example, to reduce a client’s fear before asking them to sign a contract. Recent work has shown that people already respond socially to computers (Reeves and Nass 1996); by endowing our computer agents with social intelligence and the communicative abilities to use that intelligence, we enable them to fulfill roles that until now have been exclusively in the domain of humans. With appropriate social skills, computer agents would move closer to being usable as digital friends or confidants, effective therapists, counselors, teachers, or trainers, or persuasive salespeople or corporate representatives. In order to meet the goal of socially intelligent conversational agents I will need to develop an understanding of certain human social protocols gleaned from the fields of Linguistics, Social Psychology, Human Ethnography and from my own studies of natural interactions among human interlocutors, including theories of when particular protocols should be used to meet the goals of the agent and how normal execution of the protocols should proceed. Since it is not realistic to build an agent of the complexity of a human, I will need to decide which interactional cues (verbal and nonverbal) are essential for a given protocol and how these cues are to be detected, interpreted, and generated by a
Page 1
conversational agent. Finally, I will need to evaluate my theories by testing how humans interact with conversational agents based on them. If I am successful in this endeavor, computers as we know them will disappear and be replaced by computational artifacts which are an integral part of our social lives. Inviduals will never be lacking a sympathetic ear to turn to in time of crisis, yielding potentially significant health benefits to those who have socially intelligent agents at their disposal (e.g., lack of a confidant has been shown to lead to an order-of-magnitude increase in likelihood of depression following severe life events (Brown and Harris 1978)). Businesses will gain powerful new tools for establishing on-going relationships with clients in cyberspace. The fields of Linguistics and Social Psychology will benefit from the theories developed about how humans interact with each other and with computer agents, and the field of Computational Linguistics will benefit from new theories and algorithms which extend the work in task-oriented conversational systems into the social realm.
II. Previous Work In this section I will discuss the intellectual antecedents of the proposed research, in rough order of influence. Emulating human social behavior touches upon many disciplines and technologies, many of which have been pioneered at the Media Lab. Linguistics Since I am primarily concerned with conversational behavior, the deepest intellectual roots for this research are in the field of Linguistics, specifically in Discourse Theory and Pragmatics, since I am more interested in language as a means to an end than in the detailed structure of a given utterance. There is a large body of basic research on the linguistic phenomena of interest (e.g., small talk is discussed as early as 1923 (Malinowski 1923)), although researchers in this discipline historically produce only descriptive models of the behaviors of interest, and not models which can be used to produce the behaviors or decide when they are appropriate. Although most areas of Discourse Theory are relevant to this work, research on Intention (Searle 1969; Levinson 1983; Grice 1989; Grosz and Sidner 1990), Footing (Goffman 1983), Face and Politeness (Brown and Levinson ; Goffman 1983), Conversational Storytelling (Jefferson 1978; Polanyi 1989), and Conversational Analysis (Sacks 1995) are particularly relevant. Computational Linguistics Computational linguistics, and more specifically, conversational dialog systems (such as TRAINS (Traum and Hinkelman 1992)) provide performance models of certain conversational behaviors. Non-task-oriented (i.e. social) conversation has so far been mostly avoided by this community, since it is seen as open-ended, unbounded dialog, and is not seen as having direct application in business or industrial domains. Interestingly, the most successful social conversational systems that have been built are "chatterbots" which have been developed primarily for entertainment purposes (Mauldin
Social Intelligence in Conversational Computer Agents
2
1994).
Chatterbot systems are capable of carrying on text-based "casual" conversations with users. Most of these systems are direct descendants of ELIZA (Weizenbaum, 1966) and primarily use regular expression pattern matching against the user’s input to directly index canned responses. The success of these systems depends upon having a large enough set of responses that the system never repeats itself, and crafting the responses in such a way as to be ambiguous enough that the user can draw their own relevance relations. More recent chatterbot systems use various indices in addition to pattern matching to determine the system’s response. For example, the system described in (Mauldin, 1994) used a current topic index which is updated based on an activation network. Rousseau and Hayes-Roth (1997), describe a system in which response indices include the character’s affective state. Chatterbot systems also utilize several "tricks" to give the illusion of intelligence and fluency (Mauldin, 1994), including: •
Maintaining initiative by asking questions.
•
Including portions of the user’s input in the system’s response.
•
Changing the level of the conversation (e.g., "Why do you ask that?").
•
Rigidly continuing a topic.
•
Launching into a new topic.
• Making controversial or humorous statements. Chatterbot systems are now available commercially as either desktop "friends" or personal assistants, or as web site guides or assistants. For example, the Extempo "Imp" characters (Extempo Systems) can be programmed via a forms interface in which responses are indexed by current web page and topic in addition to keyword/regular expression pattern matching. Neuromedia’s "Virtual Representatives" (Neuromedia, Inc.) are programmed in a high-level language to function as web-based customer service representatives. The development of these systems has been driven by some extent by the Loebner Prize competition in which a $100,000 reward is promised to the first system which can pass the Turing test (Mauldin, 1994). Embodied Conversational Agents Embodied conversational agents extend the work of speech-only dialog systems by adding an animated humanoid body which can be used to exhibit appropriate nonverbal conversational behavior (Cassell, Bickmore et al. 1999) (Andre, Muller et al. 1996; Ball, Ling et al. 1998), in addition to adding multimodal sensors which can detect the user’s nonverbal behavior. While the metaphor of face-to-face conversation has been successfully applied to humaninterface design for quite some time, its use to date has been just that; a metaphor. Embodied conversational agents leverage the full breadth and power of human conversational competency by using all of the conversational skills that humans have, namely the ability to use facial expression, gaze, gesture, and intonation to regulate the process of conversation, as well as the ability to use verbal and nonverbal means to contribute content to the ongoing conversation.
Social Intelligence in Conversational Computer Agents
3
Although significant advances have been made to the development of these agents over the last few years (Section III presents a detailed discussion of work at the Media Lab), all of the work to date has focused on the recognition and production of behaviors which fulfill relatively low-level conversational functions, such as turn-taking, feedback, repair, and spatial deixis (pointing gestures), and have not addressed discourse-level phenemona such as social interactions. Social Psychology This work will also draw on a large body of research in Social Psychology, specifically in the area of interpersonal relationships. For example, in the work on intimacy, there are not only descriptions of the phenomena and assessment tools for measuring it, but experimental procedures for inducing it between strangers (Aron, Melinat et al. 1997). As in Linguistics, however, the vast majority of work in this area consists of descriptive models which are not readily automatable. Reeves and Nass (Reeves and Nass 1996) have repeatedly demonstrated how much of this work in Social Psychology directly applies to people’s relationships with computers and other media. One of their students--Youngme Moon-has gone on to show how certain social skills (e.g., self-disclosure) can be used to influence a user’s buying behavior when interacting with a computer agent (Moon 1998), and another--B.J. Fogg--has established the field of "Captology" to study how technology in general can be used to persuade users (Fogg 1999). However, none of these researchers are investigating the use of social skills by anthropomorphic agents, and they are also primarily interested in developing descriptive theories of the phenomena they study. Human Ethnography Studies by researchers such as Adam Kendon (Kendon 1990) attempt to describe the regularities of ritualized human interactions such as greetings and group formation. Although there are many valuable insights into social protocols that can be gained from this research, the discipline exclusively focusses on nonverbal behavior and thus only provides a small but important part of the characterization of social conversation. Affective Computing Also relevant is recent work in Affective Computing, pioneered by Roz Picard at the Media Lab as computing that relates to, arises from or deliberately influences the emotions of humans (Picard 1997). Especially relevant is the work of Jonathan Klein (Klein 1999) who developed a system which was demonstrated to alleviate the frustration of users through "active listening" skills, one of the social skills I am interested in modeling. However, work in this area to date has not utilized conversational or animated agents, but has relied upon a standard window-and-menu based interface. I contend that a natural interaction with a conversational agent would enhance the effectiveness of such systems.
Social Intelligence in Conversational Computer Agents
4
Synthetic Characters and Autonomous Agents The goal of researchers in the field of Synthetic Characters is to produce autonomous, interactive, animated computer characters which seem to be intelligent, sentient beings. Much of the work in this area was pioneered by Bruce Blumberg’s group at the Media Lab (Maes, Darrell et al. 1995; Blumberg 1996) (Johnson, Wilson et al. 1999), which takes an ethologically-inspired approach to modeling autonomous, intelligent agents, although the group has not addressed conversational behavior in any of their work. This research is important to build upon in my work in order to produce conversational agents which are not only socially intelligent, but believable and lifelike enough to engage a user in in the first place. Since embodied conversational agents are necessarily autonomous (although Cassell and Vilhjalmsson have explored the implementation of semi-autonomous conversational agents (Cassell and Vilhjalmsson 1999)) , they must address the same issues faced by researchers developing autonomous software agents and robots, namely rationale agency, reactivity vs. deliberation, management of goals, etc. Thus, much of the literature on autonomous agents is relevant to the construction of conversational agents (see (Goetz 1997) Chapter 2 for a review of autonomous agent architectures).
III. Embodied Conversational Agents: Gandalf, REA and Sam The Gesture and Narrative Language Group at the Media Lab has been developing technologies for embodied conversational agents over the last four years. This section describes the three generations of conversational agents developed in the group, since they will be used as the development platform for my proposed research by endowing them with some degree of social intelligence. Gandalf The first generation system was Gandalf (Figure 1), who acted to assist users in navigating a 3D model of the solar system(Thorisson 1996). Users were required to wear motion and gaze tracking equipment so that the system could detect where they were
Figure 1. Gandalf Social Intelligence in Conversational Computer Agents
5
looking and if they were gesturing or not, as well as a microphone for speech recognition. Gandalf was animated as a 2D head and a hand that could be used for simple gestures to accompany his synthesized speech. While Gandalf was capable of real-time, multimodal interaction with a user, all of his outputs (verbal and nonverbal) were pre-programmed, making the system very inflexible and non-scalable. Interactions with Gandalf consisted of single-utterance command and question exchanges with the user. REA The REA (Real Estate Agent) project is the GNL group’s current research platform for embodied conversational agents (Cassell, Bickmore et al. 1999) (Figure 2). REA is a simulated real estate agent, and builds upon Gandalf in several ways. Most fundamentally, she has the capability to synthesize her outputs from first principles, i.e., a dictionary and grammar, encompassing both speech and gesture (Stone 1998). REA also incorporates a discourse model which enables her to synthesize and resolve anaphoric references. The underlying approach to conversational understanding and generation in REA is based on discourse functions. Thus, each of the user’s inputs are interpreted in terms of its conversational function and responses are generated according to the desired function to be fulfilled. Perception of the user’s nonverbal behavior is now handled by a stereoscopic vision system (Azarbayejani, Wren et al. 1996) rather than the unwieldly motion and gaze tracking equipment used in Gandalf. REA also has a fully articulated 3D graphical body inhabiting a 3D virtual world, enabling her to give users guided tours through virtual houses. REA’s interactional behavior has also been improved. When the user makes cues typically associated with turn taking behavior such as gesturing, Rea allows herself to be interrupted, and then takes the turn again when she is able. She is able to initiate conversational repair when she misunderstands what the user says, and can generate
Figure 2. REA Social Intelligence in Conversational Computer Agents
6
combined voice and gestural output. Table 1 shows the range of interactional functions that REA is capable of and the behaviors used to realize those functions. Communicative Functions
Communicative Behavior
Initiation and termination: Reacting
Short Glance
Inviting Contact
Sustained Glance, Smile
Distance Salutation
Looking, Head Toss/Nod, Raise Eyebrows, Wave, Smile
Break Away
Glance Around
Farewell
Looking, Head Nod, Wave
Turn-Taking Give Turn
Looking, Raise Eyebrows (followed by silence)
Wanting Turn
Raise Hands into gesture space
Take Turn
Glance Away, Start talking
Feedback Request Feedback
Looking, Raise Eyebrows
Give Feedback
Looking, Head Nod
Table 1. Some examples of conversational functions and their behavior realization REA’s software architecture is radically different from the one used in Gandalf (Figure 3), and is based on separation of modules into those which deal with conversational functions and those which deal with conversational behavior. The decision module (the central decision-making and planning part of the system) deals exclusively with conversational functions and is thus modular with respect to input and output modalities. Hardwired Reaction Deliberative
Knowledge Base Discourse Model
Decision Module
Input Devices Speech Body pos. Gaze Gesture …
Input Manager
Interactional Processing Understanding Module
Generation Module Propositional Processing
Action Scheduler
speech and gesture gen.
Speech Body pos. Gaze Gesture …
Response Planner
Figure 3. Rea’s Software Architecture Social Intelligence in Conversational Computer Agents
Output Devices
7
REA will likely be the initial testbed for the theories of social interaction that I develop in my research. Real estate sales was purposely selected as a task domain because of the opportunities for both social and task-oriented dialog. And, even with all of the fundamental improvements made to REA, interactions with her still feels like a command-and-control interface, with the user asking a question or issuing a command and REA responding in single utterance exchanges. My work will initially focus on the development of the decision module, and in particular the response planner, which is responsible for planning what REA is to say next. Sam The latest addition to the GNL family of conversational agents is Sam (Figure 4), who functions as a peer playmate for children to encourage and scaffold their storytelling. While not as sophisticated as REA in many ways (e.g., Sam’s output is not synthesized and user presence is detected via pressure-sensitive floor mats), Sam does have a much more complex interaction model and is capable of producing and listening to multiutterance story turns. Sam is an exploration into the use of conversational agents and storytelling systems for children.
Figure 4. Sam
Social Intelligence in Conversational Computer Agents
8
IV. Social Protocol 1: Small Talk One of the initial areas of my research will be how small talk can be used in the realestate domain to establish trust and rapport between REA and the user in such a way that the user feels comfortable disclosing personal information (such as salary). Through the introduction of a metric of interpersonal distance which varies from "strangers" to "intimates", the discourse planner can actually plan to conduct small talk by adding a requisite level of interpersonal distance to the preconditions for asking particular questions. For example, before REA asks the user invasive questions about financial status (a negative face threat) she should first establish some level of rapport, and this rapport might be achieved by conducting some amount of small talk. Although small talk is most noticeable at the margins of conversational encounters, it can be used at various points in the interaction to continue to build rapport and trust (Cheepen 1988), and in real estate sales, a good agent will continue to focus on building rapport throughout the relationship with a buyer (Garros 1999). Small talk has received sporadic treatment in the linguistics literature, starting with the seminal work of Malinowski (Malinowski 1923) who defined "phatic communion" as "a type of speech in which ties of union are created by a mere exchange of words". Small talk is the language used in free, aimless social intercourse, which occurs when people are relaxing or when they are accompanying "some manual work by gossip quite unconnected with what they are doing." Jacobson (Jakobson 1960) also included a "phatic function" in his well-known conduit model of communication, which is focused on the regulation of the conduit itself (as opposed to the message, sender, or receiver). More recent work has further characterized small talk by describing the contexts in which it occurs, topics typically used, and even grammars which define its surface form in certain domains (Laver 1975; Cheepen 1988; Schneider 1988). In addition, degree of "phaticity" has been proposed as a persistent goal which governs the degree of politeness in all utterances a speaker makes, including task-oriented ones (Coupland, Coupland et al. 1992). Phatic communion is closely related to the notion of "face" (Goffman 1983); "positive face" is the desire of all speakers to be approved of by their listeners, while "negative face" is the desire of all speakers to be unobstructed in their autonomy. Small talk mitigates positive face threats by providing an interactional style in which it is very easy (even somewhat obligatory) for all interlocutors to carry on a conversation and thereby achieve some degree of cameraderie. It can also be used to mitigate negative face threats by establishing that one’s interlocutors are non-hostile (e.g., as used to break uneasy silences in waiting rooms).
V. Social Protocol 2: Conversational Storytelling Another preliminary area of my research will be into how REA can use conversational storytelling to help achieve her interpersonal goals. Although conversational storytelling is not a social interaction protocol, per se, it is very used frequently in social conversation as a means to achieve multiple goals within the larger framework of a social protocol such as small talk (Cheepen actually characterizes most of the exchanges in small talk as
Social Intelligence in Conversational Computer Agents
9
story contributions (Cheepen 1988).) Conversational storytelling is the spontaneous generation of a narrative embedded within face-to-face interaction and it is found both during periods of small talk and during task-oriented dialogue as a way of presenting task-oriented information in an engaging way, and/or meeting one or more of the goals stated above. From a functional perspective, conversational storytelling can be characterized as a linguistic device which can be used to organize and relate information in a narrative form within any interactional frame. Following the literature on conversational storytelling, stories are considered to be "specific, affirmative, past time narratives which tell about a series of events which did take place at specific unique moments in a unique past time world" and which are told to others to make a point or transmit a message (Polanyi 1989). In casual conversation, stories are often used to relate interesting events or humor and thus contribute to the rapport-building function. Conversational storytelling should be an integral part of an agent’s small talk capability. One of the tricks that chatterbot programs use to convey the illusion of intelligence is to tell stories whenever possible in the course of the conversation (Mauldin 1994), a technique that is also espoused in several popular press books on conversational skills (e.g., (RoAne 1997)). There are two possible reasons why this is successful. First, people try to establish relevancy relationships among conversational contributions whenever possible and multi-utterance narratives may give listeners more material to work with. Second, a story contribution that is irrelevant to an on-going conversation may be tolerated if the entertainment or information content is high enough. This may be why most books on conversational skills suggest that speakers should always have several interesting stories ready to contribute to any social occasion. As with small talk, storytelling can be used to achieve multiple conversational goals simultaneously. In the real estate domain, it can be used to achieve any combination of the following goals: •
Phatic - Simply having the agent deliver a story meets the objectives of phatic communion by keeping the conversation going.
•
Establishing expertise - As with the phatic goal, the successful delivery of any story helps to establish the agent’s linguistic expertise and intelligence. In addition, specific stories can help establish the agent's expertise in specific areas, such as: “I sold a house to someone just like you.”; “I sold a house in the area you are interested in.”; or “I sold 25 houses last month.”
•
Encouraging self-disclosure - It has been demonstrated that users are more willing to disclose personal information to a computer which has just disclosed similar information to them (Moon 1998). For example: "I just bought a wonderful home in Cambridge. Where are you looking to buy?"
•
Persuasion/Problem solving - The agent can relate a story about a situation similar to the client's in which one of the client's problems is solved. For example: "I sold a house to someone who used Acme Mortgage Company and they closed escrow in only two weeks."
•
Providing requested information - Stories give the agent the opportunity to answer a client's question while also achieving other goals. For example, in response to the
Social Intelligence in Conversational Computer Agents
10
question "Are there any new homes available in Boston?", the agent might respond "I just sold a brand new Foobar Development home to a couple out in Waltham.” Spontaneously generated, conversational stories differ from prepared narratives in several significant ways. Conversational stories are typically not delivered as monologues, but rather intimately involve the listeners in the production through their elicitation, feedback, requests for elaboration, and attempt to show relevance. Conversational stories must be locally occasioned and recipient designed (Sacks 1995). Speakers must take care to tell stories which are relevant to their listeners (locally occasioned) otherwise they suffer a loss of face due to their wasting the time of their audience. One general rule of story relevance presented by Polanyi (Polanyi 1982) is that whatever is "close" to the listener (in terms of space, time or relationship) is relevant. In the real-estate domain, for example, stories which relate to the area the client is looking to buy in, the style of home they are looking for, or the school district they are interested in would all be appropriate to tell. In addition to this strategic relevance requirement, stories must also be told at an appropriate point in a given social interaction, with the storyteller constructing the point of the story so that it relates directly to what is being discussed when it is introduced. The problem of deciding which story to tell at any given time in a conversation has been addressed by several researchers. Gough (Gough 1990) analyzed the production of Xhosa folk narrative which is synthesized during a performance by weaving together several story fragments that the narrator has memorized. Gough identified two mechanisms which were used to index and modify story fragments based on relevancy relationships. Computational systems for relevant story indexing have also been developed using keywords (Bers and Cassell 1998) and case-based retrieval mechanisms (Domeshek 1992). Stories must also be recipient designed, in that they need to be tailored for the specific audience they are delivered to. Describing the mechanics of escrow in an otherwise interesting story about creative financing would be inappropriate if told to a banker, but required for a first-time buyer new to home financing. Thus, the knowledge representation used to generate stories from must be hierarchical, with varying levels of detail, so that only the interesting and relevant portions to the listener can be selected and conveyed. Once a story has been selected and tailored, it must be told in a "lifelike" manner by the embodied conversational character. The performance must be punctuated by emphasis at the appropriate points using relevance or information structure to determine placement and degree of prosody and gesture (Cassell 1995) to convey the emphasis to the listener. Appropriate linguistic devices must also be used to naturally introduce a story, mark its ending, and demonstrate relevance to the listener, if necessary (Jefferson 1978).
VI. Discourse Planning for Social Interaction One of the primary theoretical contributions of this research will be a model of discourse planning which encompasses social dialog as well as task-oriented dialog, effectively performing social protocols in addition to determining when they should be used.
Social Intelligence in Conversational Computer Agents
11
Classical Approaches to Discourse Planning The problem of deciding what an autonomous agent should do at any point in time is known as the action selection problem. For a conversational agent, the choices include both interactional behaviors such as conversation initiation, turn-taking, interruption, feedback, etc., and propositional behaviors which consist of the possible utterances the agent can make. Within the field of Computational Linguistics, the predominant approach to determining appropriate propositional behaviors (what an agent should say next) is to use a speech-act-based discourse planner to determine the semantic content to be conveyed. Once the content is determined, other processes are typically used to organize the content into a coherent dialog, and map the semantic representations into the words the agent actually speaks (known as "text generation"). Although some researchers have investigated the possibility of performing all phases of discourse planning and text generation in a single planning framework (e.g., (Appelt 1982)) most contemporary discourse systems separates content determination from generation. In the rest of this discussion I will focus on the content determination problem. The classical approach to discourse planning is based on the observation that utterances constitute speech acts (Searle 1969), such as requesting, informing, wanting and suggesting. In addition, humans plan their actions to achieve various goals, and in the case of communicative actions, these goals include changes to the mental states of listeners. Thus, this approach uses classical "static world" planners (e.g., STRIPS (Fikes and Nilsson 1971)) to determine a sequence of speech acts which will meet the agents goals in a given context. One of the major advantages of plan-based theories of dialog is that language can be treated as a special case of other rational noncommunicative behavior. Problems with Extending the Classical Approach Social conversation presents many novel and theoretically interesting requirements on dialog systems which have not been addressed by existent systems. First, the discourse planning part of the agent must be able to manage and pursue multiple conversational goals, some or all of which may be persistent or non-discrete. It is not sufficient that the planner work on one goal at a time, since a properly selected utterance, for example, can satisfy a task goal by providing information to the user while also advancing the interpersonal goals of the agent. In addition, many goals, such as intimacy or face goals (Coupland, Coupland et al. 1992), are better represented by a utility model in which degrees of satisfaction can be planned for, rather than the discrete all-or-nothing goals typically addressed in AI planners (Hanks 1994). The discourse planner must also be very reactive, since the user’s responses cannot be anticipated. The agent’s goals and plans may be spontaneously achieved by the user (e.g., through volunteered information) or invalidated (e.g., by the user changing their mind) and the planner must be able to immediately accommodate these changes. Finally, the planner must be very fast, given that the overall response time of the agent is especially crucial during social conversation, since part of the function of this behavior is to establish the capability and intelligence of the other party. An agent with retarded responsiveness will likely not be held in high regard.
Social Intelligence in Conversational Computer Agents
12
Alternative Approaches to Action Selection Unfortunately, the AI community has mostly abondoned the "static world" planner approach to action selection (such as STRIPS) for applications in which an autonomous agent must function in a dynamic, real-time environment (Goetz 1997). One of the alternatives to static world planner are reactive systems such as the subsumption architecture (Brooks 1985), which have no memory and perform no deliberation; they simply react to the world. Clearly, for behavior as complex as language, some amount of planning and deliberation is required (although ELIZA addicts may disagree). A large number of hybrid architectures have sprung up over the last decade which attempt to bridge the gap between purely reactive and purely deliberative systems, often by simply gluing them together (e.g., (Vere and Bickmore 1990)). However, the idea that behavior can be split cleanly into reactive and deliberative layers has been criticized by many researchers as nonsensical. Behavior networks are a relatively new paradigm which attempt to use a single mechanism for both deliberative and reactive behavior. In these networks nodes represent behaviors, sensors, and goals, behavior nodes typically have an activation level associated with them (a measure of the agent’s desire to perform the behavior) and the edges connecting nodes represent paths along which activation energy can flow (i.e., paths of influence among behaviors, sensors and goals). Behavior networks function in a reactive manner when energy from sensors flows to a behavior and leads directly to its execution, and they function in a deliberative manner when energy flows from a goal (directly or indirectly through other behaviors) and leads to a "planned" behavior’s execution. Do The Right Thing Maes’ Do The Right Thing architecture (Maes 1989) was an early behavior network designed to remedy three flaws of reactive systems: their lack of explicit goals, the necessity for the designer to precompile the action selection, and the lack of prediction and planning activity. In this system behaviors are defined as STRIPS operators (with precondition, add and delete lists) but are then automatically compiled into a behavior network according to their symbolic relationships to the agent’s goals, the current environment and their relationship to each other. Edges are added linking the agent’s goals to behaviors which achieve the goals (i.e., are in the operator’s add list). They are also added linking environmental conditions to behavior preconditions, and between behaviors to express enablement and conflict relationships. Whenever the agent needs to select a behavior to perform, energy is propagated through the network according to Maes’ update algorithm until an executable behavior’s activation level reaches a specified threshold, at which time it was selected for execution. Maes showed that this system was capable of solving simple blocks world planning problems. More interestingly, however, she showed that by changing a few global parameters, the behavior of the system could be smoothly graduated from purely reactive (ignoring goals) to purely deliberative (not doing anything until a complete plan was formulated).
Social Intelligence in Conversational Computer Agents
13
More recent work has extended Maes’ system and corrected a few problems with the approach ((Rhodes 1996)). Further Goetz has demonstrated an equivalence between this type of activation network and a certain class of neural networks, and used that equivalence to make conclusions about the ability of the system to converge to ensure persistent action (Goetz 1997). A New Paradigm for Discourse Planning Given the novel requirements that social discourse place upon a conversational agent’s action selection mechanism, in particular the requirements for pursuing multiple, nondiscrete goals in a dynamic, real-time task domain, I plan to investigate the use of activation networks as a new approach to discourse planning. I believe that Maes’ architecture, in particular, is well-suited to this task, since extending it to deal with nondiscrete goals is very straightforward given that everything in the network is represented as continuous-valued activation energy. An example of the small talk dialog I am interested in generating is shown in Figure 5. In this example a new buyer has just approached REA about buying a house. Ex1
U: enters… R: Hello.
Ex2
R: Isn’t this glorious weather we’re having? U: It’s nice. R: Yes.
Ex3
R: Fall is such a lovely time of year in New England.
Ex4
R: Are you from Boston? U: No, I’m just moving from Wisconsin. R: I see.
Ex5
R: I grew up in Maine, but Boston feels like home now. R: I sold my first house here in 1981.
Ex6
R: What brings you to Boston? U: I just accepted a position at MIT. R: Really?
Ex7
R: I just sold a house to someone else at MIT. Figure 5. Example Small Talk Dialog (U=User; R=REA)
Exchange 1 is a simple ritual greeting exchange. In exchange 2, the agent leads with a classic small talk question about the weather, since this is a topic assumed to be relevant to everyone and there are no intimacy preconditions for the topic. In exchange 3, the agent uses an evaluative statement (another classic small talk contribution) which is coherent with current weather topic, but serves to transition to a new topic (New England) to lay the groundwork for a discussion about real estate. In exchange 4, the Social Intelligence in Conversational Computer Agents
14
agent asks the user a question which is within the topic of New England, and should provide some information to further the task goals of selling a house. In exchange 5 REA tells two conversational stories (about growing up in Maine and selling her first house), the first of which transitions the topic back to Boston and the second of which helps to establish her expertise as an experienced real estate agent. In exchange 6, REA stays in topic (Boston) and asks a slightly more personal question (now that some interpersonal groundwork has been laid) to gain some more information in order to qualify the buyer. Finally, in exchange 7, REA takes the opportunity to tell another story to help further establish her expertise and further increase her bond with the buyer. In reviewing transcripts such as this, and ones between human buyers and real estate agents, I have started compiling a list of criteria which must be taken into account by the discourse planner in selecting its next utterance. These criteria currently include: •
Logical preconditions -- It does not make sense to ask a user what kind of financing they can arrange until the agent establishes that they want to buy a house. Most utterances have a large number of logical preconditions which must be satisfied in order for the agent to even consider using them.
•
Support goals -- The utterance should further the agent’s task and interpersonal goals. These goals may be discrete (e.g., SELL-HOUSE) or non-discrete (e.g., ESTABLISH-EXPERTISE).
•
Coherence -- Utterances should stay within the current topic being discussed whenever possible. Deliberate breaks from the current topic should be marked linguistically (Schiffrin 1987) .
•
Topic transition -- Utterances which serve to transition the current topic to one that will enable the agent to further its goals should be preferred.
•
Relevance -- Utterances, especially conversational stories, should be relevant to the user.
•
Intimacy criteria -- Utterances may have intimacy preconditions which should be observed (e.g., don’t ask about finances until some rapport has been established).
•
Novelty -- Conversational stories which are interesting, timely or funny should be preferred.
A discourse planner can be constructed using an activation network architecture by having each node represent an utterance (or story) the agent can tell, and the edges represent activation energy from the criteria listed above (Figure 6). I am currently in the process of building the first prototype of such a system.
Social Intelligence in Conversational Computer Agents
15
Precondition Satisfaction
Goal Achievement
Coherence
Relevance
Utterance
Precondition Enablement
Utterance
Topic Enablement Predictive Enablement
Utterance
Utterance Figure 6. Activation network-based discourse planner
VII. Research Agenda I intend to pursue an experimental methodology in which I first observe humans performing the behaviors of interest (e.g., small talk in sales encounters), analyze the data to infer the goals and strategies of the interlocutors, construct theories and models which incorporate these strategies, and finally construct computational models of this behavior in existing embodied conversational interfaces to assess their validity with users. The research agenda that I plan to pursue is roughly the following: 1. Develop a taxonomy of social protocols (one initial attempt at doing this can be found in [Goldsmith, 1996 #366]). 2. Develop a taxonomy of conversational agent task domains. 3. Determine which protocols are important for a conversational agent to use in different task domains. 4. Develop theories about how people select and perform selected social protocols. 5. Develop an overarching framework within which models of the theories can be implemented and integrated in a conversational agent so that they can be evaluated. 6. Produce a unified theory about how a conversational agent should plan its actions to address both task and social goals. 7. Refine the theories based on experimental result. Research Questions Some of the fundamental questions I will be addressing in this program of research are:
Social Intelligence in Conversational Computer Agents
16
•
Which social protocols are most important to use to ensure an agent’s success in a given task domain?
•
What nonverbal behaviors accompany these social protocols? Which ones are essential?
•
Are activation-network-based discourse planners an appropriate framework for modeling these protocols in the agent? What is the power of these planners relative to non-linear planners? How can they be extended to produce equivalent functionality?
•
What requirements do these social protocols place on the agent’s physical appearance and animation models?
•
How will users react to socially intelligent agents? What can be done to help ensure their acceptance? Will they find them more natural and compelling to use than socially non-intelligent agents?
Research Challenges Certainly, avoiding the abyss of trying to understand unbounded natural language is one of the biggest challenges in this work. However, there are several aspects of social conversation which make this endeavor seem tractable. First, the structure of much of social conversation is surprisingly regular, the topics likely to be discussed are predictable (Laver 1975; Laver 1981; Cheepen 1988), and in fact grammars have been developed which completely characterize phenomena like small talk in particular contexts (Schneider 1988). Second, conversational contributions in social conversation need not always be completely semantically relevant. As experience with chatterbot systems have shown, users work hard to establish the relevance of contributions made by a conversational partner. Third, it is natural for most professionals (such as therapists, teachers, or salespersons) to maintain initiative in interactions with their clients, and keeping initiative is one of the tricks used by most chatterbot systems to maintain the illusion of intelligence by limiting what the user can say and minimizing the need to respond directly to their queries. I contend that a properly designed system in a limited task domain (such as sales) can work on speech input using keyword spotting to detect topic shifts, feedback, and repair moves by the user, and a task-specific grammar to detect when the user transitions into task talk. Evaluation The social conversational skills developed in this research will be evaluated by comparing the effectiveness of an embodied conversational agent with and without the social behaviors enabled (Dehn and Mulken ). The evaluation will be along three dimensions: user behavior while using the system, subjective evaluation, and outcome. Measurements of user behavior will indicate how "natural" the interaction is, and is indicated by such metrics as number of disfluencies, syntactic complexity, and length of utterance in the user’s speech. Subjective measures will indicate how the user liked interacting with the agent under different conditions and will be measured through postexperiment questionnaires. Evaluation of outcome will depend upon the task domain
Social Intelligence in Conversational Computer Agents
17
chosen, but could be represented by the likelihood of a user to buy from a computerized sales agent, as in (Moon 1998; Moon 1999; Moon 1999).
VIII. Conclusion The research program outlined here addresses the basic question of how to give embodied conversational agents social intelligence to improve their effectiveness in interacting with humans. The theories and models developed in this work will benefit the fields of Linguistics (in particular Discourse Theory and Computational Linguistics) and Social Pscyhology, and pave the way for the next generation of intelligent computers which can interact with humans more naturally and effectively than ever before.
References Andre, E., J. Muller, et al. (1996). The PPP Persona: A Multipurpose Animated Presentation Agent. Advanced Visual Interfaces. Appelt, D. (1982). Planning natural language utterances to satisfy multiple goals. Palo Alto, CA, Stanford University. Aron, A., E. Melinat, et al. (1997). “The experimental generation of interpersonal closeness: A procedure and some preliminary findings.” Personality and Social Psychology Bulletin 23(4): 363-377. Azarbayejani, A., C. Wren, et al. (1996). Real-time 3-D tracking of the human body. IMAGE'COM, Bordeaux, France. Ball, G., D. Ling, et al. (1998). Lifelike Computer Characters: the Persona project at Microsoft Research. Seattle, Microsoft Research. Bers, M. U. and J. Cassell (1998). “Interactive Storytelling Systems for Children: Using Technology to Explore Language and Identity.” Journal of INteractive Learning Research 9(2): 183-215. Blumberg, B. (1996). Old Tricks, New Dogs: Ethology and Interactive Creatures. Media Arts and Sciences. Cambridge, MIT. Brooks, R. (1985). A robust layered control system for a mobile robot. Cambridge, MA, MIT AI Lab. Brown, G. and T. Harris (1978). Social origins of depression: A study of psychiatric disorder in women. New York, Free Press. Brown, P. and S. Levinson Universals in language usage: Politeness phenomena. Cassell, J. (1995). The role of Gesture in Stories as Multiple Participant Frameworks. AAAI Spring Symposium Series. Cassell, J., T. Bickmore, et al. (1999). Embodiment in Conversational Interfaces: Rea. CHI 99, Pittsburgh, PA. Cassell, J. and H. Vilhjalmsson (1999). “Fully Embodied Conversational Avatars: Making Communicative Behaviors Autonomous.” Autonomous Agents and Multi-Agent Systems 2: 45-64. Social Intelligence in Conversational Computer Agents
18
Cheepen, C. (1988). The Predictability of Informal Conversation. New York, Pinter. Coupland, J., N. Coupland, et al. (1992). “"How are you?": Negotiating phatic communion.” Language in Society 21: 207-230. Dehn, D. M. and S. v. Mulken The Impact of Animated Interface Agents: A Review of Empirical Research. Saarbrucken, Germany, University of Saarland. Domeshek, E. (1992). Do The Right Thing: A Component Theory for Indexing Stories as Social Advice, Institute for the Learning Sciences, Northwestern University. Fikes, R. and N. Nilsson (1971). “STRIPS: A new approach to the application of theorem proving to problem solving.” Artificial Intelligence 5(2): 189-208. Fogg, B. J. (1999). Persuasive Technologies. Communications of the ACM. 42: 27-29. Garros, D. (1999). Real estate agent, Home and Hearth Realty, Cambridge. Goetz, P. (1997). Attractors in Recurrent Behavior Networks. Buffalo, NY, State University of New York at Buffalo. Goffman, E. (1983). Forms of Talk. Philadelphia, PA, University of Pennsylvania Publications. Gough, D. (1990). The Principle of Relevance and the Production of Discourse: Evidence from Xhosa Folk Narrative. Narrative Thought and Narrative Language. B. Britton and A. Pellegrini. Hillsdale, New Jersey, Lawrence Erlbaum Associates: 199 - 217. Grice, P. (1989). Studies in the Way of Words. Cambridge, MA, Harvard University Press. Grosz, B. and C. Sidner (1990). Plans for Discourse. Intentions in Communication. P. R. Cohen, J. Morgan and M. E. Pollack. Cambridge, MA, MIT Press: 417-444. Hanks, S. (1994). Discourse Planning: Technical Challenges for the Planning Community. AAAI Workshop on Planning for Inter-Agent Communication. Jakobson, R. (1960). Linguistics and Poetics. Jefferson, G. (1978). Sequential aspects of storytelling in conversation. Studies in the organization of conversational interaction. J. Schenkein. New York, Academic Press: 219-248. Johnson, M. P., A. Wilson, et al. (1999). “Sympathetic Interfaces: Using a Plush Toy to Direct Synthetic Characters.” Proceedings of CHI'99: 152-158. Kendon, A. (1990). Conducting Interaction: Patterns of behavior in focused encounters. New York, Cambridge University Press. Klein, J. T. (1999). Computer Response to User Frustration, MIT. Laver, J. (1975). Communicative functions of phatic communion. The organization of behavior in face-to-face interaction. A. Kendon, R. Harris and M. Key. The Hague, Mouton: 215-238. Laver, J. (1981). Linguistic routines and politeness in greeting and parting. Conversational routine. F. Coulmas. The Hague, Mouton: 289-304. Levinson, S. C. (1983). Pragmatics. Cambridge, Cambridge University Press. Maes, P. (1989). How to do the right thing.
Social Intelligence in Conversational Computer Agents
19
Maes, P., T. Darrell, et al. (1995). The ALIVE System: Wireless, full-body interaction with autonomous agents. Cambridge, MA, MIT Media Lab. Malinowski, B. (1923). The problem of meaning in primitive languages. The Meaning of Meaning. C. K. Ogden and I. A. Richards, Routledge & Kegan Paul. Mauldin, M. L. (1994). Chatterbots, Tinymuds, and the Turing Test: Entering the Loebner Prize Competition. AAAI 94. Moon, Y. (1998). Intimate self-disclosure exhanges: Using computers to build reciprocal relationships with consumers. Cambridge, MA, Harvard Business School. Moon, Y. (1999). The Effects of "Canned" Personalization on Outcomes in an Interactive Marketing Situation, Harvard Business School. Moon, Y. (1999). When the Computer is the "Salesperson": Consumer Responses to Computer "Personalities" in Interactive Marketing Situations, Harvard Business School. Picard, R. (1997). Affective Computing. Cambridge, MA, MIT Press. Polanyi, L. (1982). “Linguistic and social constraints on storytelling.” Journal of Pragmatics 6: 509-524. Polanyi, L. (1989). Telling the American Story A Structural and Cultural Analysis of Conversational Storytelling. Cambridge, MA, MIT Press. Reeves, B. and C. Nass (1996). The Media Equation. Cambridge, Cambridge University Press. Rhodes, B. J. (1996). PHISH-Nets: Planning Heuristically in Situated Hybrid Networks. Media Lab. Cambridge, MA, MIT. RoAne, S. (1997). What do I Say Next? Talking your way to business and social success. New York, Warner Books. Sacks, H. (1995). Lectures on Conversation. Oxford, Blackwell. Schiffrin, D. (1987). Discourse markers. Cambridge, Cambridge University Press. Schneider, K. P. (1988). Small Talk: Analysing Phatic Discourse. Marburg, Hitzeroth. Searle, J. (1969). Speech Acts: An essay in the philosophy of language, Cambridge University Press. Stone, M. (1998). Modality in Dialogue: Planning, Pragmatics, and Computation, University of Pennsylvania. Thorisson, K. (1996). Communicative Humanoids: A Computational Model of Psychosocial Dialogue Skills. Media Arts and Sciences. Cambridge, MIT. Traum, D. R. and E. A. Hinkelman (1992). “Conversation acts in task-oriented spoken dialogue.” Computational Intelligence 8(3). Vere, S. and T. Bickmore (1990). “A Basic Agent.” Computational Intelligence 6: 41-60.
Social Intelligence in Conversational Computer Agents
20