Linguistic Style Improvisation for Lifelike Computer Characters AAAI 96 Workshop on AI, Alife and Entertainment, Portland
Marilyn A. Walker Janet E. Cahn Stephen J. Whittaker AT&T Research Laboratories MIT Media Lab AT&T Research Laboratories 600 Mountain Ave. 20 Ames St. 600 Mountain Ave. Murray Hill, NJ 07974 Cambridge, MA 02139 Murray Hill, NJ 07974
[email protected] [email protected] [email protected] Just because you are a character doesn’t mean that you have character. (Wolf to Raquel in Pulp Fiction, Q. Tarantino)
Abstract This paper introduces Linguistic Style Improvisation, a theory and algorithms for improvisation of spoken utterances by artificial agents, with applications to interactive story and dialogue systems. We argue that linguistic style is a key aspect of character, and show how speech act representations common in AI can provide abstract representations from which computer characters can improvise. We show that the mechanisms proposed introduce the possibility of socially oriented agents, meet the requirements that lifelike characters be believable, and satisfy particular criteria for improvisation proposed by Hayes-Roth.
1
Introduction
Recent work on interactive drama systems attempts to create virtual environments for learning or entertainment in which humans and lifelike computer characters engage in interesting and emotionally evocative play [Loyall and Bates, 1995, Rich et al., 1994, Maes et al., 1994]. Previous work suggests that the key requirements for these characters include believability, the ability to represent and express emotion, and the ability to respond to human users in an interpretable way. In addition, it is desirable for a character to be able to IMPROVISE on an abstract specification of an act that it intends to perform, either by its own volition or under the direction of another agent [Hayes-Roth and Brownston, 1994, Hayes-Roth et al., 1995]. Here, we argue that linguistic style is a key aspect of character, previously ignored, and present a theory of, and algorithms for, Linguistic Style Improvisation by computer characters. Linguistic Style Improvisation (henceforth LSI) concerns the choices a speaker makes about the SEMANTIC CONTENT, SYNTACTIC FORM and ACOUSTICAL REALIZATION of their spoken utterances. As an example of how linguistic style can
This research was supported in part by Hewlett Packard Laboratories U.K., by the University of Pennsylvania, by Mitsubishi Electric Research Laboratories, by the MIT Media Laboratory, and by ATT Research Laboratories .
(Laszlo and Ilsa enter Rick’s Cafe) Headwaiter: Yes, M’sieur? Laszlo: I reserved a table. Victor Laszlo. Waiter: Yes, M’sieur Laszlo. Right this way. (Laszlo and Ilsa follow the waiter to a table) Laszlo: Two cointreaux, please. Waiter: Yes, M’sieur. Laszlo: (to Ilsa) I saw no one of Ugarte’s description. Ilsa: Victor, I feel somehow we shouldn’t stay here.
Figure 1: Excerpt from the Casablanca script. convey character, consider Laszlo’s request for two cointreaux in 1, from the Casablanca screenplay in Figure 1. In the film, this request is delivered in pleasant tones. (1) a. Two cointreaux, please. We show how it is possible, given an abstract plan-based representation of Laszlo’s utterance as a speech act, to automatically improvise any of the alternative stylistic realizations in 2 for requesting two cointreaux: (2) a. Bring us two cointreaux, right away. b. You must bring us two cointreaux. c. We don’t have two cointreaux, yet. d. You wouldn’t want to bring us two cointreaux, would you? Clearly, speakers make such stylistic choices when they realize their communicative intentions, and these stylistic realizations express the speaker’s character and personality. Furthermore, listeners draw inferences about the character and the personality of a speaker based on the speaker’s choices. According to theories of social interaction a speaker’s choice is controlled by social and affective variables [Goffman, 1983, Brown and Levinson, 1987]. For example, the power variable can affect Laszlo’s choice among the forms in 2; if he believes he is more powerful than the waiter he may say 2a or 2b; but if he believes he is less powerful he may say 2d. Our work on Linguistic Style Improvisation is most similar to Hayes-Roth’s work on directed improvisation [HayesRoth and Brownston, 1994, Hayes-Roth et al., 1995]. In her work, directions to computer characters are abstract representations of acts that the characters can improvise on and
header: precondition:
REQUEST-ACT(speaker, WANT(speaker,action) CANDO(hearer,action)
decomposition-1: decomposition-2: decomposition-3: decomposition-4: effects:
surface-request(speaker,hearer,action) surface-request(speaker,hearer, INFORMIF (hearer,speaker, CANDO(hearer,action))) surface-inform(speaker,hearer, (CANDO(speaker,action))) surface-inform(speaker,hearer, WANT(speaker,action)) WANT(hearer,action) KNOW(hearer, WANT( speaker, action)) AGENT (action,hearer)
constraint:
hearer, action)
Figure 2: Definition of the REQUEST-ACT plan operator from Litman and Allen, 1987
perform. These acts may be either physical movements or linguistic behavior. We focus exclusively on linguistic behavior. Our work draws from two theoretical bases: computational work on SPEECH ACTS [Allen, 1979, Cohen, 1978, Litman, 1985], and social anthropology and linguistics research on social interaction [Goffman, 1983, Brown and Levinson, 1987]. In section 2, we introduce the components of speech act theory that we draw on, then in section 3 we discuss in detail Brown and Levinson’s theory of linguistic social interaction. We argue that the combination of these provide a rich generative source of different characterizations for artificial agents. Section 4 explains how these theories provide the basis for generating the improvisations exemplified in 2 above. In section 5 we discuss how we augment these improvisations with improvisations of acoustical configurations for emotions [Cahn, 1990]. Section 6 illustrates how the theory is implemented in the domain of interactive story and dialogue. Finally, section 7 discusses how our methods for improvisation satisfy Hayes-Roth’s criteria for improvisations by lifelike characters [Hayes-Roth et al., 1995].
2
Speech Acts
2.1 Speech Acts as Abstractions Speech acts were first proposed as a small set of communicative intentions such as REQUEST or INFORM that underly all utterance production [Searle, 1975]. A speech act definition includes specifying the conditions under which a speaker could be successful at achieving a communicative intention by performing a particular speech act, and specifying the effects on the hearer if the speaker is successful. Initial computational work proposed that speech acts should be implemented in a standard AI planning system as plan operators that include the act’s DECOMPOSITION, PRECONDITIONS and EFFECTS, thereby enabling computer agents to plan utterances as well as physical acts [Allen, 1979, Cohen, 1978, Litman, 1985]. While many different inventories of speech acts are possible, our taxonomy consists of the initiating acts of INFORM, OFFER and two types of REQUEST: REQUEST-INFO and REQUEST-ACT. We also use three types each of response speech acts for acceptance and rejection, corresponding to each major type of initiating act. These
header: precondition: decomposition-1:
SERVE(?A, ?C, cointreaux) HAS(?B,cointreaux)
acquire-from-bar(?A,?B,cointreaux) BRING(?A, ?C, cointreaux) HAS(?C,cointreaux)
effects:
Figure 3: A possible plan in the restaurant domain for serving two cointreaux
are ACCEPT-INFORM, ACCEPT-OFFER, ACCEPT-REQUEST and REJECT-INFORM, REJECT-OFFER and REJECT-REQUEST. An example plan-based representation of a REQUEST-ACT, such as Laszlo produced in 1, based on Litman’s work, is given in Figure 2 [Litman, 1985]. A critical basis of our improvisation algorithms is speech act theory’s distinction between the underlying intention of a speech act, presumably described by the speech-act’s effect field (and encapsulated in its label), and the surface forms that can REALIZE that speech act as an utterance form. This distinction is seen in Figure 2: the REQUEST-ACT speech act specifies an underlying intention (the desired effect) of the speaker getting the hearer to do (or want to do) a particular action, while the four decompositions specify the different ways that the underlying speech act can be realized by surface speech acts, i.e. by particular sentential forms such as declarative sentences or questions. For example, the sentential equivalents of decompositions 1 to 4 in Figure 2 are in 3a to 3d respectively, where action represents an action description: (3) a. Do action. b. Can you do action? c. I can’t do action. d. I want you to do action. Our algorithms for improvisation, to be discussed in section 3, are mechanisms for deciding how to realize a given underlying intention as a particular surface form. These realizations depend on social and affective variables and include a wide range of alternative improvisations in addition to the surface realizations that Litman enumerates.
2.2 Representing a Script or a Task Oriented Dialogue by Speech Acts We assume that both plays and interactive dialogue can be represented as sequences of speech acts by multiple characters. We use speech acts as the abstract representation for utterances that provide the basis for improvisation. Thus, each spoken utterance is represented as an instantiation of a plan operator and these instantiations are interleaved with descriptions of physical acts in a real or simulated world. This representation matches the output of current dialogue planners. Thus, we can apply LSI directly to interactive dialogue systems. We illustrate this in section 6 on dialogue plans for task-oriented dialogues [Walker, 1993]. To apply LSI to interactive drama requires scripts for interactive drama to be written as a sequence of speech acts and physical acts. We believe that writing such a script would require a new type of interactive fiction writer, who is willing to represent the script abstractly. In Hayes-Roth’s work, part of children’s interaction with the system is the writing of abstract scripts for the characters’ behavior. In our initial work, we constructed the speech-act/physical act representation by hand for a segment of the script of Casablanca. In addition to the speech act types, the real or simulated world that the speech acts refer to must also be represented. In general, each speech act refers to some belief or act in some domain. For example, note that the REQUEST-ACT plan in Figure 2 takes an action as an argument. A convenient way to represent this action is with a domain specific plan-operator, with its own domain-specific preconditions, decompositions, and effects. A somewhat simplified action plan for an agent A who works in a restaurant B serving cointreaux to customer C is in Figure 3. If the plan operator described in figure 3 serves as the action in the REQUEST-ACT speech act in Figure 2, then this would represent the request that the hearer serve the speaker cointreaux, as in Laszlo’s request to the waiter in 1, once the variables were instantiated to reflect the fact that the hearer of the request act is the agent of the serve action. The preconditions, decompositions, and effects for both speech acts and domain acts provide SEMANTIC CONTENT for improvisation, as will be explained in more detail in section 3 and section 4. Finally, because we are defining a mapping between abstract representations and surface strings, as a matter of convenience, we represent the content of the plan operator fields in the generator FUF’s (Functional Unification Formalism’s) attribute-value language [Elhadad, 1992]. This simplification saves us from writing a translator from logical forms to FUF attribute-value representations. As an example, consider the proposition, in Figure 3 that ‘the restaurant has cointreaux’. In FUF, this would be represented as: ((cat clause) (proc ((type possessive) (mode attributive))) (tense present) (partic ((possessor ((cat common) (head === restaurant)))
(possessed ((cat common) (denotation meal) (head === cointreaux))) ))) Given this representation, it is straightforward to implement transformation functions which change this representation to another feature-value matrix for a particular linguistic form, which is then given as input to the FUF generator. For example, the yes-no question Does the restaurant have cointreaux? can be generated by adding the feature-value pair (mood yes-no) to the set of feature-value pairs above.
3
Social Interaction and Linguistic Style
Whenever a person realizes a particular speech act, s/he makes choices about the linguistic style with which that act is realized. Our main idea is that all these choices have a major effect on our perception of an agent’s character and personality. Given a goal to express a particular character by improvising on an abstract speech-act specification, how does an agent choose among all the possible variations in SEMANTIC CONTENT, SYNTACTIC FORM and ACOUSTICAL REALIZATION? We claim that a principled explanation may be found in Brown and Levinson’s theory of social interaction described in detail below, [Brown and Levinson, 1987], and, in addition, that the results are artistically interesting. Maintaining public face Brown and Levinson, henceforth B&L, posit that all agents have and know each other to have: 1. FACE: An agent’s public self image, which consists of the desire for: (a) AUTONOMY: Freedom of action and freedom from imposition by other agents; (b) APPROVAL: A positive consistent self-image or personality that is appreciated and approved of by other agents; 2. Capabilities for RATIONAL REASONING such as means-end reasoning, deliberation, and plan recognition. Social Variables and Face Given the desire to maintain their own and others’ face, and beliefs about their own and others’ rationality, agents’ algorithm for choosing a strategy relies on evaluating three socially determined variables: 1. D(S,H): the SOCIAL DISTANCE between the speaker and hearer. 2. P(H,S): the POWER that the hearer has over the speaker. 3. R : a RANKING OF IMPOSITION for the act under discussion.
B&L propose that these variables are quantities. In LSI, the range of these variables are values between 1 and 50. Human agents use personal experience, background knowledge, and cultural norms to determine the values for these variables. For example, SOCIAL DISTANCE often depends on how well S and H know one another, but also on social class and status. POWER comes from many sources,
but often arises from the ability of S to control access to goods that H wants, such as money. The RANKING OF IMPOSITION relies on the fact that all agents’ basic desires include the desire for autonomy and approval. Thus particular speech-act types can be ranked as higher impositions simply by how they relate to agents’ basic desires. B&L claim that speech acts that can function as a threat to H’s desire for autonomy include those that predicate some future act of H, as well as speech acts that predicate some future act of S toward H, such as offers, which put pressure on H to accept or reject them. In LSI, we define the act types of REQUEST-INFORM, REQUEST-ACT and OFFER as acts that threaten H’s desire for autonomy. We also assume that the INFORM speech act can threaten H’s desire for autonomy on the basis that it is an attempt by S to affect H’s mental state. B&L claim that Speech acts that threaten H’s desire for approval include all rejections. In our speech-act inventory in LSI, this includes the act types of REJECT-INFORM, REJECTOFFER and REJECT-REQUEST.1 Given our inventory of speech acts, and the range of the variables D and S, we instantiate B&L’s claims with the ranking of imposition R based on speech-act type shown in Figure 4.2
Speech Act
R
accept-request
5
accept-inform
5
accept-offer
10
inform
15
request-info
20
offer
25
reject-offer
30
reject-inform
35
reject-request
40
request-act
45
Figure 4: A ranking R on imposition of various types of speech acts with values from 1 to 50.
Linguistic Style Strategies and Social Variables As social and rational actors, S & H attempt to avoid threats to one another’s face. Given values for social distance D(S,H), power P(H,S) and ranking of imposition R , the agent S estimates the threat Θ to H of performing the speech-act by simply summing these variables as in equation 4:
(4) Θ
D(S,H) + P(H,S) + R
Once a value for Θ has been calculated, the agent uses it to choose among one of the following four strategies for executing a speech act:3 (5) a. DIRECT: Do the act directly. b. APPROVAL-ORIENTED: Orient the realization of the act to H’s desire for approval; c. AUTONOMY-ORIENTED: Orient the realization of the act to H’s desire for autonomy; d. OFF-RECORD: Do the act off-record by hinting, and/or by ensuring that the interpretation of the utterance is ambigous. The lowest values of Θ lead to the DIRECT strategy and higher values lead to the OFF-RECORD strategy. Since D, P, and R range between 0 and 50, Θ can have values from 0 to 150. DIRECT strategies correspond to values through 50, APPROVAL-ORIENTED strategies to values up to 80, AUTONOMY-ORIENTED strategies for values up to 120 and OFF-RECORD strategies for values from 120 to 150. Each strategy can be realized by a wide range of substrategies, whose SEMANTIC CONTENT is selected from the plan-based representation for a speech act and whose SYNTACTIC FORM is selected from a library of syntactic forms. And since there are many ways to realize each strategy, realizations within particular ranges are heuristically assigned to the upper or lower end of the scale, or assigned to the same values of the scale to support random variation.
Emotion as an Element of Linguistic Style Varying the affect with which an utterance is realized is also a critical aspect of linguistic style. Brown and Levinson state that expressions of strong emotion threaten both S and H’s desires for approval and autonomy, but do not further specify the relation between strategies for selecting SEMANTIC CONTENT and SYNTACTIC FORM, and those for selecting ACOUSTICAL REALIZATIONS which most directly express emotions. In order to explore this interaction, we adopt a very simple view of emotional expression: emotional disposition is an orthogonal dimension to social variables, and each character is assigned an emotional dispositions from among the set: Angry, Pleasant, Disgusted, Annoyed, Distraught, Sad, Gruff.
4 1
Other speech acts not in our inventory, such as criticisms and complaints, also threaten H’s desire for approval. 2 We discuss in section 7 below how our R is a simplification from what is actually required.
Computing linguistic style
We have implemented LSI in two domains: in a task-oriented dialogue planner in which two agents discuss furnishing a two room house [Walker, 1993], and on a segment of 3 Brown and Levinson include a strategy of not executing the speech act at all because the face threat is too great.
the Casablanca script, [Walker, Forthcoming, Waters et al., 1995]. LSI is implemented as a program operated under the influence of a human DIRECTOR who sets up the values of the variables that affect agents’ emotional dispositions and their choices in linguistic style, and whose stage directions are realized by algorithms for varying how computer characters execute speech acts. First, the director constructs a SOCIAL STRUCTURE which consists of a value between 1 and 50, for each pair of characters for both social distance D and for power P. Then, for each speech-act in the script or the dialogue, the speaker determines the social distance D between him/herself and the hearer, the power P that the hearer has over him/her, and the value on R for the speech-act type as in Figure 4. Then by equation 4, s/he determines the value of Θ, and uses this to select one of the strategies given in 4. For example, in the Casablanca application, assume that Laszlo wishes to order two cointreaux from Emil as in 1, given the representations in Figures 2 and 3. For different social structures, Laszlo determines the social distance D between himself and Emil, the power P that Emil has over him, and the value on R for requests as in Figure 4. Then by equation 4, he determines the value of Θ. So if, for example, D is 5, P is 0 and the R for REQUEST-ACT is 45, then Θ’s value is 50, and this leads Laszlo to select a direct form strategy. Other values lead to one of the other strategies. Below we describe each strategy in turn. Since there are many more realizations of the strategies than can be discussed here, interested readers are referred to [Brown and Levinson, 1987].
4.2 Approval-oriented strategies Orienting the realization of a speech act to the hearer’s desire for approval includes: (1) intensifying interest or attention to H; (2) implying that S and H are cooperators who have the same perspective or desires; or (3) conveying that S and H are part of the same social group, or are friends. One way to do (3) for request-acts is to select the semantic content from the want hearer action effect of the request-act and then use a declarative sentence with a tag question; this strategy is encapsulated in 11, producing forms such as 12: (11) OPTIMISM-APPROVAL-STRATEGY: S expresses optimism that H will want to do what S wants him to do. (12) You’d like to bring us two cointreaux, wouldn’t you? A similar strategy of assuming that the effect already holds can be used for INFORM speech acts, as in 13: (13) TAG-QUESTION-APPROVAL-STRATEGY: Add a tag question to requests and informs. Another strategy related to social groups is in 14, producing forms such as 15:
4.1 Direct strategies realizations are derived from a SEMANTIC CONTENT that comes from the decomposition step of the speech-act, with the default SYNTACTIC FORM for that speech-act type, as specified in 6:
DIRECT
(6) REALIZE-DIRECT-STRATEGY: Realize the content of the decomposition step directly.
(14) GROUP-APPROVAL-STRATEGY: Use in-group address forms such as buddy, mate, honey, doll, my man etc. depending on the group. (15) Hey Emil, my man, bring us two cointreaux. Using in-group address forms conveys that S and H are in the same social group, that S believes that the relative P between himself and H is small. The form in 15, my man, is implemented by adding an in-group address to the default syntactic form of the decomposition field of the speech act. For ACCEPT-OFFER or ACCEPT-REQUEST speech acts, approval oriented forms are those that realize the WANT effect of the offer or request, but in addition emphatically assert H’s desire to accept as in 16 and 17: (16) I’d be glad to. (17) With pleasure.
For a request such as 1, the default syntactic form is an imperative as in 7:
For rejections, approval-oriented forms are those by which H affirms a social relationsip with S as in 18:
(7) Bring us two cointreaux.
(18) I’m sorry, I can’t. Normally I’d love to.
Direct realizations are divided further in the range of 0 to 50 so that lower values correspond to styles that convey that H has no power (P is low); one way to make a direct request-act express power is given in 8 and illustrated in 9 and 10: (8) POWER-DIRECT-STRATEGY: Add you must or right away to the imperative. (9) Bring us two cointreaux right away. (10) You must bring us two cointreaux. For speech acts such as INFORM, the default syntactic form is a declarative sentence, and for speech acts which are more specific forms of ACCEPT or REJECT, the default forms are Okay, Yes or No respectively.
4.3 Autonomy-oriented strategies Orienting towards H’s desire for autonomy includes tactics for: (1) Making minimal assumptions about H’s wants and desires; (2) Leaving H the option not to do the act; and (3) dissociating S and H from the autonomy infringement. Making minimal assumptions about H’s desires and wants relies on selecting the semantic content from the plan preconditions and effects. In the examples below, we again instantiate the action for REQUEST-ACT with the cointreaux serving action in Figure 3. One strategy is given in in 19, producing a form such as 20: (19) QUERY-ABILITY-AUTONOMY-STRATEGY: ability precondition of the plan. (20) Can you bring us two cointreaux?
Query the
Another rule is to be pessimistic about H’s desires; this can be achieved by the strategy in 21, which produces a form such as 22: (21) NEGATE-EFFECT-AUTONOMY-STRATEGY: State that the want effect doesn’t hold. (22) You wouldn’t want to bring us two cointreaux would you? Ways of dissociating S and H from the autonomy infringement include producing an indirect form of a request by the strategy in 23, which produces the form in 24: (23) ASSERT-WANT-PRECONDITION-AUTONOMY-STRATEGY: State that the want precondition holds. (24) We’d like two cointreaux. Another dissociation strategy is in 25 producing the form in 26: (25) IMPERSONALIZE-ACTOR-AUTONOMY-STRATEGY: Impersonalize who actually performs the requested act. (26) Let’s bring us two cointreaux. The form in 26 is produced by realizing a let’s form of proposal from the effect of the domain plan in Figure 3. 4 INFORM speech acts also have realizations that are autonomy-oriented. An INFORM speech act can impinge on H’s autonomy of believing what s/he wants to believe. One way to orient to H’s autonomy is to soften the strength of an assertion by HEDGING it [Prince et al., 1980]. For example, consider Laszlo’s utterance of I reserved a table. This can be hedged by simply embedding the declarative sentence, which is produced from the decomposition step of the plan for an inform, with hedging phrases such as I feel, I believe, It seems, As you may know, I think, I heard, or adding other hedges such as somehow, sort of, kind of to the verb phrase. This strategy is encapsulated in 27 and produces forms such as 28: (27) HEDGE-INFORM-STRATEGY: augment any inform statement with either a pre-sentential or a verbal hedge. (28) I believe I reserved a table. An example of hedging in the original script is Ilsa’s assertion from Figure 1 repeated in 29: (29) Victor, I feel somehow we shouldn’t stay here. Hedging the strength of the assertion can also function as an approval-oriented strategy since it is a simple way to avoid disagreement.
4.4 Off record strategies Tactics for going off record are more difficult to implement since they involve inferences that are difficult to model computationally. There are however several simple ways to make a request off-record by constructing hints from plan-based representations. One is to state that the effect of the domain plan does not hold, as in 30, exemplified in 31: 4 Note that the forms in 20 and 24 are represented directly as decompositions in Litman’s definition of a REQUEST-ACT, but are also producible by rule [Allen and Perrault, 1980, Gordon and Lakoff, 1971] as we do here.
(30) ASSERT-NEGATION-DOMAIN-EFFECT-STRATEGY: assert that the effect of the domain plan does not hold. (31) We don’t have two cointreaux yet. Another realization is to state that the domain precondition holds as encapsulated in 32. For example, Laszlo’s utterance of I reserved a table is a statement that the domain precondition for being shown to a table holds. Thus the original realization in the script is an off-record form. (32) ASSERT-DOMAIN-PRECONDITION-HOLDS-STRATEGY: assert that the precondition of the domain plan holds. Another tactic is to lambda abstract the agent role on the plan decomposition of the domain plan and assert that it hasn’t yet occurred, as in 33, exemplified by 34, leading to an implicature inference [Hirschberg, 1985]: (33)
REPLACE-AGENTWITH-NONSPECIFIED-AGENT-STRATEGY:
lambda abstract the agent role from the plan decomposition of the domain plan so that the agent is not specified.
(34) Someone hasn’t brought us two cointreaux. In the current implementation of LSI, autonomy-oriented forms are sometimes substituted for off-record forms in order to provide more variability when characters choose to go off record.
5
Implementing Emotional Dispositions
Once a character’s emotional disposition has been set, all of that character’s utterances are synthesized with the acoustical correlates of that emotion. We implement this by drawing on Cahn’s theory of expressing affect in synthesized speech, and a version of the Affect Editor program developed expressly for interactive theater and simulated conversation [Cahn, 1990]. The Affect Editor computes instructions for a speech synthesizer (so far, the DECtalk3 and 4.1) so that it produces emotional and expressive synthesized speech. The output is a set of synthesizer instructions; the input is a combination of text and acoustical parameter values. The parameters number seventeen, and control the presence in the speech signal of various aspects of pitch, timing, voice quality and phoneme quality. Because some of the acoustical properties are moderated by linguistic properties of the text, the words in the text must be annotated for part of speech, focus information (expresses as a likelihood of receiving intonational stress, that is, as the inverse of the accessibility of items in memory), and then the text itself marked with all possible phrase boundaries (based on syntax and grammatical role). The acoustical parameters have numerical values - their adjustment around zero, representing neutral affect, allows various shadings of emotion, for example, from calm to sadness to completely dejected, or from enthusiam to harsh anger. We use parameter value configurations for the emotional dispositions of: Angry, Annoyed, Disgusted, Distraught, Gruff, Pleasant and Sad.
6
Example Runs of Linguistic Style Improvisation
To demonstrate the effect of LSI on a task-oriented dialogue, we discuss several variations of a two-part dialogue in which one agent A requests that the two agents agree to put a particular item of furniture into a room, and the other agent B accepts the request. To demonstrate the effect of LSI on a script, we apply the LSI to the first five lines of the Casablanca script in Figure 1, where agent A is Laszlo and agent B is the waiter. The examples are derived from extreme power and social distance parameter settings in order to demonstrate the range of variation that is possible. A direct/angry speaker with an approval-oriented/pleasant hearer If the director sets up the social structure so that A’s emotional disposition is angry, B’s is pleasant, D(A,B) 0, P(B,A) 0, D(B,A) 30, and P(A,B) 30, then A will choose direct strategies and an angry delivery, and the B will choose approval oriented strategies, delivered in pleasant tones. The application of this social structure to CASABLANCA is in 35:
The configuration in 37 results in portraying Laszlo as a wimp, for a combination of several reasons. First, Laszlo, who is the customer, is orienting to the waiter’s autonomy. Second, the distraught intonation is very high pitched and tentative. Finally, the fact that the waiter is rude highlights their differences in linguistic style. The application of this social structure to the furniture task is in 38: (38) A: Let’s put the red couch in the study. B: If you think so.
7
Discussion
In this paper, we propose that linguistic style is a key aspect of character and presented algorithms for Linguistic Style Improvisation by computer characters. Hayes-Roth proposes five lifelike qualities required of improvisation mechanisms of computer characters [Hayes-Roth et al., 1995], p. 5: 1. Interesting variability in a character’s interpretation of a given direction on different occasions;
(35)
2. Random variability in the way a character performs a specific behavior on different occasions;
W: Could I help you?
3. Idiosyncrasies in the behaviors of different characters;
L: You must take us to a table. I am Victor Laszlo.
4. Plausible motivations for character’s behavior;
W: It’s a pleasure.
5. Recognizable emotions associated with character’s behaviors and interactions.
L: Bring us two cointreaux, right away W: I’d be glad to. The application of this social structure to the furniture task is in 36: (36) A: Put the red couch in the study. B: I’d be glad to. An autonomy-oriented/distraught speaker with a direct/pleasant hearer If the director sets up the social structure so that A’s emotional disposition is distraught, the B’s is pleasant, D(A,B) 40, P(B,A) 40, D(B,A) 0, and P(A,B) 0, then A will choose autonomy-oriented strategies delivered as distraught, and the B will choose the lower end of direct strategies (rude ones), delivered as pleasant. The application to CASABLANCA is in 37.
(37) W: I will help you L: Can you take us to a table? As you may know, I am Victor Laszlo W: Yes, if you insist. L: You wouldn’t want to bring us two cointreaux, would you?
Here we argue that our mechanisms for LSI satisfy HayesRoth’s criteria. First, we hope that it is obvious that we can produce interesting variability, random variability, and by using social structure, idiosyncrasies. Second, we believe that because B&L’s theory is based on empirical observation of human interaction in many cultures, a theory of LSI based on it satisfies Hayes-Roth’s last two criteria. If B&L have captured linguistic universals, as they claim they have, then human users should be able to ascribe plausible motivations for a character’s behavior and recognize the emotions associated with a character’s behaviors and interactions. The motivations that B&L’s theory ascribes are not only descriptive and explanatory, but predictive and generative. One limitation of our work is that the SOCIAL STRUCTURE does not change during a particular period of interaction between two agents. In future work, we hope to explore a reciprocal feedback loop to social structure, in which, for example, one agent’s friendliness results in another agent adjusting their beliefs about social distance. This should result in interpretable and interesting changes in the way two agents treat one another over the course of a social interaction. In future work, we also hope to examine in more detail the relationship of acoustical expression of emotions to choices about linguistic semantic content and syntactic form. Finally, we noted that our RANKING OF IMPOSITION R is a simplification from what is actually required. The problem is that R should be a function of both the speech act type, and the type of the action in the domain. For example, a
W: Yes, if I must.
REQUEST-ACT that H pass the salt is less of an imposition than a REQUEST-ACT that H give S five dollars. We leave to
future work a way of incorporating domain-act based rankings for imposition into R . We conjecture that a function for R could be based on inputs and a domain act , if the speech act planner could access information about the effort involved with the execution of the domain act . We have shown how LSI can be applied to computer characters in both interactive fiction and task-oriented dialogue simulation. In future work, we hope to investigate applying the same mechanisms to characters for personal assistants for spoken language interfaces [Ball and Ling, 1993, Yankelovich et al., 1995]. We believe that the combination of dimensions we have focused on provides a motivated and artistically interesting basis for making choices about linguistic style, that these choices are closely related to human perceptions of character and personality, and that they provide a rich generative source of linguistic behaviors for lifelike computer characters.
8
Acknowledgments
Thanks to Gene Ball, Justine Cassell, Larry Friedlander, Dawn Griesbach, Don Martinelli, Phil Stenton, and Loren Terveen for interesting discussions about linguistic style, to Michael Elhadad and Charles Callaway for consultation on using FUF, to Obed Torres for implementation work on VIVA, and to Wendy Plesniak for artistic inspiration.
References [Allen, 1979] James F. Allen. A plan-based approach to speech act recognition. Technical report, University of Toronto, 1979. [Allen and Perrault, 1980] James F. Allen and C. Raymond Perrault. Analyzing intention in utterances. Artificial Intelligence, 15:143–178, 1980. [Ball and Ling, 1993] J. Eugene Ball and Daniel T. Ling. Natural language processing for a conversational assistant. Technical Report MSR-TR-93-13, Microsoft, 1993. [Brown and Levinson, 1987] Penelope Brown and Steve Levinson. Politeness: Some universals in language usage. Cambridge University Press, 1987. [Cahn, 1990] Janet E. Cahn. The Generation of Affect in Synthesized Speech. Journal of the American Voice I/O Society, 8:1–19, July 1990. [Cohen, 1978] Phillip R. Cohen. On knowing what to say: Planning speech acts. Technical Report 118, University of Toronto; Department of Computer Science, 1978. [Elhadad, 1992] Michael Elhadad. Using Argumentation to Control Lexical Choice: a Functional Unification Implementation. PhD thesis, Columbia University, 1992. [Goffman, 1983] Erving Goffman. Forms of Talk. University of Pennsylvania Press, 1983. [Gordon and Lakoff, 1971] D. Gordon and G. Lakoff. Conversational postulates. In Papers from the 7th Regional Meeting of the Chicago Linguistic Society, pages 63–84. CLS, 1971.
[Hayes-Roth and Brownston, 1994] Barbara Hayes-Roth and Lee Brownston. Multiagent collaboration in directed improvisation. Technical Report KSL 94-69, Knowledge Systems Laboratory, Stanford University, 1994. [Hayes-Roth et al., 1995] Barbara Hayes-Roth, Lee Brownston, and Erik Sincoff. Directed improvisation by computer characters. Technical Report KSL 95-04, Knowledge Systems Laboratory, Stanford University, 1995. [Hirschberg, 1985] Julia Hirschberg. A Theory of Scalar Implicature. PhD thesis, University of Pennsylvania, Computer and Information Science, 1985. [Litman, 1985] Diane Litman. Plan recognition and discourse analysis: An integrated approach for understanding dialogues. Technical Report 170, University of Rochester, 1985. [Loyall and Bates, 1995] A. Bryan Loyall and Joseph Bates. Behavior-based language generation for believable agents. Technical Report CMU-CS-95-139, School of Computer Science, Carnegie Mellon University,, 1995. [Maes et al., 1994] Pattie Maes, Trevor Darrell, Bruce Blumberg, and Sandy Pentland. Interacting with animated autonomous agents. In AAAI Spring Symposium on Believable Agents, 1994. [Prince et al., 1980] Ellen Prince, Joel Frader, and Charles Bosk. On hedging in physician-physician discourse. In AAAL Symposium on Applied Linguistics in Medicine, 1980. [Rich et al., 1994] Charles Rich, Dick Waters, Carol Strohecker, Michal Roth, and Yves Schabes. Demonstration of an interactive multimedia environment. Computer, 27(12):15–22, 1994. [Searle, 1975] John R. Searle. Indirect speech acts. In P. Cole and J. L. Morgan, editors, Syntax and Semantics III: Speech Acts, pages 59–82. Academic Press, New York, 1975. [Walker, 1993] Marilyn A. Walker. Informational Redundancy and Resource Bounds in Dialogue. PhD thesis, University of Pennsylvania, 1993. [Walker, Forthcoming] Marilyn A. Walker. The VIVA virtual theater. Technical report, Mitsubishi Electric Research Labs, Forthcoming. [Waters et al., 1995] Dick Waters, David Anderson, John Barrus, and Joe Marks. The SPLINE scalable platform for interactive environements. Technical report, Mitsubishi Electric Research Labs, 1995. [Yankelovich et al., 1995] Nicole Yankelovich, Gina-Anne Levow, and Matt Marx. Designing speech acts: Issues in speech user interfaces. In CHI 95 Conference on Human Factors in Computing Systems, 1995.