Engaging in Interactions with Multiple Users - CiteSeerX

Look Who’s Talking — Engaging in Interactions with Multiple Users Matthias Rehm

Elisabeth Andre´

Multimedia Concepts and Applications Faculty of Applied Computer Science University of Augsburg, Germany

Multimedia Concepts and Applications Faculty of Applied Computer Science University of Augsburg, Germany

[email protected]

[email protected]

ABSTRACT To create a bond between the user and an agent, it is indispensable to engage the user in the interaction. A crucial engagement behavior is gaze which serves various functions from feedback to turn taking. In this article, we present a corpus analysis of the gaze behavior of two users interacting with an embodied conversational agent. The setting for this interaction is the Gamble1 system that realizes a small game of dice for multiple players.

Categories and Subject Descriptors H.5.1 [Multimedia Information Systems]: Artificial, augmented, and virtual realities; Evaluation/methodology; H.5.2 [User Interfaces]: Evaluation/methodology

General Terms Design, Human Factors

Keywords Embodied Conversational Agents, Non-verbal Behavior, Engagement

1.

the ability to do something together with others, and gestural behavior, i.e. the ability for multimodal interactions including body movements and eye gaze. All of these behaviors are also essential ingredients of embodied conversational agents (ECA, [6]). In this article we will concentrate on one specific engagement behavior, namely gaze behavior. Goodwin ([10]) presents an in-depth analysis of the interaction between speaker and hearer which revealed the importance of gaze behavior for engagement in interaction. Gaze serves a number of functions like feedback, directing attention, showing interest, or turn taking.2 Nakano and colleagues ([15]) have shown how the user’s engagement can be measured by analyzing his gaze behavior during an interaction with the MIT Information Kiosk agent MACK. The agent uses gaze as a deictic device as well as a feedback and turn taking mechanism. The system we present in this paper, Gamble, has one special feature that makes it more difficult to come up with a gaze model for the agent. In Gamble, two users are interacting with an ECA at the same time. Thus, Gamble serves as a test bed for multiparty interactions and in this article we present a corpus study looking into the users gaze behaviors. In particular we are interested in the following questions: 1. Do people apply different attentive behavior patterns in multi-party scenarios when talking to an agent as opposed to talking to a human?

INTRODUCTION

To start a process of creating a bond between the user and an embodied conversational agent, it is indispensable to draw the user into the interaction with the system, to engage her in the interaction. According to Sidner et al. ([19]), engagement “is the process by which two (or more) participants establish, maintain and end their perceived connection during interactions they jointly undertake.” Engagement behaviors comprise spoken linguistic behavior, i.e. the ability to communicate by speech, collaborative behavior, i.e. 1 This work is supported by the EU network of excellence Humaine (http://emotion-research.net)

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. AAMAS’05, July 25-29, 2005, Utrecht, Netherlands. Copyright 2005 ACM 1-59593-094-9/05/0007 ...$5.00.

2. Do people apply different attentive behavior patterns in multi-party scenarios when listening to an agent as opposed to listening to a human?

2.

MULTIPARTY INTERACTIONS

The standard testbed for ECA systems is build around a dyadic interaction with one user and one agent. These scenarios allow for a thorough investigation of an ECA’s abilities to interact with and to engage a user. Accordingly, a vast amount of literature can be found on such dyadic interactions. For example, Cassell and colleagues present REA, an agent able to engage in smalltalk for building trust within a real estate scenario ([5]). The multimodal assembly expert 2 A comment on the notion of gaze behavior is necessary at this point. Whereas in Psychology the measurement of gaze typically refers to the use of sophisticated eye gaze tracking devices, conversational analysis generally uses more unobstrusive ways like video recordings to analyze the gaze behavior in interactions.

MAX cooperates in the task of building a toy plane employing a rich repertoire of multimodal behaviors like speech, gestures, and facial expressions ([14]). The Greta agent gives health advice to the user ([17]). The agent is provided with a personality and a social role, that allows it to show its emotion or to refrain from showing it, depending on the context in which the conversation takes place. But little is known of the effects when we move from a dyadic towards a multiparty interaction. We can distinguish three main settings for such interactions: 1. One user and multiple agents: Generally we have a multiagent system in this case and the user can interact with one or more of the agents leading to consequences in the virtual world of the agents. In the Crosstalk installation ([18]) e.g., the user specifies some of the agents parameters including their interests and personality. The agents will then perform a presentation for the user interacting amongst each other and the user can provide feedback on the performance. The VicTec system (e.g., [16]) realizes a multiagent learning environment to teach kids strategies against bullying relying on a Forum Theatre metaphor. The user is able to interact with one of the agent and suggest plans of action, that will influence the storyline. A more direct interaction is realized in the MRE system ([11]). In this training scenario, the user has to solve the dilemma between his orders and an unforeseen event like an accident. Several agents populate the world but the user interacts only with one of them via speech. His action although will influence all of the agents. A commercial example at last are the Sims ([2]). Again the parameters of agents are specified, then their life in the multiagent world is observed. In contrast to the 1:1 scenarios, the user ideally has more than one potential interaction partner posing some interesting challenges. The hierarchy of agents has to be clear or has to be learned by the user to ensure a successful interaction. Due to the emergent behavior between different agents, the system’s behavior may not be understandable. In the worst case, the agents may leave the user out of the interaction loop degrading him to a mere spectator of the action. 2. Multiple users and one agent: There are not many system in which multiple users interact with a single agent. The Mel robot by Sidner and colleagues [19] is an example that goes in this direction. The robot is primarily interacting with a single user but is able to take onlookers into account by directing its gaze towards them. The Gamble system presented in this paper is another example: two users play a little game of dice together with an agent. Compared to the dyadic or one user and multiple agents interaction situation, such a multiuser interaction is much less predictable. Although the context is unambiguous – playing a game of dice with set rules and turns – the two users might and indeed show any behavior, e.g., by sympathizing with or collaborating against the agent, by discussing off topic matters, etc. In the dyadic situation the only interaction partner is the agent and thus it is likely that the user will take his normal interaction behavior into account – his “traffic-rules” of interaction. Now there is a real human interaction partner and another

Figure 1: The setting.

(inferior) interaction device. Thus, it might well be that the communication behavior is radically different from the dyadic examples. 3. Multiple users and multiple agents: A third setting would merge the above scenarios to ultimately end up with a many to many interaction. In such a setting, multiple agents interact among each other and with different users which in turn interact among each other and with different agents. The complexity renders this setting intractable at the moment. In the rest of this paper we examine the behavior of two users playing a little game of dice together with an ECA, the Greta agent developed by Catherine Pelachaud and colleagues ([8]). We will constrain our analysis to the users gaze behaviors which are generally interpreted as good indicators of users engagement in an interaction. We suppose that humans interact with an agent in a way that roughly resembles interaction with a human. Based on studies by Argyle and Cook ([1]) and Vertegaal and colleagues ([20]), we assume that humans spend more time looking at the agent when listening to it than when talking to it. Following Kendon ([12]), we expect similar behaviors at sentence boundaries as in human-human communication. Nevertheless, the user will probably pay more attention to the other human conversational partner since the communicative skills of the agent are strongly limited. For instance, the user might not establish frequent gaze contact with the agent since he does not expect it to notice it anyway. Furthermore, there is empirical evidence that humans tend to avoid computer-controlled agents when navigating through a virtual 3D environment ([3]) which, however, seems to be in conflict with observations by Colburn and colleagues ([7]) who assume that humans might feel less shy to address an agent.

3.

GAMBLE: THE GAME

The setting can be seen in Figure 1. We devised an interactive scenario called Gamble where two users play a simple game of dice (also known as Mexicali) with the agent. To win the game it is indispensable to lie to the other players and to catch them lying to you. The traditional (not computer-based) version of the game is played with two dice

and a cup. Let’s assume player 1 casts the dice. He inspects the dice without permitting the other players to have a look. The cast is interpreted in the following way: the higher digit always represents the first part of the cast. Thus, a 5 and a 2 correspond to a 52. Two equal digits (11, ..., 66) have a higher value than the other casts, the highest cast is a 21. Player 1 has to announce his cast with the constraint that he has to say a higher number than the previous player. For instance, if he casts a 52, but the previous player already announced a 61, player 1 has to say at least 62. Now player 2 has to decide whether to believe the other player’s claim. In this case, he has to cast next. Otherwise, the dice are shown and if player 1 has lied he has lost this round and has to start a new one. Although the rules are simple they trigger rich emotional interactions because catching another playing lying or getting away with such an attempt creates highly affective situations.

4. 4.1

THE MULTIMODAL CORPUS

Figure 2: The annotation board.

addressed the experimenter, a track was introduced for his utterances. Coding is done in Anvil ([13]) and figure 2 gives an impression of the annotation board. It contains the following tracks: • Current: Indicates the current player, i.e., the player that has to announce her belief/disbelief and that has to cast the dice. Possible elements are Agent, P right, and P left.

Subjects and Design

Subjects were 24 students, all native speakers of German, recruited from the computer science and philosophy faculties at Augsburg University. 12 students from each faculty in their second and third year of study participated, 14 male and 10 female. As independent variables, we defined the type of interlocutor (ToI) with the levels Human vs. Agent and the user’s role during an utterance (RoU) with the levels Speaker vs. Addressee. Both variables were manipulated between-subjects. The value of the independent variables depends on the position of the single players. If the subject is standing on the right-hand side in Fig. 1, he has to listen to the agent’s announcements and to make announcements to the human player on his left. If the subject is standing on the left-hand side in Fig. 1, she has to listen to the announcements of the human player on her right and to make announcements to the agent. As dependent variables, we defined the length and number of attentive behaviors directed to the conversational partner.

4.2

• P right: Group of tracks for the right player3 who plays after the agent. The group consists of – ExtraLing: For annotating if the player is laughing. This information will be used to test automatic recognition of emotions from speech. It is planned to include more features in this track in the future. – Trl: In this track the utterance of the player is annotated. Utterances are coded per sentence to minimize the coding effort. – Gaze: The head movements of the player are given in this track. They are interpreted as gaze towards different entities in the environment. Possible elements are Agent, P right, P left, PDA, Camera, and Elsewhere. Coding of gaze behaviors was adopted from Nakano et al. ([15]). A gaze is defined in the following way. The gaze ends and a new one starts at the moment the head starts moving. The direction of the gaze is determined at the end of the head movement.

Procedure

The subjects were randomly divided into 12 teams. At the beginning of the experiment, the subjects were presented with a three minute video of the Gamble system. In addition, they had to participate in a test round to get acquainted with the game, the handling of the PDA and the Greta agent. After the test round, each team played two rounds of 12 minutes. The participants changed positions after the first round so that each participant came to play before and after the agent. We told the subjects that the agent might not be able to conceal its emotions perfectly, but left it open how deceptive behaviours might be detected. Consequently, the subjects had no idea which channel of expression to concentrate on. To increase interest in the game, the winner was paid five Euros. We videotaped the interactions, and we logged the game progress for the analysis.

4.3

Coding Scheme

The videos are coded for utterances, gaze, role in the game, and laughing. Moreover, it was coded who was the current player and because subjects sometimes although rarely

– Role: This is a secondary track that is bound to the Track Current. It specifies the role of the player at the moment in the game. Possible elements are Current, Previous, and Unaffected. Current duplicates the information present in the primary track. Previous indicates that the player is judged by the current player in this turn, and Unaffected indicates that it is the player who is on turn next. • P left: see P right • Agent: see P right 3

The right player seen from the perspective of the coder, not from the agent. This minimizes problems with left/right distinctions, because no cognitive transformations are necessary.

• Other – Trl: The utterances of other people like the experimenter. This track is very rarely used. Until now, half of the material has been coded by 24 coders. Each video sequence is coded by two people. The interrater reliability has been calculated for gaze behavior and is very good with a kappa value of around 0.9 for each pair. Thus it can be concluded that the applied coding scheme is very robust and that measuring the gaze behavior from video recordings is reliable. For the cases where disagreement is present, a typical pattern is observed. In most cases, one of the coders has overlooked a very short gaze like looking up for the fraction of second from the PDA to the other player. At the moment, the corpus contains 2200 utterances of which 645 are done by the agent, 675 by the right player, and 700 by the left player. Moreover, we have 5398 head movements which are considered as gaze behavior of the users.

4.4

Figure 3: Number of gazes.

Gaze behavior

The goal of the annotation work was twofold. On the one hand, we were interested in collecting information to develop an appropriate gaze model for agents in multi-party scenarios. On the other hand, analyzing the users’ gaze behavior should reveal to what extent they regard the agent as a real game partner worthy of communication. To analyze the data we need to know where people look in human-human interactions. According to Kendon ([12]), we can distinguish between at least four functions of seeking or avoiding to look at the partner in dyadic interactions: (i) to provide visual feedback, (ii) to regulate the flow of conversation, (iii) to communicate emotions and relationships, (iv) to improve concentration by restriction of visual input. Concerning the listener, Argyle and Cook ([1]) show that people look nearly twice as much while listening (75%) than while speaking (41%). Compared to dyadic conversations, we know little about gaze behavior in multiparty interaction. Vertegaal and colleagues ([20]) describe a study of the looking behavior in a four-party interaction. Subjects looked about 7 times more at the individual they listened to (62%) than at others (9%). They looked about three times more at the individual they spoke to (40%) than at others (12%). In accordance with Sidner et al. ([19]) or Nakano et al. ([15]), they conclude that gaze, or looking at faces, is an excellent predictor of conversational attention in multiparty conversations. Vertegaal et al. also showed that 1. People look more at the person they speak or listen to than at others. 2. Listeners in a group can still see they are being addressed. Each person still receives 1.7 times more gaze than could be expected had he not been addressed. 3. Speakers compensate for divided visual attention by increasing the total amount of their gazes. 4. Listeners gaze more than speakers (1.6 times). Starting with some basic statistics, Figure 3 shows the number of gazes towards each of the given directions. The total number of gazes is 5398. The players looked roughly as often towards the synthetic agent (27%) as towards the

Figure 4: Length of gazes.

other human player (30%). Just considering the number of the gazes, the agent seems to be as attractive as the other player. The fact that people look slightly more often at the PDA (38%) could be attributed to its use as the interface for casting the dice and indicating belief or disbelief. If we examine instead the length of the gazes towards each of the given directions (Fig. 4), this interpretation no longer holds. More than half of the time the players look at the PDA (55%), which seems to bind a lot of their attention. Noteworthy is the fact that players spend considerably more time (1.5) looking at the agent (26%) than looking at the other player (17%). Obviously, the type of interlocutor (human or agent) influences the users’ gaze behavior. The total number of gazes and the length of gazes during the game provide a rough impression of the users’ attention towards human and synthetic interlocutors. In addition, we are interested in the question of whether the users’ gaze behaviors depends on their role as a speaker or as an addressee. Because Gamble is a strictly round-based game, the utterances can be categorized into three main categories: announcement, belief, and comment. During announcements, the current player announces his cast or what he pretends to be his cast to the next player who is the addressee of this announcement. The belief category comprises utterances indicating a player’s belief or disbelief of an announcement. Hence, the addressee of such an utterance is the previous player who made the announcement that is subject to the speaker’s evaluation. All other utterances are categorized as comments which are - strictly speaking - not game-relevant. Among other things, utterances in this category comprise

User’s Role (RoU) Speaker Addressee

Agent 9.33 31.75

Interlocutor (ToI) Human Result 8.75 F(1,23)=0.77 20.17 F(1,23)=23.87

Table 1: Gaze behavior of speaker towards addressee and vice versa. Gaze Behavior Total begin of utterance end of utterance

Table 2: dressee.

Agent 9.33 3.25 7.33

Human 8.75 3.08 7.33

Result F(1,23)=0.77 F(1,23)=0.23 F(1,23)=0.00

Gaze behavior of speaker towards ad-

general comments on the game or on the behavior of other players. For the analysis conducted in this paper, comments are disregarded since we are mostly interested in conversational utterances with uniquely determined addressees. In our future work, we will consider comments to study gaze behaviors in situations where the addressee cannot be identified with certainty or where several conversational partners are addressed simultaneously. Table 1 compares the gaze behaviors of human interlocutors in the role of a speaker and an addressee for gamerelevant utterances. A comparison of the speakers’ and addressees’ gaze behaviors only makes sense for human interlocutors because the agent is driven by a gaze model (which is not the subject of our investigations). We further distinguish whether their interlocutor is an agent or another human user. No significant difference was observed in the gaze behavior of the speaker in the two conditions (i) agent (as interlocutor) and (ii) human (as interlocutor). That is people did not apply different gaze behaviors when talking to an agent. Turning to the addressee’s gaze behavior gives a different picture. Whereas the speaker seems to be uninfluenced by the fact that one of her interaction partners is an agent, the addressee’s gaze behavior shows a strong significant effect (F(1,23)=23.97, p