644
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 50, NO. 4, AUGUST 2003
Designing for Learnability in Human–Robot Communication Anders Green and Kerstin Severinson Eklundh
Abstract—In a future scenario where many devices can be controlled using the voice, easy and intuitive access will be crucial for avoiding cognitive overload when users are faced with many different systems and interaction models. We propose a model for interaction with spoken language interfaces applied to heterogeneous tasks for service robots, based on the idea of using a family of lifelike characters. We argue that we can signal important features of the speech interface by using certain visual cues. The aim is to facilitate learning and transfer between interfaces. We discuss challenges for dialogue design affecting learnability in the light of the speech interface constructed for our full-scale robot prototype CERO. Index Terms—Dialogue, human–robot interaction, learning, speech interface, ubiquitous computing.
I. INTRODUCTION
R
ECENT developments in service robotics promise a future where robots become part of the electronic landscape of our offices and homes. In addition, the cost of hardware will become low enough to equip any kind of device with a spoken language interface. A general assumption of this work is that careful dialogue design may have beneficial effects on the complexity of interactions that the system needs to handle. We will discuss some issues which we have found to be important for dialogue design, starting with the general question of how models of human-to-human conversation can be used to reduce the effort of learning required by users to handle speech interfaces for service robots. Engaging in dialogue from the robot’s perspective can, in computational terms, be formulated as finding the solution to the problem of executing the next appropriate system task, i.e., a physical action, an utterance, or a combination of these two. Identifying the attributes and conditions that determine what an appropriate action is requires careful design efforts. From a software engineering perspective spoken human–machine dialogue, using a conversational style of interaction, requires advanced models of discourse and task-related domain knowledge. The contribution of this work is rather how naturalness can be supManuscript received January 16, 2002; revised February 6, 2002. Abstract published on the Internet May 26, 2003. This work was supported by The Swedish Labour Market Board (AMS), by The Swedish Foundation for Strategic Research (SSF), and by the Swedish Transport and Communication Board (KFB). This paper was presented at the 2001 IEEE International Workshop on Robot and Human Interactive Communication, Bordeaux and Paris, France, September 18–21. The authors are with the KTH Interaction and Presentation Laboratory, KTH Royal Institute of Technology, 10044 Stockholm, Sweden (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TIE.2003.814763
Fig. 1. CERO fetch-and-carry robot with the interface character placed on top of the robot cover.
ported through means of dialogue design than an attempt at addressing the computational aspects of human–robot interaction. This paper is organized as follows. Initially, we discuss learnability issues of speech interfaces in general and the affordances of human conversational agents. Then, we present some usage scenarios and how we have addressed these within the design of the speech interface for our fetch-and-carry robot CERO (Fig. 1; see also [1]). After this, we turn to a more general discussion of an interface metaphor where the central idea is to use what we call a family of lifelike characters. The aim is to reduce the effort spent on learning a new style of interaction as users switch between interfaces with diverse tasks. We conclude our paper by discussing some remaining issues of speech interfaces based on our analyses of video-recorded dialogues. II. LEARNABILITY OF SPEECH INTERFACES One of the arguments that has been used for natural language user interfaces using a conversational style of interaction is that learning is not required. In reality, learning always takes place, since even state-of-the-art spoken language interfaces only provide more or less poor imitations of human behavior and intelligence. Consequently, the user needs to acquire a specific style of interaction for every new speech interface. One of the key issues for the usability of speech interfaces is that the barrier for first-time use must be as low as possible, preferably as low as to afford first-time success. Learnability is defined in ISO 9126 as attributes of a system that have a bearing on the effort required by its users to learn its application [2]. Other factors affecting usability are the ability
0278-0046/03$17.00 © 2003 IEEE
GREEN AND EKLUNDH: DESIGNING FOR LEARNABILITY IN HUMAN–ROBOT COMMUNICATION
for users to understand the concept of the system (understandability) and the effort required to control the interface (operability). There is a wide range of principles that affect learnability in human–machine interaction [3]. The ones that are most applicable to human–robot communication are predictability, consistency, and familiarity (e.g., affordances). Designing for learnability in this sense means, in practical terms, to use design elements that give the user intuitive cues for understanding the functions that the system offers. One way to facilitate learning is to use a design where functions and features of the speech interface are consistently used in many different speech interfaces. Rosenfeld et al. ([4], [5]) discuss a framework called Universal Speech Interfaces (USIs). The fundamental hypothesis of this is that when a user has acquired the skills needed to handled one USI-based application, learning speed is improved for new applications that use the same interface model. This is achieved by using the same commands and style of interaction (i.e., the “say-and-sound” of the interface) for the types of phenomena recurring in many dialogues (e.g., asking for help, error handling, navigation commands). The practical consequence of this is that interface components behave and look in a similar fashion for all the applications running within the same system environment. Another important means of increasing usability and learnability are the practical techniques to guide the users’ speech using specific prompts as discussed by Yankelovich [6]. For example, explicit hints, or help messages, can encourage the user to utilize specific discourse strategies. The use of tutorials for novice users has also been proposed by Kamm et al. [7] as a factor that may have positive effects on the usability for speech interfaces.
III. AFFORDANCES OF CONVERSATIONAL AGENTS Natural language interaction is one of the most prominent human activities. From very early years we learn to recognize what it means to use language in a social setting. We learn and adapt to patterns of interaction that are common to all humans within a certain community. For this purpose, we possess mental models, just like we have mental models of interaction with doorknobs, stoves, and light switches. In the following, we take the view that embodied human-like agents afford human-like communication. The anthropomorphic features of these agents, together with their behavior, signal to users what kinds of natural language interaction they support. The concept of affordances perhaps needs some further explanation. In a psychological perspective, objects afford different kinds of behavior (e.g., a door handle affords pulling, a button affords pushing) [8], which are determined by cognitive and cultural factors. Gaver [9] points out that the affordances of an object are related to both the physical attributes of the object to be acted upon, and the way the actor perceives and understands them. Relationships like these can often be described using terms like, e.g., “graspability,” describing the relation between ball size and hand size, or “climbability,” meaning a re-
645
lation of individual stairs and how people perceive how difficult it is to climb them. A common view is that the affordances of the human body, especially the human face, play a special role during language perception and understanding. For instance, facial gestures (e.g., mouth movements) seem to aid the auditory perception process [10]. Thus, the affordances of human-like embodied agents are believed to provide strong cues for interaction and people seem to find them engaging [11]. According to linguistic theories of human-to-human conversational behavior, the participants of dialogue are engaged in joint cooperative behavior to achieve common goals [12], [13]. During the dialogue, the interlocutors try to establish mutual belief by performing coordinated actions that are oriented toward a set of goals [14]. Issuing feedback is one important feature of the process of achieving common ground in human dialogue. Feedback signals may be either linguistic, using back-channels and other linguistic structures, or nonverbal, using the body to issue gestures (e.g., nod, shake, raise eyebrows). Thus, speech and body gestures can be viewed as the two primary modes of production of feedback. These two modalities either reinforce each other by introducing redundancy or add information to one another. Allwood et al. [12] list the following functions of body gestures: 1) adding emotions and attitudes, e.g., smiling to display a friendly attitude; 2) adding illustrations to verbal content, e.g., pointing in a certain direction, or drawing the shape of an object; 3) adding information pertaining to interactive communication management, e.g., nodding to show that the current message has been understood. The concept of affordances of embodied agents and the models of conversational behavior described above can be combined when designing speech interfaces. Consequently, a design of a speech interface that uses embodied characters needs an appropriate set of affordances that signal what the communicative behavior of that interface are. The REA system developed by Cassell et al. [15] is a good example of how conversational behavior can be used in a spoken language interface, e.g., the system uses head gestures to manage turn taking. It also raises its eyebrows for providing feedback when a sentence has been understood. IV. USERS AND DIVERSE SPEECH INTERFACES We have created a set of scenarios in order to reflect upon how users may encounter speech interfaces in their daily life. Scenarios where a user is interacting with a single service robot poses interesting challenges, but the challenges become even greater when considering the case of multiple speech interfaces providing a wide range of services. We are especially interested in what kind of metaphors could provide viable models for straightforward and intuitive access to devices where the underlying system has diverse tasks. A. Scenario 1: Interfacing a Single Agent The following simple scenario illustrates the kind of tasks that we intend that the users will be able to perform using the
646
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 50, NO. 4, AUGUST 2003
dialogue interface of a robot. The tasks that the robot performs seem simple from a robotics perspective, but our experience is that there are a number of nontrivial issues concerning modeling the human–robot dialogue, as we will see below. Delivering: The user Kim wants to send a commented article to one of her colleagues. She summons the robot by turning to the robot, saying “Robot.” The robot is activated by the command word and responds: “How may I help.” Kim says “Deliver.” The robot asks for a place where the object should be delivered. When the robot has enough information it starts navigating to the office of Kim’s colleague. Upon arrival, the robot states its mission and waits. Then, after a short while, it returns to its standby position.
(a)
B. Scenario 2: Diverse Interfaces In the following scenarios the user interface requires the use of a different model for user interaction. The scenarios listed below are different with respect to the style of interaction they demand. Microwave oven: Kim comes into the kitchen. She puts a pizza in the microwave oven. After the lid is closed, the oven utters “Pizza! Cook time 3 minutes. Push yes button to start cooking, or specify the time, please.” Kim thinks for a while, but does not push the button. Instead she picks up the microphone hanging from the microwave oven and says “2,” the oven lights up, and the pizza is cooked for two minutes. Refrigerator: When Kim shuts the refrigerator after taking the last can of soda, it makes a beeping noise and says: “Last soda taken, should I place an order?” Kim responds to this by uttering “yes,” speaking into the microphone beneath the screen, and the refrigerator replies “How many?” Kim says “A dozen.” The refrigerator makes a beep and goes quiet. Recalling different styles of interaction from memory is generally hard for users to do, and there is clearly a need for consistency in interfaces like the ones exemplified above. Designing for transfer between interfaces is, therefore, an important question that needs to be addressed from different perspectives. At the abstract level of software architectures, it is necessary to provide abstract interfaces to find uniform ways of interacting with the user [16]. One of the goals of dialogue design addressing issues of consistency and learnability is to reduce the effort of users transferring between interfaces [4]. V. SPEECH INTERFACE INFRASTRUCTURES The currently used infrastructures for speech interfaces are mainly centered around a single application. This means that there is a one-to-one relationship between a speech interface and an application. Common examples of this kind of system are various voice dictation applications available for personal computers. Most research systems for service robots also have a centralized model using a dedicated speech interface, e.g., the JIJO-2 robot [17]. In a setting where there are several speech
(b) Fig. 2. (a) Scenario where multiple interface agents admit several users to interact with the system. The system will be able to assume a spatial relationship between the interface and the user that is reasonably fixed. (b) Scenario with a single device constrains the use of the system to a single user.
interfaces available, solving diverse tasks, the user will most likely be required to learn many different models of interaction. If we consider how interaction will appear from the user’s point of view it is important what kind of infrastructure is used to record the user’s commands. We are considering two main architectures for interfacing multiple speech interfaces. • Direct/Situated: Speaker-independent speech interfaces are directly attached to devices [see Fig. 2(a)]. • Centralized (Voice-Portal): Every user carries a personal interface device [see Fig. 2(b)]. VI. CERO SPEECH INTERFACE Our full-scale prototype robot CERO (Fig. 1) has been developed with the aim of assisting users with everyday tasks such as fetching and delivering objects in an office environment. The target users suffer from physical impairments that make it difficult to move around and to carry lightweight objects. The work model can be characterized as user centered, meaning that we bring users into the process at different stages of development [18]–[20]. The development of the spoken dialogue interface started by analysing the types of tasks, relevant to the user, that the robot platform would be able to perform. The physical capability of the current system allows the robot to move between points in a predefined map. This means that the robot normally performs tasks based on instructions issued by a single user. In some
GREEN AND EKLUNDH: DESIGNING FOR LEARNABILITY IN HUMAN–ROBOT COMMUNICATION
647
cases, other users may also be engaged, by the robot or by the first user, in a collaborative effort to solve the task. With these constraints as a starting point, we have worked with simulation techniques as well as with synthetic dialogues, i.e., dialogues constructed in order to explore possible commands and dialogue patterns. A. Synthetic Dialogues and Wizard-of-Oz Simulation By constructing dialogue examples, we were able to explore what type of dialogue capabilities we need to address when building the dialogue system. Others have also used this method as a means of assessing how users envision their interchange with a robot (e.g., [21] and [22]). The sample dialogue of Example 1 illustrates such a constructed dialogue. Example 1 is a prototypical (synthetic) dialogue illustrating an explicit style of grounding a task. U: Robot! R: What is the mission? U: Go to the kitchen R: Go to the kitchen? U: Yes R: Going to the kitchen However, the use of synthetic dialogues can only take us as far as our own educated guesses allow. Subsequently, we wanted to assess how our ideas would be received by real users. This was done by letting users interact with the robot to solve a task in an office environment. For this purpose, we used a first prototype of our robot equipped with a provisional transport compartment and a loudspeaker to play prerecorded synthetic speech. The robot’s movements and responses were controlled by two so-called wizard operators. The user is made to believe that she is interacting with a fully functional system. This technique is referred to as Hi-Fi or Wizard-of-Oz Simulation, and it has been used for different types of agent-based systems [23], [24]. From the data collected, we were able to qualify a set of dialogue acts that needs to be handled within the system. B. Dialogue Design The dialogue handling in the CERO dialogue system is based upon a set of rules which decides what actions are appropriate given the input and the current system state. The interaction between the system and the user can be modeled as a finite-state machine. A state-based design only deals with the ideal flow of dialogue and, thus, there is a need for general principle for handling errors and breakdowns within the dialogue. To support learnability, we are striving for predictability and consistency of the utterances used in the system. The way we accomplish this in the CERO dialogue system is to use prompts that initially request the user to specify a task, then, prompts that reflect, or repeat, what the system has recognized. Another strategy for providing predictability is to provide explicit prompts that limit the range of possible user responses, e.g., yes/no questions or requests that limit the set of possible responses (“Say ‘go’ to confirm or ‘cancel’ to stop the mission!”).
Fig. 3. CERO interface robot seen from three different angles.
One behavior of the system that is aimed at providing predictability is asking for confirmation. This means that the robot should always ask the user before performing a task like “move to the kitchen.” As a means of providing consistency, the system always asks for confirmation in the same manner for every task. The dialogue in Example 1 shows a successful dialogue which starts by the user uttering a command which is interpreted as a summoning action. C. Low-Level Feedback Using a Lifelike Character During early user studies using a simulated robot interface, users sometimes said that the robot seemed very “quiet” since it was using neither its speech synthesis nor moving. The users also informed us that they lacked a sense of direction or heading of the robot. In response to this, we devised a lifelike character, CERO (see Figs. 1 and 3), with the twofold purpose of: 1) providing a visible direction for the robot and 2) working as an interface component with the ability of providing low-level feedback gestures as a supplement to the spoken feedback issued by the dialogue system. The gesture repertoire of the CERO character is limited to a set of basic conversational gestures. This set includes head nods and head shakes for positive and negative feedback. Raising the character’s head, which can be seen as a special case of nodding, is intended to convey that the system is attending. To contact or to call the attention of users the arms may also be used for issuing different waving gestures. The kind of conversational gestures we are considering have also been used in screen-based conversational interfaces that use the embodied agent metaphor, like Gandalf and REA [15], [25]. The different gestures of the system are issued reactively based on system states. In the design, we have considered a classification of system feedback discussed by Brennan and Hulteen [26]. They proposed a rank of eight categories, or system states, which can be seen as a measure of the depth of grounding. We argue that there is a need for some kind of unifying metaphor or a model that enables easy and intuitive transfer between diverse speech interfaces, and that the use of lifelike characters, like CERO, is a candidate for such a model. To support learnability we suggest the different interface characters that share features like shape, color, size, and gestures for conversational behavior (see [27]). We are referring to this metaphor as a family of interface characters against the background of Wittgenstein’s notion of family resemblance [28]. The idea is that one can easily spot who are members of the same family because they look like each other. It is easy to
648
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 50, NO. 4, AUGUST 2003
Example 2—In this Wizard-of-Oz dialogue, the robot is slow to respond to the user’s request. U: ah-ha U: Okay U: Okay U: robot deliver this to Maria Svensson at room Fig. 4. Here, two characters are placed on the corner of a VCR and a microwave oven. They each have a microphone and an array of LEDs which can give feedback about the signal strength of the user’s voice.
U: //pause//1628
U: I can //pause// walk with you. realize that they have some features in common, but specifying what these features are is generally very hard. The illustrations in Fig. 4 with two similar lifelike characters attached to a VCR and a microwave oven are two examples intended to illustrate how this could be utilized in a home environment. In a system where the microphone and speaker are placed in the immediate proximity of the service-providing device, together with visual information that tells the user that there is a speech interface available, the user will most likely infer that the interface is for that particular device. In short, we believe that if we can signal what type of speech interface the user faces by using visible cues the user will be able to use the experience of using one interface to increase learnability of other speech interfaces. VII. CHALLENGES FOR FURTHER DESIGN Prototypical dialogues like the one in Example 1 are very rare in practical settings. In order to further inform our dialogue design, we have collected and analyzed video-recorded data from user interactions with the working prototype. The purpose of these analyses is to discover issues that pose challenges for further development of the speech interface. One important difference between the interactions with the simulated system and the ones with the prototype system is that in the wizard study performed earlier the users had no previous experience with the system, whereas the user interacting with the implemented system in the examples below is a member of the research staff who is well acquainted with the way the system works. A. Sequencing Breakdowns Ideally, the sequencing of contributions is supposed to follow the patterns A-B-A-B. However, in the dialogues collected in the Wizard-of-Oz study the user does not always wait for the system. Example 2 shows a transcribed excerpt of a Wizard-of-Oz dialogue. In this dialogue, the user picks up a magazine, turns to the robot, and performs the action of putting the magazine in the compartment of the robot. The robot (controlled by the wizard operator) is slow in its responses. This causes a breakdown in the sequencing, which subsequently makes the user suggest alternative actions to the system. In Example 2, the user first tells the robot that he will accompany it to the goal destination, then asks for explicit feedback from the system (using the request “Are you ready?”).
U: Are you ready? R: I am going to Maria In the video-recorded transcript shown in Example 3, the system is also slow in its responses, causing a breakdown in the dialogue sequencing. In this case, the reason for the breakdown is that the speech recognizer does not manage to translate the utterance to text, something which causes the system to fail at providing feedback in time. The strategy of the trained user is first to speak louder with a changed intonation before trying to rephrase the command. Example 3 is a dialogue with a patient user. U: Cero! R: Missions: Deliver, Get, Go. Please specify a mission, for instance: Go to Maria’s office. U: Go to Lars office! //5 sec pause// U: Go to Lars office! //5 sec pause// U: Cero, go to Lars office! //2 sec pause// R: Go to Lars office? U: Yes //5 sec pause// U: Yes //2 sec pause// R: Going to Lars office! In terms of learnability there is a difference between the inexperienced user and the trained user. It seems that the users in the wizard study did not get the appropriate prompts, but, also, they lacked models of interacting with the system. Consequently, when breakdowns occurred they resorted to using general conversational patterns of human-to-human dialogue. In contrast to this, the trained user in Example 3 has learned ways of interacting with the system and, therefore, has a good idea of what types of utterances the system expects. Consequently, he is able to apply another set of strategies which has greater chances of being successful (e.g., change tone of voice or use commands known to work) than the more natural style of interaction of the inexperienced user (e.g., eliciting feedback using “ok?” or “robot?”). In both cases above, it seems that relevant feedback from the system, like answering to the user summoning the robot, would reduce the problems described. B. Time to Response We have also observed a pattern concerning the timing between the system failing to respond in a certain time and the
GREEN AND EKLUNDH: DESIGNING FOR LEARNABILITY IN HUMAN–ROBOT COMMUNICATION
user’s issuing of a new command. There are two cases when the time to response is important: 1) time between an utterance and a system response; 2) time between the end of an utterance and the time when the user realizes that the system has failed to respond. It seems that the user in Example 3 has learned how long a time it takes before the system issues a response (interpreted by the user as a sign of successful input). In Example 3, the 5-s delay between the first and second issuing of the command “Go to Lars office!” can be divided into a first part and a second part. In the first half of the 5-s pause, the user first waits a couple of seconds for the system response, then, after noticing the failure to respond, needs another couple of seconds. The strategy then seems to be to wait at least the amount of time that it usually takes for the system to respond to a command. This may also explain the fact that the user stops speaking in the middle of the confirming utterance “Ok” on line 4 of Example 4. After noticing that the robot starts moving, there is no need to confirm the action. Example 4 is a dialogue where the user cooperates with the robot at the second part of a fetch mission.
649
VIII. CONCLUDING REMARKS Our aim with this paper has been to stimulate discussion about what kind of metaphors and design strategies can be used for supporting learnability and access to situated natural language interfaces. The question of how, and if, we can enhance learnability by employing lifelike characters in spoken language interfaces needs to be considered carefully. Our findings suggest that the following points are important for situated natural language interfaces. • A design that enables easy and intuitive transfer between different kinds of interfaces is necessary for practical use. • Relevant and immediate feedback is crucial for enabling successful dialogue. • Designing for learnability is essential for first-time users. • The use of lifelike characters can be an important form factor to achieve a low threshold for first-time users and users switching between different speech interfaces. Further research efforts should be spent on gaining knowledge about the principles that affect learnability in human–robot communication. Together with controlled studies of specific dialogue phenomena, it is important to perform long-time studies in realistic settings, studying patterns of use emerging from daily encounters with robots.
R: Put the paper on the tray please! ACKNOWLEDGMENT U: Okay //7 sec passes// U: Ok //overlap with robot moving//
C. System Actions Both the wizard study and the examples discussed here show that the users closely monitor the behavior of the system. Users interpret even small movements of the robot as signs of the robot’s intentions. In the fetch dialogue (Example 4, above) the user monitors the behavior of the robot. It only takes a slight movement of the robot to make the user believe that the robot is about to perform its mission. In the wizard study the users were not aware as to what extent the robot was actually able to perform a mission. The majority of users accompanied the robot while it performed its task. This suggests that a user who has been using the system for a long time will be more apt to interpret movements as acknowledgment by the robot that it is going to perform the specified mission. In this paper, we have mainly been addressing issues concerned with learnability of first-time use of a system. As users gain more experience with the system, other design strategies and techniques can be employed, e.g., to use explicit prompts in the beginning of the dialogue, and then use implicit prompts later, when tasks are repeated (see [6]). The current version of the system always asks for confirmation using an explicit paraphrase of the task, as specified by the user. For an experienced user, it might suffice with a simple “OK” accompanied by the physical movement of the robot, after a command has been given.
The authors would like to acknowledge E. Espmark for the physical design of CERO and the Centre for Autonomous Systems (CAS) at KTH for their support on robot technology. REFERENCES [1] A. Green, H. Hüttenrauch, M. Norman, L. Oestreicher, and K. Severinson-Eklundh, “User-centered design for intelligent service robots,” in Proc. 9th IEEE Int. Workshop Robot and Human Interactive Communication, Osaka, Japan, 2000, pp. 161–166. [2] Information Technology—Software Product Evaluation—Quality Characteristics and Guidelines for Their Use, International Organization for Standardization, ISO/IEC 9126, 1991. [3] A. Dix, J. Finlay, G. Abowd, and R. Beale, Human-Computer Interaction, 2nd ed. Upper Saddle River, NJ: Prentice-Hall, 1998. [4] R. Rosenfeld, D. Olsen, and A. Rudnicky, “Universal speech interfaces,” ACM Interactions, vol. VIII, no. 6, pp. 34–44, 2001. [5] “Universal human-machine speech interface: A white paper,” Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CS-00-114, 2000. [6] N. Yankelovich, “How do users know what to say?,” ACM Interactions, vol. 3, no. 6, pp. 32–43, Nov./Dec. 1996. [7] C. Kamm, D. Litman, and M. A. Walker, “From novice to expert: the effect of tutorials on user expertise with spoken dialogue systems,” in Proc. Int. Conf. Spoken Language Processing (ICSLP’98), 1998, pp. 1211–1214. [8] D. A. Norman, The Design of Everyday Things. Cambridge, MA: MIT Press, 1990. [9] W. W. Gaver, “Technology affordances,” in Proc. CHI’91, New Orleans, LA, Apr. 27-May 2 1991, pp. 79–84. [10] D. W. Massaro, Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press, 1998. [11] J. Cassell, T. Bickmore, H. Vilhjalmsson, and H. Yan, “More than just a pretty face: conversational protocols and the affordances of embodiment,” Knowl.-Based Syst., vol. 14, no. 1–2, pp. 55–64, 2001. [12] J. Allwood, J. Nivre, and E. Ahlsén, “On the semantics and pragmatics of linguistic feedback,” Göteborg Univ., Göteborg, Sweden, Tech. Rep. 64, 1991. [13] H. C. Bunt, “Dynamic interpretation and dialogue theory,” in The Structure of Multimodal Dialogue, M. Taylor, D. Bouwhuis, and F. Neel, Eds. Amsterdam, The Netherlands: John Benjamins, 1999, vol. 2.
650
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 50, NO. 4, AUGUST 2003
[14] H. H. Clark, Using Language. Cambridge, U.K.: Cambridge Univ. Press, 1996. [15] J. Cassell, T. Bickmore, M. Billinghurst, L. Campbell, K. Chang, H. Vilhjalmsson, and H. Yan, “Embodiment in conversational interfaces: REA,” in Proc. CHI’99, 1999, pp. 520–527. [16] A. Fox, B. Johanson, P. Hanrahan, and T. Winograd, “Integrating information appliances into an interactive workspace,” IEEE Comput. Graph. Applicat., vol. 20, pp. 54–65, May/June 2000. [17] J. Fry, H. Asoh, and T. Matsui, “Natural dialogue with the JIJO-2 office robot,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, vol. 2, 1998, pp. 1278–1283. [18] H. Hüttenrauch and M. Norman, “Pocketcero—mobile interfaces for service robots,” in Proc. Mobile HCI 2001: Third Int. Workshop Human Computer Interaction With Mobile Devices, Lille, France, Sept. 2001. [19] H. Hüttenrauch and K. Severinson Eklundh, “Fetch-and-carry with cero: observations from a long-term user study with a service robot,” in Proc. IEEE Int. Workshop Robot and Human Interactive Communication (ROMAN 2002), Berlin, Germany, Sept. 25–27, 2002, pp. 158–163. [20] K. Severinson Eklundh, A. Green, and H. Hüttenrauch, “Social and collaborative aspects of interaction with a service robot,” Robot. Auton. Syst. (Special Issue on Socially Interactive Robots), vol. 42, no. 3–4, pp. 223–234, 2003. [21] I. Isendor, “Mänsklig interaktion med autonom servicerobot,” Master’s thesis, Interaction and Presentation Lab., Dept. Numer. Anal. Comput. Sci., Royal Inst. Technol., Stockholm, Sweden, 1998. [22] M. C. Torrance, “Natural communication with robots,” Master’s thesis, Dept. Elect. Eng. Comput. Sci., Massachusetts Inst. Technol., Cambridge, MA, Jan. 1994. [23] N. Dahlbäck, A. Jönsson, and L. Ahrenberg, “Wizard of Oz studies—Why and how,” Knowl.-Based Syst., vol. 6, no. 4, pp. 258–256, 1993. [24] D. Maulsby, S. Greenberg, and R. Mander, “Prototyping an intelligent agent through wizard of oz,” in Proc. ACM INTERCHI’93, Apr. 1993, pp. 277–282. [25] K. Thorisson, “Gandalf: An embodied humanoid capable of real-time multimodal dialogue with people,” in Proc. First ACM Int. Conf. Autonomous Agents, Feb. 1997, pp. 536–537. [26] S. E. Brennan and E. Hulteen, “Interaction and feedback in a spoken language system: A theoretical framework,” Knowl.-Based Syst., vol. 8, pp. 143–151, 1995.
[27] A. Green, “C-roids: Life-like characters for situated natural language user interfaces,” in Proc. 10th IEEE Int. Workshop Robot and Human Interactive Communication, Bordeaux/Paris, Sept. 2001, pp. 140–145. [28] L. Wittgenstein, Philosophical Investigations. Oxford, U.K.: Blackwell, 1953.
Anders Green received the M.A. degree in computational linguistics from Göteborg University, Göteborg, Sweden, in 1997. He is currently working toward the Ph.D. degree in human–machine interaction in the Interaction and Presentation Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden. His research is focused on human–robot interaction using multisensory natural language user interfaces. He participates in the Graduate School of Human–Machine Interaction and The Swedish National Graduate School of Language Technology (GSLT).
Kerstin Severinson Eklundh received the Ph.D. degree in communication studies from the University of Linköping, Linköping, Sweden, in 1986. She is a Professor of human–computer interaction at the KTH Royal Institute of Technology, Stockholm, Sweden, where she also heads the Interaction and Presentation Laboratory (IPLab), an interdisciplinary environment for human–computer interaction research and education. One of her current research projects is “Human interaction with intelligent service robots,” which is a collaboration with the Centre for Autonomous Systems. Other research areas at the IPLab include computer-supported cooperative work, language technology, and computer-assisted writing.