Department of Numerical Analysis and Computing Science IPLab-192
Task-oriented Dialogue for CERO: a User-centered Approach
Anders Green and Kerstin Severinson Eklundh
Interaction and Presentation Laboratory (IPLab)
Anders Green and Kerstin Severinson Eklundh
Task-oriented dialogue for CERO: a user-centered approach Report number: IPLab-192 Publication: In Proceedings of Ro-Man ’01 (10th IEEE international workshop on robot and human communication). September 2001 E-mail of author: {green, kse}@nada.kth.se
Reports can be ordered from: Interaction and Presentation Laboratory (IPLab) Numerical Analysis and Computing Science (Nada) Royal Institure of Technology (KTH) S-100 44 STOCKHOLM, Sweden telephone; + 46 8 790 6280 fax: + 46 8 10 2477 e-mail:
[email protected] URL: http://www.nada.kth.se/nada/iplab/
Task-oriented Dialogue for CERO: a User-centered Approach Anders Green
Kerstin Severinson-Eklundh IPLab, Nada Royal Institute of Technology (KTH) 100 44 Stockholm, Sweden E-mail: {green, kse}@nada.kth.se
Abstract We describe a user-centered approach to the process of designing spoken dialogues for commanding robots. Using scenarios and synthetic dialogues followed by simulated trials with real users we built a spoken language interface for commanding an office robot. Initial evaluation with the implemented system has brought interesting questions concerning the feedback necessary for interacting with a robot that has no screen. We are using a small life-like character placed upon the robot who is able of displaying conversational gestures. We have performed initial evaluations on video recorded material which have raised issues concerning low-level feedback, timing and sequencing of commands in dialogue.
1
Introduction
The purpose of this paper is to discuss issues concerning the dialogue design of a natural language speech interface for task-oriented dialogue with an intelligent service robot. Working together with the Swedish National Labour Market Board (AMS) and the Center for Autonomous Systems (CAS) at KTH, we have been developing a robot with the aim of assisting users with everyday tasks such as to fetch and deliver objects in an office environment. The target users suffer from physical impairments that make it difficult to move around and to carry objects [6]. The following simple scenario illustrates the kind of tasks that we intend that the users will be able to perform using the dialogue interface of the robot. The tasks that the robot performs seem simple from a roboticists perspective, but our experience is that there are a number of non-trivial issues concerning modeling the human-robot dialogue, as we will se below.
Delivering The user Kim wants to send a commented article to one of her colleagues. She summons the robot by turning to the robot, saying ”Robot”. The robot is activated by the command word and responds: ”How may I help”. Kim says ”Deliver”. The robot asks for a place where the object should be delivered. When the robot has enough information it starts navigating to the office of Kim’s colleague. Upon arrival the robot states its mission and waits. Then, after a short while, it returns to its standby position. The successful application of such an interface is dependent on a number of factors, such as the reliability of components for speech recognition and natural language interpretation. Another factor that is essential for the successful application of speech technology is dialogue design.
2
User-centered dialogue design
We employ a user-centered work model for the development of the system, which means that we try to bring users into the process at all the stages of development. The work with the dialogue system started by doing an analysis of the types of tasks that the robot platform would be able to perform. The current system is able of performing a small set of tasks: – GOTO: navigation from one location to another; – DELIVER: carry an object whilst navigating; – FETCH: navigate to a location and address another user in order to get an object and carry it back. With these tasks as a starting point we have constructed synthetic prototype dialogues that we use to assess the kind of phenomena that need to be handled by the dialogue system. Then we designed a simulated prototype of the robot with which we could perform studies with users in the wizard-of-oz framework (cf. [5, 9]). In the following we will describe the devel-
opment process in more detail and describe the dialogue system that has been implemented for the working prototype. We will also discuss some challenges we have identified that are related to the human-robot dialogue in general.
2.1
Synthetic dialogue examples
By constructing dialogue examples we were able to get a picture of what type of dialogue capabilities we need to address when building the dialogue system. Others have also used this method as a means of assessing how users envision their interchange with a robot(e.g. [7, 10]). U: Robot! Deliver this in the kitchen R: Deliver something in the kitchen? U: Deliver. R: Delivering something in the kitchen. Example 1. A synthetic dialogue. By analyzing dialogues like the one above (Example 1) we were able to identify problematic cases necessary to handle or get around in some way, without actually building a prototype, and thus avoiding to spend effort on dialogue that have little potential to be successful. However, the use of synthetic dialogues can only take us as far as our own educated guesses allow. Subsequently we wanted to try out how our ideas would be received by real users.
2.2
Wizard-of-oz
By simulating the interface in an agent-based system we try to get a picture of the behavior of the user. This technique is referred to as hi-fi simulation or wizard-of-oz-study, and it has been used for different types of agent-based systems [5, 9]. There can be many objectives for a wizard-of-oz study, one objective might be to collect data in order to perform structural analysis or to see what words the system needs to handle. Another kind of interesting data is the way different dialogue acts are used in different situations (e.g. question, answer, require, repair). For the designer, the experience of becoming a wizard is also an important qualitative aspect of getting a feel for what the dialogue should be. Example 2 shows a transcript of an excerpt of a wizard dialogue we collected. In the dialogue, the user picks up a magazine, turns to the robot (focus) and performs the action of putting the magazine in the compartment of the robot. The robot is slow in its responses which makes the user suggest alternative actions to the system.
3
CERO-dialogue system
In human-to-human dialogue the participants are engaged in a joint co-operative behavior to achieve
a common goal [1, 3]. For complex dialogues, such as iteratively specifying goals for the robot using the dialogue system as an intelligent conversational agent, a richer sense of the notion of dialogue is needed. Thus the natural language dialogue system does not merely reactively act upon the spoken commands issued by the user. In this process grounding is an important way of establishing successful communication [4]. We have constructed the dialogue system with the aim that it should be modular. Currently we use a commercially available dictation system (IBM ViaVoice) to translate the commands of the user to text. When audio input is received by the system it is mapped to a possible command using a simple context-free grammar. The result from the speech recognition is passed as a list of words to the semantic analysis which produces a logical representation of the command. The dialogue handler then tries to map the logical expression to an appropriate action, i.e. respond using synthetic speech or perform a physical system action by sending a goal to the robot’s planner. In the following we will focus on how the dialogue is treated within the system.
3.1
Dialogue design
The dialogue handling in the CERO dialogue system is based upon a set of rules which decides what actions is appropriate given the input and the current system state. In Figure 1 we see a schematic representation of the actions the system and the user can take in order to move between dialogue states. A state-based design only deals with the ideal flow of dialogue. Modeling for all cases including probably requires principle-based solutions. Incoherent actions by the system or the user might put the system in a state from which alternative actions are necessary to take in order to progress in the dialogue (e.g. repairs, emergency stop). Robot! hSummoni How may I help? hReq-Missioni Go to the kitchen hMission-Speci Go to the kitchen? hReq-pemi Yes hAck-missioni (planner receives: task goto kitchen) R: Going to the kitchen hReporti Example 3. A successful dialogue and the corresponding states.
U: R: U: R: U:
The dialogue in Example 3 shows a successful dialogue which starts by the user uttering a command which is interpreted as a summon action. The appropriate dialogue action given the current state is to perform a request for a task (Req-Mission). The next user ut-
U: U: U: U: U:
ah-ha Okay Okej robot deliver this to Maria Svensson at room . . . hpausei 1628
U:
I can hpausei walk (go?) with you.
U: R:
Are you ready? I am going to Maria
Looks at task-list Takes up one magazine Turns to R, puts magazine into transport bay Turns around to task list on table to check for office number Takes up task list, turns to robot Stands up Standing slightly in front of R, looking back and down at R Robot starts moving
Example 2: In this wizard-of-oz dialogue the robot is slow to respond to the user’s request.
Figure 1: Some dialogue states of the CERO-system expressed as a schematic finite-state-transition network.
terance is a task specification, which, if it is complete, as in this case, causes the system to request permission (Req-Perm) to perform the action. The user acknowledges the mission (Ack) and the system sends a request to the planner and reports what the intended action is. If the mission specification would have been incomplete, the next action would instead have been to request the missing information (Req-Spec), normally by asking the user.
3.2
Low-level feedback characters
using
robot seemed very ”quiet” when it neither was using its speech synthesis nor moving. The users also informed us that they lacked a sense of direction or heading on the robot. In response to this we devised a life-like character, CERO, to support the generation of conversational gestures for giving feedback signals. CERO was placed upon the robot with the twofold purpose of a) providing a visible direction for the robot and b) to work as interface component with the ability of providing low-level feedback as a supplement to the spoken feedback issued by the dialogue system. The CERO character is interleaved with the speech system so that it is both capable of issuing conversational gestures reactively based on system states (e.g. raised amplitude, spoken vowels, see Figure 3.2) and conventional gestures (e.g. emblems such as to nod or shake its head). We are currently not (deliberately) providing gestures aimed at express emotions. In our design we have considered a classification for system feedback discussed by Brennan and Hulteen in [2]. They proposed a rank of eight categories which can be seen as a measure of the depth of grounding: State 0: State 1: State 2: State 3:
life-like
One of the things we discovered when performing the wizard-of-oz trial was that users said that the
State 4: State 5:
Not attending. System is not listening. Attending. System has noticed but yet not interpreted anything. Hearing. System has identified some words but. Parsing. System received a command but has not yet mapped it to an action. Interpreting. System is attempting to interpret the command. Intending. System has not acted yet.
State 6:
Acting. System attempts to carry out the command. State 7: Reporting. System reports on the outcome of its actions. The system of [2] was a spoken language system for call-routing. However, the categorization of system states are general enough to serve as a starting point for the modeling of system feedback. In Figure 3.2 we have put the system states and behavior of CERO and the categories of Brennan and Hulteen side by side. Using the interface character we are able of providing low-level feedback using visual cues. It has not been possible to find behaviors of the life-like character that can be directly mapped at all levels. Instead the life-like character is supposed to give feedback for which the speech synthesis is not well suited, for instance displaying that speech is detected or showing that the system is switched on.
3.3
When things go wrong
The examples above have all been ideal in the sense that the user has specified the mission to the robot, using enough information and the correct wording. However, this is not the whole picture. Things might go wrong, making the robot fail to perform its tasks. In the following we are not discussing errors related to low-level speech processing. Below we have listed some causes of failures, or breakdowns, in communication we are attempting to handle in the system. Interpretation: when the system could not find a semantic interpretation to the string that the speech recognition module delivered. Ex: U: R:
Hi there robot there is something which I would like you to do. Could not understand, please rephrase
Task-related: when the user changes his/her mind and tries to get the system to abort the current mission. U: Get coffee R: Where is the coffee? U: No R: Aborting mission: get Dialogue breakdowns: when the system gets into a state for which there currently is no rule. R: U: R:
Get coffee in the kitchen? Get tea in the kitchen! How may I help?
Planner failures: when the planner cannot perform the plan, or reject the plan specified by the dialogue system.
4
Challenges for further design
In order to inform our dialogue design we have collected video-recorded data both from the wizardstudy and from interactions with the system that has been implemented. These video-recordings have been closely analyzed in order to discover issues that pose challenges for development of the dialogue. A couple of these examples are shown below. The user is a member of the staff and has been trained on the system, he has also a very solid understanding of how the system is supposed to work under ideal circumstances. Thus, the user knows exactly what words and phrases that are supposed to work. Yet, when faced with problems leading to breakdowns, even this small sample of dialogues shows that there are regularities at the micro-level which are interesting to study. In the following we will discuss some of the issues causing these breakdowns and also suggest ways of improving the dialogue design. The work presented below is very much work in progress. Nevertheless we feel that is important to share our observations in order to stimulate a discussion about these issues.
4.1
Sequencing
Ideally the sequencing of contributions is supposed to follow the patterns A-B-A-B. However, in the dialogues collected in the wizard-of-oz study the user does not wait for the system. This probably is related to the systems inability of providing feedback in time. Instead of waiting for the system response, the user suggests different actions to the system or initiate new tasks (see Example 2, above). We also observed this pattern in the dialogues with the trained user, but in a slightly different form. In the dialogue below (Example 4), the user changes amplitude and intonation before trying to rephrase the command or suggest new actions. U: R: U: U: U: R: U: U: R:
Cero! Missions: Deliver, Get, Go. Please specify a mission, for instance: Go to Maria’s office. Go to Lars office! //5 sec pause// Go to Lars office! //5 sec pause// Cero, go to Lars office! //2 sec pause// Go to Lars office? Yes //5 sec pause// Yes //2 sec pause// Going to Lars office! hrobot starts movingi
Example 4. A dialogue with a patient user. No feedback is given because the speech recognition does not recognize any commands. In this example the Cero character was disconnected and did not provide
Category
CERO action
0. Not attending.
Not applicable system switched off
( 1. Attending.
Example Action
LED: Blink
On Amplitude above threshold Attending
2. Hearing.
Hearing (out of lexicon)
3. Parsing.
Parsing: parse errors
CERO: raise head
“Sorry cannot understand CERO: raise head
“The word X is not in the lexicon
“Cannot understand, please rephrase CERO: shake head
CERO: small nodds “What is the object?
4. Interpreting.
Interpreting
5. Intending.
Require permission
6. Acting.
Executing
Robot moves
7. Reporting.
Report
“Going to the kitchen!
“Go to the kitchen?
Figure 2: The categories used in Brennan and Hulteen [2] together with some of the corresponding multimodal actions taken by the CERO system.
low-level feedback based on the amplitude of the sound input.
4.2
Time-to-response
There also seems to be a pattern concerning the timing between the system failing to respond in a certain time and the issuing of a new command. There are two cases when the time to response is important. i) the time between an utterance and a system response, ii) the time between the end of an utterance and the time when the user realizes that the system has failed to respond. In the cases where the system failed to respond to the user’s command he usually waited twice the amount of time before making a new contribution (see Example 4, above). This observation is only based on a very small amount of data and could be a due to the idiosyncratic behavior of the user rather than reflecting a general pattern of use. However, within this particular set of data the pattern is recurring with enough similarity to believe this to be a deliberate strategy.
hrobot in navigation state is moving slowlyi R: Put the paper on the tray please! huser puts the paper on the trayi U: Okay //seven seconds passes// U: Ok //overlap// hrobots starts movingi Example 5. A dialogue where the user co-operates with the robot at the second part of a fetch mission.
4.3
System actions
Both the wizard-study and the examples discussed here show that the users closely monitor the behavior of the system. Users interpret even small movements of the robot as signs of the robot’s intentions. In the fetch dialogue (Example 5, above) the user monitors the behavior of the robot. It only takes a slight movement of the robot to make the user believe that the robot is about to perform its mission. In the wizard study the users were not as clear to what extent the robot actually was able to perform a mission. The majority of users tried to accompany the robot while it performed its task. This suggests that a user who has been using the system during a long time will be more apt to interpret movements as acknowledgement and intent of performing the specified mission. It is not clear however in what way this may affect the way grounding should be performed. In the simulated
wizard-of-oz system the grounding by paraphrasing the mission was limited, and the wizard only expressed acceptance and intention of performing a mission by a simple ”Ok” accompanied by the physical movement of the robot. It is possible that the type of task the user wants the system to perform will have effects on the necessity of providing an explicit paraphrase of the intended goal. Currently in this system the act of issuing a complete task specification is always something that causes the robot to move to a location, normally one which is not adjacent to the user. If the task instead was a close navigation, like moving an inch forward or backward, we might expect that just moving the robot the specified distance would suffice. It is important to clarify what type of actions that require explicit grounding and what type of action that could be performed more reactively, like other robot behaviors.
5
Concluding remarks
What should be on the agenda for a system behaving like the one we have? The actual sequencing of commands and interpretation of commands, given that they are parsed by the grammar, is quite robust. The problems are related to the channel of communication: speech. Many of the problems we have discussed would probably be non-issues if we were to use written commands. This is also reflected in sparseness in available literature for spoken language interfaces to robots. Extensive work has been done on written commands e.g. [8, 10] but the problems related to spoken commands have attracted less attention from the human-robot interaction viewpoint . When looking at the data we have from users’ interaction with the system our lasting impression is that relevant feedback is crucial for enabling a successful dialogue with a robot. It seems especially important to enable low-level visual feedback like the one we have sketched above. For this purpose we need to incorporate and extend the notion of grounding states [2] using information from different parts of the system like sensors and interpretation components. We also need to investigate what other types of interface modalities could add in terms of giving feedback and making the behavior of the robot more intuitive and transparent to the user.
References [1] Jens Allwood, Joakim Nivre, and Elisabet Ahls´en. On the semantics and pragmatics of linguistic feedback. Technical Report 64, Gothenburg Papers on Theoretical Linguistics, 1991.
[2] S. E. Brennan and E. Hulteen. Interaction and Feedback in a Spoken Language System: A Theoretical Framework. Knowledge-Based Systems, 8:143 – 151, 1995. [3] Harry C. Bunt. Dynamic interpretation and dialogue theory. In M.M. Taylor, D.G. Bouwhuis, and F. Neel, editors, The Structure of Multimodal Dialogue, volume 2. John Benjamins, 1999. [4] Herbert H. Clark. Using Language. Cambridge University Press, Cambridge, 1996. [5] Nils Dahlb¨ack, Arne J¨onsson, and Lars Ahrenberg. Wizard of Oz studies - why and how. Knowledge-Based Systems, 6(4):258–256, 1993. [6] Anders Green, Helge H¨ uttenrauch, Mikael Norman, Lars Oestreicher, and Kerstin SeverinsonEklundh. User-centered design for intelligent service robots. In Proceedings of 9th IEEE International Workshop on Robot and Human Interactive Communication, Osaka, Japan, 2000. [7] Ingvar Isendor. M¨ansklig interaktion med autonom servicerobot. Master’s thesis, Royal Institute of Technology, Department of Numberical Analysis and Computing Science, Interaction and Presentation Laboratory, 1998. Report no. TRITA-NA-E9841, IPLab-148. [8] T. C. Lueth, T. Laengle, G. Herzog E. Stopp, and U. Rembold. Kantra-human-machine interaction for intelligent robots using natural language. In Proceedings of 3rd IEEE International Workshop on Robot and Human Communication RO-MAN ’94, pages 106–111, Nagoya, Japan, 1994. [9] David Maulsby, Saul Greenberg, and Richard Mander. Prototyping an Intelligent Agent through Wizard of Oz. In INTERCHI’93, pages 277 – 282. ACM, April 1993. [10] Mark C. Torrance. Natural Communication with Robots. Master’s thesis, MIT Department of Electrical Engineering and Computer Science, January 1994.