Spoken Dialogue Management Using Hierarchical

0 downloads 0 Views 527KB Size Report
School of Informatics ... 4. 2 Previous Work. 5. 2.1 Spoken Dialogue Management . ..... by the fact that hierarchical methods have proven to learn faster, with less training .... that manual design of this kind of dialogue is a labour-intensive task due to the many possible ..... C = Wi < Ni > +Wr < Nr > +Wo < fo(No) > +Ws < Fs >.
Spoken Dialogue Management Using Hierarchical Reinforcement Learning and Dialogue Simulation

Heriberto Cuay´ahuitl

NI VER

S Y

TH

IT

E

U

R

G

H

O F

E

D I U N B

Supervisors: Prof. Steve Renals and Dr. Oliver Lemon

PhD Research Proposal Institute for Communicating and Collaborative Systems School of Informatics University of Edinburgh October 2005

Abstract Speech-based human-computer interaction faces several difficult challenges in order to be more widely accepted. One of the challenges in spoken dialogue management is to control the dialogue flow (dialogue strategy) in an efficient and natural way. Dialogue strategies designed by humans are prone to errors, labour-intensive and non-portable, making automatic design an attractive alternative. Previous work proposed addressing the dialogue strategy design as an optimization problem using the reinforcement learning framework. However, the size of the state space grows exponentially according to the state variables taken into account, making the task of learning dialogue strategies for large-scale spoken dialogue systems difficult. In addition, learning dialogue strategies from real users is a very expensive and time-consuming process, making automatic learning an attractive alternative. To address these research problems three lines of investigation are proposed. Firstly, to investigate a method to simulate task-oriented human-computer dialogues at the intention level in order to design the dialogue strategy automatically. Secondly, to investigate a metric to evaluate the realism of simulated dialogues. Thirdly, to make a comparative study between hierarchical reinforcement learning methods and reinforcement learning with function approximation, in order to find an effective and efficient method to learn optimal dialogue strategies in large state spaces. Finally, a timeline for the completion of this research is proposed. Keywords: Spoken dialogue systems, probabilistic dialogue management, human-computer dialogue simulation, user modelling, hidden Markov models, dialogue optimization, dialogue strategies, Markov decision processes, Semi-Markov decision processes, reinforcement learning, hierarchical reinforcement learning, function approximation, dialogue systems evaluation.

iii

Acknowledgements This research is being sponsored mainly by PROMEP (“PROgrama de MEjoramiento del Profesorado”), part of the Mexican Ministry of Education (http://promep.sep.gob.mx). It is also being sponsored by the Autonomous University of Tlaxcala (www.uatx.mx).

iv

Table of Contents 1

Introduction 1.1 Spoken Dialogue Systems 1.2 Motivation . . . . . . . . . 1.3 Proposal . . . . . . . . . . 1.4 Research Questions . . . . 1.5 Contributions . . . . . . .

2

Previous Work 2.1 Spoken Dialogue Management . . . . . . . . . . . 2.2 Human-Computer Dialogue Simulation . . . . . . 2.3 Reinforcement Learning for Dialogue Management 2.4 Spoken Dialogue Systems Evaluation . . . . . . .

3

Human-Computer Dialogue Simulation Using Hidden Markov Models 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Probabilistic Dialogue Simulation . . . . . . . . . . . . . . . . . . . 3.2.1 The System Model . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 The User Model . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 The Simulation Algorithm . . . . . . . . . . . . . . . . . . . 3.3 Dialogue Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Training the System and User Model . . . . . . . . . . . . . 3.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . 3.5 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . 3.6 Proposed Future Work . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

4

Spoken Dialogue Management Using Hierarchical Reinforcement Learning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Reinforcement Learning Framework . . . . . . . . . . . . . . . . . . 4.2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 4.2.2 Semi-Markov Decision Processes . . . . . . . . . . . . . . . . . 4.2.3 Reinforcement Learning Methods . . . . . . . . . . . . . . . . . 4.3 Hierarchical Reinforcement Learning Methods . . . . . . . . . . . . . . 4.3.1 The Options Framework . . . . . . . . . . . . . . . . . . . . . . 4.3.2 The MAXQ Method . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Hierarchies of Abstract Machines . . . . . . . . . . . . . . . . . 4.3.4 Comparison of Methods . . . . . . . . . . . . . . . . . . . . . . 4.4 Reinforcement Learning with Function Approximation . . . . . . . . . .

. . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

v

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . .

1 1 2 3 4 4

. . . .

5 5 8 10 11

. . . . . . . . . . . .

. . . . . . . . . . . .

13 13 14 14 15 15 16 17 17 19 19 21 22

. . . . . . . . . . .

. . . . . . . . . . .

25 25 26 26 28 28 29 29 31 33 34 35

. . . . . . . . .

4.5

4.6 5

Experimental Design . . . . . . 4.5.1 The Agent-Environment 4.5.2 Evaluation Metrics . . . 4.5.3 Experiments . . . . . . Proposed Future Work . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

36 36 38 38 38

Future Plans 39 5.1 Timetable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

References

41

vi

Chapter 1

Introduction 1.1 Spoken Dialogue Systems Speech-based human-computer interaction faces several difficult challenges in order to be more widely accepted. An important justification for doing research on this topic is the fact that speech is the most efficient and natural form of communication for humans. Currently, humancomputer interaction is mainly performed using the following devices: keyboard, mouse and screen. There are many different and attractive reasons for interacting with computers using speech. For instance, in hands-free and eyes-free environments (such as walking or driving a car), in computer applications where providing information is tedious (such as searching/consulting information or booking a service), in mobile environments, or simply to have fun (such as talking with toys). Computer programs supporting interaction with speech are called “Conversational Interfaces”, computer programs supporting different modalities (such as speech, pen, and touch among others) are called “Multimodal Conversational Interfaces”, computer programs supporting only speech are typically called “Spoken Dialogue Systems”. The main task of a spoken dialogue system is to recognize user intentions and to provide coherent responses until user goals are achieved. Figure 1.1 illustrates the architecture of a basic spoken dialogue system. Briefly, when the user speaks, the “input” components recognize user intentions and provide them to the “dialogue manager” (DM), which consults information from a database and provides an answer to the user through the “output” components. The conversation is basically driven by three levels of communication: speech, words and intentions. Typically, the user provides speech signals using either a microphone or telephone. The “input” components receive a speech signal and provide user intentions: the speech signal is given to the “Automatic Speech Recognition” (ASR) component, which looks for the words corresponding to the given speech signal and passes them on to the “natural language understanding” (NLU) component, which looks for the intentions corresponding to the given words. The main task of the dialogue manager is to control the flow of the conversation in an effective and natural way by providing the best system intentions given the current user intentions, information from the database and history of the conversation. The “output” components are the counterpart of the input components, receiving a set of system intentions and providing a speech signal to the user: the system intentions are given to the “natural language generation” (NLG) component, which generates a contextually appropriate response and provides the corresponding words to the speech synthesis” (TTS) component, which provides the corresponding speech signal to the user. In this way, human-computer conversations are compounded by user turns and system turns in a finite iterative loop until user goals are achieved. The process described above still is a challenge for science and engineering. None of the 1

2

Chapter 1. Introduction

OUTPUT ech sp e

Speech Synthesis

words

Natural Language Generation

Spoken Dialogue Manager

User INPUT s pe ech

Speech Recognition

words

in te n tio ns

Natural Language Understanding

info

database, internet,...

ns ntio inte

Figure 1.1: Basic architecture of a spoken dialogue system.

components described above is perfect, even for simple tasks. The ASR component may provide an incorrect sequence of words due to noise in the channel, noise in the environment or out-of-vocabulary words. The NLU component may provide an incorrect sequence of intentions due to the fact that a word sequence may be incorrect or may have several different interpretations. The DM component has a very challenging task due to the fact that the input components may convey incorrect user intentions, so the DM must work under uncertainty. In order to overcome the misunderstandings of the input components and choose the best system intentions it must use all possible knowledge in the conversation. In addition, the NLG component may provide a word sequence that may be unclear for the user. Finally, the TTS component usually provides an unnatural speech signal that may distract the user’s attention. All components in a spoken dialogue system may be simplified for well defined and simple task-oriented applications. For instance, the ASR may have a small vocabulary, the NLU may only provide semantic representation based on the utterance semantic tags, the DM may have a predefined control flow of the conversation, the NLG may have predefined answers, and finally, pre-recorded prompts may be used instead of TTS. However, even in simple applications sophisticated components are required for successful conversations in real environments. For instance, robust ASR and DM may significantly improve the performance of spoken dialogue systems. This research is focused on dialogue management for large-scale spoken dialogue systems, where the user may have several different goals in a single conversation (e.g., some goals in the travel domain are: book a multi-leg flight, book a hotel and rent a car).

1.2 Motivation The main motivation behind this research is the fact that the automatic design of spoken dialogue managers remains problematic, even for simple applications. Dialogue design (control flow of the conversations) is typically hand-crafted by system developers, based on their intuition about proper dialogue flow. There are at least three motivations for automating dialogue design: a) it is a time-consuming process, b) the difficulty increases according to the dialogue complexity, and c) there may be some design issues that escape system developers. Automating dialogue design using the reinforcement learning framework, based on learning by interaction within an environment, is a current research topic. However, the size of the state space grows exponentially according to the variables taken into account, making the task of learning optimal dialogue strategies1 in large state spaces difficult. 1 In

this document the terms dialogue design, dialogue strategy and dialogue policy are used interchangeably.

1.3. Proposal

3

Another related problem is how to learn the dialogue strategy automatically. Spoken dialogue managers are typically optimized and evaluated with dialogues collected from lengthy cycles of trials with human subjects. But training optimal dialogue strategies usually requires many dialogues to derive an optimal policy and learning from conversations with real users may be impractical. An alternative is to use simulated dialogues. For dialogue modelling, simulation at the intention level is the most convenient, since the effects of recognition and understanding errors can be modelled and the intricacies of natural language generation can be avoided (Young, 2000). Simulated dialogues must incorporate all possible outcomes that may occur in real environments, including system and user errors. Previous work has addressed this topic by learning a user model in order to plug it into a spoken dialogue system. However, little attention has been given to methods for simulating both sides of the conversation (system and user). Furthermore, there has been a lack of research on evaluating the realism of simulated dialogues. Some potential uses of simulating the system and user are: a) to acquire knowledge from both entities, b) to learn optimal dialogue strategies in the absence of a real spoken dialogue system and real users, and c) to evaluate spoken dialogue systems in early stages of development. The primary aims of my research proposal are: 1. To investigate a method for simulating task-oriented human-computer dialogues on both sides of the conversation (system and users). This is important because small dialogue data sets may be expanded for purposes of optimization and evaluation, and because knowledge from both entities may be acquired in order to improve the interaction. 2. To investigate a metric for evaluating the realism of simulated dialogues. This is important because different system/user models may be compared in a more reliable way, so that the best models may be used for simulation. 3. To investigate an efficient reinforcement learning method for learning optimal dialogue strategies in large state spaces. This is beneficial to optimize dialogue strategies for dialogue managers with many different variables, which is useful not only for spoken dialogue systems but also for multimodal conversational interfaces, where the number of variables increases.

1.3 Proposal I propose a two-stage research project: the first stage focused on human-computer dialogue simulation and the second stage focused on learning optimal dialogue strategies. In the first stage I propose to investigate a novel probabilistic method to simulate taskoriented human-computer conversations at the intention level using hidden Markov models with rich structures. In addition, for evaluating the simulation method I propose to investigate a novel metric to measure the realism of the simulated dialogues, based on a probabilistic approach of dialogue similarity and potentially combined with other metrics previously proposed in the natural language processing field that may help to correlate dialogue realism. For the second stage I propose to perform a comparative study of three hierarchical reinforcement learning methods proposed in the machine learning field. This proposal is motivated by the fact that hierarchical methods have proven to learn faster, with less training data and have not been applied to spoken dialogue systems with large state spaces. In addition, I propose to compare the hierarchical methods against a function approximation method, which is another alternative for addressing optimization in large state spaces. The first stage will be used as a

4

Chapter 1. Introduction

valuable resource for learning dialogue strategies automatically, and the second stage will be a valuable resource for learning dialogue strategies for large-scale spoken dialogue systems. Finally and for both stages, I propose to perform experiments using the 2001 DARPA Communicator corpora, which is annotated with the DATE annotation scheme and consists of ∼1.2K dialogues in the domain of travel information (multi-leg flight booking, hotel booking and rental car). If time and resources permit I will perform experiments using a real spoken dialogue system and real users within the same domain.

1.4 Research Questions In this research I aim to provide answers to the following questions: 1. How can a small corpus of dialogue data be expanded with more varied simulated conversations? There are two challenges in this question: a) how to predict system behaviour, and b) how to predict user behaviour. Due to the fact that probabilistic models will be used for this purpose, incoherent dialogues may be generated. Therefore, exploitation of knowledge from different variables must be taken into account. Preliminary experiments suggest promising results, and answers to this question will be driven by the following assumption: the more knowledge they have, the more accurate they are. An important problem to face in this question is data sparsity due to the small amount of data used to train the models. 2. How can the realism of simulated human-computer dialogues be evaluated? This question is difficult due to the fact that it is not known if simulated dialogues may occur in real environments. Nevertheless, the answers to this question will be driven by the following assumption: the more similar the simulated dialogues are to the real ones, the more realistic they are. Another possible direction is through evaluating the utility of simulated dialogues under the following assumption: the more useful (for optimization or evaluation) they are, the more accurate they are. 3. How can optimal dialogue strategies be learnt for spoken dialogue systems with large state spaces? There are two research fields to address for this question: a) hierarchical reinforcement learning, and b) reinforcement learning with function approximation. There are several methods for each field and potential methods will be selected in order to perform a comparative study, which may reveal potential application of specific methods in spoken dialogue systems. Experiments in this topic must be designed carefully in order to evaluate performance, computational cost and portability to other domains.

1.5 Contributions This research intends to advance the current knowledge in spoken dialogue systems according to the following expected contributions: 1. A method to generate human-computer simulated dialogues at the intention level. 2. A metric to evaluate the realism of human-computer simulated dialogues. 3. A comparative study of reinforcement learning methods to learn optimal dialogue strategies in large state spaces.

Chapter 2

Previous Work This chapter summarizes previous work that supports the proposal described in this document. The topics investigated and related to this research are: spoken dialogue management, humancomputer dialogue simulation, reinforcement learning for dialogue management and evaluation of spoken dialogue systems. A brief description of each topic is provided, with references to relevant related work. Finally, at the end of each section a list of research gaps is provided.

2.1 Spoken Dialogue Management The main task of a spoken dialogue manager is to control the flow of the conversation between the user and the system. More specifically, a dialogue manager in task-oriented applications must gather information from the user (e.g., “Where do you want to go?”), possibly clarifying information explicitly (e.g., “Did you say London?”) or implicitly (e.g. “A flight to London. For what date?”), resolve ambiguities that arise due to recognition errors (e.g., “Did you say Boston or London?”) or incomplete specifications (e.g., “On what day would you like to travel?”). In addition, the dialogue manager must guide the user by suggesting subsequent subgoals (e.g., “Would you like me to summarize your trip?”), offer assistance upon request (e.g., “Try asking for flights between two major cities.”) or clarification, provide alternatives when the information is not available (e.g., “I couldn’t find any flights on United. I have 2 Alaska Airlines flights ...”), provide additional constraints (e.g., “I found ten flights, do you have a preferred airline?”), and control the degree of initiative such as system-initiative (“What city are you leaving from?”) or mixed-initiative (e.g., “How may I help you?”). The dialogue manager can influence other system components in order to make dynamic adjustments in the system such as: vocabulary, language models or grammars. In general, the goal of a spoken dialogue manager is to take an active role in directing the conversation towards a successful, effective and natural experience. However, there is a trade-off between increasing user flexibility and increasing system understanding accuracy (Zue & Glass, 2000). There are several architectures for designing spoken dialogue managers, which can be broadly classified as follows: • State-Based: In this architecture the dialogue structure is represented in the form of a network, where every node represents a question, and the transitions between nodes represent all the possible dialogues (McTear, 1998). • Frame-Based: This architecture is based on frames (forms) that have to be filled by the user, where each frame contains slots that guide the user through the dialogue. Here the user is free to take the initiative in the dialogue (Goddeau et al., 1996) (Chu-Carroll, 1999) (Pieraccini et al., 2001). 5

6

Chapter 2. Previous Work

• Agenda-Based: This architecture is based on the frame-based, but builds a more complex data structure (using dynamic trees) for conversations in more complex domains (Rudnicky & Xu., 1999) (Wei & Rudnicky, 2000) (Bohus & Rudnicky, 2003). • Agent-Based: (Allen et al., 2001a) (Allen et al., 2001b) propose an architecture where the components of a dialogue system are divided into three areas of functionality: interpretation, generation and behaviour. Each area consists of a general control module and they communicate with each other, sharing information and messages. The agent responsible for controlling the flow of the conversation is mainly the behavioural agent. Another approach using collaborative agents is COLLAGEN, which mimics the relationships between two humans (agents) collaborating on a task involving a shared artifact. The collaborative manager supports mixed-initiative by interpreting dialogue acts of the tasks and goals of the agents (Rich & Sidner, 1998). • Information State Approach: This architecture is based on the notions of information state, which represents information to distinguish dialogues, to represent previous actions and to motivate future actions (Larsson & Traum, 2000). An important dialogue data collection from several different systems using different features is the DARPA Communicator corpora (available through LDC) (Walker et al., 2002). This data was collected with the aim of supporting rapid development of spoken dialogue systems with advanced conversational capabilities. This valuable resource for research collected dialogues from 8 different systems: AT&T (Levin et al., 2000b), BBN (Stallard, 2000), CMU (Rudnicky et al., 1999), COL (Pellom et al., 1999) (Pellom et al., 2001), IBM (Erdo˘gan, 2001), LUC (Potamianos et al., 2000), MIT (Seneff & Polifroni, 2000) and SRI. Most of the systems adopt the frame-based architecture with some variations in the dialogue strategies, only the CMU Communicator adopted the agenda-based architecture. Figure 2.1 illustrates a dialogue from the MIT Communicator system. We can observe that manual design of this kind of dialogue is a labour-intensive task due to the many possible strategies that must be taken into account. In addition, manual design is prone to errors: for instance, in the second leg the system implicitly confirmed the user’s departure from “Zurich” (line 31), but in line 43 the ASR recognizes a new departure city “Newark”. Here the dialogue manager assumed that the user changed his mind, which may be a reasonable action. In line 44 the system explicitly confirms this new assumption, and the recognizer provides a wrong sequence of words due to a mixture of in-vocabulary words and out-of-vocabulary words (line 47). In line 48, the system assumes a correct confirmation and the user is annoyed with this new departure city. At this point, a potential better strategy (considering dialogue history) may be to ignore this new assumption, possibly saving system turns, or simply minimizing user annoyance. Thus it is a fact that there will be some design issues that escape system developers, which makes automatic design of dialogue strategies an attractive option. Despite the previous research efforts in spoken dialogue management, more investigations are needed in order to build systems supporting successful, efficient and natural conversations. The following are some potential research gaps in this area: • To investigate methods for optimizing dialogue strategies exploiting system and users’ knowledge • To balance the trade-off between user flexibility and system understanding • To separate the domain-independent and domain-dependent system features • To explore how dialogue managers can be abstracted and reused across domains

2.1. Spoken Dialogue Management

LINE 01. 02. 03. 04. 05. 06. 07. 08. 09. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56.

7

TURN UTTERANCE SYS: how may i help you? USR: thank you i want to fly from kansas city missouri on october third in the morning to denver colorado [thank you i want to fly from kansas city missouri on october third in the morning to denver colorado] SYS: okay, flights from kansas city to denver departing 5 a.m. and noon on Wednesday october third... USR: what else is available please [what are available please] SYS: i have a nonstop united flight leaving at 6:20 a. m. and arriving at 7:01 a. m would that work? USR: please [nowhere] SYS: i have a nonstop united flight leaving at 6:20 a. m. and arriving at 7:01 a. m would that work? USR: no i want to fly out later than six twenty [no i want to fly later in the morning] SYS: okay, flights from kansas city to denver departing after 6:20 a.m. on Wednesday october third... would that work? i also have a united flight leaving at 7:40 a. m. USR: i will take the united leaving at seven forty [i will take the united leaving at seven forty] SYS: united flight 1631 has been added to your itinerary. you depart... and arrive.... what date will you be returning on? USR: i want to leave denver on october fourth in the afternoon and i want to fly from there to zurich (switzer-) switzerland [i want to leave denver on october four in the afternoon and i want to fly american zurich switzerland switzerland] SYS: okay, flights from denver to kansas city departing between noon and 6:00 pm on thursday october fourth. one moment please. i have flights on... USR: no [no] SYS: i have a nonstop united flight leaving at 12:38 p. m. and arriving at 3:13 pm? would that work? USR: no i am flying from denver to zurich switzerland [no i am flying from denver to zurich switzerland] SYS: okay from denver to zurich. what date will you be travelling? USR: october four [october four] SYS: okay, from denver to zurich on thursday october fourth. can you provide the approximate departure time or airline? USR: i'd like to leave in mid afternoon [i would like to leave in the afternoon] SYS: okay, flights from denver to zurich departing between noon and 6:00 pm on thursday october fourth… USR: the american flight please [the american flight please] SYS: american flight 1078 connecting to american flight 170 in chicago o'hare ... what date will you be returning on? USR: i'm going to return to kansas on october fifteen [sunday returning in atlanta at four eighteen] SYS: okay, flights from zurich to kansas city departing at 4:18 pm on sunday october seventh… would that work? USR: (-o) i want to leave zurich in the late morning please [i want to leave newark in the late morning please] SYS: i heard you say you want to depart from newark. is that correct? if you need help, you can say, for example, what cities do you know in oregon. USR: i i don't want to go to oregon [i will what about oregon] SYS: okay, flights from newark to kansas city departing between 10:00 am and noon... would that work? USR: no i am not in newark i'm flying from zurich z. u. r. i c. h. switzerland remember you booked me to go to zurich [no i am not in newark on friday from new york and you are i see switzerland remember you booked me to voyager] SYS: sorry, i'm not sure what you said. USR: i want to leave zurich switzerland on on october fifteen in the late morning and fly back to kansas city [i want to leave zurich switzerland on october fifteen in the late morning and fly back to kansas city] SYS: okay, flights from zurich to kansas city departing between 10:00 am and noon on monday october fifteenth… USR: ...

Figure 2.1: Sample dialogue from the MIT Communicator system illustrating a three-leg flight booking: leg1 from lines 1 to 16, leg2 from lines 17 to 37, and leg3 from lines 38 to 56. The tags SYS and USR correspond to system turns and user turns. The word sequences in squared brackets are the speech recognition results.

8

Chapter 2. Previous Work

• To compare the architectures in domains of different complexity • To extend the architectures for more complex tasks

2.2 Human-Computer Dialogue Simulation The task of human-computer dialogue simulation consists of generating artificial conversations between a spoken dialogue system and a user. The aim is to automate the optimization and evaluation of spoken dialogue systems. Dialogue simulation may be performed at different levels of communication. The speech and word levels are useful for ASR systems in order to train acoustic and language models, and the intention level is useful for dialogue managers in order to train dialogue strategies. The following factors motivate the use of dialogue simulation for dialogue management: • Training optimal dialogue strategies requires many dialogues to derive an optimal policy and learning from real conversations may be impractical since it is expensive, labourintensive, and time-consuming. An alternative is to use simulated dialogues. • Simulated dialogues can be used to evaluate spoken dialogue systems at early stages of development, and potentially to discover errors that may help to reduce expensive and lengthy trials with human subjects. • When a dialogue manager is updated, the previous optimization is no longer valid and another optimization is needed. At this point, simulated dialogues may help to speed up the development and deployment of optimized spoken dialogue systems. Several research efforts have been undertaken in this area and the following dimensions are used to summarize such investigations: 1. Approach: Whilst some approaches are rule-based (Chung, 2004) (Lin & Lee et al., 2001) (L´opez-C´ozar et al., 2003), others are corpus-based (Eckert et al., 1997) (Scheffler & Young, 2000) (Scheffler & Young, 2001) (Georgila et al., 2005b). The advantage of the corpus-based methods is towards minimizing the portability problem (lack of expertise and high development costs) (Sutton et al., 1996). Rule-based methods tend to be ad-hoc for the task and domain. 2. Communication Level: Most of the investigations are intention based, and some use the speech and word levels; depending on the purposes of the simulated dialogues. The investigations based on intentions have the purpose of optimizing dialogue strategies (Eckert et al., 1997) (Scheffler & Young, 2000) (Levin et al., 2000a) (Scheffler & Young, 2001) (Pietquin & Renals, 2002) (Georgila et al., 2005b). (L´opez-C´ozar et al., 2003) use the speech and word levels in order to evaluate different speech recognition front-ends and dialogue strategies. (Chung, 2004) uses the speech and word levels in order to train speech recognition and understanding components. 3. Evaluation: A few investigations attempt to evaluate the simulated dialogues. (Eckert et al., 1997) (Scheffler & Young, 2000) (Scheffler & Young, 2001) use the average number of turns. (Schatzmann et al., 2005a) use three dimensions: high level features (dialogue and turn lengths), dialogue style (speech-acts frequency; proportion of goal-directed actions, grounding, formalities, and unrecognized actions; proportion of information provided, reprovided, requested and rerequested), and dialogue efficiency (goal completion rates and times). Finally, (Georgila et al., 2005b) use perplexity and a performance function based on filled slots, confirmed slots, and number of actions performed.

2.2. Human-Computer Dialogue Simulation

9

4. Degree of Simulation: Most of the investigations simulate user behaviour and some of them model speech recognition errors in order to corrupt users’ responses (Chung, 2004) (Lin & Lee et al., 2001) (Scheffler & Young, 2000) (Scheffler & Young, 2001) (Pietquin & Renals, 2002). (Georgila et al., 2005b) simulate a system and user model using n-gram models in order to optimize dialogue strategies. 5. Domain: Dialogue simulation has been investigated in: restaurants (Chung, 2004), air travel information (Eckert et al., 1997) (Levin et al., 2000a) (Schatzmann et al., 2005a) (Georgila et al., 2005b), banking (Scheffler & Young, 2000), cinema (Scheffler & Young, 2001), computer purchasing (Pietquin & Renals, 2002), and fast food (L´opezC´ozar et al., 2003). This paragraph summarizes results in dialogue simulation. (Eckert et al., 1997) found that it is possible to identify shortcomings by investigating compressed statistics. For instance, several bugs were found by looking at the distributions of the dialogue length, and a more adequate dialogue strategy was found by examining the user type “patient”. After integrating such changes into the system a better overall performance was observed. (Scheffler & Young, 2000) found that dialogue simulation can identify flaws in the dialogue strategy that escaped the developers’ attention. (Lin & Lee et al., 2001) conclude that dialogue simulation is very useful for the analysis and design of spoken dialogue systems, although the online tests, corpus-based analysis and user survey can always follow after the system is in operation. (L´opez-C´ozar et al., 2003) conclude that, despite using a simple user model, the simulations were useful for evaluating a dialogue system using different recognition front-ends and different dialogue strategies to handle user confirmations. (Chung, 2004) concludes that the use of a simulator greatly facilitates the development of dialogue systems, due to the availability of thousands of artificial dialogues. Even relatively restricted synthetic dialogues have been shown to accelerate development. (Schatzmann et al., 2005a) show that goal-based user models outperform a bigram baseline, and qualitative evaluations reveal that simple statistical metrics are still sufficient to discern synthetic dialogues from real ones. (Georgila et al., 2005b) show that user models based on linear feature combination and n-grams produce simulated dialogues of similar quality. Finally, (Schatzmann et al., 2005b) conclude that the development of realistic user models is a priority for future research on simulation based learning of dialogue strategies. The aim of conversational interfaces is to communicate with humans in a natural and efficient way. Therefore, it is fundamental to acquire and exploit knowledge about the system and users (Kass & Finin, 1988) (Zukerman & Litman, 2001). The area of user modelling plays a crucial role for such purpose, with the aim of acquiring knowledge about the user in order to improve the interaction. (Thompson et al., 2004) provides a classification of user models: stereotypical users vs. individual, hand-crafted vs. learnt, short-term vs. long term, probabilistic vs. absolute, direct-feedback vs. unobtrusive, and content-based vs. collaborative. (Webb et al., 2001) and (Zukerman & Albrecht, 2001) promote machine learning for user modelling as promising techniques. However, a number of challenges must be faced: large data sets, labelled data, concept drift (not ad-hoc) and computational complexity. Despite the previous research efforts in human-computer dialogue simulation, more investigations are needed in order to generate reliable simulated dialogues that may be truly useful to conversational interfaces. The following are potential research gaps in this area: • To assess the reliability of simulated dialogues • To learn rich user models incorporating important features • To learn a system model in order to acquire knowledge about the system

10

Chapter 2. Previous Work

• To exploit system and user knowledge for predicting system/user responses • To learn multiple user models considering degrees of user experience • To learn reusable models in order to be used across applications • To assess spoken dialogue systems using simulated dialogues

2.3 Reinforcement Learning for Dialogue Management Reinforcement learning characterizes the problem faced by an agent that learns behaviour using trial and error interactions within a dynamic environment (Kaelbling et al., 1996) (Sutton & Barto, 1998). (Levin et al., 1997) pioneered the idea of spoken dialogue design as an optimization problem using a Markov Decision Process (MDP) and reinforcement learning methods. In addition, (Eckert et al., 1997) pioneered the idea of learning dialogue strategies in an automatic way through the use of a probabilistic user model. First experimental results in dialogue design show that it is indeed possible to learn dialogue strategies similar to those designed by human intuition (Levin et al., 1998) (Levin et al., 2000a), making this a promising research topic. In addition, applying the reinforcement learning framework to dialogue allows the objective evaluation and comparison of different dialogue strategies. First experimental results in user simulation show its utility in optimization of spoken dialogue systems. Table 2.1 summarizes investigations of reinforcement learning applied to dialogue optimization, and here we can observe that MDPs have been investigated more than Partially Observable MDPs (POMDPs), and different algorithms have been investigated with different reward functions. A few investigations use real dialogues (Litman et al., 2000) (Walker, 2000) (Roy et al., 2000), the rest use simulations. This paragraph summarizes results of reinforcement learning applied to dialogue. (Levin et al., 1998) show that a system that started without initial knowledge converged to a reasonable strategy. (Levin et al., 2000a) show that it is indeed possible to find a simple reward criterion, state space representation, and simulated user model in order to learn relatively complex behaviour, similar to that heuristically designed by several research groups. (Young, 2000) emphasizes the importance of user simulation models and the need for developing methods of mapping system features in order to achieve sufficiently compact state spaces. (Walker, 2000) tested a learnt policy on a set of real users, and showed that the learnt policy resulted in a statistically significant increase in user satisfaction in the test set of dialogues. (Litman et al., 2000) conclude that the application of reinforcement learning to dialogue allows the empirical optimization of the system dialogue strategy by searching through a much larger search space than can be explored with traditional methods (e.g., empirically testing several versions of the system). (Pietquin & Renals, 2002) conclude that after several thousands of simulated dialogues, the system adopts a stationary policy, which appears to be optimal. (Scheffler & Young, 2002) found that learnt policies outperformed handcrafted policies that operated in the same restricted state space, and gave performance similar to the original design obtained from several iterations of manual refinement. (Pietquin & Dutoit, 2005) argue that dialogue simulation and reinforcement learning does not fully automate dialogue design, but a first acceptable strategy can be obtained. (Henderson et al., 2005) found that linear function approximation is a viable approach for addressing large state spaces. (English & Heeman, 2005) found that an effective dialogue strategy can be learnt by using reinforcement learning on both conversants, without the drawbacks of training data. (Schatzmann et al., 2005b) found that the choice of user model has a significant impact on the learnt strategy, meaning that the development of realistic user models is a priority for future research. In addition, POMDP-based dialogue strategies have

2.4. Spoken Dialogue Systems Evaluation

11

shown to outperform MDP-based dialogue strategies, but so far have been limited to smallscale problems (Roy et al., 2000) (Zhang et al., 2001) (Williams & Young, 2005). Finally, results using a hierarchical POMDP show that for very small domains there are no benefits, but that the benefits increase dramatically for larger problems (Pineau & Thrun, 2001). In spite of the previous research efforts in this topic, more investigations are needed in order learn effective and efficient dialogue strategies. The following are potential research gaps: • To find a principled way to discover a good state representation • To find a principled way to propose a good performance function • To learn optimal dialogue strategies for expert and novice users • To learn optimal dialogue strategies on-line, exploiting system and user knowledge • To learn optimal dialogue strategies in large state spaces • To learn optimal dialogue strategies using small amounts of training data • Faster learning algorithms

2.4 Spoken Dialogue Systems Evaluation The evaluation of spoken dialogue systems is fundamental to monitor the progress of such systems. The most widely embraced evaluation method proposed so far is PARADISE (Walker et al., 1997). This method uses a decision-theoretic framework to specify the relative contribution of various factors for the overall system performance. Performance is modelled as a weighted function of: task success (exact scenario completion), dialogue efficiency (task duration, system turns, user turns, total turns), dialogue quality (word accuracy, response latency) and user satisfaction (sum of TTS performance, ease of task, user expertise, expected behaviour, future use). More recent investigations applying this framework develop general models for predicting user satisfaction based on experimental data with different spoken dialogue systems (Walker et al., 2000) (Walker et al., 2001). (Wright et al., 2002) propose a PARADISE-based automatic method for predicting user satisfaction using the DARPA Communicator corpora. PARADISE has been used to compare spoken dialogue systems with different capabilities, such as degrees of initiative (Chu-Carroll & Nickerson, 2000), adaptability (Litman & Pan, 1999), adaptivity (Litman & Pan, 2002), and multimodality (Beringer et al., 2003). (Danielli & Gerbino, 1995) describe a set of metrics: contextual appropriateness, turn correction ratio, transaction success; and propose a new metric called implicit recovery. Contextual appropriateness measures the degree of contextual coherence of the system answers: appropriate, inappropriate and ambiguous. Turn correction ratio is calculated by adding the results of system turn corrections and user turn corrections. Transaction success measures the success of the system in providing the users with the information they required. Finally, implicit recovery measures the capacity to regain utterances which are partially failed at recognition and understanding levels. (Polifroni et al., 1998) propose a method to evaluate a spoken dialogue system in two ways: a) system behaviour judged by examining each query/response pair (manual), and b) component evaluation (automatic), which examines the performance of each system component (speech recognizer, natural language parser, natural language generation component and domain server). A comparison between the automatic evaluation and manual evaluation showed a significant agreement. Despite these research efforts in evaluating spoken dialogue systems, more investigations are necessary in order to evaluate systems in a more reliable, integrated and automatic way.

12

Chapter 2. Previous Work

Table 2.1: Reinforcement Learning for Spoken Dialogue Management (in a Nutshell). The first group of references uses MDPs and the second group uses POMDPs. Rerefence (Levin et al., 1997) (Levin et al., 1998) (Levin et al., 2000a)

Slots,States, Actions,Obs.*

Algorithm; Strategy

Reward Function C = Wi < Ni > +Wr < Nr > +Wo < fo (No ) > +Ws < Fs > < Ni >=expected lenght of the interaction (# turns)

5, 111, 12

(Young, 2000)

2, 62 , 5

(Litman et al., 2000) (Singh et al., 2002) (Goddeau & Pineau, 2000)

3, 42, 2 −, 3n , 5

Monte Carlo Explorative Starts

< Nr >=expected number of tuples from the DB fo (No ) =data presentation cost Fs =overall task success measure

Value Iteration & Monte Carlo Value Iteration

−1 for each question-answer cycle +100 for a successful conclusion StrongComp =Binary reward function 1 if system query the database using the attributes specified, 0 otherwise

Dynamic Programming

Cost = A + B

Q-Learning

Performance= .27 ∗ Comp + .54 ∗ MRS − .09 ∗ BargeIn% + .15 ∗ Re jection%

A =fixed amount for every prompt to the user B =final cost for each unfilled or incorrectly filled slot

(Walker, 2000)

3, 18, 17

Comp =user perception of task completion (1,0) MRS =Mean Recognition Score BargeIn% =percentage of utterances interrupted by the user

(Pietquin & Renals, 2002)

7, 37 , 24

Re jection% =percentage of recognizer rejections

Monte Carlo Explorative Starts; ε-greedy

rt = Wt Nt + Wdba Ndba + Wpr N pr − WCLCL − Ws f (vs ) Nt =set to 0 if st+1 = sF , 1 otherwise Ndba =number of database accesses N pr =number of presented data CL =confidence level of the current user’s utterance f (Us ) =function of the modeled user’s satisfaction (Us )

(Scheffler & Young, 2002)

(Pietquin & Dutoit, 2005)

4, {209, 303}, 6 Q(λ); ε-greedy 4, {480, 778}, 6 4, 1298, 6 7, 37 , 25

Wx =adjustable possitive weights Cost = NumTurns + FailCost + NumFailures NumTurns=average number of turns per transaction FailCost=average number of failures per transaction NumFailures=weighting factor with assigned values 20, 10, 5, 2

Q(λ); softmax

rt = WTC .TC + WCL .CL + WN .N TC =task completion measure CL =confidence measure N =0 of final state, 1 otherwise

(Henderson et al., 2005)

4, 1087 , 10

(Frampton & Lemon, 2005) 2, 640, 6 4, 1539, 6 (English & Heeman, 2005)

4, 25 , 5

Wx =positive weights

Sarsa(λ) & Linear Function Approx. Sarsa(λ); ε-greedy

-1 for all states other than the final state The final state is the sum of rewards based on a PARADISE eval. RF1 =100 for each correct slot value RF2 =100 if all slots are correct, else 0 per slot RF3 =100 if all slots are correct, else 10 per slot

On-Policy Monte Carlo; ε-greedy

o(S, I) = w1 S + w2 I S =task score I =number of interactions wi =constants

(Schatzmann et al., 2005b)

4, 81, 256

(Roy et al., 2000)

4, 13, 20, 16

(Zhang et al., 2001)

3, 40, 18, 25

(Pineau & Thrun, 2001) (Williams & Young, 2005)

3, 10, 15, 16 2, 36, 5, 5

Q-Learning; ε-greedy MDP,POMPD approx. Exact POMDP MDP,QMDP,FIB, Grid-based approx. Incremental Prunning Perseus

−1 for each question-answer cycle +20 for a successful conclusion Each action labeled as Correct(+100), OK(-1) or Wrong(-100) positive reward when the answer matches user’s request negative reward if mismatch occurs Computation time in seconds -1 if ask slot not stated, -3 if confirm slot not stated -2 if ask slot stated, -1 if confirm slot stated -3 if ask slot confirmed, -2 if confirm slot confirmed +50 if dialogue goal ends successfully, -50 otherwise

*Observations (Obs.) are only applicable to POMDPs (Partially Observable Markov Decision Processes).

Chapter 3

Human-Computer Dialogue Simulation Using Hidden Markov Models This chapter presents a probabilistic method to simulate task-oriented human-computer dialogues at the intention level1 , that may be used to improve or to evaluate the performance of spoken dialogue systems. Our method uses a network of Hidden Markov Models (HMMs) to predict system and user intentions, where a “language model” predicts sequences of goals and the component HMMs predict sequences of intentions. We compare standard HMMs, Input HMMs and Input-Output HMMs in an effort to better predict sequences of intentions. In addition, we propose a dialogue similarity measure to evaluate the realism of the simulated dialogues. We performed experiments using the DARPA Communicator corpora and report results with three different metrics: dialogue length, dialogue similarity and precision-recall.

3.1 Introduction The task of human-computer dialogue simulation consists of generating artificial conversations between a spoken dialogue system and a user. The communication in real spoken dialogue systems is achieved at several levels: speech, words and intentions (analogous to dialogue acts). Training optimal dialogue strategies usually requires many dialogues to derive an optimal policy and on-line learning from real conversations may be impractical. An alternative is to use simulated dialogues. For dialogue modelling, simulation at the intention level is the most convenient, since the effects of recognition and understanding errors can be modelled and the intricacies of natural language generation can be avoided (Young, 2000). Several research efforts have been undertaken in the area of human-computer dialogue simulation for the three levels of communication afore mentioned (see section 2.2). However, no corpus-based efforts have been undertaken to simulate both system and user behaviour. This chapter presents a method that addresses the following question: How to expand a small corpus of dialogue data with more varied simulated conversations? Our method learns system and user behaviour based on a network of HMMs, where each HMM represents a goal in the conversation. In an effort to better predict real dialogues we compare three models with different dependencies in their structures. In addition, this chapter presents a measure to evaluate the realism of the simulated dialogues through the comparison of HMMs trained with real and simulated dialogues. Some potential uses of the expanded corpus may be to learn optimal dialogue strategies and to evaluate spoken dialogue systems in early stages of development. 1 Most

of the content described in this chapter appears in (Cuay´ahuitl et al., 2005).

13

Chapter 3. Human-Computer Dialogue Simulation Using Hidden Markov Models

Dialogues (test)

U M ser od el

Intentions

dialogues

m ste l Sy ode M

Spoken Dialogue System

em st s Sy u rn T

Dialogues (training)

Knowledge Acquisition and Representation

Evaluation

U Tu s e rn r s

Dialogues (real)

Intentions

14

Dialogues (simulated)

Simulated User

Figure 3.1: A high-level diagram of the proposed human-computer dialogue simulator.

3.2 Probabilistic Dialogue Simulation This section describes a probabilistic human-computer dialogue simulation method that models both system and user behaviour at the intention level (see figure 3.1). A set of real dialogues (the training set) is required in order to acquire knowledge and train the system and user models, which are used to make them interact together using intentions in order to generate simulated dialogues. The system model is a probabilistic dialogue manager that controls the flow of the conversation, and the user model is a set of conditional probabilities that describe user behaviour. Finally, the simulated dialogues and another set of real dialogues (the test set) are used to evaluate the realism of such simulated dialogues. 3.2.1 The System Model

The task of the system model is to generate a sequence of system turns including system intentions, allowing user responses between turns. Due to the fact that conversations may have many system turns and that some turns are reused during the conversations, we decided to divide the conversation into goals, which are subsequences of system turns within the same topic. Therefore, our system model consists of multiple Hidden Markov Models (HMMs) connected by a bigram language model, where each HMM in the network represents a dialogue goal (see figure 3.2a). The task of the bigram language model is to predict the goal sequence within a dialogue by the conditional probability of the preceding goal P(g n |gn−1 ) given the set of goals G = {g1 , g2 , ..., gN }. The language model is parameterized as Λ = (σ, δ), where σ is the initial distribution and δ the transition distribution. The conversation within a goal is modelled by an ergodic HMM with visible states. The notation λ = (A, B, π) is used to indicate the complete parameter set of a standard HMM and its characterization is as follows (Rabiner, 1989): • N, the number of states within a goal plus a final state. We assume that any goal can be modelled as a set of visible states S = {S 1 , S2 , ..., SN } representing system turns, the state at time t is referred as qt and the final state is referred as qN . • M, the number of observed symbols, represented as a set of system intentions V = {v1 , v2 , ..., vM }, the symbol observed at time t is referred as ct . • The discrete random variable A describes the flow of system turns by P(q t+1 |qt ). • The discrete random variable B describes the system intentions generated in each state by P(ct |qt ).

3.2. Probabilistic Dialogue Simulation

15

U

A

B (a)

(b)

(c)

(d)

Figure 3.2: HMM-based system models. (a) a language model defining a network of Hidden Markov Models (HMMs), (b) a standard HMM, (c) an Input HMM (IHMM) and (d) an InputOutput HMM (IOHMM). The empty circles represent visible states, the lightly shaded circles represent observations and the dark shaded circles represent user responses.

• The initial state distribution π = P(q 0 ) represents the start of the conversation within a goal. Standard HMMs consider state transitions (system turns) and observations (system intentions) independent of user responses (see figure 3.2b), meaning that the control flow of the conversation does not take into account the previous user responses. This fact motivated the use of models with more dependencies in their structure. Therefore, we use Input Hidden Markov Models (IHMMs) and Input-Output Hidden Markov Models (IOHMMs), which are extensions of the standard HMMs (Bengio & Frasconi, 1996), see figures 3.2c and 3.2d. IHMMs condition the next state transition qt+1 on the current state qt and the current user response ut , the state transition probability is rewritten as P(qt+1 |qt , ut ). IOHMMs extend IHMMs by conditioning the current observation ct on the current state qt and the previous user response ut−1 , the observation symbol probability distribution is rewritten as P(c t |qt , ut−1 ). 3.2.2 The User Model

The task of the user model is to interact with the system model by providing answers to system intentions. Our user model is based on the assumption that a user response is conditional only on the previous system response (Eckert et al., 1997). The observed symbols are represented by the set of user intentions H = {h1 , h2 , ..., hL }, where L is the number of intentions and the intention at time t is referred as ut . Thus, the discrete random variable U describes the user intentions generated in each state by P(ut |qt , ct ). Figure 3.3 illustrates the structure of an IOHMM including user responses. 3.2.3 The Simulation Algorithm

The language model and HMMs are used as a generator in order to simulate task-oriented human-computer dialogues at the intention level. A simplified version of the dialogue simulation algorithm using standard HMMs is shown in figure 3.4. The function DialogueSimulator

16

Chapter 3. Human-Computer Dialogue Simulation Using Hidden Markov Models

U

A

B

Figure 3.3: The IOHMM including user responses (dark shaded circles), the black arrows correspond to the user model.

generates sequences of goals using the language model Λ, choosing initial goals from σ and goal transitions from δ, until reaching the final goal g N . For each goal, the function SimulateHMM is invoked with the corresponding model λ, which generates a sequence of system intentions ct and user intentions ut , until reaching the final state qN . The probability distributions from lines 18 and 21 may be replaced with the ones specified by IHMMs or IOHMMs. The algorithm assumes that the system starts the conversation and the user ends it.

3.3 Dialogue Similarity This section describes a measure to evaluate the realism of simulated dialogues. The motivation for proposing another measure is due to the fact that previous measures are either very general (such as dialogue length (Scheffler & Young, 2000)) or very strict (such as precisionrecall (Schatzmann et al., 2005a), which highly penalizes unseen dialogues). Therefore, in an attempt to address the deficiencies of the previous measures we propose a dialogue similarity measure. The purpose of this measure is to evaluate the similarity between two sets of dialogues. For our purposes, we compare a corpus of real dialogues against a corpus of simulated dialogues2 , training a set of standard HMMs (one per dialogue goal) for each corpus. This measure computes the normalized distance between HMMs trained from each corpus, where γr represents a set of HMMs trained with real dialogues and γ s represents another set of HMMs trained with simulated dialogues. The similarity is the distance between γ r and γs given by equation 3.1. Notice that this measure can evaluate the system model (including the variables q and c), the user model (including the variable u) or both (including all variables). This measure attempts to provide an indication of how far all the simulated dialogues are from the real dialogues. D∗ (γr , γs ) =

1 L 1 N 1 ∑ N ∑ Mi L l=1 i=1

Mi

∑ D(ω j ; λr , ω j ; λs ),

j=1

i

i

(3.1)

where L is the number of variables to compare, N is the number of HMMs (one per goal), M is the number of probability distributions in the model λ i , ω is the variable (e.g., q, c, u), and D is a distance between HMMs expressed as DKL (p k q) + DKL (q k p) , 2 and DKL is the Kullback-Leibler divergence expressed as   pi DKL (p k q) = ∑ pi log2 . q i i D(p, q) =

2 Under

the assumption that the more similarity the more realism.

(3.2)

(3.3)

3.4. Experimental Design

17

01. function DialogueSimulator() 02. load parameters of the language model Λ 03. current goal ← random goal from σ 04. while current goal != gN do 05. λ ← parameters of the HMM given current goal 06. SimulateHMM(λ) 07. current goal ← random goal from δ 08. end 09. end 10. function SimulateHMM(λ) 11. t ← 0 12. qt ← random system turn from π 13. ct ← random system intention from P(ct |qt ) 14. loop 15. print ct 16. ut ← random user intention from P(ut |qt , ct ) 17. print ut 18. qt ← random system turn from P(qt+1 |qt ) 19. if qt = qN then return 20. else t ← t + 1 21. ct ← random system intention from P(ct |qt ) 22. end 23. end Figure 3.4: The dialogue simulation algorithm.

3.4 Experimental Design 3.4.1 Training the System and User Model

Our experiments use the DARPA Communicator corpora 2001, which is annotated using the DATE annotation scheme (Walker & Passonneau, 2001). These corpora (available from the LDC), consist of task-oriented human-computer dialogues in the domain of travel information. The DATE scheme annotates dialogues using dialogue acts, which characterize behaviour of human-computer dialogues. Both system turns and user turns are annotated, with a focus on system turns, assuming that system behaviour is correlated to user behaviour. As a consequence, system turns are annotated with dialogue acts, whilst user turns provide the ASR and user transcriptions at the word level embedding semantic tags. Using this data we trained our models using the following five steps: 1. Dialogue segmentation, where each segment corresponds to a goal, these segments are application dependent. Figure 3.5 shows the goal delimiters (dialogue acts) of the systems used in our experiments. This step was used to train the language models, the rest of the steps were used to train the HMMs. 2. Classification of system turns into states for the HMMs. Briefly, the system turns with speech acts request info, offer, and acknowledgement were classified as states, using such order in order to avoid duplicated states. System turns without any of these speech acts were classified according to their most recent state. 3. Classification of system turns into intentions. Due to the fact that system turns have many combinations of dialogue acts, we collapsed them into the set of system intentions

18

Chapter 3. Human-Computer Dialogue Simulation Using Hidden Markov Models

SYSTEM

GOAL DELIMITERS

STATES PER GOAL

TEST

0.43

105

56

0.42

45

31

9 8 5 10 7 8

0.46

126

81

g0: about_task|request_info|top_level_trip g1: about_task|request_info|continue_trip g2:about_task|offer|flight

9 8 9

0.67

134

85

g0:about_task|request_info|top_level_trip g1: about_task|request_info|continue_trip g2:about_task|request_info|return_date g3: about_task|request_info|rental g4: about_task|request_info|hotel g0: about_task|request_info|top_level_trip g1: about_task|request_info|return_date g2: about_task|request_info|price

11 10 8 10 10 13 10 10

0.50

103

74

0.94

116

94

10 9 8 9 3 7 7 3 7 5 4

COL

g0:about_task|request_info|top_level_trip g1:about_task|request_info|return_date g2: about_task|request_info|continue_trip g3: about_task|request_info|rental g4: about_task|request_info|hotel g5: about_task|request_info|flight

IBM

CMU

LUC

MIT

# DIALOGUES TRAIN

g0:about_task|request_info|top_level_trip g1: about_task|request_info|hotel g2: about_task|request_info|rental g3:about_task|request_info|return_date g4: about_task|request_info|flight g0:about_task|request_info|dest_city g1:about_task|request_info|return_date g2: about_task|request_info|continue_trip g3: about_task|request_info|hotel g4: about_task|request_info|rental g5: about_task|request_info|flight

BBN

USER INITIATIVE

Figure 3.5: Information extracted from the Communicator data. User initiative is the ratio between number of semantic tags and number of utterances (from user transcriptions). These corpora is a subset of the original dialogue data.

V = {start, apology, instruction, confirmation}. Briefly, and using the following order, the system turns with speech acts explicit confirm were classified as confirmation; the system turns with speech acts apology as apology; the system turns with speech acts request info, offer, and acknowledgement as start; the system turns with speech acts instruction as instruction; and any other system turn as start. 4. Classification of user turns into intentions. As we are interested in intention-based dialogues, information from transcriptions was used in order to classify user turns into the set of user intentions H = {oov, command, yes, no, CITY, DATE TIME, RENTAL, CAR, AIRLINE, HOTEL, AIRPORT, NUMBER, CITY CITY, DATE TIME DATE TIME, CITY DATE TIME, AIRLINE DATE TIME, AIRLINE NUMBER, CITY CITY DATE TIME}. The items in capitals are the semantic tags that occur in most of the systems. The use of more than one semantic tag allows user initiative. The full set of user intentions H was used to provide user responses and to train the state transitions in IHMMs and IOHMMs, but for training observations in IOHMMs we collapsed the semantic tags into the intention iv, in an effort to reduce the data sparsity problem. Finally, subsets of H (system vocabulary) were allowed in each HMM, according to the user intentions observed in the data. 5. Smoothing of intentions in order to consider unseen entries. Due to the fact that many intentions may not have occurred in the data, the probability distributions of the HMMs (system turns, system intentions and user intentions) were smoothed using Backoff estimation with Witten-Bell discounting (Jurafsky & Martin, 2000).

3.4. Experimental Design

19

U A

B C

Figure 3.6: Data sets used by the Precision-Recall measure (A=real dialogues in the training set, B=real dialogues in the test set, C=simulated dialogues), if the set C completely covers A and B this measure will mean realism in the simulated dialogues.

3.4.2 Evaluation Metrics

Evaluating simulated dialogues is a difficult process due to the fact that we do not know in advance if the simulated dialogues would occur in real environments. Nevertheless, we evaluate our method using the following metrics that compare two sets of dialogues. For our purposes we are mainly interested in comparing real dialogues (test set) against simulated dialogues. • Dialogue Length: This measure computes the average number of turns per dialogue, giving a rough indication of agreement between two sets of dialogues. • Precision-Recall: This measure evaluates how well a model can predict training and test data, but it highly penalizes the simulated dialogues that did not occur in the real data. This measure is illustrated in figure 3.6, where recall is given by R train = (A ∩C)/A or Rtest = (B ∩C)/B, and precision is given by Ptrain = (A ∩C)/C or Ptest = (B ∩C)/C. An average of recall and precision is given by F = 2PR/(P + R) (Jurafsky & Martin, 2000). • Dialogue Similarity: This proposed measure computes the normalized distance of standard HMMs between two sets of dialogues, penalizing unseen behaviour, but taking into account seen and unseen dialogues (see section 3). In this chapter our evaluation focuses on the HMM-based system models, but such measures can also be used to evaluate the user model or both. In the case of dialogue length we only consider system turns. In the case of precision-recall we consider fragments (one per goal) compounded by state plus system intention. Finally, in the case of dialogue similarity we only consider system turns (states) and system intentions (observations), but other parameters might be incorporated such as user intentions. 3.4.3 Experiments and Results

We trained the proposed models for six Communicator systems: BBN, CMU, COL, IBM, LUC, MIT. From the original data we filtered dialogues with missing annotations that impede the induction of system intentions. The size of the corpora used for experiments is shown in figure 3.5. We performed experiments for each system in order to compare the proposed HMM models. In each comparison 104 simulated dialogues were generated. Figures 3.7 and 3.8 illustrate results from closed and open tests using the three evaluation metrics: Dialogue Length (DL), Precision-Recall (PR) and Dialogue Similarity (DS). The bars in each plot represent: real dialogues (comparing the training and test sets), random dialogues (using same setup as standard HMMs but with flat probabilities), and HMM-based simulations (HMMs, IHMMs and IOHMMs). Ideally, we would like our models to behave similarly to the real dialogues; we assume that our simulations may be considered realistic if they reach scores

Chapter 3. Human-Computer Dialogue Simulation Using Hidden Markov Models

Average Number of Turns

Dialogue Length − Training

Random HMMs IHMMs

60

IOHMMs 40 20 0 BBN

CMU

COL

IBM

LUC

MIT

Precision−Recall (Intentions) − Training

100

Precision−Recall (Goals) − Training

100 80 60 40 20 0

BBN

CMU

COL

IBM

LUC

MIT

Dialogue Similarity − Training

2.5

80

2

Distance

F−Measure (Percentage)

Real

80

F−Measure (Percentage)

20

60 40 20

1.5 1 0.5

0

BBN

CMU

COL

IBM

LUC

0

MIT

BBN

CMU

COL

IBM

LUC

MIT

Figure 3.7: Results using training data with three different metrics (the thin bars represent the standard deviation).

80

HMMs IHMMs

60

IOHMMs

40 20 0 BBN

CMU

COL

IBM

LUC

F−Measure (Percentage)

Real Random

MIT

Precision−Recall (Intentions) − Test

100

Precision−Recall (Goals) − Test

100 80 60 40 20 0

BBN

CMU

COL

IBM

LUC

MIT

LUC

MIT

Dialogue Similarity − Test

2.5 2

Distance

80 60 40

1.5 1 0.5

20 0

BBN

CMU

COL

IBM

LUC

0

MIT

BBN

CMU

COL

IBM

Figure 3.8: Results using test data with three different metrics (Precision-Recall of goals use the same language model). Dialogue Length

70

Dialogue Similarity

1.5

40

40 30 20 10

35 30

1

Distance

F−Measure (Percentage)

50

0

Precision−Recall (Intentions)

45

60

Average Number of Turns

F−Measure (Percentage)

Average Number of Turns

Dialogue Length − Test

25 20 15

0.5

10 5

Real

Random HMMs

IHMMs IOHMMs

0

Real

Random HMMs

IHMMs IOHMMs

0

Real

Random HMMs

IHMMs IOHMMs

Figure 3.9: Average test results from all Communicator systems (assumption: the closer to “Real” the better).

3.5. Conclusions and Future Directions

21

SYS: about_task|request_info|dest_city SYS: where do you want to go? USR: CITY USR: Atlanta SYS: about_task|request_info|depart_arrive_date SYS: and what day would you be departing Boston? USR: oov USR: [cough] SYS: about_task|request_info|depart_arrive_time SYS: and what time would you like to depart Boston? USR: DATE_TIME USR: september twenty seventh early afternoon SYS: about_task|offer|flight SYS: [info] do you want to take that? USR: no USR: no SYS: about_task|offer|flight SYS: [info] do you want to take that? USR: DATE_TIME USR: morning SYS: about_task|offer|flight SYS: [info] do you want to take that? USR: yes USR: yes SYS: about_task|request_info|return_date SYS: will you return to Boston? USR: oov USR: [noise] SYS: about_communication|apology|meta_slu_reject SYS: i'm sorry. i didn't catch that. USR: yes USR: yes SYS: about_task|offer|flight SYS: [info] do you want to take that? USR: yes USR: yes SYS: about_task|request_info|hotel SYS: will you need a hotel in Atlanta? USR: yes USR: yes SYS: about_task|request_info|hotel_location SYS: would you like a hotel downtown or near the ... USR: HOTEL USR: downtown SYS: about_task|request_info|flight SYS: would you like me to summarize your trip? USR: yes USR: yes SYS: about_task|acknowledgement|rental_booking SYS: i've requested a rental car with no preference ... USR: command USR: repeat

Figure 3.10: Fragment of a simulated dialogue.

similar to the real dialogues (it is only an indication). From the results we can observe that random dialogues obtain the worst performance (meaning that they are strongly unrealistic), whilst the HMM-based models are better than random. From the PR (goals) results we can observe that the HMM-based models result in a similar performance due to the fact that they use the same language model. From the PR (intentions) results we can observe that the HMMbased models obtain performance similar to the real dialogues. Thus, PR is partially useful because it only tells us how much our models can predict training and test data, but penalizes the unseen dialogues. This fact raises the question “What proportion of dialogues penalized by PR may occur in real environments?” In another side, from the DS measure we can observe that the HMM-based simulations are considerably distant from the real dialogues. This measure is promising due to the fact that it is strongly evaluating dialogue behavior in comparison to the other metrics. This fact raises the question “How realistic might the simulated dialogues be if they obtain similar distance to the real ones?” In the meantime, all measures agree that the random dialogues are significantly unrealistic, whilst our trained models generate dialogues closer the to real ones; this can be observed from the average results in figure 3.9. According to PR and DS we can observe that IHMMs and IOHMMs perform slightly closer to real dialogues, but still cannot be considered realistic. This suggests exploring more effective dependencies in the HMMs. Finally, figure 3.10 illustrates a simulated dialogue based on the CMU simulated system with IHMMs, the left column uses intentions and the right column is an instantiation in natural language. Since our method is purely probabilistic, some incoherencies may occur; for instance, the system offers a return flight without asking for a return date.

3.5 Conclusions and Future Directions In this chapter we have presented a corpus-based method to simulate task-oriented humancomputer dialogues at the intention level using a network of HMMs connected by a bigram language model, where each HMM represents a dialogue goal. This method learnt a system model and a user model: the system model is a probabilistic dialogue manager that models the

22

Chapter 3. Human-Computer Dialogue Simulation Using Hidden Markov Models

sequence of system intentions, and the user model consists of conditional probabilities of the possible user responses. We argue that our method is independent of application and annotation scheme. Due to the fact that in the proposed method all the possible system and user intentions may occur in each state, more exploratory dialogues may be generated than observed in the real data. We compared three models with different structures: HMMs, IHMMs and IOHMMs. Our experiments with the DARPA Communicator data reveal that the HMM-based models obtain very similar performance, clearly outperforming random dialogues, and are close to being considered realistic. We believe that Precision-Recall and Dialogue Similarity are potentially complementary metrics due to the fact that precision-recall penalizes the unseen dialogues, whilst dialogue similarity considers all the dialogues. This suggests that a combination of measures may better evaluate the realism of simulated dialogues, but there is no guarantee that these metrics are directly related to dialogue realism. Immediate work on dialogue simulation follows two directions: 1) finding better evaluation measures and 2) improving the performance of our proposed method including: degrees of initiative in user responses, investigating the application of a balanced number of goals and states, duration modelling, modelling system and user intentions according to the dialogue history (this should yield more coherent sequences of goals and intentions), modelling confidence levels, modelling different kinds of users, and exploring richer dependencies in the models but avoiding the data sparsity problem. Future work consists of using the proposed method within the reinforcement learning framework to learn optimal dialogue strategies for large-scale spoken dialogue systems.

3.6 Proposed Future Work One of the main limitations of the dialogue simulation method proposed in this chapter is that incoherent dialogues are generated. Considering the fact that this research will use dialogue simulation to learn dialogue strategies in an automatic way, the following question arises “What kind of dialogue strategies may be learnt from incoherent dialogues?” Notice that incoherences may occur in both sides of the conversation (system and user). A second limitation is that the models proposed here have a very limited amount of knowledge, making the task of learning optimal dialogue strategies3 difficult. Here, a second question arises “What kind of dialogue strategies may be learnt from a very limited amount of knowledge?” Furthermore, this research intends to investigate how to learn dialogue strategies in large state spaces, but for that purpose a method with knowledge from several different variables is desirable. These facts strongly suggest investigating a more sophisticated method for simulating dialogues. This research proposes to investigate an extension of the proposed method by adding more knowledge to the proposed models. Figure 3.11 illustrates a potential model including knowledge4 from 7 different variables (described below). However, this potential model was manually designed. Therefore, this research proposes to investigate a principled way to find the variables and dependencies to take into account, so that important knowledge may be incorporated into the simulations. • User Type: The possible values to consider are: naive user and expert user. A potential probability distribution is given by P(tt |tt−1 ). This information will be induced from data based on user initiative and dialogue length using clustering techniques (Jain et al., 1999) (Bechet et al., 2004) (Webb, 2002). 3 Under

the assumption that the more knowledge the more optimal dialogue strategies may be learnt. corpora does not provide such knowledge explicitly, we assume that it can be induced from the data. (Georgila et al., 2005a) propose an extended annotation based on the Information State Update approach. This research will mainly use the original data however because it can still be exploited further. 4 The Communicator

3.6. Proposed Future Work

23

user type

t

user intention (user model )

u

semantic frame

f

system turn

q

system intention (system model )

c

confidence level

l

barge-in

b

Figure 3.11: HMM-based dialogue simulation model including knowledge from 7 different variables. The empty circles represent visible states, the lightly shaded circles represent observations and the dark shaded circles represent knowledge used to predict system and user responses. The black arrows correspond to dependencies of the observations.

• User Intention: The possible values to consider are: oov, yes, no, sequences of semantic tags (slots), and a set of basic commands such as repeat, cancel and start over. A potential probability distribution is given by P(ut | ft , ct ,tt ). • Semantic Frame: The semantic frames are compounded by the semantic tags (filled slots) of the corresponding dialogue goal (fragment of a conversation within the same topic). A potential probability distribution is given by P( f t | ft−1 ). • System Turn: The system turns (slots asked) correspond to the relevant utterances in the system. A potential probability distribution is given by P(qt |qt+1 , ft ). • System Intention: The possible dialogue actions to consider are: start, apology, instruction, confirmation, and provide information. A potential probability distribution is given by P(ct |qt , lt ). • Confidence Level: The possible values to consider are: low, medium and high. A potential probability distribution is given by P(lt |qt , bt ). This information will be either induced from data based on speech recognition results or borrowed from the Communicator extended annotation (Georgila et al., 2005a). • Barge-in: The possible values to consider are: barge-in and no barge-in. A potential probability distribution is given by P(bt |qt , bt−1 ). This information will be induced from data based on the start times of each dialogue turn.

24

Chapter 3. Human-Computer Dialogue Simulation Using Hidden Markov Models

Finally, I propose to investigate a more reliable metric for evaluation of the realism of simulated dialogues with the following potential directions: 1. A combination of metrics based on precision-recall (Jurafsky & Martin, 2000) (Manning & Sch¨utze, 2001), dialogue similarity (Cuay´ahuitl et al., 2005) (Lin, 1996), and the BLEU score (Papineni et al., 2002). 2. A metric based on the utility of the simulated dialogues 5 . The utility will be measured according to the total expected reward. For instance, let dialogue policy A trained with user model X and dialogue policy B trained with user model Y , using real user responses on the corresponding dialogue policies, the dialogue policy with the higher accumulated reward has the more accurate user model.

5 Under

the assumption that the greater the utility of the simulated dialogues, the more accurate they are.

Chapter 4

Spoken Dialogue Management Using Hierarchical Reinforcement Learning This chapter addresses the problem of learning optimal dialogue policies in large state spaces using the Reinforcement Learning (RL) framework. It contains a description of work proposed with the aim of finding an effective and efficient fully observable RL method for learning optimal dialogue strategies, based on a comparative study of hierarchical RL methods and RL with function approximation. A brief description of each method is provided, highlighting some potential strengths and weaknesses. A potential experimental design in the travel domain is described, based on the hierarchical approach and dialogue simulation method proposed in the previous chapter. Finally, a summary of proposed work is described.

4.1 Introduction The specification of a dialogue manager has typically been treated as an iterative process: design by hand, evaluation with human subjects, and refinement (Barnard et al., 1999). This process is expensive, time-consuming and prone to errors; even for simple applications, making dialogue design more difficult as dialogue complexity increases. Therefore dialogue design has been considered more an art than engineering or science. (Levin et al., 1997) pioneered the idea of automating dialogue design as an optimization problem using the Reinforcement Learning (RL) framework, which applies the paradigm of learning by trial and error. Figure 4.1 illustrates the interaction between the agent and the environment in an RL framework applied to spoken dialogue systems. The environment consists of a spoken dialogue system and users (or dialogue simulator in the case of automatic learning), and has a reward function associated with it that rewards or punishes the actions taken by the agent. The agent consists of a dialogue strategy 1 , which is optimized by an RL method (an algorithm that learns to choose the best actions). At each time step the dialogue strategy will aim to provide the actions that achieve the highest reward (or lowest punishment). Notice that the dialogue system in the environment no longer takes actions. Instead it consults the dialogue strategy to choose the best actions. There is a disagreement of terms in the literature between dialogue manager and dialogue strategy. For some researchers the agent is the dialogue manager and for others it is the dialogue strategy. A definition of dialogue manager is as follows: the task of the dialogue manager is to control the flow of the conversation, accept spoken input from the user, produce messages (to clarify, disambiguate, suggest, control, assist and constrain the conversation) to be conveyed to the user, and interact with internal and external resources (McTear, 2004) (Zue & Glass, 1 In

this chapter the terms dialogue design, dialogue strategy and dialogue policy are used interchangeably.

25

26

Chapter 4. Spoken Dialogue Management Using Hierarchical Reinforcement Learning

rt

reward

st

AGENT

Dialogue Strategy

action at

state

ENVIRONMENT

s t+1

System&Users or Dialogue Simulator

r t+1

Reward Function

Figure 4.1: The agent-environment interaction in the reinforcement learning framework.

2000). Therefore in this research it is assumed that the dialogue strategy is a subcomponent of the dialogue manager, meaning that the dialogue manager consults the dialogue strategy to choose actions. Several research efforts have been undertaken within the RL framework applied to spoken dialogue systems with significant advances. However, such investigations have mainly focused on small state spaces (see table 2.1). This chapter describes reinforcement learning methods for large state spaces, suitable for large-scale spoken dialogue systems, where the state space grows exponentially.

4.2 The Reinforcement Learning Framework 4.2.1 Markov Decision Processes

The agent-environment interaction is usually expressed as a Markov Decision Process (MDP). MDPs were developed to address the problem of choosing optimal actions in stochastic domains. In spoken dialogue systems the environment is everything that is outside the dialogue strategy (agent) such as the spoken dialogue system, users and reward function. An MDP (see figure 4.2) applied to dialogue is characterized as follows (Putterman, 1994) (Sutton & Barto, 1998) (Levin et al., 1997) (Young, 2000): • The set of states S = {s1 , s2 , ..., sn } represents the knowledge about the conversation (represented as vectors of state variables, usually called factored states), the state at time t is referred as st or simply as s, and the next state is referred as st+1 or s0 . • The set of actions A = {a1 , a2 , ..., am } represents all possible actions that the system can perform, such as interactions with the user and interactions with other system components. The action at time t is referred as at or simply as a, and the next action is referred as at+1 or a0 . • The transition probabilities P(s0 |a, s) specify the control flow of the conversation by choosing the next state s0 given the current action a and state s. • The reward function R(s0 , a, s) = E(r 0 |s0 , a, s) specifies the expected reward given the next state s0 and the previous action a and state s.

4.2. The Reinforcement Learning Framework

27

Actions

States

Rewards

Figure 4.2: A Markov Decision Process.

• The policy matrix π(s, a) = P(a|s) specifies the dialogue strategy by choosing an action given the current state of the environment. The optimal policy π ∗ maximizes the expected total reward. In MDPs the state and action spaces may be finite or infinite with discrete or continuous time. For human-computer dialogues discrete and finite MDPs are of particular interest. Typically, a dialogue consists of a sequence of states s 0 , s1 , ..., sT , which receives a total expected reward expressed as T

R = ∑ γR(st+1 , at , st ).

(4.1)

t=1

The discount rate 0 ≤ γ ≤ 1 is used to weight future reinforcements. The task in reinforcement learning is to optimize the interaction between the agent and the environment. Usually, the goal is to find a policy that maximizes R . The expected value of the reward can be computed recursively by introducing the state-value function V π (s), which is the value of state s under policy π, defined by the Bellman equation for V π as V π (s) = ∑ π(s, a) ∑ P(s0 , a, s)[R(s0 , a, s) + γV π (s0 )] = ∑ π(s, a)Qπ (s, a), a

(4.2)

a

s0

where Qπ (s, a) gives the expected reward of taking action a from state s following policy π. The optimal state-value function V ∗ can be found by V ∗ (s) = max V π (s) = max ∑ P(s0 , a, s)[R(s0 , a, s) + γV ∗ (s0 )]. π

a

(4.3)

s0

Similarly, the optimal action-value function Q ∗ can be found by Q∗ (s, a) = max Qπ (s, a) = ∑ P(s0 , a, s)[R(s0 , a, s) + γ max Q∗ (s0 , a0 )]. 0 π

a

s0

(4.4)

Finally, the optimal policy is given by π∗ (s) = arg max Q∗ (s, a), a

(4.5)

and can be learnt by either classical dynamic programming methods (Putterman, 1994) or reinforcement learning methods (Sutton & Barto, 1998) (Kaelbling et al., 1996) (Mitchel, 1997) (Russell & Norvig, 2002).

28

Chapter 4. Spoken Dialogue Management Using Hierarchical Reinforcement Learning

4.2.2 Semi-Markov Decision Processes

In MDPs the agent assumes that decisions are taken in a fixed amount of time. A discrete-time Semi-Markov Decision Processes (SMDP) is a generalization of an MDP, in which actions can take a variable amount of time to complete. Let the random variable N denote the number of time steps that action a takes when it is executed in state s. In SMDPs the transition probability is rewritten as P(s0 , N|s, a). The Bellman equations for V ∗ and Q∗ are rewritten as V ∗ (s) = max ∑ P(s0 , a, N|s)[R(s0 , a, s) + γN V ∗ (s0 )], a

and

Q∗ (s, a) =

(4.6)

s0 ,N

Q∗ (s0 , a0 )], ∑ P(s0 , a, N|s)[R(s0 , a, s) + γN max a

s0 ,N

0

(4.7)

Similarly, the value functions can be learnt either by dynamic programming algorithms applied to SMDPs (Putterman, 1994) or by hierarchical reinforcement learning methods. The basic notions of these methods are described in section 4.3. 4.2.3 Reinforcement Learning Methods

Reinforcement Learning methods offer two important advantages over classical dynamic programming: the methods are online and the methods can employ function approximation. There are three main families of Reinforcement Learning (RL) methods that find an optimal policy: Dynamic Programming (DP), Monte Carlo (MC) methods, and Temporal Difference (TD) learning (Kaelbling et al., 1996) (Sutton & Barto, 1998). In DP the simplest and most effective method is value iteration (Young, 2000), but its utility is limited due to its great computational expense. Nevertheless, DP provides an essential foundation for all RL methods. Whilst DP requires complete knowledge of the environment, MC methods require only experience: sample sequences of states, actions and rewards. TD learning is a combination of DP and MC methods, where two of the most widely used learning algorithms are Q-Learning and Sarsa. Figure 4.3 illustrates the Q-learning algorithm (Watkins, 1989), which updates Q-values for every state-action pair, where α represents a learning rate parameter that decays from 1 to 0. The learned action-value function Q directly approximates Q ∗ and has shown to converge with probability 1. Another RL method combining TD and MC methods is known as Eligibility Traces, which is a basic mechanism for temporal credit assignment 2 . One of the challenges in RL is the trade-off between exploitation and exploration. The agent has to exploit what has already been learnt in order to obtain reward, but it also has to explore in order to discover better actions. In this dilemma the agent must try different actions and progressively prefer those that seem to be the best. The basic methods for such purpose are ε-greedy and softmax (Sutton & Barto, 1998). These RL methods estimate value functions using a table with entries of action-value pairs, but this is limited to tasks with small state spaces. The problem is not just the memory needed for large tables, but also the time and data needed to fill them accurately; this phenomena is called “Generalization” (Sutton & Barto, 1998). For spoken dialogue systems it is of particular interest to apply RL where many state-action combinations have never been experienced before. Thus, the way of learning these kinds of tasks is to generalize from previous stateactions to the ones that have never been seen. This kind of generalization is known as “function approximation”, which is an instance of supervised learning. Therefore it is appealing to combine RL methods with function approximation such as “gradient-descendent methods”. There 2 Temporal

credit assignment consists of determining which of the actions in a sequence are to be credited with producing the eventual rewards. The use of eligibility traces helps to solve this problem.

4.3. Hierarchical Reinforcement Learning Methods

29

01. Initialize Q(s, a) arbitrarily 02. Repeat (for each episode): 03. Initialize s 04. Repeat (for each step of episode): 05. Choose a from s using policy derived from Q (e.g., ε-greedy) 06. Take action a, observe r, s0 07. Q(s, a) ← Q(s, a) + α[r + γ maxa0 Q(s0 , a0 ) − Q(s, a)] 08. s ← s0 ; 09. until s is terminal 10. end Figure 4.3: The Q-learning algorithm.

are two gradient-descendent methods that have been widely studied in RL: multilayer artificial neural networks and the linear form. Linear methods include radial basis functions, tile coding and Kanerva coding, among others. Another alternative to deal with large state spaces is to apply divide-and-conquer methods such as hierarchical reinforcement learning, which introduce the notion of “abstraction” by ignoring details that are irrelevant for the task at hand (Barto & Mahadevan, 2003). A simple type of abstraction is the idea of a “macro”, which is a sequence of actions that can invoke other actions, forming the basis of hierarchical specifications. There are two kinds of hierarchical methods: context-free, which discover recursive optimal policies such as MAXQ (Dietterich, 2000a); and context-sensitive, which discover hierarchical optimal policies such as Options (Sutton et al., 1999) or HAMs (Parr & Russell, 1998) (Parr, 1998).

4.3 Hierarchical Reinforcement Learning Methods 4.3.1 The Options Framework

The options framework extends the notion of action to options, incorporating the concept of “temporal abstraction”, where the actions in SMDPs take a variable amount of time to complete and are intended to model temporally-extended courses of action (Sutton et al., 1999). A fixed set of options defines a discrete-time SMDP embedded within a corresponding MDP, where the MDP takes single-step transitions and the SMDP takes multi-step transitions. The SMDP actions (the options) are no longer black boxes, but policies in the MDP that can be examined and learned. An option is the three-tuple < I , π, β >, where I is the initiation set, π is the policy, and β is a termination condition. If the option taken in state s t is Markov, then the next action at is selected given π(st , ·). Then the environment makes a transition to state st+1 , where the option either terminates given β(st+1 ), or continues determining action at+1 given π(st+1 , ·), possibly terminating in st+2 given β(st+2 ), and so on. The options dependent on the current state are called “Markov options” and the options dependent on all prior events since the option was initiated are called “Semi-Markov options”. That is, at each intermediate time t ≤ T ≤ t + k, the decisions of a Markov option depend only on s T , whilst the decisions of a Semi-Markov option depend on the entire sequence st , at , rt+1 , st+1 , at+1 , ..., rT , sT , but not on events prior to st or after sT . Given a set of options, the initiation sets implicitly define a set of available options O s for each state s. Options are equivalent to the set of available actions, so that actions can be considered to be a special case of options. Each action a is an option that is available whenever a is available (I = {s : a}), that always lasts one step (β(s) = 1), and that selects action a

30

Chapter 4. Spoken Dialogue Management Using Hierarchical Reinforcement Learning

everywhere (π(s, a) = 1). Therefore the choices for an agent are always options, some lasting for a single time step and some lasting for multiple time steps. We can consider policies that select options instead of policies that select actions. When policies over options are initiated in a state st , the Markov policy over options µ selects an option o according to the probability distribution µ(st , ·). The option o is then taken in st , determining actions until it terminates in st+k , where a new option is selected given (µt+k , ·), and so on. In this way, a policy over options µ determines a conventional policy over actions, or flat policy π = f lat(µ). The policies that depend on a single time step are called “Markov policies”, and the policies that depend on multiple time steps are called “Semi-Markov policies”. The value of a state s under a semi-Markov flat policy π is given by V π (s) = E{rt+1 + γrt+2 + γ2 rt+3 + ...|E (π, s,t)},

(4.8)

where E (π, s,t) denotes the event of π being initiated in state s at time t. Similarly, the option-value function of taking option o in state s under policy µ is defined as Qµ (s, o) = E{rt+1 + γrt+2 + γ2 rt+3 + ...|E (oµ, s,t)},

(4.9)

where the composition oµ denotes the semi-Markov policy that first follows option o until it terminates and then starts choosing according to policy µ in the resultant state. The reward part of the model of option o for any state s is R(s, o) = E{rt+1 + γt+1 + ... + γk−1 rt+k |E (o, s,t)},

(4.10)

where t + k is the random time at which option o terminates. The state-prediction part of the model of option o for state s is P(s0 |o, s) =



∑ p(s0 , k)γk ,

(4.11)

k=1

where p(s0 , k) is the probability that option o terminates in state s 0 after k steps. The quantities R(s, o) and P(s0 |o, s) generalize the reward and transition probabilities R(s, a) and P(s 0 |a, s). The Bellman equations for VO∗ and Q∗O are VO∗ (s) = max[r(s, o) + ∑ P(s0 |o, s)VO∗ (s0 )]

(4.12)

Q∗O (s, o) = R(s, o) + ∑ P(s0 |o, s) 0max Q∗O (s0 , o0 )

(4.13)

o∈Os

and o ∈Ost+k

s0

The SMDP Q-Learning method can be used to find an optimal policy over a set of options by applying updates after each option termination by Q(s0 , o0 ) − Q(s, o)] Q(s, o) ← Q(s, o) + α[r + γk max 0 0 o ∈Os

(4.14)

where k is the number of time-steps elapsed between s and s 0 , and the step-size parameter α may depend on states, option and time steps. The estimate Q(s, o) converges to Q ∗O (s, o) under similar conditions as Q-Learning. Finally, learning policies over options can be further improved with other methods by interrupting options, intra-option model learning, intra-option value learning and learning options that achieve subgoals; see (Sutton et al., 1999) and (Precup, 2002) for more details.

4.3. Hierarchical Reinforcement Learning Methods

31

4.3.2 The MAXQ Method

The MAXQ method provides a hierarchical decomposition of the given reinforcement learning problem into a set of subproblems (Dietterich, 2000a) (Dietterich, 1998) (Dietterich, 2000c). This decomposition has the following advantages: a) policies learnt in subproblems can be reused for multiple parent tasks, b) value functions learnt in subproblems can be shared so that learning in new tasks is accelerated, and c) the value function can be represented more compactly by applying state abstraction, meaning that learning is faster and requires less training data. The MAXQ method decomposes a target Markov Decision Process (MDP) into a hierarchy of smaller MDPs, and decomposes the value function of the target MDP into a combination of value functions of the smaller MDPs. The value function decomposition creates opportunities to exploit state abstraction, so that individual MDPs within the hierarchy can ignore parts of the state space. To construct a MAXQ decomposition a set of individual tasks must be identified, each subtask is defined by a subgoal, and each subtask terminates when the subgoal is achieved. After defining these subtasks, for each subtask must be indicated which other subtasks or primitive actions it should employ to reach its goal. If a subtask invokes other subtasks, then the subtask is formulated as an SMDP, otherwise the subtask is formulated as an MDP. Suppose that for each subtask a policy is written; if we have a policy for each subtask, then this gives us an overall policy for the target MDP. This collection of policies is called a “hierarchical policy”. The hierarchical policy is executed using a stack discipline, similar to data structures. If a subtask is invoked its name and parameters are pushed onto the stack. If a subtask ends, its name and parameters are popped off the stack. More formally, the value function decomposition takes a given MDP M and decomposes it into a set of subtasks M0 , M1 , ..., Mn and a hierarchical policy π. Each subtask M i defines an SMDP with states Si , actions Ai , probability transition function Piπ (s0 , N|s, a), and expected reward function R˜i (s, a) = V π (a, s), where V π (a, s) is the projected value function for the subtask Mi in state s. If a is a subtask denoted V π (< s, K >), then it gives the expected cumulative reward of following the hierarchical policy π in state s with stack contents K. If a is a primitive action, then it gives the expected immediate reward of executing a in s. A subtask is a three-tuple < Ti , Ai , R˜ i >, where Ti is a termination predicate3 , A is a set of actions for the subtask Mi , and R˜ i is the pseudo-reward function4 . The purpose of the MAXQ value function decomposition is to decompose the projected value function of the root task V (0, s) in terms of projected value functions V (i, s) for all subtasks in the hierarchy. The state-value function can be computed using the Bellman equation for SMDPs given by V π (i, s) = V π (πi (s), s) + ∑ Piπ (s0 , N|s, πi (s))γN V π (i, s0 ).

(4.15)

s0 ,N

The action-value function Q can be extended to support subtasks. Let Q π (i, s, a) be the expected cumulative reward for subtask M i of performing action a in state s and following the hierarchical policy π until subtask M i terminates, where action a may be either a primitive action or subtask. Thus, the action-value function is given by Qπ (i, s, a) = V π (a, s) + ∑ Piπ (s0 , N|s, a)γN Qπ (i, s0 , π(s0 )).

(4.16)

s0 ,N

The rightmost term in this equation is referred to as the completion function C π (i, s, a), which is the expected discounted cumulative reward of completing subtask M i after executing action a in state s, expresed as 3A

4A

termination predicate partitions the set S into a set of active states Si and a set of terminal states Ti . pseudo-reward function specifies a pseudo-reward for each transition to a terminal state s0 ∈ Ti .

32

Chapter 4. Spoken Dialogue Management Using Hierarchical Reinforcement Learning

Cπ (i, s, a) =

∑ Piπ (s0 , N|s, a)γN Qπ (i, s0 , π(s0 )).

(4.17)

s0 ,N

Using equation 4.17 in equation 4.16 the Q function is expressed recursively as Qπ (i, s, a) = V π (a, s) +Cπ (i, s, a).

(4.18)

Finally, the definition for V π (i, s) can be expressed as π

V (i, s) =



Qπ (i, s, πi (s)) if i is composite 0 |i, s)R(s0 |s, i) if i is primitive 0 P(s ∑s

(4.19)

The three previous equations are referred as the “decomposition equations” for the MAXQ hierarchy under a fixed hierarchical policy π. These equations recursively decompose the projected value function for the root subtask V π (0, s) into the projected value functions for the individual subtasks Mi and the individual completion functions C π (i, s, a). (Dietterich, 2000a) provides a graphical representation called MAXQ graph, which intends to facilitate the design and debugging of MAXQ decompositions. In general, the MAXQ value function decomposition has the form V π (0, s) = V π (am , s) +Cπ (am−1 , s, am ) + ... +Cπ (a1 , s, a2 ) + cπ (0, s, a1 ),

(4.20)

where a0 , a1 , ..., am is the path of subtasks chosen by the hierarchical policy π going from the root subtask to a primitive action. There are two kinds of optimal policies in hierarchical methods: hierarchical optimal policies and recursive optimal policies. A hierarchical optimal policy for an MDP is a policy that achieves the highest cumulative reward among all policies consistent with the given hierarchy. A hierarchical policy is recursively optimal if each policy π i is optimal given the policies of its decendants in the task graph. The MAXQ method adopts recursive optimality. The reason for pursuing this kind of optimality rather than hierarchical optimality is that this makes it possible to solve each subtask without reference to the context in which is executed. This property makes it easier to share and re-use subtasks. The MAXQ-Q learning algorithm can be used to find a recursive optimal policy, which is a variation of Q-Learning based on the following update rule: C(i, s, a) = (1 − α)C(i, s, a) + αt · γN [max V (a0 , s0 ) +C(i, s0 , a0 )]. a

(4.21)

So far the MAXQ method must have tables for each of the C functions at the internal nodes and the V functions at the leaves, requiring more than four times the memory of a flat Q table. This problem is addressed with five conditions that permit state abstraction, ignoring certain aspects of the state space (Dietterich, 2000b). The use of state abstraction produces a more compact representation of the value function, and has proven convergence to a locally optimal policy. Finally, the optimal policy for an MDP may be not strictly hierarchical, so that a nonhierarchical policy may be derived from the hierarchical policy. (Dietterich, 2000a) describes two methods for such purpose, one based on the dynamic programming algorithm known as value iteration, and the other based on Q learning.

4.3. Hierarchical Reinforcement Learning Methods

33

4.3.3 Hierarchies of Abstract Machines

Hierarchies of Abstract Machines (HAMs) consist of non-deterministic Finite State Machines (FSMs) whose transitions may invoke lower-level machines (Parr & Russell, 1998) (Parr, 1998). A machine is hierarchical if it calls other machines. The control flow is the same as that of an action. When a machine is called, control is transferred to the start state. When the machine reaches a stop state, control returns to the caller, which determines the next machine state. A machine is abstract if it specifies non-deterministic choice states, which specify the machine to be executed. A Hierarchical Abstract Machine (HAM) executed by an agent constrains the actions that can be taken in each state. A machine for a HAM is a three-tuple N = (µ, I, δ), where µ is a finite set of machine states, I is a start function from environment states to machine states, and δ is the transition function mapping from machine states and environment states to next machine states. There are four types of machine states: action (take an action), call (execute another machine), choice (select the next machine state), and stop (halt execution and return control). An additional requirement for machines used by HAMs is that the call graph for each machine must be a tree in order to prevent recursion. For any MDP M and any HAM H, there exists an SMDP called H ◦M. The solution defines an optimal choice function that maximizes the expected total reward by an agent executing H in M. To find the optimal policy, H is applied to M to yield an induced MDP (H ◦ M). A description of the induced MDP is as follows: • The set of states in H ◦ M is the cross-product of the states of H with the states of M. • For each state in H ◦ M where the machine component is an action state, the model and machine transition functions are combined. • For each state where the machine component is a choice state, actions that change only the machine component of the state are specified. • The reward is taken from M for primitive actions, otherwise it is zero. As H ◦ M may be quite large, for any M and H, let C be the set of choice points in H ◦ M. There exists a decision process reduce(H ◦ M), with state C such that the optimal policy for reduce(H ◦ M) corresponds to the optimal policy for M that is consistent with H. A variation of Q-learning called HAMQ-learning is proposed. A HAMQ-learning agent keeps track of the following quantities: t=current environment state, n=current machine state, sc and mc =environment state and machine state at the previous choice point, a=choice made at the previous choice point, rc and bc =total accumulated reward and discount since the previous choice point. In addition, it maintains an extended Q-table Q([s, m], a), which is indexed by an environment-state/machine-state pair and by an action taken at a choice point. For every environment transition from state s to state t with observed reward r and discount β, the HAMQ-learning agent updates the following rules: r c ← rc + βc r and βc ← ββc . For each transition to a choice point the agent does Q([sc , mc ], a) ← Q([sc , mc ], a) + α[rc + βcV ([t, n]) − Q([sc , mc ], a)],

(4.22)

and then rc ← 0, βc ← 1. For any M and H, HAMQ-learning has proven to converge to the optimal choice point in reduce(H ◦ M) with probability 1.

34

Chapter 4. Spoken Dialogue Management Using Hierarchical Reinforcement Learning

4.3.4 Comparison of Methods

The following questions are considered in order to provide a comparison of the hierarchical reinforcement learning methods described above. In addition, a summary highlighting the features of each method is shown in table 4.1. • How to specify subtasks? There are three approaches for defining subtasks: 1) define each subtask in terms of a fixed policy, 2) define each subtask in terms of a termination predicate and a local reward function, and 3) define each subtask in terms of a non-deterministic finite state controller. These approaches correspond to Options, MAXQ and HAMs, respectively. An advantage of Options and HAMs is that the subtask can be defined in terms of amount of effort rather than in terms of a particular goal condition. However, Options and HAMS require the programmer to provide complete policies for the subtasks, which can be a difficult task in realistic and large-scale applications. An advantage of MAXQ is that the termination predicate allows the programmer to guess in which states the subtask might terminate, and this can be revised automatically by the learning algorithm. • What form of optimality to employ, Hierarchical or Recursive? A limitation of all hierarchical methods is that the learned policy may be suboptimal. The options and HAMs methods converge to a form of hierarchical optimality, meaning that a policy achieves the highest cumulative reward consistent with all policies in the hierarchy. The MAXQ method converges to a form of local optimality called recursive optimality, meaning that each policy is locally optimal given the policies of its children. Although recursive optimality is weaker than hierarchical optimality, it is an important form of optimality because it permits each subtask to learn a locally optimal policy while ignoring the behaviour of its ancestors. • Should we employ state abstractions within subtasks? A subtask employs state abstraction if it ignores parts of the state of the environment. Only the MAXQ method creates opportunities to exploit state abstraction, which can accelerate the learning process. Therefore, there is a tradeoff between achieving hierarchical optimality and employing state abstractions. • What form of execution for a learned policy, Hierarchical or Non-hierarchical? A value function learned from a hierarchical policy can be evaluated incrementally to yield a potentially better non-hierarchical policy (Kaelbling, 1993). In general nonhierarchical execution requires additional computation and memory due to the fact that learning is required in all states and at all levels in the hierarchy, but it may worth the extra cost. The Options framework adopts the non-hierarchical execution mode, MAXQ adopts both modes and HAMs adopts the hierarchical execution mode. • What form of learning algorithm to employ? An important advantage of reinforcement learning algorithms is that they can learn online. The Options framework assumes that the policies for the subproblems are given and do not need to be learned. MAXQ and HAMS provide on-line learning algorithms, but the HAMQ learning algorithm requires flattening the hierarchy, which has undesirable consequences (Dietterich, 2000a). From this rough comparison we can observe that each method has advantages and disadvantages, which justifies a comparative study in order to identify which method performs better (under certain conditions) for learning optimal dialogue strategies in large state spaces.

4.4. Reinforcement Learning with Function Approximation

35

Table 4.1: Hierarchical reinforcement learning methods (in a nutshell). Feature

Options

MAXQ

HAMs

Stochastic Model Algorithm On-line Learning Algorithm with Proven Convergence Optimal Policy Execution Mode

MDPs & SMDPs SMDP Q-Learning No Yes Hierarchical Non-Hierarchical

FSMs, MDPs & SMDPs HAMQ-Learning Yes Yes Hierarchical Hierarchical

Model-Based Learning Subgoal Association Value Function Decomposition State Abstraction

Yes Yes No No

MDPs & SMDPs MAXQ-Q Yes Yes Recursive Hierarchical & Non-Hierarchical No No Yes Yes

No No No No

More recent investigations have been extending the hierarchical methods discussed above. For instance, (Sutton et al., 2005) presents an algorithm for intra-option learning in TD networks with function approximation and eligibility traces 5 . (Littman et al, 2005) presents the algorithm MAXQ-Rmax that unifies three ideas (factored representations, model-based learning and hierarchies) for improving efficiency of reinforcement learning in large state spaces. Finally, (Andre & Russell, 2000) extend the HAMs method using constructs borrowed from programming languages such as parameters, interrupts, aborts and local state variables. This research will mainly focus on the original methods; the recent extensions to the methods will be considered as future work.

4.4 Reinforcement Learning with Function Approximation The goal of Reinforcement Learning (RL) methods is to learn the value of taking an action from each possible state in order to maximize the total reward. Most of the RL methods use a tabular representation (or lookup table) for that purpose, but for large state spaces several difficulties must be faced; for instance, large tables and data sparsity. In order to address such problems previous research efforts have investigated combinations of RL with function approximation. The task of a function approximator is to find a function that generalizes from some given training examples, so that the function approximator can substitute the lookup table. Examples of function approximators applied to RL are: linear function approximation, CMACs, decision tree approximation, regression tree approximation, sparse coarse coding, radial basis functions, SVM regression, kernel-based approximation, recursive least squares, and neural networks, among others. Two kinds of approximation approaches have been applied to RL: a) valuebased (used to represent value functions) and b) policy-based (used to represent the policy itself); the former approach has been studied more extensively. However, not all function approximation methods are suitable for reinforcement learning and there is no guarantee of convergence (Boyan & Moore, 1995). Only a few combinations of RL algorithms and function approximation methods have proven to be stable, such as temporal difference learning with linear function approximation (Precup et al., 2001) (Tadic, 2001). (Henderson et al., 2005) applied Linear Function Approximation (LFA) to learn optimal dialogue strategies and found that this method is viable for addressing large state spaces. This research proposes a comparative study of hierarchical RL methods and RL with LFA, with the aim of finding an efficient/effective method to learn dialogue policies in large domains. 5 Eligibility

traces are a basic mechanism for temporal credit assignment.

36

Chapter 4. Spoken Dialogue Management Using Hierarchical Reinforcement Learning

S={0000, 1000,1100,1110,1111} A={fb,hb,rc,st}

Root

state 0000

Flight Booking

state 000

One Way

Departure City

state 1000

S={000,100,110,111} A={ow,ct,rt}

state 100

Continue Trip

Destination City

state 1100

Hotel Booking

state 1110

Rental Car

Summarize Trip

state 110

Return Flight

Date

Brand

Time

Location

Flight Number

Offer

Airline

Company

Size

Airport

Offer

Offer

Figure 4.4: A hierarchical structure of dialogue goals (lightly shaded rectangles) in the travel domain. The dark shaded rectangles form factored states for its parents (see tables 4.2 & 4.4).

4.5 Experimental Design 4.5.1 The Agent-Environment

Experiments will be performed in the travel domain (see table 4.4). The elements of the agentenvironment in the RL framework are described as follows: • If a state achieves a dialogue goal it is considered terminal, otherwise it is considered non-terminal. The set of states are mainly vectors of state variables as shown in figure 4.4 and tables 4.2, 4.4. According to this, the size of the state space is computed as 72 n ∗ 4m , where 72 is the number of states per slot, n is the number of slots, 4 is the number of states for the database and m is the number of database queries. Although a more compact representation may be obtained by eliminating some state-action combinations 6 , the state space may still be large. On one hand, the size of the state space using a non-hierarchical approach is 7231 ∗ 46 , corresponding to one instance of each dialogue goal. On the other hand, the size of the state space using a hierarchical approach (given the hierarchy above) is 5+4+728 ∗4+728 ∗4+728 ∗4+723 ∗4+723 ∗4+72∗4, making a more manageable state representation. This fact justifies the exploration of hierarchical approaches. • Table 4.3 shows a set of primitive actions and figure 4.4 shows non-primitive actions. • The transition function will provide next states based on the current status of the dialogue simulation model described in chapter 3. • A simple reward function to explore is as follows: non-terminal states receive a punishment of −1 (penalizing long dialogues), and terminal states receive a reward of +100; but other performance functions can be also explored (e.g., see table 2.1). • The dialogue policies are learnt with a reinforcement learning method as described in sections 4.3 and 4.4. From figure 4.4 we can observe that the empty nodes represent SMDPs and the lightly shaded nodes represent MDPs (execute only primitive actions). Notice that the dialogue policy “one way” may be potentially shared. 6 For

instance, states with an unfilled slot must not confirm, filled slots must not fill/confirm again, confirmed slots must not fill/confirm again, etc.

4.5. Experimental Design

37

Table 4.2: State variables and values. A combination of state variables forms a factored state (state for short). Multiple combinations are needed for multiple slots, but only one instance of D. State Variable

Abbreviation Values

Explanation

Filled Slot Confirmation

F C

1 if the slot was collected, 0 otherwise 0 if unconfirmed slot; 1,2,3 if confidence level was low, medium or high; 4 if implicit confirmation and 5 if explicit confirmation 0 if first attempt, 1 if first retry, and 2 if second retry 1 if the user shows expertise (default), 0 otherwise 0 if no query available, 1 if few tuples, 2 if many tuples, and 3 if error

0,1 0,1,2,3, 4,5 0,1,2 0,1 0,1,2,3

Retries R User Type U Database Status D

Table 4.3: Set of primitive actions potentially executed in every non-terminal state. Action

Abbreviation Explanation

Mixed Initiative System Initiative Confirm Apologize Instruct Provide Information Hang up

mi si co ap in pi hu

Initial mixed-initiative prompt asking to fill a potential group of slots Initial system-initiative prompt asking to fill the corresponding slot Explicit confirmation possibly due to medium confidence score Apologize and ask again due to low confidence score Provide prompt with help on how to fill the current slot Provide the information retrieved from (or stored in) the database Ends the conversation

Table 4.4: Sample dialogue (from figure 2.1) illustrating state-action mappings according to the task hierarchy. Italics represent user responses and bold fonts represent chosen actions. State 0000 000 0001 0001 0001 0001 0001 0001 0001 0001 0

Utterance

Actions*

how may i help you?

fb, hb, rc, st ow, ct, rf mi, si, co, ap, in, pi, hu

thank you i want to fly from kansas city missouri on october third in the morning to denver colorado

1301 1301 1301 1301 0001 0001 0001 0001 1

okay, flights from kansas city to denver departing between 5:00 am and

mi, si, co, ap, in, pi, hu

noon on wednesday october third. one moment please. i have flights on frontier, vanguard airlines, and united. the earliest nonstop united flight leaves at 6:20 a. m. and arrives at 7:01 a. m. what else is available please

1401 1401 1401 1401 1001 0001 0001 0001 1 1401 1401 1401 1401 0001 0001 0001 0001 1

i have a nonstop united flight leaving at 6:20 a. m. and arriving at 7:01 a. m would that work?

mi, si, co, ap, in, pi, hu mi, si, co, ap, in, pi, hu

please

1401 1401 1401 1401 0001 0001 0001 0111 1 1401 1401 1401 1401 0001 0001 0001 0111 1

would that work?

1401 1401 1401 1301 0001 0001 0001 0001 1

okay, flights from kansas city to denver departing after 6:20 am on

i have a nonstop united flight leaving at 6:20 a. m. and arriving at 7:01 a. m

mi, si, co, ap, in, pi, hu mi, si, co, ap, in, pi, hu

no i want to fly out later than six twenty

mi, si, co, ap, in, pi, hu

wednesday october third. one moment please. what about a frontier flight leaving at 7:10 a. m. i also have a united flight leaving at 7:40 a. m.

1401 1401 1401 1401 0001 0001 0001 0001 1

would that work?

mi, si, co, ap, in, pi, hu

i will take the united leaving at seven forty

1401 1401 1401 1401 0001 0001 0001 1301 1

united flight 1631 has been added to your itinerary. you depart kansas

mi, si, co, ap, in, pi, hu

city at 7:40 am and arrive in denver at 8:21 am.

1401 1401 1401 1401 0001 0001 0001 1401 1 100 0001 0001 0001 0001 0001 0001 0001 0001 0

what date will you be returning on?

[terminal state] ow, ct, rf mi, si, co, ap, in, pi, hu

i want to leave denver on october fourth in the afternoon and i want to fly from there to zurich (switzer-) switzerland

1301 1301 1301 1301 0001 0001 0001 0001 1 ...

okay, flights from denver to kansas city departing between noon ...

...

mi, si, co, ap, in, pi, hu ...

*fb=flight booking, hb=hotel booking, rc=rental car, st=summarize trip, ow=one way, ct=continue trip, rt=return trip. Multi-columns in state correspond to departure city, destination city, date, time, flight number, airline, airport, offer and db status. Note: The state representation may be more compact, for instance the relaxation slots (airline and airport) can be in another level.

38

Chapter 4. Spoken Dialogue Management Using Hierarchical Reinforcement Learning

4.5.2 Evaluation Metrics

The evaluation of each experiment will be assessed under three dimensions: performance, computational cost and portability to other domains. The objective evaluation of the PARADISE (PARAdigm for DIalogue System Evaluation) method (Walker et al., 1997) will be used to assess the performance of our experiments. PARADISE provides a way to estimate a performance function as a linear combination of a number of metrics that can be directly measured such as: task success (k), user turns, system turns, elapsed time, mean recognition score, timeouts, help requests, barge-ins, retries, etc. The performance for any dialogue is defined by n

Per f ormance = (α ∗ N(k)) − ∑ wi ∗ N(ci ),

(4.23)

i=1

where α is a weight on k, ci are the costs functions weighted by w i , and N is a Z score normalization function. Finally, given values for α and w i , performance can be computed for different spoken dialogue systems using the equation above. Computational cost will be assessed with both computation time (in seconds) and memory (in bytes) required during training. Finally, a set of factors (such as rapid design and re-usable models) affecting portability will be identified for each RL method. 4.5.3 Experiments

The following experiments (using dialogue simulation) will be performed in order to find an effective and efficient method to learn optimal dialogue strategies in large state spaces: Options with and without interruptions, intra-options and subgoals; MAXQ with and without state abstraction; HAMs; and Q(λ) with Linear Function Approximation. A couple of final experiments will be performed using a handcrafted dialogue strategy and the best RL method in a real environment. All algorithms will use the same shared parameters such as discount factor, learning rate and exploration policy (e-greedy or softmax).

4.6 Proposed Future Work The ultimate goal of this research is to find an efficient reinforcement learning method for optimizing dialogue policies in large state spaces. This chapter has described two approaches for learning optimal dialogue policies in large state spaces: hierarchical Reinforcement Learning (RL) methods and RL with function approximation. This research can be potentially addressed within the following directions: a) investigate hierarchical RL methods, b) investigate RL with function approximation methods, and c) investigate a combination of both. In this research I am interested in investigating fully observable hierarchical RL methods due to the fact that they have proven to learn faster and with less training data than non-hierarchical RL methods, and that they have not been applied to dialogue optimization. For such purposes, I plan to investigate three hierarchical reinforcement learning methods: Options, MAXQ and HAMs. In addition, I propose to compare the hierarchical methods against a function approximation method with proven convergence such as temporal difference learning with linear function approximation. Note that this research will evolve as the methods are investigated in detail. Therefore I plan to perform the experiments mentioned above in order, considering that the first experiments are more likely to be explored than the last experiments. Irrespective of any change of plans, this research will follow one main direction: “to investigate an efficient hierarchical reinforcement learning method (applied to spoken dialogue management) and compare it with a function approximation method”.

Chapter 5

Future Plans 5.1 Timetable I propose the following list of research activities in order to achieve the proposed work described in the previous chapters. Table 5.1: Calendar of activities. Year

Term

Activity

2005 Autumn

HMM-based dialogue simulation with richer dependencies Preparation of the agent-environment for experiments with (non-)hierarchical methods 2006 Spring Dialogue simulation metric combining different metrics Hierarchical RL with Options and dialogue simulation Hierarchical RL with Options+interruptions and dialogue simulation Hierarchical RL with Options+interruptions+intra-options and dialogue simulation Hierarchical RL with Options+interruptions+intra-options+subgoals and dialogue simulation 2006 Summer Hierarchical RL with MAXQ and dialogue simulation Hierarchical RL with MAXQ+state abstraction and dialogue simulation 2006 Autumn Hierarchical RL with HAMs and dialogue simulation Non-hierarchical RL with Q(λ), LFA and dialogue simulation Dialogue simulation metric based on utility for optimization 2007 Spring Final experiments using a handcrafted/learnt dialogue strategy with dialogue simulation Setup a spoken dialogue system in the travel domain Final experiments using a handcrafted/learnt dialogue strategy in a real environment Write thesis 2007 Summer Submit thesis 2007 Autumn Thesis defence Apply corrections requested and submit final version

NOTES: The second and third activities in Spring 2007 (experiments in a real environment) are dependent on the completion of the previous activities. If the activities in 2006 require a more extensive study, the experimentation in real environments will be suggested as future work. Nevertheless, if time permits experiments in a real environment can potentially be performed using the resources generated by the TALK project (www.talk.org). The research work described in this document started in September 2004 and is expected to be completed in September 2007. This document describes work carried out in the first year.

39

References Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., and Stent, A. (2001). Towards Conversational Human-Computer Interaction. In AI Magazine, 22(4), pp. 27-37. Allen, J., Ferguson, G., and Stent, A. (2001). An Architecture for More Realistic Conversational Systems. In Proc. of IUI, Santa Fe, New Mexico, USA, pp. 1-8. Andre, D., and Russell S. (2000). Programmable Reinforcement Learning Agents. In Proc. of NIPS, Cambridge, MA, USA, pp. 1019-1025. Barnard, E., Halberstadt, A., Kotelly, C., and Phillips, M. (1999). A Consistent Approach to Designing Spoken-Dialog Systems. In Proc. of IEEE ASRU Workshop, Keystone, Colorado, USA. Barto, A., and Mahadevan, S. (2003). Recent Advances in Hierarchical Reinforcement Learning. In Discrete Event Dynamic Systems: Theory and Applications, Kluwer Academic Publishers, 13, pp. 343-379. Bechet, F., Riccardi, G., and Hakkani-Tur, D. (2004). Mining Spoken Dialogue Corpora for System Evaluation and Modeling. In Proc. of EMNLP’04, Barcelona, Spain, pp. 134-141. Bengio, Y., and Frasconi, P. (1996). Input-Output HMMs for Sequence Processing. In IEEE Trans. Neural Networks, 7:5, pp. 1231-1249. Beringer, N., Kartal, U., Louka, K., Schiel, F., and Turk, U. (2003). PROMISE - A Procedure for Multimodal Interactive System Evaluation. In Proc. of the Workshop on Multimodal Resources and Multimodal Systems Evaluation, Las Palmas, Gran Canaria, Spain, pp. 77-80. Bohus, D., and Rudnicky, A. (2003). RavenClaw: Dialog Management Using Hierarchical Task Decomposition and an Expectation Agenda. In Proc. of Eurospeech, Geneva, Switerland, pp. 597-600. Boyan, J. A., and Moore, A. M. (1995). Generalization in Reinforcement Learning: Safely Approximating the Value Function. In Proc. of NIPS, Cambridge, MA, USA, 1995, pp. 369-376. Chu-Carroll, J. (1999). Form-Based Reasoning for Mixed-Initiative Dialogue Management in Information-Query Systems. In Proc. of Eurospeech, Budapest, Hungry, pp. 1519-1522. Chu-Carroll, J. (2000). MIMIC: An Adaptive Mixed Initiative Spoken Dialogue System for Information Queries. In Proc. of ANLP, Seattle, pp. 97-104. Chu-Carroll, J., and Nickerson, J. (2000). Evaluating Automatic Dialogue Strategy Adaptation for a Spoken Dialogue System. In Proc. of NAACL, pp. 202-204. Chung, G. (2004). Developing a Flexible Spoken Dialogue System Using Simulation. In Proc. of ACL, Barcelona, Spain, pp. 63-70. Cuay´ahuitl, H., Renals, S., Lemon, O., and Shimodaira, H. (2005). Human-Computer Dialogue Simulation Using Hidden Markov Models. To appear in Proc. of IEEE ASRU Workshop, Cancun, Mexico. Danieli, M., Gerbino, E., and Moisa, L. M. (1997). Dialogue Strategies for Improving the Usability of Telephone Human-Machine Communication. In Interactive Spoken Dialogue Systems: Bridging Speech and NLP Together in Real Applications, ACL, pp. 114-120. 41

42

References

Danieli, M., and Gerbino, E. (1995). Metrics for Evaluating Dialogue Strategies in a Spoken Language System. In Proc. of AAAI Symposium on Empirical Methods in Discourse Interpretation and Generation, California, USA, pp. 34-39. Dietterich, T. (1998). The MAXQ Method for Hierarchical Reinforcement Learning. In Proc. of ICML, San Francisco, USA, pp. 118-126. Dietterich, T. (2000a). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. In Journal of Artificial Intelligence Research, 13, pp. 227-303. Dietterich, T. (2000b). State Abstraction in MAXQ Hierarchical Reinforcement Learning. In Proc. of NIPS, 12, pp. 994-1000. Dietterich, T. (2000c). An Overview of MAXQ Hierarchical Reinforcement Learning. In Proc. of SARA, pp. 26-44. Eckert, W., Levin, E., and Pieraccini, R. (1997). User Modeling for Spoken Dialogue System Evaluation. In Proc. of IEEE ASRU Workshop, Santa Barbara, California, USA, USA, pp. 80-87. English, M., and Heeman, P. (2005). Learning Mixed Initiative Dialogue Strategies By Using Reinforcement Learning on Both Conversants. In Proc. of the HLT Conference, Vancouver, Canada. Erdo˘gan, H. (2001). Speech Recognition for a Travel Reservation System. In Proc. of IC-AI. Frampton, M., and Lemon, O. (2005). Reinforcement Learning of Dialogue Strategies Using the User’s Last Dialogue Act. In Workshop on Knowledge and Reasoning in Practical Dialogue Systems (IJCAI), Edinburgh, Scotland, pp. 83-90. Georgila K., Lemon, O., and Henderson, J. (2005a). Automatic Annotation of Communicator Dialogue Data for Learning Dialogue Strategies and User Simulations. In Proc. of DIALOR, Nancy, France. Georgila K., Henderson, J., and Lemon, O. (2005b). Learning User Simulations for Information State Update Dialogue Systems. In Proc. of Interspeech-Eurospeech, Lisbon, Portugal, pp. 893-896. Goddeau, D., Meng, H., Polifroni, J., Seneff, S., and Busayapongchai, S. (1996). A Form-Based Dialogue Manager for Spoken Language Applications. In Proc. of ICSLP, Philadelphia, USA, pp. 701-704. Goddeau, D., and Pineau, J. (2000). Fast Reinforcement Learning of Dialogue Strategies. In Proc. of IEEE ICASSP, Istanbul, Turkey. Henderson, J., Lemon, O., and Georgila, K. (2005). Hybrid Reinforcement/Supervised Learning for Dialogue Policies from COMMUNICATOR data. In Workshop on Knowledge and Reasoning in Practical Dialogue Systems (IJCAI), Edinburgh, Scotland, pp. 68-75. Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice-Hall. Jain, A. K, Murty, M, N., Flynn, P. J. (1999). Data Clustering: A Review. In ACM Computing Surveys, 31:3, pp. 264-323. Jurafsky, D., and Martin, J. (2000). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, Upper Saddle River. Kaelbling, L. P. (1993). Hierarchical Reinforcement Learning: Preliminary Results. In Proc. of ICML, San Francisco, CA, USA, pp. 167-163. Kaelbling, L. P., Litman, M., Moore, A. (1996). Reinforcement Learning: A Survey. In Journal of Artificial Intelligence Research, 4, pp. 237-285.

References

43

Kass, R., and Finin, T. (1988). Modeling the User in Natural Language Systems. In Computational Linguistics, 14:3, pp. 5-22. Larsson, S., and Traum, D. (2000). Information State and Dialogue Management in the TRINDI Dialogue Move Engine Toolkit. In Natural Language Engineering, 1, pp. 1-17. Levin, E., Pieraccini, R., and Eckert, W. (1997). A Stochastic Model of Computer-Human Interaction for Learning Dialog Strategies. In Proc. of Eurospeech, Rhodes, Greece, pp. 1883-1886. Levin, E., Pieraccini, R., and Eckert, W. (1998). Using Markov Decision Process for Learning Dialogue Strategies. In Proc. of the IEEE, Transactions on Speech and Audio Processing, 8:1, pp. 11-23. Levin, E., Pieraccini, R., and Eckert, W. (2000a). A Stochastic Model of Human Machine Interaction for Learning Dialog Strategies. In Proc. of the IEEE ICASSP, Istanbul, Turkey, pp. 1883-1886. Levin, E., Narayanan, S., Pieraccini, R., Biatov, K., Bocchieri, E., Di Fabbrizio, G., Eckert W., Lee, S., Pokrovsky, A., Rahim, M., Ruscitti, P., and Walker, M. (2000b). The AT&T-Darpa Communicator Mixed-Initiative Spoken Dialogue System. In Proc. of ICSLP, Beijing, China, pp. 122-125. Lin, D. (1996). An Information-Theoretic Definition of Similarity. In Proc. of ICML, pp. 296-304. Lin, B., and Lee, L. (2001). Computer-Aided Analysis and Design for Spoken Dialogue Systems Based on Quantitative Simulations. In Proc. of the IEEE, Transactions on Speech and Audio Processing, 9:5, pp. 534-548. Litman, D., and Pan, S. (1999). Empirically Evaluating an Adaptable Spoken Dialogue System. In Proc. of UM, pp. 55-64. Litman, D., Kearns, M. S., Singh, S., and Walker, M. A. (2000). Automatic Optimization of Dialogue Management. In Proc. of COLING’00, Saarbrcken, Germany, pp. 502-508. Litman, D., and Pan, S. (2002). Designing and Evaluating an Adaptive Spoken Dialogue System. In User Modeling and User-Adapted Interaction, 12, pp. 111-137. Littman, L. M., Diuk, C., and Strehl, A. L. (2005). A Hierarchical Approach to Efficient Reinforcement Learning. In Proc. of ICML - Workshop on Rich Representations for Reinforcement Learning, Bonn, Germany. L´opez-C´ozar, R, De la Torre, A., Segura, J., and Rubio, J. (2003). Assessment of Dialogue Systems by Means of a New Simulation Technique. In Speech Communication, 40, pp. 387-407. Manning, C., and Sch¨utze, H. (2001). Foundations of Statistical Natural Language Processing, MIT Press. McTear, M. (1998). Modelling Spoken Dialogues with State Transition Diagrams: Experiences with the CSLU Toolkit. In em Proc. of ICSLP, Sidney, Australia, pp. 1223-1226. McTear, M. (2004). Spoken Dialogue Technology: Toward the Conversational User Interface. SpringerVerlag. Mitchel, T. (1997). Machine Learning. Mc Graw Hill. Papineni, K., Roukos, S., Ward, T., and Zhu, W. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. In Proc. of ACL, pp. 311-318. Parr, R., and Russell, S. (1998). Reinforcement Learning with Hierarchies of Machines. In Proc. of NIPS, Cambridge, MA, USA, pp. 1043-1049. Parr, R. (1998). Hierarchical Control and Learning for Markov Decision Processes. PhD Thesis, University of California at Berkeley.

44

References

Pellom, B., Ward, W., and Pradham, S. (1999). The CU Communicator: An Architecture for Dialogue Systems. In Proc. of IEEE ASRU Workshop, Keystone, Colorado, USA, pp. 723-726. Pellom, B., Ward, W., Hansen, J., Hacioglu, K., Zhang, J., Yu, X., and Pradham, S. (2001). University of Colorado Dialog Systems for Travel and Navigation. In Proc. of HLT, San Diego, USA. Pieraccini, R., Caskey, S., Dayanidhi, K., Carpenter, B., and Phillips, M. (2001). ETUDE, A Recursive Dialogue Manager With Embedded User Interface Patterns. In Proc. of IEEE ASRU Workshop, Madonna di Campiglio, Italy. Pietquin, O., and Renals, S. (2002). ASR System Modeling for Automatic Evaluation and Optimization of Dialogue Systems. In Proc. of the IEEE ICASSP, Orlando, USA, pp. 46-49. Pietquin, O., and Dutoit, T. (2005). A Probabilistic Framework for Dialogue Simulation and Optimal Strategy Learning. To appear in Proc. of the IEEE, Transactions on Speech and Audio Processing. Pineau, J., and Thrun, S. (2001). Hierarchical POMDP Decomposition for a Conversational Robot. In Workshop on Hierarchy and Memory in Reinforcement Learning (ICML), William College, MA, USA. Polifroni, J., Seneff, S., Glass, J., and Hazen, T. (1998). Evaluation Methodology for a Telephone-Based Conversational System. In Proc. of LREC, pp. 42-50. Potamianos, A., Emmicht, E., and Kuo, H. K. (2000). Dialogue Management in the Bell Labs Communicator System. In Proc. of ICSLP, Beijing, China, pp. 603-606. Precup, D. (2001). Off-Policy Temporal Difference Learning with Function Approximation. In Proc. of ICML, San Francisco, CA, USA, pp. 417-424. Precup, D. (2002). Temporal Abstraction in Reinforcement Learning. PhD Thesis, University of Massachusetts, Amherst. Putterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley. Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Proc. of the IEEE, 77:2, pp. 257-286. Rich, C., and Sidner, C. L. (1998). COLLAGEN: A Collaboration Manager for Software Interface Agents. In User Modeling and User-Adapted Interaction, 8(3/4), pp. 315-350. Roy, N., Pineau, J., and Thrun, S. (2000). Spoken Dialogue Management Using Probabilistic Reasoning. In Proc. of ACL, Hong Kong. Rudnicky, A., Thayer, E., Constantinides, P., Tchou, C., Shern, R., Lenzo, K., Xu, W., and Oh, A. (1999). Creating Natural Dialogues in the Carnegie Mellon Communicator System. In Proc. of Eurospeech, Budapest, Hungry, pp. 1531-1534. Rudnicky, A., and Xu, W. (1999). An Agenda-Based Dialogue Management Architecture for Spoken Language Systems. In Proc. of IEEE ASRU Workshop, Keystone, Colorado, USA, pp. 337-340. Russell S., and Norvig, P. (2002). Artificial Intelligence: A Modern Approach, Prentice Hall. Schatzmann J., Geogila K., and Young, S. (2005a). Quantitative Evaluation of User Simulation Techniques for Spoken Dialogue Systems. In Proc. of Workshop on Discourse and Dialogue, Lisbon, Portugal. Schatzmann, J., Stuttle, M. N., Weilhammer, K., and Young, S. (2005b). Effects of the User Model on Simulation-Based Learning of Dialogue Strategies. To appear in Proc. of IEEE ASRU Workshop, Cancun, Mexico.

References

45

Scheffler, K., and Young, S. (2000). Probabilistic Simulation of Human-Machine Dialogues. In Proc. of the IEEE ICASSP, Istanbul, Turkey, pp. 1217-1220. Scheffler, K., and Young, S. (2001). Corpus-Based Simulation for Automatic Strategy Learning and Evaluation. In Workshop on Adaptation in Dialogue Systems (NAACL), Pittsburgh, Pennsylvania, USA. Scheffler, K., and Young, S. (2002). Automatic Learning of Dialogue Strategy Using Dialogue Simulation and Reinforcement Learning. In Proc. of the HLT Conference, San Diego, USA. Seneff, S., and Polifroni, J. (2000). Dialogue Management in the Mercury Flight Reservation System. In Proc. of ANLP/NAACL, Workshop on Conversational Systems, Seattle, USA. Singh S., Litman, D., Kearns, M., and Walker, M. (2002). Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System. In Journal of Artificial Intelligence Research, 16, pp.105-133. Skantze, G. (2003). Exploring Human Error Handling Strategies: Implications for Spoken Dialogue Systems. In ISCA Tutorial and Reseach Workshop on Error Handling in Spoken Dialogue Systems, pp. 71-76. Stallard, D. (2000). Evaluation Results for the Talk’n’Travel System. In Proc. of HLT, San Diego, USA, pp. 1-3. Sutton R., and Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton R., Precup, D., and Singh, S. (1999). Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. In Artificial Intelligence, 112, pp. 118-211. Sutton, R., Rafols, E. J., Koop, A. (2005). Temporal Abstraction in Temporal Difference Networks. In Proc. of NIPS, Vancouver, CA. Sutton, S., Novick, D. G., Cole, R., and Fanty, M. (1996). Building 10,000 Spoken Dialogue Systems. In Proc. of ICSLP, Philadelphia, USA, pp. 709-712. Tadic, V. (2001). On the Convergence of Temporal-Difference Learning with Function Approximation. In Machine Learning, vol. 42, pp. 241-267. Thompson, C., Goker, M., and Langley, P. (2004). A Personalized System for Conversational Recommendations. In Journal of Artificial Intelligence Research, 21, pp. 1-36. Walker, M., Litman, D., Kamm, C., and Abella, A. (1997). PARADISE: A Framework for Evaluating Spoken Dialogue Agents. In Proc. of ACL’97, pp. 271-280. Walker, M. A. (2000). An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email. In Journal of Artificial Intelligence Research, 12, pp. 387-416. Walker, M., Kamm, C., and Litman, D. (2000). Towards Developing General Models of Usability with PARADISE. In Natural Language Engineering, 1, pp. 1-16. Walker, M., and Passonneau, R. (2001). DATE: A Dialogue Act Tagging Scheme for Evaluation of Spoken Dialogue Systems. In Proc. of HLT, San Diego, USA, pp. 1-8. Walker, M., Passonneau, R., and Boland, J. (2001). Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems. In Proc. of ACL, pp. 515-522. Walker, M., Rudnicky, A., Prasad, R., Aberdeen, J., Bratt, E., Garofolo, J., Hastie, H., L, A., Pellom, B., Potamianos, A., Passonneau, R., Roukos, S., Sanders, G., Seneff, S., Stallard, D. (2002). DARPA Communicator: Cross-System Results for the 2001 Evaluation. In Proc. of ICSLP, Colorado, USA. Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD Thesis, King’s College.

46

References

Webb, A. (2002). Statistical Pattern Recognition. John Wiley and Sons Ltd. Webb, G., Pazzani, M., and Billsus, D. (2001). Machine Learning for User Modeling. In User Modeling and User-Adapted Interaction, 11, pp. 19-29. Wei, X., and Rudnicky, A. I. (2000). Task-Based Dialog Management Using an Agenda. In Proc. of ANLP/NAACL, Workshop on Conversational Systems, pp. 42-47. Williams, J., and Young, S. (2005). Scaling Up POMDPs for Dialogue Management: The “Summary POMDP” Method. To appear in Proc. of IEEE ASRU Workshop, Cancun, Mexico. Wright Hastie H., Prasad, R., and Walker, M. (2002). Automatic Evaluation: Using a DATE Dialogue Act Tagger for User Satisfaction and Task Completion Prediction. In Proc. of LREC, pp. 641-648. Young, S. (2000). Probabilistic Methods in Spoken Dialogue Systems. In Philosophical Transactions of the Royal Society, (Series A) 358(1769), pp. 1389-1402. Young, S. (2002). Talking to Machines (Statistically Speaking). In Proc. of ICSLP, Colorado, USA, pp. 9-16. Zhang, B., Cai, Q., Mao, J., Chang, E., and Guo, B. (2001). Spoken Dialogue Management as Planning and Acting Under Uncertainty. In Proc. of Eurospeech, Aalborg, Denmark, pp. 2169-2172. Zue, V., and Glass, J. (2000). Conversational Interfaces: Advances and Challenges. In Proc. of the IEEE, Special Issue on Spoken Language Processing, 8, pp. 1166-1180. Zukerman, I., and Albrecht, D. (2001). Predictive Statistical Models for User Modeling. In User Modeling and User-Adapted Interaction, 11, pp. 5-18. Zukerman, I., and Litman, D. (2001). Natural Language Processing and User Modeling: Sinergies and Limitations. In User Modeling and User-Adapted Interaction, 11, pp. 129-158.