Fast Feature Selection for Reinforcement-Learning-based ... - CiteSeerX

10 downloads 0 Views 67KB Size Report
Ronald Parr, Christopher Painter-Wakefield, Lihong Li, and Michael L. Littman. Analyzing feature ... Richard S. Sutton and Andrew G. Barto. Reinforcement ...
Fast Feature Selection for Reinforcement-Learning-based Spoken Dialog Management: A Case Study Lihong Li

LIHONG @ CS . RUTGERS . EDU

Department of Computer Science, Rutgers University, Piscataway, NJ 08854

Jason D. Williams Suhrid Balakrishnan

JDW @ RESEARCH . ATT. COM SUHRID @ RESEARCH . ATT. COM

AT&T Labs - Research, 180 Park Avenue, Building 103, Florham Park, NJ 07932

The goal of spoken dialog management is to create a dialog system that optimizes the actions it chooses based on its observations of the conversation. In a voice dialer application, for example, the dialog system is to ask the user/caller questions to obtain information such as the name, location, and phone type (cell, office, etc.) of the callee, and then transfers the call to the right person and the right phone type. Actions taken by this system may include: AskName (which asks the name of the callee), AskPhoneType (which asks the phone type of the callee), ConfirmName (which confirms with the user on the name of the callee), CallTransferred (which transfers the call), etc. The system may receives a -1 penalty for every action it takes, a large final reward (say, 20) if the call is correctly transferred and a large penalty (say, -20) otherwise. The objective of the system is to choose actions to maximize the total rewards it gets in the whole conversation. In such a problem, observations of the conversation are usually outputs of automatic speech recognition of speech signals from the user. Since speech recognition errors are common and difficult to detect, we view the user’s true goal as a hidden variable, and maintain a distribution over all possible user goals. From this distribution, we extract continuous features such as the probability of the most likely target callee. In addition, we also use discrete features such as which types of phones a callee has listed, and which dialog actions are available to the system. So, our feature set includes both discrete and continuous features. Finally, it may be helpful to include similar features extracted from previous timesteps of the dialog as well as higher order features1 . As a result, we end up with a set of hundreds or even thousands of features. To deal with the large state space of this problem, we represent the value function as a linear combination of features described above, and use a powerful algorithm known as LSPI (Lagoudakis and Parr, 2003) to learn a near-optimal linear value function based on a set of training dialogs. Unfortunately, the computation complexity of LSPI is cubic in the number of features, rendering it impractical for most realistic dialog problems. We are thus interested in augmenting LSPI with a fast feature-selection mechanism that are suitable for our target application as well as many other real-world problems. We note that one could view dialog management as a POMDP optimization problem (Williams, 2006) since the real user intention is unobservable. But it is challenging to scale POMDP-based learning algorithms to problems of this size. Existing methods for feature selection in reinforcement learning are often too expensive (Parr et al., 2007; Mahadevan and Maggioni, 2007). Instead, we adopt the following heuristics: in each LSPI iteration, we (i) run temporal difference (Sutton, 1988) with experience replay on the training dialogs to obtain a rough linear value function, (ii) pick a small number of features that correspond to the largest weight magnitudes, and (iii) use this small subset of features to run the normal operations of the original LSPI. Such a heuristic approach, although simple, works well even in a simulator of an AT&T internal voice dialer system. There are 50,000 AT&T employees in the directory, and each employer may have cell phone number, office phone 1. A higher order feature is an element of the cross products of the basic features.

1

number, of both, in record. Using a set of 3456 human designed features, we pick the top 400 features in each iteration of LSPI using the aforementioned approach, and compare the task completion ratio2 of LSPI to two policies: HC-Baseline is a hand-crafted baseline policy; RL-Baseline is learned by a model-based reinforcement-learning algorithm that is a common approach in the spoken dialog literature that has worked well (Williams, 2008). Although both reinforcement-learning methods produce policies better than the handcrafted baseline policy with a reasonable amount of training dialogs, our variant of LSPI that combines the feature-selection heuristic consistently outperforms RL-Baseline which does not select features at all. Note that the original LSPI algorithm is intractable with such a large feature set. In the future, we would like to perform a more thorough empirical evaluation of our feature-selection method, try to relate the heuristic to other regularization techniques, and investigate feature selection in online reinforcement-learning algorithms such as Q-learning and Sarsa (Sutton and Barto, 1998). HC-Baseline

RL-Baseline

LSPI

Task Completion Ratio

1

0.95

0.9

0.85

0.8 0

50

100

150

200

Number of Training Dialogs

Figure 1: Comparison of task completion ratio: Every data is averaged over 10 runs. In each run a different training set is used to run LSPI, and the learned policy is evaluated on 1000 randomly generated dialogs.

References Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8:2169–2231, 2007. Ronald Parr, Christopher Painter-Wakefield, Lihong Li, and Michael L. Littman. Analyzing feature generation for value-function approximation. In Proceedings of the Twenty-Fourth International Conference on Machine Learning, pages 737–744, 2007. Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, March 1998. Jason D. Williams. Partially Observable Markov Decision Processes for Spoken Dialogue Management. PhD thesis, Cambridge University, Cambridge, UK, 2006. Jason D. Williams. Integrating expert knowledge into POMDP optimization for spoken dialog systems. In Proceedings of the AAAI-08 Workshop on Advancements in POMDP Solvers, 2008. 2. Roughly speaking, the task completion ratio is defined as the percentage of correct call transfers.

2

Suggest Documents