UX Center, Samsung Electronics, Suwon, South Korea. {getta, lcj80 ..... Average score has the highest value when θ1=0.5 and θ2=0.2. s1 s2 s3 sn . . . DEt. 1.
MODELING CONFIRMATIONS FOR EXAMPLE-BASED DIALOG MANAGEMENT Kyungduk Kim1, Cheongjae Lee1, Donghyeon Lee1, Junhwi Choi1, Sangkeun Jung2 and Gary Geunbae Lee1 1
Department of Computer Science and Engineering Pohang University of Science & Technology, Pohang, South Korea 2 UX Center, Samsung Electronics, Suwon, South Korea {getta, lcj80, semko, chasunee, hugman, gblee}@postech.ac.kr ABSTRACT This paper proposes a method to model confirmations for example-based dialog management. To enable the system to provide a confirmation to the user in an appropriate time, we employed a multiple dialog state representation approach for keeping track of user input uncertainty and implemented a confirmation agent which decides when the information gathered from the user contains an error. We developed a car navigation dialog system to evaluate our proposed method. Evaluations with simulated dialogs show our approach is useful for handling misunderstanding errors in examplebased dialog management. Index Terms— dialog management, dialog modeling, spoken dialog system, confirmation dialogs 1. INTRODUCTION Error handling is an essential part of robust spoken dialog management. Speech input must be considered error-prone from the standpoint of a dialog manager, which is the result of Automatic Speech Recognition (ASR) and Spoken Language Understanding (SLU) modules, because spoken input usually contains recognition and understanding errors. Therefore, robust dialog managers need to be capable of handling recognition errors for the purposes of error recovery. Confirmation is one of most popular and natural error handling methods for resolving misunderstanding of a user’s utterance. A misunderstanding-error occurs when the system constructs an incorrect discourse-level interpretation of the user’s turn. The system asks the user to confirm unclear task-related information so that the system can find what the user really wants. McTear et al. has proposed an objectoriented approach to determine confirmation strategies by guaranteeing common ground between the user and the system [1]. Recently, Lee et al. proposed a data-driven dialog modeling approach called Example-based Dialog Modeling (EBDM) [2]. It adapted an agenda-based approach with n-
best recognition hypotheses to improve the robustness in follow-up research [3]. Several other approaches have been studied to exploit n-best list to the dialog manager for improving robustness. Rather than the Markov Decision Processes, Partially Observable Markov Decision Processes (POMDPs) have the potential to provide a probabilistic framework for robust dialog modeling since they support nbest hypotheses to estimate the belief state of dialog states [4, 5]. To improve the dialog manager’s robustness, probabilistic dialog models have been researched [6, 7, 8]. They usually take advantage of sufficient information from the n-best list to track multiple dialog states. Maintaining a distribution over multiple dialog states has been shown to enhance robustness to recognition and understanding errors. In this paper, we propose a method to model confirmations for EBDM to handle misunderstanding errors. We implemented a confirmation agent which engages confirmation strategies by maintaining multiple dialog state hypotheses. Furthermore, we employed a dynamic threshold approach to reduce useless confirmations by using dialog examples. We developed a dialog system in car navigation service to evaluate the proposed method.
2. MODELING CONFIRMATIONS FOR EBDM 2.1. Overview The EBDM, one of generic dialog modeling methods, is used for the base dialog manager of the system. For handling misunderstandings of a user’s input, two additional modules operate before the EBDM searches for task related system action (Fig.1). In the first one, the dialog state representing module manages multiple dialog state hypotheses, and updates their beliefs at each point in the dialog. In the second one, the confirmation agent that conducts the actual confirmation process decides whether each piece of information gathered from the user needs to be confirmed or not. This confirmation decision is performed based on given dialog state hypotheses and the confirmation strategy which is
Executing
Learning User’s Actions
Multiple Dialog States Representation
SLU
ASR
dialog states act: request goal: search_loc loc_type=hospital
Dialog state hypotheses
User Simulator
Confirmation
Confirmation Agent
(misunderstanding Handler)
Confirmation Strategy
User EBDM
DEDB
P(s1)
2.2. Example-based Dialog Management Our approach is implemented based on the EBDM framework. EBDM is a data-driven dialog modeling method in which the next system action is predicted by finding the most semantically similar dialog example in a DEDB. To generalize the data, a dialog example is first indexed as a set of tuples that have the same semantic and discourse constraints. Details of how EBDM framework predicts the next system action are described in [2] The EBDM framework is a simple and powerful approach to rapidly develop a spoken dialog system for multidomain dialog processing [2]. Moreover the dialog manager is able to handle errors using n-best recognition hypotheses
Table 1. An example dialog state User’s utterance
Dialog state
Where is the hospital near Hyojadong? LOC_TYPE
LOC_ADDRESS
User’s intention
Dialog Act = wh_question Main Goal = search_location
Discourse history (slot filling information)
LOC_TYPE = hospital LOC_ADDRESS = Hyojadong LOC_NAME = (unfilled) ROUTE_TYPE = (unfilled)
P(sm)
act: wh_question goal: search_loc loc_address=Haedodong
P(an)
act: statement goal: search_loc loc_type=hospital loc_address=Hyojadong
act: statement goal: search_loc loc_type=restaurant loc_address=Daeyidong
P(s’1,2)=P(s1)*P(a 2)
…
learned from simulated dialogs generated by the user simulator. At the learning phase, the confirmation agent models the confirmation strategy using a number of simulated dialogs which are generated by the user simulator. The confirmation agent also uses a dialog example database (DEDB) to reduce useless confirmations in order to save dialog turns. If there is no need to confirm information, the confirmation agents pass the top hypothesis of dialog states to the EBDM which generates task-related system action.
act: wh_question goal: search_loc loc_type=bar
act: statement goal: search_loc loc_address=Daeyidong
P(a2)
updated dialog states
P(s’1,1)=P(s1)*P(a 1)
…
Fig. 1 Overall system operation
…
Task related system action
act: statement goal: search_loc loc_address=Hyojadong
P(a1)
act: request goal: search_loc loc_type=restaurant
P(s2)
recognized user’s actions (n-best list)
act: wh_question goal: search_loc loc_type=bar loc_address=Haedodong
P(s’1,n)=P(s1)*P(a n)
Fig. 2 Updating beliefs of dialog states. One dialog state s1 splits into m states by updating a new dialog act, main goal, and slot information from a recognized user’s actions
through follow-up research [3]. However, the system is still needed to be able to maintain multiple dialog state hypotheses to keep track of the uncertain current dialog situation. Tracking a probability distribution over multiple dialog states has been shown to add robustness to recognition of errors [6], and this capability can be used to model appropriate confirmations to handle misunderstanding errors which are not dealt with in the previous EBDM framework.
2.3. Multiple Dialog State Representation The system maintains multiple dialog state hypotheses to exploit the information acquired from the n-best list of ASR and SLU results. Our approach is inspired by the framebased belief state representation framework [7]. The dialog state is defined with a set of state variables which are similar to a set of search key constraints of EBDM [2]. The dialog state consists of the user’s intention (dialog act and main goal) and discourse history such as slot-filling information (Table 1). The system tracks multiple dialog states and their beliefs to represent the current dialog situation. At the start of a dialog, there is only one root dialog state. As the dialog progresses, this root dialog state splits into smaller dialog states which have more specific information that is gathered from the n-best list of recognized user action. To obtain the confidence of a user’s action we adapted the n-best listbased confidence measuring method [9]. The system updates current dialog states according to a recognized user’s actions. If there is more than one hypothesis of user’s action, a dialog state may split into a number of new dialog states. The system multiplies the probability dis-
(a) static threshold Reject
0.0
Confirmation
Accept
θ2
θ1
1.0
(b) dynamic threshold Reject
0.0
Confirmation
θ'2
Accept
θ‘1
1.0
slot belief
Fig. 3. Static and dynamic thresholds for confirmation.
tribution for current dialog states with the probabilities for recognized user’s action to compute the beliefs for next dialog states. If there are too many next states generated, then the system prunes the long tails of next states that have probabilities small enough to ignore in order for the system to maintain a reasonable number of states, and renormalizes the computed probabilities (Fig. 2).
2.4. Confirmation Agent The confirmation agent determines whether or not the system needs to provide a confirmation to the user. We adapted a concept belief to decide when a user’s information is uncertain enough to require confirmation [10]. The belief of the slot having value c is computed as:
b ( c ) c ( si ) p ( si )
Static thresholds (θ1 =0.5, θ2 = 0.2) U1: Where is the coffee shop? [loc_type=coffee shop] = 0.25 (uncertain) S1: Coffee shop. Is that correct? U2: Yes. [loc_type=coffee shop] = 0.9 (uncertain) S1: Please specify the address. U2: Coffee shop near Hyojadong. [loc_type=coffee shop] = 0.9 (certain) [loc_address=Hyojadong] = 0.7 (certain) S2: There are three cafés near Hyojadong…. Dynamic thresholds (θ1=0.5, θ2 = 0.2, θ'1=0.4. θ'2 =0.3 ) U1: Where is the coffee shop? [loc_type=coffee shop] = 0.25 (reject-ignore) S1: Please specify the address. U2: Coffee shop near Hyojadong. [loc_type=coffee shop] = 0.9 (certain) [loc_address=Hyojadong] = 0.7 (certain) S2: There are three cafés near Hyojadong…. Fig. 4. An example dialog for showing necessity of dynamic threshold. User turns are prefixed by (U); System turns prefixed by (S); the understanding results and associated confidence scores are represented [in between baskets]. The user uttered location type (loc_type) information at the first turn U1 with low confidence, and he uttered it again (in U3 with static thresholds, U2 with dynamic thresholds respectively). If the system uses dynamic thresholds, one dialog turn will be saved in this case.
si S
where b(c) is the belief of the slot having value c, and p(si) is the belief of dialog state si . δ(si) = 1 if the state si has value c, 0 otherwise. In other words, b(c), the belief of the concept c is computed as the marginal probability of p(si) which contains the concept c. Fig. 3 (a) shows the typical policy for the confirmation; if the slot (or concept) belief for the top hypothesis is higher than a certain threshold θ1, accept the value as correct. Alternatively, if the slot belief is very low, below θ2, then reject the value. The confirmation is conducted when the slot belief is larger than θ2 but smaller then θ1 (Fig. 3 (a)). But these thresholds are needed to be optimized differently according to not only the label of the slot but also the current dialog state. This means that some information which is not important for processing the task at the current point of the dialog does not have to be confirmed strictly and immediately while some other critical information does. The Move On strategy in [10] has a similar motivation. If the Move On strategy invokes as an error handling strategy, the system does not try to directly address the current error. Instead, the system ignores the current error slot and continues the dialog by moving on to a different system action, or by
switching to an alternative dialog plan for accomplishing the same goal. We believe that if information is not important or redundant and has a high probability to be uttered by the user subsequently, then the belief range for confirmation needs to be narrower (Fig. 3 (b)). If the system knows that the user is going to talk about uncertain information such as [loc_type] slot and its value subsequently, the system has no need to confirm the potentially redundant information immediately (loc_type slot in Fig. 4). To know which slot is potentially redundant, the system exploits dialog examples that are used to generate the next system action in the EBDM approach. A dialog example contains a pair of dialog situations DS and corresponding system actions SA. The user’s intention includes information of slots that appeared in the user’s utterance (Fig. 5). The system uses this information to adjust thresholds for confirmation. For each dialog state, the system searches for the most similar dialog example DEt from DEDB by using the EBDM approach. The next dialog example DEt+1 that is linked with the extracted one also has slot information. We
states
An example
Extracted Next dialog examples dialog examples DEt DEt+1
s1
DEt1
DEt+11
s2
DEt2
DEt+12
s3
DEt3
DEt+13
DEt
DS
SA
s1
p(s1)
p(s2)
p(s3)
p(s1) = 0.5
slots
a dialog example
. . .
sn
. . .
s2 p(s2) = 0.3
. . DE . n
DEtn
t+1
s3
p(sn)
p(s3) = 0.2
DEt+1
DS(U): Where is the coffee shop? SA: request(loc_address) loc_type
DS(U): Where is the café? SA: request(address)
DS(U): Coffee shop near Hyojadong. SA: inform(loc_name) loc_type
DS(U): Daeyidong. SA: inform(loc_name)
loc_type DS(U): Where is the post office? SA: request(address) loc_type
loc_address
loc_address DS(U): post office in Haedodong. SA: inform(loc_name) loc_type
loc_address
DEDB
Fig. 5. The system searches DEDB to find redundant information. The slot loc_type appears at DEt+11, DEt+13. b'(loc_type) = p(s1)+p(s3) = 0.5+0.2=0.7.
believe that if the slot k appears at DEt+1 many times, the slot has a high probability to be uttered by the user at the next turn. The confirmation agent computes dynamic thresholds θ'1 and θ'2 as follows: θ'1 = θ1 – w∙b'(k)∙(θ1 – θ2) θ'2 = θ2+ w∙b'(k)∙(θ1 – θ2) b' ( k ) ' k ( s i ) p ( si ) si S
where w is an empirical weight for adjusting thresholds, b'(k) is the sum of probabilities of dialog states which has slot k at DEt+1. δ'(si) = 1 if the slot k appears at the next dialog example of state si, 0 otherwise. If the system uses dynamic thresholds, the belief range for the confirmation becomes narrower with the proportion of w∙b'(k).
Static thresholds are learned using dialog simulation with the user and ASR channel. A simple user simulator is used to generate user responses to system actions. It has three main components: a user intention simulator, user utterance simulator and ASR channel simulator. The intention simulator which generates the next user intention given the current discourse context is implemented using the bi-gram based user model approach [11]. The user utterance simulator generates user utterances to express the generated intention using a template-based approach. The ASR channel simulator which simulates speech recognition errors is implemented based on [12]. We use a grid-search for finding optimal θ1 and θ2 which leads the average score of dialogs to be largest. The score function used for searching static thresholds is similar to the reward function of reinforcement learning User simulator User Intention Simulator
ACT=WH-QUESTION GOAL=SEARCH_LOC LOC_TYPE=1
User Utterance Simulator
Where is the coffee shop
6.516 6
0.9
5.5 5 4.5 4 3.5
θ2
3 2.5
ASR Channel Simulator
UM WHERE THE COFFEE SHOP
2 1.5 1
0.1
0.5
0.0
1.0 θ1 (a) Grid-search for θ1, θ2
SLU
0 Score
(b) Result - average score
Fig. 6 Average score computed using grid-search. Average score has the highest value when θ1=0.5 and θ2=0.2.
DM
ACT=WH-QUESTION GOAL=SEARCH_LOC LOC_TYPE=COFFEE SHOP request(loc_address)
Fig. 7 Simulated environment for automatically evaluating the system.
used in [4]. For each dialog, the system gets +20 if the dialog is successfully completed, and one point is subtracted for each dialogue turn.
3. EXPERIMENTS & RESULTS A spoken dialog system for car navigation was developed to provide information about places of interest to the user. For this system, we collected a human-human dialog corpus of about 123 dialogs with 510 user utterances (turns). The system was evaluated automatically in a simulated environment (Fig. 7). The user simulator (both intention simulator and utterance simulator) was trained with the same corpus to index the dialog example database. However, the simulator could generate unseen patterns of user’s utterances randomly based on a probabilistic model. Therefore, our simulator can model speech recognition errors as well as SLU errors at a certain error rate. We evaluated the performance of four dialog systems (Table. 2): 100
TCR (%)
90
EBDM 80
MS
MS.static MS.dynamic 70
0
5
10
15
20
25
30
35
WER (%)
Fig. 8. Task completion rates for three dialog systems
Avg. dialog length (turns)
14
Table 2. Dialog systems used for evaluating performance Multiple dialog Systems Confirmations states EBDM X X +MS O X +MS.static O Static thresholds +MS.dynamic O Dynamic thresholds
The original EBDM system was evaluated as a baseline system supporting a single dialog state and impossible to process confirmation dialogs. Three other systems – MS, MS.static and MS.dynamic are able to maintain multiple dialog states as described in section 2.3. MS.static used static thresholds learned from grid-searches and MS.dynamic could vary the belief range for confirmations using dialog examples. We measured task completion rates and average dialog lengths for these four dialog systems under various WER conditions which were generated by an ASR channel simulator. 1000 simulated dialogs were generated for each WER condition. We found that the task completion rate linearly decreased with the WER when the dialog manager maintained a single dialog state (EBDM in Fig. 8). If the system can support multiple dialog states, task completion rates are more than 85% even when the WER is larger than 30% (MS, MS.static and MS.dynamic in Fig. 8). The system achieved the highest TCR when it adjusted confirmation thresholds dynamically at the noisy environment especially when the WER was larger than 25%. The use of confirmation slightly increased the task completion rate. The system using dynamic threshold achieved the highest TCR when the WER is larger than 25%. There was no much difference in average user turn length between MS and MS.dynamic. In addition, the effect of the confirmation became larger when the WER was larger than 20%. This higher task completion rate indicates that confirmations make the system more robust when the environment is quite noisy or the speech recognizer performs poorly.
EBDM
MS 12
MS.static
MS.dynamic 10
8 6
0
5
10
15
20
25
30
35
WER (%)
Fig. 9. Average dialog lengths for three dialog systems
Fig. 10. Ratio of turns that is used for confirmation
However, when the system used static thresholds, the average dialog length had a relatively high value compared to the system using dynamic thresholds (MS.static vs. MS.dynamic in Fig. 9). This was caused by the confirmation overhead. During a confirmation, the system takes an extra dialog turn to confirm the value of a slot. The system using static thresholds tends to confirm more frequently even if a confirmation is not necessary. We also measured how many confirmations were conducted in MS.static and MS.dynamic (Fig. 10) The number of confirmations increased with WER in both systems. The systems invoked confirmation frequently when the WER had a large value because the uncertainty of information acquired from a user’s utterance usually increases in a noisy environment.
4. CONCLUSIONS This paper proposes a method to model confirmations for EBDM. The system maintains multiple dialog state hypotheses to keep track of user input uncertainty. And beliefs of dialog states are used for finding an appropriate point of dialog to confirm unclear information by the confirmation agent. The experimental results show that our confirmation strategy helps the user complete a dialog successfully. Moreover, supporting confirmation dialogs makes the system more robust in noisy environments. Dialog examples which are indexed for generating a system action in EBDM can also be used for reducing useless confirmations. The system adjusts confirmation thresholds if the system notices that some information from the user’s utterance is uncertain but redundant. The simulated evaluations have shown that confirmation dialogs help the system achieve higher TCR in the EBDM approach and dynamic thresholds help the system reduce useless confirmations.
5. ACKNOWLEDGEMENTS This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0019523), and the Industrial Strategic technology development program, 10035252, Development of dialog-based spontaneous speech interface technology on mobile platform funded by the Ministry of Knowledge Economy (MKE, Korea).
6. REFERENCES
[1] M. McTear, I. O’Neill, P. Hanna, and X. Liu, Handling errors and determining confirmation strategies – An objective-based approach, Speech Communication, 45(3), 2005. [2] C. Lee, S. Jung, J. Eun, M. Jung, and G.G. Lee, Example-based dialog modeling for practical multi-domain dialog system. Speech Communications, 51(5):466-484. 2009 [3] C. Lee, S. Jung, K. Kim. G.G. Lee, Hybrid approach to robust dialog management using agenda and dialog examples. Computer Speech and Language, 24(4):609-631. 2010 [4] O. Lemon and V. Rieser, Learning Effective Multimodal Dialogue Strategies from Wizard-of-Oz data: Bootstrapping and Evaluation, ACL 2008 [5] J.D. Williams, and S. Young, Partially Observable Markov Decision Processes for Spoken Dialog Systems. Computer Speech Language, 21(2):393-422. 2007 [6] J.D. Williams. Incremental partition recombination for efficient tracking of multiple dialog states. Proc ICASSP, Dallas, Texas, USA. 2010 [7] K. Kim, C. Lee, S. Jung, G. G. Lee, A frame-based probabilistic framework for spoken dialog management using dialog examples, in Proceedings of the 9th sigdial workshop on discourse and dialog (sigdial 2008), Ohio, 2008 [8] J. Henderson and O. Lemon, Mixture Model POMDPs for Efficient Handling of Uncertainty in Dialogue Management, in Proceedings of ACL 2008, Columbus, 2008 [9] B. Rueber. Obtaining confidence measures from sentence probabilities, EUROSPEECH97, Greece, 1997 [10] D. Bohus, Error awareness and recovery in conversational spoken language interfaces, Ph.d Thesis, Carnegie Mellon University. 2007 [11]W. Eckert, E. Levin, and R. Pieraccini, User modeling for spoken dialogue system evaluation, Proc. IEEE ASR Workshop, 1997 [12] S. Jung, C. Lee, K. Kim, M. Jeong, G. G. Lee. Data-driven user simulation for automated evaluation of spoken dialog systems. Computer Speech and Language, 23(4): 479-509, Oct 2009