{sikeda, komatani, ogata, okuno}@kuis.kyoto-u.ac.jp. Abstract. We present a method of robust domain selection against out- of-grammar (OOG) utterances in ...
Integrating Topic Estimation and Dialogue History for Domain Selection in Multi-domain Spoken Dialogue Systems Satoshi Ikeda, Kazunori Komatani, Tetsuya Ogata, and Hiroshi G. Okuno Graduate School of Informatics, Kyoto University, Kyoto, Japan {sikeda, komatani, ogata, okuno}@kuis.kyoto-u.ac.jp
Abstract. We present a method of robust domain selection against outof-grammar (OOG) utterances in multi-domain spoken dialogue systems. These utterances cause language-understanding errors because of a limited set of grammar and vocabulary of the systems, and deteriorate the domain selection. This is critical for multi-domain spoken dialogue systems to determine a system’s response. We first define a topic as a domain from which the user wants to retrieve information, and estimate it as the user’s intention. This topic estimation is enabled by using a large amount of sentences collected from the Web and Latent Semantic Mapping (LSM). The results are reliable even for OOG utterances. We then integrated both the topic estimation results and the dialogue history to construct a robust domain classifier against OOG utterances. The idea of integration is based on the fact that the reliability of the dialogue history is often impeded by language-understanding errors caused by OOG utterances, from which using topic estimation obtains useful information. Experimental results using 2191 utterances showed that our integrated method reduced domain selection errors by 14.3%.
1
Introduction
More and more novices are using spoken dialogue systems through telephones. They often experience difficulties in using such systems because of automatic speech recognition (ASR) errors. These kinds of errors are often caused by outof-grammar utterances, i.e., those containing expressions that systems cannot accept by their grammar and vocabulary for language-understanding. This is an increasingly important issue for multi-domain spoken dialogue systems because they deal with various tasks. We define a domain as a sub-system in a multi-domain spoken dialogue system. Domain selection, i.e., determining which sub-system in the multi-domain systems should respond to a user’s request, is essential for such systems. We previously presented a framework of the domain selection that uses dialogue history [1]. However, when the user’s utterance is out-of-grammar, this method cannot be used to determine the unique domain that should respond to the utterance because it can obtain no useful information from such utterances. N.T. Nguyen et al. (Eds.): IEA/AIE 2008, LNAI 5027, pp. 294–304, 2008. c Springer-Verlag Berlin Heidelberg 2008
Integrating Topic Estimation and Dialogue History for Domain Selection
295
To solve this problem, we adopted the following two approaches: 1. Topic estimation using large amounts of texts collected from the Web and Latent Semantic Mapping (LSM) [2] (described in Section 3.1). 2. Integrating the topic estimation result and dialogue history (described in Section 3.2). We defined a “topic” as a domain about which users want to find more information, and we estimated it as the user’s intention. Using these results is a reliable way to select domains even for out-of-grammar utterances. The topic estimation and use of dialogue history have complementary information. The topic estimation uses only information obtained from a single utterance while dialogue history takes the context into consideration. On the other hand, dialogue history is often damaged by language-understanding errors caused by out-of-grammar utterances while topic estimation results are relatively reliable for such utterances. Therefore, integrating the topic estimation and the dialogue history should reduce the number of domain selection errors.
2 2.1
Managing Out-of-Grammar Utterances in Multi-domain Spoken Dialogue Systems Architecture of Multi-domain Spoken Dialogue Systems
Multi-domain spoken dialogue systems deal with various tasks, such as searching for restaurants and retrieving bus information. They can deal with user’s various requests by a single interface. A problem of this system was that a large amount of effort was required to develop it. To reduce the amount of effort needed, Lin et al. have proposed an architecture with domain extensibility [3], which enables system developers to design each domain independently. This architecture is composed of several domains and a central module that selects an appropriate domain (domain selection) to generate a response. Because a central module does not manage the dialogues and each domain does, domain selection is a process that is essential to keep the independence of each domain. Our multidomain spoken dialogue system is based on this architecture, and, as shown in Figure 1, it has five domains. 2.2
Definition of Topics for Managing Out-of-Grammar Utterances
Out-of-grammar utterances cause language-understanding errors because of their limited set of grammar and vocabulary, and they also damage the domain selection. We define in-domain utterances as utterances that any of the domains in a multi-domain system accept, and out-of-grammar utterances as the utterances that none of the domains in the system do. Therefore, out-of-grammar utterances cannot be treated in either conventional frameworks [3,4] or our previous system [1]. To reduce the domain selection errors, we define the term “topic” for outof-grammar utterances. A topic is defined as a domain from which users want
296
S. Ikeda et al. Sub-systems
Central Module User
Domain Selector Speech Recognizer
Bus Info. Hotel Sightseeing Weather Info. Restaurant Domain
Utterance Distribution Weather Info.
Bus info. Sightseeing Hotel
Restaurant :Range of utterances acceptable in this domain
Fig. 1. Architecture for multi-domain spoken dialogue system Utterances that restaurant grammar cannot accept but that users want to know about, from restaurant domain Example: “Tell me about Japanese restaurants with a relaxed atmosphere.” Restaurant Restaurant Domain Topic Hotel Topic Utterances acceptable in the restaurant grammar Example: “Tell me about restaurants near the Gion area.”
Fig. 2. Relation between domains and topics
to retrieve information, and we estimate it as the user’s intention. The relation between domains and topics is shown in Figure 2. Lane et al. developed a method for detecting utterances whose topics cannot be handled by the system [5]. They estimated topics and detected such utterances with Support Vector Machine or a linear discriminant analysis. However, their method lacks domain extensibility because they require dialogue corpora to be collected in advance, and this required a lot of effort. Our method does not disrupt domain extensibility because it automatically collects training data from the Web texts and does not need to collect dialogue corpora in advance. 2.3
Robust Domain Selection against Out-of-Grammar Utterances
We previously presented a domain classifier based on a 3-class categorization [1]: (i) the domain in which a previous response was made, (ii) the domain in which the ASR results with the highest score for language-understanding can be accepted, and (iii) other domains. Conventional methods [3,4] considered only the former two. However, when the user’s utterance is out-of-grammar, the language-understanding errors mean that this method cannot determine the unique domain that should respond to the utterance. An example of such a situation is shown in Figure 3. Here, the utterance U2 relates to the restaurant domain, but ASR errors occurred because of the out-of-vocabulary word, which is not included in the vocabulary for language-understanding. As a result of the error, the domain with the highest score for language-understanding in U2 is the
Integrating Topic Estimation and Dialogue History for Domain Selection
297
U1: Tell me the address of Holiday Inn. (domain: hotel) S1: The address of Holiday Inn is ... U2: I want Tamanohikari near there. (domain: restaurant) Tamanohikari (name of liquor) is an out-of-vocabulary word, and misrecognized as a spot of Tamba-bashi (name of place). (domain: sightseeing) S2 (by method [1]): I do not understand what you said. (domain: others) S2 (by our method): I do not understand what you said. You can ask about several conditions such as location, price and food type about restaurants. For example, you can say “Tell me restaurants near the Gion area”. (domain: restaurant)
Fig. 3. Example of dialogue including out-of-grammar utterances
LU module Voice Input
Grammar-based Speech Recognizer Stastical-based Speech Recognizer
TE module
(I)
Features Grammar
LSM
Selection (II) Domain Selection
(III)
Web Text
Dialogue History
(IV)
In previous response With highest scores for LU With highest Scores for TE Other than (I), (II), (III)
LU: Language-Understanding TE: Topic Estimation
Fig. 4. Overview of domain selection
sightseeing domain. In this case, the domain that should ideally be selected is neither the previous domain (hotel) nor the domain with the highest score for language-understanding (sightseeing). The system can detect (iii) on the basis of our previous method [1], but, as the concrete domain is not determined, it cannot generate a concrete response, shown as S2 (by method [1]). We integrated the topic estimation with the domain selection using the dialogue history presented in our previous paper [1]. The overview of our domain selection is shown in Figure 4. In addition to our previous features about dialogue history and language-understanding results [1], we introduced features obtained by topic estimation. These features improve the accuracy of the domain selection for out-of-grammar utterances. Then, we define domain selection as the following 4-class categorization: the same (I) and (II) as (i) and (ii) of the method [1], (III) the domain with the highest score for topic estimation, and (IV) other domains. This 4-class categorization increases the number of utterances for which the system can select a uniquely correct domain, which enables the system to generate a concrete response. This is shown as S2 (by our method) in Figure 3: The system does not accept the language-understanding results for U2 containing errors, and provides help messages based on the domain (in this case, restaurant) derived from the topic estimation result.
298
3 3.1
S. Ikeda et al.
Domain Selection Using Dialogue History and Topic Estimation Topic Estimation with Domain Extensibility
We estimated topics by calculating the closeness between the user’s utterance and the training data collected from the Web by using LSM [6]. The topic estimation module in Figure 4 shows a brief overview of the topic estimation process. Our system has five topics corresponding to each domain (restaurant, sightseeing, bus information, hotel, and weather information) and the command topic, which corresponds to the command utterances for the system such as “yes” and “undo”. Estimating topic is a process that consists of the following two parts. More details were presented in our previous paper [6]. Collecting training data from the Web: We collected Web texts for each of the five topics except for the command topic by using a tool for developing language models [7]. First, we manually prepared about 10 keywords and several hundred sentences related to the topic. Then, the tool automatically retrieved 100,000 sentences related to the topic by using the keywords while filtering out irrelevant sentences based on the prepared sentences. We added 10,000 sentences, which were generated by each domain grammar, to this training data. For the command topic, we manually prepared 175 sentences as training data. We randomly divided the training data into d sets (d = 20) for each topic, and we made up the training documents. Using LSM to remove the effect of noise from the training data: We used LSM [2] because the training data collected from the Web contain documents with other topics as noise. Latent Semantic Mapping is suitable for expressing the conceptual topic of a document. It removes the effect of noise from such data and allows for robust topic estimation. We first constructed the co-occurrence matrix, M × N , between the words and the training documents, where M is the vocabulary size, and N is the total number of documents. We applied the singular value decomposition to the matrix and calculated the kdimensional vectors of all the documents. The size of the co-occurrence matrix we constructed was M = 67533, N = 120, and k = 50. The degree of closeness between a topic and a user’s utterance is defined as the maximum cosine distance between the k-dimensional vector of an ASR result of a user’s utterance and that of d training documents related to the topic. This ASR for topic estimation uses a statistical language model trained with the collected Web texts, unlike the grammar-based one used for the ASR for language-understanding. The resulting topic is the one with the highest score of closeness to a user’s utterance. 3.2
Integrating Dialogue History and Topic Estimation
We define domain selection as the following 4-class categorization: (I) the previous domain, (II) the domain with highest score for language-understanding, (III)
Integrating Topic Estimation and Dialogue History for Domain Selection
299
Table 1. Features representing confidence in previous domain P1: number of affirmatives after entering the domain P2: number of negations after entering the domain P3: whether tasks have been completed in the domain (whether to enter “requesting detailed information” in database search task) P4: whether the domain appeared before P5: number of changed slots after entering the domain P6: number of turns after entering the domain P7: ratio of changed slots (= P5/P6) P8: ratio of user’s negative answers (= P2/(P1 + P2)) P9: ratio of user’s negative answers in the domain (= P2/P6) P10: states in tasks Table 2. Features of ASR results U1: acoustic score of the best candidate of ASR results for LU interpreted in (I) U2: posteriori probability of N-best candidates of ASR results for LU interpreted in (I) U3: average of words’ confidence score for the best candidate in (I) U4: acoustic score of the best candidate of ASR results for LU in (II) U5: posteriori probability of N-best candidates of ASR results for LU interpreted in (II) U6: average of words’ confidence score for the best candidate of ASR results for LU in (II) U7: difference of acoustic scores between candidates selected as (I) and (II) (= U1-U4) U8: ratio between posteriori probability of N-best candidates in (I) and that in (II) (= U2/U5) U9: ratio between averages of words’ confidence scores for the best candidate in (I) and that in (II) (= U3/U6) LU: Language Understanding.
the domain with highest score for topic estimation, and (IV) other domains. We constructed a domain classifier using machine learning. Here, we describe the features used to construct it. In addition to the information listed in Table 1, 2, and 3, which was used in our previous work [1], we adopted the features listed in Table 4. Using this information enables the system to select correct domains for out-of-grammar utterances. The features from T1 to T6 are defined so that they represent the confidence in the topic estimation result. We defined the confidence measure of topic T used in T2 and T4 as CMT = closenessT / j closenesstj , where T and tj are topics handled by the system, and closenesst is the degree of closeness between topic t and the user’s utterance. We also adopted the features from T7 to T9 to represent the relation between (I), (II) and (III). For example, if the topic estimation result is the same as (I), the system prefers (I). We defined feature T10 because an utterance whose duration is too short often causes errors in the estimation of the topic. The features from T11 to T13 represent whether
300
S. Ikeda et al. Table 3. Features representing situations after domain selection
C1: C2: C3: C4: C5: C6: C7: C8: C9: C10: C11:
dialogue state after selecting (I) whether the language-understanding result is affirmative after selecting (I) whether the language-understanding result is negative after selecting (I) number of changed slots after selecting (I) number of common slots (name of place, here) changed after selecting (I) dialogue state after selecting (II) whether the language-understanding result is affirmative after selecting (II) whether the language-understanding result is negative after selecting (II) number of changed slots after selecting (II) number of common slots (name of place, here) changed after selecting (II) whether (II) has appeared before
Table 4. Features of topic estimation result T1: T2: T3: T4: T5: T6: T7: T8: T9: T10: T11: T12:
closeness between (III) and ASR result for TE confidence measure of (III) closeness between (II) and ASR result for TE confidence measure of (II) difference of closeness to ASR result for TE between (II) and (III) (= T1 - T3) difference of confidence measures between (II) and (III) (= T2 - T4) whether (III) is the same as (II) whether (III) is the same as (I) whether (III) is the same as “command” duration of ASR result for TE (number of phoneme in recognition result) acoustic score of ASR result for TE difference of acoustic scores per phoneme between candidates selected as (III) and (I) (= (T11 - U1)/T10) T13: difference of acoustic scores per phoneme between candidates selected as (III) and (II) (= (T11 - U4)/T10) TE: Topic Estimation.
the user’s utterance is out-of-grammar [8]. If the user’s utterance seems so, the system does not prefer (II).
4 4.1
Experimental Evaluation Dialogue Data for Evaluation
We evaluated our method by using the dialogue data that was collected from 10 subjects [1]. It contains 2191 utterances. This data was collected by using following procedure. First, to get accustomed to the timing to speak, the subjects used the system by following a sample scenario. They then used the system by following three scenarios, where at least three domains were mentioned. We used Julian, a grammar-based speech recognizer [9], for the language understanding. Its grammar rules correspond to those used in the languageunderstanding modules in each domain. We also used Julius, a statistical-based
Integrating Topic Estimation and Dialogue History for Domain Selection
301
Table 5. Features survived after feature selection Our method
P2, P3, P4, P5, P6, P7, P9, P10, U2, U3, U5, U6, C3, C6, C8, C10, C11, T2, T3, T4, T5, T7, T8, T9, T10, T11, T12 Baseline method P1, P4, P5, P8, P9, P10, U1, U2, U3, U5, U6, U7, U9, C8, C9, C11
speech recognizer [9], for estimating the topic. Its language model was constructed from the training data collected for the topic estimation. A 3000-state Phonetic Tied-Mixture (PTM) model [9] was used as an acoustic model. The ASR accuracies were 63.3% and 69.6% for each. We used C5.0 [10] as a classifier. To construct it, we used features selected by the backward stepwise selection, in which a feature survives if the performance in the domain classification is degraded when it is removed from a feature set one by one. In our method, we selected from the features listed in Table 1, 2, 3, and 4. Table 5 lists the selected features. The performance was calculated with a 10-fold cross validation. Accuracies for domain selection were calculated per utterance. When there were several domains that had the same score after domain selection, one domain was randomly selected from them. Reference labels of the domain selection were given by hand for each utterance on the basis of domains the system had selected and transcriptions of the user’s utterances: Label (I): When the correct domain for a user’s utterance is the same as the domain in which the previous system’s response was made. Label (II): Except for case (I), when the correct domain for a user’s utterance is the domain in which an ASR result for language understanding in the N-best candidates with the highest score can be interpreted. Label (III): Except for case (I) and (II), when the correct domain for a user’s utterance is the domain that has the maximum closeness to an ASR result used for topic estimation. Label (IV): Cases other than (I), (II), and (III). 4.2
Evaluation of Domain Selection
We first calculated the accuracy for domain selection in our method. Table 6 lists the classification results in our method as a confusion matrix. There were 464 domain selection errors in our method. Using our method enabled the system to successfully select correct domains for 37 of 131 utterances of (III). For these utterances, the conventional method cannot be used to select the correct domain. For example, we can select a domain correctly by using topic estimation results even for the utterance with Label (III) “Is any bus for Kyoudaiseimonmae (a name of place) in service?”, which contains out-of-vocabulary words “in service”. Nevertheless, the recall rate for (III) was not so high because the number of the utterances of (I) was much larger than that of (III) and the classifier was trained as most of utterances are classified as (I). Our method could also be used to select the correct domain for 84 of 238 utterances of (IV). These are the utterances
302
S. Ikeda et al. Table 6. Confusion matrix in our domain selection
Reference label \ Output (I): in previous response (II): with highest score for LU (III): with highest score for TE (IV): others Total (precision)
(I) 1348 93 81 130
(II) 34 258 + 10† 7 11
(III) 23 14 37 13
(IV) 37 5 6 84
1652 320 87 132 (81.6%) (80.6%) (42.5%) (63.6%)
Total (recall) 1442 (93.5%) 380 (67.9%) 131 (28.2%) 238 (35.5%) 2191 (78.8%)
LU: Language Understanding TE: Topic Estimation. †: These include 10 errors because of random selection when there were several domains having the same highest scores.
Table 7. Comparison between our method and baseline method (Output/Reference label) Method\Output (I): previous (II): with highest score for LU (IV): others Total Baseline 1303/1442 238/380 131/369 1672/2191 Our method 1348/1442 258/380 140/369 1746/2191
in which dialogue history is not reliable. In fact, the investigation for these 84 utterances correctly classified as (IV) revealed that 83 domain selection results of their previous utterances were incorrect. By detecting (IV), we can avoid successive domain selection errors. We also compared the performance of our method with that of the baseline method described below: Baseline method: A domain was selected on the basis of our previous method [1]. After we had merged utterances with Labels (III) and (IV), we constructed a 3-class domain classifier by using the dialogue data described in Section 4.1. We used features selected from the ones listed in Table 1, 2, and 3. These are listed in Table 5. Table 7 lists the classification accuracy per their reference labels when it was calculated using the baseline conditions (3-class classification). The error rate of domain selection in the baseline method was 23.7% (= 519/2191), and that in our method was 20.3% (= 445/2191). The error reduction rate was 14.3% (= 74/519). Adopting the features obtained by the topic estimation results improved the accuracies for all labels from the baseline. These results also showed that our method selects a uniquely correct domain for more utterances than a baseline method does. As listed in Table 7, the number of utterances for which the system selected uniquely correct domains in the baseline method was 1541 (= 1303 + 238). On the other hand, as listed in Table 6, that in our method was 1643 (= 1348 + 258 + 37). This represents an increase of 102 utterances from the baseline method. We have discussed which features played an important role in the domain classifier and calculated how the number of errors increased when each feature was removed. Table 8 lists the top 10 features which increased the number of
Integrating Topic Estimation and Dialogue History for Domain Selection
303
Table 8. Increased number of errors when feature was removed Removed feature Increased number of errors
U8 86
P9 67
T7 62
U6 58
T2 47
C8 43
U3 40
P5 40
T10 T12 37 33
errors when it was removed. Four features related to topic estimation are listed here, and they show that features obtained from the topic estimation were useful for domain selection.
5
Conclusion
We developed a method of robust domain selection against out-of-grammar utterances in multi-domain spoken dialogue systems. By integrating the topic estimation results and dialogue history, using our method enables the system to generate correct responses for utterances that the system cannot accept. The experimental results using 2191 utterances showed that our method successfully reduced domain selection errors by 14.3% compared to a baseline method. There are still some issues for developing the robust multi-domain system against out-of-grammar utterances. When the system selects either (III), the domain with the highest score for topic estimation, or (IV), other domains, it obtains no language-understanding result. Thus, we need to develop effective dialogue management, such as providing help messages, on the basis of the selected domain information. Acknowledgements. We are grateful to Mr. Teruhisa Misu and Prof. Tatsuya Kawahara of Kyoto University for allowing us to use the document-collecting program they developed [7]. We are also grateful to Dr. Mikio Nakano of Honda Research Institute Japan and Mr. Naoyuki Kanda for their cooperation in developing the multi-domain system used to collect the evaluation data.
References 1. Komatani, K., Kanda, N., Nakano, M., Nakadai, K., Tsujino, H., Ogata, T., Okuno, H.G.: Multi-domain spoken dialogue system with extensibility and robustness against speech recognition errors. In: Proc. SIGDial, pp. 9–17 (2006) 2. Bellegarda, J.R.: Latent semantic mapping. IEEE Signal Processing Mag. 22(5), 70–80 (2005) 3. Lin, B., Wang, H., Lee, L.: A distributed agent architecture for intelligent multidomain spoken dialogue systems. In: Proc. ASRU (1999) 4. O’Neill, I., Hanna, P., Liu, X., McTear, M.: Cross domain dialogue modelling: an object-based approach. In: Proc. ICSLP, pp. 205–208 (2004) 5. Lane, I.R., Kawahara, T., Matsui, T., Nakamura, S.: Topic classification and verification modeling for out-of-domain utterance detection. In: Proc. ICSLP, pp. 2197– 2200 (2004)
304
S. Ikeda et al.
6. Ikeda, S., Komatani, K., Ogata, T., Okuno, H.G.: Topic estimation with domain extensibility for guiding user’s out-of-grammar utterance in multi-domain spoken dialogue systems. In: Proc. Interspeech, pp. 2561–2564 (2007) 7. Misu, T., Kawahara, T.: A bootstrapping approach for developing language model of new spoken dialogue systems by selecting Web texts. In: Proc. ICSLP, pp. 9–12 (2006) 8. Komatani, K., Fukubayashi, Y., Ogata, T., Okuno, H.G.: Introducing utterance verification in spoken dialogue system to improve dynamic help generation for novice users. In: Proc. SIGDial, pp. 202–205 (2007) 9. Kawahara, T., Lee, A., Takeda, K., Itou, K., Shikano, K.: Recent progress of opensource LVCSR engine Julius and Japanese model repository. In: Proc. ICSLP, pp. 3069–3072 (2004) 10. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993), http://www.rulequest.com/see5-info.html