Automatic Learning of Sentence Dependencies in Spoken Dialogues

1 downloads 0 Views 136KB Size Report
plete request. This may happen for several reasons: some words are lost because of a recognition error, a user's mistake, or the user forgets some constraints or.
Automatic Learning of Sentence Dependencies in Spoken Dialogues Mauro Cettolo (*), Anna Corazza (*) and Renato De Mori (**) (*) IRST-Istituto per la Ricerca Scienti ca e Tecnologica I-38050 Povo (Trento), Italy (**) School of Computer Science, McGill University, 3480 University str, Montreal, Quebec, Canada, H3A2A7

ABSTRACT

This paper describes a strategy followed by the dialogue manager of a database inquiry system to assess the reliability of input utterances. For this purpose, binary classi cation trees are used, in such a way that the assessment capability can be automatically learnt from samples. As a starting point, training of such trees for classi cation of ATIS-3 sentences in terms of A, D, and X classes was done; experimental results show that this assessment technique can be e ectively exploited and integrated into the dialogue manager under development.

1. INTRODUCTION

In this paper, the problem of data retrieval through a speech-based interface is considered. Usually, this task is performed by a pipeline: the speech recognizer outputs one or more word sequence hypotheses, or a word lattice, which are then processed by an understanding module. The goal of the latter module is to obtain all the information necessary to formulate a query to the database. Often, in real applications, a single utterance can not provide all the information necessary for a complete request. This may happen for several reasons: some words are lost because of a recognition error, a user's mistake, or the user forgets some constraints or chunks of information. In these cases, if the system asked the user to repeat the complete sentence together with the missing information, the interaction would become long, clumsy and also prone to further errors. Moreover, shorter sentences are recognized with more accuracy. This paper deals with how to achieve clari cation, completion and similar tasks with a short dialogue trying to minimize further user's e orts. Examples of these limited dialogues can be found in the ATIS-3 corpus [4]. Therefore, this corpus will be used to train and evaluate the proposed algorithms. Many authors of papers related to dialogue modeling claim that this kind of interaction is structured, predictable and domain independent. In particular, these properties characterize the type of actions a user intends to execute in the so called discourse plans. Dif-

ferent authors de ne di erent taxonomies of discourse plans [10, 9]. Examples of discourse plans are to continue the initial plan or to start a subdialogue; the goals of subdialogues can be clari cation, con rmation or correction. For the sake of completeness, the possibility of switching the topic should be also considered. Nevertheless, our intuition is that this kind of discourse plans, certainly important for human-human interaction modeling, is very rare, if not absent, in the limited data retrieval application considered here. To solve the general problem of the user's goal recognition, a basic ability of the dialogue manager should be the categorization of each utterance in terms of discourse plans. Our claim is that a (even rough) classi cation based on rules learnt automatically from training data can be e ectively integrated with hand-designed ones. Further, the more relevant the automatic part is, the more robust and portable across di erent domains the nal system will be. As a rst step for the construction of a new dialogue manager, focus was put on an algorithm to make this automatic classi cation. One powerful statistical tool for classi cation is given by binary classi cation trees (such as CART, presented in [2] and [8]). Using a version of this basic tool, the task of associating each utterance to one of the three classes proposed in ATIS-3 was considered : class A for context free interpretation, class D for context dependent interpretation and class X for unevaluable. Our short-term goal is to integrate this kind of discrimination into the dialogue manager in order to make it able to prevent and handle error situations. The paper is organized as follows. Section 2 contains a general description of which problem the dialogue manager deals with. In Section 3 the architecture of the system is described, with particular attention given to the parts involved in the dialogue strategy: binary classi cation trees, understanding module and dialogue manager. In Section 4 some experiments are described. Finally, the last section contains a few conclusive remarks and considerations.

2. SPOKEN LANGUAGE INTERACTION One of the most important characteristics of spoken language is uncertainty. Two are the main uncertainty sources: one is intrinsic in spontaneous speech, where ill-formedness and extra-linguistic phenomena are very frequent; the second source is due to acoustic recognizer errors. Clearly these two sources are not completely independent, because the high variability of spontaneous speech deteriorates the performance of the acoustic recognizer, too. And this makes things even worse. The hypothesis that should then be made is that the linguistic module of an understanding system has to consider an error-prone input, and try to recover from such errors whenever possible. The fact that enough information is contained in the original message for accomplishing the task, even when the sentence is ungrammatical or interrupted, is proven by the fact that humans can interpret such an input. The linguistic module can face possible (and probable) errors mainly at two levels: understanding of the single utterance and dialogue manager. The role of the dialogue manager in a spoken language system is to guarantee a good interaction quality. This means that the initiative should be left to the user whenever this is possible. But, when a possible discrepancy between the real user's intentions and those hypothesized by the system is detected, the system should take the initiative and follow a strategy for recovering from such divergences [7]. The problem of user's goal recognition is widely treated in the literature [10, 9]. Sophisticated approaches are usually followed to recognize the user's goal within generic domains and allowing large freedom to the user, such that he can, for example, suddenly change the topic of the interaction. On the contrary, our task was restricted, so that the robustness of the interaction could be improved. First of all, the chosen domain regards database inquiry: this allows making reliable predictions of the user's goal, based on the database contents, which are well known during the system design phase. Moreover, if the change of topic is forbidden until the former request is satis ed, the modeling is further simpli ed because there is no need to recognize and mask, possibly in a temporary way, contextual information already acquired, problem that may make satisfaction of a new request dicult. Given these constraints, the main problem of the dialogue manager becomes that of correctly acquiring all information that can focus the search of the datum the user is looking for. This information is a constraint used for reducing search e ort and not all of it is, in general, supplied by the user in the rst utterance. So, the user is asked for the remaining one through proper questions in successive stages. Since, as underlined above, any input data can be corrupted, the dialogue manager has to be able to recover from error situations.

USER

ACOUSTIC LEVEL 1 BEST or

ANSWER or

LATTICE

QUESTION

UNDERSTANDING MODULE

DIALOGUE SEMANTIC

MANAGER

LANGUAGE SQL

DATA BASE

Figure 1: System architecture.

3. SYSTEM ARCHITECTURE

The architecture of the system under development is shown in Figure 1. For the acoustic level, the technology developed at IRST is adopted (see for example [1]). The linguistic level considered in this work is given by the union of the dialogue manager and the understanding module. The understanding module is in an advanced development phase, as well as the de nition of the semantic language. Since the understanding module is based on binary classi cation trees, and the dialogue manager also will use them, in the following subsection a brief description of this algorithm is given.

3.1. Binary Classi cation Trees

Binary Classi cation Trees (BCTs) are used by a classi cation algorithm that examines the input features in order to reduce the overall classi cation error by the use of an automatically designed sequence of YES/NO questions. Given an initial set of YES/NO questions, one of them is associated to each internal node of a binary tree while a class is associated to each leaf. The same class can be associated to more than one leaf. When an input enters the algorithm, it follows an ideal path from the root of the tree to one of its leaves. At each internal node, the corresponding question is applied to the input: if the answer is YES, the path will continue through the left node1 ; on the contrary, if the answer to the question is NO, the path will continue through the right node. When eventually a leaf is attained, the associated class is given as output. Even from this schematic description, it should be clear enough that the application of this algorithm is simple and ecient. The most important part is then represented by the tree design, whose goal is to consider the most e ective question sequence relative to every possible input. This can be automatically done on the basis of a labeled corpus, namely a collection of pairs (x; y) in which x is a possible input and y the correct output (class) to be associated to x. The This is just an example: clearly left and right here are just conventional and could be interchanged. 1

set of all possible YES/NO questions also needs to be de ned. This is a very important point in the design of the whole algorithm: if the set is too wide, the algorithm becomes too inecient, but if the questions are not powerful enough, important information could become impossible to get. The simplest possible question set was used, namely

keywords. Therefore, each question has the form:

\Does word w appear in the input?". Every word that occurs in the training set a number of times greater than a xed threshold is a good candidate to be a keyword. The threshold was chosen so that the computation time necessary to chose the best question is enhanced without lowering overall algorithm performance.

The tree is built starting from an initial tree only composed of the root; at each step all the leaves of the tree generated in the last step are examined: the ones that do not satisfy a prede ned halting criterion are split and become internal nodes; when all the leaves satisfy the halting criterion the expansion is stopped. An interesting interpretation of the algorithm is based on the fact that a subset of the training set can be associated to each node in the tree, namely the set of all the (x; y) such that when the algorithm is applied to x, the path it follows passes through the considered node. Then, whenever a node is split, the corresponding subset is also split into two parts. The essential part of the split operation is the choice of the best question to be associated to the node under examination. For this choice an optimality criterion is adopted based on an impurity measure: a node is considered as pure as possible when all the elements of the subset associated to it belong to the same class; it is completely impure when no class dominates the others. A good impurity measure is given by entropy: the Gini criterion, which has approximately the same performance but can be computed in a more ecient way (see [2] for a discussion), was used. So, in every node, all the questions are considered and among them the one for which the maximum di erence of impurity between the node and its children is chosen. Another important point to be considered is represented by the halting criterion. In fact, if the tree is expanded too much, it will describe the training data very well, but no generalization will be attained, and it will very likely fail on test data. On the contrary, if the expansion is halted too soon, the classi cation error will be too high. The algorithm presented in [6] was used here, in which the result of every expansion step is veri ed and pruned on a second set. The result of this pruning step is then expanded on this second set and pruned on the rst one, and so on. Such an algorithm was proved to converge to an optimal one, and it usually does so in a very small number of iterations (4-6).

3.2. Understanding Module

The understanding module follows the main ideas presented in [8], even if, in some cases, they are being realized in a di erent way. For example, a Semantic Representation (SR) language was de ned that permits describing in a formal way the semantic contents of the sentence2 . However, the de nition is di erent than in the cited work. Whenever a complete SQL query can be extracted from the input utterance, the corresponding SR can be unambiguously generated by applying some transformation rules to the syntactic analysis of the SQL query. Moreover, even if the input utterance does not contain a complete query, the SR can be used to describe the query eld given by the input. Therefore, the goal of the understanding module is that of translating the input into a corresponding SR. This is made by a forest of BCTs, each of which has the task of extracting the value of a particular eld in the query. In order to use the information given by the database as much as possible, a pre-processing is done, which associates classes to particular words in the input, such as city names, days in the week, months and so on. However, more work has to be done on this pre-processing to improve performance. For a more detailed description of this module see [3].

3.3. Dialogue Manager

At present, only some of the dialogue manager aspects are well de ned, while others are only sketched and will require further investigation. One of the aspects to be explored concerns the organization of contextual information; this is necessary whenever context dependent utterances are found. On the contrary, the task representation [5] has already been well de ned, based on frames. In [3] a hierarchy is de ned among different frame's elds and strategies to take advantage of it are discussed. Using the same representation for both understanding and dialogue managing will render the communication between these two modules much easier. Perhaps the most important situation the dialogue manager will be asked to face is represented by an incomplete query. But, besides incomplete information, it can receive corrupted data. The system, then, has to be able to assess the reliability of acquired information in order to prevent discrepancies between the real user's intentions and those hypothesized by the system itself. The dialogue manager strategy can be based on the following two extreme cases: the explicit con rmation by the user of all information acquired, or full con dence in the machine interpretation. If the rst strategy For the SR some public domain software was used, namely TXL 7.4, (c)1988-1993 Queen's University at Kingston, which is a programming language speci cally designed to support transformational programming. 2

is safe but extremely boring for the user, with the second one it can be dicult for the system to catch errors in previous phases of the interaction. The system should be able to discriminate between reliable and unreliable information, asking for con rmation only for the latter. The discriminant ability should be learned from real samples rather than heuristic knowledge coming from designers. The learnable knowledge is available from the two following sources: the probability of the class corresponding to the leaf on which the current interpretation is based; the acoustic score coming from the recognizer. Moreover, this twofold knowledge can be integrated to a third that comes from a source created on purpose: a pre- ltering of the utterance through BCTs trained on the A, D and X ATIS classes. Then, for each input utterance, the dialogue manager will exploit the following set of information:  interpretation + class probability from the understanding module;  acoustic score from the recognizer;  A, D or X classi cation + class probability. Some experiments were performed to assess the exploitability of pre- ltering based on BCTs.

4. EXPERIMENTS

Some experimental results for BCT-based pre ltering follow. The experiments were performed on 3476 utterances from the ATIS-3 corpus. The distribution of the three classes on these data is reported in Table I. Classes' distributions in ATIS-3 corpus. Class # sentences percentage A 1692 48.7 % D 1114 32.0 % X 670 19.3 %

Table I.

Experiments results. Classes % errors on % errors on training test A,D,X 14.8 % 19.8 % A,(D,X) 12.8 % 17.9 % A,D 8.7 % 11.0 % Table II.

Detailed results for the (A; D; X ) discrimination experiment. Class % errors on % errors on training test A 12.8 % 17.9 % D 14.9 % 21.5 % X 16.6 % 20.0 %

Table III.

As a rst step, 8 categories (airline name, airport name, cardinal number, city name, day name,

meal description, month name, state name) were associated to the words that appear in tables within the ATIS database. Three experiments were performed that di er in the classes to be discriminated: all three classes in the rst experiment, class A versus the union of classes D and X in the second one, and nally class A versus class D (only 1692+1114=2806 sentences). In each of these experiments, a 6-fold cross-validation technique [2] was adopted to estimate the classi cation error rate. Global results are summarized in Table II. For each class, Table III details error rates of the rst experiment. Note that the third experiment has the same goal as the one presented in [8]: nevertheless the experimental conditions are di erent because they used more various data (ATIS-2 in addition to ATIS-3) and this can be the reason for our slight improvement in performance (11:0 % error rate instead of the 11:2 % they obtained) even if a less powerful question set was used (they used a combination of keywords with their relative position). The rst experiment is the most interesting since our goal is just to exploit even rough preliminary information of the kind of the generic input utterance. The good results obtained prove that the dialogue manager can take advantage of pre ltering based on BCTs trained on each of the three ATIS classes for assessing the reliability of input contents. However, given the good discrimination obtained for A and D classes (see results of the third experiment), the question of why train a speci c BCT for X class arises. Then, an attempt was made to classify the 670 X utterances by the two BCTs trained for A and D classes respectively, but error rates were high: 150 (22.4%) sentences were classi ed as A and not D, while 137 (20.4%) as D and not A (the remaining 383 (57.2%) utterances were classi ed as A and D or as not A and not D). The bad performance obtained on X sentences probably re ects the limits of our question set. X sentences are often characterized by chunks of correct sentences interrupted before enough information can be extracted, or by multiple corrections, dicult to clarify even for a human. This kind of phenomenon can not be detected only by keywords, but more sophisticated analysis, perhaps based on syntax and/or on semantics, is required.

5. CONCLUSIONS AND FUTURE WORK

The rst thing a dialogue manager should be able to determine is whether the input sentence can be answered or not. This decision should be based on di erent aspects, such as syntactical analysis, semantic contents, pragmatical consideration and so on. On the other hand, the rejection of a request is going to be frustrating for the users: such frustration can be lowered if an assessment of the reliability of the utterance's contents is made instead of the acceptance/rejection decision. A quite crude analysis that can be made on the input sentence is based on BCTs. It permits immedi-

ate discovery of sentences that do not contain enough information either to give an immediate answer or to start a clari cation dialogue. Experimental results con rmed that this way is worth following, and a rst version of the dialogue manager is under construction to use such information. In the future, to re ne the reliability assessment phase, other sources of information will be used, such as acoustic and interpretation scores.

References

[1] B. Angelini, G. Antoniol, F. Brugnara, M. Cettolo, M. Federico, R. Fiutem, and G. Lazzari Radiological Reporting by Speech Recognition: the A.Re.S. System. In Proceedings of the International Conference on Spoken Language Processing, volume III, pages 1267{1270, Yokohama, Japan, 1994. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi cation and Regression Trees. Wadsworth Inc., 1984. [3] A. Corazza, and R. De Mori. Hierarchies of Classi cation Trees for Semantic Interpretation. IRST abstract n.9502-06. Submitted to the European Conference on Speech Communication and Technology, Madrid, Spain, 1995. [4] D. A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky, and E. Shriberg. Expanding the Scope of the ATIS Task: the ATIS-3 Corpus. In Proceedings of the ARPA Human Language Technology Workshop, pages 45{50, Plainsboro, NJ, 1994. [5] W. Eckert, and S. McGlashan. Managing Spoken Dialogues for Information Services. In Proceedings of the European Conference on Speech Communication and Technology, pages 1653{1656, Berlin, Ger-

many, 1993. [6] S. Gelfand, C. Ravishankar, and E. Delp. An Iterative Growing and Pruning Algorithm for Classi cation Tree Design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(2):163{174, Feb. 1991. [7] E. Gerbino, and M. Danieli. Managing Dialogue in a Continuous Speech Understanding System. In Proceedings of the European Conference on Speech Communication and Technology, pages 1661{1664,

Berlin, Germany, 1993. [8] R. Kuhn, R. De Mori, and E. Millien. Learning Consistent Semantics from Training Data. In Pro-

ceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume II,

pages 37{40, Adelaide, Australia, 1994. [9] D.J. Litman, and J.F. Allen. A Plan Recognition Model for Subdialogues in Conversations. Cognitive Science, volume 11, pages 163{200, 1987.

[10] S. R. Young. Dialog Structure and Plan Recognition in Spontaneous Spoken Dialog. In Proceedings of the European Conference on Speech Communication and Technology, pages 1169{1172, Berlin, Ger-

many, 1993.

Suggest Documents