acoustic information, n-best list distributional properties and most importantly a User ..... linguistic context (so called 'high-Cloze' contexts) as in the case of domain- ..... System: Would you like something cheap, expensive or reasonably priced?
Dialogue Context-Based Speech Recognition using User Simulation Ioannis Konstas Master of Science Artificial Intelligence School of Informatics University of Edinburgh August 2008
©Copyright 2008 by Ioannis Konstas
Declaration I hereby declare that this thesis is of my own composition, and that it contains no material previously submitted for the award of any other degree. The work reported in this thesis has been executed by myself, except where due acknowledgement is made in the text. Ioannis Konstas
Speech recognizers do not usually perform well in spoken dialogue systems due to their lack of linguistic knowledge and thus their inability to cope with the context of the dialogues in a similar way that humans do. This study, following the fashion of several previous efforts, attempts to build a post-processing system that will act as an intermediate filter between the speech recogniser and the dialogue system in an attempt to improve the accuracy of the former. In order to achieve this, it trains a Memory Based Classifier using features extracted from recognition hypotheses, acoustic information, n-best list distributional properties and most importantly a User Simulation model trained on dialogue data that simulates the way people predict the next dialogue move based on the discourse history. The system was trained on dialogue logs extracted using the TownInfo dialogue system and consists of a twotier architecture, namely a classifier that ascribes to each hypothesis of the speech recogniser a confidence label and a re-ranker that extracts the hypothesis with the highest confidence label out of the n-best list. Overall the system exhibited a relative reduction of Word Error Rate (WER) of 5.13% and a relative increase of Dialogue Move Accuracy (DMA) of 4.22% compared to always selecting the topmost hypothesis (Baseline), thus capturing a 44.06% of the possible WER improvement on this data and 61.55% for the DMA measure, therefore validating the main hypothesis of this thesis, i.e. the User Simulation can effectively boost the speech recogniser's performance. Future work involves using a more elaborate semantic parser for the labelling of each hypothesis and evaluation of the system and the integration of the system to a real dialogue system such as the TownInfo System.
I wish to warmly thank my supervisor Oliver Lemon for his constant guidance, support and time he spent for the completion of the project. I also wish to thank Kalliroi Georgila, Xingun Liu, Helen Hastie and Sofia Morfopoulou for their co-operation and helpful advice.
Contents Declaration ...................................................................................................................ii Abstract........................................................................................................................iii Acknowledgements......................................................................................................iv Contents........................................................................................................................v Chapter 1 - Introduction................................................................................................1 1.1 Overview............................................................................................................3 1.2 Related Work......................................................................................................3 1.2.1 Topmost hypothesis classification..............................................................3 1.2.2 Re-ranking of n-best lists............................................................................4 1.3 User Simulation..................................................................................................7 1.4 TiMBL – Memory Based Learning....................................................................8 1.5 Evaluation Metrics.............................................................................................9 1.5.1 Classifier Metrics (Precision, Recall, F-Measure and Accuracy)...................9 1.5.2 Re-ranker Metrics (Word Error Rate, Dialogue Move Accuracy, Sentence Accuracy)...............................................................................................................10 1.6 The TownInfo Dialogue System.......................................................................11 Chapter 2 - Methodology............................................................................................13 2.1 The Edinburgh TownInfo Corpus.....................................................................14 2.2 Automatic Labelling.........................................................................................18 2.3 Features............................................................................................................19 2.4 System Architecture.........................................................................................22 2.5. Experiments.....................................................................................................26 2.6. Baseline and Oracle ........................................................................................28 Chapter 3 - Results .....................................................................................................29 3.1 First Layer – Classifier Experiments................................................................30 3.2 Second Layer – Re-ranker Experiments..........................................................32 3.3. Significance tests: McNemar's test & Wilcoxon test......................................33 Chapter 4 - Experiment with high-level Features.......................................................35
vi Chapter 5 - Discussion and Conclusions....................................................................37 5.1. Automatic Labelling........................................................................................38 5.2. Features...........................................................................................................39 5.3 Classifier..........................................................................................................42 5.4 Results..............................................................................................................42 5.5. Future work.....................................................................................................44 5.6. Conclusions.....................................................................................................44 References...................................................................................................................46
CHAPTER 1 Introduction
Speech recognizers, being essential modules in spoken dialogue systems, do not alone provide adequate performance for a particular system to be robust and intuitive enough to use. The most common reasons that usually account for erroneous recognitions of the user's utterances is the ASR module's lack of linguistic knowledge and their inability to perform well in noisy environments. In an attempt to compare them with the human speech recognition subsystem, there is evidence that the latter is usually able to predict upcoming words if it is posed in a certain sufficiently constraining linguistic context (so called 'high-Cloze' contexts) as in the case of domain-specific dialogues even in situations where the levels of surrounding distracting noise is high (Pickering et al., 2007). This very interesting behaviour of the human brain leads us to believe that we can simulate in a way its ability to correctly disambiguate possible misrecognitions of what the user of a dialogue system intended to say. The most obvious way would be to induce such linguistic context such as the history of the dialogue so far between the user and the system to a post-processing system in an effort to boost the speech recogniser's module accuracy. Let us consider the following psycholinguistic theory of Pickering et al. (2007) who advocate the fact that people go a step further and use their language production subsystem in order to make predictions during comprehension of their co-speaker in a dialogue: “if B overtly imitates A, then A's comprehension of B's utterance is facilitated by A's memory for A's previous utterance.” With this in mind let us make the following analogy: Let us consider that A is the speaker-system in a dialogue session and B the user-human of this system. The user is said to imitate the system in terms both of words'
2 choice and semantics of the messages to be conveyed since he or she is asked to fulfil a certain scenario in a rather limited domain of interest, for example book a train ticket, find a hotel/bar/restaurant, etc. Then the system (A) can understand better what the user (B) said, because it “remembers”, i.e. it has stored, its previous actions and turns. Taking into account this interpretation, it can be justified that it would be rather useful if we could model the dialogues between the user and the system and thus have the system “remember” what it had said and done before, in order to better understand what the user is really saying. The idea behind this theory can be approached computationally by what is called User Simulation and is the main theme around which this study revolves. In an attempt to combine this theory with the speech recognition module of a dialogue system we consider that the latter produces several hypotheses for a given user utterance as its output, namely an n-best list. The justification posed yields to the conclusion that the topmost hypothesis out of this list might not be correct either in terms of word recognition or semantic interpretation. Instead the correct hypothesis might exist somewhere lower in this n-best list. Note that in dialogue systems we are usually interested in the semantic representation of an utterance, since it is usually sufficient for the system merely to understand what the user wants to say, instead of the exact way he or she said it. Word alignment of course may account for the level of confidence that the semantic interpretation is truly the one meant to be conveyed by the user. This study attempts to build a post-processing system that will take as its input the speech recogniser's n-best lists of the user's utterances in dialogues and re-rank them in an effort to extract the correct ones in terms both of semantic representation and word alignment. In order to achieve this, it trains a Memory Based Classifier using features extracted from recognition hypotheses, acoustic information, n-best list distributional properties and a User Simulation model trained on dialogue data.
1.1 Overview The chapters of this thesis are organised as follows: Chapter 1: Introduction to the problem of context-sensitive speech recognition and previous work on this area. Chapter 2: Detailed description of the methodology adopted to implement and train the system discussed in the study. Chapter 3: Results of the experiments conducted in order to train the system. Chapter 4: Additional experiment with minimal number of high-level features. Chapter 5: Discussion on the methodology and the results, future work and conclusion.
1.2 Related Work The notion of incorporating explicit knowledge to evaluate and refine the ASR hypotheses in the context of enhancing the dialogue strategy of a system, as assumed above, is not something new. Several studies have been performed in an effort to boost the performance of the speech recognizer following either of two different approaches to essentially the same problem: either make decisions on the topmost hypothesis of the ASR's output or classify and then perform re-ranking of the n-best lists. All the experiments of these studies were conducted on similar input data, i.e. transcripts and wave files of user utterances and logs of dialogues. However, the systems they were extracted from were different as far as their target domain is concerned and the magnitude of the corpora ranged from a few hundred utterances (Gabsdil and Lemon, 2004), to several thousand utterances (Litman et al., 2000).
1.2.1 Topmost hypothesis classification Litman et al. (2000) use prosodic cues extracted directly from speech waveforms
4 rather than confidence scores of the acoustic model incorporated in the speech recognizer, in order to predict misrecognised user utterances in their corpus. In their experiments they show that utterances containing word errors have certain prosodic features. What is more, even simple acoustic features such as the energy of the captured waveforms and their duration provide with good separation between correctly recognised and misrecognised utterances. In this fashion they maintain that these features can account for more accuracy than standard confidence scores. In order to distinguish between correct and incorrect recognitions they train a classifier using RIPPER (Cohen, 1996), which is a tree decision model and build a set of binary rules based on the entropy of each feature on a given training set. Their corpus consists of 544 dialogues between humans and three different dialogue systems; voice dialling and messaging (Kamm et al., 1998), accessing email (Walker et al., 1998), accessing online train schedules (Litman and Pan, 2000). Their best configuration scores 77.4% accuracy, a 48.8% relative increase compared to their baseline. Walker et al. (2000) use a combination of features from the speech recognizer, natural language understanding, and dialogue history to attribute different classes to the topmost hypothesis, namely: correct, partially correct, and misrecognised. Like Litman et al. (2000) they also use RIPPER as their classifier. They train their system on 11.787 spoken utterances collected by AT&T's How may I help you corpus (Gorin et al. 1997; Boyce and Gorin, 1996), consisting of dialogues over the telephone concerning subscriptions' related scenarios. Their system achieves 86% accuracy, an improvement of 23% over the baseline.
1.2.2 Re-ranking of n-best lists On the other hand, Chotimongkol and Rudnicky (2001), Gabsdil and Lemon (2004), Jonson (2006) and Andersson (2006) move a step further than simply classifying the topmost hypothesis and perform re-ranking of the n-best lists using prosodic and speech recognition features as well as dialogue context and task-related attributes. Chotimongkol and Rudnicky (2001) train a linear regression model on acoustic, syntactic and semantic features in order to reorder the n-best hypotheses for a single ut-
5 terance. Each hypothesis in the list is ascribed a correctness score, namely its relative word error rate in the list. Then the one that scores lower is chosen instead of the top-1 result. The corpus used is extracted from the Communicator system (Rudnicky et al., 2000) regarding travel planning and consists of 35766 utterances for which the 25-best lists are taken into consideration. The performance of the re-ranker is 11.97% WER resulting in a 3.96% relative reduction compared to the baseline Gabsdil and Lemon (2004) similarly perform reordering of n-best lists by combining acoustic and pragmatic features. Their study shows that the dialogue features such as the previous system question and if a hypothesis is the correct answer to a particular question contributed more than the other more common attributes. Each hypothesis in the n-best list is automatically labelled as being either in-grammar, out-of-grammar (oog) (WER 50) or crosstalk. This labelling is based on a combination of the semantic parse of each hypothesis and its alignment with the true transcript. Their approach to the problem is in two steps: first they use TiMBL (Daelemans et al., 2007), a memory based classifier, in order to predict the correct label of each hypothesis in the n-list and then they perform a simple re-ranking by choosing the hypothesis that has the most significant label (if it exists in the list) according to the order: in-grammar < oog (WER ≤ 50) < oog (WER > 50) < crosstalk. The corpus used was extracted with the WITAS dialogue system (Lemon, 2004) and consisted of interactions with a simulated aerial vehicle, a total of 30 dialogues with 303 utterances, the 10-best lists of which were taken into consideration. Their system performed 25% relatively better than the baseline with a weighted f-score of 86.38%. Jonson (2006) classifies recognition hypotheses with quite similar labels denoting acceptance, clarification, confirmation and rejection. These labels have been automatically crafted in an equivalent manner as in the Gabsdil and Lemon (2004) study and correspond to varying levels of confidence, being essentially potential directives to the dialogue manager. Apart from the common features Jonson includes close-context features, e.g. previous dialogue moves, slot fulfilment as well as the dialogue history. She also includes attributes that account for the whole n-best list, i.e. standard deviation of confidence scores etc. Jonson (2006) also uses TiMBL in order to
6 classify each hypothesis of the n-best list to one of the 5 labels incorporated and uses the same re-rank algorithm as Gabsdil and Lemon (2004) to choose the top-1 hypothesis. Her system got trained on the GoDiS corpus, comprising of dialogues dealing with a virtual jukebox which consist of 486 user utterances the 40-best lists of which were taken into account. Her optimal set-up scored 83% of DMA and 58% of SA (see section 1.5.2 for explanation of these measures), gaining a 56.60% of relative increase of the DMA and 20.83% for the SA measure compared to the baseline. Andersson (2006) uses similar acoustic, list and dialogue features but adheres to a simpler binary annotation characterizing whether each hypothesis of the ASR n-best list is close enough ('B') or not ('N') to the original transcript. For the classification purposes of the given problem he trains maximum entropy models and performs a simple re-rank by choosing the first hypothesis, if it exists, which belongs to the 'B' category. His corpus is taken from the Edinburgh TownInfo system, containing dialogues for booking of hotels/bars/restaurants (see section 2.1) and consists of 191 dialogues or 2904 utterances taking on average into consideration on average the 7best lists. He scores an absolute improvement of error of 4.1% which interprets to a relative improvement of 44.5% compared to the baseline. Gruenstein (2008) follows a somewhat different approach to the problem of re-ranking by considering the prediction of the system's response rather than the user's utterance in the context of a multi-modal dialogue system. Along with the common recognition and distributional features of the hypotheses in the n-best lists he takes into account features that deal with the response of the system to the n-best list produced by the speech recogniser. Similarly to Andersson (2006) he labels each hypothesis as 'acceptable' or 'unacceptable' depending on the semantic match with the true transcript. He then trains an SVM to predict either of the two classes and then fits a linear regression model to the classification output of the SVM in order to output a confidence score between -1 and +1, with -1 being totally 'unacceptable' and +1 totally 'acceptable'. Re-ranking is then performed by setting a threshold in the domain [-1,+1] and choosing the hypothesis that exceeds this threshold. Unlike the previous the studies this method is able to output a numerical confidence score rather than a discrete label. His system is trained on data taken from the City Browser multi-modal
7 system (Gruenstein et al., 2006; Gruenstein and Seneff, 2007), resulting in 1912 utterances. His system scored an absolute 72% of F1-measure yielding 16% improvement compared to the baseline.
1.3 User Simulation What makes this study different from the previous work in the area of post-processing of the ASR hypotheses is the incorporation of a User Simulation output as an additional feature in my own system. Undeniably, the history of the discourse between a user and a dialogue system plays an important role as to what might be expected from the user to say next. As a result, most of the studies mentioned in the previous section make various efforts to capture it by including relevant features directly in their classifiers. Although this may account for simplicity and performance in the runtime of their system, still they fail to some extent to adopt a more systematic way in coping with user behaviour along the dialogue. A User Simulation model is what comes in hand to fill this gap by getting trained on small corpora of dialogue data in order to simulate real user behaviour. In my system I have used a User Simulation model created by Georgila et al. (2006) based on ngrams of dialogue moves. Essentially, it treats a dialogue as a sequence of lists of consecutive user and system turns in a certain high level semantic representation, i.e. {, } pairs. (see section 2.1 for complete explanation of this semantic representation). It takes as input the n - 1 most recent lists of {, } pairs in the dialogue history, and uses the statistics of n-grams in the training set to decide on the next user action. If no n-grams match the current history, the model can back-off to n-grams of lower order. The benefit from using n-gram models in order to simulate user actions is that they are fully probabilistic and are fast to train even on large corpora. A main drawback which is common in the case of n-grams is that they are considered to be quite local in their predictions. In other words, given a history of n – 1 dialogue turns, the prediction of the n-th user turn may be too dependent on the previous ones and thus might not make much sense in the more global context of the dialogue.
8 The main hypothesis of this study is that by using the User Simulation model to predict the next dialogue move of the user utterance as a feature in my system, I shall effectively increase the performance of the speech recogniser module.
1.4 TiMBL – Memory Based Learning In this study I chose TiMBL 6.1 (Daelemans et al. 2007) as the main model for the classification of the hypotheses in the n-best list. TiMBL is considered to be a well established, open-source and efficient C++ software, already used with considerable results by Gabsdil and Lemon (2004) and Jonson (2006). Memory-Based Learning (MBL) is an elegantly simple and robust machine learning method which has been applied to a multitude of tasks in Natural Language Processing (NLP). MBL descends directly from the plain k-Nearest Neighbour (k-NN) method of classification, which is still considered to be a quick, yet powerful pattern classification algorithm . Though plain k-NN performs well in various applications it is notoriously inefficient in its runtime use, since each test vector needs to be compared to all the training data. As a result, since classification speed is a critical issue in any realistic application of MBL, non-trivial data-structures and speed-up optimizations have been employed in TiMBL. Typically, training data are compressed and represented into a decision-tree structure. In general, MBL is founded on the hypothesis that “performance in cognitive tasks is based on reasoning on the basis of similarity of new situations to stored representations of earlier experiences, rather than on the application of mental rules abstracted from earlier experiences” (Daelemans et al., 2007). TiMBL, as every common machine learning method, is divided into two parts, namely the learning component which is memory-based, and the performance component which is similarity-based. The learning component of MBL is memory-based and it merely involves adding training instances to memory. This process is sometimes referred to as “lazy” since storing into memory takes place without some form of abstraction or intermediate
9 representation. An instance consists of a fixed-length vector of n feature-value pairs, plus the class that this particular vector belongs to. In the performance component of an MBL system, the training vectors are used in order to perform classification of a previously unseen test datum. The similarity between the new instance X and all training vectors Y in memory is computed using some distance metric ∆(X, Y). The extrapolation is done by assigning the most frequent class within the found subset of most similar examples, the k-nearest neighbours, as the class of the new test datum. If there exists a tie among classes, a certain tie breaking algorithm is applied. In order to compute the similarity between a test datum and each training vector I chose to use IB1 with information-theoretic feature weighting (among other MBL implementations found in the TiMBL library). The IB1 algorithm calculates this similarity in terms of weighted overlap: the total difference between two patterns is the sum of the relevance weights of those features which are not equal. The class for the test datum is decided on the basis of the least distant item(s) in memory. To compute relevance, Gain Ratio is used which is essentially the Information Gain of each feature of the training vectors divided by the entropy of the feature-values . Gain Ratio is considered to be a normalised version of the Information Gain measure (according to which each feature is considered independent of the rest, and measures how much information each of these contributes to our knowledge of the correct class label ).
1.5 Evaluation Metrics In this section I shall introduce the six metrics that were used in the evaluation of the two core components of the system, namely the classifier and re-ranker.
1.5.1 Classifier Metrics (Precision, Recall, F-Measure and Accuracy) Precision of a given class X is the ratio of the vectors that were classified correctly to
10 class X (True Positive) to the total number of vectors that were classified as class X either correctly or not (True Positive + False Positive). True Positive Precision =
(1.1) True Positive + False Positive
Recall of a given class X is the ratio of the vectors that were classified correctly to this class (True Positive) to the total number of vectors that actually belong to class X, in other words to the number of vectors that were correctly classified as class X plus the number of vectors that were incorrectly not classified as class X (True Positive + False Negative). True Positive Recall =
(1.2) True Positive + False Negative
F-measure is a combination of precision and recall. The general formula of this metric is (b is a non-negative real valued constant): (b2 + 1) ⋅ Precision ⋅ Recall F-measure =
(1.3) b2 ⋅ Precision + Recall
In this study we use the formula of F1 (b = 1), which gives equal gravity to precision and recall and is also called the weighted harmonic mean of precision and recall: 2 ⋅ Precision ⋅ Recall F1 =
(1.4) Precision + Recall
Accuracy of the classifier in total is the ratio of the vectors that were correctly classified in their classes to the total number of vectors that exist in the test set.
1.5.2 Re-ranker Metrics (Word Error Rate, Dialogue Move Accuracy, Sentence Accuracy) Word Error Rate (WER) is the ratio of the number of deletions, insertions and substi-
11 tutions in the transcription of a hypothesis as compared to the true transcript to the total number of words in the true transcript. In my system I compute it by measuring the Levenshtein distance between the hypothesis and the true transcript. Deletions + Insertions + Substitutions WER =
(1.5) Length of Transcript
Dialogue Move Accuracy (DMA) is a variant of the Concept Error Rate (CER) as defined by Boros et al. (1996), which takes into account the semantic aspects of the difference between the classified utterance and the true transcription. CER is similar in a sense to the WER, since it takes into account deletions, insertions and substitutions but to the semantic rather than the word level of the utterance. In our case DMA is stricter than CER, in the sense that it does not allow for partial matches in the semantic representation. In other words, if the classified utterance corresponds to the same semantic representation as the transcribed then we have 100% DMA, otherwise 0%.
Sentence Accuracy (SA) is the alignment of a single hypothesis in the n-best list with the true transcription. Similarly to DMA, it accounts for perfect alignment between the hypothesis and the transcription, i.e. if they match perfectly we have 100% SA, otherwise 0%.
1.6 The TownInfo Dialogue System The training datasets used in this study were collected from user (both native and non-native) interactions with the TownInfo dialogue system (Lemon, Georgila and Henderson, 2006) developed within the TALK project ( The TALK TownInfo system is an experimental specific domain system where presumptive users interact with it via natural speech in order to book a room in a hotel, a table in a restaurant or try to find a bar. Each user was given a specific scenario to fulfil involving subtasks of preferred choice regarding price range, location and type
12 of facility (Lemon, Georgila, Henderson and Stuttle, 2006). The dialogue system is implemented in the Open Agent Architecture (OOA) (Cheyer and Martin, 2001) with the main components being a dialogue manager, a dialogue policy reinforcement learner, a speech recogniser and a speech synthesiser. The input to my system are given by the dialogue manager and the speech recogniser. The dialogue manager, DIPPER (Bos et al., 2003), is an Information State Update (ISU) approach to dialogue management that was specifically developed to handle spoken input/output and integrates several communicating software agents. These include agents that monitor dialogue progress and facilitate other agents to decide as to what action should be taken based on previous and current state of the dialogue. The output of DIPPER which is of particular interest for us is the logging of the flow of the dialogues with information such as user utterances, system output (both the transcript and semantic representation), current user task etc. The goal of the system is to guide the dialogue manager’s clarification and confirmation strategies that is then given to the speech synthesiser to realise (Lemon, Georgila, Henderson and Stuttle, 2006). The speech recogniser was built with the ATK tool-kit (Young, 2004). The recogniser models natural speech using Hidden Markov Models and utilizes n-grams as its language model integrating in this way domain-specific data with wide coverage data instead of a domain dependent recognition grammar network. What is more, it operates in an n-best mode, which means that it produces the top-n hypotheses that were recognised against the recorded speech of the user, ordered by the model's overall confidence score.
CHAPTER 2 Methodology
The outcome of this study is a standalone software written in JAVA and C/C++ that performs re-ranking of the n-best lists produced by the speech recogniser in the context of a dialogue system. The input to the system is the n-best list that corresponds to a user utterance along with the confidence score per utterance and per word of each utterance and the dialogue log which contains the turns of both the system and the user. The output of the system is a single hypothesis that has been chosen from the n-best list along with a label that corresponds to a degree of “certainty” as to the correctness of the picked hypothesis. In order to extract the correct hypothesis the system includes a Memory Based Classifier (TiMBL - Daelemans et al., 2007) that has been trained to the Edinburgh TownInfo corpus, consisting of 126 dialogues containing 1554 user utterances. For each utterance I used the 60-best lists produced by the recogniser resulting in a total of 93240 hypotheses. Each hypothesis was automatically labelled with one of the 'opt', 'pos', 'neg' and 'ign' labels denoting a decreasing order of confidence as to the match with the true transcript and semantic representation of the user's utterance, with 'opt' representing the most confident estimate. The final testing of the classifier was performed on a separate part of the Edinburgh TownInfo corpus consisting of 58 dialogues containing 510 utterances. For each utterance I used the full 60-best lists resulting in 30200 hypotheses. The testing of the re-ranker was performed with a 10-fold cross-validation on the whole training set. The reason for choosing cross-validation is that it accounts for more significance of the results and overall robustness of the system in its ability to boost the recogniser's performance. The classifier was trained with four different sets of increasing number of features;
14 the first set included only list features such as standard deviation of confidence scores, the second included the first plus current hypothesis features such as hypothesis length, confidence score etc., the second included the the previous two plus acoustic features and the fourth all previous plus the User Simulation score.
2.1 The Edinburgh TownInfo Corpus The input to the system is extracted from the Edinburgh TownInfo corpus which consists of a total of 184 dialogues an example of which is given in Figure 2.1. The corpus was already captured and transcribed by K. Georgila since it is a part of the training set used in the User Simulation model also incorporated in my system. In more detail, the corpus contains a collection of interactions with the TownInfo dialogue system by native and non-native users (see section 1.6). System: How can I help you? User: I need to find a hotel room System: Did you say hotel? User: Yes System: Do you want something central? User: Yes please System: Do you need a single or double room? User: [empty] System: Do you need a single or double room? User: [empty] System: Do you need a single or double room? User: Single System: Would you like something cheap, expensive or reasonably priced? User: I don't mind System: Now I am going to search for a single central room. Please wait a moment while I search for suitable hotel options. (system presents options) Are you interested in any hotel option number or would you like me to keep going? User: Number three (system repeats option number three) System: Are you happy with that choice? User: Yes System: Would you like to search some more? User: No System: Goodbye...and thanks for using the Edinburgh Cambridge dialogue system. Figure 2.1: A dialogue log from the Edinburgh TownInfo corpus (Andersson 2006)
Something cheap SOMETHING CHEAP -15268.4 SOMETHING A CHEAP -15283.8 I THINK CHEAP -15294.8 UH SOMETHING CHEAP -15287.4 SOMETHING CHEAPER -15287.3 SOMETHING CHEAP I -15276.6 I DON'T CHEAP -15307.4 I SOMETHING CHEAP -15310.5 I WANT A CHEAP -15383.5 A SOMETHING CHEAP -15287.4 SOMETHING CHEAP A -15259.5 SOMETHING CHEAP UH -15259.5 I HAVE A CHEAP -15396.8 I WANT CHEAP -15327.8 THE SOMETHING CHEAP -15311.4 UH THANK CHEAP -15270.5 AH SOMETHING CHEAP -15291.3 ER SOMETHING CHEAP -15300 SOMETHING CHEAP AT -15261.9 I THINK A CHEAP -15336.4 Figure 2.2: Part of an n-best list for the transcription 'Something cheap'. The second column denotes the acoustic score of the speech recogniser Each utterance is contained in various formats depending on the context we are focusing our attention on. On the highest level we have a collection of dialogue logs which are structured in accordance to the Information State Update (ISU) paradigm as shown in Figure 2.3. Apart from the transcript of the user or system's utterance (shown in bold) the logs also contain a semantic representation for the limited knowledge domain of hotels, bar and restaurants that denote the current Dialogue Move. More specifically, each utterance is transcribed in the following format: , , (shown in red in the example of Figure 2.3 with the equivalent values filled).The Speech Act field is a high-level representation of the type of the sentence that was uttered by the user/system and takes values such as provide_info, yes_answer, which mean that the user/system tries to convey some domain-specific inform-
16 ation, answers affirmatively respectively. TypeOfPolicy: 1 STATE 7 DIALOGUE LEVEL Turn: user TurnNumber: 3 Speaker: user DialogueActType: user ConvDomain: about_task SpeechAct: [provide_info] AsrInput: chinese TransInput: Output: TASK LEVEL Task: [food_type] FilledSlot: [food_type] FilledSlotValue: [chinese] LOW LEVEL AudioFileName: kirsten-003--2006-11-06_12-30-13.wav ConfidenceScore: 0.44 HISTORY LEVEL PreviouslyFilledSlots: [null],[top_level_trip],[null],[food_type] PreviouslyFilledSlotsValues: [null],[restaurant],[],[chinese] PreviouslyGroundedSlots: [null],[null],[top_level_trip],[] SpeechActsHist: opening_closing,request_info, [provide_info,provide_info],explicit_confirm, [yes_answer,yes_answer],request_info,[provide_info] TasksHist: meta_greeting_goodbye,top_level_trip, [top_level_trip,food_type],top_level_trip,[top_level_trip,food_type], food_type,[food_type] FilledSlotsHist: [top_level_trip,food_type],[],[food_type] FilledSlotsValuesHist: [restaurant,chinese],[],[chinese] Figure 2.3: Excerpt from a dialogue log containing the most useful fields, showing the Information State fields for the user utterance 'chinese'
The Task field is a lower-level representation of the contents of the uttered by the user/system and takes values such as top_level_trip, food_type, which summarise the fact that the user/system has made a general statement for a hotel, bar or restaurant or a statement for the type of food respectively. Finally, the Slot Value field is the lowest-representation of the message conveyed and usually corresponds to specific information such as chinese if the Task Field has the value of food_type, cheap if the
17 Task Field is filled with hotel_price etc For the purposes of the experiments I have used a different cut-down version of the ISU logs that contain just the semantic parses of the dialogue moves of the systems' and users' turns and the file names of the wave files that correspond to the users utterances. For each utterance we have a series of files of 60-best lists produced by the speech recogniser, namely the transcription hypotheses on a sentence level along with the acoustic model score (Figure 2.2) and the equivalent transcriptions on a word level, with information such as the duration of each recognised frame and the confidence score of the acoustic and language model of each word (Figure 2.4). Finally, there exist the wave files of each utterance which were used to compute various acoustic features. Something cheap 0 6000000 15.000000 -3306.653320 -3291.653320 0.903141 6000000 10900000 SOMETHING -69.225189 -3832.953613 -3902.178711 0.774873 10900000 14000000 CHEAP -16.965895 -2162.810547 -2179.776367 0.950973 14000000 25900000 6.578006 -5965.947266 -5959.369141 0.935400 /// 0 6000000 15.000000 -3306.653320 12041.324219 0.903141 6000000 10300000 SOMETHING -69.225189 -3324.001465 -3393.226562 0.785827 10300000 11000000 A -42.978447 -608.698303 -651.676758 0.514222 11000000 14000000 CHEAP -17.142681 -2078.526367 -2095.668945 0.954854 14000000 25900000 6.578006 -5965.947266 -5959.369141 0.935400 /// 0 6700000 15.000000 -3828.631348 11577.962891 0.890681 6700000 8800000 I -20.461586 -1653.962280 -1674.423828 0.720694 8800000 11200000 THINK -33.112690 -1921.326782 -1954.439453 0.784222 11200000 14000000 CHEAP -73.240974 -1924.966553 -1998.207520 0.957299 14000000 25900000 6.578006 -5965.947266 -5959.369141 0.935400 Figure 2.4: Speech recogniser's output at a word level for the transcript 'Something cheap'. The columns correspond to: start of frame, end of frame, label, language modelling, acoustic, total and confidence score.
2.2 Automatic Labelling In order to perform the re-ranking of the n-best lists we have to rely on some measure of correctness of each hypothesis. In other words we need to distinguish among those that are supposed to be close enough to the true transcript or not. Instead of adopting the industry-standard measure of closeness for speech recognisers, namely WER, I adhered to a less strict hybrid method that combines primarily the DMA and then the WER of each hypothesis. What is more, in order to induce some kind of discrete confidence scoring that can guide or at least facilitate the dialogue manager to choose for a particular strategy move. I have devised four labels with decreasing order of confidence: 'opt', 'pos', 'neg', 'ign'. These are automatically generated by using two different modules: a keyword parser that computes the {} pair as described in the previous section and a Levenshtein Distance calculator for the computation of the DMA and WER of each hypothesis respectively. The reason for opting towards a more abstract level, namely the semantics of the hypotheses rather than delving into the lower level of individual word recognition, is that in Dialogue Systems it is usually sufficient to rely on the message that is being conveyed by the user rather than the words that he or she used. Similar to Gabsdil and Lemon (2004) and Jonson (2006) I ascribed to each utterance either of the 'opt', 'pos', 'neg', 'ign' labels according to the following schema: •
opt: The hypothesis is perfectly aligned and semantically identical to the transcription
pos: The hypothesis is not entirely aligned (WER ≤ 50) but is semantically identical to the transcription
neg: The hypothesis is semantically identical to the transcription but does not align well (WER > 50) or is semantically different compared to the transcription
ign: The hypothesis was not addressed to the system (crosstalk), e.g. the user laughed, coughed, etc.
19 The 50% value for the WER as a threshold for the distinction between the 'pos' and 'neg' category is adopted from Gabsdil (2003), based on the fact that WER is affected by concept accuracy (Boros et al. 2003). In other words, if a hypothesis is erroneous as far as its transcript is concerned then it is highly likely that it does not even convey the correct message from a semantic point of view. It can be clearly seen that I am always labelling conceptually equivalent hypotheses to a particular transcription as potential candidate dialogue strategy moves and total misrecognitions as rejections. In Figure 2.5 we can see some examples of the four labels. Notice that in the case of silence, we give an opt to the empty hypothesis.
Transcript: I'd like to find a bar please
Transcipt: silence
neg MM
opt HM
Figure 2.5: Examples of the four labels: opt, pos, neg and ign
2.3 Features All the features used by the system are extracted by the dialogue logs, the n-best lists per utterance and per word and the audio files. The majority of the features chosen are based on their success in previous systems as described in the literature. The novel feature of course is the User Simulation score which may make redundant most of the equivalent dialogue features met in other studies. In order to measure the usefulness of each candidate feature and thus choose the most important I used the common metrics of Information Gain and Gain Ratio (see section 1.4 for a very brief explanation) on the whole training set, i.e. 93240 hypotheses. In total I extracted 13 attributes that can be grouped into 4 main categories; those that concern the current hypothesis to be classified, those that concern low-level statistics
20 of the audio files, those that concern the whole n-best list, and finally the user simulation feature: 1. Current Hypothesis Features (6): acoustic score, overall model confidence score, minimum word confidence score, grammar parsability, hypothesis length and hypothesis duration. 2. Acoustic Features (3): minimum, maximum and RMS amplitude 3. List Features (3): n-best rank, deviation of confidence scores in the list, match with most frequent Dialogue Move 4. User Simulation (1): User Simulation confidence score The current hypothesis features were extracted from the n-best list files that contained the hypotheses' transcription along with overall acoustic score (Figure 2.4) per utterance and from the equivalent files that contained the transcription of each word along with the start of frame, end of frame and confidence score: Acoustic score is the negative log likelihood that is ascribed by the speech recogniser to the whole hypothesis, being the sum of the individual word acoustic scores. Intuitively this is considered to be helpful since it depicts the confidence of the statistical model only for each word and is also adopted in previous studies. Incorrect alignments shall tend to adapt less well to the model and thus have low log likelihood. Overall model confidence score is the average of the individual word confidence scores. In the lack of the real model confidence scores in the given files of the corpus, I adhered to the average of each word confidence score as the next best approach to the models' overall confidence taking into account both the language and acoustic model. Minimum word confidence score is also computed by the individual word transcriptions and accounts for the confidence score of the word for which the speech recogniser is least certain of. It is expected to help our classifier distinguish between poor overall hypotheses' recognitions since a high overall confidence score can sometimes prove to be misleading.
21 Grammar Parsability is the negative log likelihood of the transcript for the current hypothesis as produced by the Stanford Parser, a wide-coverage Probabilistic Context-Free Grammar (PCFG) (Klein et al. 2003, This feature seems helpful since we expect that a highly ungrammatical hypothesis is likely not to match with the true transcription semantically. Hypothesis duration is the length of the hypothesis in milliseconds as extracted from the n-best list files with transcriptions per word that include the start and the end time of the recognised frame. The reason for the inclusion of this feature is that can help distinguish between short utterances such as yes/no answers, medium-sized utterances of normal answers and long utterances caused by crosstalk. Hypothesis length is the number of words in a hypothesis and is considered to help in a similar way as the above feature. The acoustic features were extracted directly from the wave files using SoX, an industry-standard open-source audio editing and converter utility in *NIX environments: Minimum, maximum and RMS amplitude are pretty straightforward features rather common in all previous studies mentioned in section 1.2. The list features were calculated based on the n-best list files with transcriptions per utterance and per word and take into account the whole list: N-best rank is the position of the hypothesis in the list and could be useful in the sense that 'opt' and 'pos' are usually found in the upper part of the list rather than the bottom. Deviation of confidence scores in the list is the deviation of the overall model confidence score of the hypothesis from the mean confidence score in the list. This feature is extracted in the hope that it will indicate potential clusters of confidence scores in particular positions in the list, i.e. group hypotheses that deviate in a specific fashion from the mean and thus indicating them being classified with the same label.
22 Match with most frequent Dialogue Move is the only boolean feature crafted and indicates whether the Dialogue Move of the current hypothesis, i.e. the pair of {} coincides with the most frequent one. The trend in n-best lists is to have a majority of utterances that belong to one or two labels and only one hypothesis belonging to the 'opt' and/or a few to the 'pos'. As a result, the idea behind this feature is to extract such potential outliers which are the desired goal for the re-ranker. Finally, the user simulation score is given as an output from the User Simulation model created by K. Georgila and adapted for the purposes of this study (see next section for more details). The model is operating with 5-grams. The input to it is given by two different sources: the history of the dialogue, namely the 4 previous Dialogue Moves, is taken by the dialogue logs and the current hypothesis' semantic parse which is generated on the fly by the same keyword parser used in the automatic labelling. User Simulation score is the probability that the current hypothesis' Dialogue Move has really been said by the user given the 4 previous Dialogue Moves. The advantage of this feature has been discussed in section 1.3.
2.4 System Architecture The system developed in the context of this study is implemented mainly in JAVA, with the exception of the parts that interact with the User Simulation model of K. Georgila and the TiMBL classifier (Daelemans et al., 2007) that were written in C/C++ and Java Native Interface (JNI). In Figure 2.6 we can see an overview of the system's architecture. Currently the system works in off-line mode, i.e. gets its input from flat files that comprise the Edinburgh TownInfo corpus and performs re-ranking of an n-best list, i.e. outputs the hypothesis that has the label with the highest degree of confidence along with this very label. For evaluation purposes it currently computes the DMA, SA and WER of the training set with 10-fold cross-validation as its output. However, an OAA wrapper has been included as well in order to enable it to work in
23 a real time environment, where the input shall be given directly by the speech recogniser and the dialogue logger and its output will be given as input to the dialogue manager.
User Simulation n-1 hypotheses' Dialogue Moves
n-th hypothesis Dialogue Move Keyword Parser
User Simulation score
N-best transcriptions Feature Extractor
Edinburgh TownInfo Dialogue Corpus
Feature Vectors Top Hypothesis and Label Re-Ranker
Figure 2.6: The system's architecture
A brief description of the individual components follows: •
The keyword parser was originally written in C by K. Georgila and has been adapted to Java by Neil Mayo, the version of whom I included in my system. The keyword parser reads a vocabulary file which contains a simple mapping from the various domain-specific words of interest that are met in the transcripts to an intermediate reduced vocabulary that will be used by a pattern matcher. It then reads a file that contains all the patterns of the reduced
24 vocabulary and maps these to {} pairs. Note that the original pattern files included with the original version of the parser by N. Mayo mapped the vocabulary to a different semantic representation. However, these were not considered to be helpful since I wanted to keep the same formalism adopted in the ISU logs that already existed. •
The User Simulation is written in C by K. Georgila and is ported to my system via JNI. Originally, K. Georgila has written the User Simulation as an OAA agent but the first experiments that were conducted using this version were rather inefficient in terms of runtime. The reason for that was that the OAA itself was inducing unwanted overhead due to possibly the large size of the messages that were transferred between my system and the agent. As a result I wrote a JNI wrapper in the original C code that interfaces its three main functions: load the model from n-grams stored in flat files to memory, simulate the user action given the n-1 history and kill the model. It should be noted that originally the User Simulation was trained using both the Cambridge and Edinburgh TownInfo corpus resulting in a total of 461 dialogues with 4505 dialogues. These were stored as mentioned above as ngrams in flat files produced by the CMU-Cambridge Statistical Language Model Toolkit v.2 using absolute discounting for smoothing the n-gram probabilities. Since I am also using the Edinburgh TownInfo for training and testing TiMBL (Daelemans et al., 2007) I had to reduce the training dataset given to the User Simulation to avoid having it get trained on test data of my system as well. As a result, I have subtracted the separate part of the Edinburgh TownInfo corpus consisting of 58 dialogues containing 510 utterances, that was used to test TiMBL (Daelemans et al., 2007) classifier and re-calculated the n-grams.
The feature extractor is the core module of my system written fully in JAVA and is responsible for reading the Edinburgh TownInfo corpus from the various flat files that make it up and extracts the features that were described in detail in the previous section. The output of this module is the training and testing dataset in ARFF format since it was considered convenient to visualise
25 and
(http://www.c- This format can also be read by TiMBL (Daelemans et al., 2007). •
TiMBL (Daelemans et al., 2007) is written purely in C++ and usually runs in standalone mode. However, it provides a rather convenient API that enables other software to integrate it quite easily in their work flow. Since my system is written in JAVA I wrote a JNI wrapper for that as well, porting main API calls, namely load the model from a flat file in a tree-based format, train the model, test a flat file against a trained model, predict the class of a single vector given a trained model and kill the model. The input to TiMBL is a set of feature vectors with a combination of real-valued numbers, integers and a single boolean attribute. The classifier itself performs internally a conversion of the numeric attributes to discrete ones using a default of 20 classes. The output is a set of labels that are attributed to each input vector. Note that TiMBL completely ignores the fact that the input vectors actually correspond to hypothesis in an n-best list, in other words each vector is fully independent from the others. It is the responsibility of the Feature Extractor and the Re-ranker to keep track of the position of each vector in a dialogue, n-best list and mapping to a single hypothesis. TiMBL was trained using different parameter combinations mainly choosing between number of k-nearest neighbours (1 to 5) and distance metrics (Weighted Overlap and Modified Value Difference Metric). Quite surprisingly, there was not any significant gain from using parameter combinations other than the default, namely Weighted Overlap with k = 1 neighbours.
The Re-ranker is written in JAVA and takes as input the labels that have been assigned to each hypothesis of the n-best list under investigation and returns the hypothesis according to the following algorithm along with the corresponding label in the hope that it will assist the dialogue manager's strategies (adapted by Gabsdil and Lemon 2004):
1. Scan the list of classified n-best recognition hypotheses
26 top-down. Return the first result that is classified as 'opt'. 2. If 1. fails, scan the list of classified n-best recognition hypotheses top-down. Return the first result that is classified as 'pos'. 3. If 2. fails, count the number of neg’s and ign’s in the classified recognition hypotheses. If the number of neg’s is larger or equal than the number of ign’s then return the first 'neg'. 4. Else return the first 'ign' utterance.
2.5. Experiments In this study the experiments were conducted in two layers: the first layer concerns only the classifier, i.e. the ability of the system to correctly classify each hypothesis to either of the four labels 'opt', 'pos', 'neg', 'ign' and the second layer the re-ranker, i.e. the ability of the system to boost the speech recogniser's accuracy. For the first layer, I trained the TiMBL classifier using the Weighted Overlap metric and k = 1 nearest neighbours (as discussed in the previous section) on 75 % of the Edinburgh TownInfo corpus consisting of 126 dialogues containing 1554 user utterances. For each utterance correspond 60-best lists produced by the recogniser resulting in a total of 93240 hypotheses. Using this corpus, I performed a series of experiments using different sets of features in order to both determine and illustrate the increasing performance of the classifier. These sets were determined not only by the literature but also by the Information Gain measures that were calculated on the training set using WEKA, as shown in Figure 2.7).
27 InfoGain Attribute ------------------------------------1.0324 userSimulationScore 0.9038 0.8280 0.8087
rmsAmp minAmp maxAmp
0.4861 0.3975 0.3773 0.2545 0.1627 0.1085
parsability acousScore hypothesisDuration hypothesisLength avgConfScore minWordConfidence
0.0511 0.0447 0.0408
nBestRank standardDeviation matchesFrequentDM
Figure 2.7: Information Gain for all 13 attributes (measured using WEKA) Quite surprisingly, we can notice that the rank being given by the Information Gain measure coincides perfectly with the logical grouping of the attributes that was initially performed (see section 2.3). As a result, I chose to stick to this very grouping as the final 4 feature sets on which the experiments on the classifier were performed with the following order: 1. List Features 2. List Features + Current Hypothesis Features 3. List Features + Current Hypothesis Features + Acoustic Features 4. List Features + Current Hypothesis Features + Acoustic Features + User Simulation Note that the User Simulation score seems to be a really strong feature, scoring first in the Information Gain rank, validating our main hypothesis. The testing of the classifier using each of the above feature sets was performed on the remaining 25 % of the Edinburgh TownInfo corpus comprising of 58 dialogues, consisting of 510 utterances and taking the 60-best lists resulting in a total of 30600
28 vectors. In each experiment I measured Precision, Recall, F-measure per class and total Accuracy of the classifier . For the second layer, I used a trained instance of the TiMBL classifier on the 4 th feature set (List Features + Current Hypothesis Features + Acoustic Features + User Simulation) and performed re-ranking using the algorithm illustrated in the previous section on the same training set used in the first layer using 10-fold cross validation.
2.6. Baseline and Oracle For the first layer I chose as a baseline the scenario when the most frequent label, 'neg', would be chosen in every case for the four-way classification. For the second layer I chose as a baseline the normal speech recogniser's behaviour, i.e. giving as output the topmost hypothesis. As an oracle for the system I defined the choice of either the first 'opt' in the n-best list to be classified or if this does not exist the first 'pos' in the list. In this way it is guaranteed that we shall always get as output a perfect match to the true transcript as far as its Dialogue Move is concerned, provided there exists a perfect match somewhere in the list.
CHAPTER 3 Results
As explained in chapter 2, I performed two series of experiments in two layers: the first corresponds to the training of the classifier alone and the second to the system as a whole measuring the re-ranker's output. A brief summary of the method follows: •
First Layer – Classifier Experiments •
List Features (LF)
List Features + Current Hypothesis Features (LF + CHF)
List Features + Current Hypothesis Features + Acoustic Features (LF + CHF + AF)
List Features + Current Hypothesis Features + Acoustic Features + User Simulation (LF + CHF + AF + US)
Second Layer – Re-ranker Experiments •
10-fold cross-validation
All results reported in this chapter are drawn from the TiMBL classifier which is being trained with the Weighted Overlap metric and k = 1 nearest neighbours settings. Both layers are trained on the same Edinburgh TownInfo Corpus of 126 dialogues containing 1554 user utterances or a total of 93240 hypotheses. The first layer was tested against a separate Edinburgh TownInfo Corpus of 58 dialogues containing 510 user utterances or a total of 30600 hypotheses, while the second was tested on the
30 whole training set with 10-fold cross-validation.
3.1 First Layer – Classifier Experiments In these series of experiments I measure precision, recall and F1-measure for each of the four labels and overall F1-measure and accuracy of the classifier. In order to have a better view of the classifier's performance I have also included the confusion matrices for the final experiment with all 13 attributes which scores better than the rest. Table 3.1-3.4 show per class and per attribute set measures, while Table 3.5 shows a collective view of the results for the four sets of attributes and the baseline being the majority class label 'neg'. Table 3.6 shows the confusion matrix for the final experiment.
Feature set (opt) LF LF+CHF LF+CHF+AF
Precision Recall F1-Measure 42.50% 58.41% 49.20% 62.35% 65.71% 63.99% 55.59% 61.59% 58.43%
LF+CHF+AF+ US 70.51% 73.66%
Table 3.1 Precision, Recall and F1-Measure for the 'opt' category Feature set (pos) Precision Recall F1-Measure LF 25.18% 1.72% 3.22% LF+CHF 51.22% 57.37% 54.11% LF+CHF+AF
51.52% 54.60%
LF+CHF+AF+ US64.79% 61.80%
Table 3.2 Precision, Recall and F1-Measure for the 'pos' category
31 Feature set (neg) Precision Recall F1-Measure LF 54.20% 96.36% 69.38% LF+CHF 70.70% 74.95% 72.77% LF+CHF+AF
69.50% 73.37%
LF+CHF+AF+ US 85.61% 87.03%
Table 3.3 Precision, Recall and F1-Measure for the 'neg' category Feature set (ign) LF LF+CHF LF+CHF+AF
Precision Recall F1-Measure 19.64% 1.31% 2.46% 63.52% 48.72% 55.15% 59.30% 48.90% 53.60%
LF+CHF+AF+ US 99.89% 99.93%
Table 3.4 Precision, Recall and F1-Measure for the 'ign' category Feature set Baseline LF LF+CHF
F1-MeasureAccuracy 51.08% 37.31% 53.07% 64.06% 64.77%
Table 3.5: F1-Measure and Accuracy for the four attribute sets. In tables 3.1 – 3.5 we generally notice an increase in precision, recall and F1-measure as we progressively add more attributes to the system with the exception of the addition of the Acoustic Features which seem to impair the classifier's performance. We also make note of the fact that in the case of the 4 th attribute set the classifier can distinguish very well the 'neg' and 'ign' categories with 86.32% and 99.91% F1-measure respectively. Most importantly, we take a remarkable boost in F1-measure and accuracy with the addition of the User Simulation score. We mark a 37.36% relative increase in F1measure and 34.02% increase in the accuracy compared to the 3rd experiment, which
32 contains all but the User Simulation score attribute and a 66.20% relative increase of the accuracy compared to the Baseline. In table 3.4 we make note of a considerably low recall measure for the 'ign' category in the case of the LF experiment, suggesting that the list features do not add extra value to the classifier, partially validating the Information Gain measure (Figure 2.7). opt
Table 3.6 Confusion Matrix for LF + CHF + AF + US set. Taking a closer look to the 4th experiment with all 13 features we notice in table 3.6 that most errors occur between the 'pos' and 'neg' category. In fact, for the 'neg' category the False Positive Rate (FPR) is 18.17% and for the 'pos' 8.9%, all in all a lot larger than for the other categories.
3.2 Second Layer – Re-ranker Experiments In these series of experiments I measure WER, DMA and SA for the system as a whole. In order to make sure that the improvement noted was really attributed to the classifier I computed the p-values for each of these measures using the Wilcoxon signed rank test for the WER and McNemar chi-square test for the DMA and SA measure.
78.22% *
Classifier 45.27% ** Oracle
42.16% *** 80.20% *** 45.27% ***
Table 3.7 WER, DMA and SA measures for the Baseline, Classifier and Oracle (*** indicates p < 0.001, ** indicates p < 0.01, * indicates p < 0.05)
33 In table 3.7 we note that the classifier scores 45.27% WER making a notable relative reduction of 5.13% compared to the baseline and 78.22% DMA incurring a relative improvement of 4.22%. The classifier scored 42.26% on SA but it was not considered significant compared to the baseline (0.05 < p < 0.10). Comparing the classifier's performance with the Oracle it achieves a 44.06% of the possible WER improvement on this data, 61.55% for the DMA measure and 37.16% for the SA measure. Finally, we also notice that the Oracle has a 80.20% for the DMA, which means that 19.80% of the n-best lists did not include at all a hypothesis that matched semantically to the true transcript.
3.3. Significance tests: McNemar's test & Wilcoxon test McNemar’s test (Tan et al., 2001) is a statistical process that can validate the significance of differences between two classifiers on boolean data. Let fA be the baseline and fB be our system. Given a pair of binary data (in our case the answers whether the true transcript and the topmost hypothesis for the baseline or the output of the reranker for our system match semantically and on a word level for the case of DMA and SA measure respectively) we record the matches on each utterance both for fA and fB simultaneously to construct the following contingency table: Correct by fA
Incorrect by fA
Correct by fB
Correct by fB
McNemar’s test is based on the idea that there is little information about the distribution with which both the baseline and the classifier get the correct results or for which both get incorrect results; it is based entirely on the values of n01 and n10. Under the null hypothesis (H0), the two algorithms should have the same error rate, meaning n01 = n10. It is essentially a x2 test and performs a test using the following statistic:
If the H0 is correct, then the probability that this number is bigger than x 2 = 3.84 is less than 0.05. So we may reject the H0 in favour of the hypothesis that the two algorithms have different performance. The Wilcoxon signed rank test is a statistical test for real-valued paired data that do not follow a normal distribution as is the case with the WER distribution as can be
seen in Figure 3.1 below. I used MATLAB's version of the test signrank.
Figure 3.1: WER distribution for the re-ranker output
CHAPTER 4 Experiment with high-level Features
The four sets of attributes already described in the previous chapters were chosen based both on previous studies and on intuition and were also partially justified according to the ranking produced by running the Information Gain measure on the Edinburgh TownInfo corpus training set. Apart from this more traditional approach in the feature selection process, I also trained a Memory Based Classifier based only on the higher level features of merely the User Simulation score and the Grammar Parsability (US + GP). The idea behind this choice is to try and find a combination of features that ignores low level characteristics of the user's utterances as well as features that heavily rely on the speech recogniser and thus by default are not considered to be very trustworthy. Quite surprisingly, the results taken from an experiment with just the User Simulation score and the Grammar Parsability are very promising and comparable with those acquired from the 4th experiment with all 13 attributes. Table 3.9 shows the precision, recall and F1-measure per label and table 3.10 illustrates the classifier's performance in comparison with the 4th experiment.
Label Precision Recall F1-Measure opt 73.99% 64.13% 68.70% pos 76.29% 46.21% 57.56% neg 81.87% 94.42% 87.70% ign
99.99% 99.95%
Table 3.9 Precision, Recall and F1-measure for the high-level features experiment
36 It can be derived from table 3.9 that there is a somewhat considerable decrease in the recall and a corresponding increase in the precision of the 'pos' and 'opt' categories compared to the LF + CHF + AF + US attribute set, which account for lower F1measures. However, all in all the US + GP set manages to classify correctly 207 more vectors and quite interestingly commits far fewer ties1 and manages to resolve more compared to the full 13 attribute set.
Feature set
4993 / 863 / 57.13%
115/ 75 / 65.22%
Table 3.10 F1-measure, Accuracy and number of ties that were correctly resolved by TiMBL for the LF+CHF+AF+US and US+GP feature sets Next, I performed an experiment on the re-ranker using the aforementioned classifier and it did not achieve much compared to the Baseline for the DMA and SA measures (it scored 74.85% DMA, 0.2% lower than the Baseline and 40.82% SA, 0.34% higher than the Baseline, both results being statistically insignificant). For the WER it scored 46.39%, a relative decrease of 2.78% compared to the Baseline, achieving 23.92% of the possible WER improvement for this dataset. Following the success of the previous experiment on the classifier alone, I took things to their extremes and trained TiMBL with just the User Simulation score feature. Not surprisingly the classifier scored 80.60% overall F1-measure and 81.64% accuracy, but it was unable to classify correctly any of the 'opt' hypotheses correctly. As a result, it was not considered necessary to continue to check the performance of the re-ranker with this rather minimal classifier.
1 In the case of k-nn algorithms we might come across situations when a particular vector is found to be equidistant from two or more neighbours that belong to different classes. In this case a particular tie-resolving scheme is adopted such as weighted voting.
CHAPTER 5 Discussion and Conclusions
In this chapter we shall discuss the methodology applied as a whole and the results that were drawn from the experiments on the Edinburgh TownInfo corpus and present some overall conclusions. The results especially for the second layer of experiments are limited by three major reasons: 1. The speech recogniser's performance. The oracle score for the DMA measure shows that approximately 19.80% of the n-best lists do not contain a hypothesis that matches semantically with the true transcript. This partly resulted in a dataset which is highly imbalanced (Figure 5.1) and impaired the classifier's separability. According to Andersson (2006) there are two causes for this problem: •
Mis-timed recognition – where the microphone was not activated in due time before the user started speaking and/or was deactivated before the user had finished speaking.
Bad recognition hypotheses – where the user said something clearly but the system failed to recognise it. This can be ascribed to decoding parameters and deficiency of the language model to include domain-specific vocabulary.
2. The problem we are trying to solve is somewhat trivial from a semantic point of view. In order to compute the labels and measure the DMA of each hypothesis I have used a keyword parser (see section 2.4), which “translates” each sentence to the format {} pair. While this high level of representation seems to be enough for the User Simulation model, it seems as though we let even highly ungrammatical hypotheses align semantically with
38 the true transcript. Although this assumption can be justified by the fact that in dialogue systems we are interested in the messages to be conveyed rather than the exact way they have been uttered by the user, we have artificially increased in this way the baseline's DMA, in other words the erroneous topmost
hypotheses semantically align with the true transcript.
opt opt
pos pos
neg neg
ign ign
Figure 5.1 Label histogram for the Edinburgh TownInfo training set
5.1. Automatic Labelling The labelling used in this study is closely related to that devised by Gabsdil and Lemon (2004) and Jonson (2006). The main idea is to map each hypothesis to a class that categorises it primarily from a semantic point of view and secondarily taking into account the WER. This is being done under the notion that Dialogue Systems are sensitive to the meaning behind the user's utterance rather than the grammaticality of the utterance. However, the method adopted in order to ascribe a label to each hypothesis as described in section 2.2 augmented the problem of the imbalanced dataset as mentioned in the introduction of this chapter (Figure 5.1). The 'neg' category includes both semantically aligned hypotheses with high WER (>50) and hypotheses which do not align but are addressed to the system and thus are distinguished from the 'ign'. This
39 of course gave a boost to the 'neg' category being the majority class with a grave numerical difference from the rest. A way to alleviate this problem would be to split the 'neg' category to two, namely 'pessimistic' ('pess') which would include semantically identical hypotheses with high WER (>50) and 'neg' which would address semantically mismatched hypotheses. A few preliminary experiments towards this directions were conducted but did not achieve much accuracy and were left to be included in some future work.
5.2. Features The features used in this study are divided into four groups which also reflect the number of series of experiments that were conducted; list features, current hypothesis features, acoustic features and User Simulation. All of the features mentioned with the exception of the User Simulation score have been used in previous similar studies with successful results, a strong justification for including them in my system as well. However, as illustrated in chapter 4 not all of them contributed to the classifier's performance. The list features alone (1st experiment) did not make the classifier more significant compared to the Baseline. It seems that at least in the Edinburgh TownInfo corpus we cannot account for possible clusters of labels gathered in a specific place within the list (something that the n-best rank feature or the standard deviation of confidence scores for example could give rise to). This phenomenon is particularly evident in the case of the 'ign' label, as shown in table 3.4. This seems quite reasonable since in the case where an utterance is actually crosstalk, then most of the times all the hypotheses in the n-best list are labelled as 'ign', rendering list-wise features such as n-best rank and standard deviation of confidence score useless. The current hypothesis features (2nd experiment) contributed significantly to the classifier's performance which was quite as expected though they included attributes such as the speech recogniser's confidence and acoustic score which by default are
40 the foremost suspects for the mediocre performance of the 1-best top hypothesis baseline system. The inclusion of the grammar parsability and the minimum word confidence score seem to separate the hypotheses well especially between the 'opt' 'pos' and 'neg' - 'ign' categories. In this way they validate the assumption that fairly ungrammatical and/or marginally acceptable utterance recognitions (which might have on average a high confidence score but some of the words that comprise it are not recognised with much confidence by the acoustic model and/or language model, e.g. utterances with wrong syntax) do not carry the correct semantic information compared to the true transcript. On the other hand, the addition of the acoustic features (3 rd experiment) though seemed promising from the Information Gain ranking (Figure 2.7) and literature, they actually impaired the accuracy of the classifier (Table 3.5). This may be due to the fact that the minimum, maximum and RMS amplitude values correspond to a single wave file and thus is the same for all the hypotheses in an n-best list. From a dataset point of view, we are essentially multiplying the probability mass of a certain value for each of these attributes without them being unique. As a result, we are artificially boosting their importance which in its turn trick the Information Gain and Gain Ratio measure which is being used internally by TiMBL. In other words, these attributes score high in both measures because they have many occurrences of the same values rather than being unique and therefore essentially useful. The addition of the User Simulation score (4th experiment) gives a remarkable boost to the classifier's performance, which validates the main hypothesis of this study as far as the classification of each hypothesis to a certain label is concerned. What strikes most is the fact that the User Simulation score helps the classifier distinguish very clearly the 'ign' and then the 'neg' category, i.e. the categories which correspond to hypotheses that mostly differ semantically from the true transcript or do not address the system. Especially in the case of the 'ign' category when the user does not address the system, the User Simulation almost always models it very accurately. In other words, given a history of 4 dialogue moves (User Simulation uses 5-grams) and the current being semantically empty, {,}, it assigns it the highest probability it can give
41 (as shown in figure 5.2). This makes sense since if the user currently does not address the system, then the dialogue that has preceded is rather fixed and thus can be modelled easily. An equivalent justification exists in the case when the user says something that does not align semantically with the true transcript and/or is erroneous and thus has caused the system in the past to respond in a fixed way. Bear in mind that we consider a dialogue system, the responses and vocabulary of which (and of the user as well) is rather limited.
Figure 5.2 Histogram of User Simulation score for the Ed TownInfo training set On the other hand, in the case of the 'opt' and 'pos' categories the User Simulation is less certain (Figure 5.2) for the exactly opposite reason as with the case of the 'ign' and 'neg' categories. In the case of correctly recognised hypotheses the dialogue between the system and the user may progress rather quickly in the sense that the system does not need to explicitly or implicitly confirm the user's utterances. This means that the course of the dialogue can be quite different and thus more difficult to model ( {,} can occur in many different contexts compared to {,} ). This is partially validated by the additional experiment performed where I trained TiMBL with just the User Simulation score as a feature and noticed that it was not able to classify correctly any of the 'opt' hypotheses.
5.3 Classifier The TiMBL classifier seems to be rather well-suited for modelling dialogue contextbased speech recognition using the User Simulation as an extra feature. Though every effort was made to keep as consistent a method for the feature selection and model optimization as possible (Information Gain ranking), still I believe that the classifier would benefit more from a more systematic-exhaustive search through all the possible combinations of features and/or parameter settings, such as using the “leave-one-out” method adopted by Gabsdil and Lemon (2004) and got an increase in their classifier's accuracy by 9%. The main drawback of our trained classifier was the high false positive rate for the 'pos' category. As is evident in table 3.6 the 'pos' category gets easily mistaken with the 'neg' category. A possible cause for this is the fact that the hypotheses belonging to the 'neg' category far outnumber those belonging to the 'pos', as described in the introduction of this chapter. Another way to justify this phenomenon resides in the fact that the 'neg' category includes semantically aligned (with high WER) hypotheses as is the case with the 'pos' category and thus most of the features used cannot distinguish very well between the two classes in this case. For example the hypothesis' duration is the same even if the recogniser captures something semantically aligned with the true transcription or not.
5.4 Results In the first layer of experiments I managed to train a considerably efficient classifier using all 13 attributes scoring 86.03% F1-measure and 84.90% accuracy. The User Simulation score seems to be the key attribute that accounts for most of the classifier's ability to separate among the four classes. In favour of this hypothesis is the extra experiment performed towards the end of this study where I trained TiMBL with just the User Simulation score and the Grammar Parser and scored 85.68% F1-measure and 85.58% accuracy.
43 This latter experiment seems very promising in the sense that we can get acceptable results with just two attributes, resulting in a very robust and efficient system. What is more, the nature of these attributes being of higher level than the rest pose an interesting argument as to the approach which should be followed in the post-processing of speech recognition output. We should bear in mind though that both features cannot be extracted directly from the n-best lists or the wave files of the user's utterances but rather involve the application of models on the dialogue and the syntax of each hypothesis. This means that we are essentially inducing time overhead to the system, which in the case of a real time dialogue system is crucial. Dealing with 60-best lists induces fairly acceptable overhead time for the User Simulation model which does not have to account for too many different states as is the case of domain-specific dialogue systems. However, this is not always the case especially in the use of a wide-coverage grammar parser which sometimes has to deal with the parsing of long sentences and may slow down the overall response of the system. Using a more domain-specific and efficient parser than the one used in this thesis shall effectively alleviate this problem. In the second layer of experiments the performance of the re-ranker is equally encouraging as in the case of the classifier. The system achieved a relative reduction of WER of 5.13%, a relative increase of DMA of 4.22%, and a relative increase of SA of 4.40% with only the latter not being statistically significant (0.05 < p 0.10) compared to the Baseline. In the case of dialogue systems we are primarily interested in a gain in the DMA measure, which would essentially mean that our re-ranker is helping the system to better “understand” what the user really said and it seems that my system can improve the performance of the speech recogniser. Even though, the increase is somewhat small compared to previous studies, still it shows that my system is robust enough gaining 61.55% of the possible DMA improvement and the result is statistically significant. The same applies for the relative improvement of WER compared to the Baseline, which altogether considers a 44.06% of the possible boost in the overall performance of the speech recogniser. A possible reason for not gaining very large increase in the results for the WER and
44 the DMA and a statistical insignificant improvement in the SA measure was the limited size of the test data. What is more, as described in the introduction of this chapter, the problem we are trying to solve from a semantic point of view is rather trivial resulting both in the Baseline having an already high DMA and SA and in a very tight margin between the Baseline and the Oracle, leaving only a small improvement to be achieved.
5.5. Future work Some ideas for future work have already been mentioned in previous sections of this chapter: the adoption of a fifth ('pess') label that shall split the 'neg' category and therefore bring balance to the dataset, a more systematic search of the features' and parameters' combination of TiMBL to be selected using the “leave-one-out” approach of Gabsdil and Lemon (2006), the use of a more domain-specific and more efficient grammar parser. Another useful improvement would be to use a more elaborate semantic parser than the keyword parser I used in my system, which would not take into only the existence or not of certain key-words in each hypothesis but also some semantic function among the uttered words. In this way we shall end up with a more difficult problem left to be solved by the re-ranker, which would essentially reduce the accuracy of the baseline, i.e. merely choosing the topmost hypothesis. Finally, the current system is already implemented in a way that adheres to the OAA practices and thus is very easy to integrate with a real dialogue system such as DIPPER and TownInfo dialogue system. In this way, we shall be able to evaluate the system with truly unseen data and test it against the baseline system which relies on the topmost 1-best hypothesis of the speech recogniser alone.
5.6. Conclusions The system developed was tested in two layers, namely experiments that involved the classifier alone and experiments that concerned the re-ranker. For the first layer I
45 conducted four experiments by training the classifier with an increasing number of features: •
List Features (LF)
List Features + Current Hypothesis Features (LF + CHF)
List Features + Current Hypothesis Features + Acoustic Features (LF + CHF + AF)
List Features + Current Hypothesis Features + Acoustic Features + User Simulation (LF + CHF + AF + US)
Out of the four experiments the 4th gave the best results with 86.03% F1-measure and 84.90% accuracy, yielding a 66.70% relative increase of the accuracy compared to the Baseline. I also conducted an additional experiment where the classifier was trained to a limited set of high-level features, namely the User Simulation score and the Grammar Parsability feature and scored 85.68% F1-measure and 85.58% accuracy, performing a 67.54% relative increase of the accuracy compared to the Baseline. For the second layer of experiments the re-ranker scored a relative reduction of WER of 5.13%, a relative increase of DMA of 4.22%, and a relative increase of SA of 4.40% with only the latter not being statistically significant (0.05 < p 0.10) compared to the Baseline. Comparing the re-ranker's performance with the Oracle it achieved a 44.06% of the possible WER improvement on this data, 61.55% for the DMA measure and 37.16% for the SA measure. This study has shown that building a system that performs re-ranking of n-best lists produced as an output from a speech recogniser module of a dialogue system can improve the performance of the speech recogniser. It has also validated the main hypothesis that the boost in the performance can be achieved to a considerable extent using a User Simulation model of the dialogues between the system and the user.
Andersson, S. (2006), “Context Dependent Speech Recognition”, MSc Dissertation, University of Edinburgh, 2006. Boros, M., Eckert, W., Gallwitz, F., Gorz, G., Hanrieder, G. and Niemann, H. (1996), “Towards Understanding Spontaneous Speech: Word Accuracy vs. Concept Accuracy”, in Proceedings of International Symposium on Spoken Dialogue, ICSLP-96, Philadelphia, USA, pp. 1005–1008. Bos, J., Klein, E., Lemon, O. and Oka, T. (2003), “Dipper: Description and formalisation of an information-state update dialogue system architecture”, in 4th SIGdial Workshop on Discourse and Dialogue, Sapporo, Japan, pp. 115–124. Boyce, S., and Gorin, A. L. (1996), “User interface issues for natural spoken dialogue systems”, in Proceedings of International Symposium on Spoken Dialogue, pp. 65–68. Cheyer, A. and Martin, D. (2001), “The open agent architecture”, Journal of Autonomous Agents and Multi-Agent Systems 4(1), 143–148. Chotimongkol, A. and Rudnicky, A. (2001), “N-best speech hypotheses reordering using linear regression”, in Proceedings of EuroSpeech, pp. 1829–1832. Cohen, W. (1996), “Learning trees and rules with set-valued features”, in Proceedings of the Association for the Advancement of Artificial Intelligence, AAAI-96. Daelemans, W., Zavrel, J., van der Sloot, K. and van den Bosch, A. (2007), “TiMBL: Tilburg Memory Based Learner”, version 6.1 Reference Guide, ILK Technical Report 07-07. Gabsdil, M. (2003), “Classifying Recognition Results for Spoken Dialogue Systems”, in Proceedings of the Student Research Workshop at ACL–03. Gabsdil, M. and Lemon, O. (2004), “Combining acoustic and pragmatic features to
47 predict recognition performance in spoken dialogue systems”, in Proceedings of ACL, Barcelona, Spain, pp. 343–350. Georgila, K., Henderson, J. and Lemon, O. (2006), “User Simulation for Spoken Dialogue Systems: Learning and Evaluation”, in Proceedings of the 9th International Conference on Spoken Language Processing (INTERSPEECH–ICSLP-06), Pittsburgh, USA. Gorin, A., L., Riccardi G. and Wright, J., H. (1997), “How may I help you?”, Journal of Speech Communication, 23(1/2), pp. 113–127. Gruenstein, A. (2008), “Response-Based Confidence Annotation for Spoken Dialogue Systems”, in Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, Columbus, Ohio, USA, pp. 11–20. Gruenstein, A. and Seneff, S. (2007), “Releasing a multimodal dialogue system into the wild: User support mechanisms”, in Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, pp 111–119. Gruenstein, A., Seneff S. and Wang C., (2006), “Scalable and portable web-based multimodal dialogue interaction with geographical databases”, in Proceedings of INTERSPEECH-06. Jonson, R.(2006), “Dialogue context-based re-ranking of ASR hypothesis”, in Spoken Language Technology Workshop, IEEE, Palm Beach, Aruba, pp. 174–177. Lemon, O. (2004), “Context-sensitive speech recognition in ISU dialogue systems: results for the grammar switching approach”, in Proceedings of the 8th Workshop on the Semantics and Pragmatics of Dialogue, CATALOG-04. Lemon, O., Georgila, K. and Henderson, J. (2006), “Evaluating effectiveness and portability of reinforcement learned dialogue strategies with real users: The TALK TownInfo evaluation”, in Spoken Language Technology Workshop, IEEE, pp. 178– 181. Lemon, O., Georgila, K., Henderson, J. and Stuttle, M. (2006), “An ISU dialogue system exhibiting reinforcement learning of dialogue policies: Generic slot-filling in the talk in-car system”, in Proceedings of European Chapter of the ACL, EACL-06,
48 Trento, Italy, pp. 119–122. Kamm, C., Litman, D. and Walker, M., A. (1998), “From novice to expert: The effect of tutorials on user expertise with user dialogue systems”, in Proceedings of the International Conference on Spoken Language Processing, ICSL-98. Klein, D. and Manning, C., D. (2003), “Fast Exact Inference with a Factored Model for Natural Language Parsing”, in Journal of Advances in Neural Information Processing Systems 15, NIPS-02, Cambridge, MA, MIT Press, pp. 3–10. Litman, D., Hirschberg, J. and Swerts, M. (2000), “Predicting automatic speech recognition performance using prosodic cues”, in Proceedings of NAACL-00. Litman, D. and Pan, S. (2000), “Predicting and adapting to poor speech recognition in a spoken dialogue system”, in Proceedings of the Association for the Advancement of Artificial Intelligence, AAAI-00, Austin, USA, pp. 722–728. Pickering, M., J. and Garrod S. (2007), “Do people use language production to make predictions during comprehension?”, Journal of Trends in Cognitive Sciences, 11(3), pp.105–110. Rudnicky, A., I., Bennett, C., Black, A., W., Chotimongkol, A., Lenzo, K., Oh, A. and Singh R. (2000), “Task and Domain Specific Modeling in the Carnegie Mellon Communicator System”, in Proceedings of the International Conference on Spoken Language Processing, ICSLP’00, Beijing, China. Tan, C., M., Wang, Y., F. and Lee, C., D. (2001), “The use of bigrams to enhance text categorization”, Journal of Information Processing & Management, 38(4), pp. 529– 546. Walker M., Fromer, J., C. and Narayanan S. (1998), “Learning optimal dialogue strategies: A case study of a spoken dialogue agent for email”, in Proceedings of the 36th Annual Meeting of the Association of Computational Lingustics, COLING/ACL -98, pp. 1345–1352. Walker, M., Wright, J. and Langkilde, I. (2000), “Using natural language processing and discourse features to identify understanding errors in a spoken dialogue system”, in Proceedings of International Conference on Machine Learning, ICML-00.
49 Young, S. (2004), “ATK An Application Toolkit for HTK”, 1.4.1 edn. Technical Manual.