call-center dialogues make progress toward accomplishing a customer-service task, such as ..... for the NESPOLE travel domain: monolingual and bilingual.
Cue-Based Dialogue Act Classification
Nick Webb
Submitted for the Degree of Ph.D. Department of Computer Science University of Sheffield March, 2010
i
Typographic Conventions In all writing about language there is the danger of confusion over whether a word is being mentioned or used. In this thesis we shall use “these quotes” when a word, phrase, or utterance is mentioned. In addition, it is convenient to have a mechanism for representing word meanings and ‘single quotes’ shall be used for this. For example, “thesis” can mean either ‘dissertation’ or ‘claim’. We will also discuss Dialogue Act labels, and these will be denoted using a consistent representation, by surrounding the label with < and > marks. For example, the label representing a statement that is not an opinion (a label in the switchboard corpus) will be shown as .
ii
Abstract Cue-Based Dialogue Act Classification Nick Webb Supervisors: Yorick Wilks & Mark Hepple
In this thesis, we will address three research questions relating to the discovery and use of cue phrases for Dialogue Act classification. Cue phrases are single words, or combinations of words in phrases, that can serve as reliable indicators of some discourse function. In our case, we are looking to automatically discover cue phrases in corpora that are useful for the detection of Dialogue Acts (da). Dialogue Acts are labels attached to utterances in dialogue that serve to concisely characterise a speaker’s intention in producing a particular utterance, a notion that most major theories of dialogue take as central. Our first research question is whether or not we can extract cue phrases automatically from a corpus. We apply a method of cue extraction to the switchboard corpus of annotated human-human dialogues, and experiment with thresholds to identify cue phrases. To determine if these automatically extracted cue phrases are reliable indicators of das, we created a novel, cue-based da classification model. In this model, our cue phrases are exploited directly, by determining if they appear in unseen dialogue utterances. This forms our second research question, and we extensively explore
iii our cue-based classification model. In doing so, we obtain classification accuracy scores over the switchboard corpus that rival the best published classification scores for this corpus. Finally, we investigate how general these automatically extracted cue phrases are, by applying cues extracted from one corpus directly to two new corpora, the icsi-mrda corpus and the amities ge corpus. We find that we can achieve surprisingly good classification accuracy using this method, and that there are a core of automatically extracted cue phrases shared between major corpora, that prove to be extremely reliable when used directly to classify Dialogue Acts.
iv
Acknowledgements Completing this thesis has taken significant time and effort, and would not have been possible without the support of many. I wish to give extensive thanks to my supervisors, Yorick Wilks and Mark Hepple. Yorick I want to thank for his patience and his encouragement, both as a supervisor and an employer. Yorick stressed many times during our relationship that he had confidence that I would complete this thesis, and although many times I doubted him, it appears he was right. Mark has provided a great amount of guidance and feedback on the experiments and the narrative, and the work presented here and in our publications has benefited immensely from this process. I thank them both. Some of the material in this thesis has benefitted from direct collaboration with colleagues and students, including specific input from Cristian Ursu, Ting Liu and Michael Ferguson, and I am grateful for their ideas and their work. The road to this thesis began way back in 1997, with my colleague Udo Kruschwitz at the University of Essex. I thank him for starting this journey with me, although he finished several years ago, a testament to his will power. I have never been a traditional Ph.D. student, instead balancing the needs of the thesis with my life as a working NLP researcher. I have been employed by three great academic institutions, the University of Essex, the
v University of Sheffield, and now the University at Albany, SUNY. I have collaborated with very talented people, who I thank for their contribution to excellent working environments, including John Carson, Mark Stevenson, Andrea Setzer, Roberta Catizone, Louise Guthrie, Joe Polifroni, Diana Maynard, Rob Gaizauskas, Tomek Strzalkowski, Hilda Hardy, Sharon Small and Samira Shaikh. I have also had access to excellent administrative support, from Gillian Callaghan, Karen Barker, Lucy Moffatt and Gill Wells at the University of Sheffield, and now Lynne Casper at the University at Albany. This Ph.D. has benefited greatly from discussions with others at conferences (especially SemDial and COLING) and research project meetings. The work in this thesis has been applied to a number of research projects on which I was employed, including the European Commission funded projects AMITIES and COMPANIONS, the DTO funded COLLANE project and the NSF funded DeER project. I am grateful to all the funding agencies for their support. I am extremely grateful to my parents. These thanks mark the second time in one year that I am able to formally thank them for everything that they have helped me to accomplish, and I do so gladly with love and respect for them as parents. Finally, I would like to thank my wife, Kristen van Ginhoven. Kristen is an inspiration to me daily. Aside from her own, magnificent achievements to serve as a guide, she refused to allow me to fail with respect to this thesis, and I owe her immensely for that. I look forward with joy to our future together.
vi
CONTENTS 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1
Goal of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2
CuDAC: Cue-Based Dialogue Act Classification . . . . . . . .
6
1.3
Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . .
7
1.4
Contributions to Literature . . . . . . . . . . . . . . . . . . .
8
2. Motivation and State-of-the-Art . . . . . . . . . . . . . . . . 11 2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2
Dialogue Acts: A Definition . . . . . . . . . . . . . . . . . . . 12
2.3
Dialogue Acts and their Uses
2.4
. . . . . . . . . . . . . . . . . . 14
2.3.1
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 15
2.3.2
Discourse Structure . . . . . . . . . . . . . . . . . . . . 17
Corpora & Labelling Paradigms . . . . . . . . . . . . . . . . . 23 2.4.1
MapTask . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2
verbmobil . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3
Dialog Act Mark-up in Several Layers (damsl) . . . . 29
2.4.4
AMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.4.5
Additional Individual Dialogue Corpora . . . . . . . . 52
Contents 2.4.6 2.5
2.6
viii DIT++ . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Computational Models of Dialogue Act Recognition . . . . . . 57 2.5.1
Plan and Inference Models . . . . . . . . . . . . . . . . 58
2.5.2
Cue-Based Approach . . . . . . . . . . . . . . . . . . . 61
Automatically Classifying Dialogue Acts . . . . . . . . . . . . 67 2.6.1
DA Sequence Models . . . . . . . . . . . . . . . . . . . 69
2.6.2
Hidden Markov Models . . . . . . . . . . . . . . . . . . 70
2.6.3
Bayesian Methods . . . . . . . . . . . . . . . . . . . . . 72
2.6.4
Transformation Based Learning . . . . . . . . . . . . . 74
2.6.5
Memory Based Learning . . . . . . . . . . . . . . . . . 74
2.6.6
Latent Semantic Analysis . . . . . . . . . . . . . . . . 75
2.6.7
Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.6.8
Decisions Trees . . . . . . . . . . . . . . . . . . . . . . 78
2.6.9
Neural Networks . . . . . . . . . . . . . . . . . . . . . 79
2.6.10 Rule Based Approach . . . . . . . . . . . . . . . . . . . 80 2.6.11 Review of Approaches . . . . . . . . . . . . . . . . . . 84 2.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 2.7.1
Number of Target Categories . . . . . . . . . . . . . . 88
2.7.2
Enhanced Features . . . . . . . . . . . . . . . . . . . . 89
3. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2
Cue Phrase Selection . . . . . . . . . . . . . . . . . . . . . . . 95
3.3
Cue-Based da Classification . . . . . . . . . . . . . . . . . . . 98
3.4
Initial Experiments . . . . . . . . . . . . . . . . . . . . . . . . 102
ix
Contents 3.5
3.6
3.7
3.8
3.9
Elaboration of the Classification Model . . . . . . . . . . . . . 106 3.5.1
Utterance Length Models . . . . . . . . . . . . . . . . 106
3.5.2
Position Specific Cues . . . . . . . . . . . . . . . . . . 107
3.5.3
Overlapping Speech . . . . . . . . . . . . . . . . . . . . 111
Effect of Training Data Size . . . . . . . . . . . . . . . . . . . 113 3.6.1
4k Data Set . . . . . . . . . . . . . . . . . . . . . . . . 114
3.6.2
202k Data Set . . . . . . . . . . . . . . . . . . . . . . . 114
Empirical Determination of Thresholds . . . . . . . . . . . . . 116 3.7.1
Computing a Range of Thresholds . . . . . . . . . . . . 118
3.7.2
Validation Model . . . . . . . . . . . . . . . . . . . . . 120
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 3.8.1
Errors concerning the category126
3.8.2
Errors concerning the category . . 128
3.8.3
Resolution to Some of the Classification Problems . . . 129
N-best Classification . . . . . . . . . . . . . . . . . . . . . . . 130
3.10 Models of da progression . . . . . . . . . . . . . . . . . . . . . 134 3.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4. Using Cue Phrases for Cross-Corpus Classification . . . . 141 4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.2
Establish Baselines for New Corpora . . . . . . . . . . . . . . 143 4.2.1
icsi-mrda Corpus Classification . . . . . . . . . . . . . 144
4.2.2
amities Corpus Classification . . . . . . . . . . . . . . 150
4.2.3
Number of Cues . . . . . . . . . . . . . . . . . . . . . . 154
4.2.4
Review of Baselines . . . . . . . . . . . . . . . . . . . . 155
Contents 4.3
x
Cross-Corpus Classification
. . . . . . . . . . . . . . . . . . . 157
4.3.1
Cue Overlap . . . . . . . . . . . . . . . . . . . . . . . . 157
4.3.2
icsi-mrda & switchboard Cross-Corpus Experiments160
4.3.3
switchboard & amities ge Cross-Corpus Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5. Conclusions and Future Work . . . . . . . . . . . . . . . . . . 173 5.1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.2
Applications and Future Research . . . . . . . . . . . . . . . . 179 5.2.1
AT-AT: Albany Tagging & Annotation Tool . . . . . . 179
5.2.2
Collaboration & Deliberation . . . . . . . . . . . . . . 182
5.2.3
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . 187
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Appendix
201
A. Mapping between switchboard, amities ge and superclass labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 B. Cue phrases shared between the switchboard and icsimrda corpora, listed by da label . . . . . . . . . . . . . . . 215
LIST OF FIGURES 2.1
Sequence of utterances, showing game structure, from the maptask corpus (taken from Kowto et al. (1993)) . . . . . . 22
2.2
Hierarchy of 43 verbmobil Dialogue Acts . . . . . . . . . . . 28
2.3
The hierarchical damsl Dialogue Acts . . . . . . . . . . . . . 31
2.4
Hierarchy of XDML labels for Information Level and Communicative Status . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5
Hierarchy of XDML labels for Forward-Looking Function . . . 42
2.6
Hierarchy of XDML labels for Backward-Looking Function . . 47
2.7
Example dialogue with the TRIPS system . . . . . . . . . . . 62
3.1
An utterance interrupted by a back-channel . . . . . . . . . . 111
3.2
Effects of predictivity and frequency on tagging accuracy . . . 117
3.3
Plateau of good performance for thresholds . . . . . . . . . . . 119
3.4
switchboard: Example utterance incorrectly labelled . . . . 127
3.5
N-best classification, for various n . . . . . . . . . . . . . . . . 132
List of Figures 3.6
xii
Example rules learnt by the JRip algorithm over a sample of the switchboard data. Features used are ‘CuDAC’: the classification made by the CuDAC algorithm; ‘Truth’: the actual da of the current utterance; and DA−n : the actual da at position n . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.1
Excerpt of dialogue from the icsi-mrda corpus . . . . . . . . 144
4.2
Linear increase in number of cues, as number of total words in corpus increases . . . . . . . . . . . . . . . . . . . . . . . . 156
A.1 xdml simplification steps . . . . . . . . . . . . . . . . . . . . 207 A.2 Understanding mapping table (xdml } superclass { switchboard-damsl) . . . . . . . . . . . . . . . . . . . . . . . . . . 208 A.3 Agreement mapping table (xdml } superclass { switchboard-damsl) . . . . . . . . . . . . . . . . . . . . . . . . . . 209 A.4 Conventional, offer, options, commits and action directive mapping table (xdml } superclass { switchboard-damsl) . . 210 A.5 Questions mapping table (xdml } superclass { switchboard-damsl) . . . . . . . . . . . . . . . . . . . . . . . . . . 211 A.6 Answers and Statements mapping table (xdml } superclass { switchboard-damsl) . . . . . . . . . . . . . . . . . . . . . 212 A.7 switchboard labels with no mapping to xdml labels (superclass { switchboard-damsl) . . . . . . . . . . . . . . . 213
LIST OF TABLES 2.1
maptask dialogue acts . . . . . . . . . . . . . . . . . . . . . . 26
2.2
The 18 top-level verbmobil Dialogue Acts . . . . . . . . . . 29
2.3
switchboard dialogue acts . . . . . . . . . . . . . . . . . . . 34
2.4
icsi-mrda dialogue acts, taken from Shriberg et al. (2004): not shown are the labels indecipherable, non-speech and nonlabelled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5
A selection of the amities ge dialogue acts. Shown are all acts that have a frequency count across the corpus greater than 0.55% of the total number of labels . . . . . . . . . . . . 50
2.6
15 das of the AMI corpus . . . . . . . . . . . . . . . . . . . . 52
2.7
Summary data for the major dialogue corpora . . . . . . . . . 57
2.8
Number of cue phrases for each method of automatic discovery investigated by Samuel et al. (1999), after lexical filtering . . . 67
2.9
62 single word cues phrases from literature, as reported in Hirschberg et al. (1993) . . . . . . . . . . . . . . . . . . . . . 68
2.10 Classification results over the Verbmobil, Switchboard and ICSI corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 2.11 Classification results over additional corpora . . . . . . . . . . 85
List of Tables
xiv
3.1
switchboard: Example cue-based classifier . . . . . . . . . . 99
3.2
switchboard: Example classifier, with and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.3
switchboard: Example classifier, with utterance length and and features . . . . . . . . . . . . . . . . . . 110
3.4
switchboard Experiments: 50k data set . . . . . . . . . . . 112
3.5
switchboard Experiments: 4k data set . . . . . . . . . . . . 115
3.6
switchboard Experiments: 202k data set . . . . . . . . . . . 116
3.7
switchboard: Best threshold experiments, 202k data set . . 120
3.8
switchboard Dialogue Acts: Overall tagging accuracy . . . 122
3.9
switchboard: Single category error analysis . . . . . . . . . 125
4.1
icsi-mrda dialogue acts, taken from Shriberg et al. (2004): not shown are the labels indecipherable, non-speech and nonlabeled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.2
icsi-mrda dialogue acts, clustered into 11 general tags, not including any disruption classes . . . . . . . . . . . . . . . . . 148
4.3
icsi-mrda Classification Results
. . . . . . . . . . . . . . . . 149
4.4
amities ge dialogue acts, with frequency of occurrence above 0.5%, clustered with the tag removed . . . . . 151
4.5
amities ge Classification Results . . . . . . . . . . . . . . . . 153
4.6
switchboard and icsi-mrda overlapping cue phrases . . . . 158
4.7
switchboard & icsi-mrda Cross-Corpus Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
xv
List of Tables 4.8
switchboard & amities ge Cross-Corpus Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
xvi
1. INTRODUCTION The study of dialogue, the communication between two or more participants, both spoken (either face-to-face or at a distance via telephone) and written (such as Internet chat) remains an interesting challenge within the field of Computational Linguistics. Dialogue has been studied from a variety of different perspectives, including linguistic and psychological, but it is relatively recently that this study has extended to the computational. One mechanism that has proved useful in analysing dialogue has been Dialogue Acts (Bunt, 1994). These acts are very much related to Speech Acts introduced by Austin (1962) where each utterance can be explained by understanding the action that the speaker is attempting to accomplish. Here, as elsewhere in this thesis, we refer to an ‘utterance’ as a natural unit of speech that corresponds to a single act. This is a slightly different definition to that used in the speech community, where an utterance is a complete unit of speech bounded by the speaker’s silence. In this thesis, we refer to such a unit as a ‘speaker turn’ or just ‘turn’. Thus a single speaker turn can be comprised of many utterances. Whereas Speech Acts deal with the intentions of the speaker, and the resulting actions of the hearer, Dialogue Acts (das) are annotations over
2 utterances that ascribe dialogue function to those utterances. We are interested in the role each utterance plays in the developing dialogue at the functional level. This means that we want to know if a particular utterance is functioning as a question, a confirmation or a statement of fact. Ideally we want to learn models of dialogue from data annotated with these acts. To do so requires that we have corpora of transcribed dialogues, which have been reliably encoded with Dialogue Acts. A major development in recent years is the availability of dialogue corpora, through organisations such as the Linguistic Data Consortium (LDC)1 and the European Language Resources Association (ELRA)2 . An increasing number of research projects are making human annotated dialogue transcripts available to the research community. Once annotated corpora were available, researchers were able to construct models of interaction based on real data. Initially, models of da assignment required complex planning and belief based interpretation (such as the work of Cohen and Perrault (1979), Perrault and Allen (1980) and Ballim and Wilks (1991)). However, statistical models were soon at the fore, where annotated data is used to automatically train da classifiers. Either the knowledge based or the statistical approach has practical utility for building working language processing systems, although models exploiting statistical methods, often trained over large corpora, tend to display a greater range of coverage than the more precise, yet possibly limited hand coded planning methods. Both models can be scientifically illuminating, 1
http://www.ldc.upenn.edu
2
http://www.elra.info
3
Chapter 1. Introduction
showing which aspects of the data (called ‘features’ in machine learning) help the system correctly predict linguistic annotations. There are a wide range of features used in da assignment, including the words in each utterance, syntactic information (such as parse chunks or part of speech information), pragmatic information, including the discourse context as captured by the das of preceding utterances, whether there has been a change of speaker, and prosodic information from the acoustic signal if the audio data is available. We can intuitively imagine the importance of some of these features apriori. For example, useful factors in determining specific functions of utterances might include lexical content of the utterance, or the nature of the preceding utterance, or some combination of features. There have been a number of approaches to the automatic assignment of a da tag to utterances, most of which leverage recent advances in machine learning techniques and toolkits. An extensive overview is presented in Chapter 2, but some of the more popular approaches include Planning and Belief interpretation (Cohen and Perrault, 1979), Hidden Markov Models (Stolcke et al., 2000), Transformation-Based Learning (Samuel et al., 1998), Memory-Based Learning, Neural Networks (Ries, 1999), an extended version of Latent Semantic Analysis, FLSA (Serafin and Eugenio, 2004), and a range of Bayesian approaches (Grau et al., 2004). As might be understood from this list, a significant amount of effort on Dialogue Act classification has been devoted to choosing a machine learning algorithm. So far there has been little practical comparison between machine learning algorithms, and this is often made harder by the lack of, for example, standardised data sets for training and testing. Equally, there has not be significant exploration of the impact
1.1. Goal of Thesis
4
of individual features for da classification, with the notable exception of the work of Samuel et al. (1999), where the authors investigate the use of ‘cue phrases’ as a key feature for their machine learning approach.
1.1 Goal of Thesis In this thesis, we wish to take a look at the automatic selection of words and phrases (so called ‘cues phrases’) from annotated utterances in dialogue corpora, determine if we can extract a good set of cues, and then investigate the use of these cues as a da classifier. In addition, we want to know if we can extract cue phrases from one corpus, and apply them to a different corpus from a different domain of conversation (so-called “out of domain” data). Specifically we wish to address the following research question:
Given an appropriately annotated corpus of dialogues, can we automatically extract cue phrases from utterances that are useful for Dialogue Act classification? To expand on this question, we want to know if, given a corpus of human annotated dialogues, where individual utterances have been manually assigned Dialogue Acts, can we automatically discover words and phrases that are predictors of das? We hypothesise that cue phrases are reliable indicators of intention or discourse structure, for example words and phrase such as ‘now’, ‘well’ or ‘can you’. Such cue phrases are reported in existing linguistics literature, and in have previously been automatically extracted
5
Chapter 1. Introduction
from small corpora, as described in the work of Samuel et al. (1999). This question suggests two corollary questions, that we will also explore: • Can the cue phrases we automatically extract be used directly as a method of classification, without reference to dialogue context information? To address our research question, we want to identify cue phrases that serve as reliable indicators of das. One method of testing such reliability is to use the cue phrases directly as a classification method. We will attempt to construct a da classifier that uses cue phrases in conjunction with other utterance internal features. • Are these automatically extracted cue phrases general in nature; can they be used to classify utterances from different dialogue types (conversation vs. task oriented) and domains (free conversation vs. financial transactions)? Finally, if we can extract a set of reliable cue phrases, can we determine how general these phrases are? By general, we mean the ability to apply these cue phrases, extracted from one particular corpus, to dialogue data from a different corpus entirely, where the interactions in this new corpus may be of a different style (free flowing, social conversation in contrast to more focused, task specific dialogue for instance). If the cue phrases we automatically extract can be used as features to successfully classify new dialogue data, we hypothesise that this is evidence of the power of cue phrases in human language. Capturing general cue phrases
1.2. CuDAC: Cue-Based Dialogue Act Classification
6
could greatly aid both the labelling and decoding of Dialogue Acts for a range of applications.
1.2 CuDAC: Cue-Based Dialogue Act Classification As we shall see in this thesis, cue phrases can play a key role in the identification of dialogue acts. Our model captures the idea that the surface form of an utterance provides all manner of indicators as to what the dialogue act could be. Such cues can be lexical, collocational, syntactic, prosodic or based on a deeper conversational structure. The key to the cue model from a computational perspective is that these cues can be probabilistically associated with specific dialogue acts. The work of Samuel et al. (1999) indicates that these automatically discovered cue phrases can be a very powerful indicator of the associated dialogue act, when used as a feature for some machine learning algorithm. The work presented in this thesis uses these cue phrases directly as a technique for classifying the correct dialogue act for an utterance. This combination of automatic cue discovery and the subsequent use of these cues directly for classification of both in-domain and out-of-domain data, forms the basis of this thesis. If we can automatically acquire cue phrases and are able to accurately recognise cue phrase usage, we can use this information to annotate dialogues in a way useful to later processing applications.
7
Chapter 1. Introduction
1.3 Structure of the Thesis This thesis has five chapters, the first of which is this introduction. We then present a review of prior work, and an overview of the state-of-the-art in da classification (Chapter 2). We give an overview of our method for Dialogue Act classification (Chapter 3), with a series of experiments using one annotated dialogue corpus. We automatically extract a number of potential cue phrases from this corpus, and in order to evaluate their effectiveness, we introduce a cue-based method of da classification, where we apply these cue phrases directly. In doing so, we demonstrate a high degree of performance over this corpus in relation to other, published work. We examine our approach in detail, including an error analysis of performance, to gain insights into both our approach and the data we are attempting to classify. We explore the utility of n-best classification, selecting more than a single category for each utterance, and we incorporate simple models of dialogue context (models of da progression), to illustrate that for our approach, dialogue context appears to add little additional usable information that our chosen features do not already capture or exploit. We apply this method for cross-corpus classification (Chapter 4). We begin by exploring our Dialogue Act classification approach using a range of corpora, to demonstrate that we have not over-tuned our model to one specific data set. Then we attempt cross-corpus classification where we demonstrate that, in some cases, we can train our cue based classifier on one set of data, and achieve reasonable performance on an entirely new and unrelated data set. We illustrate the approach by training over three different data
1.4. Contributions to Literature
8
sets. We conclude with some novel applications, future work and conclusions (Chapter 5).
1.4 Contributions to Literature The work reported in this thesis has contributed to the following publications: Strzalkowski et al. (2009), Benyon et al. (2008), Field et al. (2008), Kruschwitz et al. (2008), Muhlberger et al. (2008), Webb and Liu (2008), Webb et al. (2008), Webb et al. (2005a), Webb et al. (2005b), Webb et al. (2005c), Wilks et al. (2004), Wilks et al. (2003). • Webb et al. (2005a) describes the cue-based da classification mechanism, and its application to the switchboard corpus. In this paper we explore the effects of training data size on our method, and experiment with a range of features for da classification. The paper outlines the method that can be seen in Chapter 3. • Webb et al. (2005b) explores the automatic discovery of key threshold parameters. THe paper describes our experiments with a range of manually determined parameters, and the subsequent use of a validation set of dialogues to set these parameters automatically. An expansion of this work can be seen in Chapter 3. • Webb et al. (2005c) performs an error analysis of classifier performance. This paper forms the basis of the error analysis section in Chapter 3. • Webb et al. (2008) and Webb and Liu (2008) describe the techniques used for cross-corpus classification. This work is expanded in Chap-
9
Chapter 1. Introduction ter 4. We explore the overlap of cue phrases between three corpora, the switchboard corpus, the icsi-mrda corpus and the amities ge corpus. • Muhlberger et al. (2008) and Strzalkowski et al. (2009) describe applications which have made use of our method for cue-based da classification. A brief description of these applications appears in Chapter 5. • Wilks et al. (2003) and Wilks et al. (2004) describe early work on our classifier in the context of the amities project.
10
2. MOTIVATION AND STATE-OF-THE-ART 2.1 Introduction This Chapter reviews some of the main contributions to the field of Dialogue Act (da) analysis and classification. We start with a definition of Dialogue Acts, and a historical perspective of their development, beginning with Speech Acts. We review the role of discourse cues in the identification of dialogue acts specifically, and the interplay of dialogue acts and cues in discourse structure in general. As dialogues acts have evolved, so a number of annotation schema for them have emerged, and we discuss these in terms of their uses, aims and objectives. Corpora annotated for Dialogue Acts play a key role in the development and evaluation of algorithms in this field, so some overview of prominent corpora will be presented. The potential uses of Dialogue Acts, and their benefit to applications are surveyed. This includes the role das can play in practical applications such as dialogue systems development and evaluation. Finally, we look at prior methods for the automatic annotation of Di-
2.2. Dialogue Acts: A Definition
12
alogue Acts. Throughout this part of the chapter, we present a series of tables that show the accuracy of these approaches over the particular corpora, and compare their performance where possible, in terms of the features they use, the tag-set they are operating over, and their resulting accuracy as classifiers).
2.2 Dialogue Acts: A Definition Searle (1969) built on the work by Austin (1962), and proposed an initial typology of speech acts: 1. assertives commit the Speaker to the truth of some proposition (e.g. stating, claiming, reporting, announcing); 2. directives attempts to bring about some effect through the action of the Hearer (e.g. ordering, requesting, demanding, begging); 3. commissives commit Speaker to some future action (e.g. promising, offering, swearing to do something); 4. expressives are the expression of some psychological state (e.g. thanking, apologising, congratulating); 5. declarations are speech acts whose successful performance brings about the correspondence between the propositional content and reality (e.g. resigning, sentencing, dismissing, christening). In many of the examples Searle (1969) addressed, the utterances contained cue words or phrases, which could be used to indicate the presence
13
Chapter 2. Motivation and State-of-the-Art
of particular Speech Acts. For example, a number of directives were realised by utterances beginning with wh-words, or containing subject-aux inversion (“Can you. . . ”). The Literal Meaning Hypothesis (Gazdar, 1981) is a strong version of this observation, suggesting that every utterance has an illocutionary force built into its surface form. If it is the case that the selection of surface realisation of our utterances is so intentional as to communicate the illocutionary force we wish to display, it indicates that an analysis of utterances that relies on surface lexicalisations should perform well in identifying the underlying Speech Act. While a full treatment of speech acts provides a useful characterisation of pragmatic force, and is certainly an interesting area of study within linguistics, models of interpretation often require complex conversational planning mechanisms, and models of the belief space of Speaker and Hearer (cf. the work of Cohen and Perrault (1979), Perrault and Allen (1980), Ballim and Wilks (1991) and Lee and Wilks (1996)). Within the field of computational linguistics, recent work, closely linked to the development and deployment of spoken language dialogue systems, has focused on the some of the more conversational roles such acts can perform. For example, if an automatic system or agent asks a question, and a user replies with another question, it is possible for this to be seen, in context, as a clarification of the original question. Less important then are the notions of pragmatic force, and more important is their role in a progression of dialogue and how the recognition of some sort of act can aid in the interpretation of a user’s utterance. These acts are called Dialogue Acts (Bunt, 1994) or in some literature Conversational or Dialogue Moves (Power, 1979). This level of analysis concentrates
2.3. Dialogue Acts and their Uses
14
on how what you say commits you to some action, such as accepting or rejecting some offer. Lewin (2000) characterises this as a stress on “what you say committing you, rather than your committing to what you say”, as in the focus of the early work by Austin (1962) and Searle (1969). There is a possible interpretation of the work of Lewin (2000), in common with the literal meaning hypothesis, that there is much literal meaning in the words chosen to express an utterance that can also be used in interpretation. This stands in contrast to the need to construct a complex mental model of your conversational partner in order to accurately process utterances. This thesis explores to what extent you can perform an interpretation of utterances based solely on the words in the utterance (and a few other simple lexical features).
2.3 Dialogue Acts and their Uses Having identified what Dialogue Acts are, we look at some of the usage of, and references to, Dialogue Acts. Notably, das are practically used in many live dialogue systems such as the SUNDIAL project (Peckham, 1993), the Airline Travel Information Systems (ATIS) programme (c.f. Seneff et al. (1991)), the DARPA Communicator program (c.f. Constantinides and Rudnicky (1999), Levin et al. (2000), and Pellom et al. (2001)), the verbmobil project (Wahlster, 2000), and in the amities dialogue system (Hardy et al., 2005). We do not describe each of these projects in detail. Instead we look at the kinds of tasks other than immediate utterance interpretation that das can be used for.
15
Chapter 2. Motivation and State-of-the-Art
2.3.1 Evaluation Metrics The Communicator Program, referenced earlier, made extensive use of the PARADISE metric (Walker et al., 1997a). PARADISE (PARAdigm for DIaLogue System Evaluation) was developed to evaluate the performance of spoken dialogue systems, in a way de-coupled from the task the system was attempting. ‘Performance’ of a dialogue system is affected both by what gets accomplished by the user and the dialogue agent working together, and how it gets accomplished’: In other words, PARADISE aims to maximise task completion, whilst simultaneously minimising dialogue costs. These costs include measures of objective efficiency of the dialogue (length, measured in total turns for example) and some qualitative measures (how appropriate were system responses, how many ASR rejections occurred, or how many explicit requests for help were made by the user). The goals of the task are seen as an Attribute-Value Matrix (AVM), so for example in the ELVIS e-mail retrieval task (Walker et al., 1997b): “Find the time and place of your meeting with Kim” where the time and place are pre-defined, task success can be calculated by finding a match between AVM values at end of the dialogue with the “true” known values for the AVM as relates to your proposed meeting. PARADISE also looks for user satisfaction and task success, both perceived completion and as a function of the information extracted from the user. Efficiency and effectiveness metrics include the number of user turns, system turns, and total elapsed time. For the quality of interaction, it is necessary to record Automatic Speech Recognition rejections, time out prompts,
2.3. Dialogue Acts and their Uses
16
help requests, barge-ins, mean recognition score (concept accuracy), and cancellation requests. In discussing the PARADISE framework Walker et al. (1997a) observe that many of these variables correspond directly to Dialogue Acts, both generated and recognised by the system. By annotating system data from user interactions with Dialogue Acts, PARADISE can automatically assess some parts of the operation of the dialogue system. This idea was expanded during the life-span of the Communicator program, and included the development of a set of Dialogue Acts called DATE (Dialogue Act Tagging for Evaluation) (Walker and Passonneau, 2000), a series of 10 labels created specifically to tag and analyse system utterances for the automatic dialogue systems. These dialogue act metrics are used to show the amount of effort spent managing a dialogue, or establishing and completing the task. By using dialogue acts as a feature, it’s possible to estimate the effect of aspects of dialogue on task completion and user satisfaction. Various interaction costs (such as making explicit or implicit confirmations) can be used as predictors of eventual user satisfaction. Walker et al. (2001) implemented a Dialogue Act classifier for the system utterances in the Communicator corpus. They limited themselves to the system utterances because the systems concerned all used a method of template based generation, so very high levels of utterance recognition could be implemented, close to 100% accuracy. At the time they believed that classifying user utterances to be too complex a task. The PARADISE model is useful both for ongoing system development, making predictions about system modifications, and distinguishing good dialogues from bad dialogues, but can also tell on-line when a dialogue is going
17
Chapter 2. Motivation and State-of-the-Art
wrong. Most prominently PARADISE models of dialogue, including the use of Dialogue Acts, are being used to train complete dialogue systems using reinforcement learning (Sutton and Barto, 1998), such as the work of Rieser and Lemon (2008). A consequence of this model is that often the dialogue quality parameters are tuned to overcome the deficiencies highlighted by the observable metrics (cf. Hajdinjak and Miheli˘c (2006)). For example, using explicit confirmation increases the likelihood of task completion, and so is often chosen, despite being regarded as somewhat unnatural in comparable human-human speech data.
2.3.2 Discourse Structure So far, we have referred to Dialogue Acts as a mostly stand alone entity, marking the intention of a single utterance. We have indicated that many researchers use sequences of dialogue acts, predicting the next act to come, and using such predictions to help with, for example, speech recognition. The use of sequence information in this way is based on the existence of a discourse structure, the idea that dialogue is not a random collection of utterances, but that certain forms (such as questions) lead to certain other forms (such as answers, or further, clarification questions). In this section, we examine such structure in closer detail. We note that both dialogue acts, as discussed throughout this chapter, and discourse cues, words or phrases that have a particular structural role in conversation, play a vital role in the identification of discourse structure. Such structural information can be used in many natural language processing
2.3. Dialogue Acts and their Uses
18
tasks, including anaphora resolution and language generation. For example, a comprehensive treatment of discourse structural issues is presented in Grosz and Sidner (1986). In their work, Grosz and Sidner (1986) stress discourse purpose and processing, through three separate but interrelated components (needed to explain such dialogue phenomena as interruptions, referring expressions, and so on). The three components are Linguistic Structure (loosely seen as the sequence of utterances), Intentional Structure and the Attentional State. For the purposes of this thesis, the most pertinent levels of understanding are Linguistic Structure, where the utterances in a discourse are naturally aggregated into discourse segments (like words into constituent phrases), and Intentional Structure, an indication of the intention motivating the discourse or dialogue. Within the linguistic structure, segments are not necessarily continuous, thanks to the influence of, for example, interruptions (and an example of this can be seen in Chapter 3, in Figure 3.1). There is a two-way interaction between discourse segment structure and utterances constituting the discourse: linguistic expressions can convey information about the discourse structure (for example using cue phrases or linguistic boundary markers), whilst at the same time discourse structure constrains the interpretation of these linguistic expressions. . The nature of discourse understanding as presented in the work of Grosz and Sidner (1986) treats the entire discourse as individual units of information, contributing toward a greater whole. This gives us two essential units of information; each individual utterance of a user, and the global discourse or dialogue as context. In order to accommo-
19
Chapter 2. Motivation and State-of-the-Art
date the claims of Grosz and Sidner (1986) about task structure influencing dialogue structure, it seems likely that there are structures higher than a single utterance, yet more fine grained than a complete dialogue. A commonly occurring example is the - pair, where one person asks a question, and the next utterance contains an answer to that question. Of course, dialogue can get infinitely more complex than this, for example the first question could itself be answered by another question, perhaps indicating some sort of clarification required before the initial question can be answered. Several researchers identify structures within dialogue at levels higher than individual utterances or speaker turns, but below the level of complete discourse description. There has been some significant exploration of the use of sequences of Dialogue Acts, at a number of levels of granularity. The most simple, illustrated in our - pair, is the study of adjacency pairs, as reported in Schegloff and Sacks (1973) and Atkinson and Drew (1979), and exploited directly into computational models we’ve already hinted at, and that we shall examine later (cf. (Nagata and Morimoto, 1994; Reithinger and Maier, 1995)). Similar to adjacency pairs is the notion of exchanges (Sinclair and Coulthard, 1975). By their definition, exchanges are most suited to classroom, or doctor/patient interchanges, where there is a known structure in the interaction and little exchange of initiative or deviation from that known structure. Under such analysis, with the specified constraints, much dialogue appears to be constructed from adjacency pairs: greetings following other greetings, answers following questions. Kowtko et al. (1993) looked at the possibility of nested structures in
2.3. Dialogue Acts and their Uses
20
the HCRC maptask corpus (described in Section 2.4.1). Nested structures, disprefered by Sinclair and Coulthard (1975) in favour of their adjacencypair approach, seem to make anecdotal sense, in the following way. If one is attempting to complete an information passing cycle, say inform a friend of the location of a social event, the speaker may make certain assumptions of the hearer’s knowledge of known landmarks. If these assumptions are incorrect, the hearer may initiate a clarification sub-dialogue, interrupting the main goal or game of information interchange only to clarify specific details. Such sub-structures, like clarification and negotiation, have been found both in the somewhat artificial dialogues of the maptask corpus (to give one example) and in naturally occurring dialogue (Carberry and Lambert, 1999). For example, a question may be followed by a clarification questions, thus generating a side sequence (Jefferson, 1972), as in the example below, slightly modified from (Levinson, 1983). (1) A: ‘‘How many tubes would you like sir?’’ (2) B: ‘‘Uhm. What’s the price now eh with VAT do you know eh?’’ (3) A: ‘‘Er I’ll just work that out for you’’ (4) B: ‘‘Thanks’’ (pause) (5) A: ‘‘Three pounds nineteen a tube sir’’ (6) B: ‘‘Three nineteen is it’’ (7) A: ‘‘Yeah’’ (8) B: ‘‘Then I’ll have 3 tubes’’ In this example, presumed to be part way through a conversation, speaker
21
Chapter 2. Motivation and State-of-the-Art
A asks a question (utterance (1)). Instead of answering the question directly, speaker B initiates a clarification side sequence, which spans from utterance (2) (it’s inception) to utterance (7). Even within this sequence there are elements of sub-structure; utterance (3) from speaker A is a hold utterance, to which B replies out of politeness (“thanks”). In utterance (5), speaker A replies with the information requested by the clarification question, although speaker B then requests explicit confirmation (utterance (6)), which is received from speaker A in utterance (7) of this example. Given that the answer to the initial question (in utterance (1)) is finally received in utterance (8), the conversation from utterance (2) to utterance (7) can be seen as an insertion sequence. Such structure has been referred to in some literature as Conversational Games. Games are captured by conversational game theory (Power, 1979). These games are comprised of transactions, the highest level information units, which have some correlation with sub-tasks, sub-topics or the side sequences shown above. (Carletta et al., 1997) describe dialogue games as exchanges between speakers that fulfil some limited goal, embodying our expectation of natural patterns within dialogue, e.g. that questions usually precede answers, and requests precede either an acceptance or a refusal. Games consist of initiations, that set up a discourse expectation, and are then usually followed by responses, that fulfil those expectations. In their work, Kowtko et al. (1993) use sequences of Dialogue Acts (or moves, as they call them) to indicate game initiation, sub-games and game completion. They found that certain of their moves, (the communication of a direct instruction), (an explicit confir-
2.3. Dialogue Acts and their Uses
22
Figure 2.1: Sequence of utterances, showing game structure, from the maptask corpus (taken from Kowto et al. (1993))
mation), (a yes-no question), (a wh-question), (an unsolicited new nugget of information relating to the goal) and (a check of the other participant’s understanding or accomplishment of a goal), are the initiating moves of dialogue games, and there are six other moves that serve as response and feedback moves within a game. An example of an instructing game can be seen in Figure 2.1. In this example, there are four types of games: Instructing (I), Explaining (E), Querying (Q) and Checking (C). The dialogue fragment shown is one
23
Chapter 2. Motivation and State-of-the-Art
large Instructing games that contains three nested games. Units marked ‘*g’ are from the direction giver, and those marked ‘*f’ are from the follower. The angled brackets are indicators of overlapping speech, and the double vertical line (| |) are move boundaries. Dialogue modelling as a whole is made easier if these games can be shown to exist, and exist across domains and conversational styles. In terms of intention analysis, it may be possible to combine information from a conversational structure approach with simple methods for analysing the content of utterances. For example, if higher level structures can give a more global perspective on the current state of a dialogue, that may enable better predictions of what is to follow, and aid interpretations of otherwise ambiguous utterances. Evidence supporting this view can be found in (Poesio and Mikheev, 1998), who achieve a 12% improvement in da tagging accuracy on the maptask corpus by exploiting information of manually assigned game boundaries and types. Of course, this ignores the potentially circular argument, that we require da analysis to determine game structure, whilst relying on game structure to improve our da analysis.
2.4 Corpora & Labelling Paradigms Whilst the investigation of Dialogue Acts can occupy a theoretical space concerning the communication of concepts between speakers and hearers, the best place to observe them as a mechanism of communication are studies of corpora of interactions in which da usage has been indicated or marked. Later we want to explain the prevalent computational models used to recog-
2.4. Corpora & Labelling Paradigms
24
nise or generate dialogue acts automatically, both using these corpora and as a tool to annotate new resources. Since the availability of such annotated resources is a major factor in developing computational models, we will take the time to describe a few key resources. We will need then to focus both on the resource itself, and the labelling paradigm for Dialogue Acts that the resource employs. There are a number of different goals to bear in mind when designing a set of da labels. Principally, researchers are attempting to capture some notion of the utterance function: i.e. what role is the speaker attempting to accomplish with this utterance. One key difference that can be seen between tag-sets is the granularity of approaches. Some domain specific approaches might choose to concatenate semantic and functional information into a single label (consider the dialogue act ) where another (higher level) approach might only identify the same utterance as a request, and leave the semantic analysis to another module. Domain specific labels are easier to use for particular functionality, such as working out how to reply to specific requests for example, but they are not portable to new domains, or provide sufficient generality to learn models of dialogue beyond their specific application. There has been a proliferation of labelling schemes, often with each new corpus generating a new schema, often starting from the topology suggested by Searle (1969). The conceptual granularity of the da labels used varies considerably between alternative analyses, often being driven by demands specific to some application or domain. In the context of a European-funded
25
Chapter 2. Motivation and State-of-the-Art
project, MATE1 , Klein (1999) surveyed 16 different Dialogue Act taxonomies, and drew several comparisons. Each scheme provided coding books for training purposes, and those that were so evaluated found moderate to good levels of inter-annotator agreement, showing that these acts could be reliably encoded. Notably, even whilst attempting to preserve some sense of domain independence, most schemes were hard to re-use in different contexts because they intrinsically were domain, task or language dependent. That is, when a new application or domain led to the creation of a new annotated resource, the annotations were highly specific to the need of the researchers at that time. In the late 1990’s, a subset of researchers (who referred to themselves as the Discourse Resource Initiative2 ) collaborated to produce a reusable hierarchical representation of das called Dialogue Act Mark-up in Several Layers (DAMSL), and was designed from the outset to be applicable across multiple domains. The set of labels remains the most widely used and sourced labelling paradigm. In the following sections, we present some of the most notable schemes, in particular those that we have referenced in earlier sections, or those to which we apply our experiments in Chapters to follow. We detail each labelling scheme and, where applicable, the corpora they are associated with. 1 2
http://mate.nis.sdu.dk/
The DRI website is now defunct, but a substantial number of original materials can be found at the multiparty discourse group (a member of the DRI) at Rochester: http://www.cs.rochester.edu/research/speech/damsl/
2.4. Corpora & Labelling Paradigms Initiating Moves instruct explain check align query-yn query-w
26 Response Moves acknowledge reply-y reply-n reply-w clarify ready
Table 2.1: maptask dialogue acts
2.4.1 MapTask The first corpus we describe came out of the MapTask project. The dialogues in this corpus are collaborative in nature, as people are working together to solve a shared problem. The maptask (Anderson et al., 1991) corpus3 , comprising 128 task-oriented dialogues (of 150,000 total words), is a collection of dialogues in which two people negotiate an agreed route on separate (and slightly different) maps. An excerpt of the corpus can bee seen in Figure 2.1. The dialogues feature a giver, who has a map featuring a known route, and a follower, who is attempting to reproduce that route on their own map. These maps contain landmarks which the giver can use to describe the route, although the features may not match between the two maps, resulting in some confusion, and need for clarification dialogue. The da annotation uses 12 distinct da labels (as shown in Table 2.1), and this corpus represents one of the very few resources annotated for dialogue games, as we described in Section 2.3.2. The full annotation scheme is out3
available for download from http://www.hcrc.ed.ac.uk/maptask/
27
Chapter 2. Motivation and State-of-the-Art
line in Carletta et al. (1997). Significantly, Carletta et al. (1997) performed excellent work in determining the inter-annotator agreement for the annotations over the MapTask corpus, something that became standard behaviour for the majority of corpora that followed. Carletta et al. (1997) used the kappa statistic (see Artstein and Poesio (2008) for an excellent overview of the user of inter-coder reliability scores in Natural Language Processing), and determined that scores of 0.8 or higher indicated a high degree of reliability, good enough for the basis of machine learning algorithms. Scores lower than 0.8 indicate that human coders cannot reliably make the judgements required by the schema. For the MapTask corpus, experienced annotators scored a kappa of 0.83 for dialogue act annotation. This dropped (to 0.67) when naive coders were used, indicating that this is not a trivial assignment of categories, but that sufficient training can lead to good annotation results.
2.4.2 verbmobil The verbmobil project (1993 – 2000) aimed at the development of an automatic speech to speech translation system for the languages German, American English and Japanese (Wahlster, 2000). Within verbmobil an empirical data collection was carried out by seven academic institutions in Tokyo, Pittsburgh, Kiel, Bonn, Hamburg, Karlsruhe and Munich. The main task of this data collection was to record a large corpus of spontaneous speech dialogues, and include annotations necessary to train acoustic models, language models, and to build the translation dictionary (together with most likely pronunciation variants).
2.4. Corpora & Labelling Paradigms
28
request_suggest
request_suggest_date
request_comment
request_suggest_duration
dialogue act
request_comment_date request_comment_location
request_suggest_location
request_comment_duration thank
accept greet bye accept_date
thank_init
accept_location
thank_react
accept_duration reject give_reason motivate reject_date reject_location
clarify confirm init clarify_query
clarify_state deliberate clarify_answer introduce init_date
garbage deliberate_explicit
reject_duration motivate_appointment
init_duration
init_location suggest
feedback
digress
deliberate_implicit
introduce_name introduce_react introduce_position
suggest_exclude suggest_support
digress_scenario refer_to_setting
suggest_exclude_date suggest_exclude_duration
suggest_support_date suggest_support_duration
suggest_exclude_location suggest_support_location
Figure 2.2: Hierarchy of 43 verbmobil Dialogue Acts
In the first phase of the verbmobil project, a corpus of 168 English dialogues4 (comprising 3117 utterances), was annotated with a total of 43 distinct Dialogue Acts (Jekat et al., 1995). These acts were organised in a hierarchy, where 18 acts were designated to describe the primary intentions, at a supposedly domain independent level. The 43 acts can be see in Figure 2.2, and the 18 higher level acts can be seen in Table 2.2. There was a second phase of the verbmobil project, which expanded the dialogues from meeting scheduling to comprehensive travel planning. In common with a number of other Dialogue Act annotation efforts, this change 4
available for download from http://www.phonetik.uni-muenchen.de/Bas/BasKorporaeng.html
29
Chapter 2. Motivation and State-of-the-Art Dialogue Acts thank bye greet suggest reject digress give-reason garbage confirm
Dialogue Acts deliberate request-suggest request-comment accept init clarify motivate feedback introduce
Table 2.2: The 18 top-level verbmobil Dialogue Acts
in the underlying domain of dialogues resulted in a new da hierarchy (as outlined in (Alexandersson et al., 1998)). This new set of da labels was introduced in part to cope with the new domain, and in part to address issues of data sparsity with respect to the original 43 Dialogue Acts of the phase one corpus. The new scheme contained just 29 labels, instead of the original 43. However, all automatic classification results relating to the verbmobil corpus reported in this thesis use the first phase corpus, either over the complete set of 43 acts or the higher level clustered 18 acts.
2.4.3 Dialog Act Mark-up in Several Layers (damsl) In 1998, the Discourse Resource Initiative finalised a task-independent set of das, called damsl (Dialogue Act Mark-up in Several Layers), for use across different domains (Core et al., 1999). damsl and its variants has been used to mark-up several dialogue corpora, such as trains (Core and Allen, 1997),
2.4. Corpora & Labelling Paradigms
30
and switchboard data (Jurafsky et al., 1998). damsl draws both on the need to provide a reliable corpus to derive cue models, and the philosophical underpinnings of the Speech Acts as described by Austin (1962) and Searle (1969) earlier. For example, damsl includes a series of forward-looking functions, like questions or performatives, such as , similar to Searle’s directives, and , such as offers and commits, which represent Searle’s commissives. The tag-set as devised in damsl is multidimensional; there are 4 dimensions across which the tags operate. The information level (addressing the question ‘is an utterance about the task, or how to complete the task (i.e. task-management), or managing the communication itself’), the communicative level (aside from normal communication, is the utterance uninterpretable, or self-talk), and the backward and forward looking functions, such as statements, questions and answers. This tag-set allows for as many tags as necessary to fully describe the behaviour of the utterance, which leads to a vast number of potential combinations of tags. This can create significant problems for an automated processes to learn correlations, so, in common with other tag-sets, clusters of labels are used to reduce data sparsity. The full set of hierarchical labels can be seen in Figure 2.3. This annotation schema is very expressive, but with that comes a lower expectation of inter-coder reliability. Scores for complete dialogues haven’t been given, but Core and Allen (1997) cite kappa scores of 0.66 for the most prevalent da type: . Some categories seem to be labelled fairly reliably (such as (kappa = 0.76) and (kappa = 0.70)), whereas others appear very unreliable ( scores a kappa of only 0.15). There are limited examples of dialogues that have been annotated with the full, hierarchical damsl scheme (including some early examples of TRAINS dialogues5 (Allen and Core, 1997)).
switchboard-damsl Precisely because of the difficulty of consistently applying the damsl annotation schema, when Jurafsky et al. (1997) wanted to annotate a large amount of transcribed speech data, the switchboard corpus, they were attracted by the domain independence of the damsl annotation scheme, but took steps to improve inter-coder reliability. The switchboard corpus contains a large number of approximately 5minute conversations, between two people who are unknown to each other, who were asked to converse about a range of everyday topics, with little or no constraint. The dialogues themselves are essentially ‘domain-free’, the task being to talk to someone for a certain amount of time, thus there is no overarching domain of conversation, and few external constraints on the definition of da categories. Such conversation should be ideally suited to the damsl scheme. The goal of Jurafsky et al. (1997) was to facilitate machine learning over the da annotated portion of the switchboard corpus6 , so a high degree 5
available for download from http://www.cs.rochester.edu/research/cisd/resources/damsl/ 6
available for download from
33
Chapter 2. Motivation and State-of-the-Art
of inter-coder reliability became very important. To achieve this, Jurafsky et al. (1997) compressed the hierarchical tag-set of damsl into a collection of single atomic labels, which attempt to capture both the function of the individual utterance and the hierarchical information captured by the original damsl schema. The damsl annotated portion of the switchboard corpus (Jurafsky et al., 1997) consists of 1155 annotated conversations, containing some 225,000 utterances. The resulting switchboard-damsl annotation was a set of 220 distinct labels, still a very large set for manual annotators to manage, however Jurafsky et al. (1997) were able to achieve an average pair-wise kappa score across the corpus of 0.8. Of the 220 categories, 130 occurred less than 10 times in the entire corpus. To obtain enough data per class for statistical modelling purposes, a clustered tag set was devised. This tag set distinguishes 42 mutually exclusive da types, and it is this clustered data that was used for the majority of the experiments that we report in this thesis. Table 2.3 shows the 42 categories, with relative frequencies of distribution over the entire corpus. While some of the original infrequent classes were collapsed, the resulting da type distribution is still highly skewed.
icsi-mrda Like the switchboard corpus, the icsi-mrda Meeting Room corpus (Shriberg et al., 2004) was annotated using a variant of the damsl tag-set, combining the tags into a single, distinct da to reduce aspects of the multidimensional ftp://ftp.ldc.upenn.edu/pub/ldc/public-data/swb1-dialogact-annot.tar.gz
2.4. Corpora & Labelling Paradigms
Dialogue Act statement-non-opinion acknowledge statement-opinion abandoned/uninterpretable agree/accept appreciation yes-no-question non-verbal yes answers conventional-closing wh-question no answers response acknowledgement hedge declarative yes-no-question other back-channel in question form quotation summarise/reformulate affirmative non-yes answers action-directive
% of corpus 36% 19% 13% 6% 5% 2% 2% 2% 1% 1% 1% 1% 1% 1% 1% 1% 1% 0.5% 0.5% 0.4% 0.4%
34
Dialogue Act collaborative completion repeat-phrase open-question rhetorical-questions hold before answer reject negative non-no answers signal-non-understanding other answers conventional-opening or-clause disprefered answers 3rd-party-talk offers, options commits self-talk down-player maybe/accept-part tag-question declarative wh-question thanking apology
Table 2.3: switchboard dialogue acts
% of corpus 0.4% 0.3% 0.3% 0.2% 0.2% 0.2% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% < 0.1% < 0.1% < 0.1% < 0.1% < 0.1%
35
Chapter 2. Motivation and State-of-the-Art
nature of the original damsl annotation scheme. The underlying domain of the dialogues in the icsi-mrda corpus was that of multi-party meetings, with multiple participants discussing an agenda of items in a structured meeting. This application required the introduction of new tags specifically for this scenario, such as , a tag introduced to indicate when an utterance was used to take control of the meeting. The icsi-mrda corpus comprises 75 naturally occurring meetings, each around an hour in length. There were 53 unique speakers in the corpus, an average of 6 speakers per meeting. The corpus consists of around 180,000 utterances. For each utterance in the corpus, one general tag was assigned, with zero or more additional specific tags. Excluding non-labelled cases, there are 11 general tags and 39 specific tags, and these can be seen in Table 2.4. Tags in boldface are not present in the switchboard tag-set, and have been added to the ICSI MRDA data. Tags in italics are based on the switchboard version, but have had their meaning modified for the icsi-mrda data. In addition to Dialogue Acts that label the function of utterances, there are two disruption forms (%-, %–), two types of indecipherable utterances (x, %) and a further tag to denote rising tone in the prosody (rt). Collectively, these five labels are referred to as . The vocabulary size of the entire corpus is 14,347 and there are 1,260 unique dialogue acts used in the annotation, through permutations of the general tags and combinations of the specific tags. As with the switchboard data set, the number of unique Dialogue Acts is too large (and the distribution of data among them too sparse) for machine learning experiments, so processing steps were introduced that removed some of the special tags, compressing the number of
2.4. Corpora & Labelling Paradigms
36
unique Dialogue Acts to 65. In later work, the dimensionality was further reduced. Starting with experiments reported in Ang et al. (2005), a da tag sub-set is used that contains only , , , , and , just 5 tags.
AMITIES In this section, we include details of our own, privately collected dialogue corpus. From 2001 to 2005, I worked on the amities dialogue project, an EC 5th Framework funded collaboration between a group of EU and US universities and companies. During the amities project7 , we collected 1000 English human-human dialogues from GE call centres in the United Kingdom. These calls are of an information seeking or transactional type, in which customers interact with their financial accounts information by phone to check balances, make payments and report lost credit cards. All 1,000 English dialogues, comprising some 24,000 utterances and a vocabulary size of around 8,000 words, have been annotated with das (using a variant of damsl called XDML) to capture the functional layer of the dialogues, and a frame-based semantic scheme to record the semantic layer (Hardy et al., 2002; Hardy et al., 2003). With this level of annotation, both the categories and the lists remain largely independent of the domain. However, we have made some adjustments in the tag set in order to reflect more accurately the features found in the amities ge corpus. Our taxonomy follows the general damsl categories 7
http://www.dcs.shef.ac.uk/nlp/amities/
% of corpus 31.80% 14.20% 7.94% 6.82% 5.59% 3.55% 3.04% 2.96% 2.53% 2.10% 2.05% 1.98% 1.73% 1.13% 1.05% 0.90% 0.86% 0.80%
Dialogue Act no knowledge answers hold command yes-no-question disprefered answers humorous material down-player commit narrative-negative answers maybe or clause after yes-no-question exclamation mimic apology task management signal non-understanding partial-accept
% of corpus 0.78% 0.75% 0.64% 0.64% 0.46% 0.44% 0.37% 0.35% 0.33% 0.33% 0.33% 0.29% 0.28% 0.24% 0.24% 0.22% 0.21%
Dialogue Act rhetorical question topic change repeat self-talk 3rd-party-talk rhetorical-question-down-player-back-channel partial reject misspeak self-correction reformulation “follow me” or question thanks tag question open-ended question misspeak correction sympathy welcome
% of corpus 0.20% 0.20% 0.20% 0.19% 0.16% 0.15% 0.14% 0.14% 0.13% 0.12% 0.12% 0.11% 0.08% 0.07% 0.05% 0.01% < 0.01%
Table 2.4: icsi-mrda dialogue acts, taken from Shriberg et al. (2004): not shown are the labels indecipherable, non-speech and non-labelled
Dialogue Act statement back-channel floor-holder acknowledgement yes-answers defending/explanation expansion floor-grabber suggestion appreciation interruption understanding check declarative question abandoned narrative-affirmative answers wh-question no answers collaborative-completion
37 Chapter 2. Motivation and State-of-the-Art
2.4. Corpora & Labelling Paradigms
38
Task e annotation selecSystem capabilities in the widely-used Order of tasks Task-management chosen for its easy Information Level Summary s. Communication-management ool was posted for Out-of-topic he UK, and the US pdates were made ring the following Self-talk ed call-center diaCommunicative Third-party-talk Status (Features) ve been annotated. Abandoned with a windows inwork with current 2: Hierarchy of annotation labels for ors can use it easilyFigure 2.4:Figure Hierarchy of XDML labels for Information Level and Communicative Information Level and Communicative Status Status ral timesaving feach as automatic tag abilities, and user- allows us to make an abstract description of the content of each utterance. We are concerned here he tool works wellInformation-Level, Communicative Status, and Forward- and Backward- Lookwith the broad topic of a particular turn or portion pproaches to anno-ing Functions. Each utterance can have one annotation from each of these visual organization of a turn. Is the speaker participating in an excategories (so for example, an utterance which contains a reply to some -glance review of change of information that will accomplish a task? Ortask is he stepping above task, asso, where the Answer goal or complete the task (Task-management)? m partners at theportion represents the Backward Looking Function of this utterance. This ools, as well as the Perhaps his words serve to initiate, maintain or we can capture topical distinctions, (Communicationunusual occurrences in conthe broadconversation may be downloadedway close management). perhaps he isutterance digressing the or versations, and ways inOr which a particular relatesfrom to previous subject of the dialogue (Out-of-topic). subsequent parts of the dialogue. Our tag set for the categories Information Most turns in call-center dialogues make pronnotation Levelgress and Communicative Status is shown in Figure 2.4. The category Infortoward accomplishing a customer-service ect of an utterance task, such as making a payment, verifying a cusrpose in the inter- tomer’s identity, or giving the customer informaanswers, and ex- tion about his account balance. Questions and s of such functions, answers, directions, suggestions, explanations, is layer for the dia- commitments to perform some action connected we have found that with the task, and agents’ offers of help all fall into ork well [2]. With this category. For convenience, XDMLTool labels
39
Chapter 2. Motivation and State-of-the-Art
mation Level allows us to make an abstract description of the content of each utterance. We are concerned here with the broad topic of a particular turn or portion of a turn. Is the speaker participating in an ex- change of information that will accomplish a task? Or is he stepping above the task, so to speak, and talking about the process needed to achieve some goal or complete the task (Task-management)? Perhaps his words serve to initiate, maintain or close the conversation (Communication- management). Or perhaps he is digressing from the subject of the dialogue (Out-of-topic). Most turns in call-center dialogues make progress toward accomplishing a customer-service task, such as making a payment, verifying a customers identity, or giving the customer information about his account balance. Questions and answers, directions, suggestions, explanations, commitments to perform some action connected with the task, and agents offers of help all fall into this category. For convenience, the XDML Tool we used to annotate the amities ge dialogues labels every turn (or turn segment) “Task” by default. An utterance concerned with ‘doing the task’, receives the Information Level label “Task”. If an agent or customer is not making direct progress toward completing the task, but instead is talking about the task or the process of doing the task, we use the label Task-management. We found it useful to subdivide Taskmanagement into three categories: System capabilities, Order of tasks, and Summary. The System capabilities category of Task-management means that the speaker is addressing problems that the computer system or the service center can and cant solve; this is the ‘competence domain’ for customer service. For Order of tasks, the speaker talks about the order in which tasks will
2.4. Corpora & Labelling Paradigms
40
be completed, or indicates that a task will be started. Occasionally the agent will make a statement that serves to summarise, wrap up, or recapitulate the task that has been accomplished in the dialogue, or to indicate that the task has been completed. We reserve the Summary tag for summary or completion statements that refer to the entire task, such as closing an account or changing an address, not parts of the task. A third category under Information-level serves to describe utterances that deal with social obligations such as greetings, introductions, apologies, expressions of gratitude, farewells, as well as marks which maintain the conversation. Conventional phrases such as “hello” and “good-bye”, as well as back-channel words or non-words such as “uh-huh”, “yes”, and “ok” are examples of Communication-management. Ordinary sentences and phrases used to signal misunderstanding or to manage delays in the conversation should also be labelled Communication-management. One useful test suggested by Allen and Core (1997) is to remove the utterance in question from the dialogue. The conversation might be less fluent but would still have the same content relative to the task and how it is solved. Another Information-level category has to do with brief or extensive digressions from the task in the conversation. Out-of-topic includes jokes, non-sequiturs, small talk, and any comments or remarks that have no direct bearing on accomplishing the task. Whereas every utterance in a dialogue can be assigned an Information Level tag, Communicative Status tags are intended to be used only in exceptional circumstances. We use three labels under this category: Self-talk, Thirdparty-talk, and Abandoned. Interruptions and unintelligible utterances are best annotated using a transcription tool while the user is listening to the
41
Chapter 2. Motivation and State-of-the-Art
dialogue. A Self-talk utterance indicates that the speaker does not intend the other person to respond to or otherwise use what he is saying. The speakers apparent intentions are the key, rather than whether or not the other person actually responds to or uses the utterance. An utterance labelled Third-party-talk is one in which the speaker is addressing someone other than the second party to the conversation. Typically a customer will ask a question of a family member; or an agent will speak to a customer who is with him, while the agent is holding a telephone conversation with another agent. A speaker abandons an utterance occasionally, when he makes an error or changes his mind about what to say. The Abandoned label should be used only if the utterance has no effect on the progress of the dialogue; that is, the utterance could be removed from the conversation without changing its content. An utterance can be marked Abandoned whether or not the speaker was interrupted, as long as the speaker actually leaves his thought and does not return to it. We do not use the label Abandoned for cases in which the speaker corrects himself during the turn. An utterance having a Forward-Looking Function anticipates the future in some way, having an effect on what is answered, discussed or done next. The speaker may be making a statement, asking a question, or committing himself to some course of action. He may be suggesting or directing the other person to do something. Forward-looking functions can be distinguished from backward-looking functions in that backward functions are primarily responses to something that was said, and forward functions typically elicit a response. Some functions, such as the various kinds of statements, as well as the Expression function, have either a forward or a backward orientation,
2.4. Corpora & Labelling Paradigms
42
Assert
(26mar02-690.trs.txt)
d-party-talk is one in ssing someone other conversation. Typiquestion of a family ak to a customer who s holding a telephone nt. terance occasionally, anges his mind about label should be used effect on the progress tterance could be rewithout changing its e marked Abandoned as interrupted, as long his thought and does the label Abandoned corrects himself dur-
Reassert Statement
Explanation Reexplanation
Offer Commit Forward-Looking Function
Expression
Information-request (Explicit, Implicit) Confirmation-request (Explicit, Implicit)
Influenceon-Listener
Repeat-request Action-directive Open-option Opening
Conventional Closing
Figure 3: Hierarchy oflabels annotation labels for Function Figure 2.5: Hierarchy of XDML for Forward-Looking Forward-Looking Function
In general, a sentence or phrase that is a Statement can be said to be true or false. A statement depending on the context. Sometimes an utterance can be tagged with lamakes a claim about the world, and tries to change bels frombeliefs both theofforward and the backward categories. For example, the hrough to change of the the listener. We use four tags under the category Statement: Reassert, Explanabackward function Answer is also Assert, labelled Assert. If more than one tag (fortion, and Reexplanation. ward, backward, or both) is applicable for an utterance, the annotator should card (26mar02Assertions and Reassertions are ordinary stateselectments, all appropriate tags. Our Functions is distinguished by hierarchy whetheroforForward-Looking not the speaker rward-Looking Func- has already made the claim in an earlier part of the illustrated in Figure 2.5. some way, having an dialogue. Here we include, first, any yes/no answer a sentence or phrase that is a Statement can be said to be true scussed or done next. Intogeneral, a question. Second, Assertions and Reassertions a statement, asking aor false. canAtake the form statements that communicate statement makesof a claim about the world, and tries to change the elf to some course of some specific details. The attributes and possibly or directing the other values contained in these statements are also annoard-looking functions tated at the semantic level. Third, Assertions or ckward-looking func- Reassertions may take the form of a recapitulation, ons are primarily re- reformulation, or summary. These are also tagged as said, and forward “Task-management-summary” (see above). esponse. Some funcIn contrast to Assertions, which are simple nds of statements, as
43
Chapter 2. Motivation and State-of-the-Art
beliefs of the listener. We use four tags under the category Statement: Assert, Reassert, Explanation, and Re-explanation. Assertions and Reassertions are ordinary statements, distinguished by whether or not the speaker has already made the claim in an earlier part of the dialogue. Here we include, first, any yes/no answer to a question. Second, Assertions and Reassertions can take the form of statements that communicate some specific details. Third, Assertions or Reassertions may take the form of a recapitulation, reformulation, or summary. These are also tagged “Task-management-summary”. In contrast to Assertions, which are simple statements, Explanations are reasons people give for their answers or for the questions they ask, or elaboration about topics such as customer service policies. Either the agent or the customer may use these. Explanations and Re-explanations, similar to Assertions and Reassertions, are distinguished by whether the statement has been made previously in the dialogue. Offers are implicit or explicit questions that, if answered in the affirmative or with some positive information, mean that the speaker will perform some action for the listener. The Commit tag is used for utterances in which the speaker obligates himself to perform a future action, phrased in such a way that the commitment is not contingent on the listeners agreement. A Commit may also be the response to an Action-directive. The Expression tag encompasses conventional phrases such as “Thank you”, “I apologize”, and “Sorry”, exclamations, short words used to hold or grab the turn, such as “Right” or “Okay”, and other expressive phrases. These may also be tagged Back-channel, Accept or Non-understanding, depending on the context. In the Influence-on-listener group of tags, the speaker is asking the listener a
2.4. Corpora & Labelling Paradigms
44
question, directing him or her to do something, or suggesting some course of action the listener may take. A request for information, whether it is spoken in the interrogative form (Explicit) or the imperative or declarative form (Implicit), is tagged Information-request. We exclude from this category questions that call for a yes/no answer. A Confirmation-request is an utterance that calls for the listener either to confirm or to deny the request or the question; in other words, it calls for a simple acceptance or rejection: a yes/no answer. In this category we also make a distinction between Explicit and Implicit. The label Repeat-request is used to mark any request, whether it is for information or confirmation, that has been made earlier in the dialogue. If the speaker directs the listener to perform some action, we label the utterance Action- directive. If the directive is done to manage some delay in the conversation, for example, “Please wait” or “Bear with me”, then we also use the Information Level label Communication-management. Agents and customers in call-center dialogues rarely phrase action-directives in a ‘Do this’ manner. Instead they make a polite request or a statement of a problem that must be solved or a task that needs to be done. If a speaker suggests a course of action but puts no obligation on the listener, we use the tag Open-option. The difference between Open-option and Offer has to do with who will perform the action. The Offer means that the speaker proposes to do something for the listener, as in “I can do this for you”, or “What can I do for you?”. The Open-option, on the other hand, suggests that the listener or some other person perform the action. This type of utterance takes the form “You can do this”, or “This [option or course of action] can be done.”
45
Chapter 2. Motivation and State-of-the-Art The Opening tag indicates that the speaker is beginning the interaction
by using a conventional social phrase to greet the listener, or by replying to such a greeting with a conventional phrase. Often the speaker, if he or she is a customer service representative, will identify the service name and/or the agent name as part of the greeting. The Closing label is used for turns in which the speaker utters a conventional social phrase or expression to finish or wrap up the conversation. If the annotator selects the Opening or Closing tag, for convenience the annotation tool used for the amities ge data automatically assigns the Information Level tag Communication-management. Utterances in the Backward-Looking Function category respond in some way to one or more previous turns in the dialogue. An answer to a question is one example of a common backward-looking function. If the speaker signals some level of understanding or not understanding what the previous speaker has said, we use one of the five tags in the Understanding sub-category. If the speaker signals some level of agreeing or disagreeing with the previous speakers question (or some degree of accepting or rejecting the previous speakers proposal), then we select a tag in the Agreement sub-category. Note that most, if not all, acceptances and rejections are also answers. Our set of Backward-Looking Function tags is shown in Figure 2.6. The Response-to field on XDML Tools user interface provides a place to annotate the antecedent, or the utterance to which the current turn is responding. Because the most common antecedent is the previous turn, XDMLTool automatically fills in the number of the previous turn or turn segment when the user selects any Backward Function tag. This number may be overridden manually if the antecedent occurs earlier in the dialogue, or if more than one utterance
2.4. Corpora & Labelling Paradigms
46
forms the antecedent. An Answer is a response to an Information-request or Confirmation-request. An answer by definition will always be an assertion, as it provides information or confirms a previous supposition, and it makes a claim about the world. XDMLTool automatically tags an utterance Assert if the user chooses Answer. An Understanding response to an utterance reveals whether and in what way the speaker heard and understood what the other speaker was saying. This aspect indicates nothing about whether the speaker accepts or rejects what was heard. Because a speaker may be indicating understanding and agreement at the same time, choices in both areas can be appropriate. We use five tags in the Understanding category: Back-channel, Repeat-rephrase, Completion, Non-understanding and Correction. Because all these types of utterances can be categorised at the Information Level of Communication-management, XDMLTool makes that choice automatically. The annotator can change the Information-level label to “Task” if he or she determines that the utterance also has a Task role; for example, Accept or Commit. A Back-channel response is typically a short phrase indicating that the speaker heard and understood the previous utterance, but did not necessarily accept what he heard. A Back-channel utterance may be paraphrased as “I heard you”, “I understand what you said”, or “Thats clear; you can continue”. This type of response may or may not interrupt the previous speaker, or it may occur while the other speaker is still speaking. If the utterance also counts as an acceptance or a commitment, it should be annotated at both levels. The Repeat-rephrase label is used for utterances that repeat or paraphrase the previous speakers words, to show that those words were under-
47
Chapter 2. Motivation and State-of-the-Art
er the phone
Backchannel Answer Repeat-rephrase
cipation (am3-
he speaker is conventional by replying to phrase. Often r service repname and/or ng. These de-
in which the phrase or exnversation. If Closing tag, ically assigns mmunication-
ing Function or more prewer to a quesn backwardls some level ing what the ne of the five egory. If the g or disagreetion (or some
Understanding
Completion Non-understanding Correction
Backward-Looking Function
Accept Accept-part Agreement
Maybe Reject Reject-part
Figure 2.6: Hierarchy of XDML of labels for Backward-Looking Figure 4: Hierarchy annotation labels forFunction
Backward-Looking Function tion, and it makes a claim about the world. XDMLTool automatically tags an utterance Assert if the user chooses Answer. An Understanding response to an utterance reveals whether and in what way the speaker heard and understood what the other speaker was saying. This aspect indicates nothing about whether the speaker accepts or rejects what was heard. Because a speaker may be indicating understanding and agreement at the same time, choices in both areas can be appropriate. We use five tags in the Understanding category: Backchannel, Repeat-rephrase,
2.4. Corpora & Labelling Paradigms
48
stood but not necessarily accepted. If the speaker repeats or paraphrases some words with uncertainty, or a rising inflection (indicated in the transcription by a question mark), then we use the tag Non-understanding. The Completion tag is used for utterances that indicate understanding by continuing or finishing the other persons sentence or phrase. If the speaker has not understood or has partially understood something he has just heard, we use the Non-understanding tag. Utterances of this type can usually be paraphrased “What did you say?” “What did you mean?” or “Is this what you said?” Many of these examples can also be labelled Explicit Confirmationrequest. The Correction tag is used to indicate that the speaker has understood what the other person has said, but wants to correct a perceived error in the other persons utterance. We reserve the label Correction for cases in which the speaker corrects the other person, not for cases of self-correction. The set of tags in the Agreement category indicate whether the speaker accepts a proposal, offer or request, or confirms the truth of a statement or confirmation-request. We use five labels in this category: Accept, Acceptpart, Maybe, Reject, and Reject-part. We mark an utterance with Accept or Accept-part if the speaker accepts all or part of the other speakers proposal or request; or if the information or claim conveyed in an Assert is accepted or confirmed. We use the Maybe tag when the speaker is uncertain of an answer, or says “Ill have to think about it”, or “Im not sure”. We also use this tag when the person cannot answer the question or address the proposal or offer because of lack of knowledge: “I dont know”. We use Reject or Rejectpart when the speaker disagrees, rejects a proposal or offer, says he will not comply, or says that all or part of the claim or the information conveyed by
49
Chapter 2. Motivation and State-of-the-Art
the other speaker is incorrect. Example tags can be seen in Figure 2.5. With so many possible labels in these 4 categories, the maximum number of possible tag combinations is huge (around 1,000,000 possible combinations exist). However, in our annotated corpus, we see only 259 combinations of tags in actual use, a number that corresponds somewhat to the 226 total tag set reported in the switchboard corpus, earlier. All conversations in the amities ge are made anonymous following strict guidelines ,whereby all names, account numbers and personal information are removed from the audio data and replaced by generic substitutes in the word transcripts. By adopting the general damsl mark-up scheme, we had hoped in a later stage to be able to acquire domain-independent models of dialogue, simply by changing the set of domain-specific semantic frames, and this remains a research goal. The most frequent tag in the amities corpus is , which occurs around 15% of the time. For this corpus, the average pair-wise kappa score of .59 was significantly lower than the switchboard corpus. Cohen’s kappa coefficient is a statistical measure of inter-rater agreement for qualitative (categorical) items, and is generally thought to be a more robust measure than simple percent agreement calculation, since kappa takes into account the agreement occurring by chance (Cohen, 1960). For the major categories (such as questions and answers), average pair-wise kappa scores in the amities corpus were around .70. Again, according to the work of Carletta et al.
50 2.4. Corpora & Labelling Paradigms
Dialogue Act
% of corpus 15.18% 13.33% 12.45% 6.03% 5.03% 4.86% 4.43% 4.02% 2.85% 2.67% 1.99% 1.93% 1.72% 1.61% 1.58% 1.47% 1.12% 1.09% 1.03% 1.02% 0.89% 0.79% 0.75% 0.74% 0.71% 0.58% 0.57% 0.56% 0.55%
Table 2.5: A selection of the amities ge dialogue acts. Shown are all acts that have a frequency count across the corpus greater than 0.55% of the total number of labels
51
Chapter 2. Motivation and State-of-the-Art
(1997), a minimum kappa score of 0.67 is required for the corpus to be reliable enough for learning automatic models for classification.
2.4.4 AMI The AMI project8 is a European research project centred on multi-modal meeting room technology. The AMI Meeting Corpus contains 100 hours of meetings captured using many synchronised recording devices, and is designed to support work in speech and video processing, language engineering, corpus linguistics, and organisational psychology. It has been transcribed orthographically, with annotated subsets for everything from named entities, Dialogue Acts, and summaries to simple gaze and head movement. Twothirds of the corpus consists of recordings in which groups of four people played different roles in a fictional design team that was specifying a new kind of remote control. Controlling the roles that the participants play is a way of focussing the data, and allows researchers to better measure how well the groups are doing and to compare to new data where groups use assistive technologies. However, it also limits the things people talk about. The remaining third of the corpus contains recordings of other types of meetings. The AMI Meeting Corpus is available for free download under a Creative Commons license9 , but was not used in any of the experiments we attempt later in this thesis. The AMI da tag-set has 15 tags, which can be seen in Table 2.6.
8
http://www.ami-project.org
9
http://corpus.amiproject.org/
2.4. Corpora & Labelling Paradigms
52
Dialogue Act
Dialogue Act
back-channel stall fragment inform elicit-inform suggest offer elicit-offer-or-suggest
assess elicit-assessment be-positive be-negative comment-about-understanding elicit-comment-about-understanding other
Table 2.6: 15 das of the AMI corpus
2.4.5 Additional Individual Dialogue Corpora There are a number of other dialogue corpora which we will reference in the next chapter, when we discuss methods and models that have been used for automatic da classification. We include details of these corpora here, for a better understanding of those results. These corpora are associated with a single dialogue project, or annotation effort, and are not widely used.
Spanish CallHome The Spanish CallHome corpus (Levin et al., 1998; Ries, 1999) comprises 120 telephone calls in Spanish between family members and friends, for a total of 12066 distinct words and 44628 utterances. The Spanish CallHome corpus is annotated at three levels: Dialogue Acts, dialogue games and dialogue activities. The da annotation augments a basic tag such as statement along several dimensions, such as whether the statement describes a psychological state of the speaker. This results in 232 different DA tags, many
53
Chapter 2. Motivation and State-of-the-Art
with very low frequencies, in common with other corpora annotation efforts, such as switchboard. As with the work of Jurafsky et al. (1997) on the switchboard corpus, tag categories are collapsed when running experiments so as to get meaningful frequencies. In CallHome37, different types of statements and back-channels are collapsed, obtaining 37 different tags. CallHome37 maintains some sub-categorisations, e.g. whether a question is yes/no or rhetorical. In CallHome10, these categories are further collapsed. CallHome10 is reduced to 8 das proper (e.g., statement, question, answer) plus the two tags “%” for abandoned sentences and “x” for noise. CallHome Spanish is further annotated for dialogue games and activities. Dialogue game annotation is based on the maptask notion of a dialogue game, a set of utterances starting with an initiation and encompassing all utterances up until the purpose of the game has been fulfilled (e.g., the requested information has been transferred) or abandoned (Carletta et al., 1997). Moves are the components of games, and they correspond to a single or more das, and each is tagged as Initiative, Response or Feedback. Each game is also given a label, such as Info(rmation) or Direct(ive). Finally, activities pertain to the main goal of a certain discourse stretch, such as gossip or argue.
Basurde The BASURDE task (Sanchis and Castro, 2002; Grau et al., 2004) consists of information retrieval by telephone for Spanish nationwide trains. Queries are restricted to timetables, prices and services for long distance trains. This corpus has 227 dialogues, 4884 turns (2333 user turns, 2551 system turns)
2.4. Corpora & Labelling Paradigms
54
and 61483 words, and a vocabulary of 860 words. An average conversation has 21.5 turns and the average number of words for a turn is 14.6. There are 15 labels for the dialogue acts.
OVIS The OVIS corpus of Dutch Train Timetable dialogues (Lendvai et al., 2003) consists of 3,738 pairs of system questions and user answers; in total 441 full dialogues (involving more than 400 different speakers). The dialogues were sampled from a range of telephone calls where users interacted with a Dutch train timetable information system. The dialogues are relatively short (2-10 turns). The system uses a mixed-initiative dialogue strategy that prompts the user to ll various slots before it can perform a database query. System prompts are tagged in terms of dialogue acts and slots. Basic dialogue acts include asking a question (Q), explicit verification (E), repeating a prompt (R), asking a meta-question (M) and offering travel advice (final result, Fr). Implicit verification is represented as the simultaneous occurrence of a question and a verification (Q;I). The slots to be filled from the user input are departure and arrival station (V and A respectively), and the corresponding day, time of day (i.e., morning, noon or night) and hour (represented as D, T and H respectively). These time slots can be questioned together (“when”, Q DTH) or in isolation (e.g., “at what time” Q H). In addition, the system can ask whether the user wants to have the travel advice repeated (repeat connection, Q Rc), or whether the user would like to have information about another connection (Q Oc), or an earlier or later one, and so on.
55
Chapter 2. Motivation and State-of-the-Art The user tag represents jointly a high-level dialogue act (S, A, Y, N), a
shallow semantic interpretation of the types of slots filled by the user, and a high-level pragmatic “awareness” flag of a communication problem.
NESPOLE The NESPOLE corpus (Levin et al., 2003) originates from the NESPOLE speech-to-speech translation project. The NESPOLE travel domain covers inquiries about vacation packages. There were two data collection protocols for the NESPOLE travel domain: monolingual and bilingual. In the monolingual protocol, an English speaker in the United States had a conversation with an Italian travel agent speaking (non-native) English in Italy. Monolingual data was also collected for German, French and Italian. Bilingual data was collected during user studies with, for example, an English speaker in the United States talking to an Italian-speaking travel agent in Italy, with the NESPOLE system providing the translation between the two parties. The dialogues were transcribed and multi-sentence utterances were broken down into multiple Semantic Dialogue Units (SDUs) that each correspond to one domain action. Over 14,000 SDUs have been tagged with inter-lingua representations including domain actions as well as argument-value pairs, and a set of 70 speech acts, although they do not list what these speech acts are.
2.4. Corpora & Labelling Paradigms
56
2.4.6 DIT++ Finally, development of taxonomies of Dialogue Acts did not end with damsl. The DIT++ taxonomy10 is a comprehensive system of dialogue act types obtained by extending the taxonomy of Dynamic Interpretation Theory (DIT), originally developed for information dialogues (Bunt, 1994), with a number of dialogue act types from damsl (Allen and Core, 1997) and other schema. The DIT++ taxonomy forms a multidimensional system not only in the sense that it supports the assignment of multiple tags to utterances, but also in the sense that dimensions can represent different aspects of communication that may be addressed independently, i.e. each utterance may have more than one communicative function – maximally one in each dimension (Bunt and Girard, 2005). There are 11 dimensions of the DIT++ tag-set, with around 95 communicative functions, around 42 of which, like switchboard are for general purpose functions, whereas others cover elements of feedback, interaction management and the control of social obligations. To test the reliability of annotation given the multi-dimensional nature of the tag set, Geertzen and Bunt (2006) annotated 558 utterances collected from three different dialogue systems, OVIS, a train scheduling system (Strik et al., 1997), DIAMOND, a language interface to a fax machine (Geertzen et al., 2004) and a collection of Map Task dialogues in Dutch (Caspers, 2006). The pair-wise agreement, as indicated by the kappa score, is averaged across the 11 dimensions, and it can clearly be seen that for some dimensions, 10
http://dit.uvt.nl/
57
Chapter 2. Motivation and State-of-the-Art
Corpus
Availability
maptask verbmobil switchboard icsi-mrda amities ge
public public public public restricted
Utterance count
Dialogue count
26621 3117 223606 180887 23412
128 168 1155 75 1000
Word count 152705 24980 1431725 795000 228165
Distinct words 2502 959 21715 14347 7841
Dialogue type Direction giving Travel planning Conversational Meetings Financial data
Table 2.7: Summary data for the major dialogue corpora
as with the switchboard schema, high levels of agreement can be achieved (for example, the dimension scores a kappa of 0.82), whereas other, high frequency dimensions, such as score a kappa as low as 0.47). These scores do indeed seem to indicate that high dimensionality schema can cause annotation problems, although the creators of DIT++ make very valid points about the applicability of kappa statistics to multidimensional tag sets (Geertzen and Bunt, 2006), that we do not explore here. DIT++ has also been used to annotate part of the AMI multi-party dialogues (McCowan et al., 2005), consisting of 72, 4-people, 1 hour meetings, comprising a total of 57,000 utterances.
2.5 Computational Models of Dialogue Act Recognition There are two broad categories of computational model used to interpret Speech or Dialogue Acts. The first relies on models of interpretation first developed by Searle (1976) when considering the definition and structure
2.5. Computational Models of Dialogue Act Recognition
58
of these acts. The model works by processing belief logic, centring on the impact each utterance has on the hearer and interpreting what the hearer believes the speaker intended to communicate. The second model type is far more shallow, and is often cue-based. This cue-based model, so called by Jurafsky and Martin (2008), is based on the notion of repeated predictive cues, subsections or groups of words that are strong indicators of specific das. The first category of da interpretation requires both complex models of the speaker and any hearers, as well as planning mechanisms able to process complex models of understanding. In the second category, much of the work is cast as a probabilistic classification task, solved by training approaches on labelled examples of speech acts. We’ll look with a little more detail at both models.
2.5.1 Plan and Inference Models The first set of models require reasoning about both speakers and hearers beliefs, desires and goals. An example of this is the plan-based model of Cohen and Perrault (1979), where planning is used to determine how speech acts are generated by developing an understanding of what it is that the speaker wishes to communicate, and understanding the likely effect on the hearer each utterance will have. Initially, this work centred on those speech acts that had a literal meaning, although in later work (Perrault and Allen, 1980; Allen, 1979) applied the same approach to understanding indirect speech acts, where perhaps this approach is most effective. Each individual utterance in a dialogue is modelled as a step in a plan. Understanding an utter-
59
Chapter 2. Motivation and State-of-the-Art
ance requires the system to derive a complete (and consistent) plan that the speaker is attempting to achieve. This model for recognising speech acts relies heavily on mutual belief. Interpretations of speech acts can only occur when two more agents share a particular piece of knowledge, and are aware that the other know it also, such that Person A believes proposition P, Person B believes proposition P, A believes B believes P, B believes A believes P, and so on. There is no theoretical upper bound to the number of levels of nesting, and theoretical utterances and exchanges can be created that demonstrate large nested structures. However when Lee (1994) examined dialogue corpora, they found that there was little requirement for more than two levels of belief nesting. Although occasional mistakes in the dialogue were observed due to this limited nesting, they could be quickly identified and resolved through a subsequent clarification dialogue. This evidence, that deeper level representations of dialogue may not be needed, is one possible explanation for the success of the shallow processing methods we shall examine later. Ballim and Wilks (1991) created the ViewGen system, which uses nested belief structures to model the beliefs of an artificial system. This system uses a minimal set of beliefs to interpret speech acts. To interpret an speech act, involving a ‘speaker’, a ‘hearer’ and a ‘proposition’, there are a series of preconditions that must be met. The ‘speaker’ must believe the ‘proposition’, and there must be a goal, which is that the ‘speaker’ believes that the ‘hearer’ also believes the proposition. The speech acts are hierarchical, in that there are more specific speech acts that can inherit preconditions from more general acts. Lee and Wilks (1996) uses the example of a correction act, a more spe-
2.5. Computational Models of Dialogue Act Recognition
60
cific example of an inform act in their hierarchy. As well as the preconditions and the goals of the higher level inform act just outlined, the act has an additional precondition: that through interaction, the ‘speaker’ believes that the ‘hearer’ bf does not believe the ‘proposition’, and hence motivates the ‘speaker’s’ need to correct them. In order to apply the plan-based model, there are three key components (taken from (Jurafsky and Martin, 2008)): 1. an axiomatisation of belief, of desire, of action and of planning 2. a set of plan inference rules, which codify the abductive heuristics of the understanding system 3. a theorem prover Given these three components, a plan-inference system can align an input sentence with the correct speech act. In common with most inference systems, the benefits of the model are that it is highly explanatory; it is possible to see why certain inferences were made. This model can be highly effective in identifying indirect speech acts, using models of the speaker and current speaker goals, to perform accurate intention analysis. This is possible due to the extremely rich knowledge structures which this model requires to operate. However, these structures are also one of the drawbacks of this approach, in that they often require a lot of manual effort to construct and maintain. Some impressive co-operative dialogue systems have been implemented using this approach, including the TRAINS system (Allen et al., 1995), and (using the TRAINS architecture) the TRIPS system (Ferguson and Allen,
61
Chapter 2. Motivation and State-of-the-Art
1998), an interactive planning system. The scenario for TRIPS is organising the evacuation of a city in the face of an impending disaster. An example dialogue with the TRIPS system, taken from Ferguson and Allen (1998), can be seen in Figure 2.7, where H is the human and S is the system. In early work it was assumed that there were only a few, high level domain goals that the speaker might be pursuing. Later, in Carberry (1983), a context model was proposed that captured the speaker’s domain plan as inferred from the dialogue so far. Plan inference rules were used to relate each new utterance to the partially formed plan, and expand it appropriately. Beyond domain goals, work was required to form plans around communicative goals, where speakers work together to exchange the information necessary to further their problem-solving goals. A complete discourse plan contains actions that are executed in order to achieve either discourse or communicative goals (Carberry, 1990). It is these actions that are referred to as Dialogue Acts in the communication planning literature.
2.5.2 Cue-Based Approach One alternative to the plan-based approach is the cue model. In this model, the hearer uses cues (or simple indicators, either alone or in combination) in the utterances (both individual and in context of the wider dialogue) to decide on an interpretation. This model captures the idea that the surface form of an utterance provides all manner of cues as to what this interpretation could be. Cues can be lexical, collocational, syntactic, prosodic or based on a deeper conversational structure. The cue model relies on ideas similar to
2.5. Computational Models of Dialogue Act Recognition
0:00 H: Hello.
1:38 H: Now send truck one to Barnacle.
0:02 S: Hi there.
62
0:03 H: Show me a map of Pacifica.
1:41 S: Your wish is my command. Adds a third task to the construction window (in a third color) and updates the map.
0:07 S: Ok. Displays large map of the island of Pacifica
1:44 H: Pick up the people there.
0:08 H: Where are the people?
1:48 S: Ok. Updates displays
0:11 S: There are two groups of people at Exodus, two at Calypso, two at Barnacle, and one group of people at Abyss. Displays a bar chart showing the numbers of people in the cities. This chart is updated as the conversation progresses to show the state of the world given the plan under consideration.
1:50 H: Go on to Abyss.
0:17 H: And where are the transports?
2:12 S: Glad to be of service.
0:19 S: There are two trucks and one helicopter at Delta. Displays a chart showing vehicle status and contents (also updated through the conversation)
We now suppose that the manager learns of a change in road status from an external source, and informs the system so that the plan can be adjusted. Delta Bridge crosses the inlet between Delta and South Delta.
0:24 H: Use a truck to get the people from Calypso to Delta. 0:31 S: Your wish is my command. Displays a construction window with a timeline view of the plan under development, with the vehicles on the vertical axis and time horizontally. The map is also updated to show the route taken to Calypso and back. 0:34 H: How long will that take? 0:37 S: It will take six hours and forty-two minutes.
1:54 S: Ok. Updates displays 1:57 H: And take the people there to Delta. 2:05 S: Ok. Updates displays 2:10 H: That looks good.
2:18 H: Delta bridge is out. 2:21 S: Ok. The highlighted actions may need to be revised. Flashes Delta Bridge in red, and highlights actions whose routes use the bridge in red in the construction window. 2:37 H: Send truck two along the coast instead. 2:39 S: Ok. Displays new route (to Calypso and back) on the map and updates the construction window (longer trips).
0:43 H: What if we went along the coast instead?
2:42 H: Have the other truck stop at South Delta instead.
0:47 S: That option would take ten hours and twenty-seven minutes. Displays another construction window showing the option alongside the original plan.
3:02 S: Ok. Display new route (along coast to Barnacle, then to Abyss, then to South Delta) and updates the construction window.
0:53 H: Forget it.
3:06 H: Unload the people there.
0:55 S: Ok. Deletes second construction window from the display.
3:13 S: Ok. Updates displays
1:05 H: Use the other truck to get the people from Exodus to Delta. 1:13 S: Ok. Adds the new task to the original construction window, using a different color, and updates the map. In this case, the task is clearly seen to take more than ten hours.
3:18 H: And use the helicopter to get the people from South Delta to Delta. 4:03 S: No problem. Multiple flights with the helicopter are added to the third task in the construction window and the map is updated to show them.
1:22 H: Let’s use the helicopter instead. 1:34 S: Ok. Changes the construction window view of this task and updates the map display.
Figure 2: Example Dialogue
Figure 2.7: Example dialogue with the TRIPS system Example Dialogue An example dialogue is presented in Figure 2. This is a verbatim transcript of an actual session, annotated to describe the graphical displays generated by TRIPS. The times shown are wall time—the actual time of each utterance. The exact times are not important,1 but note that most utterances are understood within a few seconds, including speech recognition and all processing. This level of responsiveness is 1 This session was run using two Sun Ultra 1 workstations with 167 MHz SPARC processors.
required to maintain an effective dialogue with the user.
TRIPS System Architecture The organization of TRIPS is shown in Figure 3. Modules communicate by exchanging KQML messages (Finin et al. 1993) using our own central message-passing Input Manager (not shown in the figure). Most modules are in fact separate Unix processes, and TRIPS can run on any combination of machines that can run the individual modules. The TRIPS infrastructure allows any program that can read standard input and write standard output to exchange messages.
63
Chapter 2. Motivation and State-of-the-Art
the literal meaning hypothesis (Gazdar, 1981) we discussed earlier, that is the literal meaning hypothesis is strong prior claim for cue-based modelling to be successful. The key to the cue model from a computational perspective is that these cues can be probabilistically associated with specific Dialogue Acts, and we shall exploit this later. Cue phrases (also called ‘discourse cues’, ‘discourse markers’ or ‘clue words’) are words and phrases that explicitly signal the structure of a discourse or dialogue, and in turn can be used to determine the intention of the speaker. Phrases include single word cues, such as “well” or multiword cues, for example “in any case” (or for that matter “for example”). Although cue phrases have been well covered in both linguistic and computational literature it wasn’t until fairly recently, with the availability of large annotated corpora, that empirical studies analysing and classifying cue phrases have been possible (cf. Hirschberg et al. (1987), Litman and Allen (1990)). Much of this work has focused either on correlating instances of cues with classes of annotations, such as intention or dialogue act as in the work reported in this thesis, or on disambiguating the structural role cue phrases are playing in utterances (and can be useful in resolving ambiguity, for example). For an example of this second approach, the disambiguation of phrases, the work of (Hirschberg and Litman, 1993) centres on resolving those instances when cues are playing discourse structuring roles (where for example the cue ‘now’ signals a return to a previous topic, or the introduction of a new sub-topic, as in “now, if you look at this next example”) against when the same cue is used in sentential form (where ‘now’ indicates a specific span of time, including the duration of the utterance, as in “I’m not free now, but
2.5. Computational Models of Dialogue Act Recognition
64
I may be later”, where the cue phrase ‘but’ plays a crucial discourse role). There are specific features that help determine such usage, the position of the cue phrase in the utterance for example, that we can leverage in our algorithm to determine the weight and role of such a phrase in an utterance. In most prior work concerning cue phrases once phrases have been identified, either manually or automatically, they are passed to some later process as one of many possible features, that are then used to identify Dialogue Acts (as in the work of (Samuel et al., 1998) reported later), or used for the segmentation of broadcast news (Maybury, 1998) and spontaneous spoken narratives (Passonneau and Litman, 1997), or presented for selection and use in natural language generation (Moser and Moore, 1995). The work presented in this thesis differs in that we use cue phrases directly as a technique for classifying the correct Dialogue Act for an utterance.
Lexical and Syntactic Cues Lexical cues (specific words or phrases) are the most widely understood set of cues and have appeared in literature across languages and conversational styles (Cohen, 1984; Warner, 1985; Grosz and Sidner, 1986; Hirschberg and Litman, 1993; Marcu and Hirst, 1995). Syntactic cues (such as specific grammatical constructions) have the benefit that they can be cross-linguistic, like the strong correlation between sentence-initial or sentence-final particles, special verb order, and general word order that can often be seen in (Jurafsky and Martin, 2008). Given a large corpus, Jurafsky and Martin (2008) were able to find many other such syntactic correlations.
65
Chapter 2. Motivation and State-of-the-Art
In this work, we concentrate on lexical cues, although we exploit some syntactic relationships (such as sentence initial elements) as can be see later. In Hirschberg and Litman (1993) they detail 62 single word cue phrases drawn from linguistic and computational literature, that can be seen in Table 2.9. When Samuel et al. (1999) performed a broader analysis of discourse processing literature, they found a total of 687 different cue phrases listed or mentioned as useful for either intention or structural analysis of discourse and dialogue. Most cue phrases from the literature were recorded as being g enerally useful for spotting discourse structure or function. Samuel et al. (1999) decided to attempt to learn automatically cue phrases that were good indicators of Dialogue Acts, using the annotated verbmobil corpus as source material. Samuel et al. (1999) used two baselines to compare their automatic acquisition techniques for this work, one being the 687 cue phrases they identified in literature, the other being all the phrases of up to three words found in the corpus, a set comprising a total of 14,231 n-grams. We make the distinction here between cue phrases (those phrases we believe, through some assessment, to be useful indicators of Dialogue Acts) and ngrams (which in this case are collections of every possible 1-3 word phrase in the corpus). Taking the 14,000 n-grams from the verbmobil corpus, Samuel et al. (1999) looked at nine different methods for identifying so-called key cue phrases, including co-occurrence measures, conditional probability, entropy, mutual information, selectional preference strength and a measure they introduce called deviation, where an individual cue phrase is scored for its deviation from some optimally predictive phrase both on it’s own, and in combination with some measure of conditional probability. How the use of
2.5. Computational Models of Dialogue Act Recognition
66
these measures effects the total number of remaining cue phrases can be seen in Table 2.8. Using these measures, Samuel et al. (1999) identified potential cue phrases contained in the group of all possible n-grams. They quickly identified an undesirable bias caused by frequency, that in the case of a low frequency n-gram, their information measures were unable to draw reliable conclusions about its usefulness. They considered removing such n-grams with a frequency outside of some range, but felt that it may be impossible to find an appropriate range (one that maximises performance whilst not allowing coverage to suffer). We report on our efforts to find such a range automatically as part of the experiments in Chapter 3. Samuel et al. (1999) took their resulting ranked list of cue phrases (that is, the cue phrases that remain once the measures have been applied, leaving an ordered list of cue phrases, those which convey the most information first), and subjected the phrases to lexical filtering, removing duplicate or overlapping phrases. The cue phrases were evaluated for effectiveness, by passing them as a feature to a subsequent machine learning method. We were inspired by the work of Samuel et al. (1999) automatically identifying potential cue phrases from a corpus. In their experiments, Samuel et al. (1999) constructed all n-grams of lengths 1 through 3 from the corpus, and then applied a range of measures which effectively pruned the n-gram list, until only candidate cue phrases remained. In order to test the effectiveness of these automatically acquired cue phrases, Samuel et al. (1999) passed them as features to a machine learning method, in their case Transformation Based Learning (TBL). At this stage it is interesting to note that most of
67
Chapter 2. Motivation and State-of-the-Art
Method
Cue Number
Literature Co-occurrences Mutual Information Information Gain Conditional probability Deviation Conditional Probability Entropy Selectional Preference Strength T Test Deviation All
687 3994 4291 5202 5515 8509 9610 9635 10,189 11, 007 14,231
Table 2.8: Number of cue phrases for each method of automatic discovery investigated by Samuel et al. (1999), after lexical filtering
the measures for selecting cue phrases from the complete n-gram set were adjudged to be equally effective, with little statistical significant difference between them, but that six of the automatic selection measures outperformed the literature model.
2.6 Automatically Classifying Dialogue Acts In this section, we examine some of the prior work applying machine learning approaches to da classification. There are many ways we can categorise these approaches, either by the corpus they were trained applied to, the algorithm used or the range of features each approach utilised. On inspection of the results, it appears as though the choice of algorithm may be the least important: that correct identification and use of a feature set plays a far
2.6. Automatically Classifying Dialogue Acts
accordingly again alright also alternately alternatively although altogether and anyway boy consequently conversely equally finally fine
Cue Words first moreover further namely furthermore next gee nevertheless hence nonetheless hey nor hopefully now however oh incidentally ok indeed only last or like otherwise likewise overall listen say look second meanwhile see
68
similarly still so then therefore though thus too unless well where whereas why yet
Table 2.9: 62 single word cues phrases from literature, as reported in Hirschberg et al. (1993)
more critical role in classification performance. As there is no true reference task for this problem, it remains difficult to compare performance of models across data sets. There is no level playing field, and results on a single corpus can be interpreted in different ways. In contrast with papers in machine learning, when reporting the use of ML algorithms to the da classification problem, researchers have not historically published the split of training and testing data used in their experiments, and in some cases methods to reduce the impact of the variations that can be observed when choosing data for training and testing (such as 10-fold cross validation) have not been used. In the discussion that follows, we divide the work by the choice of algorithm used, to reduce duplication when explaining methods. In the summary
69
Chapter 2. Motivation and State-of-the-Art
tables at the end of the chapter we group results by corpus so that the reader can compare approaches easily. In each table, we will highlight the best performing approach over that corpus, although the accompanying text may contain some caveat, where a method has used some non-standard approach.
2.6.1 DA Sequence Models One of the simplest methods of da prediction is the use of da sequence ngram models, predicting the up-coming da based on some limited sequence of previous das. The use of n-grams of Dialogue Acts for analysis such as the work reported in Niedermair (1992) and Nagata and Morimoto (1994) showed some early promise for da classification. Niedermair (1992) used sequences of Dialogue Acts to reduce the search space of interpretations for a word recogniser engine. Nagata and Morimoto (1994) used a prior sequence of Dialogue Acts to predict the next act, and achieved 61.7% accuracy, although over a small set of data, and with a very small set of dialogue acts. They also showed results for n-best classification, that is, assigning more than one tag per utterance (up to n), and evaluating performance based on this disjunctive offering of candidates, marking success if the true da is one of the n das offered. They achieve scores of 77.5% using two tags (2-best) and 85.1% using 3-best. Reithinger and Maier (1995) use the same model as Nagata and Morimoto (1994), over the 18 act annotated verbmobil data described in Section 2.4.2, including the evaluation of an n-best tagging approach. Assigning only a single tag (1-best) achieves 40.28% accuracy, 2-best achieves 59.62% and 3-
2.6. Automatically Classifying Dialogue Acts
70
best achieves 71.93%. Remember, these scores are achieved by modelling only the progression of Dialogue Acts, so the prediction of a da is made only on the sequence of das so far. Core and Allen (1997) analysed damsl tags applied to TRAINS dialogues. Whilst not performing sequence modelling per se, this analysis was useful for determining the characteristics of the underlying corpus¿ They determined that approximately 50% of utterances were statements of some kind (a distribution echoed in the similarly damsl annotated switchboard corpus). They also presented evidence that a decision tree mechanism, coded only on the preceding Dialogue Act, should have a performance greater than some reasonable baseline measure. Most of the work using sequence models found that a combination of both da n-grams and models of utterance words perform better than da n-grams alone. The models described below use this combination for the most part.
2.6.2 Hidden Markov Models One of the most reference pieces of work in da classification is that of Stolcke et al. (2000). They attempted the first large scale modelling of dialogue acts, using the switchboard corpus, outlined in Section 2.4.3. The data was partitioned into a training set of 1,115 conversations (1.4M words, 198K utterances), and a test set of 19 conversations (29K words, 4K utterances). Stolcke et al. (2000) investigated a range of techniques, including Hidden Markov Models (HMM), decision trees and neural nets, as principled methods to combine a range of features, including the words in the individual
71
Chapter 2. Motivation and State-of-the-Art
utterances, and the progression of Dialogue Acts so far in the dialogue. Stolcke et al. (2000) apply HMM models of the words in individual utterances in combination with tri-gram models of da sequences, and report 71% accuracy. Importantly, without the model of da progression, they achieve an accuracy of only 54.3%. This increases to 68.2% when considering only the most recent da. This appears to indicate that dialogue context is vital for interpreting the current da, that immediate context (i.e. only the two preceding das) provides a good deal of information, and that even “taskfree” domains such as switchboard contain sufficient structure to exploit for this modelling effort. We note that Stolcke et al. (2000) did not publish the splits in data they used for this experiment, nor did they employ a cross-validation approach to their experiment, so comparison with their published results is difficult. Further, the work of Stolcke et al. (2000) trains a tri-gram model of da progression over the entire corpus, rather than relying only on the progression of da labels up to the point of the utterance under consideration. Reithinger and Klesen (1997) also apply a HMM modelling approach to the words in the utterances in the English part of the verbmobil corpus (which provides rather limited training data). As with their previous work they used the data tagged with the 18 higher level acts, and report a 74.7% tagging accuracy (compare this with the 40.28% we describe earlier, using only the da sequence model). When applied to the German data tagged with the same 18 acts, performance is lower at 67.18% (the difference being due in part, according to Reithinger and Klesen (1997) to the fixed word order of English, compared to the relatively free word order for German). As an
2.6. Automatically Classifying Dialogue Acts
72
interesting point of comparison, when this same method is applied to the full 43 act tagged German data, performance falls again to 65.18%. In experiments over the Spanish CallHome corpus (Levin et al., 2003), Ries (1999) applies both HMM modelling (using unigram, bi-gram and trigram models) and neural network learning (using the SNNS system (Zell et al., 1993)) to the words in the utterances. Both models make use of only lexical features (words and part of speech (POS) tags), and were trained on 55 dialogues and tested on 40 dialogues. Neural networks using just unigram features, and taking no account of dialogue context at all, result in significantly better results (at 76.2% accuracy) than the n-gram HMM back-off models, which achieved 74.4% classification accuracy.
2.6.3 Bayesian Methods Bayesian approaches have proven to be one of the most popular and one of the most successful approaches adopted for the classification of Dialogue Acts. Mast et al. (1996) used a bayesian approach, called polygrams, over the 18 act annotated English verbmobil corpus, using both words and da sequence as features, and achieved an accuracy of 68.7%. In Grau et al. (2004), a naive bayes approach is used over the switchboard corpus. Using the 5-fold cross validation approach to evaluation, and with training sets representing 80% of the corpus, and test sets of 20%, they achieve 60% classification accuracy, using a tri-gram language model of the words in the utterances. This compares favourably to the 54.3% achieved using the HMM language modelling approach to utterances (as demonstrated earlier by Stolcke et al.
73
Chapter 2. Motivation and State-of-the-Art
(2000)), when taking no account of the context of the dialogue (i.e. uses no model of dialogue act progression). Several researchers also investigated Bayesian Networks, including early work by Pulman (1996). Keizer et al. (2002) apply Bayesian Networks to their own Schisma corpus (Keizer, 2001) of 64 Wizard of Oz dialogues. The domain is enquiry and booking regarding theatre performances. 20 dialogues of the Schisma corpus have been annotated with the switchboard-damsl annotation scheme, although somewhat modified to include transaction specific acts, and overall reduce the number of target tags. The small amount of training data (they use 75% of the available data for training) has a significant impact on performance, but overall classification accuracy is 43.5%, compared to a baseline (achieved by randomly selecting the tag) of 8.3%. In more recent work (Ji and Bilmes, 2005; Ji and Bilmes, 2006) there is an investigation of the use of dynamic Bayesian networks (DBNs), using graphical models, applied to the icsi-mrda corpus. Their efforts show one possible effect of reducing the number of labels applied to the same corpus. In their first work, Ji and Bilmes (2005) classify utterances of the icsi-mrda corpus into one of 65 da categories, the major labels of the icsi-mrda corpus. Their best performing set of features is a tri-gram model of the words in the utterances combined with a bi-gram model of da progression, and achieves 66% accuracy. In later work (Ji and Bilmes, 2006), they apply the same set of features to a version of the icsi-mrda corpus that is clustered into 5 super classes (as described in Section 2.4.3. This increases performance (with no other major changes to system set-up) to an accuracy of 81.3%.
2.6. Automatically Classifying Dialogue Acts
74
2.6.4 Transformation Based Learning An early paper in the field of automatic da classification contains work by Samuel et al. (1998), in which they apply Transformation Based Learning (TBL) (Brill, 1995) to the verbmobil corpus described in Section 2.4.2. Samuel et al. (1998) use transformation-based learning over a number of utterance features, including utterance length, speaker turn and the dialogue act tags of adjacent utterances (including those that follow the da under investigation, as with Stolcke et al. (2000), earlier). They achieved an average score of 75.12% tagging accuracy over the verbmobil corpus. They also explore the automatic identification of word sequences that might serve as useful dialogue act cues. A number of statistical criteria are applied to identify potentially useful word n-grams which are then supplied to the transformation-based learning method to be treated as ‘features’. TBL has also been applied to the maptask corpus, replicating the feature set used in Samuel et al. (1998), in the work of Lager and Zinovjeva (1999). They achieve a classification accuracy of 62.1%.
2.6.5 Memory Based Learning Memory Based Learning (MBL) is based on the hypothesis that humans handle a new situation by matching them with stored representation of previous situations. The matching process assumes a similarity metric between situations. Rotaru (2002) applied MBL to the switchboard corpus, extracting 5,000 utterances to use as test data, and the remainder (around 195,000 utterances) for training. Best results were obtained when using 3 nearest
75
Chapter 2. Motivation and State-of-the-Art
neighbours, and excluding the ‘+’ annotation (which we will discuss further in relation to our own work, later), when accuracy reached 72.32%. This result is somewhat comparable to the result of 71% reported in Stolcke et al. (2000), who also ignored the ‘+’ annotation. As with the work of Stolcke et al. (2000), Rotaru (2002) performed no cross-validation exercise, and did not publish the split of training and test data. Lendvai et al. (2003) applied MBL to a corpus of 3,738 interactions with a Dutch train timetable system, which was annotated with a total of 94 different dialogue act tags. The experiments used a range of features including lexical features, dialogue history, ASR confidence values, and prosodic features. The lexical features were encoded as a 759 bit bag-of-words vector (there being a 759 word vocabulary size for this system). The TiMBL system was used as the classifier, and achieved an accuracy of 73.5%.
2.6.6 Latent Semantic Analysis Latent Semantic Analysis (LSA) is an information retrieval method which has been used for many classification problems. Serafin and Eugenio (2004) propose an extension to LSA called Feature Latent Semantic Analysis (FLSA), to add the richness of many other linguistic features often labelled in a corpus. They apply FLSA to dialogue act classification over the CallHome Spanish corpus (Levin et al., 2003), annotated with a total of 232 das. To facilitate classification, these 232 tags are collapsed to a set of 37 das, and then further reduced to a set of just 10 das, and results reported for both sets. They also apply FLSA classification to the maptask corpus (Anderson et al., 1991),
2.6. Automatically Classifying Dialogue Acts
76
comprising 13 tags, and their own corpus of tutoring dialogues, with just 4 distinct tags. Importantly, both the maptask corpus and the CallHome Spanish corpus are annotated by hand with dialogue game information, as described in Section 2.3.2. We already discussed the work of Poesio and Mikheev (1998) and the contribution to da classification that game structure can provide. Serafin and Eugenio (2004) make use of game structure as one feature for FLSA, as well as syntactic, lexical and dialogue context information and their best performing models find that syntactic features do not help, but most other dialogue related features do. One dialogue related feature that does not help is the dialogue act history. This results creates a possible tension between two sets of work. On the one hand, there are those that find dialogue history to be an important feature when determining the current da as shown in the work of Stolcke et al. (2000) and Ji and Bilmes (2006), for example. On the other, the results from Serafin and Eugenio (2004) and Ries (1999) show that dialogue context is not an important feature. In any case, over the CallHome corpus clustered around 37 das, Serafin and Eugenio (2004) achieve an accuracy of 74.87%, making use of dialogue game structure as a feature. Over the CallHome corpus clustered into 10 das, performance increases to 78.88%, using the same features. This increase is to be expected when reducing the number of target classification categories. Serafin and Eugenio (2004) then applied the same technique to maptask, and the highest accuracy is again achieved using game structure, and is 73.91% (compare this to other maptask classification efforts without the benefit of game structure, in Table 2.11).
77
Chapter 2. Motivation and State-of-the-Art
2.6.7 Boosting In Tur et al. (2006), a Boosting approach to classification is used, as applied to the icsi-mrda corpus, using only lexical features. Additionally, Tur et al. (2006) examine the impact of combining data from the icsi-mrda and switchboard corpora, using maximum a posteriori (MAP) adaptation, to overcome the smaller amounts of training data in the ICSI corpus. In doing so, they encounter some of the same problems we face when using out of domain data (as fully described in Chapter 4), in that the da set does not have an exact correlation. Specifically, the icsi-mrda data contains the tags and that do not appear in the switchboard data. Tur et al. (2006) address this by exclusion by removing all instances of utterances marked with these tags from a 5,000utterance portion of the icsi-mrda corpus. Through string matching, they then exclude similar utterances from the switchboard corpus. This reduced the amount of switchboard data by 19% due to some utterances labelled as and which frequently appear in the icsi-mrda data, such as “yeah”. Clearly not all utterances containing the word “yeah” are labelled with the tags to be excluded, however they are removed from the switchboard data. Under these conditions, using icsi-mrda data for training and testing over the 5 category clustered icsi-mrda data, Tur et al. (2006) achieve 77.98% tagging accuracy. Using switchboard data to tag the icsi-mrda corpus, they achieve only 57.37% accuracy. Tur et al. (2006) go on to show that using some small amount of in-domain labelled data (the 5,000 utterance portion), they can boost classi-
2.6. Automatically Classifying Dialogue Acts
78
fication performance by using out of domain data, and show that the results of such boosting largely disappear, once the in-domain data reaches a size of around 50,000 utterances, and classification performance approaches that achieved by the in-domain data alone.
2.6.8 Decisions Trees Decision trees are a popular method for classification tasks. Mast et al. (1996) used semantic classification trees over the verbmobil data, annotated with 18 different dialogues acts, using both the words and da sequences as features, and achieved 69% accuracy (compare this with the 68.7% reported earlier using a bayesian approach). Verbree et al. (2006) used J48 (an implementation of C4.5 (Quinlan, 1993) in the WEKA (Witten and Frank, 2005) package). They apply this method to three corpora, the icsi-mrda corpus, the switchboard corpus, and report initial work over the AMI corpus described in Section 2.4.4. For features, they are using word transcripts of the conversations, and uncommonly use punctuation features (such as the presence of question marks), which might not be available in real time processing with ASR systems. For all three corpora, the best set performing set of features for classification (obtained by performing a 10-fold cross validation over each) was the same: principally the presence of a ‘question mark’ token, the presence of the word ‘OR’ , the length of the utterance, a bi-gram model of da progression, and the presence of n-grams of length 1- through 4 in the utterances of both words and part of speech tags. With these features, they score 89.27% on the five label cluster of the icsi-mrda corpus (the best
79
Chapter 2. Motivation and State-of-the-Art
reported for this corpus so far, although the use of punctuation makes this of questionable value for those looking to work with automatic speech systems), 65.68% on the switchboard corpus, and 59.76% on the AMI corpus.
2.6.9 Neural Networks Levin et al.
(2003) used a range of classification approaches, including
memory-based learning (TiMBL (Daelemans and den Bosch, 2005)), decision trees (C4.5 (Quinlan, 1993)), neural networks (SNNS (Zell et al., 1993)) and naive bayes (Rainbow (McCallum, 1996)). All classifiers were applied to the NESPOLE! corpus of interactions (Lavie et al., 2006), consisting of 8289 utterances in English (they also report scores for German data, that we do not reproduce here). The da set they use contained 70 labels, the most frequent of which (‘give-information’) occurred 41.4% of the time. For their machine learning experiments, they had a total of 212 features for English, including binary features indicating the presence or absence of certain syntactic labels from the grammars in the parse forest for the utterances. In terms of accuracy, best results were produced by the neural net approach (71.52% accuracy), whereas perhaps surprisingly the worst results were produced by naive bayes (51.39%). The other two methods were not significantly different from the top performing result: memory based learning achieved 69.82%; decision trees 70.41%. The SNNS neural network system was also used by Sanchis and Castro (2002). This work shows that that for very tightly defined domains (they are working with a corpus of 215 dialogues in the train time-tabling domain,
2.6. Automatically Classifying Dialogue Acts
80
labelled with 11 domain specific dialogue acts), high levels of da classification accuracy are possible. They achieve 92.54% accuracy on word transcripts. In common with others, this score suffers substantially when applied to the output of ASR systems, falling to 72.39%. Also interestingly, in later work (Grau et al., 2004) apply a naive bayes classifier to this same data, and score 89% classification accuracy. Wright (1998) also compares a range of classification algorithms (HMMS, neural nets and classification trees) using the MapTask corpus (Anderson et al., 1991), consisting of 13 da types. Unlike other approaches, Wright (1998) uses only prosodic information, in combination with a 4-gram model of dialogue act progression. 20 dialogues of the MapTask corpus (comprising 3726 utterances) are used for training, and 5 dialogues (1061 utterances) used for testing. All the classification methods achieved similar results. Without the 4-gram model of da progression, HMM methods achieved 42% accuracy, decision trees 44% and neural nets 43%. When the model of da progression is added, this boosts accuracy significantly, in a similar way to the experiments of Stolcke et al. (2000), with HMM methods performing the best (64% accuracy), then decision trees (63%) and finally neural nets (62%).
2.6.10 Rule Based Approach Although part of a wider, plan-based model such as we discussed earlier, Carberry and Lambert (1999) implemented a rule-based model of da recognition, that uses three sources of knowledge, linguistic (including cue phrases), contextual and world knowledge. The linguistic knowledge is used primarily to
81
Chapter 2. Motivation and State-of-the-Art
identify if the speaker has some belief in the evidence presented, using prior known cue phrases (such as but), or the use of surface-negative question forms (“Doesn’t X require Y?”). Prasad and Walker (2002) were interested in classifying DATE tags (Walker and Passonneau, 2000) in the DARPA Communicator dialogues (Walker et al., 2001). Specifically, they wanted to model the system side (or service side) interactions in human-computer data. They used a rule based learning method, RIPPER (Cohen, 1995), applied to the words in the utterances, and achieved 98.5% recognition accuracy, when only applied to the system utterances of the corpus collected in 2000, annotated with a total of 11 dialogue acts. They believed that an automatic system applied to the user utterances would not achieve high enough levels of accuracy. When Prasad and Walker (2002) applied their classifier, trained over the initial dialogue set collected in 2000, to a new set of data from substantially the same domain, collected in 2001, recognition performance of the system utterances fell to 71.8%. Minor changes to the vocabulary used by the system, and some underlying task changes, meant that the rules automatically induced were no longer as applicable to this new data, as they were to the original training set. This appears to be a flaw in many machine learning approaches to dialogue, i.e. they are unable to generalise to new data sets. However, Prasad and Walker (2002) were able to show that by adding a small amount of data (in this case, 3,000 utterances) from the new data set, they could increase performance to 93.8%. More recently, Georgila et al. (2009) extended this work to include manually constructed context rules that cover the user side of the Communicator dialogues. In comparison with hand-annotated data, they are able to
2.6. Automatically Classifying Dialogue Acts
82
achieve precision scores of around 80% for the DATE labels. When Prasad and Walker (2002) applied their classifier trained on humancomputer data to human-human data from the same domain, again focused only on the service side interactions (where most of the interaction is scripted), the classifier performed at 36.7%. There is an argument, made be Dahlb¨ack and J¨onsson (1992), that human-computer data and human-human data, even in the same domain, are not comparable and so, Dahlb¨ack and J¨onsson (1992) claim, human-human data is insufficient for training humancomputer systems. The issue appears to be one of coverage. In other words, does the training data contain enough of the types of interaction seen in the test data. When moving from small, tightly constrained data (in the case of Prasad and Walker (2002), the human-computer (H-C) data) to less restricted interactions (with the human-human (H-H) data) there are likely to be issues of coverage, where there are interactions in the H-H that do not occur in the H-C data. It would be interesting to see the results of their experiments in reverse, where the more expressive H-H data is used to train a classifier applied to H-C data. We encounter similar issues when we attempt cross-domain classification later in Chapter 4. Lendvai et al. (2003) also applied RIPPER to their OVIS corpus (3,738 interactions with a Dutch train timetable system), but achieved only 59.5% accuracy. Compare this result to the 78.5% they achieved over the same corpus using Memory Based Learning, as reported in Section 2.6.5.
42
Switchboard
ICSI
16
Verbmobil
Method HMMs Decision Trees HMMs HMMs TBL HMM Bayesian Decision Trees HMM MBL Graph model Boosting Maximum Entropy Maximum Entropy Graph model Naive Bayes MBL Decision Trees
Features da sequences words + das words + das words + das cue phrases words words words + das words + das words + das words + das words words words + prosodic cues words + das words words + das + prosodic cues words + das + orthography
Accuracy 40.28% 59% 68.7% 74.7% 75.12% 54.3% 60% 65.68% 71% 72.32% 66% 77.98% 79.53% 81.18% 81.3% 82% 84% 89.27%
Reference Reithinger and Maier (1995) Mast et al. (1996) Mast et al. (1996) Reithinger and Klesen (1997) Samuel et al. (1998) Stolcke et al. (2000) Grau et al. (2004) Verbree et al. (2006) Stolcke et al. (2000) Rotaru (2002) Ji and Bilmes (2005) Tur et al. (2006) Ang et al. (2005) Ang et al. (2005) Ji and Bilmes (2006) Lendvai and Geertzen (2007) Lendvai and Geertzen (2007) Verbree et al. (2006)
Table 2.10: Classification results over the Verbmobil, Switchboard and ICSI corpora
5
65
Labels
Corpus
83 Chapter 2. Motivation and State-of-the-Art
2.6. Automatically Classifying Dialogue Acts
84
2.6.11 Review of Approaches
We will conclude this section with a review of the approaches to da classification we have discussed so far. For each major corpus discussed in Section 2.4, we can see (in Tables 2.10 and 2.11) that there are a range of results, obtained through the application of different machine learning methods. The results for the three major corpora, verbmobil, switchboard and icsi-mrda can be seen in Table 2.10. The verbmobil corpus was one of the first generally available annotated dialogue corpora. Although Transformation-based Learning (TBL) is the best performing method over this data (Samuel et al., 1998), we see that other popular methods such as HMMs are not far behind (Reithinger and Klesen, 1997). On larger corpora such as the switchboard corpus, large scale language modelling techniques such as HMMs perform very well (Stolcke et al., 2000), although Memory Based Learning (MBL) also performs well (Rotaru, 2002). The icsi-mrda corpus is relatively new, although a wide range of machine learning methods have been already applied to this corpus for da classification. The best performing is Decision Trees (Verbree et al., 2006), although they make extensive use of punctuation which may not be readily available for real time processing of utterances. Outside of this approach, the graphical models of Ji and Bilmes (2006) are the best performing. It is interesting to note the increase in performance that is achieved by this model when moving from 65 labels (66% accuracy) to 5 labels (81.3% accuracy). It is curious that so much classification activity has concentrated on the 5 category clustering of the icsi-mrda corpus, when
37 10 11 11 94
70 15
Spanish CallHome
DARPA Communicator H-C
Basurde
OVIS
NESPOLE
AMI
a
13
MapTask
Method HMM Decision Trees Neural Networks TBL FLSA HMM Neural Networks FLSA FLSA Ripper Neural Networks Naive Bayes Rule based MBL Naive Bayes TIMBL Decision Trees Neural Networks Decision Trees
Features prosodic cues +das prosodic cues + das prosodic cues + das words + das games structure words + das words + das game structure game structure words words words words + das + prosodic cues words + das + prosodic cues words words words words words + das + orthography
Accuracy 64% 63% 62% 62.1% 73.91% 74.4% 76.2% 74.87% 78.88% 98%a 92.54% 89% 59.5% 73.5% 51.39% 69.82% 70.41% 71.52% 59.76%
Only analysing system utterances
Table 2.11: Classification results over additional corpora
Labels
Corpus
Reference Wright (1998) Wright (1998) Wright (1998) Lager and Zinovjeva (1999) Serafin and Eugenio (2004) Ries (1999) Ries (1999) Serafin and Eugenio (2004) Serafin and Eugenio (2004) Prasad and Walker (2002) Sanchis and Castro (2002) Grau et al. (2004) Lendvai et al. (2003) Lendvai et al. (2003) Levin et al. (2003) Levin et al. (2003) Levin et al. (2003) Levin et al. (2003) Verbree et al. (2006)
85 Chapter 2. Motivation and State-of-the-Art
2.6. Automatically Classifying Dialogue Acts
86
these 5 categories appear to be too broad to present a useful interpretation of the on-going dialogue. Table 2.11 shows the results over other corpora discussed in this chapter. We reiterate that it remains very difficult to directly compare approaches, even when applied to the same corpus, so cross-corpora comparisons must be carefully considered. There are issues of the da label set used, the labels considered and those ignored, the pre-processing of the corpus, the use of orthographic information, or prosody and so on. Even taking all of these factors into account, there is no shared information of the splits used in the data for training and testing purposes, so there is no level playing field for the comparison of results. As our own results over some of these corpora will show, the variation in a 10-fold cross validation can be large, sometimes up to 8-percentage points across the 10 folds. What we hope to have shown in this section is that there are no obvious leading contender for algorithm best suited to the Dialogue Act classification task. For example, bayesian classifiers work well over the icsi-mrda corpus (Lendvai and Geertzen, 2007), but fair badly over the switchboard corpus (Grau et al., 2004), and are the worst classifier for the NESPOLE corpus (Levin et al., 2003). Neural Networks are the best classifier for the NESPOLE corpus (Levin et al., 2003), but the worst classifier for the maptask corpus (Wright, 1998). For the maptask corpus, FLSA performs the best (Serafin and Eugenio, 2004), but this method is outperformed on the Spanish CallHome corpus by Neural Networks (Ries, 1999). Memory Based Learning is rarely the top scoring algorithm, but consistently performs relatively well across a range of corpora, including switchboard (Rotaru, 2002) and
87
Chapter 2. Motivation and State-of-the-Art
icsi-mrda (Lendvai and Geertzen, 2007). Comparing the features used by each machine learning method, it appears as though using prosodic information when available consistently adds a few percentage points to classification accuracy (such as the 1.65% point gain demonstrated over the icsi-mrda corpus by Ang et al. (2005)). Similarly, in highly structured dialogues such as those in the maptask corpus, game structure can be a very significant feature, but less so with unstructured two person conversation, as seen in the Spanish CallHome corpus (Serafin and Eugenio, 2004).
2.7 Summary In this chapter, we have presented an introduction to Dialogue Acts, starting with initial work in Speech Act theory through to the practical use made of das today in both on-line (by which we mean active dialogue systems) and off-line (corpus-based investigation) processing of dialogue. Central to much current work in the field, and to the work presented in this thesis, is the availability of large, manually annotated corpora, featuring Dialogue Acts. In order to review these, we discussed some of the most prevalent annotation paradigms that are used, and then described the resources that have been annotated with examples of those paradigms. Whilst there remains discussion about the appropriate level of annotation of dialogue structure, and there is no universally accepted Dialogue Act set in use, the majority of large-scale annotation projects in recent years have adopted some variant of the damsl annotation scheme, as seen in both the switchboard corpus
2.7. Summary
88
(Jurafsky et al., 1997) and the more recent icsi-mrda corpus (Shriberg et al., 2004). We also adopted a variant of this annotation schema for our own dialogue system project, amities, which included the manual annotation of 1,000 dialogues between users and call centre agents (Hardy et al., 2005). We then presented the a cross-section of work previously published on the automatic classification of Dialogue Acts. Where possible, we have compared approaches, although the range of corpora and annotation schemes makes this difficult. What we hope is clear from an examination of the results when applied to the same corpus is that there is no single algorithm that is clearly better than the others for the da classification task. There are a range of factors than can contribute to good classification performance, and two factors that most centrally contribute appear to be (1) the number of target categories for classification, and (2) whether or not the algorithm has access to enhanced features such as prosody, or manually annotated dialogue game structure (such as discussed earlier in Section 2.3.2). Other features, such as dialogue history, that work well for one machine learning approach over one corpus, appear to have less impact with a different machine learning algorithm, or over a different corpus.
2.7.1 Number of Target Categories There is a clear benefit in reducing the number of overall target categories, and experiments (cf. Ji and Bilmes (2005) versus Ji and Bilmes (2006)) show that. Still there is a trade off in having a set of labels that are maximally useful for the application in hand, and designers of da taxonomies should be
89
Chapter 2. Motivation and State-of-the-Art
careful to consider this. The reduced category set of the icsi-mrda corpus appears to have too few categories for it to be practically useful in a range of dialogue applications, and yet it is the most popular version of the icsi-mrda corpus for da classification experiments.
2.7.2 Enhanced Features It would seem from the results discussed earlier that a number of features are complementary, i.e. that either prosody or lexical features can be used, and achieve reasonable accuracy alone. It is not clear that combining these two features achieves significantly increased levels of classification accuracy. In part it is assumed as they encode much of the same information (for example, in questions, the use of key cue phrases mirrors the rising intonation seen in prosodic features), and we should note that purely lexical features continually perform well as a mechanism for classification. What is clear is that higher level dialogue structure, if available, is a key feature that can improve da classification in restricted domains. For example, dialogue structure is shown to be highly useful in the direction giving domain of the maptask corpus, but significantly less so in the open, conversational domain of the Spanish CallHome corpus (both results from Serafin and Eugenio (2004)). In any case, this may be a complex situation for how do we ascribe dialogue structure automatically without an assignment of Dialogue Acts to build upon? However, in contrast to this evidence are empirical results that show that dialogue history is not an important feature when it comes to Dialogue Act classification (see Mast et al. (1996), Ries (1999) and Serafin and Euge-
2.7. Summary
90
nio (2004) for confirmation). This is a key point and slightly confusing in the context of the dialogue structure argument. It appears as though dialogue structure is very important, but at a level of abstraction more complex than the information contained in limited prior sequence of Dialogue Acts.
3. METHOD 3.1 Introduction In Chapter 2, we introduced Dialogue Acts as labels on utterances of dialogue that represent the function of those utterances in the context of the on-going dialogue. In Chapter 2.5 we detailed a range of implementations of Dialogue Act classifiers, concluding that whilst straightforward comparison between classifiers is difficult, it appears as though there is no one single algorithm ideally suited to da classification. What we did notice is that approaches used a range of different ‘features’ for the da classification task, including lexical, syntactic, prosodic and dialogue context features. Most classifiers used some lexical features (the words in the utterances under consideration), frequently employing some kind of Hidden Markov Modelling (HMM) to every utterance (Levin et al., 2003; Stolcke et al., 2000; Reithinger and Klesen, 1997), a technique popular in speech processing. We were inspired by the work of Samuel et al. (1999), who instead of modelling entire utterances, extract cue phrases from the verbmobil corpus of dialogues. Several researchers, such as Hirschberg and Litman (1993) and Grosz and Sidner (1986), discuss the presence of cue phrases in utterances,
3.1. Introduction
92
which can serve as useful indicators of Dialogue Acts or discourse function. These cue phrases are words or groups of words that appear to be reliable indicators of Dialogue Acts. We wanted to see if we could extract useful cue phrases from a corpus at least one order of magnitude larger than the verbmobil corpus (3k utterances, 1k distinct words) used by Samuel et al. (1999). When we began our investigation, we were looking to create an online Dialogue Act classifier (that is, a system that worked in real time, using only features which are available at run time) for use in the AMITIES project1 , where we would have access to a large corpus of manually annotated humanhuman dialogues (around 1,000 dialogues in English), with Dialogue Acts annotated using a slight variant of the damsl tag set introduced in Section 2.4.3. We started an investigation of a cue phrases using the switchboard corpus we described in Section 2.4.3. This corpus contains nearly 225k utterances and 22k distinct words, so we felt that if we could deploy techniques that would extract cue phrases from this amount of data, then a subsequent move to the AMITIES data would be possible. In the first part of this Chapter, we present our initial experiments with a subset of the 225k data from the switchboard corpus. We describe the intuition behind our cue phrase extraction method, and describe how we determine if our automatically extracted cue phrases are effective in identifying Dialogue Acts. We do this by using them directly as a method of classification, and we discovered fairly early that our simple classification method 1
http://www.dcs.shef.ac.uk/nlp/amities/
93
Chapter 3. Method
worked surprisingly well on the switchboard corpus, without taking any dialogue context into consideration. We look at a number of elaborations to our cue extraction model, including two variable thresholds, predictivity and frequency. Setting a correct level for the frequency variable results in a boost in performance, by weeding out low frequency features that skew classification performance, whereas setting the predictivity level (and we explain what we mean by predictivity later in this chapter) enables us to reduce the overall size of the set of cue phrases we use in classification, to maintain a compact model. Ultimately for our experiments, these thresholds are set empirically, using a validation set, as we report later in this chapter. We also examine the effect on performance caused by varying amounts of training data. In addition we examine the effect of a number of additional, simple lexical and syntactic features, as others (notably Samuel et al. (1998) and Stolcke et al. (2000)) had done, for example using positional information (whether the cue phrase in question occurs at the beginning or the end of an utterance), and using specific, discrete length models (one model for ‘short’ utterances and another for ‘long’ utterances, for instance). The second part of this chapter presents our cue based, intra-utterance dialogue act classifier. We experiment with different sets of da labels (using both a large set and a small set of categories) to see if the observations on number of categories versus classification accuracy, reported in the previous chapter (cf. Ji and Bilmes (2005) versus Ji and Bilmes (2006), or the results reported in Serafin and Eugenio (2004)), were borne out in our experience. We will show some steps we took to improve performance, ending with our best performing model. The results we ultimately obtain rival the best results
3.1. Introduction
94
achieved on that corpus, in work by Stolcke et al. (2000), who use a far more complex approach involving Hidden Markov Modelling (hmm), that addresses both the sequencing of words within utterances and the sequencing of dialogue acts over utterances. We perform an error analysis, and examine the results in some detail, to determine what the possible upper bound of performance is for a mechanism of this type, and discover what errors are regularly made by our simple approach. We discuss the potential utility of classifiers that identify the n most likely dialogue acts for each utterance. Our approach can as well be used to produce a (possibly ranked) list of the n most likely alternative classifications for each utterance, something we explore later in this chapter, which might feed into some subsequent process, such as a dialogue manager, that could select amongst the restricted set of alternatives offered on the basis of higher-level dialogue information. The subsequent process might alternatively be a machine-learning based component trained to make the final choice of da based on inter-utterance context, with the possible benefit of having a much reduced feature space from the elimination of word n-gram based features, which would have already been exploited by our simple classifier component. This could address one of our initial motivating factors, where we were looking to adopt methods and models that would be computationally tractable even when exposed to large amounts of data. Finally, having evaluated our intra-utterance classifier, we presents results incorporating our approach into a model that uses a notion of dialogue context, by modelling the sequence of prior Dialogue Acts. We conclude with a summary and some discussion.
95
Chapter 3. Method
3.2 Cue Phrase Selection What defines a good cue phrase? We are looking for words or phrases in a corpus that regularly co-occur with individual cue phrases. For example, if the ngram “hello there” occurs in the corpus a total of 100 times, and of those 100, 95 instances occur in utterances that are annotated as , then we could say that “hello there” is a cue phrase that is somehow a reliable indicator of the Dialogue Act . We use the term predicitivity to indicate how predictive a phrase is of a particular da. In this case, the cue phrase “hello there” has a predictivity of 95% for the da label . Note that there are 5 other instances in our sample corpus where this phrase co-occurs with one or more additional dialogue acts, but we are not interested in these. We want to select phrases that are highly discriminative, and so concern ourselves with the highest predictivity of a particular cue phrase, irrespective of other relationships with other da labels. We call this score the maximal predicitivity. Already in this informal introduction, there are several thresholds that should be apparent. First, below some maximal predicitivty score, we assume that phrases will no longer be discriminative enough to be useful for labelling Dialogue Acts. Second, the number of occurrences of each phrase in the corpus as a whole is important. We will have to investigate setting values for these thresholds in our model. We were inspired by the work of Samuel et al. (1999) automatically identifying potential cue phrases from a corpus. In their experiments, Samuel et al. (1999) constructed all n-grams of lengths 1 through 3 from the corpus,
3.2. Cue Phrase Selection
96
and then applied a range of measures which effectively pruned the n-gram list, until only candidate cue phrases remained. In order to test the effectiveness of these automatically acquired cue phrases, Samuel et al. (1999) passed them as features to a machine learning method, in their case Transformation Based Learning (TBL). We began our experiments with an intuition as to what constituted a good cue phrase. We hypothesised that a cue phrase would be a word or phrase in the corpus that would serve as a reliable indicator of a particular Dialogue Act. If we look at cue phrases that regularly co-occur with individual Dialogue Acts, we could see if the presence of that particular phrase was a reliable predictor of that Dialogue Act. Each cue phrase could predict many different Dialogue Acts, but we are interested only in that da that is maximally predicted by the cue phrase in question. More formally, we can describe our criteria, predictivity, for selecting cue phrases from the set of all possible cue phrases in the following way. The predictivity of phrase c for DA d is the conditional probability P (d|c), where:
P (d|c) =
#(c&d) #(c)
We represent the set of all possible cue phrases (all n-grams length 1–4 from the corpus) as C, so given c ∈ C : c represents some possible cue phrase. Similarly, D is the set of all dialogue act labels, and d ∈ D : d represents some dialogue act label. Therefore #(c) is the count of (possible) cue phrase c in corpus, and #(c&d) is the count of occurrences of phrase c in utterances with dialogue act d in the training data. The maximal predictivity of a cue
97
Chapter 3. Method
phrase c, written as mp(c), is defined as:
mp(c) = max P (d|c) d∈D
In their experiments, Samuel et al. (1999) also experimented with conditional probability. However they used P (c|d), or the probability of some phrase occurring given some Dialogue Act, which does not appear as intuitive as our approach. For example, there will be many words that occur with a large number of Dialogue Acts, so will not be discriminatory. For example, the word “the” could occur with virtually any Dialogue Act. For our experiments, the word n-grams used as potential cue phrases during are automatically extracted from training data. All word n-grams of length 1–4 within the data are considered as candidates. The maximal predictivity of each cue phrase can be computed directly from the corpus. We can use this value as one threshold for pruning potential cue phrases from our model. Removing n-grams below some predictivity threshold will improve the compactness of the model produced. For example, taking our earlier example of “hello there”, 95 instance of this cue phrase occurred in utterances annotated as . For the other 5 occurrences, we may as well discard this information, because the resulting predictivity (even if all 5 instances occur in utterances marked with the same da) can only be 5%. Thus the maximal predictivity for the n-gram “hello there” is 95% in this instance. Another reasonable threshold would appear to be the frequency count of each potential cue phrase. Phrases which have a low frequency score are
3.3. Cue-Based da Classification
98
likely to have very high predictivity scores, possibly skewing the model as a whole. For example, any potential cue phrase which occurs only once will de-facto have a 100% predictivity score. We can use a minimal count value (t# ) and minimal predictivity thresholds (tmp ) to prune the set C ∗ of ‘useful’ cue phrases derived from the training data, as defined by:
C ∗ = {c ∈ C | mp(c) ≥ tmp ∧ #(c) ≥ t# } The n-grams that remain after this thresholding process are those we identify as cue phrases. For our initial experiments, we used a predictivity of 30% and a frequency of 2 as our thresholds for cue extraction, and applied the mechanism to the switchboard corpus, described in Section 2.4.3 earlier.
3.3 Cue-Based da Classification Having defined our mechanism to extract cue phrases from a corpus, we need some way to evaluate their effectiveness. Samuel et al. (1999) passed their cue phrases as a feature to a machine learning method. We chose instead a method where the cue phrases extracted from a corpus could be used directly as a method of classification. If our extracted cues are indeed reliable predictors of Dialogue Acts, then a classifier that uses these cues directly should perform reasonably well. If, on the other hand, this mechanism did not work, it would not necessarily mean that our cue phrases are not effective, only that we need to pass them to a subsequent machine learning process as others had done. The benefit of our direct classification approach is that
99
Chapter 3. Method
Speaker A: DA="yes-no-question": thought of that Example n-gram Total Count would 4563 you 38905 would you 157 would you 157 would you ever 3 would you ever have 1
would you ever
have
Predicts da (count) Predictivity statement-non-opinion (2368) 51.9% statement-non-opinion (16342) 42.0% wh-question (53) 33.8% yes-no-question (40) 25.5% yes-no-question (3) 100% yes-no-question (1) 100%
Table 3.1: switchboard: Example cue-based classifier
it is very fast to apply, and gives us immediate feedback as to the possible effectiveness of our automatically extracted cue phrases. The predictivity of a cue phrase can be exploited directly in a simple model of Dialogue Act classification. Intuitively, when we see the phrase “hello there”, we want to always assign the category and all other things being equal we’ll be correct 95% of the time. To apply this method, we can train a classifier on a part of a corpus, extracted potential cue phrases as described in the previous section. The cue phrases selected using our measure of predictivity are used directly to classify unseen utterances in the following manner. We identify all the potential cue phrases a target utterance contains, and determine which has the highest predictivity of some dialogue act category, then assign that category. Given the notation we define above, we can obtain the da predicted by a particular cue (dp(c)) by:
3.3. Cue-Based da Classification
100
dp(c) = argmax P (d|c) d∈D
In the example shown in Figure 3.1, the utterance “would you ever have thought of that” has been annotated by a human as a . In Figure 3.1, we show the predicitivity scores for a selection of the n-grams in the utterance. We can see the 1-grams, such as “would” and “you”, have a high total count, but equally these phrases are found in utterances marked with a range of different das, as indicated by the low predicitivity scores (for all the following examples, only the maximum predicitivity score is shown, unless otherwise indicated). Neither of the 1-grams shown have a maximum predictivity score that would indicate the correct da for this utterance, and if we used only 1-grams, we would incorrectly assign the category to this utterance. If we look at a selection of 2-grams, such as “would you”, we can see that this n-gram is maximally predictive of the da , but at a lower predictivity than the 1-grams we examined earlier, so even using this mix of 1- and 2-grams, we would not assign the correct category. We can see that “would you” has a low rate of predictivity for at a level below our predictivity threshold, and so this n-gram phrase would not be retained by our classifier. It is only when we look at 3-grams and above that we see really useful results. The 3-gram “would you ever” is 100% predicitive of the correct da category. That is, the three times that this 3-gram occurs in this corpus, they are all in utterances annotated by humans as a . We can also see that the 4-gram “would you ever have” has a 100% predic-
101
Chapter 3. Method
tivity of the correct category, but only occurs once in the entire corpus, so would be discarded for falling below our frequency threshold. If multiple cue phrases share the same maximal predictivity, but predict different categories, we select the da category for the phrase which has the higher number of occurrences (that is, the n-gram with the highest frequency). If the combination of predicitivity and occurrence count is insufficient to determine a single da, then a random choice is made amongst the remaining candidate das. If ng(u) defines the set of ngrams of length 1..4 in utterance u, and Cu∗ is the set of n-grams in the utterance u that are also in the threshold model C ∗ then Cu∗ is defined as:
Cu∗ = ng(u) ∩ C ∗ Given our thresholds, the mpu(u) (the utterance maximal prediction, or mp value for the highest scoring cue in utterance u) is defined as:
mpu(u) = max∗ mp(c) c∈Cu
The maximally predictive cues of an utterance (mpcu(u)) are:
mpcu(u) = {c ∈ Cu∗ | mp(c) = mpu(u)} Then the maximal cue of utterance (mcu(u)), i.e. one of its maximally predictive cues that has a maximal count (from within that set), is:
mcu(u) = argmax #(c) c∈mpcu(u)
3.4. Initial Experiments
102
Finally, for our classification model, dpu(u) utterance da prediction — the da predicted by model for utterance u, is defined as:
dpu(u) = dp(mcu(u)) If no cue phrases are present in the utterance under consideration, then a default tag is assigned, corresponding to the most frequent tag within the training corpus, which for the switchboard corpus is the tag .
3.4 Initial Experiments For our initial experiments, we used a 50k utterance section of the switchboard corpus. To obtain this 50k data, and the data samples we use in later experiments, we first extracted a 202k portion of the 225k complete data (in line with experiments detailed in Stolcke et al. (2000)). This data was randomly selected by dialogue (we did not split complete dialogues). From that 202k data, we randonly chose a number of dialogues until we had a 50k utterance data set. The complete switchboard data set of consists of 1155 conversations, annotated with the Dialogue Act types seen in Jurafsky et al. (1997). In total, there are 226 distinct da labels covering the entire corpus. Altogether the complete corpus of 1155 conversations comprise around 225k utterances, but for our initial experiments we wanted some reasonably sized subset to determine if our method of cue extraction was effective. The corpus was pre-processed to remove all punctuation and case information and
103
Chapter 3. Method
some specific corpus mark-up, such as filler or repeated information, as described in Meteer (1995), was also removed in line with the work of Stolcke et al. (2000). An important aspect of the switchboard corpus is that the data is already segmented into appropriate utterance units. Such units are often sentential in form, but this is not always the case, and some methods we discussed earlier look to incorporate segmentation into their classification mechanisms. We will return to the subject of segmentation later, but we for now we acknowledge that it is a challenge that we assume solved for this sequence of experiments. To overcome the problem with reporting results for this classification task as we have described, we adopted a 10-fold cross-validation approach, with results being averaged over 10 runs. For our 50k utterance data set we used a standard leave one out approach, with the data being split into ten approximately equal partitions of 5k utterances, each being used in turn for testing, with the remaining 45k utterances combined for training. Taking our initial 50k utterance data set, and using our arbitrary thresholds, we extracted potential cue phrases from the training data and used them to classify the unseen utterances of the testing data. Using the 226 da labels, we were facing some issues of severe data sparsity, but even so we achieved an average classification accuracy of 54.5% using this simple method of da classification. This compares to a baseline of 33.4%, if we assigned the most frequently occurring tag in the data. We took this result to indicate that we had indeed extracted a useful set of cue phrases from the complete set of n-grams, and were surprised by the relatively high performance of our evaluation model, the da classifier
3.4. Initial Experiments
104
constructed by directly applying the extracted cue phrases. The best published results on the switchboard corpus are 71% classification accuracy (Stolcke et al., 2000)2 . However this score is obtained over a clustered version of the switchboard corpus. This clustering, suggested by Jurafsky et al. (1997), was designed specifically to overcome problems both of data sparsity and to improve inter-coder reliability, and reduces the 226 da label set to just 42 labels. Our average cross-validated score for this experiment is 54.5%. This is significantly below the high mark of 71% reported in Stolcke et al. (2000), although the single highest score for our approach is 60.7%. This gives some indication of the wide degree of variation possible in these cross-validated classification experiments, and reinforces the opinion that authors should either publicise data splits, or performing cross-validation experiments to ensure meaningful (and repeatable) results (or preferably both). None of the experiments of Stolcke et al. (2000) or Rotaru (2002) adopt a cross-validation approach. Our 60.7% result over the 226 label corpus is not as far from the 71% score reported over the 42 label corpus as we might have expected. The next logical step was to apply the same method of cue extraction and evaluation to the 42 label switchboard corpus. For these experiments, we used the clustering of labels proposed by Jurafsky et al. (1998), which maps the 226 damsl da labels to the 42 classes shown in Figure 2.3. It is this clustered corpus that is used by others for da 2
Rotaru (2002) achieves a slightly higher classification accuracy, but omits one important label
105
Chapter 3. Method
classification experiments. Jurafsky et al. (1997) proposed this clustering to improve the levels of inter-annotator agreement (less categories meant fewer cases of disagreement) and to create data clusters that were more amenable to machine learning experiments. Looking at the distribution of da labels across the data in Figure 2.3 we can see that this clustered data set still has a very skewed representation, with the top 5 labels covering around 80% of the corpus. When we apply our classification algorithm to a 50k utterance portion of this clustered corpus, we obtain an average cross-validated score of 61.3% (an increase of an absolute 6.8% over the average cross-validated score using the corpus with 226 labels). The single highest score we obtain during this 10-fold cross validation is 65.8%. Remember that we wanted to determine a method to evaluate our set of automatically extracted cues. We devised our simple method of da classification only as a mechanism to perform this evaluation. What we see from the results is that applying these cue phrases directly as a method of da classification obtains results that are close to the best published scores for da classification on this corpus. Our best score represents 92.7% of the figure published by Stolcke et al. (2000) (and the similar score obtained by Rotaru (2002)), both of which employ methods that are significantly more complex than our classification method, and draw on a greater range of features.
3.5. Elaboration of the Classification Model
106
3.5 Elaboration of the Classification Model To continue our investigation, we saw at least two possible alternatives: either now pass our extracted cue phrases to some machine learning method to exploit them; or refine our direct classification approach, to determine the impact of these simple features to the fullest. We chose to explore the latter, using in all subsequent cases a 50k portion of the clustered 42 label corpus. We decided to add two features to the model, accounting for the length of the utterance under consideration, and the position of cue phrases within the utterance, and a final manipulation of the corpus to reconnect segmented utterances.
3.5.1 Utterance Length Models The first elaboration used models sensitive to utterance length. Intuitively, we saw that a particular cue phrase might serve as a predictor of different das given different lengths of utterance. Therefore, we hoped that this model would provide better classification for Dialogue Acts whose realisation was skewed over, for instance, short utterances, like “okay”. We had observed that “okay” occurring in short utterances was a good indication of the tag , whereas in longer utterances “okay” was more usually part of a utterance. In work in the icsi-mrda corpus reported earlier Ji and Bilmes (2006) found that the mean length of utterances was 8.60 words, were 1.04 words, 1.31 words and 6.50 words. Taking this as a start point, we grouped utter-
107
Chapter 3. Method
ances into those of length 1 (i.e. short, or one word utterances), those with lengths 2–4 (we call medium length utterances), and those of length 5+ (the long length model, that comprises everything else), and produced separate cue-based models for each group. We chose to represent this data as discrete sets, rather than use real representations of utterance length, to avoid creating data sparsity issues. Whilst the models we introduced related to our experiments with the switchboard data set, we hoped they would capture some general property of associated dialogue acts. In terms of our models of utterance length, it is most important to separate from , as these utterances share a number of our automatically extracted cue phrases. Grouping utterances that are and together should present no particular problems, as we expect to be clearly lexicalised, containing very clear, different cue phrases. When we apply this model to our 50k utterance, 42 label corpus, and perform a 10-fold cross-validation (c-v), we find that we achieve an average c-v score of 65.7%. This represents an absolute increase of 4.4% over the model that takes no account of utterance length. The single highest score for this elaboration is 68.8%.
3.5.2 Position Specific Cues In Chapter 2.5, we reported work that exploited the position specific information of cue phrases or the use of syntactic information in general (Stolcke et al., 2000; Lendvai et al., 2003). We particularly wanted to represent cue
3.5. Elaboration of the Classification Model
Speaker A: DA="wh-question": rather have Example n-gram what would what would would you what what would what would you
108
what would you
Total Count 7236 4563 78 157 1375 17 9
Predicts da (count) statement-non-opinion (2978) statement-non-opinion (2368) wh-question (39) wh-question (53) wh-question (650) wh-question (13) wh-question (9)
Predictivity 41.2% 51.9% 50% 33.8% 47.3% 76.5% 100%
Table 3.2: switchboard: Example classifier, with and features
phrases that occurred at the beginning and/or end of utterances. We introduced and tags to each utterance (independent of the calculation of utterance length), to capture position specific information for particular cues. For example “ okay” effectively identifies the occurrence of the word ‘okay’ as the first word in the utterance. In the example shown in Figure 3.2, the utterance “what would you rather have” has been annotated by a human as a . Whilst there are a number of cue phrases shown in Figure 3.2 that predict the correct category, none of the cue phrases without the label in place have a predictivity higher than the 51.9% indicated by the high frequency 1-gram “would”, which predicts an incorrect category. However, when we take the positional indicator in conjunction with the 2-gram “ what would” (remember, we do not count the positional labels as part of the n-gram), we get a high predictivity of 76.5%. For the 3-gram “ what would you”, we get an even better
109
Chapter 3. Method
100% predictivity of the correct category. When we add this elaboration to the model described so far, the average cross-validated score increases to 66.4%, and the single highest score improves to 69.5%, both an increase over the previous model of around an absolute 1%. In the final in-depth example shown in Figure 3.3, we see how the position indicators work in conjunction with the utterance length models to achieve correct classification. In this instance, our example utterance, “would you” has been annotated by a human as a , as in utterance (2) in the following exchange: (1) B: ‘‘I’d give it up tomorrow if I had to’’ (2) A: ‘‘Would you?’’ The utterance length models we have implemented are short, for utterances with a single word, medium for utterances with 2-4 words, and long for everything else. The positional indicator labels are not included in calculating the length model. To gauge the impact of this length models, we can see that the 2-gram “would you”, with no length models in place, has a very low maximal predictivity (of 33.8%) of the category , but that when the length models are introduced, instances of “would you” in medium length utterances have a much higher predictivity (of 50%) of the category . Similarly, instances of “ would you” occurring in medium length utterances have a reasonably high level of predictivity to the correct category, in this instance. In fact, this is the highest predictivity of all n-grams occurring in this length model, something that would have been
110 3.5. Elaboration of the Classification Model
Length Model medium long n/a medium long medium long medium long medium
Total Count 250 4311 157 8 149 13 32 3 18 2
Predicts da (count) statement-opinion (62) statement-non-opinion (2316) wh-question (53) wh-question (4) wh-question (49) uninterpretable (5) yes-no-question (23) backchannel-in-question-form (2) yes-no-question (17) backchannel-in-question-form (2)
Speaker A: DA="backchannel-in-question-form": would you Example n-gram would would would you would you would you would would would you would you would you
Predictivity 24.8% 53.7% 33.8% 50.0% 32.9% 38.5% 71.9% 66.7% 94.4% 100%
Table 3.3: switchboard: Example classifier, with utterance length and and features
111
Chapter 3. Method
Speaker A: DA="sv": probably the biggest thing we’re got going right now is the robberies and theft and probably murder – Speaker B: DA="b":
uh-huh
Speaker A: DA="+":
– are the two top ones we have.
Figure 3.1: An utterance interrupted by a back-channel
lost if we didn’t separate n-grams in this way. Again, the 4-gram “ would you ” has the highest predictivity (of 100%), but occurs only 3 times in the corpus in total.
3.5.3 Overlapping Speech In addition to the Dialogue Act annotations of the switchboard corpus, there are other markers dealing with various linguistic issues, as outlined in Meteer (1995). A primary example is the label , which indicated the presence of overlapping speech, as in the example shown Figure 3.1. Our experiments so far have ignored utterances marked with the entirely, and consequently we saw that a lot of potentially useful word data was being ignored. The tag occurs around 16,000 times in the 202k corpus, or around 8% of total annotations. One approach to utilise this data would be to ‘reconnect’ the divided utterances, i.e. appending any utterance assigned tag to the last utterance by the same speaker. This is entirely feasible from a speech reconstruction perspective for automatic online processing, and so we considered this valid as a pre-processing step for our algorithm.
3.5. Elaboration of the Classification Model Data Set Cross Validated Score most frequent tag (baseline) 33.4% 50k, unclustered 54.5% 50k, clustered 42 tags 61.3% as above, plus utt. length models 65.7% as above, plus , tags 66.4% as above, plus interrupted utterances 68.4%
112 Single Best Score 33.4% 60.7% 65.8% 68.8% 69.5% 72.0%
Table 3.4: switchboard Experiments: 50k data set
Reconnecting utterances in this way gave us both our highest crossvalidated average score so far of 68.4% and our highest single run score so far of 72.0%. By extracting cue phrases from our 45k utterance training corpus, and applying them directly as a means of da classification to an unseen 5k utterance test corpus, we have shown that we can achieve scores that rival (and in some cases beat) methods that are far more sophisticated, such as HMM language modelling (Stolcke et al., 2000) and Memory Based Learning (Rotaru, 2002). Most notably, both of these methods employed models of utterances in combination with models of the dialogue context, where we do not. They exploit the progression of da labels in the corpus. All of our classification is intra-utterance, using only the presence of our automatically extracted cue phrases in conjunction with simple syntactic information. Finally, both Stolcke et al. (2000) and Rotaru (2002) use the full switchboard corpus of 202k utterances, whereas we achieved our results (summarised in Table 3.4) using a 50k utterance portion of the same corpus. The next experiments for us would be to determine the impact of variable data sizes, using the same
113
Chapter 3. Method
model elaboration we describe.
3.6 Effect of Training Data Size To experiment with the effects of size of training data on our method, we created two further experimental data sets from the switchboard corpus. The first mirrored the size of the experiments performed on the verbmobil corpus (Samuel et al., 1998), described in Section 2.4.2, i.e. a total data size of around 4k utterances. The kind of techniques we have been investigating had been successful when applied to a data set of this fairly small size. The final split used the same data size as that of Stolcke et al. (2000), with 198k utterances for training and 4k for testing. Ideally we would want to perform our experiments over the same split of data used by Stolcke et al. (2000), but this split has not been publicised, which creates problems for direct comparison of results as others have noted (Rotaru, 2002). The ratio of test data to training data represents much less than a tenth of the overall set, so a standard ten-fold cross-validation experiment does not apply. Instead, we randomly selected dialogues out of the overall data to create ten disjoint subsets of around 4k utterances for use as test sets, which were used across the different experimental runs. In each case, the corresponding training set was the overall data minus the testing subset. Overall we believed that a significant increase in the amount of training data would translate to a much improved tagging accuracy.
3.6. Effect of Training Data Size
114
3.6.1 4k Data Set We started with the smaller corpus of 4k utterance data. All steps of cue extraction and application of cues for classification are the same as described with the 50k utterance data. With the 226 label corpus our approach yields an average c-v classification accuracy of 51.8%, which compares relatively well against a baseline accuracy of 36.5% from assigning the most frequently occurring tag in this set (which is ). When we clustered this corpus into the 42 labels proposed by Jurafsky et al. (1997), average tagging accuracy rose to 56.5% (an increase of an absolute 4.7%). Adding in our later modifications, such as the utterance length and position indicators, we achieve an average score of 61.0%. Finally, reconnecting the utterances of overlapping speech, we achieve an average cross-validated score of 62.7%, and a single highest score of 69.5%. These scores are summarised in Table 3.5. Comparing these scores with those of the 50k utterance data set, we see that we consistently score an absolute 4% to 5% lower with the smaller data set. In particular, with the introduction of the utterance length models, where we see an increase of greater than 4-percentage points with the 50k data, the increase is far smaller with the 4k data, presumably because of the high degree of data sparsity this model introduces into the already depleted training data.
3.6.2 202k Data Set Finally, we applied the steps of our classification model to large data set, averaging results over 10 runs, each using 198k utterances for training, and
115
Chapter 3. Method
Data Set Cross Validated Score most frequent tag (baseline) 36.5% 4k, unclustered 51.8% 4k, clustered 42 tags 56.5% as above, plus utt. length models 59.1% as above, plus , tags 61.0% as above, plus interrupted utterances 62.7%
Single Best Score 36.5% 57.1% 67.3% 66.9% 66.1% 69.4%
Table 3.5: switchboard Experiments: 4k data set
4k utterances for testing. With the 226 label corpus, we produced an average tagging accuracy of 55.8%, compared to a baseline of 36%, and a score of 54.5% using the 50k utterance data. When we used the 42 label corpus, we achieved a score of 60.7%. Compare this to the 61.3% we achieve using the 50k data. Adding the utterance length models raises our score to 64.8% (compared to 65.7% with the 50k data), and the position specific indicators raise this further to 65.9% (compared to 66.4% for the 50k data). Finally, using the utterances of interrupted speech, we achieve an average cross-validated score of 69.09% (compared with 68.4% with the 50k data), our highest average c-v score so far. The single highest score using the 202k data is 71.3%, compared with 72.0% using the 50k data. These results are summarised in Table 3.6. Whilst it was clear that the 4k data set performs worse than the 50k utterance data collection, there are only very slight differences between the 50k utterance data set and the 202k utterance data set, and in many cases the 50k data set performs marginally better than the 202k data. However, for all remaining experiments in this thesis, to make use of all the available
3.7. Empirical Determination of Thresholds Data Set Cross Validated Score most frequent tag (baseline) 36% 202k, unclustered 55.8% 202k, clustered 42 tags 60.7% as above, plus utt. length models 64.8% as above, plus , tags 65.9% as above, plus interrupted utterances 69.1%
116 Single Best Score 36% 58.9% 65.1% 69.7% 71.5% 71.3%
Table 3.6: switchboard Experiments: 202k data set
data, we use the 202k utterance data, with the overlapping speech corrected as per Section 4.5.3.
3.7 Empirical Determination of Thresholds In each of the experiments described so far, we have made use of two important variables to select n-grams extracted from our corpus as potential cue phrases, the frequency of occurrence of each n-gram, and the notion of how predictive a particular n-gram is of some Dialogue Act. The threshold values of these variables for all of the experiments described above were set in an arbitrary manner, following our initial investigations with the 50k utterance data set; a minimum frequency count of 2, and minimum predictivity score of 30%. N-gram cue phrases with scores lower than these thresholds are discarded from the set of total n-grams created from the corpus. Starting with the original switchboard corpus, which has 1.4M words (22k distinct words), we extract around 2.3M distinct n-grams. Once we have applied our thresholds, we are left with around 180k cue phrases in total.
117
Chapter 3. Method
predictivity
100 90 80 70 60 50 40 30 20 10 0
35
40
45
50
55
60
65
70 1
2
3
4
5
6
7
8
9
10
frequency
accuracy
Figure 3.2: Effects of predictivity and frequency on tagging accuracy
This approach has a number of inherent problems. First, we do not know if there are other values of our thresholds which will work better for the purposes of classification. The scores we used were chosen following extensive pre-experimental work with a 50k utterance training set. It is possible that these threshold values would no longer be optimal when used with larger data sets. Secondly, and more importantly, these values were chosen for their ability to perform well over the test data. Such an approach undermines our attempts to establish a thorough scientific baseline for potentially portable classification performance. Our subsequent experiments aim to address this last problem directly.
3.7. Empirical Determination of Thresholds
118
3.7.1 Computing a Range of Thresholds To address these concerns we sought to develop a method that would determine thresholds automatically as part of the training process through the use of a validation set. As a prelude to this, we investigated how the selection of thresholds interacts with the overall performance of our classification approach, by computing performance results with a large range of threshold values, using the 202k utterance version of the switchboard data. For this investigation we computed scores for frequency count thresholds of 1, 2, 3, 4, 5, 6, 8 and 10. For predictivity, all scores from 0 to 100% were used in gradations of 5% for each of the possible predicitivity cut-offs (so at 100%, clearly no cue phrases should remain in the classification set, so the score should be the same as the most frequent tag (our default measure), or 36%). Using this approach we should be able to tell if there is an optimal range of scores for our thresholds. We can then determine if we can use a validation set to automatically select values for these thresholds within this optimal range. Figure 3.2 shows the effect on tagging accuracy of varying thresholds. A quick interpretation of this graph shows that the classifier performs well with minimum predictivity thresholds of 40% and below, but performance falls rapidly for thresholds above that value. As can be seen from the graph, although the classifier performs optimally with a frequency threshold between 2 and 3, classification behaviour is tolerant of higher thresholds. Most importantly, there appears to be a plateau, indicated in Figure 3.3 over which classifier performance does not
119
Chapter 3. Method
predictivity
100 90 80 70 60 50 40 30 20 10 0
35
40
45
50
55
60
65
70 1
2
3
4
5
6
7
8
9
10
frequency
accuracy
Figure 3.3: Plateau of good performance for thresholds
vary significantly. If we are able to set our threshold values automatically somewhere in this plateau region (from 2 to 10 for frequency, and from 0% to 40% for predicitivity) then we should have optimal performance of our overall classifier. As can be seen in Table 3.7, the best cross validated accuracy scores occur at a frequency count of 3, minimum predictivity of either 30% or 35%. This score is higher than our manually selected thresholds of frequency 2, predictivity 30%, although the effective gain is 0.2% Additionally, the single highest score occurs at 30% predictivity, although again the difference is extremely low, at 0.1%. For historical reasons, these experiments we performed
3.7. Empirical Determination of Thresholds Freq 2 3 3
Pred 30 30 35
Cross Validated Score 69.5% 69.7% 69.7%
120 Single Best Score 74.9% 75.0% 74.9%
Table 3.7: switchboard: Best threshold experiments, 202k data set
on a different random selection of data than the experiments reported earlier in this chapter. It is worth noting that the figures quoted here for both crossvalidation and single highest score are greater than our best figures so far, and the single highest score is an absolute 4% higher than the 71% accuracy reported in Stolcke et al. (2000). This is a reinforcement of the problem of comparing classification accuracies, where only single data splits are used, and raises issues with our own attempt at cross-validation, where we use 10 random subsets of 4k test utterances. It is possible that one of our splits of test data is an extremely fortunate one for our method, and this results in these higher scores.
3.7.2 Validation Model We recognise that selecting thresholds manually by performance on the test set may not be a scientifically robust method for this task. To counter this, we split training data into two parts, training and validation. After training is complete, we validate on the second held-out part of the training data, to automatically select the best values for minimum frequency and predictivity counts. This directly addresses the original problem of setting values based
121
Chapter 3. Method
on the test data, the downside being a slight loss of training data. However, our earlier experiments on the size of data showed minimal difference between experiments using 198k training utterances versus 45k training utterances. Experimentally, we now take the 198k utterance training set, and take 10% (around 20k utterances) to use for validation, a set distinct from the 4k utterances used for testing. We derive n-grams from the 178k training set, then do exhaustive testing over the validation set, using the range of variables described in the previous experiment. These experiments select the best performing combination of frequency and predictivity scores which are then used when applying the n-grams to the test set. We repeat this 10 times, using a random selection of dialogues for both the validation and testing data sets. In each case, we also tag the test data using our original, arbitrary values of frequency 2, predictivity 30%, to establish some kind of baseline, important given the variance observed in our cross-validation results shown earlier. The average frequency count selected by our automatic method is 2.9, with an average minimum predicitivity of 32.5%. The cross-validated tagging accuracy when classifying using these automatically selected thresholds is 67.44% (with a high score of 70.31%). This compares favourably to the cross-validated score of 67.49% (high score 70.72%) obtained using our static, manually prescribed thresholds on the same data splits. These results are perhaps not surprising given the previous experiment, and the wide plateau of possible values indicated in Figure 3.3, which seems to demonstrate a broad range of possible values for these thresholds over which tagging accuracy is
3.7. Empirical Determination of Thresholds
Dialogue Act statement-non-opinion acknowledge statement-opinion abandoned/uninterpretable agree/accept appreciation yes-no-question non-verbal yes answers conventional-closing wh-question no answers response acknowledgement hedge declarative yes-no-question other back-channel in question form quotation summarise/reformulate affirmative non-yes answers action-directive
Freq 36% 19% 13% 6% 5% 2% 2% 2% 1% 1% 1% 1% 1% 1% 1% 1% 1% 0.5% 0.5% 0.4% 0.4%
Acc 92% 97% 26% 55% 31% 86% 51% 100% 0% 47% 46% 5% 0% 64% 3% 0% 19% 0% 0% 0% 25%
122
Dialogue Act collaborative completion repeat-phrase open-question rhetorical-questions hold before answer reject negative non-no answers signal-non-understanding other answers conventional-opening or-clause dispreferred answers 3rd-party-talk offers, options commits self-talk downplayer maybe/accept-part tag-question declarative wh-question thanking apology
Table 3.8: switchboard Dialogue Acts: Overall tagging accuracy
Freq 0.4% 0.3% 0.3% 0.2% 0.2% 0.2% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% 0.1% < 0.1% < 0.1% < 0.1% < 0.1% < 0.1%
Acc 0% 0% 67% 0% 33% 0% 14% 0% 0% 50% 0% 0% 0% 0% 0% 0% 0% 0% 0% 50% 50%
123
Chapter 3. Method
largely unaffected. These overall cross-validated scores seem to be down on other reported scores. This could be due in part to the loss of training data caused by the creation of the validation set coupled with the general observed variance caused by choosing different training and test data. However it is encouraging to see that we can use the validation data to select scores which perform well over the test data.
3.8 Error Analysis Now that we have established our performance figures over the switchboard data, we want to determine what our upper performance levels could be, given optimal conditions. To that end, we want to perform an in depth error analysis, to understand in more detail what mis-classifications are occurring regularly with our model. Using the average scores from a 10-fold cross-validation with the 202k utterance data set, we computed the accuracy per da label as seen in Table 3.8. We can see that automatic classification of labels score very high (92% recognition accuracy), but that score far lower (26%). What could be worrying for automatic dialogue systems is the low recognition rate for important categories such as (0%) and (5%). Some of these low scores can be attributed to sparse amounts of training data for specific Dialogue Acts; given the low frequency of occurrence for many das in the switchboard corpus, there is insufficient training data to create discrimi-
3.8. Error Analysis
124
native models. Other errors can be explained in terms of the lexicalisations which realise the utterances, where multiple classes are indicated by identical cue phrases, as we will show in some examples later in this section. Still, these figures do not necessarily help us identify areas on which we should focus in order to improve classification accuracy. To provide further insight, for each occurrence of a Dialogue Act incorrectly tagged, we noted which tag was used in error, creating a confusion matrix. The full matrix is not reproduced here, instead we concentrate on specific examples which are of most interest. We examined those categories which are the most frequently occurring, and those that are most consistently tagged as another single category. For example, the tag occurs 469 times in the chosen test data, of which we correctly tagged 120, or 26%. Of the incorrect tags assigned to utterances, the scores for tagging with other das can be seen in Table 3.9, along with the proportional scores calculated by dividing the number of times an incorrect tag is used for a specific category, by the number of times the correct category occurs in the corpus. It seems clear that the significant score here is the number of times that a utterance is incorrectly tagged as a , 67.2% of all assignments of this particular label. We determined that this proportional score is one useful discriminator for selecting interesting, regularly confused da tags. We chose to look at only those tags where 50% or more of the proportion of errors to total occurrences are tagged as a single incorrect category. An equally important factor is the number of occurrences of the da tag in question. It makes sense in the first instance to concentrate on those das
125
Chapter 3. Method Dialogue Act Name statement-opinion Incorrectly tagged as: appreciation abandoned/uninterpretable yes-no-question hedge acknowledge conventional-closing statement-non-opinion acknowledge-accept wh-question
count 469 count 4 13 1 1 2 1 315 11 1
% accuracy 25.6% % error 0.9% 2.8% 0.2% 0.2% 0.4% 0.2% 67.2% 2.3% 0.2%
Table 3.9: switchboard: Single category error analysis
whose count was significant, i.e. those where correcting errors in classification would have a statistically significant effect on classifier performance. We chose to concentrate on those das whose frequency of occurrence in the test data is higher than 40, the equivalent to an effective 1% gain in classifier performance, if all incorrect instances are subsequently tagged correctly. Interestingly, there are only two instances of categories in our confusion matrix where both criteria of significant count and significant proportional errors are fulfilled. The first of these is as already mentioned, the case of being incorrectly tagged as . The second is the case of being tagged as (there are 228 instances of in the test data, of which 70 were tagged correctly; of the 158 errors, 140 were tagged as , 61.4% of all instances). We shall examine both cases in turn.
3.8. Error Analysis
126
3.8.1 Errors concerning the category For the confusion regarding , we first look at the tagging guidelines for the switchboard corpus, laid out in Jurafsky et al. (1998). They themselves are unable in the documentation to ascertain if the distinction made between these two categories is fruitful. Having trained separate tri-gram models over the two categories, Stolcke et al. (2000) claim these tri-gram models look somewhat distinct, and yet found that this distinction was very hard to make by human labellers. Jurafsky et al. (1998) report that this lack of distinction between the two categories accounted for a large proportion of their inter-labeller disagreements. In the comprehensive annotation guidelines, Jurafsky et al. (1998) provide ‘clue’ phrases, which may be present in a utterance, including: ‘I think’; ‘I believe; ‘It seems’ and ‘It’s my opinion that’. Looking at the n-grams created by our method of predictivity from the entire switchboard corpus, we can identify some potential problems. ‘I think’ is a common n-gram, occurring more than 6250 times in the corpus as a whole. However, while some 63% of those occur in utterances annotated with the tag, this leaves a significant number of incorrect assignments. If we remember the examples we gave earlier (Figures 3.1 to 3.3) we note that whilst a predictivity of 63% may seem high, it is often insufficient to result in a phrase with such a predictivity score being chosen from all candidate cue phrases. 31% of the remaining n-grams occur in utterances. This finding is replicated with ‘it seems’ (472 to-
127
Chapter 3. Method Speaker A: DA="statement-non-opinion": but I also believe that the earth is a kind of a self-regulating system Figure 3.4: switchboard: Example utterance incorrectly labelled
tal instances, 307 as (65%), 144 as (31%)). In these cases, although the cue phrases are clearly indicative of , if some other, more highly predictive n-gram is present in the target utterance, it is possible that the presence of the crucial cue phrases will be overlooked. The situation is worse with respect to ‘I believe’, which occurs 190 times in total, but where 88 (46%) of those co-occur with the label , 91 (48%) occur in utterances marked as . The only one of the example phrases given in Jurafsky et al. (1998) to score well is ‘it’s my opinion that’, but as this occurs only once in the entire corpus, is of limited use. This investigation bears out the argument that labellers had extreme difficultly in differentiating between these two categories.
Hand inspec-
tion of the utterances in the corpus indicates that where the example cue phrases were present, an overwhelming majority should have been annotated as (as in the example shown in Figure 3.4), but it is possible that the overwhelming presence of utterances caused annotators to be blind to the distinction (remember that fully one third of the entire corpus is comprised of utterances). There is perhaps an argument here that if this is a hard cate-
3.8. Error Analysis
128
gory for human labellers to separate, perhaps there should not be two distinct categories in the first place.
3.8.2 Errors concerning the category The second problem category, where is often tagged as , is more straightforward to understand. By looking at a sample of the utterances coded in each category we can see that, as might be expected, they have substantially similar lexicalisations. Both are represented often by ‘yeah’, ‘yes’ and ‘right’. According to the labellers manual (Jurafsky et al., 1998) there are several contextual issues which may help to disambiguate the two. This raises an important point. Since this far we have concerned ourselves only with intra-utterance features, we have to accept that there are some categories such as these that will be impossible for us to disambiguate. That said, we are still achieving a high level of accuracy without any dialogue context, and as our error analysis indicates, this is the only category that we get significantly incorrect on this basis. We hope that higher level processes, perhaps powered by machine learning alorithms, may enable to us to leverage the context of surrounding utterances in our classification. We speculate that a machine learning approach using context, and we investigate such a mechanism later in Section 3.10, might do better at disambiguating between and , but such a mechanism will not help us to perform better for and .
129
Chapter 3. Method
3.8.3 Resolution to Some of the Classification Problems As we have shown, the categories and are often confused, and there is no hope that a higher level process will be able to separate these categories based on context. This split of the overall category is one that Jurafsky et al. (1998) created for the switchboard corpus, as there is no such distinction made in the original damsl coding schema (Core and Allen, 1997) from which the set of Dialogue Acts for the switchboard corpus is derived. In order to test the performance of a system where such a distinction was not made, we created a version of the corpus where all instances of both and were merged into the single category . The results from the error-analysis would seem to indicate that this should lead to an almost 10% improvement in our overall classification accuracy, if we now correctly classify 100% of the inaccuracies. However, in the analysis shown in Table 3.9, we see that only around 70% of the misclassifications of are as , so can predict an improvement of around 7%. In earlier sections, we reported a best cross-validated score of 69.7% (with a high score of 75.0%) (Webb et al., 2005b). After repeating the cross-validation experiment on the corpus with merged categories, we achieve a cross-validated score of 76.7%, with a high of 78.6%, in both cases a significant gain, especially for the cross-validated score (a gain of an absolute 7%, as predicted), although not perhaps as high as might have been expected in terms of the single highest score.
3.9. N-best Classification
130
Another possible solution to this problem of mislabelled categories is to use the phrases suggested by Jurafsky et al. (1998) to create a distinct set of utterances, all of which should originally have been labelled as utterances. We could then correct the error indicated in Figure 3.4. Alternatively, when we merge the and utterances into a single category, we could propose a separate feature of whether an utterance contains what we suppose to be a lexical indicator of opinion. This would make annotation easier, in that when clear evidence of opinion was identified, this information could be added to the basic annotation. We do not look at this further in the context of this thesis.
3.9 N-best Classification The experiments described so far have tried to select the single best-fit candidate Dialogue Act tag for an individual utterance. As da tagging could be seen as a first step in some wider natural language understanding process, such as in an active online dialogue system. We have suggested the possibility of later refinement of selection by some higher level process, such as a dialogue manager, or context-aware interpretation module. To aid in this interpretation process, it may be sub-optimal to select one and only one da category for some utterances. Taking the switchboard corpus as an example, if we really are interested in the distinction between and , we know a-priori that we have limited possibility using our classification method of making this distinction
131
Chapter 3. Method
correctly. However, we also know that when we encounter an appropriate utterance, such as “okay”, it is highly likely to be one of these two choices, as opposed to any other. Taking the example, if we combine the percentages we correctly classify this utterance, with the percentage we misclassify an utterance as , we obtain a potential accuracy score of 92.1%. In other words, whatever cue phrases lead us to select one of these two categories, we know a-priori that one of these two categories is correct 92% of the time. This information could be passed to the dialogue manager which could make a final interpretation based on the progression and content of the entire dialogue thus far, or other, contextually aware features. We wanted to investigate the possibility of selecting some list of potential Dialogue Acts given a target utterance, such as can be seen in earlier work reported in Nagata and Morimoto (1994) and Reithinger and Maier (1995). To this end we built a classifier using the medium-sized switchboard data set, i.e. with a 45k utterance training set and a 5k utterance test set. However, rather than attempting to find the single best match from the classifier, we tagged each utterance with the top 5 possible dialogue acts, as indicated by the classifier on the basis of the predictivity of the n-grams the utterance contained. All possible das suggested by the presence of cue phrases are considered, where the top 5 ordered by predicitivy of the underlying cue phrases are used. Duplicate das are deleted from the candidate set, so the 2nd ranked da could be represented by the 5th ranked cue phrase, for example, in the case where the cue phrases with the top 4 predicitivity scores all indicate the same da.
3.9. N-best Classification
132
n-best DA classification 100
pred 32/freq3 no thresholds
95 90
accuracy
85 80 75 70 65 60 55
1
2
3
4
5
6
7
8
9
10
n
Figure 3.5: N-best classification, for various n
In order to create a baseline measure for this task, we computed scores for utterances in the test set, by assigning a default set of tags to every utterance, consisting of the 5 most frequently occurring tags in the entire corpus. We score such a set of labels correct if the correct tag is contained within this disjunctive list. This is a challenging baseline, given the skewed distribution of the Dialogue Acts over the corpus. The number of times the correct Dialogue Act occurred in the top 5 default tag assignment was 71.09% (in other words, for the 42 category data set, just 5 tags account for the coverage of over 70% of the entire corpus). We applied our 5-best classifier, and automatically assigned 5 tags using
133
Chapter 3. Method
the presence of cue phrases as an indicator, and we calculated (over a 10fold cross validation) that 86.7% of the time the correct Dialogue Act was contained in the 5-best output of the classifier. This score would define some theoretically attainable upper limit of performance that could be reached by some higher level process that selected correctly amongst the 5-best DAs. The choice of 5-best is used here only to demonstrate that we can reduce the ambiguity of choice of da labels to some more feasible, smaller set. In a later experiment, we compared n-best tagging (using a set of thresholds set at a minimum predictivity of 32% and a minimum frequency of 3, as chosen by our previous experiments with a validation model, and used for all subsequent experiments reported in this thesis) for a variety of values of n, versus a version of the classifier that used no thresholds what so ever. The results can be seen in Figure 3.5, and as might be expected, the 1-best classifier with our optimal threshold scores significantly outperforms the classifier with no thresholds (note, due to the 10-fold nature of our experiments with the 202k utterance switchboard data, this is a different data set than the first top-n experimental data). However, as early as n=2, the classifier with no thresholds outperforms the optimised classifier. The performance of the optimised classifier levels off, as a reduction in the set of n-grams caused by the thresholds means that it is unable to suggest a da beyond two or three categories, a restriction not felt by the classifier with access to the full range of potential cue phrases.
3.10. Models of da progression
134
3.10 Models of da progression Given that others performing experiments over the switchboard corpus, most notably the work of Stolcke et al. (2000), had shown such a significant impact using dialogue context, we decided to include a da modelling effort in conjunction with our classification approach to determine if this resulted in a significant increase in da classification accuracy. As a reminder, when Stolcke et al. (2000) used hmm modelling for purely intra-utternace features, they achieved a classification accuracy over the switchboard corpus of only 54.3%. It was only when they combined this model with a tri-gram model of da progression (looking at the das of the previous three utterances in the dialogue) that performance was boosted to a 71% accuracy. We decided to use our cue phrase classifier as a first classification pass over the data, and then use the single best category assigned by our classifier as a first guess to some later process, and include as a feature the tri-gram model of da progression (the model that achieved the best results on this same corpus for Stolcke et al. (2000)), and the true da of the utterance as the target feature. Some of the best results for combining models of the utterances with da sequences, such as used by e.g Samuel et al. (1998) and Mart´ınez-Hinarejos et al. (2008) also used information about subsequent das, but we decided against such an approach. We encoded our feature information in a format suitable for the WEKA (Witten and Frank, 2005) machine learning toolkit, and applied the naive bayes classifier. Our approach to intra-utterance classification is bayesian in nature, so felt this was the most appropriate selection of classification
135
Chapter 3. Method
mechanism. We conducted a cross-validated run using our simple cue phrase classification method. We took the output of each run of that experiment, and converted the output into WEKA format, representing the best guess of our classifier, the target da and the true das of each of the three previous utterances in the dialogue. We also created compound features that represented both bi-gram and tri-gram models of da progression (so included DA−1 DA−2 as a single feature, for example). We ran separate classification experiments of each file, using all three, two and then just one previous da, to determine the impact of having a larger window of da progression. Although Stolcke et al. (2000) found their best performance using a trigram model of da progression, we found that the larger window resulted in less absolute performance gain, and in all cases the gain observed was very small. We averaged the increase over the 10 cross-validation runs, and found that using a bi-gram model of da progression added an absolute 3.6%. The single n-gram model of da progression, i.e. using only the information about the most recent da, added on average an absolute 3.3% to our cue-phrase classifier performance. Using the tri-gram model added on average 2.8% performance increase. A possible explanation is that perhaps the switchboard does not contain much dialogue structure that can be exploited. Remember, the switchboard corpus is not a domain specific corpus, but rather comprises general conversation between two parties who are at the outset unknown to each other. However, in the experiments reported in Stolcke et al. (2000), when they ignore da progression and concentrate on full hmm language modelling
3.10. Models of da progression
136
of the utterances, they report high scores of 54.3%, significantly lower than our best intra-utterance score of 74.95%. Stolcke et al. (2000) report their best scores when using this HMM utterance modelling in conjunction with a tri-gram model of da progression, boosting performance to the 71% they report. This gain (16.7% absolute) is huge in comparison to the gains we shows, but perhaps this only demonstrates the inefficiency of their initial full utterance hmm modelling approach. It is also worth remembering that in contrast to the results of Stolcke et al. (2000), the findings of Mast et al. (1996), Ries (1999) and Serafin and Eugenio (2004) discussed earlier, which all indicate that incorporating dialogue history or dialogue context has no significant impact on their classification experiments. We also explored different machine learning algorithms through the WEKA framework. We applied the rule learning algorithm JRip (an implementation of the RIPPER algorithm (Cohen, 1995)) to a small portion of the switchboard data. We had to use a fraction of the total available switchboard corpus data (around 5k utterances), as memory usage was extremely high. Overall, the sample scores we obtained did not rival the scores we achieved using the Naive Bayes implementation, however it was instructive to examine the rules learnt automatically by the algorithm, as the most influential rules (that is, those that affected the most number of instances in both the training and test data) concerned those categories we identified as most confused in our earlier error analysis. Three rules pertained to utterances which our CuDAC classifier labelled as , and resulted in changing the da label to . First, if the most recent da (DA−1 ) is a (Truth = ) (CuDAC = ) AND (DA−1 = ) AND (DA−2 = ) => (Truth = ) (CuDAC = ) AND (DA−1 = ) AND (DA−2 = ) AND (DA−3 = ) => (Truth = ) (CuDAC = ) AND (DA−1 = ) => (Truth = ) Figure 3.6: Example rules learnt by the JRip algorithm over a sample of the switchboard data. Features used are ‘CuDAC’: the classification made by the CuDAC algorithm; ‘Truth’: the actual da of the current utterance; and DA−n : the actual da at position n
opinion>, then it is likely that the current utterance is also a . Second, if the DA−2 is a and DA−1 is a , then the current utterance is likely to be a . Finally, if DA−3 is a , and both DA−2 and DA−1 are , then again the current utterance is likely to be a . Intuitively, if a recent utterance expresses a , and the only intervening utterances are forms of ‘back-channel’, then the current utterance is more like to be another , rather than a . Finally, an additional rule took utterances that CuDAC identified as , and reclassified them as if the most recent da is also an . The rules acquired by JRip can be seen in Figure 3.6.
3.11. Summary
138
3.11 Summary In this chapter we have discussed how we can automatically extract cue phrases from all the n-grams of length 1–4 in a corpus. We have shown that these cue phrases appear to be useful, by evaluating them using our own mechanism for da classification, which shows not only the value of the cue phrases, but that these cue phrases can be used directly as a classification method. We described a series of modifications to the basic classification algorithm that show the individual contributions of some simple lexical and positional features. We explored in depth the contribution of features such as word n-grams, size of training set, positional cues and utterance length models. We found that using more training data increased classification accuracy, when moving from 4k total utterances to 50k utterances, but that improvements in accuracy were not significant when using a 202k utterance data set over a 50k utterance set. Combining all automatic features for simple dialogue act tagging, we obtain an average cross-validated high score of 69.7% over the larger, 202k data set for the switchboard corpus. Our highest single run score was 75.0%, also using the 202k data set of the switchboard corpus. Both of these scores were obtained during experiments to automatically determine the best combination of thresholds for this corpus. The single highest score of 75.0% outperforms the best published score for this corpus, the 71% claimed by Stolcke et al. (2000), although for reasons already given it is difficult to compare directly. Importantly, both Stolcke et al. (2000) and Rotaru (2002)
139
Chapter 3. Method
use more complex methods than our simple classification approach, and both make use of dialogue context in the learning algorithms. Our method makes no use of information outside of the utterance under consideration. Using our automatically extracted cue phrases, we have created a simple dialogue act tagger that uses just intra-utterance cues for classification. A Java package containing our intra-utterance classifier, CuDAC, trained over the 202k utterances of the switchboard data, is available for download and use3 . We analysed the common errors encountered by our classification approach, and determined that a proportion of these reflect inconsistencies in the underlying human annotation (the confusion of and for example). Others are due to identical cue phrases indicating two or more possible da categories, such as often being classified as , as both are represented often by ‘yeah’, ‘yes’ and ‘right’. Our intra-utterance classifier has no mechanism to deal with this confusion. We look at two possible solutions. First, we used our classifier as a first step in a pipeline approach, where we combine information from the dialogue context, and show that this makes little significant impact on our experiments using the switchboard corpus. Second, we adopt the n-best approach to tagging, where we have demonstrated that our naive classifier can present a list of ranked possible alternatives, which could be used by some later, higher level process, such as a dialogue manager, to make informed choices in the evaluation of utterances. 3
http://www.nick-webb.net/Home/Tools.html
3.11. Summary
140
Interestingly the cue phrases that we automatically extract from the switchboard corpus, as seen in Figures 3.1 to 3.3, appear to be widely applicable to a range of domains and applications. In fact, many of the cue phrases appear to be similar to the cue phrases in literature, such as those outlined in Chapter 2. If this is in fact the case, perhaps the cue phrases we automatically extract from the switchboard corpus can be applied as is to new corpora, from new domains. It is this avenue we will explore in the next chapter.
4. USING CUE PHRASES FOR CROSS-CORPUS CLASSIFICATION 4.1 Introduction In this thesis so far we have investigated the automatic extraction of cue phrases from an annotated corpus, and have shown that these cue phrases are reliable indicators of das. We reported that on inspection these phrases appear to be intuitive, as in the examples given in the previous chapter we saw that “ would you” in short utterances was a reliable indicator of the da , whereas the same phrase in long utterances was a reliable predictor of the da (see Figure 3.3). If indeed the cue phrases we automatically extract are as intuitive as this then perhaps we can take these cue phrases, learnt from one, large scale corpus of spontaneous conversational speech, and apply them directly to other, unrelated corpora, possibly from new domains or containing dialogues of different conversational styles, such as task-oriented dialogue as opposed to general conversation, for example. If this is so, it would address two areas of interest: first, that cue phrases do indeed play a general role in
4.1. Introduction
142
understanding dialogue, as indicated in previous literature; second, that these phrases can be exploited for a range of uses in natural language processing applications. In this chapter, we explore the portability of our automatically extracted cue phrases. We take the Dialogue Act classifier created in Chapter 3, and apply the same mechanism to the classification of utterances from two different new corpora. We do this in two stages. First, we apply our method to new corpora to determine an upper bound of performance our technique can achieve on this data. We automatically extract cue phrases from the new corpus, and apply these cue phrases directly to the new corpus. This will tell us how well our cue-based classification method works on the new data. Then we use the cue phrases extracted from the switchboard data, and apply those switchboard cues as a classifier directly to the new corpora, to see if the cue phrases extracted from the switchboard corpus are general enough to classify new, unseen material from these new corpora. This is both a test of the cue phrases we have extracted, and serves as a useful tool for the future annotation of new dialogue data. In order that we might make comparisons, it would be ideal if the new target corpora were labelled using a schema at least similar to that of our training switchboard corpus. We are fortunate therefore that we have access to the icsi-mrda corpus, as described in Section 2.4.3, which has been manually annotated with a variant of the switchboard-damsl schema. Additionally, we have our own amities corpus, annotated with a variant of the original damsl scheme. This means that we should be able to annotate both using a classifier trained over the switchboard corpus, and evaluate
143
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
accuracy, precision and recall of the resulting classifications.
4.2 Establish Baselines for New Corpora Before we perform cross-corpora classification experiments, we need to establish baselines of performance for our algorithm over these new data sets. In Chapter 2, we reported on the best published figures for each of these corpora, where possible. Remember that for the switchboard corpus, the best published figure prior to our work was 71% as reported by Stolcke et al. (2000). For the icsi-mrda corpus, the absolute best results are 89.27% as reported by Verbree et al. (2006). However, this classification effort included a wide range of orthographic information, such as punctuation and capitalisation, that may not be available in an online dialogue system. We consider the 81.3% reported by Ji and Bilmes (2006) a more realistic target for this effort. Remember also that this figure relates to the collapsed, 5 category da set. When the same method is applied to the full, 55 category set, Ji and Bilmes (2005) report a classification accuracy of 66%. For the amities ge corpus, we reported that there are no directly comparable results, although when dealing with this corpus later, we will report some recently published indicative results. It remains difficult for this classification task to compare the results we obtain directly to other, published results (such as we discussed earlier). Instead, we apply 10-fold cross-validation to all our experiments, to compensate for the arbitrary nature of any testing and training selections, and report the average value across the 10 experiments.
4.2. Establish Baselines for New Corpora
144
i mean you can’t just like print the - the values out in ascii and you know look at them to see if they’re == well
==
not unless you had a lot of time . and == uh and also they’re not - i mean as i understand it you you don’t have a way to optimize the features for the final word error . right right .
Figure 4.1: Excerpt of dialogue from the icsi-mrda corpus
4.2.1 icsi-mrda Corpus Classification We described the icsi-mrda corpus of meeting dialogues earlier in Section 2.4.3, and outlined the various cluster groups of Dialogue Acts that are used over the corpus, that result in an original clustered tag set containing 55 tags (seen in Table 4.1), another set considered as main tags comprising 11 total tags (seen in Table 4.2), and finally a super clustered set which is the most coarse-grain representation, comprising 5 tags in total (back-channels (b), place-holders (h), questions (q), statements (s), and disruptions (x)). An example excerpt from the icsi-mrda corpus, processed to fit our XML
?
145
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
corpus description used for the switchboard corpus, can be seen in Figure 4.1. In this example, there are four different speakers (C2, C3, C5, C6), and the utterances are labelled with one of the 5 upper level categories. For the initial experiments we directly apply the classification model, complete with parameters derived from the switchboard experiments described earlier to the new corpus data. We appreciate that these parameters, for the frequency, predictivity and length models of utterances, may well be sub-optimal for this corpus. Remember however that motivation for our work comes in part from the ability to semi-automatically annotate new dialogue resources, for which optimal definition of parameters will not be possible a-priori, and that as shown earlier, at least for the switchboard corpus, there appears to be high degree of tolerance to a wide range of possible value assignments for these variables (see Figure 3.3, for example). It is interesting then to know how well our method of da classification works ‘out of the box’ with no additional work, how much of the presumed upper level of classification, as described earlier in Chapter 2, can we achieve using our simple approach. In each case, we perform a 10-fold cross-validation exercise, taking a random nine-tenths of the data for training, and one-tenth for testing. The results of our classification algorithm applied to the icsi-mrda data can be seen in Table 4.3. In this table, we show the level of accuracy achieved by our cue-based classifier, and compare it with the highest reported score on each clustering of the icsi-mrda corpus. The final column of Table 4.3 gives a score showing the ratio of our cue-based classifier score to the best score for each cluster. For the 55 tag set (as shown in Table 4.1), we achieve an average
146 4.2. Establish Baselines for New Corpora
Dialogue Act statement back-channel floor-holder acknowledgement yes-answers defending/explanation expansion floor-grabber suggestion appreciation interruption understanding check declarative question abandoned narrative-affirmative answers wh-question no answers collaborative-completion
% of corpus 31.80% 14.20% 7.94% 6.82% 5.59% 3.55% 3.04% 2.96% 2.53% 2.10% 2.05% 1.98% 1.73% 1.13% 1.05% 0.90% 0.86% 0.80%
Dialogue Act no knowledge answers hold command yes-no-question dispreferred answers humorous material downplayer commit narrative-negative answers maybe or clause after yes-no-question exclamation mimic apology task management signal non-understanding partial-accept
% of corpus 0.78% 0.75% 0.64% 0.64% 0.46% 0.44% 0.37% 0.35% 0.33% 0.33% 0.33% 0.29% 0.28% 0.24% 0.24% 0.22% 0.21%
Dialogue Act rhetorical question topic change repeat self-talk 3rd-party-talk rhetorical-question back-channel partial reject misspeak self-correction reformulation “follow me” or question thanks tag question open-ended question misspeak correction sympathy welcome
% of corpus 0.20% 0.20% 0.20% 0.19% 0.16% 0.15% 0.14% 0.14% 0.13% 0.12% 0.12% 0.11% 0.08% 0.07% 0.05% 0.01% < 0.01%
Table 4.1: icsi-mrda dialogue acts, taken from Shriberg et al. (2004): not shown are the labels indecipherable, non-speech and nonlabeled
147
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
classification accuracy of 51.76% with the basic n-gram approach. For the 11 category data (and the 11 categories can be seen in Table 4.2) we get an immediate improvement to an average score of 76.12%. Finally for the 5 category set, we record scores of 75.10% using the basic n-gram approach. This score is lower than that for the 11 category cluster, which does not include disruption forms (such as and ), whereas the 5 category cluster includes these forms together as a single label, x. In exploratory work using the icsi-mrda corpus, Zimmermann et al. (2006) reimplemented our approach as described in Chapter 3, and reported a score of 73.4% over the 5 category clustering of the data, which correlates well with the score we report over the same data clustering. Comparing these results to those discussed earlier, we see the work of Ji and Bilmes (2005) achieves a high score of 66% over the 55 category set. By comparison, our score using only the n-gram approach (51.76%) represents 78% of this upper bound of current performance. Using the highly clustered, 5 da tag set, we see results range between 77.98% and 89.27%, as algorithms are increasingly tailored to the problem. We choose to compare our performance to the 81.3% achieved by Ji and Bilmes (2006), which is most comparable to our approach in terms of features and utility. Our method achieves a creditable 75.10%, which represents 92.37% of the upper bound performance, although this represents only 88% of the best performance reported on this da cluster by Verbree et al. (2006) using orthographic information such as ‘full stops’ and ‘question marks’. We should note that we are not aware at this time of others reporting
4.2. Establish Baselines for New Corpora Dialogue Act statement back-channel floor-holder yes-no-question floor-grabber wh-question
% of corpus 68.00% 13.37% 7.38% 4.91% 2.74% 0.90%
Dialogue Act hedge rhetorical question or clause after yes-no-question or question open-ended question
148 % of corpus 0.70% 0.36% 0.35% 0.20% 0.17%
Table 4.2: icsi-mrda dialogue acts, clustered into 11 general tags, not including any disruption classes
classification scores over the data clustered around 11 categories. The 11 categories (shown in Table 4.2) are identified by the authors of the corpus Shriberg et al. (2004) to be the major da categories. Instead, classification has concentrated on the 5 category clustering, possibly in an effort to boost reported classification scores (note, in Table 4.3, the difference in performance reported by Ji and Bilmes (2005) on the 55 category corpus, versus the score for the same mechanism reported by Ji and Bilmes (2006) using the 5 category corpus). We believe that 5 category cluster of the icsi-mrda corpus is far too small a grouping to be of practical use, and that the 55 category scheme of the icsi-mrda corpus, in common with the 42 category cluster of the switchboard corpus, represents the most useful set of categories for building sophisticated artificial conversational agents, but we do not make a case for this viewpoint here. It seems reasonable to conclude at this stage that we achieve solid, if not state-of-the-art performance over this corpus. In contrast with our experiments on the switchboard corpus, it is possible that context could be a
149
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
Method Categories Most Frequent Tag 55 Ji and Bilmes (2005) 55 CuDAC 55 CuDAC 55 CuDAC 11 CuDAC 11 Ji and Bilmes (2006) 5 CuDAC 5 CuDAC 5 Verbree et al. (2006) 5
Sequence Model No Yes No Yes No Yes Yes No Yes Yes
Accuracy 31.80% 66% 51.76 55.93% 76.12% 77.67% 81.3% 75.10% 78.29% 89.27%
% of Maximum n/a 100% 78.42% 84.74% n/a n/a 100% 92.37% 96.3% n/a
Table 4.3: icsi-mrda Classification Results
much more useful feature in the dialogues of the icsi-mrda corpus, possibly because meetings contain more structure than the open-ended conversations of the switchboard corpus, and methods than can exploit this structure (better than our simple n-gram representation) benefit accordingly. This remains an open research question. In order to test this hypothesis, we took the naive bayes model used in the da sequence experiments in the previous chapter, and applied this model (again using a window of up to 3 previous das) to the three variants of the icsi-mrda corpus. Our best performing models achieve 55.93% on the 65 category cluster (an increase of 4.17%), 77.67% on the 11 category cluster (an increase of 1.55%), and 78.29% on the 5 category cluster (an increase of 3.19%). We can see that for our simple model of da progression, we consistently add around an absolute 1.5% to 4% to our classification accuracy, which takes our results close to the state of the art. Our combination of
4.2. Establish Baselines for New Corpora
150
information encoded in utterances (as represented by our cue phrases) and the da sequence model (captured using the naive bayes classifier) is simplistic and we believe that results show that in comparison to other approaches.
4.2.2 amities Corpus Classification Our next step is to repeat the experiments described for the icsi-mrda corpus, using the amities ge corpus. We will extract cue phrases from a portion of the amities ge corpus, and apply them directly as a classification device to an unseen, test portion of the same corpus. Again, we conduct a 10fold cross validation experiment over this , giving us a total of 20k utterances for training and around 2.5k utterances for testing for each fold. One notable difference between this new corpus and those we have operated over so far is the complexity of the da tagging schema. Here, instead of a single atomic label that represents the multidimensional damsl tag set (as we see in both the switchboard and the icsi-mrda corpora), instead we see a hierarchical damsl mark-up (as seen in Figure 2.3) which when applied to the amities ge corpus results in 259 distinct combinations of da labels, taking into account all the assignments made to each of the hierarchical dimensions. When using the amities ge corpus for both training and testing, we require that we label all parts of each multi-dimensional tag correct for us to score 100% classification accuracy. That said, there are still issues with the amities ge data that we need to resolve. There are a number of amities ge tags that we know a-priori we have little or no chance to recognise.
151
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
Dialogue Act
% of corpus 15.22% 13.34% 12.49% 9.43% 5.49% 5.10% 4.86% 4.49% 3.43% 2.74% 2.25% 1.99% 1.95% 1.72% 1.63% 1.47% 0.96% 0.83% 0.80% 0.74% 0.73% 0.57% 0.56% 0.55% 0.52%
Table 4.4: amities ge dialogue acts, with frequency of occurrence above 0.5%, clustered with the tag removed
For example, the amities ge corpus is meticulously annotated to include that certain utterances are perceived as answers to prior utterances. Our approach to da tagging is purely intra-utterance, taking no account of the wider discourse structure, and so will not recognise these distinctions. As a consequence, we will omit all reference to these labels. Similarly, we omit the distinction between “Task” and “Communication-management”, part of the ‘information level’. Therefore, we ignore the ‘information level’ hierarchy completely. The distribution of tags when the information level class is ignored can be seen in Table 4.4. When the information level class is considered, it results in 259 distinct da labels in the amities ge corpus. When we ignore the information level category, the remaining labels cluster into 163 distinct da labels.
4.2. Establish Baselines for New Corpora
152
Looking at Table 4.4, we can see that the most frequent tag is this corpus occurs 15% of the time, which represents a baseline score for this classification task (if we assigned the most frequently occurring label to every utterance in the dialogue). Using our CuDAC classifier method we score 65.9% accuracy. If we add the same naive bayes classifier as with previous experiments, incorporating a tri-gram model of da progression trained over amities ge data, we increase this to a score of 71.47% (an increase of an absolute 5.57%), a more significant increase than results with the switchboard or icsi-mrda corpora. The results from our classification experiments using the amities ge corpus are summarised in Table 4.5. Hardy et al. (2005) show partial results for da classification on this data, looking only at 5 major classes, where they achieve a classification score of 86%. However, this includes only 5 da labels (, , , and ), and only considers utterances shorter than 5 words. Recently, some classification work has been performed by Rosset et al. (2008) using the amities ge corpus. Unlike our approach, where we treat the damsl hierarchical annotation as an individual atomic label (i.e. try to match each part of the hierarchy for an individual utterance), Rosset et al. (2008) instead take the approach that each utterance has to be marked with exactly one tag for each class of the hierarchy. They treat the data as though there are 8 independent labels per utterance (there are 5 hierarchical levels in the original damsl scheme, but Rosset et al. (2008) split one of these (forward looking functions) into two sub-categories, and another (backward looking functions) into a further 3
153
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
Method Note Most Frequent Tag all 259 labels Hardy et al. (2005) limited labels CuDAC strict scoring, all 259 labels CuDAC strict scoring, all 259 labels Rosset et al. (2008) 8 categories per utterance
Sequence Model No Yes No Yes Yes
Accuracy 15% 86% 65.9% 71.47% 85.7%
Table 4.5: amities ge Classification Results
subcategories). These means that for an utterance marked as for example, the classifier of Rosset et al. (2008) must make 8 assignments, 2 of which are as shown, and 6 of which contain the label “NULL”. Taking a test corpus of 1711 utterances, this means there are a total of 13,688 da assignments to make and Rosset et al. (2008) report their classification accuracy over these individually, treating the assignments as independent. Of those 13,688 assignments, Rosset et al. (2008) claim a classification accuracy of 85.7% but it is not clear how we should interpret this score. For example, taking our example , if the classifier of Rosset et al. (2008) were to label this with all “NULL” categories (remember, of the 8 categories, 6 for this utterance should be “NULL”), they would score this as 75% accurate for this label, without making any assignments at all.
4.2. Establish Baselines for New Corpora
154
4.2.3 Number of Cues Given that we have now performed our automatic cue extraction exercise over three different corpora it is interesting to compare the number of cue phrases extracted from each corpus. Starting with the original switchboard corpus, which has 1.4M words (22k distinct words), we extract around 2.3M distinct n-grams. These include the and tags we append to each utterance. Once we have applied our thresholds, which remove phrases with a frequency of 3 or less, and those with a predictivity of less than 32.5%, we are left with around 140k cue phrases in total. For the icsi-mrda corpus, we have a total of around 795k words (14k distinct) from which we extract 55k cue phrases after filtering with the same thresholds. The amities ge corpus of 228k total words only has 8k distinct words. After thresholds are applied, the total number of cue phrases for the amities ge corpus is around 17k phrases. For comparison, we also add the work of Samuel et al. (1999), who perform a similar cue extraction experiment on the verbmobil corpus which comprises nearly 25k words, resulting in nearly 4k cue phrases. In Figure 4.2 we can see that as the corpus size grows (as measured by the total number of words in the corpus), the number of cue phrases extracted after filtering is roughly linear. If we look in a little more detail at the number of cue phrases extracted for the switchboard corpus, we see that of the 140k cue phrases we extract using our thresholds, around 130k of those (92.9%) represent one of the two labels, understandable when considering that these two labels account for nearly 50% of annotations over the switchboard corpus,
155
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
and that utterances labelled as are significantly longer than utterances with other da labels. What this means is that the rest of the labels are being classified by only 7% of the total cue phrases. If we look at a label such as , which covers 19% of the entire corpus, we obtain a precision of 0.76 using just 10 cue phrases. (precision 0.85) has 91 cue phrases in total, and (precision 0.85) has 586 cue phrases. Taking this indicative scores, perhaps it is the case that some da labels are predicted by a few specific cue phrases. Further, perhaps these cue phrases, as indicated in literature, are general predictors of the same da labels, irrespective of the domain of discussion found in an individual dialogue corpus. If so, can we use cue phrases extracted from one corpus of dialogues directly as a classification device applied to a different corpus?
4.2.4 Review of Baselines We have applied both the cue extraction and da label classification mechanism outlined in Chapter 3 to two additional corpora. Whereas with the switchboard corpus experiments we were able to show state of the art classification performance using this direct method of da classification, that has not been the case with the icsi-mrda corpus. We are able to show competitive results using our approach, and are able to demonstrate that we can extract useful cue phrases, and that these cue phrases are reliable predictors of da labels. For the amities ge corpus, we show the first results in attempting classification over the corpus annotated with 259 distinct labels.
4.2. Establish Baselines for New Corpora
156
Figure 4.2: Linear increase in number of cues, as number of total words in corpus increases
157
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
4.3 Cross-Corpus Classification The central purpose of this chapter is to examine the use of automatically extracted cues to tag data other than the corpus from which they are derived. The hypothesis we wish to test is that these cues are sufficiently general to work as a classification device on a corpus from a different domain, even containing interactions of a different conversational style. The ability to extract cues from one or more corpora, and use these cues to directly classify new data is an interesting challenge. From a theoretical perspective, it would confirm linguistic observations that indicate the prominence of such word cues in language (Hirschberg and Litman, 1993). From a practical standpoint, these cue phrases could be the basis of a reliable da classifier that would operate over new data, both in an online, live dialogue system, and as a method of annotating new dialogue collections.
4.3.1 Cue Overlap We will begin with the switchboard and icsi-mrda corpora as these two are most similar in terms of overall size, and in terms of annotation schema, allowing a more direct comparison of results. As we see in the previous section, from the switchboard corpus we extract a total of 140k cue phrases and from the icsi-mrda corpus we extract 55k cue phrases. Taking no other information into account other than the raw overlap of these cue phrases, there are around 38k cue phrases that overlap between these two corpora. These 38k cue phrases represent 27% of the total of switchboard cue phrases, but 60% of icsi-mrda cue phrases. From this we might expect that
4.3. Cross-Corpus Classification
158
Dialogue Act Number of Cues appreciation 199 wh-question 150 yes-no-question 102 yes-answers 94 back-channel 37 or clause after yes-no-question 27 no answers 26 signal non-understanding 19 apology 17 thanks 11 rhetorical question back-channel 10 acknowledgement 8 maybe 7 suggestion 6 no knowledge answers 5 rhetorical questions 4 narrative affirmative answers 1 open ended question 1 Table 4.6: switchboard and icsi-mrda overlapping cue phrases
classification of the icsi-mrda data using cue phrases generated from the switchboard corpus to perform at a reasonable level of accuracy. Earlier we reported that of the 140k switchboard cue phrases, nearly 93% represented one of the two categories. For the icsimrda data, of the 55k cue phrases generated from the corpus, 87% represent the category. If we now look at the intersection of cue phrases between the icsi-mrda corpus and the switchboard corpus, only this time require a match between both the cue phrase and the label it is a predictor of, we are left with 34k cue phrases, or a drop of 4k phrases. This means
159
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
that there are 34k phrases which match between the two corpora in terms of which label they are predicting. Of these 34k cue phrases, 33k are predictors of the labels for both corpora. If we then also consider our length model criteria, so that for the cue phrases to be considered as overlapping they have to predict the same da label and belong to the same length model, the total number of overlapping cues is 32,572. Of those, 31,848 cue phrases predict a category. This means that the remaining 719 shared cues are spread across the remaining categories. We show the number of overlap cue phrases per category in Table 4.6. When looking at the switchboard corpus and the amities ge corpus, we already expect there to be a smaller degree of overlap than between the switchboard corpus and the icsi-mrda corpus, given that the amities ge corpus is significantly smaller, and contains dialogues that are much more task oriented. This hypothesis is confirmed when we look at the extracted cues. From the amities ge data, we extract around 17,000 cues, whereas from switchboard data, we extract nearly ten times as many cue phrases. The size of the overlap between these two data sets is only a little over 3,000 cue phrases. In terms of the amities ge total cues, this represents around 20% of the amities ge cue phrase set but it represents a little more than 2.5% of the switchboard cue phrase set. Given these figures, we might anticipate that a classifier using cue phrases extracted from the switchboard data might perform acceptably when applied to amities ge data, but the reverse will not be true; that there is too little data (in terms of extent of vocabulary) in the amities ge data to provide sufficient coverage of the switchboard corpus.
4.3. Cross-Corpus Classification
160
4.3.2 icsi-mrda & switchboard Cross-Corpus Experiments The hardest part of any cross-corpus classification effort will always be the mismatch between the annotation sets over those corpora. It makes sense for us to start with two corpora with similar, if not quite identical, annotation schema. Both the switchboard corpus and the icsi-mrda corpus use a flattened, single code variant of the hierarchical damsl annotation scheme. Whilst the switchboard corpus is of open, non-task specific conversation, the underlying domain of the dialogues in the icsi-mrda corpus is that of a multi-party meeting room, and this seemingly slight change in application required the introduction of 11 new tags specifically for this scenario, including tags such as , introduced to indicate when an utterance was used to take control of the meeting. There are three further labels in the icsi-mrda schema that are composites of switchboard labels. Positive answers labels ( and ) and negative answers labels ( and ) from the switchboard schema have been combined into two distinct groups: and . More importantly, in the switchboard corpus, the distinction is made between and . This distinction is not preserved in the icsimrda corpus, there being a single category. To deal with these labels, we created superclass clusters (YES ANSWERS, NO ANSWERS and STATEMENTS) at running time, and during the scoring procedure. Following the experiments reported in Section 3.8, we know that when we apply such label clustering to the switchboard data, we can expect an improve-
161
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
Common Accuracy Training Testing Training Test corpus corpus utterances utterances tag (%) score switchboard 198,000 switchboard 3,000 49% 76.7% icsi-mrda 160,000 icsi-mrda 20,000 31.8% 51.8% switchboard 198,000 icsi-mrda 180,000 31.8% 46.9% icsi-mrda 180,000 switchboard 198,000 49% 70.8% Table 4.7: switchboard & icsi-mrda Cross-Corpus Experimental Results
ment in classification accuracy over the baseline. We first establish the baseline classification performance. The absolute baseline for each corpus is represented by the most frequent tag in each and that would be the label for both, which constitutes 49% for the switchboard corpus, when the and categories are merged, and 32% for the icsimrda corpus. In Chapter 3, we show that our simple, cue based classifier achieves a cross-validated accuracy score of 76.7% over the switchboard corpus, where the categories have been conflated. Earlier in this chapter, in Section 4.2.1, we show that the same mechanism achieves 51.76% over the icsi-mrda corpus. For this comparison, and all further experiments, we exclude the additional 3-5%-age point increase we can gain by including the da sequence model as we are interested in how well the intra-utterance classifier performs alone. We now attempt cross-domain classification: First, we train our classifier using switchboard data, and test using icsi-mrda data. Then we apply the classification in reverse, when we train on icsi-mrda data, and test on the switchboard corpus, using all available data in both cases.
4.3. Cross-Corpus Classification
162
When using cue phrases derived from the switchboard corpus to classify utterances from the icsi-mrda corpus, we achieve an overall tagging accuracy of 46.9%. Compare this to the 51.8% we achieve applying our mechanism for both training and testing, and we can see that we achieve 90.5% of our achievable upper baseline using the out of domain cue phrases. Running the same experiment in reverse, training on the icsi-mrda corpus and testing using utterances from the switchboard corpus, we achieve 70.8%, compared to 76.7% when training and testing occur with the same corpus. That means that we obtain scores that are 92.3% of our achievable upper baseline in this experiment. These are extremely encouraging results. Remember however that when dealing with the icsi-mrda corpus data comprising 55 categories, the maximum upper bound demonstrated so far is 66%, demonstrated by Ji and Bilmes (2005), of which we achieve only 71.1% using our out of domain cue phrases.
4.3.3 switchboard & amities ge Cross-Corpus Experiments We have established that we can extract cue phrases successfully from one corpus, and used these same cues to classify data from another corpus, by experimenting with the switchboard and icsi-mrda corpora. Now we perform the same set of experiments, substituting the amities ge corpus for the icsi-mrda corpus. This is more challenging task than before for many reasons. The underlying task structure in the amities ge corpus is far more regularised than that of the switchboard corpus or the icsi-mrda
163
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
corpus, and has a smaller vocabulary than either. This means that as well as being smaller, there is a smaller degree of variety in the data. In addition, the amities ge corpus also has a segmentation of utterances that is less finegrain. This results in longer utterances that may well have been split into two or more individual utterances under the segmentation schemes employed by the other corpora. As a consequence, the cue phrases learned from the switchboard corpus can often be in conflict. The first problem we must deal with is a mismatch between the atomic category labels used to annotate the switchboard corpus, and the hierarchical labels of the amities ge corpus. One way to deal with this hierarchy is in the same manner as Rosset et al. (2008), who take the hierarchical classes of the xdml annotation, and treat them as independent of each other, but requiring an annotation in each (resulting in eight independent annotations per utterance in their case). However, in most cases levels are not independent, and can be inferred from one another. Also, there is the question of mapping between multiple possible categories. For example, in the amities ge annotation, statements of fact are annotated as . This class contains utterances that are in some way equivalent to the two classes of switchboard annotation, and . If we are training and testing using two different category sets, we need some consistent mechanism of translating between them, in order to evaluate our performance.
4.3. Cross-Corpus Classification
164
amities ge da Mapping Cross-corpus classification would be simplified if both corpora were annotated with identical da taxonomies. In fact the switchboard corpus and the amities ge corpus are both annotated with variants of the damsl da annotation scheme. In the switchboard corpus, the hierarchical nature of the damsl schema has been flattened and clustered, to produce 42 atomic classes. In the amities ge corpus, the hierarchical dialogue act schema has been left largely untouched from the damsl original, resulting in the large number of potential unique annotations, as described earlier in Section 4.2.2. In order then to be able to compare automatic classification performance across the two corpora, a translation is required between the 42-class schema of switchboard and the damsl-like xdml schema of the amities ge corpus. The mechanism by which we achieve this, using a set of intermediate superclass labels, is detailed in Appendix A.
Scoring Criteria In all of our cross-corpus classification experiments, we are training on one set of data (Corpus A), and so learning relationships between cue phrases and labels of that corpus (Labels A), and then labelling a new target corpus (Corpus B), attempting to assign the correct labels for Corpus B, from the set of Labels B. In the previous set of experiments using the switchboard corpus and the icsi-mrda corpus, this process was simplified as essential Labels A and Labels B were the same, and we could literally assign the correct tags to a test sentence in a corpus.
165
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
In the situation that exists with the switchboard and amities ge corpora, we have to construct a more complex scoring mechanism for the tagging process. When training over superclass clustered amities ge data (and so assigning superclass labels to switchboard test data), the process is relatively straight forward. If we assign the correct superclass label to the switchboard data, and there is a one-to-one relationship between the superclass and the relevant switchboard category, then we could have effectively assigned the correct switchboard label to the target utterance, and we give this a correct score. We define this method of scoring as a strict measure. When training over switchboard data and using amities ge data as the target, the process is slightly more complex. We can score a test item as correct under the strict measure provided that (a) the label assigned by the switchboard trained classifier and the amities ge target label both map onto the same superclass category as before, and (b) the relevant rule mapping amities ge labels onto this superclass specifies only a single amities ge feature representation on its left hand side. Note that a number of different amities ge labels will match to this left hand side representation (i.e. labels which include this feature-value pair, but which may also have values for other features), suggesting a one-to-many relation from the superclass to amities ge target labels which might seem to contradict the intuitive requirement of ’strict’ correctness. We take the view, however, that the left hand side of these rules capture the salient aspect of the relevant amities ge labels, and so a classifier trained over the switchboard data whose output can be unambiguously mapped to this left hand side representation should
4.3. Cross-Corpus Classification
166
be treated as being correct in the strict sense. This strict mapping measure gives us a sense of the upper level of performance capable by our mechanism of training over one corpus and applying the resulting classifier to a new, different corpus. We also incorporate a method of scoring which is not so strict. Our lenient measure requires that we make a match at the superclass level only.
Cross-Corpus da Classification As with the single domain amities ge experiments earlier, we first show our baseline tagging performance, representing the most frequent tag in each corpus. This baseline changes depending on whether we are using a strict or lenient approach to classification. For the switchboard corpus, the strict baseline, as in most of our experiments, is 36% corresponding to the frequency of the label. For the lenient score, we are combining the or labels, and so the baseline increases to 49%. For the amities ge corpus, the strict baseline is 15% for , and increases to 20% when we cluster all the possible labels together into a super class as per Figure A.1. Next we determine baseline performance of our CuDAC classifier. The score we report in Section 4.2.2 (65.9%) for training and testing using amities ge data represents the strict measure we describe above (i.e. the unclustered amities ge corpus). We need to re-calculate this score, taking into
167
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
account the da clustering we report in Section 4.3.3, using 21.5k amities ge utterances for training, and 2.5k utterances for testing. Using the lenient scoring mechanism, we score 70.8% classification accuracy. We then apply cross-domain classification: First, we train our classifier using switchboard data, and test using amities ge data. Then we apply the classification in reverse, when we train on amities ge data, and test on the switchboard corpus, using all available data in both cases. Finally, we note that in the cross-corpus classification experiments involving the switchboard and icsi-mrda corpora, the amount of data available for training is somewhat similar. That is not the case here, as the amities ge corpus is significantly smaller overall. Despite showing, in Chapter 3, that corpus size is not a significant factor when dealing with data of the order of magnitude similar to that contained in the amities ge corpus, we wanted in this last experiment to study the effect of limiting the training data on cross-domain classification, by reducing the switchboard data to match that of the amities ge training set, that is, to use only 24,000 utterances of the switchboard corpus as training data to extract cues, which are then applied both to itself (for reference), and to the amities ge corpus. For all experiments where amities ge data is used as a test corpus, both strict and lenient scoring will be used. Strict scoring sets a lower bound for this exercise, and should be greater than chance, which corresponds to the distribution of the most frequent da tag in each corpus. The results of our experiments are summarised in Table 4.8. For the scores where the switchboard corpus is used for both training and testing, we do not report values for lenient scoring. Possibly these scores are interesting,
4.3. Cross-Corpus Classification Training Testing Training Test Lenient corpus corpus utterances utterances score switchboard 198,000 switchboard 3,000 – amities ge 21,500 amities ge 2,500 70.8% switchboard 198,000 amities ge 30,000 55.7% amities ge 24,000 switchboard 198,000 48.3% switchboard 21,500 amities ge 2,500 53.2% switchboard 21,500 switchboard 2,500 –
168 Strict score 69% 65.9% 39.8% 40% 38% 67.4%
Table 4.8: switchboard & amities ge Cross-Corpus Experimental Results
but are not the focus of this experiment. Initially, we train using all of the switchboard data, and test over the complete amities ge corpus, and we recorded a strict evaluation score of 39.8% tagging accuracy, compared to a lower bound (the most frequent tag) of 15%, and an upper bound of 65.9%. This means we score 60.4% of the upper bound, using out of domain cues. Using the lenient score, we achieve around 55.7% accuracy, compared to a lower bound of 20% and an upper bound of 70.8%. For the lenient scoring, we achieve 78.7% of the upper bound of performance we have shown on this corpus. Applying the same experiment in reverse, we train with the amities corpus, and test over the switchboard data. Analysing the cue overlap statistics discussed in Section 4.3.1, we already expect that this experiment will not achieve high scores. Using the strict evaluation metric, we achieve a score of 40.0%, which compares to a baseline for strict data of 36%. When we apply the lenient metric, we score 48.3%, in comparison to a lenient baseline measure of 49%, so is performing below the reported baseline. Some inspection of the data informed us that the amities ge data only
169
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
had a minimal number of utterances annotated as any form of , so subsequently most of these instances in the test data from the switchboard corpus were missed by our classifier. If we set the default tag to be , rather than using the most frequent tag from the training corpus (a tag or equivalent for all the corpora we use in this thesis), we achieve a performance gain to 47.7% with strict scoring, and 56.0% with the lenient metric. However, for this to be relevant when we apply the classifier to new data, we would need to know in advance what the most frequent tag would be and this is unlikely. In the final experiment, using a size adjusted variant of this cross-domain classification experiment, we score 53.2% with the lenient metric, and 38% with strict, indicating that the reduction in size of the training data has little effect on classification accuracy, mirroring the result we report in Section 3.6. This is also highlighted by the reference score, training and testing using a reduced version of the switchboard corpus, where we score 67.4%. Compare this to the scores reported in Chapter 3, where training using a 50k data set we achieve a score of 68.4%. These scores confirm that it is less the size of training data than the composition of that data that is important.
4.4 Summary In this chapter, we started by applying our cue-based model for da classification to two new corpora, the icsi-mrda corpus of meeting dialogues, and the amities ge corpus of financial transactions with a call centre. For the icsi-mrda corpus, we are able to show that our cue-based method alone
4.4. Summary
170
achieves 78% of maximum published performance on the 55 category cluster of labels, and 92% of best published performance on the 5 category cluster. Adding a simple da sequence model boosts these scores to 85% and 96% of best published respectively. For the amities ge corpus we score a respectable 66% accuracy, rising to 71% when we add the simple model of da progression. Using these figures as baselines, when we apply the cues derived from the switchboard corpus to classify the utterances from the icsi-mrda corpus, we obtain a score of 46.9%, not a particularly high score, but one that represents 90.5% of the score we achieve using cue phrases generated from the icsi-mrda corpus. When we apply cue phrases generated from the icsi-mrda corpus to classify utterances from the switchboard corpus, we achieve an impressive 92.3% of the maximum score for that corpus. Next, we substitute the amities ge corpus for the icsi-mrda corpus. We achieve almost 80% of the best score we achieved over the amities ge corpus, when judged using our lenient scoring mechanism, scoring 55.7% using the cross-domain cues, compared to the 70.8% when using in-domain cues. When using the strict measure we still achieve around 60% of the best performance, both results being a substantial improvement over the baseline measure of 20%, corresponding to the most frequent tag. However, using amities ge derived cues to classify switchboard data does not work well. This could be related to the size of data available for training, although our experiments in this area seem to suggest otherwise. We believe that the composition of the training data is a more crucial element. For example, although the da distribution in the switchboard
171
Chapter 4. Using Cue Phrases for Cross-Corpus Classification
corpus is skewed over a few categories, it contains enough data for the major classes (such as varieties of ‘statements’, ‘questions’ and ‘answers’) to be effective on new data. Although the amities ge contains a reasonable amount of questions and statements, there is very little of the other significant categories, such as , a key da in the switchboard corpus and in conversational speech in general. There are many areas of future investigation not covered in this material. We believe that the cues we extract have a general nature, and think the experiments we show in this chapter confirm that hypothesis. We want to compare our list of automatically derived cues phrases, particularly those that overlap between the two major corpora (switchboard and icsi-mrda) with a focus on the 687 we report earlier, to those reported in prior literature in Chapter 2.5. For a quick analysis, we took the table of 62 cue phrases from literature (shown in Table 2.9), and determined if any of these cues were present in the 719 cue phrases that overlap between the switchboard corpus and the icsi-mrda corpus (and these phrases can be seen in Appendix A). Of those 62 cues, 11 appear in our cue phrase list: ‘and’, ‘like’, ‘oh’, ‘okay’, ‘or’, ‘see’, ‘so’, ‘too’, ‘well’, ‘where’ and ‘why’.
172
5. CONCLUSIONS AND FUTURE WORK 5.1 Conclusion In this thesis, we addressed three research goals relating to automatic Dialogue Act classification. The labelling of Dialogue Acts has become a classic classification challenge; using large, hand-annotated corpora of dialogues to learn key features that predict the presence of particular das. In Chapter 2, we saw that there are many approaches taken so far to learn the correct classifications for a number of corpora using different da paradigms. We also see that there is no dominant method of machine learning for this task. Although comparison of approaches is often complicated, as there are no reference sets of data for this task, no standard use of methods such as cross-validation, and different clusters of labels, there appears to be no one algorithm, or set of algorithms, that clearly outperform others for da classification. What does seem clear is that the identification and isolation of specific features can make a far greater contribution to classification performance. It was this observation, made perhaps most clearly to date in the paper
5.1. Conclusion
174
by Samuel et al. (1999), that led us to our initial research question: ‘Can we automatically extract useful cue phrases from an annotated corpus of dialogues?’. In Chapter 3, we applied a method for cue extraction to the switchboard corpus of human labelled spoken dialogues. Our cue extraction method relied on the concept of ‘predictivity’, or how reliable a predictor of a particular Dialogue Act is an individual cue phrase. We demonstrated that we were able to extract cue phrases from the switchboard corpus using predictivity as one threshold, and frequency of occurrence as another. Once we have selected candidate cue phrases, we wanted some method of evaluating their effectiveness. Clearly one method of evaluation would be to use these extracted cue phrases directly as a method for classifying unseen utterances, where presence of the selected cue phrases in utterances would signal particular dialogue acts. This was our second research question: ‘Can the cue phrases we automatically extract be used directly as a method of classification, without reference to dialogue context information?’. In Chapter 3, we devised a novel Dialogue Act classification algorithm that used our automatically extracted cue phrases directly as a method for labelling new dialogues. We described the evolution of our method to include utterance length models, and position specific cue information, and a series of experiments on data sets of different sizes, from 4k utterances up to 202k utterances. Our complete, final model for Dialogue Act classification over the switchboard corpus obtains a cross validated score of 69.65% over the 202k data set. Our highest single run score was 74.95%, also using the 202k data set of the switchboard corpus. Compare this with the 71% classification ac-
175
Chapter 5. Conclusions and Future Work
curacy Stolcke et al. (2000) report. It is difficult to compare our results directly with those of Stolcke et al. (2000) given that they did not use a cross-validation approach, but even so it is striking that our cross-validated score comes so close to their result (71% performance) given their use of a much more complex language modelling approach, that exploits also inter utterance information, by way of a tri-gram model of da progression). As a reminder, when Stolcke et al. (2000) used hmm modelling for purely intra-utterance features (i.e. using only the words in the individual utterance), they achieved a classification accuracy over the switchboard corpus of only 54.3%. It was only when they combined this model with a tri-gram model of da progression (looking at the das of the previous three utterances in the dialogue) that performance was boosted to a 71% accuracy. This is compelling evidence that dialogue structure can be an important feature for Dialogue Act classification experiments. However, in contrast to these results, it is also worth remembering the findings of Mast et al. (1996), Ries (1999) and Serafin and Eugenio (2004) discussed in Chapter 3, all of which indicate that dialogue context, is not a significant contributor in their classification experiments. These experiments show that indeed we can apply automatically extracted cue phrases directly as a method of Dialogue Act classification. To conclude Chapter 3, we explored our direct classification algorithm further, looking at an in-depth error analysis to determine under what circumstances misclassification was occurring, and found two prevalent types of error. The first occurred when one class was consistently labelled as a different class because of a confused set of cue phrases (such as being labelled as ). This is a problem our classifier cannot solve, as there are issues with the underlying human labelling that cause such an error. The second occurs when the confused classes (such as and ) have similar lexicalisations. That is, under normal circumstances, the same set of cue phrases can predict multiple da labels. In this case, additional features may be necessary to distinguish between these categories. We looked at two further modifications to our classification algorithm that could help with these cases. We considered assigning more than one label per utterance, using n-best classification. Here the classifier could offer some ranked list of alternative da labels. We also incorporated a simple model of da progression, to see if some use of dialogue context could help us make better classifications. We found that adding da progression information added around 3%-age points. A Java package containing our intra-utterance classifier, CuDAC, trained over the switchboard data, is available for download and use1 . For our final research question, we wanted to see if these cue phrases, extracted from one corpus, could be applied directly as a classification device to a new corpus. We stated our research question as ‘Are these automatically extracted cue phrases general in nature; can they be used to classify utterances from different dialogue types (conversation vs. task oriented) and domains (free conversation vs. financial transactions)?’. We had observed that the cue phrases that we automatically extracted from the switchboard corpus, and that subsequently are used to classify 1
http://www.nick-webb.net/Home/Tools.html
177
Chapter 5. Conclusions and Future Work
utterances, appear to be widely applicable to a range of domains and applications. In Chapter 4, we started by applying our model for da classification to two new corpora, the icsi-mrda corpus of meeting dialogues, and the amities ge corpus of financial transactions with a call centre. For the icsimrda corpus, we are able to show that our cue based method alone achieves 78% of maximum published performance on the 55 category cluster of labels, and 92% of best published performance on the 5 category cluster. Adding a simple da sequence model boosts these scores to 85% and 96% of best published scores respectively. For the amities ge corpus score a respectable 66% accuracy, rising to 71% when we add the simple model of da progression. We believe these to be interesting scores given the simplicity of the model used to produce them. Using these figures as baselines, we have shown that the cues extracted from the switchboard corpus can be used to classify utterances in new domains, starting with the icsi-mrda corpus. We see from the cue overlap statistics that we predict that the scores using icsi-mrda cues over switchboard data should be comparable, based solely on the raw number of cue phrases that are shared between the two corpora. Interestingly, we see that of the thousands of cue phrases the corpora share, if we exclude phrases that relate to the categories, there are only some 719 phrases left. Looking at the overlap between cue phrases from the switchboard corpus and the amities ge corpus, we see a far smaller degree of overlap, so predict a much lower cross corpus classification accuracy. When we apply the cues derived from the switchboard corpus to classify the utterances from the icsi-mrda corpus, we obtain a score of 46.9%,
5.1. Conclusion
178
not a particularly high score, but one that represents 90.5% of the score we achieve using cue phrases generated from the icsi-mrda corpus. When we apply cue phrases generated from the icsi-mrda corpus to classify utterances from the switchboard corpus, we achieve an impressive 92.3% of the maximum score for that corpus. Next, we substitute the amities ge corpus for the icsi-mrda corpus. We achieve almost 80% of the upper baseline performance over the amities ge corpus, when judged using our lenient scoring mechanism - scoring 55.7% using the cross-domain cues, compared to the 70.8% when using in-domain cues. When using the strict measure we still achieve around 60% of the upper bound performance, both results being a substantial improvement over the baseline measure of 20%, corresponding to the most frequent tag. This is a significant result, which confirms the idea that cues can be sufficiently general across domains to be used in classification. However, whilst the experiment using switchboard corpus derived cues to classify amities ge data works well, the same is not true in reverse. There are two possible explanations for this result. It could be related to the size of data available for training, although our experiments in this area seem to suggest otherwise. We believe that the composition of the training data is a more crucial element. Although the da distribution in this corpus is skewed over a few categories, it contains enough data for the major classes to be effective on new data. Although the amities ge contains a reasonable amount of questions and statements, there is very little of the other significant categories, such as , a key da in the switchboard corpus and conversational speech in general. Correspondingly, the cues de-
179
Chapter 5. Conclusions and Future Work
rived from the amities ge data perform well on a selection of utterances in the switchboard corpus, but very poorly on others. Taken together, these results allow us to confirm that we can extract cue phrases from one corpus and apply them directly as a classification device to a new corpus.
5.2 Applications and Future Research It has been some five years of exploration since we started this work. The original goal, of exploring Dialogue Act tagging in the context of the AMITIES program, has long since been completed. Over the past few years, we have taken our Dialogue Act classifier described in Chapter 3, called CuDAC, and applied it to a range of different applications and funded research projects. These applications include tagging online, live internet chat, and the creation of a tool to assist human labellers of new dialogue data. In this section, we outline a few of those applications to demonstrate the usability of our classification mechanism. We then give some examples of future research we intend to pursue.
5.2.1 AT-AT: Albany Tagging & Annotation Tool We wanted to create a tool that would aid annotators of new dialogue data. There are many text processing tools available that can assist with the annotation task - including general purpose tools such as GATE (Cunningham et al., 2002) and NLTK (Bird and Loper, 2004) that contain language processing elements that can be used for data annotation, as well as Dialogue
5.2. Applications and Future Research
180
Act specific annotation tools, such as the Dialogue Annotation Tool (DAT)2 created for damsl annotation and the XDML tool created in the amities project3 . These specific dialogue act annotation tools tend to be very focused on particular annotations schemes, such as XDML in the case of the amities tool, and hence are not easily reusable if the annotation set changes. To address this lack of flexibility, we created the Albany Tagging and Annotation Tool (AT-AT), a Java tool that is dynamically configurable at run time, using XML specification files. For example, annotation sets can be described in simple XML files, including grouping tag names (creating a ‘questions’ group for example, containing , , , . . . ). When the tool is started, these group labels are used to dynamically assign pull down menus in the tool. In this way, when using a new tag set, a user only needs specify the labels required in an XML file. This allows users to create maps between tag sets, to effectively re-annotate existing data to meet new tag set requirements, by specifying maps between two or more XML files. For example, we could create an XML file for the icsi-mrda schema, and a map file between the switchboard and icsi-mrda schema, and the AT-AT tool could emit data with either. Further, we have integrated our CuDAC classifier into the AT-AT tool, allowing the human labeller to automatically annotate the utterances in their corpus. This changes the annotation process from purely labelling to error correcting and looking at the cross corpus classification results reported in 2
http://www.cs.rochester.edu/research/speech/damsl/
3
http://www.dcs.shef.ac.uk/nlp/amities/amities demos.htm
181
Chapter 5. Conclusions and Future Work
Chapter 4 we asses that this means there will be less work for human labellers when annotating a new corpus. Note this tool is not currently an active learning tool, in that we do not use this error-corrected data to retrain the underlying model, although this is an area of future investigation. We are not aiming to reproduce existing functionality. Rather, we are trying to create as flexible a tool as possible. We do not discuss the design philosophy of the AT-AT tool further here, but describe the integration between AT-AT and CuDAC. By having CuDAC run in the background, we believe we can help the user to annotate volumes of data. In Chapter 4, we were able to show that the cues extracted from the switchboard corpus can be used to classify utterances in new domains., with some success. We have incorporated the CuDAC classifier into the ATAT tool, so that users can automatically annotate new data. Of course, we know that we will not achieve 100% classification accuracy. As it stands, the user will have to manually check each assignment, and has the capability to correct the automatic choices made by CuDAC. The next iteration of CuDAC in AT-AT will be more intelligent. Looking at the precision scores we achieve across the three corpora used in this thesis, we can choose those that achieve some threshold of precision - say around 60% accuracy - and in those cases make a single category assignment to an utterance. For other categories, especially those that we known a-priori are easily confused by CuDAC, we will make some top-n assignment of possible categories, and explicitly flag these cases to the annotator. When using the switchboard categories, we hope to reduce the category assignment problem from one of forty-two categories, to a choice between two or three possible labels. In
5.2. Applications and Future Research
182
order to do this, we need to re-investigate our top-n assignment. We can see from prior results that moving to a top-n assignment will require us to retune our classification parameters. For example, we may wish to prune our cue phrases, but retain information when cue phrases predict multiple categories (information that we currently discard). This information can help us to suggest alternative categories for the user to consider. In this way, we hope we can show that the speed of data annotation is increased, with a corresponding increase in data annotation accuracy, although this analysis is outside of the scope of this thesis. This tool also provides a test bed for further investigation of the power of our cue phrases. The first version of our AT-AT tool is currently being used by multiple sites (including Charles University, Prague and Napier University, Edinburgh) for Dialogue Act annotation (and simultaneously for Dialogue Appropriateness annotation (Benyon et al., 2008)) in the EU-funded COMPANIONS project.
5.2.2 Collaboration & Deliberation Aside from the AT-AT annotation tool, we have used our CuDAC classifier directly in a number of funded research projects. We overview two of these projects here, and describe how CuDAC is used in each. The first is an online collaboration project, where we use CuDAC to classify utterances exchanged through chat between intelligence analysts. The second is an online deliberation website, where we use discourse structure as one feature to summarise discussion between many users.
183
Chapter 5. Conclusions and Future Work
A Collaborative Analytical Environment We are working on COLLANE4 , a collaborative analytic environment, which enables groups of users to work together effectively on a complex information problem. The objective of this project is to investigate the impact of certain forms of collaboration on the quality of work emerging from a group. Our focus is on tacit collaboration when the participants can share and leverage each others contributions without the cost and commitment of a planned activity. COLLANE has been designed to enable automatic information sharing so that the individual users can take advantage others actions and insights, while pursuing independent lines of work. The expected outcome of such collaboration is a competitively ranked ’portfolio’ of reports covering various aspects of the overall problem. An initial prototype of the COLLANE system, which embodies early principles of targeted information sharing has been developed and tested in realistic analytic experiments. The technical focus of the COLLANE project is to design and implement a virtual working environment that would enable tacit collaboration by a group of analysts who may be using the system remotely and in an asynchronous manner. By tacit collaboration we mean the exchange of information, advice, and insights between the analysts that does not require (though does not preclude) a conscious, extraneous effort that would take ones attention from the task at hand. Instead COLLANE supports each analysts individual line of work, while managing the totality of information and knowledge created 4
This work is based on research supported in part by the Air Force Research Laboratory and the Intelligence Advanced Research Projects Activity under agreement number FA8750-06-2-0229 (CASE Program).
5.2. Applications and Future Research
184
by the collaborating group. This way the cognitive power of each analyst is expanded without creating an undue distraction or information overload. A typical analytical task involves collecting relevant information (evidence) from available sources, making some assessment about this evidence (judgments, conclusions), and preparing a reasoned report (a solution). A central component of COLLANE is the Multi-channel Dialogue Manager (MDM) which supports coordinated non-linear interaction between the analysts and the COLLANE information server. MDM exploits both verbal (text) and visual communication channels to conduct an efficient dialogue with each user, and the selection of communication modality depends upon the urgency of a message and the current state of each user. In addition to direct communication (both user-system and user-user) an indirect interaction is supported, whereby one users actions are communicated through their effect or impact on another users workspace. To facilitate inter-analyst communication, COLLANE provides a chat (IM) mechanism by which analysts can communicate directly. IM is used by analysts to ask questions of each other, to exchange opinions and data items, as well as to coordinate their work. The chat channel complements tacit collaboration when information sharing appears slow (so analysts often ask questions to each other, such as “Cant find anything on cabinet restructuring, did you?”). To identify key parts of the inter-analyst communication, CuDAC is used to tag the dialogue function aspect of all communications, and a subset identified (such as different types of questions, statements of fact, and agreement and rejection of ideas). For example, any question raised by an analyst inside
185
Chapter 5. Conclusions and Future Work
the chat framework can be answered by another online analyst, but also by the system itself, using the interactive QA mechanism HITIQA (Small and Strzalkowski, 2009). Any analyst material (such as statements) in reply to open questions can also be collected, and made available to other analysts with similar questions. This tacit information sharing has been shown to lower the cognitive load on individual analysts, and circumvent some of the problems associated when analysts do not share information. For more on COLLANE see Strzalkowski et al. (2009).
Deliberative E-Rulemaking Decision Facilitation Project The purpose of the Deliberative E-Rulemaking Decision Facilitation (DeER)5 project is to address real-world problems in governmental electronic rulemaking while using this practical context to address key problems in advanced information technology and the social science of deliberative groups. DeER addresses issues of rulemaking by employing a combination of proven and novel deliberation techniques and an artificial discussion facilitation agent (DiFA), while seeking to build an online user community around the rulemaking process (Muhlberger et al., 2008). DiFA helps inform users regarding the subject of discussion, facilitate deliberation between users, and help officials identify the key points and structure of the resulting deliberations. E-rulemaking is a relatively new process in digital government. When government agencies wish to introduce new rules, laws or amendments to existing provisions, the agency concerned engages in a dialogue with the cit5
This research is based upon work supported by the National Science Foundation under Grant No. IIS-0713143.
5.2. Applications and Future Research
186
izens concerned. Traditionally, agencies would post an intention to change or create a rule, and people interested in the rule would comment on the change, often via post or email. The agency would review all comments received within some timeframe, and then adjust the proposed rule accordingly. In E-rulemaking, much of this process takes place online, and in a more interactive way, often with online discussion with representatives of a wide variety of viewpoints (so those in favour of the rule and those against, for example). We are experimenting with a range of NLP technologies to see how we can both improve the quality of the discussion, and digest the resulting data (often long ‘dialogues’ posted on on-line message boards). We are using methods explored in this thesis to investigate the combination of an existing set of comprehensive dialogue acts, with a discussion specific content coding scheme elaborated by (Stromer-Galley, 2007). Using hand labelled data for training and testing, and using our classification mechanism, we achieve initial classification accuracy of around 66% on this new data. We have yet to perform a cross-corpus classification. In discussion act analysis, DiFA will provide feedback on the quality of discussion as determined by the amount of agreement, disagreement, elaboration, and so forth present in a conversation. It would also be used to identify potential paths of discussion to be presented as exemplars for other discussion groups. Finally, there is existing evidence that using the discourse structure of these kinds of discussions results in higher quality summaries when compared to existing, news text summarisers applied to the same data (cf. Klaas (2005) for instance).
187
Chapter 5. Conclusions and Future Work
5.2.3 Next Steps There are many areas of future investigation not covered in this thesis. We are planning a major investigation on the power of the 719 cue phrases that overlap between the switchboard corpus and the icsi-mrda corpus. These phrases are listed in Appendix A. We want to use additional corpus resources to determine if these same cue phrases appear in those corpora, and are useful as a direct classification device. We want to be able to characterise corpora (such as the amities ge corpus) in terms of these cue phrases, to determine the likelihood of achieving reasonable da classification performance, and we want to extensively compare these cue phrases to those reported in literature (such as the set of around 700 cue phrases reported in Samuel et al. (1999)). Finally, we will continue to use the tools we have developed (both CuDAC, and the AT-AT tool) to annotate new resources, and operate in on-line dialogue systems. We see da annotation as a first, important step in building dialogue resources. The next step is to discover re-usable sequences of das, in combination with cues, which form patterns of game-like structures of dialogue, such as error correction, clarification, or conventional politeness which enable us to interpret larger units of dialogue. If da recognition is a building block, Conversation Management (CM) is one of the goals. Spoken language dialogue systems are increasingly a commercial reality, but often in limited domains, with highly structured language and constrained range of options. We are looking to expand the functionality of these systems by using more complex methods of conversation management. CM goes beyond the dialogue management seen in many deployed
5.2. Applications and Future Research
188
systems, by having a higher level view of the interaction, as a long-term, persistent goal. Such Conversation Managers can leverage patterns of dialogue to be able to control appropriate interaction, and handle errors and inconsistencies within the framework of the wider picture, or conversation.
189
References Alexandersson, J., B. Buschbeck-Wolf, T. Fujinami, M. Kipp, S. Koch, E. Maier, N. Reithinger, B. Schmitz, and M. Siegel. 1998. Dialogue Acts in VERBMOBIL-2 (second edition). Vm report 226, DFKI GmbH, Universities of Berlin, Saarbrcken and Stuttgart. Allen, J. and M. Core. 1997. Draft of DAMSL: Dialog Act Markup in Several Layers. Technical report, Jan. Allen, J., L. Schubert, G. Ferguson, P. Heeman, C. Hwang, T. Kato, M. Light, N. Martin, B. Miller, M. Posesio, and D. Traum. 1995. The TRAINS project: A Case Study in Building a Conversational Planning Agent. Journal of Experimental and Theoretical Artificial Intelligence, 7:7–48. Allen, J. 1979. A Plan-Based Approach to Speech Act Recognition. Ph.D. thesis, University of Toronto. Anderson, A., M. Bader, E. Bard, E. Boyle, and G. Doherty. 1991. The HCRC Map Task Corpus. Language and Speech, 34:351–366. Ang, J., Y. Liu, and E. Shriberg. 2005. Automatic Dialog Act Segmentation and Classification in Multiparty Meetings. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 1061–1064, Philadelphia. Artstein, R. and M. Poesio. 2008. Inter-Coder Agreement for Computational Linguistics. Computational Linguistics, 34(4):555–596. Atkinson, J. M. and P. Drew. 1979. Order in Court. Macmillan, London. Austin, J. L. 1962. How to Do Things with Words. Oxford University Press, Oxford. Ballim, A. and Y. Wilks. 1991. Artificial Believers: The Ascription of Belief. L. Erlbaum Associates Inc., Hillsdale, NJ, USA. Benyon, D., P. Hansen, and N. Webb. 2008. Evaluating Human-Computer Conversation in Companions. In Proceedings of the 4th International Workshop on Human-Computer Conversation, Bellagio, Italy. Bird, S. and E. Loper. 2004. NLTK: The Natural Language Toolkit. Proceedings of the ACL Demonstration Session. Brill, E. 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging. Computational Linguistics, 21:543–565. Bunt, H. and Y. Girard. 2005. Designing an Open, Multidimensional Dialogue Act Taxonomy. Proceedings of the 9th Workshop on the Semantics and Pragmatics in Dialogue. Bunt, H. 1994. Context and Dialogue Control. THINK, 3:19–31. Carberry, S. and L. Lambert. 1999. A Process Model for Recognizing Com-
190 municative Acts and Modeling Negotiation Subdialogues. Computational Linguistics, 25:1–53. Carberry, S. 1983. Tracking User Goals in an Information-Seeking Environment. In Proceedings of the American Association of Artificial Intelligence Conference, pages 59–63. Carberry, S. 1990. Plan Recognition in Natural Language Dialogue. ACLMIT Press Series on Natural Language Processing edition. Carletta, J. C., A. Isard, S. Isard, J. Kowtko, G. Doherty-Sneddon, and A. Anderson. 1997. The Reliability of a Dialogue Structure Coding Scheme. Computational Linguistics, 23:13–31. Caspers, J. 2006. Pitch Accents, Boundary Tones and Turn-Taking in Dutch Map Task Dialogues. In International Conference on Speech and Language Processing, pages 565–568. Cohen, P. R. and C. R. Perrault. 1979. Elements of a Plan Based Theory of Speech Acts. Cognitive Science, 3. Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46. Cohen, R. 1984. A Computational Theory of the Function of Clue Words in Argument Understanding. In Proceedings of the 1984 International Computational Linguistics Conference, pages 251–255, California. Cohen, W. 1995. Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115–123, Tahoe City, California. Constantinides, P. and A. Rudnicky. 1999. Dialog Analysis in the Carnegie Mellon Communicator. In Proceedings of Eurospeech, pages 243–246. Core, M. and J. Allen. 1997. Coding Dialogs with the DAMSL annotation scheme. In AAAI Fall Symposium on Communicative Action in Humans and Machines, MIT, Cambridge, MA. Core, M., M. Ishizaki, J. Moore, and C. Nakatani. 1999. The Report of the Third Workshop of the Discourse Resource Initiative. Chiba University and Kazusa Academia Hall. Cunningham, H., D. Maynard, K. Bontcheva, and V. Tablan. 2002. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. Daelemans, W. and A. Van den Bosch. 2005. Memory-Based Language Processing. Cambridge University Press, Cambridge, UK. Dahlb¨ack, N. and A. J¨onsson. 1992. An Empirically Based Computationally Tractable Dialogue Model. In Proceedings of the 14 th Annual Conference of the Cognitive Science Society (COGSCI-92), Bloomington, Indiana.
191 Ferguson, G. and J. F. Allen. 1998. TRIPS: An Integrated Intelligent Problem-Solving Assistant. In Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, AAAI 98, IAAI 98, pages 567–572, Madison, Wisconsin. Field, D., S. Worgan, N. Webb, M. Hepple, and Y. Wilks. 2008. Automatic Induction of Dialogue Structure from the Companions Dialogue Corpus. In Proceedings of the 4th International Workshop on Human-Computer Conversation, Bellagio, Italy. Gazdar, G. 1981. Speech Act Assignment. In Joshi, A., B. Webber, and I. Sag, editors, Elements of Discourse Understanding, pages 64–83. Cambridge University Press, Cambridge, England. Geertzen, J. and H. Bunt. 2006. Measuring Annotator Agreement in a Complex Hierarchical Dialogue Act Annotation Scheme. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, pages 126–133, Sydney, Australia. Association for Computational Linguistics. Geertzen, J., Y. Girard, and R. Morante. 2004. The DIAMOND project. Poster at the 8th Workshop on Semantics & Pragmatics of Dialogue (CATALOG2004), Barcelona, Spain. Georgila, K., O. Lemon, J. Henderson, and J. D. Moore. 2009. Automatic Annotation of Context and Speech Acts for Dialogue Corpora. Journal of Natural Language Engineering, 15(3):315–353. Grau, S., E. Sanchis, M. J. Castro, and D. Vilar. 2004. Dialogue Act Classification using a Bayesian Approach. 9th Conference Speech and Computer. Grosz, B. and C. Sidner. 1986. Attention, Intentions, and the Structure of Discourse. Computational Linguistics, 19(3). Hajdinjak, M. and F. Miheli˘c. 2006. The PARADISE Evaluation Framework: Issues and Findings. Computational Linguistics: Special Issue on Empirical Studies in Discourse Interpretation and Generation, 32:263– 272. Hardy, H., K. Baker, L. Devillers, L. Lamel, S. Rosset, T. Strzalkowski, C. Ursu, and N. Webb. 2002. Multi-Layered Dialogue Annotation for Automated Multilingual Customer Service. In Proceedings of the ISLE Workshop on Dialogue Tagging for Multimodal Human Computer Interaction, Edinburgh. Hardy, H., K. Baker, H. Bonneau-Maynard, L. Devillers, S. Rosset, and T. Strzalkowski. 2003. Semantic and Dialogic Annotation for Automated Multilingual Customer Service. In Proceedings of Eurospeech’03, Geneva, Switzerland. Hardy, H., A. Biermann, R. Bryce Inouye, A. McKenzie, T. Strzalkowski,
192 ´ System: DataC. Ursu, N. Webb, and M. Wu. 2005. The AMITIES Driven Techniques for Automated Dialogue. Speech Communication, 48:354–373. Hirschberg, J. and D. Litman. 1993. Empirical Studies on the Disambiguation of Cue Phrases. Computational Linguistics, 19(3):501–530. Hirschberg, J., D. Litman, J. Pierrehumbert, and G. Ward. 1987. Intonation and the Intentional Structure of Discourse. In Proceedings of the International Joint Conference on Artificial Inteliigence, Milan. Jefferson, G. 1972. Side Sequences. In Sudnow, D., editor, Studies in Social Interaction, pages 294–338. Free press, New York. Jekat, S., R. Klein, E. Maier, I. Maleck, M. Mast, T. Berlin, and J. J. Quantz. 1995. Dialogue Acts in VERBMOBIL. Vm report 65, DFKI GmbH, Universities of Berlin, Saarbrcken and Stuttgart. Ji, G. and J. Bilmes. 2005. Dialog Act Tagging Using Graphical Models. Acoustics. Ji, G. and J. Bilmes. 2006. Backoff Model Training using Partially Observed Data: Application to Dialog Act Tagging. In Proceedings of the Human Language Technology/ American chapter of the Association for Computational Linguistics(HLT/NAACL’06). Jurafsky, D. and J. H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. Prentice-Hall, 2nd edition. Jurafsky, D., R. Bates, N. Coccaro, R. Martin, M. Meteer, K. Ries, E. Shriberg, A. Stolcke, P. Taylor, and C. Van Ess-Dykema. 1997. Automatic Detection of Discourse Structure for Speech Recognition and Understanding. In Proceedings of the 1997 IEEE Workshop on Speech Recognition and Understanding, Santa Barbara. Jurafsky, D., R. Bates, N. Coccaro, R. Martin, M. Meteer, K. Ries, E. Shriberg, A. Stolcke, P. Taylor, and C. Van Ess-Dykema. 1998. Switchboard Discourse Language Modeling Project Final Report. Research Note 30, Center for Language and Speech Processing, Johns Hopkins University, Baltimore. Keizer, S., R. op den Akker, and A. Nijholt. 2002. Dialogue Act Recognition with Bayesian Networks for Dutch Dialogues. In Proceedings of the Third SIGdial Workshop on Discourse and Dialogue, pages 210–218, Philadelphia. Keizer, S. 2001. A Bayesian Approach to Dialogue Act Classication. In Proceedings of the 5th Workshop on Formal Semantics and Pragmatics of Dialogue, pages 210–218. Klaas, M. 2005. Toward Indicative Discussion Fora Summarization. Techni-
193 cal Report UBC-CS TR-2005-04, University of British Columbia. Klein, M. 1999. Standardisation Efforts on the Level of Dialogue Act in the MATE Project. Proceedings of the ACL Workshop Towards Standards and Tools for Discourse Tagging. Kowtko, J., S. Isard, and G. M. Doherty. 1993. Conversational Games Within Dialogue. Research paper hcrc/rp-31, Human Communication Research Centre, University of Edinburgh. Kruschwitz, U., N. Webb, and R. Sutcliffe. 2008. Query Log Analysis for Adaptive Dialogue-Driven Search. In Jansen, B.J., A. Spink, and I. Taksa, editors, Handbook of Research on Web Log Analysis, pages 389– 416. IGI, Hershey, PA. Lager, T. and N. Zinovjeva. 1999. Training a Dialogue Act Tagger with the µ-TBL System. In Proceedings of the Third Swedish Symposium on Mulitmodal Communication, Link¨oping University Natural Language Processing Laboratory. Lavie, A., L. Levin, and F. Pianesi. 2006. The NESPOLE! System for Multilingual Speech Communication over the Internet. IEEE Transactions on Speech and Audio Processing, 14:1664–1673. Lee, M. and Y. Wilks. 1996. An Ascription-Based Approach to Speech Acts. In Proceedings of the 16th Conference on Computational linguistics, pages 699–704, Morristown, NJ, USA. Association for Computational Linguistics. Lee, M. 1994. Conjunctive goals as a cause of conversational implicature. Department of Computer Science, Technical Report 94-10-05, University of Sheffield. Lendvai, P. and J. Geertzen. 2007. Token-Based Chunking of Turn-Internal Dialogue Act Sequences. In Proceedings of the 8th SIGDIAL Workshop on Discourse and Dialogue, pages 174–181, Antwerp, Belgium. Lendvai, P., A. van den Bosch, and E. Krahmer. 2003. Machine Learning for Shallow Interpretation of User Utterances in Spoken Dialogue Systems. In Proceedings of the 2003 Workshop on Dialogue Systems. Levin, L., A. Thym´e-Gobbel, A. Lavie, K. Ries, and K. Zechner. 1998. A Discourse Coding Scheme for Conversational Spanish. In Proceedings of the International Conference on Speech and Language Processing. Levin, E., S. Narayanan, R. Pieraccini, K. Biatov, E. Bocchieri, G. Di Fabbrizio, W. Eckert, S. Lee, A. Pokrovsky, M. Rahim, P. Ruscitti, and M. Walker. 2000. The AT&T DARPA Communicator Mixed-Initiative Spoken Dialogue System. In Proceedings of the International Conference on Speech and Language Processing, pages 122–125. Levin, L., C. Langley, A. Lavie, D. Gates, and D. Wallace. 2003. Domain
194 Specific Speech Acts for Spoken Language Translation. In Proceedings of 4th SIGdial Workshop on Discourse and Dialogue. Levinson, S. C. 1983. Pragmatics. Cambridge University Press, Cambridge, England. Lewin, I. 2000. A Formal Model of Conversational Game Theory. In Fourth Workshop on the Semantics and Pragmatics of Dialogue: Gotalog 2000, pages 115–122. Litman, D. and J. F. Allen. 1990. Discourse Processing and Commonsense Plans. In Cohen, P., J. Morgan, and M. Pollack, editors, Intentions in Communication,, pages 365–388. MIT Press, Cambridge, MA. Marcu, D. and G. Hirst. 1995. A Uniform Treatment of Pragmatic Inferences in Simple and Complex Utterances and Sequences of Utterances. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 144–150, Boston, MA. Mart´ınez-Hinarejos, C. D., J.M. Bened´ı, and R. Granell. 2008. Statistical Framework for a Spanish Spoken Dialogue Corpus. Speech Communication, 50(11-12):992–1008. Mast, M., R. Kompe, S. Harbeck, A. Kiessling, and V. Warnke. 1996. Dialog Act Classification with the Help of Prosody. In Proceedings of the International Conference on Speech and Language Processing ICSLP ’96, volume 3, pages 1732–1735, Philadelphia, PA. Maybury, M. 1998. Discourse Cues for Broadcast News Segmentation. Proceedings of the 17th International Conference on Computational Linguistics, pages 819–822. McCallum, A. K. 1996. BOW: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering. McCowan, I., J. Carletta, W. Kraaij, and S. Ashby. 2005. The AMI Meeting Corpus. Proceedings of Measuring Behavior, Jan. Meteer, M. 1995. Dysfluency Annotation Stylebook for the Switchboard Corpus. Working paper, Linguistic Data Consortium. Moser, M. and J. D. Moore. 1995. Investigating Cue Selection and Placement in Tutorial Discourse. Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, pages 130–135. Muhlberger, P., N. Webb, and J. Stromer-Galley. 2008. The Deliberative E-Rulemaking Project (DeER): Improving Federal Agency Rulemaking Via Natural Language Processing and Citizen Dialogue. In 9th International Conference on Digital Government Research (dg.o 2008), Montreal, Canada. Nagata, M. and T. Morimoto. 1994. First Steps Towards Statistical Model-
195 ing of Dialogue to Predict the Speech Act Type of the Next Utterance. Speech Communication. Niedermair, G. T. 1992. Linguistic Modelling in the Context of Oral Dialogue. In Proceedings of International Conference on Spoken Language Processing (ICSLP’92, pages 635–638, Banff, Canada. Passonneau, R. J. and D. J. Litman. 1997. Discourse Segmentation by Human and Automated Means. Computational Linguistics: Special Issue on Empirical Studies in Discourse Interpretation and Generation, 23:103– 139. Peckham, J. 1993. A New Generation of Spoken Dialogue Systems: Results and Lessons from the SUNDIAL Project. In Proceedings of the 3 rd European Conference on Speech Communication and Technology, pages 33 – 40, Berlin, Germany. Pellom, B., W. Ward, J. Hansen, R. Cole, K. Hacioglu, J. Zhang, X. Yu, and S. Pradhan. 2001. University of Colorado Dialog Systems for Travel and Navigation. In HLT ’01: Proceedings of the First International Conference on Human Language Technology Research, pages 1–6, Morristown, NJ, USA. Association for Computational Linguistics. Perrault, C. R. and J. F. Allen. 1980. A Plan-Based Analysis of Indirect Speech Acts. Computational Linguistics, 6:167–182. Poesio, M. and A. Mikheev. 1998. The Predictive Power of Game Structure in Dialogue Act Recognition: Experimental Results Using Maximum Entropy Estimation. In International Conference on Speech and Language Processing. Power, R. J. 1979. The Organisation of Purposeful Dialogues. Linguistics, 17:107–152. Prasad, R. and M. Walker. 2002. Training a Dialogue Act Tagger for HumnaHuman and Human-Computer Travel Dialogues. In Proceedings of the 3rd SIGdial workshop on Discourse and Dialogue, Philadelphia, Pennsylvania. Pulman, S.G. 1996. Conversational Games, Belief Revision and Bayesian Networks. In Computational Linguistics in the Netherlands. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Reithinger, N. and M. Klesen. 1997. Dialogue Act Classification Using Language Models. In Proceedings of EuroSpeech-97. Reithinger, N. and E. Maier. 1995. Utilizing Statistical Dialogue Act Processing in Verbmobil. Proceedings of the 33rd annual meeting on Association for Computational Linguistics. Ries, K. 1999. HMM and Neural Network Based Speech Act Classification.
196 In Proceddings of the IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 497–500, Phoenix, AZ. Rieser, V. and O. Lemon. 2008. Learning Effective Multimodal Dialogue Strategies from Wizard-of-Oz Data: Bootstrapping and Evaluation. In Proceedings of the Association for Computational Linguistics. Rosset, S., D. Tribout, and L. Lamel. 2008. Multi-Level Information and Automatic Dialog Act Detection in Human-Human Spoken Dialogs. Speech Communication, 50(8-9):1–13. Rotaru, M. 2002. Dialog Act Tagging using Memory-Based Learning. Term project, University of Pittsburgh. Samuel, K., S. Carberry, and K. Vijay-Shanker. 1998. Dialogue Act Tagging with Transformation-Based Learning. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal. Samuel, K., S. Carberry, and K. Vijay-Shanker. 1999. Automatically Selecting Useful Phrases for Dialogue Act Tagging. In Proceedings of the Fourth Conference of the Pacific Association for Computational Linguistics, Waterloo, Ontario, Canada. Sanchis, E. and M. J. Castro. 2002. Dialogue Act Connectionist Detection in a Spoken Dialogue System. In Abraham, A., J. Ruiz del Solar, and M. K¨oppen, editors, Soft Computing Systems - Design, Management and Applications, pages 644–651. Schegloff, E. A. and H. Sacks. 1973. Opening Up Closings. Semiotica, 7:289– 327. Searle, J. R. 1969. Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, Cambridge. Searle, J. 1976. A Classification of Illocutionary Acts. Language in Society. Seneff, S., L. Hirschman, and V. W. Zue. 1991. Interactive Problem Solving and Dialogue in the ATIS Domain. In HLT ’91: Proceedings of the Workshop on Speech and Natural Language, pages 354–359, Morristown, NJ, USA. Association for Computational Linguistics. Serafin, R. and B. Di Eugenio. 2004. FLSA: Extending Latent Semantic Analysis with Features for Dialogue Act Classification. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain. Shriberg, E., R. Dhillon, S. Bhagat, J. Ang, and H. Carvey. 2004. The ICSI Meeting Recorder Dialog Act (MRDA) Corpus. In Special Interest Group on Discourse and Dialogue (SIGdial), Boston, USA. Sinclair, J. and M. Coulthard. 1975. Toward an Analysis of Discourse: the
197 English Used by Teachers and Pupils. Oxford University Press, Oxford, England. Small, S. and T. Strzalkowski. 2009. HITIQA: High-Quality Intelligence through Interactive Question Answering. Journal of Natural Language Engineering: Special Issue on Interactive Question Answering, 15(1):31– 54. Stolcke, A., K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin, C. Van Ess-Dykema, and M. Meteer. 2000. Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. In Computational Linguistics 26(3), 339–373. Strik, H., A. J. M. Russel, H. van den Heuvel, C. Cucchiarini, and L. Boves. 1997. A Spoken Dialog System for the Dutch Public Transport Information Service. International Journal of Speech Technology, 2(2):119–129. Stromer-Galley, J. 2007. Assessing Deliberative Quality: A Coding Scheme. Journal of Public Deliberation, 3. Strzalkowski, T., S. Taylor, S. Shaikh, B. Lipetz, H. Hardy, N. Webb, T. Cresswell, M. Wu, Y. Zhan, T. Liu, and S. Chen. 2009. COLLANE: An Experiment in Computer-Mediated Tacit Collaboration. In Marciniak, M. and A. Mykowiecka, editors, Aspects of Natural Language Processing, Lecture Notes in Computer Science, Vol. 5070. Springer. Sutton, R. and A. Barto. 1998. Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). The MIT Press, March. Tur, G., U. Guz, and D. Hakkani-Tur. 2006. Model Adaptation for Dialogue Act Tagging. In IEEE Spoken Language Technology Workshop. Verbree, D., R. Rienks, and D. Heylen. 2006. Dialogue Act Tagging using Smart Feature Selection; Results on Multiple Corpora. Spoken Language Technology Workshop, 2006. IEEE, pages 70–73. Wahlster, W. 2000. Verbmobil: Foundations of Speech-To-Speech Translation. Springer. Walker, M. and R. Passonneau. 2000. DATE: A Dialogue Act Tagging Scheme for Evaluation of Spoken Dialogue Systems. Proceedings of the First International Conference on Human Computer Interaction. Walker, M., D. Litman, C. Kamm, and A. Abella. 1997a. PARADISE: A Framework for Evaluating Spoken Dialogue Agents. Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, Jan. Walker, M. A., J. E. Cahn, and S. J. Whittaker. 1997b. Improvising Linguistic Style: Social and Affective Bases for Agent Personality. In Proceedings of the Conference on Autonomous Agents, AGENTS97. Walker, M., J. Aberdeen, J. Boland, E. Bratt, J. Garofolo, L. Hirschman,
198 A Le, S. Lee, S. Narayanan, K. Papineni, B. Pellom, J. Polifroni, A. Potamianos, P. Prabhu, A. Rudnicky, G. Sanders, S. Seneff, D. Stallard, S., and Whittaker. 2001. DARPA Communicator Dialog Travel Planning Systems: The June 2000 Data Collection. In The European Conference on Speech (EuroSpeech’01). Warner, R. G. 1985. Discourse Connectives in English. Garland Publishing, Inc. . Webb, N. and T. Liu. 2008. Investigating the Portability of Corpus-Derived Cue Phrases for Dialogue Act Classification. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), Manchester, United Kingdom. Webb, N., M. Hepple, and Y. Wilks. 2005a. Dialogue Act Classification Based on Intra-Utterance Features. In Proceedings of the AAAI Workshop on Spoken Language Understanding. Webb, N., M. Hepple, and Y. Wilks. 2005b. Empirical Determination of Thresholds for Optimal Dialogue Act Classification. In Proceedings of the Ninth Workshop on the Semantics and Pragmatics of Dialogue. Webb, N., M. Hepple, and Y. Wilks. 2005c. Error Analysis of Dialogue Act Classification. In Proceedings of the 8th International Conference on Text, Speech and Dialogue, Carlsbad, Czech Republic. Webb, N., T. Liu, M. Hepple, and Y. Wilks. 2008. Cross-Domain Dialogue Act Tagging. In proceedings of the 6th International Conference on Language Resources and Evaluation (LREC2008), Marrakech, Morocco. Wilks, Y., N. Webb, A. Setzer, M. Hepple, and R. Catizone. 2003. Human Dialogue Modelling Using Machine Learning. In Nicolov, Nicolas, Kalina Bontcheva, Galia Angelova, and Ruslan Mitkov, editors, Recent Advances in Natural Language Processing III, Sel. papers from RANLP). Wilks, Y., N. Webb, A. Setzer, M. Hepple, and R. Catizone. 2004. Human Dialogue Modelling Using Annotated Corpora. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC2004), Lisbon, Portugal. Witten, I. H. and E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2nd edition. Wright, H. 1998. Automatic Utterance Type Detection Using Suprasegmental Features. In Fifth International Conference on Spoken Language Processing. Zell, A., Mamier G, R. H¨ ubner, N. Schmalzl, T. Sommer, and M. Vogt. 1993. SNNS: An Efficient Simulator for Neural Nets. In Schwetman, Herbert D., Jean C. Walrand, Kallol Kumar Bagchi, and Doug DeGroot, editors, Proceedings of the International Workshop on Modeling, Analysis,
199 and Simulation On Computer and Telecommunication Systems, January 17-20, 1993, La Jolla, San Diego, CA, USA, pages 343–346. The Society for Computer Simulation. Zimmermann, M., D. Hakkani-T¨ ur, E. Shriberg, and A. Stolcke. 2006. Text Based Dialog Act Classification for Multiparty Meetings. In Proceedings of the 3rd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms, LNCS 4299, pages 190–199, Berlin. Springer Verlag.
200
APPENDIX
A. MAPPING BETWEEN switchboard, amities ge AND superclass LABELS Our first step is to simplify the xdml markup. Mark-up in the amities ge corpus makes the distinction between different types of ‘statement’, accepting that they can be ‘assertions’, ‘re-assertions’, ’explanations’, ’re-explanations’ and ’expressions’. Some of these assignments are impossible to make using our intra-utterance classifier. For example, we would not be able to tell the difference between an ‘explanation’ and a ‘re-explanation’ without recourse to dialogue context. As a first simplifying step, we collapse these four categories to form one ‘assertion’ type, i.e. . We have a similar problem with information requests. In the xdml schema, there is a division between ‘information-request’ and ‘confirmation-request’. Often the decision to choose between these annotations is taken by a human labeller on the basis of the dialogue context (i.e. if it’s the same question, only the second time it is asked, it qualifies as a ‘confirmation-request’). Further, these information requests are split into implicit and explicit requests. For the purposes of our exploratory experiments, we collapse all of the requests into a single information request label, i.e. . These simplifications, performed before any further processing, can be seen together in Figure A.1. Once this simplification procedure has been completed, we look to create a map between the two category sets under consideration, the switchboard labels set and the amities ge label set. The problem of creating a mapping between the two is immediately apparent; It is impossible to create a consistent map in either direction. There are instances where one or many labels in the source category are in some way equivalent to more than one label of the target category. When trying to translate from amities ge labels to switchboard labels, there is the aforementioned problem of going from one label () to two possible target la-
204 bels ( and ). When trying to map from switchboard labels to amities ge labels, there are similar problems (like trying to translate between the single switchboard category and the three corresponding amities ge labels). To address this problem, we chose to introduce an additional level of representation which employs a new set of categories which correspond to ‘superclasses’ of the category labels appearing in either of the two source corpora category sets. This new level (the superclass level) in effect captures what exists in common between the representational schemes of the two corpora (by generalising away any distinctions that are expressed by one scheme that are beyond the expressive power of the other). The use of superclasses in this way allows us to map representations from both source corpora onto the superclass level even though mappings from either source level to the other one is not possible. For example, we may note that there is a correspondence between tag A of corpus P and tags A1, A2 and A3 of corpus Q, but this one-to-many relationship makes a mapping from corpus P to corpus Q impossible. Likewise, there is a correspondence between tag B of corpus Q and tags B1, B2 and B3 of corpus P, which makes a mapping from corpus Q to corpus P impossible. We can reconcile these representational differences by positing categories SC A and SC B at the superclass level and making the obvious mapping of the alternative category sets onto this common level. Pursuing this approach, we arrive at a set of 15 superclass labels, which divide into understanding labels (Figure A.2), agreement labels (Figure A.3), conventional and other special labels (Figure A.4), question labels (Figure A.5) and finally answer and statement labels (Figure A.6). Creating the mapping from switchboard labels into our superclass labels is relatively trivial, as we are dealing with 42 atomic labels. The mapping from amities ge labels into the superclass labels is less straightforward. There are large number of labels, and given the ability to select an annotation from each of the four hierarchies in the scheme, annotators were not always consistent. In their work annotating the switchboard corpus, Jurafsky et al. (1997) include a mapping from the full damsl hierarchy to the atomic 42 labels of their annotation, where they focus on the most salient part of the annotation. We attempt to do the same here when mapping from the amities ge labels to the superclass categories. Take the example of utterances in the amities ge corpus that signal the end of a conversation, such as “goodbye”. This could be annotated as . Given the earlier simplification steps indicated in Figure A.1, we know that this will have been changed to . We will have a mapping from amities ge labels to our superclass categories that will deal with labels, mapping them into the SC STATEMENT category. However, this is not the correct assignment for the label , which should instead be mapped to the superclass SC GOODBYE. In this sense, the part of the complete annotation is considered the most salient. Through the inconsistency of the human labellers, there are many examples where additional label information is introduced into the annotation. If we look at the example annotations given in Table 4.4, we see instances of these inconsistencies. The most frequent instances of back-channels are labelled as . However, there are a significant number of back-channel utterances that are labelled as . Here, the presence of the ‘assertion’ label is optional, and the basic function of the utterance remains as a ‘back-channel’. Similarly, with utterances signifying a lack of understanding (such as “I don’t understand, what do you mean?”), they can be annotated as , or with additional information added, such as the fact that our example is also a request for confirmation (and can be annotated as ). In this example, it is the ‘understanding’ part that is the most salient. Part of this problem is due to the lack of proper segmentation of utterances (where, in other annotation efforts, this utterance might have been split into two separate utterances). Consequently, in this case both parts of the annotation could be said to be equally valid. To achieve as much consistency as possible, we adopt the approach that backward-looking functions (such as agreement and understanding) are more information bearing than forward looking functions (which are statements and questions), where both are present and in conflict (as in the ‘understanding’ example). This is an arbitrary decision, and one that should be explored in further work. In the instances where two or more forward-looking functions are in conflict, such as in , we choose to rank the options: conventional, offers, commits, influence-on-listener, and then answers and assertions. In this case, ‘conventional’ is ranked above ‘assertions’, and so is identified as most salient. The complete mappings from amities ge labels to superclass labels, and switchboard labels to superclass labels, are ordered by the priorities we define above, where more specific mappings are applied ahead of more general mappings in all cases. The figures also include optional label infor-
206 mation (parts of the label that are green, appear inside square brackets, and indicated with a ‘*’). This means that this indicated part of the annotation may or may not be present. This mapping process accounts for over 99% of occurrences of utterances in the data set. Those labels not covered are ignored for the following experiments. This mapping leaves 10 switchboard labels with no corresponding amities ge labels. Of these, two of the labels in switchboard corpus, and are encoded in the amities ge data set as ‘features’. Utterances that might be annotated as assertions or questions of some type may include a feature label that indicates that these are or . These feature assignments occur less than 20 times in our corpus, and we chose to ignore them at this time, classifying them according the major salient role (as ‘question’, for example). The remaining 8 switchboard labels that have no mapping to amities ge labels are shown in Figure A.7. This mapping to a NULL superclass guarantees that we cannot successfully classify utterances with these labels.
Inf luence − on − listener Inf luence − on − listener Inf luence − on − listener Inf luence − on − listener Inf luence − on − listener
Inf luence − on − listener = “Inf ormation − request00
Figure A.1: xdml simplification steps
F orward − f unction = “Assert00
= “Inf o − request − explicit00 = “Inf o − request − implicit00 = “Conf − request − implicit00 = “Conf − request − explicit00 = “Repeat − request00
F orward − f unction = “Assert00 F orward − f unction = “Reassert00 F orward − f unction = “Explanation00 F orward − f unction = “Rexplanation00 F orward − f unction = “Expression00
207
208
8 > acknowledge > < response − acknowledgement (1) [F orward − f unction = “Assert00 ]∗ U nderstanding = “Back − channel00 } SC BACK − CHAN N EL > > hedge : back − channel − in − question − f orm
(2) [F orward − f unction = “Assert00 ]∗ U nderstanding = “Completion00 } SC COM P LET ION { collaborative − completion
(3) [Inf luence − on − listener = “Inf ormation − request00 ]∗ U nderstanding = “N on−understanding 00 } SC N ON − U N DERST AN DIN G { signal − non − understanding
repeat − phrase summarise − ref ormulate
(4) [F orward − f unction = “Assert00 ]∗ U nderstanding = “Correction00 } SC CORRECT ION { correction
(5) [F orward − f unction = “Assert00 ]∗ U nderstanding = “Repeat − rephrase00 } SC REP EAT
Figure A.2: Understanding mapping table (xdml } superclass { switchboard-damsl)
Figure A.3: Agreement mapping table (xdml } superclass { switchboard-damsl)
F orward − f unction = “Assert00 Answer = “T rue00 Agreement = “Accept − part00 (8) F orward − f unction = “Assert00 Answer = “T rue00 Agreement = “M aybe00 SC M AY BE {maybe/accept−part F orward − f unction = “Assert00 Answer = “T rue00 Agreement = “Reject − part00
(7) F orward − f unction = “Assert00 Answer = “T rue00 Agreement = “Reject00 } SC REJECT { reject
(6) F orward − f unction = “Assert00 Answer = “T rue00 Agreement = “Accept00 } SC ACCEP T { accept
209
210
(9) F orward − f unction = “Assert00 Conventional = “Opening 00 } SC HELLO { conventional − opening
9 F orward − f unction = “Of f er00 = F orward − f unction = “Commit00 SC OF F ER { of f ers − options − commits ; Inf luence − on − listener = “Open − Option00
(10) F orward − f unction = “Assert00 Conventional = “Closing 00 } SC GOODBY E { conventional − closing
(11)
(12) [F orward − f unction = “Assert00 ]∗ Inf luence−on−listener = “Action−Directive00 [Answer = “T rue00 ]∗ } SC ACT ION − DIRECT IV E { action − directive
Figure A.4: Conventional, offer, options, commits and action directive mapping table (xdml } superclass { switchboarddamsl)
(13)
Figure A.5: Questions mapping table (xdml } superclass { switchboard-damsl)
yes − no − question wh − questions open − questions or − clause Inf luence − on − listener = SC QU EST ION 00 “Inf ormation − request declarative − yes − no − question declarative − wh − question rhetorical − question tag − question
211
212
statement − non − opinion statement − opinion
yes − answers no − answers af f irmative − non − yes − answers negative − non − no − answers (14) F orward − f unction = “Assert00 Answer = “T rue00 } SC AN SW ER dispref erred − answers other − answers hold − bef ore − answer
(15) F orward − f unction = “Assert00 } SC ST AT EM EN T
Figure A.6: Answers and Statements mapping table (xdml } superclass { switchboard-damsl)
Figure A.7: switchboard labels with no mapping to xdml labels (superclass { switchboard-damsl)
abandoned/uninterpretable appreciation non − verbal other (16) SC N U LL quotation downplayer thanking apology
213
214
B. CUE PHRASES SHARED BETWEEN THE switchboard AND icsi-mrda CORPORA, LISTED BY da LABEL
216 agree-accept Cue Phrase think so too so too think so too yes agree with that i agree with that that’s true agree with that that is true that is true i i agree that is true absolutely right absolutely right is true is true absolutely that’s true that’s true absolutely well that’s true exactly that’s true absolutely that’s true exactly true true that’s right well that’s true well that’s true that’s right i agree exactly that’s right that’s right you’re right you’re right sure i agree agree i i agree i i agree i agree yep agree
Utterance Length Long Long Long Long Long Long Long Long Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium
Frequency 4 4 3 3 4 3 6 4 4 4 3 3 3 3 7 7 2 5 5 4 4 5 1 3 9 6 1 1 1 4 4 1 4 9 1 1 5 5 1 2 4 3 3 4 4 4
Predictivity 100% 100% 100% 58.9% 57.1% 50% 46.2% 40% 100% 100% 100% 100% 100% 100% 87.5% 87.5% 84.6% 81.7% 81.5% 80% 80% 77.1% 76.9% 76.9% 76.4% 72.3% 71% 70.3% 67.1% 66.7% 66.7% 66.5% 64.5% 64.3% 63.8% 62.9% 62.5% 62.5% 61.5% 61% 60% 60% 60% 58.6% 55.6% 55%
217 Cue Phrase i agree you’re right you’re right definitely oh yes oh yes me too it’s true yeah you it’s true me too it is it does oh yes definitely yeah you of course course oh yes course of course yes right it does of course of course right exactly exactly exactly exactly true true true true definitely definitely definitely definitely absolutely absolutely absolutely absolutely yes yes yes yes
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short
Frequency 2 1 1 7 9 8 4 4 4 4 4 3 5 9 1 4 9 9 8 9 9 2 3 6 5 5 2 9 9 9 9 9 9 9 9 1 1 1 1 2 2 2 2 1 1 1 1
Predictivity 53.1% 52.2% 52.2% 50% 47.4% 44.4% 44.4% 44.4% 44.4% 44.4% 44.4% 42.9% 41.7% 40.9% 40% 40% 39.1% 39.1% 38.1% 36% 36% 34.4% 33.6% 33.3% 33.3% 33.3% 33.2% 84.1% 84.1% 84.1% 84.1% 81.8% 81.8% 81.8% 81.8% 76.9% 76.9% 76.9% 76.9% 68.3% 68.3% 68.3% 68.3% 60.7% 60.7% 60.7% 60.7%
218
maybe Cue Phrase could be could be maybe that’s maybe maybe maybe maybe
Utterance Length Medium Medium Medium Short Short Short Short
Frequency 2 2 3 4 4 4 4
Predictivity 64.5% 52.5% 42.9% 90% 90% 90% 90%
Frequency 5 5 3 8 8 7 6 7 6 3 1 1 1 1 1 1 1 1 3 3 3 3
Predictivity 100% 100% 100% 88.9% 88.9% 87.5% 85.7% 77.8% 75% 75% 55.5% 37.1% 35.3% 34.5% 85.7% 85.7% 85.7% 85.7% 68.2% 68.2% 68.2% 68.2%
reject Cue Phrase no no actually no no no well no well no well no well no actually no no uh no uh no uh no nope nope nope nope no no no no
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Short Short Short Short Short Short Short Short
219
back-channel Cue Phrase uhhuh uhhuh uhhuh uhhuh uhhuh uhhuh uhhuh uhhuh uhhuh uhhuh uhhuh uhhuh yeah uhhuh yeah huh yeah uhhuh yeah uhhuh yeah right yeah yeah right yeah right yeah uhhuh uhhuh uhhuh uhhuh huh huh huh huh yeah yeah yeah yeah yep yep yep yep right right right right
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short
Frequency 2 2 2 2 6 4 1 1 1 1 3 1 1 5 7 5 5 4 4 4 4 1 1 1 1 5 5 5 5 2 2 2 2 1 1 1 1
Predictivity 83.3% 81.8% 81.2% 77.8% 56.8% 56.5% 55.1% 52% 51.9% 50.6% 50% 48.3% 46.9% 38.5% 36.8% 33.3% 33.3% 90.8% 90.8% 90.8% 90.8% 86% 86% 86% 86% 59% 59% 59% 59% 47.7% 47.7% 47.7% 47.7% 37.2% 37.2% 37.2% 37.2%
220 appreciation Cue Phrase well that’s a good wow that’s wow that’s well that’s a well that’s a that’s a good idea wow that’s that’s a good good point sounds good good to know like a good that’s great good point a good point that’s a good point a good point that’s cool that’s cool that’s a great that’s a great that’s amazing good idea that sounds good that sounds good good idea how interesting that’s amazing how interesting that’s amazing that sounds good that’s amazing that’s wonderful fascinating fascinating how interesting how interesting that’s wonderful that’s nice wow that’s that’s wonderful what a what a that’s nice well that’s interesting well that’s interesting well that’s good
Utterance Length Long Long Long Long Long Long Long Long Long Long Long Long Long Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium
Frequency 4 3 3 4 3 1 4 5 8 3 4 4 5 1 1 1 1 8 8 7 7 6 6 6 6 6 6 6 6 5 5 5 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3
Predictivity 100% 100% 100% 80% 75% 66.7% 57.1% 55.6% 53.3% 50% 40% 40% 33.3% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100%
221 Cue Phrase that’s wonderful well that’s interesting wow that’s that’s not bad that’s not bad well that’s good well that’s good that’s nice that’s a good that’s a good a good idea a good idea that’s a good idea good point that’s great good point that’s great that’s good good idea good idea that’s interesting that’s interesting good point that’s interesting that’s great that’s great that’s interesting that’s good that would be great that’s cool that’s cool a good very interesting wow very interesting very good that sounds cool very good interesting amazing amazing be great be great cool interesting sounds good sounds good
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium
Frequency 3 3 3 3 3 3 3 3 3 3 2 2 1 3 3 1 3 3 3 3 2 2 3 4 4 4 4 3 8 8 8 4 7 7 7 7 7 2 5 1 9 9 2 2 3 1 1 1
Predictivity 100% 100% 100% 100% 100% 100% 100% 100% 97.5% 97.5% 95.8% 95.8% 94.7% 94.6% 94.4% 94.4% 94.3% 94.1% 93.8% 93.8% 93.1% 92.9% 92.3% 92.2% 91.8% 91.7% 90.6% 90% 88.9% 88.9% 88.9% 88.2% 87.5% 87.5% 87.5% 87.5% 87.5% 83.9% 83.3% 82.1% 81.8% 81.8% 81.5% 81.5% 81.1% 80.9% 80% 80%
222 Cue Phrase neat would be great would be great that’s funny that’s funny that’s nice that’s good that’s good good question good question that would be that sounds that’s funny that’s funny that’s very that’s so exciting really neat that’s kind exciting really neat that’s kind of great nice good great neat wow wow wonderful that’s really that would be that’s a be good be good good that’d that’d be that’s really nice sounds idea that’d be weird that’s a that’d funny good
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium
Frequency 8 8 8 4 4 4 5 5 7 7 2 1 6 6 6 3 3 3 3 3 3 3 1 1 4 1 8 2 2 9 9 2 5 1 1 2 2 2 1 2 7 5 2 5 5 2 1 1
Predictivity 80% 80% 80% 80% 80% 80% 79.7% 78.6% 77.8% 77.8% 76.9% 76.9% 75% 75% 75% 75% 75% 75% 75% 75% 75% 75% 74.3% 73.1% 72.9% 72.8% 72.7% 71.8% 70% 69.2% 69.2% 69% 68% 66.7% 66.7% 65.7% 65.6% 65.6% 65.2% 64.3% 63.6% 62.9% 62.5% 62.5% 61.9% 61% 60.9% 60.8%
223 Cue Phrase wonderful be fun be fun idea funny sounds point not bad not bad god that’s pretty deal god deal helpful helpful pretty good pretty good oh that’s oh that’s oh my oh my that’s pretty is good hline nice nice nice nice interesting interesting interesting interesting man man man man great great great great cool cool cool
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short Short
Frequency 9 3 3 5 1 2 3 5 5 5 3 3 6 3 3 3 3 3 1 1 5 5 4 3 7 7 7 7 3 3 3 3 3 3 3 3 6 6 6 6 3 3 3
Predictivity 60% 60% 60% 59.6% 56% 55.3% 52.1% 50% 50% 50% 50% 50% 42.9% 42.9% 42.9% 42.9% 37.5% 37.5% 35.8% 35.3% 33.3% 33.3% 33.3% 33.3% 100% 100% 100% 100% 84.4% 84.4% 84.4% 84.4% 75% 75% 75% 75% 62.3% 62.3% 62.3% 62.3% 60.4% 60.4% 60.4%
224 Cue Phrase cool wonderful wonderful wonderful wonderful wow wow wow wow good good good good
Utterance Length Short Short Short Short Short Short Short Short Short Short Short Short Short
Frequency 3 6 6 6 6 5 5 5 5 7 7 7 7
Predictivity 60.4% 60% 60% 60% 60% 53.2% 53.2% 53.2% 53.2% 51.1% 51.1% 51.1% 51.1%
backchannel-in-question-form Cue Phrase oh really oh really oh really oh really isn’t that isn’t that really really really really
Utterance Length Medium Medium Medium Medium Medium Medium Short Short Short Short
Frequency 2 2 2 2 3 3 4 4 4 4
Predictivity 78.1% 78.1% 78.1% 78.1% 42.9% 37.5% 80% 80% 80% 80%
response acknowledgement Cue Phrase oh okay oh okay oh okay oh okay i see i see i see i see
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium
Frequency 2 2 2 2 3 3 2 2
Predictivity 96.6% 96.5% 96.5% 96.3% 94.8% 94.2% 93.5% 92.6%
225
signal non understanding Cue Phrase the what pardon what’s that excuse me what’s that what’s that excuse me excuse excuse me excuse me what’s that pardon pardon pardon pardon what what what what
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Short Short Short Short Short Short Short Short
Frequency 3 3 4 9 4 4 1 9 9 1 4 3 3 3 3 3 3 3 3
Predictivity 100% 100% 50% 47.4% 44.4% 44.4% 43.5% 42.9% 42.9% 40% 33.3% 100% 100% 100% 100% 40.7% 40.7% 40.7% 40.7%
offer Cue Phrase why don’t put why don’t you why don’t you should why don’t you
Utterance Length Long Long Long Long Long Long
Frequency 1 6 6 4 3 1
Predictivity 67.9% 54.5% 50% 47.4% 37.5% 35.9%
226
apology Cue Phrase sorry i i’m sorry i’m sorry i sorry about sorry i’m sorry i’m sorry sorry i’m sorry i’m sorry sorry excuse me excuse me sorry sorry sorry sorry
Utterance Length Long Long Long Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Short Short Short Short
Frequency 1 6 5 3 2 4 3 1 6 5 8 9 7 8 8 8 8
Predictivity 66.7% 60% 55.6% 100% 95.5% 80.7% 77.6% 72.2% 71.6% 68.8% 67.2% 39.1% 36.8% 86.6% 86.6% 86.6% 86.6%
Frequency 3 3 4 4 4 4 3 2 2 2 2
Predictivity 75% 100% 97.6% 97.6% 83.3% 83% 81.4% 100% 100% 100% 100%
thanking Cue Phrase thanks for thank you thank thank you thank thank you thank you thanks thanks thanks thanks
Utterance Length Long Medium Medium Medium Medium Medium Medium Short Short Short Short
227
affirmative non yes answers Cue Phrase i did
Utterance Length Medium
Frequency 4
Predictivity 80%
Frequency 8 8 8 1 1
Predictivity 100% 100% 100% 90.9% 90.9%
other answers Cue Phrase i have no idea have no idea have no idea i have no i have no
Utterance Length Medium Medium Medium Medium Medium
rhetorical questions Cue Phrase how can who knows who knows why not
Utterance Length Long Medium Medium Medium
Frequency 6 3 3 3
Predictivity 40% 50% 50% 33.3%
open questions Cue Phrase what about the
Utterance Length Long
Frequency 3
Predictivity 33.3%
228
or clauses Cue Phrase or is it just or are you or are or do you or should or are you or do or is that or is it or is or is that or do you or are or should or is or are they or is it or do or was or was or are or are or is it or is it or is it or is or is
Utterance Length Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Medium Medium Medium Medium Medium Medium Medium Medium Medium
Frequency 4 3 1 1 6 3 1 7 9 2 8 1 1 7 2 3 1 1 5 5 3 3 4 6 6 1 1
Predictivity 100% 100% 90.9% 83.3% 75% 75% 73.7% 70% 64.3% 63.6% 53.3% 52.6% 45.8% 38.9% 38.6% 37.5% 37.1% 36.8% 100% 100% 100% 100% 80% 75% 75% 65% 65%
229 wh-questions Cue Phrase long does how long does does it take long does it long does it take how long does it how long does when is why do you and what are how does how is the how big is so what was how is big is how often so what was what what are what does what are they when is how many how much what kind what kind of what did and how do it take and how do you they doing how much do where where what’s your how long how is how do you and what are what was the so how do
Utterance Length Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long
Frequency 1 1 9 9 9 9 5 5 5 4 3 3 3 3 3 3 3 3 3 8 7 6 1 1 9 8 4 4 1 3 3 3 3 3 8 1 1 5 5 5
Predictivity 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 88.9% 87.5% 85.7% 84.6% 82.4% 81.8% 80% 80% 80% 78.6% 75% 75% 75% 75% 75% 72.7% 71.4% 71.4% 71.4% 71.4% 71.4%
230 Cue Phrase what was what do you so what’s what are what do what’s the but how what is the how how what do you mean how did and what’s where does so what’s the what’s what so what’s the and how do so how do you so what’s would you do what’s it so how so what are how why would how does what does what did how do what what what are you how often what what are what did you so how do what would you how much is what sort what what do you what does that mean what sort of why do how did
Utterance Length Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long
Frequency 7 1 1 1 1 1 8 8 6 1 7 5 5 5 5 5 5 5 1 3 3 1 4 9 6 1 2 1 1 2 9 6 5 5 5 4 4 3 3 3 3 3 3
Predictivity 70% 68.8% 68.8% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 65% 63.6% 62.5% 62.5% 62.5% 62.5% 62.5% 62.5% 62.5% 60% 60% 60% 59.1% 57.1% 56.5% 54.5% 52.6% 52.5% 52.4% 51.7% 51.3% 50% 50% 50% 50% 50% 50% 50% 50% 50% 50% 50% 50% 50%
231
Cue Phrase when do what do you what was the what is what’s who you doing what does that what are do you mean well how how would so what are what would how would you how big what’s the what what is what do so how what’s a wha what what if what’s what’s what is the how do we would it be so where where did
Utterance Length Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long
Frequency 3 2 1 9 1 9 4 4 5 1 3 1 5 5 6 4 3 5 3 1 3 3 1 8 7 2 1 4 4 3
Predictivity 50% 49.2% 48.1% 47.4% 45.2% 45% 44.4% 44.4% 43.7% 42.9% 42.9% 41.7% 41.7% 41.7% 40% 40% 39.8% 38.5% 38.4% 37.8% 37.5% 37.5% 36.6% 36.4% 35% 33.9% 33.3% 33.3% 33.3% 33.3%
232
Cue Phrase what’s your how does what are where’s what are what do you what do you so what’s where’s what’s the how many how many what’s the what about what do what do what’s what’s what what what’s that where is where is what is what what was what is it who’s what is what was what is it who’s what’s that what’s that why why why why
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Short Short Short Short
Frequency 4 3 1 4 1 6 6 3 3 6 6 4 4 4 7 8 2 3 4 6 4 3 1 7 8 3 3 7 9 3 3 3 3 1 1 1 1
Predictivity 100% 100% 80% 80% 78.6% 75% 75% 75% 75% 66.7% 66.7% 66.7% 66.7% 66.7% 63.6% 61.5% 60% 57.1% 57.1% 50% 50% 50% 48.1% 47.7% 44.4% 42.9% 42.9% 41.2% 37.5% 37.5% 37.5% 33.3% 33.3% 68.4% 68.4% 68.4% 68.4%
233 yes-no questions Cue Phrase are there and do you have have you i mean do i mean do you is it the but are there any is there any do you think do you know and do you was it i mean do you think that would is there do you know what do they mean do you and do you did you have mean do is there a do we do you have but do you but do you do you have a is that the does that do you see does is it like do they have is it should we do you think that so do you would you do you know do you should do isn’t but are did but do are are
Utterance Length Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long
Frequency 4 3 6 4 4 4 4 9 6 6 6 3 3 3 3 1 7 4 4 4 4 4 4 8 1 3 3 3 3 3 3 2 4 4 2 5 4 3 3 1 3 1 5 4 4 1 3 3
Predictivity 100% 100% 85.7% 80% 80% 80% 80% 75% 75% 75% 75% 75% 75% 75% 75% 70.6% 70% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 61.5% 61.1% 60% 60% 60% 60% 60% 60% 57.1% 57.1% 57.1% 56.8% 50% 50% 50% 50% 48% 47.1% 45.5% 45.2% 44.4% 44.4% 42.9% 42.9% 42.9%
234
Cue Phrase do you do you do you have so are so are is this is it the so do you has did you is the so do are are we would does it but do is that what ever is that what is that do you do you did you that what is there do you know are you do they is is have you do you know was it did wasn’t do you know have you have you can you is that are you
Utterance Length Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Long Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium
Frequency 3 1 5 6 6 6 3 3 4 4 4 2 5 8 3 3 3 3 3 1 2 5 5 4 4 4 7 3 5 5 4 4 1 3 3 3 3 3 3 1
Predictivity 42.9% 42.5% 41.7% 40% 37.5% 37.5% 37.5% 37.5% 36.4% 36.4% 36.4% 36.1% 35.7% 33.3% 33.3% 33.3% 100% 100% 100% 94.1% 85.2% 83.3% 83.3% 80% 80% 80% 77.8% 75% 71.4% 71.4% 66.7% 66.7% 61.1% 60% 60% 60% 60% 60% 57.7% 55.6%
235
Cue Phrase did you do can is it are was is is is there can you do you do you are they is do they
Utterance Length Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium Medium
Frequency 6 3 1 2 1 7 5 5 4 2 7 4 6 3
Predictivity 54.5% 53.4% 52.6% 51.3% 50% 50% 50% 50% 50% 49% 46.7% 44.4% 43.3% 42.9%