Coding Schemas for Dialogue Moves Staan Larsson
[email protected]
Department of linguistics Goteborg University January 13, 1998
1 Introduction The goal of this paper is to investigate and compare coding schemas for dialogue moves1 for possible use in the S-DIME (Swedish dialogue move engine) project. The aim of the project is to build a dialogue move engine which simulates changes of information states in dialogue participants engaging in Swedish dialogues, and that sees dialogue moves as an intermediary level of description between spoken utterances and information state updates. For this, we will need to specify a computational model of dialogue moves, which will involve integration of a number of developments in the literature and in current systems. Why do we need a coding schema for dialogue moves in task-oriented dialoges? There are several reasons. Firstly, we will at some point need to specify which dialogue moves to include in our model and how to de ne them. One way of trying out such a model is to look at real dialogues and see how our categories t reality, and if there is an intersubjective way of recognizing the moves in the model. Trying out dierent schemas at an early stage of the project will provide an overview of the range of variation on this eld and some empirical background for a computational model. Secondly, if a sucient amount of coding is done, we may obtain various types of data useful in an implemented dialogue system, such as conditional probabilities for predicting and interpreting moves. The papers starts with some background, and then moves on to consider various proposed coding schemas for dialogue moves. Three of these are singled out and are described in more detail, with examples from both their native corpus and an \Airplane" dialogue from the Goteborg Spoken Language Corpus. These schemas are then compared and some general issues in designing schemas for dialogue moves are isolated and their consequences for reliability, reliability testing, computational tractability of move recognition, and requirements on tools for coding dialogue, are discussed. Work on this paper has been supported by S-DIME (Swedish dialogue move engine), NUTEK/HSFR Language Technology project F305/97. 1 There is a proliferation of names for this level of dialogue structure which perhaps is symptomatic of the lack of agreement between researchers. Apart from dialogue moves (Carlson [16]) we have dialogue objects (Jonsson [26]), communicative actions (Allen [2]), communicative acts (Allwood [5] and Traum[39]), and the speech acts of Austin and Searle. Of course, there are also sometimes theoretical dierences between many of these accounts. As noted, we will use the \dialogue move" terminology which is derived from Wittgenstein's \language game" metaphor, but this should not be taken as indicating any speci c theoretical standpoint.
1
1.1 The Airplane corpus The investigation will include examples from the \Airplane" task dialogues in the Goteborg Spoken Language Corpus. The Airplane subcorpus consists of six dialogues containing a total of just over 10 000 words. In the task, one of the participants (I) has an assembled model airplane and the other (F) an unassembled one. Without looking at the assembled plane, F has to assemble his plane using the instructions given by I. Before the dialogue starts, I is allowed to look at instructions for how to assemble the plane. During the conversation, the participants cannot see each other (apart from eye contact in half of the dialogues). The dialogues are recorded on audio and video2 and have been transcribed using modi ed standard orthography using the standard speci ed in ([31]).
1.2 The Kappa statistic An important issue in choosing a coding schema is reliability. Is the schema reliable, i.e. can the coding results be replicated by researchers others than those who have devised the schema? If so, then (arguably) the results derived from coding are more likely to re ect facts. Unfortunately, many schemas do not include useful measures of reliability. This problem has only recently been recognized and a proposed solution, the Kappa statistic ([12]), seems to have been generally accepted. The advantage of Kappa is that it normalizes pairwise agreement between coders with respect to the expected random agreement, i.e. the amount of agreement that the coders would reach if they annotated a dialogue by chance3 . For example, if we have a schema S with two categories A and B, we can measure Kappa by rst letting two(or more) coders annotate a dialogue. We can now measure the relative distribution between categories A and B in the annotated dialogue. Suppose that 75% of all utterances4 are tagged with A, and 25% with B. This means that the expected chance agreement is PE = 0:75 0:75 + 0:25 0:25 = 0:62. Suppose also that the actual pairwise agreement is 0.8, i.e. the coders agree have annotated 80% of the utterances with the same category. We can now compute :
? PE = 0:8 ? 0:62 = 0:18 = 0:47 = PA 1 ? PE 1 ? 0:62 0:38 It also seems generally accepted that Kappa values above 0.67 allow tentative conclusions to be drawn, while values above 0.8 are considered reliable. In the example above, the schema would not be considered reliable even though a pairwise agreement of 80% was reached, since the expected chance agreement was so high. A problem with the Kappa statistic, however, is that it is designed for mutually exclusive categories. This makes it a bit complicated (but not impossible) to apply it to schemas allowing for multifunctionality of utterances. One solution is to allow only limited multifunctionality by dividing a schema into independent parts (or \layers"), each of which contains mutually exclusive (i.e. dependent) categories. Kappa statistics can then be provided for each individual part. Another solution is 2 Unfortunately, only one participant (I) is captured on video. These recordings were originally made with a dierent purpose than ours. 3 Not entirely by chance; the relative distribution between categories must be assumed to be constant 4 This assumes that we have already divided the dialogue into utterances.
2
to generate the set of all combinations of categories (i.e. the superset of the set of categories) and treat each of these as a category when computing .
2 Background: From Taxonomies to Coding Schemas Since the theory of speech acts was conceived, there has been numerous proposals of which speech acts there are, how they are to be de ned and how they are related to each other. The canonical taxonomy is perhaps that of Searle [35] (S stands for speaker and H for hearer; examples are given in parentheses):
representatives commit S to the truth of the expressed proposition (assert, state, hypothesize )
directives: S attempts to get H to do something (command, request ) commissives commit S to some future course of action (promise, undertake ) expressives express a psychological state in S regarding some state of aairs (thank, deplore, welcome ) declarations directly aect the world ( re, declare war, wed)
Starting from Searle's taxonomy, Vanderveken [42] attempts to parameterize the de nitions of English speech act verbs in a very detailed and formal manner. However, his taxonomy is very complex - about 107 illocutionary speech act categories are included - which is likely to make reliability a problem if it was used for coding. The aforementioned taxonomies have all more or less sprung from the rationalistic traditions in linguistics, and few, if any, attempts ahve been made to apply them to real data. This is changing, however, and in the \second wave of corpus linguistics", several taxonomies in the form of coding schemas for dialogue moves have been developed more-or-less independently and used for coding dialogues:
HCRC have developed a schema with three complementary structural levels
(move, game and transaction) for coding dialogue structure in the Map Task corpus. Reliablity has been measured using the Kappa statistic ([13]) indicating various levels of agreement for dierent schemes, but generally good. The DAMSL (Dialogue Act Markup in Several Layers) schema is the product of the Discourse Resource Initiative (DRI), consisting of reasearchers from several dialogue projects worldwide. The goal of DRI is to provide a standard for coding of dialogue acts, which if necessary can be augmented with further subdivisions of the given categories. Detailed Kappa statistics are given indicating various levels of success. The VERBMOBIL project ([24]) has developed a large and complex coding schema for dialogue acts, encompassing almost 60 dierent categories, for coding appointment scheduling dialogues. This coding has been used for obtaining weighted defaults used in interpreting dialogue acts. Statistics on reliability are not available. Linkoping University (Ahrenberg, Dahlback, Jonsson) have coded a corpus of WOZ-dialogues using two dierent schemas - one very simple (henceforth referred to as LINLIN1) and one slightly more complex (LINLIN2). Reliability measures are given for the simple schema, though not in terms of Kappa. 3
In connection with the TRAINS project [33], Traum ([39]) has developed
ve complementary coding schemas for spoken dialogue structure: coherence, grounding, surface form, illocutionary function5 and argumentation structure. A part of the TRAINS corpus has been annotated using these schemes. No reliability statistics are given.6 Allwood [5] also gives a parameterized account of communicative acts, where the parameters (or \dimensions") are expressive and evocative function. This model includes an account of communication management which subsequently has been re ned and speci ed in two complementary coding schemas, Own Communication Management (OCM) and Interaction Management (IM).
In additions to these there are a few more schemas which I haven't looked at: Walker and Whittaker (1990) [41], Sutton et. al (1995), Nagata and Morimoto (1993) and Condon and Cech (1995). There is also the COCONUT schema, an extension of DAMSL [17]. Apart from these schemas, there is a wealth of taxonomies not directly intended for coding purposes, e.g. Hancher [20], Sinclair and Coulthard [36], Bilange [8], Bunt [9], Wachtel [40], and Stenstrom [37].
3 Three coding schemas for dialogue moves In this section, three of the schemas listed above will be presented: the LINLIN, HCRC and DAMSL schemas. For a more thorough description, the reader is advised to read the original documents and manuals. Some notes on terminology:
Multifunctionality of utterances refers to the fact that one single utter-
ance can have several functions at once. Some schemas allow utterances to be coded for several move categories, and some allow only one category per utterance. Relational categories are categories which are related to two separate utterances rather than only one. For example, if a schema allows coding of which question an answer is an answer to, \answer" is said to be a relational category in that schema. Discontinuous moves are moves which are performed by uttering a stretch of speech which is interrupted by some other speaker and then resumed, as in this ctitious example: A: I would like an // um B: yes? A: ice cream
Here, clearly \I would like an // um ice cream" is a single move (some kind of indirect request), and in order to encode this fact we need to be able to apply the category \request" (or similar) to a discontiuous stretch of speech. This could be written e.g. request(utt1,utt3). 5 In recent publications (e.g. [34]), Poesio & Traum has replaced this level with the Forward Looking Function level of the DAMSL schema. 6 Apart from this schema, there is another taxonomyof Conversation Act types which is frequent in publications relating to TRAINS and many publications by Traum, consisting of Turn-taking acts, Grounding acts, Core Speech Acts and Argumentation acts.
4
A distinction can be made between descriptive levels (or layers) and structural levels. The latter refers to parts of schemas concerned with a particular structural unit of dialogue; a schema might e.g. contain an utterance level to describe utterance units. Descriptive levels, on the other hand, refers to parts of schemas concerned with describing dierent aspects of dialogue, regardless of strucutral level. Usually, each descriptive level corresponds to a single structural level, but a single structural level may correspond to several descriptive levels (e.g. the structural level \utterance" may correspond to the descriptive layers \illocutionary function" and \topic"). Unless otherwise indicated, the terms \level" and \layer" both refer to descriptive levels.
3.1 LINLIN 3.1.1 LINLIN: Background The two LINLIN schemas (LINLIN1 and LINLIN2) were developed for annotating information retrieval dialogues collected by means of WOZ-experiments with several dierent simulated systems. The dialogues were conducted using written language. Unfortunately, no coding manual is available. The information below is taken from various sources ([25], [26], [27], [7]).
3.1.2 LINLIN: The schemas The rst and simpler schema (LINLIN1) has only two \basic moves", Initiative and Response, where initiatives introduce goals and responses satisfy goals. The second schema (LINLIN2) contains two dierent initiatives, Update (U) and Question (Q). Update occurs when the user provides information to the system. Questions occur when the user obtains information from the system or when the user answers a clari cation request. Responses are called \Answer" in the second schema, and occurs when the system gives a database answer or when the user answers a clari cation request. (Note that while the schemas themselves are very similar, the kinds of de nitions they use are very dierent.) The simple schema is interesting in that it is the most rudimentary coding of dialogue moves possible (provided one does not take into account the discoursemanagement moves, which were only introduced to exclude them from analysis, see [7]). Of course, coding done using the LINLIN2 schema is easily reduced to LINLIN1 by collapsing the two Initiative categories. The LINLIN2 schema looks like this:
Initiative { Update (U): User provides information to the system { Question (Q): User obtains information from the system Response { Answer (A): System database answer, answer to clari cation request Discourse management { Greeting 5
{ Farwell { Discourse Continuation (DC) Apart from the abovementioned categories, the LINLIN2 schema also includes a Topic layer. Relational coding and discontinuous moves are not allowed by this schema. Multifunctionality of utterances, however, is supported7 but I have not been able to nd any examples of this.
3.1.3 LINLIN: Example Here is a sample of WOZ-dialogue in the \Bildata" (Cardata) domain, annotated with the LINLIN2 schema (and translated into English)8. The coding, made using the Nb (Nota Bene) annotation tool ([19]) has been converted (by the present author) from Nb's SGML-like format. Information about who the \speaker" is has been added, and information about topic has been removed. The initials S and U stand for \User" and \System", respectively. S: [Welcome to Cardata]Greeting U: [show mercedes]Question S: [Wait...]DC [Cardata can answer questions about a number of car models,
concerning manufacturer, model, year, country of manufacturing, disposition to rust and size class as well as questions about price, security, space and technical data. 9 Any particular wishes? Question
[ [
]
]Answer [
]
U: cost and space Answer S: Wait...Searching...
]DC [
+---------------+----------+------+-------------+-----------+-------+ |manufacturer |model |year |price |space,coupe||uggage| +---------------+----------+------+-------------+-----------+-------+ |Mercedes |200 | 1982| 66400| 4| 370| |Mercedes |200 | 1985| 90000| 4| 370| |Mercedes |190 | 1988| 149000| 3| 320| |Mercedes |200 | 1988| 180300| 4| 370| +---------------+----------+------+-------------+-----------+-------+
]Answer In this rst example, a dicult problem for single-function schemas rears its ugly head: If A asks B something and B asks for a clari cation, A:s next utterance will be an answer to the clari cation request but also a reformulation of the previous questions (e.g. \cost and space"), and will thus be both a question and an answer. In a single-function schema, this cannot be expressed unless a new category is introduced for these cases.
7 (From [25]:) \Utterances are not analysed as dialogue objects, but as linguistic objects which function as vehicles of one or more moves." 8 Thanks to Arne J onsson for providing this dialogue transcript. 9 While this utterance doesn't really answer the user's question, it still counts as an answer in the LINLIN2 schema. If the system cannot answer a question, it will give information about its own capabilities, which counts as an answer with a \System" topic rather than a \Task" topic. For a complete list of topics see Section A.1.
6
3.1.4 LINLIN: Application to airplane dialogue We have applied The LINLIN2 schema to a segment of instructional dialogue from the AIRPLANE corpus. In doing this, one must of course keep in mind that the LINLIN schema is not intended for this kind of dialogue (i.e. spoken human-to-human instructional dialogue) but to a simpler type (written human-to(simulated)computer information retrieval dialogue). Below is a slightly simpli ed and translated10 version of this segment with LINLIN2 coding. Underlined text indicates overlapping speech. A: [ ok so i have a plane in front of me which / has um / two wheels /
and it's like a wheel chassi / and / a long body and there's TWO wings you see these um long Question it a biplane or Question / it's just two wings next to each other so to speak / these / with seven holes / those sticks Answer / do you see Question B: yeah Answer A: they are wings Answer? B: that's wings ? and there's one of these five+ / +hole things / is A: so to speak ? supposed to be placed at the back as / tail wing Update B: mhm ? A: um / the ones with THREE holes / is the propeller / there i+ Update and then there is one more with three holes, right? Question
]Update [ B: [ is A: [ no
]
[ [ [ [
[
[ [
]
]
]
]
[
]
]
] [
]
]
]
]
One can note several problems when applying this schema to the segment above. Firstly, feedback utterances like \mhm" and the rephrase \that's wings" do not t into any of the categories given. Secondly, the phrases \these / with seven holes / those sticks" - \they are wings" - \so to speak" seem to form a coherent unit which we would perhaps like to assign a single classi cation. Assuming this could be done, which classi cation should we choose? It seems that this unit is some kind of further speci cation of the answer \no / it's just two wings next to each other so to speak" (which (by the way) in itself contains a simple yesanswer and a clari cation). Thus, it seems to fall somewhere between the Answer and Update categories. From this we may conclude that the LINLIN2 schema is simply not suciently ne-grained and speci c to capture some of the interesting phenomena in spoken human-to-human instructional dialogue (which, again, is not surprising given what the schema was designed for). Also, there is no way of indicating e.g. that \yeah" is an answer to \/ do you see".
3.1.5 LINLIN: Reliability Unforunately, no Kappa-statistic is avaliable for the LINLIN schemas, but for LINLIN1 (the simpler version), a pairwise agreement of 97% was achieved. If we hypothesise that both categories (Initiative and Response) were equally common, the expected agreement would be 50% which would give = 0:97 ? 0:5=1 ? 0:5 = 0:94, which is very good. 10 The original transcription, reproduced in Appendix B, is in Swedish and uses the MSO (modi ed Standard Ortography) transcription format (see [31]).
7
3.1.6 LINLIN: Relation to computational models Jonsson ([25]) claims that, while plan-based approaches to dialogue management may be useful in some cases, they are not required in the domain of \simple service systems." [21] These systems require only that the user can identify an object to the system, and once this has been done the system can provide information about this object. Thus, the simpler approach of a dialogue grammar is used. The Dialogue Manager records instances of dialogue objects (moves, dicourse segments and dialogues) in a dialogue tree as the dialogue proceeds, using a dialogue grammar. The dialogue object de nitions, consisting of a set of parameters (e.g. initiator, responder, context and content) and the values they can take. Typically, the content parameter of a move holds information about focus structure using a set of primary referents (\Objects") and a complex predicate ascribed to this set (\Properties"). There are two rules for maintaning and updating the focus structure:
Everything not changed in an utterance is copied from the latest InitiativeResponse-node in the dialogue tree to the current node.
The value of \Objects" will be updated with database information if the system provides it.
The dialogue objects are domain-dependent, and the Dialogue Manager is customized to each new application by setting them. There are three types of dialogue objects: D-objects, IR-objects and Move-objects, corresponding to dialogues, initiative-response-segments and single utterances. An example of a template IRobject is shown in Figure 1. 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4
Class: IR Context: link to parents Children: list of child nodes Type: type of IR-unit e.g. Q/A Initiator: e.g. User Initiative type: e.g. Question Initiative topic: e.g. Task Responder: e.g. User Response type: e.g. Answer Response 2topic: e.g. Task 3 Manufacturer: e.g. Volvo 5 Objects: 4 Model: e.g. 244 Year: e.g. 1978 e.g. price Properties: Aspect: Value: e.g. < 70 000 cr.
3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5
Figure 1: A (slightly simpli ed) template for LINLIN IR-objects, customized for a car-related domain. This is a rather rudimentary computational model. To generalise the notions used in the LINLIN system, one can view the dialogue object de nitions as simple representations of the common ground of the participants11. Seen in this way, the LINLIN model does not take into account additional information of mental states 11 The content parameter in particular seems to be represantable as a prototypical DRS, but this is speculation.
8
such as desires, intentions, private beliefs and beliefs about the other participant's belief, nor phenomena such as grounding and feedback. Of course, this is not an accident but an intentional strategy motivated by the argument that this additional information is not necessary for the kind of dialogues which the LINLIN system is concerned with.
3.2 HCRC 3.2.1 HCRC: Background The theory of dialogue games stems from Wittgenstein [43], and has been used as a basis for several more-or-less formal accounts of dialogue (Carlson [16], Levin and Moore [29], Power [32], Sinclair and Coulthard [36] and others). The games and moves used in the HCRC schema are extensions of Houghton's interaction frames [22]. The Map Task corpus, for which the HCRC schema has been developed, consists of instructional dialogues where one person (the instruction giver) explains to another person (the instruction follower) how to draw a line from one point to another on a map. The giver's map and the follower's map are slightly dierent, and they cannot see each other's maps. A detailed coding manual for dialogue moves is avaliable ([11]), providing detailed descriptions, a decision tree and several examples. Additional information on the HCRC schema is available in [28] and [23].
3.2.2 HCRC: The schema In the HCRC schema there are three levels of dialogue structure: move, game and transaction. Transactions are subdialogues that accomplish one major step in achieving the task. Transactions consist of games. A game is a sequence of utterances starting with an initiation and ending when the goal of the initiation has been achieved or when the game is abandoned. Games, in turn, are made up of moves (initiations, e.g questions, and responses, e.g. answers). While moves are mutually exclusive, games usually overlap with or are embedded within other games. The MapTask move schema contains 12 dierent move-types based on models of dialogue such as the \interaction frames" of Houghton [22]. The moves are listed below together with short explanations taken from [11].
Initiating moves { Instruct: S tells A to carry out any action other than the one implicit
{ { { {
in queries Explain: S states info which has not been elicited by A Check: S requests A to con rm info that S has some reason to believe, but is not entirely sure about Align: S checks attention or agreement of A, or A's readiness for the next move Query-yn: S asks A any question which takes a \yes" or \no" answer and does not count as a Check or Align 9
{ Query-w: Any query which is not covered by the other categories Response moves
{ Acknowledge: A verbal response which minimally shows that S has { { { {
heard the move to which it responds, and often also demonstrates that the move was understood and accepted Reply-y: Any reply to any query with a yes-no-surface form which means \yes", however that is expressed Reply-n: Any reply to any query with a yes-no-surface form which means \no", however that is expressed Reply-w: Any reply to any query which doesn't simply mean \yes" or \no" Clarify: A reply to some kind of question in which the speaker tells the partner something over and above what was strictly asked
Ready: A move which occurs after the close of a dialogue game and prepares the conversation for a new game to be initiated
Since games are always classi ed according to the name of the initiating move, game coding means basically determining the start and end of each game, and also deciding whether the game is embedded in another game or if it is a toplevel game.
3.2.3 HCRC: Example The coded dialogue example below has been reconstructed from the resources available at http://www.ltg.ed.ac.uk/dmck/mtparse sub6.html12. The format has been slightly altered to increase readibility. The coding has been converted from SGML-style tags. The initials G and F stand for \instruction giver" and \instruction follower", respectively. G: f 13 [ okay , ]Read y [ starting off // we are // above // a caravan G1:uncoded park . ]Instruct F: [ // mmhmm . ]Acknowledge gG1 G: fG2:uncoded [ // we are going to go // due south // straight south // and
// then we +'re going to g-- // turn straight back round and head north // past an old mill // on the right // hand side . Instruct F: G3:Check // due south and then back up again ? Check south and then straight back up again with an old G: yeah , // Reply?Y mill on the right and you +'re going to pass on the left-hand side of the mill Clarify F: // right , okay Acknowledge G: G4:Align // okay ? // Align G4 G3 G2 G5:uncoded and then we +'re going to turn // east . Instruct F: // mmhmm . Acknowledge
f
[
[
[
f [
]
]
[
]
]
[
]
]
]
g g g f
]
[
The dialogue index number is q1ec1. For some reason, the rule that games inherit their category from the category of the initial move is not applied in some cases. Instead, the move category is left unspeci ed. Indices (e.g. G1) are necessary to indicate where games begin and end, since games are allowed to overlap. 12 13
10
3.2.4 HCRC: Application to airplane dialogue The move and game levels of the HCRC schema has been used in the same manner as described in Section 3.1.4 to code a segment of Airplane dialogue. An immediate problem facing the coder is that there are no clear instructions as to how moves are segmented. For example, is an utterance of \it's raining and I want to go home" a sequence of two moves (of type explain), or just one? In the former case, syntactic criteria of complete sentences are used to nd minimal complete moves, and in the latter case one tries to locate maximal segments realising a single move. Judging from examples, the HCRC schema seems to apply the latter method, which we will refer to as the maximal functional segment -method of utterance segmentation. A: [ ok so i have a plane in front of me which / has um / two wheels /
and it's like a wheel chassi / and / a long body and there's TWO wings you see these um long Query?YN Explain B: is it a biplane or Query?YN A: no Reply?N / it's just two wings next to each other so to speak / these / with seven holes / those sticks Clarify / do you see Query?YN B: yeah Reply?Y A: they are wings Clarify B: that's wings Acknowledge and there's one of these five+ / +hole things Explain A: so to speak ? / is supposed to be placed at the back as / tail wing Instruct B: mhm Feedback A: um / the ones with THREE holes / is the propeller / there i+ Explain and then there is one more with three holes, right? Query?YN?Check?
]
[ [
[
]
[
[ [ [ [
[
[ [
]
]
]
]
]
[
]
]
] [
]
]
]
]
]
As we can see, some of the problems we had with the LINLIN2 schema have been resolved, speci cally the coding of feedback which is now handled by the Acknowledge move. The distinction between YN and W-type questions and answers is pretty easy to make, since it depends to a large extent on surface (syntactic) phenomena. The coding of \is supposed to be placed at the back as / tail wing" as Instruct is perhaps not self-evident, since it is in a sense a statement. The coding manual, however, states that \the instruction can be quite indirect (...), as long as it is obvious that there is a speci c action which the speaker wants to ellicit" ([11], p. 3). The action, in this case, is to (paraphrasing) \place one of those ve hole things at the back as tail wing". One might, however, argue that the preceding Explain should be included in the Instruct. The ideal solution would perhaps to be able to say both that \and there's one of these five+ / +hole things" is in Explain and that it, together with \is supposed...", is an Instruct. Unfortunately, this is not possible using the HCRC schema since it does not allow multifunctionality of utterances. A problem arises with A's utterance of \so to speak" followingB:s \that's wings". It is not obvious what this should be coded as using the HCRC schema. One might want to join it with A's previous Clarify move, thus forming a larger discontiuous Clarify-move, but this is not possible in the HCRC schema. Another possible solution is to code it as a solitary Explain move, but is A really making a statement here? Finally, is the last move a Query-YN or a Check move? This depends on whether \(...) the question ask[s] for con rmation of material which the speaker believes 11
might be inferred, given the dialogue context" ([11], p. 4). If \dialogue context" is taken to mean that which has been said previously in the dialogue, which seems to be the intended reading, the Query-YN move should be chosen since the information which can be assumed to be guessed by A is not inferred from the dialogue but from the general context (A is simply assuming that B's plane has the same set of parts as A's own plane). However, if \dialogue context" is taken to mean \the context of the dialogue", the Check move might be more appropriate. The latter interpretation seems also to t with the tag phrase \right?".
3.2.5 HCRC: Reliability = 0:92 = 0:83 = 0:89 70% pairwise agreement = 0:46 65% pairwise agreement Table 1: Reliability of the HCRC schema when applied to map task dialogues Move segementation Move classi cation Response/initiation classi cation Game beginning Game embedded/not embedded Game ending
Note that the Kappa score for move classi cation is based on only those segments on which all move coders agreed. Since games are allowed to overlap, game coding has not been measured using the Kappa statistic. Instead, pairwise agreement was used. Game coding is clearly not as reliable as move coding. No reliablilty scores are given for the transaction coding.
3.2.6 HCRC: Relation to computational models The HCRC schema is based on computational models of dialogue such as those found in Houghton [22], Carlson [16] and Power [32]. In ??, a (procedural) de nition of the ASK dialogue game is shown14. A move and game model of dialogue based on the HCRC schema has been used in a spoken dialogue interface for the Autoroute route planning package [30]. Here, dialogue management is implemented using planning and plan recognition techniques which requires moves and games to be speci ed as action operators. As an example, the Query-W game formalization is shown in gure 3. The intended interpretation is \if agent A1 knows the value of predicate P in situation S0 and there is a communication channel between agents A1 and A2 in S0 and A2 conducts a questioning game with content P with A1 in S0 then the result is S1 where agent A1 knows the value of P". 14 If this de nition doesn't seem suciently \formal" this is because it has been translated (by Power) from a computer program code to increase readability.
12
CONVERSATIONAL PROCEDURE ASK (S1,S2,Q) 1. S1 composes a sentence U which expresses Q as a question, and utters it. 2. S2 reads U and obtains a value for Q. He records that S1 cannot see the object mentioned in Q, and then inspects his world model to see if Q is true. If he finds no information there he says I DON'T KNOW, otherwise YES or NO as appropriate 3. S1 reads S2's reply. If it is YES or NO he updates his world model appropriately. If it is I DON'T KNOW he records that S2 cannot see the object mentioned in Q
Figure 2: A procedural de nition of a dialogue game, from Power (1978)
8A1;A2;P;S 0;S 1
((knows ref(A1,P,S0) & channel(A1,A2,S0) & actions(S0,play(A2,qw,A1,P),S1)) ! knows ref(A2,P,S1))
Figure 3: The "qw" (Query-W) operator in the Autoroute dialogue system
3.3 DAMSL 3.3.1 DAMSL: Background The Discourse Resource Initiative (DRI), consisting of researchers from several dialogue projects worldwide, have had two seminars on dialogue coding which have resulted in the DAMSL coding schema15 ([3]). This schema is still under development. Several interesting and important questions concerning the design of coding schemas for dialogue moves are raised in the report on the second seminar ([14]). A coding manual similar to that of the HCRC schema is available. Decision trees are available for each subgroup (see below) of the schema.
3.3.2 DAMSL: The schema There are four layers of \utterance-tags" in DAMSL: The layer corresponding to Searle's illocutionary acts is called Forward Looking Function in DAMSL. The other layers are Backward Communicative Function , dealing with \the speak15 DAMSL starts from Searle's traditional taxonomy of speech acts, but uses slightly more cumbersome names. A translation table might come in handy (Note that Searles \expressives" are not present in DAMSL.):
DAMSL
Statement In uencing Addressee Future Action Committing Speaker Future Action Explicit-performative
Searle
Representative Directive Commissive Declaration
13
er's reaction to previous utterances", and Information Level, describing \features that characterize the content and structure of utterances". I will concentrate on the rst two, since these correspond to the layer we are mainly interested in. The short descriptions below represent an attempt to summarize the central parts of the descriptions in the DAMSL manual. Most of the descriptions are taken from the decision trees, and some is from the longer descriptions. It is of course possible that I have distorted the descriptions in doing this, and the reader is encouraged to read the full descriptions in the DAMSL manual.
The Forward Looking Function layer of DAMSL Statement: makes claims about the world { Assert: S tries to aect beliefs of H { Reassert: (S thinks that) the claim has been made previously in the dialogue16
{ Other-Statement In uencing Addressee Future Action: S tries to directly in uence H's
future non-communicative actions { Action-Directive: obligates H to either perform the requested action or to communicate a refusal or inability to do so. { Open-option: S suggests a course of action but puts no obligation on H. Info-Request: introduces an obligation for the hearer to provide information Committing Speaker Future Action: potentially commit S to some future course of action { Oer: Commitment is conditional on S's agreement { Commit: Commitment is not conditional on S's agreement
Conventional { Opening: a phrase conventionally used to summon H and/or start the
interaction (e.g. \Can I help you", \Hi") { Closing: a phrase conventionally used in a dilogue closing or used to dismiss the addressee (e.g. \Goodbye") Explicit-performative: S performs an action by virtue of making the utterance (e.g. \You're red", \I quit", \Thank you"17)
Exclamation Other Forward Function: forward looking functions not captured in the
current scheme such as holding/grabbing the turn (e.g. \Right" or \Okay") or signaling an error (e.g. \Oops")
14 The quali cation \S thinks that" is omitted on p. 10 of [3], but in the decision tree on p. 11 it is included. This is unfortunate, since these two criteria may not coincide. 17 The reason given for seeing \Thank you" as an explicit performative is the \hereby"-test, i.e. the word \hereby" can be inserted before the verb. One might object that this is a valid reason only if \Thank you" is seen as a short form for \I (hereby) thank you", and not as a lexicalized phrase), which is not obvious. If we reason along these lines, \go home" can be seen as a short form for \I (hereby) order you to go home", and is thus an explicit performative also.
14
The Backward Looking Function layer of DAMSL Agreement: S is addressing a previous proposal, request, or claim (hence-
forth the antecedent) { Accept: S (explicitly) agrees to all of the antecedent { Accept-Part: S (explicitly) agrees to part of the antecedent { Maybe: S refuses to make a judgement at this point; S agrees to one part, but disagrees to another part, of the antecedent18 . { Reject-Part: S (explicitly) disagrees with part of the antecedent { Reject: S (explicitly) disagrees with all of the antecedent { Hold: S is not stating his/her attitude towards the proposal, request, or claim Understanding: actions that speakers takes to make sure that they are understanding eachother as the conversation proceeds. { Signal-Non-Understanding: S explicitly indicates a problem in understanding the antecedent { Signal-Understanding: S explicitly indicates understanding (but not necessarily acceptance) Acknowledge: S signals understanding using phrases such as \okay", \yes", \uh-huh" Repeat-Rephrase: S signal understanding by repeating or paraphrasing what has just been said Completion: S signals understanding by nishing or adding to the clause that H is in the middle of constructing { Correct-Misspeaking: S indicates that S believes that H has not said what he/she actually intended, by oering a correction Answer: S is supplying info explicitly requested by a previous Info-Request act
Information-Relation The schemas FLF and BLF together contain 23 bottom-level categories, i.e. 23 move-types. The analysis is intended to be general in scope and coverage and to be applicable to all types of dialogue. The schema is meant to be used as the common foundation for other coding schemas, giving the individual research group the possibility to subdivide the given categories to suit a particular research task. There are some logical relations between categories; for example, an Answer is always an Assert. The restrictions posed by these relations must be attended to when coding, and the DAT tool provides support for this. DAMSL allows limited multifunctionality of utterances. The three top level groups are independent of each other, and within the groups, each subgroup is independent. Within the subgroups, however, the categories are not independent, i.e. they 18 On p. 19 of [3], it says that \The Hold tag applies to the case where the participant does not address the proposal but performs an act that leaves the decision open pending further discussion" [my italics]. This seems to contradict the de nition of the Agreement aspect. Presumably, this will be xed in future DAMSL versions.
15
are mutually exclusive19 . For example, the ve subgroups of Forward Looking Function (Statement, In uencing Addressee Future Action, Committing Speaker Future Action, Performative and Other Forward Function) are independent, but withinIn uencing Addressee Future Action the categories (Open-Option, Info-Request and Action-Directive) are mutually exclusive. Relational coding is used extensively (in the Backward Looking Function). However, the DAT coding tool only allows one antecedent for each utterance, which means that it is impossible to code dierent moves in a single utterances as being related to dierent previous utterances. Whether this limitation has any eects in practice remains to be seen. Also, this is possibly not intended as a feature of the DAMSL schema but may have come about as a consequence of using the DAT tool. The DAT tool also allows discontinuous moves to be coded. This is done by making the relevant utterances and grouping them into a \segment". This segement can then be coded as any utterance.
3.3.3 DAMSL: Example
The dialogue transcription below is available at ftp://ftp.cs.rochester.edu/pub/packages/dialog-annotati The annotation format has been converted from the output format of the DAT tool [18]. The coding was done with an older version of DAMSL, but since there seems to be a straightforward way of translating the coding to t the new schema, I have taken the liberty of doing so20. Information about Info-level has been omitted to increase readability. For the same reason many tags have been shortened from \feature-value-pairs" (e.g. In uence-on-speaker=Oer) to simple values (e.g. Oer). The corresponding feature can be found by looking at the DAMSL schema, so no information is lost. For each utterance, there is an index, e.g. \utt2", followed by a list of coded functions (moves). Backward-Looking Function (BLF) is listed rst, followed (in some cases) by an indication of which utterance the BLF is a response to (indicated by *), and nally the Forward-Looking Function. S: [ hello // ]U1:Opening [ can I help you ]U2:Info?request;Offer U: [ yes // ]U3:Accept;Answer;*U2;Commit;Assert [ um // I have a problem here I need
to transport one tanker of orange juice to Avon // and a boxcar of bananas to Corning // by three p.m. and I think it's midnight now U5:Assert U4:Assert S: uh right it's midnight U6:Accept;Acknowledge;*U5;Assert U: okay U7:Accept;Acknowledge;*U6 so we need to // um get a tanker of OJ to Avon is the first thing we need to do so U8:Assert (click) so we have to make orange juice S: okay U9:Accept;Acknowledge;*U8;Commit first U10:Open?option;Assert U: mm-hm // U11:Acknowledge;*U10
[ [
]
[
]
[
]
]
]
[
[
[
]
]
]
The annotation of utterances U4 and U5 is somewhat puzzling. The DAMSL manual 19 That two categories are independent means that one can be coded regardless of the other, i.e. the categories are not mutually exclusive. That two categories are dependent means that if an utterance is coded with one of the categories, the other category can not be applied to that segment, i.e. the categories are mutually exclusive. 20 The changes I have observed is that Info-request was formerly a subcategory of In uencingaddresse-future-action, while it is now a category on par with In uencing-addresse-future-action. Also, there seems to have been two kinds of acknowledge, one called simple \Acknowledge" and one called SU-Acknowledge, with \SU" presumably standing for \Signal Understanding". The category \Greeting" is now called \Opening". For reference, the original coding is reproduced in Appendix C.
16
says that \an utterance is a set of words by one speaker that is homogeneous with respect to Information Level and Forward and Backward Looking Functions." Now, it seems reasonable to assume that \a set of words" means \a maximal set of words". If it doesn't, each word and word sequence within an utterance is an utterance in itself. That is, we assume that DAMSL adheres to the \maximal functional segment"-principle mentioned above. However, in utterances U4 and U5 the annotator has not used this principle, since two consecutive utterances by the same speaker have identical annotation with regard to Information Level and Forward and Backward Looking Functions. The question is, what criteria has been used to split this word-sequence into two utterances? One hypothesis is that syntactic criteria (complete sentences) has been used. Here we note a problem with DAMSL that will be further addressed in Section 5.4 below.
3.3.4 DAMSL: Application to Airplane dialogue [
]
[
A: ok U1:Other?forward?function so i have a plane in front of me which / has um / two wheels / and it's like a wheel chassi / and / a long body and there's TWO wings U2:Assert you see these um long U3:Assert;Info?request B: is it a biplane or U4:Hold;*U3;Info?request A: no / it's just two wings next to each other so to speak / these / with seven holes / those sticks U5:Answer;*U4;Assert / do you see U6:Info?request B: yeah U7:Answer;*U6 A: they are wings U8:Assert B: that's wings U9:Repeat?rephrase;*U8 and there's one of these five+ / +hole things U10:Assert A: so to speak U9:? / is supposed to be placed at the back as / tail wing U11:Assert;Open?option B: mhm U12:Acknowledge;*U10?U11 A: um / the ones with THREE holes / is the propeller / there i+ U13:Assert and then there is one more with three holes, right? U14:Info?request
]
[ [
[ [ [ [
[
[ [
]
]
[
]
]
]
]
[
]
]
[
]
]
]
]
]
Judging from the DAMSL manual, Answers are usually not Accepts of the preceding Info-Request. This is rather confusing, especially as refusals to answer (e.g. by saying \I don't know") are coded as Rejects of the previous Info-request (p. 24, [3]). It seems intuitively plausible to say that answers to questions are acceptances of the question. Perhaps the reason for this is that answers do not explicitly accept info-requests, but the explicit/implicit-distinction seems rather vague. In the above airplane dialogue sample the annotation does not utilize the possibility of creating multi-utterance and discontinuous segments and coding these as if they were regular utterances. Note that we still haven't been able to solve the problem of the overlapped \so to speak"; it doesn't seem satisfactory to label this as a solitary Assert act. One solution would be to form a continuous segment of several utterances and see this as a single, multi-utterance, multi-agent Answer move: B: [U4 is it a biplane or ]U4:Hold;*U3;Info?request A: [X1 no / it's just two wings next to each other so to speak / these / with seven holes / those sticks [U6 / do you see ]U6:Info?request B: [U7 yeah ]U7:Answer;*U6 A: [U8 they are wings ]U8 B: [U9 that's wings ]U9:Repeat?rephrase;*U8 A: so to speak ]X1:Answer;*U4;Assert 17
A further solution would be to create a discontinuous single-agent Answer-move, as done below. The DAMSL manual does not specify any recommendation concerning these alternative solutions, but they are allowed by both the schema and the DAT tool. B: [U4 is it a biplane or ]U4:Hold;*U3;Info?request A: [X1 no / it's just two wings next to each other so to speak / these / with seven holes / those sticks ]X1 [U6 / do you see ]U6:Info?request B: [U7 yeah ]U7:Answer;*U6 A: [X1 they are wings ]X1 B: [U9 that's wings ]U9:Repeat?rephrase;*U8 A: [X1 so to speak ]X1:Answer;*U4;Assert
3.3.5 DAMSL: Reliability The values for Kappa shown in Tables 2 and 3 are taken from [2]. Statement In uencing Addressee Future Action Committing Speaker Future Action Other Forward Function
= 0:66 = 0:70 = 0:15 = 0:48
Table 2: Forward Looking Function Agreement = 0:42 Understanding = 0:57 Answer = 0:76 Response to21 = 0:77 Table 3: Backward Looking Function None of the schema subgroups can be considered reliable. The Answer and In uenceing Addressee Future Action aspects seem promising, as does (to a lesser extent) Statement. The values for the Committing Speaker Future Action, Other Forward Function and Agreement clearly indicate a need for revision. Note, however, that these statistics are based on a previous version of the schema, and that some of the problems may already have been resolved.
3.3.6 Relation to computational models Poesio and Traum ([34]) give tentative formal de nitions of the forward-looking functions of the DAMSL schema, examples of which can be seen in gure 4. The notation used, based on DRT, may require some explanation: Bel(A; K (s)) means that A believes the DRS K is true of situation s. SCCOE(A; B; K (s)) means that A is socially commited to B to K being the case in situation s, whether or not A actually believes it. e:Try(A; ) means that the event e is characterized by A trying to perform an action of type (present-directed intention). 18
e:Achieve(A; K ) means that e is characterized by the act of A bringing about the satisfaction of DRS K. These de nitions are obviously not complete in their present version, but they are very interesting in that they take into consideration many phenomena that most earlier attempts have ignored, e.g. obligations and grounding. Name Statement Assert Reassert
act condition e:stmt(A,B,K(s)(s')) e:asrt(A,B,K(S)(s')) e:rsrt(A,B,K(s)(s'))
de ning eect DS': SCCOE(A,B,K(s)) e: Try(A,Achieve(A,DS':Bel(B,K(s)))) DS: SCCOE(A,B,K(s))
Figure 4: The formal act de nition for the Statement aspect of the DAMSL schema
4 Comparison of taxonomies In this and the following section, we will attempt to establish some relevant parameters of variation in coding schema design, and nd the corresponding parameter values for each schema. We also draw some conclusions from the dierences and similarities observed. In the present section we will deal with the similarities and dierences between the schemas concerning the move taxonomy itself: the range of phenomena to cover, the division of phenomena into levels (or layers), the division of layers into categories, and domain and genre dependency. Apart from design of the move taxonomy itself, there are additional issues such as whether coders are allowed to look ahead in the transcription when coding, whether relational coding is allowed (e.g. which question is a particular answer an answer to?) and whether discountinuous moves or utterances are allowed. These and other issues will be dealt with in Section 5. All these choices in uence, in various ways, the reliability of a schema and the potential for computational tractability of a model of dialog based on that schema. A simple schema will most likely make it easier both to achieve and assess its reliability. It might also lead to a more tractable computational model. However, a simple schema may also produce an over-simplistic and unnatural model of dialogue. For reference, two related coding schemas are included in this section: the TRAINS \Conversation Acts" coding schema [33] and the Goteborg Interaction Managment [4] schema. The TRAINS schema contains four dierent conversation act types : turn-taking acts, grounding acts, core speech acts and argumentation acts22 . The Goteborg Interaction Management schema (henceforth referred to as the GBG-IM schema) includes several aspects of feedback (FB) and turn management (TM), some of which have been included here. 22 Each conversation act type corresponds to a Discourse Level (structural level) described in terms of Discourse Units (DUs) or Utterance Units (UUs). A DU are a bit like an HCRC game except that while a game usually ends when then goal of the initiating move has been ful lled (e.g. a question has been answered), a DU ends when the initiating utterance has been mutually understood, or Grounded . Utterance Units correspond to more or less continuous speech by one speaker, punctuated by prosodic boundaries. Turn-taking acts, grounding acts, core speech acts and argumentation acts correspond to the Discourse Levels Sub-UU, UU, DU, and Multiple DUs, respectively.
19
4.1 Scope and layers There are clear dierences in the scope of phenomena and division of these phenomena into layers between the schemas we have considered above. A rough impression of these dierences are given in gure 4.
LINLIN2
HCRC DAMSL
Type
Move
|
|
|
TRAINS
Game
Discourse | management Topic | | |
Forward-Looking, Backward-Looking SignalUnderstanding | Conventional
GBG-IM
Argumentation acts Core Speech Acts Grounding Feedback Acts function Turn-taking Turn characteristics | |
Information-level | Communicative Status |
| |
Table 4: Rough impression of relations between schema layers The transaction and game levels of the HCRC schema seem to have no corresponding layers in the LINLIN and DAMSL schemas23. Likewise, the Communicative-status layer of DAMSL has no obvious counterpart in the other schemas. However, the Topic layer of LINLIN is very similar to the Information Level layer of DAMSL, in that they both try to capture some general semantics of utterances in terms of what they are about.
4.2 Move taxonomies It is interesting to notice that all three schemas have divided moves into initators/forwardlooking functions and responses/backward-looking functions. Apart from this, however, there are big dierences regarding lower-level categories. A somewhat speculative characterization of relation between schema categories can be found in Tables 5 and 6. Italics indicate a category-set. There are clearly similarities between the schemas, e.g. the toplevel division into initiatives and responses. The DAMSL schema is generally the most complex, but in some cases the HCRC schema gives a more ne-grained analysis. The TRAINS and GBG-IM schemas include some aspects of feedback and turntaking not covered by the other schemas. To make this kind of comparisons more exact, however, there is a need for nding a way of giving exact semantics for coding schemas. This is clearly a subject worth further studies.
4.3 Dependencies on dialogue genre, domain and theory The three coding schemas described above have all been designed for dierent genres of dialogue, dierent domains and based on dierent theories. A rough overview of these dierences are given in Table 7. 23 The Argumentation Acts of the TRAINS schema seems to have largely the same coverage as these levels
20
LINLIN2 HCRC Initiative
Update Question
|
GBG-IM
Core speech | acts
Query-yn Query-w Check Align Instruct
YNQ WHQ
Response moves
|
TRAINS
Initiating moves Forward Looking Function Explain Statement
|
Response (Answer)
DAMSL
Reply-y, Reply-n, Reply-w, Clarify |
Ready ???
Assert Reassert Other Info-request
In uencing-addresseefuture-action
Inform
Action-directive Open-Option
Request Suggest
Oer Commit Explicit-performative Exclamation |
Oer
Committing-speaker future-action
Backward Looking Function
Answer
Agreement
Accept Accept-part Maybe Reject Reject-part Hold |
| Promise ...
|
Eval
Accept
+Accept-content
Reject
?Accept-content
|
|
Table 5: Rough impression of relations between move taxonomies, pt. 1
21
LINLIN2 |
HCRC
DAMSL
TRAINS
| Acknowledge
|
ReqAck
(Response moves) Understanding
GBG-IM
Grounding
Feedback function
Acknowledge Repeat-rephrase Completion | Signal-NonUnderstanding
Ack
+Accept-com-act (also Give-FB)
| ReqRepair
-Accept-com-act -Understanding
| |
Correct-Misspeaking |
|
|
|
|
Repair Initiate Continue Cancel
Discourse management
|
Conventional
take-turn keep-turn release-turn assign-turn |
Turn acceptance Turn holding Turn closing | |
| |
Opening Continuation Closing
Signal-understanding
Elicit-FB
+Perception -Perception +Contact -Contact
Turn-taking Turn Managemen t
Opening Closing
Table 6: Rough impression of relations between move taxonomies, pt. 2
LINLIN HCRC DAMSL TRAINS GBG-IM
Dialogue genre
information retrieval instructional general interactive planning general
Theory
dialogue grammar dialogue games speech acts conversation acts activity-based pragmatics
Domain
various route following general route planning general
Table 7: Dialogue genre, intended domain and foundational theory for the three schemas described above.
22
5 Some general problems In this section we will look at general problems involved in designing coding schemas for dialogue moves, how these problems have been treated in the schemas described above, and how they are related to computational issues. Apart from the division of utterances into dialogue move categories, a number of properties of a schema determine its exibility and expressivity, i.e. what can be said and what cannot be said when using the schema for annotating. For example, many schemas allow multiple moves/acts to be performed by a single utterance, and some do not. Under the (seemingly uncontroversial) assumption that utterances can, in fact, be multifunctional, the latter type of schema are not as realistic as the former. Of course, it may be possible to construct a schema which combines functions into function aggregates, e.g. Answer-and-assert, or where e.g. an Answer always implies an Assert. Some of the issues discussed in this section, e.g. whether to allow discontinuous utterances, are pretty much independent of the choice of categories in a schema. Others are clearly related to the choice of categories; in HCRC, for example, relational categories are not used since this aspect is included in the game coding. The question of lookahead is a decision relating mainly to the coding procedure , but it may still have eects on e.g. how categories are de ned (e.g., do the criteria relate to actual eects of an utterance or not?).
5.1 Intention-based and surface-based de nitions It might be expected that de nitions relying on surface criteria such as sentence form would be easier to assign reliably to utterances, while intention-based de nitions may be harder to apply since they require more interpretation from the coder, and thus can be expected to be less intersubjective. Looking at the de nitions, it appears that the HCRC schema is more surface-dependent while the DAMSL and LINLIN2 schemas are more intention-based, though all three mix surface and intention-related criteria. For the purpose of coding dialogues for use in developing a theory of information state updates, one the one hand it seems to be more appropriate to use a surfacebased schema as point of departure, since an intention-based schema is likely to be more or less permeated by some theory of mental states. We would like a carteblanche, so to speak, on the theory of informational states and mental states in order to use the schema for testing dierent computational (state-related) theories of dialogue moves. On the other hand, such a schema is bound to be more dicult to relate to intentions, which is what we must do when giving formal de nitions. The very problem of, e.g., dialogue move interpretation lies in the dierence between surface form and the underlying intention. In a dialogue system, we may assume that a syntactic parser will provide information about surface-syntactic form and that this information will be used as a clue in recognizing the intention(s) underlying an utterance. In this light, it seems pointless in coding dialogues for surface-syntactic phenomena which have only indirect links to dialogue moves. Thus, according to this point of view, it seems wiser to use an intention-based schema. This suggests that ideally, one should attempt to strive for a coding schema based on intention-based but (as far as possible) theory-neutral de nitions. Also, we need 23
to satisfy additional requirements of intersubjectivity, simplicity and comprehensibility. Most likely, the best way of achieveing this is to use some semi-formal theory based on \folk psychology" which does not take a de nite stand on controversial issues. Since complete theory-neutrality is not possible in practice, the schema should be as exible as possible. One way of achieving this is to allow for variations in the degree of preciseness in de nitions, so that a category may have several more speci c subcategories, which in turn have subcategories, and so on.
5.2 Eect-based de nitions and Lookahead The DAMSL schema encourages the coder to look ahead in the dialogue for clues as to the function of utterances, thus in a sense including eect-based de nitions. However, it is not so clear whether actual eects should determine the choice of categegory by itself or whether it should only be used as a clue to the coder enabling a better guess about the actual intentions of the speaker. A dialogue system, of course, has no access to the actual eects of an utterance at the time of interpretation but is, on the other hand, present in the actual discourse situation. Looking ahead in the dialogue should be used only as a means of compensating for the coder's lack of this direct access.
5.3 Instructional and computational de nitions As we have seen, the de nitions given in coding manuals are informal and are often given in the form of instructions for the annotator. We may call these de nitions instructional . On the other hand, there have been attempts to give computational (formal) de nitions of dialogue moves. Here, it is crucial that there is a close correspondence between the two kinds of de nition for each dialogue move, or else there will be little point in providing formal de nitions. Regarding LINLIN, the number of formally de nied moves is so low (2 or 3) that the formal representations are extremely simple, and it is therefore hard to assess their correspondence to the instructional de nitions. In the case of the HCRC schema, there is no canonical set of computational de nitions for dialogue moves, but we may perhaps take the liberty of using the computational de nition of the \qw" operator (see Figure 3 from the Autoroute dialogue system for comparison. According to the instructional de nition, a Query-W move is \Any query which is not covered by any of the other categories", i.e. any query which is not a Check, Align or Query-YN. In essence, this means that a Query-W move is Any query which does not take a \yes" or \no" answer, and where S is not checking the attention of A or A's readiness for the next move, and where S does not already have reason to suspect what the answer might be. It is perhaps not obvious how this correponds to the formal de nition, and to make things simpler we may perhaps use the \informal" version24 : If the addressee knows the value of predicate P in situation S0 and there is a communication channel between the hearer and the sepaker in S0 24 To make comparison easier, \agent A1" and \agent A2" have been replaced by \the adressee" and \the speaker", respectively.
24
and the speaker conducts a questioning game with content P with the addressee in S0 then the result is S1 where the speaker knows the value of P. In this case, we can see several discrepancies between the instructional and the computational de nitions. For example, the latter requires that the adressee knows the answer and that the speaker comes to know it, but the former does not; the former requires that the speaker does not have reason to suspect what the answer is, but the latter does not. Of course, one should note that there are no indications that the HCRC schema has been used in the Autoroute system; the point of this example is to show how discrepancies betweem instructional and formal de nitions of the \same" move may look. As a contrast, the formal de nitions for the moves in the DAMSL schema given in Figure 4 seem to correspond relatively well to the instructional de nitions25.
5.4 Segmentation, multifunctionality, and the Kappa statistic Until now, we have used an informal notion of utterances, meaning, basically, continuous stretches of speech (but not necessarily full turns). In the HCRC and DAMSL (and probably also LINLIN) schemas, the principle we have called maximal functional segment-principle is (or should be) used. In the TRAINS schema, prosodic information is used to divide a dialogue into utterances. In the case of the HCRC schema, which is unifunctional, kappa values can be computed in the standard way - kappa computation poses no requirement that segmentation be done previous to coding. In both these cases, we are provided with a notion of utterance - in the TRAINS case, an intonation-based notion, and in the HCRC case, a functional one. In multifunctional schemas such as DAMSL, the standard way of computing kappa is not directly applicable, and therefore alternative methods must be used. In evaluating the DAMSL schema, Allen [2] has solved this by essentially dividing the schema into several subschemas, which internally are unifunctional, and computing kappa for each of these separately. This means that for each subschema, there is an implicit division of the dialogue into functional segments, each segment assigned either one (and only one) category of the subschema, or no category at all. Now, how are utterances to be de ned in this kind of schema? As stated above, the DAMSL manual says that \an utterance is a set of words by one speaker that is homogeneous with respect to Information Level and Forward and Backward Looking Functions." If we assume that \a set of words" means \a maximal set of words"26 , this means that we basically take all our neat unifunctional dialogue segmentations and stack them on top of each other. Whether this gives an inuitively satisfying notion of utterances remains to be seen. Also, there is another problem of a more technical nature with DAMSL and the DAT coding tool. DAT requires dialogues to be segmented into utterances before they can be coded/annotated. Unfortunately, the DAMSL de nition of utterances relies on sameness of Information Level and Forward and Backward Looking Functions, which means that the coding of these dimensions determines the utterance segmentation, i.e. the minimal parts of the dialogue which can be coded for dia25 The dierence between Assert and Reassert has been lost, but the de nitions are merely tentative rst attempts and this problem will most likely be attended to. 26 If it doesn't, each word within a maximal functional unit is an utterance in itself.
25
logue moves. This appears to make the coding procedure rather awkward, since all (or almost all) coding decisions for the entire dialogue have to be made before coding using DAT can even begin. As noted above in Section 3.3.3, this increases the risk that annotators take a shortcut and divide the dialogue into utterances using other more traditional criteria such as syntactically complete sentences. This indicates that the design of coding tools is an important, if not directly theoretically signi cant, issue in the actual use of a coding schema. Apart from these problems, the idea that a schema can be multifunctional in one layer but consist of unifunctional subschemas seems reasonable. While multifunctionality needs to be accounted for, it seems unnecessary to allow unlimited multifunctionality. Of course, the limitations on multifunctionality is essentially an emprical question and we should be careful not to make ungrounded a priori assumptions. It also should be said that utterance segmentation is a (very important) special case of dividing dialogues into structural units. Since we are primarily concerned with dialogue moves, which usually correspond to utterance-level segments, we will not go deeper into the problems of segmenting dialogues into e.g. transactions (HCRC), Discourse Units (TRAINS) or Feedback Units (GBG-IM). It suces to say that all structural units that are to be coded must be de ned and segmentation instructions must be given.
5.5 Segmentation, multifunctionality, and computational move interpretation Considering that we are interested in modelling dialogue moves computationally, we need to consider the eects of utterance segmentation and multifunctionality of utterances on the interpretation (and generation) of dialogue moves. If we assume unifunctionality and prosody-based utterance (and consequently move) segmentation, the dialogue agent simply waits until an intonational unit is completed and then attempts to classify it according to its move category. If we remove the prosodybased segmentation, things get a little more dicult - apart from categorizing each move, the move recognizer must now also decide when a move starts and when it is completed. If we also introduce multifuntionality, things get even more dicult. The dialogue agent must now entertain several hypotheses about possible moves. Also, while in the previous cases the end of one move was always the beginning of another, the agent now has to decide for each move where it starts and where it ends, and there is only indirect (statistical, perhaps) relations between start and end points of overlapping moves. This last case is of course the most realistic, but also the most computationally expensive, method of move interpretation.
5.6 Discontinuous and multi-agent actions As an alternative to allowing discontiuous moves, one might want to allow multiagent moves, i.e. moves that are performed by several agents27. As an example, here's a segment of the Airplane dialogue shown above: A: no / it's just two wings next to each other so to speak / these / with seven holes / those sticks / do you see B: yeah 27
Thanks to David Traum for pointing this out to me.
26
A: they are wings B: that's wings A: so to speak
This whole segment may be seen as a multi-agent \clarify"-type move, peformed by A and B in cooperation. To code this, we would mark the segment consisting of all ve turns with one single move label28: A: [ no / it's just two wings next to each other so to speak / these / with seven holes / those sticks / do you see B: yeah A: they are wings B: that's wings A: so to speak clarify
]
Of course, to be able to code the moves occurring inside this multi-agent move (as e.g. the question \do you see" and the answer \yeah"), we need to allow for multifunctional utterances. This takes care of cases where all turns address the same topic and can be seen as working towards a common goal. This is not always true, however, as this example illustrates: A: I would like an // um B: excuse me, I have to take the kettle off // ok A: ice cream
If we want to be able to cover cases like these, we might still need to allow discontinuous moves.
5.7 Coding of relations between utterances We have seen that only one coding schema, DAMSL, allows for coding of relations between utterances. But in a sense, the HCRC schema also allows for coding of relations, namely implicitly in the coding of dialogue games. As an example, a Query-YN game starts with a Query-YN move and ends when the query has been answered (or when the game is abandoned). Now, if the last move of the game is e.g. an Answer-Y, one can make the assumption that this answer adresses the initiating query. Of course, this might not be true, which makes move coding a less powerful way of coding relations.
6 Summary and conclusions When designing a coding schema for dialogue moves, there are several choices that can be made. From the above investigation, (at least) the following dimensions of variation emerge: 28 In a sense, these multi-agent moves can be seen as games rather than moves, which implies a model where games consist of moves which in turn consist of (sub)games.
27
Range of phenomena to cover How to divide these phenomena into dierent layers How to divide each layer into categories, subcategories and so on Principles of (utterance) segmentation Relational moves? Multifunctionality of utterances? Multi-agent moves? Discontiuous moves? Domain dependence? Dialogue genre dependence? Theory dependence? Kind of de nitions: intentional, surface-based
Along all these dimensions we nd variation between dierent schemas, and the choices along dierent dimensions are to various degrees dependent of each other. Regarding all these, we need to consider various tradeos between cognitive plausibility, ease of coding, reliability and computational tractability. Of course, it is not necessary to make all these choices in designing a schema; some issues may simply be left open. This is especially true in the case of general coding schemas such as DAMSL. Care must be taken so that limitations in various coding tools are not allowed to in uence these choices, or to force schema designers to make choices which would better be left open.
number of multirelations discontinuous reliability categories functionality units
LINLIN 2-5 HCRC 12
{29 {
{ { + (games) +
DAMSL 24
+
+
+
high moves: high games: fair fair
Table 8: Impression of expressiveness and reliability of schemas Table 8 clearly indicates, not surprisingly, that there is a tradeo between the complexity and the reliability of a coding schema. For example, the HCRC schema does not permit utterances to be coded for more than one move. Also, it does not permit coding of relations, e.g. one cannot annotate which question an answer is an answer to. These limitations may make the schema less expressive and less realistic, but it also seems to make it more reliable. Regarding our initial question about the usefulness of these schemata for coding the Airplane dialogues and for general use in the S-DIME project, there are signs that the DAMSL schema is being accepted as a standard of dialogue move coding. As noted, DAMSL uses intention-based rather than surface-based de nitions, which is probably a better approach for our purposes. The disadvantages of DAMSL is 29 While the schema does not rule out multifunctional utterances explicitly, it doesn't seem to be used in practice. It is also nontrivial how to combine multifunctionality with a dialogue grammar model.
28
the fact that it is still under development and has not yet been extensively used or tested in coding actual dialogues, while the LINLIN and HCRC schemas have the advantage of having actually been used, and they also have higher reliability rates than DAMSL. The latter fact can probably in part be explained by the smaller number of categories in the LINLIN and HCRC schemas. Move
Initiating (Forward Looking)
Response (Backward Looking) K=0.89
Update
Assert
Reassert
Instruct
Other
Info-request
Check
Align
K=0.66 Action-directive
Query-yn
Query-w K=?
Open-Option K=0.7 K=?
Figure 5: Part of hypothetical schema formed by uniting the DAMSL and HCRC schemas, with phony Kappa values for groups at dierent levels Tables 5 and 6 suggests that, if we view coding schemas as type hierarchies for dialogue moves, we can embed complex (parts of) schemas in simpler schemas. For example, we can produce a maximally complex schema by extending the DAMSL schema with parts of the HCRC schema (see Figure 5), and code dialogues using this maximally complex schema. This, of course, requires among other things that the category de nitions of the dierent schemas are made compatible, and preferably also more precise. We can then compute reliability at any desired level (or combination of levels) in the hierarchy. Comparing Kappa statistics obtained for the three coding schemas indicate that we can expect higher reliability for less complex schemas. When analysing coded dialogues, we may choose to collaps some distinctions (e.g. seeing checks, aligns and queries as info-requests) if we want to. We may also choose to allow for coding of \non-leaf" categories (e.g. Initiative and Response). An essential property of any coding schema to be used in developing and testing a theory of dialogue moves is, of course, that it can be related to that theory. Following the idea of a coding schema as a type hierarchy, we may want to start with a simple model of dialogue using only two moves, initiating and response. As the theory develops, hopefully we will be able to move down the hierarchy, dierentiating between more and more moves, and perhaps reach something like the DAMSL schema somewhere along the way.
References [1] Allen, J. et al (1995): The TRAINS Project: A case study in building a conversational planning agent. Journal of Experimental and Theoretical AI, to appear. 29
[2] Allen, J. and Core, M. (1997): Coding Dialogs with the DAMSL Annotation Scheme. Accepted for publication in Proccedings of the AAAI Fall 1997 Symposium. [3] Allen, J. and Core, M. (1997): Draft of DAMSL: Dialog Act Markup in Several Layers. [4] Allwood, J., Nivre, J. & Ahlsn, E. (1994) Semantics and Spoken Language: Manual for Coding Interaction Management. Report from the HSFR project Semantik och talsprk. [5] Allwood, J. (1995): An Activity Based Approach to Pragmatics. In Gothenburg Papers in Theoretical Linguistics 76, Dept. of linguistics, University of Goteborg. Forthcoming in Bunt & Black (eds.) Approaches to Pragmatics. [6] Andernach, T. (1996): A Machine Learning Approach to the Classi cation of Dialogue Utterances. To appear in Proceedings of NeMLaP-2, Bilkent University, Ankara, Turkey. [7] Lars Ahrenberg, Nils Dahlback and Arne Jonsson(1995): Coding Schemes for Studies of Natural Language Dialogue. in Working Notes from AAAI Spring Symposium , Stanford, 1995. [8] Bilange, E. (1991): A task independent oral dialogue model. In Proceedings of the Fifth Conference of the European Chapter of the ACL, Berlin, 1991.
[9] Bunt, H.C. (1995):Dynamic Interpretation and Dialogue Theory. To appear in: M.M. Taylor, D.G. Bouwhuis & F. Neel (eds.) The Structure of Multimodal Dialogue, Vol 2., Amsterdam: John Benjamins. [10] Carletta, J. , Isard, A. , Isard, S. , Kowtko, J. , Doherty-Sneddon, G. , Anderson,A. (1995): The Coding of dialogue structure in a Corpus. Proceedings of the Twente Workshop on Language Technology: Corpus-based approaches to dialogue modelling , Pages 25-34.
[11] Carletta, J. , Isard, A. , Isard, S. , Kowtko, J. , Doherty-Sneddon, G. (1996): HCRC dialogue structure coding manual. Technical Report HCRC/TR-82. [12] Carletta, J. (1996): Assessing agreement on classi cation tasks: the kappa statistic. Computational Linguistics , Volume 22(2), Pages 249-254. [13] Carletta, J. , Isard, A. , Isard, S. , Kowtko, J. , Newlands, A. ,Doherty-Sneddon, G. , Anderson, A. (1997): The reliability of a dialogue structure coding scheme. Computational Linguistics , Volume 23, Pages 13-31 . [14] Carletta, J. , Dahlback, N. , Reithinger. N. , Walker, M. A. (1997): Report on the Dagstuhl-Seminar "Standards for Dialogue Coding in Natural Language Processing". [15] Di Eugenio, B., Pamela W. Jordan and Liina Pylkkanen (1997): The COCONUT project: dialogue annotation manual, DRAFT. Unpublished manuscript. [16] Carlson, Lauri (1983): Dialogue Games: An Approach to Discourse Analysis. Reidel. [17] Di Eugenio, B. , Pamela W. Jordan and Liina Pylkkanen (1997): The COCONUT project: dialogue annotation manual, DRAFT. Unpublished manuscript. 30
[18] Gaizauskas, R., Pete Rodgers, Hamish Cunningham, and Kevin Humphreys (1997): GATE User Guide. [19] Flammia, G. and Zue, V. (199?): Empirical Evaluation of Human Performance and Agreement in Parsing Constituents in Spoken Dialogue. Eurospeech 95. [20] Hancher, M. (1979): The classi cation of cooperative illocutionary acts. In Language in Society, vol. 8, 1979. [21] Hayes, P.J. and Reddy, D.R. (1983): Steps toward graceful interaction in spoken and written machine communication. International Journal of ManMachine Studies 19:231-284. [22] Houghton, G. (1986): The Production of Language in Dialogue: A Computational Model . Ph. D. thesis, University of Sussex. [23] Isard, A. , Carletta, J. (1995): Transaction and action coding in the Map Task Corpus. Research Paper HCRC/RP-65. [24] Jekat, S. et. al. (1995): Dialogue Acts in VERBMOBIL. VM-Report 65. [25] Jonsson, A. (1993): A Method for Development of Dialogue Managers for Natural Language Interfaces. In Proceedings of AAAI-93, pp. 190-195, Washington DC, 1993. [26] Arne Jonsson (1995): Dialogue Actions for Natural Language Interfaces. in Proceedings of IJCAI-95 , Montral, Canada, 1995. [27] Arne Jonsson (1995): A Dialogue Manager for Natural Language Interfaces. in Proceedings of the Paci c Association for Computational Linguistics, Second conference , The University of Queensland, Brisbane, Australia, 1995.
[28] Kowtko, J. , Isard, S. , Doherty, G. M. (1993): Conversational Games Within Dialogue, Research Paper HCRC/RP-31, Human Communication Research Centre, University of Edinburgh. [29] Levin, J. A. and J. A. Moore (1977): Dialogue games: Metacommunication structures for natural language interaction. Cognitive Science 1(4), 395-420. [30] Lewin, I., M. Russell, D. Carter, S. Browning, K. Ponting and S. G. Pulman (1993): A Speech-Based Route Enquiry System Built from General Purpose Components, in EUROSPEECH 93, Proceedings of the 3rd European Conference on Speech Communication and Technology, Berlin, 2047-2050. [31] Nivre, J. (ed.) (1997): Transcription Standards. Report from the HSFR project Spoken Language and Social Activity. [32] Power, R. J. D (1979): The organisation of purposeful dialogues. Linguistics 17, 107-152. [33] Traum, D. R., and Hinkelman E. A. (1992): Conversation Acts in Task-oriented Spoken Dialogue, Computational Intelligence, 8(3):575{599, 1992. Also available as University of Rochester TR 425. [34] Poesio, M. and David R. Traum (1997): Representing Conversation Acts in a Uni ed Semantic/Pragmatic Framework. Draft. [35] Searle, J. (1969): Speech Acts. Cambridge: Cambridge University Press. 31
[36] Sinclair, J. M. and R. M. Coulthard (1975): Towards an Analysis of Discourse: The English used by teacher and pupils. Oxford University Press. [37] Stenstrom, A-B. (1984): Questions and Responses. Lund Studies in English: Number 68. Lund: CWK Gleerup. [38] Traum, D. R., and Hinkelman E. A. (1992): Conversation Acts in Task-oriented Spoken Dialogue, Computational Intelligence , 8(3):575{599, 1992. Also available as University of Rochester TR 425. [39] Traum, D. R. (1996): Coding Schemas for Spoken Dialogue Structure. Unpublished manuscript. [40] Wachtel, T. (1986): Pragmatic sensitivity in NL interfaces and the structure of conversation. In Proceedings of the 11th International Conference of Computational Linguistics, University of Bonn , pp. 32-42. [41] Walker, M. A. , Whittaker, S. (1990): Mixed Initiative in Dialogue: An Investigation into Discourse Segmentation. Proceedings of the 28th Annual Meeting of the Association of Computational Linguistics, 1990. [42] Vanderveken, D. and John R, Searle (1985): Foundations of Illocutionary Logic . Cambridge: Cambridge University Press. [43] Wittgenstein, L. (1958): Philosophical Investigations. Oxford: Blackwell.
A Additional schemas and parts of schemas A.1 LINLIN A.1.1 Schema I Basic Moves { Initiative (I): introduces a goal { Response (R): satis es a goal Discourse management moves { Discourse opening (DO) { Discourse continuation (DC) { Discourse Ending (DE) A.1.2 Schema II Functional type See Section 3.1.2. Topic
Task (T) System (S) Dialogue (D) Order (O) 32
Focus - kind of info needed to specify Objects and Properties Fully Speci ed (FS) Local Context (LC) Global Context (GC)
A.2 HCRC A.2.1 Move See Section 3.2.2.
A.2.2 Game Toplevel Embedded A.2.3 Transaction Normal Review Overview Irrelevant
A.3 DAMSL Forward Looking Function See Section 3.3.2. Backward Looking Function See Section 3.3.2 Utterance Features . Information Level { Task{ Task Management { Communication Management { Other Communicative Status { Abandoned { Uninterpretable 33
Syntactic Features { Conventional Form { Exclamatory Form
B A segment of Airplane dialogue TV3001.1.1 from the Goteborg Spoken Language Corpus Note that this transcript contains two initial utterances not included in the simpli ed representation, and a similarly ignored inital continuation of the rst utterance by I (\leka nu"). $I: f ar vi0 // < b orja0 > @ < high pitch > $J: < ja0 > @ < quiet > $I: leka nu / okej ja1 har allts a ett plan framf o mej som / har a1 / tv a hjul / a0 s a0 e0 de0 liksom ett hjulchassi / a0 / en0 l ang kropp a0 s a0 e0 de0 TV A vingar / du du ser rom1 h ar e1 [3 l anga ]3 $C: < [3 e0 de0 en0 ]3 dubbel d ackare > elle @ < event: C looks for eye contact > $I: n ae / de0 e0 bara tv a vingar breve varann s a0 att s aja / dom1 h ar / me0 sju h al i / dom1 pinnana / ser ru $C: a0 $I: dom1 e0 vingar $C: de0 [4 e0 vingar ]4 $I: [4 s a0 att s aja ]4 a0 s a0 sitter de0 en0 s an h ar fem+ / +h alsgrejs / ska sitta l angst bak som / stj artvinge $C: mhm $I: o1 / dom1 me0 TRE h al / e0 propeller / de0 < fi+ > a0 s a0 finns de0 en0 till [5 me0 tre h al va0 ]5 @ < cutoff: finns >
C Original DAMSL-coded dialogue segment hello [sil] can I help you
34
yes [sil] um [sil] I have a problem here I need to transport one tanker of orange juice to Avon [sil] and a boxcar of bananas to Corning [sil] by three p.m. and I think it's midnight now uh right it's midnight okay so we need to [sil] um get a tanker of OJ to Avon is the first thing we need to do + so + + okay + [click] so we have to make orange juice first mm-hm [sil]
35