Topic Information and Spoken Dialogue Systems3 - CiteSeerX

1 downloads 7562 Views 212KB Size Report
of speech acts which can be assumed to be domain- independent ..... PRICE. MEALS. Figure 2: A partial topic tree for the reservation domain. Langley, 1985) orĀ ...
Topic Information and Spoken Dialogue Systems Kristiina Jokinen

and

3

Tsuyoshi Morimoto

ATR Interpreting Telecommunications Research Laboratories 2-2 Hikaridai, Seika-cho, Soraku-gun Kyoto 619-02 Japan

email: fkjokinen,[email protected]

Abstract The paper concerns dialogue coherence measured relative to dialogue topics, and the usefulness of topic information in language modelling for speech recognition. It proposes a linguistically motivated method to recognize topics in task-oriented dialogues and reports preliminary results of the accuracy of the method. The method combines a top-down approach (topic tree) and a bottom-up approach (information structure of utterances), and aims to overcome the two problems in topic modelling: the topic's domain-dependence and the lack of standard de nition of the information structure of utterances.

1

Introduction

Rational agents produce utterances which are intentionally and thematically linked to previous utterances: the agents' intentions as well as the content of the utterances contribute to some logical organisation of the related events and propositions. Speakers' intentions are usually modelled in terms of speech acts which can be assumed to be domainindependent and provide a suitable basis for statistical coherence measures (Katoh and Morimoto, 1996). The information content of the utterances, however, is related to the exchange of domain information, and a similar domain-independent classi cation for possible topics is impossible. Domainindependent dialogue models thus tend to discard topic information, although such general models also tend to be less speci c and hence less accurate. Our starting point is that topic information is an important knowledge source, one of the major factors for creating coherence in goal-oriented dialogues, cf. (Moller, 1996). We aim at incorporating * Proceedings of the Natural Language Paci c Rim Symposium 1997, Phuket Thailand.

topic information into a general dialogue model, thus essentially continuing the research started in (Katoh and Morimoto, 1996). In AI-based dialogue modelling, the use of topic has been mainly supported by arguments regarding processing e ort (search space limits) and anaphora resolution, and topics are associated with a particular discourse entity, focus, which is currently in the centre of attention and which the participants want to focus their actions on. Our goal is to use thematic information in predicting likely content of the next utterance, and thus we are more interested in the topic types that describe new information exchanged in the utterances than in the actual topic entity. The topic model is to assist speech processing, so we also prefer a model that relies on observed facts, and uses statistical information instead of elaborated reasoning about plans and world knowledge. The model should also easily adapt to extensions, for instance, anticipate an account of sentential stress and pitch accent. The guidelines for our topic modelling are summarised as follows. The topic model should be 1. linguistically motivated: based on the information structure of the utterances 2. surface syntax oriented: no deep analysis of sentence meanings nor world model available 3. operational: automatic recognition possible. We propose a topic model which can be used to improve predictions of what will be said next. From the dialogue corpus, we extract a probabilistic topic tree which describes possible dependencies between di erent topic types. The tree is constructed by calculating topic sequences in a training corpus, and it provides top-down information for the prediction of the likely next topic. Bottomup analysis of the information structure of the utterances identi es new information conveyed by particular words of the utterance, and matches the words to possible topic types. The topic is assigned

to an input utterance as the best candidate out of the possible topic types proposed by the topical words, using the likelihood information encoded into the tree about the relative probabilities of the shifts from the previous topic to the candidates. The rest of the paper is organised as follows. Section 2 introduces our bottom-up approach to model the information content of utterances. Section 3 deals with topic tagging, and Section 4 gives an overview of the topic tree. Section 5 discusses the usefulness of the model in speech recognition and reports preliminary results of the accuracy of the method. Finally, Section 6 gives conclusions and points to future work.

2

Information Structure of Utterances

Two basic information structures for utterances have been suggested: the topic-comment structure (Gundel, 1985; Rats, 1996), and the focus(back)ground structure (Clark and Haviland, 1977; Carlson, 1983; Vallduv and Engdahl, 1996).1 The topic-comment structure divides the utterance into topic (what is talked about) and comment (what is said about the topic). The order in which di erent sentential elements can be considered topics is captured in the topicality hierarchy (Givon, 1985), akin to the availability hierarchy of di erent foci in AI focus stacks. However, the strict topic-comment structure has several drawbacks. The utterances can be topicless and elliptical utterances have no explicit topic at all. Our main concern is that most topics in task-oriented dialogues deal with the speaker and thus provide no help for the task identi cation: [I]T will take a twin, May [I]T have your name, please? Moreover, utterances may have the same topic-comment structure but di er with respect to the new information that the comment carries: the utterances in (1) talk about the speaker, but (a) focuses on what she would like to do, and (b) on how she would like to pay: (1) a. [I]T would like to [pay by Master Card]F . b. [I]T would like to pay [by Master Card]F . The focus-background structure distinguishes new information, focus2 , from the old, known or expected information, background. The locus of 1 Di erences

between the two approaches are discussed in (Jokinen, 1994b; Vallduv and Engdahl, 1996). 2 This focus is a di erent concept from the AI-focus which refers to the most salient element activated in the course of

new information is related to the sentential nuclear stress so it is important for speech processing, and a useful basis for our topic model. (Carlson, 1983) de nes new and old information as follows: If a sentence S = XBY is addressed to a sentence S' = XAY, then string B is old if B repeats A, and string B is new or focus if B replaces A. The sentence S addresses S' if S can be interpreted in the context of S'. The context can be set up by appropriate questions. Following the twolevel question method of (Vallduv and Engdahl, 1996), contexts for (1) can be set up as follows: (2) a. What about your activities? What would you like to do? I'd like to [pay by Master Card]F b. What about paying? How would you like to pay? I'd like to pay [by Master Card]F

(Jokinen, 1994a) combines the "aboutness" and "newness" approaches into the notions:3 Concept, CC: a discourse referent which the utterance is about. NewInfo: a property or property value which is new with respect to some CC. NewInfo is the information centre of the utterance, singled out by the question method, while CC refers to the topic-entity and xes the viewpoint from which NewInfo is presented. NewInfo is always realised, but the realisation of CC depends on the context. The concepts related to CC form the background: those already introduced in the dialogue are old information and those pending to be realised potential new information, likely to be talked about next. The information structure of utterances is thus represented as in (3). The locus of information which is important for our topic modelling is the NewInfo marked with the subscript N ew. (3) [[I]CC would like to]G [pay by Master Card]New Central

[[I]CC would like to pay]G [by Master Card]New the dialogue and is related to the linguistic topic (aboutness) rather than to focus (newness). 3 Incidentally, the same distinction is made by Vallduv (Vallduv and Engdahl, 1996) whose starting point is not in dialogue management, but in cross-linguistic realisation of information packaging. He divides utterances into focus (our NewInfo) and ground, and the ground further into link (our CC) and tail. The combinations of the tripartite division instruct the hearer on the ways in which she should update her information state.

3

Tagging corpus with topics

We have tagged 80 transcribed English dialogues of the spoken bilingual ATR Travel Dialogue Corpus according to the information structure outlined above. The dialogues deal with hotel reservations and tourist information, and basic statistics of the dialogues are given in Table 1. speaker clerk customer total

turns 881 788 1669

% 53 47 100

utt. 2486 1736 4222

% 59 41 100

words 845 663 1008

% 84 66

Figure 1: Dialogue statistics. The speakers' turns are segmented into utterances according to written language markers (periods, commas) already in the transcripts. Hesitations, lled pauses, etc. are analysed as temporizers (time management acts), and considered separate acts unless they occur in the middle of a sentence (production errors). Complex utterances are divided into their constituent clauses (4), except for conditional and temporal clauses (if I can help you with anything else, please feel free to give me a call.) and coordinations of reasons and conclusions (I'd like to relax so I de nitely need a room with a bath), which are regarded as single utterances. (4) My name is Kazuo Suzuki *Name

and I have a VISA card *CardType and the number is 4883 5800 4088 1718 *CardN

The tags are abstractions of the NewInfo exchanged on a particular CC. NewInfo is found using the question method, and the topic is assigned as the property of the CC updated by the NewInfo. (5) (a) What about your activities? What would you like to do? I'd like to [pay by Master Card]F

) *Paying

(b) What about paying? How would you like to pay? I'd like to pay [by Master Card]F Method

) *Payment-

(c) What about paying? You'll pay somehow but [How]F would you like to pay? *PaymentMethod

)

One topic tag is assigned to each utterance. If the utterance contains multiple items of new information, a tag which subsumes the di erent subtopics is used (cf. the topic tree on p. 4). (6) I'd like to stay [for two nights] [on August 10th] *StayLength and *ArrivalDate

) *Staying

The utterances like That's correct; Yes please; That sounds nice have no special lexical item which realises the NewInfo. They occur after the partner's suggestion, explanation or con rmation request, and as the speaker continues the topic by accepting or rejecting it, the tag is assigned on the basis of the context as the previous tag. For the utterances like (7), which explicitly convey information about the speaker's attitudes, preferences and abilities, the NewInfo is the actual meta-level attitude. However, subsequent utterances usually refer to the task-related information locus, and, since the speaker's attitudes are modelled by dialogue acts, we assign the topic tag on basis of the content of the attitude. (7) (a) I don't care how much it costs as long as the room is on the second or third oor. *Room (b) I think we can arrange that. *Room

The utterances like (8) function as time management requests or conventionalised dialogue acts, and the notion of topic is not really relevant for them: they do not request or provide information about the domain, but control the dialogue ow itself. Their topics are classi ed as *iam, InterAction Management topics. More than one third of the utterances fall in this class. (8) (a) Could you wait for a moment while I check?

*iam (request to wait) (b) Sorry to keep you waiting. *iam (apology, renewed contact) (c) Okay. *iam (acknowledgement) (d) We will be looking forward to your arrival. *iam (closing) (e) Thank you for calling the New Washington Hotel. *iam (thanking)

The main problem in topic tagging is the level of speci city: how ne tag-distinctions are useful? For instance, talking about a reasonable room price, the customer may rst introduce the price range by talking about her budget: We're planning a budget for fty dollars per person. These topics resemble side-sequences brought in as explanations or reasons rather than the main threads of the domain, and the topic *RoomPrice is generalised to subsume them as well. However, the distinction between single unique topics and topics which can be subsumed by a more general class is not clear. We have classi ed topics which have no direct relation to the reservation task as unique topics. Examples of such topics are given in (9), and they comprise about 1.7 % of the utterances in the corpus.

(9) (a) My business is taking longer than I expected.

RESERVATION

(b) And if it isn't too much trouble, could you please tell me how safe that area is? (c) My wife will be relieved to hear that.

4

Topic trees

Topic trees provide a "top-down" approach to dialogue coherence: the branches describe what sort of shifts are cognitively easy to process and thus likely to occur in dialogues. Originally "focus trees" were suggested by (McCoy and Cheng, 1991) to enable more exible focussing management than a stack, and (Carcagno and Iordanskaja, 1993) used topic trees to structure domain knowledge in order to provide information for their text planner. We cluster topics into a tree where each node corresponds to a topic tag (Figure 2). Possible traversals of the tree describe possible thematic structures of the domain, i.e. likely topic sequences in the dialogues. For instance, focusing on the action "make a reservation" highlights what the speaker knows about reservations, and she is likely to move on to discuss the room that she wants to reserve or the dates and length of her stay. The tree4 can be traversed in whatever order, but in practise some of the transitions are favoured: likely topic shifts correspond to shifts from a node to its daughters or sisters, while shifts to nodes in separate branches are less likely to occur. Thus, after *RoomLoc it is unlikely that the topic *Card occurs if the topic *RoomPrice has not yet been discussed: *RoomLoc and *Card are not sisters, and switching attention back and forth between them would make their processing dicult.5 Once a topic has been exhaustively discussed, it is not likely to re-occur in the dialogue, and its probability drops close to zero. Towards the end of the dialogue it is thus not very likely that topics (or: words related to the topic types) which are already closed will be found, unless the speaker explicitly opens them for con rmation. At the moment, the topic tree is manually built from our tagged corpus, but on-going research is aimed at making the tagging automatic. We envisage that the domain model may be constructed using conceptual clustering techniques (Fisher and 4 We will continue talking about a topic tree, although in statistical modelling, the tree becomes a topic network where the shift probability between nodes which are not daughters or sisters of each other is close to zero. 5 Awkward shifts are usually syntactically marked: afterthoughts and jumps to distant topics are accompanied by syntactic markers such as By the way; oh, forgot to say.

CONTACT

NAME

TEL

MEALS

ADDRESS

PRICE

CHECKING

IN

OUT

SPELLING

STAY

ARRIVAL

ROOM

DEPARTURE

LENGTH

PAYMENT

TYPE PEOPLE LOC

PRICE

CHEQUE

CARD

TYPE NUMBER EXP.DATE

Figure 2: A partial topic tree for the reservation domain. Langley, 1985) or word classi cation (McMahon and Smith, 1996).

5

Topic model and speech recognition

To test the feasibility and accuracy for the proposed topic model for speech recognition, we calculated trigram perplexity6 and compared the topic model to a dialogue act model built for the same set of dialogues. The original number of topic tags (80) is too big to be used in successful testing, so the topic tree was pruned and only nine topmost nodes taken into account, merging the subtopics into approriate mother topics (Figure 3). In speech recognition, given the utterance W1n = w1 ; :::; wn to be recognized, we want to maximize the likelihood of the word string by maximizing the probability of each word wi in the context in which it occurs. In the trigram model, the context for a word contains its two previous words, and the conditional probability of a word wi given the two previous words wi01 and wi02 is calculated as: ( i jwi02 wi01 ) =

P w

P

( i02 wi01 wi ) ( i02 w i01 w x )

Occ w

wx

Occ w

To overcome the sparse data problem, we used 6 We

are grateful to Sabine Deligne (ATR-ITL, Dept 1) for discussions on speech recognition and for the trigram perplexity measures using the Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit.

tag iam room stay name res paym contact meals mix

count 1743 824 332 320 310

% 41.3 19.5 7.9 7.6 7.3

250 237 135 71

5.9 5.6 3.2 1.7

interpretation Interaction Management Room, its properties Staying period Name, spelling Make/change/extend/ cancel reservation Payment method Contact Info Meals (breakfast, dinner) Single unique topics

Figure 3: Topic tags for the experiment.

that a topic model based on linguistically motivated classi cation provides a good starting point, and at least for the amount of data we used for each topic, it is useful to specialise the language model for speech recognition depending on the topic. Since our corpus is also tagged with dialogue act information, we compared the topic information to dialogue act information on a word-based recognition model, using the same program for both tasks.7 Now the conditional probability of a tag s given a word w is computed as:

the back-o model (Katz, 1987): if the string on one level does not occur, its probability is calculated by backing o to the next lower level. Thus, if the trigram does not occur, its probability is assigned on the basis of bigrams, then on unigrams, multiplied by some estimation of their relative weight. The quality of statistical models can be measured by test set perplexity. The goal is to nd models which estimate low perplexity: the direct interpretation of the perplexity is the number of words among which the interpretation of the next word must be chosen. Our test perplexity was calculated using the normalized formula: PP

= 20 n log L(Wi ) 1

n

where n is the number of words in the utterance n = w1 ; :::; wn and L is the likelihood of the word string. Trigram perplexity results are shown in Table 4. W1

Words known open set

General model 12.77 14.81

Topic-dependent model Random Manual tags 23.93 10.31 33.71 12.72

Figure 4: Trigram perplexity in the experiments. As expected, perplexity increases on topicdependent model if the topics are randomly tagged. However, compared to the general model trained on all dialogues, perplexity decreases by 20 % for a topic-dependent model where topics have been (manually) tagged, and by 14 % if we use an open test with unknown words included as well. Since any consistent classi cation is likely to improve quality of statistical models, we need to conduct further experiments on automatically tagged topic corpora. However, the results show

0

if Occ(w) = 0,

Occ(wjs) otherwise. Occ(w) Given the utterance W1n = w1 ; :::; wn , a tag is as-

( j )=

P s w

signed to it by adding up the conditional probabilities of each word wi and selecting the tag with the highest probability. Testing of the method is carried out by using different size training and test dialogue sets. The sets are formed by partitioning the tagged corpus into two by random dialogue selection; the ve di erent partitions used are shown in Table 5. For each utterance in the test set, a tag is assigned according to the formula above, and this is then compared with the tag in the manually tagged corpus. The accuracy of the method is calculated as the percentage of the correctly assigned tags in respect to the total number of test utterances (4222). Train/Test Topics Dacts Topics Dacts

30/10 40/10 Unigrams 77.8 79.2 71.2 73.4 Trigrams 80.0 82.5 84.0 85.6

50/10

60/10

70/10

78.9 72.2

80.4 73.3

78.2 72.3

84.0 85.5

84.6 85.4

84.8 85.8

Figure 5: Accuracy of topic and dialogue act recognition. The dialogue act accuracies are from Quantz (pc). The accuracy of di erent training and test set combinations, after average of 10 runs, is shown in Table 5. There is a steady increase in accuracy when the test set increases, and the best result occurs when the training set contains 60 dialogues. What is interesting in the comparision is that topic-labelling outperforms dialogue act labelling on unigram test, but if trigrams are taken 7 The act labels are based on the classi cation proposed by (Seligman et al., 1994). The dialogue act tagging and test runs were done by J. Quanz during his visit to ATR.

into account the dialogue act labelling performs slightly better. This is understandable if we consider how the topics and dialogue acts contribute to the dialogue coherence in the rst place. The topic tags describe the content of the utterance encoded in the semantics of the words occurring in the utterance (usually one or two topical words contribute to the topic tag), while the acts describe the speaker's intentions encoded in the function of the utterance. The correlation between dialogue acts and individual words is thus less tight: the function is a matter of the whole utterance (longer word strings) in relation to the whole dialogue.

6

Future Work

This paper reports on-going work on topic modelling and leaves many questions open. Future work concerns two research lines especially. First, we intend to test the topic model on all available data (evaluating the model when more unique topics occur), thus improving generality of the topic tree. Second, we will explore ways to automatize the question method and move from manual to automated topic tagging. This includes exploiting the semantic information available in the parser as well as exploring conceptual clustering methods for the automatic construction of the tree. The preliminary results of the proposed model show that the use of topic information in speech recognition is promising when the quality of the model is measured in terms of trigram perplexity. Furthermore, the topic model performs slightly better than a surface-structure oriented dialogue act model, suggesting that topics provide at least as reliable a source for dialogue coherence as intention-based dialogue acts if used as the only knowledge source. Future research will answer the question of how independent the two knowledge sources are, and how the dialogue model's accuracy can be improved when the two sources are combined.

7

Acknowledgements

We are grateful to the following people for their help and useful discussions: Naoya Arakawa, Ezra Black, Sabine Deligne, Laurie Fais, Patric Healey, Joachim Quantz.

References D. Carcagno and L. Iordanskaja. 1993. Content determination and text structuring: two interrelated

processes. In H. Horacek and M. Zock, editors, New Concepts in Natural Language Generation, pages 10{26. Pinter Publishers, London. L. Carlson. 1983. Dialogue Games. D. Reidel Publishing Company, Dordrecht. H. H. Clark and S. E. Haviland. 1977. Comprehension and the given-new contract. In R. O. Freedle, editor, Discourse Production and Comprehension, Vol.1. Ablex. D. H. Fisher and P. Langley. 1985. Approaches to conceptual clustering. In Proceedings of the IJCAI85, pages 691{697. T. Givon. 1985. Iconicity, isomorphism and nonarbitrary coding in syntax. In J. Haiman ed, Iconicity in Syntax, pp. 187{219. John Benjamins, Amsterdam. J. K. Gundel. 1985. Shared knowledge and topicality. Journal of Pragmatics, 9:83{107. K. Jokinen. 1994a. Coherence and cooperation in dialogue management. In K. Jokinen, editor, Pragmatics in Dialogue Management, pages 97{111. Proceedings of The XIVth Scandinavian Conference of Linguistics. Goteborg. K. Jokinen. 1994b. Response Planning in Information-Seeking Dialogues. Ph.D. thesis, University of Manchester Institute of Science and Technology. N. Katoh and T. Morimoto. 1996. Statistical method of recognizing local cohesion in spoken dialogues. In Proceedings of the 16th COLING, pages 634{639. S. M. Katz. 1987. Estimation of probabilities for sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, pp. 400{401. K. McCoy and J. Cheng. 1991. Focus of attention: Constraining what can be said next. In C. L. Paris, W. R. Swartout, and W. C. Moore, editors, Natural Language Generation in Arti cial Intelligence and Computational Linguistics, pages 103{124. Kluwer Academic Publishers, Norwell, Massachusetts. J. G. McMahon and F. J. Smith. 1996. Improving statistical language model performance with automatically generated word hierarchies. Computational Linguistics, 22:2:217{247. J-U. Moller. 1996. Using DIA-MOLE for unsupervised learning of domain speci c dialogue acts from spontaneous language. Tech Report FBI-HH-B-191/96, University of Hamburg. M. Rats. 1996. Topic Management in Information Dialogues. ITK Dissertation Series, Katholieke Universiteit Brabant, Tilburg. M. Seligman, L. Fais, and M. Tomokiyo. 1994. A bilingual set of communicative act labels for spontaneous dialogues. TR-IT-81, ATR Interpreting Telecommunications Research Laboratories, Kyoto. E. Vallduv and E. Engdahl. 1996. The linguistic realization of information packaging. Linguistics, 34:459{519.