Evaluating Robustness in Dialogue Systems3 1

Evaluating Robustness in Dialogue Systems3 Kristiina Jokinen Computational Linguistics Laboratory Graduate School of Information Science Nara Institute of Science and Technology 8916-5 Takayama, Ikoma, Nara 630-01 JAPAN Email:

[email protected]

Summary:

This paper contrasts two views of robustness, wide coverage as opposed to communicative ability, and provides a preliminary study of the methods and problems in evaluating robust Natural Language Dialogue (NLD) systems. A special emphasis is put on the metrics of conversational adequacy, so as to operationalise this into a quanti able measure.

1

Introduction

Robustness is generally regarded as one of the the main requirements of NLD systems, encoding some standards of appropriate system behaviour. However, in order to structure our goals in building robust dialogues systems, we should have a clear de nition of robustness. What is it that we require from a good NLD system? We also need to operationalise this goal and de ne some quanti able measures: which speci c system properties can be identi ed with the given de nition? Finally, to assess dierent systems in regard to their robustness, a method to determine the appropriate value for a given measure and a given system must be de ned: how can we evaluate robustness in NLD systems? In this paper, I will answer these questions in turn.

2

Robust NLD systems

2.1

Two approaches

[9] de nes robustness as the system's ability to react sensibly to all input. With the aim of producing good user interfaces, careful design, usually based on empirical WOZ-studies, is thus carried out to specify particular system requirements and to achieve robustness within the given application domain [7]. However, two problems arise from this de nition. First, wide coverage as such cannot be equated with robustness, since this could be trivialised to quasi-conversational robustness by responding \Please re-phrase" to all problematic requests. Second, coverage is tied to a particular application, and if the system is tailored according to the speci c needs peculiar to that domain, it will be dicult to apply the system to new domains. Consequently, robustness in the widest sense of 'all input' will not be achieved. 3 Presented

at the DiaLeague96 Summer Session Workshop, Tokyo, 23rd July 1996.

1

A dialogue system's desired behaviour can also be grounded on the notion of cooperative and appropriate communication [5, 12]. Robustness is thus not only a quantitative measure of the system's response capabilities, but it also subsumes qualitative evaluation of the system's communicative competence. System design is based on empirical research, but it does not aim to model the collected dialogues as such. Instead, general pragmatic principles are sought for, as the basis on which contextually adequate responses can be formulated. However, the problem is that the principles of 'conversational adequacy' are often as general and dicult to formalise as the notion of robustness itself. 2.2

Identifying robustness goals

The design of NLD systems should be based on both a solid theory of communication and empirical corpus studies of typical dialogues that the system is to handle. Below I discuss four general requirements for robust NLD systems, drawn from a theory of communication as cooperative rational action [2] and empirical research on information-seeking dialogues [12]. Robustness is understood as the system's communicative competence, and included as one ot the desiderata of a communicatively competent NLD system. 1. 2. 3. 4.

Physical feasibility of the interface Eciency of reasoning components Syntactic robustness Conversational adequacy

2.2.1 Interface Human-computer communication has constraints on physical and biological levels (the user's eye-sight, physical capability to type or use the mouse, etc.), and these external enablements form an important requirement for exible, user-friendly dialogue systems. In HCI studies, these factors are usually gathered under the notion of usability: systems are easy to use, easy to learn and easy to talk to ([13]). This is also discussed in [1], in terms of habitability (the user should be able to express commands and requests conveniently without transgressing the linguistic capabilities of the interface), and transparency (the system's capabilities and limitations should be evident to the user from experience). The use of multimodal facilities like graphics and speech can also contribute to the system's robustness. Combined with natural language dialogue capabilities such interfaces are powerful tools in a wide variety of applications dealing with human-computer communication.

2.2.2 Eciency An NLD system should not slow down the interaction with the background system noticeably. If the system does not reply in a reasonably short time, the user normally starts to think that contact with the system has been lost and this may cause unnecessary turn takings and other complications in dialogue management. Dialogue studies also show that even when users acknowledge the naturalness of system responses, they regard long response times as the main factor in characterizing dierences between their computer conversations and real human-human conversations. Considerable attention should thus be paid to fast and ecient algorithms: this requires computational complexity measures and a study of ecient implementation possibilities. 2

Real-time operation usually means a decrease in the system's reliability and/or user-friendliness. To overcome this, some processing heuristics are needed: e.g. predicting the most probable next utterances, classifying dierent discourse phenomena as frequently or rarely occurring and providing a fast response method for the former, while allowing more processing for the latter, etc.

2.2.3 Syntactic robustness Robustness in the narrow sense refers to the system's ability to cope with syntactically problematic input. However, in this paper robustness also includes conversational adequacy, which goes beyond linguistic parsing and generation capabilities. Nevertheless, a good natural language interface improves the system's usability: it reduces unintended problem solving situations like \what would be the best way to put the questions so that it would understand me".1 Linguistic sophistication has often been questioned as esoteric and rather unimportant.2 Negative conclusions may be due to the constrained underlying task, simple pattern matching techniques and the emphasis on a working service system. Considering more elaborated NLD systems and their communicative capability, a good quality parser and generator, based on solid grammatical principles are essential. However, more extensive use of pragmatics in NLD systems also suggests that the role of the natural language front-end is to be rede ned: the dialogue manager can handle much of the interpretation of the user input, and also be responsible for much of the tactical generation of the system output, so the functionality of the parser and the generator can be changed to a shallow NL front-end which 'translates' linguistic input into a conceptual format suitable for dialogue manager reasoning.

2.2.4 Conversational adequacy Conversational adequacy manifests itself in the contributions that the system plans to clear up vagueness, misunderstandings or lack of understanding in user contributions. It is closely tied to the complexity of the underlying task and to the level of dialogue partners' communicative competence. Adequacy criteria dier depending on whether the task concerns information providing, advice giving, suggesting, or argumentation: overall goals of the partners in the given dialogue situation determine the kind of reasoning needed, and thus also the kind of criteria by which adequacy of their dialogue contributions is assessed. A more complicated task also requires a more intelligent agent. The two aspects are interconnected and specify the system's competence as a dialogue partner: whether the user is expected to communicate with a simple service system or with a more sophisticated, rational agent. An intelligent dialogue agent is able to conduct mixed-initiative dialogues, negotiate the correct meaning of references and other vague expressions, and produce helpful, cooperative responses that enable both partners to take steps towards the ful llment of the underlying task. An intelligent NLD system should thus be able to formulate responses that clarify vagueness (I need a car - Do you want to buy or rent one?), non-understanding (Hot in taste, not in temperature; spicy as being highly avoured), and misunderstandings due to knowledge (But there is no airport in Bolton) and confusion (I meant in Brighton, not in Bolton). Let us consider the following sample dialogue from our corpus: 1

One subject in the PLUS project reported that he had been wondering whether the computer would understand `restaurants indian' or `indian restaurants'. 2 [6] shows that the number of words used in computer interactions is rather limited, and thus the system's NL capability may not be such a big obstacle as has been thought, and [3] claim that NL enhancements (e.g. ellipsis) do not necessarily improve the system's usability but may in some cases even obscure the task.

3

(1)

User1:

Is there anywhere in the town centre that serves hot, spicy food?

Wizard1: User2: Wizard2:

What type of food? Hot, spicy food. Far eastern or Asian, perhaps. Please wait...

Assuming that the wizard's task is to provide the user with information about restaurants, but the user speci cation hot, spicy food is not clear enough to pick out a restaurant type, the response Wizard1 sounds natural. However, for a system, the generation of such a response is not that easy: it has to recognize the metonymic relation between the concepts `food' and `restaurant' (if the user is looking for a place that serves food, this is taken to be a restaurant, as opposed to other places associated with food like take-aways and shops), then formulate its own goal to restrict the database search with the help of the restaurant type (it would be uncooperative to give the user information about all the restaurants), relate the modi ers hot and spicy to a restaurant 'type', and nally, generate an elliptical response using the metaconcept type and the word food used in the user's contribution. A system might thus formulate a slightly dierent response like What type of restaurant? or I don't know "hot, spicy food". Although these responses may not be as natural as the one in example (1) (abrupt topic shift from food to restaurant, and a curt statement of one's knowledge limits), they nevertheless convey relevant information to the user, and are in this context more cooperative than Sorry, I don't understand. The rst explicitly signals what information is looked for, thus allowing the user to correct the interpretation if the connection between her request and the application model is wrong. The second prevents the user from repeating the unknown adjectives in her response, and thus helps the wizard to proceed with the task without digressing into a repetition of the restaurant type question3 . If the system is able to take initiatives with respect to task ful llment, this can be combined with a question Could you specify the type of food?, and the reply makes an adequate albeit somewhat tedious response. Assessing the success of a contribution in a particular dialogue situation is not a simple task. The rst criterion for an appropriate response seems to lie in the general aims of the dialogue model: in human-computer interaction the latter responses would be ne, justi ed by the requirements of eciency and feasibility, and by the assumption that humans are still far better than computers in bridging inference gaps. In analogous human-human dialogues, say those at a Tourist Information Oce, these responses or ones like I don't understand what you mean, No, I do not know, or repetitions of the same question etc., usually reinforce an impression of a rude and uncooperative partner, and the uency of communication suers seriously. Besides the external system design viewpoint, example (1) also exempli es the notion of communicative strategy: the way in which mutual knowledge is established, maintained, modi ed and exploited. Depending on the speaker's conversational posture ([11]) and risk-taking ability ([4]), some part of the information content can be ignored or omitted in order to contribute to conversational uency, to achieve a particular communicative eect, to maintain one's 'face', etc. Moreover, an appropriate way to react does not require truthfulness in the sense that the response details one's knowledge limits. For instance, the wizard does not explicitly tell whether hot and spicy are truly understood, or that she assumes the user is looking for a restaurant rather than a fast food take-away or an exotic shop only because she can nd no information about the latter two in the database. On the contrary, the response strongly implies that the wizard understands the meaning of the modi ers and can associate 'serving food' with 'restaurants', and only requires some clari cation of the type of food. The cooperative partner chooses 3 After an open question like Wizard1, the user could use the same unknown adjectives in her reply and thus give no useful information at all. In the example, the user indeed repeats the adjectives, although she gives further speci cations as well.

4

to specify the food type further, the wizard continues with the task, and the user gets a list of restaurants probably without even noticing that there was something vague in her rst request. 2.3

Communicative Competence and NLD systems

A exible, communicatively competent NLD system aims to ful l the robustness goals discussed above. Its responses can be characterised by the following properties: 1. 2.

Cooperative: informative, helpful responses; related to the functionality of the system Robust: appropriate and conversationally adequate responses; related to the precision

level on which the system clari es vague or misunderstood input, as well as to the system's external user-friendliness. 3. : natural responses; related to uency and smoothness of information ow, to the way in which dialogue parts `hang together'.

Coherent

3 3.1

Evaluation of NLD systems Basic concepts

Evaluation of NLD systems is needed for forming and re ning system design (formative evaluation), and assessing the impact, usability and eectiveness of the overall system performance (summative evaluation). It is also important for potential purchasers of commercially available systems (consumer reports), funding agencies who want to develop core technology, and researchers to guide and focus their research. [10] distinguish dierent types of evaluation. Adequacy evaluation deals with assessing the system's tness for a particular purpose: will it do what is required, how well, and at what cost. It is typically done for a prospective user, and may include extensive study of a user's needs. Diagnostic evaluation is typically used by system developers to compare system performance with respect to some set of possible inputs, a test suite. The compilation of a large, representative test suite is one of the important subtasks in evaluation. Performance evaluation, also called progress or technology evaluation, measures system performance in one or more speci c areas, and is used by system developers and R&D programme managers to compare alternative implementations or successive generations of the same implementation. Another distinction can be made on the basis of the information that is obtained from the evaluation. Qualitative evaluation aims to clarify which parts of the system need alteration, why errors or misunderstandings occur, etc. by providing an answer to the question "How does the system do what it does?" Quantitative evaluation produces information of the absolute number of errors and misunderstandings, impact of changes to the system, etc. by looking for an answer to the question "How well does the system do what it does?" Qualitative evaluation often tends to be glass box evaluation, a component-wise assessment of the system, while quantitative evaluation tends to be black box evaluation, an evaluation of the system as a whole. Quantitative evaluation of system performance also calls for some criteria which de ne the evaluation goals, a measure which illustrates the system performance on the chosen criterion, and a method which describes how the appropriate value is determined for a given measure and a given system.

5

3.2

Problems with evaluating NLD systems

Evaluation of NLD systems is prone to variation of human judgements in general: the evaluator's expertise, personal preferences, emotional state etc. aect the assessment. Moreover, dialogue participants adapt their language according to their partners (we speak dierently to children, foreigners, brain-injured, etc.), and most people are already used to talking to a 'dumb' computer: thus their assessment may be biased by tacit adaptation and default attitudes. The users also have dierent needs and requirements (which may be more or less adequate for system design purposes), and a value of a good coverage at one point versus bad coverage at another is not in itself indicative of the system's tness to a user's purpose. Evaluation is also likely to be aected by the laboratory situation in which it takes place: as pointed out by [8] what is being evaluated is a setup, a system embedded in a context of use. HCI literature stresses that the usability of a system can be properly assessed only in real situations. The evaluation goals may also be interdependent: a system may gain accuracy at the expense of real-time interaction. On the other hand, the users' subjective consideration of the length of transaction may be rather dierent from the objective time measurement. Also, if the users are novices, their assesment of the system's clarity and friendliness dier from expert users. In the following, I will not discuss adequacy or diagnostic evaluation, but concentrate on performance evaluation, especially on evaluating communicatively adequate system responses as they are de ned in the previous section. Although cooperativeness, robustness and coherence of system responses is dicult to operationalise in objective terms, it is an interesting enterprise to explore how far objective measures can be used in NLD system evaluation. 3.3

What kind of measures?

The simplest objective measure is the length of a dialogue: the number of words, the length and number of utterances. It is based on the assumption that a shorter dialogue represents a more successful transaction since the fewer words and turns used, the more cooperatively and smartly the partners have communicated. However, the length of the transaction is a weak and limited indication of the system's communicative competence: the assumption may only be true if the partners share the same plan of what is the optimal way to achieve the joint goal - and this is usually not the case. Instead, dialogue partners have varied knowledge and they frequently use explicit checks and feedback to monitor the success of their contributions. Also, as discussed above, they plan (or do not plan) clari cation questions depending on their conversational posture ([11]) and risk-taking ability ([4]). Moreover, redundancy seems to play a crucial part in human communication in general, especially in spoken dialogues, since it guarantees that the message can get through even though part of it is lost on the way. The total time of utterances can also be used to measure communicative adequacy: the longer it takes for the partners to perform a task, the less ecient and cooperative their communication. In HCI, one of the common measures of usability is the user's performance score S [14]:

S = 1 1P 1C T

where T = time the user takes to perform the task, P = percentage of task the user manages to complete, and C = what can be performed (expert's performance). Of course, the same caveats as with the length of the dialogue apply here; and the performance score measures the user's ability to cope with the system rather than the communicative competence of the system. Qualitative information about the system's design can also be obtained from internal complexity measures: number of rules, inference steps, time to produce an analysis, generate a response 6

etc. tell how well and eciently the system performs its task. Although these are important in assessing NLD systems as a whole, their relation to conversational adequacy is indirect. One of the most useful sources of information is an error situation. Errors show where problems exist, and also suggest the cause and a potential remedy for the problems. However, in evaluating conversational adequacy of NLD systems, the concept of an 'error' is problematic since it suggests that there also exists a normative 'correct' answer; in human-human and human-computer interaction this is usually not the case, but dierent types of misunderstandings occur rather than errors. The source of misunderstandings is usually a confusion, lack of relevant knowledge or excess of unimportant knowledge, so that the partner fails to capture the intended meaning. From the point of view of the speaker, there seem to be two sources of misunderstandings: strategic miscalculation and tactical sloppiness. The former refers to a badly chosen action (e.g. con rming instead of asking for more information, following one path instead of another), causing the hearer to get confused about the speaker's goals and probably drawing false conclusions about her knowledge. Tactical sloppiness deals with problems in carrying out the intentions: the speaker does not possess enough information, or possesses too much, wrong and confusing information, to perform the action appropriately (e.g. referring to names which the partner may not know, adding extra information which is too detailed). The hearer is thus confused about the referents and parameter values assigned to the shared knowledge, and usually asks for clari cation. Since dialogues are collaborative activities, the success of an utterance can only be assessed when the partner has given feedback on whether she understands and accepts the contribution. Consequently, miscalculation or sloppiness may be accepted by the partner (i.e. go unnoticed or be deliberately ignored), and may even contribute to the uency of communication rather than being a communication failure. If we take into account both the speaker's view-point of the source of the misunderstanding and the hearer's acceptance, contributions can be classi ed as follows: Contribution Ok Partial Slip Mistake Failure Spurious 3.4

Speaker correct miscalc sloppy miscalc sloppy -

Partner accepted accepted accepted not accepted not accepted not understood

Measuring communicative competence

Communicative competence of a speaker can now be associated with dierent degrees of particular contribution types in the contributions produced by the speaker. Cooperativeness is regarded as the degree of 'completeness' of the speaker's utterances from the partner's view-point, and calculated as the degree of the contributions accepted by the partner with respect to all those contributions produced by the speaker and understood by the hearer:

Cooperativeness =

Ok

+ 0 51 :

P artial

All utterances

0

+ 0 51 :

Slip

Spurious

In a similar manner, robustness, de ned as the appropriateness of contributions and the precision level on which vagueness and misunderstandings are clari ed, is calculated as follows: 7

Appropriateness =

Ok

+ 0 51 :

Slip

+ 0 51 0 :

All utterances

Precision =

F ailure

S purious

Ok All utterances

Appropriateness gives the speaker's view-point of the degree of strategic accuracy of the utterances which are understood by the partner (regardless of whether or not they are accepted by her), while precision is the absolute degree of the correct and accepted utterances with respect to all utterances produced by the speaker. Coherence, de ned with respect to the uency and smoothness of the information ow, can be studied from the view-point of either partner, thus resulting in the degrees of uency and fallout. Fluency records successful responses from the speaker's view-point, and is calculated as the complement of spurious and strategic inaccuracy:

Fluency = 1 0

P artial

+ 0 51 :

M istake

+ 0 51 :

Spurious

All utterances

Fallout records unfortunate responses from the partner's view-point (and is thus opposite to cooperation), and is calculated as follows:

Fallout =

4

M istake

+

F ailure

+

Spurious

All utterances

Conclusions

This paper discusses dierent goals for robust NLD systems, and advocates the view that robustness should be de ned as the communicative competence of NLD systems. Based on this de nition, conversationally adequate system responses can be characterised as cooperative, robust and coherent. The second half of the paper is devoted to the evaluation of NLD systems, especially their communicative competence. Problems related to the evaluation of conversationally adequate system responses are surveyed, and dierent evaluation measures discussed. Some such objective measures for quantitative evaluation of cooperativeness, robustness and coherence are also formulated, based on the notion of misunderstanding. Finally, I summarise the points that [10] make in regard to success and limitations of evaluation. On the positive side, the rapid growth of evaluation studies have created a number of performance evaluation conferences (MUCs, TRECs, MT Evaluation Workshops, Spoken Language Technology Workshops), which, besides spurring competition in building advanced systems, have also helped communication among researchers and the sharing of information to solve hard problems. Increased public visibility of the area, support for infrastructure, and rapid technical progress are also regarded as positive side-eects of such enterprises. However, evaluation also takes time and resources, and thus competes with other activities, especially with the development of new, innovative technologies. Focus on performance evaluation may also lead to risk-avoidance strategies, where getting a good score becomes more important than doing good research. As limitations of the evaluation methods, [10] notice that performance evaluation methods are application-speci c, and there exists no methodology for assessing portability of systems into new application domains. Little attention has also been paid to the ways the user interacts with a system, and to multilingual settings. These latter aspects directly relate to the topic of this paper: evaluation of NLD systems. It is the aim of this paper to start discussion on how to de ne and measure robustness, cooperativeness and coherence in NLD systems, and for this, tentative operationalisations of the concepts are given. 8

References [1] L. Ahrenberg, A. Jonsson, and A. Thuree. Customizing interaction for natural language interfaces. In K. Jokinen, editor, Pragmatics in Dialogue Management, pages 21{38. Proceedings of The XIVth Scandinavian Conference of Linguistics, University of Goteborg, Goteborg, 1994. Gothenburg Papers in Theoretical Linguistics 71. [2] J. Allwood. Linguistic Communication as Action and Cooperation. Department of Linguistics, University of Goteborg, 1976. Gothenburg Monographs in Linguistics 2. [3] A. Burton and A. P. Steward. Eects of linguistic sophistication on the usability of a natural language interface. Interacting with Computers, 5:1:31{59, 1993. [4] J. Carletta. Planning to fail, not failing to plan: Risk-taking and recovery in task-oriented dialogue. In Proceedings of COLING-92, pages 896{900. Nantes, 1992. [5] V. Cavalli, H. Dahlgren, C. E. Donzella, S. Fujol, and C. Godin. System evaluation criteria and test environment. Technical Report D2.2, PLUS deliverable, 1992. [6] D. Diaper. Identifying the knowledge requirements of an expert system's natural language processing interface. In People and Computers: Designing for Usability, pages 263{280. Cambridge University Press, 1986. [7] L. Dybkjaer, N. O. Bernsen, and H. Dybkjaer. Evaluation of spoken dialogue systems. In Dialogue Management in Natural Language Processing Systems. Proceedings of the 11th Twente Workshop on Language Technology, Twente, 1996. [8] J. R. Galliers and K. S. Jones. Evaluating natural language processing systems. SpringerVerlag, Berlin, 1996. [9] P. J. Hayes and D. R. Reddy. Steps toward graceful interaction in spoken and written man-machine communication. International Journal of Man-Machine Studies, 19:231{284, 1983. [10] L. Hirschman and H. S. Thompson. Overview of evaluation in speech and natural language processing. In R. A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue, editors, Survey of the State of the Art in Human Language Technology. 1996. Chapter 13.1. Also available at http://www.cse.ogi.edu/CSLU/HLTSurvey/. [11] E. H. Hovy. Generating Natural Language Under Pragmatic Constraints. Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1988. [12] K. Jokinen. Response Planning in Information-Seeking Dialogues. PhD thesis, University of Manchester Institute of Science and Technology, 1994. [13] M. D. Wallace and T. J. Anderson. Approaches to interface design. Interacting with Computers, 5:3:259{278, 1993. [14] J. Whiteside, S. Jones, P. S. Levy, and D. Wixon. User performance with command, menu and iconic interfaces. In L. Borman and W. Curtis, editors, Human Factors in Computer Systems II: Proceedings of the CHI'85 conference, San Francisco. North-Holland Publishing Company, Amsterdam, 1985.

9

Evaluating Robustness in Dialogue Systems3 1

Evaluating Robustness in Dialogue Systems3 1

Suggest Documents

Evaluating Robustness in Dialogue Systems3 1 ... - Semantic Scholar

Topic Information and Spoken Dialogue Systems3 - CiteSeerX

Topic Information and Spoken Dialogue Systems3 - Semantic Scholar

evaluating dialogue strategies in a spoken dialogue system for email

Evaluating Discourse Understanding in Spoken Dialogue ... - CiteSeerX

Empirically Evaluating an Adaptable Spoken Dialogue System

Evaluating Combinations of Dialogue Acts for Generation

Evaluating the Effectiveness of a Tutorial Dialogue

Evaluating Interactive Dialogue Systems - Association for ...

"Do That Again": Evaluating Spoken Dialogue

Robustness for Evaluating Rule's Generalization Capability in Data ...

Empirically Evaluating an Adaptable Spoken Dialogue System

"Do That Again": Evaluating Spoken Dialogue

Thucydides' Melian Dialogue 1 The Melian Dialogue

Metrics for Evaluating Dialogue Strategies in a ... - Semantic Scholar

Evaluating the Effectiveness of Tutorial Dialogue Instruction in an ...

The Robustness Issue 1 Introduction

1 Dialogue Design in HCI - CiteSeerX

Debating Multiculturalism 1 - Dialogue Society

Evaluating the Robustness of the DGT Approach for Smartphone ...

Evaluating the Robustness of Learning from Implicit Feedback

Evaluating Robustness Of Cloud-based Systems - Springer Link

Evaluating the robustness of repeated measures ... - staff.uni-mainz.de

Evaluating the Robustness of EmotiBlog for Sentiment Analysis and ...