Natural Language Engineering 1 (1): 1–16.
Printed in the United Kingdom
1
c 2000 Cambridge University Press
Modelling Usability Development Methods for Dialogue Systems Joris Hulstijn Department of Computer Science University of Twente e-mail:
[email protected]
(Received 7 March 2000 )
Abstract This paper gives a characterisation of dialogue properties which are argued to be related to the usability of dialogue systems: effectiveness, efficiency, transparency and a number of properties that have to do with coherence. Such properties can be used as guidelines in dialogue design, but also as aspects to test for in verification and validation.
1 Introduction Designing a usable dialogue system is considered an art. Current texts on dialogue system development are based on ‘best practice’ (Gibbon et al., 1998; Bernsen et al., 1998). There is no established theory of dialogue development that can be applied in the specification and evaluation of dialogue systems. On the one hand, dialogue systems are special. Natural language technology requires special resources, such as a lexicon, a grammar, an interaction model and an extensive domain dependent knowledge base. So general object-oriented design techniques like OMT or UML (Rumbaugh et al., 1991; Fowler and Scott, 1997) are not directly applicable. On the other hand, dialogue systems are just another example of complicated interactive software. Their success depends as much on general software quality factors like functionality, usability, reliability, performance, maintainability and portability, as on their linguistic coverage. Because of their special linguistic resources natural language interfaces are relatively complex, unstable and difficult to maintain or adapt (Androutsopoulos et al., 1995). Because of the interactive nature and acclaimed advantages of natural language interfaces as being natural and easy-to-use, usability is the most important quality factor for a dialogue system. Can the system actually be learnt and used in an effective, efficient and satisfactory way? It has proved difficult to assess the usability of a completed dialogue system, let alone to predict the usability of a dialogue design. It is unclear what aspects of a design influence the usability of the resulting system. In this paper we argue that the
2
Hulstijn
following dialogue properties affect usability: effectiveness, efficiency, transparency and properties related to coherence. These properties can be used in the verification of dialogue system designs. Throughout the paper we use the example of a theatre information and ticket reservation system (Van der Hoeven et al., 1995; Hulstijn, 2000). There is a keyboardbased version of the system, called schisma, as well as a spoken version adapted to movie theatres. Ticket reservation is interesting because it combines information exchange with a transaction of the user with the system. The paper is structured as follows. In section 2 we briefly review an iterative development process, which is recommended for complex interactive systems. In section 3 we discuss the different types of models that are needed in dialogue design. In section 4 we analyse the concept of usability, and reduce it to a number of verifiable dialogue properties. Finally, in section 5 we summarise the proposal and compare it to related research. 2 Iterative Development Dialogue systems are interactive; their performance crucially depends on user behaviour. Different and often opposing requirements have to be met. Novice and expert users, business or leisure users and users from different social and cultural backgrounds have potentially diverging expectations of a system. Ideally, a consistent mixture should form the basis for a system design. Therefore, a development method that is both iterative and user-centred is recommended. The idea is to involve target users in the whole development process, not just in the testing phase (Nielsen, 1993). Iterative development proceeds in cycles. In each cycle, the system is evaluated and improved. Early in development, mistakes are not yet very expensive. Simulations or early prototypes allow developers to experiment with radical new ideas. Feedback from user focus groups, surveys or user experiments, will detect bugs, design mistakes or conceptual errors. There are four types of activities that have to be carried out for each development cycle: analysis, design, implementation and evaluation, shown in figure 1. These activities are mutually dependent and can take place simultaneously. 2.1 Analysis Development is based on a detailed domain and task analysis. A task is essentially a pattern of actions that is common to a large number of social activities. For each activity type there are conventions that regulate the behaviour of participants in a particular role and determine the meaning of artifacts such as coins or tickets. Dialogue developers should be aware of such social and linguistic conventions. Often the development of a system does not start from scratch. There may be a system in place or there is some human service. In that case, experiences with the existing service must be taken into account. In the schisma case, we use the theatre schedule and price conventions of the local theatre. Analysis may begin with scenarios, simple descriptions of the projected behaviour
Modelling Usability ANALYSIS
DESIGN corpus
error analysis logfile analysis
scenarios
task analysis simulation WoZ user testing
3
re-design lexicon grammar objects actions database
simulation prototype environment tools
verification
EVALUATION
datastructures control
IMPLEMENTATION
Fig. 1. Iterative development cycle
of dialogue participants. Scenarios are based on observations of real life activities. Scenarios are applied in two ways: they serve as a basis for a more systematic domain and task analysis and they serve as a basis for the collection of a corpus. Corpus based analysis A corpus is a way of modelling the interactive behaviour of users on a particular task. If no appropriate corpus exists, a corpus can be collected in a simulation or ‘Wizard of Oz’ experiment (Fraser and Gilbert, 1991). A human experimenter simulates the behaviour of the system, using an environment with semi-automated responses. A group of representative subjects is asked to enact several scenarios. Scenario descriptions should not trigger a particular way of responding. To this end, often diagrams and pictures are used. The interaction is recorded and logged. The recorded data is edited and annotated with linguistic descriptions at different levels: phonetic features, syntactic features and cue words can be annotated, but also the apparent content and dialogue act or ‘move’ in the interaction. Different modules are tested and trained on annotated data. Resources like a grammar can thus be improved by machine learning techniques (Van Noord, 1997). Also descriptions of re-occurring interaction patterns can be automatically derived (Alexandersson, 1996). Use case analysis Task analysis can be documented in use cases (Jacobson et al., 1992). A use case is a specification document that generalises over a set of related scenarios. It defines the projected roles of user and system and their private and public goals, the setting, success and failure conditions, and the form, content and function of possible interaction sequences. In addition, use cases specify management information, such as version control, deadlines, and a list of open problems.
4
Hulstijn
There are use case templates that can be filled in as the analysis progresses (Cockburn, 1997). Figure 2 shows an example of a use case for ticket reservation. Because they can be intuitively understood, use cases facilitate communication between the system developer, the client and user focus groups. Use cases look much like scripts e.g. (Schank and Abelson, 1977). The recursive structure and the trigger, pre, success and failure conditions suggest an underlying plan-based theory of task analysis (see below). Each use case roughly corresponds to a sub-task or goal. Use cases may be nested or sequentially ordered. Use case templates allow variations and extensions to be defined as the development progresses. In this way, motivated design changes can be monitored in the documentation of the development process. Test case designs for validation tests – black box tests that focus on functionality – can be successfully derived from use cases. For each interaction sequence, each use case extension and each sub-variation, a test case must be designed. 2.2 Design and Implementation The domain and task analysis has to be converted into designs for data structures and control mechanisms. These can then be implemented using a programming language, or by combining existing modules. Not all of the task model has to be implemented! Domain and task models play a role in the background, to motivate choices and to ground specification, evaluation and testing effort. For example, a plan-based task analysis does not need to be implemented using a planning algorithm. Often the procedure-call mechanism of an imperative programming language suffices with the help of a stack. Dialogue systems can be implemented in many ways. Figure 3 depicts the proposed architecture of a spoken version of the schisma system (Hulstijn and Van Hessen, 1998). Each component captures a basic processing step: speech recognition preprocessing (name and date recognition), parsing, understanding and dialogue management, response generation and text-to-speech conversion. Similar components appear in almost any dialogue system, partly because the basic functions are universal. For human language production there are three levels: deciding what to say (planning), how to say it (generation) and actually saying it (production) (Levelt, 1989). Combining these with the corresponding modules for processing the meaning of what was said (understanding), how it was said (parsing) and what was actually said (recognition), we get the presented architecture, given that both the understanding and planning functions are carried out by the dialogue manager. The schema in figure 3 also represents a particular philosophy to the design of dialogue systems. On the left we find the actual system components. Specific resources, shown in the middle, specify the behaviour of components. For example, a grammar specifies the behaviour of a parser. On the right we see the information, task and interaction models that guide the design and can play a role in verification and validation. Also a corpus can be seen as a model of typical user behaviour. In this architecture, each component corresponds to a basic processing function. But components do not have to be scheduled one after the other. A black-board
Modelling Usability
Use Case C
5
Ticket Reservation
Actors
primary: user, system
secondary: seat reservation database
Goals
User wants to make a ticket reservation. System wants to sell tickets.
Context
User called the system. User may know of performances.
Scope
Primary task, transaction
Trigger
User indicated the wish to make a ticket reservation.
Precondition
User has preferences and a limited budget. Enough seats are available.
Success
User and system agreed on title, artist, date, time and price of performance. System knows user’s name and address and number and type of tickets. User and system agreed on total costs. The seats are marked as reserved for user. User is committed to pick up the tickets.
Failure
No seats marked as reserved for user. User made no commitments.
Description
1. System assists user to select performance within budget (use case B). 2. System informs user of title, artist, date, time and price. 3. System gets user name and address and also number and type (discount, rank) of tickets. 4. System informs user of total costs. 5. System gets explicit confirmation from user. 6. System marks seats as reserved. 7. System informs user of reservation and suggests commitment.
Next
D: Closure, B: Theatre Information or again C: Ticket Reservation.
Extension
xa. xb. 1c. 1d. 3c.
Sub-variation
2c. Information already discussed is not repeated. 3d. User satisfied with information only. (Use case D: Closure)
User misunderstands. System initiates help (use case E: Help). User ends this reservation attempt (use case D: Closure). Selection not available. System apologises and offers alternative (1). User changes mind. System follows (1). User corrects number or type of tickets or corrects name and address. System follows (3). 5c. User does not confirm. System tries again, else (1). 6c. Seats not available. System apologises and offers alternative. (3). 7c. User changes mind. System cancels reservation and informs user that no commitments have been made (1).
Fig. 2. Use case for ticket reservation. Template adapted from Cockburn(1997).
6
Hulstijn utterance speech models
SPEECH PROCESSING PRE PROCESSING
lexicon
PARSER
grammar
CORPUS
utterance repr. inference rules context plans DIALOGUE MANAGER response spec.
DB DBMS
objects actions dialogue rules
GENERATOR
response templates
TEXT-TOSPEECH
phonetic lexicon
response
DOMAIN MODEL information task interaction
Classes
Fig. 3. Dialogue system components, resources and models. An arrow depicts data flow, a solid line ‘specification’ and a dashed line indicates a ‘captures’ relation.
architecture uses a central data-structure, called the black-board, for each component to take its input from and to write its results onto. Such a configuration is especially useful when the behaviour of modules is mutually dependent, for instance when the speech recognition may benefit from pragmatic expectations of the dialogue manager. Apart from the decomposition into components, it is usually a good idea to cluster parts of resources that have to do with the same domain object. For instance, a date object needs specific lexical entries – “Wednesday”, grammar constructions – “a month from now”, inference rules – “if unknown, the date is today by default” and utterance templates “Which day would you like?”. Ideally, this clustering leads to a library of re-usable classes that specify objects in the domain. Obviously, not all aspects of a system belong to a re-usable class. Good candidates for domain objects that will generate a re-usable class can be found by studying scenarios for re-occurring elements.
Modelling Usability
7
2.3 Evaluation In iterative, user-centred development testing and evaluation is done in experiments similar to the ‘Wizard-of-Oz’ experiments used in analysis. Subjects are asked to perform assignments with the system. The assignments can be used to focus testing on precisely defined test objectives. The interaction is recorded, logged and annotated, producing a corpus for further analysis. To achieve improvement, generally a combination of objective and subjective evaluation should be used. Objective metrics, like dialogue length, are used to filter problematic from unproblematic dialogues. Log-files of the problematic dialogues can then be analysed in detail. Problems are classified according to the apparent source of the problem. Subjective measures are used to set objectives for re-design and to assess the relative importance of different problem types. Dybkjaer and others have developed a detailed theory of diagnostic evaluation for spoken dialogue systems (Dybkjaer et al., 1998; Bernsen et al., 1998). The theory consists of a list of neo-Gricean maxims: principles of cooperative dialogue (Grice, 1975). Some maxims were adapted specifically for the case of man-machine dialogue. Particular attention has to be paid to background information, misunderstanding and error repairs. Evaluation Example Here is an example of such an evaluation and re-design cycle (Bouwman and Hulstijn, 1998). The evaluation focussed on the trade-off between dialogue length and reliability in the design of a confirmation strategy for spoken dialogue systems. One can choose an explicit confirmation strategy, always asking the user directly if certain information was correctly understood. Such a strategy is reliable, but it tends to lengthen the dialogue. Instead, an implicit confirmation strategy can be used, formulating the next question in such a way that it conveys what information was understood. In case of a misunderstanding, the user is expected to protest. Implicit confirmation does speed up the dialogue, but it is less reliable (Danieli et al., 1997). An experiment was set up to evaluate the confirmation strategy of a particular spoken dialogue system called padis, developed at Philips (Kellner et al., 1996). The system is an automatic telephone directory that can retrieve telephone numbers, room numbers and e-mail addresses of employees. The system also allows ‘voice dialling’: establishing a direct connection to a requested employee. In a pre-experiment survey, users indicated that they thought the existing system was slow and irritating in some cases. The first evaluation experiment revealed that so called ‘dead-lock’ misunderstandings were the source of many complaints. Because of wrong assumptions built into the confirmation strategy, user corrections were not always processed by the system. After identification of the sources of errors, the strategy could be re-designed. Part of the re-design involved the speech recogniser’s reliability measure: based on some statistics over the recognition results, the system’s confidence in a reliable outcome can be predicted. The strategy was changed to take the cost of failure into account. Direct connections have a high cost of failure: you don’t want to disturb the wrong person. So for direct connections the implicit confirmation was changed
8
Hulstijn
to an explicit one. Retrieval tasks have a low cost of failure. In case of a recognition error resulting in retrieval of the wrong number, the user can always correct the system. So for utterances with a high reliability measure the confirmation could be removed altogether. In this way, the number of utterances in case of an error, still only equals the normal number of utterances under the original strategy. This speeded up the dialogues considerably. The second evaluation experiment, after the re-design, showed that general user satisfaction had increased. Another remarkable result was that despite the increase in satisfaction, the perceived speech recognition rate had gone down. So people thought the system did not recognise their utterances as well as it did before. However, judging from objective measures of system performance, the recognition rate had not decreased at all. One explanation is that the users’ mental model of the system’s capabilities had become more accurate. After the change in strategy, users may have realised that the system’s problems were caused by misrecognition, not by misunderstanding. 3 Models The dialogue is coherent sequence of utterances. Coherence concerns different aspects of utterances. An utterance is a unit of verbal interaction that can be interpreted as a dialogue act, consisting of a semantic content and a communicative function. The content of an utterance concerns the information that is conveyed, requested or proposed. The communicative function determines what is to be done with the content. One can distinguish assertives, that add factual information, interrogatives, that bring up issues for discussion, and directives, that establish commitment to future action (Hulstijn, 2000, ch4). Part of the communicative function is related to the task; part of it relates to the interaction with the surrounding dialogue acts. Therefore the three major aspects of a domain model are the information model, the task model, and the interaction model. Each of these structures constrain the dialogue structure in their own way. 3.1 Information Model The information model provides an ontology of the basic concepts and relations in the application domain. Each concept is characterised by a number of attributes. For example, the theatre performance concept is characterised by the attributes title, artist, date and genre. Concepts with their specific attributes can be implemented in object-oriented class hierarchies, using records and types in a conventional programming language, or using typed feature structures. The concept hierarchy may act as a blue-print for the database design. It indicates which attribute combinations can serve as a key to select a unique instance of an object. Default expectations, domain dependent inference rules and functional dependencies between attributes must also be specified in the information model. Each concept is a potential dialogue topic. The topic is the actual object about which information is provided or requested. A topic is associated with a ‘topic
Modelling Usability
9
space’ of attributes and related concepts that may act as sub-topics, thus forming a hierarchy. There is a dialogue representation structure that keeps track of the topic for each utterance. For information dialogues the current topic, or else attributes and objects related to the current topic, are likely to be the next topic (Rats, 1996). Such ‘topic chains’ indicate that the dialogue is coherent. Thus the actual topic representation is severely constrained by the conceptual hierarchy in the information model. Algorithms for resolution of anaphora and ellipsis make use of this representation. 3.2 Task Model The task model relates to the typical actions carried out by a user in a given application type. The task should be matched by the functionality of the system. A task can be expressed by a hierarchy of goals and sub-goals, or else by a plan, a combination of actions, to achieve those goals (Litman and Allen, 1987; Carberry, 1990). For dialogue systems, the basic actions from which complex plans can be constructed are either database operations or dialogue acts like requests or assertions. The task constrains the dialogue in two ways. First, the task determines what objects are relevant. The attributes needed by a sub-task become relevant when that sub-task is addressed. So a task generates a corresponding topic space. Given that the task is ticket reservation, an utterance like “3 for tonight, please” can be interpreted unambiguously as a request for three tickets for tonight’s performance. Second, the task may put constraints on the order of dialogue act exchanges that address sub-tasks. There are two basic types of constraints: precedence, and dominance (Grosz and Sidner, 1986). A task a satisfaction-precedes task b when a must be completed before b can start. A task a dominates b, when completion of a is part of completion of b. A third type of constraint, that two tasks be carried out simultaneously, becomes relevant for multi-modal systems where different output modalities like speech and gesture must be synchronised. The task structure of a ticket reservation can be based on the use case example in figure 2. A reservation task (use case C) dominates sub-tasks 1-7. Precedence constraints can be partly derived from the preconditions for successful completion. Sub-task 1, performance selection, is a precondition for steps 2, 4, 5, 6 and 7, and must therefore be completed before them. But sub-task 3 is independent of the performance selection and can thus occur before it, despite the preferred order of steps initially indicated in the use-case analysis. Sub-tasks may also be combined as in “3 for tonight, please”. 3.3 Interaction Model The interaction model relates to the linguistic realisation of a task. It describes the segmentation and annotation of dialogues into exchanges of related dialogue acts. For example, a question is an initiative. It must be followed by an appropriate response, preferably an answer or a counter question which is then followed by an answer, or by a rejection to answer. Either way, the hearer is obliged to respond.
10
Hulstijn
Similarly, a proposal must be followed by an acceptance, a counterproposal or a rejection. A greeting must be returned. Such typical interaction patterns can be described by dialogue games (Mann, 1988; Carletta et al., 1997). Interaction patterns for a particular task can be derived from an annotated corpus. How are these aspects of the model related? The answer is: by a dialogue representation structure. This is a hierarchical representation, much like a parse-tree. At the nodes of the tree we find a representation of the semantic content carried by specific utterances. On the branches of the tree we find coherence relations. A coherence relation indicates the function of an utterance with respect to the task and the surrounding utterances. As the dialogue progresses, a representation of each utterance is attached to the appropriate part of the tree by means of a plausible coherence relation. Typical coherence relations are elaboration, causation (because) or contrast (but). For dialogue, some coherence relations are derived from initiative-response pairs (Asher and Lascarides, 1998). Examples are direct or indirect question answering, acknowledging an assertion, accepting or rejecting a proposal or answer elaboration. In figure 3 we pictured the domain model on the right. Ideally, the general principles and observations of a model must be captured by the resources, which specify in turn the desired behaviour of a specific implementation. Also the corpus can be seen as a model in this sense. Given a model we can in principle verify whether the resources are internally coherent and whether they are ‘correct’ and ‘complete’ with respect to the application domain. For example, each linguistic expression that is well-formed according to the specified lexicon, grammar and dialogue rules, should be interpretable as a concept or action in the domain model. And each concept or action from the domain model must have a preferred way of expressing it as system behaviour. Such constraints would ensure among other things that all expressions used by the system are also understood when applied by the user. Built into toolkits for dialogue system development, such constraints allow users to learn a consistent and transparent mental model of the system’s behaviour. But when exactly is a design correct or complete relative to a model? What level of detail is required? We will try to find verifiable dialogue properties that are related to those properties of system design that affect the usability of a system. 4 Usability Properties In general, the assessment of software quality involves three levels of analysis (Pressman, 1997). The first level describes quality factors. For example, maintainability, how easy it is to keep the system adapted to a changing environment; portability, how easy it is to port the system to a new application; or reliability, how unlikely it is that the system breaks down. These factors describe general aspects of a system; usability is one of them. The second level describes properties, both for design and for evaluation. Properties are still abstract, but can be made concrete for different application types. For example, maintainability and portability are related to modularisation: to what extent a system is composed of independent components.
Modelling Usability
usability
11
effectiveness efficiency
understandability, learnability, operability satisfaction
Fig. 4. Combined usability properties of ISO 9126 and Nielsen (1993).
At the third level we find observable features or metrics to collect data about the design and evaluation properties. For example, modularisation is related to the number of cross-references between different components. If features or metrics are measured quantitatively they can be combined in a weighted sum: a summation of the measurements multiplied by their respective weights. Weights indicate the relative importance of a feature. For each application type specific metrics and weights will have to be decided on and tuned to fit the test objectives. If the metrics are concerned with qualitative concepts they can often be turned into numbers by indicating a value on a scale of preference. In isolation, it is difficult to interpret the meaning of a metric. Metrics are especially valuable when historical data exists for comparison. Metrics can indicate a trend or show improvement after re-design. Usability is one of the defining characteristics of software quality, along with such factors as functionality, maintainability, reliability, efficiency and portability. These characteristics have been named in different configurations by different authors. Nielsen defines usability by learnability, efficiency, memorability, error recovery and satisfaction (Nielsen, 1993). The ISO standard for software quality lists understandability, learnability, and operability as aspects of usability (ISO 9126, 1991). We propose to combine these theories (figure 4). Effectiveness and efficiency are listed as sub-aspects of usability, although one could argue that they are quality factors in their own right. The software quality factor of functionality which assesses the features and capabilities of a system, is obviously related to effectiveness. However, our choice means that all user related aspects are now listed under usability. For each usability property we describe how it can be influenced by design properties, such as transparency, and how it can be assessed by dialogue properties, such as coherence (figure 5). Dialogue properties are defined for completed dialogues, but often we also refer to the contribution of a single utterance. 4.1 Effectiveness Effectiveness concerns the accuracy and completeness with which users achieve their task. Obviously, the effectiveness of a design is influenced by the extent to which the system’s functionality matches the task. Effectiveness can be be assessed by measuring the success rate: the number of successfully completed dialogues of a set of dialogues. Effectiveness is also related to the relative coherence of individual utterances with respect to their context. Coherence is dealt with below. It is difficult but possible to measure task success quantitatively. The paradise evaluation framework uses confusion matrices that represent the deviations from
12
Hulstijn
usability property
design property
dialogue property
effectiveness
functionality
task success coherence
success see [1]
efficiency
functionality
complexity/effort
number of basic actions/ duration, number of utterances
coherence
[1] number of topic shifts, corrections, superfluous repetitions
greetings, thanks
surveys, actual usage
understandability learnability, transparency operability satisfaction
several aspects e.g. appearance
metrics rate,
Fig. 5. Usability, design and dialogue properties for task oriented dialogue systems.
some fixed attribute-value matrix with expected values (Walker et al., 1997). By calculating agreement over the confusion matrices for a large corpus of dialogues they derive a general measure of task success. In principle, this allows one to compare task success for different systems, or for different phases in the dialogue. Unfortunately, a task can not always be represented as a fixed set of attributes to be filled in. Why not? First, the number of attributes depends on the contents of the database. In the theatre schedule, on some days mentioning the date is enough to select a unique performance, but on other days also the artist or title is needed. The scheme could be adjusted to accommodate this flexibility, but this would require a way to normalise for the actual task complexity. Second, some attributes are mutually dependent. Attributes that determine a unique object, like title, artist, genre and date, are clustered into objects. The measure should account for such objects instead of single attributes. Third, a dialogue is dynamic. Not only the content of the dialogue, but also the user’s task or goal may change. Recognising what the user wants is a crucial part of a dialogue system’s function. Obviously, in a controlled evaluation experiment the task can be fixed temporarily to make quantitative assessment possible. 4.2 Efficiency Efficiency is the relative effectiveness of a system, in proportion to the effort to achieve it. Efficiency can be influenced in the design by carefully adapting the interaction strategy to the desired task. An efficiency measure can be computed by dividing a relative effectiveness measure by a measure of the effort. Effort can be measured by the number of turns, number of utterances or dialogue duration. Indicators of human processing effort can also be measured, such as the number of spoken disfluencies (Oviatt, 1995). Obviously, the effort needed to accomplish
Modelling Usability
13
a task depends on the complexity of the task. By task complexity we mean the number of basic actions – in this case dialogue acts and therefore utterances – needed to compete the task under ideal circumstances. Depending on the likelihood of a misunderstanding, which depends on the quality of the speech recognition, we can also predict task complexity for different degrees of misunderstanding. Suppose that on average a misrecognition takes an extra 2.6 utterances to recover from, and that the recognition rate is 84%. Then the estimated task complexity equals the ideal task complexity plus 0.16 ∗ 2.6 = 0.4 utterances. For this reason, some recent spoken dialogue systems dynamically adjust their strategy depending on a reliability measure. Misunderstanding and error have a great impact both on effectiveness and efficiency. Therefore the error rate and error recovery rate are good indicators of usability. A system response is classified as an error when it is incoherent with the task, the information in the database or with the interaction pattern, as can be derived from the previous utterances. For systems that are not task-oriented the aspects of efficiency and effort are less crucial. For leisure users for example, usability has to do with the entertainment value. Still, the number of misunderstandings and errors remains a good indicator of usability. Misunderstandings are irritating, take effort to correct and lead to behaviour that is difficult to understand, learn or avoid. 4.3 Understandability, Learnability and Operability Understandability, learnability and operability are listed as important aspects of usability (ISO 9126, 1991). They indicate how easy a user can understand, learn and handle to controls a system. For natural language these aspects should be easier: understanding and learning the commands of a natural language system is not a problem, provided the system’s resources match the application domain. The occurrence of out-of-domain errors shows that this is not always the case. Operability becomes a problem in case of misunderstandings and when the expectations of users turn out to be wrong. The design property related to understandability, learnability and operability is transparency; the corresponding dialogue property is coherence. Transparency For natural language interfaces the limits of the system’s capabilities, both with respect to linguistic understanding and functionality, are not immediately visible. The user has to rely on his or her own mental model of the system and of the task, which is largely based on previous experiences. The capabilities of a system must coincide with the expected mental model, or if that is not possible, the mental model should be influenced, to bring it in line with the system’s capabilities. Apart from marketing techniques, there is not much to go on. Users often do not remember the introductory advice about how to use a system. The only other way to influence the user’s mental model is by a careful design of system prompts: what to say, when and how. A system is called transparent to the extent that it does not allow the user’s mental model to deviate from the system’s capabilities. A system tends to be transparent when it always indicates what it is doing, and why.
14
Hulstijn
coherence
(
form: lexical choice, parallelism, intonation content: consistent, relevant, licensed function: task, interaction Fig. 6. Coherence aspects
Transparency is a kind of consistency in behaviour. Obviously, transparency can not be added to a system’s design. It must come from a systematic and principled way of designing system prompts. It is difficult to assess how transparent a system is. Only in case of comparative testing can one come to limited conclusions. For example, the dialogue redesign for padis described in section 2.3 probably made the system more transparent (Bouwman and Hulstijn, 1998). Coherence concerns the way utterances are combined into a well formed dialogue. An utterance is coherent with the context, to the extent that it ‘fits’ the dialogue representation structure at the relevant linguistic levels. Incoherence signals a lack of understanding. Because misunderstandings decrease usability, coherence is a good indicator of usability. Coherence involves aspects of form, content and function (figure 6). Form aspects are intonation, lexical choice and syntactic and semantic parallelism. For instance, when the user refers to the date as “Wednesday”, it is confusing if the system replies with “On August, 25, ...”. In general, adjacent utterances tend to be similar both in structure and wording. Objects that are given in the context, are referred to by a pronoun or even left out. Getting the intonation right is important too. New or contrastive information must be accented; agreed or old information must be deaccented. In particular information to be confirmed by the implicit confirmation strategy must be deaccented. With respect to content, information must be consistent and accurate (maxim of quality). Moreover, an utterance should be relevant and should be just informative enough for the current purposes (maxims of relevance and quantity). An utterance is not licensed when it is overinformative. These aspects can be given a formal characterisation (Groenendijk, 1999; Hulstijn, 2000). The function of an utterance must also fit the context. This can be assessed relative to the task model and the interaction patterns. An utterance is incoherent when it cannot be attached to some part of the dialogue representation with a coherence relation (Asher and Lascarides, 1998). How can we measure coherence effectively? In general, anaphoric references, semantic and syntactic parallelism between adjacent utterances and verb-phrase ellipsis increase the perceived coherence. On the other hand, unnecessary repetitions, interruptions and corrections decrease coherence. Also a dialogue with many topic shifts is perceived as incoherent. These aspects must be annotated by hand. Repetitions and corrections can be automatically calculated on the basis of the system’s decisions.
Modelling Usability
15
4.4 Satisfaction User satisfaction is the usability factor that is most difficult to influence by a design. Yet in a way, it is the most important feature. Ultimately it determines if users are going to use the system. Satisfaction combines most aspects of a system. In that sense, user satisfaction corresponds to perceived usability: it is the observable correlate of the quality factor usability. User satisfaction is measured with questionnaires, think-aloud protocols, counting expressions of greetings and thanks and counting the actual usage of the system. On the basis of such measures the relative weights of usability measures are calibrated (Walker et al., 1997). User satisfaction is affected by all aspects of a system, also apparently non-functional ones. For example, the satisfaction of spoken dialogue systems is sometimes related to whether users prefer a male or a female voice. What is decisive is the added value: how much better is this system compared to an existing system for the same task? In the example of padis users often preferred the paper list for retrieving telephone numbers. Only when users were not at the office, they appreciated the possibility of calling the system. 5 Summary and Related Research This paper addresses the software engineering aspects of developing a dialogue system. It was argued that an iterative development method, in which a prototype or system is constantly evaluated and improved, does not preclude formal specification and verification. Relative to a sufficiently formal model of the domain the correctness and completeness of resources can be checked. Because it is not initially clear when a dialogue design is correct, we defined properties that are related to usability, which is arguably the most important quality factor for dialogue systems. The properties are effectiveness, efficiency, understandability, learnability and operability, and satisfaction. We indicated that these can be affected by the functionality of the system and by the transparency of the system prompts. They can be assessed by task success rate, a measure of the relative effort such as duration, and by coherence measures such as the number of anaphoric references, the number of superfluous repetitions and the number of topic shifts. Our proposal is similar to the diagnostic evaluation method on the basis of neoGricean maxims (Dybkjaer et al., 1998; Bernsen et al., 1998). In this method, a selection of problematic dialogues are annotated with the maxims that caused an error or misunderstanding. By tracing back errors to design choices, a design can be analysed and improved. A disadvantage is that this a laborious method. In principle the maxims can also be used as guidelines for design, but maxims may well be conflicting in some cases. The theory does not give a method for treating such trade-offs. The theory developed here does not only address the content and appearance related aspects, but also the task and interaction model. In particular our account of coherence in terms of form, content and the task and interaction related communicative function of an utterance, may act as a basis for more concrete qualitative
16
Hulstijn
and quantitative evaluation schemes. Formal aspects are worked out in detail by different authors. In particular, the content-related aspects of coherence, namely consistence, relevance, informativeness and licensing can be given a formal semantics in terms of a set of possibilities that are structured by issues or contextual questions (Groenendijk, 1999; Hulstijn, 2000). In a plan-recognition framework such as (Litman and Allen, 1987; Carberry, 1990), the task related aspects can be characterised in terms of plans and goals. The interaction related aspects are dealt with either in terms of dialogue games, or in terms of coherence relations and default principles for coherent attachment (Asher and Lascarides, 1998). There are few theorists who have a formal verification system that combines these aspects. This is the major topic for further research: apply these verification methods in toolkits for dialogue system development. An exception is the artimis system, which applies a logical theory of dialogue acts in terms of beliefs and goals of participants to specify the behaviour of the dialogue manager (Bretier and Sadek, 1997).
References Alexandersson, J. (1996). Some ideas for the automatic acquisition of dialogue structure. In Dialogue Management in Natural Language Dialoguye Systems, TWLT11, pages 149–158. University of Twente, Enschede. Androutsopoulos, I., Ritchie, G. D., and Thanish, P. (1995). Natural language interfaces to databases: an introduction. Natural Language Engineering, 1(1):29–81. Asher, N. and Lascarides, A. (1998). Questions in dialogue. Linguistics and Philosophy, 21:237–309. Bernsen, N. O., Dybkjaer, H., and Dybkjaer, L. (1998). Designing Interactive Speech Systems. From First Ideas to User Testing. Springer-Verlag, Berlin. Bouwman, G. and Hulstijn, J. (1998). Dialogue (re-)design with reliability measures. In Proceedings of the 1st International Conference on Language Resources and Evaluation (LREC), pages 191–198, Granada, Spain. Bretier, P. and Sadek, D. (1997). A rational agent as the kernel of a cooperative spoken dialogue system: Implementing a logical theory of interaction. In Intelligent Agents III: ECAI’96 Workshop on Agent Theories, Architectures and Languages (ATAL), LNCS 1193 , pages 189–204. Springer-Verlag, Berlin. Carberry, S. (1990). Plan recognition in natural language dialogue. MIT Press, Cambridge, Mass. Carletta, J., Isard, A., Isard, S., Kowtko, J. C., Doherty-Sneddon, G., and Anderson, A. H. (1997). The reliability of a dialogue structure coding scheme. Computational linguistics, 23(1):13–32. Cockburn, A. (1997). Using goal-based use cases: transitioning from theory to practice. Journal of object-oriented programming, 10(7):56–62. Danieli, M., Gerbino, E., and Moisa, L. (1997). Dialogue strategies for improving the naturalness of telephone human-machine communication. In Proceedings of the ACL Workshop on Interactive Spoken Dialog Systems, pages 114–120, Madrid.
Modelling Usability
17
Dybkjaer, L., Bernsen, N., and Dybkjaer, H. (1998). A methodology for diagnostic evaluation of spoken human machine dialogue. International journal of humancomputer studies, 48(5):605–626. Fowler, M. and Scott, K. (1997). UML Distilled: Applying the standard object modeling language. Addison-Wesley, Reading, Mass. Fraser, N. and Gilbert, G. (1991). Simulating speech systems. Computer Speech and Language, 5:81–99. Gibbon, D., Moore, R., and Winski, R., editors (1998). Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, Berlin. Grice, H. P. (1975). Logic and conversation. In Syntax and Semantics 3. Academic Press, New York. Groenendijk, J. (1999). The logic of interrogation: classical version. In Proceedings of the Ninth Conference on Semantics and Linguistics Theory (SALT-9), Santa Cruz. Grosz, B. J. and Sidner, C. L. (1986). Attentions, intentions and the structure of discourse. Computational Linguistics, 12:175–204. van der Hoeven, G. F., Andernach, J., van de Burgt, S. P., Kruijff, G.-J. M., Nijholt, A., Schaake, J., and de Jong, F. (1995). SCHISMA: A natural language accessible theatre information and booking system. In First International Workshop on Applications of Natural Language to Data Bases (NLDB’95), pages 271– 285, Versailles, France. Hulstijn, J. (2000). Dialogue Models for Inquiry and Transaction. PhD thesis, University of Twente, Enschede. Hulstijn, J. and Van Hessen, A. (1998). Utterance generation for transaction dialogues. In editors, Proceedings of the Fifth International Conference on Spoken Language Processing (ICSLP’98), paper number 776. ISO 9126 (1991). Information technology - software product evaluation - quality characteristics and guidelines for their use. Technical report 9126, International Organization for Standardization. Jacobson, L., Christerson, M., Jonsson, P., and Vergaard, G. (1992). Object-oriented Software Engineering, a Use Case Driven Approach. Addison Wesley, Wokingham. Kellner, A., R¨ uber, B., and Seide, F. (1996). A voice-controlled automatic switchboard and directory information system. In Proceedings of the IEEE Third Workshop on Interactive Voice Technology for Telecommunications Applications, pages 117–120. Basking Ridge (NJ). Levelt, W. J. (1989). Speaking: from Intention to Articulation. MIT Press, Cambridge, Mass. Litman, D. L. and Allen, J. F. (1987). A plan recognition model for subdialogues in conversation. Cognitive Science, 11:163–200. Mann, W. C. (1988). Dialogue games: Conventions of human interaction. Argumentation, 2:511–532. Nielsen, J. (1993). Usability engineering. Academic Press, Boston. van Noord, G. (1997). An efficient implementation of the head-corner parser. Computational Linguistics, 23(3):425–456.
18
Hulstijn
Oviatt, S. (1995). Predicting spoken disfluencies during human-computer interaction. Computer Speech and Language, 9(1):19–36. Pressman, R. S. (1997). Software Engineering, a practitioner’s approach. McGrawHill, european adaptation. Rats, M. (1996). Topic Management in Information Dialogues. PhD thesis, Katholieke Universiteit Brabant, Tilburg. Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F., and Lorensen, W. (1991). Object-oriented modeling and design. Prentice Hall, New York. Schank, R. C. and Abelson, R. P. (1977). Scripts, Plans, Goals and Understanding: an Inquiry into Human Knowledge Structures. Erlbaum, New York. Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. (1997). PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual meeting of the ACL/EACL, pages 271–280, Madrid.