The speech equivalent to a paper prototype is a Wizard of Oz (WOZ) study .... One method to support the simulation of recognition errors is the use of filters, e.g. ...
Simulating recognition errors in speech user interface prototyping Matthias Peissner, Frank Heidmann, Jürgen Ziegler Fraunhofer-Institute for Industrial Engineering (IAO), Nobelstr. 12, D-70569 Stuttgart, Germany
ABSTRACT We have developed a Wizard of Oz simulation tool which allows scenario-based simulation of speech systems for the conduction of empirical studies with future users. This paper focuses on the adequate integration of recognition errors as they are an important feature of speech-based applications. The presented solution considers the aspects of reliability and validity. Both are necessary preconditions for the immediate transferability of simulation results to the real system.
1. SPEECH USER INTERFACE PROTOTYPING In the field of GUI design it has become common practice to test usability in early development stages. By using paper prototypes important design decisions can be met on the empirical basis of tests with future users. In comparison to the vast amount of empirical studies and guidelines concerning the usability of GUIs, we know very little about how to design effective speech user interfaces (SUI). Moreover SUI designers face the essential difficulty of getting a sound feeling for the dialogue flow by merely inspecting a written dialogue specification. For these reasons it is even more important to include prototyping and usability testing early in the design process of user-friendly interactive voice response systems (IVR systems). The speech equivalent to a paper prototype is a Wizard of Oz (WOZ) study (Weinschenk & Barker, 2000), where a human (the wizard) simulates the role of the computer during testing and starts different recorded system prompts dependent on what the user said. Usability testing with the WOZ technique can lead to valuable results regarding the following topics: Å Designing a user-oriented grammar: In very early development stages WOZ studies can pinpoint the utterances which are typically used in order to control the available functions. Given a sufficient number of subjects the transcriptions of the test sessions can give a representative image of how users would expect the system to understand. The most frequently recorded utterances can serve as a valid basis for a user-centred grammar. This way, the timeconsuming procedure of pilot testing including iterative grammar modifications and recognition tuning can be shortened or even partially avoided (Pearl, 2000). Å Comparison of different systems / system versions: Alternative design decisions can quickly be acted out and tested with future users. Especially the different effects of alternative prompt versions on the users’ performance and attitudes towards the system can be evaluated. Å Overall ergonomic evaluation: WOZ experiments can take the traditional role of usability tests in evaluation and troubleshooting. The detection of major problems of use in an early development stage enables iterative redesign and reconception without the otherwise necessary phases of implementation. Necessary precondition for the validity of a WOZ study is that the interaction between user and “machine” (here the wizard) has to be as realistic as possible. Otherwise, the gained results cannot be transferred immediately to the real situation of system use. This means, that on one hand, the subject in a WOZ study must actually belief that she interacts with a real system, which is a matter of adequate instruction. On the other hand, the simulation must not differ from the specified system behaviour in essential aspects. Among others, this refers to the reliability of speech recognition which is treated in detail in the following section, and to the available complexity of the dialogue. With high complexity applications it is necessary to do scenario-based testing in order to reduce the amount of probable user utterances. This supports the wizard’s decision by giving a situation specific pre-selection of probable options for “system” reactions. 2. PROBLEM Speech technology is probabilistic in nature and therefore recognition errors are inherent in any speech-based application. Furthermore, situations of recognition errors are especially crucial to usability variables such as effectiveness and efficiency in task solving and user acceptance (Yankelovich, Levow & Marx, 1995).
Therefore, it will be indispensable in most cases to carefully simulate error situations in WOZ studies in order to achieve data about questions like: How frustrating do users experience recognition errors in the application in question? Do the mechanisms of error management actually assist the users in correction? Do the users recognise the occurrence of an error at all? How should recognition errors be included into the simulation design? Even if you had in mind the whole grammar of the recognition system you would never be able to anticipate the system’s behaviour. This unpredictability of recognition errors is still increased if the IVR system is used from a cellular phone. Obviously, a simple rule-based model for simulation of recognition errors is not applicable. On the other hand, bare arbitrariness or intuition as basis for the wizard’s decisions will bias the test results. In order to ensure reliability and validity of a WOZ study the following aspects have to be taken into account: Å Realistic probabilities for recognition errors: A predefined and realistic probability for correct understandings, substitution errors and rejection errors is a precondition for a sound evaluation of the relevant usability criteria. And it allows controlled examination of the consequences of various confidence thresholds. The confidence threshold defines the minimal probability of correct classification needed to execute an action. Probabilities below the threshold lead to rejection usually accompanied by a prompt like ‘Sorry, I could not understand you. Please repeat.’ Necessary data stem from knowledge of the used recogniser and of relevant parameters of the used classification scheme. Å Standardised simulation: Without using automatic speech recognition it will never be possible to completely eliminate influences on simulation performance that arise from the wizard’s decisions. These influences cannot be held constant over time, different persons and situations. It is an important goal to achieve a maximum level of objectivity by reducing the possible options, the consequences and the need of human decisions to a manageable minimum. Only under comparable test conditions different systems or system versions and the performance of different user groups can be compared adequately. For the comparison of different prototype versions the simulated recognition performance should be balanced in order to avoid undesired side effects. Å Interactivity: Although standardisation is an important feature, especially in within-subjects designs of system comparison, interactivity is essential for the validity of the results. That means that, despite standardisation of the simulation system, responses must depend on what the user says. Strict balancing (i.e. constant predefined sequences of correct recognition, substitution error and rejection in both conditions) and randomising (i.e. constant predefined frequencies of correct recognition, substitution error and rejection in both conditions) do not consider occurring training effects in the users’ speech performance which are likely to support higher recognition rates in the version presented in the second position. One method to support the simulation of recognition errors is the use of filters, e.g. vocoders which distort the spoken input, in order to help the wizard perform to the system’s expected level (Bernsen, Dybkaer, and Dybkaer, 1998). Filters suffice the requirements of interactivity and standardisation. But it is questionable if they can support a realistic simulation of errors. Firstly, the relationship between the probability of recognition errors and the physical intensity of the filter is not straightforward and has to be investigated empirically before. Secondly, a deterministic filter that constantly distorts the input signal might be no appropriate model for a highly probabilistic process. Human speech performance is probabilistic. Even if two utterances sound completely identically for another person the acoustic signals will never be totally the same. Environmental noise, recording and transmission are also probabilistic factors that make it impossible to anticipate the acoustic quality of the system input signal. Finally, the procedure of recognition itself is probabilistic in nature as it follows a statistical classification scheme. 3. OUR APPROACH We have developed a software tool that supports WOZ simulations of IVR systems (see figure 1 for the GUI). 3.1 The WOZ-GUI Each button (except those for the scenario selection) on the simulation GUI stands for a set of user utterances, a specific subset of the grammar. For ease of use each button is labelled with the corresponding grammar or at least a part of it. The scenario-based approach makes it possible to simulate even highly complex applications. Any scenario consists of one or more pairs of user utterance and system prompt. When a scenario is started the main frame displays a matrix of buttons each representing an expected user utterance. For illustration, let us take a scenario which includes to call John Smith and after that to change his number entry in the telephone book. The first target user utterance is something like “I’d like to call John Smith” which is represented by the first button in the
first column. The other buttons in the first line represent expected variations form the target utterance in this first sub-task, e.g. “I want to place a call” or “Go to telephone book” or “John Smith”. These utterances start other actions, e.g. feed-forward prompts that shall obtain the missing data in order to accomplish a transaction (e.g. “Whom would you like to call?”). When feed-forward prompts are played which are not part of the target path (the first column of the main frame) a child window is popped up displaying buttons representing possible user utterances in the actual sub-dialogue. When all data is captured which is needed to proceed to the next sub-task of the scenario the window is shut again. The following lines are built up the same way: the first button represents the next target user utterance, the other buttons of the same line stand for other expected utterances. Simultaneously displaying all stages of a scenario instead of only the actual one allows to appropriately react to users who choose another than the expected order of actions.
Main - Content
Main - Info
Main - Help
Scenario - Content 1
Scenario - Info 1
Scenario - Info 2
Scenario 1
Help
Scenario 2
Cancel
Scenario 3
NU Stop
Target 1
Deviation 1.1
Target 2
Deviation 2.1
Scenario - Help 1
Scenario - Help 2
permanently available prompts Deviation 2.2
Main frame Target 3
Logfile-creator Scenario selection permanently available functions (context sensitive)
Target 4
Deviation 4.1
Target 5
Deviation 5.1
Target 6
Deviation 6.1
Target 7
Deviation 7.1
Deviation 7.2
Figure 1: Graphical User Interface simulation tool The frame on the top of the screen displays specific help prompts and other prompts which are permanently available independent of the actual scenario and context of use. Examples for corresponding user utterances are “Content” and “Help on ”. On the left of the main frame the control frame displays buttons for functions which are context sensitive. The user utterances “help” and “cancel” or an out-of-grammar utterance (OOG) start different prompts dependent on the actual context of use. The same with barge-in which may be switched off in certain contexts. The DOS-box in the right lower corner displays the text which is written to the log-file during interaction. The log-file includes information about what button was pressed at what time and what prompt was played. Log-file analysis can discover detours and wanderings and identify especially problematic areas in the dialogue flow. More over they contain information about the actual rate of correct understandings which is important for further evaluations. 3.2 Simulating speech recognition errors The occurrence of recognition errors is integrated on the basis of the above mentioned considerations concerning realistic error probabilities, standardisation und interactivity. The imperfect performance of speech recognition is modelled by human speech understanding which is restricted to a predefined set of utterances (the grammar) and added a probabilistic element of uncertainty. The wizard only has to decide whether the user utterance can be understood given the grammar restrictions and then to press a corresponding button. She has not to assess if the real system would correctly understand. Each button – except those for scenario selection - starts a random function which triggers one of a certain set of possible actions. For example, pressing the target button is followed by the target action (target prompt), or another not desired action (simulating a substitution error), or rejection (‘Sorry, I could not understand you. Please repeat.’). The probabilities for correct recognition and the
errors of substitution and rejection are customised according to experiences or expectations regarding the real system (see figure 2).
?
User utterance
Random funktion
System response
TargetGrammar
0.8 0.07 0.13
TargetPrompt
0.4 0.6
Substitution error
yes
no etc. ambiguous
Figure 2: Probabilities for correct recognition and errors of substitution and rejection 3.3 Required definitions and wizard’s skills The following parameters are necessary input variables for running a WOZ simulation with our software. Any testing scenario must be decomposed into all necessary user utterances (the target utterances) and the corresponding target prompts. Expected variations should be identified for any target user utterance. For any target prompt a set of suitable prompts for simulating substitution errors must be given. And finally, any available user utterance (displayed button on the GUI) must be assigned the probabilities for correct recognition, substitution error and rejection error. These probabilities may vary between different classes of utterances, e.g. digits, utterances containing more than one token, etc. The wizard holds a key role for the validity of the study. She must know exactly the dialogue flow, especially the permanent available functions and prompts, in order to be able to manage the complex GUI. Furthermore, deep knowledge of the grammar including permissible variations is needed. Extensive training will be indispensable in order to gain the needed speed and competence in the decision process. 3.4 First testing experiences We conducted a study in order to compare two different prompt versions. We used exactly the same dialogue model and simulation parameters (grammar, error rates) with two different prompt versions. The questions were: Do users notice a difference? Do they establish different mental models, including hypothesis concerning the reliability of recognition? Do user utterances differ between both prompt versions? How do different prompt versions affect the acceptance and likeability of an IVR system? We used a within-subjects design with 20 subjects. Any subject worked on 10 scenarios, 5 scenarios with each version. Presentation order was balanced. Subjects were instructed that they were to test two different prototypes and to judge which version was more comfortable to use. It showed that the actual performance of the “speech recognition” which was a matter of random was a dominant aspect for the preference of one or the other version. Having only a small number of user utterances for each version and assuming a rather low probability of 80 % for correct recognition it was not rare that the actual performance differed noticeable between both versions. Even very small differences were discovered (e.g. “one misrecognition at the first two at the second version”). This effect would also have appeared if a real IVR system was used for testing with different prompt styles. Strict balancing is surely not the right way to avoid this effect. But it should be reduced by increasing the number of necessary user utterances. Taken this into account the conducted study provided valuable information, e.g. the transcripts of the interactions which indicate that one prompt version evokes user utterances which can be assumed to increase the performance of speech recognition compared to the other version. We will not discuss further results here in detail. Boyce (Boyce, 2000) reports a comparable study and results. Another WOZ study with 10 subjects was conducted in order to check overall usability of a speech portal before the real system was completely implemented and tuned. Before our tests the specified grammar was rather sparse and so one valuable by-product was a collection of utterances used in order to accomplish the given tasks. The
results indicated, among other things, that the provided grammar urgently needed to be extended in order to allow for efficient use of the IVR system. First clues for these extensions could be given but needed to be validated and completed by a more extensive study. Important recommendations on the basis of the results concerned modifications in prompting especially in cases of rejection as a consequence of an OOG utterance or as a consequence of recognition failure. As user utterances are strongly dependent on system prompting it is important that these changes are done before extensive pilot testing which aims at optimising speech recognition regarding recorded user utterances. In both studies our wizard’s decisions were not free of human error as she had to meet complex decisions in a very short time. Erroneous decisions can never be totally prevented but they are no problem as long as they do not arise systematically. However, it seems to be very difficult for a human “speech recogniser” to repeatedly reject a OOG utterance when the subject is trying so hard. Cases of transaction success as a consequence of a too tolerant wizard decision should be marked in the transcripts in order to be taken in to account in further evaluations. 4. CONCLUSIONS Designing and developing an IVR system is a complex task due to close interdependencies between the essential aspects of dialogue architecture, prompting, grammar and reliability of speech recognition. In order to find application specific solutions for user-friendly SUIs, prototyping and WOZ studies will be indispensable as they help to save time and costs in an iterative development cycle. On the other hand the conduction of WOZ studies is a favourable approach for scientific investigations in order to collect empirical data as a sound basis for guiding principles of ergonomic SUI design. Our approach to simulating recognition errors in WOZ studies allows for realistic simulation conditions. It allows to see how users cope with recognition errors. And it enables to evaluate the effectiveness of the mechanisms the system provides in order to support users in discovering and coping with recognition errors. REFERENCES Bernsen, N.O., Dybkaer, H., and Dybkaer, L., "Designing Interactive Speech Systems: From First Ideas to User Testing", London: Springer, 1998. c Boyce, J. S., "Natural spoken dialogue systems for telephony applications", Communications of the ACM, 2000, Vol.43, No. 9, pp. 29-34. Pearl, C., "A Prototyping Tool for Telephone Speech Applications." CHI 2000 Workshop on Natural Language Interfaces, The Hague, Netherlands, April 3, 2000, (http://www.cs.utep.edu/novick/nlchi/papers/Pearl.htm). Weinschenk, S. and Barber, D., "Designing Effective Speech Interfaces". New York: John Wiley & Sons, 2000. Yankelovich, N.; Levow, G.A. and Marx, M. "Designing SpeechActs: Issues in Speech User Interfaces." CHI ’95 Proceedings. ACM Conference on Human Factors in Computing Systems. Denver, CO, May 7-11, 1995.