Modelling the effects of constraint upon speech ... - Semantic Scholar

9 downloads 974 Views 229KB Size Report
Feb 4, 1998 - ICL Institute of Information Technology School of Computer Science and ... modelling studies are presented in which dialogue constraint levels for a home-banking ...... Modelling error recovery and repair in automatic speech.
Int. J. Human—Computer Studies (1999) 50, 85—107 Article No. ijhc.1998.0235 Available online at http://www.idealibrary.com on

Modelling the effects of constraint upon speech-based human–computer interaction KATE S. HONE ICL Institute of Information Technology School of Computer Science and Information Technology, University of Nottingham, University Park, Nottingham NG7 2RD, UK. email: [email protected] CHRIS BABER Industrial Ergonomics Group, School of Manufacturing and Mechanical Engineering, The University of Birmingham, Birmingham B15 2TT, UK. email: [email protected] (Received 4 February 1998 and accepted in revised form 18 August 1998) Commercial speech systems, for use by the public, rely heavily on prompts which aim to constrain user input to a highly limited vocabulary set. Constraints help to increase the recognition accuracy of the automatic speech recognition device and thus improve dialogue efficiency. However, this strategy can also lengthen interactions because longer prompts are needed to effectively constrain user utterances and more steps are usually needed to complete a task. The current paper argues that to achieve optimal dialogue design solutions it is necessary to balance these conflicting effects of constraint. Two modelling studies are presented in which dialogue constraint levels for a home-banking application were systematically manipulated in order to investigate the effects on overall transaction time. The results indicate that, even with the assumption that high constraint leads to high recognition accuracy, it is difficult for highly constrained dialogues which entail the need for extra dialogue steps (e.g. those using menus) to compete with less-constrained dialogues which do not (e.g. those using queries). The implications of these findings to system design are discussed and it is suggested that the modelling method presented here can provide a useful tool early in the design process.  1999 Academic Press

1. Introduction In recent years, there has been an increase in the number of public speech-based applications, particularly for use over the telephone. Such systems use speech input and output technology and are intended to offer a means of extending public access to computer systems and services. Already there are working speech systems providing travel information, home banking and voice messaging, and there is considerable scope for other services to be automated in the same way. However, while such systems can offer advantages to the consumer (e.g. 24 h access to services) and to the service provider (reduced cost compared to human operator), they also raise a number of problems. Baber, Stanton and Johnson (1998) propose that developers of public technology face many human factor difficulties. For instance, systems will be used by a population with little or no training, and with limited previous experience. In addition, the technology 1071-5819/99/010085#23 $30.00/0

 1999 Academic Press

86

K. S. HONE AND C. BABER

will face situations in which, if it does not reach performance criteria immediately, it could be rejected. For speech systems, these factors can be compounded by the performance of the technology, and by unrealistic expectations among users. To illustrate the latter point, users might expect to be able to speak to a computer in a similar manner to their normal speech over the telephone. However, ‘‘the possibility of carrying on an unconstrained conversation with a computer in which the computer understands in a completely human-like way is not likely to be realised in the foreseeable future.’’ (Wolf, Kassler, Zadrozny & Opyrchal, 1997, p. 461). This means that the majority of applications depend on the imposition of constraint on what can be said at each point in the interaction. At first glance, the use of constraint can offer a means of handling many of the problems concerning public technology: by limiting the user to specific words, it will be possible to both minimize variability in the vocabulary to be recognized and provide guidance and structure to new users. Constraint can function on several levels and will have varying influence on user performance, e.g. constraint on the available vocabulary at a specific juncture in the interaction will have an impact on which word can be spoken, while constraint on the type of speech used (isolated word or continuous speech) will have an impact on how a word is spoken. Even systems which can accept natural language input still require carefully designed prompts to ensure that users constrain their speech within acceptable limits (Mane, Boyce, Karis & Yankelovich, 1996). In this paper, ‘‘constraint’’ will be used to refer to the choice of individual words and their combination. The issue of constraint influencing the manner in which a person speaks is rapidly becoming a less important human factors issues, in that the capabilities of contemporary speech recognition technology allows connected speech at a reasonable speaking rate (in contrast with the old isolated word technology which required the speaker to pause after each word). This paper will begin with a review of literature concerning the effect of constraint in speech dialogues. Throughout the paper, a key issue will be the impact of constraint on transaction time (that is how quickly a user can interact with the system in order to achieve their task goals). Reducing transaction time can be crucial to system developers, for example in telephone-based systems even relatively small reductions in transaction time can mean huge savings in the cost of service provision. Users are also likely to be more satisfied with shorter interactions. The current paper then presents an approach to investigating the effects of constraints in speech dialogues using task network modelling. Studies such as these provide data which will help designers to make informed decisions about the optimal level of constraint to include in speech interactive dialogues.

2. Background The following literature review will begin by considering the benefits of using high levels of constraint. It will then go on to discuss the costs of constraint and consider how a balance can be struck between these two conflicting effects. 2.1. BENEFITS OF CONSTRAINT

The principal reason that speech recognition requires the use of constraint is to limit the opportunity for recognition errors by limiting the active vocabulary set, i.e. the number

SPEECH-BASED HUMAN-COMPUTER INTERACTION

87

of words from which the speech recognizer is able to match the incoming word. This is because the recognition accuracy of a speech recognizer is partly dependent upon the ‘‘perplexity’’ of the vocabulary. Perplexity is a measure of the branching factor in a syntax, i.e. the number of words which can be chosen at a given juncture in the vocabulary (Schmandt, 1994). The lower the perplexity, the more effectively lexical constraint can improve recognition; ‘‘Other things being equal, it is desirable to construct languages which have low perplexity, since this ensures higher recognition accuracy.’’ (Rudnicky, 1995, p. 414). Constraint, therefore, is assumed to offer a means of enhancing the performance of the recognizer by reducing the search space, which will minimize the possibility of substitution errors and reduce processing time. In applications which are accessible to the general public, high levels of constraint are often imposed on user utterances to ensure that the interactions are successful. For example, British Telecom’s CallMinder, an answering-machine service using voice input and output, has a total vocabulary of 20 words and the majority of the prompts are designed to restrict answers to ‘‘yes’’ or ‘‘no’’ (Beacham & Barrington, 1996). The intention of this constraint is to minimize the chance of substitution errors by restricting active vocabulary to only two items in most cases.

2.2. COSTS OF CONSTRAINT

Using a restricted vocabulary with unambiguous equations does not necessarily eliminate ‘‘out-task’’ vocabulary, i.e. even with questions designed to elicit ‘‘yes’’ or ‘‘no’’, users might reply with additional words (Baber, Stammers & Usher, 1990). Furthermore, transaction time is often influenced by the length of prompt required to ensure that users’ responses are sufficiently constrained. For example, Kamm (1994) noted that given the prompt ‘‘Will you accept the charges?’’ only 51% of users responded with the desired word (‘‘yes’’), but when the prompt was rephrased as ‘‘Say Yes if you will accept the call; otherwise say No’’, 81% of users said the desired word. While correct response rate is improved, the provision of suggested response words before each question doubles the time to issue the prompt.Brems, Rabin and Waggett (1995) show that a spoken menu, e.g. ‘‘please say one of the following: collect, calling card, third number, person to person or operator, now’’, produces longer interactions than prompts which simply ask a question, e.g. ‘‘what type of call would you like to make?’’. In this study (using the Wizard of Oz subterfuge), participants were more ‘‘successful’’ with the spoken menu than with the question prompt. However, the mean entry time for the question strategy was lower than the strategy with spoken menu options. It is also worth noting that if options are given in a spoken prompt, the short-term memory capacity of the user sets a limit on the number of options which should be stated per prompt; Schumacher, Hardzinski and Schwartz (1995) suggest a maximum of four items per auditory prompt. This means that additional dialogue steps may have to be negotiated compared to a question prompt, in order for

-This can be illustrated with a simple model of the interaction: assume that the first prompt consists of five words and the second of 11 words, and assume further that each ‘‘phoneme’’ (including pauses) will take 100 ms. Thus, the first prompt consists of 11 ‘‘Phonemes’’ and will take 1.8 s, while the second prompt consists of 24 ‘‘phonemes’’ and will take 3.8 s.

88

K. S. HONE AND C. BABER

users to reach the service they require. Of course, it might be possible to allow users to ‘‘barge-in’’ when they hear a menu item which they require, but research into the use of speech output tends to suggest that people will wait until the full set has been displayed before acting (Stanton & Baber, 1997). Murray, Jones and Frankish (1996) compared the performance effects of different levels of syntactic constraint for data entry using an automatic speech recognizer. The speech recognizer used in this study, the Votan VPC 2000, allows vocabulary sets to be divided by a ‘‘syntax’’ which governs the availability of words at different junctures in the interaction. Reducing the number of available words reduces the perplexity of the vocabulary and should improve recognition accuracy. This was the case in Murray et al.’s (1996) work as they found that increased syntactic constraint (and consequent reduction in active vocabulary size) was associated with higher machine recognition accuracy. However, imposing the highest level of constraint added the need for extra command words which increased the basic transaction time for this strategy (as shown in error-free interactions). Murray et al. (1996) also found that user errors increased when syntactic constraint was employed. A study by Casali, Williges and Dryden (1990) employed a ‘‘Wizard of Oz’’ subterfuge where a range of recognition accuracies and vocabulary sizes were simulated. Participants were required to enter stock control data (where a vocabulary item was not available, the participant entered the word character by character in spelling mode). Casali et al. (1990) showed significant effects of both recognition accuracy and available vocabulary on transaction time. Higher recognition accuracy led to faster transaction times because less time was spent correcting errors. Constraining the available vocabulary led to slower transaction times because of the extra time needed to enter unavailable items in the spelling mode. There was also a significant interaction between recognition accuracy and available vocabulary (more unavailable items combined with low recognition accuracy leading to very slow transaction times). This effect was due to the increased opportunity for error introduced by the increased number of input steps with vocabulary constraint. However, Casali et al. (1990) do not seem to appreciate a relationship between available vocabulary size and recognition accuracy (in their study these variables were independent). Thus, one might expect increased constraint to result in higher recognition accuracy and, under some conditions, this may offset the extra time costs of entering the unavailable items in spell mode. In fact, Casali et al.’s (1990) data showed that the most restricted vocabulary at high recognition accuracy (25% of items unavailable, 99% accuracy) was faster than a less restricted vocabulary at lower recognition accuracy (12.5% of items unavailable, 91% accuracy).

2.3. CONCLUSIONS

The discussion above highlights that although constraint is imposed with the intention of improving effectiveness (through an increase in recognition accuracy), it can also reduce efficiency (through an increase in transaction time). That is, conversational interactions with high constraint may need longer prompts in order to ensure that the user keeps within the available vocabulary. Furthermore, constraint frequently necessitates an increase in the number of dialogue steps needed to complete a task. In addition to these points, although high constraint reduces the chance of recognition error at each step, it

SPEECH-BASED HUMAN-COMPUTER INTERACTION

89

FIGURE 1. Proposed relationship between dialogue constraint and dialogue efficiency.

could increase the chance of recognition error across the entire transaction as a function of the increasing number of dialogue steps. For example, assuming a recognition accuracy of 95% for each step, in a two-step transaction, the overall accuracy will be 90%, while in a four-step transaction, the overall accuracy will be reduced to 81% (see Ainsworth, 1988). The proposed relationships between constraint level and efficiency are illustrated in Figure 1. Overall it is clear that there are performance trade-offs inherent in the task of imposing constraint in speech-based dialogues with machines. Effective dialogue design for speechbased interaction with computers will be achieved through judicious balancing of the conflicting effects of dialogue constraints. Peckham (1995) recognized this when he stated that, ‘‘2 in tasks where error costs are high or where recogniser performance is likely to be poor 2 the speed of a command language will need to be sacrificed in favor of a system using prompts’’. (p. 477). Surprisingly, there has been no attempt to date to quantify these trade-offs in terms of system efficiency. However, some studies do provide results which are relevant to this issue, and support some of the key relationships between dialogue constraint and efficiency proposed in Figure 1. Both the Murray et al. (1996) and the Casali et al. (1990) study demonstrate the performance decrement introduced in high constraint dialogues by the requirement for extra dialogue steps. Furthermore, the Casali et al. (1990) study supports the notion that additional dialogue steps further reduce efficiency by the increased opportunity they provide for error. Finally, the Murray et al. (1996) study demonstrates that imposing dialogue constraint can improve recognition accuracy which should act to increase efficiency. It is worth noting that both the Murray et al. (1996) study and the Casali et al. (1990) study used speech input combined with visual output. The kinds of effects that they obtained would be likely to be even more pronounced in conversational systems (with speech output) due to the additional time cost for each spoken prompt.

3. Modelling the performance effect of constraint In this section an alternate method for studying the effects of constraint in speech dialogues will be introduced. First the use of the modelling method will be justified in relation to previous research. The basic modelling approach will then be described and finally the current modelling studies will be introduced.

90

K. S. HONE AND C. BABER

3.1. WHY USE MODELLING?

What the research reported above does not recognize in the potential utility of balancing the conflicting effects of dialogue constraint in order to achieve an optimal design solution. For instance, Murray et al. (1996) do not report the overall mean transaction times achieved in their study which would demonstrate the outcome of the constraint trade-offs. In fact, even if these data were reported, the results would not be clear cut, due to the contaminating effects of user errors on performance time (user errors being more prevalent in the conditions with syntactic constraint). While it is clearly important to consider the effects of human error on overall performance, it is argued here that effort should not be spent designing prompts to effectively constrain user utterances before it is clear whether successful achievement of this goal would actually lead to faster transaction times. It may be that systems which cater for a wider range of user utterances could be as effective as systems with large numbers of highly constrained steps. In order to investigate the trade-off between constraint and transaction time, it is necessary to study the system performance effects of constraint in isolation from the effects of human error. In effect, this is what authors such as Murray et al. (1996) have done in excluding trials with human error from some of their analyses. In the current paper, a different approach is taken. Rather than running human trials and then excluding instances of human error, models of human—computer interactions were developed which represented perfect human performance combined with errorful computer recognition performance. These models were used to investigate the effects of constraint in conversational interactions where the trade-offs inherent in imposing constraint will be at their most pronounced. Previous research has not attempted to investigate the transaction time effects of constraints in these types of interactions. Furthermore, using the modelling technique allowed a large range of different machine recognition accuracy values to be tested (recall that in Casali et al.’s study only three recognition accuracy levels were used, presumably due to the time and cost restrictions imposed by the use of human subjects). This allows the researcher to pin-point the machine parameters at which a given dialogue design would be most effective (this approach is taken in the second of the two studies reported here). Previous research has shown that simulating speech-based interactions using tasknetwork models can provide similar data to that obtained from experiments using human subjects (Baber & Hone, 1993; Hone & Baber, 1993; 1995a,b; Ainsworth, 1993; Hone, Series & Baber, 1995). 3.2. TASK NETWORK MODELLING

Task-network models simulate the performance of events or tasks within a specified segment of activity. Using this approach, speech events can be described in terms of completion time and probability of either successful completion or progression to a further speech event (e.g. a correction move). This basic scenario is illustrated in Figure 2. In Figure 2, each utterance (computer and user) is represented as a step in a flowchart. In the modelling process, each of these steps is defined according to the mean and standard deviation of completion time (which is normally distributed). Machine recognition performance is represented as a percentage chance of completion. The model is run dynamically with time samples taken at each step based on sampling from the normal

SPEECH-BASED HUMAN-COMPUTER INTERACTION

91

FIGURE 2. Basic task network for a speech dialogue.

distribution. The path through the model is determined in a probabilistic manner based on the values defined for recognition accuracy. During runs of the model where a ‘‘misrecognition’’ occurs, a time cost is incurred because the prompt-reply-output sequence must be repeated. It is possible for repeat errors to occur, each repetition adding an additional time penalty. Each task model is run a number of times (at least 500 times in the current work) and the results provide predictions in terms of overall transaction time (minimum, mean and maximum values), and in terms of a frequency distribution of the number of iterations needed to complete the task. The predictions obtained reflect not only the basic transaction time effects incurred by different dialogue structures, but also the effects of having to enter error correction sub-dialogues following a misrecognition. It should be noted that while the error correction strategy illustrated in Figure 2 involves simple repetition of the misrecognized word, it is also possible to design task-network models which embody different error correction strategies. For example, it is possible to alter the probability of a successful completion given a previous successful completion, or to re-prompt in a different way if a first prompt is unsuccessful. Readers are directed to Baber and Hone (1993) for a demonstration of how task network models can be used to compare alternative strategies of this kind.

3.2. THE CURRENT STUDIES

In this paper, two task-network modelling studies are described which investigated the effect of constraint in home-banking dialogues. Home banking is typical of the type of task which researchers are trying to automate using conversational speech-interactive systems (see Larsen, 1997). It is also a domain in which there will be a limited number of

92

K. S. HONE AND C. BABER

TABLE 1 Interaction styles for conversational speech interfaces (adapted from Waterworth, 1984) Interaction style

Example prompt

User input

Command

No prompt

Natural language specification of request OR command language

Query

Which service do you require?

Name of an available banking service, e.g. ‘‘balance’’

Menu

Which service, balance, cash transfer, order chequebook or other service?

Name of one of the listed choice, e.g. ‘‘balance’’

Yes/No

Would you like to hear your balance?

Yes or no

Grunt

Please make a sound now if you would like to hear your balance.

Grunt or silence

services offered to clients. A decision must therefore be made over whether the dialogue structure will allow direct access to all of these services or whether constraints will be imposed so that at each step only a restricted set of options are available. There are a range of design options available for tasks such as home banking. These are illustrated in Table 1. There are several points to note from Table 1. Firstly, the amount of constraint increases from the top to the bottom of the table. Consequently, the size of active vocabulary needed per step decreases from top to bottom and one would therefore expect the system accuracy to improve from top to bottom. On the other hand, the number of steps needed to access a service tends to increase as you go down the table. For example, with the query strategy, a user can directly ask for the service they require; with the menu, yes/no or grunt strategy the user must wait until they are offered the required service and this may take several steps. In the two studies described below, the banking dialogues modelled differed according to the interaction style they used. This provided a manipulation of constraint level. The dialogues were also modelled assuming different levels of recognition accuracy. In the first study, the level of accuracy modelled for each level of constraint was calculated on the basis of assumptions regarding the effect of vocabulary size on recognition performance. The second study took a different approach by modelling dialogues across a wide range of recognition accuracies. The aim is each case was to investigate the trade-off inherent in imposing dialogue constraints. The specific aims of each study are discussed in more detail in the sections which follow.

4. The modelling studies 4.1. STUDY 1

4.1.1. Study aims and objectives For this study two home-banking dialogues were compared. The first embodied a relatively low level of constraint and assumed a combination of the query and command

SPEECH-BASED HUMAN-COMPUTER INTERACTION

93

styles of interaction (see Table 1). The second embodied a higher level of constraint and assumed a combination of the menu and yes/no styles of interaction. The aim of the study was to investigate whether the effects of increased recognition accuracy in the more constrained dialogue was sufficient to outweight the extra transaction time needed for longer prompts and more dialogue steps. 4.1.2. Dialogue structures The two dialogues modelled in this study both covered the same home-banking task: requesting an account balance and requesting an amount of money. In the first (less constrained) dialogue, query prompts were used for service selection and commands were used to correct recognition errors. In the second (more constrained) dialogue, menus were used for service selection and yes/no prompts used to confirm recognitions and thus provide error recovery if necessary. Figure 3 shows a segment of the less constrained dialogue structure, while Figure 4 shows a segment from the more constrained dialogue. 4.1.3. Model parameters and assumptions Speaking rates: following previous studies the speaking rate for both computer output and user input was assumed to be 100 ms (S.D. 3 ms) per phoneme (Flanagan, 1969; Baber & Hone, 1993). Timings for each prompt were thus calculated by counting the number of phonemes in the utterance. ºser behaviour: it was assumed that users would not make any errors in inputting words so that the effects of constraint on performance could be studied in isolation from the effects on user behaviour.

FIGURE 3. Section from the less constrained dialogue.

94

K. S. HONE AND C. BABER

FIGURE 4. Section from the more constrained dialogue.

Recognition rates: for the first (less constrained) dialogue style two recognition rates were modelled: 90 and 70% accuracy (note this refers to probability of success per word). Thus, with 90% accuracy, the models incorporated a 90% chance of successful recognition and progression to the next step, and a 10% chance of misrecognition and the need to repeat the input step. The second (more constrained) dialogue was run with the assumption that the increased constraint would improve recognition accuracy. Here the 90 and 70% values were assumed to apply to the total vocabulary with all times active (as in the less constrained dialogue). New probabilities of success were then calculated for each dialogue step based on the number of words from the total vocabulary which would actually need to be active at that point given the constraining nature of the prompt (typically three or four items). The following formula was used to calculate the new probabilities of success: P(new)"100![100!P(tot)][»(new)!1)/»(tot)!1)]

(1)

SPEECH-BASED HUMAN-COMPUTER INTERACTION

95

where P(new) is the new probability of success, P(tot) is the probability of success when the whole vocabulary is active, »(new) is the new vocabulary size and »(tot) is the total vocabulary size. This formula assumes that all recognition errors are substitutions and that if a vocabulary item is misrecognized all other items in the vocabulary have an equal chance of being the word substituted for it. While previous research (e.g. Brown & Vorsburgh, 1989) does indicate that substitutions are the most common form of recognition error, it is recognized that these assumptions represent a considerable simplification. A further simplification used in the models was an assumption that no errors would occur in recognition of words used for navigation within the dialogue (i.e. words used for confirmation of input and error correction). Support for the use of this assumption comes from an experiment performed by Noyes and Frankish (1994) where the number of substitution errors involving similar types of command word was negligible. However, it is recognized that the actual number of errors involving navigation words will depend upon both the chosen vocabulary and the characteristics of the speech recognition device used. Where researchers are modelling a particular speech recognition system (in contrast to the more generic approach used here) they should attempt to take these features into account. 4.1.4. Predictions It was predicted that the lower recognition accuracy (70%) would lead to longer transaction times than the higher recognition accuracy (90%) for both types of dialogue (because of the time cost of recovery). It was also expected that the more constrained dialogues would lead to longer basic transaction times (before the effect of errors) than the less constrained dialogues because of the longer prompts and the extra steps required. The main issue of interest was whether the recognition performance gains modelled in the increased constraint condition would offset these basic transaction time differences. 4.1.5. Procedure The test dialogues were programmed into MicroSAINT2+ as task network models. MicroSAINT2+ has a graphical user interface which allows the basic task network to be drawn as a flowchart diagram. The parameters (mean, standard deviation and distribution of performance time) for each task in the network are then defined using dialogue boxes. The parameters of each decision point (representing recognition performance) are also defined using a dialogue box. Throughout this work probabilistic decision rules were used, although readers may like to note that MicroSAINT2+ also provides the option to use rules which take account of what has occurred previously during a run of the model. MicroSAINT2+ allows users to choose how many times to run the model and how those runs are displayed. In the current study, each condition was run 1000 times (however comparisons of these results and those obtained with fewer runs suggested that a stable solution was reached within 500 runs; this number was therefore used in study 2 to save time). During each run of the model, MicroSAINT2+ displayed the path taken by an ‘‘entity’’ travelling through the network. At the end of each run, the time taken for this entity to finish its journey through the network was recorded. This value was dependent on the particular path followed (i.e. the number of times error correction

96

K. S. HONE AND C. BABER

paths were followed). At the end of each condition, these data were used to calculate mean and maximum transaction times over the 1000 runs. 4.1.6. Results Figure 5 shows the mean transaction times which are obtained in the study. For the less constrained dialogue the mean transaction time at 90% accuracy was 21.3 s and at 70% accuracy was 26.4 s. For the more constrained dialogue, the mean transaction at 90% accuracy was 36.3 s and at 70% accuracy was 36.8 s (recall that these recognition accuracy values refer to the situation where the entire vocabulary is active; the actual modelled recognition accuracy for each step varied according to Equation 1 above). It is clear from these results that, as expected, the transaction time for the lessconstrained dialogue was shorter at an assumed recognition accuracy of 90% than at an assumed recognition accuracy of 70%. For the more constrained dialogue, there was a negligible difference between the transaction times for the two recognition accuracy conditions. It is also clear that transaction times for the less constrained dialogue at both recognition accuracy values modelled were shorter than for the more constrained dialogue. Figure 6 shows the maximum transaction times which were obtained. For the less constrained dialogue, the maximum transaction time at 90% accuracy was 36.6 s and at 70% accuracy was 63.3 s. For the more constrained dialogue, the maximum transaction time at 90% accuracy was 52.4 s and at 70% accuracy was 53.2 s. These results indicate that maximum transaction time for the less constrained dialogue was shorter at an assumed recognition accuracy of 90% than at an assumed recognition accuracy of 70%. For the more constrained dialogue there was only a negligible difference between the maximum transaction times for the two recognition accuracy

FIGURE 5. Mean transaction times for alternative dialogues at 90 and 70% recognition accuracy. 70%; 90%

SPEECH-BASED HUMAN-COMPUTER INTERACTION

97

FIGURE 6. Maximum transaction times for alternative dialogues at 90 and 70% recognition accuracy. 70%; 90%

conditions. The maximum transaction time for the less constrained dialogue at 90% was also shorter than for the more constrained dialogue at 90%. However, unlike the mean time values, the more constrained dialogue at 70% outperformed the less constrained dialogue at 70% in terms of maximum transaction time. 4.1.7. Conclusions The results of this first study indicate that dialogue design was a major effect on transaction time. In fact, in this example, the effect of dialogue design on mean transaction time is greater than the effect of a 20% difference in assumed recognition accuracy. Within the parameters tested in the current work, the less constrained dialogue style (using queries and commands) always produced a faster mean transaction time than the more constrained dialogue style (using menus and yes/no). Modelling the effect of reduced vocabulary size for the more constrained dialogue style did not lead to sufficient improvements in mean transaction time to allow this dialogue to challenge the less constrained dialogue. Thus, for this example, those effects of dialogue constrained which act to shorten an interaction (fewer recognition errors and less need for time-consuming error correction) were not sufficient to outweigh the factors which act to lengthen the interaction (increased number of steps and longer prompts). Interestingly, if maximum transaction time is used as the performance metric the more constrained dialogue does outperform the less constrained dialogue at 70% accuracy. This suggests that more constrained dialogues may be useful in capping transaction times when recognition accuracy is low. Overall, this study illustrates the conflicting effects of dialogue constraint in determining overall transaction time (described in Figure 1). However, the dialogues which were

98

K. S. HONE AND C. BABER

used in this study differed from each other in several ways. As well as including different types of prompts for choosing the main services, they assumed different approaches to feedback and error correction. While it is true that different prompts lend themselves more readily to particular types of feedback and error correction, it may also be that the less constrained dialogue was given an unfair advantage by the assumption that explicit confirmation was not required. In addition, the formula used to estimate recognition accuracy with a reduced vocabulary set represented a considerable simplification. 4.2. STUDY 2

4.2.1. Study aims and objectives Study 2 was run in order to address the issues raised by study 1. Here feedback and error correction methods were held constant and only the constraint level inherent in the service prompts was varied. Furthermore, instead of trying to estimate recognition accuracy for the less constrained dialogues on the basis of vocabulary set reduction, a ‘‘best-case-scenario’’ for the constrained dialogues was assumed and compared to a less constrained dialogue at a range of recognition accuracy values. The study also provided an opportunity to investigate the relative importance of prompt length and number of dialogue steps in contributing to the performance effects of increased dialogue constraint. Three alternative home-banking dialogues were compared in which these variables were manipulated. 4.2.2. Dialogue structures For this study, a small part of a home-banking dialogue (ordering a new cheque book) was isolated and three alternative dialogues for completing the task were modelled. The methods of feedback and error correction were held constant cross the dialogues, with the assumption that the machine would provide feedback on what had been recognized and users would need to explicitly confirm or reject this feedback. Two different styles of service selection prompts were compared, a less constrained style using a query, and a more constrained style using a spoken menu. For the more constrained style there were two alternative conditions: either the task could be completed with one menu selection, or two consecutive menu selections were required. For the less constrained query-style dialogue, the task could always be completed with one selection step. The differences between the dialogues were intended to capture two key features of increasing constraint which were discussed in the introduction to this paper. The first feature is that increasing constraint will require longer prompts in order to inform the user of the restrictions. In this case, listing menu items takes longer than simply asking the user which service they require. The second feature is that the restriction on the number of options at each stage with increased constraint means that some tasks may need more than one step in order to reach the required option. The three dialogue structures used are illustrated in Figures 7—9. 4.2.3. Model parameters and assumptions The same parameters and assumptions were used as in study 1, with one exception. Unlike study 1, this study did not use Equation 1 to calculate recognition accuracy based on vocabulary size. This was to provide results which were independent of the

SPEECH-BASED HUMAN-COMPUTER INTERACTION

99

FIGURE 7. Query (less constrained) dialogue.

assumptions used to write this equation (which were, as already noted, a considerable simplification). In this study, it was assumed that the effect of reducing the active vocabulary to three items with the menu prompts would be to increase the recognition accuracy to 99%. This was considered to approximate a ‘‘best-case scenario’’ for the more constrained dialogues. The less constrained (query) dialogue was then modelled at a range of accuracies from 99 to 25% with the intention of finding out the values of recognition accuracy at which the less constrained, query dialogue would cease to be faster than (a) the more constrained, menu dialogue with one step and (b) the more constrained, menu dialogue with two steps. 4.2.4. Predictions At 100% recognition accuracy, it is clear that the less constrained style will be quickest as the basic prompt time is shortest and only one selection step is needed (see Figure 7). The menu selection dialogue with two steps will be the slowest, as not only does this strategy

100

K. S. HONE AND C. BABER

FIGURE 8. Menu dialogue with one step.

require the menu options to be spoken, but the process must be gone through twice in order to complete the task (see Figure 9). The menu dialogue with one selection step (Figure 8) will be of intermediate speed. It was predicted that this is the pattern of results which would be observed when all three dialogues were modelled at 99% recognition accuracy. For the query (less constrained) dialogue, which was modelled at a range of recognition accuracies, it was predicted that transaction time would increase as modelled recognition accuracy fell. As already stated, the aim of the study was to determine the recognition accuracy values where the query dialogue would equal the menu dialogues at 99% accuracy in terms of transaction times. No specific predictions were made regarding the absolute recognition accuracy valaues at which this would occur. However, it was predicted that the query dialogue strategy would case to be competitive with the one-step menu dialogue at a higher recognition accuracy than it would case to be competitive with the two-step menu dialogue.

SPEECH-BASED HUMAN-COMPUTER INTERACTION

FIGURE 9. Menu dialogue with two steps.

101

102

K. S. HONE AND C. BABER

4.2.5. Procedure The three test dialogues were programmed into MicroSAINT as task network models as described for study 1. The two menu (more constrained) dialogues were each run 500 times with the assumption of recognition performance at 99%. The query (less constrained) dialogue ws then run 500 times assuming each of the following recognition accuracy values: 99, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30 and 25. Data were recorded on mean and maximum transaction time. 4.2.6. Results The mean and maximum transaction times obtained for the two menu dialogues at 99% are shown in Table 2. As expected the mean and maximum transaction times for the two-step menu dialogue were larger than the one-step menu dialogue. The mean and maximum transaction times for the query dialogue across a range of recognition accuracies are plotted in Figures 10 and 11. It is clear that as recognition accuracy rises the less constrained dialogue becomes more and more efficient. As predicted, at 99% accuracy, this strategy is more efficient TABLE 2 Mean and maximum transaction times for the menu dialogues modelled at 99% recognition accuracy Number of steps

Mean transaction time (s)

Maximum transaction time (s)

1 2

14.4 17.9

19.1 25.2

FIGURE 10. Graph of mean transaction time against recognition accuracy for query (Less constrained) dialogue [Mean transaction times for the one and two-step menu dialogues are indicated by lines (a) and (b) respectively].

SPEECH-BASED HUMAN-COMPUTER INTERACTION

103

FIGURE 11. Graph of maximum transaction time against recognition accuracy for query (Less constrained) dialogue [Mean transaction times for the one and two-step menu dialogues are indicated by lines (a) and (b), respectively].

than either of the menu dialogues. Of interest is the point at which the less constrained dialogue will become less efficient than (a) the one-step menu dialogue at 99% accuracy and (b) the two-step menu dialogue at 99% accuracy. (a) Query vs. one-step menu dialogue. The mean transaction time for the one-step menu dialogue at 99% is 11.4 s (from Table 2). Looking at Figure 10, the recognition accuracy as which the query dialogue becomes less efficient than this is around 65%. The maximum transaction time for the one-step menu dialogue at 99% is 19.1 s (from Table 2). Looking at Figure 11 the value of recognition accuracy at which the query dialogue becomes less efficient than this is around 90%. (b) Query vs. two-step menu dialogue. The mean transaction time for the two-step menu dialogue at 99% is 17.9 s (from Table 2). Looking at Figure 10 the value of recognition accuracy at which the query dialogue becomes less efficient than this is around 35%. The maximum transaction time for the two-step menu dialogue at 99% is 25.2 s (from Table 2). Looking at Figure 11 the value of recognition accuracy at which the query dialogue becomes less efficient than this is around 82%.

104

K. S. HONE AND C. BABER

4.2.7. Conclusions This study used the assumption that increasing dialogue constraint through the use of menus (with three options per menu) would increase recognition performance to near perfect (99%). The performance of a less constrained (query) dialogue was then modelled at a range of recognition accuracies and the results compared to the constrained (menu) dialogues at 99%. It was found that when the addition of dialogue constraint was assumed to result in an increase in prompt length (one-step menu) then the less constrained dialogue only becomes the less efficient strategy once its recognition accuracy falls below 65%. When the addition of dialogue constraint led to an increase in prompt length and the number of dialogue steps (two-step menu) then the less constrained dialogue only becomes the less efficient strategy once its recognition accuracy falls below 35%. These findings suggest that where the imposition of high levels of constraint within a speech dialogue entails the need for an extra step it will be hard for it to compete with a less constrained dialogue which allows direct access to the required option. This effect holds even when additional constraint is assumed to result in much greater recognition accuracy: the less constrained dialogue is still faster than the more constrained dialogue with two steps when their respective recognition accuracies are assumed to be 40 and 99%. When additional constraint only entails a longer prompt, the less constrained dialogue must still fall below a relatively low level of recognition accuracy before it performs worse than the dialogue with the longer prompt. It is also worth noting that if high constraint were to prove less effective in increasing recognition accuracy than the 99% value assumed here, then the advantage for the less constrained approach would be even more pronounced. If one is concerned with limiting the maximum observed transaction time, the more constrained approach performs rather better than if one is concerned with minimizing mean transaction time. Thus, the less constrained dialogue was equivalent to the one-step constrained dialogue at 99% accuracy when its recognition accuracy was around 90%. It was equivalent to the two-step constrained dialogue when its recognition accuracy was around 82%.

5. Discussion The reported task network modelling studies illustrate the trade-offs between dialogue constraint, transaction time and recognition accuracy. The results suggest that the performance gains which can be expected from imposing high levels of dialogue constraint are relatively small compared to the costs of imposing that constraint. In particular, it is unlikely that gains in recognition accuracy (and consequent reduction in error correction time) will offest the increases in time which extra dialogue steps entail. This result implies that dialogue designers should think very carefully before they increase the number of dialogue steps in order to reduce vocabulary size as the effort may be counterproductive. The studies also highlight the importance of keeping spoken prompts as brief as possible, as only small increases in the number of words can affect whether a dialogue strategy is competitive or not. This paper has taken mean transaction time as the primary metric of speech system performance. Several authors have previously justified this approach by pointing out

SPEECH-BASED HUMAN-COMPUTER INTERACTION

105

that small reductions in average transaction time (in the order of a few seconds) can be worth millions of dollars per year to service providers (e.g. Brems et al., 1995; Lawrence, Atwood, Dews & Turner, 1995). By minimizing transaction time, we could reduce associated costs, such as call time (and hence, charge to the caller), as well as throughput for call handling, i.e. more callers could be handled if the average call time is reduced. However, it is also worth considering other ways of assessing performance. In the current paper maximum transaction times are also reported (and it is worth noting that more constrained dialogues do appear more effective when viewed in terms of maximum transaction time). Designers might consider reducing the maximum transaction time to be a good aim as it might avoid the situation where some users experience very long interactions and consequenly evaluate the system poorly. Research is needed on the relationship between user evaluations and mean and maximum transaction times. If maximum transaction times were found to be an important factor in forming user opinion, then the use of constraint would be more likely to be justified. User evaluation is of course an important metric of system performance, but it is one which cannot be addressed directly by a modelling approach. There is some evidence that users prefer systems in which the prompts imply some constraint over what can be input. For instance, in a large-scale trial of a simulated speech-based telephone inquiry system, Rothkrantz, Manintveld, Rats, Van Vark, DeVreght and Koppelaar (1997) showed that users preferred a directive style of dialogue to a non-directive style. It can be hypothesized that this kind of effect is due to users seeking guidance on how to interact with a new type of conversational partner (Baber, 1993). However, it is not clear whether users will prefer very high levels of constraint or intermediate levels. It is likely that regular users will dislike ‘‘Spanish inquisition’’ mode as it would hamper their performance, but that novice users might find such a mode supportive. However, this hypothesis needs to be tested experimentally. Overall performance will be a function of both system performance with that level of constraint and the effects of that constraint on user performance, i.e. Murray et al. (1996) showed that users made more errors when the level of constraint increased. Generally, users do not seem to be particularly good at obeying implied constraint, at least at the level of producing required input words without extraneous speech (Brems et al., 1996). Given that the available vocabulary set is limited at specific points in the interaction, constraint leads to the requirement to inform users of which words they ought to use. Baber, Johnson and Cleaver (1997) suggest that choice of words will be influenced by a set of factors which include, ‘2 the vocabulary set provided, the nature of the speech recognizer, the design of the feedback, the nature of the tasks being performed and the relationship between user goals and task demands.’’ (p. 57). Further human factors research needs to examine these issues in more detail. This paper has highlighted the different effects which designers can expect from the imposition of high levels of constraint in conversational speech-based human—computer dialogues, and it has shown how task network modelling can be used to investigate these trade-offs. It is suggested that designers could use this approach in the early stages of design. The method will be most effective when designers have information about the recognition accuracy of their system with various vocabulary sizes. However, it is also possible to make general assumptions about performance and investigate how changes in the model parameters effect dialogue efficiency. Using this approach, in the current

106

K. S. HONE AND C. BABER

paper, it was found that even when it is assumed that high constraint levels will lead to very high recognition accuracy, it will still be hard for them to compete with less constrained dialogues because of the need for longer prompts and more dialogue steps. It is suggested that dialogue designers should be more aware of this effect and should be cautious about increasing the number of dialogue steps simply to reduce the number of options at each step. Further research on the behaviour of both users and speech systems will allow for more accurate and useful models to be developed in the future.

References AINSWORTH, W. A. (1998). Optimisation of string length for spoken digit input with error correction. International Journal of Man-Machine Studies, 28, 573—581. AINSWORTH, W. A. (1993). Theoretical and simulation approaches to error correction strategies in automatic speech recognition. International Journal of Man-Machine Studies, 39, 517—520. BABER, C. & HONE, K. S. (1993). Modelling error recovery and repair in automatic speech recognition. International Journal of Man-Machine Studies, 39, 495—515. BABER, C., JOHNSON, G. I. & CLEAVER, D. (1997). Factors affecting users’ choice of words in speech-based interaction with public technology. International Journal of Speech ¹echnology, 2, 45—60. BABER, C., STAMMERS, R. B. & USHER, D. M. (1990). Error correction requirements in ASR. In E. J. LOVESEY, Ed. Contemporary Ergonomics 1990, pp. 454—459. London: Taylor and Francis. BABER, C., STANTON, N. A. & JOHNSON, G. I. (1998). From public technology to ubiquitous computing: implications for ergonomics. Ergonomics, 41, 921—926. BEACHAM, K. & BARRINGTON, S. (1996). CallMinder — the development of BT’s new telephone answering service. B¹ ¹echnology Journal, 14, 52—59. BREMS, D. J., RABIN, M. D. & WAGGETT, J. L. (1995). Using natural language conventions in the user interface design of automatic speech recognition systems Human Factors, 37, 265—282. BROWN, N. R. & VORSBURH, A. M. (1989). Evaluating the accuracy of a large vocabulary speech recognition system. Proceedings of the Human Factors Society 33rd Annual Meeting, pp. 296—300. CASALI, S. P., WILLIGES, B. H. & DRYDEN, R. D. (1990). Effects of recognition accuracy and vocabulary size of a speech recognition system on task performance and user acceptance. Human Factors, 32, 183—196. HONE, K. S. & BABER, C. (1993). Using task networks to model error correction dialogues for ASR. In E. J. LOVESEY, Ed. Contemporary Ergonomics 1993, pp. 152—157. London: Taylor and Francis. HONE, K. S. & BABER, C. (1995a). Optimization of feedback position in the entry of digits by voice. In S. A. ROBERTSON, Ed. Contemporary Ergonomics 1995, pp. 181—186. London: Taylor and Francies. HONE, K. S. & BABER, C. (1995b) Using a simulation to predict the transaction time effects of applying alternative levels of constraint to user utterances within speech interactive dialogues. Proceedings of ESCA ¼orkshop on Spoken Dialogue Systems, pp. 209—213. Grenoble: ESCA. HONE, K. S., SERIES, R. W. & BABER, C. (1995). An experimental investigation of the error correction strategies used by subjects entering digit strings with the AURIX speech recogniser. Proceedings of EuroSpeech *95, pp. 1445—1448. KAMM, C. (1994). User interfaces for voice applications. In D. B. ROE & J. G. WILPON, Eds. »oice Communication between Humans and Machines, pp. 422—442. Washington, DC: National Academy Press. LARSEN, L. B. (1997). A strategy for mixed-initiative dialogue control. Proceedings of Eurospeech ’97. Grenoble: ESCA. LAWRENCE, D., ATWOOD, M. E., DEWS, S. & TURNER, T. (1995). Social interaction in the use and design of a workstation. In P. THOMAS, Ed. ¹he Social and Interactional Dimensions of Human-Computer Interfaces, pp. 240—260. Cambridge: Cambridge University Press.

SPEECH-BASED HUMAN-COMPUTER INTERACTION

107

MANE, A., BOYCE, S., KARIS, D. & YANKELOVICH, N. (1996). Designing the user interface for speech recognition applications SIGCHI Bulletin, 28 October, pp. 29—34. MURRAY, A. C., JONES, D. M. & FRANKISH, C. R. (1996). Dialogue design in speech-mediated data-entry: the role of syntactic constraints and feedback. International Journal of HumanComputer Studies, 45, 263—286. NOYES, J. M. & FRANKISH, C. R. (1994). Errors and error correction in automatic speech recognition systems. Ergonomics, 37, 1943—1957. PECKHAM, J. (1995). Behavioral aspects of speech technology: industrial systems In A. SRYDAL, R. BENNETT, & S. GREENSPAN, Eds. Applied Speech ¹echnology, pp. 469—486. Boca Raton, FL: CRC Press. ROTHIKRANTZ, L., MANINTVELD, W., RATS, M., VAN VARK, R., DE VREUGHT, J. & KOPPELAAR, H. (1997). An appreciation study of an ASR inquiry system. Processings of Eurospeech’97: 5th European Conference on Speech Communication and ¹echnology. Grenoble: ESCA. RUDNICKY, A. I. (1995). The design of spoken language interfaces. In A. SYRDAL, R. BENNETT, S. GREENSPAN, Eds. Applied Speech ¹echnology, pp. 403—426. Boca Raton FL. CRC Press. SCHUMACHER, J. R., HARDZINSKI, M. L. & SCHWARTZ, A. L. (1995). Increasing the usability of interactive voice response systems: research and guidelines for phone-based interfaces. Human Factors, 37, 251—264. SCHMANDT, C. (1994). »oice Communication with Computers. New York: Van Nostrand Reinhold. STANTON, N. A. & C. BABER, C. (1997). Comparing speech versus text displays for alarm handling. Ergonomics, 40, 1240—1254. WATERWORTH, J. A. (1984). Interaction with machines by voice: human factors issues. British ¹elcom ¹echnology Journal, 2. WOLF, C. G., KASSLER, M., ZADROZNY, W. & OPHYRCHAL, L. (1997). Talking to the conversational machine: an empirical study. In S. HOWARD, J. HAMMOND & G. LINDGAARD, Eds. Human Computer Interaction: IN¹ERAC¹’97, pp. 461—468. London: Chapman & Hall. Paper accepted for publication by Associate Editor, Dr D. Hill

Suggest Documents