User Interface Issues for Natural Spoken Dialog Systems Susan J. Boyce and Allen L. Gorin AT&T Laboratories
[email protected]
ABSTRACT We are interested in building machines which can understand and act upon fluently spoken language. This is often not a single interaction, but rather involves a dialog to negotiate the proper outcome. Our goal is to make such human-computer dialogs as natural as possible, so as to enable large populations of nonexperts to use such systems. There are many user interface issues which then arise. We focus on a particular experimental vehicle, that of automatically routing telephone calls based on a user’s fluently spoken request to “How may I help you?”. We describe several elements that are necessary in human-computer dialog which do not have ready analogs in the human-human case. Finally, we present preliminary experimental results evaluating user interface options for these elements.
INTRODUCTION There are several design challenges for spoken dialog posed by large populations of non-expert users. These users make up a diverse group, varying widely in a number of dimensions. Some users may have never or only rarely experienced an automated system in the past. In addition, many people only call occasionally. It isn’t common for the same user to call two or three times per day for several days in a row as you might expect with some other kinds of services (like voice mail, for example). This means that each time the customer calls, he or she may not remember much from their previous call (if, indeed they ever had a previous call) so that the user interface cannot assume that the caller has “learned” how to use the system. These callers differ from one another in accent, proficiency with the language, rate of speech, volume of speech and in the quality of the sound transmitted by their telephone handset. These differences affect both the performance of the automated system and the design of the user interface. The challenge is to develop a design that is acceptable to all.
What is Naturalness? Our goal is to design a dialog that is natural, which we define as being one that closely resembles a conversation two humans might have. One can imagine a dialog which is understandable and usable (at least for experienced users), but is not natural:
A Non-Natural Dialog System: User: System: User:
Please say your authorization code now. 5 1 2 3 4 Invalid entry. Please repeat. 5 1 2 3 4
In this example, the system does not phrase questions as a human might (a human would be more likely to say “authorization code please?” or “May I have your authorization code?”). The user may be able to interpret the system’s prompts and complete the task, but it is not a natural dialog. A better dialog design would attempt to incorporate elements from human to human conversations in order to make the experience easier and more pleasant for the user [Brems, Rabin & Waggett, 1995]. By making the dialog follow the pattern of a human to human dialog, users can bring what they already know about conversation to the task. Natural dialogs rely on what users already know about language and conversation to help guide their responses.
The Call Routing Task The call routing task that we studied involved classifying users’ responses to the open ended prompt “How may I help you?” from a telephone operator [Gorin, 1994].. The goal of this experimental system is to classify responses as on of 17 call types so that the call can then be routed to an appropriate destination. For example, if a person said “Can I reverse the charges?” then the appropriate action is to connect them to an automated subsystem which processes collect calls. If the request was “I can’t understand my bill” then the call should be routed to the business office. Our Design Process
In order to produce a natural dialog design we engaged in an iterative process of design and testing the design with users [Day and Boyce, 1993]. The first was to collect and analyze human to human dialogs for the call routing task [see Gorin, Parker, Sachs & Wilpon, 1996]. In the second phase, we identified the elements of the dialog that could make the human-computer dialog seem unnatural. The third phase was to conduct Wizard of Oz experiments as defined in [Gould 1983] to investigate some of these issues.
Analysis of Human to Human Dialogs The first phase involved gaining a better understanding of how callers express their requests to humans and how most human agents elicit clarifying information [Gorin, et al., 1996]. By doing this, we not only gained important data for algorithm and
technology development [Gorin, 96a], but also for the design of the user interface. By closely matching the wording or our prompts to the words used by the human operator we can achieve a greater degree of naturalness in the dialog. However, not all aspects of human-computer dialog can be modeled after human to human dialog. Some elements of the human-computer dialog are necessary simply because the automated system has does not have all of the capabilities of a human listener.
Issues for a Natural Call Routing Dialog In designing our call routing system, there were several aspects of the dialog for which there was not a ready analog available in human to human conversation. These elements include the initial greeting, confirmation of the user’s request, disambiguation an utterance, reprompts, and knowing when to bailout to a human to complete the transaction. Initial Greeting. A frequently-voiced concern about designing very natural human-computer dialogs is that early in the interaction, users are likely to assume that the system has greater capabilities than it actually has, and therefore attempt to speak in a manner that the system has little probability of understanding. Designing the right initial greeting is necessary to appropriately set user expectations. Confirmations. Since any automated system will not perfectly interpret the user’s speech, a necessary element of a humancomputer dialog is a confirmation step. During confirmation, the system repeats what it thinks was said, giving the user a chance to confirm or deny the system’s interpretation. The confirmation strategy depends on the confidence estimate provided by the spoken language understanding (SLU) subsystem. For example, System: User: System: User:
How may I help you? Yes, could you give me the area code for Morristown, NJ? Do you need area code information? Yes, I do.
One design difficulty is that a human is not likely to say the phrase “Do you need area code information?” in such a context (Clark, 1987). The issue for confirmations is to ask the question in a way that more closely resembles the way a human might ask for confirmation, in order to elicit the appropriate user response. Disambiguating an Utterance. In some cases, the automated system is going to come up with more than one interpretation of the user’s speech. There are several ways that the system could ask for clarifying information from the user. One straightforward way to do this would be to simply ask the user, “Do you want A or B?”. Another way might be to ask a yes/no question “Do you want A?” [Ballentine, 198X] so that if the answer is no then it may be assumed that B is the correct choice. Which strategy is most appropriate will depend on the relative confidences of likely
interpretations and which is most effective at completing the dialog quickly and naturally. Reprompts. A low confidence from the SLU subsystem indicates that it didn’t understand the user’s utterance. In these cases, rather than asking for confirmation of information that is almost certainly wrong, it is better simply to ask for the user to repeat his or her request. We refer to this step as a reprompt. Often in human-computer dialogs step takes the form of “Sorry, please repeat”. The system admits culpability, then as quickly as possible asks for the information to be repeated. This phrasing does not however, provide any information on what went wrong. In contrast, human listeners have a wider repertoire of responses available to them to communicate to the speaker which elements they didn’t understand in an utterance. Possibilities are that they didn’t hear properly, or that they heard the utterance but didn’t understand it, or that they heard and partially understood, but need more information. All of these states can be quickly communicated between humans using pauses, prosody and the content of the response. A challenge for human-computer dialog is to intelligently mimic these devices in order to communicate to the user how the conversation has failed so that the user can provide useful inputs. When to Bailout. Sometimes a human-computer dialog experiences repeated breakdowns, as shown by either low system confidence or repeated responses of “no” to confirmation prompts. In such cases, a human should be brought in to complete the transaction. The design question is how to decide when a situation calls for human intervention. The correct criteria might be number of recognition errors in a row, or it might be the number of errors that have occurred overall in the dialog. It seems likely that the right answer to this question may very well depend on characteristics of the user population and of the task being accomplished by the automated system. The goal should be to bailout before the user’s frustration causes a negative perception of the system.
Preliminary Experimental Results We conducted a study to evaluate two of the components of the user interface for this automated system. In this section we will describe results from a study evaluating how to confirm users’ initial requests and how to reprompt. These tests were conducted using a “Wizard of Oz” methodology [Gould, 1983]. In such a study, the speech recognition and natural language understanding components of the system are simulated, although the user does not need to know this. The user calls in to the system and is greeted by the automated system. The experimenter monitors the call and it is the experimenter, not the system, that determines how the system should respond to the caller. The experimenter can “simulate” an error or a correct response by the automated system by pushing the appropriate key on a computer that is controlling which system prompts get played back across the telephone to the caller. This kind of experiment can be very valuable for evaluating user
interface components, particularly error recovery strategies, since the experimenter can tightly control when and where “spoken language understanding errors” occur. The initial Wizard of Oz prototype was tested by having participants complete a list of tasks by interacting with the device over the telephone. The experimenter, not the user, selected which tasks should be completed. Examples of tasks users were asked to complete included finding out the time of day in some country, getting an area code for a US city and getting a billing credit for a call placed to a wrong number. Eighty-eight users completed 7 calls each to the automated system. Of particular interest to us were strategies for confirming user input and reprompting strategies. Overall, the number of interactions with the system is not large in this study, so although these results may suggest certain trends, more data is necessary to make specific statistical claims. Confirmation Strategy Results. The necessity of a confirmation step makes the dialog unlike human-human communication. For this reason we experimented with “natural” ways to confirm the machine’s understanding of the user’s request. Two methods were evaluated. The first was the explicit confirmation, “Do you need X?”. An alternative method of confirmation was suggested by some research we had previously done on confirming strings of digits, such as phone numbers, for automated systems. Rather than explicitly asking the user the question, the system posits an interpretation to which the user can either say yes, no, or be silent where silence is interpreted as agreement. We’ll call this the implicit confirmation method. System: User:
System: User: System:
How may I help you? Yes, I just dialed a number out of the area and it happened to be a wrong number and I want to get credit for the call. OK, you need me to give you credit for a wrong number. [silence] Did you bill that call to the phone you are using now?
This strategy is somewhat more likely to occur in natural dialogs, at least when digit strings are confirmed (Clark, 1987). However, if the confirmation information is incorrect, some users may have a more difficult time figuring out how to correct the error with implicit confirmations than with explicit confirmations. Both strategies were very successful if the machine had correctly interpreted the user’s speech. However, the data indicated that users were slightly less successful at repairing errors in the dialog when the implicit confirmation was used as compared to the explicit confirmation (see Table 1).
User Behavior Correctly repaired error Answered “yes” when “no” was appropriate Silence
Implicit Confirmation 83% (59)
Explicit Confirmation 97% (62)
6% (4)
3% (2)
11% (8)
0
Table 1: User behavior when confronted with errors via implicit and explicit confirmations. The actual number of observations is shown in parentheses. In particular, for trials in which a recognition error was simulated, 11% of the users were silent when they should have said “no” and/or restated the correct information. It is possible that these users were unclear about how to interrupt the system to correct the error. Another possibility is that users weren’t sure whether their request would be considered a part of the category that was being confirmed. For some of the categories, these distinctions might be particularly difficult. These data suggest that the explicit confirmation method, albeit less natural, is the more robust strategy to pursue given that more of the errors get correctly repaired. However, this study did not provide us with information about how the naturalness of the interaction affected how well they liked interacting with the system. It is possible that users could have a more positive reaction to a system that is more natural despite the fact that it is slightly more error-prone. This is a subject for future research. Reprompt Results. In the current study, several different wordings of reprompts were evaluated. They fell into two categories: ones in the general form of an apology followed by a restatement of the original prompt (e.g. “I’m sorry. How may I help you?”) and others that included an explicit statement that the response was not understood (e.g. “I’m sorry, your response was not understood. Please tell me again how I can help you?”). Neither of these types of reprompts produced markedly different results. Thus, in the following analysis, we combined the data for the different reprompt conditions. One of the questions we were exploring was users’ behavior when reprompted for a piece of information, as in the following example: System: User:
System:
How may I help you? Yes, I was making a call to my sister and I must have dialed the wrong number. I was wondering if there is any way I can get credit for this call? I’m sorry. How may I help you?
The user has several options at this point in the dialog. He or she could repeat the phrase word for word. Alternatively the user
might decide there is a more concise way to phrase the question and repeat only that portion, or the user might decide that the system didn’t understand because not enough information was provided, so might try again including even more details of the story. The strategy the user adopts depends on the user’s mental model of the dialog failure that has occurred. Do they interpret a reprompt as an opportunity to elaborate to enhance the system’s understanding, or do they simply repeat their initial utterance to give the speech recognizer a second try. This latter strategy degrades system performance, since people speak slower and over-articulate. In order to evaluate this issue we classified users’ responses to reprompts. Users Responses to Reprompts Exact or almost exact repeat of initial utterance Shorter utterance Longer utterance Rephrased, same length
Percent 37% (19) 31% (16) 20% (10) 12% (6)
Table 2: Types of Responses to Reprompts. The actual number of observations is shown in parentheses.
These preliminary data indicated that users as a whole did not adopt a single strategy and they did not all have the same mental model of the system’s capabilities. Some of the strategies adopted by the users are likely to produce a more successful dialog than others. The goal of the dialog system is to guide users toward the successful strategies. Thus, a subject for future research is to explore reprompting strategies that are more explicit about the reason for the communication failure.
CONCLUSIONS We have proposed that a natural user interface is one that mirrors as closely as possible the elements of human to human dialogs. However, we found that there are necessary elements of the dialog that are difficult to model after human to human communication. We have presented preliminary experimental data in which we evaluate natural approaches to two of these elements, namely confirmation and reprompt strategies.
REFERENCES Ballentine, B. (19XX). Need to look up this reference.
Thus, slightly more than a third of the time, users interpreted the reprompt as a request to repeat the same utterance (see Table 2).. It seems that in these instances the users interpreted the reprompt to mean that the system didn’t “hear” the user’s word properly and simply needed the phrase repeated. This strategy may not be particularly successful in terms of system performance, since if the speech recognizer or the language understanding module failed the first time, a second try on the exact same utterance may not be any more successful.
Brems, D. J., Rabin, M. D. & Waggett, J. L. (1995). Using natural language conventions in the user interface design of automatic speech recognition systems. Human Factors, 37(2), 265-282. Clark (1987). Need to look up this reference in my office.
Of those who gave a shorter response, 80% had the same amount of information relevant to the task. For example,
Day, M. C. & Boyce, S. J. (1993). Human Factors in HumanComputer System Design. In “Advances in Computers” (M. Yovitz, ed.), pp 333-430. Academic Press, San Diego.
System: User:
Gorin, A.L. (1995) On Automated Language Acquisition. J. of the Acous. Soc. of Am. 97(6), 3441-3461.
System: User:
How may I help you? I would like to make a call to London and I need to know what time it is there so I won’t wake my brother up. I’m sorry. How may I help you? Could you tell me what time it is right now in London England?
In these cases, users were able to rephrase their initial requests more succinctly. This might indicate that these users had attributed the dialog failure to an understanding problem due to their own word choice rather than a failure in “hearing”. In general, this strategy will be beneficial since shorter utterances will be recognized and understood more easily than long utterances. In the other 20% of these cases users chose to shorten their utterances by providing less information relevant to the task. This could reflect a belief on the user’s part that the system isn’t very sophisticated so that the best strategy would be to provide the system one chunk of information at a time.
Gorin, A. L., Henck, H., Rose, R., Miller, L. (1994). Spoken Language Acquisition for Automated Call Routing, Proc. ICSLP, 1483-1485, Yokohama (Sept 1994). Gorin, A. L., Parker, B. A., Sachs, R. M. & Wilpon, J. G. (1996). “How may I help you?”. Proc. IVTTA (Oct, 1996) to appear. Gould, J. D., Conti, J., & Hovanyecz, T. (1983) Composing letters with a simulated listening typewriter. Communications of the ACM. vol. 26, pp. 295-308. Karis, D. & Dobroth, K. M. (1991). Automating services with speech recognition over the public switched telephone network: Human factors considerations. IEEE Journal on Selected Areas in Communications, 9, 574-585.