Automated Appointment Scheduling Mark Fanty, Stephen Sutton, David G. Novick, Ronald Cole Center for Spoken Language Understanding Oregon Graduate Institute of Science and Technology PO Box 91000 Portland Oregon, 97291
[email protected]
ABSTRACT We describe a spoken language system that schedules appointments over the telephone. The system has a calendar of available times for some service that callers want to obtain. The system and the caller engage in a cooperative dialogue until a mutually satisfactory appointment time can be found and scheduled. The system has three parts: a speech recognizer based on neural network phoneme classi cation and word bigrams, a robust phrase-spotting parser, and a dialogue module. The dialogue module has a calendar of available system and user times, a partial history of system goals, and a preference stack to keep track of the focus of the conversation. The dialogue module uses rules to interpret the user's input, make an appropriate response, and provide a prediction (grammar) of the user's next response for the speech recognizer. Rules also determine when the system grabs the initiative from the user.
1. INTRODUCTION Our goal is to develop a cooperative spoken language system (cf. [2]) for appointment scheduling that preserves the natural ow of the dialogue and attempts to provide the caller with as much freedom to lead the dialogue as possible. In particular, we aim to support mixed-initiative interaction of a relatively unstructured nature. The recognizer will make errors and so we will focus attention developing suitable repair strategies. Furthermore, we hope that what we learn will generalize to other tasks. There are three major components of the system: a speech recognizer that maps the caller's speech into a sequence of words; a parser that provides some semantic structure for the words; and a dialogue module that interprets the caller's utterances and generates system utterances. We will present an outline of each component, with particular emphasis on the dialogue module.
2. DESIGN FOR ROBUSTNESS
Our eorts have focused on designing a dialogue for cooperative callers. However, even with cooperative callers, unexpected dialogue situations will inevitably arise. For instance, errors made by the speech recognizer may lead to misunderstanding and result in communication breakdown. The ability to cope such unexpected situations is essential to achieving a robust system. The key to improving robustness in the current system rests in four areas: prevention of breakdowns, detection of breakdowns, recovery from breakdowns and nally, bailing out. Prevention: Steps must be taken to prevent breakdowns in communication from occurring. Ways in which this can be done including predicting the caller's response, careful wording of questions such as making system expectations explicit to the caller or implicitly clarifying information recognized (e.g., \What time on Monday?"), and explicitly con rming important information (e.g., \Your appointment will be scheduled for 3 p.m. on Friday. Is this correct?"). Detection: The system must be capable of detecting diculties and ideally, diagnosing them as they arise. In general, the sooner this can done the better in order to minimize the eects of the breakdown and to prevent it from degrading further. Diculties can be detected at all levels: if the caller does not respond to a question within a preset time, the question can be repeated. If the speech recognition has a low con dence score, the dialogue module is noti ed and pursue con rmation. Similarly if the phrase spotting parser cannot assign a large number of words, the system should proceed with less con dence (e.g., verify more). If a response is unlikely or will \undo" a lot of recent work, the system should con rm. Not all of these strategies have been implemented as of this writing. Recovery: The system should incorporate various repair strategies. Possibilities include repeating the question if low con dence, clarifying what was recog-
nized if medium con dence (e.g., \Did you say Monday?"). Also, the system should provide progressive assistance in the event of repeated breakdown. This may involve announcing that there is a problem (e.g., \I'm having diculty understanding you"), asking more directed questions and perhaps giving instructions (e.g., \Don't over-articulate"). Bail-out: The system should not attempt to recover inde nitely in the case of repeated breakdown. The system should degrade gracefully and at some stage bail out. This may mean, for example, closing the dialogue or providing human assistance.
3. SPEECH RECOGNITION
Speech recognition is based on neural-network phonetic classi cation. We rst reduce the English phoneme inventory to 39 relatively distinct classes. Each phoneme is modeled in three parts: left, middle, and right. Each left and right part is further broken into 6 classes depending on the broad phonetic identity of the surrounding phoneme. So, for example, the /eh/ in \ten" is modeled as a series of three states: (1) left part of /eh/ following stop (2) middle part of /eh/; and (3) right part of /eh/ before a nasal. The word sequence which best matches the phonetic scores is found using a fast two-pass search which uses bigram probabilities.
4. SEMANTIC PARSER
A robust semantic parser developed at CMU [1] is used to map the word string into semantic frames. We developed a grammer for this parser that contains expressions in the scheduling domain. For example, the utterance (containing a restart) \the a how about Monday at three" would map to: [give_info] ( [suggest_appt] ( HOW ABOUT [appt] ( [spec_day] ( [day_of_week] ( MONDAY )) [spec_time] ( AT [hours] ( THREE )))))
5. DIALOGUE MODULE
The semantic frames are passed to the dialogue module, which (1) interprets the caller's utterances; (2) determines the appropriate next system action; and (3) provides expectations of what the caller will say next to the recognizer. The dialogue module receives as input a semantically-parsed caller utterance and produces as output a system prompt and the
grammar name (characterizing the current dialogue state). The dialogue module has two phases of processing: interpretation and action. The interpretation phase involves normalizing the form of the caller input, extracting salient information from it and updating appropriate data structures. The internal representation of a time range (central to this task) is a four-tuple of the form: range(from day, from time, to day, to time). For example, if today is Monday, then the inputs \tomorrow afternoon" and \Tuesday afternoon" will map to the same representation: range(Tuesday, 12, Tuesday, 5). Performing such normalization reduces the number of rules required for subsequent processing during the action phase. The action phase involves a series of conditional operators. First the system goal is decided upon, then the corresponding system prompt and grammar are chosen. The preconditions of the operators typically include the nature of the caller input, the currently active system goals, any previously stated caller preferences and the calendar availability.
5.1 Data Structures
We will now review the data structures used by the dialogue module. There are three main structures in use: the calendar, context and preference stack. Calendar: records system availability for each hour of the day. It is also used to record time periods when the caller is unavailable. Context: records a partial history of system goals. This keeps track of what it is the system is currently attempting to achieve. It also detects repetitions, so the same question won't be asked repeatedly in the same way. Preference stack: records day and time preferences expressed by the caller. The preference stack provides immediate context representing the current focus. It is a basis for interpreting the caller's utterances and for keeping track of shifts in focus. Table I illustrates the role of the preference stack.1 The stack is updated at each turn of the exchange. Initially the stack is empty. When the caller speci es a day (\Monday"), this preference gets added onto the stack. Next, the system asks for a time and the caller responds with \afternoon." The system then attempts to create a new preference by combining this new information with the entry on the top of the stack. In this case, \afternoon" is successfully combined with \Monday" to produce \Monday afternoon." This new preference is then added to the 1 In the examples, \S:" denotes a system's utterance and \C:" denotes a caller's utterance.
TABLE I Role of the preference stack. Utterance
S: On what day would you like to schedule an appointment? C: Monday. S: What time on Monday? C: Afternoon. S: The only time available Monday afternoon is 1pm. C: That's no good. S: There are times available in the morning.
Preference Stack
[] [Mon] [Mon] [Mon aft, Mon] [Mon 1pm, Mon aft, Mon] [Mon aft, Mon] [Mon morn, Mon]
TABLE II Preference stack during shift of focus. Utterance
S: On what day would you like to schedule an appointment? C: Monday. S: What time on Monday? C: Afternoon. S: How about 3pm? C: Oh no, make that Tuesday S: What time on Tuesday? top of the stack. The net eect is that the response \afternoon" has been interpreted in context to mean \Monday afternoon." The system then proposes the speci c time \1 p.m." and adds the new preference \Monday 1pm" to the stack. This is turned down by the caller so the preference is right away removed from the stack; the time slot is also recorded in the calendar as unavailable by the caller. Now \Monday afternoon" is back in focus (at the top of the stack) but because there are no more times available in this range it also gets removed, leaving only \Monday" on the stack. Finally, the system checks whether any other times on \Monday" are available. It determines there are some available on \Monday morning," so informs the caller and adds a preference to the stack. Brie y, the rule for combining preferences states that the new information must constitute a re nement over the existing preferences. For example, \1pm" is considered a re nement over \afternoon," which in turn is a re nement over \anytime." If this condition is not met then it contradicts existing preferences and so is treated as a shift in context. This involves replacing the top preference with the new one. This operation is performed iteratively for each preference on the stack until either the new preference satis es the re nement condition or else the stack is
Preference Stack
[] [Mon] [Mon] [Mon aft, Mon] [Mon 3pm, Mon aft, Mon] [Tue] [Tue]
empty. Table II shows how a shift in focus is managed by the preference stack. The start of the dialogue is similar to the previous example. However, there are several times available on \Monday afternoon" so the system proposes \3 p.m." Next, rather than responding to the proposal directly, the caller changes his mind and shifts the focus to \Tuesday." This new information violates the re nement condition: rst, \Tuesday" con icts with the preference \Monday 3pm" so replaces it. Next, \Tuesday" con icts with \Monday afternoon" so it replaces it. Finally, \Tuesday" con icts with \Monday" so replaces it. This makes \Tuesday" the new focus and the only entry on the stack.
5.2 Repair
The dialogue model supports some caller-initiated and system-initiated repairs. Caller-initiated repairs includes requests for repetition, such as \Pardon?," system-corrections such as \No, I said Monday" and self-corrections. A self-correction may take place in the single turn (\Monday no Tuesday") or across turns (\C: Monday; S: What time?; C: Oh wait, I meant Tuesday"). System-initiated repairs include situations where the system is not con dent of what
the caller said (as indicated by a low-con dence recognition score), or when the caller said nothing at all (as indicated by a time-out). We are currently implementing a progressive back-o repair strategy that resorts to more explicit forms of repair on repeated failure, ending in bail-out to a human operator. Such a scheme will dynamically adjust (i.e., degrade and recover) the repair mode depending on the success of the interaction.
5.3 Initiative
The aim is to give the caller as much freedom to lead the conversation and control the initiative as possible. However, there are certain circumstances when it is more appropriate for the system to assume control. For instance, in the event that the calendar is relatively busy for the time period currently in focus. Example 1a illustrates a typical dialogue in which the caller is allowed to keep the initiative and propose a time. In such an exchange, the system can be regarded as particularly unhelpful and clearly runs the risk of frustrating the caller.
Example 1a
S: What time on Monday? C: 9. S: Sorry, 9 is not available. What other time would work? C: 10. S: Sorry, 10 is not available. What other time would work? C: 11. In contrast, example 1b illustrates the preferred system action. Here the system takes the initiative and suggests a time as soon as it determines that there are not many time slots available on Monday.
Example 1b
S: When would you like to schedule an appointment? C: Monday. S: The only time available on Monday is 3 p.m. Will this work? Another situation where the system might take control is when the caller provides a non-speci c day or time. In example 2a, the caller gives the nonspeci c time \afternoon" and the system follows up by asking \What time Monday afternoon?" This situation is more subtle than in the previous example. The problem here is that one possible interpretation of \afternoon" is \any time Monday afternoon," which renders the follow-up question \What time Monday afternoon?" somewhat awkward.
Example 2a S: What time on Monday? C: Afternoon. S: What time Monday afternoon? Example 2b shows the preferred system action of assuming the initiative and suggesting a time that satis es the caller's preference.
Example 2b
S: What time on Monday? C: Afternoon. S: How about 3 p.m?
6. CONCLUSION
We found that in order to engage in a cooperative, natural, mixed-initiative dialogue to do scheduling, the system needed three data structures: a calendar of user and system availabilities, a preference stack to keep track of the current focus of the conversation, and a history. The rst two were especially important. The logic to control the system's actions was, to a large extent, domain-speci c rules re ned through experience. An obvious goal is to specify as much of that logic as possible in a domain-independent fashion. We view this as part of a larger repertoire of dialogue elements. The robust, phrase spotting parser works well for this task, although care must be taken to specify a relatively complete set of patterns. If the word \not" is unrecognized in \I will not be able to make it on Thursday," for example, the system might respond \What time on Thursday?" Actual use of the system indicates that designing for robustness in the face of recognizer errors is a major consideration and should be a major focus.
7. Acknowledgments
This research was funded by U S WEST Advanced Technologies.
8. REFERENCES
[1] W. Ward, `Understanding Spontaneous Speech: the Phoenix System', Proceedings of International Conference on Acoustics, Speech, and Signal Processing, May 1991, pp. 365-367. [2] R.A Cole, D. G. Novick, M. Fanty, P. Vermeulen, S. Sutton, D. Burnett and J. Schalkwyk, `A Prototype VoiceResponse Questionnaire for the U.S. Census,' Proceedings of the 1994 International Conference on Spoken Language Processing, Yokohama, Japan, Sept., 1994, pp. 683-686.