Center for Spoken Language Understanding, Oregon Graduate Institute of ... the caller. A better phrasing might be âPlease state the. month, day and year of your ...
RAPID PROTOTYPING OF SPOKEN LANGUAGE SYSTEMS: THE YEAR 2000 CENSUS PROJECT Ronald A. Cole, David G. Novick, Mark Fanty, Stephen Sutton, Brian Hansen and Daniel C. Burnett Center for Spoken Language Understanding, Oregon Graduate Institute of Science & Technology 20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000 USA ABSTRACT In this paper, we describe a rapid-prototyping approach to developing spoken language systems (SLSs). This new design methodology facilitates more usable and quickly implemented SLSs. Moreover, it allows SLS developers to address the delicate balance between designing dialogues that are sufficiently constraining to meet current speech recognition capabilities, yet feel natural and intuitive to the user. This rapid-prototyping methodology was developed in the service of a project to determine the feasibility of using SLSs for the Year 2000 Census in the United States. 1. INTRODUCTION A spoken language system (SLS) engages the user in a dialogue to achieve some goal. The system must recognize words, interpret their meaning and respond appropriately to accomplish the goals of the task [1]. Current speech recognition technology represents a considerable bottleneck for many practical SLS applications, especially for speaker-independent systems handling spontaneous telephone speech. Typically, in order to develop an effective speech recognizer, it is necessary to impose constraints on the range and freedom of user responses. From the user’s standpoint, these constraints tend to reduce the naturalness of interactions. Thus, a major challenge for researchers is to design dialogues that constrain the kinds of responses that the user may give, while maintaining a relatively natural interaction. For example, eliciting a caller’s date of birth by asking “When were you born?” is too openended. The variability in the responses is likely to make recognition difficult. At the other extreme, an overconstrained series of questions like “Were you born in January, please say yes or no” may lead to easily recognized responses but is likely to be intolerable for the caller. A better phrasing might be “Please state the month, day and year of your birth.”
We have been developing a new methodology for the rapid prototyping of SLSs that combines dialogue design and data collection for recognizer training. Our approach has been developed in the service of building working spoken language systems for specific tasks. This paper describes a project that served as a test-bed for the methodology and then presents the methodology itself. 2. YEAR 2000 CENSUS PROJECT Our approach to system building has been developed in the context of a project to determine the feasibility of using an automated spoken language system to facilitate data collection and capture for the Year 2000 Census in the United States of America. This involves developing a prototype SLS to acquire personal information from telephone callers, for both English and Spanish languages. Protocol and speech corpus development is being performed at the Center for Spoken Language Understanding at the Oregon Graduate Institute of Science & Technology (OGI). Spoken language systems will be developed by OGI for both English and Spanish, by Wayne Ward at Carnegie-Mellon University for English, and by the Spoken Language Systems Laboratory at MIT for Spanish. The goal of the study is to design and test the feasibility of spoken language systems that will interact with a caller to elicit specific information about the caller. The information we desire to collect from each caller consists of: • Preferred language: English or Spanish; • First name, last name and middle initial; • Sex: male or female; • Birth Date: Month, Day, Year; • Marital status: now married, widowed, divorced, separated, never married (choose one only); • Hispanic origin: yes/no.
1
— If “yes:” Mexican, Mexican-American, Chicano, Puerto Rican, Cuban or other (choose one only); — If “other,” what origin; • Race: White, Black or Negro, American Indian, Eskimo, Aleut, Chinese, Japanese, Filipino, Asian Indian, Hawaiian, Samoan, Korean, Guamanian, Vietnamese, or other (choose one only). — If “American Indian,” then which tribe; — If “other,” then caller states race and whether it is an Asian or Pacific Islander race; • Telephone number (including area code). To demonstrate feasibility, we must solve a number of problems, including: • Determining how information sought by the written questionnaire should be obtained over the telephone; • Developing a methodology for rapid prototyping of dialogues so that we can arrive at dialogues that are natural yet constrain word choice; • Dealing with the many sources of speaker variability in a large population, such as regional dialects and foreign accents; • Recognizing words in the midst of filled pauses, restarts, corrections, and other spontaneous speech phenomena; and • Assigning a level of certainty to the caller’s response, then deciding how to respond—for instance, whether to proceed to the next question or initiate some kind of repair (e.g., repeat the question, seek confirmation on a particular response). 3. METHODOLOGY As a general development method, rapid prototyping seeks to minimize the time between design and evaluation, thereby speeding up the process of constructing working systems. This allows for multiple iterations through the design and test stages, leading to better attention to the needs of the user and systems that are more functional and usable. Our approach is an extension of rapid-prototyping, which has played a significant role in human-computer interaction research, particularly with regard to usability (e.g., [2]) and usercentered design (e.g., [3]). In prior work, we developed a usability-based methodology for developing graphic user interfaces using prototype tools [4]).
We have adapted this rapid-prototyping methodology as a basis for developing SLS dialogues. This development process iterates over two phases. In the first phase, the structure and surface form of a dialogue are designed and refined. In the second phase, this dialogue is then tested during the collection of a speech corpus for the recognizer. This approach allows for more emphasis to be placed on dialogue design, particularly with regard to exploring alternative dialogues and studying their effects on user responses. As previously suggested, developing dialogues that meet current speech recognition capabilities is central to the success of SLSs. As a part of the iterative process of protocol evaluation, we asked callers to provide their assessment of various qualities of the interaction, including naturalness, usability, and any good points or problems with the protocol as a whole or with specific questions. We used the results of these inquiries to help guide refinement of the protocols. To begin our approach, we developed a set of design heuristics that specify ways of phrasing or structuring individual system queries. These heuristics make predictions as to the nature of user responses and provide a principled way of choosing among the large number of possible prompts. Associated with the heuristics are a set of features describing the characteristics of proposed system queries, such as terseness, pre-explanation (i.e., providing the user with an explicit, expanded context) and options (i.e., specifying to the user a set of possible actions). These design heuristics and features promote a systematic approach to protocol generation, reduce the amount of human introspection and encourage stylistic consistency. The rapid-prototyping approach to SLS development applies these design heuristics to generate a set of initial protocols and consists of the following steps: (1) For each desired response, design a set of reasonable prompts; (2) Collect speech data from groups of callers using protocols containing the different prompts; (3) Analyze the responses; eliminate “bad” prompts; refine most promising prompts; and (4) Iterate stages 2 and 3 until satisfied. The process is depicted schematically in Figure 1. 4. PRELIMINARY RESULTS Using our SLS rapid-prototyping methodology, we have completed several rounds of protocol development. This involved collecting, transcribing and analyzing
2
phrasing and structuring of protocol question styles which were subsequently incorporated into round-2 of protocol development.
Specify Design Heuristics
Protocol Generation
Data Collection
Speech Recognizer
Protocol Evaluation
Dialogue Model
Figure 1. Rapid-prototyping development stages. responses from approximately 500 different telephone callers using six different protocols. The initial round (“round-0”) protocols provided a starting point in determining the issues involved in transforming the written (or enumerator-assisted) questions into a form compatible with a SLS. From our analysis of the round-0 protocols we produced a methodology for characterizing and selecting from the many possible variations in question phrasing and structuring. Briefly, this methodology entailed examining each census question in its written form (taken from a census questionnaire), and producing a set of options as to how it could be transformed for use within an SLS. Associated with each option was a set of features (e.g., +/-Polite, +DecisionTree) used to describe the resulting question. We were then able to use these features to explore different variations and to ensure that each protocol had a degree of stylistic consistency. This process produced three protocols used in our first round of formal testing (“round-1”), ranging from openended (“What is your birthdate?”) to structured (“What day, month and year where you born?”) to highly structured (“Please say the year in which you were born.” “Please say the month in which you were born.”). After collecting the round-1 data we performed a qualitative analysis to determine the variability of data acquired and the degree of coverage associated with each protocol and question. This analysis led us directly to revisions in the
For example, round-1 testing of the marital-status questions indicated that: (1) the unconstrained case (“What is your marital status?”) produced responses that were too variable for effective recognition; (2) a “yesno” case (An introduction plus a sequence of individual questions like “Have you ever been married?”) seemed long-winded—there was a lot of interaction for small amount of information, and the fact that the questions were both predictable non-natural invited callers to volunteer information that we would eventually want but were not now prepared to understand; and (3) a constrained-choice case (“Which of the following options best describes your current marital status: now married, widowed, divorced, separated, or never married?”) tended to produce reasonable interaction, however we refined the form of the question to avoid possible confusion by the recognizer in distinguishing “now married” and “never married.” The data collected in round-2 was evaluated in terms of categories of callers’ responses to individual questions. Briefly, this analysis divided responses into the following categories: concise responses, usable (but not concise) responses, responsive (but not usable) responses, unresponsive answers, and no response. In this way we were able to perform quantitative analysis of the responses with respect to coverage and variability. Our subsequent analysis provided the basis for developing round-3 protocols. We used this analysis to choose among alternative prompts. For example, we found that one of three prompts that asked the respondent to tell us their sex was clearly superior. The three candidate prompts were (1) “What is your sex?” (2) “What is your sex, female or male?” (3) “Are you female or male?” We collected about 90 calls for each protocol, for a total of about 270 calls. The percentages of concise responses to these prompts were approximately 95, 100, and 91 percent respectively. Consequently, we concluded that protocol 2 is the most effective for this question. Similar analyses were performed for the other questions. We are now starting round-3 data collection for English—our final iteration of the protocol development cycle. We expect to collect several thousand calls in order to provide sufficient speech data to develop and evaluate the system. For this round of data collection, we will use a digital T-1 line which can handle up to 24
3
simultaneous calls. This corpus—with suitable provisions for privacy—will eventually be made available to the public. For analysis of the round-3 data for English, we have extended and enhanced the response coding scheme employed in round-2. We have developed a behavioral coding scheme, in conjunction with the Census Bureau, for analyzing the callers’ responses. This extended coding scheme is intended to characterize the adequacy of the response from the perspective of automatic speech recognition. In turn, it provides a basis for a quantitative evaluation of the protocols. There are ten levels of responsiveness, ranging from “answer is concise and responsive” to “respondent refuses to answer.” At the same time, we are developing the Spanishlanguage system. An initial prototype was written in Spanish, based on the results of the English round-2 tests. We are currently collecting data on tests of this round-1 Spanish protocol with Spanish-language callers.
2.
3.
4.
5. SYSTEM OVERVIEW We have constructed an initial prototype SLS from the round-2 prompts judged to be most natural and constraining, based on our analyses. The prototype employs a modular architecture and has been designed with extensibility in mind. Thus, the present system provides a basis for continued development, engineering and refinement. The next generation prototype will be developed from the round-3 protocols and training data. The initial system consists of five main components: 1.
Speech recognizer. The speech recognition component is based on a neural-network phoneme classifier. It incorporates two separate recognizers: the OGI alphabet recognizer and the OGI word recognizer; we envision a separate recognizer for numbers in the final system. The alphabet recognizer is hand-tuned to recognize letters spoken with pauses and achieves 89% accuracy over the telephone [5]. The current word recognizer [6] is vocabulary-independent. It uses a neural-network-based phonetic front end trained on fluent telephone speech. The language model was built by hand and uses an all-word model to allow for some extraneous speech before and after the target response (e.g, “I’m” in “I’m male”). The final system will be optimized using taskdependent data collected in round-3. Research in noise robustness, word-spotting, rejection, bargein (i.e., talking over prompts), and increased accuracy is in progress; features based on this
5.
research will be incorporated into the final system. Semantic parser. This prototype’s semantic parser is based on the Phoenix system developed at Carnegie-Mellon University [7]. The parser produces candidate meanings for a given word sequence. It parses semantic fragments using a frame-based semantic grammar. Development of the semantic grammar was based on observation of the range of protocol responses in the corpus from rounds 1 and 2. Dialogue module. The dialogue module contains the system’s goals and patterns of interaction. For each caller utterance, the dialogue module determines the appropriate system action and provides the speech synthesizer with the next prompt and provides the recognizer and parser with expectations as to what the caller will say next. Speech output. Currently, the prototype system produces speech output using DECtalk, a commercially available speech synthesizer. A synthesizer, as opposed to a recorded human voice, is especially convenient during the early development stages when changes in the protocol are frequent. In the near future, however, we intend to experiment with other forms of speech output such as digitized recordings of professional announcers. Graphical interface (under development). We are developing a graphical interface that allows system developers to correct recognition errors and to evaluate the performance of the system.
Within the next few months, round-3 of the protocol development cycle will be completed; the prototype SLSs will be further developed and evaluated at three sites. As part of the evaluation process, we will analyze the impact of the design heuristics and the success of the rapid- prototyping approach. We expect that the results will show that the rapid-prototyping methodology allows SLS developers the opportunity to explore and influence users’ responses in light of users’ natural expectations about interaction, while achieving early collection and use of speech data for recognition. We believe that this novel approach will assist in the development of more usable and rapidly deployed SLSs. 6. SUMMARY A significant problem with current SLS development practices is that constructing a speech recognizer requires data and that collecting data is expensive. Furthermore, the nature of this data is dependent on the
4
particular dialogue. The rapid-prototyping approach allows data to be collected while developing a suitable dialogue. At each iteration during the design-and-test loop, protocols slowly converge on a final version and so the data collected becomes increasingly appropriate for training the speech recognizer. A major benefit of combining the dialogue design and data collection phases is that it leads to more usable and quickly implemented systems. The rapid-prototyping approach thereby permits SLS developers to address the balance between building systems that are (1) effective in terms of recognizing what users say and taking the appropriate actions; and (2) relatively natural in terms of user interaction. 7. ACKNOWLEDGMENTS This research was supported by the United States Bureau of the Census, the Office of Naval Research, and the National Science Foundation. 8. REFERENCES [1] R. A. Cole, L. Hirschman, et al., “Workshop on Spoken Language Understanding,” Technical Report CS/E 92-014, Department of Computer Science and Engineering, Oregon Graduate Institute of Science & Technology, 1992.
[2] C. Lewis, P. Poulson, C. Wharton, and J. Rieman, “Testing a walkthrough methodology for theorybased design of walk-up-and-use interfaces,” Proceedings of CHI’90, pp. 235-242, 1990. [3] J. Grudin, “interface,” Proceedings of CSCW’90, pp. 269-278, 1990. [4] D. Novick and S. Douglas, “QUID: A quick userinterface design method using prototyping tools,” Proceedings of the Hawaii International Conference on System Sciences (HICSS’90), 709718, 1990. [5] M. Fanty, R. A. Cole, and K. Roginski, “English Alphabet Recognition with Telephone Speech,” Advances in Neural Information Processing Systems 4, pp. 199-206, Morgan Kaufmann Publishers (1992). [6] M. Fanty, P. Schmid, and R. A. Cole, “City name recognition over the telephone,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, I, pp. 549-552, 1993. [7] Ward, W., “The CMU air travel information service: Understanding spontaneous speech,” Proceedings of the DARPA Speech and Natural Language Workshop, pp. 127-129, 1990.
5