An Architecture for Rapid Retrieval of Structured

0 downloads 0 Views 111KB Size Report
to navigate a complex series of steps in a menu, drip-feeding the ... Fig. 1. PFSG used for recognition of stage 1 hypotheses. This word graph is ..... to use. A minor point regarding Table 1 is worth discussing further. ... incorrectly recognizing any component of an address would end ... Talia Shaham wrote the interface to the.
AN ARCHITECTURE FOR RAPID RETRIEVAL OF STRUCTURED INFORMATION USING SPEECH WITH APPLICATION TO SPOKEN ADDRESS RECOGNITION Anand Venkataraman

Horacio Franco

Greg Myers

SRI International, Menlo Park, CA ABSTRACT

to navigate a complex series of steps in a menu, drip-feeding the same information portions at a time to the recognition system. Section 2 provides a high-level overview of our system architecture along with descriptions of some essential components. Section 3 describes the architecture in detail. Section 4 compares the performance of the system to a baseline that does not exploit structural information. Finally, Section 5 summarizes the main achievements in this work and describes plans to extend the system beyond the simple application described here.

We describe an iterative recognition strategy that can be used to vastly improve the performance of a speech recognition system when the speech pertains to structured information that can be looked up in a database. The framework that we present is designed to extract specific fields of interest from the speech signal during each iteration, query a database using these fields, and thereby construct the hypothesis space for searching during the next iteration. The architecture has been found to be significantly useful in applications such as spoken address recognition where a proof of concept and a demonstration system had been developed. We also present results on a small test set to compare the performance of the described system with the more common baseline approach.

2. OVERVIEW We give a high-level description of the system as applied to the problem of free-style spoken address recognition. Three components are essential to its functioning: class-based language models, which we implemented using the SRILM toolkit [1], an elastic filler word model, and a relational database lookup module. The use of one of these components by itself is not particularly noteworthy. We have previously used class-based language modeling techniques to good effect in improving the speech recognition accuracy on the SPINE task [2]. The elastic filler model has likewise found good application in a medium vocabulary speech recognition task within the financial domain [3]. Use of relational databases, again, is common in a number of real-world applications, notably in air-travel reservation systems. However, their combination as in this work, together with an iterative framework to refine ASR output is both novel and useful. The basic idea behind this system is to perform iterative piecewise recognition of the waveform by the application during each iteration of one or more constraints inferred during the previous iteration. At each iteration, we try to recognize from the speech signal exactly those components of the structured information that promise the most reduction in search space for the next iteration. In the case of addresses, given the interface specific to the database we were dealing with, these were the numeric fields. The numbers thus recognized are then used to query the USPS database for address hypotheses that are consistent with them, and the subsequent iteration is constrained to match only one of these hypotheses. This considerably reduces the search space in that iteration. ASR systems for restricted grammars such as these are typically small and can be built rapidly on the fly.

1. INTRODUCTION Automatic speech recognition (ASR) systems have hitherto been used with acceptable error rates in real-time applications for restricted domains. It is, however, possible to do significantly better in terms of both speed and accuracy if it is known beforehand that the data to be recognized is structured. The basic idea here is to use the structural information to restrict the search space of the recognizer. For example, in an ASR-based air travel reservation system, once the destination state is known, the range of cities that can form allowable destinations is considerably narrowed. Thus, a system can prompt the user to first say the state or province into which he or she is flying and then subsequently ask for a city within that region. Previous work that has exploited structural information for ASR has largely used such an interface, which we characterize as being of the prompt-and-response kind. Although prompt-and-response systems simplify the speech recognition task, they can be both slow and annoying to use because they require the speaker to answer a series of questions. Thus, there remains a need for speech recognition systems that allow the user to speak structured information in a natural way by using single utterances. We characterize such systems as being of the free-style kind. In this paper, we present techniques and an architecture to recognize freely spoken structured information and illustrate it with an application to spoken address recognition (SAR). By exploiting structure inherent in spoken addresses, using an elastic filler word that can consume an arbitrary number of speech frames and carefully interleaving ASR iterations with lookups in a relational database of addresses from the United States Postal Service (USPS), we were able to develop a novel architecture to recognize free-style spoken addresses with high accuracy. This system, for example allows a user to say “I’d like to plan a route to 333 Ravenswood Avenue, Menlo Park, California”, rather than have

0-7803-7980-2/03/$17.00 © 2003 IEEE

3. SYSTEM ARCHITECTURE The SAR system implemented by us consists of two main components, although this is by no means the only possible way of architecting it. The system has a front end to capture the speech signal and transmit it (or its quantized feature vectors) to the back

459

ASRU 2003

end, where the iterative recognition happens. Once the spoken address is determined, the back end sends a structured representation of the address back to the front end, which then either displays the address to the user, or performs some desired action, such as determining a navigation strategy from the current geographical position of the user (determined, say, using GPS). The primary advantage of this architecture is that the front end can perform some level of preprocessing on the speech signal and then transmit only the essential information – for example, the cepstral features – to the recognition engine, thereby saving valuable bandwidth, especially if transmission takes place over an errorful wireless network. At the back end, a copy of the information received from the front end is immediately made primarily because we expect to do more than one recognition pass over the signal. During each pass, recognition of selected fields only is done by using a combination of class-based language modeling and the elastic filler model. Class-based language models are used to generate word graphs with class tags that expand into number grammars. All the grammars are encoded as word graphs that we call probabilistic finite state grammars (PFSGs), but other encodings and representations are obviously possible. Figure 1 depicts the word graph used for the recognition of numeric fields in the first pass. Nonnumeric components of the signal that we want blanked out are represented in the word-graph by the elastic filler labeled @reject@. This filler word has the distinctive property that its pronunciation can be arbitrarily long, in contrast with all other words in the lexicon whose pronunciations are of fixed length.

and ZIP code combination, one would normally expect the total number of results from nearly a thousand database queries to be correspondingly higher. But we found that a remarkable number of street number and ZIP code combinations never occurred in the USPS database. Typically, we found the result of querying the USPS database with all the combinations gave less than a hundred possible addresses to match against. Thus, the second pass of the recognition uses a very compact PFSG that is generated from just these addresses. When the spoken address included ZIP codes, the probability that it was in the set of results returned from the USPS database query was around 92%, and the probability that the second recognition pass identified that particular address as the correct one was very high (almost 100%). As a consequence, the performance of the system was largely determined by the results of the first recognition pass and the deterministic results of the USPS database query. 3.2. ZIP-less extension In an effort to accept more natural ways of speaking addresses, we developed an extension of the above two-pass recognition system. This was motivated by the observation that people who board taxis, for instance, often omit ZIP codes in conveying their destination to the driver. For example, the preferred form of saying the postal address “333 Ravenswood Avenue, Menlo Park, California 94025” tends to be one of the following rather than its canonical presentation above: 333 Ravenswood Avenue, Menlo Park

STREET NUMBER

@reject@

STREET NUMBER

@reject@

@reject@

ZIPCODE NUMBER

ZIPCODE NUMBER

333 Ravenswood Avenue 333 Ravenswood 333 Ravenswood Avenue, Menlo Park, California (if the passenger, for instance, had won a large sum of money at a casino in Nevada and needed to get out of there quickly).

Fig. 1. PFSG used for recognition of stage 1 hypotheses. This word graph is used to extract only the numeric fields.

Thus, we modeled the recognition procedure to encode heuristics regarding which of several subsequent paths to follow based on the results of recognition at the end of each stage. We may concisely characterize the resultant system as a decision-graphcontrolled iterative recognition procedure. This procedure is graphically portrayed in Figure 3, which depicts a simplified version of the actual decision graphs used by us. At the internal nodes of these graphs, decisions are made as to which grammars will be used for the subsequent recognition phases and how they are to be constructed. The grammars are directly built from a list of allowable prior candidate address hypotheses, and they constrain the output of the recognizer to be exactly one of these prior hypotheses. Initially, we assume that the address has both a street number and a ZIP code. If, however, the first pass yielded only a single number, then we attempt either to recognize a trailing state name, or to follow a subprocedure for recognizing an address with no street number. The procedure for each of these stages is similar to that in the first pass – To recognize state names, for instance, a compact word graph (see Figure 2) that encodes exactly the state names at the end is used to decode the spoken address. If recognition of the state name succeeds, then we attempt to recognize a city within that state; if not, we attempt to recognize a city in the current state. If this fails, too, then we assume that the speaker means the current city. We then look up addresses in

The STREET NUMBER and ZIPCODE NUMBER classes can, in principle, be arbitrarily complex subgrammars that encode the most natural ways of speaking each. In our experiments, however, we restricted these to be sequences of separated digits for ease of implementation and developing proof of concept. The ZIPCODE NUMBER class expands into exactly five digits, while the STREET NUMBER class expands into one to six digits. Extracting just these two fields from the first pass of the recgonition was done so that the USPS address database could be queried with them to yield a list of addresses consistent with that particular combination of street number and ZIP code. 3.1. Incorporating N-best recognition output To compensate for some error in the recognition of the numeric fields, we considered not only the best street number and ZIP code combination from the first-pass recognition results, but actually the cross-product of the sets of all street numbers and all ZIP codes found in the 50 best hypotheses from the first recognition pass. We found on average that 30 to 40 usable street numbers and 10 to 20 usable ZIP codes in the first pass output. This caps the total number of database queries at approximately 800. Since more than one address is usually consistent with a given street number

460

Input waveform ##### ?????????????????? #####

Attempt to recognize two numbers separated by arbitrary non-numeric information Perform procedure B to iteratively recognize address given no numeric fields.

Got 2 numbers? YES

NO NO (Got no numbers) Got one number?

Query database for compatible addresses with first number = street number & second number = ZIP code.

YES

Is number on left of non-numeric data?

Got at least one candidate address?

YES

Assume it is a street#

NO

Assume it is a ZIP Code

NO YES Attempt to recognize one of 51 states at the end of the address.

Ask user to rephrase address

Perform procedure A to iteratively recognize address given ZIP code.

Got state? Build PFSG from list of candidate addresses.

Acoustic rescore these addresses based on the input waveform and pick the best match. This is the final address hypothesis.

Query database for compatible addresses with recognized street# in the recognized . city, but with city (and maybe also state) at the end of the sentence.

YES

Attempt to recognize one of the cities in this state.

Assume default state = current state

preceding the state name. Query database for compatible addresses with recognized street# in the default city and state, but with explicit city and state omitted from its output.

Attempt to recognize one of the cities in this state.

YES

at end of sentence. YES Got city? Assume default city = current city

NO

Fig. 3. Master procedure for Spoken Address Recognition. User input is required exactly once at the root node, in contrast to many of today’s systems that require input at one or more interior nodes as well. user intervention exactly once at the root of the decision procedure. In contrast, menu-based input systems typically require user intervention and input at several of the interior nodes as well.

@reject@

@reject@

@reject@

1111 0000 0 1 0 1 0 1 0 1 1 0 1 0 1 0

alabama

3.3. Class-based language modeling Class-based language modeling is a useful technique to compensate for data sparsity in domain-dependent speech recognition. It is especially useful when the amount of data in the target domain is meager, but useful generalizations can be drawn about how sentences in the domain are formed. When a word graph derived from a class-based language model is expanded using class membership information, the membership probabilities (which come from a discrete distribution over the members of each class) determine the weight of transitions leading into each word sequence that belongs to that class. We can thus weight particular class members as more likely to occur than others, or if we lack sufficient training data, weight them all equally. Although word classes can be automatically inferred by using, say, information theoretic criteria (see [4] for example), we have found that in domain-dependent tasks, human classification of words based on their intuitive groupings often works best.

wyoming

Fig. 2. PFSG used for recognition of state names. the target city that are consistent with the recognized street number and all the ZIP codes in that city. The complete procedure is illustrated in Figures 3 through 5. It may seem that the logic embodied in the decision procedure depicted in Figures 3 through 5 is as complex as that deployed in many domain-dependent speech recognizers that are being used today. But it is noteworthy that the framework we propose requires

461

Input waveform ???????????????????????????

Input waveform ending in ZIP code ????????????????????????? #####

Attempt to recognize state name given this ZIP code.

Attempt to recognize one of 51 states at the end of the address.

Got state?

Got state name?

NO

YES

YES Attempt to recognize one of

Assume default state = current state

the cities in this state.

NO

Attempt to recognize one of the city names in this ZIP code. List of cities is obtained from database

preceding the state name.

Ask user to rephrase address.

Attempt to recognize one of the cities in this state.

Ask user to rephrase address

Got city name?

at end of sentence. NO Got city?

NO YES

Got city?

YES

NO

YES

Query database for list of streets in the recognized ZIP code. Form a PFSG using this list of candidates of the form : Street name, State, ZIP code.

Query database for list of streets in the recognized city and state, with explicit city and state included in its output.

Query database for list of streets in the recognized city. Form a PFSG using this list of candidates of the form: street name, City, State ZIP code.

Build PFSG from list of candidate addresses.

Perform acoustic rescore of this list of candidates against the waveform.

Query database for list of streets in the recognized city and state, but with explicit state name omitted from its output.

Assume default city = current city

Query database for list of streets in the default city and state, but with explicit city and state omitted from its output.

Perform acoustic match of the list of candidate addresses against input waveform and select the best match. This is the final hypothesis. Confirm with user.

Select the best match. This is the final hypothesis. Confirm with user.

Fig. 4. Procedure A to recognize spoken address when the only numeric component spoken is the ZIP code.

Fig. 5. Procedure B to recognize spoken address when no numeric components exist in the address.

For example, if we desire to build a speech (or optical text) recognizer within the financial stock trading domain and it is difficult to collect a large body of sentences on which to train language models, we could instead preprocess the few sentences that are in fact available and replace selected words with their class tags. These class tags would then perform the function of generalization, extending the sentences to span a far larger set of possibilities than that found in the training set alone. In our SAR system, class-based language modeling was used to good effect to capture groupings within numbers, cities, and states. Although it is conceptually possible to determine sophisticated probability distributions over possible destination cities for a passenger boarding a taxi, in this study we adopted the simple approach of keeping class-membership distributions uniform. We also adopted the simplifying assumption that users typically do not desire to communicate specific details down to the level of apartment or suite numbers to a recognition system. Thus, our system preprocesses all the data extracted from the USPS database to remove such qualifiers before building PFSGs from them.

a self-transition – a feature that is absent in the topology of regular words. Figure 6 shows the internal structure of the @reject@ word as compared to the regular word “cat”. This unique feature makes it possible for the @reject@ word to be arbitrarily long, and thus a single @reject@ word can stretch to consume arbitrarily long segments of the speech signal that we are not interested in. For this to happen, however, the @reject@ word must be made up of very general phones that can represent any arbitrary phone in speech. This is achieved by training a special acoustic model for this purpose. This phone, which we call the rej phone, is trained on nonsilent segments of a number of waveforms and is thus able to absorb a variety of phones. A similar effect can be obtained by constructing a @reject@ word that consists of an all-phone loop, which is instead a loop over all possible phones. We have experimented with this approach also, but found that using the specially trained rej phone gave us much better performance. 3.5. Tunable parameters A consequence of using the elastic reject model is the possibility that it might absorb not only the unwanted portion of the speech signal, but also some of the wanted portion. Because of the way the system was architected, the reject model could either consume all of the ZIP code or none of it. However, by virtue of the fact that the primary number of the address was of variable length, it was possible for one or more of the trailing digits to be consumed by the reject model. Our initial experiments with the presented framework showed that during the first pass it is critical for the

3.4. The elastic filler model The elastic filler model, which we have represented using the “@reject@” word, is one of the most important components of the SAR system. As far as the recognition grammars (word graphs) are concerned, it is just the same as any other word. It is composed as a sequence of phones. However, internally, there is an important difference: One of the states of the reject word contains

462

0

k

1

ae

2

t

example, if the aim is to identify tasks to perform based just on content words in recognized utterances, a system that recognizes all and only the content words correctly is clearly better than another that recognizes all and only the function words, although these two systems may be comparable in terms of their WER. In this light, we decided that the best evaluation metric for the Spoken Address Recognition System was one that measured the correctness of the final address recognized, which we call the sentence error rate or SER. To evaluate the performance of the system, therefore, we recorded 200 addresses spoken by 8 native American English speakers; 155 of these addresses were spoken with the ZIP code present and 45 without. All of the addresses without ZIP codes ended in the city name, for example, “333 Ravensnwood Avenue, Menlo Park”. This set of 200 speech signals was then recognized using our iterative recognition strategy as well as using a baseline approach. The baseline approach consisted of running our speech recognizer on the waveform to identify individual words in it in a similar manner as we attempt to decode broadcast news stories or conversational telephonic speech. However, because it would be impossible to recognize all the street names for which the baseline approach lacked vocabulary, all the words in these 200 addresses were explicitly added into the vocabulary for the baseline approach. We tested these waveforms in two batches – one consisting of the 155 waveforms that included the ZIP code and the other of the 45 that did not. Table 1 shows the performance of the two systems for each of the two sets.

3

cat rej 0

rej

1

rej

2

@reject@

Fig. 6. Word models for a regular word (“cat”) as compared to the special @reject@ word. Note the self transition in state 1 of the @reject@ word. This enables the word to stretch arbitrarily. reject model to consume exactly the nonnumeric portions of the signal. When the reject model was too greedy and it absorbed the last digit or two of the primary address number, very few viable hypotheses ended up being generated from the database queries, and as a consequence no satisfactory final results were generated. We addressed this problem by varying the value of reject weight, a tunable parameter that controlled the probability of entering the reject word. A range of values was found for this parameter within which the recognizer performed as desired. With a value within this range, we found that the reject model consumed just the nonnumeric portion of the signals and thus generated a rich set of primary numbers and ZIP codes with which to query the USPS database.

Set name ZIPless-45 ZIPful-155

3.6. Pronunciations for proper names

SER/Baseline 23.87% 62.6%

SER/Iterative 15.56% 7.05%

Table 1. Summary of results using the presented iterative strategy and the baseline speech recognition approach.

A final detail concerned the problem of associating pronunciations with unknown words. In most speech recognition tasks, the vocabulary of the recognizer is closed and known beforehand. Thus, multiple possible pronunciations for words can be generated, perhaps by hand before attempting to decode the speech signal. However, in our case, the potentially unlimited number of proper nouns that are part of a domain such as postal addresses makes it impossible to be prepared to encounter any of them in the first recognition pass. Fortunately, the first recognition pass recognizes only numeric fields and it is only the second pass that must deal with potentially unknown words. For these words, which now form a small number, we generate automatic pronunciations on the fly, using heuristics. It is noteworthy that a significant number of the pronunciations we generate in this way are erroneous, yet, these errors turn out not to matter because within the relatively highly constrained search space of the second and subsequent iterations, the acoustic match of the signal to the candidates with erroneous pronunciations in them is still significantly better than to candidates with a completely different street or city name than the one spoken.

It is clear that the iterative recognition strategy significantly outperforms the baseline approach both when a ZIP code is present and when it is not. Furthermore, it is noteworthy that its significantly higher accuracy was obtained even though the baseline approach had an edge in the above comparison: Its vocabulary was primed with all possible proper nouns that could be expected. Obviously this is not a strategy that we would expect to work in a realistic setting when dealing with the actual USPS database of around 140 million records. In such cases, the iterative recognition approach presented in this paper is undoubtedly the better technique to use. A minor point regarding Table 1 is worth discussing further. Although the numbers in row 2 (pertaining to the ZIPful case) seem anomalous because of the dramatic difference in the reported accuracies, in fact it is not. Because the evaluation procedure scored task completion accuracy, which was measured as the percentage of addresses correctly recognized in their entirety, incorrectly recognizing any component of an address would end up scoring that trial as a failure. In several test cases, the baseline approach incorrectly recognized a single digit of either the primary street number or the ZIP code, or incorrectly recognized an unlikely street name. Each of these these cases were flagged as failures and thus contributed to the relatively higher error rate of the baseline. The next section discusses this phenomenon is

4. EVALUATION AND RESULTS In speech recognition experiments, it is customary to measure performance improvements using the word error rate (WER) criterion. However, in a domain-dependent task context, such as the one in which we evaluated the presented framework, the WER metric is less relevant than one that measures task completion accuracy. For

463

slightly more detail, shedding some light on the pros and cons of the kinds of errors the two very different recognition paradigms seem to make.

4.1. Kinds of errors It is interesting to compare the kinds of errors made by the baseline and iterative systems on the spoken addresses. Since the approach of the iterative recognizer is markedly different from that of the baseline recognizer, the errors made by them tend to be very different from each other in character. Although the iterative approach suffers from far fewer errors, when it does get an address wrong, it is often completely (and often preposterously) wrong. The baseline approach in contrast, which on average generates a much larger number of errors, is more conservative in the kinds of errors it makes. For example, in recognizing “333 Ravenswood Avenue, Menlo Park, California 94025”, the baseline approach may instead recognize “833 Ravenswood Avenue, Menlo Park, California 74303”. However, the iterative approach might get “200 Rural Route 1, Hominy, Oklahoma 74035”. This is a consequence of the fact that the only fruitful primary-number/ZIP code combination found in the n-best recognizer output after the first recognition pass on that signal was 200/74025, which yielded addresses in Oklahoma. Unfortunately, in this case, one of these addresses was also a better match than a @reject@ to the spoken waveform. This is a problem that can be addressed by tuning the reject weight to ensure that a @reject@ is more possible than a match as poor as the one we got. We also found in informal tests that users were far more tolerant of incorrect addresses provided they were given the result quickly and allowed to repeat the address, than if they were required to wait longer with a slightly increased chance of getting the right results. The presented architecture does indeed allow rapid turnaround times for recognition results since each recognition pass typically uses a very compact search space. The baseline approach, on the other hand, if it is workable at all, requires a very large vocabulary in order to encompass the full address database and consequently takes considerably longer to recognize the spoken signal.

6. ACKNOWLEDGMENTS The authors thank all members of the Speech Technology and Research Laboratory at SRI International, Menlo Park, for their assistance in architecting the various stages of the initial proof of concept. Mike Frandsen of SRI International, Montana, put together the distributed speech recognition-based Java Client for demonstration at the conference. Talia Shaham wrote the interface to the USPS database. Jim Arnold and Doug Bercow assisted with the commercialization of the technology described in this paper and provided much feedback and encouragement. Judith Lee proofread and edited an initial draft of this paper. Anonymous ASRU-03 reviewers offered helpful comments towards improving the content. 7. REFERENCES [1] Andreas Stolcke, “SRILM—an extensible language modeling toolkit,” In Hansen and Pellom [5], pp. 901–904. [2] Venkata Ramana Rao Gadde, Andreas Stolcke, Dimitra Vergyri, Jing Zheng, Kemal Sonmez, and Anand Venkataraman, “Building an ASR system for noisy environments: SRI’s 2001 SPINE evaluation system,” In Hansen and Pellom [5], pp. 1577–1580. [3] Babak Hodjat, Horacio Franco, Harry Bratt, Kristin Precoda, Andreas Stolcke, Anand Venkataraman, Dimitra Vergyri, and Jing Zheng, “Iterative statistical language model generation for use with an agent-oriented natural language interface,” in Proc. 10th Intl. Conf. HCI, Crete, 2003. [4] Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer, “Class-based n-gram models of natural language,” Computational Linguistics, vol. 18, no. 4, pp. 467–479, 1992. [5] John H. L. Hansen and Bryan Pellom, Eds., Proc. ICSLP, Denver, Sept. 2002.

5. SUMMARY AND FUTURE WORK In the particular erroneous case illustrated in the Section 4, we find that the result obtained from the baseline approach is eminently more useful to the user than the ridiculously unlikely address in Oklahoma that our system proposed.1 Although problems such as this were in fact rare, we believe that it is possible to significantly alleviate it by using heuristics and employing judicious distributions over the space of locations the user is likely to have spoken. At present, we use a uniform distribution that does not encode knowledge of the user’s current location, except on completion when a route plan is drawn up via a Web-service query. We plan to modify these distributions so that the probabilities of locations are a function of their distances from the user’s current position. We are also working on a dialog agent to enable correction of incorrectly recognized address components in a natural way, which will be integrated into the system. 1 However, note that incorrect ZIP codes will be problematic in situations where the address needs to be entered into a route-planning algorithm.

464

Suggest Documents