high level knowledge sources in usable speech recognition systems

ARTICLES

HIGH LEVEL KNOWLEDGE SOURCES IN USABLE SPEECH RECOGNITION SYSTEMS The authors detail an integrated system which combines natural language processing with speech understanding in the context of a problem solving dialogue. The MINDS system uses a variety of pragmatic knowledge sources to dynamically generate expectations of what a user is likely to say.

SHERYL R. YOUNG,ALEXANDERG. HAUPTMANN,WAYNE H. WARD, EDWARDT. SMITH, and PHILIPWERNER

Understanding speech is a difficult problem. The ultimate goal of all speech recognition research is to create an intelligent assistant, who listens to what a user tells it and then carries out the instructions. An apparently simpler goal is the listening typewriter, a device which merely transcribes whatever it hears with only a few seconds delay. The listening typewriter seems simple, but in reality the process of transcription requires almost complete understanding as well. Today, we are still quite far from these ultimate goals. But progress is being made. One of the major problems in computer speech recognition and understanding is coping with large search spaces. The search space for speech recognition contains all the acoustic associated with words in the lexicon as well as all the legal word sequences. Today, the most widely used recognition systems are based on hidden Markov models (HMM) [Z]. In these systems, typically, each word is represented as a sequence of phonemes, and each phoneme is associated with a sequence of phonemes, and each phoneme is associated with a sequence of states. In general, the search space size increases as the size of the network of states increases. As search space size increases, speech recognition performance decreases. Knowledge can be used to constrain the exponential growth of a search space and This research was sponsored by the Defense Advance Research Projects Agency (DOD). ARPA Order No. 5167. monitored by the Air Force Avionics Laboratory under contract N00039.85-C-0163. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official politics. either expressed or implied. of the Defense Advanced Research Projects Agency or the US Government.

d:’ lY8Y ACM

000.0782/89/0200-0183

February 1989

Volume 32

$1.50

Number 2

hence increase processing speed and recognition accuracy [9, 171. Currently, the most common approach to constraining search space is to use a grammar. The grammars used for speech recognition constrain legal word sequences. Normally they are used in a strict left to right fashion and embody syntactic and semantic constraints on individual sentences. These constraints are represented in some form of probabilistic or semantic network which does not change from utterance to utterance [16-181. As we move toward habitable systems and spontaneous speech, the search space problem is greatly magnified. Habitable systems permit users to speak naturally. Grammars for naturally spoken sentences are significantly larger than the small grammars typically used by speech recognition systems. When one considers interjections, restarts and additional natural speech phenomena, the search space problem is further compounded. These problems point to the need for using knowledge sources beyond syntax and semantics to constrain the speech recognition process. There are many other knowledge sources besides syntax and semantics. Typically, these are clustered into the category of pragmatic knowledge. Pragmatic knowledge includes inferring and tracking plans, using context across clausal and sentence boundaries, determining local and global constraints on utterances and dealing with definite and pronominal reference. Work in the natural language community has shown that pragmatic knowledge sources are important for understanding language. People communicate to accomplish goals, and the structure of the plans to accomplish them are well understood [9, 24, 26, 32, 331. When speech is used in a structured task such as problem solving, prag-

Communications of the ACM

183

Articles

matic knowledge sources are available for constraining search spaces. In the past, pragmatic dialogue level knowledge sources were used in speech to either correct speech recognition errors [4, lo] or to disambiguate spoken input and perform inferences required for understanding [ZO,2’1, 30, 351.In these systems, pragmatic knowledge was applied to the output of the recognizer. In this article we describe an approach for flexibly using contextual constraints to dynamically circumscribe the search space for words in a speech signal. We use pragmatic knowledge to derive constraints about what the user is likely to say next. Then we loosen the constraints in a principled manner. We generate layered sets of predictions which range from very specific to very general. To enable the speech system to give priority to recognizing what a user is most likely to say, each prediction set dynamically generates a grammar which is used by the speech recognizer. The prediction sets are tried in order of most specific first, until an acceptable parse is found. This allows optimum performance when users behave predictably, and displays graceful degradation when they do not. The implemented system (MINDS) uses these layered constraints to guide the search for words in our speech recognizer. For our recognizer, we use a modified version of the SPHINX [19] large vocabulary, speaker independent, continuous speech recognition system. The MINDS spoken language dialogue system developed at Carnegie-Mellon University applies pragmatic knowledge-based constraints as early as possible in the speech recognition process to eliminate incorrect recognition choices and drastically reduce the speech system error rate. The main problem in speech recognition is the enormous complexity involved in analyzing speech input. Variations in pronunciation, accent, speaker physiology, emphasis and characteristics of the acoustic environment typically produce hundreds of different possible phoneme classifications for each sound. In turn, the many phoneme classifications can result in many possible word hypotheses at each point in the utterance. All of these word choices can then be combined to yield hundreds of sentence candidates for each utterance. The resulting search space is huge. Yet a speech system is required to filter out all incorrect candidates and correctly recognize an utterance in real time. Different approaches have been used in the past to limit the exponential explosion of the search space and trim the computational complexity to a more manageable level. In an attempt to reduce complexity, the speech recognition problem has been simplified along different dimensions. The first speech recognition systems [18] were tailored to specific speakers only. This reduced much of the speech signal variation due to speaker characteristics such as sex, age, accent and physiological characteristics. The early systems also only recognized very few words. This reduction in vocabulary eliminated much confusion during recognition, especially if the words were all acoustically distinct. To

184

Communications of

theACM

avoid the problem of slurred and coarticulated words, the early systems required that each word be pronounced separately and that the speaker pause slightly between words. A final simplification of the speech problem was to artificially limit the number of different words that could be used at any one point. Similar to the choices available in a series of menus, the speaker could only use one of a few words at any place in the utterance. This technique depends on the previously uttered sentential context to reduce the so-c,alled branching factor. The effective vocabulary at each point is made much smaller than the overall vocabulary available to the system. Speech technology has made great strides in the recent past. We are now in a position where we can progress beyond systems that merely type out a sentence which was read to it for demonstration purposes. The speech recognition research focus has shifted toward integrated spoken language systems, which can be used by people trying to accomplish a task. For the rest of this article we will only be concerned with speaker independent, large vocabulary, connected speech recognition systems. THE NEED TO INTEGRATE SPEECH AND NATURAL LANGUAGE Speech recognition techniques at the word level are inadequate. Error rates are fairly high even for the best currently available systems. The Carnegie-Mellon SPHINX system [19] is considered to be the best speaker independent connected speech recognition system today. But even the Sphinx system has an error rate of 29.4 percent for speaker independent, 1,000 word connected speech recognition, when recognizing individual words in sentences without using knowledge about syntax or semantics. Clearly we need some forms of higher level knowledge to understand speech better. This is the kind of knowledge that has been used for years by researchers concerned with natural language understanding of typed input. Just using the modules developed by the typed natural language understanding community is not as simple as it may seem. There are a number of very specific demands to a speech system interface which differ from a typed system interface. Many of the techniques for parsing typed natural language do not adapt well to speech specific problems. The following po:ints highlight some of the unique speech problems not found in typed natural language. Probability Measures. There is nothing uncertain about what a person has typed.

The ASCII characters are transmitted unambiguously to the program. The speech system has to deal with many uncertainties during recognition. At each level of processing, the uncertainties compound. The result of each processing step is usually expressed in terms of some probability or likelihood estimate. Any techniques developed for typed natural language systems:need to be

February 1989

Volume 32

Number 2

Articles

adapted to account for different alternatives with different probabilities. Identifying Words. The speech system usually has many words hypothesized for each word actually spoken. The word identification problem has several components, some of those include: l

l

l

Uncertainty of the location of a word in the input. For almost every acoustic event in the utterance, there are words hypothesized that start and end at this event. Many of these words overlap and are mutually exclusive. Multiple alternatives at every word location. Even at the correct word location, the system is never certain of what word is actual input. A list of word candidates is usually hypothesized with different probabilities. Word boundary and juncture identification. Often words are coarticulated such that their boundaries became merged and unrecognizable. Somemilk is a classic example of a phrase where the two words overlap without a clear boundary.

Phonetic Ambiguity of Words. Many words sound completely alike and the orthographic representation can only be distinguished from larger context. This is the case for ice cream and I scream. Syllable Omissions. Spoken language tends to be terse. Because people are so good at disambiguating the speech signal, speakers unconsciously omit syllables. An example of this is frequently found in the pronunciation of United States which is reduced to sound like unite states in everyday speech. Missing Information. The speech system will occasionally fail to recognize the correct word completely. Even though the speaker may have said the word correctly, it cannot be hypothesized from the acoustic evidence. There will be too many other word candidates that receive a better score and the correct word will be left out. Ungrammatical Input. If miss-typing is the kind of phenomenon a standard natural language system has to deal with, speech systems encounter miss-spoken words and filled pauses like ah and uhm which further complicate recognition 1161.Natural human speech is also more likely to be ungrammatical [6].

The effect of all these differences requires speech recognition systems to deal with many more alternatives and a much larger search space than typed natural language systems. Therefore all techniques that have been developed by the natural language processing community must be restructured if they are to be adapted for the speech specific problems. They especially must be adapted to deal with the huge search spaces that result from the magnitude of the problem if these knowledge sources are to be used to assist in actual speech recognition.

February 1989

Volume 32

Number 2

Uses of Knowledge

in Speech Recognition

Systems

In the past, speech systems have used a variety of different kinds of knowledge sources to reduce the magnitude of the search space. The following list describes the major information sources used by different speech systems. We restrict ourselves to enumerating the knowledge sources above the level of complete words as they apply to sentences, dialogues, user goals and user focus. l Word Transition ProbabiIities. If one wants to use knowledge of more than one word, the obvious solution is to use two words. By analyzing a large set of training sentences, a matrix of word pairs is constructed. The sentences are analyzed individually, without regard to dialogue structure, focus of attention or user goals. The resulting word pair matrix indicates which words can follow an already recognized word. A further extension of this method uses likelihoods of transitions encoded in the matrix instead of just binary values. Not only do we know which word pairs are legal, but we also have an indication of how likely they are. Empirically derived trigrams of words have also been used. Here a matrix is computed which, when given a sequence of the two preceding words, indicates which words can immediately follow at this point. Variations on the word transition probability estimates using Markov modeling techniques have been used by [Z]. A minor modification of this approach uses word categories instead of words. Word categories are independent of the actual vocabulary size and require less training data to establish the transition probability matrix. While this approach does well to reduce the amount of search that is required, there is still much information missing in the triplets of allowable words [28, 311. l Syntactic Grammars. A syntactic grammar first divides all words into different syntactic categories. Instead of using transition probabilities between word pairs or triplets, a syntactic grammar specifies all possible sequences of syntax word categories for a sentence [%I. Network grammars seem to be the most efficient representation for this type of constraint, since fast processing times are crucial in a speech system. Other grammar parsing representations are not as efficient when faced with the large numbers of candidates in a speech recognition situation. While the grammar may be written in a different notation, it can usually be compiled down to a network for the actual speech processing. The big drawback of these grammars is that they are difficult to construct by hand. They also assume the speaker will produce an utterance which is recognizable by the grammar. l Semantic Grammars. Semantic grammars have been the most popular form of sentential information encoded in speech recognition systems. The grammar rules are similar to those of syntactic grammars, but words are categorized by a combination of syntactic class and semantic function. Only sentences that are both syntactically well formed as well as meaningful in the context of the application will be recognized


185

Articles

by a semantic grammar. Semantic grammars express stronger constraints than syntactic grammars, but also require more rules. These grammars are also easily representable as networks. Compared to the syntacticgrammars above, they are even more difficult to construct by hand. Nevertheless, most speech systems which use higher level knowledge have chosen to use semantic grammars as their main sentential knowledge source [5, 17, 18, 21, 361. Some speech recognition systems emphasized semantic structure while minimizing syntactic dependencies [12, 161. This approach results in a large number of choices due to the lack of appropriate constraints. The recognition performance therefore suffers due to the increased ambiguities. None of these systems proposed to use any knowledge beyond the constraints within sing1.esentences. l Thematic Memory. Barnett [3] describes a speech recognition system which uses a notion of history for the last sentences. The system keeps track of previously recognized content words and predicts that they are likely to reoccur. The possibility of using a dialogue structure is mentioned by Barnett, but no results or implementation details are reported. The thematic memory idea was picked up again by Tomabechi and Tomita [29], who demonstrated an actual implementation in a sophisticated frame-based system. Both speech recognition systems use an utterance to activate a context. This context is then transformed into word expectations which prime the speech recognition system for the next utterance. l History-based Models. Fink, Biermann and others [4, 101 implemented a system that used a dialogue feature to correct errors made by a small vocabulary, commercial speech recognition system. Their module was strictly history-based. It remembered previously recognized meanings (i.e., semantic structures) of sentences as a finite state dialogue network. If the currently analyzed utterance was semantically similar to one of the stored sentence meanings and the system was at a similar state of the dialogue at that time, the stored meaning can be used to correct the recognition of the new utterance. Significant improvements were found in both sentence and word error rates when a prediction from a previous utterance could be applied. However, the history-based expectation was only applied after a word recognition module had processed the speech, in an attempt to correct recognition errors. l Strategy Knowledge. Strategy knowledge was applied as a constraint in the voice chess application of Hearsay-I [25]. The task domain was defined by the rules of chess. The “situational semantics of the conversation” were given by the current board position. Depending on these, a list of legal moves could be formulated which represented plausible hypotheses of what a user might say next. In addition, a user model was defined to order the moves in terms of the goodness of a move. From these different knowledge sources, an exhaustive list of all sentences possible was derived. This list of sentences was then used to constrain the

188


acoustic-phonetic speech recognition. Hearsay-I went too far in the restriction of constraints. A classic anecdote tells of the door slamming during a demonstration and the system recognizing the sentence: “Pawn to Queen 4”. Hearsay-I applied its constraints in an extremely limited domain and overly restricted what could be said. Nevertheless, the principles of using a user model, task semantics and situational semantics are valid. Natural Language Back-Ends. !$everal speech recogl nition systems claim to have dialogue, discourse or pragmatic components [20, 21, 30, 351. However, most of these systems only use this knowledge just like any typed natural language understanding system would. The speech input is processed by a speech recognition module which uses all its constraints up through the level of semantic grammars to arrive at a single best sentence candidate. This sentence is then transformed into the appropriate database query, anaphoric references are resolved, elliptic utterances are completed and the discourse model is updated. All these higher level procedures are applied after the sentence is completely recognized by the speech front-end. There is no interaction between the natural language processing modules and the speech recognizer. Natural

Language Research

There has been much research on discourse, focus, planning, inference and problem solving strategies in the natural language processing community. Some of the research was not directly carried out in the context of natural language systems, but describes methods for representation and analysis of these issues. We will briefly review the key principles which influenced the design of the MINDS spoken language dialogue system. Plans. The utility of tracking plans and goals in a story has been well established. A number of researchers [l, 221 have described the utility of identifying a speaker’s goals to disambiguate natural language sentences and to provide helpful system responses. Similarly, Cohen and Perrault [8] have developed a planbased dialogue model for understanding indirect speech acts. A program called PAM [32] showed how an understanding of a person’s goal can explain an action of the person. The goal and subsequent actions can also be used to infer the particular plan which the person is using to achieve this goal. Additionally, Wjlensky [33] developed methods for dealing with competing goals and partial goal failures. The ways in which a plan can be broken down into a hierarchical set of g,oalsand subgoals was originally demonstrated by Sacerdoti [26]. Problem Solving. Newell and Simon [24] were key influences in the study of human problem solving. Among other things, they showed how people constantly break goals into subgoals when solving problems. Their findings, as well as much of th’e other research done in this area [22] illustrate the function of user goals represented as goal trees, and traversal procedures for goal trees.

February 1989

Volume 32

Number 2

Articles

Focus. Focus determines a set of relevant concepts for a particular situation. Grosz [13] found that natural language communication is highly structured at the level of dialogues and problem solving. She showed how the notion of a user focus in problem solving dialogues is related to a partitioning of the semantic space. Focus can also provide an indication how to disambiguate certain input. Additional work by Sidner [27] confirmed the use of focus as a powerful notion in natural language understanding. Sidner successfully used a focus mechanism to restrict the possibilities of referent determination in pronominal anaphora. Ellipsis. Elliptical utterances are incomplete sentences which rely on previous context to become meaningful. Methods for interpreting elliptical utterances were studied in depth by Frederking [ll]. He used a chart-based representation to remember fragments of preceding sentences which were suitable complements for elliptic phrases. User Domain Knowledge. Chin [7] showed how the knowledge of the user about the domain can influence the expectations and behavior of a system. In addition, he described ways in which the user expertise could be inferred by the system. THE MINDS SYSTEM The main problem in speech recognition is the enormous complexity involved in analyzing speech input. As search space size increases recognition performance decreases and processing speed increases. The value of a reduced search space and stronger constraints is well known in the speech recognition community [9, 17, 231. Reducing the search to only the most promising word candidates by pruning often erroneously eliminates the correct path. By applying knowledge-based constraints as early as possible, one can trim the exponential explosion of the search space to a more manageable size without eliminating correct choices. Previously we delimited the key knowledge sources incorporated into the MINDS system to provide constraint. Now we briefly overview the entire MINDS system and enumerate the primary innovations of this new approach. The approach employs knowledge based constraints to reduce the exponential growth of the search space the speech recognizer must analyze. To demonstrate our new approach in speech recognition, we have built MINDS, a Multi-modal, INteractive Dialog System [14, 381. It allows a user to speak, type and point during a problem solving session with the system. The system outputs information in a variety of media. It produces a natural language answer to the user’s question as well as displaying relevant information on multiple raster display screens. These screens display the current information in various contexts emphasizing different aspects of the dialogue history and the current world situation. The MINDS system operates in real time.

February 1989 Volume 32 Number 2

MINDS works in a resource management domain, featuring information obtained from a database of facts about ships in the United States Navy. The basic problem scenario involves a damaged ship performing a particular task. The system user must determine the impact of the damage on the mission and then determine whether the damaged ship should continue in degraded condition or should be replaced by a different ship. If a replacement ship is being considered, the user must locate one with similar capabilities and a ship which will have minimal impact on other mission operations. An excerpt from a sample interaction transcript can be found in Figure 1. For the purposes of this article, MINDS can be viewed as a speaker-independent continuous speech recognition system that uses dialogue knowledge, user goals, plans and focus to understand what was said in its naval logistics problem solving domain. The system uses this higher level knowledge of dialogues in addition to a representation of user’s domain knowledge to predict what the current user will talk about next. The predictions drastically reduce the search space before the sentence and word detection modules even begin to analyze the speech input. In very general terms, we can describe the main operations of the MINDS system as a continuous loop. First, the system generates a set of predictions based on the last user query, the database response and the state of the dialogue. Then, the predictions are translated List the casualty reports for ships in the Persian Gulf? Gridley is in the Persian Gulf. Harpoon missile is disabled. What's its mission? Gridleys mission is escort. Show me Gridley's required capabilities. Harpoon missile, Phalanx gun, SPS-10radar are required capabilities Let me see all Leahy-class cruisers in the Peizsian Gulf. England is a Leaky-class skip. It is a cruiser. It is in the Persian Gulf. Grid& is a Leahy-classskip. It is a cruiser. It is in the Persian Gulf What is the mission for England? Englands mission is transport. Priority? Englands mission importance is 1. What are her capabilities? If has Harpoon missile. It has Phalanx gun. It has SPS-10radar. It has torpedo tubes. It has SM-2 missile. Does England have any CASREPS? Nothing found in the database.

FIGURE1. An Excerpt of a Typical Dialogue with the MINDS System

Communicationsof the ACM

187

Articles

into a semantic grammar. The active lexicon is restricted to only cover words which are part of the predictions. The next user query is parsed using the dynamically created grammar and lexicon, and the user’s question is displayed for verification. The database response to user request is then presented in output modalities. Innovations of the MINDS System The MINDS system represents a radical departure from the principles of most other speech recognition systems. The key innovations of the MINDS system include the following: 1. Use of a combination of knowledge sources including discourse and dialogue knowledge, problem solving, knowledge, pragmatics, user domain knowledge goal representation, as well as task semantics and syntax in an integrated system. 2. All the knowledge is used predictively. Instead of applying the knowledge to correct an error or resolve ambiguities after they occur, the knowledge is applied in a predictive way to constrain all possibilities as they are generated. 3. The constraints generated by the system are immediately applied to the low-level speech processing to reduce the search space. We use the predictive constraints to eliminate large portions of the search space for the earliest acoustic-phonetic analysis. 4. In case the predictions fail, the system provides a principled way of recovery when constraints are violated. If the constraints are satisfied, recognition is more accurate and faster. However, if some of our predictions are violated the system does not breakit degrades gracefully. 5. In addition to speech input, the MINDS system allows pointing and clicking as well as typed modes of interaction. The user may use the mouse to select objects displayed graphically on the screen. For those users who are uncomfortable speaking to a computer, anything that could be spoken can also be typed on a keyboard. MINDS exploits knowledge about users’ domain knowledge problem solving strategy, their goals and focus as well as the general structure of a dialogue to constrain speech recognition down to the signal processing level. Pragmatic knowledge sources are used predictively to circumscribe the search space for words in the speech signal [14, 371. In contrast to other systems, we do not correct misrecognition errors after they happen, but apply our constraints as early as possible during the analysis of an utterance. Our approach uses predictions derived from the problem-solving dialogue situation to limit the search space at the lower levels of speech processing. At each point in the dialogue, we predict a set of concepts that may be used in the next utterance. This list of concepts is combined with a set of syntactic networks for possible sentence structures. The result is a dynamically constructed semantic net-

188


work grammar, which reflects all the constraints derived from all our knowledge sources. To avoid getting trapped by predictions which are not fulfilled, we generate them at different level of specificity. When the parser then analyzes the spoken utterance, the dynamic network allows only a very restricted set of word choices at each point. This reduces the amount of search necessary and cuts down on the possibility of recognition errors due to ambiguity and confusion between words. The bottom line is that the MINDS system uses as much knowledge as possible to achieve accurate speech recognition and help the user complete his task efficiently.

The Predictive Use of Knowledge Predictions are derived from what we know about the current state of the dialogue. This knowledge is then refined to constrain what we expect the user will actually s,ay. In some sense predictions cover everything we expect to happen, and exclude events which are unlikely to happen. To be able to create predictions, we have three very important data structures in the system: A knowledge base of domain concepts. In this data structure we represent all objects and their attributes in the domain. The representation uses a standard frame language which provides the capability to express inheritance and multiple relations between frames. The domain concepts also represent everything that can be expressed by the user. Each possible utterance will map into a combination of domain concepts which constitute the meaning of that utterance. This representation of meaning is also used to generate the database queries from the utterance. Hierarchical goal frees. The goal trees represent a hierarchy of all possible abstract goals as user may have during the dialogue. The goal trees are composed of individual goal nodes, structured as AND-0:R trees. Each goal node is characterized by the possible subgoals it can be decomposed into and a set of domain concepts involved in trying to achieve this g,oal. The concepts associated with a goal tend to be restricted from previous dialog context. These restrictions on concept expansions are dynamically computed for each concept. The computation is based upon principles for inheriting and propagating constraints based upon their embedding and are often referred to as local and global focus in the natural language literature. The goal tree not only defines the goals, subgoals and domain concepts, but also the traversal options available to the user. A goal node’s associated concepts can be optional or required, single use or multiple use. If a concept is optional, it is possible but not necessary for a user to apply this concept in the current problem-solving step. If a concept is defined as multi-usable, then a user could refer to it several times in different utterances during the current problem solving step.

Februa y 1989

Volume 32

Number 2

Articles

A User Model. A user model represents domain concepts and the relations between domain concepts which a user knows about. These models are represented as control structures which are associated with goals in the goal tree. The control structures express which goals may be exclusive because the user can infer the information in one goal once the other, exclusive goal has been completed. Other goals may be optional because the user is unfamiliar with the domain concept or its potential importance in deriving a solution to the current problem. Additionally, the control structures contain probabilistic orderings for conjunctive subgoals. Hence, the user model provides the system with potential traversal options which are more restrictive than the traversal options provided by the other knowledge sources. These three complex data structures are currently only constructible by hand-based on a detailed and careful analysis of the problem-solving task itself. We will now try to explain how the knowledge is used by the MINDS system to track the progress of a user during a problem-solving session. We will also show how predictions in the form of domain concepts are generated during a dialogue. When an input utterance and its database response have been processed, we first try to determine which goal states were targeted by the present interaction. Determination of activated goal states is by no means unambiguous. During one interaction cycle, several goal states may be completed and many new goal states may be initiated. Similarly, it is possible that an assumed goal state is not being pursued by the user. To deal with these ambiguities, we use a number of algorithms. Goals that have just been completed by this interaction and that are consistent with previous plan steps are preferred. Based on the information we have available, we select the most likely plan step to be executed next. If the current goal is not complete, then our most likely plan step will attempt to complete the current goal. If the current goal is satisfied, we identify the next goal states to which a user could transit. The result of tracking the goal states is a list of potential goals and subgoals which a user will try to complete in the current utterance. Additionally, since there are always many active goals which may or may not be hierarchically embedded, we also maintain a list of all active goals. Hence, the procedures described above are used for determining the best, most likely goal state a user will transit to. To generate our most restrictive predictions, we then restrict the most likely goal state further by taking the constraints from the user model. The next prediction layer ignores the user model and is derived only from the best, most likely next goal. Finally, additional less restrictive sets of predictions are derived from currently active goals which are at higher levels of the goal tree. This procedure continues until all active goals are incorporated into a prediction set. The goals all have an associated list of concepts in the task

February 1989

Volume 32

Number 2

domain. These are the concepts a user will refer to when trying to satisfy the current goal. For example, in a goal state directed at assessing a ship’s damage, we expect the ship’s name to appear frequently in both user queries and system statements. We also expect the user to refer to the ship’s capabilities. The predicted sentence structures should allow questions about the features of a ship like “Does its sonar still work?” Display the status of all radars for the Spark” and “What is Badger’s current speed?” Some domain concepts which are active at a goal tree node during a particular dialogue phase have been partially restricted by previous goal states. The representation of the domain concepts associated with goal nodes provides a mechanism to specify what prior goal can restrict the current concept. These restrictions may come either from the user’s utterances or from the system responses. Thus each goal state not only has a list of active domain concepts, but also a set of concepts whose values were partially determined by an earlier goal state. In our example, once we know which ship was damaged, we can be sure all statements in the damage assessment phase will refer to the name of that ship or its hull number only. In addition to the knowledge mentioned earlier, we also restrict what kinds of anaphoric referents are available at each goal node. The possible anaphoric referents are determined by user focus. From the current goal or subgoal state, focus identifies previously mentioned dialogue concepts and answers which are relevant at this point. These concepts are expectations of the referential content of anaphora in the next utterance. Continuing our example, it does not make sense to refer to a ship as “it” before the ship’s name has been mentioned at least once. We also do not expect the use of anaphoric “it” if we are currently talking about a group of several potential replacement ships. Elliptic utterances are predicted when we expect the user to ask about several concepts of the same type after having seen a query for the first concept. If the users have just asked about the damage to the sonar equipment of a ship, and we expect them to query about damage to the radar equipment, we must include the expectation for an elliptic utterance about radar in our predictions.

Expanding Predictions into Networks After the dialogue tracking module has identified the set of concepts which could be referred to in the next utterance, we need to expand these into possible sentence fragments. Since these predicted concepts are abstract representations, they must be translated into word sequences which signify that appropriate conceptual meaning. For each concept, we have precompiled a set of possible surface forms, which can be used in an actual utterance. In effect, we reverse the classic understanding process by un-parsing the conceptual representation into all possible word strings which can de-


199

Articles

note the concept. A predicted concept can be quickly un-parsed into all its possible semantic network grammar subnets. In addition to the individual concepts, which usually expand into noun phrases, we also have a complete semantic network grammar that has been partitioned into subnets. Each subnet expresses a complete sentence. A subnet defines allowable syntactic surface forms to express a particular semantic content. For example, all ways of asking for the capabilities of ships are grouped together into subnets. The semantic network is further partitioned into separate subnets for elliptical utterances, and subnets for anaphora. The semantic grammar subnets are precompiled to allow direct access for processing efficiency. The terminal nodes in the networks are word categories instead of words themselves, so no recompilation is necessary as new lexical items in existing categories are added to or removed from the lexicon. The final expansion of predictions brings together the partitioned semantic networks and the predicted concepts which were translated into their surface forms. Through a set of cross-indices, we intersect all predicted concept expressions with all the predicted semantic networks. This operation generates dynamically one combined semantic network grammar which embodies all the dialogue level and sentence level constraints. This dynamically created network grammar is used by the parser to process an input utterance. To illustrate this point, let us assume that the frigate “Spark” has somehow been disabled. We expect the user to ask for its capabilities next. The dialogue tracking module predicts the “shipname” concept restricted to the value “Spark” and any of the “ship-capabilities” concepts. Single anaphoric reference to the ship is also expected, but ellipsis is not meaningful at this point. The current damage assessment dialogue phase allows queries about features of a single ship. During the expansion of the predicted concepts, we find the word nets such as “the ship,” “this ship,” “the ship’s, ” “this ship’s,” “it,” “its,” “Spark” and “Spark’s” We also find the word nets for the capabilities such as “all capabilities,” “radar,” “sonar,” “Harpoon,” “Phalanx,” etc. We then intersect these with the sentential forms allowed during this dialogue phase. Thus we obtain the nets for phrases like “Does (it, Spark, this-ship, the-ship) have {phalanx, harpoon, radar, sonar}, ” “What (capabilities, radar, sonar) does (theship, this-ship, it, Spark) have,” and many more. This semantic network now represents a maximally constrained grammar at this particular point in the dialogue. Recognizing Speech Using Dynamic Networks We use the SPHINX system [19] as the basis for our recognizer. SPHINX samples input speech in centisecond frames. Based on the LPC cepstrum coefficients, each frame is then mapped into one of 256 prototype vectors. Vector-quantized speech is also used to train Hidden Markov Models (HMMs) for phonemes. HMMs

190

Commut~icatiom of the ACM

are trained from a corpus of approximately 4200 sample utterances. Each word is represented in the dictionary as a single sequence of phonemes. The models for words are pre-compiled by concatenating the HMMs for each phoneme in a word. During recognition, SPHINX performs a time-synchronous beam search known as the Viterbi algorithm, matching word models against the input. In the MINDS system, we use the active set of semantic networks to control word transitions instead of the word-pair constraints normally used by the SPHINX system. The search begins at the set of initia.1 words for all active subnets. This set includes only currently active words from the dynamically created lexicon for this utterance. As the search matches a word from the input utterance, it transits along the arc in the grammar represented by that word. A score is assigned to each path in the beam, indicating how well the input is matching the HMMs in the path. Paths falling below a threshold score are pruned. The dynamically created semantic network is used to allow only legal word transitions. The network does not affect the score of a path but simply restricts words which can continue a particular path. If no string of words is found which matches the HMMs better than a certain threshold score, a different grammar and lexicon from a more ge:neral set of predictions must be used to re-process the utterance. After the spoken input has been processed, -the word string with the best score is passed back to the system for parsing. When Predictions Fail There are a number of assumptions built into the use of predictions. If a user conforms to our model of a problem-solving dialogue, the advantages are clear. However, we must consider the case when some assumptions are violated. There are two points to consider when predictions fail: We must first be able to identify the situation of failed predictions and then find a way to recover. In the MINDS system, the first point is accomplished without extra work. When the user speaks an utterance which was not predicted, the speech recognition component usually fails to produce a complete parse. The spoken words do not match the predicted words and receive low probability scores. This may not. always be as easy in other recognition systems. As a mechanism for recovery from failed Ipredictions, the MINDS system always produces several sets of predictions for each utterance. These sets of predictions range from very specific to very general. For the most specific predictions, the system uses all the :possible constraints. Each successive set of predictions then becomes more general. The number of levels of constraint relaxation depends on the goal tree structure at that point. Predictions are made more general by assuming additional goal nodes are active. Eventually we reach a level of prediction constraints which is identical to the constraints provided by the full semantic grammar with all possible words. We now can parse any syntactically

February 1989

Volume 32

Number 2

Articles

As a mechanism for recovery from failed predictions, the MINDS system always produces several sets of predictions for each utterance.

and semanticall; legal utterance, disregarding all dialogue considerations. Beyond that we can only relax the constraints to a point where any word can be followed by any other word. This would be necessary if a user spoke an utterance that was not covered by the grammar. In this case, we must rely on heuristics during the semantic interpretation of the utterance to provide a correct meaning and database query. Details of this procedure are described in [38]. When the speech recognition module fails to parse at a particular level of constraints, the next set of predictions is used to reparse the same utterance until a successful parse is obtained. If the user is cooperative and within our predictions, recognition accuracy will be high and response time immediate. As the system backs up over several levels of constraints, the search space of the recognition module becomes larger and processing time increases while accuracy drops. However, the system never experiences a complete loss of continuity when predictions are violated. EVALUATION

OF PROGRESS

Many systems developed by researchers in the artificial intelligence community lack a rigorous evaluation. While the individual systems may incorporate brilliant ideas, it is rarely shown that they are in some way better than other systems based on a different approach. If the research in a field wants to make progress, that progress must be made visible and measurable. In the field of speech understanding one clear measure of success is recognition accuracy. Recognition accuracy can be measured in terms of word accuracy, sentence accuracy as well as semantic accuracy. Word accuracy is defined here as the number of words that were recognized correctly divided by the number of words that were spoken. In addition to the number of correct words, we also record the number of insertions of extra words by the recognizer. This number is otherwise not reflected in the percentage of correct words. Recognition accuracy thus takes into account deleted words and word substitutions (i.e., “its” was spoken but “his” was recognized.) On the other hand, error rate reflects insertions, deletions, and subtitutions. If the speech recognizer makes minor errors in recognizing an utterance, but the underlying meaning of the utterance is preserved, the utterance is considered to be recognized semantically accurate even though some words were incorrect. Semantic accuracy therefore is the percentage of sentences with correct meaning. In our system, a sentence is considered semantically correct if the recognition produces the correct database query.

February 1989

Volume 32

Number 2

To test the ability of the MINDS system to reduce search space and improve speech recognition performance, we performed two experiments. The first experiment assessedsearch space reduction caused by predictive use of all pragmatic knowledge sources. The second experiment measured improvement in recognition accuracy rates resulting from the use of layered predictions. Both studies used a test set of data which was independent from the training data used to develop the system. This means that the utterances and dialogues processed by the system to obtain the experimental results had not been seen previously by the system or the developers. Our test data consisted of 10 problem solving scenarios. These were adapted versions of three actual transcripts of naval personnel solving problems caused by a disabled vessel. The personnel must determine whether to delay a mission, find a replacement vessel or schedule a repair for a later date. They use a database to find necessary problem solving information, In addition, we created seven additional scenarios by paraphrasing the original three. The test scenarios contained an average of nine sentences with an average of eight words each. An excerpt of a dialogue sequence is given in Figure 1. The training data had consisted of five different problem solving scenarios from transcripts of naval personnel performing the same basic task. The training scenarios were used for developing the user models. Dialogue phases, goals and problem solving plans were derived from an abstract description of the stages and options available to a problem solver. The abstract plan descriptions had been provided by the Navy. Since our database was different from the one used in gathering the original transcripts, we were forced to adapt all scenarios. Lexical items which were unknown to our system were substituted with known words. Shipnames, locations, capabilities, mission requirements, etc. were changed to be consistent with our database. We feel these adaptations had minimal impact on the integrity of the data and did not alter the problem solving structure of the task. The lexicon for this domain contained 1,000 words. Reduction

of Search Space and Perplexity

Since the magnitude of the search space is such a critical factor in speech recognition, one measure of success is the reduction in search space provided by a system. To measure the constraint imposed by the knowledge sources we use two measures: perplexity and search space reduction. Perplexity is an information theoretic measure that is widely used in speech systems to characterize the constraint provided by a grammar. Perplex-


191

Articles

ity consists of the geometric mean of the number of nodes which are visited during the processing of an utterance. In our case, we use the semantic network grarnmars to calculate the number of word alternatives which the system has to consider. Test set perplexity is computed specifically for actual utterances. After we compute the alternatives for a word in the utterance, we a,ssumethe system recognizes this word correctly and continues by only computing the alternatives which directly follow this word in the grammar. A more detailed justification of this measure is given in [17]. The size of the search space is calculated by raising the sentence perplexity value to the number of words in the sentence. Our first experiment was designed to test the perplexity and search space reduction resulting from applying pragmatic constraints. To measure the reduction in perplexity and search space we collected test set perplexity measurements for each of the parsed sentences in two conditions. The first condition represented the constraints provided by the complete semantic grammar networks with the full vocabulary available. The second condition measured perplexity for the most specific set of predictions that could be applied. The estimate for the second condition is the perplexity obtained by merging the successful prediction level with all of the more specific but unsuccessful levels of constraints. Otherwise, the results would be misleading whenever the predictions were not fulfilled. As seen in Table I, by applying our best constraints, test set perplexity was reduced by an order of magnitude, from 279.2 to 17.8 while search spaces decreased by roughly 10 orders of magnitude.

speech recognition result to change state. Instead, after producing the speech recognition result, the system read the correct recognition from a file which contained the correct set of utterances. Thus, the system always changed state according to a correct analysis of the utterance. The results can be found in Table II. The system performed significantly better with the predictions. Error rate decreased from 17.9 percent to 3.5 percent. Perhaps just as important is the nature of the individual errors. In the condition with the most specific successful predictions, almost all of the errors Ifinsertions and deletions) were made on the word “the.” Another large proportion of errors consisted of substituting the word “his” for the word “its.” Furthermore, none of the errors in the “with predictions” condition resulted in an incorrect database query. Hence, semantic accuracy was 100 percent on this sample of 200 spoken sentences.

TABLE II. Recognition Performance

Test Set Perplexity Word Accuracy Insertions Semantic Accuracy Deletions Substitutions Error Rate

242.4 82.1% 0.0% 85% 8.5% 9.4% 17.9%

18.3 97.0% 0.5% 100% 1.6% 1.4% 3.5%

CONCLUSIONS TABLE I. Reduction in Branching Factor and Search Space

Test Set Perplexity Search Space

Improvements

279.2 3.81 x lOi

in Recognition

17.8 1.01 x lo9

Accuracy

To evaluate the effectiveness of using predictions on recognition performance we used 10 speakers (8 male, 2 female) who had never before spoken to the system. To assure a controlled environment for these evaluations, each speaker read 20 sentences from the adapted test scenarios provided by the Navy transcripts. Each of these utterances was recorded. The speech recordings were then analyzed by the MINDS system in two conditions. The first condition ignored all constraints except those provided by the complete semantic grammar. In other words, all possible meaningful sentences were acceptable at all times. The second condition used the MINDS system with the most specific set of predictions appropriate for the utterance. To prevent confounding of the experiment due to misrecognized words, the system did not use its normal

192


It is obvious that the MINDS system represents only a beginning in the integration of speech recognition with natural language processing. We have shown how one can apply various forms of dialogue level knowledge to reduce the complexity of a speech recognition task. Our experiments demonstrated the effectiveness of the added constraints on the recognition accuracy of the speech system. We have also demonstrated that specific predictions can fail and the system will recover gracefully using our mechanism for gradually relaxing constraints. For this domain, we hand-coded all the goal trees and grammars into the knowledge sources of the system. For larger domains and vocabularies it would be desirable to automate the process of deriving the goal trees and grammars during interactions with the initial users. There is much more work needed on automatic modeling of human problem solving processes based on empirical observation. We do not claim that these exact results ishould be obtainable in any domain or any task. Rather it was our intent to demonstrate the usefulness of dialogue level knowledge for speech recognition. Future spoken language systems dealing with larger domains and very large vocabularies will be well advised to consider in-

February 1989

Volume 32

Number 2

Articles

corporating article.

the kinds of mechanisms

Acknowledgements. We are indebted who chaperoned this research effort.

described in this

to Raj Reddy,

REFERENCES

1. Allen, J.F.. and Perrault. CR. Analyzing intention in utterances. Art. I?lfd. 15, 3 (1980). 143-178. 2. Bahl, L.R., Jelinek, F., and Mercer, R.L. A maximum likelihood approach to continuous speech recognition. IEEE Trans. Pati. Anal. and Mach. Intell. 5. 2 (19831,179-190. 3. Barnett, J. A vocal data management system. ZEEETrans. Audio and ElccfroacousticsAU-Z], 3 (June 1973). 185-186. 4. Biermann. A., Rodman R., Ballard, B.. Betancourt. T., Bilbro. G.. Deas. H.. Fineman. L.. Fink. P.. Gilbert. K., Gregory, Lt., and Heidlage. F. Interactive natural language problem solving: A pragmatic approach. In Proceedingsof the Conferenceon Applied Nafural LanguageProcessirrg(Santa Monica, Calif.. Feb. 1-3. 1983). pp. 180-191. 5. Borghesi, L.. and Favareto. C. Flexible parsing of discretely uttered sentences. COLING-82. Association for Computational Linguistics, (Prague. July, 1982).pp. 37-48. 6. Chapanis, A. Interactive human communication: Some lessons learned from laboratory experiments. In Shackel, B., Ed.. ManCompuler Interaction: Humn FactorsAspects of Computersand People, Sijthoff and Noordhoff, Rockville. Md.. 1981. pp. 65-114. 7. Chin, D.N. Znfelligenf Agents as a BasisforNaturalLanguageInterfaces. Ph.D. dissertation, Computer Science Division (EECS).University of California (Berkeley). 1988. Report No. UCB/CSD 88-396. 8. Cohen, P.R.. and Perrault, CR. Elements of a plan-based theory of speech acts. Cog. Sci. 3 (1979) 177-212. 9. Ermen, L.D., and Lesser, V.R. The Hearsay-11speech understanding system: A tutorial. In W.A. Lea (Ed.] Trends in SpeechRecognition. Prentice-Hall, En&wood Cliffs, N.J.. 1980. 10. Fink, P.E., and Biermann, A.W. The correction of ill-formed input using history-based expectation with applications to speech understanding. Compuf. Ling. 72, 1 (1986). 13-36. 11. Frederking, R.E. Natural LanguageDialogue in an Integrated Computational Model. Ph.D. dissertation, Department of Computer Science. Carnegie-Mellon University, Pittsburgh, PA, 1986. Tech Rep. CMU-CS-86-178. 12. Gatward, R.A., Johnson, S.R.. and Conolly. J.H. A natural language processing system based on functional grammar. Speech Input/Output: Techniques and Applications, Institute for Electrical Engineers, 1986. pp. 125-128. 13. Grosz, B.J.The representation and use of focus in dialogue understanding. SRI Stanford Research Institute. Stanford, CA, 1977. 14. Hauptmann, A.G.. Young, S.R., and Ward, W.H. Using dialog-level knowledge sources to improve speech recognition. In Proceedingsof AAAI-88, The 7fh Nafional Conferenceon Artificial Intelligence, American Association for Artificial Intelligence. 1988. pp. 729-733. Saint Paul, MN. 15. Hauptmann, A.G.. and Rudnicky, AI. Talking to computers: An empirical investigation. Infernational \. Man-Machine Studies (in press] (1988). 16. Hayes, P.J.,Hauptmann. A.G., Carbonell. J.G.,and Tomita, M. Parsing spoken language; A semantic caseframe approach. In Proceedings of COLING-86, Association for Computational Linguistics, Bonn. Germany, August, 1986. 17. Kimball, 0.. Price, P.. Roucos, S., Schwartz, R., Kubala. F.. Chow, Y.-L., Haas. A., Kramer. M. and Makhoul. J. Recognition performance and grammatical constraints. In Proceedingsof the DARPA SpeechRecognifim Workshop,Science Applications International Corporation Report Number SAIC-86/1546, 1986. pp. 53-59. 18. Lea, W.A. (Ed.]. Trends in SpeechRecognition. Prentice-Hall, Englewood Cliffs, N.J., 1980. 19. Lee, K-F. Large-Vocabulary Speaker-IndependentContinuous Speech Recognition: The Sphinx System.Ph.D. dissertation, Department of Computer Science, Carnegie-Mellon University, Pittsburgh. PA, 1988. Tech Rep. CMU-CS-88.148. 20. Levinson, SE.. and Rabiner, L.R. A task-oriented conversational mode speech understanding system. Bibliofheca Phonetica 12 (1985). 149-196. 21. Levinson, SE., and Shipley, K.L. A conversational-mode airline information and reservation system using speech input and output. The Bell SystemsTechnical \ournal 59 (1980), 119-137.

Februa y 1989

Volume 32

Number 2

22. Litman, D.J., and Allen, J.F. A plan recognition model for subdialogues in conversation. Cog. Sri. 11, 2 (1987). 163-200. 23. Lowerre, B. and Reddy, R. The Hearsay Speech Understanding System. In Trends in SpeechRecogrlifion, W.A. Lea (Ed.). Prentice-Hall. Englewood Cliffs. N.J.. 1980. 24. Newell, A. and Simon, H.A. Human Problem Solving. Prentice-Hall, Englewood Cliffs, N.J.. 1972. 25. Reddy, R.. and Newell, A. Knowledge and its representation in a speech understanding system. In Knowledge and Cog&ion, Gregg L.W., Ed. L. Erlbaum Associates, Potomac, Md., 1974. pp. 256-282. 26. Sacerdoti. E.D. Planning in a hierarchy of abstraction spaces.Artif. Intell. 5. 2 (1974), 115-135. 27. Sidner, C.L. Focusing for interpretation of pronouns. Amer. 1. Comput. Ling. 7. 4 (Oct.-Dec. 1981). 217-231. 28. Stern, R.M., Ward, W.H., Hauptmann, A.G., and Leon, J. Sentence parsing with weak grammatical constraints. ICASSP-87. 1987. pp. 380-383. Tomabechi, H.. and Tomita, M. The integration of unification-based 29. syntax/semantics and memory-based pragmatics for real-time understanding of noisy continuous speech input. In Proceedingsof AAAI-88, The 7th National Conferenceon Artificial Intelligence, American Association for Artificial Intelligence, 1988, pp. 724-728. Saint Paul, MN. 30. Walker, D.E. SRI research on speech recognition. In W.A. Lea (Ed.) Trends in SpeechRecognition. Prentice-Hall, Englewood Cliffs, N.J., 1980. 31. Ward, W.H.. Hauptmann, A.G., Stern, R.M.. and Chanak, T. Parsing spoken phrases despite missing words. ICASSP-88, 1988. 32. Wilensky. R. Understandinggoal-basedstories. Ph.D. dissertation. Yale University, Sept. 1978. 33. Wilensky, R.. Plamring and Undersfanding. Addison Wesley, Reading, Mass., 1983. 34. Winograd, T. Languageas a Cognitive Process,Volume I: Syntax. Addison Wesley, Reading, Mass., 1982. 35. Wolf, J.J.and Woods, W.A. The HWlH Speech Understanding Systems. In W.A. Lea (Ed.) Trends in SpeechRecognition. Prentice-Hall, En&wood Cliffs, N.J., 1980. 36. Woods, W.A.. Bates, M., Brown, G., Bruce, B.. Cook, C., Klovstad, J., Makhoul, J., Nash-Webber. B., Schwartz, R.. Wolf, J,. and Zue, V. Speech understanding systems-Final technical report. Tech. Rep. 3438, Bolt, Beranek. and Newman. Inc., Cambridge, Mass., 1976. 37. Young, S.R.. Hauptmann, A.G.. and Ward, W.H. An integrated speech and natural language dialog system: Using dialog knowledge in speech recognition. Department of Computer Science, CMU-CS88-128, Carnegie-Mellon University, April, 1988. 38. Young, S.R., and Ward, W.H. Towards habitable systems: Use of world knowledge to dynamically constrain speech recognition. 2d Symposiumou Advanced Man-Machine Interfaces fhrough Spoken Language,Hawaii. Nov., 1988. (submitted].

ABOUT THE AUTHORS:

R. YOUNG is a research faculty member of the Computer Science Department at Carnegie Mellon University. She has a B.A. in math and psychology from the University of Michigan and a Ph.D. in cognitive science/psychology from the University of Colorado. SHERYL

G. HAUPTMANN is working toward a Ph.D. in computer science at CMU. He has a B.A. and M.A. in psychology from Johns Hopkins University and a Diploma (M.A.) in computer science from the Technische Universitaet Berlin in West Germany. ALEXANDER

WAYNE H. WARD is a research associate in the CMU Computer Science Department. He has a B.A. in mathematical science from Rice University and a Ph.D. in psychology from the University of Colorado. Authors’ present address: Young, Hauptmann and Ward, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213.


193

Articles

research assistantship while attending CMU as a graduate student.

EDWARD T. SMITH is president of Greenfield Educational Software, 1014 Flemington St., Pittsburgh, PA 15217. His research interests include small educational simulations for grade (andhigh school students.

PHILIIP WERNER is a software developer for MAD Intelligent Systems in Cambridge, Massachusetts. He has an M.Sc. in cog nitive science from the University of Edinburgh and held a

Provides in-depth examination

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear. and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise. or to republish. requires a fee and/or specific permission.

of the topics covered...

acm Computing Surveys Editor-in-Chief Salvatore T. March University ofMinnesota, Mintleapolis, MN

A

uthoritative surveys and tutorials make ACM Computirzg Swueys a required resource in computer science. Refer to Computiq Slrrveys for updates, new perspectives on hardware and software, computer systems organization, computer science theory, artificial intelligence, applications, and a spectrum of peripheral topics. In the past, Cornptrtin~ Surveys has treated encryption and data security, the legal issues involved in privacy, fault-tolerant software, data management and organization, information systems, and manmachine interface software packages. What’s best is that the prose is lively and accessible while providing an in-depth examination of the topics covered. Published quarterly. ISSN: 0360-0300

194


Included in Ap$ied Science b Techrzology Index, Matlzematical Reviews, Scierlce Abstracts, Science Abstracts Ztldex, British Maritime Techolo~y Ltd., Computer Literature Itldex (formerly Qfr. B/b/i. Conzp. 0 Data hoc.), Computirq Reuiem, Compmlath Citation Index (CMCZ), Ergonomics Abstracts, Informafioll Services for the Physics & Eq+eerirzg Com~z~r~ities, Index to Scientific Review. Order No. 105000 Subscriptions: $75.00/year - Mbrs. $11.00 Single Issues: $17.00 - Mbrs. $9.00 Back Volumes: $68.00 - Mbrs. $36.00 Student Mbrs. $25/year Please send all orders and inquiries

to:

P.O. Box 12115 Church Street Station New York, N.Y. 10249

February 1989

Volume 3;!

Number 2

high level knowledge sources in usable speech recognition systems

high level knowledge sources in usable speech recognition systems

Suggest Documents

Speech production knowledge in automatic speech recognition

Arabic Speech Recognition Systems

decision level fusion of usable speech measures using ... - CiteSeerX

Knowledge Managment and Speech Recognition - Center for ...

High Level Pattern Recognition - Bitly

Forced Alignment and Speech Recognition Systems

Technology-Driven Design of Speech Recognition Systems

Towards mixed language speech recognition systems - CiteSeerX

Forced Alignment and Speech Recognition Systems

Speech Recognition Systems: A Comparative Review

SPEECH RECOGNITION

Learning Speech Rate in Speech Recognition

SPEECH RECOGNITION ENGINEERING ISSUES IN SPEECH TO ...

Contextual variability during speech-in-speech recognition

SPEECH RECOGNITION ENGINEERING ISSUES IN SPEECH TO ...

Developing Usable Speech Criteria for Speaker ... - CiteSeerX

Rule-Based High-Level Situation Recognition from

High Level Representation and Recognition of 3D

Speech separation for speech recognition

[PDF] Dwonload Robust Speech Recognition in Embedded Systems ...

Development of Multiple Automatic Speech Recognition Systems in ...

Comparing Humans and Automatic Speech Recognition Systems in

Automatic Speech Recognition in CALL systems: The ... - Helmer Strik

Designing usable Keyword Search Systems