SPOKEN LANGUAGE UNDERSTANDING: A SURVEY Renato De Mori LIA – BP 1228 – 84911Avignon CEDEX 9 (France)
[email protected]
ABSTRACT A survey of research on spoken language understanding is presented. It covers aspects of knowledge representation, automatic interpretation strategies, semantic grammars, conceptual language models, semantic event detection, shallow semantic parsing, semantic classification, semantic confidence, active learning Index Terms— Spoken language understanding, conceptual language models, spoken conceptual constituent detection, stochastic semantic grammars, semantic confidence measures, active learning. 1. INTRODUCTION Epistemology, the science of knowledge, considers a datum as basic unit. A datum can be an object, an action or an event in the world and can have time and space coordinates, multiple aspects and qualities that make it different from others. A datum can be represented by an image or it can be abstract and be represented by a concept. Computer epistemology deals with observable facts and their representation in a computer. Knowledge about the structure of a domain represents a datum by an object and groups objects into classes by their properties. Semantics deals with the organization of meanings and the relations between signs or symbols and what they denote or mean [135]. Computer semantic performs a conceptualization of the world using well defined elements of programming languages. Spoken Language Understanding (SLU) is the interpretation of signs conveyed by a speech signal. This is a difficult task because signs for meaning are mixed with other information like speaker identity and environment. Signs to be used for interpretation have to be defined and extracted from signal. Meaning is represented is by a computer language. Relations between signs and meaning are part of the interpretation knowledge source (KS) and are applied by one or more processes controlled by strategies. The knowledge used is often imperfect. The transcription of user utterances in terms of word hypotheses is performed by an Automatic Speech Recognition (ASR) system which makes errors. Strategies of some SLU systems perform transformations from signals to words, then from words to meaning. Some strategies are also proposed to transform signals into basic semantic constituents to be further composed into semantic structures.
Programming languages have their own syntax and semantic. The former defines legal programming statements, the latter specifies the operations a machine performs when an instruction is executed. Specifications are defined in terms of the procedures the machine has to carry out. Semantic analysis of a computer program is essential for understanding the behavior of a program and its coherence with the design concepts and goals. Formal logics can be used to describe computer semantics. Computer programs conceived for interpreting natural language differ from the human process they model. They can be considered as approximate models for developing useful applications, interesting research experiments and demonstrations. Semantic representations in computers usually treat data as objects respecting logical adequacy in order to formally represent any particular interpretation of a sentence. Even if utterances, in general, convey meanings which may not have relations which can be expressed in formal logics ([45], p. 287), formal logics have been considered adequate for representing natural language semantics in many application domains. Logics used for representing natural language semantic should be able to deal with intension (the essence of a concept) and extension (the set of all objects which are instances of a given concept). Computer systems interpret natural language for performing actions such as a data base access and display of the results and may require the use of knowledge which is not coded into the sentence but can be inferred from the system knowledge stored in long or short term memories. It is argued in [135] that a specification for natural language semantics requires more than the transformation of a sentence into a representation. In fact, computer representations should permit, among other things, legitimate conclusions to be drawn from data [72]. Interpretation may require the execution of procedures that specify the truth conditions of declarative statements as well as the intended meaning of questions and commands [135]. Problems and challenges in SLU can be grouped into the following groups: meaning representation, definition and extraction of signs, conception of interpretation KS base on relations between signs and meaning and between instances of meaning, processes for sign extraction, generation of hypotheses about units of meaning also called semantic constituents and constituent composition into semantic structures. As processes generate interpretation hypotheses, other challenging problems are the evaluation of confidence for semantic hypotheses, the design of interpretation KSs using human knowledge, automatic learning from annotated corpora, the collection and semantic annotation of corpora. This report reviews the history of SLU research with particular attention to the evolution of interpretation paradigms, influenced by experimental results obtained with evaluation corpora. This review integrates and complements reviews in [22,73]. 2. COMPUTER REPRESENTATIONS OF MEANING Computer representation of meaning is described by a Meaning Representation Language (MRL) which has its own syntax and a semantic. MRL should follow a representation model coherent with a theory of epistemology, taking into account, intension and extension, relations, reasoning, composition of semantic constituents into structures, procedures for relating them with signs. The semantic knowledge
of an application is a knowledge base (KB). A convenient way for reasoning about semantic knowledge is to represent it as a set of logic formulas. Formulas contain variables which are bound by constants and may be typed. An object is built by binding all the variables of a formula or by composing existing objects. Semantic compositions and decisions about composition actions are the result of an inference process. Basic inference problem is to determine whether KB = F which means that KB entails a formula F, meaning that F is true in all possible variable assignments (worlds) for which KB is true. In [135], the possibility of representing semantic relations with links between classes and objects is discussed. The formulas in a KB describe concepts and their relations which can be represented in a network called semantic network. A semantic network is made of nodes corresponding to entities and links corresponding to relations. This model combines the ability to store factual knowledge and to model associative connections between entities [135]. Examples of relations are composition functions [44]. The structure of semantic networks can be defined by a graph grammar. Computer programming classes and objects called frames can be defined to represent entities and relations in semantic networks. In a frame based MRL, grammar of frames is a model for representing semantic entities and their properties. The language should be able to represent types of conceptual structures as well as instances of them. Such a grammar should generate frames describing general concepts and their specific instances. Part of a frame is a data structure which represents a concept by associating to the concept name a set of roles which are represented by slots. Finding values for roles corresponds to fill the frame slots. In general, slots contain associations between names of aspects of an entity and its descriptions, also called slot fillers which are constrained to respect certain given types. A slot filler can be the instance of another frame. This is represented by a pointer from the filler to the other frame. Acceptable frames for semantic representations in a domain can be characterized by frame grammars which generate acceptable structures and have rules of the type: : : :
* [ ”of potential slot fillers” : types] .
A frame system is a network of frames. Important examples of MRL semantics are discussed in [135]. Early frame representations were used to represent facts about an object with a property list. For example, a specific address can be represented by the following frame: {a0001
instance_of loc
address Avignon
area country street zip
Vaucluse France 1, avenue Pascal 84000
} Here a0001 is a handle that represents an instance of a class which is specified by the value of the first slot. The other slots, made of a property name and a value, define the property list of this particular instance of the class “address”. The above frame can be derived [145], after skolemization from the following logic formula:
⎧ins tan ce _ of ( x, address) ∧ loc( x, Avignon) ∧ area ( x , Vaucluse) ∧ ⎫ ( ∃x ) ⎨ ⎬ ⎩∧ country( x , France) ∧ street ( x,1 avenue Pascal) ∧ zip( x,84000) ⎭ A definition, with a similar syntax, but with a different semantic is provided for the address class which defines the structure of any address: {address loc area country street zip
TOWN ……attached procedures DEPARTMENT OR PROVINCE OR STATE ……attached procedures NATION ……attached procedures NUMBER AND NAME ……attached procedures ORDINAL NUMBER
}
This frame has a different semantic since it defines a prototype based on which many instances can be obtained. The semantic of an MRL can be described by procedures for generating instances of entities and relations. This characterizes procedural semantics. Procedures for slot filling as well as for frame evocation use methods. Different frames may share slots with similarity links. There may be necessary and optional slots. Fillers can be obtained by attachment of procedures or detectors (of e.g. noun groups), inheritance, default. Procedures can also be attached to slots with the condition in which they have to be executed. Examples of conditions are when-needed, when-filled. Slots may contain expectations or replacements (to be considered if slots cannot be filled).
Descriptions are attached to slots to specify constraints. Given a slot-filler for a slot, the attached description can be inferred. Descriptions can be instantiations of a concept carrier and can inherit its properties. Descriptions may have connectives, coreferential (descriptions attached to a slot are attached to another and vice-versa), declarative conditions. Verbs are fundamental components of natural language sentences. They represent actions for which different entities play different roles. Actions reveal how sentence phrases and clauses are semantically related to verbs by expressing cases for verbs. A case is the name of a particular role that a noun phrase or other component takes in the state or activity expressed by the verb in a sentence. There is a case structure for each main verb. Attempts were made for mapping specific surface cases into a deep semantic representation expressing a sort of semantic invariant. Many deep semantic representations are based on deep case n-ary relations between concepts as proposed by Fillmore [31]. Deep case systems have very few cases each one representing a basic semantic constraint. Case determination may depend on syntactic information (case signals) as well as feature checking (case conditions) and can be done by a case function. This function may return the likelihood that a given preposition term serves the case relationship to the main verb of the sentence. It is possible to use a variable number of preemptive levels. A case function may return a value which preempts any previous use of that case. Cases proposed in (Fillmore, 1968) are: Agentive (A) Instrumental (I) Dative (D) Factitive (F) Locative (L) Objective (O)
- the animate instigator for an action, - inanimate force or object causally involved, - animate being affected, - object or being resulted from an action, - location or spatial orientation, -determined by verb.
The verb determines a predicate P which has associated cases like in ”push [O {A} {I}]”. The predicate means that push must have an object O; as {} means optional, push may have an agent A and a force I. The case structure of P is a set of sequences of cases. Cases which are properties of a verb are inner cases (in particular the obligatory ones). Verbs typically specify up to three inner cases, at least one of which must always be realized in any sentence using the verb. Sometimes a particular case must always be present. Verb predicates have arguments characterized by semantic roles which can be cases. Predicates and arguments of this type are related by linguistic structures. Case structures are Relation Semantic Structures (RSS). Values for roles can be complex objects, called Object Semantic Structures (OSS), characterized by properties (Waltz 1981). A verb with its cases and other roles can be represented by a frame as in the following example: {accept is_a : subject theme
verb [human…..] [………….]
………………… Other roles…… [………….] } Between brackets are represented constraints on the types of values obtained with the slot filling procedures. Slots with constraints about possible fillers have assertional import, while slots filled by objects have structural import. Other representations can be added such as mass terms, adverbial modification, probabilistic information, degree of certainty, time and tense. An instance of a verb frame could be: {V003 instance_of : accept subject user theme [service_004] ………………… Other roles…… [………….] } which is representation of the predicate accept (user, service_004) Humans communicate with computers with discourse actions that include speech acts. An example of speech act is a request, like in the following question: “What is the zip code of Avignon?”. Using notations in (Allen, 1987), the meaning of this question can be represented in logic form by the following logical sentence: REQUEST(user, system, INFORMREF (y, loc(G1,Avignon ∧ zip(G1,y))) A speech act REQUEST is a function having as arguments the user which is the agent of the request, the system which is the destination of the request and the theme which is the result of the function INFORMREF. The function INFORMREF returns the value of y for which loc(G1,Avignon ∧ zip(G1,y)) is true. G1 is a constant obtained by skolemization of the existential quantifier in the logical description of frame a0001. REQUEST returns the value of INFORMREF. In order to evaluate the value of INFORMREF (y, loc(G1,Avignon ∧ zip(G1,y)), a data base is consulted. In a relational database, a conceptual entity corresponds to a table, and an instance corresponds to a row. Its content is logically represented by a collection of instances of the frame address. Interesting books[44,4,2, 146, 148] describe various types of semantic knowledge and their use. A common aspect of many of them is that it is possible to represent complex relational structures with nonprobabilistic schemes that are more effective than context-free grammars. For example, in KL-ONE [147] concept descriptions account for the internal structure with role|filler descriptions and for a global
structural description (SD). Roles have substructures with constraints specifying types and quantities of fillers. Sds are logical expressions indicating how role fillers interact. Role descriptions contain value restrictions. Epistemological relations are defined for composing conceptual structures. They may connect formal objects of the same type and account for inheritance. It is important to point out that semantic knowledge is, in general, context-sensitive. .Semantic relations are used to compose instances of conceptual structures. Functions are examples of relations with their arguments. An example of a function with one argument, using the notation in[44] is:
[
Place
[
IN(
Thing
]]
LOC )
Subscripts are ontological category variables. IN indicates a function whose argument follows between parentheses Selectional restrictions are general semantic restrictions on arguments. In the above example, LOC is a restriction for Thing. Restrictions use senses which are part of type hierarchies or ontologies. Disjunctions are represented within curly brackets.
An interpretation is an instance of a semantic structure in which a restriction is bound to a type|token pair, such as (City Paris) for LOC. General basic building blocks for conceptual structures are lexical items with associated constraints of various types and patterns of semantic structures. In [81], schemas containing roles and other information are proposed as active structures to model events and capture sequentiality. A popular example of MRL is the Web Ontology Language (OWL) [87] which integrates some of the most important requirements for computer semantic representation.. A heterarchical architecture based on a KB made of situation-action (production) rules is described in [29]. 3. SYNTACTIC AND SEMANTIC ANALYSIS FOR INTERPRETATION
A generic architecture structure for performing the SLU process is shown in Figure 1. KS indicates knowledge sources which are stored in a long term memory with the acoustic models (AM) and language models (LM), while hypotheses are written into a short term memory (STM). The content of the STM can be used for adapting some KSs. An initial, considerable effort in SLU research was made with an ARPA project started in 1971. The project is reviewed in [55] and included approaches mostly based on Artificial Intelligence (AI) for combining syntactic analysis and semantic representation in logic form. Some early SLU systems have an architecture shown in Figure 2. A sequence of word hypotheses is generated by an ASR system. Interpretation is performed with the same approaches used for written text. S indicates speech. W indicates written text. Control indicates control strategies.
learning Long Term Memory : AM LM interpretation KSs speech speech to conceptual structures and MRL signs words
concept structures MRL description
concept tags Short Term Memory
dialogue Figure 1 – Generic architecture structure for performing the SLU process
speech ASR
S control
ASR KS meaning
W control text
W KS
WLU Figure 2 – System architecture of early SLU systems
It is not clear how concepts relate to words. The knowledge that relates these two levels has to contain patterns of word sequences for each conceptual constituent. Patterns used for detecting different constituents in the same sentence may share components and may capture context dependences, but are
of finite length because sentences, especially spoken sentences, contain a finite and often small number of words. Finite state models are thus appropriate for concisely representing these patterns. On the contrary, semantic relations may use components hypothesized in different sentences and generate structures which may belong to a context sensitive language. Sequences of conceptual constituents may have to satisfy constraints which are different from the constraints imposed on words expressing a conceptual constituent. Furthermore, semantic relations are language independent, while relations between conceptual constituents and words are language dependent. It was assumed, as stated for example in [128], that a semantic analyzer has to work with a syntactic analyzer and produce data acceptable to a logical deductive system. This is motivated by arguments, for example in [44], that each major syntactic constituent of a sentence maps into a conceptual constituent, but the inverse is not true. For example, adapting the notation in [44], a sentence requiring a restaurant near the Montparnasse metro station in Paris can be represented by the following bracketed conceptual structure expression: Γ:[Action REQUEST MONTPARNASSE])])]]
([Thing
RESTAURANT],
[Path
NEAR
([Place
IN
([Thing
The formalism is based on a set of categories. Each category, e.g. Place can be elaborated as a Placefunction, e.g. IN and an argument. The expression Γ can be obtained from a syntactic structure like this: Ψ:[S[VP [V give, PR me] NP [ART a, N restaurant] PP[PREP near, NP [N Montparnasse, N station]]]]
Concerning the relation between syntax and semantics, in [44], it is observed that: •
Each major syntactic constituent of a sentence maps into a conceptual constituent, but the inverse is not true.
•
Each conceptual constituent supports the encoding of units (linguistic, visual,…).
•
Many of the categories support type|token distinction (e.g; place_type place_token).
•
Many of the categories support quantification.
•
Some realizations of conceptual categories in conceptual structures can be decomposed into a function|argument structure.
•
Various types of relations, such as IS_A, PART_OF., hold between conceptual constituents. These relations can be used to infer the presence of a constituent in a sentence given the presence of other constituents.
Assuming that natural languages are susceptible to the same kind of semantic analysis as programming languages, in [78], it is suggested that each syntactic rule of a natural language generative grammar is associated with a semantic building procedure that turns the sentence into a logic formula. An association of semantic building formulas with syntactic analysis is proposed in categorial grammars conceived for obtaining a surface semantic representation [62]. The syntax of a language is seen as an algebra, grammatical categories are seen as functions. Lexical representations have associated a syntactic pattern that suggests possible continuations of the syntactic analysis and the semantic expression to be generated, as shown in the following fragment of the lexicon: write Mary a letter
(S\NP)|NP S|(S\NP) NP|N N.
λxλy ((WRITE x) y) λf (f Mary) λx (an x)
Elements are associated with a syntactic category which identifies them as functions and specifies the type and directionality of their arguments and the type of their results. So, in the example ”Mary writes a letter”, the lexical entry causes the fact that when ”writes” in the data is matched with the lexical entry for ”write”, the associated function (S\NP)|NP is applied. The symbol | indicates a forward function application that looks for a match with an NP following ”writes” and requires the evaluation of the function (S\NP). The word ”a” has lexical entry < a NP|N>. This causes the execution of another forward function application that looks for a noun following ”a”. As the noun is found (), the semantic function λ x (an x)is executed, returning (a letter) which is associated to the assertion of NP that now matches the expectation of (S\NP)|NP with (a letter). The x of λ x λ y ((WRITE x) y), is bound to (a letter) leading to λ y ((WRITE a letter) y). Now the backward function S\NP has to be executed. The symbol \ means that the function will look backward for a match with a lexical entry with label NP which is found by performing the forward execution of the function associated with the lexical entry < Mary S|(S\NP)>. The function considers the assertion of S if what follows is asserted. This is true because it is the backward expectation of the verb and NP is a rewriting for Mary. As a result of matching, y is bound to Mary, producing the semantic representation ((WRITE a letter) Mary) and causing the assertion of the start symbol S with which the analysis of the sentence to be interpreted is successfully completed. Parsing a sentence results in asserting logical sentences from which frames can be instantiated and slots filled by suitable procedures. Semantic knowledge is associated, in this case with lexical entries and logic formulas are composed by actions performed during parsing. Composition knowledge is associated with grammar rules and is seen as a grammar augmentation.
Semantic knowledge is associated, in this case with lexical entries and logic formulas are composed by actions performed during parsing. The use of a lexicon with Montague grammars is discussed in detail in [26].
Organization of lexical knowledge for sentence interpretation has been recently the object of investigation. VerbNet [54], is a manually developed hierarchical verb lexicon. For each verb class, VerbNet specifies the syntactic frames along with the semantic role assigned to each slot of a frame. Modelling joint information about the argument structure of a verb is proposed in [123]. In the WordNet Project [75], a word is represented by a set of synonymous senses belonging to an alphabet of synsets. It can be used for word sense disambiguation. Suitable procedures can be attached to frames to generate logical sentences from slots filled are filled. Details on the use of syntax and semantics for natural language understanding can be found in [2]. Slot filling procedures can be executed under the control of a parser or, in general, by preconditionaction rules. As natural language is context sensitive, procedural networks for parsing under the control of Augmented Transition Network Grammars (ATNG) were proposed. ATNGs [134] are augmentations of Transition Network Grammars (TNGs). TNGs are made of states and arcs. The input string is analyzed during parsing from left to right, one word at a time. The input word and the active state determine the arc followed by the parser. Arcs have types, namely CAT (to read an input symbol), PUSH (to transfer the control to a sub-network) and POP (to transfer the control from a subnetwork to the network that executed the PUSH to it). In ATNGs condition testing and register setting actions are associated to certain arcs. Actions set the content of registers with linguistic feature values and can also be used for building parse trees. It is also possible to introduce actions of the type BUILD associated to an arc to compose a parse tree or to generate semantic interpretations. An example of ATNG is shown in Figure 3. DET
N
NP JMP
SETR NP (current word)
POP
Figure 3 –Example of ATNG Different ATNGs can be used in cascade for parsing and interpretation. An arc type TRANSMIT transfers syntactic structures from the syntactic to the semantic ATNG. Augmentations for generating semantic hypotheses are shown in the following example: (atn np (CAT determiner) (optional* (CAT adjective)) (CAT noun) (BUILD semantics ……..)) If a portion of a parse tree can be mapped into a semantic symbol of an MRL, then this symbol could be used as a nonterminal in a grammar which integrates syntactic and semantic knowledge. In [135], syntactic, semantic and pragmatic knowledge are integrated into procedural semantic grammar networks in which symbols for sub networks can correspond to syntactic or semantic entities.
An example of a portion of parse tree containing semantic non terminal symbols is shown in the following grammar fragment: TOLOC -> to CITY | …….. CITY -> .London | ……. In [139], TNGs are proposed as procedural attachment to frame slots. A chart parser can be activated for each TNG under the conrol of the interpretation strategy. In [133], a search algorithm was implemented in which the TNG was employed during ASR decoding. There are several ways of using syntactic and semantic analysis. In most systems, a semantic analyzer has to work with a syntactic analyzer and produce input for a logical deductive system. Grammars can be represented in logic form and parsing can be seen as theorem proving or problem solving. Syntactic and semantic knowledge can be represented with a single logic formalism. Attempts have been made to do the opposite and represent everything by a grammar (Woods 1976, pragmatic grammars). This can be used in an architecture having the scheme shown in Figure 4.
learning AM LM
linguistic KS
speech ASR signs word lattices
interpretation concept structures MRL description
Short Term Memory dialogue Figure 4 – Architecture with an integrated interpretation knowledge In [127] a best first parser is used. Its results trigger activations in a partitioned semantic network with which inferences and predictions are performed by spreading node activation through links. Tree Adjoining grammars (TAG) also integrate syntax and logic form (LF) semantics [114].
Classification based parsing may use Functional Unification grammars (FUG), Systemic Grammars (SG), or Head Driven Phrase Structure Grammars (HDPSG) which are declarative representations of grammars with logical constraints stated in terms of features and category structure. Semantics may also drive the parser, causing it to make attachments in the parse tree. Semantics can resolve ambiguities and translate English words into semantic symbols using a discriminant net for disambiguation.. A interesting example of interleaving syntax and semantics in a parser is proposed in [25].
Semantic parsing is discussed in [144]. A semantic first parser is described in [143]. Simple grammars are used for detecting possible clauses, then classification-based parsing completes the analysis with inference [51]. 4. PARTIAL PARSING AND FALLBACK FOR SLU
Early experiment is SLU made it clear the necessity of analyzing portions of a sentence when the complete sentence could not be analyzed. Problems of this type may be due to the fact that spoken language very often does not follow a formal grammar, hesitations and repetitions are frequent and available parsers do not ensure full coverage of possible sentence even in the case of written text. Furthermore, ASR systems make errors and grammar coverage was limited even for written text. In [136], ATNGs were proposed to interpret parts of a sentence using a middle out analysis of the input words. A scope specification is associated with grammar actions. Parsing can proceed to the left or to the right of the input word. Scope specification indicates a set of states the parser has to have passed through before the action can be safely performed. If this is not the case, the action is delayed. Another approach to avoid parsing an entire sentence eunder the control of a single grammar consists in using specific TNGs for each frame slot as in the Phoenix system [139] . In early versions of the system, the input to Phoenix was the top hypothesis of the speech recognition component. Subsequently [133], a search algorithm was implemented in which information from the TNG slot parsers was employed during the A* portion of the recognizer. Adopting the more conventional approach, in which the natural language component rescores a set N-best hypotheses generated with standard N-gram language models, did not yield better recognition performance. On the other hand, it did yield significant improvement in understanding performance. The score for a frame was simply the number of words in an utterance it accounts for, though certain non-content words are ignored. In [112], it is proposed to relax parser constraints when a sentence parser fails. This will permit the recovery of phrases and clauses that can be parsed. Fragments obtained in this way are then fused together. Other solutions for partial parsing were proposed using finite state grammars. As stochastic versions of them were developed, they will be reviewed later on. More complex systems using fallback were proposed. They are described in some detail in (De Mori, 1998, ch. 14) and briefly reviewed in the following.
The Delphi system [10] contains a number of levels, namely, syntactic (using Definite Clause Grammar, DCG), general semantics, domain semantics and action. Various translations are performed using links between representations at various levels. DCG rules have LHS and RHS elements with associated a functor (their major category) and zero or more features in a fixed a-rity positional order. Features are slots that can be filled by terms. Terms can be variables or functional terms. Semantic representation is based on frames. A grammatical relation has a component that triggers a translation relation. Binding operates on the semantic interpretation of the arguments to produce the semantic interpretation of a new phrase. In this way semantic fragments are built. DELPHI contains a linguistic analyzer that generates the N best hypotheses using a fast, simple algorithm , and then repeatedly rescores these hypotheses by means of more complex, slower algorithms. In this manner, several different knowledge sources can contribute to the final result without complicating the control structure or significantly slowing down derivation of the final result. The first version of DELPHI used of a chart-based unification parser ( Austin et al., 1991). An important and useful feature of this parser, which has been retained in all subsequent versions, was the incorporation of probabilities for different senses of a word and for application of grammatical rules. These probabilities are estimated from data and used to reduce the search space for parsing. A robust fallback module has been incorporated in successive versions. The fallback understanding module within DELPHI was called if the unification chart parser failed. Rather than employing the semantic module to assign an explicit natural-language score to hypotheses, DELPHI tried to parse the first N=10 hypotheses completely, stopping when a complete interpretation could be generated. If that didn't work, another pass through these ten hypotheses would be made with the fallback module, which tried to generate a robust interpretation from parsed fragments left over from the first, failed parse. The fallback module was itself made up of two parts: the Syntactic Combiner and the Frame Combiner. The Syntactic Combiner used extended grammatical rules that skipped over intervening material in an attempt to generate a complete parse. If the attempt failed, the Frame Combiner tried to fill slots in frames in a manner similar to that of SRI's Template Matcher. The Frame Combiner used many pragmatic rules obtained through study of training data which could not be defended on abstract grounds. For instance, interpretations which combine flight and ground transportation information are ruled out because they are never observed in the data, even though a query like ”Show flights to airports with limousine service” is theoretically possible. Surprisingly, the fallback module worked better if only the Frame Combiner - but not the Syntactic Combiner - was included. In order to increase robustness and reduce reliance on the fallback module, a semantic graph data structure was introduced and syntactic evidence was considered only one way of determining the semantic links out of which the graph is built. A semantic graph is a directed acyclic graph in which nodes correspond to meanings of head words (e.g. arrival, flight, Boston) and the arcs are binary semantic relations. The basic parsing operation is that of linking two disconnected graphs with a new arc. If the chart parser does not succeed in connecting such disconnected graphs, the Semantic Linker is invoked. Fragments are lexical nodes, combination is graph completion through search, link probabilities
are derived from corpus This component can ignore fragment order, skip over unanalyzable material, and even ”hallucinate” a new node if that's the only way to link fragments. Semantically driven parsers use pattern matching to recognize items that match with the data. Matching may start with lexico-semantic patterns for instantiating initial lexical items. Interpretations are built by adding non-lexical items inferred by a search algorithm. Semantic labels can be attached to parse tree nodes as a result of the application of a rule whose premise matches words, syntactic categories and available or expected semantic labels. Different types of matchers can be designed for different purposes. When the purpose is solely retrieval, a vector of features may adequately represent the content of a message. Different structures are required if the purpose is that of obtaining a conceptual representation to be used for data-base access or for a dialogue whose goal is the execution of an action. Finite state pattern matchers, lexical pattern matchers, and sentence level pattern matchers are discussed in (Hobbs and Israel, 1994). Unanticipated expressions and difficult constructions in the spoken language cause problems to a conventional approach. A few types of information account for a very large proportion of utterances. A Template matcher (TM) tries to build templates (four basic ones instantiates in various ways. The system developed at Stanford Research Institute (SRI) consists of two semantic modules yoked together: a unification-grammar-based module called ”Gemini”, and the TM which acts as a fallback if Gemini can't produce an acceptable database query. Gemini is a unification-based natural-language parser that combines general syntactic and semantic rules for English with an ATIS-specific lexicon and sortal|selectional restrictions. Templates have slots filled by looking for short phrases produced by the recognizer even if not all the words have been hypothesized correctly. Instantiation scores are basically the percent of correct words. The template with the best score is used to build the query. The input to the TM is the top word sequence hypothesis generated by the speech recognition component, which uses a bigram language model. The TM simply tries to fill slots in frame-like templates. An early version had just 8 templates dealing with flights, fares, ground transportation, meanings of codes and headings, aircraft, cities, airlines and airports. The different templates compete with each other on each utterance; all are scored, and the template with the best score generates the database query (provided its score is greater than a certain cut-off). Slots are filled by looking through the utterance for certain phrases and words. Here is a typical example. For the utterance :”Show me all the Delta flights Denver to Atlanta nonstop on the twelve of April leaving after ten in the morning”, the following flight template would be generated: [flight, [stops, nonstop], [airline, DL], [origin, DENVER], [destination, ATLANTA],
[departing_after,[1000]], [date, [april,12,current_year]] ]. The score for a template is basically the percentage of words in the utterance that contribute to filling the template. Early experiment is SLU made it clear the necessity of analyzing portions of a sentence when the complete sentence could not be analyzed. Problems of this type may be due to the fact that spoken language very often does not follow a formal grammar, hesitations and repetitions are frequent and available parsers do not ensure full coverage of possible sentence even in the case of written text. More details and references can be found in ([22], ch. 14). 5. FINITE STATE PROBABILISTIC MODELS FOR INTERPRETATION
Even if there are relations between semantic and syntactic knowledge, integrating these two types of knowledge into a single grammar formalism may not be the best solution. Many problems of automatic interpretation in SLU systems arise from the fact that many sentences are ungrammatical, the ASR components make errors in hypothesizing words and grammars have limited coverage. These considerations suggest that it is worth considering specific models for each conceptual constituent. In addition to partial parsing [51] and back-off, in the Air Travel Information System (ATIS) project, it was found useful to conceive devices for representing knowledge whose imprecision is characterized by probability distributions. It was also found useful to obtain model parameters by automatic learning using manually annotated corpora. This works as far as manual annotation is easy, reliable and ensures a high coverage. The SLU system architecture evolved according to the scheme shown in Figure 5. Figure 6 shows the type of KSs mostly used in the used in the 90ies. The modular architecture is based on knowledge and automatic learning capability. Different types of SLU knowledge are reviewed in this and the following sections. Stochastic finite-state approximations of natural language knowledge are practically useful for this purpose. Finite-state approximations of context-free grammars are proposed in [89]. Approximations of TAG grammars are described in [97]. A review of these approximations is provided in [28]. Let assume that a concept C is expressed by a user in a sentence W which is recognized by an ASR system, based on acoustic features Y. This can be represented as follows: Y →e W →e C. Symbol →e indicates an evidential relation meaning that if Y is observed then there is evidence of W and, because of this, there is evidence of C. There are exceptions to this chain of rules, because a different concept C’ can be expressed by W and Y can generate other hypotheses W’ which express other concepts. Furthermore, C can be expressed by other sentences W j which can be hypothesized from Y. The presence of C in a spoken message described by Y can only be asserted with probability: P(C Y ) ≈
⎤ 1 ⎡ ⎢∑ P(Y Wj )P(CWj )⎥ P(Y) ⎣⎢ j ⎦⎥
Let assume now that C is a sequence of hypotheses about semantic constituents, the following decision strategy can be used to find the most likely sequence C’ as follows: C' = arg max P(C / Y) = arg max P(Y / W ) P(CW ) C
C
Word hypotheses are generated by an ASR system using a probabilistic language model (LM).
learning structural KS
AM LM trans KS speech speech to MRL constituents
signs words
constituents to structures
concept structures
concept tags Short Term Memory
dialogue Figure 5 – Architecture with translation and structural KSs
ASR KS
speech ASR
words, lattices
S control matchers parsers translators
corpus learning
SFSM n-grams SCFG
meaning SLU
Figure 6 - Architecture with stochastic knowledge and automatic learning capability A solution based on the above introduced concepts is implemented in the system called Chronus [90]. The core of this system is a stochastic model whose parameters are learned from a corpus in which
semantic constituent are associated to sentence chunks. The conceptual decoder at the core of Chronus is based on a view of utterances as generated by an HMM-like process whose hidden states correspond to meaning units called concepts. Thus, understanding is a decoding of these concepts hidden in an utterance. In the Chronous system, the probability P(CW) is computed as follows. P(CW)=P(W|C)P(C) P(C) is obtained with concept bigram probabilities. A version of Chronus obtained the best score on the 1994 natural language (NL) benchmark. It was based on the following principles: • locality - the analysis of the entire sentence is delayed as long as possible, • learnability - everything that can be learned automatically from data should be, • patchability - it should be easy to introduce new knowledge into the system, • separation - among algorithms, and between general and specific knowledge, • habitability - the focus should be on robustness to unexpected non-linguistic phenomena and recognizer mistakes, rather than on dealing with rare, complex linguistic events. The success of this system is in some respects surprising, given that the conceptual decoder chops an utterance up into non-overlapping segments, which to a first approximation are considered to contribute to the meaning independently of each other (interactions are handled by the ”interpreter”, a small handcoded module, at a later stage of processing). A later version of Chronus has four main modules: the lexical analyzer, the conceptual decoder, the template generator, and the interpreter. The input to the lexical analyzer is the top hypothesis generated by the recognizer. The lexical analyzer recognizes predefined semantic categories, which group together all possible idiomatic variants of the same word or fixed phrase: for instance, ”JFK”, ”Kennedy Airport”, ”Kennedy International Airport”, ”New York City International Airport” are all assigned to the same semantic category. The lexical analyzer also groups together singular and plural forms of a word, and inflectional variants of a verb, thus achieving robustness to minor speech recognition errors. The conceptual decoder views the modified word sequences emerging from the lexical analyzer as conceptual Hidden Markov Models (conceptual HMMs), with the words being the observations and the concepts being the states. Concept sequences are currently modeled via a bigram language model, and the sequence of words within a concept is modeled as a concept-dependent N-gram language model. The function of the conceptual decoder is to segment an utterance into phrases, each representing a concept. This is equivalent to finding the most likely sequence of states in the conceptual HMM, given the sequence produced by the lexical analyzer. The choice of conceptual units is a domain-dependent design decision. For ATIS, some concepts relate directly to database entities (e.g., ”destination”, ”origin”, ”aircraft_type”) and others are more linguistic (e.g., ”question”, ”dummy” - for irrelevant words, and ”subject” - what the user wants to know). Once these units have been defined, the parameters of the conceptual HMM must be estimated from a training corpus of segmented, labeled word sequences by means of the Viterbi training algorithm for HMMs. This process can be bootstrapped. A typical output from the conceptual decoder might look like this:
wish origin destin day time aircraft
: I WOULD LIKE TO GO : FROM NEW YORK : TO SAN FRANCISCO : SATURDAY : MORNING : PREFERABLY ON A BOEING SEVEN FORTY SEVEN
The following frame is then obtained by hand-written rules:
ORIGIN_CITY DESTINATION_CITY WEEKDAY ORIGIN_TIME AIRCRAFT
: NNYC : SSFO : SATURDAY : 0 that matches with a sentence containing any word member of the member set M(fare) of words expressing the same meaning as fare.
+ fare + YES
NO + M(fare) +
+ fare code + Y
YES NO
NO
Y flights + fare + N
YES
+ cost +
NO
NO YES Y
subtree
subtree
Figure 7 - Example of an SCT The nature of the questions in the SCTs is such that the rules learnt are robust to grammatical and lexical errors in the input from the recognizer. In fact, these questions are generated in a manner that tends to minimize the number of words that must be correct for understanding to take place. Question generation involves ”gaps”: words and groups of words that can be ignored. Thus, each leaf of an SCT corresponds to a regular expression containing gaps, words, and syntactic units (e.g., times, dates, airplane types). Most SCTs in CHANEL decide whether a given concept is present or absent from the semantic representation for an utterance; for such SCTs, the label Y or N in a leaf denotes the presence or absence of the corresponding concept. If one generalizes away from the domain-specific details of CHANEL, one can give the following recipe for building a CHANEL-like system. 1. Collect a corpus of utterances in which each utterance is accompanied by its semantic representation. 2. Write a local parser that recognizes semantically important noun phrases that encode variables in the semantic representation (e.g., times, locations) and replaces such phrases with a generic code (while retaining a value for each variable). For instance, a time might be replaced by the symbol TIME, and a city name by the symbol CITY. Thus, the utterance ”give me all uh ten at night flights out of Boston” might become ”give me all uh TIME[22:00] flights out of CITY[Bos]”. 3. Devise a way of mapping the rest of the semantic representation (i.e., the part that does not consist of the variables just mentioned) into a vector of N bits. For example, CHANEL had ”fare” bit that was set to 1 if the user wanted to know the cost of a flight, and to 0 otherwise. Some bits are allocated to deciding the role of variables - e.g., to deciding whether the CITY in ”give me all uh TIME flights out of CITY” should be an origin or a destination.
4. Grow N SCTs, one for each position in the bit vector. The training data for each SCT is the whole training corpus of utterances after processing by the local parser (with variable values stripped out); the label for each utterance is the value of the appropriate bit. E.g., for CHANEL two typical training utterances for the fare SCT might be: ”give me all uh TIME flights out of CITY” => 0 ”how much are flights to CITY these days” => 1. 5. Given a new utterance, one can generate a semantic representation from the resulting system as follows: • pass the utterance through the local parser, • temporarily strip out variable values (saving them for later use) and submit the resulting string to the N SCTs (each SCT receives a complete copy), • the resulting vector of bits, together with saved variable values, gives a unique semantic representation for the utterance. Probability P(C|W) is obtained from the counts of times the leaf corresponding to the pattern that matched with W is reached. Notice that W can be an entire sentence. Different concept tag hypotheses can be generated by different sentence patterns that share some components. Good parsers for semantically important noun phrases can be hand-coded quite quickly; implementing machine learning of the rules in these parsers would have been more trouble than it was worth and was avoided; The following is an example of the semantic representation generated by CHANEL: DISPLAYED_ATTRIBUTES (flights, fares) CONSTRAINTS (flight_from_airport