A domain independent approach for e ciently ... - Semantic Scholar

Natural Language Engineering 1 (1): 1{57. Printed in the United Kingdom

1

c 1998 Cambridge University Press

A domain independent approach for eciently interpreting extragrammatical utterances Carolyn Penstein Rose University of Pittsburgh, Learning Research and Development Center, Pittsburgh, PA 15260 [email protected]

Alon Lavie Carnegie Mellon University, Language Technologies Institute, Pittsburgh, PA 15213 [email protected]

(Received 28 June1998 )

Abstract In this paper we address the issue of eciently and eectively handling the problem of extragrammaticality in a large-scale spontaneous spoken language system. We propose and argue in favor of ROSE, a domain independent parse-and-repair approach to the problem of interpreting extragrammaticalities in spontaneous language input. We argue that in order for an approach to robust interpretation to be practical, it must be domain independent, ecient, and eective. Where previous approaches to robust interpretation possessed one or two of these qualities, the ROSE approach uniquely possesses all three. We evaluate our approach by comparing its performance in terms of parse time and quality

2

Rose and Lavie

with a parameterized version of the minimum distance parsing (MDP) approach as well as with a more restrictive partial parser, two alternative domain independent approaches. Our analysis demonstrates that the ROSE approach performs signi cantly faster than a limited version of MDP, while producing analyses of superior quality. We also demonstrate that the two stage ROSE approach achieves a higher level of quality in less time than a more restrictive partial parser. We believe these results support the claim that in general two stage approaches like ROSE achieve high quality results more eciently than one stage approaches.

1 Introduction Applications involving any level of spoken language understanding require the fundamental ability of analyzing the input utterance into some meaning representation that can then be used for further processing. While a variety of techniques have proven successful for the analysis of written language input, most do not adapt easily to the analysis of spoken input. Spoken language dis uencies and grammatical deviations in spontaneous language are the main underlying reasons for the particular diculty of analyzing spoken input. These problems are often exacerbated by errors introduced by a speech recognition front end component. In this paper we propose and argue in favor of the ROSE approach. ROSE, RObustness with Structural Evolution, is a domain independent parse-and-repair approach to the problem of eectively and eciently overcoming extragrammaticalities in spontaneous language input. We de ne extragrammaticalityas any deviation of an input utterance from the language covered by a system's parsing grammar.

Eciently interpreting extragrammatical utterances

3

A novel application of Genetic Programming (Koza1992; Koza1994) is used to construct repair hypotheses in order to avoid the necessity of hand-coded repair rules that specify how to assemble portions of partial parses. The genetic programming algorithm is used to generate populations of programs that assemble fragments from a partial parse into complete meaning representation hypotheses. The goodness of repair hypotheses increases as the best hypotheses are selected from each generation to be used in constructing similar hypotheses for the next generation. The goodness of alternative hypotheses is evaluated using a trained tness function. The focus on the problem of robust interpretation and the use of a trained tness function each make the ROSE application of genetic programming unique. The ROSE approach was developed and evaluated in the context of the large-scale JANUS multi-lingual machine translation system (Woszcyna et al.1993; Woszcyna et al.1994; Lavie et al.1996). The JANUS system deals with the scheduling domain where two speakers attempt to schedule a meeting together over the phone. The system is composed of four language independent and domain independent modules including speech-recognition, parsing, discourse processing, and generation. The repair module described in this paper is similarly language independent and domain independent, requiring no hand-coded knowledge dedicated to repair. We demonstrate the attractiveness of our approach by comparing its performance in terms of parse time and quality with an implementation of a parameterized version of the minimum distance parsing (MDP) approach, an alternative domain independent approach. The evaluations described in this paper were conducted using a grammar with

4

Rose and Lavie

approximately 1000 rules and a lexicon with approximately 3000 lexical items. The grammar formalism has a context free backbone with LFG style uni cation augmentations. However, the approach could be used with any grammar formalismthat allows the parser to build feature structures that can be assembled compositionally.

2 The ROSE Approach An overview of the complete ROSE approach is displayed in Figure 1. ROSE repairs extragrammatical input in two phases. ROSE's rst phase, Repair Hypothesis Formation, is responsible for assembling a set of hypotheses about the meaning of the ungrammatical utterance. This phase is itself divided into two stages, Partial Parsing and Combination. In the Partial Parsing stage, a partial parser is used to obtain an analysis of portions of the speaker's sentence in cases where it is not possible to obtain an analysis for the entire sentence. The partial parser produces analyses for each of the largest segments spanning the sentence as well as for smaller portions of those segments. These then form the set of chunks that are input to the Combination stage. It is in the Combination stage where structural evolution comes into view, this being the heart of the ROSE approach. Genetic programming (Koza1992; Koza1994) is used to search for dierent ways to combine the fragments. Genetic programming is an opportunistic search algorithm used for constructing computer programs to solve particular problems. Among its desirable properties is its ability to search a large space eciently by rst sampling widely and shallowly, and then narrowing in on the regions surrounding the most promising looking points. In its novel application in the ROSE approach, genetic programming is used to search the


5

Sentence P hase 1 S ta g e 1

Parser

Partial Parse

Parse Quality

acceptable

Repairability Indicator

unrepairable

S ta g e 2

Com bination M echanism

Population of Repair Hypotheses P hase 2

high confidence

Repair Quality Confidence

Interaction M echanism

Rephrase

Final Result

Fig. 1. Overview of the ROSE Interpretation Process

space of possible repair hypotheses. As a result of the genetic search, the fragments from the partial parse are assembled into a set of alternative meaning representation hypotheses. Thus, the genetic search eliminates the need for hand-crafted repair rules specifying how to assemble fragments of partial parses. Eliminating this need distinguishes ROSE from previous parse-and-repair approaches. In ROSE's second phase, Interaction with the user, the system generates a set of queries, negotiating with the speaker in order to narrow down to a single best

6

Rose and Lavie

meaning representation hypothesis. First, ROSE extracts the set of features that distinguish the alternative hypotheses from one another. These features are then sorted according to a set of criteria including coherence with previous questions asked and potential search space reduction. This sorted list is nally used to generate a series of queries to the user. It is not necessary for ROSE to understand what these features mean in any sense in order to perform this function. Thus, ROSE is able to participate in this interaction in a domain independent manner. Processing in the ROSE approach is controlled at three key decision points, displayed as hexagons in Figure 1. The goal is to make ecient use of resources by channeling them to those portions of the process that are likely to yield an improvement in interpretation quality. For example, in cases where the parser is able to obtain a high quality parse for the input sentence, there is no need for the Combination stage or the Interaction with the User phase. We therefore use a Parse-Quality evaluation heuristic to assess the result of the parsing stage, prior

to attempting any kind of repair. If the heuristic indicates that the parser output is of sucient quality, the result of the parser is passed on as the nal result, bypassing the Combination stage and the the Interaction with the User phase. Similarly, if the output from the parser is so limited that no meaningful repair can be done, then it is equally useless to continue processing in the Combination stage and Interaction with the User phase. Thus, we use a second evaluation heuristic to indicate repairability. If the Repairability heuristic indicates that the partial parse does not provide useful building blocks that can be used by the Combination stage, the Combination stage and the Interaction with the User phase are bypassed, and


7

the user is requested to provide a rephrase. If, on the other hand, the heuristic indicates that useful building blocks for combination were constructed during the Partial Parsing stage, then ROSE's Combination stage is used. After a population of hypotheses is constructed by the Combination stage, a third evaluation heuristic, the Repair Quality Con dence heuristic, is used to determine whether interaction is necessary. If the Repair Quality Con dence heuristic indicates that interaction is necessary, ROSE's Interaction with the User phase is used. Otherwise, the result of the top ranking repair hypothesis created by the Combination stage is returned as the nal result. Therefore, ROSE tailors its approach to each speci c type of case. In this paper, we focus on ROSE's rst repair phase, the Hypothesis Formation phase, which operates automatically with no human intervention. Details about ROSE's evaluation heuristics and the Interaction with the User phase can be found in (Rose1997; Rose and Levin1998). Throughout the rest of the paper, we refer to the implementation of the Combination stage as the repair module since it attempts to repair the partial parses provided by the parser. Although the Combination stage produces a set of hypotheses, in the evaluation presented in this paper, we look only at the best scoring repair hypothesis.

3 Alternative Avenues Towards Robustness There are many dierent approaches to handling the problem of extragrammaticality, but which way is best? In order to be practical, an approach to robust interpretation must be domain independent, ecient, and eective. Without the quality of domain independence, the work required to make an interpretation sys-

8

Rose and Lavie

tem robust must be redone each time a new domain is encountered. Eciency is also required in any practical interpretation system, particularly if the system is to interact with the user in real time. Finally, a language interpretation system must unarguably be eective, correctly interpreting the user's input the great majority of the time. Eciency and eectiveness are not absolute qualities, and often eciency and eectiveness trade o with one another. Thus, an approach that achieves the same or greater eectiveness more eciently than another approach should be considered more desirable. Three basic avenues exist whereby the coverage, and thus the eectiveness, of a natural language understanding system can be expanded: further development of the parsing grammar, an increase in the exibility of the parsing algorithm, or the addition of a post-processing repair stage. Expanding a parsing grammar in order to increase the coverage is both time intensive in terms of development and ultimately computationally expensive at run time, since large, cumbersome grammars generate excessive amounts of ambiguity. Improving the exibility of the parsing algorithm is preferable in some respects, particularly in that it reduces the grammar development burden. This approach can be accomplished in a domain independent manner. However, it lends itself to the same weakness in terms of computational cost when taken to its logical conclusion. The ROSE approach follows the third alternative and achieves a better eciency/eectiveness trade-o than other completely domain independent approaches, such as those that follow the second alternative. The most extreme case of a exible parser is a Minimum Distance Parser (MDP) (Lehman1989; Hipp1992), which has its origins in the programming language pars-


9

ing literature (Aho and Peterson1972; Sippu1981). In an MDP parser, any ungrammatical sentence can be mapped onto the `closest' sentence within the coverage of the grammar through a series of insertions, deletions, and in some cases substitutions or transpositions. The underlying assumption is that the meaning of the ungrammatical input is the same or very similar to that of the `closest' grammatical sentence. Since any string can be mapped onto any other string through a series of insertions, deletions, and transpositions, this approach makes it possible to repair any extragrammatical sentence. The more exibility, the better the coverage in theory, but in realistic large scale systems the MDP approach becomes computationally intractable. In practice, MDP has only been used successfully in very small and limited systems. For example, Lehman's core grammar, described in (Lehman1989), has on the order of 300 rules, and all of the inputs to her system can be assumed to be commands to a calendar program. Hipp's Circuit Fix-It Shop system, described in (Hipp1992), has a vocabulary of only 125 words and a grammar size of only 500 rules. Lehman's and Hipp's grammar formalisms are comparable to that used in the evaluations presented in this paper. They are both composed of context-free backbones with augmentations. Lehman's formalism constructs feature structures where Hipp's constructs logical forms. Recent eorts towards robust interpretation have focused on achieving a compromise between reasonable eectiveness and an acceptable level of eciency. They have accomplished this by developing more restrictive partial parsers (Hayes and Mouradain1981; Kwasny and Sondheimer1981; Jensen and Heidorn1993; Lang1989;

10

Rose and Lavie

Lavie1995; Abney1996; Ehrlich and Hanrieder1996; Srinivas et al.1996; Federici et al.1996) and repair approaches where the labor is distributed between two or more stages (Danieli and Gerbino1995; Van Noord1996). More restrictive partial parsers are attractive because the majority of dis uencies encountered in spoken input can be handled eectively without the full power of MDP. Lavie (1995) argues that the majority of dis uencies found in speech input are the result of inserted words or phrases and substitutions, or can be reasonably treated as such. This observation lead to the development of the GLR* skipping parser (Lavie and Tomita1993; Lavie1995) and other similar partial parsing approaches (Abney1996; Van Noord1996; Srinivas et al.1996; Federici et al.1996). The weakness of these partial parsing approaches is that part of the original meaning of the utterance may be discarded with the portion(s) of the utterance that are skipped if only the analysis for the single largest subset is returned. These less powerful algorithms essentially trade eectiveness for eciency. Their goal is to introduce enough exibility to gain an acceptable level of coverage at an acceptable computational expense. The goal behind two stage approaches such as ROSE is to increase the coverage at a reasonable computational cost by introducing a post-processing repair stage. The repair stage attempts to construct a complete meaning representation out of the various fragments of the partial parse spanning the ungrammatical sentence. Such a two stage approach is applicable wherever the meaning representation used by a system is compositional. Compositionality is essential since the two stage approach must compose together partial analyses from separate fragments of the input. The manner in which fragments are composed with one another in the repair


11

stage depends entirely on the constraints imposed by the meaning representation speci cation. Once the partial parser constructs analyses for separate portions of the input sentence, the rules from the parsing grammar are no longer relevant. Thus, there in no necessity for the parsing grammar, or any permutation of grammar rules, to be able to describe the structure of the sentence. This is very dierent from both the MDP approach and the more restrictive partial parsing approach, which completely rely on the grammar. The two-stage approach can be expected to be more ecient than the MDP approach. Since the input to the second stage is a collection of partial analyses, the additional exibility that is introduced at this second stage can be channeled just to the part of the analysis that the parser does not have enough knowledge to handle straightforwardly. This is unlike the MDP approach, where the full amount of exibility is unnecessarily applied to every part of the analysis, even in completely grammatical sentences. Therefore, the two stage process is a more ecient distribution of labor, since the rst stage is highly constrained by the grammar and the results of this rst stage are then used to constrain the search in the second stage. Furthermore, the second stage can be skipped in cases where it is not necessary. Thus, the two stage approach has the ability to achieve a better eciency/eectiveness trade o than either the MDP approach or the restrictive partial parsing approach. In our evaluation, we demonstrate that the ROSE approach reaches this potential. Although two stage approaches have grown in popularity in recent years because of their eciency, previous two stage approaches have done so at the cost of requiring hand coded repair heuristics that specify how to assemble more com-

12

Rose and Lavie

plete meaning representations from the fragments constructed by a partial parser (Danieli and Gerbino1995; Van Noord1996). Thus, while they achieve an admirable eectiveness/eciency trade-o, they lose the quality of domain independence. In contrast, the ROSE approach uses genetic programming to automatically learn how to construct reasonable repair hypotheses given a meaning representation speci cation and a training corpus. Thus, ROSE achieves the bene ts of ecient and eective repair with the additional bene t of domain independence.

4 ROSE's Two Stage Interpretation Process The main goal of the two stage ROSE approach is to achieve the ability to robustly interpret spontaneous natural language eciently in a large and complex system such as the JANUS multi-lingual machine translation system. In this section we describe our implementation of the Partial Parsing and Combination stages of the ROSE approach.

4.1 The Partial Parsing Stage The rst stage in our approach is the Partial Parsing stage where the goal is to obtain an analysis for islands of the speaker's utterance if it is not possible to obtain an analysis for the whole utterance. Although any partial parser capable of producing analyses for contiguous portions of an ungrammatical utterance could be used, in our implementation this is accomplished with a restricted version of Lavie's GLR* parser. Where the standard version of GLR* (Lavie and Tomita1993; Lavie1995) searches for the maximal subset of the original input that is covered by


13

Sentence: That wipes out my mornings. Partial Analyses: Chunk1: that ((ROOT THAT) (TYPE PRONOUN) (FRAME *THAT))

Chunk2: out ((TYPE NEGATIVE) (DEGREE NORMAL) (FRAME *RESPOND))

Chunk3: my ((ROOT I) (TYPE PERSON-POSS) (FRAME *I))

Chunk4: mornings ((TIME-OF-DAY MORNING) (NUMBER PLURAL) (FRAME *SIMPLE-TIME) (SIMPLE-UNIT-NAME TOD)) Fig. 2. Parse Example

the grammar, the more restrictive version produces partial analyses for only contiguous portions of the input sentence. See Figure 2 for an example parse. Here the restricted GLR* parser attempts to handle the sentence `That wipes out my mornings.' The expression `wipes out' is not covered by the parsing grammar. The

14

Rose and Lavie

grammar also does not allow time expressions to be modi ed by possessive pronouns. So `my mornings' also does not parse. Although the grammar recognizes òut' as a way of expressing a rejection, as in `Tuesdays are out,' it does not allow the time being rejected to follow the òut'. However, although the parser was not able to obtain a complete parse for this sentence, it was able to extract four chunks. The chunks are feature structures in which the parser encodes the meaning of portions of the user's sentence. We refer to this frame based meaning representation as an interlingua because it is language independent. It is de ned by an interlingua speci cation in BNF form. Each concept, or basic unit of meaning in the domain, is represented as a frame in the meaning representation. Each frame is associated with a set of slots that allow the concept represented by the frame to be modi ed by the contents of those slots. Each slot can be lled in either with a feature structure or with an atomic ller. Frame names are marked with *'s to distinguish them from other atomic slot llers. The set of frames and atomic llers in the meaning representation are arranged into subsets that are assigned a particular type. Each slot is also associated with a type, which de nes the range of llers that can be inserted into that slot. This meaning representation speci cation is used by the repair module to constrain the number of ways chunks can be assembled together. Although this meaning representation speci cation is knowledge that must be encoded by hand, it is knowledge that is used by all aspects of the system, not only the repair module, as would be the case with hand-coded repair rules. Arguably, any well designed natural language understanding system would have such a speci cation to describe its meaning representation. Thus, rather than require speci c


15

knowledge to be represented for the purpose of repair, the ROSE approach makes use of knowledge that is already part of the system.

The four chunks extracted by the parser each encode a dierent part of the meaning of the sentence `That wipes out my mornings.' The rst chunk represents the meaning of `that'. The frame associated with this concept is *THAT. The ller of the TYPE

slot indicates that this concept is a nominative pronoun. The second chunk

represents the meaning of òut'. Since òut' is generally a way of rejecting a meeting time in this domain, the associated feature structure represents the concept of a response that is a rejection. Thus, the frame associated with this concept is *RESPOND, and the ller of the TYPE slot is NEGATIVE, since this is a response that is negative. A ller of NORMAL in the DEGREE slot indicates that it is neither a weak response nor a particularly strong one. Since `wipes' does not match anything in the grammar, this token is left without any representation among the fragments returned by the parser. The last two chunks represent the meaning of `my' and `mornings' respectively. `My' is represented the same way that Ì' would be except that a ller of PERSON-POS in the TYPE slot indicates that the case is genative. `Mornings' is a basic temporal concept, so it is represented with the *SIMPLE-TIME frame. A ller of TOD in the SIMPLE-UNIT-NAME slot indicates that this chunk represents a time of day. A ller of MORNING in the TIME-OF-DAY slot indicates that the time of day is the morning. And the ller of the NUMBER slot indicates plurality.

16

Rose and Lavie

4.2 The Combination Stage The purpose of the Combination stage is to assemble the partial analyses that were created by the parser into a meaningful complete analysis structure. This is similar in nature to what can be done with a minimum distance parser using insertions, deletions, and transpositions, but that cannot be performed with just the skipping parser. The Combination stage takes as input the partial analyses returned by the skipping parser. These chunks are then combined into a set of repair hypotheses. The hypotheses built during this combination process are programs that specify how the fragments can be combined into complete feature structures that are consistent with the meaning representation and that represent the meaning of the speaker's entire sentence, rather than just parts of the sentence.

4.2.1 Applying Genetic Programming

Genetic programming is an extension of the traditional genetic algorithm (Holland1975) where the structures that are undergoing adaptation are computer programs, which dynamically vary both in size and shape. It is a technique for nding the most t computer program for a particular task by rst generating an initial population of programs, and then generating subsequent generations by allowing the most ` t' programs to `mate' with one another. The goal is for the average tness of the generated population of programs to increase from one generation to the next. A program can be viewed as a tree structure constructed from a set of functions that can be composed in order to build a program and a set of terminals


17

(i.e. variables and constants) that provide the leaf nodes of the program. The genetic programming algorithm builds programs from these primitives. In the zeroth generation, a set of programs is generated randomly from the provided set of functions and terminals. A tness function is then used to evaluate the relative goodness of the individual programs in this initial population. Usually the programs in this initial population have a low tness. Nevertheless, some programs will have a better tness than others. Those that are relatively more t are selected for reproduction as in the traditional genetic algorithm. Crossover and mutation operators are used to generate subsequent generations. The crossover operation swaps a subprogram from one parent program with a subprogram of the other parent program. The mutation operation swaps a function with another randomly chosen function or a

terminal with another randomly selected terminal. In each subsequent generation the tness function is again used to rank the set of programs generated during that generation in order to prepare for generating the next generation. When the nal generation of programs has been generated and evaluated, the most t program or set of programs generated during the entire process is returned. There are ve steps involved in applying genetic programming to a particular problem: determining a set of terminals; determining a set of functions; determining a tness measure; determining the parameters and variables to control the run; and determining the method for deciding when to stop the evolution process. In the ROSE application of genetic programming, the rst two constrain the range of repairs that the Combination process is capable of making. The tness measure determines how alternative repair hypotheses are ranked, and thus whether it is

18

Rose and Lavie

possible for the search to converge on the correct hypothesis rather than on a suboptimal competing hypothesis. The last two factors determine how quickly it will converge and how long it is given to converge. We use a tness measure in which lower tness values are preferred over higher ones, so the best ranking hypothesis

are assigned the lowest tness score. The parameters for the run, such as the size of the population of programs on each generation, were determined experimentally from a training corpus. 4.2.2 ROSE's Space of Possible Repairs

In the Combination stage, genetic programming is used to evolve a population of programs that specify how to build complete meaning representations from the chunks returned from the parser. A complete meaning representation is one that is meant to represent the meaning of the speaker's entire utterance, rather than just part of it. Since the meaning representation is compositional, a single, more complete meaning representation can be built by assembling the meaning representations for the various parts of the sentence. Thus, the most natural set of terminals for this problem is the set of chunks constructed by the parser. Each step in repairing an analysis involves taking chunks as input and returning a new chunk as output. In our implementation, a single function called MY-COMB is used to accomplish this. MY-COMB is a simple function that attempts to insert the second argument, a feature structure, into some slot in the rst argument, another feature structure. It selects an appropriate slot for this purpose by referring to the meaning representation speci cation, if a suitable slot can be found. The third


19

Ideal Repair Hypothesis: (MY-COMB ;insert arg2 into arg1 in slot ((FRAME *RESPOND) (DEGREE NORMAL) (TYPE NEGATIVE)) ; arg1 ((TIME-OF-DAY MORNING) (NUMBER PLURAL) (FRAME *SIMPLE-TIME) (SIMPLE-UNIT-NAME TOD)) ; arg2 WHEN) ; slot

Ideal Structure: ((FRAME *RESPOND) (DEGREE NORMAL) (TYPE NEGATIVE) (WHEN ((FRAME *SIMPLE-TIME) (TIME-OF-DAY MORNING) (NUMBER PLURAL) (SIMPLE-UNIT-NAME TOD))))

Gloss: Mornings are out. Fig. 3. Ideal Repair Hypothesis

parameter of MY-COMB is then set to this slot. If it is not possible to insert the second chunk into the rst one, it attempts to merge the two by performing the largest set of repair actions that are consistent with one another from the sets of repair actions

20

Rose and Lavie

that constructed the two input chunks. If this is not possible either, the largest of the two chunks is returned. Repair hypotheses in ROSE are programs built hierarchically using the MY-COMB function. The range of possible repairs includes all meaning representation structures that can be built compositionally from some subset of the chunks returned by the parser. Thus, repairing an analysis is analogous to tting pieces of a puzzle into a mold that contains receptacles for particular shapes. In this analogy, the meaning representation speci cation acts as the mold, making it possible to compute all of the ways partial analyses can t together in order to create a structure that is legal in this frame based meaning representation. The ideal repair hypothesis is shown in Figure 3. In this case, the WHEN slot is selected in which to insert the second chunk parameter. Thus, the feature structure corresponding to `mornings' is inserted into the WHEN slot in the feature structure corresponding to òut'. The result is a feature structure representing `Mornings are out.' Though this is not an exact representation of the speaker's meaning, it is reasonably close in meaning, and is the best that can be done with the available feature structures. Notice that since the expression `wipes out' is foreign to the parsing grammar, and no similar expression in its coverage is associated with the same meaning, the MDP approach would not be able to do any better in this case, since it can only insert and delete words in order to t the current sentence to the rules in its parsing grammar. Additionally, since the time expression follows the word òut' rather than preceding it, only MDP with transpositions in addition to insertions and deletions would be able to produce a similar result. Note also that


21

the feature structures corresponding to `my' and `that' are not included in this hypothesis. The repair module is responsible for both determining which fragments to include as well as for determining how to combine the selected fragments. As mentioned above, the repair hypotheses generated in the zeroth generation normally have a low tness since they are constructed randomly. As the best hypotheses from each generation are selected and used for producing the subsequent generation, the average tness value of the hypotheses increases. See Figure 4 and Figure 5 for two suboptimal repair hypotheses produced during the Combination stage for the example in Figure 2. The result of each of the hypotheses is an alternative representation for the sentence. The rst hypothesis, displayed in Figure 4, corresponds to the interpretation, `Mornings and that are out.' The problem with this hypothesis is that it includes the chunk `that', which in this case should be left out. In the second hypothesis, displayed in Figure 5, the repair module attempts to insert the `rejection' chunk into the `time expression' chunk, the opposite of the correct order. However, no slot could be found in the time expression chunk in which to insert the rejection expression chunk. The slot, therefore, remains uninstantiated and the largest chunk, in this case the time expression chunk, is returned. This hypothesis produces a feature structure that is indeed a portion of the correct structure, though not the complete structure. Although both of these hypotheses are suboptimal, the ideal repair hypothesis can be constructed from these for the next generation using the crossover operation. The crossover operator selects a subprogram from each of the two parent programs

22

Rose and Lavie

Hypothesis1: (MY-COMB

(MY-COMB ((FRAME *RESPOND)) ((FRAME *SIMPLE-TIME) (TIME-OF-DAY MORNING) (NUMBER PLURAL) (SIMPLE-UNIT-NAME TOD)) WHEN)

((FRAME *THAT) (ROOT THAT) (TYPE PRONOUN)) WHEN)

Result1: ((FRAME *RESPOND) (DEGREE NORMAL) (TYPE NEGATIVE) (WHEN (*MULTIPLE* ((FRAME *SIMPLE-TIME) (TIME-OF-DAY MORNING) (NUMBER PLURAL) (SIMPLE-UNIT-NAME TOD)) ((FRAME *THAT) (ROOT THAT) (TYPE PRONOUN)))))

Gloss: Mornings and that are out. Fig. 4. Alternative Repair Hypothesis 1


23

Hypothesis2: (MY-COMB

((TIME-OF-DAY MORNING) (NUMBER PLURAL) (FRAME *SIMPLE-TIME) (SIMPLE-UNIT-NAME TOD))

((FRAME *RESPOND) (DEGREE NORMAL) (TYPE NEGATIVE)) ??)

Result2: ((FRAME *SIMPLE-TIME) (TIME-OF-DAY MORNING) (NUMBER PLURAL) (SIMPLE-UNIT-NAME TOD))

Gloss: Mornings. Fig. 5. Alternative Repair Hypothesis 2

and swaps them, producing two new hypotheses. If the boldface portions from Figures 4 and 5 were selected to swap, one of the resulting new hypotheses would be that found in Figure 6. Note that since the meaning for a portion of the input sentence cannot be represented in the resulting meaning representation structure more than once, the outermost

MY-COMB

function has no eect on the resulting

24

Rose and Lavie

Resulting Repair Hypothesis: (MY-COMB (MY-COMB ((FRAME *RESPOND) (DEGREE NORMAL) (TYPE NEGATIVE)) ((TIME-OF-DAY MORNING) (NUMBER PLURAL) (FRAME *SIMPLE-TIME) (SIMPLE-UNIT-NAME TOD)) WHEN) ((FRAME *RESPOND) (DEGREE NORMAL) (TYPE NEGATIVE)) ??)

Resulting Structure: ((FRAME *RESPOND) (DEGREE NORMAL) (TYPE NEGATIVE) (WHEN ((FRAME *SIMPLE-TIME) (TIME-OF-DAY MORNING) (NUMBER PLURAL) (SIMPLE-UNIT-NAME TOD))))

Gloss: Mornings are out. Fig. 6. Crossover Example


25

(defun fitness-eval (NUM_CONCEPTS NUM_STEPS STAT_SCORE PERCENT_COV) ((1 * (1 / STAT_SCORE)) + (0.55 * NUM_STEPS) + (1.15 * (1 / NUM_CONCEPTS)) + (1.15 * (1 / PERCENT_COV))))

Fig. 7. Hillclimbing Trained Linear Combination Fitness Function

structure. Thus, although the program looks slightly dierent from the ideal one found in Figure 3, the resulting structure is the same. 4.2.3 Ranking Alternative Repairs

Once a population of hypotheses is generated, each individual in the population is evaluated for its tness. The tness measure is the most critical part of the genetic search since the tness evaluations are what guide the search process. Since survival of the ttest is the key to the evolutionary process, the determination of which hypotheses are more t is absolutely crucial. The tness function measures how well each evolved program performs at a particular task or ts some taskspeci c criteria. Evaluating the tness of repair hypotheses is the most dicult part of applying genetic programming to repair. The ROSE approach to tness evaluation of repair hypotheses is what makes this application of genetic programming unique. An ideal tness function for repair would rank hypotheses that generate structures closer to the target structure better than those that are more dierent. In comparing relative goodness of alternative repair hypotheses, the repair module must consider not only which subset of chunks returned by the parser to include in the nal result, but also

26

Rose and Lavie

how to put them together. However, it does not know what the ideal hypothesis is. Since the repair module does not have access to this information, it must rely upon indirect evidence for determining which hypotheses are better than others. A tness function is trained to use this indirect evidence for the purpose of ranking repair hypotheses in a useful manner. The ROSE application of genetic programming is the only one the authors are aware of that makes use of this trained approximate tness evaluation approach. Four pieces of indirect evidence about the relative goodness of repair hypotheses can be computed for each repair hypothesis: the number of frames and atomic llers in the resulting structure (NUM CONCEPTS), the number of repair actions involved (NUM STEPS), a statistical score re ecting the goodness of repair actions (STAT SCORE), and the percentage of the sentence that is covered by the repair (PERCENT COV). These parameters are generally useful in ranking hypotheses. Both NUM CONCEPTS and PERCENT COV provide an estimate of the completeness of solutions. NUM STEPS provides an estimate of the simplicity of the program. And as much as the statistical goodness score gives a reliable indication of goodness of t between slots and llers and goodness of chunks, repair hypotheses with better average statistical score are more likely to be better hypotheses. The statistical score of a hypothesis is calculated by averaging the statistical scores for each repair action, where the statistical score of each included chunk is the statistical score assigned by the parser to the analysis of that chunk, and the statistical score of inserting a chunk into a slot in another chunk is de ned by the information gain between the slot and the type of the inserted chunk.


27

Intuitively, one would prefer more complete hypotheses over less complete ones. And following the principle of Occam's razor, other things being equal, one would prefer simpler solutions over more complex ones. Since simpler solutions may be less complete, and more complete hypotheses might be more complex, the trained tness function must learn how to balance these two qualities. Two alternative trained tness functions were used in our implementation of the Combination stage. The trained tness functions combine the four pieces of indirect evidence about the goodness of repair hypotheses. The function displayed in Figure 7 is a linear combination tness function trained with a hillclimbing approach. An alternative tness function was trained using genetic programming, and thus is dicult for a human to interpret. The training procedures used to create these two alternative tness functions are described below. Both alternative tness functions were trained over a corpus of 48 randomly selected sentences from the scheduling domain from a separate corpus from that used in the evaluation discussed in Section 6. These sentences were automatically selected from a larger corpus using the Parse Quality evaluation heuristic discussed in Section 2 in order to ensure that each of these sentences required recovery from parser failure. In this corpus each sentence was coupled with its corresponding ideal meaning representation structure. To generate training data for the tness functions, the repair module was run using an ideal tness function that evaluated the goodness of hypotheses by comparing the meaning representation structure produced by the hypothesis with the ideal structure. It assigned a tness score to the hypothesis equal to the number of frames and atomic llers in the largest

28

Rose and Lavie

substructure shared by the produced structure and the ideal structure. The four scores that serve as input to the trained tness function were extracted from each of the hypotheses constructed in each generation after the programs were ranked by the ideal tness function. The resulting ranked lists of sets of scores were used to train a tness function that can order the sets of scores the same way that the ideal tness function ranked the associated hypotheses. The linear combination tness function was trained using a hillclimbing technique. The goal of this training process was to nd the set of coecients making the linear combination tness function perform the best over the training corpus. Since lower tness values are preferred over higher ones, the multiplicative inverse of the number of concepts, the percentage of the sentence covered, and the average statistical score were passed into the linear combination function rather than the original values, which are such that bigger scores are better than smaller scores. Since lower values are preferred for the number of steps involved in a repair hypothesis, the original value for this parameter was passed in as is. At the beginning of the training process, each of the four coecients were set to an initial value of one. On each iteration, eight alternative sets of coecients were generated by either adding or subtracting .05 from one of the four coecients. The performance for each set of coecients was measured by rst ordering the set of scores in each training example using a linear combination of the scores with the hypothesized coecients. The length of the greatest common subsequence between the ideal ordering and the generated ordering was then computed. The greatest common subsequence was computed using Dijkstra's well known algorithm (Cormen et al.1989).


29

After the eight alternative sets of coecients were evaluated on each iteration, the best performing one replaced the current one if it performed better than the current one. The hillclimbing algorithm terminated after the rst iteration where none of the alternative new sets of coecients performed better than the current one. The entire training process took less than an hour. The genetic programming trained tness function took as terminals the four scores plus a random real number function. The function set included addition, subtraction, multiplication, and division. The tness of alternative proposed tness functions generated during the genetic search was computed the same way as for the linear combination tness function. A population size of 1000 was used, and the training process continued for approximately one week, until subsequent generations didn't produce a function with performance better than the previous generation. The two alternative tness functions were evaluated two dierent ways. They were rst evaluated using the greatest common subsequence performance metric over a set of 100 test examples. The performance of the two was comparable. Whereas the linear combination tness function achieved an average common substring length of 5.29 out of 10, the genetic programming trained tness function achieved 4.72 out of 10. Next they were evaluated in an end-to-end translation evaluation inside the repair module on a set of 133 test sentences needing repair. It was found that the linear combination tness function produced hypotheses of slightly higher quality, although the total number of acceptable translations produced by the system was aected only slightly. ROSE using the linear combination trained tness function

30

Rose and Lavie

produced two more acceptable translations than using the genetic programming trained tness function. Although neither of these trained tness functions were able to rank more than about 53 per cent of the elements of each list correctly on average, they both performed well inside the repair module. In the evaluation discussed in Section 6, the genetic programming search successfully ranked an acceptable hypothesis in the top ten list of hypotheses returned by the repair module in about 93 per cent of the cases where the parser produced chunks sucient for building an acceptable solution. And in about 85 per cent of these cases, the manually determined best hypothesis was ranked as rst. It can thus be concluded from these results that it is not necessary for the trained tness function to be extremely accurate in its ability to rank the lists of sets of scores. What appears to be the most important is that better repair hypotheses in general be ranked closer towards the top of the list. Interestingly, in both evaluations the simpler linear combination tness function, which required only a fraction of the training time, performed slightly better than the genetic programming trained tness function. Note that while the tness function may well need to be retrained for each new domain where the ROSE approach is used, none of the code used to do the training nor any of the code that uses the trained function need be altered in any way. Thus, since the training is done automatically, the approach itself can still be considered domain independent.

4.3 An Additional Example


31

Sentence: What did you say about what was your schedule for thursday? Partial Analysis: Chunk1: What was your? ((CLARIFIED ((FRAME *WHAT) (WH +))) (TENSE PAST) (FRAME *CLARIFY) (WHOSE ((FRAME *YOU))))

Chunk2: schedule for Thursday ((FRAME *SCHEDULE) (WHEN ((FRAME *SIMPLE-TIME) (DAY-OF-WEEK THURSDAY))))

Chunk3: was your ((TENSE PAST) (FRAME *CLARIFY) (WHOSE ((FRAME *YOU))))

Chunk4: schedule ((FRAME *SCHEDULE))

Chunk5: Thursday ((FRAME *SIMPLE-TIME) (DAY-OF-WEEK THURSDAY))

Chunk6: What ((FRAME *WHAT) (WH +))

Chunk7: you ((FRAME *YOU)) Fig. 8. A More Complicated Partial Analysis

32

Rose and Lavie

Ideal Hypothesis: (MY-COMB (MY-COMB ((FRAME *SCHEDULE) (WHEN ((FRAME *SIMPLE-TIME) (DAY-OF-WEEK THURSDAY)))) ((FRAME *WHAT) (WH +)) WHAT) ((FRAME *YOU) WHO)

Result: ((FRAME *SCHEDULE) (WHO ((FRAME *YOU))) (WHEN ((FRAME *SIMPLE-TIME) (DAY-OF-WEEK THURSDAY))) (WHAT ((FRAME *WHAT) (WH +))))

Gloss: What do you have scheduled for Thursday? Fig. 9. A More Complex Repair Example

An additional example is illustrated in Figures 8 and 9. In this example the rst speaker is asking the second speaker what he had said about his schedule for Thursday. Rather than analyze `your schedule' as a noun phrase, the parser separated `your' and `schedule' and represented `schedule' as an action rather than as an object. It constructed seven chunks, which are displayed in Figure 8. The rst chunk represents the meaning of the fragment `What did you say about what was


33

your?' The *CLARIFIED frame is used for the copulative form of the verb `to be' as in `My car is a blue Honda Civic.' The TENSE slot indicates that this fragment is expressed in the past tense. The WHOSE slot indicates that the sentence is about the other speaker. The ller of the CLARIFIED slot is the rst argument of copulative be. In this case, the rst argument is `what'. Hence, this meaning representation structure can be paraphrased as `What was your?' The parser drops the meaning of `What did you say' in constructing this analysis. The second chunk represents the meaning of `schedule for Thursday'. Without the `your', the word `schedule' is misanalyzed as a verb instead of as a noun. The ller of the WHEN slot indicates that the scheduling action is for Thursday. The remaining chunks are subparts of these two large chunks. They are labeled with appropriate paraphrases in Figure 8. Unlike in the previous example, here we see chunks that span more than a single word. Speci cally, the rst chunk covers the rst eight words of the sentence. The second chunk covers the nal three words. Additionally, in the remaining chunks the portion of the sentence spanned by each chunk is a subset of the portion spanned by one or more other chunks in the set. The complete set of chunks is constructed by rst selecting the set of largest chunks spanning the entire sentence using a greedy algorithm.Then, using the context-free structure of the parse for each chunk, chunks representing each constituent within that chunk are also selected. In the ROSE approach, the meaning of a single span of words is constrained to be represented only once in the meaning representation assembled by a repair hypothesis. Thus, where the portion of the sentence spanned by a set of chunks overlaps, only one of these chunks may be inserted into the resulting structure from

34

Rose and Lavie

a single repair hypothesis. A chunk will only be inserted into a slot in another chunk if there is no overlap in the portions of the sentence spanned by the two chunks. When repair hypotheses are constructed using the genetic programming algorithm, it is possible for chunks covering the same span of text to appear in more than one place in a hypothesis, but since the MY-COMB function checks to be sure its second chunk argument can legally be inserted into a slot in its rst chunk argument before doing so, this is not a problem. Only at most one of these overlapping chunks will be inserted into the resulting structure. Because `schedule' is misanalyzed as an action rather than as an object there is no way to represent the exact meaning of this sentence with the available chunks. The meaning representation structure with the closest meaning to the original sentence that can be constructed from the available chunks represents the meaning `What do you have scheduled for Thursday?' Both this structure and its associated repair hypothesis can be found in Figure 9. This repair hypothesis builds the ideal structure by rst inserting the chunk representing `what' in the WHAT slot of the *SCHEDULE frame. The resulting structure indicates that the sentence is asking about scheduling `what'. The positive WH feature indicates that the sentence is a question. It then inserts the chunk representing `you' in the

WHO

slot. This ller

indicates that the question is asking what `you' have scheduled. Thus, the nal resulting structure represents the best result. Notice that the ideal repair hypothesis does not include both of the largest chunks. Instead, it includes the second largest chunk and two smaller chunks. It is the job of the tness function to rank hypotheses in such a way that the correct subset


35

of chunks are included in the hypothesis. This task is non-trivial since the best hypothesis does not necessarily include the largest chunks.

4.4 Extending the Scope of Possible Repairs The constraint that the meaning of a single span of text can be represented no more than once in a single repair hypotheses makes it possible to handle cases of lexical ambiguity. In the repair module a single chunk from each largest parsable span is selected and then dissected into its smaller constituent chunks. If alternative chunks from each largest parsable span were selected instead of only one, then the resulting set of chunks would contain chunks from each alternative ambiguity. It would then be the job of the tness function to make the correct decision based on the four pieces of indirect evidence about the goodness of each alternative hypothesis. Since alternative chunks covering overlapping spans of text cannot be inserted into the same resulting structure, each resulting structure is constrained to represent a single choice for each ambiguous portion of the sentence. Again, a similar technique can be used to extend ROSE to accept word lattices from a speech recognizer as input. This requires a parser capable of constructing partial analyses covering all portions of a word lattice, such as the lattice parsing version of GLR* described in (Lavie1995). Instead of starting with the set of largest chunks spanning the sentence, the set of largest chunks spanning the whole word lattice would be selected. Just as in the original ROSE approach, the full set of smaller constituents from inside these chunks would also be selected. By adding the condition that chunks from parallel branches of a word lattice cannot be inserted

36

Rose and Lavie

into the same nal structure, it can be insured that each resulting structure will represent exactly one path through the word lattice. Note that the potential success of these two extensions depends upon the ability of the trained tness function to correctly evaluate the relative goodness of hypotheses that dier in terms of alternative ambiguities or alternative speech hypotheses.

5 Why Genetic Programming The genetic algorithm and genetic programming techniques have not been applied very widely to computational linguistics, and the history of their application in this eld is short. Losee (1995) describes an approach to using the genetic algorithm to evolve parsing grammars to be used for document retrieval. Berwick (1991) and Clark (1991) both describe applications of the genetic algorithm to evolving parameter settings for parameterized GB grammars. A short review of applications of genetic programming to computational linguistics can be found in (Koza1994). Two examples are given. The rst example is Siegel's (1994) work in which genetic programming was used to induce decision trees for word sense disambiguation. The other example is Nordin's (1994) work in which genetic programming was used to evolve machine code programs for classifying spelled out Swedish words as nouns or pronouns. It has also been used for combining context-based and non-contextbased predictions for parse disambiguation (Qu et al.1996a; Qu et al.1996b) and for learning constraints to be used in a plan-based discourse processor for the purpose of identifying the discourse purpose of sentences (Mason and Rose1998). Although genetic programming has never before been applied to the problem of


37

recovery from parser failure, this problem is a completely natural application for it. Genetic programming is most appropriate for problems where the solution can be represented easily as a computer program and where it is possible to evaluate the goodness of potential solutions eciently. The most compelling reason why genetic programming is more appropriate than other potential search algorithms is that it provides a natural representation for the problem. One can easily conceptualize the process of constructing a global meaning representation hypothesis as the execution of a computer program that assembles the set of chunks returned from the parser. Because the programs generated by the genetic search are hierarchical, they naturally represent the compositional nature of the repair process. Although it is not straightforward to evaluate the relative tness of alternative repair hypotheses as discussed above, once trained, the tness functions can be executed quickly, making it possible to evaluate the relative tness of alternative repair hypotheses eciently as evidenced by the short time consumed by the entire genetic search as reported below in the evaluation presented in Section 6. Although it may appear on the surface that a simpler control structure such as a priority queue would suce, such an approach would require the system to decide which set of repairs to start with and then which alternative hypotheses logically follow in an ordered manner. However, since the goodness of hypotheses is determined by factors that make competing predictions (completeness versus simplicity), no such ordering can be determined a priori. Likewise, what distinguishes hypotheses from one another is both which subset of chunks are included in the hypotheses and how they are composed. It is not clear what principle should be used in gen-

38

Rose and Lavie Number of Chunks Search Space Size 1

1

2

4

3

18

4

116

5

536

Fig. 10. Number of possible repair hypotheses corresponding to the number of

chunks produced by the parser erating successive hypotheses to test - whether it is better to change the subset of chunks included in the hypothesis or simply to change the way they are assembled. Finally, it is not clear what stopping criteria one would use for an application of this method to the problem of repair, or how one would avoid locally optimal solutions. Thus, genetic programming's opportunistic search method is more appropriate in this case. The genetic search rst samples its search space widely and shallowly and then narrows in on the regions surrounding the most promising looking points. Genetic programming is also appropriate because it has the ability to search a large space eciently. Because in the same population there can be programs that specify how to build dierent parts of the meaning representation, dierent parts of the full solution are evolved in parallel, making it possible to evolve a complete solution quickly. An ecient search algorithm is necessary because the space of alternative repair hypotheses is too large to search thoroughly. Figure 10


39

demonstrates how the total number of possibilities grows as the number of chunks produced by the parser increases. This growth was calculated by considering that for any set of chunks produced by the parser, the repair module must decide both which subset of those chunks should be included in the nal hypothesis and how the selected subset should be composed. For example, if there are three chunks, there are three ways to choose one chunk, three ways to choose two chunks, and one way to choose three chunks. For each set of two chunks, the rst chunk can be inserted into the second chunk or the second chunk can be inserted into the rst chunk. Thus, there are six hypotheses involving two chunks. For the set of three chunks, there are nine ways of assembling them. Two chunks can be combined and then inserted into the third one, or two of them can be inserted into the third one separately. Thus, there are eighteen possible hypotheses from three chunks. Note that the actual total number of possible hypotheses is even larger than this since wherever a chunk is inserted into another chunk, there may be more than one possible slot in which to make the insertion. The number of alternative hypotheses examined by the genetic programming algorithm can be xed ahead of time by setting the population size (the number of programs constructed for each generation), and the maximum number of generations. By limiting the population size to thirty two and the number of generations to four, the repair module is constrained to search only 128 alternatives. Thus, if the number of chunks produced by the parser is any more than four, the genetic programming approach yields a signi cant savings in terms of number of alternatives explored. If the number of chunks produced by the parser is exactly four, the

40

Rose and Lavie

number of alternatives explored is similar. The average number of chunks produced by the parser in the evaluations presented in Section 6 was 5.66. In 55.7 per cent of the cases, the parser produced more than four chunks.

6 A Comparative Analysis in a Large Scale Practical Setting In order to evaluate how the two stage repair approach compares with the single stage MDP approach in a practical, large-scale scenario, we conducted a comparative experiment. For a Minimum Distance parser we use a version of Lavie's GLR* parser (Lavie1995) extended to be able to perform both skipping and inserting which we refer to as LR MDP. For the two stage ROSE approach, we use a limited version of GLR* to perform the initial partial parsing stage. Thus, we use a similar parsing framework in both cases in order to render the two approaches to be comparable as possible.

6.1 The LR MDP Parser Most MinimumDistance Parsers are based on chart parsing (Earley1970; Kay1980). The implementation described here, however, was developed in the LR parsing framework so that it could be directly compared with the two stage ROSE implementation that uses an LR style parser. The primary dierence between the standard chart parsing approach and the LR parsing approach is in the way that the grammar is used. Rather than using the grammar directly, an LR parser uses a nite-state push down automaton which is a compiled version of the grammar. The grammar rules are implicit in the state machine. Each state represents a set


41

of `dotted' rules that capture what right hand side constituents of grammar rules have been recognized while parsing the part of the input string which has been processed so far. Associated with each state is a set of parsing actions. The Graph Structured Stack (GSS) keeps track of partial parses as they are built, and is similar to the chart in the chart parsing approach. On the GSS, symbol nodes and state nodes alternate, representing recognized constituents and the states derived

after processing the constituents. Each state node on the frontier of the GSS is an active state from which the next input token can be pushed. The LR MDP parser allows both deletion of words from the input string, as well as insertion of words (or constituents) into the input string. Deletion of words in the input string is accomplished by pushing input tokens onto interior state nodes of the GSS. Insertions are handled in the LR MDP implementation by pushing `dummy' nonterminals on the GSS, followed by the appropriate state, as determined from the `goto' portion of the LR parsing table. Pushing such a non-terminal A onto the stack is equivalent to inserting into the input string some sequence of words that is reducible to A. The inserted `dummy' non-terminals in our LR MDP implementation have empty feature structures associated with them. This allows the feature uni cation procedure of the GLR parser to treat inserted symbol nodes in the same way it treats regular symbol nodes. In order to allow the parser to select parses of minimum distance from the original input, a penalty is assigned to each inserted non-terminal symbol. The penalty is equal to the minimum number of inserted words in a derivation that would produce that particular non-terminal symbol. A more in-depth discussion of LR MDP can be found in (Rose1997).

42

Rose and Lavie

6.2 An Empirical Comparison of ROSE and LR-MDP We designed an experiment in order to compare the two interpretation processes using the same grammar and test set. In the evaluation, we ran each parser using the same semantic grammar with approximately 1000 rules. For each sentence, the meaning representation structure returned by the alternative interpretation processes was passed to a generation component, which generates a sentence in the target language. This text was then graded by a human judge as Bad, Partial, Okay, or Perfect in terms of translation quality. The human judge was a sta person on the JANUS project experienced in grading translation quality but not having been involved in the development of any portion of the system being evaluated. `Partial' indicates that the result communicated part of the content of the original sentence while not containing any incorrect information. Òkay' indicates that the generated sentence communicated all of the relevant information in the original sentence but not in a perfectly uent way. `Perfect' indicates both that the result communicated the relevant information and that it did so in a smooth, high quality manner. The test set used for this evaluation contains 500 sentences extracted from a corpus of spontaneous scheduling dialogues collected in English. One should note that while a full MDP algorithm allows insertions, deletions, and transpositions, our more restrictive version of MDP allows only insertions and deletions. Although this still allows the MDP parser to repair any sentence, in some cases the result may not be as complete as it would have been with the full version of MDP or with the two stage repair process. We use the restrictive version


43

of MDP, and not the full implementation, because the runtime of the full version of MDP quickly diverges and becomes infeasible when working with a large scale grammar, such as the one we use here. Additionally, with a lexicon on the order of 3000 lexical items, it is not practical to do insertions on the level of the lexical items themselves. Instead, we allow only non-terminals to be inserted. As mentioned earlier, an insertion penalty equivalent to the minimum number of words it would take to generate a given non-terminal is assigned to a parse for each inserted nonterminal. In order to test performance of the LR MDP approach with increasing levels of robustness, we used a parameterized version of LR MDP, where we restricted the maximum allowed deviation penalty. The deviation penalty of a parse is the total number of words skipped plus the parse's associated insertion penalty as described above.

6.3 Results The LR MDP parser was evaluated on the test corpus at three dierent exibility settings. The rst setting, MDP 1, is LR MDP with maximumdeviation penalty of 1. Similarly, MDP 3 and MDP 5 are LR MDP with maximum deviation penalty of 3 and 5 respectively. MDP 5 was the most exible version of MDP that was found to be practical to run with a grammar and lexicon as large as those used in this evaluation. To evaluate ROSE, we also ran a number of dierent experiments. We rst ran a very limited version of GLR* where only initial segments can be skipped which we refer to as GLR with Restarts. Thus, while the parser can

44

Rose and Lavie

start a new parse with each word in the sentence, analyses produced are always for contiguous portions of the sentence. Since GLR with Restarts is very fast, we wanted to evaluate what level of performance can be achieved using ROSE with such a limited exibility setting for the parser. We ran GLR with Restarts both with and without ROSE's automatic repair module. Timings for all ve of these dierent evaluations on the corpus are displayed in Figure 11. The distribution of sentence lengths can be found in Figure 12. Notice that GLR with Restarts is signi cantly faster than even MDP 1. Also noteworthy is the very small dierence in processing time between GLR with Restarts and GLR with Restarts +

Repair. This is due to the fact that repair is actually performed only on a select portion of the sentences for which it is automatically determined to be bene cial. The process itself is also not time consuming, thus the overall eect on the run time of GLR with Restarts + Repair is quite minimal. In the above experiments, ROSE was run using the tness function trained by genetic programming. The translation quality ratings for the ve dierent evaluations on the corpus are found in Figure 13. Predictably, MDP 5 shows an improvement over MDP

1, with an associated signi cant cost in run time. Also, not surprisingly, the very restrictive GLR with Restarts, while it is faster than either of the other two, has a correspondingly lower associated translation quality. However, GLR with

Restarts + Repair outperforms the others in terms of total number of acceptable translations while not being signi cantly slower than GLR with Restarts and no repair. Though these results demonstrate certain trends in the performance of the alternative approaches, the dierences in translation quality overall are very small.

45

Eciently interpreting extragrammatical utterances 2500 MDP1 MDP3 MDP5 GLR w/Restarts GLR w/Restarts + Repair

Processing Time (Secs.)

2000

1500

1000

500

0 0

5

10 Sentence Length

15

20

Fig. 11. Parse Times for Alternative Strategies

70 Length Distribution 60

Number of Sentences

50

40

30

20

10

0 0

5

10 Sentence Length

15

20

Fig. 12. Distribution of Sentence Lengths

Nevertheless, the very signi cant dierence in runtime performance between MDP

46

Rose and Lavie NIL

Bad

Partial

Okay

Perfect

Total Acceptable

MDP 1

21.4% 3.4%

3.4%

18.4%

53.4%

71.8%

MDP 3

16.2% 4.2%

5.0%

19.6%

55.0%

74.6%

MDP 5

8.4%

8.2%

6.0%

21.0%

56.4%

77.4%

GLR with Restarts

9.2%

6.4%

12.8%

19.4%

52.2%

71.6%

GLR with Restarts + Repair

0.4%

9.6%

11.8%

23.4%

54.8%

78.2%

Fig. 13. Translation Quality of Alternative Strategies

5 and GLR with restarts + repair demonstrates that the ROSE approach is a clear winner. The contribution made by ROSE's Combination stage is more evident when considering the maximumpotential repair that is possible using chunks produced by the parser. The percentage of test sentences in the corpus where repair was determined to be necessary and the parser produced sucient chunks for actually constructing an acceptable hypothesis is only 8.6 per cent. Therefore, the 6.6 per cent of additional acceptable hypotheses produced by ROSE constitutes 76.7 per cent of the maximum potential improvement. Though this still leaves room for further work, it demonstrates that a signi cant percentage of the maximum improvement that is possible to achieve with repair was indeed realized. As mentioned earlier, the linear combination trained tness function performed

47

Eciently interpreting extragrammatical utterances NIL

Bad

Partial

Okay

Perfect

Total Acceptable

0.4% 8.6%

12.4%

22.4%

56.2%

78.6%

Genetic Programming Fitness Function 0.4% 9.6%

11.8%

23.4%

54.8%

78.2%

Linear Combination Fitness Function

Fig. 14. Translation Quality of GLR with Restarts + Repair with Two

Alternative Fitness Functions

slightly better than the genetic programming trained tness function. Figure 14 compares the performance of ROSE with the two tness functions. Notice that the total percentage of acceptable translations for the linear combination tness function is slightly higher than for the genetic programming trained tness function. While the genetic programming trained tness function achieved 76.7 per cent of its potential improvement, the linear combination trained tness function achieved 81.4 per cent of its potential. After these results were established for ROSE running with GLR with Restarts, a second evaluation was run over the same corpus making use of the full GLR* parser constrained only by a beam size of 30 on the frontier of the parser's Graph Structured Stack. The purpose of this second evaluation was to test the limits of robustness that are achievable with word deletion alone, and to look at the dierence in time requirements between the full GLR* parser and the more restrictive

48

Rose and Lavie 2500 MDP1 MDP3 MDP5 GLR w/Restarts GLR w/Restarts + Repair GLR*


2000

1500

1000

500

0 0

5

10 Sentence Length

15

20

Fig. 15. Parse Times for Alternative Strategies 500 GLR w/Restarts GLR w/Restarts + Repair GLR* GLR* + Repair

450 400


350 300 250 200 150 100 50 0 0

5

10 Sentence Length

15

20

Fig. 16. Parse Times for Alternative Strategies

GLR with restarts. In this evaluation, full GLR* was run both with and without repair.

49

Eciently interpreting extragrammatical utterances NIL

Bad

Partial

Okay

Perfect

Total Acceptable

MDP 5

8.4% 8.2%

6.0%

21.0%

56.4%

77.4%

GLR with Restarts

9.2% 6.4%

12.8%

19.4%

52.2%

71.6%

GLR with Restarts + Repair 0.4% 9.6%

11.8%

23.4%

54.8%

78.2%

GLR*

2.2% 8.8%

11.6%

21.4%

56.0%

77.4%

GLR* + Repair

0.6% 8.8%

10.6%

23.6%

56.4%

80.0%

Fig. 17. Translation Quality of Alternative Strategies

The results in Figure 15 clearly indicate that even GLR* is faster than MDP 1. Figure 16 shows the dierences in runtime between GLR*, GLR* + repair, GLR

with Restarts and GLR with Restarts + repair. The translation quality for these four alternatives along with MDP 5 are found in Figure 17. The time required by GLR* is greater than GLR with restarts + repair. However, since the parse quality is higher for GLR* than GLR with restarts, repair is used less often, and thus there is a smaller dierence in time between the case with repair and the case without repair. Interestingly, although GLR* is slower than GLR with Restarts

+ repair, it has fewer overall acceptable translations. Therefore, although it is possible to get more perfect translations with more word skipping, GLR with

Restarts + repair gets more acceptable translations faster. By adding repair on top of GLR* it is possible to achieve and even better performance. Notice that the

50

Rose and Lavie NIL

Bad

Partial

Okay

Perfect

Total Acceptable

GLR with Restarts + Repair 0.4% 9.6%

11.8%

23.4%

54.8%

78.2%

Oracle

0.4% 8.8%

10.8%

24.0%

56.0%

80.0%

GLR* + Repair

0.6% 8.8%

10.6%

23.6%

56.4%

80.0%

Oracle

0.6% 8.2%

9.8%

24.0%

57.4%

81.4%

Fig. 18. Selection Heuristic vs. Oracle

dierence in average processing time for GLR* and GLR* + repair are barely distinguishable. This is because the Parse Quality Con dence heuristic rarely indicated that the parse needed repair, thus not giving the repair stage much room to yield a big improvement. The results for both tness functions were identical. ROSE's Hypothesis Formation Stage was evaluated one nal time in order to test the accuracy of its heuristic for selecting between the best result from the repair module versus simply keeping the parser's best ranked result. Since the repair module has less information than the parser, the result of the repair module is not always better than simply returning the set of largest non-overlapping, full sentence parses from the chart. In Figure 18, results of using ROSE's internal heuristic are compared with those of an oracle which always selects the hypothesis with the best translation quality. The results are very similar, indicating that ROSE's internal heuristic performs well.


51

6.4 Discussion The set of evaluations that we conducted clearly indicate that the ROSE approach is signi cantly faster than MDP per associated improvement in translation quality. While we have no doubt that increasingly more exible versions of MDP would perform better than MDP 5, we have demonstrated that even MDP 5 is impractical in terms of its run-time performance. Thus we conclude that the two stage approach, even with a very limited exibility parser, is a superior choice. ROSE with GLR* is demonstrated to achieve almost 2 per cent more acceptable translations that ROSE with GLR with restarts. These results appear to indicate that more skipping yields better results. However, a more in-depth examination of individual sentences parsed with the alternative methods reveals that in some cases the additional word skipping of GLR* actually results in a worse parse (1.5 per cent of the time over the whole corpus of 500 sentences). Since more skipping makes more alternative analyses available for the parser to choose from, and since it must make a trade-o between the skipping penalty and other parse scores, the additional level of ambiguity creates more opportunity for the parser to select the wrong analysis. Thus, GLR* is not necessarily the ultimate partial parser for a two stage approach like ROSE. Instead, some compromise between GLR* and GLR with restarts would likely yield better results in less time than GLR* + repair. Finding the optimal level of parser exibility remains to be investigated in future research. The comparison between GLR with restarts + repair and GLR* with no repair appears to support the claim that a two step approach yields better per-

52

Rose and Lavie

formance per cost in time than a one stage approach. The reason is that a more constrained parsing algorithm is used in the two step approach while the one-stage approach requires a more exible parsing algorithm. In the two-stage approach, the knowledge gained in the rst stage constrains the search in the second stage. In this way, the second stage can be performed eciently. Additionally, dividing the process into two stages makes it possible to avoid the unnecessary computation of the second stage in cases where the second stage is determined not to be necessary. Though it was demonstrated that the ROSE approach achieves a signi cant percentage of its maximum potential performance bene t, it is important to examine also the roughly 20 per cent of the corpus where the ROSE approach is not able to achieve any improvement. A small portion of these sentences are such that improvement on the parsing side would make it possible to produce sucient chunks to construct an acceptable hypothesis with the repair module. However, in the great majority of these cases, either the meaning of the sentence cannot be represented by the meaning representation of the system (about half of these cases) or at least one important portion of the sentence is worded in a way that cannot be analyzed at all by the parsing grammar (about a quarter of these cases).

7 Conclusions In this paper we addressed the issue of how to eciently and eectively handle the problem of interpreting extragrammatical input in a large-scale spontaneous spoken language system. We argue that even though Minimum Distance Parsing oers a theoretically attractive solution to the problem of extragrammaticality, it is compu-


53

tationally infeasible in large scale practical applications. Our analysis demonstrates that the ROSE approach, consisting of a skipping parser with limited exibility coupled with a completely automatic post-processing repair module, performs signi cantly faster than even a limited version of MDP, while producing analyses of superior quality. We demonstrate further that two stage approaches such as ROSE possess the ability to achieve a better eciency/eectiveness trade-o than single stage approaches in general. Interesting avenues for future research include looking into alternative beam techniques for eectively limiting skipping, testing for the optimal level of skipping, and searching for a more eective trade-o between number of skipped words and other parser scores in selecting the best parse and extracting stable chunks from the parser's internal chart. Another alternative is supplying the repair module with multiple analyses of ambiguous portions of the sentence, as described in Section 4 above. This would give the repair module the means to produce a reasonable interpretation in cases where the set of chunks currently provided by the parser is insucient for repair due to incorrect disambiguation.

8 Acknowledgements We would like to thank all of the members of the JANUS project and other close colleagues for their direct and indirect contributions to this work. We would like to express our particular appreciation to Lori Levin, Jaime Carbonell, Barbara Di Eugenio, Johanna Moore, and Sandra Carberry for helpful discussions and feedback on earlier drafts of this work. We are grateful to Christine Watson for grading all of

54

Rose and Lavie

the output for each of the experiments and to Donna Gates for making available all of the training and testing corpora. Finally, we would like to thank our anonymous reviewers for their substantial feedback on this article.

References S. Abney. 1996. Partial parsing via nite-state cascades. In Proceedings of the Eighth European Summer School In Logic, Language and Information, Prague, Czech Republic.

A. H. Aho and T. G. Peterson. 1972. A minimum distance error correcting parser for context free languages. SIAM Journal of Computing, 1(4):305{312. R. C. Berwick. 1991. From rules to principles in language acquisition: A view from the bridge. In D. Powers and L. Reeker, editors, Proceedings of Machine Learning of Natural Language and Ontology.

R. Clark. 1991. A computational model of parameter setting. In D. Powers and L. Reeker, editors, Proceedings of Machine Learning of Natural Language and Ontology. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. 1989. Introduction to Algorithms (Ch 25). The MIT Press.

M. Danieli and E. Gerbino. 1995. Metrics for evaluating dialogue strategies in a spoken language system. In Working Notes of the AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation.

J. Earley. 1970. An ecient context-fre parsing algorithm. Communications of the ACM, 13(2). U. Ehrlich and G. Hanrieder. 1996. Robust speech parsing. In Proceedings of the Eighth European Summer School In Logic, Language and Information, Prague, Czech Republic.

S. Federici, S. Montemagni, and V. Pirrelli. 1996. Shallow parsing and text chunking: a view on underspeci cation in syntax. In Proceedings of the Eighth European Summer School In Logic, Language and Information, Prague, Czech Republic.


55

D. Hayes and G. V. Mouradain. 1981. Flexible parsing. Computational Linguistics, 7(4). D. R. Hipp. 1992. Design and Development of Spoken Natural-Language Dialog Parsing Systems. Ph.D. thesis, Dept. of Computer Science, Duke University.

J. Holland. 1975. Adaptation in Natural and Arti cial Systems. University of Michigan Press. K. Jensen and G. E. Heidorn. 1993. Parse Fitting and Prose Fixing (Ch 5), In Natural Language Processing: the PLNLP Approach. Kluwer Academic Publishers.

M. Kay. 1980. Algorithm schemata and data structures in syntactic processing. Technical report, Xerox PARC, CSL-80-12. J. Koza. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press.

J. Koza. 1994. Genetic Programming II. MIT Press. S.L. Kwasny and N. K. Sondheimer. 1981. Relaxation techniques for parsing grammatically ill-formed input in natural language understanding systems. American Journal of Computational Linguistics, 7(2).

B. Lang. 1989. Parsing incomplete sentences. In Proceedings of the 12th International Conference on Computational Linguistics (COLING 98).

A. Lavie and M. Tomita. 1993. GLR* - an ecient noise-skipping parsing algorithm for context free grammars. In Proceedings of the Third International Workshop on Parsing Technologies.

A. Lavie, D. Gates, M. Gavalda, L. May eld, and A. Waibel L. Levin. 1996. Multi-lingual translation of spontaneously spoken language in a limited domain. In Proceedings of COLING 96, Kopenhagen.

A. Lavie. 1995. A Grammar Based Robust Parser For Spontaneous Speech. Ph.D. thesis, School of Computer Science, Carnegie Mellon University. J. F. Lehman. 1989. Adaptive Parsing: Self-Extending Natural Language Interfaces. Ph.D. thesis, School of Computer Science, Carnegie Mellon University.

56

Rose and Lavie

R. M. Losee. 1995. Learning syntactic rules and tags with genetic algorithms for information retrieval and ltering: An empirical basis for grammatical rules. Information Processing And Management.

M. Mason and C. P. Rose. 1998. Automatically learning constraints for plan-based discourse processors. In Working Notes for the AAAI Spring Symposium on Machine Learning and Discourse Processing.

P. Nordin. 1994. A compiling genetic programming system that directly manipulates the machine code. In K. E. Kinnear, editor, Advances in Genetic Programming. MIT Press. Y. Qu, B. Di Eugenio, A. Lavie, L. Levin, and C. P. Rose. 1996a. Minimizing cumulative error in discourse context. In Proceedings of the ECAI-96 workshop on Spoken Dialogue Processing.

Y. Qu, C. P. Rose, and B. Di Eugenio. 1996b. Using discourse predictions for ambiguity resolution. In Proceedings of COLING. C. P. Rose and L. Levin. 1998. An interactive domain independent approach to robust dialogue interpretation. In Proceedings of the 36th Annual Meeting of the ACL. C. P. Rose. 1997. Robust Interactive Dialogue Interpretation. Ph.D. thesis, School of Computer Science, Carnegie Mellon University. E. Siegel. 1994. Competitively evolving decision trees against xed training cases for natural language processing. In K. E. Kinnear, editor, Advances in Genetic Programming. MIT Press. S. Sippu. 1981. Syntax error handling in compilers. Technical report, Report A-1981-1, Department of Computer Science, University of Helsinki, Helsinki. B. Srinivas, C. Doran, B. Hockey, and A. Joshi. 1996. An approach to robust partial parsing and evaluation metrics. In Proceedings of the Eighth European Summer School In Logic, Language and Information, Prague, Czech Republic.

G. Van Noord. 1996. Robust parsing with the head-corner parser. In Proceedings of the


57

Eighth European Summer School In Logic, Language and Information, Prague, Czech Republic.

M. Woszcyna, N. Coccaro, A. Eisele, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Sloboda, M. Tomita, J. Tsutsumi, N. Waibel, A. Waibel, and W. Ward. 1993. Recent advances in JANUS: a speech translation system. In Proceedings of the ARPA Human Languages Technology Workshop.

M. Woszcyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A. McNair, T. Polzin, I. Rogina, C. P. Rose, T. Schultz, B. Suhm, M. Tomita, and A. Waibel. 1994. JANUS 93: Towards spontaneous speech translation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing.