Left-Corner Parsing Algorithm for Uni cation ... - Semantic Scholar

1 downloads 0 Views 1MB Size Report
PARSING ALGORITHMS FOR UNIFICATION GRAMMARS. 1 ...... hid; pid;hi; j; pn = han; ni; s = hp1; ::; pni; M; dii hid00; id;hj; k; pm = ham; mi; s0 = hpmi; M0; amii.
Left-Corner Parsing Algorithm for Uni cation Grammars

A Dissertation Submitted to The School of Computer Science, Telecommunications and Information Systems, DePaul University In partial ful llment of the requirements for the degree of Doctor of Philosophy in Computer Science by Noriko Tomuro April, 1999 Dissertation Committee: Dr. Steve Lytinen, Chair Dr. David Miller Dr. Tom Muscarello Dr. Bill Rounds (Univ. of Michigan)

ii

c by Copyright Noriko Tomuro April, 1999

iii

Acknowledgements Like many Ph.D. graduates will say, writing a thesis and doing the research that comes with it is a solitary business. Fortunately I had many people who gave me kind support and warm encouragement throughout this process. I owe a debt of sincere gratitude to each of those people. First, my advisor Dr. Steve Lytinen, who directed my study which led to this thesis. From day one of my Ph.D. program at DePaul CTI, he made every single step with me, from writing code to proving lemmas. The work described in this thesis has bene tted greatly from his intuition in identifying problems and solving them. In this regard, this thesis is considered to be his as well as mine. Second, my mentor Dr. David Miller, who nourished my scholarly development during my Master's and Ph.D. programs. His strong top-down view of how pieces t together, as well as the deep understanding of what each piece is, also his preciseness at the detailed level, has in uenced me greatly in my graduate studies. After all, this thesis in essence is my best e ort to become his true student. Third, Dr. Tom Muscarello and Dr. Bill Rounds, who reviewed this thesis. From early in my Ph.D. program, Dr. Muscarello has kindly accepted me and given me encouragement. His view of what's really important from a practical standpoint applies beyond academic studies, and it is something I can learn a lot from, particularly in my future \real" work after the degree. Dr. Rounds, a Professor at University of Michigan, is a highly respected researcher in Computational Linguistics and AI. It is my honor to have him as my committee member. There are quite a few people, my friends, colleagues and other faculty members, who gave me encouragement from the sideline. Their kind support meant a lot to me, and helped me go through the process which required a signi cant amount of perseverance. I sincerely want to thank Joseph Morgan, Sotiris Skevoulis and Charles Sykes for their support and friendship. Countless hours of chats and laughs with them, who were struggling through the same process, made the long, no-fun life of a Ph.D. student bearable, and kept me motivated. Another thanks to Kathy Rossi, who has put up with me when I was out of line, and stayed as a great friend. I would also like to thank all CTI faculty and sta members. In particular, I would like to say special thanks to Dr. Gary Andrus and Dr. Henry Harr. I appreciate very much the warm, caring support and encouragement they gave me as friends rather than as teachers. Dr. Harr babysat me almost on a regular basis by listening to my complaints, and kept me in line and focused on what I should be doing. I must also thank Dr. Helmut Epp, the dean of the CTI, for his generous support in providing me the facilities and equipment to do the research, as well as the encouragement he has given me. Among the CTI sta , special thanks to Eleni, Patty and Geri for their kind support, which got me through the program trouble-free.

iv Finally, I want to thank my dear friends in Japan and Australia, who kept cheering me on from thousands of miles away, for many years. And last but not least, I want to thank my dearest friend Andrew McKeown, for being behind me always and having con dence in me in every endeavor I have attempted. This one's for you. Chicago, Illinois April, 1999

v

Abstract

Parsing with uni cation grammars is inecient due to the expressive power of the grammars. Most uni cation-based parsing algorithms are extensions of context-free (CF) parsing algorithms, and few have been specially designed for uni cation-style grammars. We have developed an ecient parsing algorithm for uni cation grammars which takes full advantage of the expressiveness of the grammar. Our algorithm (called LC) is a variation of Left-corner parsing, and it exhibits signi cantly improved average-case performance as compared with previous uni cation-based parsers. Eciency of our LC algorithm comes from two factors. First is the representation and architecture of LINK. LINK is a syntax-semantics integrated uni cation-based system which dynamically combines syntax (grammar) and semantics (domain knowledge). And LINK utilizes all available information at any given point during parsing. Second is the expectation-based Left-corner parsing strategy. By utilizing expectations, the algorithm can eliminate unsuccessful parses which will not t the left-context (previous word(s) in the sentence). The central focus of this thesis is the formalization and the proof of correctness of the LC algorithm. To do so, we specify the algorithm using the constraint-based grammar formalism presented in (Shieber, 1992). In the formulation, the LC algorithm is characterized as an optimization of the abstract parsing algorithm developed in (Shieber, 1992). Then by using Shieber's proof of correctness of his algorithm, we prove the correctness of our LC algorithm by reducing LC to his algorithm. In formulating the proof of correctness, we discovered a diculty in Shieber's as well as the LC algorithm, in which, for certain grammars, the algorithms may spuriously create nonminimal derivations in addition to the minimal ones. As it turns out, the nonminimal derivation problem raises important issues concerning some of the basic notions in uni cation grammar and uni cation-based parsing. We discuss this nonminimal derivation problem in depth, including the sources and the possible solutions. Finally, we present the empirical result obtained from running LINK on a corpus of example sentences taken from real-world texts. The results indicate that, for the limited domain texts, LINK achieved a linear time average-case performance. This is a marked improvement over other uni cation-based parsing algorithms.

vi

Contents 1 Introduction 1.1 1.2 1.3 1.4 1.5

Parsing Algorithms for Uni cation Grammars Ecient Uni cation-based Parsing Algorithm Formalization . . . . . . . . . . . . . . . . . . Nonminimal Derivations . . . . . . . . . . . . Thesis Overview . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

2 Parsing Algorithms for Uni cation Grammars

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 2.2 Uni cation Grammars . . . . . . . . . . . . . . . . 2.2.1 Feature Structures . . . . . . . . . . . . . . 2.2.2 Subsumption and Uni cation . . . . . . . . 2.2.3 PATR-II . . . . . . . . . . . . . . . . . . . . 2.2.4 Properties of Uni cation Grammars . . . . 2.3 Uni cation-based Parsing Algorithms . . . . . . . . 2.3.1 Context-free Parsing Algorithms . . . . . . 2.3.2 Uni cation-based Algorithms and Systems . 2.3.3 Issues in Uni cation-based Parsing . . . . . 2.4 Left-Corner Parsing Algorithm (LC) . . . . . . . . 2.4.1 LINK . . . . . . . . . . . . . . . . . . . . . 2.4.2 The LC Algorithm . . . . . . . . . . . . . .

3 Abstract Left-corner Parsing Algorithm

3.1 Introduction . . . . . . . . . . . . . . . . . . 3.2 Informal Overview . . . . . . . . . . . . . . 3.2.1 Models . . . . . . . . . . . . . . . . 3.2.2 Algorithms . . . . . . . . . . . . . . 3.3 Shieber's Logic . . . . . . . . . . . . . . . . 3.3.1 Logic System . . . . . . . . . . . . . 3.3.2 Formalism for Uni cation Grammar 3.3.3 Shieber's Abstract Algorithm . . . . 3.3.4 Filtering Function  . . . . . . . . . 3.3.5 Extension in LINK . . . . . . . . . . 3.4 Abstract Left-corner Algorithm . . . . . . . 3.4.1 Reachability Net Rules . . . . . . . . vii

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

1 1 2 4 5 5

7

7 7 8 9 9 10 11 12 19 25 27 28 30

37 37 38 38 39 40 40 41 42 44 44 45 45

CONTENTS

viii

3.4.2 Item Generation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Nonminimal Derivations

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Nonminimal Derivation Problem . . . . . . . . . . . . . 4.2.1 Minimal and Nonminimal Derivations . . . . . . 4.2.2 Nonminimal Derivation Phenomena . . . . . . . 4.2.3 Nonminimal Derivations in Shieber's Algorithm . 4.2.4 Nonminimal Derivations in LC Algorithm . . . . 4.3 Sources of the Problem . . . . . . . . . . . . . . . . . . . 4.4 Possible Solutions . . . . . . . . . . . . . . . . . . . . . . 4.5 Nonminimal Derivation and Parse Tree . . . . . . . . . . 4.5.1 Shieber's Parse Trees . . . . . . . . . . . . . . . . 4.5.2 Parse Trees and Admissibility . . . . . . . . . . . 4.5.3 Other Representations and Logics . . . . . . . . 4.6 Minimal Parse Trees for Uni cation Grammars . . . . . 4.6.1 Bottom-up De nition . . . . . . . . . . . . . . . 4.6.2 Top-down De nition . . . . . . . . . . . . . . . . 4.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . 4.7 Parsing Algorithms for Uni cation Grammars . . . . . . 4.7.1 Incorrect Uni cation-based Parsing . . . . . . . . 4.7.2 Correctness of Uni cation-based Systems . . . . 4.7.3 Shieber's Parse Tree Revisited . . . . . . . . . . 4.7.4 Alternative Algorithms . . . . . . . . . . . . . . 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

5 Proof of Correctness

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Nonminimal to Minimal Item Mapping . . . . . . . . . . . . . . . 5.2.1 Nonminimal to Minimal Mapping in Shieber's Algorithm 5.2.2 Nonminimal to Minimal Mapping in the LC Algorithm . 5.3 Model Correspondence . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 RN1 and Non-RN1 Models . . . . . . . . . . . . . . . . . 5.3.2 Model Correspondence . . . . . . . . . . . . . . . . . . . . 5.4 Proof of Algorithm Reduction . . . . . . . . . . . . . . . . . . . . 5.4.1 Prediction Items . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Operation Mapping . . . . . . . . . . . . . . . . . . . . .

6 Implementations and Empirical Results

6.1 Introduction . . . . . . . . . . . . . . . . . 6.2 Automatic Derivation of Maximizing  . . 6.2.1 Top-down Prediction Propagation 6.2.2 Discussion . . . . . . . . . . . . . . 6.3 Lazy Uni cation . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47 50

55 55 56 56 57 58 62 63 64 69 69 70 72 73 73 74 78 79 80 80 81 82 83

85 85 86 87 88 90 90 91 94 94 95

101 101 101 102 104 104

CONTENTS 6.3.1 The Uni cation Operation . . . . 6.3.2 Lazy Uni cation . . . . . . . . . 6.3.3 Copy Environment . . . . . . . . 6.3.4 The Lazy Uni cation Algorithm 6.3.5 Discussion . . . . . . . . . . . . . 6.3.6 Related Work . . . . . . . . . . . 6.4 Empirical Results . . . . . . . . . . . . . 6.4.1 Shieber vs. LC . . . . . . . . . . 6.4.2  MAX vs.  PS . . . . . . . . . 6.4.3 Lazy vs. Nonlazy . . . . . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . .

7 Conclusion

ix . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Ecient Parsing Algorithm for Uni cation Grammars 7.1.2 Nonminimal Derivation Problem . . . . . . . . . . . . 7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Grammar Formalism . . . . . . . . . . . . . . . . . . . 7.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

A Shieber's Lemmas B Auxiliary Lemmas

105 106 107 109 109 111 112 112 113 114 116

117 117 117 118 118 118 119

121 123

B.1 Shieber's operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 B.2 Path Replacement Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 B.3 Reachability Net Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

C Nonminimal to Minimal Item Mapping

C.1 Nonminimal Mapping in Shieber's Algorithm . . . . . . . . . . C.1.1 Shieber's Derivation . . . . . . . . . . . . . . . . . . . . C.1.2 Minimal and Nonminimal Items in Shieber's Algorithm C.1.3 Proof of Nonminimal to Minimal Mapping . . . . . . . . C.2 Nonminimal Mapping in LC Algorithm . . . . . . . . . . . . . C.2.1 Context for LC Items . . . . . . . . . . . . . . . . . . . C.2.2 Nonminimal Derivations . . . . . . . . . . . . . . . . . . C.2.3 Nonminimal and Minimal Items in LC Algorithm . . . . C.2.4 Proof of Nonminimal to Minimal Mapping . . . . . . . .

D Model Correspondence E Proof of Algorithm Reduction References

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

139 139 139 143 146 148 148 153 157 159

161 167 173

x

CONTENTS

Chapter 1

Introduction 1.1 Parsing Algorithms for Uni cation Grammars Uni cation grammar is a style of complex grammar formalism that has originated from context-free grammar. The basic idea of uni cation grammar is to augment context-free grammar with more linguistic information, such as subject-verb agreement, to better capture the phenomena observed in natural languages. This information is recorded as additional features and their associated values (i.e., feature-value pairs) for each phrase structure constituent, in a complex feature structure representation. A feature structure can also be thought of as a set of constraint equations between features and values in the linguistic domain of the grammar. This kind of formalism based on feature structures allows grammar rules to be written declaratively, through feature-value pairs as the description of well-formedness of each rule. This declarative approach to grammar clearly departs from traditional, procedural approaches such as Transformational Grammar (TG) (Akmajian and Heny, 1975; Radford, 1988) or Augmented Transitional Networks (ATN) (Woods, 1970), which specify the grammar by a set of operations that determine the derivability of constituents. Within the last fteen years or so, various uni cation grammars have been developed, including linguistic theories such as Generalized Phrase Structure Grammar (GPSG) (Gazdar, et al, 1985)), Lexical Functional Grammar (LFG) (Kaplan and Bresnan, 1982) or Head-driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994); and linguistic tools such as PATR-II (Shieber, 1986a) or De nite Clause Grammar (DCG) (Pereira and Warren, 1980). Uni cation grammar is very expressive. However, this expressiveness also brings dif culties in parsing. In general, parsing with uni cation grammars is intractable in the unrestricted form (Barton, et al, 1987). It has also been shown that parsing with some types of left-recursive uni cation grammars is undecidable (Shieber, 1985). Although those are theoretical worst-case bounds, empirical examinations indicate an inherent computational ineciency: average-case quadratic or cubic performance (Shann, 1991; Carroll, 1993, 1994). In addition to asymptotic complexity, performance of implemented uni cation-based natural language processing (NLP) systems su er even more because of the overhead of processing large data structures (i.e., feature structures). Even with polynomial time averagecase complexity, parsing systems may exhaust their nite resources in processing even short 1

2

CHAPTER 1. INTRODUCTION

sentences, particularly when the grammar is highly ambiguous. Therefore, it is of utmost importance to derive ecient algorithms and implementation techniques to make the systems practical (Carroll, 1993). Most uni cation-based parsers developed to date are implemented as a relatively straightforward extension of some context-free parsing algorithm. In particular, most systems adopt ecient context-free parsing algorithms, such as Chart parsing (Kay, 1980), Generalized LR Parsing (Tomita, 1986, 1991), Earley's algorithm (Earley, 1970), and Left-corner parsing (Aho and Ullman, 1972). More recently, there have been several approaches which utilize probabilistic information in uni cation-based parsing (Briscoe and Carroll, 1993). In those systems, extension was done simply by replacing context-free symbol concatenation with feature structure uni cation with no change in the control logic of the algorithm. Many systems often employ eciency techniques as well, including packing and subsumption checking (Alshawi, 1992), and some lower-level optimization techniques such as lazy uni cation (Godden, 1990). However, the context-free extension approach does not exploit the increased expressiveness of the underlying uni cation-style grammar formalism to derive ecient parsing algorithms. In other words, this approach does not promote design of new control logics (i.e., algorithms) which utilize the additional features to speed up parsing. Also, many implemented systems do not use the additional features immediately as each word in an input sentence is processed. In those systems, the application of uni cation is delayed until each rule is complete (when the left-hand side (LHS) constituent is realized) or even until the end of parsing. They use only the phrase structure component (context-free backbone) of the grammar rules rst to create a context-free skeleton, and apply uni cation either rule-by-rule to a complete LHS constituent, or at the end of parsing to a parse tree (i.e., phrase structure tree). This processing model is used essentially to avoid exponential explosion: for each context-free phrase structure rule, there are many (and possibly in nite) decorations in uni cation grammar. However, such a model is aimed to overcome this de ciency rather than to take advantage of the rich linguistic information available in the feature structures. Another issue is the use of semantic information. In most uni cation-based systems, syntactic and semantic analyses are performed asynchronously in two separate phases. In the rst phase, syntactic parsing is done by using the grammar, and in the second phase, semantics is applied to the result of the previous syntactic analysis phase (i.e., parse trees). Again, this processing model does not exploit the expressiveness of the uni cation-style grammar, which can encode syntactic as well as semantic information in the feature structure representation, to improve the parsing performance.

1.2 Ecient Uni cation-based Parsing Algorithm We have developed an ecient parsing algorithm for uni cation grammars which takes full advantage of the expressiveness of the grammar. Our algorithm (called LC) is a variation of Left-corner parsing, and it exhibits signi cantly improved average-case performance for limited domain texts as compared with previous uni cation-based parsers. It has been implemented within our uni cation-based NLP system, called LINK (Lytinen, 1992), and preliminary performance results are reported in (Lytinen and Tomuro, 1996). In this thesis,

1.2. EFFICIENT UNIFICATION-BASED PARSING ALGORITHM

3

we formalize the LC algorithm in logic, and show the proof of correctness. We further discuss the formalization issue in the next section. Eciency of our LC algorithm comes from two factors. First is the representation and architecture of LINK. LINK is a syntax-semantics integrated uni cation-based system which dynamically combines syntax (grammar) and semantics (domain knowledge). Those two components are represented uniformly by feature structures in LINK, and during parsing both kinds of features are retrieved and combined for each word in the input sentence. Thus, LINK's architecture supports synchronous (or incremental) processing, which makes full use of the information available in the feature structures. Second is the expectation-based Left-corner parsing strategy. As with context-free Leftcorner parsing, our LC algorithm generates one or more expectation(s) for every word in the input sentence, and uses them to guide parsing. An expectation for a word is a set of features which should be found in the word based on the analysis from previous word(s) in the sentence (i.e., left-context). Those features are percolated down to the word through top-down rule rewriting/propagation (top-down processing). If the expected features are consistent with the features of the word, the two sets of features are uni ed and a new constituent is realized. This constituent in turn is used to ll an intermediate expectation that was generated during top-down propagation (bottom-up processing). Thus, the algorithm performs a hybrid of top-down/bottom-up parsing. Using this strategy, the algorithm gains eciency by being able to avoid generating unsuccessful parses (bottom-up constituents) that will not t the left-context. By embedding this strategy in LINK, our LC algorithm can generate maximum expectations with both syntax and semantic features, and bring out more eciency through the increased pruning power using semantic features. In addition to eciency, the LC algorithm has an important, unique characteristic. As with other Left-corner uni cation-based parsers such as the Core Language Engine (Alshawi, 1992), the LC algorithm pre-analyzes the grammar for the reachability relation : a re exive, transitive closure of left-corner derivation (through the rst right-hand side (RHS) constituent) from an expectation to the input word. Our LC algorithm extends this standard left-corner technique in a novel way. In particular, LC precompiles the grammar rules that are in a reachability relation into reachability net entries , in which the expectation and a rule which rewrites it after some (unspeci ed) levels of left-corner derivations are connected by a special lc arc. More speci cally, every reachability net model is in the form where the expectation is placed at the root node and the rule is found under the lc arc from the root. Thus, the lc arc allows expectations to be propagated to lower levels directly, skipping intermediate derivations. Figure 1.1 shows an example of a reachability net model represented by a directed acyclic graph (dag). The model in the gure represents a reachability relation between a VP expectation (indicated under the cat arc from the root) and a rule VG ! V (under the lc arc). As you see, the VP and the rule are also connected by the VP's head arc by having that arc point at the same node as the head arcs under the VG and the V. Through these arcs, features that are added to the expected VP during parsing are propagated to the VG and the V. Our LC algorithm precompiles all possible left-corner relations in the grammar into reachability net entries. Then during parsing, it uses them instead of original grammar rules to process sentences.

CHAPTER 1. INTRODUCTION

4 cat VP

lc head

cat VG cat

1 head head

V type trans

Figure 1.1: Example of LC's reachability net model (with lc arc) Having this unique data structure with the lc arc, LC's parsing operations are modi ed from the standard Left-corner algorithm to involve manipulation of this arc. In particular, after the rule under the lc arc becomes complete (i.e., all RHS constituents are realized), LC's bottom-up processing \stretches" the lc arc to ll the skipped derivation levels. This scheme is quite unique and interesting, especially because the transformation is done while the expectation is still kept intact at the root node. Our LC algorithm is fully implemented and tested with a corpus of example sentences taken from real-world texts. In the latest version, we incorporated a lazy uni cation (Godden, 1990) for optimization. The results indicate that, for limited domain texts, the LC algorithm achieves a linear time average-case performance (Lytinen and Tomuro, 1996). This is a marked improvement over other uni cation-based parsing algorithms.

1.3 Formalization The central focus of this thesis is the formalization and the proof of correctness of the LC algorithm. To do so, we rst specify the algorithm using the constraint-based grammar formalism presented in (Shieber, 1992). Shieber (1992) developed a general logic to express a large class of uni cation-style grammars. Then he used the logic to de ne an abstract parsing algorithm for such grammars and proved its correctness. Based on his result, we prove the correctness of our LC algorithm by rst expressing the LINK grammar in Shieber's logic formalism. Then we formulate the LC algorithm in a similar manner as Shieber's abstract algorithm, and show algorithm reduction from LC to his algorithm, which ultimately proves the correctness of LC. Shieber's algorithm is essentially an incarnation of Earley's algorithm (Earley, 1970) in context-free grammar: a chart parser which performs top-down/bottom-up mixed direction parsing. Since Left-corner parsing is considered an optimization of Earley's algorithm in context-free grammar, and our LC algorithm and Shieber's algorithm are essentially an extension of those context-free algorithms to uni cation grammars, we can analogously characterize LC as an optimization of Shieber's algorithm. In this sense, our reduction proof can also be considered a proof of this optimization relation.

1.4. NONMINIMAL DERIVATIONS

5

1.4 Nonminimal Derivations In formulating the proof of correctness, we discovered that Shieber's algorithm as well as LC may spuriously create additional parses that carry extra, irrelevant features (hence, we call them nonminimal derivations ). Our further analysis indicates that this problem happens because of the subtle interactions between certain parsing algorithms applied to uni cation grammars with certain properties. As it turns, the nonminimal derivation problem raises important issues concerning some of the basic notions in uni cation grammar and uni cation-based parsing, namely the validity of nonminimal parse trees as the evidence of the language of a grammar, and the correctness of parsing algorithms which create such results. Those issues are of particular interest to our proof of correctness of the LC algorithm, since Shieber's proof of his algorithm (1992) accepts nonminimal derivations. We will discuss this nonminimal derivation problem in depth in Chapter 4.

1.5 Thesis Overview This thesis is organized as follows. In Chapter 2, we give a summary of uni cation grammars, and survey uni cation-based parsing algorithms developed to date. Then we present our LC algorithm and give an overview of the LINK system. In Chapter 3, we formalize our LC algorithm using Shieber's logic formalism. In the formulation, we characterize LC as an optimization of Shieber's algorithm, and compare the two algorithms explicitly by examples. In Chapter 4, we turn to the nonminimal derivation problem. We rst describe in detail how the spurious situations occur in Shieber's and the LC algorithms, and discuss possible sources of the problem. After proposing several solutions, we go on to discuss the implications this problem embodies to uni cation grammar and uni cation-based parsing. At the end of discussion, we propose two kinds of correctness: a strong correctness based on strict model minimality, and a weak correctness, which we call correctness up to licensing, that accepts nonminimal models as well as minimal ones. In Chapter 5, we return to our LC algorithm and prove the correctness of the algorithm. The proof is shown in two steps. In the rst step, we show the existence of a total mapping from nonminimal to minimal items generated by both Shieber's and LC algorithms. This allows us to discard nonminimal derivations with respect to the weak correctness criteria that we propose in Chapter 4. Then in the second step, we show the algorithm reduction from LC to his algorithm, which ultimately prove the correctness of LC. In Chapter 6, we describe two eciency techniques that are incorporated in the current version of LINK. Then we present the results of empirical testing, which indicate an averagecase linear time performance. Finally in Chapter 7, we discuss the contributions of this thesis and give a brief sketch of future research.

6

CHAPTER 1. INTRODUCTION

Chapter 2

Parsing Algorithms for Uni cation Grammars 2.1 Introduction In this chapter, we review some preliminary notions in uni cation grammars and uni cationbased parsing algorithms, and present a survey of those areas. First we review basic concepts of uni cation grammar formalisms. Then we survey existing uni cation-based parsing algorithms and systems. In particular, we describe in detail some of those algorithms that constitute an essential background to our algorithm. Finally we present our ecient uni cation-based parsing algorithm, and give an overview of the LINK system (Lytinen, 1992) in which our algorithm is implemented.

2.2 Uni cation Grammars Since its conception in the early 1980's, uni cation grammar has been widely accepted and has become one of the most popular grammar formalisms in computational linguistics and natural language processing. Uni cation-based formalisms developed to date are basically classi ed into two camps. One camp consists of formalisms developed as a particular linguistic theory, such as Generalized Phrase Structure Grammar (GPSG) (Gazdar, et al., 1985), Lexical Functional Grammar (LFG) (Kaplan and Bresnan, 1982; Kaplan, 1989)1 and Head-driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1987, 1994), and the other camp consists of formal systems developed as a general linguistic tool, such as PATR-II (Shieber, 1986), Functional Uni cation Grammar (FUG) (Kay, 1984) and De nite Clause Grammar (DCG) (Pereira and Warren, 1980). Although these two camps are motivated by di erent goals and the formalisms within each camp have varieties, most uni cation-based formalisms share some common characteristics: feature structure as the base data structure, and uni cation as the sole information combining operation. 1 (Shieber, 1986) gives an excellent introduction and survey of uni cation-based approaches to grammars.

Also, (Sells, 1987) gives a concise overview of GPSG and LFG along with Government-binding Theory (GB) (Chomsky, 1981; Haegeman, 1994).

7

8

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS  cat : head:    1:     2: 

S 1 [subj: 2

]

 cat: NP  head : 2 [agr: 3 ]  cat: VP  head : 1 [agr: 3 ]

         

cat S

2

1 head

cat head subj head NP VP

cat

agr agr

Figure 2.1: Feature structure and corresponding dag

2.2.1 Feature Structures In the uni cation-style grammar formalisms, grammar rules are represented by feature structures . A feature structure is a set of feature-value pairs . A feature-value pair consists of a name (feature) and its associated value.2 A feature value may either be an atomic constant which is a symbol in the linguistic domain of the respective grammar, or may itself be another feature structure; thus, feature structures have recursive structure. For instance, Figure 2.1, the left gure, shows a feature structure for a rule S ! NP VP written in an attribute-value matrix (AVM) notation. In this representation, feature structures for the RHS constituents are placed under numbered features (1 and 2) from the root, and the features of the LHS constituent are indicated by non-numbered features at the root (cat and head). By convention, syntactic categories of the constituents (S, NP and VP) are indicated by a special feature cat within its respective structure. A feature structure can also be shared as a substructure within multiple enclosing structures (i.e., structure sharing). Those shared substructures are marked with co-indices. For instance, the boxed indices 1 through 3 in the above AVM represent a constraint such that a value under the co-indexed substructures must be identical (which are possibly at the same location as well). As an example, the boxed index 3 speci es that agr (agreement) feature under head feature in the NP and VP must have the same value. A feature-value pair can also be interpreted as a constraint equation , which views featurevalue association as equality relation. For this reason, uni cation grammars are sometimes referred to as constraint-based grammar formalisms. Feature structures are often modeled by directed-acyclic graphs (dags) because of their ability to model recursive nested structure as well as shared substructures.3 In a dag, feature names become the labels of the graph edges, and feature values become either labels of the dag nodes if they are atomic or subdags if they are another feature structure. Then, the concatenation of edges, which represents a hierarchical ordering of features, become a path in a graph. For instance, Figure 2.1, the right gure, shows the corresponding dag for the AVM on the left. 2 Features are sometimes called attributes . 3 Another popular representation scheme is rst-order terms (as in DCG).

2.2. UNIFICATION GRAMMARS

9 cat

cat

cat f

F

F

g

g

foo

f

p

foo

X

F

h

f

h

g

p

foo

bar

Y

bar

Z

Figure 2.2: Subsumption ordering between feature structures cat D cat F

cat 1

1

h h g foo

D

cat

cat

F

F

w

h

w

bar bar

f

1

h h g f foo

foo2

foo2

D1

D2

D3

Figure 2.3: Uni cation of two dags D1 and D2

2.2.2 Subsumption and Uni cation

There are two important notions de ned on feature structures: subsumption and uni cation. Subsumption, denoted v, is an ordering on feature structures where a feature structure X subsumes another feature structure Y (i.e., X v Y ) if Y contains more features than X (i.e., X is more general than Y ). Figure 2.2 shows the subsumption ordering between three dags X; Y and Z . Notice Z is more speci c than Y because it satis es one more constraint for the shared substructure under f and g arcs, that is, hf i = hgi. Uni cation, denoted t, is a procedure which combines feature structures. Uni cation of two structures X and Y (i.e., X tY ) is another feature structure which represents a set union of the features contained in X and Y . This information-combining operation is fundamental to feature structures, thus the grammar formalisms based on feature structures are called uni cation grammars. Uni cation also checks the compatibility of feature structures. The uni cation of two feature structures may fail if they share a feature with incompatible values. To be compatible, values must be equal (if they are atomic), or must be unde ned (i.e., a variable) for at least one of the two structures, or must unify if they are feature structures themselves. Figure 2.3 shows an example of uni cation of two dags D1 and D2 and the resulting dag D3.

2.2.3 PATR-II

Among various uni cation-based formalisms, one of the most general, theory-independent formalisms is PATR-II (Shieber, 1986a) developed at CSLI, Stanford. PATR-II is intended

10

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

to be a general linguistic tool on which di erent uni cation-based grammars can be implemented and tested. PATR-II's formalism is quite simple in that it does not include complex constructions such as disjunction, negation or sets of feature values. But because of this simplicity as well as its generality, PATR-II has been widely used as a base formalism in many uni cation-based systems (e.g. the reconstruction of GPSG using PATR-II in (Shieber, 1986b)). In PATR-II, a grammar rule is denoted in the form X0 ! X1 ::Xn , where Xi (0  i  n) are meta symbols which represent the constituents in a rule. By convention, X0 is used for the LHS constituent, and X1 ::Xn are the ordered constituents on the RHS. Then, additional feature-value pairs are denoted using equality symbol =. For instance, following is the rule S ! NP VP shown previously in Figure 2.1 written in PATR-II notation. X0 ! X1 X2 hX0 cati = S hX1 cati = NP hX2 cati = V P hX0 headi = hX2 headi hX2 head subj i = hX1 headi hX2 head subj agri = hX1 head agri In this notation, context-free category information of the rule (i.e., context-free backbone) is found under hXi cati path. This part is often directly instantiated with the symbols Xi in the rst line and written in the shorthand notation as follows. S ! NP V P hS headi = hV P headi hV P head subj i = hNP headi hV P head subj agri = hNP head agri

2.2.4 Properties of Uni cation Grammars Using feature structures as the basic informational domain, uni cation-style grammars are quite expressive: we can express much more about the properties of each constituent by way of various features in addition to a single context-free symbol. Uni cation-based formalisms possess other attractive properties. First, uni cation grammars facilitate immediate computational interpretation. Feature structures have a natural correspondence to some of the familiar computational models such as graphs, record structures and rst-order terms. Those models together with the operations de ned on them, such as uni cation or substitution, are well-de ned and established notions. Therefore, interpretation of uni cation-based formalisms is immediately available. One particular model that needs mentioning is the frame structure used in AI. Because of the strong correspondence between feature structures and frames as a representational formalism, uni cationstyle grammar ourished not only in the linguistic community but in the NLP community as well, and it brought together the two communities through this common approach. Second, uni cation-style grammars foster a mathematical study of feature structure which bears linguistic information. In this study, uni cation grammars are rigorously de ned as a formal system, often in logic or by algebraic speci cations, and the similarity or equivalence of uni cation grammars to other formal systems can be identi ed. For example, feature structure as a set of constraint equations allows uni cation grammars to be

2.3. UNIFICATION-BASED PARSING ALGORITHMS

11

viewed as an equational system. So, the problem of recognizing a sentence in the language of a grammar reduces to a constraint satisfaction problem, which immediately suggests a correspondence to programming language semantics (such as operational semantics). Uni cation grammars have some nice mathematical properties. One important property is monotonicity. Uni cation is a monotonic operation, producing a resulting feature structure which contains at least the same or more features than either of the original structures combined.4 Not only does monotonicity allow simpli cation of the formalisms, it also implies the application of a mathematical model of information content (such as domain theory) to uni cation-style grammars. Third, uni cation grammars describe linguistic properties of constituents explicitly and declaratively through various features of constituents. This procedure-independent descriptive formalism has advantages in making linguistic generalizations of natural languages compared to traditional, procedural approaches such as Transformational Grammar (TG) (Akmajian and Heny, 1975; Radford, 1988) or Augmented Transition Network (ATN) (Woods, 1970). As a note, some uni cation-based formalisms, such as LFG and HPSG, take so called lexicalized approaches. In lexicalized formalisms, most relations between constituents are viewed as lexical properties rather than grammatical features, and they are encoded as features in the lexicon such as subcategorization. Those features are checked when lexical entries are combined by uni cation, and propagated upward as higher level constituents are created. Other non-lexical relations are often generalized and de ned as meta rules or universal rules/principles in the lexicalized formalisms. Because of the increased expressiveness, uni cation grammars have a formal power of recursively enumerable sets in the unrestricted form (e.g. PATR-II). Therefore, the recognition problem of uni cation grammars is in general intractable (Barton, et al, 1987; Rounds, 1991).5 It is also known that some uni cation grammars are even undecidable because of the left-recursive rules (Shieber, 1985). This has a direct e ect on the eciency of uni cation-based parsing algorithms, which we will discuss in the next section.

2.3 Uni cation-based Parsing Algorithms As we mentioned previously in Chapter 1, most uni cation-based parsing algorithms developed to date are extensions of context-free algorithms. In this section, we review some of those context-free parsing algorithms which are often extended to uni cation grammars and survey their implemented uni cation-based systems. Then we will discuss some issues pertaining to uni cation-based parsing. 4 There have been some nonmonotonic extensions of uni cation grammars developed as well. For instance,

(Bouma, 1992) proposes a general, nonmonotonic extension of uni cation operation on feature structures, called default uni cation. Others introduce typing in the formalism and organize feature structures in a type inheritance hierarchy (e.g. (Zajac, 1992)), as is done in HPSG. 5 Some uni cation-based formalisms impose restrictions on the grammar so that they yield more reduced complexity. For instance, LFG has the generative capacity of weakly context-sensitive grammar.

12

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

2.3.1 Context-free Parsing Algorithms In general, parsing with a nondeterministic context-free grammar is exponential with respect to the size of input sentence. In order to improve eciency, quite a few number of algorithms and techniques have been developed. One technique is memoization, which caches intermediate computations { a technique used in dynamic programming { has been employed in many algorithms. For instance, Chart parsing (Kay, 1980) uses a global table called chart to record edges: partially realized instantiated rules that are produced during parsing as intermediate results. Then, by restricting that no duplicate edges are added to the chart, redundant computations can be avoided. This scheme can also e ectively restrict the repeated application of the same rule to the same substring in the input sentence, thereby ensuring that the algorithm terminates. Variants of chart parsing include Earley's algorithm (Earley, 1970), Left-corner parsing (Aho and Ullman, 1972), Head-corner parsing (Kay, 1989; Sikkel, 1997), and Generalized (nondeterministic) LR parsing (GLR) (Tomita, 1986, 1991).6 Another eciency technique is packing, which records intermediate parse results in a compressed representation. The idea is to group together the complete edges/constituents which span the same substring in the input sentence and have the same LHS category (although the RHS may vary). This way, locally ambiguous constituents are compressed under a single category, thereby reducing the amount of operations applied during parsing. Those packed constituents are later unpacked at the end of parsing to generate all possible parses. To push the idea of packing further, Tomita's GLR uses a packed shared forest representation for a parse tree where each node in the tree is a list of LHS categories which span the same substring. By using these and other techniques, several ecient context-free algorithms have shown a reduced complexity, from exponential to polynomial time. The optimal result known to date is O(n3 ), achieved by Earley's algorithm (Earley, 1970), Cocke-Younger-Kasami (CYK) parser (Kasami, 1965; Younger, 1967) and some others. Note that there are some other approaches which achieve even more eciency in contextfree parsing. One approach is to restrict the expressiveness of the grammar. For instance, deterministic LL and LR parsing used in compilers (Aho, Sethi and Ullman, 1986), Marcus parser (Marcus, 1980) and Register Vector Grammar (Blank, 1989) recognize only a subset of context-free language, but they do so in a linear time. More recently, partial parsing has been adopted in many practical NL systems. The approach is aimed more toward identifying constituent fragments in a reasonable time rather than doing a full analysis of a sentence. For instance, th least commitment parser Fidditch (Hindle, 1983) leaves ambiguous constructions (such as prepositional attachment) as separate fragments and allows interaction with the user to disambiguate. Partial parsing is also quite often used in empirical, corpus-based applications such as Information Extraction.7 6 Tomita's GLR uses graph structured stack instead of chart. But these two data structures facilitate the

same memoization e ect. 7 Most of the recent partial parsers (sometimes called \shallow parsers") accept only regular grammars. A good example is the parser used in FASTUS system (Hobbs et al., 1997).

2.3. UNIFICATION-BASED PARSING ALGORITHMS

13

In what follows, we review three context-free algorithms which are often extended to uni cation grammars: Bottom-up chart parsing, Earley's algorithm and Left-corner algorithm. These three algorithms are in a way closely related in that they are in an optimization relation: Left-corner algorithm is an optimization of Earley's algorithm, and Earley's algorithm in turn is an optimization of Bottom-up chart parsing algorithm. We selected these algorithms, in particular Earley's and Left-corner algorithms, for the purpose of illustrating the context-free equivalent of the uni cation-based algorithms that constitute the main background of this thesis.

Bottom-up Chart Parsing Algorithm Bottom-up (BU) chart parsing is a variation of the standard bottom-up algorithm which utilizes a global table chart to store intermediate results for eciency (i.e., memoization). As with standard bottom-up parsing, BU chart parsing builds a parse tree bottom-up, by realizing the LHS constituent of a rule after all RHS constituents are realized. To keep track of how much of the RHS constituents are realized thus far (i.e., partial results), BU chart parsing uses active edges and records them in the chart. An edge is a partially completed, instantiated grammar rule, denoted by a 3-tuple hi; j; A !  i where i and j are the starting and ending positions of the input string, and A ! is a rule in the grammar where ; are a string of context-free terminal/nonterminal symbols of length  0. A !  is a dotted rule which indicates that the portion of the RHS constituents has been realized so far (covering the substring from ith to j th position). An edge is called active or incomplete when the dot has not reached the end of the RHS (i.e., A !  where is not null), or it is inactive or complete when the dot has reached the end (i.e., A ! ). Parsing proceeds by extending the edges (or advancing the dot in the edges) until a complete edge is generated whose LHS is the start symbol of the grammar. In this section, we describe a slight variation of the algorithm presented in (Allen, 1995, p. 53).8 The algorithm is roughly as follows. At each input position i, keep applying the following operations until no operation is applicable:

(i) Shift 1: If the category of the next input word (at the ith position) is

C where C is a terminal (i.e., part-of-speech categories such as N, V, Det, Prep etc.),9 for each rule in the grammar of the form A ! C , add an edge hi; i+1; A ! C  i if it is not in the chart.

(ii) Shift 2: For each complete edge of the form hi; j; C ! i where C is a nonterminal, for each rule in the grammar of the form A ! C , add an edge hi; j; A ! C  i if it is not in the chart. 8 In the presentation here, we named parsing procedures and eliminated agenda control. We also eliminated the construction of empty edges of the form hi; j; A !  i for simplicity. However, this kind of edge play an important role in Earley's algorithm which we will describe next. 9 Strictly speaking, C is a \preterminal" for the word (a terminal string). In this thesis, we use \terminal" to mean a preterminal category symbol or a terminal string.

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

14

Grammar S → NP VP NP → Det N VP → VG NP VG → V NP → " John" V → " ate" Det → " the" N → " cake"

Example: “John ate the cake” 0 1 2 3 4

(11) reduce

0, 4, S → NP VP • (10) reduce

---

1, 4, VP → VG NP • (5) shift 2

(9) reduce

1, 2, VP → VG • NP

2, 4, NP → Det N •

(2) shift 2

(4) shift 2

(7) shift 2

0,1, S → NP • VP

1, 2, VG → V •

2,3, NP → Det • N

(1) shift 1 “John”

(3) shift 1 “ate”

(6) shift 1 “the”

(8) shift 1 “cake”

0,1, NP →" John" •

1, 2, V →" ate" •

2,3, Det →" the" •

3, 4, N →" cake" •

---

0

1

2

---

3

4

Figure 2.4: Example of Bottom-up chart parsing

(iii) Reduce: For each complete edge of the form hi; j; D ! i where j  i, and for each edge hh; i; A !  D i in the chart, add an edge hh; j; A ! D  i if it is not in the chart.

Given a grammar G and an input sentence w = w1 :::wn , if the algorithm produces complete constituent(s) for the start category S , w is in the language of G. The two Shift operations introduce edges in the chart upon realizing a complete constituent C . Whether the complete constituent is a terminal (Shift 1) or a nonterminal (Shift 2), a new edge is created which is instantiated with a rule which has C as the rst RHS constituent. Thus this new edge has the dot position just after the RHS constituent. Then, the dot in the edge is advanced in subsequent parsing as the rest of the RHS constituents are realized. When the dot reaches the end of a rule (i.e., all RHS constituents are realized), that means the edge is complete. Then its LHS (complete) constituent is used to advance the dot in another edge by Reduce operation. The important thing to note here is that edges are created only if the same edge is not recorded in the chart already. As mentioned earlier, this scheme prevents the application of the same rule more than once for the same substring, thereby ensuring the algorithm terminates as well as avoiding unnecessary duplicate computations. Figure 2.4 shows the chart created by parsing an example sentence \John ate the cake"

2.3. UNIFICATION-BASED PARSING ALGORITHMS

15

S NP “John”

VP VG

NP

V

Det

“ate”

N

“the” “cake”

Figure 2.5: Parse tree for \John ate the cake" using a given grammar. The indices 0 through 4 at the bottom of the chart correspond to the positions in the sentence, and each edge (i.e., chart entry) is indicated by a rectangle which covers from the starting position to the ending position. An edge, denoted by 3tuple, is also annotated with a number in parenthesis indicating the order of operations applied when the parsing was done strictly from left to right. A parse tree produced by the algorithm for this sentence (i.e., the result of operation (11) reduce) is shown in Figure 2.5.

Earley's Algorithm Earley's algorithm is a hybrid of top-down and bottom-up parsing. This algorithm is one of the most ecient general parsing algorithms developed for context-free grammar { it runs in polynomial time with the worst-case complexity of O(n3 ). This is one of the best results known to date for nondeterministic context-free parsing. Eciency of Earley's algorithm comes from two factors. The rst factor is the topdown/bottom-up mixed-direction parsing strategy, in which top-down information is utilized as expectation when constituents are built bottom-up. An expectation is the information on what constituent must be found at the next input position/word based on the analysis of the previous word(s) in a sentence (i.e., left-context). By propagating expectations top-down before the bottom-up processing begins, the construction of unsuccessful (known to fail) bottom-up constructions can be pruned out. In this sense, Earley's algorithm is an optimization of bottom-up parsing. Also, since the algorithm allows mixed-direction information ow, Earley's algorithm is more exible than pure top-down or bottom-up parsing. The second factor is the use of items (or states in his original paper) which roughly correspond to edges in chart parsing. As with chart parsing, items in Earley's algorithm are kept in a global table (state set, a set of all states), and they are added to the table only if no same item exists already. Therefore, the algorithm utilizes memoization. This memoization scheme provides another advantage in Earley's algorithm: it can prevent in nite loops caused by applying top-down processing with left-recursive rules. Thus, the grammar can be written more naturally and without any constraints.

16

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

Earley's algorithm is described as follows. The description here is slightly modi ed from his original version in (Earley, 1970), without the strictly left-to-right control restriction (i.e., for each word in a sentence, all processing must be done before moving to the next word). Note however the modi cation will not lose the major characteristics of his algorithm that are pertinent to this thesis. Given a grammar G and an input sentence w = w1 :::wn , Earley's algorithm starts by initializing the global table with an item h0; 0; S 0 ! S i where S is the start symbol in G and S 0 is a new root symbol (not in the vocabulary of G). Then, the algorithm keeps applying the following three operations until no more items are inserted in the table. (i) Predictor: For each item hi; j; A !  X i where X is a nonterminal, and for each rule in the grammar X ! , insert an item hj; j; X !  i in the table if it does not exist. The Predictor operation extracts the expectation X in the old item and nds each rule that can rewrite it (X ! ). The resulting item is based on the new rule with the dot at the beginning of the RHS, indicating no RHS constituents are realized yet. Thus, the Predictor operation essentially propagates the expectation top-down one level below the parse tree.

(ii) Scanner: For each item hi; j; A !  a i where a is a terminal, and if the input word at the j th position (wj ) is a, insert an item hi; j +1; A ! a  i in the table if it does not exist.

The Scanner operation \eats up" the next input word and advances the dot. It is equivalent to Shift 1 or Reduce operation in BU chart parsing when the LHS constituent of the complete edge (D in D ! ) is a terminal.

(iii) Completer: For each item hi; j; A !  X i where X is a nonterminal, and for each item in the table hj; k; X ! i, insert an item hi; k; A ! X  i in the table if it does not exist. The Completer operation lls the expectation X in the old item with another complete item and advances the dot. The resulting item spans from beginning of the old item (i) to the end of the complete item (k), indicating the covered substrings by those items were concatenated. The Completer operation is equivalent to Shift 2 or Reduce operation in BU chart parsing when the LHS constituent of the complete edge (D in D ! ) is a nonterminal. Then, if the parsing generates item(s) of the form h0; n; S 0 ! S i, the input sentence w is determined grammatical. Figure 2.6 shows a diagram of operations and items produced by parsing a sentence \John ate the cake". As you can see, top-down processing is done by Predictor operation and bottom-up processing is done by Scanner and Completer operations. For example, after \John" is scanned by (2) scanner, the next expectation VP is brought down to lower levels by two successive Predictor operations ((3) predictor and (4) predictor). At this point, another expectation for a terminal V is generated. This expectation is matched with the next input word \ate" in (5) scanner, and since they match, another item is created and the dot is advanced. Then, since this new item is complete, it is used

2.3. UNIFICATION-BASED PARSING ALGORITHMS

17

Grammar S → NP VP NP → Det N VP → VG NP VG → V NP → " John" V → " ate" Det → " the" N → " cake"

0,0, S' → •S

Example: “John ate the cake” 0 1 2 3 4

0, 4, S' → S • (12) completor

(1) predictor

0,1, S → NP • VP

0,0, S → • NP VP (2) scanner “John”

0, 4, S → NP VP • (11) completor

(3) predictor

1, 2, VP → VG • NP

1,1, VP → • VG NP (4) predictor

1,1, VG → • V

(6) completor

1, 2, VG → V •

(10) completor

(7) predictor

2, 2, NP → • Det N

(5) scanner “ate”

1, 4, VP → VG NP •

2,3, NP → Det • N

(8) scanner “the”

2, 4, NP → Det N •

(9) scanner “cake”

Figure 2.6: Example of Earley's Algorithm by (6) completer to ll the suspended expectation VG one level above that was created by previous Predictor operation ((3) predictor). This way, Earley's algorithm works in such a way that, from a given expectation, top-down predictions rst produce Prediction items which are consistent with the expectation, and then bottom-up operations advance the dot in those items and complete them later in the parsing.

Context-free Left-corner Algorithm The context-free (CF) Left-corner algorithm is essentially bottom-up parsing with top-down ltering. The idea is very similar to Earley's: to propagate an expectation for the next constituent to lter out unsuccessful bottom-up constructions. There are several variations of this algorithm; we describe a version similar to the one presented in (Shann, 1991), with a modi cation to align it with Earley's algorithm as we have described it above. Similar to Earley's algorithm, CF Left-corner algorithm have both top-down and bottomup components. It di ers from Earley's, however, in the following ways. First, CF Leftcorner algorithm pre-analyzes the grammar for reachability relation: a re exive, transitive closure of left-corner derivation (the rst RHS constituent) from a given expectation.10 This idea is analogous to rst set in compiler techniques (Aho, Sethi and Ullman, 1986). For 10 Here, we mean the context-free top-down derivation.

18

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

example, given the following two rules: R1: VP ! VG NP R2: VG ! V a reachability relation exists between the ordered pairs [VP, VG], [VP, V] and [VG, V], where the rst argument in each pair is the expectation and the second argument is the left-corner constituent. Before parsing starts, the algorithm analyzes all possible left-corner relations in the grammar and stores them in a reachability table. Second, CF Left-corner does not create \Prediction items" as in Earley's algorithm (items of the form hi; j; A !  i). Instead, items are introduced in CF Left-corner algorithm only after its left-corner constituent is lled (items of the form hi; j; A ! B  i). We call such an item a \left-corner item". The algorithm eliminates left-corner items by utilizing the pre-analyzed reachability relations { when a category X is expected, a new edge is created only if there is a rule A ! such that A is reachable from X (i.e., there exists an entry in the table [X , A]). This way, unsuccessful (intermediate) prediction items are eliminated, along with other subsequent items that would have been derived from those items. In this sense, CF Left-corner can be considered an optimization of Earley's algorithm. For the purpose of illustration, we represent a reachability relation by a 3-tuple hX; A; Rk i where X is a nonterminal, A is a nonterminal or terminal, and Rk is the index of a rule whose left-corner constituent is A (for example, C ! A ). An item is represented in the same way as Earley's (on p. 16), by 3-tuple hi; j; A !  i. As with Earley's algorithm, CF Left-corner starts with initializing the global table with an item h0; 0; S 0 ! S i, and for a grammatical sentence of length n, the algorithm generates item(s) of the form h0; n; S 0 ! S i at the end of parsing. The algorithm has four operations instead of three in Earley's algorithm. The di erence comes from Earley's Predictor operation. Since predictions are precompiled in the reachability table, CF Left-corner does not have Predictor; instead it has two operations which introduce the left-corner items. The four operations of the algorithm are as follows, (i) Scanner: same as Earley's (ii) Completer: same as Earley's These operations are basically bottom-up dot-advancing operations, and they are unchanged in CF Left-corner.

(iii) Predict-Shift: For each item hi; j; A !  X i where X is a nonterminal, if the word at the j th position (wj ) is a (a terminal) whose category is A (i.e., the grammar contains a rule A ! a), and there exists a reachability relation (obtained by looking up the reachability table) hX; A; Rk : C ! A i, then insert an item hi; j +1; C ! A  i if it does not exist. Predict-Shift connects the high-level (nonterminal) expectation X with A (terminal for the input word a) which appears after some levels of left-corner derivations, and \shift" by eating up the word a.

(iv) Predict-Reduce: For each item hi; j; A !  X i where X is a nonterminal, and for each item in the table hj; k; C ! i, if there exists a reachability

2.3. UNIFICATION-BASED PARSING ALGORITHMS

19

relation hX; C; Rl : B ! Ci, then insert an item hi; k; B ! C  i in the table if it does not exist. Predict-Reduce connects the expectation X with a nonterminal C which appears after some left-corner derivations from X . This operation essentially \reduces" to C , which is in turn used to ll the left-corner of a rule Rl one level above the derivation tree. Figure 2.7 shows a diagram of operations and items produced by parsing the previous example sentence \John ate the cake" we showed earlier in Earley's algorithm (p. 17). It uses the same grammar as before, but in the CF Left-corner example, rules are indexed from R0 through R7 and reachability relations are precomputed (indexed from t0 through t4) and stored in the reachability table. The di erences from Earley's are three Predict-Shift operations ((1) predict-shift, (2) predict-shift and (4) predict-shift in Figure 2.7) which replace Earley's Predictor followed by Scanner operations ((1) predictor (2) scanner, (3) predictor - (4) predictor - (5) scanner, and (7) predictor - (8) scanner in Figure 2.6 respectively), and one Predict-Reduce operation (3) predict-reduce which produces the same result as Earley's (6) completer. Notice in CF Left-corner, (1) predict-shift eliminates one Predictor in Earley's (i.e., top-down derivation of one level), whereas (2) predict-shift skips two levels of derivations. Then the eliminated prediction item in Earley's (produced by (3) predictor) is introduced as a left-corner item by (3) predict-reduce operation. The overall result is that CF Left-corner recognizes the same sentence using the same grammar by 4 steps less than Earley's (12 in Earley's and 8 in CF Left-corner).

2.3.2 Uni cation-based Algorithms and Systems As we have mentioned in the beginning of this chapter, most uni cation-based algorithms are extensions of context-free parsing algorithms. The extension is done by simply replacing context-free symbol equality and concatenation with feature structure uni cation, with no change in the control logic of the algorithms. Among the context-free parsing algorithms, it seems relatively ecient algorithms have been extended for uni cation grammars. This is due to the intractable complexity of the underlying grammar formalisms: since the recognition problem of uni cation grammars is intractable (Barton et al., 1987), ecient (context-free) base algorithms are preferred in order to make the uni cation-based parsing practical. Another factor seems to be the generality of algorithms. More speci cally, context-free algorithms which assume no restriction on the grammar (thus fully expressive) and perform full parsing (versus partial parsing) seem to be adopted in many uni cation-based systems. This is probably due to the objective of uni cation-based parsing: to perform deep linguistic analysis of the sentences instead of simply building a surface phrase structure. Of the uni cation-based algorithms extended from context-free algorithms, BU chart parsing has been widely used, for instance in a parser in Alvey Natural Language Tools (ANLT) (Briscoe et al., 1987; Carroll, 1993), a variant of GPSG grammar; also in a parser in an HPSG-based system called Attribute Logic Engine (ALE) (Carpenter and Penn, 1998). Because of the simplicity of the algorithm, BU chart parsing is often used as a base-line

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

20

Grammar R0: R1: R2: R3: R4: R5: R6: R7:

Reachability Table

S → NP VP NP → Det N VP → VG NP VG → V NP → " John" V → " ate" Det → " the" N → " cake"

0,0, S' → •S

t0: t1: t2: t3: t4:

S, NP, R0 S, Det, R1 NP, Det, R1 VP, VG, R2 VP, V, R3

“John ate the cake” 0 1 2 3 4

0, 4, S' → S • (8) completor

(1) predict-shift with t0 and “John”

0,1, S → NP • VP

0, 4, S → NP VP •

(3) predict-reduce with t3 (2) predict-shift 1, 2, VP → VG • NP with t4 (4) predict-shift and “ate” with t2 and “the”

1, 2, VG → V •

(7) completor

1, 4, VP → VG NP • (6) completor

2,3, NP → Det • N

2, 4, NP → Det N •

(5) scanner “cake”

Figure 2.7: Example of Context-free Left-corner Algorithm

2.3. UNIFICATION-BASED PARSING ALGORITHMS

21

preliminary implementation of a system, to which some eciency techniques are added in later implementations (Moore and Dowding, 1991; Carroll, 1993). Earley's algorithm is employed in many uni cation-based systems as well, for instance in the parsers for ID/LP (immediate dominance, linear precedence) grammars in (Shieber, 1984) and (Morawietz. 1995), and in LILOG system (Sei ert, 1991). In addition, Earley's algorithm is often used in systems that perform parsing and generation (Shieber, 1988, 1990) because of the exible ow of information, that is, mixed top-down bottom-up parsing. One algorithm of which we make special mention is Shieber's (1992) algorithm. Shieber's algorithm is an incarnation of Earley's algorithm formulated in a logical formalism for a generalized uni cation-based grammar. His algorithm is a major focus of this thesis, and will be described in detail in Chapter 3. Left-corner algorithm is notably adopted in the Core Language Engine system (Alshawi, 1992). As with CF Left-corner algorithm, uni cation extension of this algorithm also appears implicitly as an optimized implementation of bottom-up chart parsing, typically under the name of \top-down predicative parsing" (Moore and Dowding, 1991; Shann, 1991; Carroll, 1993). Uni cation extension of Tomita's GLR parsing is used in Knowledge-based Machine Translation project (KBMT) (Goodman and Nirenburg, 1991), a LFG-based machine translation system. GLR is also implemented as a modi ed version of ANLT parser (Briscoe et al.; Carroll, 1993) mentioned earlier. A good description of uni cation-based Head-corner algorithm and its empirical performance is found in (van Noord, 1993, 1997). As for other uni cation-based parsing algorithms, constraint algorithms are frequently used in systems which implement a particular uni cation-based linguistic theory based on principles. For example, an implementation of HPSG parser in (Franz, 1990) processes a sentence by applying relations derived from HPSG principles such as Head Feature Principle and Subcategorization Principle. In what follows, we describe a variation of the Left-corner parser used in Core Language Engine as a concrete example of an implemented uni cation-based algorithm.

Uni cation Left-corner Algorithm (in Core Language Engine) Core Language Engine (CLE) is a state-of-the-art uni cation-based system developed at SRI Cambridge. The system is implemented in Prolog, and its grammar (de ned by a set of syntactic and semantic rules) is written in a high-level language which compiles into the base Prolog expressions. The system's coverage is quite comprehensive: its syntactic and semantic rules cover a wide range of natural language phenomena including coordination, unbounded dependencies and anaphora resolution. CLE's treatment of those constructions is very clean, by a gap threading technique which takes advantage of Prolog's depth- rst behavior and term uni cation operation.11 The scope of CLE also spans several analysis levels, from morphology to syntactic analysis, then to quasi-logical form (as a semantic interpretation of a sentence) to logical form with resolved anaphora and quanti er scoping. In each level of analysis, CLE incorporates many recent theoretical and technological advances in natural language analysis. 11 This idea of CLE's gap threading is basically the same as di erence list in DCG.

22

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

In CLE, the parser performs syntactic processing, and then semantic processing takes place in the next phase using the results of syntactic analysis. There are a few, slight differences between the CLE parser and CF Left-corner we have presented previously. First, CLE's global chart (or, as they call it, an analysis record table) records only the complete constituents/edges, and no partial results are memoized. In CLE, intermediate computations are handled dynamically by utilizing Prolog's built-in backtracking mechanism. The system also uses an explicit parse stack to control the applicable operations at a given time. This, together with backtracking, is how CLE handles nondeterminism. Second, CLE's parsing operations are formulated di erently from CF Left-corner to handle gap category in the grammar. Other than these, the CLE parser performs essentially the same or equivalent operations as CF Left-corner algorithm. As with CF Left-corner, CLE rst precompiles the grammar into reachability relations and stores them in a table. To extend the relation from context-free symbols to feature structures, CLE takes into account all features in the expectation that propagate down to the left-corner constituent12 and generates a relation of the form (cat1:[feature1=X,feature2=Y], cat2:[feature5=X,feature6=Y])

where cat1 is the expectation and cat2 is the left-corner constituent which can realize the expectation after some levels of left-corner derivations. Then during parsing, those reachability relations are retrieved from the table and used as a possible bridge between the current expectation and the realized terminal/nonterminal constituent to check their compatibility (i.e., potential left-corner constituent). CLE incorporates packing technique for eciency, which we have described previously in the context-free parsing section. To extend this technique to uni cation-style grammar, CLE packs multiple complete constituents/items (or feature structures) which cover the same substring in the input sentence and which are in subsumption relation, and keeps the most general item as the representative. This technique is basically to further enhance the e ect of memoization and eliminate redundant computations. We will later discuss the role of this technique in solving what we call the nonminimal derivation problem in Chapter 4. CLE's parsing algorithm is de ned by the following six operations. The description given here is informal and simpli ed from the original one in (Alshawi, 1992, p. 131).13 The algorithm makes use of a parse stack, which works analogously to a current parse front. Each entry in the stack is a record structure with three elds (mother-category, daughtersfound and daughters-needed) where mother-category is a constituent/category symbol, and daughters-found and daughters-needed are list of constituents. We denote a stack record by a 3-tuple (mother-category, [daughters-found], [daughters-needed]). This essentially encodes the same information as an edge does in chart parsing. The algorithm also temporarily records a just-found (complete) constituent in constituent12 Actually, some features are intentionally ignored in computing the reachability relations. That is because

of the diculty of prediction nontermination problem (Shieber, 1985). We will brie y touch on this topic in the next section 2.3.3 and discuss more in detail in Chapter 6. 13 In (Alshawi, 1992), the algorithm produces syntactic analysis records (equivalent to parse trees) as the result of parsing. In our description, it is modi ed to a recognition algorithm (i.e., without building parse trees).

2.3. UNIFICATION-BASED PARSING ALGORITHMS

23

found variable.14 Just like the CF Left-corner algorithm, the CLE algorithm uses the start production of the form S' ! S where S' is a symbol not in the vocabulary of the grammar.

(i) Start: To initialize the parser, push (S',[],[S]) on the stack, and apply Shift or Create-Gap operation. The Start operation is applied only once in the beginning of parsing.

(ii) Shift: If the daughters-needed list in the stack top record is not empty,

create a lexical constituent with a category C using the next input word and set constituent-found to be the constituent, and apply Predict or Match operation.

The Shift operation is equivalent to creating a complete lexical/terminal edge in chart parsing, and the rst part of the Scanner operation in Earley's algorithm.

(iii) Create-Gap: If there can be an empty constituent of category C , and

the rst category in the daughters-needed list in the stack top record could have an empty left-corner constituent according to some reachability relation(s), then analyze C as the empty gap and apply Predict or Match operation.

The Create-Gap operation hypothesizes an empty category at the current position in the sentence. If there is a reachability relation from some category to an empty category, the gap is realized.

(iv) Reduce: If the daughters-needed list in the stack top record is empty and C is its mother category, and the variable constituent-found holds a constituent C 0 , then set constituent-found to C and pop the stack and apply Predict or

Match operation.

The Reduce operation is equivalent to creating a complete nonterminal edge in chart parsing and the Completer operation in Earley's algorithm. It essentially realizes the mother constituent.

(v) Predict: If the constituent-found variable holds a constituent C1 and there

is a reachability relation recorded in the table such that (D, C1 ) where D is in the list of daughters-needed in the stack top record and C0 ! C1 ::Cn is a rule in the grammar, then create a new record (C0 ,[C1 ],[C2::Cn ]), push it on the stack, and apply Reduce, Shift or Create-Gap operation.

The Predict operation performs top-down processing; it is equivalent to the Predict-Shift operation in the CF Left-corner algorithm.

(vi) Match: If C is the rst category in the daughters-needed list in the stack

top record and the constituent-found variable holds a (complete) constituent C 0 that is consistent with C , then remove C from that list and add C 0 to the end of daughters-found list and apply Reduce, Shift or Create-Gap operation.

14 Note that in this simpli ed description, constituent-found variable can only keep track of single parse context with no ambiguity.

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

24 Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Const-

Start Shift \John" Predict w/t0 Shift \ate" Predict w/t4 Reduce Predict w/t3 Shift \the" Predict w/t2 Shift \cake" Match Reduce Match Reduce Match Reduce Match Reduce

Stack

found

{ NP NP V V VG VG Det Det N N NP NP VP VP S S S'

[(S',[],[S])] [(S',[],[S])] [(S',[],[S]) (S,[NP],[VP])] [(S',[],[S]) (S,[NP],[VP])] [(S',[],[S]) (S,[NP],[VP]) (VG,[V],[])] [(S',[],[S]) (S,[NP],[VP])] [(S',[],[S]) (S,[NP],[VP]) (VP,[VG],[NP])] [(S',[],[S]) (S,[NP],[VP]) (VP,[VG],[NP])] [(S',[],[S]) (S,[NP],[VP]) (VP,[VG],[NP]) (NP,[Det],[N])] [(S',[],[S]) (S,[NP],[VP]) (VP,[VG],[NP]) (NP,[Det],[N])] [(S',[],[S]) (S,[NP],[VP]) (VP,[VG],[NP]) (NP,[Det,N],[])] [(S',[],[S]) (S,[NP],[VP]) (VP,[VG],[NP])] [(S',[],[S]) (S,[NP],[VP]) (VP,[VG,NP],[])] [(S',[],[S]) (S,[NP],[VP])] [(S',[],[S]) (S,[NP,VP],[])] [(S',[],[S])] [(S',[S],[])] []

Figure 2.8: Example of Uni cation Left-corner Algorithm in CLE The Match operation essentially advances the dot in an item in chart parsing (or CF Leftcorner). Notice the stack is not popped { the only change is the daughters-needed list in the stack top record. This implies a lateral movement (i.e., dot-advancing), rather than a top-down prediction movement. As you can see, the CLE algorithm breaks up Predict-Shift and Predict-Reduce operations in CF Left-corner into smaller operations, into Predict, Shift, Reduce and Match. Figure 2.8 shows the example of parsing \John ate the cake" by CLE, using the same grammar and reachability table in the previous Figure 2.7 (on p. 20). Note that in the gure, the stack grows from left to right, with the right-most tuple as the stack top. In CLE, n levels of skipped derivations by one Predict operation are restored/ lled by n ? 1 subsequent Predict operations (using other stack records). This sequence is equivalent to Predict-Shift followed by n ? 1 Predict-Reduce operations in CF Left-corner. For example, after \ate" (V) is scanned by 4 Shift while the expectation (or the next daughter-needed category) is VP, prediction is made by 5 Predict using a reachability relation t4:hVP, V, R3i which skips one intermediate level (VG). This level is lled in later after VG is completed and reduced by 6 Reduce, and then by another prediction 7 Prediction using t3:hVP, VG, R2i.

2.3. UNIFICATION-BASED PARSING ALGORITHMS

25

2.3.3 Issues in Uni cation-based Parsing Uni cation-based formalism is a very expressive and clean style of grammar which has many desirable properties. Such formalisms allow much broader range of information to be encoded in the grammar, thereby supporting a rich sentence processing model. However, parsing with uni cation grammars has some diculties. In particular, we discuss three issues in this section: computational eciency, brittleness and inherent grammatical diculties.

Computational Eciency First, uni cation-based parsing is slow: the recognition problem is intractable in the unrestricted form. But this is the worst-case theoretical complexity, and the average-case complexity has been empirically shown to be quadratic or cubic in implemented systems (Shann, 1991; Carroll, 1993, 1994). Carroll (1993) attributes this to the algorithms being based on (or extended from) tractable context-free parsing algorithms and additional implementation techniques incorporated in the systems as follows (Carroll, 1993, p. 14). A parsing algorithm is termed ecient if its time complexity is polynomial. It is not the case, though, that a straight-forward implementation of an ecient algorithm will result in a practical parsing system. Attention must be paid to the computational techniques used, and also to how augmentations to rules are dealt with, since operations such as uni cation can easily dominate parse times. However, most practical, wide-coverage [uni cation-based] NL analysis system have indeed be based on formalisms for which computationally tractable parsing algorithms are known. Thus, no system in this area has employed an unmodi ed implementation of GPSG (shown by (Barton et al, 1987) to have EXP-POLY complexity for recognition), LFG (which (Barton et al, 1987) showed to be NPhard), or ID/LP format (shown by (Barton et al, 1987) to be NP-complete). However, even though the systems show average-case polynomial time, processing with large data structures (i.e., feature structures) imposes a signi cant burden.15 In some extreme cases, for instance when the grammar contained a large number of ambiguous rules, this may even limit the tasks and/or the set of sentences that can be handled by nite resources of the systems. To overcome this problem, there have been several implementation techniques developed for improving the eciency of uni cation-based systems. For instance, item packing used in CLE described in the last section is one of those techniques. Other techniques include grammar pre-analysis and indexing, and o -line compilation for grammars with gaps (Moore and Dowding, 1991; Minnen et al, 1995). 15 In particular, the cost of copying feature structures (or dags) has been reported to be the major source of ineciency of uni cation-based systems (Godden, 1990). Several techniques have been developed to optimize this procedure (e.g. (Pereira, 1985), (Godden, 1990) and (Tomabechi, 1992)). We will discuss some of them and our original technique in Chapter 6.

26

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

Another eciency approach, taken in (Maxwell and Kaplan, 1994), encodes some of the functional (i.e., non-context-free backbone) features of the grammar rules directly in context-free symbols (which requires grammar modi cation). This way, the algorithm can exploit the eciency of the polynomial time complexity of context-free parsing (e.g. O(n3 ) by Earley's algorithm as it was used in their paper) in uni cation-based systems. This approach essentially delineates the feature-value pairs encoded in a feature structure and represents each of them by a distinct symbol. Thus, the number of rules in the original uni cation grammar is multiplied (by an exponential factor) in the modi ed context-free grammar. However in their report, parsing with the modi ed grammar showed an improved eciency despite the increased number of rules. One shortcoming of this approach is that it requires an ad hoc decision as to which features should be considered for delineation { no account on the criteria or any automatic selection procedure is proposed.

Brittleness Second, uni cation-based parsing is still brittle: parsing fails ungraciously in processing ungrammatical sentences. There have been several approaches to apply uni cation grammars to process ill-formed sentences (Stede, 1992). For instance, (Lytinen, 1991) tries to handle terse text by rst parsing a sentence normally, and if no parse is found (i.e., parse failure), syntactic features are relaxed and sentence fragments are recovered by gluing partial constituents using semantic features. Other approaches use an overgeneral grammar in which some problematic (or often-mistaken) features are unde ned (i.e., variable) so that uni cation will succeed even with ill-formed constructs. Intuitively, feature structures can encode more information about a constituent than context-free symbols, thus could support robust parsing (Douglas and Dale, 1992). But in the case of ungrammatical sentences, error correction or recovery opens up more search paths for \probable" interpretations (e.g. interpretation as abduction (Hobbs et al, 1993)), thus the problem becomes rather one of computational control strategies. In other words, we must determine when to \reject" the possible recoveries. This also applies when possible fragments are recovered after parse failure as well.

Grammatical Diculties Third, in addition to computational ineciency, uni cation-style grammars have properties that make the parsing inherently dicult. One of the diculties is prediction nontermination: when a parsing algorithm performs top-down processing with a left-recursive grammar, the algorithm may not terminate. This problem does not happen with context-free grammar when chart-based parsing algorithms are applied, because items produced by in nite left-recursion are discarded as duplicates by memoization. However with uni cation grammars, there may be cases where (direct or indirect) left-recursion produces a di erent item at every iteration because of some \problematic" features. There have been several partial solutions proposed to this problem (e.g. (Shieber, 1985) and (Samuelsson, 1993)). But since

2.4. LEFT-CORNER PARSING ALGORITHM (LC)

27

this problem reduces to the halting problem, no complete solution exists. We will discuss this problem and propose a heuristic solution in Chapter 6. Another diculty is the (lack of) locality of the uni cation-style grammars. A feature structure can express shared substructures (indicated by co-indexes in an AVM or shared subdags). So in a general uni cation grammar formalism, features are passed from anywhere to anywhere at any arbitrary depth, thus making the e ect of uni cation global. Because of this, uni cation grammars do not support locality of processing. This nonlocality requires a complex computational model, in contrast to functional-style processing in which the e ect of an operation is strictly local. Because of those diculties mentioned above, most uni cation-based parsing focus more on how to overcome the diculties rather than taking advantage of the expressiveness of the underlying grammar formalism. It is still quite popular in many uni cation-based systems to use only the context-free backbone to parse a sentence and apply uni cation after each rule is complete (i.e., the LHS is realized) or at the end of parsing (e.g. GLR parser for ANLT grammar (Briscoe et al.; Carroll, 1993, 1994)). This kind of processing model do not support incremental processing which utilizes all available features at each point during parsing. Another point is the integration of semantics in the formalisms. Feature structure representation can encode various kinds of information on each constituent in a uniform manner, including syntax as well as semantic features. However, most uni cation-based systems process these two kinds of information in di erent phases (except for the ones based on a linguistic theory that fundamentally integrate semantics into syntax, such as HPSG) { syntactic analysis followed by semantic analysis. In such systems, syntax and semantics are processed asynchronously. This is to reduce ambiguities. If a word is nway ambiguous syntactically and m-way ambiguous semantically, a maximum of n  m readings are possible, and all of them must be carried around during parsing, thereby causing combinatorial explosion. But if semantic analysis is applied to the result of syntactic analysis, disambiguation from the rst phase will reduce the ambiguities in the second phase. However, such a processing model does not take full advantage of the expressiveness of the feature structures in deriving a rich sentence processing model.

2.4 Left-Corner Parsing Algorithm (LC) We have developed an ecient parsing algorithm for uni cation grammars, called LC, which takes full advantage of the expressiveness of the grammar. LC is a variation of Left-corner parsing, and it is implemented in our uni cation-based NLP system called LINK (Lytinen, 1992). This algorithm is quite ecient { It showed average-case linear time performance for limited domain texts (Lytinen and Tomuro, 1996). Eciency of our LC algorithm comes from two factors. First is the representation and architecture of LINK, which encodes syntax as well as semantics and domain knowledge

28

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

in a uniform representation, thereby allowing the algorithm to utilize all available information at any point during parsing. Second is the expectation-based Left-corner parsing strategy, which allows the algorithm to prune unsuccessful parses/constituents by propagating the expected features top-down and matching with the constituents before bottom-up construction takes place. In this section, we present our LC algorithm by describing these two components. We rst present a brief summary of LINK, then describe the LC algorithm in detail. Then we formalize the LC algorithm in the next Chapter 3. We also present a complete empirical performance analysis in Chapter 6.

2.4.1 LINK LINK is a syntax-semantics integrated uni cation-based system. It is a general parser, just like PATR-II, which is not tied to any speci c linguistic theories. However, LINK is more general than PATR-II in that it does not assume the existence of context-free backbone in the grammar: syntactic category is treated evenly as one of the features encoded in feature structures which may or may not be present in each constituent. LINK utilizes two kinds of knowledge: linguistic knowledge described in the grammar and lexicon, and domain knowledge represented by a semantic network. In LINK, the grammar encodes both syntactic and semantic information, and the semantic network is a hierarchically organized semantic/domain concepts. LINK represents these knowledge uniformly, by feature structure representation. This information is combined dynamically during parsing: for each word in a sentence, its syntactic and semantic features are rst retrieved from the lexicon, and combined/uni ed with the domain features associated with the semantic concept of the word (Lytinen, 1987). This way, all information is available at any given time during parsing, thereby supporting a synchronous, incremental processing. By taking advantage of this processing scheme, LINK has been successfully applied to various NLP tasks, such as robust processing of ungrammatical sentences (Lytinen, 1990; Kirtner and Lytinen, 1991), metaphor understanding (Lytinen et al., 1992), semanticsdriven parsing (Lytinen, 1991), heuristic parsing (Huyck and Lytinen, 1993) and lexical acquisition and language learning (Hastings, 1994).

LINK's Grammar and Knowledge Base The grammar rules in LINK are represented in a very similar way to those in PATRII, by constraint equations. Also just like PATR-II, rules are modeled by dags in LINK. Figure 2.9 shows an example of LINK rule S ! NP VP and its corresponding dag. In the LINK notation, a path which starts with a number (i) represents a feature in the ith RHS constituent. For instance, in the example above, a path (1 cat) represents the cat feature (i.e., syntactic category) of the rst RHS constituent (whose value is NP). Any path which starts with a non-numbered arc represents a feature in the LHS constituent. For instance, a path (head) represents the head feature of S.

2.4. LEFT-CORNER PARSING ALGORITHM (LC)

29 cat

R0: (define-rule 1 2 S (cat) = S head cat cat (1 cat) = NP head head (2 cat) = VP subj NP VP (head) = (2 head) agr (head subj) = (1 head) agr (1 head agr) = (2 head agr))

Figure 2.9: LINK rule S ! NP VP and its dag model cat

word

head “ate” S1: (define-word V tense (cat) = V subj past (word) = “ate” dobj sem (head tense) = past actor sem (head sem cat) = EAT sem object (head sem actor) = (head subj sem) cat (head sem object) = (head dobj sem)) EAT

Figure 2.10: LINK lexical rule for \ate" and its dag model LINK's grammar also includes lexical rules (i.e., lexicon). For example, Figure 2.10 shows an entry for a word \ate" and its corresponding dag. Note that lexical entries do not have RHS constituents since they are terminal. Also in LINK, semantic features are found under (head sem) path. Most semantic features encoded in the grammar are general case-frame relations (Fillmore, 1968), such as actor, object, location and purpose. Those semantic features (deep structures) are properly mapped to surrounding syntactic constituents (surface structures) in the sentence through constraint equations which equate semantic and syntactic feature paths (i.e., having the paths point to the same node). For instance, a constraint (head sem actor) = (head subj sem) in the \ate" rule maps the semantic actor of EAT action to the syntactic subject, and similarly a constraint (head sem object) = (head dobj sem) maps the object of EAT action to the syntactic direct object.16 So for example, when a sentence \John ate the cake" is parsed, assuming the V (\ate") will be propagated up to the VP in the S rule, \John" will be placed as the actor and \the cake" be the object of EAT by the syntax-semantics mapping equations in \ate". Conceptual and domain knowledge in LINK is organized in a semantic network. Every node in the network represents a semantic concept, and nodes are organized hierarchically by the is-a relation. A node is also connected with other nodes through its semantic caseframe relations. With this information, every concept (or semantic category) is represented as a frame, and encoded by a set of constraint equations in the same way as grammar rules. For example, Figure 2.11 shows a frame for the concept EAT, which speci es the actor to 16 In LINK, most syntax-semantics mapping equations are encoded in the lexicon, thus it is considered a lexicalized approach.

30

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS cat actor EAT object instrucat ment cat ANIMATE cat FOOD UTENSIL

F1: (define-semcat (cat) = EAT (actor cat) = ANIMATE (object cat) = FOOD (instrument cat) = UTENSIL)

Figure 2.11: EAT frame cat

word head “ate” V tense subj past dobj sem actor sem sem object cat

cat cat

instruANIMATE FOOD ment EAT cat UTENSIL

Figure 2.12: Uni ed \ate" dag be ANIMATE, the object to be FOOD and instrument to be UTENSIL. When a word \ate" is encountered in the input sentence, this frame is uni ed with the lexical rule shown in Figure 2.10 under (head sem) path, resulting in the dag shown in Figure 2.12.17 As you can see, with the additional information from EAT frame, the semantic case-frame relations in the \ate" dag became more speci c: actor must be ANIMATE and the object must be FOOD and so on (indicated with thick arcs in the gure). Those features are percolated to surface syntactic constituents through syntax-semantics mapping equations in the grammar, and can be utilized to predict the features of the words that will come next. Our algorithm, which we will describe in the next section, makes maximum use of those prediction features as expectation in improving the parsing eciency.

2.4.2 The LC Algorithm Our LC algorithm is a variation of standard Left-corner algorithm. Compared with the standard algorithm, LC has a unique characteristic, which comes from the base data structure on which it is de ned: dag with a special lc arc. In the LC algorithm, lc arcs are brought into the models (dags) by way of reachability net entries. As with other Left-corner systems such as CLE, the LC algorithm precompiles all possible reachability relations that exist in the grammar, and records them in a table. 17 In LINK, two node labels are considered compatible if they are equal or in a is-a relation.

2.4. LEFT-CORNER PARSING ALGORITHM (LC)

31

But unlike others systems, our algorithm represents the relation using a single dag with lc arc.

Reachability Net Entries Traditionally, a reachability relation is de ned between two constituents: an expectation and a left-corner constituent. For instance, given the following two rules (repeated from p. 18): R1: VP ! VG NP R2: VG ! V reachability relations exist between the ordered pairs [VP, VG], [VP, V] and [VG, V]. In LC algorithm, the reachability relation is extended between a constituent and a grammar rule. Thus, the reachability relation in LC will become [VP, R1], [VP, R2] and [VG, R2] respectively. LC represents this relation in a single dag in which the expectation is placed at the root node, the rule is found under the lc arc. Thus, the lc arc represents an unspeci ed levels of left-corner derivations that are possible from a high-level expectation to a rule, and functions as a compressed, virtual link between the two. Intuitively, a reachability net entry is composed by applying left-corner derivation for every consistent derivation path, and transforming the resulting dag to a uniform representation. For example, Figure 2.13 on the left shows the rules R1 and R2 augmented with some features in LINK-style grammar. Basically, the left-corner derivation for the R1-R2 path is performed in the following way. First, the left-corner constituent of R1 (VG) is uni ed with the LHS constituent of R2. Next, an arc labelled with lc is added to this uni ed dag between the root (i.e., the LHS VP of R1) and the VG, which are currently connected by the 1 arc. Then, all numbered arcs (including the 1 arc) are deleted from the root. This operation essentially removes all RHS constituents of R1 from the root, and leaves all features of the LHS VP, plus the lc arc, which is the connection to the rule R2. Figure 2.13 on the right shows the resulting reachability net entry (dag). Notice the co-reference between the VP's head arc and the VG and V's same arcs is preserved by uni cation after transformation. Through these arcs, features added during parsing under the head arc in the VP (as the expectation) are percolated down to the VG and the V directly, and they are used to predict the constituents that will satisfy the expectation. Also notice the feature (head type) = trans under the VG and the V. This feature was \pushed down" from the VP in R1 by the derivation and remained in the VG and the V. This additional feature will make the prediction more speci c, constraining the VG and the V to have this feature in order to eventually ll the VP through this particular R1-R2 derivation path. The actual construction of reachability net entries is slightly di erent from the procedure described above. From a given expectation E and a left-corner derivation path of length

32

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS R1: (define-rule (cat) = VP (1 cat) = VG (2 cat) = NP (head) = (1 head) (head type) = trans (head dobj) = (2 head)) R2: (define-rule (cat) = VG (1 cat) = V (head) = (1 head))

cat VP

lc head

cat VG cat

1 head head

V type trans

Reachability net entry which represents reachability relation between [VP, R2]

Figure 2.13: Reachability net entry created from R1 and R2 rules

n consisting of rules hr1 ; ::; rn i (in an ordered sequence), a reachability net entry which represents re exive relation is created rst. Let r1 : E ! A B. The dag created for this

lc relation is in the form E ! E ! A B.18 From this dag, entries which represent transitive relation are created by \pushing up" the left-corner constituent under (lc 1) path to (lc), lc lc then applying derivation. So, the re exive dag E ! E ! A B is rst transformed to E ! A, and then uni ed with the next rule in the sequence. Let r2 : A ! C D. The resulting lc dag, which represents a transitive relation from E to r2 , will be in the form E ! A ! C D. For every derivation path, this process continues until the left-corner constituent of the last rule (rn ) is a terminal (i.e., no more derivation is possible). The construction of reachability net entries is explained in precise detail in the next Chapter 3 where the LC algorithm is formalized.

The Parsing Algorithm After precompiling all possible left-corner derivations, the LC algorithm uses reachability net entries instead of the original grammar rules to parse sentences. Our LC algorithm is based on the standard Left-corner algorithm with some modi cations to handle lc arcs. The most notable modi cation is in the bottom-up component, in which the intermediate levels of derivations compressed in the lc arc are expanded. The LC algorithm proceeds in the following way. As with the standard Left-corner algorithm, in the beginning of parsing, an initial expectation for an S (for a complete sentence), more speci cally an item which represents a re exive left-corner derivation from 18 A reachability net entry which represents re exive relation is explained in detail in the next Chapter 3. For now, we simply note that this kind of entry is the form where expectation and the LHS of the rule (under lc arc) have the same category.

2.4. LEFT-CORNER PARSING ALGORITHM (LC)

33

cat S lc cat

head

1

2

S cat

head

cat head head NP subj VP word “John”

agr

Next expectation

agr 3S

Figure 2.14: Model created after parsing \John" in LC algorithm the S using the S ! NP VP (a reachability net entry), is added to the chart.19 Then the algorithm reads the input sentence from left to right, advancing the dot (i.e., expectation) in the items recorded in the chart by top-down prediction and by bottom-up construction. LC's top-down operation is similar to Predict-Shift operation in the CF Left-corner algorithm, where an expectation sanctions the construction of a new item whose rst RHS constituent, which is a terminal, is in a reachability relation and is consistent with the input word.20 However, there is a slight modi cation in that the reachability net entry is directly used in the resulting model. Figure 2.14 shows an item model after \John" (an NP) in the sentence \John ate the cake" was processed and lled the rst RHS constituent of the initial expectation. At this point, the next expectation is the VP (a subdag under the shaded node in the gure). Notice the agr feature whose value is 3S is inserted in the VP from the NP through co-reference. Then, this new expectation and the next input word \ate" (shown previously in Figure 2.10, p. 29) are connected by (lc 1) path (thereby creating a single dag), and any reachability net entries that are consistent with lc the dag are retrieved. In this case, the algorithm nds a reachability net entry VP ! VG ! V (shown previously in Figure 2.13, p. 32), and creates a uni ed dag. This scheme is depicted in Figure 2.15.21 Notice the (head agr) and (head subj agr) features from the VP expectation are brought down to the V through the co-reference speci ed in the reachability net entry. These features as well as the type feature in the reachability net entry stayed in the resulting model, since they did not cause con ict with the features in \ate". At this point, the V is realized and, since there is no more RHS constituents, the LHS VG is complete. Then the LC algorithm transforms this dag by \stretching" the lc arc to a (lc 1) path, and it retrieves reachability net entries which are consistent with the lc 19 This dag is in the form S ! S ! NP VP. 20 This LC operation is called Scanning 2. It is explained in detail in the next chapter 3. 21 In the gure, semantic features are omitted for simplicity. However, the actual dags do carry semantic

features, in particular in the word \ate". The complete dag for \ate" is shown previously in Figure 2.12, p. 30. Also in LC, a single-word lookahead is used to constrain the retrieval of reachability net entries.

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

34

cat

cat head

lc VP

VP subj

1

V

tense past

agr agr

VP

head

VG

head

cat

lc head

cat

1

VG

1

head head

cat

head

“ate” agr subj tense agr dobj sem type actor sem past sem 3S trans object cat V

V type trans

subj dobj sem

sem object

lc

cat

3S word head “ate”

cat

cat

actor sem cat

EAT

EAT

Figure 2.15: VP expectation matched with \ate" using reachability net entry

cat VP

head 1 VP

cat 1

VG

head

head

1

2

VP

head cat

head

“ate” tense type

past

VP lc cat

cat head VG dobj type trans

head

cat V

cat

cat lc

trans

subj dobj sem

agr agr

actor sem sem object cat EAT

3S

NP

cat VG cat

VP lc cat

head

1

2 head cat

1

head head

head NP

“ate” agr V tense subj agr dobj sem type actor sem sem past 3S trans object cat EAT

Figure 2.16: Restoration of derivation path by lc to lc 1 path extension

2.4. LEFT-CORNER PARSING ALGORITHM (LC)

35

dag. In the example, the retrieved net entries include one which represents the derivation path consisting of R1 only (which represents a re exive relation, and it is in the form VP lc ! VP ! VG NP). By unifying the transformed dag with this reachability net entry, the algorithm e ectively restored the derivation level immediately above that was compressed in the original lc arc. Figure 2.16 depicts the operation just mentioned. By this scheme, the LC algorithm restores a derivation path of length n in a bottom-up manner by n ? 1 applications of this operation. The operation described above is essentially a bottom-up operation, and it is similar to Predict-Reduce operation in the standard algorithm, where an expectation sanctions the creation of a new item whose rst RHS constituent, which is a nonterminal, is in a reachability relation.22 The di erence is that the standard algorithm must apply uni cation twice, once to check the reachability relation and another to actually build a new item, whereas our LC operation requires only one uni cation, to combine the transformed dag with the new reachability net entry. That is because LC models carry original expectation at the root. This data structure design allows the algorithm to access and manipulate the expected features at the appropriate time. The result is that the LC algorithm is able to replace an expensive uni cation operation with a fast path extension, thereby improving the eciency from the standard Left-corner algorithm. LC has other operations, which are analogous to Scanner and Completer in the standard algorithm. The operation equivalent to Completer (which is called Completion in LC) is somewhat di erent in that it removes the lc arc in a dag in which the derivation path is fully expanded. This operation, together with all other LC operations are explained in full detail in the next Chapter, where the LC algorithm is formalized and compared with Shieber's Algorithm, an extension of Earley's algorithm for uni cation grammars.

Summary In summary, our LC algorithm gains eciency by generating maximum expectation with all available information, including syntax, semantics and domain knowledge, and manipulating it in an ecient way. This is facilitated by LINK's syntax-semantics synchronous, incremental processing model, and unique data structure with special lc arc. Most uni cation-based systems adopt an asynchronous, context-free backbone-based non-incremental model in order to avoid the high computational load which results from processing large data structures. Consequently, this approach does not utilize the expressiveness of the underlying grammar formalism. Our approach, on the other hand, is to turn the diculty into an advantage, that is, to exploit the rich information encoded in the grammar to pursue ecient parsing for uni cation grammars. The result is an parsing algorithm which exhibits signi cantly improved performance: average-case linear time complexity for a domain text. Complete analyses of the empirical results are presented in Chapter 6. 22 This LC operation is called Continuation. It is explained in detail in the next Chapter 3.

36

CHAPTER 2. PARSING ALGORITHMS FOR UNIFICATION GRAMMARS

Chapter 3

Abstract Left-corner Parsing Algorithm 3.1 Introduction This chapter presents a formalization of our LC algorithm. The formalization is done by formulating the implemented algorithm described in the previous Chapter 2, and expressing it in the general uni cation-based formalism presented in (Shieber, 1992). Shieber (1992) developed a general logic to express a large class of uni cation grammars. His logic is essentially a generalization of PATR-II (Shieber, 1986a). He represents a feature structure by a set of logical formulas, each of which represents a feature-value pair. Using this logic, he expresses a general uni cation-based grammar and de nes various grammatical notions such as parse tree. Based on this grammar formalism, Shieber de ned an abstract parsing algorithm for such grammars in the same logic. His algorithm is basically an extension of Earley's algorithm adapted to uni cation-style grammars. His formulation of the algorithm is quite general, in a couple of ways. First, as with Earley's algorithm, Shieber's algorithm allows exible information ow, in both top-down and bottom-up directions. Second, the algorithm is speci ed by logic, leaving the control part in \algorithm = logic + control" (Kowalski, 1979) unspeci ed. The algorithm given by logical speci cation is also implementation-independent, without specifying any particular data structures to be used in the implemented systems. Third, the algorithm contained a parameter that can constrain the amount of information in the parser expectation. By adjusting this parameter, various instantiations of the algorithm can characterize their particular parsing behavior, from an expectation-driven top-down parsing using maximum expectation to a data-driven bottom-up parsing using minimum expectation, and anywhere in between. With those general and underspeci ed schema, Shieber e ectively de ned an algorithm which subsumes a class/family of parsing algorithms for uni cation grammars. As described in Chapter 1, our LC algorithm is a variation of Left-corner algorithm for uni cation-style grammars with modi cations to manipulate lc arc. Also as described, in 37

38

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM Unification-based parsing

CF parsing

Earley’s optimization

CF Left-corner

Shieber’s unification extension unification extension

optimization

Unification Left-corner modification with lc arc

LC

Figure 3.1: Schematic relation of LC to other algorithms context-free parsing, CF Left-corner is considered an optimization of Earley's algorithm. And the uni cation extension of those algorithms, uni cation Left-corner algorithm and Shieber's, are in the same optimization relation. Therefore, our LC algorithm can be characterized as an optimization of Shieber's algorithm with the modi cation to use lc arc. The schematic relation of LC algorithm to other algorithms is shown in Figure 3.1. In this chapter, we formalize our LC algorithm in Shieber's logic, and compare LC and Shieber's algorithm. This comparison is intended to set up a stage for the proof of correctness of the LC algorithm in Chapter 5, where the proof is shown by using the correspondence between the two algorithms.

3.2 Informal Overview In this section, we give an informal overview of Shieber's and our LC algorithms in terms of models and algorithm operations with focus on the di erences and correspondence between the two algorithms.

3.2.1 Models Both Shieber's and LC algorithms produce a set of items, just like Earley's algorithm. Part of an item is a model, a dag which represents a partial or complete result obtained by parsing a (sub)string in the input sentence using a given grammar. Basically, models in the LC algorithm di er from those produced in Shieber's algorithm in that all LC models have the lc arc, whereas no Shieber models have lc arc, and in every LC model, the corresponding Shieber model is found at the end of the lc.1 Figure 3.2 shows the corresponding Shieber's and LC models after a word \John" (an NP) in a sentence 1 There are some cases where a subdag under the lc arc in LC model carries less information than Shieber's model. The relation between LC model and the corresponding Shieber's model is discussed in detail in section 5.3.

3.2. INFORMAL OVERVIEW

39 cat

cat S 1 cat

2 cat

agr

agr 3S

head

1

2

S

head

head head NP subj VP word “John”

S lc cat

head

cat

cat

head head NP subj word “John”

agr

VP

agr 3S

Shieber’s model

LC model

Figure 3.2: Models created after parsing \John" in Shieber's and LC algorithms \John likes Mary" is parsed using the rule S ! NP VP (shown previously in Figure 2.9, p. 29). Notice the subdag under lc in the LC model is identical to Shieber's model. By pushing down Shieber's model one level below, the LC algorithm gains certain advantages. First, having extra features at the root will not harm or interfere with Shieber's model under lc arc. Therefore, LC models can facilitate the same or equivalent set of operations as Shieber's algorithm, if they are applied to the submodels under lc arc. Second, the lc arc in LC is implemented as an ordinary feature with no special privileges. Therefore, the uni cation algorithm can be applied uniformly on LC models. More importantly, as described in the previous chapter (2.4.2), the extra features at the root of LC models represent expectation. This unique structure facilitates eciency of parsing. By carrying around the expectation and applying an operation which manipulates them conveniently at appropriate time during parsing, the algorithm can build parse models directly with instantiated reachability net entries, and e ectively eliminate one uni cation operation to check the reachability relation, as it is needed in the standard uni cation Left-corner algorithm, before the actual parse model is created (by another uni cation). Therefore, LC models can facilitate more ecient processing than simple standard models such as Shieber's models.

3.2.2 Algorithms The di erences between Shieber's algorithm and LC come from the composition of two relations shown in the previous Figure 3.1. The rst relation is analogous to the optimization relation between Earley's algorithm and Left-corner algorithm in context-free grammar: elimination of top-down predictions by grammar precompilation. In the context of Shieber's and LC algorithms, this relation is embodied in LC's grammar precompilation into reachability net entries. Then in LC's parsing operations, Shieber's top-down prediction (by Prediction operation; analogous to Earley's Predictor) of n levels is replaced by an operation called Scanning 2, which instantiates a reachability net entry that connects a given

40

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM

expectation and an input word (or terminal). This operation is described informally in the previous chapter (2.4.2). The second relation is the modi cation made in LC to process dags with lc arcs. In Shieber's algorithm, when a model which was initially created by a Prediction operation becomes complete (i.e., all RHS constituents are lled), its LHS nonterminal constituent is used to ll previous expectation in another model (by Completion operation; analogous to Earley's Completor). In LC, skipped predictions are restored by one of the bottomup operations called Continuation, as described in the previous chapter (2.4.2). This operation stretches an lc arc in a complete model to hlc 1i path, and lls in the level immediately above the current level while keeping the original expectation at the root node. This way, n levels of prediction compressed in a reachability net entry are expanded one level at a time in a reverse order. Moreover, the Continuation operation ensures that, for each skipped level, the resulting model will have exactly the same expectation that would have been propagated, if predictions were applied top-down in a step-wise manner as is done in Shieber's algorithm.2

3.3 Shieber's Logic Shieber (1992) developed a general logic intended to subsume many variants of uni cation grammar, and used the logic to develop an abstract parsing algorithm for such grammars. The abstract algorithm was intended to subsume a wide variety of possible uni cation-based parsing strategies, depending on how certain underconstrained portions of the algorithm were lled in.

3.3.1 Logic System Shieber's logic is considered a generalization of PATR-II. The generalization is done on several aspects. First is the elimination of the grammar rules with context-free backbone notation, thereby allowing all features to be encoded explicitly. Second is the elimination of the presumption of a graph as the representation for feature-structures, thereby separating logic as a description language from model as an object which bears the information/constraints described by the logic (i.e., satis es the formula). Shieber's logic LL;C is a formalism that expresses feature-structures for uni cation grammars. L is a set of labels (feature names) and C is a set of constants (values). Labels can be concatenated by  to form a path .3 Then, L then represents a set of paths for features that have hierarchical structure. Paths and constants are connected by =: to represent the 2 This point will be formally proven in Chapter 5. 3 Paths are denoted by angle brackets such as hf i, or by a symbol itself. Concatenated paths are denoted

by multiple symbols enclosed in angle brackets such as hf gi, or with  such as f  g. Note that single arc is also a path of length 1.

3.3. SHIEBER'S LOGIC

41

equality relation in the uni cation (constraint) equations, and create atomic formulas that are in one of two forms:

p1 =: p2 p1 =: c

where p1 ; p2 2 L , and c 2 C . Thus, inference rules in LL;C allow various inferences on uni cation equations such as re exive, symmetry and substitution. Then, Shieber de nes a class of models M that is appropriate for LL;C ,4 and speci es the satisfaction relation j= between models and formulas. Those constitute the elements in his logic system hL; M; j=i.

3.3.2 Formalism for Uni cation Grammar Based on this logic system, Shieber de nes a uni cation grammar G as a triple h; P; p0 i, where  is the vocabulary of the grammar, P is a set of productions (grammar rules), and p0 is a designated start production. The vocabulary consists of L, C , and terminals (input words) !. For his grammar, L must include the integers from 0 to n, where n is the maximum arity (number of RHS constituents) of the productions in P . There are two kinds of productions: phrasal and lexical. A phrasal production is de ned as a two-tuple ha; i, where a is a nonnegative integer which corresponds to the arity of the production, and  is a conjunction of atomic formulas that are satis ed in the rule. For phrasal productions, each path in  must begin with an integer between 0 and a. Lexical productions are of the form h!; i where ! is a terminal, and  is a formula, all of whose paths begin with 0. To de ne the language of a grammar, Shieber rst de nes a notion of parse tree for his formalism by extending an analogous notion in context-free grammar to uni cation-based setting.5 Basically a valid parse tree is a model  in which every subconstituent (under a path consisting of numeric arcs only), including the root node, satis es a lexical or phrasal production. Note that, when  j=  where  is the formula of some production p = hv; i, p is said to license the parse tree  . Then, the language of a grammar is de ned as the set of yields of parse trees licensed by the start production p0 . The yield of a parse tree  and its licensing production p are de ned as follows: 1. If p is a lexical production, h!; i, and  j= , then  is licensed by p. The yield of  is !. 2. If p is a phrasal production ha; i, and  j= and for 1  i  a, =hii is de ned and is licensed by a production, then  is licensed by p. The yield of  is 1 ; ::: a where i is the yield of i . 4 Shieber chooses graphs for M. 5 We will discuss Shieber's parse tree in depth in Chapter 4.

42

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM

Note the symbol / stands for extraction operator, which extracts a portion of a model which is at the end of a particular path. Thus, in a parse tree  , =h1i retrieves the rst right hand side child of the production (i.e., left corner constituent) in the tree.

3.3.3 Shieber's Abstract Algorithm Having developed the formalism for uni cation grammar as above, Shieber de nes an abstract parsing algorithm in the logic. This is based on viewing parsing as deduction in logic framework (Pereira and Warren, 1983; Shieber, et al., 1994). Just like Earley's algorithm, Shieber's algorithm creates items . Formally, an item is a quintuple hi; j; p; M; di, where i and j are indices into the input string being parsed, p is a phrasal production, M is a model which corresponds to a parse tree licensed by p, and d is an integer between 0 and a, representing an index into the phrasal production (i.e., dot position in the edges in chart parsing). Note, an item hi; j; p = ha; i; M; di is said to be complete if d = a (i.e., dot reached the arity), or incomplete if d < a. Shieber's abstract algorithm is de ned by a set of deduction rules that generate items. Those rules make use of the following operations de ned on models:

 Extraction: This operation extracts a submodel M 0 at the end of a path p in some model M , if exists, and is written M=p.

 Embedding: This operation embeds a model M at the end of a path p, and is written M np. Its result is the least model M 0 such that M 0 =p = M .  Uni cation:6 This operation merges the information contained in two models M1 and M2 , and is written M t M 0 . If it exists, the result is the least model M 00 such that M  M 00 and M 0  M 00 .7  Restriction: This operation lters paths in M which extend from the top node. The restriction of M to a set of features F , written M  F , is the least model M 0 such that if M j=  and all nonempty paths in  start with some element of F , then M 0 j= . Figure 3.3 shows Shieber's parsing operations expressed as logical deduction rules.8 M is a model, and mm() stands for a minimal model 9 which satis es a set of formula , 6 Shieber calls this operation informational union. 7 Subsumption relation is de ned as (De nition 9, Shieber, 1992, p. 34): A model M subsume another model M 0 (written M  M 0 ) if and only if, for all formulas  2 L, M 0 j=  whenever M j= .

8 The formulas have been modi ed to eliminate Shieber's use of the 0 arc for the LHS constituent. Also

note that, for each rule, deduction goes from above the bar (antecedent) to below the bar (consequent). 9 Minimal model is de ned as (Property 10, Shieber, 1992, p. 34): If  is a consistent formula, then there is a model M such that M j=  and for all M 0 such that M 0 j= , it is the case that M  M 0 .

3.3. SHIEBER'S LOGIC

43

INITIAL ITEM: PREDICTION:

h0; 0; p0 ; mm(0); 0i hi; j; p = ha; i; M; di

hj; j; p0 ; (M=hd+1i) t mm(0); 0i where d < a and p0 = ha0 ; 0 i 2 P

SCANNING:

hi; j; p = ha; i; M; di hi; j +1; p; M t mm(0)nhd+1i; d+1i where d < a and hwj+1 ; 0 i 2 P

COMPLETION:

hi; j; p = ha; i; M; di hj; k; p0 = ha0 ; 0i; M 0 ; a0 i hi; k; p; M t (M 0 nhd+1i); d+1i where d < a Figure 3.3: Shieber's parsing operations

and p0 = ha0 ; 0 i is a start production. Note we assume the same operator associativity and precedence as (Shieber, 1992) throughout the paper: operators =; n;  and t are leftassociative, and =; n;  have the same precedence and has a higher precedence than t. So for example, Scanning 1 operation M t mm(0 )nhd+1i means M t (mm(0 )nhd+1i). Just like Earley's algorithm, Shieber's algorithm starts with the Initial Item. At any given time during parsing, any of the other three operations may be applied to any item whose dot position has not reached the arity of its licensing production. The Scanning operation uni es an input word (lexical production hwj+1 ; 0 i) with the expectation (constituent after the dot) in an item and advances the dot. The Prediction operation rst extracts expectation ((M=hd+1i)) from an item, uni es with a rule (p0 = ha0 ; 0 i) that can rewrite the expectation, and creates a new item whose dot position is 0, indicating no RHS constituent in the resulting item is lled yet. By this operation, expectation is propagated one level down in a derivation tree. Note that the function  is a constant function which removes some features. This function is explained in the next section. The Completion operation uni es a complete parse tree (M 0 ) with an expectation in an item and advances the dot. Those three operations keep being applied until no more rules are applicable. The nal parse, then, will be the complete parse tree (model) M in the item h0; n; p0 ; M; a0 i.

44

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM

3.3.4 Filtering Function  The function  used in Prediction plays an important role in Shieber's algorithm. It is a constant function which removes features in a model that are not in a prede ned set of features. This function serves two purposes. First is as an adjustable parameter in the algorithm speci cation.10 Since  is applied in Prediction, it can control the amount of expectation that is propagated top-down to the constituents at lower levels. Thus, various instantiations of Shieber's algorithm can characterize their parsing behavior by , from minimal ltering which passes down all top-down prediction features, to maximum ltering which passes no prediction. By adjusting this parameter, the algorithm behaves from an expectation-driven top-down parsing to a data-driven pure bottom-up parsing, or anywhere between. Or, in the case where only the context-free category symbol is passed, the algorithm behaves very closely to CF Left-corner parsing. Another purpose is to avoid prediction nontermination in the implemented uni cationbased systems (Shieber, 1985). We have brie y discussed this issue in the previous chapter (2.3.3). The problem is that, with certain left-recursive grammars, the algorithm may not terminate during top-down prediction because some features cause to create a di erent model at every iteration. To deal with this problem, (Shieber, 1985) proposed a function  which removes those problematic features from expectation.11 By applying , the algorithm creates the same models (possibly after subsumption check), which are discarded as duplicates by memoization, thereby making the algorithm terminate. The issue of prediction nontermination is further discussed in detail in Chapter 6.2.

3.3.5 Extension in LINK With some minor deviations, LINK's grammar conforms to Shieber's grammar formalism. In addition to Shieber's four operators, we de ne path replacement operator as follows. In our LC algorithm, this operator is critically applied to stretch/shrink lc arcs in the parse models.

 Path replacement: Path replacement of p1 for p2 in M , denoted M [p1 ) p2], deletes

the path p1 in M while preserving other paths, and then connects the top node and the node which used to be under p1 through the new path p2.12

Figure 3.4 shows an example of path replacement operator in which the path hf gi is replaced by hhi. Note that ) operator is de ned by a composition of Shieber's four operators (t, =, n, ), so it is totally expressed in his logic. Precise de nition of this operator is found in Appendix B.2. 10 Control strategy is another parameter that can characterize the instantiated algorithms. 11 In (Shieber, 1985), this function was called \restriction". Although there are minor di erences between

restriction and , they are considered equivalent for the most part. 12 We will also use the same [ ] notation for the textual substitution operator ! de ned in Shieber's logic. This operator is de ned for logical formulas, and textually substitute a term (path or value) in the formulas. For example, the substitution of path p for q in  is denoted [p ! q].

3.4. ABSTRACT LEFT-CORNER ALGORITHM f

45

h a

a g

b

c

c

M

M[ f g ⇒ h ]

Figure 3.4: Path Replacement Operator

3.4 Abstract Left-corner Algorithm Our LC algorithm is de ned by two sets of logical deduction rules. The rst set of rules produce entries in the reachability net applied at grammar compilation time. The second set of rules derive items, and constitute the operations performed during parsing.

3.4.1 Reachability Net Rules In the LC algorithm, a reachability net entry is produced for every consistent left-corner derivation path, consisting of an ordered sequence of rules sn = hp1 ; : : : ; pn i. A reachability net model is constructed from two logical deduction rules, RN1 and RN2. Rule RN1 is rst applied to p1 . The resulting model is then fed into rule RN2, along with p2 , to create another net entry. This entry is fed back into RN2 again, along with p3 , and so on. For the purposes of the logic, net entries are three-tuples of the form hsn ; pn ; M i, where sn = hp1 ; ::; pn i is a sequence of the phrasal productions giving rise to the entry, pn is the last production in sn and M is the resulting model, which is licensed by pn . The RN1 rule creates a model which represents a re exive reachability relation. It directly converts a phrasal production to a net entry by pre xing all paths that start with numbered arc with a lc arc and unifying the production again under the lc arc. RN1:

hhpi; p; mm()[hni ) hlc ni]an=1 t mm()nhlcii

where p = ha; i 2 P

Note that we use the notation [ ]an=1 to denote a simultaneous (or parallel) path replacement for all numbered arc n (1  n  a). Figure 3.5 shows the result of applying RN1 to the VP ! VG NP rule (R1 rule shown previously in Figure 2.13). A model created by RN1 rule, which we call an RN1 net model, has some particular properties. First, constituents at the root and under the lc arc have the same category feature. In the gure, the resulting RN1 net model can be written using the context-free

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM

46

cat cat VP

1

2 VP

head

cat

cat head

VG dobj type trans

head

VP rule (R1 dag)

NP

VP lc cat

head

1

2 head

cat

cat head

head VG dobj type trans

NP

RN1 net model (M)

Figure 3.5: Application of RN1 to VP rule (R1) lc backbone as VP ! VP ! VG NP. That is because, since RN1 net model represents a re exive relation, these two constituents intensionally denote the same constituent (i.e., the LHS constituent). Converting grammar rules into this form has an advantage: left recursive rules can be represented uniformly with other rules in the reachability net entries. Second, after the conversion, the constraint equations between the LHS constituent and RHS constituents in the original rule are preserved as non-numbered paths from the root to the RHS constituents (now under lc arc). For instance, a path equation hhead dobj i =: h2 headi in R1 is preserved as hhead dobj i =: hlc 2 headi in M . Our LC algorithm makes use of these properties of RN1 net models in the algorithm operations. Appendix B.3 lists some important properties of RN1 net models along with other general properties of reachability net entries.

The RN2 rule creates net entries which represent transitive reachability relation. It rst compresses the left-corner path hlc 1i into hlci path by path replacement operator. This step essentially pushes up the left-corner constituent (under hlc 1i) to a LHS position (under lc arc) in the resulting (intermediate) model. The RN2 rule then uni es this intermediate model with a rule that can rewrite the LHS constituent. This way, the RN2 rule extends the reachability relation one level further. RN2:

hsn = hp1; ::; pn i; pn ; M i 0 0 hs = hp1; ::; pn ; p i; p0; (M [hlc 1i ) hlci]) t mm()nhlcii where p0 = ha; i 2 P

Note that  in our algorithm is analogous to Shieber's. As the reachability relation is extended further from a given expectation, some features which in uence the left-corner constituents are propagated down and make the resulting model (which we call a RN2 net model or non-RN1 model) more speci c. Figure 3.6 shows the result of applying RN2 rule to the RN1 net model M in the previous Figure 3.5 and

3.4. ABSTRACT LEFT-CORNER ALGORITHM

47

cat VP cat

cat lc

lc

head head

VG

VG

type trans

lc

VP

cat

head

cat 1

VG

head

cat

cat

head V

1 head head

V type trans

M [〈lc 1〉 ⇒ 〈lc〉]

VG rule (R2 dag)

RN2 net model (M2)

mm(Φ)\〈lc〉

Figure 3.6: Application of RN2 rule to M and VG rule (R2) the VG ! V rule (R2 rule shown previously in Figure 2.13). A feature hlc 1 head typei =: trans in the left-corner constituent VG in M is rst transformed to hlc head typei =: trans by path replacement, and pushed into M 2's new left-corner constituent V through a path equation hlc 1 headi =: hlc headi in R2.

3.4.2 Item Generation Rules LC's item generation rules are modi cations of those in Shieber's algorithm: Shieber's four operations are modi ed to ve operations in LC: Initial Item, Scanning 1, Scanning 2, Completion and Continuation. As with Shieber's algorithm, parsing begins by generation of an initial item. Assume that phrasal production p0 = ha0 ; 0 i has been designated as the grammar's start production. The Initial Item rule is analogous to Shieber's Initial Item. In the beginning of parsing, Initial Item inserts an item which has an RN1 net model created from the start production (hhp0 i; p0 ; RM i) as the initial expectation. Notice the dot position in the resulting item is 0, implying that no RHS constituent is lled yet. INITIAL ITEM:

h0; 0; p0 ; RM; 0i

In LC, there are two scanning operations, which examine the next token in the input string to match against expectations in the existing items. Two operations are necessary because an expectation may directly predict the next input token (terminal), or it may be satis ed by the lexical item via a reachability net entry. The Scanning 1 rule is analogous to Scanning in Shieber's algorithm. It uni es an expectation in an item model (M=hlc d+1i) with the input word (mm(0 ) where 0 is a lexical production), and advances the dot.

48

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM

SCANNING 1:

hi; j; p = ha; i; M; di hi; j +1; p; M t (mm(0 )nhlc d+1i); d+1i

where d < a and hwj+1 ; 0 i 2 P The Scanning 2 rule is the operation where LC, as a Left-corner algorithm, optimizes Shieber's Prediction operation. It extracts an expectation ((M=hlc d+1i)), combines it with input word (mm(0 )), and then uni es with a reachability net model (RM ) that connects the two. SCANNING 2:

hi; j; p = ha; i; M; di hs; p0 ; RM i hj; j +1; p0 ; (M=hlc d+1i) t mm(0)nhlc 1i t RM; 1i where d < a and hwj+1 ; 0 i 2 P

Notice the rst index in the resulting item of Scanning 2 is j instead of i, and the dot position is 1, implying that expectation from the previous item is separated out, and used in the construction of a new item whose left-corner terminal constituent is lled. Figure 3.7 graphically depicts the actions of the two scanning rules.13 When the dot position reaches the arity of its phrasal production, the item is complete. Depending on whether the model is an RN1 model (a model based on the RN1 net model) or non-RN1 model (a model based on the RN2 net model), two di erent operations are necessary. The Completion rule is essentially equivalent to Completion in Shieber's algorithm. It applies to a completed RN1 model. As discussed in the previous section, an RN1 model is in the form where the expectation at the root and the LHS constituent under lc arc have the same category.14 The Completion rule rst merges the two constituents (by M 0 =hlci t M 0  fdom(M 0 ) ? lcg). This step basically collapses the lc arc, which was originally introduced by grammar precompilation, in particular by the RN1 rule. Then Completion uni es the merged LHS constituent with the expectation in the previous item model (M=hlc d+1i), and advances the dot. The actions of this rule are graphically depicted in Figure 3.8. COMPLETION:

hi; j; p = ha; i; M; di hj; k; p0 = ha0 ; 0i; M 0 ; a0 i hi; k; p; M t (M 0 =hlci t M 0 fdom(M 0 ) ? lcg)nhlc d+1i; d+1i

where d < a

13 In Figures 3.7-3.9, shaded nodes and trees are directly a ected by uni cation. Also note that in those gures, root nodes of the models M; M 0 and reachability net model RM may have some other paths besides

that are connected to the rules under lc arc. However, those paths are omitted in the pictures for simplicity. 14 A non-RN1 model could have this property as well, if the licensing rule is left-recursive. However for a non-RN1 model, Continuation operation (discussed next) instead of Completion should be applied to produce correct result. This issue is further discussed in Chapter 4.

lc

3.4. ABSTRACT LEFT-CORNER ALGORITHM

49

Scanning 1:

lc 1

a d+1 ….

d ….

model M expectation

lc

lc

==>

d+1

1

a d+1 ….

d ….

word W

model M updated

mm( Φ ' ) \ 〈 lc d + 1〉

M / 〈 lc d + 1〉

dot advanced to d+1

Scanning 2:

lc 1 d ….

a d+1 ….

==> lc 1

lc

lc a’

1 ….

model M expectation

ρ ( M / 〈 lc d + 1〉 )

word W mm ( Φ ' ) \ 〈 lc 1〉

a’

1 ….

reachability net model RM in

new model based on RM

s , p ' = 〈 a ' , Ψ 〉 , RM

dot after 1

Figure 3.7: Diagram depicting the two Scanning rules

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM

50

Completion:

lc

lc

d+1 ---> a’

1

lc

1 d ….

a d+1 ….

lc 1 d

==> ….

….

a d+1 …. a’ 1 ..

a’

1 ….

model M’ complete

M’ changed and embedded under

dot after a’

〈 lc d + 1 〉

old model M expectation

M / 〈 lc d + 1〉

old model M updated dot advanced to d+1

Figure 3.8: Diagram depicting the Completion rule The Continuation rule is the operation where the rule sequence compressed in the lc arc in the base reachability net entry (RN2 net model) is expanded. Continuation rst stretches the hlci path in a completed RN2 model (M ) to hlc 1i by the path replacement operator, and uni es it with another reachability net model (RM ). CONTINUATION:

hi; j; p = ha; i; M; ai hs; p0; RM i hi; j; p0 ; M [hlci ) hlc 1i] t RM; 1i where d < a

Notice the rst two indices i; j in the resulting item are unchanged from the previous item, but the dot position is 1, implying that the complete LHS constituent covering from the ith to the j th position in the previous item is lled in as a nonterminal left-corner constituent in the new item. Figure 3.9 graphically depicts the actions of Continuation rule.

3.5 Example In this section, We show an example of the correspondence between Shieber's and LC algorithms. We focus on the synch point between the two algorithms: from a given expectation, the algorithms eventually produce a complete constituent which lls the expectation after some processing. In Shieber's algorithm, the complete constituent is produced by n step-wise Prediction operations followed by n Completion operations (after intermediate

3.5. EXAMPLE

51

Continuation:

lc

lc

-->

a

1

lc

lc

==> 1

….

…. a

…. a

1

1

….

M [ 〈 lc 〉 ⇒ 〈 lc 1〉 ]

model M complete dot after a

a’

1

a’

1

….

reachability net model RM in

new model based on RM

s , p ' = 〈 a ' , Ψ 〉 , RM

dot after 1

Figure 3.9: Diagram depicting the Continuation rule completions). Whereas in LC, the complete constituent is produced by a less number of operations, by one Scanning 2 operation followed by n ? 1 Continuation operations and one Completion operation. Suppose a sentence \John likes Mary" is parsed using the following grammar. 9 8 :S h cat i = > > > > : NP > > 8 9 > > h 1 cat i = > > > > < hcati =: NP : = = < h2 cati =: V P h head agr i = 3 S R 3 = h \ John ; i i R0 = h2; > hheadi =: h2 headi : : ; > > > h word i = \ John > > > > subj i =: h1 :headi > > > > : hhhead 9 8 head subj agri = h2 head agri ; hcati =: V : > > > > = < 9 8 h head agr i = 3 S : h cat i = V P i R 4 = h \ likes ; : > > > > hhead type i = trans > : VG > > > > > > > h 1 cat i = : ; : > > hwordi = \likes = < h2 cati =: NP i R1 = h2; > hheadi =: h1 headi 8 9 > > > < hcati =: NP : = : > > h head dobj i = h 2 head i > > h head agr i = 3 S R 5 = h \ Mary ; > > : hhead typei =: trans ; : hwordi =: \Mary ;i 9 8 = < hcati =: V: G i R2 = h1; : h1 cati =: V hheadi = h1 headi ; Figure 3.10 shows a diagram of operations applied and items created after each parsing operation in Shieber's algorithm. Items in this gure (and also the next Figure 3.11) are indicated by a simpli ed format using the item model's context-free backbone. Notice after \John" is scanned, two Prediction operations ((3) prediction and (4) prediction in the gure) are necessary to bring down the high-level VP expectation to be matched with the input word \likes" (a V). Then after Scanning eats up \likes" ((5) scanning), Completion 00

00

00

00

00

00

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM

52

Grammar R0 : R1 : R2 : R3 : R4 : R5 :

S → NP VP VP → VG NP VG → V NP → " John" V → " likes" NP → " Mary"

“John likes Mary” 0 1 2 3

(1) initial item

0,0, R 0, S → • NP VP ,0

0,1, R 0, S → NP • VP ,1

0,3, R 0, S → NP VP •, 2

(8) completion

(2) scanning (3) prediction “John” 1,1, R1, VP → • VG NP ,0 (4) prediction 1,1, R 2, VG → • V ,0

1,2, R1, VP → VG • NP ,1

(6) completion

1,3, R1, VP → VG NP •, 2

(7) scanning “Mary”

1,2, R 2, VG → V •,1

(5) scanning “likes”

Figure 3.10: Operations and items created in Shieber's algorithm is applied to ll the VG and advance the dot ((6) completion). Figure 3.11 shows a similar diagram when the same sentence is parsed by the LC algorithm using the same grammar. Notice at the top of the gure, reachability net entries are listed. In LC, parsing starts with Initial Item using a reachability net entry t0 (note that t0 is an RN1 net model). After \John" is scanned in the same way as Shieber's (but using Scanning 1 rule { (2) scanning 1 in the gure), Shieber's two Predictions followed by one Scanning (which eats up the word \likes") are done by one operation, Scanning 2, lc using the reachability net entry t2: VP ! VG ! V ((3) scanning 2). This is where LC optimizes Shieber' algorithm, by eliminating Shieber's Prediction items. Also notice that after (3) scanning 2, Continuation is applied and creates an item whose left-corner VG is lled ((4) continuation). What happened here is that Continuation used a reachability lc VP ! VG NP to realize the VG. net entry t1: VP ! Figure 3.1215 and 3.13 show the items created after the above operation sequences in each algorithm with detailed dag models. Notice the model in Shieber's (5) scanning is the same as the submodel under lc in the model in LC's (3) scanning 2, and Shieber's model in (6) completion is the same as the submodel under lc in the model in LC's (4) continuation, implying a strong parallel between the two algorithms through corresponding operation sequences. 15 In this gure, nodes in the next expectation are shaded for clarity. Also, in Figures 3.12 and 3.13, the function  is the identity function.

3.5. EXAMPLE

53

R.N. entries lc t0: S → S → NP VP lc → VP VP → VG NP t1: lc t2: VP → VG → V

(1) initial item with t0 lc

0,0, R0, S → S → • NP VP,0 lc

0,1, R0, S → S → NP • VP,1

lc

0,3, R0, S → S → NP VP •,2 (6) completion

(2) scanning 1 “John” (3) scanning 2 with t2 and “likes”

lc

1,2, R1, VP → VP → VG • NP,1 (4) continuation with t1

lc

1,3, R1, VP → VP → VG NP •,2 (5) scanning 1 “Mary”

lc

1,2, R2, VP → VG → V •,1

Figure 3.11: Operations and items created in LC algorithm

54

CHAPTER 3. ABSTRACT LEFT-CORNER PARSING ALGORITHM (3) Prediction

(4) Prediction

cat

1,1, R 1,

1

VP

,0

2

cat

1,1, R 2,

VG

head

cat dobj head

VG type

trans

head

cat

cat head

,0

1 head

NP

V

agr

type

agr

trans

3S

3S

(5) Scanning (for input “likes”)

1,2, R 2,

cat

cat VG cat

(6) Completion

,1

1 head head

V word

agr

type

“likes” trans

3S

1,2, R 1,

,1

1 2 VP cat head cat VG 1 head head NP cat head dobj V type word trans agr “likes” 3S

Figure 3.12: Items in Shieber's algorithm

(3) Scanning 2 (for input “likes”)

cat

cat

1,2, R 2,

VP

,1

lc

1,2, R 1,

head

cat

head word

“likes”

type trans

agr 3S

VG cat

head 2 head

cat

1 head head

V type word trans “likes”

Figure 3.13: Items in LC algorithm

,1

lc

VP 1

1

V

VP cat

head

cat VG cat

(4) Continuation

head dobj agr 3S

NP

Chapter 4

Nonminimal Derivations 4.1 Introduction This chapter discusses a diculty that exists in Shieber's algorithm and LC. During the e ort to prove the correctness of our LC algorithm, we discovered that Shieber's algorithm produces unintended, spurious derivations in addition to the intended ones. We call these spurious parses nonminimal derivations because they contain more information (i.e., features) than they absolutely have to. We also discovered that LC produces derivations of the same nature as well. Nonminimal derivations in Shieber's and LC algorithms are subtle and not obvious from the algorithm de nition; in fact, this problem has not been reported in any previous literature as far as we know. Not only is the nonminimal derivation problem an interesting topic of its own, it also has a particular importance in this thesis, since it a ects the correctness of Shieber's and LC algorithms, which we show in the next Chapter 5. In the context of uni cation-based parsing, any correct algorithm should not produce nonminimal results since uni cation as the information combining operation preserves minimal models. Therefore, we argue that nonminimal results are incorrect, and any parsing algorithm which produce such results is unsound, thus incorrect. However, despite the nonminimal derivations, Shieber (1992) proved the correctness of his algorithm. As it turned out, his proof relied on his de nition of parse tree as the de ning evidence of the language of a grammar, which Shieber extended from the equivalent notion in context-free grammar to a uni cation-based setting. This raises another aspect of the nonminimal derivation problem, namely nonminimal derivation as a grammatical speci cation problem. The key realization here is that, in uni cation grammars, formal de nitions of some of the fundamental notions in grammar and parsing (such as parse tree), which seemed rather trivial in context-free grammar, need more careful consideration to ensure minimality. In this chapter, we rst describe nonminimal derivations produced in the two algorithms informally, and discuss the sources of the problem. Then we suggest several possible solutions to remedy the problem. These solutions are modi cations to the algorithms which prevent nonminimal derivations. Then we go on to discuss nonminimal derivation as a grammatical speci cation problem with reference to the notion of tree admissibility. In the discussion, we rigorously formulate minimal derivation as a procedure, and propose alternative de nitions of minimal parse tree for uni cation grammars. 55

CHAPTER 4. NONMINIMAL DERIVATIONS

56

C Grammar r1: C → D E

D

E

r2: D →" d" ……..

“d” Derivation tree

Figure 4.1: Example of context-free derivation tree

4.2 Nonminimal Derivation Problem In this section, we present nonminimal derivations produced in Shieber's algorithm and LC. We rst build an intuitive notion of minimal and nonminimal derivations, and then illustrate by examples how such derivations are created in the two algorithms. Then, later in section 4.6, we present a formal account of those notions as a part of alternative de nitions of minimal parse tree. Note that throughout this chapter, we assume the top-down ltering function  in Shieber's Prediction operation ((M=hd+1i) t mm(), on p. 43) to be the identity function. In other words, all features in the expectation (M=hd+1i) will be propagated down to a production (p = ha; i) which can rewrite the expectation.

4.2.1 Minimal and Nonminimal Derivations Derivation is a well-established notion in the studies of the mathematics of languages. Given a rule in the grammar expressed in a general form A1 ::An ! B1 ::Bm , a derivation is a rewriting of symbols where realization of the symbols/constituents on one side is replaced by those on the other side. In general, derivation can be done either top-down or bottom-up: either the left-hand side (LHS) constituents A1 ::An are replaced by those on the right-hand side (RHS) B1 ::Bm , or the RHS is replaced by the LHS respectively. Whether derivation is done in one direction or the other, in the case of context-free grammar a trace of a series of symbol rewritings forms a tree structure. In a derivation tree, every node is labeled with a context-free symbol or a terminal string for some leaf nodes, and every internal node (including the root) is licensed by a rule in the grammar. Figure 4.1 shows a derivation tree for the arbitrary grammar in which D of the rst RHS constituent in r1 is rewritten by the LHS of r2. In the case of uni cation grammars, a context-free symbol becomes a feature structure, and the operation equivalent to rewriting becomes uni cation. Then a trace of a series of derivations using uni cation forms a graph (or dag), in which every node is a complex feature structure. In this dag, every node/subdag under a numbered arc/path (including the root) represents a syntactic constituent, and those nodes are licensed by a phrasal/lexical rule in the grammar. The uni cation operation is essentially the same as set union, and it preserves minimal

4.2. NONMINIMAL DERIVATION PROBLEM Grammar r1: cat = C 1 cat = D 2 cat = E h = 1h r2: ……..

cat = D word = " d"

57

cat C

cat h

cat D

2

1

h

word

C cat E

“d”

h

cat D

2

1

h

cat

word “d”

foo

E

bar

Minimal Derivation

Nonminimal Derivation

Figure 4.2: Example of Minimal and Nonminimal derivations models (by the logical de nition given in the last chapter on p. 42).1 So, when a series of uni cations are applied by derivations, the resulting structure should have all and only features from the licensing rules/productions (embedded under appropriate paths) that were used to construct the structure. We call such a derivation a minimal derivation. Based on this principle, any correct uni cation-based parsing algorithm should produce the same result as a derivation would, that is, a minimal derivation. However, we discovered some uni cation-based parsing algorithms, when applied to certain grammars, spuriously allows derivations that result in models which have extra, irrelevant features. We call such a derivation a nonminimal derivation. Figure 4.2 shows an example of minimal and nonminimal derivations produced from an arbitrary grammar. As you can see, both derivations are built from the same rules r1 (which licenses the root) and r2 (which licenses the: subdag under h1i arc), but the nonminimal : derivation has an extra feature h1 h fooi = bar (and hh fooi = bar) which came from neither r1 nor r2.

4.2.2 Nonminimal Derivation Phenomena

Before we show how nonminimal derivations are actually produced in Shieber's algorithm and LC, we brie y describe the general scheme of the nonminimal derivation phenomena. Basically, a nonminimal derivation occurs when an expectation which spawned a sequence of Prediction operations top-down (forming what we call a prediction path) is later lled by a complete item that is not the result of the exact bottom-up restoration of that path. This inappropriate item is a completion of some other path. Therefore, it may carry features that are propagated from its prediction path but are not in the expectation with which it is uni ed. Those features essentially came from di erent contexts, but the item may be used to ll the expectation nonetheless because the irrelevant features did not cause any con ict with the existing features in the expectation. It is important to note here that the same spurious situations also occur in context-free parsing, speci cally in Earley's algorithm and context-free left-corner algorithms. However, since the only information a constituent carries in context-free grammar is the grammar symbol, there is no notion of minimality, thus the spurious derivations produce exactly 1 This resulting structure is a least upper bound (LUB) or a most general uni er (MGU).

58

CHAPTER 4. NONMINIMAL DERIVATIONS

the same results as the normal ones. Those results are most often discarded as duplicates explicitly by chart-based parsing algorithms, therefore will not become a problem.

4.2.3 Nonminimal Derivations in Shieber's Algorithm Nonminimal derivations are produced in at least three situations in Shieber's algorithm. All of them occur in conjunction with the Completion operation. The rst case involves a single expectation and a single prediction path (Case 1); the second case involves a single expectation and multiple prediction paths (Case 2); and the third case involves multiple expectations and multiple prediction paths (Case 3).

Case 1 (skipped left-recursion): This case happens when Completion is applied pre-

maturely to ll the expectation by skipping some left-recursive rules in a prediction path.2 Suppose n successive Prediction operations are performed (forming a single top-down prediction path) using a rule sequence s = (p1 ; p2 ; ::; pn ) from an expectation, generating items I1 ; I2 ; ::; In respectively. After items In through I2 are completed bottom-up in reverse order, creating new items In0 ; ::; I20 , one more Completion should be applied to I1 with (the complete) I20 to ll its left-corner constituent (since the dot position of I1 is still 0). It is only after I1 becomes complete, which will create I10 , that the expectation should get lled by that complete I10 (i.e., complete restoration of the prediction path s). However, if I1 is left-recursive, the (syntactic) category of the LHS constituent of I20 is the same as the category of the LHS constituent of I1 , which in turn is the same as the category of the expectation. Therefore, I20 may well be able to ll the expectation directly, if they unify. Thus, Completion may be applied to the expectation with I20 prematurely, skipping the last left-recursive, still incomplete I1 . Note this skipped prediction path essentially represents another path s0 = (p2 ; ::; pn ): a portion of s through which bottom-up construction actually took place (i.e., completion path).3 This situation will become problematic when the left-recursive p1 had features which are not in the expectation originally, or not in the rules used in the bottom-up completion path. Consider a sentence \John sleeps." is parsed using rules R0 (the start production) through R4.

2 This case includes indirect left-recursion also. 3 Since this path is also legitimate, parsing will generate the result for this path as another derivation,

as well as for one more derivation which is the result of complete prediction path restoration for s = (p1 ; p2 ; ::; pn ). The existence of corresponding minimal and nonminimal derivations will be further discussed and proved in the next chapter, section 5.2.

4.2. NONMINIMAL DERIVATION PROBLEM

59

expectation

E: [S → NP • VP] (1) Prediction

E’: [S → NP VP •] Completion*

I1: [VP → • VP SLOW-ADV] (2) Prediction

I2: [VP → • V]

I1’: [VP → VP • SLOW-ADV]

(4) Completion

(3) Scanning

I2’: [VP → V •]

Figure 4.3: Nonminimal Derivation for Case 1 9 8 :S h cat i = > > 9 8 > > : VP : NP > > > > h cat i = h 1 cat i = > > > > > > > > : : = < h1 cati = V = < h2 cati = VP i R 2 = h 1 ;  = i R0 = h2; 0 = > hheadi =: h2 headi : 2 h head i = h 1 head i > > > > > > > : : : hhead typei = intrans ; > > > > subj i = h1 :headi > > ; : hhhead head subj agri = h2 head agri   : NP h cat i = 9 8 R3 = h\John"; 3 = hwordi =: \John" i hcati =: VP > > > > : > > > > = < h1 cati = VP   h cati =: V: R1 = h2; 1 = > h2 cati =:: SLOW-ADV >i R 4 = h \sleeps" ;  = 4 > > h wordi = \sleeps" i h head i = h 1 head i > > > > : hhead speedi =: slow ; Notice R1 is left-recursive, and R2 is non-recursive. After the word \John" is processed as an NP and uni ed into R0 using R3, the next expectation is VP. The item for this expectation is shown on the top-left corner in Figure 4.3 (item E).4 Suppose two Predictions are applied successively from the expectation using rule sequence s = (R1; R2)5 (indicated by arrows marked with (1) Prediction and (2) Prediction in Figure 4.3), and items I1 and I2 are created. Here in I1, the feature hhead speedi = slow was added to the (copy of) VP expectation by the rst Prediction using R1,6 and propagated down to I2 by the second Prediction through head arcs. As a result, the LHS VP and left-corner V in I2 have this feature. Then, after \sleeps" is added by the next Scanning operation using R4 ((3) Scanning), the resulting item I2' still has hhead speedi = slow feature under VP and V, since it did not cause any con ict. Now that I2' is complete (since the arity of R2 is 1), the next operation should be a Completion, to ll its expectation: the left-corner VP in I1 (creating I1', by (4) Completion). However, if this Completion is skipped and Completion is applied to the original expectation with I2' prematurely (indicated by a thick dashed line with Completion*, creating E'), the lled VP expectation has the feature hhead speedi = slow which is propagated up from 4 In Figures 4.3 and 4.5, items are represented in a simpli ed form, using their context-free backbone portion of the item model with a dot to indicate the arity (as in chart parsing). 5 If parsing performs lookahead, R1 will not be selected because SLOW-ADV will never be lled (end of the input sentence). However, without lookahead, parsing can validly select this rule. 6 Recall our assumption that the function  in Prediction passes down all features in the expectation.

CHAPTER 4. NONMINIMAL DERIVATIONS

60 cat S

cat 1

2 head

cat

cat NP word “John”

S

NP word

head

“John”

head

agr agr type 3S intrans

2 head

cat

head 1 VP subj

1

V word “sleeps”

cat head 1 VP

head subj

head V

word agr speed agr type slow “sleeps” 3S intrans

Minimal derivation

Nonminimal derivation

(Prediction using R2 directly, and restored by R2)

(Predictions using R1and R2, restored by R2 only)

Figure 4.4: Minimal and Nonminimal Derivation Models for Case 1

VP and V in I2' through head arcs. Since this extra feature is not in any of the licensing productions for E' (R0; R2; R3 and R4), the resulting item model becomes a nonminimal

derivation. Figure 4.4 shows the dag models for minimal and nonminimal derivations. The nonminimal model on the right is produced by the unmatched prediction-completion paths discussed above. The minimal derivation on the left is produced by the minimal case where non-left-recursive R2 was directly applied to the expectation, and that path (consisting of R2 only) is exactly restored. As you can see, nonminimal model has an additional feature : hhead speedi = slow.

Case 2 (mixed-up completion): This case happens when multiple prediction paths,

which are spawned from a single expectation, are mixed together while they are restored by Completion operations bottom-up. Suppose an expectation spawned two prediction paths generating items I1 ; I2 ; ::; In and J1 ; J2 ; ::; Jm using rule sequences si = (p1 ; p2 ; ::; pn ) and sj = (p01 ; p02 ; ::; p0m ) respectively. While path I is being completed bottom-up, some intermediate result may be used in the bottom-up completions of path J . When it happens, (irrelevant) features in path I will be \mixed in" with the (relevant) features in path J in the resulting item. Consider a sentence \John sleeps silently" is parsed using rules R0 through R4 from the previous example in Case 1, and R5; R6 shown below.7 7 Although R1 and R5 are left-recursive, this spurious situation can occur with non-recursive rules also.

4.2. NONMINIMAL DERIVATION PROBLEM

61

expectation

E: [S → NP • VP] (1J) Prediction

(1I) Prediction

I1: [VP → • VP SLOW-ADV]

J1: [VP → • VP SILENT-ADV]

I 1 ' : [ V P → . . .]

(2I) Prediction

I2: [VP → • V]

(4I) Completion

(3I) Scanning

J1’: [VP → VP • SILENT-ADV] Completion*

I2’: [VP → V •]

J2: [ VP → • V ]

(2J) Prediction

Figure 4.5: Nonminimal Derivation for Case 2 9 hcati =: VP > > > >   h1 cati =:: VP = i =: SILENT-ADV h2 cati =: SILENT-ADV >i R6 = h\silently"; 6 = hhcat i : word i = \silently" > hheadi = h1 head:i > > hhead manneri = silent ; After \John" is processed as an NP using R0 and R3, the next expectation is VP, as shown at the top of in Figure 4.5 (item E). This expectation can spawn two prediction paths si = (R1; R2) and sj = (R5; R2). For si path, item I1 is rst generated ((1I) Prediction, using R1), and for sj path, item J1 is rst generated ((1J) Prediction, using R5). At this point, I1 has the feature hhead speedi = slow under its left-corner VP, and J1 has hhead manneri = silent feature under its left-corner VP. For each of them, another Prediction is applied with R2 to bring down the VP expectation to the V to be matched with the next input word \sleeps", generating I2 ((2I) Prediction) and J2 ((2J) Prediction). In the resulting items, I2 has hhead speedi = slow feature under its left-corner V (pushed down from I1), and J2 has hhead manneri = silent feature under its left-corner V (pushed down from J1). When \sleeps" is Scanned for each of them using R4, generating complete items I2' ((3I) Scanning) and J2' (not shown), next two Completions should use those complete items to restore their respective prediction paths, namely I2' to ll the left-corner VP of I1 ((4I) Completion), and J2' to ll the left-corner VP of J1. However, since there is no con ict between the LHS VP of I2' and the VP of J1 (because speed and manner are di erent features), Completion can be applied with I2' at this point to ll the VP of J1 (indicated by a thick dashed line with Completion*). Then, the resulting item J1' ends up with both hhead speedi = slow and hhead manneri = silent features. In this example, since the actual bottom-up construction of J1' was done using R2 (in I2'), the feature hhead speedi = slow that came from R1 is irrelevant and should not exist. Therefore, the model in J1' becomes a nonminimal derivation as a result of spurious interaction between si and sj . Note that interaction can be even more subtle when there are multiple expectations or an expectation spawns many prediction paths, or when prediction paths are long; in those

8 > > > > < R5 = h2; 5 = > > > > :

62

CHAPTER 4. NONMINIMAL DERIVATIONS

cases, spurious interaction may happen with many combinations at several levels during bottom-up construction.

Case 3 (mixed-up expectation): This case happens when there are multiple expectations, each of which spawns its own prediction path(s), one expectation may be lled by a completion of another expectation (a totally di erent context). When this happens, features in the two expectations along with features from the wrong prediction path are mixed together in the resulting item. Thus, this is the case when multiple expectations interact. This case can be considered as a subcase of Case 2 in that multiple expectations at the input position i were spawned from the same original expectation further up in the prediction path. We listed this case separately here because of the correspondence to the nonminimal cases produced in LC algorithm which we describe in the next section.

4.2.4 Nonminimal Derivations in LC Algorithm In LC algorithm, nonminimal derivations are produced by the same principle as in Shieber's, by mismatched prediction-completion paths. There are three spurious situations, each of which corresponds to Case 1 through 3 in Shieber's algorithm. In LC, there are two bottomup operations (Continuation and Completion), and the spurious cases occur in conjunction with either or both operations.

Case 1 (skipped left-recursion): This case is essentially the same as Shieber's Case 1:

when Completion is applied prematurely to ll the expectation by skipping left-recursive rules. In LC algorithm, Shieber's n successive Prediction operations (which form a prediction path) are precompiled and compressed into a reachability net entry. This prediction path of length n is expanded and restored bottom-up by n?1 Continuation operations followed by one Completion to nish the completion path. The spurious situation occurs when one or more expansions are skipped (by either Continuation or Completion); this may happen if the prediction path had left-recursive rules. The resulting item may become nonminimal by the same reason as Shieber's Case 1: features from the skipped rules may be left in the resulting item.

Case 2 (mixed-up continuation): This case is almost identical to Shieber's Case 2:

one prediction path gets mixed together with another path while it is restored bottom-up. In LC, prediction paths are restored by Continuation operation (except for the last Completion). Therefore, this case happens when a path compressed in a reachability net entry is expanded by Continuation by using some other rules that were not in that path. The resulting item may become nonminimal by the same reason as Shieber's Case 2: features from the original path, which are irrelevant to the rules used in bottom-up Continuations, may be left in the resulting item.

Case 3 (mixed-up expectation): This case is almost identical to Shieber's Case 3: when

there are multiple expectations for the same input word, one expectation may be lled by a complete restoration of a path spawned from another expectation. The resulting item may become nonminimal because features in the complete item of the wrong expectation are

4.3. SOURCES OF THE PROBLEM

63

mixed together in the relevant ones (from the original expectation) in the resulting item. Recall Shieber's Case 3 could be considered a subcase of Case 2. That was because in Shieber's algorithm, when an expectation spawns n successive Predictions, each new item is created by separating the expectation one level above and unifying it with some production. So, the prediction items do not keep the information of the original expectation which initiated the path. On the other hand, in LC, original expectation is kept at the root (above lc arc) in all items as they are created by expanding the prediction path. So, LC may create two items whose models M; M 0 are distinct (i.e., M 6= M 0 ) but have identical submodel under lc arc (i.e., M=hlci = M 0 =hlci). Since LC's Continuation expands the prediction path by unifying the old item with a reachability net entry at their root nodes (by M [hlci ) hlc 1i]tRM ), nonminimal Case 2 and Case 3 described just above occur as distinct phenomena; Case 2 involves a single original expectation which spawns multiple prediction paths (thus involving multiple reachability net entries obtained by compressing the paths that have the same rst rule), whereas Case 3 involves multiple original expectations (thus involving multiple reachability net entries obtained by compressing the paths whose rst rules are di erent).

4.3 Sources of the Problem In the preceding sections, we described what nonminimal derivations are and where/how they occur in Shieber's and LC algorithms. But why do they occur to begin with? We suggest the conjunction of three factors as the source: mixed-direction parsing using itembased parsing with uni cation grammars with certain properties. First, nonminimal derivations occur because of the mismatch between the top-down prediction and the bottom-up completion paths, which is facilitated by the mixed-direction parsing strategy. Shieber's algorithm builds constituents in two steps: rst by introducing a new item with top-down prediction features, and later by completing it with bottom-up features. If the parsing is done in one direction only, whether top-down or bottom-up, no mismatch will occur. Second, nonminimal derivations occur because inappropriate items are used to restore the prediction path. That is because item-based parsing records partial results in one global structure (such as chart in chart-parsing), thereby allowing the reuse of those partial phrases (i.e., memoization). Thus all items are accessible any time during parsing. If the accessibility of items are somehow restricted, spurious interactions between items will not happen. Third, nonminimal derivations occur because some \unsafe" features in the rules are propagated during prediction. If the grammar contains no such features or the algorithm blocks them out in the predictions (using ), even the unmatched prediction-completion paths will not produce nonminimal derivations. As a note, the rst two factors are adopted in Shieber's algorithm (and LC) for eciency purposes: top-down prediction to constrain and guide the bottom-up constructions which lead to unsuccessful parses, and memoization to avoid redundant computation. Inasmuch as they bring the intended eciency to the algorithms, they also seem to bring diculties, speci cally incorrectness, in uni cation-based parsing.

CHAPTER 4. NONMINIMAL DERIVATIONS

64

4.4 Possible Solutions To prevent nonminimal derivations, there seems to be basically two approaches: to lter them out at the end of parsing, or to modify the algorithm so that nonminimal derivations will not be produced. As for the rst approach, a system can inspect the nal parses and discard nonminimal derivations from nal answers. Or a system can be implemented such that only the context-free backbone features are used during parsing, and the nal parses are constructed by adding the rest of the features after parsing terminates. But neither of these schema addresses the real issues of the nonminimal derivation problem. In what follows, we propose four possible solutions to the nonminimal derivation problem. All of them require some level of specialization of the abstraction in Shieber's and LC algorithms. Also, those solutions have diculties of their own, and they are not quite complete in that they trade o some of the current nice properties for minimality. Note that some of these proposed solutions are in fact known techniques which have been applied to solve other problems in uni cation grammars. This implies that nonminimal derivation is an instance of some inherent problems in uni cation-style grammars. We will return to this issue later in section 4.7.

(Solution 1) Eliminating Redundancy: Nonminimal derivations can be prevented by using subsumption to avoid generating redundant items, that is, the algorithm generates an item hi; j; p; M; di if there is no other item hi; j; p; M 0 ; di such that M 0  M .8 The key observation is that nonminimal derivations carry (strictly) more features than the minimal derivations (thus more speci c), and also they are created in addition to the minimal ones.9 Therefore, by disallowing the generation of items that are more speci c than any other ones that have been already produced, only the minimal derivations will be created. For example, given a minimal and a nonminimal derivation shown in the previous Figure 4.4 (p. 60), the nonminimal model on the right has an additional feature hhead speedi = slow, therefore it can be eliminated by subsumption check. Redundancy elimination using subsumption is a well-known technique adopted in many uni cation-based systems (e.g. (Alshawi, 1992), (van Noord, 1997)). This technique was originally developed to improve the eciency of uni cation-based algorithms.10 However, this subsumption technique potentially blocks some minimal derivations from being entered in the chart for certain grammars. Consider a case when the grammar contains two rules where one is strictly more speci c than the other, such as the rules R7 and R8 for the word \drank" below. 9 8 :V h cat i = > > > > = < hvformi =: past R7 = h\drank"; 7 = > hhead semi =: ingest >i : hwordi =: \drank" ; 8 Notice the more speci c item can be di erent from the general one only by its model; other parameters

i; j; p and d must be identical. Note this modi cation to the original algorithm preserves both soundness and completeness (Shieber, 1992, p. 78). 9 These two points will be formally proved in the next Chapter 5. 10 Some uni cation-based systems use a packed representation for items that are in subsumption relation (e.g.. Core Language Engine (Alshawi, 1992) which we presented in Chapter 2). This technique, however, does not discard nonminimal items, therefore does not solve the nonminimal derivation problem.

4.4. POSSIBLE SOLUTIONS

65

9 hcati =: V : > > > > hvformi = past = : hhead semi = ingest i : alcohol ? beverage > > hhead sem object i = > > ; hwordi =: \drank" Rule R7 is less speci c than R8 since it does not specify any features associated with the ingest action, whereas R8 does: specify the object of ingest action to be an alcohol beverage by the hhead sem objecti = alcohol ? beverage as a particular meaning of \drank". Then, suppose a sentence \John drank" was parsed using these rules and R2 (VP ! V) shown previously (on p. 58). When each of R7 and R8 is used to ll the V in R2, two items I = h1; 2; R2; M; 1i and I 0 = h1; 2; R2; M 0 ; 1i are created respectively. Here, models in those items are in a subsumption relation where M  M 0 because M 0 has an additional feature : hhead sem objecti = alchol ? beverage under the rst RHS constituent V, although both models are licensed by the same production R2 and any other parameters in the items have the same values. Therefore, if the parsing produced the general item I before the speci c item I 0 , then I 0 would not be entered in the chart although it is a minimal derivation. Thus, we can see that this subsumption scheme may hinder the completeness of the algorithm. Then, to incorporate this modi cation formally, we must extend our current logic or use another kind of logic to express the control aspects of the parsing algorithm (e.g., proving the absence of more general items at a given time during parsing). 8 > > > > < R8 = h\drank"; 8 = > > > > :

(Solution 2) Filtering Function : Another way is to lter out the problematic \unsafe"

features propagated down in top-down prediction. This can be done by specifying the ltering function  applied in Shieber's Prediction ((M=hd+1i), and in LC's Scanning 2 and RN2 rules ((M=hlc d+1i) and (M [hlc 1i ) hlci]) respectively). This function  is explained previously in Chapter 3.3.4. If those features are pruned by , the resulting model becomes the exact duplicate of minimal derivation, and will be discarded by chart-parsing scheme. However, analyzing the unsafe features does not seem trivial, and also it is not clear if a complete solution exists at all.11 Also, ltering top-down features in e ect weakens the predictive power of the algorithm(s). Therefore, this solution trades o computational eciency for soundness.

(Solution 3) Data Structure: Nonminimal derivations occur because of the unmatched

prediction-completion rule/item paths. To ensure the prediction path is exactly restored, we can adopt a data structure, possibly based on some kind of stack similar to (Tomita, 1986, 1991). As an item is produced by Prediction operation, it is pushed onto this stack. If an expectation spawns more than one prediction path, they are recorded in a separate path in the stack, thus the stack will have a tree structure.12 Then, Completion will be

11 For Head-corner parsing, (Sikkel, 1997y) proposes the use of transitive features : the features that percolate only through head arcs. However, this scheme does not seem to solve the problem of nonminimal derivations completely. 12 We may also make this scheme more ecient by compressing multiple paths using subsumption, as we described in Solution 1 above. In this case, the stack will become a graph-structured stack, similar to

CHAPTER 4. NONMINIMAL DERIVATIONS

66 INITIAL ITEM:

hid; nil; h0; 0; p0 ; mm(0); 0ii where id is a new symbol

PREDICTION:

hid; pid; hi; j; p = ha; i; M; dii

hid0 ; id; hj; j; p0 ; (M=hd+1i) t mm(0); 0ii where id0 is a new symbol, and d < a and p0 = ha0 ; 0 i 2 P

SCANNING:

hid; pid; hi; j; p = ha; i; M; dii hid; pid; hi; j +1; p; M t mm(0)nhd+1i; d+1ii where d < a and hwj+1 ; 0 i 2 P

COMPLETION:

hid; pid; hi; j; p = ha; i; M; dii hid00; id; hj; k; p0 = ha0; 0 i; M 0 ; a0 ii hid; pid; hi; k; p; M t (M 0 nhd+1i); d+1ii where d < a Figure 4.6: Modi ed Shieber's parsing operations

done only through the items in the same stack path. Not only can this data structure limit the accessibility of the items as we hoped, it can also enforce the ordering of the completion path, that is, in an exact reverse order of the top-down prediction path. However, coping with left-recursion, particularly for indirect left-recursion, may become a major diculty.

(Solution 4) Parent Pointers: The stack solution above can also be done by directly encoding parent pointers in the items. We associate every item produced by the algorithm with a parent pointer and explicitly encode them in the algorithm operations. Then, to ensure the exact restoration of prediction path, we can restrict Completion to take place through this pointer. Figure 4.6 shows the modi ed Shieber's algorithm. In the gure, an item is represented by a nested 3-tuple where the rst argument is the self label, the second is the parent label/pointer, and the third is the old 5-tuple used in the original algorithm. As you can see, a parent pointer is set by Prediction operation { the resulting item has the label of the antecedent item (id) as its parent. By generating a new symbol for the self label in every Prediction item (id0 ), parent pointers in those items are threaded and chained to form a prediction path. Then in Completion, the parent pointer is used to specify the antecedent items: the complete item (on the right) must have the prior expectation (on the left) as its parent (id). By this restriction, a prediction path will be precisely restored by bottom-up completions. (Tomita, 1986, 1991).

4.4. POSSIBLE SOLUTIONS

67

While this approach solves the nonminimal derivation problem on the level of the logic, this parent scheme has some undesirable implications to the implementation of the algorithm. First, the logical speci cation now speci es further how the algorithm should be implemented. Therefore, the algorithm loses some of its abstraction and generality. Second, this scheme makes memoization no longer possible. Earlier we said that the global accessibility of items is one of the sources of nonminimal derivations. However, memoization is in fact very useful, in allowing the reuse of left-recursive items/rules as many times as needed during the bottom-up completion when those rules were actually used only once in top-down prediction (or even when they were not used at all). But by forcing the completion path to reverse the prediction path exactly, the logic will not support memoization. This will a ect the eciency of the algorithm, as well as introducing nontermination problems if the algorithm is not implemented carefully, or unless the form of phrasal productions is restricted in certain ways.13 However, it is important to note here that Shieber's formalism does not include semantics, therefore it does not specify or presume memoization to be used in implementation. Also, logically it does not make any di erence whether all possible paths and their items are generated separately in their entirety beforehand, or memoized items involved in leftrecursion are reused to complete the paths as needed; the logical speci cation only constrains that appropriate prior items must exist and the operations must select them appropriately.14 In the case of LC, in addition to parent pointers we need another information: the prediction path compressed in a reachability net entry to ensure the path is correctly expanded (by Continuation). Figure 4.7 shows the modi ed LC algorithm which uses the same parent pointer scheme. Just like the modi ed Shieber's operations, each item is now represented by a nested 3-tuple where the rst argument is the self label, the second is the parent label/pointer. They are used to prevent the wrong completion (in the Completion operation), as explained in the modi ed Shieber's operations. The third argument of an item structure is basically the same as the old item structure, but with another parameter (the fourth one) which indicates the prediction path compressed in the base reachability net entry of the item model. This information is utilized in two operations: Continuation and Completion. In Continuation, correct path expansion of a model based on a reachability net entry sn is forced by selecting a reachability net based on sn?1 , the exact same path of one length shorter. In Completion, the complete item (the second parameter in the antecedent) is required to be a model based on a reachability net entry whose compressed prediction path is of length 1, implying the path has been already expanded by previous Continuations. 13 For example, in context-free parsing, one way to prevent in nite looping on left-recursive rules in top-

down (non-chart) parsers is to stop expanding the parse tree once the number of words in the sentence is exceeded by the number of leaves in the tree. This technique only works, however, if the grammar does not contain empty symbols. 14 Besides, memoization does not work in the cases where repeated left-recursions produce di erent results at every iteration { items involved in such paths must be generated top-down in entirety before completions begin in order to ensure the correct results.

CHAPTER 4. NONMINIMAL DERIVATIONS

68

INITIAL ITEM:

hid; nil; h0; 0; p0 ; s = hp0i; RM; 0ii where h(p0 ); p0 ; RM i is a reachability net entry

based on the start production

SCANNING 1:

hid; pid; hi; j; pn = han; ni; s = hp1; ::; pn i; M; dii hid; pid; hi; j +1; pn ; s; M t (mm(0)nhlc d+1i); d+1ii where d < an and hwj+1 ; 0 i 2 P

SCANNING 2:

hid; pid; hi; j; pn = han; ni; s = hp1; ::; pn i; M; dii hs0 = hp01 ; ::; pm h; pm = ham ; mi; RM i hid0 ; id; hj; j +1; pm ; s0; (M=hlc d+1i) t mm(m)nhlc 1i t RM; 1ii where id0 is obtained by gensym, and d < an and hwj+1 ; 0 i 2 P COMPLETION:

hid; pid; hi; j; pn = han ; ni; s = hp1 ; ::; pn i; M; dii hid00; id; hj; k; pm = ham ; m i; s0 = hpmi; M 0 ; am ii hid; pid; hi; k; p; s; M t (M 0 =hlci t M 0 fdom(M 0 ) ? lcg)nhlc d+1i; d+1ii where d < an CONTINUATION:

hid; pid; hi; j; pn = han; ni; sn = hp1; ::; pn?1 ; pni; M; an ii hsn?1 = hp1 ; ::; pn?1 i; pn?1 ; RM i hid; pid; hi; j; pn?1 ; sn?1 ; M [hlci ) hlc 1i] t RM; 1ii Figure 4.7: Modi ed LC parsing operations

4.5. NONMINIMAL DERIVATION AND PARSE TREE

69

Π2 Π1 Π0

Π0

Π0

Figure 4.8: Structure of Shieber's Parse Tree

4.5 Nonminimal Derivation and Parse Tree In the next two sections, we discuss nonminimal derivation as a grammatical speci cation problem. As we mentioned in the beginning of this chapter, Shieber (1992) proved the correctness of his algorithm despite the nonminimal derivations. As it turned out, his proof relied on his de nition of parse tree as the de ning evidence of the language of a grammar. In this section, we show how his de nition fails to exclude nonminimal derivations, and compare it with other uni cation-based formalisms. Our discussion focuses on the diculty of specifying minimal features for parse trees, which seems inherent in uni cation grammars. Then in the next section 4.6, we propose modi ed de nitions of a minimal parse tree for uni cation grammars.

4.5.1 Shieber's Parse Trees

In order to de ne the language of a grammar for his logic formalism for uni cation grammars, Shieber takes the notion of parse tree in context-free grammar and extends it to a uni cation-based setting. Formally Shieber de nes a parse tree for a given grammar G as (Shieber, 1992, p.54): A parse tree  is a model thatSis a member of the in nite union of sets of bounded-depth parse trees  = i0 i , where each i is de ned as: 1. 0 is the set of models  for which there is a lexical production p = hw; i 2 G such that  j= . 2. i (i > 0) is the set of models  for which there is a phrasal production p = ha; i 2 G such that  j=  andSdom( ) = f1; 2; ::; ag and, for all 1  i  a, =hii is de ned and =hii 2 j 0) contains models which satisfy a phrasal production and whose subconstituents are all lled and they are in turn a member of some j (0  j < i). Because of this recursive structure, every constituent in a parse tree has the same property: it satis es

70

CHAPTER 4. NONMINIMAL DERIVATIONS

its licensing production ( j= ). Then by the requirement that all subconstituents are all lled, a parse tree has a property such that all \leaf" constituents are terminal (licensed by lexical productions) and all \internal" constituents are nonterminal (licensed by phrasal productions). Figure 4.8 shows a general structure of Shieber's parse tree. Each triangle enclosed by solid lines represents a constituent (with dashed lines to indicate its subconstituents), and it is indicated which i it belongs to. You can see the index i increases from 0 (for terminal constituents) as the higher-level constituents are created from bottom up (to i = 1; 2 in the gure). Then, Shieber de nes the language of a grammar to be the set of yields of all parse trees that are licensed by the start production. Shieber de nes a valid parse tree  for sentence w1 : : : wn as follows: 1. The yield of  is w1 : : : wn 2.  is licensed by the start production (p0 ) 3.  2  But, according to the above de nition, a parse tree could be decorated with various features in in nitely many ways. That is because a parse tree is de ned by the licensing relation between a model  and a production  (i.e.,  j= ). The problem is that this relation only constrains a partial, minimal condition that has to be obeyed, that is, that  must satisfy , but it does not restrict any other features besides those in  if the model has any. Models produced by nonminimal derivations are indeed such models. Consequently his de nition of parse tree is too weak to exclude nonminimal derivations.15

4.5.2 Parse Trees and Admissibility

Shieber's weak de nition of parse tree, however, is not uncommon; rather, it agrees with the same or equivalent notion in other uni cation-based grammar formalisms. In uni cation grammars, the notion of parse tree has been adopted from context-free grammar as a structural representation of a sentence admitted by a given grammar. As in the case of context-free grammar, parse trees in uni cation grammars are most often based on licensing or a similar notion of tree admissibility (Gazdar, et al., 1985). These relations essentially map rules in the grammar to some structural descriptions (i.e., tree or local subtrees). Intuitively, a tree must have at least the information speci ed by the rule (i.e., licensing relation), plus it must obey any constraints speci c to the base grammatical formalism. Among various uni cation grammars, in the theory-neutral general formalisms such as Shieber's or ones designed to be a linguistic tool such as FUG, the notion of licensing is speci ed by j= relation between a tree and the formula of a rule, or equivalently by subsumption  between feature-structure descriptions. Whereas in the formalisms which are intended to be linguistic theories such as GPSG, LFG, and HPSG, the notion of licensing or node admissibility includes the satisfaction of theory-speci c constraints, such as linear precedence or feature 15 Note Shieber (1992) does hint the possibility of nonminimal parse tree in his comments on an illustrated

example parse tree: \The parse tree given in Figure 3.1 is the minimal tree whose yield is the sample sentence" (Shieber, 1992, p. 57). But he does not discuss the issue any further except in the beginning of the correctness proof, where he says his proof is shown within the range of minimal models.

4.5. NONMINIMAL DERIVATION AND PARSE TREE

71

default principle. But even in those linguistic theories, the key relation between a rule and a tree, in terms of the amount of information, is ultimately de ned in the same way, by j= or . For example, GPSG equates licensing with projection of a rule to a tree,16 which requires every node/category in the tree to be an extension of the corresponding category in the rule. The notion of extension is (informally) de ned as (Gazdar, et al., 1985, p. 27): A category A is an extension of a category B (B  A) if and only if 1. the atom-valued feature speci cations in B are all in A, and 2. for any category-valued feature f , the value of f in A is an extension of the value of f in B . Thus, most uni cation grammar formalisms developed to date have the same diculty as Shieber's formalism in specifying a minimal parse tree. Then, why is a parse tree, or speci cally the licensing relation, de ned rather weakly, by

j= or  in uni cation grammars? That is due to the powerful expressiveness of the style of

grammar. In uni cation grammars, constituents are represented by complex structures (i.e., feature-structures). They can represent recursive structures, and often form graphs. When constituents are combined by uni cation, information is passed from anywhere to anywhere via shared structures in the models (i.e., graph re-entrances), thereby making the e ect of uni cation global. Because of this, when we try to describe the properties of a submodel (including the root) in a general way, it becomes dicult to precisely specify where each feature came from without referencing the whole model enclosed from the root.17 In the case of a parse tree, every internal constituent carries two sets of features: the ones in the licensing production of the constituent itself and possibly some additional ones that are inserted from other constituents somewhere in the enclosing tree or propagated upward from subconstituents (e.g. lexical features). Since features in the additional portion can not be easily speci ed, properties of a submodel in a parse tree are given by a partial description, by the features of the licensing production, which in e ect only speci es a necessary condition for the submodel.

16 Projection is actually de ned as a function in GPSG. A simplest version of the de nition given in (Gazdar, et al., 1985, p. 78) is as follows: Given an immediate dominance (ID) rule C0 ! C1 ; ::; Cn and a tree whose root category is C00 and whose children are C10 ; ::; Cn0 , we say that  is a projection function if and only if  is a one-one, onto function whose domain is fC0 ; C1 ; ::; Cn g, whose range is fC00 ; C10 ; ::; Cn0 g, and which meets the following conditions: (i) (C0 ) = C00 , and (ii) for all i, 0  i  n, (Ci ) is a legal extension of Ci0 . Whenever r is a rule and  is a projection function, we use (r) to denote the local tree which is determined (or admitted) by . Then, a local tree (r) is (locally) admissible from r when  is an admissible projection of r. Note the above de nition is an initial version in (Gazdar, et al., 1985); it becomes more elaborate as the details of GPSG theory are introduced. The de nition of extension is given shortly in the document body. 17 In (Wilks, 1989), several researchers in computational linguistics (including F. Pereira, A. Joshi, G. Gazdar, S. Pulman, M. Kay and M. Marcus) discuss this issue. The particular point of discussion was to reduce the computational complexity and processing overhead by constraining the ow of information, that is, to bring some form of locality in the uni cation system.

72

CHAPTER 4. NONMINIMAL DERIVATIONS

4.5.3 Other Representations and Logics Before we close the discussion on Shieber's parse tree, we must note that his weak de nition is not limited by the representational power of the underlying logic formalism. There have been other approaches which formalize uni cation-style grammars using more generalized representations (excluding those with enhanced descriptiveness for disjunction (Kasper and Rounds, 1986), negation (Dawar and Vijay-Shanker, 1990) etc., although those enhancements are not relevant to the nonminimal derivation problem). For example, (Carpenter, 1992), (Wintner, 1997) and (Sikkel, 1997) use multi-rooted feature-structures instead of single-rooted representation as in Shieber's logic. This way, both the LHS and RHS of a production A ! X1 : : : Xa can be represented by a list of categories. Carpenter (1992) also uses a fully-instantiated parse tree to de ne the notion of (standard) derivation, so that all features that will necessarily exist are accounted for in the de nition. However, although the increased expressiveness contributes to the generality of the formalism, it does not facilitate any more ease in specifying the minimality of a parse tree. Lastly, we discuss the possibility of using a di erent logic that can express more constraints on the constituents (or more generally, linguistic objects). In recent years, many feature-structure formalisms use a typed system, including HPSG (Pollard and Sag, 1994), a typed logic in (Carpenter, 1992) and Type Description Language (TDL) developed at DFKI (King, 1994; Gerdemann and King, 1994). What the typing brings is the power to put restrictions on feature structures, particularly to specify feature cooccurrences as in GPSG: what features can coexist for the objects of a type. GPSG expresses feature cooccurrence restrictions (FCRs) using rst-order terms. For example, the FCR [VFORM]  [+V, -N] represents a constraint that a feature VFORM can coexist in an object with features +V (verb) and -N (not noun), but not with others, for instance their complement -V or +N, meaning the verb form feature should not appear in a nominal category. These kinds of constraints are formalized and speci ed by the notion of appropriateness in the typed feature-structure formalisms mentioned above. Intuitively, appropriateness restricts a feature to occur in objects of limited types. For instance, the AGR (agreement) feature is appropriate for an object of type sign (as in HPSG), or any other subtypes if the types are organized into inheritance hierarchies. Carpenter (1992) formally de nes the appropriateness conditions by a partial function Approp : Feat  Type ! Type, where hType; vi is a partial order. Basically, Approp(F; t), where F is a feature and t is a type, returns the most general type of value that the feature F can have in an object of type t. The appropriateness can also specify negative constraints: if a feature is not appropriate for an object of a type, it can never be de ned for that type. Then Carpenter develops the notion of well-typing in which every feature in the feature-structure system is appropriate and takes appropriate values, and a more restrictive notion of total well-typing in which every appropriate feature must be present in all feature structures in the system. However, this appropriateness does not seem to solve the nonminimal derivation problem, particularly in the case of left-recursive rules. For example, in Shieber's Case 1 (on p. 59), the problematic feature hhead speedi for VP (or something like Prop type featurestructure under SEM feature in (Carpenter, 1992)) or rather the feature speed in the structure under head feature, must be allowed to be appropriate, since VP in the S rule can be

4.6. MINIMAL PARSE TREES FOR UNIFICATION GRAMMARS

73

legally rewritten by the LHS VP of the VP ! VP SLOW-ADV rule, and the speed feature will necessarily be placed under the head feature in S's VP. Therefore, it seems appropriateness will not be able to prevent nonminimal derivations when appropriate features are spuriously added to the typed feature structures.

4.6 Minimal Parse Trees for Uni cation Grammars From the discussions in the previous sections, we can see that uni cation grammars have an inherent diculty in specifying minimal parse trees by a static description of subconstituents. Generalized representation schemes do not seem to help in this regard either. In this section, we propose two modi ed de nitions of minimal parse tree. Our approach is to trade o declarativeness and introduce procedures that ensure minimality: top-down and bottom-up derivations, as we are familiar with in context-free grammar.18 We de ne those notions as a procedure which builds a parse tree, and extend to uni cation grammars. Then a minimal parse tree can be obtained as the result of those minimal derivations. Then at the end of this section, we discuss in detail the di erences and usefulness of these de nitions.

4.6.1 Bottom-up De nition The bottom-up de nition of minimal parse tree given here is quite simple, obtained by slightly modifying Shieber's de nition. Recall from the previous section 4.5.1 that Shieber's parse tree speci es that each subconstituent must be licensed by a production (thus belong to some i , in p. 69). Taking this de nition, we impose a restriction such that a tree licensed by a production is minimal if it is the result of unifying the RHS subconstituents that are minimal. This restriction is procedural because the minimality of a tree can be determined only after the minimality of all of its subconstituents are known, as it is expressed by if-then implication in the following de nition.

De nition (Minimal Parse Tree (BU)): Given a grammar G, a minimal parse tree 

admitted by G is a model that is a member of the in nite union of sets of bounded-depth S 0 derivation models  = i0 0i , where each 0i is de ned as: 1. For each lexical production p = hw; i 2 G, mm() 2 00 .

S

2. For each phrasal production p = ha; i 2 G, let 1 ; : : : ; a 2 j i 0j . if  = mm() t 1 nh1i t : : : t a nhai, then  2 0i .

 Notice the di erence from Shieber's, in particular that neither condition is de ned by j= relation; instead, the equality relation is used and the operators which compose the constraint mm, t and n all preserve minimal model (Lemma 36, 37, Shieber, 1992, p. 44). By 18 Note the procedures that generate minimal parse trees are not limited to top-down and bottom-up derivations.

74

CHAPTER 4. NONMINIMAL DERIVATIONS

combining the minimal trees in 0 by those operations, the minimality of a parse tree can be inductively guaranteed. Notice the set of trees in 0 is obviously a subset of Shieber's parse trees , that is, 0   . The di erence of the two sets ( ? 0 ), then, would contain nonminimal parse trees resulting from nonminimal derivations.

4.6.2 Top-down De nition

Contrary to the relatively simple bottom-up de nition, the top-down de nition is rather involved, requiring us to build formal de nitions for some fundamental notions in the context of uni cation grammars. Our basic approach here is to extend context-free top-down derivation to uni cation grammars by specifying a minimal parse tree to be a model which is minimally derivable from the start production and whose \leaf" constituents are all lled with lexical productions (i.e., terminals). We proceed with the formalization as follows: 1. Clarify context-free top-down derivation and de ne derivation tree; 2. Extend the context-free notions above to uni cation grammars, and de ne derivation and derivation model respectively; and nally, 3. De ne minimal parse tree as a special case of derivation model which satis es the conditions just mentioned above. We start from context-free derivation and work through each step. Throughout this process, we mean top-down derivation whenever we use \derivation".

Context-free Top-down Derivation

In context-free grammar, top-down derivation, denoted ), is the process of rewriting a nonterminal symbol with the RHS symbols of a production whose LHS symbol is the same as the nonterminal symbol. For example, given an arbitrary list of grammar symbols C , if there exists a rule C ! DE in the grammar, derivation can be applied to produce DE (i.e., C ) DE ). When a sequence of derivations are applied, a transitive closure of these derivations are denoted ) . Derivation can also be considered a procedure to build a parse tree, particularly when the language of a grammar is de ned by the yield of parse trees, instead of a sequence of terminals obtained by symbol rewriting. In this view, a parse tree is a pictorial representation of a sequence of derivations from the start symbol that generates a string of all terminal symbols (i.e., the yield). Typically a parse tree in context-free grammar is de ned as (Aho, Sethi and Ullman, 1986, p. 29):19 Given a context-free grammar, a parse tree is a tree with the following properties: 1. The root is labeled by the start symbol. 2. Each leaf is labeled by a terminal. 19 For the sake of brevity, the de nition is slightly modi ed here to eliminate an empty symbol .

4.6. MINIMAL PARSE TREES FOR UNIFICATION GRAMMARS

75

3. Each interior node is labeled by a nonterminal. 4. If A is the nonterminal labeling some interior node and X1 ; X2 ; ::; Xn are the labels of the children of that node from left to right where X1 ; X2 ; ::; Xn are a terminal or a nonterminal, then A ! X1 X2 ::Xn is a production. Notice this de nition is given by a static description of the properties of each node in the tree. In this sense, the de nition essentially states the well-formedness of a parse tree. Then, derivation as a procedure to build such a parse tree must be de ned on partially lled parse trees. We call such a parse tree a derivation tree. A derivation tree is similar to a (complete) parse tree, except that leaf nodes can be terminal or nonterminal. Therefore, (one-step) derivation amounts to a procedure which expands a nonterminal leaf node in a derivation tree with a production or a terminal. Steps involved are as given by (Winograd, 1983, p. 84): First choose any node in the tree that has no children and whose label is a nonterminal symbol of the grammar. Then choose any rule in the grammar whose LHS is the label of that node. For each symbol on the RHS of the rule, create a new node whose label is the symbol, and place the new node under the chosen parent node. The choice of which nonterminal to expand is nondeterministic; it can be arbitrary or speci c such as left-most, right-most etc. Then, a parse tree is generated from the start symbol by repeating this process until all leaf nodes are terminal.

Derivation for Uni cation Grammars Now to extend context-free top-down derivation to uni cation grammars (Shieber's logic formalism in particular), we rst de ne the kind of models on which it will be de ned: a context-free derivation tree extended to feature-structures. To de ne such models, we modify the de nition of Shieber's (complete) parse tree by allowing for nonterminal leaf nodes. Here, since a partially lled parse tree is generated as the result of derivation, we call it a derivation model, and formally de ne as:

De nition (Derivation Model): Given a grammar G, a derivation model  admitted

by G is a model S that is a member of the in nite union of sets of bounded-depth derivation models  = i0 i , where each i is de ned as: 1. 0 is the set of models  for which there is a lexical production p = hw; i 2 G such that  j= .

2. i (i > 0) is the set of models  for which there is a phrasal production pS= ha; i 2 G such that  j=  and, for all 1  i  a, if =hii is de ned, then =hii 2 j

Suggest Documents