thank Kentaro Inui, Kiyoaki Shirai and Timothy Baldwin enough. ..... by Briscoe and Carroll 5] and a baseline model using a probabilistic context-free gram-.
L
ISSN 0918-2802 Technical Report
Probabilistic Language Modeling for Generalized LR Parsing Virach Sornlertlamvanich TR98-0005 September
Department of Computer Science Tokyo Institute of Technology ^ Ookayama 2-12-1 Meguro Tokyo 152, Japan http://www.cs.titech.ac.jp/
c The author(s) of this report reserves all the rights.
Preface In this thesis, we introduce probabilistic models to rank the likelihood of resultant parses within the GLR parsing framework. Probabilistic models can also bring about the bene t of reduction of search space, if the models allow pre x probabilities for partial parses. In devising the models, we carefully observe the nature of GLR parsing, one of the most ecient parsing algorithms in existence, and formalize two probabilistic models with the appropriate use of the parsing context. The context in GLR parsing is provided by the constraints aorded by context-free grammars in generating an LR table (global context), and the constraints of adjoining pre-terminal symbols (local n-gram context). In this research, rstly, we conduct both model analyses and quantitative evaluation on the ATR Japanese corpus to evaluate the performance of the probabilistic models. Ambiguities arising from multiple word segmentation and part-of-speech candidates in parsing a non-segmenting language, are taken into consideration. We demonstrate the eectivity of combining contextual information to determine word sequences in the word segmentation process, de ne parts-of-speech for words in the part-of-speech tagging process, and choose between possible constituent structures, in single-pass morphosyntactic parsing. Secondly, we apply empirical evaluation to show that the performance of the probabilistic GLR parsing model (PGLR) using an LALR table is in no way inferior to that of using a CLR table, despite the states in a CLR table providing more precise context than those in an LALR table. Thirdly, we propose a new node-driven parse pruning algorithm based on the pre x probability of PGLR, which is eective in beam search style parsing. The pruning threshold is estimated by the number of state nodes up to the current parsing stage. The algorithm provides signi cant evidence of reduction in both parsing time and computational resources. Finally, a further PGLR model is formalized which overcomes some problematic issues by the way of increasing the context in parsing. i
Acknowledgments First and foremost, I would like to thank my supervisor, Prof. Hozumi Tanaka for his guidance, support and encouragement throughout the years of my Ph.D. studentship. I would also like to thank Assoc. Prof. Takenobu Tokunaga for his wealth of valuable comments to this research, and other members of my thesis committee: Prof. Sadaoki Furui, Prof. Taisuke Sato, Assoc. Prof. Motoshi Saeki and Assoc. Prof. Masayuki Numao. Additionally, I would like to express my gratitude to members of the Tanaka & Tokunaga laboratories for their contributions to this research, amongst whom, I can never thank Kentaro Inui, Kiyoaki Shirai and Timothy Baldwin enough. Specially, I would like to thank Masahiro Ueki for support in using MSLR and kindly adapting his parser to be able to accumulate action counts in the training phase. Taiichi Hashimoto helped with the implementation of the probabilistic GLR parser, and Toshiki Ayabe aided with LR table generation, especially in computing the state list for use with the B&C model. Thanaruk Theeramunkong of JAIST provided helpful discussions on accumulating probabilities into a GLR packed shared parse forest. Imai Hiroki helped revise the ATR Japanese corpus. I also would like to thank Toshiyuki Takezawa for providing the ATR Japanese corpus, and John Carroll for his helpful discussions and eorts in evaluating dierent models on an English corpus by way of his parser and grammar, during his stay at TITECH. I would never have had the chance to ful ll this research without the support of the Royal Thai government scholarship fund, Ministry of Science, Technology and Environment, and my colleagues at the Linguistics and Knowledge Science Laboratory (LINKS). Last but not least, I would like to thank my parents, my wife and my daughter for their understanding and encouragement. iii
Contents Preface
i
Acknowledgments
iii
1 Introduction
1
2 Language Modeling
7
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Focus of this research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Outline of the proceeding chapters . . . . . . . . . . . . . . . . . . . . . .
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8
GLR parsing . . . . . . . . . . . . . . . . . . . . Parse probability . . . . . . . . . . . . . . . . . Probabilistic Context-Free Grammar (PCFG) . Two-level Probabilistic Context-Free Grammar . Briscoe and Carroll's Model (B&C) . . . . . . . Probabilistic Generalized LR (PGLR) . . . . . . Related work . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . .
3 Context-Sensitivity in GLR Parsing 3.1 3.2 3.3 3.4
Context in GLR parsing . . . . . . . . . . . Action con icts . . . . . . . . . . . . . . . . An example of GLR parsing with ambiguity Summary . . . . . . . . . . . . . . . . . . . v
. . . .
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
1 3 5
7 9 10 11 12 14 18 25
29
29 30 34 35
CONTENTS
vi
4 Incorporating Probability into an LR Table
37
5 Experimental Results
45
4.1 An example of incorporating probabilities into an LR table . . . . . . . . . 38 4.2 Parse tree probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 5.2 5.3 5.4 5.5 5.6
Ambiguity in parsing non-segmenting languages . . . . . . . ATR corpus and grammar . . . . . . . . . . . . . . . . . . . Parsing the ATR corpus . . . . . . . . . . . . . . . . . . . . Model trainability . . . . . . . . . . . . . . . . . . . . . . . . Comparative results for LALR and CLR table-based PGLR Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
46 48 50 53 53 60
6 Model Analyses
61
7 A Node-Driven GLR Parse Pruning Technique
67
8 Probabilistic GLR Model-2
77
6.1 Advantages of PGLR-based models over PCFG-based models . . . . . . . . 61 6.2 Model preciseness of PGLR and B&C . . . . . . . . . . . . . . . . . . . . . 63 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.1 The node-driven parse pruning algorithm . . . . . . . . . . . . . . . . . . . 68 7.2 Eciency of parse pruning in PGLR . . . . . . . . . . . . . . . . . . . . . 70 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.1 8.2 8.3 8.4 8.5
Problematic issues . . . . . . . . . . . . . . . . . . . . . Another model for PGLR . . . . . . . . . . . . . . . . . Eects of PGLR model-2 on the exempli ed training set Evaluation of B&C, PGLR and PGLR model-2 . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
77 80 82 85 89
9 Conclusions
91
Bibliography
95
List of Tables 2.1 Parsing phrase levels represented in sentential form. . . . . . . . . . . . . . 19 3.1 LR table generated from Grammar-1. . . . . . . . . . . . . . . . . . . . . 34 4.1 LR table generated from Grammar-2. . . . . . . . . . . . . . . . . . . . . 38 4.2 LR table generated from Grammar-2, with associated probabilities. Probabilities in the rst and second line of each state row are those estimated by B&C and PGLR, respectively. . . . . . . . . . . . . . . . . . . . . . . . 41 5.1 The ATR corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Average parse base (APB) of the ATR corpus, compared to other corpora. 5.3 Performance on the ATR Corpus. PA is the parsing accuracy and indicates the percentage of top-ranked parses that match standard parses. LP/LR are label precision/recall. BP/BR are bracket precision/recall. 0-CB and m-CB are zero crossing brackets and mean crossing brackets per sentence, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 The full set of the original ATR Japanese corpus. . . . . . . . . . . . . . . 5.5 Performance on the ATR corpus with part-of-speech inputs. . . . . . . . . 5.6 Comparison of an LALR table and CLR table generated from the same ATR context-free grammar containing 762 rules, 137 non-terminal symbols and 407 terminal symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Performance on the ATR Corpus. Comparative results for PGLR using an LALR and CLR table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
6.1 An LR table generated from Grammar-3 with the associated probabilities. Probabilities in the rst line of each state row are those estimated by B&C and the bracketed values in the second line are those estimated by PGLR.
64
vii
49 49
50 52 52 55
viii
LIST OF TABLES 6.2 Probabilities of the parse trees (S3), (S4) and (S5), estimated according to each model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3 Probabilities of parse trees as estimated by each model. (S1-a) and (S1-b) are the possible parse trees for the input string `a a'; (S2-a) and (S2-b) are the possible parse trees for the input string `a a a'. . . . . . . . . . . . . . 66 7.1 Average time and space consumption when parsing with the node-driven parse pruning algorithm, as compared to full parsing. . . . . . . . . . . . . 73 8.1 The probabilistic LR table, updated according to the extended training set. Probabilities in the rst and second line of each state row are those estimated by B&C and PGLR, respectively. . . . . . . . . . . . . . . . . . 8.2 Results of ranking parse trees in ascending order, according to probabilities from B&C, PGLR and PGLR model-2. . . . . . . . . . . . . . . . . . . . . 8.3 Parse performance of PGLR on a closed test set of 500 sentences. . . . . . 8.4 Parse performance of PGLR on an open test set of 500 sentences. . . . . .
78 85 86 86
List of Figures 2.1 2.2 2.3 2.4 2.5
Application of the rule fV P ! adverb; verbg within an NP . . The decomposition of a parse tree into parsing phrase levels. . Sample representation of \with a list" in the HBG model. . . . Context in an arbitrary parse tree for the Pearl model. . . . . Treebank analysis encoded using feature values. . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
11 18 22 22 24
3.1 3.2 3.3 3.4 3.5
Reduce/reduce con ict. . . . . . . . . . . . . . . . . . . . . . . Shift preference for a shift/reduce con ict. . . . . . . . . . . . Reduce preference for a shift/reduce con ict. . . . . . . . . . . Reduction of [Xa ! a b] in dierent contexts. . . . . . . . . . Parse trees for the input string `x x x', based on Grammar-1.
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
31 32 32 33 34
4.1 The four type parse tree types in the training set. Values in square brackets indicate the frequency of each parse tree in the training set. . . . . . . . . 39 4.2 Parse probabilities of the four parse tree types. . . . . . . . . . . . . . . . . 42 5.1 Ambiguity in parsing a non-segmenting language . . . . . . . . . . . . . . . 47 5.2 Trainability measure for PCFG, two-level PCFG, B&C and PGLR. . . . . 54 5.3 Parsing accuracy distribution over dierent sentence lengths, for PGLR(LALR) and PGLR(CLR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Learning curve of actions in PGLR, using both an LALR and CLR table. . 58 5.5 Trainability measure for PGLR using an LALR and CLR table. . . . . . . 59 6.1 Parse trees, with the frequency in the training set shown in square brackets. 62 ix
x
LIST OF FIGURES 7.1 Parse pruning within a graph-structured stack. Circled state numbers are active nodes. All possible parses (sequences of state transitions) at each active node are shown in the box pointing to that node. The action/state pairs after applying the actions are shown in the square brackets next to the active nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Parsing accuracy under a varying beam width for parse pruning. . . . . . . 7.3 Distribution of state nodes over input symbols in parsing a sentence of 33 characters with 121,472 potential parses. . . . . . . . . . . . . . . . . . . . 7.4 Distribution of state nodes over input symbols in parsing a sentence of 36 characters with 1,316,912 potential parses. . . . . . . . . . . . . . . . . . . 8.1 Estimation of parse probabilities based on the expanded training set. . . . 8.2 Eects of adding (S2-c) into the training set, on PGLR model-2 as compared with B&C and PGLR. The number in the square brackets is the frequency of the parse tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Eects of adding (S2-d) into the training set, on PGLR model-2 as compared with B&C and PGLR. The number in the square brackets is the frequency of the parse tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Parsing accuracy distribution over dierent sentence lengths, for B&C, PGLR and PGLR model-2. . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Trainability measure for PGLR and PGLR model-2. . . . . . . . . . . . . .
71 72 74 75 79 83 84 87 88
Chapter 1 Introduction 1.1 Background Syntactic parsing is the process of assigning phrase markers to a sentence. In general, parsing a natural language sentence can produce multiple structures (also termed parse trees or parse derivations in this paper) according to ambiguity inherited in the grammar. Grammars can potentially assign dozens, possibly thousands or even millions of parse tree candidates for a single sentence, ranging from the reasonable to the uninterpretable, with the majority at the uninterpretable end of the spectrum. The main topic for statistical work on syntactic disambiguation in parsing, is to give preference to plausible parses by ranking ambiguous parse trees according to statistics generated from observation of current language use. Traditional work on statistical parsing has broken down the disambiguation process into the problems of (i) determining the part-of-speech of each word i.e. statistical or probabilistic part-of-speech tagging, and (ii) choosing between possible constituent structures i.e. statistical or probabilistic parsing. For languages that have no explicit word boundaries, there must be a word segmentation stage prior to the part-of-speech tagging. This process can be viewed as a speech recognition problem, because speech input, in general, is a sound sequence with no pauses between words. In this research, we integrate all three disambiguation processes together and formalize two probabilistic models for GLR parsing, aimed at applying all contextual information simul1
The terms `statistical', `stochastic' and `probabilistic' are used interchangeably in this research, to refer to the assignation of probabilities to possible alternatives, and then their ranking according to probabilities or the likelihood of locating the most probable answer. 1
1
2
CHAPTER 1. INTRODUCTION
taneously to produce better results than disambiguating in a cascade manner, in which each disambiguation is biased directly by the precision of the previous stage and information cannot be recalled from the previous stage. Consequently, we have focused our research on the task of parsing a sentence and assigning probabilities to the possible parse trees as a single process, given an input string of characters with no word boundaries, i.e. character-based parsing; the input could equally be a string of allophones in a speech recognition task, however. Generalized LR (GLR) [55] parsing is a generalized form of the LR parsing algorithm, with Tomita's contribution being to make non-deterministic LR parsing deterministic by introducing the notion of a graph-structured stack (GSS) to maintain the parsing stack. Except for the handling of non-deterministic parsing with a GSS, GLR parsers follow the same parsing algorithm as standard LR parsers. MSLR (Morpho-Syntactic LR) [52], GLR parsing involving multiple input strings, makes the integration of syntactic parsing feasible. The LR parsing algorithm [1] is a table-driven shift-reduce left-to-right parser for context-free grammars, constructing a rightmost derivation in reverse. At each step, the action of the parser is determined by the current parser state and the current input symbol. It is an appealing algorithm in terms of applying language constraints to syntactic parsing because it is simple, analytically tractable, very fast, and can prune ungrammatical hypotheses at an early stage. Based on the GLR parsing framework, we focus our research on formalizing two probabilistic models without sacri cing parsing eciency. Our newly formalized models are called the probabilistic GLR parsing model (PGLR), and the probabilistic GLR parsing model-2 (PGLR model-2). Briscoe and Carroll [5] proposed an ecient way of incorporating trained probabilities into each parsing action of the LR table. Probabilities are computed directly from the frequency of application of each action when parsing the training corpus. Their method seems to be able to exploit the advantages oered by the context-sensitivity of GLR parsing. However, Inui et al. [25] showed that Briscoe and Carroll's model (B&C) is
awed by defects in distributing parse probabilities over the actions of an LR parsing table (\LR table" for short). Firstly, in B&C, no distinction is made between actions when normalizing action probabilities over the states in an LR table, while PGLR distinguishes the action probability normalization of states reached immediately after applying a shift
1.2. FOCUS OF THIS RESEARCH
3
action, from states reached immediately after applying a reduce action. There is evidence that B&C reconsiders the probability of the next input symbol, even if parsing is at a state reached after applying a reduce action, (where the input symbol is deterministic). Redundantly including the probability of the input symbol from the previous state in this way signi cantly distorts the overall parse probability. Secondly, subdividing reduce action probabilities according to states reached immediately after applying reduce actions is also redundant because the resultant stack-top state after the stack-pop operation associated with a reduce action is always deterministic. As a result, the parse probabilities estimated by the B&C model are lower than they should be.
1.2 Focus of this research In our preliminary experiments, the probabilistic GLR parsing model (PGLR) has previously been proven to be better than existing models, in particular the model proposed by Briscoe and Carroll [5] and a baseline model using a probabilistic context-free grammar (PCFG), in parsing strings of parts-of-speech (non-word-based parsing) from both the ATR and EDR Japanese corpora [41]. Parsing a sentence at the morphological level makes the task much more complex because of the increase of parse ambiguity stemming from word segmentation ambiguity and multiple corresponding sequences of parts-of-speech.
We empirically evaluate the preciseness of the probabilistic model for PGLR against
that for Briscoe and Carroll's model (B&C), which is based on the same GLR parsing framework. We also examine the bene ts of context-sensitivity in GLR parsing, in which the PGLR model is matched against the \two-level PCFG" model [16] or \pseudo context-sensitive grammar" model (PCSG)|recently presented in [13]| which has been shown to capture greater context-sensitivity than the original PCFG model, by empirical results and model analyses. Like the B&C model, PGLR inherits the bene ts of context-sensitivity in generalized LR parsing (GLR). Its LR table is generated from a context-free grammar (CFG) by decomposing a parse into a sequence of actions. Every action in the LR table is determined by the pairing of a state and input symbol, so that it is valid to regard the state/input symbol pair as the context for determining an action. As a result, PGLR inherently captures two levels of context:
4
CHAPTER 1. INTRODUCTION
{ Global context over structures from the source context-free grammar. The
global context is inherited from the states which are the global constraints generated by the grammar in the process of constructing an LR table, in which the goto function of sets of items de nes a deterministic nite automaton (DFA) that recognizes the viable pre xes of the grammar.
{ Local n-gram context from adjoining pre-terminal constraints. Reading the
next input symbol in LR parsing is analogous to the prediction of the next symbol in the n-gram model.
Trainability is not trivial in estimating a probabilistic model. The problem of data
sparseness is very sensitive to the number of free parameters taken into account. Certainly, the amount of contextual information can be increased to raise the context sensitivity of models. But, to make the context eective, it must be applied with restraint and preciseness (poor estimates of context are worse than no estimate at all [20]). We investigate the trainability of each model by varying the size of the training set, to observe the performance curve. We also observe the learning curve of parsing actions in the LR table. These will show the eects of data sparseness, and the appropriate handling of context by the models, within the GLR parsing framework. Moreover, the performances of PGLR using an LALR table and a CLR table are empirically compared. Since the states in an LALR table are approximated versions of those in the corresponding CLR table because of state merging in creating the LALR table, the modeling context in the LALR table can be inferior to that in the CLR table. However, parsing with the LALR table is preferable because of its eciency in terms of both time and space over the CLR table [28]. Our results con rm the applicability of the PGLR model.
Parsing highly ambiguous sentences or long sentences can cause an unexpectedly
long parse time and exhaustive use of computing resources. To make parsing feasible, we need an algorithm that determines the most probable parse as early as possible. Wright et al. [60, 61] as well as Yamada and Sagayama [62] introduced a Viterbi-style beam-search to GLR parsing by adding a storage facility to keep trace of the parse. Both methods require an overhead cost in computation, and the storage mechanism
1.3. OUTLINE OF THE PROCEEDING CHAPTERS
5
certainly defeats the purpose of using GSS in GLR parsing. Carroll and Briscoe [9] also proposed a method to extract the n-best parses from a complete packed parse forest, which requires that the sentence is completely parsed before being able to extract the n-best parses. Their method can reduce the time in extracting the nbest parses from the packed parse forest, but, exhaustive parsing is still needed, and exhaustive searching is required in the worst case. Lavie and Tomita [29] also proposed a beam search heuristic in GLR* parsing. Precisely speaking, it is a beam width for limiting the increase of the number of inactive state nodes for performing shift operations in the skipping feature of GLR* parsing. Moreover, they use a very crude scoring method for sorting the inactive state nodes. Considering the ecient use of computing resources and parsing in an acceptable time, we need a new algorithm that can prune o the less probable parses as early as possible without modifying the original GSS or introducing additional storage requirements to keep trace of the various parse. Actually, the allowance of pre x probabilities in the PGLR model is feasible for setting a threshold for pruning.
Finally, we formalize a second model for PGLR, called PGLR model-2, to overcome
some problematic issues that occurred in parsing the ATR corpus; this is modeled based on a dierent set of assumptions to the PGLR model. We evaluate the models with both empirical evaluation and model analyses, which lead us to make conclusions as to the appropriate level of context and points to future work in extending our framework to produce a richer context model and lexicalization of the probabilistic model for GLR parsing.
1.3 Outline of the proceeding chapters Chapter 2 brie y reviews the various probabilistic models, namely B&C, two-level PCFG and PGLR, which are evaluated through character-based parsing on the ATR Japanese corpus. A summary is given of previous work in probabilistic parsing. Chapter 3 shows the nature of context in GLR parsing, which re ects the context-sensitivity of the probabilistic models. We exemplify a method to incorporate probabilities into the LR table and compute parse probabilities from the action probabilities in Chapter 4. Chapter 5 shows the results of experiments, concerning performance and model trainability, carried out on
6
CHAPTER 1. INTRODUCTION
the three aforementioned models and the baseline model of PCFG. Comparative results for LALR and CLR table-based PGLR are also discussed. We discuss the empirical results and give case analyses in Chapter 6. Chapter 7 introduces an eective technique, called the node-driven parse pruning algorithm, for pruning less probable parses based on the partial parse probabilities of PGLR. Chapter 8 formalizes another model for PGLR based on a dierent assumption to GLR parsing, and also discusses the results of evaluation.
Chapter 2 Language Modeling Formalizing a probabilistic model in the GLR parsing framework without sacri cing parsing eciency, thoroughly inherits the advantages of GLR parsing.
2.1 GLR parsing GLR parsing [53, 55] is a table-driven parsing algorithm which utilizes a table, called an LR table [2, 1], precompiled from a context-free grammar. The operation of the parser|parsing actions within the LR table|is de ned in terms of shift and reduce actions, with reference to the current state and the current input symbol. GLR parsing is a generalized form of the LR parsing algorithm, with Tomita's contribution being to make non-deterministic LR parsing deterministic by introducing the notion of a graphstructured stack (GSS) to maintain the parsing stack. Except for the handling of nondeterministic parsing with a GSS, GLR parsers follow the same parsing algorithm as standard LR parsers. All LR parsers behave in the manner as shown in the following LR parsing algorithm.
LR parsing algorithm [1]: Suppose that the current con guration of the 1
LR parser is:
stack (s X s X s Xm sm ; 0
1
1
2
2
input ai ai an $) +1
The con guration of an LR parser is a pair whose rst component is the stack content and second component is the unexpanded input. 1
7
CHAPTER 2. LANGUAGE MODELING
8
where the initial state s is the state initially put on top of the LR parser stack. 0
The next move of the parser is determined by reading ai, the current input symbol, and sm, the state on top of the stack, and then consulting the parsing action table entry action[sm ; ai]. The con gurations resulting after each of the four types of move are as follows: 1. If action[sm ; ai] = shift s, the parser executes a shift move, producing the con guration (s X s X s Xm sm ai s; ai an $) 0
1
1
2
2
+1
Here the parser has shifted both the current input symbol ai and the next state s, which is given in action[sm ; ai], onto the stack; ai becomes the current input symbol. +1
2. If action[sm ; ai] = reduce A ! , then the parser executes a reduce move, producing the con guration (s X s X s Xm?r sm?r A s; ai ai an $) 0
1
1
2
2
+1
where s = goto[sm?r ; A] and r is the length of , the right-hand side (RHS) of the production. Here the parser rst pops 2r symbols o the stack (r state symbols and r grammar symbols), exposing state sm?r . The parser then pushes both A, the left-hand side (LHS) of the production, and s, the entry for goto[sm?r ; A], onto the stack. The current input symbol is not changed in a reduce action. For the LR parsers, Xm?r Xm , the sequence of grammar symbols popped o the stack, will always match , the right-hand side of the reducing production. +1
3. If action[sm ; ai ] = accept, parsing is completed. 4. If action[sm ; ai] = error, the parser has discovered an error and calls an error recovery routine. As a result, in parsing with the LR parsing algorithm, a parse derivation can be observed as a stack or a state transition over the in nite transition space.
2.2. PARSE PROBABILITY
9
2.2 Parse probability We review the existing probabilistic models, namely PCFG, two-level PCFG, and B&C, which are to be evaluated against the PGLR model. B&C is a high-performance probabilistic model proposed for the GLR parsing framework, as presented in [5]. Two-level PCFG [16] is an extended PCFG model for yielding greater context-sensitivity than the original paradigm. It was recently explored more thoroughly by Charniak and Carroll [13], using the terminology \pseudo context-sensitive grammar" (PCSG), showing the improvement in per-word cross-entropy over the original PCFG model. Our motivation in selecting B&C and two-level PCFG for the comparative evaluation of PGLR is to examine the effectiveness of LR table context-sensitivity (global context over CFG-derived structures and local n-gram context from adjoining pre-terminal constraints), and the appropriateness of the PGLR model for GLR parsing. Since PCFG is widely referred to in the evaluation of various models, we include it as the baseline model for our evaluation tests. In character-based parsing, given a string of characters C = c ; : : : ; cn as an input, the task of a probabilistic parser is to search among the many possible joint probabilities of a parse tree (T ) and word sequence (W ), for the most likely parse tree and corresponding word sequence: 1
P (T ) P (W jT ) P (C jW; T ) arg max P ( T; W j C ) = arg max T;W T;W P (C ) = arg max P (T ) P (W jT ) T;W
(2.1) (2.2)
The term P (C jW; T ) becomes one when word sequence W is xed, and P (C ) is a constant scaling factor, independent of T and W , which is not worthy of consideration in ranking parse trees and word sequences. Determining the probabilities in equation (2.2) is applied to rank the likelihood of parse trees. The term P (T ) is the distribution of probabilities over all the possible parse trees that can be derived from a given grammar. The probabilistic models, we discuss in this chapter, allow us to estimate the parse tree probabilities P (T ). The term P (W jT ) is the distribution of lexical derivations from parse trees T . There are various ways proposed to estimate the lexical probability P (W jT ), discussion of which can be found in [24]. Since such estimation is beyond the scope of this research, in our evaluation, we naively assume that word wi in word sequence W = w ; : : : ; wm depends 1
CHAPTER 2. LANGUAGE MODELING
10 only on its part-of-speech li . Therefore,
P (W jT )
m Y i=1
P (wijli)
(2.3)
The estimation of lexical probability is applied identically in all models.
2.3 Probabilistic Context-Free Grammar (PCFG) A probabilistic context-free grammar model (PCFG) is an augmentation of a context-free grammar. Each production rule of the grammar (ri) is of the form < A ! ; P (ri) >, where P (ri) is the associated probability, and the probabilities associated with all rules for a given left-hand side (LHS) non-terminal must sum to one [19, 59, 58, 60]. The probability of a parse tree (T ) is regarded as the product of probabilities of the rules which are employed in deriving that parse tree, such that, X P (jA) = 1 (2.4) Y P (T ) = P (ri) (2.5) i
Wright and Wrigley propose a solution of mapping a PCFG into the GLR parsing scheme [60]. They distribute the original probabilities of the PCFG into the parsing action table by way of the LR table generator. The successful application of their probabilistic LR table generator makes word prediction available in speech recognition tasks. By way of associating probabilities with shift actions as well as reduce actions, the parser can compute not only the total probability of a parse tree but also the intermediate probability midway through the parse process. The use of intermediate parse probability provides the means to prune o less probable parses, which can reduce the parse space and lead to parse results in a signi cantly less time. However, resultant probabilities are identical to those for the original PCFG, which does not suciently capture the context-sensitivity of natural language. Rule probabilities are normalized within each non-terminal symbol, therefore making the number of free parameters in PCFG as:
FPCFG = jRj ? jN j (2.6) where R is the set of context-free grammar rules, and N is the set of non-terminal symbols appearing in R.
2.4. TWO-LEVEL PROBABILISTIC CONTEXT-FREE GRAMMAR
11
2.4 Two-level Probabilistic Context-Free Grammar (Two-level PCFG) Two-level PCFG is an extended version of PCFG, deriving from the idea of providing context-sensitivity for a context-free grammar. In the original PCFG model, the probability of a parse tree (T ) is regarded as the product of probabilities of the rules which are employed to derive that parse tree. Each production rule (ri) is of the form < A ! ; P (ri) > where P (ri) is the associated probability, and the probabilities associated with all rules for a given LHS non-terminal A must sum to one. As an extension of PCFG, the two-level PCFG model utilizes extra information provided by the parent of each non-terminal, in expanding rules (ri) through assignment of rule probabilities. Thus, the rule probability in equation (2.4) can be rewritten as: X P (j(A)) = 1 (2.7)
where (A) is the non-terminal that immediately dominates A (i.e. its parent). NP
VP
art
adverb
verb
noun
Figure 2.1: Application of the rule fV P ! adverb; verbg within an NP . Figure 2.1 shows the tree structure of an application of the rule fV P ! adverb; verbg within an NP . This rule has the non-terminal NP as its parent. The parent NP indicates the situation in which the rule is applied. Based on the two-level PCFG model, we acquire the probability for the above rule as shown in equation (2.8).
P (V P ! adverb; verb j (V P ) = NP )
(2.8)
CHAPTER 2. LANGUAGE MODELING
12
Similarly to PCFG, the probability of a parse tree (T ) can be computed by nding the product of probabilities of the rules which are employed in deriving that parse tree, as de ned in equation (2.5). Rule probabilities are normalized within each pair of non-terminal symbols, therefore making the number of free parameters in two-level PCFG as:
F2?level?PCFG = (jRj ? jN j) jN j (2.9) where R is the set of context-free grammar rules, and N is the set of non-terminal symbols appearing in R. In other words, the number of free parameters in two-level PCFG is jN j times those in PCFG.
2.5 Briscoe and Carroll's Model (B&C) Briscoe and Carroll [5] introduced probability to the GLR parsing algorithm in light of the fact that LR tables provide appropriate contextual information for solving the context-sensitivity problems observable in real world natural language applications. They pointed out that an LR parse state encodes information about the left and right context of the parse. This results in distinguishability of context for an identical rule reapplied in dierent ways across dierent derivations. Briscoe and Carroll's method allows us to associate probabilities with an LR table directly, rather than simply with the rules of the grammar, as found in the models of PCFG and two-level PCFG. Since GLR parsing is a table-based parsing algorithm, Briscoe and Carroll consider the LR table as a nondeterministic nite-state automaton. Each row of the LR table corresponds to the possible transitions out of the state represented by that row, and each transition is associated with a particular lookahead item and a parsing action. Nondeterminism arises when more than one action is possible given a particular input symbol, in the form of action con icts. Action con icts occur when the grammar is not an LR grammar, as de ned in [1]. There is no explicit model formalization for the B&C model in the original proposal. The following is a review of B&C in terms of our formalization. 2
The term \lookahead", originally used in [1], refers to the extra information that is incorporated in a state, by rede ning items to include a terminal symbol as a second component. In this case, however, Briscoe and Carroll refer to lookahead as an \input symbol". 2
2.5. BRISCOE AND CARROLL'S MODEL (B&C)
13
Briscoe and Carroll regard a parse tree as a sequence of state transitions (T ): ;an?1 ;a1 l2 ;a2 ;an s =l1) s = ) : : : ln?=1) sn sn? l=n) 0
1
(2.10)
1
where ai is an action, li is an input symbol and si is the state at time ti . The probability of a complete parse can be represented as:
P (T ) = P (s ; l ; a ; s ; : : : ; sn? ; ln; an; sn) n Y = P (s ) P (li; ai; sijs ; l ; a ; s ; : : : ; li? ; ai? ; si? ) 0
0
1
1
1
i=1
1
0
1
1
1
1
1
1
(2.11) (2.12)
As to the nature of the state on top of the stack used in the GLR parsing scheme, which contains all the information needed in parsing, they assume that the state on top of the stack during parsing, is representative of the full stack. As a result, the probability of the parse tree T is estimated by equation (2.13): n Y P (T ) P (li; ai; sijsi? ) i n Y (2.13) P (li; aijsi? ) P (s0ijsi? ; li; ai) 1
=1
i=1
1
1
where s0i is the stack-top state after a stack-pop operation in applying a reduce action. Based on B&C, the following is a summary of the scheme for deriving the action probabilities (p(a)) from the count of state transitions resulting from parsing a training set. 1. The probability of an action given an input symbol is conditioned by the state it originated from. The probabilities assigned to each action for a given state must sum to one. Therefore, X X p(a) = 1 (for 8s 2 S ) (2.14) l2La(s) a2Act(s;l)
where La(s) is the set of input symbols at state s, Act(s; l) is the set of actions given a state s and input symbol l, and S is the set of all states of the LR table. This means that the action probabilities in the LR table are normalized within
each state.
2. In the case of a shift action (ai 2 As), P (s0ijsi? ; li; ai) in equation (2.13) is equal to one because shift con icts never occur in an LR table, and shift actions determine the state to move to. Therefore, 1
p(a) = P (li; aijsi? ) (for ai 2 As ) 1
(2.15)
CHAPTER 2. LANGUAGE MODELING
14
3. In the case of a reduce action (ai 2 Ar ), the probability is subdivided accord-
ing to the state reached immediately after applying the reduce action.
3
The reason for this is that Briscoe and Carroll associate probabilities with transitions in the automaton rather than with actions in the action part of the LR table. After applying a reduce action, P (s0ijsi? ; li; ai) in equation (2.13) is no longer one. It is interpreted as the probability of the state on top of the stack after a number of symbols have been popped o in applying the reduce action. Therefore, 1
p(a) = P (li; ai jsi? ) P (s0ijsi? ; li; ai) (for ai 2 Ar ) 1
1
(2.16)
Actions in the goto part of an LR table are deterministic, and therefore Briscoe and Carroll do not consider distributing probabilities to the actions in the goto part of the table. The probability of a parse tree in B&C is the geometric mean of the probabilities of the actions in state transitions across the whole parse tree, therefore: n Y (2.17) P (T ) = ( p(ai )) =n 1
i=1
Normalizing action probabilities within each state causes the number of free parameters to be: F = jAj ? jSj (2.18)
where A and S are the sets of all actions and states of the LR table, respectively. Since the B&C model subdivides the probability assigned to a reduce action according to the state reached after the stack-pop operation, the number of free parameters can be more than F in equation (2.18). Such that:
FB&C > jAj ? jSj
(2.19)
2.6 Probabilistic Generalized LR (PGLR) It seems that Briscoe and Carroll have eciently grasped the advantages of LR parsing through their assignation of a probabilistic model, as described in Section 2.5. How-
Precisely stated, the probability is subdivided according to the stack-top state after a stack-pop operation in applying a reduce action. 3
2.6. PROBABILISTIC GENERALIZED LR (PGLR)
15
ever, there is no explicit formalization of the probabilistic model in the paper published by Briscoe and Carroll [5]. Many dubious points are found in assigning probabilities to parsing actions, though Briscoe and Carroll tried to preserve the nature of the LR parsing scheme. We review the probabilistic model proposed by B&C in devising a new formalization of a probabilistic model for GLR parsing [25, 41]. Recall from the LR parsing algorithm [1]: The parser uses a stack to store a string of the form s0 X1s1X2 s2 Xmsm , where sm is the state on top of the current stack. Each Xi is a grammar symbol and each si is a symbol called a state. Each state symbol summarizes the information contained in the stack below it, and the combination of the state symbol on top of the stack and the current input symbol are used to index the parsing table and determine the shift-reduce parsing decision.
Following the parsing algorithm above, we can get a parse tree if we follow the trace of stack transition at completion of the parse (i.e. the parse ends with an accept operation). Therefore, unlike B&C, we regard a parse tree as a sequence of transitions between LR parse stacks (T ) as shown in (2.20), where i is the stack at time ti , ai is an action, and li is an input symbol. Schema (2.20) shows the inherent divergence from B&C in our de nition of a parse derivation (cf. schema (2.10)). ;an?1 ;a1 l2 ;a2 ;an =l1) = ) : : : ln?=1) n n? l=n) 0
1
(2.20)
1
Based on the above de nition, the probability of a complete stack transition sequence T can be represented as:
P (T ) = P ( ; l ; a ; ; : : : ; n? ; ln; an; n) n Y = P ( ) P (li; ai; ij ; l ; a ; ; : : : ; li? ; ai? ; i? ) 0
0
1
1
1
1
0
i=1
1
1
1
1
1
1
(2.21) (2.22)
Since is the initial stack containing only the initial state s and the parsing always starts from the initial state, P ( ) is equal to one. Assuming that the current stack i? contains all details of its preceding parse derivation, and that neither the next input symbol li nor the next action ai depends on the preceding input symbol or action, we can simplify equation (2.22) to: n Y (2.23) P (T ) = P (li; ai; i ji? ) 0
0
1
i=1
1
CHAPTER 2. LANGUAGE MODELING
16
To estimate each transition probability P (li; ai; iji? ), we decompose it to: 1
P (li; ai; iji? ) = P (liji? ) P (aiji? ; li) P (iji? ; li; ai) 1
1
1
1
(2.24)
The rst term P (liji? ) is a conditional probability for estimating an input symbol li, given the current stack i? . According to the LR parsing algorithm, the 1
1
input symbol after applying a reduce action remains unchanged. Therefore, we have to consider P (liji? ) separately in terms of the actions that are applied to reach the stack i? : 1
1
1. If the previous action ai? is a shift action, and we assume that the stack-top state represents the stack information beneath it, then 1
P (liji? ) P (lijsi? ) 1
1
(2.25)
2. If the previous action ai? is a reduce action, the input symbol is not changed. Therefore, P (liji? ) = 1 (2.26) 1
1
The second term P (aiji? ; li) is a conditional probability for estimating an action ai, given the current stack i? and input symbol li. By the same 1
1
assumption as applied above, that the stack-top state represents the stack information beneath it, this term can be estimated by:
P (aiji? ; li) P (aijsi? ; li) 1
1
(2.27)
The third term P (iji? ; li; ai) is equal to one because the next stack i is uniquely determined given the current stack i? and action ai: 1
1
P (iji? ; li; ai) = 1 1
(2.28)
As a result, the rst term in transition probability (2.24) is estimated dierently depending on the type of action applied to reach the current stack. Therefore, we divide states into two classes, namely the class of states reached after applying a shift action (Ss), and the class of states reached after applying a reduce action (Sr ). Fortunately, these two classes are mutually exclusive because state transition in LR parsing is representable by
2.6. PROBABILISTIC GENERALIZED LR (PGLR)
17
a deterministic nite automaton (DFA). A state can be reached by a unique grammar symbol, which distinguishes states that are reached by a terminal symbol from states that are reached by a non-terminal symbol. Therefore, 4
S = Ss [ Sr and Ss \ Sr = ;
(2.29)
To summarize, the transition probability can be rewritten as: ( P (li; ai; i ji? ) PP ((ali;jasijsi;?l )) ((ssi? 22 SSs )) i i? i i? r 1
1
1
1
1
(2.30)
such that: X
X
l2La(s) a2Act(s;l)
X
a2Act(s;l)
p(a) = 1 (for s 2 Ss )
(2.31)
p(a) = 1 (for s 2 Sr )
(2.32)
where p(a) is the probability of an action a, La(s) is the set of input symbols at state s, Act(s; l) is the set of actions corresponding to state s and input symbol l. In other words, actions for states in Sr are normalized by way of the pair of the state
and input symbol, while actions for states in Ss are normalized by way of the state.
The probability of a parse tree is the product of the action probabilities for stack transitions across the whole parse tree:
P (T ) =
n Y i=1
p(ai)
(2.33)
As stated previous, action probabilities are normalized dierently according to state type. Therefore, the number of free parameters is less than that shown in equation (2.18):
FPGLR < jAj ? jSj
(2.34)
The PGLR model is not only well-founded for probabilistic induction, but also requires a smaller number of free parameters than the B&C model. 4
A deterministic nite automaton has at most one transition from each state on any input.
CHAPTER 2. LANGUAGE MODELING
18
2.7 Related work
Probabilistic context-free grammars (PCFGs) are the most basic model which allow for the estimation of parse probabilities by regarding a parse tree as a set of applied grammar rules. PCFGs are powerful in terms of their simplicity, but fail to capture enough context for parse discrimination. Most probabilistic models are developed rstly by realizing a PCFG in a parsing paradigm, and later extending its context-sensitivity through various approaches. The more bits of information are included, the more complex the model becomes. In modeling probabilistic parsing, besides overcoming the problems of data sparseness and model tractability, we have to give consideration to increasing contextual information with minimal sacri ce of parsing eciency. In addition to our proposed PGLR model and the three selected models for comparative performance evaluation, the following are the representative bodies of research relating to probabilistic parsing, proposed within various parsing paradigms, including the LR and GLR parsing framework. Su et al. [49, 15] distributed parse probabilities according to the node sequence of the rightmost derivation within the shift-reduce parsing framework. As an example, the parse tree in Figure 2.2 is decomposed into eight parsing phrase levels, as represented by the sentential form notation in Table 2.1 A t7 B
c t3
D
t6
E F
t1
G
t2 t4
c1
t5
c2 c3
c4
Figure 2.2: The decomposition of a parse tree into parsing phrase levels. The syntactic score for the parse tree is formulated as:
Ssyn(Tree) P (L ; L ; L jL ) P (L jL ) P (L ; L jL ) P (L jL ) P (L jL ) P (L jL ) P (L jL ) P (L jL ) 8
7
8
5
6
5
5
5
4
4
4
4
2
3
2
2
1
2
1
(2.35)
2.7. RELATED WORK
19
Table 2.1: Parsing phrase levels represented in sentential form.
L =f A
g
8
L =f B
C
L =f B
F G g
L =f B
F c
L =f B
c c g
7
6
5
4
3
g 4
g
4
L =f D E c c g 3
3
4
L =f D c c c g 2
2
3
4
L =f c c c c g 1
1
2
3
4
action reduce shift A
$
C G F
c
4
B
c
3
c
2
c
1
E D
Context for estimating partial probabilities depends on the number of symbols in the left and right-hand sides, taking into consideration the parsing time. Equation (2.35) is the result of synchronizing the partial probabilities with the shift operation, aimed at solving the problem of bias toward parses involving fewer operations, as the number of shift operations is always the same for a given input string. Su et al.claim that by considering the left and right contexts explicitly (e.g. as terminal and non-terminal symbols) yields better results than for the contexts left implicitly de ned, such as is the case for states in the GLR parsing framework. In summary, Su et al.consider the (terminal or non-terminal) symbols in the sentential form notation as the context for applying a context-free grammar rule, when applying a shift operation.
Li [30] distributed bi-gram probabilities over an LR table. Since bi-gram contexts
re ect the connectivity of terminal symbols only, probabilities are available only to shift actions, that is states in the Ss class. Probabilities are normalized within each state, but in the case that con icts occur in conditional pairs of state and input symbol, the best the algorithm can do is to distribute the uniform probabilities over all actions. Moreover, no probabilities are assigned to reduce actions and, in practice, all reduce action probabilities
CHAPTER 2. LANGUAGE MODELING
20
are assigned a value of one. The advantage of distributing bi-gram probabilities over the LR table as compare to parsing with only bi-gram probabilities, is that the constraints from the base CFG in generating the LR table help exclude the illegal bi-gram probabilities from the LR table. Goddeau and Zue [21] introduced a probabilistic LR model for word prediction P (wijw ; : : : ; wi? ) in speech understanding systems. The model estimates the probability for the next word conditioned on the current parsing state|the rst term P (wijQj ) in equation (2.36)|, with the current parsing state conditioned on the previous sequence of words|the second term P (Qj jw ; : : : ; wi? ) in equation (2.36): X (2.36) P (wijw ; : : : ; wi? ) P (wijQj ) P (Qj jw ; : : : ; wi? ) 0
1
0
0
1
1
0
j
1
By assuming that the grammar is deterministic, P (Qj jw ; : : : ; wi? ) becomes equal to one, therefore producing, Y P (w ; : : : ; wi? ) = P (wijw ; : : : ; wi? ) X P (wijQji) (2.37) 0
0
1
0
1
1
In summary, this model performs word prediction at each state, which is similar to the rst term of equation (2.24) for the PGLR model, as well as Li's bi-gram model for LR parsing, except for Goddeau and Zue's assumption of determining in the grammar. The main purpose in introducing this probabilistic LR model was to limit the search space in speech understanding systems. Kita et al. [27] extended the contextual information for CFG by considering the previously applied CFG rules, in bigram model of production rules. The model assumes that the structure of a sentence is represented by its derivation, which is a linear sequence of applying production rules: n 3 2 1
n = x : : : =r)
=r)
=r) S =r) 2
1
where S denotes the start symbol, and fr ; r ; : : : ; rng is the applied rule sequence to derive x. Then the probability of the derivation D is calculated as follows: n Y P (D = r ; : : : ; rn) = P (r j#) P (r jr ) P (rk jrk? )P (#jrn) 1
1
1
2
2
1
k=3
1
where # indicates the boundary marker. The probability of a production rule is dependent on the previously applied rule. In terms of probability, it is claimed that the model has a
2.7. RELATED WORK
21
kind of context-sensitivity. The experiment results showed a signi cant improvement in parsing accuracy over the original PCFG. Black et al. [3] proposed a probabilistic model, called History-Based Grammar (HBG), to be able to capture richer context for use in the parse disambiguation process. This model allows for the incorporation of lexical, syntactic, semantic and structural information:
P (Syn; Sem; R; H ; H jSynp; Semp; Rp; Ipc; H p; H p) 1
2
1
(2.38)
2
HBG predicts the syntactic (Syn) and semantic (Sem) labels of a constituent, its rewrite rule (R), and its two lexical heads (H and H ) using the labels of the parent constituent (Synp and Semp), the parent's lexical heads (H p and H p), the parent's rewrite rule (Rp) that leads to the constituent, and the constituent's index (Ipc) as a child of Rp. The probability of each constituent is estimated by the following ve factors: 1
2
1
2
1. P (SynjRp; Ipc; H p; Synp; Semp) 1
2. P (SemjSyn; Rp; Ipc; H p; H p; Synp; Semp) 1
2
3. P (RjSyn; Sem; Rp; Ipc; H p; Synp; Semp) 1
4. P (H jR; Syn; Sem; Rp; Ipc; H p; H p) 1
1
2
5. P (H jH ; R; Syn; Sem; Rp; Ipc; Synp) 2
1
Figure 2.3 is an example of a parse tree in the HBG model. The model is trained using a decision-tree algorithm. HBG has the advantages that semantic information can be included and the range of parse history can be extended across multiple sentences to access discourse information. Magerman et al. developed Pearl [33, 34, 31] (Probabilistic Earley-style Parser) by distributing context-sensitive conditional probabilities within the bottom-up Chart parsing algorithm, through use of Earley-type top-down prediction for the rule candidates. A parse probability T given the word sequence for a sentence S is estimated by assuming that each non-terminal and its immediate children are dependent of that non-terminal's siblings and parent, based on the part-of-speech tri-gram centered at the beginning of that rule: Y P (T jS ) P (A ! jC ! A ; a a a ) (2.39) A2T
0
1
2
CHAPTER 2. LANGUAGE MODELING
22 R: Syn: Sem: H1: H2:
P1 PP With-Data list with
R: Syn: Sem: H1: H2:
with
NBAR4 NP Data list a
R: Syn: Sem: H1: H2:
a
N1 N Data list *
list
Figure 2.3: Sample representation of \with a list" in the HBG model. where C is the non-terminal node which immediately dominates A, a is the part-of-speech associated with the leftmost word of constituent A, and a and a are the parts-of-speech of the words to the left and to the right of a , respectively, as illustrated in Figure 2.4. The part-of-speech tri-gram for a a a is computed using the mutual information of the tri-gram, P aP0 xaa02a1aP2 a1 , where x is any part-of-speech. 1
0
2
1
0
(
(
)
) (
1
2
)
C
β
A
γ
α ... a0
a1 a2 ...
Figure 2.4: Context in an arbitrary parse tree for the Pearl model. More precisely stated, Magerman et al.proposed using the geometric mean of the partial probabilities to score the likelihood of each parse.
2.7. RELATED WORK
23
Picky [35] (Probabilistic CKY-like Parser) is the successor of Pearl and uses the same scoring function. It was developed to overcome the shortcomings of Pearl in terms of parsing eciency, accuracy and robustness. The probabilistic models used in Picky are independent of the parsing algorithm, and therefore the parser eciency is maintained. Three phases of probabilistic prediction (i.e. covered left-corner, covered bidirectional and tree completion) are introduced to reduce the number of edges produced by the CKY parsing algorithm in bottom-up Chart parsing. SPATTER [32] (Statistical Pattern Recognizer) was later proposed by Magerman using a statistical decision-tree model to estimate parse probabilities. The parse probability is represented in the form of a tree grown by constructing nodes bottom-up from left-toright, as shown in the state order given for each node in Figure 2.5. Training is done by asking about the feature values of neighboring nodes in each state transition.
Stolcke [48] succeeded in distributing probabilities for context-free grammar production rules into the prediction and completion operations of the Earley parsing algorithm [18, 2]. His eorts put the computation of pre x probabilities to practical use,
which facilitates the computation of the most likely (Viterbi) parse of the input string. Probabilities are computed incrementally in a single left-to-right pass over the input by making use of Earley's top-down control structure. As a result, the parser eciently facilitates online pruning based on the probabilities from the original PCFG. However, the resultant parse probabilities are identical to the original PCFG, which is notorious in not being able to capture sucient contextual information.
Charniak [12] pointed out the potential of parsing with a grammar induced from tree-
bank data. A treebank grammar is a context-free grammar created by reading production rules directly from hand-parsed sentences in a treebank. Charniak showed that his induced grammar outperformed all other non-word-based statistical parsers/grammars on the Penn Wall Street Journal Treebank [11]. He later proposed a statistical parsing model with the grammar in [12]. The parse probability is lexicalized to words and the head word of the constituent (its most important lexical item). The model is composed of two joint probabilities, approximated using the deleted interpolation method for smoothing:
CHAPTER 2. LANGUAGE MODELING
24
14 S listed VVN root
Legend: state label word tag extension
10 N code NN1 right
9 Tn used VVN left
8 P by II left
7 N PC NN1 left
1
2
3
4
5
each DD1 right
code NN1 up
used VVN right
by II right
Each
code
used
by
13 V listed VVN left
6
11
the AT right
PC NN1 left
the
PC
is VBZ right is
Figure 2.5: Treebank analysis encoded using feature values.
12 listed VVN left listed
2.8. SUMMARY
25
P (sjh; t; l) = (e)P^ (sjh; t; l)+ (e)P^ (sjch; t; l)+ (e)P^ (sjt; l)+ (e)P^ (sjt) 1
2
3
4
(2.40)
The estimation of head s is conditioned by its type l, the type of its parent constituent t, and the head of the parent constituent h. ch is the cluster containing h. Additionally,
P (rjh; t; l) = (e)P^ (rjh; t; l)+ (e)P^ (rjh; t)+ (e)P^ (rjch; t)+ (e)P^ (rjt; l)+ (e)P^ (rjt) 1
2
3
4
5
(2.41)
estimates the probability of expanding grammar rule r for a constituent. Charniak's model is reported to outperform Magerman [32] and Collins [17] in the PARSEVAL measures on the Penn Wall Street Journal Treebank.
2.8 Summary In this chapter, we formalized various probabilistic model for estimating parse probabilities, and compared their context-capturing capabilities and complexity in terms of the number of free parameters. PCFG is the simplest of the models considered and involves only the context encoded within CFG rules. The small number of training parameters required in this model is seen to be attractive. Chitrao and Grishman [16] proposed the two-level PCFG model to extend the context of PCFG. Charniak and Carroll [13] proved that the two-level PCFG model is a closer approximation of a language model than PCFG by demonstrating a decrease in the testset perplexity over the original PCFG. In the GLR parsing framework, Wright and Wrigley [60] introduced a method of distributing PCFG rule probabilities into the actions of an LR table. However, the resultant parse probability estimation is equivalent to that for the original PCFG. The only advantage apparent in Wright and Wrigley's method comes in the ability to prune o less probable parses in GLR parsing. Briscoe and Carroll (B&C) seem to be able to make use of the advantages
CHAPTER 2. LANGUAGE MODELING
26
of expanded context in GLR parsing. Without a proper model formalization, the B&C model requires an undesirably large number of free parameters in subdividing reduce action probabilities and re-estimating the input symbol after applying a reduce action. The PGLR model distributes probabilities in close parallel to the intrinsic nature of the GLR parsing algorithm. PGLR is well-founded, and requires that probabilities be assigned only for actions. The total number of free parameters is also less than the dierence between the total number of actions and states. The PGLR model is expected to provide eective probability distribution. We present the results of model evaluation in Chapter 5 and discuss the dierences in model performance in Chapter 6. We can summarize the dierences between B&C and our PGLR model as follows:
Model estimation
B&C : Each parse tree regarded as a sequence of state transitions. PGLR : Each parse tree regarded as a sequence of stack transitions. Normalization
B&C : Action probabilities are normalized within each state. PGLR : Action probabilities are normalized dierently according to the state membership, i.e. in Ss or Sr . This is because the input symbol after applying a reduce action is not changed, and therefore estimation of the next input symbol after applying a reduce action must be excluded.
Action probabilities
B&C : Probabilities for reduce actions are subdivided according to the state reached after applying the action, aiming at capturing the left context during the parse.
PGLR : Each action has one probability. The topmost state of the stack provides adequate representation of the stack contents.
Parse probabilities
B&C : The parse probability is taken as the geometric mean of action probabilities applied in the parse, to avoid bias in favor of parses involving fewer rules or equivalently smaller trees.
2.8. SUMMARY
27
PGLR : The parse probability is taken as the product of action probabilities applied in the parse.
Chapter 3 Context-Sensitivity in GLR Parsing Like other probabilistic parsing frameworks, we need appropriate contextual information in probabilistic GLR parsing. PGLR, as described in Chapter 2, is modeled based on the nature of the GLR parsing algorithm. A parse derivation can then be represented by a sequence of stack transitions. The state on top of the stack at each parsing time is assumed to be adequately capture all necessary information contained in the stack, because the pairing of the state and input symbol is used to index the parsing action in the LR table, in line with the nature of GLR parsing. PGLR is modeled by capturing the context during stack transitions.
3.1 Context in GLR parsing Generalized LR parsers [55] are driven by a precompiled LR table, generated from a context-free grammar. Though the grammar used in generating the table is context-free, the nature of the table and the manner in which the LR parser is driven, make the parser mildly context sensitive. Briscoe and Carroll raised this issue in [5] but misinterpreted some aspects of the context sensitivity, causing the number of free parameters to increase unnecessarily and resulting in unexpectedly complicated probabilistic model. We will come back to this point after describing the nature of context-sensitivity of the GLR parsing. During the process of generating an LR table, each state in the table is generated by applying the goto function, (as described in [1]), to the previous state (si = goto[si? ; Xi]). 1
29
30
CHAPTER 3. CONTEXT-SENSITIVITY IN GLR PARSING
Each new state (si) is generated by consulting the previous state (si? ) and a grammar symbol (Xi), where Xi is a terminal symbol in the case of the next input symbol being incorporated onto the stack, or a non-terminal symbol when the stack is reduced by way of an appropriate reduction rule. Each state in the LR parsing table thus contains a local left context for the parser. On the other hand, at any parsing time, the action in the LR table is determined by the pair of the state and input symbols (ai = action[si ; li ]). This means that at time ti, when the parser has come to a state (si ), the parser will consider the next input symbol (li ) in determining the next action (ai ). The next input symbol here provides the parser with a right context to aid in the determination process. Basically, the context taken into account during parsing is limited to one viable state and one input symbol. Since the parse history is pushed into the stack, we can use it to fully access the left context. Considering a parse derivation as a sequence of stack transitions, parsing with a GLR parser allows us to include the left context as much as needed. However, the model will become sparse if we try to include overly much context. But, by the nature of GLR parsing, we can take only one input symbol for the right context. 1
+1
+1
+1
+1
3.2 Action con icts CFG causes ambiguity in parsing. In parsing natural language, ambiguity inevitably occurs for a xed state and input symbol. Ambiguity occurs when there is more than one action corresponding to the given state and input symbol. Two cases exist for potential action con icts: reduce/reduce con icts (Figure 3.1) and shift/reduce con icts (Figures 3.2 and 3.3). Due to the properties of LR tables, shift/shift con icts never occur. Let us consider the case of parsing with a grammar for constructing a binary tree. Reduce/reduce con icts represent the dilemma of deciding which non-terminal label should be associated with the structure (for instance, Xa or Xb in Figure 3.1), and shift/reduce con icts constitute the problem of deciding whether to incorporate the next input symbol (d) onto the stack and delay construction of hierarchical structure to the next step (see Figure 3.2), or to create structure based on the existing stack (see Figure 3.3). In the case of parsing with only a PCFG, the most we can do is to assign a probability to each
3.2. ACTION CONFLICTS
31 Xa/Xb reduce
a
b
c Si
Figure 3.1: Reduce/reduce con ict. rule, based on the premise that the probabilities for all the rules that expand a common non-terminal must sum to one. Disambiguation in the case of Figure 3.1 is then made based on comparison between the probabilities for [Xa ! a b] and [Xb ! a b]. This same methodology is also used to disambiguate between Figures 3.2 and 3.3. For both con ict types, the LR parser provides left/right context to distinguish probabilities for reducing to the same non-terminal. The overall probability for reducing to Xa may be higher than for Xb , but in some context, such as Figure 3.1, the probability for reducing to Xb can be higher than for Xa . The LR parser provides the context of the state number (si) and input symbol (c) to determine parse preference. Similarly, the context of state number (si) and input symbol (d) in Figures 3.2 and 3.3 can be used to give preference to either the shift or reduce action. Briscoe and Carroll [5] have pointed out some examples of NL phenomena that an LR parser can inherently handle. In the example of he loves her, an LR parser can distinguish between the contexts for reducing the rst pronoun and the second pronoun to NP, given that the next input symbol after he is loves, while that for her is the sentence end marker ($). However this does not work if the next input symbols are the same, such as the reduction of pronouns in the examples of he passionately loves her and he loves her passionately. As previously mentioned, Briscoe and Carroll proposed an approach of subdividing reduce actions according to the state reached after that reduce action is applied. The purpose of this approach is to distinguish between reductions using the left context of the reduction rule. Subdividing reduce actions according to the state reached after the reduce action is one
32
CHAPTER 3. CONTEXT-SENSITIVITY IN GLR PARSING
Xc
Xa
a
Xb
b
c
d Si
e
shift
Figure 3.2: Shift preference for a shift/reduce con ict.
Xd
Xa reduce a
b
c
d Si
Figure 3.3: Reduce preference for a shift/reduce con ict.
3.2. ACTION CONFLICTS
33
of the factors that leads to Briscoe and Carroll's model being unexpectedly complicated and including an unnecessarily large number of free parameters. As we have described above regarding the left context of the LR parsing scheme, every state is generated by consulting the previous state and a grammar symbol. The states contain some local context, with the degree of the context depending directly on the type of the table, namely SLR, LALR or CLR (see [1]). Furthermore, the states reached after reduce actions (for instance, si and sx in Figure 3.4) are determined deterministically. This is accounted for in our probabilistic modeling in Chapter 2.
Xa
a Si
Xa
a
b Sj
(a)
Sk
Sx
b Sy
Sz
(b)
Figure 3.4: Reduction of [Xa ! a b] in dierent contexts.
The context sensitivity when parsing with either an LALR or CLR table is dierent, because during the process of generating an LALR table, states are merged together if they ful ll the requirement of having the same core in their LR items [1]. As a result, the number of states in an LALR table is often drastically decreased when compared to a CLR table. Therefore, the left context contained in states in an LALR table is less than that for states in a CLR table (for further discussion see [26]). Despite this, however, the results of an experiment presented in Chapter 5 con rm that parsing with an LALR table does not signi cantly decrease accuracy.
CHAPTER 3. CONTEXT-SENSITIVITY IN GLR PARSING
34
3.3 An example of GLR parsing with ambiguity Suppose that we have a CFG written as in Grammar-1:
Grammar-1: (1) S ! S S (2) S ! x The LR table for Grammar-1 is generated as shown in Table 3.1. An action con ict
occurs at state 3 with the input symbol `x'. This means that the parser can derive two distinct parses at state 3 when `x' is the input symbol. Table 3.1: LR table generated from Grammar-1.
action goto x $ S
state 0 1 2 3
sh1 re2 sh1
re2 acc re1 / sh1 re1
2 3 3
For example, Figure 3.5 shows two possible parse trees for the input string `x x x'. S S
S
2
2
S
3
x
1
S
2
x
1
2
S
3
re1
0
S
2
S
3
x
1
x
1
(A)
0
S
3
x
1
sh1
S
3
x
1
(B)
Figure 3.5: Parse trees for the input string `x x x', based on Grammar-1.
Grammar-1 allows both parse trees for the input string `x x x'. This actually occurs
3.4. SUMMARY
35
in parsing natural language sentences, with most of the parses being nonsensical. It is not possible here to eliminate the nonsensical parses and pick out the single required parse. In this case, the context in GLR parsing can be used to dierentiate between the parses. If parse tree (A) is more probable than (B) then the re1 action is preferable to the sh1 action, at state 3 with `x' as the input symbol, and vice versa. This is an example of using parse context to give preference to the parse. Parsing natural language sentences is not as simple as described in this simple example. Millions of parses can easily be generated for a single input, when using a simple CFG. However, the kind of context described above can naturally be used in generating a probability model for GLR parsing.
3.4 Summary Parsing with a GLR parser provides it own particular context for parse disambiguation. Writing a grammar in a CFG which is not an LR grammar, action con icts can readily occur in the LR table. We have crafted PGLR carefully to capture the appropriate context during the parsing process. Actually, the wide coverage of CFGs can occasionally allow for millions of distinct parses, many of which may be nonsensical. The actions in the LR table are determined by the pairing of a state and input symbol. This pair can then be eectively used as the context for parsing. The context in GLR parsing can be interpreted dierently according to the model used, such that context in PGLR is dierent to that in B&C. Clarifying the eects of context in this chapter signi cantly supports our PGLR modeling in capturing context during GLR parsing.
Chapter 4 Incorporating Probability into an LR Table The PGLR model requires only probabilities for each action. Normalization of action probabilities depends upon the of state type. The two classes of states are mutually exclusive, and therefore it is practical to directly associate a probability with each action in the LR table.
In this chapter, we use a simple example to explain the process of estimating action probabilities from a training set, and incorporating them into the LR table. To show the dierence between the B&C and PGLR models, we also compute the probabilities according to B&C and incorporate them into the same LR table. The PGLR model described in Chapter 2 normalizes probabilities dierently depending on the state membership in either Ss or Sr . States are distinguishable because of the property of GLR parsing that the LR state classes are mutually exclusive. This is because any state can be reached only by way of a particular grammar symbol, a fact which is recoverable from the deterministic nite automaton nature of LR tables. States in the Sr class are referred to by transitions in the goto part of the table, and all other states are in the Ss class. 37
38
CHAPTER 4. INCORPORATING PROBABILITY INTO AN LR TABLE
4.1 An example of incorporating probabilities into an LR table Suppose that we have the simple context-free grammar shown in Grammar-2. Based on this grammar, we generate the LR table shown in Table 4.1. Say that the parse trees in Figure 4.1 are the bracketed sentences in our training set. These sentences are licensed by Grammar-2. In the training set, there are four parse trees, (S1-a), (S1-b), (S2-a), and (S2-b), which occur 1, 2, 3, and 4 times, respectively.
Grammar-2: (1) S ! V (2) S ! a (3) S ! V (4) U ! a (5) V ! a (6) V ! U
V a U a
Table 4.1: LR table generated from Grammar-2.
State 0 1 2 3 4 5 6 7 8 9 10 11
Action Goto a $ S U V
sh2 4 1 3 sh5 sh7/re4 8 6 sh9 re1 acc re6 re6 re2 re4/sh10 re4 11 re5/sh5 re5 re3 re4 re5
We train our LR parser in the supervised mode, using a correctly hand-annotated corpus to guide the parser in its extraction of the sequence of actions required to produce the correct parse. We adapt our LR parser to be able to execute parsing actions based on the label-bracketed sentences. State numbers shown in Figure 4.1 are the state nodes in
4.1. AN EXAMPLE OF INCORPORATING PROBABILITIES INTO AN LR TABLE39
0
U
1
a
2
S
4
S
4
V
3
V
3
a
0
5
a
(S1-a)[1] S
0
a
S V
a
7
(S2-a)[3]
6
U
8
a
7
(S1-b)[2]
4
2
2
0
a
4
V
2
U
11
U
8
a
10
a
7
6
a
5
(S2-b)[4]
Figure 4.1: The four type parse tree types in the training set. Values in square brackets indicate the frequency of each parse tree in the training set.
40
CHAPTER 4. INCORPORATING PROBABILITY INTO AN LR TABLE
the transitions during parsing of each parse tree. We count the frequency of application of each action, in parsing each sentence in the training set. In practice, we add a part of a count to each action that appears in the table to smooth the probability of unobserved events. Finally, each action probability is computed according to the state class which that action belongs to. As described in the summary of the dierences between B&C and PGLR contained in Section 2.8, dierences arise in computed probabilities. The probabilities computed according to B&C are shown in the rst line of each state row, while the probabilities computed according to PGLR are shown in the second line. It is important to note that B&C normalizes the action probabilities within each state while PGLR distinguishes probability normalization according to state type. State numbers in brackets are in the Sr class. It is immediately evident that states in Sr class are exclusively indicated in the goto part of the LR table. The action probabilities in state 5 and 8 in Table 4.2 explicitly illustrate probabilistic dierences between B&C and PGLR. Based on B&C, the probability for the reduce action at state 5 with $ as the input symbol is subdivided according to the states reached after applying the reduce action re6, namely state 0 in (S1-a) and state 2 in (S2-b), as shown in Figure 4.1. For the same state 5, PGLR maintains only one probability for the reduce action re6. The action probabilities for B&C in state 8 are normalized within the state, while PGLR normalizes the probabilities within each combination of state and input symbol, because state 8 belongs to the Sr class. Both models do not distribute parse probabilities over the actions in the goto part of LR table, because the actions are all uniquely determined after reduce actions are applied.
4.2 Parse tree probability Based on the probabilities in Table 4.2, let us compute the parse probabilities of each parse tree in the training set according to the models of B&C and PGLR. Figure 4.2 shows the action probability associated with the states in each parse tree. The underlined action probabilities are assigned dierently in PGLR. For B&C, at states 5 of tree (S1-a) and (S2-b), the probabilities for the action re6 are assigned dierently because the state after the associated stack-pop operation is state 0 in tree (S1-a), but state 2 in tree (S2-
4.2. PARSE TREE PROBABILITY
41
Table 4.2: LR table generated from Grammar-2, with associated probabilities. Probabilities in the rst and second line of each state row are those estimated by B&C and PGLR, respectively.
State 0
(1) 2 (3)
a
Action
sh2 (10) 1 1 sh5 (1) 1 1 sh7 (9) / re4 (1) .9 / .1 .9 / .1 sh9
(4) 5
re6
(6) 7 (8) 9 10 (11)
re4 (4) / sh10 (3) .44 / .33 .44 / .33 re5 / sh5 (4) / .66 / 1
Goto S U V
$
re1 (3) 1 1 acc (10) 1 1 re6 (5) .2 ; .8 1 re2 (7) 1 1 re4 (2) .22 .22 re5 (2) .33 1 re3
(0)
4 1
3
8
6
(2)
re4 (3) 1 1 re5 (3) 1 1
11
CHAPTER 4. INCORPORATING PROBABILITY INTO AN LR TABLE
42
b). Parse probabilities are computed as shown in the box below each parse tree. Parse probabilities are simply the product of the applied action probabilities. In practice, B&C takes the geometric mean of the applied action probabilities as the parse probability, to avoid bias in favor of parsing involving fewer rules or equivalently smaller trees.
0 sh2=1
S
4 acc=1
S
4 acc=1
V
3 re1=1
V
3 re1=1
U
1 sh5=1
a
2 re4=0.1
a
re6=0.2 re6=1
5
0 sh2=1
B&C :1x0.1x1x0.2x1x1 = 0.02 PGLR:1x0.1x1x 1x1x1 = 0.1
a
2 sh7=0.9
0
a
2 sh7=0.9
V
S 6 re2=1
a
7 re4=0.22
(S1-b)[2]
4 acc=1
sh2=1
8 re5=0.33 re5=1
B&C :1x0.9x0.22x0.33x1x1 = 0.065 PGLR:1x0.9x0.22x 1x1x1 = 0.198
(S1-a)[1] S
U
0
a
4 acc=1
2 sh7=0.9
V
6 re2=1
sh2=1
a
7 sh10=0.33
U
11 re5=1
U
8 sh5=0.66 sh5=1
a
10 re4=1
a
7 re4=0.44
B&C :1x0.9x0.33x1x1x1x1 = 0.297 PGLR:1x0.9x0.33x1x1x1x1 = 0.297
(S2-a)[3]
a
5
re6=0.8 re6=1
B&C :1x0.9x0.44x0.66x0.8x1x1 = 0.209 PGLR:1x0.9x0.44x 1x 1x1x1 = 0.396
(S2-b)[4]
Figure 4.2: Parse probabilities of the four parse tree types.
4.3 Summary In this chapter, we illustrated the estimation of action probabilities for PGLR, compared to that for B&C. Table 4.2 explicitly shows the dierences in assigning action probabilities between these two models, and Figure 4.2 shows the dierent results of parse probabilities computed by the models. Compared with PGLR, action probability assignment in B&C is more complicated because it is necessary to determine the states that each reduce action
4.3. SUMMARY
43
can reach after applying a stack-pop operation, in the process of LR table generation. Unlike B&C, in PGLR we have to normalize action probabilities dierently according to the class they belong to. This can be done easily because the states in each class are mutually exclusive and easily identi able by the state numbers in the goto part of the LR table. As a result, for the PGLR model, we can directly associate a probability with each action appearing in the LR table.
Chapter 5 Experimental Results The performance of PGLR is evaluated against existing probabilistic models, i.e. B&C, two-level PCFG and PCFG, in terms of parsing accuracy, label precision and recall, and the evaluation metric in PARSEVAL. Comparative studies between LALR and CLR table-based PGLR are also conducted to con rm that the PGLR model is also eective for the widely used LALR table.
We evaluated the various probabilistic models, i.e. B&C, two-level PCFG and PCFG, against our proposed PGLR model, on a portion of the ATR Japanese corpus, called the Spoken Language Database (SLDB) [50]. All models were trained with the same subset of the corpus. About 5% of the corpus was partitioned o for use as an open test set. We measured the model performance in terms of parsing accuracy, label precision and recall, and the evaluation metric in PARSEVAL [4, 22]. The task for all models was characterbased parsing. That is, given an input string of Japanese characters, each model produced probabilistically ranked parses together with the associated parse probabilities, computed as described in Chapter 2. As the dictionary to generate word candidates and their corresponding parts-ofspeech, we collected all the words used in the corpus. Each word in the dictionary was variously indexed with lexical probabilities P (wijli), which is the probability of generating a word `wi' from an arbitrary part-of-speech `li '. Lexical probabilities were estimated from equation (2.3). 1
A description of an alternative statistical method of generating a word dictionary from a corpus for non-segmenting languages can be found in [46, 40]. 1
45
46
CHAPTER 5. EXPERIMENTAL RESULTS
Each model was trained equally with the same hand-annotated training set. For unseen events, we simply added a part of a count to smooth the model probabilities. Evaluation of the applicability of dierent smoothing methods is beyond the scope of this paper, and therefore, we applied this naive smoothing method equivalently to each model. At the end of this chapter, we also provide comparative results between LALR and CLR table-based PGLR in terms of performance and learning curves in parameter training. The aim of this comparison is to investigate the model appropriateness of PGLR in capturing context in GLR parsing. The context in the states of the LALR table is, in fact, approximated from that of the original CLR table. But, the smaller number of free parameters and advantages in parsing eciency aorded through use of the LALR table, make LALR a viable substitute for CLR table-based PGLR. In fact, parsing with an LALR table has been empirically proven to be optimal in terms of both parsing time and table size, in the GLR parsing framework [28].
5.1 Ambiguity in parsing non-segmenting languages An enormous ambiguity increase results when parsing a sentence with no explicit word boundaries, such as is the case for Japanese, Thai and Chinese [44]. The morphological analysis of such a language diers from axal morphological analysis which simply identi es permissible morphological operations on words in combining axes and stems. Instead, the principle task of morphological analysis becomes to identify the meaningful word sequences from a given character string. A diagrammatical representation of ambiguity in morphosyntactic analysis of a sentence in a non-segmenting language, is given in Figure 5.1. Therefore, the task of parsing sentences of non-segmenting language includes the processes of (i) word segmentation [38], in which the most probable sequence of word boundaries identi ed from the input string of characters, (ii) part-of-speech tagging [14, 39, 45], in which the appropriate part-of-speech is determined for each word, and (iii) parse tree construction, in which the parser chooses between possible constituent structures. Most of the parsing models consider the task of parsing a non-segmenting language in a cascade manner, where the above three processes are independently carried on. The word segmentation process produces the most likely solution for the part-of-speech tagging
5.1. AMBIGUITY IN PARSING NON-SEGMENTING LANGUAGES
word segmentation Characters
Words
POS tagging
47
parse tree construction
Parts-of-speech
Trees
Figure 5.1: Ambiguity in parsing a non-segmenting language
48
CHAPTER 5. EXPERIMENTAL RESULTS
process, and then, the part-of-speech tagging process passes the most likely part-of-speech sequence to the parse tree construction process for nding the most likely parse tree. The integration of word segmentation and part-of-speech tagging processes is successfully reported by using a probabilistic tri-gram model for both Japanese [36] and Thai [38]. Sortlertlamvanich et al. [44] introduced word semantic information for disambiguating in the syntactic parsing process. The syntactic ambiguity is reduced in an interleaving manner by using the semantic constraints to exclude the illegible interpretation of the syntactic structures. This approach is used to reduce the ambiguity that is pending from word segmentation and part-of-speech selection processes in parsing Thai. It is found that preparing the semantic information is a cost and labor intensive job, and there are a lot of arguable issues concerning the representation. In the GLR paring framework, Tanaka et al. [52] successfully integrated morphological and syntactic analysis into a single framework, in the MSLR (Morpho-Syntactic LR) parsing, by holding the morphological and syntactic ambiguity in parallel during the parsing. We, therefore, realize the task of parsing the non-segmenting language by implementing each probabilistic model on the MSLR parser. Each model can then be examined the parsing accuracy as a whole. The target of the probabilistic parsing is to select the most probable parse tree with appropriate part-of-speech labels and the corresponding word sequence, to estimate the most likely interpretation for the input sentence. The accuracy of estimation is the criterion for evaluating the performance of each model.
5.2 ATR corpus and grammar The \Spoken Language Database" (SLDB) is a treebank (a collection of trees annotated with a syntactic analysis, or \trees" for short) developed by ATR based on Japanese dialogue. A portion of the corpus has been revised through application of a more detailed phrasal grammar developed by Tanaka et al. [51]. We randomly selected about 5% of the revised corpus to use as an open test set and trained each parsing model with the remaining approximately 10,000 trees. Table 5.1 describes a breakdown of the corpus. The range and average of sentence length in both the training and test sets are very close, from which it is plain that the test set was appropriately selected from the corpus.
5.2. ATR CORPUS AND GRAMMAR
49
Table 5.1: The ATR corpus. Training set Test set
# of Morphemes # of Characters # of Sentences Range Average Range Average 10,361 545
1-34 1-22
6.69 6.36
2-58 2-42
12.57 12.03
We implemented all the models using a GLR parser, generated an LALR table from the treebank governing context-free grammar of 762 production rules, comprised of 137 nonterminal symbols and 407 terminal symbols. The grammar is a kind of Japanese phrasal grammar developed for use in speech recognition tasks [51]. It is a revised version of the grammar used to develop the original ATR Japanese corpus [50]. The grammar has been successfully used in MSLR (Morpho-Syntactic LR) for speech recognition tasks [52, 56]. The generated LALR table contained 856 states. The details of the table size are shown in Table 5.6. Table 5.2: Average parse base (APB) of the ATR corpus, compared to other corpora.
Corpus
APB
SUSANNE 1.256 SEC (Spoken English Corpus) 1.239 ATR (character-base) 1.348
Table 5.2 shows the average parse base (APB) of the ATR test set compared with the SUSANNE and SEC corpora, as reported in [6, 10]. APB indicates the average degree of ambiguity of a sentence. APB to the power n is the expected number of parses of an n-word long sentence, from which it is apparent that ambiguity grows for higher values of APB. In the case of ATR corpus, the average length of a sentence in the test set is 12.03 characters, therefore a sentence has an ambiguity of 1:348 : = 36:32 parses on average. It is up to about 1:348 280; 000 parses for the longest sentence of 42 characters. Table 5.2 shows that sentences in the ATR corpus are, on average, more ambiguous than those in the other two corpora. 2
12 03
42
Briscoe and Carroll [6] de ned APB as the measure of ambiguity for a given corpus. It is the geometric n p, where n is the number of words in a sentence, and p is the mean over all sentences in the corpus of p number of parses for that sentence. 2
CHAPTER 5. EXPERIMENTAL RESULTS
50
5.3 Parsing the ATR corpus
We used PARSEVAL measures [4, 22] to compare the performance of the top-ranked parses for each model. The bracketed values in Table 5.3 are the percentages of error rate reduction for PGLR versus the other models. In the case of the 2-42 character test set, PGLR can reduce the error rate in parsing over the B&C model about 58% and 75%, respectively, for the PA and 0-CB measures, and by about 87% and 69% respectively for the two-level PCFG model. These gures show that PGLR produces signi cant improvement over the other models. Table 5.3: Performance on the ATR Corpus. PA is the parsing accuracy and indicates the percentage of top-ranked parses that match standard parses. LP/LR are label precision/recall. BP/BR are bracket precision/recall. 0-CB and m-CB are zero crossing brackets and mean crossing brackets per sentence, respectively.
2-42 Characters (545 sentences) LP LR BP BR 0-CB m-CB
Models
PA
Models
15-42 Characters (160 sentences) PA LP LR BP BR 0-CB m-CB
B&C 88.62 (58.1%) 97.72 97.50 Two-leval PCFG 62.39 (87.3%) 96.28 95.32 PCFG 53.03 (89.8%) 95.67 94.54 PGLR 95.23 99.08 98.50
B&C 73.75 (61.9%) 96.00 97.26 Two-level PCFG 56.25 (77.1%) 97.44 97.31 PCFG 35.62 (84.5%) 95.86 95.64 PGLR 90.00 98.98 98.99
98.48 98.61 98.77 99.54
96.84 98.90 98.60 99.49
98.05 93.94 (75.7%) 97.38 95.23 (69.2%) 97.35 94.86 (71.4%) 98.76 98.53
98.14 83.75 (76.9%) 98.76 93.13 (45.4%) 98.39 90.63 (60.0%) 99.50 96.25
0.15 0.10 0.08 0.03
0.44 0.18 0.17 0.08
Table 5.3 shows that the PGLR model outperformed the other models in every metric. Looking at the metrics of BP, BR and 0-CB, where all structural labels are ignored, every model returned very good results (> 93%). The disparity between models becomes signi cant when bracket labels are taken into consideration, such as in the metrics of LP/LR and PA. Therefore, the diculty lies not in bracketing but in labeling (especially,
5.3. PARSING THE ATR CORPUS
51
in determining terminal symbols) which is an essential task in character-based parsing. One reason for this is that, the context-free grammar used for this corpus is relatively restricted in terms of terminal symbol assignment. Information about word form (e.g. sahen-meisi|Sino-Japanese verbal noun), post position (e.g. -/ga and -/ni), for example, is explicitly included in non-terminal symbol labels. Therefore, word connection constraints within the rules can somehow exclude the invalid word combinations. Results from a preliminary test on part-of-speech input sequences showed that structural ambiguity hardly ever occurred. Most of the sentences involved no ambiguity. From this study, it was observed that most sentences had only one parse if the parts-of-speech for the component words in that sentence were de ned. The ambiguity increases, however, when we consider parsing input strings of characters. According to Table 5.2, the average parse base (APB) of the ATR test set is as high as 1.348 in the character-based measure. This is comparable with the SUSANNE corpus (1.256) and SEC corpus (1.239). Therefore, the performance of character-based parsing in this test depends mainly on the accuracy of selecting words and their corresponding parts-of-speech. This means that a model that can provide local context in addition to the global context would tend to perform better. As expected, the models which make eective use of the local context in modeling according to the intrinsic nature of GLR parsing, namely B&C and PGLR, returned signi cantly higher results than PCFG-based parsing. Although the CFG rule context in two-level PCFG extends a level higher (i.e. to the parent of the reduced rule), this model still failed to include the appropriate context in some cases. One such case is discussed in Chapter 6. Parsing accuracy (PA) shows the percentage of correct parses that are ranked topmost according to the model probability. By this measure, PGLR maintained the highest accuracy in ranking parses, while the PCFG-based models dropped down to slightly higher than 50%. Since the corpus is a kind of spoken language database, there are a lot of short response utterances such a \Yes", \No" and \Take care". The lower table in Table 5.3 is added to show the performance on sentences ranging from 15 to 42 characters. The dierence in performance is maintained at a signi cant level even where the evaluation is performed on longer sentences. Parse performance partly depends on the grammar and the corpus. According to the
CHAPTER 5. EXPERIMENTAL RESULTS
52
report of an experiment on the SUSANNE English corpus by Carroll [8], no signi cant dierence between the performance of PGLR and B&C was observed. However, the input test set was a set of part-of-speech sequences, excluding ambiguity in word and partof-speech selection. Even here, though, PGLR returned the best result in terms of the m-CB metric. Comparison of model performance on parsing part-of-speech inputs was also made on the original ATR Japanese corpus. For this, we used the full set of the ATR Japanese corpus. The results of this evaluation are presented in [43, 42, 41]. Tables 5.4 and 5.5 show the summary of the evaluation. Table 5.4 is the number of morphemes. All trees are licensed by the original Japanese phrasal grammar developed at ATR [50]. We generated an LALR table from the grammar of 360 production rules, comprised of 67 non-terminal symbols and 41 terminal symbols. 376 states were generated in the LALR table. Table 5.4: The full set of the original ATR Japanese corpus. Training set Test set
# of Morphemes # of Sentences Range Average 19,586 995
2-60 2-30
12.08 11.64
Table 5.5 shows the percentage of exact matches over 4 classes. \Exact-1" is the percentage of most probable parse trees that matched the standard parse. \Exact-5" is the percentage of parse outputs containing an exact matched parse ranked within the top 5 parses, and so on. To make sure that the models rank their output parses accurately, we count the rank of the lowest parse for parses of the same probability. Table 5.5: Performance on the ATR corpus with part-of-speech inputs.
Ranking PCFG B&C PGLR Exact-1 Exact-5 Exact-10 Exact-20
69.35% 91.56% 94.97% 96.78%
74.67% 89.05% 92.26% 95.08%
83.22% 95.48% 97.49% 98.59%
The results of non-word-based parsing evaluation also con rm that PGLR yields the
5.4. MODEL TRAINABILITY
53
best performance. \Exact-1" is actually the same measure as the parsing accuracy PA in Table 5.3. In fact, non-word-based parsing evaluation involved much longer sentences than character-based parsing evaluation, on comparison of the average numbers of morphemes in both tables. Therefore, the value of \Exact-1" in Table 5.5 is slightly lower than that of PA in Table 5.3.
5.4 Model trainability We investigated the trainability of each model by varying the number of sentences in the training set, and tested the model performance with the same test set. Trainability relates directly to the number of free parameters required by the model. The estimated number of free parameters for each model is discussed in Chapter 2. The required numbers of free parameters for the dierent models can be ranked in ascending order as PCFG, two-level PCFG, PGLR and B&C. The number of free parameters required in B&C is vastly greater than for PGLR because of the methods of normalization and subdividing reduce action probabilities. Figure 5.2 shows that PGLR requires a reasonable amount of training data, although its performance is much better than the other three models for all sizes of training data. The performances of two-level PCFG and PCFG become saturated at a very early stage of training. The performance increases only slightly when compared to B&C and PGLR. Since PGLR involves relatively fewer free parameters than B&C, it yields very high performance with a small amount of training data. The performance for B&C improves as the amount of training data increases but is still behind the performance of PGLR for all sizes of training data.
5.5 Comparative results for LALR and CLR tablebased PGLR The degree of context sensitivity when parsing with an LALR and CLR table diers, because during the process of generating an LALR table, states are merged together if they ful ll the requirement of having the same core in their LR item [2, 1]. As a result, the number of states in an LALR table is drastically reduced as compared to that in the
CHAPTER 5. EXPERIMENTAL RESULTS
54
Parsing accuracy on 510 sentences (open test set) for different proportions of the training set 100 95
Parsing accuracy (%)
90 85 PGLR Briscoe & Carroll 2-level PCFG PCFG
80 75 70 65 60 55 50 45 1/32
1/16
1/8
1/4
1/2
Fraction of 10,361 training sentences
Figure 5.2: Trainability measure for PCFG, two-level PCFG, B&C and PGLR.
1
5.5. COMPARATIVE RESULTS FOR LALR AND CLR TABLE-BASED PGLR
55
corresponding CLR table. Nevertheless, the left context encoded in states in an LALR table is less than that for states in a CLR table. We thus extended our experiment to examine the performance of PGLR using an LALR table (PGLR(LALR)) against using a CLR table (PGLR(CLR)). Table 5.6: Comparison of an LALR table and CLR table generated from the same ATR context-free grammar containing 762 rules, 137 non-terminal symbols and 407 terminal symbols. Number of states Number of shift actions Number of reduce actions Number of goto actions Number of states in Ss class Number of states in Sr class
LALR table CLR table 856 11,445 164,058 4,682 488 368
3,715 43,833 756,715 19,733 2,539 1,176
The CLR table contains 3,715 states, more than fourfold the number of states in the corresponding LALR table (856 states) for the same grammar. Table 5.6 shows the comparison of the table size for the LALR and CLR tables. The number of states and actions in the CLR table can be tens of times larger than that in the LALR table for the same grammar [28]. Consequently, the number of free parameters in training the PGLR model with a CLR table is indeed much greater than that for an LALR table. The data sparseness problem thus becomes more severe when a CLR table is used. Table 5.7: Performance on the ATR Corpus. Comparative results for PGLR using an LALR and CLR table.
Models
PA
2-42 Characters (534 sentences) LP LR BP BR 0-CB m-CB
PGLR(CLR) 95.13 99.04 98.40 99.46 98.61 97.57 PGLR(LALR) 95.32 99.06 98.47 99.53 98.73 98.50
0.04 0.03
As expected, Table 5.7 shows that PGLR(LALR) returned slightly better results than PGLR(CLR). However, the dierence is not signi cant. In terms of the parsing accuracy, Figure 5.3 shows that the parsing accuracy for these two types of LR table is almost identical for all sentence lengths. PGLR(LALR) yielded a parsing accuracy of 100%
CHAPTER 5. EXPERIMENTAL RESULTS
56
Distribution of parsing accuracy over different sentence lengths, for an open test set of 534 sentences
Percentage of correct parses
100
80
60
40
20
0 0
5
10 15 Sentence length (number of words)
20
25
PGLR(CLR) PGLR(LALR) Number of sentences in the test set
Figure 5.3: Parsing accuracy distribution over dierent sentence lengths, for PGLR(LALR) and PGLR(CLR).
5.5. COMPARATIVE RESULTS FOR LALR AND CLR TABLE-BASED PGLR
57
at sentence lengths 21 and 22, while PGLR(CLR) did not return any correct parses. However, the numbers of sentences at sentence lengths 21 and 22 were 1 and 2 respectively, too small to give conclusive evidence as to the relative accuracies of PGLR(LALR) and PGLR(CLR). Based on the above, we may conclude that, empirically, PGLR using an LALR table can perform as well as PGLR using a CLR table. State merging in generating an LALR table does not aect the parsing context a great deal. On the contrary, the drastic increase in the number of states for the CLR table causes parameter sparseness. In PGLR, we distribute parse probabilities to each action appearing in the table, so that the number of training parameters is identical to the number of distinct actions in the LR table. Compared with PGLR using a CLR table, the percentage of trained actions for that using an LALR table increases more signi cantly, as shown in Figure 5.4. The percentage of actions (both shift and reduce) in the LALR table observed in parsing the training set is around three times higher than for the CLR table. This means that actions in the LALR table are trained more eectively than those in the CLR table. Although the number of trained actions tends to increase as the number of training sentences is increased, the number of actions observed only once in training becomes saturated at an early stage of training, with most of the observed actions repeated in the expansion of the training set. It may end up that most of the actions in an LR table are not actually used, though they are allowed under the original context-free grammar; this is especially the case for reduce actions with a very low learning curve. It may be possible to generate a more ecient LR table if we can exclude actions that cause the parser to generate unacceptable parses. However, it is beyond the scope of this paper to discuss this possibility. Figure 5.5 explicitly shows that with a small training set, the results when using an LALR table are better than when using a CLR table. Training the model with only 1/8 of the training set, or about 1,250 sentences, PGLR(LALR) returned a parsing accuracy as high as 90%, while PGLR(CLR) needed more training sentences to catch up with the performance of PGLR(LALR). In the sense of context-sensitivity, states in a CLR table are more highly distinguishable than those in an LALR table, but the drastic increase in states for the CLR table causes data sparseness in training. Thus, whereas we would expect the predictability for a CLR table to be higher than that for an LALR table, LALR tables perform even better than
CHAPTER 5. EXPERIMENTAL RESULTS
58
Learning curve of actions in PGLR using an LALR table (total of 11,445 shift and 164,058 reduce actions) 20 18
Shift actions Shift actions with one instance Reduce actions Reduce actions with one instance
Trained actions (%)
16 14 12 10 8 6 4 2 0 0
2000
4000 6000 Training sentences
8000
10000
Learning curve of actions in PGLR using a CLR table (total of 43,833 shift and 756,715 reduce actions) 20 18
Trained actions (%)
16 14 Shift actions
12
Shift actions with one instance
10
Reduce actions
8
Reduce actions with one instance 6 4 2 0 0
2000
4000 6000 Training sentences
8000
10000
Figure 5.4: Learning curve of actions in PGLR, using both an LALR and CLR table.
5.5. COMPARATIVE RESULTS FOR LALR AND CLR TABLE-BASED PGLR
59
Parsing accuracy on 510 sentences (open test set) by changing the proportion of training set 100
PGLR(LALR) PGLR(CLR)
98
Parsing accuracy (%)
96 94 92 90 88 86 84 82 80 1/32
1/16
1/8 1/4 Fraction of 10,361 training sentences
1/2
Figure 5.5: Trainability measure for PGLR using an LALR and CLR table.
1
60
CHAPTER 5. EXPERIMENTAL RESULTS
CLR tables in some cases, given the same size of training data. Considering the cost of table generation and parsing time, using an LALR table is much more ecient.
5.6 Summary PGLR outperformed all other models, though B&C claimed that their model is able to adequately capture GLR parsing context, and despite two-level PCFG extending the PCFG context by one level, to the parent of each rule. B&C redundantly includes extra parameters in both normalizing Sr class states and subdividing the reduce action probabilities. As a result, its performance is not as high as it could be. Two-level PCFG performed better than the original PCFG because of the extended rule context. Even here, though, there are a lot of cases where two-level PCFG cannot correctly select between the parse trees, especially when the parent nodes of rules with the same left-hand side (LHS) symbols are not distinguishable. We analyze the performance of each model and comparatively discuss the results in Chapter 6. From the results of evaluation between LALR-based PGLR and CLR-based PGLR, it was found that there is no signi cant dierence between these two types of LR table. It is concluded that states in an LALR table are a good approximation of those in the corresponding CLR table. Considering parsing eciency in terms of both parsing time and space requirements, LALR-based PGLR signi cantly outperformed CLR-based PGLR. Therefore, LALR-based PGLR suggests itself as the optimal choice for any probabilistic model of GLR parsing.
Chapter 6 Model Analyses The optimal model must be able to rank parse trees by estimating parse probabilities maximally close to those trained.
In this chapter, we compare the results of ranking parse trees for each model. Each model is trained by a given set of parse trees. We then estimate the probabilities of trained parse trees according to each model. This quantitative comparison gives an indication of the preciseness of the models in estimating parse probabilities. The parse preferences in the model analyses herein, support the evaluation results discussed in Chapter 5.
6.1 Advantages of PGLR-based models over PCFGbased models It is obvious that two-level PCFG shows the bene ts of context-sensitivity and yields signi cant gains over the original PCFG model [16, 13]. However, the results are still well below those for the probabilistic GLR-based parsing models. One reason is the advantages of local context, i.e. pre-terminal n-gram constraints encoded in the LR table. The n-gram constraints are distributed over the actions of the table, and therefore the parse trees generated by probabilistic GLR-based parsers incorporate pre-terminal n-gram constraints into the parse probabilities. The exempli ed case below shows that probabilistic GLR-based parsing can successfully exploit the advantages of pre-terminal n-gram constraints, and assign parse probabilities in a more accurate manner. Based on Grammar-3, the three types of parse trees 61
CHAPTER 6. MODEL ANALYSES
62
shown in Figure 6.1 can be generated. Suppose that (S3) and (S4) are found one and two times respectively in our training set, but (S5) does not occur. (S5) can either be found very rarely, or alternatively never occur because it has no obvious meaning. This is the scenario actually faced by most wide-coverage grammars. The case shown in Figure 6.1 is simpli ed from one of the cases we have found in our test set. It is the case of selecting the appropriate part-of-speech for a sentence-ending word \su", which can be \in -masu-su" or \in -desu-su". It must be assigned \in masu-su", if it follows a word \ma" having the part-of-speech of \auxstem-masu", and assigned \in -desu-su" if it follows a word \de" having the part-of-speech of \auxstemdesu". In this case, only the pre-terminal n-gram constraints are eective, rather than the constraints from the parent node which are the same in both cases. In Figure 6.1, `a', `b' and `c' are corresponding to \auxstem-masu", \auxstem-desu" and \in -masu-su", respectively.
Grammar-3. A context-free grammar: (1) (2) (3) (4)
! ! ! !
X X U U
U c U a b
X
0
4
c
X
4
U
1
b
2
U
1
a
3
0
(S3)[1]
(S4)[2]
5
X
0
U
1
b
2
4
c
5
(S5)[0]
Figure 6.1: Parse trees, with the frequency in the training set shown in square brackets.
6.2. MODEL PRECISENESS OF PGLR AND B&C
63
Probability Set-3. Rule probabilities for Grammar-3 estimated by two-level PCFG: (1) S ; X ! U c (1/3) (2) S ; X ! U (2/3) (3) X ; U ! a (1/3) (4) X ; U ! b (2/3) The bracketed values given for Probability Set-3 are the rule probabilities estimated
according to the two-level PCFG model, from the training set in Figure 6.1. In fact, they are the same as for the basic PCFG because the parents of rules (1) and (2) are identical, as are the parents of rules (3) and (4). That is, the extended context in two-level PCFG does not have any eect if direct parents are the same. We need more information to distinguish the cases. Unfortunately, however, there are no other parent nodes in this case. Table 6.1 is an LR table generated from Grammar-3. The probabilities below each action are estimated according to B&C and PGLR, indicated in the rst and second lines of each state row, respectively. For the sake of brevity, we do not consider any smoothing technique in this table, although smoothing was performed in the experiments described in Chapter 5. Applying the probabilities prepared in Probability Set-3 for two-level PCFG (as well as PCFG), and Table 6.1 for B&C and PGLR, to estimate the parse probabilities of (S3), (S4) and (S5) in Figure 6.1, we obtain the results shown in Table 6.2. Two-level PCFG (and PCFG) wrongly assign preference to (S5) over (S3), despite (S5) never occurring in the training set. Although B&C yields correct preference, the probabilities are smaller than what they should be. In this case, there is no dierence between B&C and PGLR in ranking the parses. The side-eects of inappropriate normalization of probabilities in B&C has already been explored in [25] and empirically con rmed in the evaluation in Chapter 5. Section 6.2 illustrates the model preciseness of PGLR over B&C.
6.2 Model preciseness of PGLR and B&C Once again, let us consider Grammar-2, with LR table and associated probabilities ac given in Table 4.2, and the parse trees in Figure 4.1 taken as training data. In addition,
CHAPTER 6. MODEL ANALYSES
64
Table 6.1: An LR table generated from Grammar-3 with the associated probabilities. Probabilities in the rst line of each state row are those estimated by B&C and the bracketed values in the second line are those estimated by PGLR.
State 0 1
Action b c
a
sh3 sh2 1/3 2/3 (1/3) (2/3)
2 3 4 5
Goto $ U X 1 4
sh5 1/2 (1) re4
re2 1/2 (1) re4 1 (1) re3 re3 1 (1) acc 1 (1) re1 1 (1)
Table 6.2: Probabilities of the parse trees (S3), (S4) and (S5), estimated according to each model.
Models
PCFG Two-level PCFG B&C PGLR
Tree(S3)=1 Tree(S4)=2 Tree(S5)=0 1/9 1/9 1/6 1/3
4/9 4/9 1/3 2/3
2/9 2/9 0 0
6.2. MODEL PRECISENESS OF PGLR AND B&C
65
the rule probabilities estimated by two-level PCFG are as shown in Probability Set-2.
Probability Set-2. Rule probabilities for Grammar-2 estimated by two-level PCFG: (1) (2) (3) (4) (5) (6)
SS SS SS V S S
;S ;S ;S ;U ;V ;V
! ! ! ! ! !
V a V a a U
V a U a
(0.3) (0.7) (0) (1) (0.5) (0.5)
The bracketed values in Probability Set-2 are the rule probabilities estimated by two-level PCFG. There are the same as for PCFG because the rule parents are identical for each LHS. Table 6.3 shows the parse probabilities, estimated by each model, of the parse trees (S1-a), (S1-b), (S2-a) and (S2-b) shown in Figure 4.1. The results show that both PCFG and two-level PCFG cannot select either between (S1-a) and (S1-b), or between (S2-a) and (S2-b). In fact, (S1-b) should have the preference over (S1-a), and (S2-b) over (S2-a). B&C correctly gives preference to (S1-b) over (S1-a), but wrongly gives preference to (S2-a) over (S2-b). Additionally, the parse probabilities for (S1-a) and (S1-b) are very low because B&C re-estimates the next input symbol, though it is not changed after applying a reduce action. This causes the probability of action sh5 at state 8 in (S1-b) and (S2-b) of Figure 4.2 to be smaller than it should be. Also, the probability of action re6 at state 5 in (S1-a) and (S2-b) of Figure 4.2 is unnecessarily subdivided, given that either state 0 or 2 is the only reachable state, as xed in the existing parse stack. This also reduces the parse probabilities of (S1-b) and (S2-b). In particular, the probability of (S2-b) is reduced to be smaller than (S2-a), which brings about the reversed ranking of (S2-a) and (S2-b). PGLR does not only rank the parse probabilities correctly, but also can estimate the parse probabilities very close to the original frequency of each parse tree. This means that the model can eectively distribute the parse probabilities between the particular action probabilities.
CHAPTER 6. MODEL ANALYSES
66
Table 6.3: Probabilities of parse trees as estimated by each model. (S1-a) and (S1-b) are the possible parse trees for the input string `a a'; (S2-a) and (S2-b) are the possible parse trees for the input string `a a a'.
Models
PCFG Two-level PCFG B&C PGLR
Tree(S1-a)=1 Tree(S1-b)=2 Tree(S2-a)=3 Tree(S2-b)=4 0.150 0.150 0.020 0.100
0.150 0.150 0.065 0.198
0.350 0.350 0.297 0.297
0.350 0.350 0.209 0.396
6.3 Summary PGLR yields the most precise result in distributing parse probabilities over particular parse trees. As a result, PGLR can estimate parse probabilities very close to original frequencies, and also rank parse probabilities correctly. Though two-level PCFG is modeled to be able to capture more context than the original PCFG framework, in some cases, especially in the test case above where the rule parents are not discriminative, two-level PCFG cannot provide any power over PCFG. Here, it yields the same results as PCFG does. B&C seems to be able to capture contextual information inherent in GLR parsing. However, through misinterpretation of certain contextual information, the estimated probabilities are smaller than what they should be. This happens in two cases: (i) in reestimation of the next input symbol after applying a reduce action, and (ii) in subdivision of reduce action probabilities according to the states reached after applying the actions. Both are wrongly included in the probabilistic modeling process.
Chapter 7 A Node-Driven Parse Pruning Technique for GLR Parsing Pruning is an essential paradigm to reduce the search space in parsing. The idea of pruning is to exclude hypotheses from further investigation if the parses turn out to be unlikely, based on evaluation of partial data.
Extracting all possible parses from a packed parse forest is crucially constrained by both time and memory space concerns. Carroll and Briscoe [9] have proposed a method for extracting the n-best parses from a complete packed parse forest. However, their method still requires the parser to parse to completion, and then identi es the n-best parses from the resultant packed parse forest. Their method can only save time in the actual extraction of the n-best parses, and does not concern itself with time and memory space consumption during the parse process. Sentences, in general, are ambiguous because of the wide-coverage nature of context-free grammars, but most of the possible parses for a sentence are pretty senseless. In practice, it is not reasonable to parse exhaustively to obtain some of the most probable parses. We would obtain the results more quickly if it were possible to prune o the less probable parses as early as possible. The compaction of the graph-structured stack (GSS) prevents us from applying the Viterbi algorithm [57] directly to GLR parsing. By way of the GSS, the parse stack is dynamically changed and does not keep trace of the various parses. Both of the Viterbi beam-search methods proposed in [60, 61] and [62] need additional storage to keep trace of the parses, which overrides the bene ts of using GSS in GLR parsing. 67
68
CHAPTER 7. A NODE-DRIVEN GLR PARSE PRUNING TECHNIQUE
Lavie and Tomita [29] introduced a beam search heuristic for GLR* parsing. GLR* parsing is a noise-skipping parsing algorithm which allows shift operations to be performed from inactive state nodes of the GSS. This amounts to skipping words at any previous state in the GSS. The purpose of introducing the beam search algorithm is to limit the number of inactive state nodes for performing shift operations. The algorithm simply considers performing shift operations from the nearest state nodes until the number of state nodes reaches the limit. In fact, any state nodes may be merged and be common to several dierent sub-parses. Therefore, any undesirable less-probable parses that end up with the same state nodes are included within the beam, which leads to an inecient beam search. It is also possible that the most probable parses be overlooked because of inaccuracy in scoring and parse estimation. Pruning with a beam search technique is widely discussed in the speech processing community [47, 23, 37, 62]. Steinbiss et al. [47] give a good summary of previous research on beam search methods and propose some improvements to beam searching, as histogram pruning. This method introduces an additional pre-speci ed upper limit on the number of active points per frame (or active nodes per time frame) to limit the expansion of hypotheses. The result of their experiments show that the search space is eciently reduced by observing the distribution of the number of states over the parse. In this research, we propose a new method for pruning parses that have a lower probability than parses within a predetermined beam width using a histogram pruninglike algorithm called the node-driven parse pruning algorithm. The number of expanded nodes is eectively used to estimate the number of parses in the GSS. We also show the evaluation results of various beam width settings, and compare the parse time and space consumption against full parsing results. Our node-driven parse pruning algorithm allows pruning in a left-to-right manner without modifying the GSS.
7.1 The node-driven parse pruning algorithm It is inecient to compute parse probabilities for all parses from the initial state successively up to the current state of parsing, because we have to keep trace of all possible parses individually. This also degrades the bene ts of local ambiguity packing in the graph-structured stack (GSS). To counter this ineciency, we observe the number of
7.1. THE NODE-DRIVEN PARSE PRUNING ALGORITHM
69
state nodes at each stage of parsing time and compute all possible parses only when the number is more than a threshold. Since the number of state nodes in a GSS can be viewed as an indicator of the degree of ambiguity, we indirectly estimate the number of parses by observing the number of state nodes in the GSS, and apply this as a threshold for activating the parse pruning process as shown in Algorithm-1. The threshold Tt at time t is computed by:
Tt = Gt nt
(7.1)
where Tt is the estimated number of parses at time t, Gt is the gain based on the number of state nodes and the length of the input string up to time t, and nt is the number of state nodes at time t. The gain Gt can be computed by: PL i nt?i Tt?i Gt = P (7.2) L n =1
2
i=1 t?i
where L is the number of past observations (a good setting for L is 5, as reported in [23]). The gain is used in adaptive pruning [23] by regarding a pruning process as a non-linear time-variant dynamical system. In our implementation, we simply set the gain Gt as a linear time-variant to reduce the computational overhead. Since our beam width is xed, the estimated number of partial parses at each parsing time is used to activate the parse pruning algorithm.
Algorithm-1. Node-driven parse pruning process. 1. If the number of parses estimated from the number of state nodes in the GSS is over the threshold Tt , compute the number of partial parses, else return. 2. If the number of parses is greater than the predetermined beam width, then compute the probabilities for each partial parse and individually store the sequences of state transitions with corresponding probability at each active state node (top node of each stack), else return. 3. Sort the sequences of state transitions according to the probabilities and determine the minimum probability of the parse within the beam width.
70
CHAPTER 7. A NODE-DRIVEN GLR PARSE PRUNING TECHNIQUE 4. Mark sequences of state transitions that have probabilities less than the minimum as `pruned'. 5. Apply the next actions to the active state nodes only if there is at least one possible parse (unmarked sequence of state transitions). For reduce actions, check reduceable parses with the sequences of state transitions at the active state nodes.
Figure 7.1 exempli es the GSS in the process of parse pruning. Suppose that the beam width is equal to one, and the rst parse (state sequence (0,3,6,13,9)) at node 9 is the only parse within the beam width. Here, only the action [re1,14] is executed, with the result as shown in Figure 7.1 (b). Note that we do not extend the stack at active state node 13 because all parses up to state node 13 are marked to be pruned o. At active state node 17, after trying the action [re1,17], the state sequence (0,4,11,3,9) which is marked to be pruned o, is activated. Therefore, the parse after this action is also disregarded. As a result, we can parse with smaller memory space and lower computational time overhead than that required in full parsing. However, parsing with this pruning technique gives appropriate results if and only if the exploited probabilistic model provides precise probabilistic estimates for the partial parses. This is because the beam search is an approximate heuristic method that does not guarantee that the interpretation of a sentence is the best possible interpretation.
7.2 Eciency of parse pruning in PGLR We calculated the eciency of parse pruning using the PGLR model for a varying beam width. Parsing can be sped up by reducing the beam width, excepting that the correct parses can potentially be pruned o if the beam width is too small. Figure 7.2 shows that PGLR provides quite a precise probabilistic estimate for partial parses, in that the parsing accuracy increases steeply with a small increase in the beam width. The parser performs equally well with a beam width of around 30 as with full parsing. Time consumption in parsing using our pruning technique is linear in sentence length, while it is exponential for full parsing. For example, our pruning technique requires only of the parsing time required for full parsing, for 25 character long sentences (there are more than 200,000 parses if completely parsed). 1 10000
7.2. EFFICIENCY OF PARSE PRUNING IN PGLR
3
6
13
9
[re1,14] [re1,17]
0 4
11
3
6
13 [sh,7]
71
(0,3,6,13,9) (0,4,11,3,9)------marked ‘pruned’
(0,3,6,13,6,13)--marked ‘pruned’ (0,4,11,3,6,13)--marked ‘pruned’
(a) Before pruning [re9,9] 14 [sh,7]
3
6
13
0
6 4
(b) After pruning
11
13
3
17
Figure 7.1: Parse pruning within a graph-structured stack. Circled state numbers are active nodes. All possible parses (sequences of state transitions) at each active node are shown in the box pointing to that node. The action/state pairs after applying the actions are shown in the square brackets next to the active nodes.
CHAPTER 7. A NODE-DRIVEN GLR PARSE PRUNING TECHNIQUE
72
Efficiency of parse pruning for PGLR 100
Parsing accuracy 95.32
Parsing accuracy (%)
95
90
85
80
75
70 1
5
10
15 20 Beam width for pruning
25
Figure 7.2: Parsing accuracy under a varying beam width for parse pruning.
30
7.3. SUMMARY
73
We evaluated our pruning algorithm by observing time and space consumption between full parsing and parsing with the pruning algorithm at the beam width of 30. Distributions of state nodes against input symbols in single sentence, are shown in Figures 7.3 and 7.4. It is obvious that our parse pruning algorithm can drastically reduce the number of nodes used in parsing both sentences considered. Consequently, the parsing time for both sentences is also visibly reduced because of the reduction in search space. Parsing time is reduced from 1709.62 seconds to 1.0 seconds and 649.49 seconds to 5.52 seconds, in parsing the 33 and 36 character length sentences, respectively. Table 7.1 shows the average eciency of time and space consumption when parsing with the node-driven parse pruning algorithm at a beam width of 30, as compared to full parsing. Table 7.1: Average time and space consumption when parsing with the node-driven parse pruning algorithm, as compared to full parsing. Full parsing Beam width = 30
Average number of Average parse time, state nodes per sentence seconds per sentence 9,146 630
163 0.243
The setting for the beam width is a trade-o between parsing accuracy and parsing time. In practice, a beam width of around 20 is likely to be sucient to produce satisfactory parsing results for the ATR corpus.
7.3 Summary Beam searching is an approximate-based heuristic method that does not guarantee that the nal interpretation of a sentence is the best possible interpretation. Nevertheless, by carefully managing the number of parses in the GSS using the node-driven parse pruning technique we make signi cant eciency gains in both time and space consumption. Moreover, when coupled with a precise model for partial parse probability estimation, our pruning technique yields the same results as full parsing at a beam width of about 30, using about 0.07% of the relative parsing space and 0.0015% of relative parsing time over full parsing.
74
CHAPTER 7. A NODE-DRIVEN GLR PARSE PRUNING TECHNIQUE
Node occupancy in parsing a sentence (33 characters, 121,472 potential parses) 250000
Full parsing (1709.62 sec) Beam width = 30 (1.0 sec)
State nodes
200000
150000
100000
50000
0
0
5
10
15
20
25
Input symbols
Figure 7.3: Distribution of state nodes over input symbols in parsing a sentence of 33 characters with 121,472 potential parses.
7.3. SUMMARY
75
Node occupancy in parsing a sentence (36 characters, 1,316,912 potential parses) 250000
Full parsing (694.49 sec) Beam width = 30 (5.52 sec)
State nodes
200000
150000
100000
50000
0
0
5
10
15
20
25
Input symbols
Figure 7.4: Distribution of state nodes over input symbols in parsing a sentence of 36 characters with 1,316,912 potential parses.
76
CHAPTER 7. A NODE-DRIVEN GLR PARSE PRUNING TECHNIQUE
Our node-driven parse pruning algorithm allows parse pruning in the GSS in a strict left to right fashion, with the bene t of being able to disregard less-probable parses at as early a stage as possible. Our pruning algorithm also retains the advantages of the GSS, with neglible cost in estimating the number of parses in the GSS.
Chapter 8 Probabilistic GLR Model-2 By regarding a parse tree as a sequence of state transitions, we are able to reformalize PGLR to improve its performance. Any probabilistic method involves a trade-o between the amount of context and data sparseness for full parsing.
8.1 Problematic issues The results of Chapter 5 clearly showed that PGLR yielded the best relative performance. The model analyses of Chapter 6 also con rmed that PGLR returns a probabilistic mass for each parse tree very close to the original frequency in the training set. Let us, once again, consider the training set in Figure 4.1, and Grammar-2 and its associated LR table as shown in Table 4.1. Suppose that we have a new set of parse trees is included in the training data. We retrain the models for PGLR and B&C based on the expanded training set, and recompute the action probabilities for the LR table as shown in Table 8.1. We then re-estimate the probability for each parse tree and obtain the results shown in Figure 8.1. Undesirable results are produced for both PGLR and B&C. In the case of B&C, the parse trees for the input string `a a' involve the correct probabilistic preference, i.e. (S1-b) is preferred to (S1-a). However, in the case of the input string `a a a', (S2-a) is wrongly given probability over (S2-b) and (S2-c). Moreover, the probabilities for (S1-a), (S1-b), (S2-b) and (S2-c) are signi cantly lower than what they should be because of the re-estimation of the next input symbol at states 3 and 8 77
CHAPTER 8. PROBABILISTIC GLR MODEL-2
78
Table 8.1: The probabilistic LR table, updated according to the extended training set. Probabilities in the rst and second line of each state row are those estimated by B&C and PGLR, respectively.
State 0
(1) 2 (3) (4) 5 (6) 7 (8) 9 10 (11)
a
Action
sh2 (15) 1 1 sh5 (6) 1 1 sh7 (9) / re4 (6) .6 / .4 .6 / .4 sh9 (5) .62 1 re6 (5) .5 .5 re4 (4) / sh10 (3) .44 / .33 .44 / .33 re5 / sh5 (4) / .66 / 1
Goto S U V
$
re1 (3) .37 1 acc (15) 1 1 re6 (5) .1 ; .4 .5 re2 (7) 1 1 re4 (2) .22 .22 re5 (2) .33 1 re3 (5) 1 1 re4 (3) 1 1 re5 (3) 1 1
(0)
4 1
3
8
6
(2)
11
8.1. PROBLEMATIC ISSUES
0 sh2=1
79
S
4 acc=1
S
4 acc=1
V
3 re1=0.37 re1=1
V
3 re1=0.37 re1=1
U
1 sh5=1
a
2 re4=0.4
a
5
re6=0.1 re6=0.5
0 sh2=1
B&C :1x0.4x1x0.1x0.37x1 = 0.015 PGLR:1x0.4x1x0.5x 1x1 = 0.2
a
2 sh7=0.6
0 sh2=1
a
V
S
a
7 sh10=0.33
6 re2=1
(S2-a)[3]
7 re4=0.22
0 sh2=1
a
4 acc=1
V
2 sh7=0.6
6 re2=1
U
11 re5=1
U
8 sh5=0.66 sh5=1
a
10 re4=1
a
7 re4=0.44
B&C :1x0.6x0.33x1x1x1x1 = 0.198 PGLR:1x0.6x0.33x1x1x1x1 = 0.198
a
(S1-b)[2]
4 acc=1
2 sh7=0.6
8 re5=0.33 re5=1
B&C :1x0.6x0.22x0.33x0.37x1 = 0.016 PGLR:1x0.6x0.22x 1x 1x1 = 0.132
(S1-a)[1] S
U
a
re6=0.4 re6=0.5
5
B&C :1x0.6x0.44x0.66x0.4x1x1 = 0.069 PGLR:1x0.6x0.44x 1x0.5x1x1 = 0.132
(S2-b)[4]
S V
0
4 acc=1
3 sh9=0.62 sh9=1
U
1 sh5=1
a
2 re4=0.4
a
5
a
9 re3=1
re6=0.5
sh2=1
B&C :1x0.4x1x0.5x0.62x1x1 = 0.124 PGLR:1x0.4x1x0.5x 1x1x1 = 0.2
(S2-c)[5]
Figure 8.1: Estimation of parse probabilities based on the expanded training set.
CHAPTER 8. PROBABILISTIC GLR MODEL-2
80
(in fact, this also happens at states 1, 6 and 11, but each state allows only the next input symbol in this exempli ed LR table). We have already discussed this issue in Chapters 2 and 6. PGLR also displays problems in incorrectly ranking the parse trees. PGLR wrongly gives preference to (S1-a) over (S1-b), and (S2-a) over (S2-b). This probably results from a reliance only on the probabilities decomposed from the parse trees. Additionally, we may need to include linguistic knowledge in the probabilistic model. Here, let us consider another modeling for PGLR that may overcome the problems evident in the original PGLR schema.
8.2 Another model for PGLR In PGLR, we regard a parse tree as a sequence of stack transitions and estimate each factor in equation (2.24) individually, as discussed in Chapter 2. However, the nature of parsing with an LR table is more likely to be based around the sequence of state transitions. In this section, we formalize another model for PGLR, called PGLR model-2. Let us consider the sequence of state transitions representing a parse tree as: ;an?1 ;a1 l2 ;a2 ;an s =l1) s = ) : : : ln?=1) sn sn? l=n) 0
1
(8.1)
1
where ai is an action, li is an input symbol and si is the state at time ti . The probability of a complete parse can be represented as:
P (T ) = P (s ; l ; a ; s ; : : : ; sn? ; ln; an; sn) n Y = P (s ) P (li; ai; sijs ; l ; a ; s ; : : : ; li? ; ai? ; si? ) 0
1
0
1
1
1
0
i=1
1
1
1
1
1
1
(8.2) (8.3)
In the GLR parsing schema, the state on top of the stack contains all the information needed in parsing, and as such, the state on top of the stack at any parsing time can represent the stack. As a result, the probability of the parse tree T is estimated by equation (8.2), where P (s ) is the probability of the initial state in the LR table, and here equal to one. Therefore, n Y P (li; ai; sijsi? ) P (T ) i n Y P (lijsi? ) P (aijsi? ; li) P (sijsi? ; li; ai) (8.4) = 0
1
=1
i=1
1
1
1
8.2. ANOTHER MODEL FOR PGLR
81
This is the same as our independent formalization of B&C in Chapter 2. The dierence comes in the interpretation and estimation of the particular probabilities in equation (8.4). Like PGLR, the particular probabilities are estimated as follows: The rst term P (lijsi? ) is the probability of the next input symbol li at state si? . Following the properties of GLR parsing, the next input symbol remains unchanged if it is at the state reached after applying a reduce action. Therefore, 1
1
P (lijsi? ) = 1 (for s 2 Sr )
(8.5)
1
The second term P (aijsi? ; li) is the probability of the next action ai for the combination of state si? and input symbol l . This is the probability of action ai, 1
1
1
when the input symbol li is determined according to the rst term above. Therefore, the estimation is the same for states in both the Ss and Sr classes. The third term P (sijsi? ; li; ai) is the probability of the state to move to after applying an action. The next state is deterministic after applying a shift action, because the shift action describes the state to move to. Therefore, 1
P (sijsi? ; li; ai) = 1 (for ai 2 As)
(8.6)
1
where, As is the set of shift actions. Problems occur if the state in question directly follows the application of a reduce action. PGLR regards a parse tree as a sequence of stack transitions, and hence, the next state is deterministic. On the other hand, we consider a parse tree as a sequence of state transitions in modeling PGLR model-2, and consequently, the next state is not deterministic after applying a reduce action. Namely, the third term P (sijsi? ; li; ai) is the probability of the state to move to after applying a reduce action. It is important to 1
note that the state to move to after applying a reduce action does not correlate to the stack-top state after a stack-pop operation, as in B&C. Rather, it is the next state to shift to, over a non-terminal symbol. It is sound to interpret
the third term in equation (8.4) in this way. In summary, we normalize the action probabilities in the same way as we did for PGLR, but, probabilities for reduce actions are subdivided according to the next reachable states. This results in a greater number of free parameters for PGLR model-2 than for PGLR, but still a much smaller number than for B&C because the next state after a reduce action can
CHAPTER 8. PROBABILISTIC GLR MODEL-2
82
potentially be merged, even if the stack-top states after a stack-pop operation dier. The reverse case of there being multiple next states following a stack-pop operation associated with a single reduce action, cannot occur, according to the behavior of LR tables as a deterministic nite automaton. Experimentally, based on the LALR table generated for the ATR corpus, the number of training actions in PGLR model-2 is about 5.6 times that for PGLR, but in B&C about 24.9 times that for PGLR. 1
8.3 Eects of PGLR model-2 on the exempli ed training set With respect to the LR table shown in Table 8.1, the action probabilities estimated by PGLR model-2 are similar to those for PGLR, except for the reduce action re6 occurring at the pairing of state 5 and input symbol $. The probability of the reduce action re6 is subdivided, resulting in 0.1 for a next state of 3, and 0.4 for a next state of 6. Figure 8.2 shows the probabilities as estimated by PGLR model-2, compared to B&C and PGLR. In this example, the probabilities for (S1-a) and (S2-b) are dierent from those for PGLR. Note that in (S1-a) as well as (S2-b), the probability for re6 at state 5 with the input symbol $ is reduced because it is dierentiated according to the next state after the reduce action re6. In other words, the next states, i.e. 3 and 6, are estimated after the estimation of the reduce action re6. This is because of the inclusion of the third term P (sijsi? ; li; ai ) in equation (8.4). As a result, PGLR model-2 returns better results than PGLR, by reproducing the proper probabilistic ranking for (S1-a) and (S1-b). However, (S2-a) still has a higher probability than (S2-b). It would appear that (S2-a) has excessive preference over other parse trees. This also occurs when we test the models with further set of parse trees, as shown in Figure 8.3. Figure 8.3 shows all possible parse trees for the input strings `a a' and `a a a'. Table 8.2 summarizes the results of parse probability ranking after testing with (S2-c) and (S4-d) individually added into the training set. PGLR model-2 can rank the parse tree candidates for the input string `a a' correctly, while B&C and PGLR return progressively 1
The number of training actions is not the number of free parameters, but at least roughly shows the relative levels data sparseness between the models. 1
8.3. EFFECTS OF PGLR MODEL-2 ON THE EXEMPLIFIED TRAINING SET 83
U
4 acc=1
S
4 acc=1
V
3 re1=0.37 re1=1
V
3 re1=0.37 re1=1
1 sh5=1
a
0
S
a
re6=0.1 re6=0.5
5
a
0
2 sh7=0.6
U
8 re5=0.33 re5=1
a
7 re4=0.22
sh2=1
2 re4=0.4
sh2=1
B&C :1x0.4x1x0.1x0.37x1 = 0.015 PGLR :1x0.4x1x0.5x 1x1 = 0.2 PGLR-2:1x0.4x1x0.1x 1x1 = 0.04
B&C :1x0.6x0.22x0.33x0.37x1 = 0.016 PGLR :1x0.6x0.22x 1x 1x1 = 0.132 PGLR-2:1x0.6x0.22x 1x 1x1 = 0.132
(S1-a)[1] S
0
a
(S1-b)[2] S
4 acc=1
2 sh7=0.6
V
6 re2=1
sh2=1
0
a
4 acc=1
V
2 sh7=0.6
6 re2=1
sh2=1
a
7
U
11 re5=1
U
8 sh5=0.66 sh5=1
a
10 re4=1
a
7 re4=0.44
sh10=0.33
B&C :1x0.6x0.33x1x1x1x1 = 0.198 PGLR :1x0.6x0.33x1x1x1x1 = 0.198 PGRL-2:1x0.6x0.33x1x1x1x1 = 0.198
(S2-a)[3]
a
re6=0.4 re6=0.5
5
B&C :1x0.6x0.44x0.66x0.4x1x1 = 0.069 PGLR :1x0.6x0.44x 1x0.5x1x1 = 0.132 PGLR-2:1x0.6x0.44x 1x0.4x1x1 = 0.105
(S2-b)[4]
S V
0
4 acc=1
3 sh9=0.62 sh9=1
U
1 sh5=1
a
2 re4=0.4
a
5
a
9 re3=1
re6=0.5
sh2=1
B&C :1x0.4x1x0.5x0.62x1x1 = 0.124 PGLR :1x0.4x1x0.5x 1x1x1 = 0.2 PGLR-2:1x0.4x1x0.5x 1x1x1 = 0.2
(S2-c)[5]
Figure 8.2: Eects of adding (S2-c) into the training set, on PGLR model-2 as compared with B&C and PGLR. The number in the square brackets is the frequency of the parse tree.
CHAPTER 8. PROBABILISTIC GLR MODEL-2
84
U
4 acc=1
S
4 acc=1
V
3 re1=0.21 re1=1
V
3 re1=0.21 re1=1
1 sh5=1
a
0
S
a
re6=0.1 re6=0.5
5
a
0
2 sh7=0.71
U
8 re5=0.16 re5=1
a
7 re4=0.13
sh2=1
2 re4=0.28
sh2=1
B&C :1x0.28x1x0.1x0.21x1 = 0.006 PGLR :1x0.28x1x0.5x 1x1 = 0.14 PGLR-2:1x0.28x1x0.1x 1x1 = 0.028
B&C :1x0.71x0.13x0.16x0.21x1 = 0.003 PGLR :1x0.71x0.13x 1x 1x1 = 0.092 PGLR-2:1x0.71x0.13x 1x 1x1 = 0.092
(S1-a)[1] S a
0
(S1-b)[2] S
4 acc=1
2 sh7=0.71
V
6 re2=1
sh2=1
0
a
2 sh7=0.71
V
6 re2=1
sh2=1
a
7 sh10=0.2
U
11 re5=1
U
8 sh5=0.33 sh5=0.4
a
10 re4=1
a
7 re4=0.66
B&C :1x0.71x0.2x1x1x1x1 = 0.142 PGLR :1x0.71x0.2x1x1x1x1 = 0.142 PGRL-2:1x0.71x0.2x1x1x1x1 = 0.142
S V
a
1 sh5=1
a
2 re4=0.28
5
re6=0.4 re6=0.5
(S2-b)[4]
4 acc=1
3 sh9=0.78 sh9=1
U
a
B&C :1x0.71x0.66x0.33x0.4x1x1 = 0.062 PGLR :1x0.71x0.66x 0.4x0.5x1x1 = 0.094 PGLR-2:1x0.71x0.66x 0.4x0.4x1x1 = 0.075
(S2-a)[3]
0
4 acc=1
5
a
9 re3=1
S
re6=0.5
V
sh2=1
B&C :1x0.28x1x0.5x0.78x1x1 = 0.109 PGLR :1x0.28x1x0.5x 1x1x1 = 0.14 PGLR-2:1x0.28x1x0.5x 1x1x1 = 0.14
(S2-c)[5]
0
a
4 acc=1
3 sh9=0.78 sh9=1
2 sh7=0.71
a
9 re3=1
U
8 re5=0.5 re5=0.6
a
7 re4=0.66
sh2=1
B&C :1x0.71x0.66x0.5x0.78x1x1 = 0.183 PGLR :1x0.71x0.66x0.6x 1x1x1 = 0.281 PGLR-2:1x0.71x0.66x0.6x 1x1x1 = 0.281
(S2-d)[6]
Figure 8.3: Eects of adding (S2-d) into the training set, on PGLR model-2 as compared with B&C and PGLR. The number in the square brackets is the frequency of the parse tree.
8.4. EVALUATION OF B&C, PGLR AND PGLR MODEL-2
85
worse results as training data is added. As discussed previously, B&C unexpectedly reduce the parse probabilities by re-distributing action probabilities to deterministic probabilities. This occurs both, (i) when calculating the probability of the next input symbol after applying a reduce action, for instance, the probabilities at state 3 and 8 in Figures 8.2 and 8.3, respectively, and (ii) when calculating the probability of the stack-top state after applying a stack-pop operation, for instance, the probabilities at state 5 in Figures 8.2 and 8.3. In the case of PGLR, the probability for re6 at state 5 in (S1-a) is given the same value as for (S2-b). In fact, at state 5 in the respective parse trees, the states after applying the reduce action re6 are dierent. It is possible that distributing the probability of the next state after applying a reduce action, is needed to distinguish the reduce action probabilities. Table 8.2: Results of ranking parse trees in ascending order, according to probabilities from B&C, PGLR and PGLR model-2.
`a a' `a a a' (S1-a) (S1-b) (S2-a) (S2-b) (S2-c) (S2-d)
Original ranking 1 B&C 1/1/2 PGLR 1/2/2 PGLR model-2 1/1/1
2 2/2/1 2/1/1 2/2/2
1 2/3/3 1/2/3 1/2/3
2 1/1/1 2/1/1 2/1/1
3 -/2/2 -/3/2 -/3/2
4 -/-/4 -/-/4 -/-/4
PGLR model-2 improves on the performance of B&C and PGLR, but still cannot produce the correct ranking in all cases because of the preferential bias for (S2-a). Some linguistic knowledge may be needed to constrain the distribution of probabilities over the parsing space. We leave this incorporation of linguistic knowledge into the GLR parsing framework issue for future research.
8.4 Evaluation of B&C, PGLR and PGLR model-2 To demonstrate that the various models are not only eective in such limited cases as discussed in Section 8.3, we evaluated the performance of three models on the grammar and LR table discussed in Chapter 5, on a dierent portion of the ATR Japanese corpus. Table 8.3 shows the result of a closed test on 500 sentences.
CHAPTER 8. PROBABILISTIC GLR MODEL-2
86
Table 8.3: Parse performance of PGLR on a closed test set of 500 sentences.
Models
2-48 Characters (500 sentences) PA LP LR BP BR 0-CB m-CB
B&C 96.00 98.53 98.34 98.89 98.62 98.00 PGLR 96.00 98.77 98.35 99.17 98.62 98.20 PGLR model-2 97.00 99.15 98.53 99.53 98.68 98.80
0.04 0.03 0.02
In the closed test, we omitted the smoothing operation of adding a very small value for unseen events, in probability distribution. The purpose of this is to exclude the eects of smoothing for both B&C and PGLR model-2, which require more free parameters than PGLR. Though B&C and PGLR returned the same value for PA, LP/LR and BP/BR show that PGLR is superior to B&C, and m-CB for PGLR is better than that for B&C. However, PGLR model-2 plays an important role in improving the performance of PGLR. In fact, PGLR model-2 yielded the best results for all the measures in this closed test on the ATR Japanese corpus. In an open test, the performance of B&C remained lower than PGLR though evaluation with a dierent test set from that used in Chapter 5. PGLR returned a performance very close to PGLR model-2 as shown in Table 8.4 and Figure 8.4. Other for PA, PGLR model-2 returned slightly better performance than PGLR, although the performance difference is not statistically signi cant. We subsequently applied a smoothing operation for the models in the open test, by simply adding a part of a count to each action. Even here, PGLR showed the advantages of a small number of free parameters in probability estimation. Table 8.4: Parse performance of PGLR on an open test set of 500 sentences.
Models
2-48 Characters (500 sentences) PA LP LR BP BR 0-CB m-CB
B&C 90.20 97.26 97.39 97.95 98.03 94.60 PGLR 94.00 98.37 97.86 98.96 98.30 97.20 PGLR model-2 94.00 98.60 98.02 99.15 98.36 97.20
0.14 0.06 0.05
We investigated the eects of the data sparseness problem on PGLR model-2 and PGLR, by varying the size of the training set. The results are shown in Figure 8.5.
8.4. EVALUATION OF B&C, PGLR AND PGLR MODEL-2
87
Distribution of parsing accuracy over different sentence lengths, for an open test set of 500 sentences
Percentage of correct parses
100
80
60
40
20
0 0
5
10 15 20 Sentence length (number of words)
25
B&C PGLR(LALR) PGLR model-2 Number of sentences in the test set
Figure 8.4: Parsing accuracy distribution over dierent sentence lengths, for B&C, PGLR and PGLR model-2.
CHAPTER 8. PROBABILISTIC GLR MODEL-2
88
Ignoring the results for training the models on 1/16 and 1/8 of the total of about 10,000 sentences, PGLR model-2 would appear to potentially perform better than PGLR for large quantities of training data. Though there was no signi cant dierence between PGLR and PGLR model-2 on an open test, the performance of PGLR model-2 in the closed test without parameter smoothing is signi cantly superior to PGLR. In conclusion, PGLR has the advantage of a smaller number of free parameters, and we need to train on a larger corpus to bring out the full capabilities of PGLR model-2. Improving the smoothing method for PGLR model-2 could also improve the open test performance of PGLR model-2.
Parsing accuracy on 500 sentences (open test set) by changing the proportion of training set 100
PGLR PGLR model-2
Parsing accuracy (%)
95
90
85
80 1/32
1/16
1/8
1/4
1/2
Fraction of 10,397 training sentences
Figure 8.5: Trainability measure for PGLR and PGLR model-2.
1
8.5. SUMMARY
89
8.5 Summary Based on a dierent formalization, we came up with a second model for PGLR, called PGLR model-2. PGLR model-2 is modeled by regarding a parse tree as a sequence of state transitions. Like PGLR, action probabilities are normalized dierently according to the class that state belongs to. The dierence with PGLR is that PGLR model-2 additionally includes the probability of the next state after a reduce action in calculating overall reduce action probabilities. In practice, this probability is calculated by subdividing the probability for the original reduce action between next states. It is important to note that this diers from B&C, where reduce probabilities are calculated based on the stack-top state after a stack-pop operation in applying a reduce action. By way of an exempli ed problem, we showed how the distinguishability of states to move to after applying a reduce action, can be applied in PGLR model-2 to reduce probabilities bias experienced by PGLR. In the example, PGLR model-2 yielded the parse ranking most closely representative of actual data, as shown in Table 8.2. PGLR model-2 involves more free parameters than PGLR, but much fewer than B&C. In an implementation of LALR table generation based on the ATR corpus, the number of training actions in PGLR model-2 has about 5.6 times that for PGLR, and in case of B&C about 24.9 times that for PGLR. Despite this, PGLR model-2 yielded superior performance to both B&C and PGLR signi cantly in closed test evaluation. In an open test, since PGLR model-2 required more free parameters than PGLR, there was no statistically signi cant dierence between these two models. The results of LP/LR, PARSEVAL measures, and the trainability measure in Figure 8.5 gave promising signs of performance improvement for PGLR model-2 given enough training data.
Chapter 9 Conclusions The probabilistic generalized LR parsing model (PGLR) is formalized based on the nature of GLR parsing, aimed at inheriting the bene ts of context-sensitivity inherent in the GLR parsing algorithm. The context-sensitivity of GLR parsing re ects both (i) global context over structures from the source context-free grammar, and (ii) local n-gram context from adjoining pre-terminal constraints. The context is encoded in the LR table states, and the parsing history of GLR is stored in stacks|in a form of GSS to record parsing ambiguity. Therefore, it is eective to concentrate on state transitions in LR parsing and view a parse as a sequence of stack or state transitions. Since GLR parsing is a table-driven shift-reduce left-to-right parser for context-free grammars, we can eciently incorporate state transition probabilities into parsing action probabilities. We evaluated PGLR against its main predecessors in GLR-based probabilistic parsing, namely B&C and PCFG, and two-level PCFG|the pseudo context-sensitive version of PCFG. Evaluation was performed based on the ATR Japanese corpus, and gave signi cant evidence that the PGLR model is able to make eective use of both global and local context provided in the GLR parsing framework. The PGLR model outperformed both Briscoe and Carroll's model and the two-level PCFG model in all tests. In addition, PGLR requires only the probability for each action in the LR table to compute the overall probability of each parse. It is thus tractable to training, with the degree of free parameters as small as the number of distinct actions, which makes it possible to associate a probability directly to each action. Empirical results also demonstrated that the eectiveness of PGLR(LALR) is comparable with that of PGLR(CLR). Though the states in a CLR table are more distinguish91
92
CHAPTER 9. CONCLUSIONS
able in terms of encoding context information in parsing, the enormous increase in the number of parsing actions in a CLR table when compared to the corresponding LALR table, leads to data sparseness in training. Considering the advantages in parsing with an LALR table, it can be concluded that parsing with an LALR table is more ecient than parsing with a CLR table. The PGLR model also allows us to compute parse probability incrementally in a left to right manner. Unlike the previously proposed beam-search style in pulling out the n-best parses from the nal packed parse forest, which requires exhaustive parsing, we proposed a new node-driven parse pruning algorithm which supports pruning o less probable parses as early as possible, based on pre x probability. With precisely de ned probabilities, we can prune o less probable parses eciently in terms of both parsing time and memory space consumption, and obtain the most probable parse with high accuracy. Our pruning technique is expected to be applicable particularly to parsing highly ambiguous sentences, that is long sentences, as well as to speech processing which is usually associated with a great deal of parsing ambiguity. PGLR model-2 is the result of a second formalization of a probabilistic model for GLR parsing, aimed at solving problematic issues in parse tree disambiguation by considering more context. It results in a requirement for greater numbers of free parameters in training the model. However, by way of distributing probabilities between candidate states to move to after reduce actions, the model becomes more context-sensitive. The state to move to after applying a reduce action is not deterministic when a parse is regarded as a sequence of state transitions. PGLR model-2 performed slightly better than PGLR in a closed test. The dierent was not statistically signi cant in an open test. The trainability measure shows that we need more training data to produce conclusive performance dierences between the two models. Only a sucient amount of training data can make performance gains for probabilistic models statistically signi cant. Developing a bracketed corpus is currently still very expensive, and we need a greater amount of training data to evaluate especially those models which involve a greater number of free parameters in training. The following are some future issues in developing probabilistic models within the GLR parsing framework:
Assess the probabilistic models with a larger corpus, as well as with speech corpora,
to con rm the applicability of the PGLR models. We also have an interest in im-
93 plementing the PGLR models for other non-segmenting languages, such as Thai. Parsing the Thai sentences as reported in [44] faces the same problem of word segmentation as in parsing the Japanese sentences. The accuracy in syntactic parsing depends mostly on the result of word segmentation, if it is implemented in the cascade manner. With the PGLR models implemented on the integrated MSLR parser, both lexical and structural information are available during the parsing. The high parsing accuracy can then be expected.
Consider smoothing in the probabilistic models. Model smoothing is one of the out-
standing issues in this research. The closed test in Chapter 8 showed that smoothing aected the performance of the models. What smoothing methods are appropriate for GLR parsing is expected to be crucial issue in cases where there is insucient data for training. Also, the number of parsing actions that are left almost inactivated because of the wide-coverage of the grammar, could excessively reduce the probabilistic distribution over the activated parsing actions, if a suitable smoothing method is not applied.
Long distant constraints should be incorporated, as well as richer context in parsing.
In some disambiguation processes, we may need constraints which do not neighbor the current parsing state. Consequently, the ability to include the arbitrary constraints during transitions in GLR parsing, is preferred.
Lexicalization of the probability distribution. There are a great many examples
where preference over a parse tree and sequences of parts-of-speech do not re ect the preferred interpretation. Lexical information is somehow necessary to constrain the preferences. Parsing with lexicalized probabilities is possible in the GLR parsing framework, but needs to be considered carefully because of its vulnerability to data sparseness problems.
Bibliography [1] A. Aho, R. Sethi, and J. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986. [2] A. Aho and J. Ullman. The Theory of Parsing, Translation, and Compiling, Vol. I & II. Prentice Hall, 1972. [3] E. Black, F. Jelinek, J. Laerty, D. M. Magerman, R. Mercer, and S. Roukos. Towards History-based Grammars: Using Richer Models for Probabilistic Parsing. In Proceedings of the DARPA Speech and Natural Language Workshop, pp. 134{139, 1992. [4] E. Black et al. A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars. In Proceedings of the DARPA Speech and Natural Language Workshop, pp. 306{311, 1992. [5] T. Briscoe and J. Carroll. Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Uni cation-Based Grammars. Computational Linguistics, Vol. 19, No. 1, pp. 25{59, 1993. [6] T. Briscoe and J. Carroll. Developing and Evaluating a Probabilistic LR Parser of Part-Of-Speech and Punctuation Labels. In Proceedings of the 4th International Workshop on Parsing Technologies, pp. 48{58, 1995. [7] H. Bunt and M. Tomita. Recent Advances in Parsing Technology. Kluwer Academic Publishers, 1996. [8] J. Carroll. Report of Visit to Tanaka Laboratory, 1997/11/8 - 1997/12/8. (unpublished), 1997. 95
96
BIBLIOGRAPHY
[9] J. Carroll and T. Briscoe. Probabilistic Normalisation and Unpacking of Packed Parse Forests for Uni cation-Based Grammars. In Proceedings of AAAI Fall Symposium on Probabilistic Approaches to Natural Language, pp. 33{38, 1992. [10] J. Carroll and T. Briscoe. Apportioning Development Eort in a Probabilistic LR Parsing System through Evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1996. [11] E. Charniak. Tree-bank Grammar. In Proceedings of AAAI/IAAI-96, pp. 1031{1036, 1996. [12] E. Charniak. Statistical Parsing with a Context-free Grammar and Word Statistics. In Proceedings of AAAI/IAAI-97, pp. 598{603, 1997. [13] E. Charniak and G. Carroll. Context-Sensitive Statistics for Improved Grammatical Language Models. In Proceedings of AAAI-94, pp. 728{733, 1994. [14] T. Charoenporn, V. Sornlertlamvanich, and H. Isahara. Building A Large Thai Text Corpus|Part-Of-Speech Tagged Corpus: ORCHID|. In Proceedings of the Natural Language Processing Paci c Rim Symposium, 1997. [15] T. H. Chiang, Y. C. Lin, and K. Y. Su. Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution. Computational Linguistics, Vol. 21, No. 3, pp. 321{349, 1995. [16] M. Chitrao and R. Grishman. Statistical Parsing of Messages. In Proceedings of the DARPA Speech and Natural Language Workshop, pp. 263{266, 1990. [17] M. J. Collins. A New Statistical Parser based on Bigram Lexical Dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, 1996. [18] J. Earley. An Ecient Context-Free Parsing Algorithm. Communications of the ACM, Vol. 6, No. 8, pp. 451{455, 1970. [19] T. Fujisaki, F. Jelinek, J. Cocke, E. Black, and T. Nishino. A Probabilistic Parsing Method for Sentence Disambiguation. In Proceedings of the 1st International Workshop on Parsing Technologies, pp. 85{94, 1989.
BIBLIOGRAPHY
97
[20] W. A. Gale and K. Church. Poor Estimates of Context are Worse than None. In Proceedings of the DARPA Speech and Natural Language Workshop, pp. 283{287, 1990. [21] D. Goddeau and V. Zue. Integrating Probabilistic LR Parsing into Speech Understanding Systems. In Proceedings of International Conference on Acoustic, Speech and Signal Processing, pp. 181{184, 1992. [22] R. Grishman, C. Macleod, and J. Sterling. Evaluating Parsing Strategies Using Standardized Parse Files. In Proceedings of the 3rd ACL Conference on Applied Natural Language Processing, pp. 156{161, 1992. [23] H. V. Hamme and F. V. Aelten. An Adaptive-Beam Pruning Technique for Continuous Speech Recognition. In Proceedings of International Conference on Spoken Language Processing, pp. 2083{2086, 1996. [24] K. Inui, K. Shirai, T. Tokunaga, and H. Tanaka. Integration of Statistics-Based Techniques for Analysis of Japanese Sentences. Information Processing Society of Japan, SIGNL, Vol. 116, No. SIG-NL-116-6, 1996. (in Japanese). [25] K. Inui, V. Sornlertlamvanich, H. Tanaka, and T. Tokunaga. A New Formalization of Probabilistic GLR Parsing. In Proceedings of the 5th International Workshop on Parsing Technologies, 1997. [26] K. Inui, V. Sornlertlamvanich, H. Tanaka, and T. Tokunaga. A New Probabilistic LR Language Model for Statistical Parsing. Technical report, Department of Computer Science, Tokyo Institute of Technology, 1997. [27] K. Kita, T. Morimoto, K. Ohkura, S. Sagayama, and Y. Yano. Spoken Sentence Recognition Based on HMM-LR with Hybrid Language Modeling. IEICE Transactions on Information and Systems, Vol. E77-D, No. 2, pp. 258{265, 1994. [28] M. Lankhorst. An Empirical Comparison of Generalized LR Tables. In Twente Workshop on Language Technology (TWLT1), Tomita's Algorithm: Extensions and Applications, pp. 87{93, 1991.
98
BIBLIOGRAPHY
[29] A. Lavie and M. Tomita. GLR* - An Ecient Noise-Skipping Parsing Algorithm for Context-Free Grammars. In Recent Advances in Parsing Technology [7], chapter 10, pp. 183{200. [30] H. Li. Integrating Connection Constraints into a GLR Parser and its Application in a Continuous Speech Recognition System. Doctoral dissertation, Tokyo Institute of Technology, Tokyo, Japan, March 1996. [31] D. M. Magerman. Natural Language Parsing as Statistical Pattern Recognition. Doctoral dissertation, Stanford University, California, February 1994. [32] D. M. Magerman. Statistical Decision-Tree Models for Parsing. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 276{283, 1995. [33] D. M. Magerman and M. P. Marcus. Pearl: A Probabilistic Chart Parser. In Proceedings of the 2nd International Workshop on Parsing Technologies, pp. 193{199, 1991. [34] D. M. Magerman and M. P. Marcus. Pearl: A Probabilistic Chart Parser. In Proceedings of the 5th Conference of the European Chapter of the Association for Computational Linguistics, pp. 15{20, 1991. [35] D. M. Magerman and C. Weir. Eciently, Robustness and Accuracy in Picky Chart Parsing. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pp. 40{47, 1992. [36] M. Nagata. A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm. In Proceedings of the 15th International Conference on Computational Linguistics, pp. 201{207, 1994. [37] S. Ortmanns, H. Ney, and A. Eiden. Language-Model Look-Ahead for Large Vocabulary Speech Recognition. In Proceedings of International Conference on Spoken Language Processing, pp. 2095{2098, 1996. [38] V. Sornlertlamvanich. Word Segmentation for Thai in Machine Translation System. In Machine Translation, pp. 50{56. National Electronics and Computer Technology Center, Bangkok, Thailand, 1993. (in Thai).
BIBLIOGRAPHY
99
[39] V. Sornlertlamvanich, T. Charoenporn, and H. Isahara. ORCHID: Thai Part-OfSpeech Tagged Corpus. In T. Charoenporn, editor, Orchid, TR-NECTEC-1997001, pp. 5{19. National Electronics and Computer Technology Center, Thailand, December 1997. [40] V. Sornlertlamvanich and T. Hozumi. The Automatic Extraction of Open Compounds from Text Corpora. In Proceedings of the 16th International Conference on Computational Linguistics, pp. 1143{1146, August 1996. [41] V. Sornlertlamvanich, K. Inui, K. Shirai, H. Tanaka, T. Tokunaga, and T. Takezawa. Empirical Evaluation of Probabilistic GLR Parsing. In Proceedings of the Natural Language Processing Paci c Rim Symposium, pp. 169{174, 1997. [42] V. Sornlertlamvanich, K. Inui, K. Shirai, H. Tanaka, T. Tokunaga, and T. Takezawa. Incorporating Probabilistic Parsing into an LR Parser|LR Table Engineering (4)|. Information Processing Society of Japan, SIGNL, Vol. 97, No. 53, pp. 61{68, May 1997. [43] V. Sornlertlamvanich, K. Inui, H. Tanaka, and T. Tokunaga. A New Probabilistic LR Parsing. In Proceedings of The Third Annual Meetings of The Association for Natural Language Processing, pp. 301{304, March 1997. [44] V. Sornlertlamvanich and W. Pantachat. Information-based Language Analysis for Thai. ASEAN Journal on Science & Technology for Development, Vol. 10, No. 2, pp. 181{196, 1993. [45] V. Sornlertlamvanich, N. Takahashi, and H. Isahara. Thai Part-Of-Speech Tagged Corpus: ORCHID. In Proceedings of Oriental COCOSDA Workshop, pp. 131{138, 1998. [46] V. Sornlertlamvanich and H. Tanaka. Extracting Open Compounds from Text Corpora. In Proceedings of The Second Annual Meetings of The Association for Natural Language Processing, pp. 213{216, March 1996. [47] V. Steinbiss, B.-H. Tran, and H. Ney. Improvements in Beam Search. In Proceedings of International Conference on Spoken Language Processing, pp. 2143{2146, 1994.
100
BIBLIOGRAPHY
[48] A. Stolcke. An Ecient Probabilistic Context-Free Parsing Algorithm that Computes Pre x Probabilities. Computational Linguistics, Vol. 21, No. 2, pp. 165{201, 1995. [49] K. Y. Su, J. N. Wang, M. H. Su, and J. S. Chang. GLR Parsing with Scoring. In Generalized LR Parsing [54], chapter 7, pp. 93{112. [50] T. Takezawa. ATR Japanese Syntactic Structure Database and the Grammar. Technical report, ATR Interpreting Telecommunications Research Laboratories, April 1997. (in Japanese). [51] H. Tanaka, T. Takezawa, and J. Etoh. Japanese Grammar for Speech Recognition Considering the MSLR Method. Technical Report 97-SLP-15-25, Information Processing Society of Japan, SIGNL, 1997. (in Japanese). [52] H. Tanaka, T. Tokunaga, and M. Izawa. Integration of Morphological and Syntactic Analysis Based on GLR Parsing. In Recent Advances in Parsing Technology [7], chapter 17, pp. 325{342. [53] M Tomita. Ecient Parsing for Natural Language. Kluwer Academic Publishers, Boston, MA, 1985. [54] M Tomita. Generalized LR Parsing. Kluwer Academic Publishers, 1991. [55] M. Tomita and S-K. Ng. The Generalized LR Parsing Algorithm. In Generalized LR Parsing [54], chapter 1, pp. 1{16. [56] M. Ueki, T. Tokunaga, and H. Tanaka. A System for Japanese Morphological and Syntactic Analysis Using the EDR Dictionary. In Proceedings of Symposium on Using the EDR Dictionaries, pp. 33{39, 1995. (in Japanese). [57] A. J. Viterbi. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. In Proceedings of IEEE Transactions on Information Theory, pp. 260{269, 1967. [58] J. H. Wright. LR Parsing of Probabilistic Grammars with Input Uncertainty for Speech Recognition. In Proceedings of Computer Speech and Language, pp. 297{323, 1990.
BIBLIOGRAPHY
101
[59] J. H. Wright and E. N. Wrigley. Probabilistic LR Parsing for Speech Recognition. In Proceedings of the 1st International Workshop on Parsing Technologies, pp. 105{114, 1989. [60] J. H. Wright and E. N. Wrigley. GLR Parsing with Probability. In Generalized LR Parsing [54], pp. 113{128. [61] J. H. Wright, E. N. Wrigley, and R. Sharman. Adaptive Probabilistic Generalized LR Parsing. In Proceedings of the 2nd International Workshop on Parsing Technologies, pp. 100{109, 1991. [62] T. Yamada and S. Sagayama. LR-Parser-Driven Viterbi Search with Hypotheses Merging Mechanism Context-Dependent Phone Models. In Proceedings of International Conference on Spoken Language Processing, pp. 2103{2106, 1996.