Lane Schwartz. University of Minnesota. University of ...... Baker, James. 1975. The dragon system: an overivew. ... Mabry Tyson. 1996. Fastus: A cascaded ...
William Schuler
University of Minnesota
Tim Miller
University of Minnesota
AF T
Robust Incremental Parsing using Human-Like Memory Constraints
Samir AbdelRahman Cairo University
Lane Schwartz
University of Minnesota
Psycholinguistic studies suggest a model of human language processing that 1) performs incremental interpretation of spoken utterances or written text, 2) preserves ambiguity by maintaining competing analyses in parallel, and 3) operates within a severely constrained short-term memory store – possibly constrained to as few as four distinct elements. The first two observations are sometimes taken to endorse a probabilistic beam-search model, similar to some existing systems. But the last observation imposes a restriction that has not until now been evaluated in a corpus study. This paper first describes a relatively simple statistical model of incremental language processing that meets all three of the above desiderata. The paper then evaluates the coverage of an implementation of this model, and the accuracy with which this implementation can analyze a large syntactically-annotated corpus of English.
1 Introduction
DR
Psycholinguistic studies suggest a model of human language processing with three important properties. First, eye-tracking studies (Tanenhaus et al., 1995; Brown-Schmidt, Campana, and Tanenhaus, 2002) suggest that humans analyze sentences incrementally, assembling and interpreting referential expressions even while they are still being pronounced. Second, humans appear to maintain competing analyses in parallel, with eye gaze showing significant attention to competitors (referents of words with similar prefixes to the correct word), even relatively long after the end of the word has been encountered, when attention to other distractor referents has fallen off (Dahan and Gaskell, 2007). Preserving ambiguity in a parallel, non-deterministic search like this may account for human robustness to missing, unknown, mispronounced, or misspelled words. Finally, studies of short-term memory capacity suggest human language processing operates within a severely constrained short-term memory store – possibly restricted to as few as four distinct elements (Miller, 1956; Ericsson and Kintsch, 1995; Cowan, 2001). The first two observations may be taken to endorse existing probabilistic beamsearch models which maintain multiple competing analyses, pruned by contextual preferences and dead ends (e.g. Roark, 2001). But the last observation imposes a restriction that until now has not been evaluated in a corpus study. This paper first describes a relatively simple statistical model of incremental language processing that meets all three of the above desiderata. The paper then evaluates the coverage of an implementation of this model, and the accuracy with which this implementation can analyze a large syntactically-annotated corpus of English. The remainder of this paper is organized as follows: Section 2 describes some cur-
c 2000 Association for Computational Linguistics
Volume 0, Number 0
AF T
Computational Linguistics
rent approaches to incremental parsing; Section 3 describes a statistical framework for parsing using a bounded stack of explicit constituents; Section 4 describes an experiment to estimate the level of coverage of the Penn Treebank corpus that can be achieved with various stack memory limits, using a set of reversible tree transforms; and Section 5 gives accuracy results of a bounded-memory model trained on this corpus, given certain linguistically motivated assumptions of binary branching structure in tree annotation. 2 Related Work
DR
The representational power and recognition complexity of context-free grammars is derived from the assumption of an infinite stack during recognition. Parsing with a bounded stack is therefore formally equivalent to finite-state parsing. Any bounded stack model can be compiled into a simple HMM by defining hidden states corresponding to all combinations of stack elements. But the use of a stack of explicit constituents provides a means by which semantic referents (or simply head words) can influence individual attachment decisions. Attempting to enumerate all combinations of head-annotated constituent labels, let alone some representation of actual referents, would result in prohibitively large transition models. Cascaded finite-state automata, as in FASTUS (Hobbs et al., 1996), make use of a bounded stack similar to the one described in this paper, but stack levels in such systems are typically dedicated to particular syntactic operations: e.g. a word group level, a phrasal level, and a clausal level. As a result, some amount of constituent structure may overflow its dedicated level, and be sacrificed (for example, prepositional phrase attachment may be left underspecified). The model described here instead enforces stack bounds in toto, over all syntactic operations, by recognizing phrase structure trees in a right-corner transform. This transformed representation requires stack space only in cases of center embedding, so the stack bounds enforce a center-embedding limit, without placing limits on left- or right-recursive structure. The model is therefore free to allocate stack levels to word groups, phrasal constituents, or clauses in whatever combination is needed. Finite-state equivalent parsers (and thus, bounded-stack parsers) have asymptotically linear run time. Other parsers (Sagae and Lavie, 2005) have achieved linear runtime complexity with unbounded stacks in incremental parsing by using a greedy strategy, pursuing locally most probable shift or reduce operations, conditioned on multiple surrounding words. This architecture is not compatible with the human memory modeling desiderata outlined in Section 1 however. Moreover, from a practical point of view, conditioning on lexical observations is not desirable in speech processing, where lexical items are not directly observed. Later results by the same authors (Sagae and Lavie, 2006) show a significant increase in accuracy when this greedy strategy is replaced with a parallel best-first strategy, but this is no longer worst-case linear. Roark’s (2001) top-down parser generates trees incrementally in a transformed representation related to that used in this paper, but requires distributions to be maintained over entire trees. This increases the beam width necessary to avoid parse failure. Moreover, although the system is conducting a beam search, the objects in this beam are growing, so the recognition complexity is not linear. In contrast, the model described in 2
AF T
this paper maintains distributions over stack configurations of fixed size, using Viterbi back pointers to only the most likely predecessor of each beam element in order to reconstruct most likely trees. 3 Bounded-Memory Parsing with a Time Series Model
This section describes a basic statistical framework – a factored time-series model, or Dynamic Bayes Net (DBN) – for recognizing hierarchic structures using a bounded set of memory elements, each with a finite number of states, at each time step. Unlike simple FSA compilation, this model maintains an explicit representation of active, incomplete phrase structure constituents on a bounded stack, so it can be readily extended with additional variables that depend on syntax (e.g. to track hypothesized entities or relations). These incomplete constituents are derived from ordinary phrase structure annotations through a series of bidirectional tree transforms. These transforms: 1. binarize phrase structure trees into linguistically motivated head-modifier branches (described in Section 3.2);
2. transform right-recursive sequences of branches to left-recursive sequences (described in Section 3.3); and
3. align transformed trees to an array of random variable values at each depth and time step of a probabilistic time-series model (described in Sections 3.4 and 3.5).
DR
Following these transforms, a model can be trained from example trees, then run as a parser on unseen sentences. The transforms can then be reversed to evaluate the output of the parser. This representation will ultimately be used to evaluate the coverage of a bounded-memory model on a large corpus of tree-annotated sentences in Section 4, and to evaluate the accuracy of a basic (unsmoothed, unlexicalized) implementation of this model in Section 5. 3.1 Tree Transforms using Rewrite Rules The tree transforms presented in this paper will be defined in terms of destructive rewrite rules applied iteratively to each constituent of a source tree, from leaves to root, and from left to right among siblings, to derive a target tree. These rewrites are ordered; when multiple rewrite rules apply to the same constituent, the later rewrites are applied to the results of the earlier ones.1 For example, the rewrite: A0
A0
...
A1
α2 α3
... ⇒ . . . α2 α 3 . . .
1 The appropriate analogy here is to a unix sed script, made sensitive to the beginning and end brackets of a constituent and those of its children.
3
Volume 0, Number 0
AF T
Computational Linguistics
could be used to iteratively eliminate all binary-branching nonterminal nodes in a tree, except the root. In the notation used in this paper, • Roman uppercase letters (Ai ) are variables matching constituent labels, • Roman lowercase letters (ai ) are variables matching terminal symbols,
• Greek lowercase letters (αi ) are variables matching entire subtree structure,
• Roman letters followed by colons, followed by Greek letters (Ai :αi ) are variables matching the label and structure, respectively, of the same subtree, and
• ellipses (. . . ) are taken to match zero or more subtree structures, preserving the order of ellipses in cases where there are more than one (as in the rewrite shown above).
Many of the transforms used in this paper are reversible, meaning that the result of applying a transform to a tree, then applying the reverse of that transform to the resulting tree, will be the original tree itself. In general, a transform can be reversed if the direction of its rewrite rules is reversed, and if each constituent in a target tree matches a unique rewrite rule in the reversed transform.
DR
3.2 Binary branching structure This paper will attempt to draw conclusions about the syntactic complexity of natural language, in terms of stack memory requirements in time-ordered (left-to-right) recognition. These requirements will be minimized by recognizing trees in a right-corner form, which accounts partially recognized phrases and clauses as incomplete constituents, lacking one instance of another constituent yet to come. In particular, this study will use the trees in the Penn Treebank Wall Street Journal (WSJ) corpus (Marcus, Santorini, and Marcinkiewicz, 1994) as a data set. In order to obtain a linguistically plausible right-corner transform representation of incomplete constituents, the corpus is subjected to another, pre-process transform to introduce binary-branching nonterminal projections, and fold empty categories into nonterminal symbols in a manner similar to that proposed by Johnson (1998b) and Klein and Manning (2003). This binarization is done in in such a way as to preserve linguistic intuitions of head projection, so that the depth requirements of right-corner transformed trees will be reasonable approximations to the working memory requirements of a human reader or listener. Ordinary phrases and clauses are decomposed into head projections, each consisting of one subordinate head projection and one argument or modifier, e.g.: A0
A0
⇒ ...
...
VB:α1
NP:α2
VB
...
...
VB:α1 NP:α2
4
AF T
The selection of head constituents is done using rewrites based on the Magerman-Black head rules (Magerman, 1995). Any new constituent created by this process is assigned the label of the subordinate head projection. The subordinate projection may be the left or right child, so the decomposed structure may be left- or right-recursive. Conjunctions are decomposed into purely right-branching structures using nonterminals appended with a ‘-LIST’ suffix: A0
A0
⇒ ...
A1 -LIST
. . . A1 :α1 CC A1 :α2
A1 :α1 CC A1 :α2 A0
A0
⇒ ...
A1 -LIST
. . . A1 :α1 A1 -LIST:α2
A1 :α1 A1 -LIST:α2
DR
This right-branching decomposition of conjoined lists is motivated by the general preference in English toward right branching structure, and the distinction of right children as ‘-LIST’ categories is motivated by the asymmetry of conjunctions such as ‘and’ and ‘or’ generally occurring only between constituents at the end of a list, not at the beginning. (Thus, ‘tea or milk’ is an NP-LIST constituent, whereas ‘coffee, tea’ is not.) Empty constituents are removed outright. In the case of empty constituents representing traces, the extracted category label is annotated onto the lowest nonterminal dominating the trace using the suffix ‘-extrX,’ where ‘X’ is the category of the extracted constituent. (To preserve grammaticality, this annotation is then passed up the tree and eliminated when a wh-, topicalized, or other moved constituent is encountered, but this does not affect branching structure.) Together these rewrites remove about 65% of super-binary branches from the unprocessed Treebank. All remaining super-binary branches are ‘nominally’ decomposed into right-branching structures by introducing intermediate nodes, each with a label concatenated from the labels of its children, delimited by underscores: A0
A0
⇒ ...
...
A1 A2
A1 :α1 A2 :α2
A1 :α1 A2 :α2
This decomposition is ‘nominal’ in that the concatenated labels leave the resulting binary branches just as complex as the original n-ary branches prior to this decomposition. It is equivalent to leaving super-binary branches intact and using dot rules in parsing (Earley, 1970). This decomposition therefore does nothing to reduce sparse data effects in statistical parsing. A sample binarized tree is shown in Figure 1(a). 5
Volume 0, Number 0
a) binarized phrase structure tree:
AF T
Computational Linguistics
S
NP
VP
NP
PP
JJ
NN
IN
strong
demand
for
VBN
NP
NPpos NNP
NNP new
NNS
POS
JJ
’s
general
NNP
NNP
NNP
york
city
NP
VBN
PRT
DT
propped
up
the
NNS
NN
NNS
obligation
bonds
NN
JJ
NN
municipal
market
S
NN
S/NN
S/NN
b) result of right-corner transform:
S/NP
S/VP
VBN
NP
NP/NNS
NN
JJ
NP/NP
NPpos
VBN/PRT
PRT
VBN
up
bonds
propped
IN
NPpos/POS
POS
NP
for
NNP
’s
NN
JJ
demand
NNP/NNP NNP/NNP
NNP
NNP
york
strong
new
obligation
general
NP/PP NP/NN
market
municipal
the
NNS
NP/NNS
NP/NNS
JJ
DT
NNP city
DR
Figure 1 Trees resulting from a) a binarization of a sample phrase structure tree, and b) a right-corner transform of this binarized tree.
3.3 Right-corner transform Binarized trees are then transformed into right-corner trees using transform rules similar to those described by Johnson (1998a). This right-corner transform is simply the leftright dual of a left-corner transform. It transforms all right recursive sequences in each tree into left recursive sequences of symbols of the form A1 /A2 , denoting an incomplete instance of category A1 lacking an instance of category A2 to the right. Rewrite rules for the right-corner transform are shown below, first to flatten out right-recursive structure: A1
A1
A1
α1
α2
⇒ A1 /A2 A2 /A3
A3 a3
6
α1
A2
α1
α2
A2
A3 ,
A2 /A3 . . .
a3
A1
α2
⇒ A1 /A2 A2 /A3 . . . α1
α2
AF T
then to replace it with left-recursive structure: A1
A1
A1 /A2 :α1 A2 /A3 α3 . . . ⇒ α2
A1 /A3
α3 . . .
A1 /A2 :α1 α2
DR
Here, the first two rewrite rules are applied iteratively (bottom-up on the tree) to flatten all right recursion, using incomplete constituents to record the original nonterminal ordering. The second rule is then applied to generate left-recursive structure, preserving this ordering. Note that the last rewrite above leaves a unary branch at the leftmost child of each flattened node. This preserves the nodes at which the original tree was not rightbranching, so the original tree can be reconstructed when the right-corner transform concatenates multiple right-recursive sequences into a single left-recursive sequence. An example of a right-corner transformed tree is shown in Figure 1(b). An important property of this transform is that it is reversible. Rewrite rules for reversing a right-corner transform are simply the converse of those shown above. The correctness of this can be demonstrated by dividing a tree into maximal sequences of right-recursive branches (that is, maximal sequences of adjacent right children). The first two ‘flattening’ rewrites of the right-corner transform, applied to any such sequence, will replace the right-branching nonterminal nodes with a flat sequence of nodes labeled with slash categories, which preserves the order of the nonterminal category symbols in the original nodes. Reversing this rewrite will therefore generate the original sequence of nonterminal nodes. The final rewrite similarly preserves the order of these nonterminal symbols while grouping them from the left to the right, so reversing this rewrite will reproduce the original version of the flattened tree. 3.3.1 Comparison with left-corner transform A right-corner transform is used in this study, rather than a left-corner transform, mainly because the right-corner version coincides with an intuition about how incomplete constituents might be stored in human memory. Stacked up constituents in the right corner form are chunks that have been encountered, rather than hypothesized goal constituents. Intuitively, in the right corner view, after a sentence has been recognized, the stack memory contains a sentence (and some associated referent). In the left corner view, on the other hand, the stack memory is empty after a sentence has been recognized. Second, stacked up constituents in the right corner form are not yet attached (although there may be a strong statistical likelihood that they will be). This means that referents of these constituents are, for the time being, independent, suggesting the need for a distinct memory element. Finally, the right-corner version is sometimes more concise when starting with mostly right-branching trees. This is because when rewrite rules are applied from the bottom up, in some cases distinct right-branching sequences (e.g. a noun phrase subject followed by a verb phrase, each with several adjective or adverb modifiers), may be concatenated into a single left-branching sequence in the transformed tree. The resulting tree can then be recognized using fewer levels of stack memory. A left corner transform sim7
Volume 0, Number 0
AF T
Computational Linguistics
ilarly consolidates left-branching sequences, but these are less common in the data set. Coverage results for stack bounds are compared on a left and right corner transformed corpora in Section 4. 3.3.2 Comparison with CCG The incomplete constituent categories generated in the right-corner transform have the same form and much of the same meaning as non-constituent categories in a Combinatorial Categorial Grammar (CCG) (Steedman, 2000).2 Both CCG operations of forward function application: A1 A1 /A2 A2
and forward function composition:
A1 /A3 A1 /A2 A2 /A3
appear in the branching structure of right-corner transformed trees. Nested operations can also occur in CCG derivations: A1 /A2 A1 /A2 /A3 A3
as well as in right-corner transformed trees (using underscore delimiters described in Section 3.2): A1 /A2 A1 /A3 A2 A3
There are correlates of type-raising, too (unary branches introduced by the right-corner transform operations described in Section 3.2):
DR
A1 /A2 A3
But, importantly, the right-corner transform generates no correlates to the CCG operations of backward function application or composition: A1 A2 A1\A2 A1\A3 A2\A3 A1\A2
This has two consequences. First, right-corner transform models do not introduce spurious ambiguity between type-raised forward and backward operations, as CCG derivations do. Second, since leftward dependencies (as between a verb and its subject in English) cannot be incorporated into lexical categories, right-corner transform models cannot be taken to explicitly encode argument structure, as CCGs are. The right-corner transform model described in this paper is therefore perhaps better regarded as a ‘performance model’ of processing, given subcategorizations specified in some other grammar (such as in this case the Treebank grammar), rather than a constraint on grammar itself. 2 In fact, one of the original motivations for CCG as a model of language was to minimize stack usage in incremental processing (Ades and Steedman, 1982).
8
AF T
3.4 Hierarchic HMMs Right-corner transformed trees from the training set can then be mapped to random variable positions in a Hierarchic Hidden Markov Model (HHMM) (Murphy and Paskin, 2001), essentially a Hidden Markov Model (HMM) factored into some fixed number of stack levels at each time step. HMMs characterize speech or text as a sequence of hidden states qt (which may consist of phones, words, or other hypothesized syntactic or semantic information), and observed states ot at corresponding time steps t (typically short, overlapping frames of an audio signal, or words or characters in a text processing application). A most likely sequence of hidden states qˆ1..T can then be hypothesized given any sequence of observed states o1..T , using Bayes’ Law (Equation 2) and Markov independence assumptions (Equation 3) to define a full P(q1..T | o1..T ) probability as the product of a def ˆ Θ (qt | qt−1 ) and an Observation Language Model (ΘL ) prior probability P(q1..T ) = ∏t P L
def
ˆ Θ (ot | qt ) (Baker, 1975; Jelinek, Model (ΘO ) likelihood probability P(o1..T | q1..T ) = ∏t P O Bahl, and Mercer, 1975): qˆ1..T = argmax P(q1..T | o1..T )
(1)
q1..T
= argmaxP(q1..T ) · P(o1..T | q1..T ) q1..T
ˆ Θ (qt | qt−1 ) · P ˆ Θ (ot | qt ) = argmax ∏ P L O
def
q1..T
(2)
T
(3)
t=1
DR
HMM transitions can be modeled using Weighted Finite State Automata (WFSAs), corresponding to regular expressions. A HMM state qt is then a WFSA state, or a symbol position in a corresponding regular expression. ˆ Θ (qt | qt−1 ) over complex hidden states qt can be modLanguage model transitions P L eled using synchronized levels of stacked-up component HMMs in a Hierarchic Hidden Markov Model (HHMM) (Murphy and Paskin, 2001), analogous to a shift-reduce parser with a bounded stack. HHMM transition probabilities are calculated in two phases: a ‘reduce’ phase (resulting in an intermediate final-state variable ft ), modeling whether component HMMs terminate; and a ‘shift’ phase (resulting in a modeled state qt ), in which unterminated HMMs transition, and terminated HMMs are re-initialized from their parent HMMs. Variables over intermediate ft and modeled qt states are factored into sequences of depth-specific variables – one for each of D levels in the HHMM hierarchy: ft = h ft1 . . . ftD i
(4)
hqt1 . . . qtD i
(5)
qt =
Transition probabilities are then calculated as a product of transition probabilities at each level, using level-specific ‘reduce’ ΘF and ‘shift’ ΘQ models: P(qt | qt−1 ) = ∑ P( ft | qt−1 ) · P(qt | ft qt−1 )
(6)
ft
=∑
def
D
D
d−1 d d ˆ Θ (qtd | ftd+1 ftd qt−1 qt−1 ) · ∏P qtd−1 ) ∏ Pˆ ΘF ( ftd | ftd+1qt−1 Q
ft1.. ftD d=1
(7)
d=1
9
1 ft−2
... ...
ft1
1 qt−1
2 ft−1
2 qt−2
3 ft−2
...
1 ft−1
1 qt−2
2 ft−2
...
Volume 0, Number 0
AF T
Computational Linguistics
qt1
ft2
2 qt−1
3 ft−1
qt2
ft3
3 qt−2
3 qt−1
qt3
ot−2
ot−1
ot
Figure 2 Graphical representation of a Hierarchic Hidden Markov Model. Circles denote random variables, and edges denote conditional dependencies. Shaded circles denote variables with observed values.
with ftD+1 and qt0 defined as constants. A graphical representation of an HHMM with three levels is shown in Figure 2. Shift and reduce probabilities can now be defined in terms of finitely recursive FSAs with probability distributions over recursive expansion, transition, and reduction of states at each hierarchy level. In simple HHMMs, each intermediate state variable is a boolean switching variable ftd ∈ {0, 1} and each modeled state variable qtd is a syntactic, lexical, or phonetic state. The intermediate variable ftd is modeled: 3 d+1 = 0 : [ ftd = 0] d−1 def if ft d ˆ Θ ( ftd | ftd+1 qt−1 (8) q ) = P F d−1 d+1 t−1 d d ˆΘ if ft = 1 : P F-Reduce ( ft | qt−1 , qt−1 )
DR
where f D+1 = 1 and qt0 = ROOT. Shift probabilities at each level are defined using level-specific transition ΘQ-Trans and expansion ΘQ-Expand models: d+1 d d d if ft = 0, ft = 0 : [qt = qt−1 ] d d+1 d d d−1 def d+1 d qd−1 ) d ˆ ˆ PΘQ (qt | ft ft qt−1 qt ) = if ft = 1, ft = 0 : PΘQ-Trans (qtd | qt−1 t d+1 d d−1 ) ˆΘ if ft = 1, ftd = 1 : P Q-Expand (qt | qt
(9)
where f D+1 = 1 and qt0 = ROOT. This model is conditioned on final-state switching variables at and immediately below the current HHMM level. If there is no final state immediately below the current level (the first case above), it deterministically copies the current HHMM state forward to the next time step. If there is a final state immediately below the current level (the second case above), it transitions the HHMM state at the current level, according to the distribution ΘQ-Trans . And if the state at the current level is final (the third case above), it re-initializes this state given the state at the level above, according to the distribution ΘQ-Expand . Models ΘQ-Expand and ΘQ-Trans are defined directly from training examples using relative frequency estimation. The overall effect is 3 Here [·] is an indicator function: [φ] = 1 if φ is true, 0 otherwise.
10
−
AF T −
n − mu
t rke ma
N S/N
−
ed
s
al icip
N S/N
the
−
pp
nd
l era
up
p ro
bo
gen
’s
rk
city
yo
new
t=15
−
P /NP /NN − NP NNP
t=14
P S/N
−
fo r
t=13
T P /PR − S/V BN V
−
−
and dem
g
t=12
t=8
−
/NP NP
−
−
n stro
t=11
−
/PP NP
−
−
word
t=10
P S/V
/NN NP
−
d=3
t=9
−
−
d=2
t=7
−
d=1
t=6
S /NN NP
t=5
n tio iga − obl
t=4
−
t=3
S /NN NP
t=2
S − − /NN NP OS s/P /NP − NP NPpo P /NP /NN − NP NNP
t=1
Figure 3 Sample tree from Figure 1 mapped to qtd variable positions of an HHMM at each stack depth d (vertical) and time step t (horizontal). This tree uses only two levels of stack memory. Values for final-state variables ftd are not shown. Note that the mapping transform omits some nonterminal labels; labels for these nodes can be reconstructed from their children.
that higher-level HMMs are allowed to transition only when lower-level HMMs terminate. An HHMM therefore behaves like a probabilistic implementation of a pushdown automaton (or ‘shift-reduce’ parser) with a finite stack, where the maximum stack depth is equal to the number of levels in the HHMM hierarchy.
3.5 Mapping trees to HHMM derivations Any tree can be mapped to an HHMM derivation by aligning the nonterminals with qtd categories. First, it is necessary to define rightward depth d, right index position t, and final (right) child status f , for every nonterminal node A in a tree, where: • d is defined to be the number of right branches between node A and the root,
DR
• t is defined to be the number of words beneath or to the left of node A, and • f is defined to be 0 if node A is a left (or unary) child, 1 otherwise.
Any binary-branching tree can then be annotated with these values and rewritten to define labels and final-state values for every combination of d and t covered by the tree. This is done using the rewrite rule: d,t, A0 , f
d,t−1, A1 |A2 , 0
d,t, A0 , f
d+1,t, A2 , 1
⇒
d,t ′, A1 , 0
d+1,t, A2 , 1
d,t ′ +1, A1 |A2 , 0 d,t ′, A1 |A2 , 0
which copies stacked up left child constituents over multiple time steps, while lowerlevel (right child) constituents are being recognized. The dashed line on the right side of the rewrite rule represents the variable number of time steps for a stacked up higher-level constituent (as seen, for example, in time steps 4-7 at depth 1 in Figure 3). Coordinates 11
Volume 0, Number 0
AF T
Computational Linguistics
DR
d,t ≤ D, T that have f = 1 are assigned label ‘−’, and coordinates not covered by the tree are assigned label ‘−’ and f = 1. Here, the bar notation A1 |A2 specifies that constituent A2 is being expanded in a lower stack level. This allows transitions to depend on the sub-constituent that was recognized immediately previously. For example, an incomplete noun phrase NP/NN should transition to another NP/NN after recognizing an adjective, but should transition to a complete NP after recognizing a noun. The bar notation is equivalent to adding a dependency to each variable qtd from the previous variable at the stack level below, but preserves the original HHMM topology, and therefore the associated complexity proofs. The resulting label and final-state values at each node now define a value of qtd d for each depth d and time step t of the HHMM (see Figure 3). Probabilities and ft+1 for HHMM models ΘQ-Expand , ΘQ-Trans , and ΘF-Reduce can then be estimated from these values directly. Like the right-corner transform, this mapping is reversible, so q and f values can be taken from a hypothesized most likely sequence and mapped back to trees (which can then undergo the reverse of the right-corner transform to become ordinary phrase structure trees). Inspection of the above rewrite rule will reveal the reverse of this transform simply involves deleting unary-branching sequences that differ only in the value of t. This alignment of right corner transformed trees also has the interesting property that the categories on the stack at any given time step represent a segmentation of the input up to that time step. For example, in Figure 3 at t=12 the stack contains a sentence lacking a verb phrase: S/VP (‘strong demand for . . . bonds’), followed by a verb projection lacking a particle: VBN/PRT (‘propped’).
3.5.1 Comparison with Cascaded FSAs in Information Extraction Hierarchies of WFSA-equivalent HMMs may also be viewed as probabilistic implementations of cascaded FSAs, used for modeling syntax in information extraction systems such as FASTUS (Hobbs et al., 1996). Indeed, the diagonal left-branching sequences of transformed constituents recognized by this model (as shown in Figure 3) bear a strong resemblance to the flattened phrase structure representations recognized by cascaded FSA systems, in that most phrases are consolidated to flat sequences at one hierarchy level. This flat structure is desirable in cascaded FSA systems because it allows information to be extracted from noun or verb phrases using straightforward pattern matching rules, implemented as FSA-equivalent regular expressions. Like FASTUS, this system produces layers of flat phrases that can be searched using regular expression pattern-matching rules. It also has a fixed number of levels and linear-time recognition complexity. But unlike FASTUS, the model described here can produce – and can be trained on – complete phrase structure trees (accessible by reversing the transforms described above). This suggests the exciting possibility of constructing a Treebank-trained statistical information extraction system by embedding regular expression patterns, similar to those used in FASTUS, directly into this HHMM structure, as joint variables of syntactic constituents and semantic pattern-matching states. 12
sentences 35 2,288 16,672 34,179 39,317 39,805 39,832
coverage 0.09% 5.74% 41.88% 85.81% 98.71% 99.93% 100.00%
AF T
stack memory capacity no stack memory 1 stack elements 2 stack elements 3 stack elements 4 stack elements 5 stack elements TOTAL
Table 1 Percent coverage of right-corner transformed treebank sections 2-21, for each stack memory capacity, with punctuation.
stack memory capacity no stack memory 1 stack element 2 stack elements 3 stack elements 4 stack elements 5 stack elements TOTAL
sentences 127 3,496 25,909 38,902 39,816 39,832 39,832
coverage 0.32% 8.78% 65.05% 97.67% 99.96% 100.00% 100.00%
Table 2 Percent coverage of right-corner transformed treebank sections 2–21 with punctuation omitted.
4 Coverage Results
DR
Sections 2–21 of the Penn Treebank WSJ corpus were transformed as described in Section 3. Coverage of this corpus, in sentences, for a recognizer using one to five levels of stack memory is shown in Table 1. In the framework described above, punctuation was mostly covered by ‘nominal’ binarization rewrites. But branching structure for punctuation can be difficult to motivate on linguistic grounds. And from a psycholinguistic point of view, it seems questionable to account for punctuation using stack memory. In order to counter possible undesirable effects of an arbitrary branching analysis of punctuation, the system was run on a version of the WSJ corpus with punctuation removed. An analysis (Table 2) of the Penn Treebank WSJ corpus sections 2–21 without punctuation, using the right-corner transformed trees described above, shows that 97.70% of trees can be recognized using three hidden levels, and 99.96% can be recognized using four. Table 3 shows slightly worse results for a left-corner transformed corpus, using left-right reflections of the rewrite rules presented in Section 3.3. Cowan (2001) documents empirically observed short-term memory limits of about four elements across a wide variety of tasks. It is therefore not surprising to find a similar limit in the memory required to parse the Treebank, assuming elements corresponding to right-corner-transformed incomplete constituents. But remembered sequences of letters or numbers can be doubled or tripled up (or 13
Computational Linguistics
Volume 0, Number 0
sentences 127 4,237 26,702 38,802 39,803 39,832 39,832
coverage 0.32% 10.64% 67.04% 97.41% 99.93% 100.00% 100.00%
AF T
stack memory capacity no stack memory 1 stack element 2 stack elements 3 stack elements 4 stack elements 5 stack elements TOTAL
Table 3 Percent coverage of left-corner transformed treebank sections 2–21 with punctuation omitted.
‘chunked’) to save memory. Could speakers learn to use categories that are merged or chunked into richer structures that span multiple levels in the current framework? The levels in the current framework may form natural barriers to chunking because, in human language processing, constituents may carry semantic referents as well as syntactic categories. Left-branching sequences of e.g. noun phrases therefore correspond to progressively constrained descriptions of entities or sets of entities. Learning a repertoire of ‘chunked’ combined constituents would therefore require building associations between large numbers of paired entities or entity sets. 5 Accuracy Results
DR
The coverage results described in the previous section set an upper bound on the accuracy one could hope to achieve using only a few levels of stack memory.4 Since this bound is relatively high, it seems safe to conclude that these results do not rule out the possibility of using very small amounts of stack memory in parsing. But the right-corner transform can drastically alter the structure of a phrase structure tree, and on top of this, the HHMM parser described in Section 3 has very different dependencies than those of a PCFG. Is it still possible to achieve accurate parsing with this modified set of dependencies? To answer this question, a baseline CKY (deep) parser and right-corner HHMM (shallow) parser were evaluated on the standard WSJ section 23 parsing task, using the binarized tree set described in Section 3.2 as training data. Both CKY baseline and HHMM test systems were run with a simple POS model using relative frequency estimates from the training set, backed off to a discriminative (decision tree) model conditioned on the last five letters of each word, normalized over unigram POS probabilities. The CKY results are comparable to the baseline system reported by Klein and Manning (2003) using no parent or sibling conditioning. The results for HHMM parsing using these same binarized trees (modulo right-corner and variable-mapping transforms) were substantially better than CKY, most likely due to the expanded HHMM dependencies on d ) and parent (qd−1 ) variables at each qd . These experiments used a maxprevious (qt−1 t t 4 However, note that this coverage statistic is expressed in terms of sentences, not constituents. Most constituents in a depth-five tree are not at depth five themselves.
14
LP − − 72.3 81.4 72.0 79.7
LR − − 71.1 82.9 69.7 80.4
F 72.6 77.4 71.7 82.1 70.8 80.1
fail 0 0 0 1.4 0.3 0.6
AF T
with punctuation: (≤ 40 wds) KM’03: unmodified, devset KM’03: par+sib, devset CKY: binarized, devset HHMM: par+sib, devset CKY: binarized, sect 23 HHMM: par+sib, sect 23
no punctuation: (≤ 120 wds) LP LR F fail R’01: par+sib, sect 23–24 77.4 75.2 − 0.1 HHMM: par+sib, sect 23–24 77.6 76.8 77.2 0.4
Table 4 Labeled precision (LP), labeled recall (LR), weighted average (F score), and parse failure results for basic CKY parser and HHMM parser on unmodified and binarized WSJ sections 22 (sentences 1-393: ‘devset’) and 23–24 (all sentences), with and without punctuation, compared to Klein and Manning 2003 (KM’03) using baseline and parent-sibling (par+sib) conditioning, and Roark 2001 (R’01) using parent-sibling conditioning. Unlike Roark 2001, HHMM parse failures were not figured into recall and precision statistics.
DR
imum stack depth of three, and conditioned expansion and transition probabilities for each qtd on only the portion of the parent category following the slash (that is, only A2 of A1 /A2 ), in order to avoid sparse data effects.5 These results are comparable to conditioning on parent and sibling of each node in the Klein-Manning and Roark evaluations, shown in Table 4. This system was run with a beam width of 2000 hypotheses. A version of this HHMM model, extended to hypothesize semantic referents at each stack element, was shown to run in real time in speech recognition experiments at this same beam width, using a four element stack and a frame rate of 100Hz, on a 2.6GHz dual quad core desktop computer (Wu, Schwartz, and Schuler, 2008). 6 Conclusion
This paper has described a basic incremental parsing model that achieves worst-case linear time complexity by enforcing fixed limits on a stack of explicit (albeit incomplete) constituents. Initial results show a use of only three to four levels of stack memory within this framework provides nearly complete coverage of the large-scale Penn Treebank corpus. Moreover, preliminary accuracy results for an unlexicalized, un-smoothed version of this model show about 80% recall and precision on the standard test set of this corpus, and 77% on a ‘speechified’ version of this corpus, both comparable to that reported for state-of-the-art cubic-time parsers using similar configurations of conditioning information, without lexicalization or smoothing. In other words, the system delivers the accurate phrase structure expected from deep parsing with the attractive runtime proper5 Models at greater stack depths, and models depending on complete parent categories (or grandparent categories, etc, as in state-of-the-art parsers) could be developed using smoothing and backoff techniques, but this was left for later work.
15
ties of a shallow parser. References
Volume 0, Number 0
AF T
Computational Linguistics
DR
Ades, Anthony E. and Mark Steedman. 1982. On the order of words. Linguistics and Philosophy, 4:517–558. Baker, James. 1975. The dragon system: an overivew. IEEE Transactions on Acoustics, Speech and Signal Processing, 23(1):24–29. Brown-Schmidt, Sarah, Ellen Campana, and Michael K. Tanenhaus. 2002. Reference resolution in the wild: Online circumscription of referential domains in a natural interactive problem-solving task. In Proceedings of the 24th Annual Meeting of the Cognitive Science Society, pages 148–153, Fairfax, VA, August. Cowan, Nelson. 2001. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24:87–185. Dahan, Delphine and M. Gareth Gaskell. 2007. The temporal dynamics of ambiguity resolution: Evidence from spoken-word recognition. Journal of Memory and Language, 57(4):483–501. Earley, Jay. 1970. An efficient context-free parsing algorithm. CACM, 13(2):94–102. Ericsson, K. Anders and Walter Kintsch. 1995. Long-term working memory. Psychological Review, 102:211–245. Hobbs, Jerry R., Douglas E. Appelt, John Bear, David Israel, Megumi Kameyama, Mark Stickel, and Mabry Tyson. 1996. Fastus: A cascaded finite-state transducer for extracting information from natural-language text. In Finite State Devices for Natural Language Processing. MIT Press, Cambridge, MA, pages 383–406. Jelinek, Frederick, Lalit R. Bahl, and Robert L. Mercer. 1975. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory, 21:250–256. Johnson, Mark. 1998a. Finite state approximation of constraint-based grammars using left-corner grammar transforms. In Proceedings of COLING/ACL, pages 619–623. Johnson, Mark. 1998b. PCFG models of linguistic tree representation. Computational Linguistics, 24:613–632. Klein, Dan and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 423–430. Magerman, David. 1995. Statistical decision-tree models for parsing. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics(ACL’95). Marcus, Mitchell P., Beatrice Santorini, and Mary Ann Marcinkiewicz. 1994. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313–330. Miller, George A. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63:81–97. Murphy, Kevin P. and Mark A. Paskin. 2001. Linear time inference in hierarchical HMMs. In Proc. NIPS, pages 833–840. Roark, Brian. 2001. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2):249–276. Sagae, Kenji and Alon Lavie. 2005. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technologies (IWPT’05). Sagae, Kenji and Alon Lavie. 2006. A best-first probabilistic shift-reduce parser. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 691–698. Steedman, Mark. 2000. The syntactic process. MIT Press/Bradford Books, Cambridge, MA. Tanenhaus, Michael K., Michael J. Spivey-Knowlton, Kathy M. Eberhard, and Julie E. Sedivy. 1995. Integration of visual and linguistic information in spoken language comprehension. Science, 268:1632–1634. Wu, Stephen, Lane Schwartz, and William Schuler. 2008. Referential semantic language modeling for data-poor domains. In Proc. International Conference on Audio, Speech and Signal Processing (ICASSP’08), Las Vegas, Nevada.
16