Finally we report large scale experiments on the entire Wall Street Journal .... Lombardo and Sturt [16] describe a procedure that takes as input a parse tree T and ...... [3] M. J. Pickering and M. J. Traxler, âPlausibility and recovery from garden ...
1
Ambiguity resolution in incremental parsing of natural language Fabrizio Costa1 , Paolo Frasconi1 , Vincenzo Lombardo2 , Patrick Sturt3 , Giovanni Soda1
Abstract Incremental parsing gains its importance in natural language processing and psycholinguistics because of its cognitive plausibility. Modeling the associated structures is believed to help a better understanding of the human parser. In earlier work we have introduced a recursive neural network capable of performing syntactic ambiguity resolution in incremental parsing. In this paper we report a systematic analysis of the behavior of the network that allows to gain important insights about the kind of information that is exploited to resolve different forms of ambiguity. In particular, we found that learning from examples allows to predict the location of the attachment point with high accuracy, while the discrimination amongst alternative syntactic structures with the same anchor point is slightly better than making a decision purely based on frequencies. We also introduce several new ideas to enhance the architectural design, obtaining significant improvements of prediction accuracy, up to 25% error reductuib on the same dataset used in previous work. Finally we report large scale experiments on the entire Wall Street Journal section of the Penn Treebank. The best prediction accuracy of the model on this large dataset is 87.6%, an error reduction larger than 50% compared to previous results.
I. I NTRODUCTION The incremental strategy is a largely accepted hypothesis about the human mechanism of language processing. Under this model, processing proceeds from left to right, and each input word is assigned a structural position as it is being read [1]. The incrementality hypothesis is supported by several experimental studies that demonstrate how humans are able to assign a meaning to “almost any” initial (left) fragment of a sentence [2], that is, they are capable to anticipate syntactic and semantic decisions before reaching the end of the sentence [3], [4], [5]. In particular, under the strong incrementality framework (assumed in this paper), humans maintain a totally connected parse while scanning the input words from left to right, with no input stored in a disconnected state [6]. Although well accepted in the psycholinguistic community, incremental processing has received relatively modest attention in the computational linguistic community. In this direction, Roark & Johnson [7] have proposed a top-down left-corner probabilistic parser that uses a probabilistic best-first strategy and a beamsearch heuristic to avoid the non termination problems typical of top-down predictive parsers. Their parser proceeds incrementally from left to right, with one item of look-ahead, and maintains a fully connected tree spanning the left context which they use to extract non-local dependency information. With regards to connectionist architectures, Lane & Henderson[8] have proposed Simple Synchrony Networks and applied
2
them to a small scale parsing problem. Their approach joins Temporal Synchrony Variable Binding and Simple Recurrent Networks in order to output representations of tree structures. In this paper we focus on ambiguity resolution in first-pass attachment. A strong model of incremental parsing can be formulated by means of a dynamic grammar with join operations that attach each new word to the syntactic structure spanning the previous input words [16]. Ambiguity in this context arises because several attachment choices are available for each word. This task is important in psycholinguistics to better understand the human language processor [3], [5]. In previous work [?], [10] we have introduced a connectionist architecture that learns to predict first-pass attachment decisions from examples of parsed sentences. More recently [11] we have shown that the proposed architecture is capable of modeling some interesting aspects of human incremental processing. Our method, shortly reviewed in Section III uses recursive neural networks (RNN) [12], [13] trained on a supervised preference problem where each example is a bag of syntactic trees and exactly one member of the bag is labeled as correct. The associated machine learning problem can be seen as a simplified version of the ranking problem introduced by Cohen et al. [14] since in the preference task we are only interested in determining only the “best” element of the whole set. The task is also very similar to the one studied by Collins [15] in order to re-rank alternative syntactic structures. In [15], alternative candidate trees are ranked using the voted perceptron, and trees are mapped into a high-dimensional feature space using a convolution kernel. In contrast, when using RNNs, trees are mapped into a low-dimensional space, but the mapping is adaptive and therefore focused by the learning task rather than being dependent on hand-crafted or problem-specific kernel functions. Second, but not less important, making prediction with the voting perceptron takes time proportional to the number of misclassifications encountered while training while the RNN takes constant time (with respect to the training set size). As shown in the experiments reported in the paper, this property allows us to train RNNs on realistic large scale annotated linguistic corpora. In this paper we characterize what a recursive connectionist system has learned by investigating the structural features of the input. In order to characterize the solution we perform statistical analysis on the correlation between structural features of data and the generalization error of the network. We observe that the system consistently prefers simple and more frequent structures and we compare the findings to well known psycholinguistic heuristics. The behavior of the network suggests a way to simplify the syntactic trees by removing nodes that are too distant from the candidate attachment points. Significant error reduction is obtained by keeping only nodes that either belong to the right frontier or have a parent in the right frontier. Domain partitioning is a second way of injecting prior knowledge. More precisely, we propose to specialize separate networks on different domain splits, according to the grammatical category of the word to be attached. All these enhancements produce a relevant error reduction. The rest of the paper is organized as follows. In section II we briefly describe the linguistic nature of the problem and we formulate the learning task as a preference task. In section III we describe the model
3
for learning preferences on syntactic structures. In section IV we describe and characterize the dataset. In section V we study the main properties of the trained network and we correlate prediction error to the structural properties of the input trees. In section VI we describe the enhanced architecture and report wide coverage experiments. II. L INGUISTIC F RAMEWORK In this section we give some basic concepts related to first-pass ambiguity resolution. More details can be found in [16]. A. Definitions We assume that syntactic processing takes place in a strongly incremental fashion. This means that each word wi is processed by scanning the sentence from left to right and that we do not allow disconnected sub-structures to be assembled together at some later stage in processing. Let T the parse tree for sentence s = w1 , . . . , wn . For i = 1, . . . , n we define the incremental tree Ti as the sub-tree of T recursively built in the following way (see Fig. 1 (a)): •
T1 consists of the chain of nodes and edges of T that goes from w1 to its maximal projection1 .
•
Ti consists of all the nodes and edges in Ti−1 and either: – the chain of nodes and edges of T starting from node R where R is the lowest node of Ti−1 that dominates wi – the chain of nodes and edges of T starting from node R where R is lowest node of T that dominates both the root of Ti−1 and wi , and the chain of nodes and edges that connect R with the root of Ti−1
Given two incremental trees Ta and Tb we define the difference between Ta and Tb as the set of all the edges that are in Ta but not in Tb and all the nodes touched by those edges. The difference between Ti and Ti−1 is called the connection path: cpi = Ti − Ti−1 , (see Fig.1 (b)). The node that belongs to Ti−1 and cpi is called the anchor (this is the node where cpi attaches to Ti−1 ). The preterminal node for word wi is called the foot. This is the node whose label is the “part of speech” (POS) tag of word wi and in our framework it is a leaf of the the syntactic tree. POS-tags are the syntactic category of words and can be predicted with very high accuracy [17]. We use the symbol ’◦’ to denote the join operator, defined as Ti = Ti−1 ◦ cpi . According to the above definitions, an incremental tree Ti can always be written as the result of repeated joins: Ti = cp1 ◦cp2 . . .◦cpi . We assume that the children of each node are ordered from left to right. The right frontier of an incremental tree is the sequence of the “last” children from the root node to a leaf (see Fig. 7). Join operations are always performed on nodes belonging to the right frontier. 1
A maximal projection of a word w is a non terminal symbol X such that w is the linguistic head of X, that is w is the word that characterizes
the behavior of X. For example, a noun projects onto a Noun Phrase.
4
S
VP NP
NP
PP
Connection Path
NP S
NP
NP
PRP
VBZ
DT
NN
IN
PRP$
It
has
no
bearing
on
our
NN
work
NN
force
Anchor
NP
PRP
T1
T2
T3
T4
VP
It
VBZ
Foot
has
(a)
(b)
Fig. 1. (a) Example of incremental trees. (b) Anchor and Foot nodes.
Lombardo and Sturt [16] describe a procedure that takes as input a parse tree T and computes all the incremental trees T1 , . . . , Tn and the all the connection paths cp1 . . . cpn . By applying this procedure to a treebank B (a set of sentences annotated with their parse trees) we obtain a set of connection paths called the universe of connection paths, denoted U (B). B. First Pass Attachment Prediction Suppose we are given a new sentence s = w1 , . . . , wn not included in the treebank B, and suppose that at stage i of parsing we know the correct incremental tree Ti−1 spanning w1 , . . . , wi−1 . We want to compute the next tree Ti in order to accommodate the next word wi . Under the implicit hypothesis that U (B) contains the required connection path, Ti can be obtained by joining Ti−1 to some unknown path c in U (B). The prediction problem is then defined as follows: given Ti−1 , find c ∈ U (B) such that Ti−1 ◦ c is the correct incremental tree spanning w1 , . . . wi . The set of candidate paths can be significantly reduced by enforcing the following two rules that must be satisfied by a legal joining: •
the foot of c must match the POS-tag of wi ;
•
the anchor of c must match the one of the nodes in the right frontier of Ti−1 .
Note that U (B) along with the joining operator and the above above rules can be regarded as a dynamic grammar. This grammar, however, is highly ambiguous as the set of paths that satisfy the two joining rules may be very large. In particular, there are three different sources of ambiguity: •
a word can have more than a POS tag.
•
the anchor node can be any node of the right frontier (see Fig. 2 (a)).
•
for each pair there can exist more than one legal connection path (see Fig. 2 (b)).
s
5
bearing
NN
S
DT
NN
no
bearing
VBZ
DT
has
NP
no
It
VBZ
PRP
has
NP
It
PRP
VP
S NP
VP
NP
NP NP
PP
1
IN
on
NP NP NP
SBAR
NP
4
WHADVP
NONE
ADJP
3
IN
NP
QP
PP
2
IN
ADVP
4
3
2
1
IN
NP
PRN
(a)
(b)
Fig. 2. (a) Anchor point variability. (b) Connection path variability.
C. Left Recursion and Lexical Information The incremental parser suffers from the problem of left recursion [18]. Since a left recursive structure can be arbitrarily nested, we cannot predict the correct connection-path incrementally. There are a few practical and psycholinguistically motivated solutions in the literature [19], but in the current work we have resorted to an immediate approach which is extensively implemented in the Penn Treebank schema: namely we flatten the tree structure and avoid the left recursion issue altogether. Consider as an example the application of the flattening procedure to a local tree like 1 that produces as a result a tree like 2: 1) [N P [N P DT NN] PP] 2) [N P DT NN PP] Since the main focus of the present linguistic analysis is about syntax, no lexical information is used for prediction (two sentences having the same sequence of POS tags are therefore equivalent in our study). D. Machine learning formulation The first-pass attachment prediction problem is now The set of trees obtained by legally joining Ti−1 to a path in U (B) will be referred to as the forest of candidates for word wi , denoted Fi = {Ti,1 , . . . , Ti,mi }, where mi = |Fi |. Note that, under our assumptions, one and only one tree in Fi is the correct incremental tree spanning w1 , . . . , wi . Typically prediction problems are formulated either as classification or as regression which can both be thought of as function approximation problems, where the image of the function is a subset of IR for regression or a discrete set for classification. Here we are interested in ranking a set of alternatives, a task with characteristics of both the previous problems: like classification it has as image a discrete set and like regression there exists an ordering among the elements of the image. In contrast with regression we are not dealing with an image that is a metric space, but an image that admits only a partial order relation. We can give a more general formulation of these problems under the unifying framework of “map approximation”. The problem can be stated in terms of establishing a map approximation between sets of identifiable elements (ie. sets for which the i-th element can be determined but where we are not interested in the relative ordering of the elements themselves as in the case of strings or sequences) K : S x 7→ S y .
6
We have S x ⊂ X ? , where the elements Six ∈ S x are tuples of variable length Six =< x1 , . . . , xmi >. We call xj ∈ X instances. We proceed analogously for S y . By imposing some constraints over S x and S y we can represent a range of different problems including standard classification, regression, ranking, preference estimation, multiple instance classification[20], [21] and multiple instance regression [22]. As an example consider the multiple instance classification problem . This problem arises when an object may have a range of different representations in the feature space. The objective is to attribute to a bag of instances Si = {x1 , . . . , xmi } a class in Y ≡ {⊕, }. In this case the image is a set of constant unitary size. A whole bag Sj is mapped to ⊕, that is K(Si ) 7→ ⊕, if there exist at least one element xj ∈ Si for which the desired characterizing property holds, that is if ∃j : xj ∈ Si f (xj ) 7→ ⊕, where f is the supervision classification function. If we allow the image to be continuous we obtain the case of multiple instance regression. In this case one “primary” instance in the set of representations is responsible for a real valued output associated with the whole set, that is f : S x 7→ IR. The problem here is to find the optimal linear regression under the hypothesis of not knowing which element of the set is the primary one. We formulate the ranking problem as a mapping of an input tuple of arbitrary many elements to an output tuple of discrete elements that admit a partial or total order relation. The definition of the order relation among the elements yj ∈ Y is arbitrary and problem specific, though a typical choice consists in taking yj ∈ IN and the standard > relation among naturals. We denote with Kj the value yj ∈ Y computed by the map K K(< x1 , . . . , xj , . . . , xmi >) 7→ < y1 , . . . , yj , . . . , ymi > Kj (< x1 , . . . , xj , . . . , xmi >) 7→ yj We say that an instance xa is ranked higher than xb if Ka (x) > Kb (x) and has equal rank if Ka (x) = Kb (x). We define “preference” a special case of ranking obtained restricting the image to a binary set and constraining the greatest of the classes to be the image of at most one element. In this case Y ≡ {C0 , C1 } where C1 C0 . The problem of finding the “greatest” or “preferred” element of a set Si = {x1 , . . . , xmi } can now be stated in terms of finding the unique element xj ∈ Si mapped in C1 . In this paper we are interested in modeling such a problem and in finding a robust algorithmic solution to the problem of learning a preference map from examples. The solution scheme that we adopt tries to satisfy the two following constraints for the preference problem: 1) C1 C0 2) ∃!j ? : Kj ? (< x1 , . . . , xj , . . . , xmi >) 7→ C1 In order to satisfy the constraint 1 we consider simply C1 = 1 and C0 = 0 with the ordering relation the standard > relation among naturals. In order to satisfy the constraint 2 we chose: k(xj ) Kj (< x1 , . . . , xmi >) = Pmi j 0 k(xj 0 )
(1)
7
and hence
Pmi j
Kj (x) = 1. This ensures that if there exist at least one element xj for which Kj ? (x) = 1 than
this is unique, or that in other terms that only the instance xj will be mapped in C1 = 1 while all the other elements of the tuple will be mapped in C0 = 0. The condition Kj ? (x) = 1 is achieved if k(xj ? ) >> k(xj ) for j ? 6= j. The robustness of the solution is due to the fact that no constraints are imposed on the elements that are not to be preferred other than the fact that the value computed by k be much smaller than the one computed on the element to be preferred. Moreover, note that the decomposition of K in terms of the function k is correctly invariant to any permutation of the instances in the input tuple and is independent of the size of the tuple. E. Connectionist vs. frequency approach According to the notation given above, we can restate our learning task as the estimation of a function K that given a forest of incremental trees Fi = {Ti1 , . . . , Timi }, maps the correct element Ti? in C1 = 1, and all the remaining trees in j-star should disappear C0 = 0. A reasonable approach is to derive a probabilistic estimator of K collecting frequency information on all the forests in a large corpus. If the estimator is a multinomial model, we need to collect information on the joint probability of all configurations of incremental trees (left contexts) and POS tags (next words). This approach suffers from a severe data sparseness problem. The combinatorial nature of the grammar determines negligible probabilities for the repeated occurrence of the same incremental trees in training sets of any given size (i.e. 104 − 106 sentences) currently available. To quantify this statement we have selected sample of 1000 sentences randomly divided in two sets of same size: one for a nominal test set and one for a training set, and we have calculated the number of trees of the test set present in the training set, counting first the coincidences among correct incremental trees, and then among all trees (i.e. the correct incremental trees plus negative examples) obtaining for the correct incremental trees: Correct trees in test set
11,011 trees
Correct trees in training set
11,250 trees
Correct trees from test set in training set: 420 trees Percentage
4%
and for the overall dataset (i.e. correct incremental trees plus negative examples): Trees in test set:
480,928 trees
Trees in training set:
517,308 trees
Trees from test set in training set:
4,469
Percentage
1%
The small percentage of the training set seen in the test set clearly illustrates the infeasibility of a simple multinomial estimator. In computational linguistics, data sparseness is traditionally dealt with a resort to backing off or smoothing techniques, that is, a resort to approximating frequency estimation through the
8
decomposition of complex and infrequent objects in more frequent sub-parts[23] or resorting to marginalization techniques. It is then possible to estimate the overall object probability by composing the frequencies of the sub-parts, even if this forces some independency hypothesis over the data. In an incremental framework, this decomposition has been attempted by [7]. Our solution does not make any simplifying assumption and tries to take advantage of the whole information represented by the left context and at the same time to overcome the data sparseness problem. This is achieved by resorting to a parametric estimator that models the underlying statistical phenomenon with a smaller set of hyper-parameters. In the following paragraph we shortly rewrite the solution developed in [10] according to the formalism introduced in II-D. III. C ONNECTIONIST R ECURSIVE A RCHITECTURE FOR P REFERENCE E STIMATION We show a two steps solution to the first-pass attachment preference estimation task. Firstly we find a way to encode each parse tree in a vector representation. Secondly we show a way to learn a map K that determines the preferred element of a given forest of encoded alternatives. Both steps are implemented resorting to a connectionist computational scheme. A. Tree Adaptive Encoding The general theory developed in [12] allows the processing of directed acyclic graph with a super-source. Here we are interested in the case of labeled ordered m-ary tree. By ordered we mean that, for each vertex v, a total order is defined on the m children of v. Specifically, ch[v] denotes the ordered m-tuple of vertices whose elements are v’s children. If v has k < m children, then the last m − k elements of ch[v] are filled with a special entry nil, denoting a missing child. I(v) denotes the label attached to vertex v. In the case of syntactic trees, labels belong to a finite alphabet of nonterminal symbols I = N T1 , . . . , N TN . The basic neural network architecture is based on the following recursive state space representation: x(v) = f (x(ch[v]), I(v))
(2)
In the above equation, x(v) ∈ IRn denotes the state vector associated with node v and x(ch[v]) ∈ IRm·n is a vector obtained by concatenating the components of the state vectors contained in v’s children. f : I × IRm·n 7→ IRn is the state transition function that maps states at v’s children and the label at v into the state vector at v. States in Eq. (2) are updated bottom-up, traversing the tree in post-order. If a child is missing, the corresponding entries in x(ch[v]) are filled with the frontier state x, which is associated with the base step of recursion. A typical choice is x = 0. This computational scheme closely resembles the one used by frontier-to-root tree automata [24], but in that case states are discrete whereas here states are real vectors. The transition function f is implemented by a feed-forward neural network, according to the scheme:
9
aj (v) = ωj0 +
N X
ωjh zh (I(v)) +
h=1
m X n X
wjk` x` (chk [v])
(3)
k=1 `=1
xj (v) = tanh(aj (v)) h = 1, . . . , n
(4)
where xj (v) denotes the j-th component of the state vector at vertex v, zh (N Tq ) = 1 if h = q or zero otherwise (i.e., we are using a one-hot encoding of symbols), chk [v] denotes the k-th child of v, and ωjh , wjk` are adjustable weights. It is important to remark that weights in the transition network are independent of the node v at which the transition function is applied. This can be seen as a form of weight sharing. B. Connectionist Preference Learner The learning task consists in computing the preference map K as defined in II-D over a forest of alternative incremental trees xi = Fi for each word of position i in sentence I dont understand this s. Each tree has a distributed vector representation computed by the recursive network (the state of the root node of each tree). K(xi ) generates a tuple y i that has to approximate the target ti =< 0, 0, . . . , 1, 0, . . . , 0 >, where only the correct element is associated with 1 and all the others with 0. We express K as a sequence of Kj each one modeled through a feed-forward neural network. The resulting network, called output network is a modular network with mi sets of shared weights that has input xi and desired output ti . We chose k in the exponential family for reasons that will be more clear in the next paragraph. For each module with index j ∈ 1, . . . , mi we have: aij = w0 +
n X
w` xij`
(5)
`=1
exp (aij ) j 0 exp (aij 0 ) K(xi ) = < yi1 , . . . , yimi > yij = P
(6) (7)
C. Parameters Optimization The estimation of the parameters in a connectionist model is done by adjusting the weights of the model by means of iteratively diminishing an error computed between the output of the system and the target output. The architecture introduced so far is the result of the coupling of two sub-systems: the output network and the unfolded transition network. In the remainder of this paragraph we will show the computation of the error contribution that is back-propagated from the output network to the transition one. We define the error function as the negative log-likelihood of the training set {xi , ti } so that minimizing the error is equivalent to determine the maximum-likelihood estimate of the parameters given the training set. More formally, we write the likelihood for the set of training data as : L =
Y i
P (xi , ti ) =
Y i
P (ti |xi )P (xi )
(8)
10
We define the error function: E = − log L = −
X
log P (ti |xi ) −
i
X
P (xi )
(9)
i
Since the second term in 9 does not depend on the model parameters it just represents an additive constant that can be dropped. We have E=−
X
log P (ti |xi )
(10)
i
The value of the conditional distribution for this input can be written as: P (ti |xi ) =
mi Y
(yij )tij
(11)
tij log yij
(12)
j=1
substituting in 10 we obtain an error function of the form: E=−
mi XX i j=1
The absolute minimum of this error function occurs when yij = tij for all values of j and i. Evaluating the derivatives of the error function we obtain: ∂Ei X ∂Ei ∂yj 0 = ∂aj ∂yj 0 ∂aj j0
(13)
∂yj 0 = yj 0 ∆(jj 0 ) − yj 0 yj ∂aj
(14)
from 6 we have
where ∆(jj 0 ) is the Dirac Delta function that is 1 if j = j 0 and 0 otherwise. From 12 we have ∂Ei tj 0 =− ∂yj 0 yj 0
(15)
substituting 14 and 15 in 13 we have δij = −
∂Ei = tj − yj ∂aj
(16)
The choice of the exponential normalization allows us to compute in a very simple way the error contribution for each alternative in the forest Fi . The optimization of the state transition network is then solved by gradient descent. In this case, gradients are computed by a special form of back-propagation on the feed-forward network obtained by unrolling the state transition network according to the topology of the input tree as showed in section III-A. The algorithm was first proposed in [25] and is referred to as back-propagation through structure (BPTS). Backward propagation proceeds from the root to the leaves. Note that gradient contributions must be summed over all the replicas of the transition network to correctly implement weight sharing. In Fig. 3 we depict the coupling and unfolding of the transition network and the output network on a forest of two incremental trees.
11
(a) S
S VP
VP
NP
NP NP
DT
N
V
N
(c)
IN 0
DT
N
DT
V
N
PP IN
1
(b)
S
S NP
NP
PP
NP
VP DT
N V
NP
PP
VP N
NP V PP
N
IN
N IN
Fig. 3. Network unfolding on a forest of two elements. (a) Syntactic tree. (b) Unfolded recursive net. (c) Unfolded output network
IV. L INGUISTIC DATA AND N ETWORK T RAINING A. Linguistic Data For the current investigation we used the Wall Street Journal Section of the Penn-Treebank Corpus [26]. For our experiments we have adopted the standard setting widely accepted in literature (see [27]): specifically, sections 2-21 have been used to form the training set ( 39,832 sentences, 950,026 words), section 23 has been used for the test set (2416 sentences, 56,683 words) and section 24 for the the validation set (3,677 sentences, 85,335 words). The entire dataset used for our experiments includes therefore 45,925 sentences for a total of 1,092,044 words. The average sentence length is 24 in a range of 1-141 (1-67 in the test set). The labels (tags) on the nodes of the parse trees can be divided into Part-Of-Speech (POS or pre-terminal) tags, and non-terminal tags: POS tags dominate a single lexical item and indicate the syntactic category of the item (ex. a noun or a verb) while non terminal nodes dominate sequences called phrases that can be made of pre-terminal and/or non-terminal tags. In the Penn Treebank the POS tags are 45, and the non-terminal tags are 26. Although the syntactic annotation schema provides a wide range of semantic and coindexing information, we have used only syntactic information about the dominance relation2 . 2
This limitation can be a very compelling one, since most parsing models achieve valuable results by including lexical statistics on word
occurrence and functional dependencies ([28],[23]).
12
B. Network Training For each word wi in a sentence, a forest of alternatives is generated by extracting the incremental tree Ti−1 spanning w1 , . . . , wi−1 , and joining it with all the legal connection paths. Each sentence has an average length of 24 words and each forest contains on average 120 trees. Considering that the average number of nodes of an incremental tree is 27, we have that the entire dataset has 1 · 106 forests, 117 · 106 trees, and 3 · 109 nodes. Learning proceeds in an online fashion, updating weights after the presentation of each forest. A separate set of 1,000 sentences from section 24 of the treebank is used as a validation set to control overfitting by early stopping. Given the considerable amount of training data, accuracy on the validation set is monitored after the presentation of each group of 100 sentences. Optimization is stopped if the validation error has reached a minimum and does not decrease for the subsequent 1,000 sentences. In this setting, three epochs were enough to reach the optimum. The recursive network is a single layer feed-forward network with an input vector of 447 units partitioned in 1 unit for the threshold, 71 units for the one-hot encoding of the non-terminal and POS tag symbols. To represent the state vector of each child node we use 25 units. The maximum number of children has been fixed to 15. The output layer of the recursive network represents the state of the node being processed and is consequently made of 25 units for a total of 11K free parameters. The non-linearity is the standard hyperbolic tangent function. The output network is a feed-forward network with 25+1 units in input and 1 unit in output and it is replicated as many times as the number of alternatives in the current forest (120 on average). The non linear function for the output network is the softmax function. Once the output network and the recursive network have been unrolled, the forward and backward phase proceed following the standard back-propagation algorithm with a fixed λ learning rate and momentum m. Good values for the parameters λ and m have been experimentally determined on a working set to be λ = 10−3 and m = 0.1. Training the system on the whole dataset takes less than 3 days of CPU per epoch on a 1GHz Pentium III Intel processor. We evaluate the learning curve of the system. The training set has been partitioned in sub-sets of 100, 400, 1,000, 4,000, 10,000 and 40,000 sentences. A validation set of 1,000 sentences is used to avoid overtraining. We report in figure 4 the percentage of times that the correct element has been ranked by the system in the first position on a test set of 2,416 sentences. The results indicate that the difference between training with 10,000 or 40,000 sentences yields a 3% error reduction. C. Connection Path Analysis One of our working hypothesis is the “coverage assumption”, which states that we can extract, from a large enough corpus, the complete set of connection paths with which to form all possible incremental trees. This is likely to be only approximately true, and we perform the following experiment to get a quantitative result on the validity of this assumption. We build sub-sets with an increasing number of sentences: from
13
86
84
82
80
78
76
74
72 100 1000 10000 100000
Fig. 4. Performance with dataset variation
100 to the full 40K sentences in steps of 100 sentences. The list of connection paths with their frequencies and the number y of distinct connection paths is then extracted from each sub-set using the incremental simulator developed by [16]. Fitting the results (see Fig. 5 (a)) with a polynomial model we have y = axα where α = 0.434. This is in remarkable accordance with Heaps law [29], an empirical rule which describes the vocabulary growth as a function of the text size, which establishes that a text of n words has a vocabulary of size O(nβ ) with β ∈ [0, 1] where for English in first approximation β = 12 . If we now consider how often certain connection paths are used in the making of the syntactic trees we observe that the frequency counts distribute accordingly to another famous linguistic law that bears the name of Zipf’s law. Zipf’s law expresses a relationship between the frequency of words occurring in a large corpus and a quantity called rank. Given a corpus under examination, the rank of a word is defined as the position of that word in the list of all the words in the corpus, ordered by frequency of occurrence. Zipf’s law says that f ∝
1 r
or in
other words that there is a constant k such that f · r = k. What the law states is that there are few very common words and many low frequency words. Considering connection paths instead of words we find that their frequency is closely described by the Zipf’s law (Fig. 5(b)). Moreover one can find that the same distribution holds if we keep distinct the connection paths whose foot node belongs to a specific category (such as nouns, verbs, articles, ...). Also, if we consider the new connection paths, that are extracted after
14
1e+07
1e+06
10000
100000
100
1000
1
10
0
0.1
6000
5000
4000
3000
2000
1000
0
1
50 10
100 150
100
200 250 350
Connection Paths f(x)=6 10^6 x^(-1.89)
1000
New Rules Growth 400*x**0.434
300
10000
400
(a)
(b)
Fig. 5. (a) Number of different connection path in respect of dataset size (units in 100 sentences). (b) Connection paths Zipfian distribution.
having processed 10.000 sentences, we see that those are rarely used (less than 1% of times), and we can therefore say that the coverage approximation holds. V. S TRUCTURAL ANALYSIS In this section we will investigate the characteristics of the preference mapping learned by the connectionist recursive architecture. We will start by analyzing the correlation between the net’s performance and the structural properties of the incremental trees. Then we will study the influence of the frequency of the connection paths on the choices of the system. Finally, we will compare the preference learned by the network with some heuristics studied in psycholinguistics. A. Correlation between structural features and prediction error Basing our intuitions on the domain knowledge we hypothesize a certain number of structural characteristics of the incremental trees that are likely to have an influence over the generalization performance. The features that we investigate in the correct incremental trees are reported in table I. We are interested in investigating characteristics such as: •
number of nodes (rows: tree num nodes, cpath num nodes, tree max node outdegree, anchor outdegree, root outdegree): a higher value implies a more difficult error assignment task for the network when propagating the error. Moreover the tree automaton that we approximate with the connectionist system is more complex as the number of possible configuration increases;
•
height (rows: anchor depth, tree height, cpath height): a greater height implies more steps in the propagation of the information and as a consequence a weakening of the gradient information;
•
frequency of configurations (row: freq of cpath): since after all the net is a statistical estimator, this count is always a valuable baseline for comparing the net performance.
15
In addition, we study the number of alternative incremental trees (row: forest size), as we expect a negative correlation between the number of alternatives and the prediction performance, and finally the word absolute position (row: word index), since words that occur later in the sentence condition the choice of the correct attachment on a bigger context. For each feature we have collected statistical information (max value, mean, standard deviation, skew, kurtosis) and tested for normality so to be able to use the appropriate statistical test later on. In order Description tree max node outdegree tree height tree num nodes cpath height cpath num nodes
max
mean
sd
sk
ku
rho
18
4.3
1.9
0.4
4.8
0.18
28
6.6
3.9
0.7
3.8
0.19
122
27.3
19.2
0.8
3.4
0.20
5
1.5
0.7
0.4
3.4
0.34
11
2.7
1.0
1.6
7.5
0.33
anchor outdegree
18
2.6
1.6
0.8
7.2
0.21
anchor depth
28
4.6
3.9
1.0
4.4
0.17 0.02
root outdegree forest size word index freq of cpath
18
2.9
1.6
1.2
6.5
1940
126.3
145.1
2.4
11.9
0.31
66
14.7
10.4
0.9
3.6
0.19
102291
177.9
2077.6
30.1
1226.1
-0.39
TABLE I F EATURES STATISTICS : MAXIMUM VALUE , MEAN VALUE , STANDARD DEVIATION , SKEW, KURTOSIS , CORRELATION COEFFICIENT.
to measure the correlation between these features and the net’s error on the correct element we have run an analysis employing the Spearman Rank Correlation test over a randomly sampled sub-set of 200 pairs (error, feature). We report the correlation results (column: rho) in table I. The test indicates a significant if small positive correlation between each feature and the network’s error, except the root outdegree, and a relevant negative correlation between the frequency of the connection path and the error. The most significant positive correlations are with the size of the connection path and the forest size. B. Characterizing true and false positives We now investigate the features of the incremental trees that are correctly classified against those trees that are mistakenly preferred by the network. We distinguish true positive elements and false positive elements: true positive are the correct incremental trees that are preferred by the network; false positives are the trees preferred by the network but that are not the correct ones. At first we will identify statistically significant differences in the average values of some features. Then, analyzing the distinctive features, we will identify some characterizing properties of the set of true positive elements against the second preferred element, and we will do the same with the false positive against the correct elements. For the features that do not exhibit a normal distribution we use the Wilcoxon MatchedPairs Signed-Ranks Test on a random sample of 200 pairs from the dataset for each feature. For all the other features the anova (ANalysis Of VAriance between groups) test is used, randomly sampling 100 pairs from the dataset for each feature and treating the relevant feature as a repeated measure factor.
16
In the first experiment the tests are used to determine whether there are meaningful differences in some feature of the trees when comparing the net’s false positive with the true elements. The results are reported in Table II. In column “Sig” we mark with an asterisk the results that are significant (p < .05) under the respective statistical tests, in column “true elem” we report the average values for the correct element that the net has not been able to predict and in “false pos” column we report the average values for the wrong element picked by the net. In the last column, we report the size of the difference with respect to the standard deviation of the corresponding distribution. Description
Sig.
true elem
false pos
∆ / sd
tree max outdegree tree height
*
7.38
7.20
0.05
tree num nodes
*
30.91
30.55
0.02
cpath height
*
2.12
1.67
0.64
cpath num nodes
*
3.35
2.77
0.58
anchor outdegree anchor depth root outdegree TABLE II FALSE POSITIVE
The average value of each statistically significant feature is greater for the true elements than for the elements chosen by the network. We can synthesize this result by saying that trees which are “simpler” in various senses are preferred by the network. Simplicity can be expressed in terms of a shorter connection path and incremental trees, or connection paths and trees with fewer nodes. There seems to be no meaningful effect of the outdegree on the net false positive error. We note how the differences within the whole incremental tree are much smaller than those between the connection paths. This indicates that connection paths are the key element responsible for the discrimination between the correct element and the incorrect chosen element. We will be using this finding for enhancing the performance of the system. In a second experiment we test differences between true positives and the element ranked second by the net. We report the results in Table III (to be read as the previous one). The same trend is kept for the true positives: the net has preferred the correct incremental trees because of their “simplicity” in comparison with the second ranked alternative that turns out to be more complex. Note that now the root and the anchor outdegree have become a meaningful feature in a way that is still consistent with the hypothesis that “simpler” trees are preferred, i.e. the correct incremental trees have root and anchor nodes with a smaller outdegree. This latter fact can be represented by a heuristic that disprefers joining those connection paths that increase the number of children of the root or of the anchor node since this leads to wrong incremental trees.
17
Description
Sig.
true pos
second elem
∆ / sd
tree max outdegree tree height
*
6.02
6.56
0.14
tree num nodes
*
24.08
25.32
0.06
cpath height
*
1.47
1.81
0.49
cpath num nodes
*
2.56
3.08
0.52
anchor outdegree
*
2.48
2.72
0.15
*
2.8
2.98
0.11
anchor depth root outdegree
TABLE III T RUE POSITIVES
We suspect that the simplicity preference of the network is mainly due to the combinatorial nature of the elements of this domain, since all the features are strongly correlated, and there could be an underlying factor that is the direct cause of the preference. Analyzing the zipfian distribution of connection paths we find that shorter connection paths are more frequent. As a direct consequence most correct incremental trees are themselves simpler because more frequently derived by joining simpler elements. In order to understand the magnitude of this effect we have run a Pearson Correlation test on a sample of 10000 pairs of connection paths number of nodes vs. log(freq). We obtain a correlation of rho=-0.33 (statistical significance p < 0.001) indicating that simpler connection paths are reliably more frequent. So, we investigate the influence of connection paths’ frequency on false positives and true positives, respectively. C. Influence of connection paths’ frequency We compare the net prediction with a simple frequency heuristic, i.e. we rank each alternative according to the frequency of the last connection path used in its derivation (Fig. 6). The test is done on the standard test set of 2416 sentences. Each point (x,y) in the diagram of fig. 6 is to be interpreted as: y is the proportion of times that the correct element has been ranked in the position x or less. I.e. for x = 1 we consider valid only the cases when the system has ranked the positive element in the first position, for x = i we consider valid only the cases when the system has ranked the positive element within the first i positions. From Fig. 6 we can deduce that the net bases its decisions on something more than the pure frequency. The anova test was used to determine the influence of the log-transformed frequency3 on the network’s accuracy. For the true positive the mean log-frequency of the connection path was 9.2 against a mean of 5.2 for the second best ranked alternative. The difference being highly significant on the random sample of 100 pairs. For the false positive dataset there was no significant difference in the mean of the log frequency 3
This is because the connection path frequency follows a Zipfian (log log) distribution.
18
(7.4 for the correct element vs. 7.2 of the predicted element, F < 1). Notice also that the overall mean is much higher for the true positives than the false positives. This result can be explained observing that in the case of the true positives the frequency distribution of the connection paths is more skewed, with the correct alternative having a much higher frequency than the other alternatives. This seems to indicate that it is more difficult for the net to express a preference when it cannot draw information directly from the frequency distribution of the alternative connection paths. Since the network performance is better than the frequency heuristic we can conclude that while the connection path frequency remains a determinant factor, the decision strategy of the network takes other factors into account. We now want to understand what information does the network resorts to when it is not relying on the connection paths frequency. D. Filtering out the connection paths’ frequency effect We investigate the cases where the network correctly predicts trees whose connection path is not the most frequent one. We isolate these cases (which represent the 10% of the correct prediction cases) and we analyze their characteristics as we have previously done. In Table IV we report the average value of the significant features that discriminate between the correctly predicted element and the most frequent element and the relative difference of the values. We observe how the net has preferred slightly more Description
Sig.
correct
most freq.
∆ / sd
tree max outdegree
*
4,38
4,69
0.16
tree height
*
7,74
7,45
0.07
tree num nodes
*
31,01
30,80
0.01
cpath height
*
1,60
1,43
0.24
cpath num nodes
*
2,87
2,62
0.25
anchor outdegree
*
2,49
4,77
1.43
anchor depth
*
6,58
3,13
0.88
TABLE IV N ET PREDICTION VS . FREQ .
complex alternatives in terms of heights or number of nodes but has preferred cases characterized by anchors with a smaller outdegree and at a higher distance from the root. This confirms the importance played by frequency and simplicity on the connection path but indicates a preference for deeper anchors. We therefore try to decompose the preference task into two sub-tasks (as reported in section II-B and Fig. 2): the first one consists in finding the correct attachment point of the connection path and the second one consists in choosing the correct connection path itself. Given the previous findings we hypothesize that the network has a somewhat more complex decision strategy to disambiguate the attachment point of the anchor but then exploits only the connection paths frequency to choose the path.
19
E. Analysis of anchor attachment preference We investigate the performance of the network in determining the correct anchor on the right edge. To do so we group together all the incremental trees that derive from the attachment of a connection path at the same anchor node. We rank the groups according to the highest score given by the net to any member of the group. We consider a group correctly predicted iff the correct connection path belongs to the group. We compute the percentage of times a group was correctly predicted as the first choice of the net or within the first best two or three choices, obtaining: 1st choice: 91.5%, within first 2 choices: 97.1%, within first 3 choices: 98.4%. F. Analysis of connection path preference We want to test the hypothesis that the network, after having chosen the anchor, relies only to the frequency information of the connection path to choose among alternative paths. To do so we prepare a test set where the correct anchor is given and we collect the network’s preferences on the reduced forest of alternatives. In this setting, the network reaches a 89.5% accuracy in first position. The number of predictions where a connection path with lower frequency is correctly preferred is only 3.8%. We then evaluate how many times the network makes a mistake because it prefers a more frequent connection path instead of the correct less frequent one. This happens 66.3% of the times. Therefore, the network mostly exploits frequency information once the correct anchor is given. This is not necessarily a negative finding and high error rate can also be expected for human first-pass disambiguation decisions, that are also biased by frequencies []. Citation needed We remark that there are no experimental results concerning human performance on first-pass disambiguation carried out on large corpora. In the case of the human parsers Wrong first-pass decisions do not have a dramatic impact on the performance of the human parser thanks to the existence of a robust mechanism of recovery that backtracks to the source of error and corrects it []. citation Coupling the two preference decisions (about anchor and about connection path, given the anchor) is the main strength of the proposed architecture. We now compare the performance of the RNN to a hybrid approach where the anchor is predicted by the RNN and the connection path is predicted based on corpus frequency. More precisely, we order alternative connection paths using the RNN and we extract a corresponding ordered list of anchors, removing duplicates after the first occurrence. We then form a set of alternatives by selecting all the connection paths that match one of the first N anchors in the ordered list. Members of this set are finally ranked by their frequency in the corpus or by the RNN. In Table V we report the results obtained for N = 1, 2, and 3. All the previous experiments seem to validate the initial hypothesis: the RNN is very successful in predicting the anchor and relies mainly on frequency information to predict connection paths. However, the last experiment shows that the network is using something more than pure corpus frequency; it is conditioning the statistics collected on the context offered by the incremental tree.
20
TABLE V
N
Frequency
1
88.3
89.5
10.0 %
2
77.35
84.4
31.1 %
3
75.9
83.3
30.7 %
RNN Relative error reduction
In order to gain a better insight on the kind of statistics that the network is really employing, we assume the working hypothesis that the human parser shares with the network some common mechanism for ambiguity resolution. We then simulate some known heuristics that have been found in psycholinguistic research, and investigate to what extent they are matched by the network. G. Comparison with psycholinguistic heuristics Among the structural preferences expressed by the syntactic module of the human parser, psycholinguistic studies identify the minimal attachment (MA) preference and the late closure (LC) preference ([30]). The minimal attachment preference, suggests that humans tend to prefer simpler and shorter analysis. In our fraramework this translates to preferring connection paths with fewer nodes, which generally implies shorter connection path trees (see the rank given to the different choices in Fig. 2 (b)). The late closure preference, suggests that while processing the sentence in an incremental way, a preference is expressed for recency, i.e. humans prefer to connect the current analysis with recently processed material. In our framework this is equivalent to prefer low attachment anchor points, i.e. attaching the connection paths nearer to the leaves of the frontier (see Fig. 2 (a)). Since a single preference scheme would lead to a relevant number of ties, we employ at first one scheme and then break the ties with the other one. Should ties be still present we resort to the frequency of connection paths. We have two possible combinations: LC-MA and MA-LC (see Fig. 6). The results of the simulation are as follows: within the first two alternatives the LC-MA heuristic finds the correct element 70% of time, while the MA-LC heuristic achieves a precision of 50%. In comparison the network guesses correctly more than 82% of time with the first proposed element and more than 90% within the first two proposed elements. In order to test whether the network has learned to express a preference that mimics that of the heuristics or rather has found an orthogonal set of features to look at, we analyze the predictions overlapping (see tab.VI). We consider how many times the network’s first choice corresponds exactly to the first element ranked by each heuristic combination. The results indicate that the network choices coincide with those of the heuristics only roughly half of the time. If we allow the first or second choice of the network to match either the first or second choice of the heuristics combination we find that between the net’s preference and the LC-MA heuristic we have an agreement more than 78% of the times. We can conclude that the network has consistently learned to use the latter heuristic. On the other hand we have also to conclude that the 66% error reduction within the best two choices is a clear sign that the network has found some more complex strategy to express its preferences.
21
1
0.9
0.8
0.7
0.6
0.5
0.4 1 2 3 4 5 6 7 8
Late Closure over Minimal Attachment Minimal Attachment over Late Closure Frequency Heuristic Recursive Network Recursive Network on Reduced dataset
9
Fig. 6. Comparison with psycholinguistic heuristics
Pos
net LC-MA
net MA-LC
LC-MA MA-LC
1
43.5%
44.5%
91%
1 or 2
78.3%
61.5%
94.4%
TABLE VI OVERLAPPING PREFERENCES : NET VS . HEURISTIC AND HEURISTIC VS . HEURISTIC
VI. E NHANCEMENTS A. Tree simplification The experimental results reported in Section V have shown how the complexity of the incremental trees negatively affects the prediction performance. We would like to decrease this complexity (i.e. the number of nodes) without risking to disregard useful features. Intuitively not all the information of the incremental tree is significant for the disambiguation task. Specifically it can be argued that the knowledge of the internal composition of “closed” constituents, i.e. constituents that have been fully parsed, can be summarized by the non-terminal tag that immediately dominates the constituent. In other words, the knowledge that a deeply nested NP is made of a sequence of (DT NN) or rather a more complex (DT JJ NN NN) is not much more
22
S
RIGHT FRONTEER
VP NP
NP
PP NP
NP
NP
PRP
VBZ
DT
NN
IN
PRP$
It
has
no
bearing
on
our
NP
NN
NN
work
force
NN
today
S
NP
VP NP VBZ NP
PP NP NP
IN NP
NN
It
has
no bearing
on
our work force today
Fig. 7. Tree simplification
informative when deciding how to attach a connection path. If this hypothesis is true it should be possible to eliminate a significant part of the nodes of the incremental tree without decreasing the discriminating power of the information that is left in the remaining nodes. We propose a reducing scheme where we keep all the nodes that dominate incomplete components plus all their children. Because of the incremental nature of the algorithm, it turns out that these nodes belong to the right frontier of the incremental tree, or to the children of such nodes. The procedure we are adopting turns out to be consistent with the notion of c-command4 in theoretical linguistics. When we create Ti , we keep only those nodes that c-command the right frontier of Ti−1 , plus the right frontier of Ti−1 itself. Preserving the nodes that c-command the nodes that are active (those that are potential anchors) is linguistically motivated in that it keeps the nodes that can exhibit a “linguistic influence” on each other. In Fig. 7 we show the subset of retained nodes. In order to test the equivalence hypothesis we have run an experiment with the following setting. The datasets are the standard training, validation and test sets where we have applied the simplification procedure. We employ 20 unit recursive and output network within the standard validation and test scheme. We report in Fig. 6 the comparison between the performance on the reduced dataset and the normal dataset. We observe an increase of performances from 81.7% to 84.82% with an error reduction of 17%. The results indicate that not only 4
A node A c-commands a node B if B is a descendent of a sister of A[31].
23
have we not eliminated relevant information, but we have helped the system by eliminating potential sources of noise making the task somewhat simpler and allowing for a better generalization. To explain this behavior we can hypothesize the fact that the states that encode the information relative to what lays in deep (i.e. more distant from the right frontier) nodes are “noisy” and confound higher (i.e. closer to the frontier) states. B. Modular networks When the learning domain can naturally be decomposed into a set of disjoint sub-domains, it is possible to specialize several learners on each sub-domain. A special case for these specialized learners is when we have informationally encapsulated “modules” [32], that is, predictors whose internal computation is unaffected by the other modules. The linguistic data that we are processing present an intuitive decomposition: the knowledge needed to process the attachment of verb-footed connection paths is quite different from the knowledge used to attach article- or punctuation-footed connection paths, etc. I.e. it can be the case that the features that are relevant for discriminating the correct incremental trees are different. If there is no significant information overlap between the different cases, we can partition the dataset and select a smaller sub-set of examples with which to train a predictor, without subtracting relevant examples to the training itself. The adoption of a modular approach allows moreover a tighter parameter tuning of each module. The knowledge of the domain suggests that certain attachment decisions are harder than others. For example, prepositional phrase attachment is notoriously a hard problem, especially when full-lexical information is not used (which is our case). In order to determine the “hardness” of each sub-task we setup an experiment that has the following setting. We divide the set of POS tags into s = 10 sub-sets where we collate “similar” tags, i.e. tags that have a similar grammatical function5 . A special set contains all those tags that couldn’t be put in any other sub-set6 . We employ a network of 25 units, trained on the training set of 40k sentences and we test the resulting network on the 2k sentences test set. The dataset has been preprocessed with the simplification scheme introduced in the previous section. The prediction results are collected and partitioned in the appropriate sub-sets accordingly to which POS tag was involved in the attachment decision. We report the results in Table VII, where in column Runspec is the best accuracy obtained using the method of Section VI-A while in column Size we report the fraction of the total dataset represented by each sub-set. The results indicate that the problem is harder in the case of adverbs and prepositions and easier for nouns, verbs and articles. We propose to enhance the overall performance by letting single nets to concentrate on specific ambiguities, i.e. having a net being exposed only to attachment decisions involving, for example, adverbs or prepositions. The network specialization can be done in an online or batch fashion. The online scheme is realized using a single network with s different “switching” weight sets for the recursive and output networks. It is the POS tag of the current word that selects the appropriate weight set. The batch scheme is 5
For example all the tags MD VB VBD VBG VBN VBP VBZ , which correspond to modal verbs and verbs with various different tense and
agreement information, are grouped together under the category VERB. 6 It includes POS tags that denote foreign words, exclamations, symbols, etc
24
Category
Size %
R500 %
Ru %
Rs %
Red %
Adjective
7.48
85.24
87.00
89.46
18.92
Adverb
4.26
45.05
53.46
59.44
12.85
Article
12.45
83.58
89.09
90.99
17.42
2.31
59.55
70.41
78.69
27.98
Noun
32.97
91.84
94.52
95.74
22.26
Other
0.69
51.69
68.64
72.88
13.52
Possessive
2.03
89.75
97.99
97.12
-43.28
Preposition
12.63
61.78
64.26
68.19
11.0
Punctuation
11.72
68.21
75.29
80.84
22.46
Verb
13.46
90.87
94.72
96.77
38.83
100
80.56
84.82
87.52
17.79
Conjunction
Weighted tot
TABLE VII S PECIALIZATION IMPROVEMENT.P RECISION RESULTS AFTER TRAINING ON THE ORIGINAL 500- SENTENCE TRAINING SET WITH SPECIALIZED NETWORKS
(R500 ), AND AFTER TRAINING ON THE 40 K TRAINING SET WITH AN UNSPECIALIZED NETWORK (Ru ) AND
SPECIALIZED NETWORKS
(Rs ). R EDUCTION OF ERROR IS SHOWN IN THE FINAL COLUMN .
realized by pre-processing the training and test sets obtaining s different sub-sets according once again to the POS tag, and then employing s different networks each one exposed only to uniform attachment decisions. Since the latter solution allows an easier parallelization of the training and testing phase, we resorted to the batch approach. We have run two experiments. In the first one we replicate the training set of [10], apply the reduction pre-processing, train the modular network and test the performance of the new architecture. We report in column R500 of table VII the results. We obtain a total precision in first position of 80.57% against a previous 74.0% yielding a 25% error reduction. On a second experiment we train a network of 25 units on the standard training set of 40k sentences and we test the resulting network on the standard 2k sentences test set. We report the comparison between the performance of the specialized networks (column Rs ) and the unspecialized network (column Ru ) on the same dataset and the relative error reduction (column Red). The results indicate that we have an overall enhancement of the performance (17.79% error reduction) and that some categories greatly benefit from this approach. We believe that the reason is that the resources (i.e. areas in the state space) allocated for discriminating the less frequent classes (conjunctions, punctuation, adverbs) do not have to compete against the ones allocated for the most frequent cases (nouns, verbs). VII. C ONCLUSIONS We have shown how the analysis of the preferences expressed by the recursive neural network allows to have a useful insight on the nature of the statistics information used by the system. We have found that the net bases its preferences to disambiguate the anchor attachment point on some complex information and mainly resorts to frequency to choose the correct connection path. We have moreover shown how the system prefers to attach simple structures to recently processed material, modeling human heuristics, but
25
that the incremental tree offers a richer context on which to condition the preferences. Taking advantage of the highly structural nature of the domain we have been able to propose a simplification scheme and a specialized architecture that have enhanced the overall prediction accuracy of the network. We believe that further improvements are achievable introducing more information by lexicalizing the underlying grammar. Future work will focus on the use of the recursive neural network as an informant to guide an incremental parser. R EFERENCES [1] G. Altmann and M. Steedman, “Interaction with context during human sentence processing,” Cognition, vol. 30, pp. 191–238, 1988. [2] W. Marslen-Wilson, “Linguistic structure and speech shadowing at very short latencies,” Nature, vol. 244, pp. 522–533, 1973. [3] M. J. Pickering and M. J. Traxler, “Plausibility and recovery from garden paths: An eye-tracking study,” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 24, no. 4, pp. 940–961, 1998. [4] M. Bader and I. Lasser, “German verb-final clauses and sentence processing,” in Perspectives on Sentence Processing, C. Clifton, L. Frazier, and K. Rayner, Eds. Lawrence Erlbaum Associates, New Jersey, 1994. [5] Y. Kamide and D. C. Mitchell, “Incremental pre-head attachment in Japanese parsing,” Language and Cognitive Processes, vol. 14, pp. 631–632, 1999. [6] E. P. Stabler, “Parsing for incremental interpretation,” Manuscript, University of California at Los Angeles, 1994. [7] B. Roark and M. Johnson, “Efficient probabilistic top-down and left-corner parsing,” in Proceedings of the 37th annual meeting of the Association for Computational Linguistics, 1999, pp. 421–428. [8] P. C. R. Lane and J. B. Henderson, “Incremental syntactic parsing of natural language corpora with simple syncrony networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 2, 2001. [9] P. Sturt, M. J. Pickering, C. Scheepers, and M. W. Crocker, “The preservation of structure in language comprehension. is syntactic reanalysis a last resort?,” Journal of Memory and Language. [10] F. Costa, P. Frasconi, V.Lombardo, and G. Soda, “Towards incremental parsing of natural language using recursive neural networks,” Accepted in Applied Artificial Intelligence, 2000. [11] P. Sturt, F. Costa, V. Lombardo, and P. Frasconi, “Learning first-pass structural attachment preferences with dynamic grammars and recursive neural networks,” In press, 2001. [12] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive processing of data structures,” IEEE Transactions on Neural Networks., vol. 9, pp. 768–786, 1998. [13] A. Sperduti and A. Starita, “Supervised neural networks for the classification of structures,” IEEE Transactions on Neural Networks, vol. 8, no. 3, 1997. [14] William W. Cohen, Robert E. Schapire, and Yoram Singer, “Learning to order things,” in Advances in Neural Information Processing Systems, Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, Eds. 1998, vol. 10, The MIT Press. [15] Michael Collins and Nigel Duffy, “Convolution kernels for natural language,” in Proc. of NIPS, 2001. [16] V. Lombardo and P. Sturt, “Incrementality and lexicalism: A treebank study,” in Lexical Representations in Sentence Processing, S. Stevenson and P. Merlo, Eds. John Benjamins, 1999. [17] Eric Brill, “A simple rule-based part-of-speech tagger,” in Proceedings of ANLP-92, 3rd Conference on Applied Natural Language Processing, Trento, IT, 1992, pp. 152–155. [18] V. Lombardo and P. Sturt, “Incremental parsing and infinite local ambiguity,” in Proc. of XIXth Cognitive Science Society, 1997. [19] H. Thompson, M. Dixon, and J. Lamping, “Compose-reduce parsing,” in Proceedings of the 29th Meeting of the Association for Computational Linguistics, Berkley, California, June 1991, pp. 87–97. [20] G. Dietterich, R. H. Lathrop, and T. Lozano-Perez, “Solving the multiple-instance problem with axis-parallel rectangles,” Artificial Intelligence, vol. 89, no. 1–2, pp. 31–71, 1997. [21] Yann Chevaleyre and Jean-Daniel Zucker, “A framework for learning rules from multiple instance data,” in 12th European Conference on Machine Learning. 2001, vol. 2167 of LNCS, pp. 49–60, Springer. [22] Soumya Ray and David Page, “Multiple instance regression,” in Proc. 18th International Conf. on Machine Learning. 2001, pp. 425–432, Morgan Kaufmann, San Francisco, CA.
26
[23] M. Collins, “Three generative, lexicalised models for statistical parsing,” in Proceedings of the 35th annual meeting of the Association for Computational Linguistics, 1997, pp. 16–23. [24] J.W. Thacher, “Tree automata: An informal survey,” in Currents in the Theory of Computing, A.V. Aho, Ed., pp. 143–172. Prentice-Hall Inc., Englewood Cliffs, 1973. [25] C. Goller and A. Kuechler, “Learning task-dependent distributed structure-representations by back-propagation through structure,” in IEEE International Conference on Neural networks, 1996, pp. 347–352. [26] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of English: the Penn Treebank,” Computational Linguistics, vol. 19, pp. 313–330, 1993. [27] M. J. Collins, “A new statistical parser based on bigram lexical dependencies,” in Proceedings of the 34th annual meeting of the Association for Computational Linguistics, 1996. [28] E. Charniak, “Expected-frequency interpolation,” Technical report, CS96-37, Department of Computer Science, Brown University, 1996. [29] J. Heaps, Information Retrieval–Computational and Theoretical Aspects, Academic Press, Inc., New York, NY, 1978. [30] L. Frazier, On comprehending sentences: Syntactic parsing strategies, Ph.D. thesis, University of Connecticut, Storrs, CT, 1978. [31] N. Chomsky, Lectures on Government and Binding, Foris, 1981. [32] A. Sharkey, “On combining artificial neural nets,” 1996.