Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007
Ryan McDonald1
Joakim Nivre2
1 Google
Inc., New York, USA E-mail:
[email protected] 2 Uppsala
University and V¨axj¨o University, Sweden E-mail:
[email protected]
Introduction to Data-Driven Dependency Parsing
1(63)
Introduction
Overview of the Course
I
Dependency parsing (Joakim)
I
Machine learning methods (Ryan)
I
Transition-based models (Joakim)
I
Graph-based models (Ryan) Loose ends (Joakim, Ryan):
I
I I I
Other approaches Empirical results Available software
Introduction to Data-Driven Dependency Parsing
2(63)
Introduction
Notation Reminder
I
Sentence x = w0 , w1 , . . . , wn , with w0 = root
I
L = {l1 , . . . , l|L| } set of permissible arc labels
I
Let G = (V , A) be a dependency graph for sentence x where: I I
I
V = {0, 1, . . . , n} is the vertex set A is the arc set, i.e., (i, j, k) ∈ A represents a dependency from wi to wj with label lk ∈ L
By the usual definition, G is a tree
Introduction to Data-Driven Dependency Parsing
3(63)
Introduction
Data-Driven Parsing I
Goal: Learn a good predictor of dependency graphs
I
Input: x
I
Output: dependency graph/tree G Last lecture:
I
I I I
I
Parameterize parsing by transitions Learn to predict transitions given the input and a history Predict new graphs using deterministic parsing algorithm
This lecture: I I I
Parameterize parsing by dependency arcs Learn to predict entire graphs given the input Predict new graphs using spanning tree algorithms
Introduction to Data-Driven Dependency Parsing
4(63)
Introduction
Lecture 4: Outline
I I
Graph theory refresher Arc-factored models (a.k.a. Edge-factored models) I I I
I
Maximum spanning tree formulation Projective and non-projective inference algorithms Partition function and marginal algorithms – Matrix Tree Theorem
Beyond Arc-factored Models I I
Vertical and horizontal markovization Approximations
Introduction to Data-Driven Dependency Parsing
5(63)
Graph Theory Refresher
Some Graph Theory Reminders I I I
A graph G = (V , A) is a set of verteces V and arcs (i, j) ∈ A, where i, j ∈ V Undirected graphs: (i, j) ∈ A ⇔ (j, i) ∈ A Directed graphs (digraphs): (i, j) ∈ A ; (j, i) ∈ A
Introduction to Data-Driven Dependency Parsing
6(63)
Graph Theory Refresher
Multi-Digraphs I I I
A multi-digraph is a digraph where there can be multiple arcs between verteces G = (V , A) (i, j, k) ∈ A represents the k th arc from vertex i to vertex j
Introduction to Data-Driven Dependency Parsing
7(63)
Graph Theory Refresher
Directed Spanning Trees (a.k.a. Arborescence) I
A directed spanning tree of a (multi-)digraph G = (V , A), is a subgraph G 0 = (V 0 , A0 ) such that: I I I
I
V0 = V A0 ⊆ A, and |A0 | = |V 0 | − 1 G 0 is a tree (acyclic)
A spanning tree of the following (multi-)digraphs
Introduction to Data-Driven Dependency Parsing
8(63)
Graph Theory Refresher
Weighted Directed Spanning Trees
I
Assume we have a weight function for each arc in a multi-digraph G = (V , A)
I
Define wijk ≥ 0 to be the weight of (i, j, k) ∈ A for a multi-digraph
I
Define the weight of directed spanning tree G 0 of graph G as Y w (G 0 ) = wijk (i,j,k)∈G 0
I
Notation: (i, j, k) ∈ G = (V , A) ⇔ the arc (i, j, k) ∈ A
Introduction to Data-Driven Dependency Parsing
9(63)
Graph Theory Refresher
Maximum Spanning Trees (MST) of (Multi-)Digraphs I
Let T (G ) be the set of all spanning trees for graph G
I
The MST Problem: Find the spanning tree G 0 of the graph G that has highest weight Y G 0 = arg max w (G 0 ) = arg max wijk G 0 ∈T (G )
I
G 0 ∈T (G ) (i,j,k)∈G 0
Solutions ... to come.
Introduction to Data-Driven Dependency Parsing
10(63)
Arc-factored Models
Arc-Factored Dependency Models I
I
Remember: Data-driven parsing parameterizes model and then learns parameters from data Arc-factored model I
Assumes that the score / probability / weight of a dependency graph factors by its arcs Y look familiar? w (G ) = wijk (i,j,k)∈G
I
wijk is the weight of creating a dependency from word wi to wj with label lk
I
Thus there is an assumption that each dependency decision is independent I
Strong assumption! Will address this later.
Introduction to Data-Driven Dependency Parsing
11(63)
Arc-factored Models
Arc-Factored Dependency Models Example
I
Weight of dependency graph is 10 × 30 × 30 = 9000
I
In practice arc weights are much smaller
Introduction to Data-Driven Dependency Parsing
12(63)
Arc-factored Models
Important Concept Gx I
For input sentence x = w0 , . . . , wn , define Gx = (Vx , Ax ) as: I I
I
Vx = {0, 1, . . . , n} Ax = {(i, j, k) | ∀ i, j ∈ Vx and lk ∈ L}
Thus, Gx is complete multi-digraph over vertex set representing words
Theorem Every valid dependency graph for sentence x is equivalent to a directed spanning tree for Gx that originates out of vertex 0 I
Falls out of definitions of tree constrained dependency graphs and spanning trees I I
Both are spanning/connected (contain all words) Both are trees
Introduction to Data-Driven Dependency Parsing
13(63)
Arc-factored Models
Three Important Problems Theorem Every valid dependency graph for sentence x is equivalent to a directed spanning tree for Gx that originates out of vertex 0 1. Inference ≡ finding the MST of Gx G = arg max w (G ) = arg max G ∈T (Gx )
G ∈T (Gx )
Y
wijk
(i,j,k)∈G
2. Defining wijk and its feature space 3. Learning wijk I
Can use perceptron-based learning if we solve (1)
Introduction to Data-Driven Dependency Parsing
14(63)
Arc-factored Models
Inference - Getting Rid of Arc Labels Y
G = arg max w (G ) = arg max G ∈T (Gx )
G ∈T (Gx )
wijk
(i,j,k)∈G
I
Consider all the arcs between vertexes i and j
I
Now, consider the arc (i, j, k) such that, (i, j, k) = arg max wijk k
Theorem The highest weighted dependency tree for sentence x must contain the arc (i, j, k) – (assuming no ties) I
Easy proof: if not, sub in (i, j, k) and get higher weighted tree
Introduction to Data-Driven Dependency Parsing
15(63)
Arc-factored Models
Inference - Getting Rid of Arc Labels Y
G = arg max w (G ) = arg max G ∈T (Gx )
G ∈T (Gx )
wijk
(i,j,k)∈G
I
Thus, we can reduce Gx from a multi-digraph to a simple digraph
I
Just remove all arcs that do not satisfy (i, j, k) = arg max wijk k
I
Problem is now equal to the MST problem for digraphs We will use the Chu-Liu-Edmonds Algorithm [Chu and Liu 1965, Edmonds 1967]
Introduction to Data-Driven Dependency Parsing
16(63)
Arc-factored Models
Chu-Liu-Edmonds Algorithm I I I
Finds the MST originating out of a vertex of choice Assumes weight of tree is sum of arc weights No problem, we can use logarithms Y G = arg max wijk G ∈T (Gx )
(i,j,k)∈G
Y
= arg max log G ∈T (Gx )
(i,j,k)∈G
X
= arg max G ∈T (Gx )
So if we let
wijk
=
log wijk , G
log wijk
(i,j,k)∈G
then we get
= arg max G ∈T (Gx )
Introduction to Data-Driven Dependency Parsing
wijk
X
wijk
(i,j,k)∈G
17(63)
Arc-factored Models
Chu-Liu-Edmonds I
x = root John saw Mary
Introduction to Data-Driven Dependency Parsing
18(63)
Arc-factored Models
Chu-Liu-Edmonds I
Find highest scoring incoming arc for each vertex
I
If this is a tree, then we have found MST!!
Introduction to Data-Driven Dependency Parsing
19(63)
Arc-factored Models
Chu-Liu-Edmonds I
If not a tree, identify cycle and contract
I
Recalculate arc weights into and out-of cycle
Introduction to Data-Driven Dependency Parsing
20(63)
Arc-factored Models
Chu-Liu-Edmonds
I
Outgoing arc weights I I
Equal to the max of outgoing arc over all vertexes in cycle e.g., John → Mary is 3 and saw → Mary is 30
Introduction to Data-Driven Dependency Parsing
21(63)
Arc-factored Models
Chu-Liu-Edmonds
I
Incoming arc weights I
I I
Equal to the weight of best spanning tree that includes head of incoming arc, and all nodes in cycle root → saw → John is 40 (**) root → John → saw is 29
Introduction to Data-Driven Dependency Parsing
22(63)
Arc-factored Models
Chu-Liu-Edmonds Theorem The weight of the MST of this contracted graph is equal to the weight of the MST for the original graph
I
Therefore, recursively call algorithm on new graph
Introduction to Data-Driven Dependency Parsing
23(63)
Arc-factored Models
Chu-Liu-Edmonds I
This is a tree and the MST for the contracted graph!!
I
Go back up recursive call and reconstruct final graph
Introduction to Data-Driven Dependency Parsing
24(63)
Arc-factored Models
Chu-Liu-Edmonds
I
This is the MST!!
Introduction to Data-Driven Dependency Parsing
25(63)
Arc-factored Models
Chu-Liu-Edmonds Code Chu-Liu-Edmonds(Gx , w ) 1. Let M = {(i ∗ , j) : j ∈ Vx , i ∗ = arg maxi 0 wij } 2. Let GM = (Vx , M) 3. If GM has no cycles, then it is an MST: return GM 4. Otherwise, find a cycle C in GM 5. Let < GC , c, ma >= contract(G , C , w ) 6. Let G = Chu-Liu-Edmonds(GC , w ) 7. Find vertex i ∈ C such that (i 0 , c) ∈ G and ma(i 0 , c) = i 8. Find arc (i 00 , i) ∈ C 9. Find all arc (c, i 000 ) ∈ G 10. G = G ∪ {(ma(c, i 000 ), i 000 )}∀(c,i 000 )∈G ∪ C ∪ {(i 0 , i)} − {(i 00 , i)} 11. Remove all vertices and arcs in G containing c 12. return G I
Reminder: wij = arg maxk wijk
Introduction to Data-Driven Dependency Parsing
26(63)
Arc-factored Models
Chu-Liu-Edmonds Code (II) contract(G = (V , A), C , w ) 1. Let GC be the subgraph of G excluding nodes in C 2. Add a node c to GC representing cycle C 3. For i ∈ V − C : ∃i 0 ∈C (i 0 , i) ∈ A Add arc (c, i) to GC with ma(c, i) = arg maxi 0 ∈C score(i 0 , i) i 0 = ma(c, i) score(c, i) = score(i 0 , i) 4. For i ∈ V − C : ∃i 0 ∈C (i, i 0 ) ∈ A Add edge (i, c) to GC with ma(i, c) = arg maxi 0 ∈C [score(i, i 0 ) − score(a(i 0 ), i 0 )] i 0 = ma(i, c) score(i, c) = [score(i, i 0 ) − score(a(i 0 ), i 0 ) + score(C )] where a(v ) is thePpredecessor of v in C and score(C ) = v ∈C score(a(v ), v ) 5. return < GC , c, ma >
Introduction to Data-Driven Dependency Parsing
27(63)
Arc-factored Models
Chu-Liu-Edmonds I
Naive implementation O(n3 + |L|n2 ) I I I I
I I
Better algorithms run in O(|L|n2 ) [Tarjan 1977] Chu-Liu-Edmonds searches all dependency graphs I I
I
Converting Gx to a digraph – O(|L|n2 ) Finding best arc – O(n2 ) Contracting cycles – O(n2 ) At most n recursive calls
Both projective and non-projective Thus, it is an exact non-projective search algorithm!!!
What about the projective case?
Introduction to Data-Driven Dependency Parsing
28(63)
Arc-factored Models
Arc-factored Projective Parsing I I I
Projective dependency structures are nested Can use CFG like parsing algorithms – chart parsing Each chart item (triangle) represents the weight of the best tree rooted at word h spanning all the words from i to j I
I
Analog in CFG parsing: items represent best tree rooted at non-terminal NT spanning words i to j
Goal: Find chart item rooted at 0 spanning 0 to n
Base case Length 1, h = i = j, has weight 1
Introduction to Data-Driven Dependency Parsing
29(63)
Arc-factored Models
Arc-factored Projective Parsing I
All projective graphs can be written as the combination of two smaller adjacent graphs
I
Inductive hypothesis – algorithm has calculated score of smaller items correctly (just like CKY)
Introduction to Data-Driven Dependency Parsing
30(63)
Arc-factored Models
Arc-factored Projective Parsing I
Chart item filled in a bottom-up manner I
I I I
First do all strings of length 1, then 2, etc. just like CKY
k Weight of new item: maxl,j,k w (A) × w (B) × whh 0 5 Algorithm runs in O(|L|n ) Use back-pointers to extract best parse (like CKY)
Introduction to Data-Driven Dependency Parsing
31(63)
Arc-factored Models
Arc-factored Projective Parsing I I
O(|L|n5 ) is not that good [Eisner 1996] showed how this can be reduced to O(|L|n3 ) I
Key: split items so that sub-roots are always on periphery
Introduction to Data-Driven Dependency Parsing
32(63)
Arc-factored Models
Inference in Arc-Factored Models
I
Non-projective case I
I
O(|L|n2 ) with the Chu-Liu-Edmonds MST algorithm
Projective case I
O(|L|n3 ) with the Eisner algorithm
I
But we still haven’t defined the form of wijk
I
Or how to learn these parameters
Introduction to Data-Driven Dependency Parsing
33(63)
Arc-factored Models
Arc weights as linear classifiers
wijk = e w·f(i,j,k) I
Arc weights are a linear combination of features of the arc, f, and a corresponding weight vector w
I
Raised to an exponent (simplifies some math ...)
I
What arc features?
I [McDonald et al. 2005]
Introduction to Data-Driven Dependency Parsing
discuss a number of binary features
34(63)
Arc-factored Models
Arc Features: f(i, j, k)
I
Features from [McDonald et al. 2005]: I
Identities of the words wi and wj and the label lk
head=saw & dependent=with
Introduction to Data-Driven Dependency Parsing
35(63)
Arc-factored Models
Arc Features: f(i, j, k)
I
Features from [McDonald et al. 2005]: I
Part-of-speech tags of the words wi and wj and the label lk
head-pos=Verb & dependent-pos=Preposition
Introduction to Data-Driven Dependency Parsing
36(63)
Arc-factored Models
Arc Features: f(i, j, k)
I
Features from [McDonald et al. 2005]: I
Part-of-speech of words surrounding and between wi and wj inbetween-pos=Noun inbetween-pos=Adverb dependent-pos-right=Pronoun head-pos-left=Noun ...
Introduction to Data-Driven Dependency Parsing
37(63)
Arc-factored Models
Arc Features: f(i, j, k)
I
Features from [McDonald et al. 2005]: I
Number of words between wi and wj , and their orientation
arc-distance=3 arc-direction=right
Introduction to Data-Driven Dependency Parsing
38(63)
Arc-factored Models
Arc Features: f(i, j, k)
I
Label features arc-label=PP
Introduction to Data-Driven Dependency Parsing
39(63)
Arc-factored Models
Arc Features: f(i, j, k)
I
Combos of the above
head-pos=Verb & dependent-pos=Preposition & arc-label=PP head-pos=Verb & dependent=with & arc-distance=3 ... I
No limit: any feature over arc (i, j, k) or input x
Introduction to Data-Driven Dependency Parsing
40(63)
Arc-factored Models
Learning the parameters I
We can then re-write the inference problem G
= arg max G ∈T (Gx )
Y
G ∈T (Gx )
G ∈T (Gx )
Y
e w·f(i,j,k)
arg max G ∈T (Gx )
Y
e w·f(i,j,k)
(i,j,k)∈G
(i,j,k)∈G
X
w · f(i, j, k)
(i,j,k)∈G
= arg max w · G ∈T (Gx )
I
=
(i,j,k)∈G
= arg max log = arg max
wijk
X (i,j,k)∈G
f(i, j, k) =
arg max w · f(G ) G ∈T (Gx )
Which we can plug into online learning algorithms
Introduction to Data-Driven Dependency Parsing
41(63)
Arc-factored Models
Inference-based Learning e.g., The Perceptron |T |
Training data: T = {(xt , Gt )}t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let G 0 = arg maxG 0 w(i) · f(G 0 ) (**) 5. if G 0 6= Gt 6. w(i+1) = w(i) + f(Gt ) − f(G 0 ) 7. i =i +1 8. return wi
Introduction to Data-Driven Dependency Parsing
42(63)
Arc-factored Models
Other Important Problems I
K -best inference – O(K × |L|n2 ) [Camerini et al. 1980]
I
Partition function X
Zx =
w (G )
G ∈T (Gx ) I
Arc expectations hi, j, kix =
X
w (G ) × 1[(i, j, l) ∈ G ]
G ∈T (Gx ) I
Important for some learning & inference frameworks
I
Important for some applications
Introduction to Data-Driven Dependency Parsing
43(63)
Arc-factored Models
Partition Function: Zx = I
G ∈T (Gx ) w (G )
Lapacian Matrix Q for graph Gx = (Vx , Ax ) X X Qjj = wijk and Qij = i6=j,(i,j,k)∈Ax
I
P
−wijk
i6=j,(i,j,k)∈Ax
Cofactor Q i is the matrix Q with the i th row and column removed The Matrix Tree Theorem [Tutte 1984] The determinant of the cofactor Q 0 is equal to Zx
I I I
Thus Zx = |Q 0 | – determinants can be calculated in O(n3 ) Constructing Q takes O(|L|n2 ) Therefore the whole process takes O(n3 + |L|n2 )
Introduction to Data-Driven Dependency Parsing
44(63)
Arc-factored Models
Arc Expectations X
hi, j, kix =
w (G ) × 1[(i, j, k) ∈ A]
G ∈T (Gx ) I
Can easily be calculated, first reset some weights 0
wik0 j = 0 ∀i 0 6= i and k 0 6= k I
Now, hi, j, kix = Zx
I
Why? All competing arc weights to zero, therefore every non-zero weighted graph must contain (i, j, k)
I
Naively takes O(n5 + |L|n2 ) to compute all expectations
I
But can be calculated in O(n3 + |L|n2 ) (see [McDonald and Satta 2007, Smith and Smith 2007, Koo et al. 2007])
Introduction to Data-Driven Dependency Parsing
45(63)
Arc-factored Models
Zx for the Projective Case I
Just augment chart-parsing algorithm
I
P k Weight of new item: l,j,k w (A) × w (B) × whh 0 Weight of item rooted at 0 spanning 0 to n is equal to Zx Also works for Eisner’s algorithim – runtime O(n3 + |L|n2 )
I I
Introduction to Data-Driven Dependency Parsing
46(63)
Arc-factored Models
hi, j, kix for the Projective Case
I
Can be calculated through Zx , just like the non-projective case
I
Can also be calculated using the inside-outside algorithm
I
See [Paskin 2001] for more details
Introduction to Data-Driven Dependency Parsing
47(63)
Arc-factored Models
Why calculate Zx and hi, j, kix ?
I
Useful for many learning and inference problems I I I I I
I
Min risk-decoding (hi, j, kix ) Log-linear parsing models (Zx and hi, j, kix ) Syntactic language modeling (Zx ) Unsupervised dependency parsing (Zx and hi, j, kix ) ...
See [McDonald and Satta 2007] for more
Introduction to Data-Driven Dependency Parsing
48(63)
Beyond Arc-factored Models
Beyond Arc-factored Models
I
Arc-factored models make strong independence assumptions
I
Can we do better? Rest of lecture
I
I I I
NP-hardness of Markovization for non-projective parsing But ... projective case has polynomial solutions!! Approximate non-projective algorithms
Introduction to Data-Driven Dependency Parsing
49(63)
Beyond Arc-factored Models
Vertical and Horizontal Markovization I
Dependency graphs weight factors over neighbouring arcs
I
Vertical versus Horizontal neighbourhoods
Introduction to Data-Driven Dependency Parsing
50(63)
Beyond Arc-factored Models
N th Order Horizontal Markov Factorization I
Assume the unlabeled parsing case (adding labels is easy) Weights factor over neighbourhoods of size N
I
Normal (arc-factored = first-order)
I
m Y
wi0 ik
k=1 I
Second-order – weights over pairs of adjacent (same side) arcs j−1 Y
wi0 ik ik+1 × wi0 ·ij × wi0 ·ij+1 ×
k=1
Introduction to Data-Driven Dependency Parsing
m−1 Y
wi0 ik ik+1
k=j+1
51(63)
Beyond Arc-factored Models
Non-projective Horizontal Markovization I
Non-projective second-order parsing is NP-hard I
I
I
Thus any order non-projective parsing is NP-hard
3-dimensional matching (3DM): Disjoint sets X , Y , Z each with m elements. A set T ⊆ X × Y × Z . Question – is there a subset S ⊆ T such that |S| = m and each v ∈ X ∪ Y ∪ Z occurs in exactly one element of S Reduction: Define Gx = (Vx , Ax ) as a dense graph, where I I I I I
Vx = {v | ∀, v ∈ X ∪ Y ∪ Z } ∪ {0} w0xi xj = 1, ∀xi , xj ∈ X wx·y = 1, ∀x ∈ X , y ∈ Y wxi yj zk = 1, ∀(x, y , z) ∈ T All other weights are 0
Introduction to Data-Driven Dependency Parsing
52(63)
Beyond Arc-factored Models
Non-projective Horizontal Markovization
I
Non-projective second-order parsing is NP-hard
I
Generate sentence from all x ∈ X , y ∈ Y and z ∈ Z
Introduction to Data-Driven Dependency Parsing
53(63)
Beyond Arc-factored Models
Non-projective Horizontal Markovization
I
Non-projective second-order parsing is NP-hard
Introduction to Data-Driven Dependency Parsing
54(63)
Beyond Arc-factored Models
Non-projective Horizontal Markovization
I
Non-projective second-order parsing is NP-hard
Introduction to Data-Driven Dependency Parsing
55(63)
Beyond Arc-factored Models
Non-projective Horizontal Markovization
I
Non-projective second-order parsing is NP-hard
I
All other arc weights are set to 0
Introduction to Data-Driven Dependency Parsing
56(63)
Beyond Arc-factored Models
Non-projective Horizontal Markovization
I
I
Theorem: There is a 3DM iff there is a dependency graph of weight 1 Proof: I I
I
I
All non-zero weight dependency graphs correspond to a 3DM Every 3DM corresponds to a non-zero weight dependency graph Therefore, there is a non-zero weight dependency graph iff then there is a 3DM See [McDonald and Pereira 2006] for more
Introduction to Data-Driven Dependency Parsing
57(63)
Beyond Arc-factored Models
Projective Horizontal Markovization I
Can simply augment chart parsing algorithm
I
Same for the Eisner algorithm – runtime still O(|L|n3 )
Introduction to Data-Driven Dependency Parsing
58(63)
Beyond Arc-factored Models
Approx Non-proj Horizontal Markovization
I
Two properties: I I
I
Projective parsing is polynomial w/ horizontal Markovization Most non-projective graphs are still primarily projective
Use these facts to get an approximate non-projective algorithm I I
I
Find a high scoring projective parse Iteratively modify to create a higher scoring non-projective parse Post-process non-projectivity, which is related to pseudo-projective parsing
Introduction to Data-Driven Dependency Parsing
59(63)
Beyond Arc-factored Models
Approx Non-proj Horizontal Markovization
I
Algorithm 1. Let G be the highest weighted projective graph 2. Find the arc (i, j, k) ∈ G , a node i 0 and label lk 0 such that I I
G 0 = G ∪ {(i 0 , j, k 0 )} − {(i, j, k)} is a valid graph (tree) G 0 has highest weight of all possibly changes
3. if w (G 0 ) > w (G ) then G = G 0 and return to step 2 4. Otherwise return G I
Intuition: Start with a high weighted graph and make local changes that increase the graphs weight until convergence
I
Works well in practice [McDonald and Pereira 2006]
Introduction to Data-Driven Dependency Parsing
60(63)
Beyond Arc-factored Models
Vertical Markovization
I
Also NP-hard for non-projective case [McDonald and Satta 2007] I I
I
Reduction again from 3DM A little more complicated – relies on arc labels
Projective case is again polynomial I
Same method of augmenting the chart-parsing algorithm
Introduction to Data-Driven Dependency Parsing
61(63)
Beyond Arc-factored Models
Beyond Arc-Factorization
I
For the non-projective case, increasing scope of weights (and as a result features) makes parsing intractable
I
However, chart parsing nature of projective algorithms allows for simple augmentations
I
Can approximate the non-projective case using the exact projective algorithms plus a post-process optimization
I
Further reading: [McDonald and Pereira 2006, McDonald and Satta 2007]
Introduction to Data-Driven Dependency Parsing
62(63)
Summary
Summary – Graph-based Methods
I
Arc-factored models I I I
I
Maximum spanning tree formulation Projective and non-projective inference algorithms Partition function and arc expectation algorithms – Matrix Tree Theorem
Beyond Arc-factored Models I I
Vertical and horizontal markovization Approximations
Introduction to Data-Driven Dependency Parsing
63(63)
References and Further Reading
References and Further Reading I P. M. Camerini, L. Fratta, and F. Maffioli. 1980. The k best spanning arborescences of a network. Networks, 10(2):91–110. I Y.J. Chu and T.H. Liu. 1965. On the shortest arborescence of a directed graph. Science Sinica, 14:1396–1400. I J. Edmonds. 1967. Optimum branchings. Journal of Research of the National Bureau of Standards, 71B:233–240. I J. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proc. COLING. I T. Koo, A. Globerson, X. Carreras, and M. Collins. 2007. Structured prediction models via the matrix-tree theorem. In Proc. EMNLP. I R. McDonald and F. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proc EACL. I R. McDonald and G. Satta. 2007. On the complexity of non-projective data-driven dependency parsing. In Proc. IWPT. I R. McDonald, K. Crammer, and F. Pereira. 2005.
Introduction to Data-Driven Dependency Parsing
63(63)
References and Further Reading
Online large-margin training of dependency parsers. In Proc. ACL. I M.A. Paskin. 2001. Cubic-time parsing and learning algorithms for grammatical bigram models. Technical Report UCB/CSD-01-1148, Computer Science Division, University of California Berkeley. I D.A. Smith and N.A. Smith. 2007. Probabilistic models of nonprojective dependency trees. In Proc. EMNLP. I R.E. Tarjan. 1977. Finding optimum branchings. Networks, 7:25–35. I W.T. Tutte. 1984. Graph Theory. Cambridge University Press.
Introduction to Data-Driven Dependency Parsing
63(63)