Dependency Parsing - DidaWiki

Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007

Ryan McDonald1

Joakim Nivre2

1 Google

Inc., New York, USA E-mail: [email protected] 2 Uppsala

University and V¨axj¨o University, Sweden E-mail: [email protected]

Introduction to Data-Driven Dependency Parsing

1(63)

Introduction

Overview of the Course

I

Dependency parsing (Joakim)

I

Machine learning methods (Ryan)

I

Transition-based models (Joakim)

I

Graph-based models (Ryan) Loose ends (Joakim, Ryan):

I

I I I

Other approaches Empirical results Available software


2(63)

Introduction

Notation Reminder

I

Sentence x = w0 , w1 , . . . , wn , with w0 = root

I

L = {l1 , . . . , l|L| } set of permissible arc labels

I

Let G = (V , A) be a dependency graph for sentence x where: I I

I

V = {0, 1, . . . , n} is the vertex set A is the arc set, i.e., (i, j, k) ∈ A represents a dependency from wi to wj with label lk ∈ L

By the usual definition, G is a tree


3(63)

Introduction

Data-Driven Parsing I

Goal: Learn a good predictor of dependency graphs

I

Input: x

I

Output: dependency graph/tree G Last lecture:

I

I I I

I

Parameterize parsing by transitions Learn to predict transitions given the input and a history Predict new graphs using deterministic parsing algorithm

This lecture: I I I

Parameterize parsing by dependency arcs Learn to predict entire graphs given the input Predict new graphs using spanning tree algorithms


4(63)

Introduction

Lecture 4: Outline

I I

Graph theory refresher Arc-factored models (a.k.a. Edge-factored models) I I I

I

Maximum spanning tree formulation Projective and non-projective inference algorithms Partition function and marginal algorithms – Matrix Tree Theorem

Beyond Arc-factored Models I I

Vertical and horizontal markovization Approximations


5(63)

Graph Theory Refresher

Some Graph Theory Reminders I I I

A graph G = (V , A) is a set of verteces V and arcs (i, j) ∈ A, where i, j ∈ V Undirected graphs: (i, j) ∈ A ⇔ (j, i) ∈ A Directed graphs (digraphs): (i, j) ∈ A ; (j, i) ∈ A


6(63)


Multi-Digraphs I I I

A multi-digraph is a digraph where there can be multiple arcs between verteces G = (V , A) (i, j, k) ∈ A represents the k th arc from vertex i to vertex j


7(63)


Directed Spanning Trees (a.k.a. Arborescence) I

A directed spanning tree of a (multi-)digraph G = (V , A), is a subgraph G 0 = (V 0 , A0 ) such that: I I I

I

V0 = V A0 ⊆ A, and |A0 | = |V 0 | − 1 G 0 is a tree (acyclic)

A spanning tree of the following (multi-)digraphs


8(63)


Weighted Directed Spanning Trees

I

Assume we have a weight function for each arc in a multi-digraph G = (V , A)

I

Define wijk ≥ 0 to be the weight of (i, j, k) ∈ A for a multi-digraph

I

Define the weight of directed spanning tree G 0 of graph G as Y w (G 0 ) = wijk (i,j,k)∈G 0

I

Notation: (i, j, k) ∈ G = (V , A) ⇔ the arc (i, j, k) ∈ A


9(63)


Maximum Spanning Trees (MST) of (Multi-)Digraphs I

Let T (G ) be the set of all spanning trees for graph G

I

The MST Problem: Find the spanning tree G 0 of the graph G that has highest weight Y G 0 = arg max w (G 0 ) = arg max wijk G 0 ∈T (G )

I

G 0 ∈T (G ) (i,j,k)∈G 0

Solutions ... to come.


10(63)

Arc-factored Models

Arc-Factored Dependency Models I

I

Remember: Data-driven parsing parameterizes model and then learns parameters from data Arc-factored model I

Assumes that the score / probability / weight of a dependency graph factors by its arcs Y look familiar? w (G ) = wijk (i,j,k)∈G

I

wijk is the weight of creating a dependency from word wi to wj with label lk

I

Thus there is an assumption that each dependency decision is independent I

Strong assumption! Will address this later.


11(63)

Arc-factored Models

Arc-Factored Dependency Models Example

I

Weight of dependency graph is 10 × 30 × 30 = 9000

I

In practice arc weights are much smaller


12(63)

Arc-factored Models

Important Concept Gx I

For input sentence x = w0 , . . . , wn , define Gx = (Vx , Ax ) as: I I

I

Vx = {0, 1, . . . , n} Ax = {(i, j, k) | ∀ i, j ∈ Vx and lk ∈ L}

Thus, Gx is complete multi-digraph over vertex set representing words

Theorem Every valid dependency graph for sentence x is equivalent to a directed spanning tree for Gx that originates out of vertex 0 I

Falls out of definitions of tree constrained dependency graphs and spanning trees I I

Both are spanning/connected (contain all words) Both are trees


13(63)

Arc-factored Models

Three Important Problems Theorem Every valid dependency graph for sentence x is equivalent to a directed spanning tree for Gx that originates out of vertex 0 1. Inference ≡ finding the MST of Gx G = arg max w (G ) = arg max G ∈T (Gx )

G ∈T (Gx )

Y

wijk

(i,j,k)∈G

2. Defining wijk and its feature space 3. Learning wijk I

Can use perceptron-based learning if we solve (1)


14(63)

Arc-factored Models

Inference - Getting Rid of Arc Labels Y

G = arg max w (G ) = arg max G ∈T (Gx )

G ∈T (Gx )

wijk

(i,j,k)∈G

I

Consider all the arcs between vertexes i and j

I

Now, consider the arc (i, j, k) such that, (i, j, k) = arg max wijk k

Theorem The highest weighted dependency tree for sentence x must contain the arc (i, j, k) – (assuming no ties) I

Easy proof: if not, sub in (i, j, k) and get higher weighted tree


15(63)

Arc-factored Models

Inference - Getting Rid of Arc Labels Y

G = arg max w (G ) = arg max G ∈T (Gx )

G ∈T (Gx )

wijk

(i,j,k)∈G

I

Thus, we can reduce Gx from a multi-digraph to a simple digraph

I

Just remove all arcs that do not satisfy (i, j, k) = arg max wijk k

I

Problem is now equal to the MST problem for digraphs We will use the Chu-Liu-Edmonds Algorithm [Chu and Liu 1965, Edmonds 1967]


16(63)

Arc-factored Models

Chu-Liu-Edmonds Algorithm I I I

Finds the MST originating out of a vertex of choice Assumes weight of tree is sum of arc weights No problem, we can use logarithms Y G = arg max wijk G ∈T (Gx )

(i,j,k)∈G

Y

= arg max log G ∈T (Gx )

(i,j,k)∈G

X

= arg max G ∈T (Gx )

So if we let

wijk

=

log wijk , G

log wijk

(i,j,k)∈G

then we get



wijk

X

wijk

(i,j,k)∈G

17(63)

Arc-factored Models

Chu-Liu-Edmonds I

x = root John saw Mary


18(63)

Arc-factored Models

Chu-Liu-Edmonds I

Find highest scoring incoming arc for each vertex

I

If this is a tree, then we have found MST!!


19(63)

Arc-factored Models

Chu-Liu-Edmonds I

If not a tree, identify cycle and contract

I

Recalculate arc weights into and out-of cycle


20(63)

Arc-factored Models

Chu-Liu-Edmonds

I

Outgoing arc weights I I

Equal to the max of outgoing arc over all vertexes in cycle e.g., John → Mary is 3 and saw → Mary is 30


21(63)

Arc-factored Models

Chu-Liu-Edmonds

I

Incoming arc weights I

I I

Equal to the weight of best spanning tree that includes head of incoming arc, and all nodes in cycle root → saw → John is 40 (**) root → John → saw is 29


22(63)

Arc-factored Models

Chu-Liu-Edmonds Theorem The weight of the MST of this contracted graph is equal to the weight of the MST for the original graph

I

Therefore, recursively call algorithm on new graph


23(63)

Arc-factored Models

Chu-Liu-Edmonds I

This is a tree and the MST for the contracted graph!!

I

Go back up recursive call and reconstruct final graph


24(63)

Arc-factored Models

Chu-Liu-Edmonds

I

This is the MST!!


25(63)

Arc-factored Models

Chu-Liu-Edmonds Code Chu-Liu-Edmonds(Gx , w ) 1. Let M = {(i ∗ , j) : j ∈ Vx , i ∗ = arg maxi 0 wij } 2. Let GM = (Vx , M) 3. If GM has no cycles, then it is an MST: return GM 4. Otherwise, find a cycle C in GM 5. Let < GC , c, ma >= contract(G , C , w ) 6. Let G = Chu-Liu-Edmonds(GC , w ) 7. Find vertex i ∈ C such that (i 0 , c) ∈ G and ma(i 0 , c) = i 8. Find arc (i 00 , i) ∈ C 9. Find all arc (c, i 000 ) ∈ G 10. G = G ∪ {(ma(c, i 000 ), i 000 )}∀(c,i 000 )∈G ∪ C ∪ {(i 0 , i)} − {(i 00 , i)} 11. Remove all vertices and arcs in G containing c 12. return G I

Reminder: wij = arg maxk wijk


26(63)

Arc-factored Models

Chu-Liu-Edmonds Code (II) contract(G = (V , A), C , w ) 1. Let GC be the subgraph of G excluding nodes in C 2. Add a node c to GC representing cycle C 3. For i ∈ V − C : ∃i 0 ∈C (i 0 , i) ∈ A Add arc (c, i) to GC with ma(c, i) = arg maxi 0 ∈C score(i 0 , i) i 0 = ma(c, i) score(c, i) = score(i 0 , i) 4. For i ∈ V − C : ∃i 0 ∈C (i, i 0 ) ∈ A Add edge (i, c) to GC with ma(i, c) = arg maxi 0 ∈C [score(i, i 0 ) − score(a(i 0 ), i 0 )] i 0 = ma(i, c) score(i, c) = [score(i, i 0 ) − score(a(i 0 ), i 0 ) + score(C )] where a(v ) is thePpredecessor of v in C and score(C ) = v ∈C score(a(v ), v ) 5. return < GC , c, ma >


27(63)

Arc-factored Models

Chu-Liu-Edmonds I

Naive implementation O(n3 + |L|n2 ) I I I I

I I

Better algorithms run in O(|L|n2 ) [Tarjan 1977] Chu-Liu-Edmonds searches all dependency graphs I I

I

Converting Gx to a digraph – O(|L|n2 ) Finding best arc – O(n2 ) Contracting cycles – O(n2 ) At most n recursive calls

Both projective and non-projective Thus, it is an exact non-projective search algorithm!!!

What about the projective case?


28(63)

Arc-factored Models

Arc-factored Projective Parsing I I I

Projective dependency structures are nested Can use CFG like parsing algorithms – chart parsing Each chart item (triangle) represents the weight of the best tree rooted at word h spanning all the words from i to j I

I

Analog in CFG parsing: items represent best tree rooted at non-terminal NT spanning words i to j

Goal: Find chart item rooted at 0 spanning 0 to n

Base case Length 1, h = i = j, has weight 1


29(63)

Arc-factored Models

Arc-factored Projective Parsing I

All projective graphs can be written as the combination of two smaller adjacent graphs

I

Inductive hypothesis – algorithm has calculated score of smaller items correctly (just like CKY)


30(63)

Arc-factored Models

Arc-factored Projective Parsing I

Chart item filled in a bottom-up manner I

I I I

First do all strings of length 1, then 2, etc. just like CKY

k Weight of new item: maxl,j,k w (A) × w (B) × whh 0 5 Algorithm runs in O(|L|n ) Use back-pointers to extract best parse (like CKY)


31(63)

Arc-factored Models

Arc-factored Projective Parsing I I

O(|L|n5 ) is not that good [Eisner 1996] showed how this can be reduced to O(|L|n3 ) I

Key: split items so that sub-roots are always on periphery


32(63)

Arc-factored Models

Inference in Arc-Factored Models

I

Non-projective case I

I

O(|L|n2 ) with the Chu-Liu-Edmonds MST algorithm

Projective case I

O(|L|n3 ) with the Eisner algorithm

I

But we still haven’t defined the form of wijk

I

Or how to learn these parameters


33(63)

Arc-factored Models

Arc weights as linear classifiers

wijk = e w·f(i,j,k) I

Arc weights are a linear combination of features of the arc, f, and a corresponding weight vector w

I

Raised to an exponent (simplifies some math ...)

I

What arc features?

I [McDonald et al. 2005]


discuss a number of binary features

34(63)

Arc-factored Models

Arc Features: f(i, j, k)

I

Features from [McDonald et al. 2005]: I

Identities of the words wi and wj and the label lk

head=saw & dependent=with


35(63)

Arc-factored Models


I


Part-of-speech tags of the words wi and wj and the label lk

head-pos=Verb & dependent-pos=Preposition


36(63)

Arc-factored Models


I


Part-of-speech of words surrounding and between wi and wj inbetween-pos=Noun inbetween-pos=Adverb dependent-pos-right=Pronoun head-pos-left=Noun ...


37(63)

Arc-factored Models


I


Number of words between wi and wj , and their orientation

arc-distance=3 arc-direction=right


38(63)

Arc-factored Models


I

Label features arc-label=PP


39(63)

Arc-factored Models


I

Combos of the above

head-pos=Verb & dependent-pos=Preposition & arc-label=PP head-pos=Verb & dependent=with & arc-distance=3 ... I

No limit: any feature over arc (i, j, k) or input x


40(63)

Arc-factored Models

Learning the parameters I

We can then re-write the inference problem G


Y

G ∈T (Gx )

G ∈T (Gx )

Y

e w·f(i,j,k)

arg max G ∈T (Gx )

Y

e w·f(i,j,k)

(i,j,k)∈G

(i,j,k)∈G

X

w · f(i, j, k)

(i,j,k)∈G

= arg max w · G ∈T (Gx )

I

=

(i,j,k)∈G

= arg max log = arg max

wijk

X (i,j,k)∈G

f(i, j, k) =

arg max w · f(G ) G ∈T (Gx )

Which we can plug into online learning algorithms


41(63)

Arc-factored Models

Inference-based Learning e.g., The Perceptron |T |

Training data: T = {(xt , Gt )}t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let G 0 = arg maxG 0 w(i) · f(G 0 ) (**) 5. if G 0 6= Gt 6. w(i+1) = w(i) + f(Gt ) − f(G 0 ) 7. i =i +1 8. return wi


42(63)

Arc-factored Models

Other Important Problems I

K -best inference – O(K × |L|n2 ) [Camerini et al. 1980]

I

Partition function X

Zx =

w (G )

G ∈T (Gx ) I

Arc expectations hi, j, kix =

X

w (G ) × 1[(i, j, l) ∈ G ]

G ∈T (Gx ) I

Important for some learning & inference frameworks

I

Important for some applications


43(63)

Arc-factored Models

Partition Function: Zx = I

G ∈T (Gx ) w (G )

Lapacian Matrix Q for graph Gx = (Vx , Ax ) X X Qjj = wijk and Qij = i6=j,(i,j,k)∈Ax

I

P

−wijk

i6=j,(i,j,k)∈Ax

Cofactor Q i is the matrix Q with the i th row and column removed The Matrix Tree Theorem [Tutte 1984] The determinant of the cofactor Q 0 is equal to Zx

I I I

Thus Zx = |Q 0 | – determinants can be calculated in O(n3 ) Constructing Q takes O(|L|n2 ) Therefore the whole process takes O(n3 + |L|n2 )


44(63)

Arc-factored Models

Arc Expectations X

hi, j, kix =

w (G ) × 1[(i, j, k) ∈ A]

G ∈T (Gx ) I

Can easily be calculated, first reset some weights 0

wik0 j = 0 ∀i 0 6= i and k 0 6= k I

Now, hi, j, kix = Zx

I

Why? All competing arc weights to zero, therefore every non-zero weighted graph must contain (i, j, k)

I

Naively takes O(n5 + |L|n2 ) to compute all expectations

I

But can be calculated in O(n3 + |L|n2 ) (see [McDonald and Satta 2007, Smith and Smith 2007, Koo et al. 2007])


45(63)

Arc-factored Models

Zx for the Projective Case I

Just augment chart-parsing algorithm

I

P k Weight of new item: l,j,k w (A) × w (B) × whh 0 Weight of item rooted at 0 spanning 0 to n is equal to Zx Also works for Eisner’s algorithim – runtime O(n3 + |L|n2 )

I I


46(63)

Arc-factored Models

hi, j, kix for the Projective Case

I

Can be calculated through Zx , just like the non-projective case

I

Can also be calculated using the inside-outside algorithm

I

See [Paskin 2001] for more details


47(63)

Arc-factored Models

Why calculate Zx and hi, j, kix ?

I

Useful for many learning and inference problems I I I I I

I

Min risk-decoding (hi, j, kix ) Log-linear parsing models (Zx and hi, j, kix ) Syntactic language modeling (Zx ) Unsupervised dependency parsing (Zx and hi, j, kix ) ...

See [McDonald and Satta 2007] for more


48(63)

Beyond Arc-factored Models


I

Arc-factored models make strong independence assumptions

I

Can we do better? Rest of lecture

I

I I I

NP-hardness of Markovization for non-projective parsing But ... projective case has polynomial solutions!! Approximate non-projective algorithms


49(63)


Vertical and Horizontal Markovization I

Dependency graphs weight factors over neighbouring arcs

I

Vertical versus Horizontal neighbourhoods


50(63)


N th Order Horizontal Markov Factorization I

Assume the unlabeled parsing case (adding labels is easy) Weights factor over neighbourhoods of size N

I

Normal (arc-factored = first-order)

I

m Y

wi0 ik

k=1 I

Second-order – weights over pairs of adjacent (same side) arcs j−1 Y

wi0 ik ik+1 × wi0 ·ij × wi0 ·ij+1 ×

k=1


m−1 Y

wi0 ik ik+1

k=j+1

51(63)


Non-projective Horizontal Markovization I

Non-projective second-order parsing is NP-hard I

I

I

Thus any order non-projective parsing is NP-hard

3-dimensional matching (3DM): Disjoint sets X , Y , Z each with m elements. A set T ⊆ X × Y × Z . Question – is there a subset S ⊆ T such that |S| = m and each v ∈ X ∪ Y ∪ Z occurs in exactly one element of S Reduction: Define Gx = (Vx , Ax ) as a dense graph, where I I I I I

Vx = {v | ∀, v ∈ X ∪ Y ∪ Z } ∪ {0} w0xi xj = 1, ∀xi , xj ∈ X wx·y = 1, ∀x ∈ X , y ∈ Y wxi yj zk = 1, ∀(x, y , z) ∈ T All other weights are 0


52(63)


Non-projective Horizontal Markovization

I

Non-projective second-order parsing is NP-hard

I

Generate sentence from all x ∈ X , y ∈ Y and z ∈ Z


53(63)



I



54(63)



I



55(63)



I


I

All other arc weights are set to 0


56(63)



I

I

Theorem: There is a 3DM iff there is a dependency graph of weight 1 Proof: I I

I

I

All non-zero weight dependency graphs correspond to a 3DM Every 3DM corresponds to a non-zero weight dependency graph Therefore, there is a non-zero weight dependency graph iff then there is a 3DM See [McDonald and Pereira 2006] for more


57(63)


Projective Horizontal Markovization I

Can simply augment chart parsing algorithm

I

Same for the Eisner algorithm – runtime still O(|L|n3 )


58(63)


Approx Non-proj Horizontal Markovization

I

Two properties: I I

I

Projective parsing is polynomial w/ horizontal Markovization Most non-projective graphs are still primarily projective

Use these facts to get an approximate non-projective algorithm I I

I

Find a high scoring projective parse Iteratively modify to create a higher scoring non-projective parse Post-process non-projectivity, which is related to pseudo-projective parsing


59(63)


Approx Non-proj Horizontal Markovization

I

Algorithm 1. Let G be the highest weighted projective graph 2. Find the arc (i, j, k) ∈ G , a node i 0 and label lk 0 such that I I

G 0 = G ∪ {(i 0 , j, k 0 )} − {(i, j, k)} is a valid graph (tree) G 0 has highest weight of all possibly changes

3. if w (G 0 ) > w (G ) then G = G 0 and return to step 2 4. Otherwise return G I

Intuition: Start with a high weighted graph and make local changes that increase the graphs weight until convergence

I

Works well in practice [McDonald and Pereira 2006]


60(63)


Vertical Markovization

I

Also NP-hard for non-projective case [McDonald and Satta 2007] I I

I

Reduction again from 3DM A little more complicated – relies on arc labels

Projective case is again polynomial I

Same method of augmenting the chart-parsing algorithm


61(63)


Beyond Arc-Factorization

I

For the non-projective case, increasing scope of weights (and as a result features) makes parsing intractable

I

However, chart parsing nature of projective algorithms allows for simple augmentations

I

Can approximate the non-projective case using the exact projective algorithms plus a post-process optimization

I

Further reading: [McDonald and Pereira 2006, McDonald and Satta 2007]


62(63)

Summary

Summary – Graph-based Methods

I

Arc-factored models I I I

I

Maximum spanning tree formulation Projective and non-projective inference algorithms Partition function and arc expectation algorithms – Matrix Tree Theorem

Beyond Arc-factored Models I I

Vertical and horizontal markovization Approximations


63(63)

References and Further Reading

References and Further Reading I P. M. Camerini, L. Fratta, and F. Maffioli. 1980. The k best spanning arborescences of a network. Networks, 10(2):91–110. I Y.J. Chu and T.H. Liu. 1965. On the shortest arborescence of a directed graph. Science Sinica, 14:1396–1400. I J. Edmonds. 1967. Optimum branchings. Journal of Research of the National Bureau of Standards, 71B:233–240. I J. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proc. COLING. I T. Koo, A. Globerson, X. Carreras, and M. Collins. 2007. Structured prediction models via the matrix-tree theorem. In Proc. EMNLP. I R. McDonald and F. Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proc EACL. I R. McDonald and G. Satta. 2007. On the complexity of non-projective data-driven dependency parsing. In Proc. IWPT. I R. McDonald, K. Crammer, and F. Pereira. 2005.


63(63)

References and Further Reading

Online large-margin training of dependency parsers. In Proc. ACL. I M.A. Paskin. 2001. Cubic-time parsing and learning algorithms for grammatical bigram models. Technical Report UCB/CSD-01-1148, Computer Science Division, University of California Berkeley. I D.A. Smith and N.A. Smith. 2007. Probabilistic models of nonprojective dependency trees. In Proc. EMNLP. I R.E. Tarjan. 1977. Finding optimum branchings. Networks, 7:25–35. I W.T. Tutte. 1984. Graph Theory. Cambridge University Press.


63(63)