on the coast face freak storm g(y;x,θ) = score(coast → the). + score(on → coast) ...
1, if tm = u. 0, o.w.. ▷ Each feature has a corresponding real-value weight.
Dependency Parsing Statistical NLP
Fall 2016
prep dobj
ROOT
Lecture 9: Dependency Parsing
CoNLL Format
su
PRON
VERB
solved
DET
NOUN
the problem
ADP
NOUN
with
statistics
(Non-)Projectivity • Crossing Arcs needed to account for nonprojective constructions • Fairly rare in English but can be common in other languages (e.g. Czech):
vc obj1
pobj det
They
Slav Petrov – Google
ROOT
nsubj
mod
punct
* Cathy zag hen wild zwaaien . 0 1 2 3 4 5 6 http://ilk.uvt.nl/conll/
Formal Conditions
Styles of Dependency Parsing •
• Transition-Based (tr) • Fast, greedy, linear time inference algorithms • Trained for greedy search
Graph-Based (gr) • Slower, exhaustive, dynamic programming inference algorithms
•
• Beam search
Higher-order factorizations
Accuracy
3rd-order gr tr est k-b
greedy tr
·n
)
1st-order gr
O(n4 )
O(n3 )
O(n3 )
O(n) [Nivre et al. ’03-’11]
Arc-Factored Models
k O(
2nd-order gr
Time
[McDonald et al. ’05-’06]
Graph-based Parsing • Assumes that scores factor over the tree • Arc-factored models • Score(tree) = ∑ edges I washed dishes with detergent =
I washed washed dishes washed with with detergent
+ + +
Dependency Representation Representation
Representation McGwire
neared
,
fans
went
wild
*
As
McGwire
*
*
As
As
McGwire
McGwire
neared
Heads
fans
fans
went
went
wild
wild
went
wild
wild
t wen
fans
As
wild
fans
t wen
,
ed near wire McG
As
Modifiers
Dependency Representation Representation
Heads
fans
,
Modifiers
*
,
neared
Heads
,
neared
,
As
ed near wire McG
*
Dependency Representation
Dependency Representation Representation
As
McGwire
neared
,
fans
went
wild
*
As
McGwire
*
*
As
As
McGwire
McGwire
neared
Heads
,
,
fans
went
neared ,
fans
fans
went
went
wild
wild wild
ed
Modifiers
t wen
fans
,
wire McG
near
As
wild
fans
ed
t wen
,
wire McG
near
As
Modifiers
neared
wild
Dependency Representation Representation
Representation McGwire
neared
,
fans
went
wild
*
As
McGwire
*
*
As
As
McGwire
McGwire
neared
Heads
fans
fans
went
went
wild
wild
went
wild
wild
t wen
fans
As
wild
fans
t wen
,
ed near wire McG
As
Modifiers
Dependency Representation Representation
Heads
fans
,
Modifiers
*
,
neared
Heads
,
neared
,
As
ed near wire McG
*
Dependency Representation
Dependency Representation Representation
As
McGwire
neared
,
fans
went
wild
*
As
McGwire
*
*
As
As
McGwire
McGwire
neared
Heads
,
,
fans
went
neared ,
fans
fans
went
went
wild
wild wild
ed
Modifiers
t wen
fans
,
wire McG
near
As
wild
fans
ed
t wen
,
wire McG
near
As
Modifiers
neared
wild
Graph-Based Parsing
Arc-factored Projective Parsing
Arc-factored Projective Parsing
Eisner Algorithm
Eisner First-Order Parsing
Eisner First-Order Rules
Eisner First-Order Parsing First-Order Parsing
m
h
r
r+1
As
McGwire
m
neared
,
,
h
*
ed near wire McG
+
fans
went
wild
* As
+ h
e
h
m
McGwire
m
neared
e
, fans went
In practice also left arc version
wild
wild
First-Order Parsing
McGwire
neared
,
fans
went
wild
*
As
McGwire
*
*
As
As
McGwire
McGwire
neared
neared
,
,
fans
fans
went
went
wild
wild
fans
went
wild
t wen
,
fans
As
wild
fans
t wen
,
ed near wire McG
As
neared
,
As
Eisner First-Order Parsing
ed near wire McG
*
t wen
First-Order Parsing
fans
As
Eisner First-Order Parsing
wild
Eisner First-Order Parsing First-Order Parsing
McGwire
neared
,
fans
went
wild
As
McGwire
*
*
As
As
McGwire
McGwire
neared
neared
,
,
fans
fans
went
went
wild
wild
fans
went
wild
wild
Eisner First-Order Parsing First-Order Parsing
neared
,
fans
went
wild
*
As
McGwire
*
*
As
As
McGwire
McGwire
neared
neared
,
,
fans
fans
went
went
wild
wild
fans
went
wild
t wen
,
fans
As
wild
fans
t wen
,
ed near wire McG
As
neared
,
McGwire
ed near wire McG
As
,
t wen
First-Order Parsing
neared
fans
As
wild
t wen
fans
,
As
ed near wire McG
Eisner First-Order Parsing
*
*
,
As
First-Order Parsing
ed near wire McG
*
Eisner First-Order Parsing
wild
Eisner First-Order Parsing First-Order Parsing
McGwire
neared
,
fans
went
wild
*
As
McGwire
*
*
As
As
McGwire
McGwire
neared
neared
,
,
fans
fans
went
went
wild
wild
fans
went
wild
wild
t wen
,
fans
As
wild
fans
t wen
,
As
ed near wire McG
Eisner Algorithm Pseudo Code
neared
,
As
First-Order Parsing
ed near wire McG
*
Eisner First-Order Parsing
Maximum Spanning Trees (MSTs)
Can use MST algorithms for nonprojective parsing!
Chu-Liu-Edmonds
Chu-Liu-Edmonds
Find Cycle and Contract
Recalculate Edge Weights
Theorem
Final MST
Chu-Liu-Edmonds PseudoCode
Chu-Liu-Edmonds PseudoCode
Arc Weights
Arc Feature Ideas for f(i,j,k)
• Identities of the words wi and wj and the label lk • Part-of-speech tags of the words wi and wj and the label lk • Part-of-speech of words surrounding and between wi and wj • Number of words between wi and wj , and their orientation • Combinations of the above
First-Order Feature Computation
First-Order Feature Calculation
* As McGwire neared ,
*
As
McGwire
neared
,
fans
went
wild
fans went wild
wild
fans
t wen
,
As
d neare wire McG
[went]
[VBD]
[As]
[ADP]
[went]
[VERB]
[As]
[IN]
[went, VBD]
[As, ADP]
[went, As]
[VBD, ADP]
[went, VERB]
[As, IN]
[went, As]
[VERB, IN]
[VBD, As, ADP]
[went, As, ADP]
[went, VBD, ADP]
[went, VBD, As]
[ADJ, *, ADP]
[VBD, *, ADP]
[VBD, ADJ, ADP]
[VBD, ADJ, *]
[NNS, *, ADP]
[NNS, VBD, ADP]
[NNS, VBD, *]
[ADJ, ADP, NNP]
[VBD, ADP, NNP]
[VBD, ADJ, NNP]
[NNS, ADP, NNP]
[NNS, VBD, NNP]
[went, left, 5]
[VBD, left, 5]
[As, left, 5]
[ADP, left, 5]
[VERB, As, IN]
[went, As, IN]
[went, VERB, IN]
[went, VERB, As]
[JJ, *, IN]
[VERB, *, IN]
[VERB, JJ, IN]
[VERB, JJ, *]
[NOUN, *, IN]
[NOUN, VERB, IN]
[NOUN, VERB, *]
[NOUN, IN, NOUN]
[NOUN, VERB, NOUN]
[went, left, 5]
[VERB, left, 5]
[As, left, 5]
[IN, left, 5]
[went, VBD, As, ADP]
[VBD, ADJ, *, ADP]
[NNS, VBD, *, ADP]
[VBD, ADJ, ADP, NNP]
[NNS, VBD, ADP, NNP]
[went, VBD, left, 5]
[As, ADP, left, 5]
[went, As, left, 5]
[VBD, ADP, left, 5]
[NOUN, VERB, *, IN]
[went, VERB, As, IN]
[VERB, JJ, *, IN]
[JJ, IN, NOUN]
[VERB, IN, NOUN]
[VERB, JJ, NOUN]
[VERB, JJ, IN, NOUN]
[NOUN, VERB, IN, NOUN]
[went, VERB, left, 5]
[As, IN, left, 5]
[went, As, left, 5]
[VERB, IN, left, 5]
[VBD, As, ADP, left, 5]
[went, As, ADP, left, 5]
[went, VBD, ADP, left, 5]
[went, VBD, As, left, 5]
[ADJ, *, ADP, left, 5]
[VBD, *, ADP, left, 5]
[VBD, ADJ, ADP, left, 5]
[VBD, ADJ, *, left, 5]
[NNS, *, ADP, left, 5]
[NNS, VBD, ADP, left, 5]
[NNS, VBD, *, left, 5]
[ADJ, ADP, NNP, left, 5]
[VBD, ADP, NNP, left, 5]
[VBD, ADJ, NNP, left, 5]
[NNS, ADP, NNP, left, 5]
[NNS, VBD, NNP, left, 5]
[VERB, As, IN, left, 5]
[went, As, IN, left, 5]
[went, VERB, IN, left, 5]
[went, VERB, As, left, 5]
[JJ, *, IN, left, 5]
[VERB, *, IN, left, 5]
[VERB, JJ, IN, left, 5]
[VERB, JJ, *, left, 5]
[NOUN, *, IN, left, 5]
[NOUN, VERB, IN, left, 5]
[NOUN, VERB, *, left, 5]
[JJ, IN, NOUN, left, 5]
[VERB, IN, NOUN, left, 5]
[VERB, JJ, NOUN, left, 5]
[NOUN, IN, NOUN, left, 5]
[NOUN, VERB, NOUN, left, 5]
[went, VBD, As, ADP, left, 5] [VBD, ADJ, *, ADP, left, 5] [NNS, VBD, *, ADP, left, 5] [VBD, ADJ, ADP, NNP, left, 5]
[NNS, VBD, ADP, NNP, left, 5]
[went, VERB, As, IN, left, 5]
[NOUN, VERB, IN, NOUN, left, 5]
[VERB, JJ, *, IN, left, 5]
[NOUN, VERB, *, IN, left, 5] [VERB, JJ, IN, NOUN, left, 5]
(Structured) Perceptron
Transition Based Dependency Parsing • Process sentence left to right
Arc-Standard Example ↑ Stack
• Different transition strategies available • Delay decisions by pushing on stack
← Buffer I
booked
a
flight
to
• Arc-Standard Transition Strategy [Nivre ’03] SHIFT
Initial configuration: ([],[0,…,n],[]) Terminal configuration: ([0],[],A) shift: (σ,[i|β],A)
([σ|i],β,A)
left-arc (label): ([σ|i|j],B,A)
([σ|j],B,A∪{j,l,i})
right-arc (label): ([σ|i|j],B,A)
Arc-Standard Example ↑ Stack I
← Buffer booked
a
I
([σ|i],B,A∪{i,l,j})
flight
to
Lisbon
booked
a
flight
to
Lisbon
Arc-Standard Example ↑ Stack
← Buffer
booked
a
flight
to
Lisbon
I
LEFT-ARC
nsubj
SHIFT
I
booked
a
flight
to
Lisbon
I
booked
a
flight
to
Lisbon
Lisbon
Arc-Standard Example ↑ Stack I
Arc-Standard Example
← Buffer a
booked
↑ Stack flight
to
Lisbon
← Buffer
a I
flight
to
Lisbon
booked
SHIFT
SHIFT
nsubj
I
nsubj
booked
a
flight
to
Lisbon
I
Arc-Standard Example ↑ Stack
← Buffer
flight
to
Lisbon
LEFT-ARC
det
booked
a
flight
I
booked
booked
flight
to
Lisbon
to
Lisbon
← Buffer to
Lisbon
nsubj
a
flight
SHIFT
nsubj
I
a
Arc-Standard Example ↑ Stack
a I
booked
I
booked
det
a
flight
to
Lisbon
Arc-Standard Example ↑ Stack
← Buffer
to a
↑ Stack
Lisbon
to
flight booked
nsubj
I
a
flight
I
booked
det
booked
a
↑ Stack Lisbon
a
flight
I
booked
flight
to
Lisbon
I
det
booked
a
flight
to
Lisbon
Arc-Standard Example ↑ Stack
← Buffer
← Buffer
a flight to Lisbon
RIGHT-ARC
prep
nsubj
I
RIGHT-ARC
pobj
nsubj
Arc-Standard Example
to
← Buffer
Lisbon
SHIFT I
Arc-Standard Example
booked
det
a
I
booked
pobj
flight
to
Lisbon
RIGHT-ARC
dobj
nsubj
I
booked
det
a
prep
flight
pobj
to
Lisbon
Arc-Standard Example ↑ Stack
Features
← Buffer
↑ Stack
I booked a flight to Lisbon
a
flight
I
booked
← Buffer to
Lisbon
SHIFT RIGHT-ARC? LEFT-ARC?
dobj
nsubj
I
booked
det
a
prep
flight
pobj
to
Lisbon
Features ZPar Parser # From Single Words pair { stack.tag stack.word } stack { word tag } pair { input.tag input.word } input { word tag } pair { input(1).tag input(1).word } input(1) { word tag } pair { input(2).tag input(2).word } input(2) { word tag } # From quad { triple triple triple triple pair { pair { pair {
word pairs stack.tag stack.word input.tag input.word } { stack.tag stack.word input.word } { stack.word input.tag input.word } { stack.tag stack.word input.tag } { stack.tag input.tag input.word } stack.word input.word } stack.tag input.tag } input.tag input(1).tag }
# From triple triple triple triple triple triple
word triples { input.tag input(1).tag input(2).tag } { stack.tag input.tag input(1).tag } { stack.head(1).tag stack.tag input.tag } { stack.tag stack.child(-1).tag input.tag } { stack.tag stack.child(1).tag input.tag } { stack.tag input.tag input.child(-1).tag }
# Distance pair { stack.distance stack.word } pair { stack.distance stack.tag } pair { stack.distance input.word } pair { stack.distance input.tag } triple { stack.distance stack.word input.word } triple { stack.distance stack.tag input.tag }
Stack top word = “flight” Stack top POS tag = “NOUN” Buffer front word = “to” Child of stack top word = “a” ....
# valency pair { stack.word stack.valence(-1) } pair { stack.word stack.valence(1) } pair { stack.tag stack.valence(-1) } pair { stack.tag stack.valence(1) } pair { input.word input.valence(-1) } pair { input.tag input.valence(-1) } # unigrams stack.head(1) {word tag} stack.label stack.child(-1) {word tag label} stack.child(1) {word tag label} input.child(-1) {word tag label}
SVM / Structured Perceptron Hyperparameters • Regularization • Loss function • Hand-crafted features
# third order stack.head(1).head(1) {word tag} stack.head(1).label stack.child(-1).sibling(1) {word tag label} stack.child(1).sibling(-1) {word tag label} input.child(-1).sibling(1) {word tag label} triple { stack.tag stack.child(-1).tag stack.child(-1).sibling(1).tag } triple { stack.tag stack.child(1).tag stack.child(1).sibling(-1).tag } triple { stack.tag stack.head(1).tag stack.head(1).head(1).tag } triple { input.tag input.child(-1).tag input.child(-1).sibling(1).tag } # label set pair { stack.tag stack.child(-1).label } triple { stack.tag stack.child(-1).label stack.child(-1).sibling(1).label } quad { stack.tag stack.child(-1).label stack.child(-1).sibling(1).label stack.child(-1).sibling(2).label } pair { stack.tag stack.child(1).label } triple { stack.tag stack.child(1).label stack.child(1).sibling(-1).label } quad { stack.tag stack.child(1).label stack.child(1).sibling(-1).label stack.child(1).sibling(-2).label } pair { input.tag input.child(-1).label } triple { input.tag input.child(-1).label input.child(-1).sibling(1).label } quad { input.tag input.child(-1).label input.child(-1).sibling(1).label input.child(-1).sibling(2).label }
Neural Network Transition Based Parser
Neural Network Transition Based Parser
[Chen & Manning ’14] and [Weiss et al. ’15]
words
pos
Softmax
Softmax
Hidden Layer
Hidden Layer
labels
…
[Weiss et al. ’15]
words
pos
labels
…
Embedding Layer
f0 = 1 [buffer0-word = “to”]
Embedding Layer
f0 = 1 [buffer0-word = “to”]
f1 = 1 [buffer1-word = “Bilbao”] f2 = 1 [buffer0-POS = “IN”]
f1 = 1 [buffer1-word = “Bilbao”]
Atomic Inputs
f2 = 1 [buffer0-POS = “IN”]
f3 = 1 [stack0-label = “pobj”]
Atomic Inputs
f3 = 1 [stack0-label = “pobj”]
… …
… …
Neural Network Transition Based Parser
Neural Network Transition Based Parser
[Weiss et al. ’15]
[Weiss et al. ’15] Softmax
Softmax
Hidden Layer 2 Hidden Layer words
pos
labels
…
words
f2 = 1 [buffer0-POS = “IN”] f3 = 1 [stack0-label = “pobj”]
… …
pos
labels
…
Embedding Layer
f0 = 1 [buffer0-word = “to”] f1 = 1 [buffer1-word = “Bilbao”]
Hidden Layer 1
Embedding Layer
f0 = 1 [buffer0-word = “to”] Atomic Inputs
f1 = 1 [buffer1-word = “Bilbao”] f2 = 1 [buffer0-POS = “IN”] f3 = 1 [stack0-label = “pobj”]
… …
Atomic Inputs
English Results (WSJ 23)
Neural Network Transition Based Parser fs
words
pos
.w
[Weiss et al. ’15]
labels
Method 3rd-order Graph-based (ZM2014)
UAS LAS Beam 93.22 91.02 -
Transition-based Linear (ZN2011)
93.00 90.95
32
NN Baseline (Chen & Manning, 2014)
91.80 89.60
1
NN Better SGD (Weiss et al., 2015)
92.58 90.54
1
NN Deeper Network (Weiss et al., 2015)
93.19 91.18
1
… f0 = 1 [buffer0-word = “to”]
structured perceptron
f1 = 1 [buffer1-word = “Bilbao”] f2 = 1 [buffer0-POS = “IN”] f3 = 1 [stack0-label = “pobj”]
… …
NN Hyperparameters • •
Regularization Loss function
NN Hyperparameters • •
Regularization
• • • • •
Dimensions
Loss function Activation function Initialization Adagrad Dropout
NN Hyperparameters • •
Regularization
• • • • • • • • •
Dimensions
NN Hyperparameters
Loss function
Optimization matters! Use random restarts, grid Pick best using holdout data
Activation function Initialization Adagrad
Tune: WSJ S24 Dev: WSJ S22 Test: WSJ S23
Dropout Mini-batch size Initial learning rate Learning rate schedule Momentum
• •
Stopping time Parameter averaging
Random Restarts: How much Variance?
Effect of Embedding Dimensions
Variance of Networks on Tuning/Dev Set
92.6
92 91.5
92.5
2nd hidden layer +
pre training increases correlation
92.4 92.3
UAS (%)
UAS (%) on WSJ Dev Set
Word Tuning on WSJ (Tune Set, Dpos,Dlabels=32)
Pretrained 200x200 Pretrained 200 200x200 200
92.7
91 90.5
Pretrained 200x200 Pretrained 200 200x200 200
92.2
90
92.1
89.5
92 91.2
91.4 91.6 91.8 UAS (%) on WSJ Tune Set
92
1
2
4 8 16 32 64 Word Embedding Dimension (Dwords)
128
Effect of Embedding Dimensions POS/Label Tuning on WSJ (Tune Set, D
How Important is Lookahead? Local
=64)
words
91.5 Pretrained 200x200 Pretrained 200 200x200 200
91
90.5
1
2 4 8 16 POS/Label Embedding Dimension (Dpos,Dlabels)
90 85 80 75
0
32
How Important is Lookahead?
Alice saw
1
? Bob
eat
3
4
pizza with Charlie
Local
95
90
90 UAS
95
85 80 75
2
How Important is Lookahead?
Local
UAS
UAS (%)
92
UAS (§22 of the WSJ)
95
85 80
0
Alice saw
1
? Bob
2
eat
3
4
pizza with Charlie
75
0
Alice saw
1
? Bob
2
eat
3
4
pizza with Charlie
How Important is Lookahead?
How Important is Lookahead? Local
95
95
90
90 UAS
UAS
Local
85 80 75
85 80
0
Alice saw
1
? Bob
2
eat
3
75
4
pizza with Charlie
How Important is Lookahead?
0
Alice saw
1
? Bob
4
pizza with Charlie
Local
95
95
90
90 UAS
UAS
eat
3
How Important is Lookahead?
Local
85 80 75
2
LSTM [Kiperwasser & Goldberg '16]
85 80
0
Alice saw
1
Bob
2
eat
3
4
pizza with Charlie Bi-LSTM
75
0
Alice saw
1
Bob
2
eat
3
4
pizza with Charlie Bi-LSTM
Beam Search with Local Model
Beam Search with Local Model Local
Beam
+Beam
95
UAS
Better
90 85 80 75
Alice saw Bob eat pizza with Charlie
(Schematic)
0
X
(1) i
i
X
(2) i
i
X
(3) i
i
X
j=1
(4) i
i
BACKPROP
X i
P|Beam|
(⇤) i
P
i
exp
3
4
Label Bias
Globally normalized with respect to the beam:
exp
2 Lookahead
Training with Early Updates Beam
1
• In Local Model every decision 0≤ɸ≤1 cannot penalize decision for pushing overall structure to far from gold • Global Model learns to assign credit/blame
(⇤) i
P
i
(j) i
Backpropagate through all steps, paths, and layers
[Collins and Roark ’04, Zhou et al.’15]
• Other views: • Need to learn to model states not along the gold path • Look into the future and overrule low-entropy (low-branching) states
Globally Normalized Model Local
+Beam
English Results (WSJ 23)
Global
95
85 80 75
0
1
2
3
4
UAS LAS Beam 93.22 91.02 -
Transition-based Linear (ZN2011)
93.00 90.95
32
NN Baseline (Chen & Manning, 2014)
91.80 89.60
1
NN Better SGD (Weiss et al., 2015)
92.58 90.54
1
NN Deeper Network (Weiss et al., 2015)
93.19 91.18
1
NN Perceptron (Weiss et al., 2015)
93.99 92.05
8
NN CRF (Andor et al., 2016)
94.61 92.79
32
NN CRF Semi-Supervised (Andor et al.)
95.01 92.97
32
S-LSTM (Dyer et al., 2015)
93.20 90.90
1
Lookahead
Contrastive NN (Zhou et al., 2015)
Tri-Training [Zhou et al. ’05, Li et al. ’14] ZPar
100
• Evaluate on Web
UAS 89.96 LAS 87.26 UAS 96.35 LAS 95.02
Parser
—
• Train on WSJ + Web Treebank + QuestionBank
3rd Order Graph (ZM2014) Transition-based NN (B=1)
Parser
Berkeley
92.83
English Out-of-Domain Results
~40%
agreement
Transition-based Linear (ZN 2011, B=32) Transition-based NN (B=8)
90
UAS 89.84 LAS 87.21
89.25 UAS (%)
UAS
90
Method 3rd-order Graph-based (ZM2014)
88.5
87.75
87
Supervised
Semi-Supervised
Multilingual Results Tensor-Based Graph (Lei et al. '14 Transition-based NN (Weiss et al. '15)
SyntaxNet and Parsey McParseface
3rd-Order Graph (Zhang & McDonald '14) Transition-based CRF (Andor et al. 16)
95.0
UAS (%)
90.0
85.0
80.0
Catalan
Chinese
Czech
English
German Japanese Spanish
No tri-training data
With morph features
SyntaxNet and Parsey
LSTMs vs SyntaxNet
• Add screenshots once released
LSTMs
SyntaxNet
Accuracy
+
++
Efficiency
-
+
End-to-End
++
-
(yet)
Recurrence
+
-
(yet)
Universal Dependencies Stanford Universal++ Dependencies
Interset++ Morphological Features Google Universal++ POS Tags
http://universaldependencies.github.io/docs/
Summary • Constituency Parsing • • • • •
CKY Algorithm Lexicalized Grammars Latent Variable Grammars Conditional Random Field Parsing Neural Network Representations
• Dependency Parsing • • • •
Eisner Algorithm Maximum Spanning Tree Algorithm Transition Based Parsing Neural Network Representations
Universal Dependencies