Advanced Constituency Parsing

5 downloads 90 Views 8MB Size Report
on the coast face freak storm g(y;x,θ) = score(coast → the). + score(on → coast) ... 1, if tm = u. 0, o.w.. ▷ Each feature has a corresponding real-value weight.
Dependency Parsing Statistical NLP
 Fall 2016

prep dobj

ROOT

Lecture 9: Dependency Parsing

CoNLL Format

su

PRON

VERB

solved

DET

NOUN

the problem

ADP

NOUN

with

statistics

(Non-)Projectivity • Crossing Arcs needed to account for nonprojective constructions • Fairly rare in English but can be common in other languages (e.g. Czech):

vc obj1

pobj det

They

Slav Petrov – Google

ROOT

nsubj

mod

punct

* Cathy zag hen wild zwaaien . 0 1 2 3 4 5 6 http://ilk.uvt.nl/conll/

Formal Conditions

Styles of Dependency Parsing •

• Transition-Based (tr) • Fast, greedy, linear time inference algorithms • Trained for greedy search

Graph-Based (gr) • Slower, exhaustive, dynamic programming inference algorithms



• Beam search

Higher-order factorizations

Accuracy

3rd-order gr tr est k-b

greedy tr

·n

)

1st-order gr

O(n4 )

O(n3 )

O(n3 )

O(n) [Nivre et al. ’03-’11]

Arc-Factored Models

k O(

2nd-order gr

Time

[McDonald et al. ’05-’06]

Graph-based Parsing • Assumes that scores factor over the tree • Arc-factored models • Score(tree) = ∑ edges I washed dishes with detergent =

I washed washed dishes washed with with detergent

+ + +

Dependency Representation Representation

Representation McGwire

neared

,

fans

went

wild

*

As

McGwire

*

*

As

As

McGwire

McGwire

neared

Heads

fans

fans

went

went

wild

wild

went

wild

wild

t wen

fans

As

wild

fans

t wen

,

ed near wire McG

As

Modifiers

Dependency Representation Representation

Heads

fans

,

Modifiers

*

,

neared

Heads

,

neared

,

As

ed near wire McG

*

Dependency Representation

Dependency Representation Representation

As

McGwire

neared

,

fans

went

wild

*

As

McGwire

*

*

As

As

McGwire

McGwire

neared

Heads

,

,

fans

went

neared ,

fans

fans

went

went

wild

wild wild

ed

Modifiers

t wen

fans

,

wire McG

near

As

wild

fans

ed

t wen

,

wire McG

near

As

Modifiers

neared

wild

Dependency Representation Representation

Representation McGwire

neared

,

fans

went

wild

*

As

McGwire

*

*

As

As

McGwire

McGwire

neared

Heads

fans

fans

went

went

wild

wild

went

wild

wild

t wen

fans

As

wild

fans

t wen

,

ed near wire McG

As

Modifiers

Dependency Representation Representation

Heads

fans

,

Modifiers

*

,

neared

Heads

,

neared

,

As

ed near wire McG

*

Dependency Representation

Dependency Representation Representation

As

McGwire

neared

,

fans

went

wild

*

As

McGwire

*

*

As

As

McGwire

McGwire

neared

Heads

,

,

fans

went

neared ,

fans

fans

went

went

wild

wild wild

ed

Modifiers

t wen

fans

,

wire McG

near

As

wild

fans

ed

t wen

,

wire McG

near

As

Modifiers

neared

wild

Graph-Based Parsing

Arc-factored Projective Parsing

Arc-factored Projective Parsing

Eisner Algorithm

Eisner First-Order Parsing

Eisner First-Order Rules

Eisner First-Order Parsing First-Order Parsing

m

h

r

r+1

As

McGwire

m

neared

,

,

h

*

ed near wire McG

+

fans

went

wild

* As

+ h

e

h

m

McGwire

m

neared

e

, fans went

In practice also left arc version

wild

wild

First-Order Parsing

McGwire

neared

,

fans

went

wild

*

As

McGwire

*

*

As

As

McGwire

McGwire

neared

neared

,

,

fans

fans

went

went

wild

wild

fans

went

wild

t wen

,

fans

As

wild

fans

t wen

,

ed near wire McG

As

neared

,

As

Eisner First-Order Parsing

ed near wire McG

*

t wen

First-Order Parsing

fans

As

Eisner First-Order Parsing

wild

Eisner First-Order Parsing First-Order Parsing

McGwire

neared

,

fans

went

wild

As

McGwire

*

*

As

As

McGwire

McGwire

neared

neared

,

,

fans

fans

went

went

wild

wild

fans

went

wild

wild

Eisner First-Order Parsing First-Order Parsing

neared

,

fans

went

wild

*

As

McGwire

*

*

As

As

McGwire

McGwire

neared

neared

,

,

fans

fans

went

went

wild

wild

fans

went

wild

t wen

,

fans

As

wild

fans

t wen

,

ed near wire McG

As

neared

,

McGwire

ed near wire McG

As

,

t wen

First-Order Parsing

neared

fans

As

wild

t wen

fans

,

As

ed near wire McG

Eisner First-Order Parsing

*

*

,

As

First-Order Parsing

ed near wire McG

*

Eisner First-Order Parsing

wild

Eisner First-Order Parsing First-Order Parsing

McGwire

neared

,

fans

went

wild

*

As

McGwire

*

*

As

As

McGwire

McGwire

neared

neared

,

,

fans

fans

went

went

wild

wild

fans

went

wild

wild

t wen

,

fans

As

wild

fans

t wen

,

As

ed near wire McG

Eisner Algorithm Pseudo Code

neared

,

As

First-Order Parsing

ed near wire McG

*

Eisner First-Order Parsing

Maximum Spanning Trees (MSTs)

Can use MST algorithms for nonprojective parsing!

Chu-Liu-Edmonds

Chu-Liu-Edmonds

Find Cycle and Contract

Recalculate Edge Weights

Theorem

Final MST

Chu-Liu-Edmonds PseudoCode

Chu-Liu-Edmonds PseudoCode

Arc Weights

Arc Feature Ideas for f(i,j,k)

• Identities of the words wi and wj and the label lk • Part-of-speech tags of the words wi and wj and the label lk • Part-of-speech of words surrounding and between wi and wj • Number of words between wi and wj , and their orientation • Combinations of the above

First-Order Feature Computation

First-Order Feature Calculation

* As McGwire neared ,

*

As

McGwire

neared

,

fans

went

wild

fans went wild

wild

fans

t wen

,

As

d neare wire McG

[went]

[VBD]

[As]

[ADP]

[went]

[VERB]

[As]

[IN]

[went, VBD]

[As, ADP]

[went, As]

[VBD, ADP]

[went, VERB]

[As, IN]

[went, As]

[VERB, IN]

[VBD, As, ADP]

[went, As, ADP]

[went, VBD, ADP]

[went, VBD, As]

[ADJ, *, ADP]

[VBD, *, ADP]

[VBD, ADJ, ADP]

[VBD, ADJ, *]

[NNS, *, ADP]

[NNS, VBD, ADP]

[NNS, VBD, *]

[ADJ, ADP, NNP]

[VBD, ADP, NNP]

[VBD, ADJ, NNP]

[NNS, ADP, NNP]

[NNS, VBD, NNP]

[went, left, 5]

[VBD, left, 5]

[As, left, 5]

[ADP, left, 5]

[VERB, As, IN]

[went, As, IN]

[went, VERB, IN]

[went, VERB, As]

[JJ, *, IN]

[VERB, *, IN]

[VERB, JJ, IN]

[VERB, JJ, *]

[NOUN, *, IN]

[NOUN, VERB, IN]

[NOUN, VERB, *]

[NOUN, IN, NOUN]

[NOUN, VERB, NOUN]

[went, left, 5]

[VERB, left, 5]

[As, left, 5]

[IN, left, 5]

[went, VBD, As, ADP]

[VBD, ADJ, *, ADP]

[NNS, VBD, *, ADP]

[VBD, ADJ, ADP, NNP]

[NNS, VBD, ADP, NNP]

[went, VBD, left, 5]

[As, ADP, left, 5]

[went, As, left, 5]

[VBD, ADP, left, 5]

[NOUN, VERB, *, IN]

[went, VERB, As, IN]

[VERB, JJ, *, IN]

[JJ, IN, NOUN]

[VERB, IN, NOUN]

[VERB, JJ, NOUN]

[VERB, JJ, IN, NOUN]

[NOUN, VERB, IN, NOUN]

[went, VERB, left, 5]

[As, IN, left, 5]

[went, As, left, 5]

[VERB, IN, left, 5]

[VBD, As, ADP, left, 5]

[went, As, ADP, left, 5]

[went, VBD, ADP, left, 5]

[went, VBD, As, left, 5]

[ADJ, *, ADP, left, 5]

[VBD, *, ADP, left, 5]

[VBD, ADJ, ADP, left, 5]

[VBD, ADJ, *, left, 5]

[NNS, *, ADP, left, 5]

[NNS, VBD, ADP, left, 5]

[NNS, VBD, *, left, 5]

[ADJ, ADP, NNP, left, 5]

[VBD, ADP, NNP, left, 5]

[VBD, ADJ, NNP, left, 5]

[NNS, ADP, NNP, left, 5]

[NNS, VBD, NNP, left, 5]

[VERB, As, IN, left, 5]

[went, As, IN, left, 5]

[went, VERB, IN, left, 5]

[went, VERB, As, left, 5]

[JJ, *, IN, left, 5]

[VERB, *, IN, left, 5]

[VERB, JJ, IN, left, 5]

[VERB, JJ, *, left, 5]

[NOUN, *, IN, left, 5]

[NOUN, VERB, IN, left, 5]

[NOUN, VERB, *, left, 5]

[JJ, IN, NOUN, left, 5]

[VERB, IN, NOUN, left, 5]

[VERB, JJ, NOUN, left, 5]

[NOUN, IN, NOUN, left, 5]

[NOUN, VERB, NOUN, left, 5]

[went, VBD, As, ADP, left, 5] [VBD, ADJ, *, ADP, left, 5] [NNS, VBD, *, ADP, left, 5] [VBD, ADJ, ADP, NNP, left, 5]

[NNS, VBD, ADP, NNP, left, 5]

[went, VERB, As, IN, left, 5]

[NOUN, VERB, IN, NOUN, left, 5]

[VERB, JJ, *, IN, left, 5]

[NOUN, VERB, *, IN, left, 5] [VERB, JJ, IN, NOUN, left, 5]

(Structured) Perceptron

Transition Based Dependency Parsing • Process sentence left to right

Arc-Standard Example ↑ Stack

• Different transition strategies available • Delay decisions by pushing on stack


← Buffer I

booked

a

flight

to

• Arc-Standard Transition Strategy [Nivre ’03] SHIFT

Initial configuration: ([],[0,…,n],[]) Terminal configuration: ([0],[],A) shift: (σ,[i|β],A)

([σ|i],β,A)

left-arc (label): ([σ|i|j],B,A)

([σ|j],B,A∪{j,l,i})

right-arc (label): ([σ|i|j],B,A)

Arc-Standard Example ↑ Stack I

← Buffer booked

a

I

([σ|i],B,A∪{i,l,j})

flight

to

Lisbon

booked

a

flight

to

Lisbon

Arc-Standard Example ↑ Stack

← Buffer

booked

a

flight

to

Lisbon

I

LEFT-ARC
 nsubj

SHIFT

I

booked

a

flight

to

Lisbon

I

booked

a

flight

to

Lisbon

Lisbon

Arc-Standard Example ↑ Stack I

Arc-Standard Example

← Buffer a

booked

↑ Stack flight

to

Lisbon

← Buffer

a I

flight

to

Lisbon

booked

SHIFT

SHIFT

nsubj

I

nsubj

booked

a

flight

to

Lisbon

I

Arc-Standard Example ↑ Stack

← Buffer

flight

to

Lisbon

LEFT-ARC
 det

booked

a

flight

I

booked

booked

flight

to

Lisbon

to

Lisbon

← Buffer to

Lisbon

nsubj

a

flight

SHIFT

nsubj

I

a

Arc-Standard Example ↑ Stack

a I

booked

I

booked

det

a

flight

to

Lisbon

Arc-Standard Example ↑ Stack

← Buffer

to a

↑ Stack

Lisbon

to

flight booked

nsubj

I

a

flight

I

booked

det

booked

a

↑ Stack Lisbon

a

flight

I

booked

flight

to

Lisbon

I

det

booked

a

flight

to

Lisbon

Arc-Standard Example ↑ Stack

← Buffer

← Buffer

a flight to Lisbon

RIGHT-ARC
 prep

nsubj

I

RIGHT-ARC
 pobj

nsubj

Arc-Standard Example

to

← Buffer

Lisbon

SHIFT I

Arc-Standard Example

booked

det

a

I

booked

pobj

flight

to

Lisbon

RIGHT-ARC
 dobj

nsubj

I

booked

det

a

prep

flight

pobj

to

Lisbon

Arc-Standard Example ↑ Stack

Features

← Buffer

↑ Stack

I booked a flight to Lisbon

a

flight

I

booked

← Buffer to

Lisbon

SHIFT RIGHT-ARC? LEFT-ARC?

dobj

nsubj

I

booked

det

a

prep

flight

pobj

to

Lisbon

Features ZPar Parser # From Single Words pair { stack.tag stack.word } stack { word tag } pair { input.tag input.word } input { word tag } pair { input(1).tag input(1).word } input(1) { word tag } pair { input(2).tag input(2).word } input(2) { word tag } # From quad { triple triple triple triple pair { pair { pair {

word pairs stack.tag stack.word input.tag input.word } { stack.tag stack.word input.word } { stack.word input.tag input.word } { stack.tag stack.word input.tag } { stack.tag input.tag input.word } stack.word input.word } stack.tag input.tag } input.tag input(1).tag }

# From triple triple triple triple triple triple

word triples { input.tag input(1).tag input(2).tag } { stack.tag input.tag input(1).tag } { stack.head(1).tag stack.tag input.tag } { stack.tag stack.child(-1).tag input.tag } { stack.tag stack.child(1).tag input.tag } { stack.tag input.tag input.child(-1).tag }

# Distance pair { stack.distance stack.word } pair { stack.distance stack.tag } pair { stack.distance input.word } pair { stack.distance input.tag } triple { stack.distance stack.word input.word } triple { stack.distance stack.tag input.tag }

Stack top word = “flight” Stack top POS tag = “NOUN” Buffer front word = “to” Child of stack top word = “a” ....

# valency pair { stack.word stack.valence(-1) } pair { stack.word stack.valence(1) } pair { stack.tag stack.valence(-1) } pair { stack.tag stack.valence(1) } pair { input.word input.valence(-1) } pair { input.tag input.valence(-1) } # unigrams stack.head(1) {word tag} stack.label stack.child(-1) {word tag label} stack.child(1) {word tag label} input.child(-1) {word tag label}

SVM / Structured Perceptron Hyperparameters • Regularization • Loss function • Hand-crafted features

# third order stack.head(1).head(1) {word tag} stack.head(1).label stack.child(-1).sibling(1) {word tag label} stack.child(1).sibling(-1) {word tag label} input.child(-1).sibling(1) {word tag label} triple { stack.tag stack.child(-1).tag stack.child(-1).sibling(1).tag } triple { stack.tag stack.child(1).tag stack.child(1).sibling(-1).tag } triple { stack.tag stack.head(1).tag stack.head(1).head(1).tag } triple { input.tag input.child(-1).tag input.child(-1).sibling(1).tag } # label set pair { stack.tag stack.child(-1).label } triple { stack.tag stack.child(-1).label stack.child(-1).sibling(1).label } quad { stack.tag stack.child(-1).label stack.child(-1).sibling(1).label stack.child(-1).sibling(2).label } pair { stack.tag stack.child(1).label } triple { stack.tag stack.child(1).label stack.child(1).sibling(-1).label } quad { stack.tag stack.child(1).label stack.child(1).sibling(-1).label stack.child(1).sibling(-2).label } pair { input.tag input.child(-1).label } triple { input.tag input.child(-1).label input.child(-1).sibling(1).label } quad { input.tag input.child(-1).label input.child(-1).sibling(1).label input.child(-1).sibling(2).label }

Neural Network Transition Based Parser

Neural Network Transition Based Parser

[Chen & Manning ’14] and [Weiss et al. ’15]

words

pos

Softmax

Softmax

Hidden Layer

Hidden Layer

labels



[Weiss et al. ’15]

words

pos

labels



Embedding Layer

f0 = 1 [buffer0-word = “to”]

Embedding Layer

f0 = 1 [buffer0-word = “to”]

f1 = 1 [buffer1-word = “Bilbao”] f2 = 1 [buffer0-POS = “IN”]

f1 = 1 [buffer1-word = “Bilbao”]

Atomic Inputs

f2 = 1 [buffer0-POS = “IN”]

f3 = 1 [stack0-label = “pobj”]

Atomic Inputs

f3 = 1 [stack0-label = “pobj”]

… …

… …

Neural Network Transition Based Parser

Neural Network Transition Based Parser

[Weiss et al. ’15]

[Weiss et al. ’15] Softmax

Softmax

Hidden Layer 2 Hidden Layer words

pos

labels



words

f2 = 1 [buffer0-POS = “IN”] f3 = 1 [stack0-label = “pobj”]

… …

pos

labels



Embedding Layer

f0 = 1 [buffer0-word = “to”] f1 = 1 [buffer1-word = “Bilbao”]

Hidden Layer 1

Embedding Layer

f0 = 1 [buffer0-word = “to”] Atomic Inputs

f1 = 1 [buffer1-word = “Bilbao”] f2 = 1 [buffer0-POS = “IN”] f3 = 1 [stack0-label = “pobj”]

… …

Atomic Inputs

English Results (WSJ 23)

Neural Network Transition Based Parser fs

words

pos

.w

[Weiss et al. ’15]

labels

Method 3rd-order Graph-based (ZM2014)

UAS LAS Beam 93.22 91.02 -

Transition-based Linear (ZN2011)

93.00 90.95

32

NN Baseline (Chen & Manning, 2014)

91.80 89.60

1

NN Better SGD (Weiss et al., 2015)

92.58 90.54

1

NN Deeper Network (Weiss et al., 2015)

93.19 91.18

1

… f0 = 1 [buffer0-word = “to”]

structured perceptron

f1 = 1 [buffer1-word = “Bilbao”] f2 = 1 [buffer0-POS = “IN”] f3 = 1 [stack0-label = “pobj”]

… …

NN Hyperparameters • •

Regularization Loss function

NN Hyperparameters • •

Regularization

• • • • •

Dimensions

Loss function Activation function Initialization Adagrad Dropout

NN Hyperparameters • •

Regularization

• • • • • • • • •

Dimensions

NN Hyperparameters

Loss function

Optimization matters! Use random restarts, grid Pick best using holdout data

Activation function Initialization Adagrad

Tune: WSJ S24 Dev: WSJ S22 Test: WSJ S23

Dropout Mini-batch size Initial learning rate Learning rate schedule Momentum

• •

Stopping time Parameter averaging

Random Restarts: How much Variance?

Effect of Embedding Dimensions

Variance of Networks on Tuning/Dev Set

92.6

92 91.5

92.5

2nd hidden layer + 
 pre training increases correlation

92.4 92.3

UAS (%)

UAS (%) on WSJ Dev Set

Word Tuning on WSJ (Tune Set, Dpos,Dlabels=32)

Pretrained 200x200 Pretrained 200 200x200 200

92.7

91 90.5

Pretrained 200x200 Pretrained 200 200x200 200

92.2

90

92.1

89.5

92 91.2

91.4 91.6 91.8 UAS (%) on WSJ Tune Set

92

1

2

4 8 16 32 64 Word Embedding Dimension (Dwords)

128

Effect of Embedding Dimensions POS/Label Tuning on WSJ (Tune Set, D

How Important is Lookahead? Local

=64)

words

91.5 Pretrained 200x200 Pretrained 200 200x200 200

91

90.5

1

2 4 8 16 POS/Label Embedding Dimension (Dpos,Dlabels)

90 85 80 75

0

32

How Important is Lookahead?

Alice saw

1

? Bob

eat

3

4

pizza with Charlie

Local

95

90

90 UAS

95

85 80 75

2

How Important is Lookahead?

Local

UAS

UAS (%)

92

UAS (§22 of the WSJ)

95

85 80

0

Alice saw

1

? Bob

2

eat

3

4

pizza with Charlie

75

0

Alice saw

1

? Bob

2

eat

3

4

pizza with Charlie

How Important is Lookahead?

How Important is Lookahead? Local

95

95

90

90 UAS

UAS

Local

85 80 75

85 80

0

Alice saw

1

? Bob

2

eat

3

75

4

pizza with Charlie

How Important is Lookahead?

0

Alice saw

1

? Bob

4

pizza with Charlie

Local

95

95

90

90 UAS

UAS

eat

3

How Important is Lookahead?

Local

85 80 75

2

LSTM [Kiperwasser & Goldberg '16]

85 80

0

Alice saw

1

Bob

2

eat

3

4

pizza with Charlie Bi-LSTM

75

0

Alice saw

1

Bob

2

eat

3

4

pizza with Charlie Bi-LSTM

Beam Search with Local Model

Beam Search with Local Model Local

Beam

+Beam

95

UAS

Better

90 85 80 75

Alice saw Bob eat pizza with Charlie

(Schematic)

0

X

(1) i

i

X

(2) i

i

X

(3) i

i

X

j=1

(4) i

i

BACKPROP

X i

P|Beam|

(⇤) i

P

i

exp

3

4

Label Bias

Globally normalized with respect to the beam:

exp

2 Lookahead

Training with Early Updates Beam

1

• In Local Model every decision 0≤ɸ≤1 cannot penalize decision for pushing overall structure to far from gold • Global Model learns to assign credit/blame

(⇤) i

P

i

(j) i

Backpropagate through all steps, paths, and layers

[Collins and Roark ’04, Zhou et al.’15]

• Other views: • Need to learn to model states not along the gold path • Look into the future and overrule low-entropy (low-branching) states

Globally Normalized Model Local

+Beam

English Results (WSJ 23)

Global

95

85 80 75

0

1

2

3

4

UAS LAS Beam 93.22 91.02 -

Transition-based Linear (ZN2011)

93.00 90.95

32

NN Baseline (Chen & Manning, 2014)

91.80 89.60

1

NN Better SGD (Weiss et al., 2015)

92.58 90.54

1

NN Deeper Network (Weiss et al., 2015)

93.19 91.18

1

NN Perceptron (Weiss et al., 2015)

93.99 92.05

8

NN CRF (Andor et al., 2016)

94.61 92.79

32

NN CRF Semi-Supervised (Andor et al.)

95.01 92.97

32

S-LSTM (Dyer et al., 2015)

93.20 90.90

1

Lookahead

Contrastive NN (Zhou et al., 2015)

Tri-Training [Zhou et al. ’05, Li et al. ’14] ZPar

100

• Evaluate on Web

UAS 89.96 LAS 87.26 UAS 96.35 LAS 95.02

Parser



• Train on WSJ + Web Treebank + QuestionBank

3rd Order Graph (ZM2014) Transition-based NN (B=1)

Parser

Berkeley

92.83

English Out-of-Domain Results

~40%
 agreement

Transition-based Linear (ZN 2011, B=32) Transition-based NN (B=8)

90

UAS 89.84 LAS 87.21

89.25 UAS (%)

UAS

90

Method 3rd-order Graph-based (ZM2014)

88.5

87.75

87

Supervised

Semi-Supervised

Multilingual Results Tensor-Based Graph (Lei et al. '14 Transition-based NN (Weiss et al. '15)

SyntaxNet and Parsey McParseface

3rd-Order Graph (Zhang & McDonald '14) Transition-based CRF (Andor et al. 16)

95.0

UAS (%)

90.0

85.0

80.0

Catalan

Chinese

Czech

English

German Japanese Spanish

No tri-training data

With morph features

SyntaxNet and Parsey

LSTMs vs SyntaxNet

• Add screenshots once released

LSTMs

SyntaxNet

Accuracy

+

++

Efficiency

-

+

End-to-End

++

-

(yet)

Recurrence

+

-

(yet)

Universal Dependencies Stanford Universal++ Dependencies

Interset++ Morphological Features Google Universal++ POS Tags

http://universaldependencies.github.io/docs/

Summary • Constituency Parsing • • • • •

CKY Algorithm Lexicalized Grammars Latent Variable Grammars Conditional Random Field Parsing Neural Network Representations

• Dependency Parsing • • • •

Eisner Algorithm Maximum Spanning Tree Algorithm Transition Based Parsing Neural Network Representations

Universal Dependencies