Beyond Structured Prediction: Inverse ... - Semantic Scholar

7 downloads 226 Views 3MB Size Report
Computer Science. University of Maryland [email protected]. A Tutorial at ACL 2011. Portland, Oregon. Sunday, 19 June 2011. A
Beyond Structured Prediction: Inverse Reinforcement Learning Hal Daumé III

Acknowledgements

Computer Science University of Maryland

Some slides: Stuart Russell Dan Klein J. Drew Bagnell Nathan Ratliff Stephane Ross

[email protected] A Tutorial at ACL 2011 Portland, Oregon Sunday, 19 June 2011

1

Discussions/Feedback: MLRG Spring 2010

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Examples of structured problems

2

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Examples of demonstrations

3

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Examples of demonstrations

4

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

NLP as transduction Task

Input

Output

Machine Translation

Ces deux principes se tiennent à  la croisée de la philosophie, de  la politique, de l’économie, de la  sociologie et du droit.

Both principles lie at the  crossroads of philosophy, politics, economics,  sociology, and law.

Document Summarization

Syntactic Analysis

Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74-day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands.

The man ate a big sandwich.

The Falkland islands war, in 1982, was fought between Britain and Argentina.

The man ate a big sandwich

...many more...

5

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Structured prediction 101 Learn a function mapping inputs to complex outputs:

f : X  Y Pro Md Decoding Md Dt Vb Input Space Output Space Pro Md Barack Md Dt Nn Pro notMd Nn green Dt witch Md. Mary did slap the President Obama Pro Md Nn Dt theVb Mary no dio una bofetada a la bruja verda . Pro Md Nn Dt Nn Pro Md McCain Pro Md

Vb

Dt

Vb

Dt

Md he Vb

Vb John

Dt

Nn

I can can a Sequence Parsing Translation Coreference Machine Labeling Resolution

can

Pro

6

Md

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Structured prediction 101 Learn a function mapping inputs to complex outputs:

f : X  Y Input Space Decoding Output Space

I 7

can

can

a

Hal Daumé III ([email protected])

can SP2IRL @ ACL2010

Why is structure important? ➢

Correlations among outputs ➢ ➢



Global coherence ➢



It just doesn't make sense to have three determiners next to each other

My objective (aka “loss function”) forces it ➢ ➢

8

Determiners often precede nouns Sentences usually have verbs

Translations should have good sequences of words Summaries should be coherent

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Outline: Part I ➢ ➢

What is Structured Prediction? Refresher on Binary Classification ➢ ➢ ➢



From Perceptron to Structured Perceptron ➢ ➢ ➢



Linear models for Structured Prediction The “argmax” problem From Perceptron to margins

Structure without Structure ➢ ➢

9

What does it mean to learn? Linear models for classification Batch versus stochastic optimization

Stacking Structure compilation

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Outline: Part II ➢

Learning to Search ➢ ➢

➢ ➢

Refresher on Markov Decision Processes Inverse Reinforcement Learning ➢ ➢





10

Determining rewards given policies Maximum margin planning

Learning by Demonstration ➢



Incremental parsing Learning to queue

Searn Dagger

Discussion

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Refresher on Binary Classification

11

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

What does it mean to learn? ➢

Informally: ➢



Slightly-less-informally: ➢



to take labeled examples and construct a function that will label them as a human would

Formally: ➢



12

to predict the future based on the past

Given: ➢ A fixed unknown distribution D over X*Y ➢ A loss function over Y*Y ➢ A finite sample of (x,y) pairs drawn i.i.d. from D Construct a function f that has low expected loss with respect to D Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Feature extractors ➢

A feature extractor Φ maps examples to vectors Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …



13

Φ

W=dear W=sir W=this ... W=wish ... MISSPELLED NAMELESS ALL_CAPS NUM_URLS ...

: : :

1 1 2

:

0

: : : :

2 1 0 0

Feature vectors in NLP are frequently sparse

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Linear models for binary classification ➢

Decision boundary is the set of “uncertain”points



Linear decision boundaries are characterized by weight vectors

x “free money” 14

f2

f1

Φ(x) BIAS free money the ...

: : : :

1 1 1 0

w BIAS free money the ...

∑i wi Φi(x)

: -3 : 4 : 2 : 0

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

The perceptron ➢

Inputs = feature values Params = weights Sum is the response



If the response is:

➢ ➢

➢ ➢

Positive, output +1 Negative, output -1

Φ1 Φ2 ➢

When training, update on errors:

w=w y  x

Φ3

w1 w2 w3

Σ

>0?

“Error” when:

y w⋅ x≤0 15

Hal Daumé III ([email protected])

SP2IRL @ ACL2010 15

Why does that update work? ➢

old

When y w ⋅ x≤0

new old w =w  y  x  , update:

wold

wnew yw

new

 x= y  w  y  x   x old

old

= y w   x yy  x x

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Support vector machines ➢

Explicitly optimize the margin



Enforce that all training points are correctly classified

17

f2

f1

max margin w max margin w

s.t. s.t.

y n w⋅ x n ≥1 , ∀ n

min w

s.t.

y n w⋅ x n ≥1 , ∀ n

2

∥w∥

all points are correctly classified

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Support vector machines with slack ➢

Explicitly optimize the margin



Allow some “noisy” points to be misclassified

f2

f1

min w,ξ

1 2 ∥w∥ C ∑n n 2

s.t.

y n w⋅ x n  n ≥1 , ∀ n n ≥0 , ∀ n

18

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Batch versus stochastic optimization ➢ ➢

Batch = read in all the data, then process it Stochastic = (roughly) process a bit at a time

min w,ξ

1 2 ∥w∥ C ∑n n 2

s.t.

y n w⋅ x n n ≥1 , ∀n



For n=1..N: ➢ If y n w⋅ x n ≤0 ➢ w=w y n  x n 

n ≥0 , ∀ n

19

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Stochastically optimized SVMs SVM Objective SOME MATH ➢

For n=1..N: ➢ If y n w⋅ x n ≤1 ➢ w=w y n  x n  1 ➢ w= 1− w CN



20



Implementation Note: Weight shrinkage is SLOW. Implement it lazily, at the cost of double storage.



For n=1..N: ➢ If y n w⋅ x n ≤0 ➢ w=w y n  x n 

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

From Perceptron to Structured Perceptron

21

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Perceptron with multiple classes ➢



w1•Φ(x)

Store separate weight vector for each class w1, w2, ..., wK

biggest

For n=1..N: ➢ Predict:

w3•Φ(x)

y =arg max k w k⋅ x n  ➢

If y ≠ y:n w y =w y − x n  w y =w y  x n  n

22

n

?!

Hal Daumé III ([email protected])

biggest

w2•Φ(x) biggest

Why does this do the right thing? SP2IRL @ ACL2010

Perceptron with multiple classes v2 ➢

Originally: w1

w2

w3

w ➢

For n=1..N: ➢ Predict:



For n=1..N: ➢ Predict:

y =arg max k w k⋅ x n  ➢

If y ≠ y: n

y =arg max k w⋅ x n , k  ➢

If y ≠ y:n

w y =w y − x n 

w y =w y  x n  n

23

n

Hal Daumé III ([email protected])

w=w− x n , y   x n , y n  SP2IRL @ ACL2010

x multiple Φ(x,1)classes Φ(x,2) Perceptron with v2 ➢

Originally: w1

spam_BIAS : spam_free : spam_money : w2 spam_the w3: ...

“free money”

1 1 1 0

w



For n=1..N: ➢ Predict:



If y ≠ y: n



If y ≠ y:n

w y =w y  x n  24

1 1 1 0

y =arg max k w⋅ x n , k 

w y =w y − x n  n

: : : :

For n=1..N: ➢ Predict:

y =arg max k w k⋅ x n  ➢

ham_BIAS ham_free ham_money ham_the ...

n

Hal Daumé III ([email protected])

w=w− x n , y   x n , y n  SP2IRL @ ACL2010

Features for structured prediction ➢

Allowed to encode anything you want Pro

Md

Vb

Dt

Nn

I

can

can

a

can

 x , y= I_Pro can_Md can_Vb a_Dt can_Nn ...

➢ 25

: : : : :

1 1 1 1 1

-Pro Pro-Md Md-Vb Vb-Dt Dt-Nn Nn- ...

: : : : : :

1 1 1 1 1 1

has_verb has_nn_lft has_n_lft has_nn_rgt has_n_rgt ...

: : : : :

1 0 1 1 1

Output features, Markov features, other features Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Structured perceptron Enumeration over 1..K ➢

For n=1..N: ➢ Predict:

Enumeration over all outputs ➢

For n=1..N: ➢ Predict:

y =arg max k w⋅ x n , k  ➢

If y ≠ y n:

y =arg max k w⋅ x n , k  ➢

If y ≠ y n:

26

Hal Daumé III ([email protected])

w=w− x n , y   x n , y n 

SP2IRL @ ACL2010

[Collins, EMNLP02]

w=w− x n , y   x n , y n 

Argmax for sequences ➢

If we only have output and Markov features, we can use Viterbi algorithm: w•[Pro-Pro] w•[I_Pro]

w•[can_Pro]

w•[can_Pro]

w•[Pro-Md] w•[I_Md]

w•[can_Md]

w•[can_Md]

w•[Pro-Vb] w•[I_Vb]

w•[can_Vb]

w•[can_Vb]

I

can

can

(plus some work to account for boundary conditions) 27

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Structured perceptron as ranking ➢

For n=1..N: ➢ Run Viterbi: y =arg max k w⋅ x n , k  ➢ If  y ≠ y:n w=w− x , y  x , y  n



28

n

n

When does this make an update? Pro Pro Pro Pro Pro Pro Pro

Md Md Md Md Md Md Md

Vb Md Md Nn Nn Vb Vb

Dt Dt Dt Dt Dt Dt Dt

Nn Vb Nn Md Nn Md Vb

I

can

can

a

can

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

From perceptron to margins Maximize Margin

Minimize Errors

s.t.

y n w⋅  x n n ≥1 , ∀n

1 2 ∥w∥ C ∑n n , y 2

Response for truth

Response for other

s.t. w⋅ x n , y n  −w⋅ x n , y  n≥1 , ∀ n , y ≠ y n

Each point is correctly classified, modulo ξ Each true output is more highly ranked, modulo ξ

29

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Taskar+al, JMLR05; Tshochandaritis, JMLR05]

min 1 ∥w∥2C n ∑ n w,ξ 2

min w,ξ

From perceptron to margins  x 1, y 1 

 x 2, y 

 x 2, y 

 x 1, y 

 x 1, y   x 2, y 

1 2 ∥w∥ C ∑n n , y 2

Response for truth

Response for other

s.t. w⋅ x n , y n  −w⋅ x n , y  n≥1 , ∀ n , y ≠ y n

 x 1, y 

 x 1, y  Each true output is more highly ranked, modulo ξ

30

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Taskar+al, JMLR05; Tshochandaritis, JMLR05]

 x 2, y 2 

min w,ξ

Ranking margins ➢

Some errors are worse than others... Md

Vb

Dt

Nn

Margin of one

31

Pro Pro Pro Pro Pro Pro

Md Md Md Md Md Md

Md Md Nn Nn Vb Vb

Dt Dt Dt Dt Dt Dt

Vb Nn Md Nn Md Vb

I

can

can

a

can

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Taskar+al, JMLR05; Tshochandaritis, JMLR05]

Pro

Accounting for a loss function ➢

Some errors are worse than others... Md

Vb

Dt

Nn

Pro Md Vb Dt Vb Pro Md Md Dt Nn Pro Md Nn Dt Nn

Margin of l(y,y')

Pro Md Vb Nn Md Nn Md Vb Dt Md Pro Vb Nn Dt Nn Nn Md Vb Nn Md Nn Md Vb Md Md Pro Vb Nn Vb Nn I

32

can

can

a

can

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Taskar+al, JMLR05; Tshochandaritis, JMLR05]

Pro

Accounting for a loss function  x 1, y 1 

 x 2, y 

 x 2, y 

 x 1, y   x 1, y 

∀y x, 1, y w⋅ x n , y n −w⋅ x n , y n ≥l  y n , y 

is equivalent to , xy1, y−w⋅  x 2, yw⋅  max x x n , y n ≥l  y n , y  y n n w⋅ x n , y n −w⋅ x n , y n

w⋅ x n , y n −w⋅ x n , y n

≥1

≥l y n , y 

33

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Taskar+al, JMLR05; Tshochandaritis, JMLR05]

 x 2, y 2 

Augmented argmax for sequences ➢

Add “loss” to each wrong node!

w•[I_Pro]

+1

w•[can_Pro]

+1

w•[can_Pro]

w•[Pro-Md]

+1

w•[I_Md]

w•[can_Md]

+1

w•[can_Md]

w•[Pro-Vb]

+1

34

+1

w•[I_Vb]

w•[can_Vb]

I

can

?!

Hal Daumé III ([email protected])

w•[can_Vb] can What are we assuming here? SP2IRL @ ACL2010

[Taskar+al, JMLR05; Tshochandaritis, JMLR05]

w•[Pro-Pro]

Stochastically optimizing Markov nets M3N Objective SOME MATH ➢



For n=1..N: ➢ Viterbi:

y =arg max k w⋅ x n , k  y =arg max k w⋅ x n , k  ➢ If  y ≠ y :n l  y n , k  ➢ If y ≠ y n: w=w− x n , y  w=w− x n , y   x n , y n   x n , y n  ➢ 1 w= 1− w CN Hal Daumé III ([email protected]) SP2IRL @ ACL2010





[Ratliff+al, AIStats07]

35

For n=1..N: ➢ Augmented Viterbi:

Stacking ➢

Structured models: accurate but slow Output Labels Input Features



Independent models: less accurate but fast Output Labels Input Features



Stacking: multiple independent models Output Labels 2 Output Labels Input Features

36

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Training a stacked model ➢

Train independent classifier f1 on input features

Output Labels 2 Output Labels



Train independent classifier f2 on input features + f1's output



Danger: overfitting!



Solution: cross-validation

37

Hal Daumé III ([email protected])

Input Features

SP2IRL @ ACL2010

Do we really need structure? ➢

Structured models: accurate but slow Output Labels Input Features



Independent models: less accurate but fast Output Labels Input Features

Goal: transfer power to get fast+accurate

Output Labels Input Features ➢

38

Questions: are independent models... ➢ … expressive enough? (approximation error) ➢ … easy to learn? (estimation error) Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Liang+D+Klein, ICML08]



“Compiling” structure out Labeled Data

POS: 95.0% NER: 75.3%

IRL(f1)

POS: 91.7% NER: 69.1%

f1 = words/prefixes/suffixes/forms

39

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Liang+D+Klein, ICML08]

CRF(f1)

“Compiling” structure out Labeled Data

Auto-Labeled Data POS: 95.0% NER: 75.3%

f1 = words/prefixes/suffixes/forms f2 = f1 applied to a larger window

40

IRL(f1)

POS: 91.7% NER: 69.1%

IRL(f2)

POS: 94.4% NER: 66.2%

CompIRL(f2)

POS: 95.0% NER: 72.7%

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Liang+D+Klein, ICML08]

CRF(f1)

Decomposition of errors CRF(f1):

Sum of MI on edges POS=.003 (95.0% → 95.0%) pC NER=.009 (76.3% → 76.0%)

coherence Σ

Σ

Σ

marginalized

Train a truncated CRF CRF(fNER: 1): p76.0% MC → 72.7%

nonlinearities

global information

IRL(f2): p1* Theorem: KL(pC || p1*) = KL(pC || pMC) + KL(pMC || pA*) + KL(pA* || p1*) 41

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Liang+D+Klein, ICML08]

IRL(fꝏ): pA*

Train a marginalized CRF NER: 76.0% → 76.0%

Structure compilation results Part of speech 96

95

94

2

4

8 16 32 64

Parsing

78

88

76

86

74

84

72

82

70

80

68

78

66

76

2

4

8 16 32 64

4 14 24 44 84 164

Structured Independent 42

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Liang+D+Klein, ICML08]

93

Named Entity

Coffee Break!!!

43

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Outline: Part I ➢ ➢

What is Structured Prediction? Refresher on Binary Classification ➢ ➢ ➢



From Perceptron to Structured Perceptron ➢ ➢ ➢



Linear models for Structured Prediction The “argmax” problem From Perceptron to margins

Structure without Structure ➢ ➢

44

What does it mean to learn? Linear models for classification Batch versus stochastic optimization

Stacking Structure compilation

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Outline: Part II ➢

Learning to Search ➢ ➢

➢ ➢

Refresher on Markov Decision Processes Inverse Reinforcement Learning ➢ ➢





45

Determining rewards given policies Maximum margin planning

Learning by Demonstration ➢



Incremental parsing Learning to queue

Searn Dagger

Discussion

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Learning to Search

46

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Argmax is hard! ➢

Classic formulation of structured prediction:

score  x , y = ➢

something we learn to make “good” x,y pairs score highly

At test time:

f  x = argmax y ∈Y score  x , y  ➢

Combinatorial optimization problem ➢ ➢

47

Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Argmax is hard! ➢

Order these words: bart better I madonna say than , Classic formulation of structured prediction:

score  x , y = ➢

something we learn to make “good” x,y pairs score highly

At test time:

f  x = argmax y ∈Y score  x , y                                                      [Soricut, PhD Thesis, USC 2007]



Combinatorial optimization problem ➢ ➢

48

Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Argmax is hard! ➢

Order these words: bart better I madonna say than , Classic formulation of structured prediction: Best search (32.3): I say better than bart madonna , Original (41.6): better bart than madonna , I say something we learn

score  x , y = to make “good” x,y pairs score highly



At test time:

f  x = argmax y ∈Y score  x , y 

                                                    [Soricut, PhD Thesis, USC 2007]



Combinatorial optimization problem ➢ ➢

49

Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Argmax is hard! ➢

Order these words: bart better I madonna say than , Classic formulation of structured prediction: Best search (32.3): I say better than bart madonna , Original (41.6): better bart than madonna , I say something we learn

to make “good” x,y pairs score  x , y = Best search (51.6): and so could really be a neural score highly



At test time:

apparently thought things as dissimilar firing two identical

f  x = argmax y ∈Y score  x , y                                                      [Soricut, PhD Thesis, USC 2007]



Combinatorial optimization problem ➢ ➢

50

Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Argmax is hard! ➢

Order these words: bart better I madonna say than , Classic formulation of structured prediction: Best search (32.3): I say better than bart madonna , Original (41.6): better bart than madonna , I say something we learn

to make “good” x,y pairs score  x , y = Best search (51.6): and so could really be a neural score highly

apparently thought things as dissimilar firing two identical ➢ At test time: Original (64.3): could two things so apparently dissimilar as a thought and neural f  x = argmax y ∈Y score  x , y  firing really be identical                                                     [Soricut, PhD Thesis, USC 2007]



Combinatorial optimization problem ➢ ➢

51

Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Incremental parsing, early 90s style S

Train a classifier to make decisions

Right VP

Left

Right NP

Unary I / Pro 52

Left

can / Md

Up can / Vb

Hal Daumé III ([email protected])

Left a / Dt

Right can / Nn SP2IRL @ ACL2010

[Magerman, ACL95]

NP

Incremental parsing, mid 2000s style S Train a classifier to make decisions VP

I / Pro 53

NP

can / Md

can / Vb Hal Daumé III ([email protected])

a / Dt

can / Nn SP2IRL @ ACL2010

[Collsns+Roark, ACL04]

NP

Learning to beam-search ➢

S



NP

NP

can / Md

can / Vb Hal Daumé III ([email protected])

a / Dt

can / Nn SP2IRL @ ACL2010

[Collins+Roark, ACL04]

VP

w=w− x n , y   x n , y n 

I / Pro 54

h c r a e s m a e y =arg maxB k w⋅ x n , k  If y ≠ y:n

For n=1..N: ➢ Viterbi:

Learning to beam-search h c r For n=1..N: a e s ➢ n=1..N: ➢ For Viterbi: m a e B ➢ Run beam search until ➢ If falls out : of beam truth ➢ Update weights immediately! ➢

S

VP

I / Pro 55

NP

can / Md

can / Vb Hal Daumé III ([email protected])

a / Dt

can / Nn SP2IRL @ ACL2010

[Collins+Roark, ACL04]

NP

Learning to beam-search h c r For n=1..N: a e s ➢ n=1..N: ➢ For Viterbi: m a ➢ For n=1..N: e B ➢ Run beam search until ➢ Run beam search until ➢ If falls out : of beam truth truth falls out of beam ➢ Update weights ➢ Update weights immediately! immediately! ➢ Restart at truth ➢

S

NP

I / Pro 56

NP

can / Md

can / Vb Hal Daumé III ([email protected])

a / Dt

can / Nn SP2IRL @ ACL2010

[D+Marcu, ICML05; Xu+al, JMLR09]

VP

Incremental parsing results

[Collins+Roark, ACL04]

57

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Generic Search Formulation ➢

Search Problem: ➢ ➢ ➢ ➢

nodes := MakeQueue(S0)



while nodes is not empty ➢

Search Variable: ➢



Enqueue function ➢ ➢

Varying the Enqueue function can give us DFS, BFS, beam search, A* search, etc... 58



node := RemoveFront(nodes) if node is a goal state return node next := Operators(node) nodes := Enqueue(nodes, next)

fail

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Marcu, ICML05; Xu+al, JMLR09]



Search space Operators Goal-test function Path-cost function



Online Learning Framework (LaSO) ➢ ➢

nodes := MakeQueue(S0) while nodes is not empty ➢ ➢

node := RemoveFront(nodes) if none of {node} ∪ nodes is y-good or node is a goal & not ygood

➢ ➢ ➢

sibs := siblings(node, y) w := update(w, x, sibs, {node} ∪ nodes) nodes := MakeQueue(sibs)

else ➢ ➢ ➢

59

Where should we have gone?

Continue search... if node is a goal state return w next := Operators(node) nodes := Enqueue(nodes, next)

Hal Daumé III ([email protected])

Update our weights based on the good and the bad choices

SP2IRL @ ACL2010

[D+Marcu, ICML05; Xu+al, JMLR09]

If we erred...



Monotonicity: for any node, we can tell if it can lead to the correct solution or not

Search-based Margin ➢

The margin is the amount by which we are correct:

T

u   x , g 1 

u  x , g 2 



T u  x ,b 1  T u  x ,b 2 



60

Note that the margin and hence linear separability is also a function of the search algorithm! Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Marcu, ICML05; Xu+al, JMLR09]

u 

T

Syntactic chunking Results 24 min

[Zhang+Damerau+Johnson 2002]; timing unknown

22 min [Collins 2002]

33 min

Training Time (minutes) 61

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Marcu, ICML05; Xu+al, JMLR09]

F-Score

4 min

Tagging+chunking results

3 min

7 min

[Sutton+McCallum 2004]

[D+Marcu, ICML05; Xu+al, JMLR09]

Joint tagging/chunking accuracy

23 min

1 min

Training Time (hours) [log scale] 62

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Variations on a beam ➢

Observation: ➢ We needn't use the same beam size for training and decoding ➢ Varying these values independently yields: [D+Marcu, ICML05; Xu+al, JMLR09]

Training Beam

Decoding Beam

63

1 5 10 25 50

1 93.9 90.5 89.5 88.7 88.4

5 92.8 94.3 94.3 94.2 94.2

10 91.9 94.4 94.4 94.5 94.4

25 91.3 94.1 94.2 94.3 94.2

Hal Daumé III ([email protected])

50 90.9 94.1 94.2 94.3 94.4

SP2IRL @ ACL2010

What if our model sucks? ➢

Sometimes our model cannot produce the “correct” output ➢

canonical example: machine translation

“Bold” update

Model Outputs

Good N-best list Outputs or “optimal decoding” or ...

“Local” update 64

Reference

Best achievable output

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Och, ACL03; Liang+al, ACL06]

Current Hypothesis

Local versus bold updating... Machine Translation Performance (Bleu) 35.5 35 34.5

Bold Local Pharoah

34 33.5 33 32.5

65

Monotonic

Distortion

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Refresher on Markov Decision Processes

66

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Reinforcement learning ➢

Basic idea: ➢ ➢ ➢ ➢



Examples: ➢ ➢ ➢

67

Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act to maximize expected rewards Change the rewards, change the learned behavior

Playing a game, reward at the end for outcome Vacuuming, reward for each piece of dirt picked up Driving a taxi, reward for each passenger delivered

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Markov decision processes What are the values (expected future rewards) of states and actions? Q(s,a1)* = 30

0.6

5

0.4

V(s)* = 30

1

Q(s,a2)* = 23

Q(s,a3)* = 17

1.0

3

0.1

2

0.9 4

68

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Markov Decision Processes ➢

An MDP is defined by: ➢ ➢ ➢

➢ ➢ ➢

➢ ➢

A set of states s ∈ S A set of actions a ∈ A A transition function T(s,a,s’) ➢ Prob that a from s leads to s’ ➢ i.e., P(s’ | s,a) ➢ Also called the model A reward function R(s, a, s’) ➢ Sometimes just R(s) or R(s’) A start state (or distribution) Maybe a terminal state

MDPs are a family of nondeterministic search problems Total utility is one of:

∑ r t or ∑t  r t t

t

69

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Solving MDPs In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal ➢ In an MDP, we want an optimal policy π(s) ➢ A policy gives an action for each state ➢ Optimal policy maximizes expected if followed ➢ Defines a reflex agent ➢

Optimal policy when R(s, a, s’) = -0.04 for all nonterminals s 70

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Example Optimal Policies

R(s) = -0.01

71

R(s) = -0.4

R(s) = -0.03

Hal Daumé III ([email protected])

R(s) = -2.0

SP2IRL @ ACL2010

Optimal Utilities ➢

72

Fundamental operation: compute the optimal utilities of states s (all at once)



Why? Optimal values define optimal policies!



Define the utility of a state s: V*(s) = expected return starting in s and acting optimally



Define the utility of a q-state (s,a): Q*(s,a) = expected return starting in s, taking action a and thereafter acting optimally



Define the optimal policy: π*(s) = optimal action from state s Hal Daumé III ([email protected])

s a s, a s,a,s ’

s ’

SP2IRL @ ACL2010

The Bellman Equations ➢



73

Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy

s a s, a s,a,s ’

Formally:

Hal Daumé III ([email protected])

s ’

SP2IRL @ ACL2010

Solving MDPs / memoized recursion ➢

Recurrences:



Cache all function call results so you never repeat work What happened to the evaluation function?



74

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Q-Value Iteration ➢

Value iteration: iterate approx optimal values ➢ ➢



But Q-values are more useful! ➢ ➢

75

Start with V0*(s) = 0, which we know is right (why?) Given Vi*, calculate the values for all states for depth i+1:

Start with Q0*(s,a) = 0, which we know is right (why?) Given Qi*, calculate the q-values for all q-states for depth i+1:

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

RL = Unknown MDPs ➢

If we knew the MDP (i.e., the reward function and transition function): ➢ Value iteration leads to optimal values ➢ Will always converge to the truth



Reinforcement learning is what we do when we do not know the MDP ➢ All we observe is a trajectory ➢ (s1,a1,r1, s2,a2,r2, s3,a3,r3, …)



Many algorithms exist for this problem; see Sutton+Barto's excellent book!

76

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Q-Learning ➢

Learn Q*(s,a) values



Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate:



Incorporate the new estimate into a running average:

➢ ➢

77

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Exploration / Exploitation ➢

Several schemes for forcing exploration ➢

Simplest: random actions (ε greedy) ➢ ➢ ➢



Problems with random actions? ➢ ➢ ➢

78

Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy

You do explore the space, but keep thrashing around once learning is done One solution: lower ε over time Another solution: exploration functions

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Q-Learning ➢

In realistic situations, we cannot possibly learn about every single state! ➢ ➢



Instead, we want to generalize: ➢

Learn about some small number of training states from experience Generalize that experience to new, similar states:



Very simple stochastic updates:



79

Too many states to visit them all in training Too many states to hold the q-tables in memory

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Inverse Reinforcement Learning (aka Inverse Optimal Control)

80

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Inverse RL: Task ➢

Given: ➢ ➢ ➢

measurements of an agent's behavior over time, in a variety of circumstances if needed, measurements of the sensory inputs to that agent if available, a model of the environment.



Determine: the reward function being optimized



Proposed by [Kalman68]



First solution, by [Boyd94]

81

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Why inverse RL? ➢

Computational models for animal learning ➢



Agent construction ➢





“An agent designer [...] may only have a very rough idea of the reward function whose optimization would generate 'desirable' behavior.” eg., “Driving well”

Multi-agent systems and mechanism design ➢

82

“In examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.”

learning opponents’ reward functions that guide their actions to devise strategies against them

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

IRL from Sample Trajectories Warning: need to be ➢





careful to avoid Optimal policy available through sampled trajectories trivial solutions! (eg., driving a car)

Want to find Reward function that makes this policy look as good as possible Write R w s=w  s so the reward is linear  w

and V  s 0  be the value of the starting state

∑ f V k=1



 w

 s0 −V  s0  

How good does the “optimal policy” look? 83

k w

How good does the some other policy look?

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Ng+Russell, ICML00]

max w

K

Apprenticeship Learning via IRL ➢

For t = 1,2,… ➢

Inverse RL step: Estimate expert’s reward function R(s)= wTφ(s) such that under R(s) the expert performs better than all previously found policies {πi}.



RL step: Compute optimal policy πt for the estimated reward w [Abbeel+Ng, ICML04]

84

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Car Driving Experiment No explicit reward function at all!



Expert demonstrates proper policy via 2 min. of driving time on simulator (1200 data points).



5 different “driver types” tried.



Features: which lane the car is in, distance to closest car in current lane.



Algorithm run for 30 iterations, policy hand-picked.



Movie Time! (Expert left, IRL right)

85

Hal Daumé III ([email protected])

[Abbeel+Ng, ICML04]



SP2IRL @ ACL2010

“Nice” driver

86

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

“Evil” driver

87

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Maxent IRL Distribution over trajectories: P(ζ) Match the reward of observed behavior: ∑ζ P(ζ) fζ = fdem Maximizing the causal entropy over trajectories given stochastic outcomes:

(Condition on random uncontrolled outcomes, but only after they happen) 88

Hal Daumé III ([email protected])

As uniform as possible SP2IRL @ ACL2010

[Ziebart+al, AAAI08]

max H(P(ζ)||O)

Data collection

25 Taxi Drivers

89

Accidents Construction Congestion Time of day

Over 100,000 miles SP2IRL @ ACL2010

Hal Daumé III ([email protected])

[Ziebart+al, AAAI08]

Length Speed Road Type Lanes

Predicting destinations....

90

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Planning as structured prediction

[Ratliff+al, NIPS05]

91

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Maximum margin planning ➢

Let μ(s,a) denote the probability of reaching q-state (s,a) under current model w

max margin w

s.t.

planner run with w yields human output Q-state visitation frequency by human

s.t.

s , a w⋅ x n , s , a − s  , a w⋅ x n , s , a≥1 , ∀n,s,a

Q-state visitation frequency by planner

All trajectories, and all q-states 92

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Ratliff+al, NIPS05]

min w

1 2 ∥w∥ 2

Optimizing MMP MMP Objective SOME MATH ➢

For n=1..N: ➢ Augmented planning: Run A* on current (augmented) cost map to get q-state visitation frequencies s , a Update: w=w

 , a−s , a ]  x n , s , a ∑ ∑ [ s s

➢ 93



a



1 w Shrink: w= 1− CN Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Ratliff+al, NIPS05]



Maximum margin planning movies

[Ratliff+al, NIPS05]

94

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Parsing via inverse optimal control ➢ ➢ ➢ ➢ ➢

State space = all partial parse trees over the full sentence labeled “S” Actions: take a partial parse and split it anywhere in the middle Transitions: obvious Terminal states: when there are no actions left Reward: parse score at completion [Neu+Szepevari, MLJ09]

95

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Parsing via inverse optimal control 90 85 80 75 70 65 60

Medium

Maximum Likelihood

Projection

Perceptron

Maximum Margin

Maximum Entropy

Policy Matching

Hal Daumé III ([email protected])

Large Apprenticeship Learning

SP2IRL @ ACL2010

[Neu+Szepevari, MLJ09]

96

Small

Learning by Demonstration

97

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Integrating search and learning  Input: Le homme mange l' croissant.  Output: The man ate a croissant.

Classifier 'h'

 Hyp: The man ate a  Cov: Le homme mange   l' croissant. 98

 Hyp: The man ate a fox  Cov: Le homme mange   l' croissant.  Hyp: The man ate happy  Cov: Le homme mange   l' croissant.  Hyp: The man ate a  Cov: Le homme mange   l' croissant. Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Marcu, ICML05; D+Langford+Marcu, MLJ09]

 Hyp: The man ate  Cov: Le homme mange   l' croissant.

 Hyp: The man ate a croissant  Cov: Le homme mange   l' croissant.

Reducing search to classification ➢

Natural chicken and egg problem: ➢ ➢ ➢



Want h to get low expected future loss … on future decisions made by h … and starting from states visited by h

Iterative solution

h(t-1) h(t)

 Hyp:  The man ate a croissant  Cov:  Le homme mange   l' croissant.  Hyp:  The man ate a fox  Cov:  Le homme mange   l' croissant.  Hyp:  The man ate happy  Cov:  Le homme mange   l' croissant.

 Hyp:  The man ate a  Cov:  Le homme mange   l' croissant. 99

 Hyp:  The man ate a  Cov:  Le homme mange   l' croissant.

Loss = 0 Loss = 1.8 Loss = 1.2

Loss = 0 0.5

Loss = 0 Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Langford+Marcu, MLJ09]

 Hyp:  The man ate  Cov:  Le homme mange   l' croissant.

 Input: Le homme mange l' croissant.  Output: The man ate a croissant.

Reduction for Structured Prediction ➢

Idea: view structured prediction in light of search D

D

N

N

N

A

A

A

R

R

R

V

V

V

I

sing

merrily

Loss function: L([N V R], [N V R]) = 0 L([N V R], [N V V]) = 1/3 ... 100

Hal Daumé III ([email protected])

Can we formalize this idea?

SP2IRL @ ACL2010

[D+Langford+Marcu, MLJ09]

D

Each step here looks like it could be represented as a weighted multi-class problem.

Reducing Structured Prediction D

D

D

N

N

N

A

A

A

R

R

R

V

V

V

I

sing

merrily

Key Assumption: Optimal Policy for training data Given: input, true output and state; Return: best successor state Weak! 101

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Langford+Marcu, MLJ09]

Desired: good policy on test data (i.e., given only input string)

How to Learn in Search

[D+Langford+Marcu, MLJ09]

Idea: Train based only on optimal path (ala MEMM)

Better Idea: Train based only on optimal policy, then train based on optimal policy + a little learned policy then train based on optimal policy + a little more learned policy then ... eventually only use learned policy 102

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

How to Learn in Search c=1 c=1 c=2 c=1 c=0 I

103

merrily

Translating DSP into Searn(DSP, loss, π): ➢ Draw x ~ DSP ➢ Run π on x, to get a path ➢ Pick position uniformly on path ➢ Generate example with costs given by expected (wrt π) completion costs for “loss” Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Langford+Marcu, MLJ09]



sing

Searn

Ingredients for Searn: Input space (X) and output space (Y), data from X Loss function (loss(y, y')) and features “Optimal” policy π*(x, y0)

[D+Langford+Marcu, MLJ09]

Algorithm: Searn-Learn(A, DSP, loss, π*, β) 1: Initialize: π = π* 2: while not converged do 3: Sample: D ~ Searn(DSP, loss, π) 4: Learn: h ← A(D) 5: Update: π ← (1-β) π + β h 6: end while 7: return π without π*

Theorem: L(π) ≤ L(π*) + 2 avg(loss) T ln T + c(1+ln T) / T 104

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

But what about demonstrations? ➢

What did we assume before?

Key Assumption: Optimal Policy for training data Given: input, true output and state; Return: best successor state ➢

105

We can have a human (or system) demonstrate, thus giving us an optimal policy Hal Daumé III ([email protected])

SP2IRL @ ACL2010

3d racing game (TuxKart) Input:

Output:

Resized to 25x19 pixels (1425 features) 106

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Ross+Gordon+Bagnell, AIStats11]

Steering in [-1,1]

DAgger: Dataset Aggregation ➢ ➢ ➢

If N = T log T, Collect trajectories from expert π* Dataset D0 = { ( s, π*(s) ) | s ~ π* } L(πn) < T  N + O(1) Train π1 on D0 for some n



Collect new trajectories from π1



But let the expert steer! Dataset D1 = { ( s, π*(s) ) | s ~ π1 }

π1



Train π2 on D0 ∪ D1

π2



In general: ➢ Dn = { ( s, π*(s) ) | s ~ πn }



107

Train πn on ∪ i