Beyond Structured Prediction: Inverse ... - Semantic Scholar

Beyond Structured Prediction: Inverse Reinforcement Learning Hal Daumé III

Acknowledgements

Computer Science University of Maryland

Some slides: Stuart Russell Dan Klein J. Drew Bagnell Nathan Ratliff Stephane Ross

[email protected] A Tutorial at ACL 2011 Portland, Oregon Sunday, 19 June 2011

1

Discussions/Feedback: MLRG Spring 2010

Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Examples of structured problems

2


SP2IRL @ ACL2010

Examples of demonstrations

3


SP2IRL @ ACL2010

Examples of demonstrations

4


SP2IRL @ ACL2010

NLP as transduction Task

Input

Output

Machine Translation

Ces deux principes se tiennent à la croisée de la philosophie, de la politique, de l’économie, de la sociologie et du droit.

Both principles lie at the crossroads of philosophy, politics, economics, sociology, and law.

Document Summarization

Syntactic Analysis

Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74-day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands.

The man ate a big sandwich.

The Falkland islands war, in 1982, was fought between Britain and Argentina.

The man ate a big sandwich

...many more...

5


SP2IRL @ ACL2010

Structured prediction 101 Learn a function mapping inputs to complex outputs:

f : X  Y Pro Md Decoding Md Dt Vb Input Space Output Space Pro Md Barack Md Dt Nn Pro notMd Nn green Dt witch Md. Mary did slap the President Obama Pro Md Nn Dt theVb Mary no dio una bofetada a la bruja verda . Pro Md Nn Dt Nn Pro Md McCain Pro Md

Vb

Dt

Vb

Dt

Md he Vb

Vb John

Dt

Nn

I can can a Sequence Parsing Translation Coreference Machine Labeling Resolution

can

Pro

6

Md


SP2IRL @ ACL2010

Structured prediction 101 Learn a function mapping inputs to complex outputs:

f : X  Y Input Space Decoding Output Space

I 7

can

can

a


can SP2IRL @ ACL2010

Why is structure important? ➢

Correlations among outputs ➢ ➢

➢

Global coherence ➢

➢

It just doesn't make sense to have three determiners next to each other

My objective (aka “loss function”) forces it ➢ ➢

8

Determiners often precede nouns Sentences usually have verbs

Translations should have good sequences of words Summaries should be coherent


SP2IRL @ ACL2010

Outline: Part I ➢ ➢

What is Structured Prediction? Refresher on Binary Classification ➢ ➢ ➢

➢

From Perceptron to Structured Perceptron ➢ ➢ ➢

➢

Linear models for Structured Prediction The “argmax” problem From Perceptron to margins

Structure without Structure ➢ ➢

9

What does it mean to learn? Linear models for classification Batch versus stochastic optimization

Stacking Structure compilation


SP2IRL @ ACL2010

Outline: Part II ➢

Learning to Search ➢ ➢

➢ ➢

Refresher on Markov Decision Processes Inverse Reinforcement Learning ➢ ➢

➢

➢

10

Determining rewards given policies Maximum margin planning

Learning by Demonstration ➢

➢

Incremental parsing Learning to queue

Searn Dagger

Discussion


SP2IRL @ ACL2010

Refresher on Binary Classification

11


SP2IRL @ ACL2010

What does it mean to learn? ➢

Informally: ➢

➢

Slightly-less-informally: ➢

➢

to take labeled examples and construct a function that will label them as a human would

Formally: ➢

➢

12

to predict the future based on the past

Given: ➢ A fixed unknown distribution D over X*Y ➢ A loss function over Y*Y ➢ A finite sample of (x,y) pairs drawn i.i.d. from D Construct a function f that has low expected loss with respect to D Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Feature extractors ➢

A feature extractor Φ maps examples to vectors Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …

➢

13

Φ

W=dear W=sir W=this ... W=wish ... MISSPELLED NAMELESS ALL_CAPS NUM_URLS ...

: : :

1 1 2

:

0

: : : :

2 1 0 0

Feature vectors in NLP are frequently sparse


SP2IRL @ ACL2010

Linear models for binary classification ➢

Decision boundary is the set of “uncertain”points

➢

Linear decision boundaries are characterized by weight vectors

x “free money” 14

f2

f1

Φ(x) BIAS free money the ...

: : : :

1 1 1 0

w BIAS free money the ...

∑i wi Φi(x)

: -3 : 4 : 2 : 0


SP2IRL @ ACL2010

The perceptron ➢

Inputs = feature values Params = weights Sum is the response

➢

If the response is:

➢ ➢

➢ ➢

Positive, output +1 Negative, output -1

Φ1 Φ2 ➢

When training, update on errors:

w=w y  x

Φ3

w1 w2 w3

Σ

>0?

“Error” when:

y w⋅ x≤0 15


SP2IRL @ ACL2010 15

Why does that update work? ➢

old

When y w ⋅ x≤0

new old w =w  y  x  , update:

wold

wnew yw

new

 x= y  w  y  x   x old

old

= y w   x yy  x x


SP2IRL @ ACL2010

Support vector machines ➢

Explicitly optimize the margin

➢

Enforce that all training points are correctly classified

17

f2

f1

max margin w max margin w

s.t. s.t.

y n w⋅ x n ≥1 , ∀ n

min w

s.t.

y n w⋅ x n ≥1 , ∀ n

2

∥w∥

all points are correctly classified


SP2IRL @ ACL2010

Support vector machines with slack ➢

Explicitly optimize the margin

➢

Allow some “noisy” points to be misclassified

f2

f1

min w,ξ

1 2 ∥w∥ C ∑n n 2

s.t.

y n w⋅ x n  n ≥1 , ∀ n n ≥0 , ∀ n

18


SP2IRL @ ACL2010

Batch versus stochastic optimization ➢ ➢

Batch = read in all the data, then process it Stochastic = (roughly) process a bit at a time

min w,ξ

1 2 ∥w∥ C ∑n n 2

s.t.

y n w⋅ x n n ≥1 , ∀n

➢

For n=1..N: ➢ If y n w⋅ x n ≤0 ➢ w=w y n  x n 

n ≥0 , ∀ n

19


SP2IRL @ ACL2010

Stochastically optimized SVMs SVM Objective SOME MATH ➢

For n=1..N: ➢ If y n w⋅ x n ≤1 ➢ w=w y n  x n  1 ➢ w= 1− w CN



20



Implementation Note: Weight shrinkage is SLOW. Implement it lazily, at the cost of double storage.

➢

For n=1..N: ➢ If y n w⋅ x n ≤0 ➢ w=w y n  x n 


SP2IRL @ ACL2010

From Perceptron to Structured Perceptron

21


SP2IRL @ ACL2010

Perceptron with multiple classes ➢

➢

w1•Φ(x)

Store separate weight vector for each class w1, w2, ..., wK

biggest

For n=1..N: ➢ Predict:

w3•Φ(x)

y =arg max k w k⋅ x n  ➢

If y ≠ y:n w y =w y − x n  w y =w y  x n  n

22

n

?!


biggest

w2•Φ(x) biggest

Why does this do the right thing? SP2IRL @ ACL2010

Perceptron with multiple classes v2 ➢

Originally: w1

w2

w3

w ➢


➢



If y ≠ y: n

y =arg max k w⋅ x n , k  ➢

If y ≠ y:n

w y =w y − x n 

w y =w y  x n  n

23

n


w=w− x n , y   x n , y n  SP2IRL @ ACL2010

x multiple Φ(x,1)classes Φ(x,2) Perceptron with v2 ➢

Originally: w1

spam_BIAS : spam_free : spam_money : w2 spam_the w3: ...

“free money”

1 1 1 0

w

➢


➢

If y ≠ y: n

➢

If y ≠ y:n

w y =w y  x n  24

1 1 1 0

y =arg max k w⋅ x n , k 

w y =w y − x n  n

: : : :



ham_BIAS ham_free ham_money ham_the ...

n


w=w− x n , y   x n , y n  SP2IRL @ ACL2010

Features for structured prediction ➢

Allowed to encode anything you want Pro

Md

Vb

Dt

Nn

I

can

can

a

can

 x , y= I_Pro can_Md can_Vb a_Dt can_Nn ...

➢ 25

: : : : :

1 1 1 1 1

-Pro Pro-Md Md-Vb Vb-Dt Dt-Nn Nn- ...

: : : : : :

1 1 1 1 1 1

has_verb has_nn_lft has_n_lft has_nn_rgt has_n_rgt ...

: : : : :

1 0 1 1 1

Output features, Markov features, other features Hal Daumé III ([email protected])

SP2IRL @ ACL2010

Structured perceptron Enumeration over 1..K ➢


Enumeration over all outputs ➢



If y ≠ y n:


If y ≠ y n:

26


w=w− x n , y   x n , y n 

SP2IRL @ ACL2010

[Collins, EMNLP02]

w=w− x n , y   x n , y n 

Argmax for sequences ➢

If we only have output and Markov features, we can use Viterbi algorithm: w•[Pro-Pro] w•[I_Pro]

w•[can_Pro]

w•[can_Pro]

w•[Pro-Md] w•[I_Md]

w•[can_Md]

w•[can_Md]

w•[Pro-Vb] w•[I_Vb]

w•[can_Vb]

w•[can_Vb]

I

can

can

(plus some work to account for boundary conditions) 27


SP2IRL @ ACL2010

Structured perceptron as ranking ➢

For n=1..N: ➢ Run Viterbi: y =arg max k w⋅ x n , k  ➢ If  y ≠ y:n w=w− x , y  x , y  n

➢

28

n

n

When does this make an update? Pro Pro Pro Pro Pro Pro Pro

Md Md Md Md Md Md Md

Vb Md Md Nn Nn Vb Vb

Dt Dt Dt Dt Dt Dt Dt

Nn Vb Nn Md Nn Md Vb

I

can

can

a

can


SP2IRL @ ACL2010

From perceptron to margins Maximize Margin

Minimize Errors

s.t.

y n w⋅  x n n ≥1 , ∀n

1 2 ∥w∥ C ∑n n , y 2

Response for truth

Response for other

s.t. w⋅ x n , y n  −w⋅ x n , y  n≥1 , ∀ n , y ≠ y n

Each point is correctly classified, modulo ξ Each true output is more highly ranked, modulo ξ

29


SP2IRL @ ACL2010

[Taskar+al, JMLR05; Tshochandaritis, JMLR05]

min 1 ∥w∥2C n ∑ n w,ξ 2

min w,ξ

From perceptron to margins  x 1, y 1 

 x 2, y 

 x 2, y 

 x 1, y 

 x 1, y   x 2, y 

1 2 ∥w∥ C ∑n n , y 2

Response for truth

Response for other

s.t. w⋅ x n , y n  −w⋅ x n , y  n≥1 , ∀ n , y ≠ y n

 x 1, y 

 x 1, y  Each true output is more highly ranked, modulo ξ

30


SP2IRL @ ACL2010


 x 2, y 2 

min w,ξ

Ranking margins ➢

Some errors are worse than others... Md

Vb

Dt

Nn

Margin of one

31

Pro Pro Pro Pro Pro Pro

Md Md Md Md Md Md

Md Md Nn Nn Vb Vb

Dt Dt Dt Dt Dt Dt

Vb Nn Md Nn Md Vb

I

can

can

a

can


SP2IRL @ ACL2010


Pro

Accounting for a loss function ➢

Some errors are worse than others... Md

Vb

Dt

Nn

Pro Md Vb Dt Vb Pro Md Md Dt Nn Pro Md Nn Dt Nn

Margin of l(y,y')

Pro Md Vb Nn Md Nn Md Vb Dt Md Pro Vb Nn Dt Nn Nn Md Vb Nn Md Nn Md Vb Md Md Pro Vb Nn Vb Nn I

32

can

can

a

can


SP2IRL @ ACL2010


Pro

Accounting for a loss function  x 1, y 1 

 x 2, y 

 x 2, y 

 x 1, y   x 1, y 

∀y x, 1, y w⋅ x n , y n −w⋅ x n , y n ≥l  y n , y 

is equivalent to , xy1, y−w⋅  x 2, yw⋅  max x x n , y n ≥l  y n , y  y n n w⋅ x n , y n −w⋅ x n , y n

w⋅ x n , y n −w⋅ x n , y n

≥1

≥l y n , y 

33


SP2IRL @ ACL2010


 x 2, y 2 

Augmented argmax for sequences ➢

Add “loss” to each wrong node!

w•[I_Pro]

+1

w•[can_Pro]

+1

w•[can_Pro]

w•[Pro-Md]

+1

w•[I_Md]

w•[can_Md]

+1

w•[can_Md]

w•[Pro-Vb]

+1

34

+1

w•[I_Vb]

w•[can_Vb]

I

can

?!


w•[can_Vb] can What are we assuming here? SP2IRL @ ACL2010


w•[Pro-Pro]

Stochastically optimizing Markov nets M3N Objective SOME MATH ➢

➢

For n=1..N: ➢ Viterbi:

y =arg max k w⋅ x n , k  y =arg max k w⋅ x n , k  ➢ If  y ≠ y :n l  y n , k  ➢ If y ≠ y n: w=w− x n , y  w=w− x n , y   x n , y n   x n , y n  ➢ 1 w= 1− w CN Hal Daumé III ([email protected]) SP2IRL @ ACL2010





[Ratliff+al, AIStats07]

35

For n=1..N: ➢ Augmented Viterbi:

Stacking ➢

Structured models: accurate but slow Output Labels Input Features

➢

Independent models: less accurate but fast Output Labels Input Features

➢

Stacking: multiple independent models Output Labels 2 Output Labels Input Features

36


SP2IRL @ ACL2010

Training a stacked model ➢

Train independent classifier f1 on input features

Output Labels 2 Output Labels

➢

Train independent classifier f2 on input features + f1's output

➢

Danger: overfitting!

➢

Solution: cross-validation

37


Input Features

SP2IRL @ ACL2010

Do we really need structure? ➢

Structured models: accurate but slow Output Labels Input Features

➢

Independent models: less accurate but fast Output Labels Input Features

Goal: transfer power to get fast+accurate

Output Labels Input Features ➢

38

Questions: are independent models... ➢ … expressive enough? (approximation error) ➢ … easy to learn? (estimation error) Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[Liang+D+Klein, ICML08]

➢

“Compiling” structure out Labeled Data

POS: 95.0% NER: 75.3%

IRL(f1)

POS: 91.7% NER: 69.1%

f1 = words/prefixes/suffixes/forms

39


SP2IRL @ ACL2010


CRF(f1)

“Compiling” structure out Labeled Data

Auto-Labeled Data POS: 95.0% NER: 75.3%

f1 = words/prefixes/suffixes/forms f2 = f1 applied to a larger window

40

IRL(f1)

POS: 91.7% NER: 69.1%

IRL(f2)

POS: 94.4% NER: 66.2%

CompIRL(f2)

POS: 95.0% NER: 72.7%


SP2IRL @ ACL2010


CRF(f1)

Decomposition of errors CRF(f1):

Sum of MI on edges POS=.003 (95.0% → 95.0%) pC NER=.009 (76.3% → 76.0%)

coherence Σ

Σ

Σ

marginalized

Train a truncated CRF CRF(fNER: 1): p76.0% MC → 72.7%

nonlinearities

global information

IRL(f2): p1* Theorem: KL(pC || p1*) = KL(pC || pMC) + KL(pMC || pA*) + KL(pA* || p1*) 41


SP2IRL @ ACL2010


IRL(fꝏ): pA*

Train a marginalized CRF NER: 76.0% → 76.0%

Structure compilation results Part of speech 96

95

94

2

4

8 16 32 64

Parsing

78

88

76

86

74

84

72

82

70

80

68

78

66

76

2

4

8 16 32 64

4 14 24 44 84 164

Structured Independent 42


SP2IRL @ ACL2010


93

Named Entity

Coffee Break!!!

43


SP2IRL @ ACL2010

Outline: Part I ➢ ➢

What is Structured Prediction? Refresher on Binary Classification ➢ ➢ ➢

➢

From Perceptron to Structured Perceptron ➢ ➢ ➢

➢

Linear models for Structured Prediction The “argmax” problem From Perceptron to margins

Structure without Structure ➢ ➢

44

What does it mean to learn? Linear models for classification Batch versus stochastic optimization

Stacking Structure compilation


SP2IRL @ ACL2010

Outline: Part II ➢

Learning to Search ➢ ➢

➢ ➢

Refresher on Markov Decision Processes Inverse Reinforcement Learning ➢ ➢

➢

➢

45

Determining rewards given policies Maximum margin planning

Learning by Demonstration ➢

➢

Incremental parsing Learning to queue

Searn Dagger

Discussion


SP2IRL @ ACL2010

Learning to Search

46


SP2IRL @ ACL2010

Argmax is hard! ➢

Classic formulation of structured prediction:

score  x , y = ➢

something we learn to make “good” x,y pairs score highly

At test time:

f  x = argmax y ∈Y score  x , y  ➢

Combinatorial optimization problem ➢ ➢

47

Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search


SP2IRL @ ACL2010

Argmax is hard! ➢

Order these words: bart better I madonna say than , Classic formulation of structured prediction:

score  x , y = ➢

something we learn to make “good” x,y pairs score highly

At test time:

f  x = argmax y ∈Y score  x , y  [Soricut, PhD Thesis, USC 2007]

➢


48



SP2IRL @ ACL2010

Argmax is hard! ➢

Order these words: bart better I madonna say than , Classic formulation of structured prediction: Best search (32.3): I say better than bart madonna , Original (41.6): better bart than madonna , I say something we learn

score  x , y = to make “good” x,y pairs score highly

➢

At test time:

f  x = argmax y ∈Y score  x , y 

[Soricut, PhD Thesis, USC 2007]

➢


49



SP2IRL @ ACL2010

Argmax is hard! ➢


to make “good” x,y pairs score  x , y = Best search (51.6): and so could really be a neural score highly

➢

At test time:

apparently thought things as dissimilar firing two identical

f  x = argmax y ∈Y score  x , y  [Soricut, PhD Thesis, USC 2007]

➢


50



SP2IRL @ ACL2010

Argmax is hard! ➢


to make “good” x,y pairs score  x , y = Best search (51.6): and so could really be a neural score highly

apparently thought things as dissimilar firing two identical ➢ At test time: Original (64.3): could two things so apparently dissimilar as a thought and neural f  x = argmax y ∈Y score  x , y  firing really be identical [Soricut, PhD Thesis, USC 2007]

➢


51



SP2IRL @ ACL2010

Incremental parsing, early 90s style S

Train a classifier to make decisions

Right VP

Left

Right NP

Unary I / Pro 52

Left

can / Md

Up can / Vb


Left a / Dt

Right can / Nn SP2IRL @ ACL2010

[Magerman, ACL95]

NP

Incremental parsing, mid 2000s style S Train a classifier to make decisions VP

I / Pro 53

NP

can / Md

can / Vb Hal Daumé III ([email protected])

a / Dt

can / Nn SP2IRL @ ACL2010

[Collsns+Roark, ACL04]

NP

Learning to beam-search ➢

S

➢

NP

NP

can / Md


a / Dt


[Collins+Roark, ACL04]

VP

w=w− x n , y   x n , y n 

I / Pro 54

h c r a e s m a e y =arg maxB k w⋅ x n , k  If y ≠ y:n

For n=1..N: ➢ Viterbi:

Learning to beam-search h c r For n=1..N: a e s ➢ n=1..N: ➢ For Viterbi: m a e B ➢ Run beam search until ➢ If falls out : of beam truth ➢ Update weights immediately! ➢

S

VP

I / Pro 55

NP

can / Md


a / Dt



NP

Learning to beam-search h c r For n=1..N: a e s ➢ n=1..N: ➢ For Viterbi: m a ➢ For n=1..N: e B ➢ Run beam search until ➢ Run beam search until ➢ If falls out : of beam truth truth falls out of beam ➢ Update weights ➢ Update weights immediately! immediately! ➢ Restart at truth ➢

S

NP

I / Pro 56

NP

can / Md


a / Dt


[D+Marcu, ICML05; Xu+al, JMLR09]

VP

Incremental parsing results


57


SP2IRL @ ACL2010

Generic Search Formulation ➢

Search Problem: ➢ ➢ ➢ ➢

nodes := MakeQueue(S0)

➢

while nodes is not empty ➢

Search Variable: ➢

➢

Enqueue function ➢ ➢

Varying the Enqueue function can give us DFS, BFS, beam search, A* search, etc... 58

➢

node := RemoveFront(nodes) if node is a goal state return node next := Operators(node) nodes := Enqueue(nodes, next)

fail


SP2IRL @ ACL2010


➢

Search space Operators Goal-test function Path-cost function

➢

Online Learning Framework (LaSO) ➢ ➢

nodes := MakeQueue(S0) while nodes is not empty ➢ ➢

node := RemoveFront(nodes) if none of {node} ∪ nodes is y-good or node is a goal & not ygood

➢ ➢ ➢

sibs := siblings(node, y) w := update(w, x, sibs, {node} ∪ nodes) nodes := MakeQueue(sibs)

else ➢ ➢ ➢

59

Where should we have gone?

Continue search... if node is a goal state return w next := Operators(node) nodes := Enqueue(nodes, next)


Update our weights based on the good and the bad choices

SP2IRL @ ACL2010


If we erred...

➢

Monotonicity: for any node, we can tell if it can lead to the correct solution or not

Search-based Margin ➢

The margin is the amount by which we are correct:

T

u   x , g 1 

u  x , g 2 



T u  x ,b 1  T u  x ,b 2 

➢

60

Note that the margin and hence linear separability is also a function of the search algorithm! Hal Daumé III ([email protected])

SP2IRL @ ACL2010


u 

T

Syntactic chunking Results 24 min

[Zhang+Damerau+Johnson 2002]; timing unknown

22 min [Collins 2002]

33 min

Training Time (minutes) 61


SP2IRL @ ACL2010


F-Score

4 min

Tagging+chunking results

3 min

7 min

[Sutton+McCallum 2004]


Joint tagging/chunking accuracy

23 min

1 min

Training Time (hours) [log scale] 62


SP2IRL @ ACL2010

Variations on a beam ➢

Observation: ➢ We needn't use the same beam size for training and decoding ➢ Varying these values independently yields: [D+Marcu, ICML05; Xu+al, JMLR09]

Training Beam

Decoding Beam

63

1 5 10 25 50

1 93.9 90.5 89.5 88.7 88.4

5 92.8 94.3 94.3 94.2 94.2

10 91.9 94.4 94.4 94.5 94.4

25 91.3 94.1 94.2 94.3 94.2


50 90.9 94.1 94.2 94.3 94.4

SP2IRL @ ACL2010

What if our model sucks? ➢

Sometimes our model cannot produce the “correct” output ➢

canonical example: machine translation

“Bold” update

Model Outputs

Good N-best list Outputs or “optimal decoding” or ...

“Local” update 64

Reference

Best achievable output


SP2IRL @ ACL2010

[Och, ACL03; Liang+al, ACL06]

Current Hypothesis

Local versus bold updating... Machine Translation Performance (Bleu) 35.5 35 34.5

Bold Local Pharoah

34 33.5 33 32.5

65

Monotonic

Distortion


SP2IRL @ ACL2010

Refresher on Markov Decision Processes

66


SP2IRL @ ACL2010

Reinforcement learning ➢

Basic idea: ➢ ➢ ➢ ➢

➢

Examples: ➢ ➢ ➢

67

Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act to maximize expected rewards Change the rewards, change the learned behavior

Playing a game, reward at the end for outcome Vacuuming, reward for each piece of dirt picked up Driving a taxi, reward for each passenger delivered


SP2IRL @ ACL2010

Markov decision processes What are the values (expected future rewards) of states and actions? Q(s,a1)* = 30

0.6

5

0.4

V(s)* = 30

1

Q(s,a2)* = 23

Q(s,a3)* = 17

1.0

3

0.1

2

0.9 4

68


SP2IRL @ ACL2010

Markov Decision Processes ➢

An MDP is defined by: ➢ ➢ ➢

➢ ➢ ➢

➢ ➢

A set of states s ∈ S A set of actions a ∈ A A transition function T(s,a,s’) ➢ Prob that a from s leads to s’ ➢ i.e., P(s’ | s,a) ➢ Also called the model A reward function R(s, a, s’) ➢ Sometimes just R(s) or R(s’) A start state (or distribution) Maybe a terminal state

MDPs are a family of nondeterministic search problems Total utility is one of:

∑ r t or ∑t  r t t

t

69


SP2IRL @ ACL2010

Solving MDPs In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal ➢ In an MDP, we want an optimal policy π(s) ➢ A policy gives an action for each state ➢ Optimal policy maximizes expected if followed ➢ Defines a reflex agent ➢

Optimal policy when R(s, a, s’) = -0.04 for all nonterminals s 70


SP2IRL @ ACL2010

Example Optimal Policies

R(s) = -0.01

71

R(s) = -0.4

R(s) = -0.03


R(s) = -2.0

SP2IRL @ ACL2010

Optimal Utilities ➢

72

Fundamental operation: compute the optimal utilities of states s (all at once)

➢

Why? Optimal values define optimal policies!

➢

Define the utility of a state s: V*(s) = expected return starting in s and acting optimally

➢

Define the utility of a q-state (s,a): Q*(s,a) = expected return starting in s, taking action a and thereafter acting optimally

➢

Define the optimal policy: π*(s) = optimal action from state s Hal Daumé III ([email protected])

s a s, a s,a,s ’

s ’

SP2IRL @ ACL2010

The Bellman Equations ➢

➢

73

Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy

s a s, a s,a,s ’

Formally:


s ’

SP2IRL @ ACL2010

Solving MDPs / memoized recursion ➢

Recurrences:

➢

Cache all function call results so you never repeat work What happened to the evaluation function?

➢

74


SP2IRL @ ACL2010

Q-Value Iteration ➢

Value iteration: iterate approx optimal values ➢ ➢

➢

But Q-values are more useful! ➢ ➢

75

Start with V0*(s) = 0, which we know is right (why?) Given Vi*, calculate the values for all states for depth i+1:

Start with Q0*(s,a) = 0, which we know is right (why?) Given Qi*, calculate the q-values for all q-states for depth i+1:


SP2IRL @ ACL2010

RL = Unknown MDPs ➢

If we knew the MDP (i.e., the reward function and transition function): ➢ Value iteration leads to optimal values ➢ Will always converge to the truth

➢

Reinforcement learning is what we do when we do not know the MDP ➢ All we observe is a trajectory ➢ (s1,a1,r1, s2,a2,r2, s3,a3,r3, …)

➢

Many algorithms exist for this problem; see Sutton+Barto's excellent book!

76


SP2IRL @ ACL2010

Q-Learning ➢

Learn Q*(s,a) values

➢

Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate:

➢

Incorporate the new estimate into a running average:

➢ ➢

77


SP2IRL @ ACL2010

Exploration / Exploitation ➢

Several schemes for forcing exploration ➢

Simplest: random actions (ε greedy) ➢ ➢ ➢

➢

Problems with random actions? ➢ ➢ ➢

78

Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy

You do explore the space, but keep thrashing around once learning is done One solution: lower ε over time Another solution: exploration functions


SP2IRL @ ACL2010

Q-Learning ➢

In realistic situations, we cannot possibly learn about every single state! ➢ ➢

➢

Instead, we want to generalize: ➢

Learn about some small number of training states from experience Generalize that experience to new, similar states:

➢

Very simple stochastic updates:

➢

79

Too many states to visit them all in training Too many states to hold the q-tables in memory


SP2IRL @ ACL2010

Inverse Reinforcement Learning (aka Inverse Optimal Control)

80


SP2IRL @ ACL2010

Inverse RL: Task ➢

Given: ➢ ➢ ➢

measurements of an agent's behavior over time, in a variety of circumstances if needed, measurements of the sensory inputs to that agent if available, a model of the environment.

➢

Determine: the reward function being optimized

➢

Proposed by [Kalman68]

➢

First solution, by [Boyd94]

81


SP2IRL @ ACL2010

Why inverse RL? ➢

Computational models for animal learning ➢

➢

Agent construction ➢

➢

➢

“An agent designer [...] may only have a very rough idea of the reward function whose optimization would generate 'desirable' behavior.” eg., “Driving well”

Multi-agent systems and mechanism design ➢

82

“In examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.”

learning opponents’ reward functions that guide their actions to devise strategies against them


SP2IRL @ ACL2010

IRL from Sample Trajectories Warning: need to be ➢

➢

➢

careful to avoid Optimal policy available through sampled trajectories trivial solutions! (eg., driving a car)

Want to find Reward function that makes this policy look as good as possible Write R w s=w  s so the reward is linear  w

and V  s 0  be the value of the starting state

∑ f V k=1

∗

 w

 s0 −V  s0  

How good does the “optimal policy” look? 83

k w

How good does the some other policy look?


SP2IRL @ ACL2010

[Ng+Russell, ICML00]

max w

K

Apprenticeship Learning via IRL ➢

For t = 1,2,… ➢

Inverse RL step: Estimate expert’s reward function R(s)= wTφ(s) such that under R(s) the expert performs better than all previously found policies {πi}.

➢

RL step: Compute optimal policy πt for the estimated reward w [Abbeel+Ng, ICML04]

84


SP2IRL @ ACL2010

Car Driving Experiment No explicit reward function at all!

➢

Expert demonstrates proper policy via 2 min. of driving time on simulator (1200 data points).

➢

5 different “driver types” tried.

➢

Features: which lane the car is in, distance to closest car in current lane.

➢

Algorithm run for 30 iterations, policy hand-picked.

➢

Movie Time! (Expert left, IRL right)

85


[Abbeel+Ng, ICML04]

➢

SP2IRL @ ACL2010

“Nice” driver

86


SP2IRL @ ACL2010

“Evil” driver

87


SP2IRL @ ACL2010

Maxent IRL Distribution over trajectories: P(ζ) Match the reward of observed behavior: ∑ζ P(ζ) fζ = fdem Maximizing the causal entropy over trajectories given stochastic outcomes:

(Condition on random uncontrolled outcomes, but only after they happen) 88


As uniform as possible SP2IRL @ ACL2010

[Ziebart+al, AAAI08]

max H(P(ζ)||O)

Data collection

25 Taxi Drivers

89

Accidents Construction Congestion Time of day

Over 100,000 miles SP2IRL @ ACL2010


[Ziebart+al, AAAI08]

Length Speed Road Type Lanes

Predicting destinations....

90


SP2IRL @ ACL2010

Planning as structured prediction

[Ratliff+al, NIPS05]

91


SP2IRL @ ACL2010

Maximum margin planning ➢

Let μ(s,a) denote the probability of reaching q-state (s,a) under current model w

max margin w

s.t.

planner run with w yields human output Q-state visitation frequency by human

s.t.

s , a w⋅ x n , s , a − s  , a w⋅ x n , s , a≥1 , ∀n,s,a

Q-state visitation frequency by planner

All trajectories, and all q-states 92


SP2IRL @ ACL2010


min w

1 2 ∥w∥ 2

Optimizing MMP MMP Objective SOME MATH ➢

For n=1..N: ➢ Augmented planning: Run A* on current (augmented) cost map to get q-state visitation frequencies s , a Update: w=w

 , a−s , a ]  x n , s , a ∑ ∑ [ s s

➢ 93



a



1 w Shrink: w= 1− CN Hal Daumé III ([email protected])

SP2IRL @ ACL2010


➢

Maximum margin planning movies


94


SP2IRL @ ACL2010

Parsing via inverse optimal control ➢ ➢ ➢ ➢ ➢

State space = all partial parse trees over the full sentence labeled “S” Actions: take a partial parse and split it anywhere in the middle Transitions: obvious Terminal states: when there are no actions left Reward: parse score at completion [Neu+Szepevari, MLJ09]

95


SP2IRL @ ACL2010

Parsing via inverse optimal control 90 85 80 75 70 65 60

Medium

Maximum Likelihood

Projection

Perceptron

Maximum Margin

Maximum Entropy

Policy Matching


Large Apprenticeship Learning

SP2IRL @ ACL2010

[Neu+Szepevari, MLJ09]

96

Small

Learning by Demonstration

97


SP2IRL @ ACL2010

Integrating search and learning Input: Le homme mange l' croissant. Output: The man ate a croissant.

Classifier 'h'

Hyp: The man ate a Cov: Le homme mange l' croissant. 98

Hyp: The man ate a fox Cov: Le homme mange l' croissant. Hyp: The man ate happy Cov: Le homme mange l' croissant. Hyp: The man ate a Cov: Le homme mange l' croissant. Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Marcu, ICML05; D+Langford+Marcu, MLJ09]

Hyp: The man ate Cov: Le homme mange l' croissant.

Hyp: The man ate a croissant Cov: Le homme mange l' croissant.

Reducing search to classification ➢

Natural chicken and egg problem: ➢ ➢ ➢

➢

Want h to get low expected future loss … on future decisions made by h … and starting from states visited by h

Iterative solution

h(t-1) h(t)

Hyp: The man ate a croissant Cov: Le homme mange l' croissant. Hyp: The man ate a fox Cov: Le homme mange l' croissant. Hyp: The man ate happy Cov: Le homme mange l' croissant.

Hyp: The man ate a Cov: Le homme mange l' croissant. 99

Hyp: The man ate a Cov: Le homme mange l' croissant.

Loss = 0 Loss = 1.8 Loss = 1.2

Loss = 0 0.5

Loss = 0 Hal Daumé III ([email protected])

SP2IRL @ ACL2010

[D+Langford+Marcu, MLJ09]

Hyp: The man ate Cov: Le homme mange l' croissant.

Input: Le homme mange l' croissant. Output: The man ate a croissant.

Reduction for Structured Prediction ➢

Idea: view structured prediction in light of search D

D

N

N

N

A

A

A

R

R

R

V

V

V

I

sing

merrily

Loss function: L([N V R], [N V R]) = 0 L([N V R], [N V V]) = 1/3 ... 100


Can we formalize this idea?

SP2IRL @ ACL2010


D

Each step here looks like it could be represented as a weighted multi-class problem.

Reducing Structured Prediction D

D

D

N

N

N

A

A

A

R

R

R

V

V

V

I

sing

merrily

Key Assumption: Optimal Policy for training data Given: input, true output and state; Return: best successor state Weak! 101


SP2IRL @ ACL2010


Desired: good policy on test data (i.e., given only input string)

How to Learn in Search


Idea: Train based only on optimal path (ala MEMM)

Better Idea: Train based only on optimal policy, then train based on optimal policy + a little learned policy then train based on optimal policy + a little more learned policy then ... eventually only use learned policy 102


SP2IRL @ ACL2010

How to Learn in Search c=1 c=1 c=2 c=1 c=0 I

103

merrily

Translating DSP into Searn(DSP, loss, π): ➢ Draw x ~ DSP ➢ Run π on x, to get a path ➢ Pick position uniformly on path ➢ Generate example with costs given by expected (wrt π) completion costs for “loss” Hal Daumé III ([email protected])

SP2IRL @ ACL2010


➢

sing

Searn

Ingredients for Searn: Input space (X) and output space (Y), data from X Loss function (loss(y, y')) and features “Optimal” policy π*(x, y0)


Algorithm: Searn-Learn(A, DSP, loss, π*, β) 1: Initialize: π = π* 2: while not converged do 3: Sample: D ~ Searn(DSP, loss, π) 4: Learn: h ← A(D) 5: Update: π ← (1-β) π + β h 6: end while 7: return π without π*

Theorem: L(π) ≤ L(π*) + 2 avg(loss) T ln T + c(1+ln T) / T 104


SP2IRL @ ACL2010

But what about demonstrations? ➢

What did we assume before?

Key Assumption: Optimal Policy for training data Given: input, true output and state; Return: best successor state ➢

105

We can have a human (or system) demonstrate, thus giving us an optimal policy Hal Daumé III ([email protected])

SP2IRL @ ACL2010

3d racing game (TuxKart) Input:

Output:

Resized to 25x19 pixels (1425 features) 106


SP2IRL @ ACL2010

[Ross+Gordon+Bagnell, AIStats11]

Steering in [-1,1]

DAgger: Dataset Aggregation ➢ ➢ ➢

If N = T log T, Collect trajectories from expert π* Dataset D0 = { ( s, π*(s) ) | s ~ π* } L(πn) < T  N + O(1) Train π1 on D0 for some n

➢

Collect new trajectories from π1

➢

But let the expert steer! Dataset D1 = { ( s, π*(s) ) | s ~ π1 }

π1

➢

Train π2 on D0 ∪ D1

π2

➢

In general: ➢ Dn = { ( s, π*(s) ) | s ~ πn }

➢

107

Train πn on ∪ i

Beyond Structured Prediction: Inverse ... - Semantic Scholar

Beyond Structured Prediction: Inverse ... - Semantic Scholar

Suggest Documents

Structured prediction for urban scene semantic ... - Semantic Scholar

Learning Inverse Kinematics with Structured Prediction - Max Planck ...

Structured Prediction with Test-time Budget ... - Semantic Scholar

Structured Prediction via the Extragradient Method - Semantic Scholar

Structured Casenotes - Semantic Scholar

Structured Abstract - Semantic Scholar

Structured Prediction - Google Sites

BLOCK STRUCTURED QUADRATIC ... - Semantic Scholar

Beyond skin - Semantic Scholar

Beyond Automation - Semantic Scholar

Structured Document Databases - Semantic Scholar

Structured-Light Stereo - Semantic Scholar

DICOM Structured Reporting - Semantic Scholar

Beyond Biometrics - Semantic Scholar

Beyond risk - Semantic Scholar

Beyond Fatalism - Semantic Scholar

beyond 3g - Semantic Scholar

Structured Service Composition - Semantic Scholar

Building Structured Theories - Semantic Scholar

Searching Structured Documents - Semantic Scholar

Structured Interacting Computations - Semantic Scholar

Confidence Estimation in Structured Prediction

What is structured prediction? - PDFKUL.COM

Structured Prediction of Network Response