Computer Science. University of Maryland
. A Tutorial at ACL 2011. Portland, Oregon. Sunday, 19 June 2011. A
Beyond Structured Prediction: Inverse Reinforcement Learning Hal Daumé III
Acknowledgements
Computer Science University of Maryland
Some slides: Stuart Russell Dan Klein J. Drew Bagnell Nathan Ratliff Stephane Ross
[email protected] A Tutorial at ACL 2011 Portland, Oregon Sunday, 19 June 2011
1
Discussions/Feedback: MLRG Spring 2010
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Examples of structured problems
2
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Examples of demonstrations
3
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Examples of demonstrations
4
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
NLP as transduction Task
Input
Output
Machine Translation
Ces deux principes se tiennent à la croisée de la philosophie, de la politique, de l’économie, de la sociologie et du droit.
Both principles lie at the crossroads of philosophy, politics, economics, sociology, and law.
Document Summarization
Syntactic Analysis
Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74-day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands.
The man ate a big sandwich.
The Falkland islands war, in 1982, was fought between Britain and Argentina.
The man ate a big sandwich
...many more...
5
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Structured prediction 101 Learn a function mapping inputs to complex outputs:
f : X Y Pro Md Decoding Md Dt Vb Input Space Output Space Pro Md Barack Md Dt Nn Pro notMd Nn green Dt witch Md. Mary did slap the President Obama Pro Md Nn Dt theVb Mary no dio una bofetada a la bruja verda . Pro Md Nn Dt Nn Pro Md McCain Pro Md
Vb
Dt
Vb
Dt
Md he Vb
Vb John
Dt
Nn
I can can a Sequence Parsing Translation Coreference Machine Labeling Resolution
can
Pro
6
Md
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Structured prediction 101 Learn a function mapping inputs to complex outputs:
f : X Y Input Space Decoding Output Space
I 7
can
can
a
Hal Daumé III (
[email protected])
can SP2IRL @ ACL2010
Why is structure important? ➢
Correlations among outputs ➢ ➢
➢
Global coherence ➢
➢
It just doesn't make sense to have three determiners next to each other
My objective (aka “loss function”) forces it ➢ ➢
8
Determiners often precede nouns Sentences usually have verbs
Translations should have good sequences of words Summaries should be coherent
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Outline: Part I ➢ ➢
What is Structured Prediction? Refresher on Binary Classification ➢ ➢ ➢
➢
From Perceptron to Structured Perceptron ➢ ➢ ➢
➢
Linear models for Structured Prediction The “argmax” problem From Perceptron to margins
Structure without Structure ➢ ➢
9
What does it mean to learn? Linear models for classification Batch versus stochastic optimization
Stacking Structure compilation
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Outline: Part II ➢
Learning to Search ➢ ➢
➢ ➢
Refresher on Markov Decision Processes Inverse Reinforcement Learning ➢ ➢
➢
➢
10
Determining rewards given policies Maximum margin planning
Learning by Demonstration ➢
➢
Incremental parsing Learning to queue
Searn Dagger
Discussion
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Refresher on Binary Classification
11
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
What does it mean to learn? ➢
Informally: ➢
➢
Slightly-less-informally: ➢
➢
to take labeled examples and construct a function that will label them as a human would
Formally: ➢
➢
12
to predict the future based on the past
Given: ➢ A fixed unknown distribution D over X*Y ➢ A loss function over Y*Y ➢ A finite sample of (x,y) pairs drawn i.i.d. from D Construct a function f that has low expected loss with respect to D Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Feature extractors ➢
A feature extractor Φ maps examples to vectors Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …
➢
13
Φ
W=dear W=sir W=this ... W=wish ... MISSPELLED NAMELESS ALL_CAPS NUM_URLS ...
: : :
1 1 2
:
0
: : : :
2 1 0 0
Feature vectors in NLP are frequently sparse
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Linear models for binary classification ➢
Decision boundary is the set of “uncertain”points
➢
Linear decision boundaries are characterized by weight vectors
x “free money” 14
f2
f1
Φ(x) BIAS free money the ...
: : : :
1 1 1 0
w BIAS free money the ...
∑i wi Φi(x)
: -3 : 4 : 2 : 0
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
The perceptron ➢
Inputs = feature values Params = weights Sum is the response
➢
If the response is:
➢ ➢
➢ ➢
Positive, output +1 Negative, output -1
Φ1 Φ2 ➢
When training, update on errors:
w=w y x
Φ3
w1 w2 w3
Σ
>0?
“Error” when:
y w⋅ x≤0 15
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010 15
Why does that update work? ➢
old
When y w ⋅ x≤0
new old w =w y x , update:
wold
wnew yw
new
x= y w y x x old
old
= y w x yy x x
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Support vector machines ➢
Explicitly optimize the margin
➢
Enforce that all training points are correctly classified
17
f2
f1
max margin w max margin w
s.t. s.t.
y n w⋅ x n ≥1 , ∀ n
min w
s.t.
y n w⋅ x n ≥1 , ∀ n
2
∥w∥
all points are correctly classified
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Support vector machines with slack ➢
Explicitly optimize the margin
➢
Allow some “noisy” points to be misclassified
f2
f1
min w,ξ
1 2 ∥w∥ C ∑n n 2
s.t.
y n w⋅ x n n ≥1 , ∀ n n ≥0 , ∀ n
18
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Batch versus stochastic optimization ➢ ➢
Batch = read in all the data, then process it Stochastic = (roughly) process a bit at a time
min w,ξ
1 2 ∥w∥ C ∑n n 2
s.t.
y n w⋅ x n n ≥1 , ∀n
➢
For n=1..N: ➢ If y n w⋅ x n ≤0 ➢ w=w y n x n
n ≥0 , ∀ n
19
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Stochastically optimized SVMs SVM Objective SOME MATH ➢
For n=1..N: ➢ If y n w⋅ x n ≤1 ➢ w=w y n x n 1 ➢ w= 1− w CN
20
Implementation Note: Weight shrinkage is SLOW. Implement it lazily, at the cost of double storage.
➢
For n=1..N: ➢ If y n w⋅ x n ≤0 ➢ w=w y n x n
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
From Perceptron to Structured Perceptron
21
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Perceptron with multiple classes ➢
➢
w1•Φ(x)
Store separate weight vector for each class w1, w2, ..., wK
biggest
For n=1..N: ➢ Predict:
w3•Φ(x)
y =arg max k w k⋅ x n ➢
If y ≠ y:n w y =w y − x n w y =w y x n n
22
n
?!
Hal Daumé III (
[email protected])
biggest
w2•Φ(x) biggest
Why does this do the right thing? SP2IRL @ ACL2010
Perceptron with multiple classes v2 ➢
Originally: w1
w2
w3
w ➢
For n=1..N: ➢ Predict:
➢
For n=1..N: ➢ Predict:
y =arg max k w k⋅ x n ➢
If y ≠ y: n
y =arg max k w⋅ x n , k ➢
If y ≠ y:n
w y =w y − x n
w y =w y x n n
23
n
Hal Daumé III (
[email protected])
w=w− x n , y x n , y n SP2IRL @ ACL2010
x multiple Φ(x,1)classes Φ(x,2) Perceptron with v2 ➢
Originally: w1
spam_BIAS : spam_free : spam_money : w2 spam_the w3: ...
“free money”
1 1 1 0
w
➢
For n=1..N: ➢ Predict:
➢
If y ≠ y: n
➢
If y ≠ y:n
w y =w y x n 24
1 1 1 0
y =arg max k w⋅ x n , k
w y =w y − x n n
: : : :
For n=1..N: ➢ Predict:
y =arg max k w k⋅ x n ➢
ham_BIAS ham_free ham_money ham_the ...
n
Hal Daumé III (
[email protected])
w=w− x n , y x n , y n SP2IRL @ ACL2010
Features for structured prediction ➢
Allowed to encode anything you want Pro
Md
Vb
Dt
Nn
I
can
can
a
can
x , y= I_Pro can_Md can_Vb a_Dt can_Nn ...
➢ 25
: : : : :
1 1 1 1 1
-Pro Pro-Md Md-Vb Vb-Dt Dt-Nn Nn- ...
: : : : : :
1 1 1 1 1 1
has_verb has_nn_lft has_n_lft has_nn_rgt has_n_rgt ...
: : : : :
1 0 1 1 1
Output features, Markov features, other features Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Structured perceptron Enumeration over 1..K ➢
For n=1..N: ➢ Predict:
Enumeration over all outputs ➢
For n=1..N: ➢ Predict:
y =arg max k w⋅ x n , k ➢
If y ≠ y n:
y =arg max k w⋅ x n , k ➢
If y ≠ y n:
26
Hal Daumé III (
[email protected])
w=w− x n , y x n , y n
SP2IRL @ ACL2010
[Collins, EMNLP02]
w=w− x n , y x n , y n
Argmax for sequences ➢
If we only have output and Markov features, we can use Viterbi algorithm: w•[Pro-Pro] w•[I_Pro]
w•[can_Pro]
w•[can_Pro]
w•[Pro-Md] w•[I_Md]
w•[can_Md]
w•[can_Md]
w•[Pro-Vb] w•[I_Vb]
w•[can_Vb]
w•[can_Vb]
I
can
can
(plus some work to account for boundary conditions) 27
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Structured perceptron as ranking ➢
For n=1..N: ➢ Run Viterbi: y =arg max k w⋅ x n , k ➢ If y ≠ y:n w=w− x , y x , y n
➢
28
n
n
When does this make an update? Pro Pro Pro Pro Pro Pro Pro
Md Md Md Md Md Md Md
Vb Md Md Nn Nn Vb Vb
Dt Dt Dt Dt Dt Dt Dt
Nn Vb Nn Md Nn Md Vb
I
can
can
a
can
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
From perceptron to margins Maximize Margin
Minimize Errors
s.t.
y n w⋅ x n n ≥1 , ∀n
1 2 ∥w∥ C ∑n n , y 2
Response for truth
Response for other
s.t. w⋅ x n , y n −w⋅ x n , y n≥1 , ∀ n , y ≠ y n
Each point is correctly classified, modulo ξ Each true output is more highly ranked, modulo ξ
29
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Taskar+al, JMLR05; Tshochandaritis, JMLR05]
min 1 ∥w∥2C n ∑ n w,ξ 2
min w,ξ
From perceptron to margins x 1, y 1
x 2, y
x 2, y
x 1, y
x 1, y x 2, y
1 2 ∥w∥ C ∑n n , y 2
Response for truth
Response for other
s.t. w⋅ x n , y n −w⋅ x n , y n≥1 , ∀ n , y ≠ y n
x 1, y
x 1, y Each true output is more highly ranked, modulo ξ
30
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Taskar+al, JMLR05; Tshochandaritis, JMLR05]
x 2, y 2
min w,ξ
Ranking margins ➢
Some errors are worse than others... Md
Vb
Dt
Nn
Margin of one
31
Pro Pro Pro Pro Pro Pro
Md Md Md Md Md Md
Md Md Nn Nn Vb Vb
Dt Dt Dt Dt Dt Dt
Vb Nn Md Nn Md Vb
I
can
can
a
can
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Taskar+al, JMLR05; Tshochandaritis, JMLR05]
Pro
Accounting for a loss function ➢
Some errors are worse than others... Md
Vb
Dt
Nn
Pro Md Vb Dt Vb Pro Md Md Dt Nn Pro Md Nn Dt Nn
Margin of l(y,y')
Pro Md Vb Nn Md Nn Md Vb Dt Md Pro Vb Nn Dt Nn Nn Md Vb Nn Md Nn Md Vb Md Md Pro Vb Nn Vb Nn I
32
can
can
a
can
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Taskar+al, JMLR05; Tshochandaritis, JMLR05]
Pro
Accounting for a loss function x 1, y 1
x 2, y
x 2, y
x 1, y x 1, y
∀y x, 1, y w⋅ x n , y n −w⋅ x n , y n ≥l y n , y
is equivalent to , xy1, y−w⋅ x 2, yw⋅ max x x n , y n ≥l y n , y y n n w⋅ x n , y n −w⋅ x n , y n
w⋅ x n , y n −w⋅ x n , y n
≥1
≥l y n , y
33
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Taskar+al, JMLR05; Tshochandaritis, JMLR05]
x 2, y 2
Augmented argmax for sequences ➢
Add “loss” to each wrong node!
w•[I_Pro]
+1
w•[can_Pro]
+1
w•[can_Pro]
w•[Pro-Md]
+1
w•[I_Md]
w•[can_Md]
+1
w•[can_Md]
w•[Pro-Vb]
+1
34
+1
w•[I_Vb]
w•[can_Vb]
I
can
?!
Hal Daumé III (
[email protected])
w•[can_Vb] can What are we assuming here? SP2IRL @ ACL2010
[Taskar+al, JMLR05; Tshochandaritis, JMLR05]
w•[Pro-Pro]
Stochastically optimizing Markov nets M3N Objective SOME MATH ➢
➢
For n=1..N: ➢ Viterbi:
y =arg max k w⋅ x n , k y =arg max k w⋅ x n , k ➢ If y ≠ y :n l y n , k ➢ If y ≠ y n: w=w− x n , y w=w− x n , y x n , y n x n , y n ➢ 1 w= 1− w CN Hal Daumé III (
[email protected]) SP2IRL @ ACL2010
[Ratliff+al, AIStats07]
35
For n=1..N: ➢ Augmented Viterbi:
Stacking ➢
Structured models: accurate but slow Output Labels Input Features
➢
Independent models: less accurate but fast Output Labels Input Features
➢
Stacking: multiple independent models Output Labels 2 Output Labels Input Features
36
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Training a stacked model ➢
Train independent classifier f1 on input features
Output Labels 2 Output Labels
➢
Train independent classifier f2 on input features + f1's output
➢
Danger: overfitting!
➢
Solution: cross-validation
37
Hal Daumé III (
[email protected])
Input Features
SP2IRL @ ACL2010
Do we really need structure? ➢
Structured models: accurate but slow Output Labels Input Features
➢
Independent models: less accurate but fast Output Labels Input Features
Goal: transfer power to get fast+accurate
Output Labels Input Features ➢
38
Questions: are independent models... ➢ … expressive enough? (approximation error) ➢ … easy to learn? (estimation error) Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Liang+D+Klein, ICML08]
➢
“Compiling” structure out Labeled Data
POS: 95.0% NER: 75.3%
IRL(f1)
POS: 91.7% NER: 69.1%
f1 = words/prefixes/suffixes/forms
39
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Liang+D+Klein, ICML08]
CRF(f1)
“Compiling” structure out Labeled Data
Auto-Labeled Data POS: 95.0% NER: 75.3%
f1 = words/prefixes/suffixes/forms f2 = f1 applied to a larger window
40
IRL(f1)
POS: 91.7% NER: 69.1%
IRL(f2)
POS: 94.4% NER: 66.2%
CompIRL(f2)
POS: 95.0% NER: 72.7%
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Liang+D+Klein, ICML08]
CRF(f1)
Decomposition of errors CRF(f1):
Sum of MI on edges POS=.003 (95.0% → 95.0%) pC NER=.009 (76.3% → 76.0%)
coherence Σ
Σ
Σ
marginalized
Train a truncated CRF CRF(fNER: 1): p76.0% MC → 72.7%
nonlinearities
global information
IRL(f2): p1* Theorem: KL(pC || p1*) = KL(pC || pMC) + KL(pMC || pA*) + KL(pA* || p1*) 41
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Liang+D+Klein, ICML08]
IRL(fꝏ): pA*
Train a marginalized CRF NER: 76.0% → 76.0%
Structure compilation results Part of speech 96
95
94
2
4
8 16 32 64
Parsing
78
88
76
86
74
84
72
82
70
80
68
78
66
76
2
4
8 16 32 64
4 14 24 44 84 164
Structured Independent 42
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Liang+D+Klein, ICML08]
93
Named Entity
Coffee Break!!!
43
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Outline: Part I ➢ ➢
What is Structured Prediction? Refresher on Binary Classification ➢ ➢ ➢
➢
From Perceptron to Structured Perceptron ➢ ➢ ➢
➢
Linear models for Structured Prediction The “argmax” problem From Perceptron to margins
Structure without Structure ➢ ➢
44
What does it mean to learn? Linear models for classification Batch versus stochastic optimization
Stacking Structure compilation
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Outline: Part II ➢
Learning to Search ➢ ➢
➢ ➢
Refresher on Markov Decision Processes Inverse Reinforcement Learning ➢ ➢
➢
➢
45
Determining rewards given policies Maximum margin planning
Learning by Demonstration ➢
➢
Incremental parsing Learning to queue
Searn Dagger
Discussion
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Learning to Search
46
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Argmax is hard! ➢
Classic formulation of structured prediction:
score x , y = ➢
something we learn to make “good” x,y pairs score highly
At test time:
f x = argmax y ∈Y score x , y ➢
Combinatorial optimization problem ➢ ➢
47
Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Argmax is hard! ➢
Order these words: bart better I madonna say than , Classic formulation of structured prediction:
score x , y = ➢
something we learn to make “good” x,y pairs score highly
At test time:
f x = argmax y ∈Y score x , y [Soricut, PhD Thesis, USC 2007]
➢
Combinatorial optimization problem ➢ ➢
48
Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Argmax is hard! ➢
Order these words: bart better I madonna say than , Classic formulation of structured prediction: Best search (32.3): I say better than bart madonna , Original (41.6): better bart than madonna , I say something we learn
score x , y = to make “good” x,y pairs score highly
➢
At test time:
f x = argmax y ∈Y score x , y
[Soricut, PhD Thesis, USC 2007]
➢
Combinatorial optimization problem ➢ ➢
49
Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Argmax is hard! ➢
Order these words: bart better I madonna say than , Classic formulation of structured prediction: Best search (32.3): I say better than bart madonna , Original (41.6): better bart than madonna , I say something we learn
to make “good” x,y pairs score x , y = Best search (51.6): and so could really be a neural score highly
➢
At test time:
apparently thought things as dissimilar firing two identical
f x = argmax y ∈Y score x , y [Soricut, PhD Thesis, USC 2007]
➢
Combinatorial optimization problem ➢ ➢
50
Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Argmax is hard! ➢
Order these words: bart better I madonna say than , Classic formulation of structured prediction: Best search (32.3): I say better than bart madonna , Original (41.6): better bart than madonna , I say something we learn
to make “good” x,y pairs score x , y = Best search (51.6): and so could really be a neural score highly
apparently thought things as dissimilar firing two identical ➢ At test time: Original (64.3): could two things so apparently dissimilar as a thought and neural f x = argmax y ∈Y score x , y firing really be identical [Soricut, PhD Thesis, USC 2007]
➢
Combinatorial optimization problem ➢ ➢
51
Efficient only in very limiting cases Solved by heuristic search: beam + A* + local search
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Incremental parsing, early 90s style S
Train a classifier to make decisions
Right VP
Left
Right NP
Unary I / Pro 52
Left
can / Md
Up can / Vb
Hal Daumé III (
[email protected])
Left a / Dt
Right can / Nn SP2IRL @ ACL2010
[Magerman, ACL95]
NP
Incremental parsing, mid 2000s style S Train a classifier to make decisions VP
I / Pro 53
NP
can / Md
can / Vb Hal Daumé III (
[email protected])
a / Dt
can / Nn SP2IRL @ ACL2010
[Collsns+Roark, ACL04]
NP
Learning to beam-search ➢
S
➢
NP
NP
can / Md
can / Vb Hal Daumé III (
[email protected])
a / Dt
can / Nn SP2IRL @ ACL2010
[Collins+Roark, ACL04]
VP
w=w− x n , y x n , y n
I / Pro 54
h c r a e s m a e y =arg maxB k w⋅ x n , k If y ≠ y:n
For n=1..N: ➢ Viterbi:
Learning to beam-search h c r For n=1..N: a e s ➢ n=1..N: ➢ For Viterbi: m a e B ➢ Run beam search until ➢ If falls out : of beam truth ➢ Update weights immediately! ➢
S
VP
I / Pro 55
NP
can / Md
can / Vb Hal Daumé III (
[email protected])
a / Dt
can / Nn SP2IRL @ ACL2010
[Collins+Roark, ACL04]
NP
Learning to beam-search h c r For n=1..N: a e s ➢ n=1..N: ➢ For Viterbi: m a ➢ For n=1..N: e B ➢ Run beam search until ➢ Run beam search until ➢ If falls out : of beam truth truth falls out of beam ➢ Update weights ➢ Update weights immediately! immediately! ➢ Restart at truth ➢
S
NP
I / Pro 56
NP
can / Md
can / Vb Hal Daumé III (
[email protected])
a / Dt
can / Nn SP2IRL @ ACL2010
[D+Marcu, ICML05; Xu+al, JMLR09]
VP
Incremental parsing results
[Collins+Roark, ACL04]
57
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Generic Search Formulation ➢
Search Problem: ➢ ➢ ➢ ➢
nodes := MakeQueue(S0)
➢
while nodes is not empty ➢
Search Variable: ➢
➢
Enqueue function ➢ ➢
Varying the Enqueue function can give us DFS, BFS, beam search, A* search, etc... 58
➢
node := RemoveFront(nodes) if node is a goal state return node next := Operators(node) nodes := Enqueue(nodes, next)
fail
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[D+Marcu, ICML05; Xu+al, JMLR09]
➢
Search space Operators Goal-test function Path-cost function
➢
Online Learning Framework (LaSO) ➢ ➢
nodes := MakeQueue(S0) while nodes is not empty ➢ ➢
node := RemoveFront(nodes) if none of {node} ∪ nodes is y-good or node is a goal & not ygood
➢ ➢ ➢
sibs := siblings(node, y) w := update(w, x, sibs, {node} ∪ nodes) nodes := MakeQueue(sibs)
else ➢ ➢ ➢
59
Where should we have gone?
Continue search... if node is a goal state return w next := Operators(node) nodes := Enqueue(nodes, next)
Hal Daumé III (
[email protected])
Update our weights based on the good and the bad choices
SP2IRL @ ACL2010
[D+Marcu, ICML05; Xu+al, JMLR09]
If we erred...
➢
Monotonicity: for any node, we can tell if it can lead to the correct solution or not
Search-based Margin ➢
The margin is the amount by which we are correct:
T
u x , g 1
u x , g 2
T u x ,b 1 T u x ,b 2
➢
60
Note that the margin and hence linear separability is also a function of the search algorithm! Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[D+Marcu, ICML05; Xu+al, JMLR09]
u
T
Syntactic chunking Results 24 min
[Zhang+Damerau+Johnson 2002]; timing unknown
22 min [Collins 2002]
33 min
Training Time (minutes) 61
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[D+Marcu, ICML05; Xu+al, JMLR09]
F-Score
4 min
Tagging+chunking results
3 min
7 min
[Sutton+McCallum 2004]
[D+Marcu, ICML05; Xu+al, JMLR09]
Joint tagging/chunking accuracy
23 min
1 min
Training Time (hours) [log scale] 62
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Variations on a beam ➢
Observation: ➢ We needn't use the same beam size for training and decoding ➢ Varying these values independently yields: [D+Marcu, ICML05; Xu+al, JMLR09]
Training Beam
Decoding Beam
63
1 5 10 25 50
1 93.9 90.5 89.5 88.7 88.4
5 92.8 94.3 94.3 94.2 94.2
10 91.9 94.4 94.4 94.5 94.4
25 91.3 94.1 94.2 94.3 94.2
Hal Daumé III (
[email protected])
50 90.9 94.1 94.2 94.3 94.4
SP2IRL @ ACL2010
What if our model sucks? ➢
Sometimes our model cannot produce the “correct” output ➢
canonical example: machine translation
“Bold” update
Model Outputs
Good N-best list Outputs or “optimal decoding” or ...
“Local” update 64
Reference
Best achievable output
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Och, ACL03; Liang+al, ACL06]
Current Hypothesis
Local versus bold updating... Machine Translation Performance (Bleu) 35.5 35 34.5
Bold Local Pharoah
34 33.5 33 32.5
65
Monotonic
Distortion
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Refresher on Markov Decision Processes
66
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Reinforcement learning ➢
Basic idea: ➢ ➢ ➢ ➢
➢
Examples: ➢ ➢ ➢
67
Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act to maximize expected rewards Change the rewards, change the learned behavior
Playing a game, reward at the end for outcome Vacuuming, reward for each piece of dirt picked up Driving a taxi, reward for each passenger delivered
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Markov decision processes What are the values (expected future rewards) of states and actions? Q(s,a1)* = 30
0.6
5
0.4
V(s)* = 30
1
Q(s,a2)* = 23
Q(s,a3)* = 17
1.0
3
0.1
2
0.9 4
68
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Markov Decision Processes ➢
An MDP is defined by: ➢ ➢ ➢
➢ ➢ ➢
➢ ➢
A set of states s ∈ S A set of actions a ∈ A A transition function T(s,a,s’) ➢ Prob that a from s leads to s’ ➢ i.e., P(s’ | s,a) ➢ Also called the model A reward function R(s, a, s’) ➢ Sometimes just R(s) or R(s’) A start state (or distribution) Maybe a terminal state
MDPs are a family of nondeterministic search problems Total utility is one of:
∑ r t or ∑t r t t
t
69
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Solving MDPs In deterministic single-agent search problem, want an optimal plan, or sequence of actions, from start to a goal ➢ In an MDP, we want an optimal policy π(s) ➢ A policy gives an action for each state ➢ Optimal policy maximizes expected if followed ➢ Defines a reflex agent ➢
Optimal policy when R(s, a, s’) = -0.04 for all nonterminals s 70
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Example Optimal Policies
R(s) = -0.01
71
R(s) = -0.4
R(s) = -0.03
Hal Daumé III (
[email protected])
R(s) = -2.0
SP2IRL @ ACL2010
Optimal Utilities ➢
72
Fundamental operation: compute the optimal utilities of states s (all at once)
➢
Why? Optimal values define optimal policies!
➢
Define the utility of a state s: V*(s) = expected return starting in s and acting optimally
➢
Define the utility of a q-state (s,a): Q*(s,a) = expected return starting in s, taking action a and thereafter acting optimally
➢
Define the optimal policy: π*(s) = optimal action from state s Hal Daumé III (
[email protected])
s a s, a s,a,s ’
s ’
SP2IRL @ ACL2010
The Bellman Equations ➢
➢
73
Definition of utility leads to a simple one-step lookahead relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy
s a s, a s,a,s ’
Formally:
Hal Daumé III (
[email protected])
s ’
SP2IRL @ ACL2010
Solving MDPs / memoized recursion ➢
Recurrences:
➢
Cache all function call results so you never repeat work What happened to the evaluation function?
➢
74
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Q-Value Iteration ➢
Value iteration: iterate approx optimal values ➢ ➢
➢
But Q-values are more useful! ➢ ➢
75
Start with V0*(s) = 0, which we know is right (why?) Given Vi*, calculate the values for all states for depth i+1:
Start with Q0*(s,a) = 0, which we know is right (why?) Given Qi*, calculate the q-values for all q-states for depth i+1:
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
RL = Unknown MDPs ➢
If we knew the MDP (i.e., the reward function and transition function): ➢ Value iteration leads to optimal values ➢ Will always converge to the truth
➢
Reinforcement learning is what we do when we do not know the MDP ➢ All we observe is a trajectory ➢ (s1,a1,r1, s2,a2,r2, s3,a3,r3, …)
➢
Many algorithms exist for this problem; see Sutton+Barto's excellent book!
76
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Q-Learning ➢
Learn Q*(s,a) values
➢
Receive a sample (s,a,s’,r) Consider your old estimate: Consider your new sample estimate:
➢
Incorporate the new estimate into a running average:
➢ ➢
77
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Exploration / Exploitation ➢
Several schemes for forcing exploration ➢
Simplest: random actions (ε greedy) ➢ ➢ ➢
➢
Problems with random actions? ➢ ➢ ➢
78
Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy
You do explore the space, but keep thrashing around once learning is done One solution: lower ε over time Another solution: exploration functions
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Q-Learning ➢
In realistic situations, we cannot possibly learn about every single state! ➢ ➢
➢
Instead, we want to generalize: ➢
Learn about some small number of training states from experience Generalize that experience to new, similar states:
➢
Very simple stochastic updates:
➢
79
Too many states to visit them all in training Too many states to hold the q-tables in memory
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Inverse Reinforcement Learning (aka Inverse Optimal Control)
80
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Inverse RL: Task ➢
Given: ➢ ➢ ➢
measurements of an agent's behavior over time, in a variety of circumstances if needed, measurements of the sensory inputs to that agent if available, a model of the environment.
➢
Determine: the reward function being optimized
➢
Proposed by [Kalman68]
➢
First solution, by [Boyd94]
81
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Why inverse RL? ➢
Computational models for animal learning ➢
➢
Agent construction ➢
➢
➢
“An agent designer [...] may only have a very rough idea of the reward function whose optimization would generate 'desirable' behavior.” eg., “Driving well”
Multi-agent systems and mechanism design ➢
82
“In examining animal and human behavior we must consider the reward function as an unknown to be ascertained through empirical investigation.”
learning opponents’ reward functions that guide their actions to devise strategies against them
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
IRL from Sample Trajectories Warning: need to be ➢
➢
➢
careful to avoid Optimal policy available through sampled trajectories trivial solutions! (eg., driving a car)
Want to find Reward function that makes this policy look as good as possible Write R w s=w s so the reward is linear w
and V s 0 be the value of the starting state
∑ f V k=1
∗
w
s0 −V s0
How good does the “optimal policy” look? 83
k w
How good does the some other policy look?
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Ng+Russell, ICML00]
max w
K
Apprenticeship Learning via IRL ➢
For t = 1,2,… ➢
Inverse RL step: Estimate expert’s reward function R(s)= wTφ(s) such that under R(s) the expert performs better than all previously found policies {πi}.
➢
RL step: Compute optimal policy πt for the estimated reward w [Abbeel+Ng, ICML04]
84
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Car Driving Experiment No explicit reward function at all!
➢
Expert demonstrates proper policy via 2 min. of driving time on simulator (1200 data points).
➢
5 different “driver types” tried.
➢
Features: which lane the car is in, distance to closest car in current lane.
➢
Algorithm run for 30 iterations, policy hand-picked.
➢
Movie Time! (Expert left, IRL right)
85
Hal Daumé III (
[email protected])
[Abbeel+Ng, ICML04]
➢
SP2IRL @ ACL2010
“Nice” driver
86
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
“Evil” driver
87
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Maxent IRL Distribution over trajectories: P(ζ) Match the reward of observed behavior: ∑ζ P(ζ) fζ = fdem Maximizing the causal entropy over trajectories given stochastic outcomes:
(Condition on random uncontrolled outcomes, but only after they happen) 88
Hal Daumé III (
[email protected])
As uniform as possible SP2IRL @ ACL2010
[Ziebart+al, AAAI08]
max H(P(ζ)||O)
Data collection
25 Taxi Drivers
89
Accidents Construction Congestion Time of day
Over 100,000 miles SP2IRL @ ACL2010
Hal Daumé III (
[email protected])
[Ziebart+al, AAAI08]
Length Speed Road Type Lanes
Predicting destinations....
90
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Planning as structured prediction
[Ratliff+al, NIPS05]
91
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Maximum margin planning ➢
Let μ(s,a) denote the probability of reaching q-state (s,a) under current model w
max margin w
s.t.
planner run with w yields human output Q-state visitation frequency by human
s.t.
s , a w⋅ x n , s , a − s , a w⋅ x n , s , a≥1 , ∀n,s,a
Q-state visitation frequency by planner
All trajectories, and all q-states 92
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Ratliff+al, NIPS05]
min w
1 2 ∥w∥ 2
Optimizing MMP MMP Objective SOME MATH ➢
For n=1..N: ➢ Augmented planning: Run A* on current (augmented) cost map to get q-state visitation frequencies s , a Update: w=w
, a−s , a ] x n , s , a ∑ ∑ [ s s
➢ 93
a
1 w Shrink: w= 1− CN Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Ratliff+al, NIPS05]
➢
Maximum margin planning movies
[Ratliff+al, NIPS05]
94
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Parsing via inverse optimal control ➢ ➢ ➢ ➢ ➢
State space = all partial parse trees over the full sentence labeled “S” Actions: take a partial parse and split it anywhere in the middle Transitions: obvious Terminal states: when there are no actions left Reward: parse score at completion [Neu+Szepevari, MLJ09]
95
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Parsing via inverse optimal control 90 85 80 75 70 65 60
Medium
Maximum Likelihood
Projection
Perceptron
Maximum Margin
Maximum Entropy
Policy Matching
Hal Daumé III (
[email protected])
Large Apprenticeship Learning
SP2IRL @ ACL2010
[Neu+Szepevari, MLJ09]
96
Small
Learning by Demonstration
97
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
Integrating search and learning Input: Le homme mange l' croissant. Output: The man ate a croissant.
Classifier 'h'
Hyp: The man ate a Cov: Le homme mange l' croissant. 98
Hyp: The man ate a fox Cov: Le homme mange l' croissant. Hyp: The man ate happy Cov: Le homme mange l' croissant. Hyp: The man ate a Cov: Le homme mange l' croissant. Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[D+Marcu, ICML05; D+Langford+Marcu, MLJ09]
Hyp: The man ate Cov: Le homme mange l' croissant.
Hyp: The man ate a croissant Cov: Le homme mange l' croissant.
Reducing search to classification ➢
Natural chicken and egg problem: ➢ ➢ ➢
➢
Want h to get low expected future loss … on future decisions made by h … and starting from states visited by h
Iterative solution
h(t-1) h(t)
Hyp: The man ate a croissant Cov: Le homme mange l' croissant. Hyp: The man ate a fox Cov: Le homme mange l' croissant. Hyp: The man ate happy Cov: Le homme mange l' croissant.
Hyp: The man ate a Cov: Le homme mange l' croissant. 99
Hyp: The man ate a Cov: Le homme mange l' croissant.
Loss = 0 Loss = 1.8 Loss = 1.2
Loss = 0 0.5
Loss = 0 Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[D+Langford+Marcu, MLJ09]
Hyp: The man ate Cov: Le homme mange l' croissant.
Input: Le homme mange l' croissant. Output: The man ate a croissant.
Reduction for Structured Prediction ➢
Idea: view structured prediction in light of search D
D
N
N
N
A
A
A
R
R
R
V
V
V
I
sing
merrily
Loss function: L([N V R], [N V R]) = 0 L([N V R], [N V V]) = 1/3 ... 100
Hal Daumé III (
[email protected])
Can we formalize this idea?
SP2IRL @ ACL2010
[D+Langford+Marcu, MLJ09]
D
Each step here looks like it could be represented as a weighted multi-class problem.
Reducing Structured Prediction D
D
D
N
N
N
A
A
A
R
R
R
V
V
V
I
sing
merrily
Key Assumption: Optimal Policy for training data Given: input, true output and state; Return: best successor state Weak! 101
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[D+Langford+Marcu, MLJ09]
Desired: good policy on test data (i.e., given only input string)
How to Learn in Search
[D+Langford+Marcu, MLJ09]
Idea: Train based only on optimal path (ala MEMM)
Better Idea: Train based only on optimal policy, then train based on optimal policy + a little learned policy then train based on optimal policy + a little more learned policy then ... eventually only use learned policy 102
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
How to Learn in Search c=1 c=1 c=2 c=1 c=0 I
103
merrily
Translating DSP into Searn(DSP, loss, π): ➢ Draw x ~ DSP ➢ Run π on x, to get a path ➢ Pick position uniformly on path ➢ Generate example with costs given by expected (wrt π) completion costs for “loss” Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[D+Langford+Marcu, MLJ09]
➢
sing
Searn
Ingredients for Searn: Input space (X) and output space (Y), data from X Loss function (loss(y, y')) and features “Optimal” policy π*(x, y0)
[D+Langford+Marcu, MLJ09]
Algorithm: Searn-Learn(A, DSP, loss, π*, β) 1: Initialize: π = π* 2: while not converged do 3: Sample: D ~ Searn(DSP, loss, π) 4: Learn: h ← A(D) 5: Update: π ← (1-β) π + β h 6: end while 7: return π without π*
Theorem: L(π) ≤ L(π*) + 2 avg(loss) T ln T + c(1+ln T) / T 104
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
But what about demonstrations? ➢
What did we assume before?
Key Assumption: Optimal Policy for training data Given: input, true output and state; Return: best successor state ➢
105
We can have a human (or system) demonstrate, thus giving us an optimal policy Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
3d racing game (TuxKart) Input:
Output:
Resized to 25x19 pixels (1425 features) 106
Hal Daumé III (
[email protected])
SP2IRL @ ACL2010
[Ross+Gordon+Bagnell, AIStats11]
Steering in [-1,1]
DAgger: Dataset Aggregation ➢ ➢ ➢
If N = T log T, Collect trajectories from expert π* Dataset D0 = { ( s, π*(s) ) | s ~ π* } L(πn) < T N + O(1) Train π1 on D0 for some n
➢
Collect new trajectories from π1
➢
But let the expert steer! Dataset D1 = { ( s, π*(s) ) | s ~ π1 }
π1
➢
Train π2 on D0 ∪ D1
π2
➢
In general: ➢ Dn = { ( s, π*(s) ) | s ~ πn }
➢
107
Train πn on ∪ i