Jun 26, 2012 - This Tutorial. Least-Squares, SVM, . ...... Sample xi ⼠p(·|xi,ai) and ri = r(xi,ai). A. Lazaric &
Statistical Learning Theory in Reinforcement Learning & Approximate Dynamic Programming A. Lazaric & M. Ghavamzadeh (INRIA Lille – Team SequeL) ICML 2012 INRIA Lille – Team SequeL
June 26, 2012
Sequential Decision-Making under Uncertainty
? How Can I ... ?
I
Move around in the physical world (e.g. driving, navigation)
I
Play and win a game
I
Retrieve information over the web
I
Medical diagnosis and treatment
I
Maximize the throughput of a factory
I
Optimize the performance of a rescue team A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 2/185
Reinforcement Learning (RL)
I
RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment
I
Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance
I
Interaction: Modeled as a MDP or a POMDP A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 3/185
Reinforcement Learning (RL)
Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 4/185
Reinforcement Learning (RL)
Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance Algorithms: are based on the two celebrated dynamic programming algorithms: policy iteration and value iteration
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 4/185
A Bit of History I
formulation of the problem: optimal control, state, value function, Bellman equations, etc.
I
dynamic programming algorithms: policy iteration and value iteration + proof of convergence to an optimal policy
I
approximate dynamic programming I
performance evaluation: how close is the obtained solution to an optimal one? I
asymptotic analysis: the performance with infinite number of samples
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 5/185
A Bit of History I
formulation of the problem: optimal control, state, value function, Bellman equations, etc.
I
dynamic programming algorithms: policy iteration and value iteration + proof of convergence to an optimal policy
I
approximate dynamic programming I
performance evaluation: how close is the obtained solution to an optimal one? I
asymptotic analysis: the performance with infinite number of samples
in real problems we always have a finite number of samples
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 5/185
Motivation what about the performance with finite number of samples?
I
approximate dynamic programming (ADP) I
asymptotic analysis
I
finite sample analysis
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 6/185
Motivation what about the performance with finite number of samples?
I
I
approximate dynamic programming (ADP) I
asymptotic analysis
I
finite sample analysis
finite sample analysis of ADP algorithms I
error at each iteration of the alg.
I
how the error propagates through the iterations of the alg.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 6/185
Motivation
I
finite sample analysis of ADP algorithms I
error at each iteration of the alg. I
the problem is formulated as regression, classification, or fixed point
tools from statistical learning theory are used to bound the error of these problems
I
I
how the error propagates through the iterations of the alg.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 7/185
Supervised Learning
Problem (regression , classification)
Sequential Decision-Making
ADP
Algorithm
Least-Squares, SVM, ... (API , AVI)
Tools for Analysis
Statistical Learning Theory (SLT)
This Tutorial
Asymptotic Analysis
Finite Sample Analysis of ADP Algs.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 8/185
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 9/185
Preliminaries
Outline Preliminaries Dynamic Programming Approximate Dynamic Programming Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 10/185
Preliminaries
Reinforcement Learning (RL)
I
RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment
I
Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance
I
Interaction: Modeled as a MDP or a POMDP A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 11/185
Preliminaries
Markov Decision Process MDP I
An MDP M is a tuple hX , A, r, p, γi.
I
The state space X is a bounded closed subset of Rd .
I
The set of actions A is finite (|A| < ∞).
I
The reward function r : X × A → R is bounded by Rmax .
I
The transition model p(·|x, a) is a distribution over X .
I
γ ∈ (0, 1) is a discount factor.
I
Policy: a mapping from states to actions
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
π(x) ∈ A June 26, 2012 - 12/185
Preliminaries
Value Function For a policy π I
Vπ :X →R "∞ # X V π (x) = E γ t r Xt , π(Xt ) |X0 = x
Value function
t=0
I
Action-value function Qπ : X × A → R "∞ # X Qπ (x, a) = E γ t r(Xt , At )|X0 = x, A0 = a t=0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 13/185
Preliminaries
Notation Bellman Operator I
Bellman operator for policy π T π : B(X ; Vmax ) → B(X ; Vmax )
I
V π is the unique fixed-point of the Bellman operator Z π (T V )(x) = r x, π(x) + γ p dy|x, π(x) V (y) X
I
The action-value function Qπ is defined as Z Qπ (x, a) = r(x, a) + γ p(dy|x, a)V π (y) X
B(X ; Vmax ) is the space of functions on X bounded by Vmax A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 14/185
Preliminaries
Optimal Value Function and Optimal Policy I
Optimal value function V ∗ (x) = sup V π (x) π
I
Optimal action-value function Q∗ (x, a) = sup Qπ (x, a) π
I
∀x ∈ X
∀x ∈ X , ∀a ∈ A
A policy π is optimal if V π (x) = V ∗ (x)
∀x ∈ X
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 15/185
Preliminaries
Notation Bellman Optimality Operator I
Bellman optimality operator T : B(X ; Vmax ) → B(X ; Vmax )
I
V ∗ is the unique fixed-point of the Bellman optimality operator Z (T V )(x) = max r(x, a) + γ p(dy|x, a)V (y) a∈A
I
X
Optimal action-value function Q∗ is defined as Z ∗ Q (x, a) = r(x, a) + γ p(dy|x, a)V ∗ (y) X
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 16/185
Preliminaries
Properties of Bellman Operators I
Monotonicity: if V1 ≤ V2 component-wise, then T π V1 ≤ T π V2
I
Max-Norm Contraction:
and
T V1 ≤ T V2
∀V1 , V2 ∈ B(X ; Vmax )
||T π V1 − T π V2 ||∞ ≤ γ||V1 − V2 ||∞ ||T V1 − T V2 ||∞ ≤ γ||V1 − V2 ||∞
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 17/185
Preliminaries
Dynamic Programming
Outline Preliminaries Dynamic Programming Approximate Dynamic Programming Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 18/185
Preliminaries
Dynamic Programming
Dynamic Programming Algorithms Value Iteration I
start with an arbitrary action-value function Q0
I
at each iteration k
Qk+1 = T Qk
Convergence I
limk→∞ Vk = V ∗ . k→∞
||V ∗ −Vk+1 ||∞ = ||T V ∗ −T Vk ||∞ ≤ γ||V ∗ −Vk ||∞ ≤ γ k+1 ||V ∗ −V0 ||∞ −→ 0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 19/185
Preliminaries
Dynamic Programming
Dynamic Programming Algorithms Policy Iteration I
start with an arbitrary policy π0
I
at each iteration k I
Policy Evaluation: Compute Qπk
I
Policy Improvement: Compute the greedy policy w.r.t. Qπk πk+1 (x) = (Gπk )(x) = arg max Qπk (x, a) a∈A
Convergence PI generates a sequence of policies with increasing performance (V πk+1 ≥ V πk ) and stops after a finite number of iterations with the optimal policy π ∗ . V πk = T πk V πk ≤ T V πk = T πk+1 V πk ≤ lim (T πk+1 )n V πk = V πk+1 n→∞
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 20/185
Preliminaries
Approximate Dynamic Programming
Outline Preliminaries Dynamic Programming Approximate Dynamic Programming Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 21/185
Preliminaries
Approximate Dynamic Programming
Approximate Dynamic Programming & Batch Reinforcement Learning
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 22/185
Preliminaries
Approximate Dynamic Programming
Approximate Dynamic Programming Algorithms Value Iteration I
start with an arbitrary action-value function Q0
I
at each iteration k
Qk+1 = T Qk
What if Qk+1 ≈ T Qk ? ?
||Q∗ − Qk+1 || ≤ γ||Q∗ − Qk ||
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 23/185
Preliminaries
Approximate Dynamic Programming
Approximate Dynamic Programming Algorithms Policy Iteration I
start with an arbitrary policy π0
I
at each iteration k I
Policy Evaluation: Compute Qπk
I
Policy Improvement: Compute the greedy policy w.r.t. Qπk πk+1 (x) = (Gπk )(x) = arg max Qπk (x, a) a∈A
b πk ≈ Qπk instead) What if we cannot compute Qπk exactly? (Compute Q ?
b πk (x, a) 6= (Gπk )(x) −→ V πk+1 ≥ V πk πk+1 (x) = arg max Q a∈A
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 24/185
Preliminaries
Approximate Dynamic Programming
Error at each Iteration (AVI)
Budget = B App Space = F
AVI
Qk+1 ≈ T Qk
Error at iteration k ||T Qk − Qk+1 ||p,ρ ≤ f (B, F)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
w.h.p.
June 26, 2012 - 25/185
Preliminaries
Approximate Dynamic Programming
Error at each Iteration (API)
Budget = B App Space = F
! π k ≈ Q πk Q
API
! πk ≈ Gπk πk+1 = G Q Error at iteration k b πk ||p,ρ ≤ f (B, F) ||Qπk − Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
w.h.p.
June 26, 2012 - 26/185
Preliminaries
Approximate Dynamic Programming
Final Performance Bound
Final Objective: Bound the error after K iteration of the alg. ||V ∗ − V πK ||p,µ ≤ f (B, F, K)
w.h.p.
πK is the policy computed by the algorithm after K iterations
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 27/185
Preliminaries
Approximate Dynamic Programming
Final Performance Bound
Final Objective: Bound the error after K iteration of the alg. ||V ∗ − V πK ||p,µ ≤ f (B, F, K)
w.h.p.
πK is the policy computed by the algorithm after K iterations
Error Propagation: How the error at each iteration propagates through the iterations of the algorithm
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 27/185
Preliminaries
Approximate Dynamic Programming
SLT in RL & ADP I supervised learning methods (regression, classification) appear in the inner-loop of ADP algorithms (performance at each iteration) I tools from SLT that are used to analyze supervised learning methods can be used in RL and ADP (e.g., how many samples are required to achieve a certain performance)
What makes RL more challenging? I the objective is not always to recover a target function from its noisy observations (fixed-point vs. regression) I the target sometimes has to be approximated given sample trajectories (non i.i.d. samples) I propagation of error (control problem) I the choice of the sampling distribution ρ (exploration problem) A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 28/185
Tools from Statistical Learning Theory
Outline Preliminaries Tools from Statistical Learning Theory Concentration Inequalities Functional Concentration Inequalities A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 29/185
Tools from Statistical Learning Theory
Objective of the section
I
Introduce the theoretical tools used to derive the error bounds at each iteration
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 30/185
Tools from Statistical Learning Theory
Objective of the section
I
Introduce the theoretical tools used to derive the error bounds at each iteration
I
Understand the relationship between accuracy , number of samples, and confidence
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 30/185
Tools from Statistical Learning Theory
Concentration Inequalities
Outline Preliminaries Tools from Statistical Learning Theory Concentration Inequalities Functional Concentration Inequalities A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 31/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Chernoff–Hoeffding Bound
Remark: all the learning algorithms use random samples instead of actual distributions.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 32/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Chernoff–Hoeffding Bound
Remark: all the learning algorithms use random samples instead of actual distributions. Question: how reliable is the solution learned from finite random samples?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 32/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Chernoff–Hoeffding Bound
Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution P bounded in [a, b], then for any ε > 0 # " n 1 X 2nε2 Xt − EP [X1 ] > ε ≤ 2 exp − P n (b − a)2 t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 33/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Chernoff–Hoeffding Bound
Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution bounded in [a, b], then for any ε > 0 # " n 1 X 2nε2 ε ≤ 2 exp − P Xt − E[X1 ] > |{z} n (b − a)2 t=1 accuracy | {z } | {z } deviation
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
conf idence
June 26, 2012 - 34/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Chernoff–Hoeffding Bound (Cont.d)
Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution bounded in [a, b], then for any δ ∈ (0, 1) # " r n 1 X log 2/δ Xt − E[X1 ] > (b − a) ≤δ P n 2n t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 35/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Chernoff–Hoeffding Bound (Cont.d)
Theorem Let X1 , X2 , . . . be i.i.d. samples from a distribution bounded in [a, b], then for any δ ∈ (0, 1) and ε > 0 # " n 1 X Xt − E[X1 ] > ε ≤ δ P n t=1
if
n≥
(b − a)2 log 2/δ . 2ε2
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 36/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Azuma Bound
Remark: in ADP and RL, the samples are not necessarily i.i.d. but may be generated from trajectories
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 37/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Azuma Bound
Remark: in ADP and RL, the samples are not necessarily i.i.d. but may be generated from trajectories Question: how is it possible to extend the previous results to non–i.i.d. samples?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 37/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Azuma Bound
A sequence of random variables X1 , X2 , . . . is a martingale difference sequence if for any t E[Xt+1 |X1 , . . . , Xt ] = 0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 38/185
Tools from Statistical Learning Theory
Concentration Inequalities
The Azuma Bound
Theorem Let X1 , . . . , Xn be a martingale difference sequence bounded in [a, b], then for any ε > 0 # " n 1 X 2nε2 Xt > ε ≤ 2 exp − P n (b − a)2 t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 39/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Outline Preliminaries Tools from Statistical Learning Theory Concentration Inequalities Functional Concentration Inequalities A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 40/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound
Remark: the learning algorithm returns the empirical best hypothesis from a hypothesis set (e.g., a value function, a policy). Question: how do the previous results extend to the case of random hypotheses in a hypothesis set?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 41/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound
Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and f : X → [a, b] a bounded function, then for any ε > 0 # " n 1 X 2nε2 f (Xt ) − E[f (X1 )] > ε ≤ 2 exp − P n (b − a)2 t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 42/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound (Cont.d)
Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of functions bounded in [a, b], then for any fixed f ∈ F and any ε > 0 # " n 1 X 2nε2 f (Xt ) − E[f (X1 )] > ε ≤ 2 exp − P n (b − a)2 t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 43/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound (Cont.d)
Remark: usually we do not know which function f the learning algorithm will return (it is random!)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 44/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound (Cont.d)
Remark: usually we do not know which function f the learning algorithm will return (it is random!)
Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of functions bounded in [a, b], then for any ε > 0 " # n 1 X P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε ≤ ??? n t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 44/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Union Bound
Also known as: Boole’s inequality, Bonferroni inequality, etc.
Theorem Let A1 , A2 , . . . be a countable set of events, then h[ i X P Ai . P Ai ≤ i
i
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 45/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Union Bound
Also known as: Boole’s inequality, Bonferroni inequality, etc.
Theorem Let A1 , A2 , . . . be a countable set of events, then h[ i X P Ai . P Ai ≤ i
i
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 45/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound (Cont.d) n 1 X h i P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 46/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound (Cont.d) n 1 X h i P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1
"
=P
n n 1 X o[ f (X ) − E[f (X )] > ε 1 t 1 1 n t=1 n o[ n 1 X f2 (Xt ) − E[f2 (X1 )] > ε n t=1
··· n n 1 X o[ fN (Xt ) − E[fN (X1 )] > ε n t=1 # ···
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 46/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound (Cont.d) Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of functions bounded in [a, b] with |F| = N , then for any f1 ∈ F and any δ ∈ (0, 1) # " r n 1 X log 2/δ f (Xt ) − E[f (X1 )] > (b − a) ≤ P ∃f ∈ F : n 2n t=1 " # r n 1 X log 2/δ N max P f (Xt ) − E[f (X1 )] > (b − a) ≤ Nδ f ∈F n 2n t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 47/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound (Cont.d)
Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of functions bounded in [a, b] with |F| = N , then for any f1 ∈ F and any δ ∈ (0, 1) " # r n 1 X log 2N /δ P ∃f ∈ F : f (Xt ) − E[f (X1 )] > (b − a) ≤δ n 2n t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 48/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional Chernoff–Hoeffding Bound (Cont.d)
Problem: In general F contains an infinite number of functions (e.g., a linear classifier)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 49/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Symmetrization Trick
"
# n 1 X P ∃f ∈ F : f (Xt ) − E f (X1 ) > ε n t=1 " # n n 1 X ε 1X ≤ 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1
t=1
with the ghost samples {Xt0 }nt=1 independently drawn from P.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 50/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension Not all the infinities are the same...
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2
−2.5 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2.5 −1
−0.8
−0.6
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
June 26, 2012 - 51/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension Not all the infinities are the same...
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2
−2.5 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2.5 −1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Let’s consider a binary space F = {f : X → {0, 1}} A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 51/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
How many different predictions can a space F produce over n distinct inputs?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 52/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
The VC dimension of a linear classifier in dim. 2 is V C(F) = 3.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 53/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
Let S = (x1 , . . . , xd ) be an arbitrary sequence of points, then ΠS (F) = {(f (x1 ), . . . , f (xd )), h ∈ F} is the set of all the possible ways the d points can be classified by hypothesis in F.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 54/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d)
Let S = (x1 , . . . , xd ) be an arbitrary sequence of points, then ΠS (F) = {(f (x1 ), . . . , f (xd )), h ∈ F} is the set of all the possible ways the d points can be classified by hypothesis in F.
Definition A set S is shattered by a hypothesis space F if |ΠS (F)| = 2d .
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 54/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d) Definition (VC Dimension) The VC dimension of a hypothesis space F is VC(F) = max{d| ∃|S| = d, |ΠS (F)| = 2d }
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 55/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The VC dimension (cont’d) Definition (VC Dimension) The VC dimension of a hypothesis space F is VC(F) = max{d| ∃|S| = d, |ΠS (F)| = 2d }
Lemma (Sauer’s Lemma) Let F be a hypothesis space with VC dimension d, then for any sequence of n points S = (x1 , . . . , xn ) with n > d |ΠS (F)| ≤
d X n i=0
i
≤ nd
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 55/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional CH Bound for Binary Spaces
Question: how many values can f ∈ F (with F a binary space) take on 2n samples? "
n n 1 X ε 1X 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1
t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
#
June 26, 2012 - 56/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional CH Bound for Binary Spaces
Question: how many values can f ∈ F (with F a binary space) take on 2n samples? "
n n 1 X ε 1X 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1
t=1
#
If VC(F) = d and 2n > d, then the answer is at most (2n)d !
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 56/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional CH Bound for Binary Spaces
Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of binary functions with VC = d, then for any δ ∈ (0, 1) # " n r log 2N /δ 1 X f (Xt ) − E[f (X1 )] > ≤ 2δ P ∃f ∈ F : n 2n t=1
with N = (2n)d .
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 57/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Functional CH Bound for Binary Spaces
A simplified reading of the previous bound For any set of n i.i.d. samples and any binary function f ∈ F with VC(F) = d r n 1 X d log n/δ f (Xt ) − E[f (X1 )] ≤ O n n t=1
with probability 1 − δ (w.r.t. to the randomness of the samples)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 58/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Pollard’s Inequality
Extension: how does the previous result extend to the case of a real–valued space F?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 59/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Pollard’s Inequality
Question: how many values can f ∈ F (with F a real–valued space) take on 2n samples? "
n n 1 X ε 1X 0 2P ∃f ∈ F : f (Xt ) − f (Xt ) > n n 2 t=1
t=1
#
Answer: an infinite number of values...
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 60/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Covering
Observation: we only need an accuracy of order ε.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 61/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Covering
Observation: we only need an accuracy of order ε. Question: how many functions from F do we need to achieve an accuracy of order ε on 2n samples?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 61/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Covering
A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1
t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 62/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Covering
A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1
t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 62/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Covering
A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1
t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 62/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Covering
A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1
t=1
The cover number of F is
N F, ε, {xt }nt=1 = |Fε |
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 62/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Pollard’s Inequality We build an (ε/8)-cover of F on states {Xt }nt=1 ∪ {Xt0 }nt=1 , thus we have # n n 1 X ε 1X P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1 t=1 # " n n 1 X ε 1X 0 ≤ P ∃f ∈ Fε/8 : f (Xt ) − f (Xt ) > n n 4 t=1 t=1 # " n n ε h i 1 X 1X 0 0 n f (Xt ) − f (Xt ) > ≤ E N F, ε/8, {Xt ∪ Xt }t=1 P n n 4 "
t=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
t=1
June 26, 2012 - 63/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Pollard’s Inequality
Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of bounded functions in [0, B], then for any ε > 0 " # n 1 X P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1 h i nε2 ≤ 8E N F, ε/8, {Xt ∪ Xt0 }nt=1 exp − . 64B 2
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 64/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Pseudo–Dimension
Question: how is it possible to compute the cover number ? A real–valued space F has a psuedo–dimension d if V(F) = VC
(x, y) → sign(f (x) − y), f ∈ F =d
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 65/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
The Pseudo–Dimension
Question: how is it possible to compute the cover number ? A real–valued space F has a psuedo–dimension d if V(F) = VC For any {xt }nt=1
(x, y) → sign(f (x) − y), f ∈ F =d
N
F, ε, {xt }nt=1
≤O
B d ε
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 65/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Functional Concentration Inequality for L2 -norm
Remark: In some cases we want to consider the deviations between different norms.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 66/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Functional Concentration Inequality for L2 -norm
Remark: In some cases we want to consider the deviations between different norms. Example: in least–squares regression, the error is measured with L2 -norms, so we want to bound the deviation between X 1/2 n 1 f (Xt )2 n t=1
1/2 E f (X)2
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 66/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Functional Concentration Inequality for L2 -norm
Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of bounded functions in [0, B], then for any ε " # 1/2 n 1X 1/2 P ∃f ∈ F : f (Xt )2 − 2 E f (X)2 >ε n t=1 √ h i 2 nε2 ≤ 3E N2 F, ε, {Xt ∪ Xt0 }nt=1 exp − . 24 288B 2
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 67/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Summary
I
Learning algorithms use finite random samples ⇒ concentration of averages to expectations
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 68/185
Tools from Statistical Learning Theory
Functional Concentration Inequalities
Summary
I
I
Learning algorithms use finite random samples ⇒ concentration of averages to expectations
Learning algorithms use spaces of functions ⇒ concentration of averages to expectations for any function
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 68/185
A Step-by-step Derivation for Linear FQI
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 69/185
A Step-by-step Derivation for Linear FQI
Objective of the section
I
Step-by-step derivation a performance bound for a popular algorithm
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 70/185
A Step-by-step Derivation for Linear FQI
Objective of the section
I
Step-by-step derivation a performance bound for a popular algorithm
I
Show the interplay between prediction error and propagation
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 70/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration
Linear space (used to approximate action–value functions) d n o X F = f (x, a) = αj ϕj (x, a), α ∈ Rd j=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 71/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration
Linear space (used to approximate action–value functions) d n o X F = f (x, a) = αj ϕj (x, a), α ∈ Rd j=1
with features ϕj : X × A → [0, L]
φ(x, a) = [ϕ1 (x, a) . . . ϕd (x, a)]>
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 71/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I I
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
e k−1 (x0 , a) Compute yi = ri + γ maxa Q i
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1
I
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1
I
I
Solve the least squares problem
n
fαbk = arg min
fα ∈F
2 1X fα (xi , ai ) − yi n i=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1
I
I
Solve the least squares problem
n
fαbk = arg min
fα ∈F
I
ek
Return Q = Trunc(fαbk )
2 1X fα (xi , ai ) − yi n i=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1
I
I
Solve the least squares problem
n
fαbk = arg min
fα ∈F
I
ek
Return Q = Trunc(fαbk )
2 1X fα (xi , ai ) − yi n i=1
e K (·, a) (greedy policy ) Return πK (·) = arg maxa Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 72/185
A Step-by-step Derivation for Linear FQI
Theoretical Objectives
Objective 1: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ ||Q∗ − QπK ||µ ≤ ???
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 73/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 74/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1 I
I
Solve the least squares problem
n
fαbk = arg min
fα ∈F
I
ek
Return Q = Trunc(fαbk )
2 1X fα (xi , ai ) − yi n i=1
e K (·, a) (greedy policy ) Return πK (·) = arg maxa Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 75/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1 I
I
Solve the least squares problem
n
fαbk = arg min
fα ∈F
I
ek
Return Q = Trunc(fαbk )
2 1X fα (xi , ai ) − yi n i=1
e K (·, a) (greedy policy ) Return πK (·) = arg maxa Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 75/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Linear Fitted Q-iteration
I I
i.i.d
Draw n samples (xi , ai ) ∼ ρ
Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )
e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1 I
I
Solve the least squares problem
n
fαbk = arg min
fα ∈F
e k = Trunc(fαbk ) I Return Q
2 1X fα (xi , ai ) − yi n i=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 76/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives
Target: at each iteration we want to approximate Qk = T Qk−1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 77/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives
Target: at each iteration we want to approximate Qk = T Qk−1 Objective 2: derive an intermediate bound on the prediction error [random design] e k ||ρ ≤ ??? ||Qk − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 77/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives
Target: at each iteration we have samples {(xi , ai )}ni=1 (from ρ)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 78/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives
Target: at each iteration we have samples {(xi , ai )}ni=1 (from ρ) Objective 3: derive an intermediate bound on the prediction error on the samples [deterministic design] n 2 1 X k e k (xi , ai ) = ||Qk − Q e k ||2 ≤ ??? Q (xi , ai ) − Q ρb n i=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 78/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives Obj 3 e k ||2 ≤ ??? ||Qk − Q ρb
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 79/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives Obj 3
⇒ Obj 2
e k ||2 ≤ ??? ||Qk − Q ρb e k ||ρ ≤ ??? ||Qk − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 79/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives Obj 3
⇒ Obj 2
e k ||2 ≤ ??? ||Qk − Q ρb
⇒ Obj 1
e k ||ρ ≤ ??? ||Qk − Q ||Q∗ − QπK ||µ ≤ ???
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 79/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives
Returned solution n
fαbk
2 1X = arg min fα (xi , ai ) − yi fα ∈F n i=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 80/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Objectives
Returned solution n
fαbk
2 1X = arg min fα (xi , ai ) − yi fα ∈F n i=1
Best solution fα∗k = arg inf ||fα − Qk ||ρ fα ∈F
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 80/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 81/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 81/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn Empirical L2 -norm ||fα ||2ρb
n
n
i=1
i=1
1X 1X 2 = fα (xi , ai )2 = zi = ||z||2n n n
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 81/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn Empirical L2 -norm ||fα ||2ρb
n
n
i=1
i=1
1X 1X 2 = fα (xi , ai )2 = zi = ||z||2n n n
Empirical orthogonal projection b = arg min ||y − z||n Πy z∈Fn
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 81/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation I
Target vector: e k−1 (xi , ai ) qi = Qk (xi , ai ) = T Q Z e k−1 (dx0 , a)p(dx0 |xi , ai ) = r(xi , ai ) + γ max Q a
X
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 82/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation I
Target vector: e k−1 (xi , ai ) qi = Qk (xi , ai ) = T Q Z e k−1 (dx0 , a)p(dx0 |xi , ai ) = r(xi , ai ) + γ max Q a
I
X
Observed target vector:
e k−1 (x0i , a) yi = ri + γ max Q a
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 82/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation I
Target vector: e k−1 (xi , ai ) qi = Qk (xi , ai ) = T Q Z e k−1 (dx0 , a)p(dx0 |xi , ai ) = r(xi , ai ) + γ max Q a
I
X
Observed target vector:
e k−1 (x0i , a) yi = ri + γ max Q a
I
Noise vector (zero–mean and bounded): ξi = qi − yi |ξi | ≤ Vmax
E[ξi |xi ] = 0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 82/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation
y ξ q Fn
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 83/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation
I
Optimal solution in Fn b = arg min ||q − z||n Πq z∈Fn
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 84/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation
I
Optimal solution in Fn b = arg min ||q − z||n Πq z∈Fn
I
Returned vector
qbi = fαbk (xi , ai )
b = arg min ||y − z||n qb = Πy z∈Fn
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 84/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Additional Notation
y ξ q Fn
b Πq
qb
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 85/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 86/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n y
ξ q Fn
b Πq
ξb
qb
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 86/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n y
ξ q Fn
b Πq
ξb
qb
b n b n + ||Πq b − qb||n = ||q − Πq|| b n + ||ξ|| ||q − qb||n ≤ ||q − Πq|| A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 86/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}
prediction err
approx. err
estim. err
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 87/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}
prediction err I
approx. err
estim. err
Prediction error : distance between learned function and target function
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 87/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}
prediction err
approx. err
estim. err
I
Prediction error : distance between learned function and target function
I
Approximation error : distance between the best function in F and the target function ⇒ depends on F
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 87/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}
prediction err
approx. err
estim. err
I
Prediction error : distance between learned function and target function
I
Approximation error : distance between the best function in F and the target function ⇒ depends on F
I
Estimation error : distance between the best function in F and the learned function ⇒ depends on the samples
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 87/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis b The noise ξb = Πξ
b n = hξ, b ξi b = hξ, b ξi ⇒ ||ξ||
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 88/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis b The noise ξb = Πξ
b n = hξ, b ξi b = hξ, b ξi ⇒ ||ξ||
The projected noise belongs to Fn
⇒ ∃fβ ∈ F : fβ (xi , ai ) = ξbi ,
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
∀(xi , ai )
June 26, 2012 - 88/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis b The noise ξb = Πξ
b n = hξ, b ξi b = hξ, b ξi ⇒ ||ξ||
The projected noise belongs to Fn
⇒ ∃fβ ∈ F : fβ (xi , ai ) = ξbi ,
∀(xi , ai )
By definition of inner product
n
b n= ⇒ ||ξ||
1X fβ (xi , ai )ξi n i=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 88/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ]
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 89/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ] Thus for any fixed fβ ∈ F (the expectation is conditioned on (xi , ai ))
⇒ Eξ
n h1 X
n
i=1
fβ (xi , ai )ξi
i
n
1X = Eξ fβ (xi , ai )ξi = 0 n i=1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 89/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ] Thus for any fixed fβ ∈ F (the expectation is conditioned on (xi , ai ))
⇒ Eξ
n h1 X
n
i=1
fβ (xi , ai )ξi
i
n
1X = Eξ fβ (xi , ai )ξi = 0 n i=1
n
n
i=1
i=1
X 2 1X 2 1 fβ (xi , ai )ξi ≤ 4Vmax fβ (xi , ai )2 = 4Vmax ||fβ ||2ρb ⇒ n n ⇒ we can use concentration inequalities
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 89/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
Problem: fβ is a random variable
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 90/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
Problem: fβ is a random variable Solution: we need functional concentration inequalities
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 90/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 91/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb [by definition] ⇒ ∀g ∈ G, ||g||ρb ≤ 1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 91/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb [by definition] ⇒ ∀g ∈ G, ||g||ρb ≤ 1 [F is a linear space] ⇒ V(G) = d + 1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 91/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
Application of Pollard’s inequality for space G
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 92/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
Application of Pollard’s inequality for space G For any g ∈ G n 1 X g(xi , ai )ξi ≤ 4Vmax n i=1
s
2 log n
3(9ne2 )d+1 δ
with probability 1 − δ (w.r.t., the realization of the noise ξ).
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 92/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis By definition of g s n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 93/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis By definition of g s n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ
For the specific fβ equivalent to ξb
b ξi ≤ 4Vmax ||ξ|| b n ⇒ hξ,
s
2 log n
3(9ne2 )d+1 δ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 93/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis By definition of g s n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ
For the specific fβ equivalent to ξb
b ξi ≤ 4Vmax ||ξ|| b n ⇒ hξ,
s
2 log n
3(9ne2 )d+1 δ
s
2 log n
3(9ne2 )d+1 δ
Recalling the objective
b 2 ≤ 4Vmax ||ξ|| b n ⇒ ||ξ|| n
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 93/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis By definition of g s n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ
For the specific fβ equivalent to ξb
b ξi ≤ 4Vmax ||ξ|| b n ⇒ hξ,
s
2 log n
3(9ne2 )d+1 δ
s
2 log n
3(9ne2 )d+1 δ
Recalling the objective
b 2 ≤ 4Vmax ||ξ|| b n ⇒ ||ξ|| n
b − qb||n ≤ 4Vmax ⇒ ||Πq
s
2 log n
3(9ne2 )d+1 δ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 93/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis y ξ q Fn
b Πq
ξb
qb
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 94/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis y ξ q Fn
b Πq
ξb
qb
Theorem At each iteration k and given a set of state–action pairs {(xi , ai )}, LinearFQI returns an approximation qb such that b n + ||Πq b − qb||n ||q − qb||n ≤ ||q − Πq|| r d log n/δ b ≤ ||q − Πq||n + O Vmax n A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 94/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
Moving back from vectors to functions ||q − qb||n = ||Qk − fαbk ||ρb b n ≤ ||Qk − fα∗ ||ρb ||q − Πq|| k
r d log n/δ k k ⇒ ||Q − fαbk ||ρb ≤ ||Q − fα∗k ||ρb + O Vmax n
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 95/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
e k = Trunc(fαb )) By definition of truncation (Q k
Theorem
At each iteration k and given a set of state–action pairs {(xi , ai )}, b k such that (Objective 3) LinearFQI returns an approximation Q e k ||ρb ≤ ||Qk − fαb ||ρb ||Qk − Q k k
≤ ||Q − fα∗k ||ρb + O Vmax
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
r
d log n/δ n
June 26, 2012 - 96/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 97/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms e k is truncated, it is bounded in [−Vmax , Vmax ] Since Q r d log n/δ k k k k e e 2||Q − Q ||ρb ≥ ||Q − Q ||ρ − O Vmax n
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 97/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms e k is truncated, it is bounded in [−Vmax , Vmax ] Since Q r d log n/δ k k k k e e 2||Q − Q ||ρb ≥ ||Q − Q ||ρ − O Vmax n The best solution fα∗k is a fixed function in F k
k
||Q − fα∗k ||ρb ≤ 2||Q − fα∗k ||ρ + O
Vmax +
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
L||αk∗ ||
r
log 1/δ n
June 26, 2012 - 97/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis Theorem e k such At each iteration k, LinearFQI returns an approximation Q that (Objective 2) e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ ∗ + O Vmax + L||αk || n r d log n/δ + O Vmax , n with probability 1 − δ. A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 98/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ ∗ + O Vmax + L||αk || n r d log n/δ + O Vmax n
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 99/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ + O Vmax + L||α∗k || n r d log n/δ + O Vmax n
Remarks I
No algorithm can do better
I
Constant 4
I
Depends on the space F
I
Changes with the iteration k
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 100/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ ∗ + O Vmax + L||αk || n r d log n/δ + O Vmax n
Remarks I
Vanishing to zero as O(n−1/2 )
I
Depends on the features (L) and on the best solution (||αk∗ ||)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 101/185
A Step-by-step Derivation for Linear FQI
Error at Each Iteration
Theoretical Analysis
e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ + O Vmax + L||α∗k || n r d log n/δ + O Vmax n
Remarks I
Vanishing to zero as O(n−1/2 )
I
Depends on the dimensionality of the space (d) and the number of samples (n)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 102/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 103/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Theoretical Analysis
Objective 1 ||Q∗ − QπK ||µ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 104/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Theoretical Analysis
Objective 1 ||Q∗ − QπK ||µ I
Problem 1: the test norm µ is different from the sampling norm ρ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 104/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Theoretical Analysis
Objective 1 ||Q∗ − QπK ||µ I
I
Problem 1: the test norm µ is different from the sampling norm ρ e k not for the performance Problem 2: we have bounds for Q of the corresponding πk
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 104/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Theoretical Analysis
Objective 1 ||Q∗ − QπK ||µ I
I
I
Problem 1: the test norm µ is different from the sampling norm ρ e k not for the performance Problem 2: we have bounds for Q of the corresponding πk
Problem 3: we have bounds for one single iteration
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 104/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors I
Bellman operators T Q(x, a) = r(x, a) + γ T π Q(x, a) = r(x, a) + γ
I
I
Z
X
max Q(dx0 , a0 )p(dx0 |x, a) 0
X
Q(dx0 , π(dx0 ))p(dx0 |x, a)
Z
a
Optimal action–value function Greedy policy
Q∗ = T Q∗ π(x) = arg max Q(x, a) a
π ∗ (x) = arg max Q∗ (x, a) I
Prediction error
a
ek εk = Qk − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 105/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ −T π∗ Q ek + T π∗ Q e k −T Q e k + εk Q∗ − Q | {z } | {z }| {z } fixed point
0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
e k+1 Q
June 26, 2012 - 106/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ − T π∗ Q ek + T π∗ Q e k + −T Q e k + εk Q∗ − Q | {z } | {z } |{z} recursion
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
≤0
error
June 26, 2012 - 106/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ − T π∗ Q e k + T π∗ Q e k + −T Q e k + εk Q∗ − Q ∗ e k ) + εk ≤ γP π (Q∗ − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 106/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk eK ≤ Q −Q ∗
K−1 X k=0
∗ ∗ e0 ) γ K−k−1 (P π )K−k−1 εk + γ K (P π )K (Q∗ − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 106/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 = T Q∗ −T πk Q∗ + T πk Q∗ −T Q e k + εk Q∗ − Q | {z } | {z }| {z } fixed point
0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
e k+1 Q
June 26, 2012 - 107/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 = T Q∗ − T πk Q∗ + T πk Q∗ − T Q e k + εk Q∗ − Q | {z } | {z } |{z} ≥0
greedy pol.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
error
June 26, 2012 - 107/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ T πk Q∗ − T πk Q e k + εk Q∗ − Q | {z } |{z} recursion
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
error
June 26, 2012 - 107/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ γP πk (Q∗ − Q e k ) + εk Q∗ − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 107/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ Q −Q ∗
K−1 X
γ K−k−1 (P πK−1 P πK−2 . . . P πk+1 )εk
k=0
e0 ) + γ K (P πK−1 P πK−2 . . . P π0 )(Q∗ − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 107/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q
∗ ∗ e K + T π∗ Q e K −T πK Q e K + T πK Q e K −T πK Q eK Q∗ − QπK = T π Q∗ −T π Q | {z } | {z }| {z } | {z }
fixed point
0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
0
fixed point
June 26, 2012 - 108/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q
∗ ∗ e K + T π∗ Q e K − T πK Q eK e K − T πK Q e K + T πK Q Q∗ − QπK = T π Q∗ − T π Q | {z } | {z } | {z }
error
≤0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
function vs policy
June 26, 2012 - 108/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q
∗ e K ) + γP πK (Q e K −Q∗ + Q∗ −QπK ) Q∗ − QπK ≤ γP π (Q∗ − Q | {z }
0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 108/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q
∗ e K ) + γP πK (Q e K − Q∗ + Q∗ − QπK ) Q∗ − QπK ≤ γP π (Q∗ − Q | {z } | {z } | {z }
error
error
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
policy performance
June 26, 2012 - 108/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
e K to πK (problem 2) Step 3: from Q e K ≥ T π∗ QK eK = T Q By definition T πK Q
∗ ek ) (I − γP πK )(Q∗ − QπK ) ≤ γ(P π − P πK )(Q∗ − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 108/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q
∗ ek ) Q∗ − QπK ≤ γ(I − γP πK )−1 (P π − P πK )(Q∗ − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 108/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q
∗ ek ) Q∗ − QπK ≤ γ(I − γP πK )−1 (P π − P πK )(Q∗ − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 108/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 3: plugging the error propagation (problem 2) Q∗ − QπK ≤(I − γP πK )−1
K−1 X
i h ∗ γ K−k (P π )K−k − P πK P πK−1 . . . P πk+1 εk
k=0
h
+ (P
π ∗ K+1
)
i e0 ) − (P πK P πK−1 . . . P π0 ) (Q∗ − Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 109/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 4: rewrite in compact form
∗
πK
Q −Q
K−1 2γ(1 − γ K+1 ) X ∗ 0 e ≤ αk Ak |εk | + αK AK |Q − Q | (1 − γ)2 k=0
I
αk : weights
I
Ak : summarize the P πi terms
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 110/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 5: take the norm w.r.t. to the test distribution µ ||Q∗ − QπK ||2µ = ≤ ≤
Z
ρ(dx, da)(Q∗ (x, a) − QπK (x, a))2
2γ(1 − γ K+1 (1 − γ)2
2 Z
γ K+1 2
2γ(1 − (1 − γ)2
K−1 2 X e 0 | (x, a) µ(dx, da) αk Ak |εk | + αK AK |Q∗ − Q k=0
Z
K−1 X e 0 )2 (x, a) µ(dx, da) αk Ak ε2k + αK AK (Q∗ − Q k=0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 111/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Focusing on one single term ∗ 1−γ µ(I − γP πK )−1 (P π )K−k + P πK P πK−1 . . . P πk+1 2 ∗ 1−γ X m = γ µ(P πK )m (P π )K−k + P πK P πK−1 . . . P πk+1 2 m≥0
µAk =
=
i X ∗ 1−γh X m γ µ(P πK )m (P π )K−k + γ m µ(P πK )m P πK P πK−1 . . . P πk+1 2 m≥0 m≥0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 112/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Assumption: concentrability terms d(µP π1 . . . P πm ) c(m) = sup dρ π1 ...πm Cµ,ρ = (1 − γ)2
X
∞
mγ m−1 c(m) < +∞
m≥1
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 113/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 5: take the norm w.r.t. to the test distribution µ ||Q∗ − QπK ||2µ K−1 X 2γ(1 − γ K+1 2 X ≤ γ m c(m + K − k)||εk ||2ρ + αK (2Vmax )2 αk (1 − γ) 2 (1 − γ) k=0 m≥0
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 114/185
A Step-by-step Derivation for Linear FQI
Error Propagation
Propagation of Errors
Step 5: take the norm w.r.t. to the test distribution µ (problem 1)
∗
||Q −
QπK ||2µ
2γ ≤ (1 − γ)2
2
Cµ,ρ max ||εk ||2ρ k
γK +O V2 (1 − γ)3 max
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 115/185
A Step-by-step Derivation for Linear FQI
The Final Bound
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 116/185
A Step-by-step Derivation for Linear FQI
The Final Bound
Plugging Per–Iteration Regret
∗
||Q −
QπK ||2µ
2γ ≤ (1 − γ)2
2
Cµ,ρ max ||εk ||2ρ k
γK +O V2 (1 − γ)3 max
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 117/185
A Step-by-step Derivation for Linear FQI
The Final Bound
Plugging Per–Iteration Regret
∗
||Q −
QπK ||2µ
2γ ≤ (1 − γ)2
2
Cµ,ρ max ||εk ||2ρ k
γK +O V2 (1 − γ)3 max
e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||εk ||ρ = ||Qk − Q k r log 1/δ ∗ + O Vmax + L||αk || n r d log n/δ + O Vmax n
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 117/185
A Step-by-step Derivation for Linear FQI
The Final Bound
Plugging Per–Iteration Regret
The inherent Bellman error ||Qk − fα∗k ||ρ = inf ||Qk − f ||ρ f ∈F
e k−1 − f ||ρ = inf ||T Q f ∈F
≤ inf ||T fαk−1 − f ||ρ f ∈F
≤ sup inf ||T g − f ||ρ = d(F, T F) g∈F f ∈F
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 118/185
A Step-by-step Derivation for Linear FQI
The Final Bound
Plugging Per–Iteration Regret
fα∗k is the orthogonal projection of Qk onto F w.r.t. ρ e k−1 ||ρ ≤ ||Q e k−1 ||∞ ≤ Vmax ⇒ ||fα∗k ||ρ ≤ ||Qk ||ρ = ||T Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 119/185
A Step-by-step Derivation for Linear FQI
The Final Bound
Plugging Per–Iteration Regret Gram matrix Gi,j = E(x,a)∼ρ [ϕi (x, a)ϕj (x, a)] Smallest eigenvalue of G is ω ||fα ||2ρ = ||φ> α||2ρ = α> Gα ≥ ωα> α = ω||α||2
max ||αk∗ || ≤ max k
k
||fα∗k ||ρ Vmax √ ≤ √ ω ω
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 120/185
A Step-by-step Derivation for Linear FQI
The Final Bound
The Final Bound
Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤
r ! p 2γ L d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 121/185
A Step-by-step Derivation for Linear FQI
The Final Bound
The Final Bound
Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤
r ! p L d log n/δ 2γ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max
The propagation (and different norms) makes the problem more complex ⇒ how do we choose the sampling distribution?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 122/185
A Step-by-step Derivation for Linear FQI
The Final Bound
The Final Bound
Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤
r ! p L d log n/δ 2γ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max
The approximation error is worse than in regression ⇒ how do adapt to the Bellman operator?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 123/185
A Step-by-step Derivation for Linear FQI
The Final Bound
The Final Bound
Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤
r ! p 2γ L d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max
The dependency on γ is worse than at each iteration ⇒ is it possible to avoid it?
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 124/185
A Step-by-step Derivation for Linear FQI
The Final Bound
The Final Bound
Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤
r ! p 2γ L d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max
The error decreases exponentially in K ⇒ K ≈ ε/(1 − γ)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 125/185
A Step-by-step Derivation for Linear FQI
The Final Bound
The Final Bound
Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤
r ! p 2γ L d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max
The smallest eigenvalue of the Gram matrix ⇒ design the features so as to be orthogonal w.r.t. ρ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 126/185
A Step-by-step Derivation for Linear FQI
The Final Bound
The Final Bound
Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤
r ! p 2γ L d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n K γ +O V2 (1 − γ)3 max
The asymptotic rate O(d/n) is the same as for regression
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 127/185
A Step-by-step Derivation for Linear FQI
The Final Bound
Summary
I
At each iteration FQI solves a regression problem ⇒ least–squares prediction error bound
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 128/185
A Step-by-step Derivation for Linear FQI
The Final Bound
Summary
I
I
At each iteration FQI solves a regression problem ⇒ least–squares prediction error bound The error is propagated through iterations ⇒ propagation of any error
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 128/185
Least-Squares Policy Iteration (LSPI)
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Least-Squares Temporal-Difference Learning (LSTD) LSTD and LSPI Error Bounds Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 129/185
Least-Squares Policy Iteration (LSPI)
Finite-Sample Performance Bound of Least-Squares Policy Iteration (LSPI)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 130/185
Least-Squares Policy Iteration (LSPI)
Least-Squares Policy Iteration (LSPI)
LSPI: is an approximate policy iteration algorithm that uses Least-Squares Temporal-Difference Learning (LSTD) for policy evaluation.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 131/185
Least-Squares Policy Iteration (LSPI)
Objective of the Section
I
a brief description of LSTD iteration) algorithms
I
report final sample performance bounds for LSTD and LSPI
I
describe the main components of these bounds
(policy evaluation)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
and LSPI
(policy
June 26, 2012 - 132/185
Least-Squares Policy Iteration (LSPI)
Least-Squares Temporal-Difference Learning (LSTD)
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Least-Squares Temporal-Difference Learning (LSTD) LSTD and LSPI Error Bounds Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 133/185
Least-Squares Policy Iteration (LSPI)
Least-Squares Temporal-Difference Learning (LSTD)
Least-Squares Temporal-Difference Learning (LSTD) I
P Linear function space F = f : f (·) = dj=1 αj ϕj (·) {ϕj }dj=1 ∈ B(X ; L
,
> φ : X → Rd , φ(·) = ϕ1 (·), . . . , ϕd (·)
I
V π is the fixed-point of T π
V π = T πV π
I
V π may not belong to F
I
Best approximation of V π in F is
Vπ ∈ /F
ΠV π = arg min ||V π − f ||
(Π is the projection onto F)
f ∈F
Tπ
F
Vπ
ΠV π
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 134/185
Least-Squares Policy Iteration (LSPI)
Least-Squares Temporal-Difference Learning (LSTD)
Least-Squares Temporal-Difference Learning (LSTD) I
LSTD searches for the fixed-point of Π? T π instead (Π? is a
projection into F w.r.t. L? -norm) I
Π∞ T π is a contraction in L∞ -norm I
I
L∞ -projection is numerically expensive when the number of states is large or infinite
LSTD searches for the fixed-point of Π2,ρ T π Π2,ρ g = arg min ||g − f ||2,ρ f ∈F
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 135/185
Least-Squares Policy Iteration (LSPI)
Least-Squares Temporal-Difference Learning (LSTD)
Least-Squares Temporal-Difference Learning (LSTD) When the fixed-point of Πρ T π exists, we call it the LSTD solution VTD = Πρ T π VTD T π VTD Tπ Vπ
F
Πρ V π
Tπ
VTD = Πρ T π VTD
hT π VTD − VTD , ϕi iρ = 0, π
i = 1, . . . , d
π
hr + γP VTD − VTD , ϕi iρ = 0 d X
(j)
hrπ , ϕi iρ − hϕj − γP π ϕj , ϕi iρ · αTD = 0 | {z } i=1 | {z } bi
−→
A αTD = b
Aij
I In general, Πρ T π is not a contraction and does not have a fixed-point. I If ρ = ρπ , the stationary dist. of π, then Πρπ T π has a unique fixed-point. A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 136/185
Least-Squares Policy Iteration (LSPI)
Least-Squares Temporal-Difference Learning (LSTD)
LSTD Algorithm Proposition
(LSTD Performance)
||V π − VTD ||ρπ ≤ p
1 1−
inf ||V π − V ||ρπ
γ 2 V ∈F
LSTD Algorithm
I We observe a trajectory generated by following the policy π (X0 , R0 , X1 , R1 , . . . , XN ) where Xt+1 ∼ P · |Xt , π(Xt ) and Rt = r Xt , π(Xt ) I We build estimators of the matrix A and vector b N −1 X bij = 1 A ϕi (Xt ) ϕj (Xt ) − γϕj (Xt+1 ) N t=0
bαTD = b I Ab b
,
,
N −1 1 X b bi = ϕi (Xt )Rt N t=0
VbTD (·) = φ(·)> α bTD
b → A and b when n → ∞ then A b → b, and thus, α bTD → αTD and VbTD → VTD . A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 137/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Least-Squares Temporal-Difference Learning (LSTD) LSTD and LSPI Error Bounds Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 138/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
LSTD Error Bound When the Markov chain induced by the policy under evaluation π has a stationary distribution ρπ (Markov chain is ergodic - e.g. β-mixing), then
Theorem
(LSTD Error Bound)
Let Ve be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π. Then with probability 1 − δ, we have ! r d log(d/δ) c π π e inf ||V − f ||ρπ + O ||V − V ||ρπ ≤ p nν 1 − γ 2 f ∈F I n = # of samples
d = dimension of the linear function space F R I ν = the smallest eigenvalue of the Gram matrix ( ϕi ϕj dρπ )i,j ,
(Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) I β-mixing coefficients are hidden in the O(·) notation A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 139/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
LSTD Error Bound LSTD Error Bound ||V π − Ve ||ρπ ≤ p
I
c
inf ||V π − f ||ρπ + O 1 − γ 2 f|∈F {z } | approximation error
r
! d log(d/δ) nν {z }
estimation error
Approximation error: it depends on how well the function space F can approximate the value function V π
I
Estimation error: it depends on the number of samples n, the dim of the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 140/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
LSPI Error Bound
Theorem
(LSPI Error Bound)
Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have
||V ∗ −V πK ||µ ≤
4γ (1 − γ)2
(
" p CCµ,ρ cE0 (F ) + O
s
d log(dK/δ) n νρ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
!# +γ
K−1 2
) Rmax
June 26, 2012 - 141/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
LSPI Error Bound Theorem
(LSPI Error Bound)
Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have
∗
||V −V
πK
4γ ||µ ≤ (1 − γ)2
(
" p CCµ,ρ cE0 (F ) + O
s
d log(dK/δ) n νρ
!# +γ
K−1 2
) Rmax
I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 142/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
LSPI Error Bound Theorem
(LSPI Error Bound)
Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have
||V ∗ −V πK ||µ ≤
4γ (1 − γ)2
(
" p CCµ,ρ cE0 (F ) + O
s
d log(dK/δ) n νρ
!# +γ
K−1 2
) Rmax
I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ I Estimation error: depends on n, d, νρ , K
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 143/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
LSPI Error Bound Theorem
(LSPI Error Bound)
Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have ||V ∗ −V πK ||µ ≤
4γ (1 − γ)2
(
" p CCµ,ρ cE0 (F ) + O
s
d log(dK/δ) n νρ
!# +γ
K−1 2
) Rmax
I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ I Estimation error: depends on n, d, νρ , K I Initialization error: error due to the choice of the initial value function or
initial policy |V ∗ − V π0 | A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 144/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
LSPI Error Bound LSPI Error Bound ∗
||V −V
πK
4γ ||µ ≤ (1 − γ)2
(
" p CCµ,ρ cE0 (F ) + O
s
d log(dK/δ) n νρ
!# +γ
K−1 2
) Rmax
Lower-Bounding Distribution e we have There exists a distribution ρ such that for any policy π ∈ G(F), ρ ≤ Cρπ , where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 145/185
Least-Squares Policy Iteration (LSPI)
LSTD and LSPI Error Bounds
LSPI Error Bound LSPI Error Bound ||V ∗ −V πK ||µ ≤
4γ (1 − γ)2
(
" p CCµ,ρ cE0 (F ) + O
s
d log(dK/δ) n νρ
!# +γ
K−1 2
) Rmax
Lower-Bounding Distribution e we have There exists a distribution ρ such that for any policy π ∈ G(F), ρ ≤ Cρπ , where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before. R
I νρ = the smallest eigenvalue of the Gram matrix ( ϕi ϕj dρ)i,j
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 146/185
Classification-based Policy Iteration
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Algorithm Bounds on Error at each Iteration and Final Error Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 147/185
Classification-based Policy Iteration
Finite-Sample Performance Bound of a Classification-based Policy Iteration Algorithm
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 148/185
Classification-based Policy Iteration
Objective of the Section I
classification-based vs. regression-based (value function-based) policy iteration
I
describe a classification-based policy iteration algorithm
I
report bounds on the error at each iteration and on the error after K iterations of the algorithm
I
describe the main components of these bounds
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 149/185
Classification-based Policy Iteration
Value-based (Approximate) Policy Iteration
Regression eπ = learn(D, F) Q
Approximate vf e π (·, ·) Q Improvement e a) π e = arg maxa Q(·,
Training Set
bπ (xi , a)} D = {(xi, a), Q
Samples xi ∼ ρ
Rollout Estimate bπ (xi , a) Q
Policy π
* We use Monte-Carlo estimation for illustration purposes.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 150/185
Classification-based Policy Iteration
Classification-based Policy Iteration No explicit approximation
Training Set
bπ (xi, a)} D = {xi, arg maxa Q
Classification
Samples xi ∼ ρ
Rollout Estimate bπ (xi , a) Q
π e = learn(D, Π)
Policy π
* First introduced by Lagoudakis & Parr (2003) and Fern et al. (2004,2006).
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 151/185
Classification-based Policy Iteration
Value-based vs Classification-based Policy Iteration
Regression eπ = learn(D, F) Q
Approximate vf e π (·, ·) Q
Samples xi ∼ ρ
Training Set bπ (xi, a)} D = {xi, arg maxa Q
Improvement e a) π e = arg maxa Q(·,
Training Set bπ (xi , a)} D = {(xi, a), Q Rollout Estimate bπ (xi , a) Q
No explicit approximation
Policy π
Classification
Samples xi ∼ ρ
Rollout Estimate bπ (xi , a) Q
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
π e = learn(D, Π)
Policy π
June 26, 2012 - 152/185
Classification-based Policy Iteration
Appealing Properties
I
Property 1. More important to have a policy with a performance similar to the greedy policy w.r.t. Qπk than an accurate approximation of Qπk .
I
Property 2. In some problems good policies are easier to represent and learn than their corresponding value functions.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 153/185
Classification-based Policy Iteration
Algorithm
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Algorithm Bounds on Error at each Iteration and Final Error Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 154/185
Classification-based Policy Iteration
Algorithm
Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid
Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π
Rj k (xi , a) = r(xi , a) +
H−1 X
γ t r xt , πk (xt ) ,
t=1
xt
|xt−1 , π
(xt−1 )
with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for
x1
∼ p(·|xi , a)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
(classifier)
June 26, 2012 - 155/185
Classification-based Policy Iteration
Algorithm
Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid
Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π
Rj k (xi , a) = r(xi , a) +
H−1 X
γ t r xt , πk (xt ) ,
t=1
xt
|xt−1 , π
(xt−1 )
with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for
x1
∼ p(·|xi , a)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
(classifier)
June 26, 2012 - 156/185
Classification-based Policy Iteration
Algorithm
Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid
Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π
Rj k (xi , a) = r(xi , a) +
H−1 X
γ t r xt , πk (xt ) ,
t=1
xt
|xt−1 , π
(xt−1 )
with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for
x1
∼ p(·|xi , a)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
(classifier)
June 26, 2012 - 157/185
Classification-based Policy Iteration
Algorithm
Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid
Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π
Rj k (xi , a) = r(xi , a) +
H−1 X
γ t r xt , πk (xt ) ,
t=1
xt
|xt−1 , π
(xt−1 )
with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for
x1
∼ p(·|xi , a)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
(classifier)
June 26, 2012 - 158/185
Classification-based Policy Iteration
Algorithm
Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid
Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π
Rj k (xi , a) = r(xi , a) +
H−1 X
γ t r xt , πk (xt ) ,
t=1
xt
|xt−1 , π
(xt−1 )
with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for
x1
∼ p(·|xi , a)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
(classifier)
June 26, 2012 - 159/185
Classification-based Policy Iteration
Algorithm
Template of the Algorithm Empirical Error: N i 1 Xh b b πk (xi , a) − Q b πk xi , π(xi ) , Lπk (b ρ ; π) = max Q a∈A N i=1
(b ρ is the empirical distribution induced by the samples in Dk )
with the objective to minimize the Expected Error Z h i Lπk (ρ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) X
a∈A
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 160/185
Classification-based Policy Iteration
Algorithm
Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)} Z πk = I π(x) 6= arg max Q (x, a) ρ(dx) X
a∈A
Gap-based error Z h i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A X h Z i = I π(x) 6= arg max Qπk (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) X
a∈A
a∈A
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 161/185
Classification-based Policy Iteration
Algorithm
Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)} Z = I π(x) 6= arg max Qπk (x, a) ρ(dx) a∈A X | {z } mistake
Gap-based error
Z h
i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) A a∈A h Z i πk = I π(x) 6= arg max Q (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A a∈A X | {z } mistake
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 162/185
Classification-based Policy Iteration
Algorithm
Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)} Z = I π(x) 6= arg max Qπk (x, a) ρ(dx) a∈A X | {z } mistake
Gap-based error
Z h
i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A X h Z i πk = I π(x) 6= arg max Q (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A a∈A X {z } | {z }| mistake
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
cost/regret
June 26, 2012 - 163/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Algorithm Bounds on Error at each Iteration and Final Error Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 164/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Error at each Iteration Theorem Let Π be a policy space with h = V C(Π) < ∞ and ρ be a distribution over X . Let N be the number of states in Dk drawn i.i.d. from ρ, H be the rollout horizon, and M be the number of rollouts per state-action pair. Let πk+1 = arg min Lbπk (b ρ ; π) π∈Π
be the policy computed at the k’th iteration of DPI . Then, for any δ > 0 Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2(1 + 2 + γ H Qmax ), π∈Π
with probability 1 − δ, where s
1 = 16Qmax
2 N
h log
s 2 = 8(1 − γ H )Qmax
eN 32 + log h δ
2 MN
and
eM N 32 h log + log h δ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 165/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π | {z } | {z } approximation
estimation
I
Approximation error: it depends on how well the policy space Π can approximate greedy policies.
I
Estimation error: it depends on the number of rollout states, number of rollouts, and the rollout horizon.
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 166/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π)+2 1 (N )+2 (N, M, H)+γ H Qmax π∈Π
The approximation error
inf Lπk (ρ ; π) = Z h i inf I {π(x) 6= (Gπk )(x)} max Qπk (x, a) − Qπk x, π(x) ρ(dx)
π∈Π
π∈Π X
a∈A
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 167/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π
The estimation error s
eN 32 + log h δ s eM N 32 2 2 = 8(1 − γ H )Qmax h log + log MN h δ 1 = 16Qmax
2 N
h log
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 168/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π
The estimation error s
eN 32 + log h δ s eM N 32 2 2 = 8(1 − γ H )Qmax h log + log MN h δ 1 = 16Qmax
2 N
h log
I
Avoid overfitting (1 ) : take N h
I
Fixed budget of rollouts B = M N : take M = 1 and N = B
I
log B Fixed budget B = N M H and M = 1 : take H = O( log 1/γ ) and N = O(B/H) A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 168/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤
i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + 2 (1 − γ) 1−γ
with probability 1 − δ, where s 1 = 16Qmax
2 N
eN 32K h log + log h δ s
H
2 = 8(1 − γ )Qmax
2 MN
h log
and
eM N 32K + log h δ
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 169/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤
i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + (1 − γ)2 1−γ
with probability 1 − δ, where s 1 = 16Qmax
2 N
eN 32K h log + log h δ s
2 = 8(1 − γ H )Qmax
2 MN
h log
and
eM N 32K + log h δ
Concentrability coefficient: Cµ,ρ A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 170/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤
i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + (1 − γ)2 1−γ
with probability 1 − δ, where s 1 = 16Qmax
2 N
eN 32K h log + log h δ s
2 = 8(1 − γ H )Qmax
2 MN
h log
and
eM N 32K + log h δ
Estimation error: depends on M , N , H, h, and K A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 171/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤
i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + (1 − γ)2 1−γ
with probability 1 − δ, where s 1 = 16Qmax
2 N
eN 32K h log + log h δ s
H
2 = 8(1 − γ )Qmax
2 MN
h log
and
eM N 32K + log h δ
Initialization error: error due to the choice of the initial policy |V ∗ − V π0 | A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 172/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Remarks Inherent Greedy Error d(Π, GΠ)
(approximation error)
d(Π, GΠ) = sup inf Lπ (ρ ; π 0 ) 0 π∈Π π ∈Π
= sup inf 0
π∈Π π ∈Π
Z
X
h i I {π 0 (x) 6= (Gπ)(x)} max Qπ (x, a) − Qπ x, π 0 (x) ρ(dx) a∈A
Gπ
Π π
π
′
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 173/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Other Finite-Sample Analysis Results in Batch RL I
Approximate Value Iteration (Munos & Szepesvari 2008)
I
Approximate Policy Iteration I
LSTD and LSPI (Lazaric et al. 2010, 2012)
I
Bellman Residual Minimization (Maillard et al. 2010)
I
Modified Bellman Residual Minimization (Antos et al. 2008)
I
Classification-based Policy Iteration (Fern et al. 2006; Lazaric et al. 2010; Gabillon et al. 2011; Farahmand et al. 2012)
I
Conservative Policy Iteration (Kakade & Langford 2002; Kakade 2003)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 174/185
Classification-based Policy Iteration
Bounds on Error at each Iteration and Final Error
Other Finite-Sample Analysis Results in Batch RL I
Approximate Modified Policy Iteration (Scherrer et al. 2012)
I
Regularized Approximate Dynamic Programming I
I
L2 -Regularization I
L2 -Regularized Policy Iteration (Farahmand et al. 2008)
I
L2 -Regularized Fitted Q-Iteration (Farahmand et al. 2009)
L1 -Regularization and High-Dimensional RL I
Lasso-TD (Ghavamzadeh et al. 2011)
I
LSTD (LSPI) with Random Projections (Ghavamzadeh et al. 2010)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 175/185
Discussion
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion Learned Lessons Open Problems A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 176/185
Discussion
Discussion
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 177/185
Discussion
Learned Lessons
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion Learned Lessons Open Problems A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 178/185
Discussion
Learned Lessons
Learned Lessons
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 179/185
Discussion
Learned Lessons
Comparison to Supervised Learning we obtain the optimal rate of regression and classification for RL (ADP) algorithms
What makes RL more challenging then? I
dependency on 1/(1 − γ) (sequential nature of the problem)
I
the approximation error is more complex
I
the propagation of error (control problem)
I
the sampling problem (how to choose ρ – exploration problem)
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 180/185
Discussion
Learned Lessons
Practical Lessons I
I
Tuning the parameters (given a fixed accuracy ) I
number of samples (inverting the bound)
I
number of iterations (inverting the bound)
K ≈ /(1 − γ)
choice of function F and/or policy space Π I
I
e d) n ≥ Ω(
features {ϕi }di=1 to be linearly independent given the sampling distribution ρ (on-policy – off-policy sampling)
tradeoff between approximation and estimation errors
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 181/185
Discussion
Open Problems
Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion Learned Lessons Open Problems A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 182/185
Discussion
Open Problems
Open Problems
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 183/185
Discussion
Open Problems
Open Problems
I
High–dimensional spaces: how to deal with MDPs with many state–action variables? I I
First example in deterministic design for LSTD Extension to other algorithms
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 184/185
Discussion
Open Problems
Open Problems
I
High–dimensional spaces: how to deal with MDPs with many state–action variables? I I
I
First example in deterministic design for LSTD Extension to other algorithms
Optimality : how optimal are the current algorithms? I I I
Improve the sampling distribution Control the concentrability terms Limit the propagation of error through iterations
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 184/185
Discussion
Open Problems
Open Problems
I
High–dimensional spaces: how to deal with MDPs with many state–action variables? I I
I
Optimality : how optimal are the current algorithms? I I I
I
First example in deterministic design for LSTD Extension to other algorithms Improve the sampling distribution Control the concentrability terms Limit the propagation of error through iterations
Off–policy learning for LSTD
A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP
June 26, 2012 - 184/185
Statistical Learning Theory Meets Dynamic Programming
M. Ghavamzadeh, A. Lazaric {mohammad.ghavamzadeh, alessandro.lazaric}@inria.fr sequel.lille.inria.fr