Statistical Learning Theory in Reinforcement Learning ... - Inria

Statistical Learning Theory in Reinforcement Learning & Approximate Dynamic Programming A. Lazaric & M. Ghavamzadeh (INRIA Lille – Team SequeL) ICML 2012 INRIA Lille – Team SequeL

June 26, 2012

Sequential Decision-Making under Uncertainty

? How Can I ... ?

I

Move around in the physical world (e.g. driving, navigation)

I

Play and win a game

I

Retrieve information over the web

I

Medical diagnosis and treatment

I

Maximize the throughput of a factory

I

Optimize the performance of a rescue team A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 2/185

Reinforcement Learning (RL)

I

RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment

I

Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance

I

Interaction: Modeled as a MDP or a POMDP A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 3/185



A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 4/185


Goal: Learn an action-selection strategy, or policy, to optimize some measure of its long-term performance Algorithms: are based on the two celebrated dynamic programming algorithms: policy iteration and value iteration


June 26, 2012 - 4/185

A Bit of History I

formulation of the problem: optimal control, state, value function, Bellman equations, etc.

I

dynamic programming algorithms: policy iteration and value iteration + proof of convergence to an optimal policy

I

approximate dynamic programming I

performance evaluation: how close is the obtained solution to an optimal one? I

asymptotic analysis: the performance with infinite number of samples


June 26, 2012 - 5/185

A Bit of History I

formulation of the problem: optimal control, state, value function, Bellman equations, etc.

I

dynamic programming algorithms: policy iteration and value iteration + proof of convergence to an optimal policy

I

approximate dynamic programming I

performance evaluation: how close is the obtained solution to an optimal one? I

asymptotic analysis: the performance with infinite number of samples

in real problems we always have a finite number of samples


June 26, 2012 - 5/185

Motivation what about the performance with finite number of samples?

I

approximate dynamic programming (ADP) I

asymptotic analysis

I

finite sample analysis


June 26, 2012 - 6/185

Motivation what about the performance with finite number of samples?

I

I

approximate dynamic programming (ADP) I

asymptotic analysis

I

finite sample analysis

finite sample analysis of ADP algorithms I

error at each iteration of the alg.

I

how the error propagates through the iterations of the alg.


June 26, 2012 - 6/185

Motivation

I

finite sample analysis of ADP algorithms I

error at each iteration of the alg. I

the problem is formulated as regression, classification, or fixed point

tools from statistical learning theory are used to bound the error of these problems

I

I

how the error propagates through the iterations of the alg.


June 26, 2012 - 7/185

Supervised Learning

Problem (regression , classiﬁcation)

Sequential Decision-Making

ADP

Algorithm

Least-Squares, SVM, ... (API , AVI)

Tools for Analysis

Statistical Learning Theory (SLT)

This Tutorial

Asymptotic Analysis

Finite Sample Analysis of ADP Algs.


June 26, 2012 - 8/185

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion


June 26, 2012 - 9/185

Preliminaries

Outline Preliminaries Dynamic Programming Approximate Dynamic Programming Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 10/185

Preliminaries


I

RL: A class of learning problems in which an agent interacts with a dynamic, stochastic, and incompletely known environment

I


I

Interaction: Modeled as a MDP or a POMDP A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 11/185

Preliminaries

Markov Decision Process MDP I

An MDP M is a tuple hX , A, r, p, γi.

I

The state space X is a bounded closed subset of Rd .

I

The set of actions A is finite (|A| < ∞).

I

The reward function r : X × A → R is bounded by Rmax .

I

The transition model p(·|x, a) is a distribution over X .

I

γ ∈ (0, 1) is a discount factor.

I

Policy: a mapping from states to actions


π(x) ∈ A June 26, 2012 - 12/185

Preliminaries

Value Function For a policy π I

Vπ :X →R "∞ # X V π (x) = E γ t r Xt , π(Xt ) |X0 = x

Value function

t=0

I

Action-value function Qπ : X × A → R "∞ # X Qπ (x, a) = E γ t r(Xt , At )|X0 = x, A0 = a t=0


June 26, 2012 - 13/185

Preliminaries

Notation Bellman Operator I

Bellman operator for policy π T π : B(X ; Vmax ) → B(X ; Vmax )

I

V π is the unique fixed-point of the Bellman operator Z π (T V )(x) = r x, π(x) + γ p dy|x, π(x) V (y) X

I

The action-value function Qπ is defined as Z Qπ (x, a) = r(x, a) + γ p(dy|x, a)V π (y) X

B(X ; Vmax ) is the space of functions on X bounded by Vmax A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 14/185

Preliminaries

Optimal Value Function and Optimal Policy I

Optimal value function V ∗ (x) = sup V π (x) π

I

Optimal action-value function Q∗ (x, a) = sup Qπ (x, a) π

I

∀x ∈ X

∀x ∈ X , ∀a ∈ A

A policy π is optimal if V π (x) = V ∗ (x)

∀x ∈ X


June 26, 2012 - 15/185

Preliminaries

Notation Bellman Optimality Operator I

Bellman optimality operator T : B(X ; Vmax ) → B(X ; Vmax )

I

V ∗ is the unique fixed-point of the Bellman optimality operator Z (T V )(x) = max r(x, a) + γ p(dy|x, a)V (y) a∈A

I

X

Optimal action-value function Q∗ is defined as Z ∗ Q (x, a) = r(x, a) + γ p(dy|x, a)V ∗ (y) X


June 26, 2012 - 16/185

Preliminaries

Properties of Bellman Operators I

Monotonicity: if V1 ≤ V2 component-wise, then T π V1 ≤ T π V2

I

Max-Norm Contraction:

and

T V1 ≤ T V2

∀V1 , V2 ∈ B(X ; Vmax )

||T π V1 − T π V2 ||∞ ≤ γ||V1 − V2 ||∞ ||T V1 − T V2 ||∞ ≤ γ||V1 − V2 ||∞


June 26, 2012 - 17/185

Preliminaries

Dynamic Programming


June 26, 2012 - 18/185

Preliminaries

Dynamic Programming

Dynamic Programming Algorithms Value Iteration I

start with an arbitrary action-value function Q0

I

at each iteration k

Qk+1 = T Qk

Convergence I

limk→∞ Vk = V ∗ . k→∞

||V ∗ −Vk+1 ||∞ = ||T V ∗ −T Vk ||∞ ≤ γ||V ∗ −Vk ||∞ ≤ γ k+1 ||V ∗ −V0 ||∞ −→ 0


June 26, 2012 - 19/185

Preliminaries

Dynamic Programming

Dynamic Programming Algorithms Policy Iteration I

start with an arbitrary policy π0

I

at each iteration k I

Policy Evaluation: Compute Qπk

I

Policy Improvement: Compute the greedy policy w.r.t. Qπk πk+1 (x) = (Gπk )(x) = arg max Qπk (x, a) a∈A

Convergence PI generates a sequence of policies with increasing performance (V πk+1 ≥ V πk ) and stops after a finite number of iterations with the optimal policy π ∗ . V πk = T πk V πk ≤ T V πk = T πk+1 V πk ≤ lim (T πk+1 )n V πk = V πk+1 n→∞


June 26, 2012 - 20/185

Preliminaries

Approximate Dynamic Programming


June 26, 2012 - 21/185

Preliminaries


Approximate Dynamic Programming & Batch Reinforcement Learning


June 26, 2012 - 22/185

Preliminaries


Approximate Dynamic Programming Algorithms Value Iteration I

start with an arbitrary action-value function Q0

I

at each iteration k

Qk+1 = T Qk

What if Qk+1 ≈ T Qk ? ?

||Q∗ − Qk+1 || ≤ γ||Q∗ − Qk ||


June 26, 2012 - 23/185

Preliminaries


Approximate Dynamic Programming Algorithms Policy Iteration I

start with an arbitrary policy π0

I

at each iteration k I

Policy Evaluation: Compute Qπk

I

Policy Improvement: Compute the greedy policy w.r.t. Qπk πk+1 (x) = (Gπk )(x) = arg max Qπk (x, a) a∈A

b πk ≈ Qπk instead) What if we cannot compute Qπk exactly? (Compute Q ?

b πk (x, a) 6= (Gπk )(x) −→ V πk+1 ≥ V πk πk+1 (x) = arg max Q a∈A


June 26, 2012 - 24/185

Preliminaries


Error at each Iteration (AVI)

Budget = B App Space = F

AVI

Qk+1 ≈ T Qk

Error at iteration k ||T Qk − Qk+1 ||p,ρ ≤ f (B, F)


w.h.p.

June 26, 2012 - 25/185

Preliminaries


Error at each Iteration (API)

Budget = B App Space = F

! π k ≈ Q πk Q

API

! πk ≈ Gπk πk+1 = G Q Error at iteration k b πk ||p,ρ ≤ f (B, F) ||Qπk − Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

w.h.p.

June 26, 2012 - 26/185

Preliminaries


Final Performance Bound

Final Objective: Bound the error after K iteration of the alg. ||V ∗ − V πK ||p,µ ≤ f (B, F, K)

w.h.p.

πK is the policy computed by the algorithm after K iterations


June 26, 2012 - 27/185

Preliminaries


Final Performance Bound

Final Objective: Bound the error after K iteration of the alg. ||V ∗ − V πK ||p,µ ≤ f (B, F, K)

w.h.p.

πK is the policy computed by the algorithm after K iterations

Error Propagation: How the error at each iteration propagates through the iterations of the algorithm


June 26, 2012 - 27/185

Preliminaries


SLT in RL & ADP I supervised learning methods (regression, classification) appear in the inner-loop of ADP algorithms (performance at each iteration) I tools from SLT that are used to analyze supervised learning methods can be used in RL and ADP (e.g., how many samples are required to achieve a certain performance)

What makes RL more challenging? I the objective is not always to recover a target function from its noisy observations (fixed-point vs. regression) I the target sometimes has to be approximated given sample trajectories (non i.i.d. samples) I propagation of error (control problem) I the choice of the sampling distribution ρ (exploration problem) A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 28/185

Tools from Statistical Learning Theory

Outline Preliminaries Tools from Statistical Learning Theory Concentration Inequalities Functional Concentration Inequalities A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 29/185


Objective of the section

I

Introduce the theoretical tools used to derive the error bounds at each iteration


June 26, 2012 - 30/185



I

Introduce the theoretical tools used to derive the error bounds at each iteration

I

Understand the relationship between accuracy , number of samples, and confidence


June 26, 2012 - 30/185


Concentration Inequalities


June 26, 2012 - 31/185



The Chernoff–Hoeffding Bound

Remark: all the learning algorithms use random samples instead of actual distributions.


June 26, 2012 - 32/185




Remark: all the learning algorithms use random samples instead of actual distributions. Question: how reliable is the solution learned from finite random samples?


June 26, 2012 - 32/185




Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution P bounded in [a, b], then for any ε > 0 # " n 1 X 2nε2 Xt − EP [X1 ] > ε ≤ 2 exp − P n (b − a)2 t=1


June 26, 2012 - 33/185




Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution bounded in [a, b], then for any ε > 0 # " n 1 X 2nε2 ε ≤ 2 exp − P Xt − E[X1 ] > |{z} n (b − a)2 t=1 accuracy | {z } | {z } deviation


conf idence

June 26, 2012 - 34/185



The Chernoff–Hoeffding Bound (Cont.d)

Theorem Let X1 , . . . , Xn be i.i.d. samples from a distribution bounded in [a, b], then for any δ ∈ (0, 1) # " r n 1 X log 2/δ Xt − E[X1 ] > (b − a) ≤δ P n 2n t=1


June 26, 2012 - 35/185



The Chernoff–Hoeffding Bound (Cont.d)

Theorem Let X1 , X2 , . . . be i.i.d. samples from a distribution bounded in [a, b], then for any δ ∈ (0, 1) and ε > 0 # " n 1 X Xt − E[X1 ] > ε ≤ δ P n t=1

if

n≥

(b − a)2 log 2/δ . 2ε2


June 26, 2012 - 36/185



The Azuma Bound

Remark: in ADP and RL, the samples are not necessarily i.i.d. but may be generated from trajectories


June 26, 2012 - 37/185



The Azuma Bound

Remark: in ADP and RL, the samples are not necessarily i.i.d. but may be generated from trajectories Question: how is it possible to extend the previous results to non–i.i.d. samples?


June 26, 2012 - 37/185



The Azuma Bound

A sequence of random variables X1 , X2 , . . . is a martingale difference sequence if for any t E[Xt+1 |X1 , . . . , Xt ] = 0


June 26, 2012 - 38/185



The Azuma Bound

Theorem Let X1 , . . . , Xn be a martingale difference sequence bounded in [a, b], then for any ε > 0 # " n 1 X 2nε2 Xt > ε ≤ 2 exp − P n (b − a)2 t=1


June 26, 2012 - 39/185


Functional Concentration Inequalities


June 26, 2012 - 40/185



The Functional Chernoff–Hoeffding Bound

Remark: the learning algorithm returns the empirical best hypothesis from a hypothesis set (e.g., a value function, a policy). Question: how do the previous results extend to the case of random hypotheses in a hypothesis set?


June 26, 2012 - 41/185



The Functional Chernoff–Hoeffding Bound

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and f : X → [a, b] a bounded function, then for any ε > 0 # " n 1 X 2nε2 f (Xt ) − E[f (X1 )] > ε ≤ 2 exp − P n (b − a)2 t=1


June 26, 2012 - 42/185



The Functional Chernoff–Hoeffding Bound (Cont.d)

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of functions bounded in [a, b], then for any fixed f ∈ F and any ε > 0 # " n 1 X 2nε2 f (Xt ) − E[f (X1 )] > ε ≤ 2 exp − P n (b − a)2 t=1


June 26, 2012 - 43/185




Remark: usually we do not know which function f the learning algorithm will return (it is random!)


June 26, 2012 - 44/185




Remark: usually we do not know which function f the learning algorithm will return (it is random!)

Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of functions bounded in [a, b], then for any ε > 0 " # n 1 X P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε ≤ ??? n t=1


June 26, 2012 - 44/185



The Union Bound

Also known as: Boole’s inequality, Bonferroni inequality, etc.

Theorem Let A1 , A2 , . . . be a countable set of events, then h[ i X P Ai . P Ai ≤ i

i


June 26, 2012 - 45/185



The Union Bound

Also known as: Boole’s inequality, Bonferroni inequality, etc.

Theorem Let A1 , A2 , . . . be a countable set of events, then h[ i X P Ai . P Ai ≤ i

i


June 26, 2012 - 45/185



The Functional Chernoff–Hoeffding Bound (Cont.d) n 1 X h i P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1


June 26, 2012 - 46/185



The Functional Chernoff–Hoeffding Bound (Cont.d) n 1 X h i P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1

"

=P

n n 1 X o[ f (X ) − E[f (X )] > ε 1 t 1 1 n t=1 n o[ n 1 X f2 (Xt ) − E[f2 (X1 )] > ε n t=1

··· n n 1 X o[ fN (Xt ) − E[fN (X1 )] > ε n t=1 # ···


June 26, 2012 - 46/185



The Functional Chernoff–Hoeffding Bound (Cont.d) Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of functions bounded in [a, b] with |F| = N , then for any f1 ∈ F and any δ ∈ (0, 1) # " r n 1 X log 2/δ f (Xt ) − E[f (X1 )] > (b − a) ≤ P ∃f ∈ F : n 2n t=1 " # r n 1 X log 2/δ N max P f (Xt ) − E[f (X1 )] > (b − a) ≤ Nδ f ∈F n 2n t=1


June 26, 2012 - 47/185




Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of functions bounded in [a, b] with |F| = N , then for any f1 ∈ F and any δ ∈ (0, 1) " # r n 1 X log 2N /δ P ∃f ∈ F : f (Xt ) − E[f (X1 )] > (b − a) ≤δ n 2n t=1


June 26, 2012 - 48/185




Problem: In general F contains an infinite number of functions (e.g., a linear classifier)


June 26, 2012 - 49/185



The Symmetrization Trick

"

# n 1 X P ∃f ∈ F : f (Xt ) − E f (X1 ) > ε n t=1 " # n n 1 X ε 1X ≤ 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1

t=1

with the ghost samples {Xt0 }nt=1 independently drawn from P.


June 26, 2012 - 50/185



The VC dimension Not all the infinities are the same...

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−2.5 −1

−0.8

−0.6


−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

June 26, 2012 - 51/185



The VC dimension Not all the infinities are the same...

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−2.5 −1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Let’s consider a binary space F = {f : X → {0, 1}} A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 51/185



The VC dimension (cont’d)

How many different predictions can a space F produce over n distinct inputs?


June 26, 2012 - 52/185





June 26, 2012 - 53/185





June 26, 2012 - 53/185





June 26, 2012 - 53/185





June 26, 2012 - 53/185





June 26, 2012 - 53/185





June 26, 2012 - 53/185





June 26, 2012 - 53/185





June 26, 2012 - 53/185





June 26, 2012 - 53/185




The VC dimension of a linear classifier in dim. 2 is V C(F) = 3.


June 26, 2012 - 53/185




Let S = (x1 , . . . , xd ) be an arbitrary sequence of points, then ΠS (F) = {(f (x1 ), . . . , f (xd )), h ∈ F} is the set of all the possible ways the d points can be classified by hypothesis in F.


June 26, 2012 - 54/185




Let S = (x1 , . . . , xd ) be an arbitrary sequence of points, then ΠS (F) = {(f (x1 ), . . . , f (xd )), h ∈ F} is the set of all the possible ways the d points can be classified by hypothesis in F.

Definition A set S is shattered by a hypothesis space F if |ΠS (F)| = 2d .


June 26, 2012 - 54/185



The VC dimension (cont’d) Definition (VC Dimension) The VC dimension of a hypothesis space F is VC(F) = max{d| ∃|S| = d, |ΠS (F)| = 2d }


June 26, 2012 - 55/185



The VC dimension (cont’d) Definition (VC Dimension) The VC dimension of a hypothesis space F is VC(F) = max{d| ∃|S| = d, |ΠS (F)| = 2d }

Lemma (Sauer’s Lemma) Let F be a hypothesis space with VC dimension d, then for any sequence of n points S = (x1 , . . . , xn ) with n > d |ΠS (F)| ≤

d X n i=0

i

≤ nd


June 26, 2012 - 55/185



The Functional CH Bound for Binary Spaces

Question: how many values can f ∈ F (with F a binary space) take on 2n samples? "

n n 1 X ε 1X 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1

t=1


#

June 26, 2012 - 56/185




Question: how many values can f ∈ F (with F a binary space) take on 2n samples? "

n n 1 X ε 1X 2P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1

t=1

#

If VC(F) = d and 2n > d, then the answer is at most (2n)d !


June 26, 2012 - 56/185




Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a finite set of binary functions with VC = d, then for any δ ∈ (0, 1) # " n r log 2N /δ 1 X f (Xt ) − E[f (X1 )] > ≤ 2δ P ∃f ∈ F : n 2n t=1

with N = (2n)d .


June 26, 2012 - 57/185




A simplified reading of the previous bound For any set of n i.i.d. samples and any binary function f ∈ F with VC(F) = d r n 1 X d log n/δ f (Xt ) − E[f (X1 )] ≤ O n n t=1

with probability 1 − δ (w.r.t. to the randomness of the samples)


June 26, 2012 - 58/185



The Pollard’s Inequality

Extension: how does the previous result extend to the case of a real–valued space F?


June 26, 2012 - 59/185




Question: how many values can f ∈ F (with F a real–valued space) take on 2n samples? "

n n 1 X ε 1X 0 2P ∃f ∈ F : f (Xt ) − f (Xt ) > n n 2 t=1

t=1

#

Answer: an infinite number of values...


June 26, 2012 - 60/185



Covering

Observation: we only need an accuracy of order ε.


June 26, 2012 - 61/185



Covering

Observation: we only need an accuracy of order ε. Question: how many functions from F do we need to achieve an accuracy of order ε on 2n samples?


June 26, 2012 - 61/185



Covering

A space Fε ⊂ F is an ε-cover of F on the states {xt }nt=1 if n n 1 X 1X 0 ∀f ∈ F, ∃f 0 ∈ Fε : f (xt ) − f (xt ) ≤ ε n n t=1

t=1


June 26, 2012 - 62/185



Covering


t=1


June 26, 2012 - 62/185



Covering


t=1


June 26, 2012 - 62/185



Covering


t=1

The cover number of F is

N F, ε, {xt }nt=1 = |Fε |


June 26, 2012 - 62/185



The Pollard’s Inequality We build an (ε/8)-cover of F on states {Xt }nt=1 ∪ {Xt0 }nt=1 , thus we have # n n 1 X ε 1X P ∃f ∈ F : f (Xt ) − f (Xt0 ) > n n 2 t=1 t=1 # " n n 1 X ε 1X 0 ≤ P ∃f ∈ Fε/8 : f (Xt ) − f (Xt ) > n n 4 t=1 t=1 # " n n ε h i 1 X 1X 0 0 n f (Xt ) − f (Xt ) > ≤ E N F, ε/8, {Xt ∪ Xt }t=1 P n n 4 "

t=1


t=1

June 26, 2012 - 63/185




Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of bounded functions in [0, B], then for any ε > 0 " # n 1 X P ∃f ∈ F : f (Xt ) − E[f (X1 )] > ε n t=1 h i nε2 ≤ 8E N F, ε/8, {Xt ∪ Xt0 }nt=1 exp − . 64B 2


June 26, 2012 - 64/185



The Pseudo–Dimension

Question: how is it possible to compute the cover number ? A real–valued space F has a psuedo–dimension d if V(F) = VC

(x, y) → sign(f (x) − y), f ∈ F =d


June 26, 2012 - 65/185



The Pseudo–Dimension

Question: how is it possible to compute the cover number ? A real–valued space F has a psuedo–dimension d if V(F) = VC For any {xt }nt=1

(x, y) → sign(f (x) − y), f ∈ F =d

N

F, ε, {xt }nt=1

≤O

B d ε


June 26, 2012 - 65/185



Functional Concentration Inequality for L2 -norm

Remark: In some cases we want to consider the deviations between different norms.


June 26, 2012 - 66/185




Remark: In some cases we want to consider the deviations between different norms. Example: in least–squares regression, the error is measured with L2 -norms, so we want to bound the deviation between X 1/2 n 1 f (Xt )2 n t=1

1/2 E f (X)2


June 26, 2012 - 66/185




Theorem Let X1 , . . . , Xn be i.i.d. samples from an arbitrary distribution P in X and F a set of bounded functions in [0, B], then for any ε " # 1/2 n 1X 1/2 P ∃f ∈ F : f (Xt )2 − 2 E f (X)2 >ε n t=1 √ h i 2 nε2 ≤ 3E N2 F, ε, {Xt ∪ Xt0 }nt=1 exp − . 24 288B 2


June 26, 2012 - 67/185



Summary

I

Learning algorithms use finite random samples ⇒ concentration of averages to expectations


June 26, 2012 - 68/185



Summary

I

I

Learning algorithms use finite random samples ⇒ concentration of averages to expectations

Learning algorithms use spaces of functions ⇒ concentration of averages to expectations for any function


June 26, 2012 - 68/185

A Step-by-step Derivation for Linear FQI

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Error at Each Iteration Error Propagation The Final Bound Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 69/185



I

Step-by-step derivation a performance bound for a popular algorithm


June 26, 2012 - 70/185



I

Step-by-step derivation a performance bound for a popular algorithm

I

Show the interplay between prediction error and propagation


June 26, 2012 - 70/185


Linear Fitted Q-iteration

Linear space (used to approximate action–value functions) d n o X F = f (x, a) = αj ϕj (x, a), α ∈ Rd j=1


June 26, 2012 - 71/185



Linear space (used to approximate action–value functions) d n o X F = f (x, a) = αj ϕj (x, a), α ∈ Rd j=1

with features ϕj : X × A → [0, L]

φ(x, a) = [ϕ1 (x, a) . . . ϕd (x, a)]>


June 26, 2012 - 71/185


Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n


June 26, 2012 - 72/185


Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q


June 26, 2012 - 72/185


Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K


June 26, 2012 - 72/185


Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ


June 26, 2012 - 72/185


Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I

Sample x0i ∼ p(·|xi , ai ) and ri = r(xi , ai )


June 26, 2012 - 72/185


Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ, num of samples n e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I I


e k−1 (x0 , a) Compute yi = ri + γ maxa Q i


June 26, 2012 - 72/185




e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1

I


June 26, 2012 - 72/185





I

I

Solve the least squares problem

n

fαbk = arg min

fα ∈F

2 1X fα (xi , ai ) − yi n i=1


June 26, 2012 - 72/185





I

I


n

fαbk = arg min

fα ∈F

I

ek

Return Q = Trunc(fαbk )

2 1X fα (xi , ai ) − yi n i=1


June 26, 2012 - 72/185





I

I


n

fαbk = arg min

fα ∈F

I

ek


2 1X fα (xi , ai ) − yi n i=1

e K (·, a) (greedy policy ) Return πK (·) = arg maxa Q A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 72/185


Theoretical Objectives

Objective 1: derive a bound on the performance (quadratic) loss w.r.t. a testing distribution µ ||Q∗ − QπK ||µ ≤ ???


June 26, 2012 - 73/185


Error at Each Iteration


June 26, 2012 - 74/185



Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I


e k−1 (x0 , a) Compute yi = ri + γ maxa Q i n I Build training set (xi , ai ), yi i=1 I

I


n

fαbk = arg min

fα ∈F

I

ek


2 1X fα (xi , ai ) − yi n i=1


June 26, 2012 - 75/185



Linear Fitted Q-iteration Input: space F, iterations K, sampling distribution ρ e0 ∈ F Initial function Q For k = 1, . . . , K i.i.d I Draw n samples (xi , ai ) ∼ ρ I



I


n

fαbk = arg min

fα ∈F

I

ek


2 1X fα (xi , ai ) − yi n i=1


June 26, 2012 - 75/185




I I

i.i.d

Draw n samples (xi , ai ) ∼ ρ



I


n

fαbk = arg min

fα ∈F

e k = Trunc(fαbk ) I Return Q

2 1X fα (xi , ai ) − yi n i=1


June 26, 2012 - 76/185




Target: at each iteration we want to approximate Qk = T Qk−1


June 26, 2012 - 77/185




Target: at each iteration we want to approximate Qk = T Qk−1 Objective 2: derive an intermediate bound on the prediction error [random design] e k ||ρ ≤ ??? ||Qk − Q


June 26, 2012 - 77/185




Target: at each iteration we have samples {(xi , ai )}ni=1 (from ρ)


June 26, 2012 - 78/185




Target: at each iteration we have samples {(xi , ai )}ni=1 (from ρ) Objective 3: derive an intermediate bound on the prediction error on the samples [deterministic design] n 2 1 X k e k (xi , ai ) = ||Qk − Q e k ||2 ≤ ??? Q (xi , ai ) − Q ρb n i=1


June 26, 2012 - 78/185



Theoretical Objectives Obj 3 e k ||2 ≤ ??? ||Qk − Q ρb


June 26, 2012 - 79/185



Theoretical Objectives Obj 3

⇒ Obj 2

e k ||2 ≤ ??? ||Qk − Q ρb e k ||ρ ≤ ??? ||Qk − Q


June 26, 2012 - 79/185



Theoretical Objectives Obj 3

⇒ Obj 2

e k ||2 ≤ ??? ||Qk − Q ρb

⇒ Obj 1

e k ||ρ ≤ ??? ||Qk − Q ||Q∗ − QπK ||µ ≤ ???


June 26, 2012 - 79/185




Returned solution n

fαbk

2 1X = arg min fα (xi , ai ) − yi fα ∈F n i=1


June 26, 2012 - 80/185




Returned solution n

fαbk

2 1X = arg min fα (xi , ai ) − yi fα ∈F n i=1

Best solution fα∗k = arg inf ||fα − Qk ||ρ fα ∈F


June 26, 2012 - 80/185



Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ.


June 26, 2012 - 81/185



Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn


June 26, 2012 - 81/185



Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn Empirical L2 -norm ||fα ||2ρb

n

n

i=1

i=1

1X 1X 2 = fα (xi , ai )2 = zi = ||z||2n n n


June 26, 2012 - 81/185



Additional Notation Given the set of inputs {(xi , ai )}ni=1 drawn from ρ. Vector space Fn = {z ∈ Rn , zi = fα (xi , ai ); fα ∈ F} ⊂ Rn Empirical L2 -norm ||fα ||2ρb

n

n

i=1

i=1

1X 1X 2 = fα (xi , ai )2 = zi = ||z||2n n n

Empirical orthogonal projection b = arg min ||y − z||n Πy z∈Fn


June 26, 2012 - 81/185



Additional Notation I

Target vector: e k−1 (xi , ai ) qi = Qk (xi , ai ) = T Q Z e k−1 (dx0 , a)p(dx0 |xi , ai ) = r(xi , ai ) + γ max Q a

X


June 26, 2012 - 82/185





I

X

Observed target vector:

e k−1 (x0i , a) yi = ri + γ max Q a


June 26, 2012 - 82/185





I

X

Observed target vector:

e k−1 (x0i , a) yi = ri + γ max Q a

I

Noise vector (zero–mean and bounded): ξi = qi − yi |ξi | ≤ Vmax

E[ξi |xi ] = 0


June 26, 2012 - 82/185



Additional Notation

y ξ q Fn


June 26, 2012 - 83/185



Additional Notation

I

Optimal solution in Fn b = arg min ||q − z||n Πq z∈Fn


June 26, 2012 - 84/185



Additional Notation

I

Optimal solution in Fn b = arg min ||q − z||n Πq z∈Fn

I

Returned vector

qbi = fαbk (xi , ai )

b = arg min ||y − z||n qb = Πy z∈Fn


June 26, 2012 - 84/185



Additional Notation

y ξ q Fn

b Πq

qb


June 26, 2012 - 85/185



Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n


June 26, 2012 - 86/185



Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n y

ξ q Fn

b Πq

ξb

qb


June 26, 2012 - 86/185



Theoretical Analysis ||Qk − fαbk ||2ρb = ||q − qb||2n y

ξ q Fn

b Πq

ξb

qb

b n b n + ||Πq b − qb||n = ||q − Πq|| b n + ||ξ|| ||q − qb||n ≤ ||q − Πq|| A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 86/185



Theoretical Analysis

b b ||q − qb||n ≤ ||q − Πq|| + ||ξ|| | {z } | {z n} | {z n}

prediction err

approx. err

estim. err


June 26, 2012 - 87/185





prediction err I

approx. err

estim. err

Prediction error : distance between learned function and target function


June 26, 2012 - 87/185





prediction err

approx. err

estim. err

I


I

Approximation error : distance between the best function in F and the target function ⇒ depends on F


June 26, 2012 - 87/185





prediction err

approx. err

estim. err

I


I

Approximation error : distance between the best function in F and the target function ⇒ depends on F

I

Estimation error : distance between the best function in F and the learned function ⇒ depends on the samples


June 26, 2012 - 87/185



Theoretical Analysis b The noise ξb = Πξ

b n = hξ, b ξi b = hξ, b ξi ⇒ ||ξ||


June 26, 2012 - 88/185





The projected noise belongs to Fn

⇒ ∃fβ ∈ F : fβ (xi , ai ) = ξbi ,


∀(xi , ai )

June 26, 2012 - 88/185





The projected noise belongs to Fn

⇒ ∃fβ ∈ F : fβ (xi , ai ) = ξbi ,

∀(xi , ai )

By definition of inner product

n

b n= ⇒ ||ξ||

1X fβ (xi , ai )ξi n i=1


June 26, 2012 - 88/185



Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ]


June 26, 2012 - 89/185



Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ] Thus for any fixed fβ ∈ F (the expectation is conditioned on (xi , ai ))

⇒ Eξ

n h1 X

n

i=1

fβ (xi , ai )ξi

i

n

1X = Eξ fβ (xi , ai )ξi = 0 n i=1


June 26, 2012 - 89/185



Theoretical Analysis The noise ξ has zero mean and it is bounded in [−Vmax , Vmax ] Thus for any fixed fβ ∈ F (the expectation is conditioned on (xi , ai ))

⇒ Eξ

n h1 X

n

i=1

fβ (xi , ai )ξi

i

n

1X = Eξ fβ (xi , ai )ξi = 0 n i=1

n

n

i=1

i=1

X 2 1X 2 1 fβ (xi , ai )ξi ≤ 4Vmax fβ (xi , ai )2 = 4Vmax ||fβ ||2ρb ⇒ n n ⇒ we can use concentration inequalities


June 26, 2012 - 89/185




Problem: fβ is a random variable


June 26, 2012 - 90/185




Problem: fβ is a random variable Solution: we need functional concentration inequalities


June 26, 2012 - 90/185




Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb


June 26, 2012 - 91/185




Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb [by definition] ⇒ ∀g ∈ G, ||g||ρb ≤ 1


June 26, 2012 - 91/185




Define the space of normalized functions n o fα (·) G = g(·) = , fα ∈ F ||fα ||ρb [by definition] ⇒ ∀g ∈ G, ||g||ρb ≤ 1 [F is a linear space] ⇒ V(G) = d + 1


June 26, 2012 - 91/185




Application of Pollard’s inequality for space G


June 26, 2012 - 92/185




Application of Pollard’s inequality for space G For any g ∈ G n 1 X g(xi , ai )ξi ≤ 4Vmax n i=1

s

2 log n

3(9ne2 )d+1 δ

with probability 1 − δ (w.r.t., the realization of the noise ξ).


June 26, 2012 - 92/185



Theoretical Analysis By definition of g s n 1 X 3(9ne2 )d+1 2 ⇒ fα (xi , ai )ξi ≤ 4Vmax ||fα ||ρb log n i=1 n δ


June 26, 2012 - 93/185




For the specific fβ equivalent to ξb

b ξi ≤ 4Vmax ||ξ|| b n ⇒ hξ,

s

2 log n

3(9ne2 )d+1 δ


June 26, 2012 - 93/185






s

2 log n

3(9ne2 )d+1 δ

s

2 log n

3(9ne2 )d+1 δ

Recalling the objective

b 2 ≤ 4Vmax ||ξ|| b n ⇒ ||ξ|| n


June 26, 2012 - 93/185






s

2 log n

3(9ne2 )d+1 δ

s

2 log n

3(9ne2 )d+1 δ

Recalling the objective

b 2 ≤ 4Vmax ||ξ|| b n ⇒ ||ξ|| n

b − qb||n ≤ 4Vmax ⇒ ||Πq

s

2 log n

3(9ne2 )d+1 δ


June 26, 2012 - 93/185



Theoretical Analysis y ξ q Fn

b Πq

ξb

qb


June 26, 2012 - 94/185



Theoretical Analysis y ξ q Fn

b Πq

ξb

qb

Theorem At each iteration k and given a set of state–action pairs {(xi , ai )}, LinearFQI returns an approximation qb such that b n + ||Πq b − qb||n ||q − qb||n ≤ ||q − Πq|| r d log n/δ b ≤ ||q − Πq||n + O Vmax n A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 94/185




Moving back from vectors to functions ||q − qb||n = ||Qk − fαbk ||ρb b n ≤ ||Qk − fα∗ ||ρb ||q − Πq|| k

r d log n/δ k k ⇒ ||Q − fαbk ||ρb ≤ ||Q − fα∗k ||ρb + O Vmax n


June 26, 2012 - 95/185




e k = Trunc(fαb )) By definition of truncation (Q k

Theorem

At each iteration k and given a set of state–action pairs {(xi , ai )}, b k such that (Objective 3) LinearFQI returns an approximation Q e k ||ρb ≤ ||Qk − fαb ||ρb ||Qk − Q k k

≤ ||Q − fα∗k ||ρb + O Vmax


r

d log n/δ n

June 26, 2012 - 96/185



Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms


June 26, 2012 - 97/185



Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms e k is truncated, it is bounded in [−Vmax , Vmax ] Since Q r d log n/δ k k k k e e 2||Q − Q ||ρb ≥ ||Q − Q ||ρ − O Vmax n


June 26, 2012 - 97/185



Theoretical Analysis Remark: in order to move from Obj3 to Obj2 we need to move from empirical to expected L2 -norms e k is truncated, it is bounded in [−Vmax , Vmax ] Since Q r d log n/δ k k k k e e 2||Q − Q ||ρb ≥ ||Q − Q ||ρ − O Vmax n The best solution fα∗k is a fixed function in F k

k

||Q − fα∗k ||ρb ≤ 2||Q − fα∗k ||ρ + O

Vmax +


L||αk∗ ||

r

log 1/δ n

June 26, 2012 - 97/185



Theoretical Analysis Theorem e k such At each iteration k, LinearFQI returns an approximation Q that (Objective 2) e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ ∗ + O Vmax + L||αk || n r d log n/δ + O Vmax , n with probability 1 − δ. A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 98/185




e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ ∗ + O Vmax + L||αk || n r d log n/δ + O Vmax n


June 26, 2012 - 99/185




e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ + O Vmax + L||α∗k || n r d log n/δ + O Vmax n

Remarks I

No algorithm can do better

I

Constant 4

I

Depends on the space F

I

Changes with the iteration k


June 26, 2012 - 100/185




e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ ∗ + O Vmax + L||αk || n r d log n/δ + O Vmax n

Remarks I

Vanishing to zero as O(n−1/2 )

I

Depends on the features (L) and on the best solution (||αk∗ ||)


June 26, 2012 - 101/185




e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||Qk − Q k r log 1/δ + O Vmax + L||α∗k || n r d log n/δ + O Vmax n

Remarks I

Vanishing to zero as O(n−1/2 )

I

Depends on the dimensionality of the space (d) and the number of samples (n)


June 26, 2012 - 102/185


Error Propagation


June 26, 2012 - 103/185


Error Propagation


Objective 1 ||Q∗ − QπK ||µ


June 26, 2012 - 104/185


Error Propagation


Objective 1 ||Q∗ − QπK ||µ I

Problem 1: the test norm µ is different from the sampling norm ρ


June 26, 2012 - 104/185


Error Propagation



I

Problem 1: the test norm µ is different from the sampling norm ρ e k not for the performance Problem 2: we have bounds for Q of the corresponding πk


June 26, 2012 - 104/185


Error Propagation



I

I

Problem 1: the test norm µ is different from the sampling norm ρ e k not for the performance Problem 2: we have bounds for Q of the corresponding πk

Problem 3: we have bounds for one single iteration


June 26, 2012 - 104/185


Error Propagation

Propagation of Errors I

Bellman operators T Q(x, a) = r(x, a) + γ T π Q(x, a) = r(x, a) + γ

I

I

Z

X

max Q(dx0 , a0 )p(dx0 |x, a) 0

X

Q(dx0 , π(dx0 ))p(dx0 |x, a)

Z

a

Optimal action–value function Greedy policy

Q∗ = T Q∗ π(x) = arg max Q(x, a) a

π ∗ (x) = arg max Q∗ (x, a) I

Prediction error

a

ek εk = Qk − Q


June 26, 2012 - 105/185


Error Propagation

Propagation of Errors

Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ −T π∗ Q ek + T π∗ Q e k −T Q e k + εk Q∗ − Q | {z } | {z }| {z } fixed point

0


e k+1 Q

June 26, 2012 - 106/185


Error Propagation


Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ − T π∗ Q ek + T π∗ Q e k + −T Q e k + εk Q∗ − Q | {z } | {z } |{z} recursion


≤0

error

June 26, 2012 - 106/185


Error Propagation


Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk e k+1 = T π∗ Q∗ − T π∗ Q e k + T π∗ Q e k + −T Q e k + εk Q∗ − Q ∗ e k ) + εk ≤ γP π (Q∗ − Q


June 26, 2012 - 106/185


Error Propagation


Step 1: upper-bound on the propagation (problem 3) ∗ By definition T Qk ≥ T π Qk eK ≤ Q −Q ∗

K−1 X k=0

∗ ∗ e0 ) γ K−k−1 (P π )K−k−1 εk + γ K (P π )K (Q∗ − Q


June 26, 2012 - 106/185


Error Propagation


Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 = T Q∗ −T πk Q∗ + T πk Q∗ −T Q e k + εk Q∗ − Q | {z } | {z }| {z } fixed point

0


e k+1 Q

June 26, 2012 - 107/185


Error Propagation


Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 = T Q∗ − T πk Q∗ + T πk Q∗ − T Q e k + εk Q∗ − Q | {z } | {z } |{z} ≥0

greedy pol.


error

June 26, 2012 - 107/185


Error Propagation


Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ T πk Q∗ − T πk Q e k + εk Q∗ − Q | {z } |{z} recursion


error

June 26, 2012 - 107/185


Error Propagation


Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ γP πk (Q∗ − Q e k ) + εk Q∗ − Q


June 26, 2012 - 107/185


Error Propagation


Step 2: lower-bound on the propagation (problem 3) By definition T Q∗ ≥ T πk Q∗ e k+1 ≥ Q −Q ∗

K−1 X

γ K−k−1 (P πK−1 P πK−2 . . . P πk+1 )εk

k=0

e0 ) + γ K (P πK−1 P πK−2 . . . P π0 )(Q∗ − Q


June 26, 2012 - 107/185


Error Propagation


e K to πK (problem 2) Step 3: from Q eK = T Q e K ≥ T π∗ QK By definition T πK Q

∗ ∗ e K + T π∗ Q e K −T πK Q e K + T πK Q e K −T πK Q eK Q∗ − QπK = T π Q∗ −T π Q | {z } | {z }| {z } | {z }

fixed point

0


0

fixed point

June 26, 2012 - 108/185


Error Propagation



∗ ∗ e K + T π∗ Q e K − T πK Q eK e K − T πK Q e K + T πK Q Q∗ − QπK = T π Q∗ − T π Q | {z } | {z } | {z }

error

≤0


function vs policy

June 26, 2012 - 108/185


Error Propagation



∗ e K ) + γP πK (Q e K −Q∗ + Q∗ −QπK ) Q∗ − QπK ≤ γP π (Q∗ − Q | {z }

0


June 26, 2012 - 108/185


Error Propagation



∗ e K ) + γP πK (Q e K − Q∗ + Q∗ − QπK ) Q∗ − QπK ≤ γP π (Q∗ − Q | {z } | {z } | {z }

error

error


policy performance

June 26, 2012 - 108/185


Error Propagation


e K to πK (problem 2) Step 3: from Q e K ≥ T π∗ QK eK = T Q By definition T πK Q

∗ ek ) (I − γP πK )(Q∗ − QπK ) ≤ γ(P π − P πK )(Q∗ − Q


June 26, 2012 - 108/185


Error Propagation



∗ ek ) Q∗ − QπK ≤ γ(I − γP πK )−1 (P π − P πK )(Q∗ − Q


June 26, 2012 - 108/185


Error Propagation



∗ ek ) Q∗ − QπK ≤ γ(I − γP πK )−1 (P π − P πK )(Q∗ − Q


June 26, 2012 - 108/185


Error Propagation


Step 3: plugging the error propagation (problem 2) Q∗ − QπK ≤(I − γP πK )−1

K−1 X

i h ∗ γ K−k (P π )K−k − P πK P πK−1 . . . P πk+1 εk

k=0

h

+ (P

π ∗ K+1

)

i e0 ) − (P πK P πK−1 . . . P π0 ) (Q∗ − Q


June 26, 2012 - 109/185


Error Propagation


Step 4: rewrite in compact form

∗

πK

Q −Q

K−1 2γ(1 − γ K+1 ) X ∗ 0 e ≤ αk Ak |εk | + αK AK |Q − Q | (1 − γ)2 k=0

I

αk : weights

I

Ak : summarize the P πi terms


June 26, 2012 - 110/185


Error Propagation


Step 5: take the norm w.r.t. to the test distribution µ ||Q∗ − QπK ||2µ = ≤ ≤

Z

ρ(dx, da)(Q∗ (x, a) − QπK (x, a))2

2γ(1 − γ K+1 (1 − γ)2

2 Z

γ K+1 2

2γ(1 − (1 − γ)2

K−1 2 X e 0 | (x, a) µ(dx, da) αk Ak |εk | + αK AK |Q∗ − Q k=0

Z

K−1 X e 0 )2 (x, a) µ(dx, da) αk Ak ε2k + αK AK (Q∗ − Q k=0


June 26, 2012 - 111/185


Error Propagation


Focusing on one single term ∗ 1−γ µ(I − γP πK )−1 (P π )K−k + P πK P πK−1 . . . P πk+1 2 ∗ 1−γ X m = γ µ(P πK )m (P π )K−k + P πK P πK−1 . . . P πk+1 2 m≥0

µAk =

=

i X ∗ 1−γh X m γ µ(P πK )m (P π )K−k + γ m µ(P πK )m P πK P πK−1 . . . P πk+1 2 m≥0 m≥0


June 26, 2012 - 112/185


Error Propagation


Assumption: concentrability terms d(µP π1 . . . P πm ) c(m) = sup dρ π1 ...πm Cµ,ρ = (1 − γ)2

X

∞

mγ m−1 c(m) < +∞

m≥1


June 26, 2012 - 113/185


Error Propagation


Step 5: take the norm w.r.t. to the test distribution µ ||Q∗ − QπK ||2µ K−1 X 2γ(1 − γ K+1 2 X ≤ γ m c(m + K − k)||εk ||2ρ + αK (2Vmax )2 αk (1 − γ) 2 (1 − γ) k=0 m≥0


June 26, 2012 - 114/185


Error Propagation


Step 5: take the norm w.r.t. to the test distribution µ (problem 1)

∗

||Q −

QπK ||2µ

2γ ≤ (1 − γ)2

2

Cµ,ρ max ||εk ||2ρ k

γK +O V2 (1 − γ)3 max


June 26, 2012 - 115/185


The Final Bound


June 26, 2012 - 116/185


The Final Bound

Plugging Per–Iteration Regret

∗

||Q −

QπK ||2µ

2γ ≤ (1 − γ)2

2


γK +O V2 (1 − γ)3 max


June 26, 2012 - 117/185


The Final Bound


∗

||Q −

QπK ||2µ

2γ ≤ (1 − γ)2

2


γK +O V2 (1 − γ)3 max

e k ||ρ ≤ 4||Qk − fα∗ ||ρ ||εk ||ρ = ||Qk − Q k r log 1/δ ∗ + O Vmax + L||αk || n r d log n/δ + O Vmax n


June 26, 2012 - 117/185


The Final Bound


The inherent Bellman error ||Qk − fα∗k ||ρ = inf ||Qk − f ||ρ f ∈F

e k−1 − f ||ρ = inf ||T Q f ∈F

≤ inf ||T fαk−1 − f ||ρ f ∈F

≤ sup inf ||T g − f ||ρ = d(F, T F) g∈F f ∈F


June 26, 2012 - 118/185


The Final Bound


fα∗k is the orthogonal projection of Qk onto F w.r.t. ρ e k−1 ||ρ ≤ ||Q e k−1 ||∞ ≤ Vmax ⇒ ||fα∗k ||ρ ≤ ||Qk ||ρ = ||T Q


June 26, 2012 - 119/185


The Final Bound

Plugging Per–Iteration Regret Gram matrix Gi,j = E(x,a)∼ρ [ϕi (x, a)ϕj (x, a)] Smallest eigenvalue of G is ω ||fα ||2ρ = ||φ> α||2ρ = α> Gα ≥ ωα> α = ω||α||2

max ||αk∗ || ≤ max k

k

||fα∗k ||ρ Vmax √ ≤ √ ω ω


June 26, 2012 - 120/185


The Final Bound

The Final Bound

Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤

r ! p 2γ L d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max


June 26, 2012 - 121/185


The Final Bound

The Final Bound


r ! p L d log n/δ 2γ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max

The propagation (and different norms) makes the problem more complex ⇒ how do we choose the sampling distribution?


June 26, 2012 - 122/185


The Final Bound

The Final Bound


r ! p L d log n/δ 2γ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n γK +O V2 (1 − γ)3 max

The approximation error is worse than in regression ⇒ how do adapt to the Bellman operator?


June 26, 2012 - 123/185


The Final Bound

The Final Bound



The dependency on γ is worse than at each iteration ⇒ is it possible to avoid it?


June 26, 2012 - 124/185


The Final Bound

The Final Bound



The error decreases exponentially in K ⇒ K ≈ ε/(1 − γ)


June 26, 2012 - 125/185


The Final Bound

The Final Bound



The smallest eigenvalue of the Gram matrix ⇒ design the features so as to be orthogonal w.r.t. ρ


June 26, 2012 - 126/185


The Final Bound

The Final Bound


r ! p 2γ L d log n/δ C 4d(F , T F ) + O V 1 + √ µ,ρ max (1 − γ)2 ω n K γ +O V2 (1 − γ)3 max

The asymptotic rate O(d/n) is the same as for regression


June 26, 2012 - 127/185


The Final Bound

Summary

I

At each iteration FQI solves a regression problem ⇒ least–squares prediction error bound


June 26, 2012 - 128/185


The Final Bound

Summary

I

I

At each iteration FQI solves a regression problem ⇒ least–squares prediction error bound The error is propagated through iterations ⇒ propagation of any error


June 26, 2012 - 128/185

Least-Squares Policy Iteration (LSPI)

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Least-Squares Temporal-Difference Learning (LSTD) LSTD and LSPI Error Bounds Classification-based Policy Iteration Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 129/185


Finite-Sample Performance Bound of Least-Squares Policy Iteration (LSPI)


June 26, 2012 - 130/185



LSPI: is an approximate policy iteration algorithm that uses Least-Squares Temporal-Difference Learning (LSTD) for policy evaluation.


June 26, 2012 - 131/185


Objective of the Section

I

a brief description of LSTD iteration) algorithms

I

report final sample performance bounds for LSTD and LSPI

I

describe the main components of these bounds

(policy evaluation)


and LSPI

(policy

June 26, 2012 - 132/185


Least-Squares Temporal-Difference Learning (LSTD)


June 26, 2012 - 133/185



Least-Squares Temporal-Difference Learning (LSTD) I

P Linear function space F = f : f (·) = dj=1 αj ϕj (·) {ϕj }dj=1 ∈ B(X ; L

,

> φ : X → Rd , φ(·) = ϕ1 (·), . . . , ϕd (·)

I

V π is the fixed-point of T π

V π = T πV π

I

V π may not belong to F

I

Best approximation of V π in F is

Vπ ∈ /F

ΠV π = arg min ||V π − f ||

(Π is the projection onto F)

f ∈F

Tπ

F

Vπ

ΠV π


June 26, 2012 - 134/185



Least-Squares Temporal-Difference Learning (LSTD) I

LSTD searches for the fixed-point of Π? T π instead (Π? is a

projection into F w.r.t. L? -norm) I

Π∞ T π is a contraction in L∞ -norm I

I

L∞ -projection is numerically expensive when the number of states is large or infinite

LSTD searches for the fixed-point of Π2,ρ T π Π2,ρ g = arg min ||g − f ||2,ρ f ∈F


June 26, 2012 - 135/185



Least-Squares Temporal-Difference Learning (LSTD) When the fixed-point of Πρ T π exists, we call it the LSTD solution VTD = Πρ T π VTD T π VTD Tπ Vπ

F

Πρ V π

Tπ

VTD = Πρ T π VTD

hT π VTD − VTD , ϕi iρ = 0, π

i = 1, . . . , d

π

hr + γP VTD − VTD , ϕi iρ = 0 d X

(j)

hrπ , ϕi iρ − hϕj − γP π ϕj , ϕi iρ · αTD = 0 | {z } i=1 | {z } bi

−→

A αTD = b

Aij

I In general, Πρ T π is not a contraction and does not have a fixed-point. I If ρ = ρπ , the stationary dist. of π, then Πρπ T π has a unique fixed-point. A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 136/185



LSTD Algorithm Proposition

(LSTD Performance)

||V π − VTD ||ρπ ≤ p

1 1−

inf ||V π − V ||ρπ

γ 2 V ∈F

LSTD Algorithm

I We observe a trajectory generated by following the policy π (X0 , R0 , X1 , R1 , . . . , XN ) where Xt+1 ∼ P · |Xt , π(Xt ) and Rt = r Xt , π(Xt ) I We build estimators of the matrix A and vector b N −1 X bij = 1 A ϕi (Xt ) ϕj (Xt ) − γϕj (Xt+1 ) N t=0

bαTD = b I Ab b

,

,

N −1 1 X b bi = ϕi (Xt )Rt N t=0

VbTD (·) = φ(·)> α bTD

b → A and b when n → ∞ then A b → b, and thus, α bTD → αTD and VbTD → VTD . A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 137/185


LSTD and LSPI Error Bounds


June 26, 2012 - 138/185



LSTD Error Bound When the Markov chain induced by the policy under evaluation π has a stationary distribution ρπ (Markov chain is ergodic - e.g. β-mixing), then

Theorem

(LSTD Error Bound)

Let Ve be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π. Then with probability 1 − δ, we have ! r d log(d/δ) c π π e inf ||V − f ||ρπ + O ||V − V ||ρπ ≤ p nν 1 − γ 2 f ∈F I n = # of samples

d = dimension of the linear function space F R I ν = the smallest eigenvalue of the Gram matrix ( ϕi ϕj dρπ )i,j ,

(Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) I β-mixing coefficients are hidden in the O(·) notation A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 139/185



LSTD Error Bound LSTD Error Bound ||V π − Ve ||ρπ ≤ p

I

c

inf ||V π − f ||ρπ + O 1 − γ 2 f|∈F {z } | approximation error

r

! d log(d/δ) nν {z }

estimation error

Approximation error: it depends on how well the function space F can approximate the value function V π

I

Estimation error: it depends on the number of samples n, the dim of the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O)


June 26, 2012 - 140/185



LSPI Error Bound

Theorem

(LSPI Error Bound)

Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have

||V ∗ −V πK ||µ ≤

4γ (1 − γ)2

(

" p CCµ,ρ cE0 (F ) + O

s

d log(dK/δ) n νρ


!# +γ

K−1 2

) Rmax

June 26, 2012 - 141/185



LSPI Error Bound Theorem

(LSPI Error Bound)


∗

||V −V

πK

4γ ||µ ≤ (1 − γ)2

(


s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ


June 26, 2012 - 142/185




(LSPI Error Bound)


||V ∗ −V πK ||µ ≤

4γ (1 − γ)2

(


s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ I Estimation error: depends on n, d, νρ , K


June 26, 2012 - 143/185




(LSPI Error Bound)

Let V−1 ∈ Fe be an arbitrary initial value function, Ve0 , . . . , VeK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VeK−1 . Then with probability 1 − δ, we have ||V ∗ −V πK ||µ ≤

4γ (1 − γ)2

(


s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

I Approximation error: E0 (F) = supπ∈G(Fe) inf f ∈F ||V π − f ||ρπ I Estimation error: depends on n, d, νρ , K I Initialization error: error due to the choice of the initial value function or

initial policy |V ∗ − V π0 | A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 144/185



LSPI Error Bound LSPI Error Bound ∗

||V −V

πK

4γ ||µ ≤ (1 − γ)2

(


s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

Lower-Bounding Distribution e we have There exists a distribution ρ such that for any policy π ∈ G(F), ρ ≤ Cρπ , where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.


June 26, 2012 - 145/185



LSPI Error Bound LSPI Error Bound ||V ∗ −V πK ||µ ≤

4γ (1 − γ)2

(


s

d log(dK/δ) n νρ

!# +γ

K−1 2

) Rmax

Lower-Bounding Distribution e we have There exists a distribution ρ such that for any policy π ∈ G(F), ρ ≤ Cρπ , where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before. R

I νρ = the smallest eigenvalue of the Gram matrix ( ϕi ϕj dρ)i,j


June 26, 2012 - 146/185

Classification-based Policy Iteration

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Algorithm Bounds on Error at each Iteration and Final Error Discussion A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 147/185


Finite-Sample Performance Bound of a Classification-based Policy Iteration Algorithm


June 26, 2012 - 148/185


Objective of the Section I

classification-based vs. regression-based (value function-based) policy iteration

I

describe a classification-based policy iteration algorithm

I

report bounds on the error at each iteration and on the error after K iterations of the algorithm

I

describe the main components of these bounds


June 26, 2012 - 149/185


Value-based (Approximate) Policy Iteration

Regression eπ = learn(D, F) Q

Approximate vf e π (·, ·) Q Improvement e a) π e = arg maxa Q(·,

Training Set

bπ (xi , a)} D = {(xi, a), Q

Samples xi ∼ ρ

Rollout Estimate bπ (xi , a) Q

Policy π

* We use Monte-Carlo estimation for illustration purposes.


June 26, 2012 - 150/185


Classification-based Policy Iteration No explicit approximation

Training Set

bπ (xi, a)} D = {xi, arg maxa Q

Classification

Samples xi ∼ ρ


π e = learn(D, Π)

Policy π

* First introduced by Lagoudakis & Parr (2003) and Fern et al. (2004,2006).


June 26, 2012 - 151/185


Value-based vs Classification-based Policy Iteration

Regression eπ = learn(D, F) Q

Approximate vf e π (·, ·) Q

Samples xi ∼ ρ

Training Set bπ (xi, a)} D = {xi, arg maxa Q

Improvement e a) π e = arg maxa Q(·,

Training Set bπ (xi , a)} D = {(xi, a), Q Rollout Estimate bπ (xi , a) Q

No explicit approximation

Policy π

Classification

Samples xi ∼ ρ



π e = learn(D, Π)

Policy π

June 26, 2012 - 152/185


Appealing Properties

I

Property 1. More important to have a policy with a performance similar to the greedy policy w.r.t. Qπk than an accurate approximation of Qπk .

I

Property 2. In some problems good policies are easier to represent and learn than their corresponding value functions.


June 26, 2012 - 153/185


Algorithm


June 26, 2012 - 154/185


Algorithm

Template of the Algorithm Input: policy space Π ⊆ Bπ (X ), state distribution ρ, number of rollout states N , number of rollouts per state-action pair M , rollout horizon H Initialize: Let π0 ∈ Π be an arbitrary policy for k = 0, 1, 2, . . . do iid

Construct the rollout set Dk = {xi }N i=1 , xi ∼ ρ for all states xi ∈ Dk and actions a ∈ A do for j = 1 to M do Perform a rollout according to policy πk and return π

Rj k (xi , a) = r(xi , a) +

H−1 X

γ t r xt , πk (xt ) ,

t=1

xt

|xt−1 , π

(xt−1 )

with ∼p · and k end for b πk (xi , a) = 1 PM Rπk (xi , a) Q j=1 j M end for b πk+1 = arg minπ∈Π Lπk (b ρ ; π) end for

x1

∼ p(·|xi , a)


(classifier)

June 26, 2012 - 155/185


Algorithm



Rj k (xi , a) = r(xi , a) +

H−1 X


t=1

xt

|xt−1 , π

(xt−1 )


x1

∼ p(·|xi , a)


(classifier)

June 26, 2012 - 156/185


Algorithm



Rj k (xi , a) = r(xi , a) +

H−1 X


t=1

xt

|xt−1 , π

(xt−1 )


x1

∼ p(·|xi , a)


(classifier)

June 26, 2012 - 157/185


Algorithm



Rj k (xi , a) = r(xi , a) +

H−1 X


t=1

xt

|xt−1 , π

(xt−1 )


x1

∼ p(·|xi , a)


(classifier)

June 26, 2012 - 158/185


Algorithm



Rj k (xi , a) = r(xi , a) +

H−1 X


t=1

xt

|xt−1 , π

(xt−1 )


x1

∼ p(·|xi , a)


(classifier)

June 26, 2012 - 159/185


Algorithm

Template of the Algorithm Empirical Error: N i 1 Xh b b πk (xi , a) − Q b πk xi , π(xi ) , Lπk (b ρ ; π) = max Q a∈A N i=1

(b ρ is the empirical distribution induced by the samples in Dk )

with the objective to minimize the Expected Error Z h i Lπk (ρ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) X

a∈A


June 26, 2012 - 160/185


Algorithm

Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)} Z πk = I π(x) 6= arg max Q (x, a) ρ(dx) X

a∈A

Gap-based error Z h i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A X h Z i = I π(x) 6= arg max Qπk (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) X

a∈A

a∈A


June 26, 2012 - 161/185


Algorithm

Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)} Z = I π(x) 6= arg max Qπk (x, a) ρ(dx) a∈A X | {z } mistake

Gap-based error

Z h

i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) A a∈A h Z i πk = I π(x) 6= arg max Q (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A a∈A X | {z } mistake


June 26, 2012 - 162/185


Algorithm

Mistake-based vs. Gap-based Errors Mistake-based error h i Lπk (ρ ; π) = Ex∼ρ I {π(x) 6= (Gπk )(x)} Z = I π(x) 6= arg max Qπk (x, a) ρ(dx) a∈A X | {z } mistake

Gap-based error

Z h

i Lπk (ρ ; π) = max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A X h Z i πk = I π(x) 6= arg max Q (x, a) max Qπk (x, a) − Qπk x, π(x) ρ(dx) a∈A a∈A X {z } | {z }| mistake


cost/regret

June 26, 2012 - 163/185


Bounds on Error at each Iteration and Final Error


June 26, 2012 - 164/185



Error at each Iteration Theorem Let Π be a policy space with h = V C(Π) < ∞ and ρ be a distribution over X . Let N be the number of states in Dk drawn i.i.d. from ρ, H be the rollout horizon, and M be the number of rollouts per state-action pair. Let πk+1 = arg min Lbπk (b ρ ; π) π∈Π

be the policy computed at the k’th iteration of DPI . Then, for any δ > 0 Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2(1 + 2 + γ H Qmax ), π∈Π

with probability 1 − δ, where s

1 = 16Qmax

2 N

h log

s 2 = 8(1 − γ H )Qmax

eN 32 + log h δ

2 MN

and

eM N 32 h log + log h δ


June 26, 2012 - 165/185



Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π | {z } | {z } approximation

estimation

I

Approximation error: it depends on how well the policy space Π can approximate greedy policies.

I

Estimation error: it depends on the number of rollout states, number of rollouts, and the rollout horizon.


June 26, 2012 - 166/185



Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π)+2 1 (N )+2 (N, M, H)+γ H Qmax π∈Π

The approximation error

inf Lπk (ρ ; π) = Z h i inf I {π(x) 6= (Gπk )(x)} max Qπk (x, a) − Qπk x, π(x) ρ(dx)

π∈Π

π∈Π X

a∈A


June 26, 2012 - 167/185



Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π

The estimation error s

eN 32 + log h δ s eM N 32 2 2 = 8(1 − γ H )Qmax h log + log MN h δ 1 = 16Qmax

2 N

h log


June 26, 2012 - 168/185



Remarks The bound Lπk (ρ ; πk+1 ) ≤ inf Lπk (ρ ; π) + 2 1 (N ) + 2 (N, M, H) + γ H Qmax π∈Π

The estimation error s

eN 32 + log h δ s eM N 32 2 2 = 8(1 − γ H )Qmax h log + log MN h δ 1 = 16Qmax

2 N

h log

I

Avoid overfitting (1 ) : take N h

I

Fixed budget of rollouts B = M N : take M = 1 and N = B

I

log B Fixed budget B = N M H and M = 1 : take H = O( log 1/γ ) and N = O(B/H) A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 168/185



Final Bound Theorem Let Π be a policy space with VC-dimension h and πK be the policy generated by DPI after K iterations. Then, for any δ > 0 ||V ∗ − V πK ||1,µ ≤

i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + 2 (1 − γ) 1−γ

with probability 1 − δ, where s 1 = 16Qmax

2 N

eN 32K h log + log h δ s

H

2 = 8(1 − γ )Qmax

2 MN

h log

and

eM N 32K + log h δ


June 26, 2012 - 169/185




i 2γ K R Cµ,ρ h max d(Π, GΠ) + 2(1 + 2 + γ H Qmax ) + (1 − γ)2 1−γ


2 N


2 = 8(1 − γ H )Qmax

2 MN

h log

and

eM N 32K + log h δ

Concentrability coefficient: Cµ,ρ A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 170/185






2 N


2 = 8(1 − γ H )Qmax

2 MN

h log

and

eM N 32K + log h δ

Estimation error: depends on M , N , H, h, and K A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 171/185






2 N


H

2 = 8(1 − γ )Qmax

2 MN

h log

and

eM N 32K + log h δ

Initialization error: error due to the choice of the initial policy |V ∗ − V π0 | A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 172/185



Remarks Inherent Greedy Error d(Π, GΠ)

(approximation error)

d(Π, GΠ) = sup inf Lπ (ρ ; π 0 ) 0 π∈Π π ∈Π

= sup inf 0

π∈Π π ∈Π

Z

X

h i I {π 0 (x) 6= (Gπ)(x)} max Qπ (x, a) − Qπ x, π 0 (x) ρ(dx) a∈A

Gπ

Π π

π

′


June 26, 2012 - 173/185



Other Finite-Sample Analysis Results in Batch RL I

Approximate Value Iteration (Munos & Szepesvari 2008)

I

Approximate Policy Iteration I

LSTD and LSPI (Lazaric et al. 2010, 2012)

I

Bellman Residual Minimization (Maillard et al. 2010)

I

Modified Bellman Residual Minimization (Antos et al. 2008)

I

Classification-based Policy Iteration (Fern et al. 2006; Lazaric et al. 2010; Gabillon et al. 2011; Farahmand et al. 2012)

I

Conservative Policy Iteration (Kakade & Langford 2002; Kakade 2003)


June 26, 2012 - 174/185



Other Finite-Sample Analysis Results in Batch RL I

Approximate Modified Policy Iteration (Scherrer et al. 2012)

I

Regularized Approximate Dynamic Programming I

I

L2 -Regularization I

L2 -Regularized Policy Iteration (Farahmand et al. 2008)

I

L2 -Regularized Fitted Q-Iteration (Farahmand et al. 2009)

L1 -Regularization and High-Dimensional RL I

Lasso-TD (Ghavamzadeh et al. 2011)

I

LSTD (LSPI) with Random Projections (Ghavamzadeh et al. 2010)


June 26, 2012 - 175/185

Discussion

Outline Preliminaries Tools from Statistical Learning Theory A Step-by-step Derivation for Linear FQI Least-Squares Policy Iteration (LSPI) Classification-based Policy Iteration Discussion Learned Lessons Open Problems A. Lazaric & M. Ghavamzadeh – SLT in RL & ADP

June 26, 2012 - 176/185

Discussion

Discussion


June 26, 2012 - 177/185

Discussion

Learned Lessons


June 26, 2012 - 178/185

Discussion

Learned Lessons

Learned Lessons


June 26, 2012 - 179/185

Discussion

Learned Lessons

Comparison to Supervised Learning we obtain the optimal rate of regression and classification for RL (ADP) algorithms

What makes RL more challenging then? I

dependency on 1/(1 − γ) (sequential nature of the problem)

I

the approximation error is more complex

I

the propagation of error (control problem)

I

the sampling problem (how to choose ρ – exploration problem)


June 26, 2012 - 180/185

Discussion

Learned Lessons

Practical Lessons I

I

Tuning the parameters (given a fixed accuracy ) I

number of samples (inverting the bound)

I

number of iterations (inverting the bound)

K ≈ /(1 − γ)

choice of function F and/or policy space Π I

I

e d) n ≥ Ω(

features {ϕi }di=1 to be linearly independent given the sampling distribution ρ (on-policy – off-policy sampling)

tradeoff between approximation and estimation errors


June 26, 2012 - 181/185

Discussion

Open Problems


June 26, 2012 - 182/185

Discussion

Open Problems

Open Problems


June 26, 2012 - 183/185

Discussion

Open Problems

Open Problems

I

High–dimensional spaces: how to deal with MDPs with many state–action variables? I I

First example in deterministic design for LSTD Extension to other algorithms


June 26, 2012 - 184/185

Discussion

Open Problems

Open Problems

I


I

First example in deterministic design for LSTD Extension to other algorithms

Optimality : how optimal are the current algorithms? I I I

Improve the sampling distribution Control the concentrability terms Limit the propagation of error through iterations


June 26, 2012 - 184/185

Discussion

Open Problems

Open Problems

I


I

Optimality : how optimal are the current algorithms? I I I

I

First example in deterministic design for LSTD Extension to other algorithms Improve the sampling distribution Control the concentrability terms Limit the propagation of error through iterations

Off–policy learning for LSTD


June 26, 2012 - 184/185

Statistical Learning Theory Meets Dynamic Programming

M. Ghavamzadeh, A. Lazaric {mohammad.ghavamzadeh, alessandro.lazaric}@inria.fr sequel.lille.inria.fr

Statistical Learning Theory in Reinforcement Learning ... - Inria

Statistical Learning Theory in Reinforcement Learning ... - Inria

Suggest Documents

Statistical Learning Theory in Reinforcement Learning ... - Inria

Hierarchical Multi-Agent Reinforcement Learning - Inria

Statistical Learning Theory - Machine Learning Thoughts

Statistical Learning Theory - Machine Learning Thoughts

Statistical Learning Theory - Stanford University

Statistical Theory of Learning Curves

Transfer of Task Representation in Reinforcement Learning ... - Inria

Exploration in Model-based Reinforcement Learning by ... - HAL-Inria

Reinforcement Learning

Evaluating Reinforcement Learning for Game Theory

Learning Options in Reinforcement Learning - CiteSeerX

Reinforcement learning for parameter estimation in statistical spoken ...

Inverse reinforcement learning to control a robotic arm ... - HAL-Inria

Utility-based Reinforcement Learning for Reactive Grids - HAL-Inria

Ensemble Algorithms in Reinforcement Learning

Reinforcement learning in market games

hierarchical reinforcement learning in continuous

STUDIES IN REINFORCEMENT LEARNING AND

Reinforcement Learning in Auction Experiments

Experiments with Reinforcement Learning in

Reinforcement Learning Algorithms in Humanoid

Learning to Rank: Online Learning, Statistical Theory and ... - Statistics

Quantum Reinforcement Learning - arXiv

reinforcement learning concrete problems